# robust_pruning_at_initialization__fadef367.pdf

Published as a conference paper at ICLR 2021

ROBUST PRUNING AT INITIALIZATION

Souﬁane Hayou, Jean-Francois Ton, Arnaud Doucet & Yee Whye Teh Department of Statistics University of Oxford United Kingdom {soufiane.hayou, ton, doucet, teh}@stats.ox.ac.uk

Overparameterized Neural Networks (NN) display state-of-the-art performance. However, there is a growing need for smaller, energy-efﬁcient, neural networks to be able to use machine learning applications on devices with limited computational resources. A popular approach consists of using pruning techniques. While these techniques have traditionally focused on pruning pre-trained NN (Le Cun et al., 1990; Hassibi et al., 1993), recent work by Lee et al. (2018) has shown promising results when pruning at initialization. However, for Deep NNs, such procedures remain unsatisfactory as the resulting pruned networks can be difﬁcult to train and, for instance, they do not prevent one layer from being fully pruned. In this paper, we provide a comprehensive theoretical analysis of Magnitude and Gradient based pruning at initialization and training of sparse architectures. This allows us to propose novel principled approaches which we validate experimentally on a variety of NN architectures.

1 INTRODUCTION

Overparameterized deep NNs have achieved state of the art (SOTA) performance in many tasks (Nguyen and Hein, 2018; Du et al., 2019; Zhang et al., 2016; Neyshabur et al., 2019). However, it is impractical to implement such models on small devices such as mobile phones. To address this problem, network pruning is widely used to reduce the time and space requirements both at training and test time. The main idea is to identify weights that do not contribute signiﬁcantly to the model performance based on some criterion, and remove them from the NN. However, most pruning procedures currently available can only be applied after having trained the full NN (Le Cun et al., 1990; Hassibi et al., 1993; Mozer and Smolensky, 1989; Dong et al., 2017) although methods that consider pruning the NN during training have become available. For example, Louizos et al. (2018) propose an algorithm which adds a L0 regularization on the weights to enforce sparsity while Carreira-Perpi n an and Idelbayev (2018); Alvarez and Salzmann (2017); Li et al. (2020) propose the inclusion of compression inside training steps. Other pruning variants consider training a secondary network that learns a pruning mask for a given architecture (Li et al. (2020); Liu et al. (2019)).

Recently, Frankle and Carbin (2019) have introduced and validated experimentally the Lottery Ticket Hypothesis which conjectures the existence of a sparse subnetwork that achieves similar performance to the original NN. These empirical ﬁndings have motivated the development of pruning at initialization such as SNIP (Lee et al. (2018)) which demonstrated similar performance to classical pruning methods of pruning-after-training. Importantly, pruning at initialization never requires training the complete NN and is thus more memory efﬁcient, allowing to train deep NN using limited computational resources. However, such techniques may suffer from different problems. In particular, nothing prevents such methods from pruning one whole layer of the NN, making it untrainable. More generally, it is typically difﬁcult to train the resulting pruned NN (Li et al., 2018). To solve this situation, Lee et al. (2020) try to tackle this issue by enforcing dynamical isometry using orthogonal weights, while Wang et al. (2020) (Gra SP) uses Hessian based pruning to preserve gradient ﬂow. Other work by Tanaka et al. (2020) considers a data-agnostic iterative approach using the concept of synaptic ﬂow in order to avoid the layer-collapse phenomenon (pruning a whole layer). In our work, we use principled scaling and re-parameterization to solve this issue, and show numerically that our algorithm achieves SOTA performance on CIFAR10, CIFAR100, Tiny Image Net and Image Net in some scenarios and remains competitive in others.

Published as a conference paper at ICLR 2021

Table 1: Classiﬁcation accuracies on CIFAR10 for Resnet with varying depths and sparsities using SNIP (Lee et al. (2018)) and our algorithm SBP-SR

ALGORITHM 90% 95% 98% 99.5% 99.9%

RESNET32 SNIP 92.26 0.32 91.18 0.17 87.78 0.16 77.56 0.36 9.98 0.08 SBP-SR 92.56 0.06 91.21 0.30 88.25 0.35 79.54 1.12 51.56 1.12

RESNET50 SNIP 91.95 0.13 92.12 0.34 89.26 0.23 80.49 2.41 19.98 14.12 SBP-SR 92.05 0.06 92.74 0.32 89.57 0.21 82.68 0.52 58.76 1.82

RESNET104 SNIP 93.25 0.53 92.98 0.12 91.58 0.19 33.63 33.27 10.11 0.09 SBP-SR 94.69 0.13 93.88 0.17 92.08 0.14 87.47 0.23 72.70 0.48

In this paper, we provide novel algorithms for Sensitivity-Based Pruning (SBP), i.e. pruning schemes that prune a weight W based on the magnitude of |W L

W | at initialization where L is the loss. Experimentally, compared to other available one-shot pruning schemes, these algorithms provide state-of the-art results (this might not be true in some regimes). Our work is motivated by a new theoretical analysis of gradient back-propagation relying on the mean-ﬁeld approximation of deep NN (Hayou et al., 2019; Schoenholz et al., 2017; Poole et al., 2016; Yang and Schoenholz, 2017; Xiao et al., 2018; Lee et al., 2018; Matthews et al., 2018). Our contribution is threefold:

For deep fully connected Feed Forward NN (FFNN) and Convolutional NN (CNN), it has been previously shown that only an initialization on the so-called Edge of Chaos (EOC) make models trainable; see e.g. (Schoenholz et al., 2017; Hayou et al., 2019). For such models, we show that an EOC initialization is also necessary for SBP to be efﬁcient. Outside this regime, one layer can be fully pruned. For these models, pruning pushes the NN out of the EOC making the resulting pruned model difﬁcult to train. We introduce a simple rescaling trick to bring the pruned model back in the EOC regime, making the pruned NN easily trainable. Unlike FFNN and CNN, we show that Resnets are better suited for pruning at initialization since they live on the EOC by default (Yang and Schoenholz, 2017). However, they can suffer from exploding gradients, which we resolve by introducing a re-parameterization, called Stable Resnet (SR). The performance of the resulting SBP-SR pruning algorithm is illustrated in Table 1: SBP-SR allows for pruning up to 99.5% of Res Net104 on CIFAR10 while still retaining around 87% test accuracy.

The precise statements and proofs of the theoretical results are given in the Supplementary. Appendix H also includes the proof of a weak version of the Lottery Ticket Hypothesis (Frankle and Carbin, 2019) showing that, starting from a randomly initialized NN, there exists a subnetwork initialized on the EOC.

2 SENSITIVITY PRUNING FOR FFNN/CNN AND THE RESCALING TRICK

2.1 SETUP AND NOTATIONS

Let x be an input in Rd. A NN of depth L is deﬁned by

yl(x) = Fl(W l, yl 1(x)) + Bl, 1 l L, (1)

where yl(x) is the vector of pre-activations, W l and Bl are respectively the weights and bias of the lth layer and Fl is a mapping that deﬁnes the nature of the layer. The weights and bias are initialized with W l iid N(0, σ2 w/vl), where vl is a scaling factor used to control the variance of yl, and Bl iid N(0, σ2 b). Hereafter, Ml denotes the number of weights in the lth layer, φ the activation function and [m : n] := {m, m + 1, ..., n} for m n. Two examples of such architectures are:

Fully connected FFNN. For a FFNN of depth L and widths (Nl)0 l L, we have vl = Nl 1, Ml = Nl 1Nl and

j=1 W 1 ijxj + B1 i , yl i(x) =

j=1 W l ijφ(yl 1 j (x)) + Bl i for l 2. (2)

Published as a conference paper at ICLR 2021

CNN. For a 1D CNN of depth L, number of channels (nl)l L, and number of neurons per channel (Nl)l L, we have

y1 i,α(x) =

β kerl W 1 i,j,βxj,α+β +b1 i , yl i,α(x) =

β kerl W l i,j,βφ(yl 1 j,α+β(x))+bl i, for l 2,

(3) where i [1 : nl] is the channel index, α [0 : Nl 1] is the neuron location, kerl = [ kl : kl] is the ﬁlter range, and 2kl + 1 is the ﬁlter size. To simplify the analysis, we assume hereafter that Nl = N and kl = k for all l. Here, we have vl = nl 1(2k + 1) and Ml = nl 1nl(2k + 1). We assume periodic boundary conditions; so yl i,α = yl i,α+N = yl i,α N. Generalization to multidimensional convolutions is straightforward.

When no speciﬁc architecture is mentioned, (W l i )1 i Ml denotes the weights of the lth layer. In practice, a pruning algorithm creates a binary mask δ over the weights to force the pruned weights to be zero. The neural network after pruning is given by yl(x) = Fl(δl W l, yl 1(x)) + Bl, (4) where is the Hadamard (i.e. element-wise) product. In this paper, we focus on pruning at initialization. The mask is typically created by using a vector gl of the same dimension as W l using a mapping of choice (see below), we then prune the network by keeping the weights that correspond to the top k values in the sequence (gl i)i,l where k is ﬁxed by the sparsity that we want to achieve. There are three popular types of criteria in the literature :

Magnitude based pruning (MBP): We prune weights based on the magnitude |W|.

Sensitivity based pruning (SBP): We prune the weights based on the values of |W L

W | where L is the loss. This is motivated by LW LW =0 + W L

W used in SNIP (Lee et al. (2018)).

Hessian based pruning (HBP): We prune the weights based on some function that uses the Hessian of the loss function as in Gra SP (Wang et al., 2020).

In the remainder of the paper, we focus exclusively on SBP while our analysis of MBP is given in Appendix E. We leave HBP for future work. However, we include empirical results with Gra SP (Wang et al., 2020) in Section 4.

Hereafter, we denote by s the sparsity, i.e. the fraction of weights we want to prune. Let Al be the set of indices of the weights in the lth layer that are pruned, i.e. Al = {i [1 : Ml], s.t. δl i = 0}. We deﬁne the critical sparsity scr by scr = min{s (0, 1), s.t. l, |Al| = Ml}, where |Al| is the cardinality of Al. Intuitively, scr represents the maximal sparsity we are allowed to choose without fully pruning at least one layer. scr is random as the weights are initialized randomly. Thus, we study the behaviour of the expected value E[scr] where, hereafter, all expectations are taken w.r.t. to the random initial weights. This provides theoretical guidelines for pruning at initialization.

For all l [1 : L], we deﬁne αl by vl = αl N where N > 0, and ζl > 0 such that Ml = ζl N 2, where we recall that vl is a scaling factor controlling the variance of yl and Ml is the number of weights in the lth layer. This notation assumes that, in each layer, the number of weights is quadratic in the number of neurons, which is satisﬁed by classical FFNN and CNN architectures.

2.2 SENSITIVITY-BASED PRUNING (SBP)

SBP is a data-dependent pruning method that uses the data to compute the gradient with backpropagation at initialization (one-shot pruning).We randomly sample a batch and compute the gradients of the loss with respect to each weight. The mask is then deﬁned by δl i = I(|W l i L W l i | ts), where

W |(ks) and ks = (1 s) P

l Ml and |W L

W |(ks) is the kth s order statistics of the sequence (|W l i L W l i |)1 l L,1 i Ml.

However, this simple approach suffers from the well-known exploding/vanishing gradients problem which renders the ﬁrst/last few layers respectively susceptible to be completely pruned. We give a formal deﬁnition to this problem.

Published as a conference paper at ICLR 2021

Deﬁnition 1 (Well-conditioned & ill-conditioned NN). Let ml = E[|W l 1 L W l 1 |2] for l [1 : L]. We say that the NN is well-conditioned if there exist A, B > 0 such that for all L 1 and l [1 : L] we have A ml/m L B, and it is ill-conditioned otherwise.

Understanding the behaviour of gradients at initialization is thus crucial for SBP to be efﬁcient. Using a mean-ﬁeld approach, such analysis has been carried out in (Schoenholz et al., 2017; Hayou et al., 2019; Xiao et al., 2018; Poole et al., 2016; Yang, 2019) where it has been shown that an initialization known as the EOC is beneﬁcial for DNN training. The mean-ﬁeld analysis of DNNs relies on two standard approximations that we will also use here. Approximation 1 (Mean-Field Approximation). When Nl 1 for FFNN or nl 1 for CNN, we use the approximation of inﬁnitely wide NN. This means inﬁnite number of neurons per layer for fully connected layers and inﬁnite number of channels per layer for convolutional layers. Approximation 2 (Gradient Independence). The weights used for forward propagation are independent from those used for back-propagation.

These two approximations are ubiquitous in literature on the mean-ﬁeld analysis of neural networks. They have been used to derive theoretical results on signal propagation (Schoenholz et al., 2017; Hayou et al., 2019; Poole et al., 2016; Yang, 2019; Yang and Schoenholz, 2017; Yang et al., 2019) and are also key tools in the derivation of the Neural Tangent Kernel (Jacot et al., 2018; Arora et al., 2019; Hayou et al., 2020). Approximation 1 simpliﬁes the analysis of the forward propagation as it allows the derivation of closed-form formulas for covariance propagation. Approximation 2 does the same for back-propagation. See Appendix A for a detailed discussion of these approximations. Throughout the paper, we provide numerical results that substantiate the theoretical results that we derive using these two approximations. We show that these approximations lead to excellent match between theoretical results and numerical experiments.

Edge of Chaos (EOC): For inputs x, x , let cl(x, x ) be the correlation between yl(x) and yl(x ). From (Schoenholz et al., 2017; Hayou et al., 2019), there exists a so-called correlation function f that depends on (σw, σb) such that cl+1(x, x ) = f(cl(x, x )). Let χ(σb, σw) = f (1). The EOC is the set of hyperparameters (σw, σb) satisfying χ(σb, σw) = 1. When χ(σb, σw) > 1, we are in the Chaotic phase, the gradient explodes and cl(x, x ) converges exponentially to some c < 1 for x = x and the resulting output function is discontinuous everywhere. When χ(σb, σw) < 1, we are in the Ordered phase where cl(x, x ) converges exponentially fast to 1 and the NN outputs constant functions. Initialization on the EOC allows for better information propagation (see Supplementary for more details).

Hence, by leveraging the above results, we show that an initialization outside the EOC will lead to an ill-conditioned NN. Theorem 1 (EOC Initialization is crucial for SBP). Consider a NN of type (2) or (3) (FFNN or CNN). Assume (σw, σb) are chosen on the ordered phase, i.e. χ(σb, σw) < 1, then the NN is ill-conditioned. Moreover, we have

1 + log(κLN 2)

where κ = | log χ(σb, σw)|/8. If (σw, σb) are on the EOC, i.e. χ(σb, σw) = 1, then the NN is well-conditioned. In this case, κ = 0 and the above upper bound no longer holds.

The proof of Theorem 1 relies on the behaviour of the gradient norm at initialization. On the ordered phase, the gradient norm vanishes exponentially quickly as it back-propagates, thus resulting in an ill-conditioned network. We use another approximation for the sake of simpliﬁcation of the proof (Approximation 3 in the Supplementary) but the result holds without this approximation although the resulting constants would be a bit different. Theorem 1 shows that the upper bound decreases the farther χ(σb, σw) is from 1, i.e. the farther the initialization is from the EOC. For constant width FFNN with L = 100, N = 100 and κ = 0.2, the theoretical upper bound is E[scr] 27% while we obtain E[scr] 22% based on 10 simulations. A similar result can be obtained when the NN is initialized on the chaotic phase; in this case too, the NN is ill-conditioned. To illustrate these results, Figure 1 shows the impact of the initialization with sparsity s = 70%. The dark area in Figure 1(b) corresponds to layers that are fully pruned in the chaotic phase due to exploding gradients. Using an EOC initialization, Figure 1(a) shows that pruned weights are well distributed in the NN, ensuring that no layer is fully pruned.

Published as a conference paper at ICLR 2021

(a) Edge of Chaos

(b) Chaotic phase

Figure 1: Percentage of weights kept after SBP applied to a randomly initialized FFNN with depth 100 and width 100 for 70% sparsity on MNIST. Each pixel (i, j) corresponds to a neuron and shows the proportion of connections to neuron (i, j) that have not been pruned. The EOC (a) allows us to preserve a uniform spread of the weights, whereas the Chaotic phase (b), due to exploding gradients, prunes entire layers.

2.3 TRAINING PRUNED NETWORKS USING THE RESCALING TRICK

We have shown previously that an initialization on the EOC is crucial for SBP. However, we have not yet addressed the key problem of training the resulting pruned NN. This can be very challenging in practice (Li et al., 2018), especially for deep NN.

Consider as an example a FFNN architecture. After pruning, we have for an input x

ˆyl i(x) = PNl 1 j=1 W l ijδl ijφ(ˆyl 1 j (x)) + Bl i, for l 2, (5)

where δ is the pruning mask. While the original NN initialized on the EOC was satisfying cl+1(x, x ) = f(cl(x, x )) for f (1) = χ(σb, σw) = 1, the pruned architecture leads to ˆcl+1(x, x ) = fpruned(ˆcl(x, x )) with f pruned(1) = 1, hence pruning destroys the EOC. Consequently, the pruned NN will be difﬁcult to train (Schoenholz et al., 2017; Hayou et al., 2019) especially if it is deep. Hence, we propose to bring the pruned NN back on the EOC. This approach consists of rescaling the weights obtained after SBP in each layer by factors that depend on the pruned architecture itself.

Proposition 1 (Rescaling Trick). Consider a NN of type (2) or (3) (FFNN or CNN) initialized on the EOC. Then, after pruning, the pruned NN is not initialized on the EOC anymore. However, the rescaled pruned NN

yl(x) = F(ρl δl W l, yl 1(x)) + Bl, for l 1, (6)

ρl ij = (E[Nl 1(W l i1)2δl i1]) 1

2 for FFNN , ρl i,j,β = (E[nl 1(W l i,1,β)2δl i,1,β]) 1

2 for CNN, (7)

is initialized on the EOC. (The scaling is constant across j).

The scaling factors in equation 7 are easily approximated using the weights kept after pruning. Algorithm 1 (see Appendix I) details a practical implementation of this rescaling technique for FFNN. We illustrate experimentally the beneﬁts of this approach in Section 4.

3 SENSITIVITY-BASED PRUNING FOR STABLE RESIDUAL NETWORKS

Resnets and their variants (He et al., 2015; Huang et al., 2017) are currently the best performing models on various classiﬁcation tasks (CIFAR10, CIFAR100, Image Net etc (Kolesnikov et al., 2019)). Thus, understanding Resnet pruning at initialization is of crucial interest. Yang and Schoenholz (2017) showed that Resnets naturally live on the EOC. Using this result, we show that Resnets are actually better suited to SBP than FFNN and CNN. However, Resnets suffer from an exploding gradient problem (Yang and Schoenholz, 2017) which might affect the performance of SBP. We

Published as a conference paper at ICLR 2021

Figure 2: Percentage of non-pruned weights per layer in a Res Net32 for our Stable Res Net32 and standard Resnet32 with Kaiming initialization on CIFAR10. With Stable Resnet, we prune less aggressively weights in the deeper layers than for standard Resnet.

address this issue by introducing a new Resnet parameterization. Let a standard Resnet architecture be given by

y1(x) = F(W 1, x), yl(x) = yl 1(x) + F(W l, yl 1), for l 2, (8) where F deﬁnes the blocks of the Resnet. Hereafter, we assume that F is either of the form (2) or (3) (FFNN or CNN).

The next theorem shows that Resnets are well-conditioned independently from the initialization and are thus well suited for pruning at initialization. Theorem 2 (Resnet are Well-Conditioned). Consider a Resnet with either Fully Connected or Convolutional layers and Re LU activation function. Then for all σw > 0, the Resnet is wellconditioned. Moreover, for all l {1, ..., L}, we have ml = Θ((1 + σ2 w 2 )L).

The above theorem proves that Resnets are always well-conditioned. However, taking a closer look at ml, which represents the variance of the pruning criterion (Deﬁnition 1), we see that it grows exponentially in the number of layers L. Therefore, this could lead to a higher variance of pruned networks and hence high variance test accuracy. To this end, we propose a Resnet parameterization which we call Stable Resnet. Stable Resnets prevent the second moment from growing exponentially as shown below. Proposition 2 (Stable Resnet). Consider the following Resnet parameterization

yl(x) = yl 1(x) + 1

LF(W l, yl 1), for l 2, (9)

then the NN is well-conditioned for all σw > 0. Moreover, for all l L we have ml = Θ(L 1).

In Proposition 2, L is not the number of layers but the number of blocks. For example, Res Net32 has 15 blocks and 32 layers, hence L = 15. Figure 2 shows the percentage of weights in each layer kept after pruning Res Net32 and Stable Res Net32 at initialization. The jumps correspond to limits between sections in Res Net32 and are caused by max-pooling. Within each section, Stable Resnet tends to have a more uniform distribution of percentages of weights kept after pruning compared to standard Resnet. In Section 4 we show that this leads to better performance of Stable Resnet compared to standard Resnet. Further theoretical and experimental results for Stable Resnets are presented in (Hayou et al., 2021).

In the next proposition, we establish that, unlike FFNN or CNN, we do not need to rescale the pruned Resnet for it to be trainable as it lives naturally on the EOC before and after pruning. Proposition 3 (Resnet live on the EOC even after pruning). Consider a Residual NN with blocks of type FFNN or CNN. Then, after pruning, the pruned Residual NN is initialized on the EOC.

4 EXPERIMENTS

In this section, we illustrate empirically the theoretical results obtained in the previous sections. We validate the results on MNIST, CIFAR10, CIFAR100 and Tiny Image Net.

4.1 INITIALIZATION AND RESCALING

According to Theorem 1, an EOC initialization is necessary for the network to be well-conditioned. We train FFNN with tanh activation on MNIST, varying depth L {2, 20, 40, 60, 80, 100} and

Published as a conference paper at ICLR 2021

(a) EOC Init & Rescaling

(b) EOC Init

(c) Ordered phase Init

Figure 3: Accuracy on MNIST with different initialization schemes including EOC with rescaling, EOC without rescaling, Ordered phase, with varying depth and sparsity. This shows that rescaling to be on the EOC allows us to train not only much deeper but also sparser models.

sparsity s {10%, 20%, .., 90%}. We use SGD with batchsize 100 and learning rate 10 3, which we found to be optimal using a grid search with an exponential scale of 10. Figure 3 shows the test accuracy after 10k iterations for 3 different initialization schemes: Rescaled EOC, EOC, Ordered. On the Ordered phase, the model is untrainable when we choose sparsity s > 40% and depth L > 60 as one layer being fully pruned. For an EOC initialization, the set (s, L) for which NN are trainable becomes larger. However, the model is still untrainable for highly sparse deep networks as the sparse NN is no longer initialized on the EOC (see Proposition 1). As predicted by Proposition 1, after application of the rescaling trick to bring back the pruned NN on the EOC, the pruned NN can be trained appropriately.

Table 2: Classiﬁcation accuracies for CIFAR10 and CIFAR100 after pruning

CIFAR10 CIFAR100

SPARSITY 90% 95% 98% 90% 95% 98%

Res Net32 (NO PRUNING) 94.80 - - 74.64 - - OBD LECUN ET AL. (1990) 93.74 93.58 93.49 73.83 71.98 67.79

RANDOM PRUNING 89.95 0.23 89.68 0.15 86.13 0.25 63.13 2.94 64.55 0.32 19.83 3.21 MBP 90.21 0.55 88.35 0.75 86.83 0.27 67.07 0.31 64.92 0.77 59.53 2.19 SNIP LEE ET AL. (2018) 92.26 0.32 91.18 0.17 87.78 0.16 69.31 0.52 65.63 0.15 55.70 1.13 GRASP WANG ET AL. (2020) 92.20 0.31 91.39 0.25 88.70 0.42 69.24 0.24 66.50 0.11 58.43 0.43 GRASP-SR 92.30 0.19 91.16 0.13 87.8 0.32 69.12 0.15 65.49 0.21 58.63 0.23 SYNFLOW TANAKA ET AL. (2020) 92.01 0.22 91.67 0.17 88.10 0.25 69.03 0.20 65.23 0.31 58.73 0.30 SBP-SR (STABLE RESNET) 92.56 0.06 91.21 0.30 88.25 0.35 69.51 0.21 66.72 0.12 59.51 0.15

Res Net50 (NO PRUNING) 94.90 - - 74.9 - - RANDOM PRUNING 85.11 4.51 88.76 0.21 85.32 0.47 65.67 0.57 60.23 2.21 28.32 10.35 MBP 90.11 0.32 89.06 0.09 87.32 0.16 68.51 0.21 63.32 1.32 55.21 0.35 SNIP 91.95 0.13 92.12 0.34 89.26 0.23 70.43 0.43 67.85 1.02 60.38 0.78 GRASP 92.10 0.21 91.74 0.35 89.97 0.25 70.53 0.32 67.84 0.25 63.88 0.45 SYNFLOW 92.05 0.20 91.83 0.23 89.61 0.17 70.43 0.30 67.95 0.22 63.95 0.11 SBP-SR 92.05 0.06 92.74 0.32 89.57 0.21 71.79 0.13 68.98 0.15 64.45 0.34

Res Net104 (NO PRUNING) 94.92 - - 75.24 - - RANDOM PRUNING 89.80 0.33 87.86 1.22 85.52 2.12 66.73 1.32 64.98 0.11 30.31 4.51 MBP 90.05 1.23 88.95 0.65 87.83 1.21 69.57 0.35 64.31 0.78 60.21 2.41 SNIP 93.25 0.53 92.98 0.12 91.58 0.19 71.94 0.22 68.73 0.09 63.31 0.41 GRASP 93.08 0.17 92.93 0.09 91.19 0.35 73.33 0.21 70.95 1.12 66.91 0.33 SYNFLOW 93.43 0.10 92.85 0.18 91.03 0.25 72.85 0.20 70.33 0.15 67.02 0.10 SBP-SR 94.69 0.13 93.88 0.17 92.08 0.14 74.17 0.11 71.84 0.13 67.73 0.28

4.2 RESNET AND STABLE RESNET

Although Resnets are adapted to SBP (i.e. they are always well-conditioned for all σw > 0), Theorem 2 shows that the magnitude of the pruning criterion grows exponentially w.r.t. the depth L. To resolve this problem we introduced Stable Resnet. We call our pruning algorithm for Res Net SBP-SR (SBP with Stable Resnet). Theoretically, we expect SBP-SR to perform better than other methods for deep Resnets according to Proposition 2. Table 2 shows test accuracies for Res Net32, Res Net50 and Res Net104 with varying sparsities s {90%, 95%, 98%} on CIFAR10 and CIFAR100. For all our experiments, we use a setup similar to (Wang et al., 2020), i.e. we use SGD for 160 and 250 epochs for CIFAR10 and CIFAR100, respectively. We use an initial learning rate of 0.1 and decay it by 0.1

Published as a conference paper at ICLR 2021

at 1/2 and 3/4 of the number of total epoch. In addition, we run all our experiments 3 times to obtain more stable and reliable test accuracies. As in (Wang et al., 2020), we adopt Resnet architectures where we doubled the number of ﬁlters in each convolutional layer. As a baseline, we include pruning results with the classical OBD pruning algorithm (Le Cun et al., 1990) for Res Net32 (train prune repeat). We compare our results against other algorithms that prune at initialization, such as SNIP (Lee et al., 2018), which is a SBP algorithm, Gra SP (Wang et al., 2020) which is a Hessian based pruning algorithm, and Syn Flow (Tanaka et al., 2020), which is an iterative data-agnostic pruning algorithm. As we increase the depth, SBP-SR starts to outperform other algorithms that prune at initialization (SBP-SR outperforms all other algorithms with Res Net104 on CIFAR10 and CIFAR100). Furthermore, using Gra SP on Stable Resnet did not improve the result of Gra SP on standard Resnet, as our proposed Stable Resnet analysis only applies to gradient based pruning. The analysis of Hessian based pruning could lead to similar techniques for improving trainability, which we leave for future work.

Table 3: Classiﬁcation accuracies on Tiny Image Net for Resnet with varying depths

ALGORITHM 85% 90% 95%

RESNET32 SBP-SR 57.25 0.09 55.67 0.21 50.63 0.21 SNIP 56.92 0.33 54.99 0.37 49.48 0.48 GRASP 57.25 0.11 55.53 0.11 51.34 0.29 SYNFLOW 56.75 0.09 55.60 0.07 51.50 0.21

RESNET50 SBP-SR 59.8 0.18 57.74 0.06 53.97 0.27 SNIP 58.91 0.23 56.15 0.31 51.19 0.47 GRASP 58.46 0.29 57.48 0.35 52.5 0.41 SYNFLOW 59.31 0.17 57.67 0.15 53.14 0.31

RESNET104 SBP-SR 62.84 0.13 61.96 0.11 57.9 0.31 SNIP 59.94 0.34 58.14 0.28 54.9 0.42 GRASP 61.1 0.41 60.14 0.38 56.36 0.51 SYNFLOW 61.71 0.08 60.81 0.14 55.91 0.43

To conﬁrm these results, we also test SBP-SR against other pruning algorithms on Tiny Image Net. We train the models for 300 training epochs to make sure all algorithms converge. Table 3 shows test accuracies for SBP-SR, SNIP, Gra SP, and Syn Flow for s {85%, 90%, 95%}. Although Syn Flow competes or outperforms Gra SP in many cases, SBP-SR has a clear advantage over Syn Flow and other algorithms, especially for deep networks as illustrated on Res Net104.

Additional results with Image Net dataset are provided in Appendix F.

4.3 RESCALING TRICK AND CNNS

The theoretical analysis of Section 2 is valid for Vanilla CNN i.e. CNN without pooling layers. With pooling layers, the theory of signal propagation applies to sections between successive pooling layers; each of those section can be seen as Vanilla CNN. This applies to standard CNN architectures such as VGG. As a toy example, we show in Table 4 the test accuracy of a pruned V-CNN with sparsity s = 50% on MNIST dataset. Similar to FFNN results in Figure 3, the combination of the EOC Init and the Re Scaling trick allows for pruning deep V-CNN (depth 100) while ensuring their trainability.

Table 4: Test accuracy on MNIST with V-CNN for different depths with sparsity 50% using SBP(SNIP)

L = 10 L = 50 L = 100

ORDERED PHASE INIT 98.12 0.13 10.00 0.0 10.00 0.0 EOC INIT 98.20 0.17 98.75 0.11 10.00 0.0 EOC + RESCALING 98.18 0.21 98.90 0.07 99.15 0.08

However, V-CNN is a toy example that is generally not used in practice. Standard CNN architectures such as VGG are popular among practitioners since they achieve SOTA accuracy on many tasks. Table 5 shows test accuracies for SNIP, Syn Flow, and our EOC+Re Scaling trick for VGG16 on CIFAR10. Our results are close to the results presented by Frankle et al. (2020). These three

Published as a conference paper at ICLR 2021

algorithms perform similarly. From a theoretical point of view, our Re Scaling trick applies to vanilla CNNs without pooling layers, hence, adding pooling layers might cause a deterioration. However, we know that the signal propagation theory applies to vanilla blocks inside VGG (i.e. the sequence of convolutional layers between two successive pooling layers). The larger those vanilla blocks are, the better our Re Scaling trick performs. We leverage this observation by training a modiﬁed version of VGG, called 3x VGG16, which has the same number of pooling layers as VGG16, and 3 times the number of convolutional layers inside each vanilla block. Numerical results in Table 5 show that the EOC initialization with the Re Scaling trick outperforms other algorithms, which conﬁrms our hypothesis. However, the architecture 3x VGG16 is not a standard architecture and it does not seem to improve much the test accuracy of VGG16. An adaptation of the Re Scaling trick to standard VGG architectures would be of great value and is left for future work.

Table 5: Classiﬁcation accuracy on CIFAR10 for VGG16 and 3x VGG16 with varying sparsities

ALGORITHM 85% 90% 95%

VGG16 SNIP 93.09 0.11 92.97 0.08 92.61 0.10 SYNFLOW 93.21 0.13 93.05 0.11 92.19 0.12 EOC + RESCALING 93.15 0.12 92.90 0.15 92.70 0.06

3XVGG16 SNIP 93.30 0.10 93.12 0.20 92.85 0.15 SYNFLOW 92.95 0.13 92.91 0.21 92.70 0.20 EOC + RESCALING 93.97 0.17 93.75 0.15 93.40 0.16

Summary of numerical results. We summarize in Table 6 our numerical results. The letter C refers to Competition between algorithms in that setting, and indicates no clear winner is found, while the dash means no experiment has been run with this setting. We observe that our algorithm SBP-SR consistently outperforms other algorithms in a variety of settings.

Table 6: Which algorithm performs better? (according to our results)

DATASET ARCHITECTURE 85% 90% 95% 98%

CIFAR10 RESNET32 - C C GRASP RESNET50 - C SBP-SR GRASP RESNET104 - SBP-SR SBP-SR SBP-SR VGG16 C C C - 3XVGG16 EOC+RESC EOC+RESC EOC+RESC -

CIFAR100 RESNET32 - SBP-SR SBP-SR SBP-SR RESNET50 - SBP-SR SBP-SR SBP-SR RESNET104 - SBP-SR SBP-SR SBP-SR

TINY IMAGENET RESNET32 C C SYNFLOW - RESNET50 SBP-SR C SBP-SR - RESNET104 SBP-SR SBP-SR SBP-SR -

5 CONCLUSION

In this paper, we have formulated principled guidelines for SBP at initialization. For FNNN and CNN, we have shown that an initialization on the EOC is necessary followed by the application of a simple rescaling trick to train the pruned network. For Resnets, the situation is markedly different. There is no need for a speciﬁc initialization but Resnets in their original form suffer from an exploding gradient problem. We propose an alternative Resnet parameterization called Stable Resnet, which allows for more stable pruning. Our theoretical results have been validated by extensive experiments on MNIST, CIFAR10, CIFAR100, Tiny Image Net and Image Net. Compared to other available one-shot pruning algorithms, we achieve state-of the-art results in many scenarios.

Published as a conference paper at ICLR 2021

Alvarez, J. M. and M. Salzmann (2017). Compression-aware training of deep networks. In 31st Conference in Neural Information Processing Systems, pp. 856 867.

Arora, S., S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang (2019). On exact computation with an inﬁnitely wide neural net. In 33rd Conference on Neural Information Processing Systems.

Carreira-Perpi n an, M. and Y. Idelbayev (2018, June). Learning-compression algorithms for neural net pruning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Dong, X., S. Chen, and S. Pan (2017). Learning to prune deep neural networks via layer-wise optimal brain surgeon. In 31st Conference on Neural Information Processing Systems, pp. 4860 4874.

Du, S., X. Zhai, B. Poczos, and A. Singh (2019). Gradient descent provably optimizes overparameterized neural networks. In 7th International Conference on Learning Representations.

Frankle, J. and M. Carbin (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations.

Frankle, J., G. Dziugaite, D. Roy, and M. Carbin (2020). Pruning neural networks at initialization: Why are we missing the mark? ar Xiv preprint ar Xiv:2009.08576.

Hardy, G., J. Littlewood, and G. P olya (1952). Inequalities, Volume 2. Cambridge Mathematical Library.

Hassibi, B., D. Stork, and W. Gregory (1993). Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks, pp. 293 299 vol.1.

Hayou, S., E. Clerico, B. He, G. Deligiannidis, A. Doucet, and J. Rousseau (2021). Stable resnet. In 24th International Conference on Artiﬁcial Intelligence and Statistics.

Hayou, S., A. Doucet, and J. Rousseau (2019). On the impact of the activation function on deep neural networks training. In 36th International Conference on Machine Learning.

Hayou, S., A. Doucet, and J. Rousseau (2020). Mean-ﬁeld behaviour of neural tangent kernel for deep neural networks. ar Xiv preprint ar Xiv:1905.13654.

He, K., X. Zhang, S. Ren, and J. Sun (2015). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778.

Huang, G., Z. Liu, L. Maaten, and K. Weinberger (2017). Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261 2269.

Jacot, A., F. Gabriel, and C. Hongler (2018). Neural tangent kernel: Convergence and generalization in neural networks. In 32nd Conference on Neural Information Processing Systems.

Kolesnikov, A., L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby (2019). Large scale learning of general visual representations for transfer. ar Xiv preprint ar Xiv:1912.11370.

Le Cun, Y., J. Denker, and S. Solla (1990). Optimal brain damage. In Advances in Neural Information Processing Sstems, pp. 598 605.

Lee, J., Y. Bahri, R. Novak, S. Schoenholz, J. Pennington, and J. Sohl-Dickstein (2018). Deep neural networks as Gaussian processes. In 6th International Conference on Learning Representations.

Lee, J., L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington (2019). Wide neural networks of any depth evolve as linear models under gradient descent. In 33rd Conference on Neural Information Processing Systems.

Lee, N., T. Ajanthan, S. Gould, and P. H. S. Torr (2020). A signal propagation perspective for pruning neural networks at initialization. In 8th International Conference on Learning Representations.

Lee, N., T. Ajanthan, and P. H. Torr (2018). Snip: Single-shot network pruning based on connection sensitivity. In 6th International Conference on Learning Representations.

Published as a conference paper at ICLR 2021

Li, H., A. Kadav, I. Durdanovic, H. Samet, and H. Graf (2018). Pruning ﬁlters for efﬁcient convnets. In 6th International Conference on Learning Representations.

Li, Y., S. Gu, C. Mayer, L. V. Gool, and R. Timofte (2020). Group sparsity: The hinge between ﬁlter pruning and decomposition for network compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8018 8027.

Li, Y., S. Gu, K. Zhang, L. Van Gool, and R. Timofte (2020). Dhp: Differentiable meta pruning via hypernetworks. ar Xiv preprint ar Xiv:2003.13683.

Lillicrap, T., D. Cownden, D. Tweed, and C. Akerman (2016). Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications 7(13276).

Liu, Z., H. Mu, X. Zhang, Z. Guo, X. Yang, K.-T. Cheng, and J. Sun (2019). Metapruning: Meta learning for automatic neural network channel pruning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3296 3305.

Louizos, C., M. Welling, and D. Kingma (2018). Learning sparse neural networks through l0 regularization. In 6th International Conference on Learning Representations.

Matthews, A., J. Hron, M. Rowland, R. Turner, and Z. Ghahramani (2018). Gaussian process behaviour in wide deep neural networks. In 6th International Conference on Learning Representations.

Mozer, M. and P. Smolensky (1989). Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in Neural Information Processing Systems, pp. 107 115.

Neal, R. (1995). Bayesian Learning for Neural Networks, Volume 118. Springer Science & Business Media.

Neyshabur, B., Z. Li, S. Bhojanapalli, Y. Le Cun, and N. Srebro (2019). The role of overparametrization in generalization of neural networks. In 7th International Conference on Learning Representations.

Nguyen, Q. and M. Hein (2018). Optimization landscape and expressivity of deep CNNs. In 35th International Conference on Machine Learning.

Peˇcari c, J., F. Proschan, and Y. Tong (1992). Convex Functions, Partial Orderings, and Statistical Applications. Academic Press.

Poole, B., S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli (2016). Exponential expressivity in deep neural networks through transient chaos. In 30th Conference on Neural Information Processing Systems.

Puri, M. and S. Ralescu (1986). Limit theorems for random central order statistics. Lecture Notes Monograph Series Vol. 8, Adaptive Statistical Procedures and Related Topics.

Schoenholz, S., J. Gilmer, S. Ganguli, and J. Sohl-Dickstein (2017). Deep information propagation. In 5th International Conference on Learning Representations.

Tanaka, H., D. Kunin, D. L. Yamins, and S. Ganguli (2020). Pruning neural networks without any data by iteratively conserving synaptic ﬂow. In 34th Conference on Neural Information Processing Systems.

Van Handel, R. (2016). Probability in High Dimension. Princeton University. APC 550 Lecture Notes.

Von Mises, R. (1936). La distribution de la plus grande de n valeurs. Selected Papers Volumen II, American Mathematical Society, 271 294.

Wang, C., G. Zhang, and R. Grosse (2020). Picking winning tickets before training by preserving gradient ﬂow. In 8th International Conference on Learning Representations.

Published as a conference paper at ICLR 2021

Xiao, L., Y. Bahri, J. Sohl-Dickstein, S. S. Schoenholz, and P. Pennington (2018). Dynamical isometry and a mean ﬁeld theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In 35th International Conference on Machine Learning.

Yang, G. (2019). Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. ar Xiv preprint ar Xiv:1902.04760.

Yang, G., J. Pennington, V. Rao, J. Sohl-Dickstein, and S. S. Schoenholz (2019). A mean ﬁeld theory of batch normalization. In 7th International Conference on Learning Representations.

Yang, G. and S. Schoenholz (2017). Mean ﬁeld residual networks: On the edge of chaos. In 31st Conference in Neural Information Processing Systems, Volume 30, pp. 2869 2869.

Zhang, C., S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016). Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations.

Published as a conference paper at ICLR 2021

A DISCUSSION ABOUT APPROXIMATIONS 1 AND 2

A.1 APPROXIMATION 1: INFINITE WIDTH APPROXIMATION

Feed Forward Neural Network Consider a randomly initialized FFNN of depth L, widths (Nl)1 l L, weights W l ij iid N(0, σ2 w Nl 1 )

and bias Bl i iid N(0, σ2 b), where N(µ, σ2) denotes the normal distribution of mean µ and variance σ2. For some input x Rd, the propagation of this input through the network is given by

j=1 W 1 ijxj + B1 i , (10)

j=1 W l ijφ(yl 1 j (x)) + Bl i, for l 2. (11)

Where φ : R R is the activation function. When we take the limit Nl 1 , the Central Limit Theorem implies that yl i(x) is a Gaussian variable for any input x. This approximation by inﬁnite width solution results in an error of order O(1/ p

Nl 1) (standard Monte Carlo error). More generally, an approximation of the random process yl i(.) by a Gaussian process was ﬁrst proposed by Neal (1995) in the single layer case and has been recently extended to the multiple layer case by Lee et al. (2018) and Matthews et al. (2018). We recall here the expressions of the limiting Gaussian process kernels. For any input x Rd, E[yl i(x)] = 0 so that for any inputs x, x Rd

κl(x, x ) = E[yl i(x)yl i(x )]

= σ2 b + σ2 w E[φ(yl 1 i (x))φ(yl 1 i (x ))]

= σ2 b + σ2 w Fφ(κl 1(x, x), κl 1(x, x ), κl 1(x , x )),

where Fφ is a function that only depends on φ. This provides a simple recursive formula for the computation of the kernel κl; see, e.g., Lee et al. (2018) for more details.

Convolutional Neural Networks Similar to the FFNN case, the inﬁnite width approximation with 1D CNN (introduced in the main paper) yields a recursion for the kernel. However, the inﬁnite width here means inﬁnite number of channels, and results in an error O(1/ nl 1). The kernel in this case depends on the choice of the neurons in the channel and is given by

κl α,α (x, x ) = E[yl i,α(x)yl i,α (x )] = σ2 b + σ2 w 2k + 1

β ker E[φ(yl 1 1,α+β(x))φ(yl 1 1,α +β(x ))]

κl α,α (x, x ) = σ2 b + σ2 w 2k + 1

β ker Fφ(κl 1 α+β,α +β(x, x), κl 1 α+β,α +β(x, x ), κl 1 α+β,α +β(x , x )).

The convolutional kernel κl α,α has the self-averaging property; i.e. it is an average over the kernels corresponding to different combination of neurons in the previous layer. However, it is easy to simplify the analysis in this case by studying the average kernel per channel deﬁned by ˆκl = 1 N2 P

α,α κl α,α . Indeed, by summing terms in the previous equation and using the fact that we use circular padding, we obtain

ˆκl(x, x ) = σ2 b + σ2 w 1 N 2 X

α,α Fφ(κl 1 α,α (x, x), κl 1 α,α (x, x ), κl 1 α,α (x , x )).

This expression is similar in nature to that of FFNN. We will use this observation in the proofs.

Note that our analysis only requires the approximation that, in the inﬁnite width limit, for any two inputs x, x , the variables yl i(x) and yl i(x ) are Gaussian with covariance κl(x, x ) for FFNN, and

Published as a conference paper at ICLR 2021

yl i,α(x) and yl i,α (x ) are Gaussian with covariance κl α,α (x, x ) for CNN. We do not need the much stronger approximation that the process yl i(x) (yl i,α(x) for CNN) is a Gaussian process.

Residual Neural Networks The inﬁnite width limit approximation for Res Net yields similar results with an additional residual terms. It is straighforward to see that, in the case a Res Net with FFNN-type layers, we have that

κl(x, x ) = κl 1(x, x ) + σ2 b + σ2 w Fφ(κl 1(x, x), κl 1(x, x ), κl 1(x , x )),

whereas for Res Net with CNN-type layers, we have that

κl α,α (x, x ) = κl 1 α,α (x, x ) + σ2 b

+ σ2 w 2k + 1

β ker Fφ(κl 1 α+β,α +β(x, x), κl 1 α+β,α +β(x, x ), κl 1 α+β,α +β(x , x )).

A.2 APPROXIMATION 2: GRADIENT INDEPENDENCE

For gradient back-propagation, an essential assumption in prior literature in Mean-Field analysis of DNNs is that of the gradient independence which is similar in nature to the practice of feedback alignment (Lillicrap et al., 2016). This approximation allows for derivation of recursive formulas for gradient back-propagation, and it has been extensively used in literature and veriﬁed empirically; see references below.

Gradient Covariance back-propagation: this approximation was used to derive analytical formulas for gradient covariance back-propagation in (Hayou et al., 2019; Schoenholz et al., 2017; Yang and Schoenholz, 2017; Lee et al., 2018; Poole et al., 2016; Xiao et al., 2018; Yang, 2019). It was shown empirically through simulations that it is an excellent approximation for FFNN in Schoenholz et al. (2017), for Resnets in Yang and Schoenholz (2017) and for CNN in Xiao et al. (2018).

Neural Tangent Kernel (NTK): this approximation was implicitly used by Jacot et al. (2018) to derive the recursive formula of the inﬁnite width Neural Tangent Kernel (See Jacot et al. (2018), Appendix A.1). Authors have found that this approximation yields excellent match with exact NTK. It was also exploited later in (Arora et al., 2019; Hayou et al., 2020) to derive the inﬁnite NTK for different architectures. The difference between the inﬁnite width NTK Θ and the empirical (exact) NTK ˆΘ was studied in Lee et al. (2019) where authors have shown that Θ ˆΘ F = O(N 1) where N is the width of the NN.

More precisely, we use the approximation that, for wide neural networks, the weights used for forward propagation are independent from those used for back-propagation. When used for the computation of gradient covariance and Neural Tangent Kernel, this approximation was proven to give the exact computation for standard architectures such as FFNN, CNN and Res Nets, without Batch Norm in Yang (2019) (section D.5). Even with Batch Norm, in Yang et al. (2019), authors have found that the Gradient Independence approximation matches empirical results.

This approximation can be alternatively formulated as an assumption instead of an approximation as in Yang and Schoenholz (2017). Assumption 1 (Gradient Independence): The gradients are computed using an i.i.d. version of the weights used for forward propagation.

B PRELIMINARY RESULTS

Let x be an input in Rd. In its general form, a neural network of depth L is given by the following set of forward propagation equations

yl(x) = Fl(W l, yl 1(x)) + Bl, 1 l L, (12)

where yl(x) is the vector of pre-activations and W l and Bl are respectively the weights and bias of the lth layer. Fl is a mapping that deﬁnes the nature of the layer. The weights and bias are initialized with W l iid N(0, σ2 w vl ) where vl is a scaling factor used to control the variance of yl, and Bl iid N(0, σ2 b). Hereafter, we denote by Ml the number of weights in the lth layer, φ the activation function and [n : m] the set of integers {n, n + 1, ..., m} for n m. Two examples of such architectures are:

Published as a conference paper at ICLR 2021

Fully-connected Feed Forward Neural Network (FFNN) For a fully connected feedforward neural network of depth L and widths (Nl)0 l L, the forward propagation of the input through the network is given by

j=1 W 1 ijxj + B1 i ,

j=1 W l ijφ(yl 1 j (x)) + Bl i, for l 2.

Here, we have vl = Nl 1 and Ml = Nl 1Nl. Convolutional Neural Network (CNN/Conv Net) For a 1D convolutional neural network of depth L, number of channels (nl)l L and number of neurons per channel (Nl)l L. we have

y1 i,α(x) =

β kerl W 1 i,j,βxj,α+β + b1 i ,

yl i,α(x) =

β kerl W l i,j,βφ(yl 1 j,α+β(x)) + bl i, for l 2,

where i [1 : nl] is the channel index, α [0 : Nl 1] is the neuron location, kerl = [ kl : kl] is the ﬁlter range and 2kl + 1 is the ﬁlter size. To simplify the analysis, we assume hereafter that Nl = N and kl = k for all l. Here, we have vl = nl 1(2k + 1) and Ml = nl 1nl(2k + 1). We assume periodic boundary conditions, so yl i,α = yl i,α+N = yl i,α N. Generalization to multidimensional convolutions is straighforward.

Notation: Hereafter, for FFNN layers, we denote by ql(x) the variance of yl 1(x) (the choice of the index 1 is not crucial since, by the mean-ﬁeld approximation, the random variables (yl i(x))i [1:Nl] are iid Gaussian variables). We denote by ql(x, x ) the covariance between yl 1(x) and yl 1(x ), and cl 1(x, x ) the corresponding correlation. For gradient back-propagation, for some loss function L,

we denote by ql(x, x ) the gradient covariance deﬁned by ql(x, x ) = E h L yl 1 (x) L

yl 1 (x ) i . Similarly,

ql(x) denotes the gradient variance at point x. For CNN layers, we use similar notation accross channels. More precisely, we denote by ql α(x) the variance of yl 1,α,(x) (the choice of the index 1 is not crucial here either since, by the meanﬁeld approximation, the random variables (yl i,α(x))i [1:Nl] are iid Gaussian variables). We denote by ql α,α (x, x ) the covariance between yl 1,α(x) and yl 1,α (x ), and cl α,α (x, x ) the corresponding correlation.

As in the FFNN case, we deﬁne the gradient covariance by ql α,α (x, x ) = E L yl 1,α (x) L yl 1,α (x ) .

B.1 WARMUP : SOME RESULTS FROM THE MEAN-FIELD THEORY OF DNNS

We start by recalling some results from the mean-ﬁeld theory of deep NNs.

B.1.1 COVARIANCE PROPAGATION

Covariance propagation for FFNN: In Section A.1, we presented the recursive formula for covariance propagation in a FFNN, which we derive using the Central Limit Theorem. More precisely, for two inputs x, x Rd, we have

ql(x, x ) = σ2 b + σ2 w E[φ(yl 1 i (x))φ(yl 1 i (x ))].

This can be rewritten as

ql(x, x ) = σ2 b + σ2 w E φ q

ql(x )(cl 1Z1 + q

1 (cl 1)2Z2

Published as a conference paper at ICLR 2021

where cl 1 := cl 1(x, x ). With a Re LU activation function, we have

ql(x, x ) = σ2 b + σ2 w 2

ql(x )f(cl 1),

where f is the Re LU correlation function given by (Hayou et al. (2019))

π (c arcsin c + p

Covariance propagation for CNN: Similar to the FFNN case, it is straightforward to derive recusive formula for the covariance. However, in this case, the independence is across channels and not neurons. Simple calculus yields

ql α,α (x, x ) = E[yl i,α(x)yl i,α (x )] = σ2 b + σ2 w 2k + 1

β ker E[φ(yl 1 1,α+β(x))φ(yl 1 1,α +β(x ))]

Using a Re LU activation function, this becomes

ql α,α (x, x ) = σ2 b + σ2 w 2k + 1

ql α+β(x) q

ql α +β(x )f(cl 1 α+β,α +β(x, x )).

Covariance propagation for Res Net with Re LU : This case is similar to the non residual case. However, an added residual term shows up in the recursive formula. For Res Net with FFNN layers, we have

ql(x, x ) = ql 1(x, x ) + σ2 b + σ2 w 2

ql(x )f(cl 1)

and for Res Net with CNN layers, we have

ql α,α (x, x ) = ql 1 α,α (x, x ) + σ2 b + σ2 w 2k + 1

ql α+β(x) q

ql α +β(x )f(cl 1 α+β,α +β(x, x )).

B.1.2 GRADIENT COVARIANCE BACK-PROPAGATION

Gradiant Covariance back-propagation for FFNN: Let L be the loss function. Let x be an input. The back-propagation of the gradient is given by the set of equations

L yl i = φ (yl i)

L yl+1 j W l+1 ji .

Using the approximation that the weights used for forward propagation are independent from those used in backpropagation, we have as in Schoenholz et al. (2017)

ql(x) = ql+1(x)Nl+1

Nl χ(ql(x)),

where χ(ql(x)) = σ2 w E[φ( p

Gradient Covariance back-propagation for CNN: Similar to the FFNN case, we have that

L W l i,j,β = X

L yl i,α φ(yl 1 j,α+β)

and L yl i,α =

L yl+1 j,α β W l+1 i,j,βφ (yl i,α).

Published as a conference paper at ICLR 2021

Using the approximation of Gradient independence and averaging over the number of channels (using CLT) we have that

2 ] = σ2 w E[φ ( p

qlα(x)Z)2] 2k + 1

β ker E[ L yl+1 i,α β

We can get similar recursion to that of the FFNN case by summing over α and using the periodic boundary condition, this yields

2 ] = χ(ql α(x)) X

B.1.3 EDGE OF CHAOS (EOC)

Let x Rd be an input. The convergence of ql(x) as l increases has been studied by Schoenholz et al. (2017) and Hayou et al. (2019). In particular, under weak regularity conditions, it is proven that ql(x) converges to a point q(σb, σw) > 0 independent of x as l . The asymptotic behaviour of the correlations cl(x, x ) between yl(x) and yl(x ) for any two inputs x and x is also driven by (σb, σw): the dynamics of cl is controlled by a function f i.e. cl+1 = f(cl) called the correlation function. The authors deﬁne the EOC as the set of parameters (σb, σw) such that σ2 w E[φ ( p

q(σb, σw)Z)2] = 1 where Z N(0, 1). Similarly the Ordered, resp. Chaotic, phase is deﬁned by σ2 w E[φ ( p

q(σb, σw)Z)2] < 1, resp. σ2 w E[φ ( p

q(σb, σw)Z)2] > 1. On the Ordered phase, the gradient will vanish as it backpropagates through the network, and the correlation cl(x, x ) converges exponentially to 1. Hence the output function becomes constant (hence the name Ordered phase ). On the Chaotic phase, the gradient explodes and the correlation converges exponentially to some limiting value c < 1 which results in the output function being discontinuous everywhere (hence the Chaotic phase name). On the EOC, the second moment of the gradient remains constant throughout the backpropagation and the correlation converges to 1 at a sub-exponential rate, which allows deeper information propagation. Hereafter, f will always refer to the correlation function.

B.1.4 SOME RESULTS FROM THE MEAN-FIELD THEORY OF DEEP FFNNS

Let ϵ (0, 1) and Bϵ = {(x, x )Rd : c1(x, x ) < 1 ϵ} (For now Bϵ is deﬁned only for FFNN).

Using Approximation 1, the following results have been derived by Schoenholz et al. (2017) and Hayou et al. (2019):

There exist q, λ > 0 such that supx Rd |ql q| e λl.

On the Ordered phase, there exists γ > 0 such that supx,x Rd |cl(x, x ) 1| e γl.

On the Chaotic phase, For all ϵ (0, 1) there exist γ > 0 and c < 1 such that sup(x,x ) Bϵ |cl(x, x ) c| e γl.

For Re LU network on the EOC, we have

f(x) = x 1 x + 2

2 3π (1 x)3/2 + O((1 x)5/2).

In general, we have

f(x) = σ2 b + σ2 w E[φ( q Z1)φ( q Z(x))]

where Z(x) = x Z1 +

1 x2Z2 and Z1, Z2 are iid standard Gaussian variables.

On the EOC, we have f (1) = 1

On the Ordered, resp. Chaotic, phase we have that f (1) < 1, resp. f (1) > 1.

For non-linear activation functions, f is strictly convex and f(1) = 1.

f is increasing on [ 1, 1].

Published as a conference paper at ICLR 2021

On the Ordered phase and EOC, f has one ﬁxed point which is 1. On the chaotic phase, f has two ﬁxed points: 1 which is unstable, and c (0, 1) which is a stable ﬁxed point.

On the Ordered/Chaotic phase, the correlation between gradients computed with different inputs converges exponentially to 0 as we back-progapagate the gradients.

Similar results exist for CNN. Xiao et al. (2018) show that, similarly to the FFNN case, there exists q such that ql α(x) converges exponentially to q for all x, α, and studied the limiting behaviour of correlation between neurons at the same channel cl α,α (x, x) (same input x). These correlations describe how features are correlated for the same input. However, they do not capture the behaviour of these features for different inputs (i.e. cl α,α (x, x ) where x = x ). We establish this result in the next section.

B.2 CORRELATION BEHAVIOUR IN CNN IN THE LIMIT OF LARGE DEPTH

Appendix Lemma 1 (Asymptotic behaviour of the correlation in CNN with smooth activation functions). We consider a 1D CNN. Let (σb, σw) (R+)2 and x = x be two inputs Rd. If (σb, σw) are either on the Ordered or Chaotic phase, then there exists β > 0 such that

sup α,α |cl α,α (x, x ) c| = O(e βl),

where c = 1 if (σb, σw) is in the Ordered phase, and c (0, 1) if (σb, σw) is in the Chaotic phase.

Proof. Let x = x be two inputs and α, α two nodes in the same channel i. From Section B.1, we have that

ql α,α (x, x ) = E[yl i,α(x)yl i,α (x )] = σ2 w 2k + 1

β ker E[φ(yl 1 1,α+β(x))φ(yl 1 1,α +β(x ))] + σ2 b.

This yields

cl α,α (x, x ) = 1 2k + 1

β ker f(cl 1 α+β,α +β(x, x )),

where f is the correlation function. We prove the result in the Ordered phase, the proof in the Chaotic phase is similar. Let (σb, σw) be in the Ordered phase and cl m = minα,α cl α,a (x, x ). Using the fact that f is non-decreasing (section B.1), we have that cl α,α (x, x ) 1 2k+1 P

β ker cl 1 α+β,α +β(x, x )) f(cl 1 m ). Taking the min again over α, α , we have cl m f(cl 1 m ), therefore cl m is non-decreasing and converges to a stable ﬁxed point of f. By the convexity of f, the limit is 1 (in the Chaotic phase, f has two ﬁxed point, a stable point c1 < 1 and c2 = 1 unstable). Moreover, the convergence is exponential using the fact that 0 < f (1) < 1. We conclude using the fact that supα,α |cl α,α (x, x ) 1| = 1 cl m.

C PROOFS FOR SECTION 2 : SBP FOR FFNN/CNN AND THE RESCALING TRICK

In this section, we prove Theorem 1 and Proposition 1. Before proving Theorem 1, we state the degeneracy approximation.

Approximation 3 (Degeneracy on the Ordered phase). On the Ordered phase, the correlation cl and the variance ql converge exponentially quickly to their limiting values 1 and q respectively. The degeneracy approximation for FFNN states that

x = x , cl(x, x ) 1

x = x , α, α , cl α,α (x, x ) 1

Published as a conference paper at ICLR 2021

x, ql α(x) q

The degeneracy approximation is essential in the proof of Theorem 1 as it allows us to avoid many unnecessary complications. However, the results holds without this approximation although the constants may be a bit different. Theorem 1 (Initialization is crucial for SBP). We consider a FFNN (2) or a CNN (3). Assume (σw, σb) are chosen on the ordered, i.e. χ(σb, σw) < 1, then the NN is ill-conditioned. Moreover, we have

1 + log(κLN 2)

where κ = | log χ(σb, σw)|/8. If (σw, σb) are on the EOC, i.e. χ(σb, σw) = 1, then the NN is well-conditioned. In this case, κ = 0 and the above upper bound no longer holds.

Proof. We prove the result using Approximation 3.

1. Case 1 : Fully connected Feedforward Neural Networks To simplify the notation, we assume that Nl = N and Ml = N 2 (i.e. αl = 1 and ζl = 1) for all l. We prove the result for the Ordered phase, the proof for the Chaotic phase is similar. Let L0 1, ϵ (0, 1 1 L0 ), L L0 and x ( 1

L + ϵ, 1). With sparsity x, we keep kx = (1 x)LN 2 weights. We have

P(scr x) P(max i,j |W 1 ij| L

where t(kx) is the kth x order statistic of the sequence {|W l ij| L

W l ij , l > 0, (i, j) [1 : N]2}.

L W l ij = 1 |D|

L yl i(x) yl i(x) W l ij

L yl i(x)φ(yl 1 j (x)).

On the Ordered phase, the variance ql(x) and the correlation cl(x, x ) converge exponentially to their limiting values q, 1 (Section B.1). Under the degeneracy Approximation 3, we have

x = x , cl(x, x ) 1 x, ql(x) q

Let ql(x) = E[ L yl i(x) 2] (the choice of i is not important since (yl i(x))i are iid ). Using these

approximations, we have that yl i(x) = yl i(x ) almost surely for all x, x . Thus

2 = E[φ( q Z)2] ql(x),

where x is an input. The choice of x is not important in our approximation. From Section B.1.2, we have

ql x = ql+1 x Nl+1

Then we obtain ql x = NL

Nl q L x χL l = q L x χL l,

where χ = σ2 w E[φ( q Z)2] as we have assumed Nl = N. Using this result, we have

2 = A χL l,

Published as a conference paper at ICLR 2021

where A = E[φ( q Z)2] q L x for an input x. Recall that by deﬁnition, one has χ < 1 on the Ordered phase.

In the general case, i.e. without the degeneracy approximation on cl and ql, we can prove that

2 = Θ(χL l)

which sufﬁces for the rest of the proof. However, the proof of this result requires many unnecessary complications that do not add any intuitive value to the proof.

In the general case where the widths are different, ql will also scale as χL l up to a different constant. Now we want to lower bound the probability

P(max i,j |W 1 ij| L

Let t(kx) ϵ be the kth x order statistic of the sequence {|W l ij| L

W l ij , l > 1 + ϵL, (i, j) [1 :

N]2}. It is clear that t(kx) > t(kx) ϵ , therefore

P(max i,j |W 1 ij| L

< t(kx)) P(max i,j |W 1 ij| L

< t(kx) ϵ ).

Using Markov s inequality, we have that

Note that Var(χ l L

W l ij ) = A. In general, the random variables χ l L

W l ij have a

density f l ij for all l > 1+ ϵL, (i, j) [1 : N]2, such that f l ij(0) = 0. Therefore, there exists a constant λ such that for x small enough,

By selecting x = χ (1 ϵ/2)L 1

2 , we obtain

Therefore, for L large enough, and all l > 1 + ϵL, (i, j) [1 : Nl] [1 : Nl 1], we have

2 1 λ χϵL/2.

Now choosing α = χ (1 ϵ/4)L 1

2 in inequality (16) yields

2 ) 1 A χϵL/4.

Since we do not know the exact distribution of the gradients, the trick is to bound them using the previous concentration inequalities. We deﬁne the event B := { (i, j) [1 : N] [1 : d], L

W 1 ij χ (1 ϵ/4)L 1

2 } { l > 1 + ϵL, (i, j) [1 : N]2, L

W l ij χ (1 ϵ/2)L 1

P(max i,j |W 1 ij| L

< t(kx) ϵ ) P(max i,j |W 1 ij| L

< t(kx) ϵ B)P(B).

Published as a conference paper at ICLR 2021

But, by conditioning on the event B, we also have

P(max i,j |W 1 ij| L

< t(kx) ϵ B) P(max i,j |W 1 ij| < χ ϵL/8t ϵ (kx)),

where t ϵ (kx) is the kth x order statistic of the sequence {|W l ij|, l > 1 + ϵL, (i, j) [1 : N]2}.

Now, as in the proof of Proposition 4 in Appendix E (MBP section), deﬁne xζ,γL = min{y (0, 1) : x > y, γLQx > Q 1 (1 x)γ2 ζ L }, where γL = χ ϵL/8. Since limζ 2 xζ,γL = 0,

then there exists ζϵ < 2 such that xζϵ,γL = ϵ + 1

As L grows, t ϵ (kx) converges to the quantile of order x ϵ

1 ϵ. Therefore,

P(max i,j |W 1 ij| < χ ϵL/8t ϵ (kx)) P(max i,j |W 1 ij| < Q 1 (1 x ϵ

1 ϵ )γ2 ζϵ L ) + O( 1

1 ϵ )γ2 ζϵ L + O( 1

Using the above concentration inequalities on the gradient, we obtain

P(B) (1 A χϵL/4)N 2(1 λ χϵL/2)LN2.

Therefore there exists a constant η > 0 independent of ϵ such that

P(B) 1 ηLN 2χϵL/4.

Hence, we obtain

P(scr x) N 2(x ϵ

1 ϵ )γ2 ζϵ L + ηLN 2χϵL/4 + O( 1

Integration of the previous inequality yields

E[scr] ϵ + 1

1 + γ2 ζϵ L + ηLN 2χϵL/4 + O( 1

Now let κ = | log(χ)|

8 and set ϵ = log(κLN2)

κL . By the deﬁnition of xζϵ, we have

γLQxζϵ,γL = Q 1 (1 xζϵ,γL)γ2 ζϵ L .

For the left hand side, we have

γLQxζϵ,γL αγL log(κLN 2)

κL where α > 0 is the derivative at 0 of the function x Qx. Since γL = κLN 2, we have

γLQxζϵ,γL αN 2 log(κLN 2)

Which diverges as L goes to inﬁnity. In particular this proves that the right hand side diverges and therefore we have that (1 xζϵ,γL)γ2 ζϵ L converges to 0 as L goes to inﬁnity. Using the asymptotic equivalent of the right hand side as L , we have

Q 1 (1 xζϵ,γL)γ2 ζϵ L q

2 log((1 xζϵ,γL)γ2 ζϵ L ) = γ1 ζϵ/2 L p

2 log(1 xζϵ,γL).

Therefore, we obtain

Q 1 (1 xζϵ,γL)γ2 ζϵ L γ1 ζϵ/2 L

2 log(κLN 2)

Combining this result to the fact that γLQxζϵ,γL αγL log(κLN2)

κL we obtain

γ ζϵ L β log(κLN 2)

Published as a conference paper at ICLR 2021

where β is a positive constant. This yields

E[scr] log(κLN 2)

L + µ κLN 2 log(κLN 2)(1 + o(1)) + η 1 κ2LN 2 + O( 1

L(1 + log(κLN 2)

where κ = | log(χ)|

8 and µ is a constant.

2. Case 2 : Convolutional Neural Networks The proof for CNNs in similar to that of FFNN once we prove that

E L W l i,j,β

where A is a constant. We have that

L W l i,j,β = X

L yl i,α φ(yl 1 j,α+β)

and L yl i,α =

L yl+1 j,α β W l+1 i,j,βφ (yl i,α).

Using the approximation of Gradient independence and averaging over the number of channels (using CLT) we have that

2 ] = σ2 w E[φ ( q Z)2]

β ker E[ L yl+1 i,α β

Summing over α and using the periodic boundary condition, this yields X

Here also, on the Ordered phase, the variance ql and the correlation cl converge exponentially to their limiting values q and 1 respectively. As for FFNN, we use the degeneracy approximation that states

x = x , α, α , cl α,α (x, x ) 1,

x, ql α(x) q.

Using these approximations, we have

E L W l i,j,β

2 = E[φ( q Z)2] ql(x),

where ql(x) = P

α E[ L yl i,α(x) 2] for an input x. The choice of x is not important in our

approximation.

From the analysis above, we have

ql(x) = q L(x)χL l,

so we conclude that

E L W l i,j,β

where A = E[φ( q Z)2] q L(x).

Published as a conference paper at ICLR 2021

After pruning, the network is usually deep in the Ordered phase in the sense that χ = f (1) 1. To re-place it on the Edge of Chaos, we use the Rescaling Trick.

Proposition 1 (Rescaling Trick). Consider a NN of the form (2) or (3) (FFNN or CNN) initialized on the EOC. Then, after pruning, the sparse network is not initialized on the EOC. However, the rescaled sparse network

yl(x) = F(ρl δl W l, yl 1(x)) + Bl, for l 1, (17)

E[Nl 1(W l i1)2δl i1] for FFNN of the form (2),

ρl i,j,β = 1 q

E[nl 1(W l i,1,β)2δl i,1,β] for CNN of the form (3),

is initialized on the EOC.

Proof. For two inputs x, x , the forward propagation of the covariance is given by

ˆql(x, x ) = E[yl i(x)yl i(x )]

j,k W l ij W l ikδl ijδl ikφ(ˆyl 1 j (x))φ(ˆyl 1 j (x ))] + σ2 b.

L W l ij = 1 |D|

L yl i(x) yl i(x) W l ij

L yl i(x)φ(yl 1 j (x)).

Under the assumption that the weights used for forward propagation are independent from the weights used for back-propagation, W l ij and L yl i(x) are independent for all x D. We also have that W l ij and

φ(yl 1 j (x)) are independent for all x D. Therefore, W l ij and L W l ij are independent for all l, i, j.

This yields

ˆql(x, x ) = σ2 wαl E[φ(ˆyl 1 1 (x))φ(ˆyl 1 1 (x ))] + σ2 b,

where αl = E[Nl 1(W l 11)2δl 11] (the choice of i, j does not matter because they are iid). Unless we do not prune any weights from the lth layer, we have that αl < 1. These dynamics are the same as a FFNN with the variance of the weights given by ˆσ2 w = σ2 wαl. Since the EOC equation is given by σ2 w E[φ ( q Z)2] = 1, with the new variance, it is clear that ˆσ2 w E[φ ( ˆq Z)2] = 1 in general. Hence, the network is no longer on the EOC and this could be problematic for training. With the rescaling, this becomes

ˆql(x, x ) = σ2 wρ2 l αl E[φ( yl 1 1 (x))φ( yl 1 1 (x ))] + σ2 b = σ2 w E[φ( yl 1 1 (x))φ( yl 1 1 (x ))] + σ2 b.

Therefore, the new variance after re-scaling is σ2 w = σ2 w, and the limiting variance q = q remains also unchanged since the dynamics are the same. Therefore σ2 w E[φ ( q Z)2] = σ2 w E[φ ( q Z)2] = 1. Thus, the re-scaled network is initialized on the EOC. The proof is similar for CNNs.

Published as a conference paper at ICLR 2021

D PROOF FOR SECTION 3 : SBP FOR STABLE RESIDUAL NETWORKS

Theorem 2 (Resnet is well-conditioned). Consider a Resnet with either Fully Connected or Convolutional layers and Re LU activation function. Then for all σw > 0, the Resnet is well-conditioned. Moreover, for all l {1, ..., L}, ml = Θ((1 + σ2 w 2 )L).

Proof. Let us start with the case of a Resnet with Fully Connected layers. we have that

L W l ij = 1 |D|

L yl i(x) yl i(x) W l ij

L yl i(x)φ(yl 1 j (x))

and the backpropagation of the gradient is given by the set of equations

L yl i = L yl+1 i + φ (yl i)

L yl+1 j W l+1 ji .

Recall that ql(x) = E[yl i(x)2] and ql(x, x ) = E[ L yl i(x) L yl i(x )] for some inputs x, x . We have that

ql(x) = E[yl 1 i (x)2] + σ2 w E[φ(yl 1 1 )2] = (1 + σ2 w 2 )ql 1(x),

and ql(x, x ) = (1 + σ2 w E[φ (yl i(x))φ (yl i(x ))]) ql+1(x, x ). We also have

2 ] = 1 |D|2 X

x,x tl x,x ,

where tl x,x = ql(x, x ) p

ql(x)ql(x )f(cl 1(x, x )) and f is deﬁned in the preliminary results (Eq 15). Let k {1, 2, ..., L} be ﬁxed. We compare the terms tl x,x for l = k and l = L. The ratio between the two terms is given by (after simpliﬁcation)

tk x,x t L x,x = QL 1 l=k (1 + σ2 w 2 f (cl(x, x )))

2 )L k f(ck 1(x, x )) f(c L 1(x, x )).

We have that f (cl(x, x)) = f (1) = 1. A Taylor expansion of f near 1 yields f (cl(x, x )) = 1 l 1 + o(l 1) and f(cl(x, x)) = 1 sl 2 + o(l 2) (see Hayou et al. (2019) for more details).

Therefore, there exist two constants A, B > 0 such that A <

QL 1 l=k (1+ σ2 w 2 f (cl(x,x )))

2 )L k < B for all L

and k {1, 2, ..., L}. This yields

which concludes the proof.

For Resnet with convolutional layers, we have

L W l i,j,β = 1 |D|

L yl i,α(x)φ(yl 1 j,α+β(x))

Published as a conference paper at ICLR 2021

and L yl i,α = L yl+1 i,α +

L yl+1 j,α β W l+1 i,j,βφ (yl i,α).

Recall the notation ql α,α (x, x ) = E[ L yl i,α(x) L yl i,α (x )]. Using the hypothesis of independence of

forward and backward weights and averaging over the number of channels (using CLT), we have

ql α,α (x, x ) = ql+1 α,α (x, x ) + σ2 wf (cl α,α (x, x )) 2(2k + 1)

β ql+1 α+β,α +β(x, x ).

Let Kl = (( ql α,α+β(x, x ))α [0:N 1])β [0:N 1] be a vector in RN2. Writing this previous equation in matrix form, we obtain

Kl = (I + σ2 wf (cl α,α (x, x )) 2(2k + 1) U)Kl+1

E[ L W l i,j,β

2 ] = 1 |D|2 X

α,α tl α,α (x, x ),

where tl α,α (x, x ) = ql α,α (x, x ) q

ql α+β(x)ql α +β(x )f(cl 1 α+β,α +β(x, x )). Since we have

f (cl α,α (x, x )) 1, then by ﬁxing l and letting L goes to inﬁnity, it follows that

Kl L (1 + σ2 w 2 )L le1e T 1 KL

and, from Lemma 2, we know that q

ql α+β(x)ql α +β(x ) = (1 + σ2 w 2 )l 1 q0,xq0,x .

Therefore, for a ﬁxed k < L, we have tk α,α (x, x ) (1 + σ2 w 2 )L 1f(ck 1 α+β,α +β(x, x ))(e T 1 KL) = Θ(t L α,α (x, x )). This concludes the proof.

Proposition 2 (Stable Resnet). Consider the following Resnet parameterization

yl(x) = yl 1(x) + 1

L F(W l, yl 1), for l 2, (18)

then the network is well-conditioned for all choices of σw > 0. Moreover, for all l {1, ..., L} we have ml = Θ(L 1).

Proof. The proof is similar to that of Theorem 2 with minor differences. Let us start with the case of a Resnet with fully connected layers, we have

L W l ij = 1 |D|

L yl i(x) yl i(x) W l ij

L yl i(x)φ(yl 1 j (x))

and the backpropagation of the gradient is given by

L yl i = L yl+1 i + 1

L yl+1 j W l+1 ji .

Published as a conference paper at ICLR 2021

Recall that ql(x) = E[yl i(x)2] and ql(x, x ) = E[ L yl i(x) L yl i(x )] for some inputs x, x . We have

ql(x) = E[yl 1 i (x)2] + σ2 w L E[φ(yl 1 1 (x))2] = (1 + σ2 w 2L)ql 1(x)

ql(x, x ) = (1 + σ2 w L E[φ (yl i(x))φ (yl i(x ))]) ql+1(x, x ).

We also have

2 ] = 1 L|D|2 X

x,x tl x,x ,

where tl x,x = ql(x, x ) p

ql(x)ql(x )f(cl 1(x, x )) and f is deﬁned in the preliminary results (Eq. 15). Let k {1, 2, ..., L} be ﬁxed. We compare the terms tl x,x for l = k and l = L. The ratio between the two terms is given after simpliﬁcation by

tk x,x t L x,x = QL 1 l=k (1 + σ2 w 2L f (cl(x, x )))

2L )L k f(ck 1(x, x )) f(c L 1(x, x )).

As in the proof of Theorem 2, we have that f (cl(x, x)) = 1, f (cl(x, x )) = 1 l 1 + o(l 1) and f(cl(x, x)) = 1 sl 2 + o(l 2). Therefore, there exist two constants A, B > 0 such that

QL 1 l=k (1+ σ2 w 2L f (cl(x,x )))

2L )L k < B for all L and k {1, 2, ..., L}. This yields

Moreover, since (1+ σ2 w 2L )L eσ2 w/2, then ml = Θ(1) for all l {1, ..., L}. This concludes the proof.

For Resnet with convolutional layers, the proof is similar. With the scaling, we have

L W l i,j,β = 1

L yl i,α(x)φ(yl 1 j,α+β(x))

and L yl i,α = L yl+1 i,α + 1

L yl+1 j,α β W l+1 i,j,βφ (yl i,α).

Let ql α,α (x, x ) = E[ L yl i,α(x) L yl i,α (x )]. Using the hypothesis of independence of forward and

backward weights and averaging over the number of channels (using CLT) we have

ql α,α (x, x ) = ql+1 α,α (x, x ) + σ2 wf (cl α,α (x, x )) 2(2k + 1)L

β ql+1 α+β,α +β(x, x ).

Let Kl = (( ql α,α+β(x, x ))α [0:N 1])β [0:N 1] is a vector in RN2. Writing this previous equation in matrix form, we have

Kl = (I + σ2 wf (cl α,α (x, x )) 2(2k + 1)L U)Kl+1,

E[ L W l i,j,β

2 ] = 1 L|D|2 X

α,α tl α,α (x, x ),

Published as a conference paper at ICLR 2021

where tl α,α (x, x ) = ql α,α (x, x ) q

ql α+β(x)ql α +β(x )f(cl 1 α+β,α +β(x, x )). Since we have

f (cl α,α (x, x )) 1, then by ﬁxing l and letting L goes to inﬁnity, we obtain

Kl L (1 + σ2 w 2L)L le1e T 1 KL

and we know from Appendix Lemma 2 (using αβ = σ2 w 2L for all β) that q

ql α+β(x)ql α +β(x ) = (1 + σ2 w 2L)l 1 q0,xq0,x .

Therefore, for a ﬁxed k < L, we have tk α,α (x, x ) (1 + σ2 w 2L )L 1f(ck 1 α+β,α +β(x, x ))(e T 1 KL) = Θ(t L α,α (x, x )) which proves that the stable Resnet is well conditioned. Moreover, since (1 +

σ2 w 2L )L 1 eσ2 w/2, then ml = Θ(L 1) for all l.

In the next Lemma, we study the asymptotic behaviour of the variance ql α. We show that, as l , a phenomenon of self averaging shows that ql α becomes independent of α. Appendix Lemma 2. Let x Rd. Assume the sequence (al,α)l,α is given by the recursive formula

al,α = al 1,α + X

β ker λβal 1,α+β

where λβ > 0 for all β. Then, there exists ζ > 0 such that for all x Rd and α,

al,α(x) = (1 + X

β αβ)la0 + O((1 + X

β αβ)le ζl)),

where a0 is a constant and the O is uniform in α.

Proof. Recall that

al,α = al 1,α + X

β ker λβal 1,α+β.

We rewrite this expression in a matrix form

Al = UAl 1,

where Al = (al,α)α is a vector in RN and U is the is the convolution matrix. As an example, for k = 1, U given by

1 + λ0 λ1 0 ... 0 λ 1

λ 1 1 + λ0 λ1 0 ... 0

0 λ 1 1 + λ0 λ1 ... 0

0 0 λ 1 1 + λ0 ... 0 ... ... ... ... λ1 0 . . . 0 λ 1 1 + λ0

U is a circulant symmetric matrix with eigenvalues b1 > b2 b3... b N. The largest eigenvalue of U is given by b1 = 1 + P β λβ and its equivalent eigenspace is generated by the vector e1 = 1

N (1, 1, ..., 1) RN. This yields

b l 1 U l = e1e T 1 + O(e ζl),

where ζ = log( b1

b2 ). Using this result, we obtain

b l 1 Al = (b l 1 U l)A0 = e1e T 1 A0 + O(e ζl).

This concludes the proof.

Published as a conference paper at ICLR 2021

Unlike FFNN or CNN, we do not need to rescale the pruned network. The next proposition establishes that a Resnet lives on the EOC in the sense that the correlation between yl i(x) and yl i(x ) converges to 1 at a sub-exponential O(l 2) rate. Proposition 3 (Resnet live on the EOC even after pruning). Let x = x be two inputs. The following statments hold

1. For Resnet with Fully Connected layers, let ˆcl(x, x ) be the correlation between ˆyl i(x) and ˆyl i(x ) after pruning the network. Then we have

1 ˆcl(x, x ) κ

where κ > 0 is a constant.

2. For Resnet with Convolutional layers, let ˆcl(x, x ) =

α,α E[yl 1,α(x)yl 1,α (x )] P

qlα (x ) be an average

correlation after pruning the network. Then we have

1 ˆcl(x, x ) l 2.

Proof. 1. Let x and x be two inputs. The covariance of ˆyl i(x) and ˆyl i(x ) is given by

ˆql(x, x ) = ˆql 1(x, x ) + αE(Z1,Z2) N(0,Ql 1)[φ(Z1)φ(Z2)]

where Ql 1 = ˆql 1(x) ˆql 1(x, x ) ˆql 1(x, x ) ˆql 1(x )

and α = E[Nl 1W l 11 2δl 11].

Consequently, we have ˆql(x) = (1 + α

2 )ˆql 1(x). Therefore, we obtain

ˆcl(x, x ) = 1 1 + λˆcl 1(x, x ) + λ 1 + λf(ˆcl 1(x, x )),

where λ = α

2 and f(x) = 2E[φ(Z1)φ(x Z1 +

1 x2Z2)] and Z1 and Z2 are iid standard normal variables.

Using the fact that f is increasing (Section B.1), it is easy to see that ˆcl(x, x ) 1. Let ζl = 1 ˆcl(x, x ). Moreover, using a Taylor expansion of f near 1 (Section B.1) f(x) = x 1 x + β(1 x)3/2 + O((1 x)5/2), it follows that

ζl = ζl 1 ηζ3/2 l 1 + O(ζ5/2 l 1),

where η = λβ 1+λ. Now using the asymptotic expansion of ζ 1/2 l given by

ζ 1/2 l = ζ 1/2 l 1 + η

2 + O(ζl 1),

this yields ζ 1/2 l l

η 2l. We conclude that 1 ˆcl(x, x ) 4 η2l2 .

2. Let x be an input. Recall the forward propagation of a pruned 1D CNN

yl i,α(x) = yl 1 i,α (x) +

β ker δl i,j.βW l i,j,βφ(yl 1 j,α+β(x)) + bl i.

Unlike FFNN, neurons in the same channel are correlated since we use the same ﬁlters for all of them. Let x, x be two inputs and α, α two nodes in the same channel i. Using the Central Limit Theorem in the limit of large nl (number of channels), we have

E[yl i,α(x)yl i,α (x )] = E[yl 1 i,α (x)yl 1 i,α (x )]+ 1 2k + 1

β ker αβE[φ(yl 1 1,α+β(x))φ(yl 1 1,α +β(x ))],

where αβ = E[δl i,1.βW l i,1,β 2nl 1].

Published as a conference paper at ICLR 2021

Let ql α(x) = E[yl 1,α(x)2]. The choice of the channel is not important since for a given α, neurons (yl i,α(x))i [c] are iid. Using the previous formula, we have

ql α(x) = ql 1 α (x) + 1 2k + 1

β ker αβE[φ(yl 1 1,α+β(x))2]

= ql 1 α (x) + 1 2k + 1

β ker αβ ql 1 α+β(x)

Therefore, letting ql(x) = 1

α [N] ql α(x) and σ =

β αβ 2k+1 , we obtain

ql(x) = ql 1(x) + 1 2k + 1

ql 1 α+β(x)

2 )ql 1(x) = (1 + σ

2 )l 1q1(x),

where we have used the periodicity ql 1 α = ql 1 α N = ql 1 α+N. Moreover, we have minα ql α(x) (1 + σ

2 ) minα ql 1 α (x) (1 + σ

2 )l 1 minα q1 α(x).

The convolutional structure makes it hard to analyse the correlation between the values of a neurons for two different inputs. Xiao et al. (2018) studied the correlation between the values of two neurons in the same channel for the same input. Although this could capture the propagation of the input structure (say how different pixels propagate together) inside the network, it does not provide any information on how different structures from different inputs propagate. To resolve this situation, we study the average correlation per channel deﬁned as

cl(x, x ) =

α,α E[yl 1,α(x)yl 1,α (x )] P α,α p

for any two inputs x = x . We also deﬁne cl(x, x ) by

cl(x, x ) =

α,α E[yl 1,α(x)yl 1,α (x )] q

α qlα(x ) .

Using the concavity of the square root function, we have s

α qlα(x ) = s

α,α qlα(x)qlα(x )

α,α |E[yl 1,α(x)yl 1,α (x )]|.

This yields cl(x, x ) cl(x, x ) 1. Using Appendix Lemma 2 twice with al,α = ql α(x), al,α = ql α(x ), and λβ = αβ 2(2k+1), there exists ζ > 0 such that

cl(x, x ) = cl(x, x )(1 + O(e ζl)). (19)

This result shows that the limiting behaviour of cl(x, x ) is equivalent to that of cl(x, x ) up to an exponentially small factor. We study hereafter the behaviour of cl(x, x ) and use this result to conclude. Recall that

E[yl i,α(x)yl i,α (x )] = E[yl 1 i,α (x)yl 1 i,α (x )]+ 1 2k + 1

β ker αβE[φ(yl 1 1,α+β(x))φ(yl 1 1,α +β(x ))].

Published as a conference paper at ICLR 2021

Therefore, X

α,α E[yl 1,α(x)yl 1,α (x )]

α,α E[yl 1 1,α (x)yl 1 1,α (x )] + 1 2k + 1

β ker αβE[φ(yl 1 1,α+β(x))φ(yl 1 1,α +β(x ))]

α,α E[yl 1 1,α (x)yl 1 1,α (x )] + σ X

α,α E[φ(yl 1 1,α (x))φ(yl 1 1,α (x ))]

α,α E[yl 1 1,α (x)yl 1 1,α (x )] + σ

ql 1 α (x) q

ql 1 α (x )f(cl 1 α,α (x, x )),

where f is the correlation function of Re LU.

Let us ﬁrst prove that cl(x, x ) converges to 1. Using the fact that f(z) z for all z (0, 1) (Section B.1), we have that

α,α E[yl 1,α(x)yl 1,α (x )] X

α,α E[yl 1 1,α (x)yl 1 1,α (x )] + σ

ql 1 α (x) q

ql 1 α (x )cl 1 α,α (x, x )

α,α E[yl 1 1,α (x)yl 1 1,α (x )] + σ

α,α E[yl 1 1,α (x)yl 1 1,α (x )]

2 )E[yl 1 1,α (x)yl 1 1,α (x )].

Combining this result with the fact that P

α ql α(x) = (1 + σ

α ql 1 α (x), we have cl(x, x ) cl 1(x, x ). Therefore cl(x, x ) is non-decreasing and converges to a limiting point c. Let us prove that c = 1. By contradiction, assume the limit c < 1. Using equation (19), we have that cl(x,x )

cl(x,x ) converge to 1 as l goes to inﬁnity. This yields cl(x, x ) c. Therefore, there exists α0, α 0 and a constant δ < 1 such that for all l, cl α0,α 0(x, x ) δ < 1. Knowing that f is strongly convex and that f (1) = 1, we have that f(cl α0,α 0(x, x )) cl α0,α 0(x, x ) + f(δ) δ. Therefore,

cl(x, x ) cl 1(x, x ) +

ql 1 α0 (x)ql 1 α 0 (x )

ql(x ) (f(δ) δ)

cl 1(x, x ) +

minα q1α(x) minα q1 α (x )

q1(x ) (f(δ) δ).

By taking the limit l , we ﬁnd that c c +

minα q1α(x) minα q1 α (x )

q1(x ) (f(δ) δ). This

cannot be true since f(δ) > δ. Thus we conclude that c = 1.

Now we study the asymptotic convergence rate. From Section B.1, we have that

f(x) = x 1 x + 2

2 3π (1 x)3/2 + O((1 x)5/2).

Therefore, there exists κ > 0 such that, close to 1 we have that

f(x) x + κ(1 x)3/2.

Using this result, we can upper bound cl(x, x )

Published as a conference paper at ICLR 2021

cl(x, x ) cl 1(x, x ) + κ X

ql 1 α (x) q

ql 1 α (x ) p

ql(x ) (1 cl α,α (x, x ))3/2.

To get a polynomial convergence rate, we should have an upper bound of the form cl cl 1 +ζ(1 cl 1)1+ϵ (see below). However, the function x3/2 is convex, so the sum cannot be upper-bounded directly using Jensen s inequality. We use here instead (Peˇcari c et al., 1992, Theorem 1) which states that for any x1, x2, ...xn > 0 and s > r > 0, we have X

i xs i 1/s < X

i xr i 1/r. (20)

Let zl α,α =

ql 1 α (x) q

ql 1 α (x )

ql(x ) , we have

α,α zl α,α (1 cl α,α (x, x ))3/2 ζl X

α,α [zl α,α (1 cl α,α (x, x ))]3/2,

where ζl = maxα,α 1 zl α,α 1/2 . Using the inequality (20) with s = 3/2 and r = 1, we have

α,α [zl α,α (1 cl α,α (x, x ))]3/2 ( X

α,α zl α,α (1 cl α,α (x, x )))3/2

α,α zl α,α cl(x, x )))3/2.

Moreover, using the concavity of the square root function, we have P α,α zl α,α 1. This yields

cl(x, x ) cl 1(x, x ) + ζ(1 cl 1(x, x ))3/2,

where ζ is constant. Letting γl = 1 cl(x, x ), we can conclude using the following inequality (we had an equality in the case of FFNN)

γl γl 1 ζγ3/2 l 1

which leads to

γ 1/2 l γ 1/2 l 1 (1 ζγ1/2 l 1) 1/2 = γ 1/2 l 1 + ζ

Hence we have γl l 2.

Using this result combined with (19) again, we conclude that

1 cl(x, x ) l 2.

E THEORETICAL ANALYSIS OF MAGNITUDE BASED PRUNING (MBP)

In this section, we provide a theoretical analysis of MBP. The two approximations from Appendix A are not used here.

MBP is a data independent pruning algorithm (zero-shot pruning). The mask is given by

δl i = 1 if |W l i | ts, 0 if |W l i | < ts,

Published as a conference paper at ICLR 2021

where ts is a threshold that depends on the sparsity s. By deﬁning ks = (1 s) P

l Ml, ts is given by ts = |W|(ks) where |W|(ks) is the kth s order statistic of the network weights (|W l i |)1 l L,1 i Ml (|W|(1) > |W|(2) > ...).

With MBP, changing σw does not impact the distribution of the resulting sparse architecture since it is a common factor for all the weights. However, in the case of different scaling factors vl, the variances σ2 w vl used to initialize the weights vary across layers. This gives potentially the erroneous intuition that the layer with the smallest variance will be highly likely fully pruned before others as we increase the sparsity s. This is wrong in general since layers with small variances might have more weights compared to other layers. However, we can prove a similar result by considering the limit of large depth with ﬁxed widths.

Proposition 4 (MBP in the large depth limit). Assume N is ﬁxed and there exists l0 [|1, L|] such that αl0 > αl for all l = l0. Let Qx be the xth quantile of |X| where X iid N(0, 1) and γ = minl =l0 αl0

αl . For ϵ (0, 2), deﬁne xϵ,γ = inf{y (0, 1) : x > y, γQx > Q1 (1 x)γ2 ϵ} and xϵ,γ = for the null set. Then, for all ϵ (0, 2), xϵ,γ is ﬁnite and there exists a constant ν > 0 such that

E[scr] inf ϵ (0,2){xϵ,γ + ζl0N 2

1 + γ2 ϵ (1 xϵ,γ)1+γ2 ϵ} + O( 1

Proposition 4 gives an upper bound on E[scr] in the large depth limit. The upper bound is easy to approximate numerically. Table 7 compares the theoretical upper bound in Proposition 4 to the empirical value of E[scr] over 10 simulations for a FFNN with depth L = 100, N = 100, α1 = γ and α2 = α3 = = αL = 1. Our experiments reveal that this bound can be tight.

Table 7: Theoretical upper bound of Proposition 4 and empirical observations for a FFNN with N = 100 and L = 100

GAMMA γ = 2 γ = 5 γ = 10

UPPER BOUND 5.77 0.81 0.72 EMPIRICAL OBSERVATION 1 0.79 0.69

Proof. Let x (0, 1) and kx = (1 x)ΓLN 2, where ΓL = P

l =l0 ζl. We have

P(scr x) P(max i |W l0 i | < |W|(kx)),

where |W|(kx) is the kth x order statistic of the sequence {|W l i |, l = l0, i [1 : Ml]}; i.e |W|(1) > |W|(2) > ... > |W|(kx).

Let (Xi)i [1:Ml0] and (Zi)i [1:ΓLN2] be two sequences of iid standard normal variables. It is easy to see that P(max i,j |W l0 ij | < |W|(kx)) P(max i |Xi| < γ|Z|(kx))

where γ = minl =l0 αl0

Moreover, we have the following result from the theory of order statistics, which is a weak version of Theorem 3.1. in Puri and Ralescu (1986)

Appendix Lemma 3. Let X1, X2, ..., Xn be iid random variables with a cdf F. Assume F is differentiable and let p (0, 1) and let Qp be the order p quantile of the distribution F, i.e. F(Qp) = p. Then we have n(X(pn) Qp)F (Qp)σ 1 p D N(0, 1),

where the convergence is in distribution and σp = p(1 p).

Published as a conference paper at ICLR 2021

Using this result, we obtain

P(max i |Xi| < γ|Z|(kx)) = P(max i |Xi| < γQx) + O( 1

where Qx is the x quantile of the folded standard normal distribution.

The next result shows that xϵ,γ is ﬁnite for all ϵ (0, 2).

Appendix Lemma 4. Let γ > 1. For all ϵ (0, 2), there exists xϵ (0, 1) such that, for all x > xϵ, γQx > Q1 (1 x)γ2 ϵ.

Proof. Let ϵ > 0, and recall the asymptotic equivalent of Q1 x given by

Therefore, γQx Q1 (1 x)γ2 ϵ x 1 γϵ > 1. Hence xϵ exists and is ﬁnite.

Let ϵ > 0. Using Appendix Lemma 4, there exists xϵ > 0 such that P(max i |Xi| < γQx) P(max i |Xi| < Q1 (1 x)γ2 ϵ)

= (1 (1 x)γ2 ϵ)ζl0N2

1 ζl0N 2(1 x)γ2 ϵ, where we have used the inequality (1 t)z 1 zt for all (t, z) [0, 1] (1, ) and β = αl0αl0+1.

Using the last result, we have

P(scr x) βN 2(1 x)γ2 ϵ + O( 1

Now we have

E[scr] = Z 1

0 P(scr x)dx

xϵ P(scr x)dx

1 + γ2 ϵ (1 xϵ)γ2 ϵ+1 + O( 1

This is true for all ϵ (0, 2), and the additional term O( 1

LN2 ) does not depend on ϵ. Therefore there exists a constant ν R such that for all ϵ

E[scr] xϵ + βN 2

1 + γ2 ϵ (1 xϵ)γ2 ϵ+1 + ν

We conclude by taking the inﬁmum over ϵ.

Another interesting aspect of MBP is when the depth is ﬁxed and the width goes to inﬁnity. The next result gives a lower bound on the probability of pruning at least one full layer. Proposition 5 (MBP in the large width limit). Assume there exists l0 [1 : L] such that αl0 > αl (i.e. vl0 > vl) for all l, and let s0 = Ml0 P

l Ml . For some sparsity s, let PRl0(s) be the event that layer l0 is fully pruned before other layers, i.e. PRl0(s) = {|Al0| = Ml0} l [1:L] {|Al| < Ml},

and let PRl0 = s (s0,smax)PRl0(s) be the event where there exists a sparsity s such that layer l0 is fully pruned before other layers. Then, we have

P(PRl0) 1 Lπ2

4(γ 1)2 log(N)2 + o 1 log(N)2 ,

where γ = mink =l0 αl0

Published as a conference paper at ICLR 2021

Proposition 5 shows that when the width is not the same for all layers, MBP will result in one layer being fully pruned with a probability that converges to 1 as the width goes to inﬁnity. The larger the ratio γ (ratio of widths between the largest and the second largest layers), the faster this probability goes to 1.

The intuition behind Proposition 5 comes from a result in Extreme Value Theory stated in Appendix Lemma 6. Indeed, the problem of pruning one whole layer before the others is essentially a problem of maxima: we prune one whole layer l0 before the others if and only if for all maxi |W l0 i | < minl =l0 maxi |W l i |. The expected value of n iid standard Gaussian variables is known to scale as log n for large n; see e.g. Van Handel (2016).

The proof of Proposition 5 relies on the following two auxiliary results. Appendix Lemma 5 (Rearrangement inequality (Hardy et al., 1952)). Let f, g : R R+ be functions which are either both non-decreasing or non-increasing and let X be a random variable. Then E[f(X)g(X)] E[f(X)]E[g(X)]. Appendix Lemma 6 (Von Mises (1936)). Let (Xi)1 i n be iid random variables with common density f and cumulative distribution function F. Assume limx F 1(1)( d

dx (1 F (x))

f(x) ) = 0, then limn P(maxi Xi anx + bn) = G(x) where G is the Gumbel cumulative distribution function and series an and bn are given by bn = F 1(1 1

n) and an = 1 nf(bn).

We are now in a position to prove Proposition 5.

Proof. Assume there exists l0 [1 : L] such that αl0 > αl for all l. The trick is to see that

PRl0 = { k = l0, max i |W l0 i | < max ij |W k i |}.

Let us prove that P(PRl0) Y

k =l0 P(max i |W l0 i | < max j |W k i |).

Let X = maxi |W l0 i |. We have that

P(PRl0) = E[ Y

k =l0 P(X < max i |W k i ||X)]

using the rearrangement inequality presented in Appendix Lemma 5 with functions fi(x) = P(X < maxi |W k i ||X = x) which are all non-increasing, we obtain

k =l0 E[P(X < max i |W k i ||X)] = Y

k =l0 P(max i |W l0 i | < max i |W k i |).

In order to deal with the probability P(maxi |W l0 i | < maxi |W k i |), we use Appendix Lemma 6 which is a result from Extreme Value Theory which provides a comprehensive description of the law of maxi Xi needed in our analysis. In our case, we want to characterise the behaviour of maxi |Xi| where Xi are iid Gaussian random variables. Let Ψ and ψ be the cdf and density of a standard Gaussian variable X. The cdf of |X| is given by F = 2Ψ 1 and its density is given by f = 2ψ on the positive real line. Thus, 1 F

ψ and it is sufﬁcient to verify the conditions of Appendix Lemma 6 for the standard Gaussian distribution. We have limx F 1(1) d dx 1 Ψ(x)

ψ(x) = limx F 1(1) x (1 Ψ(x))

ψ(x) 1 = x/x 1 = 0, where we have used the fact that x(1 Ψ(x)) φ(x) in the large x limit. Let us now ﬁnd the values of an and bn. In the large x limit, we have

1 F(x) = 2 Z

Published as a conference paper at ICLR 2021

Therefore, one has

log(1 F(x)) x2

This yields

bn = F 1(1 1

Using the same asymptotic expansion of 1 F(x), we can obtain a more precise approximation of bn

2 log n 1 log(log n)

4 ) 2 log n log(log n)

8(log n)2 + o(log(log n)

(log n)2 ) .

Now let us ﬁnd an approximation for an. We have

Therefore, it follows that

an π 2 log n.

We use these results to lower bound the probability P(maxi |W l0 i | < maxi |W k i |). We have

P(max i |W l0 i | max i |W k i |) = P(max i |Xi| γk max i |Yi|),

where γk = αl0

αk and (Xi) and (Yi) are standard Gaussian random variables. Note that γk > 1. Let AN = maxi |Xi| and BN = maxi |Yi|. We have that

P(AN γk BN) = P(AN E[AN] γk(BN E[BN]) + γk E[BN] E[AN])

E (AN E[AN])2

(γk(BN E[BN]) + γk E[BN] E[AN]))2 N π2

4(γk 1)2 log(N)2 .

We conclude that for large N

P(PRl0) 1 Lπ2

4(γ 1)2 log(N)2 + o( 1 log(N)2 ),

where γ = mink =l0 αl0

F IMAGENET EXPERIMENTS

To validate our results on large scale datasets, we prune Res Net50 using SNIP, Gra SP, Syn Flow and our algorithm SBP-SR, and train the pruned network on Image Net. We train the pruned model for 90 epochs with SGD. The training starts with a learning rate 0.1 and it drops by a factor of 10 at epochs 30, 60, 80. We report in table 8 Top-1 test accuracy for different sparsities. Our algorithm SBP-SR has a clear advantage over other algorithms. We are currently running extensive simulations on Image Net to conﬁrm these results.

Table 8: Classiﬁcation accuracy on Image Net (Top-1) for Res Net50 with varying sparsities (TODO: These results will be updated to include conﬁdence intervals)

ALGORITHM 85% 90% 95%

SNIP 69.05 64.25 44.90 GRASP 69.45 66.41 62.10 SYNFLOW 69.50 66.20 62.05 SBP-SR 69.75 67.02 62.66

Published as a conference paper at ICLR 2021

Table 9: Test accuracy of pruned neural network on CIFAR10 with different activation functions

Resnet32 Algo 90 98 99.5 99.9 Relu SBP-SR 92.56(0.06) 88.25(0.35) 79.54(1.12) 51.56(1.12) SNIP 92.24(0.25) 87.63(0.16) 77.56(0.36) 10(0)

Tanh SBP-SR 90.97(0.2) 86.62(0.38) 75.04(0.49) 51.88(0.56) SNIP 90.69(0.28) 85.47(0.18) 10(0) 10(0) Resnet50

Relu SBP-SR 92.05(0.06) 89.57(0.21) 82.68(0.52) 58.76(1.82) SNIP 91.64(0.14) 89.20(0.54) 80.49(2.41) 19.98(14.12)

Tanh SBP-SR 90.43(0.32) 88.18(0.10) 80.09(0.0.55) 58.21(1.61) SNIP 89.55(0.10) 10(0) 10(0) 10(0)

G ADDITIONAL EXPERIMENTS

In Table 10, we present additional experiments with varying Resnet Architectures (Resnet32/50), and sparsities (up to 99.9%) with Relu and Tanh activation functions on Cifar10. We see that overall, using our proposed Stable Resnet performs overall better that standard Resnets.

In addition, we also plot the remaining weights for each layer to get a better understanding on the different pruning strategies and well as understand why some of the Resnets with Tanh activation functions are untrainable. Furthermore, we added additional MNIST experiments with different activation function (ELU, Tanh) and note that our rescaled version allows us to prune signiﬁcantly more for deeper networks.

Figure 4: Percentage of pruned weights per layer in a Res Net32 for our scaled Res Net32 and standard Resnet32 with Kaiming initialization

Published as a conference paper at ICLR 2021

(a) ELU with EOC Init & Rescaling

(b) ELU with EOC Init

(c) ELU with Ordered phase Init

(d) Tanh with EOC Init & Rescaling

(e) Tanh with EOC Init

(f) Tanh with Ordered phase Init

Figure 5: Accuracy on MNIST with different initialization schemes including EOC with rescaling, EOC without rescaling, Ordered phase, with varying depth and sparsity. This ﬁgure clearly illustrates the beneﬁts of rescaling very deep and sparse FFNN.

Table 10: Test accuracy of pruned vanilla-CNN on CIFAR10 with different depth/sparsity levels

Resnet32 Algo 90 98 99.5 99.9 Relu SBP-SR 92.56(0.06) 88.25(0.35) 79.54(1.12) 51.56(1.12) SNIP 92.24(0.25) 87.63(0.16) 77.56(0.36) 10(0)

Tanh SBP-SR 90.97(0.2) 86.62(0.38) 75.04(0.49) 51.88(0.56) SNIP 90.69(0.28) 85.47(0.18) 10(0) 10(0) Resnet50

Relu SBP-SR 92.05(0.06) 89.57(0.21) 82.68(0.52) 58.76(1.82) SNIP 91.64(0.14) 89.20(0.54) 80.49(2.41) 19.98(14.12)

Tanh SBP-SR 90.43(0.32) 88.18(0.10) 80.09(0.0.55) 58.21(1.61) SNIP 89.55(0.10) 10(0) 10(0) 10(0)

Published as a conference paper at ICLR 2021

H ON THE LOTTERY TICKET HYPOTHESIS

The Lottery Ticket Hypothesis (LTH) (Frankle and Carbin, 2019) states that randomly initialized networks contain subnetworks that when trained in isolation reach test accuracy comparable to the original network . We have shown so far that pruning a NN initialized on the EOC will output sparse NNs that can be trained after rescaling. Conversely, if we initialize a random NN with any hyperparameters (σw, σb), then intuitively, we can prune this network in a way that ensures that the pruned NN is on the EOC. This would theoretically make the sparse architecture trainable. We formalize this intuition as follows.

Weak Lottery Ticket Hypothesis (WLTH): For any randomly initialized network, there exists a subnetwork that is initialized on the Edge of Chaos.

In the next theorem, we prove that the WLTH is true for FFNN and CNN architectures that are initialized with Gaussian distribution.

Theorem 3. Consider a FFNN or CNN with layers initialized with variances σ2 w > 0 for weights and variance σ2 b for bias. Let σw,EOC be the value of σw such that (σw,EOC, σb) EOC. Then, for all σw > σw,EOC, there exists a subnetwork that is initialized on the EOC. Therefore WLTH is true.

The idea behind the proof of Theorem 3 is that by removing a fraction of weights from each layer, we are changing the covariance structure in the next layer. By doing so in a precise way, we can ﬁnd a subnetwork that is initialized on the EOC.

We prove a slightly more general result than the one stated.

Theorem 4 (Winning Tickets on the Edge of Chaos). Consider a neural network with layers initialized with variances σw,l R+ for each layer and variance σb > 0 for bias. We deﬁne σw,EOC to be the value of σw such that (σw,EOC, σb) EOC. Then, for all sequences (σw,l)l such that σw,l > σw,EOC for all l, there exists a distribution of subnetworks initialized on the Edge of Chaos.

Proof. We prove the result for FFNN. The proof for CNN is similar. Let x, x be two inputs. For all l, let (δl)ij be a collection of Bernoulli variables with probability pl. The forward propagation of the covariance is given by

ˆql(x, x ) = E[yl i(x)yl i(x )]

j,k W l ij W l ikδl ijδl ikφ(ˆyl 1 j (x))φ(ˆyl 1 j (x ))] + σ2 b.

This yields

ˆql(x, x ) = σ2 w,lpl E[φ(ˆyl 1 1 (x))φ(ˆyl 1 1 (x ))] + σ2 b.

By choosing pl = σ2 w,EOC

σ2 w,l , this becomes

ˆql(x, x ) = σ2 w,EOCE[φ( yl 1 1 (x))φ( yl 1 1 (x ))] + σ2 b.

Therefore, the new variance after pruning with the Bernoulli mask δ is σ2 w = σ2 w,EOC. Thus, the subnetwork deﬁned by δ is initialized on the EOC. The distribution of these subnetworks is directly linked to the distribution of δ. We can see this result as layer-wise pruning, i.e. pruning each layer aside. The proof is similar for CNNs.

Theorem 3 is a special case of the previous result where the variances σw,l are the same for all layers.

Published as a conference paper at ICLR 2021

I ALGORITHM FOR SECTION 2.3

Algorithm 1 Rescaling trick for FFNN

Input: Pruned network, size m for L = 1 to L do

for i = 1 to Nl do

αl i PNl 1 j=1 (Wij)2δl ij ρl ij 1/ p

αl i for all j end for end for