# logarithmic_pruning_is_all_you_need__4ca113c1.pdf

Logarithmic Pruning is All You Need

Laurent Orseau Deep Mind, London, UK

lorseau@google.com

Marcus Hutter Deep Mind, London, UK

www.hutter1.net

Omar Rivasplata Deep Mind, London, UK rivasplata@google.com

The Lottery Ticket Hypothesis is a conjecture that every large neural network contains a subnetwork that, when trained in isolation, achieves comparable performance to the large network. An even stronger conjecture has been proven recently: Every sufﬁciently overparameterized network contains a subnetwork that, at random initialization, but without training, achieves comparable accuracy to the trained large network. This latter result, however, relies on a number of strong assumptions and guarantees a polynomial factor on the size of the large network compared to the target function. In this work, we remove the most limiting assumptions of this previous work while providing signiﬁcantly tighter bounds: the overparameterized network only needs a logarithmic factor (in all variables but depth) number of neurons per weight of the target subnetwork.

1 Introduction

The recent success of neural network (NN) models in a variety of tasks, ranging from vision [Khan et al., 2020] to speech synthesis [van den Oord et al., 2016] to playing games [Schrittwieser et al., 2019, Ebendt and Drechsler, 2009], has sparked a number of works aiming to understand how and why they work so well. Proving theoretical properties for neural networks is quite a difﬁcult task, with challenges due to the intricate composition of the functions they implement and the high-dimensional regimes of their training dynamics. The ﬁeld is vibrant but still in its infancy, many theoretical tools are yet to be built to provide guarantees on what and how NNs can learn. A lot of progress has been made towards understanding the convergence properties of NNs (see e.g., Allen-Zhu et al. [2019], Zou and Gu [2019] and references therein). The fact remains that training and deploying deep NNs has a large cost [Livni et al., 2014], which is problematic. To avoid this problem, one could stick to a small network size. However, it is becoming evident that there are beneﬁts to using oversized networks, as the literature on overparametrized models [Ma et al., 2018] points out. Another solution, commonly used in practice, is to prune a trained network to reduce the size and hence the cost of prediction/deployment. While missing theoretical guarantees, experimental works show that pruning can considerably reduce the network size without sacriﬁcing accuracy.

The inﬂuential work of Frankle and Carbin [2019] has pointed out the following observation: a) train a large network for long enough and observe its performance on a dataset, b) prune it substantially to reveal a much smaller subnetwork with good (or better) performance, c) reset the weights of the subnetwork to their original values and remove the rest of the weights, and d) retrain the subnetwork in isolation; then the subnetwork reaches the same test performance as the large network, and trains faster. Frankle and Carbin [2019] thus conjecture that every successfully trained network contains a much smaller subnetwork that, when trained in isolation, has comparable performance to the large network, without sacriﬁcing computing time. They name this phenomenon the Lottery Ticket Hypothesis, and a winning ticket is a subnetwork of the kind just described.

Ramanujan et al. [2019] went even further by observing that if the network architecture is large enough, then it contains a smaller network that, even without any training, has comparable accuracy to the trained large network. They support their claim with empirical results using a new pruning

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

algorithm, and even provide a simple asymptotic justiﬁcation that we rephrase here: Starting from the inputs and progressing toward the outputs, for any neuron of the target network, sample as many neurons as required until one calculates a function within small error of the target neuron; then, after pruning the unnecessary neurons, the newly generated network will be within some small error of the target network. Interestingly, Ulyanov et al. [2018] pointed out that randomly initialized but untrained Conv Nets already encode a great deal of the image statistics required for restoration tasks such as de-noising and inpainting, and the only prior information needed to do them well seems to be contained in the network structure itself, since no part of the network was learned from data.

Very recently, building upon the work of Ramanujan et al. [2019], Malach et al. [2020] proved a signiﬁcantly stronger version of the pruning is all you need conjecture, moving away from asymptotic results to non-asymptotic ones: With high probability, any target network of layers and n neurons per layer can be approximated within " accuracy by pruning a larger network whose size is polynomial in the size of the target network. To prove their bounds, Malach et al. [2020] make assumptions about the norms of the inputs and of the weights. This polynomial bound already tells us that unpruned networks contain many winning tickets even without training. Then it is natural to ask: could the most important task of gradient descent be pruning?

Building on top of these previous works, we aim at providing stronger theoretical guarantees still based on the motto that pruning is all you need but hoping to provide further insights into how winning tickets may be found. In this work we relax the aforementioned assumptions while greatly

strengthening the theoretical guarantees by improving from polynomial to logarithmic order in all variables except the depth, for the number of samples required to approximate one target weight.

How this paper is organized. After some notation (Section 2) and the description of the problem (Section 3), we provide a general approximation propagation lemma (Section 4), which shows the effect of the different variables on the required accuracy. Next, we show how to construct the large, fully-connected Re LU network in Section 5 identical to Malach et al. [2020], except that weights are sampled from a hyperbolic weight distribution instead of a uniform one. We then give our theoretical results in Section 6, showing that only O(log( nmax/")) neurons per target weight are required under some similar conditions as the previous work (with layers, nmax neurons per layer and " accuracy) or O( log(nmax/")) (with some other dependencies inside the log) if these conditions are relaxed. For completeness, the most important technical result is included in Section 7. Other technical results, a table of notation, and further ideas can be found in Appendix C.

2 Notation and deﬁnitions

A network architecture A( , n, σ) is described by a positive integer corresponding to the number of fully connected feed-forward layers, and a list of positive integers n = (n0, n1, . . . , n ) corresponding to the proﬁle of widths, where ni is the number of neurons in layer i 2 [ ] = {1, . . . , } and n0 is the input dimension, and a list of activation functions σ = (σ1, . . . , σ ) all neurons in layer i use the activation function σi. Networks from the architecture A( , n, σ) implement functions from Rn0 to Rn that are obtained by successive compositions: Rn0 ! Rn1 ! ! Rn .

Let F be a target network from architecture A( , n, σ). The composition of such F is as follows: Each layer i 2 [ ] has a matrix W

i 2 [ wmax, wmax]ni ni 1 of connection weights, and an activation function σi, such as tanh, the logistic sigmoid, Re LU, Heaviside, etc. The network takes as input a vector x 2 X Rn0 where for example X = { 1, 1}n0 or X = [0, xmax]n0, etc. In layer i, the neuron j with in-coming weights W

i,j calculates fi,j(y) = σi(W

i,jy), where y 2 Rni 1 is usually the output of the previous layer. Note that W

i,j is the j-th row of the matrix W

i . The vector fi(y) = [fi,1(y), . . . , fi,ni(y)]> 2 Rni denotes the output of the whole layer i when it receives y 2 Rni 1 from the previous layer. Furthermore, for a given network input x 2 X we recursively deﬁne Fi(x) by setting F0(x) = x, and for i 2 [l] then Fi(x) = fi(Fi 1(x)). The output of neuron j 2 [ni] in layer i is Fi,j(x) = fi,j(Fi 1(x)). The network output is F(x) = F (x).

For an activation function σ(.), let λ be its Lipschitz factor (when it exists), that is, λ is the smallest real number such that |σ(x) σ(y)| λ|x y| for all (x, y) 2 R2. For Re LU and tanh we have λ = 1, and for the logistic sigmoid, λ = 1/4. Let λi be the λ corresponding to the activation function σi of all the neurons in layer i, and let λmax = maxi2[ ] λi.

Deﬁne nmax = maxi2[0.. ] ni to be the maximum number of neurons per layer. The total number of connection weights in the architecture A( , n, σ) is denoted N , and we have N n2

For all x 2 X, let Fmax(x) = maxi2[ ] maxj2[ni 1] |Fi 1,j(x)| be the maximum activation at any layer of a target network F, including the network inputs but excluding the network outputs. We also write Fmax(X) = supx2X Fmax(x); when X is restricted to the set of inputs of interest (not necessarily the set of all possible inputs) such as a particular dataset, we expect Fmax(X) to be bounded by a small constant in most if not all cases. For example, Fmax(X) 1 for a neural network with only sigmoid activations and inputs in [ 1, 1]n0. For Re LU activations, Fmax(X) can in principle grow as fast as (nmaxwmax) , but since networks with sigmoid activations are universal approximators, we expect that for all functions that can be approximated with a sigmoid network there is a Re LU network calculating the same function with Fmax(X) = O(1).

The large network G has an architecture A( 0, n0, σ0), possibly wider and deeper than the target network F. The pruned network ˆG is obtained by pruning (setting to 0) many weights of the large network G. For each layer i 2 [ 0], and each pair of neurons j1 2 [ni] and j2 2 [ni 1], for the weight wi,j1,j2 of the large network G there is a corresponding mask bi,j1,j2 2 {0, 1} such that the weight of the pruned network ˆG is w0

i,j1,j2 = bi,j1,j2wi,j1,j2. The pruned network ˆG will have a different architecture from F, but at a higher level (by grouping some neurons together) it will have the same virtual architecture, with virtual weights ˆW. As in previous theoretical work, we consider an oracle pruning procedure, as our objective is to understand the limitations of even the best pruning procedures.

For a matrix M 2 [ c, c]n m, we denote by k Mk2 its spectral norm, equal to its largest singular value, and its max-norm is k Mkmax = maxi,j |Mi,j|. In particular, for a vector v, we have k Mvk2 k Mk2kvk2 and k Mkmax k Mk2 pnmk Mkmax and also k Mkmax c. This means for example that k Mk2 1 is a stronger condition than k Mkmax 1.

3 Objective

Objective: Given an architecture A( , n, σ) and accuracy > 0, construct a network G from some larger architecture A( 0, n0, σ0), such that if the weights of G are randomly initialized (no training), then for any target network F from A( , n, σ), setting some of the weights of G to 0 (pruning) reveals a subnetwork ˆG such that with high probability,

k F(x) ˆG(x)k2 "

Question: How large must G be to contain all such ˆG? More precisely, how many more neurons or how many more weights must G have compared to F?

Ramanujan et al. [2019] were the ﬁrst to provide a formal asymptotic argument proving that such a G can indeed exist at all. Malach et al. [2020] went substantially further by providing the ﬁrst polynomial bound on the size of G compared to the size of the target network F. To do so, they make the following assumptions on the target network: (i) the inputs x 2 X must satisfy kxk2 1, and at all layers i 2 [ ]: (ii) the weights must be bounded in [ 1/pnmax, 1/pnmax], (iii) they must satisfy k W

i k2 1 at all layers i, and (iv) the number of non-zero weights at layer i must be less than nmax: k W

i k0 nmax. Note that these constraints imply that Fmax(X) 1. Then under these conditions, they prove that any Re LU network with layers and nmax neurons per layer can be approximated1 within " accuracy with probability 1 δ by pruning a network G with 2 Re LU layers and each added intermediate layer has n2

maxd 64 2n3

max "2 log 2n2

max δ e neurons. These assumptions are rather strong, as in general this forces the activation signal to decrease quickly with the depth. Relaxing these assumptions while using the same proof steps would make the bounds exponential in the number of layers. We build upon the work of Ramanujan et al. [2019], Malach et al. [2020], who gave the ﬁrst theoretical results on the Lottery Ticket Hypothesis, albeit under restrictive assumptions. Our work re-uses some of their techniques to provide sharper bounds while removing these assumptions.

1Note that even though their bounds are stated in the 1-norm, this is because they consider a single output for multiple outputs their result holds in the 2-norm, which is what their proof uses.

4 Approximation Propagation

In this section, we analyze how the approximation error between two networks with the same architecture propagates through the layers. The following lemma is a generalization of the (end of the) proof of Malach et al. [2020, Theorem A.6] that removes their aforementioned assumptions and provides better insight into the impact of the different variables on the required accuracy, but is not sufﬁcient in itself to obtain better bounds. For two given networks with the same architecture, it determines what accuracy is needed on each individual weight so the outputs of the two neural networks differ by at most " on any input. Note that no randomization appears at this stage.

Lemma 1 (Approximation propagation). Consider two networks F and ˆG with the same architecture A( , n, σ) with respective weight matrices W and ˆW, each weight being in [ wmax, wmax]. Given " > 0, if for each weight w of F the corresponding weight ˆw of ˆG we have |w ˆw| "w, and if

3/2 max Fmax(X)

max{1, λik ˆWik2}

, then sup x2X

k F(x) ˆG(x)k2 " .

The proof is given in Appendix C. Example 2. Consider an architecture with only Re LU activation function (λ = 1), weights in [ 1, 1] and assume that Fmax(X) = 1 and take the worst case k ˆWik2 wmaxnmax = nmax, then Lemma 1 tells us that the approximation error on each individual weight must be at most "w "/(e n

3/2+ max ) so as to guarantee that the approximation error between the two networks is at most ". This is exponential in the number of layers. If we assume instead that k ˆWik2 1 as in previous work then this reduces to a mild polynomial dependency: "w "/(e n

3/2 max). 4

5 Construction of the Re LU Network G and Main Ideas

We now explain how to construct the large network G given only the architecture A( , n, σ), the accuracy ", and the domain [ wmax, wmax] of the weights. Apart from this, the target network F is unknown. In this section all activation functions are Re LU σ(x) = max{0, x}, and thus λ = 1.

We use a similar construction of the large network G as Malach et al. [2020]: both the target network F and the large network G consist of fully connected Re LU layers, but G may be wider and deeper. The weights of F are in [ wmax, wmax]. The weights for G (at all layers) are all sampled from the same distribution, the only difference with the previous work is the distribution of the weights: we use a hyperbolic distribution instead of a uniform one.

Between layer i 1 and i of the target architecture, for the large network G we insert an intermediate layer i 1/2 of Re LU neurons. Layer i 1 is fully connected to layer i 1/2 which is fully connected to layer i. By contrast to the target network F, in G the layers i 1 and i are not directly connected. The insight of Malach et al. [2020] is to use two intermediate (fully connected Re LU) neurons z+ and z of the large network G to mimic one weight w of the target network (see Fig. 1): Calling z

out the input and output weights of z+ and z that match the input and output of the connection w , then in the pruned network ˆG all connections apart from these 4 are masked out by a binary mask b set to 0. These two neurons together implement a virtual weight ˆw and calculate the function x 7! ˆwx by taking advantage of the identity x = σ(x) σ( x):

Hence, if z

out, the virtual weight ˆw made of z+ and z is approximately w . Then, for each target weight w , Malach et al. [2020] sample many such intermediate neurons to ensure that two of them can be pruned so that |w ˆw| "w with high probability. This requires (1/"2

w) samples and, when combined with Lemma 1 (see Example 2), makes the general bound on the whole network grow exponentially in the number of layers, unless strong constraints are imposed.

To obtain a logarithmic dependency on "w, we use three new insights that take advantage of the composability of neural networks: 1) binary decomposition of the weights, 2) product weights, and 3) batch sampling. We detail them next.

]䈳RXW ]䈴RXW

7DUJHW ZHLJKW

9LUWXDO ZHLJKW LQ Ƣ 0DODFK HW DO

9LUWXDO ZHLJKW LQ Ƣ

Figure 1: The target weight w is simulated in the pruned network ˆG by 2 intermediate neurons, requiring 1/"2

sampled neurons (previous work) or by 2 log 1/" intermediate neurons due to a binary decomposition of w , requiring only O(log 1/") sampled neurons (this work).

Weight decomposition. Our most important improvement is to build the weight ˆw not with just two intermediate neurons, but with O(log 1/") of them, so as to decompose the weight into pieces of different precisions, and recombine them with the sum in the neuron at layer i + 1 (see Fig. 1), using a suitable binary mask vector b in the pruned network ˆG. Intuitively, the weight ˆw is decomposed into its binary representation up to a precision of k dlog2 1/"e bits: Pk

s=1 bs2 s. Using a uniform distribution to obtain these weights 2 s would not help, however. But, because the high precision bits are now all centered around 0, we can use a hyperbolic sampling distribution pw(|w|) / 1/w which has high density near 0. More precisely, but still a little simpliﬁed, for a weight w 2 [ 1, 1] we approximate w within 2 k accuracy with the virtual weight ˆw such that:

+ in,sx) + z

bssgn(w )2 sx w x (1)

where bs 2 {0, 1} is factored out since all connections have the same mask, and where z

+ in,s sgn(w )2 s z

+ out,s > 0, sgn(z

+ in,s) = sgn(w ), z

out,s < 0 and z

in,s = sgn(w ). Note however that, because of the inexactness of the sampling process, we use a decomposition in base 3/2 instead of base 2 (Lemma 9 in Section 7).

Product weights. Recall that z

+ in,sx) = z

+ out,s max{0, z

+ in,sx}. For ﬁxed signs of z

+ out,s and z

+ in,s, this function can be equivalently calculated for all possible values of these two weights such that the product z

+ in,s remains unchanged. Hence, forcing z

+ out,s and z

+ in,s to take 2 speciﬁc values is wasteful as one can take advantage of the cumulative probability mass of all their combinations. We thus make use of the induced product distribution, which avoids squaring the number of required samples. Deﬁne the distribution pw 0 for positive weights w 2 [ , β] with 0 < < β and pw, symmetric around 0, for w 2 [ β, ] [ [ , β]:

pw 0(w) = 1 w ln(β/ ) / 1

w , and pw(w) = pw( w) = 1

2pw 0(|w|) = 1 2|w| ln(β/ ) .

Then, instead of sampling uniformly until both z

+ out,s 1 and z

+ in,s w , we sample both from pw so that z

+ in,s w , taking advantage of the induced product distribution pw 1

2pw 0 (Lemma C.11).

Batch sampling. Sampling sufﬁciently many intermediate neurons so that a subset of them are employed in approximating one target weight w with high probably and then discarding (pruning) all other intermediate neurons is wasteful. Instead, we allow these samples to be recycled to be used for other neurons in the same layer. This is done by partitioning the neurons in different buckets (categories) and ensuring that each bucket has enough neurons (Lemma C.12).

6 Theoretical Results

We now have all the elements to present our central theorem, which tells us how many intermediate neurons to sample to approximate all weights at a layer of the target network with high probability. Remark 4 below will then describe the result in terms of number of neurons per target weight. Theorem 3 (Re LU sampling bound). For a given architecture A( , n, σ) where σ is the Re LU function, with weights in [ wmax, wmax] and a given accuracy ", the network G constructed as above with weights sampled from pw with [ , β] = [ 0/q, β0/q], 0 = 2"w/9, β0 = 2wmax/3, and q = ( 0β0)

1/4, requires only to sample Mi intermediate neurons for each layer i, where

nini 1 + ln 2 k0

with k0 = log3/2

3/2 max Fmax(X)

max{1, k ˆWik2}

("w is in Lemma 1 with λ = 1 for Re LU), in order to ensure that for any target network F with the given architecture A( , n, σ), there exist binary masks bi,j = (bi,j,1, . . . bi,j,ni 1) of G such that for the resulting subnetwork ˆG,

k F(x) ˆG(x)k2 " .

Proof. Step 1. Sampling intermediate neurons to obtain product weights. Consider a single target weight w . Recalling that z

+ out,s > 0 and z

out,s < 0, we rewrite Eq. (1) as

+ in,s | {z } ˆ w+

in,s | {z } ˆ w

The two virtual weight ˆw+ and ˆw are obtained separately. We need both |w ˆw+| "w/2 and |w ˆw | "w/2 so that |w ˆw| "w.

Consider ˆw+ (the case ˆw is similar). We now sample m intermediate neurons, fully connected to the previous and next layers, but only keeping the connection between the same input and output neurons as w (the other weights are zeroed out by the mask b). For a single sampled intermediate neuron z, all its weights, including z

+ out, are sampled from pw, thus the product |z

+ in| is sampled from the induced product distribution pw and, a quarter of the time, z

+ out and z

+ in have the correct signs (recall we need z

+ out > 0 and sgn(z

+ in) = sgn(w )). Deﬁne

+ in) = P(w = z

+ out > 0 z

+ in pw sgn(z

+ in) = sgn(w ))

then with p+(z

+ in) pw (|z

+ in|)/4 pw 0(|z

+ in|)/8 where the last inequality follows from Lemma C.11 for |z

+ in| 2 [ 0, β0], z

+ out 2 [ , β] and z

+ in 2 [ , β], and similarly for z

in with p (z

in) pw 0(|z

Note that because sgn(z

+ out) = sgn(z

out) and sgn(z

+ in) = sgn(z

in), the samples for ˆw+ and ˆw are mutually exclusive which will save us a factor 2 later.

Step 2. Binary decomposition/recomposition. Consider a target weight w 2 [ wmax, wmax]. Noting that Corollary 10 equally applies for negative weights by ﬁrst negating them, we obtain ˆw+ and ˆw by two separate applications of Corollary 10 where we substitute P" P"/8 = pw 0/8, " "w/2, δ δw. Substituting P" with pw 0/8 in Eq. (2) shows that this leads to a factor 8 on m. Therefore, by sampling m = 8dk0 ln k0

δw e weights from pw in [ 0, β0] = [2"w/9, 2wmax/3] with k0 = log3/2

"w ensures that there exists a binary mask b of size at most k0 such that |w ˆw+| "w/2 with probability at least 1 δw. We proceed similarly for w . Note that Corollary 10 guarantees | ˆw| |w | wmax, even though the large network G may have individual weights larger than wmax.

Step 2 . Batch sampling. Take k := dlog3/2

2"w e k0 to be the number of bits required to decompose a weight with Corollary 10 (via Lemma 9). Sampling m different intermediate neurons for each target weight and discarding m k samples is wasteful: Since there are nini 1 target weights at layer i, we would need nini 1m intermediate neurons, when in fact most of the discarded neurons could be recycled for other target weights.

Instead, we sample all the weights of layer i at the same time, requiring that we have at least nini 1 samples for each of the k intervals of the binary decompositions of ˆw+ and ˆw . Then we use Lemma C.12 with 2k categories: The ﬁrst k categories are for the decomposition of ˆw+ and the next k ones are for ˆw . Note that these categories are indeed mutually exclusive as explained in Step 1. and, adapting Eq. (2), each has probability at least 1

w=γu+1 pw 0(w)dw 1/(8 log3/2(3wmax/"w)) = 1/(8k0) (for any u). Hence, using Lemma C.12 where we take n nini 1 and δ δi, we only need to sample d16k0(nini 1 + ln 2k

δi )e d16k0(nini 1 + ln 2k0

δi )e = Mi intermediate neurons to ensure that with probability at least 1 δi each ˆw+ and ˆw can be decomposed into k product weights in each of the intervals of Lemma 9.

Step 3. Network approximation. Using a union bound, we need δi = δ/ for the claim to hold simultaneously for all layers. Finally, when considering only the virtual weights ˆw (constructed from ˆw+ and ˆw ), ˆG and F now have the same architecture, hence choosing "w as in Lemma 1 ensures that with probability at least 1 δ, supx2X k F(x) ˆG(x)k ".

Remark 4. Consider ni = nmax for all i and assume k Wik2 1, wmax = 1 and Fmax(X) 1. Then "w "/(e n

3/2 max) and k0 log3/2(3e n

3/2 max/"). Then we can interpret Theorem 3 as follows: When sampling the weights of a Re LU architecture from the hyperbolic distribution, we only need to sample Mi/n2

max 16k0 + ln(2 k0/δ)/n2

max = O(log( nmax/")) neurons per target weight (assuming n2

max > log( k0/δ)). Compare with the bound of Malach et al. [2020, Theorem A.6] which, under the further constraints that wmax 1/pnmax and maxi2[ ] k W

i k0 nmax and with uniform sampling in [ 1, 1], needed to sample Mi/n2

max = d64 2n3

max log(2N/δ)/"2e neurons per target weight.

Example 5. Under the same assumptions as Remark 4, for nmax = 100, = 10, " = 0.01, δ = 0.01, the bound above for Malach et al. [2020] gives Mi/n2

max 2 1015, while our bound in Theorem 3 gives Mi/n2

Example 6. Under the same conditions as Example 5, if we remove the assumption that k Wik2 1, then Theorem 3 gives Mi/n2

max = O( log(nmax/")) and numerically Mi/n2

max 2 450. 4

We can now state our ﬁnal result.

Corollary 7 (Weight count ratio). Under the same conditions as Theorem 3, Let N be the number of weights in the fully connected architecture A( , n, σ) and NG the number of weights of the large network G, then the weight count ratio is NG/N 32nmaxk0 + O(log(k0/δ)).

Proof. We have N = P

i=1 ni 1ni, and the total number of weights in the network G if layers are fully connected is at most NG = P

i=1(ni 1 + ni)Mi, where Mi = 16k0ni 1ni + O(log(k0/δ)). Hence the weight count ratio is NG/N 32nmaxk0 + O(log(k0/δ)).

Remark 8. Since in the pruned network ˆG each target weight requires k0 neurons, the large network has at most a constant factor more neurons than the pruned network.

7 Technical lemma: Random weights

The following lemma shows that if m weights are sampled from a hyperbolic distribution, we can construct a goldary (as opposed to binary ) representation of the weight with only O(ln 1

δ ) samples. Because of the randomness of the process, we use a base 3/2 instead of a base 2 for logarithms, so that the different bits have overlapping intervals. As the proof clariﬁes, the optimal base is 1/γ = 1

5 + 1) =1.62. The base 1/γ = 3/2 is convenient. The number 1

5 + 1) is known as the golden ratio in the mathematical literature, which explains the name we use.

Lemma 9 (Golden-ratio decomposition). For any given " > 0 and 1/' γ < 1, where ' :=

5 + 1) is the golden ratio, deﬁne the probability density P"(v) := c0

v for v 2 ["γ2, γ] with normalization c0 := [ln 1

γ"] 1. For any δ 2 (0, 1), if m = dk0 ln k0/δe = (ln " ln δ) with k0 := logγ(γ"), then with probability at least 1 δ over the random sampling of m weights vs P" for s = 1, ..., m, the following holds: For every target weight w 2 [0, 1], there exists a mask b 2 {0, 1}m with |b| k0 such that ˆw := b1v1 + ... + bmvm is "-close to w, indeed w " ˆw w.

Proof. Let k = dlogγ "e 1 + logγ " = k0. First, consider a sequence (vi)i2[k] such each vi is in the interval Ii := (γi+1, γi] for i = 1, ..., k. We construct an approximating ˆw for any weight w0 := w 2 [0, 1] by successively subtracting vi when possible. Formally

for(i = 1, ..., k) {if wi 1 γi then {wi := wi 1 vi; bi = 1} else {wi := wi 1; bi = 0}}

By induction we can show that 0 wi γi. This holds for w0. Assume 0 wi 1 γi 1: If wi 1 < γi then wi = wi 1 < γi.

If wi 1 γi then wi = wi 1 vi γi 1 γi+1 = (γ 1 γ)γi γi.

The last inequality is true for γ 1

5 1), which is satisﬁed due to the restriction 1/' γ < 1. Hence the error 0 w ˆw = wk γk " γk 1 for k := dlogγ "e 0.

Now consider a random sequence (vi)i2[m] where we sample vs

iid P over the interval [γ2", γ] for s = 1, ..., m > k. In the event that there is at least one sample in each interval Ii, we can use the construction above with a subsequence v of v such that vi 2 Ii and P

i2[k] bi vi = wk as in the construction above. Next we lower bound the probability p that each interval Ii contains at least one sample. Let Ei be the event no sample is in Ii and let c = mini2[k] P[v 2 Ii]. Then P[Ei] = (1 P[v 2 Ii])m (1 c)m, hence

p = 1 P[E1 _ ... _ Ek] 1

P[Ei] 1 k(1 c)m 1 k exp( cm)

and thus choosing m

ensures that p 1 δ. Finally,

i2[k] P[v 2 Ii] = min

i P[γi+1 < v γi] = min

v dv = c0 ln 1

γ = 1/ logγ(γ") = 1

and so we can take m =

Corollary 10 (Golden-ratio decomposition for weights in [0, wmax]). For any given " > 0, deﬁne the probability density P"(v) := c0

v for v 2 [ 4

3wmax] with normalization c0 := 1/ ln 3wmax

2" . Let k0 := log3/2

2" , For any δ 2 (0, 1), if m = dk0 ln k0

δ e = (ln 1

δ ), then with probability at least 1 δ over the random sampling of m weights vs P" (s = 1, ..., m) the following holds: For every target weight w 2 [0, wmax], there exists a mask b 2 {0, 1}m with |b| k0 such that

ˆw := b1v1 + ... + bmvm is "-close to w, indeed w " ˆw w.

Proof. Follows from Lemma 9 with γ = 2/3 and a simple rescaling argument: First rescale w0 = w/wmax and apply Lemma 9 with w0 and accuracy "/wmax. Then the constructed ˆw0 satisﬁes w0 "/wmax ˆw0 w0 and multiplying by wmax gives the required accuracy. Also note that the density P"(v) / 1/v is scale-invariant.

8 Related Work

In the version of this paper that was submitted for review, we conjectured with supporting experimental evidence that high precision could be obtained also with uniform sampling when taking advantage of sub-sums (see Appendix A). After the submission deadline, we have been made aware that Pensia et al. [2020] concurrently and independently submitted a paper that resolves this conjecture, by using a theorem of Lueker [1998]. Pensia et al. [2020] furthermore use a different grouping of the samples in each layer, leading to a reﬁned bound with a logarithmic dependency on the number of weights per target weight and provide a matching lower bound (up to constant factors). Their results are heavily anchored in the assumptions that the max norms of the weights and of the inputs are bounded by 1, leaving open the question of what happens without these constraints this could be dealt with by combining their results with our Lemma 1.

9 Conclusion

We have proven that large randomly initialized Re LU networks contain many more subnetworks than previously shown, which gives further weight to the idea that one important task of stochastic gradient descent (and learning in general) may be to effectively prune connections by driving their weights to 0, revealing the so-called winning tickets. One could even conjecture that the effect of pruning is to reach a vicinity of the global optimum, after which gradient descent can perform local quasi-convex optimization. Then the required precision " may not need to be very high.

Further questions include the impact of convolutional and batch norm layers, skip-connections and LSTMs on the number of required sampled neurons to maintain a good accuracy.

Statement of broader impact

This work is theoretical, and in this regard we do not expect any direct societal or ethical consequences. It is our hope, however, that by studying the theoretical foundations of neural networks this will eventually help the research community make better and safer learning algorithms.

Acknowledgements

The authors would like to thank Claire Orseau for her great help with time allocation, Tor Lattimore for the punctual help and insightful remarks, András György for initiating the Neural Net Readathon, and Ilja Kuzborskij for sharing helpful comments.

Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization.

In International Conference on Machine Learning, pages 242 252, 2019.

R. Ebendt and R. Drechsler. Weighted A search unifying view and application. Artiﬁcial

Intelligence, 173(14):1310 1342, 2009.

J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In

ICLR, 2019.

A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi. A survey of the recent architectures of deep

convolutional neural networks. Artiﬁcial Intelligence Review, Apr. 2020.

R. Livni, S. Shalev-Shwartz, and O. Shamir. On the computational efﬁciency of training neural

networks. In Advances in neural information processing systems, pages 855 863, 2014.

G. Lueker. Exponentially small bounds on the expected optimum of the partition and subset sum

problems. Random Structures and Algorithms, 12:51 62, 1998.

S. Ma, R. Bassily, and M. Belkin. The power of interpolation: Understanding the effectiveness of sgd

in modern over-parametrized learning. In International Conference on Machine Learning, pages 3325 3334, 2018.

E. Malach, G. Yehudai, S. Shalev-Shwartz, and O. Shamir. Proving the lottery ticket hypothesis:

Pruning is all you need. ar Xiv preprint ar Xiv:2002.00585. To appear in ICML-2020, 2020.

A. Pensia, S. Rajput, A. Nagle, H. Vishwakarma, and D. Papailiopoulos. Optimal lottery tickets via

subsetsum: Logarithmic over-parameterization is sufﬁcient. ar Xiv preprint ar Xiv:2006.07990. To appear in Neur IPS-2020, 2020.

V. Ramanujan, M. Wortsman, A. Kembhavi, A. Farhadi, and M. Rastegari. What s hidden in a

randomly weighted neural network? ar Xiv preprint ar Xiv:1911.13299, 2019.

J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart,

D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver. Mastering atari, go, chess and shogi by planning with a learned model. ar Xiv preprint ar Xiv:1911.08265, 2019.

D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages 9446 9454, 2018.

A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner,

A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio, 2016.

D. Zou and Q. Gu. An improved analysis of training over-parameterized deep neural networks. In

Advances in Neural Information Processing Systems, pages 2053 2062, 2019.