# on_the_existence_of_universal_lottery_tickets__f7bece0f.pdf

Published as a conference paper at ICLR 2022

ON THE EXISTENCE OF UNIVERSAL LOTTERY TICKETS

Rebekka Burkholz CISPA Helmholtz Center for Information Security burkholz@cispa.de

Nilanjana Laha, Rajarshi Mukherjee Harvard T.H. Chan School of Public Health rmukherj@hsph.harvard.edu

Alkis Gotovos MIT CSAIL alkisg@mit.edu

The lottery ticket hypothesis conjectures the existence of sparse subnetworks of large randomly initialized deep neural networks that can be successfully trained in isolation. Recent work has experimentally observed that some of these tickets can be practically reused across a variety of tasks, hinting at some form of universality. We formalize this concept and theoretically prove that not only do such universal tickets exist but they also do not require further training. Our proofs introduce a couple of technical innovations related to pruning for strong lottery tickets, including extensions of subset sum results and a strategy to leverage higher amounts of depth. Our explicit sparse constructions of universal function families might be of independent interest, as they highlight representational benefits induced by univariate convolutional architectures.

1 INTRODUCTION

Deep learning has achieved major breakthroughs in a variety of tasks (Le Cun et al., 1990; Schmidhuber, 2015), yet, it comes at a considerable computational cost (Sharir et al., 2020), which is exaggerated by the recent trend towards ever wider and deeper neural network architectures. Reducing the size of the networks before training could therefore significantly broaden the applicability of deep learning, lower its environmental impact, and increase access (Dhar, 2020). However, such sparse representations are often difficult to learn, as they may not enjoy the benefits associated with over-parameterization (Belkin et al., 2019).

Frankle & Carbin (2019) provided a proof of concept that sparse neural network architectures are well trainable if initialized appropriately. Their lottery ticket hypothesis states that a randomlyinitialized network contains a small subnetwork that can compete with the performance of the original network when trained in isolation. Further, Ramanujan et al. (2020) conjectured the existence of strong lottery tickets, which do not need any further training and achieve competitive performance at their initial parameters. These tickets could thus be obtained by pruning a large randomly initialized deep neural network. Unfortunately, existing pruning algorithms that search for (strong) lottery tickets have high computational demands, which are often comparable to or higher than training the original large network. However, Morcos et al. (2019) posited the existence of so-called universal lottery tickets that, once identified, can be effectively reused across a variety of settings.

Contributions.

In this paper, we formalize the notion of universality, and prove a strong version of the original universal lottery ticket conjecture. Namely, we show that a sufficiently overparameterized, randomly initialized neural network contains a subnetwork that qualifies as a universal lottery ticket without further training of its parameters. Furthermore, it is adapted to a new task only by a linear transformation of its output. This view can explain some empirical observations regarding the required size of universal lottery tickets. Our proof relies on the explicit construction of basis functions, for which we find sparse neural network representations that benefit from parameter sharing, as it is realized by convolutional neural networks. The fact that these representations are sparse and universal is the most remarkable insight.

Published as a conference paper at ICLR 2022

To show that they can also be obtained by pruning a larger randomly initialized neural network, we extend existing subset sum results (Lueker, 1998) and develop a proof strategy, which might be of independent interest, as it improves current bounds on pruning for general architectures by making the bounds depth dependent. Accordingly, the width of the large random network can scale as n0 O(nt Lt/L0 log (nt L0/(Ltϵ))) to achieve a maximum error ϵ, where Lt denotes the depth and nt the width of the target network, and L0 the depth of the large network.

In support of our existence proofs, we adapt standard parameter initialization techniques to a specific non-zero bias initialization and show in experiments that pruning is feasible in the proposed setting under realistic conditions and for different tasks.

Related work. The lottery ticket hypothesis (Frankle & Carbin, 2019) and its strong version (Ramanujan et al., 2020) have inspired the proposal of a number of pruning algorithms that either prune before (Wang et al., 2020; Lee et al., 2019; Tanaka et al., 2020) or during and after training (Frankle & Carbin, 2019; Savarese et al., 2020). Usually, they try to find lottery tickets in the weak sense, with the exception of the edge-popup algorithm (Ramanujan et al., 2020) that identifies strong lottery tickets, albeit at less extreme sparsity. In general, network compression is a problem that has been studied for a long time and for good reasons, see, e.g., Lin et al. (2020) for a recent literature discussion. Here we focus specifically on lottery tickets, whose existence has been proven in the strong sense, thus, they can be derived from sufficiently large, randomly initialized deep neural networks by pruning alone. To obtain these results, recent work has also provided lower bounds for the required width of the large randomly initialized neural network (Malach et al., 2020; Pensia et al., 2020; Orseau et al., 2020; Fischer & Burkholz, 2021; 2022). In addition, it was shown that multiple candidate tickets exist that are also robust to parameter quantization (Diffenderfer & Kailkhura, 2021). The significant computational cost associated with finding good lottery tickets has motivated the quest for universal tickets that can be transferred to different tasks (Morcos et al., 2019; Chen et al., 2020). We prove here their existence.

1.1 NOTATION

For any d-dimensional input x = (x1, . . . , xd)T , let f(x) be a fully connected deep neural network with architecture n = [n0, n1, ..., n L], i.e., depth L and widths nl for layers l = 0, ..., L, with Re LU activation function ϕ(x) := max(x, 0). An input vector x(0) is mapped to neurons x(l) i as:

x(l) = ϕ h(l) , h(l) = W (l)x(l 1) + b(l), W (l) Rnl 1 nl, b(l) Rnl, (1)

where h(l) i is called the pre-activation of neuron i, W (l) the weight matrix, and b(l) the bias vector of layer l. We also write θ for the collection of all parameters θ := W (l), b(l) L l=1 and indicate a dependence of f on the parameters by f(x|θ).

We also use 1-dimensional convolutional layers, for which the width nl refers to the number of channels in architecture n. For simplicity, we only consider 1-dimensional kernels with stride 1. Larger kernels could simply be pruned to that size and higher strides could be supported as they are defined so that filters overlap. The purpose of such convolutional layers is to represent a univariate function, which is applied to each input component.

Typically, we distinguish three different networks: 1) a large (usually untrained) deep neural network f0, which we also call the mother network, 2) a smaller target network f, and 3) a close approximation, our lottery ticket (LT) fϵ, which will correspond to a subnetwork of f0. fϵ is obtained by pruning f0, as indicated by a binary mask B = (bi)i {0,1}|θ0| that specifies for each parameter θi = biθi,0 whether it is set to zero (bi = 0) or inherits the parameter of f0 by θi = θi,0 (for bi = 1).

We usually provide approximation results with respect to the l1-norm x := P

i |xi| but they hold for any p-norm with p 1. C generally stands for a universal constant that can change its value from equation to equation. Its precise value can be determined based on the proofs. Furthermore, we make use of the notation [n] := {0, ..., n} for n N, and [n]k for a k-dimensional multi-index with range in [n].

Published as a conference paper at ICLR 2022

2 UNIVERSAL LOTTERY TICKETS

Before we can prove the existence of strong universal LTs, we have to formalize our notion of what makes a strong LT universal. First of all, a universal LT cannot exist in the same way as a strong LT, which is hidden in a randomly initialized deep neural network and is identified by pruning, i.e., setting a large amount of its parameters to zero while the rest keep their initial value. For a ticket to be universal and thus applicable to a variety of tasks, some of its parameters, if not all, need to be trained. So which parameters should that be? In deep transfer learning, it is common practice to only train the top layers (close to the output) of a large deep neural network. The bottom layers (close to the input) are reused and copied from a network that has been already trained successfully to perform a related task. This approach saves significant computational resources and often leads to improved training results. It is therefore reasonable to transfer it to LTs (Morcos et al., 2019).

Independently from LTs, we discuss conditions when this is a promising approach, i.e., when the bottom layers of the deep neural network represent multivariate (basis) functions, whose linear combination can represent a large class of multivariate functions. The independence of the functions is not required and could be replaced by dictionaries, but the independence aids the compression of the bottom layers and thus our objective to find sparse LTs. This view also provides an explanation of the empirically observed phenomenon that universal tickets achieve good performance across a number of tasks only at moderate sparsity levels and become more universal when trained on larger datasets (Morcos et al., 2019; Chen et al., 2020). Including a higher number of basis functions naturally reduces the sparsity of a LT but also makes it adaptable to richer function families.

2.1 HOW UNIVERSAL CAN A LOTTERY TICKET BE?

A trivial universal ticket. A trivial solution of our problem would be to encode the identity function by the first layers, which would only require 2d or d neurons per layer or even 1-2 neurons per convolutional layer. This would be an extremely sparse ticket, yet, pointless as the ticket does not reduce the hardness of our learning task. In contrast, it cannot leverage the full depth of the neural network and needs to rely on shallow function representations. How could we improve the learning task? The next idea that comes to our mind is to reduce the complexity of the function that has to be learned by the upper layers. For instance, we could restrict it to learn univariate functions. To explore this option, our best chance of success might be to utilize the following theorem.

The Kolmogorov-Arnold representation theorem states that every multivariate function can be written as the composition and linear combination of univariate functions. In particular, recent results based on Cantor sets C promise potential for efficient representations. Thm. 2 in (Schmidt Hieber, 2021) shows the existence of only two univariate functions g : C R and ψ : [0, 1] C

so that any continuous function f : [0, 1]d R can be written as f(x) = g Pd i=1 31 iψ(xi) . Furthermore, only g depends on the function f, while ψ is shared by all functions f and is hence universal. Could Pd i=1 31 iψ(xi) be our universal LT? Unfortunately, it seems to be numerically infeasible to compute for higher input dimensions d > 10. In addition, the resulting representation of f seems to be sensitive to approximation errors. On top of this, the outer function g is relatively rough even though it inherits some smoothness properties of the function f (cf. Schmidt-Hieber, 2021) and is difficult to learn. Thus, even restricting ourselves to learning an univariate function g in the last layers does not adequately simplify our learning problem. To make meaningful progress in deriving a notion of universal LTs, we therefore need a stronger simplification.

2.2 DEFINING UNIVERSALITY

To ensure that the knowledge of a LT substantially simplifies our learning task, we only allow the training of the last layer. A consequence of this requirement is that we have to limit the family of functions that we are able to learn, which means we have to make some concessions with regard to universality. We thus define a strongly universal LT always with respect to a family of functions.

We focus in the following on regression and assume that the last layer of a neural network has linear activation functions, which reduces our learning task to linear regression after we have established the LT. Classification problems could be treated in a similar way. Replacing the activation functions in the last layer by softmax activation functions would lead to the standard setting. In this case, we

Published as a conference paper at ICLR 2022

would have to perform a multinomial logistic regression instead of a linear regression to train the last layer. We omit this case here to improve the clarity of our derivations. Definition 1 (Strong Universality). Let F be a family of functions defined on S Rd with F g : S Rn. A function b : Rd Rk is called strongly universal with respect to F up to error ϵ > 0, if for every f F there exists a matrix W Rn k and a vector c Rn so that

sup x S W b(x) + c f(x) ϵ. (2)

Note that we have defined the universality property for any function, including a neural network. It also applies to general transfer learning problems, in which we train only the last layer. To qualify as a (strong) LT, we have to obtain b by pruning a larger neural network. Definition 2 (Lottery ticket). A neural network f : Rd S Rk is called a lottery ticket (LT) with respect to f0 : Rd S Rk with parameters θ0, if there exists a binary mask B {0, 1}|θ0| so that f0(x|Bθ0) = f(x) for all x S. We also write f f0.

3 EXISTENCE OF UNIVERSAL LOTTERY TICKETS

Our Def. 1 of strong universality assumes that our target ticket b has a finite amount of k features, which is reasonable in practice but limits the complexity of the induced function family F. However, universal function approximation regarding general continuous functions on [0, 1]d can be achieved by neural networks only if they have arbitrary width (Pinkus, 1999; Cybenko, 1989; Kurt & Hornik, 1991) or arbitrary depth (Telgarsky, 2016; Yarotsky, 2017; Schmidt-Hieber, 2020). Feed forward networks of higher depth are usually more expressive (Yarotsky, 2018) and thus require less parameters than width constrained networks to approximate a continuous function f with modulus of continuity ωf up to maximal error ϵ. Yarotsky (2018) has shown that the minimum number of required parameters is of order O(ωf(O(ϵ d/2)) but has to assume that the depth of the network is almost linear in this number of parameters. Shallow networks in contrast need O(ωf(O(ϵ d)) parameters. Note that the input dimension d can be quite large in machine learning applications like image classification and the number of parameters depends on the Lipschitz constant of a function via ωf, which can be huge in general. In consequence, we need to narrow our focus regarding which function families we can hope to approximate with finite neural network architectures that have sparse representations and limit ourselves to the explicit construction of k basis functions of a family that has the universal function approximation property.

We follow a similar strategy as most universal approximation results by explicitly constructing polynomials and Fourier basis functions. However, we propose a sparser, non-standard construction that, in contrast to the literature on feed forward neural networks, leverages convolutional layers to share parameters. Another advantage of our construction is that it is composed of linear multivariate and univariate functions, for which we can improve recent results on lottery ticket pruning. The existence of such sparse representations is remarkable because, in consequence, we would expect that most functions that occur in practice can be approximated by sparse neural network architectures and that these architectures are often universally transferable to other tasks.

Polynomials Sufficiently smooth functions, which often occur in practice, can be well approximated by a few monomials of low degree. At least locally this is possible, for instance, by a Taylor approximation. How can we approximate these monomials with a neural networks? In principle, we could improve on the parameter sparse construction by Yarotsky (2017); Schmidt-Hieber (2020) based on tooth functions by using convolutional layers that approximate univariate monomials in each component separately followed by feed forward fully-connected layers that multiply these pairwise. This, however, would require an unrealistically large depth L = O(log(ϵ/k) log(d)) and also a considerable width of at least n = O(kd) in many layers. Alternatively, we propose a constant depth solution as visualized in Fig. 1 (a). It has an ϵ dependent width that is maximally n = O(d p

k/ϵ) in just one layer and n = O( p

kd/ϵ) in another. It leverages the following observation. A multivariate monomial b(x) = Qd i=1 0.5ri(1 + xi)ri, which is restricted to the domain [0, 1]d, can also be written as b(x) = exp (P

i ri log(1 + xi) log(2) P

i ri). It is therefore a composition and linear combination of univariate functions b(x) = g (P

i rih(xi)), where g(x) = exp(x) and h(x) = log(1 + x) log(2) as in Kolmogorov-Arnold form. Most importantly, every monomial

Published as a conference paper at ICLR 2022

x1 xd x1 xd

h(x1) h(xd)

b1(x) bk(x)

b1(x) bk(x)

Wb(x) + c Wb(x) + c bf(x) bf(x)

(a) (b) (c)

b b b b b b

Input x1 x2

Figure 1: Left: Visualization of the proposed architecture for approximating function families. Potentially fully connected networks are colored blue, while green parts can be encoded as convolutional layers by applying the same function to all components. The last orange linear layer can be trained, while all other parameters are frozen to their initialization. (a) Polynomials: h(x) = log((1 + x)/2), g(y) = exp(y). (b) Fourier series: g(x) = sin(2πx). Right: Visualization of the lottery ticket construction for multivariate linear functions. (c) Visualization of an exemplary linear layer. (d) Approximation of the linear layer in (a) by a subset sum block. Blue nodes with label + represent neurons in the intermediary layer that are pruned to neurons of the form ϕ(wx1/2) with w > 0, while green nodes correspond to neurons with w < 0, and orange nodes with label b to bias neurons of the form ϕ(b). (e) The subset sum block in (d) is distributed across three layers. Two paths of bounded weight are highlighted in orange.

has the same structure. We can therefore construct a family of monomials efficiently by approximating each univariate function with a convolutional neural network and the linear combinations of the functions as potentially fully connected layers. Fig. 1 (a) visualizes the architecture and general construction idea. It applies to any composition of functions in form of the Kolmogorov-Arnold representation theorem, which in theory exists for every continuous multivariate function. This makes our construction more general than it may seem at first. In practice, however, the univariate functions still need to be amenable to efficient approximation by deep neural networks, which is the case for polynomials as we show in the next sections.

Fourier series Fourier analysis and discrete time Fourier transformations seek representations with respect to a Fourier basis f(x) = a0 + P

n [N]d an sin(2π(Pd i=0 nixi + cn)). Fig. 1 (b)

shows how to construct the functions sin(2π(Pd i=0 nixi + cn)) by (affine) linear combinations Pd i=0 nixi + cn in the first layers close to the input followed by convolutional layers computing the univariate sin, which has quite sparse representations if enough depth is available to exploit its symmetries. Again we use a composition of linear transformations and univariate functions, which share parameters by convolutional layers.

Even though the function families above can be represented efficiently as LTs, we should mention that a big advantage of neural networks is that the actual function family can be learned. This could also lead to a combination of different families, when this improves the ability of the lottery ticket to solve specific tasks efficiently. Accordingly, adding dependent functions to the outputs might also provide advantages. Our universal lottery ticket constructions allow for this as well. We simply focus our discussion on families of independent functions, as they usually induce higher sparsity.

3.1 EXISTENCE OF LOTTERY TICKETS LEVERAGING DEPTH

The targets that we propose as strongly universal functions are composed of linear and univariate neural networks. Some of our improvements with respect to the literature on LT existence leverage this fact. While Malach et al. (2020); Pensia et al. (2020); Orseau et al. (2020); Fischer & Burkholz (2021) provide a lower bound on the required width of the mother network f0 so that a subnetwork could approximate our target neural network with depth Lt, f0 would need to have exactly twice the depth L0 = 2Lt, which would limit our universality statement to a specific architecture. To address

Published as a conference paper at ICLR 2022

this issue, we allow for more flexible mother networks and utilize additional available depth. As Pensia et al. (2020), we approximate each target parameter by a subset sum block, but distribute this block across multiple network layers. This makes the original bound n0 = O(N log (N/ϵ)) on the width of the mother network depth dependent.

This approach requires, among others, two additional innovations of practical consequence. The first one is our parameter initialization proposal. Most previous work neglects the biases and assumes that they are zero. Only Fischer & Burkholz (2021) can handle architectures with non-zero biases, which we need to represent our univariate functions of interest. They propose an initialization scheme that extends standard approaches like He (He et al., 2015) or Glorot (Glorot & Bengio, 2010) initialization to non-zero biases and supports the existence of LTs while keeping the large mother network f0 trainable. We modify it to enable our second innovation, i.e., paths through the network that connect network subset sum blocks in different layers to the output. Definition 3 (Parameter initialization). We assume that the parameters of a deep neural network are independently distributed as w(l) ij U ([ σw,l, σw,l]) or w(l) ij N (0, σw,l) for some σw,l > 0 and

b(l) i U([ Ql k=1 σw,k/2, Ql k=1 σw,k/2]) or b(l) i N 0, Ql m=1 σw,m/2 , respectively.

Also dependencies between the weights as in (Burkholz & Dubatovka, 2019; Balduzzi et al., 2017) are supported by our proofs. The above initialization results in a rescaling of the output by λ = QL k=1 σw,k/2 in comparison with an initialization of the weights by w U ([ 2, 2]) or w N (0, 4) and biases by b U ([ 1, 1]) or w N (0, 1). As pruning deletes a high percentage of parameters, we can expect that the output of the resulting network is also scaled roughly by this scaling factor λ (see Thm. 2). However, the last linear layer that we concatenate to f and assume to be trained can compensate for this. This rescaling means effectively that we can prune weight parameters from the interval [ 2, 2] in contrast to [ 1, 1] as in (Fischer & Burkholz, 2021; 2022). We can therefore find weight parameters wi that are bigger or smaller than 1 with high enough probability so that we can prune for paths of bounded weight 1 Qk i=1 wi C through the network, as stated next. Lemma 1. Define α = 3/4 and let wj U[ 2, 2] denote k independently and identically (iid) uniformly distributed random variables with j [k]. Then wj is contained in an interval

with probability at least q = 1/16. If this is fulfilled for each wj, then

The same holds true if each wj N(0, 4) is iid normally distributed instead.

This defines the setting of our existence proofs as formalized in the next theorem. Theorem 2 (LT existence). Assume that ϵ, δ (0, 1), a target network f: S Rd Rm with depth Lt and architecture nt, and a mother network f0 with depth L0 2 and architecture n0 are given. Let f0 be initialized according to Def. 3. Then, with probability at least 1 δ, f0 contains a sparse approximation fϵ f0 so that each output component i is approximated as maxx S |fi(x) λfϵ,i(x)| ϵ with λ = QL l=1(2σ 1 w,l) if nl,0 g( nt) for each 1 < l L0 1.

The required width g( nt) needs to be specified to obtain complete results. We start with the construction of a single layer ϕ (Wx + b), which we can also use to represent a linear layer by constructing its positive ϕ (Wx + b) and negative part ϕ ( Wx b) separately. Note that in our polynomial architecture, all components of Wx + b are negative. Theorem 3 (Multivariate LTs (single layer)). Assume the same set-up as in Thm. 2 and a target function f(x) = ϕ (Wx + b) with M := maxi,j max(|wi,j|, |bi|) , N non-zero parameters, and Q = (supx S x 1 + 1). A lottery ticket fϵ exists if

nl,0 C Md L0 1 log (M/ min {δ/(2(m + d)(L0 1) + N + 1), ϵ/Q})

Published as a conference paper at ICLR 2022

for each 1 < l L0 1 whenever L0 > 2 and n1,0 CMd log M min{δ/(N+1),ϵ/Q} when L0 = 2.

Proof idea. The main idea to utilize subset sum approximations for network pruning has been introduced by Pensia et al. (2020). Each target layer ϕ(h) = ϕ (Wx + b) is represented by two layers in the pruned network fϵ as follows. Using the property of Re LUs that x = ϕ(x) ϕ( x), we can write each pre-activation component as hi = Pm j=1 wijϕ(xj) wijϕ( xj) + sign(bi)ϕ(|bi|).

This suggests that intermediate neurons should be pruned into an univariate form ϕ(w(1) lj xj) or

ϕ(b(1) l ) if bl > 0. Each actual parameter wij or bi is, however, approximated by solving a subset sum approximation problem involving several intermediate neurons l Iij of the same type so that wij P

l Iij w(2) il w(1) lj . Fig. 1 (d) visualizes the approach. Node colors distinguish different neuron

types ϕ(w(1) lj xj) or a bias type ϕ(b(1) l ) for bl > 0 within a subset sum block.

We split such a block into similarly sized parts and assign one to each available layer in the mother network as shown in Fig. 1 (e). The challenge in this construction is to collect the different results that are computed at different layers and to send them through the full network along a path. The product of weights p = QL0 2 k=1 wk along this path is multiplied with the output. Since we can ensure that this factor is bounded reasonably by Lemma 1, we can compensate for it by solving a subset sum problem for wij/ QL0 2 k=1 wk instead of wij without increasing the associated approximation error. Furthermore, we only need to prune C log(1/δ ) neurons in each layer to find such a path. A rigorous proof is presented in Appendix B.1.

As every target neural network is composed of Lt layers of this form, we have derived a bound n0 O(nt Lt/L0 log (nt L0/(Ltϵ))) that scales as 1/L0 log(L0) and highlights the benefits of additional depth in general LT pruning, which might be of independent interest. Yet, the depth of the mother network needs to be considerably larger than the one of the target. For univariate networks, we can again improve on this result in the following way. As explained in the appendix, every univariate neural network can be written as

i=0 aiϕ(x si) + a N (3)

with respect to 2N + 1 parameters ai and si for 0 i N 1, where the width N determines its potential approximation accuracy of a continuous univariate function f and enters our bound if we want to find a corresponding LT by pruning.

Theorem 4 (Univariate LT). Assume the same set-up as in Thm. 2 and that an univariate target network f : S R R in form of Eq. (3) is given. Define M := (1 + max (maxi |ai|, maxj |sj|)) and Q = maxx S |x|. f0 contains a LT fϵ f0 if

nl,0 C max{M, N}

L0 2 log M min {δ/ [L0(N + 2) 1] , ϵ/(2(Q + M))}

for L0 3. We require n1,0 C MQ

ϵ log M min{δ/(N+1),ϵ/(2(Q+M))} for L0 = 2.

Next, we can utilize our improved LT pruning results to prove the existence of universal LTs.

3.2 EXISTENCE OF POLYNOMIAL LOTTERY TICKETS

In the following, we discuss the existence of polynomial LTs in more detail. Corresponding results for Fourier basis functions are presented in the appendix. Our objective is to define a parameter sparse approximation of k multivariate monomials b(x) = Qd i=1 0.5ri(1+xi)ri as our target neural network (see Fig. 1 (a)) and then prove that it can be recovered by pruning a larger, but ideally not much larger, randomly initialized mother network. Thus, in contrast to previous results on LT existence, we have to take not only the pruning error but also the approximation error into account.

A multivariate linear function can be represented exactly by a Re LU neural network, but we have to approximate the known functions h(x) = log((1 + x)/2) and g(y) = exp(y) in a parameter sparse

Published as a conference paper at ICLR 2022

way. Thm. 2 Yarotsky (2017) states that univariate Lipschitz continuous function can be approximated by a depth-6 feed forward neural network with no more than c/(ϵ log(1/ϵ)) parameters. By specializing our derivation to our known functions, in contrast, we obtain a better scaling c/ ϵ with depth L = 2, which leads to our main theorem.

Theorem 5. Let B n Qd i=1 0.5ri(1 + xi)ri|ri q, P ri t, xi [0, 1] o be a subset of the poly-

nomial basis functions with bounded maximal degree q of cardinality k = |B|. Let F (B) be the family of bounded (affine) linear combinations of these basis functions. For ϵ, δ (0, 1), with probability 1 δ up to error ϵ, f0 contains a strongly universal lottery ticket fϵ f0 with respect to F if f has Llog 2 convolutional layers, followed by Lmulti 2 fully connected layers, and Lexp 2 convolutional layers with channel size or width chosen as in Thms. 4 and 3 with the following set of parameters.

Logarithm: δlog = 1 (1 δ)1/3, ϵlog = ϵ 6tkm, Nlog = 1 + q

, Mlog = 2, Qlog = 1.

Multivariate: δmulti = 1 (1 δ)1/3, ϵmulti = ϵ 6km, Mmulti = q, Qmulti = d.

Exponential: δexp = 1 (1 δ)1/3, ϵexp = ϵ 6km, Nexp = 1 + t q

, Mexp = 2, Qexp = t log 2.

As we show next in experiments, all our innovations provide us with practical insights into conditions under which pruning for universal lottery tickets is feasible.

4 EXPERIMENTS

To showcase the practical relevance of our main theorems, we conduct two types of experiments on a machine with Intel(R) Core(TM) i9-10850K CPU @ 3.60GHz processor and GPU NVIDIA Ge Force RTX 3080 Ti. First, we show that the derived bounds on the width of the mother networks are realistic and realizable by explicitly pruning for our proposed universal lottery tickets. Second, we prune mother network architectures from our theorems for strong winning tickets with the edgepopup algorithm (Ramanujan et al., 2020) to transfer our insights to a different domain.

In the first type of experiment, we explicitly construct universal lottery tickets by following the approach outlined in our proofs. The construction involves solving multiple subset sum approximation problems, which is generally NP-hard. Instead, we obtain an good approximate solution by taking the sum over the best 5-or-less-element subset out of a random ground set. Usually, ground sets consisting of 20 to 30-elements are sufficient to obtain good approximation results. The mother network has uniformly distributed parameters according to Def. 3.

Polynomial function family: We construct a polynomial function family consisting of k features of the form (1 + xi)b for any b [0, 4] assuming d-dimensional inputs. If the mother network has a maximum channel size of 500, we obtain a maximum approximation error of 0.001, which is negligible. Concretely, assuming convolutional (or fully connected) layers of sizes [d, 200, 10, 200, 10, k 30, k, 500, 40, 500, 1] (or higher) is sufficient. This further assumes that we use Nlog = 10 and Nexp = 20 intermediary neurons in the approximation of the respective univariate functions. After pruning, 0.022 of the original parameters remain, which is sparser than the tickets that most pruning algorithms can identify in practice. Note that we can always achieve an even better sparsity relative to the mother network by increasing the width of the mother network.

Fourier basis: The most restrictive part is the approximation of f(x) = sin(2πx) on [0, 1], which can be easily obtained by pruning a mother network of depth 4 with architecture [1, 250, 21, 250, 1] with Nsin = 21 intermediary neurons in the univariate representation. Pruning such a random network achieves an approximation error of 0.01 and keeps only 0.038 of the original parameters. Note that we can always achieve a better sparsity by increasing the width of the mother network. For instance, pruning a network with architecture [1, 250, 250, 250, 1] would result in 0.0035.

In the second type of experiments, we train our mother networks with edge-popup (Ramanujan et al., 2020) on MNIST (Le Cun & Cortes, 2010) for 100 epochs based on SGD with momentum 0.9, weight decay 0.0001, batch size 128, and target sparsity 0.5. Parameters are initialized from a normal distribution according to Def. 3. As the original algorithm is constrained to zero biases, we had to extend it to pruning non-zero biases as well. It finds subnetworks of the randomly initialized mother ticket architecture achieving an average accuracy with 0.95 confidence region of 92.4 0.5

Published as a conference paper at ICLR 2022

% (polynomial architecture) and 97.9 0.1 % (Fourier architecture) over 5 independent runs. Note that pruning the polynomial architecture is not always successful on MNIST but such runs can be detected early in training. This implies that our insights can transfer to a domain different from polynomial or Fourier regression and attests the architectures some degree of universality. Note that even though edge-popup is free to learn tickets that are different from the proposed universal function families, the architecture is still constrained to compositions of univariate and linear multivariate functions.

In addition, these experiments highlight that our parameter initialization (with non-zero biases) induces trainable architectures, as the success of edge-popup relies on meaningful gradients.

5 DISCUSSION

We have derived a formal notion of strong universal lottery tickets with respect to a family of functions, for which the knowledge of a universal lottery ticket reduces the learning task to linear regression. As we have proven, these universal lottery tickets exist in a strong sense, that is, with high probability they can be identified as subset of a large randomly initialized neural network. Thus, once a ticket is identified, no further training is required except for the linear regression. We have shown this with an explicit construction of deep Re LU networks that represent two major function classes, multivariate polynomials and trigonometric functions, which have the universal approximation property.

These classes consist of basis functions which can be represented by compositions of univariate functions and multivariate linear functions, which are amenable to sparse approximations by deep Re LU networks. As we highlight, the use of convolutional network layers can significantly improve the sparsity of these representations. Up to our knowledge, we have presented the first proof of the existence of universal lottery tickets and of lottery tickets in general in convolutional neural networks. Most remarkable is the fact that these common function classes with the universal approximation property have sparse neural network representations. Furthermore, they can be found by pruning for (universal) lottery tickets. In consequence, we should expect that many tasks can be solved with the help of sparse neural network architectures and these architectures transfer potentially to different domains.

Our theoretical insights provide some practical guidance with respect to the setting in which sparse universal tickets could be found. We have shown how parameters can be initialized effectively and discussed what kind of architectures promote the existence of universal lottery tickets. In comparison with the theoretical literature on strong lottery tickets, we have relaxed the requirements on the width of the large randomly initialized mother network and made its depth variable. Some of our novel proof ideas might therefore be of independent interest for lottery ticket pruning. Interestingly, in contrast to the standard construction of lottery tickets, we derived a proof that does not start from representing a function as neural network but as linear combination with respect to a basis. In future, we might be able to use similar insights to deepen our understanding of what kind of function classes are amenable to training by pruning in search of lottery tickets.

As a word of caution, we would like to mention that a strongly universal ticket (whose existence we have proven) is not necessarily also a weakly universal ticket (which is more commonly identified in practice). The reason is that in case that we train a ticket that represents the right basis functions together with the last layer (i.e. the linear regression weights), also the parameters of the ticket might change and move away from the suitable values that they had initially. In several common initialization settings, however, the parameters of the bottom layers (that correspond to the ticket) change only slowly during training, if at all. Therefore, a strong ticket will often be also a weak ticket from a practical point of view.

Furthermore, it is important to note that the universal lottery tickets that we have constructed here are not necessarily the tickets that are identified by pruning neural networks in common applications related to imaging or natural language processing. It would be interesting to see in future how much redundancy these tickets encode and whether they could be further compressed into a basis.

Published as a conference paper at ICLR 2022

ETHICS AND REPRODUCIBILITY STATEMENT

We do not have ethical concerns about our work that go beyond the general risks associated with deep learning that are, for instance, related to potentially harmful applications, a lack of fairness due to biases in data, or vulnerabilities of deployed models to adversarial examples. The advantage of universal lottery tickets, whose existence we have proven here, is that we might be able to understand and address some of this limitations for specific architectures that can than be reused in different contexts. Code for the experiments is publicly available in the Github repository Universal LT, which can be accessed with the following url https://github.com/Relational ML/ Universal LT.

ACKNOWLEDGMENTS

NL and RM were supported by the National Institutes of Health grant P42ES030990. AG was supported by a Swiss National Science Foundation Early Postdoc.Mobility fellowship.

Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. In International Conference on Learning Representations, 2018.

David Balduzzi, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt Wan-Duo Ma, and Brian Mc Williams. The shattered gradients problem: If resnets are the answer, then what is the question? In International Conference on Machine Learning, 2017.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machinelearning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 2019.

Rebekka Burkholz and Alina Dubatovka. Initialization of Re LUs for dynamical isometry. In Advances in Neural Information Processing Systems, volume 32, 2019.

Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Zhangyang Wang, and Michael Carbin. The lottery ticket hypothesis for pre-trained bert networks. In Advances in Neural Information Processing Systems, volume 33, pp. 15834 15846. 2020.

G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS), 2(4):303 314, December 1989.

I. Daubechies, R. De Vore, S. Foucart, B. Hanin, and G. Petrova. Nonlinear approximation and (deep) relu networks. ar Xiv, art. 1905.02199, 2019.

Payal Dhar. The carbon impact of artificial intelligence. Nat Mach Intell, 2(8):423 425, 2020.

James Diffenderfer and Bhavya Kailkhura. Multi-prize lottery ticket hypothesis: Finding accurate binary neural networks by pruning a randomly weighted network. In International Conference on Learning Representations, 2021.

Jonas Fischer and Rebekka Burkholz. Towards strong pruning for lottery tickets with non-zero biases. ar Xiv, 2021.

Jonas Fischer and Rebekka Burkholz. Plant n seek: Can you find the winning ticket? International Conference on Learning Representations, 2022.

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019.

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, volume 9, pp. 249 256, May 2010.

Published as a conference paper at ICLR 2022

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026 1034, 2015.

Kurt and Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251 257, 1991.

Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann. lecun.com/exdb/mnist/.

Yann Le Cun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pp. 598 605, 1990.

Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. Snip: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019.

Tao Lin, Sebastian U. Stich, Luis Barba, Daniil Dmitriev, and Martin Jaggi. Dynamic model pruning with feedback. In International Conference on Learning Representations, 2020.

George S. Lueker. Exponentially small bounds on the expected optimum of the partition and subset sum problems. Random Structures & Algorithms, 12(1):51 62, 1998.

Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning, 2020.

Ari Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. In Advances in Neural Information Processing Systems, pp. 4932 4942. 2019.

Laurent Orseau, Marcus Hutter, and Omar Rivasplata. Logarithmic pruning is all you need. Advances in Neural Information Processing Systems, 33, 2020.

Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, and Dimitris Papailiopoulos. Optimal lottery tickets via subset sum: Logarithmic over-parameterization is sufficient. In Advances in Neural Information Processing Systems, volume 33, pp. 2599 2610, 2020.

Allan Pinkus. Approximation theory of the mlp model in neural networks. Acta Numerica, 8: 143 195, 1999.

Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What s hidden in a randomly weighted neural network? In Computer Vision and Pattern Recognition, pp. 11893 11902, 2020.

Pedro Savarese, Hugo Silva, and Michael Maire. Winning the lottery with continuous sparsification. In Advances in Neural Information Processing Systems, volume 33, pp. 11380 11390, 2020.

J urgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 2015.

Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with Re LU activation function. The Annals of Statistics, 48(4):1875 1897, 2020.

Johannes Schmidt-Hieber. The kolmogorov arnold representation theorem revisited. Neural Networks, 137:119 126, 2021.

Or Sharir, Barak Peleg, and Yoav Shoham. The cost of training NLP models: A concise overview. ar Xiv, 2004.08900, 2020.

Hidenori Tanaka, Daniel Kunin, Daniel L. Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. In Advances in Neural Information Processing Systems, 2020.

Matus Telgarsky. Benefits of depth in neural networks. In Conference on Learning Theory, volume 49, 2016.

Published as a conference paper at ICLR 2022

Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, 2020.

Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94: 103 114, 2017.

Dmitry Yarotsky. Optimal approximation of continuous functions by very deep relu networks. In Conference On Learning Theory, 2018.

A RESULTS IN SUPPORT OF EXISTENCE THEOREMS

In the following sections, we represent the proofs of our theorems and discuss different representation options of lottery tickets. In particular, we highlight the benefits of additional depth for reducing the maximum required width of the mother network f0. We frequently utilize various forms of solutions to the subset sum problem and present novel results with regard to this general problem.

A.1 INITIALIZATION

We derive our proofs by assuming that the parameters of the randomly initialized mother network f0 are distributed in the following way. In case of uniform parameter initialization, we assume that the weights w and biases b follow the distributions w U[ 2, 2] and biases b U[ 1, 1]. In case of normal parameter initialization, we have w N(0, 4) and biases b N(0, 1). The following lemma explains, why we can follow this approach when the parameters are initialized according to Definition 3 and the output of a neural network is corrected by a scaling factor λ = QL l=1(2σ 1 w,l).

Lemma 6 (Output scaling). Let h (θ0, σ) denote a transformation of the parameters θ0 of the deep neural network f0, where each weight is multiplied by a scalar σl, i.e., h(l) ij (w(l) 0,ij) = σlw(l) 0,ij,

and each bias is transformed to h(l) i (b(l) 0,i) = Ql m=1 σmb(l) 0,i. Then, we have f (x | h(θ0, σ)) = QL l=1 σlf(x | θ0).

For completeness, we present the proof but note that it has been also derived by (Fischer & Burkholz, 2021).

Proof. Let the activation function ϕ of a neuron either be a Re LU ϕ(x) = max(x, 0) or the identity

ϕ(x) = x. A neuron x(l) i in the original network becomes g x(l) i after parameter transformation. We prove the statement by induction over the depth L of a deep neural network.

First, assume that L = 1 so that we have x(1) i = ϕ P

j w(1) ij xj + b(1) i After transformation by

w(1) ij 7 σ1w(1) ij and b(1) i 7 σ1b(1) i , we receive g x(1) i = ϕ P

j w(1) ij σ1xj + σ1b(1) i = σ1x(1) i because of the homogeneity of ϕ( ). This proves our claim for L = 1.

Next, our induction hypothesis is that g x(L 1) i = QL 1 m=1 σmx(L 1) i . It follows that

g x(L) i = ϕ

j w(L) ij σLg x(L 1) j + b(L) i

(def. of transformation) (5)

j w(L) ij σL

m=1 σmx(L 1) j + b(L) i

(induction hypothesis) (6)

m=1 σmx(L) i (homogeneity of ϕ), (7)

which was to be shown.

Published as a conference paper at ICLR 2022

A.2 SUBSET SUM PROBLEM

Our proofs frequently rely on solving the subset sum problem (Lueker, 1998), which was first utilized in existence proofs of strong lottery tickets without biases (Pensia et al., 2020) and then transferred to proofs of strong tickets with biases (Fischer & Burkholz, 2021; 2022). It applies to general iid random variables that contain a uniform distribution, as defined next.

Definition 4. A random variable X contains a uniform distribution if there exist constants α (0, 1], c, h > 0 and a random variable G1 so that X αU[c h, c + h] + (1 α)G1.

Such random variables can be used to approximate any z [ m, m] in a bounded interval, as our next corollary states.

Corollary 7 (Subset sum approximation ). Let X1, ..., Xn be independent bounded random variables with |Xk| B. Assume that each Xk X contains a uniform distribution with c = 0 (see Definition 4). Let ϵ, δ (0, 1) and m R with m 1 be given. Then for any z [ m, m] there exists a subset S [n] so that with probability at least 1 δ we have |z P

k S Xk| ϵ if

n C max 1, m

mink{αk} log

min δ max{1,m/h}, ϵ max{m,h}

Proof. For m = 1 the proof is a subset of the proof of Corollary 3.3. by Lueker (1998). To extend these results to m > 1, let us fix a δ and ϵ that we will later choose depending on ϵ, δ, and m. We know that approximating any z [ 1, 1] is feasible with probability 1 δ up to error ϵ based on random variables Xk/h as long as we have at least

α log B min (δ , ϵ )

random variables.

Next we distinguish two different cases. 1) Let us start with h/m 1. To approximate any z [ m, m], we approximate z = zh/m by solving m = m/h 1 separate subset sum approximation problems. Note that the variables Xk/h contain a uniform distribution U[ 1, 1] so that we can approximate z/m [ 1, 1] by |z/m P

k S Xk/h| ϵ . It follows that |zh/m P k S Xk| ϵ h. We use this approximation result m = m/h times. Accordingly, we draw in total n = m/h n1 random variables, assign each to one of m = m/h independent batches Ak = (X(k 1)m +1, ..., Xkm ), and use each batch Ak to approximate zh/m up to error ϵ h. This approximation is successful for all batches with probability (1 δ )m and identifies index sets Sk so that

k=1 |zh/m X

i Sk Xi| m hϵ .

Thus, if we choose δ and ϵ so that (1 δ )m (1 δ), we obtain the desired approximation guarantees with n = m n1. This is fulfilled for δ = δ/m δh/m, and ϵ = ϵ/(hm ) ϵ/m.

2) The second case assumes h/m > 1. Let us define z := zm/h with m/h < 1, for which we have | z/m| < |z|/m 1. Thus, also z/h [ 1, 1], as z/h = z/m. It follows that we can approximate z/h with a subset S of n1 random variables Xk/h so that |z/h P

k S Xk/h| ϵ . Hence, we have

k S Xk| hϵ .

The choice ϵ = ϵ/h meets our approximation objective.

Next, we explicitly state two special cases that we use frequently in our initialization set-up.

Lemma 8 (Approximation by products of uniform random variables). Let X1, ..., Xn be iid random variables with a distribution X1 V2V1, where V1 U[0, 1] and V2 U[ 2, 2] are independently

Published as a conference paper at ICLR 2022

0.0 0.1 0.2 0.3 0.4 0.5

Figure 2: Each standard normal distribution Z N(0, 1) contains a uniform distribution U[ h, h] with the shown α, where α = ϕ(h/σ)2h/σ. We use the highlighted value h = 1 in our proofs.

distributed. Let ϵ, δ (0, 1) and m > 0 be given. Then for any z [ m, m] there exists a subset S [n] so that with probability at least 1 δ we have |z P i S Xi| ϵ if

n Cm log m min (δ, ϵ)

Proof. Our main proof strategy is the application of Corollary 7. Since X1 is obviously bounded by B = 2, we only have to show that X1 contains a uniform distribution. For V2 U[ 1, 1] this has been shown already by Pensia et al. (2020) with Corollary 1 and α = log(2)/2, h = 1/2, and c = 0. The same arguments apply to our case with α = log(4)/4, h = 1, and c = 0, which we integrate into the generic constant C.

Alternatively, we might want to consider initialization schemes with normally distributed parameters. The next lemma confirms that each normal distribution contains a uniform distribution.

Lemma 9. A normally distributed random variable Z N(0, σ2) contains a uniform distribution U[ σ, σ] with α = 0.4.

Proof. From Definition 4 it can be showed that a random variable Z contains a uniform distribution U[ h, h] if the probability that Z [ h, h] is at least as big as if would flip a biased coin that turns up heads with probability α > 0. In case of this event, we would further draw an element from [ h, h] from a uniform distribution U[ h, h]. In other words, it suffices to show that there exists α > 0 so that the probability density p Z(x) fulfills p Z(x) α/(2h) for all x [ h, h].

Since Z N(0, σ2), its density is given by ϕ(x/σ)/σ, where ϕ(x) denotes the density of the standard normal distribution. Since ϕ(x) is monotonously decreasing in x for x > 0, we need to fulfill α ϕ(h/σ)2h/σ. For h = σ, this is fulfilled for α = 0.4 < 2 ϕ(1). Figure 2 shows the different choices of α for varying h and σ = 1, which confirms that h = σ in general leads to relatively high values of α.

We can use this proof to show that products of normal distributions are also suitable for approximation with the help of the subset sum problem.

Published as a conference paper at ICLR 2022

Lemma 10 (Approximation by products of normal random variables). Let X1, ..., Xn be iid random variables with a distribution X1 V2V1, where V1 (N | N > 0) with N N(0, σ2 1) and V2 N(0, 4) are independently distributed with σ1 = 1 or σ1 = 2. Let ϵ, δ (0, 1) and m > 0 be given. Then for any z [ m, m] there exists a subset S [n] so that with probability at least 1 δ we have |z P

i S Xi| ϵ if

n Cm log m min (δ, ϵ)

Proof. To apply Corollary 7, we have to show that the product V1V2, contains a uniform distribution. In addition, we have to solve the technical problem that normal distributions are not bounded random variables. However, for a given bound B, we can simply ignore all variables that exceed that bound |Xi| > B and just make sure that enough bounded variables are left with high probability so that we can solve our approximation problem. In fact, let us only use variables with 1/2 V1 1 and |V2| B and assume that we need at least np of the variables Xi that fulfill these criteria with probability (1 δ/2). With the help of a union bound, we can see that we need at least n C(np + log(2/δ)) variables to ensure this. Note that C depends on the bound B and decreases for increasing B, while the width requirement in Corollary 7 increases in B as log(B). One could trade-off both with the right choice of B but this is not relevant conceptually in our proofs.

Given the variable V1, the random variable V2V1 is distributed as V2V1 | V1 N 0, 4V 2 1 , which contains U[ 2V1, 2V1] with α = 0.4 according to Lemma 9. Note that U[ 2V1, 2V1] contains U[ 1, 1] for all 1/2 V1 1 with α = 1/(2V1) so that V2V1 | V1 contains U[ 1, 1] with α = 0.4/(2V1) 0.2 Thus, V2V1 also contains U[ 1, 1] with α = 0.2.

Applying Corollary 7 to bound np thus concludes our proof.

Next, we will prove an even stronger result on approximation by solving a subset sum problem that allows us to mix random variables with different distributions.

Corollary 11 (Extended approximation with subset sum). Let X1, ..., Xnx be independent bounded random variables with |Xi| B. Assume that each Xi X contains a uniform distribution U[ 1, 1] with potentially different αi > 0 (see Definition 4). Let ϵ, δ (0, 1) and m N with m 1 be given. Then for any z [ m, m] there exists a subset S [n] so that with probability at least 1 δ we have |z P i S Xi| ϵ if

n C hm mini{αi} log Bm min (δ, ϵ)

Proof. Following the same arguments as Lueker (1998) in his proof of Corollary 3.3, we only need to ensure that we can obtain k C log(1/ϵ) samples that are drawn from a uniform distribution U[ 1, 1] to approximate a z [ 1, 1] up to precision ϵ with high probability. The probability that an individual sample Xi falls into this category is αi and is at least α = mini{αi} for every sample. With this α , we can follow exactly the same steps as Lueker (1998) and arrive at the stated result.

The following lemma is also stated in the main manuscript as Lemma 1.

Lemma 12. Define α = 3/4 and let wj U[ 2, 2] denote k iid uniformly distributed random variables with j [k]. Then wj is contained in an interval

with probability at least q = 1/16. If this is fulfilled for each wj, then

The same holds true if each wj N(0, 4) is iid normally distributed instead.

Published as a conference paper at ICLR 2022

Proof. We prove the statement by induction. First, let k = 1 and note that 1 w1 1/α with probability (4/3 1)/4 = 1/12 1/16 for uniformly distributed w1. For normally distributed w1, 1 w1 1/α is fulfilled with probability Φ(1/(2α))) Φ(1/2) (1/α 1)/4 = 1/12 1/16.

In the induction step from k to k + 1, we assume that

with probability q for all j k, which implies that

We have to show that also 1 wk+1 Qk i=1 wi 1/α and thus

For uniformly distributed wk+1, this is fulfilled with probability (1/α 1)/ 4 Qk i=1 wi (1/α

1)α/4 = 1/16 = q, where we used the induction hypothesis. For normally distributed wk+1, this is also fulfilled with probability (1/α 1)α/4 = 1/16 q.

B EXISTENCE OF LINEAR AND UNIVARIATE LOTTERY TICKETS

At least for specific function classes, we can significantly relax the width and depth requirements. The family of functions that we are particularly interested in are compositions of univariate functions and multivariate linear functions. This also allows us to leverage the parameter sharing offered by convolutional layers to gain further improvements. We first focus on fully-connected mother networks f0, as we can also utilize them along the channel dimension in convolutional layers as well.

B.1 MULTIVARIATE LINEAR LOTTERY TICKETS

Our first existence results we present for affine linear multivariate functions f(x) = Wx + b. As every layer of a neural network layer is essentially a composition of this function and an activation function ϕ(f(x)), the main ideas that we present here could be immediately transferred to pruning for general target networks.

Statement. Assume that ϵ, δ (0, 1) and a target function f : Rd S Rm with f(x) = Wx+b with M := maxi,j max(|wi,j|, |bi|) , N non-zero parameters, and Q = (supx S x 1 + 1) are given. Let each weight and bias of f0 with depth L 2 and architecture n0 be initialized according to Def. 3. Then, with probability at least 1 δ, f0 contains a sparse approximation fϵ f0 so that each output component i is approximated as maxx S |fi(x) λfϵ,i(x)| ϵ if

L 1 log (M/ min {δ/(2(m + d)(L 1) + N + 1), ϵ/Q})

for L > 2 and

n1,0 CMd log M min {δ/(N + 1), ϵ/Q}

for L = 2 and λ = QL l=1(2σ 1 w,l).

Proof. First, we analyze the amount of error ϵl that we can make with each parameter and still meet our approximation objective.

Published as a conference paper at ICLR 2022

Error propagation: Let us assume we approximate a target function component fi(x) = w T x + bi with another multivariate affine linear function gϵ(x) = a T x + c with close parameters so that |ai wi| ϵl and |bi c| ϵl. Then, it follows that supx S |fi(x) gϵ,i(x)| supx S Pd j=1 |wj aj||xj|+|bi c| ϵl (supx S x 1 + 1) = Qϵl ϵ, if ϵl ϵ/Q.

Lottery ticket construction: Next, we have to construct a lottery ticket that computes such a function gϵ(x) for each output component. Our construction strategy depends on the available depth L. We start with the case L = 2.

Case L = 2: As Pensia et al. (2020), we utilize solutions to the subset sum problem. In addition, however, we have to deal with the fact that our parameters are not necessarily bounded by 1 and that we have non-zero biases. While Layer 0 and Layer 2 of both f0 and our ticket fϵ are fixed to the input and output neurons respectively, we can only prune Layer 1 of f0, which has width n0,1. How could we represent our target f in this setting? We can write each output component of our target as

j=1 wjxj + bi =

j=1 wjϕ(xj) +

j=1 ( wj)ϕ( xj) + biϕ(1),

utilizing the identity x = ϕ(x) ϕ( x) for Re LUs. This shows that we can represent f as 2-layer neural network consisting of 2d + 1 neurons in Layer 1, i.e., ϕ(xj), ϕ( xj), and ϕ(1).

Let us thus prune np out of the n0,1 neurons in f0 to neurons of the form ujϕ(vjxj) with vj > 0 to help represent wjϕ(xj) and prune np of the neurons in f0 to neurons of the form ujϕ(vjxj) with vj < 0 to help represent ( wj)ϕ( xj) and prune np neurons to ujϕ(vj) with vj > 0 to represent biϕ(1).

How large must n0,1 be so that this pruning is feasible with probability at least 1 δ for a given np and δ > 0? (Note that this was assumed to be possible in (Pensia et al., 2020) with probability one, which is not entirely correct.) To use the union bound to answer this question, let us first estimate the probability that we cannot find np neurons of each type. If we could not find those then at least n0,1 np + 1 neurons in f0 must be useless to prevent pruning of a neuron of a specific type. A neuron is useless for a specific type with probability 0.5, since vj > 0 (or vj > 0) with probability 0.5. We only have (d + 1) possible cases of failed pruning, because pruning can only fail for one of the types wjϕ(xj) or ( wj)ϕ( xj) but not for both, since each neuron in Layer 1 of f0 falls in one of the two categories. It follows that with the help of a union bound, we can thus ensure that we can find enough neurons of each type with probability at least 1 δ if 1 δ = 1 (d+1)0.5n0,1 np+1 and thus

n0,1 max (np + log(2) log ((d + 1)/δ ) , (2d + 1)np) . (12)

Now that we have np neurons for each of the 2d + 1 neurons that we want to approximate. We can utilize them as ground set in solving a subset sum problem to achieve a small approximation error of each wjϕ(xj).

Using Lemma 8 or Lemma 10, we are successful with our approximation up to error ϵ with probability at least 1 δ if

np CM log (M/ min{δ , ϵ }) .

Putting everything together, we have to ensure that the pruning works (see Eq. (12)) with probability (1 δ ) and that every approximation of each of the N non-zero parameters is successful with probability (1 δ ). This is achieved with probability (1 δ) for δ = δ = δ/(N + 1), as we obtain from a union bound. Furthermore, we can allow an approximation error of each single parameter of ϵ = ϵ/Q, as we derived earlier. This can all be achieved by

n1,0 CMd log M min {δ/(N + 1), ϵ/Q}

Case L > 2: In comparison to the case L = 2, we have the additional advantage that we can also prune Layers

Published as a conference paper at ICLR 2022

2, . . . , L 1. As in the proof of the case L = 2, we need in total ntot = (2d + 1)np neurons for a successful approximation. But utilizing Corollary 11, we can distribute these ntot neurons on L 1 layers. Thus, each layer has to cover only ntot/(L 1) neurons.

L 1 log M min {δ /(N + 1), ϵ/Q}

are therefore sufficient to for a successful approximation with probability at least 1 δ .

Our construction that achieves this is visualized in Figure 1 (e). It consists of (L 1) subset sum blocks, i.e., subnetworks with (2d + 1)np/(L 1) neurons in the intermediary layer and m outputs. Each of these intermediary outputs, say yk (represented by a blue node in the figure), is connected to the final corresponding output x L k along a path of neurons of the form w L 1ϕ(...ϕ(wjyk)) with wj > 0, thus adding QL 2 j=1 wjyk to the final output x L k . Exemplary paths are also highlighted in orange in the figure.

Corollary 11 states that each of these contributions is valid in a subset sum approximation, as long as we can bound the path weight in such a way that we can integrate it into our universal constant of each width requirement. This is the case for 1 QL 2 j=1 wj 1/α for an α < 1. In each layer, we need to find maximally m + d of such wj. Lemma 1 provides us with the result that a random neuron can be pruned to represent such a neuron ϕ(wjx) on that path with probability at least q so that α = 3/4. Note that q and α are independent of L. This guarantees with the application of a union bound that

nl,w C log((m + d)/δ ) (13)

neurons need to be available for pruning to find the desired 2(m + d) neurons with probability at least 1 δ , where C = log(1/(1 q)).

Putting everything together, using a union bound, we can guarantee an overall existence probability of at least 1 δ by setting δ = δ/(2(m + d)(L 1) + N + 1) and δ = δ(N + 1)/(2(m + d)(L 1) + N + 1), which leads to

L 1 log (M/ min {δ/(2(m + d)(L 1) + N + 1), ϵ/Q}) .

B.2 UNIVARIATE LOTTERY TICKETS

A similar reasoning as for linear multivariate lottery tickets allows us to prove our results for univariate lottery tickets, the second important building block of our representation of basis functions. However, we generally need 3 layers to implement a subset sum approximation instead of the 2 layers required in case of linear multivariate lottery tickets. Note that other results on network pruning would require at least two blocks and thus exactly 4 layers in the mother network (Fischer & Burkholz, 2021) (or could not represent the architecture because they are restricted to zero biases Pensia et al. (2020)). Our 3-layer block leverages the specific neural network structure that we use to approximate an univariate function, which we introduce next.

It is well known that every deep neural network with Re LU activation functions ϕ(x) = max(x, 0) is a piecewise linear function (Arora et al., 2018). Conversely, each piecewise linear function with a finite number of linear regions can be represented by a Re LU network as long as it has enough neurons (Arora et al., 2018; Daubechies et al., 2019).

Deep neural networks are generally overparametrized. Yet, all parameters of an univariate network f can be mapped to the effective parameter vectors a and s that represent f in the following way. The knots s = (si)i [N 1] mark the boundaries of the linear regions and a = (ai)i [N] indicate changes in slopes mi = (f(si+1) f(si)) / (si+1 si) (with s N := s N 1 + ϵ) from one linear region to the next. Assuming f(x) = a N for x < s0, we can write

i=0 aiϕ(x si) + a N (14)

Published as a conference paper at ICLR 2022

with ai = mi mi 1 for 1 i N 1, a0 = m0, and a N = f(s0).

The relevant domain, where f varies non-linearly, is defined by [s0, s N 1]. Without loss of generality, we assume that f is positive so that f(x) 0 for all x [s0, s N 1]. This can always be achieved by an affine linear transformation of the output that is learned by the last layer. Alternatively, the parameters of the deep neural network could be transformed appropriately. The last layer of the randomly initialized mother network f0 can have either Re LU or linear activation functions, since we assume f(x) 0 so that ϕ(f(x)) = f(x) holds.

For a given f of this form, we want to find a lottery ticket that is an ϵ-approximation of f. The first question that we have to answer is how much error we are allowed to make in each parameter in order to achieve this.

B.2.1 PRUNING ERROR FOR AN UNIVARIATE LOTTERY TICKET

Lemma 13 (Approximation of univariate piecewise linear function). Let ϵ > 0, f be defined as in Eq. (14) and fϵ be a piecewise linear function whose parameters fulfill |ai ai,ϵ| ϵa and |si si,ϵ| ϵs with

2 (N + 1) max{|s0|, |s N 1|} + PN 1 i=0 |si| , ϵs = ϵ

2 N + PN 1 i=0 |ai| . (15)

Then maxx [s0,s N 1] |f(x) fϵ(x)| ϵ.

Proof. It follows from the definition of f, the triangle inequality, and the fact that |ϕ(x) ϕ(y)| |x y| that

|f(x) fϵ(x)|

i=0 |ai ai,ϵ||ϕ(x si)| + |ai,ϵ||ϕ(x si) ϕ(x si,ϵ)| + |a N a N,ϵ|

i=0 ϵa|x si| + |ai,ϵ||si si,ϵ| + ϵa

i=0 |ai,ϵ|ϵs

i=0 (|ai| + ϵa)ϵs

i=0 (|ai| + 1)ϵs

N max{|s0|, |s N 1|} + 1 +

for all x [s0, s N 1], where we have used Eq. (15) in the second to last inequality and assumed that |s N 1| 1.

Letting Q = max{|s0|, |s N 1|}, M = (1 + max (maxi |ai|, maxj |sj|)), and using similar arguments, we can more generally derive that

max x S |f(x) fϵ(x)| ϵa(1 + NQ + (N 1)M) + ϵs NM ϵa N(Q + M) + ϵs NM

where we used the fact that M 1 in the last step. It is therefore sufficient to bound the error in each individual parameter by |θi θi,ϵ| ϵ/(2N(Q + M)).

For convenience, we state again Thm. 4 before we prove it.

Published as a conference paper at ICLR 2022

Statement (Existence of univariate lottery ticket). Assume that ϵ, δ (0, 1) and an univariate target network f : R S in form of Eq. (14) consisting of N intermediary neurons are given. Define M := (1 + max (maxi |ai|, maxj |sj|)) and Q = maxx S |x|. Let each weight and bias of f0 with depth L 2 and architecture n0 be initialized according to Def. 3. Then, with probability at least 1 δ, f0 contains a sparse approximation fϵ f0 so that maxx [s0,s N 1] |f(x) λfϵ(x)| ϵ with

nl,0 C max{M, N}

L 2 log M min {δ/ [L(N + 2) 1] , ϵ/(2(Q + M))}

for L 3 and

ϵ log M min {δ/(N + 1), ϵ/(2(Q + M))}

for L = 2, where the output is scaled by λ = QL l=1(2σ 1 w,l).

Proof. Let us start with the case L = 3.

Case L = 3: Note that we can represent our target network f(x) = PN 1 i=0 aiϕ(x si) + a N as univariate neural network consisting of 4 layers

i=1 aiϕ (ϕ(x) ϕ( x) siϕ(1)) + sign(a N)ϕ(|a N|ϕ(1)),

where the first layer has width n1 = 3 and the second layer has width n2 = N + 1. (Note that we can also apply another Re LU activation function to the output, ϕ(f(x)), because we assumed that f(x) is positive. We omitted it for clarity.) Lemmas 8 or 10 let us solve the associated (2N + 1) subset sum problems if the first layer represents enough neurons of type ϕ(x) and ϕ(1) with

n1,0 CM log M min {δ/(3N + 2), ϵ/(2N(Q + M))}

where each parameter (1, 1, si) is approximated up to error ϵ/(2N(Q + M)). The n1,0 neurons serve the creation of n2,0 neurons in the next layer, which are approximately of the form ϕ(x si) and a N. These in turn are used to solve another set of subset sum problems that approximate the parameters ai. This is feasible for

n2,0 CN log 3N + 2

Case L 3: Following the same steps of reasoning as in our proof of the existence of linear multivariate lottery tickets (see Figure 1), we can distribute the approximation of N + 1 neurons on L 2 layers. The approximation of the target f is thus successful with probability 1 δ if each of the 2N + 1 subset sum problems is solved and we find the following number of connections (of type wj) that create paths between our intermediary outputs and the final output. The first layer needs one connection 1, the second also 1, and all L 2 remaining layers need maximally N + 2 connections. In total, this results in δ = δ/((N + 2)L 1) and thus

nl,0 C max{M, N}

L 2 log M min {δ/ [L(N + 2) 1] , ϵ/(2(Q + M))}

Case L = 2: The case L = 2 is more complicated because we do not have enough layers to construct univariate neurons of the form ϕ(x), ϕ( x), and ϕ(1). Instead we have to create the required N neurons of the form ϕ(x s) and a Nϕ(1) directly in Layer 1. We can still achieve this approximation by utilizing Corollary 7 and thus solving N + 1 subset sum problems. Note that we can associate a neuron w2ϕ(w1x + b) in f0 that is used to approximate a neuron aiϕ(x si) of f with a random variable X. We construct X such that it contains a uniform distribution U[ 1, 1]. With n1,0 of such X, we can then approximate our target f using Corollary 7.

Published as a conference paper at ICLR 2022

Thus, let us derive the distribution of X. With probability at least q we have that 1 w1 2. (q = 1/4 for uniformly initialized w1 and q = 1/8 for normally initialized w1.) Given this w1, we can approximate si within error ϵ with probability at least ϵ w1/2 ϵ 1/2 by pruning neurons to have appropriate b. w2 contains a uniform distribution U[ 1, 1] with α 1/4 We can use this uniform distribution U[ 1, 1] to approximate ai/w1 up to error ϵ /w1 ϵ /2. All together, the random variable X that reflects the described pruning contains a uniform distribution U[ 1, 1] with α = ϵ /8. We can use n1,0 random independent copies of X to approximate each of the N + 1 neurons up to error ϵ /2 with probability at least 1 δ . According to Corollary 7, this is achieved for

α log M min {δ /(N + 1), ϵ }

where the choice ϵ = ϵ/(2Q) and δ = δ proves the claim of the theorem.

C EXISTENCE OF UNIVERSAL LOTTERY TICKETS

In the following, we present theorems about the construction of lottery tickets as concatenated basis functions. Both versions, one for polynomial and the other one for Fourier basis functions, share the general set-up so that we start with a joint proof beginning.

Proof. To show that fϵ is a strongly universal lottery ticket with respect to the function family F, we have to prove that if fulfils Definition 1. Thus, for every g F we have to find W Rm k and c Rm so that

sup x [0,1]d W fϵ(x) + c g(x) ϵ. (21)

Since g F, we can write each component as linear combination with respect to our basis functions bi B so that gi(x) = Pk j=1 aijbj(x) + di with aij, di [ 1, 1]. If we define wij = aijλ and ci = di for a positive scaling factor λ > 0, we receive

sup x [0,1]d W fϵ(x) + c g(x) sup x [0,1]d

j=1 |aij|| (λfϵ,j(x) bj(x)) | (22)

km sup x [0,1]d,j [k] |λfϵ,j(x) bj(x)| (23)

using |aij| 1 and the triangle inequality. Note that this inequality holds for any p-norm with p 1. Thus, fϵ is a strongly universal lottery ticket with respect to F if each component j is bounded as

sup x [0,1]d |λfϵ,j(x) bj(x)| ϵ km. (24)

To simplify the notation, in the remainder we drop the index j but understand that both fϵ(x) and b(x) have 1-dimensional output. To bound the error, we decompose it into two parts, one that captures the approximation of b(x) by a target deep neural network f N (and thus a piece-wise linear function) and one that handles the error related to pruning a mother network f0 to obtain the actual lottery ticket fϵ.

sup x [0,1]d |λfϵ(x) b(x)| sup x [0,1]d |λfϵ(x) f N(x)|

| {z } pruning

+ sup x [0,1]d |f N(x) b(x)|

| {z } approximation

ϵ 2km + ϵ 2km = ϵ km.

The last inequality needs to be shown. The approximation error can be controlled by a sensible choice of the deep neural network architecture, while the pruning error is handled by Theorems 4 and Theorems 3. In both cases, we also have to consider, however, how the error propagates through the deep neural network layers. The following arguments depend on the specifics of the construction.

Published as a conference paper at ICLR 2022

C.1 POLYNOMIAL UNIVERSAL LOTTERY TICKETS

We continue with the proof for polynomial basis functions. For convenience, we restate Thm. 5 from the main manuscript.

Statement. Let B n Qd i=1 0.5ri(1 + xi)ri|ri q, x [0, 1] o be a subset of the polynomial basis

functions with bounded maximal degree q of cardinality k = |B|. Let F (B) be the family of bounded (affine) linear combinations of these basis functions. For ϵ, δ (0, 1), with probability 1 δ up to error ϵ, f0 contains a strongly universal lottery ticket fϵ f0 with respect to F if f has Llog 2 convolutional layers, followed by Lmulti 2 fully connected layers, and Lexp 2 convolutional layers with channel size or width chosen as in Thms. 4 and 3 with the following set of parameters.

Logarithm: δlog = 1 (1 δ)1/3, ϵlog = ϵ 6tkm, Nlog = 1 + q

, Mlog = 2, Qlog = 1.

Multivariate: δmulti = 1 (1 δ)1/3, ϵmulti = ϵ 6km, Nmulti = kd, Mmulti = q, Qmulti = d.

Exponential: δexp = 1 (1 δ)1/3, ϵexp = ϵ 6km, Nexp = 1 + t q

, Mexp = 2, Qexp = t log 2.

Proof. Our objective is to bound the approximation and pruning error in Eq. (25) sufficiently.

Approximation error: First, we bound the approximation error in Eq. (25) by ϵ/(2km). Our basis function b(x) as well as our approximating deep neural network f N(x) can be written as composition of three functions.

b(x) = g Pd i=1 aih(xi) , where h(x) = log((1 + x)/2), 0 ai q, Pd i=1 ai t, and

g(y) = exp(y) for log(2)t y 0, g(y) = 0.5t for y < log(2)t, and g(y) = 1 for y > 0.

f N(x) = g N Pd i=1 aih N(xi) , where g N and h N are piece-wise linear approximations of g and h respectively. As we have restricted our approximation to a compact domain, all functions are Lipschitz continuous. Here, and in the sequel, we use the fact that on any interval, the slope of the piecewise linear function g N is bounded by that of g, which follows from the construction of g N in Eq. (14). Therefore, on any interval, the Lipschitz constant of g is also the Lipschitz constant of g N. Thus we can bound the difference between both functions as

sup x [0,1]d |f N(x) b(x)| = sup x [0,1]d

i=1 aih(xi)

i=1 aih N(xi)

sup x [0,1]d

i=1 aih(xi)

i=1 aih N(xi)

i=1 aih N(xi)

i=1 aih N(xi)

Lgt sup x [0,1] |h(x) h N(x)| + sup x [ t log(2) ϵ ,ϵ ] |g(x) g N(x)|,

where we bounded each ai by q and ϵ denotes the pruning error for now. Lg denotes the Lipschitz constant of g(y) and g N(y) on their domain [ t log 2 ϵ , ϵ ], which is bounded by 1.

Thus, bounding

sup x [0,1] |h(x) h N(x)| ϵ 4kmt and sup x [ t log(2) ϵ ,ϵ ] |g(x) g N(x)| ϵ 4km (27)

would be sufficient to control our approximation error. To achieve this, we have to allow for large enough number of neurons Nexp and Nlog in the univariate representation (14) of g N and h N.

Approximation of h(x) = log((1 + x)/2) Let us determine Nlog first. We partition the domain [0, 1] of h N(x) into intervals Ii = [si, si+1) [0, 1] and IN 1 = [s N 2, s N 1] [0, 1] of length = 1/(N 1) based on equidistant knots si with s0 = 0 and s N 1 = 1. For every x let ix mark the index of the interval in which x Iix is contained. Since h(x) = log((1 + x)/2) is concave, we have for all x

|h(x) h N(x)| = h(x) h N(x) = h(x) h(six) (x six) (h(six+1) h(six)) / , (28)

Published as a conference paper at ICLR 2022

where ix := arg mini, x si 0 x si is the index the corresponds to the interval that contains x. Furthermore, the maximum approximation error is attained on the interval I0 = [0, ]. Accordingly, we have

sup x [0,1] |h(x) h N(x)| = sup x [0, ] log(0.5(1 + x)) log(0.5) x m = sup x [0, ] log(1 + x) x m

with m = (log(0.5(1 + )) log(0.5))/ = log(1 + )/ . To identify the position of maximum error, we set the first derivative with respect to x of the error objective to zero and obtain x = 1/m 1. This results in

sup x [0,1] |h(x) h N(x)| = sup x [0, ] log(1 + x) x m = log(1 + 1/m 1) (1/m 1) m

= log(m) + m 1 1/2 1/8 2 + 1/24 3 + 1 1/2

+ 1/3 2 1 1/4 2,

where we used the series expansion of log(1 + x). For 0.25 2 ϵ 4kmt we can therefore guaranty a

small enough approximation error. Hence, the choice Nlog = 1 + q

is sufficient to achieve

the desired bound Eq. (27).

Approximation of g(x) = exp(x) We can make a similar argument for g(x) = exp(x) on its domain D = [ t log(2) ϵ , ϵ ]. Since we cut off g and g N outside of the actual support [ t log(2), 0] to avoid amplifying errors that we know are errors, we can just focus on the domain [ t log(2), 0]. Note that also exp( ) is monotonously increasing, but since it is convex, we have

sup x [ t log(2),0] |g(x) g N(x)| = sup x [ ,0] g N(x) g(x) = sup x [ ,0] exp( ) + m(x + ) exp(x)

with m = (1 exp( ))/ ( 0.5 2)/ = 1 0.5 . The maximum is attained at x = log(m) so that we have

sup x [ t log(2),0] |g(x) g N(x)| = exp( ) + m(log(m) + ) m 1/8 2,

where we employed series expansions of exp(x) and log(1 + x). Again, from = log(2)t

Nexp 1 follows

that any Nexp 1 + t q

would be sufficient to achieve our approximation objective (27).

Last but not least we mention also the number of non-zero parameters that we want to prune in case of the multivariate function. As we can represent this function exactly by a deep neural network, we do not inflict any approximation error and we can set Nmulti = dk to the maximum number of its non-zero parameters.

With this, we have derived the stated N. We still have to bound the pruning error to conclude our proof.

Pruning error: The pruning error in Eq. (25) has to be smaller or equal to ϵ/(2km), which we can achieve by limiting the pruning approximation errors ϵexp, ϵmulti, and ϵlog in Theorems 4 and 3 to represent the targets g N, h N and the multivariate linear map with matrix A.

Similarly to Eq. (26), we have to consider the error propagation through a composition of functions. Yet, this time also the multivariate linear map inflicts an error. Recall that our deep neural network f N(x) can be written as composition of three functions so that f N(x) = g N Pd i=1 aih N(xi) .

Analogously, also our lottery ticket is of similar form: fϵ(x) = gϵ (aϵ (hϵ(x1), ..., hϵ(xd))), where aϵ denotes a function from Rd to R. A slight complication is that we can approximate each lottery ticket only up to a scaling factor λ. The scaling factor of each function is not completely arbitrary because all together have to harmonize with the initialization of biases, as previously explained. The

Published as a conference paper at ICLR 2022

exact scaling factors in the theorems ensure that with λ = λexpλmultiλlog, we can write

sup x [0,1]d |f N(x) λfϵ(x)| = sup x [0,1]d

i=1 aih N(xi)

λexpgϵ (λmultiaϵ (λloghϵ(x)))

sup x [0,1]d

i=1 aih N(xi)

g N (λmultiaϵ [λloghϵ(x)])

+ sup x [0,1]d |g N (λmultiaϵ [λloghϵ(x)]) λexpgϵ (λmultiaϵ [λloghϵ(x)])|

sup x [0,1]d Lg N

i=1 aih N(xi) λmultiaϵ [λloghϵ(x)]

+ sup y [ qd log(2) ϵ ,ϵ ] |g N(y) λexpgϵ(y)|

| {z } ϵexp

sup x [0,1]d Lg N

i=1 ai (h N(xi) λloghϵ(xi))

+ sup x [0,1]d Lg N

i=1 aiλloghϵ(xi) λmultiaϵ [λloghϵ(x)]

| {z } ϵmulti

sup x [0,1] Lg N t |h N(x) λloghϵ(x)|

| {z } ϵlog

+Lg N ϵmulti + ϵexp

Lg N tϵlog + Lg N ϵmulti + ϵexp, (29)

where we have used Theorems 4 and 3 to bound the difference between a target network and the corresponding lottery ticket. Theorem 3 depends on M = q (since all parameters are bounded by q) and Q = d, since h(x) [0, 1] and the inputs to the multivariate linear function are bounded this way. Since Lg N 1, we can thus achieve a pruning error of maximally ϵ/(2km) with with ϵlog = ϵ/(6kmt), ϵmulti = ϵ/(6km), and ϵexp = ϵ/(6km).

With the stated parameter choice, each of the three parts of the lottery ticket exists independently with probability at least (1 δ)1/3. All three exist thus with probability at least (1 δ)1/3 3 = (1 δ), which concludes our proof.

C.1.1 FOURIER UNIVERSAL LOTTERY TICKETS

Similarly, we can also bound the approximation and pruning error in case of the second family of functions that we have considered. In doing so, we prove the following theorem. Theorem 14. For ϵ, δ (0, 1), with probability 1 δ up to error ϵ, f0 with the specified properties contains a strongly universal lottery ticket fϵ f0 with respect to the function family F (B), which is defined as bounded affine linear combinations with respect to a finite subset of the Fourier basis functions B n sin 2π Pd i=0 nixi + c | ni [M]( i [d]), c [0, 1], xi [0, 1] o with car-

dinality k = |B|. f0 consists of Lmulti 2 fully connected layers followed by Lsin 2 convolutional layers whose width or number of channels fulfill the requirements of Thms. 4 and 3 with the following set of parameters. Multivariate: δmulti = 1 (1 δ)1/2, ϵmulti = ϵ 8kmπ, Nmulti = dk, Mmulti = 1 + M, Qmulti = d.

Sin: δsin = 1 (1 δ)1/2, ϵsin = ϵ 4km, Nsin = 1 + l π arcsin (ϵsin/2) 1m , Msin = 1 + (d + 1)M,

Qsin = (d + 1)M.

Proof. As for polynomial lottery tickets, we have to bound the approximation error and the pruning error separately to be smaller or equal to ϵ/(2km) to fulfill Eq. 25.

Approximation error: The basis function b(x) as well as our approximating deep neural network f N(x) can be written as

composition of two functions. b(x) = g Pd i=1 nixi + ci , where g(y) = sin(2πy), 0 ni M,

Published as a conference paper at ICLR 2022

and c [0, 1]. Correspondingly, f N(x) = g N Pd i=1 nixi + ci , where g N is a piece-wise linear approximations of g. As we have restricted our approximation to a compact domain, all functions are Lipschitz continuous and we can bound the difference between both functions as

sup x [0,1]d |b(x) f N(x)| = sup x [0,1]d

i=1 nixi + ci

i=1 nixi + ci

sup y [0,d M+1] |g(y) g N(y)| = sup y [0,d M+1] |sin(2πy) g N(y)| ,

where [0, d M + 1] is the domain of g and g N. As in the previous proof, we assume that g N approximates g on an equidistant grid with linear regions of size = 1/(Nsin 1). Similarly as before, a piece-wise linear approximation inflicts the maximal error

sup y [0,d M+1] |sin(2πy) g N(y)| 2 sup z,y [0,d M+1 ] s.t. |z y| |sin(2πz) sin(2πy)|

= 4 sup z,y [0,d M+1 ] s.t. |z y| |sin(π(z y)) cos(π(z + y))|

4 sup x [0, ] |sin(πx)| = 4 sin(π )

using the addition theorem sin(α + β) sin(α β) = 2 sin(β) cos(α) with α = π(z + y) and β = π(z y) for the first equality and assuming 1/2 to obtain the last equality.

It follows that Nsin 1+ l π (arcsin (ϵ/(8km))) 1m leads to sufficiently small approximation error.

Note that Nsin is always bigger than 3 so that 1/2 is fulfilled automatically.

Pruning error: The argument for the pruning error is quite similar but the error related to the linear transformation a N(x) = Pd i=1 aixi + ci is non-zero and needs to be controlled as well. With λ = λsinλmulti we can derive

sup x [0,1]d |f N(x) λfϵ(x)| = sup x [0,1]d |g N (a N(x)) λsingϵ (λmultiaϵ(x))|

sup x [0,1]d |g N (a N(x)) g N (λmultiaϵ(x))| + sup x [0,1]d |g N (λmultiaϵ(x)) λsingϵ (λmultiaϵ(x))|

Lg N sup x [0,1]d |a N(x) λmultiaϵ(x)|

| {z } ϵmulti

+ sup y [0,d M+1] |g N(y) λsingϵ(y)|

| {z } ϵsin Lg N ϵmulti + ϵsin ϵ/(2km)

for ϵmulti = ϵ/(8kmπ) and ϵsin = ϵ/(4km), where we have used that Lg N = 2π, i.e., the Lipschitz constant of the function sin(2π ).

These estimates hold in reference to Theorems 4 and 3.

It might seem strange that these errors do not depend on M. However, in the end, the widths of both mother neural networks does depend on M, as Theorems 4 and 3 consider the domain and maximum value of the parameters of the functions g N and a N in the error assessment.

We conclude that a lottery ticket and those both parts of the neural network exist with probability at least (1 δ)1/2(1 δ)1/2 = 1 δ.