# universal_approximation_in_dropout_neural_networks__7a57055d.pdf Journal of Machine Learning Research 23 (2022) 1-46 Submitted 12/20; Revised 9/21; Published 2/22 Universal Approximation in Dropout Neural Networks Oxana A. Manita o.zaal.manita@tue.nl Mark A. Peletier m.a.peletier@tue.nl Jacobus W. Portegies j.w.portegies@tue.nl Jaron Sanders jaron.sanders@tue.nl Albert Senen Cerda a.senen.cerda@tue.nl Department of Mathematics & Computer Science Eindhoven University of Technology Eindhoven, The Netherlands Editor: Ruslan Salakhutdinov We prove two universal approximation theorems for a range of dropout neural networks. These are feed-forward neural networks in which each edge is given a random {0, 1}-valued filter, that have two modes of operation: in the first each edge output is multiplied by its random filter, resulting in a random output, while in the second each edge output is multiplied by the expectation of its filter, leading to a deterministic output. It is common to use the random mode during training and the deterministic mode during testing and prediction. Both theorems are of the following form: Given a function to approximate and a threshold ε > 0, there exists a dropout network that is ε-close in probability and in Lq. The first theorem applies to dropout networks in the random mode. It assumes little on the activation function, applies to a wide class of networks, and can even be applied to approximation schemes other than neural networks. The core is an algebraic property that shows that deterministic networks can be exactly matched in expectation by random networks. The second theorem makes stronger assumptions and gives a stronger result. Given a function to approximate, it provides existence of a network that approximates in both modes simultaneously. Proof components are a recursive replacement of edges by independent copies, and a special first-layer replacement that couples the resulting larger network to the input. The functions to be approximated are assumed to be elements of general normed spaces, and the approximations are measured in the corresponding norms. The networks are constructed explicitly. Because of the different methods of proof, the two results give independent insight into the approximation properties of random dropout networks. With this, we establish that dropout neural networks broadly satisfy a universal-approximation property. Keywords: neural networks, approximation, dropout, random neural network 1. Introduction The class of functions that are generated by neural networks satisfies a universal approximation property : any given function can be approximated to arbitrary precision by such a . All authors contributed equally to this work. 2022 Oxana A. Manita, Mark A. Peletier, Jacobus W. Portegies, Jaron Sanders and Albert Senen Cerda.. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/20-1433.html. Manita, Peletier, Portegies, Sanders and Senen Cerda neural network (Cybenko, 1989; Leshno et al., 1993). This property partially explains why neural networks are so effective as approximators of implicitly given functions. It is commonly observed that the training of such networks improves upon introducing dropout, the random dropping of edges (Goodfellow et al., 2016, Sec. 7.12). Dropout converts a deterministic network into a random one. In this paper we address the question that this observation implicitly raises: Does this randomness interfere with the universal approximation property? Or, to formulate it in the affirmative: does the class of dropout neural networks still satisfy universal approximation? To provide a first quantification of this question, let us explain the expectation variance split, which in the context of dropout goes back to the theoretical analysis by Wager et al. (2013). We will think of a dropout neural network as a function Ψ : Rd Rn R together with a {0, 1}n-valued random variable f. We call the components of f filter variables. We think of Rd as data space and Rn as parameter space (the space of weights and biases for the neural network). The parameters get multiplied elementwise with the vector of filter variables f. That means that when ζ : Rd R is a function we want to approximate, we try to approximate it with the stochastic function that maps x to Ψ(x, w f). For fixed x, the expectation variance split reads E h Ψ(x, w f) ζ(x) 2i = E[Ψ(x, w f)] ζ(x) 2 + V[Ψ(x, w f)]. (1) Here denotes element-wise multiplication, and throughout the paper P, E, V stand for probability, expectation, variance with respect to the filter variables f, respectively. As both terms on the right-hand side are nonnegative, both terms have to be small in order to have a good approximation. Is this possible? Foong et al. (2020) showed that deep Rectified Linear Unit (Re LU) neural networks with node-dropout still can approximate functions arbitrary well, by showing that both the expectation term and the variance term in the expectation variance split can be made arbitrary small (see their Theorem 3). In fact, the two terms are arbitrary small uniformly over x in the unit cube in Rd. With this statement, Foong et al. effectively showed a universal-approximation result. In this paper we show two universal-approximation results for wider classes of dropout neural networks. Where Foong et al. made specific use of the Re LU activation, assume Bernoulli filter variables (thus equidistributed, independent, and with finite variance), and restrict to one hidden layer, we show that the property of universal approximation holds under more general assumptions. Our distinguishing insight is that certain classes of random neural networks satisfy an algebraic property, which enables us to deal with arbitrary depth, generic activation functions, and dependent filter variables not necessarily equidistributed and possibly with infinite variance. Notably, our techniques allow for dropout of edges from the input layer. For the theorems we prove below the structural assumptions on the network reduce to the assumption that the underlying deterministic network has the universal-approximation property; necessary and sufficient conditions for the latter to hold are well-known, for example, see (Leshno et al., 1993). In addition, our main theorems allow for very general classes of filters, including the original node-based dropout (Hinton et al., 2012), the edge-based dropconnect (Wan et al., 2013), and many others, including sets of filters with strong dependence. With Theorem 1 below we show that the class of Universal Approximation in Dropout Neural Networks dropout networks can exactly match a given deterministic network, at least in expectation. With Theorem 17 below we show that we can construct networks that approximate a given function arbitrarily well, both as a random network and as a deterministic filtered network. Finally, we provide control of the error both in probability and in Lq. 1.1 Approximation by Random Neural Networks In a deterministic context, a universal-approximation theorem for some class C is a density statement, stating that any function ζ can be approximated to arbitrary precision by neural networks in C, where the approximation is measured in some seminormed function space (F, F). Such approximation statements can be generalized to a stochastic context in multiple ways. We will focus on two of these, approximation in probability and in Lq for q [1, ). Universal approximation in probability is the property that for every function ζ F to approximate, and every ϵ > 0, there exists a neural network Ψ, a weight vector w and a random vector f (all with certain extra properties to make the statement nontrivial), such that P [ ζ( ) Ψ( , w f) F > ϵ] < ϵ. A stronger statement involves approximation in Lq for q [1, ): there exist Ψ, w and f such that E ζ( ) Ψ( , w f) q F 1 In this article we indeed show such universal approximation statements for certain classes of deep dropout neural networks. We prove two main classes of approximation results, corresponding to the two main ways that dropout networks are used in practice. In the first class of results the network is a random object as described above, and the same network is used during training and prediction; we call this random-approximation dropout.1 In the second class of results the network is trained with random networks of the form Ψ( , w f), but for prediction the deterministic network Ψ( , w Ef) is used in which the filters are replaced by their expectations. We call this type of dropout expectation-replacement. 1.2 Main Results 1: Random-approximation We start with uniform random-approximation, that is the property that any function ζ in an appropriate set can be approximated by random networks of the form Ψ( , w f). At the highest level the proof strategy is the same as in Foong et al. and consists of the following three steps. Given a function ζ F to be approximated: 1. Approximate ζ by a neural network Ψ( , w) using classical deterministic universal approximation results (e.g., Leshno et al. (1993)); 2. Use Ψ( , w) to construct a larger, random dropout neural network Ψ( , w f) that matches Ψ( , w) in expectation; 1. This has also been called Monte Carlo dropout because of the close connection with Monte Carlo estimation of integrals (Gallicchio and Scardapane, 2020). Manita, Peletier, Portegies, Sanders and Senen Cerda 3. Construct an even larger random neural network bΨ( , bw bf) consisting of many independent copies of the network Ψ( , w f) to obtain an approximation of ζ that is close in expectation and also has small variance. We consider Step 1 as given by existing results, and Step 3 is a standard procedure. The novelty of this paper for random-approximation lies in Step 2, which we describe in the rest of this section. Step 2 is based on an algebraic property of common classes of neural networks, which is illustrated by the following simpler version of the central theorem. We write 2n for the collection of subsets of 1, n = {1, . . . , n}, and for any such a subset U, we write 1U {0, 1}n for the vector with entries (1j U)j 1,n. Theorem 1 Let Ψ : Rd Rn R be any given function. Let (f U)U 2n be a collection of {0, 1}n-valued random variables indexed by subsets U 2n such that for every U P[f U = (1, . . . , 1)] > 0. Then there exist constants (a U)U 2n, independent of w, such that for all w, U 2n a UΨ , (w 1U) f U # = Ψ( , w). (2) This theorem should be read as follows. The right-hand side in (2) plays the role of a deterministic function that we want to approximate. The left-hand side is the expectation of a linear combination of many copies of Ψ( , w). Each copy has two dropout modifications: the vector 1U implements a deterministic dropout, and the random filter variables f U a stochastic one. With a view to generality, the random filter vector f U is allowed to be a different random vector for each subset U of edges, but note that the distribution of f U on {0, 1}n can be completely unrelated to the subset U 1, n; the subset U only serves as label. The theorem establishes the following fact: for any collection of random filter variables f U, for any function Ψ, for any parameter point w, the function Ψ( , w) can be matched exactly by the expectation of a sum of filtered versions of the same function. The important caveat is that one needs to take into account all reduced versions of the functions Ψ, i.e., the whole hierarchy of deterministically modified versions indexed by subsets U. Theorem 1 suggests a special role for classes of networks , with the property that given a network Ψ( , w) we can in some sense define a new (random) network Ψ( , w f) by Ψ( , w f) := X U 2n a UΨ( , w 1U f U). (3) To formalize this, we assume that we have chosen a set DDNN (a set of random neural networks ), which can be any collection of tuples (n, Ψ, f) that satisfy the following properties: (i) n N is a natural number; (ii) Ψ : Rd Rn R is a function such that for every w Rn, Ψ( , w) F; Universal Approximation in Dropout Neural Networks (iii) f is a {0, 1}n-valued random variable such that P[f = (1, . . . , 1)] > 0. (4) Moreover, we assume that DDNN is closed under linear, independent combinations. By this we mean that whenever a, b R and (m, Φ, f) and (n, Ψ, g) are in DDNN, then also (n , aΦ + bΨ, h) DDNN with n n + m where h is an {0, 1}n -valued random variable that is the independent concatenation of f and g, and aΦ + bΨ : Rd Rn R is given by (x, (w1, w2)) 7 aΦ(x, w1) + bΨ(x, w2). This closure assumption implies that a definition of the form (3) is meaningful. The range of possible classes DDNN satisfying these requirements is vast. Typical examples are neural networks with node-dropout, as originally introduced by Hinton et al. (2012), and dropconnect, as introduced by Wan et al. (2013), but many other choices also are possible. Note that the function Ψ may be extremely general, implying that there are no restrictions on e.g., the form of the activation function or the structure of the network. In fact, nothing in the requirements on DDNN restricts to functions Ψ generated by neural networks; other approximation methodologies may also be used, for instance based on Fourier or wavelet expansions. See Section 2 for a detailed description of DDNN. By combining Theorem 1 with the law of large numbers we then find Corollary 2 below, which expresses the following insight: if the class DDNN is rich enough to approximate any function in F when all filter variables are set to 1, then any function in F can also be approximated by a (random) dropout neural network in DDNN. Corollary 2 Let ζ F and ϵ > 0. Assume there exists a (m, Φ, g) DDNN and a v Rm such that Φ( , v) ζ F < ϵ/2. Then there exists a (n, Ψ, f) DDNN and a w Rn such that P[ Ψ( , w f) ζ F > ϵ] < ϵ (5) and E Ψ( , w f) ζ q F 1 Section 4 is devoted to these results, but develops them in more generality. There we also give some examples and calculate the coefficients a U explicitly for the case of independent Bernoulli filters. 1.3 Main Results 2: Expectation-replacement In the previous section we considered dropout neural network to be a random object, both during training and during prediction. By contrast, it is common practice to choose the filter variables to be random during training and to be deterministic during prediction and equal to their expectations; see e.g., Goodfellow et al. (2016, Sec. 7.12). We call this expectation-replacement dropout, and Corollary 2 above does not say anything about this situation. In fact, we show in Example 5 that the construction at the heart of Corollary 2 may lead to networks that are bad approximators in this specific sense: given a function ζ, the constructed networks approximate ζ with high probability with random filters, but do not approximate ζ at all when replacing the filters by their expectations. Manita, Peletier, Portegies, Sanders and Senen Cerda To address this, we describe in Section 4 the construction of dropout neural networks that approximate not only in probability and in Lq, but also in this expectation-replacement sense. As in the case of Corollary 2, the construction builds on existing density results for deterministic networks: we start with a given deterministic neural network Ψ( , w) that is close to a given function ζ. Differently from Corollary 2, however, the nonlinearity of Ψ forces us to apply the law of large numbers to each edge (or weight in this context) separately, instead of simultaneously for the whole network. In the construction in Section 4 we therefore iteratively replace each edge in the deterministic neural network Ψ by a set of parallel edges, with edge-weights w taken from the original edge, and with independent filter variables on each of them. In this way we can use the law of large numbers to obtain convergence estimates for each edge separately, and then combine these estimates into a single convergence estimate for the whole network. The convergence estimate for a single edge arises from the following statement (which is a simplified version of Lemma 14). It describes how the error encountered by averaging N independently filtered edges can be controlled in probability. At the same time it also allows for small perturbations of the inputs to this edge. This latter perturbation freedom is needed in order to apply this lemma progressively, moving from edge to edge through the network. Lemma 3 Consider any continuous function σ : Rm Rm and let W Rm n, b Rm. Let {F i}i 1,N be a collection of independent copies of a random matrix F {0, 1}m n. Then for every K > 0 there exists a δ > 0 such that sup x B(0,K) sup ( xi) B(x,δ)N i=1 (W F i) xi + b σ (W EF i)x + b (6) converges to zero in probability as N . In Section 4 this construction is described in detail. A separate part of this description is how to connect the resulting dropout neural network to the inputs of the original layer; for this we introduce a single additional layer that implements this connection. The main Theorem 17 allows for a wide range of choices of activation functions and filter-variable distributions. For example, the following are simple, concrete corollaries for a Re LU activation function with dropconnect and node-dropout respectively. Corollary 4 Take F to be the space of continuous functions Rd R, and endow it with a seminorm F equal to supremum of the function on the unit cube. Then for every ζ F and every ϵ > 0 there exists a dropconnect Re LU neural network (Ψ, f) and a parameter vector w such that P h Ψ( , w f) ζ F > ϵ i < ϵ (7) and E Ψ( , w f) ζ q F 1 while Ψ( , w E[f]) ζ F < ϵ. Corollary 5 Corollary 4 also holds when using node-dropout instead of dropconnect. Universal Approximation in Dropout Neural Networks Note that where the construction of the previous section applied to a very wide class of functions Ψ not only those generated by neural networks the construction underlying Corollaries 4 and 5 depends in a detailed manner on the fact that Ψ has the structure of a neural network. Finally, we want to emphasize that constructing a network that approximates well when filters are replaced by their expectations (the last bound in Corollaries 4 and 5) is not difficult due to the universal approximation property for deterministic networks: it is enough to scale the coefficients. However the obtained network may have no relation to the training process. In order to overcome that, we need simultaneous approximation as a random network and as a deterministic network after replacing filters by their expectations. 1.4 Random-approximation vs. Expectation-replacement Dropout Using a random neural network to approximate a given deterministic function is non-trivial; the variance of the network needs to be reduced while matching the expectation, as described in Section 1.2. In expectation-replacement dropout, however, the networks used in prediction are deterministic, and this difficulty is absent. In fact, the difference between training and prediction is the reason we include expectation-replacement in this paper. This difference between training and prediction poses an intriguing question. Suppose that the dropout training algorithm yields a parameter point w . During this training, the algorithm has observed random networks Ψ( , w f), but it has never observed the deterministic network Ψ( , w Ef). Why, then, should the result w of the dropout training then generate a good deterministic network Ψ( , w Ef)? Example 5 confirms that this method may fail badly. At the same time, expectation-replacement dropout is both very widespread and very successful; see e.g., Goodfellow et al. (2016, Sec. 7.12) or Labach et al. (2019). How can these two observations be reconciled? The results of Section 4 and e.g., Corollary 4 or 5 provide a partial answer to this question. We show that dropout neural networks have sufficient representational capacity to approximate well simultaneously in probability, in Lq, and in the expectation-replacement sense. While this does not explain why any given training algorithm finds parameter points that approximate well in the expectation-replacement sense, at least it shows that the contrast between random training and deterministic prediction is not an obstacle to good performance. 1.5 Related Literature The universal approximation property for neural networks is one of the fundamental properties and essentially determines whether the whole training process of the network makes sense: if the algorithmically generated functions don t form a dense set in the function space of interest, the approximation problem is ill-posed. Therefore establishing the universal approximation property for different classes of networks has been an active research area in the last decades. However, most classes of networks for which there is a universal approximation property established do not include, for example, node-dropout or dropconnect neural network.s Manita, Peletier, Portegies, Sanders and Senen Cerda The first universal approximation theorem for neural networks with a sigmoidal activation function can be found in Cybenko (1989) s paper, and this canonical work led to much follow-up research. Several years later Hornik (1991) showed that the universal approximation property relies more on a neural network s architecture than on the specific use of sigmoid activation functions. Moreover, Leshno et al. (1993) established that deep, feed-forward neural networks require a nonpolynomial activation function in order for a universal approximation theorem to hold. Makovoz (1996, 1998) used the so-called probabilistic method to prove the existence of a deterministic function that suitably approximates a target function in deterministic neural networks. Approximately at the same time the study of random networks started. White (1989) s paper on Quick Net is one of the first works where universal approximation is mentioned (but not proved) side by side with a neural network algorithm in which random hidden nodes are placed. The class of networks with random weights and biases, called Random Vector Functional Link Nets, was introduced in 1994 by Pao et al. (1994). Igelnik and Pao (1995) proved a universal approximation property of these networks, by showing that the span of the node functions is almost surely asymptotically dense in the many-node limit. This result does not apply to dropout schemes since in the dropout setup the randomness is applied after choosing coefficients. Gelenbe et al. (1999a,b) introduced a class of neural networks that relies on a fixed neural network topology on top of which neurons forward positive and negative signals (spikes) at random points in time based on their own potential . Specifically, they gave a constructive proof of the universal approximation theorem for such stochastic neural networks networks in steady state. This class of networks also doesn t cover the node-dropout or dropconnect cases due to the different dynamics assumed; moreover a dropout neural network is trained randomly, but typically operated deterministically. Rahimi and Recht (2008) investigated uniform approximation of functions with random bases. This is a particular case of a so-called random feature method, in which the parameters are split in two groups: parameters in one group are taken randomly (and not tuned), and the other part is trained to achieve best approximation. Therefore these results also don t cover node-dropout or dropconnect since for the latter algorithms all parameters are trained. Another commonly used class of neural networks is the mixture of experts model. The idea is that for different input regions different, typically simpler, networks (learners) are used for prediction. The choice is performed by the gating network; training of the model consists then of training individual learners together with training the gating network. Nguyen et al. (2016) proved a universal approximation theorem for a mixture-of-experts model, and Nguyen (2017) subsequently generalized their findings to allow for so-called Gaussian gating. De Bie et al. (2019) considered a network architecture that can handle probability measures as input and output. They proved the universal approximation in Wasserstein metric for continuous maps from the space of measures into itself. Our results are more specific, and not covered by this result, since we study a different (more restricted) approximation scheme. Universal Approximation in Dropout Neural Networks As mentioned in the introduction, Foong et al. (2020) show a universal approximation property for random-approximation dropout networks (see their Theorem 3). We recover this result as Corollary 2 when identifying h 0. Another difference is that we allow for activation functions other than Re LU activation functions, and consider a stronger sense of approximation. Finally, we refer interested readers to the following surveys to fully complete their picture of known results. A survey of approximation-theoretic problems was written by Pinkus (1999); a recent survey by Elbr achter et al. (2020) contains a comparison of approximation properties for finite-width and finite-depth networks. Several uniform approximation results for random neural networks can be found in Timotheou (2010) s Section 5.4. Approximation literature for random neural networks was also summarized by Yin (2019). 1.6 Structure of this paper Definitions of dropout neural networks are given in Section 2. In Section 3 we show universal approximation results for random-approximation dropout, whereas Section 4 is devoted to universal approximation results for expectation-replacement dropout. We discuss our results in Section 5 and conclude in Section 6. 2. Specification of Dropout Neural Networks In the introduction, we considered general functions Ψ : Rd Rn R together with a {0, 1}n-valued random variable f (Section 1.2), and more specific functions Ψ that arise from a neural network (Section 1.3). In this section, we specify this neural network structure and introduce the corresponding notation. 2.1 Neural Networks We specify a (feedforward) neural network as a special type of parametrized function Ψ : Rd Rn R from an input vector space Rd to R, parametrized by vectors in Rn. The function is special in that it is assumed to be the composition of multiple functions of much simpler type Ψ( , w) = ΨL , w(L) ΨL 1 , w(L 1) Ψ1 , w(1) . (8) Here, L is an integer, the parameter w is the concatenation of the individual parameter vectors w(j) = (W (j), b(j)) for j = 1, L, which in turn consist of a dj dj 1 weight matrix W (j) and a bias vector b(j) Rdj. We set d0 = d and d L = 1. In (8) every Ψj is a function from Rdj 1 to Rdj given by Ψj(x, w(j)) := σj W (j)x + b(j) , (9) where the function σj : R R is called the activation function. The activation function is applied elementwise. Manita, Peletier, Portegies, Sanders and Senen Cerda 2.2 Dropout Neural Networks A dropout neural network consists of a neural network Ψ : Rd Rn R as above together with a random vector f {0, 1}n. The components of f are called filter variables. The network Ψ, the filter variables f, and a parameter vector w Rn together form a stochastic function from Rd to R given by x 7 Ψ(x, w f). For the constructions later in the article, we recall what we precisely mean by random variables. Throughout the article, (Ω, FΩ, P) is an arbitrary, rich enough, probability space. Whenever we write random variable, random vector or random matrix, we mean a measurable function defined on this probability space. 2.2.1 Node-dropout In the original version of dropout filter variables acted on nodes of the network (Hinton et al., 2012). In this paper the filter variables act on edges instead. The original version, which we call node-dropout, can be represented in the edge-based version of this paper as follows. The filter variables are partitioned into various blocks: filter variables are in the same block if and only if they multiply an element in the same column in the same weight matrix, or they multiply elements of the same bias vector. Filter variables in the same block always attain the same value, i.e., with probability one. Filter variables in different blocks are independent. We will use the convention that filter variables that multiply biases are always on, whereas filter variables that multiply elements of weight matrices are on, i.e., equal to 1, with probability 1 p for some 0 p < 1. We can understand node-dropout from the previous description in the notation of (9). For any j = 1, . . . , L, choose probabilities pj, and let fj 1, . . . , fj dj be independent Bernoulli fil- ters with probability 1 pj. Let Dj Rdj dj be the diagonal matrix with entries fj 1, . . . , fj dj in the diagonal. If we then arranging all nodes per block, then node-dropout implements for j = 1, . . . , L, Ψj( , w(j) f(j)) = σj W (j)Dj( ) + b(j) . (10) Note that if p1 > 0 then with positive probability an input is masked. For this reason we call the case p1 > 0 node-dropout with dropout on the inputs. We call the case p1 = 0 node-dropout without dropout on the inputs. 2.2.2 Dropconnect Dropconnect is another dropout regularization scheme (Wan et al., 2013). Although Wan et al. (2013) also allowed for dropout of biases, we will use the term dropconnect for the dropout neural network in which only the matrices W (j) are filtered. This is achieved by choosing the filter variables multiplying the biases to be equal to 1 with probability one. We can understand dropconnect in the notation of (9). For j = 1, . . . , L, let F (j) Rdj+1 dj be random matrices composed of entries (F j)ik, all of which are mutually independent Bernoulli random variables with the same success probability 1 p. Dropconnect Universal Approximation in Dropout Neural Networks then implements for j = 1, . . . , L, Ψj( , w(j) f(j)) = σj (W (j) F (j))( ) + b(j) . (11) 3. Universal Approximation for Random Approximation Dropout The aim of this section is to derive the abstract universal approximation statement for random-approximation dropout already mentioned in the introduction (Corollary 2). 3.1 Key Approximation Result The following theorem is Theorem 1 in the introduction, extended with a convergence statement. Theorem 6 Let (F, F) be a seminormed vector space of functions from Rd to R. Let Ψ : Rd Rn R be a given function such that Ψ( , w) F for every w Rn. Let (f U)U 2n be a collection of {0, 1}n-valued random variables indexed by subsets U 2n, such that for every U P[f U = (1, . . . , 1)] > 0. (12) Then there exist constants (a U)U 2n independent of w such that U 2n a UΨ( , (w 1U) f U) = Ψ( , w). (13) In particular, by the weak law of large numbers, if fi,U are independent copies of f U, then as M , 1 M U 2n a UΨ( , (w 1U) fi,U) Ψ( , w) (14) in probability in (F, F) and in Lq for every q [1, ). A proof of Theorem 6 can be found in Appendix A.1. The main observation in Theorem 6 is the existence of the constants (a U)U 2n. This purely algebraic statement follows by induction, as explained by Lemma 19 in Appendix A.1. From Theorem 6, it follows that one can see a dropout neural network as a linear combination of dropout networks with weights (w 1U)U 2n, such that the linear combination equals the original neural network in expectation as shown in (13). To get a dropout neural network that is close to the original network in probability, in (14) one makes a large average of independent copies of the dropout network that approximates the original network in expectation. The convergence in probability of (14) follows then from the weak law of large numbers. The convergence in Lq finally follows because the expectation is uniformly bounded in F for any realization of the filter variables fi,U, so that the convergence in probability immediately implies the convergence in Lq by dominated convergence. Note that the number of parameters in this construction increases exponentially, which limits the potential applicability of this theorem. On the other hand, for particular cases the coefficients can be calculated explicitly, as it will be shown in Section 3.4. Manita, Peletier, Portegies, Sanders and Senen Cerda 3.2 Examples We further illustrate the construction of Theorem 6 with the following examples: Example 1 (One-hidden-layer dropconnect networks) Consider the function Ψ : Rd Rn R given by j=1 cjσ(wjx + bj) (15) where the activation function σ : R R is continuous with σ(x) 0 as x and σ(x) 1 as x . In (15) we have biases bj R and weights made up from the constants cj R and the 1 d-matrices wj. The well-known result by Cybenko (1989) implies that the class of all such functions is dense in C([0, 1]d) endowed with the supremum norm. An example of an approximation by functions in (15) is depicted in Figure 1. + Ψ(x, w) = j=1 cjσ(wjx + bj) w1 10 0 10 0.5 Truth ζ(x) Cybenko NN Ψ(x, v) Figure 1: A Cybenko neural network as in (15), trained to approximate a function ζ; here, n = 2, d = 1, and ζ(x) = sin (x + 3) exp |x 3|. We suppose that the distribution of the filters follows the case of dropconnect, as described in Section 2.2.2. Theorem 6 directly yields that by choosing appropriate weights cj,U and weight matrices wj,U, the one-hidden-layer dropconnect network given by j=1 a Ucj,Ugj,Uσ (wj,U fj,U)x + bj (16) can be chosen to be close to Ψ in Lq for large M. Here gj,U are independent Bernouilli random variables, and fj,U are random vectors with independently Bernoulli-distributed components, all with success probability 1 p. This result is illustrated by Figures 2 and 3, where for simplicity we have used filters only on the weights wj,U, while leaving the biases bj and cj with constant filters 1. Figure 2 shows a single realization of the neural network in (13) with dropconnect while in Figure 3 a blow up the average of M independent copies of the network in (16) of the previous construction is depicted. Universal Approximation in Dropout Neural Networks U 2n a UΨ(x, (w 1U) f U) 10 0 10 0.5 Truth ζ(x) P U 2n a UΨ(x, (w 1U) f U) Ψ(x, w 100 f 00) Ψ(x, w 101 f 01) Ψ(x, w 110 f 10) Ψ(x, w 111 f 11) Figure 2: A single realization of the random neural network in (13) using Dropconnect. Based offthe trained Cybenko neural network in Figure 1, for simplicity, we only apply dropout to the weights wj of (15), which we denote by w and correspond to the edges joining the nodes connected to the input x. With n = 2 and d = 1, there are four different random neural networks with their respective independent filters. All of them use as base Ψ( , w) in Figure 1. In this realization, some of the edges are filtered, which are depicted with red crosses. The explicit coefficients a U used for dropconnect are computed in Section 3.4. Manita, Peletier, Portegies, Sanders and Senen Cerda . . . et cetera (M i.i.d. copies) x + M Ψ(x, (w 1U) f U,i) Truth ζ(x) PM i=1 P U 2n a U M Ψ(x, (w 1U) f U,i) P U 2n a UΨ(x, (w 1U) f U,1) P U 2n a UΨ(x, (w 1U) f U,128) P U 2n a UΨ(x, (w 1U) f U,256) Figure 3: An approximation of ζ with a large neural network using dropconnect, based off the base Cybenko neural network in (15) depicted in Figure 1. Adding many independent copies of the network from Figure 2, we are leveraging the law of large numbers as in (16). Different independent copies of the network may have a different realization of the filters, which is here depicted by the red crosses on the edges. Universal Approximation in Dropout Neural Networks In a similar way, we can also consider more general dropconnect networks. Example 2 (Dropconnect networks) Consider a deep neural network Ψ : Rd Rn R as introduced in (8) with dropconnect filters as described in Section 2.2.2. Here, the filter variables, i.e., the components of fi,U in (8), are i.i.d. Bernoulli distributed with success probability 1 p if they multiply elements of the weight matrices W (j), and are equal to 1 if they multiply biases b(j). Let w Rn. We choose for F the vector space of continuous functions on Rd, endowed with the supremum seminorm over the closed unit cube. Then the dropconnect random network in (14) is for large M close to the network Ψ( , w) in Lq. Example 3 (Node-dropout networks) Consider again the deep neural network in (8) with node-dropout as described in Section 2.2.1. The random neural network in (14) is then again a node-dropout neural network. In this way, we recover Foong et al. (2020) s Theorem 3 (with h 0), which for Re LU activation functions and a target function ζ bounds sup x [0,1]d Var(ζ(x) Ψ(x, w f)). (17) When F is the space of continuous functions with supremum norm, (17) can be bounded by a constant times the square of the L2-norm. Hence, Theorem 6 approximates in a stronger sense, namely, in Lq for any q [1, ). Moreover, Theorem 6 also allows for activation functions other than Re LU. Example 4 (Dropout networks with dropout on input) In contrast, if there is also dropout on the input, then the neural network in (14) is not again a dropout neural network with dropout on the inputs. Results by Foong et al. (2020) imply that in general neural networks with dropout on the input cannot satisfy a universal approximation property. We remark that this kind of stochastic network is not a dropout neural network as defined in Section 2.2 as the following example shows: Suppose that Ψ1, Ψ2 : Rd Rn R are two different dropout neural networks with weights w1, w2 and with respective filter random variables f, g with values in {0, 1}n. Then we can define the dropout neural network Ψ with value Ψ(x, (w1 f, w2 g)) = Ψ1(x, w1 f) + Ψ2(x, w1 g). (18) Suppose that, additionally, we add independent filters h1 and h2 with values in {0, 1}d to Ψ1, Ψ2 for their respective inputs. Then, Ψ1(x h1, w1) + Ψ2(x h2, w2) is not necessarily of the type Ψ(x h, (w1, w2)) for some random variable h with values in {0, 1}d. As the above examples illustrate, a crucial aspect of whether a certain class of dropout neural networks (such as dropconnect or node-dropout) satisfy a universal approximation property, is whether linear, independent combinations of such networks are again networks in the same class. On the other hand, many details of the neural networks, such as them being a composition of simpler functions, are irrelevant for the proof of Theorem 6. Manita, Peletier, Portegies, Sanders and Senen Cerda 3.3 The Classes DDNN In the introduction we introduced classes DDNN of tuples (n, Ψ, f) that are closed under linear, independent combinations as the basic objects with which we want to approximate a given function ζ F. The convergence statement (14) of Theorem 6 then immediately implies Corollary 7, which was already given as Corollary 2. It expresses that if the class DDNN is rich enough to approximate any function in F when all filter variables are set to 1 in the event in (4), then for every function in F there exists a dropout neural network such that with high probability with regards to the filter variables, the dropout neural network also approximates the function. Corollary 7 Let ζ F and ϵ > 0. Assume there exists a (m, Φ, f) DDNN and a v Rm such that Φ( , v) ζ F < ϵ/2. Then there exists a (n, Ψ, g) DDNN and a w Rn such that P[ Ψ( , w g) ζ F > ϵ] < ϵ (19) E ζ( ) Ψ( , w g) q F 1 The proof of Corollary 7 can be found in Appendix A.2. This corollary can be then combined with deterministic universal approximation properties of certain classes of neural networks to obtain concrete universal approximation properties of dropout neural networks. For instance, because both the class of node-dropout networks and the class of dropconnect networks defined in Sections 2.2.1 and 2.2.2 form examples of a set DDNN, we obtain the following universal approximation property by combining Corollary 7 with the universal approximation result in Leshno et al. (1993) s Proposition 1. Corollary 8 Assume µ is a nonnegative probability measure on Rn with compact support, absolutely continuous with respect to the Lebesgue measure. Take F = Lr(µ) for some r [1, ). Assume that the activation function σ : R R is not equal to a polynomial almost everywhere. Then for every ϵ > 0 there exists a one-hidden-layer dropconnect neural network (n, Ψ, g) such that P[ Ψ( , w g) ζ Lr(µ) > ϵ] < ϵ, (20) E h ζ( ) Ψ( , w h) q Lr(µ) i 1 There also exists a one-hidden-layer node-dropout neural network with the same properties. Certainly, many variations of the above corollary can be constructed. To further illustrate Corollary 7, in Figure 4 we look at the approximation in probability of our construction from Theorem 6. Universal Approximation in Dropout Neural Networks 10 5 0 5 10 0.5 Truth ζ(x) Cybenko NN Φ(x, v) Cybenko NN Φ(x, v) ε Constructions Ψ(x, w f) Figure 4: An illustration of function approximation in probability with our construction. Here, M = 256, and 20 independent runs of the construction are shown in red. Here, ϵ = 0.1 was chosen for illustrative purposes. Most of the runs lie within ϵ distance around the Cybenko neural network from (15), which we approximate with our construction in (16). Altogether, we are approximating the target function ζ. Manita, Peletier, Portegies, Sanders and Senen Cerda 3.4 Explicit Computation of Coefficients To further illustrate Theorem 6, we will compute the coefficients a U in (13) explicitly for a special case of dropout neural networks for which the filter variables are partitioned into independent blocks. All variables in one block i are all simultaneously offwith probability pi and simultaneously on with probability 1 pi. Both node-dropout and dropconnect are special cases. Proposition 9 Let f be a {0, 1}n-valued random variable with a distribution specified as follows. Let 1, n = I1 . . . Ir be a disjoint partition and suppose that fi = fj whenever i, j Is for any i, j 1, n and s 1, r. Let f = (f I1, . . . , f Ir) denote the random variables ordered as blocks and suppose that P(f Is = 1) = 1 ps > 0 for all s 1, r and that {f Ii}i 1,r are mutually independent. Then we have Ψ( , w) = X E h Ψ( , (w 1ι(V )) f V ) i where ι : 2r 2n is the embedding characterized by j ι(V ) if j Ii for some i V . We prove Proposition 9 in Appendix A.3. Note that as pi 1, the coefficients a U become large. From this fact, together with the observation that the sum is taken over the large set 2r, it is clear that the construction is computationally strenuous. Still, small examples in the case of dropconnect are shown in Figures 2, 3 and 4. 3.5 Why the Results in this Section are only for Random-approximation Dropout In this section, we have shown a random-approximation universal approximation result, i.e., a universal approximation result that is relevant when the dropout neural network is also used at prediction time with a stochastic output. In practice, the filter variables are usually replaced by their average values at prediction time. The following example shows that the construction in this section can lead to a bad approximation when doing expectation-replacement. Example 5 Let σ be the standard Re LU activation function. The approximation procedure in Corollary 7 would yield that the function ζ : R R given by ζ(x) := σ(x 1) can be well approximated by an average of many independent copies of the dropout neural network x 7 4f1σ(f2x 1) where f1 and f2 are i.i.d. Bernoulli random variables with success probability 1/2. However, replacing f1 and f2 by 1/2, we just obtain the function x 7 2σ(x/2 1) which is not a good approximation to the function ζ at all. Universal Approximation in Dropout Neural Networks 4. Use of Average Filter Variables for Prediction We will now approximate a neural network by a larger dropout neural network that is also close to the original neural network if the filter variables are replaced by their expected values. The replacement of the filter random variables f by their expected values E[f] is common practice after having trained dropout neural networks for prediction. Informally, the main Theorem 17 below states that for any base neural network Ψ( , w), there exists a larger neural network NNΓ,Ξ( , v) and filter variables f such that NNΓ,Ξ(x, v f) Ψ(x, w) NNΓ,Ξ(x, v E[f]). (21) Global Variables In order to improve readability of this section, we fix for the entire section a few (otherwise arbitrary) variables. Throughout this section: The base neural network Ψ is assumed to be a fixed (L 1)-hidden layer neural network as described in Section 2.1. We assume that its activation functions σj are continuous. We also keep the weights W (j) and biases b(j) fixed. We fix a number R > 0, which will play the role of the radius of a ball in the input space. We fix a number β (0, 1) and assume that for every random filter matrix F in this section, each one entry is on with a probability that is larger than or equal to β, i.e., for all r, c, P[Frc = 1] β > 0. We fix a number Q > 1, whose role will become clear later. 4.1 Heuristic Description of the Construction In this section we describe the construction of the larger dropout neural network NNΓ,Ξ in heuristic terms; the full details are given in the subsequent sections. The construction starts at the last layer of the base neural network Ψ, which is a function ΨL : Rd L 1 Rd L given by x 7 σL W (L)x + b(L) . (22) We construct the last layer of the larger dropout neural network such that it remains close to (22) as follows. Let N N and consider any collection {F (L),i}i 1,N of i.i.d. random filter matrices such that for each i, F (L),i has the same dimension as W (L). By the law of large numbers, we can expect that the function W (L) E[F (L),i] F (L),i x + b(L) (23) will be close to the function ΨL for sufficiently large N. Here we write for element-wise division. Manita, Peletier, Portegies, Sanders and Senen Cerda (W (1), b(1)) (W (2), b(2)) (W (3), b(3)) (W (4), b(4)) . x .. .... ... . Ψ(x, w) Figure 5: An example base neural network Ψ where L = 4. The top diagram indicates the individual activation functions and the dimensions of each layer of nodes: d = d0 = 1, d1 = 2, d2 = 4, d3 = 3, and d4 = 1. The set of all arrows connecting layer k 1 of nodes to layer k correspond to the parameters (W (k), b(k)). The bottom diagram shows the same network, with the edges between layers compressed to single arrows; this compressed notation is the basis for the diagram in Figure 6 below. Viewed as a one-layer neural network, the function (23) is a one-layer dropout neural network with N times as many edges as ΨL, and it can replace (22), i.e., ΨL, while staying close to ΨL. A further adaptation is necessary, however, because in (23) each copy W (L) E[F (L),i] takes the same input x Rd L 1. To make (23) a bona fides dropout network, different edges should take different inputs, and therefore we generalize (23) to (Rd L 1)N (xi)i 1,N 7 σL W (L) E[F (L),i] F (L),i xi + b(L) . (24) By precomposing each of the inputs xi with Ψ(L 1), and performing the same construction as above (copying the input to these copies of Ψ(L 1)), we can inductively create our larger dropout neural network NNΓ,Ξ that will be close to Ψ. There are now three points of attention: The intuitive statement repeating this construction needs a formalization by an inductive construction. This requires a mathematical object that can record the intermediate stages of the construction. We need to show inductively that the resulting intermediate neural networks are close to (a network closely related to) the original network. In particular, we need to introduce a mathematical specification of close that is compatible with an inductive argument. Universal Approximation in Dropout Neural Networks The input space to the neural network in this construction grows with each step, whereas we still aim to have a final neural network with data space Rd. This requires us to deal with the first layer of the network differently. These points are the topics of the subsequent sections. 4.2 Dropout-trees We will encode the intermediate stages of our inductive construction by a mathematical object that we will refer to as a dropout-tree. The idea is that we start with a root, then attach incoming edges labeled with random filter matrices to it (creating leaves), and then recursively attach even more edges to the leaves. To be consistent with the numbering of layers in Section 2.1, here, we will prefer to speak about the level j of a vertex or an edge in a tree rather than its depth L j (the latter is also established jargon in graph theory, and this aligns our notation with that of the neural network). In this numbering, the root is therefore at level L. Definition 10 A vertex v of a rooted tree is at level j {0, 1, . . . , L} if the path from v to the root v0 has length L j. An edge e = (u, v) of a rooted tree is at level j {0, 1, . . . , L} if its target vertex v is at level j. From now on we will write σv for σlevel(v), W v for W (level(v)), bv for b(level(v)), et cetera. This simplifies the notation at only a minor cost of abuse of notation. Definition 11 A dropout-tree Γ of an (L 1)-hidden layer neural network Ψ is a directed graph G together with a labeling of the edges such that: the graph G is connected and acyclic; one of the vertices, say v0, is designated as the root; the depth of the tree is at most L 1; all directed edges point towards the root; every edge e is labeled with a random matrix F e; for each e (a) F e has the same dimension as W target(e) (b) F e s entries are {0, 1}-valued (c) for all r, c, P[F e rc = 1] β > 0; for every vertex v that is not a leaf, {F e}e into(v) is a collection of mutually independent, identically distributed random matrices. For convenience we recall some terminology. A directed edge points from a source to a target, and for an edge e we identify them by source(e) and target(e); we write into(v) for the set of all edges with target vertex v. In the trees in this paper, all edges point towards the root of the tree. A vertex v is a child of a vertex w if there is an edge pointing from v to w; w then is the parent of v. A leaf is a vertex without children. Manita, Peletier, Portegies, Sanders and Senen Cerda Dropout-trees can be constructed iteratively by starting from the trivial dropout-tree consisting only of a root and then performing a so-called µ-input-copy construction. This allows us to inductively create a larger dropout-tree from a smaller dropout-tree. Definition 12 Let Γ be a dropout-tree and let ℓbe a leaf of Γ at level k. Let µ be a distribution of a random matrix F {0, 1}dk dk 1 that satisfies for all r, c, P[Frc = 1] β > 0. A dropout-tree Γ is a µ-input-copy to the leaf ℓof Γ if one can obtain Γ from Γ by: (a) attaching child vertices to ℓ, and (b) labeling each edge going into ℓby an independent copy of F. The size of a µ-input-copy to ℓat Γ refers to the number of children of ℓin Γ . Let us describe the precise meaning of procedure (b) in Definition 12. For that, it may be useful to recall that random matrices are nothing but measurable functions defined on the probability space (Ω, FΩ, P). The procedure (b) precisely means that the sigma-algebras generated by the filter variables F e with e into(ℓ) are independent, and that for every e into(ℓ) the law of F e equals µ. In particular, this condition allows for some correlation between filter variables labeling edges in the dropout-tree that do not go into ℓ. Moreover, in general there can be many different dropout trees Γ that are µ-input-copies of Γ. We will now describe how a dropout-tree encodes a dropout neural network. 4.2.1 Dropout neural networks encoded by dropout-trees Each dropout-tree Γ will induce a stochastic function Φv0 Γ a dropout neural network as follows. For any edge e = (u, v) of Γ, let V e Γ := W e E[F e] (25) be rescaled weights for the dropout neural network. We define ( σv 1 #into(v) P e into(v)(V e F e)Φsource(e) Γ + bv if v is not a leaf, Identity Rdv if v is a leaf. Figure 6 illustrates this construction, based on the network Ψ depicted in Figure 5. 4.3 Dropout Neural Networks induced by Dropout-trees are Close to their Deterministic Counterpart We will give an inductive argument that Φv0 Γ is close to Φv0 Γdet. Here, Γdet denotes the same dropout-tree as Γ except for the fact that we have replaced each and every filter variable deterministically by its expectation. Loosely speaking, the inductive argument implies that dropout neural networks induced by dropout-trees are close to their deterministic counterparts. As a technical preparation, we define a sequence of radii R0, R1, . . . , RL. The idea is that these provide bounds on the output after applying several layers, no matter the choice of filter variables or weights in the upcoming construction. Given the radius R > 0 defined at the start of this section, we set Universal Approximation in Dropout Neural Networks and then choose Rj inductively such that for all j 1, L, x B(0, β 1 W (j) HSRj 1 + 1) it holds that Ψj x, I, b(j) < Rj 1. (26) for all j = 1, L, where β (0, 1) and Q > 1 were two of the global variables that we defined at the beginning of the section. Here HS denotes the Hilbert Schmidt norm of a matrix and I denotes an identity matrix of the corresponding size. We denote the input space to a network induced by a dropout-tree Γ by InpΓ. That is, InpΓ is the vector space (xℓ Rdlevel(ℓ) | ℓ leaves(Γ)). We endow InpΓ with the norm (xℓ) InpΓ := max l leaves(Γ) xℓ Rdlevel(ℓ). We define InΓ : Rd InpΓ to be the collection of functions, indexed by leaves ℓof Γ, that are generated by those layers in the base network Ψ that are not represented in Γ at leaf ℓ: Inℓ Γ(x) := Ψlevel(ℓ)( , (W (level(ℓ)), b(level(ℓ)))) Ψ1( , (W (1), b(1))) (x). (27) Note that by the definitions (26) of the radii Rj we have B(0, R) B(0, Rlevel(ℓ) 1). (28) We say that a dropout-tree Γ satisfies property Ap PropΓ(δ, ϵ) if P h sup x B(0,R) sup x B(InΓ(x),δ) Φv0 Γ ( x) Φv0 Γdet(InΓ(x)) > ϵ We will prove Lemma 13: its message is that one can always construct a full dropout-tree that satisfies Ap PropΓ(δ, ϵ) for some δ > 0, by copying inputs at vertices. Lemma 13 Let Γ be a dropout-tree and let ℓbe a leaf of Γ at level k > 1. Let µ be the distribution of a random matrix F {0, 1}dk dk 1 that satisfies for all r, c, P[Frc = 1] β > 0. The following now holds: if Γ satisfies Ap PropΓ(δ, ϵ) in (29) for some δ, ϵ > 0, then for every sufficiently large µ-input-copy Γ at ℓthere exists a δ > 0 such that Γ satisfies Ap Prop(Γ )(δ , ϵ). The proof of Lemma 13 is relegated to Appendix B.1. There, we show that Lemma 13 follows from Lemma 14, which is displayed next and proved in Appendix B.2. Lemma 14 Consider any continuous function σ : Rm Rm and let W Rm n, b Rm. Let {F i}i 1 be a sequence of mutually independent copies of a random matrix F Rm n that satisfies: for r 1, m and c 1, n, 0 < E[Frc] < and 0 Frc M < w.p. one. Let V := W E[F]. The following now holds: for every 0 K < and ρ > 0 there exists a δ > 0 such that P h sup x B(0,K) sup ( xi) B(x,δ) N i=1 (V F i) xi + b σ(Wx + b) > ρ i 0 (30) Manita, Peletier, Portegies, Sanders and Senen Cerda 4.4 Replacing the First Layer We look now at the first layer and consider first, the case that we do not use dropout and secondly, the case when we do. 4.4.1 Copying the First Layer many times, without Dropout of Input Edges We will now start from a full dropout-tree and connect copies of the first layer many times to obtain a neural network that approximates the original neural network well, both in terms of random approximation and in terms of expectation replacement. In this section, it is crucial that there is no dropout of edges in this very first layer. Assume therefore that we have constructed a full dropout tree, i.e., a dropout tree of which all leaves are at level 1 (at depth L 1). We denote by NNΓ,Ξ the neural network that is formed by precomposing every leaf ℓof the neural network Φv0 Γ induced by the dropout tree by the first layer of the original network. To align with the next section, we denote this function by Ξℓ:= Φ(1)( , (W (1), b(1))). We also let NNavg filt Γ,Ξ denote almost the same network, but with the modification that every random filter variable is replaced by its expectation. The next theorem expresses that this construction yields a dropout neural network that approximates the original neural network well in terms of random approximation and expectation replacement. Theorem 15 Fix 0 < ϵ < 1. Let Γ be a full dropout-tree satisfying Ap PropΓ(δ, ϵ) for some δ > 0. Let NNΓ,Ξ and NNavg filt Γ,Ξ be the networks describe above, or more precisely defined in Example 6. Then P h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) > ϵ i < ϵ (31) E h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) qi1/q < ϵ, (32) while sup x B(0,R) NNavg filt Γ,Ξ (x) Ψ(x, w) < ϵ. (33) The theorem follows from Theorem 17, which will deal with the more general case in which edges in the input layers can be dropped. Indeed, if in Theorem 17 one takes σ0 : R R to be the identity function and one sets all filter variables deterministically equal to 1, then for every α and N, the networks constructed in Theorem 17 coincide exactly with NNΓ,Ξ and NNavg filt Γ,Ξ introduced above. 4.4.2 First Layers in which Inputs are Dropped The situation is more complicated if we allow for dropout of edges in every layer, particularly the first. In this section we will show that we can also in that case find a dropout neural network that approximates the original neural network well both in terms of random approximation and in terms of expectation replacement. Universal Approximation in Dropout Neural Networks Assume again that we have constructed a full dropout-tree, that is, a dropout-tree of which all leaves are at level 1 (i.e., at depth L 1). This means that we have constructed suitable replacements for almost every layer of the neural network, except for the first layer. This layer contains the edges that have the global input as source. Replacing the first layer requires a different construction: if we would outright drop edges in the first layer, then we can not control the error with the current technique. We now describe how we replace the first layer. For every leaf ℓin the full dropout-tree, we precompose every input at ℓwith a stochastic function Ξℓ: Rd0 Rd1. We record this information in what we call a precomposition for a dropout-tree. Figure 6 illustrates this precomposition. Definition 16 A precomposition Ξ for a full dropout-tree Γ is a map Ξ : leaves(Γ) (Rd0 Rd1) that sends every leaf ℓ leaves(Γ) to a stochastic function Ξℓ: Rd0 Rd1. Let : Rd0 (Rd0)leavesΓ be the diagonal map sending x to copies of x. The neural network induced by the full dropout-tree Γ and a precomposition Ξ that we consider is given by NNΓ,Ξ := Φv0 Γ,Ξ (34) ( σv 1 #into(v) P e into(v)(V e F e)Φsource(e) Γ,Ξ + be if v is not a leaf, Ξv if v is a leaf. (35) We also define NNavg filt Γ,Ξ as being almost the same neural network as NNΓ,Ξ, with the only difference being that we replace each random filter variable F e in (35) with its expectation E[F e]. Recall for (34) that v0 designates the root of the dropout-tree, and note that (35) constructs the neural network recursively (layer by layer). Example 6 In fact, we already examined one specific precomposition Ξ in Section 4.4.1: the one that assigns the function Ψ1( , (W (1), b(1))) to every leaf. This precomposition yields a neural network NNΓ,Ξ in which edges in the first layer, i.e., the input edges of the neural network, are never dropped. In this case, NNavg filt Γ,Ξ (w) actually coincides with the original neural network Ψ( , w). We will now construct precompositions that allow for the possibility of dropping edges in the first layer and applying e.g., the Re LU function to them immediately. Concretely, we will add a zeroth layer with an activation function σ0 : R R. We assume that σ0(0) = 0 and that σ0 has one-sided derivatives σ and σ+ in the point 0 R: σ := lim α 0 σ0( α) σ0(0) α , σ+ := lim α 0 σ0(+α) σ0(0) Define the sign function ( if x < 0, + if x 0 (37) component-wise. If x = 0, then σS( x) + σS(x) = σ + σ+ does not depend on x a critical fact that we will leverage in our construction. Manita, Peletier, Portegies, Sanders and Senen Cerda σ4 σ3 σ2 σ1 σ0| {z } Ξ root level 3 level 2 leaves at level 1 level 0 copying Figure 6: An illustration of how we use the base neural network Ψ from Figure 5 and a dropout-tree (indicated with thicker arrows) to ultimately construct the larger neural network NNΓ,Ξ in which edges are being dropped stochastically. Here, L = 4. Universal Approximation in Dropout Neural Networks Example 7 Consider a zeroth layer that is the identity function, i.e., σ0(y) = y. Then σ = 1. Choosing σ0 as the identity function is allowed here, and means that the layer is not adapted. Example 8 Consider a zeroth layer with Re LU activation function, σ0 := Re LU(z) := z1[z 0]. Then σ = 0 and σ+ = 1. Here are the precompositions that we employ: we call Ξ an (α, N)-precomposition associated to a set of distributions {µℓ, νℓ}l leaves(Γ) if for each leaf ℓ, Ξℓ(x) := σ1 1 i=1 ( 1)i(V ℓ F ℓ,i)σ0 ( 1)iα(I Gℓ,i)x + bℓ (38) where element-wise V ℓ rc = W (1) rc α σ + σ+ E[F ℓrc] E[Gℓcc], (39) and {F ℓ,i}i 1, {Gℓ,i}i 1 are sequences of mutually independent copies of random matrices F ℓ, Gℓthat have distributions µℓ, νℓ, respectively. Furthermore, the F ℓare presumed to have the same size as W (1), and the Gℓto have size d0 d0. Note that these assumptions allow us to place unit mass on any particular outcome and thus to replace F ℓ, Gℓby deterministic counterparts. The idea of (38) is that it represents two layers of a dropout neural network that satisfies the approximation properties we are after. The functions σ1, σ0 can be understood as their activation functions, the matrices V ℓ, αI as their weights, bℓas a bias, and the matrices F ℓ, Gℓas describing which edges and inputs are randomly removed. By scaling the weights V ℓby 1/(2N) and generating 2N independent copies of the first layer, we are preparing for an application of the law of large numbers. Furthermore, by allowing for arbitrarily small α, we are preparing for a linearization of σ0 around 0. Finally, the alternatingly positive and negative multiplicative factors ( 1)i allow us to cover directional derivatives such as that of the Re LU activation function. All together, the construction allows us to prove the following theorem. Theorem 17 Fix 0 < ϵ < 1. Let Γ be a full dropout-tree satisfying Ap PropΓ(δ, ϵ) for some δ > 0. Let Ξ be an (α, N)-precomposition associated to a set of distributions {µℓ, νℓ}ℓ leaves(Γ). Assume that for every ℓ, if F is a matrix of filter variables distributing according to µℓor νℓ, then for every r, c, P[Frc = 1] β > 0. Let σ0 : R R be a continuous function with one-sided derivatives σ and σ+ in 0, such that σ + σ+ = 0 and such that σ0(0) = 0. Assume moreover that σ and σ+ satisfy the following inequality with respect to the global variable Q: 4|σ | + |σ+| |σ + σ+| < Q. (40) Manita, Peletier, Portegies, Sanders and Senen Cerda The following inequalities now hold for α > 0 small enough and N N large enough: P h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) > ϵ i < ϵ (41) E h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) qi1/q < ϵ, (42) while sup x B(0,R) NNavg filt Γ,Ξ (x) Ψ(x, w) < ϵ. (43) An important consequence of Theorem 17 is that we obtain for instance a universal approximation result for node-dropout and dropconnect neural networks with Re LU activation functions that also guarantees a good approximation when filter variables are replaced by their averages, as formalized by Corollaries 4 and 5 in the introduction. Theorem 17 is proven in Appendix B.3. There, we show that Theorem 17 follows from the following Lemma 18, which in turn is proven in Appendix B.4 using compactness arguments and the law of large numbers. Lemma 18 Let σ0 : R R be a continuous function with σ(0) = 0 and with two one-sided derivatives σ and σ+ in 0 satisfying |σ + σ+| > 0. Let Ξ be an (α, N)- precomposition associated to a set of distributions {µℓ, νℓ}l leaves(Γ) such that for all ℓ, r, c, E[F ℓ rc] > 0, E[Gℓ rc] > 0, 0 F ℓ rc M w.p. one, and 0 Gℓ rc M < w.p. one. The following now holds: for every leaf ℓ, 0 K < , and ρ > 0, for α small enough and N large enough, P h sup x B(0,K) Ξℓ(x) Ψ1(x; (W (1), b(1))) > ρ i < ρ (44) and sup x B(0,K) Ξℓ,avg filt(x) Ψ1(x; (W (1), b(1))) < ρ (45) where Ξℓ,avg filt denotes the function Ξℓin (38) but with each filter variable F ℓ,i replaced by its expectation E[F ℓ] . 5. Discussion In this article, we showed that dropout neural networks are rich enough for a universal approximation property to hold, both for random-approximation and expectation-replacement dropout. It is further evidence that the representational capacity of neural networks is so large that approximations are possible despite significant additional constraints. In the case of dropout in general, these additional constraints are the implicit symmetry constraints enforced by the turning on and offof the filter variables: in dropconnect, for instance, for most realizations of the filter variables, the output of the dropconnect neural network still approximates the original neural network well after the filter variables are randomly permuted. Despite the enforced invariance with respect to this operation, there is enough Universal Approximation in Dropout Neural Networks room in the parameter space for the weights of the network to have a good approximation for the overwhelming majority of realizations of the filter variables. Our proof of the universal approximation property for random-approximation dropout explicitly works with this symmetry. By this, we mean the following. The universal approximation property that we show even works when edges from the input nodes are dropped out at random. The output in the first hidden layer is then inherently random, and in no way close to deterministic. This is in contrast with for instance the universal approximation property by Foong et al. (2020) in which the layers are all very close to deterministic. Yet even though the values in the nodes are random, we do have a good understanding of the distribution of the values in the nodes, and two stochastic realizations are most likely almost permutations of each other. By blowing up the first layer, i.e., repeating it many times in parallel, we then know the output very well up to this permutation symmetry and this turns out to be enough for us to show a universal approximation property. 5.1 Limitations of our Results Our results and methods have several limitations. We only show the existence of dropout neural networks close to a given function. It is a completely separate question whether an algorithm such as dropout stochastic gradient descent would actually be able to find such an approximation. The main message of our result is that at least there is no theoretical obstruction to approximating functions with dropout neural networks. In the proofs, we used very explicitly that filter variables only take on the values zero or one, while other forms of dropout also exist (for instance with Gaussian filter variables). Our algebraic proof does not readily generalize to this more general case, but it is possible that parts of the proof could be reused. In this paper we made no efforts to reduce the number of parameters of the approximating networks, and indeed the number of parameters can rapidly grow with increasing dimensions of the parameter w. This can for instance be recognized in the exponential number of additional parameters a U in Theorem 6, the large (but algebraic) number M of copies that is required to reduce variance in (14), and the exponential increase of leaves in dropout-trees with increasing depth. Naturally, understanding optimal approximation rates in terms of parameter dimensions is a central question in Approximation Theory, but we leave this to future work. 6. Conclusion We showed two types of universal approximation results for dropout neural networks, one for random-approximation dropout, in which case the random filter variables are also used at prediction time, and one for expectation-replacement dropout, in which case the filter variables are replaced by their averages at prediction time. Our results allow for dropout of edges from the input layer, allow for a wide class of distributions on filter variables, including dropout of edges from the input layer, and for a wide class of activation functions. By making the difference between random-approximation and expectation-replacement dropout explicit, our results also highlight the following mystery: How is it that expectationreplacement dropout performs so well on prediction time? Manita, Peletier, Portegies, Sanders and Senen Cerda Acknowledgments J.W. Portegies was supported by the Electronic Component Systems for European Leadership Joint Undertaking under grant agreement No 737459 (project Productive 4.0). This Joint Undertaking receives support from the European Union Horizon 2020 research and innovation program and Germany, Austria, France, Czech Republic, Netherlands, Belgium, Spain, Greece, Sweden, Italy, Ireland, Poland, Hungary, Portugal, Denmark, Finland, Luxembourg, Norway, Turkey. Appendix A. Proofs of Section 3 In this appendix we prove the results of Section 3. In particular, we prove Theorem 6, Corollary 7 and Proposition 9. A.1 Proof of Theorem 6 We require the following algebraic lemma, which lies at the heart of Theorem 6, as it implies the existence of the constants (a U). Lemma 19 Let Ψ : Rd Rn R be a function. Let (f U) be a collection of {0, 1}n-valued random variables indexed by subsets U 1, n, such that for every U P[f U = (1, . . . , 1)] > 0. Then for every subset V 1, n, it holds that Ψ( , 1V ) span E Ψ( , ( 1U) f U) : U 2n where for any subset S 2d, i.e., S {1, . . . , d}, we denote by 1S the characteristic function of S. In other words, there exist constants (a U,V )U 2n independent of w and x such that for all (x, w) Rd Rn Ψ(x, w 1V ) = E U 2n a U,V Ψ(x, (w 1U) f U) and in particular there exists constants αU such that Ψ(x, w) = E U 2n a UΨ(x, (w 1U) f U) Proof The proof is by induction on the cardinality of V and follows from the equality X S:V S P[f V = 1S] Ψ( , w 1V ) = E h Ψ( , (w 1V ) f V ) i S:V \S = P[f V = 1S]Ψ( , w 1S V ). (46) Universal Approximation in Dropout Neural Networks In particular, for the base case in which V is empty, the last term vanishes. In the induction step, the functions Ψ( , w 1S V ) are by the induction hypothesis all in the required span. Proof (of Theorem 6) By Lemma 19, we can find constants a U for U 2n such that (13) holds. We look now at (14). By the law of large numbers, convergence in probability in the normed vector space (F, ) follows. Moreover, for any V 2n we have Ψ( , (w 1V ) f V ) F max U 2n Ψ( , (w 1U) f U) F =: Cw, (47) so that for any q [1, ) and M, U 2n a UΨ( , (w 1U) fi,U) q q max U 2n |a U|Cw. (48) With uniform boundedness for all M, we can use the dominated convergence theorem which implies then convergence in Lq of the F-valued random variables as M . A.2 Proof of Corollary 7 Proof Let ζ F and let ϵ > 0. Assume there exists a (m, Φ, f) DDNN and a v Rm such that Φ( , v) ζ F < ϵ. Define η := ϵ Φ( , v) ζ F > 0. Define the collection (f U) of {0, 1}m-valued filter variables, each being specifically an independent copy of f. By Theorem 6, there exist constants (a U), a number M N and 2m M independent copies (fi,U) of f such that U 2m a UΦ( , (v 1U) fi,U) Φ( , v) F > η i < η U 2m a UΦ( , (v 1U) fi,U) Φ( , v) q Hence by the triangle inequality, in fact U 2m a UΦ( , (v 1U) fi,U) ζ F > ϵ i < ϵ U 2m a UΦ( , (v 1U) fi,U) ζ q We then define the tuple (n, Ψ, g) DDNN as an independent finite linear combination of 2m M copies of (m, Φ, f), with coefficients a U/M. Setting v R2mm to be the concatenation of the 2m modified vectors (v 1U)U 1,m, we then set w R2m Mm to be the subsequent concatenation of M copies of v. The combination of (n, Ψ, g) and w achieves the assertion. Manita, Peletier, Portegies, Sanders and Senen Cerda A.3 Proof of Proposition 9 As we have seen in Lemma 19, we can find the map Ψ( , w) in the span of E[Ψ( , (w 1V ) f V ] for V 2n. We will look now at specific cases where we can explicitly compute the linear combination. In particular we will look in the case that we use dropout (Hinton et al., 2012), that is, we drop nodes independently with the same probability, and dropconnect (Wan et al., 2013), where we drop individual weights independently with the same probability. In both cases the filter variables fi take the same values for some disjoint subsets of 1, n, where n is the number of weights where we apply filters. That is, if we have a disjoint set decomposition 1, n = I1 . . . Ir with Ik Is = whenever k = s, then fi = fj for all i, j Ik and fi, fl are independent if they belong to disjoint sets, fi Ik and fl Is with s = k. In this section we drop the index U 2n of the random variable f U for notational convenience as they are identically distributed. We will use this property to obtain an explicit decomposition in (13) and in a more general setting where the probability of the filters may differ depending on which disjoint set they belong to. For K, S 2n, we denote K S to be the usual set inclusion, that is, i / K whenever i / S for all i 1, n holds. We need the following lemmas: Lemma 20 Let 1 q1, . . . , qr > 0 and r N. For K 2r, i 1,r\S (1 qi) Y ( 1 if K = 1, r 0 otherwise. (50) Proof Let x1, . . . , xr be free variables. For S 2r we denote the monomial x S = Q i S xi. We will prove the identity by comparing coefficients of two equal polynomials. We have q(x1, . . . , xr) = i=1 (qixi + (1 qi)) = X i 1,r\S (1 qi) (51) where we have expanded all |2r| monomials appearing in the decomposition of q. Now we set xi = (1/qi)yi (1 qi)/qi in (51). We have q1 , . . . , 1 i=1 yi = y1,r. (52) Universal Approximation in Dropout Neural Networks On the other hand, if we substitute xi = (1/qi)yi (1 qi)/qi in the monomials x S in (51) we have X i 1,r\S (1 qi) = X i 1,r\S (1 qi) i 1,r\S (1 qi) Y i 1,r\S (1 qi) Y K 2r µKy K (53) so that we must have µK = 1 if K = 1, r and zero otherwise. Let 1, n = I1 . . . , Ir be a disjoint partition of 1, n, i.e., Ij Ii = if i = j. We consider S 2r also as an element of 2n via the inclusion ι : 2r 2n given by j ι(S) if j Ii and i S, i.e., we consider the index i as the set of all indices j Ii. Note then that ι(1, r) = 1, n. Recall now that the filter random variable with values in {0, 1}n are denoted by f = (f1, , fn) . We suppose now that the filter random variables satisfy that fi = fj whenever i, j Is for some s 1, r. We denote by Bs the {0, 1} valued random variable corresponding to the Is part of 1, n. We suppose that P(Bs = 1) = qs = 1 ps for all s 1, r, where qs is the probability of success and ps the dropping probability. Moreover, we suppose that the (Bs)s 1,r are mutually independent. With this notation we have: E Ψ( , w f) = X i 1,r\L (1 qi)Ψ( , w 1ι(L)). (54) In the following Lemma, we embed 2r into 2n as blocks according to a partition of 1, n using ι: Lemma 21 For S 2r, E Ψ( , (w 1ι(S)) f) = X i S\K (1 qi)Ψ , w 1ι(K) . (55) Proof Let S 2r. Observe that f = Pr s=1 1[Bs = 1]1Is and note in particular that g := 1ι(S) f = X 1[Bs = 1]1ι(S) Is = X s S 1[Bs = 1]1Is. (56) Hence, g depends only on (Bs)s S and is thus moreover independent of (Bt)t Sc by assumption. Consequently Ψ( , w g) also depends only on (Bs)s S and is also independent of (Bt)t Sc. The result then follows. Manita, Peletier, Portegies, Sanders and Senen Cerda To see this in detail, suppose that S = {s1, . . . , sm} and Sc = {t1, . . . , tr m} say. Use the law of the unconscious statistician together with (i) independence to conclude that E[Ψ( , w g)] (56) = br=0 Ψ( , w X s S 1[bs = 1]1Is)P[B1 = b1, . . . , Br = br] bsm=0 Ψ( , w i=1 1[bsi = 1]1Isi)P[ m i=1{Bsi = bsi}] btr m=0 P[ r m j=1 {Btj = btj}] | {z } =1 as an axiom of the pdf P[ m i=1{Bsi = bsi}] (i)= i=1 P[Bsi = bsi] = i=1 q bsi si (1 qsi)1 bsi (58) and then apply the change of variables K(bs1, . . . , bsm) = m i=1{si : bsi = 1} to identify the right-hand side of (55). We can now prove Proposition 9: Proof (of Proposition 9) In the same notation as in Lemma 20 and Lemma 21, we use that qs = 1 ps is the success probability. Then, we can write E(Ψ( , (w 1ι(V )) f) (Lemma 21) = X i V \K (1 qi)Ψ( , w 1ι(K)) i V \K (1 qi)Ψ( , w 1ι(K)) K 2r Ψ( , w 1ι(K)) X i V \K (1 qi) K 2r Ψ( , w 1ι(K))µK (Lemma 20) = Ψ( , w 1ι(1,r)) = Ψ( , w). (59) Note finally that we obtain Proposition 9 after substituting qs = 1 ps. Universal Approximation in Dropout Neural Networks Appendix B. Proofs of Section 4 In this appendix we prove the results of Section 4. In particular we prove Lemmas 13, 14 and 18 as well as Theorem 17. B.1 Proof of Lemma 13 Let Γ be a dropout tree. Let ℓbe a leaf of Γ at level k > 1. Let µ be the distribution of a random matrix F {0, 1}dk dk 1 that satisfies for all r, c, P[Frc = 1] β > 0. Assume that Γ satisfies Ap PropΓ(δ, ϵ), i.e., P h sup x B(0,R) sup x B(InΓ(x),δ) Φv0 Γ (( xℓ)ℓ) Φv0 Γdet(InΓ(x)) > ϵ Define κ > 0 by q P h sup x B(0,R) sup x B(InΓ(x),δ) Φv0 Γ (x) Φv0 Γdet( x) > ϵ By Lemma 14, there exists an η > 0 and an N0 N such that for all N N0, if F i are independent, identically distributed filter matrices distributed according to µ, and if V is given by (25), then P h sup z B(0,Rk) sup ( z)i B(x,η)N i=1 (V F i) zi + b(k) σk W (k)z + b(k) > δ i < κ. Now let Γ be a µ-input-copy of Γ at ℓof size N N0. Choose δ := min(δ, η). Consider the event A that sup x B(0,R) sup x B(InΓ(x),δ) Φv0 Γ (( xℓ)ℓ) Φv0 Γdet(InΓ(x)) > ϵ which (informally) means that the tree Γ provides a bad approximation. Consider also the event B that sup z B(0,Rk 1) sup ( zm) B(z,η)N e into(ℓ) (V e F e) zm + be σk(W ez + be) > δ. Here, into(ℓ) refers to the dropout-tree Γ . This event (informally) means that the added part provides a bad approximation. Note that P[A B] P[A] + P[B] < P[A] + κ = ϵ 4RL Next, let us show that on (A B)c one has sup x B(0,R) sup x B(InΓ (x),δ ) Φv0 Γ (( xℓ)ℓ) Φv0 Γdet(InΓ (x)) ϵ To do this, suppose that (A B)c holds and x B(0, R). For every leaf m children(ℓ) in Γ define z := Inm Γ (x) (recall the definition from (27)) and note that z B(0, Rk 1). Let Manita, Peletier, Portegies, Sanders and Senen Cerda x B(InΓ (x), δ ). For every leaf m children(ℓ) define zm := xm B(z, η) by the choice of δ . Since Bc holds, σk X e into(ℓ) (V e F e) zi + be σk(W ez + be) δ, or, in other words, σk X e into(ℓ) (V e F e) zi + be B(Inℓ Γ(x), δ). Together with Ac this implies (60). Finally, by the law of total probability, we estimate P h sup x B(0,R) sup x B(InΓ (x),δ ) Φv0 Γ (( xℓ)ℓ) Φv0 Γdet(InΓ (x)) > ϵ = P h sup x B(0,R) sup x B(InΓ (x),δ ) Φv0 Γ (( xℓ)ℓ) Φv0 Γdet(InΓ (x)) > ϵ A B i P[A B] + P h sup x B(0,R) sup x B(InΓ (x),δ ) Φv0 Γ (( xℓ)ℓ) Φv0 Γdet(InΓ (x)) > ϵ (A B)ci P[(A B)c] q + 0 = ϵ 4RL This completes the proof of Lemma 13. B.2 Proof of Lemma 14 Let 0 K < and ρ > 0 be fixed. For every N < , the suprema in (30) over x, ( xi) are in fact attained say at X , ( Xi ) because σ is continuous and the optimization domain is closed and bounded. We need to now be careful because X , ( Xi ) depend on the collection {F i}i 1,N. Recall that the continuity of σ implies that σ is also uniformly continuous on each compact set, i.e., for every ζ > 0 there exists an ηζ > 0 such that for all x, y from this compact set |x y| < ηζ |σ(y) σ(x)| < ζ. (62) Define X := (1/N) PN i=1 Xi . Then uniform continuity of σ implies that σ W X + b σ(Wx + b) sup x B(0,K) sup y B(x,δ) σ Wy + b σ(Wx + b) =: γδ < . (63) Moreover, γδ is independent of N. Remark also that sup f {0,1}m n sup y B(0,δ) W f W E[F] y =: cδ < (64) where f stands for all possible deterministic realizations of the filters F. Again, cδ is independent of N. Finally, by construction there exists a compact set C Rm such that Universal Approximation in Dropout Neural Networks the points 1 N i=1 (V F i) Xi + b, W X + b (65) lie in C with probability one. First, fix ζ = ρ/2. From uniform continuity of σ on the compact set C there exists ηζ > 0 such that (62) holds for all x, y C. Second, observe that γδ 0 and cδ 0 as δ 0. Hence we can choose δ and fix it such that 0 < ζ < ρ γδ and cδ < ηζ. (66) Combining (63) with the triangle inequality and using (66), we arrive at LHS (30) P h σ 1 i=1 (V F i) Xi + b σ W X + b > ρ γδ. i , (67) Consider now the event i=1 (V F i) Xi W X < ηζ o . (68) Then by the law of total probability and uniform continuity, P[Z > ρ γδ] = P[Z > ρ γδ|E]P[E] + P[Z > ρ γδ|Ec]P[Ec] 1[ζ > ρ γδ]P[E] + P[Ec] (66) = P[Ec]. (69) We proceed by bounding P[Ec]. Use the triangle inequality twice to establish that for any x B(0, K), i=1 (V F i) Xi W X (W F i) W E[F] x + ( Xi x) (W F i) W E[F] x + 1 NE[F] (W F i) W E[F] ( Xi x) (64) 1 NE[F] (W F i) W E[F] x + cδ. (70) Note now additionally that by Khinchin s weak law of large numbers, i=1 W F i P W E[F] as N . (71) Manita, Peletier, Portegies, Sanders and Senen Cerda Therefore, using (66), we get as N P[Ec] (70) P h 1 NE[F] (W F i) W E[F] x ηζ cδ i (71) 0. (72) Bounding (69) by (72) completes the proof. B.3 Proof of Theorem 17 We start by showing that for α small enough and for N large enough, P h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) > ϵ Afterwards, we deduce the three assertions (41) (43) from (73). Proof of (73). Recall that the assumption Ap PropΓ(δ, ϵ) means that P h sup x B(0,R) sup x B(InΓ(x),δ) Φv0 Γ,Ξ( x) Φv0 Γdet,Ξ(InΓ(x)) > ϵ Define therefore κ > 0 by κ := 1 #leaves(Γ) q P h sup x B(0,R) sup x B(InΓ(x),δ) Φv0 Γ,Ξ( x) Φv0 Γdet,Ξ(InΓ(x)) > ϵ Observe now that the function Ψ1 is continuous, and the function Φv0 Γdet is continuous on (Rd1)leaves(Γ). Since this implies uniform continuity on compact sets (see (62)), there exists a ζ > 0 such that whenever a function g : B(0, R) InpΓ satisfies sup x B(0,R) sup ℓ leaves(Γ) gℓ(x) Ψ1(x; (W (1), b(1))) < ζ, then we also have sup x B(0,R) Φv0 Γdet(g(x)) Ψ(x, w) < ϵ. (75) Now choose ρ := min (δ/2, κ, ζ) and K := R, which we use as parameters for Lemma 18. This choice ensures that for α small enough and N large enough, for all leaves ℓof Γ, by inequality (44) P h sup x B(0,R) Ξℓ(x) Ψ1(x; (W (1), b(1))) > δ/2 i < κ (76) and by inequality (45) sup x B(0,R) Ξℓ,avg filt(x) Ψ1(x; (W (1), b(1))) < ζ. (77) Universal Approximation in Dropout Neural Networks Consider now the event A that there exists a leaf ℓof Γ such that sup x B(0,R) |Ξℓ(x) Ψ1(x; (W (1), b(1)))| > δ/2. From the law of total probability, it follows that P h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) > ϵ = P h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) > ϵ + P h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) > ϵ Aci P[Ac]. (78) Observe that P h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) > ϵ A i 1, (79) and by (i) Boole s inequality P[A] = P h ℓ leaves(Γ) sup x B(0,R) |Ξℓ(x) Ψ1(x; (W (1), b(1)))| > δ/2 i ℓ leaves(Γ) P h sup x B(0,R) |Ξℓ(x) Ψ1(x; (W (1), b(1)))| > δ/2 i (76) κ #leaves(Γ). (80) Furthermore, P h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) > ϵ (75) P h sup x B(0,R) sup x B(InΓ(x),δ) Φv0 Γ,Ξ( x) Φv0 Γdet,Ξ(InΓ(x)) > ϵ By bounding (78) using (79) (81) and P[Ac] 1, we find that P h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) > ϵ < κ #leaves(Γ) + P h sup x B(0,R) sup x B(InΓ(x),δ) Φv0 Γ,Ξ( x) Φv0 Γdet,Ξ(InΓ(x)) > ϵ (74) = ϵ 4RL This shows (73). Next, we prove that (41) (43) follow from (73). Proof of (41). This inequality follows from (73) since RL 1 by construction and q 1 by assumption. Manita, Peletier, Portegies, Sanders and Senen Cerda Proof of (43). This inequality is a direct consequence of inequality (75) by choosing gℓ:= Ξℓ,avg filt and using (77). Proof of (42). We will prove that by the definition of Rj in (26), for all x B(0, R) NNΓ,Ξ(x) q < Rq L w.p. one, and Ψ(x, w) q < Rq L. (83) Namely, if (83) holds true, then (42) follows. To see the implication, consider the event D for which sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) > ϵ and apply the law of total expectation: E h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) qi (85) = E h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) q D i P[D] + E h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) q Dci P[Dc]. By the triangle inequality, E h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) q D i (83) (2RL)q. (86) On the other hand, E h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) q Dci (84) ϵ Bound now (85) using (73), (86), (87), and the elementary bound P[Dc] 1 to obtain E h sup x B(0,R) NNΓ,Ξ(x) Ψ(x, w) qi < (2RL)q ϵ 4RL That would prove (42). What remains is to prove (83). Proof of (83). Observe immediately that the right inequality in (83) follows immediately as 0 < β < 1 and Q > 1 (recall the definition of Ψ in (8)). Next, we will prove the left inequality in (83) by mathematical induction (recall the recursion in (34) and (35) that defines NNΓ,Ξ). Base case. Recall from (34) and (35) that the induction starts with the functions Ξℓ(x) := σ1 1 i=1 ( 1)i(V ℓ F ℓ,i)σ0 ( 1)iα(I Gℓ,i)x + bℓ (88) where element-wise V ℓ rc = W (1) rc α σ + σ+ E[F ℓrc]E[Gℓcc]. (89) Universal Approximation in Dropout Neural Networks We are now going to prove that for every x B(0, R), the point i=1 ( 1)i(V ℓ F ℓ,i)σ0 ( 1)iα(I Gℓ,i)x B 0, β 1 W (1) HSR0 w.p. one. (90) In particular, by the definition of R1 in (26) (which implicitly deals with the bias bℓ), this implies that for all x B(0, R), Ξℓ(x) R1 1. Start by noting that there exists an α0 > 0 such that for all 0 < α α0 and all ξ [ R, R] R we have |σ0(αξ)| 2(|σ | + |σ+|)α|ξ|. (91) It follows from the bound (91) that for all x B(0, R) and all i 1, 2N, 1 α(σ + σ+)E[Gℓcc]σ0 α(I Gℓ,i)x < 2|σ | + |σ+| |σ + σ+| 1 β |xc| w.p. one. Since we assumed in (40) that 4|σ | + |σ+| |σ + σ+| < Q it follows that for all x B(0, R) and all i 1, 2N, α(σ + σ+)σ0 α(I Gℓ,i)x < Q β R < R0 w.p. one. Therefore, for every x B(0, R), i=1 ( 1)i(V ℓ F ℓ,i)σ0 ( 1)iα(I Gℓ,i)x i=1 ( 1)i((W (1) F ℓ,i) E[F ℓ,i])2E[I Gℓ,i] 1 α(σ + σ+) σ0 ( 1)iα(I Gℓ,i)x i=1 (W (1) F ℓ,i) E[F ℓ,i] HSR0 1 β W (1) HSR0 w.p. one. This proves (90). Inductive step. In the definition of NNΓ,Ξ we defined for v not a leaf in Γ, Φv Γ,Ξ = σv 1 #into(v) e into(v) (V e F e)Φsource(e) Γ,Ξ + be . By an inductive argument we find that for all x B(0, R), it holds that 1 #into(v) e into(v) (V e F e)Φsource(e) Γ,Ξ (x) β 1 W e HSRlevel(v) 1 w.p. one. Manita, Peletier, Portegies, Sanders and Senen Cerda so that by definition of Rlevel(v) it holds that |Φv Γ,Ξ(x)| < Rlevel(v) w.p. one. In particular, NNΓ,Ξ(x) q = Φv0 Γ,Ξ(x) q < Rq L w.p. one. This proves (83). With that, Theorem 17 is proven. B.4 Proof of Lemma 18 Proof of (44). Let 0 K < , ℓbe a leaf of Γ, and ρ > 0. Recall that Ψ1(x; (W (1), b(1))) = σ1 W (1)x + b(1) , (92) and for x B(0, K), define Zℓ N(x) = 1 i=1 ( 1)i(V ℓ F ℓ,i)σ0 ( 1)iα(I Gℓ,i)x + b + bℓ (93) so that Ξℓ(x) = σ1(Zℓ N(x)). Note that the weights W are fixed and therefore uniformly bounded. Continuity of σ0 and σ1, boundedness of F and G, positivity of E[F ℓ rc] and E[Gℓ rc], and compactness of the optimization domain imply that the supremum of the optimization problem is attained say at X B(0, K). Just like in Appendix B.2, note that X is random and depends on the collections {F ℓ,i}i 1,2N, {Gℓ,i}i 1,2N. Summarizing: P h sup x B(0,K) σ1(Zℓ N(x)) σ1(W (1)x + b(1)) > ρ i = P h σ1(Zℓ N(X )) σ1(W (1)X + b(1)) > ρ i . (94) Note that here we slightly abuse the notation by using | | sign not only for absolute value of numbers, but also, as in the last formula, for the Euclidean norm of the vector. By construction, there exists a compact set C so that the points Zℓ N(X ), W (1)X + b(1) (95) lie in C with probability one. The uniform continuity of σ1 on C implies that for each ζ > 0 there exists ηζ > 0 such that (62) holds for σ1 and all x, y C. Fix ζ = ρ and introduce the event D(X ) = Zℓ N(X ) (W (1)X + b(1)) 2 < ηρ . (96) By the law of total probability P h |σ1(Zℓ N(X )) σ1(W (1)X + b(1))| > ρ i = P h |σ1(Zℓ N(X )) σ1(W (1)X + b(1))| > ρ | D(X ) i P h D(X ) i + P h |σ1(Zℓ N(X )) σ1(W (1)X + b(1))| > ρ | Dc(X ) i P h Dc(X ) i P h Dc(X ) i . (97) Universal Approximation in Dropout Neural Networks We will next prove that for all x B(0, K), P h Dc(x) i 0 (98) as α 0 and N . Together with (97), this implies the result. Let x B(0, K). Component-wise, Zℓ N(x) (W (1)x + b(1)) N (V ℓ F ℓ,i)σ0 ( 1)iα(I Gℓ,i)x + bℓ (W (1)x + b(1)) N V ℓ rc F ℓ,i rc σ0 ( 1)iα(I Gℓ,i)x c + b(1) r X c W (1) rc xc + b(1) r . Substituting (39) into (99), using the triangle inequality, and rearranging terms, we find that Zℓ N(x) (W (1)x + b(1)) W (1) rc 1 2(σ + σ+)E[Gℓcc] F ℓi rc E[F ℓrc]( 1)iσ0 ( 1)iα(I Gℓ,i)x W (1) rc xc . (100) Note that the assumptions of the lemma imply that W (1) rc 1 2(σ + σ+)E[Gℓcc] Cw,G,σ < + . We focus now on the term within brackets in (100). Let δ1 > 0 and consider the event EF,N(δ1) = n 1 E[F ℓrc] 1 < δ1 o . (101) There exists C1 > 0 such that, conditional on EF,N(δ1), E[F ℓrc]( 1)iσ0 ( 1)iα(I Gℓ,i)x i=1 ( 1)iσ0 ( 1)iα(I Gℓ,i)x E[F ℓrc] 1 ( 1)iσ0 ( 1)iα(I Gℓ,i)x since the argument of σ0 is uniformly bounded, and σ0 is continuous. Moreover, there exists C2 > 0 such that conditional on EF,N(δ1), Zℓ N(x) (W (1)x + b(1)) W (1) rc 1 2(σ + σ+)E[Gℓcc] 1 α ( 1)iσ0 ( 1)iα(I Gℓ,i)x W (1) rc xc + C2δ1. Manita, Peletier, Portegies, Sanders and Senen Cerda Note that C1, C2 are independent of N, δ1. Recall now that by (36) and (37) we can find for each γ > 0 an α > 0 such that for all y Rd0, c, 1 c σS(yr)yc < γ. (104) Recall furthermore that maxrc Gℓ rc M < with probability one by assumption. Together, this implies that we can find for each γ > 0 an α > 0 such that for all x B(0, K), c, σ0 αI Gℓ,ix c σS( Gℓi ccxc)Gℓi ccxc < γ i = 1 (105) Fix γ (0, δ1) and corresponding α > 0. Then there exists a constant C3 > 0, independent of δ1, γ, N, such that conditional on EF,N(δ1), Zℓ N(x) (W (1)x + b(1)) W (1) rc 1 2(σ + σ+)E[Gℓcc] i=1 σS(( 1)i Gℓi ccxc)Gℓi ccxc W (1) rc xc + C3δ1 c 1,d0:xc>0 W (1) rc 1 2(σ + σ+)E[Gℓcc] i=1 σS(( 1)i Gℓi ccxc)Gℓi ccxc W (1) rc xc + C3δ1. To conclude (i), we used the fact that σ 0 = 0. By assumption Gℓi cc 0 with probability one, so if moreover |xc| > 0, then S(( 1)i Gℓi ccxc) = S(( 1)ixc) with probability one recall its definition in (37). Thus there exists C4 independent of δ1, γ, N such that conditional on the event EF,N(δ1) EG,N(δ1), Zℓ N(x) (W (1)x + b(1)) r c 1,d0:|xc|>0 W (1) rc 1 2(σ + σ+) 1 2N i=1 σS(( 1)ixc)xc W (1) rc xc + C4δ1 = C4δ1. (107) The last equality holds because the sum is only of over c such that |xc| > 0. All that remains is to prove that P[EF,N(δ1) EG,N(δ1)] 1 as N . (108) This fact follows immediately from the independence of F, G, and a subsequent application of Khinchin s weak law of large numbers (which may be applied since F, G s expectations are bounded). Note that δ1 is an arbitrary parameter: choosing it such that C4δ1 < ηζ, and then choosing N sufficiently large completes the proof of (44). Proof of (45). The assertion (42) is proven for any (α, N)-precomposition associated with some distributions µℓ, νℓwith finite nonzero mean. In particular, the same argument shows that (42) holds when F ℓand Gℓare taken deterministic and equal to the expectations of the corresponding random variables (see also the discussion following (39)). This proves (45). Universal Approximation in Dropout Neural Networks George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989. Gwendoline De Bie, Gabriel Peyr e, and Marco Cuturi. Stochastic deep networks. In International Conference on Machine Learning, pages 1556 1565. PMLR, 2019. Dennis Elbr achter, Dmytro Perekrestenko, Philipp Grohs, and Helmut B olcskei. Deep neural network approximation theory. IEEE Transactions on Information Theory, submitted Jan. 2019, revised, June 2020. URL http://www.nari.ee.ethz.ch/pubs/p/ deep-it-2019. Andrew Foong, David Burt, Yingzhen Li, and Richard Turner. On the expressiveness of approximate inference in Bayesian neural networks. Advances in Neural Information Processing Systems, 33, 2020. Claudio Gallicchio and Simone Scardapane. Deep randomized neural networks. In Recent Trends in Learning From Data, pages 43 68. Springer, 2020. Erol Gelenbe, Zhi-Hong Mao, and Yan-Da Li. Function approximation with spiked random networks. IEEE Transactions on Neural Networks, 10(1):3 9, 1999a. Erol Gelenbe, Zhi-Wong Mao, and Yan-Da Li. Approximation by random networks with bounded number of layers. In Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No. 98TH8468), pages 166 175. IEEE, 1999b. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. ar Xiv preprint ar Xiv:1207.0580, 2012. Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251 257, 1991. Boris Igelnik and Yoh-Han Pao. Stochastic choice of basis functions in adaptive function approximation and the functional-link net. IEEE Transactions on Neural Networks, 6 (6):1320 1329, 1995. Alex Labach, Hojjat Salehinejad, and Shahrokh Valaee. Survey of dropout methods for deep neural networks. ar Xiv preprint ar Xiv:1904.13310, 2019. Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861 867, 1993. Yuly Makovoz. Random approximants and neural networks. Journal of Approximation Theory, 85(1):98 109, 1996. Manita, Peletier, Portegies, Sanders and Senen Cerda Yuly Makovoz. Uniform approximation by neural networks. Journal of Approximation Theory, 95(2):215 228, 1998. Hien Nguyen. A universal approximation theorem for Gaussian-gated mixture of experts models. Available at SSRN 2946964, 2017. Hien D. Nguyen, Luke R. Lloyd-Jones, and Geoffrey J. Mc Lachlan. A universal approximation theorem for mixture-of-experts models. Neural computation, 28(12):2585 2593, 2016. Yoh-Han Pao, Gwang-Hoon Park, and Dejan J. Sobajic. Learning and generalization characteristics of the random vector functional-link net. Neurocomputing, 6(2):163 180, 1994. Allan Pinkus. Approximation theory of the MLP model in neural networks. Acta numerica, 8:143 195, 1999. Ali Rahimi and Benjamin Recht. Uniform approximation of functions with random bases. In 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pages 555 561. IEEE, 2008. Stelios Timotheou. The random neural network: a survey. The computer journal, 53(3): 251 267, 2010. Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems, pages 351 359, 2013. Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using Dropconnect. In International Conference on Machine Learning, pages 1058 1066, 2013. Halbert White. An additional hidden unit test for neglected nonlinearity in multilayer feedforward networks. In Proceedings of the international joint conference on neural networks, volume 2, pages 451 455. Washington, DC, 1989. Yonghua Yin. Random neural network methods and deep learning. Probability in the Engineering and Informational Sciences, pages 1 31, 2019.