# deep_relu_networks_preserve_expected_length__cfb8082b.pdf Published as a conference paper at ICLR 2022 DEEP RELU NETWORKS PRESERVE EXPECTED LENGTH Boris Hanin Dept. of Operations Research & Financial Engineering Princeton University Princeton, NJ 08544 USA bhanin@princeton.edu Ryan Jeong Dept. of Mathematics University of Pennsylvania Philadelphia, PA 19104 USA rsjeong@sas.upenn.edu David Rolnick School of Computer Science Mc Gill University Montr eal, QC H3A 0G4 Canada drolnick@cs.mcgill.ca Assessing the complexity of functions computed by a neural network helps us understand how the network will learn and generalize. One natural measure of complexity is how the network distorts length if the network takes a unit-length curve as input, what is the length of the resulting curve of outputs? It has been widely believed that this length grows exponentially in network depth. We prove that in fact this is not the case: the expected length distortion does not grow with depth, and indeed shrinks slightly, for Re LU networks with standard random initialization. We also generalize this result by proving upper bounds both for higher moments of the length distortion and for the distortion of higher-dimensional volumes. These theoretical results are corroborated by our experiments. 1 INTRODUCTION The utility of deep neural networks ultimately arises from their ability to learn functions that are sufficiently complex to fit training data and yet simple enough to generalize well. Understanding the precise sense in which functions computed by a given network have low or high complexity is therefore important for studying when the network will perform well. Despite the fundamental importance of this question, our mathematical understanding of the functions expressed and learned by different neural network architectures remains limited. A popular way to measure the complexity of a neural network function is to compute how it distorts lengths. This may be done by considering a set of inputs lying along a curve and measuring the length of the corresponding curve of outputs. It has been claimed in prior literature that in a Re LU network this length distortion grows exponentially with the network s depth Price & Tanner (2019); Raghu et al. (2017), and this has been used as a justification of the power of deeper networks. We prove that, in fact, for networks with the typical initialization used in practice, expected length distortion does not grow at all with depth. Our main contributions are: 1. We prove that for Re LU networks initialized with the usual 2/fan-in weight variance, the expected length distortion does not grow with depth at initialization, actually decreasing slightly with depth (Thm. 3.1) and exhibiting an interesting width dependency. 2. We prove bounds on higher moments of the length distortion, giving upper bounds that hold with high probability (Thm. 4.1). We also obtain similar results for the distortion in the volume of higher-dimensional manifolds of inputs (Thm. 4.2). 3. We empirically verify that our theoretical results accurately predict observed behavior for networks at initialization, while previous bounds are loose and fail to capture subtle architecture dependencies. It is worth explaining why our conclusions differ from those of Price & Tanner (2019); Raghu et al. (2017). First, prior authors prove only lower bounds on the expected length distortion, while we use Equal contribution Published as a conference paper at ICLR 2022 different methodology to calculate tight upper bounds, allowing us to say that the expected length distortion does not grow with depth. As we show in Thm. 5.1 and Fig. 1, the prior bounds are in fact quite loose. Second, Thm. 3(a) in Raghu et al. (2017) has a critical typo1, which has unfortunately been perpetuated in Thm. 1 of Price & Tanner (2019). Namely, the 2 was omitted in the following statement (paraphrased from the original): E [length distortion] = Ω σw where σ2 w/width is the variance of the weight distribution. As we prove in Thm. 3.1, leaving out the 2 makes the statement false, incorrectly suggesting that length distortion explodes with depth for the standard He initialization σw = Finally, prior authors drew the conclusion that length distortion grows exponentially with depth by considering the behavior of Re LU networks with unrealistically large weights. If one multiplies by C the weights and biases of a Re LU network, one multiplies the length distortion by Cdepth, so it should come as no surprise that there exist settings of the weights for which the distortion grows (or decays) exponentially with depth. The value of our results comes in analyzing the behavior specifically at He initialization (σw = 2) He et al. (2015). This initialization is the one used in practice, since this is the weight variance that must be used if the outputs Hanin & Rolnick (2018) and gradients Hanin (2018) are to remain well-controlled at init. In the present work, we show that this is also the correct initialization for the expected length distortion to remain well-behaved. 2 RELATED WORK A range of complexity measures for functions computed by deep neural networks have been considered in the literature, dividing the prior work into at least three categories. In the first, the emphasis is on worst-case (or best-case) scenarios what is the maximal possible complexity of functions computed by a given network architecture. These works essentially study the expressive power of neural networks and often focus on showing that deep networks are able to express functions that cannot be expressed by shallower networks. For example, it has been shown that it is possible to set the weights of a deep Re LU network such that the number of linear regions computed by the network grows exponentially in the depth Daniely (2017); Eldan & Shamir (2016); Mont ufar et al. (2014); Telgarsky (2015; 2016). Other works consider the degree of polynomials approximable by networks of different depths Lin et al. (2017); Rolnick & Tegmark (2018) and the topological invariants of networks Bianchini & Scarselli (2014). While such work has sometimes been used to explain the utility of different neural network architectures (especially deeper ones), a second strand of prior work has shown that a significant gap can exist between the functions expressible by a given architecture and those which may be learned in practice. Such average-case analyses have indicated that some readily expressible functions are provably difficult to learn Shalev-Shwartz et al. (2017) or vanishingly unlikely for random networks Hanin & Nica (2020); Hanin & Rolnick (2019; 2020), and that some functions learned more easily by deep architectures are nonetheless expressible by shallow ones Ba & Caruana (2014). (While here we consider neural nets with Re LU activation, it is worth noting that for arithmetic circuits, worst-case and average-case scenarios may be more similar in both scenarios, a significant gap exists between the matricization rank of the functions computed by deep and shallow architectures Cohen et al. (2016).) As noted earlier, average-case analyses in Price & Tanner (2019); Raghu et al. (2017) provided lower bounds on expected length distortion, while Poole et al. (2016) presented a similar analysis for the curvature of output trajectories. Finally, a number of authors have sought complexity measures that either empirically or theoretically correlate with generalization Jiang et al. (2019). Such measures have been based on classification margins Bartlett et al. (2017), network compressibility Arora et al. (2018), and PAC-Bayes considerations Dziugaite & Roy (2017). 1We have confirmed this in personal correspondence with the authors, and it is simple to verify the typo arises in the last step of the proof given in the Supplementary Material of Raghu et al. (2017). Published as a conference paper at ICLR 2022 3 EXPECTED LENGTH DISTORTION 3.1 MOTIVATION While neural networks are typically overparameterized and trained with little or no explicit regularization, the functions they learn in practice are often able to generalize to unseen data. This phenomenon indicates an implicit regularization that causes these learned functions to be surprisingly simple. Why this occurs is not well-understood theoretically, but there is a high level intuition. In a nutshell, randomly initialized neural networks will compute functions of low complexity. Moreover, as optimization by first order methods is a kind of greedy local search, network training is attracted to minima of the loss that are not too far from the initialization and hence will still be well-behaved. While this intuition is compelling, a key challenge in making it rigorous is to devise appropriate notions of complexity that are small throughout training. The present work is intended to provide a piece of the puzzle, making precise the idea that, at the start of training, neural networks compute tame functions. Specifically, we demonstrate that neural networks at initialization have low distortion of length and volume, as defined in 3.2. An important aspect of our analysis is that we study networks in typical, average-case scenarios. A randomly initialized neural network could, in principle, compute any function expressible by a network of that architecture, since the weights might with low probability take on any set of values. Some settings of weights will lead to functions of high complexity, but these settings may be unlikely to occur in practice, depending on the distribution over weights that is under consideration. As prior work has emphasized Price & Tanner (2019); Raghu et al. (2017), atypical distributions of weights can lead to exponentially high length distortion. We show that, in contrast, for deep Re LU networks with the standard initialization, the functions computed have low distortion in expectation and with high probability. Figure 1: Mean length distortion as a function of depth, for randomly initialized Re LU networks of varying architecture. As described in Thm. 3.1, distortion not only fails to grow, but shrinks with depth, especially for networks of small width. (a) compares the predictions of our Thm. 5.1 (solid lines) to the lower bounds proven in prior work Raghu et al. (2017) (dashed lines) and the true empirical means (colored dots), which closely track our predictions. (b) zooms in on the upper part of (a), showing the empirical mean length distortion as a function of depth for different widths. The horizontal black line is y = Γ(3) Γ(2.5) 2.5, the mean length distortion we predict when n0 = n L in the limit of infinite width for any fixed depth. In each network, width is constant in all hidden layers, while input and output dimension are both 5. The input curve is a fixed line segment of unit length. Length distortion is calculated for 500 different initializations of the weights and biases of the network (the weight variance is 2/fan-in). For further experimental details, see Appendix B. 3.2 DEFINITIONS Let L 1 be a positive integer and fix positive integers n0, . . . , n L. We consider a fully connected feed-forward Re LU network N with input dimension n0, output dimension n L, and hidden layer Published as a conference paper at ICLR 2022 widths n1, . . . , n L 1. Suppose that the weights and biases of N are independent and Gaussian with the weights W (ℓ) ij between layers ℓ 1 and ℓand biases b (ℓ) j in layer ℓsatisfying: W (ℓ) ij G (0, 2/nℓ 1) , b(ℓ) j G(0, Cb). (1) Here, Cb > 0 is any fixed constant and G(µ, σ2) denotes a Gaussian with mean 0 and variance σ2. For any M Rn0, we denote by N(M) Rn L the (random) image of M under the map x 7 N(x). Our primary object of the study will be the size of the output N(M) relative to that of M. Specifically, when M is a 1-dimensional curve, we define length distortion = len(N(M)) Note that while a priori this random variable depends on M, we will find that its statistics depend only on the architecture of N (see Thms. 3.1 and 5.1). 3.3 RESULTS We prove that the expected length distortion does not grow with depth in fact, it slightly decreases. Our result may be informally stated thus (for a formal statement and proof, see Thm. 5.1 and App. C.): Theorem 3.1 (Length distortion: Mean; Informal Statement of Theorem 5.1). Consider a Re LU network of depth L, input dimension n0, output dimension n L, and hidden layer widths nℓ, with weights given by standard He normal initialization He et al. (2015). The expected length distortion is upper bounded by p n L/n0. More precisely: E [length distortion] C n L , C := Γ n L+1 Figure 2: Mean length distortion as a function of the ratio of output to input dimension, for Re LU networks with various architectures (e.g. [10, 10, 10] denotes three hidden layers, each of width 10). All networks are randomly initialized as in Figure 1. We sample 100 pairs of input dimension n0 and output dimension n L, each at most 50, such that the ratio of output to input dimension is distinct for each such pair. For each pair, 200 different network initializations are tested and the resulting length distortion is calculated; the log of the empirical mean is plotted. The dashed black line plots log(y) = 1 2 log(x), the approximate prediction by Theorem 3.1. Our experiments align with the theorem. In Figure 1, we compute the empirical mean of the length distortion for randomly initialized deep Re LU networks of various architectures with fixed input and output dimension. The figure confirms three of our key predictions: (1) the expected length distortion does not grow with depth, and actually decreases slightly; (2) the decrease happens faster for narrower networks (since 1/nℓis larger for smaller nℓ); and (3) for equal input and output dimension n0 = n L, there is an upper bound of 1 (in fact, the tighter upper bound of C is approximately 0.9515 for n L = 5, as shown in the figure). The figure also shows the prior bounds in Raghu et al. (2017) to be quite loose. For further experimental details, see Appendix B. In Fig. 2, we instead fix the set of hidden layer widths, but vary input and output dimensions. The results confirm that indeed the expected length distortion grows as the square root of the ratio of output to input dimension. Note that our theorem also applies for initializations other than He normal; the result in such cases is simply less interesting. Suppose we initialize the weights in each layer ℓas i.i.d. normal with Published as a conference paper at ICLR 2022 variance 2c2/nℓ 1 instead of 2/nℓ 1. This is equivalent to taking a He normal initialization and multiplying the weights by c. Multiplying the weights and biases in a depth-L Re LU network by c simply multiplies the output (and hence the length distortion) by c L, so the expected distortion is c L times the value given in Thm. 3.1. This emphasizes why an exponentially large distortion necessarily occurs if one sets the weights too large. 3.4 INTUITIVE EXPLANATION The purpose of this section is to give an intuitive but technical explanation for why, in Theorem 3.1, we find that the distortion len(N(M))/ len(M) of the length of a 1D curve M under the neural network N is typically not much larger than 1, even for deep networks. Our starting point is that, near a given point x M, the length of the image under N of a small portion of M with length dx is approximately given by ||Jxu|| dx, where Jx is the input-output Jacobian of N at the input x and u is the unit tangent vector to M at x. Prior work (see Fact 7.2 in Allen-Zhu et al. (2019)). In fact, in Lemma C.1 of the Supp. Material, we use a simple argument to give upper bounds on the moments of len(N(M))/ len(M) in terms of the moments of the norm ||Jxu|| of the Jacobian-vector product Jxu. Figure 3: Length distortion as a function of PL ℓ=1 n 1 ℓ, showing both mean and standard deviation across initializations. We test several types of network architecture with constant width 20, alternating between widths 30 and 10, and with the first (respectively, second) half of the layers of width 30 and the rest width 10. Each architecture type is tested for several depths. For each such network, we use n0 = n L = 5 and compute length distortion for 100 initializations on a fixed line segment. As predicted in Thm. 4.1, the mean length distortion decreases with the sum of width reciprocals. Empirical standard deviation does not, in general, increase with greater depth, remaining modest throughout, as is consistent with the upper bound on variance in Thm. 4.1. Thus, upper bounds on the length distortion reduce to studying the Jacobian Jx, which in a network with L hidden layers can be written as a product Jx = JL,x J1,x of the layer ℓ 1 to layer ℓJacobians Jℓ,x. In a worst-case analysis, the left singular vectors of Jℓ,x would align with the right singular vectors of Jℓ+1,x, causing the resulting product Jx to have a largest singular value that grows exponentially with L. However, at initialization, this is provably not the case with high probability. Indeed, the Jacobians Jℓ,x are independent (see Lemma C.3) and their singular vectors are therefore incoherent. We find in particular (Lemma C.2) the following equality in distribution: ℓ=1 ||Jℓ,xe1|| , where e1 is the first standard unit vector and the terms in the product are independent. On average each term in this product is close to 1: E [||Jℓ,xe1||] = 1 + O(n 1 ℓ). This is a consequence of initializing weights to have variance 2/fan-in. Put together, the two preceding displayed equations suggest writing ||Jxu|| = exp ℓ=1 log(||Jℓ,xe1||) The terms being summed in the exponent are independent and the argument of each logarithm scales like 1 + O(n 1 ℓ). With probability 1 δ we have for all ℓthat log(||Jℓ,xe1||) cn 1 ℓ, where c is a Published as a conference paper at ICLR 2022 constant depending only on δ. Thus, all together, ||Jxu|| exp with high probability. In particular, in the simple case where nℓare proportional to a single large constant n, we find the typical size of ||Jxu|| is exponential in L/n rather than exponential in L, as in the worst case. If N is wider than it is deep, so that L/n is bounded above, the typical size of ||Jxu|| and hence of len(N(M))/ len(M) remains bounded. Theorem 5.1 makes this argument precise. Moreover, let us note that articles like Daniely (2017); Giryes et al. (2016); Poole et al. (2016) show low distortion of angles and volumes in very wide networks. Our results, however, apply directly to more realistic network widths. 4 FURTHER RESULTS 4.1 HIGHER MOMENTS We have seen that the mean length distortion is upper-bounded by 1, and indeed by a function of the architecture that decreases with the depth of the network. However, a small mean is insufficient by itself to ensure that typical networks have low length distortion. For this, we now show that in fact all moments of the length distortion are well-controlled. Specifically, the variance is bounded above by the ratio of output to input dimension, and higher moments are upper-bounded by a function that grows very slowly with depth. Our results may be informally stated thus: Theorem 4.1 (Length distortion: Higher moments). Consider, as before, a Re LU network of depth L, input dimension n0, output dimension n L, and hidden layer widths nℓ, with weights given by He normal initialization. We have the following bounds on higher moments of the length distortion: Var[length distortion] n L n0 and E [(length distortion)m] n L for some universal constant c > 0. A formal statement and proof of this result are given in Theorem 5.1 and Appendix C. We consider the mean and variance of length distortion for a wide range of network architectures in Figure 3. The figure confirms that variance is modest for all architectures and does not increase with depth, and that mean length distortion is consistently decreasing with the sum of layer width reciprocals, as predicted by Theorem 4.1. 4.2 HIGHER-DIMENSIONAL VOLUMES Another natural generalization to consider is how Re LU networks distort higher-dimensional volumes: Given a d-dimensional input M, the output N(M) will in general also be d-dimensional, and we can therefore consider the volume distortion vold(N(M))/ vold(M). The theorems presented in 3.3 when d = 1 can be extended to all d, and the results may be informally stated thus: Theorem 4.2 (Volume distortion). Consider a Re LU network of depth L, input dimension n0, output dimension n L, and hidden layer widths nℓ, with weights given by He normal initialization. Both the squared mean and the variance of volume distortion are upper-bounded by: n L A formal statement and proof of this result are given in Theorem 5.2 and Appendix D. 5 FORMAL STATEMENTS In this section, we provide formal statements of our results. Full proofs are given in the appendices. Published as a conference paper at ICLR 2022 5.1 ONE-DIMENSIONAL MANIFOLDS Fix a smooth one-dimensional manifold M Rn0 (i.e. a curve) with unit length. Each point on M represents a possible input to our Re LU network N, which we recall has input dimension n0, output dimension n L, and hidden layer widths n1, . . . , n L 1. The function x 7 N(x) computed by N is continuous and piecewise linear. Thus, the image N(M) Rn L of M under this map is a piecewise smooth curve in Rn L with a finite number of pieces, and its length len(N(M)) is a random variable. Our first result states that, provided N is wider than it is deep, this distortion is small with high probability at initialization. Theorem 5.1. Let N be a fully connected Re LU network with input dimension n0, output dimension n L, hidden layer widths n1, . . . , n L 1, and independent centered Gaussian weights/biases with the weights having variance 2/fan-in (as in (1)). Let M be a 1-dimensional curve of unit length in Rn0. Then, the mean length E [len(N(M))] equals n L 1/2 Γ n L+1 2 1/2 exp 5 ℓ=1 n 1 ℓ + O where implied constants in O( ) are universal and Γ( ) denotes the Gamma function. Moreover: Var[len(N(M))] n L Finally, there exist universal constants c1, c2 > 0 such that if m < c1 min {n1, . . . , n L 1}, then E [(len(N(M)))m] n L Theorem 5.1 is proved in C. Several comments are in order. First, we have assumed for simplicity that M has unit length. For curves of arbitrary length, both the equality (2) and the bounds (3) and (4) all hold and are derived in the same way provided len(N(M)) is replaced by the distortion len(N(M))/ len(M) per unit length. Second, in the expression (2), note that the exponent tends to 0 as nℓ for any fixed L. More precisely, when nℓ= n is large and constant, it scales as 5L/8n. This shows that, for fixed n0, n L, n, the mean E [len(N(M))] is actually decreasing with L. This somewhat surprising phenomenon is borne out in Fig. 1 and is a consequence of the fact that wide fully connected Re LU nets are strongly contracting (see e.g. Thms. 3 and 4 in Giryes et al. (2016)). Third, the pre-factor (n L/n0)1/2 encodes the fact that, on average over initialization, for any vector of inputs x Rn0, we have (see e.g. Cor. 1 in Hanin & Rolnick (2018)) E h ||N(x)||2i n L n0 ||x||2 , (5) where the approximate equality becomes exact if the bias variance Cb is set to 0 (see (1)). In other words, at the start of training, the network N rescales inputs by an average factor (n L/n0)1/2. This overall scaling factor also dilates len(N(M)) relative to len(M). Fourth, we stated Theorem 5.1 for Gaussian weights and biases (see (1)), but the result holds for any weight distribution that is symmetric around 0, has variance 2/fan-in, and has finite higher moments. The only difference is that the constant 5/8 may need to be slightly enlarged (e.g. around (19)). Finally, all the estimates (2), (3), and (4) are consistent with the statement that, up to the scaling factor (n L/n0)1/2, the random variable len(N(M)) is bounded above by a log-normal distribution: len(N(M)) n L 2 , β , where β = c ℓ=1 n 1 ℓ and c is a fixed constant. 5.2 HIGHER-DIMENSIONAL MANIFOLDS The results in the previous section generalize to the case when M has higher dimension. To consider this case, fix a positive integer d min {n0, . . . , n L} and a smooth d dimensional manifold M Published as a conference paper at ICLR 2022 Rn0 of unit volume vold(M) = 1. Note that if d > min {n0, . . . , n L}, then N(M) is at most (d 1)-dimensional and its d-dimensional volume vanishes. Its image N(M) Rn L is a piecewise smooth manifold of dimension at most d. The following result, proved in D, gives an upper bound on the typical value of the d dimensional volume of N(M). Theorem 5.2. Let N be a fully connected Re LU network with input dimension n0, output dimension n L, hidden layer widths n1, . . . , n L 1 and independent centered Gaussian weights/biases with the variance of the weights given by 2/fan-in (see (1)). Let M be a d-dimensional smooth submanifold of Rn0 with unit volume and d min {n0, . . . , n L}. Then, both the squared mean and the variance of the d dimensional volume vold(N(M)) of N(M) is bounded above by n L 6 PROOF SKETCH The purpose of this section is to explain the main steps to obtaining the mean and variance estimates (2) and (3) from Theorem 5.1. In several places, we will gloss over some mathematical subtleties that we deal with in the detailed proof of Theorem 5.1 given in Appendix C. We will also content ourselves here with proving a slightly weaker estimate than (2) and show instead simply that E [len(N(M))] n L 1/2 and Var [len(N(M))] n L We refer the reader to Appendix C for a derivation of the more refined result stated in Theorem 5.1. Since we have E [X]2 E X2 and Var[X] E X2 , both estimates in (7) follow from E len(N(M))2 n L To obtain this bound, we begin by relating moments of len(N(M)) to those of the input-output Jacobian of N at a single point for which we need some notation. Namely, let us choose a unit speed parameterization of M; that is, fix a smooth mapping γ : R Rn0, γ(t) = (γ1(t), . . . , γn0(t)) for which M is the image under γ of the interval [0, 1]. That γ has unit speed means that for every t [0, 1] we have ||γ (t)|| = 1, where γ (t) := γ 1(t), . . . , γ n0(t) . Then, the mapping Γ := N γ (for Γ : R Rn L) gives a parameterization of the curve N(M). Note that this parameterization is not unit speed. Rather the length of N(M) is given by len(N(M)) = Z 1 0 ||Γ (t)|| dt. (9) Intuitively, the integrand ||Γ (t)|| dt computes, at a given t, the length of Γ([t, t + dt]) as dt 0. The following Lemma uses (9) to bound the moments of len(N(M)) in terms of the moments of the norm of the input-output Jacobian Jx, defined for any x Rn0 by 1 i n L 1 j n0 , (10) where Ni is the i-th component of the network output. Lemma 6.1. We have E len(N(M))2 Z 1 0 E h Jγ(t)γ (t) 2i dt. (11) Sketch of Proof. Taking powers in (9) and interchanging the expectation and integrals, we obtain E len(N(M))2 = Z 1 0 E [||Γ (t1)|| ||Γ (t2)||] dt1dt2. (12) Published as a conference paper at ICLR 2022 Applying the inequality ab 1 2(a2 + b2), for a, b R, to the integrand in (12), we conclude E len(N(M))2 Z 1 0 E h ||Γ (t)||2i dt. (13) The chain rule yields Γ (t) = Jγ(t)γ (t). Substituting this into (13) completes the proof. Lemma 6.1, while elementary, appears to be new, allowing us to bound global quantities such as length in terms of local quantities such as the moments of ||Jxu||. In particular, having obtained the expression (11), we have reduced bounding the second moment of len(N(M)) to bounding the second moment of the random variable ||Jxu|| for a fixed x Rn0 and unit vector u Rn0. This is a simple matter since in our setting the distribution of weights and biases is Gaussian and hence, in distribution, ||Jxu|| is equal to a product of independent random variables: Proposition 6.2. For any x Rn0 and any unit vector u Rn0, the random variable ||Jxu||2 is equal in distribution to a product of independent scaled chi-squared random variables ||Jxu||2 d= n L n L χ2 n L, where the number of degrees of freedom in the ℓth term of the product (for ℓ= 1, . . . , L 1) is given by an independent binomial nℓ d= Bin (nℓ, 1/2) with nℓtrials and success probability 1/2. The number of degrees of freedom in the final term is deterministic and given by n L. This Proposition has been derived a number of times in the literature (see Theorem 3 in Hanin (2018) and Fact 7.2 in Allen-Zhu et al. (2019)). We provide an alternative proof based on Proposition 2 in Hanin & Nica (2019) in the Supplementary Material. Note that the distribution of ||Jxu|| is the same at every x and u. Thus, fixing x Rn0 and a unit vector u Rn0, we find E len(N(M))2 E h ||Jxu||2i . To prove (8), we note that E χ2 n = n/2 and apply Lemma 6.2 to find, as desired, E h ||Jxu||2i = n L E n 1 L χ2 n L = n L nℓ E [nℓ] = n L 7 LIMITATIONS AND FUTURE WORK We show that deep Re LU networks with appropriate initialization do not appreciably distort lengths and volumes, contrary to prior assumptions of exponential growth. Specifically, we provide an exact expression for the mean length distortion, which is bounded above by 1 and decreases slightly with increasing depth. We also prove that higher moments of the length distortion admit well-controlled upper bounds, and generalize this to distortion of higher dimensional volumes. We show empirically that our theoretical results closely match observations, unlike previous loose lower bounds. There are several notable limitations of this paper, which offer promising directions for future work. First, we prove statements for networks at initialization, and our results do not necessarily hold after training; analyzing such behavior formally would require consideration of the loss function, optimizer, and data distribution. Second, our results, as stated, do not apply to convolutional, residual, or recurrent networks. We envision generalizations for networks with skip connections being straightforward, while formulations for convolutional and recurrent networks will likely be more complicated. In Appendix A, we provide preliminary results suggesting that expected length distortion decreases modestly with depth in both convolutional networks and those with skip connections. Third, we believe that various other measures of complexity for neural networks, such as curvature, likely demonstrate similar behavior to length and volume distortion in average-case scenarios with appropriate initialization. While somewhat parallel results for counting linear regions were shown in Hanin & Rolnick (2019; 2020), there remains much more to be understood for other notions of complexity. Finally, we look forward to work that explicitly ties properties such as low length distortion to improved generalization, as well as new learning algorithms that leverage this line of theory in order to better control inductive biases to fit real-world tasks. Published as a conference paper at ICLR 2022 8 ETHICS STATEMENT This paper falls within deep learning theory and is intended to advance innovation for more effective and reliable algorithms. However, insofar as our work shows that deep Re LU networks at initialization are better-behaved than was previously believed, this could encourage the use of deeper neural networks. While deeper networks may enhance performance in some contexts, larger numbers of parameters are also associated with increased computational cost, which can both increase greenhouse gas emissions and exacerbate inequities caused by differential access to computational resources. 9 ACKNOWLEDGEMENTS The authors gratefully acknowledge a range of funding sources that supported this research. BH would like to acknowledge NSF grants DMS-1855684 DMS-2133806 as well as an NSF CAREER grant DMS-2143754 and an ONR MURI on Foundations of Deep Learning. DR would like to acknowledge support from the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants program and the Canada CIFAR AI Chairs program. Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. In International Conference on Machine Learning (ICML), 2019. Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning (ICML), 2018. Lei Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Conference on Neural Information Processing Systems (Neur IPS), 2014. Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. Preprint ar Xiv:1706.08498, 2017. Monica Bianchini and Franco Scarselli. On the complexity of neural network classifiers: A comparison between shallow and deep architectures. IEEE Transactions on Neural Networks and Learning Systems, 25(8):1553 1565, 2014. Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory (COLT), 2016. Amit Daniely. Depth separation for neural networks. In Conference on Learning Theory (COLT), 2017. Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Preprint ar Xiv:1703.11008, 2017. Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory (COLT), 2016. Raja Giryes, Guillermo Sapiro, and Alex M Bronstein. Deep neural networks with random Gaussian weights: A universal classification strategy? IEEE Transactions on Signal Processing, 64(13): 3444 3457, 2016. Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? In Conference on Neural Information Processing Systems (Neur IPS), 2018. Boris Hanin and Mihai Nica. Products of many large random matrices and gradients in deep neural networks. Communications in Mathematical Physics, pp. 1 36, 2019. Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. In International Conference on Learning Representations (ICLR), 2020. Published as a conference paper at ICLR 2022 Boris Hanin and David Rolnick. How to start training: The effect of initialization and architecture. In Conference on Neural Information Processing Systems (Neur IPS), 2018. Boris Hanin and David Rolnick. Complexity of linear regions in deep networks. In International Conference on Machine Learning (ICML), 2019. Boris Hanin and David Rolnick. Deep Re LU networks have surprisingly few activation patterns. In Conference on Neural Information Processing Systems (Neur IPS), 2020. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on Image Net classification. In IEEE International Conference on Computer Vision (ICCV), 2015. Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. Preprint ar Xiv:1912.02178, 2019. Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168(6):1223 1247, 2017. Guido Mont ufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Conference on Neural Information Processing Systems (Neur IPS), 2014. Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Conference on Neural Information Processing Systems (Neur IPS), 2016. Ilan Price and Jared Tanner. Trajectory growth lower bounds for random sparse deep Re LU networks. Preprint ar Xiv:1911.10651, 2019. Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. In International Conference on Machine Learning (ICML), 2017. David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. In International Conference on Learning Representations, 2018. Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In International Conference on Machine Learning (ICML), 2017. Matus Telgarsky. Representation benefits of deep feedforward networks. Preprint ar Xiv:1509.08101, 2015. Matus Telgarsky. Benefits of depth in neural networks. In Conference on Learning Theory (COLT), 2016. Published as a conference paper at ICLR 2022 A EXPERIMENTS FOR CONVOLUTIONAL AND RESIDUAL NETWORKS We here provide preliminary experimental results for the mean length distortion in networks with convolutional layers and skip connections. In Figure 4(a), we show results for networks with sequential convolutional layers (without pooling layers), initialized as before with weights i.i.d. normal with variance 2/fan-in, and a final fully connected layer. We use input dimension n0 = 16 16 3 = 768 and output dimension 5. Our results indicate that the mean length distortion is, as expected, approximately equal to p n L/n0 0.08, and that it decays slightly with depth. In Figure 4(b), we consider residual networks. Here, the overall network N is defined in terms of residual modules N1, N2, . . . , NL and scales η1, η2, . . . , ηL according to: N(x) = x + η1N1(x) + η2N2(x + η1N1(x)) + + ηLNL (x + η1N1(x) + η2N2(x + η1N1(x)) + )) . We set all residual modules Nℓto be two-layer, fully connected Re LU networks. In keeping with Hanin & Rolnick (2018), we initialize all weights i.i.d. normal with variance 2/fan-in, while setting η1 = η2 = = ηL = 1/L. Our results suggest that mean length distortion again decays modestly overall, with a somewhat sharper decrease for small depths. Figure 4: Mean length distortion as a function of depth, for networks with convolutional layers and skip connections, initialized using He normal initialization He et al. (2015). (a) shows results for networks having convolutional layers, where the input dimension n0 equals 16 16 3 = 768 and output dimension n L equals 5. Here width corresponds to the number of 3 3 kernels in each layer. As expected, the mean length distortion is p n L/n0 0.08 and decays modestly with depth. (b) shows results for networks with skip connections occurring between even layers (i.e. each residual block is a fully-connected Re LU network of depth 2), again taking a line segment of unit length as the input curve. Here, depth corresponds to the total number of layers in the network, so that a depth-20 network includes 10 residual blocks. Both plots (a) and (b) show the empirical mean over 100 initializations of the weights and biases. B EXPERIMENTAL DETAILS For all experiments, weights were initialized from i.i.d. normal distributions with variance 2/fan-in and bias variance 0.1. We run several experiments that involve computing the length distortion of a given line segment in Rn0. We remark on how this was done, for which we rely on the notion of linear regions and bent hyperplanes associated with Re LU networks, explored in Hanin & Rolnick (2020); Telgarsky (2015). Specifically, Re LU networks partition input space into a collection of convex polytopes, which we call linear regions, as (generically) distinct linear functions are defined on each region corresponding to a different subset of the hidden neurons being activated (where a neuron is activated if its pre-activation is nonnegative). The boundaries of these linear regions are given by sets of points for which a particular neuron has preactivation equal to 0. Published as a conference paper at ICLR 2022 Say we are given a line segment ℓwith endpoints p1, p2 Rn0 parametrized by unit speed on [0, 1], given by the set {p1 + t(p2 p1) : t [0, 1]}, for which we aim to compute the length distortion. Let wℓ= p2 p1. We first approximate the intersections of the line segment with linear region boundaries using a binary search subroutine. Specifically, initialize the set S = [0, 0.5, 1], which will contain the parameter values for these approximations. For each iteration of the subroutine, we run the following procedure: for any three consecutive points t1, t2, t3 S, let x1 = p1 + t1(p2 p1), x2 = p1 + t2(p2 p1), x3 = p1 + t3(p2 p1), and consider the values ||N (x2) N(x1)|| ||x2 x1|| and ||N (x3) N(x2)|| ||x3 x2|| . These values are equal if the three points are in the same linear region, in which case we eliminate t1 and t3 from S. If not equal, then there exists a linear region boundary in the segment from x1 to x3; we add the points t1+t2 2 and t2+t3 2 to S. We iterate this procedure (15 times in our implementation); for the final iteration, we ensure that only the last point associated with a given linear region is in S. Now take the set S = [t1, t2, . . . , tn] returned by the binary search procedure; we proceed to compute the parameter values denoting the exact intersections of the line segment with region boundaries, which we shall store in S . For consecutive points ti, ti+1 S, i = 1, . . . , n 1, determine the set of activated neurons for both points at xi = p1 + ti(p2 p1), xi+1 = p1 + ti+1(p2 p1), and find the neuron at which these sets differ; we solve a linear equation to determine the value t [0, 1] for which this neuron is 0, and replace ti with the exact value t i in S . Finally, we append 0 and 1 to the respective ends of S = [t 0 = 0, t 1, . . . , t n = 1]. Computing the length distortion is reduced to summing the lengths of the output segments corresponding to consecutive pairs ti, ti+1 S . Namely, for each such pair, the network is given by a single linear function on the segment of inputs between xi = p1 + ti(p2 p1) and xi+1 = p1 + ti+1(p2 p1). We calculate the weight matrix Wi of this linear function the length of the corresponding output segment is the product of Wi with the vector wℓ. The total length distortion is then given by the sum i=0 ||Wiwℓ||(t i+1 t i ) C PROOF OF THEOREM 5.1 Our first step in proving Theorem 5.1 is to obtain bounds on the moments of len(N(M)) in terms of the input-output Jacobian of N at a single point, which we recall was defined in (10). To accomplish them, we recall the notation from 6. Namely, fix a smooth unit speed parameterization of M = γ([0, 1]) with γ : R Rn0, γ(t) = (γ1(t), . . . , γn0(t)) . The mapping Γ := N γ, Γ : R Rn L gives a parameterization of the curve N(M), and we have len(N(M)) = Z 1 0 ||Γ (t)|| dt. (14) Let us note an important but ultimately benign technicality. The Jacobian Jx of the map x 7 N(x) is not defined at every x (namely those x where some neuron turns from on to off). Thus, a priori, Γ (t) exists only at those t for which Jγ(t) exists. However, the formula (14) is still valid. Indeed, for any setting of weights and biases of N the map Γ is Lipschitz. Thus, by Rademacher s theorem, Γ (t) exists for almost every t [0, 1] and the length of the curve is given by integrating the norm of this almost-everywhere defined derivative. The following simple Lemma is a generalization of Lemma 6.1 and allows us to bound all moments of len(N(M)) in terms of the moments of the norm of the Jacobian vector product ||Jxu||. Lemma C.1. For any integer m 1, we have E [len(N(M))m] Z 1 0 E Jγ(t)γ (t) m dt. (15) Published as a conference paper at ICLR 2022 Proof. Taking powers in (14) and using Tonelli s theorem to interchange the expectation and integrals, we obtain E [len(N(M))m] = Z 1 j=1 ||Γ (tj)|| dt1 dtm. (16) To proceed, let us recall a special case of the power mean inequality which says that for any ai 0 we have m Y Applying this to the integrand in (16), we conclude E [len(N(M))m] Z 1 0 E ||Γ (t)||m dt. (17) Next, fix t [0, 1]. For any neuron z in N, denote by x 7 z(x) its pre-activation. Note that P(z(γ(t)) = 0) = 0 since our bias variance Cb is set to some fixed positive constant and hence the bias of each neuron has a continuous density. Therefore, with probability 1 over the randomness in the weights and biases of N, there exists a neighborhood U of γ(t) on which z(x) has constant sign for all x U. The Jacobian Jγ(t) is therefore well-defined and the chain rule yields Γ (t) = Jγ(t)γ (t). (18) Substituting this into (17) completes the proof. Having obtained the expression (15), we have reduced bounding the moments of len(N(M)) to bounding the moments of the random variable ||Jxu|| for a fixed x Rn0 and unit vector u Rn0. Prior work (e.g. Thm. 1 in Hanin & Nica (2019)) shows that these moments satisfy E h ||Jxu||2mi = n L ℓ=1 n 1 ℓ + O provided m 2 < min ℓ=1,...,L 1 nℓ. (19) Substituting these moment estimates in (15) completes the derivation of (4). However, the results in Hanin & Nica (2019) are subtle because they apply to any distribution of weights and biases. They also give the slightly sub-optimal restriction (19) that m2 must be smaller than a constant times the minimum of the nℓ s. In the special case where the distribution of weights and biases is Gaussian, we can do slightly better by computing the moments of ||Jxu|| more directly (note that in the statement of Theorem 5.1, we required only that m is smaller than a constant times the minimum of the nℓ s). We will need this anyway to derive the slightly more refined estimates in (2) and (3). Let us therefore check that, in distribution, ||Jxu|| is equal to a product of independent random variables: Proposition C.2. For any x Rn0 and any unit vector u Rn0, the random variable ||Jxu||2 is equal in distribution to a product of independent scaled chi-squared random variables ||Jxu||2 d= n L n L χ2 n L, where the number of degrees of freedom in the ℓth term of the product (for ℓ= 1, . . . , L 1) is given by an independent binomial nℓ d= Bin (nℓ, 1/2) with nℓtrials and success probability 1/2. The number of degrees of freedom in the final term is deterministic and given by n L. Published as a conference paper at ICLR 2022 Proof. Consider a Re LU network N with input dimension 1, output dimension n L, and hidden layer widths n1, . . . , n L 1. Suppose the weight matrices W (ℓ) and bias vectors b(ℓ) are independent with i.i.d. Gaussian components: W (ℓ) ij G(0, 2/nℓ 1), b(ℓ) j G(0, Cb), where Cb > 0 is any fixed constant. For a fixed network input x Rn0, with probability 1, the input-output Jacobian Jx is well-defined. Moreover, it can be written as Jx = W (L)D(L 1)W (L 1) D(1)W (1), where W (ℓ) is the matrix of weights from layer ℓ 1 to layer ℓand D(ℓ) is an nℓ nℓdiagonal matrix: D(ℓ) = Diag 1n z(ℓ) i 0 o, i = 1, . . . , nℓ whose diagonal entries are 0 or 1 depending on whether the pre-activation z(ℓ) i of neuron i in layer ℓis positive at our fixed input x. Next, according to Proposition 2 in Hanin & Nica (2019), the marginal distribution of each D(ℓ) is that its diagonal entries are independent Bernoulli(1/2) random variables. Moreover, we have the following equality in distribution: Jx d= ηW (L)D(L 1)W (L 1) D(1)W (1) where D(ℓ) are independent of each other (and of W (ℓ)) resampled from the marginal distribution of D(ℓ) and η is an independent diagonal matrix with independent diagonal entries that are 1 with equal probability. In particular, for a fixed unit vector u Rn0, we have ||Jxu|| d= ||W (L)D(L 1)W (L 1) D(1)W (1)u||. We may rewrite this as ||W (L)D(L 1)W (L 1) D(2)W (2)u(1)|| ||D(1)W (1)u||, (20) for u(1) := D(1)W (1)u ||D(1)W (1)u||, where we interpret the expression (20) as equal to 0 if D(1) is the zero matrix. To complete the proof, we need the following standard observation. Lemma C.3. Suppose W is an n n matrix with i.i.d. Gaussian entries and u is a random unit vector in Rn that is independent of W but otherwise has any distribution. Then Wu is independent of u and is equal in distribution to Wv where v is any fixed unit vector in Rn . Proof. For any fixed orthogonal matrix O, it is a standard fact that WO is equal in distribution to W. Thus, for any measurable sets A Rn and B Rn , since u, W are independent we have P (Wu A, u B) = Z Sn 1 P (Wu0 A) d Pu(u0), where Pu(u0) is the distribution of u. Fix any u0 Sn 1 and let O = O(u0) be any orthogonal matrix so that u0 = Oe1 with e1 = (1, 0, . . . , 0) is the first standard unit vector. Then, since WO is equal in distribution to W, we have P (Wu0 A) = P (We1 A) , which is independent of u0. We therefore find P (Wu A, u B) = P (We1 A) P(u B), as desired. We are now in a position to complete the proof of Proposition C.2. Combining Lemma C.3 with (20), we find that, in distribution ||Jxu|| equals ||W (L)D(L 1)W (L 1) D(2)W (2)u|| ||D(1)W (1)u||. (21) Published as a conference paper at ICLR 2022 Note that these two terms are independent. Repeating this argument, we obtain that ||Jxu|| is equal in distribution to the product ||W (L)u|| ||D(L 1)W (L 1)u|| ||D(1)W (1)u||. (22) The terms in this product are independent. To complete the proof note that for ℓ= 1, . . . , L 1 the number of non-zero entries in D(ℓ) is a binomial random variable nℓwith nℓtrials, each with probability of success 1/2 and that ||D(ℓ)W (ℓ)u||2 is precisely 2/nℓtimes a sum of nℓsquares of independent standard Gaussians. Thus, for ℓ= 1, . . . , L 1, ||D(ℓ)W (ℓ)u||2 d= 2 nℓ 1 χ2 nℓ. (23) ||W (L)u||2 d= 2 n L 1 χ2 n L. (24) Substituting (23) and (24) into (22) completes the proof. Evaluating the moments of len(N(M)) is now a matter of computing the moments of some scaled chi-squared random variables with a random number of degrees of freedom. For instance, recalling that E χ2 k = k and applying Lemma C.2 as well as the tower property of the expectation we find E h ||Jxu||2i = n L E n 1 L χ2 n L nℓ E E χ2 nℓ| nℓ Substituting this into (15) with m = 2 yields Var [len(N(M))] n L yielding (3). To prove (2), we need to estimate E [len N(M)]. By taking expectations in (14) and using (18), we find E [len N(M)] = Z 1 0 E Jγ(t)γ (t) dt = E [||Jxu||] , where we ve used that by Lemma C.2, the distribution of ||Jxu|| is the same for every x Rn0 and every unit vector u. Moreover, again using Lemma C.2, we see that E [||Jxu||] equals To simplify this, a direct computation shows that E h k 1χ2 k 1/2i = Γ k+1 Published as a conference paper at ICLR 2022 where Γ( ) is the Gamma function. Next, for any positive random variable X with E [X] = 1 we have E h X1/2i = E h (1 + (X 1))1/2i 8Var[X] + O(|X 1|3). Recall that Var[χ2 k] = 2k. Using this and the law of total variance yields = 1 1 2n2 ℓ Var[χ2 nℓ] + O(n 2 ℓ) = 1 1 2n2 ℓ E Var[χ2 nℓ| nℓ] + Var[E χ2 nℓ| nℓ ] + O(n 2 ℓ) = 1 1 2n2 ℓ [E [2nℓ] + Var[nℓ]] + O(n 2 ℓ) = 1 1 2n2 ℓ i + O(n 2 ℓ) = 1 5 8nℓ + O(n 2 ℓ). Thus, we find that E [len(N(M))] equals n L 1/2 Γ n L+1 ℓ=1 n 1 ℓ + O as claimed. Finally, let us check the higher moment estimates (4). By Lemma C.2 we have the following estimate ||Jxu||2 = n L n 1 L χ2 n L nℓ χ2 nℓ 1 # n 1 L χ2 n L, where we used that x = x 1 + 1 ex 1. For any m, we have Therefore, ||Jxu||2m is bounded above by n L nℓ χ2 nℓ 1 + m2 Finally, for any fixed positive integer n we may write 2 nχ2 n 1 = 2 Published as a conference paper at ICLR 2022 where ξk are independent Bernoulli(1/2) random variables and Zk are independent standard Gaussians. A direct computation shows that the centered random variables ξk Z2 k 1 2 are sub-exponential SE(ν2, α) with some universal parameters ν2, α. Thus, by the stability of sub-exponential random variables under summation we have 2 nℓ χ2 nℓ 1 SE 4ν2 Again using this property we conclude 2 nℓ χ2 nℓ 1 SE where n = min {n1, . . . , n L 1} . Therefore, by (25), we find that E h ||Jxu||2mi n L m E em Y exp m2 where Y SE 4ν2 PL 1 ℓ=1 1 nℓ, 2α . By definition of sub-exponential random variables, we have 4m2ν2 L 1 X provided m < n /2α for some universal constant c > 0. This completes the derivation of (4) and completes the proof of Theorem 5.1. D PROOF OF THEOREM 5.2 The proof of Theorem 5.2 follows closely the argument for Theorem 5.1. For a given input x Rn0, we will continue to write JN (x) := xj Ni(x) n1 j n0 1 i n L for the input-output Jacobian of the network function, which exists for Lebesgue almost every x Rn0. We will write ΠTx M : Rn0 Tx M for the orthogonal projection onto this tangent space. We have vold(N(M)) = Z det ΠTx MJN (x)T JN (x)ΠTx M 1/2 vold(dx), where vold is the d dimensional Hausdorff measure. Indeed, the integrand measures, at each x M, the volume of the image of an infinitesimal cube on M of volume vold(dx) centered at x under the map x 7 N(x). Arguing precisely as in the proof of Lemma C.1, we find that for any integer m 1 E [vold(N(M))m] Z M E det ΠTx MJN (x)T JN (x)ΠTx M m/2 vold(dx) (26) Fix x M and denote by e1(x), . . . , ed(x) an orthonormal basis of the tangent space of M. Then, by the Gram identity, we may write det ΠTx MJN (x)T JN (x)ΠTx M = ||JN (x)e1(x) JN (x)ed(x)||2 , (27) where we recall that the wedge product is the anti-symmetrization of the tensor product. Just as in the proof of Theorem 5.1, the key observation is that for Gaussian weights we have that ||JN (x)e1(x) JN (x)ed(x)|| is a product of i.i.d. random variables. The formal statement is the following: Published as a conference paper at ICLR 2022 Proposition D.1. For any x Rn0 and collection of orthonormal unit vectors u1, . . . , ud Rn0, the random variable ||Jxu1 Jxud||2 is equal in distribution to a product of independent scaled chi-squared random variables: n L 2 nℓ χ2 nℓ j+1 1 n L χ2 n L j+1, where all the terms in the product are independent and for ℓ= 1, . . . , L 1 we ve written nℓfor independent Binomial random variables: nℓ d= Bin (nℓ, 1/2) with nℓtrials and success probability 1/2. Proof. The proof is identical to that of Lemma C.2. The only difference is that we must invoke the following amplification of Lemma C.3: Lemma D.2. Let u1, . . . , uk Rn be a collection of orthonormal vectors (i.e. a collection) with any distribution. Let W be an independent n n matrix with i.i.d. Gaussian entries. Then W(u1 uk) = Wu1 Wuk is independent of u1 uk and is equal in distribution to Wv1 Wvk where v1, . . . , vk is any fixed collection of orthonormal vectors. Proof. The proof is identical to that of Lemma C.3, except that we note that for that, given u1, . . . , uk, there exists an orthogonal matrix O = O(u1, . . . , uk) so that uj = Oej, j = 1, . . . , k where ej are the standard basis vectors. With Lemma D.1 in hand, we complete the proof of Theorem 5.2 as follows. First, note that for any random variable X we have E [X] E X2 1/2 , Var[X] E X2 . Hence, Theorem 5.2 will follow once we show that E vold(N(M))2 n L Combining Proposition D.1 with (26) and (27), this estimate follows by showing that E h ||Jxu1 Jxud||2i n L To check this, recall that E χ2 k = k. Hence, E h ||JN (x)e1(x) JN (x)ed(x)||2i = n L 2 nℓ E χ2 nℓ j+1 1 n L E χ2 n L j+1 where in the last line we used that 1 + x ex. This completes the proof.