# invariance_of_weight_distributions_in_rectified_mlps__b5110673.pdf

Invariance of Weight Distributions in Rectiﬁed MLPs

Russell Tsuchida 1 Farbod Roosta-Khorasani 2 3 Marcus Gallagher 1

An interesting approach to analyzing neural networks that has received renewed attention is to examine the equivalent kernel of the neural network. This is based on the fact that a fully connected feedforward network with one hidden layer, a certain weight distribution, an activation function, and an inﬁnite number of neurons can be viewed as a mapping into a Hilbert space. We derive the equivalent kernels of MLPs with Re LU or Leaky Re LU activations for all rotationally-invariant weight distributions, generalizing a previous result that required Gaussian weight distributions. Additionally, the Central Limit Theorem is used to show that for certain activation functions, kernels corresponding to layers with weight distributions having 0 mean and ﬁnite absolute third moment are asymptotically universal, and are well approximated by the kernel corresponding to layers with spherical Gaussian weights. In deep networks, as depth increases the equivalent kernel approaches a pathological ﬁxed point, which can be used to argue why training randomly initialized networks can be difﬁcult. Our results also have implications for weight initialization.

1. Introduction

Neural networks have recently been applied to a number of diverse problems with impressive results (van den Oord et al., 2016; Silver et al., 2017; Berthelot et al., 2017). These breakthroughs largely appear to be driven by ap-

1School of ITEE, University of Queensland, Brisbane, Queensland, Australia 2School of Mathematics and Physics, University of Queensland, Brisbane, Queensland, Australia 3International Computer Science Institute, Berkeley, California, USA. Correspondence to: Russell Tsuchida <s.tsuchida@uq.edu.au>, Farbod Roosta Khorasani <fred.roosta@uq.edu.au>, Marcus Gallagher <marcusg@uq.edu.au>.

Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s).

plication rather than an understanding of the capabilities and training of neural networks. Recently, signiﬁcant work has been done to increase understanding of neural networks (Choromanska et al., 2015; Haeffele & Vidal, 2015; Poole et al., 2016; Schoenholz et al., 2017; Zhang et al., 2016; Martin & Mahoney, 2017; Shwartz-Ziv & Tishby, 2017; Balduzzi et al., 2017; Raghu et al., 2017). However, there is still work to be done to bring theoretical understanding in line with the results seen in practice.

The connection between neural networks and kernel machines has long been studied (Neal, 1994). Much past work has been done to investigate the equivalent kernel of certain neural networks, either experimentally (Burgess, 1997), through sampling (Sinha & Duchi, 2016; Livni et al., 2017; Lee et al., 2017), or analytically by assuming some random distribution over the weight parameters in the network (Williams, 1997; Cho & Saul, 2009; Pandey & Dukkipati, 2014a;b; Daniely et al., 2016; Bach, 2017a). Surprisingly, in the latter approach, rarely have distributions other than the Gaussian distribution been analyzed. This is perhaps due to early inﬂuential work on Bayesian Networks (Mac Kay, 1992), which laid a strong mathematical foundation for a Bayesian approach to training networks. Another reason may be that some researchers may hold the intuitive (but not necessarily principled) view that the Central Limit Theorem (CLT) should somehow apply.

In this work, we investigate the equivalent kernels for networks with Rectiﬁed Linear Unit (Re LU), Leaky Re LU (LRe LU) or other activation functions, one-hidden layer, and more general weight distributions. Our analysis carries over to deep networks. We investigate the consequences that weight initialization has on the equivalent kernel at the beginning of training. While initialization schemes that mitigate exploding/vanishing gradient problems (Hochreiter, 1991; Bengio et al., 1994; Hochreiter et al., 2001) for other activation functions and weight distribution combinations have been explored in earlier works (Glorot & Bengio, 2010; He et al., 2015), we discuss an initialization scheme for Muli-Layer Perceptrons (MLPs) with LRe LUs and weights coming from distributions with 0 mean and ﬁnite absolute third moment. The derived kernels also allow us to analyze the loss of information as an input is propagated through the network, offering a complementary view to the shattered gradient problem (Balduzzi et al., 2017).

Invariance of Weight Distributions in Rectiﬁed MLPs

2. Preliminaries

Consider a fully connected (FC) feedforward neural network with m inputs and a hidden layer with n neurons. Let σ : R R be the activation function of all the neurons in the hidden layer. Further assume that the biases are 0, as is common when initializing neural network parameters. For any two inputs x, y Rm propagated through the network, the dot product in the hidden layer is

1 nh(x) h(y) = 1

i=1 σ(wi x)σ(wi y), (1)

where h( ) denotes the n dimensional vector in the hidden layer and wi Rm is the weight vector into the ith neuron. Assuming an inﬁnite number of hidden neurons, the sum in (1) has an interpretation as an inner product in feature space, which corresponds to the kernel of a Hilbert space. We have

k(x, y) = Z

Rm σ(w x)σ(w y)f(w) dw, (2)

where f(w) is the probability density function (PDF) for the identically distributed weight vector W = (W1, ..., Wm)T in the network. The connection of (2) to the kernels in kernel machines is well-known (Neal, 1994; Williams, 1997; Cho & Saul, 2009).

Probabilistic bounds for the error between (1) and (2) have been derived in special cases (Rahimi & Recht, 2008) when the kernel is shift-invariant. Two speciﬁc random feature mappings are considered: (1) Random Fourier features are taken for the σ in (1). Calculating the approximation error in this way requires being able to sample from the PDF deﬁned by the Fourier transform of the target kernel. More explicitly, the weight distribution f is the Fourier transform of the target kernel and the n samples σ(wi x) are replaced by some appropriate scale of cos(wi x). (2) A random bit string σ(xi) is associated to each input according to a grid with random pitch δ sampled from f imposed on the input space. This method requires having access to the second derivative of the target kernel to sample from the distribution f.

Other work (Bach, 2017b) has focused on the smallest error between a target function g in the reproducing kernel Hilbert space (RKHS) deﬁned by (2) and an approximate function ˆg expressible by the RKHS with the kernel (1). More explicitly, let g(x) = R

Rm G(w)σ(w, x)f(w) dw be the representation of g in the RKHS. The quantity ˆg g = Pn i=1 αiσ(wi, ) R

Rm G(w)σ(w, )f(w) dw (with some suitable norm) is studied for the best set of αi and random wi with an optimized distribution.

Yet another measure of kernel approximation error is investigated by Rudi & Rosasco (2017). Let ˆg and g be the

optimal solutions to the ridge regression problem of minimizing a regularized cost function C using the kernel (1) and the kernel (2) respectively. The number of datapoints n required to probabilistically bound C(ˆg) C(g) is found to be O( n log n) under a suitable set of assumptions. This work notes the connection between kernel machines and one-layer Neural Networks with Re LU activations and Gaussian weights by citing Cho & Saul (2009). We extend this connection by considering other weight distributions and activation functions.

In this work our focus is on deriving expressions for the target kernel, not the approximation error. Additionally, we consider random mappings that have not been considered elsewhere. Our work is related to work by Poole et al. (2016) and Schoenholz et al. (2017). However, our results apply to the unbounded (L)Re LU activation function and more general weight distributions, and their work considers random biases as well as weights.

3. Equivalent Kernels for Inﬁnite Width Hidden Layers

The kernel (2) has previously been evaluated for a number of choices of f and σ (Williams, 1997; Roux & Bengio, 2007; Cho & Saul, 2009; Pandey & Dukkipati, 2014a;b). In particular, the equivalent kernel for a one-hidden layer network with spherical Gaussian weights of variance E[W 2 i ] and mean 0 is the Arc-Cosine Kernel (Cho & Saul, 2009)

k(x, y) = E[W 2 i ] x y

2π sin θ0 + (π θ0) cos θ0 , (3)

where θ0 = cos 1 x y x y is the angle between the inputs x and y and denotes the ℓ2 norm. Noticing that the Arc Cosine Kernel k(x, y) depends on x and y only through their norms, with an abuse of notation we will henceforth set k(x, y) k(θ0). Deﬁne the normalized kernel to be the cosine similarity between the signals in the hidden layer. The normalized Arc-Cosine Kernel is given by

cos θ1 = k(x, y) p

k(y, y) = 1

π sin θ0 + (π θ0) cos θ0 ,

where θ1 is the angle between the signals in the ﬁrst layer. Figure 1 shows a plot of the normalized Arc-Cosine Kernel. One might ask how the equivalent kernel changes for a different choice of weight distribution. We investigate the equivalent kernel for networks with (L)Re LU activations and general weight distributions in Section 3.1 and 3.2. The equivalent kernel can be composed and applied to deep networks. The kernel can also be used to choose good weights for initialization. These, as well as other implications for practical neural networks, are investigated in Section 5.

Invariance of Weight Distributions in Rectiﬁed MLPs

0 0.5 1 1.5 2 2.5 3 0

Theoretical curve Empirical samples

Figure 1. Normalized Arc-Cosine Kernel as a function of θ0 for a single hidden layer network, Gaussian weights, and Re LU activations. Empirical samples from a network with 1000 inputs and 1000 hidden units are plotted alongside the theoretical curve. Samples are obtained by generating R from a QR decomposition of a random matrix, then setting x = RT (1, 0, ..., 0)T and y = RT (cos θ, sin θ, 0, ..., 0)T .

3.1. Kernels for Rotationally-Invariant Weights

In this section we show that (3) holds more generally than for the case where f is Gaussian. Speciﬁcally, (3) holds when f is any rotationally invariant distribution. We do this by casting (2) as the solution to an ODE, and then solving the ODE. We then extend this result using the same technique to the case where σ is LRe LU.

A rotationally-invariant PDF one with the property f(w) = f(Rw) = f( w ) for all w and orthogonal matrices R. Recall that the class of rotationally-invariant distributions (Bryc, 1995), as a subclass of elliptically contoured distributions (Johnson, 2013), includes the Gaussian distribution, the multivariate t-distribution, the symmetric multivariate Laplace distribution, and symmetric multivariate stable distributions.

Proposition 1. Suppose we have a one-hidden layer feedforward network with Re LU σ and random weights W with uncorrelated and identically distributed rows with rotationally-invariant PDF f : Rm R and E[W 2 i ] < . The equivalent kernel of the network is (3).

Proof. First, we require the following.

Proposition 2. With the conditions in Proposition 1 and inputs x, y Rm the equivalent kernel of the network is the solution to the Initial Value Problem (IVP)

k (θ0) + k(θ0) = F(θ0), k (π) = 0, k(π) = 0, (4)

where θ0 (0, π) is the angle between the inputs x and y. The derivatives are meant in the distributional sense; they are functionals applying to all test functions in C c (0, π). F(θ0) is given by the m 1 dimensional integral

Rm 1 f (s sin θ0, s cos θ0, w3, ..., wm)T

Θ(s)s3 ds dw3 dw4... dwm x y sin θ0, (5)

where Θ is the Heaviside step function.

The proof is given in Appendix A. The main idea is to rotate w (following Cho & Saul (2009)) so that

k(x, y) = Z

Rm Θ(w1)Θ(w1 cos θ0 + w2 sin θ0)w1

(w1 cos θ0 + w2 sin θ0)f(w) dw x y .

Now differentiating twice with respect to θ0 yields the second order ODE (4). The usefulness of the ODE in its current form is limited, since the forcing term F(θ0) as in (5) is difﬁcult to interpret. However, regardless of the underlying distribution on weights w, as long as the PDF f in (5) corresponds to any rotationally-invariant distribution, the integral enjoys a much simpler representation.

Proposition 3. With the conditions in Proposition 1, the forcing term F(θ0) in the kernel ODE is given by F(θ0) = K sin θ0, where

Rm 1 Θ(s)s3f (s, 0, w3, ..., wm)T

ds dw3, ... dwm x y < ,

and the solution to the distributional ODE (4) is the solution to the corresponding classical ODE.

The proof is given in Appendix B.

Note that in the representation F(θ0) = K sin θ0 of the forcing term, the underlying distribution appears only as a constant K. For all rotationally-invariant distributions, the forcing term in (4) results in an equivalent kernel with the same form. We can combine Propositions 2 and 3 to ﬁnd the equivalent kernel assuming rotationally-invariant weight distributions.

Due to the rotational invariance of f, k(0) = R

Rm Θ(w1)w2 1f(Rw) dw x y = x y E[W 2 i ] 2 . The solution to the ODE in Proposition 2 using the forcing term from Proposition 3 is k(θ0) = c1 cos θ0 + c2 sin θ0 1 2Kθ0 cos θ0. Using the conditions from the IVP and k(0), the values of c1, c2 and K give the required result.

One can apply the same technique to the case of LRe LU activations σ(z) = a + (1 a)Θ(z) z, where a speciﬁes the gradient of the activation for z < 0.

Proposition 4. Consider the same situation as in Proposition 1 with the exception that the activations are LRe LU. The integral (2) is then given by

k(x, y) = h(1 a)2

2π sin θ0 + (π θ0) cos θ0 + a cos θ0 i

E[W 2 i ] x y , (6)

where a [0, 1) is the LRe LU gradient parameter.

Invariance of Weight Distributions in Rectiﬁed MLPs

This is just a slightly more involved calculation than the Re LU case; we defer our proof to the supplementary material.

3.2. Asymptotic Kernels

In this section we approximate k for large m and more general weight PDFs. We invoke the CLT as m , which requires a condition that we discuss brieﬂy before presenting it formally. The dot product w x can be seen as a linear combination of the weights, with the coefﬁcients corresponding to the coordinates of x. Roughly, such a linear combination will obey the CLT if many coefﬁcients are non-zero. To let m , we construct a sequence of inputs {x(m)} m=2. This may appear unusual in the context of neural networks, since m is ﬁxed and ﬁnite in practice. The sequence is used only for asymptotic analysis.

As an example if the dataset were Celeb A (Liu et al., 2015) with 116412 inputs, one would have x(116412). To generate an artiﬁcial sequence, one could down-sample the image to be of size 116411, 116410, and so on. At each point in the sequence, one could normalize the point so that its ℓ2 norm is x(116412) . One could similarly up-sample the image.

Intuitively, if the up-sampled image does not just insert

zeros, as m increases the we expect the ratio |x(m) i | x(m) to decrease because the denominator stays ﬁxed and the numerator gets smaller. In our proof the application of

CLT requires maxm i=1 |x(m) i | x(m) to decrease faster than m1/4. Hypothesis 5 states this condition precisely.

Hypothesis 5. For x(m), y(m) Rm, deﬁne sequences of inputs {x(m)} m=2 and {y(m)} m=2 with ﬁxed x(m) = x , y(m) = y , and θ0= cos 1 x(m) y(m)

x y for all m.

Letting x(m) i be the ith coordinate of x(m),

assume that lim m m(1/4) maxm i=1 |x(m) i | x and

lim m m(1/4) maxm i=1 |y(m) i | y are both 0.

Figures 2 and 5 empirically investigate Hypothesis 5 for two datasets, suggesting it makes reasonable assumptions on high dimensional data such as images and audio. Theorem 6. Consider an inﬁnitely wide FC layer with almost everywhere continuous activation functions σ. Suppose the random weights W come from an IID distribution with PDF fm such that E[Wi] = 0 and E|W 3 i | < . Suppose that the conditions in Hypothesis 5 are satisﬁed. Then

σ(W(m) x(m))σ(W(m) y(m)) D σ(Z1)σ(Z2),

where D denotes convergence in distribution and Z = (Z1, Z2)T is a Gaussian random vector with covari-

Figure 2. The solid line is an average over 1000 randomly sampled datapoints. The shaded region represents 1 standard deviation in the worst-case direction. Data is preprocessed so that each dimension is in the range [0, 255]. (Left) Aligned and cropped Celeb A dataset (Liu et al., 2015), with true dimensionality m = 116412. The images are compressed using Bicubic Interpolation. (Right) CHi ME3 embedded et05 real live speech data from The 4th CHi ME Speech Separation and Recognition Challenge (Vincent et al., 2017; Barker et al., 2017). Each clip is trimmed to a length of 6.25 seconds and the true sample rate is 16000 Hz, so the true dimensionality is m = 100000. Compression is achieved through subsampling by integer factors.

ance matrix E[W 2 i ] x 2 x y cos θ0 x y cos θ0 y 2

0 mean. Every Z(m) = (W(m) x(m), W(m) y(m))T has the same mean and covariance matrix as Z.

Convergence in distribution is a weak form of convergence, so we cannot expect in general that all kernels should converge asymptotically. For some special cases however, this is indeed possible to show. We ﬁrst present the Re LU case.

Corollary 7. Let m, W, fm, E[Wi] and E|W 3 i | be as deﬁned in Theorem 6. Deﬁne the corresponding kernel to be k(m) f x(m), y(m) . Consider a second inﬁnitely wide FC layer with m inputs. Suppose the random weights come from a spherical Gaussian with E[Wi] = 0 and ﬁnite variance E[W 2 i ] with PDF gm. Deﬁne the corresponding kernel to be k(m) g x(m), y(m) . Suppose that the conditions in Hypothesis 5 are satisﬁed and the activation functions are σ(z) = Θ(z)z. Then for all s 2,

lim m k(m) f x(m), y(m)

= k(s) g x(s), y(s) = E σ(Z1)σ(Z2) ,

where Z is as in Theorem 6. Explicitly, k(m) f converges to (3).

The proof is given in Appendix D. This implies that the Arc-Cosine Kernel is well approximated by Re LU layers with weights from a wide class of distributions. Similar results hold for other σ including the LRe LU and ELU (Clevert et al., 2016), as shown in the supplementary material.

Invariance of Weight Distributions in Rectiﬁed MLPs

0 0.5 1 1.5 2 2.5 3 0.00

0 0.5 1 1.5 2 2.5 3 -0.38

0 0.5 1 1.5 2 2.5 3 0.00

0 0.5 1 1.5 2 2.5 3 -0.38

Figure 3. Theoretical normalized kernel for networks of increasing depth. Empirical samples from a network with between 1 and 128 hidden layers, 1000 hidden neurons in each layer, m = 1000 and weights coming from different symmetric distributions. The sampling process for each θ0 is as described in Figure 1. The variance is chosen according to (8). (a) Re LU Activations, (7) distribution. (b) LRe LU Activations with a = 0.2, (7) distribution. (c) Re LU Activations, t-distribution. (d) LRe LU Activations with a = 0.2, t-distribution.

4. Empirical Veriﬁcation of Results

We empirically verify our results using two families of weight distributions. First, consider the m-dimensional tdistribution

f(w) = Γ[(ν + m)/2]

Γ(ν/2)νm/2πm/2p

|det(Σ)| h 1 + 1

ν (w T Σ 1w) i (ν+m)/2 ,

with degrees of freedom ν and identity shape matrix Σ = I. The multivariate t-distribution approaches the multivariate Gaussian as ν . Random variables drawn from the multivariate t-distribution are uncorrelated but not independent. This distribution is rotationally-invariant and satisﬁes the conditions in Propositions (1) and (4).

Second, consider the multivariate distribution

β 2αΓ(1/β)e |wi/α|β, (7)

which is not rotationally-invariant (except when β = 2, which coincides with a Gaussian distribution) but whose random variables are IID and satisfy the conditions in Theorem 6. As β this distribution converges pointwise to the uniform distribution on [ α, α].

In Figure 3, we empirically verify Propositions 1 and 4. In the one hidden layer case, the samples follow the blue curve j = 1, regardless of the speciﬁc multivariate t weight distribution which varies with ν. We also observe that the universality of the equivalent kernel appears to hold for the distribution (7) regardless of the value of β, as predicted by theory. We discuss the relevance of the curves j = 1 in Section 5.

5. Implications for Practical Networks

5.1. Composed Kernels in Deep Networks

A recent advancement in understanding the difﬁculty in training deep neural networks is the identiﬁcation of the shattered gradients problem (Balduzzi et al., 2017). Without skip connections, the gradients of deep networks approach white noise as they are backpropagated through the network, making them difﬁcult to train.

A simple observation that complements this view is obtained through repeated composition of the normalized kernel. As m , the angle between two inputs in the jth layer of a LRe LU network random weights with E[W] = 0 and E|W 3| < approaches cos θj =

1 1+a2 (1 a)2

π sin θj 1+(π θj 1) cos θj 1 +2a cos θ0 .

A result similar to the following is hinted at by Lee et al. (2017), citing Schoenholz et al. (2017). Their analysis, which considers biases in addition to weights (Poole et al., 2016), yields insights on the trainability of random neural networks that our analysis cannot. However, their argument does not appear to provide a complete formal proof for the case when the activation functions are unbounded, e.g., Re LU. The degeneracy of the composed kernel with more general activation functions is also proved by Daniely (2016), with the assumption that the weights are Gaussian distributed. Corollary 8. The normalized kernel corresponding to LRe LU activations converges to a ﬁxed point at θ = 0.

Proof. Let z = cos θj 1 and deﬁne

T(z)= 1 1 + a2

1 z2+(π cos 1 z)z +2az .

The magnitude of the derivative of T is 1 1 a

1+a 2 cos 1 z

Invariance of Weight Distributions in Rectiﬁed MLPs

which is bounded above by 1 on [ 1, 1]. Therefore, T is a contraction mapping. By Banach s ﬁxed point theorem there exists a unique ﬁxed point z = cos θ . Set θ = 0 to verify that θ = 0 is a solution, and θ is unique.

Corollary 8 implies that for this deep network, the angle between any two signals at a deep layer approaches 0. No matter what the input is, the kernel sees the same thing after accounting for the scaling induced by the norm of the input. Hence, it becomes increasingly difﬁcult to train deeper networks, as much of the information is lost and the outputs will depend merely on the norm of the inputs; the signals decorrelate as they propagate through the layers.

At ﬁrst this may seem counter-intuitive. An appeal to intuition can be made by considering the corresponding linear network with deterministic and equal weight matrices in each layer, which amounts to the celebrated power iteration method. In this case, the repeated application of a matrix transformation A to a vector v converges to the dominant eigenvector (i.e. the eigenvector corresponding to the largest eigenvalue) of A.

Figure 3 shows that the theoretical normalized kernel for networks of increasing depth closely follows empirical samples from randomly initialized neural networks.

In addition to convergence of direction, by also requiring that x = y it can be shown that after accounting for scaling, the magnitude of the signals converge as the signals propagate through the network. This is analogous to having the dominant eigenvalue equal to 1 in the power iteration method comparison.

Corollary 9. The quantity E h σ(j)(x)

σ(j)(y) 2i /E[σ(j)(x)2] in a j-layer random (L)Re LU network of inﬁnite width with random uncorrelated and identically distributed rotationally-invariant weights with x = y approaches 0 as j .

Proof. Denote the output of one neuron in the jth layer of a network σ(W (1) σ(...σ(W (j)x)) by σ(j)(x) and let kj be the kernel for the j-layer network. Then

E h σ(j)(x) σ(j)(y) 2i /E[σ(j)(x)2]

= kj(x, x) 2kj(x, y) + kj(y, y) /kj(x, x),

= 2 2 cos θj

which approaches 0 as j .

Contrary to the shattered gradients analysis, which applies to gradient based optimizers, our analysis relates to any optimizers that initialize weights from some distribution satisfying conditions in Proposition 4 or Corollary 7. Since information is lost during signal propagation, the network s output shares little information with the input. An

Figure 4. Histograms showing the ratio of the norm of signals in layer j to the norm of the input signals. Each histogram contains 1000 data points randomly sampled from a unit Gaussian distribution. The network tested has 1000 inputs, 1000 neurons in each layer, and LRe LU activations with a = 0.2. The legend indicates the number of layers in the network. The weights are randomly initialized from a Gaussian distribution. (Left) Weights initialized according to the method of He et al. (2015). (Right) Weights initialized according to (8).

optimizer that tries to relate inputs, outputs and weights through a suitable cost function will be blind to relationships between inputs and outputs.

Our results can be used to argue against the utility of controversial Extreme Learning Machines (ELM) (Huang et al., 2004), which randomly initialize hidden layers from symmetric distributions and only learn the weights in the ﬁnal layer. A single layer ELM can be replaced by kernel ridge regression using the equivalent kernel. Furthermore, a Multi-Layer ELM (Tang et al., 2016) with (L)Re LU activations utilizes a pathological kernel as shown in Figure 3. It should be noted that ELM bears resemblance to early works (Schmidt et al., 1992; Pao et al., 1994).

5.2. Initialization

Suppose we wish to approximately preserve the ℓ2 norm from the input to hidden layer. By comparing (1) and (2), we approximately have h(x) p

k(x, x)n. Letting

θ0 = 0 in (6), we have h(x) = x q

n E[W 2 i ](1+a2)

2 . Setting h(x) = x ,

E[W 2 i ] =

v u u t 2 1 + a2 n . (8)

This applies whenever the conditions in Proposition 4 or Corollary 12 are satisﬁed. This agrees with the well-known case when the elements of W are IID (He et al., 2015) and a = 0. For small values of a, (8) is well approximated by the known result (He et al., 2015). For larger values of a, this approximation breaks down, as shown in Figure 4.

An alternative approach to weight initialization is the datadriven approach (Mishkin & Matas, 2016), which can be applied to more complicated network structures such as

Invariance of Weight Distributions in Rectiﬁed MLPs

convolutional and max-pooling layers commonly used in practice. As parameter distributions change during training, batch normalization inserts layers with learnable scaling and centering parameters at the cost of increased computation and complexity (Ioffe & Szegedy, 2015).

6. Conclusion

We have considered universal properties of MLPs with weights coming from a large class of distributions. We have theoretically and empirically shown that the equivalent kernel for networks with an inﬁnite number of hidden Re LU neurons and all rotationally-invariant weight distributions is the Arc-Cosine Kernel. The CLT can be applied to approximate the kernel for high dimensional input data. When the activations are LRe LUs, the equivalent kernel has a similar form. The kernel converges to a ﬁxed point, showing that information is lost as signals propagate through the network.

One avenue for future work is to study the equivalent kernel for different activation functions, noting that some activations such as the ELU may not be expressible in a closed form (we do show in the supplementary material however, that the ELU does have an asymptotically universal kernel).

Since wide networks with centered weight distributions have approximately the same equivalent kernel, powerful trained deep and wide MLPs with (L)Re LU activations should have asymmetric, non-zero mean, non-IID parameter distributions. Future work may consider analyzing the equivalent kernels of trained networks and more complicated architectures. We should not expect that k(x, y) may be expressed neatly as k(θ0) in these cases. This work is a crucial ﬁrst step in identifying invariant properties in neural networks and sets a foundation from which we hope to expand in future.

A. Proof of Proposition 2

Proof. The kernel with weight PDF f(ω) and Re LU σ is

k(x, y) = Z

Rm Θ(ω x)Θ(ω y)(ω x)(ω y)f(ω) dω.

Let θ0 be the angle between x and y. Deﬁne u = ( x , 0, ..., 0)T and v = ( y cos θ0, y sin θ0, 0, ..., 0)T

with u, v Rm. Following Cho & Saul (2009), there exists some m m rotation matrix R such that x = Ru and y = Rv. We have

k(x, y) = Z

Rm Θ(ω Ru)Θ(ω Rv)(ω Ru)(ω Rv)

Let ω = Rw and note that the dot product is invariant under rotations and the determinant of the Jacobian of the

transformation is 1 since R is orthogonal. We have

k(x, y) = Z

Rm Θ(w u)Θ(w v)(w u)(w v)

Rm Θ( x w1)Θ( y (w1 cos θ0 + w2 sin θ0))

w1(w1 cos θ0 + w2 sin θ0)f(w) dw x y . (9)

One may view the integrand as a functional acting on test functions of θ0. Denote the set of inﬁnitely differentiable test functions on (0, π) by C c (0, π). The linear functional acting over C c (0, π) is a Generalized Function and we may take distributional derivatives under the integral by Theorem 7.40 of Jones (1982). Differentiating twice,

Rm Θ(w1)w1( w1 sin θ0 + w2 cos θ0)2

δ w1 cos θ0 + w2 sin θ0 f(w) dw x y ,

Rm 1 f (s sin θ0, s cos θ0, w3, ..., wm)T

Θ(s)s3 ds dw3 dw4... dwm x y sin θ0.

The initial condition k(π) = 0 is obtained by putting θ0 = π in (9) and noting that the resulting integrand contains a factor of Θ(w1)Θ( w1)w1 which is 0 everywhere. Similarly, the integrand of k (π) contains a factor of Θ(w2)Θ( w2)w2.

The ODE is meant in a distributional sense, that Z π

0 ψ(θ0) k (θ0) + k(θ0) F(θ0) dθ0 = 0

ψ Cc (0, π), where k is a distribution with a distributional second derivative k .

B. Proof of Proposition 3

Proof. Denote the marginal PDF of the ﬁrst two coordinates of W by f12. Due to the rotational invariance of f, f(Ox) = f( x ) = f(x) for any orthogonal matrix O. So

Rm 1 f (s sin θ0, s cos θ0, w3, ..., wm)T

sin θ0Θ(s)s3 ds dw3, ... dwm x y ,

R Θ(s)s3f12 (s, 0, )T ds x y ,

= K sin θ0, K (0, ].

It remains to check that K < . F is integrable since Z

0 Θ(w1)w1( w1 sin θ0 + w2 cos θ0)2

δ(w1 cos θ0 + w2 sin θ0)f12(w1, w2)dθ0dw1dw2

Invariance of Weight Distributions in Rectiﬁed MLPs

R2 Θ(w1)w1 (w2 1 + w2 2)1/2 f12(w1, w2)dw1dw2,

E Θ2(W1)W 2 1 q

E W 2 1 + W 2 2 ] < .

Therefore, F is ﬁnite almost everywhere. This is only true if K < . k = F k must be a function, so the distributional and classical derivatives coincide.

C. Proof of Theorem 6

Proof. There exist some orthonormal R1, R2 Rm such that y(m)= y(m) (R1 cos θ0 + R2 sin θ0) and x(m) = x(m) R1. We would like to examine the asymptotic distribution of σ y(m) W(m) R1 cos θ0+R2 sin θ0

σ x(m) W(m) R1 .

Let U (m) 1 =W R1 cos θ0 + W R2 sin θ0 and U (m) 2 = W R1 sin θ0+W R2 cos θ0. Note that E[U (m)2 1 ]=E[U (m)2 2 ]=E[W 2 i ] and E[U (m) 1 ]=E[U (m) 2 ]=0. Also note that U (m) 1 and U (m) 2 are uncorrelated since

E[U (m) 1 U (m) 2 ] = E h (W R1)(W R2)(cos2 θ0+sin2 θ0)

cos θ0 sin θ0 (W R1)2 (W R2)2 i = 0.

Let Mk = E W k i , U(m) = (U1, U2)T , I be the 2 2 identity matrix and Q N 0, M2I . Then for any convex set S R2 and some C R, by the Berry-Esseen Theorem, P[U S] P[Q S] 2 Cγ2 where γ2 is given by m X

2 2 Wi I R1j cos θ0 + R2j sin θ0 R1j sin θ0 + R2j cos θ0

j=1 E R1j cos θ0 + R2j sin θ0 R1j sin θ0 + R2j cos θ0

R2 1j + R2 2j (3/2) 2 ,

M 3 2 M 2 3 m

R2 1j + R2 2j 3 ,

= M 3 2 M 2 3 m

R6 1j + 3R4 1j R2 2j + 3R2 1j R4 2j + R6 2j ,

M 3 2 M 2 3 m 4 m max k=1 R4 1k + 4 m max k=1 R4 2k .

The last line is due to the fact that m X

R6 1j + 3R4 1j R2 2j m max k=1 R4 1k m X

j=1 R2 1j + 3R2 2j .

Now R1k = xk x and R2k = 1 sin θ0

yk y xk x cos θ0 ,

so if θ0 = 0, π by Hypothesis 5 U(m) converges in distribution to the bivariate spherical Gaussian with variance E[W 2 i ]. Then the random vector Z(m) = (Z(m) 1 , Z(m) 2 )T =

x W R1, y (W R1 cos θ0 + W R2 sin θ0) T = x (U1 cos θ0 U2 sin θ0), y U1 T converges in distribution to the bivariate Gaussian random variable with co-

variance matrix E[W 2 i ] x 2 x y cos θ0 x y cos θ0 y 2

Since σ is continuous almost everywhere, by the Continuous Mapping Theorem,

σ(W(m) x(m))σ(W(m) y(m)) D σ(Z1)σ(Z2).

If θ0 = 0 or θ0 = π, we may treat R2 as 0 and the above still holds.

D. Proof of Corollary 7

Proof. We have limm k(m) f x(m), y(m) =

limm E σ(Z(m) 1 )σ(Z(m) 2 ) and would like to bring the limit inside the expected value. By Theorem 6 and Theorem 25.12 of Billingsley (1995), it sufﬁces to show that σ(Z(m) 1 )σ(Z(m) 2 ) is uniformly integrable. Deﬁne h to be the joint PDF of Z(m). We have

|σ(z1)σ(z2)|>α |σ(z1)σ(z2)|h(z1, z2) dz1dz2

|Θ(z1)Θ(z2)z1z2|>α |Θ(z1)Θ(z2)z1z2|h(z1, z2)

but the integrand is 0 whenever z1 0 or z2 0. So Z

|σ(z1)σ(z2)|>α |σ(z1)σ(z2)|h(z1, z2) dz1dz2

R2 z1z2Θ(z1z2 α)Θ(z1)Θ(z2)h(z1, z2) dz1dz2.

We may raise the Heaviside functions to any power without changing the value of the integral. Squaring the Heaviside functions and applying H older s inequality, we have Z

R2 z1z2Θ2(z1z2 α)Θ2(z1)Θ2(z2)h(z1, z2)dz1dz2 2

E[z2 1Θ(z1z2 α)Θ(z1)Θ(z2)]

E[z2 2Θ(z1z2 α)Θ(z1)Θ(z2)].

Examining the ﬁrst of these factors, Z

α/z1 z2 1h(z1, z2) dz2dz1,

α/z1 h(z1, z2) dz2dz1.

Now let gα(z1) = R α/z1 h(z1, z2) dz2. gα(z1)z2 1 is monotonically pointwise non-increasing to 0 in α for all z1 > 0 and R z2 1g0(z1)dz1 E[Z2 1] < . By the Monotone Convergence Theorem limα E[z2 1Θ(z1z2 α)Θ(z1)] = 0. The second factor has the same limit, so the limit of the right hand side of H older s inequality is 0.

Invariance of Weight Distributions in Rectiﬁed MLPs

Acknowledgements

We thank the anonymous reviewers for directing us toward relevant work and providing helpful recommendations regarding the presentation of the paper. Farbod Roosta-Khorasani gratefully acknowledges the support from the Australian Research Council through a Discovery Early Career Researcher Award (DE180100923). Russell Tsuchida s attendance at the conference was made possible by an ICML travel award.

Bach, F. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18(19):1 53, 2017a.

Bach, F. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(21):1 38, 2017b.

Balduzzi, D., Frean, M., Leary, L., Lewis, J.P., Ma, K.W., and Mc Williams, B. The shattered gradients problem: If resnets are the answer, then what is the question? In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp. 342 350, 2017.

Barker, J., Marxer, R., Vincent, E., and Watanabe, S. The third chime speech separation and recognition challenge: Analysis and outcomes. Computer Speech and Language, 46:605 626, 2017.

Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is difﬁcult. IEEE transactions on neural networks, 5(2):157 166, 1994.

Berthelot, D., Schumm, T., and Metz, L. Began: Boundary equilibrium generative adversarial networks. ar Xiv preprint ar Xiv:1703.10717, 2017.

Billingsley, P. Probability and Measure. Wiley Interscience, 3rd edition, 1995. ISBN 0471007102.

Bryc, W. Rotation invariant distributions. In The Normal Distribution, pp. 51 69. Springer, 1995.

Burgess, A.N. Estimating equivalent kernels for neural networks: A data perturbation approach. In Advances in Neural Information Processing Systems, pp. 382 388, 1997.

Cho, Y. and Saul, L.K. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, pp. 342 350, 2009.

Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B., and Le Cun, Y. The loss surfaces of multilayer networks. In Artiﬁcial Intelligence and Statistics, pp. 192 204, 2015.

Clevert, D., Unterthiner, T., and Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). In International Conference on Learning Representations, 2016.

Daniely, A., Frostig, R., and Singer, Y. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, pp. 2253 2261, 2016.

Glorot, X. and Bengio, Y. Understanding the difﬁculty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 249 256, 2010.

Haeffele, B.D. and Vidal, R. Global optimality in tensor factorization, deep learning, and beyond. ar Xiv preprint ar Xiv:1506.07540, 2015.

He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In Proceedings of the IEEE international conference on computer vision, pp. 1026 1034, 2015.

Hochreiter, S. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universit at M unchen, 91, 1991.

Hochreiter, S., Bengio, Y., and Frasconi, P. Gradient ﬂow in recurrent nets: the difﬁculty of learning long-term dependencies. In Kolen, J. and Kremer, S. (eds.), Field Guide to Dynamical Recurrent Networks. IEEE Press, 2001.

Huang, G., Zhu, Q., and Siew, C. Extreme learning machine: a new learning scheme of feedforward neural networks. In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on, volume 2, pp. 985 990. IEEE, 2004.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448 456, 2015.

Johnson, M.E. Multivariate statistical simulation: A guide to selecting and generating continuous multivariate distributions. John Wiley & Sons, 2013.

Jones, D.S. The Theory of Generalised Functions, chapter 7, pp. 263. Cambridge University Press, 2nd edition, 1982.

Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. 2009.

Invariance of Weight Distributions in Rectiﬁed MLPs

Lee, J., Bahri, Y., Novak, R., Schoenholz, S.S., Pennington, J., and Sohl-Dickstein, J. Deep neural networks as gaussian processes. ar Xiv preprint ar Xiv:1611.01232, 2017.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.

Livni, R., Carmon, D., and Globerson, A. Learning inﬁnite layer networks without the kernel trick. In International Conference on Machine Learning, pp. 2198 2207, 2017.

Mac Kay, D.J.C. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448 472, 1992.

Martin, C.H. and Mahoney, M.W. Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior. ar Xiv preprint ar Xiv:1710.09553, 2017.

Mishkin, D. and Matas, J. All you need is a good init. In International Conference on Learning Representations, 2016.

Neal, R.M. Bayesian Learning for Neural Networks. Ph D thesis, University of Toronto, 1994.

Pandey, G. and Dukkipati, A. To go deep or wide in learning? In Artiﬁcial Intelligence and Statistics, pp. 724 732, 2014a.

Pandey, G. and Dukkipati, A. Learning by stretching deep networks. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1719 1727, 2014b.

Pao, Y., Park, G., and Sobajic, D.J. Learning and generalization characteristics of the random vector functionallink net. Neurocomputing, 6(2):163 180, 1994.

Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., and Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. In Advances In Neural Information Processing Systems, pp. 3360 3368, 2016.

Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl Dickstein, J. On the expressive power of deep neural networks. In Precup, D. and Teh, Y.W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 2847 2854, 2017.

Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in neural information processing systems, pp. 1177 1184, 2008.

Roux, N. Le and Bengio, Y. Continuous neural networks. In Artiﬁcial Intelligence and Statistics, pp. 404 411, 2007.

Rudi, Alessandro and Rosasco, Lorenzo. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, pp. 3218 3228, 2017.

Schmidt, W.F., Kraaijveld, M.A., and Duin, R.P.W. Feedforward neural networks with random weights. In Pattern Recognition, 1992. Vol. II. Conference B: Pattern Recognition Methodology and Systems, Proceedings., 11th IAPR International Conference on, pp. 1 4. IEEE, 1992.

Schoenholz, S.S., Gilmer, J., Ganguli, S., and Sohl Dickstein, J. Deep information propagation. In International Conference on Learning Representations, 2017.

Shwartz-Ziv, R. and Tishby, N. Opening the Black Box of Deep Neural Networks via Information. ar Xiv preprint ar Xiv:1703.00810, 2017.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, T., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., and Hassabis, D. Mastering the game of go without human knowledge. Nature, 550(7676):354 359, 2017.

Sinha, A. and Duchi, J.C. Learning kernels with random features. In Advances in Neural Information Processing Systems, pp. 1298 1306, 2016.

Tang, J., Deng, C., and Huang, G. Extreme learning machine for multilayer perceptron. IEEE Transactions on Neural Networks and Learning Systems, 27(4):809 821, 2016.

van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. ar Xiv preprint ar Xiv:1609.03499, 2016.

Vincent, E., Watanabe, S., Nugraha, A., Barker, J., and Marxer, R. An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Computer Speech and Language, 46:535 557, 2017.

Williams, C.K.I. Computing with inﬁnite networks. In Advances in Neural Information Processing Systems, pp. 295 301, 1997.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016.