# deep_information_propagation__22891154.pdf

Published as a conference paper at ICLR 2017

DEEP INFORMATION PROPAGATION

Samuel S. Schoenholz Google Brain Justin Gilmer Google Brain Surya Ganguli Stanford University Jascha Sohl-Dickstein Google Brain

We study the behavior of untrained neural networks whose weights and biases are randomly distributed using mean ﬁeld theory. We show the existence of depth scales that naturally limit the maximum depth of signal propagation through these random networks. Our main practical result is to show that random networks may be trained precisely when information can travel through them. Thus, the depth scales that we identify provide bounds on how deep a network may be trained for a speciﬁc choice of hyperparameters. As a corollary to this, we argue that in networks at the edge of chaos, one of these depth scales diverges. Thus arbitrarily deep networks may be trained only sufﬁciently close to criticality. We show that the presence of dropout destroys the order-to-chaos critical point and therefore strongly limits the maximum trainable depth for random networks. Finally, we develop a mean ﬁeld theory for backpropagation and we show that the ordered and chaotic phases correspond to regions of vanishing and exploding gradient respectively.

1 INTRODUCTION

Deep neural network architectures have become ubiquitous in machine learning. The success of deep networks is due to the fact that they are highly expressive (Montufar et al., 2014) while simultaneously being relatively easy to optimize (Choromanska et al., 2015; Goodfellow et al., 2014) with strong generalization properties (Recht et al., 2015). Consequently, developments in machine learning often accompany improvements in our ability to train increasingly deep networks. Despite this, designing novel network architectures is frequently equal parts art and science. This is, in part, because a general theory for neural networks that might inform design decisions has lagged behind the feverish pace of design.

A pair of recent papers (Poole et al., 2016; Raghu et al., 2016) demonstrated that random neural networks are exponentially expressive in their depth. Central to their approach was the consideration of networks after random initialization, whose weights and biases were i.i.d. Gaussian distributed. In particular the paper by Poole et al. (2016) developed a mean ﬁeld formalism for treating wide, untrained, neural networks. They showed that these mean ﬁeld networks exhibit an order-to-chaos transition as a function of the weight and bias variances. Notably the mean ﬁeld formalism is not closely tied to a speciﬁc choice of activation function or loss.

In this paper, we demonstrate the existence of several characteristic depth scales that emerge naturally and control signal propagation in these random networks. We then show that one of these depth scales, ξc, diverges at the boundary between order and chaos. This result is insensitive to many architectural decisions (such as choice of activation function) and will generically be true at any order-to-chaos transition. We then extend these results to include dropout and we show that even small amounts of dropout destroys the order-to-chaos critical point and consequently removes the divergence in ξc. Together these results bound the depth to which signal may propagate through random neural networks.

We then develop a corresponding mean ﬁeld model for gradients and we show that a duality exists between the forward propagation of signals and the backpropagation of gradients. The ordered and chaotic phases that Poole et al. (2016) identiﬁed correspond to regions of vanishing and exploding gradients, respectively. We demonstrate the validity of this mean ﬁeld theory by computing gradients of random networks on MNIST. This provides a formal explanation of the vanishing gradients

Work done as a member of the Google Brain Residency program (g.co/brainresidency)

Published as a conference paper at ICLR 2017

phenomenon that has long been observed in neural networks (Bengio et al., 1993). We continue to show that the covariance between two gradients is controlled by the same depth scale that limits correlated signal propagation in the forward direction.

Finally, we hypothesize that a necessary condition for a random neural network to be trainable is that information should be able to pass through it. Thus, the depth-scales identiﬁed here bound the set of hyperparameters that will lead to successful training. To test this ansatz we train ensembles of deep, fully connected, feed-forward neural networks of varying depth on MNIST and CIFAR10, with and without dropout. Our results conﬁrm that neural networks are trainable precisely when their depth is not much larger than ξc. This result is dataset independent and is, therefore, a universal function of network architecture.

A corollary of these result is that asymptotically deep neural networks should be trainable provided they are initialized sufﬁciently close to the order-to-chaos transition. The notion of edge of chaos initialization has been explored previously. Such investigations have been both direct as in Bertschinger et al. (2005); Glorot & Bengio (2010) or indirect, through initialization schemes that favor deep signal propagation such as batch normalization (Ioffe & Szegedy, 2015), orthogonal matrix initialization (Saxe et al., 2014), random walk initialization (Sussillo & Abbott, 2014), composition kernels (Daniely et al., 2016), or residual network architectures (He et al., 2015). The novelty of the work presented here is two-fold. First, our framework predicts the depth at which networks may be trained even far from the order-to-chaos transition. While a skeptic might ask when it would be proﬁtable to initialize a network far from criticality, we respond by noting that there are architectures (such as neural networks with dropout) where no critical point exists and so this more general framework is needed. Second, our work provides a formal, as opposed to intuitive, explanation for why very deep networks can only be trained near the edge of chaos.

2 BACKGROUND

We begin by recapitulating the mean-ﬁeld formalism developed in Poole et al. (2016). Consider a fully-connected, untrained, feed-forward, neural network of depth L with layer width Nl and some nonlinearity φ : R R. Since this is an untrained neural network we suppose that its weights and biases are respectively i.i.d. as W l ij N(0, σ2 w/Nl) and bl i N(0, σ2 b). Notationally we set zl i to be the pre-activations of the lth layer and yl+1 i to be the activations of that layer. Finally, we take the input to the network to be y0 i = xi. The propagation of a signal through the network is described by the pair of equations, zl i = X

j W l ijyl j + bl i yl+1 i = φ(zl i). (1)

Since the weights and biases are randomly distributed, these equations deﬁne a probability distribution on the activations and pre-activations over an ensemble of untrained neural networks. The mean-ﬁeld approximation is then to replace zl i by a Gaussian whose ﬁrst two moments match those of zl i. For the remainder of the paper we will take the mean ﬁeld approximation as given.

Consider ﬁrst the evolution of a single input, xi;a, as it evolves through the network (as quantiﬁed by yl i;a and zl i;a). Since the weights and biases are independent with zero mean, the ﬁrst two moments of the pre-activations in the same layer will be,

E[zl i;a] = 0 E[zl i;azl j;a] = ql aaδij (2)

where δij is the Kronecker delta. Here ql aa is the variance of the pre-activations in the lth layer due to an input xi;a and it is described by the recursion relation,

ql aa = σ2 w

ql 1 aa z + σ2 b (3)

where R Dz = 1

2 z2 is the measure for a standard Gaussian distribution. Together these equations completely describe the evolution of a single input through a mean ﬁeld neural network. For any choice of σ2 w and σ2 b with bounded φ, eq. 3 has a ﬁxed point at q = liml ql aa.

The propagation of a pair of signals, x0 i;a and x0 i;b, through this network can be understood similarly. Here the mean pre-activations are trivially the same as in the single-input case. The independence

Published as a conference paper at ICLR 2017

of the weights and biases implies that the covariance between different pre-activations in the same layer will be given by, E[zl i;azl j;b] = ql abδij. The covariance, ql ab, will be given by the recurrence relation,

ql ab = σ2 w

Z Dz1Dz2φ(u1)φ(u2) + σ2 b (4)

where u1 = p

ql 1 aa z1 and u2 = q

cl 1 ab z1 + q

1 (cl 1 ab )2z2

, with cl ab = ql ab/ q

are Gaussian approximations to the pre-activations in the preceding layer with the correct covariance matrix. Moreover cl ab is the correlation between the two inputs after l layers.

(a) (b) (c)

Figure 1: Mean ﬁeld criticality. (a) The mean ﬁeld phase diagram showing the boundary between ordered and chaotic phases as a function of σ2 w and σ2 b. (b) The residual |q ql aa| as a function of depth on a log-scale with σ2 b = 0.05 and σ2 w from 0.01 (red) to 1.7 (purple). Clear exponential behavior is observed. (c) The residual |c cl ab| as a function of depth on a log-scale. Again, the exponential behavior is clear. The same color scheme is used here as in (b).

Examining eq. 4 it is clear that c = 1 is a ﬁxed point of the recurrence relation. To determine whether or not the c = 1 is an attractive ﬁxed point the quantity,

χ1 = cl ab cl 1 ab = σ2 w

Z Dz φ q z 2 (5)

is introduced. Poole et al. (2016) note that the c = 1 ﬁxed point is stable if χ1 < 1 and is unstable otherwise. Thus, χ1 = 1 represents a critical line separating an ordered phase (in which c = 1 and all inputs end up asymptotically correlated) and a chaotic phase (in which c < 1 and all inputs end up asymptotically decorrelated). For the case of φ = tanh, the phase diagram in ﬁg. 1 (a) is observed.

3 ASYMPTOTIC EXPANSIONS AND DEPTH SCALES

Our ﬁrst contribution is to demonstrate the existence of two depth-scales that arise naturally within the framework of mean ﬁeld neural networks. Motivating the existence of these depth-scales, we iterate eq. 3 and 4 until convergence for many values of σ2 w between 0.1 and 3.0 and with σ2 b = 0.05 starting with q0 aa = q0 bb = 0.8 and c0 ab = 0.6. We see, in ﬁg. 1 (b) and (c), that the manner in which both ql aa approaches q and cl ab approaches c is exponential over many orders of magnitude. We therefore anticipate that asymptotically |ql aa q | e l/ξq and |cl ab c | e l/ξc for sufﬁciently large l. Here, ξq and ξc deﬁne depth-scales over which information may propagate about the magnitude of a single input and the correlation between two inputs respectively.

We will presently prove that ql aa and cl ab are asymptotically exponential. In both cases we will use the same fundamental strategy wherein we expand one of the recurrence relations (either eq. 3 or eq. 4) about its ﬁxed point to get an approximate asymptotic recurrence relation. We ﬁnd that this asymptotic recurrence relation in turn implies exponential decay towards the ﬁxed point over a depth-scale, ξx.

We ﬁrst analyze eq. 3 and identify a depth-scale at which information about a single input may propagate. Let ql aa = q + ϵl. By construction so long as liml ql aa = q exists it follows that

Published as a conference paper at ICLR 2017

ϵl 0 as l . Eq. 3 may be expanded to lowest order in ϵl to arrive at an asymptotic recurrence relation (see Appendix 7.1),

ϵl+1 = ϵl χ1 + σ2 w

Z Dzφ q z φ q z + O (ϵl)2 . (6)

Notably, the term multiplying ϵl is a constant. It follows that for large l the asymptotic recurrence relation has an exponential solution, ϵl e l/ξq, with ξq given by

ξ 1 q = log χ1 + σ2 w

Z Dzφ q z φ q z . (7)

This establishes ξq as a depth scale that controls how deep information from a single input may penetrate into a random neural network.

Next, we consider eq. 4. Using a similar argument (detailed in Appendix 7.2) we can expand about cl ab = c + ϵl to ﬁnd an asymptotic recurrence relation,

ϵl+1 = ϵl σ2 w

Z Dz1Dz2φ (u 1)φ (u 2) + O((ϵl)2). (8)

Here u 1 = q z1 and u 2 = q (c z1 + p

1 (c )2z2). Thus, once again, we expect that for large l this recurrence will have an exponential solution, ϵl e l/ξc, with ξc given by

ξ 1 c = log σ2 w

Z Dz1Dz2φ (u 1)φ (u 2) . (9)

In the ordered phase c = 1 and so ξ 1 c = log χ1. Since the transition between order and chaos occurs when χ1 = 1 it follows that ξc diverges at any order-to-chaos transition so long as q and c exist.

(a) (b) (c)

Figure 2: Depth scales. (a) The iterative correlation map showing cl+1 ab as a function of cl ab for three different values of σ2 w. Green inset lines show the linearization of the iterative map about the critical point, e 1/ξc. The three curves show networks far in the ordered regime (red), at the edge of chaos (purple), and deep in the chaotic regime (blue). (b) The depth scale for information propagated in a single input, ξq as a function of σ2 w for σ2 b = 0.01 (black) to σ2 b = 0.3 (green). Dashed lines show theoretical predictions while solid lines show measurements. (c) The depth scale for correlations between inputs, ξc for the same values of σ2 b. Again dashed lines are the theoretical predictions while solid lines show measurements. Here a clear divergence is observed at the order-to-chaos transition.

These results can be investigated intuitively by plotting cl+1 ab vs cl ab in ﬁg. 2 (a). In the ordered phase there is only a single ﬁxed point, cl ab = 1. In the chaotic regime we see that a second ﬁxed point develops and the cl ab = 1 point becomes unstable. We see that the linearization about the ﬁxed points becomes signiﬁcantly closer to the trivial map near the order-to-chaos transition.

To test these claims we measure ξq and ξc directly by iterating the recurrence relations for ql aa and cl ab as before with q0 aa = q0 bb = 0.8 and c0 ab = 0.6. In this case we consider values of σ2 w between

Published as a conference paper at ICLR 2017

0.1 and 3.0 and σ2 b between 0.01 and 0.3. For each hyperparameter settings we ﬁt the resulting residuals, |ql aa q | and |cl ab c |, to exponential functions and infer the depth-scale. We then compare this measured depth-scale to that predicted by the asymptotic expansion. The result of this measurement is shown in ﬁg. 2. In general we see that the agreement is quite good. As expected we see that ξc diverges at the critical point.

As observed in Poole et al. (2016) we see that the depth scale for the propagation of information in a single input, ξq, is consistently ﬁnite and signiﬁcantly shorter than ξc. To understand why this is the case consider eq. 6 and note that for tanh nonlinearities the second term is always negative. Thus, even as χ1 approaches 1 we expect χ1 + σ2 w R Dzφ ( q z)φ( q z) to be substantially smaller than 1.

3.1 DROPOUT

The mean ﬁeld formalism can be extended to include dropout. The main contribution here will be to argue that even inﬁnitesimal amounts of dropout destroys the mean ﬁeld critical point, and therefore limits the trainable network depth. In the presence of dropout the propagation equation, eq. 1, becomes,

j W l ijpl jyl j + bl i (10)

where pj Bernoulli(ρ) and ρ is the dropout rate. As is typically the case we have re-scaled the sum by ρ 1 so that the mean of the pre-activation is invariant with respect to our choice of dropout rate.

Following a similar procedure to the original mean ﬁeld calculation consider the fate of two inputs, x0 i;a and x0 i;b, as they are propagated through such a random network. We take the dropout masks to be chosen independently for the two inputs mimicking the manner in which dropout is employed in practice. With dropout the diagonal term in the covariance matrix will be (see Appendix 7.3),

ql aa = σ2 w ρ

ql 1 aa z + σ2 b. (11)

The variance of a single input with dropout will therefore propagate in an identical fashion to the vanilla case with a re-scaling σ2 w σ2 w/ρ. Intuitively, this result implies that, for the case of a single input, the presence of dropout simply increases the effective variance of the weights.

Computing the off-diagonal term of the covariance matrix similarly (see Appendix 7.4),

ql ab = σ2 w

Z Dz1Dz2φ( u1)φ( u2) + σ2 b (12)

with u1, u2, and cl ab deﬁned by analogy to the mean ﬁeld equations without dropout. Here, unlike in the case of a single input, the recurrence relation is identical to the recurrence relation without dropout. To see that c = 1 is no longer a ﬁxed point of these dynamics consider what happens to eq. 12 when we input cl = 1. For simplicity, we leverage the short range of ξq to replace ql aa = ql bb = q . We ﬁnd (see Appendix 7.5),

cl+1 ab = 1 1 ρ

Z Dzφ2 q z . (13)

The second term is positive for any ρ < 1. This implies that if cl ab = 1 for any l then cl+1 ab < 1. Thus, c = 1 is not a ﬁxed point of eq. 12 for any ρ < 1. Since eq. 12 is identical in form to eq. 4 it follows that the depth scale for signal propagation with dropout will likewise be given by eq. 9 with the substitutions q q and c c computed using eq. 11 and eq. 12 respectively. Importantly, since there is no longer a sharp critical point with dropout we do not expect a diverging depth scale.

As in networks without dropout we plot, in ﬁg. 3 (a), the iterative map cl+1 ab as a function of cl ab. Most signiﬁcantly, we see that the cl ab = 1 is no longer a ﬁxed point of the dynamics. Instead, as the dropout rate increases cl ab gets mapped to decreasing values and the ﬁxed point monotonically decreases.

Published as a conference paper at ICLR 2017

(a) (b) (c)

= 1.0 = 0.95 = 0.9

Figure 3: Dropout destroys the critical point, and limits the depth to which information can propagate in a deep network. (a) The iterative correlation map showing cl+1 ab as a function of cl ab for three different values of the dropout rate ρ for networks tuned close to their critical point. Green inset lines show the linearization of the iterative map about the critical point, e 1/ξc. (b) The asymptotic value of the correlation map, c , as a function of σ2 w for different values of dropout from ρ = 1 (black) to ρ = 0.8 (blue). We see that for all values of dropout except for ρ = 1, c does not show a sharp transition between an ordered phase and a chaotic phase. (c) The correlation depth scale ξc as a function of σ2 w for the same values of dropout as in (b). We see here that for all values of ρ except for ρ = 1 there is no divergence in ξc.

To test these results we plot in ﬁg. 3 (b) the asymptotic correlation, c , as a function of σ2 w for different values of dropout from ρ = 0.8 to ρ = 1.0. As expected, we see that for all ρ < 1 there is no sharp transition between c = 1 and c < 1. Moreover as the dropout rate increases the correlation c monotonically decreases. Intuitively this makes sense. Identical inputs passed through two different dropout masks will become increasingly dissimilar as the dropout rate increases. In ﬁg. 3 (c) we show the depth scale, ξc, as a function of σ2 w for the same range of dropout probabilities. We ﬁnd that, as predicted, the depth of signal propagation with dropout is drastically reduced and, importantly, there is no longer a divergence in ξc. Increasing the dropout rate continues to decrease the correlation depth for constant σ2 w.

4 GRADIENT BACKPROPAGATION

There is a duality between the forward propagation of signals and the backpropagation of gradients. To elucidate this connection consider the backpropagation equations given a loss E,

E W l ij = δl iφ(zl 1 j ) δl i = φ (zl i) X

j δl+1 j W l+1 ji (14)

with the identiﬁcation δl i = E/ zl i. Within mean ﬁeld theory, it is clear that the scale of ﬂuctuations of the gradient of weights in a layer will be proportional to E[(δl i)2] (see appendix 7.6). In contrast to the pre-activations in forward propagation (eq. 1), the δl i will typically not be Gaussian distributed even in the large layer width limit.

Nonetheless, we can work out a recurrence relation for the variance of the error, q l aa = E[(δl i)2], leveraging the Gaussian ansatz on the pre-activations. In order to do this, however, we must ﬁrst make an additional approximation that the weights used during forward propagation are drawn independently from the weights used in backpropagation. This approximation is similar in spirit to the vanilla mean ﬁeld approximation and is reminiscent of work on feedback alignment (Lillicrap et al., 2014). With this in mind we arrive at the recurrence (see appendix 7.7),

q l aa = q l+1 aa Nl+1

Nl χ1. (15)

The presence of χ1 in the above equation should perhaps not be surprising. In Poole et al. (2016) they show that χ1 is intimately related to the tangent space of a given layer in mean ﬁeld neural

Published as a conference paper at ICLR 2017

networks. We note that the backpropagation recurrence features an explicit dependence on the ratio of widths of adjacent layers of the network, Nl+1/Nl. Here we will consider exclusively constant width networks where this factor is unity. For a discussion of the case of unequal layer widths see Glorot & Bengio (2010).

Since χ1 depends only on the asymptotic q it follows that for constant width networks we expect eq. 15 to again have an exponential solution with,

q l aa = q L aae (L l)/ξ ξ 1

= log χ1. (16)

Note that here ξ 1 = log χ1 both above and below the transition. It follows that ξ can be both positive and negative. We conclude that there should be three distinct regimes for the gradients.

1. In the ordered phase, χ1 < 1 and so ξ > 0. We therefore expect gradients to vanish over a depth |ξ |. 2. At criticality, χ1 1 and so ξ . Here gradients should be stable regardless of depth. 3. In the chaotic phase, χ1 > 1 and so ξ < 0. It follows that in this regime gradients should explode over a depth |ξ |.

Intuitively these three regimes make sense. To see this, recall that perturbations to a weight in layer l can alternatively be viewed as perturbations to the pre-activations in the same layer. In the ordered phase both the perturbed signal and the unperturbed signal will be asymptotically mapped to the same point and the derivative will be small. In the chaotic phase the perturbed and unperturbed signals will become asymptotically decorrelated and the gradient will be large.

Figure 4: Gradient backpropagation behaves similarly to signal forward propagation. (a) The 2norm, || W l ab E||2 2 as a function of layer, l, for a 240 layer random network with a cross-entropy loss on MNIST. Different values of σ2 w from 1.0 (blue) to 4.0 (red) are shown. Clear exponential vanishing / explosion is observed over many orders of magnitude. (b) The depth scale for gradients predicted by theory (dashed line) compared with measurements from experiment (red dots). Similarity between theory and experiment is clear. Deviations near the critical point are primarily due to ﬁnite size effects.

To investigate these predictions we construct deep random networks of depth L = 240 and layerwidth Nl = 300. We then consider the cross-entropy loss of these networks on MNIST. In ﬁg. 4 (a) we plot the layer-by-layer 2-norm of the gradient, || W l ab E||2 2, as a function of layer, l, for different values of σ2 w. We see that || W l ab E||2 2 behaves exponentially over many orders of magnitude. Moreover, we see that the gradient vanishes in the ordered phase and explodes in the chaotic phase. We test the quantitative predictions of eq. 16 in ﬁg. 4 (b) where we compare |ξ | as predicted from theory with the measured depth-scale constructed from exponential ﬁts to the gradient data. Here we see good quantitative agreement between the theoretical predictions from mean ﬁeld random networks and experimentally realized networks. Together these results suggest that the approximations on the backpropagation equations were representative of deep, wide, random networks.

Finally, we show that the depth scale for correlated signal propagation likewise controls the depth at which information stored in the covariance between gradients can survive. The existence of

Published as a conference paper at ICLR 2017

consistent gradients across similar samples from a training set ought to be especially important for determining whether or not a given neural network architecture can be trained. To establish this depth-scale ﬁrst note (see Appendix 7.8) that the covariance between gradients of two different inputs, xi;1 and xi;2, will be proportional to ( W l ij Ea) ( W l ij Eb) E[δl i;aδl i;b] = q l ab where Ea is

the loss evaluated on xi;a and δi;a = Ea/ zl i;a are appropriately deﬁned errors.

It can be shown (see Appendix 7.9) that q l ab features the recurrence relation,

q l ab = q l+1 ab Nl+1 Nl+2 σ2 w

Z Dz1Dz2φ (u1)φ (u2) (17)

where u1 and u2 are deﬁned similarly as for the forward pass. Expanding asymptotically it is clear that to zeroth order in ϵl, ql ab will have an exponential solution with q l ab = q L abe (L l)/ξc with ξc as deﬁned in the forward pass.

5 EXPERIMENTAL RESULTS

Taken together, the results of this paper lead us to the following hypothesis: a necessary condition for a random network to be trained is that information about the inputs should be able to propagate forward through the network, and information about the gradients should be able to propagate backwards through the network. The preceding analysis shows that networks will have this property precisely when the network depth, L, is not much larger than the depth-scale ξc. This criterion is data independent and therefore offers a universal constraint on the hyperparameters that depends on network architecture alone. We now explore this relationship between depth of signal propagation and network trainability empirically.

Figure 5: Mean ﬁeld depth scales control trainable hyperparameters. The training accuracy for neural networks as a function of their depth and initial weight variance, σ2 w from a high accuracy (red) to low accuracy (black). In (a) we plot the training accuracy after 200 training steps on MNIST using SGD. Here overlayed in grey dashed lines are different multiples of the depth scale for correlated signal propagation, nξc. We plot the accuracy in (b) after 2000 training steps on CIFAR10 using SGD, in (c) after 14000 training steps on MNIST using SGD, and in (d) after 300 training steps on MNIST using RMSPROP. Here we overlay in white dashed lines 6ξc.

To investigate this prediction, we consider random networks of depth 10 L 300 and 1 σ2 w 4 with σ2 b = 0.05. We train these networks using Stochastic Gradient Descent (SGD) and RMSProp

Published as a conference paper at ICLR 2017

on MNIST and CIFAR10. We use a learning rate of 10 3 for SGD when L 200, 10 4 for larger L, and 10 5 for RMSProp. These learning rates were selected by grid search between 10 6 and 10 2 in exponentially spaced steps of size 10. We note that the depth dependence of learning rate was explored in detail in Saxe et al. (2014). In ﬁg. 5 (a)-(d) we color in red the training accuracy that neural networks achieved as a function of σ2 w and L for different datasets, training time, and choice of minimizer (see Appendix 7.10 for more comparisons). In all cases the neural networks over-ﬁt the data to give a training accuracy of 100% and test accuracies of 98% on MNIST and 55% on CIFAR10. We emphasize that the purpose of this study is to demonstrate trainability as opposed to optimizing test accuracy.

We now make the connection between the depth scale, ξc, and the maximum trainable depth more precise. Given the arguments in the preceding sections we note that if L = nξc then signal through the network will be attenuated by a factor of en. To understand how much signal can be lost while still allowing for training, we overlay in ﬁg. 5 (a) curves corresponding to nξc from n = 1 to 6. We ﬁnd that networks appear to be trainable when L 6ξc. It would be interesting to understand why this is the case.

Motivated by this argument in ﬁg. 5 (b)-(d) in white, dashed, overlay we plot twice the predicted depth scale, 6ξc. There is clearly a relationship between the depth of correlated signal propagation and whether or not these networks are trainable. Networks closer to their critical point appear to train more quickly than those further away. Moreover, this relationship has no obvious dependence on dataset, duration of training, or minimizer. We therefore conclude that these bounds on trainable hyperparameters are universal. This in turn implies that to train increasingly deep networks, one must generically be ever closer to criticality.

(a) (b) (c)

Figure 6: The effect of dropout on trainability. The same scheme as in ﬁg. 5 but with dropout rates of (a) ρ = 0.99, (b) ρ = 0.98, and (c) ρ = 0.94. Even for modest amounts of dropout we see an upper bound on the maximum trainable depth for neural networks. We continue to see good agreement between the prediction of our theory and our experimental training accuracy.

Next we consider the effect of dropout. As we showed earlier, even inﬁnitesimal amounts of dropout disrupt the order-to-chaos phase transition and cause the depth scale to become ﬁnite. However, since the effect of a single dropout mask is to simply re-scale the weight variance by σ2 w σ2 w/ρ, the gradient magnitude will be stable near criticality, while the input and gradient correlations will not be. This therefore offers a unique opportunity to test whether the relevant depth-scale is |1/ log χ1| or ξc.

In ﬁg. 6 we repeat the same experimental setup as above on MNIST with dropout rates ρ = 0.99, 0.98, and 0.94. We observe, ﬁrst and foremost, that even extremely modest amounts of dropout limit the maximum trainable depth to about L = 100. We additionally notice that the depth-scale, ξc, predicts the trainable region accurately for varying amounts of dropout.

6 DISCUSSION

In this paper we have elucidated the existence of several depth-scales that control signal propagation in random neural networks. Furthermore, we have shown that the degree to which a neural network can be trained depends crucially on its ability to propagate information about inputs and gradients

Published as a conference paper at ICLR 2017

through its full depth. At the transition between order and chaos, information stored in the correlation between inputs can propagate inﬁnitely far through these random networks. This in turn implies that extremely deep neural networks may be trained sufﬁciently close to criticality. However, our contribution goes beyond advocating for hyperparameter selection that brings random networks to be nearly critical. Instead, we offer a general purpose framework that predicts, at the level of mean ﬁeld theory, which hyperparameters should allow a network to be trained. This is especially relevant when analyzing schemes like dropout where there is no critical point and which therefore imply an upper bound on trainable network depth.

An alternative perspective as to why information stored in the covariance between inputs is crucial for training can be understood by appealing to the correspondence between inﬁnitely wide Bayesian neural networks and Gaussian Processes (Neal, 2012). In particular the covariance, ql ab, is intimately related to the kernel of the induced Gaussian Process. It follows that cases in which signal stored in the covariance between inputs may propagate through the network correspond precisely to situations in which the associated Gaussian Process is well deﬁned.

Our work suggests that it may be fruitful to investigate pre-training schemes that attempt to perturb the weights of a neural network to favor information ﬂow through the network. In principle this could be accomplished through a layer-by-layer local criterion for information ﬂow or by selecting the mean and variance in schemes like batch normalization to maximize the covariance depth-scale.

These results suggest that theoretical work on random neural networks can be used to inform practical architectural decisions. However, there is still much work to be done. For instance, the framework developed here does not apply to unbounded activations, such as rectiﬁed linear units, where it can be shown that there are phases in which eq. 3 does not have a ﬁxed point. Additionally, the analysis here applies directly only to fully connected feed-forward networks, and will need to be extended to architectures with structured weight matrices such as convolutional networks.

We close by noting that in physics it has long been known that, through renormalization, the behavior of systems near critical points can control their behavior even far from the idealized critical case. We therefore make the somewhat bold hypothesis that a broad class of neural network topologies will be controlled by the fully-connected mean ﬁeld critical point.

ACKNOWLEDGMENTS

We thank Ben Poole, Jeffrey Pennington, Maithra Raghu, and George Dahl for useful discussions. We are additionally grateful to Rocket AI for introducing us to Temporally Recurrent Online Learning and two-dimensional time.

Y Bengio, Paolo Frasconi, and P Simard. The problem of learning long-term dependencies in recurrent networks. In Neural Networks, 1993., IEEE International Conference on, pp. 1183 1188. IEEE, 1993.

Nils Bertschinger, Thomas Natschl ager, and Robert A. Legenstein. At the edge of chaos: Real-time computations and self-organized criticality in recurrent neural networks. In L. K. Saul, Y. Weiss, and L. Bottou (eds.), Advances in Neural Information Processing Systems 17, pp. 145 152. MIT Press, 2005.

Anna Choromanska, Mikael Henaff, Michael Mathieu, G erard Ben Arous, and Yann Le Cun. The loss surfaces of multilayer networks. In AISTATS, 2015.

A. Daniely, R. Frostig, and Y. Singer. Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity. ar Xiv:1602.05897, 2016.

Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In Aistats, volume 9, pp. 249 256, 2010.

Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. ar Xiv:1412.6544, 2014.

Published as a conference paper at ICLR 2017

K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. Ar Xiv e-prints, December 2015.

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pp. 448 456, 2015.

Timothy P Lillicrap, Daniel Cownden, Douglas B Tweed, and Colin J Akerman. Random feedback weights support learning in deep neural networks. ar Xiv:1411.0247, 2014.

Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27, pp. 2924 2932. Curran Associates, Inc., 2014.

Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.

B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli. Exponential expressivity in deep neural networks through transient chaos. ar Xiv:1606.05340, June 2016.

M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein. On the expressive power of deep neural networks. ar Xiv:1606.05336, June 2016.

Benjamin Recht, Moritz Hardt, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. ar Xiv:1509.01240, 2015.

A. M. Saxe, J. L. Mc Clelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations, 2014.

David Sussillo and LF Abbott. Random walks: Training very deep nonlinear feed-forward networks with smart initialization. Co RR, vol. abs/1412.6558, 2014.

Here we present derivations of results from throughout the paper.

7.1 SINGLE INPUT DEPTH-SCALE

Consider the recurrence relation for the variance of a single input,

ql aa = σ2 w

ql 1 aa z + σ2 b (18)

and a ﬁxed point of the dynamics, q . ql aa can be expanded about the ﬁxed point to yield the asymptotic recurrence relation,

ϵl+1 = ϵl χ1 + σ2 w

Z Dzφ q z φ q z + O (ϵl)2 . (19)

Derivation:

Published as a conference paper at ICLR 2017

We begin by ﬁrst expanding to order ϵl,

q + ϵl+1 = σ2 w

q + ϵlz i2 + σ2 b (20)

Z Dz φ q z + 1

2 + σ2 b (21)

Z Dz φ q z + 1

2 ϵlz q φ q z 2 + σ2 b + O((ϵl)2) (22)

Z Dzφ2 q z + σ2 b + ϵl σ2 w q

Z Dzzφ q z φ q z + O((ϵl)2) (23)

q + ϵl σ2 w q

Z Dzzφ( q z)φ q z + O((ϵl)2). (24)

We therefore arrive at the approximate reccurence relation,

ϵl+1 = ϵl σ2 w q

Z Dzzφ( q z)φ q z + O((ϵl)2). (25)

Using the identity, R Dzzf(z) = R Dzf (z) we can rewrite this asymptotic recurrence relation as,

ϵl+1 = ϵl σ2 w

Z Dz φ q z 2 + σ2 w

Z Dzφ q z φ q z + O((ϵl)2) (26)

= ϵl χ1 + σ2 w

Z Dzφ q z φ q z + O((ϵl)2) (27)

as required.

7.2 TWO INPUT DEPTH-SCALE

Consider the recurrence relation for the co-variance of two input,

ql ab = σ2 w

Z Dz1Dz2φ(u1)φ(u2) + σ2 b, (28)

a correlation between the inputs, cl ab = ql ab/ q

qlaaql bb, and a ﬁxed point of the dynamics, c . cl ab can be expanded about the ﬁxed point to yield the asymptotic recurrence relation,

ϵl+1 = ϵl σ2 w

Z Dz1Dz2φ (u1)φ (u2) + O (ϵl)2 . (29)

Derivation:

Since the relaxation of ql aa and ql bb to q occurs much more quickly than the convergence of ql ab we approximate ql aa = ql bb = q as in Poole et al. (2016). We therefore consider the perturbation ql ab/q = cl ab = c + ϵl. It follows that we may make the approximation,

ul 2 = q cl abz1 + q

1 (cl ab)2z2

1 (c )2 2c ϵlz2

+ q ϵlz1 + O(ϵ2) (31)

We now consider the case where c < 1 and c = 1 separately; we will later show that these two results agree with one another. First we consider the case where c < 1 in which case we may safely expand the above equation to get,

ul 2 = q c z1 + p

1 (c )2z2 + q ϵl

+ O(ϵ2). (33)

Published as a conference paper at ICLR 2017

This allows us to in turn approximate the recurrence relation,

cl+1 ab = σ2 w q

Z Dz1Dz2φ(u 1)φ(ul 2) + σ2 b (34)

Z Dz1Dz2φ(u 1)

φ(u 2) + q ϵl

+ σ2 b + O(ϵ2)

= c + σ2 w q ϵl Z Dz1Dz2

φ(u 1)φ (u 2) (36)

= c + σ2 w q ϵl "Z Dz1Dz2z1φ(u 1)φ (u 2) c p

Z Dz1Dz2z2φ(u 1)φ (u 2)

= c + σ2 wϵl Z Dz1Dz2(φ (u 1)φ (u 2) + c φ(u 1)φ (u 2)) c Z Dz1Dz2φ(u 1)φ (u 2)

= c + σ2 wϵl Z Dz1Dz2φ (u 1)φ (u 2). (39)

where u 1 and u 2 are appropriately deﬁned asymptotic random variables. This leads to the asymptotic recurrence relation,

ϵl+1 = σ2 wϵl Z Dz1Dz2φ (u 1)φ (u 2) (40)

as required.

We now consider the case where c = 1 and cl ab = 1 ϵl. In this case the expansion of ul 2 will become, ul 2 = q z1 + p

2q ϵlz2 q ϵlz1 + O(ϵ3/2) (41)

and so the lowest order correction is of order O(

ϵl) as opposed to O(ϵl). As usual we now expand the recurrence relation, noting that u 2 = u 1 is independent of z2 when c = 1 to ﬁnd,

cl+1 ab = σ2 w q

Z Dz1Dz2φ(u 1)φ(ul 2) + σ2 b (42)

Z Dz1Dz2φ(u 1) h φ(u 2) + p

2q ϵlz2 q ϵlz1 φ (u 2) + q ϵlz2 2φ (u 2) i + σ2 b

= c + σ2 wϵl Z Dzφ( q z) φ ( q z) 1 q zφ ( q z) (44)

= c + σ2 wϵl Z Dzφ( q z)φ ( q z) 1 q

Z Dzzφ( q z)φ ( q z) (45)

= c σ2 wϵl Z Dz φ ( q z) 2 (46)

It follows that the asymptotic recurrence relation in this case will be,

ϵl+1 = ϵlσ2 w

Z Dz φ ( q z) 2 = ϵlχ1. (47)

where χ1 is the stability condition for the ordered phase. We note that although the approximations were somewhat different the asymptotic recurrence relation for c < 1 reduces eq. 47 result for c = 1. We may therefore use 4 for all c .

7.3 VARIANCE OF AN INPUT WITH DROPOUT

Published as a conference paper at ICLR 2017

In the presence of dropout with rate ρ, the variance of a single input as it is passed through the network is described by the recurrence relation,

ql aa = σ2 w ρ

ql 1 aa z + σ2 b. (48)

Derivation:

Recall that the recurrence relation for the pre-activations is given by,

j W l ijpl jyl j + bl i (49)

where pl j Bernoulli(ρ). It follows that the variance will be given by,

ql aa = E[(zl i)2] (50)

j E[(W l ij)2]E[(ρl j)2]E[(yl j)2] + E[(bl i)2] (51)

ql 1 aa z + σ2 b. (52)

where we have used the fact that E[(pl j)2] = ρ.

7.4 COVARIANCE OF TWO INPUTS WITH DROPOUT

The co-variance between two signals, zl i;a and zl i;b, with separate i.i.d. dropout masks pl i;a and pl i;b is given by,

ql ab = σ2 w

Z Dz1Dz2φ( u1)φ( u2) + σ2 b. (53)

where, in analogy to eq. 4, u1 = p

qlaaz1 and u2 = q

cl abz1 + q

1 ( cl ab)2z2

Derivation:

Proceeding directly we ﬁnd that,

E[zl i;azl i;b] = 1

j E[(W l ij)2]E[pl j;a]E[pl j;b]E[yl j;ayl j;b] + E[bl i] (54)

Z Dz1Dz2φ( u1)φ( u2) + σ2 b (55)

where we have used the fact that E[pl i;a] = E[pl i;b] = ρ. We have also used the same substitution for E[yl j;ayl j;b] used in the original mean ﬁeld calculation with the appropriate substitution.

7.5 THE LACK OF A c = 1 FIXED POINT WITH DROPOUT

If cl ab = 1 then it follows that,

cl+1 ab = 1 1 ρ

Z Dzφ2 q z (56)

subject to the approximation, ql aa ql bb q . This implies that cl+1 ab < 1.

Derivation:

Published as a conference paper at ICLR 2017

Plugging in cl ab = 1 with ql aa ql bb q we ﬁnd that u1 = u2 = q z1. It follows that,

cl+1 ab = ql+1 ab q (57)

Z Dzφ2 q z + σ2 b

σ2 w(1 ρ 1 + ρ 1) Z Dzφ2 q z + σ2 b

Z Dzφ2 q z + σ2 b

+ σ2 w q (1 ρ 1) Z Dzφ2 q z (60)

Z Dzφ2 q z (61)

as required. Here we have integrated out z2 since nether u1 nor u2 depend on it.

7.6 MEAN FIELD GRADIENT SCALING

In mean ﬁeld theory the expected magnitude of the gradient || W l ij E||2 will be proportional to

E[(δl i)2].

Derivation:

We ﬁrst note that since the W l ij are i.i.d. it follows that,

|| W l ij E||2 = X

where we have used the fact that the ﬁrst line is related to the sample expectation over the different realizations of the W l ij to approximate it by the analytic expectation in the second line. In mean ﬁeld theory since the pre-activations in each layer are assumed to be i.i.d. Gaussian it follows that,

= E[(δl i)2]E[φ2(zl 1 j )] (64)

and the result follows.

7.7 MEAN FIELD BACKPROPAGATION

In mean ﬁeld theory the recursion relation for the variance of the errors, q l = E[(δl i)2] is given by,

q l aa = q l+1 aa Nl+1 Nl+2 χ1(ql aa). (65)

Derivation:

Published as a conference paper at ICLR 2017

Computing the variance directly and using mean ﬁeld approximation,

q l aa = E[(δl i;a)2] = E[(φ (zl i;a))2] X

j E[(δl+1 j;a )2]E[(W l+1 ji )2] (66)

= E[(φ (zl i;a))2] σ2 w Nl+1

j E[(δl+1 j;a )2] (67)

= E[(φ (zl i;a))2]Nl+1

Nl+2 σ2 w q l+1 aa (68)

= σ2 w q l+1 aa Nl+1 Nl+2

qlaaz 2 (69)

q l+1 aa Nl+1 Nl+2 χ1 (70)

as required. In the last step we have made the approximation that ql aa q since the depth scale for the variance is short ranged.

7.8 MEAN FIELD GRADIENT COVARIANCE SCALING

In mean ﬁeld theory we expect the covariance between the gradients of two different inputs to scale as, ( W l ij Ea) ( W l ij Eb) E[δi;aδi;b]. (71)

Derivation:

We proceed in a manner analogous to Appendix 7.6. Note that in mean ﬁeld theory since the weights are i.i.d. it follows that

( W l ij Ea) ( W l ij Eb) = X

Eb W l ij (72)

" Ea W l ij

where, as before, the ﬁnal term is approximating the sample expectation. Since the weights in the forward and backwards passes are chosen independently it follows that we can factor the expectation as,

" Ea W l ij

= E[δl i;aδl i;b]E[φ(zl i;a)φ(zl i;b)] (74)

and the result follows.

7.9 MEAN FIELD BACKPROPAGATION OF COVARIANCE

The covariance between the gradients due to two inputs scales as,

q l ab = q l+1 ab Nl+1 Nl+2 σ2 w

Z Dz1Dz2φ (u1)φ (u2) (75)

under backpropagation.

As in the analogous derivation for the variance, we compute directly,

q l ab = E[δl i;aδl i;b] = E [φ (zi;a)φ (zi;b)] X

j E[δl+1 j;a δl+1 j;b ]E[(W l+1 ji )2] (76)

= q l+1 ab Nl+1 Nl+2 σ2 w

Z Dz1Dz2φ (u1)φ (u2) (77)

as required.

Published as a conference paper at ICLR 2017

7.10 FURTHER EXPERIMENTAL RESULTS

Here we include some more experimental ﬁgures that investigate the effects of training time, minimizer, and dataset more closely.

Figure 7: Training accuracy on MNIST after (a) 45 (b) 304 (c) 2048 and (d) 13780 steps of SGD with learning rate 10 3.

Published as a conference paper at ICLR 2017

Figure 8: Training accuracy on MNIST after (a) 45 (b) 304 (c) 2048 and (d) 13780 steps of RMSProp with learning rate 10 5.