# fixup_initialization_residual_learning_without_normalization__a224a350.pdf Published as a conference paper at ICLR 2019 FIXUP INITIALIZATION: RESIDUAL LEARNING WITHOUT NORMALIZATION Hongyi Zhang MIT hongyiz@mit.edu Yann N. Dauphin Google Brain yann@dauphin.io Tengyu Ma Stanford University tengyuma@stanford.edu Normalization layers are a staple in state-of-the-art deep neural network architectures. They are widely believed to stabilize training, enable higher learning rate, accelerate convergence and improve generalization, though the reason for their effectiveness is still an active research topic. In this work, we challenge the commonly-held beliefs by showing that none of the perceived benefits is unique to normalization. Specifically, we propose fixed-update initialization (Fixup), an initialization motivated by solving the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialization. We find training residual networks with Fixup to be as stable as training with normalization even for networks with 10,000 layers. Furthermore, with proper regularization, Fixup enables residual networks without normalization to achieve state-of-the-art performance in image classification and machine translation. 1 INTRODUCTION Artificial intelligence applications have witnessed major advances in recent years. At the core of this revolution is the development of novel neural network models and their training techniques. For example, since the landmark work of He et al. (2016), most of the state-of-the-art image recognition systems are built upon a deep stack of network blocks consisting of convolutional layers and additive skip connections, with some normalization mechanism (e.g., batch normalization (Ioffe & Szegedy, 2015)) to facilitate training and generalization. Besides image classification, various normalization techniques (Ulyanov et al., 2016; Ba et al., 2016; Salimans & Kingma, 2016; Wu & He, 2018) have been found essential to achieving good performance on other tasks, such as machine translation (Vaswani et al., 2017) and generative modeling (Zhu et al., 2017). They are widely believed to have multiple benefits for training very deep neural networks, including stabilizing learning, enabling higher learning rate, accelerating convergence, and improving generalization. Despite the enormous empirical success of training deep networks with normalization, and recent progress on understanding the working of batch normalization (Santurkar et al., 2018), there is currently no general consensus on why these normalization techniques help training residual neural networks. Intrigued by this topic, in this work we study (i) without normalization, can a deep residual network be trained reliably? (And if so,) (ii) without normalization, can a deep residual network be trained with the same learning rate, converge at the same speed, and generalize equally well (or even better)? Perhaps surprisingly, we find the answers to both questions are Yes. In particular, we show: Why normalization helps training. We derive a lower bound for the gradient norm of a residual network at initialization, which explains why with standard initializations, normalization techniques are essential for training deep residual networks at maximal learning rate. (Section 2) Work done at Facebook. Equal contribution. Work done at Facebook. Equal contribution. Work done at Facebook. Published as a conference paper at ICLR 2019 Training without normalization. We propose Fixup, a method that rescales the standard initialization of residual branches by adjusting for the network architecture. Fixup enables training very deep residual networks stably at maximal learning rate without normalization. (Section 3) Image classification. We apply Fixup to replace batch normalization on image classification benchmarks CIFAR-10 (with Wide-Res Net) and Image Net (with Res Net), and find Fixup with proper regularization matches the well-tuned baseline trained with normalization. (Section 4.2) Machine translation. We apply Fixup to replace layer normalization on machine translation benchmarks IWSLT and WMT using the Transformer model, and find it outperforms the baseline and achieves new state-of-the-art results on the same architecture. (Section 4.3) (He et al., 2016) Fixup remove normalization rescale weights add scalar multipliers : initialized at 1 : initialized at 0 : scaled down by Fixup w/o bias Figure 1: Left: Res Net basic block. Batch normalization (Ioffe & Szegedy, 2015) layers are marked in red. Middle: A simple network block that trains stably when stacked together. Right: Fixup further improves by adding bias parameters. (See Section 3 for details.) In the remaining of this paper, we first analyze the exploding gradient problem of residual networks at initialization in Section 2. To solve this problem, we develop Fixup in Section 3. In Section 4 we quantify the properties of Fixup and compare it against state-of-the-art normalization methods on real world benchmarks. A comparison with related work is presented in Section 5. 2 PROBLEM: RESNET WITH STANDARD INITIALIZATIONS LEAD TO EXPLODING GRADIENTS Standard initialization methods (Glorot & Bengio, 2010; He et al., 2015; Xiao et al., 2018) attempt to set the initial parameters of the network such that the activations neither vanish nor explode. Unfortunately, it has been observed that without normalization techniques such as Batch Norm they do not account properly for the effect of residual connections and this causes exploding gradients. Balduzzi et al. (2017) characterizes this problem for Re LU networks, and we will generalize this to residual networks with positively homogenous activation functions. A plain (i.e. without normalization layers) Res Net with residual blocks {F1, . . . , FL} and input x0 computes the activations as i=0 Fi(xi). (1) Res Net output variance grows exponentially with depth. Here we only consider the initialization, view the input x0 as fixed, and consider the randomness of the weight initialization. We analyze the variance of each layer xl, denoted by Var[xl] (which is technically defined as the sum of the variance of all the coordinates of xl.) For simplicity we assume the blocks are initialized to be zero mean, i.e., E[Fl(xl) | xl] = 0. By xl+1 = xl + Fl(xl), and the law of total variance, we have Var[xl+1] = E[Var[F(xl)|xl]] + Var(xl). Resnet structure prevents xl from vanishing by forcing the variance to grow with depth, i.e. Var[xl] < Var[xl+1] if E[Var[F(xl)|xl]] > 0. Yet, combined with initialization methods such as He et al. (2015), the output variance of each residual branch Published as a conference paper at ICLR 2019 Var[Fl(xl)|xl] will be about the same as its input variance Var[xl], and thus Var[xl+1] 2Var[xl]. This causes the output variance to explode exponentially with depth without normalization (Hanin & Rolnick, 2018) for positively homogeneous blocks (see Definition 1). This is detrimental to learning because it can in turn cause gradient explosion. As we will show, at initialization, the gradient norm of certain activations and weight tensors is lower bounded by the cross-entropy loss up to some constant. Intuitively, this implies that blowup in the logits will cause gradient explosion. Our result applies to convolutional and linear weights in a neural network with Re LU nonlinearity (e.g., feed-forward network, CNN), possibly with skip connections (e.g., Res Net, Dense Net), but without any normalization. Our analysis utilizes properties of positively homogeneous functions, which we now introduce. Definition 1 (positively homogeneous function of first degree). A function f : Rm Rn is called positively homogeneous (of first degree) (p.h.) if for any input x Rm and α > 0, f(αx) = αf(x). Definition 2 (positively homogeneous set of first degree). Let θ = {θi}i S be the set of parameters of f(x) and θph = {θi}i Sph S. We call θph a positively homogeneous set (of first degree) (p.h. set) if for any α > 0, f(x; θ \ θph, αθph) = αf(x; θ \ θph, θph), where αθph denotes {αθi}i Sph. Intuitively, a p.h. set is a set of parameters θph in function f such that for any fixed input x and fixed parameters θ \ θph, f(θph) f(x; θ \ θph, θph) is a p.h. function. Examples of p.h. functions are ubiquitous in neural networks, including various kinds of linear operations without bias (fully-connected (FC) and convolution layers, pooling, addition, concatenation and dropout etc.) as well as Re LU nonlinearity. Moreover, we have the following claim: Proposition 1. A function that is the composition of p.h. functions is itself p.h. We study classification problems with c classes and the cross-entropy loss. We use f to denote a neural network function except for the softmax layer. Cross-entropy loss is defined as ℓ(z, y) y T (z logsumexp(z)) where y is the one-hot label vector, z f(x) Rc is the logits where zi denotes its i-th element, and logsumexp(z) log P i [c] exp(zi) . Consider a minibatch of training examples DM = {(x(m), y(m))}M m=1 and the average cross-entropy loss ℓavg(DM) 1 M PM m=1 ℓ(f(x(m)), y(m)), where we use (m) to index quantities referring to the m-th example. denotes any valid norm. We only make the following assumptions about the network f: 1. f is a sequential composition of network blocks {fi}L i=1, i.e. f(x0) = f L(f L 1(. . . f1(x0))), each of which is composed of p.h. functions. 2. Weight elements in the FC layer are i.i.d. sampled from a zero-mean symmetric distribution. These assumptions hold at initialization if we remove all the normalization layers in a residual network with Re LU nonlinearity, assuming all the biases are initialized at 0. Our results are summarized in the following two theorems, whose proofs are listed in the appendix: Theorem 1. Denote the input to the i-th block by xi 1. With Assumption 1, we have ℓ xi 1 ℓ(z, y) H(p) where p is the softmax probabilities and H denotes the Shannon entropy. Since H(p) is upper bounded by log(c) and xi 1 is small in the lower blocks, blowup in the loss will cause large gradient norm with respect to the lower block input. Our second theorem proves a lower bound on the gradient norm of a p.h. set in a network. Theorem 2. With Assumption 1, we have ℓavg m=1 ℓ(z(m), y(m)) H(p(m)) G(θph). (3) Furthermore, with Assumptions 1 and 2, we have EG(θph) E[maxi [c] zi] log(c) Published as a conference paper at ICLR 2019 It remains to identify such p.h. sets in a neural network. In Figure 2 we provide three examples of p.h. sets in a Res Net without normalization. Theorem 2 suggests that these layers would suffer from the exploding gradient problem, if the logits z blow up at initialization, which unfortunately would occur in a Res Net without normalization if initialized in a traditional way. This motivates us to introduce a new initialization in the next section. Figure 2: Examples of p.h. sets in a Res Net without normalization: (1) the first convolution layer before max pooling; (2) the fully connected layer before softmax; (3) the union of a spatial downsampling layer in the backbone and a convolution layer in its corresponding residual branch. 3 FIXUP: UPDATE A RESIDUAL NETWORK Θ(η) PER SGD STEP Our analysis in the previous section points out the failure mode of standard initializations for training deep residual network: the gradient norm of certain layers is in expectation lower bounded by a quantity that increases indefinitely with the network depth. However, escaping this failure mode does not necessarily lead us to successful training after all, it is the whole network as a function that we care about, rather than a layer or a network block. In this section, we propose a top-down design of a new initialization that ensures proper update scale to the network function, by simply rescaling a standard initialization. To start, we denote the learning rate by η and set our goal: f(x; θ) is updated by Θ(η) per SGD step after initialization as η 0. That is, f(x) = Θ(η) where f(x) f(x; θ η θℓ(f(x), y)) f(x; θ). Put another way, our goal is to design an initialization such that SGD updates to the network function are in the right scale and independent of the depth. We define the Shortcut as the shortest path from input to output in a residual network. The Shortcut is typically a shallow network with a few trainable layers.1 We assume the Shortcut is initialized using a standard method, and focus on the initialization of the residual branches. Residual branches update the network in sync. To start, we first make an important observation that the SGD update to each residual branch changes the network output in highly correlated directions. This implies that if a residual network has L residual branches, then an SGD step to each residual branch should change the network output by Θ(η/L) on average to achieve an overall Θ(η) update. We defer the formal statement and its proof until Appendix B.1. Study of a scalar branch. Next we study how to initialize a residual branch with m layers so that its SGD update changes the network output by Θ(η/L). We assume m is a small positive integer (e.g., 2 or 3). As we are only concerned about the scale of the update, it is sufficiently instructive to study the scalar case, i.e., F(x) = (Qm i=1 ai) x where a1, . . . , am, x R+. For example, the standard initialization methods typically initialize each layer so that the output (after nonlinear activation) preserves the input variance, which can be modeled as setting i [m], ai = 1. In turn, setting ai to a positive number other than 1 corresponds to rescaling the i-th layer by ai. Through deriving the constraints for F(x) to make Θ(η/L) updates, we will also discover how to rescale the weight layers of a standard initialization as desired. In particular, we show the SGD 1For example, in the Res Net architecture (e.g., Res Net-50, Res Net-101 or Res Net-152) for Image Net classification, the Shortcut is always a 6-layer network with five convolution layers and one fully-connected layer, irrespective of the total depth of the whole network. Published as a conference paper at ICLR 2019 update to F(x) is Θ(η/L) if and only if the initialization satisfies the following constraint: i [m]\{j} ai , where j arg min k ak (5) We defer the derivation until Appendix B.2. Equation (5) suggests new methods to initialize a residual branch through rescaling the standard initialization of i-th layer in a residual branch by its corresponding scalar ai. For example, we could set i [m], ai = L 1 2m 2 . Alternatively, we could start the residual branch as a zero function by setting am = 0 and i [m 1], ai = L 1 2m 2 . In the second option, the residual branch does not need to unlearn its potentially bad random initial state, which can be beneficial for learning. Therefore, we use the latter option in our experiments, unless otherwise specified. The effects of biases and multipliers. With proper rescaling of the weights in all the residual branches, a residual network is supposed to be updated by Θ(η) per SGD step our goal is achieved. However, in order to match the training performance of a corresponding network with normalization, there are two more things to consider: biases and multipliers. Using biases in the linear and convolution layers is a common practice. In normalization methods, bias and scale parameters are typically used to restore the representation power after normalization.2 Intuitively, because the preferred input/output mean of a weight layer may be different from the preferred output/input mean of an activation layer, it also helps to insert bias terms in a residual network without normalization. Empirically, we find that inserting just one scalar bias before each weight layer and nonlinear activation layer significantly improves the training performance. Multipliers scale the output of a residual branch, similar to the scale parameters in batch normalization. They have an interesting effect on the learning dynamics of weight layers in the same branch. Specifically, as the stochastic gradient of a layer is typically almost orthogonal to its weight, learning rate decay tends to cause the weight norm equilibrium to shrink when combined with L2 weight decay (van Laarhoven, 2017). In a branch with multipliers, this in turn causes the growth of the multipliers, increasing the effective learning rate of other layers. In particular, we observe that inserting just one scalar multiplier per residual branch mimics the weight norm dynamics of a network with normalization, and spares us the search of a new learning rate schedule. Put together, we propose the following method to train residual networks without normalization: Fixup initialization (or: How to train a deep residual network without normalization) 1. Initialize the classification layer and the last layer of each residual branch to 0. 2. Initialize every other layer using a standard method (e.g., He et al. (2015)), and scale only the weight layers inside residual branches by L 1 2m 2 . 3. Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0) before each convolution, linear, and element-wise activation layer. It is important to note that Rule 2 of Fixup is the essential part as predicted by Equation (5). Indeed, we observe that using Rule 2 alone is sufficient and necessary for training extremely deep residual networks. On the other hand, Rule 1 and Rule 3 make further improvements for training so as to match the performance of a residual network with normalization layers, as we explain in the above text.3 We find ablation experiments confirm our claims (see Appendix C.1). 2For example, in batch normalization gamma and beta parameters are used to affine-transform the normalized activations per each channel. 3It is worth noting that the design of Fixup is a simplification of the common practice, in that we only introduce O(K) parameters beyond convolution and linear weights (since we remove bias terms from convolution and linear layers), whereas the common practice includes O(KC) (Ioffe & Szegedy, 2015; Salimans & Kingma, 2016) or O(KCWH) (Ba et al., 2016) additional parameters, where K is the number of layers, C is the max number of channels per layer and W, H are the spatial dimension of the largest feature maps. Published as a conference paper at ICLR 2019 Our initialization and network design is consistent with recent theoretical work Hardt & Ma (2016); Li et al. (2018), which, in much more simplified settings such as linearized residual nets and quadratic neural nets, propose that small initialization tend to stabilize optimization and help generalizaiton. However, our approach suggests that more delicate control of the scale of the initialization is beneficial.4 4 EXPERIMENTS 4.1 TRAINING AT INCREASING DEPTH One of the key advatanges of Batch Norm is that it leads to fast training even for very deep models (Ioffe & Szegedy, 2015). Here we will determine if we can match this desirable property by relying only on proper initialization. We propose to evaluate how each method affects training very deep nets by measuring the test accuracy after the first epoch as we increase depth. In particular, we use the wide residual network (WRN) architecture with width 1 and the default weight decay 5e 4 (Zagoruyko & Komodakis, 2016). We specifically use the default learning rate of 0.1 because the ability to use high learning rates is considered to be important to the success of Batch Norm. We compare Fixup against three baseline methods (1) rescale the output of each residual block by 1 2 (Balduzzi et al., 2017), (2) post-process an orthogonal initialization such that the output variance of each residual block is close to 1 (Layer-sequential unit-variance orthogonal initialization, or LSUV) (Mishkin & Matas, 2015), (3) batch normalization (Ioffe & Szegedy, 2015). We use the default batch size of 128 up to 1000 layers, with a batch size of 64 for 10,000 layers. We limit our budget of epochs to 1 due to the computational strain of evaluating models with up to 10,000 layers. 10 100 1000 10000 Depth First Epoch Test Accuracy (%) 1/2 -scaling LSUV Batch Norm Fixup Figure 3: Depth of residual networks versus test accuracy at the first epoch for various methods on CIFAR-10 with the default Batch Norm learning rate. We observe that Fixup is able to train very deep networks with the same learning rate as batch normalization. (Higher is better.) Figure 3 shows the test accuracy at the first epoch as depth increases. Observe that Fixup matches the performance of Batch Norm at the first epoch, even with 10,000 layers. LSUV and p 1/2-scaling are not able to train with the same learning rate as Batch Norm past 100 layers. 4.2 IMAGE CLASSIFICATION In this section, we evaluate the ability of Fixup to replace batch normalization in image classification applications. On the CIFAR-10 dataset, we first test on Res Net-110 (He et al., 2016) with default hyper-parameters; results are shown in Table 1. Fixup obtains 7% relative improvement in test error compared with standard initialization; however, we note a substantial difference in the difficulty of training. While network with Fixup is trained with the same learning rate and converge as fast as network with batch normalization, we fail to train a Xavier initialized Res Net-110 with 0.1x maximal learning rate.5 The test error gap in Table 1 is likely due to the regularization effect of Batch Norm 4For example, learning rate smaller than our choice would also stabilize the training, but lead to lower convergence rate. 5Personal communication with the authors of (Shang et al., 2017) confirms our observation, and reveals that the Xavier initialized network need more epochs to converge. Published as a conference paper at ICLR 2019 rather than difficulty in optimization; when we train Fixup networks with better regularization, the test error gap disappears and we obtain state-of-the-art results on CIFAR-10 and SVHN without normalization layers (see Appendix C.2). Dataset Res Net-110 Normalization Large η Test Error (%) CIFAR-10 w/ Batch Norm (He et al., 2016) 6.61 w/ Xavier Init (Shang et al., 2017) 7.78 w/ Fixup-init 7.24 Table 1: Results on CIFAR-10 with Res Net-110 (mean/median of 5 runs; lower is better). On the Image Net dataset, we benchmark Fixup with the Res Net-50 and Res Net-101 architectures (He et al., 2016), trained for 100 epochs and 200 epochs respectively. Similar to our finding on the CIFAR-10 dataset, we observe that (1) training with Fixup is fast and stable with the default hyperparameters, (2) Fixup alone significantly improves the test error of standard initialization, and (3) there is a large test error gap between Fixup and Batch Norm. Further inspection reveals that Fixup initialized models obtain significantly lower training error compared with Batch Norm models (see Appendix C.3), i.e., Fixup suffers from overfitting. We therefore apply stronger regularization to the Fixup models using Mixup (Zhang et al., 2017). We find it is beneficial to reduce the learning rate of the scalar multiplier and bias by 10x when additional large regularization is used. Best Mixup coefficients are found through cross-validation: they are 0.2, 0.1 and 0.7 for Batch Norm, Group Norm (Wu & He, 2018) and Fixup respectively. We present the results in Table 2, noting that with better regularization, the performance of Fixup is on par with Group Norm. Model Method Normalization Test Error (%) Batch Norm (Goyal et al., 2017) 23.6 Batch Norm + Mixup (Zhang et al., 2017) 23.3 Group Norm + Mixup 23.9 Xavier Init (Shang et al., 2017) 31.5 Fixup-init 27.6 Fixup-init + Mixup 24.0 Res Net-101 Batch Norm (Zhang et al., 2017) 22.0 Batch Norm + Mixup (Zhang et al., 2017) 20.8 Group Norm + Mixup 21.4 Fixup-init + Mixup 21.4 Table 2: Image Net test results using the Res Net architecture. (Lower is better.) 4.3 MACHINE TRANSLATION To demonstrate the generality of Fixup, we also apply it to replace layer normalization (Ba et al., 2016) in Transformer (Vaswani et al., 2017), a state-of-the-art neural network for machine translation. Specifically, we use the fairseq library (Gehring et al., 2017) and follow the Fixup template in Section 3 to modify the baseline model. We evaluate on two standard machine translation datasets, IWSLT German-English (de-en) and WMT English-German (en-de) following the setup of Ott et al. (2018). For the IWSLT de-en dataset, we cross-validate the dropout probability from {0.3, 0.4, 0.5, 0.6} and find 0.5 to be optimal for both Fixup and the Layer Norm baseline. For the WMT 16 en-de dataset, we use dropout probability 0.4. All models are trained for 200k updates. It was reported (Chen et al., 2018) that Layer normalization is most critical to stabilize the training process... removing layer normalization results in unstable training runs . However we find training with Fixup to be very stable and as fast as the baseline model. Results are shown in Table 3. Surprisingly, we find the models do not suffer from overfitting when Layer Norm is replaced by Fixup, thanks to the strong regularization effect of dropout. Instead, Fixup matches or supersedes the state-of-the-art results using Transformer model on both datasets. Published as a conference paper at ICLR 2019 Dataset Model Normalization BLEU IWSLT DE-EN (Deng et al., 2018) 33.1 Layer Norm 34.2 Fixup-init 34.5 WMT EN-DE (Vaswani et al., 2017) 28.4 Layer Norm (Ott et al., 2018) 29.3 Fixup-init 29.3 Table 3: Comparing Fixup vs. Layer Norm for machine translation tasks. (Higher is better.) 5 RELATED WORK Normalization methods. Normalization methods have enabled training very deep residual networks, and are currently an essential building block of the most successful deep learning architectures. All normalization methods for training neural networks explicitly normalize (i.e. standardize) some component (activations or weights) through dividing activations or weights by some real number computed from its statistics and/or subtracting some real number activation statistics (typically the mean) from the activations.6 In contrast, Fixup does not compute statistics (mean, variance or norm) at initialization or during any phase of training, hence is not a normalization method. Theoretical analysis of deep networks. Training very deep neural networks is an important theoretical problem. Early works study the propagation of variance in the forward and backward pass for different activation functions (Glorot & Bengio, 2010; He et al., 2015). Recently, the study of dynamical isometry (Saxe et al., 2013) provides a more detailed characterization of the forward and backward signal propogation at initialization (Pennington et al., 2017; Hanin, 2018), enabling training 10,000-layer CNNs from scratch (Xiao et al., 2018). For residual networks, activation scale (Hanin & Rolnick, 2018), gradient variance (Balduzzi et al., 2017) and dynamical isometry property (Yang & Schoenholz, 2017) have been studied. Our analysis in Section 2 leads to the similar conclusion as previous work that the standard initialization for residual networks is problematic. However, our use of positive homogeneity for lower bounding the gradient norm of a neural network is novel, and applies to a broad class of neural network architectures (e.g., Res Net, Dense Net) and initialization methods (e.g., Xavier, LSUV) with simple assumptions and proof. Hardt & Ma (2016) analyze the optimization landscape (loss surface) of linearized residual nets in the neighborhood around the zero initialization where all the critical points are proved to be global minima. Yang & Schoenholz (2017) study the effect of the initialization of residual nets to the test performance and pointed out Xavier or He initialization scheme is not optimal. In this paper, we give a concrete recipe for the initialization scheme with which we can train deep residual networks without batch normalization successfully. Understanding batch normalization. Despite its popularity in practice, batch normalization has not been well understood. Ioffe & Szegedy (2015) attributed its success to reducing internal covariate shift , whereas Santurkar et al. (2018) argued that its effect may be smoothing loss surface . Our analysis in Section 2 corroborates the latter idea of Santurkar et al. (2018) by showing that standard initialization leads to very steep loss surface at initialization. Moreover, we empirically showed in Section 3 that steep loss surface may be alleviated for residual networks by using smaller initialization than the standard ones such as Xavier or He s initialization in residual branches. van Laarhoven (2017); Hoffer et al. (2018) studied the effect of (batch) normalization and weight decay on the effective learning rate. Their results inspire us to include a multiplier in each residual branch. Res Net initialization in practice. Gehring et al. (2017); Balduzzi et al. (2017) proposed to address the initialization problem of residual nets by using the recurrence xl = p 1/2(xl 1 + Fl(xl 1)). Mishkin & Matas (2015) proposed a data-dependent initialization to mimic the effect of batch normalization in the first forward pass. While both methods limit the scale of activation and gradient, they would fail to train stably at the maximal learning rate for very deep residual networks, since 6For reference, we include a brief history of normalization methods in Appendix D. Published as a conference paper at ICLR 2019 they fail to consider the accumulation of highly correlated updates contributed by different residual branches to the network function (Appendix B.1). Srivastava et al. (2015); Hardt & Ma (2016); Goyal et al. (2017); Kingma & Dhariwal (2018) found that initializing the residual branches at (or close to) zero helped optimization. Our results support their observation in general, but Equation (5) suggests additional subtleties when choosing a good initialization scheme. 6 CONCLUSION In this work, we study how to train a deep residual network reliably without normalization. Our theory in Section 2 suggests that the exploding gradient problem at initialization in a positively homogeneous network such as Res Net is directly linked to the blowup of logits. In Section 3 we develop Fixup initialization to ensure the whole network as well as each residual branch gets updates of proper scale, based on a top-down analysis. Extensive experiments on real world datasets demonstrate that Fixup matches normalization techniques in training deep residual networks, and achieves state-of-the-art test performance with proper regularization. Our work opens up new possibilities for both theory and applications. Can we analyze the training dynamics of Fixup, which may potentially be simpler than analyzing models with batch normalization is? Could we apply or extend the initialization scheme to other applications of deep learning? It would also be very interesting to understand the regularization benefits of various normalization methods, and to develop better regularizers to further improve the test performance of Fixup. ACKNOWLEDGMENTS The authors would like to thank Yuxin Wu, Kaiming He, Aleksander Madry and the anonymous reviewers for their helpful feedback. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016. David Balduzzi, Marcus Frean, Lennox Leary, JP Lewis, Kurt Wan-Duo Ma, and Brian Mc Williams. The shattered gradients problem: If resnets are the answer, then what is the question? ar Xiv preprint ar Xiv:1702.08591, 2017. Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, et al. The best of both worlds: Combining recent advances in neural machine translation. ar Xiv preprint ar Xiv:1804.09849, 2018. Yuntian Deng, Yoon Kim, Justin Chiu, Demi Guo, and Alexander M Rush. Latent alignment and variational attention. Thirty-second Conference on Neural Information Processing Systems (NIPS), 2018. Terrance De Vries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. ar Xiv preprint ar Xiv:1708.04552, 2017. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional Sequence to Sequence Learning. In Proc. of ICML, 2017. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249 256, 2010. Priya Goyal, Piotr Doll ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017. Benjamin Graham. Fractional max-pooling. ar Xiv preprint ar Xiv:1412.6071, 2014. Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? ar Xiv preprint ar Xiv:1801.03744, 2018. Published as a conference paper at ICLR 2019 Boris Hanin and David Rolnick. How to start training: The effect of initialization and architecture. ar Xiv preprint ar Xiv:1803.01719, 2018. Moritz Hardt and Tengyu Ma. Identity matters in deep learning. ar Xiv preprint ar Xiv:1611.04231, 2016. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026 1034, 2015. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. David J Heeger. Normalization of cell responses in cat striate cortex. Visual neuroscience, 9(2): 181 197, 1992. Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normalization schemes in deep networks. ar Xiv preprint ar Xiv:1803.01814, 2018. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015. Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. ar Xiv preprint ar Xiv:1807.03039, 2018. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012. Chen-Yu Lee, Patrick W Gallagher, and Zhuowen Tu. Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree. In Artificial Intelligence and Statistics, pp. 464 472, 2016. Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix recovery. Conference on Learning Theory (COLT), 2018. Siwei Lyu and Eero P Simoncelli. Nonlinear image representation using divisive normalization. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1 8. IEEE, 2008. Dmytro Mishkin and Jiri Matas. All you need is a good init. ar Xiv preprint ar Xiv:1511.06422, 2015. Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation. ar Xiv preprint ar Xiv:1806.00187, 2018. Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Advances in neural information processing systems, pp. 4785 4795, 2017. Nicolas Pinto, David D Cox, and James J Di Carlo. Why is real-world visual object recognition hard? PLo S computational biology, 4(1):e27, 2008. Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901 909, 2016. Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization?(no, it is not about internal covariate shift). ar Xiv preprint ar Xiv:1805.11604, 2018. Andrew M Saxe, James L Mc Clelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ar Xiv preprint ar Xiv:1312.6120, 2013. Published as a conference paper at ICLR 2019 Wenling Shang, Justin Chiu, and Kihyuk Sohn. Exploring normalization in deep residual networks with concatenated rectified linear units. In AAAI, pp. 1509 1516, 2017. Rupesh Kumar Srivastava, Klaus Greff, and J urgen Schmidhuber. Highway networks. ar Xiv preprint ar Xiv:1505.00387, 2015. Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. Co RR, abs/1607.08022, 2016. Twan van Laarhoven. L2 regularization versus batch and weight normalization. ar Xiv preprint ar Xiv:1706.05350, 2017. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998 6008, 2017. Yuxin Wu and Kaiming He. Group normalization. In The European Conference on Computer Vision (ECCV), September 2018. Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S Schoenholz, and Jeffrey Pennington. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. ar Xiv preprint ar Xiv:1806.05393, 2018. Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. Shakedrop regularization. ar Xiv preprint ar Xiv:1802.02375, 2018. Ge Yang and Samuel Schoenholz. Mean field residual networks: On the edge of chaos. In Advances in neural information processing systems, pp. 7103 7114, 2017. Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016. Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017. Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017. A PROOFS FOR SECTION 2 A.1 GRADIENT NORM LOWER BOUND FOR THE INPUT TO A NETWORK BLOCK Proof of Theorem 1. We use fi j to denote the composition fj fj 1 fi, so that z = fi L(xi 1) for all i [L]. Note that z is p.h. with respect to the input of each network block, i.e. fi L((1 + ϵ)xi 1) = (1 + ϵ)fi L(xi 1) for ϵ > 1. This allows us to compute the gradient of the cross-entropy loss with respect to the scaling factor ϵ at ϵ = 0 as ϵℓ(fi L((1 + ϵ)xi 1), y) ϵ=0 = ℓ ϵ = y T z + p T z = ℓ(z, y) H(p) (6) Since the gradient L2 norm ℓ/ xi 1 must be greater than the directional derivative tℓ(fi L(xi 1 + t xi 1 xi 1 ), y), defining ϵ = t/ xi 1 we have ℓ xi 1 ϵℓ(fi L(xi 1 + ϵxi 1), y) ϵ t = ℓ(z, y) H(p) Published as a conference paper at ICLR 2019 A.2 GRADIENT NORM LOWER BOUND FOR POSITIVELY HOMOGENEOUS SETS Proof of Theorem 2. The proof idea is similar. Recall that if θph is a p.h. set, then f (m)(θph) f(x(m); θ \ θph, θph) is a p.h. function. We therefore have ϵℓavg(DM; (1 + ϵ)θph) ϵ=0 = 1 ℓ z(m) f (m) m=1 ℓ(z(m), y(m)) H(p(m)) (8) hence we again invoke the directional derivative argument to show ℓavg m=1 ℓ(z(m), y(m)) H(p(m)) G(θph). (9) In order to estimate the scale of this lower bound, recall the FC layer weights are i.i.d. sampled from a symmetric, mean-zero distribution, therefore z has a symmetric probability density function with mean 0. We hence have Eℓ(z, y) = E[ y T (z logsumexp(z))] E[y T (maxi [c] zi z)] = E[maxi [c] zi] (10) where the inequality uses the fact that logsumexp(z) maxi [c] zi; the last equality is due to y and z being independent at initialization and Ez = 0. Using the trivial bound EH(p) log(c), we get EG(θph) E[maxi [c] zi] log(c) which shows that the gradient norm of a p.h. set is of the order Ω(E[maxi [c] zi]) at initialization. B PROOFS FOR SECTION 3 B.1 RESIDUAL BRANCHES UPDATE THE NETWORK IN SYNC A common theme in previous analysis of residual networks is the scale of activation and gradient (Balduzzi et al., 2017; Yang & Schoenholz, 2017; Hanin & Rolnick, 2018). However, it is more important to consider the scale of actual change to the network function made by a (stochastic) gradient descent step. If the updates to different layers cancel out each other, the network would be stable as a whole despite drastic changes in different layers; if, on the other hand, the updates to different layers align with each other, the whole network may incur a drastic change in one step, even if each layer only changes a tiny amount. We now provide analysis showing that the latter scenario more accurately describes what happens in reality at initialization. For our result in this section, we make the following assumptions: f is a sequential composition of network blocks {fi}L i=1, i.e. f(x0) = f L(f L 1(. . . f1(x0))), consisting of fully-connected weight layers, Re LU activation functions and residual branches. f L is a fully-connected layer with weights i.i.d. sampled from a zero-mean distribution. There is no bias parameter in f. For l < L, let xl 1 be the input to fl and Fl(xl 1) be a branch in fl with ml layers. Without loss of generality, we study the following specific form of network architecture: Fl(xl 1) = ( ml Re LU z }| { Re LU W (ml) l Re LU W (1) l )(xl 1), fl(xl 1) = xl 1 + Fl(xl 1). For the last block we denote m L = 1 and f L(x L 1) = FL(x L 1) = W (1) L x L 1. Furthermore, we always choose 0 as the gradient of Re LU when its input is 0. As such, with input x, the output and gradient of Re LU(x) can be simply written as D1[x>0]x, where D1[x>0] is a diagonal matrix with diagonal entries corresponding to 1[x > 0]. Denote the preactivation of the i-th layer Published as a conference paper at ICLR 2019 (i.e. the input to the i-th Re LU) in the l-th block by x(i) l . We define the following terms to simplify our presentation: F (i ) l D1[x(i 1) l >0]W (i 1) l D1[x(1) l >0]W (1) l xl 1, l < L, i [ml] F (i+) l D1[x (ml) l >0]W (ml) l D1[x(i) l >0], l < L, i [ml] F (1 ) L x L 1 We have the following result on the gradient update to f: Theorem 3. With the above assumptions, suppose we update the network parameters by θ = η θℓ(f(x0; θ), y), then the update to network output f(x0) f(x0; θ + θ) f(x0; θ) is Ji l z }| { F (i ) l 2 f T F (i+) l F (i+) l T f ℓ z + O(η2), (12) where z f(x0) Rc is the logits. Let us discuss the implecation of this result before delving into the proof. As each Ji l is a c c real symmetric positive semi-definite matrix, the trace norm of each Ji l equals its trace. Similarly, the trace norm of J P i Ji l equals the trace of the sum of all Ji l as well, which scales linearly with the number of residual branches L. Since the output z has no (or little) correlation with the target y at the start of training, ℓ z is a vector of some random direction. It then follows that the expected update scale is proportional to the trace norm of J, which is proportional to L as well as the average trace of Ji l . Simply put, to allow the whole network be updated by Θ(η) per step independent of depth, we need to ensure each residual branch contributes only a Θ(η/L) update on average. Proof. The first insight to prove our result is to note that conditioning on a specific input x0, we can replace each Re LU activation layer by a diagonal matrix and does not change the forward and backward pass. (In fact, this is valid even after we apply a gradient descent update, as long as the learning rate η > 0 is sufficiently small so that all positive preactivation remains positive. This observation will be essential for our later analysis.) We thus have the gradient w.r.t. the i-th weight layer in the l-th block is ℓ Vec(W (i) l ) = xl Vec(W (i) l ) f z = F (i ) l I(i) l F (i+) l T f where denotes the Kronecker product. The second insight is to note that with our assumptions, a network block and its gradient w.r.t. its input have the following relation: fl(xl 1) = fl xl 1 xl 1. (14) We then plug in Equation (13) to the gradient update θ = η θℓ(f(x0; θ), y), and recalculate the forward pass f(x0; θ+ θ). The theorem follows by applying Equation (14) and a first-order Taylor series expansion in a small neighborhood of η = 0 where f(x0; θ + θ) is smooth w.r.t. η. B.2 WHAT SCALAR BRANCH HAS Θ(η/L) UPDATES? For this section, we focus on the proper initialization of a scalar branch F(x) = (Qm i=1 ai)x. We have the following result: Theorem 4. Assuming i, ai 0, x = Θ(1) and ℓ F (x) = Θ(1), then F(x) F(x; θ η ℓ θ) F(x; θ) is Θ(η/L) if and only if k [m]\{j} ak , where j arg min k ak (15) Published as a conference paper at ICLR 2019 Proof. We start by calculating the gradient of each parameter: k [m]\{i} ak and a first-order approximation of F(x): F(x) = η ℓ F(x) (F(x))2 m X 1 a2 i (17) where we conveniently abuse some notations by defining k [m]\{i} ak x, if ai = 0. (18) Denote Pm i=1 1 a2 i as M and mink ak as A, we have A2 (F(x))2M (F(x))2 m and therefore by rearranging Equation (17) and letting F(x) = Θ(η/L) we get F(x) η ℓ F (x) i.e. F(x)/A = Θ(1/ L). Hence the only if part is proved. For the if part, we apply Equation (19) to Equation (17) and observe that by Equation (15) F(x) = Θ η(F(x))2 1 The result of this theorem provides useful guidance on how to rescale the standard initialization to achieve the desired update scale for the network function. C ADDITIONAL EXPERIMENTS C.1 ABLATION STUDIES OF FIXUP In this section we present the training curves of different architecture designs and initialization schemes. Specifically, we compare the training accuracy of batch normalization, Fixup, as well as a few ablated options: (1) removing the bias parameters in the network; (2) use 0.1x the suggested initialization scale and no bias parameters; (3) use 10x the suggested initialization scale and no bias parameters; and (4) remove all the residual branches. The results are shown in Figure 4. We see that initializing the residual branch layers at a smaller scale (or all zero) slows down learning, whereas training fails when initializing them at a larger scale; we also see the clear benefit of adding bias parameters in the network. C.2 CIFAR AND SVHN WITH BETTER REGULARIZATION We perform additional experiments to validate our hypothesis that the gap in test error between Fixup and batch normalization is primarily due to overfitting. To combat overfitting, we use Mixup (Zhang et al., 2017) and Cutout (De Vries & Taylor, 2017) with default hyperparameters as additional regularization. On the CIFAR-10 dataset, we perform experiments with Wide Res Net-40-10 and on SVHN we use Wide Res Net-16-12 (Zagoruyko & Komodakis, 2016), all with the default hyperparameters. We observe in Table 4 that models trained with Fixup and strong regularization are competitive with state-of-the-art methods on CIFAR-10 and SVHN, as well as our baseline with batch normalization. Published as a conference paper at ICLR 2019 0 200 400 600 800 1000 1200 Batch Index Train Accuracy (%) Batch Norm Fixup 1 2m 2, no bias no residual Figure 4: Minibatch training accuracy of Res Net-110 on CIFAR-10 dataset with different configurations in the first 3 epochs. We use minibatch size of 128 and smooth the curves using 10-step moving average. Dataset Model Normalization Test Error (%) (Zagoruyko & Komodakis, 2016) Yes 3.8 (Yamada et al., 2018) 2.3 Batch Norm + Mixup + Cutout 2.5 (Graham, 2014) No 3.5 Fixup-init + Mixup + Cutout 2.3 (Zagoruyko & Komodakis, 2016) Yes 1.5 (De Vries & Taylor, 2017) 1.3 Batch Norm + Mixup + Cutout 1.4 (Lee et al., 2016) No 1.7 Fixup-init + Mixup + Cutout 1.4 Table 4: Additional results on CIFAR-10, SVHN datasets. Published as a conference paper at ICLR 2019 C.3 TRAINING AND TEST CURVES ON IMAGENET Figure 5 shows that without additional regularization Fixup fits the training set very well, but overfits significantly. We see in Figure 6 that Fixup is competitive with networks trained with normalization when the Mixup regularizer is used. 0 20 40 60 80 100 Epochs Train Error (%) Batch Norm Group Norm Fixup 0 20 40 60 80 100 Epochs Test Error (%) Batch Norm Group Norm Fixup Figure 5: Training and test errors on Image Net using Res Net-50 without additional regularization. We observe that Fixup is able to better fit the training data and that leads to overfitting - more regularization is needed. Results of Batch Norm and Group Norm reproduced from (Wu & He, 2018). 0 20 40 60 80 100 Epochs Test Error (%) Batch Norm + Mixup Group Norm + Mixup Fixup + Mixup Figure 6: Test error of Res Net-50 on Image Net with Mixup (Zhang et al., 2017). Fixup closely matches the final results yielded by the use of Group Norm, without any normalization. D ADDITIONAL REFERENCES: A BRIEF HISTORY OF NORMALIZATION METHODS The first use of normalization in neural networks appears in the modeling of biological visual system and dates back at least to Heeger (1992) in neuroscience and to Pinto et al. (2008); Lyu & Simoncelli (2008) in computer vision, where each neuron output is divided by the sum (or norm) of all of the outputs, a module called divisive normalization. Recent popular normalization methods, such as local response normalization (Krizhevsky et al., 2012), batch normalization (Ioffe & Szegedy, 2015) and layer normalization (Ba et al., 2016) mostly follow this tradition of dividing the neuron activations by their certain summary statistics, often also with the activation mean subtracted. An exception is weight normalization (Salimans & Kingma, 2016), which instead divides the weight parameters by their statistics, specifically the weight norm; weight normalization also adopts the idea of activation normalization for weight initialization. The recently proposed actnorm (Kingma & Dhariwal, 2018) removes the normalization of weight parameters, but still use activation normalization to initialize the affine transformation layers.