# understanding_and_improving_layer_normalization__557ed132.pdf Understanding and Improving Layer Normalization Jingjing Xu1, Xu Sun1,2 , Zhiyuan Zhang1, Guangxiang Zhao2, Junyang Lin1 1 MOE Key Lab of Computational Linguistics, School of EECS, Peking University 2 Center for Data Science, Peking University {jingjingxu,xusun,zzy1210,zhaoguangxiang,linjunyang}@pku.edu.cn Layer normalization (Layer Norm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, our main contribution is to take a step further in understanding Layer Norm. Many of previous studies believe that the success of Layer Norm comes from forward normalization. Unlike them, we find that the derivatives of the mean and variance are more important than forward normalization by re-centering and re-scaling backward gradients. Furthermore, we find that the parameters of Layer Norm, including the bias and gain, increase the risk of over-fitting and do not work in most cases. Experiments show that a simple version of Layer Norm (Layer Norm-simple) without the bias and gain outperforms Layer Norm on four datasets. It obtains the state-of-the-art performance on En-Vi machine translation. To address the over-fitting problem, we propose a new normalization method, Adaptive Normalization (Ada Norm), by replacing the bias and gain with a new transformation function. Experiments show that Ada Norm demonstrates better results than Layer Norm on seven out of eight datasets. 1 Introduction Neural network training has long been a focus in Deep Learning research area. One of the prominent progress is the application of normalization methods. Initially, Ioffe and Szegedy [2015] introduce the concept of normalizing layers with the proposed Batch Normalization (Batch Norm). It is widely believed that by controlling the mean and variance of layer inputs across mini-batches, Batch Norm stabilizes the distribution and improves training efficiency. Following this work, Lei Ba et al. [2016] point out its limitation in Recurrent Neural Networks (RNN) and propose Layer Normalization (Layer Norm) that is performed across the neurons in a layer. Layer Norm is adaptive to RNN and self-attention-based models. A typical example is its application in the state-of-the-art framework, Transformer [Vaswani et al., 2017]. Layer Norm enables faster training of Transformer and is irreplaceable in this framework. Despite its great success, it is still unclear why Layer Norm is so effective. The widely accepted explanation is that forward normalization brings distribution stability [Ioffe and Szegedy, 2015, Lei Ba et al., 2016]. Recent studies show that the effects of Batch Norm are not related to the stability of input distribution [Zhang et al., 2017, Santurkar et al., 2018]. They also propose that the reason why Batch Norm is effective is that normalization smooths the optimization landscape. However, it is still unclear whether these theories can explain the success of Layer Norm. The main contribution of this paper is to explore how Layer Norm works. Through a series of analyses, we find that the derivatives of the mean and variance are important by re-centering and re-scaling Corresponding author. 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. backward gradients. Furthermore, it is beyond our expectation that the bias and gain do not work in most cases. The details of our findings are illustrated below. The derivatives of the mean and variance are more important to Layer Norm than forward normalization. Many of the previous studies believe that the forward normalization is the only decisive factor to Layer Norm. It makes the input distribution more stable, thus brings better convergence. Unlike them, our experimental results show that forward normalization has little to do with the effectiveness and the derivatives of the mean and variance play a significant role in Layer Norm. To illustrate how these derivatives work, we propose Detach Norm, which adds an additional detaching operation to Layer Norm to change the mean and variance from variables to constants. It preserves the re-centering and re-scaling fact but cuts off the derivative of the mean and variance with respect to the input. Detach Norm performs worse than Layer Norm on six out of eight datasets. This proves that the derivatives of the mean and variance are useful to Layer Norm. Furthermore, to investigate the reason for the above observation, we analyze the gradients in Layer Norm and Detach Norm, and find that the derivatives of means re-center gradients and the derivatives of variances re-scale gradients. The parameters of Layer Norm, including the bias and gain, increase the risk of over-fitting and do not work in most cases. The bias and gain are applied for affine transformation on normalized vectors. They are expected to enhance the expressive power by re-shaping the distribution. To evaluate their effects on results, we build a simple version of Layer Norm (Layer Norm-simple) by removing the bias and gain. Our experimental results show that Layer Norm-simple achieves better results than Layer Norm on four datasets. It even achieves the state-of-the-art performance on En-Vi machine translation. By comparing loss curves of Layer Norm with and without the bias and gain, we find that the bias and gain cause over-fitting. We speculate the reason of over-fitting is mainly that the bias and gain are learned from the training set and cannot adjust themself towards different input distributions when testing. Motivated by this assumption, we propose a novel normalization method, Adaptive Normalization (Ada Norm). Ada Norm replaces the bias and gain with a new transformation function. This function adaptively adjusts scaling weights based on input values. We evaluate Ada Norm and Layer Norm on eight datasets, covering tasks of machine translation, language modeling, text classification, image classification, and dependency parsing. Results show that Ada Norm achieves better results on seven datasets. 2 Preliminaries In this section, we first review the algorithm of Layer Norm and then introduce the datasets and models used in the following analysis sections. 2.1 Layer Norm Algorithm Let x = (x1, x2, . . . , x H) be the vector representation of an input of size H to normalization layers. Layer Norm re-centers and re-scales input x as h = g N(x) + b, N(x) = x µ i=1 xi, σ = i=1 (xi µ)2 (1) where h is the output of a Layer Norm layer. is a dot production operation. µ and σ are the mean and standard deviation of input. Bias b and gain g are parameters with the same dimension H. 2.2 Experimental Setup To investigate how Layer Norm works, we conduct a series of experiments in this paper. Since Layer Norm is a default setting in Transformer [Vaswani et al., 2017] and Transformer-XL [Dai et al., 2019], which have shown state-of-the-art results on a variety of tasks (e.g., machine translation), we primarily consider normalization on Transformer and Transformer-XL networks. Also, to avoid the impact of model architecture, we evaluate the effects of normalization on feed-forward neural networks and convolutional neural networks. Here list the datasets and models. More details can be found at the arxiv version. Machine translation includes three widely-used datasets, WMT English-German (En-De), IWSLT 14 German-English (De-En) [Cettolo et al., 2014] and IWSLT 15 English-Vietnamese (En-Vi) [Cettolo et al., 2015]. For all dataset, we use the setting of Pre Norm where normalization is applied before each layer. We re-implement Transformer with the released code of Fairseq [Ott et al., 2019]2. The evaluation metric is BLEU [Papineni et al., 2002]. For En-De dataset, we use the same dataset splits and the same compound splitting following previous work [Vaswani et al., 2017]. BPE is used to get vocabularies. We use the shared embedding setting and the vocabulary size is 32,765. We use transformer_wmt_en_de_big_t2t as our basic model. The dropout rate is 0.3. The learning rate is 0.001. The training batch size is 4,096 tokens. We use optimizer Adam with β1 = 0.9 and β2 = 0.98. The number of warmup steps is 4K. The De-En dataset is provided by the IWSLT 2014 Evaluation Campaign. We use the same dataset splits following previous work [Ott et al., 2019, Ranzato et al., 2016, Wiseman and Rush, 2016]. It contains 153K sentences for training, 7K sentences for validation, and 7K sentences for testing. BPE is used to get vocabularies. We use the shared embedding setting and the vocabulary size is 10,149. We use transformer_iwslt_de_en as our basic model. The dropout rate is 0.3. The attention dropout rate is 0.1. The activation dropout is 0.1. The initialization learning rate is 1e-07 and the learning rate is 0.0015. The training batch size is 4,096 tokens. We update gradients for every 2 steps. The number of warmup steps is 8K. The En-Vi dataset contains 133K training sentence pairs provided by the IWSLT 2015 Evaluation Campaign. We use TED tst2012 (1,553 sentences) as the validation set and TED tst2013 (1,268 sentences) as the test set. BPE is used to get input and output vocabularies. The English and Vietnamese vocabulary sizes are 7,669 and 6,669 respectively. The dropout rate is 0.1. The learning rate is 0.001. The training batch size is 4,096 tokens. The number of warmup steps is 8K. We use transformer_wmt_en_de as our basic model. We use optimizer Adam with β1 = 0.9 and β2 = 0.98. Language modeling includes a large dataset, Enwiki83 that contains 100M bytes of unprocessed Wikipedia text. We implement a 12-layer Transformer-XL model. The dimension of each layer is 512. Multi-head attention contains 8 heads and the dimension of each head is 64. The dropout rate is 0.1. The batch size is 22. We use optimizer Adam with a learning rate 0.00025. We use the average number of Bits-Per-Character (BPC) as the evaluation metric [Al-Rfou et al., 2018, Dai et al., 2019]. Text classification includes two sentence classification datasets: RT [Pang and Lee, 2005], and SST5 [Socher et al., 2013]. RT is a binary sentiment classification dataset from online movie reviews. We randomly divide all examples into 8,608 for training, 964 for validation, and 1,089 for testing. SST5 is a single-sentence classification dataset built on movie reviews. We run experiments on a five label set. We build a Transformer model with a 4-layer encoder. The batch size is 4,096 tokens. The word embedding dimension is 128 and the hidden dimension is 128. The dropout rate is 0.2. We use optimizer Adam with β1 = 0.9, β2 = 0.998. Normalization is applied before each layer. Accuracy is the evaluation metric. Image classification includes a widely-used dataset, MNIST [Le Cun et al., 1998]. It consists of 55,000 training images, 5,000 validation images, and additional 10,000 testing images. We implement a 3-layer convolutional neural network for classification. The first 2D-convolution layer has 1 inchannel, 20 out-channels. The second 2D-convolution layer has 20 in-channels, 50 out-channels. We flatten the output of the second 2D-convolution layer and send it to a linear layer. The batch size is 32. We use optimizer Adam with a learning rate of 0.001. We apply Layer Norm before the activation in every linear layer. We train the model for 20 epochs. Normalization is applied before each layer. Accuracy is the evaluation metric. Dependency parsing includes a dataset, English Penn Tree Bank (PTB) [Marcus et al., 1993]. We follow the standard split of the corpus with sections 2-21 as the training set (39,832 sentences, 1,900,056 transition examples), section 22 as the validation set (1,700 sentences, 80,234 transition examples), and section 23 as the testing set (2,416 sentences, 113,368 transition examples). We implement a MLP-based parser following the work [Chen and Manning, 2014]. The dimension of the hidden state is 512, the batch size is 1, 024, the dropout rate is 0.2. We use optimizer Adam and initialize the learning rate to 0.001. We apply normalization before activation in every linear layer. 2https://github.com/pytorch/fairseq 3http://mattmahoney.net/dc/text.html Following the work [Chen and Manning, 2014], we use Unlabeled Attachment Score (UAS) as the evaluation metric. 3 Understanding Layer Norm To investigate how Layer Norm facilitates training, we conduct ablation studies to observe each part s contribution to the performance. In this section, we analyse the effects of the bias and gain, forward normalization, and backward normalization. Table 1: The bias and gain do not work on six out of eight datasets. w/o Norm is a naive model without Layer Norm. Layer Norm-simple is a variant of Layer Norm that drops the bias and gain. (+) means higher is better. (-) means lower is better. Models Machine Translation Language Modeling Classification Parsing En-De(+) De-En(+) En-Vi(+) Enwiki8(-) RT(+) SST5(+) MNIST(+) PTB(+) Model Layers 12 12 12 12 4 4 3 3 w/o Norm Diverge 34.0 28.4 1.04 76.85 38.55 99.14 88.31 Layer Norm 28.3 35.5 31.2 1.07 77.21 39.23 99.13 89.12 Layer Norm-simple 28.4 35.5 31.6 1.07 76.66 40.54 99.09 89.19 3.1 The Effect of the Bias and Gain in Layer Norm The bias and gain do not work in most cases. From Table 1, it can be found that Layer Norm is an effective approach. It brings large performance improvements on six out of eight datasets compared with the naive baseline without Layer Norm ( w/o Norm ). By comparing Layer Norm and Layer Norm-simple, we find that dropping the bias and gain ( Layer Norm-simple ) does not decrease the performance on six datasets. Surprisingly, Layer Norm-simple outperforms Layer Norm on four datasets, even with a 0.4 BLEU improvement on En-Vi and a 1.31 ACC improvement on SST-5. Also, it needs to notice that 31.6 achieved by Layer Norm-simple is the state-of-the-art result on En-Vi machine translation. Furthermore, we find that the bias and gain increase the risk of over-fitting. Initially, considering that input information may be lost when normalizing input distributions, the bias and gain are designed for affine transformation on normalized vectors to enhance the expressive power. However, since the bias and gain are learned from the training set and they ignore the input distributions of the testing data, the risk of over-fitting may increase in Layer Norm. It is verified by convergence curves in Figure 1. Layer Norm achieves lower training loss (or BPC) but higher validation loss (or BPC) than Layer Norm-simple on En-Vi, Enwiki8. These results indicate that current affine transformation mechanism has a potential risk of over-fitting and needs to be further improved. 10 15 20 25 30 2.50 Layer Norm-simple train Layer Norm train Layer Norm-simple valid Layer Norm valid 10 20 30 40 50 60 0.90 Layer Norm-simple train Layer Norm train Layer Norm-simple valid Layer Norm valid Figure 1: Convergence curves of Layer Norm and Layer Norm-simple on En-Vi, Enwiki8. Lower is better. The bias and gain increase the risk of over-fitting. 3.2 The Effect of Forward Normalization For easier analysis, we only consider Layer Norm without the bias and gain here. Let y = (y1, y2, . . . , y H) be the normalized vector, the calculation process of Layer Norm without the bias and Table 2: The derivatives of the mean and variance matter. w/o Norm is the naive model without normalization. Detach Norm is a variant of Layer Norm-simple . It detaches the derivatives of the mean and variance. (+) means higher is better. (-) means lower is better. The top table shows the effect of forward normalization. The bottom table shows the effect of the derivatives of the mean and variance. Models Machine Translation Language Modeling Classification Parsing En-De De-En(+) En-Vi(+) Enwiki8(-) RT(+) SST5(+) MNIST(+) PTB(+) Model Layers 12 12 12 12 4 4 3 3 w/o Norm Diverge 34.0 28.4 1.04 76.85 38.55 99.14 88.31 Detach Norm Diverge 33.9 27.7 1.12 76.40 40.04 99.10 89.79 Improvement -0.1 -0.7 -0.08 -0.45 1.49 -0.04 1.48 Models Machine Translation Language Modeling Classification Parsing En-De De-En(+) En-Vi(+) Enwiki8(-) RT(+) SST5(+) MNIST(+) PTB(+) Model Layers 12 12 12 12 4 4 3 3 Detach Norm Diverge 33.9 27.7 1.12 76.40 40.04 99.10 89.79 Layer Norm-simple 28.4 35.5 31.6 1.07 76.66 40.54 99.09 89.19 Improvement 1.6 3.9 0.05 0.26 0.50 -0.01 -0.60 gain can be written as i=1 xi, σ = i=1 (xi µ)2 (2) where x = (x1, x2, . . . , x H) is the input vector and H is the dimension of x. µ and σ are the mean and standard deviation of x1, x2, . . . , x H. Then, suppose y and Dy are the mean and variance of y1, y2, . . . , y H. It is easy to verify σ = 0, Dy = 1 σ2 = 1. (3) Eq. (3) shows that normalization re-centers and re-scales input vector x. By now, a widely accepted belief is that the effectiveness of Layer Norm comes from steady layer distributions brought by forward normalization [Lei Ba et al., 2016]. To evaluate whether forward normalization explains the effectiveness of Layer Norm, we need to separate the effect on forward layer inputs and that on backward gradients. In this paper, we design a new method, called Detach Norm. The difference between Layer Norm and Detach Norm is that Detach Norm detaches the derivatives of the mean and variance4. Detaching derivatives means treating the mean and variance as changeable constants, rather than variables, which do not require gradients in backward propagation. The calculation of Detach Norm can be written as ˆσ , ˆµ = θ(µ), ˆσ = θ(σ) (4) where µ and σ are the mean and standard deviation of input x, as calculated in Eq. (2). The function θ( ) can be seen as a special copy function, which copies the values of µ and σ into constants ˆµ and ˆσ. In all, Detach Norm keeps the same forward normalization fact as Layer Norm does, but cuts offs the derivatives of the mean and variance. Since Detach Norm keeps the same re-centering and re-scaling way in forward propagation as Layer Norm-simple does, the gap between Detach Norm and w/o Norm shows the effect of forward normalization. As we can see, Detach Norm perform worse than w/o Norm , showing that forward normalization has little to do with the success of Layer Norm. Furthermore, the only difference between Detach Norm and Layer Norm-simple lies in that Detach Norm detaches the derivatives of the mean and variance. As shown in Table 2, Detach Norm performs 4In our implementation, we detach the derivative of standard deviation, the square root of variance. 0 10 20 30 40 50 Epoch Detach Norm Layer Norm-simple 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 Epoch Detach Norm Layer Norm-simple Figure 2: Convergence curves of Layer Norm-simple and Detach Norm on two translation datasets. worse than Layer Norm-simple on six datasets. It is mainly because that Detach Norm converges to much worse local optima compared with Layer Norm-simple, as shown in Figure 2. The gap between Detach Norm and Layer Norm-simple shows the effectiveness of the derivatives of the mean and variance. By comparing the achieved improvements, we find that the derivatives of the mean and variance bring higher improvements than forward normalization does. These results demonstrate that the derivatives of the mean and variance play a significant role. In addition, the extremely worse results of Detach Norm on En-De, De-En and En-Vi indicate that the derivatives of the mean and variance may be more important for deeper models. In the following section, we will give a detailed analysis of why and how the derivatives of the mean and variance contribute to the performance. 3.3 The Effect of the Derivatives of the Mean and Variance To understand how the derivatives of the mean and variance work, we analyze the gradients of Layer Norm-simple and Detach Norm. According to the chain rule, the gradient of x is5 where ℓis the loss function, x is the input vector and y is the normalized vector. We here analyze the effect of detaching the derivatives of the mean and variance on backward gradients. Our results are summarized in the following theorem, whose proof is listed in the Appendix of the arxiv version. Theorem 1. Given ℓ y = (g1, g2, ..., g H)T, let g and Dg be the mean and variance of g1, g2, ..., g H. For the case of detaching the derivatives of µ and σ, suppose ℓ x = (a1, a2, ..., a H)T is the gradient of x with mean a and variance Da. We have a = g/σ and Da = Dg/(σ2). (1) For the case of standard Layer Norm-simple, suppose ℓ x = (b1, b2, ..., b H)T is the gradient of x with mean b and variance Db. We have b = 0 and Db Dg/(σ2). (2) For the case of detaching the derivative of µ, suppose ℓ x = (c1, c2, ..., c H)T is the gradient of x with mean c and variance Dc. We have c = g/σ and Dc Dg/(σ2). (3) For the case of detaching the derivative of σ, suppose ℓ x = (d1, d2, ..., d H)T is the gradient of x with mean d and variance Dd. We have d = 0 and Dc = Dg/(σ2). By comparing the case of detaching the derivative of µ with that of Layer Norm-simple in Theorem 1, we find that the derivative of µ re-centers ℓ x to zero. By comparing the case of detaching the 5When calculating the gradient, we adopt the denominator layout. derivative of σ with of Layer Norm-simple, we find that the derivative of σ reduces the variance of ℓ x, which can be seen a kind of re-scaling. We refer to gradient re-centering and re-scaling as gradient normalization. To further evaluate the effect of gradient normalization on model performance, we test the derivatives of the mean and variance separately. Table 3 shows that detaching the derivative of variance decreases the performance significantly on deeper networks. Therefore, it is necessary to control the variance of gradients for deeper networks. In conclusion, Layer Norm normalizes forward layer inputs and backward gradients. The derivatives of the mean and variance play more important roles than forward normalization in Layer Norm. Furthermore, unlike previous work [Santurkar et al., 2018] only noticing that normalization smooths gradients, this paper provides deeper insight about how normalization impacts backward gradients. Table 3: The derivative of variance is more important than that of mean for deeper networks. (+) means higher is better. (-) means lower is better. Models Machine Translation Language Model Classification Parsing En-De(+) De-En(+) En-Vi(+) Enwiki8(-) RT(+) SST5(+) MNIST(+) PTB(+) Model Layers 12 12 12 12 4 4 3 3 Layer Norm-simple 28.4 35.5 31.6 1.07 76.66 40.54 99.09 89.19 Detach Mean 28.3 35.6 31.3 1.07 75.02 40.99 99.25 89.45 Detach Variance Diverge 34.2 29.8 1.10 77.04 41.74 99.10 89.80 Ada Norm adopts a new transformation function which can adaptively control scaling weights towards different inputs.6 4.1 Ada Norm Algorithm Formally, let y = N(x) = (x µ)/ σ be the normalized vector where µ and σ are the mean and variance of the input x = (x1, x2, . . . , x H). We use φ(y), a function with respect to input x, to replace the bias and gain with the following equation: z = φ(y) y = φ(N(x)) N(x) (6) where z = (z1, z2, . . . , z H) is the output of Ada Norm and is a dot product operation. Unlike the bias and gain being fixed in Layer Norm, φ(y) can adaptively adjust scaling weights based on inputs. To keep the stability of training, we expect that φ( ) has some features. First, φ( ) must be differentiable. Second, we expect that the average scaling weight is fixed, namely the average of φ(y) is a constant C where C > 0. Third, we expect that the average of z is bounded, which can avoid the problem of exploding loss. Namely, we require that there exists a constant M such i=1 zi| < M. Theorem 2 proves that there exists a unique solution which can satisfy these requirements. The proof is listed in the Appendix of the arxiv version. Theorem 2. Suppose φ(yi) is derivable, y , 1 H i=1 φ(yi) = C > 0, and M, s.t. | 1 M (M > 0), where H is the hidden size. There exists only one solution: φ(yi) = C(1 kyi) which can satisfy these requirements. Since 1 kyi < 0 will undesirably change the direction of vector, we expect that φ(yi) > 0 holds, which means yi < 1/k must hold. Due to the symmetry of yi, |yi| < 1/k is required to hold too. 6Our code is released at https://github.com/lancopku/Ada Norm Based on Chebyshev s Inequality, we have P(|yi| < 1/k) = P(|yi E(yi)| < 1/k) 1 Dy (1/k)2 = 1 k2Dy (7) where Dy is the variance of y = (y1, y2, . . . , y H) and H is the dimension of y. Based on Eq. (3), we can verify Dy = 1. If we expect that |yi| < 1/k holds with a probability higher than 99%, k = 1/10 should be choose based on Eq. (7). Namely, we choose φ(yi) = C(1 yi Given an input vector x, the complete calculation process of Ada Norm is z = C(1 ky) y, y = x µ i=1 xi, σ = i=1 (xi µ)2 (9) where C is a hyper-parameter. is a dot product operation. k is recommended to set as 1/10. To prevent the introduced term C(1 ky) dismissing the feature of gradient re-centering and re-scaling, we detach the gradient of C(1 ky) and only treat it as a changeable constant in implementation. Table 4: Results of Layer Norm and Ada Norm. (+) means higher is better. (-) means lower is better. Ada Norm outperforms Layer Norm on seven datasets. Models Machine Translation Language Model Classification Parsing En-De(+) De-En(+) En-Vi(+) Enwiki8(-) RT(+) SST5(+) MNIST(+) PTB(+) w/o Norm Diverge 34.0 28.4 1.04 76.85 38.55 99.14 88.31 Layer Norm 28.3 35.5 31.2 1.07 77.21 39.23 99.13 89.12 Layer Norm-simple 28.4 35.5 31.6 1.07 76.66 40.54 99.09 89.19 Ada Norm 28.5 35.6 31.4 1.07 77.50 40.54 99.35 89.23 4.2 Comparison between Ada Norm and Layer Norm The comparison between Layer Norm and Ada Norm is shown in Table 4.7 Ada Norm outperforms Layer Norm on seven datasets, with 0.2 BLEU on En-De, 0.1 BLEU on De-En, 0.2 BLEU on En Vi, 0.29 ACC on RT, 1.31 ACC on SST, 0.22 ACC on MNIST, and 0.11 UAC on PTB. Unlike Layer Norm-simple only performing well on bigger models, Ada Norm achieves more balanced results. Figure 3 shows the loss curves of Layer Norm and Ada Norm on the validation set of En-Vi, PTB, and De-En. Compared to Ada Norm, Layer Norm has lower training loss but higher validation loss. Lower validation loss proves that Ada Norm has better convergence. 5 10 15 20 25 30 35 2.50 Ada Norm train Layer Norm train Ada Norm valid Layer Norm valid 5 10 15 20 25 30 0.00 Ada Norm train Layer Norm train Ada Norm valid Layer Norm valid 20 40 60 80 100 120 Ada Norm train Layer Norm train Ada Norm valid Layer Norm valid Figure 3: Loss curves of Layer Norm and Ada Norm on En-Vi, PTB, and De-En. 5 Related Work Deep neural networks have outperformed shallow models in a variety of fields, such as natural language processing [Sutskever et al., 2014, Bahdanau et al., 2015, Devlin et al., 2018], computer vision [He et al., 2016, Huang et al., 2017], etc. The improvement mainly comes from the stronger 7For Ada Norm implementation, Kaiming initialization and the setting of prenorm are recommended. expressive power of deep layers. However, with the increase of depth, the network training process becomes complicated and requires advanced architectural techniques. One of the important techniques of such advances is normalization. Currently, it is widely accepted that normalization layers assist training by smoothing gradients, enabling large learning rates, accelerating convergence, and improving generalization results [Zhang et al., 2019]. First introduced by Ioffe and Szegedy [2015], Batch Norm fixes layer distributions to reduce ICS (Internal Covariate Shift), a phenomenon that the upper layers need to continuously adapt to the new distributions of lower layers. Following this work, several normalization methods have been proposed, like instance normalization [Ulyanov et al., 2016] and group normalization [Wu and He, 2018]. In addition, there are several studies exploring better activation functions [Klambauer et al., 2017] or initialization methods [Zhang et al., 2019] to avoid the dependency on normalization layers. Layer Norm is proposed to expand Batch Norm into RNN. Layer Norm normalizes the mean and variance of all summed inputs to the neurons in one layer. Unlike Batch Norm that depends on the size of mini-batch, Layer Norm has fewer limitations. Layer Norm is adaptive to RNN and self-attentionbased models. It has been applied to the state-of-the-art frameworks such as Transformer [Vaswani et al., 2017], BERT [Devlin et al., 2018], and Transformer-XL [Dai et al., 2019]. Layer Norm brings better performance and is irreplaceable in these frameworks. Despite the good performance, it is still unclear how layer normalization works. Ioffe and Szegedy [2015] claim that the effectiveness of Batch Norm comes from reducing ICS. It has been a popular belief about Batch Norm [Santurkar et al., 2018]. However, some recent studies point out that the success of Batch Norm relates to the smoother gradients and has little to do with reducing ICS [Santurkar et al., 2018, Bjorck et al., 2018]. Although these studies provide a pioneering perspective to understand Batch Norm, there still remain some unanswered questions, such as how Batch Norm helps smooth gradients. Also, there are little work studying whether these theories can explain the success of Layer Norm. In this paper, we take a further step to a better understanding of Layer Norm. 6 Conclusion In this paper, we investigate how layer normalization works. Based on a series of experiments and theoretical analysis, we summarize some interesting conclusions. We find that the derivatives of the mean and variance are important to the success of Layer Norm by re-centering and re-scaling backward gradients. Furthermore, experiments show that the bias and gain increase the risk of over-fitting and do not work in most cases. To address this problem, we propose a normalization method Ada Norm. It replaces the bias and gain in Layer Norm with a new adaptive transformation function that can update scaling weights based on input values. Experiments show that Ada Norm outperforms Layer Norm on seven datasets. In the future work, we would like to explore more alternatives to Layer Norm from the perspective of gradient normalization. Acknowledgments We thank all reviewers for providing the thoughtful and constructive suggestions. This work was supported in part by National Natural Science Foundation of China (No. 61673028). R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones. Character-level language modeling with deeper self-attention. Co RR, abs/1808.04444, 2018. D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. N. Bjorck, C. P. Gomes, B. Selman, and K. Q. Weinberger. Understanding batch normalization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada., pages 7705 7716, 2018. M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, and M. Federico. The iwslt 2015 evaluation campaign. In IWSLT 2014, International Workshop on Spoken Language Translation, 2014. M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, R. Cattoni, and M. Federico. The iwslt 2015 evaluation campaign. In IWSLT 2015, International Workshop on Spoken Language Translation, 2015. D. Chen and C. D. Manning. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 740 750, 2014. Z. Dai, Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le, and R. Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. ar Xiv preprint ar Xiv:1901.02860, 2019. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700 4708, 2017. S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 448 456, 2015. G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter. Self-normalizing neural networks. In Advances in neural information processing systems, pages 971 980, 2017. Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. J. Lei Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. ar Xiv preprint ar Xiv:1607.06450, 2016. M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2):313 330, 1993. M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli. fairseq: A fast, extensible toolkit for sequence modeling. ar Xiv preprint ar Xiv:1904.01038, 2019. B. Pang and L. Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the Association for Computational Linguistics (ACL), pages 115 124, 2005. K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., pages 311 318, 2002. M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry. How does batch normalization help optimization? In Advances in Neural Information Processing Systems, pages 2483 2493, 2018. R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631 1642, 2013. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104 3112, 2014. D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. Co RR, abs/1607.08022, 2016. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6000 6010, 2017. S. Wiseman and A. M. Rush. Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 1296 1306, 2016. Y. Wu and K. He. Group normalization. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII, pages 3 19, 2018. C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. H. Zhang, Y. N. Dauphin, and T. Ma. Fixup initialization: Residual learning without normalization. Co RR, abs/1901.09321, 2019.