# on_the_nonlinearity_of_layer_normalization__9c7fe42e.pdf On the Nonlinearity of Layer Normalization Yunhao Ni 1 Yuxin Guo 1 Junlong Jia 1 Lei Huang* 1 Layer normalization (LN) is a ubiquitous technique in deep learning but our theoretical understanding to it remains elusive. This paper investigates a new theoretical direction for LN, regarding to its nonlinearity and representation capacity. We investigate the representation capacity of a network with layerwise composition of linear and LN transformations, referred to as LN-Net. We theoretically show that, given m samples with any label assignment, an LN-Net with only 3 neurons in each layer and O(m) LN layers can correctly classify them. We further show the lower bound of the VC dimension of an LN-Net. The nonlinearity of LN can be amplified by group partition, which is also theoretically demonstrated with mild assumption and empirically supported by our experiments. Based on our analyses, we consider to design neural architecture by exploiting and amplifying the nonlinearity of LN, and the effectiveness is supported by our experiments. 1. Introduction Layer normalization (LN) (Ba et al., 2016) is a ubiquitous technique in deep learning, enabling varies neural networks to train effectively. It was initially proposed to address the train-inference inconsistency problem of Batch Normalization (BN) (Ioffe & Szegedy, 2015) applied in the recurrent neural networks for Natural Language Processing (NLP) tasks. It then became the key component of Transformer (Vaswani et al., 2017) and its variants (Dai et al., 2019; Xiong et al., 2020; Dosovitskiy et al., 2021), spreading from NLP (Radford et al., 2021; Devlin et al., 2019; Raffel et al., 2020) to Computer Vision (CV) (Dosovitskiy et al., 2021; Carion et al., 2020; Cheng et al., 2022) communities. LN has got its firm position (Huang et al., 2023) in the evolution of neural architectures and is currently a 1SKLCCSE, Institute of Artificial Intelligence, Beihang University, Beijing, China. Correspondence to: Lei Huang . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). basic layer in almost all the foundation models (Brown et al., 2020; Alayrac et al., 2022; Kirillov et al., 2023). While LN is extensively used in practice, our theoretical understanding to it remains elusive. One main theoretical work for LN is its scale-invariant property, which is initially discussed in (Ba et al., 2016) to illustrate its ability in stabilizing training and is further extended in (Hoffer et al., 2018; Arora et al., 2019; Li & Arora, 2020) to consider its potential affects in optimization dynamics. Different from the previous work focusing on theoretical analyses of LN from the perspective of optimization, this paper investigates a new theoretical direction for LN, regarding to its nonlinearity and representation capacity. We mathematically demonstrate that LN is a nonlinear transformation. We highlight that LN might be a nonlinear transformation by intuition, but there is no work, to our best knowledge, demonstrating it. Our demonstration is based on the defined lower bound named LSSR (Definition 2). The LSSR will not be broken under any linear transformation by definition, but we show that a linear neural network combined with LN can break the LSSR. Therefore, LN has nonlinearity. We also show that an LN-Net, which is a layerwise composition of linear and LN transformations, has nonlinearity. One interesting question is that how powerful the nonlinear of an LN-Net is in theory. We theoretically show that, given m samples with any label assignment, an LN-Net with only 3 neurons in each layer and O(m) LN layers can correctly classify them. We further show the lower bound of the VC dimension of an LN-Net. In particular, given an LNNet with width only 3 neurons in each layer and L LN layers, its VC dimension is lower bounded by L + 2. These results show that LN-Net has great representation capacity in theory, implying the possibility that a network with linear and LN layer only can work well in practice. We further investigate how to amplify and exploit the nonlinearity of LN. We find that Group based LN (LN-G) which divides neurons of a layer into groups and perform LN in each group in parallel has stronger nonlinearity than the naive LN countpart. This is also theoretically demonstrated with mild assumption and empirically supported by our comprehensive experiments. We also consider practical scenario, where we replace LN with LN-G on Transformer and Vi T, On the Nonlinearity of Layer Normalization since we believe the amplified nonlinearity can benefits the models. The preliminary results show the potentiality of this design in neural architecture. 2. Preliminary and Notation We use a lowercase letter x R to denote a scalar, boldface lowercase letter x Rd for a vector and boldface uppercase letter for a matrix X Rd m, where R is the set of realvalued numbers, and d, m are positive integers. Neural Network. Given the input x, a classical neural network fθ(x) is typically represented as a layer-wise linear1 and nonlinear transformation: h(l) = W (l)x(l 1) + b(l), (1) x(l) = ϕ(h(l)), l = 1, ..., L, (2) where θ = {(W (l), b(l)), l = 1, , L} are learnable parameters, x(0) = x, W (l) Rdl dl 1, b(l) Rdl and dl indicates the number of neurons in the l-th layer. We set x(L) = h(L) as the output of the network fθ(x) to simplify denotations. A neural network without nonlinear transformation ϕ( ) (Eqn. 2) is referred to as a linear neural network, which is still a linear transformation in native. Layer Normalization. Layer Normalization (LN) is an essential layer in modern deep neural networks mainly for stabilizing training. Given a single sample of layer input x = [x(1), x(2), , x(d)] Rd with d neurons in a neural network, LN standardizes x within the neurons as 2: ˆx(j) = LN(x(j)) = x(j) µ σ , j = 1, 2, , d, (3) where µ = 1 i=1 x(j) and σ = i=1 (x(j) µ)2 are the mean and variance for each sample, respectively. The standardization operation can be viewed as a combination of centering and scaling operations. Centering projects x onto the hyperplane {x Rd : x(1) + + x(d) = 0}, by x = (I 1 d1d1 d )x. Scaling projects x onto the sphere {x Rd : [x(1)]2 + +[x(d)]2 = d}, by ˆx = d x/ x 2. We thus also call scaling as Spherical Projection (SP), from the geometric perspective. Note that SP is the only operation for normalization in RMSNorm (Zhang & Sennrich, 2019). Sum of Squares. Sum of Squares (SS) (Fisher, 1970) is a statistical concept that measures the variability or dispersion within a set of data. Denote m samples from 1We follow the convention in deep learning community, and do not differentiate between the linear and affine transformation. 2LN usually uses extra learnable scale and shift parameters (Ioffe & Szegedy, 2015), and we omit them for simplifying discussion as they are affine transformation in native class c as xc1, , xcm Rd, represented as a matrix Xc = [xc1, , xcm], then SS of Xc is defined as i=1 xci xc 2 , (4) where xc = (xc1 + + xcm)/m. 3. The Existence of Nonlinearity in LN In this section, we define Sum of Squares Ratio (SSR) and its linear invariant lower bound named LSSR. We then show that LN can break the boundary of SSR and plays a role in nonlinear representation. 3.1. Linear Invariant Lower Bound We take binary classification for simplifying discussion. Let Xc = [xc1, , xcm] represents m samples3 in Rd from the corresponding class c {1, 2}, and [X1, X2] Rd 2m represents all the samples together. Definition 1. (SSR.) Given SS([X1, X2]) = 0, the Sum of Squares Ratio (SSR) between X1 and X2 is defined as SSR(X1, X2) = SS(X1) + SS(X2) SS([X1, X2]) . (5) It is easy to demonstrate that SSR(X1, X2) [0, 1]. SSR can be an indicator to show how easy the samples in the Euclidean space from different classes can be separated. I.e., the smaller SSR is, the more easily X1 and X2 are to be separated with Eulcidean distance as a measurement in most cases. Based on SSR, we further define its lower bound under any linear transformation as follows. Definition 2. (LSSR.) The Linear SSR (LSSR) between X1 and X2 is defined as LSSR(X1, X2) = inf φ Dφ(d) SSR(φ(X1), φ(X2)), (6) where Dφ(d) is the set of all linear functions defined on Rd. By definition, LSSR is the lower bound of SSR under any linear transformation. LSSR can be an indicator to show how easy the samples from different classes can be linearly separated. We provide illustrative examples in Appendix A for details. In the following proposition, we show a linear neural network can not break LSSR. Proposition 1. Given X1, X2 Rd0 m and a linear neural network represented as φ = φ1 φL, where φl : Rdl 1 Rdl, (l = 1, , L) are all linear transformations as shown in Eqn. 1, we have that SSR( φ(X1), φ(X2)) LSSR(X1, X2). (7) 3We use the same number (m) of samples in each class for simplifying notation, and our subsequent definition and conclusion are also apply to different number for different classes. On the Nonlinearity of Layer Normalization Proposition 1 is easily proved by the definition of LSSR, since we have φ Dφ(d0). Proposition 1 implies that the SSR will not break the lower bound if we use an arbitrary linear neural network as a representation transformation over the samples. One interesting question is that whether a linear neural network combined with LN can break the lower bound of SSR. If Yes, we can show that LN has nonlinearity. 3.2. Break the Lower Bound of SSR with LN Here, we focus on the linear neural network combined with LN. To state more precisely, we denote LN-Net as follows. Definition 3. (LN-Net.) The LN-Net fθ(x) is defined as layer-wise composition of linear and LN transformation: h(l) = W (l)x(l 1) + b(l), l = 1, ..., L, (8) x(l) = LN(h(l)), l = 1, ..., L 1, where θ = {(W (l), b(l)), l = 1, ..., L} are learnable parameters, x(0) = x and LN( ) denotes the LN operation. We set x(L) = h(L) as the output of the network fθ(x) to simplify denotations. We first provide a tractable method to calculate LSSR, stated by the following proposition. Proposition 2. Given X1, X2 Rd m, we denote M = 2P i=1 (xci xc)(xci xc) , and N = 2P x)(xci x) , where x = ( x1 + x2)/2. Supposing that N is reversible, we have LSSR(X1, X2) = λ , (9) and correspondingly, LSSR(X1, X2) = SSR((u ) X1, (u ) X2), (10) where λ and u are the minimum eigenvalue and corresponding eigenvector of N 1M. The proof of Proposition 2 are shown in Appendix B. Based on Proposition 2, we further define f SSR(t) as ( LSSR(X1, X2), t = 0, SSR( ψ(t; X1), ψ(t; X2)), t = 0, (11) where ψ(t; xci) = 1 φ(t; xci)/ φ(t; xci) 2, φ(t; xci) = [(u ) xcit, 1] and t R. We point out that f SSR(t) is derivable at t = 0, and f SSR(0) is only decided by X1 and X2, which is proved in Appendix C. Based on the definition of f SSR(t), we show that LN-Net can decrease the LSSR as stated by the following theorem. Theorem 1. Let ψ = φ1 LN( ) φ2, performing over the input X1, X2 Rd m. If f SSR(0) = 0 , we can always find suitable linear functions φ1 and φ2, such that SSR(ψ(X1), ψ(X2)) < LSSR(X1, X2). (12) The proof of Theorem 1 requires complicated derivation. Please refer to Appendix C for details. Note that LN-Net is a more general form of ψ in Theorem 1, which implies that LN-Net can break the lower bound of SSR. Based on Theorem 1, we can obtain the following statement. We deduce that LN is a nonlinear transformation. Corollary 1. LN is a nonlinear transformation. Proof. We assume that LN( ) is a linear transformation. We thus have LN-Net is also a linear transformation. Based on Proposition 1, we have LN-Net can not break LSSR. This contradicts Theorem 1. Therefore, LN( ) must be a nonlinear transformation. Summary. In this section, we mathematically show that LN is a nonlinear transformation, and LN-Net is a network with nonlinearity. One interesting question is that how powerful the nonlinearity of an LN-Net is in theory. We will discuss about it in the following section. 4. Capacity of a Network with LN In this section, we apply LN-Net to classify m samples with any label assignment. To prove the existence of such LNNet, we propose Projection Merge Algorithm (PMA) and Parallelization Breaking Algorithm (PBA) to help find the parameters of the LN-Net. 4.1. LN for Xor Classification To understand PMA intuitively, we use Spherical Projection (SP) rather than LN at the beginning. But we replace SP with LN and linear layers back in the end, according to the lemma as follows. Lemma 1. Denote LN( ) as the LN operation in Rd(d 3), and SP( ) as the SP operation4 in Rd 1. We can find some linear transformations ˆφ1 and ˆφ2, such that SP( ) = ˆφ1 LN( ) ˆφ2. (13) The proof of Lemma 1 is shown in Appendix C. And we can easily obtain the following corollary. Corollary 2. SP( ) can be represented by an LN-Net. Taking xor classification as an example, we primarily show how we use LN-Net to classify linearly inseparable samples. 4If there are no special instructions, we denote SP projects the sample on to the unit circle, namely x 7 x/ x 2. On the Nonlinearity of Layer Normalization (a) Initial input. (b) Rotation. (c) Linear projection. (d) Spherical projection. (e) Linear projection. Figure 1. Solution to the Xor Classification. To begin with, we rotate them by 45 , as shown in Figure 1(b). Then we vertically project them onto y = 0.5, as shown in Figure 1(c). Next, we spherically project them onto the circle x2 + y2 = 1, as shown in Figure 1(d). Finally, we horizontally project them onto x = 0, as shown in Figure 1(e). Now we have classified the two classes. As shown in Figure 1(a), (0, 0), (1, 1) and (0, 1), (1, 0) belong to different classes. Obviously, the two classes are not linearly separable. We can classify them with SP and linear transformations only, please refer to the demonstration in Figure 1 for details. By Lemma 1, replace SP with LN-Net. Therefore, we can construct an LN-Net according to the operations in Figure 1, and then classify the xor samples. More generally, we discuss binary classification in Section 4.2 and multi-class classification in Section 4.3. 4.2. LN for Binary Classification Theorem 2. Given m samples with any binary label assignment in {0, 1}, there always exists an LN-Net with only 3 neurons per layer and O(m) LN layers can correctly classify them. To prove Theorem 2, we represent the LN-Net with SP and linear layers. Then we design an algorithm to help compute the parameters according to the input. We hence get an LN-Net with proper parameters to classify the samples. The proof is shown as follows. We represent an LN-Net as fθ( ) = φ1 LN( ) φ2 φL 1 LN( ) φL, (14) where φ1, , φL denote the linear layers, and LN( ) denotes the LN layers. For convenience, we replace LN with SP temporarily. Proposition 3. The LN-Net fθ( ) in Eqn.14 can be represented by SP and linear layers equivalently. Proof. Since each LN( ) acts on R3, by Lemma 1, we can construct a 2-dimensional SP( ) = ˆφ1 LN( ) ˆφ2. Define each φl in Theorem 2 as φ(1) l φ(2) l ˆφ1, l = 1, ˆφ2 φ(1) l φ(2) l ˆφ1, 1 < l < L, ˆφ2 φ(1) l , l = L, where φ(1) l and φ(2) l are both linear functions. By Eqn.15 and Lemma 1, we can rewrite fθ( ) as fθ( ) = φ(1) 1 φ(2) 1 SP( ) φ(1) 2 SP( ) φ(1) L , (16) namely, fθ( ) can be represented by SP and linear layers equivalently. Hereafter, we consider to compute the parameters of fθ( ). Specifically, for each layer, we denote φ(1) l : X(l 1) 7 P (l), 1 l L; φ(2) l : P (l) 7 H(l), 1 l L 1; SP( ) : H(l) 7 X(l), 1 l L 1. Besides, the input of fθ( ) is X(0) = [x(0) 1 , , x(0) m ], and the output is P (L). Now we construct fθ( ) step by step. We denote that for each P (l), (l = 1, , L), these points are on the x-axis, namely p(l) k = [p(l) k , 0] , (k = 1, , m). To get P (1), we apply φ(1) 1 for initialization as below. Proposition 4. For any input X(0), we can find some u, such that φ(1) 1 : x(0) k 7 p(1) k = [u x(0) k , 0] , (18) where p(1) i = p(1) j if x(0) i = x(0) j . Proposition 4 parameterizes φ(1) 1 and initializes P (1) onto the x-axis, without merging different points5. Please refer to Appendix D for the proof. As for other linear functions, the suitable parameters are generated from the Projection Merge Algorithm, as shown in Algorithm 1. In Algorithm 1, P (L) is the output, as well as that of fθ( ). Factually, by Algorithm 1, we get each P (l) in a recursive way. For the case Ji = , we take 5 points as an example to show how we get P (l+1) from P (l) in Figure 2. As for the case Ji = , it indicates that all points with the same label as p(l) i are merged together. Therefore, we 5In this paper, we claim that p(l) i and p(l) j are "different points" means p(l) i = p(l) j rather than i = j, for each hidden layer (applies to x and h as well). On the Nonlinearity of Layer Normalization p(l) 1 p(l) 4 p(l) 2 p(l) 3 p(l) 5 h(l) 1 h(l) 4 h(l) 2 h(l) 3 h(l) 5 (a) Translation. h(l) 1 h(l) 4 h(l) 2 h(l) 3 h(l) 5 x(l) 2 x(l) 3 x(l) 4x(l) 5 (b) Spherical projection. x(l) 2 x(l) 3 p(l+1) 2,3 p(l+1) 5 (c) Linear projection. Figure 2. Get P (l+1) from P (l) geometrically. In Figure 2(a), P (l) is shown as the bars on the x-axis. At first, find the leftmost point, namely p(l) 1 . Then we find another point with the same label as p(l) 1 , but right of p(l) 1 , choose the leftmost one, namely p(l) 4 . Afterwards, shift all the points up by (p(l) 4 p(l) 1 )/2, and left by (p(l) 4 + p(l) 1 )/2, then we get H(l), as shown in Figure 2(a). Next, spherically project H(l) onto the unit circle and get X(l), shown as + s in Figure 2(b). Finally merge the points in X(l) by their ordinates, as the new abscissas of P (l+1), and take 0 as the new ordinates of P (l+1), as shown in Figure 2(c). Now, we have P (l+1). Algorithm 1 Projection Merge Algorithm input The initial input P (1). output The final output P (L). 1: l 1; 2: P {p(l) 1 , p(l) 2 , , p(l) m }; 3: while P = do 4: i arg min k {p(l) k : p(l) k P}; 5: Ji {p(l) j P : p(l) j = p(l) i , yj = yi}; 6: if Ji = then 7: j arg min k {p(l) k : p(l) k Ji}; 8: for k 1 to m do 9: h(l) k p(l) k " p(l) i + p(l) j p(l) i p(l) j 10: x(l) k h(l) k / h(l) k ; 11: p(l+1) k 0 1 0 0 12: end for 13: l l + 1; 14: P {p(l) 1 , p(l) 2 , , p(l) m }; 15: else 16: remove p(l) j from P, as long as p(l) j = p(l) i ; 17: end if 18: end while 19: return P (l); remove them from P, and choose the leftmost point from the remaining P, until P = . Based above, we give the properties of each layer as follows. Proposition 5. For each layer, φ(1) l (2 l L) only merges points with the same label. Nevertheless, φ(1) 1 , SP( ) and φ(2) l (1 l L 1) do not merge any points. Please refer to Appendix D for the proof of Proposition 5. By Proposition 5, we figure out that Algorithm 2 will only merge points with the same label. Besides, we find that from P (l) to P (l+1), the number of different points will decrease at least 1. Since the input is m different points from two classes, we merge at most m 2 times by Algorithm 1, we thus have L 1 m 2. By Algorithm 1, we can construct other linear functions with exact parameters as follows. φ(1) l : x(l 1) k 7 x(l 1) k , 1 < l L, φ(2) l : p(l) k 7 p(l) k " p(l) i + p(l) j p(l) i p(l) j /2, 1 l < L. Therefore, fθ( ) with the parameters in Eqn.18 and Eqn.19 can classify the samples X(0). Besides, the LN-Net in Eqn.14 with depth6 L 1 = O(m) can also classify the m samples. We hence have proved Theorem 2. Our results above are based on an LN-Net with 3 neurons each layer. Furthermore, we can generalize PMA for a wider neural network, but it is much more complex. Please refer to Appendix D for more details. Based on Theorem 2, we can easily obtain the following corollary related to VC dimension (Bartlett et al., 1998) of an LN-Net. Corollary 3. Given an LN-Net fθ( ) with width 3 and depth L, its VC dimension V Cdim(fθ( )) is lower bounded by L + 2. 4.3. LN for Multi-class Classification Theorem 3. Given m samples with any binary label assignment, there always exists an LN-Net with only 3 neurons per layer and O(m) LN layers can correctly classify them. Applying Algorithm 1 for a multi-class classification may 6We denote the number of LNs as the depth of an LN-Net. On the Nonlinearity of Layer Normalization confuse two samples with different labels. We thus introduce Parallelization Breaking Algorithm to avoid such confusion. Besides, we can also construct an LN-Net to classify the samples. The detailed analysis and proof are as below. To begin with, we are concerned about whether Algorithm 1 applies to multi-class classification the answer is Not. Based on Figure 2(c), we recolor x(l) 3 red, as shown in Figure 3. When we merge x(l) 1 and x(l) 4 , x(l) 2 and x(l) 3 will be merged in the meanwhile. In other words, the algorithm will confuse them to be in the same class. Proposition 6 indicates the necessary condition for such confusion. x(l) 2 x(l) 3 x(l) 4x(l) 5 Figure 3. The case of confusion in the merging process. Proposition 6. Confusion refers to merging two points with different labels. If confusion happens when we project X(l+1) onto the y-axis, there must be a parallelogram7 consisting of four different points in P (l). In reverse, if there is no parallelograms in P (l), confusion will never happen when applying Algorithm 1. Please refer to Appendix D for the proof of Proposition 6. To avoid such confusion, we propose Parallelization Breaking Algorithm (PBA) as follows. Algorithm 2 Parallelization Breaking Algorithm input P (l), ul (got by Proposition 7). output ˆP (l). 1: for k 1 to m do 2: p(l) k = SP(p(l) k + [0, 1] ); 3: ˆp(l) k = [u l p(l) k , 0] ; 4: end for 5: return ˆP (l); Proposition 7. We can always find ul R2 for Algorithm 2, such that there is no parallelograms in ˆP (l), and no points merged in the algorithm. Please refer to Appendix D for the proof of Proposition 7. PBA helps us transform P (l) to ˆP (l), based on which confusion will never happen. For multi-class classification, we insert PBA between φ(1) l and φ(2) l in Eqn.16, then given m 7The parallelogram may be degenerate. Given four points x1, x2, x3, x4, if the sum of two points is the same with that of the other two, we regard they form a parallelogram. samples with any label assignment, fθ( ) with PBA can classify them. Based above, we replace SP with LN and linear layers in fθ( ) with PBA, and then merge the adjacent linear layers. We figure out fθ( ) with PBA is also an LN-Net. We point out that the depth of this LN-Net is no more than 2m. We hence have proved Theorem 3. Summary. In this section, we show that LN-Net also has powerful capacity in theory. Our theoretical results show that an LN-Net with width 3 and depth O(m) is able to classify given m samples with any label assignment. We see an LN-Net performing over 3 neurons can introduce nonlinearity. One question is that whether the nonlinearity of an LN-Net with d > 3 neurons can be amplified, if we group neurons and perform LN in each group in parallel? We answer it in the following section. 5. Amplify and Exploit the Nonlinearity of LN 5.1. Comparison of Nonlinearity In this part, we first define a measurement over the Hessian matrix to compare the magnitude of the nonlinearity. We then show the Group based LN (LN-G)8 which divides neurons of a layer into groups and perform LN in each group in parallel has stronger nonlinearity than the naive LN countpart. Hessian of Linear Function. Given a twice differential function f(x) : Rd R, we focus on its Hessian Matrix 2f(x). If f(x) is a linear function, we have 2f(x) O. More generally, suppose that φ : Rd Rd is a linear transformation, we define φ(x) = φ1(x), , φd(x) , and each φi(x) : Rd R is a linear function, namely each Hessian matrix 2 xφi(x) = O. Measurement of Nonlinearity. Given a twice differential function9 f : Rd Rd and y = f(x). Denote y = [y1, , yd] and x = [x1, , xd] . We define H(f; x) as an indicator to describe the Hessian information of f : Rd Rd as where each 2yi x2 is a Hessian matrix. 8We use the new defined term LN-G rather than Group Normalization (GN) (Wu & He, 2018), considering that: 1) GN is defined on the convolutional input X Rd h w but not on the input x Rd; 2) Given the sequential input (e.g., text) X Rd T in Transformer/Vi T, GN will share statistics over T by definition while LN-G will have no inter-sequence dependence and use separate statistics over T, like LN. 9For y = f(x), we require each yi(i = 1, , d) is twice differential about x. On the Nonlinearity of Layer Normalization We use the Frobenius norm rather than the operator norm, For easier calculations. Note that H(f; x) 0, and H(f; x) = 0 if and only if f is a linear function. We thus assume that the larger H(f; x) is, the more nonlinearity f contains. Amplifying Nonliearity by Group. Denote ψG(g; ) as Group based LN (LN-G) on Rd with group number g, and ψL( ) as LN on Rd. Compare LN with LN-G, the result is shown in Proposition 8. Proposition 8. Given g d/3, we have H(ψG(g; ); x) H(ψL( ); x) 1. (21) Specifically, when g = d/4, we figure out that H(ψG(g; ); x) H(ψL( ); x) d Proposition 8 shows that LN-G can amplify the nonliearity of LN by using appropriated group number. Compared with LN, when d is larger, LN-G shows more nonlineaity. Please refer to Appendix E for the proof. Besides, we generalize our discussion about H to the typical activation function Re LU, please refer to Appendix E for more details. One limit of the result above is the assumption, that H(f; x) is a good indicator for measuring nonlinearity, is from the intuition and can not be well verified. In the subsequent experiments, we empirically show that LN-G indeed can amplify the nonlinearity of LN. 5.2. Comparison of Representation Capacity by Fitting Random Labels In this part, we follow the non-parametric randomization tests fitting random labels (Zhang et al., 2017) to empirically verify the nonlinearity of LN, and to further compare the representational capacity of LN-Net with different groups for LN-G. The experiments are conducted on CIFAR-10 and MNIST with random label assigned (CIFAR-10-RL and MNIST-RL). We evaluate the classification accuracy on the training set after the model is trained, which indicates that the capacity of models in fitting dataset empirically. We only provide essential components of the experimental setup; for more details, please refer to the Appendix F.1. Verify the Nonlinearity of LN. We conduct experiments on linear neural network and LN-Net with 256 neurons in each layer and various depths. We first train sufficiently a linear classifier and obtain the (nearly) upper bound accuracy (18.51 % on CIFAR-10 -RL and 15.38% on MNISTRL). To rule out the influence in optimization difficulty, we train the linear neural network and LN-Net with various 2 4 6 8 10 12 14 Depth Accuracy(%) linear neural network LN-Net (a) CIFAR-10-RL. 2 4 6 8 10 12 14 Depth 13 14 15 16 17 18 19 20 Accuracy(%) linear neural network LN-Net (b) MNIST-RL. Figure 4. Results of linear neural network and LN-Net on fitting random label. The black dashed line represents the upper bound accuracy of linear classifier. (a) Results on CIFAR-10-RL; (b) Results on MNIST-RL. configurations, including different learning rates and (with or without) residual connection10. We report the best result from all configurations, as shown in Figure 4. We observe that linear neural network cannot break the bound of linear classifier on all datasets, while LN-Net can reach the accuracy of 55.85% on CIFAR-10-RL and 19.44% on MNIST-RL, which is much better than the linear classifier. This result also verifies that LN has nonlinearity empirically. Besides, we observe that LN-Net obtains better performance in general as the depth increases (namely more LN layers and greater nonlinearity). We note that an LNNet without sufficient depth does not break the bound of linear classifier on MNIST-RL. The reasons leading to this phenomenon are likely to be that: 1) MNIST-RL are more difficult to train, compare to CIFAR-10-RL; 2) LN-Nets have a non-convex optimization landscape and we cannot ensure the weight learned to be the optimal point, given fixed training epochs. We also conduct experiments with Batch Normalization (BN) (Ioffe & Szegedy, 2015), where we replace LN with BN in LN-Net. We find that BN cannot break the bound of linear classifier on all datasets, like linear neural network. This preliminary result is interesting, which shows the potential advantage of LN over BN, in terms of the representation capacity. Amplifying the Nonlinearity using Group. We conduct experiments on LN-Net with d = 256 neurons in each layer and various depths. We replace LN in LN-Net with LN-G and also vary the group number g. We train LN-Net with various learning rates and report the best training accuracy on CIFAR-10-RL and MNIST-RL, as shown in Figure 5. We observe that some LN-Net with LN-G (e.g., depth = 8 and g = 32) can perfectly classify all the random labels on CIFAR-10-RL and MNIST-RL, which suggests that LNG can amplify the nonlinearity of LN by using group, as stated in Proposition 8. We also observe that an LN-Net with appropriate group number (e.g, g = 32) can obtain 10A linear neural network with residual connection is still a linear model. On the Nonlinearity of Layer Normalization 2 4 8 16 32 64 128 Group Number Accuracy(%) (a) Accuracy on CIFAR-10-RL. 2 4 8 16 32 64 128 Group Number Accuracy(%) (b) Accuracy on MNIST-RL. 1 2 4 8 16 32 64 Group Number H(f; x) (log2) 2nd Layer 4th Layer 6th Layer (c) H(f; x) on CIFAR-10-RL. 1 2 4 8 16 32 64 Group Number H(f; x) (log2) 2nd Layer 4th Layer 6th Layer (d) H(f; x) on MNIST-RL. Figure 5. Results of LN-Net using LN-G. We vary the group number g and show the training accuracy and H(f; x). (a) Training accuracy on CIFAR-10-RL; (b) Training accuracy on MNIST-RL; (c)H(f; x) on CIFAR-10-RL;(d) H(f; x) on MNIST-RL. The black dashed line in (a) and (b) has the same meaning as that in Figure 4. better performance, as the depth increases. Besides, an LNNet has better performance in general with larger group number, along the group number is not too much (relative to the number of neurons). E.g, An LN-Net has significantly degenerated performance when g = 128, due to d/g = 2 < 3 that go against the premise in Proposition 8. We also calculate H(f; x) in certain layers and show how H(f; x) varies as the group number increases in Figure 5 (c) and (d). H(f; x) is calculated by averaging over 1000 samples in our experiments. We find H(f; x) increases as the group number of LN-G increases, which matches our theoretical analyses in Section 5.1. 5.3. Inspiration for Neural Architecture Design In this part, we consider designing neural networks in real scenarios, considering that LN-G can amplify the nonlinearity and have great performance in fitting the random label shown in Section 5.2. We conduct experiments on both CNN and Transformer architectures. 5.3.1. CNN WITHOUT ACTIVATION FUNCTION To validate the representation capacity of LN-G in real scenarios further, we conducted experiments on CIFAR-10 using Res Net (He et al., 2016). To exclude the influence of other nonlinearities, we remove all nonlinear activations from the Res Net, and refer the network to Res Net-NA. We set the channel number of each layer to 128 for better ablating the group number of LN-G. We also conduct experiments on more CNNs shown in Appendix F.2 Investigation of LN-G. Note that LN-G may have several variants for a convolutional input X Rc h w, where c, h and w indicate the feature mappings channel, height and width dimensions respectively. Following the usage of LN on CNNs, LN-G should calculate the mean/variance along all the channel, height and width dimensions, which is equivalent to Group Normalization (GN) (Wu & He, 2018). Following the usage of LN on MLP&Transformer, LN-G should calculate the mean/variance along only the channel dimension and use separate statistics over each position (a pair of height and width), and we refer to this method as G2 G4 G8 G16 G32 G64 Group Number Accuracy(%) 68.38 71.92 78.25 76.90 78.33 82.16 99.94 99.98 99.89 99.66 10.00 GN LN-G-Position (a) Training. G2 G4 G8 G16 G32 G64 Group Number Accuracy(%) 63.09 66.53 66.84 66.84 67.32 69.37 85.32 85.14 85.49 86.66 10.00 GN LN-G-Position Figure 6. Results of the variants of LN-G (GN and LN-G-Position) when using different group number. The experiments are conducted on CIFAR-10 dataset using Res Net without Re LU activation. We show (a) the training accuracy and (b) the test accuracy. In the x-axis, G2 refers to a group number of 2. LN-G-Position. We investigate how the group number affects the performance of the variants of LN-G (GN and LN-G-Position). We vary the group number g ranging in {2, 4, 8, 16, 32, 64}. We train a total of 200 epochs using SGD with a minibatch size of 128, momentum of 0.9 and weight decay of 0.0001. The initial learning rate is set to 0.1, and divided by 5 at the 60th, 120th, and 160th epochs. The results are shown in Figure 6. We find that GN obtains slightly better performance as the group number increases. Note that this observation does not go against the experimental results of LN-G in amplifying the nonlinearity in Section 5.2 since the effective samples used to calculate the normalization statistics in each group of GN is h w c g . We observe that LNG-Position works particularly well and obtains over 85% test accuracy for multiple group number (Note that there is no Re LU activations.). We also find that LN-G-Position works particularly bad if group number is 64, because the samples used to calculate the normalization statistics in each group of LN-G-Position is c Comparison to other Normalization. We also conduct experiments to train Res Net-NA by using other normalization methods, including the original Batch Normalization (BN) (Ioffe & Szegedy, 2015), Layer Normalization (LN) (Ba et al., 2016), Instance Normalization (IN) (Ulyanov et al., 2016). Besides, we also train Res Net NA without normalization. We use the same setting up On the Nonlinearity of Layer Normalization Table 1. Comparison of different normalization methods on CIFAR-10 using Res Net-NA (Res Net without Re LU activation). Normalization methods Train Acc(%) Test Acc(%) IN 10 10 BN 36.0 39.3 LN 59.5 62.85 GN 82.16 69.37 LN-G-Position 99.66 86.66 described in previous experiments. We find that Res Net-NA without normalization is very difficult to train and shows a random guess behavior. Similarly, Res Net-NA with IN is also very difficult to train. Res Net-NA with BN can be trained normally. However, the performance of the model is relatively low, with only 39.3% test accuracy. Res Net NA with LN obtains 62.85% test accuracy, which is significantly better than BN. Furthermore, Res Net-NA with LN-G-Position obtains the best performance, e.g., a test accuracy of 86.66% when using a group number 16 for LNG-Position. We contribute it to the strong nonlinearity of LN-G-Position. 5.3.2. LN-G IN TRANSFORMERS Transformer for Machine Translation. We conduct experiments to apply LN-G on Transformer (Vaswani et al., 2017) (where LN is the default normalization) for machine translation tasks using fairseq-py (Ott et al., 2019). We evaluate the public IWSLT14 German-to-English (De-EN) dataset using BLEU (higher is better). We use the hyperparameters recommended in fairseq-py (Ott et al., 2019) for Transformer and train over 50 epochs with five random seeds. The baseline LN has a BLEU score of 35.01 0.10. LN-G (replacing all the LNs with LN-G) has a BLEU score of 35.23 0.07. Vi T for Image Classification. We conducted experiments by applying LN-G to Tiny-VIT (with the default normalization being LN). We performed classification tests on the test set of the CIFAR-10 dataset, with hyperparameter settings referencing (Steiner et al., 2021). The classification accuracy on the test dataset was 88.81% for LN and 89.26% for LN-G (replacing all the LNs with LN-G). These preliminary results show the potentiality of LN-G used for neural architecture design in practice. 6. Related Work Previous theoretical analyses on normalization are mainly focused on BN, the pioneer work in normalization for deep learning. One main argument is that BN can improve the conditioning of the optimization problem (Cai et al., 2019), either by avoiding the rank collapse of pre-activation matri- ces (Daneshmand et al., 2020) or by alleviating the pathological sharpness of the landscape (Santurkar et al., 2018; Karakida et al., 2019; Ghorbani et al., 2019; Lyu et al., 2022). The improved conditioning enables large learning rates (Bjorck et al., 2018), thus improving the generalization (Luo et al., 2019). Another argument is that BN is scale invariant (Ba et al., 2016), enabling it to adaptively adjust the learning rate (Arora et al., 2019; Cai et al., 2019; Zhang et al., 2019; Li & Arora, 2020), which stabilizes and further accelerates training. This scale invariant analyses also applies to LN (Ba et al., 2016; Lubana et al., 2021). Some work address to understanding LN empirically through experiments, showing that the learnable parameters in LN increases the risk of over-fitting (Xu et al., 2019). Different from these work, we investigate a new theoretical direction for LN, regarding to its nonlinearity and representation capacity. We note that there are several work (Huang et al., 2021; Labatie et al., 2021) investigating the expressive power of normalization empirically by experiments. However, their experiments are conducted on networks with activation functions, while our work focuses on analyzing the representation capacity of a network without activation functions through theory and experiment. 7. Conclusion We mathematically demonstrated that LN is a nonlinear transformation. We also theoretically showed the representation capacity of an LN-Net in correctly classifying samples with any label assignment. We demonstrated these results by finely designing algorithms, considering the geometric property of LN. We hope that our techniques will inspire the community to reconsider the analyses of the representation capacity of a network with normalization layer, though it suffers from great challenges (Huang et al., 2023). Limitation and Future Work. Our results in representation capacity for LN-Net is very loose currently, which is like the initial universal approximation theory in the arbitrary wide shallow neural network (Hornik et al., 1989). We believe it is interesting to extend our results along the direction as universal approximation theory is extended to the cases of arbitrary depth (Gripenberg, 2003), bounded depth and bounded width (Maiorov & Pinkus, 1999), and the question of minimal possible width (Park et al., 2020). Besides, the effectiveness of group mechanism for LN (i.e., LN-G) is only verified on small-scale networks and datasets, and more results on large-scale networks and datasets are required to support the practicality of LN-G. Acknowledgments This work was partially supported by the National Science and Technology Major Project under Grant On the Nonlinearity of Layer Normalization 2022ZD0116310, National Natural Science Foundation of China (Grant No. 62106012), the Fundamental Research Funds for the Central Universities. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential social consequences of our work, none which feel must be specifically highlighted here. Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. In Neur IPS, 2022. Arora, S., Li, Z., and Lyu, K. Theoretical analysis of auto rate-tuning by batch normalization. In ICLR, 2019. Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization, 2016. Bartlett, P., Maiorov, V., and Meir, R. Almost linear vc dimension bounds for piecewise polynomial networks. In Neur IPS, 1998. Bjorck, N., Gomes, C. P., Selman, B., and Weinberger, K. Q. Understanding batch normalization. In Neur IPS, 2018. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., Mc Candlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Neur IPS, 2020. Cai, Y., Li, Q., and Shen, Z. A quantitative analysis of the effect of batch normalization on gradient descent. In ICML, 2019. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. End-to-end object detection with transformers. In ECCV, 2020. Cheng, B., Misra, I., Schwing, A. G., Kirillov, A., and Girdhar, R. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., and Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. In ACL, 2019. Daneshmand, H., Kohler, J. M., Bach, F. R., Hofmann, T., and Lucchi, A. Batch normalization provably avoids ranks collapse for randomly initialised deep networks. In Neur IPS, 2020. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In ACL, 2019. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. Fisher, R. A. Statistical methods for research workers. In Breakthroughs in statistics: Methodology and distribution, pp. 66 70. Springer, 1970. Ghorbani, B., Krishnan, S., and Xiao, Y. An investigation into neural net optimization via hessian eigenvalue density. In ICML, 2019. Gripenberg, G. Approximation by neural networks with a bounded number of nodes at each level. Journal of approximation theory, 122(2):260 266, 2003. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016. Hoffer, E., Banner, R., Golan, I., and Soudry, D. Norm matters: efficient and accurate normalization schemes in deep networks. In Neur IPS, 2018. Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 366, 1989. Huang, L., Zhou, Y., Liu, L., Zhu, F., and Shao, L. Group whitening: Balancing learning efficiency and representational capacity. In CVPR, 2021. Huang, L., Qin, J., Zhou, Y., Zhu, F., Liu, L., and Shao, L. Normalization techniques in training dnns: Methodology, analysis and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. Karakida, R., Akaho, S., and Amari, S.-i. The normalization method for alleviating pathological sharpness in wide neural networks. In Neur IPS, 2019. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollar, P., and Girshick, R. Segment anything. In ICCV, 2023. On the Nonlinearity of Layer Normalization Labatie, A., Masters, D., Eaton-Rosen, Z., and Luschi, C. Proxy-normalizing activations to match batch normalization while removing batch dependence. In Neur IPS, 2021. Li, Z. and Arora, S. An exponential learning rate schedule for batch normalized networks. In ICLR, 2020. Lubana, E. S., Dick, R., and Tanaka, H. Beyond batchnorm: towards a unified understanding of normalization in deep learning. In Neur IPS, 2021. Luo, P., Wang, X., Shao, W., and Peng, Z. Towards understanding regularization in batch normalization. In ICLR, 2019. Lyu, K., Li, Z., and Arora, S. Understanding the generalization benefit of normalization layers: Sharpness reduction. In Neur IPS, 2022. Maiorov, V. and Pinkus, A. Lower bounds for approximation by mlp neural networks. Neurocomputing, 25(1-3):81 91, 1999. Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. fairseq: A fast, extensible toolkit for sequence modeling. In ACL, 2019. Park, S., Yun, C., Lee, J., and Shin, J. Minimum width for universal approximation. ar Xiv preprint ar Xiv:2006.08859, 2020. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. Improving language understanding by generative pre-training. 2021. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. How does batch normalization help optimization? In Neur IPS, 2018. Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., and Beyer, L. How to train your vit? data, augmentation, and regularization in vision transformers. ar Xiv preprint ar Xiv:2106.10270, 2021. Ulyanov, D., Vedaldi, A., and Lempitsky, V. S. Instance normalization: The missing ingredient for fast stylization. ar Xiv preprint ar Xiv:1607.08022, 2016. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Neur IPS, 2017. Wu, Y. and He, K. Group normalization. In ECCV, 2018. Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T.-Y. On layer normalization in the transformer architecture. In ICML, 2020. Xu, J., Sun, X., Zhang, Z., Zhao, G., and Lin, J. Understanding and improving layer normalization. In Neur IPS, 2019. Zhang, B. and Sennrich, R. Root mean square layer normalization. In Neur IPS, 2019. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In ICLR, 2017. Zhang, G., Wang, C., Xu, B., and Grosse, R. B. Three mechanisms of weight decay regularization. In ICLR, 2019. On the Nonlinearity of Layer Normalization A. LSSR as a Linearly Separable Indicator 1.0 0.5 0.0 0.5 1.0 1.5 2.0 1.0 (a) XOR data. 8 6 4 2 0 2 4 6 8 8 (b) Inseparable Gaussian data. 8 6 4 2 0 2 4 6 8 8 (c) Separable Gaussian data. 8 6 4 2 0 2 4 6 8 8 (d) Parallel Gaussian data. Figure A1. We randomly sample 256 points from each class in the four different distributions of data above. The detailed data is shown in Table I. To show how SSR and LSSR evaluate the difficulty of separating the samples from different classes linearly, we give four different distributions of data in the figure above and their details in the table below. Table I. Detailed data of Figure A1. In Figure 1(a), the random variance X takes values 0 and 1 with probabilities 1/2 each. In the other figures, the sign N( , ) denotes the Gaussian distribution. Figure Distribution of X1 Distribution of X2 SSR LSSR Figure 1(a) X1 = X X 0.9963 0.9929 Figure 1(b) X1 N 0 0 0.9929 0.9859 Figure 1(c) X1 N 3 3 0.2304 0.1312 Figure 1(d) X1 N 2 0 0.7536 0.2157 According to Figure A1 and Table I, we have several conclusions below. In Figure 1(a) and Figure 1(b), the classes are hard to be linearly separated, whose SSR and LSSR are both near 1. In Figure 1(c), the classes are easy to be linearly separated, whose SSR and LSSR are both near 0. However, in Figure 1(d), the classes are easy to be linearly separated, but harder to be separated if focused on the Euclidean distance. As a result, its SSR is larger, but its LSSR is near 0. We hence conclude that LSSR is a better indicator than SSR in judging how two classes are linearly separable. B. Proofs of Proposition 2 Proposition 2. Given X1, X2 Rd m, we denote M = 2P i=1 (xci xc)(xci xc) , and N = 2P i=1 (xci x)(xci x) , where x = ( x1 + x2)/2. Supposing that N is reversible, we have LSSR(X1, X2) = λ , (23) and correspondingly, LSSR(X1, X2) = SSR((u ) X1, (u ) X2), (24) where λ and u are the minimum eigenvalue and corresponding eigenvector of N 1M. Since the definition of LSSR comes from a lower bound, we prove Proposition 2 from solving the optimization problem as follows. min φ LSSR(X1, X2) = SSR(φ(X1), φ(X2)), s.t. φ(x) = W x + b, W Rn d, b Rn, n N , SS(φ(X1), φ(X2)) = 0. To solve this, we first propose four lemmas, and then use them to prove Proposition 2. Furthermore, we give the optimal W as a corollary. On the Nonlinearity of Layer Normalization B.1. Required Lemmas for the Proof Lemma 2. The bias b Rn does not affect SSR, as well as LSSR. Proof. By the definition of SS, we obtain SS(φ(X1)) = SS(W X1 + b1 ) W x1i + b 1 i=1 (W x1i + b) W x1i W x1 2 = SS(W X1). Similarly, we have SS(φ(X2)) = SS(W X2), (27) and SS([φ(X1), φ(X2)]) = SS([W X1, W X2]). (28) Since SSR is defined with SS, the conclusion also holds for SSR, namely SSR(φ(X1), φ(X2)) = SSR(W X1, W X2), (29) where the bias b is not included. Lemma 3. Suppose the eigenvalue decomposition of W W as W W = UΛU = i=1 λiuiu i , (30) where U = [u1, , ud] is an orthogonal matrix, and Λ = diag{λ1, , λd} is a positive semi-definite and diagonal matrix. We consider to minimize SSR(W X1, W X2) over Λ with a fixed U, as: min Λ SSR(W X1, W X2) = min 1 j d SSR(u j X1, u j X2). (31) The optimal solution is that 0, j arg min 1 j d SSR(u j X1, u j X2), = 0, otherwise, (32) for j = 1, , d, and λ1, , λd are not all zeros. Proof. By Lemma 2, we find that LSSR(X1, X2) = min W SSR(W X1, W X2). (33) Besides, we figure out that i=1 W xci W xc 2 2 = i=1 (xci xc) W W (xci xc), (34) where Xc = X1, X2, or even11 [X1, X2]. 11In this case, we choose x as xc in Eqn.34 On the Nonlinearity of Layer Normalization Based on the eigenvalue decomposition, we obtain that i=1 (x1i x1) W W (x1i x1) j=1 λj(x1i x1) uju j (x1i x1) i=1 (u j x1i u j x1)2 j=1 λj SS(u j X1). The term SS(u j X1) can be regarded as that we put a linear transformation u j on X1, and then calculate its SS. Similarly, we have that j=1 λj SS(u j X2), (36) SS([W X1, W X2]) = j=1 λj SS([u j X1, u j X2]). (37) Therefore, we obtain SSR(W X1, W X2) = j=1 λj[SS(u j X1) + SS(u j X2)] j=1 λj SS([u j X1, u j X2]) . (38) By the definition of M and N in Proposition 2, we obtain that SS(u j X1) + SS(u j X2) = i=1 [(u j x1i u j x1)2 + (u j x2i u j x2)2] i=1 u j [(x1i x1)(x1i x1) + (x2i x2)(x2i x2) ]uj and similarly, we have SS([u j X1, u j X2]) = u j Nuj. (40) By the hypothesis in Definition 2, we figure out that λj(j = 1, , d) are not all zeros, otherwise SS([W X1, W X2]) = 0. Besides, by the hypothesis in Proposition 2, N is reversible. We point out that N is also a positive semi-definite matrix. Furthermore, N is a positive definite matrix. When uj = 0, we find that SS([u j X1, u j X2]) = u j Nuj > 0. (41) On the Nonlinearity of Layer Normalization Let ηj = λj SS([u j X1, u j X2]), we thus have η1 + + ηd > 0. We obtain SSR(W X1, W X2) = 1 η1 + + ηd ηj[SS(u j X1) + SS(u j X2)] SS([u j X1, u j X2]) ηj η1 + + ηd SSR(u j X1, u j X2) ηj η1 + + ηd min 1 k d SSR(u k X1, u k X2) = min 1 k d SSR(u k X1, u k X2) = min 1 j d SSR(u j X1, u j X2). We figure out that the equation holds, if and only if 0, j arg min 1 j d SSR(u j X1, u j X2); = 0, otherwise. (43) Here, j = 1, , d, and η1, , ηd are not all zeros. Since λj = ηj/SS([u j X1, u j X2]) and SS([u j X1, u j X2]) > 0, we thus have 0, j arg min 1 j d SSR(u j X1, u j X2), = 0, otherwise, (44) holds for j = 1, , d, and λ1, , λd are not all zeros. Lemma 4. Given Dv = {v : v Nv = 1} and Du = {u : u u = 1} and the map ψ : Dv Du, where u = ψ(v) = v/(v v) 1 2 , we have that ψ is a bijection. Proof. For N is a positive definite matrix , and v Nv = 1, we have v = 0. Given u = ψ(v), we obtain u u = v v/(v v) = 1, (45) for each v in Dv. Therefore, ψ is a reflection from Dv to Du. Besides, we find that u Nu = v Nv/(v v) = 1/(v v). (46) By the definition of ψ, we hence have u/(u Nu) 1 2 = ψ(v)(v v) 1 2 = v. (47) Therefore, for each u, we obtain that v = ψ 1(u) = u/(u Nu) 1 2 , (48) namely we find ψ 1 as the inverse mapping of ψ. As a result, ψ is a bijection. Lemma 5. Let the optimization problem be min v f(v) = v Mv s.t. v Nv = 1, (49) where M and N are defined in Proposition 2. We have that the optimal value is the minimal eigenvalue of N 1M, namely λ . And the optimal solution is the eigenvector which belongs to λ . On the Nonlinearity of Layer Normalization Proof. To get the minimum, we use the Lagrange multiplier method: L(v, α) = v Mv α(v Nv 1). (50) We figure out that the KKT conditions are L v = 2Mv 2αNv = 0, v Nv 1 = 0. (51) We hence have N 1Mv = αv, namely α is an eigenvalue of N 1M, and v is the corresponding eigenvector. Based above, we find that v Mv = v NN 1Mv = v N(αv) = α. (52) Furthermore, the minimum of L(v, α) is the minimum α, namely the minimum eigenvalue of N 1M. We hence have LSSR(X1, X2) = λmin(N 1M) = λ , (53) and the optimal solution is the eigenvector which belongs to λ . B.2. Proof of Proposition 2 Based on the four lemmas above, now we give the proof of Proposition 2. Proof. By Lemma 2 and Lemma 3, given W W = UΛU , we have LSSR(X1, X2) = min W SSR(W X1, W X2) = min U,Λ SSR(W X1, W X2) = min U min Λ SSR(W X1, W X2) = min U min 1 j d SSR(u j X1, u j X2). According to Eqn.39 and Eqn.40, we define the function f(u) = SSR(u X1, u X2), namely f(u) = u Mu u Nu . (55) By Eqn.54, there is some U, and j = arg min 1 j d SSR(u j X1, u j X2), such that LSSR(X1, X2) = min 1 j d SSR(u j X1, u j X2) = f(uj ). (56) Consider the optimization problem min u f(u) = u Mu s.t. u u = 1. (57) We denote one of the optimal solutions as u. Obviously, uj is one of the feasible solutions above, we thus have LSSR(X1, X2) = f(uj ) f( u). (58) We remind that λ and u are the minimum eigenvalue and corresponding eigenvector of N 1M. On one hand, let u0 = u / u 2 and v0 = ψ 1(u0) (ψ is defined the same as that in Lemma 4), namely v0 = u0/(u 0 Nu0) 1 2 . We first On the Nonlinearity of Layer Normalization point out that for k = 0, we have f(ku) = (ku) M(ku) u Nu = f(u). Since 1/||u ||2 = 0 and 1/(u 0 Nu0) 1 2 = 0, we obtain f(v0) = f(u0) = f(u ) = (u ) M(u ) = (u ) NN 1M(u ) = (u ) N(λ u ) (u ) N(u ) = λ , where λ is the minimal eigenvalue of N 1M, as shown in Proposition 2. Therefore, by Lemma 5, v0 is the optimal solution of (Pv). Furthermore, by Lemma 4, since ψ is a bijection between Dv and Du and f(ψ(v)) = f(v), we have that u0 = ψ(v0) is also the optimal solution of (Pu). We hence have f(u ) = f(u0) = f( u). By Eqn.58, we obtain LSSR(X1, X2) f( u) = f(u ) = SSR((u ) X1, (u ) X2). (61) On the other hand, the definition of LSSR denotes the lower bound of SSR, we hence have LSSR(X1, X2) SSR((u ) X1, (u ) X2). (62) By Eqn.60, Eqn.61 and Eqn.62, we obtain LSSR(X1, X2) = SSR((u ) X1, (u ) X2) = λ . (63) B.3. Corollaries of Proposition 2 Suppose λ is the minimal eigenvalue of N 1M, and u is its unique linearly independent eigenvector, we give the result in Corollary 4. If λ has more than one linearly independent eigenvectors, we give the result in Corollary 5. Corollary 4. Suppose λ is the minimal eigenvalue of N 1M, and u is its unique linearly independent eigenvector, we have that the optimal W satisfies that W W = Cu (u ) , C > 0. (64) Proof. By Lemma 2, we can only consider the eigenvalues and eigenvectors of W W . By Eqn.30, uj will affect W W only when the corresponding eigenvalue λj = 0. Furthermore, by Lemma 3, when λj = 0, we figure out that j arg min 1 j d SSR(u j X1, u j X2). (65) Since uj is a unit vector, it must be one of the optimal solutions of (Pu). We hence have f(uj) = λ . By Eqn.59 and Lemma 4, we have f(ψ 1(uj)) = f(uj) = λ , (66) On the Nonlinearity of Layer Normalization and ψ 1(uj) is one of the optimal solutions of (Pv). By Lemma 5, we have that ψ 1(uj) must satisfy the KKT conditions in Eqn.51, namely N 1Mψ 1(uj) = λ ψ 1(uj). (67) Therefore, ψ 1(uj) is the minimal eigenvector of M 1N. For ψ 1(uj) = uj/(u j Nuj) 1 2 , we obtain uj is also the eigenvector of λ . For u is the unique linearly independent eigenvector of λ , we figure out that uj = αju , αj = 0. (68) To be reminded, Eqn.68 only holds when λj = 0. Therefore, we have j=1 λjuju j λj =0 λjα2 ju (u ) = Cu (u ) , where C = P λj =0 λjα2 j. By Lemma 3, we have λ1, , λj 0 are not all zeros. Besides, we figure out that αj can be any non-zero real number. We thus have C can be any non-zero real number, to demonstrate W W . Corollary 5. Suppose that the minimal eigenvalue of N 1M, namely λ , has k linearly independent eigenvectors v1, v2, , vk. We denote V = [v1, , vk], then we have that the optimal W satisfies that W W = V CV , (70) where C is a k-order semi-positive definite and non-zero matrix. Proof. Suppose λj is an eigenvalue of W W , and its eigenvector is uj. We can identify that j arg min 1 j d SSR(u j X1, u j X2), if λj = 0. Similarly to the proof of Corollary 4, uj must be an eigenvector of N 1M, and the corresponding eigenvalue is λ . Accordingly, uj is a linear combination of all the linearly independent eigenvectors of λ , namely uj = αj1v1 + αj2v2 + + αjkvk = V αj (71) where αj = [αj1, , αjk] , and αj = 0. We remind that Eqn.71 only holds when λj = 0. We thus have j=1 λjuju j λj>0 λjuju j λj>0 λj(V αj)(V αj) λj>0 λjαjα j where C = P λj>0 λjαjα j . By Lemma 3, we have λ1, , λj 0 are not all zeros. Besides, we figure out that αj can be any non-zero vector. We thus have C is any k-order semi-positive definite and non-zero matrix, to demonstrate W W . On the Nonlinearity of Layer Normalization C. Proof Related to Breaking LSSR In this section, we prove Theorem 1 from the perspective of Taylor s expansion. We have defined f SSR(t) as ( LSSR(X1, X2), t = 0, SSR( ψ(t; X1), ψ(t; X2)), t = 0, (73) where ψ(t; xci) = 1 φ(t; xci)/ φ(t; xci) 2, φ(t; xci) = [(u ) xcit, 1] and t R. We remind Theorem 1 as below. Theorem 1. Let ψ = φ1 LN( ) φ2, performing over the input X1, X2 Rd m. If f SSR(0) = 0 , we can always find suitable linear functions φ1 and φ2, such that SSR(ψ(X1), ψ(X2)) < LSSR(X1, X2). (74) We prove Lemma 1 and show three extra lemmas before the formal proof of Theorem 1. C.1. Proof of Lemma 1 Lemma 1. Denote LN( ) as the LN operation in Rd(d 3), and SP( ) as the SP operation12 in Rd 1. We can find some linear transformations ˆφ1 and ˆφ2, such that SP( ) = ˆφ1 LN( ) ˆφ2. (75) We denote that SP( ) is defined on Rd 1, as SP(x) = x/ x 2. (76) While LN( ) is defined on Rd, where d11 x)/ x 1 d11 x 2. (77) Before the proof of Lemma 1, we propose Lemma 6 as follows. Lemma 6. There is some orthogonal matrix Q Rd d, such that z = Q x 0 {z Rd : z(1) + + z(d) = 0} (namely z is centralized), for x Rd 1, Proof. Suppose Q = {qij}d d = [q1, q2, ..., qd]. We take qd = 1 d1 specially, and q1, , qd 1 can be calculated by Schmidt orthogonalization. Given x = [x(1), , x(d 1)] Rd 1, we have = [z(1), , z(d)] . (78) Since Q is an orthogonal matrix, we have k=1 qki = 0, (79) 12If there are no special instructions, we denote SP projects the sample on to the unit circle, namely x 7 x/ x 2. On the Nonlinearity of Layer Normalization for i = 1, , d 1. Furthermore, we obtain that i=1 qkix(i) ! i=1 qkix(i) ! which shows that z {z Rd : z(1) + + z(d) = 0}, namely z is centralized. Now we can design ˆφ1 and ˆφ2 based on Q in Lemma 6, and then prove Lemma 1. Proof. Based above, we obtain z 2 = Q x 0 2 = x 2. (81) By Lemma 6, we have 1 z = 0, and z = z 1 d11 z. We hence find that d11 z)/ z 1 d z/ z 2. (82) Let Id denotes the identity matrix in Rd d. We thus have Id 1 0 Q LN(Q Id 1 0 x) = 1 Id 1 0 Q LN(z) Id 1 0 Q Q x 0 =x/ x 2 =SP(x). Let ˆφ1(x) = Q Id 1 0 x, and ˆφ2(x) = 1 d Id 1 0 Q x. We observe that SP( ) = ˆφ1 LN( ) ˆφ2. (84) C.2. Extra Lemmas for the Proof Let xci = (u ) xci, (i = 1, , m; c = 1, 2). We define the mean xc, the variance σ2 c and the third-order central moment (xc xc)3 with the equations below: i=1 (xci xc)2, (xc xc)3 = 1 i=1 (xci xc)3. On the Nonlinearity of Layer Normalization Based on xci = (u ) xci, we design an linear transformation φ(t; ) : R R2, where t R is a parameter: φ(t; xci) = t 1 0 Lemma 7. Let ˆ X = SP(φ(t; (u ) X)), and v = [1, 1] . Besides, we define three statistics about (u ) X: T1 = ( x1 x2)2[(x1 x1)3 + (x2 x2)3], T2 = ( x1 x2)(σ2 1 σ2 2)[( x1 x2)2 (σ2 1 + σ2 2)], T3 = [2σ2 1 + 2σ2 2 + ( x1 x2)2]2. (87) We figure out that when t 0, we have SSR(v ˆ X1, v ˆ X2) = LSSR(X1, X2) 2(T1 + T2) T3 t + o(t). (88) Proof. Denote ˆxci = SP(φ(t; xci)) = [ˆx(1) ci , ˆx(2) ci ] . By Newton s binomial expansion, we obtain that 1 φ(t; xci) 2 = 1 p 1 + (xcit)2 =(1 + x2 cit2) 1 2(x2 cit2) + 3 8(x2 cit2)2 + o((t2)2) 2x2 cit2 + o(t3). We thus have ˆx(1) ci = xcit p 1 + (xcit)2 = xcit 1 2x3 cit3 + o(t3), (90) and ˆx(2) ci = 1 p 1 + (xcit)2 = 1 1 2x2 cit2 + o(t3). (91) Let v = [1, 1] . Then we have v ˆxci = 1 + xcit 1 2x3 cit3 + o(t3). (92) We denote that a0 = 1, a1 = 1, a2 = 1 2 and a3 = 1 2, therefore s=0 asxs cits + o(t3). (93) We hence obtain that SS(v ˆ Xc) = i=1 (v ˆxci v ˆxc)2 j=1 v ˆxci(v ˆxci v ˆxcj) r=0 arxr citr + o(t3) s=0 as(xs ci xs cj)ts + o(t3) For s + r > 3, we put the multiplicative term into o(t3). Accordingly, we only consider the case s + r 3. On the Nonlinearity of Layer Normalization For r = 0, 1, 2, 3; s = 0, we have xr ci(xs ci xs cj) = 0. (95) For r = 0; s = 1, 2, 3, we have j=1 xr citr (xs ci xs cj)ts = j=1 (xs ci xs cj)ts = 0. (96) For r = 1, s = 1, we have m X j=1 xci(xci xcj) = m i=1 (xci xc) = mσ2 c. (97) For r = 2, s = 1, we observe j=1 x2 ci(xci xcj) = m i=1 x2 ci(xci xc) i=1 [(x2 ci 2xci xc + x2 c)(xci xc) + 2xci xc(xci xc) x2 c(xci xc)] = m2[(xc xc)3 + 2 xcσ2 c]. And for r = 1, s = 2, we obtain j=1 xci(x2 ci x2 cj) = j=1 x3 ci xcix2 cj j=1 x3 ci x2 cixcj j=1 x2 ci(xci xcj) = m2[(xc xc)3 + 2 xcσ2 c]. Therefore, we have that SS(v ˆ Xc) = 1 a2 1xci(xci xcj)t2 + a1a2x2 ci(xci xcj)t3 + a1a2xci(x2 ci x2 cj)t3 + o(t3) = ma2 1σ2 ct2 + 2ma1a2[(xc xci)3 + 2 xcσ2 c]t3 + o(t3) = mσ2 ct2 m[(xc xci)3 + 2 xcσ2 c]t3 + o(t3) = βc2t2 + βc3t3 + o(t3), where βc2 = mσ2 c, and βc3 = m[(xc xci)3 + 2 xcσ2 c]. To simplify the calculation, we define SSD(X1, X2) = SS([X1, X2]) SS(X1) SS(X2) i=1 (xci x) (xci x) i=1 (xci xc) (xci xc) i=1 (x cixci x x) i=1 (x cixci x c xc) = m x 1 x1 + m x 2 x2 2m x x = m x 1 x1 + m x 2 x2 m 2 ( x1 + x2) ( x1 + x2) 2 x1 x2 2 2. On the Nonlinearity of Layer Normalization Similar to Eqn.100, we obtain SSD(v ˆ X1, v ˆ X2) =m 2 (v ˆx1 v ˆx2)2 2 [a1( x1 x2)t + a2(x2 1 x2 2)t2 + o(t2)]2 2 [a2 1( x1 x2)2t2 + 2a1a2( x1 x2)(x2 1 x2 2)t3 + o(t3)] 2 ( x1 x2)2t2 m 2 ( x1 x2)(x2 1 x2 2)t3 + o(t3) =β2t2 + β3t3 + o(t3), where β2 = m 2 ( x1 x2)2, and β3 = m 2 ( x1 x2)(x2 1 x2 2). We thus have SSR(v ˆ X1, v ˆ X2) = (β12 + β22)t2 + (β13 + β23)t3 + o(t3) (β12 + β22 + β2)t2 + (β13 + β23 + β3)t3 + o(t3) = (β12 + β22)t2 + (β13 + β23)t3 + o(t3) (β12 + β22 + β2)t2 h 1 + β13+β23+β3 β12+β22+β2 t + o(t) i = β12 + β22 β12 + β22 + β2 + β13 + β23 β12 + β22 + β2 t + o(t) 1 β13 + β23 + β3 β12 + β22 + β2 t + o(t) = β12 + β22 β12 + β22 + β2 + (β12 + β22 + β2)(β13 + β23) (β12 + β22)(β13 + β23 + β3) (β12 + β22 + β2)2 t + o(t) = β12 + β22 β12 + β22 + β2 + β2(β13 + β23) β3(β12 + β22) (β12 + β22 + β2)2 t + o(t). (103) We find that β12 + β22 β12 + β22 + β2 = SSR((u ) X1, (u ) X2) = LSSR(X1, X2). (104) On the other hand, we have β2(β13 + β23) β3(β12 + β22) 2m2( x1 x2)2[(x1 x1)3 + 2 x1σ2 1 + (x2 x2)3 + 2 x2σ2 2] + 1 2m2( x1 x2)(x2 1 x2 2)(σ2 1 + σ2 2) 2m2( x1 x2)2[(x1 x1)3 + (x2 x2)3] 2m2( x1 x2)2(2 x1σ2 1 + 2 x2σ2 2) + 1 2m2( x1 x2)(x2 1 x2 2)(σ2 1 + σ2 2). We figure out that ( x1 x2)2(2 x1σ2 1 + 2 x2σ2 2) ( x1 x2)(x2 1 x2 2)(σ2 1 + σ2 2) =( x1 x2)[( x1 x2)(2 x1σ2 1 + 2 x2σ2 2) ( x2 1 + σ2 1 x2 2 σ2 2)(σ2 1 + σ2 2)] =( x1 x2)[2 x1( x1 x2)σ2 1 + 2 x2( x1 x2)σ2 2 ( x2 1 x2 2)σ2 1 ( x2 1 x2 2)σ2 2 (σ2 1 σ2 2)(σ2 1 + σ2 2)] =( x1 x2)[( x1 x2)2σ2 1 ( x1 x2)2σ2 2 (σ2 1 σ2 2)(σ2 1 + σ2 2)] =( x1 x2)(σ2 1 σ2 2)[( x1 x2)2 (σ2 1 + σ2 2)]. By the definition of T1, T2 and T3, we thus obtain β2(β13 + β23) β3(β12 + β22) 2m2( x1 x2)2[(x1 x1)3 + (x2 x2)3] 1 2m2( x1 x2)(σ2 1 σ2 2)[( x1 x2)2 (σ2 1 + σ2 2)] On the Nonlinearity of Layer Normalization Moreover, we have (β12 + β22 + β2)2 = [mσ2 1 + mσ2 2 + m 2 ( x1 x2)2]2 4m2[2σ2 1 + 2σ2 2 + ( x1 x2)2]2 As a result, we obtain that SSR(v ˆ X1, v ˆ X2) = LSSR(X1, X2) 2(T1 + T2) T3 t + o(t). (109) Lemma 8. For ( LSSR(X1, X2), t = 0, SSR( ψ(t; X1), ψ(t; X2)), t = 0, (110) where ψ(t; xci) = 1 φ(t; xci)/ φ(t; xci) 2, φ(t; xci) = [(u ) xcit, 1] and t R, we have that f SSR(t) is derivable around t = 0, and f SSR(0) is only decided by X1 and X2. Proof. It is easy to identify that ψ(t; Xi) = v ˆ Xi. Therefore, by Lemma 7. We have SSR( ψ(t; X1), ψ(t; X2)) = SSR(v ˆ X1, v ˆ X2) = LSSR(X1, X2) 2(T1 + T2) T3 t + o(t). (111) We hence obtain f SSR(0) = lim t 0 f SSR(t) f SSR(0) = lim t 0 SSR( ψ(t; X1), ψ(t; X2)) LSSR(X1, X2) T3 t + o(t) = 2(T1 + T2) Conclusively, we have that f SSR(t) is derivable at t = 0. Lemma 9. Given a differentiable function f(x), with f (0) = 0, we figure out that there is some x , such that f(x ) < f(0). Proof. Given that f(0) = A and f (0) = B = 0, by the definition of derivative, we have lim h 0 h = B. That is to say, ε > 0, there exists a positive δ > 0, whenever 0 < |x| < δ, we have f(x) A x B ε, (113) then ε|x| f(x) A Bx ε|x|. (114) Let x = |B|δ 2B , and ε = |B| 2 . We have f(x ) A + Bx + ε|x| = A |B|δ 4 < A, (115) namely f(x ) < f(0). On the Nonlinearity of Layer Normalization C.3. Proof of Theorem 1 Since f SSR(0) = 2(T1 + T2)/T3 = 0, by Lemma 9, there is some t = t , such that SSR( ψ(t ; X1), ψ(t ; X2)) < LSSR(X1, X2). (116) We denote φ1(x) = φ(t ; u x) and φ2(x) = v x. By Lemma 1, we have SP( ) = ˆφ1 LN( ) ˆφ2. We hence have v ˆ Xc = φ2( ˆφ2(LN( ˆφ1( φ1(Xc)))). (117) Let ψ = φ1 LN( ) φ2, where φ1 = φ1 ˆφ1 and φ2 = ˆφ2 φ2. We thus have SSR(ψ(X1), ψ(X2)) < LSSR(X1, X2). (118) Obviously, φ1 and φ2 are linear functions. Consequently, we have proved Theorem 1. C.4. A Generalized Proof of Theorem 1 To begin with, we also need to project Xc to u Xc. This can reach LSSR(X1, X2), which is necessary in our discussion. More generally, we design a n-dimensional linear transformation φn(t; ) : R Rn, instead of a 2-dimensional one in Eqn.86. Specifically, we denote φn(t; x) = t wx + b = w1xt + b1 wnxt + bn Considering SP on φn(t; x) with scaling=1, we denote ˆx = SP(φn(t; x)) = φn(t; x)/ φn(t; x) . (120) Owing to the introduce of t, let t and w represent the direction and length of weight respectively. We thus add the constraint w 2 = 1 for convenience. As for the bias, if b = 0, ˆx = w/( w 2) will result in SS(ψ( ˆ X1), ψ( ˆ X2)) = 0. Therefore, we require that b = 0. Now we are concerned about LSSR( ˆ X1, ˆ X2). Factually, by Proposition 2, we need not consider all the linear functions on Rn to get LSSR. We figure out that there must be some v Rn, such that LSSR( ˆ X1, ˆ X2) = SSR(v ˆ X, v ˆY ). (121) We give the Taylor s expansion of ˆx on its each dimension. Let ξ1 = w b and ξ2 = b b. We figure out that 1 wxt + b 2 =(1 + 2ξ1xt + ξ2x2t2) 1 2(2ξ1xt + ξ2x2t2) + 3 8(2ξ1xt + ξ2x2t2)2 5 16(2ξ1xt + ξ2x2t2)3 + o(t3) =1 ξ1xt + (3 2ξ2)x2t2 + (3 2ξ3 1)x3t3 + o(t3). We further obtain ˆx(k) = wkxt + bk =(bk + wkxt)[1 ξ1xt + (3 2ξ2)x2t2 + (3 2ξ3 1)x3t3 + o(t3)] =bk + (wk ξ1bk)xt + [(3 2ξ2)bk ξ1wk]x2t2 + [(3 2ξ3 1)bk + (3 2ξ2)wk]x3t3 + o(t3). (123) Similarly, to simplify our calculation, we denote ˆx(k) ci = a(k) 0 + a(k) 1 xcit + a(k) 2 x2 cit2 + a(k) 3 x3 cit3 + o(t3) (124) On the Nonlinearity of Layer Normalization where a(k) 0 = bk, a(k) 1 = wk ξ1bk, a(k) 2 = ( 3 2ξ2)bk ξ1wk and a(k) 3 = ( 3 2ξ3 1)bk + ( 3 Let v = [v1, , vn] . We have SS(v ˆ Xc) = i=1 (v ˆxci v ˆxc)2 j=1 v ˆxci(v ˆxci v ˆxcj) k=1 vkˆx(k) ci l=1 vl[ˆx(l) ci ˆx(l) cj ] j=1 ˆx(k) ci [ˆx(l) ci ˆx(l) cj ] Similar to calculation when we discuss the 2-dimensional case, we have j=1 ˆx(k) ci [ˆx(l) ci ˆx(l) cj ] s=0 a(k) s xs cits + o(t3) s=0 a(l) s (xs ci xs cj)ts + o(t3) h a(k) 1 a(l) 1 xci(xci xcj)t2 + a(k) 2 a(l) 1 x2 ci(xci xcj)t3 + a(k) 1 a(l) 2 xci(x2 ci x2 cj)t3 + o(t3) i =ma(k) 1 a(l) 1 σ2 ct2 + m[a(k) 1 a(l) 2 + a(k) 2 a(l) 1 ][(xc xci)3 + 2 xcσ2 c]t3 + o(t3). l=1 vkvla(k) 1 a(l) 1 , (127) l=1 vkvla(k) 1 a(l) 2 = l=1 vkvla(k) 2 a(l) 1 . (128) We thus have that SS(v ˆ Xc) = j=1 ˆx(k) ci [ˆx(l) ci ˆx(l) cj ] l=1 vkvl ma(k) 1 a(l) 1 σ2 ct2 + m[a(k) 1 a(l) 2 + a(k) 2 a(l) 1 ][(xc xci)3 + 2 xcσ2 c]t3 + o(t3) = mθ1σ2 ct2 + 2mθ2[(xc xci)3 + 2 xcσ2 c]t3 + o(t3) = βc2t2 + βc3t3 + o(t3), where βc2 = mθ1σ2 c and βc3 = 2mθ2[(xc xci)3 + 2 xcσ2 c]. On the Nonlinearity of Layer Normalization On the other hand, we obtain SSD(v ˆ X1, v ˆ X2) =m 2 (v ˆx1 v ˆx2)2 i=1 [v ˆx1i v ˆx2i] k=1 vk(ˆx(k) 1i ˆx(k) 2i ) k=1 vk[a(k) 1 (x1i x2i)t + a(k) 2 (x2 1i x2 2i)t2 + o(t2)] k=1 vk[a(k) 1 ( x1 x2)t + a(k) 2 (x2 1 x2 2)t2 + o(t2)] l=1 vkvl[a(k) 1 a(l) 1 ( x1 x2)2t2 + [a(k) 1 a(l) 2 + a(k) 2 a(l) 1 ]( x1 x2)(x2 1 x2 2)t3 + o(t3)] 2 θ1( x1 x2)2t2 + mθ2( x1 x2)(x2 1 x2 2)t3 + o(t3) =β2t2 + β3t3 + o(t3), (130) where β2 = m 2 θ1( x1 x2)2 and β3 = mθ2( x1 x2)(x2 1 x2 2). Therefore, we figure out that β2(β13 + β23) β3(β12 + β22) =m2θ1θ2( x1 x2)2[(x1 x1)3 + 2 x1σ2 1 + (x2 x2)3 + 2 x2σ2 2] m2θ1θ2( x1 x2)(x2 1 x2 2)(σ2 1 + σ2 2) =m2θ1θ2( x1 x2)2[(x1 x1)3 + (x2 x2)3] + m2θ1θ2( x1 x2)(σ2 1 σ2 2)[( x1 x2)2 (σ2 1 + σ2 2)] =m2θ1θ2T1 + m2θ1θ2T2, and (β12 + β22 + β2)2 = h mθ1σ2 1 + mθ1σ2 2 + m 2 θ1( x1 x2)2i2 4m2θ2 1[2σ2 1 + 2σ2 2 + ( x1 x2)2]2 We hence obtain SSR(v ˆ X1, v ˆ X2) = LSSR(X1, X2) + 4θ2 T3 t + o(t). (133) Similarly, f SSR(0) = 0 means T1 + T2 = 0, we can find some t and some w, b, v, such that θ2 = 0, and SSR(v ˆ X1, v ˆ X2) < LSSR(X1, X2). We figure out that in the n-dimensional case, f SSR(0) = 0 is also required in our proof. Hereafter, the remaining proof is nearly the same as the 2-dimensional version. Take our 2-dimensional φ(t; ) as an example, w = [1, 0] , b = [0, 1] , v = [1, 1] . We thus have s1 = 0, s2 = 1. Furthermore, we obtain a(1) 1 = 1, a(2) 1 = 0, a(1) 2 = 0, a(2) 2 = 1 2 and then θ1 = 1, θ2 = 1 2. As a result, we have 4θ2/θ1 = 2, which is the same as Eqn.109. D. Proofs Related to the Algorithms D.1. Proof of Proposition 4 Proposition 4. For any input X(0), we can find some u, such that φ(1) 1 : x(0) k 7 p(1) k = [u x(0) k , 0] , (134) On the Nonlinearity of Layer Normalization where p(1) i = p(1) j if x(0) i = x(0) j . Proof. In reverse, we consider to find all the u, such that some two different points are coincident after the projection. Given two points x(0) i = x(0) j in X(0), if u project them into the same point, we have u x(0) i = u x(0) j . (135) We use S2(x(0) i , x(0) j ) to denote the whole solution space of Eqn.135, namely S2(x(0) i , x(0) j ) = {u Rd : u (x(0) i x(0) j ) = 0}. (136) Considering all the pairs of different points, we define ˆS2(X(0)) = [ x(0) i =x(0) j S2(x(0) i , x(0) j ) (137) Since x(0) i = x(0) j , we find each solution space S2(x(0) i , x(0) j ) is (d 1) dimensional, and the number of such sets is no more than m2. Therefore, the union of these solution spaces13 is still smaller than Rd, namely ˆS2(X(0)) Rd. We obtain that x(0) i = x(0) j , u x(0) i = u x(0) j , if and only if u ˆS2(X(0)). Since ˆS2(X(0)) Rd, we obtain Rd/ˆS2(X(0)) = . (138) Therefore, we can always find a u Rd, such that we have p(1) i = p(1) j , for any x(0) i = x(0) j . D.2. Proof of Proposition 5 Proposition 5. For each layer, φ(1) l (2 l L) only merges points with the same label. Nevertheless, φ(1) 1 , SP( ) and φ(2) l (1 l L 1) do not merge any points. Proof. 1) According to Proposition 4, we figure out that φ(1) 1 does not merge any points. h(l) 1 h(l) 4 h(l) 2 h(l) 3 h(l) 5 x(l) 2 x(l) 3 Figure A2. A copied figure from Figure 2(b). 2) Furthermore, we analyze SP( ). Focused on γ(l) k (there is an example of γ(l) 2 copied from Figure 2(b)), we figure out that γ(l) k = arctan 2p(l) k p(l) i p(l) j p(l) j p(l) i , (139) where p(l) i , p(l) j is defined in Algorithm 1. We can obtain that γ(l) k is monotonically decreasing with p(l) k . 13The union of finite subspaces of d 1 dimensional can not cover the whole space Rd. On the Nonlinearity of Layer Normalization When h(l) ki = h(l) kj , we have γ(l) ki = γ(l) kj , namely x(l) ki = x(l) kj . In other words, SP( ) does not merge any points. 3) We then consider φ(2) l (1 l L 1). Obviously, φ(2) l is a translation transformation, which does not merge any points. 4) Finally, we consider φ(1) l (2 l L 1). We first have p(l+1) k = 0 1 φ(1) l (x(l) k ) = sin γ(l) k . (140) Accordingly, p(l+1) k is also monotonically decreasing with p(l) k when γ(l) k π Given two points from different classes denoted as p(l) k1 , p(l) k2 , we discuss them under three cases. Case 1: If p(l) k1 , p(l) k2 > p(l) j , we have γ(l) k1 , γ(l) k2 < π 4 . Therefore, p(l+1) k is monotonically decreasing with p(l) k . We have p(l) k1 = p(l) k2 p(l+1) k1 = p(l+1) k2 . (141) Case 2: If one of p(l) k1 , p(l) k2 is less than p(l) j and the other is not, then one of p(l+1) k1 , p(l+1) k2 is lager than 2 2 , while the other is not. We hence have p(l+1) k1 = p(l+1) k2 Case 3: If p(l) k1 , p(l) k2 are both less than p(l) j this case will never happen, otherwise one of them belongs to the same class with p(l) i , resulting p(l) j is not the leftmost point (with the same label as p(l) i ), which contradicts the definition of j. Conclusively, we find the samples from different classes will not merge by φ(1) l (1 l L 1). Based on all the discussions above, we have proved Proposition 5. D.3. Proof of Proposition 6 Proposition 6. Confusion refers to merging two points with different labels. If confusion happens when we project X(l+1) onto the y-axis, there must be a parallelogram14 consisting of four different points in P (l). Proof. If confusion happens, we will merge some two points x(l) s and x(l) t with different labels. According to Eqn.139, we find that sin γ(l) s = sin γ(l) t . Since x(l) s and x(l) t are different points on the unit circle, we have γ(l) s = π γ(l) t , namely they are symmetric about y-axis. Furthermore, h(l) s and h(l) t are symmetric about y-axis. Besides, h(l) i and h(l) j are also symmetric about y-axis. Since the four points are on the same line, we have h(l) i + h(l) j = h(l) s + h(l) t . For H(l) is translated from P (l), we have p(l) i + p(l) j = p(l) s + p(l) t . We hence find a parallelogram in P (l). D.4. Proof of Proposition 7 Proposition 7. We can always find ul R2 for Algorithm 2, such that there is no parallelograms in ˆP (l), and no points merged in the algorithm. Proof. By Algorithm 2, we shift the points in P (l) up by 1, and then projects onto the unit circle x2 + y2 = 1, namely p(l) i SP p(l) i + 0 1 . (142) We find all points in P (l) are on the upper half circle. Obviously, any four different points in P (l) can not form a parallelogram, for the quadrilateral has two adjacent obtuse angles. In other words, give four different points pi, pj, ps, pt, we have pi + pj = ps + pt. Besides, if p(l) i = p(l) j , we have p(l) i = p(l) j . We can intuitively identify the two claims above in Figure 3. 14The parallelogram may be degenerate. Given four points x1, x2, x3, x4, if the sum of two points is the same with that of the other two, we regard they form a parallelogram. On the Nonlinearity of Layer Normalization Similarly, consider to merge different points together by ul, we can find ul from the set ˆS2( P (l)) = [ p(l) i = p(l) j S2( p(l) i , p(l) j ), (143) where S2( p(l) i , p(l) j ) = {ul R2 : u l ( p(l) i p(l) j ) = 0}. We then consider to form a parallelogram. We need some ul, and four different points pi, pj, ps, pt, such that u l p(l) i + u l p(l) j = u l p(l) s + u l p(l) t . Obviously, we can find ul from the set ˆS4( P (l)) = [ (i,j,s,t) I4( P (l)) S4( p(l) i , p(l) j , p(l) s , p(l) t ), (144) where S4( p(l) i , p(l) j , p(l) s , p(l) t ) = {ul R2 : u l ( p(l) i + p(l) j p(l) s p(l) t ) = 0}, (145) and the index set I4( P (l)) = {(i, j, s, t) : p(l) i , p(l) j , p(l) s , p(l) t are different with each other}. (146) Similarly, we point out that ˆS2( P (l)) consists of15 no more than m2 spaces of 1-dimensional. On the other hand, since pi + pj = ps + pt holds for any four different points in P (l), each S4( pi, pj, ps, pt) is a 1-dimensional space. Therefore, ˆS4( P (l)) consists of no more than m4 spaces of 1-dimension. We hence obtain that ˆS4( P (l)) ˆS2( P (l)) consists of no more than m2 + m4 spaces of 1-dimension, namely [ˆS4( P (l)) ˆS2( P (l))] R2. (147) We thus have that there exists p(l) i = p(l) j subjected to u l p(l) i = u l p(l) j , if and only if ul ˆS2(P (l)). On the other hand, we figure out that there exists four different points pi, pj, ps, pt subjected to u l p(l) i + u l p(l) j = u l p(l) s + u l p(l) t , if and only if ul ˆS4(P (l)). Since [ˆS4( P (l)) ˆS2( P (l))] R2, we obtain R2/[ˆS4( P (l)) ˆS2( P (l))] = . As a result, we can always find a ul R2/[ˆS4( P (l)) ˆS2( P (l))] to ensure not to merge different points, and form no parallelograms in ˆP (l) as well. D.5. Discussion on a Wider LN-Net We figure out that the algorithm here is suitable for both binary and multi-class classifications. Before giving the algorithm, we propose two lemmas as follows. Lemma 10. Given X(l) on the unit sphere, the necessary condition of x(l) i x(l) j //x(l) s x(l) t is that x(l) j Ox(l) s = x(l) i Ox(l) t , where O is origin of coordinates. Proof. For x(l) i x(l) j //x(l) s x(l) t , we have x(l) j x(l) i = k (x(l) t x(l) s ) (148) where k = 0. Accordingly, we figure out that x(l) j + k x(l) s = x(l) i + k x(l) t , (149) and furthermore, (x(l) j )2 + 2k x(l) j x(l) s + k2 (x(l) s )2 = (x(l) i )2 + 2k x(l) i x(l) t + k2 (x(l) t )2. (150) Since x(l) i , x(l) j , x(l) s , x(l) t are all on the unit sphere, we have (x(l) j )2 = (x(l) j )2 = (x(l) j )2 = (x(l) j )2 = 1. Therefore, we have x(l) j x(l) s = x(l) i x(l) t 15ˆS2( P (l)) is a point set of finite lines, hence can not cover the whole R2. On the Nonlinearity of Layer Normalization According to the cosine theorem, we have |x(l) j x(l) s | = |x(l) i x(l) t |. Furthermore, according to the central angle theorem, we have x(l) j Ox(l) s = x(l) i Ox(l) t . Lemma 11. Given p(l) i , p(l) j , p(l) s , p(l) t which are different from each other, the solution space B4(p(l) i , p(l) j , p(l) s , p(l) t ) = b Rn : (p(l) i + b) (p(l) s + b) p(l) i + b 2 p(l) s + b 2 = (p(l) j + b) (p(l) t + b) p(l) j + b 2 p(l) t + b 2 is contained in a hypersurface of n 1 dimension. Proof. We first loose the equation in B4(p(l) i , p(l) j , p(l) s , p(l) t ) to a polynomial equation. Ignoring the case b { p(l) i , p(l) j , p(l) s , p(l) t }, we can loose the equation in B4(pi, pj, ps, pt) as (p(l) i + b) (p(l) s + b) p(l) j + b 2 p(l) t + b 2 = (p(l) j + b) (p(l) t + b) p(l) i + b 2 p(l) s + b 2. (152) Furthermore, we loose it again to [(p(l) i + b) (p(l) s + b)]2 p(l) j + b 2 2 p(l) t + b 2 2 = [(p(l) j + b) (p(l) t + b)]2 p(l) i + b 2 2 p(l) s + b 2 2. (153) We define B 4(p(l) i , p(l) j , p(l) s , p(l) t ) = {b : b satisfies Eqn.153.}. (154) We find that b B 4(p(l) i , p(l) j , p(l) s , p(l) t ), for each b B4(p(l) i , p(l) j , p(l) s , p(l) t ). Since Eqn.153 is a polynomial equation about b, its solution space B 4(p(l) i , p(l) j , p(l) s , p(l) t ) is a hypersurface. From B4(p(l) i , p(l) j , p(l) s , p(l) t ) to B 4(p(l) i , p(l) j , p(l) s , p(l) t ), we add the four singularities { p(l) i , p(l) j , p(l) s , p(l) t }, and we extend cos x(l) j Ox(l) s = cos x(l) i Ox(l) t to cos2 x(l) j Ox(l) s = cos2 x(l) i Ox(l) t . We then prove B 4(p(l) i , p(l) j , p(l) s , p(l) t ) Rn, to ensure it is a hypersurface of d 1 dimension. (a) Four points are not on the same plane. p(l) t p(l) j (b) Four points are on the same plane, but not on the same line. (c) Four points are on the same line. Figure A3. Three cases of the four points. We figure out that b is the shift direction and distance, and becomes the new origin when we translate P (l) to H(l). Case 1: Suppose the four points p(l) i , p(l) j , p(l) s , p(l) t are not on the same plane, as shown in Figure A3(a). Choose b = (p(l) i + p(l) t )/2, we thus have h(l) i Oh(l) t = π. However, h(l) j Oh(l) s (0, π), otherwise the four points will belong to the same plane. Therefore, b / B 4(p(l) i , p(l) j , p(l) s , p(l) t ). On the Nonlinearity of Layer Normalization Case 2: Suppose the four points p(l) i , p(l) j , p(l) s , p(l) t are on the same plane, but not on the same line, as shown in Figure A3(b). We can always find b on the line segment p(l) i p(l) t , and ensure b is not on the line p(l) j p(l) s , otherwise the four points will be on the same line. We thus have h(l) i Oh(l) t = π, but h(l) j Oh(l) s (0, π). Therefore, b / B 4(p(l) i , p(l) j , p(l) s , p(l) t ). Case 3: Suppose the four points p(l) i , p(l) j , p(l) s , p(l) t are on the same line, as shown in Figure A3(c). We can draw circles with p(l) i p(l) t and p(l) j p(l) s , respectively. We can always find b on the previous circle, but not on the later one, otherwise they will be not different from each other. We thus have h(l) i Oh(l) t = π 2 , but h(l) j Oh(l) s = π 2 . Therefore, b / B 4(p(l) i , p(l) j , p(l) s , p(l) t ). Conclusively, we can always find some b / B 4(p(l) i , p(l) j , p(l) s , p(l) t ), then we have B 4(p(l) i , p(l) j , p(l) s , p(l) t ) Rn. Further, B 4(p(l) i , p(l) j , p(l) s , p(l) t ) is a hypersurface of d 1 dimension, and B4(p(l) i , p(l) j , p(l) s , p(l) t ) B 4(p(l) i , p(l) j , p(l) s , p(l) t ) Here we propose the proposition of a wider LN-Net as follows. Proposition 9. A wider LN-Net can classify m samples with any label assignment. Proof. Similarly, we hope to merge two points from the same class, and do not merge other points meanwhile. Suppose LN acts on Rn+1 by Lemma 1, we thus use SP on Rn for convenience. Given P (l) Rn m on a n 1 dimensional hyperplane, we consider to shift the points by b Rn and get H(l). After that, we spherically project H(l) onto the unit sphere x 2 = 1, represented by X(l+1). Hereafter, we linearly project X(l+1) onto another n 1 dimensional hyperplane. Different from our method on R2, we can not sort the points, it is hence much harder to design a suitable algorithm in a high dimensional space. But we can consider to merge some p(l) i and p(l) j only, without merging the other points. We analyze the merging progress backward, and show how to find the projection direction and the bias b. To get P (l+1) from X(l), without doubt the projection direction is along x(l) i x(l) j , and the target is some n 1 dimensional hyperplane. Now we need to ensure doing so will not merge other points. Obviously, its necessary and sufficient condition is that there are no other different points x(l) s , x(l) t , such that x(l) i x(l) j //x(l) s x(l) t , (155) namely x(l) s x(l) t is parallel to the projection direction. According to Lemma 10, for X(l) is on the unit sphere, the necessary condition of x(l) i x(l) j //x(l) s x(l) t is that x(l) i Ox(l) s = x(l) j Ox(l) t , where O is the origin of coordinates. Since X(l) = SP(H(l)), we have x(l) i Ox(l) s = x(l) j Ox(l) t h(l) i Oh(l) s = h(l) j Oh(l) t . (156) If we ensure any four different points in H(l) to satisfy h(l) i Oh(l) s = h(l) j Oh(l) t , we will not merge other points when we merge x(l+1) i and x(l+1) j . Since h(l) k = p(l) k +b, according to the cosine theorem, we point out that h(l) i Oh(l) s = h(l) j Oh(l) t is equivalent to (p(l) i + b) (p(l) s + b) p(l) i + b 2 p(l) s + b 2 = (p(l) j + b) (p(l) t + b) p(l) j + b 2 p(l) t + b 2 . (157) B4(p(l) i , p(l) j , p(l) s , p(l) t ) = b Rn : (p(l) i + b) (p(l) s + b) p(l) i + b 2 p(l) s + b 2 = (p(l) j + b) (p(l) t + b) p(l) j + b 2 p(l) t + b 2 Since p(l) i , p(l) j , p(l) s , p(l) t are different from each other, the solution space of Eqn.157 about b is contained in a hypersurface of n 1 dimension, by Lemma 11. On the Nonlinearity of Layer Normalization Again, we define ˆB4(P (l)) = [ (i,j,s,t) I4(P (l)) B4(p(l) i , p(l) j , p(l) s , p(l) t ), (159) where I4(P (l)) = {(i, j, s, t) : p(l) i , p(l) j , p(l) s , p(l) t are different with each other}. (160) We figure out that ˆB4(P (l)) is contained in a union of no more than m4 hypersurfaces of n 1 dimension. Besides, from P (l) to X(l), we can not merge any two different points. Therefore, given p(l) i = p(l) j , we need (p(l) i + b)/ p(l) i + b 2 = (p(l) j + b)/ p(l) j + b 2. Given two different points pi, pj, we define B2(p(l) i , p(l) j ) = b Rn : p(l) i + b p(l) i + b 2 = p(l) j + b p(l) j + b 2 Similarly, we can prove that B2(pi, pj) is contained in a hypersurface of n 1 dimension. We find ˆB2(P (l)) is contained in the union of no more than m2 hypersurfaces of n 1 dimension, where ˆB2(P (l)) = [ p(l) i =p(l) j B2(p(l) i , p(l) j ). (162) We figure out that ˆB2(P (l)) ˆB4(P (l)) is contained in a union of no more than m2 + m4 hypersurfaces of n 1 dimension. Therefore, we have [ˆB2(P (l)) ˆB4(P (l))] Rn (163) Choose some b Rn/[ˆB2(P (l)) ˆB4(P (l))], then h(l) j Oh(l) s = h(l) i Oh(l) t will not holds. Furthermore, by Lemma 10, x(l) i x(l) j //x(l) s x(l) t will not holds either. As a result, we can only merge p(l) i and p(l) j by projection. In conclusion, we can choose to only merge two samples with the same label each step by the method above. Furthermore, we can construct an LN-Net with depth O(m) to classify m samples with any label assignment. Note the width of LN-Net here is wider than 3, and we do not require the widths of each layer are equal. E. Proof of Proposition 8 Proposition 8. Given g d/3, we have H(ψG(g; ); x) H(ψL( ); x) 1. (164) Specifically, when g = d/4, we figure out that H(ψG(g; ); x) H(ψL( ); x) d In the proof of Proposition 8, we consider a single sample only. We use xi as the i-th ordinate of x instead of x(i) in this proof, we thus use x2 i to denote the squares rather than [x(i)]2. E.1. Required Lemmas for the Proof Lemma 12. Given x Rd, µ = (x1 + + xd)/d and σ2 = [(x1 µ)2 + + (xd µ)2]/d, we denote LN(x) as ˆx = (x µ1)/σ. We point out that H(ψL( ); x) = 3 σ4 6 dσ4 (166) On the Nonlinearity of Layer Normalization Proof. To begin with, we regard ˆxi as ψi(x), and then give the gradient xψi(x). Let s = 1 i=1 (xi µ)2 and σ = s. d, i, (167) j=1 (xj µ)2 d(xi µ), i, σ xi = 1 2 s s xi d , i. (169) We thus obtain σ xi (xi µ) + (xi µ) dσ (d 1 ˆx2 i ). While for j = i, we have σ xj (xi µ) + (xi µ) dσ ( 1 ˆxiˆxj). Based above, we calculate the Hessian matrix. For each term 2ˆxi xj xk , we figure out that there are four kinds of the second order derivative. Case 1, i = j = k: dσ2 (d 1 ˆx2 i ) σ = 1 d2σ2 (d 1 ˆx2 i )ˆxi 2ˆxi d2σ2 (d 1 ˆx2 i ) = 1 d2σ2 [3ˆx3 i 3(d 1)ˆxi] = 1 d2σ2 (3ˆx3 i + 3ˆxi) 3ˆxi On the Nonlinearity of Layer Normalization Case 2, only one of j, k equals to i, assume i = k: 2ˆxi xi xj = 1 dσ2 (d 1 ˆx2 i ) σ = 1 d2σ2 (d 1 ˆx2 i )ˆxj 2ˆxi d2σ2 ( 1 ˆxiˆxj) = 1 d2σ2 [3ˆx2 i ˆxj + 2ˆxi (d 1)ˆxj] = 1 d2σ2 (3ˆx2 i ˆxj + 2ˆxi + ˆxj) ˆxj We have that 2ˆxi xj xk = 2ˆxi xk xj , so the result of the other case i = j has the same form with that of i = k. Case 3, j = k, but i = j: dσ2 ( 1 ˆxiˆxj) σ dσ ˆxj xj ˆxj = 1 d2σ2 ( 1 ˆxiˆxj)ˆxj ˆxi d2σ2 (d 1 ˆx2 j) ˆxj d2σ2 ( 1 ˆxiˆxj) = 1 d2σ2 [3ˆxiˆx2 j + 2ˆxj (d 1)ˆxi] = 1 d2σ2 (3ˆxiˆx2 j + 2ˆxj + ˆxi) ˆxi Case 4, i, j, k are different from each other: 2ˆxi xj xk = 1 dσ2 ( 1 ˆxiˆxj) σ dσ ˆxj xk ˆxj = 1 d2σ2 ( 1 ˆxiˆxj)ˆxk ˆxi d2σ2 ( 1 ˆxj ˆxk) ˆxj d2σ2 ( 1 ˆxiˆxk) = 1 d2σ2 (3ˆxiˆxj ˆxk + ˆxi + ˆxj + ˆxk). It is hard to calculate the operator norm of the Hessian matrix is too difficult, so we calculate the Frobenius norm instead. 1 d4σ4 (3ˆxiˆxj ˆxk + ˆxi + ˆxj + ˆxk)2 + X ˆx2 i d2σ4 2 ˆxi d3σ4 (3ˆxiˆx2 j + 2ˆxj + ˆxi) " ˆx2 j d2σ4 2 ˆxj d3σ4 (3ˆx2 i ˆxj + 2ˆxi + ˆxj) + 9ˆx2 i d2σ4 2 3ˆxi d3σ4 (3ˆx3 i + 3ˆxi) 1 d4σ4 (3ˆxiˆxj ˆxk + ˆxi + ˆxj + ˆxk)2 + ˆx2 i d2σ4 2 ˆxi d3σ4 (3ˆxiˆx2 j + 2ˆxj + ˆxi) " ˆx2 j d2σ4 2 ˆxj d3σ4 (3ˆx2 i ˆxj + 2ˆxi + ˆxj) + 6ˆx2 i d2σ4 We note that d X j=1 ˆxj = 0, j=1 ˆx2 j = d. (177) On the Nonlinearity of Layer Normalization We thus have d X 1 d4σ4 (3ˆxiˆxj ˆxk + ˆxi + ˆxj + ˆxk)2 1 d4σ4 [9ˆx2 i ˆx2 j ˆx2 k + 6ˆxiˆxj ˆxk(ˆxi + ˆxj + ˆxk) + (ˆxi + ˆxj + ˆxk)2] = 9ˆx2 i d2σ4 + 0 + 1 d4σ4 (ˆxi + ˆxj + ˆxk)2 =10ˆx2 i d2σ4 + 2 d2σ4 , ˆx2 i d2σ4 2 ˆxi d3σ4 (3ˆxiˆx2 j + 2ˆxj + ˆxi) = ˆx2 i dσ4 8ˆx2 i d2σ4 , (179) " ˆx2 j d2σ4 2 ˆxj d3σ4 (3ˆx2 i ˆxj + 2ˆxi + ˆxj) = 2 dσ4 12ˆx2 i d2σ4 4 d2σ4 . (180) Take Eqn.178, Eqn.179 and Eqn.180 into Eqn.176, we obtain F = ˆx2 i + 2 dσ4 4ˆx2 i + 2 d2σ4 (181) Now we add up all the dimensions, as LN s information of the second order H(ψL( ); x) = σ4 6 dσ4 = 3 dσ4 (d 2). (182) When d = 2, we have ˆx2 i = 1, and H(ψL( ); x)|d=2 = 0 naturally. Lemma 13. Given x Rd, let the group number of GN be g. Suppose σ2 i is the variance of the i-th group, we have that H(ψG(g; ); x) = Proof. We simplify ψG(g; ) as ψ( ) in the proof here. As for Group Normalization, suppose the number of groups is g, and d = g c. Let x = [z 1 , , z g ] , where zi = [zi1, , zic] , (i = 1, , g). Assume x = [x1, , xd] , we denote that zij = x(i 1) c+j. Let ˆx = GN(x), where GN( ) denotes the Group Normalization operation. GN can be calculated by µi = (zi1+ +zic)/c, σ2 i = [(zi1 µi)2 + + (zic µi)2]/c, and then ˆzij = (zij µi)/σi. Accordingly, we denote ˆx = [ˆz 1 , , ˆz g ] , where ˆzi = LN(zi), (i = 1, , g). To begin with ,we denote GN(x) as ψ(x) = [ψ11(x), ψ12(x), , ψgc(x)]. We thus have z1ψij(x) ... zgψij(x) , (i = 1, , g; j = 1, , c). (184) On the Nonlinearity of Layer Normalization We denote that zij = ψij(x). When k = i, we have zkψij(x) = 0. When k = i, we have ziψij(x) is a gradient of LN, for [ψi1(x), , ψic(x)] = LN(zi). We can give the Hessian matrix of GN, denoted as 2 xψij(x) = O O ... ... 2 ziψij(x) ... ... O O , (i = 1, , g; j = 1, , c) (185) By the discussion about LN above, we obtain that 2 ziψij(x) 2 F = ˆz2 ij + 2 cσ4 i 4ˆz2 ij + 2 c2σ4 i . (186) Obviously, we have 2 xψij(x) 2 F = 2 ziψij(x) 2 F . Although there are many zeros in 2 xψij(x), for c P j=1 ˆx2 ij = c, we H(ψG(g; ); x) = j=1 2 xψij(x) 2 F cσ4 i 4ˆx2 ij + 2 c2σ4 i σ4 i 6 cσ4 i Lemma 14. In group normalization, we have i=1 σ2 i . (188) Proof. According to the definition, we have i=1 σ2 i = 1 j=1 (zij µ)2 1 j=1 (zij µi)2 j=1 (z2 ij µ2) 1 j=1 (z2 ij µ2 i ) i=1 µ2 i µ2. Since c(µ1 + + µg) = cgµ, we have i=1 σ2 i = 1 i=1 µ2 i µ2 = 1 i=1 (µi µ)2 0. (190) Lemma 15. f(x) = 1 x2 is a monotonically decreasing and convex function on x > 0. On the Nonlinearity of Layer Normalization Proof. For f(x) = 1 x2 , we have f (x) = 2 x3 < 0, namely, f(x) is monotonically decreasing. Furthermore, we have f (x) = 6 x4 > 0, namely, f(x) is a convex function. Lemma 16. Given that σ2 1, , σ2 g and σ2 are variances in LN-G and LN respectively, we have Proof. According to Lemma 15, we have f(x) = 1 x2 is a convex function. By Jensen s inequality, we obtain 1 g f(σ2 k) f According to Lemma 14 and Lemma 15, we have 1 g f(σ2 i ) f(σ2), (193) E.2. Proof of Proposition 8 Proof. To prove H(ψG(g; ); x) H(ψL( ); x) 1, (195) we can prove Eqn.196 instead: H(ψG(g; ); x) H(ψL( ); x) 0. (196) According to Eqn.182, Eqn.187 and Lemma 16, we obtain H(ψG(g; ); x) H(ψL( ); x) = σ4 i 6 cσ4 i 3 dσ4 (d 2) = 3 dσ4 (d 2g 2)(g 1). When g 2, we have d 6. Therefore, we obtain d 2g 2 = d 2d 3d 2 0 (198) According to Eqn.197, we give the necessary condition for equality in Eqn.196. One of the cases is g = 1 obviously. The other case is d = 2g + 2 but we note that g|d, we hence have g|2. Namely g = 2, d = 6 is the only other case for equality. Therefore, we have proved H(ψG(g; ); x) H(ψL( ); x) 1. (199) On the Nonlinearity of Layer Normalization As for the case g = d/4, we have that H(ψG(g; ); x) 3 When g = d/4, the right term reaches its maximum, where we have H(ψG(g; ); x) 3d 8σ4 . (201) On the other hand, we have that H(ψL( ); x) = 3 As a result, we obtain H(ψG(g; ); x) H(ψL( ); x) d E.3. H about Re LU We conduct additional analyses to compare the nonlinearity of Re LU and LN during the phase of rebuttal. Re LU is defined as max(0, x), which is not differentiable strictly. To compare Re LU with LN, we consider to introduce the Dirac function δ(x) as Re LU s second-order derivative, namely 2Re LU(x) = δ(x). We know that I δ(x)dx = 1 and I f(x)δ(x)dx = f(0). To apply the integral, we introduce the expectation, and assume x N(0,I) is d-dimensional. Since we do not know how to calculate I f(x)δ2(x)dx, we remove the square sign in H. Specifically, we define H(f;x) as like Eqn.20 in our paper, and yi is defined similarly. Based on the assumptions above, we have that H(relu( );x) = d 2π = O(d), (205) H(ψL( );x) = i=1 Ex 1 dσ2 d(ˆx2 i + 2) (4ˆx2 i + 2) = O( Furthermore, we have H(ψG(g; );x) = g O( c) = O( p Note that we removed the square sign in H, and there is a square root sign in H. We hope the analysis above can help compare Re LU with LN, to some extent. On the Nonlinearity of Layer Normalization F. Experiments F.1. Details of Experiments on Comparison of Representation Capacity by Fitting Random Labels. In this section, we provide the details of experimental setup in comparing the representation capacity by fitting random labels, as stated in Section 5.2. F.1.1. DATASET WITH RANDOM LABELS We conduct the random label datasets based on CIFAR-10 and MNIST, referred to as CIFAR-10-RL and MNIST-RL. In particular, for each sample of these datasets, we randomly assign a class label to this sample and save all the samples as a dataset. Even though the labels are random, the label assignment is certain once the dataset is conducted. Therefore, it is meaningful to compare the results of different methods by fitting random labels. MNIST-RL is more challenging. Here, we highlight that MNIST-RL is more challenging in training a classifier for fitting the labels, compare to CIFAR-10-RL. Let Xc represents examples belong to class c. It is clear that the features in Xc are very close for the normal MNIST dataset. For example, all the digits of "0" are very similar in representation, they all have rounded curves. However, if we use the random label (the MNIST-RL dataset), the samples in Xc will have different labels. In this case, the network will need to map Xc which is very close in representation to different labels. As a result, we need more powerful model to fit MNIST-RL and is more difficult to train. F.1.2. DETAILS ON VERIFYING NONLINEARITY OF LN In this part, we use various configurations of hyper-parameters to train our models, aiming at reducing the effect from the optimization. We first sufficiently train a linear classifier (Figure A4 (b)), as the baseline, which provides the (nearly) upper bound accuracy of linear classifier. We then compare the results under linear neural network and LN-Net with residential structure for better optimization as shown in Figure A4 (c). We vary the depths ranging in {2, 4, 6, 8, 10, 1214}, and each hidden layer has a dimension of 256. Training protocols. For the training of liner classifier, we apply both SGD optimizer with momentum (0.1) and Adam optimizer with betas (0.9, 0.999). We train the model for 150 epochs and use a learning rate schedule with a decay 0.5 per 20 epochs. We search the batch sizes ranging in {128, 256}, the initial learning rates ranging in {0.001, 0.003, 0.005, 0.008, 0.05, 0.08, 0.1, 0.15} and 5 random seeds, and report the best accuracy from these configurations of hyper-parameters. For the training of linear neural networks and LN-Nets, we follow the settings in training linear classifier, except that: 1) we use a batch size of 128 and a fixed random seed; 2) we search the initial learning rates ranging in {0.01, 0.03, 0.05, 0.08, 0.1} for SGD and the initial learning rates ranging in {0.001, 0.003, 0.005, 0.008, 0.05, 0.08, 0.1, 0.15} for Adam. Results. In Figure 4 of the main paper, we show the best result of linear classifier as black dashed line, which is 18.51% on CIFAR-10-RL and 15.38% on MNIST-RL. We also provide the detailed results for linear neural network and LN-Net, shown in Table II. Figure A4. Schematic representation of the networks used in the experiment. (a) Original LN-Net and Linear Neural Network (LNN). (b) Linear classifier. (c) LN-Net and LNN using residual connections. (d) LN-Net using LN-G. On the Nonlinearity of Layer Normalization Table II. The result of linear neural network and LN-Net model on classification task on CIFAR-10-RL and MNIST-RL. The bold numbers refer to those outperform linear classifier. We can see layer normalization breaks the bound of linearity. dataset RL-CIFAR-10 RL-MNIST depth Linear+Res LN+Res Linear+Res LN+Res 2 17.37% 20.45% 14.71% 14.45% 4 17.00% 27.97% 14.54% 15.26% 6 16.97% 39.24% 14.29% 15.28% 8 17.02% 39.39% 14.32% 15.76% 10 16.98% 31.12% 13.89% 18.26% 12 16.91% 50.48% 13.35% 18.47% 14 15.19% 55.58% 13.98% 19.44% best 17.37% 55.58% 14.71% 19.44% F.1.3. DETAILS ON AMPLIFYING THE NONLINEARITY USING GROUP We use the origin LN-Net and replace LN with LN-G, as shown in Figure A4 (d). For the configuration of networks, we fix the number of neurons as 256 and vary the depths ranging in {1, 2, 4, 6, 8, 10, 12, 14}. We vary the group numbers of LN-G ranging in {2, 4, 8, 16, 32, 64, 128}. Training protocols. We use the same training protocols as the experiment above, except that we only use SGD optimizer with fixed momentum of 0.1 and search the initial learning rate ranging in {0.01, 0.03, 0.05, 0.1}. Results. We provide the detailed results of CIFAR-10-RL in Table III and MNIST-RL in Table IV for linear neural network and LN-Net. Table III. The performance of LN-Net with LN-G on CIFAR-10-RL. The rows of the table represent the model depth and the columns represent the group number of LN-G in the model. The percentage is the best accuracy of model under such setting. The bold number refers to the best accuracy among all group numbers under such depth. CIFAR 1 2 4 6 8 10 12 14 2 20.51% 29.09% 52.17% 60.70% 67.21% 71.45% 74.10% 68.53% 4 26.63% 45.19% 72.41% 84.08% 91.36% 94.02% 95.76% 96.76% 8 35.02% 60.65% 91.74% 98.57% 99.72% 99.94% 99.99% 99.96% 16 46.42% 79.71% 99.58% 99.99% 100.00% 100.00% 100.00% 100.00% 32 59.89% 93.67% 99.96% 100.00% 100.00% 100.00% 100.00% 100.00% 64 69.40% 91.62% 99.44% 99.66% 96.58% 88.20% 77.22% 44.48% 128 26.48% 14.66% 12.28% 10.38% 10.23% 10.26% 10.37% 10.22% best 69.40% 93.67% 99.96% 100.00% 100.00% 100.00% 100.00% 100.00% Table IV. The performance of LN-Net with LN-G on MNIST-RL. The rows of the table represent the model depth and the columns represent the group number of LN-G in the model. The percentage is the best accuracy of model under such setting. The bold number refers to the best accuracy among all group numbers under such depth. MNIST 1 2 4 6 8 10 12 14 2 14.53% 18.25% 26.83% 27.76% 27.96% 27.56% 30.39% 30.81% 4 14.77% 20.98% 33.35% 40.67% 50.00% 53.52% 57.44% 58.78% 8 15.61% 25.38% 46.48% 64.51% 74.91% 81.34% 85.98% 89.97% 16 19.13% 32.43% 66.59% 86.20% 92.16% 94.03% 95.32% 95.25% 32 24.92% 47.08% 82.34% 92.40% 94.47% 95.56% 95.68% 95.96% 64 33.95% 54.00% 70.61% 68.63% 56.89% 42.89% 13.43% 10.21% 128 10.22% 10.17% 10.16% 10.22% 10.30% 10.25% 10.32% 10.31% best 33.95% 54.00% 82.34% 92.40% 94.47% 95.56% 95.68% 95.96% On the Nonlinearity of Layer Normalization F.2. More Results of CNN without Activation Functions As stated in Section 5.3.1, we conduct more experiments on different networks, including the results on VGGs, and the 20-layer Res Net with the original configuration of channel number. Results on VGGs. Following the experimental setup shown in Section 5.3.1, we also conduct experiments on CIFAR-10 classification using different normalization methods in the VGG-style networks (the network architecture used is Res Net-20, but with the residual connections removed.) with Re LU activation removed, where the group number g ranging in {2, 4, 8, 16, 32, 64}. The experimental results of different normalization methods are shown in the Table V. The results of different groups of GN and LN-G-Position are shown in the Figure A5. We have the similar observations as in the Res Net-20 shown in the main paper. G2 G4 G8 G16 G32 G64 Group Number Accuracy(%) 62.05 67.12 67.89 71.52 73.80 79.16 99.76 99.78 99.47 99.33 9.64 GN LN-G-Position (a) Training. G2 G4 G8 G16 G32 G64 Group Number Accuracy(%) 59.87 63.53 64.19 66.07 66.67 69.93 84.07 83.47 84.23 84.50 10.00 GN LN-G-Position Figure A5. Results of the variants of LN-G (GN and LN-GPosition) when using different group number. The experiments are conducted on CIFAR-10 dataset using a 20-layer VGG-style network without Re LU activation. We show (a) the training accuracy and (b) the test accuracy. In the x-axis, G2 refers to a group number of 2. G2 G4 G8 G16 Group Number Accuracy(%) 67.62 71.86 74.98 77.12 89.90 89.56 10.00 GN LN-G-Position (a) Training. G2 G4 G8 G16 Group Number Accuracy(%) 63.08 65.68 66.71 68.74 83.80 81.77 10.00 GN LN-G-Position Figure A6. Results of the variants of LN-G (GN and LN-GPosition) when using different group number. The experiments are conducted on CIFAR-10 dataset using Res Net-20Original without Re LU activation. We show (a) the training accuracy and (b) the test accuracy. Table V. Comparison of different normalization methods on CIFAR-10 using VGGs-NA (VGGs without Re LU activation). Normalization methods Train Acc(%) Test Acc(%) IN 9.76 10 BN 39.41 39.52 LN 51.51 51.06 GN 79.16 69.93 LN-G-Position 99.33 84.5 Results on original Res Net-20 Following the experimental setup shown in Section 5.3.1, We also conduct experiments on the original Res Net-20-NA (with Re LU removed), where the group number g ranging in {2, 4, 8, 16}. The experimental results of different normalization methods are shown in the Table VI. The results of different groups of GN and LN-GPosition are shown in the Figure A6. We have also the similar observations as in the Res Net-20 shown in the main paper. Table VI. Comparison of different normalization methods on CIFAR-10 using Res Net-20-original-NA (the Res Net-20 using original configuration of channel numbers without Re LU activation). Normalization methods Train Acc(%) Test Acc(%) IN 10 10 BN 36.16 39.34 LN 61.12 58.69 GN 77.12 68.74 LN-G-Position 89.9 83.8