# towards_understanding_regularization_in_batch_normalization__95220621.pdf Published as a conference paper at ICLR 2019 TOWARDS UNDERSTANDING REGULARIZATION IN BATCH NORMALIZATION Ping Luo1,3 Xinjiang Wang2 Wenqi Shao1 Zhanglin Peng2 1The Chinese University of Hong Kong 2Sense Time Research 3The University of Hong Kong Batch Normalization (BN) improves both convergence and generalization in training neural networks. This work understands these phenomena theoretically. We analyze BN by using a basic block of neural networks, consisting of a kernel layer, a BN layer, and a nonlinear activation function. This basic network helps us understand the impacts of BN in three aspects. First, by viewing BN as an implicit regularizer, BN can be decomposed into population normalization (PN) and gamma decay as an explicit regularization. Second, learning dynamics of BN and the regularization show that training converged with large maximum and effective learning rate. Third, generalization of BN is explored by using statistical mechanics. Experiments demonstrate that BN in convolutional neural networks share the same traits of regularization as the above analyses. 1 INTRODUCTION Batch Normalization (BN) is an indispensable component in many deep neural networks (He et al., 2016; Huang et al., 2017). BN has been widely used in various areas such as machine vision, speech and natural language processing. Experimental studies (Ioffe & Szegedy, 2015) suggested that BN improves convergence and generalization by enabling large learning rate and preventing overfitting when training deep networks. Understanding BN theoretically is a key question. This work investigates regularization of BN as well as its optimization and generalization in a singlelayer perceptron, which is a building block of deep models, consisting of a kernel layer, a BN layer, and a nonlinear activation function such as Re LU. The computation of BN is written by y = g(ˆh), ˆh = γ h µB σB + β and h = w Tx. (1) This work denotes a scalar and a vector by using lowercase letter (e.g. x) and bold lowercase letter (e.g. x) respectively. In Eqn.(1), y is the output of a neuron, g( ) denotes an activation function, h and ˆh are hidden values before and after batch normalization, w and x are kernel weight vector and network input respectively. In BN, µB and σB represent the mean and standard deviation of h. They are estimated within a batch of samples for each neuron independently. γ is a scale parameter and β is a shift parameter. In what follows, Sec.1.1 overviews assumptions and main results, and Sec.1.2 presents relationships with previous work. 1.1 OVERVIEW OF RESULTS We overview results in three aspects. First, Sec.2 decomposes BN into population normalization (PN) and gamma decay. To better understand BN, we treat a single-layer perceptron with Re LU activation function as an illustrative case. Despite the simplicity of this case, it is a building block of deep networks and has been widely adopted in theoretical analyses such as proper initialization (Krogh & Hertz, 1992; Advani & Saxe, 2017), dropout (Wager et al., 2013), weight decay and data augmentation (B os, 1998). The results in Sec.2 can be extended to deep neural networks as presented in Appendix C.4. Our analyses assume that neurons at the BN layer are independent similar to (Salimans & Kingma, 2016; van Laarhoven, 2017; Teye et al., 2018), as the mean and the variance of BN are estimated individually for each neuron of each layer. The form of regularization in this study does not rely on The first three authors contribute equally. Corresponding to pluo.lhi@gmail.com, {wangxinjiang, pengzhanglin}@sensetime.com, weqish@link.cuhk.edu.hk. Published as a conference paper at ICLR 2019 Gaussian assumption on the network input and the weight vector, meaning our assumption is milder than those in (Yoshida et al., 2017; Ba et al., 2016; Salimans & Kingma, 2016). Sec.2 tells us that BN has an explicit regularization form, gamma decay, where µB and σB have different impacts: (1) µB discourages reliance on a single neuron and encourages different neurons to have equal magnitude, in the sense that corrupting individual neuron does not harm generalization. This phenomenon was also found empirically in a recent work (Morcos et al., 2018), but has not been established analytically. (2) σB reduces kurtosis of the input distribution as well as correlations between neurons. (3) The regularization strengths of these statistics are inversely proportional to the batch size M, indicating that BN with large batch would decrease generalization. (4) Removing either one of µB and σB could imped convergence and generalization. Second, by using ordinary differential equations (ODEs), Sec.3 shows that gamma decay enables the network trained with BN to converge with large maximum learning rate and effective learning rate, compared to the network trained without BN or trained with weight normalization (WN) (Salimans & Kingma, 2016) that is a counterpart of BN. The maximum learning rate (LR) represents the largest LR value that allows training to converge to a fixed point without diverging, while effective LR represents the actual LR in training. Larger maximum and effective LRs imply faster convergence rate. Third, Sec.4 compares generalization errors of BN, WN, and vanilla SGD by using statistical mechanics. The large-scale regime is of interest, where number of samples P and number of neurons N are both large but their ratio P/N is finite. In this regime, the generalization errors are quantified both analytically and empirically. Numerical results in Sec.5 show that BN in CNNs has the same traits of regularization as disclosed above. 1.2 RELATED WORK Neural Network Analysis. Many studies conducted theoretical analyses of neural networks (Opper et al., 1990; Saad & Solla, 1996; Bs & Opper, 1998; Pennington & Bahri, 2017; Zhang et al., 2017b; Brutzkus & Globerson, 2017; Raghu et al., 2017; Mei et al., 2016; Tian, 2017). For example, for a multilayer network with linear activation function, Glorot & Bengio (2010) explored its SGD dynamics and Kawaguchi (2016) showed that every local minimum is global. Tian (2017) studied the critical points and convergence behaviors of a 2-layered network with Re LU units. Zhang et al. (2017b) investigated a teacher-student model when the activation function is harmonic. In (Saad & Solla, 1996), the learning dynamics of a committee machine were discussed when the activation function is error function erf(x). Unlike previous work, this work analyzes regularization emerged in BN and its impact to both learning and generalization, which are still unseen in the literature. Normalization. Many normalization methods have been proposed recently. For example, BN (Ioffe & Szegedy, 2015) was introduced to stabilize the distribution of input data of each hidden layer. Weight normalization (WN) (Salimans & Kingma, 2016) decouples the lengths of the network parameter vectors from their directions, by normalizing the parameter vectors to unit length. The dynamic of WN was studied by using a single-layer network (Yoshida et al., 2017). Li et al. (2018) diagnosed the compatibility of BN and dropout (Srivastava et al., 2014) by reducing the variance shift produced by them. Moreover, van Laarhoven (2017) showed that weight decay has no regularization effect when using together with BN or WN. Ba et al. (2016) demonstrated when BN or WN is employed, backpropagating gradients through a hidden layer is scale-invariant with respect to the network parameters. Santurkar et al. (2018) gave another perspective of the role of BN during training instead of reducing the covariant shift. They argued that BN results in a smoother optimization landscape and the Lipschitzness is strengthened in networks trained with BN. However, both analytical and empirical results of regularization in BN are still desirable. Our study explores regularization, optimization, and generalization of BN in the scenario of online learning. Regularization. Ioffe & Szegedy (2015) conjectured that BN implicitly regularizes training to prevent overfitting. Zhang et al. (2017a) categorized BN as an implicit regularizer from experimental results. Szegedy et al. (2015) also conjectured that in the Inception network, BN behaves similar to dropout to improve the generalization ability. Gitman & Ginsburg (2017) experimentally compared Published as a conference paper at ICLR 2019 BN and WN, and also confirmed the better generalization of BN. In the literature there are also implicit regularization schemes other than BN. For instance, random noise in the input layer for data augmentation has long been discovered equivalent to a weight decay method, in the sense that the inverse of the signal-to-noise ratio acts as the decay factor (Krogh & Hertz, 1992; Rifai et al., 2011). Dropout (Srivastava et al., 2014) was also proved able to regularize training by using the generalized linear model (Wager et al., 2013). 2 A PROBABILISTIC INTERPRETATION OF BN The notations in this work are summarized in Appendix Table 2 for reference. Training the above single-layer perceptron with BN in Eqn.(1) typically involves minimizing a negative log likelihood function with respect to a set of network parameters θ = {w, γ, β}. Then the loss function is defined by j=1 ℓ(ˆhj) = 1 j=1 log p(yj|ˆhj; θ) + ζ θ 2 2, (2) where p(yj|ˆhj; θ) represents the likelihood function of the network and P is number of training samples. As Gaussian distribution is often employed as prior distribution for the network parameters, we have a regularization term ζ θ 2 2 known as weight decay (Krizhevsky et al., 2012) that is a popular technique in deep learning, where ζ is a coefficient. To derive regularization of BN, we treat µB and σB as random variables. Since one sample x is seen many times in the entire training course, and at each time x is presented with the other samples in a batch that is drawn randomly, µB and σB can be treated as injected random noise for x. Prior of µB, σB. By following (Teye et al., 2018), we find that BN also induces Gaussian priors for µB and σB. We have µB N(µP, σ2 P M ) and σB N(σP , ρ+2 4M ), where M is batch size, µP and σP are population mean and standard deviation respectively, and ρ is kurtosis that measures the peakedness of the distribution of h. These priors tell us that µB and σB would produce Gaussian noise in training. There is a tradeoff regarding this noise. For example, when M is small, training could diverge because the noise is large. This is supported by experiment of BN (Wu & He, 2018) where training diverges when M = 2 in Image Net (Russakovsky et al., 2015). When M is large, the noise is small because µB and σB get close to µP and σP. It is known that M > 30 would provide a moderate noise, as the sample statistics converges in probability to the population statistics by the weak Law of Large Numbers. This is also supported by experiment (Ioffe & Szegedy, 2015) where BN with M = 32 already works well in Image Net. 2.1 A REGULARIZATION FORM The loss function in Eqn.(2) can be written as an expected loss by integrating over the priors of µB and σB, that is, 1 P PP j=1 EµB,σB[ℓ(ˆhj)] where E[ ] denotes expectation. We show that µB and σB impose regularization on the scale parameter γ by decomposing BN into population normalization (PN) and gamma decay. To see this, we employ a single-layer perceptron and Re LU activation function as an illustrative example. A more rigorous description is provided in Appendix C.1. Regularization of µB, σB. Let ℓ(ˆh) be the loss function defined in Eqn.(2) and Re LU be the activation function. We have j=1 EµB,σBℓ(ˆhj) 1 + ζ(h)γ2 | {z } gamma decay , and ζ(h) = ρ + 2 8M I(γ) | {z } from σB | {z } from µB where hj = γ hj µP σP + β and hj = w Txj represent the computations of PN. ζ(h)γ2 represents gamma decay, where ζ(h) is an adaptive decay factor depended on the hidden value h. Moreover, ρ is the kurtosis of distribution of h, I(γ) represents an estimation of the Fisher information of γ and P PP j=1( ℓ(ˆhj) γ )2, and σ( ) is a sigmoid function. From Eqn.(3), we have several observations that have both theoretical and practical values. First, PN replaces the batch statistics µB, σB in BN by the population statistics µP, σP. In gamma decay, computation of ζ(h) is data-dependent, making it differed from weight decay where the Published as a conference paper at ICLR 2019 coefficient is determined manually. In fact, Eqn.(3) recasts the randomness of BN in a deterministic manner, not only enabling us to apply methodologies such as ODEs and statistical mechanics to analyze BN, but also inspiring us to imitate BN s performance by WN without computing batch statistics in empirical study. Second, PN is closely connected to WN, which is independent from sample mean and variance. WN (Salimans & Kingma, 2016) is defined by υ w T x ||w||2 that normalizes the weight vector w to have unit variance, where υ is a learnable parameter. Let each diagonal element of the covariance matrix of x be a and all the off-diagonal elements be zeros. hj in Eqn.(3) can be rewritten as hj = γ w T xj µP σP + β = υ w T xj ||w||2 + b, (4) where υ = γ a and b = γµP a||w||2 + β. Eqn.(4) removes the estimations of statistics and eases our analyses of regularization for BN. Third, µB and σB produce different strengths in ζ(h). As shown in Eqn.(3), the strength from µB depends on the expectation of σ( hj) [0, 1], which represents excitation or inhibition of a neuron, meaning that a neuron with larger output may exposure to larger regularization, encouraging different neurons to have equal magnitude. This is consistent with empirical result (Morcos et al., 2018) which prevented reliance on single neuron to improve generalization. The strength from σB works as a complement for µB. For a single neuron, I(γ) represents the norm of gradient, implying that BN punishes large gradient norm. For multiple neurons, I(γ) is the Fisher information matrix of γ, meaning that BN would penalize correlations among neurons. Both σB and µB are important, removing either one of them would imped performance. Extensions to Deep Networks. The above results can be extended to deep networks as shown in Appendix C.4 by decomposing the expected loss at a certain hidden layer. We also demonstrate the results empirically in Sec.5, where we observe that CNNs trained with BN share similar traits of regularization as discussed above. 3 OPTIMIZATION WITH REGULARIZATION Now we show that BN converges with large maximum and effective learning rate (LR), where the former one is the largest LR when training converged, while the latter one is the actual LR during training. With BN, we find that both LRs would be larger than a network trained without BN. Our result explains why BN enables large learning rates used in practice (Ioffe & Szegedy, 2015). Our analyses are conducted in three stages. First, we establish dynamical equations of a teacherstudent model in the thermodynamic limit and acquire the fixed point. Second, we investigate the eigenvalues of the corresponding Jacobian matrix at this fixed point. Finally, we calculate the maximum and the effective LR. Teacher-Student Model. We first introduce useful techniques from statistical mechanics (SM). With SM, a student network is dedicated to learn relationship between a Gaussian input and an output by using a weight vector w as parameters. It is useful to characterize behavior of the student by using a teacher network with w as a ground-truth parameter vector. We treat single-layer perceptron as a student, which is optimized by minimizing the euclidian distance between its output and the supervision provided by a teacher without BN. The student and the teacher have identical activation function. Loss Function. We define a loss function of the above teacher-student model by 1 P PP j=1 ℓ(xj) = 1 P PP j=1 g(w Txj) g( w 2 ) 2 + ζγ2, where g(w Txj) represents supervision from the teacher, while g( w 2 ) is the output of student trained to mimic the teacher. This student is defined by following Eqn.(4) with ν = Nγ and the bias term is absorbed into w. The above loss function represents BN by using WN with gamma decay, and it is sufficient to study the learning rates of different approaches. Let θ = {w, γ} be a set of parameters updated by SGD, i.e. θj+1 = θj η ℓ(xj) θj where η denotes learning rate. The update rules for w and γ are wj+1 wj = ηδj(γj N wj 2 xj wj Txj wj 2 2 wj) and γj+1 γj = η(δj wj 2 ζγj), (5) Published as a conference paper at ICLR 2019 where wj denotes a normalized weight vector of the student, that is, wj = Nγj wj wj 2 , and δj = g ( wj Txj)[g(w Txj) g( wj Txj)] represents the gradient1 for clarity of notations. Order Parameters. As we are interested in the large-scale regime where both N and P are large and their ratio P/N is finite, it is difficult to examine a student with parameters in high dimensions directly. Therefore, we transform the weight vectors to order parameters that fully characterize interactions between the student and the teacher network. In this case, the parameter vector can be reparameterized by using a vector of three elements including γ, R, and L. In particular, γ measures length of the normalized weight vector w, that is, w T w = Nγ2 w Tw w 2 2 = Nγ2. The parameter R measures angle (overlapping ratio) between the weight vectors of student and teacher. We have R = w Tw w w = 1 Nγ w Tw , where the norm of the ground-truth vector is 1 N w Tw = 1. Moreover, L represents length of the original weight vector w and L2 = 1 Learning Dynamics. The update equations (5) can be transformed into a set of differential equations (ODEs) by using the above order parameters. This is achieved by treating the update step j as a continuous time variable t = j N . They can be turned into differential equations because the contiguous step t = 1 N approaches zero in the thermodynamic limit when N . We obtain a dynamical system of three order parameters dγ L2 I1 η2 γ2R 2L4 I2, and d L 2L3 I2, (6) where I1 = Ex[δ w Tx], I2 = Ex[δ2x Tx], and I3 = Ex[δw Tx] are defined to simplify notations. The derivations of Eqn.(6) can be found in Appendix C.5. 3.1 FIXED POINT OF THE DYNAMICAL SYSTEM (γ0, R0, L0) ηmax (R) ηeff (R) BN (γ0, 1, L0) WN (1, 1, L0) (I3 I1) η L2 0 SGD (1, 1, 1) (I3 I1) Table 1: Comparisons of fixed points, ηmax for R, and ηefffor R. A fixed point is denoted as (γ0, R0, L0). To find the fixed points of (6), we set dγ/dt = d R/dt = d L/dt = 0. The fixed points of BN, WN, and vanilla SGD (without BN and WN) are given in Table 1. In the thermodynamic limit, the optima denoted as (γ0, R0, L0) would be (γ0, R0, L0) = (1, 1, 1). Our main interest is the overlapping ratio R0 between the student and the teacher, because it optimizes the direction of the weight vector regardless of its length. We see that R0 for all three approaches attain optimum 1 . Intuitively, in BN and WN, this optimal solution does not depend on the value of L0 because their weight vectors are normalized. In other words, WN and BN are easier to optimize than vanilla SGD, unlike SGD where both R0 and L0 have to be optimized to 1 . Furthermore, γ0 in BN depends on the activation function. For Re LU, we have γbn 0 = 1 2ζ+1 (see Proposition 1 in Appendix C.5), meaning that norm of the normalized weight vector relies on the decay factor ζ. In WN, we have γwn 0 = 1 as WN has no regularization on γ. 3.2 MAXIMUM AND EFFECTIVE LEARNING RATES With the above fixed points, we derive the maximum and the effective LR. Specifically, we analyze eigenvalues and eigenvectors of the Jacobian matrix corresponding to Eqn.(6). We are interested in the LR to approach R0. We find that this optimum value only depends on its corresponding eigenvalue denoted as λR. We have λR = I2 R ηγ0 2L2 0 (ηmax ηeff), where ηmax and ηeffrepresent the maximum and effective LR (proposition 2 in Appendix C.5), which are given in Table 1. We demonstrate that λR < 0 if and only if ηmax > ηeff, such that the fixed point R0 is stable for all approaches (proposition 3 in Appendix C.6). Moreover, it is also able to show that ηmax of BN (ηbn max) is larger than WN and SGD, enabling R to converge with a larger learning rate. For Re LU as an example, we find that ηbn max η{wn,sgd} max + 2ζ (proposition 4 in Appendix C.7). The larger maximum LRs enables the network to be trained more stably and has the potential to be combined with other stabilization techniques (Fagan & Iyengar, 2018) during optimization. The effective LRs shown in Table 1 are consistent with previous work (van Laarhoven, 2017). 1g (x) denotes the first derivative of g(x). Published as a conference paper at ICLR 2019 4 GENERALIZATION ANALYSIS Here we investigate generalization of BN by using a teacher-student model that minimizes a loss function 1 P PP j=1((y )j yj)2, where y represents the teacher s output and y is the student s output. We compare BN with WN+gamma decay and vanilla SGD. All of them share the same teacher network whose output is a noise-corrupted linear function y = w Tx + s, where x is drawn from N(0, 1 N ) and s is an unobserved Gaussian noise. We are interested to see how the above methods resist this noise by using student networks with both identity (linear) and Re LU activation functions. For vanilla SGD, the student is computed by y = g(w Tx) with g( ) being either identity or Re LU, and w being the weight vector to optimize, where w has the same dimension as w . The loss function of vanilla SGD is ℓsgd = 1 P PP j=1 y g(w Txj) 2. For BN, the student is defined as y = γ w Tx µB σB + β. As our main interest is the weight vector, we freeze the bias by setting β = 0. Therefore, the batch average term µB is also unnecessary to avoid additional parameters, and the loss function is written as ℓbn = 1 P PP j=1 (y )j γw Txj/σB 2. For WN+gamma decay, the student is computed similar to Eqn.(4) by using y = w 2 . Then the loss function is defined by P PP j=1 (y )j w 2 2 +ζ γ 2 2. With the above definitions, the three approaches are studied under the same teacher-student framework, where their generalization errors can be strictly compared with the other factors ruled out. 4.1 GENERALIZATION ERRORS Figure 1: (a) shows generalization error v.s. effective load α using a linear student (identity units). WN+gamma decay has two curves ζ = 1 2M and ζ = 0.25. BN is trained with M = 32. (b) shows generalization error v.s. effective load α using a Re LU student. WN+gamma decay has ζ = 1 4M and is compared to BN with batch size M = 32. The theoretical curve for vanilla SGD is also shown in blue. The red line is the generalization error of vanilla SGD with no noise in the teacher and thus serves as a lower bound. We provide closed-form solutions of the generalization errors (see Appendix D.1) for vanilla SGD with both linear and Re LU student networks. The theoretical solution of WN+gamma decay can also be solved for the linear student, but still remains difficult for Re LU student whose numerical verification is provided instead. Both vanilla SGD and WN+gamma decay are compared with numerical solutions of BN. vanilla SGD. In an identity (linear) student, the solution of generalization error depends on the rank of correlation matrix Σ = x Tx. Here we define an effective load α = P/N that is the ratio between number of samples P and number of input neurons N (number of learnable parameters). The generalization error of the identity student is denoted as ϵsgd id , which can be acquired by using the distribution of eigenvalues of Σ following (Advani & Saxe, 2017). If α < 1, ϵsgd id = 1 α + αS/(1 α). Otherwise, ϵsgd id = S/(α 1) where S is the variance of the injected noise to the teacher network. The values of ϵsgd id with respect to α are plotted in blue curve of Fig.1(a). It first decreases but then increases as α increases from 0 to 1. ϵsgd id diverges at α = 1. And it would decrease again when α > 1. In a Re LU student, the nonlinear activation yields difficulty to derive the theoretical solution. Here we utilize the statistical mechanics and calculate that ϵsgd relu = 1 α/4 + αS 2(2 α) and α < 2 (see Appendix D.2). When comparing to the lower bound (trained without noisy supervision) shown as the red curve in Fig.1(b), we see that ϵsgd relu (blue curve) diverges at α = 2. This is because the student overfits the noise in the teacher s output. The curve of numerical solution is also plotted in dashed line in Fig.1(b) and it captures the diverging trend well. It should be noted that obtaining the theoretical curve empirically requires an infinitely long time of training and an infinitely small learning rate. This unreachable limit explains the discrepancies between the theoretical and the numerical solution. Published as a conference paper at ICLR 2019 WN+gamma decay. In a linear student, the gamma decay term turns the correlation matrix to Σ = x Tx + ζI , which is positive definite. Following statistical mechanics (Krogh & Hertz, 1992), the generalization error is ϵwn id = δ2 (ζG) ζ where G = 1 α ζ + ζ + (1 + α)2 1 2 ζ + (1 α)2 1 2 2ζ. We see that ϵwn id can be computed quantitatively given the values of ζ and α. Let the variance of noise injected to the teacher be 0.25. Fig.1(a) shows that no other curves could outperform the red curve when ζ = 0.25, a value equal to the noise magnitude. The ζ smaller than 0.25 (green curve ζ = 1 2M and M = 32) would exhibit overtraining around α = 1, but they still perform significantly better than vanilla SGD. For the Re LU student in Fig,1(b), a direct solution of the generalization error ϵwn relu remains an open problem. Therefore, the numerical results of WN+gamma decay (green curve) are run at each α value. It effectively reduces over-fitting compared to vanilla SGD. Numerical Solutions of BN. In the linear student, we employ SGD with M = 32 to find solutions of w for BN. The number of input neurons is 4096 and the number of training samples can be varied to change α. The results are marked as black squares in Fig.1(a). After applying the analyses for linear student (Appendix C.3), BN is equivalent to WN+gamma decay when ζ = 1 2M (green curve). It is seen that BN gets in line with the curve ζ = 1/2M (M = 32) and thus quantitatively validates our derivations. In the Re LU student, the setting is mostly the same as the linear case, except that we employ a smaller batch size M = 16. The results are shown as black squares in Fig.1(b). For Re LU units, the equivalent ζ of gamma decay is ζ = 1 4M . If one compares the generalization error of BN with WN+gamma decay (green curve), a clear correspondence is found, which also validates the derivations for the Re LU activation function. 5 EXPERIMENTS IN CNNS This section shows that BN in CNNs follows similar traits of regularization as the above analyses. To compare different methods, the CNN architectures are fixed while only the normalization layers are changed. We adopt CIFAR10 (Krizhevsky, 2009) that contains 60k images of 10 categories (50k images for training and 10k images for test). All models are trained by using SGD with momentum, while the initial learning rates are scaled proportionally (Goyal et al., 2017) when different batch sizes are presented. More empirical setting can be found in Appendix B. Evaluation of PN+Gamma Decay. This work shows that BN can be decomposed into PN and gamma decay. We empirically compare PN+gamma decay with BN by using Res Net18 (He et al., 2016). For PN+gamma decay , the population statistics of PN and the decay factor of gamma decay are estimated by using sufficient amount of training samples. For BN, BN trained with a normal batch size M = 128 is treated as baseline as shown in Fig.2(a&b). We see that when batch size increases, BN would imped both loss and accuracy. For example, when increasing M to 1024, performance decreases because the regularization from the batch statistics reduces in large batch, resulting in overtraining (see the gap between train and validation loss in (a) when M = 1024). In comparison, we train PN by using 10k training samples to estimate the population statistics. Note that this further reduces regularization. We see that the release of regularization can be complemented by gamma decay, making PN outperformed BN. This empirical result verifies our derivation of regularization for BN. Similar trend can be observed by experiment in a down-sampled version of Image Net (see Appendix B.1). We would like to point out that PN+gamma decay is of interest in theoretical analyses, but it is computation-demanding when applied in practice because evaluating µP, σP and ζ(h) may require sufficiently large number of samples. Comparisons of Regularization. We study the regulation strengths of vanilla SGD, BN, WN, WN+mean-only BN, and WN+variance-only BN. At first, the strength of regularization terms from both µB and σB are compared by using a simpler network with 4 convolutional and 2 fully connected layers as used in (Salimans & Kingma, 2016). Fig.2(c&d) compares their training and validation losses. We see that the generalization error of BN is much lower than WN and vanilla SGD. The reason has been disclosed in this work: stochastic behaviors of µB and σB in BN improves generalization. To investigate µB and σB individually, we decompose their contributions by running a WN with mean-only BN as well as a WN with variance-only BN, to simulate their respective regularization. Published as a conference paper at ICLR 2019 Figure 2: (a) & (b) compare the loss (both training and evaluation) and validation accuracy between BN and PN on CIFAR10 using a Res Net18 network; (c) & (d) compare the training and validation loss curve with WN + mean-only BN and WN + variance-only BN; (e) & (f) validate the regularization effect of BN on both γ2 and the validation loss with different batch sizes; (g) & (h) show the loss and top-1 validation accuracy of Res Net18 with additional regularization (dropout) on large-batch training of BN and WN. As shown in Fig.2(c&d), improvements from the mean-only and the variance-only BN over WN verify our conclusion that noises from µB and σB have different regularization strengths. Both of them are essential to produce good result. Regularization and parameter norm. We further demonstrate impact of BN to the norm of parameters. We compare BN with vanilla SGD. A network is first trained by BN in order to converge to a local minima where the parameters do not change much. At this local minima, the weight vector is frozen and denoted as wbn. Then this network is finetuned by using vanilla SGD with a small learning rate 10 3 and its kernel parameters are initialized by wsgd = γ wbn σ , where σ is the moving average of σB. Fig.4 in Appendix B.2 visualizes the results. As µB and σB are removed in vanilla SGD, it is found that the training loss decreases while the validation loss increases, implying that reduction in regularization makes the network converged to a sharper local minimum that generalizes less well. The magnitudes of kernel parameters wsgd at different layers are also observed to increase after freezing BN, due to the release of regularization on these parameters. Batch size. To study BN with different batch sizes, we train different networks but only add BN at one layer at a time. The regularization on the γ parameter is compared in Fig.2(e) when BN is located at different layers. The values of γ2 increase along with the batch size M due to the weaker regularization for the larger batches. The increase of γ2 also makes all validation losses increased as shown in Fig.2(f). BN and WN trained with dropout. As PN and gamma decay requires estimating the population statistics that increases computations, we utilize dropout as an alternative to improve regularization of BN. We add a dropout after each BN layer. Fig.2(g&h) plot the classification results using Res Net18. The generalization of BN deteriorates significantly when M increases from 128 to 1024. This is observed by the much higher validation loss (Fig.2(g)) and lower validation accuracy (Fig.2(h)) when M = 1024. If a dropout layer with ratio 0.1 is added after each residual block layer for M = 1024 in Res Net18, the validation loss is suppressed and accuracy increased by a great margin. This superficially contradicts with the original claim that BN reduces the need for dropout (Ioffe & Szegedy, 2015). As discussed in Appendix B.3, we find that there are two differences between our study and previous work (Ioffe & Szegedy, 2015). Fig.2(g&h) also show that WN can also be regularized by dropout. We apply dropout after each WN layer with ratio 0.2 and the dropout is applied at the same layers as that for BN. We found that the improvement on both validation accuracy and loss is surprising. The accuracy increases from 0.90 to 0.93, even close to the results of BN. Nevertheless, additional regularization on WN still cannot make WN on par with the performance BN. In deep neural networks the distribution after each layer would be far from a Gaussian distribution, in which case WN is not a good substitute for PN. Published as a conference paper at ICLR 2019 6 CONCLUSIONS This work investigated an explicit regularization form of BN, which was decomposed into PN and gamma decay where the regularization strengths from µB and σB were explored. Moreover, optimization and generalization of BN with regularization were derived and compared with vanilla SGD, WN, and WN+gamma decay, showing that BN enables training to converge with large maximum and effective learning rate, as well as leads to better generalization. Our analytical results explain many existing empirical phenomena. Experiments in CNNs showed that BN in deep networks share the same traits of regularization. In future work, we are interested in analyzing optimization and generalization of BN in deep networks, which is still an open problem. Moreover, investigating the other normalizers such as instance normalization (IN) (Ulyanov et al., 2016) and layer normalization (LN) (Ba et al., 2016) is also important. Understanding the characteristics of these normalizers should be the first step to analyze some recent best practices such as whitening (Luo, 2017b;a), switchable normalization (Luo et al., 2019; 2018; Shao et al., 2019), and switchable whitening (Pan et al., 2019). Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks. ar Xiv:1710.03667 [physics, q-bio, stat], October 2017. URL http://arxiv.org/abs/1710.03667. ar Xiv: 1710.03667. Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. ar Xiv:1607.06450, 2016. Chris M. Bishop. Training with Noise is Equivalent to Tikhonov Regularization. Neural Computation, 7 (1):108 116, January 1995. ISSN 0899-7667, 1530-888X. doi: 10.1162/neco.1995.7.1.108. URL http: //www.mitpressjournals.org/doi/10.1162/neco.1995.7.1.108. Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In ICML, 2017. Siegfried B os. Statistical mechanics approach to early stopping and weight decay. Physical Review E, 58(1):833, 1998. Siegfried Bs and Manfred Opper. Dynamics of batch training in a perceptron. In Journal of Physics A: Mathematical and General, volume 31(21), pp. 4835, 1998. Francois Fagan and Garud Iyengar. Robust Implicit Backpropagation. In ar Xiv:1808.02433, 2018. Igor Gitman and Boris Ginsburg. Comparison of Batch Normalization and Weight Normalization Algorithms for the Large-scale Image Classification. ar Xiv:1709.08145 [cs], September 2017. URL http://arxiv. org/abs/1709.08145. ar Xiv: 1709.08145. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010. Priya Goyal, Piotr Doll ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, Large Minibatch SGD: Training Image Net in 1 Hour. ar Xiv preprint ar Xiv:1706.02677, 2017. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. Kenji Kawaguchi. Deep learning without poor local minima. In NIPS, 2016. Alex Krizhevsky. Learning multiple layers of features from tiny images. In Technical Report, 2009. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012. Published as a conference paper at ICLR 2019 Anders Krogh and John A. Hertz. Generalization in a linear perceptron in the presence of noise. Journal of Physics A: Mathematical and General, 25(5):1135, 1992. Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch normalization by variance shift. In ar Xiv:1801.05134, 2018. Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In ar Xiv:1608.03983, 2016. Ping Luo. Eigennet: Towards fast and structural learning of deep neural networks. IJCAI, 2017a. Ping Luo. Learning deep architectures via generalized whitened neural networks. ICML, 2017b. Ping Luo, Zhanglin Peng, Jiamin Ren, and Ruimao Zhang. Do normalization layers in a deep convnet really need to be distinct? ar Xiv:1811.07727, 2018. Ping Luo, Jiamin Ren, Zhanglin Peng, Ruimao Zhang, and Jingyu Li. Differentiable learning-to-normalize via switchable normalization. ICLR, 2019. Stephan Mandt, Matthew D. Hoffman, and David M. Blei. Stochastic Gradient Descent as Approximate Bayesian Inference. ar Xiv:1704.04289 [cs, stat], April 2017. URL http://arxiv.org/abs/1704.04289. ar Xiv: 1704.04289. Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex losses. In ar Xiv:1607.06534, 2016. Ari S. Morcos, David G.T. Barrett, Neil C. Rabinowitz, and Matthew Botvinick. On the importance of single directions for generalization. In ICLR, 2018. M. Opper, W. Kinzel, J. Kleinz, and R. Nehl. On the ability of the optimal perceptron to generalise. In Journal of Physics A: Mathematical and General, volume 23(11), pp. 581, 1990. Xingang Pan, Xiaohang Zhan, Jianping Shi, Xiaoou Tang, and Ping Luo. Switchable whitening for deep representation learning. In ar Xiv:1904.09739, 2019. Jeffrey Pennington and Yasaman Bahri. Geometry of neural network loss surfaces via random matrix theory. In ICML, 2017. Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl Dickstein. On the expressive power of deep neural networks. In ICML, 2017. Salah Rifai, Xavier Glorot, Yoshua Bengio, and Pascal Vincent. Adding noise to the input of a model trained with a regularized objective. ar Xiv:1104.3250 [cs], April 2011. URL http://arxiv.org/abs/1104.3250. ar Xiv: 1104.3250. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. In ICJV, 2015. David Saad and Sara A. Solla. Dynamics of on-line gradient descent learning for multilayer neural networks. In NIPS, 1996. Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In ar Xiv:1602.07868, 2016. Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How Does Batch Normalization Help Optimization? ar Xiv:1805.11604 [cs, stat], May 2018. URL http://arxiv.org/abs/1805.11604. ar Xiv: 1805.11604. H. S. Seung, Haim Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056, 1992. URL https://journals.aps.org/pra/abstract/10.1103/ Phys Rev A.45.6056. Wenqi Shao, Tianjian Meng, Jingyu Li, Ruimao Zhang, Yudian Li, Xiaogang Wang, and Ping Luo. Ssn: Learning sparse switchable normalization via sparsestmax. In CVPR, 2019. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. In Journal of Machine Learning Research, 2014. Published as a conference paper at ICLR 2019 Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision. ar Xiv:1512.00567 [cs], December 2015. URL http: //arxiv.org/abs/1512.00567. ar Xiv: 1512.00567. Mattias Teye, Hossein Azizpour, and Kevin Smith. Bayesian uncertainty estimation for batch normalized deep networks. In ICML, 2018. Yuandong Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In ICML, 2017. Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. ar Xiv:1607.08022, 2016. Twan. van Laarhoven. L2 regularization versus batch and weight normalization. In ar Xiv:1706.05350, 2017. Stefan Wager, Sida Wang, and Percy Liang. Dropout Training as Adaptive Regularization. ar Xiv:1307.1493 [cs, stat], July 2013. URL http://arxiv.org/abs/1307.1493. ar Xiv: 1307.1493. Yuxin Wu and Kaiming He. Group normalization. ar Xiv:1803.08494, 2018. Yuki Yoshida, Ryo Karakida, Masato Okada, and Shun ichi Amari. Statistical mechanical analysis of online learning with weight normalization in single layer perceptron. In Journal of the Physical Society of Japan, 2017. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, , and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017a. Qiuyi Zhang, Rina Panigrahy, and Sushant. Sachdeva. Electron-proton dynamics in deep learning. In ar Xiv:1702.00458, 2017b. Published as a conference paper at ICLR 2019 A NOTATIONS Table 2: Several notations are summarized for reference. µB, σ2 B batch mean, batch variance µP, σ2 P population mean, population variance x, y input of a network, output of a network y ground truth of an output h, ˆh hidden value before and after BN h hidden value after population normalization γ, β scale parameter, shift parameter g( ) activation function w, w weight vector, ground truth weight vector w normalized weight vector M, N, P batch size, number of neurons, sample size α an effective load value α = P/N ζ regularization strength (coefficient) ρ Kurtosis of a distribution δ gradient of the activation function ηeff, ηmax effective, maximum learning rate R overlapping ratio (angle) between w and w L norm (length) of w λmax, λmin maximum, minimum eigenvalue ϵgen generalization error B MORE EMPIRICAL SETTINGS AND RESULTS All experiments in Sec.5 are conducted in CIFAR10 by using Res Net18 and a CNN architecture similar to (Salimans & Kingma, 2016) that is summarized as conv(3,32)-conv(3,32)-conv(3,64)- conv(3,64)-pool(2,2)-fc(512)-fc(10) , where conv(3,32) represents a convolution with kernel size 3 and 32 channels, pool(2,2) is max-pooling with kernel size 2 and stride 2, and fc indicates a full connection. We follow a configuration for training by using SGD with a momentum value of 0.9 and continuously decaying the learning rate by a factor of 10 4 each step. For different batch sizes, the initial learning rate is scaled proportionally with the batch size to maintain a similar learning dynamics (Goyal et al., 2017). B.1 RESULTS IN DOWNSAMPLED IMAGENET Besides CIFAR10, we also evaluate PN+gamma decay by employing a downsampled version of Image Net (Loshchilov & Hutter, 2016), which contains identical 1.2 million data and 1k categories as the original Image Net, but each image is scaled to 32 32. We train Res Net18 in downsampled Image Net by following the training protocol used in (He et al., 2016). In particular, Res Net18 is trained by using SGD with momentum of 0.9 and the initial learning rate is 0.1, which is then decayed by a factor of 10 after 30, 60, and 90 training epochs. In downsampled Image Net, we observe similar trends as those presented in CIFAR10. For example, we see that BN would imped both loss and accuracy when batch size increases. When increasing M to 1024 as shown in Fig.3, both the loss and validation accuracy decrease because the regularization from the random batch statistics reduces in large batch size, resulting in overtraining. This can be seen by the gap between the training and the validation loss. Nevertheless, we see that the reduction of regularization can be complemented when PN is trained with adaptive gamma decay, which makes PN performed comparably to BN in downsampled Image Net. B.2 IMPACT OF BN TO THE NORM OF PARAMETERS We demonstrate the impact of BN to the norm of parameters. We compare BN with vanilla SGD, where a network is first trained by BN in order to converge to a local minima when the parameters do not change much. At this local minima, the weight vector is frozen and denoted as wbn. Then this Published as a conference paper at ICLR 2019 _.l_ - train - BN M=32 ______ ......,,_,.__ - train - BN M=1024 11 - I, - - -1-+-a----+----+----+-,,/ train - PN adaptive gamma decay eval - BN M=32 eval - BN M=1024 eval - PN adaptive gamma decay 100k step 150k 200k (a) Comparisons of train and validation loss. 30 u m .._ 25 ::::s u 20 0 15 -..., 10 --m 5 > BN M=32 BN M=1024 PN+adaptive gamma decay 50k 100k step 150k 200k (b) Comparisons of validation accuracy. Figure 3: Results of downsampled Image Net. (a) plots training and evaluation loss. (b) shows validation accuracy. The models are trained on 8 GPUs. network is finetuned by using vanilla SGD with a small learning rate 10 3 with the kernel parameters initialized by wsgd = γ wbn σ , where σ is the moving average of σB. Fig.4 below visualizes the results. As µB and σB are removed in the vanilla SGD, it is found from the last two figures that the training loss decreases while the validation loss increases, meaning that the reduction in regularization makes the network converged to a sharper local minimum that generalizes less well. The magnitudes of kernel parameters wsgd at different layers are also displayed in the first four figures. All of them increase after freezing BN, due to the release of regularization on these parameters. (a) (b) (c) (d) (e) (f) Figure 4: Study of parameter norm. Vanilla SGD is finetuned from a network pretrained by BN on CIFAR10. The first four figures show the magnitude of the kernel parameters in different layers in finetuning, compared to the effective norm of BN defined as γ w σB . The last two figures compare the training and validation losses in finetuning. B.3 BN AND WN WITH DROPOUT BN+dropout. Despite the better generalization of BN with smaller batch sizes, large-batch training is more efficient in real cases. Therefore, improving generalization of BN with large batch is more desiring. However, gamma decay requires estimating the population statistics that Published as a conference paper at ICLR 2019 increases computations. We also found that treating the decay factor as a constant hardly improves generalization for large batch. Therefore, we utilize dropout as an alternative to compensate for the insufficient regularization. Dropout has also been analytically viewed as a regularizer (Wager et al., 2013). We add a dropout after each BN layer to impose regularization. Fig.2(g&h) in the main paper plot the classification results using Res Net18. The generalization of BN deteriorates significantly when M increases from 128 to 1024. This is observed by the much higher validation loss (Fig.2(g)) and lower validation accuracy (Fig.2(h)) when M = 1024. If a dropout layer with ratio 0.1 is added after each residual block layer for M = 1024 in Res Net18, the validation loss is suppressed and accuracy increased by a great margin. This superficially contradicts with the original claim that BN reduces the need for dropout (Ioffe & Szegedy, 2015). We find that there are two differences between our study and (Ioffe & Szegedy, 2015). First, in pervious study the batch size was fixed at a quite small value (e.g. 32), at which the regularization was already quite strong. Therefore, an additional dropout could not further cause better regularization, but on the contrary increases the instability in training and yields a lower accuracy. However, our study explores relatively large batch that degrades the regularization of BN, and thus dropout with a small ratio can complement. Second, usual trials put dropout before BN and cause BN to have different variances during training and test. In contrast, dropout follows BN in this study and the distance between two dropout layers is large (a residual block separation), thus the problem can be alleviated. The improvement by applying dropout after BN has also been observed by a recent work (Li et al., 2018). WN+dropout. Since BN can be treated as WN trained with regularization as shown in this study, combining WN with regularization should be able to match the performance of BN. As WN outperforms BN in running speed (without calculating statistics) and it suits better in RNNs than BN, an improvement of its generalization is also of great importance. Fig.2(g&h) also show that WN can also be regularized by dropout. We apply dropout after each WN layer with ratio 0.2 and the dropout is applied at the same layers as that for BN. We found that the improvement on both validation accuracy and loss is surprising. The accuracy increases from 0.90 to 0.93, even close to the results of BN. Nevertheless, additional regularization on WN still cannot make WN on par with the performance BN. In deep neural networks the distribution after each layer would be far from a Gaussian distribution, in which case WN is not a good substitute for PN. A potential substibute of BN would require us for designing better estimations of the distribution to improve the training speed and performance of deep networks. C PROOF OF RESULTS C.1 PROOF OF EQN.(3) Theorem 1 (Regularization of µB, σB). Let a single-layer perceptron with BN and Re LU activation function be defined by y = max(0, ˆh), ˆh = γ h µB σB +β and h = w Tx, where x and y are the network input and output respectively, h and ˆh are the hidden values before and after batch normalization, and w is the weight vector. Let ℓ(ˆh) be the loss function. Then j=1 EµB,σBℓ(ˆhj) 1 j=1 ℓ( hj) + ζ(h)γ2 and ζ(h) = ρ + 2 8M I(γ) + 1 2M 1 P j=1 σ( hj), where hj = γ w Txj µP σP + β represents population normalization (PN), ζ(h)γ2 represents gamma decay and ζ(h) is a data-dependent decay factor. ρ is the kurtosis of the distribution of h, I(γ) is an estimation of the Fisher information of γ and I(γ) = 1 P PP j=1( ℓ(ˆhj) γ )2, and σ( ) is a sigmoid function. Proof. We have ˆhj = γ w T xj µB σB +β and hj = γ w T xj µP σP +β. We prove theorem 1 by performing a Taylor expansion on a function A(ˆhj) at hj, where A(ˆhj) is a function of ˆhj defined according to a particular activation function. The negative log likelihood function of the above single-layer perceptron can be generally defined as log p(yj|ˆhj) = A(ˆhj) yjˆhj, which is similar to the loss Published as a conference paper at ICLR 2019 function of the generalized linear models with different activation functions. Therefore, we have j=1 EµB,σB[l(ˆhj)] = 1 j=1 EµB,σB h A(ˆhj) yjˆhji j=1 (A( hj) yj hj) + 1 j=1 EµB,σB h yj(ˆhj hj) + A(ˆhj) A( hj) i j=1 l( hj) + 1 j=1 EµB,σB h (A ( hj) yj)(ˆhj hj) i 2 (ˆhj hj)2 j=1 l( hj) + Rf + Rq, where A ( ) and A ( ) denote the first and second derivatives of function A( ). The first and second order terms in the expansion are represented by Rf and Rq respectively. To derive the analytical forms of Rf and Rq, we take a second-order Taylor expansion of of 1 σB and 1 σ2 B around σP , it suffices to have 1 σB 1 σ2 P )(σB σP) + 1 σ3 P (σB σP)2 and 1 σ2 B 1 σ3 P )(σB σP) + 3 σ4 P (σB σP)2. By applying the distributions of µB and σB introduced in section 2, we have µB N(µP, σ2 P M ) and σB N(σP , ρ+2 4M ). Hence, Rf can be derived as in the paper, Rf can be derived as j=1 EµB,σB h (A ( hj) yj)(ˆhj hj) i (A ( hj) yj) γ w T xj µB σB γ w T xj µP (A ( hjyj) γw T xj 1 j=1 γ(A ( hj) yj)(w T xj µP)EσB 4M γ(A ( hj) yj)w T xj µP This Rf term can be understood as below. Let h = w T x µP σP and the distribution of the population data be pxy. We establish the following relationship E(x,y) pxy EµB,σB (A ( h) y)h = EµB,σBEx px Ey|x py|x (A ( h) y)h = EµB,σBEx px h (E [y|x] Ey|x py|x [y])h i Since the sample mean converges in probability to the population mean by the Weak Law of Large Numbers, for all ϵ > 0 and a constant number K ( K > 0 and P > K), we have Published as a conference paper at ICLR 2019 p Rf E(x,y) pxy EµB,σB (A ( h) y)h ρ+2 4M ϵ = 0. This equation implies that Rf is sufficiently small with a probability of 1 given moderately large number of data points P (the above inequality holds when P > 30). On the other hand, Rq can be derived as 2 (ˆhj hj)2 (γ w T xj µB σB + β γ w T xj µP (γw T xj)2( 1 σP )2 2γµPw T xj( 1 σP )2 + (µB (w T xj µP)2EµB,σB σP )2 + EµB,σB σP )2 ρ + 2 M (1 + 3(ρ + 2) Note that 2l( hj) γ2 = A ( hj)( w T xj µP σP )2, we have I(γ) = 1 P PP j=1 A ( hj)( w T xj µP σP )2 been an estimator of the Fisher information with respect to the scale parameter γ. Then, by neglecting O(1/M 2) high-order term in Rq, we get 8M I(γ)γ2 + µd2A where µd2A indicates the mean of the second derivative of A(h). The results of both Re LU activation function and identity function are provided as below. C.2 RELU ACTIVATION FUNCTION For the Re LU non-linear activation function, that is g(h) = max(h, 0), we use its continuous approximation softplus function g(h) = log(1+exp(h)) to derive the partition function A(h). In this case, we have µd2A = 1 P PP j=1 σ( hj). Therefore, we have ζ(h) = ρ+2 8M I(γ) + 1 2M 1 P PP j=1 σ( hj) as shown in Eqn.(3). C.3 LINEAR STUDENT NETWORK WITH IDENTITY ACTIVATION FUNCTION For a loss function with identity (linear) units, 1 P PP j=1 w Txj γ(w Txj µB)/σB 2, we have I(γ) = 2λ and ρ = 0 for Gaussian input distribution. The exact expression of Eqn.(3) is also possible for such linear regression problem. Under the condition of Gaussian input x N(0, 1/N), h = w Tx is also a random variable satisfying a normal distribution N(0, 1). It can be derived that 2 ) and E σ 2 B = M 2 1) Γ( M 1 2 ) . Therefore 1 + MΓ (M 3)/2 2M Γ (M 2)/2 Furthermore, the expression of ζ can be simplified as ζ = 3 4M . If the bias term is neglected in a simple linear regression, contributions from µB to the regularization term is neglected and thus ζ = 1 4M . Note that if one uses mean square error without being divided by 2 during linear regression, the values for ζ should be multiplied by 2 as well, where ζ = 1 2M . Published as a conference paper at ICLR 2019 C.4 BN REGULARIZATION IN A DEEP NETWORK The previous derivation is based on the single-layer perceptron. In deep neural networks, the forward computation inside one basic building block of a deep network is written by zl i = g(ˆhi), ˆhl i = γl i hl i (µB)l i (σB)l i + βl i and hl i = (wl i)Tzl 1, (7) where the superscript l [1, L] is the index of a building block in a deep neural network, and i [1, N l] indexes each neuron inside a layer. z0 and z L are synonyms of input x and output y, respectively. In order to analyze the regularization of BN from a specific layer, one needs to isolate its input and focus on the noise introduced by the BN layer in this block. Therefore, the loss function ℓ(ˆhl) can also be expanded at ℓ( hl). In BN, the batch variance is calculated with regard to each neuron under the assumption of mutual independence of neurons inside a layer. By following this assumption and the above derivation in Appendix C.1, the loss function with BN in deep networks can also be similarly decomposed. Regularization of µl B, σl B in a deep network. Let ζl be the strength (coefficient) of the regularization at the l-th layer. Then j=1 Eµl B,σl Bℓ (ˆhl)j 1 j=1 ℓ ( hl)j + i ζl i (γl i)2, and ζl i = 1 diag Hℓ( hl)j ρl i + 2 4M (wl i)T (zl 1)j (µP)l i (σP)l i + O(1/M 2), where i is the index of a neuron in the layer, ( hl i)j = γl i (wl i)T (zl 1)j (µP)l i (σP)l i +βl i represents population normalization (PN), Hℓ( hl) is the Hessian matrix at hl regarding to the loss ℓand diag( ) represents the diagonal vector of a matrix. It is seen that the above equation is compatible with the results from the single-layer perceptron. The main difference of the regularization term in a deep model is that the Hessian matrix is not guaranteed to be positive semi-definite during training. However, this form of regularization is also seen from other regularization such as noise injection (Rifai et al., 2011) and dropout (Wager et al., 2013), and has long been recognized as a Tikhonov regularization term (Bishop, 1995). In fact, it has been reported that in common neural networks, where convex activation functions such as Re LU and convex loss functions such as common cross entropy are adopted, the Hessian matrix Hℓ( hl) can be seen as locally positive semidefinite (Santurkar et al., 2018). Especially, as training converges to its mimimum training loss, the Hessian matrix of the loss can be viewed as positive semi-definite and thus the regularization term on γl is positive. C.5 DYNAMICAL EQUATIONS Here we discuss the dynamical equations of BN. Let the length of teacher s weight vector be 1, that is, 1 N w Tw = 1. We introduce a normalized weight vector of the student as ew = Nγ w w . Then the overlapping ratio between teacher and student, the length of student s vector, and the length of student s normalized weight vector are 1 N ew Tw = QR = γR, 1 N ew T ew = Q2 = γ2, and 1 N w Tw = L2 respectively, where Q = γ. And we have 1 N w Tw = LR. We transform update equations (5) by using order parameters. The update rule for variable Q2 can be obtained by Q2 j+1 Q2 j = 1 N 2ηδj wj Txj 2ηζ Q2 j following update rule of γ. Similarly, the update rules for variables RL and L2 are calculated as follow: RL j+1 RL j = 1 Lj δjw Txj ηRj Lj δj wj Txj , L2 j+1 L2 j = 1 (L2)j δj2xj Txj η2 N(L2)j δj2( wj Txj)2 . (8) Published as a conference paper at ICLR 2019 Let t = j N is a normalized sample index that can be treated as a continuous time variable. We have t = 1 N that approaches zero in the thermodynamic limit when N . In this way, the learning dynamic of Q2, RL and L2 can be formulated as the following differential equations: dt = 2ηI1 2ηζQ2, d RL where I1 = δ w Tx x, I2 = δ2x Tx x, and I3 = δw Tx x, which are the terms presented in d Q2 dt , and d L2 dt and x denotes expectation over the distribution of x. They are used to simplify notations. Note that we neglect the last term of d L2/dt in Eqn.(8) since η2 N(L2)δ2( w Tx)2 can be approximately equal to zero when N approaches infinity. On the other hand, we have d Q2 = 2Qd Q, d RL = Rd L + Ld R and d L2 = 2Ld L. Hence, Eqn.(9) can be reduced to L2 I1 η2 Q2R 2L4 I2, d L Proposition 1. Let (Q0, R0, L0) denote a fixed point with parameters Q, R and L of Eqn.(10). Assume the learning rate η is sufficiently small when training converges and x N(0, 1 N I). If activation function g is Re LU, then we have Q0 = 1 2ζ+1, R0 = 1 and L0 could be arbitrary. Proof. First, L has no influence on the output of student model since w is normalized, which implies that if (Q0, R0, L0) is a fixed point of Eqn.(10), L0 could be arbitrary. Besides, we have η η2 because the learning rate η is sufficiently small. Therefore, the terms in Eqn.(10) proportional to η2 can be neglected. If (Q0, R0, L0) is a fixed point, it suffices to have η I1(Q0, R0) Q0 ηζQ0 = 0, (11) L2 0 I3(Q0, R0) η R0 L2 0 I1(Q0, R0) = 0, (12) To calculate I1 and I3, we define s and t as w Tx and w Tx. Since x N(0, 1 N I), we can acquire s t , Q2 QR QR 1 so probability measure of [s, t]T can be written as Ds Dt = 1 2πQ T Q2 QR QR 1 I1 = g (ew Tx) g(w T x) g(ew Tx) ew Tx u,v [g (s) (g(t) g(s) s]Ds Dt 0 st Ds Dt Z + 1 R2 + 2Rarcsin(R)) Published as a conference paper at ICLR 2019 u,v [g (s) (g(t) g(s) t]Ds Dt u,v g (s)g(t)t Ds Dt Z u,v g (s)g(s)t Ds Dt 0 t2Ds Dt Z + 1 R2 + 2 arcsin(R) By substituting Eqn.(13) and (14) into Eqn.(11) and (12), we get Q0 = 1 2ζ+1 and R0 = 1. Proposition 2. Given conditions in proposition1, let λbn Q , λbn R be the eigenvalues of the Jacobian matrix at fixed point (Q0, R0, L0) corresponding to the order parameters Q and R respectively in BN. Then ( λbn Q = η Q0 I1 Q ηζQ0, λbn R = I2 2L2 0 (ηbn max ηbn eff), where ηbn max and ηbn effare the maximum and effective learning rates respectively in BN. Proof. At fixed point (Q0, R0, L0) = ( 1 2ζ+1, 1, L0) obtained in proposition1, the Jacobian of dynamic equations of BN can be derived as Q 2ηζ η Q0 I1 R ζQ2 0 η2Q2 0 2L4 0 I2 0 η2Q2 0 2L3 0 I2 and the eigenvalues of Jbn can be obtained by inspection λbn Q = η Q0 I1 λbn R = η L2 0 R ζQ2 0 η2Q2 0 2L4 0 I2 2L2 0 ηbn max ηbn eff , Since γ0 = Q0, we have ηbn max = ( (γ0I3 I1) γ0 R ζγ0)/ I2 2 R and ηbn eff= ηγ0 C.6 STABLE FIXED POINTS OF BN Proposition 3. Given conditions in proposition1, when activation function is Re LU, then (i) λbn Q < 0, and (ii) λbn R < 0 iff ηbn max > ηbn eff. Proof. When activation function is Re LU, we derive I1 = Q(πR+2 1 R2+2R arcsin(R)) 2 , which gives I1 Q = Q + πR + 2 1 R2 + 2R arcsin(R) Therefore at the fixed point of BN (Q0, R0, L0) = ( 1 2ζ+1, 1, L0), we have λbn Q = η( 1 I1 Q 2ζ) = η( 1 Q0 ( 1 + 1 2Q0 2ζ) = ζ 1 Published as a conference paper at ICLR 2019 Note that x Tx approximately equals 1. We get u,v [g (s) (g(t) g(s))]2Ds Dt 0 v2Ds Dt + Z + s2Ds Dv 2 Z + 2 + πR + 2R 1 R2 + 2 arcsin(R) 4π Q(πR + 2 1 R2 + 2R arcsin(R)) At the fixed point we have I2 R = Q0 < 0. Therefore, we conclude that λbn R < 0 iff ηbn max > ηbn eff. C.7 MAXIMUM LEARNING RATE OF BN Proposition 4. When the activation function is Re LU, then ηbn max η{wn,sgd} max + 2ζ, where ηbn max and η{wn,sgd} max indicate the maximum learning rates of BN, WN, and vanilla SGD respectively. Proof. From the above results, we have I1 = Q(πR+2 1 R2+2R arcsin(R)) 2 , which gives I1/ R 0 at the fixed point of BN. Then it can be derived that I2 R < 0. Furthermore, at the fixed point of BN, Q0 = γ0 = 1 2ζ+1 < 1, then we have ηbn max = ( (γ0I3 I1) γ0 R ζγ0)/ I2 2 R ζγ0/ I2 where the inequality sign holds because (1 1 γ0 ) I1 2 R is positive. Note that (I3 I1) 2 R is also defined as maximum learning rates of WN, and vanilla SGD in Yoshida et al. (2017). Hence, we conclude that ηbn max η{wn,sgd} max + 2ζ. D PROOFS REGARDING GENERALIZATION AND STATISTICAL MECHANICS (SM) In this section, we build an analytical model for the generalization ability of a single-layer network. The framework is based on the Teacher-Student model, where the teacher network output y = g w T x + s is learned by a student network. The weight parameter of the teacher network satisfies 1 N (w )T w =1 and the bias term s is a random variable s N (0, S) fixed for each training example x to represent static errors in training data from observations. In the generalization analysis, the input is assumed to be drawn from x N 0, 1 N I . The output of the student can also be written as a similar form y = g (ew x), where the activation function g ( ) can be either linear or Re LU in the analysis and ew is a general weight parameter which can be used in either WN or common linear perceptrons. Here we take WN for example, since it has been derived in this study that BN can be decomposed into WN with a regularization term on γ. In WN ew = γ w w 2 and we defined the same order parameter as the previous section that γ2 = 1 N ew T ew and γR = 1 D.1 GENERALIZATION ERROR Since the learning task is a regression problem with teacher output biased by a Gaussian noise, it comes natural that we can use the the average mean square error loss ϵt = 1 j y j yj 2 for the regression. The generalization error defined as the estimation over the distribution of input x and is written as ϵgen(ew) = D (y y)2E Published as a conference paper at ICLR 2019 where x denotes an average over the distribution over x. The generalization error is a function of its weight parameter and can be converted to a function only with regard to the aformentioned order parameters, detailed derivation can be seen in (B os, 1998; Krogh & Hertz, 1992). ϵgen(γ, R) = ZZ Dh1Dh2 h g (h1) g(γRh1 + γ p 1 R2h2) i2 (17) where h1and h2 are variables drawn from standard Gaussian distribution and Dh1 := N (0, 1) dh1. When both the teacher network and student network have a linear activation function, the above integration can be easily solved and ϵgen(γ, R) = 1 + γ2 2γR (18) As for the case where the teacher network is linear and the student network has a Re LU activation, it can still be solved first by decomposing the loss function ϵgen(γ, R) = ZZ Dh1Dh2 h h1 g(γRh1 + γ p = ZZ Dh1Dh2 h h2 1 + g(γRh1 + γ p 1 R2h2)2 2h1g(γRh1 + γ p 2 2 ZZ Dh1Dh2 h h1g(γRh1 + γ p It should be noted that the last two terms should only be integrated over the half space γRh1 + γ 1 R2h2 > 0, and therefore if we define the angle of this line with the h2 axis θ0 = arccos (R) the integration is transformed to polar coordinate ϵgen(γ, R) = 1 + γ2 2 2 ZZ Dh1Dh2 h h1g(γRh1 + γ p 2 ) γRr2 sin2(θ) + γ p 1 R2r2 cos (θ) sin (θ) D.2 EQUILIBRIUM ORDER PARAMETERS Following studies on statistical mechanics, the learning process of a neural network resembles a Langevin process (Mandt et al., 2017) and at the equilibrium the network parameters θ follow a Gibbs distribution. That is, the weight vector that yields lower training error produces higher probability. We have p(θ) = Z 1 exp{ βϵt(θ; x)}, where β = 1/T and T is temperature, representing the variance of noise during training and implicitly controlling the learning process. ϵt(θ; x) is an energy term of the training loss function, Z = R d P(θ) exp{ ϵt(θ; x)/T} is the partition function, and P(θ) is a prior distribution. Instead of directly minimizing the energy term above, statistical mechanics finds the minima of free energy, f, which is a function over T, considering the fluctuations of θ at finite temperatures. We have βf = ln Z x. By substituting the parameters that minimize f back into the generalization errors calculated above, we are able to calculate the averaged generalization error, at a certain temperature. The solution of SM requires the differentiation of f with respect to the order parameters. Published as a conference paper at ICLR 2019 In general, the expression of free energy under the replica theory is expressed as(Seung et al., 1992) 2 (γ2 γ2R2) 2 ln(q2 γ2) + α ZZ Dh1Dh2 ln g := g γRh1 + p γ2 γ2R2h2 + p g := g (h1 + s) (20) In the above expression, h1, h2, h3 are three independent variables following the standard Gaussian distribution and α = P/N represents the ratio of the number of training samples P to number of unknown parameters N in the network, R = 1 N w w 2 w , q is the prior value of γ.. The above equation can be utilized for a general SM solution of a network. However, the solution is notoriously difficult to solve and only a few linear settings for the student network have close-form solutions(B os, 1998). Here we extend the previous analysis of linear activations to a non-linear one, though still under the condition that β , which means that the student network undergoes a exhaustive learning that minimizes the training error. In the current setting, the student network is a nonlinear Re LU network while the teacher is a noise-corrupted linear one. Proposition 5. Given a single-layer linear teacher y = w x + s and a student Re LU network y = g(γ w w 2 x) linear student network with g being a Re LU activation function, x N(0, I N ) the free energy f satisfies as β 2 (γ2 γ2R2) 2 ln(q2 γ2) 4 ln 1 + β q2 γ2 αβ 1 2γR + γ2 + S 4 (1 + β (q2 γ2)) αβ where S is the variance of the Gaussian noise s injected to the output of the teacher. Proof. The most difficult process in Eqn.19 is to solve the inner integration over h3. Here as β , it is noted that the function exp ( βx) only notches up only at x = 0 and is 0 elsewhere. Therefore, the integration R Dh3 exp β(g g )2 2 depends on the value of g . If g < 0, no solution exists for g g = 0 as g is a Re LU activation, and thus the integration is equivalent to the maximum value of the integral under the limit of β . As g > 0, the integration over the notch is equivalent to the one at full range. That is, R Dh3 exp β(g g )2 2 h1 + s > 0 maxh3 exp β(g g )2 The above equation can be readity integrated out and we obtain ln Z Dh3 exp 2 ln 1 + β q2 γ2 (1 γR) h1 p γ2 γ2R2h2 + s 2 1 + β (q2 γ2) Substituting it back to Eqn.19, we have its third term equivalent Published as a conference paper at ICLR 2019 Term3 = α ZZ h1+s>0 Dh1Dh2 (1 γR) h1 p γ2 γ2R2h2 + s 2 1 + β (q2 γ2) h1+s>0 Dh1Dh2 (1 γR)2 h2 1 p γ2 γ2R2h2 + s 2 1 + β (q2 γ2) To solve the above integration, we first realize that s is a random variable to corrupt the output of the teacher output and the above integration should be averaged out over s. Given that s N (0, S) , it is easy to realize that h+s>0 hs Dh Through simple Gaussian integraions, we get 4 ln 1 + β q2 γ2 β 1 2γR + γ2 + S 4 (1 + β (q2 γ2)) β Substituting Term3 back yields the results of the free energy. Therefore, by locating the values that minimizes f in the above proposition, we have equilibrium order parameters 2a + αS 2a α (22) where a is defined as a = 1+β(q2 γ2) β(q2 γ2) . Substituting the order parameters back to the generalization error, we have 4a + αS 2a (2a α) (24) When α < 2 and β , a = 1, the generalization error is 4 + αS 2 (2 α) (25)