# towards_theoretically_inspired_neural_initialization_optimization__85017e2b.pdf

Towards Theoretically Inspired Neural Initialization Optimization

Yibo Yang1, Hong Wang2, Haobo Yuan3, Zhouchen Lin2,4,5

1JD Explore Academy, Beijing, China 2Key Lab. of Machine Perception (Mo E), School of Intelligence Science and Technology, Peking University 3Institute of Artiﬁcial Intelligence and School of Computer Science, Wuhan University 4Institute for Artiﬁcial Intelligence, Peking University 5Pazhou Laboratory, Guangzhou, China

Automated machine learning has been widely explored to reduce human efforts in designing neural architectures and looking for proper hyperparameters. In the domain of neural initialization, however, similar automated techniques have rarely been studied. Most existing initialization methods are handcrafted and highly dependent on speciﬁc architectures. In this paper, we propose a differentiable quantity, named Grad Cosine, with theoretical insights to evaluate the initial state of a neural network. Speciﬁcally, Grad Cosine is the cosine similarity of sample-wise gradients with respect to the initialized parameters. By analyzing the samplewise optimization landscape, we show that both the training and test performance of a network can be improved by maximizing Grad Cosine under gradient norm constraint. Based on this observation, we further propose the neural initialization optimization (NIO) algorithm. Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost compared with the training time. With NIO, we improve the classiﬁcation performance of a variety of neural architectures on CIFAR-10, CIFAR-100, and Image Net. Moreover, we ﬁnd that our method can even help to train large vision Transformer architecture without warmup.

1 Introduction

For a deep neural network, architecture [19, 20, 21] and parameter initialization [18, 16] are two initial elements that largely account for the ﬁnal model performance. Lots of human efforts have been devoted to ﬁnding better answers with respect to the two aspects. To automatically produce better architectures, neural architecture search [62, 38, 5] has been a research focus. However, on the other hand, using automated techniques for parameter initialization has rarely been studied.

Previous initialization methods are mostly handcrafted. They focus on ﬁnding proper variance patterns of randomly initialized weights [18, 16, 40, 36] or rely on empirical evidence derived from certain architectures [58, 23, 15, 7]. Recently, [60, 8] propose learning-based initialization that learns to tune the norms of the initial weights so as to minimize a quantity that is intimately related to favorable training dynamics. Despite being architecture-agnostic and gradient-based, these methods do not consider sample-wise landscape. Their derived quantities lack theoretical supports for being related to model performance. It is unclear whether optimizing these quantities can indeed lead to better training or generalization performance.

In order to ﬁnd a better initialization, a theoretically sound quantity that intimately evaluates both the training and test performance should be designed. Finding such a quantity is a non-trivial problem and

: corresponding author.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

shows many challenges. First, the test performance is mostly decided by the converged parameters after training, while the training dynamic is more related to the parameters at initialization or during training. Second, to efﬁciently ﬁnd a better starting point, the quantity is expected to be differentiable to enable the optimization in the continuous parameter space. To this end, we leverage the recent advances in optimization landscape analysis [2] to propose a novel differentiable quantity and develop a corresponding algorithm for its optimization at initialization.

Speciﬁcally, our quantity is inspired by analyzing the optimization landscapes of individual training samples [59]. Through generalizing prior theoretical results on batch-wise optimization [2] to samplewise optimization, we prove that both the network s training and generalization error are upper bounded by a theoretical quantity that correlates with the cosine similarity of the sample-wise local optima. Moreover, this quantity also relates to the training dynamic since it reﬂects the optimization path consistency [35, 37] from the starting point. Unfortunately, the sample-wise local optima are intractable. With the hypothesis that the sample-wise local optima can be reached by the ﬁrst-order approximation from the initial parameters, we can approximate the quantity via the sample-wise gradients at initialization. Our ﬁnal result shows that, under a limited gradient norm, both the training and test performance of a network can be improved by maximizing the cosine similarity of sample-wise gradients, named Grad Cosine, which is differentiable and easy to implement.

We then propose the Neural Initialization Optimization (NIO) algorithm based on Grad Cosine to ﬁnd a better initialization agnostic of architecture. We generalize the algorithm from the sample-wise analysis into the batch-wise setting by dividing a batch into sub-batches for friendly implementation. We follow [8, 60] using gradient descent to learn a set of scalar coefﬁcients of the initialized parameters. These coefﬁcients are optimized to maximize Grad Cosine for better training dynamic and expected performance while constraining the gradient norm from explosion.

Experiments show that for a variety of deep architectures including Res Net [19], Dense Net [21], and Wide Res Net [56], our method achieves better classiﬁcation results on CIFAR-10/100 [27] than prior heuristic [18] and learning-based [8, 60] initialization methods. We can also initialize Res Net-50 [19] on Image Net [9] for better performance. Moreover, our method is able to help the recently proposed Swin-Transformer [32] achieve stable training and competitive results on Image Net even without warmup [17], which is crucial for the successful training of Transformer architectures [31, 52].

2 Related Work

2.1 Network Initialization

Existing initialization methods are designed to control the norms of network parameters via Gaussian initialization [16, 18] or orthonormal matrix initialization [40, 36] with different variance patterns. These analyses are most effective for simple feed-forward networks without skip connections or normalization layers. Recently, initialization techniques speciﬁed for some complex architectures are proposed. For example, [58] studied how to initialize networks with skip connections and [23] generalized the results into Transformer architecture [49]. However, these heuristic methods are restricted to speciﬁc architectures. Automated machine learning has achieved success in looking for hyperparameters [3, 13] and architectures [62, 38, 5, 54, 55, 22], while similar techniques for neural network initialization deserve more exploration. Current learning based initialization methods [8, 60] optimize the curvature [8] or the loss reduction of the ﬁrst stochastic step [60] using gradient descent to tune the norms of the initial parameters. However, these methods lack theoretical foundations to be related to the model performance. Different from these methods, our proposed Grad Cosine is derived from a theoretical quantity that is the upper bound of both training and generalization error.

2.2 Evaluating Model Performance at Initialization

Evaluating the performance of a network at initialization has been an important challenge and widely applied in zero-shot neural architecture search [1, 34, 6, 42, 59, 41, 29] and pruning [50, 28, 44]. The evaluation quantities in these studies are mainly based on the initial gradient norm [44, 41], the eigenvalues of neural tangent kernel [6, 41], and the Fisher information matrix [47, 48, 45]. However, these quantities cannot reﬂect optimization landscape, which is crucial for training dynamic and generalization [4, 12, 14, 43, 30, 2]. [2] provided theoretical evidence that for a sufﬁciently large neighborhood of a random initialization, the optimization landscape is nearly convex and semi-

𝒍(𝒇𝜽𝒙𝟏, 𝒚𝟏) 𝒍(𝒇𝜽𝒙𝟐, 𝒚𝟐)

𝒍 𝒍 𝒍(𝒇𝜽𝒙𝟐, 𝒚𝟐) 𝒍(𝒇𝜽𝒙𝟏, 𝒚𝟏)

𝒍(𝒇𝜽𝒙𝟏, 𝒚𝟏) 𝒍(𝒇𝜽𝒙𝟐, 𝒚𝟐)

𝒍(𝒇𝜽𝒙𝟏, 𝒚𝟏) 𝒍(𝒇𝜽𝒙𝟐, 𝒚𝟐)

(𝒂) (𝒃) (𝒄) (𝒅)

Figure 1: (a) Optimization landscape with sparser sample-wise local optima corresponds to a worse θ (larger loss l). (b) Optimization landscape with denser sample-wise local optima corresponds to a better θ (smaller loss l). However, the density of sample-wise local optima cannot reﬂect training path. We further leverage the cosine similarity of sample-wise local optima. Under the same local optima density, (d) corresponds to a better training dynamic than (c), since (d) enjoys a better optimization path consistency (smaller cosine distance between θ 1 θ0 and θ 2 θ0). We give more detailed explanations and discussions in Appendix C.

smooth. Based on the result, [59] proposed to use the density of sample-wise local optima to evaluate and rank neural architectures. Our study also performs sample-wise landscape analysis, but differs from [59] in that our proposed quantity is differentiable and reﬂects optimization path consistency, while the quantity in [59] is non-differentiable so cannot serve for initialization optimization. We make a comparison in more details in Appendix C.

3 Theoretical Foundations

3.1 Sample-wise Optimization Landscape Analysis

Conventional optimization landscape analyses [30, 2] mainly focus on the objective across a minibatch of samples and miss potential evidence hidden in the optimization landscapes of individual samples. As recently pointed out in [59], by decomposing a mini-batch objective into the summation of sample-wise objectives across individual samples in the mini-batch, they ﬁnd that the network with denser sample-wise local optima tends to reach a better local optima for the mini-batch objective, as shown in the Figure (1) (a)-(b). Based on this insight, they propose to use sample-wise local optima density to judge model performance at initialization.

Density of sample-wise local optima. For a batch of training samples S = {(xi, yi)}i [n], a loss function l(ˆyi, yi), and a network fθ( ) parameterized by θ Rm, the sample-wise local optima density ΨS,l(fθ0( )) is measured by the averaged Manhattan distance between the pair-wise local optima {θ i }i [n] of all n samples near the random initialization θ0 [59], i.e.,

ΨS,l(fθ0( )) =

i,j ||θ i θ j ||1, i, j [1, n], (1)

where H is the smoothness upper bound: k [m], i [n], [ 2l(fθ(xi), yi)]k,k H and H always exists with the smoothness assumption that for a neighborhood Γθ0 of a random initialization θ0, the sample-wise optimization landscapes are nearly convex and semi-smooth [2]. Based on this assumption, the training error of the network can be upper bounded by n3

2 Ψ2 S,l(fθ0( )). Moreover, with probability 1 δ, the generalization error measured by population loss E(xu,yu) D[l(f θ (xu), yu)] is upper bounded by n3

2 Ψ2 S,l(fθ0( )) + σ

nδ, where D = {(xu, yu)}u [U] is the underlying data distribution of test samples and σ2 is the upper bound of V ar(xu,yu) D[||θ θ u||2 1] [59].

Although the sample-wise local optima density is theoretically related to both the network training and generalization error, we point out that it may not be a suitable quantity to evaluate network initialization due to the following reasons. First, it ignores the optimization path consistency from initialization θ0 to each sample-wise local optimum θ i . As shown in Figure 1 (c)-(d), while both (c) and (d) have the same local optima density, in Figure 1 (d) the optimization path from the initialization

to the local optima of the two samples are more consistent. Training networks on samples with more consistent optimization paths using batch gradient descent naturally corresponds to a better training dynamic that enjoys faster and more stable convergence [39, 37]. Second, the sample-wise local optima in Eq. (1) are intractable. It can be approximated by measuring the consistency of sample-wise gradient signs [59]. But it is non-differentiable and cannot serve for initialization optimization.

Based on this observation, we aim to derive a new quantity that directly reﬂects optimization path consistency and is a differentiable function of the initialization θ0. Our proposed quantity is based on the cosine similarity of the paths from initialization to sample-wise local optima.

Cosine similarity of sample-wise local optima. Concretely, our quantity can be formulated as:

ΘS,l(fθ0( )) = Hα2

β cos (θ i θ0, θ j θ0) , i, j [1, n], (2)

where α and β are the maximal and minimal ℓ2-norms of the sample-wise optimization paths, i.e., α = max(||θ i θ0||2), β = min(||θ i θ0||2), i [1, n], and cos (θ i θ0, θ j θ0) refers to the cosine similarity of the paths from initialization θ0 to sample-wise local optima, θ i and θ j . The cosine term in Eq. (2) reﬂects the optimization path consistency. Together with the distance term α β , it is also able to measure the density of sample-wise local optima. In the ideal case, when all the local optima are located at the same point, ΘS,l(fθ0( )) = ΨS,l(fθ0( )) = 0. Hence, compared with Ψ, our Θ is more suitable for evaluating the initialization quality.

3.2 Main Results

In this subsection, we theoretically illustrate how minimizing the quantity Θ in Eq. (2) corresponds to better training and generalization performance. Similar to [59], our derivations are also based on the evidence that there exists a neighborhood for a random initialization such that the sample-wise optimization landscapes are nearly convex and semi-smooth [2].

Lemma 1. There exists no saddle point in a sample-wise optimization landscape and every local optimum is a global optimum [59].

Based on Lemma 1, we can draw a relation between the training error and ΘS,l(fθ0( )). Moreover, we show that the proposed quantity is also related to generalization performance as the upper bound of population error. We present the following two theoretical results.

Theorem 2. The training loss L = 1 n P

i l(fθ (xi), yi) of a trained network fθ on a dataset S = {(xi, yi)}i [n] is upper bounded by ΘS,l(fθ0( )), and the bound is tight when ΘS,l(fθ0( )) = 0.

Theorem 3. Suppose that σ2 is the upper bound of V ar(xu,yu) D[||θ θ u||2 2], where θ u is the local optimum in the convex neighborhood of θ0 for test sample (xu, yu). With probability 1 δ, the population loss E(xu,yu) D[l(fθ (xu), yu)] is upper bounded by ΘS,l(fθ0( )) + σ

We provide proofs of these two theorems in Appendix A. Combining both theorems, we can conclude that ΘS,l(fθ0( )) is the upper bound of both training and generalization errors of the network fθ . Therefore, minimizing Θ theoretically helps to improve the model performance.

Albeit theoretically sound, ΘS,l(fθ0( )) requires sample-wise optima θ i which are intractable at initialization. To this end, we will show how to develop a differentiable and tractable objective based on Eq. (2) in Section 4, and introduce the initialization optimization algorithm in Section 5.

4 Grad Cosine

4.1 First-Order Approximation of Sample-Wise Optimization

Since we are dealing with sample-wise optimization, it is reasonable to calculate each sample-wise optimum by the ﬁrst-order approximation. We hypothesize that each sample-wise optimum can be reached via only one-step gradient descent from the initialized parameters. Its rationale lies in that it is very easy for a deep neural network to learn the optimum for only one training sample with gradient descent. Based on this hypothesis, we can approximate each local optimum as:

θ i θ0 ηgi, i [1, n], (3)

Algorithm 1 Grad Cosine (GC) and gradient norm (GN) Require: Initial network parameters θ0, and a batch of samples S = {(xi, yi)}i [B], B = |S|.

1: for i=1 to B do 2: Compute the sample-wise gradient: gi θl(fθ(xi), yi)|θ0; gi Rm

3: end for 4: Compute the average of gradient norm: GN(S, θ0) 1 B PB i=1 ||gi||2 5: Compute the cosine similarity of gradients: φi,j gi gj ||gi||2 ||gj||2 , i = 1, ..., B, j = 1, ..., B 6: Compute the average of gradient cosine: GC(S, θ0) 1 B2 PB i=1 PB j=1 φi,j 7: Output GN and GC.

where gi = θl(fθ(xi), yi)|θ0 is the sample-wise gradient at initialization, and η is the learning rate. With Eq. (3), the upper bound quantity in Eq. (2) can be simpliﬁed as:

Θ Hg2 max n

gmin cos (gi, gj) , i, j [1, n], (4)

where gmax and gmin are the maximal and minimal sample-wise gradient norms at initialization, i.e., gmax = max(||gi||2), gmin = min(||gi||2), i [1, n].

4.2 Guidance for Initialization

Suppose that the sample-wise gradient at initialization is upper bounded by a constant γ, i.e., gmax γ, and then we have:

γ gmin cos (gi, gj) , i, j [1, n]. (5)

From Eq. (2) to Eq. (5), we have almost converted the upper bound into a quantity that can be easily calculated by the sample-wise gradient at initialization, except for the term gmin and the constraint gmax γ. We will explicitly add corresponding optimization objective and constraint in our neural initialization optimization algorithm in Section 5 to ensure a small γ/gmin.

From Eq. (5), we can draw some useful guidance for a better initialization:

(1) The sample-wise gradient norms of the initialized parameters should be large and close to the maximal value bounded by γ to induce a small γ/gmin; (2) The cosine similarity of the sample-wise gradients should be as large as possible.

Relations to Favorable Training Dynamic. There are signiﬁcant evidences in existing studies that support the two rules for better training dynamic. The ﬁrst rule is intimately related to the neural tangent kernel (NTK) [25] that has recently been shown to determine the early learning behavior [51, 41, 6, 42]. Speciﬁcally, [41] ﬁnds that the training dynamic of a neural network can be characterized by the trace norm of NTK at initialization and further approximates it via the initial gradient norm as a gauge to search for neural architectures. [60] initializes a network by maximizing the loss reduction of the ﬁrst gradient descent step, which also corresponds to gradient norm in the ﬁrst-order approximation of the loss function. The two rules together prefer a model whose sample-wise gradients have close ℓ2 and cosine distances. It is in line with prior observations that the initial gradient variance should be small to enable a large learning rate [31, 61]. However, gradient variance is not a proper optimization objective due to its high sensitivity to gradient norm.

While the ﬁrst rule of improving the initial gradient norm has been suggested for better training dynamic in prior studies [60, 41], the second rule is completely new. We refer this novel quantity, i.e., cos (gi, gj), for evaluating the network initialization as Grad Cosine. Intuitively, in the ideal case where the initial gradients of all samples are identical, we have the smallest upper bound of training and generalization error. Moreover, both gradient norm and Grad Cosine are differentiable and thus enable optimization to ﬁnd a better initialization. We provide the calculation of gradient norm (GN) and Grad Cosine (GC) for an initialized network in Algorithm 1.

Algorithm 2 Batch Grad Cosine (B-GC) and batch gradient norm (B-GN) Require: Initialized network parameters θ0, a batch of samples S = {(xi, yi)}i [B], the number of sub-batches D, and overlap ratio r. 1: for d = 1 to D do 2: Compute the batch-wise gradient: gd 1 N P

j Sd θl(fθ(xj), yj)|θ0 3: end for 4: Compute B-GN(S, θ0) and B-GC(S, θ0) with batch-wise gradients following Lines (4-6) in Algorithm 1; 5: Output B-GN and B-GC.

𝐷= 𝐵; 𝑟= 0; 𝐴𝑐𝑐= 96.03 𝐷= 4; 𝑟= 0; 𝐴𝑐𝑐= 95.80

𝐷= 4; 𝑟= 0.25; 𝐴𝑐𝑐= 95.94 𝐷= 2; 𝑟= 0.5; 𝐴𝑐𝑐= 96.13

𝐬𝐚𝐦𝐩𝐥𝐞 𝐰𝐢𝐬𝐞 𝐛𝐚𝐭𝐜𝐡 𝐬𝐞𝐭𝐭𝐢𝐧𝐠𝟏

𝐛𝐚𝐭𝐜𝐡 𝐬𝐞𝐭𝐭𝐢𝐧𝐠𝟐 𝐛𝐚𝐭𝐜𝐡 𝐬𝐞𝐭𝐭𝐢𝐧𝐠𝟑

Figure 2: An illustration of sample-wise inputs and batch-wise inputs with different batch settings. The example experiment is conducted on CIFAR-10 with Res Net-110.

5 Neural Initialization Optimization

Based on the Grad Cosine, we propose our Neural Initialization Optimization (NIO) algorithm. Our algorithm follows [60, 8] to rectify the variance of the initialized parameters in a network via gradient descent. Concretely, we introduce a set of learnable scalar coefﬁcients denoted as M = {ω1, , ωm}. The initial parameters θ0 = {W1, , Wm} rectiﬁed by these coefﬁcients can be formulated as θM = {ω1W1, , ωm Wm}. The NIO algorithm mainly solves the following constrained optimization problem:

maximize M GC(S, θM) + GN(S, θM),

subject to max i (||gi||2) γ, i = 1, ..., B, (6)

where GC and GN refer to Grad Cosine and gradient norm, respectively, as calculated in Algorithm 1, and B is the batchsize of the batch data S. With the maximal sample-wise gradient norm bounded by γ, we maximize both Grad Cosine and gradient norm.

Although the problem in Eq. (6) is differentiable and tractable, inferring Grad Cosine and gradient norm in the sample-wise manner can be very time-consuming especially for networks trained on large datasets such as Image Net. Moreover, using gradient-based methods to solve Eq. (6) requires deriving the second-order gradient with respect to the network parameters, which leads to unbearable memory consumption in the sample-wise manner. To this end, we introduce the batch-wise relaxation.

5.1 Generalizing to Batch Settings

To reduce the time and space complexity of gradient calculation, we further generalize GC and GN from the sample-wise manner into a more efﬁcient batch setting. Speciﬁcally, for a mini-batch of training samples S = {(xi, yi)}i [B], we split it into D sub-batches S1, ..., Sd, ..., SD with an overlap ratio of r that is used to relieve the violation of the sample-wise setting. The number of samples in each sub-batch is N = B D r , and each sub-batch can be denoted as:

Sd = {(xi, yi)}i=N(d 1)(1 r)+1,...,N(d 1)(1 r)+N, (7)

where d = 1, ..., D. The sample-wise input can be regarded as the special case of the batch split when setting D = B, r = 0. We further illustrate some split examples in Figure 2. We empirically ﬁnd that making the sub-batches overlapped with each other stabilize the optimization and results in better initialization. As shown in Figure 2, the test accuracies of the models trained using our initialization with different batch settings further conﬁrm this practice.

The Grad Cosine and gradient norm under the batch setting are illustrated in Algorithm 2. The main difference is that, in the batch setting, the gradient is computed as the average of gradients in a

Algorithm 3 Neural Initialization Optimization Require: Initialized parameters θ0 Rm, learning rate τ for scale coefﬁcients M, upper bound of gradient norm γ, lower bound of the scale coefﬁcients α = 0.01, total iterations T, batch size B, the number of sub-batches D, and overlap ratio r. 1: M (1) 1, where 1 denotes the all-ones vector in Rm; 2: for t = 1 to T do 3: Sample St from the training set; 4: Compute gi, B-GC(St, θM(t)), and B-GN(St, θM(t)) by Algorithm 2; 5: Compute the gmax = maxi(||gi||2), i = 1, ..., D; 6: if gmax > γ then 7: M (t+1) M (t) τ M(t)B-GN 8: else 9: M (t+1) M (t) + τ M(t)(B-GC + B-GN) 10: end if 11: Clamp M (t+1) using α; 12: end for 13: Output the rectiﬁed initialization parameters θM (T ).

sub-batch. Our optimization problem in the batch setting can be formulated as:

maximize M B-GC(S, θM) + B-GN(S, θM),

subject to max i (||gi||2) γ, i = 1, ..., D, (8)

where B-GC and B-GN refer to the batch-wise Grad Cosine and gradient norm as calculated in Algorithm 2.

5.2 Main Algorithm

The ﬁnal neural initialization optimization is illustrated in Algorithm 3. It rectiﬁes the initialized parameters by gradient descent to solve the constrained optimization problem in Eq. (8).

Concretely, we iterate for T iterations. At each iteration, a random batch data St is sampled from the training set, and divided into sub-batches according to D and r. We calculate B-GN and B-GC by Algorithm 2. If the maximal gradient norm is larger than the predeﬁned upper bound γ, we minimize the averaged batch gradient norm to avoid gradient explosion at initialization. When the constraint is satisﬁed, we maximize the Grad Cosine (B-GC) and gradient norm (B-GN) objectives simultaneously, in order to minimize the upper bound of the quantity Θ in Eq. (5), which intimately corresponds to not only a lower training and generalization error, but also a better training dynamic.

6 Experiments

6.1 Initialization for CNN

Dataset and Architectures. We validate our method on three widely used datasets including CIFAR10/100 [27] and Image Net [9]. We select three kinds of convolution neural networks including Res Net [19], Dense Net [21], and Wide Res Net [56]. On CIFAR10/100, we adopt Res Net110, Dense Net100, and the 28-layer Wide Res Net with Widen Factor 10 (WRN-28-10) as three main architectures for evaluation. To further show that our initialization method helps training dynamic for better training stability, we also conduct experiments on the same architectures but remove the batch normalization (BN) [24] layers. Moreover, to illustrate that our method can be extensible to large-scale benchmarks, we test our proposed NIO on Image Net using Res Net-50. The detailed settings for different architectures and datasets are described in Appendix B.

Experiment Results on CIFAR-10/100. We select four different initialization methods for comparison: (1) Kaiming Initialization [18]; (2) First train the network for one epoch with a linear warmup learning rate, denoted as Warmup; (3) Meta Init [8]; and (4) Grad Init [60]. (3) and (4) are learning-based initialization as ours. We re-implement their methods using the code provided in [60]. For Meta Init, Grad Init, and our proposed NIO, the initialization is rectiﬁed based on the Kaiming initialized parameters. After initialization, we train these models for 500 epochs with the same

Table 1: Test accuracies of three architectures on CIFAR-10. Best results are marked in bold.

Model Res Net-110 Dense Net-100 WRN-28-10

Kaiming 95.53 0.19 95.75 0.13 97.27 0.27 Warmup 95.56 0.12 95.73 0.27 97.30 0.18 Metainit 95.45 0.33 95.75 0.11 97.26 0.17 Gradinit 95.81 0.29 95.77 0.22 97.34 0.15 NIO 96.13 0.16 96.07 0.18 97.50 0.21

Kaiming 94.83 0.24 94.78 0.16 97.15 0.33 Warmup 94.75 0.21 94.86 0.15 97.20 0.18 Metainit 94.79 0.15 95.15 0.19 97.27 0.25 Gradinit 95.12 0.16 95.31 0.37 97.12 0.13 NIO 95.27 0.19 95.62 0.18 97.35 0.19

Table 2: Test accuracies of three architectures on CIFAR-100. Best results are marked in bold.

Model Res Net-110 Dense Net-100 WRN-28-10

Kaiming 74.54 0.21 76.58 0.23 81.40 0.28 Warmup 74.63 0.33 76.60 0.32 81.45 0.25 Metainit 74.32 0.26 76.23 0.28 81.46 0.20 Gradinit 75.40 0.17 76.14 0.21 81.35 0.31 NIO 75.72 0.15 76.86 0.26 81.83 0.20

Kaiming 73.03 0.23 71.27 0.25 79.28 0.26 Warmup 73.10 0.14 71.50 0.24 79.58 0.13 Metainit 72.60 0.17 71.68 0.37 79.15 0.24 Gradinit 72.06 0.31 71.33 0.21 79.64 0.23 NIO 73.17 0.15 72.79 0.22 79.76 0.26

(a) (b) (c) (d)

Figure 3: Empirical comparisons between the Kaiming initialization (red) and our NIO (blue) using Res Net-110 on CIFAR-10. (a) and (b) show the ratio of the maximal sample-wise gradient norm to the minimal one. (c) and (d) depict the averaged pair-wise cosine similarities between sample-wise gradients. (a) and (c) are calculated for each layer with a random batch. (b) and (d) are calculated for all parameters using 100 randomly sampled batches. Zoom in for a better view.

training setting. Each model is trained four times with different seeds. We report the average and standard deviation numbers of the accuracies on the test set.

As shown in Table 1 and 2, compared with the Kaiming initialization, we observe that Meta Init and Grad Init do not achieve improvement for some cases. As a comparison, our proposed NIO consistently improves upon the Kaiming initialization on both CIFAR-10 and CIFAR-100 with and without BN. Especially for Res Net-110-BN and Dense Net-100-no BN, the performance improvements upon the Kaiming initialization are more than 0.5% on CIFAR-10 and 1.0% on CIFAR-100. Compared with Grad Init, we also achieve better results by more than 1.0% for Res Net-110-no BN and Dense Net-110no BN on CIFAR-100.

We also make some empirical comparisons between the Kaiming initialization and our NIO with Res Net-110 on CIFAR-10. As shown in Figure 3, compared with the Kaiming initialization, NIO enjoys signiﬁcantly smaller ratios of the maximal gradient norm to the minimal one, and larger cosine similarities of sample-wise gradients. Therefore, after NIO, the gradient norm ratio is reduced towards 1.0, while the cosine similarity is improved towards 1.0, which indicates that the quantity in Eq. (2) is minimized, and the sample-wise gradients are close with respect to length and direction.

Ablation Studies on Batch Setting. We further test the sensitivity of our method to the batch setting hyper-parameters in Eq. (7), the number of sub-batches D and overlap ratio r. We show the model performances and the time and space complexities of NIO with Res Net-110 on CIFAR-100 using different D and r in Tabel 3. The performances do not vary signiﬁcantly with the change of D and r. The worst accuracy (when D = 2 and r = 0 ) 75.56% in Table 3 is still higher than the Kaiming initialization result 74.54% by 1%. As D increases, the time and memory consumption of NIO also grows, because more batch gradients need to be calculated by back propagation. When D is small, a larger r should be adopted to relieve the violation of the sample-wise hypothesis.

Experiment Results on Image Net. To verify that our method can be extensible to large-scale task, we further evaluate our approach on the Image Net dataset. Using different initialization methods,

Table 3: Model performance and complexity of Res Net-110 on CIFAR-100 with different batch settings. NIO is performed with a batchsize of 128. A smaller number of sub-batches D leads to faster speed and less memory consumption, and almost does not harm the ﬁnal performance. Time refers to the time consumed for each iteration in NIO.

D Accuracy (%) Time GPU Mem. r=0 r=0.2 r=0.4 r=0.6 2 75.56 75.71 75.96 75.72 3.75s 5007M 3 75.60 76.00 75.97 75.76 4.07s 6001M 4 75.58 76.03 75.71 75.88 4.68s 7067M

Table 4: Top-1 accuracy of Res Net-50 on Image Net. The results of Fix Up and Meta Init are taken from their papers [58, 8], and the other methods are reimplemented by us under the same training setting. Fix Up removes the BN layers as their default and the other models are trained with BN layers.

Method Accuracy (%) Init. / Train time 1

Kaiming [18] 76.43 0.05 - / 8.5h Fix Up [58] 76.0 - / 8.5h Meta Init [8] 76.0 0.81h / 8.5h Grad Init [60] 76.50 0.05 0.21h / 8.5h NIO (ours) 76.71 0.07 0.03h / 8.5h

Table 5: Top-1 accuracy of Swin-Transformer (tiny) on Image Net with and without warmup. fail represents that the model cannot converge. Warmup takes up the ﬁrst 20 epochs when enabled.

Kaiming Trunc Normal Grad Init2 NIO (ours)

w/ warmup 79.4 81.3 80.4 81.3 w/o warmup fail fail 79.9 80.9

we train Res Net-50 for 100 epochs with a batchsize of 256. Alll models are trained under the same training setting. As shown in Table 4, we improve upon the Kaiming initialization by 0.3%. We also re-implement Grad Init using the default conﬁgurations in their paper [60], and have a performance close to the Kaiming initialization baseline. Moreover, NIO takes the shortest initialization time compared with the other two learning-based initialization methods, Meta Init and Grad Init. It is because only 100 iterations are needed for NIO in our implementation, while Meta Init and Grad Init require 1000 and 2000 iterations, respectively, as their default settings [8, 60]. Compared with the training time, the initialization time of NIO is negligible.

6.2 Initialization for Transformer

Vision Transformer architectures have been popular and proven to be effective for many vision tasks [11, 46, 32, 53]. But Transformer architectures suffer from unstable training due to the large gradient variance [31, 61] in the Adam optimizer [26], so have to resort to the warmup trick, which adopts a small learning rate at initialization to stabilize the training process. In order to test the effectiveness of NIO to improve the training stability, we perform NIO for Swin-Transformer [32] with and without warmup on Image Net. Detailed training and initialization settings are described in Appendix B. As shown in Table 5, when warmup is not enabled for training, both Kaiming and Truncated normal initialized models cannot converge. As a comparison, the model with NIO achieves a top-1 accuracy of 80.9%, which is very close to the standard training result with warmup. These results indicate that NIO is able to produce a better initialization agnostic of architecture and dataset.

6.3 Discussions

Why not Gradient Variance? As indicated by Eq. (2) and Figure 3, the aim of NIO is to reduce the variability of sample-wise gradients. But why do we not directly minimize the gradient variance or just the pair-wise Euclidean distances? We point out that gradient variance is highly sensitive to gradient norm. Directly minimizing gradient variance would lead to a trivial gradient norm near zero, which disables the training. So the ability of reducing gradient variance while keeping gradient norm at a proper range is one crucial property of our proposed NIO.

1 Train time is tested on an NVIDIA A100 server with a batchsize of 256 among 8 GPUs. Init. time of Meta Init and Grad Init is tested for 1000 and 2000 iterations, respectively, according to the default conﬁgurations in their papers [8, 60]. NIO is performed for 100 iterations so enjoys a fast initialization process. 2Grad Init did not perform on Swin Transformer in their paper. We adopt their initialization settings for Res Net-50 to perform Grad Init on Swin Transformer and then train the model.

Limitations and Societal Impacts. Similar to [60], the choice of γ has not been determined from a theoretical point yet. Calculating Grad Cosine accurately requires much memory consumption for large models, so it needs the batch setting relaxation. Besides, performing NIO without a dataset will be also a promising breakthrough that deserves future exploration. Our study is general and not concerned with malicious practices with respect to privacy, public health, fairness, etc. The proposed algorithm is lightweight and does not bring much environment burden.

7 Conclusion

In this paper, we propose a differentiable quantity with theoretical insight to evaluate the initial state of a neural network. Based on our sample-wise landscape analysis, we show that maximizing the quantity with the maximal gradient norm upper bounded is able to improve both training dynamic and model performance. Accordingly, we develop an initialization optimization algorithm to rectify the initial parameters. Experimental results show that our method is able to automatically produce a better initialization for variant architectures to improve the performance on multiple datasets with negligible cost. It also helps a Transformer architecture train stably even without warmup.

Acknowledgments and Disclosure of Funding

Z. Lin was supported by the NSF China (No.s 62276004 and 61731018), NSFC Tianyuan Fund for Mathematics (No. 12026606), and Project 2020BD006 supported by PKU-Baidu Fund.

[1] M. S. Abdelfattah, A. Mehrotra, Ł. Dudziak, and N. D. Lane. Zero-cost proxies for lightweight nas. In ICLR, 2021. [2] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. In ICML, pages 242 252. PMLR, 2019. [3] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization. In Neur IPS, volume 24, 2011. [4] A. Brutzkus and A. Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In ICML, pages 605 614. PMLR, 2017. [5] L.-C. Chen, M. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens. Searching for efﬁcient multi-scale architectures for dense image prediction. In Neur IPS, volume 31, 2018. [6] W. Chen, X. Gong, and Z. Wang. Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective. In ICLR, 2020. [7] D. Das, Y. Bhalgat, and F. Porikli. Data-driven weight initialization with sylvester solvers. ar Xiv preprint ar Xiv:2105.10335, 2021. [8] Y. N. Dauphin and S. Schoenholz. Metainit: Initializing learning by learning to initialize. In Neur IPS, volume 32, 2019. [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248 255, 2009. [10] T. De Vries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. ar Xiv preprint ar Xiv:1708.04552, 2017. [11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020. [12] S. Du, J. Lee, Y. Tian, A. Singh, and B. Poczos. Gradient descent learns one-hidden-layer cnn: Don t be afraid of spurious local minima. In ICML, pages 1339 1348. PMLR, 2018. [13] M. Feurer and F. Hutter. Hyperparameter optimization. In Automated machine learning, pages 3 33. Springer, Cham, 2019. [14] R. Ge, J. D. Lee, and T. Ma. Learning one-hidden-layer neural networks with landscape design. In ICLR, 2018. [15] J. Gilmer, B. Ghorbani, A. Garg, S. Kudugunta, B. Neyshabur, D. Cardoze, G. Dahl, Z. Nado, and O. Firat. A loss curvature perspective on training instability in deep learning. ar Xiv preprint ar Xiv:2110.04369, 2021. [16] X. Glorot and Y. Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In AISTATS, pages 249 256, 2010. [17] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017.

[18] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In ICCV, pages 1026 1034, 2015. [19] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770 778, 2016. [20] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. ar Xiv preprint ar Xiv:1704.04861, 2017. [21] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 4700 4708, 2017. [22] T. Huang, S. You, Y. Yang, Z. Tu, F. Wang, C. Qian, and C. Zhang. Explicitly learning topology for differentiable neural architecture search. ar Xiv preprint ar Xiv:2011.09300, 2020. [23] X. S. Huang, F. Perez, J. Ba, and M. Volkovs. Improving transformer optimization through better initialization. In ICML, pages 4475 4483. PMLR, 2020. [24] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448 456. PMLR, 2015. [25] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Neur IPS, volume 31, 2018. [26] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015. [27] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009. [28] N. Lee, T. Ajanthan, and P. H. Torr. Snip: Single-shot network pruning based on connection sensitivity. In ICLR, 2019. [29] H. Li, Y. Yang, D. Chen, and Z. Lin. Optimization algorithm inspired deep neural network structure design. In ACML, 2018. [30] Y. Li and Y. Yuan. Convergence analysis of two-layer neural networks with relu activation. In Neur IPS, volume 30, 2017. [31] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han. On the variance of the adaptive learning rate and beyond. In ICLR, 2020. [32] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012 10022, 2021. [33] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017. [34] J. Mellor, J. Turner, A. Storkey, and E. J. Crowley. Neural architecture search without training. In ICML, pages 7588 7598. PMLR, 2021. [35] N. Meuleau and M. Dorigo. Ant colony optimization and stochastic gradient descent. Artiﬁcial life, 8(2):103 121, 2002. [36] D. Mishkin and J. Matas. All you need is a good init. ar Xiv preprint ar Xiv:1511.06422, 2015. [37] S. Oymak and M. Soltanolkotabi. Overparameterized nonlinear learning: Gradient descent takes the shortest path? In ICML, pages 4951 4960. PMLR, 2019. [38] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classiﬁer architecture search. In AAAI, volume 33, pages 4780 4789, 2019. [39] S. Ruder. An overview of gradient descent optimization algorithms. ar Xiv preprint ar Xiv:1609.04747, 2016. [40] A. M. Saxe, J. L. Mc Clelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ar Xiv preprint ar Xiv:1312.6120, 2013. [41] Y. Shu, S. Cai, Z. Dai, B. C. Ooi, and B. K. H. Low. Nasi: Label-and data-agnostic neural architecture search at initialization. In ICLR, 2022. [42] J. B. Simon, M. Dickens, and M. R. De Weese. Neural tangent kernel eigenvalues accurately predict generalization. ar Xiv preprint ar Xiv:2110.03922, 2021. [43] M. Soltanolkotabi. Learning relus via gradient descent. In Neur IPS, volume 30, 2017. [44] H. Tanaka, D. Kunin, D. L. Yamins, and S. Ganguli. Pruning neural networks without any data by iteratively conserving synaptic ﬂow. In Neur IPS, volume 33, pages 6377 6389, 2020. [45] L. Theis, I. Korshunova, A. Tejani, and F. Huszár. Faster gaze prediction with dense networks and ﬁsher pruning. ar Xiv preprint ar Xiv:1801.05787, 2018. [46] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou. Training data-efﬁcient image transformers & distillation through attention. In ICML, pages 10347 10357. PMLR, 2021. [47] J. Turner, E. J. Crowley, M. O Boyle, A. Storkey, and G. Gray. Blockswap: Fisher-guided block substitution for network compression on a budget. In ICLR, 2020. [48] J. Turner, E. J. Crowley, and M. F. O Boyle. Neural architecture search as program transformation exploration. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 915 927, 2021. [49] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Neur IPS, pages 5998 6008, 2017.

[50] C. Wang, G. Zhang, and R. Grosse. Picking winning tickets before training by preserving gradient ﬂow. In ICLR, 2020. [51] L. Xiao, J. Pennington, and S. Schoenholz. Disentangling trainability and generalization in deep neural networks. In ICML, pages 10462 10472. PMLR, 2020. [52] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu. On layer normalization in the transformer architecture. In ICML, pages 10524 10533. PMLR, 2020. [53] Y. Xu, Q. Zhang, J. Zhang, and D. Tao. Vitae: Vision transformer advanced by exploring intrinsic inductive bias. In Neur IPS, volume 34, 2021. [54] Y. Yang, H. Li, S. You, F. Wang, C. Qian, and Z. Lin. Ista-nas: Efﬁcient and consistent neural architecture search by sparse coding. In Neur IPS, 2020. [55] Y. Yang, S. You, H. Li, F. Wang, C. Qian, and Z. Lin. Towards improving the consistency, efﬁciency, and ﬂexibility of differentiable neural architecture search. In CVPR, 2021. [56] S. Zagoruyko and N. Komodakis. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016. [57] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR, 2018. [58] H. Zhang, Y. N. Dauphin, and T. Ma. Residual learning without normalization via better initialization. In ICLR, 2019. [59] Z. Zhang and Z. Jia. Gradsign: Model performance inference with theoretical insights. In ICLR, 2022. [60] C. Zhu, R. Ni, Z. Xu, K. Kong, W. R. Huang, and T. Goldstein. Gradinit: Learning to initialize neural networks for stable and efﬁcient training. In Neur IPS, volume 34, 2021. [61] J. Zhuang, T. Tang, Y. Ding, S. C. Tatikonda, N. Dvornek, X. Papademetris, and J. Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. In Neur IPS, volume 33, pages 18795 18806, 2020. [62] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In ICLR, 2017.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] See the last three paragraphs in Section 1. (b) Did you describe the limitations of your work? [Yes] See Section 6.3.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See Section 6.3. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] We are ensured. 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Section 3. (b) Did you include complete proofs of all theoretical results? [Yes] See Appendix A. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Appendix B. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See Table 1 and 2. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Table 4. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] We cite the corresponding papers. (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A]

(d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A]

(e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]