# all_you_need_is_a_good_init__323563f4.pdf

Published as a conference paper at ICLR 2016

ALL YOU NEED IS A GOOD INIT

Dmytro Mishkin, Jiri Matas

Center for Machine Perception Czech Technical University in Prague Czech Republic {mishkdmy,matas}@cmp.felk.cvut.cz

Layer-sequential unit-variance (LSUV) initialization a simple method for weight initialization for deep net learning is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the ﬁrst to the ﬁnal layer, normalizing the variance of the output of each layer to be equal to one. Experiment with different activation functions (maxout, Re LU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed speciﬁcally for very deep nets such as Fit Nets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)). Performance is evaluated on Goog Le Net, Caffe Net, Fit Nets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and Image Net datasets.

1 INTRODUCTION

Deep nets have demonstrated impressive results on a number of computer vision and natural language processing problems. At present, state-of-the-art results in image classiﬁcation (Simonyan & Zisserman (2015); Szegedy et al. (2015)) and speech recognition (Sercu et al. (2015)), etc., have been achieved with very deep ( 16 layer) CNNs. Thin deep nets are of particular interest, since they are accurate and at the same inference-time efﬁcient (Romero et al. (2015)).

One of the main obstacles preventing the wide adoption of very deep nets is the absence of a general, repeatable and efﬁcient procedure for their end-to-end training. For example, VGGNet (Simonyan & Zisserman (2015)) was optimized by a four stage procedure that started by training a network with moderate depth, adding progressively more layers. Romero et al. (2015) stated that deep and thin networks are very hard to train by backpropagation if deeper than ﬁve layers, especially with uniform initialization.

On the other hand, He et al. (2015) showed that it is possible to train the VGGNet in a single optimization run if the network weights are initialized with a speciﬁc Re LU-aware initialization. The He et al. (2015) procedure generalizes to the Re LU non-linearity the idea of ﬁlter-size dependent initialization, introduced for the linear case by (Glorot & Bengio (2010)). Batch normalization (Ioffe & Szegedy (2015)), a technique that inserts layers into the the deep net that transform the output for the batch to be zero mean unit variance, has successfully facilitated training of the twenty-two layer Goog Le Net (Szegedy et al. (2015)). However, batch normalization adds a 30% computational overhead to each iteration.

The main contribution of the paper is a proposal of a simple initialization procedure that, in connection with standard stochastic gradient descent (SGD), leads to state-of-the-art thin and very deep neural nets1. The result highlights the importance of initialization in very deep nets. We review the history of CNN initialization in Section 2, which is followed by a detailed description of the novel initialization method in Section 3. The method is experimentally validated in Section 4.

1The code allowing to reproduce the experiments is available at https://github.com/ducha-aiki/LSUVinit

ar Xiv:1511.06422v7 [cs.LG] 19 Feb 2016

Published as a conference paper at ICLR 2016

0 1 2 3 4 5 6 7 8 9 # Iteration

Relative magnitude of expected weight updates

0.4 0.6 0.8 1.0

1.2 1.4 1.6 1.8

0 1 2 3 4 5 6 7 8 9 # Iteration

Relative magnitude of expected weight updates

0.4 0.6 0.8 1.0

1.2 1.4 1.6 1.8

0 1 2 3 4 5 6 7 8 9 # Iteration

Relative magnitude of expected weight updates

0.4 0.6 0.8 1.0

1.2 1.4 1.6 1.8

0 1 2 3 4 5 6 7 8 9 # Iteration

Relative magnitude of expected weight updates

0.4 0.6 0.8 1.0

1.2 1.4 1.6 1.8

Figure 1: Relative magnitude of weight updates as a function of the training iteration for different weight initialization scaling after ortho-normalization. The values in the range 0.1% .. 1% lead to convergence, larger to divergence, for smaller, the network can hardly leave the initial state. Subgraphs show results for different non-linearities Re LU (top left), VLRe LU (top right), hyperbolic tangent (bottom left) and Maxout (bottom right).

2 INITIALIZATION IN NEURAL NETWORKS

After the success of CNNs in IVSRC 2012 (Krizhevsky et al. (2012)), initialization with Gaussian noise with mean equal to zero and standard deviation set to 0.01 and adding bias equal to one for some layers become very popular. But, as mentioned before, it is not possible to train very deep network from scratch with it (Simonyan & Zisserman (2015)). The problem is caused by the activation (and/or) gradient magnitude in ﬁnal layers (He et al. (2015)). If each layer, not properly initialized, scales input by k, the ﬁnal scale would be k L, where L is a number of layers. Values of k > 1 lead to extremely large values of output layers, k < 1 leads to a diminishing signal and gradient.

Glorot & Bengio (2010) proposed a formula for estimating the standard deviation on the basis of the number of input and output channels of the layers under assumption of no non-linearity between layers. Despite invalidity of the assumption, Glorot initialization works well in many applications. He et al. (2015) extended this formula to the Re LU (Glorot et al. (2011)) non-linearity and showed its superior performance for Re LU-based nets. Figure 1 shows why scaling is important. Large weights lead to divergence via updates larger than the initial values, small initial weights do not allow the network to learn since the updates are of the order of 0.0001% per iteration. The optimal scaling for Re LU-net is around 1.4, which is in line with the theoretically derived

2 by He et al.

Published as a conference paper at ICLR 2016

(2015). Sussillo & Abbott (2014) proposed the so called Random walk initialization, RWI, which keeps constant the log of the norms of the backpropagated errors. In our experiments, we have not been able to obtain good results with our implementation of RWI, that is why this method is not evaluated in experimental section.

Hinton et al. (2014) and Romero et al. (2015) take another approach to initialization and formulate training as mimicking teacher network predictions (so called knowledge distillation) and internal representations (so called Hints initialization) rather than minimizing the softmax loss.

Srivastava et al. (2015) proposed a LSTM-inspired gating scheme to control information and gradient ﬂow through the network. They trained a 1000-layers MLP network on MNIST. Basically, this kind of networks implicitly learns the depth needed for the given task.

Independently, Saxe et al. (2014) showed that orthonormal matrix initialization works much better for linear networks than Gaussian noise, which is only approximate orthogonal. It also work for networks with non-linearities.

The approach of layer-wise pre-training (Bengio et al. (2007)) which is still useful for multi-layerperceptron, is not popular for training discriminative convolution networks.

3 LAYER-SEQUENTIAL UNIT-VARIANCE INITIALIZATION

To the best of our knowledge, there have been no attempts to generalize Glorot & Bengio (2010) formulas to non-linearities other than Re LU, such as tanh, maxout, etc. Also, the formula does not cover max-pooling, local normalization layers Krizhevsky et al. (2012) and other types of layers which inﬂuences activations variance. Instead of theoretical derivation for all possible layer types, or doing extensive parameters search as in Figure 1, we propose a data-driven weights initialization.

We thus extend the orthonormal initialization Saxe et al. (2014) to an iterative procedure, described in Algorithm 1. Saxe et al. (2014) could be implemented in two steps. First, ﬁll the weights with Gaussian noise with unit variance. Second, decompose them to orthonormal basis with QR or SVDdecomposition and replace weights with one of the components.

The LSUV process then estimates output variance of each convolution and inner product layer and scales the weight to make variance equal to one. The inﬂuence of selected mini-batch size on estimated variance is negligible in wide margins, see Appendix.

The proposed scheme can be viewed as an orthonormal initialization combined with batch normalization performed only on the ﬁrst mini-batch. The similarity to batch normalization is the unit variance normalization procedure, while initial ortho-normalization of weights matrices efﬁciently de-correlates layer activations, which is not done in Ioffe & Szegedy (2015). Experiments show that such normalization is sufﬁcient and computationally highly efﬁcient in comparison with full batch normalization.

The LSUV algorithm is summarized in Algorithm 1. The single parameter Tolvar inﬂuences convergence of the initialization procedure, not the properties of the trained network. Its value does not noticeably inﬂuence the performance in a broad range of 0.01 to 0.1. Because of data variations, it is often not possible to normalize variance with the desired precision. To eliminate the possibility of

Algorithm 1 Layer-sequential unit-variance orthogonal initialization. L convolution or fullconnected layer, WL - its weights, BL - its output blob., Tolvar - variance tolerance, Ti current trial, Tmax max number of trials.

Pre-initialize network with orthonormal matrices as in Saxe et al. (2014) for each layer L do

while |Var(BL) 1.0| Tolvar and (Ti < Tmax) do

do Forward pass with a mini-batch calculate Var(BL) WL = WL / p

Var(BL) end while end for

Published as a conference paper at ICLR 2016

an inﬁnite loop, we restricted number of trials to Tmax. However, in experiments described in paper, the Tmax was never reached. The desired variance was achieved in 1-5 iterations.

We tested a variant LSUV initialization which was normalizing input activations of the each layer instead of output ones. Normalizing the input or output is identical for standard feed-forward nets, but normalizing input is much more complicated for networks with maxout (Goodfellow et al. (2013)) or for networks like Goog Le Net (Szegedy et al. (2015)) which use the output of multiple layers as input. Input normalization brought no improvement of results when tested against the LSUV Algorithm 1,

LSUV was also tested with pre-initialization of weights with Gaussian noise instead of orthonormal matrices. The Gaussian initialization led to small, but consistent, decrease in performance.

4 EXPERIMENTAL VALIDATION

Here we show that very deep and thin nets could be trained in a single stage. Network architectures are exactly as proposed by Romero et al. (2015). The architectures are presented in Table 1.

Table 1: Fit Nets Romero et al. (2015) network architecture used in experiments. Non-linearity: Maxout with 2 linear pieces in convolution layers, Maxout with 5 linear pieces in fully-connected.

Fit Net-1 Fit Net-4 Fit Res Net-4 Fit Net-MNIST 250K param 2.5M param 2.5M param 30K param conv 3x3x16 conv 3x3x32 conv 3x3x32 conv 3x3x16 conv 3x3x16 conv 3x3x32 conv 3x3x32 sum conv 3x3x16 conv 3x3x16 conv 3x3x32 conv 3x3x48 conv 3x3x48 conv 3x3x48 ssum conv 3x3x48 conv 3x3x48 pool 2x2 pool 2x2 pool 2x2 pool 4x4, stride2 conv 3x3x32 conv 3x3x80 conv 3x3x80 conv 3x3x16 conv 3x3x32 conv 3x3x80 conv 3x3x80 sum conv 3x3x16 conv 3x3x32 conv 3x3x80 conv 3x3x80 conv 3x3x80 conv 3x3x80 sum conv 3x3x80 conv 3x3x80 pool 2x2 pool 2x2 pool 2x2 pool 4x4, stride2 conv 3x3x48 conv 3x3x128 conv 3x3x128 conv 3x3x12 conv 3x3x48 conv 3x3x128 conv 3x3x128 sum conv 3x3x12 conv 3x3x64 conv 3x3x128 conv 3x3x128 conv 3x3x128 conv 3x3x128 sum conv 3x3x128 conv 3x3x128 pool 8x8 (global) pool 8x8 (global) pool 8x8 (global) pool 2x2 fc-500 fc-500 fc-500 softmax-10(100) softmax-10(100) softmax-10(100) softmax-10

First, as a sanity check , we performed an experiment on the MNIST dataset (Lecun et al. (1998)). It consists of 60,000 28x28 grayscale images of handwritten digits 0 to 9. We selected the Fit Net MNIST architecture (see Table 1) of Romero et al. (2015) and trained it with the proposed initialization strategy, without data augmentation. Recognition results are shown in Table 2, right block. LSUV outperforms orthonormal initialization and both LSUV and orthonormal outperform Hints initialization Romero et al. (2015). The error rates of the Deeply-Supervised Nets (DSN, Lee et al. (2015)) and maxout networks Goodfellow et al. (2013), the current state-of-art, are provided for reference.

Since the widely cited DSN error rate of 0.39%, the state-of-the-art (until recently) was obtained after replacing the softmax classiﬁer with SVM, we do the same and also observe improved results (line Fit Net-LSUV-SVM in Table 2).

Published as a conference paper at ICLR 2016

4.2 CIFAR-10/100

We validated the proposed initialization LSUV strategy on the CIFAR-10/100 (Krizhevsky (2009)) dataset. It contains 60,000 32x32 RGB images, which are divided into 10 and 100 classes, respectively.

The Fit Nets are trained with the stochastic gradient descent with momentum set to 0.9, the initial learning rate set to 0.01 and reduced by a factor of 10 after the 100th, 150th and 200th epoch, ﬁnishing at 230th epoch. Srivastava et al. (2015) and Romero et al. (2015) trained their networks for 500 epochs. Of course, training time is a trade-off dependent on the desired accuracy; one could train a slightly less accurate network much faster.

Like in the MNIST experiment, LSUV and orthonormal initialized nets outperformed Hints-trained Fitnets, leading to the new state-of-art when using commonly used augmentation mirroring and random shifts. The gain on the ﬁne-grained CIFAR-100 is much larger than on CIFAR-10. Also, note that Fit Nets with LSUV initialization outperform even much larger networks like Large-All CNN Springenberg et al. (2014) and Fractional Max-pooling Graham (2014a) trained with afﬁne and color dataset augmentation on CIFAR-100. The results of LSUV are virtually identical to the orthonormal initialization.

Table 2: Network performance comparison on the MNIST and CIFAR-10/100 datasets. Results marked were obtained with the RMSProp optimizer Tieleman & Hinton (2012).

Accuracy on CIFAR-10/100, with data augmentation Network CIFAR-10, [%] CIFAR-100,[%] Fitnet4-LSUV 93.94 70.04 (72.34 ) Fitnet4-Ortho Init 93.78 70.44 (72.30 ) Fitnet4-Hints 91.61 64.96 Fitnet4-Highway 92.46 68.09 ALL-CNN 92.75 66.29 DSN 92.03 65.43 Ni N 91.19 64.32 maxout 90.62 65.46 MIN 93.25 71.14 Extreme data augmentation Large ALL-CNN 95.59 n/a Fractional MP (1 test) 95.50 68.55 Fractional MP (12 tests) 96.53 73.61

Error on MNIST w/o data augmentation Network layers params Error, % Fit Net-like networks High Way-16 10 39K 0.57 Fit Net-Hints 6 30K 0.51 Fit Net-Ortho 6 30K 0.48 Fit Net-LSUV 6 30K 0.48 Fit Net-Ortho-SVM 6 30K 0.43 Fit Net-LSUV-SVM 6 30K 0.38 State-of-art-networks DSN-Softmax 3 350K 0.51 DSN-SVM 3 350K 0.39 High Way-32 10 151K 0.45 maxout 3 420K 0.45 MIN 2 9 447K 0.24

5 ANALYSIS OF EMPIRICAL RESULTS

5.1 INITIALIZATION STRATEGIES AND NON-LINEARITIES

For the Fit Net-1 architecture, we have not experienced any difﬁculties training the network with any of the activation functions (Re LU, maxout, tanh), optimizers (SGD, RMSProp) or initialization (Xavier, MSRA, Ortho, LSUV), unlike the uniform initialization used in Romero et al. (2015). The most probable cause is that CNNs tolerate a wide variety of mediocre initialization, only the learning time increases. The differences in the ﬁnal accuracy between the different initialization methods for the Fit Net-1 architecture is rather small and are therefore not presented here.

The Fit Net-4 architecture is much more difﬁcult to optimize and thus we focus on it in the experiments presented in this section.

We have explored the initializations with different activation functions in very deep networks. More speciﬁcally, Re LU, hyperbolic tangent, sigmoid, maxout and the VLRe LU very leaky Re LU (Graham (2014c)) a variant of leaky Re LU ( Maas et al. (2013), with a large value of the negative slope

2 When preparing this submission we have found recent unreviewed paper MIN Chang & Chen (2015) paper, which uses a sophisticated combination of batch normalization, maxout and network-in-network nonlinearities and establishes a new state-of-art on MNIST.

Published as a conference paper at ICLR 2016

Table 3: The compatibility of activation functions and initialization. Dataset: CIFAR-10. Architecture: Fit Net4, 2.5M params for maxout net, 1.2M for the rest, 17 layers. The n/c symbol stands for failed to converge ; n/c after extensive trials, we managed to train a maxout-net with MSRA initialization with very small learning rate and gradient clipping, see Figure 2. The experiment is marked n/c as training time was excessive and parameters non-standard.

Init method maxout Re LU VLRe LU tanh Sigmoid LSUV 93.94 92.11 92.97 89.28 n/c Ortho Norm 93.78 91.74 92.40 89.48 n/c Ortho Norm-MSRA scaled 91.93 93.09 n/c Xavier 91.75 90.63 92.27 89.82 n/c MSRA n/c 90.91 92.43 89.54 n/c

0.333, instead of the originally proposed 0.01) which is popular in Kaggle competitions Dieleman (2015), Graham (2014b)).

Testing was performed on CIFAR-10 and results are in Table 3 and Figure 2. Performance of orthonormal-based methods is superior to the scaled Gaussian-noise approaches for all tested types of activation functions, except tanh. Proposed LSUV strategy outperforms orthonormal initialization by smaller margin, but still consistently (see Table 3). All the methods failed to train sigmoid-based very deep network. Figure 2 shows that LSUV method not only leads to better generalization error, but also converges faster for all tested activation functions, except tanh.

We have also tested how the different initializations work out-of-the-box with the Residual net training He et al. (2015); a residual net won the ILSVRC-2015 challenge. The original paper proposed different implementations of residual learning. We adopted the simplest one, showed in Table 1, Fit Res Net-4. The output of each even convolutional layer is summed with the output of the previous non-linearity layer and then fed into the next non-linearity. Results are shown in Table 4. LSUV is the only initialization algorithm which leads nets to convergence with all tested non-linearities without any additional tuning, except, again, sigmoid. It is worth nothing that the residual training improves results for Re LU and maxout, but does not help tanh-based network.

Table 4: The performance of activation functions and initialization in the Residual learning setup He et al. (2015), Fit Res Net-4 from Table 1.The n/c symbol stands for failed to converge ;

Init method maxout Re LU VLRe LU tanh Sigmoid LSUV 94.16 92.82 93.36 89.17 n/c Ortho Norm n/c 91.42 n/c 89.31 n/c Xavier n/c 92.48 93.34 89.62 n/c MSRA n/c n/c n/c 88.59 n/c

5.2 COMPARISON TO BATCH NORMALIZATION (BN)

LSUV procedure could be viewed as batch normalization of layer output done only before the start of training. Therefore, it is natural to compare LSUV against a batch-normalized network, initialized with the standard method.

5.2.1 WHERE TO PUT BN BEFORE OR AFTER NON-LINEARITY?

It is not clear from the paper Ioffe & Szegedy (2015) where to put the batch-normalization layer before input of each layer as stated in Section 3.1, or before non-linearity, as stated in section 3.2, so we have conducted an experiment with Fit Net4 on CIFAR-10 to clarify this.

Results are shown in Table 5. Exact numbers vary from run to run, but in the most cases, batch normalization put after non-linearity performs better.

In the next experiment we compare BN-Fit Net4, initialized with Xavier and LSUV-initialized Fit Net4. Batch-normalization reduces training time in terms of needed number of iterations, but each it-

Published as a conference paper at ICLR 2016

20 40 60 80 100 120 140 Epoch

Test accuracy

LSUV MSRA Ortho Norm Xavier highway-19

20 40 60 80 100 120 140 Epoch

Test accuracy

LSUV MSRA Ortho Norm Xavier Ortho Norm MSRA

20 40 60 80 100 120 140 Epoch

Test accuracy

LSUV MSRA Ortho Norm Xavier Ortho Norm MSRA

20 40 60 80 100 120 140 Epoch

Test accuracy

LSUV MSRA Ortho Norm Xavier

Figure 2: CIFAR-10 accuracy of Fit Net-4 with different activation functions. Note that graphs are cropped at 0.4 accuracy. Highway19 is the network from Srivastava et al. (2015).

Table 5: CIFAR-10 accuracy of batch-normalized Fit Net4. Comparison of batch normalization put before and after non-linearity.

Non-linearity Where to put BN Before After Tan H 88.10 89.22 Re LU 92.60 92.58 Maxout 92.30 92.98

eration becomes slower because of extra computations. The accuracy versus wall-clock-time graphs are shown in Figure 3. LSUV-initialized network is as good as batch-normalized one.

However, we are not claiming that batch normalization can always be replaced by proper initialization, especially in large datasets like Image Net.

5.3 IMAGENET TRAINING

We trained Caffe Net (Jia et al. (2014)) and Goog Le Net (Szegedy et al. (2015)) on the Image Net1000 dataset( Russakovsky et al. (2015)) with the original initialization and LSUV. Caffe Net is a variant of Alex Net with the nearly identical performance, where the order of pooling and normalization layers is switched to reduce the memory footprint.

LSUV initialization reduces the starting ﬂat-loss time from 0.5 epochs to 0.05 for Caffe Net, and starts to converge faster, but it is overtaken by a standard Caffe Net at the 30-th epoch (see Figure 4) and its ﬁnal precision is 1.3% lower. We have no explanation for this empirical phenomenon.

On the contrary, the LSUV-initialized Goog Le Net learns faster than hen then original one and shows better test accuracy all the time see Figure 5. The ﬁnal accuracy is 0.680 vs. 0.672 respectively.

5.4 TIMINGS

A signiﬁcant part of LSUV initialization is SVD-decomposition of the weight matrices, e.g. for the fc6 layer of Caffe Net, an SVD of a 9216x4096 matrix is required. The computational overhead on top of generating almost instantly the scaled random Gaussian samples is shown in Table 6. In the slowest case Caffe Net LSUV initialization takes 3.5 minutes, which is negligible in comparison the training time.

Published as a conference paper at ICLR 2016

0 2000 4000 6000 8000 10000 Seconds

Test accuracy

Maxout-LSUV Maxout-BN Maxout-BN-half

0 1000 2000 3000 4000 5000 6000 Seconds

Test accuracy

Re LU-LSUV Re LU-BN Re LU-BN-half

0 1000 2000 3000 4000 5000 6000 Seconds

Test accuracy

VLRe LU-LSUV VLRe LU-BN VLRe LU-BN-half

0 1000 2000 3000 4000 5000 6000 Seconds

Test accuracy

Tan H-LSUV Tan H-BN Tan H-BN-half

Figure 3: CIFAR-10 accuracy of Fit Net-4 LSUV and batch normalized (BN) networks as function of wall-clock time. BN-half stands for half the number of iterations in each step.

0.0 0.2 0.4 0.6 0.8 1.0 Epoch

Training loss

LSUV original

0.0 0.2 0.4 0.6 0.8 1.0 Epoch

Test accuracy

LSUV original

0 2 4 6 8 10 Epoch

Training loss

LSUV original

0 2 4 6 8 10 Epoch

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

Test accuracy

LSUV original

10 20 30 40 50 60 70 Epoch

Training loss

LSUV original

10 20 30 40 50 60 70 Epoch

Test accuracy

LSUV original

Figure 4: Caffe Net training on ILSVRC-2012 dataset with LSUV and original Krizhevsky et al. (2012) initialization. Training loss (left) and validation accuracy (right). Top ﬁrst epoch, middle ﬁrst 10 epochs, bottom full training.

Published as a conference paper at ICLR 2016

0.0 0.2 0.4 0.6 0.8 1.0 Epoch

Training loss

LSUV Reference_BVLC

0.0 0.2 0.4 0.6 0.8 1.0 Epoch

Test accuracy

LSUV Reference_BVLC

0 2 4 6 8 10 Epoch

Training loss

LSUV Reference_BVLC

0 2 4 6 8 10 Epoch

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

Test accuracy

LSUV Reference_BVLC

10 20 30 40 50 60 70 Epoch

Training loss

LSUV Reference_BVLC

10 20 30 40 50 60 70 Epoch

0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70

Test accuracy

LSUV Reference_BVLC

Figure 5: Goog Le Net training on ILSVRC-2012 dataset with LSUV and reference Jia et al. (2014) BVLC initializations. Training loss (left) and validation accuracy (right). Top ﬁrst epoch, middle ﬁrst ten epochs, bottom full training

Table 6: Time needed for network initialization on top of random Gaussian (seconds).

Network Init Ortho Norm LSUV Fit Net4 1 4 Caffe Net 188 210 Goog Le Net 24 60

6 CONCLUSIONS

LSUV, layer sequential uniform variance, a simple strategy for weight initialization for deep net learning, is proposed. We have showed that the LSUV initialization, described fully in six lines of pseudocode, is as good as complex learning schemes which need, for instance, auxiliary nets.

The LSUV initialization allows learning of very deep nets via standard SGD, is fast, and leads to (near) state-of-the-art results on MNIST, CIFAR, Image Net datasets, outperforming the sophisticated systems designed speciﬁcally for very deep nets such as Fit Nets( Romero et al. (2015)) and Highway( Srivastava et al. (2015)). The proposed initialization works well with different activation functions.

Our experiments conﬁrm the ﬁnding of Romero et al. (2015) that very thin, thus fast and low in parameters, but deep networks obtain comparable or even better performance than wider, but shallower ones.

ACKNOWLEDGMENTS

The authors were supported by The Czech Science Foundation Project GACR P103/12/G084 and CTU student grant SGS15/155/OHK3/2T/13.

Published as a conference paper at ICLR 2016

Bengio, Yoshua, Lamblin, Pascal, Popovici, Dan, and Larochelle, Hugo. Greedy layer-wise training of deep networks. In Sch olkopf, B., Platt, J.C., and Hoffman, T. (eds.), Advances in Neural Information Processing Systems 19, pp. 153 160. MIT Press, 2007. URL http://papers.nips. cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf.

Chang, J.-R. and Chen, Y.-S. Batch-normalized Maxout Network in Network. Ar Xiv e-prints, November 2015. URL http://arxiv.org/abs/1511.02583.

Dieleman, Sander. Classifying plankton with deep neural networks, 2015. URL http:// benanne.github.io/2015/03/17/plankton.html.

Glorot, Xavier and Bengio, Yoshua. Understanding the difﬁculty of training deep feedforward neural networks. In In Proceedings of the International Conference on Artiﬁcial Intelligence and Statistics (AISTATS10). Society for Artiﬁcial Intelligence and Statistics, 2010.

Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. Deep sparse rectiﬁer neural networks. In Gordon, Geoffrey J. and Dunson, David B. (eds.), Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS-11), volume 15, pp. 315 323. Journal of Machine Learning Research - Workshop and Conference Proceedings, 2011. URL http: //www.jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf.

Goodfellow, Ian J., Warde-Farley, David, Mirza, Mehdi, Courville, Aaron C., and Bengio, Yoshua. Maxout networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pp. 1319 1327, 2013. URL http://jmlr. org/proceedings/papers/v28/goodfellow13.html.

Graham, Ben. Fractional Max-Pooling. Ar Xiv e-prints, December 2014a. URL http://arxiv. org/abs/1412.6071.

Graham, Ben. Train you very own deep convolutional network, 2014b. URL https://www.kaggle.com/c/cifar-10/forums/t/10493/ train-you-very-own-deep-convolutional-network.

Graham, Ben. Spatially-sparse convolutional neural networks. Ar Xiv e-prints, September 2014c.

He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. Ar Xiv e-prints, December 2015. URL http://arxiv.org/abs/1512/03385.

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In International Conference on Computer Vision (ICCV), 2015. URL http://arxiv.org/abs/1502.01852.

Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Distilling the Knowledge in a Neural Network. In Proceedings of Deep Learning and Representation Learning Workshop: NIPS 2014, 2014.

Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Blei, David and Bach, Francis (eds.), Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 448 456. JMLR Workshop and Conference Proceedings, 2015. URL http://jmlr.org/proceedings/papers/v37/ ioffe15.pdf.

Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross, Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature embedding. ar Xiv preprint ar Xiv:1408.5093, 2014.

Krizhevsky, Alex. Learning Multiple Layers of Features from Tiny Images. Master s thesis, 2009. URL http://www.cs.toronto.edu/ {}kriz/learning-features-2009-TR. pdf.

Published as a conference paper at ICLR 2016

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classiﬁcation with deep convolutional neural networks. In Pereira, F., Burges, C.J.C., Bottou, L., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 25, pp. 1097 1105. Curran Associates, Inc., 2012. URL http://papers.nips.cc/paper/ 4824-imagenet-classification-with-deep-convolutional-neural-networks. pdf.

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, Nov 1998. ISSN 0018-9219. doi: 10.1109/5.726791.

Lee, Chen-Yu, Xie, Saining, Gallagher, Patrick W., Zhang, Zhengyou, and Tu, Zhuowen. Deeplysupervised nets. In Proceedings of the Eighteenth International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2015, San Diego, California, USA, May 9-12, 2015, 2015. URL http://jmlr.org/proceedings/papers/v38/lee15a.html.

Maas, Andrew L, Hannun, Awni Y, and Ng, Andrew Y. Rectiﬁer nonlinearities improve neural network acoustic models. Proc. ICML, 30, 2013.

Romero, Adriana, Ballas, Nicolas, Kahou, Samira Ebrahimi, Chassang, Antoine, Gatta, Carlo, and Bengio, Yoshua. Fitnets: Hints for thin deep nets. In Proceedings of ICLR, May 2015. URL http://arxiv.org/abs/1412.6550.

Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), pp. 1 42, April 2015. doi: 10.1007/s11263-015-0816-y.

Saxe, Andrew M., Mc Clelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Proceedings of ICLR, 2014. URL http: //arxiv.org/abs/1312.6120.

Sercu, T., Puhrsch, C., Kingsbury, B., and Le Cun, Y. Very Deep Multilingual Convolutional Neural Networks for LVCSR. Ar Xiv e-prints, September 2015. URL http://arxiv.org/abs/ 1509/08967.

Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale visual recognition. In Proceedings of ICLR, May 2015. URL http://arxiv.org/abs/1409. 1556.

Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. Striving for Simplicity: The All Convolutional Net. In Proceedings of ICLR Workshop, December 2014. URL http://arxiv. org/abs/1412.6806.

Srivastava, Rupesh Kumar, Greff, Klaus, and Schmidhuber, Jrgen. Training Very Deep Networks. In Proceedings of NIPS, 2015. URL http://arxiv.org/abs/1507.06228.

Sussillo, David and Abbott, L. F. Random Walk Initialization for Training Very Deep Feedforward Networks. Ar Xiv e-prints, December 2014. URL http://arxiv.org/abs/1412.6558.

Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. In CVPR 2015, 2015. URL http://arxiv.org/abs/1409.4842.

Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5: RMSProp Divide the gradient by a running average of its recent magnitude. In COURSERA: Neural Networks for Machine Learning. 2012.

Published as a conference paper at ICLR 2016

A TECHNICAL DETAILS

A.1 INFLUENCE OF MINI-BATCH SIZE TO LSUV INITIALIZATION

We have selected tanh activation as one, where LSUV initialization shows the worst performance and tested the inﬂuence of mini-batch size to training process. Note, that training mini-batch is the same for all initializations, the only difference is mini-batch used for variance estimation. One can see from Table 7 that there is no difference between small or large mini-batch, except extreme cases, where only two sample are used.

Table 7: Fit Net4 Tan H ﬁnal performance on CIFAR-10. Dependence on LSUV mini-batch size

Batch size for LSUV 2 16 32 128 1024 Final accuracy, [%] 89.27 89.30 89.30 89.28 89.31

A.2 LSUV WEIGHT STANDARD DEVIATIONS IN DIFFERENT NETWORKS

Tables 8 and 9 show the standard deviations of the ﬁlter weights, found by the LSUV procedure and by other initialization schemes.

Table 8: Standard deviations of the weights per layer for different initializations, Fit Net4, CIFAR10, Re LU

Layer LSUV Ortho Norm MSRA Xavier conv11 0.383 0.175 0.265 0.191 conv12 0.091 0.058 0.082 0.059 conv13 0.083 0.058 0.083 0.059 conv14 0.076 0.058 0.083 0.059 conv15 0.068 0.048 0.060 0.048 conv21 0.036 0.048 0.052 0.037 conv22 0.048 0.037 0.052 0.037 conv23 0.061 0.037 0.052 0.037 conv24 0.052 0.037 0.052 0.037 conv25 0.067 0.037 0.052 0.037 conv26 0.055 0.037 0.052 0.037 conv31 0.034 0.037 0.052 0.037 conv32 0.044 0.029 0.041 0.029 conv33 0.042 0.029 0.041 0.029 conv34 0.041 0.029 0.041 0.029 conv35 0.040 0.029 0.041 0.029 conv36 0.043 0.029 0.041 0.029 ip1 0.048 0.044 0.124 0.088

A.3 GRADIENTS

To check how the activation variance normalization inﬂuences the variance of the gradient, we measure the average variance of the gradient at all layers after 10 mini-batches. The variance is close to 10 9 for all convolutional layers. It is much more stable than for the reference methods, except MSRA; see Table 10.

Published as a conference paper at ICLR 2016

Table 9: Standard deviations of the weights per layer for different non-linearities, found by LSUV, Fit Net4, CIFAR10

Layer Tan H Re LU VLRe LU Maxout conv11 0.386 0.388 0.384 0.383 conv12 0.118 0.083 0.084 0.058 conv13 0.102 0.096 0.075 0.063 conv14 0.101 0.082 0.080 0.065 conv15 0.081 0.064 0.065 0.044 conv21 0.065 0.044 0.037 0.034 conv22 0.064 0.055 0.047 0.040 conv23 0.060 0.055 0.049 0.032 conv24 0.058 0.064 0.049 0.041 conv25 0.061 0.061 0.043 0.040 conv26 0.063 0.049 0.052 0.037 conv31 0.054 0.032 0.037 0.027 conv32 0.052 0.049 0.037 0.031 conv33 0.051 0.048 0.042 0.033 conv34 0.050 0.047 0.038 0.028 conv35 0.051 0.047 0.039 0.030 conv36 0.051 0.040 0.037 0.033 ip1 0.084 0.044 0.044 0.038

Table 10: Variance of the initial gradients per layer, different initializations, Fit Net4, Re LU

Layer LSUV MSRA Ortho Init Xavier conv11 4.87E-10 9.42E-09 5.67E-15 2.30E-14 conv12 5.07E-10 9.62E-09 1.17E-14 4.85E-14 conv13 4.36E-10 1.07E-08 2.30E-14 9.94E-14 conv14 3.21E-10 7.03E-09 2.95E-14 1.35E-13 conv15 3.85E-10 6.57E-09 6.71E-14 3.10E-13 conv21 1.25E-09 9.11E-09 1.95E-13 8.00E-13 conv22 1.15E-09 9.73E-09 3.79E-13 1.56E-12 conv23 1.19E-09 1.07E-08 8.18E-13 3.28E-12 conv24 9.12E-10 1.07E-08 1.79E-12 6.69E-12 conv25 7.45E-10 1.09E-08 4.04E-12 1.36E-11 conv26 8.21E-10 1.15E-08 8.36E-12 2.99E-11 conv31 3.06E-09 1.92E-08 2.65E-11 1.05E-10 conv32 2.57E-09 2.01E-08 5.95E-11 2.28E-10 conv33 2.40E-09 1.99E-08 1.21E-10 4.69E-10 conv34 2.19E-09 2.25E-08 2.64E-10 1.01E-09 conv35 1.94E-09 2.57E-08 5.89E-10 2.27E-09 conv36 2.31E-09 2.97E-08 1.32E-09 5.57E-09 ip1 1.24E-07 1.95E-07 6.91E-08 7.31E-08 var(ip1)/var(conv11) 255 20 12198922 3176821