# principled_weight_initialization_for_hypernetworks__e72eba9c.pdf

Published as a conference paper at ICLR 2020

PRINCIPLED WEIGHT INITIALIZATION FOR HYPERNETWORKS

Oscar Chang, Lampros Flokas, Hod Lipson Columbia University New York, NY 10027 {oscar.chang, lf2540, hod.lipson}@columbia.edu

Hypernetworks are meta neural networks that generate weights for a main neural network in an end-to-end differentiable manner. Despite extensive applications ranging from multi-task learning to Bayesian deep learning, the problem of optimizing hypernetworks has not been studied to date. We observe that classical weight initialization methods like Glorot & Bengio (2010) and He et al. (2015), when applied directly on a hypernet, fail to produce weights for the mainnet in the correct scale. We develop principled techniques for weight initialization in hypernets, and show that they lead to more stable mainnet weights, lower training loss, and faster convergence.

1 INTRODUCTION

Meta-learning describes a broad family of techniques in machine learning that deals with the problem of learning to learn. An emerging branch of meta-learning involves the use of hypernetworks, which are meta neural networks that generate the weights of a main neural network to solve a given task in an end-to-end differentiable manner. Hypernetworks were originally introduced by Ha et al. (2016) as a way to induce weight-sharing and achieve model compression by training the same meta network to learn the weights belonging to different layers in the main network. Since then, hypernetworks have found numerous applications including but not limited to: weight pruning (Liu et al., 2019), neural architecture search (Brock et al., 2017; Zhang et al., 2018), Bayesian neural networks (Krueger et al., 2017; Ukai et al., 2018; Pawlowski et al., 2017; Henning et al., 2018; Deutsch et al., 2019), multi-task learning (Pan et al., 2018; Shen et al., 2017; Klocek et al., 2019; Serr a et al., 2019; Meyerson & Miikkulainen, 2019), continual learning (von Oswald et al., 2019), generative models (Suarez, 2017; Ratzlaff & Fuxin, 2019), ensemble learning (Kristiadi & Fischer, 2019), hyperparameter optimization (Lorraine & Duvenaud, 2018), and adversarial defense (Sun et al., 2017).

Despite the intensiﬁed study of applications of hypernetworks, the problem of optimizing them to this day remains signiﬁcantly understudied. In fact, even the problem of initializing hypernetworks has not been studied. Given the lack of principled approaches, prior work in the area is mostly limited to ad-hoc approaches based on trial and error (c.f. Section 3). For example, it is common to initialize the weights of a hypernetwork by sampling a small random number. Nonetheless, these ad-hoc methods do lead to successful hypernetwork training primarily due to the use of the Adam optimizer (Kingma & Ba, 2014), which has the desirable property of being invariant to the scale of the gradients. However, even Adam will not work if the loss diverges (i.e. overﬂow) at initialization, which will happen in sufﬁciently big models. The normalization of badly scaled gradients also results in noisy training dynamics where the loss function suffers from bigger ﬂuctuations during training compared to vanilla stochastic gradient descent (SGD). Wilson et al. (2017); Reddi et al. (2018) showed that while adaptive optimizers like Adam may exhibit lower training error, they fail to generalize as well to the test set as non-adaptive gradient methods. Moreover, Adam incurs a computational overhead and requires 3X the amount of memory for the gradients compared to vanilla SGD.

Small random number sampling is reminiscent of early neural network research (Rumelhart et al., 1986) before the advent of classical weight initialization methods like Xavier init (Glorot & Bengio, 2010) and Kaiming init (He et al., 2015). Since then, a big lesson learned by the neural network

Published as a conference paper at ICLR 2020

optimization community is that architecture speciﬁc initialization schemes are important to the robust training of deep networks, as shown recently in the case of residual networks (Zhang et al., 2019). In fact, weight initialization for hypernetworks was recognized as an outstanding open problem by prior work (Deutsch et al., 2019) that had questioned the suitability of classical initialization methods for hypernetworks.

Our results We show that when classical methods are used to initialize the weights of hypernetworks, they fail to produce mainnet weights in the correct scale, leading to exploding activations and losses. This is because classical network weights transform one layer s activations into another, while hypernet weights have the added function of transforming the hypernet s activations into the mainnet s weights. Our solution is to develop principled techniques for weight initialization in hypernetworks based on variance analysis. The hypernet case poses unique challenges. For example, in contrast to variance analysis for classical networks, the case for hypernetworks can be asymmetrical between the forward and backward pass. The asymmetry arises when the gradient ﬂow from the mainnet into the hypernet is affected by the biases, whereas in general, this does not occur for gradient ﬂow in the mainnet. This underscores again why architecture speciﬁc initialization schemes are essential. We show both theoretically and experimentally that our methods produce hypernet weights in the correct scale. Proper initialization mitigates exploding activations and gradients or the need to depend on Adam. Our experiments reveal that it leads to more stable mainnet weights, lower training loss, and faster convergence.

Section 2 brieﬂy covers the relevant technical preliminaries, and Section 3 reviews problems with the ad-hoc methods currently deployed by hypernetwork practitioners. We derive novel weight initialization formulae for hypernetworks in Section 4, empirically evaluate our proposed methods in Section 5, and ﬁnally conclude in Section 6.

2 PRELIMINARIES

Deﬁnition. A hypernetwork is a meta neural network H with its own parameters φ that generates the weights of a main network θ from some embedding e in a differentiable manner: θ = Hφ(e). Unlike a classical network, in a hypernetwork, the weights of the main network are not model parameters. Thus the gradients θ have to be further backpropagated to the weights of the hypernetwork φ, which is then trained via gradient descent φt+1 = φt λ φt.

This fundamental difference suggests that conventional knowledge about neural networks may not apply directly to hypernetworks and novel ways of thinking about weight initialization, optimization dynamics and architecture design for hypernetworks are sorely needed.

2.1 RICCI CALCULUS

We propose the use of Ricci calculus, as opposed to the more commonly used matrix calculus, as a suitable mathematical language for thinking about hypernetworks. Ricci calculus is useful because it allows us to reason about the derivatives of higher-order tensors with notational ease. For readers not familiar with the index-based notation of Ricci calculus, please refer to Laue et al. (2018) for a good introduction to the topic written from a machine learning perspective.

For a general nth-order tensor T i1,...,ik,...,in, we use dik to refer to the dimension of the index set that ik is drawn from. We include explicit summations where the relevant expressions might be ambiguous, and use Einstein summation convention otherwise. We use square brackets to denote different layers for added clarity, so for example W[t] denotes the t-th weight layer.

2.2 XAVIER INITIALIZATION

Glorot & Bengio (2010) derived weight initialization formulae for a feedforward neural network by conducting a variance analysis over activations and gradients. For a linear layer yi = W i jxj + bi, suppose we make the following Xavier Assumptions at initialization: (1) The W i j, xj, and bi are all independent of each other. (2) i, j : E[W i j] = 0. (3) j : E[xj] = 0. (4) i : bi = 0.

Published as a conference paper at ICLR 2020

Then, E[yi] = 0 and Var(yi) = dj Var(W i j)Var(xj). To keep the variance of the output and input activations the same, i.e. Var(yi) = Var(xj), we have to sample W i j from a distribution whose variance is equal to the reciprocal of the fan-in: Var(W i j) = 1

If analogous assumptions hold for the backward pass, then to keep the variance of the output and input gradients the same, we have to sample W i j from a distribution whose variance is equal to the reciprocal of the fan-out: Var(W i j) = 1

Thus, the forward pass and backward pass result in symmetrical formulae. Glorot & Bengio (2010) proposed an initialization based on their harmonic mean: Var(W i j) = 2 dj+di .

In general, a feedforward network is non-linear, so these assumptions are strictly invalid. But odd activation functions with unit derivative at 0 results in a roughly linear regime at initialization.

2.3 KAIMING INITIALIZATION

He et al. (2015) extended Glorot & Bengio (2010) s analysis by looking at the case of Re LU activation functions, i.e. yi = W i j Re LU(xj) + bi. We can write zj = Re LU(xj) to get

Var(yi) = X

j E[(zj)2]Var(W i j) = X

1 2 E[(xj)2]Var(W i j) = 1

2 dj Var(W i j)Var(xj).

This results in an extra factor of 2 in the variance formula. W i j have to be symmetric around 0 to enforce Xavier Assumption 3 as the activations and gradients propagate through the layers. He et al. (2015) argued that both the forward or backward version of the formula can be adopted, since the activations or gradients will only be scaled by a depth-independent factor. For convolutional layers, we have to further divide the variance by the size of the receptive ﬁeld.

Xavier init and Kaiming init are terms that are sometimes used interchangeably. Where there might be confusion, we will refer to the forward version as fan-in init, the backward version as fan-out init, and the harmonic mean version as harmonic init.

3 REVIEW OF CURRENT METHODS

In the seminal Ha et al. (2016) paper, the authors identiﬁed two distinct classes of hypernetworks: dynamic (for recurrent networks) and static (for convolutional networks). They proposed Orthogonal init (Saxe et al., 2013) for the dynamic class, but omitted discussion of initialization for the static class. The static class has since proven to be the dominant variant, covering all kinds of non-recurrent networks (not just convolutional), and thus will be the central object of our investigation.

Through an extensive literature and code review, we found that hypernet practitioners mostly depend on the Adam optimizer, which is invariant to and normalizes the scale of gradients, for training and resort to one of four weight initialization methods:

M1 Xavier or Kaiming init (as found in Pawlowski et al. (2017); Balazevic et al. (2018); Serr a et al. (2019); von Oswald et al. (2019)).

M2 Small random values (as found in Krueger et al. (2017); Lorraine & Duvenaud (2018)).

M3 Kaiming init, but with the output layer scaled by 1 10 (as found in Ukai et al. (2018)).

M4 Kaiming init, but with the hypernet embedding set to be a suitably scaled constant (as found in Meyerson & Miikkulainen (2019)).

M1 uses classical neural network initialization methods to initialize hypernetworks. This fails to produce weights for the main network in the correct scale. Consider the following illustrative example of a one-layer linear hypernet generating a linear mainnet with T + 1 layers, given embeddings sampled from a standard normal distribution and weights sampled entry-wise from a zero-mean

Published as a conference paper at ICLR 2020

distribution. We leave the biases out for now, and assume the input data x[1] is standardized.

x[t + 1]it+1 = W[t]it+1 it x[t]it, W[t]it+1 it = H[t]it+1 itkt e[t]kt, 1 t T.

Var(x[T + 1]it+1) = Var(x[1]i1)

t=1 dit Var(W[t]it+1 it ) = Var(x[1]i1)

t=1 ditdkt Var(H[t]it+1 itkt ). (1)

In this case, if the variance of the weights in the hypernet Var(H[t]it+1 itkt ) is equal to the reciprocal of the fan-in dkt, then the variance of the activations Var(x[T + 1]it+1) = QT t=1 dit explodes. If it is equal to the reciprocal of the fan-out ditdit+1, then the activation variance Var(x[T + 1]it+1) = QT t=1 dkt dit+1 is likely to vanish, since the size of the embedding vector is typically small relatively to the width of the mainnet weight layer being generated.

Where the fan-in is of a different scale than the fan-out, the harmonic mean has a scale close to that of the smaller number. Therefore, the fan-in, fan-out, and harmonic variants of Xavier and Kaiming init will all result in activations and gradients that scale exponentially with the depth of the mainnet.

M2 and M3 introduce additional hyperparameters into the model, and the ad-hoc manner in which they work is reminiscent of pre deep learning neural network research, before the introduction of classical initialization methods like Xavier and Kaiming init. This ad-hoc manner is not only inelegant and consumes more compute, but will likely fail for deeper and more complex hypernetworks.

M4 proposes to set the embeddings e[t]kt to a suitable constant (d 1/2 it in this case), such that both W[t]it+1 it and H[t]it+1 itkt can seem to be initialized with the same variance as Kaiming init. This ensures that the variance of the activations in the mainnet are preserved through the layers, but the restrictions on the embeddings might not be desirable in many applications.

Luckily, the ﬁx appears simple set Var(H[t]it+1 itkt ) = 1 ditdkt . This results in the variance of the

generated weights in the mainnet Var(W[t]it+1 it ) = 1 dit resembling conventional neural networks initialized with fan-in init. This suggests a general hypernet weight initialization strategy: initialize the weights of the hypernet such that the mainnet weights approximate classical neural network initialization. We elaborate on and generalize this intuition in Section 4.

4 HYPERFAN INITIALIZATION

Most hypernetwork architectures use a linear output layer so that gradients can pass from the mainnet into the hypernet directly without any non-linearities. We make use of this fact in developing methods called hyperfan-in init and hyperfan-out init for hypernetwork weight initialization based on the principle of variance analysis.

4.1 HYPERFAN-IN

Proposition. Suppose a hypernetwork comprises a linear output layer. Then, the variance between the input and output activations of a linear layer in the mainnet yi = W i jxj + bi can be preserved using fan-in init in the hypernetwork with appropriately scaled output layers.

Case 1. The hypernet generates the weights but not the biases of the mainnet. The bias in the mainnet is initialized to zero. We can write the weight generation in the form W i j = Hi jkh(e)k + βi j where h computes all but the last layer of the hypernet and (H, β) form the output layer. We make the following Hyperfan Assumptions at initialization: (1) Xavier assumptions hold for all the layers in the hypernet. (2) The Hi jk, h(e)k, βi j, xj, and bi are all independent of each other. (3) i, j, k : E[Hi jk] = 0. (4) E[xj] = 0. (5) i : bi = 0.

Use fan-in init to initialize the weights for h. Then, Var(h(e)k) = Var(el). If we initialize H with the formula Var(Hi jk) = 1 djdk Var(el) and β with zeros, we arrive at Var(W i j) = 1 dj , which is the formula for fan-in init in the mainnet. The Hyperfan assumptions imply the Xavier assumptions

Published as a conference paper at ICLR 2020

hold in the mainnet, thus preserving the input and output activations.

Var(yi) = X

j Var(W i j)Var(xj) = X

k Var(Hi jk)Var(h(e)k)Var(xj)

1 djdk Var(el)Var(el)Var(xj) = Var(xj). (2)

Case 2. The hypernet generates both the weights and biases of the mainnet. We can write the weight and bias generation in the form W i j = Hi jkh(e[1])k + βi j and bi = Gi lg(e[2])l + γi

respectively, where h and g compute all but the last layer of the hypernet, and (H, β) and (G, γ) form the output layers. We modify Hyperfan Assumption 2 so it includes Gi l, g(e[2])l, and γi, and further assume Var(xj) = 1, which holds at initialization with the common practice of data standardization.

Use fan-in init to initialize the weights for h and g. Then, Var(h(e[1])k) = Var(e[1]m) and Var(g(e[2])l) = Var(e[2]n). If we initialize H with the formula Var(Hi jk) = 1 2djdk Var(e[1]m), G with the formula Var(Gi l) = 1 2dl Var(e[2]n), and β, γ with zeros, then the input and output activations in the mainnet can be preserved.

Var(yi) = X

Var(W i j)Var(xj) + Var(bi)

k Var(Hi jk)Var(h(e[1])k)Var(xj)

l Var(Gi l)Var(g(e[2])l)

1 2djdk Var(e[1]m)Var(e[1]m)Var(xj)

1 2dl Var(e[2]n)Var(e[2]n)

2Var(xj) + 1

2 = Var(xj).

If we initialize Gi j to zeros, then its contribution to the variance will increase during training, causing exploding activations in the mainnet. Hence, we prefer to introduce a factor of 1/2 to divide the variance between the weight and bias generation, where the variance of each component is allowed to either decrease or increase during training. This becomes a problem if the variance of the activations in the mainnet deviates too far away from 1, but we found that it works well in practice.

4.2 HYPERFAN-OUT

Case 1. The hypernet generates the weights but not the biases of the mainnet. A similar derivation can be done for the backward pass using analogous assumptions on gradients ﬂowing

in the mainnet: L x[t]it = L x[t + 1]it+1 W[t]it+1 it ,

through mainnet weights: L

W[t]it+1 it = L x[t + 1]it+1 x[t]it, L h[t](e)kt = L

W[t]it+1 it H[t]it+1 itkt ,

and through mainnet biases: L b[t]it+1 = L x[t + 1]it+1 , L g[t](e)lt = L b[t]it+1 G[t]it+1 lt .

If we initialize the output layer H with the analogous hyperfan-out formula Var(H[t]it+1 itkt ) = 1 dit+1dkt Var(ekt) and the rest of the hypernet with fan-in init, then we can preserve input and out-

put gradients on the mainnet: Var( L x[t]it ) = Var( L x[t+1]it+1 ). However, note that the gradients will

shrink when ﬂowing from the mainnet to the hypernet: Var( L h[t](e)kt ) = dit dkt Var(ekt)Var( L W [t] it+1 it ),

and scaled by a depth-independent factor due to the use of fan-in rather than fan-out init.

Case 2. The hypernet generates both the weights and biases of the mainnet. In the classical case, the forward version (fan-in init) and the backward version (fan-out init) are symmetrical. This

Published as a conference paper at ICLR 2020

remains true for hypernets if they only generated the weights of the mainnet. However, if they were to also generate the biases, then the symmetry no longer holds, since the biases do not affect the gradient ﬂow in the mainnet but they do so for the hypernet (c.f. Equation 4). Nevertheless, we can initialize G so that it helps hyperfan-out init preserve activation variance on the forward pass as much as possible (keeping the assumption that Var(xj) = 1 as before):

Var(yi) = X

Var(W i jxj) + Var(bi)

= djdk Var(e[1]m)Var(H[hyperfan-out]i jk)Var(xj) + dl Var(e[2]n)Var(Gi l)

= djdk Var(e[1]m)Var(H[hyperfan-in]i jk)Var(xj)

Plugging in the formulae for Hyperfan-in and Hyperfan-out from above, we get

= Var(Gi l) = (1 dj

dl Var(e[2]n).

We summarize the variance formulae for hyperfan-in and hyperfan-out init in Table 1. It is not uncommon to re-use the same hypernet to generate different parts of the mainnet, as was originally done in Ha et al. (2016). We discuss this case in more detail in Appendix Section A.

Table 1: Hyperfan-in and Hyperfan-out Variance Formulae for W i j = Hi jkh(e[1])k + βi j. If yi = Re LU(W i jxj +bi), then 1Re LU = 1, else if yi = W i jxj +bi, then 1Re LU = 0. If bi = Gi lg(e[2])l+γi, then 1HBias = 1, else if bi = 0, then 1HBias = 0. We initialize h and g with fan-in init, and βi j, γi = 0. For convolutional layers, we have to further divide Var(Hi jk) by the size of the receptive ﬁeld. Uniform init: X U( p

3Var(X)). Normal init: X N(0, Var(X)).

Initialization Variance Formula Initialization Variance Formula

Hyperfan-in Var(Hi jk) = 21Re LU 21HBiasdjdk Var(e[1]m) Hyperfan-out Var(Hi jk) = 21Re LU didk Var(e[1]m)

Hyperfan-in Var(Gi l) = 21Re LU 2dl Var(e[2]n) Hyperfan-out Var(Gi l) = max( 21Re LU(1 dj di )

dl Var(e[2]n) , 0)

5 EXPERIMENTS

We evaluated our proposed methods on four sets of experiments involving different use cases of hypernetworks: feedforward networks, continual learning, convolutional networks, and Bayesian neural networks. In all cases, we optimize with vanilla SGD and sample from the uniform distribution according to the variance formula given by the init method. More experimental details can be found in Appendix Section B.

5.1 FEEDFORWARD NETWORKS ON MNIST

As an illustrative ﬁrst experiment, we train a feedforward network with ﬁve hidden layers (500 hidden units), a hyperbolic tangent activation function, and a softmax output layer, on MNIST across four different settings: (1) a classical network with Xavier init, (2) a hypernet with Xavier init that generates the weights of the mainnet, (3) a hypernet with hyperfan-in init that generates the weights of the mainnet, (4) and a hypernet with hyperfan-out init that generates the weights of the mainnet.

The use of hyperfan init methods on a hypernetwork reproduces mainnet weights similar to those that have been trained from Xavier init on a classical network, while the use of Xavier init on a hypernetwork causes exploding activations right at the beginning of training (see Figure 1). Observe in Figure 2 that when the hypernetwork is initialized in the proper scale, the magnitude of generated weights stabilizes quickly. This in turn leads to a more stable training regime, as seen in Figure 3. More visualizations of the activations and gradients of both the mainnet and hypernet can be viewed in Appendix Section B.1. Qualitatively similar observations were made when we replaced the activation function with Re LU and Xavier with Kaiming init, with Kaiming init leading to even bigger activations at initialization.

Published as a conference paper at ICLR 2020

Suppose now the hypernet generates both the weights and biases of the mainnet instead of just the weights. We found that this architectural change leads the hyperfan init methods to take more time (but still less than Xavier init), to generate stable mainnet weights (c.f. Figure 25 in the Appendix).

0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 Activation Value

Number of Activations

Xavier (NN)

layer one layer two layer three layer four layer five

0.6 0.4 0.2 0.0 0.2 0.4 0.6 Activation Value

Number of Activations

Xavier (Hyper)

layer one layer two layer three layer four layer five

0.75 0.50 0.25 0.00 0.25 0.50 0.75 Activation Value

Number of Activations

Hyperfan-in

layer one layer two layer three layer four layer five

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Activation Value

Number of Activations

Hyperfan-out

layer one layer two layer three layer four layer five

Figure 1: Mainnet Activations before the Start of Training on MNIST.

0 25 50 75 100 125 150 175 1k Iterations

Mean Activation Value

1e 4 Xavier (Hyper)

layer one layer two layer three layer four

0 25 50 75 100 125 150 175 1k Iterations

Mean Activation Value

1e 4 Hyperfan-in

layer one layer two layer three layer four

0 25 50 75 100 125 150 175 1k Iterations

Mean Activation Value

1e 4 Hyperfan-out

layer one layer two layer three layer four

Figure 2: Evolution of Hypernet Output Layer Activations during Training on MNIST. Xavier init results in unstable mainnet weights throughout training, while hyperfan-in and hyperfan-out init result in mainnet weights that stabilize quickly.

0 5 10 15 20 25 30 Epochs

Training Loss

Xavier (NN) Xavier (Hyper) Hyperfan-in Hyperfan-out

0 5 10 15 20 25 30 Epochs

Xavier (NN) Xavier (Hyper) Hyperfan-in Hyperfan-out

0 5 10 15 20 25 30 Epochs

Test Accuracy

Xavier (NN) Xavier (Hyper) Hyperfan-in Hyperfan-out

Figure 3: Loss and Test Accuracy Plots on MNIST.

5.2 CONTINUAL LEARNING ON REGRESSION TASKS

Continual learning solves the problem of learning tasks in sequence without forgetting prior tasks. von Oswald et al. (2019) used a hypernetwork to learn embeddings for each task as a way to efﬁciently regularize the training process to prevent catastrophic forgetting. We compare different initialization schemes on their hypernetwork implementation, which generates the weights and biases of a Re LU mainnet with two hidden layers to solve a sequence of three regression tasks.

In Figure 4, we plot the training loss averaged over 15 different runs, with the shaded area showing the standard error. We observe that the hyperfan methods produce smaller training losses at initialization and during training, eventually converging to a smaller loss for each task.

Published as a conference paper at ICLR 2020

0 1000 2000 3000 4000 5000 6000 Iterations

Kaiming (fan-in) Kaiming (fan-out) Hyperfan-in Hyperfan-out

0 1000 2000 3000 4000 5000 6000 Iterations

Kaiming (fan-in) Kaiming (fan-out) Hyperfan-in Hyperfan-out

0 1000 2000 3000 4000 5000 6000 Iterations

Kaiming (fan-in) Kaiming (fan-out) Hyperfan-in Hyperfan-out

Figure 4: Continual Learning Loss on a Sequence of Regression Tasks.

5.3 CONVOLUTIONAL NETWORKS ON CIFAR-10

Ha et al. (2016) applied a hypernetwork on a convolutional network for image classiﬁcation on CIFAR-10. We note that our initialization methods do not handle residual connections, which were in their chosen mainnet architecture and are important topics for future study. Instead, we implemented their hypernetwork architecture on a mainnet with the All Convolutional Net architecture (Springenberg et al., 2014) that is composed of convolutional layers and Re LU activation functions.

After searching through a dense grid of learning rates, we failed to enable the fan-in version of Kaiming init to train even with very small learning rates. The fan-out version managed to begin delayed training, starting from around epoch 270 (see Figure 5). By contrast, both hyperfan-in and hyperfan-out init led to successful training immediately. This shows a good init can make it possible to successfully train models that would have otherwise been unamenable to training on a bad init.

0 100 200 300 400 500 Epochs

Training Loss

Kaiming (fan-in) Kaiming (fan-out) Hyperfan-in Hyperfan-out

0 100 200 300 400 500 Epochs

Kaiming (fan-in) Kaiming (fan-out) Hyperfan-in Hyperfan-out

0 100 200 300 400 500 Epochs

Test Accuracy

Kaiming (fan-in) Kaiming (fan-out) Hyperfan-in Hyperfan-out

Figure 5: Loss and Test Accuracy Plots on CIFAR-10.

5.4 BAYESIAN NEURAL NETWORKS ON IMAGENET

Bayesian neural networks improve model calibration and provide uncertainty estimation, which guard against the pitfalls of overconﬁdent networks. Ukai et al. (2018) developed a Bayesian neural network by using a hypernetwork to simulate an expressive prior distribution. We trained a similar hypernetwork by applying Ukai et al. (2018) s methods on Image Net, but differed in our choice of Mobile Net (Howard et al., 2017) as a mainnet architecture that does not have residual connections.

In the work of Ukai et al. (2018), it was noticed that even with the use of batch normalization in the mainnet, classical initialization approaches still led to diverging losses (due to exploding activations, c.f. Section 3). We observe similar results in our experiment (see Figure 6) the fan-in version of Kaiming init, which is the default initialization in popular deep learning libraries like Py Torch and Chainer, resulted in substantially higher initial losses and led to slower training than the hyperfan methods. We found that the observation still stands even when the last layer of the mainnet is not generated by the hypernet. This shows that while batch normalization helps, it is not the solution for a bad init that causes exploding activations. Our approach solves this problem in a principled way, and is preferable to the trial-and-error based heuristics that Ukai et al. (2018) had to resort to in order to train their model.

Published as a conference paper at ICLR 2020

Surprisingly, the fan-out version of Kaiming init led to similar results as the hyperfan methods, suggesting that batch normalization might be sufﬁcient to correct the bad initializations that result in vanishing activations. That being said, hypernet practitioners should not expect batch normalization to be the panacea for problems caused by bad initialization, especially in memory-constrained scenarios. In a Bayesian neural network application (especially in hypernet architectures without relaxed weight-sharing), the blowup in the number of parameters limits the use of big batch sizes, which is essential to the performance of batch normalization (Wu & He, 2018). For example, in this experiment, our hypernet model requires 32 times as many parameters as a classical Mobile Net.

To the best of our knowledge, the interaction between batch normalization and initialization is not well-understood, even in the classical case, and thus, our ﬁndings prompt an interesting direction for future research.

0 20000 40000 60000 80000 100000 120000 Iterations

Training Loss

Kaiming (fan-in) Kaiming (fan-out) Hyperfan-in Hyperfan-out

0 5 10 15 20 25 Epochs

Top-5 Test Accuracy

Kaiming (fan-in) Kaiming (fan-out) Hyperfan-in Hyperfan-out

Figure 6: Loss and Test Accuracy Plots on Image Net.

In all our experiments, hyperfan-in and hyperfan-out both led to successful hypernetwork training with SGD. We did not ﬁnd a good reason to prefer one over the other (similar to He et al. (2015) s observation in the classical case for fan-in and fan-out init).

6 CONCLUSION

For a long time, the promise of deep nets to learn rich representations of the world was left unfulﬁlled due to the inability to train these models. The discovery of greedy layer-wise pre-training (Hinton et al., 2006; Bengio et al., 2007) and later, Xavier and Kaiming init, as weight initialization strategies to enable such training was a pivotal achievement that kickstarted the deep learning revolution. This underscores the importance of model initialization as a fundamental step in learning complex representations.

In this work, we developed the ﬁrst principled weight initialization methods for hypernetworks, a rapidly growing branch of meta-learning. We hope our work will spur momentum towards the development of principled techniques for building and training hypernetworks, and eventually lead to signiﬁcant progress in learning meta representations. Other non-hypernetwork methods of neural network generation (Stanley et al., 2009; Koutnik et al., 2010) can also be improved by considering whether their generated weights result in exploding activations and how to avoid that if so.

7 ACKNOWLEDGEMENTS

This research was supported in part by the US Defense Advanced Research Project Agency (DARPA) Lifelong Learning Machines Program, grant HR0011-18-2-0020. We thank Dan Martin and Yawei Li for helpful discussions, and the ICLR reviewers for their constructive feedback.

Published as a conference paper at ICLR 2020

Ivana Balazevic, Carl Allen, and Timothy M Hospedales. Hypernetwork knowledge graph embeddings. ar Xiv preprint ar Xiv:1808.07018, 2018.

Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pp. 153 160, 2007.

Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Smash: one-shot model architecture search through hypernetworks. ar Xiv preprint ar Xiv:1708.05344, 2017.

Lior Deutsch, Erik Nijkamp, and Yu Yang. A generative model for sampling high-performance and diverse weights for neural networks. ar Xiv preprint ar Xiv:1905.02898, 2019.

Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, pp. 249 256, 2010.

David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. ar Xiv preprint ar Xiv:1609.09106, 2016.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In Proceedings of the IEEE international conference on computer vision, pp. 1026 1034, 2015.

Christian Henning, Johannes von Oswald, Jo ao Sacramento, Simone Carlo Surace, Jean-Pascal Pﬁster, and Benjamin F Grewe. Approximating the predictive distribution via adversarially-trained hypernetworks. In Bayesian Deep Learning Workshop, Neur IPS (Spotlight), volume 2018, 2018.

Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527 1554, 2006.

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. ar Xiv preprint ar Xiv:1704.04861, 2017.

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Sylwester Klocek, Łukasz Maziarka, Maciej Wołczyk, Jacek Tabor, Marek Smieja, and Jakub Nowak. Hypernetwork functional image representation. ar Xiv preprint ar Xiv:1902.10404, 2019.

Jan Koutnik, Faustino Gomez, and J urgen Schmidhuber. Evolving neural networks in compressed weight space. In Proceedings of the 12th annual conference on Genetic and evolutionary computation, pp. 619 626. ACM, 2010.

Agustinus Kristiadi and Asja Fischer. Predictive uncertainty quantiﬁcation with compound density networks. ar Xiv preprint ar Xiv:1902.01080, 2019.

David Krueger, Chin-Wei Huang, Riashat Islam, Ryan Turner, Alexandre Lacoste, and Aaron Courville. Bayesian hypernetworks. ar Xiv preprint ar Xiv:1710.04759, 2017.

S oren Laue, Matthias Mitterreiter, and Joachim Giesen. Computing higher order derivatives of matrix and tensor expressions. In Advances in Neural Information Processing Systems, pp. 2755 2764, 2018.

Zechun Liu, Haoyuan Mu, Xiangyu Zhang, Zichao Guo, Xin Yang, Tim Kwang-Ting Cheng, and Jian Sun. Metapruning: Meta learning for automatic neural network channel pruning. ar Xiv preprint ar Xiv:1903.10258, 2019.

Jonathan Lorraine and David Duvenaud. Stochastic hyperparameter optimization through hypernetworks. ar Xiv preprint ar Xiv:1802.09419, 2018.

Elliot Meyerson and Risto Miikkulainen. Modular universal reparameterization: Deep multi-task learning across diverse domains. ar Xiv preprint ar Xiv:1906.00097, 2019.

Published as a conference paper at ICLR 2020

Zheyi Pan, Yuxuan Liang, Junbo Zhang, Xiuwen Yi, Yong Yu, and Yu Zheng. Hyperst-net: Hypernetworks for spatio-temporal forecasting. ar Xiv preprint ar Xiv:1809.10889, 2018.

Nick Pawlowski, Andrew Brock, Matthew CH Lee, Martin Rajchl, and Ben Glocker. Implicit weight uncertainty in neural networks. ar Xiv preprint ar Xiv:1711.01297, 2017.

Neale Ratzlaff and Li Fuxin. Hypergan: A generative model for diverse, performant neural networks. ar Xiv preprint ar Xiv:1901.11058, 2019.

Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=ry Qu7f-RZ.

David E Rumelhart, Geoffrey E Hintont, and Ronald J Williams. Learning representations by backpropagating errors. NATURE, 323:9, 1986.

Andrew M Saxe, James L Mc Clelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ar Xiv preprint ar Xiv:1312.6120, 2013.

Joan Serr a, Santiago Pascual, and Carlos Segura. Blow: a single-scale hyperconditioned ﬂow for non-parallel raw-audio voice conversion. ar Xiv preprint ar Xiv:1906.00794, 2019.

Falong Shen, Shuicheng Yan, and Gang Zeng. Meta networks for neural style transfer. ar Xiv preprint ar Xiv:1709.04111, 2017.

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. ar Xiv preprint ar Xiv:1412.6806, 2014.

Kenneth O Stanley, David B D Ambrosio, and Jason Gauci. A hypercube-based encoding for evolving large-scale neural networks. Artiﬁcial life, 15(2):185 212, 2009.

Joseph Suarez. Language modeling with recurrent highway hypernetworks. In Advances in neural information processing systems, pp. 3267 3276, 2017.

Zhun Sun, Mete Ozay, and Takayuki Okatani. Hypernetworks with statistical ﬁltering for defending adversarial examples. ar Xiv preprint ar Xiv:1711.01791, 2017.

Kenya Ukai, Takashi Matsubara, and Kuniaki Uehara. Hypernetwork-based implicit posterior estimation and model averaging of cnn. In Asian Conference on Machine Learning, pp. 176 191, 2018.

Johannes von Oswald, Christian Henning, Jo ao Sacramento, and Benjamin F Grewe. Continual learning with hypernetworks. ar Xiv preprint ar Xiv:1906.00695, 2019.

Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Advances in Neural Information Processing Systems, pp. 4148 4158, 2017.

Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3 19, 2018.

Chris Zhang, Mengye Ren, and Raquel Urtasun. Graph hypernetworks for neural architecture search. ar Xiv preprint ar Xiv:1810.05749, 2018.

Hongyi Zhang, Yann N Dauphin, and Tengyu Ma. Fixup initialization: Residual learning without normalization. ar Xiv preprint ar Xiv:1901.09321, 2019.

Published as a conference paper at ICLR 2020

A RE-USING HYPERNET WEIGHTS

A.1 FOR MAINNET WEIGHTS OF THE SAME SIZE

For model compression or weight-sharing purposes, different parts of the mainnet might be generated by the same hypernet function. This will cause some assumptions of independence in our analysis to be invalid. Consider the example of the same hypernet being used to generate multiple different mainnet weight layers of the same size, i.e. H[t]it+1 itk = H[t + 1]it+2 it+1k, dit+1 = dit+2 = dit.

Then, x[t + 1]it+1 = H[t]it+1 itkt e[t]ktx[t]it W[t + 1]it+2 it+1 = H[t + 1]it+2 it+1ke[t + 1]kt+1.

The relaxation of some of these independence assumptions does not always prove to be a big problem in practice, because the correlations introduced by repeated use of H can be minimized with the use of ﬂat distributions like the uniform distribution. It can even be helpful, since the re-use of the same hypernet for different layers causes the gradient ﬂowing through the hypernet output layer to be the sum of the gradients from the weights of these layers: L h(e)k = P

t L W [t] it+1 it Hit+1 itk , thus

combating the shrinking effect.

A.2 FOR MAINNET WEIGHTS OF DIFFERENT SIZES

Similar reasoning applies if the same hypernet was used to generate differently sized subsets of weights in the mainnet. However, we encourage avoiding this kind of hypernet architecture design if not otherwise essential, since it will complicate the initialization formulae listed in Table 1.

Consider Ha et al. (2016) s hypernetwork architecture. Their two-layer hypernet generated weight chunks of size (K, n, n) for a main convolutional network where K = 16 was found to be the highest common factor among the size of mainnet layers, and n2 = 9 was the size of the receptive ﬁeld. We simplify the presentation by writing i for it, j for jt, k for kt,m, and l for lt,m.

( Hi(mod K) k α[t][j + i

K dj]k + βi(mod K) if i is divisible by K δj(mod K)j(mod K) Hj(mod K) k α[t][i + j

K di]k + βj(mod K) if j is divisible by K

α[t][mt]k = G[t][mt]k l e[t][mt]l + γ[t][mt]k

Because the output layer (H, β) in the hypernet was re-used to generate mainnet weight matrices of different sizes (i.e. in general, it = it+1, jt = jt+1), G effectively becomes the output layer that we want to be considering for hyperfan-in and hyperfan-out initialization.

Hence, to achieve fan-in in the mainnet Var(W[t]i j) = 1 dj , we have to use fan-in init for H (i.e.

Var(Hi(mod K) k ) = 1 dk = 1 djdk Var(e[t][mt]l)), and hyperfan-in init for G (i.e. Var(G[t][mt]k l ) =

1 djdl Var(e[t][mt]l)).

Analogously, to achieve fan-out in the mainnet Var(W[t]i j) = 1 di , we have to use fan-in init for H

(i.e. Var(Hi(mod K) k ) = 1 dk = 1 didk Var(e[t][mt]l)), and hyperfan-out init for G (i.e. Var(G[t][mt]k l ) =

1 didl Var(e[t][mt]l)).

Published as a conference paper at ICLR 2020

B MORE EXPERIMENTAL DETAILS

B.1 FEEDFORWARD NETWORKS ON MNIST

The networks were trained on MNIST for 30 epochs with batch size 10 using a learning rate of 0.0005 for the hypernets and 0.01 for the classical network. The hypernets had one linear layer with embeddings of size 50 and different hidden layers in the mainnet were all generated by the same hypernet output layer with a different embedding, which was randomly sampled from U(

3) and ﬁxed. We use the mean cross entropy loss for training, but the summed cross entropy loss for testing.

We show activation and gradient plots for two cases: (i) the hypernet generates only the weights of the mainnet, and (ii) the hypernet generates both the weights and biases of the mainnet. (i) covers Figures 3, 1, 7, 8, 9, 10, 11, 12, 2, 13, 14, 15, and 16. (ii) covers Figures 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, and 29.

The activations and gradients in our plots were calculated by averaging across a ﬁxed held-out set of 300 examples drawn randomly from the test set.

In Figures 1, 8, 9, 11, 12, 13, 14, 16, 18, 20, 21, 23, 24, 26, 27, and 29, the y axis shows the number of activations/gradients, while the x axis shows the value of the activations/gradients. The value of activations/gradients from the hypernet output layer correspond to the value of mainnet weights.

In Figures 2, 7, 10, 15, 19, 22, 25, and 28, the y axis shows the mean value of the activations/gradients, while each increment on the x axis corresponds to a measurement that was taken every 1000 training batches, with the bars denoting one standard deviation away from the mean.

Published as a conference paper at ICLR 2020

B.1.1 HYPERNET GENERATES ONLY THE MAINNET WEIGHTS

0 25 50 75 100 125 150 175 1k Iterations

Mean Activation Value

Xavier (NN)

layer one layer two layer three layer four layer five

0 25 50 75 100 125 150 175 1k Iterations

Mean Activation Value

Xavier (Hyper)

layer one layer two layer three layer four layer five

0 25 50 75 100 125 150 175 1k Iterations

Mean Activation Value

Hyperfan-in

layer one layer two layer three layer four layer five

0 25 50 75 100 125 150 175 1k Iterations

Mean Activation Value

Hyperfan-out

layer one layer two layer three layer four layer five

Figure 7: Evolution of Mainnet Activations during Training on MNIST.

0.75 0.50 0.25 0.00 0.25 0.50 0.75 Activation Value

Number of Activations

Xavier (NN)

layer one layer two layer three layer four layer five

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Activation Value

Number of Activations

Xavier (Hyper)

layer one layer two layer three layer four layer five

0.75 0.50 0.25 0.00 0.25 0.50 0.75 Activation Value

Number of Activations

Hyperfan-in

layer one layer two layer three layer four layer five

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Activation Value

Number of Activations

Hyperfan-out

layer one layer two layer three layer four layer five

Figure 8: Mainnet Activations at the End of Training on MNIST.

4 2 0 2 4 Gradient Value 1e 5

Number of Gradients

Xavier (NN)

layer one layer two layer three layer four layer five

6 4 2 0 2 4 6 Gradient Value 1e 5

Number of Gradients

Xavier (Hyper)

layer one layer two layer three layer four layer five

3 2 1 0 1 2 3 Gradient Value 1e 5

Number of Gradients

Hyperfan-in

layer one layer two layer three layer four layer five

4 3 2 1 0 1 2 3 4 Gradient Value 1e 5

Number of Gradients

Hyperfan-out

layer one layer two layer three layer four layer five

Figure 9: Mainnet Gradients before the Start of Training on MNIST.

Published as a conference paper at ICLR 2020

0 25 50 75 100 125 150 175 1k Iterations

Mean Gradient Value

1e 5 Xavier (NN)

layer one layer two layer three layer four layer five

0 25 50 75 100 125 150 175 1k Iterations

Mean Gradient Value

1e 5 Xavier (Hyper)

layer one layer two layer three layer four layer five

0 25 50 75 100 125 150 175 1k Iterations

Mean Gradient Value

1e 5 Hyperfan-in

layer one layer two layer three layer four layer five

0 25 50 75 100 125 150 175 1k Iterations

Mean Gradient Value

1e 4 Hyperfan-out

layer one layer two layer three layer four layer five

Figure 10: Evolution of Mainnet Gradients during Training on MNIST.

4 2 0 2 4 Gradient Value 1e 5

Number of Gradients

Xavier (NN)

layer one layer two layer three layer four layer five

6 4 2 0 2 4 6 Gradient Value 1e 5

Number of Gradients

Xavier (Hyper)

layer one layer two layer three layer four layer five

3 2 1 0 1 2 3 Gradient Value 1e 5

Number of Gradients

Hyperfan-in

layer one layer two layer three layer four layer five

4 3 2 1 0 1 2 3 4 Gradient Value 1e 5

Number of Gradients

Hyperfan-out

layer one layer two layer three layer four layer five

Figure 11: Mainnet Gradients at the End of Training on MNIST.

0.075 0.050 0.025 0.000 0.025 0.050 0.075 Activation Value

Number of Activations

1e3 Xavier (Hyper)

layer one layer two layer three layer four

0.15 0.10 0.05 0.00 0.05 0.10 0.15 Activation Value

Number of Activations

1e3 Hyperfan-in

layer one layer two layer three layer four

0.15 0.10 0.05 0.00 0.05 0.10 0.15 Activation Value

Number of Activations

1e3 Hyperfan-out

layer one layer two layer three layer four

Figure 12: Hypernet Output Layer Activations before the Start of Training on MNIST.

0.3 0.2 0.1 0.0 0.1 0.2 0.3 Activation Value

Number of Activations

1e4 Xavier (Hyper)

layer one layer two layer three layer four

0.20 0.15 0.10 0.05 0.00 0.05 0.10 0.15 0.20 Activation Value

Number of Activations

1e3 Hyperfan-in

layer one layer two layer three layer four

0.20 0.15 0.10 0.05 0.00 0.05 0.10 0.15 0.20 Activation Value

Number of Activations

1e3 Hyperfan-out

layer one layer two layer three layer four

Figure 13: Hypernet Output Layer Activations at the End of Training on MNIST.

Published as a conference paper at ICLR 2020

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Gradient Value 1e 2

Number of Gradients

1e4 Xavier (Hyper)

layer one layer two layer three layer four

6 4 2 0 2 4 6 Gradient Value 1e 3

Number of Gradients

1e4 Hyperfan-in

layer one layer two layer three layer four

6 4 2 0 2 4 6 Gradient Value 1e 3

Number of Gradients

1e4 Hyperfan-out

layer one layer two layer three layer four

Figure 14: Hypernet Output Layer Gradients before the Start of Training on MNIST.

0 25 50 75 100 125 150 175 1k Iterations

Mean Gradient Value

1e 6 Xavier (Hyper)

layer one layer two layer three layer four

0 25 50 75 100 125 150 175 1k Iterations

Mean Gradient Value

1e 6 Hyperfan-in

layer one layer two layer three layer four

0 25 50 75 100 125 150 175 1k Iterations

Mean Gradient Value

1e 5 Hyperfan-out

layer one layer two layer three layer four

Figure 15: Evolution of Hypernet Output Layer Gradients during Training on MNIST.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Gradient Value 1e 2

Number of Gradients

1e4 Xavier (Hyper)

layer one layer two layer three layer four

6 4 2 0 2 4 6 Gradient Value 1e 3

Number of Gradients

1e4 Hyperfan-in

layer one layer two layer three layer four

6 4 2 0 2 4 6 Gradient Value 1e 3

Number of Gradients

1e4 Hyperfan-out

layer one layer two layer three layer four

Figure 16: Hypernet Output Layer Gradients at the End of Training on MNIST.

Published as a conference paper at ICLR 2020

B.1.2 HYPERNET GENERATES BOTH MAINNET WEIGHTS AND BIASES

0 5 10 15 20 25 30 Epochs

Training Loss

Xavier (Hyper) Hyperfan-in Hyperfan-out

0 5 10 15 20 25 30 Epochs

Xavier (Hyper) Hyperfan-in Hyperfan-out

0 5 10 15 20 25 30 Epochs

Test Accuracy

Xavier (Hyper) Hyperfan-in Hyperfan-out

Figure 17: Loss and Test Accuracy Plots on MNIST.

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Activation Value

Number of Activations

Xavier (Hyper)

layer one layer two layer three layer four layer five

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Activation Value

Number of Activations

Hyperfan-in

layer one layer two layer three layer four layer five

0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 Activation Value

Number of Activations

Hyperfan-out

layer one layer two layer three layer four layer five

Figure 18: Mainnet Activations before the Start of Training on MNIST.

0 25 50 75 100 125 150 175 1k Iterations

Mean Activation Value

Xavier (Hyper)

layer one layer two layer three layer four layer five

0 25 50 75 100 125 150 175 1k Iterations

Mean Activation Value

Hyperfan-in

layer one layer two layer three layer four layer five

0 25 50 75 100 125 150 175 1k Iterations

Mean Activation Value

Hyperfan-out

layer one layer two layer three layer four layer five

Figure 19: Evolution of Mainnet Activations during Training on MNIST.

Published as a conference paper at ICLR 2020

0.75 0.50 0.25 0.00 0.25 0.50 0.75 Activation Value

Number of Activations

Xavier (Hyper)

layer one layer two layer three layer four layer five

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Activation Value

Number of Activations

Hyperfan-in

layer one layer two layer three layer four layer five

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 Activation Value

Number of Activations

Hyperfan-out

layer one layer two layer three layer four layer five

Figure 20: Mainnet Activations at the End of Training on MNIST.

4 2 0 2 4 Gradient Value 1e 5

Number of Gradients

Xavier (Hyper)

layer one layer two layer three layer four layer five

4 2 0 2 4 Gradient Value 1e 5

Number of Gradients

Hyperfan-in

layer one layer two layer three layer four layer five

3 2 1 0 1 2 3 Gradient Value 1e 5

Number of Gradients

Hyperfan-out

layer one layer two layer three layer four layer five

Figure 21: Mainnet Gradients before the Start of Training on MNIST.

0 25 50 75 100 125 150 175 1k Iterations

Mean Gradient Value

1e 4 Xavier (Hyper)

layer one layer two layer three layer four layer five

0 25 50 75 100 125 150 175 1k Iterations

Mean Gradient Value

1e 5 Hyperfan-in

layer one layer two layer three layer four layer five

0 25 50 75 100 125 150 175 1k Iterations

Mean Gradient Value

1e 4 Hyperfan-out

layer one layer two layer three layer four layer five

Figure 22: Evolution of Mainnet Gradients during Training on MNIST.

4 2 0 2 4 Gradient Value 1e 5

Number of Gradients

Xavier (Hyper)

layer one layer two layer three layer four layer five

4 2 0 2 4 Gradient Value 1e 5

Number of Gradients

Hyperfan-in

layer one layer two layer three layer four layer five

3 2 1 0 1 2 3 Gradient Value 1e 5

Number of Gradients

Hyperfan-out

layer one layer two layer three layer four layer five

Figure 23: Mainnet Gradients at the End of Training on MNIST.

Published as a conference paper at ICLR 2020

0.08 0.06 0.04 0.02 0.00 0.02 0.04 0.06 0.08 Activation Value

Number of Activations

1e3 Xavier (Hyper)

layer one layer two layer three layer four

0.10 0.05 0.00 0.05 0.10 Activation Value

Number of Activations

1e3 Hyperfan-in

layer one layer two layer three layer four

0.15 0.10 0.05 0.00 0.05 0.10 0.15 Activation Value

Number of Activations

1e3 Hyperfan-out

layer one layer two layer three layer four

Figure 24: Hypernet Output Layer Activations before the Start of Training on MNIST.

0 25 50 75 100 125 150 175 1k Iterations

Mean Activation Value

1e 4 Xavier (Hyper)

layer one layer two layer three layer four

0 25 50 75 100 125 150 175 1k Iterations

Mean Activation Value

1e 4 Hyperfan-in

layer one layer two layer three layer four

0 25 50 75 100 125 150 175 1k Iterations

Mean Activation Value

1e 4 Hyperfan-out

layer one layer two layer three layer four

Figure 25: Evolution of Hypernet Output Layer Activations during Training on MNIST.

0.15 0.10 0.05 0.00 0.05 0.10 0.15 Activation Value

Number of Activations

1e4 Xavier (Hyper)

layer one layer two layer three layer four

0.2 0.1 0.0 0.1 0.2 Activation Value

Number of Activations

1e4 Hyperfan-in

layer one layer two layer three layer four

0.2 0.1 0.0 0.1 0.2 Activation Value

Number of Activations

1e4 Hyperfan-out

layer one layer two layer three layer four

Figure 26: Hypernet Output Layer Activations at the End of Training on MNIST.

8 6 4 2 0 2 4 6 8 Gradient Value 1e 3

Number of Gradients

1e4 Xavier (Hyper)

layer one layer two layer three layer four

6 4 2 0 2 4 6 Gradient Value 1e 3

Number of Gradients

1e4 Hyperfan-in

layer one layer two layer three layer four

8 6 4 2 0 2 4 6 8 Gradient Value 1e 3

Number of Gradients

1e4 Hyperfan-out

layer one layer two layer three layer four

Figure 27: Hypernet Output Layer Gradients before the Start of Training on MNIST.

Published as a conference paper at ICLR 2020

0 25 50 75 100 125 150 175 1k Iterations

Mean Gradient Value

1e 5 Xavier (Hyper)

layer one layer two layer three layer four

0 25 50 75 100 125 150 175 1k Iterations

Mean Gradient Value

1e 6 Hyperfan-in

layer one layer two layer three layer four

0 25 50 75 100 125 150 175 1k Iterations

Mean Gradient Value

1e 5 Hyperfan-out

layer one layer two layer three layer four

Figure 28: Evolution of Hypernet Output Layer Gradients during Training on MNIST.

8 6 4 2 0 2 4 6 8 Gradient Value 1e 3

Number of Gradients

1e4 Xavier (Hyper)

layer one layer two layer three layer four

6 4 2 0 2 4 6 Gradient Value 1e 3

Number of Gradients

1e4 Hyperfan-in

layer one layer two layer three layer four

8 6 4 2 0 2 4 6 8 Gradient Value 1e 3

Number of Gradients

1e4 Hyperfan-out

layer one layer two layer three layer four

Figure 29: Hypernet Output Layer Gradients at the End of Training on MNIST.

Published as a conference paper at ICLR 2020

B.1.3 REMARK ON THE COMBINATION OF FAN-IN AND FAN-OUT INIT

Glorot & Bengio (2010) proposed to use the harmonic mean of the two different initialization formulae derived from the forward and backward pass. He et al. (2015) commented that either version sufﬁces for convergence, and that it does not really matter given that the difference between the two will be a depth-independent factor.

We experimented with the harmonic, geometric, and arithmetic means of the two different formulae in both the classical and the hypernet case. There was no indication of any signiﬁcant beneﬁt from taking any of the three different means in both cases. Thus, we conﬁrm and concur with He et al. (2015) s original observation that either the fan-in or the fan-out version sufﬁces.

B.2 CONTINUAL LEARNING ON REGRESSION TASKS

The mainnet is a feedforward network with two hidden layers (10 hidden units) and the Re LU activation function. The weights and biases of the mainnet are generated from a hypernet with two hidden layers (10 hidden units) and trainable embeddings of size 2 sampled from U(

3). We keep the same continual learning hyperparameter βoutput value of 0.005 and pick the best learning rate for each initialization method from {10 2, 10 3, 10 4, 10 5}. Notably, Kaiming (fan-in) could only be trained from learning rate 10 5, with losses diverging soon after initialization using the other learning rates. Each task was trained for 6000 training iterations using batch size 32, with Figure 4 plotted from losses measured at every 100 iterations.

B.3 CONVOLUTIONAL NETWORKS ON CIFAR-10

The networks were trained on CIFAR-10 for 500 epochs starting with an initial learning rate of 0.0005 using batch size 100, and decaying with γ = 0.1 at epochs 350 and 450. The hypernet is composed of two layers (50 hidden units) with separate embeddings and separate input layers but shared output layers. The weight generation happens in blocks of (96, 3, 3) where K = 96 is the highest common factor between the different sizes of the convolutional layers in the mainnet and n = 3 is the size of the convolutional ﬁlters (see Appendix Section A.2 for a more detailed explanation on the hypernet architecture). The embeddings are size 50 and ﬁxed after random sampling from U(

3). We use the mean cross entropy loss for training, but the summed cross entropy loss for testing.

B.4 BAYESIAN NEURAL NETWORK ON IMAGENET

Ukai et al. (2018) showed that a Bayesian neural network can be developed by using a hypernetwork to express a prior distribution without substantial changes to the vanilla hypernetwork setting. Their methods simply require putting L2-regularization on the model parameters and sampling from stochastic embeddings. We trained a linear hypernet to generate the weights of a Mobile Net mainnet architecture (excluding the batch normalization layers), using the block-wise sampling strategy described in Ukai et al. (2018), with a factor of 0.0005 for the L2-regularization. We initialize ﬁxed embeddings of size 32 sampled from U(

3), and sample additive stochastic noise coming from U( 0.1, 0.1) at the beginning of every mini-batch training. The training was done on Image Net with batch size 256 and learning rate 0.1 for 25 epochs, or equivalently, 125125 iterations. The testing was done with 10 Monte Carlo samples. We omit the test loss plots due to the computational expense of doing 10 forward passes after every mini-batch instead of every epoch.