# explicit_regularisation_in_gaussian_noise_injections__b5b39b13.pdf

Explicit Regularisation in Gaussian Noise Injections

Alexander Camuto University of Oxford Alan Turing Institute acamuto@turing.ac.uk

Matthew Willetts University of Oxford Alan Turing Institute mwilletts@turing.ac.uk

Umut S ims ekli University of Oxford Institut Polytechnique de Paris umut.simsekli@telecom-paris.fr

Stephen Roberts University of Oxford Alan Turing Institute sjrob@robots.ox.ac.uk

Chris Holmes University of Oxford Alan Turing Institute cholmes@stats.ox.ac.uk

We study the regularisation induced in neural networks by Gaussian noise injections (GNIs). Though such injections have been extensively studied when applied to data, there have been few studies on understanding the regularising effect they induce when applied to network activations. Here we derive the explicit regulariser of GNIs, obtained by marginalising out the injected noise, and show that it penalises functions with high-frequency components in the Fourier domain; particularly in layers closer to a neural network s output. We show analytically and empirically that such regularisation produces calibrated classiﬁers with large classiﬁcation margins.

1 Introduction

Noise injections are a family of methods that involve adding or multiplying samples from a noise distribution, typically an isotropic Gaussian, to the weights or activations of a neural network during training. The beneﬁts of such methods are well documented. Models trained with noise often generalise better to unseen data and are less prone to overﬁtting (Srivastava et al., 2014; Kingma et al., 2015; Poole et al., 2014).

Even though the regularisation conferred by Gaussian noise injections (GNIs) can be observed empirically, and the beneﬁts of noising data are well understood theoretically (Bishop, 1995; Cohen et al., 2019; Webb, 1994), there have been few studies on understanding the beneﬁts of methods that inject noise throughout a network. Here we study the explicit regularisation of such injections, which is a positive term added to the loss function obtained when we marginalise out the noise we have injected.

Concretely our contributions are:

We derive an analytic form for an explicit regulariser that explains most of GNIs regular-

ising effect.

We show that this regulariser penalises networks that learn functions with high-frequency

content in the Fourier domain and most heavily regularises neural network layers that are closer to the output. See Figure 1 for an illustration.

Finally, we show analytically and empirically that this regularisation induces larger classi-

ﬁcation margins and better calibration of models.

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Figure 1: Here we illustrate the effect of GNIs injected throughout a network s activations. Each coloured dot represents a neuron s activations. We add GNIs, represented as circles, to each layer s activations bar the output layer. GNIs induce a network for which each layer learns a progressively lower frequency function, represented as a sinusoid matching in colour to its corresponding layer.

2 Background

2.1 Gaussian Noise Injections Training a neural network involves optimising network parameters to maximise the marginal likelihood of a set of labels given features via gradient descent. With a training dataset D composed of N data-label pairs of the form (x, y) x 2 Rd, y 2 Rm and a feed-forward neural network with M parameters divided into L layers: = {W1, ..., WL}, 2 RM, our objective is to minimise the expected negative log likelihood of labels y given data x, log p (y|x) , and ﬁnd the optimal set of parameters satisfying:

L(D; ), L(D; ) := Ex,y D [log p (y|x)] . (1)

Under stochastic optimisation algorithms, such as Stochastic Gradient Descent (SGD), we estimate L by sampling a mini-batch of data-label pairs B D.

L(B; ) = Ex,y B log p (y|x) L(D; ). (2) Consider an L layer network with no noise injections and a non-linearity φ at each layer. We obtain the activations h = {h0, ..., h L}, where h0 = x is the input data before any noise is injected. For a network consisting of dense layers (a.k.a. a multi-layer perceptron: MLP) we have that:

hk(x) = φ(Wkhk 1(x)). (3) What happens to these activations when we inject noise? First, let be the set of noise injections at each layer: = { 0, ..., L 1}. When performing a noise injection procedure, the value of the next layer s activations depends on the noised value of the previous layer. We denote the intermediate, soon-to-be-noised value of an activation as bhk and the subsequently noised value as ehk:

, ehk(x) = bhk(x) k , (4)

where is some element-wise operation. We can, for example, add or multiply Gaussian noise to each hidden layer unit. In the additive case, we obtain:

ehk(x) = bhk(x) + k, k N(0, σ2

k I). (5) The multiplicative case can be rewritten as an activation-scaled addition:

ehk(x) = bhk(x) + k, k N

Here we focus our analysis on noise additions, but through equation (6) we can translate our results to the multiplicative case.

2.2 Sobolev Spaces To deﬁne a Sobolev Space we use the generalisation of the derivative for multivariate functions of the form g : Rd ! R. We use a multi-index notation 2 Rd which deﬁnes mixed partial derivatives. We denote the th derivative of g with respect to its input x as D g(x).

D g = @| |g @x 1

1 . . . @x d

d where | | = Pd

i=1 | i|. Note that x = [x 1

1 , . . . , x d

d ] and ! = 1! . . . d!.

Deﬁnition 2.1 (Cucker and Smale (2002)). Sobolev spaces are denoted W l,p( ), Rd, where l, the order of the space, is a non-negative integer and p 1. The Sobolev space of index (l, p) is the space of locally integrable functions f : ! R such that for every multi-index where | | < l the derivative D f exists and D f 2 Lp( ). The norm in such a space is given by

kfk W l,p( ) =

|D f(x)|pdx

For p = 2 these spaces are Hilbert spaces, with a dot product that deﬁnes the L2 norm of a function s derivatives. Further these Sobolev spaces can be deﬁned in a measure space with ﬁnite measure µ. We call such spaces ﬁnite measure spaces of the form W l,p

µ (Rd) and these are the spaces of locally integrable functions such that for every where | | < l, D f 2 Lp

µ(Rd), the Lp space equipped with the measure µ. The norm in such a space is given by (Hornik, 1991):

Rd |D f(x)|pdµ(x)

, f 2 W l,p

µ (Rd), |µ(x)| < 1 8x 2 Rd (7)

Generally a Sobolev space over a compact subset of Rd can be expressed as a weighted Sobolev space with a measure µ which has compact support on (Hornik, 1991).

Hornik (1991) have shown that neural networks with continuous activations, which have continuous and bounded derivatives up to order l, such as the sigmoid function, are universal approximators in the weighted Sobolev spaces of order l, meaning that they form a dense subset of Sobolev spaces. Further, Czarnecki et al. (2017) have shown that networks that use piecewise linear activation functions (such as Re LU and its extensions) are also universal approximators in the Sobolev spaces of order 1 where the domain is some compact subset of Rd. As mentioned above, this is equivalent to being dense in a weighted Sobolev space on Rd where the measure µ has compact support. Hence, we can view a neural network, with sigmoid or piecewise linear activations to be a parameter that indexes a function in a weighted Sobolev space with index (1, 2), i.e. f 2 W 1,2

3 The Explicit Effect of Gaussian Noise Injections

Here we consider the case where we noise all layers with isotropic noise, except the ﬁnal predictive layer which we also consider to have no activation function. We can express the effect of the Gaussian noise injection on the cost function as an added term L, which is dependent on EL, the noise accumulated on the ﬁnal layer L from the noise additions on the previous hidden layer activations.

e L(B; , ) = L(B; ) + L(B; , EL) (8)

To understand the regularisation induced by GNIs, we want to study the regularisation that these injections induce consistently from batch to batch. To do so, we want to remove the stochastic component of the GNI regularisation and extract a regulariser that is of consistent sign. Regularisers that change sign from batch-to-batch do not give a constant objective to optimise, making them unﬁt as regularisers (Botev et al., 2017; Sagun et al., 2018; Wei et al., 2020).

As such, we study the explicit regularisation these injections induce by way of the expected regulariser, E p( ) [ L( )] that marginalises out the injected noise . To lighten notation, we denote this as E [ L( )]. We extract R, a constituent term of the expected regulariser that dominates the remainder terms in norm, and is consistently positive.

Because of these properties, R provides a lens through which to study the effect of GNIs. As we show, this term has a connection to the Sobolev norm and the Fourier transform of the function parameterised by the neural network. Using these connections we make inroads into better understanding the regularising effect of noise injections on neural networks.

To begin deriving this term, we ﬁrst need to deﬁne the accumulated noise EL. We do so by applying a Taylor expansion to each noised layer. As in Section 2.2 we use the generalisation of the derivative for multivariate functions using a multi-index . For example D hk,i(hk 1(x)) denotes the th

derivative of the ith activation of the kth layer (hk,i) with respect to the preceding layer s activations hk 1(x) and D L(hk(x), y) denotes the th derivative of the loss with respect to the non-noised activations hk(x).

Proposition 1. Consider an L layer neural network experiencing isotropic GNIs at each layer k 2 [0, . . . , L 1] of dimensionality dk. We denote this added noise as = { 0, ..., L 1}. We assume h L is in C1 the class of inﬁnitely differentiable functions. We can deﬁne the accumulated noise at layer each layer k using a multi-index k 2 Ndk 1:

1 L! (D Lh L,i(h L 1(x))) E L

L 1, i = 1, . . . , d L

Ek,i = k,i +

1 k! (D khk,i(hk 1(x))) E k

k 1, i = 1, . . . , dk, k = 1 . . . L 1

E0 = 0 where x is drawn from the dataset D, hk are the activations before any noise is added, as deﬁned in Equation (3).

See Appendix A.1 for the proof. Given this form for the accumulated noise, we can now deﬁne the expected regulariser. For compactness of notation, we denote each layer s Jacobian as Jk 2 Rd L dk and the Hessian of the loss with respect to the ﬁnal layer as HL 2 Rd L d L. Each entry of Jk is a partial derivative of f

k,i, the function from layer k to the ith network output, i = 1...d L.

k,1 @hk,2 . . . ... ...

k,d L @hk,1

k,d L @hk,dk

7775 , HL(x, y) =

@2L @h L,1@h L,2 . . . ... ...

@2L @h L,d L@h L,1

Using these notations we can now deﬁne the explicit regularisation induced by GNIs.

Theorem 1. Consider an L layer neural network experiencing isotropic GNIs at each layer k 2 [0, . . . , L 1] of dimensionality dk. We denote this added noise as = { 0, ..., L 1}. We assume L is in C1 the class of inﬁnitely differentiable functions. We can marginalise out the injected noise to obtain an added regulariser:

E [ L(B; , EL)] = E(x,y) B

k(x)HL(x, y)Jk(x))

+ E [C(B, )]

where hk are the activations before any noise is added, as in equation (3). E [C( )] is a remainder term in higher order derivatives.

See Appendix A.2 for the proof and for the exact form of the remainder. We denote the ﬁrst term in Theorem 1 as:

R(B; ) = E(x,y) B

k(x)HL(x, y)Jk(x))

To understand the main contributors behind the regularising effect of GNIs, we ﬁrst want to establish the relative importance of the two terms that constitute the explicit effect. We know that R is the added regulariser for the linearised version of a neural network, deﬁned by its Jacobian. This linearisation well approximates neural network behaviour for sufﬁciently wide networks (Jacot et al., 2018; Chizat et al., 2019; Arora et al., 2019) in early stages of training (Chen et al., 2020), and we can expect R to dominate the remainder term in norm, which consists of higher order derivatives. In Figure 2 we show that this is the case for a range of GNI variances, datasets, and activation functions for networks with 256 neurons per layer; where the remainder is estimated as:

E [C(B, )] 1 1000

e L(B; , ) R(B; ) L(B; ).

These results show that R is a signiﬁcant component of the regularising effect of GNIs. It dominates the remainder E [C( )] in norm and is always positive, as we will show, thus offering a consistent objective for SGD to minimise. Given that R is a likely candidate for understanding the effect of GNIs; we further study this term separately in regression and classiﬁcation settings.

(a) BHP Sigmoid (b) CIFAR10 ELU (c) BHP Sigmoid (d) CIFAR10 ELU

Figure 2: In (a,b) we plot R( ) vs E [C( )] at initialisation for 6-layer-MLPs with GNIs at each 256-neuron layer with the same variance σ2 2 [0.1, 0.25, 1.0, 4.0] at each layer. Each point corresponds to one of 250 different network initialisation acting on a batch of size 32 for the classiﬁcation dataset CIFAR10 and regression dataset Boston House Prices (BHP) datasets. The dotted red line corresponds to y = x and demonstrates that for all batches and GNI variances R is greater than E [C( )]. In (c,d) we plot ratio = | E [C( )] |/R( ) in the ﬁrst 100 training iteration for 10 randomly initialised networks. Shading corresponds to the standard deviation of values over the 10 networks. R( ) remains dominant in early stages of training as the ratio is less than 1 for all steps.

Regularisation in Regression In the case of regression one of the most commonly used loss functions is the mean-squared error (MSE), which is deﬁned for a data label pair (x, y) as:

L(x, y) = 1

2(y h L(x))2. (10)

For this loss, the Hessians in Theorem 1 are simply the identity matrix. The explicit regularisation term, guaranteed to be positive is:

k(k Jk(x)k2

k is the variance of the noise k injected at layer k and k k F is the Frobenius norm. See Appendix A.4 for a proof.

Regularisation in Classiﬁcation In the case of classiﬁcation, we consider the case of a crossentropy (CE) loss. Recall that we consider our network outputs h L to be the pre-softmax of the logits of the ﬁnal layer. We denote p(x) = softmax(h L(x)). For a pair (x, y) we have:

yc log(p(x))c), (12)

where c indexes over C possible classes. The hessian HL( ) no longer depends on y:

p(x)i(1 p(x)j) i = j p(x)ip(x)j i 6= j (13)

This Hessian is positive-semi-deﬁnite and R( ), guaranteed to be positive, can be written as:

(diag(HL(x))|J2

k is the variance of the noise k injected at layer k. See Appendix A.5 for the proof.

To test our derived regularisers, in Figure 3 we show that models trained with R and GNIs have similar training proﬁles, whereby they have similar test-set loss and parameter Hessians throughout training, meaning that they have almost identical trajectories through the loss landscape. This implies that R is a good descriptor of the effect of GNIs and that we can use this term to understand the mechanism underpinning the regularising effect of GNIs. As we now show, it penalises neural networks that parameterize functions with higher frequencies in the Fourier domain; offering a novel lens under which to study GNIs.

(a) SVHN ELU (b) SVHN MLP Tr(H)

Figure 3: Figure (a) shows the test set loss for convolutional models (CONV) and 4 layer MLPs trained on SVHN with R( ) and GNIs for σ2 = 0.1, and no noise (Baseline). Figure (b) shows the trace of the network parameter Hessian for a 2-layer, 32-unit-per-layer MLP where Hi,j =

@L @wi@wj , which is a proxy for the parameters location in the loss landscape. All networks use ELU activations. See Appendix F for more such results on other datasets and network architectures.

4 Fourier Domain Regularisation

To link our derived regularisers to the Fourier domain, we use the connection between neural networks and Sobolev Spaces mentioned above. Recall that by Hornik (1991), we can only assume a sigmoid or piecewise linear neural network parameterises a function in a weighted Sobolev space with measure µ, if we assume that the measure µ has compact support on a subset 2 Rd. As such, we equip our space with the probability measure µ(x), which we assume has compact support on some subset Rd where µ( ) = 1. We deﬁne it such that dµ(x) = p(x)dx where dx is the Lebesgue measure and p(x) is the data density function. Given this measure, we can connect the derivative of functions that are in the Hilbert-Sobolev space W 1,2

µ (Rd) to the Fourier domain.

Theorem 2. Consider a function, f : Rd ! R, with a d-dimensional input and a single output with f 2 W 1,2

µ (Rd) where µ is a probability measure which we assume has compact support on some subset Rd such that µ( ) = 1. Let us assume the derivative of f, D f, is in L2(Rd) for some multi-index where | | = 1. Then we can write that:

888G(!, j)G(!, j) P(!)

G(!, j) = !j F(!)

where F is the Fourier transform of f, P is the Fourier transform or the characteristic function of the probability measure µ, j indexes over ! = [!1, . . . , !d], is the convolution operator, and ( ) is the complex conjugate.

See Appendix A.3 for the proof. Note that in the case where the dataset contains ﬁnitely many points, the integrals for the norms of the form k D f k2

L2µ(Rd) are approximated by sampling a batch from the dataset which is distributed according to the presumed probability measure µ(x). Expectations over a batch thus approximate integration over Rd with the measure µ(x) and this approximation improves as the batch size grows. We can now use Theorem 2 to link R to the Fourier domain.

Regression Let us begin with the case of regression. Assuming differentiable and continuous activation functions, then the Jacobians within R are equivalent to the derivatives in Deﬁnition 2.1. Theorem 2 only holds for functions that have 1-D outputs, but we can decompose the Jacobians Jk as the derivatives of multiple 1-D output functions. Recall, that the ith row of the matrix Jk is the set of partial derivatives of f

k,i, the function from layer k to the ith network output, i = 1...d L, with respect to the kth layer activations. Using this perspective, and the fact that each f

k,i 2 W 1,2

µ (Rdk) (dk is the dimensionality of the kth layer), if we assume that the probability measure of our space

(a) Baseline (b) Noise (c) R

Figure 4: As in Rahaman et al. (2019), we train 6-layer deep 256-unit wide Re LU networks trained to regress the function λ(z) = P

i sin(2 riz + φ(i)) with ri 2 (5, 10, . . . , 45, 50). We train these networks with no noise (Baseline), with GNIs of variance 0.1 injected into each layer except the ﬁnal layer (Noise), and with the R( ) for regression in (11). The ﬁrst row shows the Fourier spectrum (x-axis) of the networks as training progresses (y-axis) averaged over 10 training runs. Colours show each frequency s amplitude clipped between 0 and 1. The second row shows samples of randomly generated target functions and the function learnt by the networks.

µ(x) has compact support, we use Theorem 2 to write:

k Jk,i(x)k2

k Jk,i(x)k2

k,i(!, j) P(!)

where h0 = x, i indexes over output neurons, and G

k,i(!, j) = !j F

k,i(!), where F

k,i is the Fourier transform of the function f

k,i. The approximation comes from the fact that in SGD, as mentioned above, integration over the dataset is approximated by sampling mini-batches B.

If we take the data density function to be the empirical data density, meaning that it is supported on the N points of the dataset D (i.e it is a set of δ-functions centered on each point), then as the size B of a batch B tends to N we can write that:

lim B!N R(B; ) = 1

k,i(!, j) P(!)

888d!. (17)

Classiﬁcation The classiﬁcation setting requires a bit more work. Recall that our Jacobians are weighted by diag(HL(x))|, which has positive entries that are less than 1 by Equation (13). We can deﬁne a new set of measures such that dµi(x) = diag(HL(x))|

i p(x)dx, i = 1 . . . d L. Because this new measure is positive, ﬁnite and still has compact support, Theorem 2 still holds for the spaces indexed by i: W 1,2

Using these new measures, and the fact that each f

k,i 2 W 1,2

µi (Rdk), we can use Theorem 2 to write that for classiﬁcation models:

diag(HL(x))|

i k Jk,i(x)k2

k,i(!, j) Pi(!)

Here Pi is the Fourier transform of the ith measure µi and as before G

k,i(!, j) = !j F

k,i(!), where F

k,i is the Fourier transform of the function f

k,i. Again as the batch size increases to the size of the dataset, this approximation becomes exact.

For both regression and classiﬁcation, GNIs, by way of R, induce a prior which favours smooth functions with low-frequency components. This prior is enforced by the terms G

k,i(!, j) which become large in magnitude when functions have high-frequency components, penalising neural networks that learn such functions. In Appendix B we also show that this regularisation in the Fourier domain corresponds to a form of Tikhonov regularisation.

In Figure 4, we demonstrate empirically that networks trained with GNIs learn functions that don t overﬁt; with lower-frequency components relative to their non-noised counterparts.

A layer-wise regularisation Note that there is a recursive structure to the penalisation induced by R. Consider the layer-to-layer functions which map from a layer k 1 to k, hk(hk 1(x)). With a slight abuse of notation, rhk 1hk(x) is the Jacobian deﬁned element-wise as:

i,j = @hk,i @hk 1,j(x),

where as before hk,i is the ith activation of layer k.

krhk 1hk(x)k2

2 is penalised k times in R as this derivative appears in J0, J1 . . . Jk 1 due to the chain rule. As such, when training with GNIs, we can expect the norm of krhk 1hk(x)k2

2 to decrease as the layer index k increases (i.e the closer we are to the network output). By Theorem 2, and Equations (16), and (18), larger krhk 1hk(x)k2

2 correspond to functions with higher frequency components. Consequently, when training with GNIs the layer to layer function hk(hk 1(x)) will have higher frequency components than the next layer s function hk+1(hk(x)).

We measure this layer-wise regularisation in Re LU networks, using rhk 1hk(x) = f Wk. f Wk is obtained from the original weight matrix Wk by setting its ith column to zero whenever the neuron i of the kth layer is inactive. Also note that the inputs of Re LU network hidden layers, which are the outputs of another Re LU-layer, will be positive. Negative weights are likely to deactivate a Re LU-neuron, inducing smaller kf Wkk2

2, and thus parameterising a lower frequency function. We use the trace of a weight matrix as an indicator for the number of negative components.

In Figure 5 we demonstrate that kf Wkk2

2 and Tr(Wk) decrease as k increases for Re LU-networks trained with GNIs, indicating that each successive layer in these networks learns a function with lower frequency components than the past layer. This striation and ordering is clearly absent in the baselines trained without GNIs.

The Beneﬁts of GNIs What does regularisation in the Fourier domain accomplish? The terms in R are the traces of the Gauss-Newton decompositions of the second order derivatives of the loss. By

(a) Baseline (b) GNI (c) Baseline (d) GNI

Figure 5: We use 6-layer deep 256-unit wide Re LU networks on the same dataset as in Figure 4 trained with (GNI) and without GNI (Baseline). In (a,b), for layers with square weight matrices, we plot kf Wkk2

2. In (c,d) we plot the trace of these layers weight matrices Tr(Wk). For GNI models, as the layer index k increases, Tr(Wk) and kf Wkk2

2 decrease, indicating that each successive layer in these networks learns a function with lower frequency components than the past layer.

penalising this we are more likely to land in wider (smoother) minima (see Figure 3), which has been shown, contentiously (Dinh et al., 2017), to induce networks with better generalisation properties (Keskar et al., 2019; Jastrze bski et al., 2017). GNIs however, confer other beneﬁts too.

Sensitivity. A model s weakness to input perturbations is termed the sensitivity. Rahaman et al. (2019) have shown empirically that classiﬁers biased towards lower frequencies in the Fourier domain are less sensitive, and there is ample evidence demonstrating that models trained with noised data are less sensitive (Liu et al., 2019; Li et al., 2018). The Fourier domain - sensitivity connection can be established by studying the classiﬁcation margins of a model (see Appendix D).

Calibration. Given a network s prediction ˆy(x) with conﬁdence ˆp(x) for a point x, perfect calibration consists of being as likely to be correct as you are conﬁdent: p(ˆy = y|ˆp = r) = r, 8r 2 [0, 1] (Dawid, 1982; De Groot and Fienberg, 1983). In Appendix E we show that models that are biased toward lower frequency spectra have lower capacity measures , which measure model complexity and lower values of which have been shown empirically to induce better calibrated models (Guo et al., 2017). In Figure F.7 we show that this holds true for models trained with GNIs.

5 Related Work

Many variants of GNIs have been proposed to regularise neural networks. Poole et al. (2014) extend this process and apply noise to all computational steps in a neural network layer. Not only is noise applied to the layer input it is applied to the layer output and to the pre-activation function logits. The authors allude to explicit regularisation but only derive a result for a single layer auto-encoder with a single noise injection. Similarly, Bishop (1995) derive an analytic form for the explicit regulariser induced by noise injections on data and show that such injections are equivalent to Tikhonov regularisation in an unspeciﬁed function space.

Recently Wei et al. (2020) conducted similar analysis to ours, dividing the effects of Bernoulli dropout into explicit and implicit effects. Their work is built on that of Mele and Altarelli (1993), Helmbold and Long (2015), and Wager et al. (2013) who perform this analysis for linear neural networks. Arora et al. (2020) derive an explicit regulariser for Bernoulli dropout on the ﬁnal layer of a neural network. Further, recent work by Dieng et al. (2018) shows that noise additions on recurrent network hidden states outperform Bernoulli dropout in terms of performance and bias.

6 Conclusion

In this work, we derived analytic forms for the explicit regularisation induced by Gaussian noise injections, demonstrating that the explicit regulariser penalises networks with high-frequency content in Fourier space. Further we show that this regularisation is not distributed evenly within a network, as it disproportionately penalises high-frequency content in layers closer to the network output. Finally we demonstrate that this regularisation in the Fourier domain has a number of beneﬁcial effects. It induces training dynamics that preferentially land in wider minima, it reduces model sensitivity to noise, and induces better calibration.

Acknowledgments

This research was directly funded by the Alan Turing Institute under Engineering and Physical Sciences Research Council (EPSRC) grant EP/N510129/1. AC was supported by an EPSRC Studentship. MW was supported by EPSRC grant EP/G03706X/1. US was supported by the French National Research Agency (ANR) as a part of the FBIMATRIX (ANR-16-CE23-0014) project. SR gratefully acknowledges support from the UK Royal Academy of Engineering and the Oxford-Man Institute. CH was supported by the Medical Research Council, the Engineering and Physical Sciences Research Council, Health Data Research UK, and the Li Ka Shing Foundation

Impact Statement

This paper uncovers a new mechanism by which a widely used regularisation method operates and paves the way for designing new regularisation methods which take advantage of our ﬁndings. Regularisation methods produce models that are not only less likely to overﬁt, but also have better calibrated predictions that are more robust to distribution shifts. As such improving our understanding of such methods is critical as machine learning models become increasingly ubiquitous and embedded in decision making.

Bibliography

Rita Aleksziev. Tangent Space Separability in Feedforward Neural Networks. In Neur IPS, 2019.

Raman Arora, Peter Bartlett, Poorya Mianjy, and Nathan Srebro. Dropout: Explicit Forms and

Capacity Control. 2020.

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang.

On exact computation with an inﬁnitely wide neural net. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Chris M. Bishop. Training with Noise is Equivalent to Tikhonov Regularization. Neural Computa-

tion, 7(1):108 116, 1995.

Aleksandar Botev, Hippolyt Ritter, and David Barber. Practical Gauss-Newton optimisation for deep

learning. In ICML, 2017.

Martin Burger and Andreas Neubauer. Analysis of Tikhonov regularization for function approxima-

tion by neural networks. Neural Networks, 16(1):79 90, 2003.

Zixiang Chen, Yuan Cao, Quanquan Gu, and Tong Zhang. A generalized neural tangent kernel

analysis for two-layer neural networks, 2020.

Lenaic Chizat, Edouard Oyallon, and Francis Bach. On Lazy Training in Differentiable Program-

ming. In Neur IPS, December 2019.

Jeremy Cohen, Elan Rosenfeld, and J. Zico Kolter. Certiﬁed adversarial robustness via randomized

smoothing. In ICML, 2019.

G. M. Constantine and T. H. Savits. A multivariate faa di bruno formula with applications. Trans-

actions of the American Mathematical Society, 348(2):503 520, 1996.

Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39(1):1 49, 2002.

Wojciech Marian Czarnecki, Simon Osindero, Max Jaderberg, Grzegorz Swirszcz, and Razvan Pas-

canu. Sobolev training for neural networks. In Neur IPS, 2017.

A P Dawid. The Well-Calibrated Bayesian. Journal of the American Statistical Association, 77

(379), 1982.

Morris H. De Groot and Stephen E. Fienberg. The comparison and evaluation of forecasters. Journal

of the Royal Statistical Society. Series D (The Statistician), 32:12 22, 1983.

Adji B Dieng, Rajesh Ranganath, Jaan Altosaar, and David M Blei. Noisin: Unbiased regularization

for recurrent neural networks. ar Xiv preprint ar Xiv:1805.01500, 2018.

Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize

for deep nets. In ICML, 2017.

Sebastian Farquhar, Lewis Smith, and Yarin Gal. Try Depth Instead of Weight Correlations: Mean-

ﬁeld is a Less Restrictive Assumption for Deeper Networks. In Neur IPS, 2020.

F Girosi and T Poggio. Biological Cybernetics Networks and the Best Approximation Property.

Artiﬁcial Intelligence, 176:169 176, 1990.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural

networks. In ICML, 2017.

Michael Hauser and Asok Ray. Principles of Riemannian geometry in neural networks. In Neur IPS,

David P. Helmbold and Philip M. Long. On the inductive bias of dropout. Journal of Machine

Learning Research, 16:3403 3454, 2015.

Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4

(2):251 257, 1991.

Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and gen-

eralization in neural networks. In Neur IPS, 2018.

Daniel Jakubovitz and Raja Giryes. Improving DNN robustness to adversarial attacks using jacobian

regularization. Lecture Notes in Computer Science, pages 525 541, 2018.

Stanisław Jastrze bski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Ben-

gio, and Amos Storkey. Three Factors Inﬂuencing Minima in SGD. In Neur IPS, 2017.

Nitish Shirish Keskar, Jorge Nocedal, Ping Tak Peter Tang, Dheevatsa Mudigere, and Mikhail

Smelyanskiy. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2019.

Diederik P. Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparame-

terization trick. In Neur IPS, 2015.

Daniel Kunin, Jonathan M. Bloom, Aleksandrina Goeva, and Cotton Seed. Loss landscapes of

regularized linear autoencoders. In ICML, 2019.

Yann A. Le Cun, L eon Bottou, Genevieve B. Orr, and Klaus-Robert M uller. Efﬁcient Back Prop,

pages 9 48. 1998.

Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Second-order adversarial attack and

certiﬁable robustness. Co RR, 2018.

Yuhang Liu, Wenyong Dong, Lei Zhang, Dong Gong, and Qinfeng Shi. Variational bayesian dropout

with a hierarchical prior. In IEEE CVPR, 2019.

Barbara Mele and Guido Altarelli. Lepton spectra as a measure of b quark polarization at LEP.

Physics Letters B, 299(3-4):345 350, 1993.

Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining well calibrated

probabilities using Bayesian Binning. Proceedings of the National Conference on Artiﬁcial Intelligence, 4:2901 2907, 2015.

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural

networks. In PMLR, 2015.

Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nathan Srebro. Exploring gener-

alization in deep learning. In Neur IPS, 2017.

Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learn-

ing. In ICML, 2005.

Ben Poole, Jascha Sohl-Dickstein, and Surya Ganguli. Analyzing noise in autoencoders and deep

networks. 2014.

Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponen-

tial expressivity in deep neural networks through transient chaos. In Neur IPS, 2016.

Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua

Bengio, and Aaron Courville. On the spectral bias of neural networks. In ICML, 2019.

Levent Sagun, Utku Evci, V. Ugur G uney, Yann Dauphin, and L eon Bottou. Empirical analysis of

the hessian of over-parametrized neural networks. 2018.

Jure Sokoli c, Raja Giryes, Guillermo Sapiro, and Miguel R.D. Rodrigues. Robust Large Margin

Deep Neural Networks. IEEE Transactions on Signal Processing, 65(16):4265 4280, 2017.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.

Dropout: A simple way to prevent neural networks from overﬁtting. Journal of Machine Learning Research, 15:1929 1958, 2014.

A N (Andrei Nikolaevich) Tikhonov. Solutions of ill-posed problems / Andrey N. Tikhonov and

Vasiliy Y. Arsenin ; translation editor, Fritz John. 1977.

Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. In Advances in neural information processing systems, pages 351 359, 2013.

Andrew R. Webb. Functional Approximation by Feed Forward Networks: A Least-Squares Ap-

proach to Generalization. IEEE Transactions on Neural Networks, 5(3):363 371, 1994.

Colin Wei, Sham Kakade, and Tengyu Ma. The Implicit and Explicit Regularization Effects of

Dropout. 2020.

Chiyuan Zhang, Benjamin Recht, Samy Bengio, Moritz Hardt, and Oriol Vinyals. Understanding

deep learning requires rethinking generalization. In ICLR, 2017.