# understanding_noise_injection_in_gans__e0439b45.pdf

Understanding Noise Injection in GANs

Ruili Feng 1 Deli Zhao 2 Zheng-Jun Zha 1

Noise injection is an effective way of circumventing overﬁtting and enhancing generalization in machine learning, the rationale of which has been validated in deep learning as well. Recently, noise injection exhibits surprising effectiveness when generating high-ﬁdelity images in Generative Adversarial Networks (GANs) (e.g. Style GAN). Despite its successful applications in GANs, the mechanism of its validity is still unclear. In this paper, we propose a geometric framework to theoretically analyze the role of noise injection in GANs. First, we point out the existence of the adversarial dimension trap inherent in GANs, which leads to the difﬁculty of learning a proper generator. Second, we successfully model the noise injection framework with exponential maps based on Riemannian geometry. Guided by our theories, we propose a general geometric realization for noise injection. Under our novel framework, the simple noise injection used in Style GAN reduces to the Euclidean case. The goal of our work is to make theoretical steps towards understanding the underlying mechanism of state-of-the-art GAN algorithms. Experiments on image generation and GAN inversion validate our theory in practice.

1. Introduction

Noise injection is usually applied as regularization to cope with overﬁtting or facilitate generalization in neural networks (Bishop, 1995; An, 1996). The effectiveness of this simple technique has also been proved in various tasks in deep learning, such as learning deep architectures (Hinton et al., 2012; Srivastava et al., 2014; Noh et al., 2017), defending adversarial attacks (He et al., 2019), facilitating stability of differentiable architecture search with reinforcement learning (Liu et al., 2019; Chu et al., 2020), and

Corresponding author. 1University of Science and Technology of China, Hefei, China. 2Alibaba Group. Correspondence to: Ruili Feng <ruilifengustc@gmail.com>, Deli Zhao <zhaodeli@gmail.com>, Zhen-Jun Zha <zhazj@ustc.edu.cn>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

quantizing neural networks (Baskin et al., 2018). In recent years, noise injection1 has attracted more and more attention in the community of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014a). Extensive research shows that it helps stabilize the training procedure (Arjovsky & Bottou, 2017; Jenni & Favaro, 2019) and generate images of high ﬁdelity (Karras et al., 2019a;b; Brock et al., 2018).

Particularly, noise injection in Style GAN (Karras et al., 2019a;b) has shown the amazing capability of helping generate sharp details in images (see Fig. 1 for illustration), shedding new light on obtaining high-quality photo-realistic results using GANs. In other domains like variants of variational auto-encoder (Vahdat & Kautz, 2020; Child, 2020), noise injection technique also contributes to the appealing synthesis quality. Therefore, studying the underlying principle of noise injection in GANs is an important theoretical work of understanding GAN algorithms.

In this paper, we propose a theoretical framework to explain and improve the effectiveness of noise injection in GANs. Our contributions are listed as follows:

we uncover an intrinsic defect of GAN models that the expressive power of generator is limited by the rank of its Jacobian matrix, and the rank of Jacobian matrix is monotonically (but not strictly) decreasing as the network gets deeper.

We prove that noise injection is an effective weapon to enhance the expressive power of generators for GANs.

Based on our theory, we propose a generalized form via exponential maps for noise injection in GANs, which can overcome the adversarial dimension trap. Experiments on the state-of-the-art GAN, Style GAN2 (Karras et al., 2019b), validate the effectiveness of our geometric model.

To the best of our knowledge, this is the ﬁrst work that theoretically draws the geometric picture of noise injection in GANs, and uncover the intrinsic defect of the expressive power of generators.

1It sufﬁces to note that noise injection here is totally different from adversarial attacks raised in (Goodfellow et al., 2014b).

Understanding Noise Injection in GANs

Increasing noise injection depth Std

Figure 1. Noise injection signiﬁcantly improves the detail quality of generated images. From left to right, we inject extra noise to the generator layer by layer. We can see that hair quality is clearly improved. By varying the injected noise and visualizing the standard deviation (std) over 100 different seeds, we can ﬁnd that the detailed information such as hair, parts of the background, and silhouettes are most involved, while the global information such as identity and pose is less affected.

2. Related Work

The main drawbacks of GANs are unstable training and mode collapse. Arjovsky et al. (Arjovsky & Bottou, 2017) theoretically analyze that noise injection to the image space can help smooth the distribution so as to stabilize the training. The authors of distribution-ﬁltering GAN (Jenni & Favaro, 2019) then put this idea into practice and prove that this technique will not inﬂuence the global optimality of the real data distribution. However, as the authors in (Arjovsky & Bottou, 2017) pointed out, this method depends on the amount of noise, and does not support the intrinsic geometry of synthesis and data distributions. Different from these works, our method of noise injection follows that of Style GAN (Karras et al., 2019a) and is performed on features of different layers. Besides, we provide a theoretical insight of explaining the connection between injected noise and features.

Big GAN (Brock et al., 2018) splits input latent vectors into one chunk per layer and projects each chunk to the gains and biases of batch normalization in each layer. They claim that this design allows direct inﬂuence on features at different resolutions and levels of hierarchy. Style GAN (Karras et al., 2019a) and Style GAN2 (Karras et al., 2019b) adopt a slightly different view, where noise injection is introduced to enhance randomness for multi-scale stochastic variations. Different from the settings in Big GAN, they inject extra noise independent of latent inputs into different layers of the network without projection. Our theoretical analysis is mainly motivated by the success of noise injection used in Style GAN (Karras et al., 2019a). Our proposed framework reveals that noise injection in Style GAN is a kind of repa-

G(z) z 2 Rn X Mdx

Figure 2. Illustration of dimensions in the generator of GAN. We assume that the data X lie in an underlying low-dimensional manifold Mdx embedded in the high-dimensional Euclidean space Rm, where dx is the intrinsic dimension of M and m is the ambient dimension. Usually, we have dx m and n m.

rameterization in Euclidean spaces, and we extend it into generic manifolds (section 4.3).

3. Inherent Drawbacks of GANs

We will analyze the inherent drawbacks of traditional GANs in this section. Our arguments can be divided into three steps. We ﬁrst prove that the rank of Jacobian matrix limits the intrinsic dimension of the learned manifold of the generator. Then we show that the rank of Jacobian matrix monotonically decreases as the network gets deeper. At last we prove that the expressive power of learned distribution by GAN is limited by its intrinsic dimension.

Prior to our arguments, we brieﬂy introduce the geometric perspective of generative models as follows. Given a prior z, the generator G of GAN generates a fake sample x = G(z). Here, the fake sample is of the same ambient dimension with the real sample set X in the Euclidean space Rm. The input prior z is a n-dimensional vector, which is usually sampled from Gaussian distributions of Rn. Following the

Understanding Noise Injection in GANs

convention of manifold learning (Tenenbaum et al., 2000; Roweis & Saul, 2000), we assume that the real data X lie in an underlying low-dimensional manifold Mdx embedded in the high-dimensional Rm, where dx is called the intrinsic dimension of M and m is the ambient dimension using the geometric language. Usually, we have dx m and n m, e.g. n = 512 and m = 1024 1024 3 in Style GAN. The generation process and related dimensions are illustrated in Figure 2. The intrinsic dimension dx is actually ambiguous in most cases, thus making the generation problem complicated. We refer to the dimension of the data manifold as the intrinsic dimension dx except noted otherwise. The purpose of GAN is then to approximate this data manifold Mdx with generator-induced manifold Gdg = G(Z).

3.1. Jacobian Limits the Intrinsic Dimension

We ﬁrst prove that the intrinsic dimension dg of the generated distribution can be identiﬁed through the Jacobian matrix of the generator.

Deﬁnition 1 (Jacobian matrix). Let f : Z Rm, where Z are open subset in Rn. Let

dfi dzj (z) = lim h 0 fi(z + hej)

where ej Rn is a vector with the j-th component to be 1 and 0 otherwise, and fi is the i-th component of f. Then the Jacobian matrix of f with respect to variable z is the matrix

Lemma 1 (Sard s Theorem (Hirsch, 2012)). Let f : Rn Rm and f(Z) = X df be a manifold embedded in Rm with Z an open subset of Rn. Then for almost every point x X df , the gradient matrix of f has constant rank rank(Jzf) at the pre-image of x in Z, and the intrinsic dimension of X df is df = rank(Jzf).

When f is a linear transformation, i.e. f(x) = Ax, A Rm n, Lemma 1 reduces to rank theorem of matrices (Strang et al., 1993), that the dimension of subspace induced by a matrix is equal to the rank of that matrix.

Lemma 1 gives a quantitative description to the property of generated manifold Gdg, that it has an intrinsic dimension equal to the rank of Jz G Rn m. Recall that a matrix of Rm n has rank at most min{n, m}. Thus we have dg min{n, m}. In practice, the prior dimension n is usually a bit small compared with the high variance of details in real-world data. Taking face images as an example, the hair, freckles, and wrinkles have an extremely high degree of freedom, the combination of which may exceed millions of types, while typical GANs only have latent dimensions around 512. In order to plausibly model the details of im-

ages, we need to endow the network of the generator with more degrees of freedom.

3.2. Monotonic Decreasing of Jacobian Rank

Apart from the relatively small prior dimension, another trouble comes from the network depth. To capture highly non-linear features of data manifolds, current generators often use very deep coupling modules of convolutional neural networks (CNNs). The following Lemma then suggests a monotonic decline in the dimension of the generated manifold as the network gets deeper.

Lemma 2. Let f = f 1 f 2. We have rank(Jf) min{rank(Jf 1), rank(Jf 2)}. Further, let F k = f 1 f 2 f k. Then we have rank(JF s) rank(JF t) if s t, and rank(JF s) rank(Jf k) for all k s.

Typical generators are composed of a large number of blocks of multilayer perceptrons (MLPs) or CNNs, which will keep reducing the dimension of the feature manifolds during the feedforward procedure. The intention of reducing the dimension of the underlying manifold in the deep network of the generator, combined with the relatively low dimension of the input prior and Lemma 1, will then force the dimension of the generated manifold lower than that of the real data manifold. We will look into how this will inﬂuence the expressive power of GANs.

3.3. Adversarial Dimension Trap

Previous sections have demonstrated that, in practice, there is a very high chance that the generated manifold has an intrinsic dimension lower than that of the data manifold. During training, however, the discriminator which measures the distance of these two distributions will keep encouraging the generator to increase the dimension up to the same as the true data. This contradictory functionality incurs severe punishment on the smoothness and invertibility of the generative model, which we refer to as the adversarial dimension trap.

Theorem 1. For a deterministic GAN model and generator G : Z X, if rank(Jz G) < dx, then at least one of the two cases must stand:

1. supz Z Jz G = ;

2. the generator network fails to capture the data distribution and is unable to perform inversion. Namely, for an arbitrary point x X, the possibility of G 1(x) = is 1, and we have the following estimation

DJS(Pg, Pr) log 2

where DJS is the Jensen-Shannon divergence, Pg and Pr are generated and data distributions, respectively.

Understanding Noise Injection in GANs

d M(µ, µ ) = r

(a) Skeleton of the manifold. (b) Representative pair.

Figure 3. Illustration of the skeleton set and representative pair. The blue curve in (a) is the skeleton. In (b), the dashed sphere in M is the geodesic ball, while the solid sphere in TµM is its projection onto the tangent space. The normal vector n determines the ﬁnal afﬁne transformation into the Euclidean space.

The above theorem stands for a wide range of GAN loss functions, including Wasserstein divergence, Jensen Shannon divergence, and other KL-divergence based losses. Notice that this theorem implies a much worse situation than it states. For any open sphere B in the data manifold X, the generator restricted in the pre-image of B also follows this theorem, which suggests bad properties of nearly every local neighborhood. This also suggests that the above consequences of Theorem 1 may both stand, as in some subsets the generator may successfully capture the data distribution, while in some others the generator may fail to do so.

The above theorem describes the relationship between rank(Jz G) and the expressive power of GANs. It means that a generator with a very small Jacobian rank may not be able to model complicated manifolds. We will show how noise injection addresses this issue.

The readers may note that, our expressive power here is a bit different from those in classiﬁcation tasks. For example, in a binary classiﬁcation task, the intrinsic dimension of output space is the same as its ambient space. The expressive power in this case cares more about modeling the highly non-linear structure of classes. In this paper, as our target is a data manifold with unknown intrinsic dimension, the expressive power focuses on capturing all its intrinsic dimensions, which corresponds to certain semantic features of images.

4. Riemannian Geometry of Noise Injection

The generator G in the traditional GAN is a composite of sequential non-linear feature mappings, which can be denoted as G(z) = f k f k 1 f 1(z), where z N(0, 1) is the standard Gaussian. Each feature mapping is typically a single layer CNN combined with non-linear operations such as normalization, pooling, and activation. The whole network is then a deterministic mapping from the latent space Z to the image space X. The common noise injection is actually a linear transformation

f k f k + aϵ, ϵ N(0, 1), (4)

where a is a learnable scalar parameter and noise ϵ is randomly sampled from Gaussian N(0, 1). This simple technique signiﬁcantly improves the performance of GANs, especially the ﬁdelity and realism of generated images as displayed in Figure 1.

In order to establish a solid geometric framework, we propose a general formulation by replacing f k(x) with

gk(x) = µk(x) + σk(x)ϵ, x gk 1 g1(Z), (5)

where µk(x) and σk(x) are both learnable operators on the layer input x. It is straightforward to see that noise injection in (5), which is a type of deep noise injection in feature maps of each layer, is essentially different from the reparameterization trick used in VAEs (Kingma & Welling, 2013) that is only applied one time in the latent space. In what follows, we call (5) used in this paper Riemannian Noise Injection (RNI) as our theory is established with Riemannian geometry.

It is worth emphasizing that RNI in (5) can be viewed as fuzzy equivalence relation of the original features, and uses reparameterization to model the low-dimensional feature manifolds. We present this content in the supplementary material for interested readers.

4.1. Handling Adversarial Dimension Trap with Noise Injection

As Sard s theorem tells us (Petersen et al., 2006), the key to solving the adversarial dimension trap is to avoid mapping low-dimensional feature spaces into feature manifolds with higher intrinsic dimensions. However, we are not able to control the intrinsic dimension of data manifold, and in each intermediate feature spaces of the network, we also have the dimension drop problem described in Lemma 2. So the solution could be that, instead of learning mappings into the full feature spaces, we choose to map only onto the skeleton of each feature space and use random noise to ﬁll up the remaining space. For a compact manifold, it is easy to ﬁnd that the intrinsic dimension of the skeleton set can be arbitrarily low by applying Heine Borel theorem to the skeleton (Rudin et al., 1964). By this way, the model can escape from the adversarial dimension trap.

Now we formulate the idea in detail. The whole idea is based on approximating the manifold by the tangent polyhedron. Assume that the feature space M is a Riemannian manifold embedded in Rm. Then for any point µ M, the local geometry induces a coordinate transformation from a small neighborhood of µ in M to its projection onto the tangent space TµM at µ by the following theorem.

Theorem 2. Given Riemannian manifold M embedded in Rm, for any point µ M, we let TµM denote the tangent space at µ. Then the exponential map Expµ induces a

Understanding Noise Injection in GANs

smooth diffeomorphism from a Euclidean ball BTµM(0, r) centered at 0 to a geodesic ball BM(µ, r) centered at µ in M. Thus {Exp 1 µ , BM(µ, r), BTµM(0, r)} forms a local coordinate system of M in BM(µ, r), which we call the normal coordinates. Thus we have

BM(µ, r) = Expµ(BTµM(0, r)) (6)

= {τ : τ = Expµ(v), v BTµM(0, r)}. (7)

For each local geodesic neighborhood BM(µ, r) of point µ in the feature manifold M, we can model it by its tangent space in the ambient Euclidean space as follows with error no more than o(r) .

Theorem 3. The differential of Expµ at the origin of TµM is identity I. Thus Expµ can be approximated by

Expµ(v) = µ + Iv + o( v 2). (8)

Thus, if r in equation (6) is small enough, we can approximate BM(µ, r) by

BM(µ, r) µ + IBTµM(0, r) (9)

= {τ : τ = µ + Iv, v BTµM(0, r)}. (10)

Considering that TµM is an afﬁne subspace of Rm, the coordinates on BTµM(0, r) admit an afﬁne transformation into the coordinates on Rm. Thus equation (9) can be written as

BM(µ, r) µ + IBTµM(0, r) (11)

= {τ : τ = µ + r T(µ)ϵ, ϵ B(0, 1)}. (12)

We remind the readers that the linear component matrix T(µ) differs at different µ M and is decided by the local geometry near µ.

In the above formula, µ deﬁnes the center point and r T(µ) deﬁnes the shape of the approximated neighborhood. So we call them a representative pair of BM(µ, r). Picking up a series of such representative pairs, which we refer as the skeleton set, we can construct a tangent polyhedron H of M. Thus instead of trying to learn the feature manifold directly, we adopt a two-stage procedure. We ﬁrst learn a map f : x 7 [µ(x), σ(x)] (σ(x) r T(µ(x))) onto the skeleton set, then we use noise injection g : x 7 µ(x) + σ(x)ϵ, ϵ U(0, 1) (uniform distribution) to ﬁll up the ﬂesh of the skeleton as shown in Figure 3.

However, the real world data often include fuzzy semantics. Even long range features could share some structural relations in common. It is unwise to model them with nonsmooth architectures such as locally bounded spheres and uniform distributions. Thus we borrow the idea from fuzzy topology (Ling & Bo, 2003; Murali, 1989; Recasens, 2010) which is designed to address this issue. It is well known

that for any distance metrics d( , ), e d(µ, ) admits a fuzzy equivalence relation for points near µ, which is similar to the density of Gaussian. The fuzzy equivalence relation can be viewed as a suitable smooth alternative to the sphere neighborhood BM(µ, r). Thus we replace the uniform distribution with unclipped Gaussian2. Under this setting, the ﬁrst-stage mapping in fact learns a fuzzy equivalence relation, while the second stage is a reparameterization technique.

Notice that the skeleton set can have arbitrarily low dimension as we only need ﬁnite many skeleton points to reconstruct the full manifold, and capturing ﬁnite many points is easy for functions with Jacobians of any ranks. Thus the ﬁrst-stage map can be smooth, well conditioned, and expressive in modeling the target manifold.

Theorem 4. If manifold M is compact, then there exist ﬁnite many points µ1, ..., µk M, such that the skeleton set S = {µ1, ..., µk} with representative pairs and radius r deﬁned in Theorems 2 & 3 can approximate M with local error no more than o(r).

Remark 1. Theorem 4 characterizes the expressive power of noise injection. Combined with Theorem 2, they show that for any compact manifold embedded in Rm, the generator with noise injection can approximate it with error no more than o(r), where r is the radius of geodesic ball deﬁned in Eq. (11), regardless of the relation between Jz G and dx.

For the second stage, we can show that it possesses a smooth property in expectation by the following theorem.

Theorem 5. Given f : x 7 [µ(x), σ(x)]T , f is locally Lipschitz and σ = o(1). Deﬁne g(x) µ(x)+σ(x)ϵ, ϵ N(0, 1) (standard Gaussian). Then for any bounded set U, L > 0, we have E[ g(x) g(y) 2] L x y 2 + o(1), x, y U. Namely, the principal component of g is locally Lipschitz in expectation. Speciﬁcally, if the deﬁnition domain of f is bounded, then the principal component of g is globally Lipschitz in expectation.

4.2. Property of Noise Injection

As we have discussed, traditional GANs face two challenges: the relatively low dimensional latent space compared with complicated details of real images, and the intention of dimension drop in feedforward procedure. Both of the two challenges will lead to the adversarial dimension trap in Theorem 1. The adversarial dimension trap implies an unstable training procedure because of the gradient explosion that may occur on the generator. With noise injection in the network of the generator, however, we can theoretically overcome such problems if the representative pairs are constructed properly to capture the local geometry. In this case,

2A detailed analysis about why unclipped Gaussian should be applied is offered in the supplementary material.

Understanding Noise Injection in GANs

our model does not need to ﬁt the image manifold with a higher intrinsic dimension than that the network architecture can handle. Thus the training procedure will not encourage the nonsmooth generator, and can proceed more stably. Also, the extra noise can compensate the loss of information compression so as to capture high-variance details, which has been discussed and illustrated in Style GAN (Karras et al., 2019a). We will evaluate the performance of our method from these aspects in section 5.

4.3. Geometric Realization of µ(x) and σ(x)

As µ stands for a particular point in the feature space, we simply model it by the traditional deep CNN architectures. σ(x) is designed to ﬁt the local geometry of µ(x). According to our theory, the local geometry should only admit minor differences from µ(x). Thus we believe that σ(x) should be determined by the spatial and semantic information contained in µ(x), and should characterize the local variations of the spatial and semantic information. The deviation of pixel-wise sum along channels of feature maps in Style GAN2 highlights the semantic variations like hair, parts of background, and silhouettes, as the standard deviation map over sampling instances shows in Fig. 1. This observation suggests that the sum along channels identiﬁes the local semantics we expect to reveal. Thus it should be directly connected to σ(x) we pursue here. For a given feature map µ = DCNN(x) from the deep CNN, which is a speciﬁc point in the feature manifold, the sum along its channels is

i=1 µijk, (13)

where i enumerates all the c feature maps of µ, while j, k enumerate the spatial indices of µ in its h rows and w columns, respectively. The resulting µ is then a spatial semantic identiﬁer, whose variation corresponds to the local semantic variation. We then normalize µ to obtain a spatial semantic coefﬁcient matrix s with

mean( µ) = 1 h w

s = µ mean( µ),

max(|s|) = max 1 j h,1 k w |sjk|,

s = s max(|s|).

Recall that the standard deviation of s over sampling instances highlights the local variance in semantics. Thus s can be decomposed into two independent components: sm that corresponds to the main content of the output image and is almost invariant under changes of injected noise; sv that is associated with the variance induced by the injected noise and is nearly orthogonal to the main content. We as-

sume that this decomposition can be attained by an afﬁne transformation on s such that

sd = A s + b = sm + sv, sv µ 0, (15)

where denotes element-wise matrix multiplication, and 0 denotes the zero matrix. To avoid numerical instability, we add the all-one matrix 1 to the above decomposition such that its condition number will not get exploded, i.e.

s = αsd + (1 α)1, σ = s

The regularized sm component is then used to enhance the main content in µ, and the regularized sv component is then used to guide the variance of injected noise. The ﬁnal output o is then calculated as

o = rσ µ + rσ ϵ, ϵ N(0, 1). (17)

In the above procedure, A, b, r, and α are learnable parameters. Note that in the last equation, we do not need to decompose s into sv and sm, as sv is designed to be nearly orthogonal to µ, and sm is nearly invariant. Thus σ µ will automatically drop the sv component, and σ ϵ amounts to adding an invariant bias to the variance of injected noise. There are alternative forms for µ and σ with respect to various GAN architectures. However, modeling µ by deep CNNs and deriving σ through the spatial and semantic information of µ are universal for GANs, as they comply with our theorems. We further conduct an ablation study to verify the effectiveness of the above procedure. The related results can be found in the supplementary material.

Using our formulation, noise injection in Style GAN2 can be written as follows:

µ = DCNN(x), o = µ + r ϵ, ϵ N(0, 1), (18)

where r is a learnable scalar parameter. This can be viewed as a special case of our method, where T(µ) in (11) is set to be the identity mapping. Under this settings, the local geometry is assumed to be everywhere identical among the feature manifold, which suggests a globally Euclidean structure. Our theory supports this simpliﬁcation and specialization. But our choice of µ(x) and σ(x) can suit broader and more usual scenarios, where the feature manifolds are non-Euclidean. We denote this simpliﬁed noise injection as Euclidean Noise Injection (ENI), and will extensively study its performance compared with our choice in the following section.

5. Experiment

We conduct experiments on benchmark datasets including FFHQ faces, LSUN objects, and CIFAR-10. The GAN models we use are the baseline DCGAN (Radford et al.,

Understanding Noise Injection in GANs

Table 1. Comparison for different generator architectures.

GAN arch FFHQ LSUN-Church

PPL ( ) FID ( ) PPL ( ) FID ( )

DCGAN 2.97 45.29 33.30 51.18 DCGAN + ENI 3.14 44.22 22.97 54.01 DCGAN + RNI (Ours) 2.83 40.06 22.53 46.31

Plain Style GAN2 28.44 6.87 425.7 6.44 Style GAN2 + ENI 16.20 7.29 123.6 6.80 Style GAN2-No Path Reg + RNI (Ours) 16.02 7.14 178.9 5.75 Style GAN2 + RNI (Ours) 13.05 7.31 119.5 6.86

2015) (originally without noise injection) and the state-ofthe-art Style GAN2 (Karras et al., 2019b) (originally with Euclidean noise injection). For Style GAN2, we use images of resolution 128 128 and conﬁg-e in the original paper due to that conﬁg-e achieves the best performance with respect to Path Perceptual Length (PPL) score. Besides, we apply the experimental settings from Style GAN2.

Noise injection presented in section 4.3 is called Riemannian Noise Injection (RNI) while the simple form used in Style GAN is called Euclidean Noise Injection (ENI).

Image synthesis. PPL (Zhang et al., 2018) has been proven an effective metric for measuring structural consistency of generated images (Karras et al., 2019b). Considering its similarity to the expectation of the Lipschitz constant of the generator, it can also be viewed as a quantiﬁcation of the smoothness of the generator. The path length regularizer is proposed in Style GAN2 to improve generated image quality by explicitly regularizing the Jacobian of the generator with respect to the intermediate latent space. We ﬁrst compare the noise injection methods with the plain Style GAN2, which the Euclidean noise injection and path length regularizer in Style GAN2 are removed. As shown in Table 1, we can ﬁnd that all types of noise injection signiﬁcantly improve the PPL scores. It is worth noting that our method without path length regularizer can achieve comparable performance against the standard Style GAN2 on the FFHQ dataset, and the performance can be further improved if combined with path length regularizer. Considering the extra GPU memory consuming of path length regularizer in training, we think that our method offers a computation-friendly alternative to Style GAN2 as we observe smaller GPU memory occupation of our method for all the experiments.

For the LSUN-Church dataset, we observe an obvious improvement in FID scores compared with Style GAN2. We believe that this is because the LSUN-Church data are scene images and contain various semantics of multiple objects, which are hard to ﬁt for the original Style GAN2 that is more

suitable for single object synthesis. So our RNI architecture offers more degrees of freedom to the generator to ﬁt the true distribution of the dataset. In all cases, our method is superior to Style GAN2 in both PPL and FID scores. This proves that our noise injection method is more powerful than the one used in Style GAN2. For DCGAN, as it does not possess the intermediate latent space, we cannot facilitate it with the path length regularizer. So we only compare the Euclidean noise injection with our RNI method. Through all the cases we can ﬁnd that our method achieves the best performance in PPL and FID scores.

We also study whether our choice for µ(x) and σ(x) can be applied to broader occasions. We further conduct experiments on a cat dataset which consists of 100 thousand selected images from 800 thousand LSUN-Cat images by ranking algorithm (Zhou et al., 2004). For DCGAN, we conduct extra experiments on CIFAR-10 to test whether our method could succeed in multi-class image synthesis. The results are reported in Figure 5. We can see that our method still outperforms the compared methods in PPL scores and FID scores are comparable, indicating that the proposed noise injection is more favorable of preserving structural consistency of generated images with real ones.

Numerical stability. As we have analyzed above, noise injection should be able to improve the numerical stability of GANs. To evaluate it, we examine the condition number of different GAN architectures. The condition number of a given function f is deﬁnes as (Horn & Johnson, 2013)

C(f) = lim sup x 0

f(x) f(x + x) / f(x)

x / x . (19)

It measures how sensitive a function is to changes or errors in the input. A function with a high condition number is said to be ill-conditioned. Considering the numerical infeasibility of the sup operator in the deﬁnition of condition number, we resort to the following alternative approach. We ﬁrst sample a batch of 50000 pairs of (Input, Perturbation) from the input distribution and the

Understanding Noise Injection in GANs

FFHQ LSUN-Church

Plain Style GAN2

Style GAN2 + ENI

Style GAN2-No Path Reg + RNI

Style GAN2 + RNI

Figure 4. Synthesized images of different Style GAN2-based models.

Table 2. Conditions for different GAN architectures. Lower condition metrics suggest better network stability and invertibility.

GAN arch FFHQ LSUN-Church

MC ( ) TTMC ( ) MC ( ) TTMC ( )

Plain Style GAN2 0.943 2.81 2.31 6.31 Style GAN2 + ENI 0.666 1.27 0.883 1.75 Style GAN2-No Path Reg + RNI (Ours) 0.766 2.39 1.71 4.74 Style GAN2 + RNI (Ours) 0.530 1.05 0.773 1.51

perturbation x N(0, 1e-4), and then compute the corresponding condition numbers. We compute the mean value and the mean value of the largest 1000 values of these 50000 condition numbers as Mean Condition (MC) and Top Thousand Mean Condition (TTMC) respectively to evaluate the condition of GAN models. We report the results in Table 2, where we can ﬁnd that noise injection signiﬁcantly improves the condition of GAN models, and our proposed method dominates the performance.

GAN inversion. Style GAN2 makes use of a latent style space that is capable of enabling controllable image modiﬁcations. This characteristic motivates us to study the image embedding capability of our method via GAN inversion algorithms (Abdal et al., 2019) as it may help further leverage the potential of GAN models. From the experiments, we

ﬁnd that the Style GAN2 model is prone to work well for full-face, non-blocking human face images. For this type of images, we observe comparable performance for all the GAN architectures. We think that this is because those images are close to the mean face of FFHQ dataset (Karras et al., 2019a), thus easy to learn for the Style GAN-based models. For faces of large pose or partially occluded ones, the ability of compared models differs signiﬁcantly. Noise injection methods outperform the plain Style GAN2 by a large margin, and our method achieves the best performance. The detailed implementation and results are reported in the supplementary material.

Understanding Noise Injection in GANs

DCGAN DCGAN + ENI DCGAN + RNI PPL=101.4,FID=83.8, IS=4.46 PPL=77.9, FID=84.8, IS=4.73 PPL=69.9, FID=83.2, IS=4.64

Cat-Selected

Style GAN2 + ENI Style GAN2 + RNI PPL=115, MC=0.725, TTMC=1.54 FID=12.7 PPL=106, MC=0.686, TTMC=1.45, FID=13.4

Figure 5. Image synthesis on CIFAR-10 and LSUN cats.

6. Conclusion

In this paper, we propose a theoretical framework to explain the effect of noise injection technique in GANs. We prove that the generator can easily encounter the difﬁculty of nonsmoothness or expressiveness, and noise injection is an effective approach to addressing this issue. Based on our theoretical framework, we also derive a more proper formulation for noise injection. We conduct experiments on various datasets to conﬁrm its validity. In future work, we will further investigate the universal realizations of noise injection for diverse GAN architectures, and attempt to ﬁnd more powerful ways to characterize local geometries of feature spaces.

Acknowledgement

This work is supported by the National Key R&D Program of China under Grand 2020AAA0105702, National Natural Science Foundation of China (NSFC) under Grants U19B2038, the University Synergy Innovation Program of Anhui Province under Grants GXXT-2019-025.

Abdal, R., Qin, Y., and Wonka, P. Image2Style GAN: How to embed images into the Style GAN latent space? In Proceedings of the IEEE International Conference on Computer Vision, pp. 4432 4441, 2019.

An, G. The effects of adding noise during backpropagation training on a generalization performance. Neural computation, 8(3):643 674, 1996.

Arjovsky, M. and Bottou, L. Towards principled methods for

training generative adversarial networks. arxiv e-prints, art. ar Xiv preprint ar Xiv:1701.04862, 2017.

Baskin, C., Liss, N., Chai, Y., Zheltonozhskii, E., Schwartz, E., Giryes, R., Mendelson, A., and Bronstein, A. M. Nice: Noise injection and clamping estimation for neural network quantization. ar Xiv preprint ar Xiv:1810.00162, 2018.

Bishop, C. M. Training with noise is equivalent to Tikhonov regularization. Neural computation, 7(1):108 116, 1995.

Brock, A., Donahue, J., and Simonyan, K. Large-scale GAN training for high ﬁdelity natural image synthesis, 2018.

Child, R. Very deep VAEs generalize autoregressive models and can outperform them on images. ar Xiv preprint ar Xiv:2011.10650, 2020.

Chu, X., Zhang, B., and Li, X. Noisy differentiable architecture search. ar Xiv preprint ar Xiv:2005.03566, 2020.

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial networks, 2014a.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv preprintar Xiv:1412.6572, 2014b.

He, Z., Rakin, A. S., and Fan, D. Parametric noise injection: Trainable randomness to improve deep neural network robustness against adversarial attack. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 588 597, 2019.

Understanding Noise Injection in GANs

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors. ar Xiv preprint ar Xiv:1207.0580, 2012.

Hirsch, M. W. Differential topology, volume 33. Springer Science & Business Media, 2012.

Horn, R. A. and Johnson, C. R. Matrix Analysis. Cambridge University Press, 2013.

Jenni, S. and Favaro, P. On stabilizing generative adversarial training with noise. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12145 12153, 2019.

Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401 4410, 2019a.

Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of Style GAN. ar Xiv preprint ar Xiv:1912.04958, 2019b.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Ling, Z. and Bo, Z. Theory of fuzzy quotient space (methods of fuzzy granular computing). 2003.

Liu, H., Simonyan, K., and Yang, Y. DARTS: Differentiable architecture search. International Conference of Representation Learning (ICLR), 2019.

Murali, V. Fuzzy equivalence relations. Fuzzy sets and systems, 30(2):155 163, 1989.

Noh, H., You, T., Mun, J., and Han, B. Regularizing deep neural networks by noise: Its interpretation and optimization. Advances in Neural Information Processing Systems (Neur IPS), 2017.

Petersen, P., Axler, S., and Ribet, K. Riemannian geometry, volume 171. Springer, 2006.

Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015.

Recasens, J. Indistinguishability operators: Modelling fuzzy equalities and fuzzy equivalence relations, volume 260. Springer Science & Business Media, 2010.

Roweis, S. T. and Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290 (5500):2323 2326, 2000.

Rudin, W. et al. Principles of mathematical analysis, volume 3. Mc Graw-hill New York, 1964.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overﬁtting. The journal of machine learning research, 15(1):1929 1958, 2014.

Strang, G., Strang, G., Strang, G., and Strang, G. Introduction to linear algebra, volume 3. Wellesley-Cambridge Press Wellesley, MA, 1993.

Tenenbaum, J. B., de Silva, V., and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, 2000.

Vahdat, A. and Kautz, J. NVAE: A deep hierarchical variational autoencoder. ar Xiv preprint ar Xiv:2007.03898, 2020.

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586 595, 2018.

Zhou, D., Weston, J., Gretton, A., Bousquet, O., and Schölkopf, B. Ranking on data manifolds. In Advances in neural information processing systems, pp. 169 176, 2004.