# uncertainty_principles_of_encoding_gans__5432eb89.pdf Uncertainty Principles of Encoding GANs Ruili Feng 1 Zhouchen Lin 2 3 Jiapeng Zhu 4 Deli Zhao 5 Jinren Zhou 5 Zheng-Jun Zha 1 The compelling synthesis results of Generative Adversarial Networks (GANs) demonstrate rich semantic knowledge in their latent codes. To obtain this knowledge for downstream applications, encoding GANs has been proposed to learn encoders, such that real world data can be encoded to latent codes, which can be fed to generators to reconstruct those data. However, despite the theoretical guarantees of precise reconstruction in previous works, current algorithms generally reconstruct inputs with non-negligible deviations from inputs. In this paper we study this predicament of encoding GANs, which is indispensable research for the GAN community. We prove three uncertainty principles of encoding GANs in practice: a) the perfect encoder and generator cannot be continuous at the same time, which implies that current framework of encoding GANs is illposed and needs rethinking; b) neural networks cannot approximate the underlying encoder and generator precisely at the same time, which explains why we cannot get perfect encoders and generators as promised in previous theories; c) neural networks cannot be stable and accurate at the same time, which demonstrates the difficulty of training and trade-off between fidelity and disentanglement encountered in previous works. Our work may eliminate gaps between previous theories and empirical results, promote the understanding of GANs, and guide network designs for follow-up works. Corresponding author, co-corresponding author. 1University of Science and Technology of China, Hefei, China. 2Key Lab. of Machine Perception (Mo E), School of EECS, Peking University, Beijing, China. 3Pazhou Lab, Guangzhou, China. 4Hong Kong University of Science and Technology, Hong Kong, China. 5Alibaba Group. Correspondence to: Ruili Feng , Zhouchen Lin , Zheng-Jun Zha . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). 1. Introduction Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) are powerful unsupervised models of establishing maps from simple latent distributions to arbitrarily complex data distributions in various real world scenarios like computer vision (Liang et al., 2017; Zhang et al., 2019; Karras et al., 2019; 2020; Zheng et al., 2020; Liu et al., 2019; Zha et al., 2020), natural language processing (Zhang et al., 2017; Xu et al., 2018; Liu et al., 2018), medicine (Yi & Babyn, 2018; Wolterink et al., 2017; Yi et al., 2018; Frid Adar et al., 2018), and chemistry (De Cao & Kipf, 2018). Their impressive synthesis performance has aroused a surge of interests in encoding data into latent spaces of GANs for representation learning (Donahue & Simonyan, 2019; Ma et al., 2019; Asim et al., 2020), image editing (Bau et al., 2020; Richardson et al., 2020; Shen & Zhou, 2020; Abdal et al., 2020b), and other downstream tasks (Lin et al., 2019; Rosca et al., 2018; Lewis et al., 2021). The current framework of encoding GAN researches can be abstracted as follows. Let Z and X be the latent space and the data manifold, and PZ and PX be the latent distribution and the data distribution on Z and X, respectively. Encoding GANs introduces a bijection between the latent and the data: an underlying perfect generator g transports the latent distribution into the data one, Pg(Z)(A) = Z g 1(A) d PZ = PX (A), A FX , (1) where FX is the collection of measurable sets in X; and an underlying perfect encoder e inverts the generator, e g(z) = z, g e(x) = x, z Z, x X. (2) The training algorithms then aim at approximating the underlying perfect encoder and generator with parameterized neural networks Eθ and Gφ, respectively. The above framework of encoding GANs supports both to encode a pre-trained GAN, or to learn a GAN equipped with an encoder in an end-to-end manner, which divides current encoding GAN algorithms into two training methodologies: 1) concurrent training, i.e. training the encoder and generator concurrently as in ALI (Dumoulin et al., 2017) and Bi GAN (Donahue et al., 2017; Donahue & Simonyan, 2019); 2) two phase training, i.e. training an encoder to Uncertainty Principles of Encoding GANs invert a fixed and pretrained generator as in (Perarnau et al., 2016; Reed et al., 2016; Zhu et al., 2019). There are theoretical supports for both methodologies. For concurrent training, Bi GAN and ALI have proved that neural networks will attain perfect reconstruction and synthesis at the global minimum of training algorithms (Theorem 2 & Proposition 3 in (Donahue et al., 2017; Donahue & Simonyan, 2019)). For two phase training, Universal Approximation Theorem (Cybenko, 1989; Pinkus, 1999) says that one layer neural networks can fit a given continuous mapping (the inverse of pretrained generator) arbitrarily well. Despite the theoretical guarantee that neural networks can approximate the perfect encoder & generator, the practice of encoding GANs is far from satisfactory. Optimizationbased GAN inversion methods (Abdal et al., 2019; 2020a; Gabbay & Hoshen, 2019) solve the inverted latent code of data point x by z(x) = arg min z Z Gφ(z) x 2 + λR(z), (3) where R(z) is the regularization term. They significantly outperform explicit encoders in inversion quality. The encoder and the generator with concurrent training provide informative representations for downstream tasks (Donahue et al., 2017; Dumoulin et al., 2017; Donahue & Simonyan, 2019; Belghazi et al., 2018b; Chen et al., 2016; Belghazi et al., 2018a), but generate less competitive results than state-of-the-art GAN models (Brock et al., 2018; Karras et al., 2019; 2020), and cannot achieve the faithful reconstruction. While two phase training can keep the synthesis quality of generators, it still reconstructs inputs with considerable differences (Perarnau et al., 2016; Reed et al., 2016; Zhu et al., 2019). As both the synthesis and inversion ability are vital for downstream tasks, it is necessary to close the gap between theory and empirical performance, uncover the black box behind encoding GANs, and offer insights to network designs. Here we provide a theoretical framework for analyzing encoding GANs and handling challenges mentioned above. Different from many theoretical works built on strong assumptions and narrow scenarios like smoothness, Gaussian distributions, or shallow network architectures, we only make three mild assumptions: data lie in a manifold, the neural networks are continuous and piece-wise continuously differentiable, and all involved probability distributions have densities. All assumptions are broadly accepted in the deep learning community (Wold et al., 1987; Cand es et al., 2011; Le Cun et al., 2015; Goodfellow et al., 2016; Lin et al., 2018), and are consistent with the practice well (Glorot et al., 2011; Ioffe & Szegedy, 2015; Krizhevsky et al., 2017). This allows our theory to be closely connected with the practice, have universal meaning in guiding network designs, and supplement many previous theoretical works in related directions. Our main contributions are summarized as follows: Figure 1. A typical training process. Training algorithms guide encoder network Eθ and generator network Gφ to the underlying e and g by minimizing divergences from Eθ(X) to Z and from Gφ(Z) to X. It is worthwhile to note that, as the latent distribution is pre-assigned and fixed, we usually have dim(Z) = dim(X). Our theory demonstrates three uncertainty principles 1 in the practice of encoding GANs: a) the underlying encoder and generator cannot be continuous at the same time; b) neural networks cannot accurately approximate the underlying encoder and generator at the same time; c) neural networks cannot be stable and accurate at the same time. Our theorems explain why we always get imperfect encoders and generators, why we sometimes have unstable training, and why we sometimes encounter tradeoff between fidelity and disentanglement (Karras et al., 2019; 2020), despite the theoretical guarantees in (Donahue et al., 2017; Donahue & Simonyan, 2019; Dumoulin et al., 2017; Arjovsky et al., 2017; Arjovsky & Bottou, 2017; Gulrajani et al., 2017). Our theorems also supplement those previous theoretical works. We provide examples to validate the three uncertainty principles and provide intuitive understandings on the uncertainty principles. Although our theoretical analysis is for encoding GANs, we can also apply it to the encoding of other generative models, such as Wassertein auto-encoders (Tolstikhin et al., 2017), adversarial auto-encoders (Makhzani et al., 2015), and auto-encoders with fixed latent distributions. 2. Preliminaries We start by introducing our settings. See Tab. 1 for meanings and examples of notations used in this paper. Fig. 1 illustrates a typical training process. Training algorithms guide the encoder network Eθ and the generator network Gφ to the underlying perfect encoder e and generator g by 1Here we mean that there are always two properties that cannot be reached together, which is similar to the uncertainty principle in physics (Robertson, 1929). Uncertainty Principles of Encoding GANs Table 1. Examples of notation used in this paper. d-dimensional volume md( ) Scalars σ, δ, ϵ, m, n, d Scalar value functions f, g Vectors x, z Vector value functions e, g Sets & Manifolds X, Z, D Neural networks Eθ, Gφ, Dψ Distributions PX (x), PZ(z) Induced Distributions Pg(Z)(x), PEθ(X)(z) Intrinsic dimension of manifolds dim(X), dim(Z) minimizing divergence from Eθ(X) to Z and from Gφ(Z) to X. Popular divergences include the Jensen-Shannon divergence (Donahue et al., 2017; Dumoulin et al., 2017) DJS(PA,QB) = Ex PA log 2PA(dx) PA(dx) + QB(dx) log 2QB(dy) PA(dy) + QB(dy) KL divergence (Makhzani et al., 2015), l2 reconstruction loss (Choi et al., 2020; Li et al., 2017), and Wasserstein divergence (Tolstikhin et al., 2017) W1(PA, QB) = inf π Π(PA,QB) A B x y dπ(x, y), (5) where Π(PA, QB) is the collection of all joint distributions of (x, y) A B which have marginal distribution PA for x and QB for y. Usually, latent space Z and data space X are treated as manifolds embedded in some Euclidean ambient spaces. We introduce the concept of manifolds and their intrinsic dimensions (Gallot et al., 1990) below, and give examples in Fig. 2. Note that throughout this paper, we use the word dimension for the intrinsic dimension of manifold, not the dimension of its ambient space. Definition 1 (Intrinsic Dimension and Manifold) If for any point x A, it has a small open neighborhood U and a continuous bijection b (also called the chart at x) that maps U A to an open set in Rn, then n is the intrinsic dimension of A. We denote it as dim(A) = n. Accordingly, A is called a manifold. We introduce two specific examples of the training process. The concurrent training process in Bi GAN (Donahue et al., 2017) solves a zero-sum game min θ,φ max ψ V (Eθ, Gφ, Dψ), (6) where Dψ is the discriminator network for the (x, z) pair, V (Eθ, Gφ, Dψ) = Ex PX [log (D(x, Eθ(x)))] +Ez PZ [1 log(D(Gφ(z), z))] . (7) Another example is the two phase training process of LIA (Zhu et al., 2019), where a generator is trained by solving min θ max ψ V (Gφ, Dψ), (8) V (Gφ, Dψ) = Ex PX [log(D(x))] +Ez PZ[1 log(D(Gφ(z))], (9) and then an encoder is trained by optimizing min θ Ex PX [ Gφ Eθ(x) x 2 2] + d(PEθ(X), PZ), (10) in which d is among the divergences of distributions introduced at the beginning of this section. Current design of generative models assigns a fixed latent distribution to the generator, which also fixes the intrinsic dimension of latent distribution. Specifically, for the popular standard Gaussian latents, the intrinsic dimension is the number of variables (Goodfellow et al., 2014; Gallot et al., 1990). We disallow the networks to adjust the latent distribution during training, because we need each sample z Z from the latent distribution to produce meaningful synthesis in X through the generator. This is essentially different from auto-encoders (Hinton & Zemel, 1994; Ng et al., 2011) which are not designed for synthesis and allow self-adaptation in the latent distribution. As dim(X) is often unclear, and dim(Z) is manually assigned before training, we are safe to assume that the latent space Z and domain of interest X have different intrinsic dimensions, i.e. dim(X) = dim(Z). To build the foundation of our theory, we make the following assumptions, which are almost the minimum requests for theoretical analysis. Assumption 1 Throughout this paper, we assume that: the data domain X is a manifold with an intrinsic dimension n, where n is unknown; the neural networks Eθ(x) and Gφ(z) are continuous and piece-wise continuously differentiable with respect to inputs x and z; we do not make any assumption on the training method or the loss function; the latent and the data distributions are absolutely continuous with respect to the Lebesgue measure on Z and X respectively, which are the minimum requirements for calculating the Jensen-Shannon and Wasserstein divergences. Uncertainty Principles of Encoding GANs 1-D Manifold 2-D Manifold 3-D Manifold Figure 2. Intrinsic dimensions of manifolds in R3. All the above sets have 3-D coordinates (x, y, z) in R3, but their intrinsic dimensions are different. Remark 1 Obviously, neural network components such as MLPs, CNNs, Relu, Tanh, Leaky Relu, Softmax, Sigmoid, and neural networks composed of them are all continuous and piece-wise continuously differentiable with respect to their inputs. Remark 2 The readers may note that we do not assume the training technique and architecture details of the generator and encoder. Thus the generator and encoder can also be obtained by other methods like variational inferences (Kingma & Welling, 2013). 3. Uncertainty Principles 3.1. Uncertainty in the Continuity of the Underlying Encoder and Generator Our first result suggests that the underlying encoder and generator may not be smooth at the same time. Theorem 1 When dim(Z) = dim(X), at least one of the underlying encoder and generator in Eq. (1) & (2) is discontinuous; and for any x X, δ > 0, there is a point x in the geodesic ball centered at x with radius δ, such that e is not continuous at x or g is not continuous at e(x ). The same thing holds for Z. Theorem 1 almost excludes continuous underlying encoders and generators in practice, and underlines that the discontinuous points exist in every neighborhood. It reveals the extremely bad property of the underlying encoder and generator, and nearly excludes the chance for continuous networks to exactly represent the underlying encoder and generator. Also, it urges us to rethink the current design of encoding GANs, as continuity of underlying functions is so important in theoretical foundations of universal approximation abilities of neural networks (Cybenko, 1989; Hornik et al., 1989; Allan, 1999; Hanin & Sellke, 2018; Johnson, 2018; Kidger & Lyons, 2020; Park et al., 2021), representation learning (Bengio et al., 2013), unsupervised learning (Barlow, 1989; Belkin & Niyogi, 2001; Belkin et al., 2006), and other downstream tasks (Belkin & Niyogi, 2003; Zhang & Zha, 2004; Chang et al., 2004). The uncertainty in continuity no doubt strikes the heart of both network designs and downstream tasks of encoding GANs. 3.2. Uncertainty in Universal Approximation Ability Apart from the above, we are interested in quantitatively analyzing how well the neural networks can approximate the underlying encoder and generator. The Universal Approximation Theorem (Pinkus, 1999) states that neural networks can approximate any continuous functions with arbitrary accuracy. However, in this paper, we find that neural networks are seldom universal approximators in encoding GANs. Theorem 2 When dim(Z) = dim(X), neural networks are not universal approximators to the underlying encoder and generator in Eq. (1) & (2). More specifically, we have: inf θ,φ δe(θ) + δg(φ) De + Dg > 0, (11) 2 sup x X lim sup y x e(y) e(x) , (12) 2 sup z Z lim sup w z g(w) g(z) , (13) δe(θ) = sup x X Eθ(x) e(x) , (14) δg(φ) = sup z Z Gφ(z) g(z) . (15) Moreover, if dim(Z) < dim(X), we have DJS(PGφ(Z), PX ) log 2 and if dim(X) < dim(Z), we have DJS(PEθ(X), PZ) log 2 Theorem 2 seems to contradict Theorem 2 in Bi GAN (Donahue et al., 2017) that the underlying perfect encoder and generator can be reached by the training algorithms. This contradiction comes from a potential assumption in the proof of Bi GAN: the Jensen-Shannon divergence between the induced (encoded or generated) distributions and real (latent or data) distributions can reach exact zero. This assumption is well consistent with practice when it is originally used in the proof of Theorem 1 in Bi GAN, where Eθ and Gφ have indeterministic architectures, i.e. Eθ(x) = p(z|x; θ) and Gφ(z) = p(x|z; φ) are distributions rather than specific vectors given inputs x X and z Z. However, in the proof of Theorem 2 (Appendix Uncertainty Principles of Encoding GANs A.3 & A.4 of (Donahue et al., 2017)) in Bi GAN, Eθ(x) and Gφ(z) are limited to deterministic functions. This assumption then no longer holds if dim(Z) = dim(X), as we show in Fig. 4 and 3, and Eq. (17) of Theorem 2. It then results in the failure of the theory of Bi GAN in practice. The detailed analysis is provided in the supplementary material. Theorem 2 estimates how close neural networks can approach the underlying encoder and generator. For all neural networks, whatever the depth, width, architectures, and training methods, their approximation error to e and g is larger than De + Dg, which is a positive real number when dim(Z) = dim(X) and only depends on the task itself. We note that the research field of universal approximations also extensively uses another error measure, the lp distance between neural networks and underlying targets (Lu et al., 2017; Kidger & Lyons, 2020; Park et al., 2021). However, this error measure for approximation may be ill-posed for generative models. The lp error measure cares more about how a predict (such as predicted class labels) from the input departs from its real value (such as ground-truth labels), while in the scenario of encoding generative models, we care more about how much region of the data or latent distributions are covered by generated distributions or encoded distributions (Eq. (16) & (17) offer such error measure). Those two things are not equivalent. To demonstrate it, consider the continuous target function fϵ(x) = ϵ 1 ϵx, 0 x < 1 ϵ, 1 ϵ ϵ , 1 ϵ x 1, (18) where 0 < ϵ 1. It is easy to see that f([0, 1 ϵ)) = [0, ϵ), f([1 ϵ, 1]) = [ϵ, 1]. setting g(x) = ϵ 1 ϵx, we have R 1 0 |fϵ g|dx < ϵ 0, and m1(fϵ([0, 1]) \ g([0, 1])) = 1 ϵ 1 ϵ 1, where m1 is the Lebesgue measure on [0, 1]. As ϵ is very small, we have that g approximates fϵ very well in l1 error, but most of the output of fϵ is not covered by g. Thus, for encoding generative models, the uniform approximation error in Eq. (11) and distribution divergence in Eq. (16) & (17) are more meaningful. Theorem 2 explains the gap between practice and theory of encoding GANs we have mentioned in Introduction. For optimization-based GAN inversion methods, they do not need a continuous explicit encoder (Creswell & Bharath, 2018; Abdal et al., 2019; 2020a), thus do not yield the error bounds in Eq. (11) & (17). As a consequence, the optimization-based methods may approximate the inverse mapping more accurately than encoder-based methods if provided suitable initialization (Zhu et al., 2019). For explicit encoders and generators (Perarnau et al., 2016; Donahue et al., 2017; Rosca et al., 2017; Su, 2019; Donahue & Simonyan, 2019; Zhu et al., 2019; Pidhorskyi et al., 2020), the joint approximation errors in Eq. (11) & (17) & (16) deviate at least one of the encoder and generator from high All neural networks Underlying 𝒆and 𝒈 Figure 3. Illustration of Eq. (11). For all the neural networks, no matter their architectures and parameters, they admit positive distance to the underlying encoder and generator, as long as the conditions of Theorem 2 hold. 𝐷 ℙ𝒉𝒜, ℙℬ log2 Figure 4. Illustration of Eq. (17). Here we give a simple example. If the latent distribution PA is supported on a real line A, a onedimensional manifold, and the data distribution PB is supported on a two-dimensional manifold B. Then for any differentiable function h : A B, h(A) is a curve on B. The curve can never occupy the whole 2-D surface B, and thus the Jensen-Shannon divergence can never reach exact zero. quality outputs. In this case, neural networks are not universal approximators for the underlying encoder and generator, and the universal approximation theorem (Pinkus, 1999) does not hold in our scenario. 3.3. Uncertainty in Training Dynamics Our last result digs into the training dynamics. It finds that in most cases gradient explosion cannot be avoided during training, and offers an estimation on the explosion speed. Theorem 3 Denote n = dim(X) and d = dim(Z). Let md(Z) and mn(X) be the volumes of Z and X with respect to their intrinsic dimensions, respectively. Assume that Z and X are bounded manifolds embedded in high dimensional Euclidean spaces, but are almost everywhere diffeomorphism to open subsets in Rd and Rn, respectively. Denote diam(Z) = supz,w Z z w , diam(X) = supx,y X x y , and ωi to be the volume of unit ball of dimension i. For simplicity, let i, j {d, n} and Γ(A, B, i, j, a, b) = diam(A)j imj(B) 3a(2jmj(B) + ωjdiam(A)j) 1 j i bmj(B) Then there is a trade-off between the approximation error and the maximum gradient norm of networks if dim(Z) = Uncertainty Principles of Encoding GANs (ℬ, ℙℬ) (ℬ, ℙℬ) (ℬ, ℙℬ) (𝒉𝒜, ℙ𝒉𝒜) (𝒉𝒜, ℙ𝒉𝒜) Figure 5. Illustration of Theorem 3. For a one-dimensional curve, the only way to fit a two-dimensional manifold is to twist itself, so that it can occupy more areas. dim(X). Specifically, if dim(Z) < dim(X), there exist constants CX > 0 that only depends on PX and Cd > 0 that only depends on d, such that W1(PGφ(Z), PX ) sup z Z Gφ + 1 n n d Γ(Z, X, d, n, Cd, CX ); if DJS(PGφ(Z), PX )) < log 2, then we further have DJS(PGφ(Z), PX )(supz Z Gφ + 1) 2n n d (diam(Z)(supz Z Gφ ) + diam(X))2 4Γ(Z, X, d, n, Cd, CX )2. On the other hand, if dim(Z) > dim(X), there exist constants CZ > 0 that only depends on PZ and Cn > 0 that only depends on n, such that W1(PEθ(X), PZ) sup x X Eθ + 1 n d n Γ(X, Z, n, d, Cn, CZ); if DJS(PEθ(X), PZ)) < log 2, then we further have DJS(PEθ(X), PZ)(supx X Eθ + 1) 2n d n (diam(X)(supx X Eθ ) + diam(Z))2 4Γ(X, Z, n, d, Cn, CZ)2, where W1 is the 1-Wasserstein distance (Villani, 2008). Remark 3 Typical examples of manifolds satisfying conditions of Theorem 3 are spheres, hyperbolic surfaces, ellipsoids and their deformed shapes embedded in high dimensional spaces. Remark 4 As we do not impose conditions on the training process, we can apply Theorem 3 to the training of GANs, which is the first phase learning of the two phase encoding GANs. Remark 5 Note that the value of Γ is from two positive constants when PZ and PX are given. Γ j (the explosion speed of maximum gradient norm) grows larger for fixed j when i decreases or mj(B) increases. It is very small when diam(A) is very large. Those changes of values are consistent with our intuition in Fig. 5. However, in order to get a uniform format holding for all situations, we use very loose estimation when diam(A) is tiny in the deduction. This makes the value of Γ not optimal when diam(A) is extremely small and we can use better estimation for this case. See the proof in the supplementary material for details. Remark 6 Note that the Jensen-Shannon divergence has a universal upper bound log 2. Thus the condition for Eq. (21) & (23) to hold is equivalent to saying that the induced distribution of encoder or generator network is not too far away from the real distribution. Theorem 3 reveals a trade-off between the Wasserstein distance, which is often the training loss in practice, and the maximum gradient norm of networks, if dim(Z) = dim(X). The theorem can be understood intuitively in the following way: the only way for a one-dimensional curve to fit a two-dimensional surface, is to twist the curve so that it can occupy as many areas as possible. We illustrate this intuition in Fig. 5. It means that the training of encoding GANs can be rather unstable and difficult, if both W1(Pe(X), PZ) and W1(Pg(Z), PX ) are minimized. Some previous works (Bengio et al., 2013; Belkin & Niyogi, 2003) on representation learning argue that a gentle gradient norm is necessary for good representations computed by networks. Thus Theorem 3 may also suggest bad representation quality when the Wasserstein distance is small. In a more practical setting, the data distribution is an empirical approximation to the real underlying distribution. For Wasserstein distance, we then have: Corollary 1 Under the condition of Theorem 3, and let {x1, ..., x N} be N independent samples from PX . Let QN = 1 N PN i=1 δxi be the empirical distribution of those samples, where δx is the Dirac distribution for sample x. Further assume that R X x y PX (dy) < , x X. If d < n, then there exists a constant C > 0 such that for all generator network Gφ Ex1,...,x N PX [W1(PGφ(Z), QN)] C (supz Z Gφ + 1) n n d O(N 1 Theorem 5 and Corollary 1 point out a case where Wasserstein GANs (Arjovsky et al., 2017; Arjovsky & Bottou, 2017) may suffer from gradient explosion. Wasserstein GANs are proposed to cope with the gradient explosion issue in GANs. They replace the original Jensen-Shannon Uncertainty Principles of Encoding GANs divergence (Goodfellow et al., 2014) of GANs with the Wasserstein distance. Wasserstein distance is proved to be more stable for training than the Jensen-Shannon divergence as it is smoother (Arjovsky & Bottou, 2017). But previous works did not discuss whether Wasserstein divergence can totally exclude gradient explosion. A recent theoretical analysis (Bottou et al., 2019) points out that the Monge Amp ere formulation (Villani, 2003; 2008) of WGAN may not have good duality. In this paper, we further give a negative answer in Eq. (20). When dim(Z) < dim(X), W1(PGφ(Z), PX ) 0 implies supz Z Gφ . If the training process meets the exploding points of Gφ, the network will then have gradient explosion. Theorem 3 also reveals the trade-off between fidelity and disentanglement of GANs when dim(Z) < dim(X). Specifically, there is a trade-off between the Fr echect Inception Distance (FID) (Heusel et al., 2017) and Path Perceptual Length (PPL) (Karras et al., 2019). PPL is introduced in Style GAN (Karras et al., 2019; 2020) to measure the semantic disentanglement of generators as: lp = Ez1,z2 PZ ϵ2 Vβ Gφ(tz1 + (1 t)z2) Vβ Gφ((t + ϵ)z1 + (1 t ϵ)z2) 2 2 i , (25) where t (0, 1) and 0 < ϵ 1 are constants and Vβ is the pretrained VGG network. By the chain rule of differentiation, it is easy to see that supz Z Gφ has positive correlation with PPL score if the computation of expectation meets z = arg supz Z Gφ or a sequence of zk converging to it. As the FID is smaller when the generated and data distributions get closer, a lower FID suggests lower Wasserstein distance, which by Eq. (20) suggests higher maximum gradient norms, and results in a higher PPL score (see Fig. 9 of (Karras et al., 2019) for the opposite trend of FID and PPL in training). 4. Validating the Uncertainty Principles This section presents a toy example to illustrate and support our theory. The toy example aims to learn the underlying encoder and generator between uniform distributions of supports of intrinsic dimensions in 1 or 2. The encoder, generator, and discriminator networks consist of 3-layer MLPs with Leaky Relu activations. The numbers of hidden units are 10, 100, and 10 for each MLP layer, to model upsampling and downsampling in typical generative models like Style GANs (Karras et al., 2019; 2020). Considering the simplicity of the task, we think such shallow architectures are adequate for our purpose. We train the networks with both concurrent training and two phase training methods. For concurrent training, we use the objective (6) as in Bi GAN (Donahue et al., 2017); 0.2 0.8 0.5 (𝒵, 𝒰([0,1])) 𝒈(0.2) 𝒈(0.8) 𝒈(0.5) (𝒳, 𝒰0,1 ଶ) Figure 6. Illustration of the impossibility of continuous underlying encoder and generator between a real line segment in R and a unit square in R2. Refer to Section 4.1 for detail. for two phase training, we use the objectives (8) and (10) with d = d Z + drecon, where d Z is the Jensen-Shannon divergence between the encoder output and the latent space, and drecon is the Jensen-Shannon divergence between the reconstructed data distribution and the real data distribution. For the zero-sum game in objectives (6), (8), (10), we solve it by the adversarial training process in Algorithm 1 of (Goodfellow et al., 2014). For each experiment setting, we further change the number of steps to apply to the discriminator (see Algorithm 1 of (Goodfellow et al., 2014) for meaning of it) in each adversarial training step. For two phase training, as there are multiple discriminators, we take the following strategy: when dim(Z) < dim(X), we change the steps of discriminators which discriminate the generators outputs and real data; when dim(Z) > dim(X), we change the steps of discriminators which discriminate the encoders outputs and the latents. More discriminator steps produce more discriminative discriminators, thus the corresponding generators or encoders have to align their outputs to real distributions more precisely. We are going to explore how this influences the results. 4.1. Uncertainty in the Continuity of the Underlying Encoder and Generator Theorem 1 is difficult to verify experimentally, as we do not know the exact form of the underlying encoder & generator. However, we can infer the property of the underlying encoder & generator from the basic geometric result: Lemma 1 If h is continuous and D is connected, then h(D) is also connected (Rudin et al., 1964). Let U([0, 1]) be the latent PZ and U([0, 1]2) be the data distribution PX . Assume that both the underlying generator g and encoder e are continuous. Then it is easy to see that g maps [0, 1] \ {0.5} to [0, 1]2 \ {g(0.5)}, and e performs the inversion. This is, however, against Lemma 1, as [0, 1] \ {0.5} is not connected, while [0, 1]2 \{g(0.5)} is obviously connected, as shown in Fig. 6. Uncertainty Principles of Encoding GANs 4.2. Uncertainty in Universal Approximation Ability We now check whether our toy networks can approximate the underlying encoder & generator. We can evaluate it with divergence between induced distributions (by encoder or generator network) and real distributions (of latents or data). The results are reported in Fig. 8 & 7. It may be no surprise to see that the induced distribution is a curve while the real distribution is a surface area for encoders and generators that try to transfer U([0, 1]) to U([0, 1]2), regardless of the training algorithms. As curves do not hold positive surface area, this means that the induced distributions totally fail to capture the real distributions and the divergence between them should be considerable. The above observation, however, is a little weak because: 1) our network design or training method may not be optimal; 2) we are not able to exactly estimate the error bound De + Dg to support Eq. (11) of Theorem 2 directly. Fortunately, we can check Eq. (17) of Theorem 2, which is the key to the failure of theories in Bi GAN (Donahue et al., 2017) and ALI (Dumoulin et al., 2017), by the following lemma of (Goodfellow et al., 2014): Lemma 2 (Estimation of the JS Divergence) For a fixed generator G, when the discriminator D is optimal, we have DJS(PG, Pdata) = log 2 + 1 2V (D, G), (26) where V (D, G) = Ex PX [log(D(x))] +Ez PZ[log(1 D(G(z)))]. (27) By Lemma 2, we develop the following strategy to estimate the Jensen-Shannon divergence between the generated distribution and the data distribution. For a given generator G, we fix it and maximize V (D, G) until convergence. By Lemma 2, the maximum value of 1 2V (D, G) plus log 2 offers an estimation to the Jensen-Shannon divergence. The results are reported in Fig. 7, where we find that the error bound log 2 2 for the Jensen-Shannon divergence always holds. We then look into a different case that the latent is two-dimensional standard Gaussian N2(0, 1) and the data distribution is still U([0, 1]2), to see what would happen if dim(Z) = dim(X). The networks then show an estimated Jensen-Shannon divergence smaller than log 2 2 . The experiments thus can verify that dim(Z) = dim(X) really forces a positive lower bound on the Jensen-Shannon divergence, which does not appear if dim(Z) = dim(X), as claimed in Theorem 2. 4.3. Uncertainty in Training Dynamics Theorem 3 claims an increasing maximum norm of the gradient when the approximation gets more accurate. We are Two Phase (dn) Concurrent (dn) 0 5 10 15 Discriminator Steps Estimated JSD 0 5 10 15 Discriminator Steps Gradient norm (a) Estimated JSD and gradient norm (d = n) Methods Encoded Generated Concurrent 0.108 ( 0.0062) 0.0418 ( 0.0045) Two-phase 0.118 ( 0.0064) 0.110 ( 0.0070) Bound in (17) log 2 (b) Estimated JSD (d = n) Figure 7. Estimated Jensen-Shannon divergence and gradient norms of networks for distributions in Fig. 8. When d = dim(Z) = n = dim(X), the estimated JSD is always larger than log 2 2 0.347, and the gradient norm increases as there are more steps in training discriminators. When dim(Z) = dim(X) = 2, the estimated JSD is smaller than log 2 curious about how well the practice yields to this theorem. We find the following interesting facts in Fig. 7 & 8: 1) increasing the number of steps to train discriminators makes the generated or encoded distribution longer ; 2) the generated and encoded distributions are twisted curves in R2, and we can increase the cycles of twisting by increasing the number of steps to apply to discriminators; 3) the norms of gradients of the generator or encoder network do increase as DJS gets smaller. Recall the intuition that inspires Theorem 3, that the only way for a one-dimensional curve to fit a two-dimensional surface, is to twist the curve so that it can occupy as many areas as possible. Experimental results in Fig. 8 support this intuition (Fig. 5) and Theorem 3 behind it, which could be the reason to the surprisingly more and more twisted structure when adding more steps to train discriminators. As the cycle of the twist grows, the length of the curve of generated distribution also increases. This suggests an increasing gradient norm, as the length of the curve is calculated from the integral on its gradient norm Length(γ) = Z 1 0 γ (t) 2dt. (28) This observation can be generalized to the following lemma: Lemma 3 For differentiable map h : D Rd, D Rd, Uncertainty Principles of Encoding GANs Encoded space Encoded distribution Prior distribution Generated space Generated distribution Data distribution dim 𝒵= dim (𝒳) More discriminative discriminators Concurrent Two Phase Concurrent Two Phase dim 𝒵 dim (𝒳) Figure 8. Induced distribution and real distribution of toy examples. Odd rows are results of concurrent training, and even rows are results of two phase training. Except the last column, the first two rows report the encoder outputs when dim(Z) = 2, dim(X) = 1 and the last two rows report the generator outputs when dim(Z) = 1 and dim(X) = 2. The last column reports the encoder and generator outputs when dim(Z) = dim(X) = 2 to provide a comparison with the cases of unequal dimensions. We can see that increasing the discriminative ability of discriminators forces the induced distribution of encoders and generators to be more twisted when intrinsic dimensions of latent and data spaces are unequal. md(h(D)) sup x D h d md(D), (29) where md( ) is the volume of set in Rd. In high dimensional settings, Lemma 3 suggests that the volume of the output space is connected with the maximum norm of gradient. We can infer that, for a high dimensional manifold, the only way to fit a manifold of even higher dimension is still by twisting. While twisting means larger volume, Lemma 3 suggests gradient explosion in this case, regardless of training methods and losses. This supports Theorem 3 and generalizes it to broader cases. 5. Conclusion In this paper, we investigate why encoding GANs are so difficult to achieve their theoretical performance in previous works (Donahue et al., 2017; Dumoulin et al., 2017; Pinkus, 1999). We find that three uncertainty principles deviate the practice from those previous theoretical works. The uncovered uncertainty principles give a quantifiable description to the defects of current frameworks, explain the previous empirical findings of the difficulties (Donahue et al., 2017; Donahue & Simonyan, 2019; Zhu et al., 2019), and reveal fundamental factors in the black box of encoding GANs, such as smoothness, approximation ability, and fitting stability. For each uncertainty principle, we provide simple geometric intuition to demonstrate it. Our theories will serve as a solid starting point of further understanding of encoding GANs and other generative models. Acknowledgement This work is supported by the National Key R&D Program of China under Grant 2020AAA0105702, National Natural Science Foundation of China (NSFC) under Grants U19B2038, and the University Synergy Innovation Program of Anhui Province under Grants GXXT-2019-025. Z. Lin is supported by the National Natural Science Foundation of China (Grant No.s 61625301 and 61731018), Project 2020BD006 supported by PKU-Baidu Fund, Major Scientific Research Project of Zhejiang Lab (Grant No.s 2019KB0AC01 and 2019KB0AB02), and Beijing Academy of Artificial Intelligence. Finally, the authors sincerely express vehement protestations of gratitude to Liao Wang of Rheinische Friedrich-Wilhelm Universit at Bonn for his help and support. Abdal, R., Qin, Y., and Wonka, P. Image2Style GAN: How to embed images into the Style GAN latent space? In Proceedings of the IEEE international conference on computer vision, pp. 4432 4441, 2019. Abdal, R., Qin, Y., and Wonka, P. Image2Style GAN++: How to edit the embedded images? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8296 8305, 2020a. Abdal, R., Zhu, P., Mitra, N., and Wonka, P. Style Flow: Attribute-conditioned exploration of Style GANgenerated images using conditional continuous normalizing flows. ar Xiv e-prints, pp. ar Xiv 2008, 2020b. Allan, P. Approximation theory of the MLP model in neural networks [j]. Acta Numerica, 8:143 195, 1999. Arjovsky, M. and Bottou, L. Towards principled methods for training generative adversarial networks. ar Xiv preprint ar Xiv:1701.04862, 2017. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. ar Xiv preprint ar Xiv:1701.07875, 2017. Asim, M., Daniels, M., Leong, O., Ahmed, A., and Hand, P. Invertible generative models for inverse problems: Uncertainty Principles of Encoding GANs Mitigating representation error and dataset bias. In International Conference on Machine Learning, pp. 399 409. PMLR, 2020. Barlow, H. B. Unsupervised learning. Neural computation, 1(3):295 311, 1989. Bau, D., Strobelt, H., Peebles, W., Zhou, B., Zhu, J.-Y., Torralba, A., et al. Semantic photo manipulation with a generative image prior. ar Xiv preprint ar Xiv:2005.07727, 2020. Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, D. Mutual information neural estimation. In International Conference on Machine Learning, pp. 531 540. PMLR, 2018a. Belghazi, M. I., Rajeswar, S., Mastropietro, O., Rostamzadeh, N., Mitrovic, J., and Courville, A. Hierarchical adversarially learned inference. ar Xiv preprint ar Xiv:1802.01071, 2018b. Belkin, M. and Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Nips, pp. 585 591, 2001. Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003. Belkin, M., Niyogi, P., and Sindhwani, V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(11), 2006. Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798 1828, 2013. Bottou, L., Arjovsky, M., Lopez-Paz, D., and Oquab, M. Geometrical insights for implicit generative modeling, 2019. Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096, 2018. Cand es, E. J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? Journal of the ACM (JACM), 58(3): 1 37, 2011. Chang, H., Yeung, D.-Y., and Xiong, Y. Super-resolution through neighbor embedding. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., volume 1, pp. I I. IEEE, 2004. Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Info GAN: Interpretable representation learning by information maximizing generative adversarial nets. ar Xiv preprint ar Xiv:1606.03657, 2016. Choi, K., Hawthorne, C., Simon, I., Dinculescu, M., and Engel, J. Encoding musical style with transformer autoencoders. In International Conference on Machine Learning, pp. 1899 1908. PMLR, 2020. Creswell, A. and Bharath, A. A. Inverting the generator of a generative adversarial network (II), 2018. Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989. De Cao, N. and Kipf, T. Mol GAN: An implicit generative model for small molecular graphs. ar Xiv preprint ar Xiv:1805.11973, 2018. Donahue, J. and Simonyan, K. Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, pp. 10542 10552, 2019. Donahue, J., Kr ahenb uhl, P., and Darrell, T. Adversarial Feature Learning, 2017. Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Arjovsky, M., and Courville, A. Adversarially Learned Inference, 2017. Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., and Greenspan, H. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing, 321:321 331, 2018. Gabbay, A. and Hoshen, Y. Style generator inversion for image enhancement and animation. ar Xiv preprint ar Xiv:1906.11880, 2019. Gallot, S., Hulin, D., and Lafontaine, J. Riemannian geometry, volume 2. Springer, 1990. Glorot, X., Bordes, A., and Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 315 323, 2011. Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. Deep Learning. MIT press Cambridge, 2016. Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial networks, 2014. Uncertainty Principles of Encoding GANs Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of Wasserstein GANs. In Advances in neural information processing systems, pp. 5767 5777, 2017. Hanin, B. and Sellke, M. Approximating continuous functions by Re LU nets of minimal width. URL http://arxiv. org/abs/1710.11278, 2018. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in neural information processing systems, pp. 6626 6637, 2017. Hinton, G. E. and Zemel, R. S. Autoencoders, minimum description length and Helmholtz free energy. In Advances in neural information processing systems, pp. 3 10, 1994. Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359 366, 1989. Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015. Johnson, J. Deep, skinny neural networks are not universal approximators. ar Xiv preprint ar Xiv:1810.00393, 2018. Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4401 4410, 2019. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of Style GAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110 8119, 2020. Kidger, P. and Lyons, T. Universal approximation with deep narrow networks. In Conference on Learning Theory, pp. 2306 2327. PMLR, 2020. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84 90, 2017. Le Cun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436 444, 2015. Lewis, K. M., Varadharajan, S., and Kemelmacher Shlizerman, I. VOGUE: Try-On by Style GAN interpolation optimization. ar Xiv preprint ar Xiv:2101.02285, 2021. Li, C., Liu, H., Chen, C., Pu, Y., Chen, L., Henao, R., and Carin, L. Alice: Towards understanding adversarial learning for joint distribution matching. ar Xiv preprint ar Xiv:1709.01215, 2017. Liang, X., Hu, Z., Zhang, H., Gan, C., and Xing, E. P. Recurrent topic-transition GAN for visual paragraph generation. In Proceedings of the IEEE international conference on computer vision, pp. 3362 3371, 2017. Lin, C. H., Chang, C.-C., Chen, Y.-S., Juan, D.-C., Wei, W., and Chen, H.-T. COCO-GAN: Generation by parts via conditional coordinating. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4512 4521, 2019. Lin, Z., Khetan, A., Fanti, G., and Oh, S. Pac GAN: The power of two samples in generative adversarial networks. Advances in neural information processing systems, 2018. Liu, J., Zha, Z.-J., Chen, D., Hong, R., and Wang, M. Adaptive transfer network for cross-domain person reidentification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7202 7211, 2019. Liu, L., Lu, Y., Yang, M., Qu, Q., Zhu, J., and Li, H. Generative adversarial network for abstractive text summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018. Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expressive power of neural networks: A view from the width, 2017. Ma, W., Cheng, F., Xu, Y., Wen, Q., and Liu, Y. Probabilistic representation and inverse design of metamaterials based on a deep generative model with semi-supervised learning strategy. Advanced Materials, 31(35):1901111, 2019. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial autoencoders. ar Xiv preprint ar Xiv:1511.05644, 2015. Ng, A. et al. Sparse autoencoder. CS294A Lecture notes, 72 (2011):1 19, 2011. Park, S., Yun, C., Lee, J., and Shin, J. Minimum width for universal approximation. In International Conference on Learning Representations, 2021. URL https:// openreview.net/forum?id=O-XJwyo IF-k. Perarnau, G., van de Weijer, J., Raducanu, B., and Alvarez, J. M. Invertible conditional GANs for image editing, 2016. Pidhorskyi, S., Adjeroh, D. A., and Doretto, G. Adversarial latent autoencoders. In Proceedings of the IEEE/CVF Uncertainty Principles of Encoding GANs Conference on Computer Vision and Pattern Recognition, pp. 14104 14113, 2020. Pinkus, A. Approximation theory of the MLP model in neural networks. Acta numerica, 8(1):143 195, 1999. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., and Lee, H. Generative adversarial text to image synthesis, 2016. Richardson, E., Alaluf, Y., Patashnik, O., Nitzan, Y., Azar, Y., Shapiro, S., and Cohen-Or, D. Encoding in style: A Style GAN encoder for image-to-image translation. ar Xiv preprint ar Xiv:2008.00951, 2020. Robertson, H. P. The uncertainty principle. Physical Review, 34(1):163, 1929. Rosca, M., Lakshminarayanan, B., Warde-Farley, D., and Mohamed, S. Variational approaches for autoencoding generative adversarial networks. ar Xiv preprint ar Xiv:1706.04987, 2017. Rosca, M., Lakshminarayanan, B., and Mohamed, S. Distribution matching in variational inference. ar Xiv preprint ar Xiv:1802.06847, 2018. Rudin, W. et al. Principles of mathematical analysis, volume 3. Mc Graw-hill New York, 1964. Shen, Y. and Zhou, B. Closed-form factorization of latent semantics in GANs. ar Xiv preprint ar Xiv:2007.06600, 2020. Su, J. O-GAN: Extremely concise approach for autoencoding generative adversarial networks. ar Xiv preprint ar Xiv:1903.01931, 2019. Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein auto-encoders. ar Xiv preprint ar Xiv:1711.01558, 2017. Villani, C. Topics in optimal transportation. American Mathematical Soc., 2003. Villani, C. Optimal transport: Old and new, volume 338. Springer Science & Business Media, 2008. Wold, S., Esbensen, K., and Geladi, P. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37 52, 1987. Wolterink, J. M., Dinkla, A. M., Savenije, M. H., Seevinck, P. R., van den Berg, C. A., and Iˇsgum, I. Deep MR to CT synthesis using unpaired data. In International workshop on simulation and synthesis in medical imaging, pp. 14 23. Springer, 2017. Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., and He, X. Attn GAN: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1316 1324, 2018. Yi, X. and Babyn, P. Sharpness-aware low-dose CT denoising using conditional generative adversarial network. Journal of digital imaging, 31(5):655 669, 2018. Yi, X., Walia, E., and Babyn, P. Unsupervised and semi-supervised learning with categorical generative adversarial networks assisted by Wasserstein distance for dermoscopy image classification. ar Xiv preprint ar Xiv:1804.03700, 2018. Zha, Z.-J., Liu, J., Chen, D., and Wu, F. Adversarial attribute-text embedding for person search with natural language query. IEEE Transactions on Multimedia, 22 (7):1836 1846, 2020. Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., and Metaxas, D. N. Stack GAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 5907 5915, 2017. Zhang, H., Sindagi, V., and Patel, V. M. Image de-raining using a conditional generative adversarial network. IEEE transactions on circuits and systems for video technology, 2019. Zhang, Z. and Zha, H. Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM journal on scientific computing, 26(1):313 338, 2004. Zheng, H., Fu, J., Zeng, Y., Luo, J., and Zha, Z.-J. Learning semantic-aware normalization for generative adversarial networks. Advances in Neural Information Processing Systems, 33:21853 21864, 2020. Zhu, J., Zhao, D., Zhang, B., and Zhou, B. Disentangled inference for GANs with latently invertible autoencoder. ar Xiv:1906.08090v3, 2019.