# principled_knowledge_extrapolation_with_gans__fc72d583.pdf Principled Knowledge Extrapolation with GANs Ruili Feng 1 Jie Xiao 1 Kecheng Zheng 1 Deli Zhao 2 Jingren Zhou 3 Qibin Sun 1 Zheng-Jun Zha 1 Human can extrapolate well, generalize daily knowledge into unseen scenarios, raise and answer counterfactual questions. To imitate this ability via generative models, previous works have extensively studied explicitly encoding Structural Causal Models (SCMs) into architectures of generator networks. This methodology, however, limits the flexibility of the generator as they must be carefully crafted to follow the causal graph, and demands a ground truth SCM with strong ignorability assumption as prior, which is a nontrivial assumption in many real scenarios. Thus, many current causal GAN methods fail to generate high fidelity counterfactual results as they cannot easily leverage state-of-the-art generative models. In this paper, we propose to study counterfactual synthesis from a new perspective of knowledge extrapolation, where a given knowledge dimension of the data distribution is extrapolated, but the remaining knowledge is kept indistinguishable from the original distribution. We show that an adversarial game with a closed-form discriminator can be used to address the knowledge extrapolation problem, and a novel principal knowledge descent method can efficiently estimate the extrapolated distribution through the adversarial game. Our method enjoys both elegant theoretical guarantees and superior performance in many scenarios. 1. Introduction Human beings exhibit remarkable ability of cognitive extrapolation (Ehrlich, 2005; Beck et al., 2006) in a variety of aspects. For example, we can accurately extrapolate the motion of objects (Ehrlich, 2005), imagine unseen objects (Kocaoglu et al., 2018), raise and answer counterfac- 1University of Science and Technology of China, Hefei, China 2Ant Research, Hangzhou, China 3Alibaba Group, Hangzhou, China. Correspondence to: Zheng-Jun Zha . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). Figure 1. Knowledge extrapolation. All the above objects are counterfactual, rarely existent in real world. The origin domains of them are written benzene, and the extrapolated knowledge is marked in purple. tual questions (Beck et al., 2006). There is a temptation to wonder whether Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) can generalize well to examples whose distribution is arbitrarily far from that of the given training data (or even counterfactual examples), as shown in Fig. 1. Specifically, this knowledge extrapolation ability of GANs can be reflected as synthesizing out-of-distribution examples and manipulating semantic features to constitute counterfactual combinations. Counterfactual synthesis (Kocaoglu et al., 2018; Sauer & Geiger, 2020; Nemirovsky et al., 2020; Yang et al., 2021; Averitt et al., 2020; Thiagarajan et al., 2021) is one of the most promising tasks to achieve the general goal of knowledge extrapolation in GANs. For counterfactual synthesis, if brusquely ignoring differences in details, most existing methods follow the framework of the same fashion directly modeling a Structural Causal Model (SCM) (Pearl, 2009a) in well-designed architectures of generator networks. Factors of interest in the causal graph are designed as labels to control the synthesis of the generator. However, this type of approaches is somewhat inflexible. Specifically, it demands a prior SCM to identify all the causalities in the data Principled Knowledge Extrapolation with GANs generation process. Theoretically, constructing a prior SCM with photo-realistic synthesis effect can be embarrassingly difficult if many potential factors and obscure entanglements are involved (Pearl, 2009a; Sekhon, 2008; Holland, 1986). In addition, the rigorous inference of causal effects needs the strong assumptions like strong ignorability (i.e., assuming no unobserved confounders) (Sekhon, 2008; Holland, 1986), which is hard to verify in practice (Pearl, 2009b;a). Therefore, these methods are limited in many knowledge extrapolation scenarios because it is inconvenient to deduce a prior SCM. In addition to the demand of prior SCMs, these approaches also put constraints on the network design that has to be consistent with the prior SCM. Concretely, they need to construct generators that each network component corresponds to a causal graph factor to yield causal interventions (Pearl, 2009a; Holland, 1986). While new GAN architectures adapt rapidly, it is of great challenge to directly apply those methods to state-of-the-art GANs (e.g., Style GANs (Karras et al., 2019; 2020b) or Big GAN (Brock et al., 2018)), thus limiting their overall performance regarding high-fidelity counterfactual synthesis. In this paper, we propose a principled knowledge extrapolation method to circumvent the two embarrassing flaws of current methods, and provide a new viewpoint for knowledge extrapolation of GANs. Instead of modeling prior causal graphs in generators, here we turn to a simple hypothesis that the original distribution and extrapolated distribution are indistinguishable, except for the extrapolated knowledge dimension. For example, if there is a hypothetical distribution in which all people (including women and children) could wear beards, then the information to distinguish this hypothetical distribution from real world data only exists in the distribution of beards. Thus the beard-irrelevant information (i.e., gender, age or other knowledge dimensions) cannot contribute to distinguish these two distributions, as the remaining knowledge is the same between these two distributions. Under this assumption, we prove that an adversarial learning strategy (Goodfellow et al., 2014) can approximate the hypothetical distribution with a closed-form discriminator when we manage to preserve the irrelevant knowledge unchanged during the adversarial game. To achieve this goal, a novel Principal Knowledge Descent (PKD) method is proposed to solve a sparse paradigm in the parameter space that starts from the pretrained generator distribution to an approximation of the hypothetical distribution. The sparse paradigm will involve only the most related parameters to our knowledge of interest, thus posing negligible influence to the other knowledge. In conclusion, the contributions of this paper include: We propose a new principled GAN knowledge extrapolation method that is flexible to use and can be easily adapted to state-of-the-art GAN architectures; We design a novel sparse descent strategy to efficiently estimate the extrapolated distribution based on our theory; The proposed method is the first to successfully synthesize high-fidelity counterfactual results in various image data domains. 2. Related Work The prevalence of GAN (Goodfellow et al., 2014) has aroused researchers ambitions to utilize various novel GANs to synthesize counterfactual data under the guidance of prior structural causal models. Causal GAN (Kocaoglu et al., 2018) proposes to learn a causal implicit model through adversarial training with a given causal graph for facial attribute disentanglement. Counter GAN (Nemirovsky et al., 2020) employs a residual generator to improve counterfactual realism and actionability compared to regular GANs. Counterfactual Generative Network (CGN) (Sauer & Geiger, 2020) suggests to decouple the Image Net generation into four aspects of the shape, texture, background, and composer. CGN explicitly models the causality in the four aspects to yield counterfactual combinations of them, like triumphal arch with the elephant texture. Causal VAE (Yang et al., 2021) employs structural causal layer to encode prior causalities. Thiagarajan et al. exploits deep image priors from a U-Net (Ronneberger et al., 2015) and a classification model to synthesize counterfactual images. Those methods provide compelling insights to the causal explanation of black-box generative models, but put extra limitations on generator architectures, thus generally yielding much less plausible synthesis than state-of-the-art GAN models. Also, these methods rely on prior causal models, which grossly limit their generalization to other GAN architectures. Given a data domain X and a data distribution PX , we assume that there is already a pretrained generator network GθX : Z X that captures the data distribution, and a posterior probability Pl(x) = P(l|x) 1 for a knowledge of interest l. The pretrained generator network transports the prior distribution PZ (which is usually the standard Gaussian) on the latent space Z into the data distribution PX , i.e., PGθX = PX (Goodfellow et al., 2014), with θX being its parameters at convergence. The generator can be obtained from a pretrained GAN (Goodfellow et al., 2014), VAE (Kingma & Welling, 2013), or other smooth parametric methods that yield generative components (Kingma & Dhariwal, 2018; Dinh et al., 2016). The posterior probability Pl can be obtained through classification or regression neural networks on knowledge l, or other smooth parametric 1This paper uses P to denote the probability density. Principled Knowledge Extrapolation with GANs Hypothetical distribution of female Real data distribution of Age 0 40 80 Male Female Figure 2. Illustration of the hypothetical distribution. Real data distribution excludes the cases of children or female in mustache. The hypothetical distribution extrapolates to those counterfactual cases, but keeps the other aspects unchanged, especially the confounder factors (factors that influence both the dependent variable (face image) and independent variable (mustache), causing a spurious association) like gender or age of the faces. methods that yield posterior estimation of l. For example, a most common case of counterfactual synthesis is a GAN that generates facial images, together with classifiers that identify the posteriors of semantic attributes such as mustache , age , gender , etc. (Kocaoglu et al., 2018). Our task here is to infer a hypothetical distribution PH, where it only differs from the real data in the knowledge of interest, and is indistinguishable from the real data distribution PX among all the remaining knowledge. To capture this hypothetical distribution, we want to get the parametric value θH such that PGθH = PH. Fig. 2 illustrates the example of mustache . If the real data are facial images and the knowledge of interest is mustache , then we want to obtain a generator that is capable of synthesizing images where all people, including women and children, can wear the mustache while all the other semantic knowledge is maintained realistic and plausible. The hypothetical distribution extrapolates the real data distribution along the dimension of knowledge of interest and is counterfactual in the real world. Here the key difference between our method and previous methods is that we bypass SCM to directly yield counterfactual synthesis results with given pretrained generators, and eliminate the limitation to generator architectures so that we can directly apply our method to any given generative models. Here we propose a different perspective to solve the hypothetical distribution PH from adversarial training with GANs. Theoretically, the adversarial training of GAN models will terminate at the global optimum of the generated distribution equal to the data distribution (Goodfellow et al., 2014). Thus, if we aim to alter the data distribution to the hypothetical distribution, we may formulate an adversarial game that will halt at the generated distribution equal to the hypothetical distribution. So we propose to solve min Gθ max Dϕ V (Dϕ, Gθ) = Ex PH [log(Dϕ(x))] + Ex PGθ [log(1 Dϕ(x))] , (1) where θ and ϕ are parameters, and Dϕ is the discriminator network. The problem here is that PH is merely hypothetical, meaning that we do not have any sample from it at hand, so evaluating the value of the term Ex PH [log(Dϕ(x))] of V (Dϕ, Gθ), or gradients of it seems to be impossible with typical Monte Carlo methods (Hammersley, 2013). In the next section, a novel adversarial extrapolation in the knowledge dimension will be deduced to address this problem. 3.1. Adversarial Extrapolation in Indiscernibility Space The typical algorithm to solve Problem (1) is to alternately optimize the generator and discriminator to attain a Nash equilibrium (Goodfellow et al., 2014). For the hypothetical distribution, however, we are allowed to simplify the training considerably with an indistinguishable assumption. Assumption 3.1 (Indistinguishable assumption). The real data distribution PX is indistinguishable from the hypothetical distribution PH except for the altered knowledge l. Definition 3.2 (Indiscernibility Space). We denote the collection of all the parameters that can induce generated distribution satisfying Assumption 3.1 as the indiscernibility space Il of knowledge l, i.e., Il = {θ : PGθ is indistinguishable from the hypothetical distribution PH except for the knowledge of interest l}. Apparently, we have θX , θH Il, and θH is the global optimum of Problem (1). Thus, solving Problem (1) is equivalent to solving it inside the indiscernibility space Il: min Gθ Il max Dϕ V (Dϕ, Gθ) = Ex PH [log(Dϕ(x))] + Ex PGθ [log(1 Dϕ(x))] . (2) Investigate Assumption 3.1 from knowledge l. We now consider to solve the the optimal discriminator for generators inside the Indiscernibility Space of Problem (2). Assume θ Il. Distinguishing which distribution between PGθ and PH a sample x is more likely to be sampled from can leverage the ratio PH(x) PGθ (x). If this ratio is larger than one, then x is more likely from PH, otherwise from PGθ. As PGθ is indistinguishable from PH except for l, this ratio should be purely decided by the posterior distribution of knowledge l on sample x. Namely, there is some posterior distribution Pl(x) = P(l|x) of knowledge l, such that PH(x) PGθ(x) = Pl(x) 1 Pl(x). (3) Remark 3.3. Eq. (3) means that distinguishing x amounts to distinguishing knowledge l. When sampling from PH, Principled Knowledge Extrapolation with GANs knowledge l is more likely to appear. While sampling from PGθ, knowledge l is more likely to be ignored. Thus, PH also extrapolates knowledge l beyond the original distribution of PGθ. An essential property of the indiscernibility space then immediately follows, that when constraining the generator in the indiscernibility space, the optimal discriminator admits a closed-form solution. Rigorously, we have the following: Theorem 3.4. If Gθ Il, then the optimal discriminator of problem max Dϕ V (Dϕ, Gθ) is Dϕ (x) = Pl(x) for some probability distribution Pl of knowledge l. Thus, Problem (2) is equivalent to min θ Il V (Dϕ , Gθ) = Ex PH [log(Pl(x))] + Ex PGθ [log(1 Pl(x))] . (4) Let Pl = 1 Pl denote the probability that l does not occur in sample x. As Gθ is not involved in Ex PH [log(Pl(x))], we only need to solve min θ Il Ex PGθ log(Pl(x)) = H(PGθ, Pl), (5) where H is the cross entropy function (Shore & Johnson, 1981; De Boer et al., 2005; Murphy, 2012). Solving this problem means to maximize the information of knowledge l while keeping the generated distribution indistinguishable from the real data on the remaining parts. Intuitively, this procedure just points to the desired hypothetical distribution as is claimed by Theorem 3.4. Investigate Assumption 3.1 from the remaining knowledge. Here the main difficulty is to handle the constraint θ Il in Eq. (5). To this end, we review the Indistinguishable Assumption 3.1 from the perspective of the remaining knowledge. An equivalent statement to PGθ and PH are indistinguishable except for the knowledge of interest l is that, the remaining knowledge r is not changed from PGθ to PH, and all the changes occur for the knowledge of interest l. Let f r Gθ(x), f r H(x), f r X (x) be the remaining knowledge on sample x of distribution PGθ, PH, PX , respectively. We should have an equivalent definition to the indiscernibility space Il = {θ : x X, f r Gθ(x) = f r H(x) = f r X (x)}. (6) Remark 3.5. We omit the discussion of the exact form of the remaining knowledge f r in order to bypass the demand of SCMs or strong ignorability that all confounders are known. Thus, our method does not rely on the exact factor factorization to the data distribution. Thus, the final objective can be transformed into min f r θ =f r X H(PGθ, Pl). (7) Algorithm 1 Principal Knowledge Descent (PKD) Input: Maximum number of iteration K, pretrained generator GθX which captures the data distribution with PGθX = PX , prior distribution N(O, I) of the generator, posterior distribution Pl(x) for knowledge l, step size ϵ, batch size m, and hyper-parameter λ > 0. Set: k = 0 and θk = θX . repeat Randomly sample latent codes z1, . . . , zm from prior distribution N(O, I). Compute n = θ 1 m Pm i=1 log(1 Pl(Gθk(zi)); Compute I = (| n| > λ)b, sgn = ( n > 0)b ( n < 0)b, where ( )b is element-wise Boolean operation; Update θk+1 = θk + ϵsgn I, k = k + 1, where denotes element-wise multiplication; until k = K. Output: An extrapolated generator GθK that induces distribution PGK θ to estimate the hypothetical distribution PH. However, we still have the constraint f r θ = f r X unknown as we do not know the exact form of f r θ. Hopefully, the solution to this problem can be efficiently estimated. In the next section, we will study the associated numerical approximation. Remark 3.6. If the constraint θ Il is dropped, then the cross entropy achieves its optimal value if and only if PGθ = Pl (Murphy, 2012). This is a degenerate case that the generator may even not yield the valid synthesis, and lose all knowledge except the one of interest l. Thus, enforcing the optimization inside the indiscernibility space is a decisive condition for knowledge extrapolation. 3.2. Principal Knowledge Descent In this section, we study how to numerically solve Problem (7). We show that its solution can be approximated through a series of principal knowledge descent with sparse and convex regularization. Given the current parameter θ, here we want to find a direction θ such that the knowledge of interest is altered accordingly, meaning that the cross entropy H(PGθ+ θ, Pl) is optimized; the other knowledge is unchanged, i.e., |f r Gθ+ θ f r X | is as small as possible. Such a direction can preserve θ + θ to stay in Il X , and decline the value of the objective (5) if θ Il X . We call this direction as the principal knowledge descent direction. With this method, we can compute a path starting from θX , and Principled Knowledge Extrapolation with GANs along this path, the other knowledge is kept intact, but the cross entropy criterion increases drastically. Then any point at this path corresponds to a certain degree of altering the knowledge of interest l. Now we discuss how to compute the principal knowledge descent direction. Remaining Knowledge Penalty. Suppose θ Il X , then we have f r Gθ = f r X according to Eq. (6). Thus the change of the other knowledge under a small perturbation θ can be written as |f r Gθ+ θ f r X | = |f r Gθ+ θ f r Gθ| |( θf r Gθ)T θ| θf r Gθ θ 1 L θ 1, provided that PGθ is continuously differentiable, and L is an upper bound for θf r Gθ on the indiscernibility space. The penult inequality stems from the Cauchy inequality, which offers an estimation to the upper bound of the other knowledge alteration. Thus, if we constrain the ℓ1 norm of θ, then we can constrain the overall change of the other knowledge. Linear Principal Part. On the other hand, the descent value caused by updating θ with θ + θ is H(PGθ+ θ, Pl) + H(PGθ, Pl) = θH(PGθ0, Pl)T θ + o( θ ). (9) Thus, if we limit θ ϵ 1, then maximizing the descent value H(PGθ+ θ, Pl)+H(PGθ, Pl) is equivalent to minimizing the linear principal part θH(PGθ0, Pl)T θ. One Step Objective. In conclusion, we propose to get the descent direction of a single step from combining the remaining knowledge penalty and the linear principal part, i.e. min ϵ θ ϵ θH(PGθ0, Pl)T θ + λ θ 1, (10) where 0 < ϵ 1 is the step size, λ = Lλ0 > 0 containing two factors of an estimation L to the upper bound of θf r Gθ0 and a hyper-parameter λ0 to adjust the weight of the regularization term θ 1, and ϵ θ ϵ means that each entry of θ lies in [ ϵ, ϵ]. The first term θH(PGθ0, Pl)T θ in (10) maximizes the descent value (H(PGθ+ θ, Pl) H(PGθ, Pl)) induced by θ, moving PGθ closer to Pl as it is the cross entropy term. The second term λ θ 1 penalizes the overall change θf r Gθ θ 1 induced to the other knowledge. Sparsity of the Solution. The ℓ1 penalty of θ 1 is well known to enforce sparsity to the elements of the solution (Santosa & Symes, 1986; Tibshirani, 1996). Namely, only a small number of elements of θ will be non-zeros, meaning that the principal knowledge descent will only involve a fraction of parameters, while the most parameters of the generator will remain unchanged. The typical knowledge, like the expression, age, or gender of facial images, may only contain information of low dimensions (Penev & Sirovich, 2000; H ark onen et al., 2020; Shen et al., 2020). It will be obviously excessive for extracting a single knowledge domain in the whole parameter space. Accompanying with the excessively used parameters are two well-known challenges: over-fitting (Anderson & Burnham, 2004) and numerical instability (Hildebrand, 1987). The expressive capability of millions of parameters of the generator is powerful, thereby easily overfitting the bias of the classification or regression network we use as the optimal discriminator. Another issue is the numerical instability. As we have mentioned, the valid parameters relevant to a certain semantic feature should be sparse due to low-dimensional semantic information. Thus, updating all the parameters simultaneously may bring unexpected changes like entanglement of attributes. For example, changing age leads to adding beard in facial images. Closed-Form Solution. An intriguing property pertaining to Problem (10) is that this convex optimization problem has a closed-form solution by virtue of strong duality and Karush Kuhn Tucker conditions (Boyd et al., 2004). Theorem 3.7. There is λmax > 0 such that λ (0, λmax), Problem (10) admits a non-empty solution set. Specifically, a special solution can be attained by θi = 0, if | θHi| λ, θi = ϵ, if θHi < 0, 0 λ < | θHi|, θi = ϵ, if θHi > 0, 0 λ < | θHi|, (11) where θi and θHi = [ θH(PGθ0, Pl)]i are the i-th element of θ and θH(PGθ0, Pl), respectively. Principal Knowledge Descent. We conclude our algorithm to solve Problem (5) in Algorithm 1. This algorithm manages to minimize the objective of Problem (5) while trying to maintain the remaining knowledge distribution. Rigorously, we have the following theorem. Theorem 3.8. Let be the descent value of the objective (5) by implementing Algorithm 1, and δ be the change of the other knowledge, i.e., = H(PGθK , Pl) H(PGX , Pl), (12) δ = Ex PX h f r GθK (x) f r X (x) i , (13) where K is the iteration turns. Assume that PGθX = PX , ϵ is small enough, and L = sup θ θX Kϵ θf r Gθ . There is λmax > 0 such that λ (0, λmax), we have > 0 and L + o(1). (14) Principled Knowledge Extrapolation with GANs Figure 3. Knowledge extrapolation of Big GAN on Image Net. Odd rows report the original data domain and extrapolated counterfactual knowledge (marked with CF: and purple color), while even rows report the counterfactual results generated by the proposed method. To the best of our knowledge, this is the first work that yields high-fidelity photo-realistic counterfactual synthesis of various image domains. This theorem tells us that choosing a large λ can yield small variation of the remaining knowledge. Recall that λ = λ0L. This theorem also implies that λ0 controls the principal ratio of Algorithm 1 the ratio of change in the knowledge of interest and the other knowledge. 3.3. Dirac Knowledge Extrapolation While our method is designed for distribution-wise extrapolation, we can also use it for single image extrapolation by setting the data distribution as the Dirac distribution (Arfken & Weber, 1999) PX (x) = δx0(x) = 1 x = x0, 0 x = x0. (15) We may then change the prior distribution N(O, I) in Algorithm 1 with δG 1 θX (x0)(z) to implement the single image knowledge extrapolation. While the Dirac distribution is not smooth, directly implementing it may cause numerical instability. To enhance numerical stability, we will instead change the prior distribution N(O, I) in Algorithm 1 with N(G 1 θX (x) + O, ξI), where 0 < ξ 1 is a small number to approximate the Dirac distribution. 4. Findings and Results In this section, we present several discoveries from our proposed Principal Knowledge Descent (PKD) method and extrapolation results of state-of-the-art GANs, including Big GAN256-Deep (Brock et al., 2018), Style GAN2 (Karras et al., 2019; 2020b) on FFHQ faces (Karras et al., 2019), and Style GAN2-ADA (Karras et al., 2020a) on Bre Ca HAD 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 6#1e-5 Figure 4. Pixel Principal Ratio (PPR) and Parameter Sparsity Ratio (PSR) of PKD under different λ. We find a turning point for the PPR metric, which also corresponds to the minimum volume of parameters to capture the knowledge of interest. Metric FID IS Path Length Original GAN 5.31 4.36 185.59 PKD-Mustache 7.17 4.10 187.50 PKD-Lipstick 7.26 4.26 175.20 PKD-Gray Hair 7.04 4.18 198.24 Table 1. Numerical metrics for the quality of extrapolated distributions of PKD method on the FFHQ domain. Basically, we find that the decline of synthesis quality is negligible. (Aksac et al., 2019) which contains breath cancer slices. The details of the experiments are reported in Appendix Sec. B, including choices of the hyper-parameters λ, ϵ, K, m, sources of all pretrained models (i.e., generators and posterior estimation models), and dataset information. λ-Sparsity For a given data domain and generator model, we find that the volume of parameters that are active to a knowledge of interest is an interesting property. To investigate it, we randomly sample 500 images in the Style GAN2 Principled Knowledge Extrapolation with GANs Figure 5. Knowledge extrapolation of Style GAN2 on FFHQ dataset. Odd rows report the original data domain and extrapolated counterfactual knowledge (marked in purple), while even rows report the counterfactual results generated by the proposed method. 0% 20% 40% 60% 80% 100% Origin Domain Extrapolated (PKD) Figure 6. Statistics of the original distribution and extrapolated distribution after extrapolating mustache on FFHQ faces. PKD increases the ratio of faces with mustache from 15.03% to 99.93%, while only inducing less than 5% faces becoming male or mature . Considering the subtle case where mustache alone can confuse age or gender, the consequence of entanglement is trivial. Thus, almost all faces in the extrapolated distribution are mustached, while the other statistics like the ratio of children and female are well preserved. generator of FFHQ domain, and conduct Dirac Knowledge Extrapolation to those data points with different hyperparameter λ. We report the results of two metrics: Pixel Principal Ratio (PPR) and Parameter Sparsity Ratio (PSR). PPR measures the ratio of changes induced by PKD between the cross entropy H(PGθ, Pl) and the pixel value of the image, i.e., PPR = | log(Pl(Gθ(z))) log(Pl(GθX (z)))| 1 hwc Gθ(z) GθX (z) 2 2 , (16) where h, w, c are height, width, and channel numbers of the image domain. PSR measures the ratio of parameters that correspond to non-zero updating during PKD. The remaining parameters can be viewed as inactive to the extrapolated knowledge. We report the mean values and standard deviations of these two measurements in Fig. 4. As indicated by Theorem 3.8, PPR monotonically increases before λmax of 2e-4. It suggests that before λmax, increasing λ can further exclude redundant parameters in the PKD process. After λmax, however, increasing λ will also hurt the extraction of Figure 7. Dirac knowledge extrapolation to Image Net animals. Top row shows the original data, middle and bottom rows show the extrapolated data, with extrapolated knowledge marked in purple. the knowledge of interest. On the other hand, PSP monotonically decreases as λ increases, which is the consequence of sparsity enforced by the ℓ1 penalty. Thus, the turning point of PPR should give the minimum ratio of parameters that are active to the knowledge of interest. Thus, in all the previous experiments, we set the hyper-parameter λ to a value that is slightly smaller than λmax to secure better performance. PKD offers a new methodology for counterfactual synthesis. We report knowledge extrapolation results of our PKD method on Image Net (Deng et al., 2009) data domain and FFHQ (Karras et al., 2019) face data domain in Fig. 3 and 5, respectively. In Image Net domain, we use a pretrained Big GAN256-Deep model as the pretrained generator and Res Net50 (He et al., 2016) classifiers as the posterior distribution for knowledge of interest. We infer the results of counterfactual combination of knowledge among different Image Net categories. As displayed in Fig. 3, we successfully synthesize non-existent species such as goldfinches with cheetah spot, huskies with ursus arctos fur, oranges with strawberry surface, etc. In FFHQ facial images domain, we use the Style GAN2 model as the pretrained generator, and Res Net50 classifiers trained on Celeb A-HQ (Karras et al., 2018) annotations for facial attributes like mustache , lipstick , gray hair as the posterior distribution for knowl- Principled Knowledge Extrapolation with GANs Figure 8. Counterfactual synthesis of latent image editing method (IFG) and our Dirac knowledge extrapolation method. Variations of image pixels are highlighted in the bottom row. Latent image editing method fails the counterfactual inference of women and children in mustache, removing mustache when extrapolating lipstick, while our method can easily handle these cases and induce changes more concentrated in the regions of interests. Figure 9. Few-shot Dirac knowledge extrapolation of tissue slices of breath cancer patients. The top row shows the original slices. The tissue slices are stained by H&E (Dapson & Horobin, 2009), containing bluish violet multi-core structures of the cancer nucleus and light red color of extracellular materials. Here we extrapolate each slice toward milder symptoms of cancer. The results are reported in the bottom row. In most local regions of the slices, the number of cancer nucleus decreases significantly, details are zoomed in with light green boxes. The whole training data to support our method is composed of merely 162 annotated slice images. edge of interest. We infer the results of altering distributions of those facial attributes in Fig. 5, which shows counterfactual images such as women and children in mustache , men in lipstick , and children in gray hair . As neural networks are disable to detect counterfactual results, we further conduct user study to confirm that whether the knowledge extrapolation globally succeeds. To do this, we randomly sample 3,000 images from the original generator distribution and the extrapolated distribution after extrapolating mustache on FFHQ faces, respectively. We mix and randomly shuffle those images, and then ask testers three questions for each image: whether the person is 1) a child, 2) a female, and 3) mustached (Refer to Appendix for details of the user study). The result is reported in Fig. 6. We can find that PKD only significantly changes knowledge mustache while maintaining the two confounders child and female nearly intact. Overall, the proposed PKD method can generate high-fidelity counterfactual results, although the exact causality relations are not pre-defined. Thus, counterfactual synthesis may not rely on the prior SCM to guide the generator. We demonstrate that PKD is also a competitive alternative for this task. Dirac Knowledge Extrapolation is very powerful. We report the results of Dirac Knowledge Extrapolation of PKD in Fig. 7, 8, and 9. The Dirac Knowledge Extrapolation focuses on extrapolating knowledge of a given image rather than the whole distribution. For Image Net and FFHQ data domains, the pretrained models and posterior distributions are selected as in previous knowledge extrapolation, and the results are reported in Fig. 7 and 8, respectively. Here we are interested in comparing the knowledge extrapolation with the recent latent image editing methods (Shen et al., 2020; Patashnik et al., 2021; Tewari et al., 2020), which edit facial attributes by attribute vectors in latent spaces of pretrained GANs. Despite the success in editing usual attributes, those methods are generally weak when conducting counterfactual synthesis. A major reason is that their synthesis will always lie in the pretrained generator distribution, as the parameters of the GAN model are not selected. The pretrained generator distribution can hardly yield counterfactual results. Specifically, we report the comparison with a baseline latent image editing method called Inter Face GAN (IFG) (Shen et al., 2020) in Fig. 8 (More comparisons are given in Appendix). We further conduct Dirac knowledge extrapolation in the Bre Ca HAD data domain. This dataset consists of 162 slice images and each of them has annota- Principled Knowledge Extrapolation with GANs tions for cancer nuclear. We train a Res Net50 classifier to infer the posterior probability for cancer severity of a tissue slice image on the few-shot annotation images, and use the Style GAN2-ADA model trained on this dataset as the pretrained generator. We infer the results of reducing cancer severity in Fig. 9. Despite the few-shot training data, the PKD method can successfully infer the expected results. In all cases, the Dirac Knowledge extrapolation demonstrates compelling performance. PKD is efficient and flexible when applied to SOTA models. The fidelity of counterfactual synthesis of PKD significantly surpasses previous causal GAN works. The major reason is its flexibility to combine with state-of-the-art generative models. Previous methods need to adapt generator architectures to be compatible with their prior SCMs or causal graphs, which is a non-trivial task. To the best of our knowledge, this is the first work that is capable of conducting counterfactual synthesis of photo-realistic effect without changing the generator architecture. Moreover, we find that the PKD method is very efficient when applying to SOTA models, i.e., Style GAN2 and Big GAN. We find that appealing counterfactual synthesis in most cases can be attained within 10-20 PKD iterations, and the damage to the synthesis fidelity caused by PKD is negligible. In Tab. 1, we report numerical metrics (i.e., Fr echet Inception Distance (FID), Inception Score (IS), and Path Length as in (Karras et al., 2019)) of synthesis quality after knowledge extrapolation, indicating that the influence to synthesis quality is little. 5. Conclusion This paper studies the problem of knowledge extrapolation of GANs, where the original generated distribution of a pretrained GAN is altered under the guidance of a novel Principal Knowledge Descent method to obtain counterfactual synthesis. Different from traditional methods that conduct counterfactual synthesis based on prior SCMs, this paper proposes to leverage a simple assumption that the extrapolated distribution and the original distribution are indistinguishable except for the knowledge of interest. Thus, our work gets rid of the usual demands of traditional methods to change the generator architecture to obey prior causalities. As a result, the proposed method is much more convenient to apply to SOTA generator models, and can yield much more photo-realistic counterfactual results. Acknowledgement This work was supported by the National Key R&D Program of China under Grant 2020AAA0105702, National Natural Science Foundation of China (NSFC) under Grant U19B2038, the University Synergy Innovation Program of Anhui Province under Grants GXXT-2019-025. Aksac, A., Demetrick, D. J., Ozyer, T., and Alhajj, R. Brecahad: a dataset for breast cancer histopathological annotation and diagnosis. BMC research notes, 12(1):1 3, 2019. Anderson, D. and Burnham, K. Model selection and multimodel inference. Second. NY: Springer-Verlag, 63(2020): 10, 2004. Arfken, G. B. and Weber, H. J. Mathematical methods for physicists, 1999. Averitt, A. J., Vanitchanant, N., Ranganath, R., and Perotte, A. J. The counterfactual χ-GAN: Finding comparable cohorts in observational health data. Journal of Biomedical Informatics, 109:103515, 2020. Beck, S. R., Robinson, E. J., Carroll, D. J., and Apperly, I. A. Children s thinking about counterfactuals and future hypotheticals as possibilities. Child Development, 77(2): 413 426, 2006. Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex optimization. Cambridge university press, 2004. Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. ar Xiv preprint ar Xiv:1809.11096, 2018. Dapson, R. and Horobin, R. Dyes from a twenty-first century perspective. Biotechnic & Histochemistry, 84(4): 135 137, 2009. De Boer, P.-T., Kroese, D. P., Mannor, S., and Rubinstein, R. Y. A tutorial on the cross-entropy method. Annals of operations research, 134(1):19 67, 2005. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Image Net: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009. Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. ar Xiv preprint ar Xiv:1605.08803, 2016. Ehrlich, R. The human brain s algorithm for extrapolating motion, and its possible gender-dependence. Neuroscience letters, 374(1):38 42, 2005. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. Hammersley, J. Monte Carlo methods. Springer Science & Business Media, 2013. Principled Knowledge Extrapolation with GANs H ark onen, E., Hertzmann, A., Lehtinen, J., and Paris, S. Ganspace: Discovering interpretable GAN controls. ar Xiv preprint ar Xiv:2004.02546, 2020. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Hildebrand, F. B. Introduction to numerical analysis. Courier Corporation, 1987. Holland, P. W. Statistics and causal inference. Journal of the American statistical Association, 81(396):945 960, 1986. Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018. Karras, T., Laine, S., and Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401 4410, 2019. Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., and Aila, T. Training generative adversarial networks with limited data. ar Xiv preprint ar Xiv:2006.06676, 2020a. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of Style GAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110 8119, 2020b. Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. ar Xiv preprint ar Xiv:1807.03039, 2018. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Kocaoglu, M., Snyder, C., Dimakis, A. G., and Vishwanath, S. Causal GAN: Learning causal implicit generative models with adversarial training. In International Conference on Learning Representations, 2018. Murphy, K. P. Machine learning: a probabilistic perspective. MIT press, 2012. Nemirovsky, D., Thiebaut, N., Xu, Y., and Gupta, A. Counter GAN: Generating realistic counterfactuals with residual generative adversarial nets. ar Xiv preprint ar Xiv:2009.05199, 2020. Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., and Lischinski, D. Styleclip: Text-driven manipulation of Style GAN imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2085 2094, 2021. Pearl, J. Causal inference in statistics: An overview. Statistics surveys, 3:96 146, 2009a. Pearl, J. Causality. Cambridge university press, 2009b. Penev, P. S. and Sirovich, L. The global dimensionality of face space. In Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580), pp. 264 270. IEEE, 2000. Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234 241. Springer, 2015. Santosa, F. and Symes, W. W. Linear inversion of bandlimited reflection seismograms. SIAM Journal on Scientific and Statistical Computing, 7(4):1307 1330, 1986. Sauer, A. and Geiger, A. Counterfactual generative networks. In International Conference on Learning Representations, 2020. Sekhon, J. S. The Neyman-Rubin model of causal inference and estimation via matching methods. The Oxford handbook of political methodology, 2:1 32, 2008. Shen, Y., Yang, C., Tang, X., and Zhou, B. Interfacegan: Interpreting the disentangled face representation learned by gans. IEEE transactions on pattern analysis and machine intelligence, 2020. Shore, J. and Johnson, R. Properties of cross-entropy minimization. IEEE Transactions on Information Theory, 27 (4):472 482, 1981. Tewari, A., Elgharib, M., Bernard, F., Seidel, H.-P., P erez, P., Zollh ofer, M., and Theobalt, C. PIE: Portrait image embedding for semantic control. ACM Transactions on Graphics (TOG), 39(6):1 14, 2020. Thiagarajan, J., Narayanaswamy, V. S., Rajan, D., Liang, J., Chaudhari, A., and Spanias, A. Designing counterfactual generators using deep model inversion. Advances in Neural Information Processing Systems, 34, 2021. Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. Principled Knowledge Extrapolation with GANs Wu, Z., Lischinski, D., and Shechtman, E. Stylespace analysis: Disentangled controls for stylegan image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12863 12872, 2021. Yang, M., Liu, F., Chen, Z., Shen, X., Hao, J., and Wang, J. Causal VAE: disentangled representation learning via neural structural causal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9593 9602, 2021. Principled Knowledge Extrapolation with GANs A. Proof to Theorems A.1. Theorem 3.4 Proof. It is well known that the optimal discriminator is (Goodfellow et al., 2014) Dϕ = PH PH + PGθ . (17) When θ Il, we have some probability distribution Pl for knowledge l such that Pl PH. (18) Thus we have Dϕ = PH PH + 1 Pl Pl PH = Pl. (19) A.2. Theorem 3.7 Proof. Let x = θ, V = θH(PGθ0, Pl), N be the volume of parameters of the generator, β = (β1, ..., βN)T , γ = (γ1, ..., γN)T be the Lagrangian multiplier vectors, and 1 = (1, ..., 1)T be the vector of all ones. We write the Lagrangian dual function of Problem (10) g(β, γ) = inf x L(x, β, γ) = inf x V T x + λ x 1 + i=1 βi(xi ϵ) + i=1 γi( ϵ xi) (20) = inf x (V + β γ)T x + λ x 1 ϵ(β + γ)T 1, (21) s.t. βi 0, γi 0, i = 1, . . . , N. (22) As L(x, β, γ) is a convex function, its optimal value x is reached if and only if 0 x L(x , β, γ) = V + β γ + λ x x 1, (23) where x denotes the sub-gradient operator of convex function with respect to x. Considering x x 1 = {n RN : n 1}, (24) we have 0 {V + β γ + λn : n 1}. (25) Thus we get that L(x , β, γ) = ϵ(β + γ)T 1, if V + β γ λ, , otherwise. (26) Then, the Lagrangian duality form of Problem (10) is min β,γ ϵ(β + γ)T 1, (27) s.t. βi 0, γi 0, |Vi + βi γi| λ. (28) This problem can be easily solved by linear programming technique. A special solution (β , γ ) can be obtained by β i = γ i = 0, if λ < Vi < λ, β i = 0, γ i = Vi λ, if Vi λ, β i = λ Vi, γ i = 0, if Vi λ, (29) Principled Knowledge Extrapolation with GANs for i = 1, . . . , N. Recalling the Karush Kuhn Tucker conditions of convex optimization, we have that x , β , γ satisfy the following conditions β i (x i ϵ) = 0, γ i ( ϵ x i ) = 0, i = 1, ..., N, (30) |Vi + β i γ i | < λ x i = 0. (31) We then have x i ϵ = 0, if β i = 0, ϵ x i = 0, if γ i = 0, x i = 0, if β i = γ i = 0, |Vi| < λ. (32) Combining Eq. (29), we then conclude the theorem xi = 0, if |Vi| λ, xi = ϵ, if Vi < 0, 0 λ < |Vi|, xi = ϵ, if Vi > 0, 0 λ < |Vi|. (33) A.3. Theorem 3.8 Proof. We first consider one step of Principal Knowledge Descent (PKD). Assume that the current step is k and k < K. Then we define k = H(PGθk+1, Pl) H(PGθk , Pl) = θH(PGθk , Pl)T θk + o(ϵ), (34) δk = Ex PX h f r Gθk+1 f r Gθk where θk is given as in Theorem 3.7. We first prove that k > 0 for small ϵ. This is obvious since among the non-zero elements of θ, we have θk i = ϵ if θH(PGθk , Pl)i θk i > 0, and θk i = ϵ if θH(PGθk , Pl)i θk i < 0. Thus the term θH(PGθk , Pl)T θk must be positive. In fact, we have θH(PGθk , Pl)T θk λϵ, (36) as long as θH(PGθk , Pl) > λ. While o(ϵ) is the high-order infinitesimal of ϵ, we have k = θH(PGθk , Pl)T θk + o(ϵ) λϵ + o(ϵ) > 0. (37) Assume that there are M elements of θk that are non-zero. Then we have i=1 | θH(PGθk , Pl)i θk i | + o(ϵ) Mλϵ + o(ϵ), (38) as the non-zero elements of θk corresponds to elements of θH(PGθk , Pl) that have absolute values larger than λ. i=1 θi θi 1 kϵ < Kϵ, (39) where θ0 = θX . Thus we have sup θ θk <ϵ θf r Gθ L, (40) as { θ θk < ϵ} { θ θX Kϵ}. (41) Thus we also have δk LEx PX [ θ 1] = LEx PX [Mϵ] = MLϵ. (42) Principled Knowledge Extrapolation with GANs Then we can conclude k MLϵ + o(1) = λ L + o(1). (43) Note that δ δ1 + ... + δK, (44) and = 1 + ... + K. (45) We then have δ = PK i=1 K δ PK i=1 i PK i=1 δi PK i=1( λ L + o(1))δi PK i=1 δi = λ L + o(1). (46) Thus we have L + o(1). (47) B. Experiment Setting Generative Model Choice For FFHQ data domain, we use the pretrained Style GAN2 generator offered by Awesome Pretrained Style GAN2 2 with config-f and 512 512 resolution. For Bre Ca HAD data domain, we use the official pretrained Style GAN2-ADA generator3. For Image Net data domain, we use the Big GAN256-Deep model in the official TFhub repository 4. Posterior Estimation Model Choice For FFHQ data domain, we use the official pretrained Res Net50 classifiers provided by Style GAN2 authors 5 as the posterior distribution Pl. For Bre Ca HAD, we train a Res Net50 regressive model on the annotation subset of Bre Ca HAD dataset. The annotations mark all cancer nucleus in the tissue slice images. We train the Res Net50 regressive model to predict the number of cancer nucleus in each tissue slice image. We halt the training at the error rate of 10% in the training set to avoid overfitting, as the training dataset is small. The final predict scores are further normalized to [0, 1] to yield posterior estimation to the severity of cancer. For Image Net data domain, we use the official Res Net50 classifier provided by Tensor Flow 6. The Res Net50 classifier outputs a 1,000 dimensional vector, each of which predicts the posterior of a given image category. We choose the dimension corresponding to our knowledge of interest as the final posterior estimation Pl. Hyper-parameter Setting For Lipstick extrapolation of FFHQ data domain, we set K = 4; for Gray Hair extrapolation of FFHQ data domain, we set K = 7. For all the other experiments, we set K = 10. For FFHQ domain and Bre Ca HAD domain, we set ϵ = 1e 3; for Image Net domain, we set ϵ = 1e 5. The choice of λ is set according to Fig. 4, where we choose a λ that is slightly smaller than λmax for each experiment. For FFHQ domain, we set λ = 1.8e 4; for Bre Ca HAD domain, we set λ = 4.8e 4; for Image Net domain, we set λ = 1.3e 4. For all Dirac Knowledge Extrapolations, we set ξ = 0.01. Training of Inter Face GAN We obtain the semantic boundary vectors as directed by the Inter Face GAN paper. We use the Res Net50 classifiers provided by the Style GAN2 authors to annotate 50,000 random samples of the Style GAN2 generator, and then train a Support Vector Machine (SVM) to predict the binary annotations predicted by the Res Net50 classifiers. The normalized support vectors of those SVMs are chosen to serve as the semantic boundaries for latent image editing. 2https://github.com/justinpinkney/awesome-pretrained-stylegan2 3https://github.com/NVlabs/stylegan2-ada/ 4https://tfhub.dev/deepmind/biggan-deep-256/1 5https://github.com/NVlabs/stylegan2 6https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/Res Net50 Principled Knowledge Extrapolation with GANs Figure 10. Dirac knowledge extrapolation of mustache on randomly sampled latent codes generating FFHQ faces. IFG denotes Inter Face GAN. IFG can successfully add mustache for regular cases, e.g., mature male, but fails the counterfactual cases. Style Space Analysis (Wu et al., 2021) We also compare the PKD method in Dirac distribution with another latent image editing method Style Space Analysis (SSA) (Wu et al., 2021), the results are reported in Fig. 14. We use the official code provided by Style Space Analysis authors 7. User Study We conduct user studies on the FFHQ data domain for mustache extrapolation by recruiting 50 volunteers. We randomly sample 3,000 latent codes from the prior distribution N(O, I), and feed them to the pretrained Style GAN2 generator to produce 3,000 synthesized facial images. The same operation is performed to the extrapolated distribution to yield another independent 3,000 synthesized facial images after knowledge extrapolation. Then we mix the Style GAN2 synthesis facial images with the extrapolated synthesis facial images to yield a 6,000 testing set, and then randomly shuffle the testing set. All 50 volunteers are asked to answer the following questions for each image of the testing set: 1) Does the person in the image have mustache; 2) Is the person in the image a child without regard to mustache; 3) what is the gender of the person in the image without regard to mustache. For each question, the answer with the highest votes will be the winner. Hardware Setting To train Style GAN2 on 512 512 resolution FFHQ dataset and Inter Face GAN semantic boundaries, we use 8 NVIDIA V100 GPUs. To train the Res Net50 regressive model for Bre Ca HAD data domain, we use 1 NVIDIA GTX1080Ti GPU. For all the remaining experiments of knowledge extrapolation, we use 1 NVIDIA V100 GPU. 7https://github.com/betterze/Style Space Principled Knowledge Extrapolation with GANs Figure 11. Dirac knowledge extrapolation of lipstick on randomly sampled latent codes generating FFHQ faces. IFG denotes Inter Face GAN. IFG can successfully add lipstick for regular cases, e.g., female, but fails the counterfactual cases, e.g., removing mustache or changing gender of the male cases. Figure 12. Dirac knowledge extrapolation of tissue slices of cancer patients. The cancer severity is reduced in the extrapolated cases. Principled Knowledge Extrapolation with GANs Figure 13. Uncurated lists of Big GAN generations after extrapolating knowledge dimensions. Principled Knowledge Extrapolation with GANs Figure 14. Extrapolation of knowledge graph hair . We compare our method with the latent image editing method Style Space Analysis (SSA). We set the editing strength of SSA to yield slightly stronger degree of hair grayness than ours in regular cases, e.g., middle-age male, as shown in the first three rows of the left side. We then investigate the performance of both methods in counterfactual cases young women or children in gray hair. The results show that SSA can hardly handle the counterfactual cases, while our method still works well.