# weakly_supervised_disentangled_generative_causal_representation_learning__b069aa11.pdf Journal of Machine Learning Research 23 (2022) 1-55 Submitted 1/21; Revised 7/22; Published 7/22 Weakly Supervised Disentangled Generative Causal Representation Learning Xinwei Shen xshenal@ust.hk Department of Mathematics The Hong Kong University of Science and Technology Hong Kong, China Furui Liu liufurui@zhejianglab.com Zhejiang Laboratory Hangzhou, China Hanze Dong hdongaj@ust.hk Department of Mathematics The Hong Kong University of Science and Technology Hong Kong, China Qing Lian qlianab@ust.hk Department of Computer Science The Hong Kong University of Science and Technology Hong Kong, China Zhitang Chen chenzhitang2@huawei.com Huawei Noah s Ark Lab Shenzhen, China Tong Zhang tongzhang@ust.hk Department of Computer Science and Mathematics The Hong Kong University of Science and Technology Hong Kong, China Editor: Yoshua Bengio This paper proposes a Disentangled g Enerative c Ausal Representation (DEAR) learning method under appropriate supervised information. Unlike existing disentanglement methods that enforce independence of the latent variables, we consider the general case where the underlying factors of interests can be causally related. We show that previous methods with independent priors fail to disentangle causally related factors even under supervision. Motivated by this finding, we propose a new disentangled learning method called DEAR that enables causal controllable generation and causal representation learning. The key ingredient of this new formulation is to use a structural causal model (SCM) as the prior distribution for a bidirectional generative model. The prior is then trained jointly with a generator and an encoder using a suitable GAN algorithm incorporated with supervised information on the ground-truth factors and their underlying causal structure. We provide theoretical justification on the identifiability and asymptotic convergence of the proposed method. We conduct extensive experiments on both synthesized and real data sets to demonstrate the effectiveness of DEAR in causal controllable generation, and the bene- c 2022 Xinwei Shen, Furui Liu, Hanze Dong, Qing Lian, Zhitang Chen, and Tong Zhang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0080.html. Shen, Liu, Dong, Lian, Chen, and Zhang fits of the learned representations for downstream tasks in terms of sample efficiency and distributional robustness. Keywords: disentanglement, causality, representation learning, deep generative model 1. Introduction Consider the observed data x from a distribution qx on X Rd and the latent variable z from a prior pz on Z Rk. In bidirectional generative models (BGMs), we are normally interested in learning an encoder E : X Z to infer latent variables and a generator G : Z X to generate data, to achieve both representation learning and data generation. Classical BGMs include Variational Autoencoder (VAE) (Kingma and Welling, 2014) and Bi GAN (Donahue et al., 2017; Dumoulin et al., 2017). In representation learning, it was argued that an effective representation for downstream learning tasks should disentangle the underlying factors of variation (Bengio et al., 2013). In generative modeling, it is highly desirable if one can control the semantic generative factors by aligning them with the latent variables such as in Style GAN (Karras et al., 2019). Both goals can be achieved with the disentanglement of latent variable z, which informally means that each dimension of z measures a distinct factor of variation in the data (Bengio et al., 2013). Earlier unsupervised disentanglement methods mostly regularized the VAE objective to encourage independence of learned representations (Higgins et al., 2017; Burgess et al., 2017; Kim and Mnih, 2018; Chen et al., 2018; Kumar et al., 2018). Later, Locatello et al. (2019) showed that unsupervised learning of disentangled representations is impossible: many existing unsupervised methods are brittle, requiring careful supervised hyperparameter tuning or implicit inductive biases. To promote identifiability, recent work resorted to various forms of supervision (Locatello et al., 2020b; Shu et al., 2020; Locatello et al., 2020a). In this work, we also incorporate supervision on the ground-truth factors in the form of a certain number of annotated labels as described in Section 3.2. We will present experimental results showing that our method remains competitive with a small amount of labeled data (a minimum of around 100 samples). Most of the existing methods, including those mentioned above, are built on the assumption that the underlying factors of variation are mutually independent. However, in many real-world cases, the semantically meaningful factors of interests are not independent (Bengio et al., 2020). Instead, such high-level variables are often causally related, i.e., connected by a causal graph. In this paper, we prove formally that methods with independent priors fail to disentangle causally related factors. Motivated by this observation, we propose a new method to learn disentangled generative causal representations called DEAR. The key ingredient of our formulation is a structural causal model (SCM) (Pearl et al., 2000) as the prior for latent variables in a bidirectional generative model. As discussed in Section 4.1.2, we assume that a super-graph of the underlying causal graph is known a priori, which ranges from the causal ordering of the nodes in the graph to the true causal structure. The causal model prior is then learned jointly with a generator and an encoder using a suitable GAN (Goodfellow et al., 2014) algorithm. Moreover, we establish theoretical guarantees for DEAR on how it resolves the unidentifiability issue of many existing methods as well as on the asymptotic convergence of the proposed algorithm. Weakly Supervised Disentangled Generative Causal Representation Learning An immediate application of DEAR is causal controllable generation, which can generate data from many desired interventional distributions of the latent factors. Another useful application of disentangled representations is to use such representations in downstream tasks, leading to better sample complexity (Bengio et al., 2013; Sch olkopf et al., 2012). Moreover, it is believed that causal disentanglement is invariant and thus robust under distribution shifts (Sch olkopf, 2019; Arjovsky et al., 2019). In this paper, we demonstrate these conjectures in various downstream prediction tasks for the proposed DEAR method, which has theoretically guaranteed disentanglement property. We summarize our main contributions as follows: We formally identify a problem with previous disentangled representation learning methods using the independent prior assumption, and prove that they fail to disentangle when the underlying factors of interests are causally related, even under supervision of the latents. We propose a new disentangled learning method, DEAR, which integrates an SCM prior into a bidirectional generative model, trained with a suitable GAN algorithm. We provide theoretical justification on the identifiability1 of the proposed formulation and the asymptotic convergence of our algorithm. Extensive experiments are conducted on both synthesized and real data to demonstrate the effectiveness of DEAR in causal controllable generation, and the benefits of the learned representations for downstream tasks in terms of sample efficiency and distributional robustness. Notation Throughout the paper, all distributions are assumed to be absolutely continuous with respect to Lebesgue measure unless indicated otherwise. For a vector x, let [x]i denote the i-th component of x. For a scalar function h(x, y), let xh(x, y) denote its gradient with respect to x and 2 xh(x, y) denote its Hessian matrix with respect to x. For a vector function g(x, y), let xg(x, y) denote its Jacobian matrix with respect to x. Without ambiguity, x is denoted by for simplicity. Notation stands for the Euclidean norm. Definition 1 (Smoothness) Consider a function h(x) : Rd R. h(x) is ℓ0-smooth with respect to x if h(x) is differentiable and its gradient is ℓ0-Lipschitz continuous, i.e., we have h(x) h(x ) ℓ0 x x , x, x Rd. Definition 2 (Polyak Lojasiewicz) For a set S Rd, consider a function h(x) : S R and let h = minx S h(x). Then h(x) satisfies the Polyak Lojasiewicz (PL) condition if there exists c > 0 such that for all x S h(x) h c h(x) 2 2. 1. Note that the identifiability in this work differs from that in Khemakhem et al. (2020) in terms of goals and assumptions. See more discussions in the related work and below Proposition 5. Shen, Liu, Dong, Lian, Chen, and Zhang Roadmap In Section 2, we discuss the related work. In Section 3, we introduce the problem setting of disentangled generative causal representation learning and identify a problem with previous methods. In Section 4, we propose the model, formulation and algorithm of DEAR, and provide theoretical justifications on both identifiability and asymptotic convergence. We then present empirical studies concerning causal controllable generation, downstream tasks and structure learning as well as ablation studies in Section 5, and conclude in Section 6. Detailed proofs of all theorems, propositions and lemmas are deferred to Appendix A. 2. Related work VAE-based disentanglement methods. A number of methods have been proposed to enrich the VAE loss by various regularizers to enforce the independence of the latent variables. β-VAE (Higgins et al., 2017) and Annealed VAE (Burgess et al., 2017) introduced extra constraints on the capacity of the latent bottleneck by adjusting the role of the KL term; Factor-VAE (Kim and Mnih, 2018) and β-TCVAE (Chen et al., 2018) encouraged the aggregated posterior (i.e., the marginal distribution of E(x)) to be factorized by penalizing its total correlation; DIP-VAE (Kumar et al., 2018) enforced a factorized aggregated posterior differently by matching its moments with those of a factorized prior. Going beyond the independence perspective, Suter et al. (2019) considered disentangled causal mechanisms, meaning that all the generative factors are conditionally independent given a common confounder. This is one special case of causal relationship, while we consider more general cases where the factors can have more complex causal relationships, e.g., one factor can be a direct cause of another one. Based on the above methods, Locatello et al. (2020b) and Locatello et al. (2020a) further incorporated supervised information on a few labels of the generative factors and pairs of observations which differ by a few factors respectively, where the former is more related to ours which is discussed detailedly in Section 3.2. Shu et al. (2020) proposed several concepts related to disentanglement, based on which they analyzed three forms of weak supervision including restricted labeling, match pairing, and rank pairing. Going beyond the independent prior, Khemakhem et al. (2020) proposed a conditional VAE where the latent variables are assumed to be conditionally independent given some additionally observed variables. Built upon developments of nonlinear ICA, they presented the first principled identifiability theory of latent variable models, in particular VAEs, thus leading to a form of provable disentanglement under suitable conditions. Our work, in contrast, does not aim at achieving general identifiability of latent variable models or general provable disentanglement, but contributes to resolving the failure of existing methods in disentangling causally related factors. With this motivation, we consider more general model assumptions on the latent structure as well as generating transformations than those in Khemakhem et al. (2020) which apply more suitably to real-world data. To achieve disentanglement of causal factors, we need to adopt a more direct and somehow stronger form of supervision than Khemakhem et al. (2020), i.e., we require annotated labels of true factors for a possibly small number of samples. See Appendix C for a discussion on the two forms of supervision. The model in Khemakhem et al. (2020), however, has not yet been applied with the most advanced network architecture for image generation such as Weakly Supervised Disentangled Generative Causal Representation Learning Style GAN (Karras et al., 2019), nor can their conditional independent prior models the causal structure of true factors. Therefore, their model and theory do not apply here and our work should be regarded complementary. To avoid the unidentifiability of the standard Gaussian prior caused by rotation transformations, St uhmer et al. (2020) proposed hierarchical non-Gaussian priors for unsupervised disentanglement, which is not rotationally invariant. However, there remains other kinds of mixing transformations that leave these priors invariant, leading to unidentifiability. Besides, their proposed priors cannot model the causal relationships. Recently, a concurrent work by Tr auble et al. (2021) conducted a large-scale empirical study to investigate the behavior of the most prominent disentanglement approaches on correlated data. In particular, they considered the case where the ground-truth factors exhibit pairwise correlation. Although pairwise correlation largely generalizes the independence assumption, it is less general than the causal correlation that we consider. For example, a parental node with multiple children immediately goes beyond pairwise correlation. Moreover, Tr auble et al. (2021) focused on verifying the problem that existing methods fail to learn disentangled representations for strongly correlated factors, while we identify the problem as a motivation to propose a method to resolve it and learn disentangled representations under the causal case. GAN-based disentanglement methods. Existing GAN-based methods, including Info GAN (Chen et al., 2016) and Info GAN-CR (Lin et al., 2020), differed from our proposed formulation mainly in two folds. First, they still assumed an independent prior for latent variables, so suffered from the same problem with the previous VAE-based methods mentioned above. Besides, the idea of Info GAN-CR was to encourage each latent code to make changes that are easy to detect, which only applies well when the underlying factors are independent. Second, as a bidirectional generative modeling method, Info GAN further required variational approximation apart from adversarial training, which is inferior to the principled formulation in Bi GAN and AGES (Shen et al., 2020) that we adopt. Generative modeling involving causal models in the latent space. Causal GAN (Kocaoglu et al., 2018) and a concurrent work (Moraffah et al., 2020) of ours, were unidirectional generative models (i.e., a generative model that learns a single mapping from the latent variable to data) that build upon a c GAN (Mirza and Osindero, 2014). They assigned an SCM to the conditional attributes while leaving the latent variables as independent Gaussian noises. The limit of a c GAN is that it always requires full supervision on attributes to apply conditional adversarial training. Also, the ground-truth factors were directly fed into the generator as the conditional attributes, without any extra effort to align the dimensions between the latent variables and the underlying factors, so their models had nothing to do with disentanglement learning. Moreover, their unidirectional nature made them unable to learn representations. Besides, they only considered binary factors, so the consequent semantic interpolations appear non-smooth, as shown in Appendix G. Causal VAE (Yang et al., 2021) assigned the SCM directly on the latent variables, while built upon i VAE (Khemakhem et al., 2020), it adopted a conditional prior given the groundtruth factors so was also limited to a fully supervised setting. Graph VAE (He et al., 2018) generalized the chain-structured latent space proposed in Ladder VAE (Sønderby et al., 2016) and imposed an SCM into the latent space of VAE. Shen, Liu, Dong, Lian, Chen, and Zhang The motivation behind Graph VAE is to improve the expressive capacity of VAE rather than to disentangle the underlying causal factors as ours. Purely from observational data and without any supervision on the underlying factors, the impossibility result from Locatello et al. (2019) indicated that a VAE model cannot identify the true factors. Therefore, the representations learned by Graph VAE were not guaranteed to disentangle the generative factors, and consequently the learned SCM did not reflect the true causal structure in principle. Moreover, their adopted VAE loss (ELBO) required an explicit form of KL divergence between the prior and the posterior, which limited the model choice for the SCM. Specifically, Graph VAE used an additive noise model with Gaussian noises. In contrast, our method does not require the distribution induced by the SCM to be explicitly expressed and in principle allows any SCMs that can be reparametrized as a generative model (i.e., given the exogenous noises, one can generate all the variables by ancestral sampling). For comparison, in our experiments, we include a baseline which extends the original Graph VAE method to incorporate the same amount of supervision as ours. Generative modeling involving other structured latent spaces. VLAE (Zhao et al., 2017) decomposed the latent space into separate chunks each of which is processed at different levels of the encoder and decoder. VQ-VAE-2 (Razavi et al., 2019) used a two-level latent space along with a multi-stage generation mechanism to capture both high and low level information of data. SAE (Leeb et al., 2020) encouraged a hierarchical structure in the latent space through the structural architecture of the decoder. These methods essentially adopted implicit probabilistic or architectural hierarchies, in contrast to the causal structure that we impose to the latent space, and thus cannot achieve the goal of causal disentanglement. For example, the hierarchy in SAE represents the level of abstraction, in the sense that more high-level, abstract features are processed deeper in the decoder and low-level, linear features are treated towards the end of the network. Such hierarchy differs essentially from the causal structure that we consider. Other works considered inferring the latent causal structure from visual data in the reinforcement learning setting (Dasgupta et al., 2019; Nair et al., 2019). In particular, Nair et al. (2019) developed learning-based approaches to induce causal knowledge in the form of directed acyclic graphs, which was then utilized in learning goal-conditioned policies. The interactive environment enables the agent to perform actions and observe their outcomes. Therefore, the resulting data involves various interventions each of which entails an SCM and thus is essentially different from the common setting in the disentanglement literature which is also considered in this paper, where the observed data are independent and identically distributed. 3. Problem setting In this section, we describe the probabilistic framework of disentanglement learning based on bidirectional generative models (BGMs) with supervision, and formalize the unidentifiablility problem with previous methods. Weakly Supervised Disentangled Generative Causal Representation Learning 3.1 Generative model We follow the commonly assumed two-step data generating process that first samples the underlying generative factors, and then conditional on those factors, generates the data (Kingma and Welling, 2014). During the generation process, the generator induces the generated conditional p G(x|z) and generated joint distribution p G(x, z) = pz(z)p G(x|z). During the inference process, the encoder induces the encoded conditional q E(z|x) which can be a factorized Gaussian and the encoded joint distribution q E(x, z) = qx(x)q E(z|x). We consider the following objective for generative modeling: Lgen(E, G) = DKL(q E(x, z), p G(x, z)), (1) where DKL(q, p) = R q(x, z) log(q(x, z)/p(x, z))dxdz is the Kullback-Leibler (KL) divergence between two distributions. Objective (1) is shown to be equivalent to the negative evidence lower bound (ELBO), Ex qx[ Eq E(z|x) log p G(x|z) + DKL(q E(z|x), pz(z))], (2) used in VAEs up to a constant, and ELBO allows a closed form to be optimized easily only with factorized Gaussian prior, encoder and generator (Shen et al., 2020). Since constraints on the latent space are required to enforce disentanglement, it is desirable that the distribution families of q E(x, z) and p G(x, z) should be large enough, especially for complex data like images. As demonstrated in literature on image generation (Karras et al., 2019; Mescheder et al., 2017), implicit distributions, where the randomness is fed into the input or intermediate layers of the network, are favored over factorized Gaussians in terms of expressiveness. Then minimizing (1) requires adversarial training, as discussed detailedly in Section 4.3. 3.2 Supervised regularizer To guarantee disentanglement, we incorporate supervision when training the BGM. The first part of supervision consists of a certain number of annotated labels of the groundtruth factors, following the similar idea in Locatello et al. (2020b) but with a different formulation. We leverage another part of supervision on the graph structure of the factors, which will be discussed in Section 4.1.2. Specifically, let ξ Rm be the underlying groundtruth factors of interests of data x, following distribution pξ, and [y]i be some continuous or discrete annotated observation of the i-th underlying factor [ξ]i, satisfying [ξ]i = E([y]i|x) for i = 1, . . . , m. For example, in the case of human face images, [y]1 can be the binary label indicating whether a person is young or not, and [ξ]1 = E([y]1|x) = P([y]1 = 1|x) is the probability of being young given one image x. Let E(x) be the deterministic part of the stochastic transformation E(x), i.e., E(x) = E(E(x)|x) by integrating out the additional randomness injected into the encoder, which is used for representation learning. For instance, consider a Gaussian encoder satisfying E(x)|x N(m(x), Σ(x)) which can be reparametrized by E(x) = m(x) + Σ(x) ϵ with ϵ N(0, I). Then the deterministic part is the mean, i.e., E(x) = m(x). We consider the following objective: L(E, G) = Lgen(E, G) + λLsup(E), (3) Shen, Liu, Dong, Lian, Chen, and Zhang where the supervised regularizer is Lsup = Ex,y[ls(E; x, y)] with ls = Pm i=1 CE([ E(x)]i, [y]i) if [y]i is the binary or bounded (and normalized to [0, 1]) continuous label of factor [ξ]i, where CE(l, y) = y log σ(l) (1 y) log(1 σ(l)) is the cross-entropy loss with σ( ) being the sigmoid function; ls = Pm i=1([ E(x)]i [y]i)2 if [y]i is the continuous observation of [ξ]i. λ > 0 is the coefficient to balance both terms. Through ablation studies in Section 5.4, we empirically find the choice of λ insensitive to different tasks and data sets, and hence set λ = 5 in all experiments. Note that in the objective (3), the unsupervised generative modeling loss and the supervised regularizer are decoupled in terms of taking expectations, in contrast to the conditional GANs where supervised labels are involved in the GAN loss. This enables one to use two separate samples with different sample sizes to estimate the two terms in (3) during training. Since in practice we may only have access to a limited amount of annotated labels, this property makes the formulation applicable in such semi-supervised settings. In the experiments, we conduct ablation studies to investigate how our method performs with varying amounts of labeled samples available. In addition, Locatello et al. (2020b) propose a regularizer Lsup = Pm i=1 Ex,z(CE([ E(x)]i, [z]i)) involving only the latent variable z which is a part of the generative model, without distinguishing the model component z from the ground-truth factor ξ and its observation y. Hence they do not establish formal theoretical justification on disentanglement. Moreover, they follow the earlier VAE-based methods to adopt a VAE loss (2) for generative modeling with an independent prior and an additional regularizer to enforce independence of the latent variables, which suffers from the unidentifiability problem described in the next section. 3.3 Unidentifiability with an independent prior Intuitively, the above supervised regularizer aims at ensuring some kind of alignment between the underlying factor ξ and the latent variable z in the model. We start with the definition of a disentangled representation following this intuition. Definition 3 (Disentangled representation) Given the underlying factor ξ Rm of data x, a deterministic encoder E is said to learn a disentangled representation with respect to ξ if i = 1, . . . , m, there exists a 1-1 function gi such that [E(x)]i = gi([ξ]i). Further, a stochastic encoder E is said to be disentangled with respect to ξ if its deterministic part E(x) is disentangled with respect to ξ. Note that in general, the goal of disentanglement allows for permutations in the groundtruth factors. For example one may expect for all i there exists j which is not necessarily equal to i such that [E(x)]i = gj([ξ]j). However since in our method we supervise each latent dimension by the annotated label of each ground-truth factor, we can expect a componentwise correspondence between E(x) and ξ, as justified formally in Proposition 5 below. As introduced above, we consider the general case where the underlying factors of interests are causally related. Then the goal becomes to disentangle the causal factors. Previous methods mostly use an independent prior for z, which contradicts the truth. We make this formal through the following proposition, which indicates that the disentangled representation is generally unidentifiable with an independent prior. Weakly Supervised Disentangled Generative Causal Representation Learning Proposition 4 Let E be any encoder that is disentangled with respect to ξ. Let b = Lsup(E ), a = min G Lgen(E , G), and b = min{(E,G):Lgen=0} Lsup(E). Assume the elements of ξ are connected by a causal graph whose adjacency matrix A0 is not a zero matrix. Suppose the prior pz is factorized, i.e., pz(z) = Qk i=1 pi([z]i). Then we have a > 0, and either when b b or b < b and λ < a b b , there exists a solution (E , G ) so that E is entangled and for any generator G, we have L(E , G ) < L(E , G). This proposition directly suggests that minimizing (3) favors an entangled solution (E , G ) over the one with a disentangled encoder E . Thus, with an independent prior we have no way to identify the disentangled solution with λ that is not large enough. However, in real applications, it is impossible to estimate the threshold, and too large λ makes it difficult to learn the BGM. After our work was submitted, we were brought attention to a theoretical result in Tr auble et al. (2021) that is similar to our Proposition 4. A discussion on the two independently proposed results is given in Appendix A.2 after the proof. In the following section, we propose a solution to this problem. 4. Causal disentanglement learning In this section, we propose the DEAR method for causal disentanglement learning. We start with an introduction to the model structure in Section 4.1. Then we present the formulation of DEAR as well as its identifiability of disentanglement at a population level in Section 4.2. The DEAR algorithm is described in Section 4.3 with its consistency results established in Section 4.4. 4.1 Generative model with a causal prior We introduce the proposed bidirectional generative model with a causal model prior, and discuss the learning of the adjacency matrix. Based on the model we describe the mechanism of causal controllable generation from interventional distributions. We further propose a composite prior to deal with the issue of setting the latent dimension. 4.1.1 SCM prior We propose to use a causal model as the prior pz. Specifically we adopt the general nonlinear Structural Causal Model (SCM) proposed by Yu et al. (2019) as follows z = f((I A ) 1h(ϵ)) := Fβ(ϵ), (4) where A is the weighted adjacency matrix of the directed acyclic graph (DAG) upon the k elements of z (i.e., Aij = 0 if and only if [z]i is the parent of [z]j), ϵ denotes the exogenous variables following N(0, I), f and h are element-wise transformations that are generally nonlinear, and β = (f, h, A) denotes the set of parameters of f, h and A, with the parameter space B. Further let IA = I(A = 0) denote the corresponding binary adjacency matrix, where I( ) is the element-wise indicator function. When f is invertible, (4) is equivalent to f 1(z) = A f 1(z) + h(ϵ), (5) Shen, Liu, Dong, Lian, Chen, and Zhang Inference Generation Encoder Generator Figure 1: Model structure of a BGM (left) with an SCM prior (right). which indicates that the factors z satisfy a linear SCM after nonlinear transformation f, and enables interventions on latent variables as discussed later. By combining the above SCM prior and the encoder and generator introduced in Section 3.1, we end up with the model structure presented in Figure 1. Note that different from our model where z is the latent variable following the prior (4) with the goal of causal disentanglement, Yu et al. (2019) propose a causal discovery method where variables z in SCM (4) are observed with the aim of learning the causal structure among z. 4.1.2 Learning of A In causal structure learning, the graph is required to be acyclic. Traditional causal discovery methods such as PC (Spirtes et al., 2000) or GES (Chickering, 2002) deal with the combinatorial problem over the discrete space of DAGs. Recently, Zheng et al. (2018) proposed an equality constraint whose satisfaction ensures acyclicity and solved the problem with the augmented Lagrangian method, which however leads to optimization difficulties (Ng et al., 2020). In addition, identifiability of the causal structure from purely observational data is known as an important issue in causal discovery. Despite a number of results on structure identifiability under various parametric or semi-parametric assumptions (Zhang and Hyvarinen, 2009; Peters and B uhlmann, 2014), in a general nonparametric setting, however, it cannot be guaranteed. Yu et al. (2019) did not discuss the identifiability of the SCM (4) under general cases. In many problems of disentanglement, we have some prior information on the causal structure of the factors of interests based on common knowledge or expertise. In particular, we may know a causal ordering of the factors. In addition to the ordering, for some factors, we may know that one particular factor cannot be a direct cause of another one, which helps us remove some redundant edges in advance. Therefore, in this paper with the focus on disentanglement, we utilize such prior information on the graph structure in disentanglement learning and leave incorporating causal discovery from scratch to future work. Formally, we assume the super-graph of the true binary graph IA0 is given, the best case of which is the true graph while the worst is that only the causal ordering is available. Then we learn the weights of the non-zero elements of the prior adjacency matrix that indicate the sign and scale of causal effects, jointly with other parameters of the generative model using the formulation and algorithm described in Sections 4.2 and 4.3. As discussed in Section 4.2, such prior knowledge makes the structure identifiability easy to hold. Moreover, the given super-graph ensures the acyclicity of the adjacency matrix, Weakly Supervised Disentangled Generative Causal Representation Learning allowing us to get rid of the additional acyclicity constraint. In Section 5.3, we investigate how our method performs in learning the graph structure and weighted adjacency given various amounts of prior graph information. Note that even when a super-graph is available, to our best knowledge, no previous disentanglement method except Graph VAE (He et al., 2018) can utilize them to disentangle causal factors with guarantee, but we propose one such method and show its effectiveness. In fact, He et al. (2018) also assumed an ordering over the latent nodes by specifying that the parents of node zi, i = 1, . . . , k 1 come from the set {zi+1, . . . , zk}. Later experiments suggest that Graph VAE shows inferior performance compared with ours. 4.1.3 Generation from interventional distributions One immediate application of our proposed model is causal controllable generation from interventional distributions of the latent variables. We now describe the mechanism. To enable intervention under SCM (5), we require f to be invertible. Then interventions can be formalized as operations that modify a subset of equations in (5) (Pearl et al., 2000). Suppose we would like to intervene on the i-th dimension of z, i.e., Do([z]i = c), where c is a constant. Once we obtain the latent factors z inferred from data x, i.e., z = E(x), or sampled from prior pz, we follow the modified equations in (5) to obtain z on the left-hand side using ancestral sampling by performing (5) iteratively, where ϵ can be either fixed or resampled from its prior. Then we decode the latent factor z that follows the given interventional distribution to generate the desired sample G(z ). In Section 5.1 we define the two types of interventions of most interests in applications. We discuss how our method generalizes to unseen interventions in Appendix D. 4.1.4 Latent dimension and composite prior Another issue of the model is how to set the latent dimension k of the generative model, to handle which we propose the so-called composite prior. Recall that m is the number of generative factors that we are interested to disentangle, for example, all the semantic concepts related to some filed, where m tends to be smaller than the total number M of generative factors. The latent dimension k should be no less than M to allow a sufficient degree of freedom in order to generate or reconstruct data well. Since M is generally unknown in reality, we set a sufficiently large k, at least larger than m which is a trivial lower bound of M. Then we propose to use a prior that is a composition of a causal model for the first m dimensions and another distribution for the other k m dimensions to capture other factors necessary for generation, like a standard Gaussian. In this way the first m dimensions of z aim at learning the disentangled representation of the m factors of interests, while the role of the remaining k m dimensions is to capture other factors that are necessary for generation whose structure is neither cared nor explicitly modeled. Under this model framework, we do not require the availability of annotated labels for all generative factors of data, but only the ones of our interests to disentangle are used in the supervised regularizer in (3), which broadens the applications of our method. Shen, Liu, Dong, Lian, Chen, and Zhang 4.2 DEAR formulation In this section, we first present the formulation of DEAR. Compared with the BGM described in Section 3.1, now we have one more module to learn which is the SCM prior. Thus p G(x, z) becomes p G,F (x, z) = p F (z)p G(x|z) where p F (z) is the distribution of Fβ(ϵ) with ϵ N(0, I). We then rewrite the generative model loss as follows Lgen(E, G, F) = DKL(q E(x, z), p G,F (x, z)). (6) Then we propose the following formulation to learn disentangled generative causal representations: min E,G,F L(E, G, F) := Lgen(E, G, F) + λLsup(E). (7) Now we show the identifiability of disentanglement of DEAR in contrast to the unidentifiability result in Proposition 4. Proposition 5 indicates that under appropriate conditions, the DEAR formulation (7) at a population level can learn the disentangled representations defined in Definition 3. Here, Assumption 1 supposes a sufficiently large capacity of the SCM in (4) to contain the underlying distribution pξ, which is reasonable due to the generalization of the nonlinear SCM. Assumption 1 The underlying distribution pξ belongs to the distribution family {pβ : β B}, i.e., there exits β0 = (f0, h0, A0) such that pξ = pβ0. Proposition 5 (Identifiability) Assume the infinite capacity of E and G and Assumption 1. Let (E , G , F ) argmin E,G,F L(E, G, F) which is the solution of DEAR formulation (7). Then E is disentangled with respect to ξ as defined in Definition 3. Note that Proposition 5 states the identifiability at the population level, i.e., the loss function is taken the expectation over distributions of both the data and labels of the true factors. Thus we clarify that Proposition 5 does not obtain general provable disentanglement which should be analyzed with a much weaker form of supervision on the true factors, e.g., as in Khemakhem et al. (2020). In contrast, the specific identifiability stated in Proposition 5 should be interpreted as a counterpart of the unidentifaibility result in Proposition 4. Specifically, Proposition 4 shows that the independent prior used by most existing disentanglement methods causes the contradiction between the generative loss Lgen and the supervised loss Lsup in (3), which makes the whole loss L prefer an entangled model. Therefore, even with the same amount of supervised labels of true factors, those methods cannot learn a generative model with disentangled latent representations. In contrast, Proposition 5 formally suggests that due to the introduction of the SCM prior, the two loss terms Lgen and Lsup in (7) can be simultaneously minimized and the jointly optimal solution leads to the disentangled model. 4.3 Algorithm In this section, we propose the algorithm to solve the above formulation (7). Estimating Lgen requires the unlabeled data set {x1, . . . , x N} with sample size N, while estimating Lsup requires a labeled data set {(xj, yj) : j = 1, . . . , Ns}, where the sample size Ns can be much Weakly Supervised Disentangled Generative Causal Representation Learning smaller than N. Without loss of generality, let SG = {x1, . . . , x N, y1, . . . , y Ns} denote the training data set for the generative model. We parametrize Eφ(x) and Gθ(z) by neural networks. As mentioned in Section 3.1, to enhance the expressiveness of the generative model, we use an implicit generated conditional p G(x|z), where we inject Gaussian noises to each convolution layer in the same way as Shen et al. (2020). Then the SCM prior p F (z) and implicit p G(x|z) make (6) lose an analytic form. Hence we adopt a GAN method to adversarially estimate the gradient of (6) as in Shen et al. (2020). Different from their setting, the prior also involves learnable parameters, that is, the parameters β of the SCM. In the following lemma we present the gradient formulas of (6). Lemma 6 Let D (x, z) = log[q E(x, z)/p G,F (x, z)]. Then we have θLgen = Ez pβ(z)[s(x, z) x D (x, z) |x=Gθ(z) θGθ(z)], φLgen = Ex qx[ z D (x, z) |z=Eφ(x) φEφ(x)], βLgen = Eϵ[s(x, z)( x D (x, z) βG(Fβ(ϵ)) + z D (x, z) βFβ(ϵ))|x=G(Fβ(ϵ)) z=Fβ(ϵ) ], where s(x, z) = e D (x,z) is the scaling factor. Since D depends on the unknown densities, which makes the gradients in (8) uncomputable directly from data, we estimate the gradients by training a discriminator D via the empirical logistic regression: i:wi=1 log(1 + e D (xi,zi)) + X i:wi=0 log(1 + e D (xi,zi)) , (9) where the class label wi = 1 if (xi, zi) q E and wi = 0 if (xi, zi) p G,F , with i = 1, . . . , Nd. We parametrize the discriminator using neural networks with parameter ψ. Based on the above, we propose Algorithm 1 to learn disentangled generative causal representation. 4.4 Consistency In this section, we show the asymptotic convergence of Algorithm 1. Let θ = (θ, φ, β) denote the set of parameters of the generative model, where θ, φ and β denote the parameters of the generator, encoder and SCM prior respectively. According to such parametrization, we write the objective function in (7) as L(θ). In this section, we establish the consistency result of empirical estimator ˆθ, i.e., the output of Algorithm 1, under the parametric setting. Given a discriminator D, the approximate gradient used in the algorithm is denoted by N PN i=1 s(Gθ(zi), zi) x D(Gθ(zi), zi) θGθ(zi) 1 N PN i=1 z D(xi, Eφ(xi)) φEφ(xi) + λ Ns PNs i=1 φls(φ; xi, yi) N PN i=1 s(x, z)[ x D(x, z) βG(Fβ(ϵi)) + z D(x, z) βFβ(ϵi)]|x=G(Fβ(ϵi)) z=Fβ(ϵi) We first show in the following lemma that under appropriate conditions the approximate gradient h ˆD(θ) based on the solution of (9) converges uniformly in probability to the true Shen, Liu, Dong, Lian, Chen, and Zhang Algorithm 1: Disentangled g Enerative c Ausal Representation (DEAR) Learning Input: training set SG, initial parameter φ, θ, β, ψ, batch-size n, meta-parameter T 1 for t = 1, . . . , T do 2 for multiple steps do 3 Sample {x1, . . . , xn} from the training set, {ϵ1, . . . , ϵn} from N(0, I) 4 Generate from the causal prior zi = Fβ(ϵi), i = 1, . . . n 5 Update ψ by descending the stochastic gradient: 1 n Pn i=1 ψ log(1 + e Dψ(xi,Eφ(xi))) + log(1 + e Dψ(Gθ(zi),zi)) 6 Sample {x1, . . . , xn, y1, . . . , yns}, {ϵ1, . . . , ϵn} as above; generate zi = Fβ(ϵi) 7 Compute θ-gradient: 1 n Pn i=1 s(Gθ(zi), zi) θDψ(Gθ(zi), zi) 8 Compute φ-gradient: 1 n Pn i=1 φDψ(xi, Eφ(xi)) + λ ns Pns i=1 φls(φ; xi, yi) 9 Compute β-gradient: 1 n Pn i=1 s(G(zi), zi) βDψ(Gθ(Fβ(ϵi)), Fβ(ϵi)) 10 Update parameters φ, θ, β using the gradients Return: φ, θ, β gradient. Recall the definition D (x, z) = log(q E(x, z)/p G,F (x, z)) which depends on θ. Let D = {D θ(x, z) : θ Θ} denote the true discriminator class, and D = {D(x, z)} denote the modeled discriminator class with the norm D 1 = R |D(x, z)|p θ(x, z)dxdz, where p θ(x, z) = (q E(x, z) + p G,F (x, z))/2 which induces the probability measure µ θ. Lemma 7 Assume the parameter space Θ = {θ = (θ, φ, β)} is compact. Assume the following regularity conditions hold: C1 D θ is smooth with respect to θ over Θ, as defined in Definition 1. C2 The modeled discriminator class D is compact, and contains the true class D . C3 {µ θ : θ Θ} is uniformly tight, i.e., for any ϵ > 0, there exists a compact subset Kϵ of X Z such that for all θ Θ, µ θ(Kϵ) 1 ϵ. C4 Functions in D have uniformly bounded function values, gradients and Hessians so that there exists a positive number B0 < such that D D, x, z, we have |D(x, z)| B0, D(x, z) B0 and |tr( 2D(x, z))| B0. C5 Eφ, Gθ, Eφ and Fβ are uniformly bounded. C6 The training set for the discriminator is independent from that for the generative model. Then there exists a sequence of (N, Ns, Nd) such that sup θ Θ h ˆD(θ) L(θ) p 0, (10) where p means converging in probability. Based on the above, we obtain the consistency of DEAR algorithm in the following theorem. It indicates that when the sample sizes grow large enough, with high probability, the DEAR algorithm approximately achieves the minimum of L(θ) which leads to the desired disentangled model according to Proposition 5. Weakly Supervised Disentangled Generative Causal Representation Learning Theorem 8 (Consistency) Suppose the assumptions in Lemma 7 hold. Further assume the objective function L(θ) in (7) is smooth with respect to θ and satisfies the Polyak Lojasiewicz condition in Definition 2. Let L = minθ Θ L(θ) Then there exists a sequence of (N, Ns, Nd) such that L(ˆθ) p L . Remark. The Polyak Lojasiewicz (PL) condition (Polyak, 1963) asserts that the suboptimality of a model is upper bounded by the norm of its gradient, which is a weaker condition than assumptions commonly made to ensure convergence, such as (strong) convexity. Recent literature showed that the PL condition holds for many machine learning scenarios including some deep neural networks (Charles and Papailiopoulos, 2018; Liu et al., 2020). 5. Experiments We present the experimental studies in causal controllable generation in Section 5.1 which demonstrate the effectiveness of DEAR in causal disentanglement and support the theory in Section 4. Based on these theoretical and empirical justifications, we then apply the representations learned by DEAR in downstream prediction tasks in Section 5.2, and show the benefits of the disentangled causal representations in terms of sample efficiency and distributional robustness. In addition, we investigate the performance of DEAR in learning the causal structure and weighted adjacency of the SCM prior in Section 5.3. We also provide ablation studies in terms of varying regularization strength λ and various amounts of annotated labels in Section 5.4.2 We evaluate our methods on two data sets where the ground-truth generative factors are causally related, while most data sets used in previous disentanglement work are assumed or designed to have independent generative factors, for example, in the large scale experimental study by Locatello et al. (2019). The first data set that we use is a synthesized data set, Pendulum, similar to the one in Yang et al. (2021). As shown in Figure 3, each image is generated by four continuous factors: pendulum angle, light angle, shadow length and shadow position whose underlying structure is given in Figure 2(a) following physical mechanisms. To make the data set realistic, we introduce random noises when generating the two effects from the causes, representing the measurement error. We further introduce 20% corrupted data whose shadow is randomly generated, mimicking some environmental disturbance. The sample sizes for the training, validation and test set are all 6,724. The second one is a real human face data set, Celeb A (Liu et al., 2015), with 40 labeled binary attributes. Among them, we consider two groups of causally related factors of interests as shown in Figure 2(b,c). The sample sizes for the training, validation and test set are 162,770, 19,867, and 19,962. We believe these two data sets are diverse enough to assess our methods because they cover real and synthesized data, with continuous and discrete annotated labels. In addition, we test our method on benchmark data sets (Gondal et al., 2019) where the generative factors are independent. The results are given in Appendix E. All the details of the experimental setup, network architectures and the synthesized data set are given in Appendix F. Notably, all VAEs and DEAR use the same network architecture for the encoder and decoder (generator). 2. The code and data sets are available at https://github.com/xwshen51/DEAR. Shen, Liu, Dong, Lian, Chen, and Zhang light_ angle(2) shadow_ position(4) shadow_ length(3) (a) Pendulum smile(1) gender(2) eye(5) chubby(6) mouth_ open(4) cheekbone(3) (b) Celeb A-Smile young(1) gender(2) receding_ hairline(3) make_ up(4) chubby(5) (c) Celeb A-Attractive Figure 2: Underlying causal structures. 5.1 Causal controllable generation We first investigate the performance of our methods in disentanglement through applications in causal controllable generation. Traditional controllable generation methods mainly manipulate the independent generative factors (Karras et al., 2019), while we consider the general case where the factors are causally related. With a learned SCM as the prior, we are able to generate images from many desired interventional distributions of the latent factors. For example, we can manipulate only the cause factor while leaving its effects unchanged. Besides, the bidirectional framework presented in Figure 1 enables controllable generation either from scratch or a given unlabeled image. We consider two types of interventions of most interests in applications. First, in traditional traversals, we manipulate one dimension of the latent vector while keeping the others fixed to either their inferred or sampled values (Higgins et al., 2017). A causal view of such operations is an intervention on all the variables by setting them as constants with only one of them varying. Another interesting type of interventional distribution is to intervene on only one latent variable, i.e., Pdo([z]i=c)(z), and to observe how other variables change consequently. The proposed SCM prior enables us to conduct such interventions through the mechanism described in Section 4.1.3. One can naturally generalize it to intervene on more than one variable. For simplicity, we only present the results of intervening on one variable in the paper. Figure 3-4 illustrate the results of causal controllable generation of the proposed DEAR method and the baseline method with independent priors, S-β-VAE (Locatello et al., 2020b). Results from other baselines are given in Appendix G, including S-TCVAE, S-Factor VAE which essentially make no difference due to the independence assumption, and the unidirectional generative model Causal GAN. In addition, we extend Graph VAE (He et al., 2018) to a supervised version, named S-Graph VAE by adding the supervised loss in the same way as DEAR and assuming the super-graph of the true graph is known a priori. However, in contrast to the composite prior in DEAR, Graph VAE assigns an SCM over the whole latent space and hence only allows a sufficiently low dimensional latent space. This makes the Graph VAE model less expressive and difficult to be applied to complex data sets with a large number of generative factors like Celeb A. The qualitative results of S-Graph VAE in controllable generation are given in Appendix G. Note that we do not compare with unsupervised disentanglement methods (e.g., unsupervised β-VAE, Graph VAE, etc.) because of fairness and their lack of justification. Weakly Supervised Disentangled Generative Causal Representation Learning Intervene on pendulum_angle Intervene on light_angle shadow_length& position affected shadow_length& position affected (c) Test data (d) Intervention on cause factors (a) Traversal of S- -VAE β (b) Traversal of DEAR single latent with others fixed Single factor affected Multiple factors affected Disentangled Figure 3: Results in causal controllable generation on Pendulum. For example, in line 1 of (a,b) when changing the first dimension [z]1 of z which is supervised with the annotated label of pendulum angle while keeping the others fixed, we see that the traversals of DEAR vary only in pendulum angle (disentanglement), while those of S-β-VAE vary in both pendulum angle and shadow length (entanglement); in line 3 when changing [z]3 with the others fixed, only shadow length is affected with DEAR but both shadow length and pendulum angle are affected with S-β-VAE. In line 1 of (d) we see the intervening on pendulum angle affects its effects shadow length and shadow position, which is consistent with the desired interventional distribution. (c) Test data (d) Intervention on cause factors (a) Traversal of S- -VAE β (b) Traversal of DEAR single latent with others fixed factors affected Single factor Disentangled mouth _open Intervene on gender Figure 4: Results in causal controllable generation on Celeb A. For example, in line 1 of (a,b) when altering [z]1 with the others fixed, we see that the traversals of DEAR vary only in a single factor smile with factor mouth open unaffected, while S-β-VAE entangles the two factors. In line 5-6 of (a), when changing [z]5 and [z]6 which are supervised with narrow eye and chubby, no factors seem to be affected, indicating that the S-β-VAE fails to learn the representations of some factors. In line 1 of (d) we see that intervening on smile affects its effect mouth open, which makes sense. Shen, Liu, Dong, Lian, Chen, and Zhang In each figure, we first infer the latent representations from a test image in block (c). The traditional traversals of the two models are given in blocks (a,b). We see that in each line when manipulating one latent dimension while keeping the others fixed, the generated images of our model vary only in a single factor, indicating that our method can disentangle the causally related factors, while those of S-β-VAE show multiple factors affected. It is worth pointing out that we are the first to achieve the disentanglement between a cause factor and its effects, while other methods tend to entangle them. One typical example is the disentanglement between smile and its effect mouth open as shown in Figure 4. In block (d), we show the results of intervention on the latent variables representing the cause factors, which clearly show that intervening on a cause variable changes its effect variables. Results in Appendix G further show that intervening on an effect variable does not influence its cause. Specific examples are given in the captions. Note that without an SCM prior, S-β-VAE cannot generate data from general interventional distributions. More qualitative traversals from DEAR are given in Appendix G. 5.2 Downstream task The previous section verifies the good disentanglement performance of DEAR. In this section, equipped with DEAR, we investigate and demonstrate the benefits of the learned disentangled causal representations for downstream tasks in terms of sample efficiency and distributional robustness. In Appendix B, we propose a quantitative metric for causal disentanglement which is utilized to provide some justifications on the relationship between causal disentanglement and performance in downstream tasks. We now introduce the downstream prediction tasks. On Celeb A, we consider the structure Celeb A-Attractive in Figure 2(c). We artificially create a target label τ = 1 if young=1, gender=0, receding hairline=0, make up=1, chubby=0, eye bag=0, and τ = 0 otherwise, indicating one kind of attractiveness as a slim young woman with makeup and thick hair.3 On the pendulum data set, we regard the label of data corruption as the target τ, that is, τ = 1 if the data is corrupted and τ = 0 otherwise. We consider the downstream tasks of predicting the target label. In both cases, the generative factors of interests in Figure 2(a,c) are causally related to τ, which are the features that humans would use to do the task. Hence it is conjectured that a disentangled representation of these causal factors tends to be more data-efficient and invariant to distribution shifts. 5.2.1 Sample efficiency For a BGM including the earlier state-of-the-art supervised disentanglement methods SVAEs (Locatello et al., 2020b), the modified S-Graph VAE (He et al., 2018), and our proposed DEAR, we use the learned encoder to embed the training data to the latent space and train an MLP classifier on top of the representations to predict the target label. All the architectures are the same for various methods with details given in Appendix F. Without an encoder, one normally needs to train a convolutional neural network with raw images as the input. Here we adopt the Res Net50 (named Res Net in Table 1) as the baseline classifier which is the architecture of the BGM encoder. Since the disentanglement methods use addi- 3. Note that the definition of attractiveness here only refers to one kind of attractiveness, which has nothing to do with its linguistic definition. Weakly Supervised Disentangled Generative Causal Representation Learning tional supervision of the generative factors, we consider another baseline Res Net50 (named Res Net-pretrain) that is pretrained using multi-label classification to predict the factors on the same training set. Unless indicated otherwise, DEAR, S-VAEs, S-Graph VAE, and Res Net-pretrain have access to the annotated labels for all training samples, and DEAR and S-Graph VAE are given the true graph structure. We provide the detailed results when there is less supervised information on labels and the graph structure in Sections 5.4 and 5.3. To measure the sample efficiency, we use the statistical efficiency score defined as the average test accuracy based on 100 samples divided by the average accuracy based on 10,000/all samples, following Locatello et al. (2019). Note that this metric may be misleading when a method always achieves poor accuracy with small and large training samples. Therefore, we also report the test accuracies with different training sample sizes to provide a comprehensive evaluation. Table 1 presents the results, showing that DEAR owns the highest sample efficiency and test accuracy on both data sets. Res Net with raw data inputs has the lowest efficiency, although multi-label pretraining improves its performance to a limited extent. S-VAEs have better efficiency than the Res Net baselines but lower accuracy under the case with more training data. Since the encoders of all S-VAEs and DEAR share the same architecture, we explain the inferior performance of S-VAEs is mainly because the independent prior contradicts with the supervised loss as indicated in Proposition 4, making the learned representations entangled (as shown in the previous section) and less informative. On the Pendulum data with few underlying factors, S-Graph VAE outperforms the S-VAEs when training on a smaller sample, indicating that an SCM latent structure has advantages over the independent structure under the VAE framework. Nevertheless, even with the same amount of supervision (on both annotated labels and the same given graph structure), S-Graph VAE is still inferior to DEAR, potentially due to our better causal modeling and optimization based on a GAN algorithm. On the more complex data set Celeb A, S-Graph VAE gives very poor performance, even worse than S-VAEs and Res Net. In addition, we investigate the performance of DEAR under the semi-supervised setting where only 10% of the labels are available. We find that DEAR with fewer labels has comparable sample efficiency with that in the fully supervised setting, with a sacrifice in the accuracy that is yet still comparable to other baselines which use much more supervision. In Section 5.4, we provide ablation studies to show how DEAR behaves in terms of varying amounts of labeled samples and different choices of the regularization strength λ. We also study knowing less prior information on the causal graph structure. In the last two lines of Table 1, DEAR-SG stands for the DEAR-LIN model trained with a given supergraph (which is not a full graph) of the true graph and DEAR-O stands for the DEAR-LIN model trained with a known causal ordering. We see that DEAR-SG leads to comparable performance as DEAR with the known graph structure, while DEAR-O is slightly worse but still competitive compared with other baseline methods. As we will show later, on Pendulum, DEAR-O can recover the true structure and the performance in downstream tasks is identical to that of DEAR given the true structure, so we skip showing the last two lines in Table 1(b). In Section 5.3, we investigate the performance in learning the SCM and in particular, the causal structure, given various amounts of prior information about Shen, Liu, Dong, Lian, Chen, and Zhang (a) Celeb A Method 100(%) 10,000(%) Eff(%) Res Net 68.06 0.19 79.51 0.31 85.59 0.27 Res Net-pretrain 76.84 2.08 83.75 0.93 91.74 1.98 S-VAE 77.07 1.42 79.87 1.67 96.49 1.68 S-β-VAE 71.78 1.99 76.63 0.24 93.67 2.41 S-TCVAE 77.10 2.08 81.63 0.20 94.45 2.72 S-Graph VAE 67.87 1.19 72.09 0.51 94.14 1.14 DEAR-LIN 83.51 0.77 84.92 0.11 98.34 0.81 DEAR-NL 84.44 0.48 85.10 0.09 99.23 0.51 DEAR-LIN-10% 78.09 0.59 79.54 0.41 98.18 0.49 DEAR-NL-10% 80.30 0.24 80.87 0.12 99.29 0.23 DEAR-SG 83.69 0.63 84.91 0.06 98.57 0.67 DEAR-O 82.84 0.68 84.42 0.05 98.13 0.79 (b) Pendulum 100(%) all(%) Eff(%) 79.71 0.98 90.64 1.57 87.97 2.11 79.59 0.93 89.16 1.60 89.28 0.59 84.16 0.69 90.89 0.28 92.60 0.49 79.95 1.65 87.87 0.52 90.98 1.47 85.36 1.11 90.33 0.33 94.51 1.31 86.08 1.61 91.90 0.53 93.65 1.29 90.21 0.94 93.31 0.14 96.68 0.89 90.62 0.32 92.57 0.08 97.93 0.29 88.93 1.40 93.18 0.18 95.43 1.33 87.65 0.46 91.27 0.21 96.03 0.29 Table 1: Sample efficiency and test accuracy with different training sample sizes. DEARLIN and -NL denote the DEAR models with linear and nonlinear f respectively. the true graph, where more insights are given to explain the comparable performance of DEAR-SG in downstream tasks. 5.2.2 Distributional robustness We manipulate the training data to inject spurious correlations misleading heuristics that work for most training examples but do not always hold (Sagawa et al., 2019) between the target label and some spurious attributes. On Celeb A, we regard mouth open as the spurious factor; on Pendulum, we choose background color {blue(+), white( )}. We manipulate the training data such that the target label is more strongly correlated with the spurious attributes. Specifically, the target label and the spurious attribute of 80% of the examples are both positive or negative, while those of 20% examples are opposite. For instance, in the manipulated training set, 80% smiling examples in Celeb A have an open mouth; 80% corrupted examples in Pendulum are masked with a blue background. The test sets however do not have such correlations, that is, around half of the examples in the test sets of both Celeb A and Pendulum have consistent target and spurious labels, leading to a distribution shift. Intuitively these spurious attributes are not causally related to the target label, but normal independent and identically distributed (IID) based methods like empirical risk minimization (ERM) tend to exploit such easily learned spurious correlations in prediction, and hence face performance degradation when such correlation no longer exists during testing. In contrast, causal factors are regarded as invariant and thus more robust under such shifts. Previous sections justify both theoretically and empirically that DEAR can learn disentangled causal representations well. We then apply those representations by training a classifier upon them to predict the target label, which is conjectured to be invariant and Weakly Supervised Disentangled Generative Causal Representation Learning (a) Celeb A Method Worst Acc(%) Avg Acc(%) ERM 59.12 1.78 82.12 0.26 ERM-multilabel 59.17 4.02 82.05 0.25 S-VAE 60.54 3.48 79.51 0.58 S-β-VAE 63.85 2.09 80.82 0.19 S-TCVAE 64.93 3.30 81.58 0.14 S-Graph VAE 50.51 4.43 76.01 1.73 DEAR-LIN 76.05 0.70 83.56 0.09 DEAR-NL 76.98 0.66 83.60 0.04 DEAR-LIN-10% 71.40 0.47 81.04 0.14 DEAR-NL-10% 70.44 1.02 81.94 0.31 DEAR-SG 74.95 1.14 83.56 0.25 DEAR-O 74.00 1.47 83.45 0.32 (b) Pendulum Worst Acc(%) Avg Acc(%) 60.48 2.73 87.40 0.89 61.70 4.02 87.20 1.00 20.78 4.45 84.26 1.31 44.12 9.73 86.99 1.78 35.50 5.57 86.64 1.15 54.42 4.15 87.64 2.06 75.60 0.27 93.58 0.03 75.39 2.11 93.16 0.04 74.05 1.56 92.63 0.07 73.93 1.98 92.72 0.03 Table 2: Distributional robustness. The worst-case and average test accuracy. robust. Baseline methods include ERM, multi-label ERM which is trained to predict both target label and the factors considered in disentanglement in order to have the same amount of supervision, S-VAEs that are shown unable to disentangle well in the causal case, and S-Graph VAE. Table 2 presents the average and worst-case test accuracy to assess both the overall classification performance and distributional robustness. The worst-case (Sagawa et al., 2019) accuracy refers to the following: we group the test set according to the two binary labels, the target one and the spurious attribute, into four cases and regard the group with the worst accuracy as the worst-case, which usually owns the opposite spurious correlation to the training data. It can be seen that the classifiers trained upon DEAR representations significantly outperform the baselines in both metrics. Particularly, when comparing the worst-case accuracy with the average one, we observe a slump from around 80 to around 60 for other methods on Celeb A, while DEAR enjoys a much smaller decline. As in sample efficiency, S-Graph VAE suffers from a smaller drop in worst-case accuracy than S-VAEs on Pendulum, but remains inferior to DEAR. On Celeb A, S-Graph VAE again shows poor performance. Moreover, with fewer annotated samples (i.e., 10% of the full sample), DEAR-10% remains competitive against baseline methods which use even more supervised labels. DEARSG (given the super-graph) is slightly better than DEAR-O (given the ordering), both of which are comparable to DEAR given the full structure. More ablation studies in terms of the labeled proportion as well as the strength of the supervised regularizer are given in Section 5.4. 5.3 Learning of the structure A In this section, we take a closer look into the learned causal structure and weighted adjacency matrix A of the SCM prior given various amounts of prior graph information. As mentioned in Section 4.1.2, the DEAR method requires prior knowledge on the super-graph of the true Shen, Liu, Dong, Lian, Chen, and Zhang (a) Pendulum (b) Celeb A-Smile (c) Celeb A-Attractive Figure 5: The weighted adjacency matrices learned by DEAR. light_ angle(2) shadow_ position(4) shadow_ length(3) (a) Pendulum-O young(1) gender(2) receding_ hairline(3) make_ up(4) chubby(5) (b) Celeb A-Attractive-SG young(1) gender(2) receding_ hairline(3) make_ up(4) chubby(5) (c) Celeb A-Attractive-O Figure 6: The given causal structures. -O and -SG stand for the causal ordering and super-graph. The black edges are true and red edges are in fact redundant. graph over the underlying factors of interests. The experiments shown in previous sections are all based on the given true binary structure IA0. Here we investigate the performance in learning the causal structure on knowing various amounts of information about the graph, which ranges from the causal ordering to the true structure. Note that the adjacency matrices learned by DEAR-LIN and DEAR-NL are consistent up to some scaling, so in this section we only show the results from DEAR-LIN as a representative. Figure 5 shows the learned weighted adjacency matrices when the true binary structure is given for the three underlying structures shown in Figure 2. It can be seen that the weights exhibit meaningful signs and scalings that are consistent with common knowledge. For example, the factor smile and its effect mouth open are positively correlated, that is, one is more likely to open mouth when smiling. The corresponding element in the weighted adjacency A14 of (b) turns out positive, which makes sense. Also gender (the indicator of being male) and its effect make up are negatively correlated, that is, women tend to make up more often than men. Correspondingly, element A24 of (c) turns out negative. Next, we evaluate the performance of DEAR in structure learning with less prior knowledge on the true graph, i.e. knowing a super-graph rather than the exact true graph. We first study on the synthetic data set Pendulum whose ground-truth structure is shown in Figure 2(a), where there are fewer causal factors and no hidden confounder. Consider the causal ordering pendulum angle, light angle, shadow position, shadow length, given which we start with a full graph (shown in Figure 6(a)) represented by an upper triangular adjacency matrix whose elements are randomly initialized around 0 (shown in Figure 7(a)). Figure 7(a-d) present the weighted adjacency matrices learned by DEAR at different training epochs. We observe that the weights of the two redundant edges A12 and A34 vanish Weakly Supervised Disentangled Generative Causal Representation Learning (a) Epoch 0 (b) Epoch 100 (c) Epoch 200 (d) Epoch 500 (e) S-Graph VAE Figure 7: Learned weighted adjacency matrices on Pendulum given the causal ordering. (a-d) are the learned matrices from DEAR at different training epochs starting from random initialization around 0, and (e) is the result from S-Graph VAE. gradually and it eventually leads to the weighted adjacency that nearly coincides with the one learned given the true graph shown in Figure 5(a). In contrast, Figure 7(e) shows the structure learned by S-Graph VAE. Note that Graph VAE learns a binary structure with 0-1 elements and (e) shows the learned probabilities of each element being 1. We see that it learns a redundant edge A12 from pendulum angle to light angle and misses the edge A23 from light angle to shadow position. This experiment shows the advantage of DEAR over Graph VAE in learning the latent causal structure. (a) Epoch 0 (b) Epoch 5 (c) Epoch 50 (d) Epoch 150 (e) S-Graph VAE Figure 8: Learned weighted adjacency matrices on Celeb A given a super-graph. (a) represents a random initialization around 0 of the weighted adjacency matrix corresponding to the super-graph in Figure 6(b); (b-d) are the learned matrices by DEAR at different training epochs; (e) is the result from S-Graph VAE. The case is more complicated on the real data set Celeb A. Although the number of factors of interests, six, is not large, there are much more underlying generative factors. Some of the other factors that we are not interested to disentangle could serve as the hidden confounders of the factors that we are interested in. For example, staying up late may cause a person to have eye bags and look chubby and hence serves as a hidden confounder of the two factors eye bag and chubby in Figure 2(c). These hidden confounders can be captured in the remaining dimensions of the learned representations through the composite prior introduced in Section 4.1.4. However, their existence makes it difficult to identify and learn the structure of the factors of interest. Another complication comes from some biases in the data, potentially caused by selection bias or unknown interventions. Such biases may result in spurious correlations even among the causal variables, bringing trouble to causal structure learning. There are orthogonal works (e.g., Ke et al., 2019; Bengio et al., 2020; Shen, Liu, Dong, Lian, Chen, and Zhang (a) Epoch 0 (b) Epoch 5 (c) Epoch 50 (d) Epoch 150 (e) S-Graph VAE Figure 9: Learned weighted adjacency matrices on Celeb A given the causal ordering. (a-d) are the learned matrices by DEAR at different training epochs starting from random initialization around 0; (e) is the result from S-Graph VAE. Brouillard et al., 2020) focusing on causal discovery under hidden confounders or unknown interventions, which however is beyond the scope of this paper and will be systematically explored in future work. Here we only provide some empirical studies to evaluate our method under this complicated case. We conduct two experiments on Celeb A. In the first one, we assume knowing a supergraph (Figure 6(b)) of the true graph (Figure 2(c)) and randomly initialize its weighted adjacency matrix around 0 as in Figure 8(a). Then Figure 8(a-d) show the weighted adjacency matrices learned by DEAR at different training epochs. Similar to the previous experiment on Pendulum, the weights corresponding to the redundant edges gradually vanish. Eventually, DEAR learns the weighted adjacency matrix that largely agrees with the one learned given the true graph shown in Figure 5(c). After edge pruning, one can essentially recover the true graph structure. This explains why DEAR-SG (the DEAR model given this super-graph) performs competitively with DEAR given the true structure in the downstream tasks in the previous two sections. In contrast, the graph learned by Graph VAE shown in Figure 8(e) fails to recover the true structure, although it is given the same known super-graph as DEAR. In the second experiment, we only assume knowing the causal ordering which leads to a full graph shown in Figure 6(c) with the upper-triangular weighted adjacency matrix randomly initialized in Figure 9(a). We observe that although DEAR can remove most of the redundant edges, it mistakenly learns a large weight on the edge from young to gender. This may be due to the spurious correlation between the two factors young and gender potentially caused by the selection bias during data collection. In comparison, as shown in Figure 9(e), the graph learned by Graph VAE given the same causal ordering turns out to be farther away from the true graph than DEAR. Nevertheless, as discussed in the previous two sections, DEAR-O (the DEAR model given the causal ordering) still achieves reasonably satisfying performance, which indicates the robustness of our DEAR method against the correctness of the learned graph structure. In summary, when given the true graph structure, DEAR can learn meaningful weights for each edge. If there is no hidden confounder or spurious correlation among the factors of interests, DEAR can learn the true graph given only the causal ordering. If there exist such biases, DEAR can only recover the true structure given some proper super-graphs and in general cannot learn all edges correctly when only the causal ordering is given. In all cases, DEAR outperforms Graph VAE in learning the causal structure. Weakly Supervised Disentangled Generative Causal Representation Learning 5.4 Ablation study In this section, we conduct ablation studies to illustrate how DEAR performs when using different choices of the hyperparameter λ which determines the weight of the supervised regularizer and varying amounts of labeled samples. According to Proposition 5 and Theorem 8, at the population level, i.e., assuming an infinite amount of data, the regularization strength λ in the objective (7) can be any arbitrary positive value to make the theorems hold. However, in practice with a finite sample, λ cannot be arbitrarily small roughly due to the estimation error. Therefore we suggest regarding λ as a hyperparameter and investigate its sensitivity across different tasks and data sets. Figures 10-11 plot the metrics in sample efficiency and distributional robustness when using different choices of λ. We observe that all these results (with λ ranging from 0.1 to 10) remain significantly superior to the baseline methods in Tables 1-2, which suggests that DEAR can perform reasonably well across a wide range of λ. As λ becomes close to 0, we generally observe a performance decrease. Next, we study how DEAR, as well as baseline methods, behave as we reduce the number of annotated samples. Figures 12-13 plot the metrics in sample efficiency and distributional robustness when using different amounts of labeled samples. Note that 0.1% of the Celeb A training set corresponds to 162 samples and 1% of the Pendulum training set corresponds to 67 samples, both of which belong to weakly supervised settings according to Locatello et al. (2020b). Such small numbers of supervised labels belong to weakly supervised settings according to Locatello et al. (2020b) and would make manual labeling feasible even if no label is available beforehand. Naturally, with fewer labeled samples, all methods basically perform worse. DEAR always outperforms the VAEs. In particular, as shown in Figure 13(a), when training with 0.1%-1% labels of the Celeb A training sample, S-β-VAE and S-TCVAE completely fail in the worst-case group, meaning that the classifiers trained upon them almost fully rely on the spurious correlation and exhibit no robustness to distribution shifts at all. In Figure 12(a), when the supervised proportion is lower, although S-β-VAE and S-TCVAE have higher sample efficiency, they actually perform poorly with both small and large samples, leading to a misleadingly high efficiency score. 0.0 2.5 5.0 7.5 10.0 λ Small sample accuracy 0.0 2.5 5.0 7.5 10.0 λ (a) Celeb A 0.0 2.5 5.0 7.5 10.0 λ Small sample accuracy 0.0 2.5 5.0 7.5 10.0 λ (b) Pendulum Figure 10: Test accuracy when training on a small sample & sample efficiency, as defined in Section 5.2.1, against four different choices of λ: 0.1, 1, 5, and 10. Shen, Liu, Dong, Lian, Chen, and Zhang 0.0 2.5 5.0 7.5 10.0 λ 0.0 2.5 5.0 7.5 10.0 λ (a) Celeb A 0.0 2.5 5.0 7.5 10.0 λ 0.0 2.5 5.0 7.5 10.0 λ (b) Pendulum Figure 11: Worst-case and average test accuracy, as defined in Section 5.2.2, against different choices of λ. On Pendulum, we experiment with λ = 0.1, 1, 5, 10; on Celeb A, we experiment with λ = 0.01, 0.1, 1, 5, 10. 0.001 0.01 0.1 1 Proportion of labeled samples Small sample accuracy 0.001 0.01 0.1 1 Proportion of labeled samples S Graph VAE (a) Celeb A 0.01 0.1 1 Proportion of labeled samples Small sample accuracy 0.01 0.1 1 Proportion of labeled samples S Graph VAE (b) Pendulum Figure 12: Test accuracy with a small training sample & sample efficiency against different proportions of labeled samples among full data. On the larger data set Celeb A, we consider proportion=0.001, 0.01, 0.1, 1; on the smaller Pendulum data, we consider 0.01, 0.1, 1. 6. Conclusion In this paper, we showed that previous methods with the independent latent prior assumption fail to learn disentangled representation when the underlying factors of interests are causally related. We then proposed a new disentangled learning method called DEAR with theoretical guarantees for identifiability and asymptotic consistency. Extensive experiments demonstrated the effectiveness of DEAR in causal controllable generation and structure learning, and the benefits of the learned representations for downstream tasks. Several future directions are worth exploring. Although in our ablation experiments, we demonstrated that DEAR exhibits promising performance in weakly supervised settings in terms of annotated labels and the graph structure, it is worth considering more flexible forms of supervision to make DEAR widely adopted in more real-world applications. On one hand, regarding the annotated labels of the factors of interests, one may consider utilizing other forms of supervision, such as restricted labeling or rank pairing (Shu et al., 2020). Besides, instead of using direct supervision about the true factors, one may consider some additionally observed variables such as class labels or time index (Khemakhem et al., 2020) which serve as auxiliary information to ensure more general identifiability of the true latent factors in the causal case. On the other hand, regarding the graph structure, our experiments in Section 5.3 indicated the potential of DEAR in latent structure learning. As in many real applications, even the causal ordering may not be available, it is promising Weakly Supervised Disentangled Generative Causal Representation Learning 0.001 0.01 0.1 1 Proportion of labeled samples 0.001 0.01 0.1 1 Proportion of labeled samples S Graph VAE (a) Celeb A 0.01 0.1 1 Proportion of labeled samples 0.01 0.1 1 Proportion of labeled samples S Graph VAE (b) Pendulum Figure 13: Worst-case and average test accuracy against different proportions of labeled samples among full data. to incorporate causal discovery methods in the DEAR framework to learn the latent causal structure from scratch (i.e., without any prior information) with a guarantee of the structure identifiability. In addition, the proposed method applies to the case where the observational data are IID, as commonly considered in the literature of generative models and disentanglement. It would be interesting to extend the current approach to non-IID settings, in particular, to the scenarios where one can perform interventions during data collection. For example, in reinforcement learning, the interactive environment allows the agent to perform actions and observe their outcomes. The resulting data set that contains a mixture of interventional distributions (e.g., Ke et al., 2021) could be leveraged in causal disentanglement learning. Acknowledgments We would like to thank the anonymous reviewers for their valuable comments that were very useful for improving the quality of this work. The work was supported by the General Research Fund (GRF) of Hong Kong (No. 16201320). F. Liu s research was supported in part by a Key Research Project of Zhejiang Lab (No. 2022PE0AC04). Shen, Liu, Dong, Lian, Chen, and Zhang Appendix A. Proofs A.1 Preliminaries This section presents some preliminary notions and lemmas which will be used in proofs. Definition 9 (Bracketing covering number (van de Geer, 2000)) Consider a function class G = {g(x)} and a probability measure µ defined on X. Given any positive number δ > 0. Let N1,B(δ, G, µ) be the smallest value of N for which there exist pairs of functions {[g L j , g U j ]}N j=1 such that R |g L j (x) g U j (x)|dµ δ for all j = 1, . . . , N, and such that for each g G, there is a j = j(g) {1, . . . , N} such that g L j g g U j . Then N1,B(δ, G, µ) is called the δ-bracketing covering number of G. Lemma 10 (Uniform continuous mapping theorem) Let Xn, X be random vectors defined on X. Let f : Rd Rm be uniformly continuous and Tθ : X Rd for θ Θ. Suppose Tθ(Xn) converges uniformly in probability to Tθ(X) over Θ, i.e., as n we have supθ Θ Tθ(Xn) Tθ(X) p 0. Then f(Tθ(Xn)) converges uniformly in probability to f(Tθ(X)), i.e., supθ f(Tθ(Xn)) f(Tθ(X)) p 0. Proof Given any ϵ > 0. Because f is uniformly continuous, there exists δ > 0 such that f(x) f(y) ϵ for all x y δ. We have P sup θ Θ Tθ(Xn) Tθ(X) δ = P θ Θ : Tθ(Xn) Tθ(X) δ (11) P θ Θ : f(Tθ(Xn)) f(Tθ(X)) ϵ = P sup θ Θ f(Tθ(Xn)) f(Tθ(X)) ϵ . (12) By the uniform convergence of Tθ(Xn), we know the left-hand side of (11) converges to 1. Hence (12) goes to 1, which implies the desired result. Lemma 11 Let µn and µ be a sequence of measures on probability space (X, Σ) with densities pn(x) and p(x). Given any compact subset K of X. Suppose pn is uniformly bounded and Lipschitz on K ( ). If H2(µn, µ) p 0, then supx K |pn(x) p(x)| p 0 as n , where H(q1, q2) = R (q1/2 1 q1/2 2 )dxdz/2 1/2 denotes the Hellinger distance between two distributions with densities q1 and q2. Proof Note that assumptions in ( ) satisfy the requirements in the Arzel a-Ascoli theorem. Thus, for each subsequence of pn, there is a further subsequence pnm which converges uniformly on compact set K, i.e., for some p0 as m we have sup x K |pnm(x) p0(x)| 0. Weakly Supervised Disentangled Generative Causal Representation Learning By Scheff e s Theorem we have H(pnm, p0) 0. On the other hand we have H(pnm, p) p 0. By triangle inequality, H(p, p0) H(pnm, p0) + H(pnm, p) p 0. Since the inequality holds for all m and the LHS is deterministic, we have H(p, p0) = 0, which implies p = p0, a.e. wrt the Lebesgue measure. Hence we have sup x K |pnm(x) p(x)| 0, a.e. By Durrett (2019, Theorem 2.3.2), we have supx K |pn(x) p(x)| p 0 as n . A.2 Proof of Proposition 4 Proof On one hand, by the assumption that the elements of ξ are connected by a causal graph whose adjacency matrix is not a zero matrix, there exist i = j such that [ξ]i and [ξ]j are not independent, indicating that the probability density of ξ cannot be factorized. Since E is disentangled with respect to ξ, by Definition 3, i = 1, . . . , m there exists gi such that [E (x)]i = gi([ξ]i). This implies that the probability density of E (x) is not factorized. On the other hand, notice that the distribution family of the latent prior is contained in {pz : pz is factorized}. Hence the intersection of the marginal distribution families of z and E (x) is an empty set. Then the joint distribution families of (x, E (x)) and (G(z), z) also have an empty intersection. We know that Lgen(E , G) = 0 implies q E (x, z) = p G(x, z) which contradicts the above. Therefore, we have a = min G Lgen(E , G) > 0. Let (E , G ) be the solution of the optimization problem min{(E,G):Lgen=0} Lsup(E). From the above we know E cannot be disentangled with respect to ξ. Then we have L = L(E , G ) = λb, and L = L(E , G) a + λb > λb for any generator G. When b b we directly have L < L . When b < b and λ is not large enough, i.e., λ < a b b , we have L < L . Discussion on Tr auble et al. (2021, Proposition 1) Proposition 1 in Tr auble et al. (2021) and our Proposition 4 state the same unidentifiability issue from different perspectives. Proposition 1 in Tr auble et al. (2021) says that maximum likelihood estimation (MLE) cannot identify the disentangled representation, while our Proposition 4 says that the formulation (7) in our paper cannot identify the disentangled representation. The relationship of the two formulations, MLE and (7), is that the first term in (7) is an upper bound of the negative log-likelihood. Therefore, our Proposition 4 is more straightforward in the sense that it directly studies the formulation that is used in disentanglement methods. A.3 Proof of Proposition 5 In this section, we prove a full statement of Proposition 5. Specifically, we add an assumption on structure identifiability and the consequent result in learning the true structure. Assumption 2 states the identifiability of the true causal structure IA0 of ξ, which Shen, Liu, Dong, Lian, Chen, and Zhang is applicable given the true causal ordering under the basic Markov and causal minimality conditions (Pearl, 2014; Zhang and Spirtes, 2011). Assumption 2 For all β = (f, h, A) B with pβ = pβ0, it holds that IA = IA0. Proposition 12 (Full statement of Proposition 5) Assume the infinite capacity of E and G. Further under Assumptions 1 and 2, DEAR formulation (7) learns the disentangled encoder E and the true causal structure IA0. Specifically, we have gi(x) = σ 1(x) with the CE loss as the supervised regularizer, and gi(x) = x with the L2 loss. Proof To simplify the notations in this section, for a vector x, let xi denote the i-th element of x instead of [x]i. For a vector function g(x), let gi(x) denote the i-th component function. Assume E is deterministic. On one hand, for each i = 1, . . . , m, first consider the cross-entropy loss Lsup,i(E) = E(x,y)[CE(Ei(x), yi)] = Z qx(x)p(yi|x)[yi log σ(Ei(x)) + (1 yi) log(1 σ(Ei(x)))]dxdyi, where p(yi|x) is the probability mass function of the binary label yi given x, characterized by P(yi = 1|x) = E(yi|x) and P(yi = 0|x) = 1 E(yi|x). Let Lsup,i σ(Ei(x)) = Z qx(x)p(yi|x) 1 1 σ(Ei) yi 1 σ(Ei)(1 σ(Ei)) Then we know that E i (x) = σ 1(E(yi|x)) = σ 1(ξi) minimizes Lsup,i. Consider the L2 loss Lsup,i(φ) = E(x,y)[Ei(x) yi]2 = Z qx(x)p(yi|x)[Ei(x) yi]2dxdyi. Let Lsup,i Ei(x) = 2 Z qx(x)p(yi|x)(Ei(x) yi)dxdyi = 0. Then we know that E i (x) = E(yi|x) = ξi minimizes Lsup,i in this case. On the other hand, by Assumption 1 there exists β0 = (f0, h0, A0) such that pξ = pβ0. Further due to the infinite capacity of G and Assumption 1, we have the distribution family of p G,F (x, z) contains q E (x, z). Then by minimizing the loss in (7) over G, we can find G and F such that p G ,F (x, z) matches q E (x, z) and thus Lgen(E , G , F ) reaches 0, where F corresponds to parameter β = (f , h , A ). Note that p G ,F (x, z) = q E (x, z) implies that the marginal distributions match, i.e., p F (z) = q E (z). Generally denote E i (x) = gi(ξi) for i = 1, . . . , m. Then, for i = 1, . . . , m, the distributions of g 1 i (E i (x)) = ξi and g 1 i (F i (ϵ)) are identical. It can be seen that pβ0 = pβ 0 with β 0 = (g 1 f , h , A ), where denotes elementwise composition. Then according to Assumption 2, we have IA = IA0. Hence minimizing L = Lgen + λLsup, which is the DEAR formulation (7), leads to the solution with E i (x) = gi(ξi) with gi(ξi) = σ 1(ξi) if CE loss is used, and gi(ξi) = ξi if L2 loss is used, and the true binary adjacency matrix IA0. Weakly Supervised Disentangled Generative Causal Representation Learning For a stochastic encoder, we establish the disentanglement of its deterministic part as above, and follow Definition 3 to obtain the desired result. A.4 Proof of Lemma 7 Proof [Proof of Lemma 7] The proof proceeds in three steps after an introduction on logistic regression under the scenario of generative models. For pair (x, z), let label w = 1 if (x, z) q E and w = 0 if (x, z) p G, which states that p(x, z|w = 1) = q E(x, z) and p(x, z|w = 0) = p G,F (x, z). In generative models, the prior is given by P(w = 1) = P(w = 0) = 1/2. Then the marginal distribution of (x, z) is given by p (x, z) = q E(x, z)/2 + p G,F (x, z)/2 which induces the probability measure µ . Note that the analysis below holds for all θ Θ in a pointwise manner unless indicated otherwise, so for simplicity of notation, we omit the subscript θ. By the Bayes formula we have P(w = 1|x, z) = q E(x, z)/(q E(x, z) + p G,F (x, z)) and P(w = 0|x, z) = p G,F (x, z)/(q E(x, z) + p G,F (x, z)) which defines the probability mass function p (w|x, z). Let p (x, z, w) = p (x, z)p (w|x, z). Recall the definition D (x, z) = log(q E(x, z)/p G,F (x, z)), so we notice P(w = 1|x, z) = 1/(1 + e D (x,z)). Consider the family of conditional distributions P = {PD(w = 1|x, z) = 1/(1 + e D(x,z)) : D D}. Logistic regression maximizes the log-likelihood Ex,z,w p (x,z,w)[log p D(x, z, w)] = Ep (x,z,w) log[p (x, z)p D(w|x, z)] over P or equivalently over D. Given IID samples (xi, zi, wi), i = 1, . . . , Nd from p (x, z, w), the empirical loss to be minimized is given by i=1 [log p D(wi|xi, zi) + log p (xi, zi)] i:wi=1 log(1 + e D(xi,zi)) + X i:wi=0 log(1 + e D(xi,zi)) + c, which is equivalent to (9) up to a constant c. Let ˆD = argmin D D ˆLd(D) and ˆP(w = 1|x, z) = 1/(1 + e ˆD(x,z)). Step I We now establish the consistency of ˆD(x, z) to D (x, z) as defined in (14) below based on the generalization analysis of maximum likelihood estimation. Let the class G = g(x, z, w) = 1 2 log p D(x, z, w) + p (x, z, w) 2p (x, z, w) : D D . Note that each element of G can be written as g(x, z, w) = 1 2 log p D(w|x, z) + p (w|x, z) 2p (w|x, z) . Shen, Liu, Dong, Lian, Chen, and Zhang By the boundedness of D in condition C4, we know the mass function p (w|x, z) is bounded in a closed interval within (0, 1), so g is uniformly bounded. Let g = supg G |g|. Then Ep (x,z,w)[g (x, z, w)] < . Moreover for all δ > 0, the compactness of D assumed in condition C2 implies a finite bracketing covering number defined in Definition 9, i.e., N1,B(δ, D, µ ) < . Then it follows from van de Geer (2000, Theorem 4.3) that H(p ˆD(x, z, w), p (x, z, w)) 0 (13) almost surely as Nd , where H denotes the Hellinger distance. Consider any compact subset K of X Z. We know that for all D D, D(x, z) is continuous and thus is bounded and Lipschitz on K. Also from the boundedness of D , we know that p (x, z) is bounded away from 0 on K. Then it follows from (13) and Lemma 11 that sup (x,z) K | ˆP(w = 1|x, z) P(w = 1|x, z)| p 0. Then by continuous mapping theorem (Lemma 10) and noting that l(p) = log(p/(1 p)) is uniformly continuous on a closed interval within (0, 1), we have as Nd sup (x,z) K | ˆD(x, z) D (x, z)| p 0. (14) Step II We then prove the pointwise consistency of ˆD(x, z) to D (x, z) as defined in (17). Construct an arbitrary probability measure µ on X Z that satisfies the following (e.g., a Gaussian measure): µ is absolutely continuous with respect to Lebesgue measure with a density ρ. µ is tight, i.e., for any ϵ > 0, there is a compact subset Kϵ of X Z such that µ(Kϵ) 1 ϵ. log ρ is uniformly bounded, i.e., there exists B1 > 0 such that log ρ(x, z) B1 for all (x, z) X Z. ρ vanishes at infinity rapidly enough, i.e., ρ(x, z) = o(r d k) with r = p For a function u that is uniformly bounded on X Z, we have from integration by parts and Cauchy-Schwartz inequality that Z u 2dµ = Z u tr( 2u)dµ Z u u log ρdµ s Z |u|2dµ Z [tr( 2u)]2dµ + s Z |u|2dµ Z ( u log ρ)2dµ. (15) Recall from condition C4 that there exists a positive number B0 < such that x, z, D D, we have |D(x, z)| B0, D(x, z) B0 and |tr( 2D(x, z))| B0. Weakly Supervised Disentangled Generative Causal Representation Learning Given arbitrary ϵ > 0, we know from the tightness of µ that there exits a compact subset Kϵ of X Z such that µ(Kϵ) 1 ϵ. Let B = max{B0, B1}. Then we have for all θ Θ that Z X Z ˆD(x, z) D (x, z) 2dµ X Z | ˆD(x, z) D (x, z)|2dµ + X Z | ˆD(x, z) D (x, z)|2dµ Kϵ | ˆD(x, z) D (x, z)|2dµ + Z Kcϵ | ˆD(x, z) D (x, z)|2dµ Kϵ | ˆD(x, z) D (x, z)|2dµ + 2B2ϵ, where Kc ϵ = (X Z) \ Kϵ is the complement, the first inequality is an application of (15) and further due to the boundedness of log ρ and gradients and Hessians of functions in D, and the second inequality comes from the boundedness of functions in D and the tightness of µ. By the uniform convergence in (14) over Kϵ, we have for all (x, z) Kϵ, there exist a sequence a Nd = op(1) which is free of (x, z) such that | ˆD(x, z) D(x, z)|2 a Nd. Then we have Z Kϵ | ˆD(x, z) D (x, z)|2dµ Z Kϵ a Nddµ = a Ndµ(Kϵ) = op(1), by noting that µ is finite. Further by the arbitrariness of ϵ, we let ϵ 0 and obtain from (16) that Z X Z ˆD(x, z) D (x, z) 2dµ p 0. Recall the arbitrariness of µ. For all (x, z) X Z, construct µ such that ρ(x, z) > 0. Let v(x, z) = ˆD(x, z) D (x, z) 2ρ(x, z). By the converse of the mean value theorem, if (x, z) is not an extremum of v, then there exists a bounded subset S(x, z) X Z such that ˆD(x, z) D (x, z) 2ρ(x, z) = 1 ν(S(x, z)) S(x,z) v(x , z )dx dz 1 ν(S(x, z)) X Z v(x , z )dx dz p 0 where ν denotes the Lebesgue measure. Since ρ(x, z) > 0, this implies ˆD(x, z) D (x, z) p 0 for all non-extrema. By Lipschitz continuity of v on any compact set, we have ˆD(x, z) D (x, z) p 0 for all extrema. Up to now we have shown that for all θ Θ and (x, z) X Z, we have ˆD(x, z) D (x, z) p 0 as Nd . Further from the smoothness in condition C1 and the compactness of Θ, we have x, z, as Nd sup θ Θ ˆD(x, z) D (x, z) p 0. (17) Shen, Liu, Dong, Lian, Chen, and Zhang Step III Based on the convergence statements established above, we proceed to show the consistency of the approximate gradient h ˆD(θ) and complete the proof. By condition C3, {µ } is uniformly tight. For arbitrary ϵ > 0, there exists a compact subset Kϵ of X Z such that µ (Kc ϵ) < ϵ. Because D(x, z) is Lipschitz continuous with respect to (x, z) on Kϵ, we have as Nd sup θ Θ,(x,z) Kϵ ˆD(x, z) D (x, z) p 0. (18) Given any SN = {(xi, zi) µ , yj : i = 1, . . . , N, j = 1, . . . , Ns} and δ > 0. Define events ANd = {supθ h ˆD(θ) h D (θ) δ} and BN,ϵ = { i : (xi, zi) Kϵ}. We have from the tightness of µ that P(BN,ϵ) (1 ϵ)N. We know from (18) and the continuous mapping theorem (Lemma 10) that for any SN and ϵ > 0, as Nd (free of SN), we have P(ANd|BN,ϵ) 1. Then as Nd , we have P(ANd) P(ANd BN,ϵ) = P(ANd|BN,ϵ)P(BN,ϵ) P(ANd|BN,ϵ)(1 ϵ)N (1 ϵ)N. By letting ϵ 0 we have P(ANd) 1 as Nd . Since δ is arbitrary, we have that for any SN, supθ Θ h ˆD(θ) h D (θ) p 0 as Nd . On the other hand, by condition C5 and the boundedness of D , and according to the gradient formulas in Lemma 6, it follows from the uniform law of large numbers that h D (θ) L(θ) p 0 uniformly over Θ as N, Ns . By triangle inequality we have sup θ Θ h ˆD(θ) L(θ) sup θ Θ h ˆD(θ) h D (θ) + sup θ Θ h D (θ) L(θ) . Therefore there exists a sequence of (N, Ns, Nd) such that sup θ Θ h ˆD(θ) L(θ) p 0 which completes the proof. A.5 Proof of Theorem 8 Proof [Proof of Theorem 8] Consider the gradient descent step based on the approximate gradient θt = θt 1 ηh ˆD(θt 1), where η is the learning rate. Suppose L(θ) is ℓ0-smooth. Then we have L(θt) L(θt 1) ηh ˆD(θt 1) L(θt 1) + η2ℓ0 2 h ˆD(θt 1) h ˆD(θt 1). Weakly Supervised Disentangled Generative Causal Representation Learning Let ˆϵ(θ) = L(θ) h ˆD(θ). By Lemma 7, there exists a sequence of (N, Ns, Nd) such that ˆϵ = supθ ˆϵ(θ) p 0. Then we have ηh ˆD(θt 1) L(θt 1) = ηh ˆD(θt 1) h ˆD(θt 1) + ˆϵ(θt 1) η h ˆD(θt 1) 2 + ( h ˆD(θt 1) 2 + ˆϵ 2)/2 2 h ˆD(θt 1) 2 ˆϵ2 4 h ˆD(θt 1) 2, under the case where h ˆD(θt 1) 2 2ˆϵ2. We note that L(θt) L(θt 1) η 4 h ˆD(θt 1) 2 + η2ℓ0 2 h ˆD(θt 1) 2 8 h ˆD(θt 1) 2, when η < 1/4ℓ0 which can be satisfied with a sufficiently small learning rate. By summing over t = 1, . . . , T, we have L(θT ) L(θ0) 0.125η t=1 h ˆD(θt 1) 2. Note that L(θ) is lower bounded by 0. Then we have P t h ˆD(θt 1) 2 = O(1). Thus there exists t in {0, . . . , T} such that h ˆD(θt 1) 2 = O(1/T). Otherwise there exists t such that h ˆD(θt 1) < 2ˆϵ = op(1). Therefore we have the empirical estimator h ˆD(ˆθ) p 0. By the uniform convergence (10) from Lemma 7, we have L(ˆθ) = 0. Then by the PL condition, there exists a sequence of (N, Ns, Nd) such that L(ˆθ) L p 0, which leads to the desired result. A.6 Proof of Lemma 6 We follow the same proof scheme as in Shen et al. (2020) where the only difference lies in the gradient with respect to the prior parameter β. To make this paper self-contained, we restate some proof steps here using our notations. Let denote the vector 2-norm. For a scalar function h(x, y), let xh(x, y) denote its gradient with respect to x. For a vector function g(x, y), let xg(x, y) denote its Jacobi matrix with respect to x. Given a differentiable vector function g(x) : Rk Rk, we use g(x) to denote its divergence, defined as Shen, Liu, Dong, Lian, Chen, and Zhang where [x]j denotes the j-th component of x. We know that Z g(x)dx = 0 for all vector function g(x) such that g( ) = 0. Given a matrix function w(x) = (w1(x), . . . , wl(x)) : Rk Rk l where each wi(x), i = 1 . . . , l is a k-dimensional differentiable vector function, its divergence is defined as w(x) = ( w1(x), . . . , wl(x)). To prove Lemma 6, we need the following lemma which specifies the dynamics of the generator joint distribution pg(x, z) and the encoder joint distribution pe(x, z), denoted by pθ(x, z) and pφ(x, z) here. Lemma 13 Using the definitions and notations in Lemma 6, we have θpθ,β(x, z) = xpθ,β(x, z) gθ(x) pθ,β(x, z) gθ(x), (19) φqφ(x, z) = zqφ(x, z) eφ(z) qφ(x, z) eφ(z), (20) βpθ,β(x, z) = xpθ,β(x, z) fβ(x) zpθ,β(x, z) fβ(z) pθ,β(x, z) fβ(x) fβ(z) for all data x and latent variable z, where gθ(Gθ(z, ϵ)) = θGθ(z, ϵ), eφ(Eφ(x, ϵ)) = φEφ(x, ϵ), fβ(Fβ(ϵ)) = βFβ(ϵ), and fβ(G(Fβ(ϵ))) = βG(Fβ(ϵ)). Proof [Proof of Lemma 13] We only prove (21) which is the distinct part from Shen et al. (2020). Let l be the dimension of parameter β. To simplify notation, let random vector Z = Fβ(ϵ) and X = G(Z) Rd and Y = (X, Z) Rd+k, and let p be the probability density of Y . For each i = 1, . . . , l, let = δei where ei is a l-dimensional unit vector whose i-th component is one and all the others are zero, and δ is a small scalar. Let Z = Fβ+δ(ϵ), X = G(Z ) and Y = (X , Z ) so that Y is a random variable transformed from Y by Y = Y + fβ(X) fβ(Z) Let p be the probability density of Y . For an arbitrary y = (x , z ) Rd+k, let y = y + fβ(x) fβ(z) + o(δ) and y = (x, z). Then we have p (y ) = p(y)| det(dy /dy)| 1 = p(y)| det(Id + ( fβ(x), fβ(z)) + o(δ))| 1 = p(y)(1 + ( fβ(x), fβ(z)) + o(δ)) 1 = p(y)(1 ( fβ(x), fβ(z)) + o(δ)) = p(y) p(y ) ( fβ(x ), fβ(z )) + o(δ) = p(y ) ( fβ(x ), fβ(z )) x p(x , z) p(y )( fβ(x ), fβ(z )) + o(δ). Since y is arbitrary, above implies that p (x, z) = p(x, z) ( fβ(x), fβ(z)) ( xp(x, z), zp(x, z)) xp(x, z) p(x, z)( fβ(x ), fβ(z )) + o(δ) Weakly Supervised Disentangled Generative Causal Representation Learning for all x Rd, z Rk and i = 1, . . . , l, leading to (21) by taking δ 0, and noting that p = pβ and p = pβ+ . Similarly we can obtain (19) and (20). Proof [Proof of Lemma 6] Recall the objective DKL(q, p) = R q(x, z) log(p(x, z)/q(x, z))dxdz. Denote its integrand by ℓ(q, p). Let ℓ 2(q, p) = ℓ(q, p)/ p. We have βℓ(q(x, z), p(x, z)) = ℓ 2(q(x, z), p(x, z)) βpθ,β(x, z) where βpθ,β(x, z) is computed in Lemma 13. Besides, we have x [ℓ 2(q, p)p(x, z) fβ(x)] = ℓ 2(q, p)p(x, z) fβ(x) + ℓ 2(q, p) xp(x, z) fβ(x) + xℓ 2(q, p)p(x, z) fβ(x), z [ℓ 2(q, p)p(x, z)fβ(z)] = ℓ 2(q, p)p(x, z) fβ(z) + ℓ 2(q, p) p(x, z) fβ(z) + ℓ 2(q, p)p(x, z)fβ(z). βLgen = Z βℓ(q(x, z), p(x, z))dxdz = Z p(x, z)[ xℓ 2(q, p) fβ(x) + zℓ 2(q, p)fβ(z)] where we have xℓ 2(q, p) = s(x, z) x D (x, z) and xℓ 2(q, p) = s(x, z) z D (x, z). Hence βLgen = E(x,z) p(x,z) h s(x, z)( x D (x, z) fβ(x) + z D (x, z) fβ(z)) i = Eϵ h s(x, z)( x D (x, z) βG(Fβ(ϵ)) + z D (x, z) βFβ(ϵ))|x=G(Fβ(ϵ)) z=Fβ(ϵ) i . where the second equality follows from reparametrization. Appendix B. Causal disentanglement and downstream tasks In the main text, we first demonstrate the good performance of DEAR in causal disentanglement through causal controllable generation in Section 5.1, and then show the advantages of the DEAR representations in downstream tasks in terms of sample efficiency (Section 5.2.1) and distributional robustness (Section 5.2.2). In comparison with previous methods, majorly the VAE-based disentanglement methods, we adopt the same network architectures for the encoder and decoder, and use the same amount of annotated labels. In addition, for Graph VAE, we also assume the same prior information on the graph structure as DEAR. Therefore, we conclude that the superior performance of DEAR is due to better modeling. To further justify whether such advantages come from the disentanglement of the learned representations, in this section, we propose a metric for causal disentanglement based on the Factor VAE metric, and investigate the correlation between the disentanglement metric and the metrics for downstream tasks. Shen, Liu, Dong, Lian, Chen, and Zhang B.1 Metric for causal disentanglement Many existing disentanglement papers also propose their metrics for disentanglement, including the β-VAE metric (Higgins et al., 2017), the Factor VAE metric (Kim and Mnih, 2018), the Mutual Information Gap (MIG) (Chen et al., 2018), the Separated Attribute Predictability (SAP) score (Kumar et al., 2018), etc. We refer the reader to Locatello et al. (2019) for a comprehensive introduction and discussion on these metrics. However, all of these metrics only apply to the case where the ground-truth generative factors are mutually independent and do not apply when the factors are correlated. For example, the MIG score measures for each factor the normalized gap in mutual information with the highest and second highest coordinate in E(x). Suppose a factor ξ1 is correlated with ξ2 and a disentangled representation E(x) so that there exists 1-1 functions g1 and g2 such that E1(x) = g1(ξ1) and E2(x) = g2(ξ2). Then the mutual information of ξ1 with (supposedly the highest coordinate) E1(x) and (supposedly the second highest coordinate) E2(x) will both be large, and then their difference will be small. As such, a disentangled representation in this case will not correspond to a large MIG score as expected. To this end, we propose a metric for causal disentanglement (i.e., disentanglement of causally related ground-truth factors) based on the Factor VAE metric. Suppose there are m generative factors of interest ξ1, . . . , ξm which are causally related following the true SCM C which is available. The procedure to compute the metric is presented in Algorithm 2. These steps largely follow those of the Factor VAE metric with the distinct parts tailored for causal disentanglement which we explain below the algorithm. Algorithm 2: Metric for causal disentanglement Input: Encoder E, meta-parameters M, N 1 for k = 1, . . . , m do 2 for i = 1, . . . , M do 3 Fix ξk to a randomly sampled value. 4 Randomly sample other factors ξ k from C conditioning on ξk for N times. 5 Generate data with the N factors. 6 Obtain their representations using the learned encoder. 7 Normalize each dimension by its empirical standard deviation over the full data (or a large enough random subset). 8 Compute the empirical variance in each dimension of these normalized representations. 9 Take the index of the dimension with the lowest variance. 10 If the index matches k, it counts as a correct sample. 11 Let Ck be the total number of correct samples among the M samples. 12 Obtain score S = Pm k=1 Ck/(KM). Return: S Line 4: Factor VAE metric samples all factors independently from Uniform distributions, which does not match (and can be far away) from the true distribution of the causal factors. Instead, we sample the factors following the true SCM and hence respect the data distribution. Weakly Supervised Disentangled Generative Causal Representation Learning Lines 10-12: Factor VAE metric uses the error rate of the majority-vote classifier as the metric, because in an unsupervised setting, one does not know which factor each representation captures. In contrast, the weakly-supervised setting can guarantee the alignment between each representation and a particular factor. Thus, we do not need the majority-vote classifier to identify this correspondence. Instead, we directly check whether the dimension with the lowest empirical variance matches the given index k. As we notice, this metric is limited in that it not only requires the ground-truth factors of data for sufficient coverage of the data distribution as previous metrics do, but also requires the ground-truth SCM, which only happens in synthetic data. Nevertheless, in this work, we only use such a metric to provide evaluations and justification on the relationship between causal disentanglement and performance in downstream tasks. We leave a widely-applied quantitive metric for causal disentanglement to future work. B.2 Experimental results Figure 14 shows the scatter plots of the metrics that we considered in downstream tasks (Section 5.2) and the metric for causal disentanglement (with M = 200 and N = 50). Each metric is used to evaluate seven disentanglement models, including S-β-VAE, S-TCVAE, S-Graph VAE, and multiple DEAR-LIN models with λ = 0.1, 1, 5, 10. All models are trained using fully supervised labels and Graph VAE and DEAR are given the true graph structure. The network architectures for the encoders and decoders are all the same. We observe a positive correlation between causal disentanglement and performance in downstream tasks, which indicates that the learned representations with a higher disentanglement score tend to perform better in terms of sample efficiency and distributional robustness in downstream tasks. In particular, we notice that the small sample accuracy and worst-case accuracy benefit the most from better causal disentanglement for the corresponding fitted lines have the largest scope. 0.4 0.6 0.8 1.0 Disentanglement Sample efficiency metrics DEAR S beta VAE S Graph VAE S TCVAE Efficiency Large sample accuracy Small sample accuracy (a) Sample efficiency 0.5 0.6 0.7 0.8 0.9 1.0 Disentanglement Distributional robustness metrics DEAR S beta VAE S Graph VAE S TCVAE Average accuracy Worst case accuracy (b) Distributional robustness Figure 14: Relationship between causal disentanglement and performance in downstream tasks. Shen, Liu, Dong, Lian, Chen, and Zhang Appendix C. Discussion: supervision for disentanglement learning We comment on the two forms of supervision that may be available and commonly considered in literature for the task of disentangled representation learning. Form 1 (direct and few labels): in some scenarios, we may have some conceptual knowledge about the data in the sense that we know the concepts of the underlying generative factors of data, especially those concepts that we are interested in. In such cases, a weakly supervised setting is feasible where only a few samples have annotated labels of the factors, since at least manual labeling of a few examples is practical. A representative work uses this form of supervision is Locatello et al. (2020b). Form 2 (auxiliary information of a full sample): in some other scenarios, we have no prior knowledge on what the ground-truth concepts are and thus cannot get the direct annotated labels of them. Auxiliary information is then needed for all samples with some assumptions on the variability of such side information as well as its correlation with the true generative factors. A representative work along this line is Khemakhem et al. (2020). Both settings have some real applications and limitations which make them complementary. On one hand, Form 2 in general tends to require weaker supervision than Form 1 in the sense that it does not require direct annotations of the true factors themselves. Thus, efforts towards general provable disentanglement should be put in studying along Form 2. However, in fact, the auxiliary observed variables in Form 2 also require certain knowledge on the true factors in order to verify the mathematical assumptions required in identifiability, e.g. the variability condition in Khemakhem et al. (2020). Intuitively, the auxiliary variables which can guarantee disentanglement should have enough variability and correlation with the true factors. In addition, current identifiability theory with Form 2 still assumes relatively strong and limited structure assumption on the true factors, e.g., conditional independence in Khemakhem et al. (2020). On the other hand, current research on disentanglement mostly focuses on the scenarios where we indeed have some conceptual knowledge on the true factors, which makes Form 1 at least a feasible and practical setting. For simple structures of true factors (e.g., independence or conditional independence, as assumed in most previous work), existing methods with Form 1 can achieve disentanglement, which is much more straightforward compared to provable disentanglement with supervision of Form 2. However, for more complex structures (e.g., a causal graph, as considered in our paper), existing methods using independent or conditionally independent priors generally cannot identify disentanglement even with supervision in Form 1, as shown in our Proposition 4. In particular, existing formulations (e.g., Locatello et al. (2020b)) in general cannot even reach the optimum of the supervised loss, so they cannot disentangle. To this end, our paper proposes a bidirectional generative model with an SCM prior trained using a GAN-type algorithm, which resolves this problem under the clearly stated setup and assumptions. Weakly Supervised Disentangled Generative Causal Representation Learning Appendix D. Discussion: generalization to unseen interventions We recall that DEAR is trained on observational data, that is, the training data is IID sampled from the data distribution qx and the latent variables follow IID a joint distribution pz, e.g., induced by a SCM, without a mixture with interventional distributions. When the generative model is perfectly learned, we have qx(x) = R p G (x|z)pz(z)dz. Then an interesting question would be how our method generalizes to unseen interventions. Specifically let p I z(z) be an interventional distribution. The consequent data distribution q I x(x) = R p G (x|z)p I z(z)dz does not match the observational distribution qx and model trained on an IID sample from qx have not seen q I x. Now we give some insights on how given the true graph structure, DEAR trained on observational data can sample from an interventional distributions q I x. We start with the general definition of SCM. A structural causal model (SCM) over variables Zi, i = 1 . . . , m can be generally expressed as Zi = fi(Pa(Zi; A), ϵi), i = 1, . . . , m, (22) where A denotes the adjacency matrix, Pa(Zi; A) denotes the set of parents of node Zi, and ϵi is the exogenous noise. Learning of an SCM consists of structure learning of A and parameter estimation of all the assignments fi, i = 1, . . . , m, in the SCM, i.e., how each node is generated given its parents and exogenous noise. When given the underlying causal structure, standard parameter estimation methods like maximum likelihood estimation can yield a consistent estimator of the true SCM assignments from the observational data: Zi = ˆfi(Pa(Zi; A), ϵi), i = 1, . . . , m. (23) Note that an intervention can be defined as operations that modify a subset of assignments in (22), e.g., changing ϵi, or setting fi (and thus Zi) to a constant (Pearl et al., 2000; Sch olkopf, 2019). Therefore, with the estimated SCM (23) at hand, we can sample from any interventional distributions. We illustrate this through some experimental results shown in Figure 15. In (a), we intervene on the two factors bald and gender. In each line, we keep gender = female and gradually increase the probability of them being bald. Particularly in the red box, we obtain images of bald female faces which have never been seen from the observational data. In (b), we intervene on beard and gender to generate images of female with beard which are shown in the red box. In (c), we show some generated samples that gradually wear (sun)glasses, while in the training data, there are only images with or without glasses but no intermediate states. In (d), we intervene on all four factors. In each line, the image in the middle follows the true SCM (described later in Appendix F) so that the factors satisfy the projection law. Then we change the value of only one factor while keeping others fixed, which leads to samples not satisfying the projection law. In summary, we see that although these interventions are not appearing in the observational data, DEAR is able to generate samples from such interventional distributions, suggesting its generalizability to unseen interventions. More systematic analysis on the out-of-distribution generalizability of the encoder is to be explored in future work. One potential direction is to utilize the generalizability of the generator to unseen interventions to improve the OOD performance of the encoder. Along Shen, Liu, Dong, Lian, Chen, and Zhang (a) Bald female (b) Female with beard (c) glasses: gradually wearing (sun)glasses (d) Images not following the projection law Figure 15: Samples from unseen interventional distributions. this direction, for example, Sauer and Geiger (2021) recently combined disentangled generative models and out-of-distribution classification, but adopted a different disentanglement framework. Appendix E. Experiments in the independent case In this section, we test our method on benchmark data sets where the ground-truth generative factors are independent, which is a spacial case of the causal case with no edge in the graph structure. Gondal et al. (2019) proposed a real-world benchmark data set, MPI3D-real, which consists of over one million images of physical 3D objects with seven independent factors of variation such as object color, shape, size and position. They also provided two simulation data sets. We test a simplified version of DEAR with an independent prior (a standard Gaussian) on real data MPI3D-real and simulated data MPI3D-simu (MPI3D-realistic in Gondal et al. (2019)). Both data sets consist of 1,036,800 images with resolution 64 64 3. We assume 0.01% of the data (around 100 samples) have annotated labels. No prior information on the graph structure is needed since we directly use an independent prior for the latent variable instead of an SCM. We are interested in the disentanglement performance on both real and simulated data, as well as the transferability of the representations from simulation to the real-world or reverse. As shown in the experiments by Gondal et al. (2019), most existing VAE-based methods perform similarly in disentanglement and all the metric for disentanglement also gives similar results. Hence, we consider weakly-supervised TCVAE as a representative of Weakly Supervised Disentangled Generative Causal Representation Learning the baseline methods and consider Factor VAE metric to measure disentanglement. As we have mentioned, the weakly-supervised setting can guarantee the alignment between each representation and a particular factor. Therefore, when computing the Factor VAE metric, we skip the majority-vote classifier and directly apply lines 10-12 in Algorithm 2 to obtain the score. As shown in Table 3, DEAR always significantly outperforms TCVAE in the disentanglement score and is particularly superior when training and testing on the same data set. In the transfer setting where we apply the encoder trained on one data set to another data set, both methods suffer from a performance decline. This is consistent with the discovery in Gondal et al. (2019) who found that direct transfer of learned representations from simulated to real data seems to work rather poorly. To sum up, this section suggests that DEAR can achieve state-of-the-art performance in data whose underlying factors are independent, though it is developed to handle the causal case. Train Test Method Disentangle MPI3D-simu MPI3D-simu DEAR 0.9543 TCVAE 0.5800 MPI3D-real MPI3D-real DEAR 0.9579 TCVAE 0.5793 MPI3D-simu MPI3D-real DEAR 0.4879 TCVAE 0.3614 MPI3D-real MPI3D-simu DEAR 0.5571 TCVAE 0.3443 Table 3: Results on MPI3D data. Appendix F. Implementation details In this section, we provide the details of the experimental setup and the network architectures used for all experiments, followed by a description of the synthesized Pendulum data set. Preprocessing and hyperparameters. We pre-process the images by taking center crops of 128 128 for Celeb A and resizing all images in Celeb A and Pendulum to the 64 64 resolution. We adopt Adam with β1 = 0, β2 = 0.999, and a learning rate of 1 10 4 for D, 5 10 5 for E, G and F, and 1 10 3 for the weighted adjacency matrix A. We use a mini-batch size of 128. For adversarial training in Algorithm 1, we train the D once on each mini-batch. The coefficient λ of the supervised regularizer is set to 5 unless indicated otherwise. We use CE supervised loss for both Celeb A with binary observations of the underlying factors and Pendulum with bounded continuous observations. Note that L2 loss works comparable to CE loss on Pendulum. The results of DEAR and baseline methods in controllable generation presented in Section 5.1 and Appendix G use full supervision of underlying generative factors, i.e., Ns = N, since the qualitative results with 10% labels have no big difference. Shen, Liu, Dong, Lian, Chen, and Zhang Figure 16: Generative factors of the Pendulum data set. ξ1: pendulum angle, ξ2: light angle, ξ3: shadow length, ξ4: shadow position. In downstream tasks, for BGMs with an encoder, we train a two-level MLP classifier with 100 hidden nodes using Adam with a learning rate of 1 10 2 and a mini-batch size of 128. Models were trained for around 150 epochs on Celeb A, 600 epochs on Pendulum, and 50 epochs on MPI3D on NVIDIA RTX 2080 Ti. Description of the Pendulum data set. In Figure 16, we illustrate the generative factors of the synthesized Pendulum data set, following Yang et al. (2021). Given the pendulum angle(ξ1) and light angle(ξ2), following the projection law, one can determine the shadow length(ξ3) and shadow position(ξ4). Note that we consider the parallel light in our simulator. Specifically, define some constants: cx = 10, cy = 10.5 are the axis s of the center (pendulum origin); lp = 9.5 be the pendulum length (including the red ball); the bottom line of a single plot corresponds to y = b with base b = 0.5. Then the ground-truth structural causal model is expressed as follows. ξ1 U(π/4, π/2) ξ2 U(0, π/4) ξ3 = cx + lp sin ξ1 cy lp cos ξ1 b ξ4 = cx + lp sin ξ1 cy lp cos ξ1 b tan ξ2 + cx cy b where U(a, b) denotes the uniform distribution on interval (a, b). Implementation of the SCM. Recall the nonlinear SCM as the prior z = f((I A ) 1h(ϵ)) := Fβ(ϵ). We find Gaussians are expressive enough as unexplained noises, so we set h as the identity mapping. As mentioned in Section 4.1 we require the invertibility of f. We implement both linear and nonlinear ones. For a linear f, we formally refer to f(z) = Wz +b, where W and Weakly Supervised Disentangled Generative Causal Representation Learning b are learnable weights and biases. Note that W is a diagonal matrix to model the elementwise transformation. Its inverse function can be easily computed by f 1(z) = W 1(z b). For a non-linear f, we use piece-wise linear functions defined by [f([z]i)]i = [w0]i[z]i + t=1 [wt]i([z]i ai)I([z]i ai) + [b]i where a0 < a1 < < a Na are the points of division, I( ) is the indicator function, and {b, wt : t = 0, . . . , Na} is the set of learnable parameters. According to the denseness of piecewise linear functions in C[0, 1] (Shekhtman, 1982), the family of such piece-wise linear functions is expressive enough to model general element-wise non-linear invertible transformations. Network architectures. We follow the architectures used in Shen et al. (2020). Specifically, for such realistic data, we adopt the SAGAN (Zhang et al., 2019) architecture for D and G. The D network consists of three modules as shown in Figure 17(a) and detailed described in Shen et al. (2020). Architectures for network G and Dx are given in Figure 17(b-c) and Table 4. The encoder architecture is the Res Net50 (He et al., 2016) followed by a 4-layer MLP of size 1024 after Res Net s global average pooling layer. Table 4: SAGAN architecture (k = 100 for Celeb A and k = 6 for Pendulum and ch = 32). (a) Generator Input: z Rk pz Linear 4 4 16ch Res Block up 16ch 16ch Res Block up 16ch 8ch Res Block up 8ch 4ch Non-Local Block (64 64) Res Block up 4ch 2ch BN, Re LU, 3 3 Conv 2ch 3 (b) Discriminator module Dx Input: RGB image x R64 64 3 Res Block down ch 2ch Non-Local Block (64 64) Res Block down 2ch 4ch Res Block down 4ch 8ch Res Block down 8ch 16ch Res Block 16ch 16ch Re LU, Global average pooling (fx) Linear 1 (sx) Experimental details for baseline methods. We reproduce the S-VAEs including S-VAE, S-β-VAE and S-TCVAE using E and G with the same architectures as DEAR s and adopt the same optimization algorithm with same hyperparameters for training. The coefficient for the independence regularizer is set to 4 since we notice that setting a larger independence regularizer hurts disentanglement in the correlated case. We implement Graph VAE by ourselves using the same architectures (for the encoder and decoder) and optimizer as DEAR. The latent dependencies of Graph VAE consists of a bottom-up network (approximate z|x): nn.Linear(latent dim, 32), nn.Batch Norm1d(32), nn.ELU(), nn.Linear(32, node dim), nn.Linear(node dim, 2*node dim) Shen, Liu, Dong, Lian, Chen, and Zhang Score D(x, z) fz Figure 17: (a) Architecture of the discriminator D(x, z); (b) A residual block (up scale) in the SAGAN generator where we use nearest neighbor interpolation for Upsampling; (c) A residual block (down scale) in the SAGAN discriminator. and a top-down network (approximate z|parent): nn.Linear(n parent nodes *node dim, 32), nn.Batch Norm1d(32), nn.ELU(), nn.Linear(32, node dim), nn.Linear(node dim, 2*node dim). Note that this implementation follows the original one: z|x, parent is obtained by precisionweighted fusion in He et al. (2018). Since our factor dependency are explicit, we use 32 latent dimension for more efficient optimization. For the supervised regularizer, we use λ = 1000 for a balance of generative modeling and supervised regularizer. The ERM Res Net is trained using the same optimizer with a learning rate of 1 10 4. We run the public source code from https://github.com/ mkocaoglu/Causal GAN to produce the results of Causal GAN. Appendix G. Additional results in causal controllable generation In this section, we present more qualitative results in causal controllable generation on two data sets using DEAR and baseline methods, including S-VAEs (Locatello et al., 2020b), Graph VAE (He et al., 2018), and Causal GAN (Kocaoglu et al., 2018). We consider three Weakly Supervised Disentangled Generative Causal Representation Learning underlying structures on two data sets: Pendulum in Figure 2(a), Celeb A-Smile in Figure 2(b), and Celeb A-Attractive in Figure 2(c). Note that the ordering of the rows in the traversals below matches the indices in Figure 2. Shen, Liu, Dong, Lian, Chen, and Zhang (a) Traversal (Celeb A-Smile) (b) Intervention (Celeb A-Smile) (c) Traversal (Celeb A-Attractive) (d) Intervention (Celeb A-Attractive) (e) Traversal (Pendulum) (f) Intervention (Pendulum) Figure 18: Results of DEAR. On the left we present the traditional latent traversals (the first type of intervention stated in Section 5.1) which show the disentanglement. On the right we show the results of intervening on one latent variable from which we see the consequent changes of the others (the second type of intervention). Specifically intervening on the cause variable influences the effect variables while intervening on effect variables makes no difference to the causes. Weakly Supervised Disentangled Generative Causal Representation Learning (a) S-TCVAE (Celeb A-Smile) (b) S-TCVAE (Celeb A-Attractive) (c) S-Factor VAE (Celeb A-Smile) (d) S-Factor VAE (Celeb A-Attractive) (e) S-β-VAE (Celeb A-Smile) (f) S-β-VAE (Celeb A-Attractive) Figure 19: Traversal results of baseline methods. We see that entanglement occurs and some factors are not captured by the generative models (traversing on some dimensions of the latent vector makes no difference in the decoded images.) Besides, the generated images from VAEs are blurry. Shen, Liu, Dong, Lian, Chen, and Zhang (a) Causal GAN (Celeb A-Smile) (b) Causal GAN (Celeb A-Attractive) (c) S-TCVAE (Pendulum) (d) S-Factor VAE (Pendulum) (e) S-Graph VAE (Celeb A-Attractive) (f) S-Graph VAE (Pendulum) Figure 20: Traversal results of baseline methods. Causal GAN uses the binary factors as the conditional attributes, so the traversals (a-b) appear some sudden changes. In contrast, we regard the continuous logit of binary labels as the underlying factors and hence enjoy smooth manipulations. In addition, the controllability of Causal GAN is also limited, since entanglement still exists. Results of S-VAEs are explained in Figure 19. The traversal of S-Graph VAE on Pendulum looks better than those of S-VAEs, especially in the first two factors, while the performance on Celeb A is poor. Besides, S-Graph VAE has poor generation quality. Weakly Supervised Disentangled Generative Causal Representation Learning Martin Arjovsky, L eon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. ar Xiv preprint ar Xiv:1907.02893, 2019. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798 1828, 2013. Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, Sebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning to disentangle causal mechanisms. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ryx WIg BFPS. Philippe Brouillard, S ebastien Lachapelle, Alexandre Lacoste, Simon Lacoste-Julien, and Alexandre Drouin. Differentiable causal discovery from interventional data. ar Xiv preprint ar Xiv:2007.01754, 2020. Christopher P. Burgess, Irina Higgins, Arka Pal, Lo ıc Matthey, Nicholas Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in beta-vae. Neur IPS Workshop of Learning Disentangled Features, 2017. Zachary Charles and Dimitris Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. In International Conference on Machine Learning, pages 745 754. PMLR, 2018. Tian Qi Chen, Xuechen Li, Roger B. Grosse, and David K. Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, 2018. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172 2180, 2016. David Maxwell Chickering. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507 554, 2002. Ishita Dasgupta, Jane Wang, Silvia Chiappa, Jovana Mitrovic, Pedro Ortega, David Raposo, Edward Hughes, Peter Battaglia, Matthew Botvinick, and Zeb Kurth-Nelson. Causal reasoning from meta-reinforcement learning. ar Xiv preprint ar Xiv:1901.08162, 2019. Andrea Dittadi, Frederik Tr auble, Francesco Locatello, Manuel Wuthrich, Vaibhav Agrawal, Ole Winther, Stefan Bauer, and Bernhard Sch olkopf. On the transfer of disentangled representations in realistic settings. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=8VXvj1QNRl1. JeffDonahue, Philipp Kr ahenb uhl, and Trevor Darrell. Adversarial feature learning. In International Conference on Learning Representations, 2017. Shen, Liu, Dong, Lian, Chen, and Zhang Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Mart ın Arjovsky, Olivier Mastropietro, and Aaron C. Courville. Adversarially learned inference. In International Conference on Learning Representations, 2017. Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019. Muhammad Waleed Gondal, Manuel W uthrich, DJordje Miladinovi c, Francesco Locatello, Martin Breidt, Valentin Volchkov, Joel Akpo, Olivier Bachem, Bernhard Sch olkopf, and Stefan Bauer. On the transfer of inductive bias from simulation to the real world: a new disentanglement dataset. ar Xiv preprint ar Xiv:1906.03292, 2019. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672 2680, 2014. Jiawei He, Yu Gong, Joseph Marino, Greg Mori, and Andreas Lehrmann. Variational autoencoders with jointly optimized latent dependency structure. In International Conference on Learning Representations, 2018. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. Irina Higgins, Lo ıc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 4401 4410, 2019. Nan Rosemary Ke, Olexa Bilaniuk, Anirudh Goyal, Stefan Bauer, Hugo Larochelle, Bernhard Sch olkopf, Michael C Mozer, Chris Pal, and Yoshua Bengio. Learning neural causal models from unknown interventions. ar Xiv preprint ar Xiv:1910.01075, 2019. Nan Rosemary Ke, Aniket Rajiv Didolkar, Sarthak Mittal, Anirudh Goyal, Guillaume Lajoie, Stefan Bauer, Danilo Jimenez Rezende, Michael Curtis Mozer, Yoshua Bengio, and Christopher Pal. Systematic evaluation of causal discovery in visual model based reinforcement learning. 2021. Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, and Aapo Hyvarinen. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pages 2207 2217, 2020. Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on Machine Learning, 2018. Weakly Supervised Disentangled Generative Causal Representation Learning Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014. Murat Kocaoglu, Christopher Snyder, Alexandros G Dimakis, and Sriram Vishwanath. Causalgan: Learning causal implicit generative models with adversarial training. In International Conference on Learning Representations, 2018. Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. In International Conference on Learning Representations, 2018. Felix Leeb, Yashas Annadani, Stefan Bauer, and Bernhard Sch olkopf. Structural autoencoders improve representations for generation and transfer. ar Xiv preprint ar Xiv:2006.07796, 2020. Zinan Lin, Kiran K Thekumparampil, Giulia Fanti, and Sewoong Oh. Infogan-cr and modelcentrality: Self-supervised model training and selection for disentangling gans. In International Conference on Machine Learning, 2020. Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in overparameterized non-linear systems and neural networks. ar Xiv preprint ar Xiv:2003.00307, 2020. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In IEEE International Conference on Computer Vision, pages 3730 3738, 2015. F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Sch olkopf, and O. Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4114 4124. PMLR, June 2019. URL http://proceedings.mlr.press/v97/locatello19a.html. Francesco Locatello, Ben Poole, Gunnar R atsch, Bernhard Sch olkopf, Olivier Bachem, and Michael Tschannen. Weakly-supervised disentanglement without compromises. In International Conference on Machine Learning, 2020a. Francesco Locatello, Michael Tschannen, Stefan Bauer, Gunnar R atsch, Bernhard Sch olkopf, and Olivier Bachem. Disentangling factors of variation using few labels. In International Conference on Learning Representations, 2020b. Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In International Conference on Machine Learning, pages 2391 2400. JMLR. org, 2017. Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ar Xiv preprint ar Xiv:1411.1784, 2014. Raha Moraffah, Bahman Moraffah, Mansooreh Karami, Adrienne Raglin, and Huan Liu. Can: A causal adversarial network for learning observational and interventional distributions. ar Xiv preprint ar Xiv:2008.11376, 2020. Shen, Liu, Dong, Lian, Chen, and Zhang Suraj Nair, Yuke Zhu, Silvio Savarese, and Li Fei-Fei. Causal induction from visual observations for goal directed tasks. ar Xiv preprint ar Xiv:1910.01751, 2019. Ignavier Ng, Amir Emad Ghassami, and Kun Zhang. On the role of sparsity and dag constraints for learning linear dags. ar Xiv preprint ar Xiv:2006.10201, 2020. Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Elsevier, 2014. Judea Pearl et al. Models, reasoning and inference. Cambridge, UK: Cambridge University Press, 2000. Jonas Peters and Peter B uhlmann. Identifiability of gaussian structural equation models with equal error variances. Biometrika, 101(1):219 228, 2014. Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal vychislitel noi matematiki i matematicheskoi fiziki, 3(4):643 653, 1963. Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems, pages 14866 14876, 2019. Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worstcase generalization. ar Xiv preprint ar Xiv:1911.08731, 2019. Axel Sauer and Andreas Geiger. Counterfactual generative networks. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=BXewf AYMm Jw. Bernhard Sch olkopf. Causality for machine learning. ar Xiv preprint ar Xiv:1911.10500, 2019. Bernhard Sch olkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In International Conference on Machine Learning, 2012. Boris Shekhtman. Why piecewise linear functions are dense in c [0, 1]. Journal of Approximation Theory, 36(3):265 267, 1982. Xinwei Shen, Tong Zhang, and Kani Chen. Bidirectional generative modeling using adversarial gradient estimation. ar Xiv preprint ar Xiv:2002.09161, 2020. Rui Shu, Yining Chen, Abhishek Kumar, Stefano Ermon, and Ben Poole. Weakly supervised disentanglement with guarantees. In International Conference on Learning Representations, 2020. Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in Neural Information Processing Systems, pages 3738 3746, 2016. Weakly Supervised Disentangled Generative Causal Representation Learning Peter Spirtes, Clark N Glymour, Richard Scheines, and David Heckerman. Causation, prediction, and search. MIT press, 2000. Jan St uhmer, Richard Turner, and Sebastian Nowozin. Independent subspace analysis for unsupervised learning of disentangled representations. In International Conference on Artificial Intelligence and Statistics, pages 1200 1210. PMLR, 2020. Raphael Suter, Djordje Miladinovic, Bernhard Sch olkopf, and Stefan Bauer. Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness. In International Conference on Machine Learning, pages 6056 6065. PMLR, 2019. Frederik Tr auble, Elliot Creager, Niki Kilbertus, Francesco Locatello, Andrea Dittadi, Anirudh Goyal, Bernhard Sch olkopf, and Stefan Bauer. On disentangled representations learned from correlated data. In International Conference on Machine Learning, pages 10401 10412. PMLR, 2021. Sara van de Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000. Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000. Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang. Causalvae: Disentangled representation learning via neural structural causal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9593 9602, June 2021. Yue Yu, Jie Chen, Tian Gao, and Mo Yu. Dag-gnn: Dag structure learning with graph neural networks. In International Conference on Machine Learning, 2019. Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In International Conference on Machine Learning, pages 7354 7363. PMLR, 2019. Jiji Zhang and Peter Spirtes. Intervention, determinism, and the causal minimality condition. Synthese, 182(3):335 347, 2011. Kun Zhang and Aapo Hyvarinen. On the identifiability of the post-nonlinear causal model. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, 2009. Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, pages 56 85, 2004. Shengjia Zhao, Jiaming Song, and Stefano Ermon. Learning hierarchical features from generative models. In International Conference on Machine Learning, 2017. Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning. Advances in Neural Information Processing Systems, 31, 2018.