# nonlinear_ica_using_volumepreserving_transformations__21baac59.pdf Published as a conference paper at ICLR 2022 NONLINEAR ICA USING VOLUME-PRESERVING TRANSFORMATIONS Xiaojiang Yang1 , Yi Wang1, Jiacheng Sun2, Xing Zhang2, Shifeng Zhang2, Zhenguo Li2 & Junchi Yan1 1 Shanghai Jiao Tong University, 2 Huawei Noah s Ark Lab {yangxiaojiang, refraction334, yanjunchi}@sjtu.edu.cn {sunjiacheng1, zhangxing85, zhangshifeng4, li.zhenguo}@huawei.com Nonlinear ICA is a fundamental problem in machine learning, aiming to identify the underlying independent components (sources) from data which is assumed to be a nonlinear function (mixing function) of these sources. Recent works prove that if the sources have some particular structures (e.g. temporal structure), they are theoretically identifiable even if the mixing function is arbitrary. However, in many cases such restrictions on the sources are difficult to satisfy or even verify, hence it inhibits the applicability of the proposed methods. Different from these works, we propose a general framework for nonlinear ICA, in which the mixing function is assumed to be a volume-preserving transformation, and meanwhile the conditions on the sources can be much looser. We provide an insightful proof of the identifiability of the proposed framework. We implement the framework by volume-preserving flow-based models, and verify our theory by experiments on artificial data and synthesized images. Moreover, results on real-world images indicate that our framework can disentangle interpretable features. 1 INTRODUCTION Independent component analysis (ICA) is one of the most fundamental problems in machine learning. The earlier works concentrate on linear ICA (Comon, 1994), in which the observed data is assumed to be a linear and invertible transformation (called mixing function ) of several independent components (called sources ), and the goal is to identify the independent components from data points. Recently, there are increasing interests in nonlinear ICA (Hyv arinen & Pajunen, 1999), in which the mixing function is generalized to be nonlinear. This nonlinear problem is crucial, as it is the theoretical foundation of many important tasks. For example, one task is disentanglement (Locatello et al., 2019; Sorrenson et al., 2020; Locatello et al., 2020), which aims at learning some explanatory factors of variation from data (Bengio et al., 2013) and hence facilitates the interpretability of representation learning as well as its downstream tasks (Peters et al., 2017; Lake et al., 2017). Another example is controlling the generation process in generative models using semantic factors (Karras et al., 2019; Abdal et al., 2021). These tasks essentially require their latent variables to be identifiable in highly nonlinear latent models. Otherwise the latent variables can be mixing functions of explanatory / semantic factors, hence are probably not explanatory or semantic. The central problem in nonlinear ICA is the unidentifiability. Specifically, if the observed data points are independent and identically distributed (i.i.d. in short), i.e. there is no temporal or similar structure in the data, then the sources are not identifiable essentially (Hyv arinen & Pajunen, 1999). This motivates many works to involve some structures to data for identifiability guarantee. To deal with the unidentifiability problem, most existing works of nonliear ICA involve structures to data by restricting the sources in some plausible ways. One popular solution is to assume the sources have some kinds of temporal structures (Hyvarinen & Morioka, 2016; 2017). These works almost have no restriction on the mixing functions, but limit the data in the form of time series. Recent This work was done when Xiaojiang Yang was a research intern in Huawei Noah s Ark Lab. Corresponding author is Junchi Yan. The SJTU authors are with Mo E Key Lab of Artificial Intelligence, Shanghai Jiao Tong University. Published as a conference paper at ICLR 2022 works (Hyvarinen et al., 2019; Khemakhem et al., 2020) extend the applicability of nonlinear ICA by involving auxiliary variables like labels of classes, and assume that the priors of sources are members of exponential family. However, their works require many classes in a dataset and each class should not be isolated in data space (Sorrenson et al., 2020), which is difficult to be satisfied. To see this, note that many datasets (e.g. CIFAR-10 (Krizhevsky et al., 2009)) just have very few classes, and usually each class is isolated in data space. Another direction of involving structures to data is to restrict the mixing function. One way is to assume that the mixing function is post-nonlinear (Taleb & Jutten, 1999), i.e. each data variable is a nonlinear function of a linear combination of the sources. This is obviously overly restrictive, and hence is impractical. Some works provide a few heuristic settings on learning algorithms to facilitate identifiability (Watters et al., 2019; Sorrenson et al., 2020), e.g. Sorrenson et al. (2020) empirically report that volume preservation of flow-based models promotes disentanglement. In this work, we explore a balancing direction: involve some natural structures to the mixing function, and meanwhile loosen the restrictions on the sources. Specifically, motivated by the empirical evidences by (Sorrenson et al., 2020), we assume the mixing function to be volume-preserving (and a weaker condition is provided in Appendix B), and seek some natural restrictions on the sources for identifiability guarantee. With mild conditions, we establish novel and fundamental identifiability theorems, which mainly show that if there exist two distinct classes in the datasets, the true sources are identifiable. The proofs are essentially different to existing works, and provide some insights for nonlinear ICA: the main indeterminacy of a nonlinear ICA framework using volume-preserving mixing functions is the rotation of latent variables. Based on our identifiability theorems, we implement the proposed nonlinear ICA framework using a volume-preserving flow-based model (Dinh et al., 2014), which requires merely two classes of data according to our theory. First, we empirically show that with the most essential conditions on sources (two distinct classes with non-zero overlap) and mixing functions, the sources can be well identified, which implies that there might exist some extensions of our theorems with weaker conditions. Then we compare our implementation with the state-of-the-art identifiable model i VAE (Khemakhem et al., 2020) on artificial and synthetic image datasets, and show that our framework remarkably better results in terms of mean correlation coefficient. More importantly, we empirically show that besides the given two classes, involving more classes cannot improve identifiability. Moreover, experiments on MNIST (Le Cun et al., 1998) and Celeb A (Liu et al., 2015) indicate that the implementation is able to disentangle interpretable features using only two classes of data, which demonstrates the applicability of our theory. Our contributions are as follows: 1) We present a new solution to the unidentifiability problem of nonlinear ICA by using the general volume-preserving transformation, as widely used in flow-based models. Specifically, our theoretical results lessen the requirement by existing works on the sources by reducing the number of required classes to 2, at the slight cost of adding some natural conditions on the mixing function. Accordingly, all the restrictions are moderate, which facilitates the applicability of nonlinear ICA. 2) We establish two novel identifiability theorems. Specifically, by using the friendly and general volume-preserving property, they guarantee the sources can be identified up to a point-wise nonlinearity and a point-wise linearity, respectively. The proofs suggest that the main indeterminacy of a nonlinear ICA framework with some natural conditions on mixing functions is simply the rotation of latent variables. The hope is to lay the foundation for a new path to identifiability. 3) We conduct experiments on both synthetic and real-world data to verify our theory, and point out that as long as the most essential conditions on sources and mixing functions are satisfied, our framework can well identify the true sources. This means that our theory can be further extended. Moreover, our framework remarkably excels the state-of-the-art nonlinear ICA method, which indicates that our framework is more powerful to identify the true sources. 2 BACKGROUNDS In this section, we provide some backgrounds on nonlinear ICA theory. We first give a general definition of nonlinear ICA, and then explain its central problem, namely unidentifiability. Based on this, we discuss some recent proposed solutions, including their advantages and disadvantages. Published as a conference paper at ICLR 2022 2.1 DEFINITION AND THE KEY PROBLEM OF NONLINEAR ICA Consider an observed random vector x Rd, we assume it is generated by an invertible nonlinear transformation f (called mixing function) using n independent latent variables s = (s1, , sn) (called independent components or sources) as x = f(s), (1) where n d (Khemakhem et al., 2020). In earlier ICA theory it is commonly assumed that n = d, i.e. the number of independent components should be the same as the number of observed variables. As it is commonly believed that n d for real world data (Cayton, 2005; Narayanan & Mitter, 2010; Rifai et al., 2011), the definition above is much more practical. The goal of nonlinear ICA theory is to identify (or recover) the independent components si based on the observations of x using a estimating function g. Given the data distribution p(x), if there exists a transformation g such that for arbitrary f, g f is a point-wise linear transformation, then the mixing model (Eq. 1) is said to be identifiable. Unfortunately, if there are no more restrictions on the mixing model (Eq. 1), it is seriously unidentifiable (Hyv arinen & Pajunen, 1999; Locatello et al., 2019). Specifically, even if the components of the estimated latent variables z g f(s) are independent, zi can be the mixing of {si}n i=1. In other words, independence of components does not guarantee identifiability of the mixing model. E.g. suppose the prior p(s) is a factorial multivariate Gaussian, take a point-wise scaling on s such that p(s) becomes an isotropic multivariate Gaussian, then take a rotation around the center, the independence of components is unchanged, but the obtained components are the mixing of the original sources. As arbitrary density functions can be transformed to a Gaussian (Locatello et al., 2019), this example shows that for any prior, the independence of components is insufficient for identifiability. 2.2 EXISTING SOLUTIONS FOR UNIDENTIFIABILITY PROBLEM To obtain identifiability, it is necessary to restrict the mixing model. There are two possible ways to involve restrictions: 1) restrict the sources s; 2) restrict the mixing function f. In the first direction, one idea is to involve temporal structure to the sources s, i.e. assuming the sources are time series s(t) with time index t (Harmeling et al., 2003; Sprekeler et al., 2014; Hyvarinen & Morioka, 2016; 2017). This solution guarantees identifiability of the mixing model with few restrictions on the mixing function, but is limited in the setting of time series. A more general way is to involve an auxiliary variable u and assume that conditioned on them, the prior of sources s is a factorial member of exponential family (Hyvarinen et al., 2019; Khemakhem et al., 2020): j=1 Ti,j(si)λi,j(u) where Ti,j, λi,j, Zi are the so-called sufficient statistics, coefficients and normalizing constant, respectively. Qi is the so-called base measure, and is simply set to be 1 in many cases. k is the order of the distribution, and distributions with higher k are more flexible. The auxiliary variable u is additionally observed, which can be time index, class label, etc. This solution seems general, but it requires nk + 1 distinct values of u to guarantee identifiability, which is impossible in most datasets. For example, if we assume that in MNIST (Le Cun et al., 1998), the prior of s is a factorial multivarite Gaussian (k = 2), and attempt to identify the sources using the 10 labels of the dataset based on the theory above, then we can identify 4 sources at most. However, the intrinsic dimensions is obviously far more than 4 (Pope et al., 2021), and hence we cannot identify the true sources. In the second direction, there are very few works to obtain identifiability by restricting the mixing function f in nonlinear ICA theory. One solution is to assume f to be post-nonlinear, i.e. {xj}d j=1 are nonlinear functions of the linear combination of sources, then the mixing model is identifiable (Taleb & Jutten, 1999). This condition is too restrictive (Hyvarinen et al., 2019) in practice. 3 THE NONLINEAR ICA FRAMEWORK In this section, we propose our general framework for nonlinear ICA (see Fig. 1). Published as a conference paper at ICLR 2022 3.1 DEFINITION OF GENERATIVE MODEL We assume that the sources s are independent conditioned on an auxiliary variable u: i=1 pi(si|u), (3) where pi is the i-th marginal distribution of the joint distribution p. Note that so far we do not restrict the density function of each source pi(si|u) to any particular functional form. Hence our restrictions on the prior are looser than existing works (Hyvarinen et al., 2019; Khemakhem et al., 2020). Figure 1: Structure and mild assumptions of our proposed framework. x, s, u and z are the observed data variables, the sources, the additionally observed variable and the estimated latent variables, respectively. f and g are the mixing function and estimating function, respectively, and both of them are volume-preserving. Conditioned on u, the sources s are assumed to be independent, while the estimated latent variables z is required to follow a factorial multivariate Gaussian. Next we introduce some coherent and basic restrictions on the mixing function f. First, we assume it is a homeomorphism 1, mapping sources s to a vector of observed data variables x on a n-dimensional Riemannian manifold M embedded in Rd: f : Rn M Rd This restriction is also natural, as it is the real world data often lies on a low dimensional manifold in high dimensional space (Rifai et al., 2011; Pope et al., 2021). We further assume it is volume-preserving (Dinh et al., 2014), i.e. the volume of each infinitesimal area on Rn equals to that of the area on M obtained by this transformation. The volume-preserving property is: |det Jf(s)| = 1, (5) where Jf is the Jacobian matrix of the mixing function f from the view of differential manifold2. Besides volume-preserving transformations, there exists a broader class of non-volume-preserving mixing functions to guarantee identifiability. We introduce such class of functions in Appendix B. 3.2 DEFINITION OF ESTIMATING MODEL Given the generative model above, next we seek an estimating model to identify the sources s from x. First, we set the estimating function g as a homeomorphism mapping the manifold M to Rn: where z are estimated latent variables. Note the estimating function g is set to be volume-preserving: det Jg 1(z) = 1. (7) Note that the volume-preservation here is a natural regularization to the estimating function for identifying the true sources. This is essentially different from the volume-preservation of mixing functions, which is a restriction on the generative process for identifiability. To identify s by z, naturally z should also be conditional independent. Besides, we set the distribution of z conditioned on u as a factorial multivariate Gaussian: 1 Zi(u) exp (zi µi(u))2 Actually, the model above gives a constraint to the priors p(s|u). Specifically, q(z|u) must be the push-forward of p(s|u) through g f. As a result, the restriction on z above is actually an implicit 1A homeomorphism is a continuous function between topological spaces with a continuous inverse function. 2Here Jf is a n n Jacobian matrix of f relative to the charts (i.e. a neighborhood with a coordinate system) of Rn and M, rather than a d n Jacobian matrix relative to that of Rn and Rd. So is Jg 1. Published as a conference paper at ICLR 2022 (b) Figure 2: Sketch of the discussion about condition (iii). The blue and the green ellipses represent q(z|u(1)) and q(z|u(2)), respectively, and their lengths of major and minor axes correspond to σ 1 1 and σ 1 2 . The red solid arrows represent the estimated latent variables (z1, z2), while the red dashed arrows represent the obtained new latent variables (z 1, z 2). The black solid arrows represent two operations: scaling to let one Gaussian (represented by blue ellipse) isotropic, and rotation around the center with arbitrary angle. (a) If condition (iii) is not satisfied, then the latent variables obtained by the two operations are still independent conditioned on both u(1) and u(2). (b) If condition (iii) is satisfied, then the obtained latent variables are no longer independent conditioned on u(2). one on s. To see this, note that a volume-preserving transformation does not change the volume of a distribution s support. Hence, a multivariate uniform distribution cannot be transformed into a multivariate Gaussian by a volume-preserving function, because uniform distribution has limited support while the support of Gaussian is the whole space Rn. Thus the priors p(s|u) cannot be a multivariate uniform distribution in our theory, and at least its support should cover Rn. 4 THEORETICAL ANALYSIS It is obvious that a generative model defined by Eq. 3-Eq. 5 is still unidentifiable using an estimating model defined by Eq. 6-Eq. 8. For example, if f is a point-wise nonlinear transformation mapping p(s|u) to an isotropic multivariate Gaussian, and g is a rotation around the center with arbitrary angle, then the conditions Eq. 3-Eq. 8 are satisfied, but the obtained zi is still the mixing of s. Therefore, we further seek some mild restrictions to guarantee the identifiability of the generative model. The goal is to prove that under some mild restrictions, the estimated latent variables z are the recoveries of the sources s up to a point-wise linear transformation, i.e. g f is point-wise linear function. Our main result is the following theorems: Theorem 1 (nonlinear identifiability) Assume data points are sampled from a model defined by to Eq. 3-Eq. 8, and there exist two distinct observations of u, denoted by u(1) and u(2), such that: (i) Both f and g have all second order derivatives. (ii) µi(u(1)) = µi(u(2)), i [n]. n σ1(u(2)) σ1(u(1)), , σn(u(2)) σn(u(1)) o are distinct. Then g f is a composition of a point-wise nonlinear transformation and a permutation. Remarks. The theorem (proof is given in Appendix A) guarantees that the sources s of the generative model above are identifiable up to a nonlinear point-wise transformation using the estimating model above, and hence the generative model is identifiable. In the following, we discuss the meaning of the conditions (ii) and (iii) above, and omit condition (i) since it is very mild and trivial. Condition (ii) means that the two multivariate Gaussians q(z|u(1)) and q(z|u(2)) are concentric, which guarantees that samples from the two Gaussians have large enough overlap . Otherwise, if the centers of the two Gaussians have large distance, samples from these two Gaussians are likely to be absolutely disjoint when using Monte Carlo sampling. In this case, for the estimating function g there is equivalently one Gaussian because it can deal with samples from the two Gaussians separately, and hence identifiability is not guaranteed. In our proof (see Appendix A), however, it seems that this condition is not theoretically necessary, as the supports of the two Gaussians are the whole space and hence they theoretically have overlap everywhere. How to prove an identifiability theorem without this condition is very important, and we leave this problem to future works. Condition (iii) means that there do not exist two estimated latent variables such that their variances of the two Gaussians are proportional. If this condition is not fulfilled, then we can easily construct another latent variables which are mixing of the estimated latent variables, and hence Published as a conference paper at ICLR 2022 the identifiability is impossibly guaranteed. As shown in Fig. 2, assume σ1(u(2)) σ1(u(1)) = σ2(u(2)) σ2(u(1)), let (z 1, z 2) = (z1, z2)SR, where S = diag σ1(u(1)) 1 σ1(u(1)) 1σ2(u(1)) 1 , σ2(u(1)) 1 σ1(u(1)) 1σ2(u(1)) 1 ing matrix and R is an arbitrary rotation matrix of order 2, hence SR is a volume-preserving transformation. Then (z 1, z 2) are independent conditioned on both u(1) and u(2), and hence they are also a set of estimated latent variables, but obviously they are the mixing of (z1, z2). Therefore, condition (iii) is a natural and necessary restriction for identifiability. Although (ii) and (iii) are conditions on the distributions of the estimated latent variables z, they are also implicit restrictions on the prior of sources s. To see this, note that z = g f(s) and g f is a function constrained by Eq. 4-Eq. 6, and hence (ii) and (iii) implicitly restrict the prior through g f. If we further restrict the prior p(s|u) to a multivariate Gaussian, then we can reduce nonlinear indeterminacy to linear indeterminacy: Theorem 2 (linear identifiability) Assume the hypotheses of Theorem 1 hold, and p(s|u(1)) and p(s|u(2)) are multivariate Gaussians. Then g f is a composition of a point-wise linear transformation and a permutation. In this case, the sources will have similar restrictions with the estimated latent variables. Specifically, let µs(u) and σs(u) be the mean and variance of p(s|u), then we have: (ii ) µs i(u(1)) = µs i(u(2)), i [n]; and (iii ) n σs 1(u(2)) σs 1(u(1)), , σs n(u(2)) σs n(u(1)) o are distinct. The proofs of the theorems above (see Appendix A) give us some key insights for nonlinear ICA: a nonlinear ICA framework using volume-preserving mixing functions only has three kinds of indeterminacy: point-wise nonlinear transformation, permutation and rotation of the latent variables. Therefore, resolving the indeterminacy of rotation by condition (iii) will lead to identifiability, and the indeterminacy of point-wise nonlinear transformation can be reduced to point-wise linear transformation by the Gaussianity of p(s|u). 5 METHODOLOGY As our theory requires an invertible estimating function, we set it as a volume-preserving Flowbased model (Dinh et al., 2014). The estimating function is denoted by g( ; θ), where θ refers to model parameters. Given the data variables x, g(x; θ) denotes the estimated latent variables z. Given a dataset D = (x(1,1), u(1)), , (x(1,M), u(1)), (x(2,1), u(2)), , (x(2,N), u(2)) with only two labels u(1) and u(2), we can construct a loss by our theory. As in our theory, the conditional distribution of the estimated latent variables should be a factorial multivariate Gaussian, we have to push g(x(1,i); θ) M i=1 and g(x(2,i); θ) N i=1 to q(z|u(1)) and q(z|u(2)) (defined by Eq. 14) respectively. Thus, we minimize the negative log-likelihood following (Sorrenson et al., 2020): L(θ) =E(x,u) D (gi(x; θ) µi(u))2 2σ2 i (u) + log(σi(u)) # To optimize the loss above, the number of sources n is required, which is usually unknown. Fortunately, we can integrate dimension reduction into the loss, and hence the underlying dimensionality can be estimated by learning. Specifically, the loss above can be further simplified as follows, which has the ability of dimension reduction (see Appendix C for detailed derivation and discussion): i=1 log σi(u(k); θ), (10) where σi(u(k); θ) = q E(x,u(k)) D gi(x; θ) E(x,u(k)) D [gi(x; θ)] 2. We utilize the simplified version throughout our experiments. 6 EXPERIMENTS Although our Theorem 1 does not explicitly restrict the functional form of p(s|u), and it implicitly requires p(s|u) to be identical to a factorial multivariate Gaussian, up to volume-preserving transfor- Published as a conference paper at ICLR 2022 mations. However, it is difficult to verify whether this implicit restriction is satisfied given a dataset. Therefore, our experiments are mainly based on Theorem 2, in which the prior is also assumed to be a factorial multivariate Gaussian. Hence, the main goal of this section is to show the performance of our framework to achieve linear identifiability in this case. 6.1 PROTOCOL Datasets. We run experiments both on an artificial dataset and a synthetic image dataset called Triangles , as well as on MNIST (Le Cun et al., 1998) and Celeb A (Liu et al., 2015). The generation processes of the artificial dataset and the synthetic images are described in Appendix D. Model specification. To implement the volume-preserving estimating function g( ; θ), we choose a volume-preserving flow called GIN (Sorrenson et al., 2020), which is a volume-preserving version of Real NVP (Dinh et al., 2017). For experiments on artificial data, the network is constructed by fully connected coupling blocks. For image datasets: Triangles and MNIST, the networks are constructed by both convolutional coupling blocks and fully connected coupling blocks. The parameters of each network are updated by minimizing the loss function in Eq. 10 using an Adam optimizer (Kingma & Ba, 2014). Details of the networks and their optimization refer to (Sorrenson et al., 2020). Performance metric. To quantitatively compare the performance of different methods, we compute the mean correlation coefficient (MCC) (Khemakhem et al., 2020) between the sources and the estimated latent variables. The computation of MCC consists of three steps: i) calculate all pairs of correlation coefficients between the sources and estimated latent variables; ii) solve a linear sum assignment problem to assign one estimated latent variable to each source for highest total correlation coefficients; iii) take the average correlation coefficients of the obtained pairs. A high MCC means that we successfully identify the true sources, up to point-wise transformations. 6.2 RESULTS ON ARTIFICIAL DATASET AND TRIANGLES On artificial dataset and Triangles, the labels of Gaussians in the prior of sources are used as the observations of the auxiliary variables u required in our theory. We mainly aim at verifying our theory, and comparing our framework with the state-of-the-art nonlinear ICA method, namely i VAE (Khemakhem et al., 2020), which is based on variational autoencoders (VAE) (Kingma & Welling, 2013). 6.2.1 ABLATION STUDY Table 1: MCC score on artificial data when part of the conditions are not satisfied. Conditions Mean STD Fully satisfied 1.000 0.000 µs 1(u(1)) = µs 1(u(2)) 1.000 0.000 σs 1(u(2)) σs 1(u(1)) = σs 2(u(2)) σs 2(u(1)) 1.000 0.000 Non-volume-preserving f 1.000 0.000 u(1) = u(2) 0.999 0.001 No overlap 0.942 0.064 To verify what conditions in our theory are essential for identifying the true sources, we perform an ablation study on artificial data. The chosen conditions include i) the concentric property of two Gaussians p(s|u(1)) and p(s|u(1)), denoted by µs i(u(1)) = µs i(u(2)); ii) the variances of the two Gaussains should be not proportional, denoted by σs 1(u(2)) σs 1(u(1)) = σs 2(u(2)) σs 2(u(1)); iii) the the volume-preserving property of mixing function f; iv) the existence of two distinct observations of auxiliary variable u(1) and u(2); v) the existence of overlap of samples from p(s|u(1)) and p(s|u(1)). For verifying experiment of each condition, we synthesize an artificial dataset in which the condition is not satisfied. The quantitative results of our framework when the conditions above are not satisfied are shown in Table 1, and the qualitative results are shown in Appendix E for the sake of brevity. Note that in the case of µs 1(u(1)) = µs 1(u(2)), samples from p(s|u(1)) and p(s|u(2)) have non-zero overlap. Over these experimental results, we made the following observations: 1) As long as samples from p(s|u(1)) and p(s|u(2)) have non-zero overlap, GIN will almost perfectly identify the true sources, even if the p(s|u(1)) and p(s|u(2)) are not concentric. However, when the overlap vanishes, the performance of GIN has a sharp decline. These results indicate that the condition (ii) in our Theorem 1 is not necessary, but non-zero overlap is crucial for identifiability. Published as a conference paper at ICLR 2022 (a) Ground truth (b) Data points (c) Estimation by i VAE (d) Estimation by GIN Figure 3: Qualitative comparison of i VAE and GIN on artificial dataset with two classes. (a) ground truth (observations of the sources) consists of samples from two Gaussians, which are visualized by two different colors. (b) 2-dim projection of the 10-dim data points. (c) and (d) estimations of the ground truth by i VAE and GIN, respectively. The plotted estimated latent variables are chosen by assignment of correlation coefficients between sources and all estimated latent variables. (a) Traversals by i VAE (b) Traversals by GIN Figure 4: Qualitative comparison of i VAE and GIN on Triangles images. Each row is traversal by manipulating one estimated latent variable. The last image of each row represents the heat map (the map of changed pixels) of that row, generating by taking the difference of the 3rd and 7th images. Obviously, the four rows correspond to rotation, width, height and gray level, respectively. 2) When σs 1(u(2)) σs 1(u(1)) = σs 2(u(2)) σs 2(u(1)) or u(1) = u(2), the sources are unidentifiable according to our theoretical analysis, but the experimental results show that in these cases the true sources still be well identified. This is not contradictory to the necessity of condition (iii) in Theorem 1 and two distinct classes, but indicates that there exist some biases in learning process to reduce the indeterminacy of rotation of latent variables. We conjecture that one bias is mini-batch sampling. As in mini-batch sampling, the empirical distributions of different mini-batches can be very different, while according to our theory, two different priors are almost sufficient to guarantee identifiability. 3) When the mixing function f (not the estimating function g) is non-volume-preserving, the recovery of sources can be somewhat unsuccessful. To see this, note that in Fig. 7(d), the recovery of ground truth has some visual distortions. This demonstrates that involving some restrictions to f (like volume-preservation or the condition introduced in Appendix B) is necessary. 6.2.2 QUALITATIVE RESULTS Results on artificial data. As shown in Fig. 3, both i VAE and GIN can successfully identify the true sources from their nonlinear mixing data points. While GIN almost perfectly recovers the ground truth, yet the recovery by i VAE has some distortions everywhere. Results on Triangles. For qualitative comparison of i VAE and GIN, we plot traversals of the estimated latent variables by them, which are obtained by three steps: i) deal with the assignment problem of correlation coefficients between 4 sources and all estimated latent variables, and choose the assigned latent variables for traversals, denoted by {zi}4 i=1; ii) estimate the variances of the chosen latent variables {σi}4 i=1; iii) manipulate the value of the i-th latent variable between ( 2σi, 2σi) while keep other estimated latent variables unchanged, and then generate the i-th row of images. In Fig. 4, the qualitative performances of i VAE and GIN are distinct. Each plot estimated latent variable of i VAE is a mixing of the true sources, e.g. the 4-th row of Fig. 4(a) is a mixing of rotation and color. Compared with this, the plot estimated latent variables by GIN obviously correspond to the true sources: rotation, width, height and gray level, respectively. This shows that our framework is much better than i VAE qualitatively, and is able to identify the true sources from images. These results demonstrates that our framework is much more powerful than i VAE for identifying true sources on both artificial data and synthetic images with a few classes. Published as a conference paper at ICLR 2022 (a) Rotation (b) Bend of the vertical line (c) Thickness (d) Bend of the horizontal bar Figure 5: Traversals of estimated latent variables w/ top 4 standard deviations by GIN on MNIST. 6.2.3 QUANTITATIVE RESULTS The quantitative results are shown in Table 2, in which -mc means the model is trained and tested on a dataset with m classes. According to the theory by (Hyvarinen et al., 2019; Khemakhem et al., 2020), i VAE requires 9 classes on Triangles and 5 classes on our artificial data, hence we set i VAE-9c, i VAE-5c and i VAE-2c for fair comparisions with GIN. As for GIN, we set GIN-9c, GIN-2c and GIN-1c to verify whether 2 classes predicted by our theory are sufficient and necessary. Table 2: Mean and STD of MCC score on synthetic dataset and Triangles of 5 trials. Methods Artificial Data Triangles i VAE-9c - 0.664 0.014 i VAE-5c 0.985 0.001 - i VAE-2c 0.983 0.001 0.659 0.055 GIN-9c 1.000 0.000 0.857 0.031 GIN-2c 1.000 0.000 0.863 0.011 GIN-1c 0.999 0.001 0.787 0.033 As shown in Table 2, for identifying true sources, GIN excels i VAE w.r.t. MCC on both artificial data and synthetic images. This might be due to the volume-preserving essence of GIN, which provides a strong but natural restriction for estimating functions according to our theory. Also, besides the given two classes, increasing the number of classes cannot markedly improve the performance of GIN, while reducing number of classes to 1 leads to a sharp decline of MCC. This means for GIN, two classes of data are sufficient and necessary for identifiability. These results are consistent with our theory. 6.3 RESULTS ON REAL WORLD DATASETS To verify whether our framework is able to identify sources from real world datasets, we conduct experiments on MNIST and Celeb A, and the setting and results on Celeb A are reported in Appendix F. On MNIST, we pick images with digits 1 and 7 because they probably have non-zero overlap, and use their labels as observations of the auxiliary variable. As there is no ground truth in MNIST, we cannot compute the MCC score, and hence the qualitative results are mainly reported. Moreover, the estimated latent variables with high standard deviations are chosen to plot, as such variables are probably more meaningful (Sorrenson et al., 2020). We hope the estimated latent variables will correspond to some interpretable attributes in the dataset, which are viewed as true sources in disentanglement literature (Bengio et al., 2013; Locatello et al., 2019). As shown in Fig. 5, the estimated latent variables with top 4 standard deviations are highly interpretable, corresponding to rotation, bend, thickness and bend of the horizontal bar. These results are comparable with fully utilizing the 10 classes of MNIST (Sorrenson et al., 2020). Therefore, here two classes are probably sufficient for identifying true sources, which is consistent with our theory. 7 CONCLUSION We have explored a new direction in nonlinear ICA: restrict the mixing function to be volumepreserving, and meanwhile relax the restrictions on the sources. With mild conditions, we establish two novel identifiability theorems, which guarantee the sources can be identified up to point-wise non-linearity and a point-wise linearity, respectively. The proofs give insights for nonlinear ICA: if there exists some natural conditions on mixing functions, the main indeterminacy is simply the rotation of latent variables. This provides new paths and new techniques for identifiability guarantee. For the applicability of our theory, we show in experiments that even most of the conditions are not satisfied, the true sources can still be successfully identified. This indicates that there exists stronger identifiability theorem with less conditions, which is an appealing for further exploration. We also empirically show that a volume-preserving flow-based model using two classes significantly excels the state-of-the-art ICA method (i VAE), and is able to learn interpretable attributes from real world datasets. These results show the advantages and applicability of our proposed framework. Published as a conference paper at ICLR 2022 ACKNOWLEDGMENTS This work was supported by China Key Research and Development Program (2020AAA0107600), and Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102). Rameen Abdal, Peihao Zhu, Niloy J Mitra, and Peter Wonka. Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Transactions on Graphics (TOG), 40(3):1 21, 2021. Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35:1798 1828, 2013. Lawrence Cayton. Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep, 12(1-17):1, 2005. Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287 314, 1994. T. Cover and A.E. Gamal. An information-theoretic proof of hadamard s inequality. IEEE Transactions on Information Theory, 29(6):930 931, 1983. Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014. Laurent Dinh, Jascha Sohldickstein, and Samy Bengio. Density estimation using real nvp. In International Conference on Learning Representations, 2017. Stefan Harmeling, Andreas Ziehe, Motoaki Kawanabe, and Klaus-Robert M uller. Kernel-based nonlinear blind source separation. Neural Computation, 15(5):1089 1124, 2003. Aapo Hyvarinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. In Advances in Neural Information Processing Systems, volume 29, pp. 3765 3773, 2016. Aapo Hyvarinen and Hiroshi Morioka. Nonlinear ica of temporally dependent stationary sources. In Artificial Intelligence and Statistics, pp. 460 469. PMLR, 2017. Aapo Hyv arinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429 439, 1999. Aapo Hyvarinen, Hiroaki Sasaki, and Richard Turner. Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 859 868. PMLR, 2019. Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401 4410, 2019. Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, and Aapo Hyvarinen. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pp. 2207 2217. PMLR, 2020. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Xiangyu Kong, Changhua Hu, and Zhansheng Duan. Principal component analysis networks and algorithms. Springer, 2017. Published as a conference paper at ICLR 2022 Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 40, 2017. Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015. Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Sch olkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp. 4114 4124. PMLR, 2019. Francesco Locatello, Ben Poole, Gunnar R atsch, Bernhard Sch olkopf, Olivier Bachem, and Michael Tschannen. Weakly-supervised disentanglement without compromises. In International Conference on Machine Learning, pp. 6348 6359. PMLR, 2020. Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017. Hariharan Narayanan and Sanjoy Mitter. Sample complexity of testing the manifold hypothesis. In Proceedings of the 23rd International Conference on Neural Information Processing Systems Volume 2, pp. 1786 1794, 2010. Jonas Peters, Dominik Janzing, and Bernhard Sch olkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017. Phil Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. In International Conference on Learning Representations, 2021. Salah Rifai, Yann N Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller. The manifold tangent classifier. In Advances in neural information processing systems, volume 24, pp. 2294 2302, 2011. Peter Sorrenson, Carsten Rother, and Ullrich Kothe. Disentanglement by nonlinear ica with general incompressible-flow networks (gin). In International Conference on Learning Representations, 2020. Henning Sprekeler, Tiziano Zito, and Laurenz Wiskott. An extension of slow feature analysis for nonlinear blind source separation. The Journal of Machine Learning Research, 15(1):921 947, 2014. Anisse Taleb and Christian Jutten. Source separation in post-nonlinear mixtures. IEEE Transactions on signal Processing, 47(10):2807 2820, 1999. Nicholas Watters, Loic Matthey, Christopher P Burgess, and Alexander Lerchner. Spatial broadcast decoder: A simple architecture for learning disentangled representations in vaes. ar Xiv preprint ar Xiv:1901.07017, 2019. Published as a conference paper at ICLR 2022 A PROOF OF OUR IDENTIFIABILITY THEOREM We first prove some lemmas on Euclidean space, and then prove our theorems on Riemannian manifold using these lemmas. A.1 LEMMAS ON EUCLIDEAN SPACE Let s p(s|u), s Rn, and there exists a homeomorphism as follows: f : Rn Rn s 7 z. (11) Suppose f is volume-preserving: |det Jf(s)| = 1. (12) This volume-preserving property directly leads to a key corollary: denote the density of z conditioned on u by q(z|u), then p(s|u) = q (f(s)|u). To see this, note that the change of variable rule gives us p(s|u) = q (f(s)|u) |det Jf(s)|, and substituting Eq. 12 leads to the conclusion. This is the starting point of our proof. Further suppose components of s are independent, and q(z|u) is a factorial multivariate Gaussian: i=1 pi (si|u) , (13) 1 Zq i exp θq i,1(u)zi θq i,2(u)z2 i . (14) Lemma 1 Consider the model defined by Eq. 11-Eq. 14, if f have all second order derivatives, then there exist {θi,1(u), θi,2(u)}n i=1 such that θi,1(u) 2fi sj sk (s) + θi,2(u) fi sk (s) = 0, j = k, s Rn. (15) Proof Note that p(s|u) = q (f(s)|u), according to Eq. 13-Eq. 14, we have: i=1 pi(si|u) = 1 Zq i exp θq i,1(u)fi(s) θq i,2(u)fi(s)2 . (16) Take the logarithm, we have: i=1 (log pi(si|u)) = θq i,1(u)fi(s) + θq i,2(u)fi(s)2 + log Zq i (17) Because f have all second order derivatives, fi has Taylor expansion: fi(s0 + s) = fi(s0) + fi(s0) s+ s Hfi(s0) s+O( s s). Let s = s0+ s in Eq. 17, and suppose fi(s0) = 0 without loss of generality, then take the Taylor expansion of fi, we have: i=1 (log pi(si|u)) ηq i s + θq i,1(u) s Hfi(s0) s + θq i,2(u) fi(s0) s 2 + O( s s) + Cq i , where ηq i = θq i,1(u) fi(s0) , and Cq i is a constant. Note that the left hand side of the equation above does not contain cross terms { sj sk}j =k, while the right hand side does. Therefore, the coefficients of cross terms on the right hand side should be zeros: n X θq i,1(u) 2fi sj sk (s) + 2θq i,2(u) fi sk (s) = 0, j = k, s Rn. (19) Let θi,1(u) θq i,1(u) and θi,2(u) 2θq i,2(u), then the conclusion is obtained. Published as a conference paper at ICLR 2022 Remark: Denote the mean of qi by µi, and the variance by σ2 i , then θq i,1 = µi σ2 i and θq i,2 = 1 2σ2 i . Without loss of generality, we can set µi = 0, i, and hence θi,1 = 0, i. Therefore, Eq. 15 can be further simplified as sk (s) = 0, j = k, s Rn. (20) Lemma 2 Suppose all conditions in Lemma 1 are satisfied, and there exists two distinct values of u, denoted by u(1) and u(2), such that the following conditions are also satisfied: i) µi(u(1)) = µi(u(2)), i [n]; n σ1(u(2)) σ1(u(1)), , σn(u(2)) σn(u(1)) o are distinct. Then s Rn, the Jacobian Jf(s) = fi sj (s) n n is a generalized permutation matrix. Proof Without loss of generality, let µi(u(1)) = 0, i [n], according to Lemma 1 and condition i), we have Pn i=1 θi,2(u(1)) fi sk (s) = 0 Pn i=1 θi,2(u(2)) fi sk (s) = 0 , j = k. (21) Let Σ(u, s) diag Pn i=1 θi,2(u) fi s1 (s) 2 , , Pn i=1 θi,2(u) fi sn (s) 2 and Λ(u) diag (θ1,2(u), , θn,2(u)), then the equation above can be written as Jf(s) Λ(u(1))Jf(s) = Σ(u(1), s) Jf(s) Λ(u(2))Jf(s) = Σ(u(2), s) . (22) Due to θi,2(u) = 1 σ2 i (u) > 0, Σ(u, s) and Λ(u) are always positive definite, hence the equations above have square roots on both the two hand sides as follows: Λ(u(1))1/2Jf(s) = U(u(1))Σ(u(1), s)1/2 Λ(u(2))1/2Jf(s) = U(u(2))Σ(u(2), s)1/2 , (23) where U(u) is orthogonal matrices, i.e. U(u) U(u) = I. Therefore, we have Jf(s) = Λ(u(1)) 1/2U(u(1))Σ(u(1), s)1/2 = Λ(u(2)) 1/2U(u(2))Σ(u(2), s)1/2. (24) This leads to U(u(2)) 1Λ(u(2))1/2Λ(u(1)) 1/2U(u(1)) = Σ(u(2), s)1/2Σ(u(1), s) 1/2. (25) Both the two hand sides of the equation above can be viewed as the singular value decomposition (SVD) of Σ(u(2), s)1/2Σ(u(1), s) 1/2. Note that due to condition ii), the diagonal entries of Λ(u(2))1/2Λ(u(1)) 1/2 are distinct. Based on this, according to the uniqueness of SVD (Kong et al., 2017), U(u(1)) is the composition of a permutation matrix and a signature matrix. Substitute such decomposition of U(u(1)) into Eq. 24, we finally conclude that Jf(s) is a generalized permutation matrix. Remarks: Since Jf(s) is a generalized permutation matrix, in each row and each column it has exactly one nonzero entry. Without loss of generality, we can assume fi si (s) = 0, while fi sj (s) = 0, j = i. Therefore, fi(s) is exactly the function of si and is independent of {sj}j =i, and hence it can be denoted by fi(si). Therefore, under some mild conditions in Lemma 2, z is identical to s up to a nonlinear point-wise transformation (and a permutation). Published as a conference paper at ICLR 2022 A.2 THEOREMS ON RIEMANNIAN MANIFOLD Suppose data points are distributed on a Riemannian manifold M which is homeomorphic to Rn. The homeomorphic can be denoted by f : Rn M s 7 x, (26) which is further assumed to be volume-preserving: |det Jf(s)| = 1. (27) The density of s conditioned on u is denoted by p(s|u), which is assumed to be factorial, i.e. {si}n i=1 are independent: i=1 pi(si|u). (28) Suppose the data variable x can be encoded into a vector of latent variables z with another homeomorphism, denoted by x 7 z, (29) where the homeomorphism is also volume-preserving: det Jg 1(z) = 1, (30) and conditioned on u, z follows a factorial multivariate Gaussian, denoted by 1 Zi exp zi µi(u))2 Theorem 1 (Nonlinear Identifiability) Assume data points are sampled from a generative model defined according to Eq. 26-Eq. 31, and there exist two distinct observations of u, denoted by u(1) and u(2), such that the following holds: i) Both f and g have all second order derivatives. ii) µi(u(1)) = µi(u(2)), i [n]. n σ1(u(2)) σ1(u(1)), , σn(u(2)) σn(u(1)) o are distinct. Then g f is a composition of a point-wise nonlinear transformation and a permutation. Proof Note that g f : s 7 z is a homeomorphism which is volume-preserving |det Jg f(s)| = 1 and has all second order derivatives. According to Lemma 2, we can conclude that g f is a composition of a point-wise nonlinear transformation and a permutation. Theorem 2 (Linear Identifiability) Assume the hypotheses of Theorem 1 hold, and p(s|u(1)) and p(s|u(2)) are multivariate Gaussians. Then g f is a composite of a point-wise linear transformation and a permutation. Proof Let h = g f. According to Theorem 1, h is a composite of point-wise nonlinear transformation and a permutation, and hence without loss of generality, we write hi(s) = hi(si), i [n]. (32) Since p(s|u) is a multivariate Gaussian, let p(s|u) = Qn i=1 1 Zp i exp θp i,1(u)si θp i,2(u)s2 i . Substitute this expression and the equation above into Eq. 17, we have θp i,1(u)si + θp i,2(u)s2 i = θq i,1(u)hi(si) + θq i,2(u)hi(si)2 + Ci, i [n], (33) where Ci is a constant. This obviously leads to hi(si) = aisi + bi, i [n]. (34) To conclude, g f is a composition of a point-wise linear transformation and a permutation. Published as a conference paper at ICLR 2022 B A CLASS OF IDENTIFIABLE NVP MIXING FUNCTIONS Even if the mixing function is non-volume-preserving (NVP), it still leads to identifiability when its Jacobian has the following form: |det Jf(s)| = i=1 |hi(si)| > 0, (35) where hi is arbitrary function satisfying |hi(si)| > 0. Obviously this class of function is a extension of volume-preserving transformations. To see this, note that when |hi(si)| = 1, this class of function degenerate into volume-preserving transformations. Substitute the equation above into the change of variable rule p(s|u) = q (f(s)|u) |det Jf(s)|, and take the logarithm, we have i=1 (log pi(si|u) + log |hi(si)|) = θq i,1(u)fi(s) + θq i,2(u)fi(s)2 + log Zq i . (17 ) Using this equation, the proof of Lemma 1 still holds, because the in the proof we only use the fact that the left hand side of Eq. 17 has no cross terms, which is also the property of the equation above. Therefore, the class of mixing functions defined above also leads to identifiability, even if it is nonvolume-preserving. This means that there exist some non-trival extensions of our theory. C SIMPLIFIED LOSS FUNCTION Consider when µi(u) and σi(u) are trainable, their optimal solutions of Eq. 9 have the closed forms as follows: µ i (u(k)) =E(x,u(k)) D [gi(x; θ)] , σ i (u(k)) = q E(x,u(k)) D (gi(x; θ) µ i (u(k))) 2, (36) where k = 1, 2. Note that the optimal solutions depend on the parameters θ, hence when given θ, we denote µ i (u(k)) and σ i (u(k)) by µi(u(k), θ) and σi(u(k); θ) respectively. Substitute these optimal solutions to the loss function above, we finally obtain: i=1 log σi(u(k); θ). (37) In experiments, we only sample a mini-batch B from the full dataset D for each iteration, hence in this case σi(u(k); θ) = q E(x,u(k)) B gi(x; θ) µi(u(k); θ) 2, which is the biased standard deviation of {gi(x; θ)}(x,u(k)) B. C.1 DIMENSION REDUCTION Here we will demonstrate that the simplified loss above leads to dimension reduction. This is a vital property for identifiability. On the one side, dimension reduction enable the learning algorithm to estimate the unknown dimensionality of data manifold. On the other, if we use a flow-based model to identify the true sources, then dimension reduction of latent variables z is necessary, as in flow-based models the dimensionality of z (d) is far more than the number of true sources (n). In a flow-based model, the simplified loss is i=1 log σi(u(k); θ) = i=1 σi(u(k); θ). (38) Hence minimizing the simplified loss is equivalent to minimizing Qd i=1 σi(u(k); θ). To minimizing this term, most latent variables should have almost zero standard deviations. Published as a conference paper at ICLR 2022 Note that in implementation, the standard deviations of latent variables could not be exact zero, and not all standard deviations can be almost zero. First, in flow-based models, the data points are usually augmented by adding small noise, and hence the dimensionality of augmented data manifold is exactly d. As a result, the dimensionality of the corresponding latent space is also d, hence exact zero standard deviation is not possible due to the reversibility of flow-based models. Moreover, as we assume the transformation g is volume-preserving, the entropy of data distribution and latent distribution should be equal, and hence there must exist some latent variables with high standard deviations. Therefore, the simplified loss will assign almost zero standard deviations to most latent variables while preserving some high standard deviations. A natural solution is to assign high standard deviations to meaningful latent variables, and assign almost zero standard deviations to redundant dimensions. This is exactly the dimension reduction ability of the simplified loss. C.2 INDEPENDENCE ENHANCEMENT Next we will prove that the simplified loss can enhance the independence of different components of z. This is a desirable property, as p(z|u) should be factorial according to our theory. Our proof is based on Hadamard s inequality (Cover & Gamal, 1983): for any n n matrix K, det K Q i Ki,i with equality iff Ki,j = 0, i = j. Let K = Cov(z, z), then we have det Cov(z, z) Q i Var(zi). Then the simplified loss has a lower bound as follows: i=1 log Varu(k)(zi) 1 k=1 log det Covu(k)(z). (39) where the subscripts u(k) represent the conditions. When the simplified loss is optimized, the gap between it and its lower bound will become zero, and hence the equality holds. This means Cov(zi, zj) = 0, i = j. Therefore, minimizing the simplified loss can enhance the independence of different latent variables by reducing correlations. In addition, note that our derivation above does not depend on the Gaussianity of q(z|u), and hence our conclusion is general. D DETAILS ABOUT THE ARTIFICIAL AND SYNTHESIZED DATASETS Table 3: Means and variances of each class in Triangles. α and β represents uniform random variables on [ 1, 1] and [1, 2], respectively. Factors Value Ranges Means Variances Rotation [ π, π] 0 + 0.01πα 0.03πβ Width [1, 32] 18 + 2α 1β Height [1, 32] 18 + 2α 1β Gray Level [0, 255] 170 + 5α 21β To generate the the artificial data, following (Sorrenson et al., 2020), we synthesize the sources and observed data by two steps: i) First, two sources are sampled from a 2-dim mixture of two Gaussians with their mean both (0.0, 0.0), and with variances (1.0, 0.5) and (0.5, 1.0), respectively. From each Gaussian, 5, 000 points are sampled. Then an 8-dim standard Gaussian noise scaled by 0.01 is concatenated with them. ii) The observed data is generated from the 10-dim sources, by a randomly initialized GIN (Sorrenson et al., 2020) with 8 fully connected coupling blocks. To quantitatively test our method on image datasets, we have synthesized a gray-scale image dataset called Triangles . Specifically, we choose to generate a grey-scale 32 32 image dataset of 2-D shapes similar with d Sprites (Matthey et al., 2017) as shown in Fig. 6, in which all shapes are right triangles generated from 4 factors: rotation, width (length of the horizontal edge), height (length of the vertical edge) and gray level. These factors are viewed as sources, and the observations of sources are sampled from a mixture of two Gaussians with random means and variances. The value ranges of four factors are set as [ π, π], [1, 32], [1, 32] and [0, 255], respectively. Published as a conference paper at ICLR 2022 Figure 6: Samples from Triangles. According to our requirement, in each class these factors follow a factorial multivariate Gaussian. To satisfied condition (iii) in Theorem 1, we set the variance of each factor as a proper random value, and so do the mean because we found in experiments that the condition (ii) in Theorem 1 is not necessary. The particular parameters of the four factors are shown in Table 3 The generating process of each image consists of three steps: i) for each class, randomly select four means and four variances according to Table 3, and obtain a 4-dim Gaussian; ii) sample one point from the 4-dim Gaussian, which corresponds to a right triangle; iii) judge what pixels locate inside the triangle according to the given rotation, width and height, and then assign them with the given gray level. E QUALITATIVE RESULTS OF ABLATION STUDY Here we show the qualitative results of our ablation study (see Fig. 7) to support our discussion in an intuitive way. In these figures, reconstruction is the estimated latent variables with top two standard deviations, as Sorrenson et al. (2020) reported that estimated latent variables with higher standard deviations are probably more meaningful. Spectrum is the sorted standard deviations of latent variables. The spectrum of the estimated latent variables is in black, while the spectrum of true latent variables is in grey. F QUALITATIVE RESULTS ON CELEBA To further show the performance of our framework on real-world datasets, we conduct experiments on Celeb A (Liu et al., 2015) using GIN. Since our framework requires two distinct classes, we use all images in Celeb A as one class, and their mirror images as another class. Such setting ensures the two obtained classes are distinct and have non-zero overlap. Similar with experiments on MNIST, we pick latent dimensions with high standard deviations as estimated latent variables, and plot them to show their corresponding underlying factors. As shown in Fig. 8, the estimated latent variables with top 4 standard deviations are highly interpretable, corresponding to lighting on the left, lighting on the right, skin color, background brightness, hair (bang) and grin, respectively. These results further demonstrate that our framework is probably able to identify the true sources from real-world datasets, and hence support our theory. Published as a conference paper at ICLR 2022 (a) Fully Satisfied (b) µs 1(u(1)) = µs 1(u(2)) (c) σs 1(u(2)) σs 1(u(1)) = σs 2(u(2)) σs 2(u(1)) (d) Non-volume-preserving f (e) u(1) = u(2) (f) No overlap Figure 7: Qualitative results of ablation study. Published as a conference paper at ICLR 2022 Figure 8: Traversals of estimated latent variables with top 6 standard deviations by GIN on Celeb A. Each row is traversal by manipulating one estimated latent variable. The six rows represent lighting on the left, lighting on the right, skin color, background brightness, hair (bang) and grin, respectively.