# generalizing_nonlinear_ica_beyond_structural_sparsity__6de28e5e.pdf Generalizing Nonlinear ICA Beyond Structural Sparsity Yujia Zheng1, Kun Zhang1,2 1 Carnegie Mellon University 2 Mohamed bin Zayed University of Artificial Intelligence {yujiazh, kunz1}@cmu.edu Nonlinear independent component analysis (ICA) aims to uncover the true latent sources from their observable nonlinear mixtures. Despite its significance, the identifiability of nonlinear ICA is known to be impossible without additional assumptions. Recent advances have proposed conditions on the connective structure from sources to observed variables, known as Structural Sparsity, to achieve identifiability in an unsupervised manner. However, the sparsity constraint may not hold universally for all sources in practice. Furthermore, the assumptions of bijectivity of the mixing process and independence among all sources, which arise from the setting of ICA, may also be violated in many real-world scenarios. To address these limitations and generalize nonlinear ICA, we propose a set of new identifiability results in the general settings of undercompleteness, partial sparsity and source dependence, and flexible grouping structures. Specifically, we prove identifiability when there are more observed variables than sources (undercomplete), and when certain sparsity and/or source independence assumptions are not met for some changing sources. Moreover, we show that even in cases with flexible grouping structures (e.g., part of the sources can be divided into irreducible independent groups with various sizes), appropriate identifiability results can also be established. Theoretical claims are supported empirically on both synthetic and real-world datasets. 1 Introduction The unveiling of the true generating process of observations is fundamental to scientific discovery. Nonlinear independent component analysis (ICA) provides a statistical framework that represents a set of observed variables x as a nonlinear mixture of independent latent sources s, i.e., x = f(s). Unlike linear ICA (Comon, 1994), the mixing function f can be an unknown nonlinear function, thus generalizing the theory to more real-world tasks. However, the identifiability of nonlinear ICA has been a long-standing problem for decades. The main obstacle is that, without additional assumptions, there exist infinite spurious solutions returning independent variables that are mixtures of the true sources (Hyvärinen and Pajunen, 1999). In the context of machine learning, this makes the theoretical analysis of unsupervised learning of disentangled representations difficult (Locatello et al., 2019). To overcome this challenge, recent work has introduced the auxiliary variable u, and assumed that all sources are conditionally independent given u. Most of these methods require auxiliary variables to be observable, such as class labels and domain indices (Hyvärinen and Morioka, 2016; Hyvärinen et al., 2019; Khemakhem et al., 2020a; Sorrenson et al., 2020; Lachapelle et al., 2022; Lachapelle and Lacoste-Julien, 2022), with the exceptions being those for time series (Hyvärinen and Morioka, 2017; Hälvä et al., 2021; Yao et al., 2021, 2022). While the use of the auxiliary variable u allows for the identifiability of nonlinear ICA with mild restrictions on the mixing process, it also necessitates a large number of distinct values of u, which can be difficult to obtain in tasks with insufficient side informa- 37th Conference on Neural Information Processing Systems (Neur IPS 2023). tion. Moreover, since these results assume that all sources are dependent on u, they cannot accommodate a subset of sources with invariant distributions (e.g., content may not change with different styles). Another possible direction is to impose appropriate conditions on the mixing process, but limited results are available in the literature. For example, it has been shown that conformal maps are identifiable up to specific indeterminacies (Hyvärinen and Pajunen, 1999; Buchholz et al., 2022). Moreover, Taleb and Jutten (1999) identify the latent sources when the mixing process is a component-wise nonlinear function added to a linear mixture. These methods do not rely on conditional independence given the auxiliary variable and thus achieve the identifiability in a fully unsupervised setting. At the same time, the requirement of above-mentioned classes of the mixing function, such as conformal maps and post-nonlinear models, restricts the applicability of the results in another way. For instance, according to Liouville s theorem (Monge, 1850), conformal maps in Euclidean spaces of dimensions higher than two are Möbius transformations, which appear to be overly restrictive for most data-generating processes. As an alternative, Zheng et al. (2022) prove that, under the assumption of Structural Sparsity, the true sources can be identified up to trivial indeterminacies. Since the proposed condition is on the connective structure from sources to observed variables, i.e., the support of the Jacobian matrix of the mixing function, it does not require the mixing function to be of any specific algebraic form. Thus, Structural Sparsity may serve as one of the first general principles for the identifiability of nonlinear ICA in a fully unsupervised setting. While being a potential solution to the identifiability of nonlinear ICA without side information, the assumption of Structural Sparsity has its limitations from a pragmatic viewpoint. The most obvious one arises from the fact that it may fail in a number of situations where the generating processes are heavily entangled. Although the principle of simplicity may be a general rule in nature, it is intuitively possible that Structural Sparsity does not apply to at least a subset of sources, such as one or a few speakers in a crowded room. Unfortunately, Zheng et al. (2022) require Structural Sparsity to hold for all sources in order to provide any identifiability guarantee. Therefore, it would be desirable in practice to provide weaker notions of identifiability, such as the ability to identify a subset of sources to a trivial degree of uncertainty, in cases of partial sparsity. In addition to partial sparsity, identifiability with Structural Sparsity also fails with the undercompleteness (more observed variables than sources) and/or partial source dependence (potential dependence among some hidden sources). These limitations are not unique to the sparsity assumption, but rather a result of the traditional setting of ICA, where the numbers of the sources and observed variables must be equal and dependencies among sources are not allowed. However, both situations are quite common in practice. One may easily have millions of pixels (observed variables) but only dozens of hidden concepts (sources) in a picture, constituting an undercomplete case that cannot be handled by previous results. Meanwhile, dependencies among some variables are also prevalent in tasks such as computational biology (Cardoso, 1998; Theis, 2006). The alternative assumption of conditional independence given auxiliary variables may still be overly restrictive if applied universally to all sources. For the identifiability of nonlinear ICA to truly benefit scientific discovery in a wider range of scenarios, these methodological limitations should be properly addressed. Aiming to generalize nonlinear ICA with Structural Sparsity, we first present a set of new identifiability results to address these fundamental challenges of undercompleteness, partial sparsity, and source dependence. We show that, under the assumption of Structural Sparsity and without auxiliary variables, latent sources can be identified from their nonlinear mixtures up to a component-wise invertible transformation and a permutation, even when there are more observed variables than sources (Thm. 3.1). Moreover, if the assumption of sparsity and/or source independence does not hold for some changing sources, we provide partial identifiability results, showing that the remaining sources can still be identified up to the same trivial indeterminacy (Thm. 4.1, Thm. 4.2). Furthermore, in the cases with flexible grouping structures (e.g., part of the sources can be grouped into irreducible independent subgroupings with various sizes, such as mixtures of signals with various dimensions), certain types of identifiability are also guaranteed with auxiliary variables (Thm. 4.3, Thm. 4.4). Therefore, we establish, to the best of our knowledge, one of the first general frameworks for uncovering latent variables with appropriate identifiability guarantees in a principled manner. The theoretical claims are validated empirically through our experiments and many previous works involving disentanglement. 2 Preliminaries The data-generating process of nonlinear ICA is as follows: i=1 psi(si), (1) x = f(s), (2) where s = (s1, . . . , sn) S Rn is a latent vector representing the independent sources, and x = (x1, . . . , xm) X Rm denotes the observed random vector. The mixing function f is assumed to be smooth in the sense that its second-order derivatives exist. The primary objective of ICA is to establish identifiable models, i.e., the sources s are identifiable (recoverable) up to certain indeterminacies by learning an estimated mixing function ˆf : ˆS X with assumptions identical to the generating process (Comon, 1994). Different from most ICA results where m = n and f : S X must be linear, we allow m > n (i.e., undercompleteness) and f to be a general nonlinear function, therefore extending the previous setting. Thus, we relax the previous assumption on the invertibility of f, only necessitating it to be injective and its Jacobian to be of full column rank. Furthermore, we denote psi as the marginal probability density function (PDF) of the i-th source si and ps as the joint PDF of the random vector s. Moreover, we introduce some additional technical notations as follows: Definition 2.1. Given a subset A {1, . . . , n}, the subspace Rn A is defined as Rn A := {z Rn | i / A = zi = 0} , where zi is the i-th element of the vector z. That is, Rn A denotes the subspace of Rn specified by an index set A. Furthermore, we define the support of a matrix as follows: Definition 2.2. The support of a matrix M Rm n is defined as supp(M) := {(i, j) | Mi,j = 0} . With a slight abuse of notation, we reuse supp( ) to denote the support of a matrix-valued function: Definition 2.3. The support of a function M : Θ Rm n is defined as supp(M(Θ)) := {(i, j) | θ Θ, M(θ)i,j = 0} . For brevity, we denote F and ˆF as the support of the Jacobian Jf(s) and Jˆf(ˆs), respectively. Additionally, T refers to a set of matrices with the same support of T(s) in Jˆf(ˆs) = Jf(s)T(s), where T(s) is a matrix-valued function. Throughout this work, for any matrix M, we use Mi,: to denote its i-th row, and M:,j to denote its j-th column. For any set of indices B {1, . . . , m} {1, . . . , n}, analogously, we have Bi,: := {j | (i, j) B} and B:,j := {i | (i, j) B}. 3 Identifiability with undercompleteness We first present the result on removing one of the major assumptions in ICA, i.e., the number of observed variables m must be equal to that of hidden sources n. We prove that, in the undercomplete case (m > n), sources can be identified up to a trivial indeterminacy under Structural Sparsity. Theorem 3.1. Let the observed data be a large enough sample generated by an undercomplete nonlinear ICA model as defined in Eqs. (1) and (2). Suppose the following assumptions hold: i. For each i {1, . . . , n}, there exist {s(ℓ)}|Fi,:| ℓ=1 and a matrix T T s.t. span{Jf(s(ℓ))i,:}|Fi,:| ℓ=1 = Rn Fi,: and Jf(s(ℓ))T i,: Rn ˆ Fi,:. ii. (Structural Sparsity) For each k {1, . . . , n}, there exists Ck s.t. T i Ck Fi,: = {k}. Then s is identifiable up to an element-wise invertible transformation and a permutation. The proof is included in Appx. A.1, of which part of the conditions and techniques are based on (Zheng et al., 2022). It is noteworthy that, same as previous work, we also need to add a sparsity regularization on the learned Jacobian during the estimation so that | ˆF| |F|, which is required for all sparsity-based identifications throughout the paper and we only emphasize here for brevity. Assumption i avoids some pathological conditions (e.g., samples are from very limited subpopulations that only span a degenerate subspace) and is typically satisfied asymptotically. The first part implies that there are at least |Fi,:| observed samples spanning the support space, which is almost always satisfied asymptotically. The second part is also relatively mild. Note that T refers to a set of matrices with the same support of T(s) in Jˆf(ˆs) = Jf(s)T(s) and Jˆf(ˆs)i,: Rn ˆ Fi,:. Since we only necessitate the existence of one matrix T T in the entire space, even in rare cases where these two matrices do not share the same non-zero coordinates due to non-generic canceling between specific values of elements, there is almost always an existence of a matrix T T fulfilling the assumption. Figure 1: The structural sparsity assumption in the undercomplete case, where the matrix represents supp(Jf(s)). Assumption ii, i.e., Structural Sparsity, originates from (Zheng et al., 2022). Intriguingly, compared to the original bijective setting considered by (Zheng et al., 2022), this assumption is much more likely to be satisfied in the undercomplete case. The key reason is that it only necessitates the existence of a subset of observed variables whose intersection uniquely identifies the target source variable. For instance, regarding s1 in Fig. 1, there exist x1 and x4 s.t. the intersection of their parents is only s1. In principle, the size of this set can be quite small (e.g., one or two). Hence, it is very likely to be satisfied when there is a sufficient number of observed variables (e.g., millions of pixels for images), which has also been verified empirically in our experiments (e.g., Fig. 4 in Sec. 5). Additionally, in some tasks, we might even construct or select observations in a data-centric manner to satisfy this assumption. Without the previous constraint of bijectivity, structural sparsity can truly be applied in a much broader range of practical scenarios. By proving the identifiability in the undercomplete case, we remove the previous assumption of bijectivity on the mixing function f and thus generalizing the theory to more application scenarios. It is worth noting that, while some work has provided results without assuming bijectivity (Khemakhem et al., 2020a), they rely on extra information from many distinct values of auxiliary variables. Differently, we do not need any auxiliary variables and follow a fully unsupervised setting; Zheng et al. (2022); Kivva et al. (2022) explore the undercompleteness without any auxiliary variable. However, Zheng et al. (2022) only remove the rotational indeterminacy in the nonlinear case and Kivva et al. (2022) assume Gaussian mixture priors, while we provide the full identifiability result without distributional assumptions. At the same time, as elaborated above, the assumption of Structural Sparsity has been significantly weakened in the undercomplete case considered by our theorem. Moreover, identifiability with undercompleteness is also essential if assumptions are partially violated w.r.t. a subset of sources, of which the intuition is verified by theoretical results introduced in the following sections. 4 Identifiability with partial sparsity and source dependence Under the condition of Structural Sparsity, we show the identifiability of undercomplete ICA with general nonlinear functions (Thm. 3.1). While this removes the restriction of bijectivity between sources and observed variables, it remains uncertain as to whether Structural Sparsity holds for all sources in a universal way. At the same time, even in the scenarios that Structural Sparsity may not be universally satisfied for all sources, it is still valuable to consider its potential to hold true for a subset of sources. This type of partial sparsity may often be the case in practical scenarios, as illustrated by our experiments (e.g., Fig. 5 in Sec. 5). However, the corresponding partial identifiability, i.e., the theoretical guarantee for the identification of the subset of sources satisfying Structural Sparsity, is not achieved by (Zheng et al., 2022). In fact, as long as one or a few sources do not meet the assumption of Structural Sparsity, the previous work is unable to provide any identifiability guarantees. Furthermore, in addition to the universal sparsity, the statistical independence between sources is another fundamental assumption. This assumption arises from the original setting of ICA and has been adopted in most related works. However, in many real-world scenarios, requiring all sources to be mutually independent might be impractical, and there are likely to be a subset of sources that are dependent in some way. For example, the frequency and duration of smoking, as well as the type of tobacco products used, are all interrelated factors that contribute to the development of lung cancer. Without any identifiability guarantees in the case where there exist any dependent sources, it is quite restrictive for nonlinear ICA, and even its undercomplete extension, to successfully deal with real problems in practice. Therefore, similar to the partial sparsity case, it is highly desirable that alternative theoretical results, i.e., identifiability for independent sources, can be guaranteed even with the existence of dependent sources. To deal with these remaining challenges in the considered undercomplete case, we further relax other assumptions with additional information. To start with, we provide an identifiability result when Structural Sparsity holds true for only a subset of sources, thus alleviating the obstacle of partial sparsity. Moreover, we relax the mutual independence assumption and allow for some changing sources to be dependent on each other. To this end, we partition the sources into two parts s = [s I, s D], where variables in s I are mutually independent, but those in s D do not need to be. Let s I and s D correspond to variables in s with indices {1, . . . , n I} and {n I+1, . . . , n}, respectively. That is, s I = (s1, . . . , sn I) SI Rn I and s D = (sn I+1, . . . , sn) SD Rn D. We denote the i-th scalar element in a vector, say s, as si. For sources in s D, they do not need to be mutually independent as long as they are dependent on a variable u, i.e., ps|u(s|u) = ps D|u(s D|u) i=1 psi(si). (3) It is noteworthy that we allow arbitrary relations between sources in s D. These sources might be grouped into several subspaces or actually be mutually independent, but we do not need to obtain this information as prior knowledge. This is essential since the exact dependence structures, or even the number of dependent variables, are usually unknown in practice. By keeping this type of uncertainty for both partial sparsity and/or partial source dependence, one can be more confident in applying the theoretical advancements of nonlinear ICA in various tasks. We prove the identifiability for this arguably more flexible scenario as the following theorems: Theorem 4.1. Let the observed data be a large enough sample generated by an undercomplete nonlinear ICA model defined in Eqs. (2) and (3). Suppose the following assumptions hold: i. There exist n D + 1 distinct values of u, i.e., uj with j {0, 1, . . . , n D}, s.t. the n D vectors w(s D, u, i) with i {n I + 1, ..., n} are linearly independent, where vector w(s D, u, i) is defined as follows: w(s D, u, i) = (log p(s D|u1) log(p(s D|u0)) si , . . . , (log p(s D|un D) log(p(s D|u0)) ii. There exist u1, u2 u, s.t., for any set As S with non-zero probability measure and cannot be expressed as Bs I s D for any Bs I SI, we have Z s As ps|u (s | u1) ds = Z s As ps|u (s | u2) ds. Then s D is identifiable up to an subspace-wise invertible transformation. Thm. 4.1 ensures the subspace-wise identifiability of s D, i.e., the estimated subspace ˆs D contains all and only information from s D. This implies that we can disentangle and extract the invariant part of the latents, beneficial for tasks like domain adaptation where recovering each individual source might not be necessary as long as the subspace that is invariant across domains can be disentangled. Furthermore, we also prove the component-wise identifiability as follows: Theorem 4.2. In addition to assumptions in Thm. 4.1, suppose the following assumptions hold: i. For each i {1, . . . , n I}, there exist {s(ℓ)} |Fi,:n I | ℓ=1 and a matrix T T s.t. span{Jf(s(ℓ))i,:n I} |Fi,:n I | ℓ=1 = Rn I Fi,:n I and Jf(s(ℓ))T i,:n I Rn I ˆ Fi,:n I . ii. (Structural Sparsity) For all k {1, . . . , n I}, there exists Ck s.t. T i Ck Fi,:n I = {k}. Then s I is identifiable up to an element-wise invertible transformation and a permutation. The proofs are presented in Appx. A.2 and Appx. A.3. We tackle the challenge of partial sparsity and dependence by necessitating Structural Sparsity and independence only on a subset of sources in s I and prove that these sources can be identified up to trivial indeterminacies. For the remaining sources in s D, they only need to be dependent on an auxiliary variable u without necessitating conditional independence among sources or distributional assumption. This extends previous models that assume all sources to be conditionally independent given u (Hyvärinen et al., 2019; Lachapelle et al., 2022) or require the conditional distribution of the sources to be of a specific form (Khemakhem et al., 2020b). The assumption on p(s D|u) in Thm. 4.1 indicates that the auxiliary variable u should have a sufficiently diverse impact on sources without independence assumption (i.e., s D). It follows a similar spirit to the standard assumption of variability (Hyvärinen et al., 2019) but we further relax it. Specifically, we only need n D+1 values of u for the identifiability of sources in s I. This is intuitively reasonable since the fewer changes (smaller n D) a system has, the easier (fewer required values, i.e., n D+1) that a larger part of it (larger n I, i.e., n n D) is identifiable. In contrast, most previous works require all sources to be dependent on an auxiliary variable u with 2n+1 distinct values of u: no identifiability for any subset of sources can be provided if there exists any degree of violations, either on the number of sources dependent on u or the number of values of u. This limits the application of these results to ideal scenarios where all sources are influenced by the same auxiliary variable with sufficient changes without any type of compromise. In practice, however, it is often the case that only a subset of sources benefit from the additional information provided by auxiliary variables, different auxiliary variables may affect different sources, or auxiliary variables do not contain sufficient information. Assumption ii in Thm. 4.1 is originally from (Kong et al., 2022) and also necessitate the presence of change. Intuitively, the chance of having a subset As on which all domain distributions have an equal probability measure is very slim, which has been verified empirically in (Kong et al., 2022). For both theorems, we consider the more challenging undercomplete case, for which the related identifiability results are lacking in the literature. Additionally, unlike previous works assuming specific distributions of sources such as exponential families, we do not have similar distributional assumptions on the sources. 4.1 Results with flexible grouping structures If we further have access to the dependence structure among variables in s D, additional identifiability results for these sources may also be established. For example, consider the setting that s D = (sn I+1, . . . , sn) can be decomposed to d irreducible independent subspaces {sc1, . . . , scd}, of which each is a multi-dimensional vector consisting multiple sources. We denote the j-th consecutive d-dimensional vector (j-th subspace) in s as scj = (s(j 1)d+1, . . . , sjd) = (scj (l), . . . , scj (h)), where scj (l) and scj (h) are the first and the last sources in scj, respectively. Then we have ps|u(s|u) = i=1 psi(si) j=c1 pscj |u(scj|u). (4) This is similar to Independent Subspace Analysis (ISA) (Hyvärinen and Hoyer, 2000; Theis, 2006) but we allow only a subset of sources as a composition of (conditionally) independent subspaces instead of all, which formalizes the tasks of blind source separation or uncovering latent variable models with mixtures of both high-dimensional and one-dimensional signals. The considered general setting essentially covers ICA and ISA as special cases: if n I = n, it is consistent with the ICA problem; if n I = 0, all sources can be decomposed into irreducible independent subspaces, and thus it becomes an ISA problem. The identifiability result under this setting is shown in the following theorem with its proof provided in Appx. A.4: Theorem 4.3. Let the observed data be a large enough sample generated from an undercomplete nonlinear ICA model as defined in Eqs. (2) and (4). Suppose the following assumptions hold: i. For each i {1, . . . , n I}, there exist {s(ℓ)} |Fi,:n I | ℓ=1 and a matrix T T s.t. span{Jf(s(ℓ))i,:n I} |Fi,:n I | ℓ=1 = Rn I Fi,:n I and Jf(s(ℓ))T i,:n I Rn I ˆ Fi,:n I . ii. There exist 2n D + 1 values of u, i.e., ui with i {0, 1, . . . , 2n D}, s.t. the 2n D vectors w(s D, ui) w(s D, u0) with i {1, . . . , 2n D} are linearly independent, where vector w(s D, ui) is defined as follows: w(s D, ui) = (v(sc1, ui), , v(scd, ui), v (sc1, ui), , v (scd, ui)) , v(scj, ui) = log p(scj|ui) scj (l) , , log p(scj|ui) v (scj, ui) = 2 log p(scj|ui) ( scj (l))2 , , 2 log p(scj|ui) ( scj (h))2 iii. There exist u1, u2 u, s.t., for any set As S with nonzero probability measure and cannot be expressed as Bs I SD for any Bs I SI, we have Z s As ps|u (s | u1) ds = Z s As ps|u (s | u2) ds. iv. (Structural Sparsity) For all k {1, . . . , n I}, there exists Ck s.t. T i Ck Fi,:n I = {k}. Then s I is identifiable up to an element-wise invertible transformation and a permutation, and s D is identifiable up to a subspace-wise invertible transformation and a subspace-wise permutation. All assumptions align with the same principles as those elaborated in the theorems proposed above and have been adapted to cater to the flexible grouping structure. Specifically, in addition to the identifiability of sources in s I, we prove that we can also identify sources in s D up to an indeterminacy that, for each ci {c1, . . . , cd}, there exists an invertible transformation hci s.t. hci(sci) = ˆsci, which is analogous to the previous element-wise indeterminacy. Consequently, even when dealing with mixtures of high and one-dimensional sources, like in the case of multi-modal data, we can still recover the hidden generating process to some extent. Based on the aforementioned theoretical results, which consider undercompleteness, partial sparsity, and partial source dependence, Thm. 4.3 further generalizes the identifiability of nonlinear ICA by relaxing the dimensionality constraint of the latent generating factors. In this vein, it is natural to consider another dependence structure, i.e., sources in s D are not marginally but conditionally independent given an auxiliary variable u. This is similar to the assumption made in most previous works on identifiable nonlinear ICA with surrogate information, which assume that all sources are conditionally independent of each other given the auxiliary variable. However, our setting is more flexible in the sense that we do not assume all sources to be influenced by the auxiliary variable. Specifically, sources in s I are mutually independent as in the original ICA setting, while only sources in s D have access to the side information from the conditional independence given u, i.e., ps|u(s|u) = i=1 psi(si) j=n I+1 psj|u(sj|u). (5) The identifiability result for all sources (s I and s D) is as follows with proof in Appx. A.5: Theorem 4.4. Let the observed data be a large enough sample generated from an undercomplete nonlinear ICA model as defined in Eqs. (2) and (5), suppose the following assumptions hold: i. For each i {1, . . . , n I}, there exist {s(ℓ)} |Fi,:n I | ℓ=1 and a matrix T T s.t. span{Jf(s(ℓ))i,:n I} |Fi,:n I | ℓ=1 = Rn I Fi,:n I and Jf(s(ℓ))T i,:n I Rn I ˆ Fi,:n I . ii. There exist 2n D + 1 values of u, i.e., ui with i {0, 1, . . . , 2n D}, s.t. the 2n D vectors w(s D, ui) w(s D, u0) with i {1, . . . , 2n D} are linearly independent, where vector w(s D, u) is defined as follows: w(s D, ui) = (v(s D, ui), v (s D, ui)) , where v(s D, ui) = log p(sn I+1|ui) sn I+1 , , log p(sn|ui) v (s D, ui) = 2 log p(sn I+1|ui) ( sn I+1)2 , , 2 log p(sn|ui) iii. There exist u1, u2 u, s.t., for any set As S with nonzero probability measure and cannot be expressed as Bs I SD for any Bs I SI, we have Z s As ps|u (s | u1) ds = Z s As ps|u (s | u2) ds. iv. (Structural Sparsity) For all k {1, . . . , n I}, there exists Ck s.t. T i Ck Fi,:n I = {k}. Then s is identifiable up to an element-wise invertible transformation and a permutation. With different assumptions on different sets of sources, one could view this theorem as an expansion of both previous theoretical findings that impose distributional constraints on sources with auxiliary variables (e.g., (Hyvärinen et al., 2019)) and those that constrain the mixing function with Structural Sparsity (Zheng et al., 2022). This is particularly helpful in the context of self-supervised learning (Von Kügelgen et al., 2021) or transfer learning (Kong et al., 2022), where latent representations are modeled as a changing part and an invariant part. In (Kong et al., 2022), the component-wise identifiability for variables changing across domains (i.e., s D with multiple values of u in our setting) are provided but not those in the invariant part (i.e., s I in our setting). With the help of Thm. 4.4, we can show identifiability up to an element-wise invertible transformation and a permutation for each source, regardless of whether it changes across domains or not, which may help some related tasks where full identifiability is necessary. Furthermore, for causal reasoning or disentanglement with observational time-series data, our theorem benefits the identifiability of temporal processes involving instantaneous relations. Previous works in that area can only deal with time-delayed/changing influences, as they rely on the global conditional independence of all sources given the changing time index as the auxiliary variable (Hyvärinen and Morioka, 2016, 2017; Yao et al., 2022). Our theorem, on the other hand, provides the added ability to identify unconditional sources, thanks to the partially satisfied sparsity assumption, and thus aids the uncovering of latent processes with instantaneous relations. 5 Experiments In order to validate the proposed identifibaility results, we conduct experiments using both simulated data and real-world images. It is noteworthy that there has been extensive research that has empirically verified that deep latent variable models are likely to be identifiable in complex scenarios, particularly in the disentanglement task (Kumar et al., 2017; Klys et al., 2018; Locatello et al., 2018; Rubenstein et al., 2018; Chen et al., 2018; Burgess et al., 2018; Duan et al., 2020; Falck et al., 2021; Carbonneau et al., 2022). While we are not sure of the exact inductive biases or side information that are available during the real-world application, which has been proved to be necessary (Hyvärinen and Pajunen, 1999; Locatello et al., 2019), the empirical success of these methods sheds light on the possibility of identification in the general settings considered in this work. Setup. For settings with the auxiliary variable u, u is always available during estimation and we consider the dataset as D = x(1), u(1) , . . . , x(N), u(N) , where N is the sample size and u(i) is the value of u (or class label) corresponding to the data point x(i). Given the estimated model ˆf parameterized by θ, similar to (Sorrenson et al., 2020), we consider a regularized maximum-likelihood approach for the required sparsity regularization during estimation with the objective function as: L(θ) = E(x,u) D log pˆf 1(x|u) λR , where λ [0, 1] is a regularization parameter and R is the regularization term on the Jacobian of the estimated mixing function, i.e., J ˆ f. Based on our experimental results (Fig. 8 in Appx. B.2), we adopt the minimax concave penalty (MCP) (Zhang, 2010) as the regularization term. For settings without the auxiliary variable, we remove the access of u and follow the same objective function in (Zheng et al., 2022). We train a General Incompressible-flow Network (GIN) (Sorrenson et al., 2020), which is a flow-based generative model, to maximize the objective function L( ). Following (Sorrenson et al., 2020), where necessary, we concatenate the latent sources with independent Gaussian noises to meet the dimensionality requirements. All results are from 20 trials with random seeds. Additional details of the experimental setup are included in Appx. B. Ablation study. We perform an ablation study to verify the necessity of the proposed assumptions. Specifically, we focus on the following models corresponding to different assumptions: (UCSS) The assumption of Structural Sparsity, as well as other assumptions in the undercomplete case (Thm. 3.1), are satisfied; (Mixed) The assumption of Structural Sparsity with undercompleteness and the required dependence structure among sources influenced by the auxiliary variable, as well as other assumptions in the partial sparsity and dependence case (Thm. 4.4), are satisfied; (Base) The vanilla baseline in the undercomplete case, where the assumption of Structural Sparsity is not satisfied compared to UCSS. The datasets are generated according to the required assumptions, the details of which are included in Appx. B. All experiments are conducted in the undercomplete case, where the number of observed variables is twice the number of sources. For datasets that contain both sources in s I and s D (Mixed case), we set half as s I and the other half as s D, and the minimum required 3 4 5 6 Number of sources Model = UCSS 3 4 5 6 Number of sources Model = Base Figure 2: MCC of UCSS w.r.t. different number of sources. 4 6 8 10 Number of sources Model = Mixed 4 6 8 10 Number of sources Model = Base Figure 3: MCC of Mixed w.r.t. different number of sources. 1 2 3 4 5 m/n Figure 4: Percentage of random structures satisfying Structural Sparsity w.r.t. different degree of undercompleteness (i.e., m/n). 3 4 5 6 Number of sources Figure 5: Percentage of sources satisfying Structural Sparsity w.r.t. different numbers of sources in the bijective setting (m/n = 1). numbers of required distinct values have been assigned to the auxiliary variable u. Following previous works (Hyvärinen and Morioka, 2016; Lachapelle et al., 2022), we use the mean correlation coefficient (MCC) between the true sources and the estimated ones as the evaluation metric. Results for each model are summarised in Fig. 2 and Fig. 3. It can be observed that when the proposed assumptions are met (UCSS and Mixed), our models achieve higher MCCs than Base. This indicates that it is indeed possible to identify sources from nonlinear mixtures up to trivial indeterminacy in the general settings with undercompleteness, partial sparsity, and partial source dependence. Additionally, we conduct experiments with different numbers of sources n to evaluate the stability of the identification. Our results show that both models consistently outperform Base across all values of n, further supporting the theoretical claims. Undercomplete Structural Sparsity. As previously discussed, the assumption of Structural Sparsity (Assumption ii in Thm. 3.1) is far more plausible in an undercomplete setting considered in our theory (i.e., the number of observed variables m is larger than the number of sources n) than in the more restrictive bijective scenario required in (Zheng et al., 2022) (i.e., the numbers are equal, m = n). Consequently, extending the identifiability with structural sparsity from a bijective to an undercomplete setting significantly broadens its applicability in real-world contexts. In order to validate the necessity of the proposed generalization empirically, we construct several experiments studying the Structural Sparsity assumption in the undercomplete case. We consider different numbers of sources n with different degrees of undercompleteness (m/n, where m is the number of observed variables). For each setting, we generate 50 random matrices where each entry is independently determined with an equal probability to be either zero or non-zero. The results of the percentages of matrices satisfying the assumption of Structural Sparsity are presented in Fig. 4. We could observe that there exists a significant gap on the percentages between the cases where m/n = 1, i.e., the bijective setting, and the undercomplete settings where m/n > 1. Thus, it is clear that the assumption is much more likely to hold true when we have more observed variables than sources. Furthermore, when the degree of undercompleteness increases, the percentage of cases satisfying structural converges to 1. This further suggests that the assumption will almost always hold with a sufficient degree of undercompleness, which is rather common in practice. For instance, a photo can easily have millions of pixels (observed variables) but only a dozen of hidden concepts (sources). Partial Structural Sparsity. Moreover, as previously noted, it is not uncommon for Structural Sparsity to be violated for a subset of sources. For instance, certain sources (such as high-decibel sound sources) may exert influence over all observed variables (microphones). Nonetheless, the Figure 6: Results on Triangles. The rows may correspond to rotation, height, width, and brightness, respectively. Figure 7: Results on EMNIST. The rows may correspond to line thickness, angle, upper width, and height, respectively. prior study (Zheng et al., 2022) necessitates that the sparsity assumption holds true for all sources, providing no identifiability assurance in cases of any degree of violation. To confront this practical obstacle, we propose Thm. 4.1 and Thm. 4.2 to demonstrate that the remaining sources (i.e., n I sources) can still be identified even when the sparsity assumption does not universally hold for some sources. These results can also be motivated empirically. For instance, from Fig. 4, one may find that Structural Sparsity does not likely to hold for all sources when m/n = 1, which is the bijective setting considered in (Zheng et al., 2022). However, as discussed above, if a subset of sources satisfies the assumption, at least the identifiability for these sources could be guaranteed by our proposed theorems (Thm. 4.1 and Thm. 4.2) under certain conditions. To illustrate this, we conduct experiments in the bijective setting (m/n = 1) and report the percentage of sources satisfying Structural Sparsity in Fig. 4. We consider datasets with different number of sources and generate 50 random matrices for each of these. Each entry is independently determined with an equal probability to be either zero or non-zero. Combining results from both Fig. 4 and Fig. 5, we observe that, even in scenarios where Structural Sparsity is rarely satisfied for all sources (m/n = 1 in Fig. 4), it is almost always satisfied for a significant fraction of sources (Fig. 5). Consequently, our generalization also proves helpful even within the confines of the earlier bijective setting. Image datasets. To study how reasonable the proposed theories are w.r.t. the practical generating process of observational data in complex scenarios, we conduct experiments on "Triangles" (Yang et al., 2022) and EMNIST (Cohen et al., 2017) datasets. The "Triangles" dataset consists of 60, 000 synthetic 28 28 images of triangles, which are generated from 4 factors: rotation, height, width, and brightness. By fixing the number of pixels (observed variables) and generating factors (sources), we can guarantee that the images are generated according to an undercomplete process, although the exact generating process is still unknown (e.g., a pixel could be (indirectly) influenced by multiple factors in a complicated way). For the real-world dataset, EMNIST contains 240, 000 28 28 images of handwritten digits and is a larger version of the classical MNIST dataset. Although we do not know the exact number of sources, it is highly possible that it is smaller than the number of pixels (784). We present the identified sources with the top four standard deviations (SDs) from both datasets in Fig. 6 and Fig. 7. In both figures, each row represents a source identified by our model, with it varying from 4 to +4 SDs to illustrate its influence. The rightmost column is a heat map given by the absolute pixel difference between 1 and +1 SDs. By observing the identified sources with the top four standard deviations, one could find that it is possible to identify semantically meaningful attributes from practical image datasets, which further suggests the potential of our theory in real-world scenarios. Additional results are available in Appx. B.2. 6 Conclusion We establish a set of new identifiability results of nonlinear ICA in general settings with undercompleteness, partial sparsity and source dependence, and flexible grouping structures, thereby extending the identifiabilty theory to a wide range of real-world scenarios. Specifically, we prove the identifiability when there are more observed variables than underlying sources, and when sparsity and/or independence are not met for a subset of sources. Moreover, by leveraging various dependence structures among sources, further identifiability guarantees can also be obtained. Theoretical results have been validated through a combination of extensive previous studies and our own experiments, which involve both synthetic and real-world datasets. Future work includes adopting the theoretical framework for related tasks, such as disentanglement, transfer learning, and causal discovery. Furthermore, the proposed identifiability guarantees on generalized latent variable models bolster our confidence in uncovering hidden truths across diverse real-world settings in scientific discovery. We have only explored the visual disentanglement task, and the lack of other applications is a limitation of this work. Acknowledgements We are grateful to everyone involved in the anonymous reviewing process for their insightful feedback. This project is partially supported by NSF Grant 2229881, the National Institutes of Health (NIH) under Contract R01HL159805, a grant from Apple Inc., a grant from KDDI Research Inc., and generous gifts from Salesforce Inc., Microsoft Research, and Amazon Research. L. Ardizzone, T. Bungert, F. Draxler, U. Köthe, J. Kruse, R. Schmier, and P. Sorrenson. Framework for Easily Invertible Architectures (Fr EIA), 2018-2022. URL https://github.com/vislearn/ Fr EIA. P. Breheny and J. Huang. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics, 5(1):232 253, 2011. S. Buchholz, M. Besserve, and B. Schölkopf. Function classes for identifiable nonlinear independent component analysis. ar Xiv preprint ar Xiv:2208.06406, 2022. C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner. Understanding disentangling in beta-VAE. Workshop on Learning Disentangled Representations at the 31st Conference on Neural Information Processing Systems, 2018. M.-A. Carbonneau, J. Zaidi, J. Boilard, and G. Gagnon. Measuring disentanglement: A review of metrics. IEEE Transactions on Neural Networks and Learning Systems, 2022. J.-F. Cardoso. Multidimensional independent component analysis. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 98 (Cat. No. 98CH36181), volume 4, pages 1941 1944. IEEE, 1998. R. T. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud. Isolating sources of disentanglement in variational autoencoders. Advances in neural information processing systems, 31, 2018. G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pages 2921 2926. IEEE, 2017. P. Comon. Independent component analysis, a new concept? Signal processing, 36(3):287 314, 1994. S. Duan, L. Matthey, A. Saraiva, N. Watters, C. Burgess, A. Lerchner, and I. Higgins. Unsupervised model selection for variational disentangled representation learning. In ICLR, 2020. F. Falck, H. Zhang, M. Willetts, G. Nicholson, C. Yau, and C. C. Holmes. Multi-facet clustering variational autoencoders. Advances in Neural Information Processing Systems, 34:8676 8690, 2021. J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348 1360, 2001. H. Hälvä, S. Le Corff, L. Lehéricy, J. So, Y. Zhu, E. Gassiat, and A. Hyvärinen. Disentangling identifiable features from noisy data with structured nonlinear ICA. Advances in Neural Information Processing Systems, 34, 2021. A. Hyvärinen and P. Hoyer. Emergence of phase-and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural computation, 12(7):1705 1720, 2000. A. Hyvärinen and H. Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. Advances in Neural Information Processing Systems, 29:3765 3773, 2016. A. Hyvärinen and H. Morioka. Nonlinear ICA of temporally dependent stationary sources. In International Conference on Artificial Intelligence and Statistics, pages 460 469. PMLR, 2017. A. Hyvärinen and P. Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429 439, 1999. A. Hyvärinen, H. Sasaki, and R. Turner. Nonlinear ICA using auxiliary variables and generalized contrastive learning. In International Conference on Artificial Intelligence and Statistics, pages 859 868. PMLR, 2019. I. Khemakhem, D. Kingma, R. Monti, and A. Hyvärinen. Variational autoencoders and nonlinear ICA: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pages 2207 2217. PMLR, 2020a. I. Khemakhem, R. Monti, D. Kingma, and A. Hyvarinen. Ice-beem: Identifiable conditional energybased deep models based on nonlinear ica. Advances in Neural Information Processing Systems, 33:12768 12778, 2020b. D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018. B. Kivva, G. Rajendran, P. Ravikumar, and B. Aragam. Identifiability of deep generative models without auxiliary information. Advances in Neural Information Processing Systems, 35:15687 15701, 2022. J. Klys, J. Snell, and R. Zemel. Learning latent subspaces in variational autoencoders. Advances in neural information processing systems, 31, 2018. L. Kong, S. Xie, W. Yao, Y. Zheng, G. Chen, P. Stojanov, V. Akinwande, and K. Zhang. Partial disentanglement for domain adaptation. In International Conference on Machine Learning, pages 11455 11472. PMLR, 2022. A. Kumar, P. Sattigeri, and A. Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. ar Xiv preprint ar Xiv:1711.00848, 2017. S. Lachapelle and S. Lacoste-Julien. Partial disentanglement via mechanism sparsity. In UAI 2022 Workshop on Causal Representation Learning, 2022. S. Lachapelle, P. R. López, Y. Sharma, K. Everett, R. L. Priol, A. Lacoste, and S. Lacoste-Julien. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA. Conference on Causal Learning and Reasoning, 2022. F. Locatello, D. Vincent, I. Tolstikhin, G. Rätsch, S. Gelly, and B. Schölkopf. Competitive training of mixtures of independent deep generative models. ar Xiv preprint ar Xiv:1804.11130, 2018. F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, and O. Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pages 4114 4124. PMLR, 2019. P.-L. Loh and M. J. Wainwright. Support recovery without incoherence: A case for nonconvex regularization. The Annals of Statistics, 45(6):2455 2482, 2017. G. Monge. Applications de l analyse á la géométrie, 1850. P. Ravikumar, G. Raskutti, M. J. Wainwright, and B. Yu. Model selection in Gaussian graphical models: High-dimensional consistency of ℓ1-regularized MLE. In Advances in Neural Information Processing Systems, 2008. P. Ravikumar, M. J. Wainwright, G. Raskutti, and B. Yu. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electronic Journal of Statistics, 5:935 980, 2011. P. Rubenstein, B. Schölkopf, and I. Tolstikhin. Learning disentangled representations with wasserstein auto-encoders. In 6th International Conference on Learning Representations (ICLR 2018), 2018. P. Sorrenson, C. Rother, and U. Köthe. Disentanglement by nonlinear ICA with general incompressible-flow networks (GIN). ar Xiv preprint ar Xiv:2001.04872, 2020. A. Taleb and C. Jutten. Source separation in post-nonlinear mixtures. IEEE Transactions on signal Processing, 47(10):2807 2820, 1999. F. Theis. Towards a general independent subspace analysis. Advances in Neural Information Processing Systems, 19, 2006. J. Von Kügelgen, Y. Sharma, L. Gresele, W. Brendel, B. Schölkopf, M. Besserve, and F. Locatello. Self-supervised learning with data augmentations provably isolates content from style. Advances in neural information processing systems, 34:16451 16467, 2021. M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5): 2183 2202, 2009. X. Yang, Y. Wang, J. Sun, X. Zhang, S. Zhang, Z. Li, and J. Yan. Nonlinear ICA using volumepreserving transformations. In International Conference on Learning Representations, 2022. W. Yao, Y. Sun, A. Ho, C. Sun, and K. Zhang. Learning temporally causal latent processes from general temporal data. ar Xiv preprint ar Xiv:2110.05428, 2021. W. Yao, G. Chen, and K. Zhang. Temporally disentangled representation learning. In Advances in Neural Information Processing Systems, 2022. C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894 942, 2010. Y. Zheng, I. Ng, and K. Zhang. On the identifiability of nonlinear ICA: Sparsity and beyond. In Advances in Neural Information Processing Systems, 2022. Table of Contents A Proofs 14 A.1 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.3 Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.4 Proof of Theorem 4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.5 Proof of Theorem 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 B Experiments 28 B.1 Supplementary experimental settings . . . . . . . . . . . . . . . . . . . . . . . 28 B.2 Supplementary experimental results . . . . . . . . . . . . . . . . . . . . . . . . 29 A.1 Proof of Theorem 3.1 Theorem 3.1. Let the observed data be a large enough sample generated by an undercomplete nonlinear ICA model as defined in Eqs. (1) and (2). Suppose the following assumptions hold: i. For each i {1, . . . , n}, there exist {s(ℓ)}|Fi,:| ℓ=1 and a matrix T T s.t. span{Jf(s(ℓ))i,:}|Fi,:| ℓ=1 = Rn Fi,: and Jf(s(ℓ))T i,: Rn ˆ Fi,:. ii. (Structural Sparsity) For each k {1, . . . , n}, there exists Ck s.t. T i Ck Fi,: = {k}. Then s is identifiable up to an element-wise invertible transformation and a permutation. Proof. Let h : s ˆs denotes the transformation between the true sources and estimated sources. We can apply the chain rule repeatedly to get: Jf(s) = Jˆf h(s) = Jˆf(ˆs)Jh(s). (6) Since Jˆf(ˆs) and Jf(s) both possess full column rank, Jh(s) should have a non-zero determinant. From this, we can deduce by incorporating the inverse of h: Jˆf(ˆs) = Jf(s)Jh(s) 1. (7) Our objective here is to demonstrate that the function h is a composition of a permutation and a component-wise invertible transformation. Let D(s) denote a diagonal matrix and P denote a permutation matrix, our goal can be rewritten as demonstrating that Jh(s) 1 = D(s)P. This leads us to demonstrate that: Jˆf(ˆs) = Jf(s)D(s)P. (8) Further, we can express: Jˆf(ˆs) = Jf(s)T(s), (9) where T(s) Rn n is a square matrix. Here, we define F as the support of Jf(s), ˆF as the support of Jˆf(ˆs) and T as a set of matrices with the same support of T(s). Furthermore, T T is a matrix with the same support as T(s). Based on Assumption i, we have: span{Jf(s(ℓ))i,:}|Fi,:| ℓ=1 = Rn Fi,:. (10) Given that the set {Jf(s(ℓ))i,:}|Fi,:| ℓ=1 forms a basis of Rn Fi,:, we can express any vector in this space as a linear combination of these basis vectors. In particular, for any j0 Fi,:, the one-hot vector ej0 Rn Fi,: can be written as ℓ Fi,: αℓJf(s(ℓ))i,:, (11) where αℓdenotes the respective coefficient. With this in mind, we can find the transformation of ej0 under T as Tj0,: = ej0T = X ℓ Fi,: αℓJf(s(ℓ))i,:T. (12) According to Assumption i, each term in the above summation belongs to the space Rn ˆ Fi,:. Therefore, Tj0,: itself resides in Rn ˆ Fi,:, i.e., Tj0,: Rn ˆ Fi,:. Thus j Fi,:, Tj,: Rn ˆ Fi,:. (13) Then the connections between these supports can be established according to Defn. 2.3 (i, j) F, {i} Tj,: ˆF. (14) It is noteworthy that a similar strategy to derive Eq. 14 has been applied in (Zheng et al., 2022) and part of the proof technique is inspired by that work. In contrast to the proof by Zheng et al. (2022), which assumes the invertibility of f, we only necessitate its injectivity. This distinction allows for the inclusion of undercomplete cases. Since Jf(s(ℓ)) and Jˆf(ˆs(ℓ)) have full column rank n, T(s(ℓ)) must have a non-zero determinant. Otherwise, it would follow that the rank of T(s(ℓ)) is less than n, which would imply a contradiction that Jˆf(ˆs(ℓ)) = Jf(s(ℓ))T(s(ℓ)) has a column rank less than n. Representing the determinant of the matrix T(s(ℓ)) as its Leibniz formula yields det(T(s(ℓ))) = X i=1 T(s(ℓ))i,σ(i) where Sn is the set of n-permutations. Thus, there is at least one term in the sum that is non-zero, i.e., σ Sn, i {1, . . . , n}, sgn(σ) i=1 T(s(ℓ))i,σ(i) = 0, (16) which is equivalent to σ Sn, i {1, . . . , n}, T(s(ℓ))i,σ(i) = 0. (17) Then we can conclude that this σ is in the support of T(s) since s(ℓ) s. Therefore, it follows that j {1, . . . , n}, σ(j) Tj,:. (18) Together with Eq. (14), we have (i, j) F, (i, σ(j)) {i} Tj,: ˆF. (19) Denote σ(F) = {(i, σ(j)) | (i, j) F}. (20) Then we have σ(F) ˆF. (21) Because of the sparsity regularization on the estimated Jacobian, we further have | ˆF| |F| = |σ(F)|. (22) Combining this with Eq. (21), we derive σ(F) = ˆF. (23) Suppose T(s) = D(s)P, then j1 = j2, Tj1,: Tj2,: = . (24) Additionally, consider j3 {1, . . . , n} for which σ(j3) Tj1,: Tj2,:. (25) Since j1 = j2, we can assume j3 = j1 without loss of generality. A similar strategy has been used previously in (Lachapelle et al., 2022; Zheng et al., 2022). Based on Assumption ii, there exists Cj1 j1 such that T i Cj1 Fi,: = {j1}. Because j3 {j1} = \ i Cj1 Fi,:, (26) there must exists i3 Cj1 such that j3 Fi3,:. (27) Since j1 Fi3,:, it follows that (i3, j1) F. Therefore, according to Eq. (14), we have {i3} Tj1,: ˆF. (28) Notice that σ(j3) Tj1,: Tj2,: implies (i3, σ(j3)) {i3} Tj1,:. (29) Then by Eqs. (28) and (29), we have (i3, σ(j3)) ˆF. (30) This further implies (i3, j3) F by Eq. (20) and (23), which contradicts Eq. (27). Therefore, we have proven by contradiction that T(s) = D(s)P. By replacing T(s) with D(s)P in Eq. (9), we obtain Eq. (8), which is the goal. A.2 Proof of Theorem 4.1 Theorem 4.1. Let the observed data be a large enough sample generated by an undercomplete nonlinear ICA model defined in Eqs. (2) and (3). Suppose the following assumptions hold: i. There exist n D + 1 distinct values of u, i.e., uj with j {0, 1, . . . , n D}, s.t. the n D vectors w(s D, u, i) with i {n I + 1, ..., n} are linearly independent, where vector w(s D, u, i) is defined as follows: w(s D, u, i) = (log p(s D|u1) log(p(s D|u0)) si , . . . , (log p(s D|un D) log(p(s D|u0)) ii. There exist u1, u2 u, s.t., for any set As S with non-zero probability measure and cannot be expressed as Bs I s D for any Bs I SI, we have Z s As ps|u (s | u1) ds = Z s As ps|u (s | u2) ds. Then s D is identifiable up to an subspace-wise invertible transformation. Proof. Let h : s ˆs denotes the transformation between the true and estimated sources. By using chain rule repeatedly, we have Jf(s) = Jˆf h(s) = Jˆf(ˆs)Jh(s). (31) Since Jˆf(ˆs) and Jf(s) both possess full column rank, Jh(ˆs) should have a non-zero determinant. Thus, Jh(s) must be invertible and have a non-zero determinant. Otherwise, one of them would not be of full column rank, which leads to a contradiction. Applying the change of variable rule, we have ps|u(s|u)| det(Jh 1(ˆs))| = pˆs(ˆs|u). (32) Taking the logarithm on both sides yields log ps|u(s|u) + log | det(Jh 1(ˆs))| = log pˆs|u(ˆs|u). (33) Note that according to the model defined in Eqs. (2) and (3), the joint densities can be factorized as ps|u(s|u) = ps D|u(s D|u) i=1 psi(si), pˆs|u(ˆs|u) = pˆs D|u(ˆs D|u) i=1 pˆsi(ˆsi). Then we define the difference across domains as follows q(s, uj) = log ps|u(s|uj) log ps|u(s|u0). (35) By taking the difference on both sides of Eq. (33) corresponding to uj and u0, we obtain q(ˆs D, uj) = q(s D, uj), (36) where, for i {1, . . . , n I}, log pˆs|u(ˆsi|uj) and log ps|u(si|uj) have been canceled because sources in s I are not dependent on uj. That is, q(si, uj) = 0 if i {1, . . . , n I}. Taking the derivatives of both sides of Eq. (36) w.r.t. ˆsk where k {1, . . . , n I}, we have q(ˆs D, uj) Clearly, LHS of Eq. (37) equals zero. By considering each j {1, . . . , n D} for uj, we have n D equations like Eq. (37), which constitute a linear system with a n D n D coefficient matrix. According to the assumption, the coefficient matrix of the linear system has full rank. Thus, the only solution of Eq. (37) is si ˆsk = 0 for i {n I + 1, . . . , n} and k {1, . . . , n I}. As h 1( ) is smooth, its Jacobian can be written as: ˆs I B := s I ˆs D C := s D ˆs I D := s D ˆsk = 0 for j {n I + 1, . . . , n} and k {1, . . . , n I}, entries in the submatrix C must all be zero. Thus, the submatrix D must be invertible, otherwise Jh 1(ˆs) will not be invertible, which is a contradiction. Besides, based on Assumption ii, one can show that all entries in the submatrix B are zero according to part of the proof of Theorem 4.2 in (Kong et al., 2022) (Steps 1, 2, and 3). Therefore, ˆs D is an invertible transformation of s D. A.3 Proof of Theorem 4.2 Theorem 4.2. In addition to assumptions in Thm. 4.1, suppose the following assumptions hold: i. For each i {1, . . . , n I}, there exist {s(ℓ)} |Fi,:n I | ℓ=1 and a matrix T T s.t. span{Jf(s(ℓ))i,:n I} |Fi,:n I | ℓ=1 = Rn I Fi,:n I and Jf(s(ℓ))T i,:n I Rn I ˆ Fi,:n I . ii. (Structural Sparsity) For all k {1, . . . , n I}, there exists Ck s.t. T i Ck Fi,:n I = {k}. Then s I is identifiable up to an element-wise invertible transformation and a permutation. Proof. Our goal here is to show that ˆs I is a composition of a permutation and a component-wise invertible transformation of sources in s I. By using chain rule repeatedly, we have Jˆf(ˆs) = Jf h 1(ˆs) = Jf(h 1(ˆs))Jh 1(ˆs) = Jf(s)Jh 1(ˆs). (39) As h 1( ) is smooth, its Jacobian can be written as: ˆs I B := s I ˆs D C := s D ˆs I D := s D In the proof of Thm. 4.1, we have shown that all entries in the submatrix C are zero. Then we have Jˆf(ˆs):,:n I = Jf(s)Jh 1(ˆs):,:n I ( ) = Jf(s):,:n IJh 1(ˆs):n I,:n I, (41) where Eq. ( ) is directly from the result that all entries in C, i.e., those in Jh 1(ˆs)n I+1:,:n I, are zero. Moreover, in the proof of Thm. 4.1, we have also shown that all entries in the submatrix B are zero. Let D(s) represent a diagonal matrix and P represent a permutation matrix. Thus, our goal is equivalent to show that Jh(s):n I,:n I = PIDI(s) or Jh 1(ˆs):n I,:n I = Jh 1(ˆs):n I,:n I = Jh(s) 1 :n I,:n I = DI(s) 1PI 1. Then we need to prove that Jˆf(ˆs):,:n I = Jf(s):,:n IDI(s) 1PI 1. (42) Additionally, we have Jˆf(ˆs):,:n I = Jf(s):,:n IT(s), (43) where T(s) Rn I n I is a square matrix. Note that we have denoted F as the support of Jf(s), ˆF as the support of Jˆf(ˆs) and T as the support of T(s). Besides, we have also denoted T as a matrix with the same support of T . According to Assumption i, we have span{Jf(s(ℓ))i,:n I} |Fi,:n I | ℓ=1 = Rn Fi,:n I . (44) Since {Jf(s(ℓ))i,:n I} |Fi,:n I | ℓ=1 forms a basis of Rn I Fi,:n I , for any j0 Fi,:n I, we are able to rewrite the one-hot vector ej0 Rn I Fi,:n I as ℓ Fi,:n I αℓJf(s(ℓ))i,:n I, (45) where αℓis the corresponding coefficient. Then Tj0,: = ej0T = X ℓ Fi,:n I αℓJf(s(ℓ))i,:n IT, (46) According to Assumption i, each term in the above summation belongs to the space Rn I ˆ Fi,:n I . Therefore, Tj0,: itself resides in Rn I ˆ Fi,:n I , i.e., Tj0,: Rn I ˆ Fi,:n I . Thus j Fi,:n I, Tj,: Rn I ˆ Fi,:n I . (47) Then the connections between these supports can be established according to Defn. 2.3 (i, j) F:,:n I, {i} Tj,: ˆF:,:n I. (48) Since Jf(s(ℓ)):,:n I and Jˆf(ˆs(ℓ)):,:n I have full column rank n I, T(s(ℓ)) must have a non-zero determinant. Otherwise, it would follow that the rank of T(s(ℓ)) is less than n I, which would imply a contradiction that Jˆf(ˆs(ℓ)):,:n I = Jf(s(ℓ)):,:n IT(s(ℓ)) has a column rank less than n I. The determinant of the matrix T(s(ℓ)) can be represented as its Leibniz formula as det(T(s(ℓ))) = X i=1 T(s(ℓ))i,σ(i) where Sn I is the set of n I-permutations. Therefore, there is at least one term in the sum that is non-zero, i.e., σ Sn I, i {1, . . . , n I}, sgn(σ) i=1 T(s(ℓ))i,σ(i) = 0, (50) which is equivalent to σ Sn I, i {1, . . . , n I}, T(s(ℓ))i,σ(i) = 0. (51) Then we can conclude that this σ must present in the support of T(s) since s(ℓ) s. Therefore, it follows that j {1, . . . , n I}, σ(j) Tj,:. (52) Together with Eq. (48), we have (i, j) F:,:n I, (i, σ(j)) {i} Tj,: ˆF:,:n I. (53) Denote σ(F:,:n I) = {(i, σ(j)) | (i, j) F:,:n I}. (54) Then we have σ(F:,:n I) ˆF:,:n I. (55) Because of the sparsity regularization on the estimated Jacobian, we further have | ˆF:,:n I| |F:,:n I| = |σ(F:,:n I)|. (56) Combined with Eq. (55), we have σ(F:,:n I) = ˆF:,:n I. (57) Suppose T(s) = DI(s) 1PI 1, then j1 = j2, Tj1,: Tj2,: = . (58) Additionally, consider j3 {1, . . . , n I} for which σ(j3) Tj1,: Tj2,:. (59) Since j1 = j2, we can assume j3 = j1 without loss of generality. Based on Assumption ii, there exists Cj1 j1 such that T i Cj1 Fi,:n I = {j1}. Because j3 {j1} = \ i Cj1 Fi,:n I, (60) there must exists i3 Cj1 such that j3 Fi3,:n I. (61) Since j1 Fi3,:n I, it follows that (i3, j1) F:,:n I. Therefore, according to Eq. (48), we have {i3} Tj1,: ˆF:,:n I. (62) Notice that σ(j3) Tj1,: Tj2,: implies (i3, σ(j3)) {i3} Tj1,:. (63) Then by Eqs. (62) and (63), we have (i3, σ(j3)) ˆF:,:n I. (64) This further implies (i3, j3) F:,:n I by Eqs. (54) and (57), which contradicts Eq. (61). Thus, we have proven by contradiction that T(s) = DI(s) 1PI 1. By replacing T(s) with DI(s) 1PI 1 in Eq. (43), we obtain Eq. (42), which is the goal. A.4 Proof of Theorem 4.3 Theorem 4.3. Let the observed data be a large enough sample generated from an undercomplete nonlinear ICA model as defined in Eqs. (2) and (4). Suppose the following assumptions hold: i. For each i {1, . . . , n I}, there exist {s(ℓ)} |Fi,:n I | ℓ=1 and a matrix T T s.t. span{Jf(s(ℓ))i,:n I} |Fi,:n I | ℓ=1 = Rn I Fi,:n I and Jf(s(ℓ))T i,:n I Rn I ˆ Fi,:n I . ii. There exist 2n D + 1 values of u, i.e., ui with i {0, 1, . . . , 2n D}, s.t. the 2n D vectors w(s D, ui) w(s D, u0) with i {1, . . . , 2n D} are linearly independent, where vector w(s D, ui) is defined as follows: w(s D, ui) = (v(sc1, ui), , v(scd, ui), v (sc1, ui), , v (scd, ui)) , v(scj, ui) = log p(scj|ui) scj (l) , , log p(scj|ui) v (scj, ui) = 2 log p(scj|ui) ( scj (l))2 , , 2 log p(scj|ui) ( scj (h))2 iii. There exist u1, u2 u, s.t., for any set As S with nonzero probability measure and cannot be expressed as Bs I SD for any Bs I SI, we have Z s As ps|u (s | u1) ds = Z s As ps|u (s | u2) ds. iv. (Structural Sparsity) For all k {1, . . . , n I}, there exists Ck s.t. T i Ck Fi,:n I = {k}. Then s I is identifiable up to an element-wise invertible transformation and a permutation, and s D is identifiable up to a subspace-wise invertible transformation and a subspace-wise permutation. Proof. Let h : s ˆs denotes the transformation between the true and estimated sources. By using chain rule repeatedly, we have Jf(s) = Jˆf h(s) = Jˆf(ˆs)Jh(s). (65) Because Jˆf(ˆs) and Jf(s) have full column rank, Jh(s) must be invertible and have a non-zero determinant. Otherwise, one of them would not be of full column rank, which leads to a contradiction. Applying the change of variable rule, we have ps|u(s|u)| det(Jh 1(ˆs))| = pˆs(ˆs|u). (66) Taking the logarithm on both sides yields log ps|u(s|u) + log | det(Jh 1(ˆs))| = log pˆs|u(ˆs|u). (67) Note that according to the model defined in Eqs. (2) and (3), the joint densities can be factorized as ps|u(s|u) = i=1 psi(si) j=c1 psj|u(sj|u), pˆs|u(ˆs|u) = i=1 pˆsi(ˆsi) j=c1 pˆsj|u(ˆsj|u). Together with Eq. (67), we have i log psi(si) + j=c1 log psj|u(sj|u) + log | det(Jh 1(ˆs))| i log pˆsi(ˆsi) + j=c1 log pˆsj|u(ˆsj|u). Therefore, for u = u0, . . . , u2n D, we have 2n D + 1 such equations. Subtracting each equation corresponding to u1, . . . , u2n D with the equation corresponding to u0 results in 2n D equations: log psi|uj(si|uj) log psi|u0(si|u0) log pˆsi|uj(ˆsi|uj) log pˆsi|u0(ˆsi|u0) . Then we take the derivatives of both sides of Eq. (70) w.r.t. ˆsk and ˆsv where k, v {1, . . . , n} and k = v. Besides, if both k > n I and v > n I, then they are not indices of the same subspace. It is clear that the RHS of Eq. (70) equals to zero. For the i-th term of the summation on the LHS, we have the following equation after taking the derivatives: 2 log psi|uj(si|uj) ( sl)2 2 log psi|u0(si|u0) + log psi|uj(si|uj) sl log psi|u0(si|u0) 2sl ˆsk ˆsv where il and ih corresponds to the minimum and maximum indices of elements in si = (sil, . . . , sih). By iterating i from c1 to cd, we are also iterating l from n I + 1 to n. Thus by considering all those equations as well as iterating j in uj from 0 to 2n D, we have a linear system with a 2n D 2n D coefficient matrix. Then, according to Assumption ii, the coefficient matrix of the linear system has full rank. Thus, the only solution of Eq. (71) is sl ˆsk sl ˆsv = 0 and 2sl ˆsk ˆsv = 0. Note that k = v and, if both k > n I and v > n I, they are not indices of sources in the same subspace. Besides, sl ˆsk sl ˆsv = 0 indicates that it is impossible for both sl ˆsv to be non-zero. Furthermore, because h is invertible, they cannot be both zero, indicating that it is either sl ˆsk = 0 or sl ˆsv = 0. Then, because s I has invariant distribution w.r.t. u, if sl ˆsk = 0 (i.e., sl ˆsv = 0), then ˆk / {1, . . . , n I}. Otherwise, sl will also be invariant, which is a contradiction. Therefore, ˆk can only be the index of an estimated source from one independent subspace, which, together with the invertibility, leads to the conclusion that s D is a composition of an invertible subspace-wise transformation and a subspace-wise permutation of ˆ s D. So it is the mapping from ˆs D to s D since the subspace-wise transformation is invertible and the inverse of a block-wise permutation matrix is still a block-wise invertible matrix. Now we have shown the identifiability result for s D = (sn I+1, . . . , sn), then we need to show that for the remaining sources s I = (s1, . . . , sn I). By using chain rule repeatedly, we have Jˆf(ˆs) = Jf h 1(ˆs) = Jf(h 1(ˆs))Jh 1(ˆs) = Jf(s)Jh 1(ˆs). (72) Since we have shown that, for every l {n I + 1, . . . , n} and k {1, . . . , n I}, sl ˆsk = 0, all entries of Jh 1(ˆs)n I+1:n,:n I must be zero. Then we have Jˆf(ˆs):,:n I = Jf(s)Jh 1(ˆs):,:n I ( ) = Jf(s):,:n IJh 1(ˆs):n I,:n I, (73) where Eq. ( ) is directly from the result that all entries in Jh 1(s)n I+1:,:n I are zero. Based on the proof of Theorem 4.2 in Kong et al. (2022) and Assumption iii, particularly, steps 1, 2, and 3, one can show that ˆs I does not depend on s D. Let D(s) represents a diagonal matrix and P represent a permutation matrix. Thus, our goal is equivalent to show that Jh(s):n I,:n I = PIDI(s) or Jh 1(ˆs):n I,:n I = Jh 1(ˆs):n I,:n I = Jh(s) 1 :n I,:n I = DI(s) 1PI 1. Then we need to prove that Jˆf(ˆs):,:n I = Jf(s):,:n IDI(s) 1PI 1. (74) Additionally, we have Jˆf(ˆs):,:n I = Jf(s):,:n IT(s), (75) where T(s) Rn I n I is a square matrix. Note that we have denoted F as the support of Jf(s), ˆF as the support of Jˆf(ˆs) and T as the support of T(s). Besides, we have also denoted T as a matrix with the same support of T . According to Assumption i, we have span{Jf(s(ℓ))i,:n I} |Fi,:n I | ℓ=1 = Rn Fi,:n I . (76) Since {Jf(s(ℓ))i,:n I} |Fi,:n I | ℓ=1 forms a basis of Rn I Fi,:n I , for any j0 Fi,:n I, we are able to rewrite the one-hot vector ej0 Rn I Fi,:n I as ℓ Fi,:n I αℓJf(s(ℓ))i,:n I, (77) where αℓis the corresponding coefficient. Then Tj0,: = ej0T = X ℓ Fi,:n I αℓJf(s(ℓ))i,:n IT Rn I ˆ Fi,:n I , (78) where the final follows from Assumption i that each element in the summation belongs to Rn I ˆ Fi,:n I . Thus j Fi,:n I, Tj,: Rn I ˆ Fi,:n I . (79) Then the connections between these supports can be established according to Defn. 2.3 (i, j) F:,:n I, {i} Tj,: ˆF:,:n I. (80) Since Jf(s(ℓ)):,:n I and Jˆf(ˆs(ℓ)):,:n I have full column rank n I, T(s(ℓ)) must have a non-zero determinant. Otherwise, it would follow that the rank of T(s(ℓ)) is less than n I, which would imply a contradiction that Jˆf(ˆs(ℓ)):,:n I = Jf(s(ℓ)):,:n IT(s(ℓ)) has a column rank less than n I. The determinant of the matrix T(s(ℓ)) can be represented as its Leibniz formula as det(T(s(ℓ))) = X i=1 T(s(ℓ))i,σ(i) where Sn I is the set of n I-permutations. Thus, there is at least one non-zero term in the sum, i.e., σ Sn I, i {1, . . . , n I}, sgn(σ) i=1 T(s(ℓ))i,σ(i) = 0, (82) which is equivalent to σ Sn I, i {1, . . . , n I}, T(s(ℓ))i,σ(i) = 0. (83) Then we can see that this σ must present in the support of T(s) since s(ℓ) s. It follows that j {1, . . . , n I}, σ(j) Tj,:. (84) Together with Eq. (80), we have (i, j) F:,:n I, (i, σ(j)) {i} Tj,: ˆF:,:n I. (85) Denote σ(F:,:n I) = {(i, σ(j)) | (i, j) F:,:n I}. (86) Then we have σ(F:,:n I) ˆF:,:n I. (87) Because of the sparsity regularization on the estimated Jacobian, we further have | ˆF:,:n I| |F:,:n I| = |σ(F:,:n I)|. (88) Together with Eq. (87), we have σ(F:,:n I) = ˆF:,:n I. (89) Suppose T(s) = DI(s) 1PI 1, then j1 = j2, Tj1,: Tj2,: = . (90) Additionally, consider j3 {1, . . . , n I} for which σ(j3) Tj1,: Tj2,:. (91) Since j1 = j2, we can assume j3 = j1 without loss of generality. Based on Assumption iv, there exists Cj1 j1 such that T i Cj1 Fi,:n I = {j1}. Because j3 {j1} = \ i Cj1 Fi,:n I, (92) there must exists i3 Cj1 such that j3 Fi3,:n I. (93) Since j1 Fi3,:n I, we have (i3, j1) F:,:n I. Therefore, according to Eq. (80), we have {i3} Tj1,: ˆF:,:n I. (94) Notice that σ(j3) Tj1,: Tj2,: implies (i3, σ(j3)) {i3} Tj1,:. (95) Then by Eqs. (94) and (95), we have (i3, σ(j3)) ˆF:,:n I, (96) which implies (i3, j3) F:,:n I by Eqs. (86) and (89), therefore contradicting Eq. (93). Thus, we have proven by contradiction that T(ˆs) = DI(s) 1PI 1. By replacing T(s) with DI(s) 1PI 1 in Eq. (75), we obtain Eq. (74), which is the goal. Therefore, s I is identifiable up to a composition of a component-wise invertible transformation and a permutation, and s D is identifiable up to a composition of a subspace-wise invertible transformation and a subspace-wise permutation. A.5 Proof of Theorem 4.4 Theorem 4.4. Let the observed data be a large enough sample generated from an undercomplete nonlinear ICA model as defined in Eqs. (2) and (5), suppose the following assumptions hold: i. For each i {1, . . . , n I}, there exist {s(ℓ)} |Fi,:n I | ℓ=1 and a matrix T T s.t. span{Jf(s(ℓ))i,:n I} |Fi,:n I | ℓ=1 = Rn I Fi,:n I and Jf(s(ℓ))T i,:n I Rn I ˆ Fi,:n I . ii. There exist 2n D + 1 values of u, i.e., ui with i {0, 1, . . . , 2n D}, s.t. the 2n D vectors w(s D, ui) w(s D, u0) with i {1, . . . , 2n D} are linearly independent, where vector w(s D, u) is defined as follows: w(s D, ui) = (v(s D, ui), v (s D, ui)) , v(s D, ui) = log p(sn I+1|ui) sn I+1 , , log p(sn|ui) v (s D, ui) = 2 log p(sn I+1|ui) ( sn I+1)2 , , 2 log p(sn|ui) iii. There exist u1, u2 u, s.t., for any set As S with nonzero probability measure and cannot be expressed as Bs I SD for any Bs I SI, we have Z s As ps|u (s | u1) ds = Z s As ps|u (s | u2) ds. iv. (Structural Sparsity) For all k {1, . . . , n I}, there exists Ck s.t. T i Ck Fi,:n I = {k}. Then s is identifiable up to an element-wise invertible transformation and a permutation. Proof. Let h : s ˆs denotes the transformation between the true and estimated sources. By using chain rule repeatedly, we have Jf(s) = Jˆf h(s) = Jˆf(ˆs)Jh(s). (97) Because Jˆf(ˆs) and Jf(s) have full column rank, Jh(s) must be invertible and have a non-zero determinant. Otherwise, one of them would not be of full column rank, which leads to a contradiction. Applying the change of variable rule, we have ps|u(s|u)| det(Jh 1(ˆs))| = pˆs(ˆs|u). (98) Taking the logarithm on both sides yields log ps|u(s|u) + log | det(Jh 1(ˆs))| = log pˆs|u(ˆs|u). (99) Note that according to the model defined in Eqs. (2) and (3), the joint densities can be factorized as ps|u(s|u) = i=1 psi(si) j=n I+1 psj|u(sj|u) = i=1 psi|u(ˆsi|u), pˆs|u(ˆs|u) = i=1 pˆsi(ˆsi) j=n I+1 pˆsj|u(ˆsj|u) = i=1 pˆsi|u(ˆsi|u). Together with Eq. (99), we have i log psi|u(si|u) + log | det(Jh 1(ˆs))| = i log pˆsi|u(ˆsi|u). (101) Then we take the derivatives of both sides of Eq. (101) w.r.t. ˆsk and ˆsv where k, v {1, . . . , n} and k = q. For brevity, we first define the following terms: h i,(k) := si ˆsk , (102) h i,(k,v) := 2si ˆsk ˆsv , (103) η i(si, u) := log psi|u(si|u) η i (si, u) := 2 log psi|u(si|u) ( si)2 . (105) Then we have n X η i (si, u) h i,(k)h i,(v) + η i(si, u) h i,(k,v) + 2 log | det(Jh 1(ˆs))| ˆsk ˆsv = 0. (106) Therefore, for u = u0, . . . , u2n D, we have 2n D + 1 such equations. Subtracting each equation corresponding to u1, . . . , u2n D with the equation corresponding to u0 results in 2n D equations: (η i (zi, uj) η i (zi, u0)) h i,(k)h i,(v) (107) + (η i(zi, uj) η i(zi, u0)) h i,(k,v) = 0, (108) where j = 1, . . . 2n D. Note that for i {1, . . . , n I}, si does not depend on u. Thus, we have η i (zi, uj) = η i (zi, uj ) and η i(zi, uj) = η i(zi, uj ), j, j . Hence only sources dependent on u (i.e., s D) remain in Eq. (107). By considering each j {1, . . . , 2n D} for uj, we have 2n D equations like Eq. (107), which constitute a linear system with a 2n D 2n D coefficient matrix. According to Assumption ii, the coefficient matrix of the linear system has full rank. Thus, the only solution of Eq. (107) is h i,(k)h i,(q) = 0 and h i,(k,v) = 0 for i = nc + 1, . . . , n and k, v {1, . . . , n}, k = v. As h 1( ) is smooth, its Jacobian can be written as: ˆs I B := s I ˆs D C := s D ˆs I D := s D Because h i,(k)h i,(v) = 0, k, v {1, . . . , n}, k = v, for each i = n I + 1, . . . , n, there is at most one index r {1, . . . , n} s.t. h i,(r) = 0. Therefore, there is at most one non-zero entry in each row indexed by i = n D + 1, . . . , n in the Jacobian matrix Jh 1(ˆs). Further, the invertibility of h 1( ) necessitates Jh 1 to be full-rank which implies that there is exactly one non-zero component in each row of sub-matrices C and D. Suppose that non-zero component lies in the sub-matrix C, then s D is not dependent on u. Thus, the only non-zero component must lie in D. Because Jh is of full-rank and C is a zero sub-matrix, D must have full rank. Hence, ˆs D must be a composition of a component-wise invertible transformation and a permutation of s D. Moreover, according to part of the proof of Theorem 4.2 in Kong et al. (2022) (Steps 1, 2, and 3), the submatrix B is zero if Assumption iii holds. Now we have shown that, for the matrix Jh 1, its submatrix D is a generalized permutation matrix and both submatrices B and C are zero-matrices). Because h 1 is smooth and invertible, the sub-matrices of the corresponding positions of Jh have the same properties. Then, we need to show the identifiability for the remaining sources s I = (s1, . . . , sn I), i.e., ˆs I is a permutation with component-wise invertible transformation of s I. Let D(s) represents a diagonal matrix and P represent a permutation matrix. Thus, our goal is equivalent to show that Jh(s):n I,:n I = PIDI(s) or Jh 1(ˆs):n I,:n I = Jh 1(ˆs):n I,:n I = Jh(s) 1 :n I,:n I = DI(s) 1PI 1. By using chain rule repeatedly, we have Jˆf(ˆs) = Jf h 1(ˆs) = Jf(h 1(ˆs))Jh 1(ˆs) = Jf(s)Jh 1(ˆs). (110) Then we have Jˆf(ˆs):,:n I = Jf(s)Jh 1(ˆs):,:n I ( ) = Jf(s):,:n IJh 1(ˆs):n I,:n I, (111) where Eq. ( ) is directly from the result that all entries in C, i.e., those in Jh 1(ˆs)n I+1:,:n I, are zero. Therefore our goal is equivalent to show that Jˆf(ˆs):,:n I = Jf(s):,:n IDI(s) 1PI 1. (112) Additionally, we have Jˆf(ˆs):,:n I = Jf(s):,:n IT(s), (113) where T(s) Rn I n I is a square matrix. Note that we have denoted F as the support of Jf(s), ˆF as the support of Jˆf(ˆs) and T as the support of T(s). Besides, we have also denoted T as a matrix with the same support of T . According to Assumption i, we have span{Jf(s(ℓ))i,:n I} |Fi,:n I | ℓ=1 = Rn Fi,:n I . (114) Since {Jf(s(ℓ))i,:n I} |Fi,:n I | ℓ=1 forms a basis of Rn I Fi,:n I , for any j0 Fi,:n I, we are able to rewrite the one-hot vector ej0 Rn I Fi,:n I as ℓ Fi,:n I αℓJf(s(ℓ))i,:n I, (115) where αℓis the corresponding coefficient. Then Tj0,: = ej0T = X ℓ Fi,:n I αℓJf(s(ℓ))i,:n IT Rn I ˆ Fi,:n I , (116) where the final follows from Assumption i that each element in the summation belongs to Rn I ˆ Fi,:n I . Thus j Fi,:n I, Tj,: Rn I ˆ Fi,:n I . (117) Then the connections between these supports can be established according to Defn. 2.3 (i, j) F:,:n I, {i} Tj,: ˆF:,:n I. (118) Since Jf(s(ℓ)):,:n I and Jˆf(ˆs(ℓ)):,:n I have full column rank n I, T(s(ℓ)) must have a non-zero determinant. Otherwise, it would follow that the rank of T(s(ℓ)) is less than n I, which would imply a contradiction that Jˆf(ˆs(ℓ)):,:n I = Jf(s(ℓ)):,:n IT(s(ℓ)) has a column rank less than n I. The determinant of the matrix T(s(ℓ)) can be represented as its Leibniz formula as det(T(s(ℓ))) = X i=1 T(s(ℓ))i,σ(i) where Sn I is the set of n I-permutations. Thus, there is at least one term in the sum that is non-zero, i.e., σ Sn I, i {1, . . . , n I}, sgn(σ) i=1 T(s(ℓ))i,σ(i) = 0, (120) which is equivalent to σ Sn I, i {1, . . . , n I}, T(s(ℓ))i,σ(i) = 0. (121) Then we can conclude that this σ is in the support of T(s) since s(ℓ) s. Therefore, it follows that j {1, . . . , n I}, σ(j) Tj,:. (122) Combined with Eq. (118), we have (i, j) F:,:n I, (i, σ(j)) {i} Tj,: ˆF:,:n I. (123) Denote σ(F:,:n I) = {(i, σ(j)) | (i, j) F:,:n I}. (124) Then we have σ(F:,:n I) ˆF:,:n I. (125) Because of the sparsity regularization on the estimated Jacobian, we further have | ˆF:,:n I| |F:,:n I| = |σ(F:,:n I)|. (126) Together with Eq. (125), it follows that σ(F:,:n I) = ˆF:,:n I. (127) Suppose T(s) = DI(s) 1PI 1, then j1 = j2, Tj1,: Tj2,: = . (128) Additionally, consider j3 {1, . . . , n I} for which σ(j3) Tj1,: Tj2,:. (129) Since j1 = j2, we can assume j3 = j1 without loss of generality. Based on Assumption iv, there exists Cj1 j1 such that T i Cj1 Fi,:n I = {j1}. Because j3 {j1} = \ i Cj1 Fi,:n I, (130) there must exists i3 Cj1 such that j3 Fi3,:n I. (131) Since j1 Fi3,:n I, we have (i3, j1) F:,:n I. Therefore, according to Eq. (118), we have {i3} Tj1,: ˆF:,:n I. (132) Note that σ(j3) Tj1,: Tj2,: implies (i3, σ(j3)) {i3} Tj1,:. (133) Then by Eqs. (132) and (133), we have (i3, σ(j3)) ˆF:,:n I. (134) This implies (i3, j3) F:,:n I by Eqs. (124) and (127), which contradicts Eq. (131). Therefore, we have proven by contradiction that T(s) = DI(s) 1PI 1. By replacing T(s) with DI(s) 1PI 1 in Eq. (113), we obtain Eq. (112), which is the goal. B Experiments In this section, we describe the experimental settings as well as some additional results. B.1 Supplementary experimental settings To produce observational data that meets the required assumptions for different models, we simulate the sources and mixing process as follows: UCSS. To ensure that the true nonlinear mixing process adheres to the Structural Sparsity condition (Assumption ii in Thm. 3.1), as per previous work (Zheng et al., 2022), we generate observed variables in a structured way: Each observed variable is only a nonlinear mixture of its direct ancestors. For instance, if the observed variable x1 has parents s1 and s2, then x1 = f1(s1, s2). We use Generative Flow (GLOW) (Kingma and Dhariwal, 2018) with a projection layer as the nonlinear function fi. The difference between GLOW and GIN is that GLOW does not impose a constraint on the determinant of the Jacobian, thus being more suitable for the general nonlinear function since it has less inductive bias. The implementation of GLOW is a part of Fr EIA1 (Ardizzone et al., 2018-2022). The ground-truth sources are sampled from a multivariate Gaussian, with zero means and variances sampled from a uniform distribution on [0.5, 3], which are of the same values as in previous works (Khemakhem et al., 2020a; Sorrenson et al., 2020; Zheng et al., 2022). It is worth noting that we sample sources from a single multivariate Gaussian so that all sources are marginally independent, unlike from most previous works assuming conditional independence given auxiliary variables. Mixed. For the Mixed model, we partition sources into s I and s D. For sources in s I, we sample them in the same way as that for UCSS. For sources in s D, we sample them from 2n D + 1 multivariate Gaussian distributions as required by Assumption ii in Thm. 4.4. Similarly, these multivariate Gaussian distributions are of zero means and variances sampled from a uniform distribution on [0.5, 3]. For sources in s I, we generate their influences on the observed variables in a structured way described above to satisfy the partial sparsity assumption (Assumption iv in Thm. 4.4); for sources in s D, we remove the constraint on the structure and permit each source to affect all observed variables. Base. For the Base model, following (Sorrenson et al., 2020), we use GLOW (Kingma and Dhariwal, 2018) as the mixing function to generate the data. The sources are from a single multivariate Gaussian distribution with zero means and variances uniformly sampled from the interval [0.5, 3]. No constraints on the structure have been imposed for the Base model. In the evaluation of our model, we utilize the Mean Correlation Coefficient (MCC) as a metric for assessing the correspondence between the ground-truth and recovered latent sources. The MCC is calculated by first determining the pair-wise correlation coefficients between the true sources and the recovered sources after a nonlinear component-wise transformation learned by regression. Subsequently, an assignment problem is solved to match each recovered source with the corresponding ground-truth source that exhibits the highest correlation. MCC is a widely accepted metric in the literature for measuring the degree of identifiability, accounting for component-wise transformations (Hyvärinen and Morioka, 2016). Our results are all based on 20 trials, each with a different random seed. For the synthetic datasets used in our experiments, the sample size is 2000. The parameters used for training include a learning rate of 0.01 and a batch size of 200. Additionally, the number of coupling layers for both GIN and GLOW is set as 10. In regard to the "Triangles" dataset, it comprises 60, 000 32 32 images of drawn triangles. The statistics of the dataset are described in (Yang et al., 2022). For the experiments conducted on this dataset, the learning rate is set at 3 10 4 and the batch size is 100. Concerning the EMNIST dataset, it includes 240, 000 28 28 images of real-world handwritten digits. The learning rate and batch size used for these experiments are 3 10 4 and 240, respectively. The experiments are conducted directly using the official implementation of GIN2(Sorrenson et al., 2020) with an additional sparsity regularization term on the Jacobian of the estimated mixing function. 1https://github.com/vislearn/Fr EIA 2https://github.com/VLL-HD/GIN B.2 Supplementary experimental results In this section, we delve deeper into the applicability of our results by offering additional empirical studies that further illuminate the implications of the proposed theory. Specifically, we focus on the following aspects: (1) the effect of different regularization terms on the performance of identification; (2) the applicability of the theory as illustrated by additional real-world examples. 3 4 5 6 Number of sources Figure 8: MCC of UCSS w.r.t. different sparsity regularizations and numbers of sources. Regularization. For the regularization term, directly utilizing the ℓ0 penalty may be computationally infeasible as it results in a discrete optimization problem. To overcome this issue, we adopt the ℓ1 regularizer, which has been extensively studied in the literature for highdimensional support recovery, particularly for variable selection (Wainwright, 2009) and Gaussian graphical model selection (Ravikumar et al., 2008). The usage of ℓ1 regularizer induces sparsity in the solution; however, it may also introduce bias which can negatively affect the performance (Fan and Li, 2001; Breheny and Huang, 2011). This is because the ℓ1 norm penalty also penalizes both small and large entries, unlike the ℓ0 norm which remains constant for nonzero entries. To remedy this bias issue, we explore alternative penalties, such as the smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001) and minimax concave penalty (MCP) (Zhang, 2010), which can be interpreted as hybrids of ℓ0 and ℓ1 penalties. Additionally, we note that the support recovery of the ℓ1 penalty is based on the incoherence conditions in various cases (Wainwright, 2009; Ravikumar et al., 2008, 2011), which may be restrictive in practice, whereas the SCAD and MCP penalties do not rely on such conditions (Loh and Wainwright, 2017). Based on our experimental results (Fig. 8), we adopt the MCP penalty as the regularization term. EMNIST. To further demonstrate the generalizability of the proposed identifiability results, we present identification results for all digits on the EMNIST dataset. As previously mentioned, we show the recovered attributes with the top-4 singular values. From Fig. 9, it is clear that these attributes are highly interpretable and appear to be the underlying concepts that influence the process of writing digits by hand. This indicates the potential applicability of our assumptions in real-world scenarios. (a) Line thickness (c) Upper width Figure 9: Results for all digit classes within the EMNIST dataset. We present the identified sources with the top-4 standard deviations SDs. Each sub-figure represents a source identified by our model, with its value varying from 4 to +4 SDs to illustrate its influence. The rightmost column presents a heat map given by the absolute pixel difference between the 1 and +1 SDs. The interpretation of these sources may correspond to line thickness, angle, upper width, and height, respectively.