# unpaired_multidomain_causal_representation_learning__a4f9e285.pdf

Unpaired Multi-Domain Causal Representation Learning

Nils Sturma Technical University of Munich Munich Center for Machine Learning nils.sturma@tum.de

Chandler Squires LIDS, Massachusetts Institute of Technology Broad Institute of MIT and Harvard csquires@mit.edu

Mathias Drton Technical University of Munich Munich Center for Machine Learning mathias.drton@tum.de

Caroline Uhler LIDS, Massachusetts Institute of Technology Broad Institute of MIT and Harvard cuhler@mit.edu

The goal of causal representation learning is to find a representation of data that consists of causally related latent variables. We consider a setup where one has access to data from multiple domains that potentially share a causal representation. Crucially, observations in different domains are assumed to be unpaired, that is, we only observe the marginal distribution in each domain but not their joint distribution. In this paper, we give sufficient conditions for identifiability of the joint distribution and the shared causal graph in a linear setup. Identifiability holds if we can uniquely recover the joint distribution and the shared causal representation from the marginal distributions in each domain. We transform our results into a practical method to recover the shared latent causal graph.

1 Introduction

An important challenge in machine learning is the integration and translation of data across multiple domains (Zhu et al., 2017; Zhuang et al., 2021). Researchers often have access to large amounts of unpaired data from several domains, e.g., images and text. It is then desirable to learn a probabilistic coupling between the observed marginal distributions that captures the relationship between the domains. One approach to tackle this problem is to assume that there is a latent representation that is invariant across the different domains (Bengio et al., 2013; Ericsson et al., 2022). Finding a probabilistic coupling then boils down to learning such a latent representation, that is, learning high-level, latent variables that explain the variation of the data within each domain as well as similarities across domains.

In traditional representation learning, the latent variables are assumed to be statistically independent, see for example the literature on independent component analysis (Hyvärinen and Oja, 2000; Comon and Jutten, 2010; Khemakhem et al., 2020). However, the assumption of independence can be too stringent and a poor match to reality. For example, the presence of clouds and the presence of wet roads in an image may be dependent, since clouds may cause rain which may in turn cause wet roads. Thus, it is natural to seek a causal representation, that is, a set of high-level causal variables and relations among them (Schölkopf et al., 2021; Yang et al., 2021b). Figure 1 illustrates the setup of multi-domain causal representation learning, where multiple domains provide different views on a shared causal representation.

Our motivation to study multi-domain causal representations comes, in particular, from single-cell data in biology. Given a population of cells, different technologies such as imaging and sequencing

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

provide different views on the population. Crucially, since these technologies destroy the cells, the observations are uncoupled, i.e., a specific cell may either be used for imaging or sequencing but not both. The aim is to integrate the different views on the population to study the underlying causal mechanisms determining the observed features in various domains (Butler et al., 2018; Stuart et al., 2019; Liu et al., 2019; Yang et al., 2021a; Lopez et al., 2022; Gossi et al., 2023; Cao et al., 2022). Unpaired multi-domain data also appears in many applications other than single-cell biology. For example, images of similar objects are captured in different environments (Beery et al., 2018), large biomedical and neuroimaging data sets are collected in different domains (Miller et al., 2016; Essen et al., 2013; Shafto et al., 2014; Wang et al., 2003), or stocks are traded in different markets.

In this paper, we study identifiability of the shared causal representation, that is, its uniqueness in the infinite data limit. Taking on the same perspective as, for example, in Schölkopf et al. (2021) and Squires et al. (2023), we assume that observed data is generated in two steps. First, the latent variables Z = (Zi)i H are sampled from a distribution PZ, where PZ is determined by an unknown structural causal model among the latent variables. Then, in each domain e {1, . . . , m}, the observed vector Xe Rde is the image of a subset of the latent variables under a domain-specific, injective mixing function ge. That is,

Xe = ge(ZSe),

where Se H is a subset of indices. A priori, it is unknown whether a latent variable Zi with i Se is shared across domains or domain-specific. Even the number of latent variables which are shared across domains is unknown. Moreover, we only observe the marginal distribution of each random vector Xe, but none of the joint distributions over pairs Xe, Xf for e = f. Said differently, observations across domains are unpaired. Assuming that the structural causal model among the latent variables as well as the mixing functions are linear, our main contributions are:

1. We lay out conditions under which we can identify the joint distribution of X1, . . . , Xm. 2. We give additional conditions under which we are able to identify the causal structure among the shared latent variables.

In particular, identifiability of the joint distribution across domains enables data translation. That is, given observation x in domain e, translation to domain f can be achieved by computing E[Xf|Xe = x]. Furthermore, identifying the causal structure among the shared latent variables lets us study the effect of interventions on the different domains.

The main challenge in proving rigorous identifiability results for multi-domain data is that we cannot apply existing results for single-domain data in each domain separately. Even if the causal structure of the latent variables in a single domain is identifiable, it remains unclear how to combine multiple causal structures, i.e., in which way latent variables are shared. We circumvent this problem via a two-step approach: First, we extend the identifiability of linear independent component analysis (Comon, 1994; Hyvärinen and Oja, 2000; Eriksson and Koivunen, 2004; Mesters and Zwiernik, 2022) to the multi-domain setup, which allows us to identify the joint distribution and distinguish between shared and domain-specific latent variables. Moreover, we identify an overall mixing matrix and, in a second step, exploit sparsity constraints in this matrix to identify the causal structure among the shared latent variables. This leverages recent results on causal discovery under measurement error in single domains that also exploit sparsity (Xie et al., 2020; Chen et al., 2022; Xie et al., 2022; Huang et al., 2022). Although we emphasize that our focus in this paper is on identifiability, our proofs also suggest methods to learn the joint distribution as well as the shared causal graph from finite samples. We provide algorithms for the noisy setting and, moreover, we analyze how the number of domains reduce uncertainty with respect to the learned representation.

The paper is organized as follows. In the next paragraphs we discuss further related work. Section 2 provides a precise definition of the considered setup. In Section 3 we consider identifiability of the joint distribution. Using these results, we study identifiability of the causal graph in Section 4. We conclude with a small simulation study as a proof of concept for the finite sample setting in Section 5. Due to space constraints, the detailed discussion of the finite sample setting is deferred to the Appendix. Moreover, the Appendix contains all proofs, discussions on the necessity of our assumptions, and additional examples and simulation results.

Multi-domain Integration. Motivated by technological developments for measuring different modalities at single-cell resolution, several methods have been proposed recently for domain

Figure 1: Setup. A latent causal representation where multiple domains Xe provide different views on subsets of the latent variables Zi. The domains may correspond to different data modalities such as images, text or numerical data. Crucially, the observations across domains are unpaired, i.e., they arise from different states of the latent causal model.

translation between unpaired data. The proposed methods rely on a variety of techniques, including manifold alignment (Welch et al., 2017; Amodio and Krishnaswamy, 2018; Liu et al., 2019), matrix factorization (Duren et al., 2018), correlation analysis (Barkas et al., 2019; Stuart et al., 2019), coupled autoencoders (Yang and Uhler, 2019), optimal transport (Cao et al., 2022), regression analysis (Yuan and Duren, 2022), and semisupervised learning (Lin et al., 2022). Implicitly, these methods presume the existence of a shared latent space where the different modalities either completely align or at least overlap. However, to the best of our knowledge, none of these methods have rigorous identifiability guarantees, i.e., the methods are not guaranteed to recover a correct domain translation mapping even for infinite data. Our work advances the theoretical understanding of multi-domain integration by providing identifiability guarantees on recovering the shared latent space.

Group Independent Component Analysis. The primary tool that we use for identifiability is linear independent component analysis (ICA) (Comon, 1994; Eriksson and Koivunen, 2004). Many works extend ICA to the multi-domain setting. These methods primarily come from computational neuroscience, where different domains correspond to different subjects or studies. However, to the best of our knowledge, all prior works require pairing between samples. These works can be categorized based on whether the samples are assumed to be voxels (Calhoun et al., 2001; Esposito et al., 2005), time points (Svensén et al., 2002; Varoquaux et al., 2009; Hyvärinen and Ramkumar, 2013), or either (Beckmann and Smith, 2005; Sui et al., 2009). For reviews, see Calhoun et al. (2009) and Chabriel et al. (2014). Related are methods for independent vector analysis (Kim et al., 2006; Anderson et al., 2014; Bhinge et al., 2019) and multiset canonical correlation analysis (Nielsen, 2002; Li et al., 2011; Klami et al., 2014), which allow the latent variables to take on different values in each domain but still require sample pairing. Most of the mentioned methods lack identifiability guarantees, only newer work (Richard et al., 2021) provides sufficient conditions for identifiability. Furthermore, all mentioned methods assume that every latent variable is shared across all domains, while our setup allows for shared and domain-specific latent variables. Some methods, e.g., Lukic et al. (2002), Maneshi et al. (2016), and Pandeva and Forré (2023), permit both shared and domain-specific components, but only consider the paired setting. In this paper, we extend these results to the unpaired setting.

Latent Structure Discovery. Learning causal structure between latent variables has a long history, e.g., in measurement models (Silva et al., 2006). One recent line of work studies the problem under the assumption of access to interventional data (e.g., Liu et al., 2022; Squires et al., 2023). In particular, Squires et al. (2023) show that the latent graph is identifiable if the interventions are sufficiently diverse. Another line of work, closer to ours and not based on interventional data, shows that the graph is identified under certain sparsity assumptions on the mixing functions (Xie et al., 2020; Chen et al., 2022; Xie et al., 2022; Huang et al., 2022). However, these methods are not suitable in our setup since they require paired data in a single domain. One cannot apply them in each domain separately since it would be unclear how to combine the multiple latent causal graphs, that is, which of the latent variables are shared. In this work, we lay out sparsity assumptions on the mixing

functions that are tailored to the unpaired multi-domain setup. The works of Adams et al. (2021) and Zeng et al. (2021) may be considered closest to ours as they also treat a setting with multiple domains and unpaired data. However, our setup and results are more general. Adams et al. (2021) assume that the number of observed variables are the same in each domain, whereas we consider domains of different dimensions corresponding to the fact that observations may be of very different nature. Further, we allow for shared and domain-specific latent variables, where the number of shared latent variables is unknown, while in Adams et al. (2021) it is assumed that all latent variables are shared. Compared to Zeng et al. (2021), we consider a general but fixed number of observed variables, while Zeng et al. (2021) only show identifiability of the full model in a setup where the number of observed variables in each domain increases to infinity. On a more technical level, the conditions in Zeng et al. (2021) require two pure children to identify the shared latent graph, while we prove identifiability under the weaker assumption of two partial pure children; see Section 4 for precise definitions.

Notation. Let N be the set of nonnegative integers. For positive n N, we define [n] = {1, . . . , n}. For a matrix M Ra b, we denote by MA,B the submatrix containing the rows indexed by A [a] and the columns indexed by B [b]. Moreover, we write MB for the submatrix containing all rows but only the subset of columns indexed by B. Similarly, for a tuple x = (x1, . . . , xb), we denote by x B the tuple only containing the entries indexed by B. A matrix Q = Qσ Rp p is a signed permutation matrix if it can be written as the product of a diagonal matrix D with entries Dii { 1} and a permutation matrix e Qσ with entries ( e Qσ)ij = 1j=σ(i), where σ is a permutation on p elements. Let P be a p-dimensional joint probability measure of a collection of random variables Y1, . . . , Yp. Then we denote by Pi the marginal probability measure such that Yi Pi. We say that P has independent marginals if the random variables Yi are mutually independent. Moreover, we denote by M#P the d-dimensional push-forward measure under the linear map defined by the matrix M Rd p. If Q is a signed permutation matrix and the probability measure P has independent marginals, then Q#P also has independent marginals. For univariate probability measures we use the shorthand ( 1)#P = P.

Let H = [h] for h 1, and let Z = (Z1, . . . , Zh) be latent random variables that follow a linear structural equation model. That is, the variables are related by a linear equation Z = AZ + ε, (1) with h h parameter matrix A = (aij) and zero-mean, independent random variables ε = (ε1, . . . , εh) that represent stochastic errors. Assume that we have observed random vectors Xe Rde in multiple domains of interest e [m], where the dimension de may vary across domains. Each random vector is the image under a linear function of a subset of the latent variables. In particular, we assume that there is a subset L H representing the shared latent space such that each Xe is generated via the mechanism

Xe = Ge ZL ZIe

where Ie H \ L. We say that the latent variable ZIe are domain-specific for domain e [m] while the latent variables ZL are shared across all domains. As already noted, we are motivated by settings where the shared latent variables ZL capture the key causal relations and the different domains are able to give us combined information about these relations. Likewise, we may think about the domainspecific latent variables ZIe as noise in each domain, independent of the shared latent variables. Specific models are now derived from (1)-(2) by assuming specific (but unknown) sparsity patterns in A and Ge. Each model is given by a large directed acyclic graph (DAG) that encodes the multi-domain setup. To formalize this, we introduce pairwise disjoint index sets V1, . . . , Vm, where Ve indexes the coordinates of Xe, i.e., Xe = (Xv : v Ve) and |Ve| = de. Then V = V1 Vm indexes all observed random variables. We define an m-domain graph such that the latent nodes are the only parents of observed nodes and there are no edges between shared and domain-specific latent nodes. Definition 2.1. Let G be a DAG whose node set is the disjoint union H V = H V1 Vm. Let D be the edge set of G. Then G is an m-domain graph with shared latent nodes L = [ℓ] H if the following is satisfied:

1. All parent sets contain only latent variables, i.e., pa(v) = {w : w v D} H for all v H V .

Figure 2: Compact version of a 2-domain graph G2 with five latent nodes and two domains V1 and V2. All observed nodes in each domain are represented by a single grey node. We draw a dashed blue edge from latent node h H to domain Ve V if h Se = pa(Ve). The random vectors associated to the two domains are uncoupled, that is, we do not observe their joint distribution.

2. The set L consists of the common parents of variables in all different domains, i.e., u L if and only if u pa(v) pa(w) for v Ve, w Vf with e = f.

3. Let Ie = Se \ L be the domain-specific latent nodes, where Se := pa(Ve) = v Vepa(v) H. Then there are no edges in D that connect a node in L and a node m e=1Ie or that connect a node in Ie and a node in If for any e = f.

To emphasize that a given DAG is an m-domain graph we write Gm instead of G. We also say that Se is the set of latent parents in domain e and denote its cardinality by se = |Se|. Note that the third condition in Definition 2.1 does not exclude causal relations between the domain-specific latent variables, that is, there may be edges v w for v, w Ie. Since the sets Ie satisfy Ie If = for e = f, we specify w.l.o.g. the indexing convention Ie = {ℓ+ 1 + Pe 1 i=1 |Ii|, . . . , ℓ+ Pe i=1 |Ii|} and h = ℓ+ Pm e=1 |Ie|.

Example 2.2. Consider the compact version of a 2-domain graph in Figure 2. There are h = 5 latent nodes where L = {1, 2} are shared and I1 = {3, 4} and I2 = {5} are domain-specific. A full m-domain graph is given in Appendix B.

Each m-domain graph postulates a statistical model that corresponds to the structural equation model in (1) and the mechanisms in (2), with potentially sparse matrices A and Ge. For two sets of nodes W, Y H V , we denote by DW Y the subset of edges DW Y = {y w D : w W, y Y }. Moreover, let RDW Y be the set of real |W| |Y | matrices M = (mwy) with rows indexed by W and columns indexed by Y , such that the support is given by DW Y , that is, mwy = 0 if y w DW Y .

Definition 2.3. Let Gm = (H V, D) be an m-domain graph. Define the map

ϕGm : RDV H RDHH R|V | |H|

(G, A) 7 G (I A) 1.

Then the multi-domain causal representation (MDCR) model M(Gm) is given by the set of probability measures PX = B#P, where B Im(ϕGm) and P is an h-dimensional probability measure with independent, mean-zero marginals Pi, i H. We say that the pair (B, P) is a representation of PX M(Gm).

Definition 2.3 corresponds to the model defined in Equations (1) and (2). If PX M(Gm) with representation (B, P), then PX is the joint distribution of the observed domains X = (X1, . . . , Xm). The distribution of the random variables εi in Equation (1) is given by the marginals Pi. Moreover, for any matrix G RDV H, we denote the submatrix Ge = GVe,Se Rde se which coincides with the matrix Ge from Equation (2). For the graph in Figure 2, we compute a concrete example of the matrix B in Example B.1 in the Appendix. Importantly, in the rest of the paper we assume to only observe the marginal distribution PXe in each domain but not the joint distribution PX.

Ultimately, we are interested in recovering the graph GL = (L, DLL) among the shared latent nodes. We proceed by a two-step approach: In Section 3 we recover the representation (B, P) of the joint distribution PX. To be precise, we recover a matrix b B that is equal to B up to certain permutations of the columns. Then we use the matrix b B to recover the shared latent graph GL in Section 4 and show that recovery is possible up to trivial relabeling of latent nodes that appear in the same position of the causal order.

3 Joint Distribution

To identify the joint distribution PX, we apply identifiability results from linear ICA in each domain separately and match the recovered probability measures Pi for identifying which of them are shared, that is, whether or not i L. Let Gm be an m-domain graph with shared latent nodes L, and let PX M(Gm) with representation (B, P). Recall that B = G(I A) 1 with G RDV H and A RDHH. We make the following technical assumptions.

(C1) (Different error distributions.) The marginal distributions Pi, i H are non-degenerate, non-symmetric and have unit variance. Moreover, the measures are pairwise different to each other and to the flipped versions, that is, Pi = Pj and Pi = Pj for all i, j H with i = j. Subsequently, we let d be a distance on the set of univariate Borel probability measures such that d(Pi, Pj) = 0 and d(Pi, Pj) = 0 for i = j.

(C2) (Full rank of mixing.) For each e [m], the matrix Ge = GVe,Se Rde se is of full column rank.

By not allowing symmetric distributions in Condition (C1), we assume in particular that the distributions of the errors are non-Gaussian. Non-Gaussianity together with the assumptions of pairwise different and non-symmetric error distributions allow us to extend the results on identifiability of linear ICA to the unpaired multi-domain setup and to identify the joint distribution. In particular, the assumption of pairwise different error distributions allows for matching the distributions across domains to identify the ones corresponding to the shared latent space. Non-symmetry accounts for the sign-indeterminacy of linear ICA when matching the distributions. We discuss the necessity of these assumptions in Remark 3.2 and, in more detail, in Appendix C. Note that Condition (C1) is always satisfied in a generic sense, that is, randomly chosen probability distributions on the real line are pairwise different and non-symmetric with probability one. Finally, Condition (C2) requires in particular that for each shared latent node k L there is at least one node v Ve in every domain e [m] such that k pa(v).

Under Conditions (C1) and (C2) we are able to derive a sufficient condition for identifiability of the joint distribution. Let SP(p) be the set of p p signed permutation matrices. We define the set of signed permutation block matrices:

Π = {diag(ΨL, ΨI1, . . . , ΨIm) : ΨL SP(ℓ) and ΨIe SP(|Ie|)} .

Our main result is the following. Theorem 3.1. Let Gm be an m-domain graph with shared latent nodes L = [ℓ], and let PX M(Gm) with representation (B, P). Suppose that m 2 and that Conditions (C1) and (C2) are satisfied. Let (bℓ, b B, b P) be the output of Algorithm 1. Then bℓ= ℓand

b B = B Ψ and b P = Ψ #P,

for a signed permutation block matrix Ψ Π.

Theorem 3.1 says that the matrix B is identifiable up to signed block permutations of the columns. Under the assumptions of Theorem 3.1 it holds that b B# b P is equal to PX. That is, the joint distribution of the domains is identifiable. Remark 3.2. While Theorem 3.1 is a sufficient condition for identifiability of the joint distribution, we emphasize that pairwise different error distributions are in most cases also necessary; we state the exact necessary condition in Proposition C.1 in the Appendix. Said differently, if one is willing to assume that conceptually different latent variables also follow a different distribution, then identification of these variables is possible, and otherwise (in most cases) not. Apart from pairwise different error distributions, non-symmetry is then required to fully identify the joint distribution whose dependency structure is determined by the shared latent variables. If the additional assumption on non-symmetry is not satisfied, then it is still possible to identify the shared, conceptually different latent variables, which becomes clear by inspecting the proofs of Theorem 3.1 and Proposition C.1. The non-identifiability of the joint distribution would only result in sign indeterminacy, that is, entries of the matrix b B could have a flipped sign. Remark 3.3. By checking the proof of Theorem 3.1, the careful reader may notice that the statement of the theorem still holds true when we relax the third condition in the definition of an m-domain

Algorithm 1 Identify Joint Distribution

1: Input: Probability measures PXe for all e [m]. 2: Output: Number of shared latent variables bℓ, matrix b B and probability measure b P with independent marginals. 3: for e [m] do 4: Linear ICA: Find the smallest value bse such that PXe = b Be#P e for a matrix b Be Rde bse and an bse-dimensional probability measure P e with independent, mean-zero and unit-variance marginals P e i . 5: end for 6: Matching: Let bℓbe the maximal number such that there are signed permutation matrices {Qe}e [m] satisfying d([(Qe) #P e]i, [(Qf) #P f]i) = 0

for all i = 1, . . . , bℓand for all f = e. Let b L = {1, . . . , bℓ}. 7: Construct the matrix b B and the tuple of probability measures b P given by

[ b B1Q1] b L [ b B1Q1][bs1]\ b L ... ... [ b Bm Qm] b L [ b Bm Qm][bsm]\ b L

[(Q1) #P 1] b L [(Q1) #P 1][bs1]\ b L ... [(Qm) #P m][bsm]\ b L

8: return (bℓ, b B, b P).

graph. That is, one may allow directed paths from shared to domain-specific latent nodes but not vice versa. For example, an additional edge 1 4 between the latent nodes 1 and 4 would be allowed in the graph in Figure 2. In this case, the dependency structure of the domains is still determined by the shared latent space. However, the structural assumption that there are no edges between shared and domain-specific latent nodes is made for identifiability of the shared latent graph in Section 4. Remark 3.4. The computational complexity of Algorithm 1 depends on the complexity of the chosen linear ICA algorithm, to which we make m calls. Otherwise, the dominant part is the matching in Line 6 with worst case complexity O(m maxe [m] d2 e), where we recall that m is the number of domains and de is the dimension of domain e.

In Appendix D we state a complete version of Algorithm 1 for the finite sample setting. In particular, we provide a method for the matching in Line 6 based on the two-sample Kolmogorov-Smirnov test. For finite samples, there might occur false discoveries, that is, distributions are matched that are actually not the same. With our method, we show that the probability of falsely discovering shared nodes shrinks exponentially with the number of domains.

4 Identifiability of the Causal Graph

We return to our goal of identifying the causal graph GL = (L, DLL) among the shared latent nodes. By Theorem 3.1, we can identify the representation (B, P) of PX M(Gm) from the marginal distributions. In particular, we recover the matrix b B = BΨ for a signed permutation block matrix Ψ Π. Moreover, we know which columns correspond to the shared latent nodes. That is, we know that the submatrix b BL obtained by only considering the columns indexed by L = b L = [ℓ] is equal to BLΨL, where ΨL SP(ℓ). Problem 4.1. Let B Im(ϕGm) for an m-domain graph Gm with shared latent nodes L. Given b BL = BLΨL with ΨL a signed permutation matrix, when is it possible to identify the graph GL?

Recently, Xie et al. (2022) and Dai et al. (2022) show that, in the one-domain setting with independent additive noise, the latent graph can be identified if each latent variable has at least two pure children. We obtain a comparable result tailored to the multi-domain setup. Definition 4.2. Let Gm = (H V, D) be an m-domain graph with shared latent nodes L H. For k L, we say that an observed node v V is a partial pure child of k if pa(v) L = {k}.

Algorithm 2 Identify Shared Graph

1: Input: Matrix B R|V | ℓ. 2: Output: Parameter matrix b A Rℓ ℓ. 3: Remove rows B i,L from the matrix B that are completely zero. 4: Find tuples (ik, jk)k L with ik = jk such that

(i) rank(B {ik,jk},L) = 1 for all k L and (ii) rank(B {ik,iq},L) = 2 for all k, q L with k = q.

5: Let I = {i1, . . . , iℓ} and consider B I,L Rℓ ℓ. 6: Find two permutation matrices R1 and R2 such that W = R1B I,LR2 is lower triangular. 7: Multiply each column of W by the sign of its corresponding diagonal element. This yields a new matrix f W with all diagonal elements positive. 8: Divide each row of f W by its corresponding diagonal element. This yields a new matrix f W with all diagonal elements equal to one. 9: Compute b A = I (f W ) 1. 10: return b A.

For a partial pure child v V , there may still be domain-specific latent nodes that are parents of v. Definition 4.2 only requires that there is exactly one parent that is in the set L. This explains the name partial pure child; see Example B.2 in the Appendix for further elaboration.

W.l.o.g. we assume in this section that the shared latent nodes are topologically ordered such that i j DLL implies i < j for all i, j L. We further assume:

(C3) (Two partial pure children across domains.) For each shared latent node k L, there exist two partial pure children.

(C4) (Rank faithfulness.) For any two subsets Y V and W L, we assume that

rank(BY,W ) = max B Im(ϕGm) rank(B Y,W ).

The two partial pure children required in Condition (C3) may either be in distinct domains or in a single domain. This is a sparsity condition on the large mixing matrix G. In Appendix C we discuss that the identification of the joint latent graph is impossible without any sparsity assumptions. We conjecture that two partial pure children are not necessary, but we leave it open for future work to find a non-trivial necessary condition. Roughly speaking, we assume in Condition (C4) that no configuration of edge parameters coincidentally yields low rank. The set of matrices B Im(ϕGm) that violates (C4) is a subset of measure zero of Im(ϕGm) with respect to the Lebesgue measure. Note that our conditions do not impose constraints on the graph GL. Our main tool to tackle Problem 4.1 will be the following lemma.

Lemma 4.3. Let B Im(ϕGm) for an m-domain graph Gm. Suppose that Condition (C4) is satisfied and that there are no zero-rows in BL. Let v, w V . Then rank(B{v,w},L) = 1 if and only if there is a node k L such that both v and w are partial pure children of k.

The condition on no zero-rows in Lemma 4.3 is needed since we always have rank(B{v,w},L) 1 if one of the two rows is zero. However, this is no additional structural assumption since we allow zero-rows when identifying the latent graph; c.f. Algorithm 2. The lemma allows us to find partial pure children by testing ranks on the matrix b BL. If (i1, j1) and (i2, j2) are partial pure children of two nodes in L, we make sure that these two nodes are different by checking that rank(B{i1,i2},L) = 2.

For a DAG G = (V, D), we define S(G) to be the set of permutations on |V | elements that are consistent with the DAG, i.e., σ S(G) if and only if σ(i) < σ(j) for all edges i j D. The following result is the main result of this section.

Theorem 4.4. Let b B = BΨ with B Im(ϕGm) and Ψ Π, and define B = b BL to be the input of Algorithm 2. Assume that Conditions (C3) and (C4) are satisfied, and let b A be the output of Algorithm 2. Then b A = Q σ AL,LQσ for a signed permutation matrix Qσ with σ S(GL). Moreover, if Gvk > 0 for G RDV H whenever v is a pure child of k, then Qσ is a permutation matrix.

Theorem 4.4 says that the graph GL can be recovered up to a permutation of the nodes that preserves the property that i j implies i < j; see Remark 4.5. Since the columns of the matrix b B are not only permuted but also of different signs, we solve the sign indeterminacy column-wise in Line 7 before removing the scaling indeterminacy row-wise in Line 8. In case the coefficients of partial pure children are positive, this ensures that Qσ is a permutation matrix and we have no sign indeterminacy. In Appendix D we adapt Algorithm 2 for the empirical data setting, where we only have b BL BLψL. Remark 4.5. Let b A be the output of Alg. 2. Then we construct the graph b GL = (L, b DLL) as the graph with edges j i b DLL if and only if b Aij = 0. Condition (C4) ensures that b GL is equivalent to GL in the sense that there is a permutation σ S(GL) such that b DLL = {σ(i) σ(j) : i j DLL}. Example 4.6. As highlighted in the introduction, the unpaired multi-domain setup is motivated by applications from single-cell biology. For example, consider the domains of (i) gene expression and (ii) high-level phenotypic features extracted from imaging assays (e.g. Mc Quin et al., 2018). We argue that the requirement of two partial pure children is justifiable on such data as follows. The condition requires, for example, that for each shared latent variable, (i) the expression of some gene depends only upon that shared latent variable plus domain-specific latent variables, and (ii) one of the high-level phenotypic features depends only on the same latent feature plus domain-specific latent variables. Many genes have highly specialized functions, so (i) is realistic, and similarly many phenotypic features are primarily controlled by specific pathways, so (ii) is justified. Remark 4.7. In Algorithm 2, we determine the rank of a matrix by Singular Value Decomposition, which has worst case complexity O(mn min{n, m}) for an m n matrix. Since Line 4 is the dominant part, we conclude that the worst case complexity of Algorithm 2 is O(|V |2 ℓ).

5 Simulations

In this section we report on a small simulation study to illustrate the validity of our adapted algorithms for finite samples (detailed in Appendix D). We emphasize that this should only serve as a proof of concept as the focus of our work lies on identifiability. In future work one may develop more sophisticated methods; c.f. Appendix G. The adapted algorithms have a hyperparameter γ, which is a threshold on singular values to determine the rank of a matrix. In our simulations we use γ = 0.2.

Data Generation. In each experiment we generate 1000 random models with ℓ= 3 shared latent nodes. We consider different numbers of domains m {2, 3} and assume that there are |Ie| = 2 domain-specific latent nodes for each domain. The dimensions are given by de = d/m for all e [m] and d = 30. We sample the m-domain graph Gm on the shared latent nodes as follows. First, we sample the graph GL from an Erd os-Rényi model with edge probability 0.75 and assume that there are no edges between other latent nodes, that is, between L and H \ L and within H \ L. Then we fix two partial pure children for each shared latent node k L and collect them in the set W. The remaining edges from L to V \W and from H to V are sampled from an Erd os-Rényi model with edge probability 0.9. Finally, the (nonzero) entries of G and A are sampled from Unif( [0.25, 1]). The distributions of the error variables are specified in Appendix E. For simplicity, we assume that the sample sizes coincide, that is, ne = n for all e [m], and consider n {1000, 2500, 5000, 10000, 25000}.

Results. First, we plot the average number of shared nodes bℓin our experiments in Figure 3 (a). Especially for low sample sizes, we see that fewer shared nodes are detected with more domains. However, by inspecting the error bars we also see that the probability of detecting too many nodes bℓ> ℓdecreases drastically when considering 3 instead of 2 domains. This suggests that the number of falsely detected shared nodes is very low, as expected by Theorem D.3. Our findings show that more domains lead to a more conservative discovery of shared nodes, but whenever a shared node is determined this is more certain. Moreover, we measure the error in estimating b B b L in Figure 3 (b), that is, the error in the shared columns. We take

score B( b B b L) =

min Ψ SP (ℓ) β 1/2

ℓ,bℓ b B b L [BLΨ] b L F if bℓ ℓ,

min Ψ SP (bℓ) β 1/2

ℓ,bℓ [ b B b LΨ]L BL F if bℓ> ℓ,

where F denotes the Frobenius norm and βℓ,bℓ= min{ℓ, bℓ} Pm e=1 de denotes the number

of entries of the matrix over which the norm is taken. In the cases ℓ= bℓ, we also measure the

(c) Figure 3: Results. Logarithmic scale on the x-axis. Error bars in (a) are one standard deviation of the mean and in (b) and (c) they are the interquartile range.

performance of recovering the shared latent graph GL in Figure 3 (c) by taking

score A( b A) = min Qσ SP (ℓ) s.t. σ S(GL) 1 ℓ Q σ b AQσ AL,L F.

As expected, the median estimation errors for BL and AL,L decrease with increasing sample size. In Appendix F we provide additional simulations with larger ℓ. Moreover, we consider setups where we violate specific assumptions, such as pairwise different distributions (C1) and two partial pure children (C3). The results emphasize that the conditions are necessary for the algorithms provided. The computations were performed on a single thread of an Intel Xeon Gold 6242R processor (3.1 GHz), with a total computation time of 12 hours for all simulations presented in this paper (including Appendix).

6 Discussion

This work introduces the problem of causal representation learning from unpaired multi-domain observations, in which multiple domains provide complementary information about a set of shared latent nodes that are the causal quantities of primary interest. For this problem, we laid out a setting in which we can provably identify the causal relations among the shared latent nodes. To identify the desired causal structure, we proposed a two-step approach where we first make use of linear ICA in each domain separately and match the recovered error distributions to identify shared nodes and the joint distribution of the domains. In the second step, we identify the causal structure among the shared latent variables by testing rank deficiencies in the overall mixing matrix B. To the best of our knowledge, our guarantees are the first principled identifiability results for shared causal representations in a general, unpaired multi-domain setting.

We proposed algorithms for recovering the joint distribution and the shared latent space making our proofs constructive. While our focus is on identifiability guarantees, we show in Appendix D how our proofs give rise to algorithms for the finite sample setting. Moreover, we propose a method to match approximate error distributions and show that the probability of falsely discovering shared nodes decreases exponentially in the number of domains. Our work opens up numerous directions for future work as we discuss in Appendix G.

Acknowledgments and Disclosure of Funding

This project was initiated while the first author was a visitor at the Eric and Wendy Schmidt Center of the Broad Institute of MIT and Harvard. The project has received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme (grant agreement No 883818), NCCIH/NIH (1DP2AT012345), ONR (N00014-22-1-2116), DOE-ASCR (DE-SC0023187), NSF (DMS-1651995), the MIT-IBM Watson AI Lab, and a Simons Investigator Award. Nils Sturma acknowledges support by the Munich Data Science Institute at the Technical University of Munich via the Linde/MDSI Ph D Fellowship program. Chandler Squires was partially supported by an NSF Graduate Research Fellowship.

Adams, J., Hansen, N., and Zhang, K. (2021). Identification of partially observed linear causal models: Graphical conditions for the non-gaussian and heterogeneous cases. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems, volume 34, pages 22822 22833. Curran Associates, Inc.

Ahuja, K., Mahajan, D., Wang, Y., and Bengio, Y. (2023). Interventional causal representation learning. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 372 407. PMLR.

Amodio, M. and Krishnaswamy, S. (2018). MAGAN: Aligning biological manifolds. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 215 223. PMLR.

Anderson, M., Fu, G.-S., Phlypo, R., and Adalı, T. (2014). Independent vector analysis: Identification conditions and performance bounds. IEEE Trans. Signal Process., 62(17):4399 4410.

Bach, F. R. and Jordan, M. I. (2003). Kernel independent component analysis. J. Mach. Learn. Res., 3(1):1 48.

Barkas, N., Petukhov, V., Nikolaeva, D., Lozinsky, Y., Demharter, S., Khodosevich, K., and Kharchenko, P. V. (2019). Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat. Methods, 16(8):695 698.

Beckmann, C. F. and Smith, S. M. (2005). Tensorial extensions of independent component analysis for multisubject f MRI analysis. Neuroimage, 25(1):294 311.

Beery, S., Van Horn, G., and Perona, P. (2018). Recognition in terra incognita. In Ferrari, V., Hebert, M., Sminchisescu, C., and Weiss, Y., editors, Computer Vision ECCV 2018, pages 472 489, Cham. Springer International Publishing.

Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8):1798 1828.

Bhinge, S., Mowakeaa, R., Calhoun, V. D., and Adalı, T. (2019). Extraction of time-varying spatiotemporal networks using parameter-tuned constrained IVA. IEEE Trans. Med. Imaging, 38(7):1715 1725.

Buchholz, S., Besserve, M., and Schölkopf, B. (2022). Function classes for identifiable nonlinear independent component analysis. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 16946 16961. Curran Associates, Inc.

Butler, A., Hoffman, P., Smibert, P., Papalexi, E., and Satija, R. (2018). Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol., 36(5):411 420.

Calhoun, V. D., Adali, T., Pearlson, G. D., and Pekar, J. J. (2001). A method for making group inferences from functional MRI data using independent component analysis. Hum. Brain Mapp., 14(3):140 151.

Calhoun, V. D., Liu, J., and Adalı, T. (2009). A review of group ICA for f MRI data and ICA for joint inference of imaging, genetic, and ERP data. Neuroimage, 45(1):163 172.

Cao, K., Gong, Q., Hong, Y., and Wan, L. (2022). A unified computational framework for single-cell data integration with optimal transport. Nat. Comm., 13(1).

Cardoso, J. and Souloumiac, A. (1993). Blind beamforming for non-gaussian signals. IEE Proceedings F Radar and Signal Processing, 140(6):362.

Chabriel, G., Kleinsteuber, M., Moreau, E., Shen, H., Tichavsky, P., and Yeredor, A. (2014). Joint matrices decompositions and blind source separation: A survey of methods, identification, and applications. IEEE Signal Process. Mag., 31(3):34 43.

Chen, Z., Xie, F., Qiao, J., Hao, Z., Zhang, K., and Cai, R. (2022). Identification of linear latent variable model with arbitrary distribution. Proceedings of the AAAI Conference on Artificial Intelligence, 36(6):6350 6357.

Comon, P. (1994). Independent component analysis, a new concept? Signal Process., 36(3):287 314.

Comon, P. and Jutten, C. (2010). Handbook of Blind Source Separation: Independent Component Analysis and Applications. Elsevier.

Dai, H., Spirtes, P., and Zhang, K. (2022). Independence testing-based approach to causal discovery under measurement error and linear non-gaussian models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 27524 27536. Curran Associates, Inc.

Duren, Z., Chen, X., Zamanighomi, M., Zeng, W., Satpathy, A. T., Chang, H. Y., Wang, Y., and Wong, W. H. (2018). Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations. Proc. Natl. Acad. Sci., 115(30):7723 7728.

Ericsson, L., Gouk, H., Loy, C. C., and Hospedales, T. M. (2022). Self-supervised representation learning: Introduction, advances, and challenges. IEEE Signal Process. Mag., 39(3):42 62.

Eriksson, J. and Koivunen, V. (2004). Identifiability, separability, and uniqueness of linear ICA models. IEEE Signal Process. Lett., 11(7):601 604.

Esposito, F., Scarabino, T., Hyvärinen, A., Himberg, J., Formisano, E., Comani, S., Tedeschi, G., Goebel, R., Seifritz, E., and Di Salle, F. (2005). Independent component analysis of f MRI group studies by self-organizing clustering. Neuroimage, 25(1):193 205.

Essen, D. C. V., Smith, S. M., Barch, D. M., Behrens, T. E., Yacoub, E., and Ugurbil, K. (2013). The WU-minn human connectome project: An overview. Neuroimage, 80:62 79.

Gentle, J. E. (1998). Numerical linear algebra for applications in statistics. Statistics and Computing. Springer Verlag, New York.

Gessel, I. and Viennot, G. (1985). Binomial determinants, paths, and hook length formulae. Adv. in Math., 58(3):300 321.

Gossi, F., Pati, P., Chouvardas, P., Martinelli, A. L., Kruithof-de Julio, M., and Rapsomaniki, M. A. (2023). Matching single cells across modalities with contrastive learning and optimal transport. Brief. Bioinform., 24(3):bbad130.

Huang, B., Low, C. J. H., Xie, F., Glymour, C., and Zhang, K. (2022). Latent hierarchical causal structure discovery with rank constraints. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 5549 5561. Curran Associates, Inc.

Hyvärinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans. Neural Netw., 10(3):626 634.

Hyvärinen, A. and Oja, E. (2000). Independent component analysis: algorithms and applications. Neural Netw., 13(4):411 430.

Hyvärinen, A. and Ramkumar, P. (2013). Testing independent component patterns by inter-subject or intersession consistency. Front. Hum. Neurosci., 7:94.

Khemakhem, I., Kingma, D., Monti, R., and Hyvärinen, A. (2020). Variational autoencoders and nonlinear ICA: A unifying framework. In Chiappa, S. and Calandra, R., editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 2207 2217. PMLR.

Kim, T., Lee, I., and Lee, T.-W. (2006). Independent vector analysis: definition and algorithms. In 2006 Fortieth Asilomar Conference on Signals, Systems and Computers, pages 1393 1396. IEEE.

Klami, A., Virtanen, S., Leppäaho, E., and Kaski, S. (2014). Group factor analysis. IEEE Trans. Neural Netw. Learn. Syst., 26(9):2136 2147.

Li, X.-L., Adalı, T., and Anderson, M. (2011). Joint blind source separation by generalized joint diagonalization of cumulant matrices. Signal Process., 91(10):2314 2322.

Lin, Y., Wu, T.-Y., Wan, S., Yang, J. Y., Wong, W. H., and Wang, Y. (2022). sc Joint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning. Nat. Biotechnol., 40(5):703 710.

Lindström, B. (1973). On the vector representations of induced matroids. Bull. London Math. Soc., 5:85 90.

Liu, J., Huang, Y., Singh, R., Vert, J.-P., and Noble, W. S. (2019). Jointly embedding multiple single-cell omics measurements. In Huber, K. T. and Gusfield, D., editors, 19th International Workshop on Algorithms in Bioinformatics (WABI 2019), volume 143 of Leibniz International Proceedings in Informatics (LIPIcs), pages 10:1 10:13, Dagstuhl, Germany. Schloss Dagstuhl Leibniz-Zentrum fuer Informatik.

Liu, Y., Zhang, Z., Gong, D., Gong, M., Huang, B., Hengel, A. v. d., Zhang, K., and Shi, J. Q. (2022). Identifying weight-variant latent causal models. ar Xiv preprint ar Xiv:2208.14153.

Lopez, R., Tagasovska, N., Ra, S., Cho, K., Pritchard, J., and Regev, A. (2022). Learning causal representations of single cells via sparse mechanism shift modeling. In Neur IPS 2022 Workshop on Causality for Real-world Impact.

Lukic, A. S., Wernick, M. N., Hansen, L. K., and Strother, S. C. (2002). An ICA algorithm for analyzing multiple data sets. In Proceedings. International Conference on Image Processing, volume 2, pages 821 824. IEEE.

Lyche, T. (2020). Numerical linear algebra and matrix factorizations, volume 22 of Texts in Computational Science and Engineering. Springer, Cham. With a foreword by Geir Dahl.

Maneshi, M., Vahdat, S., Gotman, J., and Grova, C. (2016). Validation of shared and specific independent component analysis (SSICA) for between-group comparisons in f MRI. Front. Neurosci., 10:417.

Mc Quin, C., Goodman, A., Chernyshev, V., Kamentsky, L., Cimini, B. A., Karhohs, K. W., Doan, M., Ding, L., Rafelski, S. M., Thirstrup, D., Wiegraebe, W., Singh, S., Becker, T., Caicedo, J. C., and Carpenter, A. E. (2018). Cell Profiler 3.0: Next-generation image processing for biology. PLOS Biol., 16(7):e2005970.

Mesters, G. and Zwiernik, P. (2022). Non-independent components analysis. ar Xiv preprint ar Xiv:2206.13668.

Miller, K. L., Alfaro-Almagro, F., Bangerter, N. K., Thomas, D. L., Yacoub, E., Xu, J., Bartsch, A. J., Jbabdi, S., Sotiropoulos, S. N., Andersson, J. L. R., Griffanti, L., Douaud, G., Okell, T. W., Weale, P., Dragonu, I., Garratt, S., Hudson, S., Collins, R., Jenkinson, M., Matthews, P. M., and Smith, S. M. (2016). Multimodal population brain imaging in the UK biobank prospective epidemiological study. Nat. Neurosci., 19(11):1523 1536.

Nielsen, A. A. (2002). Multiset canonical correlations analysis and multispectral, truly multitemporal remote sensing data. IEEE Trans. Image Process., 11(3):293 305.

Pandeva, T. and Forré, P. (2023). Multi-view independent component analysis with shared and individual sources. In Evans, R. J. and Shpitser, I., editors, Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, volume 216 of Proceedings of Machine Learning Research, pages 1639 1650. PMLR.

Richard, H., Ablin, P., Thirion, B., Gramfort, A., and Hyvärinen, A. (2021). Shared independent component analysis for multi-subject neuroimaging. Adv. Neural Inf. Process. Syst., 34:29962 29971.

Roeder, G., Metz, L., and Kingma, D. (2021). On linear identifiability of learned representations. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 9030 9039. PMLR.

Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. (2021). Toward causal representation learning. Proc. IEEE, 109(5):612 634.

Shafto, M. A., Tyler, L. K., Dixon, M., Taylor, J. R., Rowe, J. B., Cusack, R., Calder, A. J., Marslen-Wilson, W. D., Duncan, J., Dalgleish, T., Henson, R. N., Brayne, C., and Matthews, F. E. (2014). The cambridge centre for ageing and neuroscience (cam-CAN) study protocol: a cross-sectional, lifespan, multidisciplinary examination of healthy cognitive ageing. BMC Neurol., 14(1).

Shimizu, S., Inazumi, T., Sogawa, Y., Hyvärinen, A., Kawahara, Y., Washio, T., Hoyer, P. O., and Bollen, K. (2011). Direct Li NGAM: a direct method for learning a linear non-Gaussian structural equation model. J. Mach. Learn. Res., 12:1225 1248.

Silva, R., Scheine, R., Glymour, C., and Spirtes, P. (2006). Learning the structure of linear latent variable models. J. Mach. Learn. Res., 7(8):191 246.

Squires, C., Seigal, A., Bhate, S. S., and Uhler, C. (2023). Linear causal disentanglement via interventions. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 32540 32560. PMLR.

Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck, W. M., Hao, Y., Stoeckius, M., Smibert, P., and Satija, R. (2019). Comprehensive integration of single-cell data. Cell, 177(7):1888 1902.

Sui, J., Adali, T., Pearlson, G. D., and Calhoun, V. D. (2009). An ICA-based method for the identification of optimal FMRI features and components using combined group-discriminative techniques. Neuroimage, 46(1):73 86.

Svensén, M., Kruggel, F., and Benali, H. (2002). ICA of f MRI group study data. Neuroimage, 16(3):551 563.

van der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York. With applications to statistics.

Varoquaux, G., Sadaghiani, S., Poline, J. B., and Thirion, B. (2009). Can ICA: Model-based extraction of reproducible group-level ICA patterns from f MRI time series. In Medical Image Computing and Computer Aided Intervention, page 1.

Wang, X., Hutchinson, R., and Mitchell, T. M. (2003). Training fmri classifiers to detect cognitive states across multiple human subjects. In Thrun, S., Saul, L., and Schölkopf, B., editors, Proceedings of the 16th International Conference on Neural Information Processing Systems, volume 16, pages 709 716. MIT Press.

Wang, Y. S. and Drton, M. (2020). High-dimensional causal discovery under non-Gaussianity. Biometrika, 107(1):41 59.

Welch, J. D., Hartemink, A. J., and Prins, J. F. (2017). MATCHER: manifold alignment reveals correspondence between single cell transcriptome and epigenome dynamics. Genome Biol., 18(1):1 19.

Xie, F., Cai, R., Huang, B., Glymour, C., Hao, Z., and Zhang, K. (2020). Generalized independent noise condition for estimating latent variable causal graphs. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Adv. in Neural Inf. Process. Syst., volume 33, pages 14891 14902. Curran Associates, Inc.

Xie, F., Huang, B., Chen, Z., He, Y., Geng, Z., and Zhang, K. (2022). Identification of linear non-Gaussian latent hierarchical structure. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S., editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 24370 24387. PMLR.

Yang, K. D., Belyaeva, A., Venkatachalapathy, S., Damodaran, K., Katcoff, A., Radhakrishnan, A., Shivashankar, G. V., and Uhler, C. (2021a). Multi-domain translation between single-cell imaging and sequencing data using autoencoders. Nat. Comm., 12(1).

Yang, K. D. and Uhler, C. (2019). Multi-domain translation by learning uncoupled autoencoders. Computational Biology Workshop, International Conference on Machine Learning.

Yang, M., Liu, F., Chen, Z., Shen, X., Hao, J., and Wang, J. (2021b). Causalvae: Disentangled representation learning via neural structural causal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9593 9602.

Yuan, Q. and Duren, Z. (2022). Integration of single-cell multi-omics data by regression analysis on unpaired observations. Genome Biol., 23(1):1 19.

Zeng, Y., Shimizu, S., Cai, R., Xie, F., Yamamoto, M., and Hao, Z. (2021). Causal discovery with multi-domain lingam for latent factors. In Zhou, Z.-H., editor, Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 2097 2103. International Joint Conferences on Artificial Intelligence Organization. Main Track.

Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2242 2251.

Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., and He, Q. (2021). A comprehensive survey on transfer learning. Proc. IEEE, 109(1):43 76.

Proof of Theorem 3.1. Let PX M(Gm) for an m-domain graph Gm = (H V, D) with shared latent nodes L = [ℓ] and representation (B, P). By Condition (C1) the measure P has independent, non-degenerate, non-symmetric marginals Pi, i H with mean zero and variance one. Moreover, since B Im(ϕGm), we have B = G(I A) 1 for matrices G RDV H and A RDHH.

Fix one domain e [m]. Recall that we denote by Se = pa(Ve) = L Ie the set of latent parents in domain e. Define the matrix

Be := GVe,Se[(I A) 1]Se,Se = Ge[(I A) 1]Se,Se,

and observe that we can write PXe = BVe,H#P = Be#PSe. This is due to the fact that GVe,H\Se = 0 and [(I A) 1]Se,H\Se = 0 by the definition of an m-domain graph.

In particular, the equality PXe = Be#PSe shows that the representation in Line 4 of Algorithm 1 exists. Now, we show that it is unique up to signed permutation by applying results on identifiability of linear ICA. Since Ge has full column rank by Condition (C2) and [(I A) 1]Se,Se is invertible, the matrix Be also has full column rank. Let PXe = b Be#P e be any representation, where b Be Rde bse and P e is an bse-dimensional probability measure with independent, non-degenerate marginals P e i . Due to Condition (C1), all probability measures Pi are non-Gaussian and non-degenerate and therefore we have by Eriksson and Koivunen (2004, Theorem 3 and 4) the identities

b Be = Be ReΛe and P e = Λe(Re) #PSe, (3)

where Λe is an se se diagonal matrix with nonzero entries and Re is an se se permutation matrix. In particular, we have bse = se, which means that b Be Rde se and that P e is an se-dimensional probability measure. Line 4 also requires that each marginal P e i has unit variance. This removes the scaling indeterminacy in (3) and we have

b Be = Be Re De and P e = De(Re) #PSe,

where De is a diagonal matrix with entries De ii { 1}. In particular, this means that the distributions P e and PSe coincide up to permutation and sign of the marginals.

The matching in Line 6 identifies which components of P e are shared. By Condition (C1), two components of different domains P e i and P f j are shared if and only if they coincide up to sign, that is, if and only if d(P e i , P f j ) = 0 or d(P e i , P f j ) = 0. If their distribution coincide up to sign, than either d(P e i , P f j ) = 0 or d(P e i , P f j ) = 0 but not both since Condition (C1) requires the distribution of the error variables to be non-symmetric. We conclude that in each domain e [m] there exists an se se signed permutation matrix Qe such that

d([(Qe) #P e]i, [(Qf) #P f]i) = 0 (4)

for all i = 1, . . . , bℓand for all f = e. In particular, bℓ= ℓand b L = L.

It remains to show that b B = BΨ and b P = Ψ #P for a signed permutation block matrix Ψ Π. By Equation (4), the distributions [(Qe) #P e]L and [(Qe) #P e]L coincide, which means that

(Qe) #P e = (Qe) De(Re) #PSe = Ψ L 0 0 Ψ Ie

where ΨL is an ℓ ℓsigned permutation matrix and ΨIe is an |Ie| |Ie| signed permutation matrix. Importantly, the matrix Ψ L does not depend on the domain e [m]. Hence, the matrix Φe := Re De Qe is a signed permutation matrix with block structure as in Equation (5). Moreover, we have b Be Qe = Be Re De Qe = BeΦe = Be LΨL Be [se]\LΨIe

which means that the matrix b B can be factorized as

[ b B1Q1] b L [ b B1Q1][bs1]\ b L ... ... [ b Bm Qm] b L [ b Bm Qm][bsm]\ b L

B1 LΨL B1 [s1]\LΨI1 ... ... Bm L ΨL Bm [sm]\LΨIm

B1 L B1 [s1]\L ... ... Bm L Bm [sm]\L

ΨL ΨI1 ... ΨIm

where Ψ Π. Similarly, we have for all e [m],

(Qe) #P e = (Φe) #PSe = (ΨL) #PL (ΨIe) #PIe

We conclude that

[(Q1) #P 1] b L [(Q1) #P 1][bs1]\ b L ... [(Qm) #P m][bsm]\ b L

(ΨL) #PL (ΨI1) #PI1 ... (ΨIm) #PIm

ΨL ΨI1 ... ΨIm

PL PI1... PIm

Before proving Lemma 4.3 and Theorem 4.4 we fix some notation. Let Gm = (V H, D) be an m-domain graph. We denote by anc(v) = {k H : there is a directed path k v in Gm} the ancestors of a node v V . For subsets W V , we denote anc(W) = S

v W anc(w). Moreover, for L H and v V , we write shortly pa L(v) = pa(v) L.

Proof of Lemma 4.3. Let B Im(ϕGm). Then we can write B = G (I A) 1 with

GV1,L GV1,I1 ... ... GVm,L GVm,Im

Moreover, observe that, by the definition of an m-domain-graph, the matrix BV,L factorizes as

BV,L = GV,L[(I A) 1]L,L.

Now, suppose that i and j are partial pure children of a fixed node k L. Then pa L(i) = {k} = pa L(j). In particular, the only entry that may be nonzero in the row Gi,L is given by Gik and the only entry that may be nonzero in the row Gj,L is given by Gjk. Thus, we have

q L Giq[(I A) 1]q,L = Gik[(I A) 1]k,L.

Similarly, it follows that Bj,L = Gjk[(I A) 1]k,L. This means that the row Bj,L is a multiple of the row Bi,L and we conclude that rank(B{i,j},L) 1. Equality holds due to the faithfulness condition (C4) which implies that Bik = 0 and Bjk = 0, i.e., B{i,j},L is not the null matrix.

For the other direction suppose that rank(B{i,j},L) = 1. By applying the Lindström-Gessel-Viennot Lemma (Gessel and Viennot, 1985; Lindström, 1973) equivalently as in Dai et al. (2022, Theorem 1 and 2), it can be seen that

rank(B{i,j},L) min {|S| : S is a vertex cut from anc(L) to {i, j}} , (6)

where S is a vertex cut from anc(L) to {i, j} if and only if there exists no directed path in Gm from anc(L) to {i, j} without passing through S. Moreover, equality holds in (6) for generic (almost all) choices of parameters. Since we assumed rank faithfulness in Condition (C4) we exclude cases where the inequality is strict and therefore have equality. By the definition of an m-domain graph we have that anc(L) = L. Thus, if rank(B{i,j},L) = 1, there must be a single node k L such that {k} is a vertex cut from L to {i, j}. But then it follows that i and j have to be partial pure children of k by the definition of an m-domain graph and by using the assumption that there are no zero-rows in BL.

To prove Theorem 4.4 we need the following auxiliary lemma. Lemma A.1. Let G = (V, D) be a DAG with topologically ordered nodes V = [p] and let M be a lower triangular matrix with entries Mii = 0 for all i = 1, . . . , p and Mij = 0 if and only if there is a directed path j i in G. Let Qσ1 and Qσ2 be permutation matrices. Then the matrix Qσ1MQσ2 is lower triangular if and only if σ2 = σ 1 1 and σ2 S(G).

Proof of Lemma A.1. By the definition of a permutation matrix, we have

[Qσ1MQσ2]ij = Mσ1(i)σ 1 2 (j) or, equivalently, [Qσ1MQσ2]σ 1 1 (i)σ2(j) = Mij. (7)

First, suppose that σ2 = σ 1 1 and σ2 S(G), and let i, j [p] such that σ2(i) < σ2(j). Then, by the definition of S(G), there is no directed path j i in the graph G and therefore we have Mij = 0. But this means that [Qσ1MQσ2]σ2(i)σ2(j) = 0 and we conclude that the matrix Qσ1MQσ2 is lower triangular.

Now, assume that Qσ1MQσ2 is lower triangular, where σ1 and σ2 are arbitrary permutations on the set [p]. Since M has no zeros on the diagonal, we have Mii = [Qσ1MQσ2]σ 1 1 (i)σ2(i) = 0 for all i = 1, . . . , p. It follows that σ 1 1 (i) σ2(i) for all i = 1, . . . , p because Qσ1MQσ2 is lower triangular. But this is only possible if the permutations coincide on all elements, i.e., we have σ2 = σ 1 1 . It remains to show that σ2 = σ 1 1 S(G). For any edge j i D we have that Mij = 0. Recalling Equation (7) this means that [Qσ1MQσ2]σ2(i)σ2(j) = 0. But since Qσ1MQσ2 is lower triangular this can only be the case if σ2(j) < σ2(i) which proves that σ2 S(G).

Proof of Theorem 4.4. Each latent node in L has two partial pure children by Condition (C3). After removing zero-rows in Line 3 of Algorithm 2 it holds by Lemma 4.3 that rank(B {i,j},L) = 1 if and only if there is a latent node in L such that i and j are both partial pure children of that latent node. Hence, each tuple (ik, jk)k L in Line 4 of Algorithm 2 consists of two partial pure children of a certain latent node. The requirement rank(B {ik,iq},L) = 2 ensures that each pair of partial pure children has a different parent.

By the definition of an m-domain-graph and the fact that B = b BL, for I = {i1, . . . , iℓ}, we have the factorization

B I,L = BI,LΨL = GI,L(I A) 1 L,LΨL = GI,L(I AL,L) 1ΨL, (8)

where G RDV H, A RDHH and ΨL is a signed permutation matrix. Let Q1 and Q2 be permutation matrices and let Λ be a diagonal matrix with non-zero diagonal elements and let D be a diagonal matrix with entries in { 1}. Then we can rewrite Equation (8) as

B I,L = Q1 Λ(I AL,L) 1D | {z } =:M

Now, we apply Lemma A.1. Since we assume throughout Section 4 that the nodes L are topologically ordered, the matrix M is lower triangular with no zeros on the diagonal. Moreover, by Condition (C4) we have Mij = 0 if and only if there is a directed path j i in GL. In Line 6 we find other permutation matrices R1 and R2 such that

W = R1B I,LR2 = (R1Q1)M(Q2R2)

is lower triangular. Now, define the permutation matrices Qσ1 = R1Q1 and Qσ2 = Q2R2. Then we have by Lemma A.1 that Qσ1 = Q σ2 and that σ2 S(GL). Hence, the matrix W factorizes as

W = Q σ2MQσ2 = Q σ2Λ(I AL,L) 1DQσ2 = eΛQ σ2(I AL,L) 1Qσ2 e D,

where eΛ and e D are diagonal matrices with the entries given by permutations of the entries of Λ and D. Lines 7 and 8 address the scaling and sign matrices eΛ and e D. In particular, we have that f W = D Q σ2(I AL,L) 1Qσ2D for another diagonal matrix D with entries in { 1}, since each entry on the diagonal of f W is equal to 1. Thus, we have b A = I (f W ) 1

= I (D Q σ2(I AL,L) 1Qσ2D ) 1

= I D Q σ2(I AL,L)Qσ2D

= D Q σ2AL,LQσ2D . Since Qσ2D is a signed permutation matrix with σ2 S(GL), the first part of the theorem is proved. If Gvk > 0 whenever v is a pure child of k, the matrix eΛ only has positive entries which means that D is equal to the identity matrix. This proves the second part.

B Additional Examples

The graph in Figure 4 is an m-domain graph corresponding to the compact version in Figure 2 in the main paper. Example B.1. Consider the m-domain graph in Figure 4. The linear structural equation model among the latent variables is determined by lower triangular matrices of the form

0 0 0 0 0 a21 0 0 0 0 0 0 0 0 0 0 0 a43 0 0 0 0 0 0 0

Moreover, the domain-specific mixing matrices Ge are of the form

g1 11 g1 12 g1 13 0 g1 21 0 g1 23 0 0 g1 32 g1 33 g1 34 g1 41 g1 42 g1 43 g1 44

g2 11 0 g2 13 g2 21 g2 22 g2 23 0 g2 32 0 g2 41 g2 42 g2 43 g2 51 g2 52 0

Since the shared latent nodes are given by L = {1, 2}, we have

g1 11 g1 12 g1 13 0 0 g1 21 0 g1 23 0 0 0 g1 32 g1 33 g1 34 0 g1 41 g1 42 g1 43 g1 44 0 g2 11 0 0 0 g2 13 g2 21 g2 22 0 0 g2 23 0 g2 32 0 0 0 g2 41 g2 42 0 0 g2 43 g2 51 g2 52 0 0 0

B = G (I A) 1 =

a21g1 12 + g1 11 g1 12 g1 13 0 0 g1 21 0 g1 23 0 0 a21g1 32 g1 32 a43g1 34 + g1 33 g1 34 0 a21g1 42 + g1 41 g1 42 a43g1 44 + g1 43 g1 44 0 g2 11 0 0 0 g2 13 a21g2 22 + g2 21 g2 22 0 0 g2 23 a21g2 32 g2 32 0 0 0 a21g2 42 + g2 41 g2 42 0 0 g2 43 a21g2 52 + g2 51 g2 52 0 0 0

v1 1 v1 2 v1 3 v1 4

v2 1 v2 2 v2 3 v2 4 v2 5

Figure 4: A 2-domain graph with 5 latent nodes and dimensions of the observed domains given by |V1| = d1 = 4 and |V2| = d2 = 5. We denote Ve = {ve 1, . . . , ve de}, that is, the superscript indicates the domain a node belongs to.

Example B.2. Consider the m-domain graph in Figure 4. The partial pure children of node 1 L are given by {v1 2, v2 1} and the partial pure children of 2 L are given by {v1 3, v2 3}. Moreover, by continuing Example B.1, we have that the matrix BL is given by

a21g1 12 + g1 11 g1 12 g1 21 0 a21g1 32 g1 32 a21g1 42 + g1 41 g1 42 g2 11 0 a21g2 22 + g2 21 g2 22 a21g2 32 g2 32 a21g2 42 + g2 41 g2 42 a21g2 52 + g2 51 g2 52

It is easy to see that the two submatrices g1 21 0 g2 11 0

and a21g1 32 g1 32 a21g2 32 g2 32

have rank one. The first matrix corresponds to the partial pure children {v1 2, v2 1} in the graph in Figure 4 while the second matrix correspond to the partial pure children {v1 3, v2 3}. Note that the rank of any other 2 2 submatrix is generically (i.e., almost surely) equal to 2.

C Discussion of the Assumptions

In this section, we discuss aspects of Conditions (C1)-(C3) that allow for identifiability. In particular, we discuss the necessity of pairwise different and non-Gaussian error distributions if one is not willing to make further assumptions. Moreover, we elaborate on the sparsity conditions on the mixing matrix and explain why some sparsity assumption is necessary.

Pairwise Different Error Distributions. Given any two potentially different m-domain graphs Gm = (H V, D) and e Gm = ( e H V, e D), identifiability of the joint distribution in multi-domain causal representation models means that

BVe,Se#PSe = e BVe,e Se# e Pe Se for all e [m] = B#P = e B# e P (9)

for any representation (B, P) of a distribution in M(Gm) and any representation ( e B, e P) of a distribution in M( e Gm), where the matrices BVe,Se and e BVe,e Se have full column rank for all e [m]. The left-hand side says that the marginal distributions in each domain are equal, while the right-hand side says that the joint distributions are equal. If there are m-domain graphs, such that the left-hand sides holds but the right-hand side is violated, then we say that the joint distribution is not identifiable.

We assume in this section that the marginal error distributions Pi, i H are non-Gaussian and have unit variance, but are not necessarily pairwise different or non-symmetric. Then the right-hand side holds if and only if the number of shared latent nodes in each graph is equal, i.e., ℓ= eℓ, and there is a signed permutation matrix Ψ such that B = e BΨ and P = Ψ # e P. Here, the matrix Ψ does

not necessarily have a block structure. The equivalence is implied by the identifiability of the usual, one-domain linear ICA ( see, e.g., Buchholz et al. (2022)) together with the fact that for ℓ = eℓ, we have |H| = | e H| and, therefore, the distributions on the right-hand have support over different dimensional subspaces.

Theorem 3.1 shows that assumptions (C1) and (C2) are sufficient for identifiability of the joint distribution. In particular, we show that they imply identifiability in a stronger sense, namely, that it follows from the left-hand side that ℓ= eℓand B = e BΨ and P = Ψ # e P for a signed permutation Ψ Π with block structure. The next proposition reveals necessary conditions for identifiability. Proposition C.1. Let Gm be an m-domain graph with shared latent nodes L = [ℓ], and let PX M(Gm) with representation (B, P). Suppose that m 2 and that everything except the assumption about pairwise different error distributions in Conditions (C1) and (C2) is satisfied. Then, the joint distribution is not identifiable if one of the following holds:

(i) There is i, j L such that Pi = Pj or Pi = Pj.

(ii) There is i L and j Ie for some e [m] such that Pi = Pj or Pi = Pj.

(iii) For all e [m] there is ie Ie such that Pie = Pjf or Pie = Pif for all e = f.

Proof. For each of the three cases, we will construct another m-domain graph Gm = (H V, D) such that for suitable representations ( e B, e P) of distributions in M( e Gm), the left-hand side of (9) holds, but the right-hand side is violated.

To prove the statement for case (i), let i, j L and assume that Pi = Pj. We define the m-domain graph e Gm = ( e H V, e D) to be the almost same graph as Gm = (H V, D), we only swap the roles of the latent nodes i and j on an arbitrary domain e [m]. That is, for each v Ve, if there was an edge i v in D, we remove that edge from e D and add the edge j v instead, and vice versa. Otherwise, the graph e Gm has the same structure as Gm. Now, let e P = P and define a the matrix e B to be the same matrix as B, except for the subcolumns e BVe,i := BVe,j and e BVe,j := BVe,i, that is, we swapped BVe,i and BVe,j. Then the pair ( e B, e P) is a representation of some distribution in M( e Gm). Recall from the proof of Theorem 3.1 that Condition (C2) implies that the matrix BVe,Se has full column rank. Since we only swapped columns in e BVe,e Se, it still has full column rank. Moreover, observe that the left hand side of (9) is satisfied since Pi = Pj, that is, the marginal distributions on the single domains coincide.

However, now consider another domain f [m] and the submatrices

BVe Vf ,{i,j} = BVe,i BVe,j BVf ,i BVf ,j

and e BVe Vf ,{i,j} = BVe,j BVe,i BVf ,i BVf ,j

Since all of the four subcolumns are nonzero and neither BVe,j is equal to BVe,i nor BVf ,j is equal to BVf ,i, there is no signed permutation matrix Ωsuch that BVe Vf ,{i,j} = e BVe Vf ,{i,j}Ω. Hence, there is also no larger signed permutation matrix Ψ such that B = e BΨ. We conclude that the right-hand side of (9) is violated and the joint distribution is not identifiable. Finally, note that the above arguments also hold if Pi = Pj by adding signs in appropriate places.

The proof for case (ii) works with exactly the same construction. That is, for i L and j Ie we swap the roles of i and j on the domain e. Then, for any other domain f [m], we obtain the submatrices

BVe Vf ,{i,j} = BVe,i BVe,j BVf ,i 0

and e BVe Vf ,{i,j} = BVe,j BVe,i BVf ,i 0

By the same arguments as before, this shows that there is no signed permutation matrix Ψ such that B = e BΨ and, hence, the joint distribution is not identifiable.

To prove case (iii), we consider a slightly different construction. However, we also assume that Pie = Pif for all e = f, since for Pie = Pif we only have to add some signs in the following. We define the m-domain graph e Gm = ( e H V, e D) by identifying the nodes ie, e [m] with a new

node k. That is, e L = L {k} and e H = (S e [m] Ie \ {ie}) e L. For i e H \ {k} and v V , the

edge set e D contains an edge i v if and only if the edge i v is in D. For the node k e H and v V , we put an edge k v in e D if and only if there is an edge ie v in D for some e [m].

Now, define the matrix e B such that e BV, e H\{k} := BV, e H\{k} and e BVe,k := BVe,ie for all e [m].

Then the pair ( e B, e P) is a representation of some distribution in M( e Gm). Moreover, each submatrix e BVe,e Se is equal to BVe,Se up to relabeling of the columns. That is, the column that is labeled by ie in BVe,Se is now labeled by k in e BVe,e Se. We define the measure e P such that e P e H\{k} = P e H\{k} and e Pk = Pie for all e [m]. Then the pair ( e B, e P) is a representation of some distribution in M( e Gm) and, in particular, the left hand side of (9) is satisfied. That is, the marginal distributions coincide on each domain. However, the number of shared latent variables in both m-domain graphs is different since we have eℓ= ℓ+ 1. We conclude that the joint distribution is not identifiable.

The proposition states that it is in most cases necessary that error distributions are pairwise different. However, in two cases the same error distributions still lead to identifiability. First, if i, j Ie, then the corresponding error distributions may be the same and the joint distribution is still identifiable. Similarly, if there are latent nodes ie in a few domains e [m] such that the corresponding error distributions coincide, but there is at least one domain f [m] where there is no latent node with the same error distribution, then the joint distribution is also identifiable. Both can be seen by taking the the proofs of Theorem 3.1 and Proposition C.1 together.

Gaussian Errors. Without additional assumptions to those in Section 3, it is impossible to recover the joint distribution if the distributions of the errors εi of the latent structural equation model in Equation (1) are Gaussian. In this case, the distribution of Z as well as the distribution of each observed random vector Xe is determined by the covariance matrix only. The observed covariance matrix in domain e [m] is given by Σe = Ge Cov[ZL Ie](Ge) . However, knowing Σe gives no information about ZL Ie other than rank(Σe) = |L|+|Ie|, that is, we cannot distinguish which latent variables are shared and which ones are domain-specific. This is formalized in the following lemma.

Lemma C.2. Let Σ be any d d symmetric positive semidefinite matrix of rank p and let Ξ be another arbitrary p p symmetric positive definite matrix. Then there is G Rd p such that Σ = GΞG .

Proof. Let Σ be a d d symmetric positive semidefinite matrix of rank p. Then, Σ has a decomposition similar to the Cholesky decomposition; see e.g. Gentle (1998, Section 3.2.2). That is, there exists a unique matrix T, such that A = TT , where T is a lower triangular matrix with p positive diagonal elements and d p columns containing all zeros. Define e T to be the d p matrix containing only the non-zero columns of T.

On the other hand, let Ξ be a symmetric positive definite p p matrix. By the usual Cholesky decomposition (Lyche, 2020, Section 4.2.1), there exists a unique p p lower triangular matrix L with positive diagonal elements such that Ξ = LL . Now, define G := e TL 1 Rd p. Then,

Σ = e T e T = e TL 1LL L e T = GΞG .

Due to Lemma C.2 it is necessary to consider non-Gaussian distributions to obtain identifiability of the joint distribution.

Example C.3. In the Gaussian case we cannot distinguish whether the two observed domains in Figure 5 share a latent variable or not. Said differently, the observed marginal distributions may either be generated by the mechanism defined by graph (a) or graph (b) and there is no way to distinguish from observational distributions only.

Sparsity Assumptions. Let B Im(ϕGm) for an m-domain graph Gm = (H V, D) and suppose that we are given the matrix BL = GV,L(I AL,L) 1, that is, we are given the submatrix with columns indexed by the shared latent nodes. Now, assume that the graph does not impose any sparsity restrictions on GV,L, which means that the set RDV L of possible matrices GV,L is equal to R|V | |L|. Then, the set of possible matrices BL is also unrestricted, that is, BL can be any matrix in R|V | |L| no matter the form of the matrix AL,L RDLL. In other words, for arbitrary shared latent graphs

(b) Figure 5: Compact versions of two 2-domain graphs. In both graphs, both domains have two latent parents. In setup (a) there is a shared latent parent while in setup (b) there is not.

GL = (L, DLL) and arbitrary corresponding parameter matrices AL,L RDLL, we don t get any restrictions on the matrix BL. Therefore, it is impossible to infer AL,L from BL.

Condition (C3) requires that there are two partial pure children for every shared latent node k L, which implies that there are 2|L| rows in GV,L in which only one entry may be nonzero. While we show in Theorem 4.4 that this condition is sufficient for identifiability of AL,L, we leave it open for future work to find a necessary condition.

D Algorithms for Finite Samples

We adjust Algorithm 1 such that it is applicable in the empirical data setting. That is, rather than the exact distribution PXe, we have a matrix of observations Xe Rde ne in each domain e [m]. The sample size ne might be different across domains. We denote nmin = mine [m] ne and nmax = maxe [m] ne. For implementing linear ICA on finite samples, multiple well developed algorithms are available, e.g., Fast ICA (Hyvärinen, 1999; Hyvärinen and Oja, 2000), Kernel ICA (Bach and Jordan, 2003) or JADE (Cardoso and Souloumiac, 1993). Applying them, we obtain a measure b P e i which is an estimator of the true measure P e i in Algorithm 1, Line 4.

The remaining challenge is the matching in Line 6 of Algorithm 1. For finite samples, the distance between empirical distributions is almost surely not zero although the true underlying distributions might be equal. In this section, we provide a matching strategy based on the two-sample Kolmogorov Smirnov test (van der Vaart and Wellner, 1996, Section 3.7). We match two distributions if they are not significantly different. During this process, there might occur false discoveries, that is, distributions are matched that are actually not the same. We show that the probability of falsely discovering shared nodes shrinks exponentially with the number of domains.

For two univariate Borel probability measures Pi, Pj, with corresponding cumulative distribution functions Fi, Fj, the Kolmogorov-Smirnov distance is given by the L -distance

d KS(Pi, Pj) = Fi Fj = sup x R |Fi(x) Fj(x)|.

The two-sample Kolmogorov-Smirnov test statistic for the null hypothesis H0 : d KS(P e i , P f j ) = 0 is given by

T( b P e i , b P f j ) = r nenf

ne + nf d KS( b P e i , b P f j ). (10)

It is important to note that b P e i is not an empirical measure in the classical sense since it is not obtained from data sampled directly from the true distribution P e i . In addition to the sampling error there is the uncertainty of the ICA algorithm. However, in the analysis we present here, we will neglect this error and treat b P e i as an empirical measure. In this case, under H0, the test statistic T( b P e i , b P f j ) converges in distribution to B , where B is a Brownian bridge from 0 to 1 (van der Vaart and Wellner, 1996, Section 2.1). For a given level α (0, 1), we choose the critical value as cα = inf{t : P( B > t) α} and reject H0 if T( b P e i , b P f j ) > cα.

Definition D.1. Let α (0, 1) and suppose the distributions { b P e 1 , . . . , b P e bse} and { b P f 1 , . . . , b P f bsf } are given for two domains e, f [m]. Define

Ωα( b P e i , b P f j ) =

( 1 if T ef ij cα and T ef ij = min{mink [bsf ] T ef ik , mink [bse] T ef kj }, 0 else,

where T ef ij = min{T( b P e i , b P f j ), T( b P e i , b P f j )}. We say that b P e i , b P f j are matched if Ωα( b P e i , b P f j ) = 1.

Definition D.1 essentially states that two measures are matched if the test statistic (10) is not significantly large and the null hypothesis cannot be rejected. Taking the minimum of T( b P e i , b P f j ) and T( b P e i , b P f j ) accounts for the sign indeterminacy of linear ICA. For two fixed domains e, f [m], if it happens that the statistic T ef ij for multiple pairs (i, j) is small enough, then the pair with the minimal value of the statistic is matched. Note that one may use any other test than the Kolmogorov-Smirnov test to define a matching as in Definition D.1. We discover a shared latent node if it is matched consistently across domains. Definition D.2. Let C = (i1, . . . , im) be a tuple with m elements such that ie [bse]. Then we say that C determines a shared node if Ωα( b P e ie, b P f if ) = 1 for all ie, if C.

Inferring the existence of a shared node which does not actually exist may be considered a more serious error than inferring a shared node determined by a set C, where only some components of C are wrongly matched. In the following theorem we show that the probability of falsely discovering shared nodes shrinks exponentially with the number of wrongly matched components. Theorem D.3. Let C = (i1, . . . , im) be a tuple with m elements such that ie [bse]. Let g : N R 0 R 0 be a function that is monotonically decreasing in n N and assume the following:

(i) P(d KS( b P e ie, P e ie) > x) g(ne, x) for all e [m] and for all x 0.

(ii) There is E [m] with |E| 2 and a constant κ > 0 such that d KS(P e ie, P f if ) κ and

d KS(P e ie, P f if ) κ for all e E, f [m] with e = f.

P(C determines a shared node) g nmin, max κ

2 nmin cα, 0 |E| 1 .

Proof of Theorem D.3. Let C = (i1, . . . , im) and g be as in the statement of the theorem. W.l.o.g. we assume that E = {1, . . . , |E|} and that T( b P e ie, b P f if ) T( b P e ie, b P f if ). Observe that C determines a shared node if and only if X

e<f Ωα( b P e ie, b P f if ) = m 2

Now, we have

e<f Ωα( b P e ie, b P f if ) = m 2

n Ωα( b P e ie, b P f if ) = 1 o!

n T ef ij cα o!

n T( b P e ie, b P f if ) cα o n T( b P e ie, b P f if ) cα o!

n T( b P e ie, b P f if ) cα o!

e E,f [m] e<f

n T( b P e ie, b P f if ) cα o!

By the triangle inequality we have

d KS(P e ie, P f if ) d KS(P e ie, b P e ie) + d KS( b P e ie, b P f if ) + d KS( b P f if , P f if ). (12)

Moreover, if e E and f [m], then we have by Condition (ii)

d KS(P e ie, P f if ) κ. (13)

Using (12) and (13) together with the fact q nenf

ne+nf nmin 2nmax , we obtain the following chain of

implications for e E and f [m]:

T( b P e ie, b P f if ) cα

ne + nf d KS( b P e ie, b P f if ) cα

= nmin 2nmax

d KS(P e ie, P f if ) d KS( b P e ie, P e ie) d KS( b P f if , P f if ) cα

= nmin 2nmax

κ d KS( b P e ie, P e ie) d KS( b P f if , P f if ) cα

d KS( b P e ie, P e ie) + d KS( b P f if , P f if ) κ 2nmax

= d KS( b P e ie, P e ie) κ

or d KS( b P f if , P f if ) κ

Now, consider the event T e E,f [m] e<f {T( b P e ie, b P f if ) cα}. On this event, there cannot be two elements

e, f E such that both d KS( b P e ie, P e ie) < κ

and d KS( b P f if , P f if ) < κ

To see this recall that E [m]. We conclude that it must hold {d KS( b P e ie, P e ie) κ

2 nmin cα} for all but at most one element of E. We denote this exceptional element by e E. Taking up (11), we get the following:

e<f Ωα( b P e ie, b P f if ) = m 2

e E,f [m] e<f

n T( b P e ie, b P f if ) cα o!

d KS( b P e ie, P e ie) κ

d KS( b P e ie, P e ie) κ

d KS( b P e ie, P e ie) max κ

2 nmin cα, 0 !

g nmin, max κ

2 nmin cα, 0 |E| 1 . (16)

The last three steps need more explanation: Equality (14) follows from the fact that domains are unpaired. That is, the distances d KS( b P e ie, P e ie) and d KS( b P f if , P f if ) are pairwise independent for different

domains e, f [m]. Equality (15) is trivial since d KS( b P e ie, P e ie) 0. Finally, Inequality (16) follows from Condition (i) and that the function g is monotonically decreasing in n. We also used the fact that |E \ {e }| = |E| 1.

If b P e i were an empirical measure in the classical sense, then Condition (i) in Theorem D.3 translates to the well-known Dvoretzky Kiefer Wolfowitz inequality, that is, the function g is given by g(n, x) =

Algorithm 3 Identify Joint Distribution Empirical

1: Hyperparameters: γ > 0, α (0, 1). 2: Input: Matrix of observations Xe Rde ne for all e [m]. 3: Output: Number of shared latent variables bℓ, matrix b B and probability measure b P. 4: for e [m] do 5: Linear ICA: Use any linear ICA algorithm to obtain a mixing matrix e Be Rde bse, where bse = rankγ(Xe(Xe) ). Compute the matrix ηe = ( e Be) Xe Rbse ne, where ( e Be) is the Moore-Penrose pseudoinverse of e Be. 6: Scaling: Let e be a bse bse diagonal matrix with entries e ii = 1 ne [ ηe( ηe) ]ii. Define b Be = e Be( e) 1/2 and ηe = ( e) 1/2 ηe. 7: Let b P e be the estimated probability measure with independent marginals such that b P e i is the empirical measure of the row ηe i, . 8: end for 9: Matching: Let bℓbe the maximal number such that there is a signed permutation matrix Qe in each domain e [m] such that

Ωαt([(Qe) # b P e]i, [(Qf) # b P f]i) = 1

for all i = 1, . . . , bℓand all f = e, where αt = α/t with t = 2 P

e<f bsebsf. Let b L = {1, . . . , bℓ}.

10: Construct the matrix b B and the tuple of probability measures b P given by

[ b B1Q1] b L [ b B1Q1][bs1]\ b L ... ... [ b Bm Qm] b L [ b Bm Qm][bsm]\ b L

[(Q1) # b P 1] b L [(Q1) # b P 1][bs1]\ b L ... [(Qm) # b P m][bsm]\ b L

11: return (bℓ, b B, b P).

Algorithm 4 Identify Shared Graph Empirical

1: Hyperparameters: γ > 0. 2: Input: Matrix B R|V | ℓ. 3: Output: Parameter matrix b A Rℓ ℓ. 4: Remove rows B i,L with B i,L 2 γ from the matrix B . 5: Find tuples (ik, jk)k L with the smallest possible scores σmin(B {ik,jk},L) such that

(i) ik = jk for all k L and {ik, jk} {iq, jq} = for all k, q L such that k = q and (ii) σmin(B {ik,iq},L)| > γ for all k, q L such that k = q.

6: Let I = {i1, . . . , iℓ} and consider the matrix B I,L Rℓ ℓ. 7: Find two permutation matrices R1 and R2 such that W = R1B I,LR2 is as close as possible to lower triangular. This can be measured, for example, by using P

i<j W 2 ij. 8: Multiply each column of W by the sign of its corresponding diagonal element. This yields a new matrix f W with all diagonal elements positive. 9: Divide each row of f W by its corresponding diagonal element. This yields a new matrix f W with all diagonal elements equal to one. 10: Compute b A = I (f W ) 1. 11: return b A.

2 exp( 2nx2). Given a tuple C = (i1, . . . , im) that defines a shared node, Condition (ii) is an assumption on the number of wrongly matched components. The most extreme case is when the shared node does not actually exist and all components are wrongly matched. That is, the measures b P e ie and b P f if are matched even though d KS(P e ie, P f if ) = 0 and d KS(P e ie, P f if ) = 0 for all e, f [m]. On the other hand, if |E| m, then C determines a shared node where the majority of the components are correctly matched.

If g(n, x) 0 for n and x > 0, the statement of the theorem becomes meaningful under the constraint nmax/nmin 0. In this case, the probability that a given tuple C with wrong components E determines a shared node goes to zero for large sample sizes nmin. As noted, the probability of falsely discovering a shared node decreases exponentially with the number of wrongly matched components |E|. In the extreme case, this means that the probability of falsely discovering shared nodes with all components wrongly matched, i.e., E = [m], decreases exponentially with the number of domains m.

Theorem D.3 also tells us that the probability of falsely matching two measures b P e i and b P f j becomes zero if the sample size grows to infinity and the linear ICA algorithm is consistent. However, with finite samples we might fail to match two measures where the underlying true measures are actually the same, i.e., we falsely reject the true null hypothesis H0. Thus, we might be overly conservative in detecting shared nodes due to a high family-wise error rate caused by multiple testing. We suggest to correct the level α to account for the amount of tests carried out. One possibility is to apply a Bonferroni-type correction. The total number of tests is given by t = 2 P

e<f bsebsf. This means that an adjusted level is given by αt = α/t and instead of the critical value cα we consider the adjusted critical value cαt.

Algorithm 3 is the finite sample version of Algorithm 1 with the matching Ωα defined in Definition D.1. To determine the number of independent components for the linear ICA step in each domain, we need to check the rank(Xe(Xe) ). We specify the rank of a matrix M as number of singular values which are larger than a certain threshold γ and denote it by rankγ(M).

In Algorithm 4 we also provide a finite sample version of Algorithm 2 where we only have the approximation B = b BL BLΨL for a signed permutation matrix ΨL. For a matrix M, we denote by σmin(M) the smallest singular value.

E Error Distributions in Simulations

We specify L = {1, 2, 3}, I1 = {4, 5}, I2 = {6, 7} and I3 = {8, 9}. Note that the set I3 does not exist if the number of domains is m = 2. The error distributions in all simulations are specified as follows if not stated otherwise.

L: ε1 Beta(2, 3), ε2 Beta(2, 5), ε3 χ2 4,

I1: ε4 Gumbel(0, 1), ε5 Log Normal(0, 1),

I2: ε6 Weibull(1, 2), ε7 Exp(0.1),

I3: ε8 Skew Normal(6), ε9 Skew Normal(12),

where the overline means that each distribution is standardized to have mean 0 and variance 1. Figure 6 shows histograms of the empirical distributions.

F Additional Simulation Results

In this section, we make additional experiments. First, we consider another setup where all our assumptions are satisfied but we have more domains and more shared latent variables. Then, we also consider two setups where some of our assumptions are not satisfied.

Different Setup. We make additional experiments on a similar scale as in Section 5, but with more shared nodes and less domain specific nodes. This time, we consider ℓ= 5 shared latent nodes and |Ie| = 1 domain-specific latent node in each domain. Moreover, we also consider m = 4 domains. The dimensions are given by de = d/m for all e [m] and d = 48. The graphs and edge weights

Figure 6: Histograms showing the frequency of 25000 values sampled from the random variables εi with distribution as specified in Appendix E. Each distribution has mean zero and variance one. The first row shows the empirical distributions from ε1 to ε5 and the second row from ε6 to ε9.

(c) Figure 7: Simulation results for ℓ= 5 shared latent nodes. Logarithmic scale on the x-axis. Error bars in (a) are one standard deviation of the mean and in (b) and (c) they are the interquartile range.

are sampled equivalently as in Section 5 in the main paper. We also consider the same distribution of the error variables as specified in Appendix E, where we specify L = {1, 2, 3, 4, 5}, I1 = {6}, I2 = {7}, I3 = {8} and I4 = {9}.

Figure 7 shows the results where the scores are equivalent as in the main paper. Once again, we see that the estimation error for the matrices BL and AL,L decreases with increasing sample size. This supports our proof of concept and shows that the adapted algorithms are consistent for recovering BL and AL,L from finite samples.

Violated Assumptions. We consider the same setup as in the main paper in Section 5 with l = 3 shared latent nodes, but we fix the number of domains to m = 3. In this experiment, we compare the results where data was generated such that all our assumptions are satisfied with two setups where we violate some of the assumptions. In the first setup, we violate Condition (C1) that requires pairwise different error distributions. We specify the error distributions as follows.

L: ε1 Beta(2, 3), ε2 Beta(2, 5), ε3 χ2 4,

I1: ε4 Beta(2, 3), ε5 Log Normal(0, 1),

I2: ε6 Beta(2, 5), ε7 Log Normal(0, 1),

I3: ε8 χ2 4, ε9 Log Normal(0, 1),

where, as before, the overline means that each distribution is standardized to have mean 0 and variance 1. In the second setup, we do not change the error distributions but we violate Condition (C3) that requires two partial pure children per shared latent node. In this experiment, we do not make any sparsity assumptions on the mixing matrix GV,L.

Figure 8 shows the results of our experiments. As expected, we see in (a) and (b) that identifyability of the joint distribution fails if we do not require pairwise different error distributions. Recovering the joint distribution still works well in the second setup where we violate the partial pure children

(c) Figure 8: Simulation results where assumptions are not satisfied. Logarithmic scale on the x-axis. Error bars in (a) are one standard deviation of the mean and in (b) and (c) they are the interquartile range.

conditions. However, identifying the shared latent graph is impossible in this setup, as we explained in Appendix C. This is supported by the experimental results displayed in Figure 8 (c), where we can see that recovery of the shared latent graph does not work when the partial pure children assumption is not satisfied.

G Future Work

We see many directions for future work, which include the following.

Our work and algorithms rely on linear ICA. It would be interesting to study a more direct approach to recover the joint distribution and the causal graph. This might potentially be done by testing certain constraints implied by the model similar as the developments in the Li NGAM literature; see e.g. Shimizu et al. (2011) and Wang and Drton (2020). Our results require non-Gaussianity and that both the latent structural equation model and the mixing functions are linear. We consider the linear setup as a basis for any subsequent study of nonlinear cases. For example, recent advances in non-linear ICA allow identifiability of up to linear transformations, see e.g. Khemakhem et al. (2020), Buchholz et al. (2022) and Roeder et al. (2021). Identifiability of a causal representation might then be obtained from identifiability results for the linear model. Our sufficient condition for identifiability of the shared latent graph requires two partial pure children per shared latent node. In this regard, it would also be interesting to study necessary conditions; c.f. our discussion in Appendix C. This work focused purely on the observational case. However, considering interventional data can be expected to permit relaxing some conditions in both of the key steps, i.e., recovering the joint distribution and the shared latent graph. For example, recent work shows that interventional data allow for identification of the latent graph without sparsity constraints in a single-domain setup (Squires et al., 2023; Ahuja et al., 2023). Extending this to the multi-domain setup is an interesting problem for future work. It would be interesting to study the statistical properties of our setup such as theoretical bounds on the accuracy of recovering the matrices B and AL,L as well as developing algorithms that meet these bounds. Currently, our adapted algorithms for finite samples determine the rank of a matrix by using a threshold for singular values. The algorithms depend on the choice of this parameter and it would be worth studying optimal choices. Moreover, one might consider different methods for determining the rank of a matrix. There might be different matching strategies of the estimated error distributions in the finite sample setting. For example, instead of matching pairwise consistently across all domains as we propose in Appendix D, one might find an optimal matching by solving a linear program.