# multiview_causal_representation_learning_with_partial_observability__d8fee2d0.pdf

Published as a conference paper at ICLR 2024

MULTI-VIEW CAUSAL REPRESENTATION LEARNING WITH PARTIAL OBSERVABILITY

Dingling Yao1,2, Danru Xu3, Sébastien Lachapelle7,8, Sara Magliacane3,6, Perouz Taslakian9, Georg Martius4, Julius von Kügelgen2,5, and Francesco Locatello1

1Institute of Science and Technology Austria 2Max Planck Institute for Intelligent Systems, Tübingen 3University of Amsterdam 4University of Tübingen 5University of Cambridge 6MIT-IBM Watson AI Lab 7Samsung - SAIT AI Lab 8Mila, Université de Montréal 9Service Now Research

We present a unified framework for studying the identifiability of representations learned from simultaneously observed views, such as different data modalities. We allow a partially observed setting in which each view constitutes a nonlinear mixture of a subset of underlying latent variables, which can be causally related. We prove that the information shared across all subsets of any number of views can be learned up to a smooth bijection using contrastive learning and a single encoder per view. We also provide graphical criteria indicating which latent variables can be identified through a simple set of rules, which we refer to as identifiability algebra. Our general framework and theoretical results unify and extend several previous works on multi-view nonlinear ICA, disentanglement, and causal representation learning. We experimentally validate our claims on numerical, image, and multimodal data sets. Further, we demonstrate that the performance of prior methods is recovered in different special cases of our setup. Overall, we find that access to multiple partial views enables us to identify a more fine-grained representation, under the generally milder assumption of partial observability.

1 INTRODUCTION

Discovering latent structure underlying data has been important across many scientific disciplines, spanning neuroscience (Vigário et al., 1997; Brown et al., 2001), communication theory (Ristaniemi, 1999; Donoho, 2006), natural sciences (Wunsch, 1996; Chadan & Sabatier, 2012; Trapnell et al.,

2014), and countless more. The underlying assumption is that many natural phenomena measured by instruments have a simple structure that is lost in raw measurements. In the famous cocktail party problem (Cherry, 1953), multiple speakers talk concurrently, and while we can easily record their overlapping voices, we are interested in understanding what individual people are saying. From the methodological perspective, such inverse problems became common in machine learning with breakthroughs in linear (Comon, 1994; Darmois, 1951; Hyvärinen & Erkki, 2000) and non-linear (Hyvarinen et al., 2019) Independent Component Analysis (ICA), and developed into deep learning methods for disentanglement (Bengio et al., 2013; Higgins et al., 2017). More recently, approaches to causal representation learning (Schölkopf et al., 2021) began relaxing the key assumption of independent latents central to prior work (the independent in ICA), allowing for and discovering (some) hidden causal relations (Brehmer et al., 2022; Lippe et al., 2022; Lachapelle et al., 2022; Zhang et al., 2023; Ahuja et al., 2023; Varici et al., 2023; Squires et al., 2023; von Kügelgen et al., 2023).

This problem is often modeled as a two-stage sampling procedure, where latent variables z are sampled i.i.d. from a distribution pz, and the observations x are functions thereof. Intuitively, the latent variables describe the causal structure underlying a specific environment, and they are only observed through sensor measurements, entangling them via so-called mixing functions . Unfortunately, if these mixing functions are non-linear, the recovery of the latent variables is generally impossible, even if the latent variables are independent (Locatello et al., 2019; Hyvärinen & Pajunen, 1999). Following these negative results, the community has turned to settings that relax the i.i.d. condition in different ways. One particularly successful paradigm has been the assumption that data

Published as a conference paper at ICLR 2024

is not independently sampled, and in fact, multiple observations may refer to the same realization of the latent variables. This multi-view setup has generated a flurry of results in ICA (Gresele et al., 2020; Zimmermann et al., 2021; Pandeva & Forré, 2023), disentanglement (Locatello et al., 2020; Klindt et al., 2021; Fumero et al., 2023; Lachapelle et al., 2023; Ahuja et al., 2022), and causal representation learning (von Kügelgen et al., 2021; Daunhawer et al., 2023; Brehmer et al., 2022).

This paper provides a unified framework for several identifiability results in observational multi-view causal representation learning under partial observability. We assume that different views need not be functions of all the latent variables, but only of some of them. For example, a person may undertake different medical exams, each shedding light on some of their overall health status (assumed constant throughout the measurements) but none offering a comprehensive view. An X-ray may show a broken bone, an MRI how the fracture affected nearby tissues, and a blood sample may inform about ongoing infections. Our framework also allows for an arbitrary number of views, each measuring partially overlapping latent variables. It includes multi-view ICA and disentanglement as special cases.

More technically, we prove that any shared information across arbitrary subsets of views and modalities can be learned up to a smooth bijection using contrastive learning. Non-shared information can also be identified if it is independent of other latent variables. With a single identifiability proof, our result implies the identifiability of several prior works in causal representation learning (von Kügelgen et al., 2021; Daunhawer et al., 2023), non-linear ICA (Gresele et al., 2020), and disentangled representations (Locatello et al., 2020; Ahuja et al., 2022). In addition to weaker assumptions, our framework retains the algorithmic simplicity of prior contrastive multi-view (von Kügelgen et al., 2021) and multimodal (Daunhawer et al., 2023) causal representation learning approaches. Allowing partial observability and arbitrarily many views, our framework is significantly more flexible than prior work, allowing us to identify shared information between all subsets of views and not just their joint intersection.

f1 f2 f3 f4

Figure 1: Multi-View Setting with Partial Observability, for Example 2.1 with K=4 views and N=6 latents. Each view xk is generated by a subset z Sk of the latent variables through a view-specific mixing function fk. Directed arrows between latents indicate causal relations.

We highlight the following contributions:

1. We provide a unified framework for identifiability in observational multi-view causal representation learning with partial observability. This generalizes the multi-view setting in two ways: allowing (i) any arbitrary number of views, and (ii) partial observability with non-linear mixing functions. We prove that any shared information across arbitrary subsets of views and modalities can be learned up to a smooth bijection using contrastive learning and provide straightforward graphical criteria to categorize which latents can be recovered.

2. With a single proof, our result implies the identifiability of several prior works in causal representation learning, nonlinear ICA, and disentangled representations as special cases.

3. We conduct experiments for various unsupervised and supervised tasks and empirically show that (i) the performance of prior works can be recovered using a special setup of our framework and (ii) our method indicates promising disentanglement capabilities with encoder-only networks.

2 PROBLEM FORMULATION

We formalize the data generating process as a latent variable model. Let z = (z1, ..., z N) pz be possibly dependent (causally related) latent variables taking values in Z = Z1 ... ZN, where Z RN is an open, simply connected latent space with associated probability density pz. Instead of directly observing z, we observe a set of entangled measurements or views x := (x1, . . . , x K). Importantly, we assume that each observed view xk may only depend on some of the latent variables, which we call view-specific latents z Sk, indexed by subsets S1, ..., SK [N] = {1, ..., N}. For any A [N], the subset of latent variables z A and corresponding latent sub-space ZA are given by:

z A := {zj : j A}, ZA := j A Zj .

Similarly, for any V [K], the subset of views x V and corresponding observation space XV are:

x V := {xk : k V }, XV := k V Xk .

Published as a conference paper at ICLR 2024

The view-specific mixing functions {fk : ZSk Xk}k [K] are smooth, invertible mappings from the view-specific latent subspaces ZSk to observation spaces Xk Rdim(xk) with xk := fk(z Sk). Formally, the generative process for the views {x1, . . . , x K} is given by:

z pz, xk := fk(z Sk) k [K],

i.e., each view xk depends on latents z Sk through a mixing function fk, as illustrated in Fig. 1.

Assumption 2.1 (General Assumptions). For the latent generative model defined above:

(i) Each view-specific mixing function fk is a diffeomorphism;

(ii) pz is a smooth and continuous density on Z with pz > 0 almost everywhere.

Example 2.1. Throughout, we illustrate key concepts and results using the following example with K=4 views, N=6 latents, and dependencies among the zk shown as a graphical model in Fig. 1.

x1 = f1(z1, z2, z3, z4, z5), x2 = f2(z1, z2, z3, z5, z6), x3 = f3(z1, z2, z3, z4, z6), x4 = f4(z1, z2, z4, z5, z6). (2.1)

Consider a set x V of jointly observed views, and let V := {Vi V : |Vi| 2} be the set of subsets Vi V indexing two or more views. For any subset of views Vi, we refer to the set of shared latent variables (i.e., those influencing each view in the set) as the content or content block" of Vi. Formally, content variables z Ci are obtained as intersections of view-specific indexing sets:

k Vi Sk . (2.2)

Similarly, for each view k V , we can define the non-shared ( style") variables as z Sk\Ci. We use C and z C without subscript to refer to the joint content across all observed views x V .

For Example 2.1, the content of x V1 = (x1, x2) is z C1 = (z1, z2, z3, z5); the content for all four views x = (x1, x2, x3, x4) jointly is z C = (z1, z2), and the style for x1 is z S1\C = (z3, z4, z5).

Remark 2.2 ( Content-Style Terminology). We adopt these terms from von Kügelgen et al. (2021), but note that, in our setting, they are relative to a specific subset of views. Unlike in some prior works (Gresele et al., 2020; von Kügelgen et al., 2021; Daunhawer et al., 2023), style variables are generally not considered irrelevant, but may also be of interest and can sometimes be identified by other means (e.g., from other subsets of views or because they are independent of content).

Our goal is to show that we can simultaneously identify multiple content blocks given a set of jointly observed views under weak assumptions. This extends previous work (Gresele et al., 2020; von Kügelgen et al., 2021; Daunhawer et al., 2023) where only one block of content variables is considered. Isolating the shared content blocks from the rest of the view-specific style information, the learned representation (estimated content) can be used in downstream pipelines, such as classification tasks (Lachapelle et al., 2023; Fumero et al., 2023). In the best case, if each latent component is represented as one individual content block, we can learn a fully disentangled representation (Higgins et al., 2018; Locatello et al., 2020; Ahuja et al., 2022). To this end, we restate the definition of block-identifiability (von Kügelgen et al., 2021, Defn 4.1) for the multi-modal, multi-view setting:

Definition 2.3 (Block-Identifiability). The true content variables c are block-identified by a function g : X Rdim(c) if the inferred content partition ˆc = g(x) contains all and only information about c, i.e., if there exists some smooth invertible mapping h : Rdim(c) Rdim(c) s.t. ˆc = h(c).

Note that the inferred content variables ˆc can be a set of entangled latent variables rather than a single one. This differentiates our paper from the line of work on disentanglement (Locatello et al., 2020; Fumero et al., 2023; Lachapelle et al., 2023), which seek to disentangle individual latent factors and can thus be considered as special cases of our framework with content block sizes equal to one.

3 IDENTIFIABILITY THEORY

High-Level Overview. This section presents a unified framework for studying identifiability from multiple partial views: we start by establishing identifiability of the shared content block z C from any number of partially observed views (Thm. 3.2). The downside of this approach is that if we seek to

Published as a conference paper at ICLR 2024

learn content from different subsets, we need to train an exponential number of encoders for the same modality, one for each subset of views. We, therefore, extend this result and show that by considering any subset of the jointly observed views, various blocks of content variables can be identified by one single view-specific encoder (Thm. 3.8). After recovering multiple content blocks simultaneously, we show in Cors. 3.9 to 3.11 that a qualitative description of the data generative process such as in Fig. 1 can be sufficient to determine exactly the extent to which individual latents or groups thereof can be identified and disentangled. Full proofs are included in App. C.

Definition 3.1 (Content Encoders). Assume that the content size |C| is given for any jointly observed views x V . The content encoders G := {gk : Xk (0, 1)|C|}k V consist of smooth functions mapping from the respective observation spaces to the |C|-dimensional unit cube.

Theorem 3.2 (Identifiability from a Set of Views). Consider a set of views x V satisfying Asm. 2.1, and let G be a set of content encoders (Defn. 3.1) that minimizes the following objective L (G) = P

k<k V E [ gk(xk) gk (xk ) 2] | {z } Content alignment

k V H (gk(xk)) | {z } Entropy regularization

where the expectation is taken w.r.t. p(x V ) and H( ) denotes differential entropy. Then the shared content variable z C := {zj : j C} is block-identified (Defn. 2.3) by gk G for any k V .

Intuition. The alignment enforces the content encoders gk only to encode content and discard styles, while the maximized entropy implies uniformity and thus invertibility. For Example 2.1, recall that the joint content is C = k [4]{Sk} = {1, 2}. Thm. 3.2 then states that, for each k = 1, 2, 3, 4, the content encoders G = {gk : Xk (0, 1)|C|} which minimize the loss in eq. (3.1) are actually invertible mappings of the ground truth content {z1, z2}, i.e., gk(xk) = hk(z1, z2) for smooth invertible functions hk : Z1 Z2 (0, 1)2.

Discussion. Thm. 3.2 provides a learning algorithm to infer one jointly shared content block for all observed views in a set, extending prior results that only consider two views (von Kügelgen et al., 2021; Daunhawer et al., 2023; Locatello et al., 2020). However, to discover another content block Ci w.r.t. a subset of views Vi V as defined in 2, we need to train another set of encoders, since the dimensionality of the content might change. Ideally, we would like to learn one view-specific encoder rk that maps from the observation space Xk to some |Sk|-dimensional manifold and can block-identify all shared contents z Ci using one training run, combined with separate content selectors.

Definition 3.3 (View-Specific Encoders). The view-specific encoders R := {rk : Xk ZSk}k V consist of smooth functions mapping from the respective observation spaces to the view-specific latent space, where the dimension of the kth latent space |Sk| is assumed known for all k V .

Intuition. The view-specific encoders learn all view-related content blocks simultaneously, instead of training a combinatorial number of networks (as would be implied by Thm. 3.2). The view-specific encoders should learn not only a single block of content variables, but instead learn to recover all shared latents in a way that makes it easy to extract various different content blocks using simple readout functions. This is possible by construction, e.g., if each rk learns to invert the ground truth mixing fk. Inspired by this idea, we introduce content selectors.

Definition 3.4 (Selection). A selection operates between two vectors a {0, 1}d , b Rd s.t. a b := [bj : aj = 1, j [d]]

Definition 3.5 (Content Selectors). The content selectors Φ := {ϕ(i,k)}Vi V,k Vi with ϕ(i,k) {0, 1}|Sk| perform selection (Defn. 3.4) on the encoded information: for any subset Vi and view k

Vi we have the selected representation: ϕ(i,k) ˆz Sk = ϕ(i,k) rk(xk), with ϕ(i,k) 0 = ϕ(i,k ) 0 for all Vi V, k, k Vi.

Intuition. Using learnable binary weights Φ, the content selectors ϕ(i,k) should pick out those latents among the representation ˆz Sk extracted by rk that belong to the content block Ci shared among Vi. For Example 2.1, consider a learned representation r1(x1) = (ˆz1, ˆz2, ˆz3, ˆz4, ˆz5). Applying a content selector with weight ϕ(i=1,k=1) = [1, 1, 1, 0, 1] then yields: (ˆz1, ˆz2, ˆz3, ˆz5).

Published as a conference paper at ICLR 2024

What is missing? While aligning various content blocks based on the same representation rk(xk) should promote disentanglement, maximizing the entropy H(rk(xk)) of the learned representation (as in Thm. 3.2) promotes uniformity. The latter implies invertibility of the encoders (Zimmermann et al., 2021), which is necessary for block-identifiability (Defn. 2.3). However, since a uniform representation has independent components by definition, disentanglement and uniformity cannot be achieved simultaneously unless all ground truth latents are mutually independent (a strong assumption we are not willing to make). Thus, to theoretically achieve invertibility while preserving disentanglement, we introduce a set of auxiliary projection functions.

Definition 3.6 (Projections). The set of projections T := {tk}k V consist of functions tk : ZSk (0, 1)|Sk| mapping each view-specific latent space to a hyper unit-cube of the same dimension |Sk|.

Intuition. The projection functions can be understood as mathematical tools: by maximizing the entropy and thus enforcing uniformity of projected representations tk rk(xk), we can show that rk needs to be invertible without interfering with the disentanglement of different content blocks.

What if the content dimension is unknown? In Thm. 3.2 we assumed that the size |C| of the shared content block is known, and the encoders map to a space of dimension |C|. In the following, we do not assume that the content size is given. Instead, we will show that the correct content block can still be discovered by ensuring that as much information as possible is shared across any given subset of views. To this end, we define the following information-sharing regularizer.

Definition 3.7 (Information-Sharing Regularizer). The following regularizer penalizes the L0norm 0 of the content selectors Φ: Reg(Φ) := P

k Vi ϕ(i,k) 0 .

Intuition. Reg(Φ) sums the number of shared latents over Vi V and k Vi. It decreases when ϕ(i,k) contains more ones, i.e., more latents are shared across views k Vi. Thus, Reg(Φ) encourages the encoders to reuse the learned latents and maximize the shared information content.

Theorem 3.8 (View-Specific Encoder for Identifiability). Let R, Φ, T respectively be any viewspecific encoders (Defn. 3.3), content selectors (Defn. 3.1) and projections (Defn. 3.6) that solve the following constrained optimization problem:

min Reg(Φ) subject to: R, Φ, T argmin L (R, Φ, T) (3.2)

L (R, Φ, T) = X

E h ϕ(i,k) rk(xk) ϕ(i,k ) rk (xk ) 2

| {z } Content alignment

k V H (tk rk(xk)) | {z } Entropy

(3.3) Then for any subset of views Vi V and any view k Vi , ϕ(i,k) rk block-identifies (Defn. 2.3) the shared content variables z Ci, as defined in eq. (2.2).

Intuition. For Example 2.1, the representation for x1 obtained by minimizing eq. (3.2) is given by ˆz := r1(x1) = r1(f1(z S1)) = r1 f1(z1, z2, z3, z4, z5). Consider the following two subsets of views V1, V2 V containing x1, but sharing different content blocks C1, C2:

x V1 = {x1, x2} , z C1 = {z1, z2, z3, z5} , x V2 = {x1, x3} , z C2 = {z1, z2, z3, z4} .

Then one of the optimal solutions of the selectors learned by Thm. 3.8 could be

ϕ(i=1,k=1) = [1, 1, 1, 0, 1] , ϕ(i=2,k=1) = [1, 1, 1, 1, 0].

Hence, the composed results of the selectors and the view-specific encoder r1 give:

ϕ(i=1,k=1) ˆz = hi=1,k=1(z1, z2, z3, z5) , ϕ(i=2,k=1) ˆz = hi=2,k=1(z1, z2, z3, z4)

where hi,k=1 is some smooth bijection, for both i = 1, 2.

Discussion. Note that Equation (3.2) can be rewritten as a regularized loss LReg (R, Φ, T) = L (R, Φ, T) + α Reg(Φ) with a sufficiently small regularization coefficient α 0. Overall, Thm. 3.8 further weakens the assumptions of Thm. 3.2 in that no content size is required. However,

Published as a conference paper at ICLR 2024

minimizing the information-sharing regularizer is highly non-convex, and having only a finite number of samples makes finding the global optimum challenging. In practice, we could use Gumbel Softmax (Jang et al., 2016) for unsupervised learning, and consider content sizes as hyper-parameters or follow the approach by Fumero et al. (2023) for supervised classification tasks. Empirically, we will see that some of the requirements that are needed in theory can be realistically dropped, and different approximations are possible, e.g., incorporating problem-specific knowledge.

After discovering various content blocks, we are further interested in how to infer more information from the learned content blocks. For example, can we identify z C3 := {z1, z2, z3} = z C1 C2? This perspective motivates our next results, which focus on how to infer new information based on the previously identified blocks: Identifiability Algebra.

Let z C1, z C2 with C1, C2 [N] be two identified blocks of latents. Then it holds for C1, C2 that:

Corollary 3.9 (Identifiability Algebra: Intersection). The intersection z C1 C2 can be block-identified.

Corollary 3.10 (Identifiability Algebra: Complement). If C1 C2 is independent of C1\C2, then the complement z C1\C2 can be block-identified.

Corollary 3.11 (Identifiability Algebra: Union). If C1 C2, C1\C2 and C2\C1 are mutually independent, then the union z C1 C2 can be block-identified.

Discussion. While Cor. 3.9 refines the identified block of information into smaller intersections, Cors. 3.10 and 3.11 allows to extract style variables as defined w.r.t. some specific views, under the assumption that they are independent of the content block, as discussed by Lyu et al. (2021). However, our setup is more general, as we can not only explain the independent style variables between pairs of observations, but also between learned content representations. Thus, by iteratively applying Cor. 3.10 we can generalize the statement to any number of identified content blocks. Combining Cors. 3.9 to 3.11 we can immediately tell which part can be block-identified from a set of views V, given a graphical model representation such as Fig. 1 and subject to technical assumptions underlying our main results. Applying Cors. 3.9 to 3.11 iteratively on identified blocks can possibly disentangle each individual factors of variation, providing a novel approach for disentanglement. If all variables can be isolated up to element-wise nonlinear transformations, we can learn the causal relations by assuming the original link are nonlinear with additive noise. This exactly corresponds to a post-nonlinear model, whose graphic structure can be further identified using causal discovery algorithms (Zhang & Chan, 2006; Zhang & Hyvärinen, 2009).

4 RELATED WORK AND SPECIAL CASES OF OUR THEORY

Our framework unifies several prior work, including multi-view nonlinear ICA (Gresele et al., 2020), weakly-supervised disentanglement (Locatello et al., 2020; Ahuja et al., 2022) and content-style identification (von Kügelgen et al., 2021; Daunhawer et al., 2023). Tab. 1 shows a summarized (nonexhaustive) list of related works and their respective graphical models that can be considered as special cases. The graphical setups of the individual works can be recovered from our framework (Fig. 1) by varying the number of observed views and causal relations.

In addition, we present a short overview of other related work which can be connected with our theoretical results, including causal representation learning (Sturma et al., 2023; Silva et al., 2006; Adams et al., 2021; Kivva et al., 2021; Cai et al., 2019; Xie et al., 2020; 2022; Morioka & Hyvärinen, 2023; Morioka & Hyvarinen, 2023), mutual information-based contrastive learning (Tian et al., 2020; Tsai et al., 2020; Tosh et al., 2021), latent correlation maximization (Andrew et al., 2013; Benton et al., 2017; Lyu & Fu, 2020; Lyu et al., 2021), nonlinear ICA without auxiliary variables (Willetts & Paige, 2021) and multitask disentanglement with sparse classifiers (Lachapelle et al., 2023; Fumero et al., 2023). Further discussion is given in App. B. We remark that several approaches here consider the setting where two observations are generated through an intervention on some latent variable(s). This is sometimes written in the graphical model as two nodes connected by an arrow (shown in the graphs in Tab. 1 as dashed lines 99K) indicating the preand post-intervention versions of the same variable(s). We stress that this does not constitute an example of partial observability. In our setting, latent variables can be simply unobserved, regardless of whether or not they were subject to an intervention.

Causal representation learning. In the context of causal representation learning (CRL), Sturma et al. (2023) also explicitly consider partial observability in a linear, multi-domain setting. Several other works on linear CRL from i.i.d. data could also be viewed as assuming partial observability, since

Published as a conference paper at ICLR 2024

Table 1: A non-exhaustive summary of special cases of our theory and their graphical models. An asterisk ( ) indicates works that have view-specific latents that are not of interest for identifiability.

Method Graph Dependent Latents Multi-Modal Partial Observability > 2 Views

Schölkopf et al. (2016)

Gresele et al. (2020)

i = 1, . . . , K

Locatello et al. (2020)

z1 z2 z3 z3

Ahuja et al. (2022)

z1 z2 z1 z2

von Kügelgen et al. (2021)

Daunhawer et al. (2023)

Ours Fig. 1

they often rely on graphical conditions which enforce each measured variable to depend on a single (a pure child) or only a few latents (Silva et al., 2006; Adams et al., 2021; Kivva et al., 2021; Cai et al., 2019; Xie et al., 2020; 2022). In our framework, each view xk instead constitutes a nonlinear mixture of several latents. Merging partially observed causal structure has been studied without a representation learning component by Gresele et al. (2022); Mejia et al. (2022); Guo et al. (2023).

Mutual Information-based Contrastive Learning. Tian et al. (2020); Tsai et al. (2020); Tosh et al. (2021) empirically showcase the success of contrastive learning in extracting task-related information across multiple views, if the augmented views are redundant to the original data regarding task-related information (Tian et al., 2020). From this point of view, the redundant task-information can be interpreted as shared content between the views, for which our theory (Thms. 3.2 and 3.8) may provide theoretical explanations for the improved performance in downstream tasks.

Latent Correlation Maximization. Prior work (Andrew et al., 2013; Benton et al., 2017; Lyu & Fu, 2020; Lyu et al., 2021) showed that maximizing the correlation between the learned representation is equivalent to our content alignment principle (eq. (3.1)). The additional invertibility constraint on the learned encoder in their setting is enforced by entropy regularization (eq. (3.1)), as explained by Zimmermann et al. (2021). However, their theory is limited to pairs of views and full observability, while we generalize it to any number of partially observed views.

Nonlinear ICA without Auxiliary Variables. Willetts & Paige (2021) shows nonlinear ICA problem can be solved using non-observable, learnable, clustering task variables u, to replace the observed auxiliary variable in other nonlinear ICA approaches (Hyvarinen et al., 2019). While we explicitly require the learned representation to be aligned in a continuous space within the content block, Willetts & Paige (2021) impose a soft alignment constraint to encourage the encoded information to be similar within a cluster. In practice, the soft alignment requirement can be easily coded in our framework by relaxing the content alignment with an equivalence class in terms of cluster membership.

Multi-task Disentanglement with Sparse Classifiers. Our setup is slightly different from that of Lachapelle et al. (2023); Fumero et al. (2023) as they focus on multiple classification tasks using shared encoding and sparse linear readouts. Their sparse classifier head jointly enforces the sufficient representation (regarding the specific classification task, while we aim for the invertibility of the encoders) and a soft alignment up to a linear equivalence class (relaxing our hard alignment). However, the identifiability principles we use are similar: sufficient representation (entropy regularization), alignment and information sharing. While our results can be easily extended to allow for alignment up to a linear equivalence class, their identifiability theory crucially only covers independent latents.

Published as a conference paper at ICLR 2024

5 EXPERIMENTS

First, we validate Thms. 3.2 and 3.8 using numerical simulations in a fully controlled synthetic setting. Next, we conduct experiments on visual (and text) data demonstrating different special cases that are unified by our theoretical framework ( 4) and how we extend them. We use Info NCE (Oord et al., 2018) and Barlow Twins (Zbontar et al., 2021) to estimate eqs. (3.1) and (3.2). The content alignment is computed by the numerator (positive pairs) in Info NCE and the entropy regularization is estimated by the denominator (negative pairs). Further remarks on contrastive learning and entropy regularization are in App. E. For the evaluation, we follow a standard evaluation protocol (von Kügelgen et al., 2021) and predict the ground truth latents from the learned representation gk(xk), using kernel ridge regression for continuous latent variables, and logistic regression for discrete ones, respectively. Then, we report the coefficient of determination R2 to show the correlation between the learned and ground truth latent variables. An R2 close to one between the learned and ground truth variables means that the learned variables are modelling correctly the ground truth, indicating block-identifiability (Defn. 2.3). However, R2 is limited as a metric, since any style variable that strongly depends on a content variable would also become predictable, thus showing a high R2 score.

5.1 NUMERICAL EXPERIMENT: THEORY VALIDATION

{x1, x2, x3}

{x1, x2, x4}

{x1, x3, x4}

{x2, x3, x4}

{x1, x2, x3, x4}

z1 z2 z3 z4 z5 z6

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

1.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0

1.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0

0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0

Figure 2: Theory Validation: Average R2 across multiple views generated from independent latents.

Experimental Setup. We generate synthetic data following eq. (2.1). The latent variables are sampled from a Gaussian distribution z N(0, Σz), where possible causal dependencies can be encoded through Σz. The view-specific mixing functions fk are implemented by randomly initialized invertible MLPs for each view k {1, . . . 4}. We report here the R2 scores for the case of independent variables, because it is easier to interpret than the R2 scores in the causally dependent case, for which we show that the learned representation still contains all and only the content information in App. D.1.

Discussion. Fig. 2 shows how the averaged R2 changes when including more views, with the y-axis denoting the ground truth latents and the x-axis showing learned representation from different subsets of views. As shown in Example 2.1 and Fig. 1, the content variables are consistently identified, having R2 1, while the independent style variables are non-predictable (R2 0). This numerical result shows that the learned representation explains almost all variation in the content block but nothing from the independent styles, which validates Thms. 3.2 and 3.8.

5.2 SELF-SUPERVISED DISENTANGLEMENT

Table 2: Self-Supervised Disentanglement Performance Comparison on MPI-3D complex (Gondal et al., 2019) and 3DIdent (Zimmermann et al., 2021), between our method and Ada-GVAE (Locatello et al., 2020).

DCI disentanglement MPI3D complex 3DIdent

Ada-GVAE 0.11 0.008 0.09 0.019 Ours 0.42 0.020 0.30 0.04

Experimental Setup. We compare our method (Thm. 3.8) with Ada-GVAE (Locatello et al., 2019), on MPI-3D complex (Gondal et al., 2019) and 3DIdent (Zimmermann et al., 2021) image datasets. We did not compare with Ahuja et al. (2022), since their method needs to know which latent is perturbed, even when guessing the offset. We experiment on a pair of views (x1, x2) where the second view x2 is obtained by randomly perturbing a subset of latent factors of x1, following (Locatello et al., 2019). We provide more details about the datasets and the experiment setup in App. D.2. As shown in Tab. 2, our method outperformed the autoencoder-based Ada-GVAE (Locatello et al., 2020), using only an encoder and computing contrastive loss in the latent space.

Discussion. As both methods are theoretically identifiable, we hypothesize that the improvement comes from avoiding reconstructing the image, which is more difficult on visually complex data. This hypothesis is supported by the fact that self-supervised contrastive learning has far exceeded the performance of autoencoder-based representation learning in both classification tasks (Chen et al., 2020; Caron et al., 2021; Oquab et al., 2023) and object discovery (Seitzer et al., 2022).

Published as a conference paper at ICLR 2024

von Kügelgen

etal. (2021)

Daunhawer etal. (2023)

Cx0,x1 Cx0,x2 Cx1,x2 Cx0,x1,x2 Selected Blocks

content style object shape object ypos object xpos spotlight pos object betarot spotlight color object gammarot background color object alpharot object color

von Kügelgen

etal. (2021)

Daunhawer etal. (2023)

Cx0,x1 Cx0,x2 Cx1,x2 Cx0,x1,x2 Selected Blocks

content style object shape object ypos object xpos object color index text phrasing

Figure 3: Simultaneous Multi-Content Identification using View-Specific Encoders. Experimental results on Multimodal3DIdent. Left: Image latents (averaged between two image views) Right: Text latents.

5.3 MULTI-MODAL CONTENT-STYLE IDENTIFIABILITY UNDER PARTIAL OBSERVABILITY Experimental setup. We experiment on a set of three views (img0, img1, txt0) extending both (Daunhawer et al., 2023; von Kügelgen et al., 2021), which are limited to two views, either two images or one image and its caption. The second image view img1 is generated by perturbing a subset of latents of img0 as in (von Kügelgen et al., 2021). Notice that this setup provides perfect partial observability, because the text is generated using text-specific modality variables that are not involved in any image views e.g., text phrasing. We train view-specific encoders to learn all content blocks simultaneously and predict individual latent variables from the each learned content blocks. We assume access to the ground truth content indices to better match the baselines, but we relax this in App. D.4.

Discussion. Fig. 3 reports the R2 on the ground truth latent values, predicted from the simultaneously learned multiple content blocks (Cx0,x1, Cx0,x2, Cx1,x2, Cx0,x1,x2, respectively). We remark that this single experiment recovers both experimental setups from (von Kügelgen et al., 2021, Sec 5.2), (Daunhawer et al., 2023, Sec 5.2): Cx0,x1 represents the content block from the image pairs (img0, img1), which aligns with the setting in (von Kügelgen et al., 2021) and Cx0,x2 shows the content block from the multi-modal pair (img0, txt0), which is studied by Daunhawer et al. (2023). We observe that the same performance for both prior works (von Kügelgen et al., 2021; Daunhawer et al., 2023) has been successfully reproduced from our single training process, which verifies the effectiveness and efficiency of Thm. 3.8. Extended evaluation and more experimental details are provided in App. D.4.

5.4 MULTI-TASK DISENTANGLEMENT Experimental setup We follow Example 2.1 with latent causal relations to verify that: (i) the improved classification performance from (Lachapelle et al., 2023; Fumero et al., 2023) originates from the fact that the task-related information is shared across multiple views (different observations from the same class) and (ii) this information can be identified (Thm. 3.2), even though the latent variables are not independent. This explains the good performance of (Fumero et al., 2023) on real-world data sets, where the latent variables are likely not independent, violating their theory.

Discussion. We synthetically generate the labels by linear/nonlinear labeling functions on the shared content values to resemble (Lachapelle et al., 2023; Fumero et al., 2023). As expected, the learned representation significantly eases the classification task and achieves an accuracy of 0.99 with linear and nonlinear labeling functions within 1k update steps, even with latent causal relations. This experimental result justifies that the success in the empirical evaluation of (Fumero et al., 2023) can be explained by our theoretical framework, as discussed in 4.

6 DISCUSSION AND CONCLUSION

This paper revisits the problem of identifying possibly dependent latent variables under multiple partial non-linear measurements. Our theoretical results extend to an arbitrary number of views, each potentially measuring a strict subset of the latent variables. In our experiments, we validate our claims and demonstrate how prior work can be obtained as a special case of our setting. While our assumptions are relatively mild, we still have notable gaps between theory and practice, thoroughly discussed in App. E. In particular, we highlight discrete variables and finite-sample errors as common gaps, which we address only empirically. Interestingly, our work offers potential connections with work in the causality literature (Triantafillou et al., 2010; Gresele et al., 2022; Mejia et al., 2022; Guo et al., 2023). Discovering hidden causal structures from overlapping but not simultaneously observed marginals (e.g., via views collected in different experimental studies at different times) remains open for future works.

Published as a conference paper at ICLR 2024

REPRODUCIBILITY STATEMENT The datasets used in ( 5) are published by Gondal et al. (2019); Zimmermann et al. (2021); von Kügelgen et al. (2021); Daunhawer et al. (2023). Results provided in the experiments section ( 5) can be reproduced using the implementation details provided in App. D. The code is available at https://github.com/Causal Learning AI/multiview-crl. The part of implementation to replicate the experiments of Fumero et al. (2023) in 5.4 was kindly provided by the authors upon request, and we do not include it in the git repository.

ACKNOWLEDGEMENTS This work was initiated at the Second Bellairs Workshop on Causality held at the Bellairs Research Institute, January 6 13, 2022; we thank all workshop participants for providing a stimulating research environment. Further, we thank Cian Eastwood, Luigi Gresele, Stefano Soatto, Marco Bagatella and A. René Geist for helpful discussion. GM is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 Project number 390727645. Jv K and GM acknowledge support from the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039B). The research of DX and SM was supported by the Air Force Office of Scientific Research under award number FA8655-22-1-7155. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force. We also thank SURF for the support in using the Dutch National Supercomputer Snellius. SL was supported by an IVADO excellence Ph D scholarship and by Samsung Electronics Co., Ldt. DY was supported by an Amazon fellowship, the International Max Planck Research School for Intelligent Systems (IMPRS-IS) and the ISTA graduate school. Work done outside of Amazon.

Jeffrey Adams, Niels Hansen, and Kun Zhang. Identification of partially observed linear causal models: Graphical conditions for the non-gaussian and heterogeneous cases. Advances in Neural Information Processing Systems, 34:22822 22833, 2021. 6, 7

Kartik Ahuja, Jason S Hartford, and Yoshua Bengio. Weakly supervised representation learning with sparse perturbations. Advances in Neural Information Processing Systems, 35:15516 15528, 2022. 2, 3, 6, 7, 8, 17

Kartik Ahuja, Divyat Mahajan, Yixin Wang, and Yoshua Bengio. Interventional causal representation learning. In International Conference on Machine Learning, pp. 372 407. PMLR, 2023. 1

Rana Ali Amjad and Bernhard C. Geiger. Learning representations for neural network-based classification using the information bottleneck principle. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(9):2225 2239, sep 2020. 18

Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In Sanjoy Dasgupta and David Mc Allester (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 1247 1255, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR. 6, 7, 18

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. 1

Adrian Benton, Huda Khayrallah, Biman Gujral, Dee Ann Reisinger, Sheng Zhang, and Raman Arora. Deep generalized canonical correlation analysis. ar Xiv preprint ar Xiv:1702.02519, 2017. 6, 7, 18

Johann Brehmer, Pim De Haan, Phillip Lippe, and Taco S Cohen. Weakly supervised causal representation learning. Advances in Neural Information Processing Systems, 35:38319 38331, 2022. 1, 2

Glen D Brown, Satoshi Yamada, and Terrence J Sejnowski. Independent component analysis at the neural cocktail party. Trends in neurosciences, 24(1):54 63, 2001. 1

Simon Buchholz, Goutham Rajendran, Elan Rosenfeld, Bryon Aragam, Bernhard Schölkopf, and Pradeep Ravikumar. Learning linear causal representations from interventions under general nonlinear mixing. ar Xiv preprint ar Xiv:2306.02235, 2023. 27, 32

Published as a conference paper at ICLR 2024

Ruichu Cai, Feng Xie, Clark Glymour, Zhifeng Hao, and Kun Zhang. Triad constraints for learning causal structure of latent variables. In Advances in Neural Information Processing Systems, volume 32, pp. 12883 12892, 2019. 6, 7

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021. 8

Khosrow Chadan and Pierre C Sabatier. Inverse problems in quantum scattering theory. Springer Science & Business Media, 2012. 1

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597 1607. PMLR, 2020. 8, 31

E Colin Cherry. Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America, 25(5):975 979, 1953. 1

Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287 314, 1994. 1

George Darmois. Analyse des liaisons de probabilité. In Proc. Int. Stat. Conferences 1947, pp. 231, 1951. 1, 19, 23

Imant Daunhawer, Alice Bizeul, Emanuele Palumbo, Alexander Marx, and Julia E Vogt. Identifiability results for multimodal contrastive learning. In The Eleventh International Conference on Learning Representations, 2023. 2, 3, 4, 6, 7, 9, 10, 18, 22, 26, 29, 30, 31

David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289 1306, 2006. 1

Marco Fumero, Florian Wenzel, Luca Zancato, Alessandro Achille, Emanuele Rodolá, Stefano Soatto, Bernhard Schölkopf, and Francesco Locatello. Leveraging sparse and shared feature activations for disentangled representation learning, 2023. 2, 3, 6, 7, 9, 10, 18, 30, 31

Muhammad Waleed Gondal, Manuel Wuthrich, Djordje Miladinovic, Francesco Locatello, Martin Breidt, Valentin Volchkov, Joel Akpo, Olivier Bachem, Bernhard Schölkopf, and Stefan Bauer. On the transfer of inductive bias from simulation to the real world: a new disentanglement dataset. Advances in Neural Information Processing Systems, 32, 2019. 8, 10, 27

Luigi Gresele, Paul K Rubenstein, Arash Mehrjou, Francesco Locatello, and Bernhard Schölkopf. The incomplete rosetta stone problem: Identifiability results for multi-view nonlinear ica. In Uncertainty in Artificial Intelligence, pp. 217 227. PMLR, 2020. 2, 3, 6, 7, 17

Luigi Gresele, Julius von Kügelgen, Jonas Kübler, Elke Kirschbaum, Bernhard Schölkopf, and Dominik Janzing. Causal inference through the structural causal marginal problem. In International Conference on Machine Learning, pp. 7793 7824. PMLR, 2022. 7, 9, 31

Siyuan Guo, Jonas Wildberger, and Bernhard Schölkopf. Out-of-variable generalization. ar Xiv preprint ar Xiv:2304.07896, 2023. 7, 9

Simon Haykin. Neural networks: a comprehensive foundation. Prentice Hall PTR, 1994. 26

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. 27

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2017. 1

Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations. ar Xiv preprint ar Xiv:1812.02230, 2018. 3

Published as a conference paper at ICLR 2024

Aapo Hyvärinen and Oja Erkki. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411 430, 2000. 1

Aapo Hyvärinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429 439, 1999. 1, 17

Aapo Hyvarinen, Hiroaki Sasaki, and Richard E. Turner. Nonlinear ica using auxiliary variables and generalized contrastive learning, 2019. 1, 7, 17

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. ar Xiv preprint ar Xiv:1611.01144, 2016. 6

Ilyes Khemakhem, Ricardo Monti, Diederik Kingma, and Aapo Hyvarinen. Ice-beem: Identifiable conditional energy-based deep models based on nonlinear ica. volume 33, pp. 12768 12778, 2020. 27

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. 27

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. 17

Bohdan Kivva, Goutham Rajendran, Pradeep Ravikumar, and Bryon Aragam. Learning latent causal graphs via mixture oracles. In Advances in Neural Information Processing Systems, volume 34, pp. 18087 18101, 2021. 6, 7

Bohdan Kivva, Goutham Rajendran, Pradeep Ravikumar, and Bryon Aragam. Identifiability of deep generative models without auxiliary information. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 15687 15701. Curran Associates, Inc., 2022. 17

David A Klindt, Lukas Schott, Yash Sharma, Ivan Ustyuzhaninov, Wieland Brendel, Matthias Bethge, and Dylan Paiton. Towards nonlinear disentanglement in natural data with temporal sparse coding. In International Conference on Learning Representations, 2021. 2

Lingjing Kong, Shaoan Xie, Weiran Yao, Yujia Zheng, Guangyi Chen, Petar Stojanov, Victor Akinwande, and Kun Zhang. Partial disentanglement for domain adaptation. In International Conference on Machine Learning, pp. 11455 11472. PMLR, 2022. 18

Sébastien Lachapelle, Rodriguez Lopez, Pau, Yash Sharma, Katie E. Everett, Rémi Le Priol, Alexandre Lacoste, and Simon Lacoste-Julien. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA. In First Conference on Causal Learning and Reasoning, 2022. 1

Sébastien Lachapelle, Tristan Deleu, Divyat Mahajan, Ioannis Mitliagkas, Yoshua Bengio, Simon Lacoste-Julien, and Quentin Bertrand. Synergies between disentanglement and sparsity: Generalization and identifiability in multi-task learning. In International Conference on Machine Learning, pp. 18171 18206. PMLR, 2023. 2, 3, 6, 7, 9, 18, 30

Wendong Liang, Armin Keki c, Julius von Kügelgen, Simon Buchholz, Michel Besserve, Luigi Gresele, and Bernhard Schölkopf. Causal component analysis. ar Xiv preprint ar Xiv:2305.17225, 2023. 32

Phillip Lippe, Sara Magliacane, Sindy Löwe, Yuki M Asano, Taco Cohen, and Stratis Gavves. Citris: Causal identifiability from temporal intervened sequences. In International Conference on Machine Learning, pp. 13557 13603. PMLR, 2022. 1

Yuhang Liu, Zhen Zhang, Dong Gong, Mingming Gong, Biwei Huang, Anton van den Hengel, Kun Zhang, and Javen Qinfeng Shi. Identifying weight-variant latent causal models. ar Xiv preprint ar Xiv:2208.14153, 2022. 17

Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp. 4114 4124. PMLR, 2019. 1, 8

Published as a conference paper at ICLR 2024

Francesco Locatello, Ben Poole, Gunnar Raetsch, Bernhard Schölkopf, Olivier Bachem, and Michael Tschannen. Weakly-supervised disentanglement without compromises. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 6348 6359. PMLR, 13 18 Jul 2020. 2, 3, 4, 6, 7, 8, 17, 27

Qi Lyu and Xiao Fu. Nonlinear multiview analysis: Identifiability and neural network-assisted implementation. IEEE Transactions on Signal Processing, 68:2697 2712, 2020. 6, 7, 18

Qi Lyu, Xiao Fu, Weiran Wang, and Songtao Lu. Understanding latent correlation-based multiview learning and self-supervision: An identifiability perspective. ar Xiv preprint ar Xiv:2106.07115, 2021. 6, 7, 18, 26

Sergio H Garrido Mejia, Elke Kirschbaum, and Dominik Janzing. Obtaining causal information by merging datasets with maxent. In International Conference on Artificial Intelligence and Statistics, pp. 581 603. PMLR, 2022. 7, 9

Nathan Juraj Michlo. Disent - a modular disentangled representation learning framework for pytorch. Github, 2021. 27

Hiroshi Morioka and Aapo Hyvärinen. Causal representation learning made identifiable by grouping of observational variables. ar Xiv preprint ar Xiv:2310.15709, 2023. 6

Hiroshi Morioka and Aapo Hyvarinen. Connectivity-contrastive learning: Combining causal discovery and representation learning for multimodal data. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent (eds.), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pp. 3399 3426. PMLR, 25 27 Apr 2023. 6

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ar Xiv preprint ar Xiv:1807.03748, 2018. 8, 31

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. ar Xiv preprint ar Xiv:2304.07193, 2023. 8

Teodora Pandeva and Patrick Forré. Multi-view independent component analysis with shared and individual sources. In Uncertainty in Artificial Intelligence, pp. 1639 1650. PMLR, 2023. 2

Tapani Ristaniemi. On the performance of blind source separation in cdma downlink. In Proceedings of the International Workshop on Independent Component Analysis and Signal Separation (ICA 99), pp. 437 441, 1999. 1

Bernhard Schölkopf, David W Hogg, Dun Wang, Daniel Foreman-Mackey, Dominik Janzing, Carl Johann Simon-Gabriel, and Jonas Peters. Modeling confounding by half-sibling regression. Proceedings of the National Academy of Sciences, 113(27):7391 7398, 2016. 7

Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612 634, 2021. 1

Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, Dominik Zietlow, Tianjun Xiao, Carl-Johann Simon-Gabriel, Tong He, Zheng Zhang, Bernhard Schölkopf, Thomas Brox, et al. Bridging the gap to real-world object-centric learning. In The Eleventh International Conference on Learning Representations, 2022. 8

Ricardo Silva, Richard Scheines, Clark Glymour, Peter Spirtes, and David Maxwell Chickering. Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7 (2), 2006. 6, 7

Chandler Squires, Anna Seigal, Salil S. Bhate, and Caroline Uhler. Linear causal disentanglement via interventions. In International Conference on Machine Learning, volume 202, pp. 32540 32560. PMLR, 2023. 1

Published as a conference paper at ICLR 2024

Nils Sturma, Chandler Squires, Mathias Drton, and Caroline Uhler. Unpaired multi-domain causal representation learning, 2023. 6

Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? Advances in neural information processing systems, 33:6827 6839, 2020. 6, 7, 18

Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory, pp. 1179 1206. PMLR, 2021. 6, 7, 18

Cole Trapnell, Davide Cacchiarelli, Jonna Grimsby, Prapti Pokharel, Shuqiang Li, Michael Morse, Niall J Lennon, Kenneth J Livak, Tarjei S Mikkelsen, and John L Rinn. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nature biotechnology, 32(4):381 386, 2014. 1

Sofia Triantafillou, Ioannis Tsamardinos, and Ioannis Tollis. Learning causal structure from overlapping variable sets. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 860 867, 2010. 9, 31

Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Self-supervised learning from a multi-view perspective. In International Conference on Learning Representations, 2020. 6, 7, 18

Burak Varici, Emre Acarturk, Karthikeyan Shanmugam, Abhishek Kumar, and Ali Tajer. Score-based causal representation learning with interventions. ar Xiv preprint ar Xiv:2301.08230, 2023. 1

Ricardo Vigário, Veikko Jousmäki, Matti Hämäläinen, Riitta Hari, and Erkki Oja. Independent component analysis for identification of artifacts in magnetoencephalographic recordings. Advances in neural information processing systems, 10, 1997. 1

Julius von Kügelgen, Yash Sharma, Luigi Gresele, Wieland Brendel, Bernhard Schölkopf, Michel Besserve, and Francesco Locatello. Self-supervised learning with data augmentations provably isolates content from style. Advances in neural information processing systems, 34:16451 16467, 2021. 2, 3, 4, 6, 7, 8, 9, 10, 18, 20, 22, 26, 29, 30, 31

Julius von Kügelgen, Michel Besserve, Wendong Liang, Luigi Gresele, Armin Keki c, Elias Bareinboim, David M Blei, and Bernhard Schölkopf. Nonparametric identifiability of causal representations from unknown interventions. ar Xiv preprint ar Xiv:2306.00542, 2023. 1, 27, 32

Matthew Willetts and Brooks Paige. I don t need u: Identifiable non-linear ica without side information. ar Xiv preprint ar Xiv:2106.05238, 2021. 6, 7, 17

Carl Wunsch. The ocean circulation inverse problem. Cambridge University Press, 1996. 1

Feng Xie, Ruichu Cai, Biwei Huang, Clark Glymour, Zhifeng Hao, and Kun Zhang. Generalized independent noise condition for estimating latent variable causal graphs. In Advances in Neural Information Processing Systems, volume 33, pp. 14891 14902, 2020. 6, 7

Feng Xie, Biwei Huang, Zhengming Chen, Yangbo He, Zhi Geng, and Kun Zhang. Identification of linear non-gaussian latent hierarchical structure. In International Conference on Machine Learning, pp. 24370 24387. PMLR, 2022. 6, 7

Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. ar Xiv preprint ar Xiv:1505.00853, 2015. 26

Jure Zbontar, Li Jing, Ishan Misra, Yann Le Cun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. ar Xiv preprint ar Xiv:2103.03230, 2021. 8, 28, 29, 30, 31

Jiaqi Zhang, Chandler Squires, Kristjan Greenewald, Akash Srivastava, Karthikeyan Shanmugam, and Caroline Uhler. Identifiability guarantees for causal disentanglement from soft interventions. ar Xiv preprint ar Xiv:2307.06250, 2023. 1

K Zhang and A Hyvärinen. On the identifiability of the post-nonlinear causal model. In 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009), pp. 647 655. AUAI Press, 2009. 6

Published as a conference paper at ICLR 2024

Kun Zhang and Lai-Wan Chan. Extensions of ica for causality discovery in the hong kong stock market. In International Conference on Neural Information Processing, pp. 400 409. Springer, 2006. 6

Yujia Zheng, Ignavier Ng, and Kun Zhang. On the identifiability of nonlinear ICA: Sparsity and beyond. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. 18

Roland S. Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland Brendel. Contrastive learning inverts the data generating process. Proceedings of Machine Learning Research, 139:12979 12990, 2021. 2, 5, 7, 8, 10, 17, 19, 22, 24, 25, 26, 27, 29

Published as a conference paper at ICLR 2024

Table of Contents

A Notation and Terminology 16

B Related work and special cases of our theory 16

C Proofs 18 C.1 Proof for Thm. 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 C.2 Proof for Thm. 3.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 C.3 Proofs for Identifiability Algebra . . . . . . . . . . . . . . . . . . . . . . . . . 26

D Experimental Results 26 D.1 Numerical Experiment Theory Validation . . . . . . . . . . . . . . . . . . . . 26 D.2 Self-Supervised Disentanglement . . . . . . . . . . . . . . . . . . . . . . . . . 27 D.3 Content-Style Identifiability on Images . . . . . . . . . . . . . . . . . . . . . . 29 D.4 Multi-modal Content-Style Identifiability under Partial Observability . . . . . . 29 D.5 Multi-Task Disentanglement with Sparse Classifiers . . . . . . . . . . . . . . . 30

E Discussion 31

A NOTATION AND TERMINOLOGY

C Index set for shared content variables

z C Shared content variables

V Collection of subset of views from V

Vi Index set for subset of views in set of views V

Ci Index set for shared content variables from subset of views Vi V

z Ci Shared content variables from subset of views Vi

N Number of latents

K Number of views

j Index for latent variables

k Index for views

Z Latent space

Xk Observational space for k-th view

xk Observed k-th view

Sk Index set for k-th view-specific latents

V {1, . . . , l}

B RELATED WORK AND SPECIAL CASES OF OUR THEORY

We present our identifiability results from 3 as a unified framework implying several prior works in multi-view nonlinear ICA, disentanglement, and causal representation learning.

Published as a conference paper at ICLR 2024

Multi-View Nonlinear ICA Gresele et al. (2020) extend the idea of nonlinear ICA introduced by Hyvarinen et al. (2019, Sec. 3) by allowing a more flexible relationship between the latents and the auxiliary variables: instead of imposing conditional independence the shared source information c on some auxiliary observed variables, Gresele et al. (2020) associate the shared information source with some view-specific noise variable ni through some smooth mapping gk:

xk = fk(gk(c, nk)), k [K].

Define the composition of the view-specific function fk and the noise corruption function gk as a new mixing function fk := fk gk., then each view xk, k [K] is generated by a view-specific mixing function fk which takes the shared content c and some additional unobserved noise variable nk as input, that is: xk = fk(c, nk). In this case, the shared source information s together with the view-specific latent noise nk defines the view-specific latents Sk in our notation. The shared source c corresponds to our content variables. Our results Thm. 3.2 can be considered as a generalized version of Gresele et al. (2020, Theorem 8) for multiple views, by removing the additivity constraint of the corruption function g (Gresele et al., 2020, Sec 3.3).

Willetts & Paige (2021); Kivva et al. (2022); Liu et al. (2022) show the nonlinear ICA problem can be solved using non-observable, learnable, clustering task variables u to replace the observed the auxiliary variable in the traditional nonlinear ICA literature (Hyvärinen & Pajunen, 1999). Conditioning on the same latents on various clustering task enforces recovering the true latent factors (up to a bijective mapping). The idea of utilizing the clustering task goes hand in hand with contrastive self-supervised learning. Clustering itself can be considered as a soft relaxation of our hard global invariance condition in eq. (3.2), in the sense they enforce the shared, task-relevant features to be similar within a cluster but not necessarily having the exact same value.

Weakly-Supervised Representation Learning Locatello et al. (2020) consider a pair of views (e.g., images) (x1, x2) where x2 is obtained by perturbing a subset of the generating factors of x1. Formally,

x1 = f(z) x2 = f( z) z, z Rd, (B.1)

where z S = z S while z S = z S for some subset of latents S [d]. In this case, z S is the shared content between the pair of views x1, x2. According to the adaptive algorithm (Sec 4.1 Locatello et al., 2020), the shared content is computed by averaging the encoded representation from x1, x2 across the shared dimensions, that is:

g(xk)j a(g(x1)j, g(x2)j) j S (B.2)

By substituting the extracted representation using the averaged value, Locatello et al. (2020) achieve the same invariance condition as enforced in the first term of eq. (3.1). The amortized encoder g is trained to maximize the ELBO (Kingma & Welling, 2013) over the pair of views, which is equivalent to minimizing the reconstruction loss

E [x dec(g(x1))] + E [x2 dec(g(x2))] . (B.3)

The reconstruction loss is minimized when the compression is lossless, or equivalently, the learned representation is uniformly distributed (Zimmermann et al., 2021). The uniformity of the representation, i.e., the lossless compression, can be enforced by maximizing the entropy term, as defined in eq. (3.1). Theoretically, Locatello et al. (2020, Theorem 1.) have shown that the shared content z S can be recovered up to permutation, which aligns with our results Thm. 3.8 that the shared content can be inferred up to a smooth invertible mapping.

Ahuja et al. (2022) extend Locatello et al. (2020) by exploring more perturbations options to achieve full disentanglement. The main results (Ahuja et al., 2022, Theorem 8) state that each individual factor of a d dimensional latents can be recovered (up to a bijection) when we augment the original observation with d views, each obtained perturbing one unique latent component. This can be explained by Thm. 3.8 because any (d 1) views from this set would share exactly one latent component, which makes it identifiable. Although the theoretical claim by Ahuja et al. (2022) is to some extent aligned with our theory, in practice, they explicitly require knowledge of the ground truth content indices while we do not necessarily.

Published as a conference paper at ICLR 2024

Mutual Information-based Framework Tian et al. (2020); Tsai et al. (2020) argue that the self-supervised signal should be approximately redundant to the task-related information. The selfsupervised learning methods are based on extracting the task-relevant information (by maximizing the mutual information between the extracted representation ˆzx of the input x and the self-supervised signal s: I(ˆzx, s)) and discarding the task-irrelevant information conditioned on task T: I(x, s | T). The mutual information I(ˆzx, s)) is maximized if s is a deterministic function (for example, a MLP) of ˆz (Amjad & Geiger, 2020, Theorem 1.). Since the mutual information remains invariant under deterministic transformation of the random variables, we have:

max I(ˆz, s) = max I(ˆz, g(s)) = max I(ˆz, ˆz) = max H(ˆz) (B.4)

which is equivalent to maximizing the entropy of the learned representations, as given in eq. (3.1). Coupled with the empirically shown strong connection between the task-related information and shared content across multiple views (Tian et al., 2020; Tsai et al., 2020; Tosh et al., 2021; Lachapelle et al., 2023; Fumero et al., 2023), our results (Thms. 3.2 and 3.8) provides a theoretical explanation for these approaches. As the shared content between the original view and self-supervised signal is proven to be related to the ground truth task-related information through a smooth invertible function, it is reasonable to see the usefulness of this high quality representation in downstream tasks.

Latent Correlation Maximization Similar alignment conditions, as given in eq. (C.1), have been widely studied in the latent correlation maximization / latent component matching literature (Andrew et al., 2013; Benton et al., 2017; Lyu & Fu, 2020; Lyu et al., 2021). Lyu et al. (2021, Theorem 1.) show that, by imposing additional invertibility constraint on the encoders latent correlation maximization across two views leads to identification of the shared component, up to an invertible function. This theoretical result can be considered as a explicit special case of Thm. 3.2, where we extend the identifiability proof to more than two multi-modal views.

Content-Style Identification Our work is most closely related to (von Kügelgen et al., 2021; Daunhawer et al., 2023), while our results extended prior work that purely focused on identifiability from pair of views (von Kügelgen et al., 2021; Daunhawer et al., 2023). von Kügelgen et al. (2021, Theorem 4.2) presented a special case of Thm. 3.2 where the set size l = 2 and the mixing function f1 = f2 for both views; Daunhawer et al. (2023, Theorem 1) formulate another special case of Thm. 3.2 by allowing multi modality in the pair of views, but coming with the restriction that the view-specific modality variables have to be independent from others. From the data generating perspective, our work differs from prior work in the sense that all of the entangled views are simultaneously generated, each based on view-specific set of latent, while prior work generate the augmented (second) view by perturbing some style variables. In our case, style" is relative to specific views. Style variables could become the content block for some set of views (Thm. 3.2) and thus be identifiable or can be inferred as independent complement block of the content (Cor. 3.10). Kong et al. (2022) have proven identifiability for independent partitions in the latent space but mostly focus on the domain adaptation tasks where additional targets are required as supervision signals.

Multi-Task Disentanglement Lachapelle et al. (2023); Fumero et al. (2023); Zheng et al. (2022) differs from our theory in the sense that their sparse classifier head jointly enforces the lossless compression (which we do with the entropy regularization) and a soft alignment up to a linear transformation (relaxing our hard alignment). In their setting, the different views are images of the same class and their augmentations sampled from a given task and the selector variable is implemented with the linear classifier. The identifiability principles we use of lossless compression, alignment, and information-sharing are similar. With this, we can explain the result that task-related and taskirrelevant information can be disentangled as blocks, as given in Lachapelle et al. (2023, Theorem 3.1), Fumero et al. (2023, Proposition 1.). With our theory, their identifiability results extend to non-independent blocks, which is an important case that is not covered in the original works.

C.1 PROOF FOR THM. 3.2

Our proof follows the steps from von Kügelgen et al. (2021) with slight adaptation:

1. We show in Lemma C.1 that the lower bound of the loss eq. (3.1) is zero and construct encoders {g k : Xk (0, 1)|C|}k V that reach this lower bound;

Published as a conference paper at ICLR 2024

2. Next, we show in Lemma C.3 that for any set of encoders {gk}k V that minimizes the loss, each learned gk(xk) depends only on the shared content variables z C, i.e. gk(xk) = hk(z C) for some smooth function hk : ZC (0, 1)|C|.

3. We conclude the proof by showing that every hk is invertible using Proposition 1 (Zimmermann et al., 2021, Proposition 5.).

We rephrase each step as a separate lemma and use them to complete the final proof for Thm. 3.2.

Lemma C.1 (Existence of Optimal Encoders). Consider a jointly observed set of views x V , satisfying Asm. 2.1. Let Sk [N], k V be view-specific indexing sets of latent variables and define the shared coordinates C := T

k V Sk. For any content encoders G := {gk : Xk (0, 1)|C|}k V (Defn. 3.1), we define the following objective:

E [ gk(xk) gk (xk ) 2] X

k V H (gk(xk)) (C.1)

where the expectation is taken with respect to p(x V ) and where H( ) denotes differential entropy. Then the global minimum of the loss (eq. (C.1)) is lower bounded by zero, and there exists a set of content encoders Defn. 3.1 which obtains this global minimum.

Proof. Consider the objective function L(G) defined in eq. (C.1), the global minimum of L(G) is obtained when the first term (alignment) is minimized and the second term (entropy) is maximized. The alignment term is minimized to zero when gk are perfectly aligned for all k V , i.e., gk(xk) = gk (xk ) for all x V px V . The second term (entropy) is maximized to zero only when gk(xk) is uniformly distributed on (0, 1)|C| for all views k V .

To show that there exists a set of smooth functions: G := {gk}k V that minimizes L(G), we consider the inverse function of the ground truth mixing function f 1 k 1:|C|, w.l.o.g. we assume that the content variables are at indices 1 : |C|. This inverse function exists and is a smooth function given by Asm. 2.1(i) that each mixing function fk is a smooth invertible function. By definition, we have f 1 k 1:|C|(xk) = z C for k V .

Next, we define a function d using Darmois construction (Darmois, 1951) as follows: dj (z C) := Fj (zj|z1:j 1) j {1, . . . , |C|}, (C.2) where Fj denotes the conditional cumulative distribution function (CDF) of zj given z1:j 1, i.e. Fj (zj|z1:j 1) := P (Zj zj|z1:j 1). By construction, d (z C) is uniformly distributed on (0, 1)|C|. Moreover, d is smooth because pz is a smooth density by Asm. 2.1(ii) and because conditional CDF of smooth densities is smooth

Finally, we define gk := d f 1 k 1:|C| : Xk (0, 1)|C|, k V, (C.3) which is a smooth function as a composition of two smooth functions.

Next, we show that the function set G as constructed above attains the global minimum of L(G). Given that f 1 k 1:|C|(xk) = f 1 k 1:|C|(xk ) = z C, k, k V , we have:

E [ gk(xk) gk (xk ) 2] X

k V H (gk(xk))

E [ d (z C) d (z C) 2] X

k V H (d (z C))

where z C is the shared content variables thus the first term (alignment) equals zero; and since d (z C) is uniformly distributed on (0, 1)|C|, the second term (entropy) is also zero.

To this end, we have shown that there exists a set of smooth encoders G := {gk}k V with gk as defined in eq. (C.3) which minimizes the objective L(G) in eq. (C.1).

Published as a conference paper at ICLR 2024

Lemma C.2 (Conditions of Optimal Encoders). Assume the same set of views x V as introduced in Lemma C.1, then for any set of smooth encoders G := {gk : Xk (0, 1)|C|}k V to obtain the global minimum (zero) of the objective L(G) in eq. (C.1), the following two conditions have to be fulfilled:

Invariance: All extracted representations ˆzk := gk(xk) must align across the views from the set V almost surely:

gk(xk) = gk (xk ) k, k V a.s. (C.5)

Uniformity: All extracted representations ˆzk := gk(xk) must be uniformly distributed over the hyper-cube (0, 1)|C|.

Proof. Given that G = argmin L(G), we have by Lemma C.1:

k,k V E [ gk(xk) gk (xk ) 2] X

k V H (gk(xk)) = 0 (C.6)

The minimum L(G) = 0 leads to following conditions:

E [ gk(xk) gk (xk ) 2] = 0 k, k V, k < k (C.7) H (gk(xk)) = 0 k V (C.8)

where eq. (C.7) indicates the invariance condition holds for all views xk and smooth encoders gk G almost surely; and eq. (C.8) implies that the encoded information gk(xk) must be uniformly distributed on (0, 1)|C|.

Lemma C.3 (Content-Style Isolation from Set of Views). Assume the same set of views x V as introduced in Lemma C.1, then for any set of smooth encoders G := {gk : Xk (0, 1)|C|}k V that satisfies the Invariance condition (eq. (C.5)), the learned representation can only be dependent on the content variables z C := {zj : j C}, not any style variables zs k := z Sk\C for all k V .

Proof. Note that the learned representation can be rewritten as:

gk(xk) = gk(fk(z Sk)) k V, (C.9)

we define hk := gk fk k V. (C.10)

Following the second step of the proof from von Kügelgen et al. (2021, Thm. 4.2), we show by contradiction that both hk(z Sk) for all k V can only depend on the shared content variables z C.

Let k V be any view from the jointly observed set, suppose for a contradiction that hc k := hk(z Sk)1:|C| depends on some component zq from the view-specific latent variables zs k:

q {1, . . . , dim(zs k)}, z Sk = (z C, zs k ) Zk, s.t. hc k zq (z C, zs k ) = 0, (C.11)

which means that partial derivative of hc k w.r.t. some latent variable zq zs k is non-zero at some point z Sk = (z C, zs k ) Zk. Since hc k is smooth, its first-order (partial) derivatives are continuous. By continuity of the partial derivatives, hc 1 zq must be non-zero in a neighborhood of (z C, zs k ), i.e.,

η > 0 s.t. zq hc k(z C, zs k q, zq) is strictly monotonic on (zq η, zq + η), (C.12)

where zs k q denotes the remaining view-specific style variables except zq.

Next, we define an auxiliary function for each pair of views (k, k ) with k, k V, k < k : ψk,k : ZC ZSk\C ZSk \C R 0

ψk,k (z C, zs k, zs k ) : = |hc k (z C, zs k) hk (z C, zs k )|

= hc k z Sk hc k z Sk 0. (C.13)

Published as a conference paper at ICLR 2024

Summarizing the pairwise auxiliary functions, we have ψ : ZC Q

k V ZSk\C R 0 as follows:

ψ(z C, {zs k}k V ) : = X

|hc k (z C, zs k) hk (z C, zs k )|

hc k z Sk hc k z Sk 0 (C.14)

To obtain a contradiction to the invariance condition in Lemma C.2, it remains to show that ψ from eq. (C.14) is strictly positive with a probability greater than zero w.r.t. the true generating process p; in other words, there has to exist at least one pair of views (k, k ) s.t. ψk,k > 0 with a probability greater than zero regarding p.

Since q Sk \ C, there exists at least one view k = k s.t. q / Sk (otherwise the content block C would contain q). We choose exactly such a pair of views k, k .

Depending whether there is a zero point z0 q of ψ within the region (zq η, zq + η), there are two cases to consider:

If there is no zero-point z0 q (zq η, zq + η) s.t. ψk,k z C, (zs k q, z0 q), zs k = 0, then it implies ψk,k z C, (zs k q, zq), zs k > 0 zq (zq η, zq + η). (C.15)

So there is an open set A := (zq η, zq + η) Zq such that the equation ψ in eq. (C.14) is strictly positive.

Otherwise, there is a zero point z0 q from the interval (zq η, zq + η) with

ψk,k z C, (zs k q, z0 q), zs k = 0 z0 q (zq η, zq + η), (C.16)

then strict monotonicity from eq. (C.12) implies that ψk,k > 0 for all zq in the neighborhood of z0 q, therefore:

ψ(z C, {zs k}k V ) > 0 zq A := (zq η, z0 q) (z0 q, zq + η). (C.17)

Since ψ is a sum of compositions of two smooth functions (absolute different of two smooth functions), ψ is also smooth. Consider the open set R>0 and note that, under a continuous function, pre-images of open sets are always open. For the continuous function ψ, its pre-image U corresponds to an open set:

k V ZSk\C (C.18)

in the domain of ψ on which ψ is strictly positive. Moreover, since eq. (C.17) indicated that for all zq A, the function ψ is strictly positive, which means:

{zs k q} A Y

k:q / Sk {zs k } U, (C.19)

hence, U is non-empty.

Given by Asm. 2.1 (ii) that pz is smooth and fully supported (pz > 0 almost everywhere), the non-empty set U is also fully supported by pz, which indicates:

P (ψ(z C, {zs k}k V ) > 0) P (U) > 0, (C.20)

where P denotes the probability w.r.t. the true generative process p.

According to Lemma C.2, the invariance condition and uniformity conditions has to be fulfilled. To this end, we have shown that the assumption eq. (C.11) leads to an contradiction to the invariance condition eq. (C.5). Hence, assumption eq. (C.11) cannot hold, i.e., hc k does not depend on any view-specific style variable zq from zs k. It is only a function of the shared content variables z C, that is, ˆzc k = hc k(z C).

Published as a conference paper at ICLR 2024

We list Zimmermann et al. (2021, Proposition 5.) for future use in our proof:

Proposition 1 (Proposition 5 of Zimmermann et al. (2021).). Let M, N be simply connected and oriented C1 manifolds without boundaries and h : M N be a differentiable map. Further, let the random variable z M be distributed according to z p(z) for a regular density function p, i.e., 0 < p < . If the push-forward p#h(z) through h is also a regular density, i.e., 0 < p#h < , then h is a bijection.

Theorem 3.2 (Identifiability from a Set of Views). Consider a set of views x V satisfying Asm. 2.1, and let G be a set of content encoders (Defn. 3.1) that minimizes the following objective

k<k V E [ gk(xk) gk (xk ) 2] | {z } Content alignment

k V H (gk(xk)) | {z } Entropy regularization

where the expectation is taken w.r.t. p(x V ) and H( ) denotes differential entropy. Then the shared content variable z C := {zj : j C} is block-identified (Defn. 2.3) by gk G for any k V .

Proof. Lemma C.1 verifies the existence of such a set of smooth encoders that obtains the global minimum of eq. (3.1) zero; Lemma C.2 derives the invariance conditions and the uniformity that the learned representations gk(xk) have to satisfy for all views k V . Based on the invariance condition eq. (C.5), Lemma C.3 shows that the learned representation gk(xk), k V can only depend on the content block, not on any style variables, namely gk(xk) = hk(z C) for some smooth function hk : ZC (0, 1)|C|.

We now apply Zimmermann et al. (2021, Proposition 5.) to show that all of the functions hk, k V are bijections. Note that both ZC and (0, 1)|C| are simply connected and oriented C1 manifolds, and hk are smooth, thus differentiable, functions that map the intersection set of random variables z C from C to (0, 1)|C|. Given by Asm. 2.1(ii) that pz C and the push-forward function through hk (uniform distributions) are regular densities, we conclude that all hk are diffeomorphisms for all k V .

Thus we have shown that any content set of encoders G that minimizes L(G) (eq. (3.1)) can extract the ground-truth content variables z C from view xk Xk up to a bijection hk : ZC (0, 1)|C|:

gk(xk) = hk(z C), (C.21)

That is, shared content z C is block-identified by the content encoders G = {gk}k V .

Remark on the proof technique for Thm. 3.2. For Thm. 3.2, one could imagine an alternate proof by induction over the number of views, where the proofs by von Kügelgen et al. (2021); Daunhawer et al. (2023) would be the base case. We opted for a direct proof technique as the induction proof may have been perhaps more intuitive at a high level but was significantly longer. Additionally, we present the current version because it would be generally more accessible as a more familiar proof technique.

C.2 PROOF FOR THM. 3.8 Our proof consists of the following steps:

1. We show in Lemma C.4 the loss eq. (C.22) is lower bounded by zero and construct optimal R (Defn. 3.3), Φ (Defn. 3.5), T (Defn. 3.6) that reach this lower bound;

2. Next, we show in Lemma C.6 that, if the content sizes |Ci| are known for all Vi V, then any view-specific encoders, content selectors, and projections (R, Φ, T) that minimize the loss eq. (C.22), block-identify the content variables z Ci for any Vi V, using similar steps as in the proof for Thm. 3.2.

3. As the third step, we show that any minimizer R (Defn. 3.3), Φ (Defn. 3.5), T (Defn. 3.6) of eq. (C.22) also minimizes the information-sharing regularizer (Defn. 3.7); and show that the optimal solution (R , Φ , T ) we constructed in the first step reaches this lower bound of Defn. 3.7.

4. Then, we show by contradiction that any optimal content selector Φ that solves the constrained optimization problem in eq. (3.2) recovers the correct content size |Ci| for each subset Vi, using the invariance condition in Lemma C.5.

5. Lastly, we apply the results from Lemma C.6 and conclude our proof for Thm. 3.8.

Published as a conference paper at ICLR 2024

We rephrase each step as a separate lemma and use them to complete the final proof for Thm. 3.8.

Lemma C.4 (Existence of Encoders, Selectors and Projections). Consider a jointly observed set of views x V satisfying Asm. 2.1. For any set of view-specific encoders R (Defn. 3.3), content selectors RΦ (Defn. 3.5) and projections T (Defn. 3.6), we define the following objective:

L (R, Φ, T) = X

E h ϕ(i,k) rk(xk) ϕ(i,k ) rk (xk ) 2

k V H (tk rk(xk)) .

which is lower bounded by zero; and there exists such combination of R, Φ, T that obtains this global minimum zero.

Proof. Consider the objective function L(R, Φ, T) (eq. (C.22)), the global minimum of L(R, Φ, T) is obtained when the first term (alignment) is minimized and the second term (entropy) is maximized. The alignment term is minimized to zero when selected representations ϕ(i,k) rk are perfectly aligned for all k V almost surely. The second term (entropy) is maximized to zero only when tk rk(xk) is uniformly distributed on (0, 1)|Sk| for all view k V . Thus we have shown that the loss (eq. (C.22)) is lower-bounded by zero.

The optimal view-specific encoders can be defined via the inverse of the view-specific mixing functions {fk}k V , which by Asm. 2.1(i) are smooth and invertible. By definition, we have f 1 k (xk) = z Sk for all k V . Formally, we define the set of optimal view-specific encoders

R := {f 1 k }k V . (C.23)

Next, we define the optimal auxiliary transformation tk for each view k using Darmois construction, writing tk rk(xk) = tk f 1 k (xk) = tj k (z Sk), we have:

tj k (z Sk) := F k j ([z Sk]j|[z Sk]1:j 1) = P ([ZSk]j [z Sk]j|[z Sk]1:j 1) j {1, . . . , |Sk|}, (C.24)

where F k j denotes the conditional cumulative distribution function (CDF) of [z Sk]j given [z Sk]1:j 1. Thus, tk (z Sk) is uniformly distributed on (0, 1)|Sk| and tk is smooth by Asm. 2.1(ii) which states that pz is a smooth density.

As for the optimal content selectors Φ = {ϕ(i,k)}Vi V,k Vi, choose ϕ(i,k) such that

ϕ(i,k) ˆz Sk := ˆz Ci (C.25)

Writing f 1 k (xk) = z Sk, the loss L(R, Φ, T) from eq. (C.22) takes the value:

L (R, Φ, T) = X

E h ϕ(i,k) rk(xk) ϕ(i,k ) rk (xk ) 2

k V H (tk rk(xk))

E h ϕ(i,k) f 1 k (xk) ϕ(i,k ) f 1 k (xk ) 2

k V H tk f 1 k (xk)

E h ϕ(i,k) z Sk ϕ(i,k ) z Sk 2

k V H (tk(z Sk))

E [ z Ci z Ci 2] X

k V H (tk (z Sk))

=0 (C.26) Note that the first term is minimized to zero because the shared content values z Ci align among the views in one subset Vi V; the second term is maximized to zero because tk (z Sk) is uniformly distributed on (0, 1)|Sk| given by the property of Darmois construction (Darmois, 1951). To this end, we have shown that there exists such optimum R, Φ, T as defined in eqs. (C.23) to (C.25) that minimizes the objective in eq. (C.22).

Published as a conference paper at ICLR 2024

Lemma C.5 (Conditions of Optimal Encoders, Selectors and projections). Given the same set of views x V as introduced in Lemma C.4, to minimize L(R, Φ, T) in eq. (C.22), any optimum R, Φ, T (Defns. 3.3, 3.5 and 3.6) has to satisfy similar invariance and uniformity conditions from Lemma C.2:

Invariance: All selected representations ϕ(i,k) rk(xk), k V must align across the views from the set Vi V almost surely:

ϕ(i,k) rk(xk) = ϕ(i,k ) rk (xk ) Vi V k, k Vi a.s. (C.27)

Uniformity: All extracted representations tk rk(xk), k V must be uniformly distributed over the hyper unit-cube (0, 1)|Sk|.

Proof. The minimum of L(R, Φ, T) = 0 can only be obtained when both terms are zero. For the first term (alignment) to be zero, it is necessary that ϕ(i,k) rk(xk) = ϕ(i,k ) rk (xk ) almost surely for all Vi V , k, k Vi w.r.t. the true generating process. The second term (entropy) is upper-bounded by zero; this maximum can only be obtained when the auxiliary encoding tk rk(xk), k V follows uniformity, as also indicated by Lemma C.2.

Lemma C.6 (View-Specific Encoder for Identifiability Given Content Sizes). Consider a jointly observed set of views x V satisfying Asm. 2.1 and assume that the dimensionality of the subset-specific content |Ci| is given for all subset Vi V. We consider a special type of content selectors Φ with ϕ(i,k) 0 = |Ci| for all k Vi. Let R, T respectively denote some view-specific encoders (Defn. 3.3), and projections (Defn. 3.6), which jointly minimize the following objective together with the special content selectors Φ:

L (R, Φ, T) = X

E h ϕ(i,k) rk(xk) ϕ(i,k ) rk (xk ) 2

k V H (tk rk(xk)) .

Then for any view k V , any subset of views Vi V with k Vi, the composed function ϕ(i,k) rk block-identifies the shared content variables z Ci in the sense that the learned representation ˆz(i) k := ϕ(i,k) rk(xk) is related to the ground truth content variables through some smooth invertible mapping hk : ZCi ZCi with ˆz(i) k = h(i) k (z Ci).

Proof. Lemma C.4 verifies that there exists such optimum which minimizes the loss eq. (C.28) to zero; the invariance and uniformity conditions have to be satisfied by any optimum, as shown in Lemma C.5. Following Lemma C.3, the composition r(i) k := ϕ(i,k) rk can only encode information related to the subset-specific content Ci for any subset Vi V otherwise it will lead to a contradiction to the invariance condition from Lemma C.5. The last step is to prove the invertibility of the encoders G. Notice that tk rk(xk) = tk rk fk(z Sk) By applying Zimmermann et al. (2021, Proposition 5.) with similar arguments as in the proof for Thm. 3.2, we can show that composition tk rk fk is a smooth bijection of the subset-specific content z Ci. Since fk is a smooth invertible mapping by Asm. 2.1 (i), we have:

(tk rk fk) f 1 k = (tk rk) (fk f 1 k ) = tk rk,

Hence, tk rk is bijective as the composition of bijections is a bijection. Next, we show that rk is bijective. Showing that rk is bijective on its image is equivalent to showing that it is injective. By contradiction, suppose rk is not injective. Thus there exists distinct values x1 k, x2 k Xk s.t. rk(x1 k) = rk(x2 k). This implies that tk rk(x1 k) = tk rk(x2 k), which violate injectivity of tk rk. Thus, rk must be injective.

To this end, we conclude that any R, Φ, T that minimizes eq. (C.28) block-identifies the shared content variables z Ci for any subset of views Vi V.

Claim 1. For any (R, Φ, T) (Defns. 3.1, 3.3and 3.6) that minimizes the loss eq. (C.22), the Reg(Φ) (Defn. 3.7) is lower bounded by P

Vi V |Ci| |Vi| and this minimum is obtained at the optimal content selectors defined in eq. (C.25).

Published as a conference paper at ICLR 2024

Proof. Suppose for a contradiction that there exists some binary weight parameters Φ = Φ with

Reg( Φ) = X

ϕ(i,k) 0 < Reg(Φ), (C.29)

which means, there exists at least one vector ϕ(i,k) for some view k V , subset Vi V, such that ϕ(i,k) rk(xk) = ˆz A |A| > |Ci|, (C.30) where A Sk is an index subset of the view-specific latents Sk. Given that R, Φ, T minimizes L(R, Φ, T) from eq. (C.22), these minimizers have to satisfy the invariance and uniformity constraint as shown in Lemma C.5. Since uniformity implies invertibility (Zimmermann et al., 2021), the learned representation rk(xk) contains sufficient information about the original view xk s.t. the view xk can be reconstructed by some decoder given enough capacity. Given that the number of selected dimensions |A| > |Ci|, at least one latent component j A will contain information that is not jointly shared by Vi. That means the composition r(i) k := ϕ(i,k) rk encodes some information other than just content Ci. As shown in Lemma C.3, any dependency from the learned representation on non-content variables leads to contradiction to the invariance condition as derived in Lemma C.5. Therefore, the optimal content selectors Φ following the definition in eq. (C.25) must obtain the global minimum of the information-sharing regularizer (Defn. 3.7), which equals P

Theorem 3.8 (View-Specific Encoder for Identifiability). Let R, Φ, T respectively be any viewspecific encoders (Defn. 3.3), content selectors (Defn. 3.1) and projections (Defn. 3.6) that solve the following constrained optimization problem: min Reg(Φ) subject to: R, Φ, T argmin L (R, Φ, T) (3.2)

L (R, Φ, T) = X

E h ϕ(i,k) rk(xk) ϕ(i,k ) rk (xk ) 2

| {z } Content alignment

k V H (tk rk(xk)) | {z } Entropy

(3.3) Then for any subset of views Vi V and any view k Vi , ϕ(i,k) rk block-identifies (Defn. 2.3) the shared content variables z Ci, as defined in eq. (2.2).

Proof. Lemma C.4 confirms that there exist view-specific encoders R, content selectors Φ, and projections T that obtain the minimum of the unregularized loss eq. (C.22) (equals zero); Additionally, any optimal R, Φ, T fulfills the invariance condition and uniformity (Lemma C.5) s.t. they obtain the global minimum zero. Using the invariance condition, Claim 1 substantiates that the optimal content selectors as defined in eq. (C.25) minimizes the regularization term (Defn. 3.7). We have thus shown that with R, Φ, T (as defined in eqs. (C.23) to (C.25)), eq. (3.2) obtains the global minimum.

Next, we show that the number of selected dimensions from each selector ϕ(i,k), i.e., the L0 norm of ϕ(i,k), align with the size of the shared content |Ci|.

Among the content selectors that minimize the unregularized loss (eq. (C.22)), we consider some content selectors Φ argmin Reg(Φ) that also minimize the information-sharing regulariser defined in Defn. 3.7, that is: Reg(Φ ) = X

Suppose for a contradiction that there exists a pair of binary selectors (ϕ(i,k), ϕ(i ,k )) with ϕ(i,k) {0, 1}|Sk| and ϕ(i ,k ) {0, 1}|Sk | such that ϕ(i,k) 0 > |Ci|; ϕ(i,k ) 0 < |Ci |, (C.31)

which indicates that there exists at least one latent component j Sk \ Ci being selected by ϕ(i,k); similarly, this contradicts the invariance condition as shown in Lemma C.3. Hence, the number of dimensions selected by each ϕ(i,k) has to equal the content size |Ci|.

At this stage, the problem setup is reduced to the case in Lemma C.6 where the size of the content variables |Ci| are given for all subset of views Vi V. Hence, applying Lemma C.6, we conclude that any R, Φ, T (Defns. 3.3, 3.5 and 3.6) that minimize eq. (3.2) block-identify the shared content variables z Ci for any subset of views Vi V and for all views k Vi.

Published as a conference paper at ICLR 2024

C.3 PROOFS FOR IDENTIFIABILITY ALGEBRA Let z C1, z C2 be two sets of content variables indexed by C1, C2 [N] that are block-identified by some smooth encoders g1 : X1 ZC1, g2 : X2 ZC2, then it holds for C1, C2 that:

Corollary 3.9 (Identifiability Algebra: Intersection). The intersection z C1 C2 can be block-identified.

Proof. By the definition of block-identifiability, we construct two synthetic views using the learned representation from x1 and x2: x(1) := g1(x1) = h1(z C1)

x(2) := g2(x2) = h2(z C2) (C.32)

for some smooth invertible mapping hk : ZCk ZCk k {1, 2}. Applying the Thm. 3.2 with two views, we can block-identify the intersection C1 C2 using this pair of views (x(1), x(2)).

Corollary 3.10 (Identifiability Algebra: Complement). If C1 C2 is independent of C1\C2, then the complement z C1\C2 can be block-identified.

Proof. Construct the same synthetic views x(1), x(2) as in the proof for Cor. 3.9. We then can consider the intersection C1 C2 as the content variable and C1 \ C2 as the style variable from these two synthetic views (x(1), x(2)). Private Component Extraction from Lyu et al. (Theorem 2. 2021) has shown that if the style variable is independent of the content, then the style variables can also be extracted up to a smooth invertible mapping. Therefore, we conclude that the complement z C1\C2 can also be block-identified.

Corollary 3.11 (Identifiability Algebra: Union). If C1 C2, C1\C2 and C2\C1 are mutually independent, then the union z C1 C2 can be block-identified.

Proof. We rephrase C1 C2 as a union of the following disjoint parts:

C1 C2 = (C1 C2) (C1\C2) (C2\C1) (C.33)

Following the definition from Cors. 3.9 and 3.10 have shown that:

ˆz := h (z C1 C2) ˆz1\2 := h1\2(z C1\C2)

ˆz2\1 := h2\1(z C2\C1), (C.34)

By concatenate the learned representations, we define h : ZC1 C2 ZC1 C2 as

h (z C, z1\2, z2\1) := [ˆz C, ˆz1\2, ˆz2\1] = h (z C1 C2), (C.35)

hence, the union C1 C2 can be block-identified.

D EXPERIMENTAL RESULTS

This section provides further details about the datasets and implementation details in 5. The implementation is built upon the code open-sourced by Zimmermann et al. (2021); von Kügelgen et al. (2021); Daunhawer et al. (2023).

D.1 NUMERICAL EXPERIMENT THEORY VALIDATION Data Generation. For completeness, we summarize the setting of our numerical experiments. We generate synthetic data following Example 2.1, which we also report below. The latent variables are sampled from a Gaussian distribution z N(0, Σz), where possible causal dependencies are encoded through Σz. Note that in this setting the ground truth causal variables will be related linearly to each other. x1 = f1(z1, z2, z3, z4, z5), x2 = f2(z1, z2, z3, z5, z6), x3 = f3(z1, z2, z3, z4, z6), x4 = f4(z1, z2, z4, z5, z6). (D.1)

Implementation Details. We implement each view-specific mixing function fk, for each view k = 1, 2, 3, 4, using a 3-layer invertible, untrainable MLP (Haykin, 1994) with Leaky Re LU (Xu

Published as a conference paper at ICLR 2024

Table 3: Linear Evaluation: Mean Correlation Coefficients across multiple views.

(x1, x2) (x1, x3) (x1, x4) (x2, x3) (x2, x4) (x3, x4) (x1, x2, x3) (x1, x2, x4) (x1, x3, x4) (x2, x3, x4) (x1, x2, x3, x4)

ind. 0.887 0.000 0.881 0.000 0.882 0.000 0.885 0.000 0.886 0.000 0.880 0.000 0.853 0.000 0.854 0.000 0.846 0.000 0.851 0.000 0.786 0.000 dep. 0.956 0.000 0.880 0.002 0.891 0.002 0.795 0.002 0.805 0.002 0.805 0.002 0.945 0.001 0.969 0.001 0.858 0.003 0.744 0.003 0.944 0.001

et al., 2015)(α = 0.2). The weight parameters in the mixing functions are randomly initialized. For the learnable view-specific encoders, we use a 7-layer MLP with Leaky Re LU (α = 0.01) for each view. The encoders are trained using the Adam optimizer (Kingma & Ba, 2014) with lr=1e-4. All implementation details are summarized in Tab. 4.

Additional Experiments. We experiment on causally dependent synthetic data, generated by z N(0, Σz) with Σz Wishart(0, I). The results are shown in Fig. 4. The rows denote the ground truth latent factors, and the columns represent the learned representation from different subsets of views. Each cell reports the R2 score between the respective ground truth factors and the learned representation. For example, the cell with col={x1, x2} and row=z1 shows the R2 score when trying to predict z1 using the learned representation from subset {x1, x2}. Since dependent style variables become predictable, as discussed in App. D.1, we aim to verify that the learned representation contains all and only content variables. In other words, it block-identifies the ground truth content factors. For that, we consider all the views {x1, . . . , x4} and train a linear regression from the ground truth content variables z1, z2 to the individual style variables z3, z4, z5, z5. We report the coefficient of determination R2 in Tab. 6. We observe that the R2 values obtained from the ground truth content are highly similar to the ones in the last column of the heatmap (Fig. 4). Based on this, we have showcased that the learned representation indeed block-identifies the content variables.

Additional Evaluation Metric. We report the Mean Correlation Coefficient(MCC) (Khemakhem et al., 2020) on the numerical experiments. MCC has been used in several recent works on identifiability of causal representation learning (Buchholz et al., 2023; von Kügelgen et al., 2023), it measures the component-wise linear correlation up to permutations. A high MCC (close to one) indicates a clear 1-to-1 linear correspondence between the learned representation and the ground truth latents. We remark our theoretical framework considers block-identifiability, which could imply any type of bijective relation to the ground truth content variables, including nonlinear transformations. Nevertheless, we observe high MCC score on both independent and dependent cases, showing that the learned representation having a high linear correlation to the latent components which indicates stronger identifiability results.

D.2 SELF-SUPERVISED DISENTANGLEMENT

Datasets. In this experiment, we test on MPI-3D complex (Gondal et al., 2019) and 3DIdent (Zimmermann et al., 2021). Both are high-dimensional image datasets generated from mutually independent latent factors: MPI-3D complex contains real-world complex shaped images with ten discretized latent factors while 3DIdent renders a teapot conditioned on ten continuous latent factors.

Original view

Augmeted view

(a) Example Input: MPI-3D complex

Original view

Augmeted view

(b) Example Input: 3DIdent

Implementation Details. We used the implementation from (Michlo, 2021) for Ada-GVAE (Locatello et al., 2020), following the same architecture as (Locatello et al., 2020, Tab. 1 Appendix ). For our method, we use Res Net-18 (He et al., 2016) as the image encoder, details given in Tab. 8. For both approaches, we set ENCODING SIZE=10, following the setup in Locatello et al. (2020).

Published as a conference paper at ICLR 2024

Table 4: Parameters for numerical simulation ( 5.1 and App. D.1).

Parameter Value

Mixing function 3-layer MLP Encoder 7-layer MLP Optimizer Adam Adam: learning rate 1e-4 Adam: beta1 0.9 Adam: beta2 0.999 Adam: epsilon 1e-8 Batch size 4096 Temperature τ 1.0 # Iterations 100,000 # Seeds 3 Similarity metric Euclidian

Table 5: Parameters for experiments 5.2 and 5.3 and Apps. D.3 and D.4. : for both image and text encoders. : hyper-arapmeter for Barlow Twins (Zbontar et al., 2021).

Parameter Values

Content encoding size 8 View-specific encoding size 11 Image hidden size 100 Text embedding dim 128 Text vocab size 111 Text fbase 25 Batch size 128 Temperature 1.0 Off-diagonal constant λ 1.0 Optimizer Adam Adam: beta1 0.9 Adam: beta2 0.999 Adam: epsilon 1e-8 Adam: learning rate 1e-4 # Iterations 300,000 # Seeds 3 Similarity metric Cosine similarity Gradient clipping 2-norm; max value 2

{x1, x2, x3}

{x1, x2, x4}

{x1, x3, x4}

{x2, x3, x4}

{x1, x2, x3, x4}

z1 z2 z3 z4 z5 z6

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

1.0 1.0 0.8 1.0 0.9 0.9 1.0 0.7 0.7 0.9 0.4

0.8 0.9 0.9 0.9 0.9 0.9 0.8 0.8 0.9 0.9 0.7

1.0 0.8 1.0 0.8 1.0 0.8 0.8 1.0 0.7 0.8 0.6

0.8 0.7 0.8 0.8 0.8 0.8 0.7 0.8 0.8 0.7 0.7

Figure 4: Theory Verfication: Average R2 across multiple views generated from causally dependent latents.

Table 6: Linear R2 from ground truth content variables to styles when consider {x1, x2, x3, x4}, these values align with the last column of Fig. 4, showing that we have block-identified the content variables {z1, z2}

content style {z1, z2} z3 z4 z5 z6 1.0 0.32 0.65 0.58 0.71

Published as a conference paper at ICLR 2024

Table 7: Thm. 3.2 Validation on Causal3DIdent: R2 mean std. Green: content, bold: R2 > 0.50.

positions hues rotations Views generated by changing class x y z spotl obj spotl bkg ϕ θ ψ

hues 1.00 0.00 0.76 0.01 0.56 0.02 0.00 0.00 0.82 0.01 0.27 0.03 0.00 0.01 0.00 0.00 0.25 0.02 0.27 0.02 0.27 0.02 positions 1.00 0.00 0.00 0.01 0.46 0.02 0.00 0.01 0.00 0.01 0.32 0.02 0.00 0.01 0.92 0.00 0.26 0.02 0.29 0.02 0.27 0.02 rotations 1.00 0.00 0.11 0.01 0.50 0.02 0.00 0.00 0.06 0.01 0.31 0.02 0.00 0.01 0.83 0.01 0.25 0.01 0.27 0.02 0.06 0.01 hues+pos 1.00 0.00 0.00 0.00 0.20 0.02 0.00 0.01 0.00 0.01 0.14 0.02 0.00 0.00 0.00 0.01 0.07 0.01 0.18 0.02 0.12 0.02 hues+rot 1.00 0.00 0.09 0.02 0.36 0.02 0.00 0.00 0.51 0.01 0.25 0.02 0.00 0.01 0.00 0.01 0.00 0.01 0.25 0.02 0.25 0.01 pos+rot 1.00 0.00 0.00 0.00 0.21 0.02 0.00 0.01 0.00 0.00 0.07 0.01 0.00 0.01 0.23 0.02 0.05 0.01 0.20 0.02 0.13 0.02 hues+pos+rot 1.00 0.00 0.00 0.00 0.42 0.02 0.00 0.01 0.00 0.00 0.25 0.02 0.00 0.00 0.00 0.00 0.01 0.01 0.26 0.02 0.26 0.02

D.3 CONTENT-STYLE IDENTIFIABILITY ON IMAGES Datasets. Causal3DIdent (von Kügelgen et al., 2021) extends 3Dident (Zimmermann et al., 2021) by introducing different classes of objects, thus object shape (or class) is added as an additional discrete factor of variation. We extend the image pairs experiments from (von Kügelgen et al., 2021) by inputting three views, as shown in Fig. 6b, where the second and third images are obtained by perturbing different subsets of latent factors of the first image. To perturb one specific latent component, we uniformly sample one latent in the predefined latent space (Unif[ 1, 1], details see (von Kügelgen et al., 2021, App. B)), then we use indexing search to retrieve the image in the dataset that has the closest latent values as the sampled ones. Note that only a finite number of images are available; thus, there is not always a perfect match. More frequently, we observe slight changes in the non-perturbing latent dimensions. For instance, the hues of the third view is slightly different than the original view, although we intended to share the same hue values.

(a) Underlying causal relation in Causal3DIdent and Multimodal3DIdent images. Figure adopted from (von Kügelgen et al., 2021, Fig. 2)

Original view

Change hues

Change pos + rot

(b) Example Input: Causal3DIdent

Figure 6: Causal3DIdent: Underlying causal relations and input examples.

Implementation Details. The encoder structure and parameters are summarized in (Tabs. 5 and 8). We train using Barlow Twins (Zbontar et al., 2021) with cosine similarity and off-diagonal importance constant λ = 1. Barlow Twins is another contrastive loss that jointly encourages alignment and uniformity, by enforcing the cross-correlation matrix of the learned representations to be identity. The on-diagonal elements represent the content alignment while the off-diagonal elements approximate the entropy regularization.

Thm. 3.2 Validation. We train content encoders (Defn. 3.1) on Causal3DIdent to verify Thm. 3.2. Note that we experiment on three views (von Kügelgen et al., 2021) cannot naively handle. Tab. 7 summarizes results for all possible perturbations among the three views. We can observe that the discrete factor class learned perfectly; dependent style variables become predictable from the content (class) latent causal dependence. Note that this table shows similar results as in (von Kügelgen et al., 2021, Table 6. Latent Transformation (LT)). We remark that there is a reality between theory in practice: In theory, we assume that the content variables share the exact same value across all views; however, in practice, finding a perfect match of all of the continuous content values become impossible, since there is only a finite number of training data available. We believe this reality gap negatively influenced the learning performance on the content variables, thus preventing efficient prediction on certain content variables, such as object hues.

D.4 MULTI-MODAL CONTENT-STYLE IDENTIFIABILITY UNDER PARTIAL OBSERVABILITY Dataset. Multimodal3DIdent (Daunhawer et al., 2023) augments Causal3DIdent (von Kügelgen et al., 2021) with text annotations for each image view, and discretizes the objection positions (x, y, z)

Published as a conference paper at ICLR 2024

Table 8: Encoder Architectures for Causal3DIdent and Multimodal3DIdent.

Image Encoder Text Encoder

Input size = H W 3 Input size = vocab size Res Net-18(hidden size) Linear(fbase, text embedding dim) Leaky Re Lu(α = 0.01) Conv2D(1, fbase, 4, 2, 1, bias=True) Linear(hidden size, image encoding size) Batch Norm(fbase); Re LU Conv2D(fbase, fbase 2, 4, 2, 1, bias=True) Batch Norm(fbase 2); Re LU Conv2D(fbase 2, fbase 4, 4, 2, 1, bias=True) Batch Norm(fbase 4); Re LU Linear(fbase 4 3 16, text encoding size)

Table 9: Thm. 3.2 Validation on Multimodal3DIdent: R2 mean std. Green: content, bold: R2 > 0.50.

views generated class img pos img hues txt class txt pos txt hue txt phrasing by changing x y spotl obj spotl bkg x y obj_color_idx

hues + rot 0.82 0.01 1.00 0.00 1.00 0.00 0.00 0.00 0.87 0.01 0.00 0.00 - 0.85 0.03 1.00 0.00 1.00 0.00 0.15 0.02 0.21 0.02

pos 1.00 0.00 0.47 0.02 0.64 0.01 0.00 0.00 0.67 0.02 0.00 0.00 - 1.00 0.00 0.34 0.02 0.94 0.01 0.16 0.03 0.21 0.02

to categorical variables. In particular, object-zpos is a constant and thus not shown in our evaluation (Fig. 3). Our experiment extends (Daunhawer et al., 2023) by adding one additional image to the original image-text pair, perturbing the hues, object rotations and spotlight positions of the original image (Uniformly sample from Unif[0, 1]). Thus, (img0, img1) share object shape and background color; Thus, (img0, txt0) share object shape and object x-y positions; both (img1, txt0) and the joint set (img0, img1, txt0) share only the object shape. One example input is shown in Fig. 7.

Original view

Change hues + rot

Text description: A hare of bright yellow green color is visible, positioned at the mid-left of the image.

Figure 7: Example input: Multimodal3DIdent. Left: pair of images that are original and perturbed images. Right: Text annotation for the original view.

Implementation Details. Tabs. 5 and 8 shows the network architecture and implementation details, mostly following (Daunhawer et al., 2023). Note that we use the same encoding size for both image and text encoders for the convenience of implementation. We train using Barlow Twins with λ = 1. In practice, we treat the unknown content sizes as a list of hyper-parameters and optimize it over different validations.

Further Discussion about 5.3. The fundamental difference between the Multimodal3D and Causal3DIdent datasets, as mentioned above, makes a direct comparison between our results in Fig. 3 and (von Kügelgen et al., 2021) harder. However, Tab. 7 shows similar R2 scores as the results given in von Kügelgen et al. (2021, Sec 5.2), which verifies the correctness of our method.

Thm. 3.2 Validation. We additionally learn content encoders on three partially observed views (img0, img1, txt0) using the loss from Zbontar et al. (2021), to justify Thm. 3.2. We use the same architecture and parameters as summarized in Tabs. 5 and 8. Tab. 9 shows the content encoders consistently predict the content variables well and that our evaluation results highly align with (Daunhawer et al., 2023, Fig. 3) on image-text pairs (img0, txt0) as inputs, which empirically validates Thm. 3.2.

D.5 MULTI-TASK DISENTANGLEMENT WITH SPARSE CLASSIFIERS Following Example 2.1, we synthetically generate the class labels by linear/nonlinear labeling functions on the shared content values, which resembles the underlying inductive bias from (Lachapelle et al., 2023; Fumero et al., 2023), that the shared features across different images with the same label should be most task-relevant. Here, we use the same sparse-classifier implementation from (Fumero et al., 2023). We remark that the goal of this experiment is to verify our expectation from Thm. 3.2

Published as a conference paper at ICLR 2024

that the method of (Fumero et al., 2023) can be explained by our theory, although they assume mutually independent latents, which is a special case of our setup. In our experimental setup, an input gets a label 1 when the following labeling function value is greater than zero:

Linear: Pd j=1 ˆzj where d denotes the encoding size.

Nonlinear: tanh Pd j=1 ˆz3 j

Thus, we have resembled the inductive hypothesis in Fumero et al. (2023) that the classification task is only dependent on the shared features. The fact that the binary classification is solved in several iterations verifies that Fumero et al. (2023) used the same soft alignment principle as described in Thm. 3.2.

E DISCUSSION

The Theory-Practice Gap It is noticeable that some of the technical assumptions we made in the theory may not exactly hold in practice. A common assumption in identifiability literature is that the latent variables z are continuous, while this is not true for e.g. the object shape in Causal3DIdent (von Kügelgen et al., 2021) and object shape, positions in Multimodal3DIdent (Daunhawer et al., 2023). Another related gap regarding the dataset is that the additional views are generated by uniformly sampling a subset of latents from the original view and then trying to retrieve an image among the existing finite dataset, whose latent value is closest to the proposed one. However, having only a finite number of images implies that always finding a perfect match for each perturbed latents is almost impossible in practice. As a consequence, the designed to be strictly aligned content values between different views could differ from each other by a certain margin. Also, both Thms. 3.2 and 3.8 holds asymptotically and the global minimum is obtained only when given infinitely amount of data. Given that there is no closed-form solution for entropy regularization term eqs. (3.1) and (3.2), it is approximated either using negative samples (Oord et al., 2018; Chen et al., 2020) or by optimizing the cross-correlation between the encoded information to be close to Identity matrix (Zbontar et al., 2021); in both cases there is only a finite number of samples available, which makes converging to global minimum almost impossible in practice.

Discovering Hidden Causal Structure from Overlapping Marginals. Identifying latent blocks {z Bi} provides us with access to the marginal distributions over the corresponding subsets of latents {p(z Bi)}. With observed variables and known graph, this has been termed the causal marginal problem (Gresele et al., 2022), and our setup could therefore also been seen as generalization along those dimensions. It may be possible to extract some causal relations from the inferred marginal distributions over blocks, either by imposing additional assumptions or through constraint-based approaches (Triantafillou et al., 2010).

How to Learn the Content Sizes? Thm. 3.8 shows that content blocks from any arbitrary subset of views can be discovered simultaneously using view-specific encoders (Defn. 3.3), content selectors (Defn. 3.5) and some projections (Defn. 3.6). We remarked in the main text that optimizing the information-sharing regularizer (Defn. 3.7) is highly non-convex and thus impractical. We proposed alternatives for both unsupervised and supervised cases: for self-supervised representation learning, one could employ Gumble-Softmax to learn the hard mask. We hypothesize that if there is an additional inclusion relation about the content blocks available, for example, we know that C1 C2 C3, then the learning process could be eased by coding this inclusion relation in the mask implementation. This additional information is naturally inherited from the fact that the more views we include, the smaller the shared content will be. Another idea would be manually allocating individual content blocks in the learned encoding in a sequential manner, e.g. we set index=1, 2, 3 for the first content block and index=4, 5 for the second content block, and enforcing alignment correspondingly. Thus, for each view, we learn a concatenated representation of the shared content. Although this method does not perfectly follow the theoretical setting in the Thm. 3.8, it still learn all of the contents simultaneously and shows faster convergence. In classification tasks, the hard alignment constraint is relaxed to some soft constraint within one equivalence class e.g. samples which have the same label. In this case, we can replace the binary content selector with linear readouts, as studied and implemented by Fumero et al. (2023). Another, yet the most common approach to deal with this problem is to treat the content sizes as hyperparameters, as shown in (von Kügelgen et al., 2021; Daunhawer et al., 2023).

Published as a conference paper at ICLR 2024

Trade-off between Invertibility and Feature Sharing. An invertible encoder implies that the extracted representation is a lossless compression, which means that the original observation can be reconstructed using this learned representation, given enough capacity. On the one hand, the invertibility of the encoders is enforced by the entropy regularization, such that the encoder preserves all information about the content variables; on the other hand, the info-sharing regularizer (Defn. 3.7) encourages reuse of the learned feature, which potentially prevents perfect reconstruction for each individual view. Intuitively, Thm. 3.8 seeks the sweet spot between invertibility and feature sharing: When the encoder shares more than the ground truth, then it loses information about certain views, and thus the compression is not lossless; When the invertibility is fulfilled but the info-sharing is not maximized, then the learned encoder is not an optimal solution either, given by the regularization penalty from the infor-sharing regularizer.

Causal Representation Learning from Interventional Data. Our framework considers purely observational data, where multiple partial views are generated from concurrently sampled latent variables using view-specific mixing functions. Recent works (Liang et al., 2023; Buchholz et al., 2023; von Kügelgen et al., 2023) have shown identifiability in non-parametric causal representation learning using interventional data, allowing discovering (some) hidden causal relations. Since simultaneously identifying the latent representation and the underlying causal structure in a partially observable setup has been a long standing goal, we believe incorporating interventional data into the proposed framework could be one interesting direction to explore.