# weakly_supervised_representation_learning_with_sparse_perturbations__14b1284e.pdf

Weakly Supervised Representation Learning with Sparse Perturbations

Kartik Ahuja Jason Hartford Yoshua Bengio

The theory of representation learning aims to build methods that provably invert the data generating process with minimal domain knowledge or any source of supervision. Most prior approaches require strong distributional assumptions on the latent variables and weak supervision (auxiliary information such as timestamps) to provide provable identification guarantees. In this work, we show that if one has weak supervision from observations generated by sparse perturbations of the latent variables e.g. images in a reinforcement learning environment where actions move individual sprites identification is achievable under unknown continuous latent distributions. We show that if the perturbations are applied only on mutually exclusive blocks of latents, we identify the latents up to those blocks. We also show that if these perturbation blocks overlap, we identify latents up to the smallest blocks shared across perturbations. Consequently, if there are blocks that intersect in one latent variable only, then such latents are identified up to permutation and scaling. We propose a natural estimation procedure based on this theory and illustrate it on low-dimensional synthetic and image-based experiments.

1 Introduction

If you are reading this paper on a computer, press one of the arrow keys... all the text you are reading jumps as the screen refreshes in response to your action. Now imagine you were playing a video game like Atari s Space Invaders the same keystroke would cause a small sprite at the bottom of your screen to move in response. These actions induce changes in pixels that are very different, but in both cases, the visual feedback in response to our actions indicates the presence of some object on the screen a virtual paper and a virtual spacecraft, respectively with properties that we can manipulate. Our keystrokes induce sparse changes to a program s state, and these changes are reflected on the screen, albeit not necessarily in a correspondingly sparse way (e.g., most pixels change when scrolling). Similarly, many of our interactions with the real world induce sparse changes to the underlying causal factors of our environment: lift a coffee cup and the cup moves, but not the rest of the objects on your desk; turn your head laterally, and the coordinates of all the objects in the room shift, but only in the horizontal direction. These examples hint at the main question we aim to answer in this paper: if we know that actions have sparse effects on the latent factors of our system, can we use that knowledge as weak supervision to help disentangle these latent factors from pixel-level data?

Self and weakly-supervised learning approaches have made phenomenal progress in the last few years, with large-scale systems like GPT-3 (Brown et al., 2020) offering large improvements on all natural language benchmarks, and CLIP (Radford et al., 2021) outperforming state-of-the-art supervised models from six years ago (Szegedy et al., 2016) on the Image Net challenge (Deng et al., 2009) without using any of the labels.

Mila - Quebec AI Institute, Université de Montréal. CIFAR Fellow. Correspondence to: kartik.ahuja@mila.quebec.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Sparse perturbations

z = [0.7,0.2,0.6,0.7,0.5,0.3,0.3,0.7]

(0.6,0.7) (0.3,0.7)

Perturb pblue

x , ppurple

Perturb pblue

Perturb pblue

y , ppurple

x , ppurple

x1 x2 x3 xk 2 xk 1 xk

f(x) + δ i f( xi)

Weak sparsity supervision

Perturb pgreen

Figure 1: Ball agent interaction environment. Different frames show the effect of sparse perturbations.

Yet, despite these advances, these systems are still far from human reasoning abilities and often fail on out-of-distribution examples (Geirhos et al., 2020). To robustly generalize out of distribution, we need models that can infer the causal mechanisms that relate latent variables (Schölkopf et al., 2021; Schölkopf and von Kügelgen, 2022) because these mechanisms are invariant under distribution shift. The field of causal inference has developed theory and methods to infer causal mechanisms from data (Pearl, 2009; Peters et al., 2017), but these methods assume access to high-level abstract features, instead of low-level signal data such as video, text and images. We need representation learning methods that reliably recover these abstract features if we are to bridge the gap between causal inference and deep learning.

This is a challenging task because the problem of inferring latent variables is not identified with independent and identically distributed (IID) data (Hyvärinen and Pajunen, 1999; Locatello et al., 2019), even in the limit of an infinite number of such IID examples. However, there has been significant recent progress in developing representation learning approaches that provably recover latent factors Z (e.g., object positions, object colors, etc.) underlying complex data X (e.g. image), where X g(Z), by going beyond the IID setting and using observations of X along with minimal domain knowledge and supervision (Hyvarinen and Morioka, 2016, 2017; Locatello et al., 2020; Khemakhem et al., 2020a). These works establish provable identification of latents by leveraging strong structural assumptions such as independence conditional on auxiliary information (e.g., timestamps). In this work, we aim to relax these distributional assumptions on the latent variables to achieve identification for arbitrary continuous latent distributions. Instead of distributional assumptions, we assume access to data generated under sparse perturbations that change only a few latent variables at a time as a source of weak supervision. Figure 1 illustrates our working example of this assumption: a simple environment where an agent s actions perturb the coordinates of a few balls at a time. Our main contributions are summarized as follows.

We show that sparse perturbations that impact one latent at a time are sufficient to learn the latents (up to permutation and scaling) that follow any unknown continuous distribution.

Next, we consider more general settings, where perturbations impact one block of latent variables at a time. In the setting where blocks do not overlap, we recover the latents up to an affine transformation of these blocks.

Further, we show that when perturbation blocks overlap, we get stronger identification. In this setting, we prove identification up to affine transformation of the smallest intersecting block. Consequently, if there are blocks that intersect in one latent variable only, then such latents are identified up to permutation and scaling.

We leverage these results to propose a natural estimation procedure and experimentally illustrate the theoretical claims on low-dimensional synthetic and high-dimensional imagebased data.

2 Related works

Many of the works on provable identification of representations trace their roots to non-linear ICA (Hyvärinen and Pajunen, 1999). Hyvarinen and Morioka (2016, 2017) were the first to use auxiliary information in the form of timestamps and additional structure on the latent evolution to achieve provable identification. Since then, these works have been generalized in many exciting ways. Khemakhem et al. (2020a) assume independence of latents conditional on auxiliary information, and several of these assumptions were further relaxed by Khemakhem et al. (2020b).

Our work builds on the machinery developed by Ahuja et al. (2022). Ahuja et al. show that if we know the mechanisms that drive the evolution of latents, then the latents are identified up to equivariances of these mechanisms. However, the authors leave the question of achieving identification without such knowledge open. Here we consider a class of mechanisms where an agent s actions impact the latents through unknown perturbations. We show how to achieve identification by exploiting the sparsity in the perturbations. This class of perturbations was first leveraged to prove identification by Locatello et al. (2020). However, Locatello et al. assume that the latents are independent, whereas we make no assumptions on the distribution other than continuity. Lachapelle et al. (2022) also use sparse interventions on the latents to strengthen the identification guarantees in Khemakhem et al. (2020a) for conditional exponential distributions. However, the form of sparsity that is leveraged in their work is different from ours. In our work, we assume that the vector of changes in the latents is sparse, i.e., some components change and the rest of the components do not change. In Lachapelle et al. (2022), all the components of the latents change post interventions but the graphical model capturing the interaction of interventions (described as random variables) and the latents is sparse. Klindt et al. (2021) also use sparsity in time-series settings to attain identification. Klindt et al. (2021) enforce soft ℓ1 norm driven sparsity in the vector of changes in latents by assuming that latents evolve independently under a Laplace distribution but do not require access to data under interventions.

Yao et al. (2021) and Lippe et al. (2022) model the latent evolution as a structural causal model unrolled in time. Yao et al. exploit non-stationarity and sufficient variability dictated by the auxiliary information to provide identification guarantees. Lippe et al. exploit causal interventions on the latents to provide identification guarantees but require the knowledge of intervention targets and assume an invariant causal model describing the relations between any adjacent time frames. In concurrent work, Brehmer et al. (2022) leverage data generated under causal interventions as a source of weak supervision and prove identification for structural causal models that are diffeomorphic transforms of exogenous noise. Our work also connects to an insightful line of work on multi-view ICA (Gresele et al., 2020), which proves identification under independent latents, in the following sense. We can interpret the data under different perturbations as different views of the same underlying latent. In addition, some recent papers explain the success of self-supervised contrastive learning Zimmermann et al. (2021); Von Kügelgen et al. (2021) through the lens of identification of representations.

Above, we focused on provable representation identification, which is central to this work. We now give a brief overview of empirical works on disentanglement which have shown success on some benchmark tasks, but do not theoretically characterize conditions for successful disentanglement. Many variations of variational autoencoders (VAE) were developed over the years to achieve disentanglement. β-VAE (Higgins et al., 2016) uses a hyperparameter in front of the KL regularizer to make the learned latent independent. Factor VAE (Kim and Mnih, 2018) proposes an adversarial training-based approach, where the discriminator encourages the learned representation to have independent components. Annealed β-VAE (Burgess et al., 2018) proposes to progressively increase the capacity of bottleneck β to enforce independence one component at a time. Ideas based on disentanglement have been also used in reinforcement learning; Higgins et al. (2017), Dittadi et al. (2020), and Miladinovi c et al. (2019) are some of the representative works in the area. Locatello et al. (2019) showed that most of the above methods could often fail to disentangle in the absence of supervision or inductive biases. As a result, there has been a surge in the interest in building approaches that achieve provable representation identification. Lastly, there is a line of work, which does not focus on disentanglement or representation identification, but has shown the benefits of sparsity based inductive biases sparse changes in latents over time (Goyal et al., 2019) or sparse interactions between latents (Goyal et al., 2021) under distribution shifts.

3 Latent identification under sparse perturbations

Data Generation Process We start by describing the data generation process used for the rest of the work. There are two classes of variables we consider a) unobserved latent variables Z Z Rd and b) observed variables X X Rn. The latent variables Z are sampled from a distribution PZ and then transformed by a map g : Rd Rn, where g is injective and analytic2, to generate X. We write this as follows z PZ x g(z) (1) where z and x are realizations of the random variables Z and X respectively. It is impossible to invert g just from the realizations of X (Hyvärinen and Pajunen, 1999; Locatello et al., 2019). Most work has gone into understanding how structure of latents Z and auxiliary information in the form of timestamps or labels play a role in solving the above problem. In this work, we depart from these assumptions and instead investigate the role of data generated under perturbations of latents to achieve identification. Define the set of perturbations as I = {1, , m} and the corresponding perturbation vectors as = {δ1, , δm}, where δi is the ith perturbation. Each latent z is sampled from an arbitrary and unknown distribution PZ. The same set of unknown perturbations in are applied to each z to generate m perturbed latents { zk}m k=1 per sampled z and the corresponding observed vectors { xk}m k=1. Each of these latents are transformed by the map g and we observe (x, x1, , xm). Our goal is to use these observations and estimate the underlying latents. We summarize this data generation process (DGP) in the following assumption. Assumption 1. The DGP follows

z PZ, x g(z) zk z + δk, k I xk g( zk), k I (2)

where g is injective and analytic, and Z is a continuous random vector with full support over Rd. 3

The above DGP is very close to the DGP in Locatello et al. (2020) except we do not require latent dimensions to be mutually independent. To better understand the above DGP, let us turn to some examples. Consider a setting where an agent is interacting with an environment containing several balls (See Figure 1). The latent z captures the properties of the objects; for example, in Figure 1, z just captures the positions of each ball, but in general it could include more properties such as velocity, shape, color, etc.. The agent perturbs the objects in the scene by δk, which can modify a single property associated with one object or multiple properties from one or more objects depending on how the agent acts. Note that when the agent perturbs a latent, it can lead to downstream effects. For instance, if the agent moves a ball to the edge of the table, the ball falls in subsequent frames. For this work, we only consider the observations just before and after the perturbation and not the downstream effects. In the Appendix (Section A.2.5), we explain these downstream effects using structural causal models. We also explain the connection between the perturbations in equation (24) and causal interventions leveraged in Brehmer et al. (2022); Lachapelle et al. (2022). The above example is typical of a reinforcement learning environment, other examples include natural videos with sparse changes (e.g., MPI3D data (Gondal et al., 2019)).

In the above DGP in equation (24), we assumed that for each scene x there are multiple perturbations. It is possible to extend our results to settings where we perturb each scene only once, given a sufficiently diverse set of perturbations, i.e., for a small neighborhood of a scene around x, each scene in the neighbourhood receives a different perturbation. We compare these two approaches experimentally.

Learning objective The learner s objective is to use the observed samples (x, x1, , xm) generated by the DGP in Assumption 1 and learn an encoder f : Rn Rd that inverts the function g and recovers the true latents. For each observed sample (x, x1, , xm), the learner compares all the pairs (x, xk) preand post-perturbation. For every unknown perturbation δk used in the DGP in equation (24), the learner guesses the perturbation δ k and enforces that the latents predicted by the encoder for x and xk are consistent with the guess. We write this as (x, x1, , xm) generated by DGP in (24) f( xk) = f(x) + δ k. (3)

2A analytic function, g, is an infinitely differentiable function such that for all z in its domain, the Taylor series evaluated at z converges pointwise to g(z ) 3The assumption on the support of Z can be relaxed.

We denote the set of guessed perturbations as = {δ 1, , δ m}, where δ i is the guess for perturbation δi. We can turn the above identity into a mean square error loss given as

min f, E h f( xk) f(x) δ k 2i (4)

where the expectation is taken over observed samples generated by the DGP in (24) and the minimization is over all the possible maps f and perturbation guesses in the set . Note that a trivial solution to the above problem is an encoder that maps everything to zero, and all guesses equal zero. In the next section, we get rid of these trivial solutions by imposing an additional requirement that the span of the set is Rd. It is worth pointing out that we do not restrict the set of f s to injective maps in theory and experiments. We denote the latent estimated by the encoder for a point x as ˆz = f(x). It is related to the true latent as follows ˆz = f g(z) = a(z), where a is some function that relates true z to estimated ˆz. In the next section, we show that if perturbations are diverse, then a is an affine transform. Further, we show that if perturbations are sparse, then a takes an even simpler form.

3.1 Sparse perturbations

We first show that it is possible to identify the true latents up to an affine transformation without any sparsity assumptions. Later, we leverage sparsity to strengthen identification guarantees.

Assumption 2. The dimension of the span of the perturbations in (24) is d, i.e., dim span = d.

The above assumption implies that the perturbations are diverse. We now state a regularity condition on the function a.

Assumption 3. a is an analytic function. For each component i {1, , d} of a(z) and each component j {1, , d} of z, define the set Sij = {θ | jai(z + b) = jai(z) + 2 jai(θ)b, z Rd}, where b is a fixed vector in Rd. Each set Sij has a non-zero Lebesgue measure in Rd.

If we restrict the encoder f to be analytic, then a is analytic since g is also analytic, thus satisfying the first part of the above assumption. The second part of the above assumption can be understood as follows: suppose we have a scalar valued function h : R R that is differentiable. If we expand h(u + v) around h(u), by the mean value theorem we get h(u + v) = h(u) + h (ϵ)v, where ϵ [u, u + v]. If we vary u to take all the values in R, then ϵ also varies. The above assumption states that the set of ϵ s has a non-zero Lebesgue measure. Under the above assumptions, we show that an encoder that solves equation (3) identifies true latents up to an affine transform, i.e., ˆz = Az + c, where A Rd d is a matrix and c Rd is an offset.

Proposition 1. If Assumptions 1, 2, and 3 hold, then the encoder that solves equation (3) (with

s.t. dim span = d) identifies true latents up to an invertible affine transform, i.e. ˆz = Az + c,

where A Rd d is an invertible matrix and c Rd is an offset.

The proof of above proposition follows the proof technique from Ahuja et al. (2022), for further details refer to the Appendix (Section A.1). We interpret the above result in the context of the agent interacting with balls (as shown in Figure 1), where the latent vector z captures the x and y coordinates of the nballs. Under each perturbation, the balls move along the vector dictated by the perturbation. If there are at least 2nballs perturbations, then the latents estimated by the learned encoder are guaranteed to be an affine transformation of the actual positions of the balls.

3.1.1 Non-overlapping perturbations

In Proposition 1, we showed affine identification guarantees for the DGP from Assumption 1. We now explore identification when perturbations are one-sparse, i.e., one latent changes at a time.

Assumption 4. The perturbations in are one-sparse, i.e., each δi has one non-zero component.

Next, we show that under one-sparse perturbations, the latents estimated identify true latents up to permutation and scaling.

Theorem 1. If Assumptions 1-4 hold and the number of perturbations per example equals the latent dimension, m = d, 4 then the encoder that solves equation (3) (with as one-sparse and dim span = d) identifies true latents up to permutation and scaling, i.e. ˆz = ΠΛz + c, where

Λ Rd d is an invertible diagonal matrix, Π Rd d is a permutation matrix and c is an offset.

For the proof of above theorem, refer to Section A.1 in the Appendix. The theorem does not require that learner knows either the identity or amount each component changed. However, the learner has to use one-sparse perturbations as guesses. Suppose the learner does not know that the actual perturbations are one-sparse and instead uses guesses that are p-sparse, i.e., p latents change at one time. In such a case, the ˆz and true z are related to each other through a permutation and block diagonal matrix, i.e., we can replace Λ in the above result to be a block diagonal matrix instead of a diagonal matrix (see the Appendix for details). In the context of the ball agent interaction environment from Figure 1, the above result states that provided the agent interacts with one coordinate of each ball at a time, it is possible to learn the position of each ball up to scaling errors.

We now consider a natural extension of the setting above, where the perturbations simultaneously operate on blocks of latents. In the ball agent interaction environment, this can lead to multiple scenarios i) the agent interacts with one ball at a time but perturbs both coordinates simultaneously, ii) the agent interacts with several balls simultaneously.

Consider a perturbation δi (from equation (24)). We define the block of latents that is impacted under perturbation δi as {j | δj i = 0, j {1, , d}}, where δj i is the jth component of δi. We group the perturbations in I based on the block they act upon, i.e. perturbations in the same group act on the same block of latents. Define the set of the groups corresponding to perturbations in I as GI. Define the set of corresponding blocks as BI = {B1, , Bg}, where Bk is the block impacted by perturbations in group k. If BI partitions the set of latent components indexed {1, , d}, then it implies all the distinct blocks are non-overlapping. We formally define this below. Definition 1. Blockwise and non-overlapping perturbations. If the the set of blocks BI corresponding to perturbations I form a partition of {1, , d}, then I is said to be blockwise and non-overlapping. Formally stated, any two distinct Bi, Bj BI do not intersect, i.e., Bi Bj = , and i Bi = {1, , d}.

From the above definition it follows that two perturbations either act on the same block or completely different blocks with no overlapping variables.

Assumption 5. The perturbations I (used in equation (24)) are blockwise and non-overlapping (see Definition 1). Each perturbation in I is p-sparse, i.e., it impacts blocks of length p (p d) at a time. Assumption 6. The learner knows the group label for each perturbation i I. Therefore, any two perturbations in associated with same group in GI impact the same block of latents.

We illustrate the above Assumptions 5, 6 in the following example. Consider the ball agent interaction environment (Figure 1). z = [z1x, z1y, , znballsx, znballsy] is the vector of positions of all balls, where zix/y is the x/y coordinate of ball i. If the agent randomly perturbs ball i, then it changes the block (zix, ziy). We would call such a system 2-sparse. All the perturbations on ball i are in one group. Since the agent knows the group of the perturbation, it does not know the ball index but it knows whenever we interact with the same ball.

Definition 2. If the latent variables recovered ˆz = Π Λz + c, where Π is a permutation matrix and Λ is a block-diagonal matrix, then the latent variables are said to be recovered up to permutations and block-diagonal transforms.

In the theorem that follows, we show that under the assumptions made in this section, we achieve identification up to permutations and block-diagonal transforms with invertible p p blocks.

Theorem 2. If Assumptions 1-3, 5, 6 hold, then the encoder that solves equation (3) (where

is p-sparse, dim span = d) identifies true latents up to permutation and block-diagonal

transforms, i.e. f(x) = ˆz = Π Λz + c, where Λ Rd d is an invertible block-diagonal matrix with blocks of size p p, Π Rd d is a permutation matrix and c Rd is an offset.

4We can relax this condition to m d, refer to the Appendix for details.

For the proof of the above theorem, refer to Section A.1 in the Appendix. From the above theorem, we gather that the learner can separate the perturbed blocks. However, the latent dimensions within the block are linearly entangled. In the ball agent interaction with 2-sparse perturbations, the above theorem implies that the agent can separate each ball out but not their respective x and y coordinates. In the above theorem, we require the learner to know the group of each intervention (Assumption 6). In the Appendix Section A.2.2, we relax Assumption 6 and show that we can continue to achieve identification up to permutation and block diagonal transforms. However, we need a computationally expensive procedure that searches over subsets of latent dimensions to identify the dimensions impacted under the current intervention.

We briefly compare with Von Kügelgen et al. (2021), where the authors also establish block identification guarantees. In Von Kügelgen et al. (2021), the latent vector is divided into two parts the content block and the style block. Across augmentations, style is varied and content is fixed. Von Kügelgen et al. leverage this invariance of the content across augmentations to learn the content block and not the style block. To summarize, invariance of content across different views is the key signal that is used to achieve identification. In our case, the perturbations act on different blocks of latents. In contrast to Von Kügelgen et al., we leverage sparsity of changes, i.e., we exploit both the varying part and the invariant part to identify all the distinct blocks and not just the content block.

3.1.2 Overlapping perturbations

In the previous section, we assumed that the blocks across different perturbations are non-overlapping. This section relaxes this assumption and allows the perturbation blocks to overlap. We start with a motivating example to show how overlapping perturbations can lead to stronger identification.

Consider the agent interacting with two balls, where z = [z1x, z1y, z2x, z2y] describes the coordinates of the two balls. The agent perturbs the first ball and then perturbs the second ball. For the purpose of this example, assume that these perturbations satisfy the assumptions in Theorem 2. We obtain that the estimated position of each ball ˆzix/y is linearly entangled w.r.t the true x and y coordinates. For the first ball we get ˆz1x = a1z1x + a2z1y + a3. We also have the agent perturb the x coordinates of the first and second ball together and then it does the same with the y coordinates. We apply Theorem 2 and obtain that the estimated x coordinates of each ball are linearly entangled. We write this as ˆz1x = b1z1x + b2z2x + b3. We take a difference of the two relations for ˆz1x to get

(a1 b1)z1x + a2z1y b2z2x + a3 b3 = 0 (5)

Since the above has to hold for all z1x, z1y, z2x, we get a1 = b1, a2 = 0, b2 = 0 and a3 = b3. Thus ˆz1x = a1z1x + a3. Similarly, we can disentangle the rest of the balls.

We take the insights from the above example and generalize them below. Let us suppose that from the set of perturbations I we can construct at least two distinct subsets I1 and I2 such that both subsets form a blockwise non-overlapping perturbation (see Definition 1). Perturbations in I1 (I2) partition {1, , d} into blocks BI1 (BI2) respectively. It follows that there exists at least two blocks B1 BI1 and B2 BI2 such that B1 B2 = . From Theorem 2, we know that we can identify latents in block B1 and B2 up to affine transforms. In the next theorem, we show that we can identify latents in each of the blocks B1 B2, B1 \ B2, B2 \ B1 up to affine transforms.

Assumption 7. Each perturbation in I is p-sparse. The perturbations in each group span a pdimensional space, i.e., q GI, dim span {δi}i q = p. There exist at least two distinct subsets of perturbations I1 I and I2 I that are both blockwise and non-overlapping.

Theorem 3. Suppose Assumptions 1, 3, 6 and 7 hold. Consider the subsets I1 and I2 that satisfy Assumption 7. For every pair of blocks, B1 BI1 and B2 BI2, the encoder that solves equation

(3) (where is p-sparse, dim span = d) identifies latents in each of the blocks B1 B2,

B1 \ B2, B2 \ B1 up to invertible affine transforms.

For the proof of the above theorem, refer to Section A.1 in the Appendix. From the above theorem, it follows that if blocks overlap at one latent only, then all such latents are identified up to permutation and scaling. We now construct an example to show the identification of all the latents under overlapping perturbations. Suppose we have a 4 dimensional latent. The set of all contiguous blocks of length 2 is given as follows {{1, 2}, {2, 3}, {3, 4}, {4, 1}}. Different 2-sparse perturbations impact

these blocks. Observe that every component between 1 to 4 gets to be the first element of a block exactly once and the last element of the block exactly once. As a result, each latent gets to be the only element at the intersection of two blocks. We apply Theorem 3 to this case and get that all the latents are identified up to permutation and scaling. We generalize this example below. Assumption 8. BI is a set of all the contiguous blocks of length p, where p < d. The perturbations in each block span a p dimensional space. Further, also assume that d mod p = 0.

In the above assumption, we construct d contiguous blocks of length p. The construction ensures that each index in {1, , d} forms the first element of exactly one block and last element of exactly one block. In Theorem 1 (Locatello et al., 2019) and in Theorem 5 (Lachapelle et al., 2022) a similar assumption is made that requires exactly one latent is at the intersection of multiple blocks. In the next theorem, we show that under the above assumption, we achieve identification up to permutation and scaling. Theorem 4. Suppose Assumptions 1, 3, 6 and 8 hold, then the encoder that solves the identity in equation (3) (where is p-sparse, dim span = d) identifies true latents up to permutations

and scaling, i.e., ˆz = ΠΛz + c, where Π Rd d matrix and Λ Rd d is a diagonal matrix.

For the proof of the above theorem, refer to Section A.1 in the Appendix. The total number of perturbations required in the above theorem is p d. If we plug p = 1, we recover Theorem 1 as a special case. The above result highlights that if the block lengths are larger, then we need to scale the number of perturbations accordingly by the same factor to achieve identification up to permutation and scaling. We assumed a special class of perturbations operating on contiguous blocks. In general, the total number of distinct blocks can be up to d p . Suppose s distinct random blocks of length p are selected for perturbations. As s grows, we reach a point where each latent component is at the intersection of two blocks from different sets of blockwise non-overlapping perturbations. At that point, we identify all latents up to permutation and scaling.

Extensions In the discussion so far, we made some assumptions for ease of exposition. In the appendix, we describe how to relax them. In the DGP in Assumption 1, the perturbations used are deterministic. In Section A.2.3, we extend the DGP in Assumption 1 to incorporate stochastic perturbations. Specifically, instead of z z + δ we consider a DGP where z z + δ + n, where n is the noise vector added to the perturbation δ. We show that the key results presented in the paper extend provided the noise vector n follow the same sparsity pattern as δ. We also present experiments for the same model in Section A.3. In the DGP in Assumption 1, the perturbations used are independent of the value of z. Instead of z z + δ we consider a DGP given as z z + m(z), where m( ) is a general non-linear perturbation map. In Section A.2.4, we show that the key results presented in the paper extend to this setting with non-linear perturbation mechanisms

4 Experiments

Data generation processes We conducted two sets of experiments low-dimensional synthetic and image-based inputs that follow the DGP in equation (24). In the low-dimensional synthetic experiments we experimented with two choices for PZ a) uniform distribution with independent latents, b) normal distribution with latents that are blockwise independent (with block length d/2). We used an invertible multi-layer perceptron (MLP) (with 2 hidden layers) from Zimmermann et al. (2021) for g. We evaluated for latent dimensions d {6, 10, 20}. The training and test data size was 10000 and 5000 respectively. For the image-based experiments we used Py Game (Shinners, 2011) s rendering engine for g and generated 64 64 pixel images that look like those shown in Figure 1. The coordinates of each ball, zi, were drawn independently from a uniform distribution, zi U(0.1, 0.9). We varied the number of balls from 2 (d = 4) to 4 (d = 8). For these experiments, there was no fixed-size training set; instead the images are generated online and we trained to convergence. Because these problems are high dimensional, we only sampled a single perturbation for each image.

Loss function, architecture, evaluation metrics In all the experiments we optimized equation (4) with square error loss. The encoder f was an MLP with two hidden layers of size 100 for the low-dimensional synthetic experiments and a Res Net-18 (He et al., 2015) for the image-based experiments. Further training details such as the optimizers used, hyperparameters etc. are in the

Table 1: Comparing MCC and BMCC for non-overlapping perturbations. The number of perturbations applied for each example is given in parenthesis

d p Z MCC MCC BMCC BMCC C-wise (d) C-wise (1) B-wise (d) B-wise (1) 6 Normal 0.99 0.00 0.99 0.00 0.99 0.00 0.99 0.01 10 Normal 0.99 0.00 0.99 0.01 0.99 0.00 0.91 0.02 20 Normal 0.99 0.00 0.88 0.03 0.99 0.00 0.90 0.01 6 Uniform 0.99 0.00 0.99 0.00 0.99 0.00 0.96 0.04 10 Uniform 0.99 0.00 0.99 0.01 0.99 0.00 0.81 0.05 20 Uniform 0.99 0.00 0.82 0.02 0.85 0.08 0.51 0.04

0 5 True Z1

Predicted Z1

0 5 True Z1

Predicted Z2

0 5 True Z2

Predicted Z1

0 5 True Z2

Predicted Z2

0 5 True Z1

Predicted Z3

0 5 True Z2

Predicted Z4

Figure 2: Illustrating blockwise dependence (d = 10).

Table 2: MCC for B-wise (overlap).

d Distribution MCC 6 Normal 0.95 0.01 10 Normal 0.96 0.01 20 Normal 0.99 0.01 6 Uniform 0.86 0.03 10 Uniform 0.88 0.03 20 Uniform 0.81 0.03

Appendix (Section A.3). We used the mean correlation coefficient (MCC) (Hyvarinen and Morioka, 2016) to verify the claims in Theorems 1 and 4. If MCC equals one, then the estimated latents identify true latents up to permutation and scaling. We extend MCC to blockwise MCC (BMCC) to verify the claims in Theorem 2. If BMCC equals one, then the estimated latents identify true latents up to permutation and block-diagonal transforms. Further details are in the Appendix (Section A.3).

Non-overlapping perturbations We first conducted experiments with one-sparse perturbations, the set consists of m = d one-sparse perturbations that span a d dimensional space. In the context of the image experiments, these perturbations correspond to moving each ball individually along a single axis. The learner solves the identity in equation (3) using a set of random one-sparse perturbations

that span a d dimensional space. In Table 1, we used the low-dimensional synthetic data generating process to compare the effect of (i) applying all m = d perturbations to each instance z (following the DGP in (24)), against a more practical setting (ii) where a perturbation is selected uniform at random from and applied to each instance z. The results for (i) are shown in black and the results for (ii) are shown in gray font in the C-wise (componentwise) column in Table 1. We observed high MCCs in both settings. The results were similar in the more challenging image-based experiments (see Table 3, C-wise column) with MCC scores > 0.97 for all the settings that we tested, as expected given the results presented in Theorem 1.

In our next experiments, the set of perturbations comprised of d 2-sparse non-overlapping perturbations that span a d dimensional space. We repeated the same synthetic experiments as above with one and d perturbations per instance. Under these assumptions we should expect to see that pairs of latents are separated blockwise but linearly entangled within the blocks (c.f. Theorem 2). We found this to be the case. The high BMCC numbers in Table 1 displayed under B-wise (blockwise) column (except for d = 20 and one perturbation per sample) show disentanglement between the blocks of latents. In Figure 2, the first two rows and columns show how the predicted latents corresponding to a block are correlated with their true counterpart (see Predicted Zi vs True Zi) and the other latent in the block (Predicted Z1 vs True Z2 and vice versa). The plots in the last column show that the predicted latents did not bear a correlation with a randomly selected latent from outside the block.

Table 3: Image experiments d MCC MCC MCC C-wise B-wise B-wise d d p

4 0.994 0.710 0.864 6 0.981 0.817 0.912 8 0.975 0.866 0.934

1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Figure 3: (Left) Results for the image-base experiments. (Centre) Example images in which the bottom left ball is shifted to the right. (Right) A trained encoder s predictions for the two images shown in (centre). The green ball prediction shifts right by c

2 and the other balls left by c 2(n 1). For further illustrations, refer to the animations in the supplement.

Overlapping perturbations In this section, we experimented with blocks of size two that overlap in order to conform with the setting described in Theorem 4. We used the same distributions as before and only changed the type of perturbations. The low-dimensional synthetic results are summarized in Table 2. The results were largely as expected, with a strong correspondence between the predicted and true latents reflected by high MCC values.

On the image datasets (see Table 3), we found that the MCC scores depended on both the number of balls and how the blocks were selected. We compared two strategies for selecting blocks of latents to perturb: either select uniformly from all adjacent pairs I = {(i mod d, i + 1 mod d)} (d blocks), or uniformly from all combinations of latent indices, I = {(i, j) : i {1, . . . , d}, j > i} ( d 2 blocks). The latter lead to higher MCC scores (ranging from 0.86 to 0.93) as it placed more constraints on the solution space. The dependence on the number of balls is more surprising. To investigate the implied entanglement from the lower MCC scores, we evaluated trained encoders on images where we kept nballs 1 balls in a fixed location and moved one of the balls (see Section A.3 in the Appendix for example images). If the coordinates were perfectly disentangled, the encoder should predict no movement for static balls. When the moving ball shifted by c units, the predicted location of the static balls shifted by c 2(nballs 1) and the moving ball shifted c

2 units. We further verified this claim and ran blockwise experiments with nballs = 10 balls (d = 20) and got MCC scores of 0.930 and 0.969 for d and d 2 blocks respectively. In the Appendix (Section A.3), we show that this solution is a stationary point, and we achieve a perfect MCC of one when nballs = . Finally, the code to reproduce the experiments presented above can be found at https://github.com/ahujak/WSRL.

5 Discussion and limitations

Our work presents the first systematic analysis of the role of sparsity in achieving latent identification under unknown arbitrary latent distributions. We assume that every sample (or at least every neighborhood of a sample) experiences the same set of perturbations. A natural question is how to extend our results to settings where this assumption may not hold. Data augmentation provides a rich source of perturbations; our results cover translations, but extending them to other forms of augmentation is an important future direction. We followed the literature on non-linear ICA (Hyvarinen et al., 2019) and made two assumptions i) the map g that mixes latents is injective, and ii) the dimension of the latent d is known. We believe future works should aim to relax these assumptions. In reinforcement learning (RL) environments, the effects of actions can often be sparse. Therefore, we believe illustrating the efficacy of the proposed approach in RL environments (Ahmed et al., 2020) is an important direction to further the case of the proposed theory and methods in real-world applications.

6 Acknowledgements

We thank Sébastien Lachapelle and Anirudh Goyal for insightful discussions. Kartik Ahuja acknowledges the support from the IVADO postdoctoral fellowship. Jason Hartford acknowledges support from the Natural Sciences and Engineering Research Council of Canada (NSERC) and Recursion Pharmaceuticals. Yoshua Bengio acknowledges the support from CIFAR, Samsung and IBM.

Ahmed, O., Träuble, F., Goyal, A., Neitz, A., Bengio, Y., Schölkopf, B., Wüthrich, M., and Bauer, S. (2020). Causalworld: A robotic manipulation benchmark for causal structure and transfer learning. ar Xiv preprint ar Xiv:2010.04296.

Ahuja, K., Hartford, J., and Bengio, Y. (2022). Properties from mechanisms: an equivariance perspective on identifiable representation learning. In International Conference on Learning Representations.

Brehmer, J., De Haan, P., Lippe, P., and Cohen, T. (2022). Weakly supervised causal representation learning. ar Xiv preprint ar Xiv:2203.16437.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901.

Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. (2018). Understanding disentangling in beta-vae. ar Xiv preprint ar Xiv:1804.03599.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee.

Dittadi, A., Träuble, F., Locatello, F., Wüthrich, M., Agrawal, V., Winther, O., Bauer, S., and Schölkopf, B. (2020). On the transfer of disentangled representations in realistic settings. ar Xiv preprint ar Xiv:2010.14407.

Geirhos, R., Jacobsen, J., Michaelis, C., Zemel, R. S., Brendel, W., Bethge, M., and Wichmann, F. A. (2020). Shortcut learning in deep neural networks. Co RR, abs/2004.07780.

Gondal, M. W., Wuthrich, M., Miladinovic, D., Locatello, F., Breidt, M., Volchkov, V., Akpo, J., Bachem, O., Schölkopf, B., and Bauer, S. (2019). On the transfer of inductive bias from simulation to the real world: a new disentanglement dataset. Advances in Neural Information Processing Systems, 32.

Goyal, A., Didolkar, A., Ke, N. R., Blundell, C., Beaudoin, P., Heess, N., Mozer, M. C., and Bengio, Y. (2021). Neural production systems. Advances in Neural Information Processing Systems, 34:25673 25687.

Goyal, A., Lamb, A., Hoffmann, J., Sodhani, S., Levine, S., Bengio, Y., and Schölkopf, B. (2019). Recurrent independent mechanisms. ar Xiv preprint ar Xiv:1909.10893.

Gresele, L., Rubenstein, P. K., Mehrjou, A., Locatello, F., and Schölkopf, B. (2020). The incomplete rosetta stone problem: Identifiability results for multi-view nonlinear ica. In Uncertainty in Artificial Intelligence, pages 217 227. PMLR.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image recognition. Co RR, abs/1512.03385.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. (2016). beta-vae: Learning basic visual concepts with a constrained variational framework.

Higgins, I., Pal, A., Rusu, A., Matthey, L., Burgess, C., Pritzel, A., Botvinick, M., Blundell, C., and Lerchner, A. (2017). Darla: Improving zero-shot transfer in reinforcement learning. In International Conference on Machine Learning, pages 1480 1490. PMLR.

Hyvarinen, A. and Morioka, H. (2016). Unsupervised feature extraction by time-contrastive learning and nonlinear ica. Advances in Neural Information Processing Systems, 29.

Hyvarinen, A. and Morioka, H. (2017). Nonlinear ica of temporally dependent stationary sources. In Artificial Intelligence and Statistics, pages 460 469. PMLR.

Hyvärinen, A. and Pajunen, P. (1999). Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429 439.

Hyvarinen, A., Sasaki, H., and Turner, R. (2019). Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 859 868. PMLR.

Khemakhem, I., Kingma, D., Monti, R., and Hyvarinen, A. (2020a). Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pages 2207 2217. PMLR.

Khemakhem, I., Monti, R., Kingma, D., and Hyvarinen, A. (2020b). Ice-beem: Identifiable conditional energy-based deep models based on nonlinear ica. Advances in Neural Information Processing Systems, 33:12768 12778.

Kim, H. and Mnih, A. (2018). Disentangling by factorising. In International Conference on Machine Learning, pages 2649 2658. PMLR.

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization.

Klindt, D. A., Schott, L., Sharma, Y., Ustyuzhaninov, I., Brendel, W., Bethge, M., and Paiton, D. (2021). Towards nonlinear disentanglement in natural data with temporal sparse coding. In International Conference on Learning Representations.

Lachapelle, S., Rodriguez, P., Sharma, Y., Everett, K. E., PRIOL, R. L., Lacoste, A., and Lacoste Julien, S. (2022). Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA. In First Conference on Causal Learning and Reasoning.

Lippe, P., Magliacane, S., Löwe, S., Asano, Y. M., Cohen, T., and Gavves, E. (2022). Citris: Causal identifiability from temporal intervened sequences. ar Xiv preprint ar Xiv:2202.03169.

Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., and Bachem, O. (2019). Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pages 4114 4124. PMLR.

Locatello, F., Poole, B., Rätsch, G., Schölkopf, B., Bachem, O., and Tschannen, M. (2020). Weaklysupervised disentanglement without compromises. In International Conference on Machine Learning, pages 6348 6359. PMLR.

Miladinovi c, Ð., Gondal, M. W., Schölkopf, B., Buhmann, J. M., and Bauer, S. (2019). Disentangled state space representations. ar Xiv preprint ar Xiv:1906.03255.

Mityagin, B. (2015). The zero set of a real analytic function. ar Xiv preprint ar Xiv:1512.07276.

Pearl, J. (2009). Causality. Cambridge university press.

Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of causal inference: foundations and learning algorithms.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision. Co RR, abs/2103.00020.

Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. (2021). Toward causal representation learning. Proceedings of the IEEE, 109(5):612 634.

Schölkopf, B. and von Kügelgen, J. (2022). From statistical to causal learning. ar Xiv preprint ar Xiv:2204.00607.

Shinners, P. (2011). Pygame. http://pygame.org/.

Szegedy, C., Ioffe, S., and Vanhoucke, V. (2016). Inception-v4, inception-resnet and the impact of residual connections on learning. Co RR, abs/1602.07261.

Von Kügelgen, J., Sharma, Y., Gresele, L., Brendel, W., Schölkopf, B., Besserve, M., and Locatello, F. (2021). Self-supervised learning with data augmentations provably isolates content from style. Advances in Neural Information Processing Systems, 34.

Yao, W., Sun, Y., Ho, A., Sun, C., and Zhang, K. (2021). Learning temporally causal latent processes from general temporal data. ar Xiv preprint ar Xiv:2110.05428.

Zimmermann, R. S., Sharma, Y., Schneider, S., Bethge, M., and Brendel, W. (2021). Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pages 12979 12990. PMLR.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] Each of the main contributions listed in the introduction correspond to a theorem in Section 3 and the final claim corresponds to the experiments described in Section 4. (b) Did you describe the limitations of your work? [Yes] See Section 5.

(c) Did you discuss any potential negative societal impacts of your work? [No] We do not foresee any potential negative societal impacts specific to this work. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] See Section 3. (b) Did you include complete proofs of all theoretical results? [Yes] See the Appendix (Section A.1) in the Supplementary Material. 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We have provided the codes in the Supplementary Material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See the Appendix (Section A.3). (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] For the low-dimensional synthetic experiments, which were computationally less demanding we ran five seeds. For image-based experiments, the run time for each case was on average 12 hours. Since we had several such cases, running several seeds was not feasible given the compute we had at our disposal. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See the Appendix (Section A.3). 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] See the Appendix (Section A.3). (b) Did you mention the license of the assets? [Yes] See the Appendix (Section A.3).

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

The codes can be found at https://github.com/ahujak/WSRL. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]