# provably_learning_objectcentric_representations__17ade5d4.pdf

Provably Learning Object-Centric Representations

Jack Brady * 1 2 Roland S. Zimmermann * 1 2 Yash Sharma 2 3 Bernhard Sch olkopf 1

Julius von K ugelgen 1 4 Wieland Brendel 1 2

Learning structured representations of the visual world in terms of objects promises to significantly improve the generalization abilities of current machine learning models. While recent efforts to this end have shown promising empirical progress, a theoretical account of when unsupervised objectcentric representation learning is possible is still lacking. Consequently, understanding the reasons for the success of existing object-centric methods as well as designing new theoretically grounded methods remains challenging. In the present work, we analyze when object-centric representations can provably be learned without supervision. To this end, we first introduce two assumptions on the generative process for scenes comprised of several objects, which we call compositionality and irreducibility. Under this generative process, we prove that the ground-truth object representations can be identified by an invertible and compositional inference model, even in the presence of dependencies between objects. We empirically validate our results through experiments on synthetic data. Finally, we provide evidence that our theory holds predictive power for existing object-centric models by showing a close correspondence between models compositionality and invertibility and their empirical identifiability.1

1 Introduction

Human intelligence exhibits an unparalleled ability to generalize from a limited amount of experience to a wide range

*Equal contribution Shared last author 1MPI for Intelligent Systems, T ubingen 2T ubingen AI Center, T ubingen 3University of T ubingen, T ubingen, Germany 4Department of Engineering, University of Cambridge, Cambridge, United Kingdom. Correspondence to: Jack Brady, Wieland Brendel <first.last@tue.mpg.de>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). 1Code/Website: brendel-group.github.io/objects-identifiability

of novel situations (Tenenbaum et al., 2011). To build machines with similar capabilities, a fundamental question is what types of abstract representations of sensory inputs enable such generalization (Goyal & Bengio, 2022). Research in cognitive psychology suggests that one key abstraction is the ability to represent visual scenes in terms of individual objects (Spelke, 2003; Spelke & Kinzler, 2007; Dehaene, 2020; Peters & Kriegeskorte, 2021). Such object-centric representations are thought to facilitate core cognitive abilities such as compositional generalization (Fodor & Pylyshyn, 1988; Lake et al., 2017; Battaglia et al., 2018; Greff et al.,

2020) and causal reasoning over discrete concepts (Marcus, 2001; Gopnik et al., 2004; Gerstenberg & Tenenbaum, 2017; Gerstenberg et al., 2021).

Significant effort has thus gone into endowing machine learning models with the capacity to learn object-centric representations from raw visual input. While initial approaches were mostly supervised (Ronneberger et al., 2015; He et al., 2017; Chen et al., 2017), a recent wave of new methods explore learning object-centric representations without direct supervision (Greff et al., 2019; Burgess et al., 2019; Lin et al., 2020; Kipf et al., 2020; Locatello et al., 2020; Weis et al., 2021; Biza et al., 2023). These methods have begun exhibiting impressive results, showing potential to scale to complex visual scenes (Caron et al., 2021; Singh et al., 2022a; Sajjadi et al., 2022; Seitzer et al., 2023) and real-world video datasets (Kipf et al., 2022; Singh et al., 2022b; Elsayed et al., 2022).

Yet, despite this empirical progress, we still lack a theoretical understanding of when unsupervised object-centric representation learning is possible. This makes it challenging to isolate the reasons underlying the success and failure of existing object-centric models and to develop principled ways to improve them. Furthermore, it is currently not possible to design novel object-centric methods that are theoretically grounded and not solely based on heuristics, many of which break down in more realistic settings (Karazija et al., 2021; Papa et al., 2022; Yang & Yang, 2022).

In the present work, we aim to address this deficiency by investigating when object-centric representations can provably be learned without any supervision. To this end, we first specify a data-generating process for multi-object scenes as

Provably Learning Object-Centric Representations

Unknown Generator f

Encoder Decoder f

Reconstruct Ground-Truth Latents

minimizing Compositional Contrast

Observed Multi-Object Scene x

z 1=h1 -1(z2)

z 2=h2 -1(z3)

z 3=h3 -1(z1)

Inferred Latents

#invertible #compositional #irreducible

#invertible #compositional

Figure 1. When can unsupervised object-centric representations provably be learned? We assume that observed scenes x comprising K objects are rendered by an unknown generator f from multiple ground-truth latent slots z1, ..., z K (here, K = 3). We assume that this generative model has two key properties, which we call compositionality (Defn. 1) and irreducibility (Defn. 5). Under this model, we prove (Thm. 1): An invertible inference model with a compositional inverse yields latent slots ˆzi which identify the ground-truth slots up to permutation and slot-wise invertible functions hi (slot identifiability, Defn. 6). To measure violations of compositionality in practice, we introduce a contrast function (Defn. 7) which is zero if and only if a function is compositional, while to measure invertibility, we rely on the reconstruction loss in an auto-encoder framework.

a structured latent variable model in which each object is described by a subset of latents, or a latent slot. We then study the identifiability of object-centric representations under this model, i.e., we investigate under which conditions an inference model will be guaranteed to recover the subset of ground-truth latents for each object.

Because identifying the ground-truth latent variables is impossible without further assumptions on the generative process (Hyv arinen & Pajunen, 1999; Locatello et al., 2019), previous identifiability results primarily rely on distributional assumptions on the latents (Hyv arinen & Morioka, 2016; 2017; Hyv arinen et al., 2019; Khemakhem et al., 2020a;b; Klindt et al., 2021; Zimmermann et al., 2021). In contrast, we make no such assumptions, thus allowing for arbitrary statistical and causal dependencies between objects.

Structure and Main Contributions. In the present work, we instead take the position that the object-centric nature of the problem imposes a very specific structure on the generator function that renders scenes from latent slots ( 2). Specifically, we define two key properties that this function should satisfy: compositionality (Defn. 1) and irreducibility (Defn. 5). Informally, these properties imply that every pixel can only correspond to one object and that information is shared across different parts of the same object but not between parts of different objects inspired by the principle of independent causal mechanisms (Peters et al., 2017). Under this generative model, we then prove in 3 our main theoretical result: the ground-truth latent slots can be identified without supervision by an invertible inference model with a compositional inverse (Thm. 1). To quantify compo-

sitionality, we introduce a contrast function (Defn. 7) that is zero if and only if a function is compositional; to quantify invertibility, we rely on reconstruction error. We validate on synthetic data that inference models which maximize invertibility and compositionality indeed identify the ground-truth latent slots, even with dependencies between latents ( 5.1). Finally, we examine existing object-centric learning models on image data and find a close correspondence between models compositionality and invertibility and their success in identifying the ground-truth latent slots ( 5.2).

To the best of our knowledge, the present work provides the first identifiability result for object-centric representations. We hope that this lays the groundwork for a better understanding of success and failure in unsupervised objectcentric learning, and that future work can build on these insights to develop more effective learning methods.

Notation. Bold lowercase z denotes vectors, bold uppercase J denotes matrices. For n N, let [n] denote the set { 1, . . . , n }. Additionally, if f is a function with n component functions, let f S denote the restriction of f to the component functions indexed by S [n], i.e., f S := (fs)s S.

2 Generative Model

While humans have a clear intuition for what constitutes an object, formalizing this notion mathematically is not straightforward. Indeed, there is no universally agreed-upon definition of an object; various formalizations based upon distinct criteria co-exist (Green, 2019; Spelke, 1990; Koffka, 1936; Greff et al., 2020). We approach the problem by

Provably Learning Object-Centric Representations

A | Compositional Generator Two slots never affect the same pixel

The Jacobian of f has disjoint slot-wise blocks

Slot 1 Slot 2 Latents

Pixels (Ordered to expose blocks)

Slot 1 Slot 2

Pixels (Ordered to expose blocks)

B | Non-Compositional Generator Two slots can affect the same pixels

The Jacobian of f has non-disjoint slot-wise blocks

Generator f

Generator f

Figure 2. Difference between a compositional and a non-compositional generator. (A) For a compositional generator f, every pixel is affected by at most one latent slot. As a result, there always exists an ordering of the pixels such that the generator s Jacobian Jf consists of disjoint blocks, one for each latent slot (bottom). Note that both the pixel ordering and the specific structure of the Jacobian are not fixed across scenes and might depend on the latent input z. (B) For a non-compositional generator, there exists no pixel ordering that exposes such a structure in the Jacobian, since the same pixel can be affected by more than one latent slot.

defining multi-object scenes in terms of a latent variable model (see Fig. 1 for an overview) and argue that the objectcentric nature of the problem necessitates a very specific structure on the generator, which we leverage in 3 to prove our identifiability result.

As a starting point, we assume that observed data samples x of multi-object scenes are generated from a set of latent random vectors z through a diffeomorphism2 f : Z X, mapping from a latent space Z to an observation space X,

z pz, x = f(z). (1)

The only assumption we place on pz is that it is fully supported on Z. In particular, we do not require independence and allow for arbitrary dependencies between components of z, motivated by the fact that the presence or properties of certain objects may be correlated with those of other objects.

2.1 Slots and Compositionality

We think of an object in a scene as being encoded not by a single latent component zi but instead by a group of latents zk which specify its properties. For a scene comprised of K objects, we thus assume that the latent space Z factorizes into K subspaces Zk, which we refer to as slots. Each slot is assumed to have dimension M, representing, e.g., M distinct object properties. More precisely, k [K] : Zk = RM, and Z = Z1 ZK = RKM.

Let zk be the latent vector of the kth slot. The full KMdimensional latent scene representation vector z is then

2a differentiable bijection with differentiable inverse

given by the concatenation of the latents from all slots,

z = (z1, . . . , z K) . (2)

We would like to ensure that each latent slot zk is responsible for encoding a distinct object in a scene. To this end, the latent scene representation z should be rendered by f such that each slot generates exactly one object (see Fig. 1). If f is an arbitrary function with no additional constraints, however, this will generally not be the case.

First, f lacks any structure which ensures that an object is not generated by more than one latent slot. To see this, let Ik(z) [N] denote the subset of pixels in an image generated from scene representation z that functionally depend on slot k,

Ik(z) := n [N] : fn

zk (z) = 0 . (3)

Note the dependence on z, which encodes that an object may appear in different places across different scenes.

Without further constraints on f, the pixel subsets Ik(z) and Ij(z) can overlap for any k = j such that latent slots k, j can affect the same pixels and thus contribute to generating the same object (see Fig. 2B, top). To avoid this, we impose the following structure on f, which we call compositionality. Definition 1 (Compositionality). Let f : Z X be differentiable. f is said to be compositional if

z Z : k = j = Ik(z) Ij(z) = . (4)

Compositionality implies that each pixel is a function of at most one latent slot and thus imposes a local sparsity

Provably Learning Object-Centric Representations

A | Reducible Mechanism I Different latents affect different groups of pixels, S1 & S2

B | Reducible Mechanism II C | Irreducible Mechanism All latents affect S1 & S2, but information decomposes across S1 & S2 Information does not decompose across any pixel subsets S & S

Sub-mechanism 2

Sub-mechanism 1 S1 S2

Generator f

Generator f

Generator f

Figure 3. (Ir)reducible mechanisms. (A) A simple example of a reducible mechanism is one for which disjoint subsets of latents from the same slot render pixel groups S1 and S2 separately such that they form independent sub-mechanisms according to Defn. 4. This independence between sub-mechanisms is indicated by the difference in colors. (B) Not all reducible mechanisms look as simple as panel A: here, S1 and S2 depend on every latent component in the slot, but the information in S1 S2 still decomposes across S1 and S2 as sub-mechanisms 1 and 2 are independent. (C) In contrast, for an irreducible mechanism, the information does not decompose across any pixel partition S, S , and so it is impossible to separate it into independent sub-mechanisms.

structure on the Jacobian matrix Jf = fi

ij of f, which is visualized in Fig. 2, bottom. Intuitively, the Jacobian of a compositional generator can always be brought into block structure through an appropriate permutation of the pixels. However, this block structure is local in that the required permutation may differ across scene representations z.

2.2 Mechanisms and Irreducibility

While compositionality ensures that different latent slots do not generate the same object, we need an additional constraint on f to ensure that each slot generates only one object, rather than something humans would regard as multiple objects. To see this, consider the example depicted in Fig. 3A, where f maps the first half of the latent slot to the pixels denoted S1 and the second half to S2. It is clear that for humans, these groups of pixels would likely be considered as distinct objects. On the other hand, it is not immediately clear what formal criteria would give rise to such a distinction.

Intuitively, the issue with the two sub-objects S1 and S2 in Fig. 3A appears to be that they are independent of each other in some sense. To avoid such splitting of objects within slots, we would thus like to enforce that pixels belonging to the same object are dependent on one another. But what is a meaningful notion of such instance-level independence of objects? Since we are dealing with a single scene sampled according to Eq. (1), it cannot be statistical in nature. Instead, our intuition is more aligned with the notion of algorithmic independence of objects (Janzing & Sch olkopf, 2010), a formalization3 of the principle of independent causal mechanisms (ICM) which posits that physical generative processes consist of autonomous mod-

3albeit an impractical one formulated in terms of Kolmogorov complexity (algorithmic information), which is not computable

ules that do not inform or influence each other (Peters et al., 2017). The two subsets of pixels S1 and S2 in Fig. 3A are independent of each other in precisely this sense: they arise from autonomous processes that do not share information.

In the following, we therefore draw inspiration from prior implementations of the ICM principle (Daniusis et al., 2010; Janzing et al., 2012; Gresele et al., 2021, see 4 for more details) to formalize our intuitions about independence of objects. First, we define the mapping which locally renders information from the kth latent slot to the affected pixels Ik(z) which we refer to as a mechanism. Definition 2 (Mechanism). z Z, k [K], we define the kth mechanism of f at z as the Jacobian matrix Jf Ik(z).

The kth mechanism can be understood as the sub-matrix of the Jacobian of f whose rows correspond to the pixels Ik(z) affected by slot k. Further, we define a sub-mechanism as the restriction to a subset of the affected pixels. Definition 3 (Sub-Mechanism). Jf S(z) is said to be a submechanism of Jf Ik(z), if S Ik(z) and S is nonempty.

In light of these definitions, Fig. 3A consists of two submechanism, Jf S1(z) and Jf S2(z), which generate pixels S1 and S2. To characterize the level of dependence between sets of pixels and their associated sub-mechanisms, we propose to use the matrix rank, which can be seen as a nonstatistical measure of information as it locally characterizes the latent capacity used to generate the corresponding pixels. Definition 4 (Independent/Dependent Sub-Mechanisms). Let S1, S2 [N] and z Z. The sub-mechanisms Jf S1(z) and Jf S2(z) are said to be independent if:

rank (Jf S1 S2(z)) = rank (Jf S1(z)) + rank (Jf S2(z)) . (5) Conversely, they are said to be dependent if:

rank (Jf S1 S2(z)) < rank (Jf S1(z)) + rank (Jf S2(z)) .

Provably Learning Object-Centric Representations

Intuitively, two sub-mechanisms Jf S1(z) and Jf S2(z) are independent according to Defn. 4 if the information content of pixels S1 S2 decomposes across S1 and S2 in the sense that the latent capacity required to jointly generate S1 S2 (LHS of Eq. (5)) is the same as that required to generate S1 and S2 separately (RHS of Eq. (5)). Such a decomposition will occur when the rows of the sub-mechanism Jf S1(z) do not lie in the row-space of the sub-mechanism Jf S2(z) and vice-versa. This will be the case in Fig. 3A where Jf S1(z) and Jf S2(z) affect different pixels since the rows of the Jacobian for pixels S1 and S2 will never have non-zero entries for the same column. As shown in Fig. 3B, however, it could also be the case that all latents within a slot affect pixels in both S1 and S2, yet the information content of S1 S2 still decomposes across S1 and S2 since the rows of Jf S1(z) and Jf S2(z) could span linearly independent subspaces.

To enforce that each slot generates only one object, we now finally place the condition on the mechanisms of f that they cannot be partitioned into independent sub-mechanisms (see Fig. 3C). We refer to this property as irreducibility. Definition 5 (Irreducibility). f is said to have irreducible mechanisms, or is irreducible, if for all z Z, k [K] and any partition of Ik(z) into S1 and S2, the sub-mechanisms Jf S1(z) and Jf S2(z) are dependent in the sense of Defn. 4.

3 Theory: Slot Identifiability

Given multi-object scenes sampled from the generative model outlined in 2, we now seek to understand under what conditions an inference model ˆg : X Z will provably identify the ground-truth object representations. Ideally, we would like ˆg to recover the true inverse g := f 1, but that is generally only possible up to certain irresolvable ambiguities. In our multi-object setting, the objective is to separate the object representations such that each inferred slot captures one and only one ground-truth slot. We refer to this notion as slot identifiability and define it as follows. Definition 6 (Slot Identifiability). Let f : Z X be a diffeomorphism. An inference model ˆg : X Z is said to slot-identify z = g(x) via ˆz = ˆg(x) = ˆg(f(z)) if for all k [K] there exist a unique j [K] and a diffeomorphism hk : Zk Zj such that ˆzj = hk(zk) for all z Z.

We are now in a position to state our main theoretical result (all complete proofs are provided in Appx. A). Theorem 1. Let f : Z X be a diffeomorphism that is compositional (Defn. 1) with irreducible mechanisms (Defn. 5). If an inference model ˆg : X Z is (i) a diffeomorphism with (ii) compositional inverse ˆf = ˆg 1, then ˆg slot-identifies z = g(x) in the sense of Defn. 6.

Proof Sketch. Irreducibility of f ensures that information is shared across different parts of an object, and compositionality of f that this information is not shared with other

objects. This creates an asymmetry in the latent capacity required to encode the entirety of one object compared to parts of different objects. When ˆg satisfies (i) and (ii), this asymmetry can be leveraged to show that each inferred slot ˆzj maps to one and only ground-truth slot zk by a proof by contradiction. Namely, suppose that ˆg maps pixels of two distinct objects to the same slot j. If ˆg were to encode all latent information required to generate these pixels in slot j, there would not be sufficient total latent capacity to recover the entire scene, leading to a violation of (i) invertibility. Hence, information for at least one of the pixels needs to be distributed across multiple slots, violating (ii) compositionality of ˆf = ˆg 1.

Implications for Object-Centric Learning. Thm. 1 highlights important conceptual points for object-centric representation learning. First, it shows that distributional assumptions on the latents z are not necessary for slot identifiability; instead, it suffices to enforce structure on the generator f. This falls in line with state-of-the-art (SOTA) object-centric learning methods (Locatello et al., 2020; Singh et al., 2022b; Seitzer et al., 2023; Elsayed et al., 2022), which are based on an auto-encoding framework, thus imposing no additional structure on pz. However, while these models directly enforce invertibility through the reconstruction objective, it is less clear whether and to what extent they also enforce compositionality. Specifically, compositionality is not explicitly optimized in any object-centric methods. Yet, the success of SOTA models in practice suggests that it may be implicitly enforced to some extent through additional inductive biases in the model. We explore this point empirically (see Fig. 6) and leave a more theoretical exploration for future work.

Thm. 1 also emphasizes that using a restricted latent bottleneck plays an important role in achieving slot identifiability. Specifically, Thm. 1 is predicated on dim(z)=dim(ˆz) and would no longer hold in its current form if dim(z)<dim(ˆz). The importance of restricting the latent capacity of objectcentric models was emphasized empirically by Engelcke et al. (2020a). Yet, the most successful object-centric models in practice often use dim(z)<dim(ˆz) (Dittadi et al., 2022; Locatello et al., 2020; Sajjadi et al., 2022). A potential explanation for this discrepancy is that SOTA object-centric models do encode information from multiple objects in each latent slot, but this additional information is ignored by the decoder during reconstruction such that image-level segmentations remain accurate. We provide some evidence for this hypothesis through experiments with existing object-centric models in 5.2.

Measuring Compositionality. While Thm. 1 reveals properties an inference function should satisfy to achieve slot identifiability, it presents these properties in an abstract mathematical form. If we seek to leverage Thm. 1 to assess the performance of existing object-centric models or inform

Provably Learning Object-Centric Representations

new training objectives for object-centric learning, we require a way to quantify whether an inference model is (i) a diffeomorphism and (ii) compositional. Regarding (i), one clear choice is to train an auto-encoder with differentiable encoder ˆg and decoder ˆf and minimize reconstruction loss to enforce invertibility. Regarding (ii), on the other hand, it is much less obvious how to quantify compositionality. To this end, we introduce the following contrast function, which we prove to be zero if and only if a function is compositional:

Definition 7 (Compositional Contrast). Let f : Z X be differentiable. The compositional contrast of f at z is

Ccomp(f, z) =

zj (z) . (6)

For a given scene representation z and generator f, the contrast function in Eq. (6) computes the sum over all pixels n of all pairwise products of the (L2) norms of those pixels gradients with respect to any two distinct slots k = j. As such, it is a non-negative quantity that can only be zero if every pixel is affected by at most one slot (i.e., f is compositional), for otherwise there would be a pair of slots k = j for which the gradient norms are both non-zero resulting in their product being non-zero.

We leverage this characterization of compositionality to provide our second result, which can be viewed as an optimization-based perspective on Thm. 1.

Theorem 2. Let f : Z X be a diffeomorphism that is compositional (Defn. 1) with irreducible mechanisms (Defn. 5). If an encoder ˆg : X Z and decoder ˆf : Z X are both differentiable and solve the following functional equation

ˆf(ˆg(x)) x 2

2 + λCcomp ˆf, ˆg(x) = 0, (7)

for λ > 0, then ˆg slot-identifies z in the sense of Defn. 6.

4 Related Work

Object-Centric Generative Models. Prior works have also formulated generative models for multi-object scenes based on latent slots (Roux et al., 2011; Heess, 2012; Greff et al., 2015; 2017; 2019; van Steenkiste et al., 2018; von K ugelgen et al.; Engelcke et al., 2020b; 2021), though without studying identifiability. Our assumptions on the generative model ( 2) bear intuitive similarity to some of these prior works, but they also differ in several fundamental ways. First, compositionality (Defn. 1) is stated as a desideratum for nearly all object-centric generative models. Yet, this constraint is not actually enforced by most existing approaches, particularly those based on spatial mixture models in which every slot may affect every pixel (Greff et al.,

2015; 2017; 2019; van Steenkiste et al., 2018; Engelcke et al., 2020b; 2021). More closely related is a dead-leaves model approach, in which a scene is sequentially generated by layering objects such that each pixel is affected by at most one slot (Roux et al., 2011; von K ugelgen et al.; Tangemann et al., 2023). In contrast, we define compositionality directly through assumptions on the structure of the (Jacobian of the) generator. Second, our irreducibility criterion (Defns. 4 and 5) bears conceptual similarity to prior works, which assume that different objects do not share information whereas parts of the same object do (Hyv arinen & Perki o, 2006; Greff et al., 2015; 2017; van Steenkiste et al., 2018). Importantly, however, these works formalize this intuition using statistical criteria such as statistical independence between pixels from different objects and dependence between pixels from the same object. However, this leads to an incorrect characterization of objects: e.g., the presence of a coffee cup should increase the likelihood that a table is also present, despite these being separate objects (Tr auble et al., 2021; Sch olkopf et al., 2021). Here, we instead formulate independence/dependence between objects in a non-statistical sense, inspired by algorithmic independence of mechanisms.

Objects and Causal Mechanisms. In causal modelling (Spirtes et al., 2001; Pearl, 2009), a mechanism typically refers to a function that determines the value of an effect variable from its direct causes and possibly a noise term, leading to a conditional distribution of effect given causes. Thus, we could view objects as the effects of the latent variables that cause them. While the causal variables are generally not independent, it has been argued that the mechanisms producing them should be (Sch olkopf et al., 2012; Peters et al., 2017). Since this is an independence between functions or conditionals rather than between random variables, it is non-trivial to formalize it statistically (Janzing & Sch olkopf, 2010; Guo et al., 2022). Hence, various implementations of the principle have been proposed (Daniusis et al., 2010; Janzing et al., 2010; 2012; Shajarisales et al., 2015; Locatello et al., 2018; Besserve et al., 2018; 2021; Janzing, 2021), typically for settings in which both cause and effect are observed. Our notion of independent sub-mechanisms is most closely related to work by Gresele et al. (2021), who also study representation learning and define mechanisms more broadly in terms of the Jacobian Jf: they assume independent latents and formalize mechanism independence as column-orthogonality of the Jacobian. In contrast, our rank condition (Eq. (5)) is inspired by objectcentric representation learning with dependent latents.

Identifiable Representation Learning. As this is the first identifiability study of unsupervised object-centric representations, our problem setting differs from existing work both in terms of the assumptions we make on the generative process and the type of identifiability that we aim to achieve.

Provably Learning Object-Centric Representations

First, prior work on identifiable representation learning commonly places assumptions on the latent distribution, such as conditional independence given an auxiliary variable (Hyv arinen & Morioka, 2016; 2017; Hyv arinen et al., 2019; Khemakhem et al., 2020a; H alv a & Hyv arinen, 2020; H alv a et al., 2021) or access to views arising from pairs of similar latents (Gresele et al., 2019; Klindt et al., 2021; Zimmermann et al., 2021; von K ugelgen et al., 2021), while leaving the generator f completely unconstrained. In contrast, we place no assumptions on pz and instead impose structure on (the Jacobian of) the generator f. Recent works have also leveraged assumptions on Jf such as orthogonality (Gresele et al., 2021; Zheng et al., 2022; Reizinger et al., 2022; Buchholz et al., 2022), unit determinant (Yang et al., 2022), or a fixed sparsity structure (Moran et al., 2021; Lachapelle et al., 2021; Lachapelle & Lacoste-Julien, 2022). While the latter relates to our definition of compositionality (Defn. 1), we crucially allow the sparsity pattern on Jf to vary with z (in line with the basic notion that objects are not fixed in space), and impose sparsity with respect to slots rather than individual latents. Secondly, existing work typically aims to identify individual latent components zi up to permutations (or linear transformations). However, this is inappropriate for object-centric representation learning, where we aim to capture and isolate the subsets of latents corresponding to each object in well-defined slots. Identifying such groups of latents is similar to efforts in independent subspace analysis (ISA; Hyv arinen & Hoyer, 2000). However, results for ISA are generally restricted to linear models and independent groups, whereas we allow for nonlinear models and dependence. Our notion of slot identifiability is most closely related to that of block-identifiability introduced by (von K ugelgen et al., 2021) and can be seen as an extension or generalization thereof to a setting with multiple blocks.

5 Experiments

Thm. 2 states that inference models which minimize reconstruction loss Lrec and compositional contrast Ccomp achieve slot identifiability (Defn. 6). This provides a concrete way to empirically test our main theoretical result. To do so, we perform two main sets of experiments. First, in 5.1 we generate controlled synthetic data according to the process specified in 2 and train an inference model on this data which directly optimizes Lrec and Ccomp jointly. Second, in 5.2 we seek to better understand the relationship between Lrec, Ccomp, and slot identifiability in existing object-centric models. To this end, we analyze a set of models trained on a multi-object sprites dataset.

Quantifying Slot Identifiability. To assess whether a model is slot identifiable in practice, we first establish a metric to measure slot identifiability. Specifically, we want to measure if there exists an invertible function between each

ground-truth and exactly one inferred latent slot. To this end, we first fit nonlinear models between inferred and groundtruth slots and measure their quality by the R2 coefficient of determination. To properly measure this R2 score, we must first match each ground-truth slot to its corresponding inferred slot as permutations could exist between slots. For our experiments in 5.1, this permutation will be global i.e. the same for all inferred latents, thus we use the Hungarian Algorithm (Kuhn, 1955) to find the optimal matching based on the R2 scores for models fit between every pair of slots. For our experiments on image data in 5.2, however, such a permutation will be local due to the permutation invariance of the generator. To resolve this, we follow a procedure similar to that of Locatello et al. (2020) and Dittadi et al. (2022) using online matching when fitting models between slots. Specifically, at every training iteration, we compute a matching loss for each sample for all possible pairings of ground-truth and inferred slots and use the Hungarian algorithm to find the optimal assignment for minimizing this loss. After resolving permutations, the R2 scores for the matched slots tell us how much information about each ground-truth slot is contained in one inferred slot. We also need to ensure, however, that inferred slots only contain information about one ground-truth slot and not multiple. To this end, we correct this score by subtracting the maximum R2 score from models fit between each inferred latent slot and the ground-truth slots that it was not previously matched with. Taking the mean of this score across all slots yields the final score, which we refer to as the slot identifiability score (SIS). Further details on the metric are given in Appx. B.4.

5.1 Synthetic Data

Experimental Setup. To generate synthetic data according to 2, we first sample a KM-dimensional latent vector from a normal distribution pz = N(0, Σ), where we consider scenarios with both statistically independent latents (Σ = I) and dependent latents (Σ Wishart KM(I, KM)). We then partition the latent vector into K slots, each with dimension M, and apply the same multi-layer perceptron (MLP) to each of the K slots separately. The MLP has 2 layers, uses Leaky Re LU non-linearities, and is chosen to lead to invertibility almost surely by following the settings used in previous work (Hyv arinen & Morioka, 2016; 2017; Zimmermann et al., 2021). Observations x are obtained by concatenating the slot-wise MLP outputs such that the generator is compositional according to Defn. 1 as well as invertible.4 We train models with a number of slots K {2, 3, 5} and λ {10 7, 10 5, 10 2, 0, 1, 10} (see Thm. 2) each across 10 random seeds (180 models in total). In all cases, we use slot-dimension M = 3 and slotoutput dimension of 20 such that dim(x) = K 20. Further details on this setup may be found in Appx. B.1.

4Regarding enforcing irreducibility, see Appx. B.1.

Provably Learning Object-Centric Representations

10 6 10 5 10 4 10 3 10 2

Reconstruction Error

Compositional Contrast

Slot Identifiability Score

Reconstruction Error

Compositional Contrast

MONet AE SA

Slot Identifiability Score

Figure 4. (A) Experimental validation of Thm. 2. We trained models on synthetic data generated according to 2 with 2, 3, 5 independent latent slots (see 5.1). The color coding indicates the level of identifiability achieved by the model, measured by the Slot Identifiability Score (SIS), where higher values correspond to more identifiable models. As predicted by our theory, if a model sufficiently minimizes both reconstruction error and compositional contrast, then it identifies the ground-truth latent slots. (B) Application of Thm. 2 to existing object-centric models. We train 3 existing object-centric architectures MONet, Slot Attention (SA), and an additive auto-encoder (AE) on image data and visualize their SIS as a function of both reconstruction error and compositional contrast. We see across models that, in general, SIS increases as reconstruction error and compositional contrast are minimized.

Results. In Fig. 4A, we visualize the SIS as a function of the reconstruction error and compositional contrast for independent latents for all K {2, 3, 5}. We normalize Lrec and Ccomp to ensure that their scores are comparable across different K, which we discuss in further detail in Appx. B.3. As predicted by Thm. 2, we can see that all models that minimize both objectives jointly yield high SIS, whereas models that fail to minimize, e.g., the compositional contrast achieve subpar identifiability. Results for dependent latents yield a similar trend which can be seen in Fig. 5.

5.2 Existing Object-Centric Models

Experimental Setup. We now aim to understand the predictions made by our theory in the context of existing objectcentric models trained on image data. To this end, we consider image data generated by the Spriteworld renderer (Watters et al., 2019). Specifically, we generate images with 2 to 4 objects, each described by 4 continuous (size, color, x/y position) and 1 discrete (shape) independent latent factors. Samples of this dataset are shown in Fig. 8. We investigate three object-centric approaches on this data: Slot Attention (Locatello et al., 2020), MONet (Burgess et al., 2019), and an additive auto-encoder. We train all models with 4 latent slots, each with dimension 16, leading to an inferred latent dimension larger than the ground-truth. This discrepancy between inferred and ground-truth latent dimensionality is ubiquitous in existing object-centric models. However, it violates our theoretical assumptions which require equal dimensions. See Appx. B.2 for further experimental details.

Results. SIS as a function of reconstruction error and compositional contrast is shown in Fig. 4B. Similar to Fig. 4A,

SIS tends to increase as Lrec and Ccomp are minimized, highlighting that our theory holds predictive power for slot identifiability in existing object-centric models. Notably, this is in spite of our theoretical assumptions not being exactly met due to the inferred latent dimension exceeding the ground-truth. This mismatch in dimension does seem to have an effect on SIS, however, which can be seen in Fig. 7. Here, we can see that the subtracted R2 score in the SIS computation is non-zero across models suggesting that these models are using their additional latent capacity to encode information from multiple objects, despite the decoder presumably not using this information during reconstruction.

6 Discussion

Limitations of Experiments. We emphasize that the main goal of this work is to create a theoretical foundation for object-centric learning. Hence, we focus our experiments on validating Thm. 2 ( 5.1) and exploring our theoretical predictions in existing object-centric models ( 5.2). While our experiments in 5.2 provide evidence that existing models which minimize Lrec and Ccomp achieve higher SIS, scaling up these experiments to more models and datasets would lead to a more comprehensive understanding of the exact extent to which the performance of existing models can be understood from our theory. We leave such a larger empirical study for future work.

Limitations of Theory. While we believe that our theoretical assumptions capture the essence of important concepts in object-centric learning, they will be violated to various degrees in practical scenarios. For example, the assumption of compositionality (Defn. 1) on the generator f is broken

Provably Learning Object-Centric Representations

by translucency/reflection, as a single pixel can then be affected by multiple latent slots. Additionally, occlusions are not yet fully covered by our theory, as pixels at the border of occluding objects would be affected by multiple latent slots. Additionally, it is common to assume in practice that the generator f is invariant to permutations of the latent slots it acts on. This permutation invariance leads to a lack of invertibility of f, however, as permuted latents will give rise to the same observation. We anticipate that our theoretical results can be adapted to incorporate such a permutation invariant generator but leave this for future work.

Relationship to Existing Definitions of Objects. Under our framework, groups of pixels corresponding to an object have the property that the latent capacity needed to encode partitions of these pixels separately exceeds the latent capacity needed to encode the pixels as a whole (Defn. 5). Intuitively, this implies that there is latent information shared across different parts of an object. By considering the location of objects as one such latent information, our definition relates to the Gestalt law of common fate (Koffka, 1936; Tangemann et al., 2023) and the concept of a Spelke Object (Spelke, 1990; Chen et al., 2022) which posit that pixels belonging to the same object move together. Furthermore, by considering color or texture as shared latent information, our definition relates to the Gestalt law of similarity (Koffka, 1936) that posits that items sharing visual features tend to be grouped together as a single object.

Extensions of Theory. While our theoretical results provide relatively general conditions under which object-centric representations can be identified, there are several potential ways our results could be extended. First, we hypothesize that the reverse implication of our main result may hold as well, i.e., given the generative model in 3, compositionality and invertibility are not only sufficient but also necessary conditions for slot identifiability. A formal proof of this conjecture would further highlight the importance of these properties. Additionally, it would be interesting to aim to extend our theoretical approach to identifying not just objects but also abstractions such as part-whole hierarchies (Hinton, 2021) or individual object attributes. In this case, our notion of compositionality would need to be adjusted to account for abstractions that interact during generation. Lastly, it would be interesting to extend our results to leverage weakly-supervised information, such as motion, which has been shown empirically to be helpful for object-centric learning (Tangemann et al., 2023; Kipf et al., 2022; Elsayed et al., 2022; Chen et al., 2022).

Optimizing Ccomp in Object-Centric Models. While creating a new method for object-centric learning is not the focus of this work, one question based on Thm. 2 is whether Ccomp can be optimized directly in object-centric models on image data to improve slot identifiability. In this

setting, explicitly optimizing Ccomp, as was done in 5.1, is challenging as the contrast in its current form is based on Jacobians. Thus, naively optimizing it through gradient descent corresponds to second-order optimization, which creates computational challenges for larger models and data dimensionalities. As previously noted, it could also be the case that there exist implicit ways to enforce that Ccomp is minimized, which could be occurring to some extent through inductive biases in existing object-centric models. We leave finding computationally efficient ways to minimize Ccomp, whether explicit or implicit, for future work.

Concluding Remarks. Representing scenes in terms of objects is a key aspect of visual intelligence and an important component of generalization in humans. While empirical object-centric learning methods are increasingly successful, we have thus far been lacking a precise theoretical understanding of what properties of the data and model are sufficient to provably learn object-centric representations. To the best of our knowledge, this work is the first to provide such a theoretical understanding. Along with invertibility, two intuitive assumptions on the generator compositionality and irreducibility are sufficient to identify the ground-truth object representations. By extending identifiability theory towards object-centric learning, we hope to facilitate a deeper understanding of existing object-centric models as well as provide a solid foundation for the next generation of models to build upon.

Author Contributions. JB developed the theory with technical help from RSZ, insight from YS, and advising from WB and Jv K. JB implemented and executed the experiments with help from RSZ and YS, while RSZ implemented the Ccomp and SIS metrics on image data. JB and Jv K led the writing of the manuscript with help from WB, BS, and RSZ. WB and RSZ created all figures in the manuscript.

Acknowledgments. We thank: the anonymous reviewers for helpful suggestions which led to improvements in the manuscript, Andrea Dittadi for helpful discussions regarding experiments, Attila Juhos for pointing out an issue with Thm. 1, Amin Charusaie, Michel Besserve, and Simon Buchholz for helpful technical discussions, and Zac Cranko for theoretical efforts in the early stages of the project.

This work was supported by the German Federal Ministry of Education and Research (BMBF): T ubingen AI Center, FKZ: 01IS18039A, 01IS18039B. WB acknowledges financial support via an Emmy Noether Grant funded by the German Research Foundation (DFG) under grant no. BR 6382/1-1 and via the Open Philantropy Foundation funded by the Good Ventures Foundation. WB is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 Project number 390727645. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting RSZ and YS.

Provably Learning Object-Centric Representations

Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez Gonzalez, A., Zambaldi, V. F., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., C aglar G ulc ehre, Song, H. F., Ballard, A. J., Gilmer, J., Dahl, G. E., Vaswani, A., Allen, K. R., Nash, C., Langston, V., Dyer, C., Heess, N. M. O., Wierstra, D., Kohli, P., Botvinick, M. M., Vinyals, O., Li, Y., and Pascanu, R. Relational inductive biases, deep learning, and graph networks. Ar Xiv, abs/1806.01261, 2018. [Cited on page 1.]

Besserve, M., Shajarisales, N., Sch olkopf, B., and Janzing, D. Group invariance principles for causal generative models. In AISTATS, volume 84 of Proceedings of Machine Learning Research, pp. 557 565, 2018. [Cited on page 6.]

Besserve, M., Sun, R., Janzing, D., and Sch olkopf, B. A theory of independent mechanisms for extrapolation in generative models. In AAAI, pp. 6741 6749, 2021. [Cited on page 6.]

Biza, O., van Steenkiste, S., Sajjadi, M. S., Elsayed, G. F., Mahendran, A., and Kipf, T. Invariant slot attention: Object discovery with slot-centric reference frames. Ar Xiv preprint, abs/2302.04973, 2023. [Cited on page 1.]

Buchholz, S., Besserve, M., and Sch olkopf, B. Function classes for identifiable nonlinear independent component analysis. In Neur IPS, 2022. [Cited on page 7.]

Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. Understanding disentangling in β-vae. Ar Xiv preprint, abs/1804.03599, 2018. [Cited on page 22.]

Burgess, C. P., Matthey, L., Watters, N., Kabra, R., Higgins, I., Botvinick, M., and Lerchner, A. Monet: Unsupervised scene decomposition and representation. Ar Xiv preprint, abs/1901.11390, 2019. [Cited on pages 1 and 8.]

Caron, M., Touvron, H., Misra, I., J egou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers. In ICCV, pp. 9630 9640, 2021. [Cited on page 1.]

Chen, H., Venkatesh, R. M., Friedman, Y., Wu, J., Tenenbaum, J. B., Yamins, D. L. K., and Bear, D. Unsupervised segmentation in real-world images via spelke object inference. In European Conference on Computer Vision, 2022. [Cited on page 9.]

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis

and machine intelligence, 40(4):834 848, 2017. [Cited on page 1.]

Daniusis, P., Janzing, D., Mooij, J. M., Zscheischler, J., Steudel, B., Zhang, K., and Sch olkopf, B. Inferring deterministic causal relations. In UAI 2010, Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, July 8-11, 2010, pp. 143 150, 2010. [Cited on pages 4 and 6.]

Dehaene, S. How We Learn: Why Brains Learn Better Than Any Machine... for Now. 2020. [Cited on page 1.]

Dittadi, A., Papa, S. S., Vita, M. D., Sch olkopf, B., Winther, O., and Locatello, F. Generalization and robustness implications in object-centric learning. In ICML, volume 162 of Proceedings of Machine Learning Research, pp. 5221 5285, 2022. [Cited on pages 5, 7, 22, and 23.]

Elsayed, G. F., Mahendran, A., van Steenkiste, S., Greff, K., Mozer, M. C., and Kipf, T. Savi++: Towards end-toend object-centric learning from real-world videos. In Neur IPS, 2022. [Cited on pages 1, 5, and 9.]

Engelcke, M., Jones, O. P., and Posner, I. Reconstruction bottlenecks in object-centric generative models. Ar Xiv preprint, abs/2007.06245, 2020a. [Cited on page 5.]

Engelcke, M., Kosiorek, A. R., Jones, O. P., and Posner, I. GENESIS: generative scene inference and sampling with object-centric latent representations. In ICLR, 2020b. [Cited on page 6.]

Engelcke, M., Jones, O. P., and Posner, I. GENESIS-V2: inferring unordered object representations without iterative refinement. In Neur IPS, pp. 8085 8094, 2021. [Cited on page 6.]

Fodor, J. A. and Pylyshyn, Z. W. Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2): 3 71, 1988. [Cited on page 1.]

Gerstenberg, T. and Tenenbaum, J. B. Intuitive theories. Oxford handbook of causal reasoning, pp. 515 548, 2017. [Cited on page 1.]

Gerstenberg, T., Goodman, N. D., Lagnado, D. A., and Tenenbaum, J. B. A counterfactual simulation model of causal judgments for physical events. Psychological Review, 128(5):936, 2021. [Cited on page 1.]

Gopnik, A., Glymour, C., Sobel, D. M., Schulz, L. E., Kushnir, T., and Danks, D. A theory of causal learning in children: causal maps and bayes nets. Psychological review, 111(1):3, 2004. [Cited on page 1.]

Goyal, A. and Bengio, Y. Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A, 478(2266):20210068, 2022. [Cited on page 1.]

Provably Learning Object-Centric Representations

Green, E. J. A theory of perceptual objects. Philosophy and Phenomenological Research, 99(3):663 693, 2019. [Cited on page 2.]

Greff, K., Srivastava, R. K., and Schmidhuber, J. Binding via reconstruction clustering. Ar Xiv preprint, abs/1511.06418, 2015. [Cited on page 6.]

Greff, K., van Steenkiste, S., and Schmidhuber, J. Neural expectation maximization. In NIPS, pp. 6691 6701, 2017. [Cited on page 6.]

Greff, K., Kaufman, R. L., Kabra, R., Watters, N., Burgess, C., Zoran, D., Matthey, L., Botvinick, M. M., and Lerchner, A. Multi-object representation learning with iterative variational inference. In ICML, volume 97 of Proceedings of Machine Learning Research, pp. 2424 2433, 2019. [Cited on pages 1 and 6.]

Greff, K., Van Steenkiste, S., and Schmidhuber, J. On the binding problem in artificial neural networks. Ar Xiv preprint, abs/2012.05208, 2020. [Cited on pages 1 and 2.]

Gresele, L., Rubenstein, P. K., Mehrjou, A., Locatello, F., and Sch olkopf, B. The incomplete rosetta stone problem: Identifiability results for multi-view nonlinear ICA. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, volume 115 of Proceedings of Machine Learning Research, pp. 217 227, 2019. [Cited on page 7.]

Gresele, L., von K ugelgen, J., Stimper, V., Sch olkopf, B., and Besserve, M. Independent mechanism analysis, a new concept? In Neur IPS, pp. 28233 28248, 2021. [Cited on pages 4, 6, and 7.]

Guo, S., T oth, V., Sch olkopf, B., and Husz ar, F. Causal de Finetti: On the identification of invariant causal structure in exchangeable data. Ar Xiv preprint, abs/2203.15756, 2022. [Cited on page 6.]

H alv a, H. and Hyv arinen, A. Hidden markov nonlinear ICA: unsupervised learning from nonstationary time series. In Proceedings of the Thirty-Sixth Conference on Uncertainty in Artificial Intelligence, UAI 2020, virtual online, August 3-6, 2020, volume 124 of Proceedings of Machine Learning Research, pp. 939 948, 2020. [Cited on page 7.]

H alv a, H., Corff, S. L., Leh ericy, L., So, J., Zhu, Y., Gassiat, E., and Hyv arinen, A. Disentangling identifiable features from noisy data with structured nonlinear ICA. In Neur IPS, pp. 1624 1633, 2021. [Cited on page 7.]

He, K., Gkioxari, G., Doll ar, P., and Girshick, R. B. Mask RCNN. In ICCV, pp. 2980 2988, 2017. [Cited on page 1.]

Heess, N. M. O. Learning generative models of mid-level structure in natural images. Ph D thesis, The University of Edinburgh, 2012. [Cited on page 6.]

Hinton, G. E. How to represent part-whole hierarchies in a neural network. Neural computation, pp. 1 40, 2021. [Cited on page 9.]

Horn, R. A. and Johnson, C. R. Matrix analysis. 2012. [Cited on page 16.]

Hyv arinen, A. and Hoyer, P. O. Emergence of phaseand shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Comput., 12(7):1705 1720, 2000. [Cited on page 7.]

Hyv arinen, A. and Morioka, H. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. In NIPS, pp. 3765 3773, 2016. [Cited on pages 2 and 7.]

Hyv arinen, A. and Morioka, H. Nonlinear ICA of temporally dependent stationary sources. In AISTATS, volume 54 of Proceedings of Machine Learning Research, pp. 460 469, 2017. [Cited on pages 2 and 7.]

Hyv arinen, A. and Perki o, J. Learning to segment any random vector. The 2006 IEEE International Joint Conference on Neural Network Proceedings, pp. 4167 4172, 2006. [Cited on page 6.]

Hyv arinen, A., Sasaki, H., and Turner, R. E. Nonlinear ICA using auxiliary variables and generalized contrastive learning. In AISTATS, volume 89 of Proceedings of Machine Learning Research, pp. 859 868, 2019. [Cited on pages 2 and 7.]

Hyv arinen, A. and Pajunen, P. Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3):429 439, 1999. ISSN 0893-6080. [Cited on page 2.]

Janzing, D. Causal versions of maximum entropy and principle of insufficient reason. Journal of Causal Inference, 9(1):285 301, 2021. [Cited on page 6.]

Janzing, D. and Sch olkopf, B. Causal inference using the algorithmic markov condition. IEEE Transactions on Information Theory, 56(10):5168 5194, 2010. [Cited on pages 4 and 6.]

Janzing, D., Hoyer, P. O., and Sch olkopf, B. Telling cause from effect based on high-dimensional observations. In ICML, pp. 479 486, 2010. [Cited on page 6.]

Janzing, D., Mooij, J., Zhang, K., Lemeire, J., Zscheischler, J., Daniuˇsis, P., Steudel, B., and Sch olkopf, B. Information-geometric approach to inferring causal directions. Artificial Intelligence, 182:1 31, 2012. [Cited on pages 4 and 6.]

Provably Learning Object-Centric Representations

Kabra, R., Burgess, C., Matthey, L., Kaufman, R. L., Greff, K., Reynolds, M., and Lerchner, A. Multiobject datasets. https://github.com/deepmind/multiobject-datasets/, 2019. [Cited on page 22.]

Karazija, L., Laina, I., and Rupprecht, C. Clevrtex: A texture-rich benchmark for unsupervised multi-object segmentation. Ar Xiv preprint, abs/2111.10265, 2021. [Cited on page 1.]

Khemakhem, I., Kingma, D. P., Monti, R. P., and Hyv arinen, A. Variational autoencoders and nonlinear ICA: A unifying framework. In AISTATS, volume 108 of Proceedings of Machine Learning Research, pp. 2207 2217, 2020a. [Cited on pages 2 and 7.]

Khemakhem, I., Monti, R. P., Kingma, D. P., and Hyv arinen, A. Ice-beem: Identifiable conditional energy-based deep models based on nonlinear ICA. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020b. [Cited on page 2.]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR (Poster), 2015. [Cited on page 22.]

Kipf, T., Elsayed, G. F., Mahendran, A., Stone, A., Sabour, S., Heigold, G., Jonschkowski, R., Dosovitskiy, A., and Greff, K. Conditional object-centric learning from video. In ICLR, 2022. [Cited on pages 1 and 9.]

Kipf, T. N., van der Pol, E., and Welling, M. Contrastive learning of structured world models. In ICLR, 2020. [Cited on page 1.]

Klindt, D. A., Schott, L., Sharma, Y., Ustyuzhaninov, I., Brendel, W., Bethge, M., and Paiton, D. M. Towards nonlinear disentanglement in natural data with temporal sparse coding. In ICLR, 2021. [Cited on pages 2 and 7.]

Koffka, K. Principles Of Gestalt Psychology. 1936. [Cited on pages 2 and 9.]

Kuhn, H. W. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83 97, 1955. [Cited on pages 7 and 23.]

Lachapelle, S. and Lacoste-Julien, S. Partial disentanglement via mechanism sparsity. In UAI 2022 Workshop on Causal Representation Learning, 2022. [Cited on page 7.]

Lachapelle, S., Rodriguez, P., Sharma, Y., Everett, K. E., Le Priol, R., Lacoste, A., and Lacoste-Julien, S. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ica. In First Conference on Causal Learning and Reasoning, 2021. [Cited on page 7.]

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. Building machines that learn and think like people. Behavioral and brain sciences, 40, 2017. [Cited on page 1.]

Lin, Z., Wu, Y., Peri, S. V., Sun, W., Singh, G., Deng, F., Jiang, J., and Ahn, S. SPACE: unsupervised objectoriented scene representation via spatial attention and decomposition. In ICLR, 2020. [Cited on page 1.]

Locatello, F., Vincent, D., Tolstikhin, I., R atsch, G., Gelly, S., and Sch olkopf, B. Competitive training of mixtures of independent deep generative models. Ar Xiv preprint, abs/1804.11130, 2018. [Cited on page 6.]

Locatello, F., Bauer, S., Lucic, M., R atsch, G., Gelly, S., Sch olkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. In ICML, volume 97 of Proceedings of Machine Learning Research, pp. 4114 4124, 2019. [Cited on page 2.]

Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., Dosovitskiy, A., and Kipf, T. Object-centric learning with slot attention. In Neur IPS, 2020. [Cited on pages 1, 5, 7, 8, and 22.]

Marcus, G. F. The Algebraic Mind: Integrating Connectionism and Cognitive Science. 2001. [Cited on page 1.]

Moran, G. E., Sridhar, D., Wang, Y., and Blei, D. M. Identifiable variational autoencoders via sparse decoding. Ar Xiv preprint, abs/2110.10804, 2021. [Cited on page 7.]

Papa, S., Winther, O., and Dittadi, A. Inductive biases for object-centric representations in the presence of complex textures. In UAI 2022 Workshop on Causal Representation Learning, 2022. [Cited on page 1.]

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., K opf, A., Yang, E. Z., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Neur IPS, pp. 8024 8035, 2019. [Cited on page 22.]

Pearl, J. Causality. 2 edition, 2009. [Cited on page 6.]

Peters, B. and Kriegeskorte, N. Capturing the objects of vision with neural networks. Nature Human Behaviour, 5(9):1127 1144, 2021. [Cited on page 1.]

Peters, J., Janzing, D., and Sch olkopf, B. Elements of causal inference: foundations and learning algorithms. 2017. [Cited on pages 2, 4, and 6.]

Provably Learning Object-Centric Representations

Reizinger, P., Gresele, L., Brady, J., von K ugelgen, J., Zietlow, D., Sch olkopf, B., Martius, G., Brendel, W., and Besserve, M. Embrace the gap: Vaes perform independent mechanism analysis. In Neur IPS, 2022. [Cited on page 7.]

Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234 241, 2015. [Cited on page 1.]

Roux, N. L., Heess, N. M. O., Shotton, J., and Winn, J. M. Learning a generative model of images by factoring appearance and shape. Neural Computation, 23:593 650, 2011. [Cited on page 6.]

Sajjadi, M. S. M., Duckworth, D., Mahendran, A., van Steenkiste, S., Pavetic, F., Lucic, M., Guibas, L. J., Greff, K., and Kipf, T. Object scene representation transformer. In Neur IPS, 2022. [Cited on pages 1 and 5.]

Sch olkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. M. On causal and anticausal learning. In ICML, 2012. [Cited on page 6.]

Sch olkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. Toward causal representation learning. Proceedings of the IEEE, 109(5): 612 634, 2021. [Cited on page 6.]

Seitzer, M., Horn, M., Zadaianchuk, A., Zietlow, D., Xiao, T., Simon-Gabriel, C.-J., He, T., Zhang, Z., Sch olkopf, B., Brox, T., and Locatello, F. Bridging the gap to real-world object-centric learning. In The Eleventh International Conference on Learning Representations, 2023. [Cited on pages 1 and 5.]

Shajarisales, N., Janzing, D., Sch olkopf, B., and Besserve, M. Telling cause from effect in deterministic linear dynamical systems. In ICML, volume 37 of JMLR Workshop and Conference Proceedings, pp. 285 294, 2015. [Cited on page 6.]

Singh, G., Deng, F., and Ahn, S. Illiterate DALL-E learns to compose. In ICLR, 2022a. [Cited on page 1.]

Singh, G., Wu, Y., and Ahn, S. Simple unsupervised objectcentric learning for complex and naturalistic videos. In Neur IPS, 2022b. [Cited on pages 1 and 5.]

Spelke, E. S. Principles of object perception. Cogn. Sci., 14: 29 56, 1990. [Cited on pages 2 and 9.]

Spelke, E. S. What makes us smart? core knowledge and natural language. Language in mind: Advances in the study of language and thought, pp. 277 311, 2003. [Cited on page 1.]

Spelke, E. S. and Kinzler, K. D. Core knowledge. Developmental science, 10(1):89 96, 2007. [Cited on page 1.]

Spirtes, P., Glymour, C., and Scheines, R. Causation, Prediction, and Search, volume 1. 2001. [Cited on page 6.]

Tangemann, M., Schneider, S., von K ugelgen, J., Locatello, F., Gehler, P., Brox, T., K ummerer, M., Bethge, M., and Sch olkopf, B. Unsupervised object learning via common fate. In 2nd Conference on Causal Learning and Reasoning (CLea R), 2023. [Cited on pages 6 and 9.]

Tenenbaum, J. B., Kemp, C., Griffiths, T. L., and Goodman, N. D. How to grow a mind: Statistics, structure, and abstraction. Science, 331:1279 1285, 2011. [Cited on page 1.]

Tr auble, F., Creager, E., Kilbertus, N., Locatello, F., Dittadi, A., Goyal, A., Sch olkopf, B., and Bauer, S. On disentangled representations learned from correlated data. In ICML, volume 139 of Proceedings of Machine Learning Research, pp. 10401 10412, 2021. [Cited on page 6.]

van Steenkiste, S., Chang, M., Greff, K., and Schmidhuber, J. Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. In ICLR (Poster), 2018. [Cited on page 6.]

von K ugelgen, J., Ustyuzhaninov, I., Gehler, P., Bethge, M., and Sch olkopf, B. Towards causal generative scene models via competition of experts. In ICLR 2020 Workshop Causal Learning for Decision Making . [Cited on page 6.]

von K ugelgen, J., Sharma, Y., Gresele, L., Brendel, W., Sch olkopf, B., Besserve, M., and Locatello, F. Selfsupervised learning with data augmentations provably isolates content from style. In Neur IPS, pp. 16451 16467, 2021. [Cited on page 7.]

Watters, N., Matthey, L., Borgeaud, S., Kabra, R., and Lerchner, A. Spriteworld: A flexible, configurable reinforcement learning environment. https://github.com/deepmind/spriteworld/, 2019. [Cited on pages 8 and 22.]

Weis, M. A., Chitta, K., Sharma, Y., Brendel, W., Bethge, M., Geiger, A., and Ecker, A. S. Benchmarking unsupervised object representations for video sequences. J. Mach. Learn. Res., 22:183:1 183:61, 2021. [Cited on page 1.]

Yang, X., Wang, Y., Sun, J., Zhang, X., Zhang, S., Li, Z., and Yan, J. Nonlinear ICA using volume-preserving transformations. In ICLR, 2022. [Cited on page 7.]

Yang, Y. and Yang, B. Promising or elusive? unsupervised object segmentation from real-world single images. In Neur IPS, 2022. [Cited on page 1.]

Provably Learning Object-Centric Representations

Zheng, Y., Ng, I., and Zhang, K. On the identifiability of nonlinear ICA: Sparsity and beyond. In Advances in Neural Information Processing Systems, 2022. [Cited on page 7.]

Zimmermann, R. S., Sharma, Y., Schneider, S., Bethge, M., and Brendel, W. Contrastive learning inverts the data generating process. In ICML, volume 139 of Proceedings of Machine Learning Research, pp. 12979 12990, 2021. [Cited on pages 2 and 7.]

Provably Learning Object-Centric Representations

In this section, we present the proofs for the results presented in the main text. First, we recall our notation:

Notation. N will denote the dimensionality of observations x, K the number of latent slots, and M the dimensionality of each latent slot zk. For n N, [n] will denote the set of natural numbers from 1 to n, i.e., [n] := {1, . . . , n}. If f is a function with n component functions, then f S will denote the restriction of f to the component functions indexed by S [n], i.e. f S := (fs)s S where f S is ordered according to the natural ordering of the elements of S. Additionally, when restricting f to the component functions indexed by Ik(z), defined according to Eq. (3), we will drop the dependence on z for notional convenience i.e. f Ik(z) := f Ik(z)(z). For functions f,ˆf, we will use Ik(z), ˆIk(ˆz), respectively, to distinguish between the indices defined for each function according to Eq. (3). Lastly, we will slightly abuse notation and use 0 to denote both the zero vector and a matrix whose entries are all 0.

We begin by proving several lemmata which will be leveraged for our main theoretical result. We start with the intuitive result that sub-mechanisms from different latent slots are independent in the sense of Defn. 4.

Lemma 1 (Sub-Mechanisms of Distinct Mechanisms are Independent). Let f be a diffeomorphism that is compositional (Defn. 1), and let S1, S2 [N] be nonempty. z Z, k [K], if S1 Ik(z), S2 Ik(z) = , then sub-mechanisms Jf S1(z), Jf S2(z) are independent in the sense of Defn. 4. Proof. From the definition of Ik(z) in Eq. (3) it follows that:

n [N] : fn zk (z) = 0 = n Ik(z).

Since S1 Ik(z), we know that n S1 : fn

zk (z) = 0. Further, since S2 Ik(z) = it means that n S2 : fn

zk (z) = 0. Put differently, this means that rows of Jf S1(z) are non-zero for those rows where Jf S2(z) vanishes and vice versa. Therefore, one cannot represent any column of Jf S1(z) as a linear combination of those of Jf S2(z). Hence,

rank (Jf S1(z)) + rank (Jf S2(z)) = rank ([Jf S1(z); Jf S2(z)]) ,

where [ ; ] denotes vertical concatenation. Note that the RHS is equal to Jf S1 S2(z) up to permutations of rows (which do not change the rank). Thus, Eq. (5) holds for S1, S2 showing that Jf S1(z), Jf S2(z) are independent in the sense of Defn. 4.

We next show that the rank of each sub-mechanism is less than or equal to the latent slot-dimension dimension, M.

Lemma 2. Let f : Z X be a diffeomorphism that is compositional (Defn. 1). z Z, k [K], if S Ik(z) is non-empty:

rank (Jf S(z)) M. (8) Proof. Since S Ik(z), then by compositionality of f

z Z, s S, j [K] \ {k} : fs zj (z) = 0. (9)

Thus, Jf S(z) has at most M non-zero columns (those corresponding to the non-zero partials w.r.t. zk) which implies rank(Jf S(z)) M.

We now show that the rank of each mechanism is equal to the latent slot-dimension M.

Lemma 3. Let f : Z X be a diffeomorphism that is compositional (Defn. 1). Then z Z, k [K]:

rank(Jf Ik(z)) = M. Proof. First note f is a diffeomorphism and is thus invertible. Therefore, Jf must be invertible and thus have full columnrank, i.e., z Z : rank(Jf(z)) = MK.

Next, z Z, k [K], let IC k := [N] \ Ik denote the complement of Ik in [N] such that IC k Ik = . Thus, by Lemma 1, the corresponding sub-mechanisms are independent:

z Z, k [K] : rank(Jf(z)) = rank(Jf Ik(z)) + rank(Jf IC k (z)) = MK. (10)

Provably Learning Object-Centric Representations

By compositionality of f,

z Z, j [K] \ { k } : f Ik

zj (z) = 0. (11)

Thus, Jf Ik(z) has at most M non-zero columns implying that rank(Jf Ik(z)) M. Furthermore, by definition,

z Z : f IC k zk (z) = 0, (12)

which means that Jf IC k (z) has at most (K 1)M non-zero columns implying rank(Jf IC k (z)) (K 1)M. Inserting this result in Eq. (10) yields

z Z, k [K] : M MK rank(Jf IC k (z)) = rank(Jf Ik(z)) M, (13)

which can only be true if rank(Jf Ik(z)) = M.

Next, we show that for ground-truth generator f and inferred generator ˆf, the sub-mechanisms at a given point with respect to the same pixel subset S will be have the same rank. Lemma 4. Let f,ˆf : Z X be diffeomorphisms with inverses g, ˆg : X Z, respectively. Then z Z, S [N] s.t. S = , rank(Jf S(z)) = rank(Jˆf S(ˆz)), where ˆz := ˆg(f(z)). Proof. First, we introduce the function

h := ˆg f : Z Z s.t. ˆz := ˆg(f(z)) = h(z),

We can express f as f = ˆf ˆg f = ˆf h. Thus, if S [N], S = , f S = ˆf S h. Therefore,

z Z, rank(Jf S(z)) = rank(Jˆf S(ˆz)Jh(z)). (14)

Because h is a diffeomorphism, Jh(z) is invertible. Thus rank(AJh(z)) = rank(A) for any matrix A s.t. AJh(z) is defined (Horn & Johnson, 2012, Section 0.4.6). Therefore, by Eq. (14):

z Z, rank(Jf S(z)) = rank(Jˆf S(ˆz)). (15)

We now prove several propositions which will be used to build our main result (Thm. 1). Firstly, we show that each inferred latent slot depends on at least one ground-truth slot. Proposition 1. Let Z be a latent space, X an observation space, and f : Z X a diffeomorphism that is compositional (Defn. 1). Let ˆg : X Z be a diffeomorphism and ˆz := ˆg(f(z)), z Z. Then, z Z, i [K], j [K] : ˆzj

zi (z) = 0. Proof. We first define the function

h := ˆg f : Z Z s.t. ˆz := ˆg(f(z)) = h(z).

As ˆg and f are both diffeomorphisms, h is also a diffeomorphism.

Note that z Z, Jh(z) is a square matrix. Furthermore, because h is a diffeomorphism, it follows that z Z, Jh(z) is full rank. This implies Jh(z) must have all non-zero columns, which implies

z Z, i [K], j [K] : ˆzj

zi (z) = 0.

Next, we show that each inferred latent slot generates the same pixels as at most one ground-truth slot. Proposition 2. Let Z be a latent space and X an observation space defined as in 2. Let f : Z X be a diffeomorphism that is compositional (Defn. 1) with irreducible mechanisms (Defn. 5). Let ˆg : X Z be a diffeomorphism with inverse ˆf : Z X that is compositional (Defn. 1). Then z Z, j [K], there exists exactly one i [K] : ˆIj(ˆz) Ii(z) = , where ˆz := ˆg(f(z)) Proof. Our goal is to show that ˆf maps each inferred latent slot ˆzj to pixels generated by exactly one ground-truth latent slot zi.

Provably Learning Object-Centric Representations

Step 1 We will first show that ˆf maps each inferred latent slot ˆzj to pixels generated by at least one ground-truth latent slot zi. More precisely, we aim to show:

z Z, j [K], i [K] : ˆIj(ˆz) Ii(z) = . (16)

Suppose for a contradiction to Eq. (16) that:

z Z, j [K], i [K] : ˆIj(ˆz ) Ii(z ) = . (17)

We will show that this assumption leads to a contradiction and, hence, is false. Let, z denote the value for which Eq. (17) holds. Eq. (17) coupled with the definition of Ii(z ) in Eq. (3) imply that there exists pixels which depend on ˆz under ˆf but not on z under f. More precisely,

i ˆIj(ˆz ) : Jˆfi(ˆz ) = 0, i ˆIj(ˆz ) : Jfi(z ) = 0 (18)

This then implies that:

rank(Jˆf ˆ Ij(ˆz )) = 0, rank(Jf ˆ Ij(z )) = 0 (19)

which contradicts the equality of Jacobian ranks between f and ˆf stated in Lemma 4. Thus, our assumed contradiction in Eq. (17) cannot hold and we conclude that Eq. (16) must hold true.

Step 2 We will now show that ˆf maps each inferred latent slot ˆzj to pixels generated by at most one ground-truth latent slot zi. More precisely, for C := { P [K] : |P| > 1 } we aim to show:

z Z, j [K], P C : i P = ˆIj(ˆz) Ii(z) = . (20)

Suppose for a contradiction to Eq. (20) that:

z Z, j [K], P C : i P = ˆIj(ˆz ) Ii(z ) = . (21)

We will let z denote the value for which Eq. (21) holds and without loss of generality let j = 1.

Step 2.1 First, i P we define the sets:

Oi,1 := { j Ii(z ) | j ˆI1(ˆz ) }, Oi,2 := { j Ii(z ) | j / ˆI1(ˆz ) }, (22)

Intuitively, the set Oi,1 represents the pixels which are a function of both ground-truth latent slot z i and inferred slot ˆz 1, while Oi,2 represents the pixels which are a function of z i but not ˆz 1. Our aim is now to show that for all i P, the sets Oi,1, Oi,2 form a partition of Ii(z ).

By Eq. (22), i P, Oi,1 Oi,2 = Ii(z ), and Oi,1 Oi,2 = . We thus only need to show that Oi,1, Oi,2 = .

We first note that by our assumed contradiction in Eq. (21), there are pixels which are a function of both ground-truth slot z i and inferred slot ˆz 1 i.e.:

i P, j Ii(z ) : j ˆI1(ˆz ) = j Oi,1 = Oi,1 = . (23)

We will now show that i P, Oi,2 = . Suppose for a contradiction that

i P : Oi,2 = , (24)

This implies that Ii(z ) = Oi,1 as Ii(z ) = Oi,1 Oi,2 = Oi,1 . Further, Eq. (22) implies that Oi,1 ˆI1(ˆz ) thus Oi,1 = Ii(z ) ˆI1(ˆz ).

Next, consider another ground-truth slot z k where k = i P. As previously established, Ok,1 = . Moreover, by Eq. (22), Ok,1 ˆI1(ˆz ). Thus, A := Ii(z ) Ok,1 ˆI1(ˆz ). Now, note that because ˆf is compositional, Lemma 2 implies that the rank of the sub-mechanism defined by A M. When coupled with the equality of Jacobian ranks between f and ˆf stated in Lemma 4, we get: rank(Jf A(z )) = rank(Jˆf A(ˆz )) M. (25)

Provably Learning Object-Centric Representations

Moreover, according to Eq. (22), Ok,1 Ik(z ). By compositionality of f, it thus follows that Ok,1 Ii(z ) = since i = k. Therefore, by Lemma 1, we know the sub-mechanisms defined by Ii(z ) and Ok,1 are independent such that

rank(Jf A(z )) = rank(Jf Ii(z )) + rank(Jf Ok,1(z )). (26)

Leveraging Lemma 3 yields rank(Jf Ii(z )) = M. Inserting this in the previous equation yields

rank(Jf A(z )) = M + rank(Jf Ok,1(z )), (27)

which according to Eq. (25) must be M i.e.

M rank(Jf A(z )) = M + rank(Jf Ok,1(z )). (28)

Now, note that by the definition of Ik(z ) in Eq. (3), i Ik(z ), Jfi(z ) = 0. Because Ok,1 = and Ok,1 Ik(z ), it follows that Jf Ok,1(z ) = 0. This implies rank(Jf Ok,1(z )) > 0. However, this contradicts Eq. (28) and, hence, also the initial assumption in Eq. (24). Therefore, we conclude that i P, Oi,2 = .

Taken together, we have shown that i P, the sets Oi,1, Oi,2 are nonempty and form a partition of Ii(z ).

Step 2.2 Next, we first note that Lemma 3 implies that the rank of the mechanism Jf Ii(z ) is equal to M. Moreover, by assumption, Jf Ii(z ) is irreducible. Because Oi,1 and Oi,2 form a partition of Ii(z ), irreducibility then implies:

i P : rank(Jf Oi,1(z )) + rank(Jf Oi,2(z )) > M. (29)

Due to the equality of Jacobian ranks between f and ˆf stated in Lemma 4, Eq. (29) implies

i P : rank(Jˆf Oi,1(ˆz )) + rank(Jˆf Oi,2(ˆz )) > M. (30)

By the definition of Oi,1, Oi,2 in Eq. (22), i P : Oi,1 ˆI1(ˆz ), Oi,2 ˆI1(ˆz ) = . It thus follows from Lemma 1 that the sub-mechanisms defined by Oi,1 and Oi,2 are independent under ˆf in the sense of Defn. 4. Because Oi,1 and Oi,2 form a partition of Ii(z ), this independence, when coupled with Eq. (30), implies:

i P : rank(Jˆf Ii(ˆz )) = rank(Jˆf Oi,1(ˆz )) + rank(Jˆf Oi,2(ˆz )) > M. (31)

We know from Lemma 3 that the mechanism defined by Ii(z ) has rank M under f. The equality of Jacobian ranks between f and ˆf stated in Lemma 4 then implies:

rank(Jˆf Ii(ˆz )) = rank(Jf Ii(z )) = M, (32)

which contradicts Eq. (31), and, hence the initial assumption of this proof by contradiction in Eq. (21) cannot be correct and Eq. (20) must hold true.

We have now shown that z Z, j [K], there exists at least one and at most one i [K] : ˆIj(ˆz) Ii(z) = implying there exists exactly one, thus completing the proof.

We now provide a corollary to Prop. 2 stating that the result also holds when the roles of ˆIj(ˆz), Ii(z) are reversed.

Corollary 1. z Z, i [K], there exists exactly one j [K] : ˆIj(ˆz) Ii(z) = .

Proof. We will first prove that there exists at least one j [K] : ˆIj(ˆz) Ii(z) = . Assume, for a contradiction that:

z Z, i [K], j [K] : ˆIj(ˆz ) Ii(z ) = . (33)

This contradiction can be shown not to hold by exactly repeating the procedure in Step 1 of Prop. 2.

We thus only need to prove that there exists at most one j [K] : ˆIj(ˆz) Ii(z) = . Let C := { P [K] : |P| > 1 }. Suppose for a contradiction that:

z Z, i [K], P C : j P = ˆIj(ˆz ) Ii(z ) = . (34)

Provably Learning Object-Centric Representations

Let A := [K] \ P. We know by Prop. 2 that j A, there exists exactly one i [K] : ˆIj(ˆz ) Ii(z ) = . This implies that at least |[K]| |A| = |P| ground-truth latent slots generate pixels which do not overlap with the pixels generated by any inferred latent slots in A. In other words, there exists a set B [K] with cardinality |P| > 1 s.t.

i B, j A : ˆIj(ˆz ) Ii(z ) = (35)

Now consider the set P. We know by Eq. (34), that for all j P : ˆIj(ˆz ) Ii(z ) = . By Prop. 2, we know that for all j P, ˆIj(ˆz ) can intersect only with Ii(z ). Given that |B| > 1, this then implies

i B : j P : ˆIj(ˆz ) Ii(z ) = (36)

Now, by construction, [K] = A P. Thus, Eq. (35) and Eq. (36) together imply:

i B [K] : j [K] : ˆIj(ˆz ) Ii(z ) = (37)

We have already shown in the first part of this corollary, however, that Eq. (37) cannot be true by repeating the procedure in Step 1 of Prop. 2. Thus, our assumed contradiction in Eq. (34) cannot be true.

We have now shown that z Z, i [K], there exists at least one and at most one j [K] : ˆIj(ˆz) Ii(z) = implying there exists exactly one, thus completing the proof.

We now build upon Prop. 2 and Cor. 1, to show that all inferred latent slots depend on at most one ground-truth slot. Proposition 3. Let Z be a latent space and X an observation space. Let f : Z X be a diffeomorphism that is compositional (Defn. 1) with irreducible mechanisms (Defn. 5). Let ˆg : X Z be a diffeomorphism with inverse ˆf : Z X that is compositional (Defn. 1). Let ˆz := ˆg(f(z)), z Z. Then, z Z, i [K], there exists at most one j [K] : ˆzj

zi (z) = 0.

Proof. Our goal is to show that at most one ˆzj is a function of a given zi. More precisely, let C := { P [K] : |P| > 1 }. We aim to show that: z Z, i [K], P C : j P = ˆzj

zi (z) = 0. (38)

Suppose for a contradiction to Eq. (38) that:

z Z, i [K], P C : j P = ˆzj

zi (z ) = 0. (39)

Let z denote the value for which Eq. (39) holds and without loss of generality let i = 1.

We first introduce the function

h := ˆg f : Z Z s.t. ˆz := ˆg(f(z)) = h(z).

Note that f = ˆf ˆg f = ˆf h. Thus, S [N], f S = ˆf S h. Therefore,

z Z, j [K] : fˆIj

z1 (z) = ˆfˆIj

z1 (z) (40)

Due to the compositionality of ˆf, ˆfˆ Ij ˆzk (ˆz) = 0, k = j [K]. This implies that these columns can be ignored when taking the product in Eq. (40), s.t. ˆfˆIj

z1 (z) = ˆfˆIj

ˆzj (ˆz) ˆzj

z1 (z). (41)

Now by Cor. 1, there exists exactly one j P [K] s.t. ˆIj(ˆz ) I1(z ) = . By the definition of Ii(z) in Eq. (3), this

implies that there exists exactly one j P s.t. fˆ Ij z1 (z ) = 0. |P| > 1, thus there exists a j P s.t.

z1 (z ) = ˆfˆIj

ˆzj (ˆz ) ˆzj

z1 (z ) = 0 (42)

Provably Learning Object-Centric Representations

where we leveraged Eq. (40), Eq. (41) to get the first equality above. Now, we know by Lemma 3, that JˆfˆIj(ˆz ) is full

column-rank. By compositionality of ˆf, we also know that rank(JˆfˆIj(ˆz )) = rank( ˆfˆ Ij ˆzj (ˆz )) as these are the only non-zero

columns in JˆfˆIj(ˆz ). Thus, ˆfˆ Ij ˆzj (ˆz ) is also full column-rank. Now, Eq. (42) implies that all columns of ˆzj

z1 (z ) must be in

null( ˆfˆ Ij ˆzj (ˆz )). Because, ˆfˆ Ij ˆzj (ˆz ) is full-column rank, null( ˆfˆ Ij ˆzj (ˆz )) = 0. However, by Eq. (39) at least one column of

ˆzj z1 (z ) is non-zero. Thus, we obtain a contradiction and conclude that Eq. (38) must hold.

Building on top of the previous propositions, we now prove our main identifiability result:

Theorem 1. Let f : Z X be a diffeomorphism that is compositional (Defn. 1) with irreducible mechanisms (Defn. 5). If an inference model ˆg : X Z is (i) a diffeomorphism with (ii) compositional inverse ˆf = ˆg 1, then ˆg slot-identifies z = g(x) in the sense of Defn. 6.

Proof. According to Prop. 1 every inferred latent slot ˆzj depends on at least one ground-truth latent slot zi. At the same time, Prop. 3 states that every inferred latent slot depends on at most one ground-truth slot. Hence, every inferred latent slot depends on exactly one ground-truth slot.

This implies that the Jacobian Jh(z) of h = ˆg f : Z Z must be block diagonal up to permutation everywhere:

z Z : Jh(z) = P(z)B(z) (43)

where P(z) is a permutation matrix and B(z) a block-diagonal matrix.

Next, note that det(Jh(z)) = det(P(z)) det(B(z)) = det(B(z)) = 0 (44)

since h is diffeomorphic. Hence, B(z) is invertible with continuous inverse. We conclude that

P(z) = Jh(z)B 1(z) (45)

is continuous. At the same time, P(z) can only attain a finite set of values since it is a permutation. Hence, P(z) must be constant in z, that is, the same global permutation is used everywhere.5

Thus, for any j K, there exists a unique i K such that the function hj = ˆgj f : Z Zj is, in fact, constant in all slots except Zi, i.e., it can be written as a mapping hj : Zi Zj.

Finally, all such hj are diffeomorphic, since h is a diffeomorphism.

This concludes the proof that assumptions (i) and (ii) imply ˆg slot-identifies z.

We now show that the compositional contrast Ccomp introduced in Eq. (6) indicates whether a map is compositional:

Lemma 5. Let f : Z X be a differentiable function. f is compositional in the sense of Defn. 1 if and only if for all z Z:

Ccomp(f, z) = 0 .

Proof. ( ) We begin by analyzing Ccomp(f, z):

fn zk (z) 2

zj (z) 2 (46)

Since all summands are non-negative, the sum can only equal zero if every summand is zero z Z. Since j = k in the summand, this means:

z Z, n [N], k = j [K] : fn zk (z) 2

zj (z) 2 = 0 (47)

5Suppose for a contradiction that P(z) attains distinct values at some z A = z B in Z. Since Z is convex, the line connecting z A and z B is also in Z and P must change value somewhere along this line, leading to a discontinuity and thus a contradiction.

Provably Learning Object-Centric Representations

This relation can only be satisfied if one (or both) of the partial derivatives in the summand have a norm of zero, i.e. if they are zero. More precisely,

z Z, n [N], k = j [K] : fn

zk (z) = 0 fn

zj (z) = 0. (48)

According to Defn. 1 a map f is compositional if

z Z : k = j = Ik(z) Ij(z) = . (49)

By the definition of Ii(z) in Eq. (3), we can restate Eq. (49) as:

z Z, k = j, n [N] : fn

zk (z) = 0 fn

zj (z) = 0 (50)

which implies:

z Z, n [N], k = j : fn

zk (z) = 0 fn

zj (z) = 0 (51)

which is equivalent to Eq. (48). Hence, z Z : Ccomp(f, z) = 0 implies that f is compositional.

( ) We now prove the reverse direction i.e. that if f is compositional, then z Z : Ccomp(f, z) = 0. Note that the form of compositionality given in Eq. (50) implies that z Z, at least one term in the summand of Ccomp(f, z) in Eq. (51) will be zero. Thus, each summand is equal to zero. This then implies that z Z : Ccomp(f, z) = 0, completing the proof.

Finally, by leveraging Lemma 5, we can obtain Thm. 1 in a less abstract form.

Theorem 2. Let f : Z X be a diffeomorphism that is compositional (Defn. 1) with irreducible mechanisms (Defn. 5). If an encoder ˆg : X Z and decoder ˆf : Z X are both differentiable and solve the following functional equation

ˆf(ˆg(x)) x 2

2 + λCcomp ˆf, ˆg(x) = 0, (7)

for λ > 0, then ˆg slot-identifies z in the sense of Defn. 6. Proof. As both summands of the functional are non-negative, solving the functional equation means solving for each of the summands to be equal to zero. Thus, we can analyze both of them separately. Solving the first sub-functional equation, i.e.,

ˆf(ˆg(x)) x 2

implies that ˆf is an inverse of ˆg for every x px. Because pz is assumed to have full support over Z, and px is defined by applying a diffeomorphism f : Z X on pz, this implies that px has full support over X. This means that ˆf is an inverse of ˆg over the entire space X i.e. ˆf = ˆg 1. Since per assumption ˆg and ˆf are differentiable it follows that ˆg is a diffeomorphism.

Moreover, per Lemma 5, solving the second sub-functional equation for λ > 0, i.e.,

Ex px h λCcomp(ˆf, ˆg(x)) i = 0,

means that ˆf is compositional as px has full support over X and ˆg is a diffeomorphism between X and Z. From Thm. 1 it now follows that ˆg slot-identifies z, concluding the proof.

B Experimental Details

B.1 Synthetic Data 5.1

Enforcing Irreducibility We choose slot-output dimension, which we will denote dim(xs), to be greater than slotdimension M as this is required for irreducibility (Defn. 5). To see this, assume the number of rows in each mechanism (Defn. 2), equal in our case to dim(xs), were equal to M. Because mechanisms have rank = M (Lemma 3) and we have

Provably Learning Object-Centric Representations

M rows, this implies that no row is in the span of any others. Hence, the mechanism would be reducible. Beyond enforcing that the slot-output dimension, equal to 20 in this case, is greater than M = 3, we do not do anything further to ensure that our ground-truth generator is irreducible. This is because it is extremely unlikely that the generator, as we have constructed it, could be reducible. Specifically, if the generator were reducible, then as dim(xs) becomes larger than M, each new row in the Jacobian would need to lie in the span of some subset of the previous rows. As dim(xs) continues to increase relative to M, however, this becomes increasingly unlikely since the rows in the weight matrices of our MLP generator are randomly sampled i.e. entries are sampled uniformly from [ 10, 10].

Inference Model Training and Evaluation For our inference model, we use a 3 layer MLP with 80 hidden units in each layer and Leaky Re LU activation functions. We train on 75,000 samples and use 6,000 and 5,000 for validation and test sets, respectively. We train for 100 epochs with the Adam optimizer (Kingma & Ba, 2015) on batches of 64 with an initial learning rate of 10 3, which we decay by factor of 10 after 50 epochs. We use the validation set to find the optimal permutation for the Hungarian matching and then evaluate the SIS on the test set after applying this permutation to the slots. We compute the SIS for models every 4 epochs during training, all of which are plotted in Fig. 4. We trained all models using Py Torch (Paszke et al., 2019).

B.2 Existing Object-Centric Models 5.2

Data Generation We generate image data using the Spriteworld renderer (Watters et al., 2019). Images consist of 2 to 4 objects, each described by 4 continuous (size, color, x/y position) and 1 discrete (shape) independent latent factors. We sample all factors uniformly where size is sampled from [.1, .15] and x/y position both from [.1, .8]. We represent color using HSV and sample hue from [0, 1] while fixing saturation and value to 3 and 1, respectively. The dataset consists of 100,000 images, 90,000 of which are used for training and 10,000 for evaluation.

Inference Model Training and Evaluation We use the same Slot Attention model proposed by Locatello et al. (2020), with the changes being that we use 16 convolutional filters in the decoder opposed to 32 and do not use a learning rate warm-up. For MONet, we follow the setup used by Dittadi et al. (2022) on Multi-d Sprites (Kabra et al., 2019). For our additive autoencoder, we use the convolutional encoder/decoder architecture proposed by Burgess et al. (2018). The model decodes each slot separately to get slot-wise reconstructions and mask, applies the normalized mask to each slot-wise reconstruction, and then adds the results together to get the final reconstructed image. For all models, we use 4 slots with a slot-dimension of 16. We train all models for 500, 000 iterations (356 epochs) on batches of 64 with between 5 to 12 random seeds for each model. We train using the Adam optimizer (Kingma & Ba, 2015) with an initial learning rate of 10 4, which we decay throughout training for all models using the same decay scheduler as Locatello et al. (2020). We trained all models using Py Torch (Paszke et al., 2019).

B.3 Compositional Contrast Normalized Variants

When computing Ccomp in 5.1 and 5.2, we use a few different normalized variants of the contrast to overcome potential issues with the definition given in Defn. 7. Firstly, as the number of latent slots K increases, the contrast in Defn. 7 will scale by a factor K2 K. Thus, when comparing models across different numbers of slots in 5.1, we divide the contrast by this factor to ensure that comparisons remain meaningful across different values of K. Another issue with the contrast in Defn. 7, is that it is not scale invariant. Specifically, naively minimizing the norm of the gradients for each pixel across slots will also minimize the contrast, despite all slots having similar gradient norms for a given pixel. This scale invariance did not cause issues when optimizing Ccomp directly in 5.1. However, when evaluating the Ccomp of object-centric models in 5.2, we account for this invariance. Specifically, we divide the gradient norms for each pixel with respect to each slot by the mean gradient norm for this pixel across slots. This gradient normalization creates an additional problem, however: Pixels with a relatively small gradient norm, such as black background pixels, will be weighted equally to pixels with a larger gradient norm such as pixels corresponding to an object. To account for this, we weight each pixel s contribution to the contrast by the pixel s mean gradient across slots.

B.4 Slot Identifiability Score

We are interested in a metric measuring how much information about the ground-truth latent slots is contained in the inferred latent slots without mixing information about different ground-truth slots into the same inferred slot. Let S1, S2 [0, 1] denote scores that quantify how much information about each ground-truth slot can be extracted from the most and secondmost predictive inferred slot, respectively. The aforementioned metric can be computed by just subtracting the two scores,

Provably Learning Object-Centric Representations

S = S1 S2. (52)

Following previous work, we use the R2 coefficient of determination as a score for continuous factors of variation (which we restrict to be strictly non-negative) and the accuracy for categorical factors (Dittadi et al., 2022). We compute one S value for each type and take the weighted mean which we then average across all slots to get the final slot identifiability score (SIS).

Computing SIS on Synthetic Data 5.1 To compute the scores S1 and S2 defined in our experiments in 5.1, we must fit two inference models between ground-truth and inferred slots: one between the best-matching slots and one between the second-best-matching slots. In 5.1, we fit these models by first fitting a kernel ridge regression model between every pair of inferred and ground-truth slots and computing the R2 scores for the predictions given by each model. We then use the Hungarian algorithm (Kuhn, 1955) to match each ground-truth slot to its most predictive inferred slot based on these R2

scores, which gives us S1. To get S2, we take the highest R2 score for each inferred slot with respect to the ground-truth slots that it was not already matched with. For our experiments in Fig. 5 with dependent latent slots, S2 will inevitably be non-zero even if a model is perfectly identifiable. Thus, for these experiments, we only consider S1 and refer to this metric as the Slot MCC (Mean Correlation Coefficient).

Computing SIS on Image Data 5.2 When training models to compute S1 and S2 in our experiments on image data in 5.2, one issue that arises is that the permutation between inferred latent slots and ground-truth slots is not necessarily a global permutation but can also be a local permutation. This is due to the ground-truth generator function being permutation invariant. To resolve this, we take a similar approach to work by Dittadi et al. (2022) and perform an online matching during training of inferred latent slots to ground-truth slots using the training loss. Specifically, we compute the loss for every pairing of the ground-truth and inferred slots and use the Hungarian algorithm to pick the permutation that yields the lowest aggregate loss. As every slot can contain both continuous and categorical variables, we compute the mean squared error for continuous factors and cross-entropy for categorical variables and sum them up to obtain the training loss. In our experiments, we notice that the cross-entropy tends to yield unstable matching results. Therefore, we use the minimum probability margin 6 to compute the categorical loss to solve the matching problem. Before fitting the readout models, we standardized both the ground-truth and inferred latents. We parameterized the readout models as 5-layer MLPs with Leaky Re LU nonlinearity and a hidden dimensionality of 256, and trained them for up to 100 epochs using the Lion optimizer with a learning rate of 10 4. To prevent the network from locking in too early on a suboptimal solution, we add a small amount of noise (10 % of the maximum matching loss value) to the losses before determining the optimal matching. Finally, we suggest performing cross-validation and early stopping to prevent overfitting.

For training the model to compute S2, we proceed as for S1 but ensure that the model is not using the same permutation used for computing S1, i.e., it is trained on the second-best matching between ground-truth and inferred slots. Lastly, when computing S2, we aim to avoid scenarios in which the model finds a spurious permutation yielding a non-zero S2 despite the model being identifiable. To account for this, we compute S2 on the ground-truth latent slots, denoted Sgt 2 , using the same procedure for computing S2, and use this score to adjust our previous scores. Specifically, by adjusting the value range accordingly, we obtain a score of

S = S1 Sgt 2 1 Sgt 2 S2 Sgt 2 1 Sgt 2 , (53)

To ensure that the subtracting term is not increasing the final score, we restrict it to be positive, yielding the final score:

S = S1 Sgt 2 1 Sgt 2 max S2 Sgt 2 1 Sgt 2 , 0 . (54)

We may additionally be interested in considering the two terms on the RHS of Eq. (54) separately. Thus, we define them below as:

ˆS1 = S1 Sgt 2 1 Sgt 2 , ˆS2 = S2 Sgt 2 1 Sgt 2 , S = ˆS1 max( ˆS2, 0). (55)

6i.e., maxi pi py, where p denotes the predicted probability for different values of the categorical distribution and y the ground-truth value

Provably Learning Object-Centric Representations

C Additional Figures and Experiments

10 5 10 4 10 3 10 2

Reconstruction Error

Compositional Contrast

Figure 5. Experimental validation of Thm. 2 for statistically dependent slots. We trained models on synthetic data generated according to 2 with 2, 3, 5 dependent latent slots (see 5.1). The color coding indicates the level of identifiability achieved by the model, measured by the Slot Mean Correlation Coefficient (MCC), where higher values correspond to more identifiable models. As predicted by our theory, if a model sufficiently minimizes both reconstruction error and compositional contrast, then it identifies the ground-truth latent slots.

0 50 100 150 200 250 300 Training Epoch

Compositional Contrast

MONet AE SA

Figure 6. Compositional Contrast (Ccomp) throughout training. Here, we plot the compositional contrast (Ccomp) over the course of training for MONet, Slot Attention (SA) as well as an additive auto-encoder (AE), on image data. We can see that all models appear to be minimizing Ccomp to some extent despite it not being explicitly optimized for in any of these models.

Provably Learning Object-Centric Representations

101 102 Reconstruction Error

Compositional Contrast

MONet AE SA

Uncorrected Slot Identifiability Score S1

101 102 Reconstruction Error

104 MONet AE SA

Slot Identifiability Score Correction S2

0.2 0.3 0.4 0.5 0.6 0.7 0.8 Uncorrected Slot Identifiability Score S1

Slot Identifiability Score Correction S2

MONet AE SA

Figure 7. Analysis of Information Leakage Between Slots from Models Trained in 5.2. (A) Uncorrected Slot Identifiability Score ( ˆS1) vs. Correction ( ˆS2). We train 3 existing object-centric architectures MONet, Slot Attention (SA), and an additive auto-encoder (AE) on image data and investigate whether inferred latent slots encode information from multiple objects when using an inferred latent dimension greater than the ground-truth. To test this, we look at the R2 score for a model fit between each inferred slot and the second most predictive ground-truth slot for this slot. We refer to this score as the slot identifiability score correction, defined as ˆS2 in Appx. B.4. We plot this score against the uncorrected slot identifiability score i.e. the most predictive ground-truth slot, defined as ˆS1 in Appx. B.4. We can see that for all models, ˆS2 is non-zero, even as ˆS1 increases, suggesting that models are leveraging their additional latent capacity to encode information about multiple objects in the same latent slot. (B) and (C) Influence of Reconstruction Error and Compositional Contrast on ˆS1 and ˆS2. Here, we further visualize the slot identifiability score correction ( ˆS1) and the uncorrected score ( ˆS2) as a function of the reconstruction error and the compositional contrast in panels B and C, respectively. We can see in B that, similar to the SIS in Fig. 4, ˆS1 tends to increase as reconstruction loss and compositional contrast decrease. We can additionally see in C that, while ˆS2 decreases to some extent with Ccomp, there is generally less of a correlation between ˆS2 and these metrics. This suggests that the latent capacity must also be restricted to minimize ˆS2.

Figure 8. Samples from our multi-sprites dataset used in 5.2. Objects are described by five latent factors: shape, color, size, and x/y position. Occlusions are present in the dataset, as shown in the samples above (see the second and third images from the left).