# identifiable_object_representations_under_spatial_ambiguities__f725e2cb.pdf

Identifiable Object Representations under Spatial Ambiguities

Avinash Kori 1 Francesca Toni 1 Ben Glocker 1

Modular object-centric representations are essential for human-like reasoning but are challenging to obtain under spatial ambiguities, e.g. due to occlusions and view ambiguities. However, addressing challenges presents both theoretical and practical difficulties. We introduce a novel multi-view probabilistic approach that aggregates view-specific slots to capture invariant content information while simultaneously learning disentangled global viewpoint-level information. Unlike prior single-view methods, our approach resolves spatial ambiguities, provides theoretical guarantees for identifiability, and requires no viewpoint annotations. Extensive experiments on standard benchmarks and novel complex datasets validate our method s robustness and scalability.

1 Introduction

The ability to capture the notion of objectness in learned representations is considered to be a critical aspect for developing situation-aware AI systems with human-like reasoning capabilities (Schölkopf & von Kügelgen, 2022; Lake et al., 2017). Objectness can be characterised as understanding the environment from the perspective of its building blocks. These can further be divided into object-part composition (Hinton, 1979; 2022), which might be a potential reason why humans generalise across environments with few examples to learn from (Tenenbaum et al., 2011). Recent advances in object-centric representation learning (OCL) have shown great potential in segregating objects in observed scenes (Locatello et al., 2020b; Kori et al., 2023; Löwe et al., 2024). Indeed, the goal of OCL is to enable agents to learn representations of objects in an observed scene in the context of their environment, as opposed to learning global representations as in the case of traditional generative models such as variational auto-encoders (Kingma & Welling,

1Department of Computing, Imperial College London, London, UK. Correspondence to: Avinash Kori <a.kori21@imperial.ac.uk>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

Figure 1. (a) Occlusion Ambiguity: the orange object, which is occluded by the blue object, could be any of the six plausible objects shown on the right. (b) View Ambiguity: the blue object is observed from two different viewpoints (represented with a red arrow and a dot), leading to a change in its overall shape. In general, identifiable representations resolve ambiguities by determining the most plausible object under occlusion and correct object properties in case of view transformation by leveraging information from multiple viewpoints.

2013). OCL approaches enable agents to learn spatially disentangled representations, which is an important step in compositional scene generation (Bengio et al., 2013; Lake et al., 2017; Battaglia et al., 2018; Greff et al., 2020) and understanding of causal (and physical) interactions between the objects (Marcus, 2003; Gerstenberg et al., 2021; Gopnik et al., 2004).

Recent progress in OCL has been limited to learning scene representations from single-viewpoints (Locatello et al., 2020b; Engelcke et al., 2021; Singh et al., 2021; Kori et al., 2023; Chang et al., 2022; Seitzer et al., 2022; Löwe et al., 2024). Although these approaches can learn meaningful object-specific representations, they encounter significant challenges stemming from spatial ambiguities such as occlusion and view ambiguities (see Fig. 1 for examples). Additionally, while it has been hypothesised that these models learn un-occluded object representations even in the case of occlusions. Learning from a single viewpoint fails to capture effective object representations, due to the presence of multiple plausibilities of partially or fully occluded objects and the effects of view transformations, as demonstrated in Fig. 1 and highlighted by the results in Fig. 2 (we will revisit these results later in section 6). Another example of spatial ambiguities can be observed in Fig. 3, where object

Identifiable Object Representations under Spatial Ambiguities

Figure 2. Identifiability across a number of views measured with Slot Mean Correlation Coefficient (SMCC).

O4 in x1 and x2 can be interpreted as a cube, but only after considering x3 we can conclude that being a pyramid.

A handful of approaches, including MULMON(Li et al., 2020), DYMON(Li et al., 2021), OCLOC(Yuan et al., 2024), have considered multiple viewpoints for extracting object representations. Additionally, methods such as (Liu et al., 2025; Chen et al., 2021; Luo et al., 2024) effectively use NERF(Mildenhall et al., 2021) for constructing a 3D environment from multi-viewpoint images, where the occlusions are addressed by construction. Among these methods, MULMON, DYMON, and all NERF based approaches assume that the viewpoint annotations are known, which simplifies the problem of learning to disentangle object representations conditioned on viewpoint information.

The problem setting in this work aligns with OCLOC, in that, our aim is to learn invariant object representations while simultaneously learning global view information with respect to an implicit global coordinate frame. This eliminates the requirement for paired viewpoint-image data. While OCLOC introduces an innovative approach for learning global view information independently of the scene, its primary focus is on achieving object-consistency unconditional to views rather than explicitly learning view-invariant object representations. Additionally, learning global unconditional view representations does not guarantee learning identifiable view/object representations, which was not studied for OCLOC. In this work, we provide a novel model, where object representations satisfy view-invariance and view representations satisfy approximate equivariance properties, allowing us to exploit objects inherent geometry and semantics to establish correspondences across views.

In single-view OCL, Kori et al. (2024); Brady et al. (2023); Lachapelle et al. (2023) make an effort in rigorously formalising the underpinning, explicit and implicit assumptions and provide conditions under which models result in learning identifiable slot representations, leaving out ambiguous scenarios. Unlike them, our approach resolves spatial ambiguities, provides theoretical guarantees for identifiability, and requires no viewpoint annotations. To the best of our

Figure 3. The figure illustrates a scene with four objects Os = {O1, O2, O3, O4}, observed from three different viewpoints, each described with a set of clearly visible objects: O1 = {O3, O4}, O2 = {O1, O3, O4}, O3 = {O1, O2, O3, O4}. The corresponding images are passed through view and content encoders, and sampled global view vector v is used to estimate transformation function Tθv given by parameters θv predicted using a localisation network. We apply a view-specific inverse T 1 θv on respective images projecting them to an implicit space, which is used to learn view conditioned slot posterior corresponding to GMMs represented by q(sv | T 1 θv (xv)), which are further aggregated to marginalize viewpoint information, resulting in a content posterior, also a GMM q(c | {s1, . . . , s V }), which is further accumulated across all samples resulting in optimal prior p(c).

knowledge, this is the first work addressing explicit formalisations of assumptions and theory required for achieving identifiable object representations under occlusions with multi-view observational data. To this end, we make use of the spatial Gaussian mixture models(GMM) in latent distribution across viewpoints to encourage identifiability without additional auxiliary data. Our main contributions in this work can be summarised as follows:

(i) We propose a probabilistic slot attention variant, View Invariant Slot Attention (VISA) for learning identifiable object-centric representations from multiple viewpoints, resolving spacial ambiguities such as occlusions and view ambiguities (Section 4).

(ii) We prove that our object-centric representations are identifiable in the case of partial or full occlusions without additional view information up to an equivalence relation with a mixture model specification (Section 5).

(iii) We provide conclusive evidence of our identifiability results, including visual verification on synthetic datasets; we also demonstrate the scalability of the proposed method on two new, carefully designed complex datasets MVMOVI-C and MVMOVI-D (Section 6).

Identifiable Object Representations under Spatial Ambiguities

2 Related Works

Identifiable Object-centric learning. Extending nonlinear Independent Component Analysis (ICA) from representation learning to object-specific representational learning has been heavily explored before (Burgess et al., 2019; Engelcke et al., 2019; Greff et al., 2019) by employing an iterative variational inference approach (Marino et al., 2018), whereas Van Steenkiste et al. (2020); Lin et al. (2020) adopt more of a generative perspective, studied the effect of object binding and scene composition empirically. Recently, the use of iterative attention mechanisms has gained a significant interest (Locatello et al., 2020b; Engelcke et al., 2021; Singh et al., 2021; Wang et al., 2023; Singh et al., 2022; Emami et al., 2022). Most of these works operate in a singleview setting, which causes fundamental issues of viewpoint ambiguities in terms of occlusions and uncertainties in binding. Recent methods, including (Eslami et al., 2018; Arsalan Soltani et al., 2017; Tobin et al., 2019; Wu et al., 2016) consider a single object from multiple views to tackle this particular problem. Additionally, (Kosiorek et al., 2018; Hsieh et al., 2018; Li et al., 2020) explore multi-object binding in videos and multiple views, tackling object binding issues across frames. Despite their empirical effectiveness, most of these works lack formal identifiability guarantees. In line with recent efforts analysing theoretical guarantees in object-centric representations (Lachapelle et al., 2023; Brady et al., 2023; Kori et al., 2024), we formally investigate the modelling assumptions and their implications for achieving identifiability guarantees in the context of multi-object, multiview object-centric representation learning settings.

Multiview nonlinear ICA. It has been noted that addressing the challenge of nonlinear ICA can involve incorporating a learnable clustering task within the latent representations, thereby imposing asymmetry in the latent distribution (Willetts & Paige, 2021; Kivva et al., 2022). Moreover, Gresele et al. (2020) delve into multiview nonlinear ICA, particularly in scenarios involving corrupted observations, where they aim to recover invariant representations while accounting for certain ambiguities. Along similar lines, Daunhawer et al. (2023); Von Kügelgen et al. (2021) explore the concept of style-content identification using contrastive learning, focusing on addressing the multiview nonlinear ICA problem. Here, we work along similar lines by emphasising the learning of invariant content and identifiable object-centric representations. We achieve this by formulating a reconstruction objective where the enforced invariance and equivariance stem from the underlying probabilistic graphical model rather than relying on a contrastive learning objective. Similar to the noiseless setting in (Gresele et al., 2020), we demonstrate the recovery of invariant content representations using different subsets of viewpoints.

3 Preliminaries

Probabilistic Slot Attention (PSA) as introduced by Kori et al. (2024), presents a probabilistic interpretation of the slot attention algorithm (Locatello et al., 2020b). In PSA, a set of feature embeddings z RN d per input x is taken as input, and an iterative Expectation Maximization (EM) algorithm is applied over these embeddings. This process results in a Gaussian Mixture Model (GMM) characterized by mean (µ RK d), variance (σ2 RK d), and mixing coefficients (π [0, 1]K 1). In summary, PSA employs the initial mean sampled from the prior distribution and initial variance initialized with unit vector, then iteratively updates the mean based on assignment probabilities (Ank) using Eqn. 1, and adjusts the mean and variance accordingly as described in Eqn. 4, for T iterations.

Ank = π(t)k N zn; µ(t)k, σ(t)2 k

PK j=1 π(t)j N zn; µ(t)k, σ(t)2 j ; (1)

ˆAnk = Ank/

l=1 Alk; π(t + 1)k =

n=1 Ank/N; (2)

µ(t + 1)k =

n=1 ˆAnkzn; (3)

σ(t + 1)2 k =

n=1 ˆAnk (zn µ(t + 1)k)2 (4)

Identifiable representations. A model is considered identifiable when different training iterations yield consistent latent distributions, thereby resulting in identical model parameters (Khemakhem et al., 2020a;c). In the context of a parameter space Θ and a family of mixing functions F, identifiability of the model on the dataset X is established if, for any θ1, θ2 Θ and fθ1, fθ2 F, the condition p(f 1 θ1 (x)) = p(f 1 θ2 (x)) holds for all x X, implying θ1 = θ2. However, in practical scenarios, exact equality or strong identifiability is often unnecessary, as establishing relationships to transformations, which can be manually recovered, proves equally effective, resulting in a notion of weak identifiability, where relationships are recovered up to an affine transformation (Khemakhem et al., 2020c; Kivva et al., 2022). Similar identifiability relations have been elucidated for OCL in (Brady et al., 2023; Lachapelle et al., 2023; Kori et al., 2024; Mansouri et al., 2023). The notion of s equivalence relation is elaborated in Dfn. 3.1. Definition 3.1. ( s equivalence (Kori et al., 2024)) Let fθ : S X denote a mapping from slot representation space S to image space X (satisfying Assumption F.4), the equivalence relation s w.r.t. to parameters θ Θ is defined as: θ1 s θ2

P , H, c : f 1 θ1 (x; v) = P (f 1 θ2 (x; v)H + a), x X,

Identifiable Object Representations under Spatial Ambiguities

Figure 4. Graphical model for multi-view probabilistic slot attention: For every image in a dataset a view v Rdv p(v), this view is used to compute transformation Tθv. Similarly, desired number (< K) of content representations c RN ds are sampled content distribution p(c). Finally, the image x is generated using the transformed content Tθv(c) and view v.

where P P {0, 1}K K is a permutation matrix, H Rd d is an affine matrix, and a Rd.

4 VISA Formalism

Let x1:V = {x1, . . . x V } X = X 1 X V , V views of the same scene observed from different viewpoints with an observational space X RV H W C. We consider [V ] as a shorthand notation for {1, . . . , V }. Let Oe = O1 OV correspond to an abstract notion of object sets of an environment, while Ov, v [V ] is a set of objects present in a considered viewpoint v. Importantly, we consider that the number of objects per viewpoint can vary, i.e., |O1 OV | |Ov| v [V ], allowing for partial or full occlusion in some viewpoints. Let v1:V V = V1 VV RV dv be inferred viewpoint-specific information1, while s1:V 1:K S = S1 SV RV K ds correspond to a viewpoint-specific slot representation. Let c1:K C RK dc capture the notion of an aggregate, effectively accumulating the object knowledge across viewpoints. For any subset A of [V ], we represent scene observations as x A = {xi : i A} i AX i. The inferred viewpoints and the view specific slots are denoted as v A = {vi : i A} i AVi, and s A 1:K = {si 1:K : i A} i ASi, respectively. We define p A(c) as the distribution of c over A. A more comprehensive summary of notations and terminologies is provided in App. A.

In modelling, w.l.o.g, we consider access to a certain subset A [V ], ensuring the model s applicability across different scenarios. Furthermore, to simplify notation, we sometimes do not include the superscript denoting the full set of views, thereby using x = x A, s1:K = s A 1:K, and v = v A interchangeably. Likewise, if we do not specify the subscripts for c and s, it implies they represent the entire collection of objects, specifically as s = s A 1:K and c = c1:K. Lastly, for any function f that operates on two distinct in-

1We abuse the terminology by considering viewpoint, lighting, object dimension, to be encoded in a representation v. Note that the v is inferred by the model.

puts x = f(z, v), its inverse is denoted by z = f 1(x; v), which signifies the reversal of f conditioned on a variable v. In the rest of this section, we introduce all the components involved in our model. We also introduce assumptions, examples, and intuition wherever necessary. Considering the generative model Eqn. 5, which is overviewed in graphical model Fig. 4, any scene x is generated using view v and content c. Here, both c and v are latent variables learned with variational inference(Kingma & Welling, 2013).

p(x) = ZZ p(xv | Tθv(c), vv) p(c) p(vv) dvv dc (5)

View model. Given that the view property remains consistent across all objects, we treat the view as a global, imagelevel property as opposed to Yuan et al. (2024), where view is treated as an object-level property. Assuming access to a discrete set of viewpoints denoted by A, we consider prior over a view distribution to be a GMM represented by p(v) = P|A| v=1 πv N(v; µv, σ2 v). To learn the parameters of this GMM, we consider the posterior of the form qϕ(v | xv) v A2. In both prior and posterior, we consider the covariance to be diagonal, implicitly making an ICA assumption (Khemakhem et al., 2020a). The sampled variable v qθ(v | xv) is used to estimate transformation parameters θv R3 2 as in Jaderberg et al. (2015) which makes an affine transformation map Tθv, which is later applied on content c and on view-specific slots s. It is important to note that we use the same set of parameters ϕ across all viewpoints in A for inferring view information v.

Viewpoint specific slots. As illustrated in Fig. 3 the inference of c depends on the view-specific slots s. For a considered image xv, v A, we first apply an inverse view transformation T 1 θv and model the slot distribution as a spatial mixture model represented by q(s A 1:K | T 1 θv (x A)). The inverse transformation makes sure that the estimated object representations across all view in A are in a common implicit representation space. As this is an intermediate variable which does not show up in our generative model in Eqn. 5, we update the corresponding parameters with closed-form equations via expectation maximisation algorithm as in (Kori et al., 2024). The resulting slot posterior is a conditional GMM as described in Eqn. 6, where xv = T 1 θv (xv) is a transformed inputs, (µk( xv), σ2 k( xv), πk( xv)) are mean, diagonal covariance, and mixing coefficients for the considered a view and object.

q(sv | xv) =

k=1 πk(xv)N sv k; µk( xv), σ2 k( xv) (6)

2We consider the parametric form of q to be Gaussian.

Identifiable Object Representations under Spatial Ambiguities

Representation matching. Given the permutation equivariance property of slot representations, we use a matching function with a permutation matrix P v, ms : SA SA

such that ms(sv 1:K) = P vsv 1:K mapping representation axis w.r.t P v. The permutation matrix P v is estimated by considering the slots of the first viewpoint s1 as a base representation, and other representations sv v A are matched to align with it. We utilise Hungarian matching, as illustrated in (Locatello et al., 2020b; Wang et al., 2023), to estimate this permutation matrix P v, to control the noise in the matching algorithm, we introduce view-warmup strategy, which we detail in App. G.5.

Content aggregator. We consider g : S C as a content aggregator function, which marginalises the effect of view conditioning. To achieve this, we consider a convex combination of all the aligned slot representations (aligned to a base representation), considering mixing coefficients πk(xv) (we use πv k for simplicity) in Eqn. 6 as a combination weight. The convex combination accounts for potential object occlusions, which may cause objects to be absent in particular views ensuring only active representations are combined (refer to an intuition below), resulting in a content posterior (q(c | s)), which is a GMM with mixing coefficients πk = P|A| v=1 πv k /|A| and the parameters described in Eqn. 8 (w.l.o.g we consider s, π to represent aligned representations), refer to Lemma F.3, with wi = 1/|A| i A. Additionally, algorithm 1 details the entire forward process.

Intuition: Content aggregation

Based on illustrated example in Fig. 3, for images x1, x2, x3, the resulting matched slots and mixing coefficients correspond to s1 = {s1 r, s1 r, s1 O3, s1 O4, s1 b}, s2 = {s2 O1, s2 r, s2 O3, s2 O4, s2 b}, s3 = {s3 O1, s3 O2, s3 O3, s3 O4, s3 b}, where sv Oi, sv r, and sv b correspond to slot representation for object Oi, random slot representation and background information, respectively, with mixing coefficients π1 = {0, 0, 1, 1, 1}, π2 = {1, 0, 1, 1, 1}, and π1 = {1, 1, 1, 1, 1}. Proposed aggregation merges the slots ignoring the random slots sv r, resulting in c O1 = (s2 O1 + s3 O1)/2, c O2 = s3 O2 and so on.

g(s1:V 1:K, π1:V ) =

πv 1:k |A| πv k sv 1:K; (7)

πv k |A| πv k µk(xv);

πv k |A| πv k

2 σ2 k(xv); (8)

Mixing function and training objective. We consider

both additive and non-additive (ref. definition E.1) mixing functions fd : C Vv X v. For additive decoders, we use a spatial-broadcasting (Greff et al., 2019) and MLP decoders, and for non-additive mixing function, we use auto-regressive transformers (Vaswani et al., 2017). We use the shared decoder fd for all views and objects, modelling the conditional distribution p(xv | Tθv(c), vv). To train our model in an end-to-end fashion, we maximise the loglikelihood of the joint p(x A), which results in the evidence lower bound (ELBO), Eqn. 9, check Lemma F.1. Here, we consider the distribution form of p(xv | c, vv) to be Gaussian with learnable mean with isotropic covariance.

E log p(x | Tθv(c), v) KL (q(v | x) p(v)) (9)

Computational complexity: VISA achieves O(VT NKd) with the added complexity of 2O(VNd) for inverse and forward view point transformation given by Tθ, while it retains the complexity per view to be O(T NKd) which is the same as slot attention and probabilistic slot attention. Additionally, the representation matching function contributes O(VK3d): this term does not alter the dominant term, in the general case when K << N. Similar to PSA, when VISA is combined with an additive decoder, the complexity of the decoder can be lowered due to the property of automatic relevance determination (ARD), eliminating the need to decode inactive slots.

5 Theoretical Analysis

In this section, we leverage the properties of the proposed model to theoretically demonstrate the learning of identifiable representations under challenging spatial ambiguities. In this work, we consider our data-generating process to satisfy a viewpoint sufficiency assumption (refer to 5.1).

Assumption 5.1. (View-point sufficiency) For any set A [V ], we consider set A to be view-point sufficient iff |OA| = |Oe|. This basically means that all the objects are visible across all the considered views A, even when an individual view may not contain all the objects.

Example: Viewpoint-sufficiency

Example 1. Based on illustrated example in Figure 3, the scene is composition of four objects Oe = {O1, O2, O3, O4}, view point subset A = [V ] = {1, 2, 3} is considered to be view point sufficient since S

v A Ov = {O3, O4} {O1, O3, O4} {O1, O2, O3, O4} = Oe.

Given that we learn the parameters of our view-specific spatial GMM with closed-form updates, we do not use an explicit prior minimising KL divergence. Instead, we rely on the fact that marginalising the effect of data points from posterior (aggregate posterior) is an optimal prior (Hoffman & Johnson, 2016; Kori et al., 2024), resulting in

Identifiable Object Representations under Spatial Ambiguities

p(c) = RR q(c|s A, x A)ds Adx A. Given that GMMs are universal density approximates given enough components (even GMMs with diagonal covariances), the resulting aggregate posterior q(c) = p(c) is highly flexible and multi-modal. It often suffices to approximate it using a sufficiently large subset of the dataset if marginalising out the entire dataset becomes computationally restrictive.

Lemma 5.2 (Optimal Prior). For A [V ], given the a local content distribution q(c1:K | s A 1:K, x A) (per-scene x A {x A i }M i=1), which can be expressed as a GMM with K components, the aggregate posterior q(c) is obtained by marginalizing out x, s is a non-degenerate global Gaussian mixture with MK components:

p(c) = q(c) = 1

k=1 bπik N c; bµik, bσ2 ik . (10)

Proof Sketch. The result is obtained by integrating the product of involved posterior densities q(c | s)q(s | x)p(x). Further, we verify if the mixing coefficients sum to one in the new mixture, proving the aggregate to be well-defined.

With this, we show three main results: firstly, we show that aggregate content representations (c) are identifiable without supervision (up to s). Secondly, we show that these representations are invariant to the choice of viewpoints under assumption 5.1. Finally, we show that the model exhibits in an approximate view equivariance.

Theorem 5.3. (Affine Equivalence) For any subset A [V ], such that |A| > 0 , given a set of images x A X A

and a corresponding aggregate content c C and a nondegenerate content posterior q(c | s A), considering two mixing function fd, fd satisfying assumption F.4, with a shared image, then c are identifiable up to s equivalence.

Intuition: Affine equivalence

Considering an example 1, with two perfectly trained models fd and fd. Resulting aggregate contents are described as c = f 1 d (x A; v A) = {c O1, c O2, c O3, c O4, c Ob} and c = f 1 d (x A; v A) = { c O2, c O4, c O3, c O1, c Ob} for A = [V ] = {1, 2, 3}. s equivalence states that there exists a permutation matrix P which aligns the object order in c to match with c and there exists and invertible affine mapping A such that c Ok = Ac Ok k {1, 2, 3, 4}.

Proof Sketch. To prove the following result, we follow multiple steps as described below: (i). We demonstrate the distribution p(c) obtained as a result of lemma 5.2 is non-degenerate and a valid distribution, (ii). With the above results, we demonstrate invertibility restrictions on mixing functions, (iii). Finally, we constrain the subspace to affine, demonstrating s of aggregate content c.

Theorem 5.4. (Invariance of aggregate content) For any subset A, B [V ], such that |A| > 0, |B| > 0 and both A, B satisfy an assumption 5.1, we consider aggregate content to be invariant if f A s f B for data X A X B.

Intuition: Invariant slots

Considering an example 1, with A = {1, 3}, B = {2, 3}, such that sets A, B are viewpoint sufficient. Let f A and f B, be trained models on X A and X B respectively. Resulting in c = f 1 A (x A; v A) = {c O1, c O2, c O3, c O4, c Ob} and c = f 1 B (x B; v B) = { c O2, c O4, c O3, c O1, c Ob}. Thm. 5.4 states that the representations T 1 θB ( c Ok) can be mapped to T 1 θA (c Ok) by permuting object indices and an affine transformation.

Proof Sketch. To prove this, we extend the proof of Thm. 5.3, and establish that there exist two inevitable affine functions h A, h B for mixing functions f A, f B : C V X to map representations c with a given view set v A to observations x A. Later, we show that, in the case of invariance, an affine mapping exists from h A to h B.

Theorem 5.5. (Approximate representational equivariance) For a given aggregate content c, for any two views v, v p A(v), resulting in respective scenes x p A(x | v, c) and x p A(x | v, c), for any homeomorphic transformation hx Hx such that hx(x) = x, their exists another homeomorphic transformation hv Hv such that Hv Hx Rdim(x) and v = h 1 v f 1 d (hx(x); c) .

Remark 5.6. Note that the theorem only says that the transformation function transforming the view representations v as an effect of the homeomorphic transformation of x lies in the same subspace of input transformations.

Intuition: Approximate equivariance

In the scenario when the cameras are positioned such that they have overlapping fields of view, and their relative pose (rotation and translation) must avoid degeneracies like aligning on the same plane or mapping points to infinity. This results in the transformation between views being smooth, invertible, and consistent. If the scene is planar or depth variations are minimal, the homography can capture the transformation accurately without the need for inverse rendering. Notably, the cameras should have non-zero rotation and translation to avoid collapsing the scene, and their intrinsic parameters must be known or identical to prevent distortions. When the scenario satisfies all the above properties, the 2D homography transformation H between two camera views can be learned as a homeomorphic transformation (Hartley & Zisserman, 2003).

Proof Sketch. We prove the following result by following the steps in Thm. 5.4, over a view distribution p(v) but for a fixed content vector c.

Identifiable Object Representations under Spatial Ambiguities

6 Empirical Evaluation

Given the work s theoretical focus, experimentally, we aim to provide strong empirical evidence of our identifiability, invariance, and equivariance claims in a multiview setting. We also extend our experiments to standard imaging benchmarks, including CLEVR-MV, CLEVR-AUG, GQN (Li et al., 2020); we additionally demonstrate the framework s scalability to highly diverse setting with GSO (Downs et al., 2022) and proposed datasets MV-MOVIC, MV-MOVID which are multiview versions of Mo Vi C dataset with fixed and varying scene-specific cameras (Greff et al., 2022).

Experimental setup. To verify our claims on (i) identifiability claim, we train our model on a given view subset A [V ] and compare view averaged slot mean correlation coefficient (SMCC) measure as defined (Kori et al., 2024) (SMCC(s, s) := 1 Kd Pd i=0 Pd j=0 ρ(sij, A sτ(i)j) for some permutation map τ and affine transformation A), (ii) invariance claim, we train multiple models on different subsets of viewpoints A, B [V ] and compare the aggregate content representations across models, quantifying the similarities with SMCC, we consider this measure to be invariant SMCC (INV-SMCC), and finally, (iii) for subspace equivariance, we consider a trained model with a view subset A [V ] and compute MCC of view information v by applying random homeomorphic transformations on samples x A X A

(which can also be done by considering samples x B X B, where cameras relative position satisfy the required constraints 5.5, and analyse p(v A) and p(v B)).

Models & baselines. We consider two ablations with two types of decoders: (i) additive with MLPs and spatial broadcasting CNNs and (ii) non-additive decoders, which include transformer models. In all cases, we use Leaky Re LU activations to satisfy the weak injectivity conditions (Assumption F.4). In terms of object-centric learning baselines, we compare with standard additive autoencoder setups following (Brady et al., 2023), slot-attention (SA) (Locatello et al., 2020b), probabilistic slot-attention (PSA) (Kori et al., 2024), Mul MON (Li et al., 2020), and OCLOC (Yuan et al., 2024).

Architectures: As detailed in the paper we use two different classes of decoder architectures: (i) additive and (ii) non-additive; within the additive architecture we use both spatial broadcasting and MLP decoders, for the non-additive architecture we use transformer decoders. Concretely, we follow SA (Locatello et al., 2020b) for spatial broadcasting decoders and DINOSAUR (Seitzer et al., 2022) for both MLP and transformer decoder. In detail we use:

1. Spatial broadcasting decoders - Input/Output: The generated slots are s RK d, each slot representation is broadcasted onto a 2D grid of dimension 8 8 d and augmented with position embeddings. Similar to slot attention, each such grid is decoded using a

shared CNN to produce an output of size W H 4, where W and H are width and height of the image, respectively. The output channels encode RGB color channels and an (unnormalized) alpha mask. Further, we normalize the alpha masks with Softmax and perform convex combinations to obtain reconstruction. Shared CNN architecture: 3 [Conv(kernel = 5 5, stride = 2), Leaky Re LU(0.02)]+Conv(kernel = 3 3, stride = 1), Leaky Re LU(0.02)

2. MLP decoders Input/Output: similar to the spatial broadcasting decoder, here, each slot representation is broadcasted onto N tokens (resulting in N d) and augmented with position embeddings. Then the individual slot representation is transformed with a shared MLP decoder to generate a representation corresponding to feature dimension along with additional alpha mask, which is further normalised with Softmax and used in creating convex combinations to obtain reconstruction. Shared MLP architecture: [Linear(d, d, bias = False), Layer Norm(d)] + 3 [Linear(d, dhidden), Leaky Re LU(0.02)] + Linear(dhidden, d + 1)

3. Transformer decoders: Input/Outputtransformer consists of linear transformers encoder output (N d) and extracted slots (K d) as input, while returning the slot conditioned feature as output with a dimension of (N d). Transformer architecture: is made up of 4 transformer blocks, where each transformer block consists of a self-attention on input tokens, cross-attention with the set of slots, and residual two-layer MLP with hidden size 4 d. Before the Transformer blocks, both the initial input and the slots are linearly transformed to d, followed by a layer norm.

CASE STUDY 1: ILLUSTRATION OF IDENTIFIABILITY. To definitively show the validity of our claims about identifiability (Thm 5.3, Thm 5.4, and Thm 5.5), we created a synthetic unconfounded scenario for modelling. This provides us with two data modalities, Fig, 7 (i) projected point cloud data, and (ii) corresponding imagery data, we detail point cloud illustrations in appendix G.1. Additionally, this dataset also provides us with the ground truth object and viewpoint features for evaluation. To visualise the aggregate mixture, following Lemma F.2, we use the projected GMM to interpret the distribution of random variables in Rd.

The data-generating process is thoroughly explained in the App. D.1. In Fig. 5, we display the distributions of marginalized aggregate content distribution q(c), comparing individual features and a mean feature across different runs that are either scaled, shifted, or split (increase in number of modes), which is reflective of affine transformation of features across runs. To quantitatively measure the same, we

Identifiable Object Representations under Spatial Ambiguities

0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 0.0

Dimension-0 Dimension-1 Dimension-2 Dimension-3 Dimension-4 Dimension-5 Dimension-6

0.15 0.10 0.05 0.00 0.05 0.10 0.15 0

Run #0-Combined

0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 0.0

Dimension-0 Dimension-1 Dimension-2 Dimension-3 Dimension-4 Dimension-5 Dimension-6

0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0

Run #1-Combined

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 0.00

Dimension-0 Dimension-1 Dimension-2 Dimension-3 Dimension-4 Dimension-5 Dimension-6

0.4 0.2 0.0 0.2 0.4 0.0

Run #2-Combined

0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 0.0

Dimension-0 Dimension-1 Dimension-2 Dimension-3 Dimension-4 Dimension-5 Dimension-6

0.15 0.10 0.05 0.00 0.05 0.10 0.15 0

10 Run #3-Combined

0.75 0.50 0.25 0.00 0.25 0.50 0.75 0.0

Dimension-0 Dimension-1 Dimension-2 Dimension-3 Dimension-4 Dimension-5 Dimension-6

0.075 0.050 0.025 0.000 0.025 0.050 0.075 0.100 0

14 Run #4-Combined

Figure 5. Identifiability of q(c). The top row indicates individual feature distribution across five different runs. The bottom row reflects the feature feature distribution, which we use as a proxy for multi-dimensional features given Lemma F.2. As observed, mean feature distribution across runs is either scaled, shifted, or split (increase in number of modes); this provides strong evidence of recovery of the latent space up to affine transformations, empirically verifying our claims in Thm. 5.3.

Table 1. Comparing identifiability of q(s), q(c), and p(v) scores wrt existing OCL methods.

METHOD CLEVR-MV GQN GSO

SMCC INV-SMCC MCC SMCC INV-SMCC MCC SMCC INV-SMCC MCC

AE 0.32 .02 - - 0.29 .02 - - 0.24 0.08 - - SA 0.47 .03 - - 0.38 .02 - - 0.28 0.06 - - PSA 0.49 .02 - - 0.38 .02 - - 0.30 0.04 - - Mul MON 0.61 .03 0.62 .02 - 0.59 .06 0.61 .02 - 0.56 0.04 0.48 0.06 - OCLOC 0.63 .02 0.64 .01 0.48 .04 0.60 .03 0.60 .01 0.42 .08 0.58 0.04 0.54 0.03 0.46 0.04

VISA 0.67 .01 0.66 .01 0.60 .04 0.59 .01 0.63 .01 0.52 .03 0.60 .03 0.61 .02 0.58 .03

Table 2. Identifiability and generalisability analysis on MV-MOVIC dataset.

METHOD IN-DOMAIN RESULTS OUT-OF-DOMAIN RESULTS

m BO SMCC INV-SMCC MCC m BO SMCC INV-SMCC MCC

SA-MLP 0.28 0.091 0.36 0.004 - - 0.26 0.08 0.38 0.006 - - PSA-MLP 0.30 0.022 0.38 0.002 - - 0.30 0.03 0.40 0.005 - -

VISA-MLP 0.28 0.021 0.52 0.021 0.61 0.023 0.54 0.026 0.27 0.02 0.51 0.029 0.58 0.031 0.52 0.021

SA-TRANSFORMER 0.34 0.014 0.36 0.016 - - 0.33 0.041 0.36 0.043 - - PSA-TRANSFORMER 0.37 0.021 0.38 0.007 - - 0.37 0.033 0.39 0.016 - -

VISA-TRANSFORMER 0.38 0.008 0.44 0.003 0.46 0.001 0.53 0.011 0.36 0.017 0.46 0.033 0.46 0.018 0.55 0.082

0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.0

Dimension-0 Dimension-1 Dimension-2 Dimension-3 Dimension-4 Dimension-5 Dimension-6

0.10 0.05 0.00 0.05 0.10 0.15 0

Run #0-Combined

0.75 0.50 0.25 0.00 0.25 0.50 0.75 0.00

Dimension-0 Dimension-1 Dimension-2 Dimension-3 Dimension-4 Dimension-5 Dimension-6

0.15 0.10 0.05 0.00 0.05 0.10 0.15 0.20 0

7 Run #1-Combined

0.75 0.50 0.25 0.00 0.25 0.50 0.75 0.0

Dimension-0 Dimension-1 Dimension-2 Dimension-3 Dimension-4 Dimension-5 Dimension-6

0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.0

Run #2-Combined

Figure 6. Viewpoint invariance for q(c). The top and bottom row indicates individual feature levels and mean feature distributions, respectively. Each columns reflect marginalised aggregate content distribution q(c) when trained with different view pairs {(blue, red), (green, blue), and (green, red)}, respectively. As the resulting distributions with different datasets only vary by an affine transformation, providing strong evidence for Thm. 5.4.

computed SMCC and observed it to be 0.72 0.04, empirically verifying our Thm. 5.3. Furthermore, to illustrate

the invariance of distribution q(c) across viewpoints (Thm. 5.4), we consider three different viewpoints. We use all possible pairs to learn q(c) distributions as illustrated in Fig. 6, where the distributions are described w.r.t viewpoints described by {g, r}, {r, b}, and {g, b}, respectively. These distributions were also found to have similar properties as before, with an observed SMCC of 0.71 0.11, further confirming the claims in Thm. 5.4. Additionally, Fig. 2 demonstrates the improvement in identifiability as the number of viewpoints increases.

CASE STUDY 2: IMAGING APPLICATIONS. We first evaluate the framework on standard benchmarks, specifically focusing on CLEVR-MV, CLEVR-AUG, GQN, and GSO with simple objects. Given the true generative factors are unobserved, we derive our quantitative assessments from multiple runs. The results are shown in Table 1, confirming the validity of our theory on imaging datasets. Regarding the baseline comparisons that utilize a single viewpoint, the INV-SMCC mirrors the SMCC due to its inherent design (i.e., aggregation of a set with a single element is the same element). Moreover, in the case of AE, SA, PSA, and

Identifiable Object Representations under Spatial Ambiguities

MULMON, the models do not estimate view information but either treat them independently or use the observed view conditioning, rendering the MCC metric inapplicable. Fig. 12 showcases how the number of viewpoints impacts the identifiability of the s, v, and c variables; the involved experiments reflect the increase in performance with an increase in the number of views, across all benchmark datasets.

Additionally, we demonstrate our methodology on proposed complex datasets, MV-MOVIC and MV-MOVID, the latter dataset enables us to examine the model performs when the assumption 5.1 is not satisfied. To evaluate model behaviour in an environment with consistent objects but with different viewpoints, we conducted in-domain and out-of-domain (OOD) evaluations. For in-domain analysis, the model is trained and assessed on the same viewpoint group A = [1, 2, 3]. Conversely, for OOD evaluation, we consider the previously trained model but test it against a new set of viewpoints B = [3, 4, 5]. The findings presented in Table 2 regarding the MV-MOVIC dataset reveal that the SMCC, INV-SMCC, and MCC metrics show similar performance across both domains. This indicates that the distributional characteristics remain unchanged when both the training and testing environments contain the same objects. The MV-MOVID dataset analysis can be found in App. G.

7 Conclusion & Discussion

Understanding when object-centric representations are both unambiguous and identifiable is essential for developing large-scale models with provable correctness guarantees. Unlike most existing work on identifiability, which largely focuses on single-view setups, we offer identifiability guarantees in multi-view scenarios. We use distributional assumptions for latent slot and view representations, drawing inspiration from mixture model-based structures. To achieve this, we propose a model that is viewpoint-agnostic and does not require additional view-conditioning information.

Our model specifically guarantees the identifiability of viewspecific slot representations, viewpoint-invariant content representations, and view representations, all without the need for additional supervision (up to an equivalence relation). We visually validate our theoretical claims with unconfounded synthetic dataset with illustrative 2D data plots. We then empirically demonstrate the model s identifiability properties on multiple object-centric benchmarks, highlighting its ability to resolve view ambiguities in imaging applications. Furthermore, we showcase the scalability of our approach on large-scale datasets and more complex decoders using realistic datasets and transformer decoders, respectively, demonstrating its capacity to scale effectively with both data volume and decoder complexity.

Limitations & future work. We recognize that our as-

sumptions, particularly regarding the viewpoint sufficiency, are strong and may not always hold in practice. However, we did not observe limiting effects of this assumption on the proposed MV-MOVID dataset. A more extensive analysis of this assumption and its implications in real-world applications is left for future work. We would also highlight that the weak injectivity of the mixing function may not always hold for different types of architectures. While generally applicable, the piecewise-affine functions we use may not always capture valid assumptions for real-world problems, e.g., when the model is misspecified. Nevertheless, to the best of our knowledge, our theoretical results on multi-object, multi-view identifiability are unique and capture key concepts in object-centric representation learning, opening various new avenues for future research along the lines of generalisability, world-modelling, and planning.

Impact Statement

This paper proposes a view invariant slot attention algorithm, addressing spatial ambiguities with identifiability guarantees. The work extends theoretical advancements in the field of OCL, and as such, it has little immediate societal or ethical consequences. Our method might be a step towards interpretable, equivariant, and aligned models, which are desired properties of trustworthy AI.

Acknowledgements

A. Kori is supported by UKRI (grant number EP/S023356/1), as part of the UKRI Centre for Doctoral Training in Safe and Trusted AI. B. Glocker received support from the Royal Academy of Engineering as part of his Kheiron/RAEng Research Chair, and acknowledges the support of the UKRI AI programme, and the EPSRC, for CHAI - EPSRC Causality in Healthcare AI Hub (grant no. EP/Y028856/1). Additionally, we thank Fabio De Sousa Ribeiro for insightful discussions and providing feedback on the initial draft of the paper.

Identifiable Object Representations under Spatial Ambiguities

Ahuja, K., Wang, Y., Mahajan, D., and Bengio, Y. Interventional causal representation learning. ar Xiv preprint ar Xiv:2209.11924, 2022.

Arsalan Soltani, A., Huang, H., Wu, J., Kulkarni, T. D., and Tenenbaum, J. B. Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1511 1519, 2017.

Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al. Relational inductive biases, deep learning, and graph networks. ar Xiv preprint ar Xiv:1806.01261, 2018.

Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798 1828, 2013.

Brady, J., Zimmermann, R. S., Sharma, Y., Schölkopf, B., von Kügelgen, J., and Brendel, W. Provably learning object-centric representations. ar Xiv preprint ar Xiv:2305.14229, 2023.

Brehmer, J., De Haan, P., Lippe, P., and Cohen, T. S. Weakly supervised causal representation learning. Advances in Neural Information Processing Systems, 35:38319 38331, 2022.

Buchholz, S., Besserve, M., and Schölkopf, B. Function classes for identifiable nonlinear independent component analysis. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022.

Burgess, C. P., Matthey, L., Watters, N., Kabra, R., Higgins, I., Botvinick, M., and Lerchner, A. Monet: Unsupervised scene decomposition and representation. ar Xiv preprint ar Xiv:1901.11390, 2019.

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650 9660, 2021.

Chang, M., Griffiths, T. L., and Levine, S. Object representations as fixed points: Training iterative refinement algorithms with implicit differentiation. ar Xiv preprint ar Xiv:2207.00787, 2022.

Chen, C., Deng, F., and Ahn, S. Roots: Object-centric representation and rendering of 3d scenes. Journal of Machine Learning Research, 22(259):1 36, 2021.

Crawford, E. and Pineau, J. Exploiting spatial invariance for scalable unsupervised object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 3684 3692, 2020.

Daunhawer, I., Bizeul, A., Palumbo, E., Marx, A., and Vogt, J. E. Identifiability results for multimodal contrastive learning. ar Xiv preprint ar Xiv:2303.09166, 2023.

Dilokthanakul, N., Mediano, P. A., Garnelo, M., Lee, M. C., Salimbeni, H., Arulkumaran, K., and Shanahan, M. Deep unsupervised clustering with gaussian mixture variational autoencoders. ar Xiv preprint ar Xiv:1611.02648, 2016.

Downs, L., Francis, A., Koenig, N., Kinman, B., Hickman, R., Reymann, K., Mc Hugh, T. B., and Vanhoucke, V. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pp. 2553 2560. IEEE, 2022.

Eastwood, C. and Williams, C. K. I. A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations, 2018.

Emami, P., He, P., Ranka, S., and Rangarajan, A. Slot order matters for compositional scene understanding. ar Xiv preprint ar Xiv:2206.01370, 2022.

Engelcke, M., Kosiorek, A. R., Jones, O. P., and Posner, I. Genesis: Generative scene inference and sampling with object-centric latent representations. ar Xiv preprint ar Xiv:1907.13052, 2019.

Engelcke, M., Parker Jones, O., and Posner, I. Genesisv2: Inferring unordered object representations without iterative refinement. Advances in Neural Information Processing Systems, 34:8085 8094, 2021.

Eslami, S. A., Jimenez Rezende, D., Besse, F., Viola, F., Morcos, A. S., Garnelo, M., Ruderman, A., Rusu, A. A., Danihelka, I., Gregor, K., et al. Neural scene representation and rendering. Science, 360(6394):1204 1210, 2018.

Gerstenberg, T., Goodman, N. D., Lagnado, D. A., and Tenenbaum, J. B. A counterfactual simulation model of causal judgments for physical events. Psychological review, 128(5):936, 2021.

Gopnik, A., Glymour, C., Sobel, D. M., Schulz, L. E., Kushnir, T., and Danks, D. A theory of causal learning in children: causal maps and bayes nets. Psychological review, 111(1):3, 2004.

Greff, K., Kaufman, R. L., Kabra, R., Watters, N., Burgess, C., Zoran, D., Matthey, L., Botvinick, M., and Lerchner,

Identifiable Object Representations under Spatial Ambiguities

A. Multi-object representation learning with iterative variational inference. In International Conference on Machine Learning, pp. 2424 2433. PMLR, 2019.

Greff, K., Van Steenkiste, S., and Schmidhuber, J. On the binding problem in artificial neural networks. ar Xiv preprint ar Xiv:2012.05208, 2020.

Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., Fleet, D. J., Gnanapragasam, D., Golemo, F., Herrmann, C., et al. Kubric: A scalable dataset generator. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3749 3761, 2022.

Gresele, L., Rubenstein, P. K., Mehrjou, A., Locatello, F., and Schölkopf, B. The incomplete rosetta stone problem: Identifiability results for multi-view nonlinear ica. In Uncertainty in Artificial Intelligence, pp. 217 227. PMLR, 2020.

Hartley, R. and Zisserman, A. Multiple view geometry in computer vision. Cambridge university press, 2003.

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta VAE: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.

Hinton, G. Some demonstrations of the effects of structural descriptions in mental imagery. Cognitive Science, 3(3): 231 250, 1979.

Hinton, G. How to represent part-whole hierarchies in a neural network. Neural Computation, pp. 1 40, 2022.

Hoffman, M. D. and Johnson, M. J. Elbo surgery: yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, NIPS, volume 1, 2016.

Hsieh, J.-T., Liu, B., Huang, D.-A., Fei-Fei, L. F., and Niebles, J. C. Learning to decompose and disentangle representations for video prediction. Advances in neural information processing systems, 31, 2018.

Hyvärinen, A. and Pajunen, P. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429 439, 1999.

Hyvarinen, A., Sasaki, H., and Turner, R. Nonlinear ica using auxiliary variables and generalized contrastive learning. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89, pp. 859 868. PMLR, 2019.

Jaderberg, M., Simonyan, K., Zisserman, A., et al. Spatial transformer networks. Advances in neural information processing systems, 28, 2015.

Khemakhem, I., Kingma, D., Monti, R., and Hyvarinen, A. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pp. 2207 2217. PMLR, 2020a.

Khemakhem, I., Monti, R., Kingma, D., and Hyvarinen, A. Ice-beem: Identifiable conditional energy-based deep models based on nonlinear ica. In Advances in Neural Information Processing Systems, volume 33, 2020b. URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 962e56a8a0b0420d87272a682bfd1e53-Paper. pdf.

Khemakhem, I., Monti, R., Kingma, D., and Hyvarinen, A. Ice-beem: Identifiable conditional energy-based deep models based on nonlinear ica. Advances in Neural Information Processing Systems, 33:12768 12778, 2020c.

Kim, H. and Mnih, A. Disentangling by factorising. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, 2018.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Kipf, T., Elsayed, G. F., Mahendran, A., Stone, A., Sabour, S., Heigold, G., Jonschkowski, R., Dosovitskiy, A., and Greff, K. Conditional object-centric learning from video. ar Xiv preprint ar Xiv:2111.12594, 2021.

Kivva, B., Rajendran, G., Ravikumar, P., and Aragam, B. Identifiability of deep generative models without auxiliary information. Advances in Neural Information Processing Systems, 35:15687 15701, 2022.

Kori, A., Locatello, F., Ribeiro, F. D. S., Toni, F., and Glocker, B. Grounded object centric learning. ar Xiv preprint ar Xiv:2307.09437, 2023.

Kori, A., Locatello, F., Toni, F., Glocker, B., and Ribeiro, F. D. S. Identifiable object centric representations via probabilistic slot attention. ar Xiv preprint ar Xiv:2307.09437, 2024.

Kosiorek, A., Kim, H., Teh, Y. W., and Posner, I. Sequential attend, infer, repeat: Generative modelling of moving objects. Advances in Neural Information Processing Systems, 31, 2018.

Lachapelle, S., Mahajan, D., Mitliagkas, I., and Lacoste Julien, S. Additive decoders for latent variables identification and cartesian-product extrapolation. ar Xiv preprint ar Xiv:2307.02598, 2023.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017.

Identifiable Object Representations under Spatial Ambiguities

Li, N., Eastwood, C., and Fisher, R. Learning objectcentric representations of multi-object scenes from multiple views. Advances in Neural Information Processing Systems, 33:5656 5666, 2020.

Li, N., Raza, M. A., Hu, W., Sun, Z., and Fisher, R. Objectcentric representation learning with generative spatialtemporal factorization. Advances in neural information processing systems, 34:10772 10783, 2021.

Lin, Z., Wu, Y.-F., Peri, S. V., Sun, W., Singh, G., Deng, F., Jiang, J., and Ahn, S. Space: Unsupervised objectoriented scene representation via spatial attention and decomposition. ar Xiv preprint ar Xiv:2001.02407, 2020.

Liu, Y., Jia, B., Chen, Y., and Huang, S. Slotlifter: Slotguided feature lifting for learning object-centric radiance fields. In European Conference on Computer Vision, pp. 270 288. Springer, 2025.

Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp. 4114 4124. PMLR, 2019.

Locatello, F., Poole, B., Rätsch, G., Schölkopf, B., Bachem, O., and Tschannen, M. Weakly-supervised disentanglement without compromises. In International Conference on Machine Learning, pp. 6348 6359. PMLR, 2020a.

Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., Dosovitskiy, A., and Kipf, T. Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33:11525 11538, 2020b.

Löwe, S., Lippe, P., Locatello, F., and Welling, M. Rotating features for object discovery. Advances in Neural Information Processing Systems, 36, 2024.

Luo, R., Yu, H.-X., and Wu, J. Unsupervised discovery of object-centric neural fields. ar Xiv preprint ar Xiv:2402.07376, 2024.

Mansouri, A., Hartford, J., Zhang, Y., and Bengio, Y. Objectcentric architectures enable efficient causal representation learning. ar Xiv preprint ar Xiv:2310.19054, 2023.

Marcus, G. F. The algebraic mind: Integrating connectionism and cognitive science. MIT press, 2003.

Marino, J., Yue, Y., and Mandt, S. Iterative amortized inference. In International Conference on Machine Learning, pp. 3403 3412. PMLR, 2018.

Mathieu, E., Rainforth, T., Siddharth, N., and Teh, Y. W. Disentangling disentanglement in variational autoencoders.

In Proceedings of the 36th International Conference on Machine Learning, 2019.

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., and Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99 106, 2021.

Schölkopf, B. and von Kügelgen, J. From statistical to causal learning. Proceedings of the International Congress of Mathematicians, 2022.

Seitzer, M., Horn, M., Zadaianchuk, A., Zietlow, D., Xiao, T., Simon-Gabriel, C.-J., He, T., Zhang, Z., Schölkopf, B., Brox, T., et al. Bridging the gap to real-world objectcentric learning. ar Xiv preprint ar Xiv:2209.14860, 2022.

Singh, G., Deng, F., and Ahn, S. Illiterate dall-e learns to compose. ar Xiv preprint ar Xiv:2110.11405, 2021.

Singh, G., Kim, Y., and Ahn, S. Neural block-slot representations. ar Xiv preprint ar Xiv:2211.01177, 2022.

Tenenbaum, J. B., Kemp, C., Griffiths, T. L., and Goodman, N. D. How to grow a mind: Statistics, structure, and abstraction. science, 331(6022):1279 1285, 2011.

Tobin, J., Zaremba, W., and Abbeel, P. Geometry-aware neural rendering. Advances in Neural Information Processing Systems, 32, 2019.

Van Steenkiste, S., Chang, M., Greff, K., and Schmidhuber, J. Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. ar Xiv preprint ar Xiv:1802.10353, 2018.

Van Steenkiste, S., Kurach, K., Schmidhuber, J., and Gelly, S. Investigating object compositionality in generative adversarial networks. Neural Networks, 130:309 325, 2020.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.

Von Kügelgen, J., Sharma, Y., Gresele, L., Brendel, W., Schölkopf, B., Besserve, M., and Locatello, F. Selfsupervised learning with data augmentations provably isolates content from style. Advances in neural information processing systems, 34:16451 16467, 2021.

Wang, Y., Liu, L., and Dauwels, J. Slot-vae: Object-centric scene generation with slot attention. ar Xiv preprint ar Xiv:2306.06997, 2023.

Willetts, M. and Paige, B. I don t need u: Identifiable non-linear ica without side information. ar Xiv preprint ar Xiv:2106.05238, 2021.

Identifiable Object Representations under Spatial Ambiguities

Wu, J., Zhang, C., Xue, T., Freeman, B., and Tenenbaum, J. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. Advances in neural information processing systems, 29, 2016.

Yang, X., Wang, Y., Sun, J., Zhang, X., Zhang, S., Li, Z., and Yan, J. Nonlinear ICA using volume-preserving transformations. In International Conference on Learning Representations, 2022.

Yuan, J., Chen, T., Shen, Z., Li, B., and Xue, X. Unsupervised object-centric learning from multiple unspecified viewpoints. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.

Zimmermann, R. S., Sharma, Y., Schneider, S., Bethge, M., and Brendel, W. Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pp. 12979 12990. PMLR, 2021.

Identifiable Object Representations under Spatial Ambiguities

A Notations

Ov : Abstract object set as observed from viewpoint v.

[V ] = {1, . . . , V } : Exhaustive set of viewpoints, representing all possible views.

A, B [V ] : Subsets of viewpoints, used for training.

X = v AX v : Data space, formed by the Cartesian product of data spaces for each view in subset A.

x A = {xv : v A} X : Data sample, where xv is the data from view v, and x A represents the set of data across all views in A.

fe : Encoder model maps input data to a latent space or feature representation.

z : Spatial latent features, representing inferred spatial properties from the data.

S : View-specific slot space, a space for features that are tied to particular viewpoints.

C : View-invariant content space, representing features that are constant across different viewpoints.

s S : Samples from the view-specific slot space, representing view-dependent latent features.

c C : Samples from the view-invariant content space, representing features that remain consistent across views.

fs, fs : Slot attention module, responsible for attending to and disentangling different parts of the input related to different views.

fd, fd : Mixing function, which combines view-specific and view-invariant features into a unified representation.

V : View information space, a space that encodes information specific to each viewpoint (e.g., angle, position).

v V : A sample from the view information space representing a specific view or camera configuration.

fv, fv : View extractor function, which extracts viewpoint-related information from the data.

µc, µs, µv : Mean of invariant content, view-specific slots, and view distributions.

σc, σs, σv : Standard deviation of invariant content, view-specific slots, and view distributions.

πc, πs, πv : Mixing coefficients of invariant content, view-specific slots, and view distributions.

Ank : Assignment confidence of a slot k getting mapped to token n.

P P {0, 1}K K : Permutation matrix.

ms : Matching function, used to align object representations across views.

K : Simplex in the space of dimension K.

Hx, Hv : Space of homeomorphic transformation.

Identifiable Object Representations under Spatial Ambiguities

B Extended Related Works

Identifiable representation learning. Learning meaningful representations from unlabeled data has long been a primary objective of deep learning (Bengio et al., 2013). Several approaches, such as those proposed by (Higgins et al., 2017; Kim & Mnih, 2018; Eastwood & Williams, 2018; Mathieu et al., 2019), relied on independence assumptions between latent variables to learn disentangled representations. However, (Hyvärinen & Pajunen, 1999; Locatello et al., 2019) demonstrated the provable impossibility of unsupervised methods for learning independent latent representations from i.i.d. data. Which is tackled by restricting mixing functions to conformal maps (Buchholz et al., 2022) or volume-preserving transformations (Yang et al., 2022), or with additional data assumptions (Zimmermann et al., 2021; Locatello et al., 2020a; Brehmer et al., 2022; Ahuja et al., 2022; Von Kügelgen et al., 2021), or by imposing structure in the latent space as in nonlinear Independent Component Analysis (ICA) (Hyvarinen et al., 2019; Khemakhem et al., 2020a;b), resulting in identifiable models. In the context of nonlinear ICA, (Dilokthanakul et al., 2016) introduced a VAE model with a GMM prior, and (Willetts & Paige, 2021) empirically demonstrated the effectiveness of the GMM prior, which was later rigorously proven by (Kivva et al., 2022). (Kori et al., 2024) use this notion of latent GMM in the context of OCL, achieving identifiability guarantees for object-centric representations. Here, we use this notion in the context of multi-view object-centric representations, tackling the issues with spatial ambiguities and uncertainties in bindings.

Multi-view Object-centric learning. Recent progress in multi-view object-centric learning has seen notable contributions from methods like MULMON (Li et al., 2020), ROOTS (Chen et al., 2021), SLOTLIFTER(Liu et al., 2025), and UOCF(Luo et al., 2024), each offering distinct approaches to compositional representation learning. However, these methods rely heavily on viewpoint annotations, which limit their applicability in fully unsupervised settings. MULMON refines object representations iteratively using annotated viewpoint-image pairs, while ROOTS, SLOTLIFTER, UOCF estimates 3D object positions performing an inverse rendering operation within a grid and projects them into image space via viewpoint transformations. In contrast, we deal with fully unsupervised framework without the need of viewpoint annotations while providing approximate viewpoint equivariance for object representations.

Temporal Object-centric learning. An alternative approach to bypass the need for viewpoint annotations leverages temporal information. Methods for learning from single-viewpoint video sequences, such as Relational N-EM (Van Steenkiste et al., 2018), SQAIR (Kosiorek et al., 2018), SILOT (Crawford & Pineau, 2020), and SAVI (Kipf et al., 2021), focus on modeling object motion, interactions, and identity tracking across frames, even under occlusion. However, these methods assume fixed viewpoints, making them unsuitable for multi-view scenarios where objects appear in different spatial configurations. Additionally, object motion affects individual objects independently, unlike viewpoint changes, which influence the entire scene. Recent advances such as DYMON (Li et al., 2021) extend multi-view approaches like MULMON (Li et al., 2020) to dynamic scenes by disentangling object motion and viewpoint changes, assuming one dominates in adjacent frames. However, DYMON relies on viewpoint annotations, limiting its utility in unsupervised settings. Temporal methods such as SIMONE (Luo et al., 2024) address this by leveraging temporal coherence across multi-view videos, using spatial and temporal positional embeddings to disentangle object and viewpoint representations. Yet, SIMONe s reliance on temporal continuity restricts its generalizability to scenarios where such coherence is absent. In contrast, our framework does not assume temporal dependencies.

C Algorithm

Here we illustrate all the steps involved in the of proposed method VISA, refer 1.

D.1 Illustrative dataset

To visually illustrate the effectiveness of our theory we experiment with two dimensional illustrative dataset. For this, similar to (Kori et al., 2024), we defined a K = 5 component GMM, with differing mean parameters µ = {µ1, . . . , µ5}, and shared isotropic covariances, which we use to sample locations for an object. For a given location, we randomly select one object from { cube , cylinder , pyramid , sphere } and generate 1000 random points on the surface of the selected shape uniformly covering it. To create a single data point, we randomly select three of the five locations and place a randomly selected object at the location. To include multiple viewpoints, we consider V camera location and project the objects, creating V different scenes. We then fill this by considering convex hull operation resulting in projected images

Identifiable Object Representations under Spatial Ambiguities

Algorithm 1 View Invariant Slot Attention VISA

1: Input: A [V ], z A = {fe(xv) v A} R|A| N d input representations 2: View: v A = {vv N(vv; µ(zv), σ2(zv)) v A} R|A| d view representations 3: View Transformation: θA = {θv = STN(vv) v A} R|A| 2 3 transformation parameters 4: key A Wk T 1 θv (z A) R|A| N d, value A Wv T 1 θv (z A) R|A| N d optional value := key 5: s ; ˆπ 6: for v A do 7: k, π(0)k 1/K, µ(0)k N(0, Id), σ(0)2 k 1d 8: for t = 0 T 1 do

9: Ank π(t)k N(keyn;Wqµ(t)k,σ(t)2 k) PK j=1 π(t)j N(keyn;Wqµ(t)j,σ(t)2 j) compute attention

10: ˆAnk Ank PN l=1 Alk normalize attention

11: µ(t + 1)k PN n=1 ˆAnk valuen update slot mean

12: σ(t + 1)2 k PN n=1 ˆAnk (valuen µ(t + 1)k)2 update slot variance

13: π(t + 1)k 1

N PN n=1 Ank update mixing coefficient 14: end for 15: s s {(µ1:K(T), σ2 1:K(T))}; ˆπ ˆπ {π1:K(T)} slot collection 16: end for 17: return Convex Combination(s, ˆπ) K view invariant content

as illustrated in Fig. 7. To maintain uniformity, we only use imaging modality in the main paper while also demonstrating point cloud illustrations here in the appendix. We use different colours representing different objects in Fig. 8, ?? and used 10, 000 data points in total to train our toy models. Unlike existing benchmark datasets, here we remove all the confounding effects caused by lighting and depth. This provides an ideal test bed to validate all our theoretical claims.

D.2 Proposed dataset

In this work, we introduce the MV-MOVI datasets, created using Kubric (Greff et al., 2022), which feature multi-view scenes with segmentation annotations. We propose two variants of the dataset: MV-MOVIC, where the camera locations for every viewpoint remain fixed across all scenes, and MV-MOVID, where the camera locations dynamically change for each scene.

Both MV-MOVIC and MV-MOVID primarily consist of scenes generated by randomly selecting a background from a set of 458 available options and choosing K objects, where 3 K 6, from a pool of 930 objects. In total, a significantly high number of images can be generated in general. In contrast, for this work, we generate 72,000 scenes, each captured from 5 different viewpoints, with object segmentation masks for every view to facilitate the evaluation of model performance. In the case of MV-MOVIC, the locations of all five cameras are fixed across the 72, 000 scenes, while in MV-MOVID, the camera positions are dynamically sampled and vary across scenes.

E Mask Generation

In the case of additive decoders, the decoder outputs K three channelled tensors along with K single channelled mask. We consider normalising these masks with softmax transformation along slot dimension, ensuring that each pixel only contributes to a single slot. The resulting softmaxed masks are used in composing (image = P

k maskk imagek) the slots to reconstruct an image for training. During inference, we normalise masks with sigmoid transformation, allowing us to estimate occluded objects visually, resolving the spatial ambiguities with occluded objects. In a later section, we illustrate the results with both softmax and sigmoid transformations.

E.1 Additivity Implications

Definition E.1. (Additive models) Function f is considered to be an additive decoder if, for any object decoders fobj and masking mechanism mobj, if they can be expressed as:

Identifiable Object Representations under Spatial Ambiguities

Figure 7. Data generating process: The figure illustrates 3D point cloud data in the first row, with camera location highlighted in red,

blue, and green arrow. Following rows indicates projected images and point cloud as observed from red, blue, and green cameras, respectively.

k [K] mobj(zk) fobj(zk) (11)

As pointed out in (Lachapelle et al., 2023), softmax-based masks do not truly fall under the category of additive decoders due to the competition between masks for groups of pixels. This implies that the additive decoders studied in (Lachapelle et al., 2023) are not expressive enough to represent the masked decoders typically employed in object-centric representation learning. The issue arises from the normalization of alpha masks, and care must be taken when extrapolating the findings from (Lachapelle et al., 2023) to the models used in practice.

Although sigmoid-based masks satisfy the condition of additivity during inference, it is important to note that the model is still trained using softmax normalization in our setting. The effect of using sigmoid masks during inference can be visually observed in App. G.

Lemma F.1 (ELBO ). With prior distributions p(v) and p(c) for view and content latent random variables, the likelihood p(x) can be maximised by maximising the following expression:

log p(x) E log p(x | Tθv(c), v) KL (q(v | x) p(v)) := ELBO(x) (12)

Proof. Considering the generative model in Eqn. 5 respecting the graphical model in Fig. 4, we get:

Identifiable Object Representations under Spatial Ambiguities

p(x) = ZZ p(x A | Tθv(c), v A) p(c) p(v A) dv dc (13)

log p(x) = log ZZ p(x A | c1:K, v A) p(c1:K) p(v A)q(v, c | x A)

q(v, c | x A) dv dc1:K (14)

ZZ q(v A | x A)q(c1:K | T 1 θv (x A)) log p(x A | Tθv(c)1:K, v A) p(v A 1:K) q(v A | x A) p(c A 1:K) q(c1:K | x A) dv A dc1:K (15)

ZZ q(vv | xv)q(c1:K | T 1 θv (xv))) log p(xv | Tθv(c)1:K, vv) p(v A 1:K) q(v A | xv) p(c A 1:K) q(c1:K | xv) dvv dc1:K (16)

Given the iterative update for c with EM algorithm, ideally we expect posterior to converge to prior, which results in:

log p(x) = X

ZZ q(vv | xv)q(c1:K | T 1 θv (xv))) log p(xv | Tθv(c)1:K, vv) p(v A 1:K) q(v A | xv) dvv dc1:K (17)

v A Ec,v log p(xv | Tθv(c), v) KL (q(v | xv) p(v)) (18)

Given the subscript notation, the above expression can also be expressed as:

Ec,v log p(x | Tθv(c), v) KL (q(v | x) p(v)) := ELBO(x) (19)

Lemma F.2 (Mean GMM). Let z RN d be a random variable drawn from a GMM with K components:

k=1 πk N(z; µk, Σk), (20)

where πk are the mixture weights, µk Rd are the mean vectors, and Σk Rd d are the covariance matrices. Assuming the mixture satisfies the ICA assumption, such that the components of z are statistically independent. A projected random variable z as the average over the dimensions of z:

j=1 zj, (21)

is also distributed according to a GMM with K components, with appropriately transformed means and variances.

Proof. Given the random variable z follows a GMM, so its density can be expressed as:

k=1 πk N(z; µk, Σk), (22)

µk = [µk,1, µk,2, . . . , µk,d] ; Σk = diag([σ2 k,1, σ2 k,2, . . . , σ2 k,d]). (23)

Considering, the projection of z onto z is defined as:

j=1 zj. (24)

Identifiable Object Representations under Spatial Ambiguities

Given the ICA assumption, the components z:,j are independent. For a fixed component k, the projected mean and variance of z can be derived as:

j=1 µk,j; Var( z) = 1

j=1 σ2 k,j. (25)

Since the projection z is a linear combination of independent Gaussian variables, z remains Gaussian for each component k. Thus, the overall distribution of z is also a GMM:

k=1 πk N( z; µ z,k, σ2 z,k), (26)

j=1 µk,j; σ2 z,k = 1

j=1 σ2 k,j. (27)

This concludes the proof.

Lemma F.3 (Convex Combination of GMMs). Let s1 = {s1 1, . . . , s1 K} and s2 = {s2 1, . . . , s2 K} be two sets of K random vectors in Rd, each distributed according to GMMs:

k=1 π1,k N(µ1,k, Σ1,k); s2

k=1 π2,k N(µ2,k, Σ2,k) (28)

where µi,k Rd, Σi,k Rd d, and πi,k are the means, covariances, and mixing coefficients respectively.

Then for any weights w1, w2 R such that w1 + w2 = 1, the convex combination s = w1s1 + w2s2 is also distributed according to a GMM with K components.

Proof. Without loss of generality, assume the components of both GMMs are aligned. For each component k, we derive the parameters of the resulting mixture:

The mixing coefficients of the resulting GMM are weighted combinations of the original coefficients:

πk = w1π1,k + w2π2,k (29)

For each component k, the convex combination of Gaussians results in a Gaussian distribution. The mean of the resulting Gaussian is:

µk = w1π1,kµ1,k + w2π2,kµ2,k

The covariance of the resulting Gaussian for each component k can be derived as follows. Firstly, recall that for a random variable X, the covariance is:

Var(X) = E[(X E[X])(X E[X]) ] = E[XX ] E[X]E[X] (31)

First lets compute E[sks k ]:

E[sks k ] = E

" w1π1,ks1 k + w2π2,ks2 k πk

w1π1,ks1 k + w2π2,ks2 k πk

= w2 1π2 1,k E[s1 k(s1 k) ] + w2 2π2 2,k E[s2 k(s2 k) ] ( πk)2 (33)

+ w1w2π1,kπ2,k

( πk)2 E[s1 k(s2 k) + s2 k(s1 k) ] (34)

Identifiable Object Representations under Spatial Ambiguities

Then, substitute known expectations:

E[si k(si k) ] = Σi,k + µi,kµ i,k (35)

E[s1 k(s2 k) ] = µ1,kµ 2,k (36)

Finally, by subtract E[si k]E[si k]T = µk µ k we get the covariance:

Σk = w2 1π2 1,kΣ1,k + w2 2π2 2,kΣ2,k ( πk)2 (37)

To verify this forms a non degenerate GMM, we show the mixing coefficients sum to 1:

k=1 (w1π1,k + w2π2,k) (38)

k=1 π1,k + w2

k=1 π2,k (39)

= w1 1 + w2 1 = 1 (40)

Therefore, the convex combination results in a valid Gaussian mixture model with K components, where each component has mean µk, covariance Σk, and mixing coefficient πk.

Lemma 5.2. (Optimal Content Mixture) For A [V ], given the a local content distribution q(c1:K | s A 1:K, x A) (per-scene x A {x A i }M i=1), which can be expressed as a GMM with K components, the aggregate posterior q(c) is obtained by marginalizing out x, s is a non-degenerate global Gaussian mixture with MK components:

p(c) = q(c) = 1

k=1 bπik N c; bµik, bσ2 ik . (41)

Proof. We extend the proof in (Kori et al., 2024), by incorporating hierarchical slot to aggregate content formalisation. For which, we begin by noting that the aggregate posterior q(c) is the optimal prior p(c) so long as our posterior approximation q(c | s A, x A) is close enough to the true posterior p(c | s A, x A), since for a dataset x A {x A i }M i=1, for which we start with q(s A | x A), wlog, given view point transformation is deterministic, we consider x A = TθA(x A) we have that:

p(s A) = Z p(s A | x A)p(x A)dx A (42)

= Ex A p(x A) p(s A | x A) (43)

i=1 p(s A | x A i ) (empirical approximation) (44)

i=1 q(s A | x A i ) (posterior approximation) (45)

=: q(s A), (46)

We further extend this to q(c), with the result from Lemma F.3, we know that the q(c | s A) is a GMM with same number of

Identifiable Object Representations under Spatial Ambiguities

components as q(sv | sv) for any v [V ] as follows

p(c) = Z p(c | s A)p(s A)ds A (47)

= Es A p(s A) p(c | s A) (48)

i=1 p(c | s A i ) (49)

i=1 q(c | s A i ) (50)

=: q(c), (51)

where we approximated p(x) using the empirical distribution, then substituted in the approximate posterior, marginalizing x to get p(s), we later consider the distributional form of p(s) and marginalise s A to get p(c). This observation was first made by (Hoffman & Johnson, 2016) and was used in (Kori et al., 2024) we use it to motivate our setup. Given our model fits a local GMM to each latent variable sampled from the approximate posterior: z A q(z A | x A i ), for i = 1, . . . , M. Let fs(z A) denote the (local) the product of GMM density, its expectation is given by:

Ep(x A),q(z A|x A) fs(z A) = ZZ p(x A)q(z A | x A)fs(z A)dx Adz A (52)

i=1 δ(x A x A i )q(z A | x A)f(z A)dx Adz A (53)

i=1 q(z A | x A i )f(z A)dz A (54)

i=1 N z A; µ(x A i ), σ2(x A i )

k=1 πk(x A i )N z A; µk(x A i ), σ2 k(x A i dz A

i=1 δ(z A µ(x A i ))

k=1 πk(x A i )N z A; µk(x A i ), σ2 k(x A i dz A (55)

k=1 πk(x A i )N z A; µk(x A i ), σ2 k(x A i (56)

=: q(z A), (57)

where we again used the empirical distribution approximation of p(x), and the following basic identity of the Dirac delta to simplify: R δ(x x )fe(x)dx = fe(x ).

For the general case, however, we must instead compute the product of q(z A | x A) and fs(z A) rather than use a Dirac delta approximation as in Eqn. 55. To that end we may proceed as follows w.r.t. to each datapoint x A i :

q(z A | x A i ) fs(z A) = N z A; µ(x A i ), σ2(x A i )

k=1 πk(x A i )N z A; µk(x A i ), σ2 k(x A i (58)

k=1 πk(x A i ) N z A; µ(x A i ), σ2(x A i ) N z A; µk(x A i ), σ2 k(x A i (59)

Given that means across all views are aligned, similar to Lemma F.3, we know the resulting combined GMM has same

Identifiable Object Representations under Spatial Ambiguities

number of components:

q(z A | x A i ) fs(z A) =

k=1 πv ik N z; µivk, σ2 ivk , (60)

Given the product of GMM is a GMM with the number of components equal to the product of a number of components in individual GMM, however in our setting we consider all the components in individual GMM across viewpoints to be aligned resulting in GMM with a number of components equal to the sum of individual components which in our case correspond to K. The posterior parameters of the resulting mixture are given in closed form by:

σ2 ivk = 1 σ2 k(xv i ) + 1 σ2(xv i )

1 , µivk = σ2 ivk

µ(xv i ) σ2(xv i ) + µk(xv i ) σ2 k(xv i )

The resulting GMM is still on the view-specific slots, the aggregation of these slots to obtain content vectors marginalises the viewpoint-level information with convex combination of parameters across all the viewpoints considered as described in cf. F.3, results in:

k=1 πv ik N z; µivk, σ2 ivk =

k=1 bπik N z; bµik, bσ2 ik , (62)

bσ2 ik = g( σik, πik) =

πv ik P|A| v=1 πv ik

!2 σ2 ik, (63)

bµivk = g( µik, πik) =

πv ik P|A| v=1 πv ik µik, (64)

Now to show that the resulting GMM is non-degenerate we need to show PK k=1 bπik = 1, for i = 1, 2, . . . , M. Based on Eqn. 56:

k=1 bπik = 1 M|A|

v=1 πv ik = 1 M|A|

i=1 |A| = 1 M|A| M|A| = 1, (65)

k=1 bπik = 1. (66)

based on the above equation we can say that the scaled sum of the mixing proportions of all K components in all M GMMs when the components are aligned must equal 1, show that the resulting aggregate posterior is non-degenerate and a valid probability distribution.

Assumption F.4 (Weak Injectivity). Let f : Z X be a mapping between latent space and image space, where dim(Z) dim(X). The mapping fd is weakly injective if there exists x0 X and δ > 0 such that |f 1({x})| = 1, x B(x0, δ) f(Z), and {x X : |f 1({x})| = } f(Z) has measure zero w.r.t. to the Lebesgue measure on f(Z) (cf. (Kivva et al., 2022)).

Remark F.5. In words, Assumption F.4 says that a mapping fd is weakly injective if: (i) in a small neighbourhood around a specific point x0 X the mapping is injective meaning each point in this neighbourhood maps to exactly one point in the latent space Z; and (ii) while fd may not be globally injective, the set of points in X that map back to an infinite number of points in Z (non-injective points) is almost non-existent in terms of the Lebesgue measure on the image of Z under fd.

Theorem F.6 (Mixture of Concatenated Slots). Let fs denote a permutation equivariant probabilistic slot attention function such that fs(zv, Psv) = Pfs(zv, sv), where P {0, 1}K K is an arbitrary permutation matrix. Let c = (g(s A 1 , .), . . . , g(s A K, .)) RKd be the concatenation of K individual content vectors, where each vector is an aggregate of all the slots obtained from considered viewpoints in a viewpoint-set A [V ] (cf. (Kori et al., 2024)). Due to the

Identifiable Object Representations under Spatial Ambiguities

nature of the aggregator function, the individual content vector is Gaussian distributed within a K-component mixture: ck N(µk, Σk) Rd, k {1, . . . K}. Then, c is also GMM distributed with K! mixture components:

p=1 πp N(c; µp, Σp), where π K! 1, µp RKd, Σp RKd Kd. (67)

We additionally borrow some theorems and definitions from (Kivva et al., 2022) which are essential for our proofs. First, we restate the definition of a generic point as outlined by (Kivva et al., 2022) below.

Definition F.7. (Generic point) A point x fd(Rm) Rn is generic if there exists δ > 0, such that fd : B(s, δ) Rn is affine for every s f 1 d ({x})

Theorem F.8 (Kivva et al. (Kivva et al., 2022)). Given fd : Rm Rn is a piecewise affine function such that {x Rn : |f 1 d ({x})| = } fd(Rm) has measure zero with respect to the Lebesgue measure on fd(Rm), this implies dim(fd(Rm)) = m and almost every point in fd(Rm) (with respect to the Lebesgue measure on fd(Rm)) is generic with respect to fd.

Theorem F.9 (Kivva et al. (Kivva et al., 2022)). Consider a pair of finite GMMs in Rm:

j=1 πj N(y; µj, Σj), and y =

j=1 π j N(y ; µ j, Σ j). (68)

Assume that there exists a ball B(x, δ) such that y and y induce the same measure on B(x, δ). Then y y , and for some permutation τ we have that πi = π τ(i) and (µi, Σi) = (µ τ(i), Σ τ(i)).

Theorem F.10 (Kivva et al. (Kivva et al., 2022)). Given z PJ i=1 πi N(z; µi, Σi) and z PJ

j=1 π j N(z ; µ j, Σ j) and fd(z) and fd(z ) are equally distributed. We can assume for x Rn and δ > 0, fd is invertible on B(x, 2δ) fd(Rm). This implies that there exists x1 B(x, δ) and δ1 > 0 such that both fd and fd are invertible on B(x1, δ1) fd(Rm).

Theorem 5.3 (Affine Equivalence of aggregate content) For any subset A [V ], such that |A| > 0 , given a set of images x A X A and a corresponding aggregate content c C and a non-degenerate content posterior q(c | s A), considering two mixing function fd, fd satisfying assumption F.4, with a shared image, then c are identifiable up to s equivalence.

Proof. Based on the results of (Kori et al., 2024) we know that when p(s) is aggregate posterior of q(s | x), p(s) is identifiable up to s equivalence. Additionally, based on lemma 5.2 we know that both q(s | x) and q(c | s) are a non-degenerate GMM with valid probability distribution. Using similar arguments in (Kori et al., 2024; Kivva et al., 2022) we show that p(c) and p(s) are identifiable up to s equivalence. W.l.o.g, given view point transformation is deterministic, we consider x A = TθA(x A).

We know that

p(s A) = Z q(s A 1:K | x A)p(x A)dx A (69)

v A q(sv | xv)p(xv)dx A (70)

k=1 πv k N sv; µk(xv), σ2 k(xv) !

p(xv)dx A (71)

k=1 πv k N cv; µk(xv), σ2 k(xv) !

δ(xv xv i )dx A (72)

1 |X| ˆπv ik N sv; ˆµivk, σ2 ivk

Identifiable Object Representations under Spatial Ambiguities

Change of variables from s to c to get prior over random variable c, with matching function g, results in:

p(c1:K) = Z p(s A 1:K)δ s A 1:K g(s A 1:K, πA 1:K) dc A 1:K (74)

Given the transformation g is linear, resulting us with the distribution with mean given by:

Ec (c1:K) = Es g(s A 1:K, πA,1:K) (75)

= g Es(s A 1:K), πA 1:K (76)

v A πv 1:K Es(s A 1:K) (77)

and the covariance follows the diagonal structure as in p(c), which can be described as follows:

Var(c1:K) = X

2 Varc(c A 1:K) (78)

Finally, the mixture components can be expressed as:

v A πv 1:K |A| (79)

With distribution parameters described in equations 77, 78, and 79, we define the aggregate content distribution as GMM expressed as follows:

1 |X| πv k N (v; E(c)k), Var(c)k) (80)

Validity of p(c): The outer summation in equation 80 can be split into two one for image samples and other for original mixing coefficients, which results in the equation:

1 |X| πv ik N (v; E(c)ik), Var(c)ik) (81)

Based on this we can observe the each component in our GMM corresponds to particular slots for a given image in a given viewpoint, triple describing each component is:

πv ik, µvik, σ2 vik , for v = 1, . . . , |A| i = 1, 2, . . . , |X|, and k = 1, 2, . . . , K. (82)

To verify that p(c) is a non-degenerate mixture, we observe the following implication:

v A πv ik |A| = 1, (83)

= 1 |X| 1 |A|

k=1 πv ik = 1 |X| 1 |A||X| |A| 1 = 1 (84)

similar to lemma 5.2, this says that the scaled sum of the mixing proportions of all K components in all |X| GMMs must equal 1, proving that the associated aggregate posterior mixture p(c) is a well-defined and non degenerate probability distribution.

Identifiable Object Representations under Spatial Ambiguities

Invertibility restrictions: Given two piece-wise affine compositional functions fd, fd : C V X, for a given set of views v A, let c = (c1, . . . , c K), ck N(ck; µk, Σk) and c = (c 1, . . . , c K), c k N(c k; µ k, Σ k) be a pair of aggregate content representations, result of sampling a concatenated higher dimensional GMM distribution in RKd, as shown in Theorem F.6, (Kori et al., 2024). In the case when, fd (C, {v A}) and fd (C , {v A})3 are equally distributed. Now assume that there exists x A X and δ > 0 such that fd and fd are invertible and piecewise affine on B(x A, δ) fd(S), for a given set of views v A, which implies dim fd(C, {v A}) = |C|.

Affine subspace: We now restrict the space B(x A, δ) to a subspace B(x A, δ ) where x A B(x A, δ ) such that fd and fd are now invertible and affine on B(x A, δ ) fd(C {v A}). With L X A be an |C|-dimensional affine subspace (assuming |X A| |C|), such that B(x A, δ ) fd (C, {v A}) = B(x A, δ ) L. We also define hf, h f : C L to be a pair of invertible affine functions where h 1 f (B(x A, δ ) L) = f 1 d (B(x A, δ ) L; v A) and h 1 f (B(x A, δ ) L) = f 1 d (B(x A, δ )

L; v A). Implying hf(c) and h f(c ) are finite GMMs that coincide with B(x A, δ ) L and hf(c) h f(c ), theorem F.9, (Kivva et al., 2022). Given, h = h 1 f hf and hf(c) and h f(c ) then h is an affine transformation such that h(c) = c .

s equivalence: Given Theorems F.8 and F.10, there exists a point x fd (C, {v A}) that is generic with respect fd and fd and invertible on B(x, δ) fd (C, {v A}). Having established that there is an affine transformation h(c) = c and invertiblility of piece-wise affine functions fd and fd on B(x A, δ) fd (C, {v A}), this implies that c is identifiable up to an affine transformation and permutation of ck c, concluding our proof.

Remark: Given Theorem F.9, we know that for each higher dimensional mixture component in p(c) induces the same measure on B(x A, δ) and hence for some permutation τ we have that (µπ(i), Σπ(i)) = (µ τ(π(i)), Σ τ(π(i))). Therefore, each mixture component cπ(i) is identifiable up to affine transformation, and permutation of aggregate content representations in c. Now, given sampling ck is equivalent to obtaining K samples from the GMM, q(z) and concatenating, this makes q(z) identifiable up to affine transformation, h and permutation of slot representations in c. It now trivially follows that each of the aggregate content representation ck N(ck; µk, Σk) Rd, k {1, . . . , K} is identifiable up to affine transformation, h based on the following observed property of GMMs:

k=1 πkh (N(sk; µk, Σk)) h K X

k=1 πk N(s k; µ k, Σ k) , (85)

Theorem 5.4 (Invariance of aggregate content) For any subset A, B [V ], such that |A| > 0, |B| > 0 and both A, B satisfy an assumption 5.1, we consider aggregate content to be invariant to viewpoints if f A s f B for data X A X B.

Proof. Based on equation 80, p A(s) and p B(s) can be expressed as follows:

v A πv k |A| N

v A πv k µvk, X

u B πu k |B| N

u B πu k µu k, X

Given the assumption of viewpoint sufficiency 5.1 we know the objects observed in viewpoint set A are same as the object observed in set B. Following the results of Theorem 5.3, we know that both p A(s) and p B(s) are independently identifiable up to s equivalence, which means f A and f B are invertible for a given views v A and v B respectively.

3fd correspond to push forward operation, applying function fd on all the elements of the given set.

Identifiable Object Representations under Spatial Ambiguities

Affine mapping. Without loss of generality, for a given set of views v A, there exists some L X A be an |S|-dimensional affine subspace, such that B(x A, δ) f A (C, {v A}) f B (C, {v A}) = B(x A, δ) L. This implies their exists an affine map between c = f 1 A (x A; v A) and c = f 1 B (x B; v A). Let h A : C L to be an invertible affine functions where h 1 A (B(x A, δ ) L) = f 1 A (B(x A, δ ) L; v A) = f 1 B (B(x B, δ ) L; v A) resulting in h A(c) = c . Similarly, we can show their exists an affine map between c = f 1 A (x A; v B) and c = f 1 B (x B; v B), such that h B( c) = c .

Invariance setup. In the case when representations are invariant, p A(c) and p B(c) are equally distributed, which means aggregate content domain in both cases are same or similar CA = CB.

c = h( c ) (88) = h A(c) = (h h B)( c) (89)

= c = (h 1 A h h B)( c) (90)

Given composition of affine maps is affine, we can consider the mapping (h 1 A h h B) to be an affine, resulting in an s equivalence between f A and f B.

Theorem 5.5 (Approximate representational equivariance) For a given aggregate content c, for any two views v, v p A(v), resulting in respective scenes x p A(x | v, c) and x p A(x | v, c), for any homeomorphic, monotonic transformation hx Hx such that hx(x) = x, their exists another homeomorphic and monotonic transformation hv Hv such that Hv Hx Rdim(x) and v = h 1 v f 1 d (hx(x); c) .

Proof. For a given view v and a mixing function fd that satisfy assumptions F.4 and is piecewise affine, from theorem 5.3 we know the latent view representations are identifiable up to s equivalence for a given aggregate content vector. We know that p(v) is expressed as GMM with a considered set of viewpoints, ideally learning each component for each viewpoint.

v=1 πv N(v; µv, σv)

Following similar arguments in Theorem 5.3 and (Kivva et al., 2022), we can show that for a given content representation c the view distribution p(v) is identifiable up to affine transformation. This means, for any two considered models fd, fd, such that fd (V; {c}) and fd (V; {c}) are equally distributed, then for any x A X with the corresponding content representations given by c the views v = f 1 d (xv; c), v = f 1 d (xv; c) are related in by an affine transformation h(v) = v , results in:

v=1 πvh N(v; µv, σ2 v) h

v=1 πv N(v; µv, σ2 v)

Without loss of generality we can consider any function f : C V X is identifiable up to affine transformation, with this for given views v, v p(v) and for any object representations c, the resulting scenes are sampled by distributions learned with mixing function f is given by x pf(x | c, v), x pf(x | c, v). As previously established for some affine transformation h,

h(v) = f 1( x; c) = v = h 1 f 1( x; c) (92)

Given hx(x) = x, when combined with above equation we know v = h 1 f 1(x; c) , v = h 1 f 1(hx(x); c) , for some invertible affine transformations h and h .

Identifiable Object Representations under Spatial Ambiguities

Figure 8. Identifiability of q(c) and q(s). Estimated marginalised slot distribution (q(s) blue contours) and marginalised content distribution (q(c) orange contours, across 4 runs of VISA. This provides strong evidence of recovery of the latent space up to affine transformations, empirically verifying our claims in Thm. 5.3.

Given hx is homeomorphic and monotonic, and f is piecewise linear, the inverse can be transferred resulting in v = h 1 hv(f 1(x; c)) , similarly we can also swap h 1 with hv, resulting in v = hv h 1 f 1(x; c) . Additionally combining the results from theorem 5.3 and (Kivva et al., 2022), we know that h 1 h is an affine transformation h. This results in:

h = h 1 h (93)

= v = ( hv h h) f 1(x; c) (94)

= v = hv(v) (95)

Given affine transformation preserves monotonicity and homeomophism, the resulting transformation hv Hv and hv Hx, concluding the proof.

G Experiments

G.1 Toy Setting

Here, we repeat the experiments in CASE STUDY 1 with point cloud giving us two dimensional distributions, which can analysed visually. In Fig. 8, we display the distributions of marginalized aggregate content distribution q(c), comparing different runs that are either rotated, skewed, or mirrored with respect to each other, indicating identifiability up to affine transformation. To quantitatively measure the same, we computed SMCC and observed it to be 0.95 0.01, empirically verifying our Thm. 5.3. Furthermore, to illustrate the invariance of distribution q(c) across viewpoints (Thm. 5.4), we consider three different views. We use all possible pairs to learn q(c) distributions as illustrated in Fig. 9, where the distributions from second to last sub-figures are learned wrt viewpoints described by {g, r}, {r, b}, and {g, b}, respectively. Similar to our previous findings, these distributions were also found to be rotated, skewed, or mirrored relative to each other, with an observed SMCC of 0.87 0.11, further confirming the claims in Thm. 5.4.

G.2 Synthetic dataset results

Here, we illustrate visual results reflecting object binding in the case of view ambiguities. Table 3, demonstrates identifiability results on CLEVR-AUG datasets. In Fig. 10, we demonstrate the results of VISA across three different views. We additionally highlight some of the occluded regions which seem to be better captured by our proposed model, which can be attributed to the multi-view setting and the sigmoid mask.

Additionally, we also illustrate the results from CLEVR-MV dataset in figure 11.

G.3 Influence of Number of Views

Here, we demonstrate the influence of the number of views on the overall identifiability of object-centric representations. Similar to Fig. 2, in Fig. 12, we observe an increasing number of views increase overall results.

Identifiable Object Representations under Spatial Ambiguities

q A={r,g}(c)

q A={r,b}(c)

q A={b,g}(c)

Figure 9. Viewpoint invariance for q(c). Estimated marginalised aggregate content distribution q(c) when trained with different view pairs {(green, red), (red, blue), (green, blue)} are illustrated in later figures. As the resulting distributions with different datasets only vary by an affine transformation, providing strong evidence for Thm. 5.4.

Table 3. Comparing identifiability of q(s), q(c), and p(v) scores wrt existing OCL methods on CLEVR-AUG dataset.

METHOD SMCC INV-SMCC MCC

AE 0.26 .01 - -

SA 0.45 .05 - -

PSA 0.48 .03 - -

Mul MON 0.56 .04 0.57 .01 -

OCLOC 0.58 .02 0.60 .01 0.48 .04

VISA 0.64 .01 0.66 .01 0.57 .04

G.4 MVMOVI Results

Here, we discuss the results obtained from the proposed dataset. To reiterate, MVMOVI-C is a variant where fixed camera positions are maintained for all viewpoints across all scenes in the dataset. This setup helps assign a fixed type of viewpoint conditioning for all images captured from a particular camera.

The detection and binding quality of different models are illustrated in Table 2. From these results, we can clearly observe that while the model demonstrates similar binding capabilities, the identifiability of object representations is improved in our proposed model. This suggests that the use of fixed camera positions in MVMOVI-C enhances the consistency and quality of object representation learning, leading to better detection performance across different viewpoints.

Figure 13 & 14 showcases the object discovery capabilities of the VISA. In the iteration of the MVMOVI-D dataset, we vary the camera position for each scene, making the dataset more dynamic and allowing for the potential violation of assumption 5.1 in certain cases. Table 4 presents the binding and identifiability results for both in-domain and out-of-domain data, following a similar analysis as in Table 2. We observe consistent trends and behaviours, suggesting that the impact of the assumption is minimal. A more detailed analysis of the assumption s effects will be left for future work.

G.5 View warm-up

Given the stochasticity during the initial phase of training, to facilitate meaningful representation in content aggregator function, we consider view warm-up strategy. For the initial 100,000 iterations, we randomly use the view-specific slots for reconstruction instead of invariant content with a probability of 0.5.

This primary makes sure the feature extractor extracts meaningful representations before aggregation, which helps to stabilize the training process and allows the model to effectively bind and integrate information from different perspectives in later stages of training.

Identifiable Object Representations under Spatial Ambiguities

Figure 10. Visual illustrations of benchmark results on CLEVR-AUG dataset.

G.6 Hyperparameters

In Table 5, we detail all the hyper-parameters used in our experiments. In the case of benchmark experiments, we use trainable CNN encoder as used in (Locatello et al., 2020b; Kori et al., 2023), while in the case of proposed MVMOVI datasets we use DINO (Caron et al., 2021) encoder to extract image features and change our objective to reconstruct these

Identifiable Object Representations under Spatial Ambiguities

Figure 11. Visual illustrations of benchmark results on CLEVR-MV dataset.

features rather than the original image as proposed in (Seitzer et al., 2022). For most of hyperparameters we use the values suggested by (Locatello et al., 2020b; Seitzer et al., 2022), based on their ablation results.

Identifiable Object Representations under Spatial Ambiguities

V=1 V=3 V=5 V=7 Number of views

Dataset-CLEVR-aug

SMCC inv-SMCC MCC

V=1 V=3 V=5 V=7 Number of views

Dataset-CLEVR-mv

SMCC inv-SMCC MCC

V=1 V=3 V=5 V=7 Number of views

Dataset-GQN

SMCC inv-SMCC MCC

Figure 12. Influence of Number of viewpoints on identifiability for synthetic datasets.

Figure 13. Visual illustrations of benchmark results on MVMOVI-C dataset with 2 views.

G.7 Computational Resources

We run all our experiments on a cluster with a Nvidia NVIDIA L40 48GB GPU cards. Our training usually takes between eight hours to a couple of days, depending on the model and the dataset. It is to be noted that speed might differ slightly with respect to the considered system and the background processes. All experimental scripts will be made available on Git Hub at a later stage.

Identifiable Object Representations under Spatial Ambiguities

Figure 14. Visual illustrations of benchmark results on MVMOVI-C dataset with 3 views.

Table 4. Identifiability and generalisability analysis on MV-MOVID dataset.

METHOD INDOMAIN ANALYSIS OUT OF DOMAIN

m BO SMCC INV-SMCC MCC m BO SMCC INV-SMCC MCC

SA-MLP 0.24 0.031 0.44 0.005 - - 0.24 0.097 0.45 0.008 - -

PSA-MLP 0.26 0.022 0.44 0.006 - - 0.25 0.012 0.42 0.006 - -

VISA-MLP 0.24 0.099 0.48 0.009 0.46 0.054 0.57 0.021 0.25 0.011 0.48 0.006 0.51 0.021 0.55 0.021

SA-TRANSFORMER 0.34 0.017 0.40 0.041 - - 0.34 0.066 0.38 0.031 - -

PSA-TRANSFORMER 0.37 0.021 0.38 0.007 - - 0.36 0.024 0.36 0.016 - -

VISA-TRANSFORMER 0.39 0.016 0.46 0.001 0.48 0.001 0.54 0.032 0.37 0.051 0.46 0.022 0.45 0.010 0.54 0.029

Identifiable Object Representations under Spatial Ambiguities

Table 5. Experimental details w.r.t datasets DATASETS( ) PARAMETERS VALUES

No. Layers 4 No. Views 10 (GSO: 8) No. Slots 7 Training Epochs 5000 Batch Size 32 Optimizer ADAM Learning Rate 0.0002 Initial Slot µ N(0, 1) Initial Slot σ I Warmup Steps 10000 Decoder SPATIAL BROADCASTING-CNN x likelihood N(µx, σ2 x I)

No. Layers 4 No. Views 10 No. Slots 4 Training Epochs 5000 Batch Size 64 Optimizer ADAM Learning Rate 0.0002 Initial Slot µ N(0, 1) Initial Slot σ I Warmup Steps 10000 Decoder SPATIAL BROADCASTING-CNN x likelihood N(µx, σ2 x I)

MVMOVI-C, MVMOVI-D

No. Layers 4 No. Views 5 No. Slots 7 Training Epochs 560 Batch Size 64 Optimizer ADAMW Learning Rate 0.0002 Initial Slot µ N(0, 1) Initial Slot σ I Warmup Steps 10000 Pretrained Encoder DINO_VITB16 Decoder MLP, TRANSFORMER x likelihood N(µx, I)