# object_scene_representation_transformer__fd5190a2.pdf

Object Scene Representation Transformer

Mehdi S. M. Sajjadi, Daniel Duckworth , Aravindh Mahendran , Sjoerd van Steenkiste ,

Filip Paveti c, Mario Luˇci c, Leonidas J. Guibas, Klaus Greff, Thomas Kipf

Google Research

A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition. Facilitating the learning of such a representation in neural networks holds promise for substantially improving labeled data efﬁciency. As a key step in this direction, we make progress on the problem of learning 3D-consistent decompositions of complex scenes into individual objects in an unsupervised fashion. We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to signiﬁcantly more complex scenes with larger diversity of objects and backgrounds than existing methods. At the same time, it is multiple orders of magnitude faster at compositional rendering thanks to its light ﬁeld parametrization and the novel Slot Mixer decoder. We believe this work will not only accelerate future architecture exploration and scaling efforts, but it will also serve as a useful tool for both object-centric as well as neural scene representation learning communities.

1 Introduction

As humans, we interact with a physical world that is composed of macroscopic objects1 situated in 3D environments. The development of an object-centric, geometric understanding of the world is considered a cornerstone of human cognition [32]: we perceive scenes in terms of discrete objects and their parts, and our understanding of 3D scene geometry is essential for reasoning about relations between objects and for skillfully interacting with them.

Replicating these capabilities in machine learning models has been a major focus in computer vision and related ﬁelds [12, 22, 34], yet the classical paradigm of supervised learning poses several challenges: explicit supervision requires carefully annotated data at a large scale, and is subject to obstacles such as rare or novel object categories. Further, obtaining accurate ground-truth 3D scene and object geometry is challenging and expensive. Learning about compositional geometry merely by observing scenes and occasionally interacting with them in the simplest case by moving a camera through a scene without relying on direct supervision is an attractive alternative.

As objects in the physical world are situated in 3D space, there has been a growing interest in combining recent advances in 3D neural rendering [25] and representation learning [7, 20] with object-centric inductive biases to jointly learn to represent objects and their 3D geometry without direct supervision [26, 33, 40]. A particularly promising setting for learning both about objects and 3D scene geometry from RGB supervision alone is that of novel view synthesis (NVS), where the task is to predict a scene s appearance from unobserved view points. This task not only encourages a model to learn a geometrically-consistent representation of a scene, but has the potential to serve as an additional inductive bias for discovering objects without supervision.

Website: osrt-paper.github.io. Correspondence: osrt@msajjadi.com. Equal technical contribution. 1We use the term object in a very broad sense, capturing other physical entities such as embodied agents.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

SRT Encoder

Slot Attention

Object-Decomposed

Novel Views

Slot Scene Representation Set-Latent Scene Representation Novel Scene Input Views

(one or more)

Figure 1: OSRT overview A set of input views of a novel scene are processed by the SRT Encoder, yielding the Set-Latent Scene Representation (SLSR). The Slot Attention module converts the SLSR into the object-centric Slot Scene Representation. Finally, arbitrary views with 3D-consistent object instance decompositions are efﬁciently rendered by the novel Slot Mixer. The entire model is trained end-to-end with an L2 loss and no additional regularizers. Details in Sec. 2.

Prior methods combining object-centric inductive biases with 3D rendering techniques for NVS [26, 33, 40] however fail to generalize to scenes of high visual complexity and face a signiﬁcant shortcoming in terms of computational cost and memory requirements: common object-centric models decode each object independently, adding a signiﬁcant multiplicative factor to the already expensive volumetric rendering procedure which requires hundreds of decoding steps. This requirement of executing thousands of decoding passes for each rendered pixel is a major limitation that prohibits scaling this class of methods to more powerful models and thus to more complex scenes.

In this work, we propose Object Scene Representation Transformer (OSRT), an end-to-end model for object-centric 3D scene representation learning. The integrated Slot Attention [23] module allows it to learn a scene representation that is decomposed into slots, each representing an object or part of the background. OSRT is based on SRT, utilizing a light ﬁeld parameterization of space to predict pixel values directly from a latent scene representation in a single forward pass. Instead of rendering each slot independently, we introduce the novel Slot Mixer, a highly-efﬁcient object-aware decoder that requires just a single forward pass per rendered pixel, irrespective of the number of slots in the model. In summary, OSRT allows for highly scalable learning of object-centric 3D scene representations without supervision.

Our core contributions are as follows:

We present OSRT, a model for 3D-centric representation learning, enabling efﬁcient and

scalable object discovery in 3D scenes from RGB supervision. The model is trained purely with the simple L2 loss and does not necessitate further regularizers or additional knowledge such as depth maps [33] or explicit background handling [40].

As part of OSRT, we propose the novel Slot Mixer, a highly efﬁcient object-aware decoder

that scales to large numbers of objects in a scene with little computational overhead.

We study several properties of the proposed method, including robustness studies and

advantages of using NVS to facilitate scene decomposition in complex datasets.

In a range of experiments from easier previously proposed, to complex many-object scenes,

we demonstrate that OSRT achieves state-of-the-art object decomposition, outperforming prior methods both quantitatively and in terms of efﬁciency.

We begin this section with a description of the proposed Object Scene Representation Transformer (OSRT) shown in Fig. 1 and its novel Slot Mixer decoder, and conclude with a discussion of possible

alternative design choices.

2.1 Novel view synthesis

The starting point for our investigations is the Scene Representation Transformer (SRT) [29] as a geometry-free novel view synthesis (NVS) backbone that provides instant novel-scene generalization and scalability to complex datasets. SRT is based on an encoder-decoder architecture.

A data point consists of a set of RGB input images {Ii 2 RH W 3} from the same scene.2 A convolutional network CNN independently encodes each image into a feature map, all of which are ﬁnally ﬂattened and combined into a single set of tokens. An Encoder Transformer E then performs self-attention on this feature set, ultimately yielding the Set-Latent Scene Representation (SLSR)

{zj 2 Rd} = E ({CNN (Ii)}). (1) Novel views are rendered using a 6D light-ﬁeld parametrization r = (o, d) of the scene. Each pixel to be rendered is described by the camera position o and the normalized ray direction d pointing from the camera through the center of the pixel in the image plane. As shown in Fig. 2 (left), the Decoder Transformer D uses these rays r as queries to attend into the SLSR, thereby aggregating localized information from the scene, and ultimately produces the RGB color prediction

C(r) = D (r | {zj}). (2)

Given a dataset of images {Igt

s,i} from different scenes indexed by s, the model is trained end-to-end using an L2 reconstruction loss for novel views:

s,ik C(r) Igt

2.2 Scene decomposition

SRT s latent representation, the SLSR, has been shown to contain enough information to perform downstream tasks such as semi-supervised semantic segmentation [29]. However, the size of the SLSR is directly determined by the number and resolution of the input images, and there is no clear one-to-one correspondence between the SLSR tokens and objects in the scene.

To obtain an object-centric scene representation, we incorporate the Slot Attention [23] module into our architecture. Slot Attention converts the SLSR {zj} into the Slot Scene Representation (Slot SR), a set of object slots {sn 2 Rh}. Different from the size of the SLSR, the size N of the Slot SR is chosen by the user.

We initialize the set of object slots using learned embeddings {ˆsn 2 Rh}. The Slot Attention module then takes the following form for learned linear projections Wv, Wz, and Ws and a learned update function U :

j=1 An,j Wvzj PJ

, with An,j = exp

(Wsˆsn)T Wzzj

(Wsˆsl)T Wzzj

The attention matrix A is used to aggregate input tokens using a weighted mean. Different from commonly used cross-attention [36], it is normalized over the output axis, i.e., the set of slots, instead of the input tokens. This enforces an exclusive grouping of input tokens into object slots, which serves as an inductive bias for decomposing the SLSR into individual per-object representations. Following Locatello et al. [23], we use a gated update for U in the form of a GRU [6] followed by a residual MLP and we apply Layer Norm [2] to both inputs and slots.

2.3 Efﬁcient object-centric decoding

To be able to extract arbitrary-view object decompositions from the model, we propose the novel Slot Mixer (SM): a powerful, yet efﬁcient object-centric decoder. The SM module is shown in Fig. 2 (center) and consists of three components: Allocation Transformer, Mixing Block, and Render MLP.

Allocation Transformer. The goal of the Allocation Transformer is to derive which slots are relevant for the given ray r = (o, d). In essence, this module derives object location and boundaries and resolves occlusions. Its architecture is similar to SRT s Decoder Transformer. It is a transformer that uses the target ray r as the query to repeatedly attend into and aggregate features from the Slot SR:

x = D (r | {sn}) (5)

Most compute in the Allocation Transformer is spent on the query, allowing it to scale gracefully to large numbers of objects in the Slot SR. However, unlike in SRT, its output is not used directly for the RGB color estimate.

2The images may optionally contain pose information, which we omit here for notational clarity.

Spatial Broadcast Decoder

Alpha Blending

Slot N Slot N Slot N Slot 1 Slot 1 Slot 1

Transformer

SRT Decoder Slot Mixer Decoder

Weighted Mean

Slot Mix Slot Mix Slot Mix

Allocation Transformer

Mixing Block

Figure 2: Decoder architectures Comparison between SRT, the novel Slot Mixer (SM), and Spatial Broadcast (SB) decoders. The SRT decoder uses an efﬁcient Transformer that scales gracefully to large numbers of objects, but it fails to produce novel-view object decompositions. The commonly used SB model decodes each slot independently, leading to high memory and computational requirements. The proposed SM decoder combines SRT s efﬁciency with SB s object decomposition capabilities. Details in Secs. 2.3 and 2.4.

Mixing Block. Instead, the resulting feature x is passed to the Mixing Block, which computes a normalized dot-product similarity w with the Slot SR matrix S 2 RN h, i.e., the slots in arbitrary, but ﬁxed order. This similarity is then used to compute a weighted mean of the original slots:

Q = WQx, K = WKST , w = softmax(KT Q), s = w T S, (6)

where WQ and WK are learned linear projections. Notably, the weight wi of each slot is scalar, and unlike standard attention layers, no linear maps are computed for the slots, i.e. the Allocation Transformer is solely responsible for mixing the slots, not for decoding them. The slot weights can be seen as novel-view object assignments that are useful for visual inspection of the learned representation and for quantitative evaluation, see Sec. 4.

Render MLP. Finally, the Render MLP decodes the original query ray r conditioned on the weighted mean s of the Slot SR into the RGB color prediction:

C(r) = M (r | s) (7)

We call the resulting model as shown in Fig. 1 with the SRT encoder, incorporated Slot Attention module and the Slot Mixer decoder Object Scene Representation Transformer (OSRT). All parameters are trained end-to-end using the L2 reconstruction loss as in Eq. (3).

2.4 Alternative decoder architecture

Our novel Slot Mixer differs signiﬁcantly from the standard choice in the literature where Spatial Broadcast (SB) decoders [37] (Fig. 2, right) are most commonly used in conjunction with Slot Attention [23, 33, 40]. We present an adaptation thereof to OSRT as an alternative to the SM decoder.

In SB decoders, the query ray is decoded for each slot sn independently using a shared MLP M:

Cn,r, n,r = M (r | sn) (8)

For each ray r, this produces color estimates Cr 2 RN and logits r 2 RN with each value corresponding to a different slot. The ﬁnal RGB color estimate is then calculated by a normalized, weighted mean:

w = softmax( r), C(r) = w T Cr (9)

In the SB decoder, the slots therefore compete for each pixel through a softmax operation.

We note a major disadvantage of this popular decoder design: it does not scale gracefully to larger numbers of objects or slots, as the full decoder has to be run on each slot. This implies a linear increase in memory and computational requirements of the entire decoder, which is often inhibitive,

especially in training. In practice, most pixels are fully explained by a single slot, i.e. almost all of the compute of the decoder is spent on resolving the most prominent slot, and the rest go unused.

The proposed Slot Mixer solves this by employing the scalable Allocation Transformer for blending between slots more efﬁciently, while only a single weighted mean of all slots must be fully decoded by the Render MLP. We further investigate the choice of decoders empirically in Sec. 4.2.

3 Related works

Neural rendering. Neural rendering is a large, promising ﬁeld that investigates the use of machine learning for graphics applications [34]. Recently, Ne RF [25] has sparked a renewed wave of interest in this ﬁeld by optimizing an MLP to parameterize a single volumetric scene representation and demonstrating photo-realistic results on real-world scenes, later also for uncurated in-the-wild data [24]. Further methods based on Ne RF are able to generalize across scenes by means of reprojecting 3D points into 2D feature maps [35, 39], though they lack a global 3D-based scene representation that could be readily used for downstream applications.

Meanwhile, there have been early successes with global latent models [15, 31], though these rarely scale beyond simple datasets of single oriented objects on uniform backgrounds [20]. Alternative approaches produce higher-quality results by employing a computationally expensive auto-regressive generative mechanisms that does not produce spatially or temporally consistent results [28]. Recently, the Scene Representation Transformer (SRT) [29] has been proposed as a global-latent model that efﬁciently scales to highly complex scenes by means of replacing the volumetric parametrization with a light ﬁeld formulation.

Object-centric learning. Prior works on object-centric learning, such as MONet [3], IODINE [9], SPACE [21] and Slot Attention [23] have demonstrated that it is possible to learn models that decompose images into objects using simple unsupervised image reconstruction objectives. These methods typically use a structured latent space and employ dedicated inductive biases in their architectures and image decoders. To handle depth and occlusion in 3D scenes, these methods typically employ a generative model which handles occlusion via alpha-blending [9, 23] or by, e.g., generating objects ordered by their distance to the camera [1, 21]. We refer to the survey by Greff et al. [10] for an overview. Recent progress in this line of research applies these core inductive biases to work on images of more complex scenes [30] or video data [17, 19]. In our OSRT model, we make use of the ubiquitous Slot Attention [23] module because of its efﬁciency and effectiveness.

3D object-centric methods. Several recent works have extended self-supervised object-centric methods to 3D scenes [5, 33, 40]. One of the ﬁrst such approaches is ROOTS [5], a probabilistic generative model that represents the scene in terms of a 3D feature map, where each 3D "patch" describes the presence (or absence) of an object. Each discovered object is independently rendered using a GQN [7] decoder and the ﬁnal image is recomposed using Spatial Transformer Networks [14].

Further prior works that are the most relevant to our method are Ob Su RF [33] and u ORF [40], both combining learned volumetric representations [25] with Slot Attention [23] and Spatial Broadcast decoders [37]. Both methods have been shown to be capable of modeling 3D scenes of slightly higher complexity than CLEVR [16].

Ob Su RF s architecture is based on Ne RF-VAE [20]. It bypasses the extraordinary memory requirements of the Spatial Broadcast decoder combined with volumetric rendering during training by using ground-truth depth information, thereby needing only 2 samples per ray instead of hundreds. During inference, at the absence of ground-truth depth information, Ob Su RF however suffers from the expected high computational cost of object-centric volumetric rendering.

u ORF is based on an encoder-decoder architecture that explicitly handles foreground (FG) and background (BG) separately through a set of built-in inductive biases and assumptions on the dataset. For instance, the dedicated BG slot is parameterized differently, and FG slots are encouraged to only produce density inside a manually speciﬁed area of the scene. The model is trained using a combination of perceptual losses, and optionally also with an adversarial loss [8].

In order to obtain the results for Ob Su RF and u ORF, we used the available ofﬁcial implementations, and consulted the respective authors of both methods to ensure correct adaptation to new datasets.

Finally, a separate line of work considers purely generative object-centric 3D rendering without the ability to render novel views of a speciﬁcally provided input scene. GIRAFFE [26] addresses this problem by combining volumetric rendering with a GAN-based [8] loss. It separately parameterizes object appearance and 3D pose for controllable image synthesis. Inspired thereof, INFERNO [4] combines a generative model with a Slot Attention-based inference model to learn object-centric 3D scene representations. We do not explicitly compare to INFERNO as it is similar to Ob Su RF and u ORF, while only shown capable of modeling CLEVR-like scenes of lower visual complexity.

4 Experiments

To investigate OSRT s capabilities, we evaluate it on a range of datasets. After conﬁrming that the proposed method outperforms existing methods on their comparably simple datasets, we move on to a more realistic, highly challenging dataset for all further investigations. We evaluate models by their novel view reconstruction quality and unsupervised scene decomposition capabilities qualitatively and quantitatively. We further investigate OSRT s computational requirements compared to the baselines and close this section with some further analysis into which ingredients are crucial to enable OSRT s unsupervised scene decomposition qualities in challenging settings.

Setting and evaluation metrics. The models are trained in a novel view synthesis (NVS) setup: on a dataset of scenes, we train the models to produce novel views of the same scene parameterized by target camera poses. For evaluation, we run the models on a held-out set of test scenes and render multiple novel views per scene.

As quantitative metrics, we report PSNR for pixel-accurate reconstruction quality and adopt the standard foreground Adjusted Rand Index (FG-ARI) [13, 27] to measure object decomposition. Crucially, we compute FG-ARI on all rendered views together, such that object instances must be consistent between different views. This makes our metric sensitive to 3D-inconsistencies that may especially plague light ﬁeld models which do not explicitly enforce this in contrast to volumetric methods. We analyze and discuss this choice further in Sec. 4.2.

For qualitative inspection of the inferred scene decompositions, we visualize for each rendered pixel the slot with the highest weight wi and color-code the slots accordingly.

Datasets. We run experiments on several datasets in increasing order of complexity.

Figure 3: Example views of scenes from CLEVR-3D (left) and MSN-Easy (right).

CLEVR-3D [33]. This is a recently proposed multicamera variant of the CLEVR dataset, which is popular for evaluating object decomposition due to its simple structure and unambiguous objects. Each scene consists of 3 6 basic geometric shapes of 2 sizes and 8 colors randomly positioned on a gray background. The dataset has 35k training and 100 test scenes, each with 3 ﬁxed views: the two target views are the default CLEVR input view rotated by 120 and 240 , respectively.

Multi Shape Net-Easy (MSN-Easy) [33]. This dataset is similar in structure to CLEVR-3D, however, 2 4 upright Shape Net objects sampled from the chair, table and cabinet classes (for a total of 12k objects) now replace the geometric solids. The dataset has 70k training and 100 test scenes.

Multi Shape Net-Hard (MSN-Hard) [29]. MSN-Hard has been proposed as a highly challenging dataset for novel view synthesis. In each scene, 16-32 Shape Net objects are scattered in random orientations. Realistic backgrounds and HDR environment maps are sampled from a total set of 382 assets. The cameras are randomly scattered on a half-sphere around the scene with varying distance to the objects. It is a highly demanding dataset due to its use of photo-realistic ray tracing [11], complex arrangements of tightly-packed objects of varying size, challenging backgrounds, and nontrivial camera poses including almost horizontal views of the scene. The dataset has 1M training scenes, each with 10 views, and we use a test set of 1000 scenes. The 51k unique Shape Net objects are taken from all classes, and they are separated into a train and test split, such that the test set not only contains novel arrangements, but also novel objects. We re-generated the dataset as the original provided by Sajjadi et al. [29] does not include ground-truth instance labels necessary for our quantitative evaluation. We will make the dataset publicly available.

Table 1: Quantitative results OSRT outperforms Ob Su RF across all metrics and datasets with the exception of CLEVR-3D, where FG-ARI is nearly identical. On MSN-Hard, OSRT is able to encode multiple input views to further improve object decomposition and reconstruction performance.

CLEVR-3D [33] MSN-Easy [33] MSN-Hard [29]

Ob Su RF OSRT (1) Ob Su RF OSRT (1) Ob Su RF OSRT (1) OSRT (5)

PSNR 33.69 39.98 27.41 29.74 16.50 20.52 23.54 FG-ARI 0.978 0.976 0.940 0.954 0.280 0.619 0.812

4.1 Comparison with prior work

Tab. 1 shows a comparison between OSRT and the strong Ob Su RF [33] baseline. On the simpler datasets CLEVR-3D and MSN-Easy, Ob Su RF produces reasonable reconstructions with accurate decomposition, though OSRT achieves signiﬁcantly higher PSNR and similar FG-ARI.

On the more realistic MSN-Hard, Ob Su RF achieves only low PSNR and FG-ARI. OSRT on the other hand still performs solidly on both metrics. Furthermore, while Ob Su RF can only be conditioned on a single image, OSRT is optionally able to ingest several. We report numbers for OSRT (5), our model with 5 input views, on the same dataset and see that it substantially improves both reconstruction and decomposition quality.

At the same time, OSRT renders novel views at 32.5 fps (frames per second), more than 3000 faster than Ob Su RF which only achieves 0.01 fps, both measured on an Nvidia V100 GPU. This speedup is the result of two multiplicative factors: OSRT s light ﬁeld formulation is 100 faster than volumetric rendering and the novel Slot Mixer is 30 faster than the SB decoder here. Finally, OSRT does not need ground-truth depth information during training.

Fig. 4 shows qualitative results for the realistic MSN-Hard dataset. It is evident that Ob Su RF has reached its limits, producing blurry images with suboptimal scene decompositions. OSRT on the other hand still performs solidly, producing much higher-quality images and decomposing scenes reasonably well from a single input image. With more images, renders become sharper and decompositions more accurate.

As the second relevant prior method, we have performed experiments with u ORF [40]. Despite guidance from the authors, u ORF failed to scale meaningfully to the most interesting MSN-Hard dataset. This mainly resulted from a lack of model capacity. Additionally, the large number of objects in this setting led to prohibitive memory requirements of the method, forcing us to lower model capacity even further, or to run the model with fewer slots to be able to ﬁt it even on an Nvidia A100 GPU with 40 GB of VRAM. We describe this further with results in the appendix.

4.2 Ablations and model analysis

We conduct several studies into the behavior of the proposed model including changes to architecture, training setup, or data distribution. Unless stated otherwise, all investigations in this section are conducted on the MSN-Hard dataset with 5 input views. Further qualitative results for these experiments are provided in the appendix.

Table 2: OSRT decoder variants Slot Mixer combines SRT s efﬁciency with SB s object decomposition abilities. Results on simpler datasets are shown in Tab. 4.

MSN-Hard PSNR FG-ARI FPS

SRT Decoder 24.40 0.330 40.98 Spatial Broadcast (SB) 23.35 0.801 1.39 Slot Mixer (SM) 23.54 0.812 32.47

Decoder architecture. As described in Sec. 2.4 and shown in Fig. 2, the novel Slot Mixer decoder differs signiﬁcantly from the default SRT decoder [29], as well as from the Spatial Broadcast (SB) decoder that is commonly used in conjunction with Slot Attention. To show the strengths of the SM decoder, we compare it with these alternative decoders switched in to the model, see Tab. 2.

Figure 4: Qualitative results on MSN-Hard Comparison of our method with one and ﬁve input views with Ob Su RF, which can only operate on a single image. OSRT produces sharper renders with better scene decompositions.

While the SRT decoder achieves slightly higher reconstruction quality due to the powerful transformer, it does not yield useful object decompositions as a result of the global information aggregation across all slots, disincentivizing object separation in the slots. It is the key design choice in Slot Mixer to only use a transformer for slot mixing, but not for decoding the slots, that leads to good decomposition.

The SB decoder performs similarly to Slot Mixer both in terms of reconstruction and decomposition. However, due to the slot-wise decoding, it requires considerably more memory, which can often be prohibitive during training, and hamper scalability to more complex datasets or large numbers of objects. For the same reason, it also requires signiﬁcantly more compute at training and for inference.

Role of novel view synthesis. Our experiments have demonstrated OSRT s strengths in the default novel view synthesis (NVS) setup. We now investigate the role of NVS on scene decomposition, on the complex MSN-Hard dataset.

To this end, we surgically remove the NVS component from the method, while keeping all other factors unchanged, by training OSRT with the input views equaling the target views to be reconstructed. This effectively turns OSRT into a multi-2D image auto-encoder, albeit with 3D-centric poses rather than pure 2D positional encoding for ray parametrization.

The resulting method achieves a much better PSNR of 28.14, 4.60 db higher than OSRT in the NVS setup (23.54). This is expected, as the model only needs to reconstruct the target images rather than needing to generate arbitrary novel views of the scene. However, the model fails to decompose the scene into objects, only achieving an FG-ARI of 0.198 compared to OSRT s FG-ARI of 0.812, demonstrating the advantage of using NVS as an auxiliary task for unsupervised scene decomposition.

Figure 5: OSRT trained on MSN-Hard (in color) is evaluated with grayscale input images at test time. We observe that the model generalizes remarkably well to this out-of-distribution setting. This indicates that OSRT does not just rely on color for decomposition.

Robustness. We begin with a closer glance at the results obtained by OSRT (5) on MSN-Hard based on the number of objects in the scene. We ﬁnd that the FG-ARI scores for scenes with the smallest (16) and largest (31) number of objects are within an acceptable range: 0.854 vs. 0.753.

To explore whether OSRT mainly relies on RGB color for scene decomposition, we evaluate it on a grayscale version of the difﬁcult test set of MSN-Hard. Note that the model was trained on RGB and has not encountered grayscale images at all during training. We ﬁnd that OSRT generalizes remarkably well to this out-of-distribution setting, still producing satisfactory reconstructions and scene decompositions that are almost up to par with the colored test set at FG-ARI of 0.780 vs. 0.812 on the default RGB images. Fig. 5 shows an example result for this experiment.

We also consider to what extent OSRT is capable of leveraging additional context at test time in the form of additional input views. We train OSRT (3) with three input views and then evaluate it using three (PSNR: 22.75, FG-ARI: 0.794) and ﬁve input views (PSNR: 23.47, FG-ARI: 0.813). These results clearly indicate that additional input views can be used to boost performance at test-time.

Finally, we train OSRT in two setups inspired by SRT variants [29]: Up OSRT, trained without input view poses, and VOSRT, using volumetric rendering. We ﬁnd that both of these achieve good reconstruction quality at PSNR s of 22.42 and 21.38 and meaningful scene decompositions at 0.798 and 0.767 FG-ARI, respectively.

Figure 6: Novel view (left) with a slot removed (center) or a slot added from another scene (right).

Scene editing. We investigate OSRT s ability for simple scene editing. Fig. 6 (left) shows the a novel view rendered by OSRT. When the slot corresponding to the large object is removed from the Slot SR (center), the rendered image reveals previously occluded objects. We can go one step further by adding a slot from a different scene (right) to the Slot SR, leading to the object being rendered in place with correct occlusions.

3D consistency. An important question with regards to novel view synthesis is whether the resulting scene and object decomposition are 3D-consistent. Prior work has demonstrated SRT s spatial consistency despite its light ﬁeld formulation [29]. Here, we investigate OSRT 3D-consistency with regards to the learned decomposition.

To measure this quantitatively, we compute the ratio between the FG-ARI as reported previously computed on all views together and the 2D-FG-ARI which is the average of the FG-ARI scores for each individual target view. While FG-ARI takes into account 3D consistency, 2D-FG-ARI is unaffected by 3D inconsistencies such as permuting slot-assignments between views. By deﬁnition, 2D-FG-ARI is an upper bound on FG-ARI. Hence, an FG-ARI ratio of 1.0 indicates that the novel views are perfectly 3D-consistent, while lower values indicate that some inconsistency took place.

We observe that our approach consistently achieves very high 3D consistency in scene decompositions, with an FG-ARI ratio of 0.940 on the challenging MSN-Hard dataset. With 5 input views, OSRT achieves an even higher ratio of 0.987. Despite its volumetric parametrization, we ﬁnd that Ob Su RF often fails to produce consistent assignments with an FG-ARI ratio of only 0.707.

Table 3: OSRT generalization on CLEVR-3D OSRT trained on scenes with 3-6 objects is tested on scenes with 3 6, and 7-10 objects, respectively. Both with learned and random slot initializations, OSRT generalizes reasonably to the presented out-of distribution (OOD) setting with more objects than during training. The slight drop in performance is mostly a result of more pixels being covered by objects rather than the simpler background.

3-6 objects (IID) PSNR FG-ARI

Learned init. 40.84 0.996 Random init. 38.14 0.988

7-10 objects (OOD) PSNR FG-ARI

Learned init. 33.03 0.955 Random init. 33.36 0.968

Out-of-distribution generalization. We investigate the ability of the model to generalize to more objects at test time than were observed at training time. To this end, we trained OSRT either with 7 randomly initialized slots, or with 11 slots using a learned initialization, on CLEVR-3D scenes containing up to 6 objects. At test time, we evaluate on scenes containing 7-10 objects by using 11 slots for both model variants.

The results are shown in Tab. 3. We ﬁnd that generalization performance in terms of PSNR and FG-ARI is similarly good for both models, with a slight advantage for the model variant with random slot initialization. In terms of in-distribution performance, learned initialization appears to have an advantage in both metrics.

4.3 Limitations

In our experimental evaluation of OSRT, we came across the following two limitations that are worth highlighting: 1) while the object segmentation masks produced by OSRT often tightly enclose the underlying object, this is not always the case and we ﬁnd that emergent masks can "bleed" into the background, and 2) OSRT can, for some architectural choices, fall into a failure mode in which it produces a 3D-spatial Voronoi tessellation of the scene instead of clear object segmentation, resulting in a substantial drop in FG-ARI.

While the alternative, yet much more inefﬁcient, SB decoder does not seem to be affected by this, mask bleeding [9] effects and tesselation failure modes [18] are not uncommon in object-centric models. We show more examples and comparisons between the different architectures in the appendix.

5 Conclusion

We present Object Scene Representation Transformer (OSRT), an efﬁcient and scalable architecture for unsupervised neural scene decomposition and rendering. By leveraging recent advances in object-centric and neural scene representation learning, OSRT enables decomposition of complex visual scenes, far beyond what existing methods are able to address. Moreover, its novel Slot Mixer decoder enables highly efﬁcient novel view synthesis, making OSRT more than 3000 faster than prior works. Future work has the potential to elevate such methods beyond modeling static scenes, instead allowing for moving objects. We believe that our contributions will signiﬁcantly support model design and scaling efforts for object-centric geometric scene understanding.

Acknowledgments

We thank Karl Stelzner and Hong-Xing Yu for their responsive feedback and guidance on the respective baselines, Mike Mozer for helpful feedback on the manuscript, Noha Radwan and Etienne Pot for help and guidance with datasets, and the anonymous reviewers for the helpful feedback.

[1] Titas Anciukevicius, Christoph H Lampert, and Paul Henderson. Object-centric image generation with

factored depths, locations, and appearances. ar Xiv preprint ar Xiv:2004.00642, 2020.

[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. Neur IPS Deep Learning

Symposium, 2016.

[3] Christopher P Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick,

and Alexander Lerchner. Monet: Unsupervised scene decomposition and representation. ar Xiv preprint ar Xiv:1901.11390, 2019.

[4] Lluis Castrejon, Nicolas Ballas, and Aaron Courville. Inferno: Inferring object-centric 3d scene representa-

tions without supervision, 2021. URL https://openreview.net/forum?id=YVa8X_2I1b.

[5] Chang Chen, Fei Deng, and Sungjin Ahn. Roots: Object-centric representation and rendering of 3d scenes.

J. Mach. Learn. Res., 22:259 1, 2021.

[6] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger

Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014.

[7] SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos, Marta Garnelo,

Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. Neural scene representation and rendering. Science, 2018.

[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron

Courville, and Yoshua Bengio. Generative adversarial nets. In Neur IPS, 2014.

[9] Klaus Greff, Raphaël Lopez Kaufman, Rishabh Kabra, Nick Watters, Christopher Burgess, Daniel Zoran,

Loic Matthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative variational inference. In ICML, 2019.

[10] Klaus Greff, Sjoerd Van Steenkiste, and Jürgen Schmidhuber. On the binding problem in artiﬁcial neural

networks. ar Xiv preprint ar Xiv:2012.05208, 2020.

[11] Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan

Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Mehdi S. M. Sajjadi, Matan Sela, Vincent Sitzmann, Austin Stone, Deqing Sun, Suhani Vora, Ziyu Wang, Tianhao Wu, Kwang Moo Yi, Fangcheng Zhong, and Andrea Tagliasacchi. Kubric: A Scalable Dataset Generator. In CVPR, 2022.

[12] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge University

Press, 2003.

[13] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classiﬁcation, 1985.

[14] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer

networks. 28, 2015.

[15] Wongbong Jang and Lourdes Agapito. Code Ne RF: Disentangled Neural Radiance Fields for Object

Categories. In ICCV, 2021.

[16] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross

Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, pages 2901 2910, 2017.

[17] Rishabh Kabra, Daniel Zoran, Goker Erdogan, Loic Matthey, Antonia Creswell, Matt Botvinick, Alexander

Lerchner, and Chris Burgess. Simone: View-invariant, temporally-abstracted object representations via unsupervised video decomposition. In Neur IPS, volume 34, 2021.

[18] Laurynas Karazija, Iro Laina, and Christian Rupprecht. Clevr Tex: A texture-rich benchmark for unsuper-

vised multi-object segmentation. In Neur IPS Track on Datasets and Benchmarks, 2021.

[19] Thomas Kipf, Gamaleldin F Elsayed, Aravindh Mahendran, Austin Stone, Sara Sabour, Georg Heigold,

Rico Jonschkowski, Alexey Dosovitskiy, and Klaus Greff. Conditional object-centric learning from video. In ICLR, 2022.

[20] Adam Kosiorek, Heiko Strathmann, Daniel Zoran, Pol Moreno, Rosalia Schneider, Sona Mokrá, and

Danilo Rezende. Ne RF-VAE: A Geometry Aware 3D Scene Generative Model. In ICML, 2021.

[21] Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, Weihao Sun, Gautam Singh, Fei Deng, Jindong Jiang,

and Sungjin Ahn. SPACE: Unsupervised object-oriented scene representation via spatial attention and decomposition. In ICLR, 2020.

[22] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie Chen, Xinwang Liu, and Matti Pietikäinen. Deep

learning for generic object detection: A survey. IJCV, 2020.

[23] Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob

Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. In Neur IPS, 2020.

[24] Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan Barron, Alexey Dosovitskiy, and

Daniel Duckworth. Ne RF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections. In CVPR, 2021.

[25] Ben Mildenhall, Pratul Srinivasan, Matthew Tancik, Jonathan Barron, Ravi Ramamoorthi, and Ren Ng.

Ne RF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV, 2020.

[26] Michael Niemeyer and Andreas Geiger. GIRAFFE: Representing Scenes as Compositional Generative

Neural Feature Fields. In CVPR, 2021.

[27] William M Rand. Objective criteria for the evaluation of clustering methods. Journal of the American

Statistical Association, 66(336), 1971.

[28] Robin Rombach, Patrick Esser, and Björn Ommer. Geometry-free view synthesis: Transformers and no 3d

priors. In ICCV, 2021.

[29] Mehdi S. M. Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani

Vora, Mario Lucic, Daniel Duckworth, Alexey Dosovitskiy, Jakob Uszkoreit, Thomas Funkhouser, and Andrea Tagliasacchi. Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations. In CVPR, 2022.

[30] Gautam Singh, Fei Deng, and Sungjin Ahn. Illiterate dall-e learns to compose. In ICLR, 2022.

[31] Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene Representation Networks: Continuous

3D-Structure-Aware Neural Scene Representations. In Neur IPS, 2019.

[32] Elizabeth S Spelke and Katherine D Kinzler. Core knowledge. Developmental science, 10(1):89 96, 2007.

[33] Karl Stelzner, Kristian Kersting, and Adam R Kosiorek. Decomposing 3d scenes into objects via unsuper-

vised volume segmentation. ar Xiv preprint ar Xiv:2104.01148, 2021.

[34] Ayush Tewari, Justus Thies, Ben Mildenhall, Pratul Srinivasan, Edgar Tretschk, Yifan Wang, Christoph

Lassner, Vincent Sitzmann, Ricardo Martin-Brualla, Stephen Lombardi, et al. Advances in neural rendering. Eurographics, 2022.

[35] Alex Trevithick and Bo Yang. GRF: Learning a General Radiance Field for 3D Scene Representation and

Rendering. In ICCV, 2021.

[36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz

Kaiser, and Illia Polosukhin. Attention is all you need. In Neur IPS, 2017.

[37] Nicholas Watters, Loic Matthey, Christopher P Burgess, and Alexander Lerchner. Spatial broadcast decoder:

A simple architecture for learning disentangled representations in vaes. ar Xiv preprint ar Xiv:1901.07017, 2019.

[38] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan

Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In ICML, 2020.

[39] Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixel Ne RF: Neural Radiance Fields from

One or Few Images. In CVPR, 2021.

[40] Hong-Xing Yu, Leonidas J Guibas, and Jiajun Wu. Unsupervised discovery of object radiance ﬁelds. In

ICLR, 2022.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s

contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See section titled Limitations.

(c) Did you discuss any potential negative societal impacts of your work? [Yes] See

appendix. (d) Have you read the ethics review guidelines and ensured that your paper conforms to

them? [Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main ex-

perimental results (either in the supplemental material or as a URL)? [No] The new MSN-Hard dataset is published on the website. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they

were chosen)? [Yes] See experimental section and appendix. (c) Did you report error bars (e.g., with respect to the random seed after running experi-

ments multiple times)? [No] The experiments require lot of compute and so use only a single seed. (d) Did you include the total amount of compute and the type of resources used (e.g.,

type of GPUs, internal cluster, or cloud provider)? [Yes] See experimental section and appendix. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes]

(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]

The new MSN-hard dataset is published on the website. (d) Did you discuss whether and how consent was obtained from people whose data you re

using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identiﬁable

information or offensive content? [N/A] We only use synthetically generated datasets of common household objects. 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if

applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review

Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount

spent on participant compensation? [N/A]