# illiterate_dalle_learns_to_compose__0e61b23b.pdf Published as a conference paper at ICLR 2022 ILLITERATE DALL-E LEARNS TO COMPOSE Gautam Singh1, Fei Deng1 & Sungjin Ahn2 1Rutgers University 2KAIST Although DALL E has shown an impressive ability of composition-based systematic generalization in image generation, it requires the dataset of text-image pairs and the compositionality is provided by the text. In contrast, object-centric representation models like the Slot Attention model learn composable representations without the text prompt. However, unlike DALL E its ability to systematically generalize for zero-shot generation is significantly limited. In this paper, we propose a simple but novel slot-based autoencoding architecture, called SLATE1, for combining the best of both worlds: learning object-centric representations that allows systematic generalization in zero-shot image generation without text. As such, this model can also be seen as an illiterate DALL E model. Unlike the pixel-mixture decoders of existing object-centric representation models, we propose to use the Image GPT decoder conditioned on the slots for capturing complex interactions among the slots and pixels. In experiments, we show that this simple and easy-to-implement architecture not requiring a text prompt achieves significant improvement in in-distribution and out-of-distribution (zero-shot) image generation and qualitatively comparable or better slot-attention structure than the models based on mixture decoders. https://sites.google.com/view/slate-autoencoder 1 INTRODUCTION Unsupervised learning of compositional representation is a core ability of human intelligence (Yuille & Kersten, 2006; Frankland & Greene, 2020). Observing a visual scene, we perceive it not simply as a monolithic entity but as a geometric composition of key components such as objects, borders, and space (Kulkarni et al., 2015; Yuille & Kersten, 2006; Epstein et al., 2017; Behrens et al., 2018). Furthermore, this understanding of the structure of the scene composition enables the ability for zero-shot imagination, i.e., composing a novel, counterfactual, or systematically manipulated scenes that are significantly different from the training distribution. As such, realizing this ability has been considered a core challenge in building a human-like AI system (Lake et al., 2017). DALL E (Ramesh et al., 2021) has recently shown an impressive ability to systematically generalize for zero-shot image generation. Trained with a dataset of text-image pairs, it can generate plausible images even from an unfamiliar text prompt such as avocado chair or lettuce hedgehog , a form of systematic generalization in the text-to-image domain (Lake & Baroni, 2018; Bahdanau et al., 2018). However, from the perspective of compositionality, this success is somewhat expected because the text prompt already brings the composable structure. That is, the text is already discretized into a sequence of concept modules, i.e., words, which are composable, reusable, and encapsulated. DALL E then learns to produce an image with global consistency by smoothly stitching over the discrete concepts via an Image GPT (Chen et al., 2020) decoder. Building from the success of DALL E, an important question is if we can achieve such systematic generalization for zero-shot image generation only from images without text. This would require realizing an ability lacking in DALL E: extracting a set of composable representations from an image to play a similar role as that of word tokens. The most relevant direction towards this is object-centric representation learning (Greff et al., 2019; Locatello et al., 2020; Lin et al., 2020b; Correspondence to singh.gautam@rutgers.edu and sjn.ahn@gmail.com. 1The implementation is available at https://github.com/singhgautam/slate. Published as a conference paper at ICLR 2022 Figure 1: Overview of our proposed model with respect to prior works. Left: In DALL E, words in the input text act as the composable units for generating the desired novel image. The generated images have global consistency because each pixel depends non-linearly on all previous pixels and the input word embeddings because of its transformer decoder. Middle: Unlike DALL E that requires text supervision, Slot Attention provides an auto-encoding framework in which object slots act as the composable units inferred purely from raw images. However, during rendering, object slots are composed via a simple weighted sum of pixels obtained without any dependency on other pixels and slots which harms image consistency and quality. Right: Our model, SLATE, combines the best of both models. Like Slot Attention our model is free of text-based supervision and like DALL E, produces novel image compositions with global consistency. Greff et al., 2020b). While this approach can obtain a set of slots from an image by reconstructing the image from the slots, we argue that its ability to systematically generalize in such a way to handle arbitrary reconfiguration of slots is significantly limited due to the way it decodes the slots. In this work, we propose a slot-based autoencoder architecture, called SLot Attention Transform Er (SLATE). SLATE combines the best of DALL E and object-centric representation learning. That is, compared to the previous object-centric representation learning, our model significantly improves the ability of composition-based systematic generalization in image generation. However, unlike DALL E we achieve this by learning object-centric representations from images alone instead of taking a text prompt, resulting in a text-free DALL E model. To do this, we first analyze that the existing pixel-mixture decoders for learning object-centric representations suffer from the slot-decoding dilemma and the pixel independence problem and thus show limitations in achieving compositional systematic generalization. Inspired from DALL E, we then hypothesize that resolving these limitations requires not only learning composable slots but also a powerful and expressive decoder that can model complex interactions among the slots and the pixels. Based on this investigation, we propose the SLATE architecture, a simple but novel slot autoencoding architecture that uses a slot-conditioned Image GPT decoder. Also, we propose a method to build a library of visual concepts from the learned slots. Similar to the text prompt in DALL E, this allows us to program a form of sentence made up of visual concepts to compose an image. Furthermore, the proposed model is also simple, easy to implement, can be plugged in into many other models. In experiments, we show that this simple architecture achieves various benefits such as significantly improved generation quality in both the in-distribution and the out-of-distribution settings. This suggests that more attention and investigation on the proposed framework is required in future object-centric models. The main contribution of the paper can be seen as four folds: (1) achieving the first text-free DALL E model, (2) the first object-centric representation learning model based on a transformer decoder, and (3) showing that it significantly improves the systematic generalization ability of object-centric representation models. (4) While achieving better performance, the proposed model is much simpler than previous approaches. Published as a conference paper at ICLR 2022 2 PRELIMINARIES 2.1 OBJECT-CENTRIC REPRESENTATION LEARNING WITH PIXEL-MIXTURE DECODER A common framework for learning object-centric representations is via a form of auto-encoders (Locatello et al., 2020; Burgess et al., 2019; Lin et al., 2020b). In this framework, an encoder takes an input image to return a set of object representation vectors or slot vectors {s1, . . . , s N} = fφ(x). A decoder then reconstructs the image by composing the decodings gθ(sn) of each slot, ˆx = fcompose(gθ(s1), . . . , gθ(s N)). To encourage the emergence of object concepts in the slots, the decoder usually uses an architecture implementing an inductive bias about the scene composition such as the pixel-mixture decoders. Pixel-mixture decoders are the most common approach to construct an image from slots. This method assumes that the final image is constructed by the pixel-wise weighted mean of the slot images. Specifically, each slot n generates its own slot image µn and its corresponding alpha-mask πn. Then, the slot images are weighted-summed with the alpha-mask weights as follows: n=1 πn,i µn,i where µn = g RGB θ (sn) and πn = exp gmask θ (sn) PN m=1 exp gmask θ (sm) , where i [1, HW] is the pixel index with H and W the image height and width, respectively. 2.2 LIMITATIONS OF PIXEL-MIXTURE DECODERS The pixel-mixture decoder has two main limitations. The first is what we term as the slot-decoding dilemma. As there is no explicit incentive for the encoder to make a slot focus on the pixel area corresponding to an object, it is usually required to employ a weak decoder for g RGB θ such as the Spatial Broadcast Network (Watters et al., 2019) to make object concepts emerge in the slots. By limiting the capacity of the slot-decoder, it prevents a slot from modeling multiple objects or the entire image and instead incentivizes the slot to focus on a local area that shows a statistical regularity such as an object of a simple color. However, this comes with a side effect: the weak decoder can make the generated image blur the details. When we use a more expressive decoder such as a CNN to prevent this, the object disentanglement tends to fail, e.g., by producing slots that capture the whole image, hence a dilemma (See Appendix E.1). The second problem is the pixel independence. In pixel-mixture decoders, each slot s contribution to a generated pixel, i.e., πn,i and µn,i, is independent of the other slots and pixels. This may not be an issue if the aim of the slots is to reconstruct only the input image. However, it becomes an issue when we consider that an important desired property for object representations is to use them as concept modules i.e., to be able to reuse them in an arbitrary configuration in the same way as word embeddings can be reused to compose an arbitrary image in DALL E. The lack of flexibility of the decoder due to the pixel independence, however, prevents the slots from having such reconfigurability. For example, when decoding an arbitrary set of slots taken from different images, the rendered image would look like a mere superposition of individual object patches without global semantic consistency, producing a Frankenstein-like image as the examples in Section 5 shall show. In contrast, this zero-shot global semantic consistency for concept reconfiguration is achieved in DALL E by using the Image GPT decoder. 2.3 IMAGE GPT AND DALL E Image GPT (Chen et al., 2020a) is an autoregressive generative model for images implemented using a transformer (Vaswani et al., 2017; Brown et al., 2020). For computational efficiency, Image GPT first down-scales the image of size H W by a factor of K using a VQ-VAE encoder (van den Oord et al., 2017). This turns an image x into a sequence of discrete image tokens {zi} where i indexes the tokens in a raster-scan order from 1 to HW/K2. The transformer is then trained to model an auto-regressive distribution over this token sequence by maximizing the log-likelihood P i log pθ(zi|z