# exploiting_relationship_for_complexscene_image_generation__49e1604b.pdf Exploiting Relationship for Complex-scene Image Generation Tianyu Hua1, Hongdong Zheng1, Yalong Bai1, Wei Zhang1*, Xiao-Ping Zhang2, Tao Mei1 1JD AI Research 2Ryerson University patrickhua.ty@gmail.com, {hongdongzheng, ylbai}@outlook.com, wzhang.cu@gmail.com xzhang@ee.ryerson.ca, tmei@live.com The significant progress on Generative Adversarial Networks (GANs) has facilitated realistic single-object image generation based on language input. However, complex-scene generation (with various interactions among multiple objects) still suffers from messy layouts and object distortions, due to diverse configurations in layouts and appearances. Prior methods are mostly object-driven and ignore their inter-relations that play a significant role in complex-scene images. This work explores relationship-aware complex-scene image generation, where multiple objects are inter-related as a scene graph. With the help of relationships, we propose three major updates in the generation framework. First, reasonable spatial layouts are inferred by jointly considering the semantics and relationships among objects. Compared to standard location regression, we show relative scales and distances serve a more reliable target. Second, since the relations between objects significantly influence an object s appearance, we design a relation-guided generator to generate objects reflecting their relationships. Third, a novel scene graph discriminator is proposed to guarantee the consistency between the generated image and the input scene graph. Our method tends to synthesize plausible layouts and objects, respecting the interplay of multiple objects in an image. Experimental results on Visual Genome and HICO-DET datasets show that our proposed method significantly outperforms prior arts in terms of IS and FID metrics. Based on our user study and visual inspection, our method is more effective in generating logical layout and appearance for complex-scenes. Introduction In the past few years, text-to-image generation has drawn extensive research attention for its potential applications in art generation, computer-aided design, image manipulation, etc. However, such success is only restricted to simple image generation, which only contains a single object in a small domain, such as flowers, birds, and faces (Reed et al. 2016; Bao et al. 2017). Complex-scene generation, on the other hand, targets for synthesizing realistic scene images out of complex sentences depicting multiple objects as well as their interactions. Nevertheless, generating complex-scenes on demand is still far from mature based on recent studies (John- *Corresponding author Copyright 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. standing on Triplet Shape Layout Image Figure 1: Relationship matters for complex-scene image generation. The same object pair (e.g., man and board) could show different object shapes, scene layouts and appearances under different relationships. son, Gupta, and Fei-Fei 2018; Xu et al. 2018; Li et al. 2019; Hinz, Heinrich, and Wermter 2019). Scene graph, a structured language representation, captures objects and their relationships described in the sentence (Xu et al. 2017). Such representation is proven effective for image-text cross-modal tasks, such as structural image retrieval (Johnson et al. 2015; Schuster et al. 2015; Johnson, Gupta, and Fei-Fei 2018), image captioning (Yang et al. 2019; Li and Jiang 2019; Li et al. 2018) and visual question answering (Teney, Liu, and van Den Hengel 2017; Norcliffe-Brown, Vafeias, and Parisot 2018). In this work, we focus on complex-scene image generation from scene graphs. Although extensive works have been done in scene graph generation from image (Xu et al. 2017; Zellers et al. 2018; Li et al. 2017; Zhang et al. 2017a) (i.e. image scene graph), reversely generating a complex-scene image from a scene graph remains challenging, due to the polymorphism nature of one-to-many mapping from a given scene graph to multiple reasonable images with different scene layouts. A general pipeline for scene graph based image generation usually consists of two stages (Johnson, Gupta, and Fei Fei 2018). The first one learns to synthesize a rough layout prediction from the scene-graph input. Usually, the object features are encoded with a graph module (Johnson, Gupta, and Fei-Fei 2018; Ashual and Wolf 2019), followed by a direct regression of bounding-box locations. At the second stage, a position-aware feature tensor, that combines object The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21) features and layout generated in the first stage, is fed into an image decoder for generating the final image. For enhancing the object appearances in generated images, Ashual et al. separates appearance embedding from layout embedding. However, previous works on complex-scene generation heavily suffer from two fundamental problems: messy layout, and object distortion. 1) Messy layouts. Image generation models are expected to figure out the reasonable layout from scene-graph inputs. However, there exist an infinite number of reasonable layouts for a given scene-graph. Directly fitting a specific layout introduces huge confusion, and limits the generalization ability. As a result, existing methods are still struggling with messy layouts in practice. 2) Distortion in object appearance. Due to the high diversity in object categories, layouts, and relationship dependencies, objects are often distorted during generation. For each object, the texture and local appearances should be inferred, respecting both its category and allocated spatial arrangement. Moreover, complex and various relations among different objects in the scene-graph can further increase the diversity of shape appearances. As shown in Fig. 1, even with the same object pairs, equipping different relationships could lead to totally different scene layouts and appearances. Some works (Ashual and Wolf 2019) simplify the task by only taking a few simple spatial relationships among objects (such as left , right or above ) but ignoring other complex relationships (such as verbs). Meanwhile, to reduce the complexity, some works only consider one specific stage of this task, such as layout generation from scene-graph (Jyothi et al. 2019), image generation from layout (Zhao et al. 2018; Sun and Wu 2019). All these works did not take account of the semantics and complex relationships among objects, which limits their application prospects. In this work, we explore relationships to mitigate the above issues. We consider both simple spatial relationships and complex semantic relationships into consideration. We observed that, in different realistic images, relative scale and distance ratios between two interrelated objects from the same subject-relation-object triplet usually conform to a norm distribution with low variance, as in Fig. 2. Even though the human have various poses, and the skateboard can be oriented to different directions, the scale ratio between the two bounding boxes is naturally clustered with very low variance. Thus, we first introduce relative scale ratios and distance for measuring the rationality of layouts generated from the scene graph. It means that all various reasonable layouts relevance to one specific scene graph can be measured under a common standard and result in very similar results. After that, we proposed a Pair-wise Spatial Constraint Module for assisting layout generation. Our Spatial Constraint Module is influenced by object pairs and their corresponding relation jointly. Meanwhile, the objective of this novel module is to correct the layout by fitting the relative scale ratio and relative distance ratio between interrelated object pair beside the absolute position coordinates of each object. In this way, the spatial commonsense of complex scene with multiple objects can be modeled. Moreover, for enhancing the influence of relation for object appearance generation, we proposed a Relation-guided 0.0-0.2 0.2-0.4 0.4-0.6 0.6-0.8 0.8-1.0 Relative Distance 0 15 30 45 60 75 0.0-1.0 1.0-3.0 3.0-5.0 5.0-7.0 7.0-9.0 Percent of Triplets (%) Relative Scale 0 15 30 45 60 75 90 0.0-0.5 0.5-1.5 1.5-2.5 2.5-3.5 3.5-4.5 Percent of Triplets (%) 0.00-0.08 0.08-0.24 0.24-0.40 0.40-0.56 0.56-0.72 riding human Sitting on bench skateboard human Figure 2: Distributions of relative scale and distance for man riding skateboard and man sitting on bench . Appearance Generator and a novel Scene Graph Discriminator for guiding image generation. Unlike the traditional discriminator that only judges whether the image is fake or not, our proposed new discriminator has two main functions. One is to determine whether the objects in the generated image are relevant to the objects described in the text scene graph or not, and the other is to discriminate the relation predictions among objects from the generated image are correspondence with the relationship described in the input scene graph. By feeding back these strong discriminant signals, our Scene Graph Discriminator guarantees the generated object patches align with not only single object fine-grained information but also the relation discrepancy among objects. The main contributions can be summarized as follows: A novel pair-wise spatial constraint module with supervisions of relative scale and distance between objects for learning relationship-aware spatial perceptions. A relation-guided appearance generator module followed by a scene graph discriminator for generating reasonable object patches respecting object fine-grained information and relation discrepancy. A general framework for synthesizing scene layout and images from scene graphs. The experimental results on Visual Genome (Krishna et al. 2017) and humanobjects interactions dataset HICO-DET (Chao et al. 2018) demonstrate the complex-scene images generated by our proposed method follow the common sense. Related Work Image Synthesis from Sentence is a conditional image generation task whose conditional signal is natural language. Textual descriptions are traditionally fed directly to a recurrent model for semantic information extractions. After that, a generative model will produce the results conditioned on this vectorized sentence representation. Most of these tasks specialize in single object image generation (Reed et al. 2016; Zhang et al. 2017b; Xu et al. 2018), whose layout is simple and the object usually centered with a large area in the image. However, generating realistic multi-object images conditioned on text descriptions is still a difficult task, since it addresses very complex sense layout generation and various object appearances mapping, and both of scene layout and object appearances are heavily influenced by the spatial and semantic relationships cross objects. Furthermore, encoding all information, including multiple object categories and the interactions among them into one vector, usually leads to critical details lost. Meanwhile, directly decoding images from such an encoded vector hurts the interpretability of the model. Scene Graph (Xu et al. 2017) is a directed graph that represents the structured relationships among objects in an explicit manner. Scene graphs have been widely used in many tasks such as image retrieval (Johnson et al. 2015), image captioning (Anderson et al. 2016), which serves as a medium that bridges the gap between language and vision. Image Synthesis from Scene Graph (Johnson, Gupta, and Fei-Fei 2018) is a derivative of sentence based multipleobject image generation. Since the conditional signals are scene graphs, graphic models are usually applied for extracting essential information from the textual scene graph. After that, these extracted features are directly used for regressing the scene layouts and then treated as input to an image decoder for generating the final image (Ashual and Wolf 2019). Such a framework is applicable to generation image contains multiple objects with simple spatial interactions. However, it is still suffering from modeling the reasonable scene layouts and appearances following commonsense due to the implication of semantic relationships in the scene graph. In this paper, we focus on image generation from the textual scene graph. Different from previous methods, we highlight the impact of relationships among objects for dealing with the messy layout and various object appearance. A scene graph is denoted as G = {C, R, E}, where C = {c1, c2, ..., cn} indicate the nodes in the graph, each ci C denotes the category embedding of an object or instance. Note that we use words like object or instance in reference to a broad range of categories from human , tree to sky , water etc. The edges of the graph are extracted as a relationship embedding set R. Two related objects cj and ck associate with one relationship rjk R, which leads to a triplet cj, rjk, ck in the directed edge set E. Given a scene graph G and its corresponding image I, scene graph-based image generation model aims to generate an image b I according to G by minimizing D(I, b I), where D(I, b I) measures differences between I and b I. A standard scene graph to image generation task can be formulated as two separate tasks: a scene graph to layout generation task which extracts object features with spatial constraints from scene graphs, and an image generation task, which generates images conditioned on the predicted object features and learned layout constraints, as shown in Fig. 3 (left). In this paper, we extend the traditional framework by emphasizing the influence of relationship R for both object layouts and object appearances generation. As shown in Fig. 3 (right), three novel modules are proposed: Pair-wise Spatial Constraint Module: a module for constraining layout generation according to the semantic information extracted from E. Relation-guided Appearance Generator: for each object ci, we introduce the semantic information of its connected relationships {rj| ci, rj, E} to influence the shape and appearance of the generated image of ci. Scene Graph Discriminator (Dsg): a novel discriminator for strengthening the generated image b I to be relevant to the appearances of object C, and the relationships R in the edge set E. Layout Generator Layout generator aims to predict bounding boxes bi = (xi, yi, wi, hi) for each object oi in given scene graph G, where xi, yi, wi, hi specifies normalized coordinates of the center, width and height in ground-truth image I. In previous work, the object representations are usually extracted from scene graph inputs, and then be passed to a box regression network to get the bounding box predictions bbi = (bxi, byi, bwi,bhi). The box regression network is trained by maximizing the objective: i=1 bi bbi 2, (1) which penalize the L2 difference between ground-truth and predicted boxes. n indicates the amount of objects. Since there are various reasonable layouts existing, as previously stated, a scene graph to layout task requires addressing challenging one-to-many mapping. Directly regressing layout to offsets of one specific bounding box would hurt the generalization ability of the layout generator, and make the layout generator to be difficult to convergence. In order to generate reasonable layouts, we relax the constraint of bounding box offsets regression and proposed a novel spatial constraint module for ensuring the rationality of layout. Our Pair-wise Spatial Constraint Module introduces two novel metrics for measuring the realistic of layouts. 1. Relative Scale Invariance. The scale of an object is represented by the diagonal length of its bounding box. For any given cj, rjk, ck triplet, the ratio between the scale of the subject and the scale of the object in different images are often roughly the same, as shown in Fig 4 (Left). We formulate the relative scale between the layout bj and bk as w2 j + h2 j .q w2 k + h2 k. (2) 2. Relative Distance Invariance. Similar to relative scale, relative distance target at calculating the distance between two objects in triplet normalized by the scales of two objects. The relative distance of related object pair cj and ck in realistic images is also naturally clustered to one specific value, and the distributions of relative distance for different triplets are usually with low variance, as shown in Fig 4 (Right). Normally, horizontal flips of images rarely alter spatial relationship distributions, we relax this constraint by using the absolute value of the horizontal coordinate difference. Most importantly, we normalize distance by the summed scales of object pairs so that the zooming effect of object depth can Object Features Discriminator Feature Extractor Scene Graph Discriminator Image Decoder Pair-wise Spatial Constraint Module Relation-guided Appearance Generator Image Decoder Discriminator Scene Layout Figure 3: Illustrations of standard (left) and our (right) framework for scene graph to image generation. Left: Directly generating layout and image based on object features extracted from scene graph. Right: Our proposed framework with object pair-wise spatial constraints and appearance supervision respecting relationships among objects. Variance of Relative Scale Variance of Relative Distance Figure 4: Distributions of relative scale and distance variance among top-100 triplets in VG and HICO-DET datasets. Low diversity of relative scale and distance is observed, following the property of commonsense knowledge. be canceled out. Therefore, the relative distance between the layout bj and bk can be formulated as djk = [|xj xk|, yj yk]T . q w2 j + h2 j + q w2 k + h2 k (3) We have keenly observed that relationship in a semantic form comes with it an inherent spatial constraint that has not been fully explored by others. For example, the relationship holding implies that the object should be within arm s reach of the subject instead of miles away. The relationship walking indicates the relative vertical arrangement between subject and object heavily, whether it s man walkingon street or dog walking-on floor . It means the relative scale and relative distance between two objects heavily depend on the relationship or interaction between these two objects. Therefore, we devise a training scheme that explicitly leverages this constraint. In this work, the scene graph G is first converted to object feature vectors C and relation embeddings R, and then fed into a Graph Convolutional Network (GCN). The GCN outputs updated object level feature vector oi = T(ci, Ci, Ri) aggregated with relation information, where T is the graph convolutional operation, Ci is the set of object embeddings relevant to ci, Ri is the set of embeddings for relations among ci and Ci. It means the output vector oi for an object ci should depend on representations of relationships and all objects connected via graph edged jointly. After that, we apply the updated object representations for generating the layout for object ci by bbi = B(oi), where B is an bounding box offset regression network. We construct a scale-distance objective for our proposed spatial constraint module to assist the training progress of B: 0