# textitbifrost_3daware_image_compositing_with_language_instructions__9bed1f52.pdf BIFRÖST: 3D-Aware Image compositing with Language Instructions Lingxiao Li1 Kaixiong Gong1 Weihong Li1 Xili Dai2 Tao Chen3 Xiaojun Yuan4 Xiangyu Yue1 1MMLab, The Chinese University of Hong Kong 2The Hong Kong University of Science and Technology (Guangzhou) 3Fudan University 4University of Electronic Science and Technology of China https://github.com/lingxiao-li/Bifrost Place the apple behind the burger Replace the cat with the dog Place the toy behind the bucket and in front of the box Place the backpack to the left of the vase Identity Preserved Inpainting Identity Transfer Figure 1: Bifröst results on various personalized image compositing tasks. Top: Bifröst is adept at precise, arbitrary object placement and replacement in a background image with a reference object and a language instruction, and achieves 3D-aware high-fidelity harmonized compositing results; Bottom Left: Given a coarse mask, Bifröst can change the pose of the object to follow the shape of the mask; Bottom Right: Our model adapts the identity of the reference image to the target image without changing the pose. This paper introduces Bifröst, a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition. Previous methods concentrate on image compositing at the 2D level, which fall short in handling complex spatial relationships (e.g., occlusion). Bifröst addresses these issues by training MLLM as a 2.5D location predictor and integrating depth maps as an extra condition during the generation process to bridge the gap between 2D and 3D, which enhances spatial comprehension and supports sophisticated spatial inter- Corresponding author 38th Conference on Neural Information Processing Systems (Neur IPS 2024). actions. Our method begins by fine-tuning MLLM with a custom counterfactual dataset to predict 2.5D object locations in complex backgrounds from language instructions. Then, the image-compositing model is uniquely designed to process multiple types of input features, enabling it to perform high-fidelity image compositions that consider occlusion, depth blur, and image harmonization. Extensive qualitative and quantitative evaluations demonstrate that Bifröst significantly outperforms existing methods, providing a robust solution for generating realistically composited images in scenarios demanding intricate spatial understanding. This work not only pushes the boundaries of generative image compositing but also reduces reliance on expensive annotated datasets by effectively utilizing existing resources in innovative ways. 1 Introduction Image generation has flourished alongside the advancement of diffusion models (Song et al., 2021; Ho et al., 2020; Rombach et al., 2022; Ramesh et al., 2022). Recent works (Saharia et al., 2022; Liu et al., 2023b; Brooks et al., 2023; Zhang et al., 2023b; Huang et al., 2023; Li et al., 2023; Chen et al., 2024; He et al., 2024) add conditional controls, e.g., text prompts, scribbles, skeleton maps to the diffusion models, offering significant potentials for controllable image editing. Among these methods for image editing, generative object-level image compositing (Yang et al., 2023; Song et al., 2023; Chen et al., 2024; Song et al., 2024) is a novel yet challenging task that aims to seamlessly inject an outside reference object into a given background image with a specific location, creating a cohesive and realistic image. This ability is significantly required in practical applications including E-commerce, effect-image rendering, poster-making, professional editing, etc. Achieving arbitrary personalized object-level image compositing necessitates a deep understanding of the visual concept inherent to both the identity of the reference object and spatial relations of the background image. To date, this task has not been well addressed. Paint-by-Example (Yang et al., 2023) and Objectstitch (Song et al., 2023) use a target image as the template to edit a specific region of the background image, but they could not generate ID-consistent contents, especially for untrained categories. On the other hand, (Chen et al., 2024; Song et al., 2024) generate objects with ID (identity) preserved in the target scene, but they fall short in processing complicated 3D geometry relations (e.g., the occlusion) as they only consider 2D-level composition. To sum up, previous methods mainly either 1) fail to achieve both ID preservation and background harmony, or 2) do not explicitly take into account the geometry behind the background and fail to accurately composite objects and backgrounds in complex spatial relations. We conclude that the root cause of aforementioned issues is that image composition is conducted at a 2D level. Ideally, the composition operation should be done in a 3D space for precise 3D geometry relationships. However, accurately modeling a 3D scene with any given image, especially with only one view, is non-trivial and time-consuming (Liu et al., 2023b). To address these challenges, we introduce Bifröst, which offers a 3D-aware framework for image composition without explicit 3D modeling. We achieve this by leveraging depth to indicate the 3D geometry relationship between the object and the background. In detail, our approach leverages a multi-modal large language model (MLLM) as a 2.5D location predictor (i.e., bounding box and depth for the object in the given background image). With the predicted bounding box and depth, our method yields a depth map for the composited image, which is fed into a diffusion model as guidance. This enables our method to achieve good ID preservation and background harmony simultaneously, as it is now aware of the spatial relations between them, and the conflict at the dimension of depth is eliminated. In addition, MLLM enables our method to composite images with text instructions, which enlarges the application scenario of Bifröst. Bifröst achieves significantly better visual results than the previous method, which in turn validates our conclusion. We divide the training procedure into two stages. In the first stage, we finetune an MLLM (e.g., LLa VA (Liu et al., 2023a, 2024)) for 2.5D predictions of objects in complex scenes with language instructions. In the second stage, we train the image composition model. To composite images with complicated spatial relationships, we introduce depth maps as conditions for image generation. In addition, we leverage an ID extractor to generate discriminative ID tokens and a frequency-aware detail extractor to extract the high-frequency 2D shape information (detail maps) for ID preservation. We unify the depth maps, ID tokens, and detail maps to guide a pre-trained text-to-image diffusion model to generate desired image composition results. This two-stage training paradigm allows In Norse mythology, a burning rainbow bridge that transports anything between Earth and Asgard. us to utilize a number of existing 2D datasets for common visual tasks e.g., visual segmentation, and detection, avoiding collecting large-volume text-image data specially designed for arbitrary object-level image compositing. Our main contributions can be summarized as follows: 1) We are the first to embed depth into the image composition pipeline, which improves the ID preservation and background harmony simultaneously. 2) We delicately build a counterfactual dataset and fine-tuned MLLM as a powerful tool to predict 2.5D location of the object in a given background image. Further, the fine-tuned MLLM enables our approach to understand language instructions for image composition. 3) Our approach has demonstrated exceptional performance through comprehensive qualitative assessments and quantitative analyses on image compositing and outperforms other methods, Bifröst allows us to generate images with better control of occlusion, depth blur, and image harmonization. 2 Related Work 2.1 Image compositing with Diffusion Models Image compositing, an essential technique in image editing, seamlessly integrates a reference object into a background image, aiming for realism and high fidelity. Traditional methods, such as image harmonization (Jiang et al., 2021; Xue et al., 2022; Guerreiro et al., 2023; Ke et al., 2022) and blending (Pérez et al., 2003; Zhang et al., 2020, 2021; Wu et al., 2019) primarily ensure color and lighting consistency but inadequately address geometric discrepancies. The introduction of diffusion models (Ho et al., 2020; Sohl-Dickstein et al., 2015; Song et al., 2021; Rombach et al., 2022) has shifted focus towards comprehensive frameworks that address all facets of image compositing. Methods like (Yang et al., 2023; Song et al., 2023) often use CLIP-based adapters to utilize pretrained models, yet they compromise the object s identity preservation, focusing mainly on high-level semantic representations.More recent studies prioritize maintaining the appearance in generative object compositing. Notable developments in this field include Any Door (Chen et al., 2024) and Control Com (Zhang et al., 2023a). Any Door integrates DINOv2 (Oquab et al., 2023) with a high-frequency filter, while Control Com introduces a local enhancement module, both improving appearance retention. However, these approaches still face challenges in spatial correction capabilities. Most recent work IMPRINT (Song et al., 2024) trains an ID-preserving encoder that enhances the visual consistency of the object while maintaining geometry and color harmonization. However, none of the work can deal with composition with occlusion and more complex spatial relations. In contrast, our models novelly propose a 3D-aware generative model that allows more accuracy and complex image compositing while maintaining geometry and color harmonization. 2.2 LLM with Diffusion Models The open-sourced Lla MA (Touvron et al., 2023; Chiang et al., 2023) substantially enhances vision tasks by leveraging Large Language Models (LLMs). Innovations such as LLa VA and Mini GPT4 (Liu et al., 2023a; Zhu et al., 2024) have advanced image-text alignment through instruction-tuning. While many MLLM-based studies have demonstrated effectiveness in text-generation tasks like human-robot interaction, complex reasoning, and science question answering, GILL (Koh et al., 2023) acts as a conduit between MLLMs and diffusion models by enabling LLMs to process and generate coherent images from textual inputs. SEED (Ge et al., 2023b) introduces a novel image tokenizer that allows LLMs to handle and concurrently generate images and text, with SEED2 (Ge et al., 2023a) enhancing this process by better aligning generation embeddings with image embeddings from un CLIP-SD, thus improving the preservation of visual semantics and realistic image reconstruction. Emu (Sun et al., 2024), a multimodal generalist, is trained on a next-tokenprediction model. CM3Leon (Yu et al., 2023a), utilizing the CM3 multimodal architecture and adapted training methods from text-only models, excels in both text-to-image and image-to-text generations. Finally, Smart Edit (Huang et al., 2024) proposes a Bidirectional Interaction Module that enables comprehensive bidirectional information interactions between images and the MLLM output, allowing it for complex instruction-based image editing. Nevertheless, these methods requires text-image pairs data to train and do not support accurate spatial location prediction and subject-driven image compositing. MLLM BBox: [0.10, 0.32, 0.52, 0.74] Depth: 0.507 Depth Fusion Diffusion Model Depth Predictor Instruction CT : Place the orange to the left and behind the apple, output the bounding box and the depth value of the center point. Figure 2: Overview of the inference pipeline of Bifröst. Given background image Ibg, and text instruction c T that indicates the location for object compositing to the background, the MLLM first predicts the 2.5D location consists of a bounding box and the depth of the object. Then a pre-trained depth predictor is applied to estimate the given images depth. After that. The depth of the reference object is scaled to the depth value predicted by MLLM and fused in the predicted location of the background depth. Finally, the masked background image, fused depth, and reference object image are used as the input of the compositing model and generate an output image Iout that satisfies spatial relations in the text instruction c T and appears visually coherent and natural (e.g., with light and shadow that are consistent with the background image). The overall pipeline of our Bifröst is elaborated in Fig. 2. Our method consists of two stages: 1) in stage 1, given the input image of object and background, and text instruction that indicates the location for object compositing to the background, the MLLMs are finetuned on our customized dataset for predicting a 2.5D location, which provides the bounding box and a depth value of the object in the background; 2) in stage 2, our Bifröst performs 3D-aware image compositing according to the generated 2.5D location, images of object and background and their depth maps estimated by a depth predictor. As we divide the pipeline into two stages, we can adopt the existing benchmarks that have been collected for common vision problems and avoid the demand of collected new and task-specific paired data. We detail our pipeline in the following section. In Sec. 3.1 we discuss building our customized counterfactual dataset and fine-tuning multi-modal large language models to predict 2.5D locations given a background image. Following this, in Sec. 3.2, we introduce the 3D-aware image compositing pipeline that uses the spatial location predicted by MLLM to seamlessly integrate the reference object into the background image. Finally, we discuss combining two stages in Sec. 3.3 and show more application scenarios of our proposed method. 3.1 Finetuning MLLM to Predict 2.5D Location Given a background image Ibg and text instruction c T , which is tokenized as HT , our goal is to obtain the 2.5D coordinate of the reference object we want to place in. We indicate the 2.5D coordinate as l which consists of a 2D bounding box b = [x1, y1, x2, y2] and an estimated depth value d [0, 1]. During the training stage, the majority of parameters θ in the LLM are kept frozen, and we utilize Lo RA (Hu et al., 2022) to carry out efficient fine-tuning. Subsequently, for a sequence of length L, we minimize the negative log-likelihood of generated text tokens XA, which can be formulated as: LLLM(Ibg, c T ) = i=1 log p{θ} (xi | Ibg, c T,