# novel_view_synthesis_with_diffusion_models__f6479c47.pdf Published as a conference paper at ICLR 2023 NOVEL VIEW SYNTHESIS WITH DIFFUSION MODELS Daniel Watson William Chan Ricardo Martin-Brualla Google Research, Brain Google Research, Brain Google Research Jonathan Ho Andrea Tagliasacchi Mohammad Norouzi Google Research, Brain Google Research, Brain Google Research, Brain We present 3Di M, a diffusion model for 3D novel view synthesis, which is able to translate a single input view into consistent and sharp completions across many views. The core component of 3Di M is a pose-conditional image-to-image diffusion model, which is trained to take a source view and its pose as inputs, and generates a novel view for a target pose as output. 3Di M can then generate multiple views that are approximately 3D consistent using a novel technique called stochastic conditioning. At inference time, the output views are generated autoregressively. When generating each novel view, one selects a random conditioning view from the set of previously generated views at each denoising step. We demonstrate that stochastic conditioning significantly improves 3D consistency compared to a na ıve sampler for an image-to-image diffusion model, which involves conditioning on a single fixed view. We compare 3Di M to prior work on the SRN Shape Net dataset, demonstrating that 3Di M s generated completions from a single view achieve much higher fidelity, while being approximately 3D consistent. We also introduce a new evaluation methodology, 3D consistency scoring, to quantify the 3D consistency of a generated object by training a neural field on the model s output views. 3Di M is geometry free, does not rely on hyper-networks or test-time optimization for novel view synthesis, and allows a single model to easily scale to a large number of scenes. | {z } Input view | {z } 3Di M outputs conditioned on different poses Figure 1: Given a single input image on the left, 3Di M performs novel view synthesis and generates the four views on the right. We trained a single 471M parameter 3Di M on all of Shape Net (without classconditioning) and sample frames with 256 steps (512 score function evaluations with classifier-free guidance). See the Supplementary Website (https://3d-diffusion.github.io/) for video outputs. 1 INTRODUCTION Diffusion Probabilistic Models (DPMs) (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020), also known as simply diffusion models, have recently emerged as a powerful family of generative models, achieving state-of-the-art performance on audio and image synthesis (Chen et al., 2020; Dhariwal & Nichol, 2021), while admitting better training stability over adversarial approaches (Goodfellow et al., 2014), as well as likelihood computation, which enables further applications such as compression and density estimation (Song et al., 2021; Kingma et al., 2021). Diffusion models have achieved impressive empirical results in a variety of image-to-image translation tasks not limited to text-to-image, super-resolution, inpainting, colorization, uncropping, and artifact removal (Song et al., 2020; Saharia et al., 2021a; Ramesh et al., 2022; Saharia et al., 2022). Published as a conference paper at ICLR 2023 One particular image-to-image translation problem where diffusion models have not been investigated is novel view synthesis, where, given a set of images of a given 3D scene, the task is to infer how the scene looks from novel viewpoints. Before the recent emergence of Scene Representation Networks (SRN) (Sitzmann et al., 2019) and Neural Radiance Fields (Ne RF) (Mildenhall et al., 2020), state-of-the-art approaches to novel view synthesis were typically built on generative models (Sun et al., 2018) or more classical techniques on interpolation or disparity estimation (Park et al., 2017; Zhou et al., 2018). Today, these models have been outperformed by Ne RF-class models (Yu et al., 2021; Niemeyer et al., 2021; Jang & Agapito, 2021), where 3D consistency is guaranteed by construction, as images are generated by volume rendering of a single underlying 3D representation (a.k.a. geometry-aware models). Still, these approaches feature different limitations. Heavily regularized Ne RFs for novel view synthesis with few images such as Reg Ne RF (Niemeyer et al., 2021) produce undesired artifacts when given very few images, and fail to leverage knowledge from multiple scenes (recall Ne RFs are trained on a single scene, i.e., one model per scene), and given one or very few views of a novel scene, a reasonable model must extrapolate to complete the occluded parts of the scene. Pixel Ne RF (Yu et al., 2021) and Vision Ne RF (Lin et al., 2022) address this by training Ne RF-like models conditioned on feature maps that encode the novel input view(s). However, these approaches are regressive rather than generative, and as a result, they cannot yield different plausible modes and are prone to blurriness. This type of failure has also been previously observed in regression-based models (Saharia et al., 2021b). Other works such as Code Ne RF (Jang & Agapito, 2021) and Lo LNe RF (Rebain et al., 2021) instead employ test-time optimization to handle novel scenes, but still have issues with sample quality. In recent literature, geometry-free approaches (i.e., methods without explicit geometric inductive biases like those introduced by volume rendering) such as Light Field Networks (LFN) (Sitzmann et al., 2021) and Scene Representation Transformers (SRT) (Sajjadi et al., 2021) have achieved results competitive with 3D-aware methods in the few-shot setting, where the number of conditioning views is limited (i.e., 1-10 images vs. dozens of images as in the usual Ne RF setting). Similarly to our approach, EG3D (Chan et al., 2022) provides approximate 3D consistency by leveraging generative models. EG3D employs a Style GAN (Karras et al., 2019) with volumetric rendering, followed by generative super-resolution (the latter being responsible for the approximation). In comparison to this complex setup, we do not only provide a significantly simpler architecture, but also a simpler hyper-parameter tuning experience compared GANs, which are well-known to be notoriously difficult to tune (Mescheder et al., 2018). Motivated by these observations and the success of diffusion models in image-to-image tasks, we introduce 3D Diffusion Models (3Di Ms). 3Di Ms are image-to-image diffusion models trained on pairs of images of the same scene, where we assume the poses of the two images are known. Drawing inspiration from Scene Representation Transformers (Sajjadi et al., 2021), 3Di Ms are trained to build a conditional generative model of one view given another view and their poses. Our key discovery is that we can turn this image-to-image model into a model that can produce an entire set of 3D-consistent frames through autoregressive generation, which we enable with our novel stochastic conditioning sampling algorithm. We cover stochastic conditioning in more detail in Section 2.2 and provide an illustration in Figure 3. Compared to prior work, 3Di Ms are generative (vs. regressive) geometry free models, they allow training to scale to a large number of scenes, and offer a simple end-to-end approach. We now summarize our core contributions: 1. We introduce 3Di M, a geometry-free image-to-image diffusion model for novel view synthesis. 2. We introduce the stochastic conditioning sampling algorithm, which encourages 3Di M to generate 3D-consistent outputs. 3. We introduce X-UNet, a new UNet architecture (Ronneberger et al., 2015) variant for 3D novel view synthesis, demonstrating that changes in architecture are critical for high fidelity results. 4. We introduce an evaluation scheme for geometry-free view synthesis models, 3D consistency scoring, that can numerically capture 3D consistency by training neural fields on model outputs. Published as a conference paper at ICLR 2023 Figure 2: Pose-conditional image-to-image training Example training inputs and outputs for pose-conditional image-to-image diffusion models, presented in Section 2.1. Given two frames from a common scene and their poses (R, t), the training task is to undo the noise added to one of the two frames. (*) In practice, our neural network is trained to predict the Gaussian noise ϵ used to corrupt the original view the predicted view is still just a linear combination of the noisy input and the predicted ϵ. 2 POSE-CONDITIONAL DIFFUSION MODELS To motivate 3Di Ms, let us consider the problem of novel view synthesis given few images from a probabilistic perspective. Given a complete description of a 3D scene S, for any pose p, the view x(p) at pose p is fully determined from S, i.e., views are conditionally independent given S. However, we are interested in modeling distributions of the form q(x1, ..., xm|xm+1, ..., xn) without S, where views are no longer conditionally independent. A concrete example is the following: given the back of a person s head, there are multiple plausible views for the front. An image-to-image model sampling front views given only the back should indeed yield different outputs for each front view with no guarantees that they will be consistent with each other especially if it learns the data distribution perfectly. Similarly, given a single view of an object that appears small, there is ambiguity on the pose itself: is it small and close, or simply far away? Thus, given the inherent ambiguity in the few-shot setting, we need a sampling scheme where generated views can depend on each other in order to achieve 3D consistency. This contrasts Ne RF approaches, where query rays are conditionally independent given a 3D representation S an even stronger condition than imposing conditional independence among frames. Such approaches try to learn the richest possible representation for a single scene S, while 3Di M avoids the difficulty of learning a generative model for S altogether. 2.1 IMAGE-TO-IMAGE DIFFUSION MODELS WITH POSE CONDITIONING Given a data distribution q(x1, x2) of pairs of views from a common scene at poses p1, p2 SE(3), we define an isotropic Gaussian process that adds increasing amounts of noise to data samples as the signal-to-noise-ratio λ decreases, following Salimans & Ho (2022): q(z(λ) k |xk) := N(z(λ) k ; σ(λ) 1 2 xk, σ( λ)I) (1) where σ( ) is the sigmoid function. We can apply the reparametrization trick (Kingma & Welling, 2013) and sample from these marginal distributions via z(λ) k = σ(λ) 1 2 xk + σ( λ) 1 2 ϵ, ϵ N(0, I) (2) Then, given a pair of views, we learn to reverse this process in one of the two frames by minimizing the objective proposed by Ho et al. (2020), which has been shown to yield much better sample quality than maximizing the true evidence lower bound (ELBO): L(θ) = Eq(x1,x2) Eλ,ϵ ϵθ(z(λ) 2 , x1, λ, p1, p2) ϵ 2 2 (3) where ϵθ is a neural network whose task is to denoise the frame z(λ) 2 given a different (clean) frame x1, and λ is the log signal-to-noise-ratio. To make our notation more legible, we slightly abuse notation and from now on we will simply write ϵθ(z(λ) 2 , x1). We illustrate training in Figure 2. Published as a conference paper at ICLR 2023 Figure 3: Stochastic conditioning sampler We illustrate our proposed inference procedure for 3Di M, outlined in Section 2.2. There are two main components to our sampling procedure: (1) the autoregressive generation of multiple frames (illustrated vertically as step 1 , step 2 , etc.), and (2) the denoising process to generate each individual frame (illustrated horizontally). When generating a new frame, we select a previous frame as the conditioning frame randomly at each denoising step (illustrated with the dice). Note that this is not part of 3Di M training; also, we omit the pose inputs in the diagram to avoid overloading the figure. 2.2 3D CONSISTENCY VIA STOCHASTIC CONDITIONING Motivation. We begin this section by motivating the need of our stochastic conditioning sampler. In the ideal situation, we would model our 3D scene frames using the chain rule decomposition: i p(xi|x