# conditional_image_generation_by_conditioning_variational_autoencoders__fb2287db.pdf Published as a conference paper at ICLR 2022 CONDITIONAL IMAGE GENERATION BY CONDITIONING VARIATIONAL AUTO-ENCODERS William Harvey, Saeid Naderiparizi & Frank Wood Department of Computer Science University of British Columbia Vancouver, Canada {wsgh,saeidnp,fwood}@cs.ubc.ca We present a conditional variational auto-encoder (VAE) which, to avoid the substantial cost of training from scratch, uses an architecture and training objective capable of leveraging a foundation model in the form of a pretrained unconditional VAE. To train the conditional VAE, we only need to train an artifact to perform amortized inference over the unconditional VAE s latent variables given a conditioning input. We demonstrate our approach on tasks including image inpainting, for which it outperforms state-of-the-art GAN-based approaches at faithfully representing the inherent uncertainty. We conclude by describing a possible application of our inpainting model, in which it is used to perform Bayesian experimental design for the purpose of guiding a sensor. 1 INTRODUCTION A major challenge with applying variational auto-encoders (VAEs) to high-dimensional data is the typically slow training times. For example, training a state-of-the-art VAE (Vahdat & Kautz, 2020; Child, 2020) on the 256 256 FFHQ dataset (Karras et al., 2019) takes on the order of 1 GPU-year, but a state-of-the-art generative adversarial network (GAN) (Lin et al., 2021; Karras et al., 2020) can be trained on the same dataset in a matter of GPU-weeks. One hypothesis for the cause of this disparity is that, whereas the mass-covering training objective for a VAE forces it to assign probability mass over the entirety of the data distribution, a GAN can cut corners by dropping modes (Arora & Zhang, 2017; Arora et al., 2017). We focus on the problem of conditional generative modelling: given an input (e.g. a partially blanked-out image), we wish to map to a distribution over outputs (e.g. plausible completions of the image). Both conditional GANs (Zheng et al., 2019; Zhao et al., 2021) and conditional VAEs (Sohn Frank Wood is also affiliated with the Montr eal Institute for Learning Algorithms (Mila) and Inverted AI. Figure 1: Left column: Images with most pixels masked out. Rest: Completions from our method. Published as a conference paper at ICLR 2022 et al., 2015; Ivanov et al., 2018) are applicable to this problem, with the same disparity in training times that we described for their unconditional counterparts. We present an approach based on the conditional VAE framework but, to mitigate the associated slow training times, we design the architecture so that we can incorporate pretrained unconditional VAEs. We show that re-using publicly available pretrained models in this way can lead to training times and sample quality competitive with GANs, while avoiding mode dropping. While requiring an existing pretrained model is a limitation, we note that: (I) The unconditional VAE need not have been (pre-)trained on the same dataset as the conditional model; we show unconditional models trained on Image Net are suitable for later use with various photo datasets. (II) A single unconditional VAE can be used for later training of conditional VAEs on any desired conditional generation tasks (e.g. the same image model may be later used for image completion or image colourisation). (III) There is an increasing trend in the machine learning community towards sharing large, expensively trained models (Wolf et al., 2020), sometimes referred to as foundation models (Bommasani et al., 2021). Most of the unconditional VAEs in our experiments use publiclyavailable pretrained weights released by Child (2020). By presenting a use case for foundation models in image modelling, we hope to encourage even more sharing of pretrained weights in this domain. We demonstrate our approach on several conditional generation tasks in the image domain but focus in particular on stochastic image completion: the problem of inferring the posterior distribution over images given the observation of a subset of pixel values. For some applications such as photoediting the implicit distribution defined by GANs is good enough. We argue that our approach has substantial advantages when image completion is used as part of a larger pipeline, and discuss one possible instance of this in Section 5: Bayesian optimal experimental design (BOED) for guiding a sensor or hard attention mechanism (Ma et al., 2018; Harvey et al., 2019; Rangrej & Clark, 2021). In this case, missing modes of the posterior over images is likely to lead to bad decisions. We show that our objective corresponds to the mass-covering KL divergence and so covers the posterior well. This is supported empirically by results indicating that, not only is the visual quality of our image completions (see Fig. 1) close to the state-of-the-art (Zhao et al., 2021), but our coverage of the true posterior over image completions is superior to that of any of our baselines. Contributions We develop a method to cheaply convert pretrained unconditional VAEs into conditional VAEs. The resulting training times and sample quality are competitive with GANs, while the models avoid the mode-dropping behaviour associated with GANs. Finally, we showcase a possible application in Bayesian optimal experimental design that benefits from these capabilities. 2 VARIATIONAL AUTO-ENCODERS We describe VAEs in terms of three components. (I) A decoder with parameters θ Θ maps from latent variables z to a distribution over data x, which we call pmodel(x|z; θ). (II) There is a prior over latent variables, pmodel(z; θ). This may have learnable parameters, which we consider to be part of θ. Together, the prior and decoder define a joint distribution, pmodel(z, x; θ). Finally, (III) an encoder with parameters φ Φ maps from data to an approximate posterior distribution over latent variables, q(z|x; φ) pmodel(z|x; θ). Ideally, θ would be learned to maximise the log likelihood log pmodel(x; θ) = log R pmodel(z, x; θ)dz, averaged over training examples. Since this is intractable, θ and φ are instead trained jointly to maximise an average of the evidence lower-bound (ELBO) over each training example x pdata( ): Epdata(x) [ELBO(θ, φ, x)] = Epdata(x)Eq(z|x;φ) log pmodel(z; θ)pmodel(x|z; θ) = H [pdata(x)] KL pdata(x)q(z|x; φ) pmodel(z, x; θ) . (2) The data distribution s entropy, H [pdata(x)], is typically a finite constant, and this is guaranteed in our experiments where x is an image with discrete pixel values. Maximising the above objective will therefore drive pmodel(z, x; θ) towards pdata(x)q(z|x; φ), and so the marginal pmodel(x; θ) towards pdata(x). The KL divergence shown leads to mass-covering behaviour from pmodel(z, x; θ) (Bishop, 2006) so pmodel(x; θ) should assign probability broadly over the data distribution pdata(x). For notational simplicity in the rest of the paper, parameters θ and φ are not written when clear from the context. Published as a conference paper at ICLR 2022 (a) Estimating ELBO. (b) Estimating Ofor. (c) Sampling x pcond( |y). Figure 2: A hierarchical VAE architecture with L = 2 layers of latent variables. Part (a) illustrates the computation of the ELBO for an unconditional VAE; part (b) illustrates the computation of our training objective Ofor; and part (c) illustrates the drawing of conditional samples. The encoder is shown in orange; the prior and the decoder (which maintains a deterministic hidden state hl) are both shown in black; and the partial encoder is shown in blue. The computation graph needed to sample z in each case is drawn with dashed lines, and the remainder of the computation graph is drawn with solid lines. Hierarchical VAEs (Gregor et al., 2015; Kingma et al., 2016; Sønderby et al., 2016; Klushyn et al., 2019) partition the latent variables z in a way which has been found to improve the fidelity of the learned pmodel(x), especially for the image domain (Vahdat & Kautz, 2020; Child, 2020). In particular, they define z to consist of L disjoint groups, z1, . . . , z L. The prior for each zl can depend on the previous groups through the factorisation pmodel(z) = l=1 pmodel(zl|z