# adaptive_density_estimation_for_generative_models__7a1d6bb9.pdf

Adaptive Density Estimation for Generative Models

Thomas Lucas

Inria Konstantin Shmelkov , Noah s Ark Lab, Huawei

Cordelia Schmid Inria Karteek Alahari Inria Jakob Verbeek Inria

Unsupervised learning of generative models has seen tremendous progress over recent years, in particular due to generative adversarial networks (GANs), variational autoencoders, and ﬂow-based models. GANs have dramatically improved sample quality, but suffer from two drawbacks: (i) they mode-drop, i.e., do not cover the full support of the train data, and (ii) they do not allow for likelihood evaluations on held-out data. In contrast, likelihood-based training encourages models to cover the full support of the train data, but yields poorer samples. These mutual shortcomings can in principle be addressed by training generative latent variable models in a hybrid adversarial-likelihood manner. However, we show that commonly made parametric assumptions create a conﬂict between them, making successful hybrid models non trivial. As a solution, we propose to use deep invertible transformations in the latent variable decoder. This approach allows for likelihood computations in image space, is more efﬁcient than fully invertible models, and can take full advantage of adversarial training. We show that our model signiﬁcantly improves over existing hybrid models: offering GAN-like samples, IS and FID scores that are competitive with fully adversarial models, and improved likelihood scores.

1 Introduction

Successful recent generative models of natural images can be divided into two broad families, which are trained in fundamentally different ways. The ﬁrst is trained using likelihood-based criteria, which ensure that all training data points are well covered by the model. This category includes variational autoencoders (VAEs) [25, 26, 39, 40], autoregressive models such as Pixel CNNs [46, 53], and ﬂow-based models such as Real-NVP [9, 20, 24]. The second category is trained based on a signal that measures to what extent (statistics of) samples from the model can be distinguished from (statistics of) the training data, i.e., based on the quality of samples drawn from the model. This is the case for generative adversarial networks (GANs) [2, 15, 22], and moment matching methods [28].

Despite tremendous recent progress, existing methods exhibit a number of drawbacks. Adversarially trained models such as GANs do not provide a density function, which poses a fundamental problem as it prevents assessment of how well the model ﬁts held out and training data. Moreover, adversarial models typically do not allow to infer the latent variables that underlie observed images. Finally, adversarial models suffer from mode collapse [2], i.e., they do not cover the full support of the training data. Likelihood-based model on the other hand are trained to put probability mass on all elements of the training set, but over-generalise and produce samples of substantially inferior quality as compared to adversarial models. The models with the best likelihood scores on held-out data are autoregressive models [35], which suffer from the additional problem that they are extremely

The authors contributed equally. Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France. Work done while at Inria.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

Figure 1: An invertible non-linear mapping fψ maps an image x to a vector fψ(x) in feature space. fψ is trained to adapt to modelling assumptions made by a trained density pθ in feature space. This induces full covariance structure and a non-Gaussian density.

Figure 2: Variational inference is used to train a latent variable generative model in feature space. The invertible mapping fψ maps back to image space, where adversarial training can be performed together with MLE.

Figure 3: Our model yields compelling samples while the optimization of likelihood ensures coverage of all modes in the training support and thus sample diversity, here on LSUN churches (64 64).

inefﬁcient to sample from [38], since images are generated pixel-by-pixel. The sampling inefﬁciency makes adversarial training of such models prohibitively expensive.

In order to overcome these shortcomings, we seek to design a model that (i) generates high-quality samples typical of adversarial models, (ii) provides a likelihood measure on the entire image space, and (iii) has a latent variable structure to enable efﬁcient sampling and to permit adversarial training. Additionally we show that, (iv) a successful hybrid adversarial-likelihood paradigm requires going beyond simplifying assumptions commonly made with likelihood based latent variable models. These simplifying assumptions on the conditional distribution on data x given latents z, p(x|z), include full independence across the dimensions of x and/or simple parametric forms such as Gaussian [25], or use fully invertible networks [9, 24]. These assumptions create a conﬂict between achieving high sample quality and high likelihood scores on held-out data. Autoregressive models, such as pixel CNNs [46, 53], do not make factorization assumptions, but are extremely inefﬁcient to sample from. As a solution, we propose learning a non-linear invertible function fψ between the image space and an abstract feature space, as illustrated in Figure 1. Training a model with full support in this feature space induces a model in the image space that does not make Gaussianity, nor independence assumptions in the conditional density p(x|z). Trained by MLE, fψ adapts to modelling assumptions made by pθ so we refer to this approach as adaptive density estimation .

We experimentally validate our approach on the CIFAR-10 dataset with an ablation study. Our model signiﬁcantly improves over existing hybrid models, producing GAN-like samples, and IS and FID scores that are competitive with fully adversarial models, see Figure 3. At the same time, we obtain likelihoods on held-out data comparable to state-of-the-art likelihood-based methods which requires covering the full support of the dataset. We further conﬁrm these observations with quantitative and qualitative experimental results on the STL-10, Image Net and LSUN datasets.

2 Related work

Mode-collapse in GANs has received considerable attention, and stabilizing the training process as well as improved and bigger architectures have been shown to alleviate this issue [2, 17, 37]. Another line of work focuses on allowing the discriminator to access batch statistics of generated images, as pioneered by [22, 45], and further generalized by [29, 32]. This enables comparison of distributional statistics by the discriminator rather than only individual samples. Other approaches to encourage diversity among GAN samples include the use of maximum mean discrepancy [1], optimal transport [47], determinental point processes [14] and Bayesian formulations of adversarial training [42] that allow model parameters to be sampled. In contrast to our work, these models lack an inference network, and do not deﬁne an explicit density over the full data support.

An other line of research has explored inference mechanisms for GANs. The discriminator of Bi GAN [10] and ALI [12], given pairs (x, z) of images and latents, predict if z was encoded from a real image, or if x was decoded from a sampled z. In [52] the encoder and the discriminator are collapsed into one network that encodes both real images and generated samples, and tries to

Strongly penalysed by GAN

Strongly penalysed by MLE

Maximum Likelihood

Mode-dropping

Adversarial training

Over-generalization

Figure 4: (Left) Maximum likelihood training pulls probability mass towards high-density regions of the data distribution, while adversarial training pushes mass out of low-density regions. (Right) Independence assumptions become a source of conﬂict in a joint training setting, making hybrid training non-trivial.

spread their posteriors apart. In [6] a symmetrized KL divergence is approximated in an adversarial setup, and uses reconstruction losses to improve the correspondence between reconstructed and target variables for x and z. Similarly, in [41] a discriminator is used to replace the KL divergence term in the variational lower bound used to train VAEs with the density ratio trick. In [33] the KL divergence term in a VAE is replaced with a discriminator that compares latent variables from the prior and the posterior in a more ﬂexible manner. This regularization is more ﬂexible than the standard KL divergence. The VAE-GAN model [27] and the model in [21] use the intermediate feature maps of a GAN discriminator and of a classiﬁer respectively, as target space for a VAE. Unlike ours, these methods do not deﬁne a likelihood over the image space.

Likelihood-based models typically make modelling assumptions that conﬂict with adversarial training, these include strong factorization and/or Gaussianity. In our work we avoid these limitations by learning the shape of the conditional density on observed data given latents, p(x|z), beyond fully factorized Gaussian models. As in our work, Flow-GAN [16] also builds on invertible transformations to construct a model that can be trained in a hybrid adversarial-MLE manner, see Figure 2.However, Flow-GAN does not use efﬁcient non-invertible layers we introduce, and instead relies entirely on invertible layers. Other approaches combine autoregressive decoders with latent variable models to go beyond typical parametric assumptions in pixel space [7, 18, 31]. They, however, are not amenable to adversarial training due to the prohibitively slow sequential pixel sampling.

3 Preliminaries on MLE and adversarial training

Maximum-likelihood and over-generalization. The de-facto standard approach to train generative models is maximum-likelihood estimation. It maximizes the probability of data sampled from an unknown data distribution p under the model pθ w.r.t. the model parameters θ. This is equivalent to minimizing the Kullback-Leibler (KL) divergence, DKL(p ||pθ), between p and pθ. This yields models that tend to cover all the modes of the data, but put mass in spurious regions of the target space; a phenomenon known as over-generalization or zero-avoiding [4], and manifested by unrealistic samples in the context of generative image models, see Figure 4.Over-generalization is inherent to the optimization of the KL divergence oriented in this manner. Real images are sampled from p , and pθ is explicitly optimized to cover all of them. The training procedure, however, does not sample from pθ to evaluate the quality of such samples, ideally using the inaccessible p (x) as a score. Therefore pθ may put mass in spurious regions of the space without being heavily penalized. We refer to this kind of training procedure as coverage-driven training (CDT). This optimizes a loss of the form LC(pθ) = R

x p (x)sc(x, pθ) dx, where sc(x, pθ) = ln pθ(x) evaluates how well a sample x is covered by the model. Any score sc that veriﬁes: LC(pθ) = 0 pθ = p , is equivalent to the log-score, which forms a justiﬁcation for MLE on which we focus.

Explicitly evaluating sample quality is redundant in the regime of unlimited model capacity and training data. Indeed, putting mass on spurious regions takes it away from the support of p , and thus reduces the likelihood of the training data. In practice, however, datasets and model capacity are ﬁnite, and models must put mass outside the ﬁnite training set in order to generalize. The maximum likelihood criterion, by construction, only measures how much mass goes off the training data, not where it goes. In classic MLE, generalization is controlled in two ways: (i) inductive bias, in the form of model architecture, controls where the off-dataset mass goes, and (ii) regularization controls to which extent this happens. An adversarial loss, by considering samples of the model pθ, can provide a second handle to evaluate and control where the off-dataset mass goes. In this sense, and in contrast to model architecture design, an adversarial loss provides a trainable form of inductive bias.

Adversarial models and mode collapse. Adversarially trained models produce samples of excellent quality. As mentioned, their main drawbacks are their tendency to mode-drop , and the lack of measure to assess mode-dropping, or their performance in general. The reasons for this are two-fold. First, deﬁning a valid likelihood requires adding volume to the low-dimensional manifold learned by GANs to deﬁne a density under which training and test data have non-zero density. Second, computing the density of a data point under the deﬁned probability distribution requires marginalizing out the latent variables, which is not trivial in the absence of an efﬁcient inference mechanism. When a human expert subjectively evaluates the quality of generated images, samples from the model are compared to the expert s implicit approximation of p . This type of objective may be formalized as LQ(pθ) = R

x pθ(x)sq(x, p ) dx, and we refer to it as quality-driven training (QDT). To see that GANs [15] use this type of training, recall that the discriminator is trained with the loss LGAN = R

x p (x) ln D(x) + pθ(x) ln(1 D(x)) dx. It is easy to show that the optimal discriminator equals D (x) = p (x)/(p (x) + pθ(x)). Substituting the optimal discriminator, LGAN equals the Jensen-Shannon divergence,

DJS(p ||pθ) = 1

2(pθ + p )) + 1

2(pθ + p )), (1)

up to additive and multiplicative constants [15]. This loss, approximated by the discriminator, is symmetric and contains two KL divergence terms. Note that DKL(p || 1

2(pθ + p )) is an integral on p , so coverage driven. The term that approximates it in LGAN, i.e., R

x p (x) ln D(x), is however independent from the generative model, and disappears when differentiating. Therefore, it cannot be used to perform coverage-driven training, and the generator is trained to minimize ln(1 D(G(z))) (or to maximize ln D(G(z))), where G(z) is the deterministic generator that maps latent variables z to the data space. Assuming D = D , this yields Z

z p(z) ln(1 D (G(z))) dz = Z

x pθ(x) ln pθ(x) pθ(x) + p (x) dx = DKL(pθ||(pθ + p )/2), (2)

which is a quality-driven criterion, favoring sample quality over support coverage.

4 Adaptive Density Estimation and hybrid adversarial-likelihood training

We present a hybrid training approach with MLE to cover the full support of the training data, and adversarial training as a trainable inductive bias mechanism to improve sample quality. Using both these criteria provides a richer training signal, but satisfying both criteria is more challenging than each in isolation for a given model complexity. In practice, model ﬂexibility is limited by (i) the number of parameters, layers, and features in the model, and (ii) simplifying modeling assumptions, usually made for tractability. We show that these simplifying assumptions create a conﬂict between the two criteria, making successfull joint training non trivial. We introduce Adaptive Density Estimation as a solution to reconcile them.

Latent variable generative models, deﬁned as pθ(x) = R

z pθ(x|z)p(z) dz, typically make simplifying assumptions on pθ(x|z), such as full factorization and/or Gaussianity, see e.g. [11, 25, 30]. In particular, assuming full factorization of pθ(x|z) implies that any correlations not captured by z are treated as independent per-pixel noise. This is a poor model for natural images, unless z captures each and every aspect of the image structure. Crucially, this hypothesis is problematic in the context of hybrid MLE-adversarial training. If p is too complex for pθ(x|z) to ﬁt it accurately enough, MLE will lead to a high variance in a factored (Gaussian) pθ(x|z) as illustrated in Figure 4 (right). This leads to unrealistic blurry samples, easily detected by an adversarial discriminator, which then does not provide a useful training signal. Conversely, adversarial training will try to avoid these poor samples by dropping modes of the training data, and driving the noise level to zero. This in turn is heavily penalized by maximum likelihood training, and leads to poor likelihoods on held-out data.

Adaptive density estimation. The point of view of regression hints at a possible solution. For instance, with isotropic Gaussian model densities with ﬁxed variance, solving the optimization problem θ maxθ ln(pθ(x|z)) is similar to solving minθ ||µθ(z) x||2, i.e., ℓ2 regression, where µθ(z) is the mean of the decoder pθ(x|z). The Euclidean distance in RGB space is known to be a poor measure of similarity between images, non-robust to small translations or other basic transformations [34]. One can instead compute the Euclidean distance in a feature space, ||fψ(x1) fψ(x2)||2, where fψ is chosen so that the distance is a better measure of similarity. A popular way to

obtain fψ is to use a CNN that learns a non-linear image representation, that allows linear assessment of image similarity. This is the idea underlying GAN discriminators, the FID evaluation measure [19], the reconstruction losses of VAE-GAN [27] and classiﬁer based perceputal losses as in [21].

Despite their ﬂexibility, such similarity metrics are in general degenerate in the sense that they may discard information about the data point x. For instance, two different images x and y can collapse to the same points in feature space, i.e., fψ(x) = fψ(y). This limits the use of similarity metrics in the context of generative modeling for two reasons: (i) it does not yield a valid measure of likelihood over inputs, and (ii) points generated in the feature space fψ cannot easily be mapped to images. To resolve this issue, we chose fψ to be a bijection. Given a model pθ trained to model fψ(x) in feature space, a density in image space is computed using the change of variable formula, which yields pθ,ψ(x) = pθ(fψ(x)) det fψ(x)/ x . Image samples are obtained by sampling from pθ in feature space, and mapping to the image space through f 1 ψ . We refer to this construction as Adaptive Denstiy Estimation. If pθ provides efﬁcient log-likelihood computations, the change of variable formula can be used to train fψ and pθ together by maximum-likelihood, and if pθ provides fast sampling adversarial training can be performed efﬁciently.

MLE with adaptive density estimation. To train a generative latent variable model pθ(x) which permits efﬁcient sampling, we rely on amortized variational inference. We use an inference network qφ(z|x) to construct a variational evidence lower-bound (ELBO),

Lψ ELBO(x, θ, φ) = E qφ(z|x) [ln(pθ(fψ(x)|z))] DKL(qφ(z|x)||pθ(z)) ln pθ(fψ(x)). (3)

Using this lower bound together with the change of variable formula, the mapping to the similarity space fψ and the generative model pθ can be trained jointly with the loss

LC(θ, φ, ψ) = E x p

Lψ ELBO(x, θ, φ) ln det fψ(x)

E x p [ln pθ,ψ(x)] . (4)

We use gradient descent to train fψ by optimizing LC(θ, φ, ψ) w.r.t. ψ. The LELBO term encourges the mapping fψ to maximize the density of points in feature space under the model pθ, so that fψ is trained to match modeling assumptions made in pθ. Simultaneously, the log-determinant term encourages fψ to maximize the volume of data points in feature space. This guarantees that data points cannot be collapsed to a single point in the feature space. We use a factored Gaussian form of the conditional pθ(.|z) for tractability, but since fψ can arbitrarily reshape the corresponding conditional image space, it still avoids simplifying assumptions in the image space. Therefore, the (invertible) transformation fψ avoids the conﬂict between the MLE and adversarial training mechanisms, and can leverage both.

Adversarial training with adaptive density estimation. To sample the generative model, we sample latents from the prior, z pθ(z), which are then mapped to feature space through µθ(z), and to image space through f 1 ψ . We train our generator using the modiﬁed objective proposed by [50], combining both generator losses considered in [15], i.e. ln[(1 D(G(z)))/D(G(z))], which yields:

LQ(pθ,ψ) = E pθ(z)

h ln D(f 1 ψ (µθ(z))) ln(1 D(f 1 ψ (µθ(z)))) i . (5)

Assuming the discriminator D is trained to optimality at every step, it is easy to demonstrate that the generator is trained to optimize DKL(pθ,ψ||p ). The training procedure, written as an algorithm in Appendix H, alternates between (i) bringing LQ(pθ,ψ) closer to it s optimal value L Q(pθ,ψ) = DKL(pθ,ψ||p ), and (ii) minimizing LC(pθ,ψ)+LQ(pθ,ψ). Assuming that the discriminator is trained to optimality at every step, the generative model is trained to minimize a bound on the symmetric sum of two KL divergences: LC(pθ,ψ) + L Q(pθ,ψ) DKL(p ||pθ,ψ) + DKL(pθ,ψ||p ) + H(p ), where the entropy of the data generating distribution, H(p ), is an additive constant independent of the generative model pθ,ψ. In contrast, MLE and GANs optimize one of these divergences each.

5 Experimental evaluation

We present our evaluation protocol, followed by an ablation study to assess the importance of the components of our model (Section 5.1). We then show the quantitative and qualitative performance of our model, and compare it to the state of the art on the CIFAR-10 dataset in Section 5.2. We present additional results and comparisons on higher resolution datasets in Section 5.3.

VAE V-ADE AV-GDE

GAN AV-ADE (Ours)

fψ Adv. MLE BPD IS FID

GAN [7.0] 6.8 31.4 VAE 4.4 2.0 171.0

V-ADE 3.5 3.0 112.0

AV-GDE 4.4 5.1 58.6

AV-ADE 3.9 7.1 28.0

Table 1: Quantitative results. : Parameter count decreased by 1.4% to compensate for fψ. [Square brackets] denote that the value is approximated, see Section 5.

Figure 5: Samples from GAN and VAE baselines, our V-ADE, AV-GDE and AVADE models, all trained on CIFAR-10.

Evalutation metrics. We evaluate our models with complementary metrics. To assess sample quality, we report the Fréchet inception distance (FID) [19] and the inception score (IS) [45], which are the de facto standard metrics to evaluate GANs [5, 54]. Although these metrics focus on sample quality, they are also sensitive to coverage, see Appendix D for details. To speciﬁcally evaluate the coverage of held-out data, we use the standard bits per dimension (BPD) metric, deﬁned as the negative log-likelihood on held-out data, averaged across pixels and color channels [9].

Due to their degenerate low-dimensional support, GANs do not deﬁne a density in the image space, which prevents measuring BPD on them. To endow a GAN with a full support and a likelihood, we train an inference network around it , while keeping the weights of the GAN generator ﬁxed. We also train an isotropic noise parameter σ. For both GANs and VAEs, we use this inference network to compute a lower bound to approximate the likelihood, i.e., an upper bound on BPD. We evaluate all metrics using held-out data not used during training, which improves over common practice in the GAN literature, where training data is often used for evaluation.

5.1 Ablation study and comparison to VAE and GAN baselines

We conduct an ablation study on the CIFAR-10 dataset.1 Our GAN baseline uses the non-residual architecture of SNGAN [37], which is stable and quick to train, without spectral normalization. The same convolutional architecture is kept to build a VAE baseline.2 It produces the mean of a factorizing Gaussian distribution. To ensure a valid density model we add a trainable isotropic variance σ. We train the generator for coverage by optimizing LQ(pθ), for quality by optimizing LC(pθ), and for both by optimizing the sum LQ(pθ) + LC(pθ). The model using Variational inference with Adaptive Density Estimation (ADE) is refered to as V-ADE. The addition of adversarial training is denoted AV-ADE, and hybrid training with a Gaussian decoder as AV-GDE. The bijective function fψ, implemented as a small Real-NVP with 1 scale, 3 residual blocks, 2 layers per block, increases the number of weights by approximately 1.4%. We compensate for these additional parameters with a slight decrease in the width of the generator for fair comparison.3 See Appendix B for details.

Experimental results in Table 1 conﬁrm that the GAN baseline yields better sample quality (IS and FID) than the VAE baseline: obtaining inception scores of 6.8 and 2.0, respectively. Conversely, VAE achieves better coverage, with a BPD of 4.4, compared to 7.0 for GAN. An identical generator trained for both quality and coverage, AV-GDE, obtains a sample quality that is in between that of the GAN and the VAE baselines, in line with the analysis in Section 4. Samples from the different models in Figure 5 conﬁrm these quantitative results. Using fψ and training with LC(pθ) only, denoted by V-ADE in the table, leads to improved sample quality with IS up from 2.0 to 3.0 and FID down from 171 to 112. Note that this quality is still below the GAN baseline and our AV-GDE model.

When fψ is used with coverage and quality driven training, AV-ADE, we obtain improved IS and FID scores over the GAN baseline, with IS up from 6.8 to 7.1, and FID down from 31.4 to 28.0. The

1 We use the standard split of 50k/10k train/test images of 32 32 pixels. 2 In the VAE model, some intermediate feature maps are treated as conditional latent variables, allowing for hierarchical top-down sampling (see Appendix B). Experimentally, we ﬁnd that similar top-down sampling is not effective for the GAN model. 3 This is, however, too small to have a signiﬁcant impact on the experimental results.

examples shown in the ﬁgure conﬁrm the high quality of the samples generated by our AV-ADE model. Our model also achieves a better BPD than the VAE baseline. These experiments demonstrate that our proposed bijective feature space substantially improves the compatibility of coverage and quality driven training. We obtain improvements over both VAE and GAN in terms of held-out likelihood, and improve VAE sample quality to, or beyond, that of GAN. We further evaluate our model using the recent precision and recall approach of [43] an the classiﬁcation framework of [48] in Appendix E. Additional results showing the impact of the number of layers and scales in the bijective similarity mapping fψ (Appendix F), reconstructions qualitatively demonstrating the inference abilities of our AV-ADE model (Appendix G) are presented in the supplementary material.

5.2 Reﬁnements and comparison to the state of the art

We now consider further reﬁnements to our model, inspired by recent generative modeling literature. Four reﬁnements are used: (i) adding residual connections to the discriminator [17] (rd), (ii) leveraging more accurate posterior approximations using inverse auto-regressive ﬂow [26] (iaf); see Appendix B, (iii) training wider generators with twice as many channels (wg), and (iv) using a hierarchy of two scales to build fψ (s2); see Appendix F. Table 2 shows consistent improvements with these additions, in terms of BPD, IS, FID.

Reﬁnements BPD IS FID

GAN [7.0] 6.8 31.4 GAN (rd) [6.9] 7.4 24.0 AV-ADE 3.9 7.1 28.0 AV-ADE (rd) 3.8 7.5 26.0 AV-ADE (wg, rd) 3.8 8.2 17.2 AV-ADE (iaf, rd) 3.7 8.1 18.6 AV-ADE (s2) 3.5 6.9 28.9

Table 2: Model reﬁnements.

Table 3 compares our model to existing hybrid approaches and state-of-the-art generative models on CIFAR-10. In the category of hybrid models that deﬁne a valid likelihood over the data space, denoted by Hybrid (L) in the table, Flow GAN(H) optimizes MLE and an adversarial loss, and Flow GAN(A) is trained adversarially. The AV-ADE model signiﬁcantly outperforms these two variants both in terms of BPD, from 4.2 to between 3.5 and 3.8, and quality, e.g., IS improves from 5.8 to 8.2. Compared to models that train an inference network adversarially, denoted by Hybrid (A), our model shows a substantial improvement in IS from 7.0 to 8.2. Note that these models do not allow likelihood evaluation, thus BPD values are not deﬁned.

Compared to adversarial models, which are not optimized for support coverage, AV-ADE obtains better FID (17.2 down from 21.7) and similar IS (8.2 for both) compared to SNGAN with residual connections and hinge-loss, despite training on 17% less data than GANs (test split removed). The improvement in FID is likely due to this measure being more sensitive to support coverage than IS. Compared to models optimized with MLE only, we obtain a BPD between 3.5 and 3.7, comparable to 3.5 for Real-NVP demonstrating a good coverage of the support of held-out data. We computed IS and FID scores for MLE based models using publicly released code, with provided parameters (denoted by in the table) or trained ourselves (denoted by ). Despite being smaller (for reference Glow has 384 layers vs. at most 10 for our deeper generator), our AV-ADE model generates better samples, e.g., IS up from 5.5 to 8.2 (samples displayed in Figure 6), owing to quality driven training controling where the off-dataset mass goes. Additional samples from our AV-ADE model and comparison to others models are given in Appendix A.

Hybrid (L) BPD IS FID

AV-ADE (wg, rd) 3.8 8.2 17.2 AV-ADE (iaf, rd) 3.7 8.1 18.6 AV-ADE (S2) 3.5 6.9 28.9 Flow Gan(A) [16] 8.5 5.8 Flow Gan(H) [16] 4.2 3.9

Hybrid (A) BPD IS FID

AGE [52] 5.9 ALI [12] 5.3 SVAE [6] 6.8 α-GAN [41] 6.8 SVAE-r [6] 7.0

Adversarial BPD IS FID

mmd-GAN[1] 7.3 25.0 SNGan [37] 7.4 29.3 Batch GAN [32] 7.5 23.7 WGAN-GP [17] 7.9 SNGAN(R,H) 8.2 21.7

MLE BPD IS FID

Real-NVP [9] 3.5 4.5 56.8

VAE-IAF [26] 3.1 3.8 73.5

Pixcnn++ [46] 2.9 5.5 Flow++ [20] 3.1 Glow [24] 3.4 5.5 46.8

Table 3: Performance on CIFAR10, without labels. MLE and Hybrid (L) models discard the test split. : computed by us using provided weights. : computed by us using provided code to (re)train models.

5.3 Results on additional datasets

To further validate our model we evaluate it on STL10 (48 48), Image Net and LSUN (both 64 64). We use a wide generator to account for the higher resolution, without IAF, a single scale in fψ, and no residual blocks (see Section 5.2). The architecture and training hyper-parameters are not tuned, besides adding one layer at resolution 64 64, which demonstrates the stability of our approach. On STL10, Table 4 shows that our AV-ADE improves inception score over SNGAN, from 9.1 up to

Glow @ 3.35 BPD Flow Gan (H) @ 4.21 BPD AV-ADE (iaf, rd) @ 3.7 BPD

Figure 6: Samples from models trained on CIFAR-10. Our AV-ADE spills less mass on unrealistic samples, owing to adversarial training which controls where off-dataset mass goes. 9.4, and is second best in FID. Our likelihood performance, between 3.8 and 4.4, and close to that of Real-NVP at 3.7, demonstrates good coverage of the support of held-out data. On the Image Net dataset, maintaining high sample quality, while covering the full support is challenging, due to its very diverse support. Our AV-ADE model obtains a sample quality behind that of MMD-GAN with IS/FID scores at 8.5/45.5 vs. 10.9/36.6. However, MMD-GAN is trained purely adversarially and does not provide a valid density across the data space, unlike our approach.

Figure 7 shows samples from our generator trained on a single GPU with 11 Gb of memory on LSUN classes. Our model yields more compelling samples compared to those of Glow, despite having less layers (7 vs. over 500). Additional samples on other LSUN categories are presented in Appendix A.

STL-10 (48 48) BPD IS FID

AV-ADE (wg, wd) 4.4 9.4 44.3 AV-ADE (iaf, wd) 4.0 8.6 52.7 AV-ADE (s2) 3.8 8.6 52.1 Real-NVP 3.7 4.8 103.2

Batch GAN 8.7 51 SNGAN (Res-Hinge) 9.1 40.1

Image Net (64 64) BPD IS FID

AV-ADE (wg, wd) 4.90 8.5 45.5 Real-NVP 3.98 Glow 3.81 Flow++ 3.69 MMD-GAN 10.9 36.6

LSUN Real-NVP Glow Ours

Bedroom (2.72/ ) (2.38/208.8 ) (3.91, 21.1) Tower (2.81/ ) (2.46/214.5 ) (3.95, 15.5) Church (3.08/ ) (2.67/222.3 ) (4.3, 13.1) Classroom (4.6, 20.0) Restaurant (4.7, 20.5)

Table 4: Results on the STL-10, Image Net, and LSUN datasets. AV-ADE (wg, rd) is used for LSUN.

Glow [24] Ours: AV-ADE (wg, rd)

Figure 7: Samples from models trained on LSUN Churches (C) and bedrooms (B). Our AV-ADE model over-generalises less and produces more compelling samples. See Appendix A for more classes and samples.

6 Conclusion

We presented a generative model that leverages invertible network layers to relax the conditional independence assumptions commonly made in VAE decoders. It allows for efﬁcient feed-forward sampling, and can be trained using a maximum likelihood criterion that ensures coverage of the data generating distribution, as well as an adversarial criterion that ensures high sample quality.

Acknowledgments. The authors would like to thank Corentin Tallec, Mathilde Caron, Adria Ruiz and Nikita Dvornik for useful feedback and discussions. Acknowledgments also go to our anonymous reviewers, who contributed valuable comments and remarks. This work was supported in part by the grants ANR16-CE23-0006 Deep in France , Lab Ex PERSYVAL-Lab (ANR-11-LABX0025-01) as well as the Indo-French project EVEREST (no. 5302-1) funded by CEFIPRA and a grant from ANR (AVENUE project ANR-18-CE23-0011),

[1] M. Arbel, D. J. Sutherland, M. Binkowski, and A. Gretton. On gradient regularizers for MMD GANs. In Neur IPS, 2018.

[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.

[3] P. Bachman. An architecture for deep, hierarchical generative models. In Neur IPS, 2016.

[4] C. Bishop. Pattern recognition and machine learning. Springer-Verlag, 2006.

[5] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high ﬁdelity natural image synthesis. In ICLR, 2019.

[6] L. Chen, S. Dai, Y. Pu, E. Zhou, C. Li, Q. Su, C. Chen, and L. Carin. Symmetric variational autoencoder and connections to adversarial learning. In AISTATS, 2018.

[7] X. Chen, D. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel. Variational lossy autoencoder. In ICLR, 2017.

[8] H. De Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. Courville. Modulating early visual processing by language. In Neur IPS, 2017.

[9] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. In ICLR, 2017.

[10] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. In ICLR, 2017.

[11] G. Dorta, S. Vicente, L. Agapito, N. Campbell, and I. Simpson. Structured uncertainty prediction networks. In CVPR, 2018.

[12] V. Dumoulin, I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mastropietro, and A. Courville. Adversarially learned inference. In ICLR, 2017.

[13] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In ICLR, 2017.

[14] M. Elfeki, C. Couprie, M. Riviere, and M. Elhoseiny. GDPP: Learning diverse generations using determinantal point processes. In arxiv.org/pdf/1812.00068, 2018.

[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Neur IPS, 2014.

[16] A. Grover, M. Dhar, and S. Ermon. Flow-GAN: Combining maximum likelihood and adversarial learning in generative models. In AAAI Press, 2018.

[17] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. In Neur IPS, 2017.

[18] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville. Pixel VAE: A latent variable model for natural images. In ICLR, 2017.

[19] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Neur IPS, 2017.

[20] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel. Flow++: Improving ﬂow-based generative models with variational dequantization and architecture design. In ICML, 2019.

[21] X. Hou, L. Shen, K. Sun, and G. Qiu. Deep feature consistent variational autoencoder. In WACV, 2017.

[22] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2018.

[23] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

[24] D. Kingma and P. Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolutions. In Neur IPS, 2018.

[25] D. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.

[26] D. Kingma, T. Salimans, R. Józefowicz, X. Chen, I. Sutskever, and M. Welling. Improving variational autoencoders with inverse autoregressive ﬂow. In Neur IPS, 2016.

[27] A. Larsen, S. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. In ICML, 2016.

[28] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In ICML, 2015.

[29] Z. Lin, A. Khetan, G. Fanti, and S. Oh. Pac GAN: The power of two samples in generative adversarial networks. In Neur IPS, 2018.

[30] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia. Deformable shape completion with graph convolutional autoencoders. In CVPR, 2018.

[31] T. Lucas and J. Verbeek. Auxiliary guided autoregressive variational autoencoders. In ECML, 2018.

[32] T. Lucas, C. Tallec, Y. Ollivier, and J. Verbeek. Mixed batches and symmetric discriminators for GAN training. In ICML, 2018.

[33] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow. Adversarial autoencoders. In ICLR, 2016.

[34] M. Mathieu, C. Couprie, and Y. Le Cun. Deep multi-scale video prediction beyond mean square error. In ICLR, 2016.

[35] J. Menick and N. Kalchbrenner. Generating high ﬁdelity images with subscale pixel networks and multidimensional upscaling. In ICLR, 2019.

[36] T. Miyato and M. Koyama. c GANs with projection discriminator. In ICLR, 2018.

[37] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.

[38] P. Ramachandran, T. Paine, P. Khorrami, M. Babaeizadeh, S. Chang, Y. Zhang, M. Hasegawa Johnson, R. Campbell, and T. Huang. Fast generation for convolutional autoregressive models. In ICLR workshop, 2017.

[39] D. Rezende and S. Mohamed. Variational inference with normalizing ﬂows. In ICML, 2015.

[40] D. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.

[41] M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed. Variational approaches for auto-encoding generative adversarial networks. ar Xiv preprint ar Xiv:1706.04987, 2017.

[42] Y. Saatchi and A. Wilson. Bayesian GAN. In Neur IPS, 2017.

[43] M. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly. Assessing generative models via precision and recall. In Neur IPS, 2018.

[44] T. Salimans and D. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Neur IPS, 2016.

[45] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In Neur IPS, 2016.

[46] T. Salimans, A. Karpathy, X. Chen, and D. Kingma. Pixel CNN++: Improving the Pixel CNN with discretized logistic mixture likelihood and other modiﬁcations. In ICLR, 2017.

[47] T. Salimans, H. Zhang, A. Radford, and D. Metaxas. Improving gans using optimal transport. In ICLR, 2018.

[48] K. Shmelkov, C. Schmid, and K. Alahari. How good is my GAN? In ECCV, 2018.

[49] C. Sønderby, T. Raiko, L. Maaløe, S. Sønderby, and O. Winther. Ladder variational autoencoders. In Neur IPS, 2016.

[50] C. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár. Amortised MAP inference for image super-resolution. In ICLR, 2017.

[51] H. Thanh-Tung, T. Tran, and S. Venkatesh. Improving generalization and stability of generative adversarial networks. In ICLR, 2019.

[52] D. Ulyanov, A. Vedaldi, and V. Lempitsky. It takes (only) two: Adversarial generator-encoder networks. In AAAI, 2018.

[53] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016.

[54] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. In ICML, 2019.