# from_variational_to_deterministic_autoencoders__f1ec8e9d.pdf

Published as a conference paper at ICLR 2020

FROM VARIATIONAL TO DETERMINISTIC AUTOENCODERS

Partha Ghosh Mehdi S. M. Sajjadi Antonio Vergari Michael Black

Bernhard Sch olkopf

Max Planck Institute for Intelligent Systems, T ubingen, Germany {pghosh,msajjadi,black,bs}@tue.mpg.de University of California, Los Angeles, USA aver@cs.ucla.edu

Variational Autoencoders (VAEs) provide a theoretically-backed and popular framework for deep generative models. However, learning a VAE from data poses still unanswered theoretical questions and considerable practical challenges. In this work, we propose an alternative framework for generative modeling that is simpler, easier to train, and deterministic, yet has many of the advantages of VAEs. We observe that sampling a stochastic encoder in a Gaussian VAE can be interpreted as simply injecting noise into the input of a deterministic decoder. We investigate how substituting this kind of stochasticity, with other explicit and implicit regularization schemes, can lead to an equally smooth and meaningful latent space without forcing it to conform to an arbitrarily chosen prior. To retrieve a generative mechanism to sample new data, we introduce an ex-post density estimation step that can be readily applied also to existing VAEs, improving their sample quality. We show, in a rigorous empirical study, that the proposed regularized deterministic autoencoders are able to generate samples that are comparable to, or better than, those of VAEs and more powerful alternatives when applied to images as well as to structured data such as molecules. 1

1 INTRODUCTION

Generative models lie at the core of machine learning. By capturing the mechanisms behind the data generation process, one can reason about data probabilistically, access and traverse the lowdimensional manifold the data is assumed to live on, and ultimately generate new data. It is therefore not surprising that generative models have gained momentum in applications such as computer vision (Sohn et al., 2015; Brock et al., 2019), NLP (Bowman et al., 2016; Severyn et al., 2017), and chemistry (Kusner et al., 2017; Jin et al., 2018; G omez-Bombarelli et al., 2018).

Variational Autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) cast learning representations for high-dimensional distributions as a variational inference problem. Learning a VAE amounts to the optimization of an objective balancing the quality of samples that are autoencoded through a stochastic encoder decoder pair while encouraging the latent space to follow a ﬁxed prior distribution. Since their introduction, VAEs have become one of the frameworks of choice among the different generative models. VAEs promise theoretically well-founded and more stable training than Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and more efﬁcient sampling mechanisms than autoregressive models (Larochelle & Murray, 2011; Germain et al., 2015).

However, the VAE framework is still far from delivering the promised generative mechanism, as there are several practical and theoretical challenges yet to be solved. A major weakness of VAEs is

Equal contribution. 1An implementation is available at: https://github.com/Partha Eth/Regularized_ autoencoders-RAE-

Published as a conference paper at ICLR 2020

the tendency to strike an unsatisfying compromise between sample quality and reconstruction quality. In practice, this has been attributed to overly simplistic prior distributions (Tomczak & Welling, 2018; Dai & Wipf, 2019) or alternatively, to the inherent over-regularization induced by the KL divergence term in the VAE objective (Tolstikhin et al., 2017). Most importantly, the VAE objective itself poses several challenges as it admits trivial solutions that decouple the latent space from the input (Chen et al., 2017; Zhao et al., 2017), leading to the posterior collapse phenomenon in conjunction with powerful decoders (van den Oord et al., 2017). Furthermore, due to its variational formulation, training a VAE requires approximating expectations through sampling at the cost of increased variance in gradients (Burda et al., 2015; Tucker et al., 2017), making initialization, validation, and annealing of hyperparameters essential in practice (Bowman et al., 2016; Higgins et al., 2017; Bauer & Mnih, 2019). Lastly, even after a satisfactory convergence of the objective, the learned aggregated posterior distribution rarely matches the assumed latent prior in practice (Kingma et al., 2016; Bauer & Mnih, 2019; Dai & Wipf, 2019), ultimately hurting the quality of generated samples. All in all, much of the attention around VAEs is still directed towards ﬁxing the aforementioned drawbacks associated with them.

In this work, we take a different route: we question whether the variational framework adopted by VAEs is necessary for generative modeling and, in particular, to obtain a smooth latent space. We propose to adopt a simpler, deterministic version of VAEs that scales better, is simpler to optimize, and, most importantly, still produces a meaningful latent space and equivalently good or better samples than VAEs or stronger alternatives, e.g., Wasserstein Autoencoders (WAEs) (Tolstikhin et al., 2017). We do so by observing that, under commonly used distributional assumptions, training a stochastic encoder decoder pair in VAEs does not differ from training a deterministic architecture where noise is added to the decoder s input. We investigate how to substitute this noise injection mechanism with other regularization schemes in the proposed deterministic Regularized Autoencoders (RAEs), and we thoroughly analyze how this affects performance. Finally, we equip RAEs with a generative mechanism via a simple ex-post density estimation step on the learned latent space.

In summary, our contributions are as follows: i) we introduce the RAE framework for generative modeling as a drop-in replacement for many common VAE architectures; ii) we propose an ex-post density estimation scheme which greatly improves sample quality for VAEs, WAEs and RAEs without the need to retrain the models; iii) we conduct a rigorous empirical evaluation to compare RAEs with VAEs and several baselines on standard image datasets and on more challenging structured domains such as molecule generation (Kusner et al., 2017; G omez-Bombarelli et al., 2018).

2 VARIATIONAL AUTOENCODERS

For a general discussion, we consider a collection of high-dimensional i.i.d. samples X = {xi}N i=1 drawn from the true data distribution pdata(x) over a random variable X taking values in the input space. The aim of generative modeling is to learn from X a mechanism to draw new samples xnew pdata. Variational Autoencoders provide a powerful latent variable framework to infer such a mechanism. The generative process of the VAE is deﬁned as

znew p(Z), xnew pθ(X | Z = znew) (1)

where p(Z) is a ﬁxed prior distribution over a low-dimensional latent space Z. A stochastic decoder

Dθ(z) = x pθ(x | z) = p(X | gθ(z)) (2)

links the latent space to the input space through the likelihood distribution pθ, where gθ is an expressive non-linear function parameterized by θ.2 As a result, a VAE estimates pdata(x) as the inﬁnite mixture model pθ(x) = R pθ(x | z)p(z)dz. At the same time, the input space is mapped to the latent space via a stochastic encoder

Eφ(x) = z qφ(z | x) = q(Z | fφ(x)) (3)

where qφ(z | x) is the posterior distribution given by a second function fφ parameterized by φ. Computing the marginal log-likelihood log pθ(x) is generally intractable. One therefore follows a variational approach, maximizing the evidence lower bound (ELBO) for a sample x:

log pθ(x) ELBO(φ, θ, x) = Ez qφ(z | x) log pθ(x | z) KL(qφ(z | x)||p(z)) (4)

2With slight abuse of notation, we use lowercase letters for both random variables and their realizations, e.g., pθ(x | z) instead of p(X | Z = z), when it is clear to discriminate between the two.

Published as a conference paper at ICLR 2020

Maximizing Eq. 4 over data X w.r.t. model parameters φ, θ corresponds to minimizing the loss

arg min φ,θ Ex pdata LELBO = Ex pdata LREC + LKL (5)

where LREC and LKL are deﬁned for a sample x as follows:

LREC = Ez qφ(z | x) log pθ(x | z) LKL = KL(qφ(z | x)||p(z)) (6)

Intuitively, the reconstruction loss LREC takes into account the quality of autoencoded samples x through Dθ(Eφ(x)), while the KL-divergence term LKL encourages qφ(z | x) to match the prior p(z) for each z which acts as a regularizer during training (Hoffman & Johnson, 2016).

2.1 PRACTICE AND SHORTCOMINGS OF VAES

To ﬁt a VAE to data through Eq. 5 one has to specify the parametric forms for p(z), qφ(z | x), pθ(x | z), and hence the deterministic mappings fφ and gθ. In practice, the choice for the above distributions is guided by trading off computational complexity with model expressiveness. In the most commonly adopted formulation of the VAE, qφ(z | x) and pθ(x | z) are assumed to be Gaussian:

Eφ(x) N(Z|µφ(x), diag(σφ(x))) Dθ(Eφ(x)) N(X|µθ(z), diag(σθ(z))) (7)

with means µφ, µθ and covariance parameters σφ, σθ given by fφ and gθ. In practice, the covariance of the decoder is set to the identity matrix for all z, i.e., σθ(z) = 1 (Dai & Wipf, 2019). The expectation of LREC in Eq. 6 must be approximated via k Monte Carlo point estimates. It is expected that the quality of the Monte Carlo estimate, and hence convergence during learning and sample quality increases for larger k (Burda et al., 2015). However, only a 1-sample approximation is generally carried out (Kingma & Welling, 2014) since memory and time requirements are prohibitive for large k. With the 1-sample approximation, LREC can be computed as the mean squared error between input samples and their mean reconstructions µθ by a decoder that is deterministic in practice:

LREC = ||x µθ(Eφ(x))||2 2 (8)

Gradients w.r.t. the encoder parameters φ are computed through the expectation of LREC in Eq. 6 via the reparametrization trick (Kingma & Welling, 2014) where the stochasticity of Eφ is relegated to an auxiliary random variable ϵ which does not depend on φ:

Eφ(x) = µφ(x) + σφ(x) ϵ, ϵ N(0, I) (9)

where denotes the Hadamard product. An additional simplifying assumption involves ﬁxing the prior p(z) to be a d-dimensional isotropic Gaussian N(Z | 0, I). For this choice, the KL-divergence for a sample x is given in closed form: 2LKL = ||µφ(x)||2 2 + d + Pd i σφ(x)i log σφ(x)i.

While the above assumptions make VAEs easy to implement, the stochasticity in the encoder and decoder are still problematic in practice (Makhzani et al., 2016; Tolstikhin et al., 2017; Dai & Wipf, 2019). In particular, one has to carefully balance the trade-off between the LKL term and LREC during optimization (Dai & Wipf, 2019; Bauer & Mnih, 2019). A too-large weight on the LKL term can dominate LELBO, having the effect of over-regularization. As this would smooth the latent space, it can directly affect sample quality in a negative way. Heuristics to avoid this include manually ﬁnetuning or gradually annealing the importance of LKL during training (Bowman et al., 2016; Bauer & Mnih, 2019). We also observe this trade-off in a practical experiment in Appendix A.

Even after employing the full array of approximations and tricks to reach convergence of Eq. 5 for a satisfactory set of parameters, there is no guarantee that the learned latent space is distributed according to the assumed prior distribution. In other words, the aggregated posterior distribution qφ(z) = Ex pdataq(z|x) has been shown not to conform well to p(z) after training (Tolstikhin et al., 2017; Bauer & Mnih, 2019; Dai & Wipf, 2019). This critical issue severely hinders the generative mechanism of VAEs (cf. Eq. 1) since latent codes sampled from p(z) (instead of q(z)) might lead to regions of the latent space that are previously unseen to Dθ during training. This results in generating out-of-distribution samples. We refer the reader to Appendix H for a visual demonstration of this phenomenon on the latent space of VAEs. We analyze solutions to this problem in Section 4.

Published as a conference paper at ICLR 2020

2.2 CONSTANT-VARIANCE ENCODERS

Before introducing our fully-deterministic take on VAEs, it is worth investigating intermediate ﬂavors of VAEs with reduced stochasticity. Analogous to what is commonly done for decoders as discussed in the previous section, one can ﬁx the variance of qφ(z | x) to be constant for all x. This simpliﬁes the computation of Eφ from Eq. 9 to

ECV φ (x) = µφ(x) + ϵ, ϵ N(0, σI) (10) where σ is a ﬁxed scalar. Then, the KL loss term in a Gaussian VAE simpliﬁes (up to a constant) to LCV KL = ||µφ(x)||2 2. We name this variant Constant-Variance VAEs (CV-VAEs). While CV-VAEs have been adopted in some applications such as variational image compression (Ball e et al., 2017) and adversarial robustness (Ghosh et al., 2019), to the best of our knowledge, there is no systematic study of them in the literature. We will ﬁll this gap in our experiments in Section 6. Lastly, note that now σ in Eq.10 is not learned along the encoder as in Eq. 9. Nevertheless, it can still be ﬁtted as an hyperparameter, e.g., by cross-validation, to maximise the model likelihood. This highlights the possibility to estimate a better parametric form for the latent space distribution after training, or in a outer-loop including training. We address this provide a more complex and ﬂexible solution to deal with the prior structure over Z via ex-post density estimation in Section 4.

3 DETERMINISTIC REGULARIZED AUTOENCODERS

Autoencoding in VAEs is deﬁned in a probabilistic fashion: Eφ and Dθ map data points not to a single point, but rather to parameterized distributions (cf. Eq. 7). However, common implementations of VAEs as discussed in Section 2 admit a simpler, deterministic view for this probabilistic mechanism. A glance at the autoencoding mechanism of the VAE is revealing.

The encoder deterministically maps a data point x to mean µφ(x) and variance σφ(x) in the latent space. The input to Dθ is then simply the mean µφ(x) augmented with Gaussian noise scaled by σφ(x) via the reparametrization trick (cf. Eq. 9). In the CV-VAE, this relationship is even more obvious, as the magnitude of the noise is ﬁxed for all data points (cf. Eq. 10). In this light, a VAE can be seen as a deterministic autoencoder where (Gaussian) noise is added to the decoder s input.

We argue that this noise injection mechanism is a key factor in having a regularized decoder. Using random noise injection to regularize neural networks is a well-known technique that dates back several decades (Sietsma & Dow, 1991; An, 1996). It implicitly helps to smooth the function learned by the network at the price of increased variance in the gradients during training. In turn, decoder regularization is a key component in generalization for VAEs, as it improves random sample quality and achieves a smoother latent space. Indeed, from a generative perspective, regularization is motivated by the goal to learn a smooth latent space where similar data points x are mapped to similar latent codes z, and small variations in Z lead to reconstructions by Dθ that vary only slightly.

We propose to substitute noise injection with an explicit regularization scheme for the decoder. This entails the substitution of the variational framework in VAEs, which enforces regularization on the encoder posterior through LKL, with a deterministic framework that applies other ﬂavors of decoder regularization. By removing noise injection from a CV-VAE, we are effectively left with a deterministic autoencoder (AE). Coupled with explicit regularization for the decoder, we obtain a Regularized Autoencoder (RAE). Training a RAE thus involves minimizing the simpliﬁed loss

LRAE = LREC + βLRAE Z + λLREG (11) where LREG represents the explicit regularizer for Dθ (discussed in Section 3.1) and LRAE Z = 1/2||z||2 2 (resulting from simplifying LCV KL) is equivalent to constraining the size of the learned latent space, which is still needed to prevent unbounded optimization. Finally, β and λ are two hyper parameters that balance the different loss terms.

Note that for RAEs, no Monte Carlo approximation is required to compute LREC. This relieves the need for more samples from qφ(z | x) to achieve better image quality (cf. Appendix A). Moreover, by abandoning the variational framework and the LKL term, there is no need in RAEs for a ﬁxed prior distribution over Z. Doing so however loses a clear generative mechanism for RAEs to sample from Z. We propose a method to regain random sampling ability in Section 4 by performing density estimation on Z ex-post, a step that is otherwise still needed for VAEs to alleviate the posterior mismatch issue.

Published as a conference paper at ICLR 2020

3.1 REGULARIZATION SCHEMES FOR RAES

Among possible choices for LREG, a ﬁrst obvious candidate is Tikhonov regularization (Tikhonov & Arsenin, 1977) since is known to be related to the addition of low-magnitude input noise (Bishop, 2006). Training a RAE within this framework thus amounts to adopting LREG = LL2 = ||θ||2 2 which effectively applies weight decay on the decoder parameters θ.

Another option comes from the recent GAN literature where regularization is a hot topic (Kurach et al., 2018) and where injecting noise to the input of the adversarial discriminator has led to improved performance in a technique called instance noise (Sønderby et al., 2017). To enforce Lipschitz continuity on adversarial discriminators, weight clipping has been proposed (Arjovsky et al., 2017), which is however known to signiﬁcantly slow down training. More successfully, a gradient penalty on the discriminator can be used similar to Gulrajani et al. (2017); Mescheder et al. (2018), yielding the objective LREG = LGP = || Dθ(Eφ(x))||2 2 which bounds the gradient norm of the decoder w.r.t. its input.

Additionally, spectral normalization (SN) has been successfully proposed as an alternative way to bound the Lipschitz norm of an adversarial discriminator (Miyato et al., 2018). SN normalizes each weight matrix θℓin the decoder by an estimate of its largest singular value: θSN ℓ = θℓ/s(θℓ) where s(θℓ) is the current estimate obtained through the power method.

In light of the recent successes of deep networks without explicit regularization (Zagoruyko & Komodakis, 2016; Zhang et al., 2017), it is intriguing to question the need for explicit regularization of the decoder in order to obtain a meaningful latent space. The assumption here is that techniques such as dropout (Srivastava et al., 2014), batch normalization (Ioffe & Szegedy, 2015), adding noise during training (An, 1996) implicitly regularize the networks enough. Therefore, as a natural baseline to the LRAE objectives introduced above, we also consider the RAE framework without LREG and LRAE Z , i.e., a standard deterministic autoencoder optimizing LREC only.

To complete our autopsy of the VAE loss, we additionally investigate deterministic autoencoders with decoder regularization, but without the LRAE Z term, as well as possible combinations of different regularizers in our RAE framework (cf. Table 3 in Appendix I).

Lastly, it is worth questioning if it is possible to formally derive our RAE framework from ﬁrst principles. We answer this afﬁrmatively, and show how to augment the ELBO optimization problem of a VAE with an explicit constraint, while not ﬁxing a parametric form for qφ(z | x). This indeed leads to a special case of the RAE loss in Eq. 11. Speciﬁcally, we derive a regularizer like LGP for a deterministic version of the CV-VAE. Note that this derivation legitimates bounding the decoder s gradients and as such it justiﬁes the spectral norm regularizer as well since the latter enforces the decoder s Lipschitzness. We accommodate the full derivation in Appendix B.

4 EX-POST DENSITY ESTIMATION

By removing stochasticity and ultimately, the KL divergence term LKL from RAEs, we have simpliﬁed the original VAE objective at the cost of detaching the encoder from the prior p(z) over the latent space. This implies that i) we cannot ensure that the latent space Z is distributed according to a simple distribution (e.g., isotropic Gaussian) anymore and consequently, ii) we lose the simple mechanism provided by p(z) to sample from Z as in Eq. 1.

As discussed in Section 2.1, issue i) is compromising the VAE framework in any case, as reported in several works (Hoffman & Johnson, 2016; Rosca et al., 2018; Dai & Wipf, 2019). To ﬁx this, some works extend the VAE objective by encouraging the aggregated posterior to match p(z) (Tolstikhin et al., 2017) or by utilizing more complex priors (Kingma et al., 2016; Tomczak & Welling, 2018; Bauer & Mnih, 2019).

To overcome both i) and ii), we instead propose to employ ex-post density estimation over Z. We ﬁt a density estimator denoted as qδ(z) to {z = Eφ(x)|x X}. This simple approach not only ﬁts our RAE framework well, but it can also be readily adopted for any VAE or variants thereof such as the WAE as a practical remedy to the aggregated posterior mismatch without adding any computational overhead to the costly training phase.

Published as a conference paper at ICLR 2020

The choice of qδ(z) needs to trade-off expressiveness to provide a good ﬁt of an arbitrary space for Z with simplicity, to improve generalization. For example, placing a Dirac distribution on each latent point z would allow the decoder to output only training sample reconstructions which have a high quality, but do not generalize. Striving for simplicity, we employ and compare a full covariance multivariate Gaussian with a 10-component Gaussian mixture model (GMM) in our experiments.

5 RELATED WORKS

Many works have focused on diagnosing the VAE framework, the terms in its objective (Hoffman & Johnson, 2016; Zhao et al., 2017; Alemi et al., 2018), and ultimately augmenting it to solve optimization issues (Rezende & Viola, 2018; Dai & Wipf, 2019). With RAE, we argue that a simpler deterministic framework can be competitive for generative modeling.

Deterministic denoising (Vincent et al., 2008) and contractive autoencoders (CAEs) (Rifai et al., 2011) have received attention in the past for their ability to capture a smooth data manifold. Heuristic attempts to equip them with a generative mechanism include MCMC schemes (Rifai et al., 2012; Bengio et al., 2013). However, they are hard to diagnose for convergence, require a considerable effort in tuning (Cowles & Carlin, 1996), and have not scaled beyond MNIST, leading to them being superseded by VAEs. While computing the Jacobian for CAEs (Rifai et al., 2011) is close in spirit to LGP for RAEs, the latter is much more computationally efﬁcient.

Approaches to cope with the aggregated posterior mismatch involve ﬁxing a more expressive form for p(z) (Kingma et al., 2016; Bauer & Mnih, 2019) therefore altering the VAE objective and requiring considerable additional computational efforts. Estimating the latent space of a VAE with a second VAE (Dai & Wipf, 2019) reintroduces many of the optimization shortcomings discussed for VAEs and is much more expensive in practice compared to ﬁtting a simple qδ(z) after training.

Adversarial Autoencoders (AAE) (Makhzani et al., 2016) add a discriminator to a deterministic encoder decoder pair, leading to sharper samples at the expense of higher computational overhead and the introduction of instabilities caused by the adversarial nature of the training process.

RECONSTRUCTIONS RANDOM SAMPLES INTERPOLATIONS

Figure 1: Qualitative evaluation of sample quality for VAEs, WAEs, 2s VAEs, and RAEs on Celeb A. RAE provides slightly sharper samples and reconstructions while interpolating smoothly in the latent space. Corresponding qualitative overviews for MNIST and CIFAR-10 are provided in Appendix F.

Wasserstein Autoencoders (WAE) (Tolstikhin et al., 2017) have been introduced as a generalization of AAEs by casting autoencoding as an optimal transport (OT) problem. Both stochastic and deterministic models can be trained by minimizing a relaxed OT cost function employing either an adversarial loss term or the maximum mean discrepancy score between p(z) and qφ(z) as a reg-

Published as a conference paper at ICLR 2020

ularizer in place of LKL. Within the RAE framework, we look at this problem from a different perspective: instead of explicitly imposing a simple structure on Z that might impair the ability to ﬁt high-dimensional data during training, we propose to model the latent space by an ex-post density estimation step.

The most successful VAE architectures for images and audio so far are variations of the VQVAE (van den Oord et al., 2017; Razavi et al., 2019). Despite the name, VQ-VAEs are neither stochastic, nor variational, but they are deterministic autoencoders. VQ-VAEs are similar to RAEs in that they adopt ex-post density estimation. However, VQ-VAEs necessitates complex discrete autoregressive density estimators and a training loss that is non-differentiable due to quantizing Z.

Lastly, RAEs share some similarities with GLO (?). However, differently from RAEs, GLO can be interpreted as a deterministic AE without and encoder, and when the latent space is built ondemand by optimization. On the other hand, RAEs augment deterministic decoders as in GANs with deterministic encoders.

6 EXPERIMENTS

Our experiments are designed to answer the following questions: Q1: Are sample quality and latent space structure in RAEs comparable to VAEs? Q2: How do different regularizations impact RAE performance? Q3: What is the effect of ex-post density estimation on VAEs and its variants?

MNIST CIFAR CELEBA

REC. SAMPLES REC. SAMPLES REC. SAMPLES

N GMM Interp. N GMM Interp. N GMM Interp.

VAE 18.26 19.21 17.66 18.21 57.94 106.37 103.78 88.62 39.12 48.12 45.52 44.49 CV-VAE 15.15 33.79 17.87 25.12 37.74 94.75 86.64 69.71 40.41 48.87 49.30 44.96 WAE 10.03 20.42 9.39 14.34 35.97 117.44 93.53 76.89 34.81 53.67 42.73 40.93 2SVAE 20.31 18.81 18.35 62.54 109.77 89.06 42.04 49.70 47.54

RAE-GP 14.04 22.21 11.54 15.32 32.17 83.05 76.33 64.08 39.71 116.30 45.63 47.00 RAE-L2 10.53 22.22 8.69 14.54 32.24 80.80 74.16 62.54 43.52 51.13 47.97 45.98 RAE-SN 15.65 19.67 11.74 15.15 27.61 84.25 75.30 63.62 36.01 44.74 40.95 39.53 RAE 11.67 23.92 9.81 14.67 29.05 83.87 76.28 63.27 40.18 48.20 44.68 43.67 AE 12.95 58.73 10.66 17.12 30.52 84.74 76.47 61.57 40.79 127.85 45.10 50.94 AE-L2 11.19 315.15 9.36 17.15 34.35 247.48 75.40 61.09 44.72 346.29 48.42 56.16

Table 1: Evaluation of all models by FID (lower is better, best models in bold). We evaluate each model by REC.: test sample reconstruction; N: random samples generated according to the prior distribution p(z) (isotropic Gaussian for VAE / WAE, another VAE for 2SVAE) or by ﬁtting a Gaussian to qδ(z) (for the remaining models); GMM: random samples generated by ﬁtting a mixture of 10 Gaussians in the latent space; Interp.: mid-point interpolation between random pairs of test reconstructions. The RAE models are competitive with or outperform previous models throughout the evaluation. Interestingly, interpolations do not suffer from the lack of explicit priors on the latent space in our models.

6.1 RAES FOR IMAGE MODELING

We evaluate all regularization schemes from Section 3.1: RAE-GP, RAE-L2, and RAE-SN. For a thorough ablation study, we also consider only adding the latent code regularizer LRAE Z to LREC (RAE), and an autoencoder without any explicit regularization (AE). We check the effect of applying one regularization scheme while not including the LRAE Z term in the AE-L2 model.

As baselines, we employ the regular VAE, constant-variance VAE (CV-VAE), Wasserstein Autoencoder (WAE) with the MMD loss as a state-of-the-art method, and the recent 2-stage VAE (2s VAE) (Dai & Wipf, 2019) which performs a form of ex-post density estimation via another VAE. For a fair comparison, we use the same network architecture for all models. Further details about the architecture and training are given in Appendix C.

We measure the following quantities: held-out sample reconstruction quality, random sample quality, and interpolation quality. While reconstructions give us a lower bound on the best quality

Published as a conference paper at ICLR 2020

achievable by the generative model, random sample quality indicates how well the model generalizes. Finally, interpolation quality sheds light on the structure of the learned latent space. The evaluation of generative models is a nontrivial research question (Theis et al., 2016; Sajjadi et al., 2017; Lucic et al., 2018). We report here the ubiquitous Fr echet Inception Distance (FID) (Heusel et al., 2017) and we provide precision and recall scores (PRD) (Sajjadi et al., 2018) in Appendix E.

Table 1 summarizes our main results. All of the proposed RAE variants are competitive with the VAE, WAE and 2s VAE w.r.t. generated image quality in all settings. Sampling RAEs achieve the best FIDs across all datasets when a modest 10-component GMM is employed for ex-post density estimation. Furthermore, even when N is considered as qδ(z), RAEs rank ﬁrst with the exception of MNIST, where it competes for the second position with a VAE. Our best RAE FIDs are lower than the best results reported for VAEs in the large scale comparison of (Lucic et al., 2018), challenging even the best scores reported for GANs. While we are employing a slightly different architecture than theirs, our models underwent only modest ﬁnetuning instead of an extensive hyperparameter search. A comparison of the different regularization schemes for RAEs (Q2) yields no clear winner across all settings as all perform equally well. Striving for a simpler implementation, one may prefer RAE-L2 over the GP and SN variants.

For completeness, we investigate applying multiple regularization schemes to our RAE models. We report the results of all possible combinations in Table 3, Appendix I. There, no signiﬁcant boost of performance can be spotted when comparing to singly regularized RAEs.

Surprisingly, the implicitly regularized RAE and AE models are shown to be able to score impressive FIDs when qδ(z) is ﬁt through GMMs. FIDs for AEs decrease from 58.73 to 10.66 on MNIST and from 127.85 to 45.10 on Celeb A a value close to the state of the art. This is a remarkable result that follows a long series of recent conﬁrmations that neural networks are surprisingly smooth by design (Neyshabur et al., 2017). It is also surprising that the lack of an explicitly ﬁxed structure on the latent space of the RAE does not impede interpolation quality. This is further conﬁrmed by the qualitative evaluation on Celeb A as reported in Fig. 1 and for the other datasets in Appendix F, where RAE interpolated samples seem sharper than competitors and transitions smoother.

Our results further conﬁrm and quantify the effect of the aggregated posterior mismatch. In Table 1, ex-post density estimation consistently improves sample quality across all settings and models. A 10-component GMM halves FID scores from 20 to 10 for WAE and RAE models on MNIST and from 116 to 46 on Celeb A. This is especially striking since this additional step is much cheaper and simpler than training a second-stage VAE as in 2s VAE (Q3). In summary, the results strongly support the conjecture that the simple deterministic RAE framework can challenge VAEs and stronger alternatives (Q1).

6.2 GRAMMARRAE: MODELING STRUCTURED INPUTS

We now evaluate RAEs for generating complex structured objects such as molecules and arithmetic expressions. We do this with a twofold aim: i) to investigate the latent space learned by RAE for more challenging input spaces that abide to some structural constraints, and ii) to quantify the gain of replacing the VAE in a state-of-the-art generative model with a RAE.

To this end, we adopt the exact architectures and experimental settings of the Grammar VAE (GVAE) (Kusner et al., 2017), which has been shown to outperform other generative alternatives such as the Character VAE (CVAE) (G omez-Bombarelli et al., 2018). As in Kusner et al. (2017), we are interested in traversing the latent space learned by our models to generate samples (molecules or expressions) that best ﬁt some downstream metric. This is done by Bayesian optimization (BO) by considering the log(1 + MSE) (lower is better) for the generated expressions w.r.t. some ground truth points, and the water-octanol partition coefﬁcient (log P) (Pyzer-Knapp et al., 2015) (higher is better) in the case of molecules. A well-behaved latent space will not only generate molecules or expressions with better scores during the BO step, but it will also contain syntactically valid ones, i.e., , samples abide to a grammar of rules describing the problem.

Figure 2 summarizes our results over 5 trials of BO. Our GRAEs (Grammar RAE) achieve better average scores than CVAEs and GVAEs in generating expressions and molecules. This is visible also for the three best samples and their scores for all models, with the exception of the ﬁrst best expression of GVAE. We include in the comparison also the GCVVAE, the equivalent of a CV-VAE

Published as a conference paper at ICLR 2020

PROBLEM MODEL % VALID AVG. SCORE

EXPRESSIONS GRAE 1.00 0.00 3.22 0.03 GCVVAE 0.99 0.01 2.85 0.08 GVAE 0.99 0.01 3.26 0.20 CVAE 0.82 0.07 4.74 0.25

MOLECULES GRAE 0.72 0.09 -5.62 0.71 GCVVAE 0.76 0.06 -6.40 0.80 GVAE 0.28 0.04 -7.89 1.90 CVAE 0.16 0.04 -25.64 6.35

MODEL # EXPRESSION SCORE

GRAE 1 sin(3) + x 0.39 2 x + 1/ exp(1) 0.39 3 x + 1 + 2 sin(3 + 1 + 2) 0.43

GCVVAE 1 x + sin(3) 1 0.39 2 x/x/3 + x 0.40 3 sin(exp(exp(1))) + x/2 2 0.43

GVAE 1 x/1 + sin(x) + sin(x x) 0.10 2 1/2 + (x) + sin(x x) 0.46 3 x/2 + sin(1) + (x/2) 0.52

CVAE 1 x 1 + sin(x) + sin(3 + x) 0.45 2 x/1 + sin(1) + sin(2 2) 0.48 3 1/1 + (x) + sin(1/2) 0.61

MODEL 1ST 2ND 3RD

SCORE 3.74 3.52 3.14

SCORE 3.22 2.83 2.63

SCORE 3.13 3.10 2.37

SCORE 2.75 0.82 0.63

Figure 2: Generating structured objects by GVAE, CVAE and GRAE. (Upper left) Percentage of valid samples and their average mean score (see text, Section 6.2). The three best expressions (lower left) and molecules (upper right) and their scores are reported for all models.

for structured objects, as an additional baseline. We can observe that while the GCVVAE delivers better average scores for the simpler task of generating equations (even though the single three best equations are on par with GRAE), when generating molecules GRAEs deliver samples associated to much higher scores.

More interestingly, while GRAEs are almost equivalent to GVAEs for the easier task of generating expressions, the proportion of syntactically valid molecules for GRAEs greatly improves over GVAEs (from 28% to 72%).

7 CONCLUSION

While the theoretical derivation of the VAE has helped popularize the framework for generative modeling, recent works have started to expose some discrepancies between theory and practice. We have shown that viewing sampling in VAEs as noise injection to enforce smoothness can enable one to distill a deterministic autoencoding framework that is compatible with several regularization techniques to learn a meaningful latent space. We have demonstrated that such an autoencoding framework can generate comparable or better samples than VAEs while getting around the practical drawbacks tied to a stochastic framework. Furthermore, we have shown that our solution of ﬁtting a simple density estimator on the learned latent space consistently improves sample quality both for the proposed RAE framework as well as for VAEs, WAEs, and 2s VAEs which solves the mismatch between the prior and the aggregated posterior in VAEs.

ACKNOWLEDGEMENTS

We would like to thank Anant Raj, Matthias Bauer, Paul Rubenstein and Soubhik Sanyal for fruitful discussions.

variatio delectat!

Published as a conference paper at ICLR 2020

Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy. Fixing a broken ELBO. In ICML, 2018.

Guozhong An. The effects of adding noise during backpropagation training on a generalization performance. In Neural computation, 1996.

Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In ICML, 2017.

Johannes Ball e, Valero Laparra, and Eero P Simoncelli. End-to-end optimized image compression. In ICLR, 2017.

M. Bauer and A. Mnih. Resampled priors for variational autoencoders. In AISTATS, 2019.

Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto-encoders as generative models. In Neur IPS, 2013.

Christopher M Bishop. Pattern recognition and machine learning. Springer, 2006.

Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. In Co NLL, 2016.

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high ﬁdelity natural image synthesis. In ICLR, 2019.

Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. ar Xiv preprint ar Xiv:1509.00519, 2015.

Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. In ICLR, 2017.

Mary Kathryn Cowles and Bradley P Carlin. Markov chain Monte Carlo convergence diagnostics: a comparative review. In Journal of the American Statistical Association, 1996.

Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019.

Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pp. 881 889, 2015.

Partha Ghosh, Arpan Losalka, and Michael J Black. Resisting adversarial attacks using Gaussian mixture variational autoencoders. In AAAI, 2019.

Rafael G omez-Bombarelli, Jennifer N Wei, David Duvenaud, Jos e Miguel Hern andez-Lobato, Benjam ın S anchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Al an Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules. In ACS central science, 2018.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Neur IPS, 2014.

Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. In Neur IPS, 2017.

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, G unter Klambauer, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a Nash equilibrium. In Neur IPS, 2017.

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. Beta-VAE: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.

Matthew D Hoffman and Matthew J Johnson. Elbo surgery: yet another way to carve up the variational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference, Neur IPS, 2016.

Published as a conference paper at ICLR 2020

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular graph generation. ar Xiv preprint ar Xiv:1802.04364, 2018.

Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In ICLR, 2014.

Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improving variational inference with inverse autoregressive ﬂow. In Neur IPS, 2016.

Alex Krizhevsky and Geoffrey Hinton. Learning Multiple Layers of Features from Tiny Images, 2009.

Karol Kurach, Mario Lucic, Xiaohua Zhai, Marcin Michalski, and Sylvain Gelly. The GAN landscape: Losses, architectures, regularization, and normalization. ar Xiv preprint ar Xiv:1807.04720, 2018.

Matt J Kusner, Brooks Paige, and Jos e Miguel Hern andez-Lobato. Grammar variational autoencoder. In ICML, 2017.

Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In AISTATS, 2011.

Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In IEEE, 1998.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep Learning Face Attributes in the Wild. In ICCV, 2015.

Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs created equal? A large-scale study. In Neur IPS, 2018.

Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. In ICLR, 2016.

Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs do actually converge? In ICML, 2018.

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.

Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Geometry of optimization and implicit regularization in deep learning. ar Xiv preprint ar Xiv:1705.03071, 2017.

Edward O Pyzer-Knapp, Changwon Suh, Rafael G omez-Bombarelli, Jorge Aguilera-Iparraguirre, and Al an Aspuru-Guzik. What is high-throughput virtual screening? A perspective from organic materials discovery. Annual Review of Materials Research, 2015.

Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-ﬁdelity images with VQ-VAE-2. ar Xiv preprint ar Xiv:1906.00446, 2019.

Danilo Jimenez Rezende and Fabio Viola. Taming VAEs. ar Xiv preprint ar Xiv:1810.00597, 2018.

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.

Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive autoencoders: Explicit invariance during feature extraction. In ICML, 2011.

Salah Rifai, Yoshua Bengio, Yann Dauphin, and Pascal Vincent. A generative process for sampling contractive auto-encoders. In ICML, 2012.

Mihaela Rosca, Balaji Lakshminarayanan, and Shakir Mohamed. Distribution matching in variational inference. ar Xiv preprint ar Xiv:1802.06847, 2018.

Published as a conference paper at ICLR 2020

Mehdi S. M. Sajjadi, Bernhard Sch olkopf, and Michael Hirsch. Enhancenet: Single image superresolution through automated texture synthesis. In ICCV, 2017.

Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assessing generative models via precision and recall. In Neur IPS, 2018.

Aliaksei Severyn, Erhardt Barth, and Stanislau Semeniuta. A hybrid convolutional variational autoencoder for text generation. In Empirical Methods in Natural Language Processing, 2017.

Jocelyn Sietsma and Robert JF Dow. Creating artiﬁcial neural networks that generalize. In Neural networks. Elsevier, 1991.

Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Neur IPS, 2015.

Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Husz ar. Amortised MAP Inference for Image Super-resolution. In ICLR, 2017.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting. The journal of machine learning research, 15(1):1929 1958, 2014.

Lucas Theis, A aron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. In ICLR, 2016.

Andrey N Tikhonov and Vasilii Iakkovlevich Arsenin. Solutions of ill-posed problems, volume 14. Winston, Washington, DC, 1977.

Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Sch olkopf. Wasserstein autoencoders. In ICLR, 2017.

Jakub Tomczak and Max Welling. VAE with a Vamp Prior. In AISTATS, 2018.

George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR: low-variance, unbiased gradient estimates for discrete latent variable models. In Neur IPS, 2017.

Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In Neur IPS, 2017.

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008.

Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks. In BMVC, 2016.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.

Shengjia Zhao, Jiaming Song, and Stefano Ermon. Towards deeper understanding of variational autoencoding models. ar Xiv preprint ar Xiv:1702.08658, 2017.

Published as a conference paper at ICLR 2020

A RECONSTRUCTION AND REGULARIZATION TRADE-OFF

We train a VAE on MNIST while monitoring the test set reconstruction quality by FID. Figure 3 (left) clearly shows the impact of more expensive k > 1 Monte Carlo approximations of Eq. 7 on sample quality during training. The commonly used 1-sample approximation is a clear limitation for VAE training.

Figure 3 (right) depicts the inherent trade-off between reconstruction and random sample quality in VAEs. Enforcing structure and smoothness in the latent space of a VAE affects random sample quality in a negative way. In practice, a compromise needs to be made, ultimately leading to subpar performance.

0 2 4 6 8 10 Training Batches 1e5

Reconstruction FID

VAE K-Sample Approximation

KL loss weight

VAE Reconstruction / KL Loss Tradeoff

Reconstructions Random Samples

Figure 3: (Left) Test reconstruction quality for a VAE trained on MNIST with different numbers of samples in the latent space as in Eq. 7 measured by FID (lower is better). Larger numbers of Monte-Carlo samples clearly improve training, however, the increased accuracy comes with larger requirements for memory and computation. In practice, the most common choice is therefore k = 1. (Right) Reconstruction and random sample quality (FID, y-axis, lower is better) of a VAE on MNIST for different trade-offs between LREC and LKL (x-axis, see Eq. 5). Higher weights for LKL improve random samples but hurt reconstruction. This is especially noticeable towards the optimality point (β 101). This indicates that enforcing structure in the VAE latent space leads to a penalty in quality.

B A PROBABILISTIC DERIVATION OF REGULARIZATION

In this section, we propose an alternative view on enforcing smoothness on the output of Dθ by augmenting the ELBO optimization problem for VAEs with an explicit constraint. While we keep the Gaussianity assumptions over a stochastic Dθ and p(z) for convenience, we however are not ﬁxing a parametric form for qφ(z | x) yet. We discuss next how some parametric restrictions over qφ(z | x) lead to a variation of the RAE framework in Eq. 11, speciﬁcally the introduction of LGP as a regularizer of a deterministic version of the CV-VAE. To start, we augment Eq. 5 as: arg min φ,θ Ex pdata(X) LREC + LKL (12)

s.t. ||Dθ(z1) Dθ(z2)||p < ϵ z1, z2 qφ(z | x) x pdata where Dθ(z) = µθ(Eφ(x)) and the constraint on the decoder encodes that the output has to vary, in the sense of an Lp norm, only by a small amount ϵ for any two possible draws from the encoding of x . Let Dθ(z) : Rdim(z) Rdim(x) be given by a set of dim(x) given by {Di(z) : Rdim(z) R1}. Now we can upper bound the quantity ||Dθ(z1) Dθ(z2)||p by dim(x) supi{||Di(z1) Di(z2)||p}. Using mean value theorem ||Di(z1) Di(z2)||p || t Di((1 t)z1 +tz2)||p ||z1 z2||p. Hence supi{||Di(z1) Di(z2)||p} supi{|| t Di((1 t)z1 + tz2)||p ||z1 z2||p}. Now if we choose the domain of qφ(z | x) to be isotopic the contribution of ||z2 z1||p to the aforementioned quantity becomes a constant factor. Loosely speaking it is the radius of the bounding ball of domain of qφ(z | x). Hence the above term simpliﬁes to supi{|| t Di((1 t)z1 + tz2)||p}. Recognizing that here z1 and z2 is arbitrary lets us simplify this further to supi{|| z Di(z)||p}

From this form of the smoothness constraint, it is apparent why the choice of a parametric form for qφ(z | x) can be impactful during training. For a compactly supported isotropic PDF qφ(z|x), the

Published as a conference paper at ICLR 2020

extension of the support sup{||z1 z2||p} would depend on its entropy H(qφ(z | x)). through some functional r. For instance, a uniform posterior over a hypersphere in z would ascertain r(H(qφ(z | x))) = e H(qφ(z | x))/n where n is the dimensionality of the latent space.

Intuitively, one would look for parametric distributions that do not favor overﬁtting, e.g., degenerating in Dirac-deltas (minimal entropy and support) along any dimensions. To this end, an isotropic nature of qφ(z|x) would favor such a robustness against decoder over-ﬁtting. We can now rewrite the constraint as r(H(qφ(z | x))) sup{|| Dθ(z) |p} < ϵ (13) The LKL term can be expressed in terms of H(qφ(z | x)), by decomposing it as LKL = LCE LH, where LH = H(qφ(z | x)) and LCE = H(qφ(z | x), p(z)) represents a cross-entropy term. Therefore, the constrained problem in Eq. 12 can be written in a Lagrangian formulation by including Eq. 13:

arg min φ,θ Ex pdata LREC + LCE LH + λLLANG (14)

where LLANG = r(H(qφ(z | x))) || Dθ(z)||p. We argue that a reasonable simplifying assumption for qφ(z | x) is to ﬁx H(qφ(z | x)) to a single constant for all samples x. Intuitively, this can be understood as ﬁxing the variance in qφ(z | x) as we did for the CV-VAE in Section 2.2. With this simpliﬁcation, Eq. 14 further reduces to

arg min φ,θ Ex pdata(X) LREC + LCE + λ|| Dθ(z)||p (15)

We can see that || Dθ(z)||p results to be the gradient penalty LGP and LCE = ||z||2 2 corresponds to LRAE KL , thus recovering our RAE framework as presented in Eq. 11.

C NETWORK ARCHITECTURE, TRAINING DETAILS AND EVALUATION

We follow the models adopted by Tolstikhin et al. (2017) with the difference that we consistently apply batch normalization (Ioffe & Szegedy, 2015). The latent space dimension is 16 for MNIST (Le Cun et al., 1998), 128 for CIFAR-10 (Krizhevsky & Hinton, 2009) and 64 for Celeb A (Liu et al., 2015).

For all experiments, we use the Adam optimizer with a starting learning rate of 10 3 which is cut in half every time the validation loss plateaus. All models are trained for a maximum of 100 epochs on MNIST and CIFAR and 70 epochs on Celeb A. We use a mini-batch size of 100 and pad MNIST digits with zeros to make the size 32 32.

We use the ofﬁcial train, validation and test splits of Celeb A. For MNIST and CIFAR, we set aside 10k train samples for validation. For random sample evaluation, we draw samples from N(0, I) for VAE and WAE-MMD and for all remaining models, samples are drawn from a multivariate Gaussian whose parameters (mean and covariance) are estimated using training set embeddings. For the GMM density estimation, we also utilize the training set embeddings for ﬁtting and validation set embeddings to verify that GMM models are not over ﬁtting to training embeddings. However, due to the very low number of mixture components (10), we did not encounter overﬁtting at this step. The GMM parameters are estimated by running EM for at most 100 iterations.

MNIST CIFAR 10 CELEBA

ENCODER: x R32 32

CONV128 BN RELU CONV256 BN RELU CONV512 BN RELU CONV1024 BN RELU FLATTEN FC16 M

CONV128 BN RELU CONV256 BN RELU CONV512 BN RELU CONV1024 BN RELU FLATTEN FC128 M

CONV128 BN RELU CONV256 BN RELU CONV512 BN RELU CONV1024 BN RELU FLATTEN FC64 M

DECODER: z R16 FC8 8 1024 BN RELU CONVT512 BN RELU CONVT256 BN RELU CONVT1

z R128 FC8 8 1024 BN RELU CONVT512 BN RELU CONVT256 BN RELU CONVT1

z R64 FC8 8 1024 BN RELU CONVT512 BN RELU CONVT256 BN RELU CONVT128 BN RELU CONVT1

Convn represents a convolutional layer with n ﬁlters. All convolutions Convn and transposed convolutions Conv Tn have a ﬁlter size of 4 4 for MNIST and CIFAR-10 and 5 5 for CELEBA. They

Published as a conference paper at ICLR 2020

all have a stride of size 2 except for the last convolutional layer in the decoder. Finally, M = 1 for all models except for the VAE which has M = 2 as the encoder has to produce both mean and variance for each input.

D EVALUATION SETUP

We compute the FID of the reconstructions of random validation samples against the test set to evaluate reconstruction quality. For evaluating generative modeling capabilities, we compute the FID between the test data and randomly drawn samples from a single Gaussian that is either the isotropic p(z) ﬁxed for VAEs and WAEs, a learned second stage VAE for 2s VAEs, or a single Gaussian ﬁt to qδ(z) for CV-VAEs and RAEs. For all models, we also evaluate random samples from a 10-component Gaussian Mixture model (GMM) ﬁt to qδ(z). Using only 10 components prevents us from overﬁtting (which would indeed give good FIDs when compared with the test set)3.

For interpolations, we report the FID for the furthest interpolation points resulted by applying spherical interpolation to randomly selected validation reconstruction pairs.

We use 10k samples for all FID and PRD evaluations. Scores for random samples are evaluated against the test set. Reconstruction scores are computed from validation set reconstructions against the respective test set. Interpolation scores are computed by interpolating latent codes of a pair of randomly chosen validation embeddings vs test set samples. The visualized interpolation samples are interpolations between two randomly chosen test set images.

E EVALUATION BY PRECISION AND RECALL

MNIST CIFAR-10 CELEBA

N GMM N GMM N GMM

VAE 0.96 / 0.92 0.95 / 0.96 0.25 / 0.55 0.37 / 0.56 0.54 / 0.66 0.50 / 0.66 CV-VAE 0.84 / 0.73 0.96 / 0.89 0.31 / 0.64 0.42 / 0.68 0.25 / 0.43 0.32 / 0.55 WAE 0.93 / 0.88 0.98 / 0.95 0.38 / 0.68 0.51 / 0.81 0.59 / 0.68 0.69 / 0.77

RAE-GP 0.93 / 0.87 0.97 / 0.98 0.36 / 0.70 0.46 / 0.77 0.38 / 0.55 0.44 / 0.67 RAE-L2 0.92 / 0.87 0.98 / 0.98 0.41 / 0.77 0.57 / 0.81 0.36 / 0.64 0.44 / 0.65 RAE-SN 0.89 / 0.95 0.98 / 0.97 0.36 / 0.73 0.52 / 0.81 0.54 / 0.68 0.55 / 0.74 RAE 0.92 / 0.85 0.98 / 0.98 0.45 / 0.73 0.53 / 0.80 0.46 / 0.59 0.52 / 0.69 AE 0.90 / 0.90 0.98 / 0.97 0.37 / 0.73 0.50 / 0.80 0.45 / 0.66 0.47 / 0.71

Table 2: Evaluation of random sample quality by precision / recall (Sajjadi et al., 2018) (higher numbers are better, best value for each dataset in bold). It is notable that the proposed ex-post density estimation improves not only precision, but also recall throughout the experiment. For example, WAE seems to have a comparably low recall of only 0.88 on MNIST which is raised considerably to 0.95 by ﬁtting a GMM. In all cases, GMM gives the best results. Another interesting point is the low precision but high recall of all models on CIFAR-10 this is also visible upon inspection of the samples in Fig. 9.

3We note that ﬁtting GMMs with up to 100 components only improved results marginally. Additionally, we provide nearest-neighbours from the training set in Appendix G to show that our models are not overﬁtting.

Published as a conference paper at ICLR 2020

PRD ALL RAES PRD ALL TRADITIONAL VAES WAE VS RAE-SN VS WAE-GMM

0.00 0.25 0.50 0.75 1.00 Recall

RAE-L2 RAE-SN RAE-GP

0.00 0.25 0.50 0.75 1.00 Recall

VAE CV-VAE WAE AE

0.00 0.25 0.50 0.75 1.00 Recall

WAE WAE-GMM RAE-SN

Figure 4: PRD curves of all RAE methods (left), reﬂects a similar story as FID scores do. RAESN seems to perform the best in both precision and recall metric. PRD curves of all traditional VAE variants (middle). Similar to the conclusion predicted by FID scores there are no clear winner. PRD curves for the WAE (with isotropic Gaussian prior) , WAE-GMM model with ex-post density estimation by a 10-component GMM and RAE+SN-GMM (right). This ﬁner grained view shows how the WAE-GMM scores higher recall but lower precision than a RAE+SN-GMM while scoring comparable FID scores. Note that ex-post density estimation greatly boosts the WAE model in both PRD and FID scores.

Published as a conference paper at ICLR 2020

VAE CV-VAE WAE

RAE-GP RAE-L2 RAE-SN

Figure 5: PRD curves of all methods on image data experiments on MNIST. For each plot, we show the PRD curve when applying the ﬁxed or the ﬁtted one by ex-post density estimation (XPDE). XPDE greatly boosts both precision and recall for all models.

Published as a conference paper at ICLR 2020

VAE CV-VAE WAE

RAE-GP RAE-L2 RAE-SN

Figure 6: PRD curves of all methods on image data experiments on CIFAR10. For each plot, we show the PRD curve when applying the ﬁxed or the ﬁtted one by ex-post density estimation (XPDE). XPDE greatly boosts both precision and recall for all models.

Published as a conference paper at ICLR 2020

VAE CV-VAE WAE

RAE-GP RAE-L2 RAE-SN

Figure 7: PRD curves of all methods on image data experiments on CELEBA. For each plot, we show the PRD curve when applying the ﬁxed or the ﬁtted one by ex-post density estimation (XPDE). XPDE greatly boosts both precision and recall for all models.

Published as a conference paper at ICLR 2020

F MORE QUALITATIVE RESULTS

RECONSTRUCTIONS RANDOM SAMPLES INTERPOLATIONS

Figure 8: Qualitative evaluation for sample quality for VAEs, WAEs and RAEs on MNIST. Left: reconstructed samples (top row is ground truth). Middle: randomly generated samples. Right: spherical interpolations between two images (ﬁrst and last column).

Published as a conference paper at ICLR 2020

RECONSTRUCTIONS RANDOM SAMPLES INTERPOLATIONS

Figure 9: Qualitative evaluation for sample quality for VAEs, WAEs and RAEs on CIFAR-10. Left: reconstructed samples (top row is ground truth). Middle: randomly generated samples. Right: spherical interpolations between two images (ﬁrst and last column).

Published as a conference paper at ICLR 2020

G INVESTIGATING OVERFITTING

MNIST CIFAR-10 CELEBA

Figure 10: Nearest neighbors to generated samples (leftmost image, red box) from training set. It seems that the models have generalized well and ﬁtting only 10 Gaussians to the latent space prevents overﬁtting.

Published as a conference paper at ICLR 2020

H VISUALIZING EX-POST DENSITY ESTIMATION

To visualize that ex-post density estimation does indeed help reduce the mismatch between the aggregated posterior and the prior we train a VAE on the MNIST dataset whose latent space is 2 dimensional. The unique advantage of this setting is that one can simply visualize the density of test sample in the latent space by plotting them as a scatterplot. As it can be seen from ﬁgure 11, an expressive density estimator effectively ﬁxes the miss-match and this as reported earlier results in better sample quality.

N(0, I) N(µ, Σ) GMM(k = 10)

Figure 11: Different density estimations of the 2-dimensional latent space of a VAE learned on MNIST. The blue points are 2000 test set samples while the orange ones are drawn from the estimator indicated in each column: isotropic Gaussian (left), multivariate Gaussian with mean and covariance estimated on the training set (center) and a 10-component GMM (right). This clearly shows the aggregated posterior mismatch w.r.t. to the isotropic Gaussian prior imposed by VAEs and how ex-post density estimation can help ﬁx the estimate.

Here in ﬁgure 12 we perform the same visualization on with all the models trained on the MNIST dataset as employed on our large evaluation in Table 1. Clearly every model depicts rather large mismatch between aggregate posterior and prior. Once again the advantage of ex-post density estimate is clearly visible.

Published as a conference paper at ICLR 2020

N(0, I) N(µ, Σ) GMM(k = 10)

Figure 12: Different density estimations of the 16-dimensional latent spaces learned by all models on MNIST (see Table 1) here projected in 2d via T-SNE. The blue points are 2000 test set samples while the orange ones are drawn from the estimator indicated in each column: isotropic Gaussian (left), multivariate Gaussian with mean and covariance estimated on the training set (center) and a 10-component GMM (right). Ex-post density estimation greatly improves sampling the latent space.

Published as a conference paper at ICLR 2020

MNIST CIFAR CELEBA

REC. SAMPLES REC. SAMPLES REC. SAMPLES

N GMM Interp. N GMM Interp. N GMM Interp.

RAE-GP 14.04 22.21 11.54 15.32 32.17 83.05 76.33 64.08 39.71 116.30 45.63 47.00 RAE-L2 10.53 22.22 8.69 14.54 32.24 80.80 74.16 62.54 43.52 51.13 47.97 45.98 RAE-SN 15.65 19.67 11.74 15.15 27.61 84.25 75.30 63.62 36.01 44.74 40.95 39.53 RAE 11.67 23.92 9.81 14.67 29.05 83.87 76.28 63.27 40.18 48.20 44.68 43.67 AE 12.95 58.73 10.66 17.12 30.52 84.74 76.47 61.57 40.79 127.85 45.10 50.94 AE-L2 11.19 315.15 9.36 17.15 34.35 247.48 75.40 61.09 44.72 346.29 48.42 56.16

RAE-GP-L2 9.70 72.64 9.07 16.07 33.25 187.07 79.03 62.48 47.06 72.09 51.55 50.28 RAE-L2-SN 10.67 50.63 9.42 15.73 24.17 240.27 74.10 61.71 39.90 180.39 44.39 42.97 RAE-SN-GP 17.00 139.61 13.12 16.62 33.04 284.36 75.23 62.86 63.75 299.69 71.05 68.87 RAE-L2-SN-GP 16.75 144.51 13.93 16.75 29.96 290.34 74.22 61.93 68.86 318.67 75.04 74.29

Table 3: Comparing multiple regularization schemes for RAE models. The improvement in reconstruction, random sample quality and interpolated test samples is generally comparable, but hardly much better. This can be explained with the fact that the additional regularization losses make tuning their hyperparameters more difﬁcult, in practice.

I COMBINING MULTIPLE REGULARIZATION TERMS

The rather intriguing facts that AE without explicit decoder regularization performs reasonably well as seen from table 1, indicates that convolutional neural networks when combined with gradient based optimizers inherit some implicit regularization. This motivates us to investigate a few different combinations of regularizations e.g. we regularize the decoder of an auto-encoder while drop the regularization in the z space. The results of this experiment is reported in the row marked AE-L2 in table 3.

Further more a recent GAN literature ? report that often a combination of regularizations boost performance of neural networks. Following this, we combine multiple regularization techniques in out framework. However note that this rather drastically increases the hyper parameters and the models become harder to train and goes against the core theme of this work, which strives for simplicity. Hence we perform simplistic effort to tune all the hyper parameters to see if this can provide boost in the performance, which seem not to be the case. These experiments are summarized in the second half of the table 3