# perceptual_generative_autoencoders__6a0c702a.pdf

Perceptual Generative Autoencoders

Zijun Zhang 1 Ruixiang Zhang 2 Zongpeng Li 3 Yoshua Bengio 2 Liam Paull 2

Modern generative models are usually designed to match target distributions directly in the data space, where the intrinsic dimension of data can be much lower than the ambient dimension. We argue that this discrepancy may contribute to the difﬁculties in training generative models. We therefore propose to map both the generated and target distributions to a latent space using the encoder of a standard autoencoder, and train the generator (or decoder) to match the target distribution in the latent space. Speciﬁcally, we enforce the consistency in both the data space and the latent space with theoretically justiﬁed data and latent reconstruction losses. The resulting generative model, which we call a perceptual generative autoencoder (PGA), is then trained with a maximum likelihood or variational autoencoder (VAE) objective. With maximum likelihood, PGAs generalize the idea of reversible generative models to unrestricted neural network architectures and arbitrary number of latent dimensions. When combined with VAEs, PGAs substantially improve over the baseline VAEs in terms of sample quality. Compared to other autoencoder-based generative models using simple priors, PGAs achieve stateof-the-art FID scores on CIFAR-10 and Celeb A.

1. Introduction

Recent years have witnessed great interest in generative models, mainly due to the success of generative adversarial networks (GANs) (Goodfellow et al., 2014; Radford et al., 2016; Karras et al., 2018; Brock et al., 2019). Despite their prevalence, the adversarial nature of GANs can lead to a number of challenges, such as unstable training dynamics and mode collapse. Since the advent of GANs, substantial

1University of Calgary, Canada 2MILA, Universit e de Montr eal, Canada 3Wuhan University, China. Correspondence to: Zijun Zhang <zijun.zhang@ucalgary.ca>, Ruixiang Zhang <ruixiang.zhang@umontreal.ca>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

efforts have been devoted to addressing these challenges (Salimans et al., 2016; Arjovsky et al., 2017; Gulrajani et al., 2017; Miyato et al., 2018), while non-adversarial approaches that are free of these issues have also gained attention. Examples include variational autoencoders (VAEs) (Kingma & Welling, 2014), reversible generative models (Dinh et al., 2014; 2017; Kingma & Dhariwal, 2018), and Wasserstein autoencoders (WAEs) (Tolstikhin et al., 2018).

However, non-adversarial approaches often have signiﬁcant limitations. For instance, VAEs tend to generate blurry samples, while reversible generative models require restricted neural network architectures or solving neural differential equations (Grathwohl et al., 2019). Furthermore, to use the change of variable formula, the latent space of a reversible model must have the same dimension as the data space, which is unreasonable considering that real-world, high-dimensional data (e.g., images) tends to lie on lowdimensional manifolds, and thus results in redundant latent dimensions and variability. Intriguingly, recent research (Arjovsky et al., 2017; Dai & Wipf, 2019) suggests that the discrepancy between the intrinsic and ambient dimensions of data also contributes to the difﬁculties in training GANs and VAEs.

In this work, we present a novel framework for training autoencoder-based generative models, with non-adversarial losses and unrestricted neural network architectures. Given a standard autoencoder and a target data distribution, instead of matching the target distribution in the data space, we map both the generated and target distributions to a latent space using an encoder, while also minimizing the divergence between the mapped distributions. We prove, under mild assumptions, that by minimizing a form of latent reconstruction error, matching the target distribution in the latent space implies matching it in the data space. We call this framework perceptual generative autoencoder (PGA). We show that PGAs enable training generative autoencoders with maximum likelihood, without restrictions on architectures or latent dimensionalities. In addition, when combined with VAEs, PGAs can generate sharper samples than vanilla VAEs.1

We summarize our main contributions as follows:

1Code is available at https://github.com/zj10/PGA.

Perceptual Generative Autoencoders

A training framework, PGA, for generative autoencoders is developed to match the target distribution in the latent space, which, we prove, ensures correct matching in data space.

We combine PGA with the maximum likelihood objective, and remove the restrictions of reversible (ﬂowbased) generative models on neural network architectures and latent dimensionalities.

We combine PGA with the VAE objective, solving the VAE s issue of blurry samples without introducing any auxiliary models or sophisticated model architectures.

2. Related Work

Autoencoder-based generative models are trained by minimizing an data reconstruction loss with regularizations. As an early approach, denoising autoencoders (DAEs) (Vincent et al., 2008) are trained to recover the original input from an intentionally corrupted input. Then a generative model can be obtained by sampling from a Markov chain (Bengio et al., 2013). To sample from a decoder directly, most recent approaches resort to mapping a simple prior distribution to a data distribution using the decoder. For instance, variational autoencoders (VAEs) directly match data distributions by maximizing the evidence lower bound. In contrast, adversarial autoencoders (AAEs) (Makhzani et al., 2016) and Wasserstein autoencoders (WAEs) (Tolstikhin et al., 2018) work in the latent space to match the aggregated posterior with the prior, either by adversarial training or by minimizing their Wasserstein distance. Inspired by AAEs and WAEs, we develop a principled approach to matching data distributions in the latent space, aiming to improve the generative performance of AAEs and WAEs (Rubenstein et al., 2018), as well as that of VAEs (Rezende & Viola, 2018; Dai & Wipf, 2019). While previous work has explored the use of perceptual loss for a similar purpose (Hou et al., 2017), it relies on a VGG net pre-trained on Image Net and provides no theoretical guarantees. In our work, the encoder of an autoencoder is jointly trained, such that matching the target distribution in the latent space guarantees the matching in the data space.

In a different line of work, reversible generative models (Dinh et al., 2014; 2017) are developed to enable exact inference. Consequently, by the change of variables theorem, the likelihood of each data sample can be exactly computed and optimized. Recent work shows that they are capable of generating realistic images (Kingma & Dhariwal, 2018). However, to avoid expensive Jacobian determinant computations, reversible models can only be composed of restricted transformations, rather than general neural network architectures. While this restriction can be relaxed by utilizing recently developed neural ordinary differential equations

(Chen et al., 2018; Grathwohl et al., 2019), they still rely on a shared dimensionality between the latent and data spaces, which remains an unnatural restriction. In this work, we use the proposed training framework to trade exact inference for unrestricted neural network architectures and arbitrary latent dimensionalities, generalizing maximum likelihood training to autoencoder-based models.

3.1. Perceptual Generative Model

Let fφ : RD RH be the encoder parameterized by φ, and gθ : RH RD be the decoder parameterized by θ. Our goal is to obtain a decoder-based generative model, which maps a simple prior distribution to a target data distribution, D. Throughout this paper, we use N (0, I) as the prior distribution. This section will introduce several related but different distributions, which are illustrated in Fig. 1a. A summary of notations is provided in the supplementary material.

For z RH, the output of the decoder, gθ (z), lies in a manifold that is at most H-dimensional. Therefore, if we train the autoencoder to minimize

2Ex D h ˆx x 2 2 i , (1)

where ˆx = gθ (fφ (x)), then ˆx can be seen as a projection of the input data, x, onto the manifold of gθ (z). Let ˆD denote the reconstructed data distribution, i.e., ˆx ˆD. Given enough capacity of the encoder, ˆD is the best approximation to D (in terms of ℓ2-distance), that we can obtain from the decoder, and thus can serve as a surrogate target distribution for training the decoder-based generative model.

Due to the difﬁculty in directly matching the generated distribution with the data-space target distribution, ˆD, we reuse the encoder to map ˆD to a latent-space target distribution, ˆH. We then transform the problem of matching ˆD in the data space into matching ˆH in the latent space. In other words, we aim to ensure that for z N (0, I), if fφ (gθ (z)) ˆH, then gθ (z) ˆD. In the following, we deﬁne h = fφ gθ for notational convenience.

To this end, we minimize the following latent reconstruction loss w.r.t. φ:

Lφ lr,N = 1

2Ez N(0,I) h h (z) z 2 2 i . (2)

Let Z (x) be the set of all z s that are mapped to the same x by gθ, we have the following theorem:

Theorem 1. Assuming E [z|x] Z (x) for all x generated by gθ, and sufﬁcient capacity of fφ; for z N (0, I), if Eq. (2) is minimized and h (z) ˆH, then gθ (z) ˆD.

Perceptual Generative Autoencoders

) I , 0 ( N

) I , 0 ( N

nll θ L nll φ L

) I , 0 ( N

H ) x ( φ µ ) x ( φ σ

vkl φ L vr L

Figure 1. Illustration of the training process of PGAs. (a) shows the distributions involved in training PGAs, where the dashed arrow points to the two latent-space distributions to be matched. The overall loss function consists of (b) the basic PGA losses, and either (c) the LPGA-speciﬁc losses or (d) the VPGA-speciﬁc losses. Circles indicate where the gradient is truncated, and dashed lines indicate where the gradient is ignored when updating parameters.

We defer the proof to the supplementary material. Note that Theorem 1 requires that different x s generated by gθ (from N (0, I) and H) are mapped to different z s by fφ. In theory, minimizing Eq. (2) would sufﬁce, since N (0, I) is supported on the whole RH. However, there can be z s with low probabilities in N (0, I), but with high probabilities in H that are not well covered by Eq. (2). Therefore, it is sometimes helpful to minimize another latent reconstruction loss on H:

Lφ lr,H = 1

2Ez H h h (z) z 2 2 i . (3)

In practice, we observe that Lφ lr,H is often small without explicit minimization, which we attribute to its consistency with the minimization of Lr. Moreover, minimizing the latent reconstruction losses w.r.t. θ is not required by Theorem 1, and it degrades the performance empirically. In addition, the use of ℓ2-norm in the reconstruction losses is not a necessity, and the framework can be easily extended to other norm deﬁnitions.

By Theorem 1, the problem of training the generative model reduces to training h to map N (0, I) to ˆH, which we re-

fer to as the perceptual generative model. The basic loss function of PGAs is given by

Lpga = Lr + αLφ lr,N + βLφ lr,H, (4)

where α and β are hyperparameters to be tuned. Eq. (4) is also illustrated in Fig. 1b.

In the subsequent subsections, we present a maximum likelihood approach, as well as a VAE-based approach to train the perceptual generative model. To build intuition before delving into the details, we note that both of these two approaches work by attracting the latent representations of data samples to the origin, while expanding the volume occupied by each sample in the latent space. These two tendencies together push H closer to N (0, I), such that H matches ˆH. This observation further leads to a uniﬁed view of the two approaches.

3.2. A Maximum Likelihood Approach

We ﬁrst assume the invertibility of h. For ˆx ˆD, let ˆz = fφ (ˆx) = h (z) ˆH. We can train h directly with maximum likelihood using the change of variables formula

Perceptual Generative Autoencoders

Eˆz ˆ H [log p (ˆz)] = Ez H

log p (z) log det h (z)

(5) where p (z) is the prior distribution, N (0, I). Since the actual generative model to be trained is the decoder (parameterized by θ), we would like to maximize Eq. (5) only w.r.t. θ. However, directly optimizing the ﬁrst term in Eq. (5) requires computing z = h 1 (ˆz), which is usually unknown. Nevertheless, for ˆz ˆH, we have h 1 (ˆz) = fφ (x) and x D, and thus we can minimize the following loss function w.r.t. φ instead:

Lφ nll = Ez H [log p (z)] = 1

2Ex D h fφ (x) 2 2 i . (6)

To avoid computing the Jacobian in the second term of Eq. (5), which is slow for unrestricted architectures, we approximate the Jacobian determinant and derive a loss function to be minimized w.r.t. θ:

2 Ez H,δ S(ϵ)

log h (z + δ) h (z) 2 2 δ 2 2

log det h (z)

where S (ϵ) can be either N 0, ϵ2I , or a uniform distribution on a small (H 1)-sphere of radius ϵ centered at the origin. The latter choice is expected to introduce slightly less variance. Note that if we also minimize Eq. (7) w.r.t. φ, the encoder will be trained to ignore the difference between gθ (z + δ) and gθ (z), in which case Theorem 1 no longer holds.

Eqs. (6) and (7) are illustrated in Fig. 1c. We show below that the approximation in Eq. (7) gives an upper bound when ϵ 0. Proposition 1. For ϵ 0,

log det h (z)

log h (z + δ) h (z) 2 2 δ 2 2

(8) The inequality is tight if h is a multiple of the identity function around z.

We defer the proof to the supplementary material. We note that while the approximation in Eq. (7) is derived from the change of variables formula, there is no direct usage of the latter. As a result, the invertibility of h is not required by the resulting method. Indeed, when h is invertible at some point z, the latent reconstruction loss ensures that h is close to the identity function around z, and hence the tightness of the upper bound in Eq. (8). Otherwise, when h is not invertible at some z, the logarithm of the Jacobian determinant at z becomes inﬁnite, in which case Eq. (5) cannot be optimized.

Nevertheless, since h (z + δ) h (z) 2 2 is unlikely to be zero if the model is properly initialized, the approximation in Eq. (7) remains ﬁnite, and thus can be optimized regardless.

To summarize, we train the autoencoder to obtain a generative model by minimizing the following loss function:

Llpga = Lpga + γ Lφ nll + Lθ nll . (9)

We refer to this approach as maximum likelihood PGA (LPGA).

3.3. A VAE-based Approach

The original VAE is trained by maximizing the evidence lower bound on log p (x) as

log p (x) (10)

log p (x) KL(q (z |x) || p (z |x))

= Ez q(z |x) [log p (x|z )] KL(q (z |x) || p (z )),

where p (x|z ) is modeled with the decoder, and q (z |x) is modeled with the encoder. Note that z denotes the stochastic version of z, whereas z remains deterministic for the basic PGA losses in Eqs. (2) and (3). In our case, we would like to modify Eq. (10) in a way that helps maximize log p (ˆz), where ˆz = h (z). Therefore, we replace p (x|z ) on the r.h.s. of Eq. (10) with p (ˆz|z ), and derive a lower bound on log p (ˆz) as

log p (ˆz) (11)

log p (ˆz) KL(q (z |x) || p (z |ˆz))

= Ez q(z |x) [log p (ˆz|z )] KL(q (z |x) || p (z )).

Similar to the original VAE, we make the assumption that q (z |x) and p (ˆz|z ) are Gaussian; i.e., q (z |x) = N z µφ (x) , diag σ2 φ (x) , and p (ˆz|z ) = N ˆz µθ,φ (z ) , σ2I . Here, µφ ( ) = fφ ( ), µθ,φ ( ) = h ( ), and σ > 0 is a tunable scalar. Note that if σ is ﬁxed, the ﬁrst term on the r.h.s. of Eq. (11) has a trivial maximum, where z, ˆz, and µθ,φ (z ) are all close to zero. To circumvent this, we set σ proportional to the ℓ2-norm of z.

The VAE variant is trained by minimizing

Lvae = Lvr + Lφ vkl (12)

= Ex D Ez q(z |x) log p ˆz z KL(q z x || p z ) ,

where Lvr and Lφ vkl correspond, respectively, to the reconstruction and KL divergence losses of VAE, as illustrated in Fig. 1d. In Lvr, while the gradient through σ2 φ (x) remains unchanged, we ignore the gradient passed directly from Lvr to the encoder, due to a similar reason discussed for Eq. (7). Accordingly, the overall loss function is given by

Lvpga = Lpga + ηLvae. (13)

We refer to this approach as variational PGA (VPGA).

Perceptual Generative Autoencoders

3.4. A High-level View of the PGA Framework

We summarize what each loss term achieves, and explain from a high-level how they work together.

Data reconstruction loss (Eq. (1)): For Theorem 1 to hold, we need to use the reconstructed data distribution ( ˆD), instead of the original data distribution (D), as the target distribution. Therefore, minimizing the data reconstruction loss ensures that the target distribution is close to the data distribution.

Latent reconstruction loss (Eqs. (2) and (3)): The encoder (fφ) is reused to map data-space distributions to the latent space. As shown by Theorem 1, minimizing the latent reconstruction loss (w.r.t. the parameters of the encoder) ensures that if the generated distribution and the target distribution can be mapped to the same distribution ( ˆH) in the latent space by the encoder, then the generated distribution and the target distribution are the same.

Maximum likelihood loss (Eqs. (6) and (7)) or VAE loss (Eq. (12)): The decoder (gθ) and encoder (fφ) together can be considered as a perceptual generative model (fφ gθ), which is trained to map N (0, I) to the latent-space target distribution ( ˆH) by minimizing either the maximum likelihood loss or the VAE loss.

The ﬁrst loss allows to use the reconstructed data distribution as the target distribution. The second loss transforms the problem of matching the target distribution in the data space into matching it in the latent space. The latter problem is then solved by the third loss. Therefore, the three losses together ensure that the generated distribution is close to the data distribution.

3.5. A Uniﬁed Approach

While the loss functions of maximum likelihood and VAE seem completely different in their original forms, they share remarkable similarities when considered in the PGA framework (see Figs. 1c and 1d). Intuitively, observe that

Lφ vkl = Lφ nll + 1

σ2 φ,i (x) log σ2 φ,i (x) ,

(14) which means both Lφ nll and Lφ vkl tend to attract the latent representations of data samples to the origin. In addition, by minimizing log |det ( h (z) / z)|, Lθ nll expands the volume occupied by each sample in the latent space, which can be also achieved by Lvr with the second term of Eq. (14).

More concretely, we observe that both Lθ nll and Lvr are minimizing the difference between h (z) and h (z + δ ), where δ is some additive zero-mean noise. However, they differ in that the variance of δ is ﬁxed for Lθ nll, but is trainable for Lvr; and the distance between h (z) and h (z + δ ) are

deﬁned in two different ways. In fact, Lvr is a squared ℓ2-distance derived from the Gaussian assumption on ˆz, whereas Lθ nll can be derived similarly by assuming that d H = ˆz h (z + δ) H 2 follows a reciprocal distribution as p d H; a, b = 1 d H (log (b) log (a)), (15)

where a d H b, and a > 0. The exact values of a and b are irrelevant, as they only appear in an additive constant when we take the logarithm of p d H; a, b .

Since there is no obvious reason for assuming Gaussian ˆz, we can instead assume ˆz to follow the distribution deﬁned in Eq. (15), and multiply H by a tunable scalar, γ , similar to σ. Furthermore, we can replace δ in Eq. (7) with δ N 0, diag σ2 φ (x) , as it is deﬁned for VPGA with a subtle difference that here σ2 φ (x) is constrained to be greater than ϵ2. As a result, LPGA and VPGA are uniﬁed into a single approach, which has a combined loss function as

Llvpga = Lpga + γ Lvr + γLφ nll + ηLφ vkl. (16)

When γ = γ and η = 0, Eq. (16) is equivalent to Eq. (9), considering that σ2 φ (x) will be optimized to approach ϵ2. Similarly, when γ = 0, Eq. (16) is equivalent to Eq. (13). Interestingly, it also becomes possible to have a mix of LPGA and VPGA by setting all three hyperparameters to positive values. This approach mainly serves to demonstrate the connection between LPGA and VPGA, and is less practical due to the extra hyperparameters. We refer to this approach as LVPGA.

4. Experiments

In this section, we evaluate the performance of LPGA and VPGA on three image datasets, MNIST (Le Cun et al., 1998), CIFAR-10 (Krizhevsky & Hinton, 2009), and Celeb A (Liu et al., 2015). For Celeb A, we employ the discriminator and generator architecture of DCGAN (Radford et al., 2016) for the encoder and decoder of PGA. We half the number of ﬁlters (i.e., 64 ﬁlters for the ﬁrst convolutional layer) for faster experiments, while more ﬁlters are observed to improve performance. Due to smaller input sizes, we reduce the number of convolutional layers accordingly for MNIST and CIFAR-10, and add a fully-connected layer of 1024 units for MNIST, as done in Chen et al. (2016). SGD with a momentum of 0.9 is used to train all models. Other hyperparameters are tuned heuristically, and could be improved by a more extensive grid search. For fair comparison, σ is tuned for both VAE and VPGA. All experiments are performed on a single GPU.

As shown in Fig. 2, the visual quality of the PGA-generated samples is signiﬁcantly improved over that of VAEs. In

Perceptual Generative Autoencoders

(a) MNIST by LPGA

(b) MNIST by VPGA

(c) MNIST by VAE

(d) CIFAR-10 by LPGA

(e) CIFAR-10 by VPGA

(f) CIFAR-10 by VAE

(g) Celeb A by LPGA

(h) Celeb A by VPGA

(i) Celeb A by VAE

Figure 2. Random samples generated by LPGA, VPGA, and VAE. Note how LPGA and VPGA images are less blurry than those from the VAE.

particular, PGAs generate much sharper samples on CIFAR10 and Celeb A compared to vanilla VAEs. The results of LVPGA much resemble that of either LPGA or VPGA, depending on the hyperparameter settings. In addition, we use

the Fr echet Inception Distance (FID) (Heusel et al., 2017) to evaluate the proposed methods, as well as VAE. For each model and each dataset, we take 5,000 generated samples to compute the FID score. The results (with standard errors of

Perceptual Generative Autoencoders

Table 1. FID scores of autoencoder-based generative models. The ﬁrst block shows the results from Ghosh et al. (2019), where CV-VAE stands for constant-variance VAE, and RAE stands for regularized autoencoder. The second block shows our results of LPGA, VPGA, LVPGA, and VAE.

Model MNIST CIFAR-10 Celeb A

VAE 19.21 106.37 48.12 CV-VAE 33.79 94.75 48.87 WAE 20.42 117.44 53.67 RAE-L2 22.22 80.80 51.13 RAE-SN 19.67 84.25 44.74

VAE 13.83 0.06 115.74 0.63 43.60 0.33 LPGA 10.34 0.15 55.87 0.25 14.53 0.52 VPGA 4.97 0.07 51.51 1.16 24.73 1.25 LVPGA 6.32 0.16 52.94 0.89 13.80 0.20

3 or more runs) are summarized in Table. 1. Compared to other autoencoder-based non-adversarial approaches (Tolstikhin et al., 2018; Kolouri et al., 2019; Ghosh et al., 2019), where similar but larger architectures are used, we obtain substantially better FID scores on all three datasets. Note that the results from Ghosh et al. (2019) shown in Table. 1 are obtained using slightly different architectures and evaluation protocols. Nevertheless, their results of VAE align well with ours, suggesting a good comparability of the results. Interestingly, as a uniﬁed approach, LVPGA can indeed combine the best performances of LPGA and VPGA on different datasets. For Celeb A, we show further results on 140x140 crops and latent space interpolations in the supplementary material. While PGA has largely bridged the performance gap between generative autoencoders and GANs, there is still a noticeable gap between them especially on CIFAR10. For instance, the FIDs of WGAN-GP and SN-GAN on CIFAR-10 using a similar architecture are respectively 40.2 and 25.5 (Miyato et al., 2018), as compared to 51.5 of VPGA.

Empirically, different PGA variants share the same optimal values of α and β (Eq. (4)) when trained on the same dataset. For LPGA, γ (Eq. (9)) tends to vary in a small range for different datasets (e.g., 1.5e 2 for MNIST and CIFAR-10, and 1e 2 for Celeb A). For VPGA, η (Eq. (13)) can vary widely (e.g., 2e 2 for MNIST, 3e 2 for CIFAR-10, and 2e 3 for Celeb A), and thus is slightly more difﬁcult to tune. The training process of PGAs is stable in general, given the non-adversarial losses. As shown in Fig. 3a, the total losses change little after the initial rapid drops. This is due to the fact that the encoder and decoder are optimized towards different objectives, as can be observed from Eqs. (4), (9), and (12). In contrast, the corresponding FIDs, shown in Fig. 3b, tend to decrease monotonically during training. However, when trained on Celeb A, there is a signiﬁcant performance gap between LPGA and VPGA, and the FID of the latter

0 1 2 3 4 5 6 Iterations 1e5

LPGA_MNIST LPGA_CIFAR-10 LPGA_Celeb A VPGA_MNIST VPGA_CIFAR-10 VPGA_Celeb A

(a) Total loss

0 1 2 3 4 5 6 Iterations 1e5

LPGA_MNIST LPGA_CIFAR-10 LPGA_Celeb A VPGA_MNIST VPGA_CIFAR-10 VPGA_Celeb A

Figure 3. Training curves of LPGA and VPGA.

starts to increase after a certain point of training. We suspect this phenomenon is related to the limited expressiveness of the variational posterior, which is not an issue for LPGA.

It is worth noting that stability issues can occur when batch normalization (Ioffe & Szegedy, 2015) is introduced, since both the encoder and decoder are fed with multiple batches drawn from different distributions. At convergence, different input distributions to the decoder (e.g., H and N (0, I)) are expected to result in similar distributions of the internal representations, which, intriguingly, can be imposed to some degree by batch normalization. Therefore, it is observed that when batch normalization does not cause stability issues, it can substantially accelerate convergence and lead to slightly better generative performance. Furthermore, we observe that LPGA tends to be more stable than VPGA in the presence of batch normalization.

Finally, we conduct an ablation study. While the loss functions of LPGA and VPGA both consist of multiple components, they are all theoretically motivated and indispensable. Speciﬁcally, the data reconstruction loss minimizes

Perceptual Generative Autoencoders

(a) MNIST, FID = 16.99

(b) CIFAR-10, FID = 114.19

(c) Celeb A, FID = 48.02

Figure 4. Random samples generated by LPGA without the latent reconstruction losses (α = β = 0). Compared to the samples in Fig. 2, we observe a degradation.

the discrepancy between the input data and its reconstruction. Since the reconstructed data distribution serves as the surrogate target distribution, removing the data reconstruction loss will result in a random target. Moreover, removing the maximum likelihood loss of LPGA or the VAE loss of VPGA will leave the perceptual generative model untrained. In both cases, no valid generative model can be obtained. Nevertheless, it is interesting to see how the latent reconstruction loss contributes to the generative performance. Therefore, we retrain the LPGAs without the latent reconstruction loss and report the results in Fig. 4. Compared to Fig. 2a, 2d, 2g, and the results in Table 1, the performance signiﬁcantly degrades both visually and quantitatively, conﬁrming the importance of the latent reconstruction loss.

5. Conclusion

We proposed a framework, PGA, for training autoencoderbased generative models, with non-adversarial losses and unrestricted neural network architectures. By matching target distributions in the latent space, PGAs trained with maximum likelihood generalize the idea of reversible generative models to unrestricted neural network architectures and arbitrary latent dimensionalities. In addition, it improves the performance of VAE when combined together. Under the PGA framework, we further show that maximum likelihood and VAE can be uniﬁed into a single approach.

In principle, the PGA framework can be combined with any method that can train the perceptual generative model. While we have only considered non-adversarial approaches, an interesting future work would be to combine it with an adversarial discriminator trained on latent representations. Moreover, the compatibility issue with batch normalization deserves further investigation.

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International Conference on Machine Learning, 2017.

Bengio, Y., Yao, L., Alain, G., and Vincent, P. Generalized denoising auto-encoders as generative models. In Advances in Neural Information Processing Systems, pp. 899 907, 2013.

Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high ﬁdelity natural image synthesis. In International Conference on Learning Representations, 2019.

Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, pp. 6571 6583, 2018.

Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172 2180, 2016.

Dai, B. and Wipf, D. Diagnosing and enhancing vae models. In International Conference on Learning Representations, 2019.

Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear independent components estimation. In International Conference on Learning Representations Workshop, 2014.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. In International Conference on Learning Representations, 2017.

Perceptual Generative Autoencoders

Ghosh, P., Sajjadi, M. S. M., Vergari, A., Black, M., and Sch olkopf, B. From variational to deterministic autoencoders. ar Xiv preprint ar Xiv:1903.12436, 2019.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014.

Grathwohl, W., Chen, R. T., Betterncourt, J., Sutskever, I., and Duvenaud, D. Ffjord: Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations, 2019.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767 5777, 2017.

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626 6637, 2017.

Hou, X., Shen, L., Sun, K., and Qiu, G. Deep feature consistent variational autoencoder. In IEEE Winter Conference on Applications of Computer Vision, pp. 1133 1141. IEEE, 2017.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448 456, 2015.

Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.

Kingma, D. P. and Dhariwal, P. Glow: Generative ﬂow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pp. 10236 10245, 2018.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014.

Kolouri, S., Pope, P. E., Martin, C. E., and Rohde, G. K. Sliced wasserstein auto-encoders. In International Conference on Learning Representations, 2019.

Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

Le Cun, Y., Cortes, C., and Burges, C. J. C. The mnist handwritten digit database, 1998.

Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In International Conference on Computer Vision, 2015.

Makhzani, A., Shlens, J., Jaitly, N., and Goodfellow, I. Adversarial autoencoders. In International Conference on Learning Representations Workshop, 2016.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.

Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations, 2016.

Rezende, D. J. and Viola, F. Taming vaes. ar Xiv preprint ar Xiv:1810.00597, 2018.

Rubenstein, P. K., Schoelkopf, B., and Tolstikhin, I. On the latent space of wasserstein auto-encoders. ar Xiv preprint ar Xiv:1802.03761, 2018.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234 2242, 2016.

Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein auto-encoders. In International Conference on Learning Representations, 2018.

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting and composing robust features with denoising autoencoders. In International Conference on Machine Learning, pp. 1096 1103. ACM, 2008.