# variational_lossy_autoencoder__5893df1d.pdf Published as a conference paper at ICLR 2017 VARIATIONAL LOSSY AUTOENCODER Xi Chen , Diederik P. Kingma , Tim Salimans , Yan Duan , Prafulla Dhariwal , John Schulman , Ilya Sutskever , Pieter Abbeel UC Berkeley, Department of Electrical Engineering and Computer Science Open AI {peter,dpkingma,tim,rocky,prafulla,joschu,ilyasu,pieter}@openai.com Representation learning seeks to expose certain aspects of observed data in a learned representation that s amenable to downstream tasks like classification. For instance, a good representation for 2D images might be one that describes only global structure and discards information about detailed texture. In this paper, we present a simple but principled method to learn such global representations by combining Variational Autoencoder (VAE) with neural autoregressive models such as RNN, MADE and Pixel RNN/CNN. Our proposed VAE model allows us to have control over what the global latent code can learn and by designing the architecture accordingly, we can force the global latent code to discard irrelevant information such as texture in 2D images, and hence the VAE only autoencodes data in a lossy fashion. In addition, by leveraging autoregressive models as both prior distribution p(z) and decoding distribution p(x|z), we can greatly improve generative modeling performance of VAEs, achieving new state-of-the-art results on MNIST, OMNIGLOT and Caltech-101 Silhouettes density estimation tasks as well as competitive results on CIFAR10. 1 INTRODUCTION A key goal of representation learning is to identify and disentangle the underlying causal factors of the data, so that it becomes easier to understand the data, to classify it, or to perform other tasks (Bengio et al., 2013). For image data this often means that we are interested in uncovering the global structure that captures the content of an image (for example, the identity of objects present in the image) and its style , but that we are typically less interested in the local and high frequency sources of variation such as the specific textures or white noise patterns. A popular approach for learning representations is to fit a probabilistic latent variable model, an approach also known as analysis-by-synthesis (Yuille & Kersten, 2006; Nair et al., 2008). By learning a generative model of the data with the appropriate hierarchical structure of latent variables, it is hoped that the model will somehow uncover and untangle those causal sources of variations that we happen to be interested in. However, without further assumptions, representation learning via generative modeling is ill-posed: there are many different possible generative models with different (or no) kinds of latent variables that all encode the same probability density function on our observed data. Thus, the results we empirically get using this approach are highly dependent on the specific architectural and modeling choices that are made. Moreover, the objective that we optimize is often completely disconnected from the goal of learning a good representation: An autoregressive model of the data may achieve the same log-likelihood as a variational autoencoder (VAE) (Kingma & Welling, 2013), but the structure learned by the two models is completely different: the latter typically has a clear hierarchy of latent variables, while the autoregressive model has no stochastic latent variables at all (although it is conceivable that the deterministic hidden units of the autoregressive models will have meaningful and useful representations). For this reason, autoregressive models have thus far not been popular for the purpose of learning representations, even though they are extremely powerful as generative models (see e.g. van den Oord et al., 2016a). A natural question becomes: is it possible to have a model that is a powerful density estimator and at the same time has the right hierarchical structure for representation learning? A potential solution would be to use a hybrid model that has both the latent variable structure of a VAE, as Published as a conference paper at ICLR 2017 well as the powerful recurrence of an autoregressive model. However, earlier attempts at combining these two kinds of models have run into the problem that the autoregressive part of the model ends up explaining all structure in the data, while the latent variables are not used (Fabius & van Amersfoort, 2014; Chung et al., 2015; Bowman et al., 2015; Serban et al., 2016; Fraccaro et al., 2016; Xu & Sun, 2016). Bowman et al. (2015) noted that weakening the autoregressive part of the model by, for example, dropout can encourage the latent variables to be used. We analyze why weakening is necessary, and we propose a principled solution that takes advantage of this property to control what kind of information goes into latent variables. The model we propose performs well as a density estimator, as evidenced by state-of-the-art log-likelihood results on MNIST, OMNIGLOT and Caltech-101, and also has a structure that is uniquely suited for learning interesting global representations of data. 2 VAES DO NOT AUTOENCODE IN GENERAL A VAE is frequently interpreted as a regularized autoencoder (Kingma & Welling, 2013; Zhang et al., 2016), but the conditions under which it is guaranteed to autoencode (reconstruction being close to original datapoint) are not discussed. In this section, we discuss the often-neglected fact that VAEs do not always autoencode and give explicit reasons why previous attempts to apply VAE in sequence modeling found that the latent code is generally not used unless the decoder is weakened (Bowman et al., 2015; Serban et al., 2016; Fraccaro et al., 2016). The understanding of when VAE does autoencode will be an essential building piece for VLAE. 2.1 TECHNICAL BACKGROUND Let x be observed variables, z latent variables and let p(x, z) be the parametric model of their joint distribution, called the generative model defined over the variables. Given a dataset X = {x1, ..., x N} we wish to perform maximum likelihood learning of its parameters: i=1 log p(x(i)), (1) but in general this marginal likelihood is intractable to compute or differentiate directly for flexible generative models that have high-dimensional latent variables and flexible priors and likelihoods. A solution is to introduce q(z|x), a parametric inference model defined over the latent variables, and optimize the variational lower bound on the marginal log-likelihood of each observation x: log p(x) Eq(z|x) [log p(x, z) log q(z|x)] = L(x; θ) (2) where θ indicates the parameters of p and q models. There are various ways to optimize the lower bound L(x; θ); for continuous z it can be done efficiently through a re-parameterization of q(z|x) (Kingma & Welling, 2013; Rezende et al., 2014). This way of optimizing the variational lower bound with a parametric inference network and reparameterization of continuous latent variables is usually called VAE. The autoencoding terminology comes from the fact that the lower bound L(x; θ) can be re-arranged: L(x; θ) = Eq(z|x) [log p(x, z) log q(z|x)] (3) = Eq(z|x) [log p(x|z)] DKL(q(z|x)||p(z)) (4) where the first term can be seen as the expectation of negative reconstruction error and the KL divergence term can be seen as a regularizer, which as a whole could be seen as a regularized autoencoder loss with q(z|x) being the encoder and p(x|z) being the decoder. In the context of 2D images modeling, the decoding distribution p(x|z) is usually chosen to be a simple factorized distribution, i.e. p(x|z) = Q i p(xi|z), and this setup often yields a sharp decoding distribution p(x|z) that tends to reconstruct original datapoint x exactly. 2.2 BITS-BACK CODING AND INFORMATION PREFERENCE It s straightforward to see that having a more powerful p(x|z) will make VAE s marginal generative distribution p(x) = R z p(z)p(x|z)dz more expressive. This idea has been explored extensively Published as a conference paper at ICLR 2017 in previous work applying VAE to sequence modeling (Fabius & van Amersfoort, 2014; Chung et al., 2015; Bowman et al., 2015; Serban et al., 2016; Fraccaro et al., 2016; Xu & Sun, 2016), where the decoding distribution is a powerful RNN with autoregressive dependency, i.e., p(x|z) = Q i p(xi|z, x