# pixelgan_autoencoders__bcb2f16f.pdf Pixel GAN Autoencoders Alireza Makhzani, Brendan Frey University of Toronto {makhzani,frey}@psi.toronto.edu In this paper, we describe the Pixel GAN autoencoder , a generative autoencoder in which the generative path is a convolutional autoregressive neural network on pixels (Pixel CNN) that is conditioned on a latent code, and the recognition path uses a generative adversarial network (GAN) to impose a prior distribution on the latent code. We show that different priors result in different decompositions of information between the latent code and the autoregressive decoder. For example, by imposing a Gaussian distribution as the prior, we can achieve a global vs. local decomposition, or by imposing a categorical distribution as the prior, we can disentangle the style and content information of images in an unsupervised fashion. We further show how the Pixel GAN autoencoder with a categorical prior can be directly used in semi-supervised settings and achieve competitive semi-supervised classification results on the MNIST, SVHN and NORB datasets. 1 Introduction In recent years, generative models that can be trained via direct back-propagation have enabled remarkable progress in modeling natural images. One of the most successful models is the generative adversarial network (GAN) [1], which employs a two player min-max game. The generative model, G, samples the prior p(z) and generates the sample G(z). The discriminator, D(x), is trained to identify whether a point x is a sample from the data distribution or a sample from the generative model. The generator is trained to maximally confuse the discriminator into believing that generated samples come from the data distribution. The cost function of GAN is min G max D Ex pdata[log D(x)] + Ez p(z)[log(1 D(G(z))]. GANs can be considered within the wider framework of implicit generative models [2, 3, 4]. Implicit distributions can be sampled through their generative path, but their likelihood function is not tractable. Recently, several papers have proposed another application of GAN-style algorithms for approximate inference [2, 3, 4, 5, 6, 7, 8, 9]. These algorithms use implicit distributions to learn posterior approximations that are more expressive than the distributions with tractable densities that are often used in variational inference. For example, adversarial autoencoders [6] use a universal approximator posterior as the implicit posterior distribution and use adversarial training to match the aggregated posterior of the latent code to the prior distribution. Adversarial variational Bayes [3, 7] uses a more general amortized GAN inference framework within a maximum-likelihood learning setting. Another type of GAN inference technique is used in the ALI [8] and Bi GAN [9] models, which have been shown to approximate maximum likelihood learning [3]. In these models, both the recognition and generative models are implicit and are jointly learnt by an adversarial training process. Variational autoencoders (VAE) [10, 11] are another state-of-the-art image modeling technique that use neural networks to parametrize the posterior distribution and pair it with a top-down generative network. Both networks are jointly trained to maximize a variational lower bound on the data loglikelihood. A different framework for learning density models is autoregressive neural networks such 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. Figure 1: Architecture of the Pixel GAN autoencoder. as NADE [12], MADE [12], Pixel RNN [12] and Pixel CNN [13]. Unlike variational autoencoders, which capture the statistics of the data in hierarchical latent codes, the autoregressive models learn the image densities directly at the pixel level without learning a hierarchical latent representation. In this paper, we present the Pixel GAN autoencoder as a generative autoencoder that combines the benefits of latent variable models with autoregressive architectures. The Pixel GAN autoencoder is a generative autoencoder in which the generative path is a Pixel CNN that is conditioned on a latent variable. The latent variable is inferred by matching the aggregated posterior distribution to the prior distribution by an adversarial training technique similar to that of the adversarial autoencoder [6]. However, whereas in adversarial autoencoders the statistics of the data distribution are captured by the latent code, in the Pixel GAN autoencoder they are captured jointly by the latent code and the autoregressive decoder. We show that imposing different distributions as the prior results in different factorizations of information between the latent code and the autoregressive decoder. For example, in Section 2.1, we show that by imposing a Gaussian distribution on the latent code, we can achieve a global vs. local decomposition of information. In this case, the global latent code no longer has to model all the irrelevant and fine details of the image, and can use its capacity to capture more relevant and global statistics of the image. Another type of decomposition of information that can be learnt by Pixel GAN autoencoders is a discrete vs. continuous decomposition. In Section 2.2, we show that we can achieve this decomposition by imposing a categorical prior on the latent code using adversarial training. In this case, the categorical latent code captures the discrete underlying factors of variation in the data, such as class label information, and the autoregressive decoder captures the remaining continuous structure, such as style information, in an unsupervised fashion. We then show how Pixel GAN autoencoders with categorical priors can be directly used in clustering and semi-supervised scenarios and achieve very competitive classification results on several datasets in Section 3. Finally, we present one of the main potential applications of Pixel GAN autoencoders in learning cross-domain relations between two different domains in Section 4. 2 Pixel GAN Autoencoders Let x be a datapoint that comes from the distribution pdata(x) and z be the hidden code. The recognition path of the Pixel GAN autoencoder (Figure 1) defines an implicit posterior distribution q(z|x) by using a deterministic neural function z = f(x, n) that takes the input x along with random noise n with a fixed distribution p(n) and outputs z. The aggregated posterior q(z) of this model is defined as follows: x q(z|x)pdata(x)dx. This parametrization of the implicit posterior distribution was originally proposed in the adversarial autoencoder work [6] as the universal approximator posterior. We can sample from this implicit distribution q(z|x), by evaluating f(x, n) at different samples of n, but the density function of this posterior distribution is intractable. Appendix A.1 discusses the importance of the input noise in training Pixel GAN autoencoders. The generative path p(x|z) is a conditional Pixel CNN [13] that conditions on the latent vector z using an adaptive bias in Pixel CNN layers. The inference is done by an amortized GAN inference technique that was originally proposed in the adversarial autoencoder work [6]. In this method, an adversarial network is attached on top of the hidden code vector of the autoencoder and matches the aggregated posterior distribution, q(z), to an arbitrary prior, p(z). Samples from q(z) and p(z) are provided to the adversarial network as the negative and positive examples respectively, and the generator of the adversarial network, which is also the encoder of the autoencoder, tries to match q(z) to p(z) by the gradient that comes through the discriminative adversarial network. The adversarial network, the Pixel CNN decoder and the encoder are trained jointly in two phases the reconstruction phase and the adversarial phase executed on each mini-batch. In the reconstruction phase, the ground truth input x along with the hidden code z inferred by the encoder are provided to the Pixel CNN decoder. The Pixel CNN decoder weights are updated to maximize the log-likelihood of the input x. The encoder weights are also updated at this stage by the gradient that comes through the conditioning vector of the Pixel CNN. In the adversarial phase, the adversarial network updates both its discriminative network and its generative network (the encoder) to match q(z) to p(z). Once the training is done, we can sample from the model by first sampling z from the prior distribution p(z), and then sampling from the conditional likelihood p(x|z) parametrized by the Pixel CNN decoder. We now establish a connection between the Pixel GAN autoencoder cost and maximum likelihood learning using a decomposition of the aggregated evidence lower bound (ELBO) proposed in [14]: Ex pdata(x)[log p(x)] > Ex pdata(x) h Eq(z|x)[ log p(x|z)] i Ex pdata(x) h KL(q(z|x) p(z)) i (1) = Ex pdata(x) h Eq(z|x)[ log p(x|z)] i | {z } reconstruction term KL(q(z) p(z)) | {z } marginal KL I(z; x) | {z } mutual info. The first term in Equation 2 is the reconstruction term and the second term is the marginal KL divergence between the aggregated posterior and the prior distribution. The third term is the mutual information between the latent code z and the input x. This is a regularization term that encourages z and x to be decoupled by removing the information of the data distribution from the hidden code. If the training set has N examples, I(z; x) is bounded as follows (see [14]). 0 < I(z; x) < log N (3) In order to maximize the ELBO, we need to minimize all the three terms of Equation 2. We consider two cases for the decoder p(x|z): Deterministic Decoder. If the decoder p(x|z) is deterministic or has very limited stochasticity such as the simple factorized decoder of the VAE, the mutual information term acts in the complete opposite direction of the reconstruction term. This is because the only way to minimize the reconstruction error of x is to learn a hidden code z that is relevant to x, which results in maximizing I(z; x). Indeed, it can be shown that minimizing the reconstruction term maximizes a variational lower bound on I(z; x) [15, 16]. For example, in the case of the VAE trained on MNIST, since the reconstruction is precise, the mutual information term is dominated and is close to its maximum value I(z; x) log N 11.00 nats [14]. Stochastic Decoder. If we use a powerful decoder such as the Pixel CNN, the reconstruction term and the mutual information term will not compete with each other anymore and the network can minimize both independently. In this case, the optimal solution for maximizing the ELBO would be to model pdata(x) solely by p(x|z) and thereby minimizing the reconstruction term, and at the same time, minimizing the mutual information term by ignoring the latent code. As a result, even though the model achieves a high likelihood, the latent code does not learn any useful representation, which is undesirable. This problem has been observed in several previous works [17, 18] and different techniques such as annealing the weight of the KL term [17] or weakening the decoder [18] have been proposed to make z and x more dependent. As suggested in [19, 18], we think that the maximum likelihood objective by itself is not a useful objective for representation learning especially when a powerful decoder is used. In Pixel GAN autoencoders, in order to encourage learning more useful representations, we modify the ELBO (Equation 2) by removing the mutual information term from it, since this term is explicitly encouraging z to become independent of x. So our cost function only includes the reconstruction term and the marginal KL term. The reconstruction term is optimized by the reconstruction phase of training and the marginal KL term is approximately optimized by the adversarial phase1. Note that since the 1The original GAN formulation optimizes the Jensen-Shannon divergence [1], but there are other formulations that optimize the KL divergence, e.g. [3]. (a) Pixel GAN Samples (2D code, limited receptive field) (b) Pixel CNN Samples (limited receptive field) (c) AAE Samples (2D code) Figure 2: (a) Samples of the Pixel GAN autoencoder with 2D Gaussian code and limited receptive field of size 9. (b) Samples of the Pixel CNN (c) Samples of the adversarial autoencoder. mutual information term is upper bounded by a constant (log N), we are still maximizing a lower bound on the log-likelihood of data. However, this bound is weaker than the ELBO, which is the price that is paid for learning more useful latent representations by balancing the decomposition of information between the latent code and the autoregressive decoder. For implementing the conditioning adaptive bias in the Pixel CNN decoder, we explore two different architectures [13]. In the location-invariant bias, for each Pixel CNN layer, we use the latent code to construct a vector that is broadcasted within each feature map of the layer and then added as an adaptive bias to that layer. In the location-dependent bias, we use the latent code to construct a spatial feature map that is broadcasted across different feature maps and then added only to the first layer of the decoder as an adaptive bias. We will discuss the effect of these architectures on the learnt representation in Figure 3 of Section 2.1 and their implementation details in Appendix A.2. 2.1 Pixel GAN Autoencoders with Gaussian Priors Here, we show that Pixel GAN autoencoders with Gaussian priors can decompose the global and local statistics of the images between the latent code and the autoregressive decoder. Figure 2a shows the samples of a Pixel GAN autoencoder model with the location-dependent bias trained on the MNIST dataset. For the purpose of better illustrating the decomposition of information, we have chosen a 2-D Gaussian latent code and a limited the receptive field of size 9 for the Pixel GAN autoencoder. Figure 2b shows the samples of a Pixel CNN model with the same limited receptive field size of 9 and Figure 2c shows the samples of an adversarial autoencoder with the 2-D Gaussian latent code. The Pixel CNN can successfully capture the local statistics, but fails to capture the global statistics due to the limited receptive field size. In contrast, the adversarial autoencoder, whose sample quality is very similar to that of the VAE, can successfully capture the global statistics, but fails to generate the details of the images. However, the Pixel GAN autoencoder, with the same receptive field and code size, can combine the best of both and generates sharp images with global statistics. In Pixel GAN autoencoders, both the Pixel CNN depth and the conditioning architecture affect the decomposition of information between the latent code and the autoregressive decoder. We investigate these effects in Figure 3 by training a Pixel GAN autoencoder on MNIST where the code size is chosen to be 2 for the visualization purpose. As shown in Figure 3a,b, when a shallow decoder is used, most of the information will be encoded in the hidden code and there is a clean separation between the digit clusters. As we make the Pixel CNN more powerful (Figure 3c,d), we can see that the hidden code is still used to capture some relevant information of the input, but the separation of digit clusters is not as sharp when the limited code size of 2 is used. In the next section, we will show that by using a larger code size (e.g., 30), we can get a much better separation of digit clusters even when a powerful Pixel CNN is used. The conditioning architecture also affects the decomposition of information. In the case of the location-invariant bias, the hidden code is encouraged to learn the global information that is locationinvariant (the what information and not the where information) such as the class label information. For example, we can see in Figure 3a,c that the network has learnt to use one of the axes of the 2D Gaussian code to explicitly encode the digit label even though a continuous prior is imposed. In this (a) Shallow Pixel CNN Location-invariant bias (b) Shallow Pixel CNN Location-dependent bias (c) Deep Pixel CNN Location-invariant bias (d) Deep Pixel CNN Location-dependent bias Figure 3: The effect of the Pixel CNN decoder depth and the conditioning architecture on the learnt representation of the Pixel GAN autoencoder. (Shallow=3 Res Blocks, Deep=12 Res Blocks) case, we can potentially get a much better separation if we impose a discrete prior. This makes this architecture suitable for the discrete vs. continuous decomposition and we use it for our clustering and semi-supervised learning experiments. In the case of the location-dependent bias (Figure 3b,d), the hidden code is encouraged to learn the global information that has location dependent information such as low-frequency content of the image, similar to what the hidden code of an adversarial or variational autoencoder would learn (Figure 2c). This makes this architecture suitable for the global vs. local decomposition experiments such as Figure 2a. From Figure 3, we can see that the class label information is mostly captured by p(z) while the style information of the images is captured by both p(z) and p(x|z). This decomposition of information has also been studied in other works that combine the latent variable models with autoregressive decoders such as Pixel VAE [20] and variational lossy autoencoders (VLAE) [18]. For example, the VLAE model [18] proposes to use the depth of the Pixel CNN decoder to control the decomposition of information. In their model, the Pixel CNN decoder is designed to have a shallow depth (small local receptive field) so that the latent code z is forced to capture more global information. This approach is very similar to our example of the Pixel GAN autoencoder in Figure 2. However, the question that has remained unanswered is whether it is possible to achieve a complete decomposition of content and style in an unsupervised fashion, where the class label or discrete structure information is encoded in the latent code z, and the remaining continuous structure such as style is captured by a powerful and deep Pixel CNN decoder. This kind of decomposition is particularly interesting as it can be directly used for clustering and semi-supervised classification. In the next section, we show that we can learn this decomposition of content and style by imposing a categorical distribution on the latent representation z using adversarial training. Note that this discrete vs. continuous decomposition is very different from the global vs. local decomposition, because a continuous factor of variation such as style can have both global and local effect on the image. Indeed, in order to achieve the discrete vs. continuous decomposition, we have to use very deep and powerful Pixel CNN decoders (up to 20 residual blocks) to capture both the global and local statistics of the style by the Pixel CNN while the discrete content of the image is captured by the categorical latent variable. 2.2 Pixel GAN Autoencoders with Categorical Priors In this section, we present an architecture of the Pixel GAN autoencoder that can separate the discrete information (e.g., class label) from the continuous information (e.g., style information) in the images. We then show how our architecture can be naturally adopted for the semi-supervised settings. The architecture that we use is similar to Figure 1, with the difference that we impose a categorical distribution as the prior rather the Gaussian distribution (Figure 4) and also use the location-independent bias architecture. Another difference is that we use a convolutional network as the inference network q(z|x) to encourage the encoder to preserve the content and lose the style information of the image. The inference network has a softmax output and predicts a one-hot vector whose dimension is the number of discrete labels or categories that we wish the data to be clustered into. The adversarial network is trained directly on the continuous probability outputs of the softmax layer of the encoder. Imposing a categorical distribution at the output of the encoder imposes two constraints. The first constraint is that the encoder has to make confident decisions about the class labels of the inputs. The Figure 4: Architecture of the Pixel GAN autoencoder with the categorical prior. p(z) captures the class label and p(x|z) is a multi-modal distribution that captures the style distribution of a digit conditioned on the class label of that digit. adversarial training pushes the output of the encoder to the corners of the softmax simplex, by which it ensures that the autoencoder cannot use the latent vector z to carry any continuous style information. The second constraint imposed by adversarial training is that the aggregated posterior distribution of z should match the categorical prior distribution with uniform outcome probabilities. This constraint enforces the encoder to evenly distribute the class labels across the corners of the softmax simplex. Because of these constraints, the latent variable will only capture the discrete content of the image and all the continuous style information will be captured by the autoregressive decoder. In order to better understand and visualize the effect of the adversarial training on shaping the hidden code distribution, we train a Pixel GAN autoencoder on the first three digits of MNIST (18000 training and 3000 test points) and choose the number of clusters to be 3. Suppose z = [z1, z2, z3] is the hidden code which in this case is the output probabilities of the softmax layer of the inference network. In Figure 5a, we project the 3D softmax simplex of z1 + z2 + z3 = 1 onto a 2D triangle and plot the hidden codes of the training examples when no distribution is imposed on the hidden code. We can see from this figure that the network has learnt to use the surface of the softmax simplex to encode style information of the digits and thus the three corners of the simplex do not have any meaningful interpretation. Figure 5b corresponds to the code space of the same network when a categorical distribution is imposed using the adversarial training. In this case, we can see the network has successfully learnt to encode the label information of the three digits in the three corners of the simplex, and all the style information has been separately captured by the autoregressive decoder. This network achieves an almost perfect test error-rate of 0.3% on the first three digits of MNIST, even though it is trained in a purely unsupervised fashion. Once the Pixel GAN autoencoder is trained, its encoder can be used for clustering new points and its decoder can be used to generate samples from each cluster. Figure 6 illustrates the samples of the Pixel GAN autoencoder trained on the full MNIST dataset. The number of clusters is set to be 30 and each row corresponds to the conditional samples of one of the clusters (only 16 are shown). We can see that the discrete latent code of the network has learnt discrete factors of variation such as (a) Without GAN Regularization (b) With GAN Regularization Figure 5: Effect of GAN regularization (categorical prior) on the code space of Pixel GAN autoencoders. Figure 6: Disentangling the content and style in an unsupervised fashion with Pixel GAN autoencoders. Each row shows samples of the model from one of the learnt clusters. class label information and some discrete style information. For example digit 1s are put in different clusters based on how much tilted they are. The network is also assigning different clusters to digit 2s (based on whether they have a loop) and digit 7s (based on whether they have a dash in the middle). In Section 3, we will show that by using the encoder of this network, we can obtain about 5% error rate in classifying digits in an unsupervised fashion, just by matching each cluster to a digit type. Semi-Supervised Pixel GAN Autoencoders. The Pixel GAN autoencoder can be used in a semisupervised setting. In order to incorporate the label information, we add a semi-supervised training phase. Specifically, we set the number of clusters to be the same as the number of class labels and after executing the reconstruction and the adversarial phases on an unlabeled mini-batch, the semi-supervised phase is executed on a labeled mini-batch, by updating the weights of the encoder q(z|x) to minimize the cross-entropy cost. The semi-supervised cost also reduces the mode-missing behavior of the GAN training by enforcing the encoder to learn all the modes of the categorical distribution. In Section 3, we will evaluate the performance of the Pixel GAN autoencoders on the semi-supervised classification tasks. 3 Experiments In this paper, we presented the Pixel GAN autoencoder as a generative model, but the currently available metrics for evaluating the likelihood of GAN-based generative models such as Parzen window estimate are fundamentally flawed [21]. So in this section, we only present the performance of the Pixel GAN autoencoder on downstream tasks such as unsupervised clustering and semi-supervised classification. The details of all the experiments can be found in Appendix B. Unsupervised Clustering. We trained a Pixel GAN autoencoder in an unsupervised fashion on the MNIST dataset (Figure 6). We chose the number of clusters to be 30 and used the following evaluation protocol: once the training is done, for each cluster i, we found the validation example (a) SVHN (1000 labels) (b) MNIST (100 labels) (c) NORB (1000 labels) Figure 7: Conditional samples of the semi-supervised Pixel GAN autoencoder. 0 25 50 75 100 125 150 175 Epochs 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 Semi-supervised MNIST 100 Labels 50 Labels 20 Labels Unsupervised (30 clusters) 0 100 200 300 400 500 600 700 800 900 Epochs 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 0.28 0.30 0.32 0.34 0.36 0.38 Semi-supervised SVHN 1000 Labels 500 Labels Figure 8: Semi-supervised error-rate of Pixel GAN autoencoders on the MNIST and SVHN datasets. MNIST MNIST MNIST MNIST SVHN SVHN NORB (Unsupervised) (20 labels) (50 labels) (100 labels) (500 labels) (1000 labels) (1000 labels) VAE [24] - - - 3.33 ( 0.14) - 36.02 ( 0.10) 18.79 ( 0.05) VAT [25] - - - 2.33 - 24.63 9.88 ADGM [26] - - - 0.96 ( 0.02) - 22.86 10.06 ( 0.05) SDGM [26] - - - 1.32 ( 0.07) - 16.61 ( 0.24) 9.40 ( 0.04) Adversarial Autoencoder [6] 4.10 ( 1.13) - - 1.90 ( 0.10) - 17.70 ( 0.30) - Ladder Networks [27] - - - 0.89 ( 0.50) - - - Convolutional Cat GAN [22] 4.27 - - 1.39 ( 0.28) - - - Info GAN [16] 5.00 - - - - - - Feature Matching GAN [28] - 16.77 ( 4.52) 2.21 ( 1.36) 0.93 ( 0.06) 18.44 ( 4.80) 8.11 ( 1.30) - Temporal Ensembling [23] - - - - 7.05 ( 0.30) 5.43 ( 0.25) - Pixel GAN Autoencoders 5.27 ( 1.81) 12.08 ( 5.50) 1.16 ( 0.17) 1.08 ( 0.15) 10.47 ( 1.80) 6.96 ( 0.55) 8.90 ( 1.0) Table 1: Semi-supervised learning and clustering error-rate on MNIST, SVHN and NORB datasets. xn that maximizes q(zi|xn), and assigned the label of xn to all the points in the cluster i. We then computed the test error based on the assigned class labels to each cluster. As shown in the first column of Table 1, the performance of Pixel GAN autoencoders is on par with other GAN-based clustering algorithms such as Cat GAN [22], Info GAN [16] and adversarial autoencoders [6]. Semi-supervised Classification. Table 1 and Figure 8 report the results of semi-supervised classification experiments on the MNIST, SVHN and NORB datasets. On the MNIST dataset with 20, 50 and 100 labels, our classification results are highly competitive. Note that the classification rate of unsupervised clustering of MNIST is better than semi-supervised MNIST with 20 labels. This is because in the unsupervised case, the number of clusters is 30, but in the semi-supervised case, there are only 10 class labels which makes it more likely to confuse two digits. On the SVHN dataset with 500 and 1000 labels, the Pixel GAN autoencoder outperforms all the other methods except the recently proposed temporal ensembling work [23] which is not a generative model. On the NORB dataset with 1000 labels, the Pixel GAN autoencoder outperforms all the other reported results. Figure 7 shows the conditional samples of the semi-supervised Pixel GAN autoencoder on the MNIST, SVHN and NORB datasets. Each column of this figure presents sampled images conditioned on a fixed one-hot latent code. We can see from this figure that the Pixel GAN autoencoder can achieve a rather clean separation of style and content on these datasets with very few labeled data. 4 Learning Cross-Domain Relations with Pixel GAN Autoencoders In this section, we discuss how the Pixel GAN autoencoder can be viewed in the context of learning cross-domain relations between two different domains. We also describe how the problem of clustering or semi-supervised learning can be cast as the problem of finding a smooth cross-domain mapping from the data distribution to the categorical distribution. Recently several GAN-based methods have been developed to learn a cross-domain mapping between two different domains [29, 30, 31, 6, 32]. In [31], an unsupervised cost function called the output distribution matching (ODM) is proposed to find a cross-domain mapping F between two domains D1 and D2 by imposing the following unsupervised constraint on the uncorrelated samples from x D1 and y D2: Distr[F(x)] = Distr[y] (4) where Distr[z] denotes the distribution of the random variable z. The adversarial training is proposed as one of the methods for matching these distributions. If we have access to a few labeled pairs (x, y), then F can be further trained on them in a supervised fashion to satisfy F(x) = y. For example, in speech recognition, we want to find a cross-domain mapping from a sequence of phonemes to a sequence of characters. By optimizing the ODM cost function in Equation 4, we can find a smooth function F that takes phonemes at its input and outputs a sequence of characters that respects the language model. However, the main problem with this method is that the network can learn to ignore part of the input distribution and still satisfy the ODM cost function by its output distribution. This problem has also been observed in other works such as [29]. One way to avoid this problem is to add a reconstruction term to the ODM cost function by introducing a reverse mapping from the output of the encoder to the input domain. The is essentially the idea of the adversarial autoencoder (AAE) [6] which learns a generative model by finding a cross-domain mapping between a Gaussian distribution and the data distribution. Using the ODM cost function along with a reconstruction term to learn cross-domain relations have been explored in several previous works. For example, Info GAN [16] adds a mutual information term to the ODM cost function and optimizes a variational lower bound on this term. It can be shown that maximizing this variational bound is indeed minimizing the reconstruction cost of an autoencoder [15]. Similarly, in [32, 33], an AAE is used to learn the cross-domain relations of the vector representations of words from two different languages. The architecture of the recent works of Disco GAN [29] and Cycle GAN [30] are also similar to an AAE in which the latent representation is enforced to have the distribution of the other domain. Here we describe how our proposed Pixel GAN autoencoder can be potentially used in all these application areas to learn better cross-domain relations. Suppose we want to learn a mapping from domain D1 to D2. In the architecture of Figure 1, we can use independent samples of x D1 at the input and instead of imposing a Gaussian distribution on the latent code, we can impose the distribution of the second domain using its independent samples y D2. Unlike AAEs, the encoder of Pixel GAN autoencoders does not have to retain all the input information in order to have a lossless reconstruction. So the encoder can use all its capacity to learn the most relevant mapping from D1 to D2 and at the same time, the Pixel CNN can capture the remaining information that has been lost by the encoder. We can adopt the ODM idea for semi-supervised learning by assuming D1 is the image domain and D2 is the label domain. Independent samples of D1 and D2 correspond to samples from the data distribution pdata(x) and the categorical distribution. The function F = q(y|x) can be parametrized by a neural network that is trained to satisfy the ODM cost function by matching the aggregated distribution q(y) = R q(y|x)pdata(x)dx to the categorical distribution using adversarial training. The few labeled examples are used to further train F to satisfy F(x) = y. However, as explained above, the problem with this method is that the network can learn to generate the categorical distribution by ignoring some part of the input distribution. The AAE solves this problem by adding an inverse mapping from the categorical distribution to the data distribution. However, the main drawback of the AAE architecture is that due to the reconstruction term, the latent representation now has to model all the underlying factors of variation in the image. For example, in the semi-supervised AAE architecture [6], while we are only interested in the one-hot label representation to do semi-supervised learning, we also need to infer the style of the image so that we can have a lossless reconstruction of the image. The Pixel GAN autoencoder solves this problem by enabling the encoder to only infer the factor of variation that we are interested in (i.e., label information), while the remaining structure of the input (i.e., style information) is automatically captured by the autoregressive decoder. 5 Conclusion In this paper, we proposed the Pixel GAN autoencoder, which is a generative autoencoder that combines a generative Pixel CNN with a GAN inference network that can impose arbitrary priors on the latent code. We showed that imposing different distributions as the prior enables us to learn a latent representation that captures the type of statistics that we care about, while the remaining structure of the image is captured by the Pixel CNN decoder. Specifically, by imposing a Gaussian prior, we were able to disentangle the low-frequency and high-frequency statistics of the images, and by imposing a categorical prior we were able to disentangle the style and content of images and learn representations that are specifically useful for clustering and semi-supervised learning tasks. While the main focus of this paper was to demonstrate the application of Pixel GAN autoencoders in downstream tasks such as semi-supervised learning, we discussed how these architectures have many other potentials such as learning cross-domain relations between two different domains. Acknowledgments We would like to thank Nathan Killoran for helpful discussions. We also thank NVIDIA for GPU donations. [1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672 2680, 2014. [2] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. ar Xiv preprint ar Xiv:1610.03483, 2016. [3] Ferenc Huszár. Variational inference using implicit distributions. ar Xiv preprint ar Xiv:1702.08235, 2017. [4] Dustin Tran, Rajesh Ranganath, and David M Blei. Deep and hierarchical implicit models. ar Xiv preprint ar Xiv:1702.08896, 2017. [5] Rajesh Ranganath, Dustin Tran, Jaan Altosaar, and David Blei. Operator variational inference. In Advances in Neural Information Processing Systems, pages 496 504, 2016. [6] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. ar Xiv preprint ar Xiv:1511.05644, 2015. [7] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. ar Xiv preprint ar Xiv:1701.04722, 2017. [8] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. ar Xiv preprint ar Xiv:1606.00704, 2016. [9] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. ar Xiv preprint ar Xiv:1605.09782, 2016. [10] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. International Conference on Learning Representations (ICLR), 2014. [11] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. International Conference on Machine Learning, 2014. [12] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. ar Xiv preprint ar Xiv:1601.06759, 2016. [13] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pages 4790 4798, 2016. [14] Matthew D Hoffman and Matthew J Johnson. Elbo surgery: yet another way to carve up the variational evidence lower bound. In NIPS 2016 Workshop on Advances in Approximate Bayesian Inference, 2016. [15] David Barber and Felix V Agakov. The im algorithm: A variational approach to information maximization. In NIPS, pages 201 208, 2003. [16] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172 2180, 2016. [17] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. ar Xiv preprint ar Xiv:1511.06349, 2015. [18] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. ar Xiv preprint ar Xiv:1611.02731, 2016. [19] Ferenc Huszár. Is Maximum Likelihood Useful for Representation Learning? http://www.inference. vc/maximum-likelihood-for-representation-learning-2. [20] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David Vazquez, and Aaron Courville. Pixelvae: A latent variable model for natural images. ar Xiv preprint ar Xiv:1611.05013, 2016. [21] Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation of generative models. ar Xiv preprint ar Xiv:1511.01844, 2015. [22] Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. ar Xiv preprint ar Xiv:1511.06390, 2015. [23] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. ar Xiv preprint ar Xiv:1610.02242, 2016. [24] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581 3589, 2014. [25] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing with virtual adversarial training. stat, 1050:25, 2015. [26] Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep generative models. ar Xiv preprint ar Xiv:1602.05473, 2016. [27] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, pages 3532 3540, 2015. [28] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2226 2234, 2016. [29] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jungkwon Lee, and Jiwon Kim. Learning to discover crossdomain relations with generative adversarial networks. ar Xiv preprint ar Xiv:1703.05192, 2017. [30] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. ar Xiv preprint ar Xiv:1703.10593, 2017. [31] Ilya Sutskever, Rafal Jozefowicz, Karol Gregor, Danilo Rezende, Tim Lillicrap, and Oriol Vinyals. Towards principled unsupervised learning. ar Xiv preprint ar Xiv:1511.06440, 2015. [32] Antonio Valerio Miceli Barone. Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. ar Xiv preprint ar Xiv:1608.02996, 2016. [33] Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. Adversarial training for unsupervised bilingual lexicon induction. [34] Daniel Jiwoong Im, Sungjin Ahn, Roland Memisevic, and Yoshua Bengio. Denoising criterion for variational auto-encoding framework. ar Xiv preprint ar Xiv:1511.06406, 2015. [35] Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár. Amortised map inference for image super-resolution. ar Xiv preprint ar Xiv:1610.04490, 2016. [36] Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. ar Xiv preprint ar Xiv:1610.00527, 2016. [37] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [38] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. ar Xiv preprint ar Xiv:1701.05517, 2017. [39] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015. [40] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.