# learning_disentangled_joint_continuous_and_discrete_representations__81a15252.pdf Learning Disentangled Joint Continuous and Discrete Representations Emilien Dupont Schlumberger Software Technology Innovation Center Menlo Park, CA, USA dupont@slb.com We present a framework for learning disentangled and interpretable jointly continuous and discrete representations in an unsupervised manner. By augmenting the continuous latent distribution of variational autoencoders with a relaxed discrete distribution and controlling the amount of information encoded in each latent unit, we show how continuous and categorical factors of variation can be discovered automatically from data. Experiments show that the framework disentangles continuous and discrete generative factors on various datasets and outperforms current disentangling methods when a discrete generative factor is prominent. 1 Introduction Disentangled representations are defined as ones where a change in a single unit of the representation corresponds to a change in single factor of variation of the data while being invariant to others (Bengio et al. (2013)). For example, a disentangled representation of 3D objects could contain a set of units each corresponding to a distinct generative factor such as position, color or scale. Most recent work on learning disentangled representations has focused on modeling continuous factors of variation (Higgins et al. (2016); Kim & Mnih (2018); Chen et al. (2018)). However, a large number of datasets contain inherently discrete generative factors which can be difficult to capture with these methods. In image data for example, distinct objects or entities would most naturally be represented by discrete variables, while their position or scale might be represented by continuous variables. Several machine learning tasks, including transfer learning and zero-shot learning, can benefit from disentangled representations (Lake et al. (2017)). Disentangled representations have also been applied to reinforcement learning (Higgins et al. (2017a)) and for learning visual concepts (Higgins et al. (2017b)). Further, in contrast to most representation learning algorithms, disentangled representations are often interpretable since they align with factors of variation of the data. Different approaches have been explored for semi-supervised or supervised learning of factored representations (Kulkarni et al. (2015); Whitney et al. (2016); Yang et al. (2015); Reed et al. (2014)). These approaches achieve impressive results but either require knowledge of the underlying generative factors or other forms of weak supervision. Several methods also exist for unsupervised disentanglement with the two most prominent being Info GAN and β-VAE (Chen et al. (2016); Higgins et al. (2016)). These frameworks have shown promise in disentangling factors of variation in an unsupervised manner on a number of datasets. Info GAN (Chen et al. (2016)) is a framework based on Generative Adversarial Networks (Goodfellow et al. (2014)) which disentangles generative factors by maximizing the mutual information between a subset of latent variables and the generated samples. While this approach is able to model both discrete and continuous factors, it suffers from some of the shortcomings of Generative Adversarial Networks (GAN), such as unstable training and reduced sample diversity. Recent improvements in the training of GANs (Arjovsky et al. (2017); Gulrajani et al. (2017)) have mitigated some of these 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. issues, but stable GAN training still remains a challenge (and this is particularly challenging for Info GAN as shown in Kim & Mnih (2018)). β-VAE (Higgins et al. (2016)), in contrast, is based on Variational Autoencoders (Kingma & Welling (2013); Rezende et al. (2014)) and is stable to train. β-VAE, however, can only model continuous latent variables. In this paper we propose a framework, based on Variational Autoencoders (VAE), that learns disentangled continuous and discrete representations in an unsupervised manner. It comes with the advantages of VAEs, such as stable training, large sample diversity and a principled inference network, while having the flexibility to model a combination of continuous and discrete generative factors. We show how our framework, which we term Joint VAE, discovers independent factors of variation on MNIST, Fashion MNIST (Xiao et al. (2017)), Celeb A (Liu et al. (2015)) and Chairs (Aubry et al. (2014)). For example, on MNIST, Joint VAE disentangles digit type (discrete) from slant, width and stroke thickness (continuous). In addition, the model s learned inference network can infer various properties of data, such as the azimuth of a chair, in an unsupervised manner. The model can also be used for simple image editing, such as rotating a face in an image. 2 Analysis of β-VAE We derive our approach by modifying the β-VAE framework and augmenting it with a joint latent distribution. β-VAEs model a joint distribution of the data x and a set of latent variables z and learn continuous disentangled representations by maximizing the objective L(θ, φ) = Eqφ(z|x)[log pθ(x|z)] βDKL(qφ(z|x) p(z)) (1) where the posterior or encoder qφ(z|x) is a neural network with parameters φ mapping x into z, the likelihood or decoder pθ(x|z) is a neural network with parameters θ mapping z into x and β is a positive constant. The loss is a weighted sum of a likelihood term Eqφ(z|x)[log pθ(x|z)] which encourages the model to encode the data x into a set of latent variables z which can efficiently reconstruct the data and a second term that encourages the distribution of the inferred latents z to be close to some prior p(z). When β = 1, this corresponds to the original VAE framework. However, when β > 1, it is theorized that the increased pressure of the posterior qφ(z|x) to match the prior p(z), combined with maximizing the likelihood term, gives rises to efficient and disentangled representations of the data (Higgins et al. (2016); Burgess et al. (2017)). We can derive further insights by analyzing the role of the KL divergence term in the objective (1). During training, the objective will be optimized in expectation over the data x. The KL term then becomes (Makhzani & Frey (2017); Kim & Mnih (2018)) Ep(x)[DKL(qφ(z|x) p(z))] = I(x; z) + DKL(q(z) p(z)) I(x; z) (2) i.e., when taken in expectation over the data, the KL divergence term is an upper bound on the mutual information between the latents and the data (see appendix for proof and details). Thus, a mini batch estimate of the mean KL divergence is an estimate of the upper bound on the information z can transmit about x. Penalizing the mutual information term improves disentanglement but comes at the cost of increased reconstruction error. Recently, several methods have been explored to improve the reconstruction quality without decreasing disentanglement (Burgess et al. (2017); Kim & Mnih (2018); Chen et al. (2018); Gao et al. (2018)). Burgess et al. (2017) in particular propose an objective where the upper bound on the mutual information is controlled and gradually increased during training. Denoting the controlled information capacity by C, the objective is defined as L(θ, φ) = Eqφ(z|x)[log pθ(x|z)] γ|DKL(qφ(z|x) p(z)) C| (3) where γ is a constant which forces the KL divergence term to match the capacity C. Gradually increasing C during training allows for control of the amount of information the model can encode. This objective has been shown to improve reconstruction quality as compared to (1) without reducing disentanglement (Burgess et al. (2017)). 3 Joint VAE Model We propose a modification to the β-VAE framework which allows us to model a joint distribution of continuous and discrete latent variables. Letting z denote a set of continuous latent variables and c denote a set of categorical or discrete latent variables, we define a joint posterior qφ(z, c|x), prior p(z, c) and likelihood pθ(x|z, c). The β-VAE objective then becomes L(θ, φ) = Eqφ(z,c|x)[log pθ(x|z, c)] βDKL(qφ(z, c|x) p(z, c)) (4) where the latent distribution is now jointly continuous and discrete. Assuming the continuous and discrete latent variables are conditionally independent1, i.e. qφ(z, c|x) = qφ(z|x)qφ(c|x) and similarly for the prior p(z, c) = p(z)p(c) we can rewrite the KL divergence as DKL(qφ(z, c|x) p(z, c)) = DKL(qφ(z|x) p(z)) + DKL(qφ(c|x) p(c)) (5) i.e. we can separate the discrete and continuous KL divergence terms (see appendix for proof). Under this assumption, the loss becomes L(θ, φ) = Eqφ(z,c|x)[log pθ(x|z, c)] βDKL(qφ(z|x) p(z)) βDKL(qφ(c|x) p(c)) (6) In our initial experiments, we found that directly optimizing this loss led to the model ignoring the discrete latent variables. Similarly, gradually increasing the channel capacity as in equation (3) leads to the model assigning all capacity to the continuous channels. To overcome this, we split the capacity increase: the capacities of the discrete and continuous latent channels are controlled separately forcing the model to encode information both in the discrete and continuous channels. The final loss is then given by L(θ, φ) = Eqφ(z,c|x)[log pθ(x|z, c)] γ|DKL(qφ(z|x) p(z)) Cz| γ|DKL(qφ(c|x) p(c)) Cc| (7) where Cz and Cc are gradually increased during training. 3.1 Parametrization of continuous latent variables As in the original VAE framework, we parametrize qφ(z|x) by a factorised Gaussian, i.e. qφ(z|x) = Q i qφ(zi|x) where qφ(zi|x) = N(µi, σ2 i ) and let the prior be a unit Gaussian p(z) = N(0, I). µ and σ2 are both parametrized by neural networks. 3.2 Parametrization of discrete latent variables Parametrizing qφ(c|x) is more difficult. Since qφ(c|x) needs to be differentiable with respect to its parameters, we cannot parametrize qφ(c|x) by a set of categorical distributions. Recently, Maddison et al. (2016) and Jang et al. (2016) proposed a differentiable relaxation of discrete random variables based on the Gumbel Max trick (Gumbel (1954)). If c is a categorical variable with class probabilities α1, α2, ..., αn, then we can sample from a continuous approximation of the categorical distribution, by sampling a set of gk Gumbel(0, 1) i.i.d. and applying the following transformation yk = exp((log αk + gk)/τ) P i exp((log αi + gi)/τ) (8) where τ is a temperature parameter which controls the relaxation. The sample y is a continuous approximation of the one hot representation of c. The relaxed discrete distribution is called a Concrete or Gumbel Softmax distribution and is denoted by g(α) where α is a vector of class probabilities. 1β-VAE assumes the data is generated by a fixed number of independent factors of variation, so all latent variables are in fact conditionally independent. However, for the sake of deriving the Joint VAE objective we only require conditional independence between the continuous and discrete latents. We can parametrize qφ(c|x) by a product of independent Gumbel Softmax distributions, qφ(c|x) = Q i qφ(ci|x) where qφ(ci|x) = g(α(i)) is a Gumbel Softmax distribution with class probabilities α(i). We let the prior p(c) be equal to a product of uniform Gumbel Softmax distributions. This approach allows us to use the reparametrization trick (Kingma & Welling (2013); Rezende et al. (2014)) and efficiently train the discrete model. 3.3 Architecture The final architecture of the Joint VAE model is shown in Fig. 1. We build the encoder to output the parameters of the continuous distribution µ and σ2 and of each of the discrete distributions α(i). We then sample zi N(µi, σ2 i ) and ci g(α(i)) using the reparametrization trick and concatenate z and c into one latent vector which is passed as input to the decoder. Figure 1: Joint VAE architecture. The input x is encoded by qφ into the parameters of the latent distributions. Samples are drawn from each of the latent distributions using the reparametrization trick (indicated by dashed arrows on the diagram). The samples are then concatenated and decoded through pθ. 3.4 Choice and sensitivity hyperparameters The Joint VAE loss in equation 7 depends on the hyperparameters γ, Cc and Cz. While the choice of these is ultimately empirical, there are various heuristics we can use to narrow the search. The value of γ, for example, is chosen so that it is large enough to maintain the capacity at the desired level (e.g. large improvements in reconstruction error should not come at the cost of breaking the capacity constraint). We found the model to be quite robust to changes in γ. As the capacity of a discrete channel is bounded, Cc is chosen to be the maximum capacity of the channel, encouraging the model to use all categories of the discrete distribution. Cz is more difficult to choose and is often chosen by experiment to be the largest value where the representation is still disentangled (in a similar way that β is chosen as the lowest value where the representation is still disentangled in β-VAE). 4 Experiments We perform experiments on several datasets including MNIST, Fashion MNIST, Celeb A and Chairs. We parametrize the encoder by a convolutional neural network and the decoder by the same network, transposed (for the full architecture and training details see appendix). The code, along with all experiments and trained models presented in this paper, is available at https://github.com/ Schlumberger/joint-vae. Disentanglement results and latent traversals for MNIST are shown in Fig. 2. The model was trained with 10 continuous latent variables and one discrete 10-dimensional latent variable. The model discovers several factors of variation in the data, such as digit type (discrete), stroke thickness, angle and width (continuous) in an unsupervised manner. As can be seen from the latent traversals in Fig. 2, the trained model is able to generate realistic samples for a large variety of latent settings. Fig. 4a shows digits generated by fixing the discrete latent and sampling the continuous latents from the prior p(z) = N(0, 1), which can be interpreted as sampling from a distribution conditioned on digit type. As can be seen, the samples are diverse, realistic and honor the conditioning. For a large range of hyperparameters we were not able to achieve disentanglement using the purely continuous β-VAE framework (see Fig. 3). This is likely because MNIST has an inherently discrete generative factor (digit type), which β-VAE is unable to map onto a continuous latent variable. In contrast, the Joint VAE approach allows us to disentangle the discrete factors while maintaining disentanglement of continuous factors. To the best of our knowledge, Joint VAE is, apart from Info GAN, the only framework which disentangles MNIST in a completely unsupervised manner and it does so in a more stable way than Info GAN. (a) Angle (continuous) (b) Thickness (continuous) (c) Digit type (discrete) (d) Width (continuous) Figure 2: Latent traversals of the model trained on MNIST with 10 continuous latent variables and 1 discrete latent variable. Each row corresponds to a fixed random setting of the latent variables and the columns correspond to varying a single latent unit. Each subfigure varies a different latent unit. As can be seen each of the varied latent units corresponds to an interpretable generative factor, such as stroke thickness or digit type. Figure 3: Traversals of all latent dimensions on MNIST for Joint VAE, β-VAE and β-VAE with controlled capacity increase (CCβ-VAE). Joint VAE is able to disentangle digit type from continuous factors of variation like stroke thickness and angle, while digit type is entangled with continuous factors for both β-VAE and CCβ-VAE. Figure 4: (a) Samples conditioned on digit type. Each row shows samples from pθ where the discrete latent variable is fixed and all other latent values are sampled from the prior. As can be seen each row produces diverse samples of each digit. Note that digits which are similar, such as 5 and 8 are sometimes confused and not perfectly disentangled. (b) Samples conditioned on fashion item type. The samples are diverse and largely disentangled. (c) Latent traversals of Fashion MNIST model. The rows correspond to different settings of the discrete latent variable, while the columns correspond to a traversal of the most informative continuous latent variable. Various factors are discovered, such as sleeve length, bag handle size, ankle height and shoe opening. Fashion MNIST Latent traversals for Fashion MNIST are shown in Fig. 4c. We also used 10 continuous and 1 discrete latent variable for this dataset. Fashion MNIST is harder to disentangle as the generative factors for creating clothes are not as clear as the ones for drawing digits. However, Joint VAE performs well and discovers interesting dimensions, such as sleeve length, heel size and shirt color. As some of the classes of Fashion MNIST are very similar (e.g. shirt and t-shirt are two different classes), not all classes are discovered. However, a significant amount of them are disentangled including dress, t-shirt, trousers, sneakers, bag, ankle boot and so on (see Fig. 4b). For Celeb A we used a model with 32 continuous latent variables and one 10 dimensional discrete latent variable. As shown in Fig. 5, the Joint VAE model discovers various factors of variation including azimuth, age and background color, while being able to generate realistic samples. Different settings of the discrete variable correspond to different facial identities. While the samples are not as sharp as those produced by entangled models, we can still see details in the images such as distinct facial features and skin tones (the trade-off between disentanglement and reconstruction quality is a well known problem which is discussed in Higgins et al. (2016); Burgess et al. (2017); Kim & Mnih (2018); Chen et al. (2018)). For the chairs dataset we used a model with 32 continuous latent variables and 3 binary discrete latent variables. Joint VAE discovers several factors of variation such as chair rotation, width and leg style. Furthermore, different settings of the discrete variables correspond to different chair types and colors. (a) Azimuth (b) Background color Figure 5: Latent traversals of the model trained on Celeb A. Each row corresponds to a fixed setting of the discrete latent variable and the columns correspond to varying a single continuous latent unit. Model Score β-VAE 0.73 Factor VAE 0.82 Joint VAE 0.69 Figure 6: Left: Disentanglement scores for various frameworks on the d Sprites dataset. The scores are obtained by averaging scores over 10 different random seeds from the model with the best hyperparameters (removing outliers where the model collapsed to the mean). Right: Comparison of latent traversals on the d Sprites dataset. There are 4 continuous factors and 1 discrete factor in the original dataset and only Joint VAE is able to encode all information into 4 continuous and 1 discrete latent variables. Note that the final row of the Joint VAE latent traversal corresponds to the discrete factor of dimension 3, which is why the patterns repeat with a period of 3. While there is a well defined discrete generative factor for datasets like MNIST and Fashion MNIST, it is less clear what exactly would constitute a discrete factor of variation in datasets like Celeb A and Chairs. For example, for Celeb A, Joint VAE maps various facial identities onto the discrete latent variable. However, facial identity is not necessarily discrete and it is possible that such a factor of variation could also be mapped to a continuous latent variable. Joint VAE has a clear advantage in disentangling datasets where discrete factors are prominent (as shown in Fig. 3) but when this is not the case using frameworks that only disentangle continuous factors may be sufficient. 4.1 Quantitative evaluation We quantitatively evaluate our model on the d Sprites dataset using the metric recently proposed by Kim & Mnih (2018). Since the dataset is generated from 1 discrete factor (with 3 categories) and 4 continuous factors, we used a model with 6 continuous latent variables and one 3 dimensional discrete latent variable. The results are shown in table 6. Even though the discrete factor in this dataset is not prominent (in the sense that the different categories have very small differences in pixel space) our model is able to achieve scores close to the current best models. Further, as shown in Fig. 6, our model learns meaningful latent representations. In particular, for the discrete factor of variation, Joint VAE is able to better separate the classes than other models. 4.2 Detecting disentanglement in latent distributions As noted in Section 2, taken in expectation over data, the KL divergence between the inferred latents qφ(z, c|x) and the priors, upper bounds the mutual information between the latent units and the data. Motivated by this, we can plot the KL divergence values for each latent unit averaged over a mini batch of data during training. As various factors of variation are discovered in the data, we would expect the KL divergence of the corresponding latent units to increase. This is shown in Fig. 7a. As the capacities Cz and Cc are increased, the model is able to encode more and more factors of variation. For MNIST, the first factor to be discovered is digit type, followed by angle and width. This is likely because encoding digit type results in the largest reconstruction error reduction, followed by encoding angle and width and so on. After training, we can also measure the KL divergence of each latent unit on test data and rank the latent units by their average KL values. This corresponds to ranking the latent units by how much information they are transmitting about x. Fig. 7b shows the ranked latent units for MNIST and Chairs along with a latent traversal of each unit. As can be seen, the latent units with large information content encode various aspects of the data while latent units with approximately zero KL divergence do not affect the output. Figure 7: (a) Increase of KL divergence during training. As the latent channel capacity is increased, different factors of variation are discovered. Most of the latent units have a KL divergence of approximately zero throughout training, meaning they do not encode any information about the data. As training progresses, however, some latent units start encoding more information about the data. Each latent unit can then be matched to a factor of variation of the data by visual inspection. (b, c) Each row corresponds to a latent traversal of a single latent unit. The column on the left shows the mean KL divergence value over a large number of examples (which corresponds to the amount of information encoded in that latent unit in nats). The rows are ordered from the latent unit with largest KL divergence to the lowest. As can be seen, large KL divergence values correspond to active latents which encode information about the data, whereas low KL divergence value channels to do not affect the data. 4.3 The inference network One of the advantages of Joint VAE is that it comes with an inference network qφ(z, c|x). For example, on MNIST we can infer the digit type on test data with 88.7% accuracy by simply looking at the value of the discrete latent variable qφ(c|x). Of course, this is completely unsupervised and the accuracy could likely be increased dramatically by using some label information. Since we are learning several generative factors, the inference network can also be used to infer properties which we do not have labels for. For example, the latent unit corresponding to azimuth on the chairs dataset correlates well with the actual azimuth of unseen chairs. After training a model on the chairs dataset and identifying the latent unit corresponding to azimuth, we can test the inference network on images that were not used during training. This is shown in Fig. 8a. As can be seen, the latent unit corresponding to rotation infers the angle of the chair even though no labeled data was given (or available) for this task. The framework can also be used to perform image editing or manipulation. If we wish to rotate the image of a face, we can encode the face with qφ, modify the latent corresponding to azimuth and decode the resulting vector with pθ. Examples of this are shown in Fig. 8b. 4.4 Robustness and sensitivity to hyperparameters While our framework is robust with respect to different architectures and optimizers, it is, like most frameworks for unsupervised disentanglement, fairly sensitive to the choice of hyperparameters (all hyperparameters needed to reproduce the results in this paper are given in the appendix). Even with a good choice of hyperparameters, the quality of disentanglement may vary based on the random seed. In general, it is easy to achieve some degree of disentanglement for a large set of hyperparameters, but achieving complete clean disentanglement (e.g. perfectly separate digit type and other generative factors) can be difficult. It would be interesting to explore more principled approaches for choosing the latent capacities and how to increase them, but we leave this for future work. Further, as mentioned in Section 4, when a discrete generative factor is not present or important, the framework may fail to learn meaningful discrete representations. We have included some failure examples in Fig. 9. Figure 8: (a) Inference of azimuth on test data of chairs. The first row shows images of chairs from a test set. The second row shows the inferred z for each of the images. As can be seen, the latent unit successfully identifies rotation. (b) Image editing with Joint VAE. An image of a celebrity is encoded with qφ. In the encoded space, we can then rotate the face, change the background color or change the hair style by manipulating the latent unit corresponding to each factor. The bottom rows show the decoded images when each latent factor is changed. The samples are not as sharp as the original image, but these initial results show promise for using disentangled representations to edit images. Figure 9: Failure examples. (a) Background color is entangled with azimuth and hair length. (b) Various clothing items are entangled with each other. 5 Conclusion We have proposed Joint VAE, a framework for learning disentangled continuous and discrete representations in an unsupervised manner. The framework comes with the advantages of VAEs such as stable training and large sample diversity while being able to model complex jointly continuous and discrete generative factors. We have shown that Joint VAE disentangles factors of variation on several datasets while producing realistic samples. In addition, the inference network can be used to infer unlabeled quantities on test data and to edit and manipulate images. In future work, it would be interesting to combine our approach with recent improvements of the β-VAE framework, such as Factor VAE (Kim & Mnih (2018)) or β-TCVAE (Chen et al. (2018)). Gaining a deeper understanding of how disentanglement depends on the latent channel capacities and how they are increased will likely provide insights to build more stable models. Finally, it would also be interesting to explore the use of other latent distributions since the framework allows the use of any joint distribution of reparametrizable random variables. Acknowledgments The author would like to thank Erik Burton, José Celaya, Suhas Suresha, Vishakh Hegde and the anonymous reviewers for helpful suggestions and comments that helped improve the paper. Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. ar Xiv preprint ar Xiv:1701.07875, 2017. Mathieu Aubry, Daniel Maturana, Alexei A Efros, Bryan C Russell, and Josef Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3762 3769, 2014. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. Christopher Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in beta-vae. NIPS 2017 Disentanglement Workshop, 2017. Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. ar Xiv preprint ar Xiv:1802.04942, 2018. Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172 2180, 2016. Shuyang Gao, Rob Brekelmans, Greg Ver Steeg, and Aram Galstyan. Auto-encoding total correlation explanation. ar Xiv preprint ar Xiv:1802.05822, 2018. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5769 5779, 2017. Emil Julius Gumbel. Statistical theory of extreme valuse and some practical applications. Nat. Bur. Standards Appl. Math. Ser. 33, 1954. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. ICLR 2017, 2016. Irina Higgins, Arka Pal, Andrei A Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement learning. ar Xiv preprint ar Xiv:1707.08475, 2017a. Irina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher P Burgess, Matthew Botvinick, Demis Hassabis, and Alexander Lerchner. Scan: learning abstract hierarchical compositional visual concepts. ar Xiv preprint ar Xiv:1707.03389, 2017b. Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. ar Xiv preprint ar Xiv:1611.01144, 2016. Hyunjik Kim and Andriy Mnih. Disentangling by factorising. ar Xiv preprint ar Xiv:1802.05983, 2018. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pp. 2539 2547, 2015. Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015. Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. ar Xiv preprint ar Xiv:1611.00712, 2016. Alireza Makhzani and Brendan J Frey. Pixelgan autoencoders. In Advances in Neural Information Processing Systems, pp. 1972 1982, 2017. Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. Learning to disentangle factors of variation with manifold interaction. In International Conference on Machine Learning, pp. 1431 1439, 2014. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. ar Xiv preprint ar Xiv:1401.4082, 2014. William F Whitney, Michael Chang, Tejas Kulkarni, and Joshua B Tenenbaum. Understanding visual concepts with continuation learning. ar Xiv preprint ar Xiv:1602.06822, 2016. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Jimei Yang, Scott E Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentangling with recurrent transformations for 3d view synthesis. In Advances in Neural Information Processing Systems, pp. 1099 1107, 2015.