# disentangling_learning_representations_with_density_estimation__468f6b92.pdf Published as a conference paper at ICLR 2023 DISENTANGLING LEARNING REPRESENTATIONS WITH DENSITY ESTIMATION Eric Yeats1 Frank Liu2 Hai Li1 1Department of Electrical and Computer Engineering, Duke University 2Computer Science and Mathematics Division, Oak Ridge National Laboratory {eric.yeats, hai.li}@duke.edu liufy@ornl.gov Disentangled learning representations have promising utility in many applications, but they currently suffer from serious reliability issues. We present Gaussian Channel Autoencoder (GCAE), a method which achieves reliable disentanglement via flexible density estimation of the latent space. GCAE avoids the curse of dimensionality of density estimation by disentangling subsets of its latent space with the Dual Total Correlation (DTC) metric, thereby representing its high-dimensional latent joint distribution as a collection of many low-dimensional conditional distributions. In our experiments, GCAE achieves highly competitive and reliable disentanglement scores compared with state-of-the-art baselines. 1 INTRODUCTION The notion of disentangled learning representations was introduced by Bengio et al. (2013) - it is meant to be a robust approach to feature learning when trying to learn more about a distribution of data X or when downstream tasks for learned features are unknown. Since then, disentangled learning representations have been proven to be extremely useful in the applications of natural language processing Jain et al. (2018), content and style separation John et al. (2018), drug discovery Polykovskiy et al. (2018); Du et al. (2020), fairness Sarhan et al. (2020), and more. Density estimation of learned representations is an important ingredient to competitive disentanglement methods. Bengio et al. (2013) state that representations z Z which are disentangled should maintain as much information of the input as possible while having components which are mutually invariant to one another. Mutual invariance motivates seeking representations of Z which have independent components extracted from the data, necessitating some notion of p Z(z). Leading unsupervised disentanglement methods, namely β-VAE Higgins et al. (2016), Factor VAE Kim & Mnih (2018), and β-TCVAE Chen et al. (2018) all learn p Z(z) via the same variational Bayesian framework Kingma & Welling (2013), but they approach making p Z(z) independent with different angles. β-VAE indirectly promotes independence in p Z(z) via enforcing low DKL between the representation and a factorized Gaussian prior, β-TCVAE encourages representations to have low Total Correlation (TC) via an ELBO decomposition and importance weighted sampling technique, and Factor VAE reduces TC with help from a monolithic neural network estimate. Other well-known unsupervised methods are Annealed β-VAE Burgess et al. (2018), which imposes careful relaxation of the information bottleneck through the VAE DKL term during training, and DIP-VAE I & II Kumar et al. (2017), which directly regularize the covariance of the learned representation. For a more in-depth description of related work, please see Appendix D. While these VAE-based disentanglement methods have been the most successful in the field, Locatello et al. (2019) point out serious reliability issues shared by all. In particular, increasing disentanglement pressure during training doesn t tend to lead to more independent representations, there currently aren t good unsupervised indicators of disentanglement, and no method consistently dominates the others across all datasets. Locatello et al. (2019) stress the need to find the right inductive biases in order for unsupervised disentanglement to truly deliver. We seek to make disentanglement more reliable and high-performing by incorporating new inductive biases into our proposed method, Gaussian Channel Autoencoder (GCAE). We shall explain them in Published as a conference paper at ICLR 2023 more detail in the following sections, but to summarize: GCAE avoids the challenge of representing high-dimensional p Z(z) via disentanglement with Dual Total Correlation (rather than TC) and the DTC criterion is augmented with a scale-dependent latent variable arbitration mechanism. This work makes the following contributions: Analysis of the TC and DTC metrics with regard to the curse of dimensionality which motivates use of DTC and a new feature-stabilizing arbitration mechanism GCAE, a new form of noisy autoencoder (AE) inspired by the Gaussian Channel problem, which permits application of flexible density estimation methods in the latent space Experiments1 which demonstrate competitive performance of GCAE against leading disentanglement baselines on multiple datasets using existing metrics 2 BACKGROUND AND INITIAL FINDINGS To estimate p Z(z), we introduce a discriminator-based method which applies the density-ratio trick and the Radon-Nikodym theorem to estimate density of samples from an unknown distribution. We shall demonstrate in this section the curse of dimensionality in density estimation and the necessity for representing p Z(z) as a collection of conditional distributions. The optimal discriminator neural network introduced by Goodfellow et al. (2014a) satisfies: arg max D( ) Exr Xreal [log D(xr)] + Exf Xfake [log (1 D(xf))] D (x) = preal(x) preal(x) + pfake(x) where D(x) is a discriminator network trained to differentiate between real samples xr and fake samples xf. Given the optimal discriminator D (x), the density-ratio trick can be applied to yield preal(x) pfake(x) = D (x) 1 D (x). Furthermore, the discriminator can be supplied conditioning variables to represent a ratio of conditional distributions Goodfellow et al. (2014b); Makhzani et al. (2015). Consider the case where the real samples come from an unknown distribution z Z and the fake samples come from a known distribution u U. Permitted that both p Z(z) and p U(u) are finite and p U(u) is nonzero on the sample space of p Z(z), the optimal discriminator can be used to retrieve the unknown density p Z(z) = D (z) 1 D (z)p U(z). In our case where u is a uniformly distributed variable, this transfer of density through the optimal discriminator can be seen as an application of the Radon-Nikodym derivative of p Z(z) with reference to the Lebesgue measure. Throughout the rest of this work, we employ discriminators with uniform noise and the density-ratio trick in this way to recover unknown distributions. This technique can be employed to recover the probability density of an m-dimensional isotropic Gaussian distribution. While it works well in low dimensions (m 8), the method inevitably fails as m increases. Figure 1a depicts several experiments of increasing m in which the KL-divergence of the true and estimated distributions are plotted with training iteration. When number of data samples is finite and the dimension m exceeds a certain threshold, the probability of there being any uniform samples in the neighborhood of the Gaussian samples swiftly approaches zero, causing the density-ratio trick to fail. This is a well-known phenomenon called the curse of dimensionality of density estimation. In essence, as the dimensionality of a joint distribution increases, concentrated joint data quickly become isolated within an extremely large space. The limit m 8 is consistent with the limits of other methods such as kernel density estimation (Parzen-Rosenblatt window). Fortunately, the same limitation does not apply to conditional distributions of many jointly distributed variables. Figure 1b depicts a similar experiment of the first in which m 1 variables are independent Gaussian distributed, but the last variable zm follows the distribution zm N(µ = (m 1) 1 2 Pm 1 i=1 zi, σ2 = 1 m) (i.e., the last variable is Gaussian distributed with its mean as the sum of observations of the other variables). The marginal distribution of each component is 1Code available at https://github.com/ericyeats/gcae-disentanglement Published as a conference paper at ICLR 2023 (a) Joint distributions (b) Conditional distributions Figure 1: Empirical KL divergence between the true and estimated distributions as training iteration and distribution dimensionality increase. Training parameters are kept the same between both experiments. We employ Monte-Carlo estimators of KL divergence, leading to transient negative values when KL is near zero. Gaussian, just like the previous example. While it takes more iterations to bring the KL-divergence between the true and estimated conditional distribution to zero, it is not limited by the curse of dimensionality. Hence, we assert that conditional distributions can capture complex relationships between subsets of many jointly distributed variables while avoiding the curse of dimensionality. 3 METHODOLOGY ANALYSIS OF DUAL TOTAL CORRELATION Recent works encourage disentanglement of the latent space by enhancing the Total Correlation (TC) either indirectly Higgins et al. (2016); Kumar et al. (2017) or explicitly Kim & Mnih (2018); Chen et al. (2018). TC is a metric of multivariate statistical independence that is non-negative and zero if and only if all elements of z are independent. TC(Z) = Ez log p Z(z) Q i p Zi(zi) = X i h(Zi) h(Z) Locatello et al. (2019) evaluate many TC-based methods and conclude that minimizing their measures of TC during training often does not lead to VAE µ (used for representation) with low TC. We note that computing TC(Z) requires knowledge of the joint distribution p Z(z), which can be very challenging to model in high dimensions. We hypothesize that the need for a model of p Z(z) is what leads to the observed reliability issues of these TC-based methods. Consider another metric for multivariate statistical independence, Dual Total Correlation (DTC). Like TC, DTC is non-negative and zero if and only if all elements of z are independent. DTC(z) = Ez log Q i p Zi(zi|z\i) p Z(z) = h(Z) X i h(Zi|Z\i) We use z\i to denote all elements of z except the i-th element. At first glance, it appears that DTC(z) also requires knowledge of the joint density p(z). However, observe an equivalent form of DTC manipulated for the i-th variable: DTC(Z) = h(Z) h(Zi|Z\i) X j =i h(Zj|Z\j) = h(Z\i) X j =i h(Zj|Z\j). (1) Published as a conference paper at ICLR 2023 Here, the i-th variable only contributes to DTC through each set of conditioning variables z\j. Hence, when computing the derivative DTC(Z)/ zi, no representation of p Z(z) is required - only the conditional entropies h(Zj|Z\j) are necessary. Hence, we observe that the curse of dimensionality can be avoided through gradient descent on the DTC metric, making it more attractive for disentanglement than TC. However, while one only needs the conditional entropies to compute gradient for DTC, the conditional entropies alone don t measure how close z is to having independent elements. To overcome this, we define the summed information loss LΣI: i I(Zi; Z\i) = i h(Zi) h(Zi|Z\i) + h(Z) h(Z) = TC(Z) + DTC(Z). (2) If gradients of each I(Zi; Z\i) are taken only with respect to z\i, then the gradients are equal to DTC(Z) z , avoiding use of any derivatives of estimates of p Z(z). Furthermore, minimizing one metric is equivalent to minimizing the other: DTC(Z) = 0 TC(Z) = 0 LΣI(Z) = 0. In our experiments, we estimate h(Zi) with batch estimates Ez\ip Zi(zi|z\i), requiring no further hyperparameters. Details on the information functional implementation are available in Appendix A.1. EXCESS ENTROPY POWER LOSS We found it very helpful to stabilize disentangled features by attaching a feature-scale dependent term to each I(Zi; Z\i). The entropy power of a latent variable zi is non-negative and grows analogously with the variance of zi. Hence, we define the Excess Entropy Power loss: LEEP(Z) 1 2πe h I(Zi; Z\i) e2h(Zi)i , (3) which weighs each component of the LΣI loss with the marginal entropy power of each i-th latent variable. Partial derivatives are taken with respect to the z\i subset only, so the marginal entropy power only weighs each component. While ϕLEEP = ϕLΣI in most situations (ϕ is the set of encoder parameters), this inductive bias has been extremely helpful in consistently yielding high disentanglement scores. An ablation study with LEEP can be found in Appendix C. The name Excess Entropy Power is inspired by DTC s alternative name, excess entropy. GAUSSIAN CHANNEL AUTOENCODER We propose Gaussian Channel Autoencoder (GCAE), composed of a coupled encoder ϕ : X Zϕ and decoder ψ : Z ˆX, which extracts a representation of the data x Rn in the latent space z Rm. We assume m n, as is typical with autoencoder models. The output of the encoder has a bounded activation function, restricting zϕ ( 3, 3)m in our experiments. The latent space is subjected to Gaussian noise of the form z = zϕ + νσ, where each νσ N(0, σ2I) and σ is a controllable hyperparameter. The Gaussian noise has the effect of smoothing the latent space, ensuring that p Z(z) is continuous and finite, and it guarantees the existence of the Radon-Nikodym derivative. Our reference noise for all experiments is u Unif( 4, 4). The loss function for training GCAE is: LGCAE = Ex,νσ + λ LEEP(Z), (4) where λ is a hyperparameter to control the strength of regularization, and νσ is the Gaussian noise injected in the latent space with the scale hyperparameter σ. The two terms have the following intuitions: the mean squared error (MSE) of reconstructions ensures z captures information of the input while LEEP encourages representations to be mutually independent. Published as a conference paper at ICLR 2023 Figure 2: Depiction of the proposed method, GCAE. Gaussian noise with variance σ2 is added to the latent space, smoothing the representations for gradient-based disentanglement with LEEP. Discriminators use the density-ratio trick to represent the conditional distributions of each latent element given observations of all other elements, capturing complex dependencies between subsets of the variables whilst avoiding the curse of dimensionality. 4 EXPERIMENTS We evaluate the performance of GCAE against the leading unsupervised disentanglement baselines β-VAE Higgins et al. (2016), Factor VAE Kim & Mnih (2018), β-TCVAE Chen et al. (2018), and DIP-VAE-II Kumar et al. (2017). We measure disentanglement using four popular supervised disentanglement metrics: Mutual Information Gap (MIG) Chen et al. (2018), Factor Score Kim & Mnih (2018), DCI Disentanglement Eastwood & Williams (2018), and Separated Attribute Predictability (SAP) Kumar et al. (2017). The four metrics cover the three major types of disentanglement metrics identified by Carbonneau et al. (2020) in order to provide a complete comparison of the quantitative disentanglement capabilities of the latest methods. We consider two datasets which cover different data modalities. The Beamsynthesis dataset Yeats et al. (2022) is a collection of 360 timeseries data from a linear particle accelerator beamforming simulation. The waveforms are 1000 values long and are made of two independent data generating factors: duty cycle (continuous) and frequency (categorical). The d Sprites dataset Matthey et al. (2017) is a collection of 737280 synthetic images of simple white shapes on a black background. Each 64 64 pixel image consists of a single shape generated from the following independent factors: shape (categorical), scale (continuous), orientation (continuous), x-position (continuous), and y-position (continuous). All experiments are run using the Py Torch framework Paszke et al. (2019) using 4 NVIDIA Tesla V100 GPUs, and all methods are trained with the same number of iterations. Hyperparameters such as network architecture and optimizer are held constant across all models in each experiment (with the exception of the dual latent parameters required by VAE models). Latent space dimension is fixed at m = 10 for all experiments, unless otherwise noted. More details are in Appendix B. In general, increasing λ and σ led to lower LΣI but higher MSE at the end of training. Figure 3a depicts this relationship for Beamsynthesis and d Sprites. Increasing σ shifts final loss values towards increased independence (according to LΣI) but slightly worse reconstruction error. This is consistent with the well-known Gaussian channel - as the relative noise level increases, the information capacity of a power-constrained channel decreases. The tightly grouped samples in the lower right of the plot correspond with λ = 0 and incorporating any λ > 0 leads to a decrease in LΣI and increase in MSE. As λ is increased further the MSE increases slightly as the average LΣI decreases. Figure 3b plots the relationship between final LΣI values with MIG evaluation scores for both Beamsynthesis and d Sprites. Our experiments depict a moderate negative relationship with correlation coefficient 0.823. These results suggest that LΣI is a promising unsupervised indicator of successful disentanglement, which is helpful if one does not have access to the ground truth data factors. Published as a conference paper at ICLR 2023 (a) Scatter plot of log(LΣI) vs. MSE for GCAE on Beamsynthesis and d Sprites. Higher σ and lower log(LΣI) (through increased disentanglement pressure) tend to increase MSE. However, the increase in MSE subsides as the model becomes disentangled. (b) Scatter plot of log(LΣI) vs. MIG for GCAE on Beamsynthesis and d Sprites (both marked with dots). There is a moderate relationship between log(LΣI) and MIG (r = 0.823), suggesting log(LΣI) is a promising indicator of (MIG) disentanglement. Figure 3: Scatter plots of log(LΣI) vs MSE and MIG, respectively, as σ is increased. EFFECT OF λ AND σ ON DISENTANGLEMENT (a) Beamsythesis (b) d Sprites Figure 4: Effect of λ and σ on different disentanglement metrics. λ is varied in the x-axis. Starting from the top left of each subfigure and moving clockwise within each subfigure, we report MIG, Factor Score, SAP, and DCI Disentanglement. Noise levels σ = {0.2, 0.3} are preferable for reliable disentanglement performance. KEY: Dark lines - average scores. Shaded areas - one standard deviation. In this experiment, we plot the disentanglement scores (average and standard deviation) of GCAE as the latent space noise level σ and disentanglement strength λ vary on Beamsynthesis and d Sprites. In each figure, each dark line plots the average disentanglement score while the shaded area fills one standard deviation of reported scores around the average. Figure 4a depicts the disentanglement scores of GCAE on the Beamsynthesis dataset. All σ levels exhibit relatively low scores when λ is set to zero (with the exception of Factor Score). In this situation, the model is well-fit to the data, but the representation is highly redundant and entangled, causing the gap or separatedness (in SAP) for each factor to be low. However, whenever λ > 0 the disentanglement performance increases significantly, especially for MIG, DCI Disentanglement, and SAP with λ [0.1, 0.2]. There is a clear preference for higher noise levels, as σ = 0.1 generally has higher variance and lower disentanglement scores. Factor Score starts out very high on Beamsynthesis because there are just two factors of variation, making the task easy. Published as a conference paper at ICLR 2023 Figure 4b depicts the disentanglement scores of GCAE on the d Sprites dataset. Similar to the previous experiment with Beamsynthesis, no disentanglement pressure leads to relatively low scores on all considered metrics ( 0.03 MIG, 0.47 Factor Score, 0.03 DCI, 0.08 SAP), but introducing λ > 0 signficantly boosts performance on a range of scores 0.35 MIG, 0.6 Factor Score, 0.37 SAP, and 0.45 DCI (for σ = {0.2, 0.3}). Here, there is a clear preference for larger σ; σ = {0.2, 0.3} reliably lead to high scores with little variance. COMPARISON OF GCAE WITH LEADING DISENTANGLEMENT METHODS We incorporate experiments with leading VAE-based baselines and compare them with GCAE σ = 0.2. Each solid line represents the average disentanglement scores for each method and the shaded areas represent one standard deviation around the mean. Figure 5: Disentanglement metric comparison of GCAE with VAE baselines on Beamsynthesis. GCAE λ is plotted on the lower axis, and VAE-based method regularization strength β is plotted on the upper axis. KEY: Dark lines - average scores. Shaded areas - one standard deviation. Figure 5 depicts the distributional performance of all considered methods and metrics on Beamsynthesis. When no disentanglement pressure is applied, disentanglement scores for all methods are relatively low. When disentanglement pressure is applied (λ, β > 0), the scores of all methods increase. GCAE scores highest or second-highest on each metric, with low relative variance over a large range of λ. β-TCVAE consistently scores second-highest on average, with moderate variance. Factor VAE and β-VAE tend to perform relatively similarly, but the performance of β-VAE appears highly sensitive to hyperparameter selection. DIP-VAE-II performs the worst on average. Figure 6 shows a similar experiment for d Sprites. Applying disentanglement pressure significantly increases disentanglement scores, and GCAE performs very well with relatively little variance when λ [0.1, 0.5]. β-VAE achieves high top scores with extremely little variance but only for a very narrow range of β. β-TCVAE scores very high on average for a wide range of β but with large variance in scores. Factor VAE consistently scores highest on Factor Score and it is competitive on SAP. DIP-VAE-II tends to underperform compared to the other methods. Published as a conference paper at ICLR 2023 Figure 6: Disentanglement metric comparison of GCAE with VAE baselines on d Sprites. GCAE λ is plotted on the lower axis, and VAE-based method regularization strength β is plotted on the upper axis. KEY: Dark lines - mean scores. Shaded areas - one standard deviation. DISENTANGLEMENT PERFORMANCE AS Z DIMENSIONALITY INCREASES We report the disentanglement performance of GCAE and Factor VAE on the d Sprites dataset as m is increased. Factor VAE Kim & Mnih (2018) is the closest TC-based method: it uses a single monolithic discriminator and the density-ratio trick to explicitly approximate TC(Z). Computing TC(Z) requires knowledge of the joint density p Z(z), which is challenging to compute as m increases. Figure 7 depicts an experiment comparing GCAE and Factor VAE when m = 20. The results for m = 10 are included for comparison. The average disentanglement scores for GCAE m = 10 and m = 20 are very close, indicating that its performance is robust in m. This is not the case for Factor VAE - it performs worse on all metrics when m increases. Interestingly, Factor VAE m = 20 seems to recover its performance on most metrics with higher β than is beneficial for Factor VAE m = 10. Despite this, the difference suggests that Factor VAE is not robust to changes in m. 5 DISCUSSION Overall, the results indicate that GCAE is a highly competitive disentanglement method. It achieves the highest average disentanglement scores on the Beamsynthesis and d Sprites datasets, and it has relatively low variance in its scores when σ = {0.2, 0.3}, indicating it is reliable. The hyperparameters are highly transferable, as λ [0.1, 0.5] works well on multiple datasets and metrics, and the performance does not change with m, contrary to the TC-based method Factor VAE. GCAE also used the same data preprocessing (mean and standard deviation normalization) across the two datasets. We also find that LΣI is a promising indicator of disentanglement performance. While GCAE performs well, it has several limitations. In contrast to the VAE optimization process which is very robust Kingma & Welling (2013), the optimization of m discriminators is sensitive to choices of learning rate and optimizer. Training m discriminators requires a lot of computation, and the quality of the learned representation depends heavily on the quality of the conditional densities stored in the discriminators. Increasing the latent space noise σ seems to make learning more Published as a conference paper at ICLR 2023 Figure 7: Comparison of GCAE with Factor VAE on d Sprites as m increases. λ is plotted below, and β is plotted above. KEY: Dark lines - mean scores. Shaded areas - one standard deviation. robust and generally leads to improved disentanglement outcomes, but it limits the corresponding information capacity of the latent space. 6 CONCLUSION We have presented Gaussian Channel Autoencoder (GCAE), a new disentanglement method which employs Gaussian noise and flexible density estimation in the latent space to achieve reliable, highperforming disentanglement scores. GCAE avoids the curse of dimensionality of density estimation by minimizing the Dual Total Correlation (DTC) metric with a weighted information functional to capture disentangled data generating factors. The method is shown to consistently outcompete existing SOTA baselines on many popular disentanglement metrics on Beamsynthesis and d Sprites. ACKNOWLEDGEMENTS This research is supported by grants from U.S. Army Research W911NF2220025 and U.S. Air Force Research Lab FA8750-21-1-1015. We would like to thank Cameron Darwin for our helpful conversations regarding this work. This research is supported, in part, by the U.S. Department of Energy, through the Office of Advanced Scientific Computing Research s Data-Driven Decision Control for Complex Systems (Dn C2S) project. This research used resources of the Experimental Computing Laboratory (Ex CL) at ORNL. This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paidup, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan). Published as a conference paper at ICLR 2023 Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in β-vae. ar Xiv preprint ar Xiv:1804.03599, 2018. Marc-Andr e Carbonneau, Julian Zaidi, Jonathan Boilard, and Ghyslain Gagnon. Measuring disentanglement: A review of metrics. ar Xiv preprint ar Xiv:2012.09276, 2020. Ricky TQ Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. Advances in neural information processing systems, 31, 2018. Yuanqi Du, Xiaojie Guo, Amarda Shehu, and Liang Zhao. Interpretable molecule generation via disentanglement learning. In Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 1 8, 2020. Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of disentangled representations. In International Conference on Learning Representations, 2018. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014a. Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014b. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. 2016. Sarthak Jain, Edward Banner, Jan-Willem van de Meent, Iain J Marshall, and Byron C Wallace. Learning disentangled representations of texts with application to biomedical abstracts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2018, pp. 4683. NIH Public Access, 2018. Vineet John, Lili Mou, Hareesh Bahuleyan, and Olga Vechtomova. Disentangled representation learning for non-parallel text style transfer. ar Xiv preprint ar Xiv:1808.04339, 2018. Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on Machine Learning, pp. 2649 2658. PMLR, 2018. Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. G unter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In Proceedings of the 31st international conference on neural information processing systems, pp. 972 981, 2017. Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. ar Xiv preprint ar Xiv:1711.00848, 2017. Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Sch olkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp. 4114 4124. PMLR, 2019. Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. ar Xiv preprint ar Xiv:1511.05644, 2015. Published as a conference paper at ICLR 2023 Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32: 8026 8037, 2019. Daniil Polykovskiy, Alexander Zhebrak, Dmitry Vetrov, Yan Ivanenkov, Vladimir Aladinskiy, Polina Mamoshina, Marine Bozdaganyan, Alexander Aliper, Alex Zhavoronkov, and Artur Kadurin. Entangled conditional adversarial autoencoder for de novo drug discovery. Molecular pharmaceutics, 15(10):4398 4405, 2018. Mhd Hasan Sarhan, Nassir Navab, Abouzar Eslami, and Shadi Albarqouni. Fairness by learning orthogonal disentangled representations. In European Conference on Computer Vision, pp. 746 761. Springer, 2020. Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. ar Xiv preprint physics/0004057, 2000. Eric Yeats, Frank Liu, David Womble, and Hai Li. Nashae: Disentangling representations through adversarial covariance minimization. In Computer Vision ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23 27, 2022, Proceedings, Part XXVII, pp. 36 51. Springer, 2022. A IMPLEMENTATION A.1 INFORMATION FUNCTIONAL We estimate the information between each subset of variables I(Zi; Zj) used in LΣI with a uniform estimate of the information functional: I(Zi; Z\i) (b a)Ez\i ϕσ(X) Eui Unif(a,b) p Zi(ui|z\i) log p Zi(ui|z\i) log p Zi(ui) , where (a, b) are the bounds of the Uniform distribution ( 4 and 4 in our experiments), and p Zi(ui|z\i) is the conditional density of the i-th discriminator evaluated with noise from the Uniform distribution. 50 uniform samples are taken per batch to estimate the functional in all experiments. Furthermore, we found it beneficial (in terms of disentanglement performance) to estimate the functional using zϕ (i.e., the noiseless form of z)2. Gradient is only taken through the p Zi(ui|z\i) term with respect to the z\i variables. The marginal entropy h(Zi) upper bounds the conditional entropy h(Zi|Z\i) with respect to the conditioning variables, so the information functional is a natural path to maximizing h(Zi|Z\i) and thereby minimizing DTC. B MAIN EXPERIMENT DETAILS Each method uses the same architecture (besides the µ, log σ2 heads for the VAE) and receies the same amount of data during training. In all experiments, the GCAE AE and discriminator learning rates are 5e 5 and 2e 4, respectively. The VAE learning rate is 1e 4 and the Factor VAE discriminator learning rate is 2e 4. All methods use the Adam optimizer with (β1, β2) = (0.9, 0.999) for the AE subset of parameters and (β1, β2) = (0.5, 0.9) for the discriminator(s) subset of parameters (if applicable). The number of discriminator updates per AE update k is set to 5 when m = 10 and 10 when m = 20. All discriminators are warmed up with 500 batches before training begins to ensure they approximate a valid density. VAE architectures are equipped with a Gaussian decoder for Beamsynthesis and a Bernoulli decoder for d Sprites. SELU refers to the Se LU activation function Klambauer et al. (2017). 2Our intuition is that each z\i comes from one of the modes of the corresponding Gaussian-blurred distribution, ensuring that the loss is defined. This avoids the case where the learned conditional distribution is not defined when given a novel z\i. Published as a conference paper at ICLR 2023 Table 1: MLP Architecture Dataset GCAE Architecture VAE Architecture Beamsynthsis Linear(n, 1024), SELU Linear(n, 1024), SELU Batch Size=64 Linear(1024, 1024), SELU Linear(1024, 1024), SELU Mean/STD Norm Linear(1024, 512), SELU Linear(1024, 512), SELU 2000 Iterations Linear(512, m), Soft Sign 2 Linear(512, m) Linear(m, 512), SELU Linear(m, 512), SELU d Sprites Linear(512, 1024), SELU Linear(512, 1024), SELU Batch Size=256 Linear(1024, 1024), SELU Linear(1024, 1024), SELU Mean/STD Norm (GCAE) Linear(1024, n) Linear(1024, n) 20000 Iterations Table 2: Discriminator Architectures. The Factor VAE architecture follows the suggestion of Kim & Mnih (2018). The GCAE discriminator is much smaller, but there are m of them compared to just 1 Factor VAE discriminator. GCAE Discriminator Architecture Factor VAE Discriminator Architecture Linear(m, 256), SELU Linear(m, 1024), SELU Linear(256, 256), SELU Linear(1024, 1024), SELU Linear(256, 1), Sigmoid Linear(1024, 1024), SELU - Linear(1024, 1024), SELU - Linear(1024, 1024), SELU - Linear(1024, 1), Sigmoid C ABLATION STUDY Figure 8: Ablation study: Comparison of MIG scores with and without LEEP. LΣI corresponds to direct gradient descent on LΣI. Figure 8 depicts an ablation study for training with LEEP vs. directly with LΣI. We found that training directly with LΣI promotes independence between the latent variables, but the learned variables were not stable (i.e., their variance fluctuated significantly in training). The results indicate that LEEP is a helpful inductive bias for aligning representations with interpretable data generating factors in a way that is stable throughout training. Published as a conference paper at ICLR 2023 D RELATED WORK D.1 DISENTANGLEMENT METHODS GCAE is an unsupervised method for disentangling learning representations - hence, the most closely related works are the state-of-the-art unsupervised VAE baselines: β-VAE Higgins et al. (2016), Factor VAE Kim & Mnih (2018), β-TCVAE Chen et al. (2018), and DIP-VAE-II Kumar et al. (2017). All methods rely on promoting some form of independence in p Z(z), and we shall cover them in more detail in the following sections. The disentanglement approach of β-VAE Higgins et al. (2016) is to promote independent codes in Z by constraining the information capacity of Z. This is done with a VAE model by maximizing the expectation (on x) of the following loss: Lβ-VAE = Eqϕ(z|x) [pθ(x|z)] βDKL qϕ(z|x) pθ(z) , where qϕ(z|x) is the approximate posterior (inferential distribution of the encoder), pθ(x|z) is the decoder distribution, pθ(z) is the prior distribution (typically spherical Gaussian), and β is a hyperparameter controlling the strength of the Information Bottleneck Tishby et al. (2000) induced on Z. Higher β are associated with improved disentanglement performance. The authors of Factor VAE Kim & Mnih (2018) assert that the information bottleneck of β-VAE is too restrictive, and seek to improve the reconstruction error vs. disentanglement performance tradeoff by isolating the Total Correlation (TC) component of the DKL qϕ(z|x) pθ(z) term. They employ a large discriminator neural network, the density-ratio trick, and a data shuffling strategy to estimate the TC. Factor VAE maximizes the following loss: LFactor VAE = Eqϕ(z|x) [pθ(x|z)] DKL qϕ(z|x) pθ(z) TCρ(Z), where TCρ(Z) is the discriminator s estimate of TC(Z). The discriminator is trained to differentiate between real jointly distributed z and fake z in which all the elements have been shuffled across a batch. β-TCVAE Chen et al. (2018) seeks to isolate TC(Z) via a batch estimate. They avoid significantly underestimating p Z(z), by constructing an importance-weighted estimate of h(Z): Eq(z) [log q(z)] 1 j=1 q(ϕ(xi)|xj) where q(z) is an estimate of p Z(z), B is the minibatch size, C is the size of the dataset, ϕ(xi) is a stochastic sample from the i-th x, and q(ϕ(xi)|xj) is the density of the posterior at ϕ(xi) when x = xj. This estimate is used to compute an estimate of TC(Z), and the following loss is maximized: Lβ-TCVAE = Eqϕ(z|x) [pθ(x|z)] Iq(Z; X) βTCρ(Z) j=1 DKL q(zj) p(zj) , where I1(Z; X) is the index-code mutual information, TCρ(Z) is an estimate of TC(Z) computed with their estimate of q(z), β is a hyperparameter controlling TC(Z) regularization, and Pm j=1 DKL q(zj) p(zj) is a dimension-wise Kullback-Leibler divergence. Published as a conference paper at ICLR 2023 The approach of DIP-VAE-II is that the aggregate posterior of a VAE model should be factorized in order to promote disentanglement Kumar et al. (2017). This is done efficiently using batch estimates of the covariance matrix. The loss to be maximized for DIP-VAE-II is: LDIP-VAE-II = Eqϕ(z|x) [pθ(x|z)] DKL qϕ(z|x) pθ(z) β i=1 [Cov(zii) 1]2 + j =i [Cov(zij)]2 Hence, the covariance matrix of the sampled representation z should be equal to the identity matrix. β is a hyperparameter controlling regularization strength. We did not consider DIP-VAE-I since it implicitly assumes knowledge of how many data generating factors there are. D.2 DISENTANGLEMENT METRICS We evaluate GCAE and the leading VAE baselines with four metrics: Mutual Information Gap (MIG), Factor Score, Separated Attribute Predictability (SAP), and DCI Disentanglement. MUTUAL INFORMATION GAP MIG is introduced by Chen et al. (2018) as an axis-aligned, unbiased, and general detector for disentanglement. In essence, MIG measures the average gap in information between the latent feature which is most selective for a unique data generating factor and the latent feature which is second runner up. MIG is a normalized metric on [0, 1], and higher scores indicate better capturing and disentanglement of the data generating factors. MIG is defined as follows: MIG(Z, V ) 1 1 H(Vk) (I(Za; Vk) I(Zb; Vk)) , where K is the number of data generating factors, H(Vk) is the discrete entropy of the k-th data generating factor, and za Za and zb Zb (where a = b) are the latent elements which share the most and next-most information with vk Vk, respectively. For Beamsynthesis, we calculate MIG on the full dataset using a histogram estimate of the latent space with 50 bins (evenly spaced maximum to minimum). For d Sprites, we calculate MIG using 10000 samples, and we use 20 histogram bins following Locatello et al. (2019). FACTOR SCORE Factor Score is introduced by Kim & Mnih (2018). The intuition is that change in one dimension of Z should result in change of at most one factor of variation. It starts off by generating many batches of data in which one factor of variation is fixed for all samples in a batch. Then the variance of each dimension on each batch is calculated and normalized by its standard deviation (without interventions). The index of the latent dimension with smallest variance and the index of the fixed factor of variation for the given batch is used as a training point for a majority-vote classifier. The score is the accuracy of the classifier on a test set of data. For Beamsynthesis, we train the majority-vote classifier on 1000 training points and evaluate on 200 separate points. For d Sprites, we train the majority-vote classifier on 5000 training points and evaluate on 1000 separate points. SEPARATED ATTRIBUTE PREDICTABILITY Separated Attribute Predictability (SAP) is introduced by Kumar et al. (2017). SAP involves creating a m k score matrix, where ij-th entry is the predictability of factor j from latent element i. Published as a conference paper at ICLR 2023 For discrete factors, the score is the balanced classification accuracy of predicting the factor given knowledge of the i-th latent, and for continuous factors, the score is the R-squared value of the i-th latent in (linearly) predicting the factor. The resulting score is the difference in predictability of the most-predictive and second-most predictive latents for a given factor, averaged over all factors. For Beamsynthesis, we use a training size of 240 and a test size of 120. For d Sprites, we use a training size of 5000 and a test size of 1000. DCI DISENTANGLEMENT DCI Disentanglement is introduced by Eastwood & Williams (2018). It complements other metrics introduced by the paper: completeness and informativeness. The intuition is that each latent variable should capture at most one factor. k decision tree regressors are trained to predict each factor given the latent codes z. The absolute importance weights of each decision tree regressor are extracted and inserted as columns in a m k importance matrix. The rows of the importance matrix are normalized, and the (discrete) k-entropy of each row is computed. The difference of one and each row k-entropy is weighted by the relative importance of each row to compute the final score. For Beamsynthesis, we use 240 training points and 120 testing points. For d Sprites, we use 5000 training points and 1000 testing points. E TRAINING TIME COMPARISON Table 3: Comparison of training times of the discriminator-based disentanglement algorithms on Beamsynthesis. Latent space size is fixed to m = 10 and discriminator training iterations is fixed to k = 5. Method Average (s) Standard Deviation (s) GCAE 955.0 13.6 Factor VAE 1024.4 5.8