# constraining_variational_inference_with_geometric_jensenshannon_divergence__e22caff5.pdf Constraining Variational Inference with Geometric Jensen-Shannon Divergence Jacob Deasy , Nikola Simidjievski, Pietro Liò Department of Computer Science and Technology University of Cambridge {jd645,ns779,pl219}@cam.ac.uk We examine the problem of controlling divergences for latent space regularisation in variational autoencoders. Specifically, when aiming to reconstruct example x 2 Rm via latent space z 2 Rn (n m), while balancing this against the need for generalisable latent representations. We present a regularisation mechanism based on the skew-geometric Jensen-Shannon divergence . We find a variation in JSG , motivated by limiting cases, which leads to an intuitive interpolation between forward and reverse KL in the space of both distributions and divergences. We motivate its potential benefits for VAEs through low-dimensional examples, before presenting quantitative and qualitative results. Our experiments demonstrate that skewing our variant of JSG , in the context of JSG -VAEs, leads to better reconstruction and generation when compared to several baseline VAEs. Our approach is entirely unsupervised and utilises only one hyperparameter which can be easily interpreted in latent space. 1 Introduction The problem of controlling regularisation strength for generative models is often data-dependent and poorly understood [3, 7]. Post-hoc analysis of coefficients dictating regularisation strength is rarely carried out and even more rarely provides an intuitive explanation (e.g. β-VAE, [13]). Although evidence suggests that stronger regularisation in variational settings leads to desirable disentangled representations of latent factors and better generalisation [38], scaling factors remain opaque and unrelated to the task at hand. To learn useful latent representations for reconstruction and generation of high-dimensional distributions, the variational inference problem can be addressed through the use of Variational Autoencoders (VAEs) [17, 34]. VAE learning requires optimisation of an objective balancing the quality of samples that are encoded and then decoded, with a regularisation term penalising latent space deviations from a fixed prior distribution. VAEs have favourable properties when compared with other families of generative models, such as Generative Adversarial Networks (GANs) [10] and autoregressive models [9, 20]. In particular, GANs are known to necessitate more stringent and problem-dependent training regimes, while autoregressive models are computationally expensive and inefficient to sample. VAEs often assume latent variables to be parameterised by a multivariate Gaussian p (z) = N(µ, σ2) with z, µ, σ 2 Rn, which is approximated by qφ(z|x) with x 2 Rm and n m. In variational Bayesian methods, using the Evidence Lower BOund (ELBO) [4], the model can be naturally constrained to prevent overfitting by minimising the Kullback-Leibler (KL) [19] divergence to an isotropic unit Gaussian ball KL (p (z) k N(0, I)). One line of work has sought to better understand this divergence term to induce disentanglement, robustness, and generalisation [5, 6]. Meanwhile, the Corresponding author. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. broader framework of learning a VAE as a constrained optimisation problem [13], has allowed for increasing use of more exotic statistical divergences and distances for latent space regularisation [8, 12, 22, 37], such as the regularisation term in Info VAE [38], the Maximum Mean Discrepancy (MMD) As regularisation terms increase in complexity, it is advantageous to maintain intuition as to how they operate in latent space and to avoid exponential hyperparameter search spaces on real-world problems. In order to properly capitalise on the advantages of each divergence, it is also desirable that the meaning of scaling factors remains clear when combining multiple divergence terms. For instance, as forward KL and reverse KL are known to have distinct beneficial properties zero-avoidance allowing for exploration of new areas in the latent space [3] and zero-forcing more easily ignoring noise for sharper selection of strong modes [37] respectively there are instances where favouring one over the other would be beneficial. Even better would be to balance the use of both properties at the same time in a comprehensible manner. In this regard, we propose the skew-geometric Jensen-Shannon Variational Autoencoder (JSG -VAE) as an unsupervised approach to learning strongly regularised latent spaces. More specifically, we make several contributions: we first discuss the skew-geometric Jensen-Shannon divergence (and its dual form) [30] in the context of the well known KL and Jensen-Shannon (JS) divergences and outline its limited use. We proceed to propose an adjustment of the skew parameter, and show how its effect on an intermediate distribution in JSG furnishes us with a more intuitive divergence and permits interpolation between forward and reverse KL divergence. We then study the skew-geometric Jensen-Shannon in the wider context of latent space regularisation and use it to derive a loss function for JSG -VAE. To test the utility of the proposed skew-geometric Jensen-Shannon adjustments, we investigate how JSG operates on low-dimensional examples. We demonstrate that JSG has beneficial properties for light-tailed posterior distributions and is a more useful (and tractable) intermediate divergence than standard JS. We further exhibit that JSG for VAEs has a positive impact on test set reconstruction loss. Namely, we show that the dual form, JSG consistently outperforms forward and reverse KL across several standard benchmark datasets and skew values.2 2 JSG VAE derivation Existing work suggests that there exists no tractable interpolation between forward and reverse KL for multivariate Gaussians. In this section, we will show that one can be found by adapting JSG . We also exhibit how this interpolation, well-motivated in the space of distributions, reduces to a simple quadratic interpolation in the space of divergences. 2.1 The JSG divergences family Problems with KL and JS minimisation. For distributions P and Q of a continuous random variable X = [X1, . . . , Xn]T, the Kullback-Leibler (KL) divergence [19] is defined as KL(P k Q) = where p and q are the probability densities of P and Q respectively, and x 2 Rn. In particular, Equation (1) is known as the forward KL divergence from P to Q, whereas reverse KL divergence refers to KL(Q k P). Due to Gaussian distributions being the self-conjugate distributions of choice in variational learning, we are interested in using divergences to compare two multivariate normal distributions N1(µ1, 1) and N2(µ2, 2) with the same dimension n. In this case, the KL divergence is KL (N1 k N2) = 1 + (µ2 µ1)T 1 2 (µ2 µ1) n 2Code is available at: https://github.com/jacobdeasy/geometric-js This expression is well-known in variational inference and, for the case of reverse KL from a standard normal distribution N2(0, I) to a diagonal multivariate normal distribution, reduces to the expression 1, . . . , σ2 used as a regularisation term in variational models [13, 17, 27] and is known to enforce zero-avoiding parameters on N1 when minimised [3, 26]. On the other hand, the forward KL divergence reduces to N2(0, I) k N1 1, . . . , σ1 and is known for its zero-forcing property [3, 26]. However, there exist well-known drawbacks of the KL divergence, such as no upper bound leading to unstable optimization and poor approximation [12], as well as its asymmetric property KL(P k Q) 6= KL(Q k P). Underdispersed approximations relative to the exact posterior also produce difficulties with light-tailed posteriors when the variational distribution has heavier tails [8]. One attempt at remedying these issues is the well-known symmetrisation, the Jensen-Shannon (JS) divergence [23] JS(p(z) k q(x)) = 1 Although the JS divergence is bounded (in [0, 1] when using base 2), and offers some intuition through symmetry, it includes the problematic mixture distribution p+q 2 . This term means that no closed-form expression exists for the JS divergence between two multivariate normal distributions using Equation (5). Divergence families. To circumvent these problems, prior work has sought more general families of distribution divergence [29]. For example, when λ = 1 2, JS is a special case of the more general family of λ divergences, defined by λ(p(x) k q(x)) = λKL (p k (1 λ)p + λq) + (1 λ)KL (q k (1 λ)p + λq) , (6) for λ 2 [0, 1], which interpolates between forward and reverse KL, and provides control over the degree of divergence skew (how closely related the intermediate distribution is to p or q). Although λ divergences do not prevent the intractable comparison to a mixture distribution, their broader goal is to measure weighted divergence to an intermediate distribution in the space of possible distributions over X. In the case of the JS divergence, this is the (arithmetic) mean divergence to the arithmetic mean distribution. Recently, [30] and [32] have proposed a further generalisation of the JS divergence using abstract means (quasi-arithmetic means [28], also known as Kolmogorov-Nagumo means). By choosing the weighted geometric mean G (x, y) = x1 y for 2 [0, 1], and using the property that the weighted product of exponential family distributions (which includes the multivariate normal) stays in the exponential family [31], a new divergence family has arisen JSG (p(x) k q(x)) = (1 )KL (p k G (p, q)) + KL (q k G (p, q)) . (7) JSG , the skew-geometric Jensen-Shannon divergence, between two multivariate Gaussians N(µ1, 1) and N(µ2, 2) then admits the closed form JSG (N1 k N2) = (1 )KL (N1 k N ) + KL (N2 k N ) (8) ((1 ) 1 + 2) | | | 1|1 | 2| + (1 )(µ µ1)T 1 (µ µ1) + (µ µ2)T 1 with the equivalent dual divergence being (N1 k N2) = (1 )KL (N k N1) + KL (N k N2) (10) where N has parameters (the matrix harmonic barycenter) and Throughout this paper we explore how to incorporate these expressions into variational learning. 2.2 JSG and JSG in variational neural networks Interpolation between forward and reverse KL. Before applying JSG , we note that although the mean distribution N can be intuitively understood, the limiting skew cases still seem to offer no insight, as Therefore, we instead choose to consider the more useful intermediate mean distribution µ(1 ), (1 ) This, is equivalent to simply reversing the geometric mean (using G (y, x) rather than G (x, y)) and trivially still permits a valid divergence as a weighted sum of valid divergences. Proposition 1. The alternative divergence JSG 0 (N1 k N2) = (1 )KL (N1 k N 0) + KL (N2 k N 0) , (17) and its dual JSG 0 , interpolate between forward and reverse KL, satisfying = KL (N1 k N2) lim !1 = KL (N2 k N1) (18) = KL (N2 k N1) lim !1 = KL (N1 k N2) . (19) The proof of this is given in Appendix A.1. Note that this is a special case of Definition 5 in [30]. Henceforth in the paper, unless explicitly stated, JSG refers to JSG 0 (without the prime (0)). Variational autoencoders. We can now introduce a new VAE loss function based on this finding by using the formulation of VAE optimisation as a constrained optimisation problem given in [13]. For generative models, a suitable objective to maximise is the marginal (log-)likelihood of the observed data x 2 Rm as an expectation over the whole distribution of latent factors z 2 Rn Ep (z) [p (x|z)] More generalisable latent representations can be achieved by imposing an isotropic unit Gaussian constraint on the prior p(z) = N(0, I), arriving at the constrained optimisation problem log Eqφ(z|x) [p (x|z)] subject to D(qφ(z|x) k p(z)) < ", (21) where " dictates the strength of the constraint and D is a divergence. We can then re-write Equation (21) as a Lagrangian under the KKT conditions [15, 18], obtaining F( , φ, λ; x, z) = Eqφ(z|x) [log p (x|z)] λ (D(qφ(z|x) k p(z)) ") . (22) By setting D( ) = JSG or D( ) = JSG , we immediately note that our family of divergences includes the β-VAE by setting = 1 and varying λ. In simple terms, a broader family of divergences using both and β, would dictate where and with how much strength to skew an intermediate distribution. Before experimentation, in order to use JSG and JSG as divergence measures in variational learning, we first simplify Equations (9) and (11). Proposition 2. For a diagonal multivariate normal distribution N1(µ, diag 1, . . . , σ1 ) and a standard normal distribution N2(0, I), the skew-geometric Jensen-Shannon divergence JSG an intermediate of forward and reverse KL regularisation and its dual JSG JSG (N1 k N2) = 1 + (1 )(µ ,i µi)2 respectively, where i (1 ) + σ2 The proof of this is given in Appendix A.2. 3 Experiments Thus far we have discussed the JSG divergence and its relationship to KL and in particular VAEs. In this section, we begin by offering a better understanding of where JSG and its variants differ in distributional space. We then provide a quantitative and qualitative exploration, justifying the immediate benefit of skewing away from 0 or 1, before finishing with an exploration of the effects this has on VAE reconstruction as well as on the generative capabilities. Note that, in the analyses that follow, we set λ = 1 for all variants of JSG -VAEs3. 3.1 Characteristic behaviour of JSG To elucidate how JSG will behave in the higher dimensional setting of variational inference, we highlight its properties in the case of one and two dimensions. In Figure 1, univariate Gaussians illustrate how the integrand for JSG differs favourably from the intractable JS. As the intermediate distribution N in Figure 1a is a Gaussian, JSG not only permits a closed-form integral, but also offers a more natural interpolation between p(z) and q(z|x), which raises questions about whether intuitive regularisation strength (relative to a known intermediate Gaussian) may be possible in variational settings. Moreover, Figure 1c demonstrates symmetry for = 0.5, and both Figure 1b and Figure 1c depict the increased integrand in areas of low probability density addressing the issues touched upon earlier, where KL struggles with light-tailed posteriors. In Figure 2, we use two dimensions to depict the effect of changing divergence measures on optimisation. As the integral of JS divergence is not tractable (and to make comparison fair), we directly optimise a bivariate Gaussian via samples from the data for all divergences. We see that the example mixture of Gaussians leads to the zero-avoiding property of KL divergence in Figure 2a and zero-forcing (i.e. mode dropping) for reverse KL in Figure 2b. While JS divergence provides an intermediate solution in Figure 2c, there is still considerable unnecessary spreading and direct optimisation of the integral will not scale. Finally, JSG with naively set to the symmetric case = 0.5 leads to a more reasonable intermediate distribution which both tends towards the dominant mode and offers localised exploration. 3Details on the influence of λ on the reconstructive performance of VAEs, with respect to JSG and JSG , are given in Appendix E (a) Mean comparison (b) N( 2, 1) k N(2, 2) (c) N(0, 0.5) k N(0, 3) Figure 1: Comparison of mean distributions (green) for two univariate Gaussians (red and blue), as well as comparison of arithmetic Jensen-Shannon integrand against skew-geometric Jensen-Shannon integrand with = 0.5 for univariate Gaussians. (a) KL(p(z) k q(z|x)) (b) KL(q(z|x) k p(z)) (d) JSG ( = 0.5) Figure 2: Level sets for optimised bivariate Gaussians fit to data drawn from a mixture of Gaussians. JSG with naively set to the symmetric case = 0.5 (d) leads to a more reasonable intermediate distribution, when compared to (a) forward KL, (b) reverse KL and (c) Jensen-Shannon divergence. JSG tends both towards the dominant mode and offers localised exploration. 3.2 Variational autoencoder benchmarks We present quantitative evaluation results following standard experimental protocols from the literature [5, 13, 38]. In this regard, VAEs are known to have a strong capacity to reproduce images when used in conjunction with convolutional encoders and decoders. For fair comparison, we follow Higgins et al. [13] in selecting a common neural architecture across experiments4. Although the margin for error " in Equation (21) will vary with dataset and architecture, the point here is to standardise comparison and isolate the effect of the new divergence measure, rather than searching within architecture and hyperparameter spaces for the best performing model by some metric. Throughout our experiments we evaluate the reconstruction loss (mean squared error) on four standard benchmark datasets: MNIST, 28 28 black and white images of handwritten digits [21]; Fashion MNIST, 28 28 black and white images of clothing [36]; Chairs, 64 64 black and white images of 3D chairs [1]; d Sprites 64 64 black and white images of 2D shapes procedurally generated from 6 ground truth independent latent factors [25]. Influence of skew coefficient. In Figure 3, we demonstrate several immediately useful properties of skewing our divergence away from = 0 or = 1. Firstly, intermediate skew values of JSG do not compromise reconstruction loss and remain considerably below KL(p(z) k q(z|x)), which we find to induce the expected mode collapse across datasets. Secondly, JSG regularisation effectively generalises to unseen data, as can be seen by the small discrepancy between train and test set evaluation. Finally, there are ranges of values which produce superior reconstructions when compared to either direction of KL for identical architectures. Furthermore, Figure 3 indicates that JSG outperforms KL(q(z|x) k p(z)) for nearly all values of . We verify that the trend, JSG outperforms traditional divergences for < 0.3 and JSG performs even better for nearly all , generalises across datasets in Table 1 and Supplementary Figures 7 9. In 4The specific model details are given in Appendix C Figure 3c and 3d, we also include the corresponding divergence loss contributions to verify that JSG does not simply minimise regularisation strength in order to improve reconstruction. (a) JSG reconstruction (b) JSG reconstruction (c) JSG divergence (d) JSG divergence Figure 3: Reconstruction (top) and divergence (bottom) loss comparison for JSG (left) and JSG (right) against KL(q(z|x) k p(z)) (VAE) and KL(p(z) k q(z|x)) on the MNIST dataset. Throughout this work, dashed or full lines represent evaluation (sampling the mean with no variance) on the training or test sets, respectively. The comparisons performed on the other three datasets are given in Appendix B. In Table 1, we compare the naive symmetric case JSG0.5 against the skew value with the lowest reconstruction loss (selected from {0.1, . . . , 0.9}) for JSG and JSG , as well as baseline regularisation terms: KL(q(z|x) k p(z)), KL(p(z) k q(z|x)), β-VAE (with β = 4)5 and MMD (with λ = 500). JSG is clearly stronger than all baselines across datasets. We reinforce this point in Figure 4 where KL divergence fails to capture sharper reconstructions (such as delineating trouser legs or the heel of high-heels in the case of Fashion-MNIST) and MMD produces blurred reconstructions (we also tested λ = 1000 from [38] to no avail)6. More specifically, we sample each latent dimension at 10 equi-spaced points, while keeping the other 9 dimensions fixed in order to highlight the trends learnt by each dimension. As ! 1, the expected mode collapse occurs when approaching reverse KL across datasets, impeding reconstruction loss across more than a few modes. However, for values close to 0, reverse KL images suffer from blur due to the aforementioned over-dispersion property. Generative capacity. In Figure 5, we demonstrate the generative capabilities when skewing JSG across different values. More specifically, we present the model evidence (ME) estimates for JSG in comparison to forward KL, reverse KL, and MMD. ME estimates are generated by Monte Carlo estimation of the marginal distribution p (x) with mean and 95% confidence intervals bootstrapped from 1000 resamples of estimated batch evidence across 100 test set batches. We emphasise here that we are not looking for state-of-the-art results, but relative improvement which isolates the impact of the proposed regularisation and extends our analysis of JSG . We see that in the case of MNIST 5Details on the performance of β-VAEs for varying β is given in Appendix F 6Additional qualitative analyses are prestened in Appendix G Divergence MNIST Fashion-MNIST d Sprites Chairs KL(q(z|x) k p(z)) 8.46 11.98 13.55 12.27 KL(p(z) k q(z|x)) 11.61 14.42 14.18 19.88 β-VAE (β = 4) 11.75 13.32 10.51 20.79 MMD (λ = 500) 13.19 11.10 11.87 18.85 JSG0.5 9.87 11.29 9.89 13.57 JSG 7.52 ( = 0.1) 10.04 ( = 0.2) 5.54 ( = 0.1) 11.95 ( = 0.2) JSG 7.34 ( = 0.3) 9.58 ( = 0.4) 4.97 ( = 0.5) 11.64 ( = 0.4) Table 1: Final model reconstruction error including optimal for JSG and JSG . The reconstruction errors for different values for JSG and JSG are given in Appendix B (b) KL(q(z|x) k p(z)) (c) MMD (λ = 500) Figure 4: Latent space traversal for 5 of the 10 dimensions used for Fashion-MNIST. Each row represents a latent dimension and each column represents an equidistant point in the traversal. (Figure 5a) the increased reconstructive power of JSG does come at a cost to generative performance, however, this trend is not consistent in the noisier Fashion-MNIST dataset (Figure 5b). Nevertheless, note that the reconstruction error of JSG for > 0.8 and > 0.6, in the case of MNIST and Fashion-MNIST, respectively, is still lower than the benchmarks. We also find 0.15 < < 0.4 for JSG is competitive with or better than all alternatives on both datasets. Taken all together, we make several pragmatic suggestions for selecting values when using our variant of JSG or its dual form. Firstly, when using JSG , lower values are to be preferred, this goes some way to explaining the poor performance of the initial attempts to use JSG0.5 in the literature (see Section 4). Whereas for the dual divergence, although lower values ( <= 0.5) lead to the lowest reconstruction error, higher values ( > 0.6) exhibit better generative capabilities while having lower reconstruction error than the benchmarks. Therefore, the symmetric case is a reasonably strong choice. Moreover, the plots of reconstruction loss against clearly demonstrate a strong correlation between train and test set performance. This can be applied in practice, by selecting an optimal value of using the training performance, circumventing the need for a separate validation set. 4 Related work JSG -VAEs build upon traditional VAEs [17, 34], with a regularisation constraint inspired by recent work on closed-form expressions for statistical divergences [30, 32]. JSG -VAEs, offer simpler and more intuitive regularisation by skewing the intermediate distribution, allowing interpolation between forward and reverse KL divergence, and therefore combating the issue of posterior collapse [24]. In this regard, our work is related to approaches that address this issue through KL annealing during training [5, 14]. In a more general sense, this work is also related to other approaches that utilise (a) MNIST (b) Fashion-MNIST Figure 5: Estimated log model evidence and confidence intervals for JSG and JSG across different values compared against KL(q(z|x) k p(z)), KL(p(z) k q(z|x)) and MMD on the (a) MNIST and (b) Fashion-MNIST datasets. various statistical divergences and distances for latent space regularisation as an alternative to the conventional KL divergence [8, 12, 22, 37, 38]. Since its recent introduction, [2] used JSG0.5 as a plug-and-play replacement for JS divergence with little success, while [35] used JSG0.5 to decompose and estimate a multimodal ELBO loss. In contrast to these papers, we do not overlook the potential of JSG . We reverse the intermediate distribution parameterisation, allowing a principled interpolation of forward and reverse KL, we simplify the subsequent closed-form loss to that needed for VAEs, and we demonstrate improved empirical performance against several baselines (application, rather than the theory of [31]). Our more natural parameterisation and pragmatic advice on how to properly use the skew parameter ultimately lead to better image reconstruction. We are not aware of any prior work exploring the dual form JSG 5 Conclusion Prior work assumed that no tractable interpolation existed between forward and reverse KL for multivariate Gaussians. We have overcome this with our variant of JSG , before translating it to the variational learning setting with JSG -VAE. The benefits of our variant of JSG include symmetry (at = 0.5) and having closed-form expression. Alongside this, we have demonstrated that the advantages of its role in VAEs include quantitatively and qualitatively better reconstructions than several baselines. Although we accept that use of vanilla VAEs may not out-compete some of the leading flow and GAN based architectures, we believe our regularisation mechanism addresses the trade-off between zero-avoidance and zero-forcing in latent space, which goes some way to bridge this gap while being intuitive in both divergence and distribution space. Our experiments demonstrate that the flexibility accorded to VAEs by skewing JSG is worth considering across a broad range of applications. Broader Impact For the statistics community, our introduction of the alternative JSG 0 and JSG 0 , rather than the "original" JSG and JSG , immediately presents a benefit as a more intuitive interpolation through divergence and distribution space. As we have shown the benefits of such an interpolation on the task of image reconstruction, the first impact of our model lies in better image compression and generation from latent samples. However, in a more general setting, VAEs present multiple impactful opportunities. Applications include compression (of any data type), generation of new samples in fields with data paucity, as well as extraction of underlying relationships. As our exploration of the JSG family of VAEs has improved performance, after translation to data types with other structures, our VAE could be used for all of these applications. Our experiments also indicate strong regions for the skew parameter which could be used as a standard regularisation mechanism across variational learning. In settings with sensitive data, all of these applications bear some risks. As VAEs provide a form of lossy compression, in healthcare and social settings there is the risk of misrepresenting personal information in latent space. In areas of data paucity, without additional constraints, VAEs may generate samples which are unrealistic and severely bias any downstream training. Finally, when using VAEs in science, to extract underlying associations, it remains important to analyse the true meaning of any independent components extracted, rather than taking these rules at face value. Acknowledgments and Disclosure of Funding We thank Cristian Bodnar, Cˇatˇalina Cangea, Ben Day, Felix Opolka, Emma Rocheteau, Ramon Viñas Torne and Duo Wang from the Department of Computer Science and Technology, University of Cambridge, for their helpful comments. We would like to also thank the reviewers for their constructive feedback and efforts towards improving our paper. We acknowledge the support of The Mark Foundation for Cancer Research and Cancer Research UK Cambridge Centre [C9685/A25177] for N.S. The authors declare no competing interests. [1] Mathieu Aubry, Daniel Maturana, Alexei Efros, Bryan Russell, and Josef Sivic. Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models. In CVPR, 2014. [2] Vikash Balasubramanian, Ivan Kobyzev, Hareesh Bahuleyan, Ilya Shapiro, and Olga Vechto- mova. Polarized-vae: Proximity based disentangled representation learning for text generation. ar Xiv preprint ar Xiv:2004.10809, 2020. [3] Christopher M Bishop. Pattern recognition and machine learning. Springer, 2006. [4] David M. Blei, Alp Kucukelbir, and Jon D. Mc Auliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017. [5] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Des- jardins, and Alexander Lerchner. Understanding disentangling in β-vae. ar Xiv preprint ar Xiv:1804.03599, 2018. [6] Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pages 2610 2620, 2018. [7] Bin Dai and David Wipf. Diagnosing and enhancing vae models. ar Xiv preprint ar Xiv:1903.05789, 2019. [8] Adji Bousso Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David Blei. Variational inference via χ upper bound minimization. In Advances in Neural Information Processing Systems, pages 2732 2741, 2017. [9] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881 889, 2015. [10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672 2680, 2014. [11] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723 773, 2012. [12] James Hensman, Max Zwießele, and Neil Lawrence. Tilted variational bayes. In Artificial Intelligence and Statistics, pages 356 364, 2014. [13] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. Iclr, 2(5):6, 2017. [14] Chin-Wei Huang, Shawn Tan, Alexandre Lacoste, and Aaron C Courville. Improving explorabil- ity in variational inference with annealed variational objectives. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9701 9711. Curran Associates, Inc., 2018. [15] William Karush. Minima of functions of several variables with inequalities as side constraints. M. Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago, 1939. [16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [17] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR 2014, 2014. URL abs/1312.6114. [18] Harold W Kuhn and Albert W Tucker. Nonlinear programming. In Traces and emergence of nonlinear programming, pages 247 258. Springer, 2014. [19] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79 86, 1951. [20] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In Proceed- ings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 29 37, 2011. [21] Yann Le Cun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010. [22] Yingzhen Li and Richard E Turner. Rényi divergence variational inference. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1073 1081. Curran Associates, Inc., 2016. [23] Jianhua Lin. Divergence measures based on the shannon entropy. IEEE Transactions on Information theory, 37(1):145 151, 1991. [24] James Lucas, George Tucker, Roger Baker Grosse, and Mohammad Norouzi. Understanding posterior collapse in generative latent variable models. In DGS@ICLR, 2019. [25] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentangle- ment testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017. [26] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012. [27] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012. [28] Constantin Niculescu and Lars-Erik Persson. Convex functions and their applications. Springer, [29] Frank Nielsen. A family of statistical symmetric divergences based on jensen s inequality. ar Xiv preprint ar Xiv:1009.4004, 2010. [30] Frank Nielsen. On the Jensen Shannon Symmetrization of Distances Relying on Abstract Means. Entropy, 21(5):485, 2019. ISSN 1099-4300. doi: 10.3390/e21050485. Extended version available at https://arxiv.org/abs/1904.04017. [31] Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards. ar Xiv preprint ar Xiv:0911.4863, 2009. [32] Tomohiro Nishiyama. Generalized bregman and jensen divergences which include some f-divergences. ar Xiv preprint ar Xiv:1808.06148, 2018. [33] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024 8035. Curran Associates, Inc., 2019. [34] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1278 1286. PMLR, 2014. [35] Thomas Sutter, Imant Daunhawer, and Julia E Vogt. Multimodal generative learning utilizing jensen-shannon divergence. In Workshop on Visually Grounded Interaction and Language at the 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), 2019. [36] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017. [37] Mingtian Zhang, Thomas Bird, Raza Habib, Tianlin Xu, and David Barber. Variational f- divergence minimization. ar Xiv preprint ar Xiv:1907.11891, 2019. [38] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Balancing learning and inference in variational autoencoders. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, Honolulu, Hawaii, USA, pages 5885 5892. AAAI Press, Palo Alto, CA, USA, 2019.