# why_do_variational_autoencoders_really_promote_disentanglement__d1b9aa4d.pdf Why do Variational Autoencoders Really Promote Disentanglement? Pratik Bhowal 1 Achint Soni 2 Sirisha Rambhatla 3 Despite not being designed for this purpose, the use of variational autoencoders (VAEs) has proven remarkably effective for disentangled representation learning (DRL). Recent research attributes this success to certain characteristics of the loss function that prevent latent space rotation, or hypothesize about the orthogonality properties of the decoder by drawing parallels with principal component analysis (PCA). This hypothesis, however, has only been tested experimentally for linear VAEs, and the theoretical justification still remains an open problem. Moreover, since real-world VAEs are often inherently non-linear due to the use of neural architectures, understanding DRL capabilities of real-world VAEs remains a critical task. Our work takes a step towards understanding disentanglement in real-world VAEs to theoretically establish how the orthogonality properties of the decoder promotes disentanglement in practical applications. Complementary to our theoretical contributions, our experimental results corroborate our analysis. Code is available at https://github.com/criticalml-uw/ Disentanglement-in-VAE. 1. Introduction Learning human interpretable concepts in generative modeling is crucial for their reliable and controllable application in real-world scenarios (Voynov & Babenko, 2020; Härkönen et al., 2020). A promising avenue in this realm is Disentangled Representation Learning (DRL), which aims to uncover the hidden factors of variation in the 1NVIDIA, India 2Department of Computer Science, University of Waterloo, Ontario, Canada 3Department of Management Science and Engineering, University of Waterloo, Ontario, Canada. Correspondence to: Pratik Bhowal , Sirisha Rambhatla . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). Figure 1. Varying a single latent variable, keeping all the other latent variables the same in the d Sprites dataset changes only the vertical position of the object in the images while keeping the rest of its attributes constant demonstrating disentanglement. observed data. A widely accepted definition of DRL posits that each latent variable should encode a single generative factor of the data, as depicted in Fig. 1(Higgins et al., 2018)1 This characteristic makes Disentangled Representations (DRs) particularly useful for learning interpretable latent spaces (Shen et al., 2022; Nie et al., 2023; Han et al., 2022). Furthermore, this emphasis on interpretability makes DRs a key tool for latent space manipulation, which is essential not only in fields like computer vision (Stammer et al., 2022; Wang et al., 2023; He et al., 2023; Li et al., 2022; Ruan et al., 2022), but also in text based media generation (Wu et al., 2023). Variational Autoencoders (VAEs) are widely used to learn DRs, owing to their probabilistic encoder-decoder structure and a latent space well-suited for generative modeling. Many benchmark DRL architectures, like β-VAE (Higgins et al., 2016), DIP-VAE (Kumar et al., 2017), and β-TCVAE (Chen et al., 2018b), are rooted in the VAE framework. These architectures generally outperform GAN-based methods, such as Info GAN (Chen et al., 2016) and DR-GAN (Tran et al., 2017) particularly in terms of stability and quality of generation. Furthermore, recent works in promoting disentanglement in diffusion models also utilize this VAE-like probabilistic encoder-decoder model (Zhang et al., 2022; Yang et al., 2023), underscoring the need to understand the mechanisms by which DRL is promoted by VAEs. While VAEs have found success for DRL, somewhat surprisingly they were not designed for this representation learning task. Consequently, explaining the DRL properties of VAEs has been a focal point in recent research (Burgess 1Some works (Locatello et al., 2019; 2020), propose that DRL techniques use inherent biases present in data and the disentanglement metrics to disentangle as more than one set of generative factors can generate statistically identical data. Why do Variational Autoencoders Really Promote Disentanglement? (a) Variational Autoencoder (VAE) Architecture (b) Local Approximations Figure 2. Panel (a) illustrates a typical VAE setting. On the right, Panel (b) shows the local approximation-based VAE used in our analysis. et al., 2018; Chen et al., 2018b; Rolinek et al., 2019). As shown in Fig. 2(a), a VAE comprises of a probabilistic encoder, Encϕ : X Z and a decoder Decθ : Z X. Here, X Rn and Z Rd represent the data space and latent space, respectively. In a setting with a dataset {x(i)}N i=1 of N elements where x(i) X, employing a fixed Gaussian prior p(z(i)) over Z such that z(i) N(0, I), and the reconstructed data points, x(i), obtained using the decoder Decθ(z(i)), Kingma & Welling (2013) introduced marginalized log-likelihood as the idealized loss function, which needs to be maximized for training VAEs: i=1 log(p(x(i))) (1) It is interesting to note that, the rotational symmetry inherent in this formulation (due to the Gaussian prior) is in fact detrimental for DRL. This is because disentangled latent spaces require precise alignment, which can be disrupted by any rotational invariance. The log-likelihood loss, however, is not tractable and is approximated by the evidence lower bound (ELBO) loss function, defined as follows: x(i) X Ez(i) q(z(i)|x(i))[log(p(x(i)|z(i)))] | {z } LMLE x(i) X DKL(q(z(i)|x(i))||p(z(i))) where the first term is log-likelihood loss LMLE (acts as the reconstruction loss and is approximated by the squared error loss, Lrec, in most cases), and the second term is KL divergence loss LKL, which calculates the similarity between the diagonal posterior probability generated by the encoder as z(i) Encϕ(x(i)) qϕ(z(i)|x(i)) = N(µϕ(x(i)), diag(σ2 ϕ(x(i)))) and the symmetric Gaussian prior probability, p(z(i)); ϕ are neural network parameters. Since qϕ( ) is not rotationally invariant the ELBO loss function is not symmetric. Using this, Rolinek et al. (2019) demonstrates that optimizing the stochastic component of the squared error reconstruction loss (Lrec) in ELBO can promote local orthogonality in the decoder. The authors further show that this induces PCA-like behavior in the decoder, which, along with the diagonal posterior, aids the VAE in learning DRL. However, Rolinek et al. (2019) s argument is not entirely sufficient to equate linear VAEs with PCA, as noted by Zietlow et al. (2021). Following this insight, Zietlow et al. (2021) approximate the action of VAE as being locally linear, providing a more clear association between PCA and linear VAEs. Our key insight is that while the local linearity assumption of Zietlow et al. (2021) and linearization of Rolinek et al. (2019) have provided a way to understand why VAEs work for DRL, these are restrictive since real-world generation critically relies on non-linearity. With regard to these previous analyses, our inquiry centers on several pivotal questions that guide the focus of our research: beyond the diagonal posterior characteristic of VAEs, what other mechanisms actively contribute to disentanglement in DRL? Are the assumptions of local linearity, as discussed in Zietlow et al. (2021), and the linearization of the loss function, as in Rolinek et al. (2019), truly adequate for capturing the complexity of DRL? In what ways does the orthogonality within the local decoder s matrix, M (i) D , facilitate the disentanglement process? In summary, in this paper we investigate VAE to explain disentanglement and our contributions are as follows: Reassessing local linearity in VAEs: We show that local linearity assumption and the linearization of the stochastic component of the reconstruction loss (Lrec) are not adequate for learning DRs in practice, Introducing local non-linearity in the VAE decoder: We present a novel approach by modeling the VAE decoder action as a composition of non-linear and linear functions (M (i) D ). Our analysis reveals that this modeling promotes orthogonality in M (i) D , a perspective not previously explored. Why do Variational Autoencoders Really Promote Disentanglement? Linking orthogonality and disentanglement: We provide theoretical and empirical evidence to confirm the role of orthogonality in M (i) D for disentanglement. 2. Related Works 2.1. Disentangled representation learning The need for interpretibility, transparency, and causality in generative modeling and deep learning models at large have motivated recent works to argue for the need for human-like learning (Lake et al., 2017), representations and understanding of the world (Bengio et al., 2013) and causal inference (Peters et al., 2017). DRL (Gonzalez-Garcia et al., 2018; Jha et al., 2018; Achille & Soatto, 2018; Hu et al., 2018) try to mimic this type of representation learning while also focusing on interpretable latent space. This interpretability causes DRL to be used in a number of domains including adversarial training (Szabó et al., 2017; Mathieu et al., 2016), synergy task specific sparse predictors (Lachapelle et al., 2023), novel graph neural network framework to learn the causal and bias substructure (Fan et al., 2022), temporally disentangled representation learning (Yao et al., 2022), constrained latent variables (Engel et al., 2017; Bojanowski et al., 2017), and machine learning applications including fairness DRL (Creager et al., 2019; Song et al., 2019) interpretability of machine learning models (Adel et al., 2018; Bengio et al., 2013; Higgins et al., 2016), downstream tasks (Locatello et al., 2019; Gao et al., 2019) and computer vision tasks (Shu et al., 2017; Liao et al., 2020). Computer Vision applications include DRL for one-shot talking head synthesis (Wang et al., 2023), zero-shot segmentation (He et al., 2023), disentanglement of geometric information for cross-view geo-localization (Zhang et al., 2023), point-of-interest recommendation (Qin et al., 2023), synergy with enhanced semantic alignment for image-based 3D model retrieval (Nie et al., 2023), unsupervised domain adaptation (Xie et al., 2022), person identification (Jia et al., 2022; Li et al., 2022), facial expression recognition (Ruan et al., 2022) and medical imaging (Han et al., 2022; Xie et al., 2022). 2.2. Variational autoencoder architectures for disentanglement VAE (Kingma & Welling, 2013), a cornerstone architecture of DRL has been modified to β-VAE (Higgins et al., 2016), by introducing a hyperparameter, β to tradeoff between reconstruction and regularization as shown in (2). To improve the architecture, while Factor-VAE (Kim & Mnih, 2018) and β-TC-VAE (Chen et al., 2018a) introduced statistical independence in the latent space, (Jeong & Song, 2019) decoupled jointly modeling the continuous and discrete factors of data. Recent developments include learning a controllable generative model in guided-VAE (Ding et al., 2020), sequential variational autoencoder under self-supervision in sequential data in S3VAE (Zhu et al., 2020) and multi-VAE (Xu et al., 2021) a VAE-based multi-view clustering framework which uses disentangled representations. Subsequent works used factorized priors conditionally dependent on auxiliary variables in (Khemakhem et al., 2020; Mita et al., 2021), commutative lie group in (Zhu et al., 2021), and sparse temporal prior in (Klindt et al., 2020). 2.3. Inner workings of VAE-based disentanglement architectures The success of VAE-based architectures inspired researchers to understand its underlying principle. Whereas (Burgess et al., 2018) used the information bottleneck principle, (Kumar & Poole, 2020) studied the regularization effect of the variational family on the local geometry of the decoding model to explain β-VAEs. Following this (Rolinek et al., 2019) showed the local orthogonality of the decoder matrix, and (Zietlow et al., 2021) showed that the local alignment of the latent space in a VAE is similar to that of PCA. However, these works are based on the linearization of the VAE decoder around a point which does not explain practical scenarios. In our work, we introduce local non-linearity. Further, these works did not establish why the the decoder s local orthogonality promotes disentanglement but rather provided only experimental evidence. In our work, we answer this question. 3. From Local Linearity to Introducing Non-linearity This section models the VAE locally as a composition of linear transformations (represented by matrices) and non-linear transformations (represented by non-linear functions). We demonstrate that minimizing the stochastic component of the reconstruction loss leads to orthogonality among the columns of the matrix representing the linear part of the local VAE decoder. Further, we show why orthogonality is key to ensuring disentanglement. Finally, we explain how the latent variables are selected for each generative factor. 3.1. The Problem Formulation We start by defining the data points as {x(i)}N i=1 X and the latent variable z(i)N i=1 Z such that X Rn and Z Rd. The encoder function, denoted as Encϕ(x(i)), is modeled as a Gaussian distribution: qϕ(z(i)|x(i)) = N(µϕ(x(i)), diag(σ2 ϕ(x(i)))). In this model, the latent space points z(i) are drawn from the distribution qϕ(z(i)|x(i)). Through the reparametrization Why do Variational Autoencoders Really Promote Disentanglement? trick, we represent z(i) as µϕ(x(i)) + ϵ diag(σϕ(x(i))), where ϵ is sampled from N(0, I). The reconstructed data points, x(i), are obtained using the decoder Decθ(z(i)). For simplicity in notation, we denote µϕ(x(i)) as µ(i) Rd, diag(σϕ(x(i))) as σ(i) Rd, and ϵ diag(σϕ(x(i))) as ϵ(i) Rd, with ϵ(i) following a Gaussian distribution N(0, σ(i)2), which means that z(i) = µ(i) + ϵ(i), Lastly, we will use the notation Ei to denote expectations over the index i. From 2, the total loss function is a combination of the KL-Divergence loss (LKL) and the reconstruction loss (LMLE). It is expressed as follows: h L(i) MLE βL(i) KL i (3) Since we model both qϕ(z(i)|x(i)) and p(z(i))) to be Gaussian distributions, the KL-Divergence Loss, is given by (Detailed proof in Proposition 2 of Appendix A.2 ): L(i) KL := 1 j log(σ(i)2 Again, assuming p(x(i)|z(i)) to be a Gaussian distribution, we define it as p(x(i)|z(i)) = N(Decθ(z(i)), Σθ), where Σθ = diag(σ2 θ(z(i))) and σ2 θ(z(i)) is the variance of the decoder distribution for every z(i). The log-likelihood L(i) MLE can then be written as follows: (Detailed proof in Proposition 3 of Appendix A.2 ): L(i) MLE = log(2π) 2 log(|Σθ|) 2 E x(i) h ||x(i) x||2Σ 1 θ 2 i , where x(i) = Decθ(z(i)). Following most previous works on VAE before us, (Rolinek et al., 2019; Zietlow et al., 2021) we approximate L(i) MLE using squared error loss (L(i) rec) as follows: L(i) rec := E x(i) h || x(i) x(i)||2i Hence, we can write 3 as follows: h E x(i) h || x(i) x(i)||2i j + log(σ(i)2 Since the objective is to maximize Equation 4, we eliminate the negative signs from this equation to facilitate minimization. Consequently, the final loss function that we utilize can be stated as follows: h L(i) rec + βL(i) KL i where L(i) rec = E x(i) h || x(i) x(i)||2i and L(i) KL = µ(i)2 j log(σ(i)2 By substituting the value of z(i) as µ(i) + ϵ(i) with ϵ(i) N(0, σ(i)2), the reconstruction loss becomes: L(i) rec := Eϵ(i) h ||Decθ(µ(i) + ϵ(i)) x(i)||2i (6) Since this paper aims to show that minimizing the stochastic part of the reconstruction loss while fixing the deterministic part and the KL-divergence loss promotes orthogonality and consequently disentanglement, we decompose the reconstruction loss into a stochastic and a deterministic part using Prop. 1. Proposition 1. Given L(i) rec := Eϵ(i)[||Decθ(µ(i) + ϵ(i)) x(i)||2], and assuming that the stochastic estimate, Decθ(µ(i)+ϵ(i)) is unbiased around Decθ(µ(i)), L(i) rec can be decomposed into deterministic and stochastic parts: L(i) rec = Lµ(i) rec + Lstoch(i) rec , where, Lstoch(i) rec := Eϵ(i)||Decθ(Encϕ(x(i))) Decθ(µ(i))||2, Lµ(i) rec := h ||Decθ(µ(i)) x(i)||2i (7) To simplify the losses, we define the polarized regime as proposed by Rolinek et al. (2019) in Def. 3.1, and assume that the VAE operates in this polarized regime. Definition 3.1. A Variational Autoencoder (VAE), having an encoder Encϕ and a decoder Decθ, is said to be operating in a polarized regime if the latent variables can be divided into a set of active (Va) and passive (Vp) variables. Here: For passive variables j Vp, µ2 j(x(i)) 1 and σ2 j (x(i)) 1, while for active variables j Va, and σ2 j (x(i)) 1. The decoder ignores the passive latent components, i.e., Decθ(z(i)) z(i) j = 0 for all j Vp. This definition categorizes latent variables into two groups: passive latent variables, which provide minimal information about the data point, and active latent variables, which convey significant information about the data point. In our further analysis, we make a crucial approximation regarding the local behaviour of the encoder and decoder in Why do Variational Autoencoders Really Promote Disentanglement? the VAE. We model the local decoder and the deterministic parts of the local encoder as non-linear functions. These can be succinctly represented as Decθ(z(i)) = g(i) D (M (i) D z(i)) and µ(i) = g(i) E (M (i) E x(i)). Here, M (i) E and M (i) D denote the linear transformations, while g(i) E and g(i) D represent the respective nonlinearities in a local neighborhood around x(i) and in the σ-neighborhood of µ(i). This approximation is visually contrasted in Fig. 2, where the standard VAE architecture is depicted on the left, and our locally approximated VAE model is shown on the right. The approximation hinges on all practical VAE operating in a polarized regime, where the variance of active latent variables are significantly small (σ(i)2 j 1) and having finite valued local linear decoders, M (i) D , M (i) D ϵ(i) 1. Assuming g D( ) to be the local decoder, we can approximate the decoder Decθ( ) around µ(i) using Taylor series expansion as follows; proof in Prop. 4 App. A.2. Decθ(z(i)) = g(i) D (M (i) D z(i)) Decθ(µ(i)) + f (i) D (M (i) D ϵ(i)) (8) This expression, while an approximation, is grounded in the framework of polarized regimes and the practical behaviour of VAEs under these conditions. For the remainder of the paper, we will refer to f (i) D simply as f D and M (i) D as MD. Next, Lstoch(i) rec is further simplified according to Lem. 1. Lemma 1. With the approximation of the decoder being locally non-linear such that it can be expressed as g D(MDϵ(i)), Lstoch(i) rec can be expressed as follows: Lstoch(i) rec = var[f D(MDjϵ(i))] + f 2 D(0) + f D(0)f D(0)var[MDjϵ(i)] (9) Lemma 2. Given the local decoder matrix MD = UDΣDV D , local encoder matrix ME = UEΣEV E , local decoder non-linearity g D, local encoder non-linearity g E , the minimization of Lµ(i) rec depends either only on VE or only on UD and f D, i.e., fixing Lµ(i) rec fixes VE, UD and f D. We use Def. 3.1, and Lem. 2 to simplify LKL in Lem. 3 Lemma 3. Fixing the deterministic part of the reconstruction loss (Lµ(i) rec ) and assuming the VAE is operating in a polarized regime, LKL can be expressed as: j Va log(σ(i)2 x(i) X L(i) KL We define the parameter CKL to investigate its effect on the minimized stochastic reconstruction loss Lstoch rec . While minimizing Lstoch rec , we maintain LKL as LKL = CKL. 3.2. Minimizing the stochastic part of the reconstruction loss promotes orthogonal columns in MD We propose that minimizing P x(i) X Lstoch(i) rec while fixing P x(i) X Lµ(i) rec and P x(i) X L(i) KL promotes the columns of MD to be orthogonal. According to Lemma 2, fixing P x(i) X Lµ(i) rec fixes VE, UD, and f D. Consequently, the minimization process can only be carried out by adjusting VD and ΣD. Thus, based on (9), Lemma 1, 2, and 3, and CKL, we formulate the optimization problem as presented in (10) and via the following result. Theorem 1. Given independent data samples x(i), if we fix the P x(i) X L(i) KL for a constant C(i) KL, and P x(i) X Lµ(i) rec , then the minimization of the VAE loss L in (5) reduces to the minimization of the stochastic reconstruction loss Lstoch(i) rec : min σ(i) j >0,VD, x(i) X log Lstoch(i) rec s.t. X x(i) X L(i) KL = CKL. (10) Then, the following hold for the local minima: (a) Every local minimum is a global minimum. (b) In every global minimum, the columns of every MD are orthogonal. Further, the variance of a latent variable is inversely proportional to the norm of the corresponding column in the linear part of the local decoder: j 1 ||cj||2 i where cj is the j-th column of MD. Proof given in A.4. Note that, even though the columns of the linear part of the decoder MD are orthogonal, in general, columns of f D(MD) are not. 3.3. The principal axes of the latent space aligns with the standard basis The latent variables, and the SVD decomposition of the linear component of the decoder MD, and the diagonal covariance of the posterior probability of the VAE aligns the principal axes (or curves) of the latent space with the standard basis ei-s of the latent space. We consider the following lemma, whose proof is detailed in A.5. Lemma 4. Given MD = UDΣDV D , such that the columns of MD are orthogonal, and MD has unique non-zero singular values, the following hold: (a) UD is an orthogonal matrix, (b) the diagonal elements of ΣD are the norms of the columns of MD, and (c) VD = I. The generative factors of data are closely linked to the principal axes or curves, which are characterized by maximum variance (experimental evidence in A.8.6). Why do Variational Autoencoders Really Promote Disentanglement? Figure 3. The figure illustrates the non-linear and SVD decomposition components of both the encoder and the decoder for a 2D case. The latent space distribution is a Gaussian, with axes of variation of the input data aligned with the standard basis vectors ei of the latent space. It is noted that V D equals the identity matrix I, indicating that ΣD is responsible for scaling the data along the stand basis vectors.For illustrative clarity, the representation is simplified to a 2D perspective, showing principal axes as straight lines instead of curves, to better convey the transformations enacted by the encoder and decoder on the data. Techniques like PCA and Hierarchical Non-Linear PCA (h-NLPCA) (Scholz & Vigário, 2002) utilize these axes for data reconstruction by maximizing latent space variance. Similarly, symmetric NLPCA (s-NLPCA) (Kramer, 1991), which uses autoencoder architectures2, focuses on retaining components with low correlation and high variance. In the case of VAE, as illustrated in Fig. 3, the latent space distribution would be very close to the Gaussian distribution (due to Gaussian prior) with a diagonal covariance. Therefore, the principal axes of the latent space are aligned with the standard basis vectors ei-s. Hence, we need V D = I, so that the distribution would not be rotated and the axes are preserved. Moreover, from here we can see that as noted by (Träuble et al., 2021), if we observe data that is correlated, there can be two latent factors that change simultaneously and it would be difficult to identify them (and disentangle them). 2Both s-NLPCA and h-NLPCA employ autoencoder architectures for generating compressed representations. Hence, in our analyses the decoder works to retain them for successful data reconstruction; see Fig. 3. 3.4. How does orthogonality influence disentanglement? We now explore why orthogonality is instrumental in promoting disentanglement with the following lemma (proof in A.6). First, we show that given a fixed Lstoch(i) rec , orthogonality in MD promotes a lower LKL. Lemma 5. Given a Lstoch(i) rec , orthogonality in the linear component of the Decoder s function transformation, MD, promotes a lower LKL. Next, we see how lower LKL and in turn orthogonality of MD affects the latent space of the VAE. First, we establish that the samples closer in data space are also closer in the latent space too. Further, a lower KL divergence loss brings these samples closer. Theorem 2. For a VAE, given z(i) Encϕ(x(i)) and z(k) Encϕ(x(k)), where, x(k) are the k(i) nearest Why do Variational Autoencoders Really Promote Disentanglement? Table 1. Comparative analysis of approximation errors for the local decoder when modeled as linear versus non-linear. Dataset Error β-TCVAE β-VAE VAE d Sprites Zietlow et al. (2021) 0.9209 0.8394 0.4693 Ours 0.8502 0.7829 0.4166 3DFaces Zietlow et al. (2021) 0.8679 0.8080 0.4956 Ours 0.8088 0.7694 0.4236 3DShapes Zietlow et al. (2021) 0.8848 0.8244 0.4856 Ours 0.8122 0.7944 0.4266 MPI3D Zietlow et al. (2021) 0.9646 0.5240 0.5024 Ours 0.8624 0.8043 0.4832 Table 2. Comparison of error incurred in approximating the local decoder for evaluating Lstoch rec using a linearized approach versus a non-linear approach. Note: *p<.05, **p<.01 Dataset Error β-TCVAE β-VAE VAE d Sprites Rolinek et al. (2019) 0.4866 0.5243 0.6680 Ours 0.4532** 0.4907* 0.6643 3DFaces Rolinek et al. (2019) 0.5690 0.5799 0.6039 Ours 0.5498* 0.5296* 0.5887** 3DShapes Rolinek et al. (2019) 0.7257 0.7215 0.7344 Ours 0.6879* 0.7010** 0.7316 MPI3D Rolinek et al. (2019) 0.7836 0.8001 0.8042 Ours 0.7818* 0.7908* 0.8014* neighbours of x(i), we define Dist(LKL) as follows: Dist(LKL) = Ex(i),z(i),z(k) k=1 ||z(i) z(k)||2 The following hold: (a) Given Encϕ(x(i)) qϕ(z(i)|x(i)) overlaps (is close) with k(i) posterior probabilities, they must be posterior probabilities generated by the k(i) nearest neighbours of x(i) in X, i.e. Encϕ(x(k)) qϕ(z(k)|x(k)). Here, for every x(i), we have k(i) number of x(k)-s, whose posterior probabilities, qϕ(z(k)|x(k)), overlap with the posterior probability qϕ(z(i)|x(i)) in the latent space. (b) Given, LKL < LKL, Dist(LKL ) < Dist(LKL). From this result, given any point in the latent space, z(i) = Pd j=1 z(i) j ej, where z(i) j = z(i) ej (with indicating the dot product), adjusting z(i) l by z(i) l for any l {1, . . . , d} (while keeping all other z(i) j s constant) results in a new point z(k). Specifically, z(k) = z(i) + z(i) l el. Finally, using Sect. 3.3, Lem. 5, Theorem 2, we observe that orthogonality in MD promotes disentanglement. Particularly, we observe: (a) the output derived from z(k) differs from that of z(i) solely in terms of a single generative factor linked to the latent variable z(i) l in the el direction; see Sect. 3.3. (b) orthogonality promotes lower LKL. According to Theorem 2, as LKL decreases, the samples close in data space come closer in latent space too. Hence, the variance between the two outputs is directly proportional to z(i) l , which is the variance in the latent variable z(i) l ; see Lem. 5. These findings substantiate that the orthogonality of MD columns promote disentanglement; detailed proof in A.7. 3.5. How do σ & LKL relate to the data principal axes? Moving on, we explain (a) the relationship between the principal axes (and consequently the generative factors) and σ2 j = Ei[σ(i)2 j ], and (b) how the LKL loss relates to the principal axes and generative factors. From part (d) of Theorem 1, σ(i)2 j 1 ||cj||2 for all i. This implies that σ2 j Ei h 1 ||cj||2 i = 1 ||cj||2 . Referring to Lem. 4, the singular values of MD are given by ||cj||, and ΣD stretches the latent space distribution along the standard basis. Principal axes with higher variances correspond to the smaller singular values. Hence, principal axes vital for image generation are associated with columns of MD with greater ||cj|| and lower σ2 j . Given that σ(i) j < 1, j ) > 0. As σ(i) j decreases, significant principal axes contribute more to the LKL loss. 4. Experiments In this section, we discuss the experimental setup and the results to verify our theoretical findings. We experimentally verify how introducing local non-linearity makes the VAE modeling more realistic. Furthermore, we show that this local-nonlinearity modeling technique is better than the linearization of Lstoch rec . We define a metric, Orthogonality Deviation Score (OD-Score) to calculate the extent of orthogonality of the linear component of the local-decoder matrix. Finally, we show that disentanglement (measured using MIG and MIG-Sup scores) is directly proportional to Orthogonality (measured using OD-Score.) The code is available at https://github.com/ criticalml-uw/Disentanglement-in-VAE. 4.1. Datasets We study the VAE architectures using four widely used datasets, namely, d Sprites, 3D Faces (Paysan et al., 2009), 3D shapes (Burgess & Kim, 2018), and MPI 3D complex real-world shapes dataset (Gondal et al., 2019). 4.2. Metrics For evaluating disentanglement, we use MIG and MIG-sup metrics introduced in Chen et al. (2018a) and Li et al. (2020) respectively. While for analyzing the efficacy of the non-linearity f D, we introduce an error function to compare the deviation of the analysis from the actual Why do Variational Autoencoders Really Promote Disentanglement? scenario with and without the non-linearity, for calculating the orthogonality of the linear part of the local decoder, we device a metric based on Lem. 4. Further, in A.8.6, we experimentally show that principal axes with the highest variance are associated with the generative factors most significant for reconstruction. 4.3. Models and Implementation Details To evaluate the efficacy of our analysis we consider three VAE-based models, namely, VAE, β-VAE, and β-TCVAE. In the subsequent experiments, we show that our analysis holds for all the VAE-based architectures. In Appendix A.8.1, we summarize implementation details. 4.4. Main Experimental Findings Contribution of the non-linearity f( ): From (8), Decθ(Encϕ(x(i))) Decθ(µ(i)) + f D(MDϵ(i)). On the other hand, in the work by Zietlow et al. (2021) the local decoder is approximated to be linear, and so Decθ(Encϕ(x(i))) Decθ(µ(i)) + flinear(MDϵ(i)). To show that introducing non-linearity improves the approximation, in the experiment, we compare the squared error between the non-linear approximation and the ground truth in (11) as δ with the squared error between the linear approximation and ground truth in (12) as δlinear, δ = ||Decθ(µ(i) + ϵ(i)) Decθ(µ(i)) f D(MDϵ(i))||2, (11) δlinear = ||Decθ(µ(i) + ϵ(i)) Decθ(µ(i)) flinear(MDϵ(i))||2 (12) In the experiment, flinear(MDϵ(i)) is approximated by a neural network without non-linearity, while, f D(MDϵ(i)) is approximated by a neural network consisting of one non-linearity. We define x(i) Xval, and for each x(i), we estimate the parameters f D and MD (as these parameters are local for each x(i)) using (11) as the loss function. For random ϵ(i) N(0, σ(i)2) for each x(i), we calculate the error for each x(i) using (11) and finally take the average. In Table 1, we demonstrate using three different VAE-based architectures on four different datasets that the non-linearity makes the local decoder approximation much more accurate. Comparisions across different approximations: We compare two approximations: the linearization assumption made by Rolinek et al. (2019) which approximates the stochastic part of the reconstruction loss as Jϵ(i), where J is the Jacobian approximation of the decoder around µ(i), and our modeling, where we assume the decoder to be non-linear. Table 2 summarizes that modelling VAEs as non-linear has lower error on real-world dataset than the linearization approximation; details in A.8.2. Disentanglement across VAE architectures: As explained above, we have used the Mutual Information Gap (MIG) and (MIG-sup) for quantifying the disentanglement of the different VAE architectures in the datasets mentioned. Panels (a)-(d) in Fig. 4, illustrate the MIG scores for the different VAE-based architectures while Panels (e)-(h) illustrate the MIG-sup scores. In Appendix A.8.4, we provide the scores in a tabular form. Understanding Orthogonality: We use Lem. 4 to calculate the distance between the linear component of decoder function and the closest orthogonal matrix. We take average of the normalized distances as follows. d U(UD, b UD) = ||UD b UD||2 F max ϕ (||UD b UD,ϕ||2 F ) , dΣ(ΣD, bΣD) = ||ΣD bΣD||2 F max ϕ (||ΣD bΣD,ϕ||2 F ) d V (VD, b VD) = ||VD b VD||2 F max ϕ (||VD b VD,ϕ||2 F ) , where ˆUD, ˆΣD, and ˆVD are calculated using Lem. 4 and Appendix A.8.3. The norms are calculated using the Frobenius norm, and the equations are normalized across all VAE-based models used for evaluating the distance. The Orthogonal Deviation Score, denoted as OD-Score(MD), is defined as: OD-Score(MD) = d U(UD, ˆUD) + dΣ(ΣD, ˆΣD) + d V (VD, ˆVD) In Fig. 4, panels (i)-(l) demonstrate the MIG versus OD-Score(MD) scores for all the datasets for each of the three VAE-based architectures. Further, panels (m)-(p) demonstrate the MIG-Sup versus OD-Score(MD) scores. We note that as the MIG and the MIG-Sup scores increase, the orthogonality as measured by OD-Score(MD) decreases (lower is better). This establishes that orthogonality promotes disentanglement. In Appendix A.8.5, we record the OD-Score(MD) values of the different VAE-based architectures for the datasets. 5. Discussion and Conclusion In this work, build on existing works which use local linearity for explaining the behavior of Variational Autoencoders (VAEs), to propose an analysis that incorporates local non-linearity. We provide theoretical analysis and extensive experimental evaluations to show that the stochastic part of the loss function promotes orthogonality among the columns of the linear component in the decoder s function. Furthermore, we establish both mathematically and empirically that this orthogonality is instrumental in promoting disentanglement, a link previously observed only through experimental evidence. Why do Variational Autoencoders Really Promote Disentanglement? d Sprites 3DFaces 3DShapes MPI3DComplex VAE ß-VAE ß-TCVAE VAE ß-VAE ß-TCVAE 0.8 VAE ß-VAE ß-TCVAE VAE ß-VAE ß-TCVAE (a) (b) (c) (d) MIG-Sup Scores VAE ß-VAE ß-TCVAE VAE ß-VAE ß-TCVAE VAE ß-VAE ß-TCVAE VAE ß-VAE ß-TCVAE (e) (f) (g) (h) OD-Score Scores 0.16 0.25 0.52 VAE ß-VAE ß-TCVAE 0.19 0.44 0.58 VAE ß-VAE ß-TCVAE VAE ß-VAE ß-TCVAE 0.14 0.24 0.32 VAE ß-VAE ß-TCVAE MIG Scores MIG Scores MIG Scores MIG Scores (i) (j) (k) (l) OD-Score Scores VAE ß-VAE ß-TCVAE 0.2 0.46 0.6 VAE ß-VAE ß-TCVAE 0.14 0.6 0.5 VAE ß-VAE ß-TCVAE 0.1 0.46 0.6 VAE ß-VAE ß-TCVAE MIG-Sup Scores MIG-Sup Scores MIG-Sup Scores MIG-Sup Scores (m) (n) (o) (p) Figure 4. Performance of different VAE architectures Panels (a)-(d) illustrate the MIG scores (a higher score promotes disentanglement) of different VAE architectures for d Sprites, 3DFaces, 3DShapes and MPI3DComplex, respectively. Panels (e)-(h) illustrate the MIG-Sup scores (a higher score promotes disentanglement). Panels (i)-(l) illustrate positive correlation between orthogonality, measured by OD-Score(MD) (a lower score promotes orthogonality) and MIG scores for specified models and datasets. Finally, Panels (m)-(p) illustrate the OD-Score(MD) vs the MIG-Sup score. Most previous studies suggest that the imposition of a diagonal posterior on the encoder is the primary driver for VAEs to learn disentangled representations. Our work expands on this notion by demonstrating that the reconstruction loss, when constrained by the KL-Divergence loss, also facilitates disentanglement. Nevertheless, the precise alignment of embeddings within the latent space remains an open question. Unraveling this aspect could significantly enhance the understanding of VAEs, and other generative models, for disentangled representation learning. Acknowledgement Sirisha Rambhatla would like to acknowledge support of the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant, RGPIN-2022-03512. Why do Variational Autoencoders Really Promote Disentanglement? Impact Statement As generative models grow in popularity and find applications across different disciplines, it is critical to equip them with the ability to break spurious correlations between features to be able to generate unbiased data. For instance, a dataset with correlations between a protected attribute and a feature (in ambient or the latent space) can learn to generate data that reinforces such patterns. Constructing generative models with distentanglement capability is therefore key for fair data generation to break historic biases in the data. Therefore, understanding why certain architectures inherently promote disentanglement is important to incorporate such properties in contemporary and future generative models. Achille, A. and Soatto, S. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947 1980, 2018. Adel, T., Ghahramani, Z., and Weller, A. Discovering interpretable representations for both deep generative and discriminative models. In International Conference on Machine Learning, pp. 50 59. PMLR, 2018. Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798 1828, 2013. Bojanowski, P., Joulin, A., Lopez-Paz, D., and Szlam, A. Optimizing the latent space of generative networks. ar Xiv preprint ar Xiv:1707.05776, 2017. Burgess, C. and Kim, H. 3d shapes dataset. https://github.com/deepmind/3dshapes-dataset/, 2018. Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. Understanding disentangling in β-vae. ar Xiv preprint ar Xiv:1804.03599, 2018. Chen, R. T., Li, X., Grosse, R., and Duvenaud, D. Isolating sources of disentanglement in vaes. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, volume 2615, pp. 2625, 2018a. Chen, R. T., Li, X., Grosse, R. B., and Duvenaud, D. K. Isolating sources of disentanglement in variational autoencoders. Advances in neural information processing systems, 31, 2018b. Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29, 2016. Creager, E., Madras, D., Jacobsen, J.-H., Weis, M., Swersky, K., Pitassi, T., and Zemel, R. Flexibly fair representation learning by disentanglement. In International conference on machine learning, pp. 1436 1445. PMLR, 2019. Ding, Z., Xu, Y., Xu, W., Parmar, G., Yang, Y., Welling, M., and Tu, Z. Guided variational autoencoder for disentanglement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7920 7929, 2020. Engel, J., Hoffman, M., and Roberts, A. Latent constraints: Learning to generate conditionally from unconditional generative models. ar Xiv preprint ar Xiv:1711.05772, 2017. Fan, S., Wang, X., Mo, Y., Shi, C., and Tang, J. Debiasing graph neural networks via learning disentangled causal substructure. Advances in Neural Information Processing Systems, 35:24934 24946, 2022. Gao, L., Mao, Q., Dong, M., Jing, Y., and Chinnam, R. On learning disentangled representation for acoustic event detection. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 2006 2014, 2019. Gondal, M. W., Wuthrich, M., Miladinovic, D., Locatello, F., Breidt, M., Volchkov, V., Akpo, J., Bachem, O., Schölkopf, B., and Bauer, S. On the transfer of inductive bias from simulation to the real world: a new disentanglement dataset. Advances in Neural Information Processing Systems, 32, 2019. Gonzalez-Garcia, A., Van De Weijer, J., and Bengio, Y. Image-to-image translation for cross-domain disentanglement. Advances in neural information processing systems, 31, 2018. Han, L., Lyu, Y., Peng, C., and Zhou, S. K. Gan-based disentanglement learning for chest x-ray rib suppression. Medical Image Analysis, 77:102369, 2022. Härkönen, E., Hertzmann, A., Lehtinen, J., and Paris, S. Ganspace: Discovering interpretable gan controls. Advances in neural information processing systems, 33: 9841 9850, 2020. He, S., Ding, H., and Jiang, W. Primitive generation and semantic-related alignment for universal zero-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11238 11247, 2023. Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae: Why do Variational Autoencoders Really Promote Disentanglement? Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2016. Higgins, I., Amos, D., Pfau, D., Racaniere, S., Matthey, L., Rezende, D., and Lerchner, A. Towards a definition of disentangled representations. ar Xiv preprint ar Xiv:1812.02230, 2018. Hu, Q., Szabó, A., Portenier, T., Favaro, P., and Zwicker, M. Disentangling factors of variation by mixing them. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3399 3407, 2018. Jeong, Y. and Song, H. O. Learning discrete and continuous factors of data via alternating disentanglement. In International Conference on Machine Learning, pp. 3091 3099. PMLR, 2019. Jha, A. H., Anand, S., Singh, M., and Veeravasarapu, V. R. Disentangling factors of variation with cycle-consistent variational auto-encoders. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 805 820, 2018. Jia, M., Cheng, X., Lu, S., and Zhang, J. Learning disentangled representation implicitly via transformer for occluded person re-identification. IEEE Transactions on Multimedia, 25:1294 1305, 2022. Khemakhem, I., Kingma, D., Monti, R., and Hyvarinen, A. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pp. 2207 2217. PMLR, 2020. Kim, H. and Mnih, A. Disentangling by factorising. In International Conference on Machine Learning, pp. 2649 2658. PMLR, 2018. Kingma, D. P. and Welling, M. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Klindt, D., Schott, L., Sharma, Y., Ustyuzhaninov, I., Brendel, W., Bethge, M., and Paiton, D. Towards nonlinear disentanglement in natural data with temporal sparse coding. ar Xiv preprint ar Xiv:2007.10930, 2020. Kramer, M. A. Nonlinear principal component analysis using autoassociative neural networks. AICh E journal, 37(2):233 243, 1991. Kumar, A. and Poole, B. On implicit regularization in β-vaes. International Conference on Machine Learning, 2020. Kumar, A., Sattigeri, P., and Balakrishnan, A. Variational inference of disentangled latent concepts from unlabeled observations. ar Xiv preprint ar Xiv:1711.00848, 2017. Lachapelle, S., Deleu, T., Mahajan, D., Mitliagkas, I., Bengio, Y., Lacoste-Julien, S., and Bertrand, Q. Synergies between disentanglement and sparsity: Generalization and identifiability in multi-task learning. In International Conference on Machine Learning, pp. 18171 18206. PMLR, 2023. Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. Building machines that learn and think like people. Behavioral and brain sciences, 40, 2017. Li, H., Xu, K., Li, J., and Yu, Z. Dual-stream reciprocal disentanglement learning for domain adaptation person re-identification. Knowledge-Based Systems, 251:109315, 2022. Li, Z., Murkute, J. V., Gyawali, P. K., and Wang, L. Progressive learning and disentanglement of hierarchical representations. ar Xiv preprint ar Xiv:2002.10549, 2020. Liao, Y., Schwarz, K., Mescheder, L., and Geiger, A. Towards unsupervised learning of generative models for 3d controllable image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5871 5880, 2020. Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp. 4114 4124. PMLR, 2019. Locatello, F., Bauer, S., Lucic, M., Rätsch, G., Gelly, S., Schölkopf, B., and Bachem, O. A sober look at the unsupervised learning of disentangled representations and their evaluation. ar Xiv preprint ar Xiv:2010.14766, 2020. Mathieu, M. F., Zhao, J. J., Zhao, J., Ramesh, A., Sprechmann, P., and Le Cun, Y. Disentangling factors of variation in deep representation using adversarial training. Advances in neural information processing systems, 29, 2016. Mita, G., Filippone, M., and Michiardi, P. An identifiable double vae for disentangled representations. In International Conference on Machine Learning, pp. 7769 7779. PMLR, 2021. Nie, J., Zhang, T., Li, T., Yu, S., Li, X., and Wei, Z. Image-based 3d model retrieval via disentangled feature learning and enhanced semantic alignment. Information Processing & Management, 60(2):103159, 2023. Paysan, P., Knothe, R., Amberg, B., Romdhani, S., and Vetter, T. A 3d face model for pose and illumination invariant face recognition. In 2009 sixth IEEE international conference on advanced video and signal based surveillance, pp. 296 301. Ieee, 2009. Why do Variational Autoencoders Really Promote Disentanglement? Peters, J., Janzing, D., and Schölkopf, B. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017. Qin, Y., Wang, Y., Sun, F., Ju, W., Hou, X., Wang, Z., Cheng, J., Lei, J., and Zhang, M. Disenpoi: Disentangling sequential and geographical influence for point-of-interest recommendation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pp. 508 516, 2023. Rolinek, M., Zietlow, D., and Martius, G. Variational autoencoders pursue pca directions (by accident). In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12406 12415, 2019. Ruan, D., Mo, R., Yan, Y., Chen, S., Xue, J.-H., and Wang, H. Adaptive deep disturbance-disentangled learning for facial expression recognition. International Journal of Computer Vision, 130(2):455 477, 2022. Scholz, M. and Vigário, R. Nonlinear pca: a new hierarchical approach. In Esann, pp. 439 444, 2002. Shen, X., Liu, F., Dong, H., Lian, Q., Chen, Z., and Zhang, T. Weakly supervised disentangled generative causal representation learning. The Journal of Machine Learning Research, 23(1):10994 11048, 2022. Shu, Z., Yumer, E., Hadap, S., Sunkavalli, K., Shechtman, E., and Samaras, D. Neural face editing with intrinsic image disentangling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5541 5550, 2017. Song, J., Kalluri, P., Grover, A., Zhao, S., and Ermon, S. Learning controllable fair representations. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2164 2173. PMLR, 2019. Stammer, W., Memmel, M., Schramowski, P., and Kersting, K. Interactive disentanglement: Learning concepts by interacting with their prototype representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10317 10328, 2022. Szabó, A., Hu, Q., Portenier, T., Zwicker, M., and Favaro, P. Challenges in disentangling independent factors of variation. ar Xiv preprint ar Xiv:1711.02245, 2017. Tran, L., Yin, X., and Liu, X. Disentangled representation learning gan for pose-invariant face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1415 1424, 2017. Träuble, F., Creager, E., Kilbertus, N., Locatello, F., Dittadi, A., Goyal, A., Schölkopf, B., and Bauer, S. On disentangled representations learned from correlated data. In International conference on machine learning, pp. 10401 10412. PMLR, 2021. Voynov, A. and Babenko, A. Unsupervised discovery of interpretable directions in the gan latent space. In International conference on machine learning, pp. 9786 9796. PMLR, 2020. Wang, D., Deng, Y., Yin, Z., Shum, H.-Y., and Wang, B. Progressive disentangled representation learning for fine-grained controllable talking head synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17979 17989, 2023. Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., and Chang, S. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1900 1910, 2023. Xie, Q., Li, Y., He, N., Ning, M., Ma, K., Wang, G., Lian, Y., and Zheng, Y. Unsupervised domain adaptation for medical image segmentation by disentanglement learning and self-training. IEEE Transactions on Medical Imaging, 2022. Xu, J., Ren, Y., Tang, H., Pu, X., Zhu, X., Zeng, M., and He, L. Multi-vae: Learning disentangled view-common and view-peculiar visual representations for multi-view clustering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9234 9243, 2021. Yang, T., Wang, Y., Lv, Y., and Zh, N. Disdiff: Unsupervised disentanglement of diffusion probabilistic models. ar Xiv preprint ar Xiv:2301.13721, 2023. Yao, W., Chen, G., and Zhang, K. Temporally disentangled representation learning. Advances in Neural Information Processing Systems, 35:26492 26503, 2022. Zhang, X., Li, X., Sultani, W., Zhou, Y., and Wshah, S. Cross-view geo-localization via learning disentangled geometric layout correspondence. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 3480 3488, 2023. Zhang, Z., Zhao, Z., and Lin, Z. Unsupervised representation learning from pre-trained diffusion probabilistic models. Advances in Neural Information Processing Systems, 35:22117 22130, 2022. Zhu, X., Xu, C., and Tao, D. Commutative lie group vae for disentanglement learning. In International Conference on Machine Learning, pp. 12924 12934. PMLR, 2021. Zhu, Y., Min, M. R., Kadav, A., and Graf, H. P. S3vae: Self-supervised sequential vae for representation Why do Variational Autoencoders Really Promote Disentanglement? disentanglement and data generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6538 6547, 2020. Zietlow, D., Rolinek, M., and Martius, G. Demystifying inductive biases for (beta-) vae based architectures. In International Conference on Machine Learning, pp. 12945 12954. PMLR, 2021. Why do Variational Autoencoders Really Promote Disentanglement? A. Appendix A.1. Background A.1.1. VARIATIONAL AUTOENCODERS Let {x(i)}N i=1 is a dataset consisting N elements, such that x(i) X = Rn. A VAE consists of a probabilistic encoder, Encϕ : X Z and a decoder Decθ : Z X, where Z = Rd is called the Latent Space. The distribution of the input, namely q(x(i)) is fixed as it is the actual data distribution. We also define a fixed prior distribution p(z(i)) over Z. The idealized loss function of the VAE is the marginalized log-likelihood which is defined as follows: i=1 log(p(x(i))) However, this loss function is not tractable and is approximated by its lower bound, the Evidence Lower Bound (ELBO) loss function defined as follows: Ez(i) q(z(i)|x(i))[log(p(x(i)|z(i)))] DKL(q(z(i)|x(i))||p(z(i))) where the first term is the reconstruction loss while the second term is KL divergence, which calculates the similarity between the probability distributions q(z(i)|x(i)) and p(z(i)). Hence, VAE performs a trade-off between reconstruction and the ability to mimic the prior probability distribution. All the probability distributions are assumed to be Gaussian with the prior probability distribution being defined as follows: p(z(i)) = N(0, I) (13) The encoder is defined as follows: Encϕ(x(i)) qϕ(z(i)|x(i)) = N(µϕ(x(i)), diag(σ2 ϕ(x(i)))) where µϕ and diag(σϕ) are the parameter ϕ dependent mappings. Parametrizing the distributions in this way allows for the use of the reparametrization trick to estimate gradients of the lower bound with respect to the parameters ϕ. The latent variables, hence are defined as follows z(i) qϕ(z(i)|x(i)) and hence can be reparametrized as follows: z(i) = µ(i) + σ(i)ϵ where ϵ N(0, I). Finally, it is to be noted that the posterior distribution qϕ(z(i)|x(i)) has a diagonal covariance matrix. Under the Gaussian assumptions, the KL divergence can be written in closed form as follows: L(i) KL = 1 j log(σ(i)2 j ) 1) (14) A.1.2. UNDERSTANDING DISENTANGLEMNT FOR LOG-LIKELIHOOD LOSS FUNCTION It is assumed in case of interpretable representation that some generating factors are responsible for the generation of data. For example in the case of the d Sprites dataset, data is generated from generative factors like position, shape, scale or size, rotational orientation etc. Disentangled representation is the situation, when, a single latent variable is responsible for the changes in a single generative factor and is non sensitive to the changes in other generative factors. In the case of unsupervised algorithms, the generative factors are not known and the methods rely on statistical procedures. One category of such methods is the VAE-based methods which we are analyzing. Rotation matrices, (U), are defined as as orthogonal matrices (U = U 1) with determinant equal to 1 (|U| = 1). We define rotational invariance or rotational symmetry for a probability as p(z) = p(Uz). It is important to note that Disentanglement is sensitive to rotations of the latent embedding. For example, consider a disentangle latent representation a b . When acted upon by the rotation matrix in 2-dimensions, namely cos θ sin θ sin θ cos θ , it becomes, a cos θ b sin θ a sin θ + b cos θ Why do Variational Autoencoders Really Promote Disentanglement? Hence, it can be seen that rotating disentangled latent representation essentially destroys the disentanglement property of the latent representation. Earlier works like (Rolinek et al., 2019) record in detail that the log-likelihood loss and the ELBO losses are rotationally invariant, meaning that if the prior probability is rotationally symmetric, then, even if a rotational matrix and its transpose are multiplied to the encoder and decoder respectively the log-likelihood objective and the ELBO objective do not change. However, this rotational invariance is disrupted by forcing a diagonal posterior on the encoder of the VAE. This gives rise to the KL Loss function that has been recorded in (14) A.2. Derivation of Loss Functions L(i) KL, L(i) MLE and Equation 8 Proposition 2. Assuming the probabilities, qϕ(z(i)|x(i))-s to be Gaussian distributions N(µϕ(x(i)), diag(σ2 ϕ(x(i))))-s, the KL divergence Loss function can be written as follows: L(i) KL = 1 j log(σ(i)2 j ) 1) (15) where, µ(i) j is the j-th element of µϕ(x(i)) and σ(i) j is the j-th element of diag(σϕ(x(i))) Proof. Given, p(z(i)) = N(0, I) and qϕ(z(i)|x(i)) = N(µϕ(x(i)), diag(σ2 ϕ(x(i)))). In short, we refer to N(µϕ(x(i)), diag(σ2 ϕ(x(i)))) as N(µ(i), Σ(i)) The multivariate normal distributions are given by, N(µ(i), Σ(i)) = 1 (2π) k 2 |Σ(i)| 1 2 exp( 1 2(z(i) µ(i)) Σ(i) 1(z(i) µ(i))) N(0, I) = 1 (2π) k 2 |I| 1 2 exp 1 We refer to N(µ(i), σ(i)) as a(z(i)) and N(0, I) as b(z(i)). We know that, L(i) KL = DKL(a||b) = Ea[log(a) log(b)] 2(z(i) µa) Σ 1 a (z(i) µa) + 1 2(z(i) µa) Σ 1 a (z(i) µa) + Ea 2z(i) z(i) (16) Now, since, (z(i) µa) Σ 1 a (z(i) µa) R, we have, (z(i) µa) Σ 1 a (z(i) µa) = Tr{z(i) µa) Σ 1 a (z(i) µa)}. Tr{z(i) µa) Σ 1 a (z(i) µa)} = Tr{z(i) µa) (z(i) µa)Σ 1 a } (17) Finally, since we can swap Tr and Ea, 2(z(i) µa) (z(i) µa)Σ 1 a 2(z(i) µa) (z(i) µa)Σ 1 a 2(z(i) µa) (z(i) µa) Σ 1 a Now, since, Ea h (z(i) µa) (z(i) µa) i = Σa, 2(z(i) µa) Σ 1 a (z(i) µa) i = tr{ΣaΣ 1 a } = Tr{Ij} = j (18) Why do Variational Autoencoders Really Promote Disentanglement? Again, we know that, E[x x] = Tr{Σ} + µ µ 2z(i) z(i)i = 1 2(Tr{Σa} + µ aµa) (19) 2 log( 1 |Σa|) i = 1 2 log(|Σa|) (20) Substituting 18, 19 and 20 in 16, we get, L(i) KL = 1 2 Tr{Σa} + µ aµa j log(|Σa| Simplifying, we get, L(i) KL = 1 j log(σ(i)2 Proposition 3. Given a VAE, with Gaussian distributions, the maximum likelihood estimate, L(i) MLE = Ez(i) q(z(i)|x(i))[log(p(x(i)|z(i)))] can be expressed as follows: L(i) MLE = log(2π) 2 log(|Σθ|) 2 Ez(i) q(z(i)|x(i)) h||x(i) x||2Σ 1 θ 2 Proof. Given that the distribution p(x(i)|z(i)) in the VAE is Gaussian, it can be expressed as: p(x(i)|z(i)) = N(Decθ(z(i)), Σθ) where Σθ = diag(σ2 θ(z(i))). The multivariate Gaussian distribution can be written as: N(Decθ(z(i)), Σθ) = 1 (2π) k 2 |Σθ| 1 2 exp 1 2(x(i) Decθ(z(i))) Σ 1 θ (x(i) Decθ(z(i))) Therefore, the log-likelihood is: log p(x(i)|z(i)) = k 2 log(2π) 1 2 log |Σθ| 1 2(x(i) Decθ(z(i))) Σ 1 θ (x(i) Decθ(z(i))) Since (x(i) Decθ(z(i))) Σ 1 θ (x(i) Decθ(z(i))) is a scalar, it can be represented as a trace: (x(i) Decθ(z(i))) Σ 1 θ (x(i) Decθ(z(i))) = Tr{(x(i) Decθ(z(i))) Σ 1 θ (x(i) Decθ(z(i)))} Using the cyclic property of the trace, we get: Tr{(x(i) Decθ(z(i))) Σ 1 θ (x(i) Decθ(z(i)))} = Tr{(x(i) Decθ(z(i)))(x(i) Decθ(z(i))) Σ 1 θ } Therefore, the maximum likelihood estimation (MLE) loss L(i) MLE can be written as: L(i) MLE = Ez(i) q(z(i)|x(i)) 2 log(2π) 1 2 log |Σθ| 1 2Tr{(x(i) Decθ(z(i)))(x(i) Decθ(z(i))) Σ 1 θ } Simplifying, we get: L(i) MLE = k 2 log(2π) 1 2 log |Σθ| Ez(i) q(z(i)|x(i)) 2||x(i) Decθ(z(i))||2Σ 1 θ Why do Variational Autoencoders Really Promote Disentanglement? Replacing Decθ(z(i)) with x, we obtain: L(i) MLE = k 2 log(2π) 1 2 log |Σθ| Ez(i) q(z(i)|x(i)) 2||x(i) x||2 Σ 1 θ Proposition 4. Given a VAE operating in the polarized regime, such that its decoder can be expressed as a combination of linear and non-linear transformation, Decθ(z(i)) = g(i) D (M (i) D z(i)) where M (i) D is a finite matrix, the local decoder can be approximately expressed as Decθ(z(i)) = g(i) D (M (i) D z(i)) Decθ(µ(i)) + f (i) D (M (i) D ϵ(i)) Proof. Since the VAE is operating in the polarized regime, for all active latent variables , σ(i)2 j 1. Given that the matrix M (i) D is finite, M (i) D ϵ(i) 1. Further, g(i) D (M (i) D z(i)) = g(i) D (M (i) D (µ(i) + ϵ(i))) Applying Taylor Series approximation around the point M (i) D µ(i), we have, g(i) D (M (i) D (µ(i) + ϵ(i))) = g(i) D (M (i) D µ(i)) + (M (i) D ϵ(i))g(i) D (M (i) D µ(i)) + (M (i) D ϵ(i))2 g(i) D (M (i) D µ(i)) 2! + ... = g(i) D (M (i) D µ(i)) + f (i) D (M (i) D ϵ(i)) f (i) D (M (i) D ϵ(i)) = (M (i) D ϵ(i))g(i) D (M (i) D µ(i)) + (M (i) D ϵ(i))2 g(i) D (M (i) D µ(i)) 2! + ... Again, g(i) D (M (i) D µ(i)) = Decθ(µ(i)) Hence, Decθ(z(i)) = g(i) D (M (i) D z(i)) Decθ(µ(i)) + f (i) D (M (i) D ϵ(i)) A.3. Proof of Proposition 1, and Lemmas 1, 2, 3 and 6 Proposition 1. Given L(i) rec := Eϵ(i)[||Decθ(µ(i) + ϵ(i)) x(i)||2], and assuming that the stochastic estimate, Decθ(µ(i) + ϵ(i)) is unbiased around Decθ(µ(i)), L(i) rec can be decomposed into deterministic and stochastic parts: L(i) rec = Lµ(i) rec + Lstoch(i) rec , where, Lstoch(i) rec := Eϵ(i)||Decθ(Encϕ(x(i))) Decθ(µ(i))||2, Lµ(i) rec := h ||Decθ(µ(i)) x(i)||2i (7) Proof. As defined, Encϕ(x(i)) = µ(i) + ϵ(i), where ϵ(i) is the Gaussian noise while µ(i) is the deterministic part of the encoder derived from x(i) as µ(i) = f E(MEx(i)). We write L(i) rec in the following way L(i) rec = Eϵ(i)[||Decθ(µ(i) + ϵ(i)) Decθ(µ(i)) + Decθ(µ(i)) x(i)||2] L(i) rec = Eϵ(i)[||Decθ(µ(i) + ϵ(i)) Decθ(µ(i))||2 + ||Decθ(µ(i)) x(i)||2+ 2||Decθ(µ(i) + ϵ(i)) Decθ(µ(i))||||Decθ(µ(i)) x(i)||] (21) Why do Variational Autoencoders Really Promote Disentanglement? Simplifying the terms in (21), we have, Eϵ(i)||Decθ(µ(i)) x(i)||2 = ||Decθ(µ(i)) x(i)||2 (22) Eϵ(i)[2||Decθ(µ(i) + ϵ(i)) Decθ(µ(i))||||Decθ(µ(i)) x(i)||] = 2||Decθ(µ(i)) x(i)||Eϵ(i)[||Decθ(µ(i) + ϵ(i)) Decθ(µ(i))||] Again, since stochastic estimate, Decθ(µ(i) + ϵ(i)) is unbiased around Decθ(µ(i)), Eϵ(i)[Decθ(µ(i) + ϵ(i))] = Decθ(µ(i)). Hence, 2||Decθ(µ(i)) x(i)||Eϵ(i)[||Decθ(µ(i) + ϵ(i)) Decθ(µ(i))||] = 0 (23) Plugging (22) and (23) into (21), we get, L(i) rec = [||Decθ(µ(i)) x(i)||2] + Eϵ(i)||Decθ(Encϕ(x(i))) Decθ(µ(i))||2 Hence, L(i) rec = Lµ(i) rec + Lstoch(i) rec Lemma 1. With the approximation of the decoder being locally non-linear such that it can be expressed as g D(MDϵ(i)), Lstoch(i) rec can be expressed as follows: Lstoch(i) rec = var[f D(MDjϵ(i))] + f 2 D(0) + f D(0)f D(0)var[MDjϵ(i)] (9) Proof. Under the assumption that the decoder of the VAE is locally non-linear, s.t. it can be expressed as f D(MDϵ(i)), Lstoch(i) rec can be expressed as Lstoch(i) rec = Eϵ(i)[||Decθ(µ(i)) + f D(MDϵ(i)) Decθ(µ(i))||2] = Eϵ(i)||f D(MDϵ(i))||2 Further, Eϵ(i)[||f D(MDϵ(i))||2] can be expressed as Eϵ(i)[||f D(MDϵ(i))||2] = j=1 Eϵ(i)[(f D(MDjϵ(i)))2] Given a random variable, o, var(o) = E[o2] (E[o])2. Hence, Lstoch(i) rec = var[f D(MDjϵ(i)] + (Eϵ(i)[(f D(MDjϵ(i)))])2 (24) From Lem. 6, Eϵ(i)[f D(MDjϵ(i))] = f D(Eϵ(i)[MDjϵ(i)]) + f D(Eϵ(i)[MDjϵ(i)]) 2 Eϵ(i)[(MDjϵ(i) Eϵ(i)[MDjϵ(i)])2] (25) From Lem. 9, Eϵ(i)[MDjϵ(i)] = 0. Plugging it into (25), we have Eϵ(i)[f D(MDjϵ(i))] = f D(0) + f D(0) 2 Eϵ(i)[(MDjϵ(i))2] (26) Again, Eϵ(i)[(MDjϵ(i))2] = var[MDjϵ(i)] + (Eϵ(i)[MDjϵ(i)])2 = var[MDjϵ(i)] Why do Variational Autoencoders Really Promote Disentanglement? Substitution the value in (26), we get, Eϵ(i)[f D(MDjϵ(i))] = f D(0) + f D(0) 2 var[MDjϵ(i)] Substituting into (24), and ignoring the higher order term, we have, Lstoch(i) rec = var[f D(MDjϵ(i))] + f 2 D(0) + f D(0)f D(0)var[MDjϵ(i)] Lemma 2. Given the local decoder matrix MD = UDΣDV D , local encoder matrix ME = UEΣEV E , local decoder non-linearity g D, local encoder non-linearity g E , the minimization of Lµ(i) rec depends either only on VE or only on UD and f D, i.e., fixing Lµ(i) rec fixes VE, UD and f D. Proof. From 1, the loss function, Lµ(i) rec = ||Decθ(µ(i)) x(i)||2 . Given MD = UDΣDV D , ME = UEΣEV E , g D and g E, Lµ(i) rec = ||x(i) g D(MD(g E(MEx(i))))||2. This can further be expressed as ||x(i) FD(F 1 D (x(i)))||2, where FD(x(i)) is defined as follows: F 1 D (x(i)) = g E(ME(x(i))) (27) Hence, FD(x(i)) = M + E g 1 E (x(i)) (28) where M + E is the pseudo inverse of ME. Hence, the SVD decomposition of M + E is M + E = VEΣ+ EU E (29) Substituting the (27), (28), (29) into FD(F 1 D (x(i))), we get, M + E g 1 E (g E(MEx)) = M + E MEx(i) = VEΣ+ EU E UEΣEV E x(i) = VEIV E x(i) = VEId n In d V E x(i) = VEd V Edx(i) Hence, replacing ME with VEd, does not affect the loss. Further, the loss function is not dependent on UE and ΣE. Again, expressing the loss function, ||x(i) g D(MD(g E(MEx(i))))||2 as ||x FD(F 1 D (x(i)))||2, FD(x(i)) can be defined as: FD(x(i)) = g D(MD(x(i))) (30) Hence, F 1 D (x(i)) = M + Dg 1 D (x(i)) (31) where M + D is the pseudo inverse of MD. Using SVD decomposition on M + D gives M + D = VDΣ+ DU T D (32) Why do Variational Autoencoders Really Promote Disentanglement? Substituting the (30), (31), (32) into FD(F 1 D (x(i))), g D(MDM + Dg 1 D (x(i))) = g D(UDΣDV T D VDΣ+ DU T Dg 1 D (x(i))) = g D(UDIU T Dg 1 D (x(i))) = g D(UDIn d Id n U T Dg 1 D (x(i))) = g D(UDn U T Dng 1 D (x(i))) Hence, replacing MD with UDn, does not change the loss. Again, the loss function is only dependent on g D and UD and not on VD and ΣD. Again, since f D is a polynomial function of g D, the loss is dependent on f D too. Hence, the minimization of the loss function depends either only on VE or only on UD, g D and f D, i.e., fixing Lµ(i) rec fixes VE, UD and f D. Lemma 3. Fixing the deterministic part of the reconstruction loss (Lµ(i) rec ) and assuming the VAE is operating in a polarized regime, LKL can be expressed as: j Va log(σ(i)2 x(i) X L(i) KL Proof. Operating the VAE to operate in a polarized regime, the passive latent variables are ignored, and for the active latent variables, σ2 j (x(i)) log(σ(i)2 j ), which simplifies the KL loss function to: L(i) KL = 1 j Va (µ(i)2 j log(σ(i)2 j ) 1) (33) From Lem. 2, the Lµ(i) rec depends either only on VE or only on f D and UD, and not on the entire g E(MEx(i)). Hence, fixing it fixes VE, UD, and f D. The matrices VD, ΣD, UE and ΣE are not fixed and minimizing the stochastic loss under the constraint of L(i) KL forces constraints on these matrices. Fixing VE and UD, and for a fixed i, 33 can be written as follows: L(i) KL = ||µ(i)||2 + X j Va log(σ(i)2 Again, since UD is fixed, µ(i) can only be affected by VD. However, since, VD is orthogonal and hence norm-preserving, ||µ(i)||2 is fixed. As a result, the only portion of LKL that affects the minimization of the stochastic reconstruction loss can be expressed as: LKL = X j Va log(σ(i)2 Lemma 6. The expectation of a function f D(MDjϵ(i)), Eϵ(i)[f D(MDjϵ(i))] can be approximately written in terms of the expectation of MDjϵ(i) as follows: Eϵ(i)[f D(MDjϵ(i))] = f D(Eϵ(i)[MDjϵ(i)]) +f D(Eϵ(i)[MDjϵ(i)]) 2 Eϵ(i)[(MDjϵ(i) Eϵ(i)[MDjϵ(i)])2] Proof. Using Chebyshev s inequality, where P stands for probability, we have, P(|MDjϵ(i) Eϵ(i)[MDjϵ(i)]| > a) var(MDjϵ(i)) Why do Variational Autoencoders Really Promote Disentanglement? Hence, given any δ > 0, we can find an a, such that, P(MDjϵ(i) [E[MDjϵ(i)] a, E[MDjϵ(i)] + a] = P(|MDjϵ(i) E[MDjϵ(i)]| a) < 1 δ (35) Calculating E[f D(MDjϵ(i))], we have, E[f D(MDjϵ(i))] = Z |x E[MDjϵ(i)]| a f D(x)d FD(x)+ |x E[MDjϵ(i)]|>a f D(x)d FD(x) (36) where, FD(x) is the distribution function for x. Since the domain of the first integral, [E[MDjϵ(i)] a, E[MDjϵ(i)] + a], is bounded and closed, Taylor Expansion can be applied in that interval. Applying Taylor s expansion in the interval [E[MDjϵ(i)] a, E[MDjϵ(i)] + a], and considering only the first four terms, we have, f D(x) = f D(E[MDjϵ(i)]) + f D(E[MDjϵ(i)])(x E[MDjϵ(i)]) +f D(E[MDjϵ(i)]) 2 (x E[MDjϵ(i)])2 3! (x E[MDjϵ(i)])3 where U [E[MDjϵ(i)] a, E[MDjϵ(i)] + a]. Substituting (37) in (36), we get, E[f D(MDjϵ(i))] = Z |x E[MDjϵ(i)]| a f D(E[MDjϵ(i)]) + f D(E[MDjϵ(i)])(x E[MDjϵ(i)]) +f D(E[MDjϵ(i)]) 2 (x E[MDjϵ(i)])2 d FD(x) + Z |x E[MDjϵ(i)]| a 3! (x E[MDjϵ(i)])3d FD(x)+ |x E[MDjϵ(i)]|>a f D(x)d FD(x) The interval or the domain is increased by increasing the value of a, i.e., a . From (34) and (35), P(|MDjϵ(i) E[MDjϵ(i)]| > a) 0 and P(MDjϵ(i) [E[MDjϵ(i)] a, E[MDjϵ(i)] + a]) 1. This simplifies (38) as follows: E[f D(MDjϵ(i))] Z + f D(E[MDjϵ(i)]) + f D(E[MDjϵ(i)])(x E[MDjϵ(i)]) +f D(E[MDjϵ(i)]) 2 (x E[MDjϵ(i)])2 d FD(x) + Z + 3! (x E[MDjϵ(i)])3d FD(x) |x E[MDjϵ(i)]|>a f D(x)d FD(x) 0. Simplifying the terms in (39) further, we get, Z + f D(E[MDjϵ(i)])d FD(x) = f D(E[MDjϵ(i)]) Z + d FD(x) = f D(E[MDjϵ(i)]) (40) f D(E[MDjϵ(i)])(x E[MDjϵ(i)])d FD(x) = f D(E[MDjϵ(i)]) Z + xd FD(x) (E[MDjϵ(i)]) = 0 (41) f D(E[MDjϵ(i)]) 2 (x E[MDjϵ(i)])2d FD(x) = f D(E[MDjϵ(i)]) (x E[MDjϵ(i)])2d FD(x) = f D(E[MDjϵ(i)]) 2 E[(MDjϵ(i) E[MDjϵ(i)])2] Why do Variational Autoencoders Really Promote Disentanglement? Substituting (40), (41) and (42) in (39), we get, E[f D(MDjϵ(i))] = f D(E[MDjϵ(i)]) + f D(E[MDjϵ(i)]) 2 E[(MDjϵ(i) E[MDjϵ(i)])2] + RE (43) Ignoring the higher order integral, we get, E[f D(MDjϵ(i))] = f D(E[MDjϵ(i)]) + f D(E[MDjϵ(i)]) 2 E[(MDjϵ(i) E[MDjϵ(i)])2] (44) A.4. The Proof of the proposed Theorem-1 and the relevant Lemmas Theorem 1. Given independent data samples x(i), if we fix the P x(i) X L(i) KL for a constant C(i) KL, and P x(i) X Lµ(i) rec , then the minimization of the VAE loss L in (5) reduces to the minimization of the stochastic reconstruction loss Lstoch(i) rec : min σ(i) j >0,VD, x(i) X log Lstoch(i) rec s.t. X x(i) X L(i) KL = CKL. (10) Then, the following hold for the local minima: (a) Every local minimum is a global minimum. (b) In every global minimum, the columns of every MD are orthogonal. Further, the variance of a latent variable is inversely proportional to the norm of the corresponding column in the linear part of the local decoder: j 1 ||cj||2 i where cj is the j-th column of MD. Proof. Proof of part (b): From Lemma 1 we have that Lstoch(i) rec = n var[f D(MDjϵ(i))] + f 2 D(0) + f D(0)f D(0)var[MDjϵ(i)] o (45) where MDj is the j-th row and n is the total number of rows in MD. Then from Lemma 7, we know that var[f D(MDjϵ(i))] = f D(Eϵ(i)(MDjϵ(i))) 2 var[MDjϵ(i)] (46) Again, from Lemma 8, we know that var[MDjϵ(i)] = k=1 a2 j,kσ(i)2 where d is the total number of columns in MD. From Lemma 9, Ei[MDjϵ(i)] = 0 (48) Plugging (47) and (48) into (46), we have var[f D(MDjϵ(i))] = [f D(0)]2 d X k=1 a2 j,kσ(i)2 Why do Variational Autoencoders Really Promote Disentanglement? Plugging (49) into (45), we have Lstoch(i) rec = n [f D(0)]2 + f D(0)f D(0) o d X k=1 a2 j,kσ(i)2 k + nf 2 D(0) = n [f D(0)]2 + f D(0)f D(0) o n X k=1 a2 j,kσ(i)2 k + nf 2 D(0) Let ck be the k-th column of the MD matrix. We know that j=1 a2 j,k (51) Substituting (51) in (50), we have Lstoch(i) rec = [f D(0)]2 + f D(0)f D(0) d X k=1 ||ck||2σ(i)2 k + nf 2 D(0) Since nf 2 D(0) is a constant for a fixed x(i), minimizing Lstoch(i) rec does not depend on it. Hence, we minimize Dstoch(i) rec = Lstoch(i) rec nf 2 D(0). The value of the minima of the two functions would differ by nf 2 D(0). Using the AM-GM inequality on Dstoch(i) rec , we have Dstoch(i) rec = [f D(0)]2 + f D(0)f D(0) d X k=1 ||ck||2σ(i)2 [f D(0)]2 + f D(0)f D(0) d k=1 ||ck||2σ(i)2 with equality if and only if ||cj||2 ||ck||2 = σ(i)2 for any j and k in {1, . . . , d}. Taking the logarithm, we have log(Dstoch(i) rec ) = log [f D(0)]2 + f D(0)f D(0) + log(d) + 1 k=1 log(σ(i)2 Taking the summation over all the values of xi X, we have x(i) X log(Dstoch(i) rec ) = X x(i) X log [f D(0)]2 + f D(0)f D(0) + N log(d) CKL From Lemma 10, we have X x(i) X log(Dstoch(i) rec ) X x(i) X log [f D(0)]2 + f D(0)f D(0) + N log(d) CKL x(i) X log(Sing(MD)) (53) It can be seen that the RHS of the above equation does not depend on σ(i)2 k s. Also, it is independent of the orthogonal matrices VD as these do not influence the singular values of MD. Why do Variational Autoencoders Really Promote Disentanglement? However, according to Lemma 10, to minimize the LHS, we set the matrices VD such that the columns of MD are orthogonal to each other. Also, the values of σ(i)2 k s are chosen so as to achieve equality in the AM-GM inequality. Hence, in every global minimum, the columns of MD are orthogonal. Proof of part (c): Part (c) of the Theorem follows directly from (52). Proof of part (a): To prove part (a), we show that at least one small change can be made in the objective function that would minimize it until it has reached the global minimum, which is the RHS of (53). Since the RHS of (53) is dependent only on the AM-GM inequality and Lemma 10, it suffices to show that local minima do not exist for both of these inequalities and that at least one small perturbation would always improve the LHS of these inequalities until they become equal to their RHS, which is their global minimum and that no other local minima exist. We have shown this in Lemma 11 and Lemma 12. Lemma 7. The variance of a function f D(MDjϵ(i)), denoted as var[f D(MDjϵ(i))], can be expressed in terms of the variance of MDjϵ(i) as follows: var[f D(MDjϵ(i))] = (f D(Eϵ(i)[MDjϵ(i)]))2 var[MDjϵ(i)] Proof. From (43), in Lem. 6, we get, E[f D(MDjϵ(i))] = f D(E[MDjϵ(i)]) + f D(E[MDjϵ(i)]) 2 E[(MDjϵ(i) E[MDjϵ(i)])2] + RE where RE is the remaining integral. Again, f 2 D(x) = f 2 D(E[MDjϵ(i)]) + 2f D(E[MDjϵ(i)])f D(E[MDjϵ(i)])(x E[MDjϵ(i)])+ [(f D(E[MDjϵ(i)])2 + f D(E[MDjϵ(i)])f D(E[MDjϵ(i)])](x E[MDjϵ(i)])2+ 3! (x E[MDjϵ(i)])3 From Lem. 6, taking expectation over f 2 D, we have, E[f 2 D(MDjϵ(i))] = f 2 D(E[MDjϵ(i)]) + [(f D(E[MDjϵ(i)]))2 + f D(E[MDjϵ(i)])f D(E[MDjϵ(i)])] E[(MDjϵ(i) E[MDjϵ(i)])2] + RE As we know V ar(MDjϵ(i)) = E[(MDjϵ(i))2] (E[MDjϵ(i)])2. Evaluating, we have, V ar(f D(MDjϵ(i))) = (f D(E[MDjϵ(i)]))2V ar(MDjϵ(i)) + f D(E[MDjϵ(i)]) 4 V ar2(MDjϵ(i)) + TE Approximating to only the first term, we have, V ar(f D(MDjϵ(i))) (f D(E[MDjϵ(i)]))2V ar(MDjϵ(i)) Lemma 8. The variance of MDjϵ(i), var[MDjϵ(i)], can be expressed as follows : var[MDjϵ(i)] = k=1 a2 j,kσ(i)2 where MDj are the rows of the decoder matrix MD and aj,k is the element in the j-th row and k-th column of MD. Why do Variational Autoencoders Really Promote Disentanglement? Proof. We know that, var(c B) = c2var(B), where, c is a constant. var(MDjϵ(i)) = var( k=1 ajkϵ(i) k ) = k=1 a2 jkvar(ϵ(i) k ) = k=1 a2 jkσ(i)2 Lemma 9. The expectation of MDjϵ(i), Eϵ(i)(MDjϵ(i)) can be expressed as follows: Eϵ(i)(MDjϵ(i)) = 0 Proof. For a Gaussian distribution, D N(µ, σ2), E[D] = µ. Again, ϵ(i) k N(0, σ(i)2 k ), Eϵ(i) k [ϵ(i) k ] = 0. Hence, Eϵ(i)[MDjϵ(i)] = Eϵ(i)[( k=1 ajkϵ(i) k )] = k=1 ajk Eϵ(i) k [ϵ(i) k ] = 0 Proposition 5. For a matrix MD Rn d with SVD MD = UDΣDV D , the following statements are equivalent. a) The columns of MD are pairwise orthogonal. b) The matrix M DMD is diagonal. c) The columns of ΣDV D are pairwise orthogonal. Proof. The statements (a) and (b) are equivalent as the columns of MD, ci are orthogonal and hence c i cj = 0 i = j while c i ci = 0. The equivalence of statements (a) and (c) can be proved as follows. Suppose we define M D = ΣDV D . We can see that, M D M D = VDΣ DΣDV D = VDΣ DU DUDΣDV D = M DMD Since, the columns of MD are orthogonal, from the equivalence of (a) and (b), M DMD is a diagonal matrix which also implies that M D MD is diagonal. Again from the equivalence of (a) and (b) the columns of M D are orthogonal. Lemma 10. Let MD Rn d be a matrix where, d < n, be a matrix with column vectors c1...cd and non-zero singular vectors s1...sd. It can be claimed that, j=1 ||cj|| Sing(MD) where, Sing(MD) is the product of singular values of MD. The condition of equality is when c1...cd are pairwise orthogonal. Proof. Suppose MD = UDΣDV D . We first show that multiplying both sides of the equation by UD does not change the inequality. Firstly, for the RHS, the singular value of UDMD is the same as that of MD. In the LHS, the cjs are the images of the ejs, as cjs can be expressed as cj = MDej j {1...d}. However, UD, being an orthogonal matrix and hence an isometry, we get ||UDMDej|| = ||MDej|| = ||cj|| j and hence, the column norms of MD remains same as the column norms of UDMD. Why do Variational Autoencoders Really Promote Disentanglement? Since we can see that both sides of the inequality are invariant to multiplication by UD, we restrict to the values of MD, which can be expressed as MD = ΣDV D . Since, we assume that d < n, ΣDD d, the d d left corner submatrix of ΣD contains all the nonzero singular values. We can define M sq D = ΣDD d V D , which is basically a square matrix consisting of all the nonzero rows of the matrix MD. Again, this implies, (M sq D ) M sq D = M DMD More specifically, the column norms of MD and M sq D are equal and hence according to Hadamard s Inequality, we have, k=1 ||ck|| = k=1 ||csq k || |det(M sq D )| = Sing(MD) where, ck are the columns of MD and csq k are the columns of M sq D . Further, according to Hadamard s inequality, equality occurs only when the columns of M sq D are orthogonal. Again that implies that the columns of MD are orthogonal and from Proposition 5, all possible MDs and not just the simplified ones assumed are orthogonal. Hence, the equality holds when the columns of any matrix MD are orthogonal. Lemma 11. Given non-negative values a1, a2, a3....a N for which, i=1 ai) 1 N there exists at least one perturbation of ais, a is such that, i=1 a i) 1 N = ( i=1 ai) 1 N (54) Proof. Selecting i = j, we set a i = ai d+δ and a j = aj(d+δ), where, d 1 and δ 1. For all other k {1, ...N}, a k = ak. We can see that, ai + aj a i a j = (ai + aj) δ 1 + δ > 0 Hence, we have, ai + aj > a i + a j. Again, we can see that, a ia j = aiaj This ensures that at least one small perturbation of ais would satisfy (54). Lemma 12. For a matrix Mn Rn n with SVD Mn = UΣV , and column vectors c1, c2, c3...cn, for which, j=1 ||cj|| > |det(Mn)| (55) there exists at least one V , which is a small perturbation of V , and a matrix M nwith columns c 1, c 2, c 3...c n such that, j=1 ||cj|| > j=1 ||c j|| (56) Proof. We establish the proof by induction on n. For n=2, we define a rotation matrix Rθ where theta is the angle of rotation for the rotation matrix. By setting V = V Rθ, we can verify that (56) is satisfied. For n > 2, we can see that, (55) implies c T i cj = 0 for some i = j. Let us assume that i = 1 and j = 2 in this case. Let for k > 2, c k = ck. We now consider, R2D θ , where, R2D θ = Rθ 0 0 In 2 is a block diagonal matrix. Rθ is the rotational matrix that we defined previously. Further, since U can be set to In as either side of (55) is not influenced by an isometry, we can reduce the above situation to n = 2 case. Hence by induction, the proof is complete. Why do Variational Autoencoders Really Promote Disentanglement? A.5. Proof of Lem. 4 Lemma 4. Given MD = UDΣDV D , such that the columns of MD are orthogonal, and MD has unique non-zero singular values, the following hold: (a) UD is an orthogonal matrix, (b) the diagonal elements of ΣD are the norms of the columns of MD, and (c) VD = I. Proof. From Proposition 5, M DMD is a diagonal matrix. Moreover, the diagonal elements are c k ck = ||ck||2 2. Furthermore, from the characteristic equation of the matrix M DMD, the eigenvalues of M DMD are ||ck||2 2 k {1..d}. Hence the singular values are ||ck||2. Which proves part (b) of the lemma. Again solving for M DMDxk = λkxk, and given the singular values are non-zero and unique, gives us that the solution only consists of the respective eks. Hence, the matrix VD = I. This proves part (c) of the lemma. Further, we use the equation, MD = UDΣDV D . Considering the RHS, since V D = I, the RHS becomes UDΣD and the individual elements of the product matrix become uij||cj||2 where uij are the elements of matrix UD. Also, let the individual elements of the matrix MD, be aij. Hence uij = aij ||cj||2 . Hence, it is clear that the columns of UD are normalized. Also the columns of UD are orthogonal to each other as the columns of MD are orthogonal to each other. Hence, the matrix UD is orthogonal. This proves part (a) of the lemma. A.6. Proof of Lem. 5 Lemma 5. Given a Lstoch(i) rec , orthogonality in the linear component of the Decoder s function transformation, MD, promotes a lower LKL. Proof. Let MD be a non-orthogonal decoder matrix. Also, let MDortho be the matrix whose columns are orthogonal to each other, solution to the optimization in Theorem 1, by setting the matrix VD. From, (53), we have, x(i) X log(Dstoch(i) rec ) = X x(i) X log [f D(0)]2 + f D(0)f D(0) + N log(d) CKL x(i) X log [f D(0)]2 + f D(0)f D(0) + N log(d) CKL x(i) X log(Sing(MD)) where, from Theorem 1, Dstoch(i) rec = Lstoch(i) rec nf 2 D(0). Given a fixed Lstoch(i) rec , we have, x(i) X log(Dstoch(i) rec ) = X x(i) X log [f D(0)]2 + f D(0)f D(0) + N log(d) CKL x(i) X log [f D(0)]2 + f D(0)f D(0) + N log(d) CKLortho x(i) X log(Sing(MDortho)) CKL CKLortho = 2 X k=1 ||ck|| log(Sing(MDortho)) 2 log(Sing(MD)) log(Sing(MDortho)) The term on the RHS is greater than 0. Also due to polarized regime, LKL > 0, and hence, CKL and CKLortho are > 0. Hence, we have, CKL CKLortho > 0 Why do Variational Autoencoders Really Promote Disentanglement? Hence, LKLortho corresponding to the decoder MDortho has a lower positive value as compared to LKL corresponding to decoder the decoder MD proving our lemma. A.7. Proof of Theorem 2 Theorem 2. For a VAE, given z(i) Encϕ(x(i)) and z(k) Encϕ(x(k)), where, x(k) are the k(i) nearest neighbours of x(i), we define Dist(LKL) as follows: Dist(LKL) = Ex(i),z(i),z(k) k=1 ||z(i) z(k)||2 The following hold: (a) Given Encϕ(x(i)) qϕ(z(i)|x(i)) overlaps (is close) with k(i) posterior probabilities, they must be posterior probabilities generated by the k(i) nearest neighbours of x(i) in X, i.e. Encϕ(x(k)) qϕ(z(k)|x(k)). Here, for every x(i), we have k(i) number of x(k)-s, whose posterior probabilities, qϕ(z(k)|x(k)), overlap with the posterior probability qϕ(z(i)|x(i)) in the latent space. (b) Given, LKL < LKL, Dist(LKL ) < Dist(LKL). Proof. For a VAE, loss L is defined as L = Lrec + βLKL, where Lrec and LKL are defined as: x(i) X [||Decθ(Encϕ(x(i))) x(i)||2] = X x(i) X [|| x(i) x(i)||2] (57) j Va (µ(i)2 j log(σ(i)2 j ) 1) (58) To minimize the loss L, given a LKL, the architecture aligns the latent space such that Lrec can be minimized. From (58), decreasing LKL causes µ(i) j to approach 0 (decrease) and σ(i) j to approach 1. The decreased means and the broadened variances cause the posterior probabilities, qϕ(z(i)|x(i))-s to overlap. We consider a random point x(i) X, with posterior probability Encϕ(x(i)) qϕ(z(i)|x(i)). Considering posterior probabilities qϕ(z(j)|x(j))s which overlap with Encϕ(x(i)) qϕ(z(i)|x(i)), we define z as follows: Z = {z Encϕ(x(i))|z z(i) sam z(j) sam, z(i) sam qϕ(z(i)|x(i)), z(j) sam qϕ(z(j)|x(j)) j z(i) sam z(j) sam = ϕ} Hence, z can be represented in 2 different ways as follows: z = µ(i) + ϵ(i) = µ(j) + ϵ(j) Given an optimal decoder, Decθ(z) = x(i) or Decθ(z) = x(j)(i), where x(j)(i) is wrongly regenerated by Decθ as x(j) instead of x(i). Since, z Encϕ(x(i)), Decθ(z) = x(j)(i) generates reconstruction loss || x(j)(i) x(i)||2. From Proposition 6, the loss || x(j)(i) x(i)||2 would be minimum when x(j) is the nearest element to x(i) in X. In (57), given ideal decoder, || x(i) x(i)||2 = 0 only when z(i) z and Decθ(z(i)) = x(j)(i). Hence, (57) simplifies to x(i)) X x(j) =x(i) || x(j)(i) x(i)||2 Further considering that each x(i) is sampled multiple times, Lrec can be expressed as , x(i)) X x(j) =x(i) || x(j)(i) x(i)||2 = X x(i)) X x(j) =x(i) l=0 || x (j)(i) l x(i)||2 (59) Why do Variational Autoencoders Really Promote Disentanglement? where ki = |z(i) Z| x(i) X and x (j)(i) l is the l-th x(j)(i). From (59), given ideal Encϕ and Decθ, Lrec depends only on x (j)(i) l = Decθ(z(i)) where z(i) Z. From Lem. 13, X k = S|X| i=1 X (i) k(i) minimizes Lrec, where X (i) k(i) is the set of k(i) ki nearest elements to any given x(i) X. Hence, the overlap between qϕ(z(i)|x(i)) and the k(i) posterior probabilities, must be posterior probabilities generated by the k(i) nearest neighbours of x(i) in X. This proves part (a) of the Theorem. As LKL decreases, since the overlap between qϕ(z(i)|x(i)) and qϕ(z(k)|x(k)) k(i) nearest neighbours of x(i) increases, given z(i) qϕ(z(i)|x(i)) and z(k) qϕ(z(k)|x(k)), Ex(i),z(i),z(k)[P k ||z(i) z(k)||2] decreases. Hence, given LKL < LKL, Dist(LKL ) < Dist(LKL). This conclusively proves part (b) of the Theorem. Lemma 13. For x(i) X (data space), the number of elements in the dataset N = |X|, an ideal VAE consisting of Encϕ(x(i)), Decθ(z(i)), z(i) qϕ(z(i)|x(i)), x(i) = Decθ(z(i)) and a set of numbers ki < |X| where i = {1, 2...N}, given, x(j) l -s X s.t. x(j) l = x(i) for the optimization problem, l=0 || x(j) l x(i)||2 (60) the following hold: X k = S|X| i=1 X (i) k(i) is the solution set to the optimization, where X (i) k(i) is the set of k(i) ki nearest elements to x(i) in X. Proof. First, solve the optimization problem min x(j) l Pki l=0 || x(j) l x(i)||2 for a random x(i) X. Given an ideal VAE, x(i) = x(i) and hence, || x(i) x(i)||2 = 0. We use induction to prove that X (i) k(i) contains the k(i) nearest elements to x(i) in X. From Proposition 6, the element closest to x(i) in X (say x(nearest X)) has the lowest value for || x(j) l x(i)||2. We prove the rest by induction. Base Case: Removing x(nearest X) from X, we have X 1 = X \ x(nearest X). From Proposition 6, the element closest to x(i) in X 1 (say x(nearest X 1)) has the lowest value for || x(j) l x(i)||2. This is also the second nearest element to x(i) in X. Induction Hypothesis: Removing d nearest elements to x(i) from X, generates X d, where the (d + 1)-th nearest element to x(i), x(nearest X d) in X, generates the lowest value for || x(j) l x(i)||2. Induction Step: We remove the (d + 1)-th nearest element from X d, x(nearest X d) (from Induction Hypothesis) to generate X (d+1). From Proposition 6, the element closest to x(i) in X (d+1) (say x(nearest X (d+1))) has the lowest value for || x(j) l x(i)||2. This is also the (d + 1)-th nearest element to x(i) in X. In the case when there are no repeated elements, i.e., x(nearest X) = x(nearest X 1) . . . = x(nearest X d) = x(nearest X (d+1)), d elements form the solution set X (i) d . However, in the case of repetition of a single element, i.e., x(nearest X i) = x(nearest X j) (say), there exists a X (i) d such that d < d. Hence, given any x(i), X (i) k(i) is the solution set for the optimization problem Pki l=0 || x(j) l x(i)||2 where k(i) ki. Hence, X k = S|X| i=1 X (i) k(i) is the solution set to the optimization problem in Proposition 6. Given x(i) X (data space), an ideal encoder Encϕ(x(i)), an ideal decoder Decθ(z(i)), z(i) qϕ(z(i)|x(i)), x(i) = Decθ(z(i)), and there exists an xk s.t. ||x(k) x(i)||2 < ||x(j) x(i)||2 j = k = i (61) then the following holds: Why do Variational Autoencoders Really Promote Disentanglement? x(k) is the element in X nearest to x(i) || x(k) x(i)||2 < || x(j) x(i)||2 Proof. Part (a) of the proposition follows directly from (61). Given an ideal encoder and decoder, || x(k) x(k)||2 = 0. Hence, x(k) = x(k). Similarly, x(j) = x(j). Hence, in (61), we can replace x(k) by x(k) and x(j) by x(j) giving us part (b) of the proposition. A.8. Appendix Related to the experiments A.8.1. NETWORK ARCHITECTURES USED IN EXPERIMENTAL SETUP Table 3 summarizes the network architectures and the implementation details of the models that we trained for each of the datasets. Table 3. The architectures of the VAE-based models used for the different datasets. Dataset Component Architecture Encoder 1200 (Re LU) - 1200 (Re LU) - Latent Space (Re LU) Decoder 1200 (Re LU) - 1200 (Re LU) - 4096 (Re LU) β 6 Optimizer Adam (lr = 10 3) Encoder Conv [32, 4, 2, 1] (BN) (Re LU), [32, 4, 2, 1] (BN) (Re LU), [64, 4, 2, 1] (BN) (Re LU), [64, 4, 2, 1] (BN) (Re LU), [512, 4, 1, 0] (BN) (Re LU), [Latent Space, 1, 1] (BN) (Re LU) Decoder Conv Trans [512, 1, 1, 0] (BN) (Re LU), [64, 4, 2, 1] (BN) (Re LU), [64, 4, 2, 1] (BN) (Re LU), [32, 4, 2, 1] (BN) (Re LU), [1, 4, 2, 1] (BN) (Re LU) β 6 Optimizer Adam (lr = 10 3, betas = (0.9, 0.999)) Encoder Conv [32, 4, 2, 1] (Re LU), [32, 4, 2, 1] (Re LU), [64, 4, 2, 1] (Re LU), [64, 4, 2, 1] (BN) (Re LU), [256, 4, 1, 0] (BN) (Re LU), [Latent Space, 1, 1] (Re LU) Decoder Conv[64, 1, 1, 0] (Relu), Conv Trans [64, 4, 1, 0] (Re LU), [64, 4, 2, 1] (Re LU), [64, 4, 2, 1] (Re LU), [32, 4, 2, 1] (Re LU), [32, 4, 2, 1] (Re LU), [3, 4, 2, 1] β 6 Optimizer Adam (lr = 10 3, betas = (0.9, 0.999)) MPI 3D complex Encoder Conv [32, 4, 2, 1] (Re LU), [32, 4, 2, 1] (Re LU), [64, 4, 2, 1] (Re LU), [64, 4, 2, 1] (BN) (Re LU), [256, 4, 1, 0] (BN) (Re LU), [Latent Space, 1, 1] (Re LU) Decoder Conv[64, 1, 1, 0] (Relu), Conv Trans [64, 4, 1, 0] (Re LU), [64, 4, 2, 1] (Re LU), [64, 4, 2, 1] (Re LU), [32, 4, 2, 1] (Re LU), [32, 4, 2, 1] (Re LU), [3, 4, 2, 1] β 6 Optimizer Adam (lr = 10 3, betas = (0.9, 0.999)) A.8.2. COMPARING JACOBIAN APPROXIMATION OF STOCHASTIC LOSS WITH LOSS CALCULATED FROM APPROXIMATED LOCAL DECODER In this experiment, we compare two approximations: the linearization approximation by (Rolinek et al., 2019), where, Lstoch(i) J = Jϵ(i) and our modeling, where, Lstoch(i) dec = g D(MDϵ(i)). The comparison focuses on determining the approximate loss closer to the actual stochastic loss, denoted as ˆLstoch(i) rec . A validation set x(i) Xval is defined. For each x(i), g D and MD are estimated as neural networks, considering that these are local approximations unique to each x(i). (11) is employed to train these networks, as detailed in Sect. 4.4. Why do Variational Autoencoders Really Promote Disentanglement? Subsequently, we compute ˆLstoch rec = P x(i) Xval ˆLstoch(i) rec . This is followed by the calculation of Lstoch J = P x(i) Xval Lstoch(i) J and Lstoch dec = P x(i) Xval Lstoch(i) dec . The final step involves comparing the squared errors δdec = || Lstoch dec ˆLstoch rec ||2, and δJ = || Lstoch J ˆLstoch rec ||2, to ascertain the more accurate approximation. In Table 2, we summarize the two differences which demonstrates that Lstoch dec is a much better approximation as compared to Lstoch J . A.8.3. CALCULATING THE VALUES ˆUD, ˆΣD FOR DETERMINING THE OD-SCORE FOR DIFFERENT VAE ARCHITECTURES Proposition 7. Given a matrix MD, the nearest orthogonal matrix ˆ MD can be estimated as follows: ˆ MD = MD(M DMD) 1 2 Proof. We minimize ||MD ˆ MD||2 subject to ˆ M D ˆ MD = I. Introducing a symmetric Lagrangian multiplier matrix Λ, we look for the stationary values of e( ˆ MD, Λ) = Tr{(MD ˆ MD) (MD ˆ MD)} + Tr{Λ( ˆ M D ˆ MD I)} Differentiating e( ˆ MD, Λ) w.r.t ˆ MD and setting it to 0, we have, ˆ MD = (Tr{(MD ˆ MD) (MD ˆ MD)}) ˆ MD + (Tr{Λ( ˆ M D ˆ MD I)}) We know that (T r{X X}) X = 2X. Hence, (Tr{(MD ˆ MD) (MD ˆ MD)}) ˆ MD = 2(MD ˆ MD) Further, (T r{ΛX X}) X = X(Λ + Λ ). Hence, (Tr{Λ( ˆ M D ˆ MD I)}) ˆ MD = ˆ MD(Λ + Λ ) Replacing the terms, we get, 2(MD ˆ MD) + ˆ MD(Λ + Λ ) = 0 Since, Λ = Λ, (MD ˆ MD) + ˆ MDΛ = 0 MD = ˆ MD(I + Λ) Calculating M DMD, we have, M DMD = (I + Λ) ˆ M D ˆ MD(I + Λ) Hence, (I + Λ) = (M DMD) 1 2 Hence, solving for ˆ MD, we have, ˆ MD = MD(M DMD) 1 2 A.8.4. MIG AND MIG-SUP SCORES FOR DIFFERENT ARCHITECTURES The MIG score for the different VAE-based architectures for the datasets have been summarized in Table 4. The MIG-sup score for the different VAE-based architectures for the datasets have been summarized in Table 5 Why do Variational Autoencoders Really Promote Disentanglement? Table 4. The MIG scores of the different VAE-based models for the Datasets β-TCVAE β-VAE VAE d Sprites 0.52 0.1 0.25 0.1 0.16 0.05 3DFaces 0.58 0.02 0.44 0.05 0.19 0.02 3DShapes 0.40 0.1 0.46 0.14 0.20 0.08 MPI3DComplex 0.32 0.1 0.24 0.08 0.14 0.06 Table 5. The MIG-Sup scores of the different VAE-based models for the Datasets β-TCVAE β-VAE VAE d Sprites 0.52 0.08 0.17 0.1 0.14 0.02 3DFaces 0.60 0.08 0.46 0.04 0.20 0.03 3DShapes 0.50 0.1 0.60 0.05 0.14 0.04 MPI3DComplex 0.60 0.15 0.46 0.15 0.10 0.06 A.8.5. OD-SCORE FOR DIFFERENT VAE ARCHITECTURES Table 6 summarizes the OD-Score(MD) scores for the different VAE based architectures for the datasets. A.8.6. DEMONSTRATING THE CONNECTION BETWEEN GENERATIVE FEATURES AND PRINCIPAL AXES In this section, we experimentally establish a connection between the generative factors and the principal components of the data. Further, we show that the principal components with the highest variance are associated with the generative factors most significant for the reconstruction. Figure 5. The first image is the image subset input to the PCA. The second image is the reconstruction with the first 1500 principal axes, the third with first 2000 principal axes, the fourth with first 2500 axes, the 5th with 3000 and the 6th with 2500th to 3000th axes. Fig. 5 shows the original image, which is the first image, and the rest reconstructed images from the different sets of principal axes (PA) the details of which are provided in Fig. 5. Given that the second picture (PA = first 1500) is blank, the third (PA = first 2000), fourth (PA = first 2500), and fifth (PA = first 3000) images capture both position and parts of the shape, and the sixth image (PA = 2500th to 3000th) does not capture position, indicates that the first 1500 axes summarize just the position and not the shape. Given that the sixth image lacks both position and shape information while the fifth lacks intricate details, we conclude that the principal axes from 1500 to 2500 capture position, while 2500 to 3000 capture intricate position details. Table 7 summarizes the reconstruction error from different sets of principal axes. While the error decreases as more axes are added, it increases rapidly when the initial high variance axes, which convey greater information are removed. Why do Variational Autoencoders Really Promote Disentanglement? Table 6. The OD-Score(MD) scores of the different VAE based models for the Datasets β-TCVAE β-VAE VAE d Sprites 0.9093 0.02 0.9400 0.02 0.9732 0.02 3DFaces 0.9002 0.02 0.9494 0.01 0.9857 0.01 3DShapes 0.9543 0.02 0.9254 0.01 0.0.9864 0.02 MPI3DComplex 0.9086 0.014 0.9254 0.016 0.9466 0.01 Table 7. Comparing the reconstruction error for the images generated from the different sets of principal axes. Axes used first 1500 first 2000 first 2500 first 3000 2000 to 3000 2500 to 3000 Reconstruction Loss 21.23 12.43 6.41 0.54 12.31 17.64