# posterior_collapse_and_latent_variable_nonidentifiability__6d9d9fe2.pdf Posterior Collapse and Latent Variable Non-identifiability Yixin Wang University of Michigan yixinw@umich.edu David M. Blei Columbia University david.blei@columbia.edu John P. Cunningham Columbia University jpc2181@columbia.edu Variational autoencoders model high-dimensional data by positing low-dimensional latent variables that are mapped through a flexible distribution parametrized by a neural network. Unfortunately, variational autoencoders often suffer from posterior collapse: the posterior of the latent variables is equal to its prior, rendering the variational autoencoder useless as a means to produce meaningful representations. Existing approaches to posterior collapse often attribute it to the use of neural networks or optimization issues due to variational approximation. In this paper, we consider posterior collapse as a problem of latent variable non-identifiability. We prove that the posterior collapses if and only if the latent variables are nonidentifiable in the generative model. This fact implies that posterior collapse is not a phenomenon specific to the use of flexible distributions or approximate inference. Rather, it can occur in classical probabilistic models even with exact inference, which we also demonstrate. Based on these results, we propose a class of latent-identifiable variational autoencoders, deep generative models which enforce identifiability without sacrificing flexibility. This model class resolves the problem of latent variable non-identifiability by leveraging bijective Brenier maps and parameterizing them with input convex neural networks, without special variational inference objectives or optimization tricks. Across synthetic and real datasets, latent-identifiable variational autoencoders outperform existing methods in mitigating posterior collapse and providing meaningful representations of the data. 1 Introduction Variational autoencoders (VAE) are powerful generative models for high-dimensional data [28, 46]. Their key idea is to combine the inference principles of probabilistic modeling with the flexibility of neural networks. In a VAE, each datapoint is independently generated by a low-dimensional latent variable drawn from a prior, then mapped to a flexible distribution parametrized by a neural network. Unfortunately, VAE often suffer from posterior collapse, an important and widely studied phenomenon where the posterior of the latent variables is equal to prior [6, 8, 38, 62]. This phenomenon is also known as latent variable collapse, KL vanishing, and over-pruning. Posterior collapse renders the VAE useless to produce meaningful representations, in so much as its per-datapoint latent variables all have the exact same posterior. Posterior collapse is commonly observed in the VAE whose generative model is highly flexible, leading to the common speculation that posterior collapse occurs because VAE involve flexible neural networks in the generative model [11], or because it uses variational inference [59]. Based on these hypotheses, many of the proposed strategies for mitigating posterior collapse thus focus on modifying the variational inference objective (e.g. [44]), designing special optimization schemes for variational inference in VAE (e.g. [18, 25, 32]), or limiting the capacity of the generative model (e.g. [6, 16, 60].) 35th Conference on Neural Information Processing Systems (Neur IPS 2021). In this paper, we consider posterior collapse as a problem of latent variable non-identifiability. We prove that posterior collapse occurs if and only if the latent variable is non-identifiable in the generative model, which loosely means the likelihood function does not depend on the latent variable [40, 42, 56]. Below, we formally establish this equivalence by appealing to recent results in Bayesian non-identifiability [40, 42, 43, 49, 58]. More broadly, the relationship between posterior collapse and latent variable non-identifiability implies that posterior collapse is not a phenomenon specific to the use of neural networks or variational inference. Rather, it can also occur in classical probabilistic models fitted with exact inference methods, such as Gaussian mixture models and probabilistic principal component analysis (PPCA). This relationship also leads to a new perspective on existing methods for avoiding posterior collapse, such as the delta-VAE [44] or the Ø-VAE [19]. These methods heuristically adjust the approximate inference procedure embedded in the optimization of the model parameters. Though originally motivated by the goal of patching the variational objective, the results here suggest that these adjustments are useful because they help avoid parameters at which the latent variable is nonidentifiable and, consequently, avoid posterior collapse. The relationship between posterior collapse and non-identifiability points to a direct solution to the problem: we must make the latent variable identifiable. To this end, we propose latent-identifiable VAE, a class of VAE that is as flexible as classical VAE while also being identifiable. Latentidentifiable VAE resolves the latent variable non-identifiability by leveraging Brenier maps [36, 39] and parameterizing them with input-convex neural networks [2, 35]. Inference on identifiable VAE uses the standard variational inference objective, without special modifications or optimization tricks. Across synthetic and real datasets, we show that identifiable VAE mitigates posterior collapse without sacrificing fidelity to the data. Related work. Existing approaches to avoiding posterior collapse often modify the variational inference objective, design new initialization or optimization schemes for VAE, or add neural network links between each data point and their latent variables [1, 3, 6, 8, 12, 15, 16, 17, 18, 21, 25, 27, 32, 34, 38, 44, 50, 51, 52, 55, 61, 62, 63]. Several recent papers also attempt to provide explanations for posterior collapse. Chen et al. [8] explains how the inexact variational approximation can lead to inefficiency of coding in VAE, which could lead to posterior collapse due to a form of information preference. Dai et al. [11] argues that posterior collapse can be partially attributed to the local optima in training VAE with deep neural networks. Lucas et al. [33] shows that posterior collapse is not specific to the variational inference training objective; absent a variational approximation, the log marginal likelihood of PPCA has bad local optima that can lead to posterior collapse. Yacoby et al. [59] discusses how variational approximation can select an undesirable generative model when the generative model parameters are non-identifiable. In contrast to these works, we consider posterior collapse solely as a problem of latent variable non-identifiability, and not of optimization, variational approximations, or neural networks per se. We use this result to propose the identifiable VAE as a way to directly avoid posterior collapse. Outside VAE, latent variable identifiability in probabilistic models has long been studied in the statistics literature [40, 42, 42, 43, 49, 56, 58]. More recently, Betancourt [5] studies the effect of latent variable identifiability on Bayesian computation for Gaussian mixtures. Khemakhem et al. [23, 24] propose to resolve the non-identifiability in deep generative models by appealing to auxiliary data. Kumar & Poole [29] study how the variational family can help resolve the nonidentifiability of VAE. These works address the identifiability issue for a different goal: they develop identifiability conditions for different subsets of VAE, aiming for recovering true causal factors of the data and improving disentanglement or out-of-distribution generalization. Related to these papers, we demonstrate posterior collapse as an additional way that the concept of identifiability, though classical, can be instrumental in modern probabilistic modeling. Considering identifiability leads to new solutions to posterior collapse. Contributions. We prove that posterior collapse occurs if and only if the latent variable in the generative model is non-identifiable. We then propose latent-identifiable VAE, a class of VAE that are as flexible as classical VAE but have latent variables that are provably identifiable. Across synthetic and real datasets, we demonstrate that latent-identifiable VAE mitigates posterior collapse without modifying VAE objectives or applying special optimization tricks. 2 Posterior collapse and latent variable non-identifiability Consider a dataset x = (x1,...,xn); each datapoint is m-dimensional. Positing n latent variables z = (z1,..., zn), a variational autoencoder (VAE) assumes that each datapoint xi is generated by a K-dimensional latent variable zi: zi ª p(zi), xi | zi ª p(xi | zi ; µ) = EF(xi | fµ(zi)), (1) where xi follows an exponential family distribution with parameters fµ(zi); fµ parameterizes the conditional likelihood. In a deep generative model fµ is a parameterized neural network. Classical probabilistic models like Gaussian mixture model [45] and probabilistic PCA [10, 47, 48, 54] are also special cases of Eq. 1. To fit the model, VAE optimizes the parameters µ by maximizing a variational approximation of the log marginal likelihood. After finding an optimal ˆµ, we can form a representation of the data using the approximate posterior q ˆ (z|x) with variational parameters ˆ or its expectation Eq ˆ (z|x) [z|x]. Note that here we abstract away computational considerations and consider the ideal case where the variational approximation is exact. This choice is sensible: if the exact posterior suffers from posterior collapse then so will the approximate posterior (a variational approximation cannot uncollapse a collapsed posterior). That said we also note that there exist in practice situations where variational inference alone can lead to posterior collapse. A notable example is when the variational approximating family is overly restrictive: it is then possible to have non-collapsing exact posteriors but collapsing approximate posteriors. 2.1 Posterior collapse , Latent variable non-identifiability We first define posterior collapse and latent variable non-identifiability, then proving their connection. Definition 1 (Posterior collapse [6, 8, 38, 62]). Given a probability model p(x, z; µ), a parameter value µ = ˆµ, and a dataset x = (x1,...,xn), the posterior of the latent variables z collapses if p(z|x; ˆµ) = p(z). (2) The posterior collapse phenomenon can occur in a variety of probabilistic models and with different latent variables. When the probability model is a VAE, it only has local latent variables z = (z1,..., zn), and Eq. 2 is equivalent to the common definition of posterior collapse p(zi |xi ; ˆµ) = p(zi) for all i [12, 17, 33, 44]. Posterior collapse has also been observed in Gaussian mixture models [5]; the posterior of the latent mixture weights resembles their prior when the number of mixture components in the model is larger than that of the data generating process. Regardless of the model, when posterior collapse occurs, it prevents the latent variable from providing meaningful summary of the dataset. Definition 2 (Latent variable non-identifiability [42, 56]). Given a likelihood function p(x| z; µ), a parameter value µ = ˆµ, and a dataset x = (x1,...,xn), the latent variable z is non-identifiable if p(x| z = z0 ; ˆµ) = p(x| z = z; ˆµ) 8 z0, z 2 Z , (3) where Z denotes the domain of z, and z0, z refer to two arbitrary values the latent variable z can take. As a consequence, for any prior p(z) on z, we have the conditional likelihood equal to the marginal p(x| z = z; ˆµ) = p(x| z; ˆµ)p(z)dz = p(x; ˆµ) 8 z 2 Z . Definition 2 says a latent variable z is non-identifiable when the likelihood of the dataset x does not depend on z. It is also known as practical non-identifiability [42, 56] and is closely related to the definition of z being conditionally non-identifiable (or conditionally uninformative) given ˆµ [40, 42, 43, 49, 58]. To enforce latent variable identifiability, it is sufficient to ensure that the likelihood p(x| z,µ) is an injective (a.k.a. one-to-one) function of z for all µ. If this condition holds then z0 6= z ) p(x| z = z0 ; ˆµ) 6= p(x| z = z; ˆµ). (4) Note that latent variable non-identifiability only requires Eq. 3 be true for a given dataset x and parameter value ˆµ. Thus a latent variable may be identifiable in a model given one dataset but not another, and at one µ but not another. See examples in Appendix A. Latent variable identifiability (Definition 2) [42, 56] differs from model identifiability [41], a related notion that has also been cited as a contributing factor to posterior collapse [59]. Latent variable identifiability is a weaker requirement: it only requires the latent variable z be identifiable at a particular parameter value µ = ˆµ, while model identifiability requires both z and µ be identifiable. We now establish the equivalence between posterior collapse and latent variable non-identifiability. Theorem 1 (Latent variable non-identifiability , Posterior collapse). Consider a probability model p(x, z; µ), a dataset x, and a parameter value µ = ˆµ. The local latent variables z are non-identifiable at ˆµ if and only if the posterior of the latent variable z collapses, p(z|x) = p(z). Proof. To prove that non-identifiability implies posterior collapse, note that, by Bayes rule, p(z|x; ˆµ) / p(z)p(x| z; ˆµ) = p(z)p(x; ˆµ) / p(z), (5) where the middle equality is due to the definition of latent variable non-identifiability. It implies p(z|x; ˆµ) = p(z) as both are densities. To prove that posterior collapse implies latent variable nonidentifiability, we again invoke Bayes rule. Posterior collapse implies that p(z) = p(z|x; ˆµ) / p(z) p(x| z; ˆµ), which further implies that p(x| z; ˆµ) is constant in z. If p(x| z; ˆµ) nontrivially depends on z, then p(z) must be different from p(z)p(x| z; ˆµ) as a function of z. The proof of Theorem 1 is straightforward, but Theorem 1 has an important implication. It shows that the problem of posterior collapse mainly arises from the model and the data, rather than from inference or optimization. If the maximum likelihood parameters ˆµ of the VAE renders the latent variable z non-identifiable, then we will observe posterior collapse. Theorem 1 also clarifies why posteriors may change from non-collapsed to collapsed (and back) while fitting a VAE. When fitting a VAE, Some parameter iterates may lead to posterior collapse; others may not. Theorem 1 points to why existing approaches can help mitigate posterior collapse. Consider the ØVAE [19], the VAE lagging encoder [18], and the semi-amortized VAE [25]. Though motivated by other perspectives, these methods modify the optimization objectives or algorithms of VAE to avoid parameter values µ at which the latent variable is non-identifiable. The resulting posterior may not collapse, though the optimal parameters for these algorithms no longer approximates the maximum likelihood estimate. Theorem 1 can also help us understand posterior collapse observed in practice, which manifests as the phenomenon that the posterior is approximately (as opposed to exactly) equal to the prior, p(z|x; ˆµ) º p(z). In several empirical studies of VAE (e.g. [12, 18, 25]), we observe that the Kullback-Leibler (KL) divergence between the prior and posterior is close to zero but not exactly zero, a property that stems from the likelihood p(x| z) being nearly constant in the latents z. In these cases, Theorem 1 provides the intuition that the latent variable is nearly non-identifiable , p(x| z0) º p(x| z),8 z, z0 and so Eq. 2 holds approximately. 2.2 Examples of latent variable non-identifiability and posterior collapse We illustrate Theorem 1 with three examples. Here we discuss the example of Gaussian mixture VAE (GMVAE). See Appendix A for probabilistic principal component analysis (PPCA) and Gaussian mixture model (GMM). The GMVAE [13, 51] is the following model: p(zi) = Categorical(1/K), p(wi | zi ; µ,ß) = N (µzi,ßzi), p(xi |wi ; f ,æ) = N (f (wi),æ2 Im), where µk s are d-dimensional, ßk are d d-dimensional, and the parameters are µ = (µ,ß, f ,æ2). Suppose the function f is fully flexible; thus f (wi) can capture any distribution of the data. The latent variable of interest is the categorical z = (z1,..., zn). If its posterior collapses, then p(zi = k|x) = 1/K for all k = 1,...,K. Consider fitting a GMVAE model with K = 2 to a dataset of 5,000 samples. This dataset is drawn from a GMVAE also with K = 2 well-separated clusters; there is no model misspecification. A GMVAE is typically fit by optimizing the maximum log marginal likelihood ˆµ = argmaxµ log p(x|µ). Note there may be multiple values of µ that achieve the global optimum of this function. We focus on two likelihood maximizers. One provides latent variable identifiability and the posterior of zi does not collapse. The other does not provide identifiablity; the posterior collapses. 1. The first likelihood-maximizing parameter ˆµ1 is the truth; the distribution of the K fitted clusters correspond to the K data-generating clusters. Given this parameter, the latent variable zi is identifiable because the K data-generating clusters are different; different cluster memberships zi must result in different likelihoods p(xi | zi ; ˆµ1). The posterior of zi does not collapse. 2. In the second likelihood-maximizing parameter ˆµ2, however, all K fitted clusters share the same distribution, each of which is equal to the marginal distribution of the data. Specifically, (µ k) = (0, Id) for all k, and each fitted cluster is a mixture of the K original data generating clusters, i.e., the marginal. At this parameter value, the model is still able to fully capture the mixture distribution of the data. However, all the K mixture components are the same, and thus the latent variable zi is non-identifiable; different cluster membership zi do not result in different likelihoods p(xi | zi ; ˆµ2), and hence the posterior of zi collapses. Figure 1a illustrates a fit of this (non-identifiable) GMVAE to the pinwheel data [22]. In Section 3, we construct an latentidentifiable VAE (LIDVAE) that avoids this collapse. Latent variable identifiability is a function of the both the model and the true data-generating distribution. Consider fitting the same GMVAE with K = 2 but to a different dataset of 5,000 samples, this one drawn from a GMVAE with only one cluster. (There is model misspecification.) One maximizing parameter value ˆµ3 is where both of the fitted clusters correspond to the true data generating cluster. While this parameter value resembles that of the first maximizer ˆµ1 above both correspond to the true data generating cluster this dataset leads to a different situation for latent variable identifiability. The two fitted clusters are the same and so different cluster memberships do not result in different likelihoods of p(xi | zi ; ˆµ3). The latent variable zi is not identifiable and its posterior collapses. Takeaways. The GMVAE example in this section (and the PPCA and GMM examples in Appendix A) illustrate different ways that a latent variable can be non-identifiable in a model and suffer from posterior collapse. They show that even the true posterior without variational inference can collapse in non-identifiable models. They also illustrate that whether a latent variable is identifiable can depend on both the model and the data. Posterior collapse is an intrinsic problem of the model and the data, rather than specific to the use of neural networks or variational inference. The equivalence between posterior collapse and latent variable non-identifiability in Theorem 1 also implies that, to mitigate posterior collapse, we should try to resolve latent variable non-identifiability. In the next section, we develop such a class of latent-identifiable VAE. 3 Latent-identifiable VAE via Brenier maps We now construct latent-identifiable VAE, a class of VAE whose latent variables are guaranteed to be identifiable, and thus the posteriors cannot collapse. 3.1 The latent-identifiable VAE To construct the latent-identifiable VAE, we rely on a key observation that, to guarantee latent variable identifiability, it is sufficient to make the likelihood function P(xi | zi ; µ) injective for all values of µ. If the likelihood is injective, then, for any µ, each value of zi will lead to a different distribution P(xi | zi ; µ). In particular, this fact will be true for any optimized ˆµ and so the latent zi must be identifiable, regardless of the data. By Theorem 1, its posterior cannot collapse. Constructing latent-identifiable VAE thus amounts to constructing an injective likelihood function for VAE. The construction is based on a few building blocks of linear and nonlinear injective functions, then composed into an injective likelihood p(xi | zi ; µ) mapping from Z d to X m, where Z and X indicate the set of values zi and xi can take. For example, if xi is an m-dimensional binary vector, then X = {0,1}m; if zi is a K-dimensional real-valued vector, then Z = Rd. The building blocks of LIDVAE: Injective functions. For linear mappings from Rd1 to Rd2 (d2 d1), we consider matrix multiplication by a d1 d2-dimensional matrix Ø. For a d1-dimensional variable z, left multiplication by a matrix Ø> is injective when Ø has full column rank [53]. For example, a matrix with all ones in the diagonal and all other entries being zero has full column rank. For nonlinear injective functions, we focus on Brenier maps [4, 37]. A d-dimensional Brenier map is is the gradient of a convex function from Rd to R. That is, a Brenier map satisfies g = r T for some convex function T : Rd ! R. Brenier maps are also known as a monotone transport map. They are guaranteed to be bijective [4, 37] because their derivative is the Hessian of a convex T, which must be positive semidefinite and has a nonnegative determinant [4]. To build a VAE with Brenier maps, we require a neural network parametrization of the Brenier map. As Brenier maps are gradients of convex functions, we begin with the neural network parametrizaton of convex functions, namely the input convex neural network (ICNN) [2, 35]. This parameterization of convex functions will enable Brenier maps to be paramterized as the gradient of ICNN. An L-layer ICNN is a neural network mapping from Rd to R. Given an input u 2 Rd, its lth layer is z0 = u, zl+1 = hl(Wlzl +Alu+bl), (l = 0,...,L 1), (6) where the last layer z L must be a scalar, {Wl} are non-negative weight matrices with W0 = 0. The functions {hl : R ! R} are convex and non-decreasing entry-wise activation functions for layer l; they are applied element-wise to the vector (Wlzl +Alu+bl). A common choice of h0 : R ! R is the square of a leaky RELU, h0(x) = (max(Æ x,x))2 with Æ = 0.2; the remaining hl s are set to be a leaky RELU, hl(x) = max(Æ x,x). This neural network is called input convex because it is guaranteed to be a convex function. Input convex neural networks can approximate any convex function on a compact domain in sup norm (Theorem 1 of Chen et al. [9].) Given the neural network parameterization of convex functions, we can parametrize the Brenier map gµ( ) as its gradient with respect to the input gµ(u) = @z L/@u. This neural network parameterization of Brenier map is a universal approxiamtor of all Brenier maps on a compact domain, because input convex neural networks are universal approximators of convex functions [9]. The latent-identifiable VAE (LIDVAE). We construct injective likelihoods for LIDVAE by composing two bijective Brenier maps with an injective matrix multiplication. As the composition of injective and bijective mappings must be injective, the resulting composition must be injective. Suppose g1,µ : RK ! RK and g2,µ : RD ! RD are two Brenier maps, and Ø is a K D-dimensional matrix (D K) with all the main diagonal entries being one and all other entries being zero. The matrix Ø> has full column rank, so multiplication by Ø> is injective. Thus the composition g2,µ(Ø> g1,µ( )) must be an injective function from a low-dimensional space RK to a high-dimensional space RD. Definition 3 (Latent-identifiable VAE (LIDVAE) via Brenier maps). An LIDVAE via Brenier maps generates a D-dimensional datapoint xi,2 {1,...,n} by: zi ª p(zi), xi | zi ª EF(xi | g2,µ(Ø> g1,µ(zi))), (7) where EF stands for exponential family distributions; zi is a K-dimensional latent variable, discrete or continuous. The parameters of the model are µ = (g1,µ, g2,µ), where g1,µ : RK ! RK and g2,µ : RD ! RD are two continuous Brenier maps. The matrix Ø is a K D-dimensional matrix (D K) with all the main diagonal entries being one and all other entries being zero. Contrasting LIDVAE (Eq. 7) with the classical VAE (Eq. 1), the LIDVAE replaces the function fµ : Z K ! X D with the injective mapping g2,µ(Ø> g1,µ( )), composed by bijective Brenier maps g1,µ, g2,µ and a zero-one matrix Ø> with full column rank. As the likelihood functions of exponential family are injective, the likelihood function p(xi | zi ; µ) = EF(g2,µ(Ø> g1,µ(zi))) of LIDVAE must be injective. Therefore, replacing an arbitrary function fµ : Z K ! X D with the injective mapping g2,µ(Ø> g1,µ( )) plays a crucial role in enforcing identifiability for latent variable zi and avoiding posterior collapse in LIDVAE. As the latent zi must be identifiable in LIDVAE, its posterior does not collapse. Despite its injective likelihood, LIDVAE are as flexible as VAE; the use of Brenier maps and ICNN does not limit the capacity of the generative model. Loosely, LIDVAE can model any distributions in RD because Brenier maps can map any given non-atomic distribution in Rd to any other one in Rd [37]. Moreover, the ICNN parametrization is a universal approximator of Brenier maps [2]. We summarize the key properties of LIDVAE in the following proposition. Proposition 2. The latent variable zi is identifiable in LIDVAE, i.e. for all i 2 {1,...,n}, we have p(xi | zi = z0 ; µ) = p(xi | zi = z; µ) ) z0 = z, 8 z0, z,µ. (8) Moreover, for any VAE-generated data distribution, there exists an LIDVAE that can generate the same distribution. (The proof is in Appendix B.) (a) Non-ID GMVAE (b) IDGMVAE (c) Accuracy (d) Log-likelihood Figure 1: (a)-(b): The posterior of the classical GMVAE [13, 26, 51] collapses when fit to the pinwheel dataset; the latents predict the same value for all datapoints. The posteriors of latentidentifiable Gaussian mixture VAE (LIDGMVAE), however, do not collapse and provide meaningful representations. (c)-(d) The latent-identifiable GMVAE produces posteriors that are substantially more informative than GMVAE when fit to fashion MNIST. It also achieves higher test log likelihood. 3.2 Inference in LIDVAE Performing inference in LIDVAE is identical to the classical VAE, as the two VAE differ only in their parameter constraints. To fit an LIDVAE, we use the classical amortized inference algorithm of VAE; we maximize the evidence lower bound (ELBO) of the log marginal likelihood [28]. In general, LIDVAE are a drop-in replacement for VAE. Both have the same capacity (Proposition 2) and share the same inference algorithm, but LIDVAE is identifiable and does not suffer from posterior collapse. The price we pay for LIDVAE is computational: the generative model (i.e. decoder) is parametrized using the gradient of a neural network; its optimization thus requires calculating gradients of the gradient of a neural network, which increases the computational complexity of VAE inference and can sometimes challenge optimization. While fitting classical VAE using stochastic gradient descent has O(k p) computational complexity, where k is the number of iterations and p is the number of parameters, fitting latent-identifiable VAE may require O(k p2) computational complexity. 3.3 Extensions of LIDVAE The construction of LIDVAE reveals a general strategy to make the latent variables of generative models identifiable: replacing nonlinear mappings with injective nonlinear mappings. We can employ this strategy to make the latent variables of many other VAE variants identifiable. Below we give two examples, mixture VAE and sequential VAE. The mixture VAE, with GMVAE as a special case, models the data with an exponential family mixture and mapped through a flexible neural network to generate the data. We develop its latentidentifiable counterpart using Brenier maps. Example 1 (Latent-identifiable mixture VAE (LIDMVAE)). An LIDMVAE generates a Ddimensional datapoint xi, i 2 {1,...,n} by zi ª Categorical(1/K), wi | zi ª EF(wi |Ø> 1 zi), xi |wi ª EF(xi | g2,µ(Ø> 2 g1,µ(wi))), (9) where Wi is a K-dimensional one-hot vector that indicates the cluster assignment. The parameters of the model are µ = (g1,µ, g2,µ), where the functions g1,µ : RM ! RM and g2,µ : RD ! RD are two continuous Brenier maps. The matrices Ø1 and Ø2 are a K M-dimensional matrix (M K) and a M D-dimensional matrix (D M) respectively, both having all the main diagonal entries being one and all other entries being zero. The LIDMVAE differs from the classical mixture VAE in p(xi | zi), where we replace its neural network mapping with its injective counterpart, i.e. a composition of two Brenier maps and a matrix multiplication g2,µ(Ø> 2 g1,µ( )). As a special case, setting both exponential families in Example 1 as Gaussian gives us LIDGMVAE, which we will use to model images in Section 4. Next we derive the identifiable counterpart of sequential VAE, which models the data with an autoregressive model conditional on the latents. Example 2 (Latent-identifiable sequential VAE (LIDSVAE)). An LIDSVAE generates a Ddimensional datapoint xi, i 2 {1,...,n} by zi ª p(zi), xi | zi,x 2 g1,µ([zi, fµ(x i w,æ2 I5), where the value of æ2 is known, zi s are the latent variables of interest, and w is the only parameter of interest. When the noise æ2 is set to a large value, the latent variable zi may become nearly non-identifiable. The reason is that the likelihood function p(xi | zi) becomes slower-varying as æ2 increases. For example, Figure 2 shows that the likelihood surface becomes flatter as æ2 increases. Accordingly, the posterior becomes closer to the prior as æ2 increases. When æ = 1.5, the posterior collapses. This non-identifiability argument provides an explanation to the closely related phenomenon described in Section 6.2 of [33]. 5 Discussion In this work, we show that the posterior collapse phenomenon is a problem of latent variable nonidentifiability. It is not specific to the use of neural networks or particular inference algorithms in VAE. Rather, it is an intrinsic issue of the model and the dataset. To this end, we propose a class of LIDVAE via Brenier maps to resolve latent variable non-identifiability and mitigate posterior collapse. Across empirical studies, we find that LIDVAE outperforms existing methods in mitigating posterior collapse. The latent variables of LIDVAE are guaranteed to be identifiable. However, it does not guarantee that the latent variables and the parameters of LIDVAE are jointly identifiable. In other words, the LIDVAE model may not be identifiable even though its latents are identifiable. This difference between latent variable identifiability and model identifiability may appear minor. But the tractability of resolving latent variable identifiability plays a key role in making non-identifiability a fruitful one perspective of posterior collapse. To enforce latent variable identifiability, it is sufficient to ensure that the likelihood p(x| z, ˆµ) is an injective function of z. In contrast, resolving model identifiability for the general class of VAE remains a long standing open problem, with some recent progress relying on auxiliary variables [23, 24]. The tractability of resolving latent variable identifiability is a key catalyst of a principled solution to mitigating posterior collapse. There are a few limitations of this work. One is that the theoretical argument focuses on the collapse of the exact posterior. The rationale is that, if the exact posterior collapses, then its variational approximation must also collapse because variational approximation of posteriors cannot uncollapse a posterior. That said, variational approximation may collapse a posterior, i.e. the exact posterior does not collapse but the variational approximate posterior collapses. The theoretical argument and algorithmic approaches developed in this work does not apply to this setting, which remains an interesting venue of future work. A second limitation is that the latent-identifiable VAE developed in this work bear a higher computational cost than classical VAE. While the latent-identifiable VAE ensures the identifiability of its latent variables and mitigates posterior collapse, it does come with a price in computation because its generative model (i.e. decoder) is parametrized using gradients of a neural network. Fitting the latent-identifiable VAE thus requires calculating gradients of gradients of a neural network, leading to much higher computational complexity than fitting the classifical VAE. Developing computationally efficient variants of the latent-identifiable VAE is another interesting direction for future work. Acknowledgments. We thank Taiga Abe and Gemma Moran for helpful discussions, and anonymous reviewers for constructive feedback that improved the manuscript. David Blei is supported by ONR N00014-17-1-2131, ONR N00014-15-1-2209, NSF CCF-1740833, DARPA SD2 FA8750-18-C-0130, Amazon, and the Simons Foundation. John Cunningham is supported by the Simons Foundation, Mc Knight Foundation, Zuckerman Institute, Grossman Center, and Gatsby Charitable Trust. [1] Alemi, A. A., Poole, B., et al. (2017). Fixing a broken ELBO. ar Xiv preprint ar Xiv:1711.00464. [2] Amos, B., Xu, L., & Kolter, J. Z. (2017). Input convex neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 146 155).: JMLR. org. [3] Asperti, A. (2019). Variational autoencoders and the variable collapse phenomenon. Sensors & Transducers, 234(6), 1 8. [4] Ball, K. (2004). An elementary introduction to monotone transportation. In Geometric aspects of functional analysis (pp. 41 52). Springer. [5] Betancourt, M. (2017). Identifying Bayesian mixture models. https://mc-stan.org/users/ documentation/case-studies/identifying_mixture_models. Accessed: 2021-05-04. [6] Bowman, S., Vilnis, L., et al. (2016). Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning (pp. 10 21). [7] Burda, Y., Grosse, R., & Salakhutdinov, R. (2015). Importance weighted autoencoders. ar Xiv preprint ar Xiv:1509.00519. [8] Chen, X., Kingma, D. P., et al. (2016). Variational lossy autoencoder. ar Xiv preprint ar Xiv:1611.02731. [9] Chen, Y., Shi, Y., & Zhang, B. (2018). Optimal control via neural networks: A convex approach. ar Xiv preprint ar Xiv:1805.11835. [10] Collins, M., Dasgupta, S., & Schapire, R. E. (2001). A generalization of principal components analysis to the exponential family. In Nips, volume 13 (pp. 23). [11] Dai, B., Wang, Z., & Wipf, D. (2019). The usual suspects? Reassessing blame for VAE posterior collapse. ar Xiv preprint ar Xiv:1912.10702. [12] Dieng, A. B., Kim, Y., Rush, A. M., & Blei, D. M. (2018). Avoiding latent variable collapse with generative skip models. ar Xiv preprint ar Xiv:1807.04863. [13] Dilokthanakul, N., Mediano, P. A., et al. (2016). Deep unsupervised clustering with Gaussian mixture variational autoencoders. ar Xiv preprint ar Xiv:1611.02648. [14] Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real NVP. ar Xiv preprint ar Xiv:1605.08803. [15] Fu, H., Li, C., et al. (2019). Cyclical annealing schedule: A simple approach to mitigating KL vanishing. ar Xiv preprint ar Xiv:1903.10145. [16] Gulrajani, I., Kumar, K., et al. (2016). Pixelvae: A latent variable model for natural images. ar Xiv preprint ar Xiv:1611.05013. [17] Havrylov, S. & Titov, I. (2020). Preventing posterior collapse with Levenshtein variational autoencoder. ar Xiv preprint ar Xiv:2004.14758. [18] He, J., Spokoyny, D., Neubig, G., & Berg-Kirkpatrick, T. (2019). Lagging inference networks and posterior collapse in variational autoencoders. ar Xiv preprint ar Xiv:1901.05534. [19] Higgins, I., Matthey, L., et al. (2016). Ø-VAE: Learning basic visual concepts with a constrained variational [20] Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735 1780. [21] Hoffman, M. D. & Johnson, M. J. (2016). ELBO surgery: Yet another way to carve up the variational evidence lower bound. [22] Johnson, M. J., Duvenaud, D. K., Wiltschko, A., Adams, R. P., & Datta, S. R. (2016). Composing graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems (pp. 2946 2954). [23] Khemakhem, I., Kingma, D. P., & Hyvärinen, A. (2019). Variational autoencoders and nonlinear ICA: A unifying framework. ar Xiv preprint ar Xiv:1907.04809. [24] Khemakhem, I., Monti, R. P., Kingma, D. P., & Hyvarinen, A. (2020). ICE-Bee M: Identifiable conditional energy-based deep models based on nonlinear ICA. [25] Kim, Y., Wiseman, S., Miller, A., Sontag, D., & Rush, A. (2018). Semi-amortized variational autoencoders. In International Conference on Machine Learning (pp. 2678 2687). [26] Kingma, D. P., Mohamed, S., Rezende, D. J., & Welling, M. (2014). Semi-supervised learning with deep generative models. In Advances in neural information processing systems (pp. 3581 3589). [27] Kingma, D. P., Salimans, T., et al. (2016). Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems (pp. 4743 4751). [28] Kingma, D. P. & Welling, M. (2014). Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), volume 1. [29] Kumar, A. & Poole, B. (2020). On implicit regularization in Ø-vaes. In International Conference on Machine Learning (pp. 5480 5490).: PMLR. [30] Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350(6266), 1332 1338. [31] Le Cun, Y., Cortes, C., & Burges, C. (2010). MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2. [32] Li, B., He, J., Neubig, G., Berg-Kirkpatrick, T., & Yang, Y. (2019). A surprisingly effective fix for deep latent variable modeling of text. ar Xiv preprint ar Xiv:1909.00868. [33] Lucas, J., Tucker, G., Grosse, R. B., & Norouzi, M. (2019). Don t blame the ELBO! A linear VAE perspective on posterior collapse. In Advances in Neural Information Processing Systems (pp. 9403 9413). [34] Maalø e, L., Fraccaro, M., Liévin, V., & Winther, O. (2019). BIVA: A very deep hierarchy of latent variables for generative modeling. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 32: Curran Associates, Inc. [35] Makkuva, A. V., Taghvaei, A., Oh, S., & Lee, J. D. (2019). Optimal transport mapping via input convex neural networks. ar Xiv preprint ar Xiv:1908.10962. [36] Mc Cann, R. J. et al. (1995). Existence and uniqueness of monotone measure-preserving maps. Duke Mathematical Journal, 80(2), 309 324. [37] Mc Cann, R. J. & Guillen, N. (2011). Five lectures on optimal transportation: geometry, regularity and applications. Analysis and geometry of metric measure spaces: Lecture notes of the séminaire de Mathématiques Supérieure (SMS) Montréal, (pp. 145 180). [38] Oord, A. v. d., Vinyals, O., & Kavukcuoglu, K. (2017). Neural discrete representation learning. ar Xiv preprint ar Xiv:1711.00937. [39] Peyré, G., Cuturi, M., et al. (2019). Computational optimal transport. Foundations and Trends in Machine Learning, 11(5-6), 355 607. [40] Poirier, D. J. (1998). Revising beliefs in nonidentified models. Econometric Theory, 14(4), 483 509. [41] Rao, B. & Prakasa, R. (1992). Identifiability in Stochastic Models: Characterization of Probability Distributions. Probability and mathematical statistics. Academic Press. [42] Raue, A., Kreutz, C., et al. (2009). Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics, 25(15), 1923 1929. [43] Raue, A., Kreutz, C., Theis, F. J., & Timmer, J. (2013). Joining forces of Bayesian and frequentist methodology: a study for inference in the presence of non-identifiability. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984), 20110544. [44] Razavi, A., Oord, A. v. d., Poole, B., & Vinyals, O. (2019). Preventing posterior collapse with delta-VAEs. ar Xiv preprint ar Xiv:1901.03416. [45] Reynolds, D. A. (2009). Gaussian mixture models. Encyclopedia of biometrics, 741, 659 663. [46] Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. ar Xiv preprint ar Xiv:1401.4082. [47] Roweis, S. (1998). Em algorithms for pca and spca. Advances in neural information processing systems, (pp. 626 632). [48] Roweis, S. & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural computation, 11(2), 305 345. [49] San Martın, E. & González, J. (2010). Bayesian identifiability: Contributions to an inconclusive debate. Chilean Journal of Statistics, 1(2), 69 91. [50] Seybold, B., Fertig, E., Alemi, A., & Fischer, I. (2019). Dueling decoders: Regularizing variational autoencoder latent spaces. ar Xiv preprint ar Xiv:1905.07478. [51] Shu, R. (2016). Gaussian mixture VAE: Lessons in variational inference, generative models, and deep nets. [52] Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., & Winther, O. (2016). How to train deep variational autoencoders and probabilistic ladder networks. In 33rd International Conference on Machine Learning (ICML 2016). [53] Strang, G., Strang, G., Strang, G., & Strang, G. (1993). Introduction to linear algebra, volume 3. Wellesley- Cambridge Press Wellesley, MA. [54] Tipping, M. E. & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3), 611 622. [55] Tomczak, J. M. & Welling, M. (2017). VAE with a Vamp Prior. ar Xiv preprint ar Xiv:1705.07120. [56] Wieland, F.-G., Hauber, A. L., Rosenblatt, M., Tönsing, C., & Timmer, J. (2021). On structural and practical identifiability. Current Opinion in Systems Biology. [57] Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. Co RR, abs/1708.07747. [58] Xie, Y. & Carlin, B. P. (2006). Measures of Bayesian learning and identifiability in hierarchical models. Journal of Statistical Planning and Inference, 136(10), 3458 3477. [59] Yacoby, Y., Pan, W., & Doshi-Velez, F. (2020). Characterizing and avoiding problematic global optima of variational autoencoders. [60] Yang, Z., Hu, Z., Salakhutdinov, R., & Berg-Kirkpatrick, T. (2017). Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 3881 3890). [61] Yeung, S., Kannan, A., Dauphin, Y., & Fei-Fei, L. (2017). Tackling over-pruning in variational autoencoders. ar Xiv preprint ar Xiv:1706.03643. [62] Zhao, T., Lee, K., & Eskenazi, M. (2018). Unsupervised discrete sentence representation learning for interpretable neural dialog generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1098 1107). [63] Zhao, Y., Yu, P., Mahapatra, S., Su, Q., & Chen, C. (2020). Discretized bottleneck in VAE: Posterior- collapse-free sequence-to-sequence learning. ar Xiv preprint ar Xiv:2004.10603.