# posterior_collapse_and_latent_variable_nonidentifiability__6d9d9fe2.pdf

Posterior Collapse and Latent Variable Non-identiﬁability

Yixin Wang University of Michigan yixinw@umich.edu

David M. Blei Columbia University david.blei@columbia.edu

John P. Cunningham

Columbia University jpc2181@columbia.edu

Variational autoencoders model high-dimensional data by positing low-dimensional latent variables that are mapped through a ﬂexible distribution parametrized by a neural network. Unfortunately, variational autoencoders often suffer from posterior collapse: the posterior of the latent variables is equal to its prior, rendering the variational autoencoder useless as a means to produce meaningful representations. Existing approaches to posterior collapse often attribute it to the use of neural networks or optimization issues due to variational approximation. In this paper, we consider posterior collapse as a problem of latent variable non-identiﬁability. We prove that the posterior collapses if and only if the latent variables are nonidentiﬁable in the generative model. This fact implies that posterior collapse is not a phenomenon speciﬁc to the use of ﬂexible distributions or approximate inference. Rather, it can occur in classical probabilistic models even with exact inference, which we also demonstrate. Based on these results, we propose a class of latent-identiﬁable variational autoencoders, deep generative models which enforce identiﬁability without sacriﬁcing ﬂexibility. This model class resolves the problem of latent variable non-identiﬁability by leveraging bijective Brenier maps and parameterizing them with input convex neural networks, without special variational inference objectives or optimization tricks. Across synthetic and real datasets, latent-identiﬁable variational autoencoders outperform existing methods in mitigating posterior collapse and providing meaningful representations of the data.

1 Introduction

Variational autoencoders (VAE) are powerful generative models for high-dimensional data [28, 46]. Their key idea is to combine the inference principles of probabilistic modeling with the ﬂexibility of neural networks. In a VAE, each datapoint is independently generated by a low-dimensional latent variable drawn from a prior, then mapped to a ﬂexible distribution parametrized by a neural network.

Unfortunately, VAE often suffer from posterior collapse, an important and widely studied phenomenon where the posterior of the latent variables is equal to prior [6, 8, 38, 62]. This phenomenon is also known as latent variable collapse, KL vanishing, and over-pruning. Posterior collapse renders the VAE useless to produce meaningful representations, in so much as its per-datapoint latent variables all have the exact same posterior.

Posterior collapse is commonly observed in the VAE whose generative model is highly ﬂexible, leading to the common speculation that posterior collapse occurs because VAE involve ﬂexible neural networks in the generative model [11], or because it uses variational inference [59]. Based on these hypotheses, many of the proposed strategies for mitigating posterior collapse thus focus on modifying the variational inference objective (e.g. [44]), designing special optimization schemes for variational inference in VAE (e.g. [18, 25, 32]), or limiting the capacity of the generative model (e.g. [6, 16, 60].)

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

In this paper, we consider posterior collapse as a problem of latent variable non-identiﬁability. We prove that posterior collapse occurs if and only if the latent variable is non-identiﬁable in the generative model, which loosely means the likelihood function does not depend on the latent variable [40, 42, 56]. Below, we formally establish this equivalence by appealing to recent results in Bayesian non-identiﬁability [40, 42, 43, 49, 58].

More broadly, the relationship between posterior collapse and latent variable non-identiﬁability implies that posterior collapse is not a phenomenon speciﬁc to the use of neural networks or variational inference. Rather, it can also occur in classical probabilistic models ﬁtted with exact inference methods, such as Gaussian mixture models and probabilistic principal component analysis (PPCA). This relationship also leads to a new perspective on existing methods for avoiding posterior collapse, such as the delta-VAE [44] or the Ø-VAE [19]. These methods heuristically adjust the approximate inference procedure embedded in the optimization of the model parameters. Though originally motivated by the goal of patching the variational objective, the results here suggest that these adjustments are useful because they help avoid parameters at which the latent variable is nonidentiﬁable and, consequently, avoid posterior collapse.

The relationship between posterior collapse and non-identiﬁability points to a direct solution to the problem: we must make the latent variable identiﬁable. To this end, we propose latent-identiﬁable VAE, a class of VAE that is as ﬂexible as classical VAE while also being identiﬁable. Latentidentiﬁable VAE resolves the latent variable non-identiﬁability by leveraging Brenier maps [36, 39] and parameterizing them with input-convex neural networks [2, 35]. Inference on identiﬁable VAE uses the standard variational inference objective, without special modiﬁcations or optimization tricks. Across synthetic and real datasets, we show that identiﬁable VAE mitigates posterior collapse without sacriﬁcing ﬁdelity to the data.

Related work. Existing approaches to avoiding posterior collapse often modify the variational inference objective, design new initialization or optimization schemes for VAE, or add neural network links between each data point and their latent variables [1, 3, 6, 8, 12, 15, 16, 17, 18, 21, 25, 27, 32, 34, 38, 44, 50, 51, 52, 55, 61, 62, 63]. Several recent papers also attempt to provide explanations for posterior collapse. Chen et al. [8] explains how the inexact variational approximation can lead to inefﬁciency of coding in VAE, which could lead to posterior collapse due to a form of information preference. Dai et al. [11] argues that posterior collapse can be partially attributed to the local optima in training VAE with deep neural networks. Lucas et al. [33] shows that posterior collapse is not speciﬁc to the variational inference training objective; absent a variational approximation, the log marginal likelihood of PPCA has bad local optima that can lead to posterior collapse. Yacoby et al. [59] discusses how variational approximation can select an undesirable generative model when the generative model parameters are non-identiﬁable. In contrast to these works, we consider posterior collapse solely as a problem of latent variable non-identiﬁability, and not of optimization, variational approximations, or neural networks per se. We use this result to propose the identiﬁable VAE as a way to directly avoid posterior collapse.

Outside VAE, latent variable identiﬁability in probabilistic models has long been studied in the statistics literature [40, 42, 42, 43, 49, 56, 58]. More recently, Betancourt [5] studies the effect of latent variable identiﬁability on Bayesian computation for Gaussian mixtures. Khemakhem et al. [23, 24] propose to resolve the non-identiﬁability in deep generative models by appealing to auxiliary data. Kumar & Poole [29] study how the variational family can help resolve the nonidentiﬁability of VAE. These works address the identiﬁability issue for a different goal: they develop identiﬁability conditions for different subsets of VAE, aiming for recovering true causal factors of the data and improving disentanglement or out-of-distribution generalization. Related to these papers, we demonstrate posterior collapse as an additional way that the concept of identiﬁability, though classical, can be instrumental in modern probabilistic modeling. Considering identiﬁability leads to new solutions to posterior collapse.

Contributions. We prove that posterior collapse occurs if and only if the latent variable in the generative model is non-identiﬁable. We then propose latent-identiﬁable VAE, a class of VAE that are as ﬂexible as classical VAE but have latent variables that are provably identiﬁable. Across synthetic and real datasets, we demonstrate that latent-identiﬁable VAE mitigates posterior collapse without modifying VAE objectives or applying special optimization tricks.

2 Posterior collapse and latent variable non-identiﬁability

Consider a dataset x = (x1,...,xn); each datapoint is m-dimensional. Positing n latent variables z = (z1,..., zn), a variational autoencoder (VAE) assumes that each datapoint xi is generated by a K-dimensional latent variable zi:

zi ª p(zi), xi | zi ª p(xi | zi ; µ) = EF(xi | fµ(zi)), (1)

where xi follows an exponential family distribution with parameters fµ(zi); fµ parameterizes the conditional likelihood. In a deep generative model fµ is a parameterized neural network. Classical probabilistic models like Gaussian mixture model [45] and probabilistic PCA [10, 47, 48, 54] are also special cases of Eq. 1.

To ﬁt the model, VAE optimizes the parameters µ by maximizing a variational approximation of the log marginal likelihood. After ﬁnding an optimal ˆµ, we can form a representation of the data using the approximate posterior q ˆ (z|x) with variational parameters ˆ or its expectation Eq ˆ (z|x) [z|x].

Note that here we abstract away computational considerations and consider the ideal case where the variational approximation is exact. This choice is sensible: if the exact posterior suffers from posterior collapse then so will the approximate posterior (a variational approximation cannot uncollapse a collapsed posterior). That said we also note that there exist in practice situations where variational inference alone can lead to posterior collapse. A notable example is when the variational approximating family is overly restrictive: it is then possible to have non-collapsing exact posteriors but collapsing approximate posteriors.

2.1 Posterior collapse , Latent variable non-identiﬁability

We ﬁrst deﬁne posterior collapse and latent variable non-identiﬁability, then proving their connection. Deﬁnition 1 (Posterior collapse [6, 8, 38, 62]). Given a probability model p(x, z; µ), a parameter value µ = ˆµ, and a dataset x = (x1,...,xn), the posterior of the latent variables z collapses if

p(z|x; ˆµ) = p(z). (2)

The posterior collapse phenomenon can occur in a variety of probabilistic models and with different latent variables. When the probability model is a VAE, it only has local latent variables z = (z1,..., zn), and Eq. 2 is equivalent to the common deﬁnition of posterior collapse p(zi |xi ; ˆµ) = p(zi) for all i [12, 17, 33, 44]. Posterior collapse has also been observed in Gaussian mixture models [5]; the posterior of the latent mixture weights resembles their prior when the number of mixture components in the model is larger than that of the data generating process. Regardless of the model, when posterior collapse occurs, it prevents the latent variable from providing meaningful summary of the dataset. Deﬁnition 2 (Latent variable non-identiﬁability [42, 56]). Given a likelihood function p(x| z; µ), a parameter value µ = ˆµ, and a dataset x = (x1,...,xn), the latent variable z is non-identiﬁable if

p(x| z = z0 ; ˆµ) = p(x| z = z; ˆµ) 8 z0, z 2 Z , (3)

where Z denotes the domain of z, and z0, z refer to two arbitrary values the latent variable z can take. As a consequence, for any prior p(z) on z, we have the conditional likelihood equal to the marginal p(x| z = z; ˆµ) =

p(x| z; ˆµ)p(z)dz = p(x; ˆµ) 8 z 2 Z .

Deﬁnition 2 says a latent variable z is non-identiﬁable when the likelihood of the dataset x does not depend on z. It is also known as practical non-identiﬁability [42, 56] and is closely related to the deﬁnition of z being conditionally non-identiﬁable (or conditionally uninformative) given ˆµ [40, 42, 43, 49, 58]. To enforce latent variable identiﬁability, it is sufﬁcient to ensure that the likelihood

p(x| z,µ) is an injective (a.k.a. one-to-one) function of z for all µ. If this condition holds then

z0 6= z ) p(x| z = z0 ; ˆµ) 6= p(x| z = z; ˆµ). (4)

Note that latent variable non-identiﬁability only requires Eq. 3 be true for a given dataset x and parameter value ˆµ. Thus a latent variable may be identiﬁable in a model given one dataset but not another, and at one µ but not another. See examples in Appendix A.

Latent variable identiﬁability (Deﬁnition 2) [42, 56] differs from model identiﬁability [41], a related notion that has also been cited as a contributing factor to posterior collapse [59]. Latent variable

identiﬁability is a weaker requirement: it only requires the latent variable z be identiﬁable at a particular parameter value µ = ˆµ, while model identiﬁability requires both z and µ be identiﬁable.

We now establish the equivalence between posterior collapse and latent variable non-identiﬁability.

Theorem 1 (Latent variable non-identiﬁability , Posterior collapse). Consider a probability model

p(x, z; µ), a dataset x, and a parameter value µ = ˆµ. The local latent variables z are non-identiﬁable at ˆµ if and only if the posterior of the latent variable z collapses, p(z|x) = p(z).

Proof. To prove that non-identiﬁability implies posterior collapse, note that, by Bayes rule,

p(z|x; ˆµ) / p(z)p(x| z; ˆµ) = p(z)p(x; ˆµ) / p(z), (5)

where the middle equality is due to the deﬁnition of latent variable non-identiﬁability. It implies

p(z|x; ˆµ) = p(z) as both are densities. To prove that posterior collapse implies latent variable nonidentiﬁability, we again invoke Bayes rule. Posterior collapse implies that p(z) = p(z|x; ˆµ) /

p(z) p(x| z; ˆµ), which further implies that p(x| z; ˆµ) is constant in z. If p(x| z; ˆµ) nontrivially depends on z, then p(z) must be different from p(z)p(x| z; ˆµ) as a function of z.

The proof of Theorem 1 is straightforward, but Theorem 1 has an important implication. It shows that the problem of posterior collapse mainly arises from the model and the data, rather than from inference or optimization. If the maximum likelihood parameters ˆµ of the VAE renders the latent variable z non-identiﬁable, then we will observe posterior collapse. Theorem 1 also clariﬁes why posteriors may change from non-collapsed to collapsed (and back) while ﬁtting a VAE. When ﬁtting a VAE, Some parameter iterates may lead to posterior collapse; others may not.

Theorem 1 points to why existing approaches can help mitigate posterior collapse. Consider the ØVAE [19], the VAE lagging encoder [18], and the semi-amortized VAE [25]. Though motivated by other perspectives, these methods modify the optimization objectives or algorithms of VAE to avoid parameter values µ at which the latent variable is non-identiﬁable. The resulting posterior may not collapse, though the optimal parameters for these algorithms no longer approximates the maximum likelihood estimate.

Theorem 1 can also help us understand posterior collapse observed in practice, which manifests as the phenomenon that the posterior is approximately (as opposed to exactly) equal to the prior,

p(z|x; ˆµ) º p(z). In several empirical studies of VAE (e.g. [12, 18, 25]), we observe that the Kullback-Leibler (KL) divergence between the prior and posterior is close to zero but not exactly zero, a property that stems from the likelihood p(x| z) being nearly constant in the latents z. In these cases, Theorem 1 provides the intuition that the latent variable is nearly non-identiﬁable ,

p(x| z0) º p(x| z),8 z, z0 and so Eq. 2 holds approximately.

2.2 Examples of latent variable non-identiﬁability and posterior collapse

We illustrate Theorem 1 with three examples. Here we discuss the example of Gaussian mixture VAE (GMVAE). See Appendix A for probabilistic principal component analysis (PPCA) and Gaussian mixture model (GMM).

The GMVAE [13, 51] is the following model:

p(zi) = Categorical(1/K), p(wi | zi ; µ,ß) = N (µzi,ßzi), p(xi |wi ; f ,æ) = N (f (wi),æ2 Im),

where µk s are d-dimensional, ßk are d d-dimensional, and the parameters are µ = (µ,ß, f ,æ2). Suppose the function f is fully ﬂexible; thus f (wi) can capture any distribution of the data. The latent variable of interest is the categorical z = (z1,..., zn). If its posterior collapses, then p(zi = k|x) = 1/K for all k = 1,...,K.

Consider ﬁtting a GMVAE model with K = 2 to a dataset of 5,000 samples. This dataset is drawn from a GMVAE also with K = 2 well-separated clusters; there is no model misspeciﬁcation. A GMVAE is typically ﬁt by optimizing the maximum log marginal likelihood ˆµ = argmaxµ log p(x|µ). Note there may be multiple values of µ that achieve the global optimum of this function.

We focus on two likelihood maximizers. One provides latent variable identiﬁability and the posterior of zi does not collapse. The other does not provide identiﬁablity; the posterior collapses.

1. The ﬁrst likelihood-maximizing parameter ˆµ1 is the truth; the distribution of the K ﬁtted clusters

correspond to the K data-generating clusters. Given this parameter, the latent variable zi is identiﬁable because the K data-generating clusters are different; different cluster memberships zi must result in different likelihoods p(xi | zi ; ˆµ1). The posterior of zi does not collapse. 2. In the second likelihood-maximizing parameter ˆµ2, however, all K ﬁtted clusters share the

same distribution, each of which is equal to the marginal distribution of the data. Speciﬁcally, (µ

k) = (0, Id) for all k, and each ﬁtted cluster is a mixture of the K original data generating clusters, i.e., the marginal. At this parameter value, the model is still able to fully capture the mixture distribution of the data. However, all the K mixture components are the same, and thus the latent variable zi is non-identiﬁable; different cluster membership zi do not result in different likelihoods p(xi | zi ; ˆµ2), and hence the posterior of zi collapses. Figure 1a illustrates a ﬁt of this (non-identiﬁable) GMVAE to the pinwheel data [22]. In Section 3, we construct an latentidentiﬁable VAE (LIDVAE) that avoids this collapse.

Latent variable identiﬁability is a function of the both the model and the true data-generating distribution. Consider ﬁtting the same GMVAE with K = 2 but to a different dataset of 5,000 samples, this one drawn from a GMVAE with only one cluster. (There is model misspeciﬁcation.) One maximizing parameter value ˆµ3 is where both of the ﬁtted clusters correspond to the true data generating cluster. While this parameter value resembles that of the ﬁrst maximizer ˆµ1 above both correspond to the true data generating cluster this dataset leads to a different situation for latent variable identiﬁability. The two ﬁtted clusters are the same and so different cluster memberships do not result in different likelihoods of p(xi | zi ; ˆµ3). The latent variable zi is not identiﬁable and its posterior collapses.

Takeaways. The GMVAE example in this section (and the PPCA and GMM examples in Appendix A) illustrate different ways that a latent variable can be non-identiﬁable in a model and suffer from posterior collapse. They show that even the true posterior without variational inference can collapse in non-identiﬁable models. They also illustrate that whether a latent variable is identiﬁable can depend on both the model and the data. Posterior collapse is an intrinsic problem of the model and the data, rather than speciﬁc to the use of neural networks or variational inference.

The equivalence between posterior collapse and latent variable non-identiﬁability in Theorem 1 also implies that, to mitigate posterior collapse, we should try to resolve latent variable non-identiﬁability. In the next section, we develop such a class of latent-identiﬁable VAE.

3 Latent-identiﬁable VAE via Brenier maps

We now construct latent-identiﬁable VAE, a class of VAE whose latent variables are guaranteed to be identiﬁable, and thus the posteriors cannot collapse.

3.1 The latent-identiﬁable VAE

To construct the latent-identiﬁable VAE, we rely on a key observation that, to guarantee latent variable identiﬁability, it is sufﬁcient to make the likelihood function P(xi | zi ; µ) injective for all values of µ. If the likelihood is injective, then, for any µ, each value of zi will lead to a different distribution P(xi | zi ; µ). In particular, this fact will be true for any optimized ˆµ and so the latent zi must be identiﬁable, regardless of the data. By Theorem 1, its posterior cannot collapse.

Constructing latent-identiﬁable VAE thus amounts to constructing an injective likelihood function for VAE. The construction is based on a few building blocks of linear and nonlinear injective functions, then composed into an injective likelihood p(xi | zi ; µ) mapping from Z d to X m, where Z and X indicate the set of values zi and xi can take. For example, if xi is an m-dimensional binary vector, then X = {0,1}m; if zi is a K-dimensional real-valued vector, then Z = Rd.

The building blocks of LIDVAE: Injective functions. For linear mappings from Rd1 to Rd2 (d2 d1), we consider matrix multiplication by a d1 d2-dimensional matrix Ø. For a d1-dimensional variable z, left multiplication by a matrix Ø> is injective when Ø has full column rank [53]. For example, a matrix with all ones in the diagonal and all other entries being zero has full column rank.

For nonlinear injective functions, we focus on Brenier maps [4, 37]. A d-dimensional Brenier map is is the gradient of a convex function from Rd to R. That is, a Brenier map satisﬁes g = r T for some

convex function T : Rd ! R. Brenier maps are also known as a monotone transport map. They are guaranteed to be bijective [4, 37] because their derivative is the Hessian of a convex T, which must be positive semideﬁnite and has a nonnegative determinant [4].

To build a VAE with Brenier maps, we require a neural network parametrization of the Brenier map. As Brenier maps are gradients of convex functions, we begin with the neural network parametrizaton of convex functions, namely the input convex neural network (ICNN) [2, 35]. This parameterization of convex functions will enable Brenier maps to be paramterized as the gradient of ICNN.

An L-layer ICNN is a neural network mapping from Rd to R. Given an input u 2 Rd, its lth layer is

z0 = u, zl+1 = hl(Wlzl +Alu+bl), (l = 0,...,L 1), (6)

where the last layer z L must be a scalar, {Wl} are non-negative weight matrices with W0 = 0. The functions {hl : R ! R} are convex and non-decreasing entry-wise activation functions for layer l; they are applied element-wise to the vector (Wlzl +Alu+bl). A common choice of h0 : R ! R is the square of a leaky RELU, h0(x) = (max(Æ x,x))2 with Æ = 0.2; the remaining hl s are set to be a leaky RELU, hl(x) = max(Æ x,x). This neural network is called input convex because it is guaranteed to be a convex function.

Input convex neural networks can approximate any convex function on a compact domain in sup norm (Theorem 1 of Chen et al. [9].) Given the neural network parameterization of convex functions, we can parametrize the Brenier map gµ( ) as its gradient with respect to the input gµ(u) = @z L/@u. This neural network parameterization of Brenier map is a universal approxiamtor of all Brenier maps on a compact domain, because input convex neural networks are universal approximators of convex functions [9].

The latent-identiﬁable VAE (LIDVAE). We construct injective likelihoods for LIDVAE by composing two bijective Brenier maps with an injective matrix multiplication. As the composition of injective and bijective mappings must be injective, the resulting composition must be injective. Suppose g1,µ : RK ! RK and g2,µ : RD ! RD are two Brenier maps, and Ø is a K D-dimensional matrix (D K) with all the main diagonal entries being one and all other entries being zero. The matrix Ø> has full column rank, so multiplication by Ø> is injective. Thus the composition g2,µ(Ø> g1,µ( )) must be an injective function from a low-dimensional space RK to a high-dimensional space RD.

Deﬁnition 3 (Latent-identiﬁable VAE (LIDVAE) via Brenier maps). An LIDVAE via Brenier maps generates a D-dimensional datapoint xi,2 {1,...,n} by:

zi ª p(zi), xi | zi ª EF(xi | g2,µ(Ø> g1,µ(zi))), (7)

where EF stands for exponential family distributions; zi is a K-dimensional latent variable, discrete or continuous. The parameters of the model are µ = (g1,µ, g2,µ), where g1,µ : RK ! RK and

g2,µ : RD ! RD are two continuous Brenier maps. The matrix Ø is a K D-dimensional matrix (D K) with all the main diagonal entries being one and all other entries being zero.

Contrasting LIDVAE (Eq. 7) with the classical VAE (Eq. 1), the LIDVAE replaces the function fµ : Z K ! X D with the injective mapping g2,µ(Ø> g1,µ( )), composed by bijective Brenier maps g1,µ, g2,µ and a zero-one matrix Ø> with full column rank. As the likelihood functions of exponential family are injective, the likelihood function p(xi | zi ; µ) = EF(g2,µ(Ø> g1,µ(zi))) of LIDVAE must be injective. Therefore, replacing an arbitrary function fµ : Z K ! X D with the injective mapping g2,µ(Ø> g1,µ( )) plays a crucial role in enforcing identiﬁability for latent variable zi and avoiding posterior collapse in LIDVAE. As the latent zi must be identiﬁable in LIDVAE, its posterior does not collapse.

Despite its injective likelihood, LIDVAE are as ﬂexible as VAE; the use of Brenier maps and ICNN does not limit the capacity of the generative model. Loosely, LIDVAE can model any distributions in RD because Brenier maps can map any given non-atomic distribution in Rd to any other one in Rd [37]. Moreover, the ICNN parametrization is a universal approximator of Brenier maps [2]. We summarize the key properties of LIDVAE in the following proposition.

Proposition 2. The latent variable zi is identiﬁable in LIDVAE, i.e. for all i 2 {1,...,n}, we have

p(xi | zi = z0 ; µ) = p(xi | zi = z; µ) ) z0 = z, 8 z0, z,µ. (8)

Moreover, for any VAE-generated data distribution, there exists an LIDVAE that can generate the same distribution. (The proof is in Appendix B.)

(a) Non-ID GMVAE (b) IDGMVAE (c) Accuracy (d) Log-likelihood

Figure 1: (a)-(b): The posterior of the classical GMVAE [13, 26, 51] collapses when ﬁt to the pinwheel dataset; the latents predict the same value for all datapoints. The posteriors of latentidentiﬁable Gaussian mixture VAE (LIDGMVAE), however, do not collapse and provide meaningful representations. (c)-(d) The latent-identiﬁable GMVAE produces posteriors that are substantially more informative than GMVAE when ﬁt to fashion MNIST. It also achieves higher test log likelihood.

3.2 Inference in LIDVAE

Performing inference in LIDVAE is identical to the classical VAE, as the two VAE differ only in their parameter constraints. To ﬁt an LIDVAE, we use the classical amortized inference algorithm of VAE; we maximize the evidence lower bound (ELBO) of the log marginal likelihood [28].

In general, LIDVAE are a drop-in replacement for VAE. Both have the same capacity (Proposition 2) and share the same inference algorithm, but LIDVAE is identiﬁable and does not suffer from posterior collapse. The price we pay for LIDVAE is computational: the generative model (i.e. decoder) is parametrized using the gradient of a neural network; its optimization thus requires calculating gradients of the gradient of a neural network, which increases the computational complexity of VAE inference and can sometimes challenge optimization. While ﬁtting classical VAE using stochastic gradient descent has O(k p) computational complexity, where k is the number of iterations and p is the number of parameters, ﬁtting latent-identiﬁable VAE may require O(k p2) computational complexity.

3.3 Extensions of LIDVAE

The construction of LIDVAE reveals a general strategy to make the latent variables of generative models identiﬁable: replacing nonlinear mappings with injective nonlinear mappings. We can employ this strategy to make the latent variables of many other VAE variants identiﬁable. Below we give two examples, mixture VAE and sequential VAE.

The mixture VAE, with GMVAE as a special case, models the data with an exponential family mixture and mapped through a ﬂexible neural network to generate the data. We develop its latentidentiﬁable counterpart using Brenier maps. Example 1 (Latent-identiﬁable mixture VAE (LIDMVAE)). An LIDMVAE generates a Ddimensional datapoint xi, i 2 {1,...,n} by

zi ª Categorical(1/K), wi | zi ª EF(wi |Ø>

1 zi), xi |wi ª EF(xi | g2,µ(Ø>

2 g1,µ(wi))), (9)

where Wi is a K-dimensional one-hot vector that indicates the cluster assignment. The parameters of the model are µ = (g1,µ, g2,µ), where the functions g1,µ : RM ! RM and g2,µ : RD ! RD are two continuous Brenier maps. The matrices Ø1 and Ø2 are a K M-dimensional matrix (M K) and a M D-dimensional matrix (D M) respectively, both having all the main diagonal entries being one and all other entries being zero.

The LIDMVAE differs from the classical mixture VAE in p(xi | zi), where we replace its neural network mapping with its injective counterpart, i.e. a composition of two Brenier maps and a matrix multiplication g2,µ(Ø>

2 g1,µ( )). As a special case, setting both exponential families in Example 1 as Gaussian gives us LIDGMVAE, which we will use to model images in Section 4.

Next we derive the identiﬁable counterpart of sequential VAE, which models the data with an autoregressive model conditional on the latents. Example 2 (Latent-identiﬁable sequential VAE (LIDSVAE)). An LIDSVAE generates a Ddimensional datapoint xi, i 2 {1,...,n} by

zi ª p(zi), xi | zi,x<i ª EF(g2,µ(Ø>

2 g1,µ([zi, fµ(x<i)]))),

Fashion-MNIST Omniglot AU KL MI LL AU KL MI LL

VAE [28] 0.1 0.2 0.9 -258.8 0.02 0.0 0.1 -862.1 SA-VAE [25] 0.2 0.3 1.3 -252.2 0.1 0.2 1.0 -853.4 Lagging VAE [18] 0.4 0.6 1.6 -248.5 0.5 1.0 3.6 -849.4 Ø-VAE [19] (Ø=0.2) 0.6 1.2 2.4 -245.3 0.7 1.4 5.9 -842.6 LIDGMVAE (this work) 1.0 1.6 2.6 -242.3 1.0 1.7 7.5 -820.3

Synthetic Yahoo Yelp AU KL MI LL AU KL MI LL AU KL MI LL

VAE [28] 0.0 0.0 0.0 -46.5 0.0 0.0 0.0 -519.7 0.0 0.0 0.0 -635.9 SA-VAE [25] 0.4 0.1 0.1 -40.2 0.2 1.0 0.2 -520.2 0.1 1.9 0.2 -631.5 Lagging VAE [18] 0.5 0.1 0.1 -40.0 0.3 1.6 0.4 -518.6 0.2 3.6 0.1 -631.0 Ø-VAE [19] (Ø=0.2) 1.0 0.1 0.1 -39.9 0.5 4.7 0.9 -524.4 0.3 10.0 0.1 -637.3 LIDSVAE 1.0 0.5 0.6 -40.3 0.8 7.2 1.1 -519.5 0.7 9.1 0.9 -634.2

Table 1: Across image and text datasets, LIDVAE outperforms existing VAE variants in preventing posterior collapse while achieving similar goodness-of-ﬁt to the data.

where x<i = (x1,...,xi 1) represents the history of x before the ith dimension. The function fµ : X<i ! RH maps the history X<i into an H-dimensional vector. Finally, [zi, fµ(x<i)] is an (K +H) 1 vector that represents a row-stack of the vectors (zi)K 1 and (fµ(x<i))H 1.

Similar with mixture VAE, the LIDSVAE also differs from sequential VAE only in its use of

2 g1,µ( )) function in p(xi | zi,x<i). We will use LIDSVAE to model text in Section 4.

4 Empirical studies

We study LIDVAE on images and text datasets, ﬁnding that LIDVAE do not suffer from posterior collapse as we increase the capacity of the generative model, while achieving similar ﬁts to the data. We further study PPCA, showing how likelihood functions nearly constant in latent variables lead to collapsing posterior even with Markov chain Monte Carlo (MCMC).

4.1 LIDVAE on images and text

We consider three metrics for evaluating posterior collapse: (1) KL divergence between the posterior and the prior, KL(q(z|x)||p(z)); (2) Percentange of active units (AU):AU = PD

d=1 1{Covp(x)(Eq(z|x) [zd]) }, where zd = (z1d,..., znd) is the dth dimension of the latent variable z for all the n data points. In calculating AU, we follow Burda et al. [7] to calculate the posterior mean, (E [z1d |x1],...,E [znd |xn])] for all data points, and calculate the sample variance of E [zid |xi] across

i s from this vector. The threshold is chosen to be 0.01 [7]; the theoretical maximum of %AU is one; (3) Approximate Mutual information (MI) between xi and zi, I(x, z) = Ex

Eq(z|x) [log(q(z|x))]

Eq(z|x) [log(q(z))]

. We also evaluate the model ﬁt using the importance weighted estimate of loglikelihood on a held-out test set [7]. For mixture VAE, we also evaluate the predictive accuracy of the categorical latents against ground truth labels to quantify their informativeness.

Competing methods. We compare LIDVAE with the classical VAE [28], the Ø-VAE (Ø=0.2) [19], the semi-amortized VAE [25], and the lagging VAE [18]. Throughout the empirical studies, we use ﬂexible variational approximating families (Real NVPs [14] for image and LSTMs [20] for text).

Results: Images. We ﬁrst study LIDGMVAE on four subsampled image datasets drawn from pinwheel [22], MNIST [31], Fashion MNIST [57], and Omniglot [30]. Figures 1a and 1b illustrate a ﬁt of the GMVAE and the LIDGMVAE to the pinwheel data [22]. The posterior of the GMVAE latents collapse, attributing all datapoints to the same latent cluster. In contrast, LIDGMVAE produces categorical latents faithful to the clustering structure. Figure 1 examines the LIDGMVAE as we increase the ﬂexibility of the generative model. Figure 1c shows that the categorical latents of the LIDGMVAE are substantially more predictive of the true labels than their classical counterparts. Moreover, its performance does not degrade as the generative model becomes more ﬂexible. Figure 1d

(a) æ = 0.2 (b) æ = 0.5 (c) æ = 1.0 (d) æ = 1.5

Figure 2: As the noise level increases in PPCA, the latent variable becomes closer to non-identiﬁable because the likelihood and more susceptible to posterior collapse. Its likelihood surface becomes ﬂatter and its posterior becomes closer to the prior. Top panel: Likelihood surface of PPCA as a function of the two latents z1, z2. When æ increase, the likelihood surface becomes ﬂatter and the latent variables z1, z2 are closer to non-identiﬁable. Bottom panel: Posterior of z1 under different æ values. When æ increase, the posterior becomes closer to the prior.

shows that the LIDGMVAE consistently achieve higher test log-likelihood. Table 1 compares different variants of VAE in a 9-layer generative model. Across four datasets, LIDGMVAE mitigates posterior collapse. It achieves higher AU, KL and MI than other variants of VAE. It also achieves a higher test log-likelihood.

Results: Text. We apply LIDSVAE to three subsampled text datasets drawn from a synthetic text dataset, the Yahoo dataset, and the Yelp dataset [60]. The synthetic dataset is generated from a classical two-layer sequential VAE with a ﬁve-dimensional latent. Table 1 compares the LIDSVAE with the sequential VAE. Across the three text datasets, the LIDSVAE outperforms other variants of VAE in mitigating posterior collapse, generally achieving a higher AU, KL, and MI.

4.2 Latent variable non-identiﬁability and posterior collapse in PPCA

Here we show that the PPCA posterior becomes close to the prior when the latent variable becomes close to be non-identiﬁable. We perform inference using Hamiltonian Monte Carlo (HMC), avoiding the effect of variational approximation on posterior collapse.

Consider a PPCA with two latent dimensions, p(zi) = N (zi ; 0, I2), p(xi | zi ; µ) = N (xi ; z>

i w,æ2 I5), where the value of æ2 is known, zi s are the latent variables of interest, and w is the only parameter of interest. When the noise æ2 is set to a large value, the latent variable zi may become nearly non-identiﬁable. The reason is that the likelihood function p(xi | zi) becomes slower-varying as æ2 increases. For example, Figure 2 shows that the likelihood surface becomes ﬂatter as æ2 increases. Accordingly, the posterior becomes closer to the prior as æ2 increases. When æ = 1.5, the posterior collapses. This non-identiﬁability argument provides an explanation to the closely related phenomenon described in Section 6.2 of [33].

5 Discussion

In this work, we show that the posterior collapse phenomenon is a problem of latent variable nonidentiﬁability. It is not speciﬁc to the use of neural networks or particular inference algorithms in VAE. Rather, it is an intrinsic issue of the model and the dataset. To this end, we propose a class of LIDVAE via Brenier maps to resolve latent variable non-identiﬁability and mitigate posterior collapse. Across empirical studies, we ﬁnd that LIDVAE outperforms existing methods in mitigating posterior collapse.

The latent variables of LIDVAE are guaranteed to be identiﬁable. However, it does not guarantee that the latent variables and the parameters of LIDVAE are jointly identiﬁable. In other words, the LIDVAE model may not be identiﬁable even though its latents are identiﬁable. This difference between latent variable identiﬁability and model identiﬁability may appear minor. But the tractability of resolving latent variable identiﬁability plays a key role in making non-identiﬁability a fruitful one perspective of posterior collapse. To enforce latent variable identiﬁability, it is sufﬁcient to ensure that the likelihood p(x| z, ˆµ) is an injective function of z. In contrast, resolving model identiﬁability for the general class of VAE remains a long standing open problem, with some recent progress relying on auxiliary variables [23, 24]. The tractability of resolving latent variable identiﬁability is a key catalyst of a principled solution to mitigating posterior collapse.

There are a few limitations of this work. One is that the theoretical argument focuses on the collapse of the exact posterior. The rationale is that, if the exact posterior collapses, then its variational approximation must also collapse because variational approximation of posteriors cannot uncollapse a posterior. That said, variational approximation may collapse a posterior, i.e. the exact posterior does not collapse but the variational approximate posterior collapses. The theoretical argument and algorithmic approaches developed in this work does not apply to this setting, which remains an interesting venue of future work.

A second limitation is that the latent-identiﬁable VAE developed in this work bear a higher computational cost than classical VAE. While the latent-identiﬁable VAE ensures the identiﬁability of its latent variables and mitigates posterior collapse, it does come with a price in computation because its generative model (i.e. decoder) is parametrized using gradients of a neural network. Fitting the latent-identiﬁable VAE thus requires calculating gradients of gradients of a neural network, leading to much higher computational complexity than ﬁtting the classiﬁcal VAE. Developing computationally efﬁcient variants of the latent-identiﬁable VAE is another interesting direction for future work.

Acknowledgments. We thank Taiga Abe and Gemma Moran for helpful discussions, and anonymous reviewers for constructive feedback that improved the manuscript. David Blei is supported by ONR N00014-17-1-2131, ONR N00014-15-1-2209, NSF CCF-1740833, DARPA SD2 FA8750-18-C-0130, Amazon, and the Simons Foundation. John Cunningham is supported by the Simons Foundation, Mc Knight Foundation, Zuckerman Institute, Grossman Center, and Gatsby Charitable Trust.

[1] Alemi, A. A., Poole, B., et al. (2017). Fixing a broken ELBO. ar Xiv preprint ar Xiv:1711.00464.

[2] Amos, B., Xu, L., & Kolter, J. Z. (2017). Input convex neural networks. In Proceedings of the 34th

International Conference on Machine Learning-Volume 70 (pp. 146 155).: JMLR. org.

[3] Asperti, A. (2019). Variational autoencoders and the variable collapse phenomenon. Sensors & Transducers,

234(6), 1 8.

[4] Ball, K. (2004). An elementary introduction to monotone transportation. In Geometric aspects of functional

analysis (pp. 41 52). Springer.

[5] Betancourt, M. (2017). Identifying Bayesian mixture models. https://mc-stan.org/users/

documentation/case-studies/identifying_mixture_models. Accessed: 2021-05-04.

[6] Bowman, S., Vilnis, L., et al. (2016). Generating sentences from a continuous space. In Proceedings of The

20th SIGNLL Conference on Computational Natural Language Learning (pp. 10 21).

[7] Burda, Y., Grosse, R., & Salakhutdinov, R. (2015). Importance weighted autoencoders. ar Xiv preprint

ar Xiv:1509.00519.

[8] Chen, X., Kingma, D. P., et al. (2016). Variational lossy autoencoder. ar Xiv preprint ar Xiv:1611.02731.

[9] Chen, Y., Shi, Y., & Zhang, B. (2018). Optimal control via neural networks: A convex approach. ar Xiv

preprint ar Xiv:1805.11835.

[10] Collins, M., Dasgupta, S., & Schapire, R. E. (2001). A generalization of principal components analysis to

the exponential family. In Nips, volume 13 (pp. 23).

[11] Dai, B., Wang, Z., & Wipf, D. (2019). The usual suspects? Reassessing blame for VAE posterior collapse.

ar Xiv preprint ar Xiv:1912.10702.

[12] Dieng, A. B., Kim, Y., Rush, A. M., & Blei, D. M. (2018). Avoiding latent variable collapse with generative

skip models. ar Xiv preprint ar Xiv:1807.04863.

[13] Dilokthanakul, N., Mediano, P. A., et al. (2016). Deep unsupervised clustering with Gaussian mixture

variational autoencoders. ar Xiv preprint ar Xiv:1611.02648.

[14] Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2016). Density estimation using real NVP. ar Xiv preprint

ar Xiv:1605.08803.

[15] Fu, H., Li, C., et al. (2019). Cyclical annealing schedule: A simple approach to mitigating KL vanishing.

ar Xiv preprint ar Xiv:1903.10145.

[16] Gulrajani, I., Kumar, K., et al. (2016). Pixelvae: A latent variable model for natural images. ar Xiv preprint

ar Xiv:1611.05013.

[17] Havrylov, S. & Titov, I. (2020). Preventing posterior collapse with Levenshtein variational autoencoder.

ar Xiv preprint ar Xiv:2004.14758.

[18] He, J., Spokoyny, D., Neubig, G., & Berg-Kirkpatrick, T. (2019). Lagging inference networks and posterior

collapse in variational autoencoders. ar Xiv preprint ar Xiv:1901.05534.

[19] Higgins, I., Matthey, L., et al. (2016). Ø-VAE: Learning basic visual concepts with a constrained variational

[20] Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735 1780.

[21] Hoffman, M. D. & Johnson, M. J. (2016). ELBO surgery: Yet another way to carve up the variational

evidence lower bound.

[22] Johnson, M. J., Duvenaud, D. K., Wiltschko, A., Adams, R. P., & Datta, S. R. (2016). Composing

graphical models with neural networks for structured representations and fast inference. In Advances in neural information processing systems (pp. 2946 2954).

[23] Khemakhem, I., Kingma, D. P., & Hyvärinen, A. (2019). Variational autoencoders and nonlinear ICA: A

unifying framework. ar Xiv preprint ar Xiv:1907.04809.

[24] Khemakhem, I., Monti, R. P., Kingma, D. P., & Hyvarinen, A. (2020). ICE-Bee M: Identiﬁable conditional

energy-based deep models based on nonlinear ICA.

[25] Kim, Y., Wiseman, S., Miller, A., Sontag, D., & Rush, A. (2018). Semi-amortized variational autoencoders.

In International Conference on Machine Learning (pp. 2678 2687).

[26] Kingma, D. P., Mohamed, S., Rezende, D. J., & Welling, M. (2014). Semi-supervised learning with deep

generative models. In Advances in neural information processing systems (pp. 3581 3589).

[27] Kingma, D. P., Salimans, T., et al. (2016). Improved variational inference with inverse autoregressive ﬂow.

In Advances in neural information processing systems (pp. 4743 4751).

[28] Kingma, D. P. & Welling, M. (2014). Auto-encoding variational Bayes. In Proceedings of the International

Conference on Learning Representations (ICLR), volume 1.

[29] Kumar, A. & Poole, B. (2020). On implicit regularization in Ø-vaes. In International Conference on

Machine Learning (pp. 5480 5490).: PMLR.

[30] Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Human-level concept learning through

probabilistic program induction. Science, 350(6266), 1332 1338.

[31] Le Cun, Y., Cortes, C., & Burges, C. (2010). MNIST handwritten digit database. ATT Labs [Online].

Available: http://yann.lecun.com/exdb/mnist, 2.

[32] Li, B., He, J., Neubig, G., Berg-Kirkpatrick, T., & Yang, Y. (2019). A surprisingly effective ﬁx for deep

latent variable modeling of text. ar Xiv preprint ar Xiv:1909.00868.

[33] Lucas, J., Tucker, G., Grosse, R. B., & Norouzi, M. (2019). Don t blame the ELBO! A linear VAE

perspective on posterior collapse. In Advances in Neural Information Processing Systems (pp. 9403 9413).

[34] Maalø e, L., Fraccaro, M., Liévin, V., & Winther, O. (2019). BIVA: A very deep hierarchy of latent

variables for generative modeling. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, & R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 32: Curran Associates, Inc.

[35] Makkuva, A. V., Taghvaei, A., Oh, S., & Lee, J. D. (2019). Optimal transport mapping via input convex

neural networks. ar Xiv preprint ar Xiv:1908.10962.

[36] Mc Cann, R. J. et al. (1995). Existence and uniqueness of monotone measure-preserving maps. Duke

Mathematical Journal, 80(2), 309 324.

[37] Mc Cann, R. J. & Guillen, N. (2011). Five lectures on optimal transportation: geometry, regularity

and applications. Analysis and geometry of metric measure spaces: Lecture notes of the séminaire de Mathématiques Supérieure (SMS) Montréal, (pp. 145 180).

[38] Oord, A. v. d., Vinyals, O., & Kavukcuoglu, K. (2017). Neural discrete representation learning. ar Xiv

preprint ar Xiv:1711.00937.

[39] Peyré, G., Cuturi, M., et al. (2019). Computational optimal transport. Foundations and Trends in Machine

Learning, 11(5-6), 355 607.

[40] Poirier, D. J. (1998). Revising beliefs in nonidentiﬁed models. Econometric Theory, 14(4), 483 509.

[41] Rao, B. & Prakasa, R. (1992). Identiﬁability in Stochastic Models: Characterization of Probability

Distributions. Probability and mathematical statistics. Academic Press.

[42] Raue, A., Kreutz, C., et al. (2009). Structural and practical identiﬁability analysis of partially observed

dynamical models by exploiting the proﬁle likelihood. Bioinformatics, 25(15), 1923 1929.

[43] Raue, A., Kreutz, C., Theis, F. J., & Timmer, J. (2013). Joining forces of Bayesian and frequentist

methodology: a study for inference in the presence of non-identiﬁability. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984), 20110544.

[44] Razavi, A., Oord, A. v. d., Poole, B., & Vinyals, O. (2019). Preventing posterior collapse with delta-VAEs.

ar Xiv preprint ar Xiv:1901.03416.

[45] Reynolds, D. A. (2009). Gaussian mixture models. Encyclopedia of biometrics, 741, 659 663.

[46] Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate

inference in deep generative models. ar Xiv preprint ar Xiv:1401.4082.

[47] Roweis, S. (1998). Em algorithms for pca and spca. Advances in neural information processing systems,

(pp. 626 632).

[48] Roweis, S. & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural computation,

11(2), 305 345.

[49] San Martın, E. & González, J. (2010). Bayesian identiﬁability: Contributions to an inconclusive debate.

Chilean Journal of Statistics, 1(2), 69 91.

[50] Seybold, B., Fertig, E., Alemi, A., & Fischer, I. (2019). Dueling decoders: Regularizing variational

autoencoder latent spaces. ar Xiv preprint ar Xiv:1905.07478.

[51] Shu, R. (2016). Gaussian mixture VAE: Lessons in variational inference, generative models, and deep nets.

[52] Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., & Winther, O. (2016). How to train deep

variational autoencoders and probabilistic ladder networks. In 33rd International Conference on Machine Learning (ICML 2016).

[53] Strang, G., Strang, G., Strang, G., & Strang, G. (1993). Introduction to linear algebra, volume 3. Wellesley-

Cambridge Press Wellesley, MA.

[54] Tipping, M. E. & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal

Statistical Society: Series B (Statistical Methodology), 61(3), 611 622.

[55] Tomczak, J. M. & Welling, M. (2017). VAE with a Vamp Prior. ar Xiv preprint ar Xiv:1705.07120.

[56] Wieland, F.-G., Hauber, A. L., Rosenblatt, M., Tönsing, C., & Timmer, J. (2021). On structural and

practical identiﬁability. Current Opinion in Systems Biology.

[57] Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-MNIST: a novel image dataset for benchmarking

machine learning algorithms. Co RR, abs/1708.07747.

[58] Xie, Y. & Carlin, B. P. (2006). Measures of Bayesian learning and identiﬁability in hierarchical models.

Journal of Statistical Planning and Inference, 136(10), 3458 3477.

[59] Yacoby, Y., Pan, W., & Doshi-Velez, F. (2020). Characterizing and avoiding problematic global optima of

variational autoencoders.

[60] Yang, Z., Hu, Z., Salakhutdinov, R., & Berg-Kirkpatrick, T. (2017). Improved variational autoencoders for

text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 3881 3890).

[61] Yeung, S., Kannan, A., Dauphin, Y., & Fei-Fei, L. (2017). Tackling over-pruning in variational autoencoders.

ar Xiv preprint ar Xiv:1706.03643.

[62] Zhao, T., Lee, K., & Eskenazi, M. (2018). Unsupervised discrete sentence representation learning for

interpretable neural dialog generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1098 1107).

[63] Zhao, Y., Yu, P., Mahapatra, S., Su, Q., & Chen, C. (2020). Discretized bottleneck in VAE: Posterior-

collapse-free sequence-to-sequence learning. ar Xiv preprint ar Xiv:2004.10603.