# learning_manifold_dimensions_with_conditional_variational_autoencoders__fdf4f49f.pdf Learning Manifold Dimensions with Conditional Variational Autoencoders Yijia Zheng1 Tong He2 Yixuan Qiu3 David Wipf2 1 Department of Statistics, Purdue University 2 Amazon Web Services 3 School of Statistics and Management, Shanghai University of Finance and Economics zheng709@purdue.edu, {htong, daviwipf}@amazon.com, qiuyixuan@sufe.edu.cn Although the variational autoencoder (VAE) and its conditional extension (CVAE) are capable of state-of-the-art results across multiple domains, their precise behavior is still not fully understood, particularly in the context of data (like images) that lie on or near a low-dimensional manifold. For example, while prior work has suggested that the globally optimal VAE solution can learn the correct manifold dimension, a necessary (but not sufficient) condition for producing samples from the true data distribution, this has never been rigorously proven. Moreover, it remains unclear how such considerations would change when various types of conditioning variables are introduced, or when the data support is extended to a union of manifolds (e.g., as is likely the case for MNIST digits and related). In this work, we address these points by first proving that VAE global minima are indeed capable of recovering the correct manifold dimension. We then extend this result to more general CVAEs, demonstrating practical scenarios whereby the conditioning variables allow the model to adaptively learn manifolds of varying dimension across samples. Our analyses, which have practical implications for various CVAE design choices, are also supported by numerical results on both synthetic and real-world datasets. 1 Introduction Variational autoencoders (VAE) [6, 14] and conditional variants (CVAE) [17] are powerful generative models that produce competitive results in various domains such as image synthesis [5, 13, 21], natural language processing [16], time-series forecasting [9, 19], and trajectory prediction [8]. As a representative example, when equipped with an appropriate deep architecture, VAE models have recently achieved state-of-the-art performance generating large-scale images [11]. And yet despite this success, there remain VAE/CVAE behaviors in certain regimes of interest where we lack a precise understanding or a supporting theoretical foundation. In particular, when the data lie on or near a low-dimensional manifold, as occurs with real-world images [12], it is meaningful to have a model that learns the manifold dimension correctly. The latter can provide insight into core properties of the data and be viewed as a necessary, albeit not sufficient, condition for producing samples from the true distribution. Although it has been suggested in prior work [3, 4] that a VAE model can learn the correct manifold dimension when globally optimized, this has only been formally established under the assumption that the decoder is linear or affine [2]. And the potential ability to learn the correct manifold dimension becomes even more nuanced when conditioning variables are introduced. In this regard, a set of discrete conditions (e.g., MNIST image digit labels) may correspond with different slices through the data space, with each Work completed during internship at the AWS Shanghai AI Labs. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). inducing a manifold with varying dimension (intuitively, the manifold dimension of images labelled 1 is likely smaller than those of 5 ). Alternatively, it is possible to have data expand fully in the ambient space but lie on a low-dimensional manifold when continuous conditional variables are present. Such a situation can be trivially constructed by simply treating some data dimensions, or transformations thereof, as the conditioning variables. In both scenarios, the role of CVAE models remains under-explored. Moreover, unresolved CVAE properties in the face of low-dimensional data structure extend to practical design decisions as well. For example, there has been ongoing investigation into the choice between a fixed VAE decoder variance and a learnable one [3, 4, 10, 15, 18], an issue of heightened significance when conditioning variables are involved. And there exists similar ambiguity regarding the commonly-adopted strategy of sharing weights between the prior and encoder/posterior in CVAEs [7, 17]. Although perhaps not obvious at first glance, in both cases these considerations are inextricably linked to the capability of learning data manifold dimensions. Against this backdrop our paper makes the following contributions: (i) In Section 2.1 we provide the first demonstration of general conditions under which VAE global minimizers provably learn the correct data manifold dimension. (ii) We then extend the above result in Section 2.2 to address certain classes of CVAE models with either continuous or discrete conditioning variables, the latter being associated with data lying on a union of manifolds. (iii) Later, Section 3 investigates common CVAE model designs and training practices, including the impact of strategies for handling the decoder variance as well as the impact of weight sharing between conditional prior and posterior networks. (iv) Section 4 supports our theoretical conclusions and analysis with numerical experiments on both synthetic and real-world datasets. 2 Learning the Dimension of Data Manifolds In this section we begin with analysis that applies to regular VAE models with no conditioning. We then later extend these results to more general CVAE scenarios. 2.1 VAE Analysis We begin with observed variables x X Rd, where X is the ambient data space equipped with some ground-truth probability measure ωgt. Hence the probability mass of an infinitesimal dx on X is ωgt(dx) and R X ωgt(dx) = 1. VAE models attempt to approximate this measure with a parameterized distribution pθ(x) instantiated as marginalization over latent variables z Rκ as in pθ(x) = R pθ(x|z)p(z)dz. Here p(z) = N(z|0, I) is a standardized Gaussian prior and pθ(x|z) represents a parameterized likelihood function that is typically referred to as the decoder. To estimate decoder parameters θ, the canonical VAE training loss is formed as a bound on the average negative log-likelihood given by L(θ, ϕ) = Z X { Eqϕ(z|x)[log pθ(x|z)] + KL[qϕ(z|x)||p(z)]}ωgt(dx) Z X log pθ(x)ωgt(dx), (1) where the latent posterior distribution qϕ(z|x) (or the VAE encoder) controls the tightness of the bound via trainable parameters ϕ. Borrowing from [4], we package widely-adopted VAE modeling assumptions into the following definition: Definition 1 (κ-simple VAE) A κ-simple VAE is a VAE model with dim[z] = κ latent dimensions, the Gaussian encoder qϕ(z|x) = N(z|µz(x; ϕ), diag{σ2 z(x; ϕ)}), the Gaussian decoder pθ(x|z) = N(x|µx(z; θ), γI), and the prior p(z) = N(z|0, I). Here γ > 0 is a trainable scalar included within θ, while the mean functions µz(x; ϕ) and µx(z; θ) are arbitrarily-complex L-Lipschitz continuous functions; the variance function σ2 z(x; ϕ) can be arbitrarily complex with no further constraint. Our goal in this section will be to closely analyze the behavior of κ-simple VAE models when trained on data restricted to low-dimensional manifolds defined as follows: Definition 2 (Data lying on a manifold) Let r and d denote two positive integers with r < d. Then Mr is a simple r-Riemannian manifold embedded in Rd when there exists a diffeomorphism φ between Mr and Rr. Specifically, for every x Mr, there exists a u = φ(x) Rr, where φ is invertible and both φ and φ 1 are differentiable. As pointed out in [4], when training κ-simple VAEs on such manifold data, the optimal decoder variance will satisfy γ 0 (i.e, unbounded from below). And as we will soon show, one effect of this phenomena can be to selectively push the encoder variances along certain dimensions of z towards zero as well, ultimately allowing these dimensions to pass sufficient information about x through the latent space such that the decoder can produce reconstructions with arbitrarily small error. To formalize these claims, we require one additional definition: Definition 3 (Active VAE latent dimensions) Let {θ γ, ϕ γ} denote globally-optimal parameters of a κ-simple VAE model applied to (1) as a function of an arbitrary fixed γ. Then a dimension j {1, . . . , κ} of latent variable z is defined as an active dimension (associated with sample x) if the corresponding optimal encoder variance satisfies σz(x; ϕ γ)2 j 0 as γ 0. We now arrive at the main result for this section: Theorem 1 (Learning the data manifold dimension using VAEs) Suppose X = Mr with r < d. Then for all κ r, any globally-optimal κ-simple VAE model applied to (1) satisfies the following: (i) L(θ γ, ϕ γ) = (d r) log γ + O(1), (ii) The number of active latent dimensions almost surely equals r, and (iii) The reconstruction error almost surely satisfies Eqϕ γ (z|x) ||x µx(z; θ γ)||2 = O(γ). While all proofs are deferred to the appendices, we provide a high-level sketch here. First, we prove by contradiction that there must exist at least r active dimensions with corresponding encoder variances tending to zero at a rate of O(γ). If this is not the case, we show that the reconstruction term will grow at a rate of O( 1 γ ), leading to an overall loss that is unbounded from above. Next, we obtain upper bound and lower bounds on (1), both of which scale as (d r) log γ + O(1) when the number of active dimensions is r. And lastly, we pin down the exact number of active dimensions by showing that the inclusion of any unnecessary active dimensions decreases the coefficient of the log γ scale factor, i.e., the factor (d r) uniquely achieves the minimal loss. Overall, Theorem 1 provides a number of revealing insights into the VAE loss surface under the stated conditions. First, we observe that although the loss can in principle be driven to minus infinity via sub-optimal solutions as γ 0, globally-optimal solutions nonetheless achieve an optimal rate (i.e., largest possible coefficient on the log γ factor) as well as the minimal number of active latent dimensions, which matches the ground-truth manifold dimension r. Moreover, this all occurs while maintaining a reconstruction error that tends to zero as desired. Additionally, while we have thus far treated γ as a manually controlled parameter tending to zero for analysis purposes, when we transition to a trainable setting, similar intuitions are still applicable. More specifically, around global minimizers, the corresponding optimal γ scales as L2 d ||σz(x; θ )1:r||2, where σz(x; θ )1:r denotes the r optimal latent posterior standard deviations associated with active dimensions. Hence both the decoder variance γ and the latent posterior variances of active dimensions converge to zero in tandem at the same rate; see the proof for more details. In contrast, along the remaining inactive dimensions we show that limγ 0 σz(x; ϕ γ) = Ω(1), which optimzes the KL term without compromising the reconstruction accuracy. In closing this section, we note that previous work [4] has demonstrated that global minima of VAE models can achieve zero reconstruction error for all samples lying on a data manifold. But it was not formally established in a general setting that this perfect reconstruction was possible using a minimal number of active latent dimensions, and hence, it is conceivable for generated samples involving a larger number of active dimensions to stray from this manifold. In contrast, to achieve perfect reconstruction using the minimal number of active latent dimensions, as we have demonstrated here under the stated assumptions, implies that generated samples must also lie on the manifold. Critically, the noisy signals from inactive dimensions are blocked by the decoder and therefore cannot produce deviations from the manifold. 2.2 Extension to Conditional VAEs In this section, we extend our analysis to include conditional models, progressing from VAEs to CVAEs. For this purpose, we introduce conditioning variables c drawn from some set C with associated probability measure νgt such that R C νgt(dc) = 1. Moreover, for any c C there exists a subset Xc X with probability measure ωc gt satisfying R Xc ωc gt(dx) = 1. Collectively we also have R Xc ωc gt(dx)νgt(dc) = 1. Given these definitions, the canonical CVAE loss is defined as L(θ, ϕ) = Z Eqϕ(z|x,c)[log pθ(x|z, c)] + KL[qϕ(z|x, c)||pθ(z|c)] ωc gt(dx)νgt(dc) Xc log pθ(x|c)ωc gt(dx)νgt(dc), (2) which forms an upper bound on the conditional version of the expected negative log-likelihood. We may then naturally extend the definition of the κ-simple VAE model to the conditional regime as follows: Definition 4 (κ-simple CVAE) A κ-simple CVAE is an extension of the κ-simple VAE with the revised conditional, parameterized prior pθ(z|c) = N(z|µz(c; θ), diag{σ2 z(c; θ)}), the Gaussian encoder qϕ(z|x, c) = N(z|µz(x, c; ϕ), diag{σ2 z(x, c; ϕ)}), and the Gaussian decoder pθ(x|z, c) = N(x|µx(z, c; θ), γI). The encoder/decoder mean functions µz(x, c; ϕ) and µx(z, c; θ) are arbitrarilycomplex L-Lipschitz continuous functions, while the prior mean µz(c; θ) and variance σ2 z(c; θ),2 and encoder variance σ2 z(x, c; ϕ) can all be arbitrarily-complex functions with no further constraint. Likewise, we may also generalize the definition of active latent dimensions, where we must explicitly account for the modified conditional prior distribution which controls the relative scaling of the data. Definition 5 (Active CVAE latent dimensions) Let {θ γ, ϕ γ} denote globally-optimal parameters of a κ-simple CVAE model applied to (2) as a function of an arbitrary fixed γ. Then a dimension j {1, . . . , κ} of latent variable z is defined as an active dimension (associated with sample pair {x, c}) if the corresponding j-th optimal encoder/prior variance ratio satisfies σz(x, c; ϕ γ)2 j/σz(c; θ γ)2 j 0 as γ 0. Note that a CVAE (or VAE) with a standardized (parameter-free) Gaussian prior, the prior variance equals one. Hence it follows that Definition 3 is a special case of Definition 5 where σ2 z(c; θ) = I. Before proceeding to our main result for this section, there is one additional nuance to our manifold assumptions underlying conditional data. Specifically, if c follows a continuous distribution, then it can reduce the number of active latent dimensions needed for obtaining perfect reconstructions. To quantify this effect, let Mc r Mr denote the subset of Mr associated with Xc. Intuitively then, depending on the information about x contained in c, the number of active latent dimensions within Mc r may be less than r. We quantify this reduction via the following definition: Definition 6 (Effective dimension of a conditioning variable) Given an integer t {0, . . . , r}, let Ct denote a subset of C with the following properties: (i) There exists a function g : Ct Rt as well as t dimensions of φ(x) denoted as φ(x)t, such that g(c) = φ(x)t for all pairs {(c, x) : c Ct, x Mc r}, where φ is a diffeomorphism per Definition 2; and (ii) there does not exist such a function g for t + 1. We refer to t as the effective dimension of any conditioning variable c Ct. Loosely speaking, Definition 6 indicates that any c Ct can effectively be used to reconstruct t r dimensions of x within Mc r. Incidentally, if t = r, then this definition implies that x degenerates to a deterministic function of c. Given these considerations, Theorem 1 can be extended to CVAE models conditioned on continuous variables as follows: Theorem 2 (Learning the data manifold dimension using CVAEs) Suppose C = Ct and Xc = Mc r with r 1, t 0, and r t. Then any globally-optimal κ-simple CVAE model applied to the loss (2), with κ r, satisfies the following: 2While the parameters of the prior mean and variance functions are labeled as θ, this is merely to follow standard convention and group all parameters from the generative pipeline together under the same heading; it is not meant to imply that the decoder and prior actually share the same parameters. (i) L(θ γ, ϕ γ) = (d r + t) log γ + O(1), and (ii) The number of active latent dimensions almost surely equals r t, and (iii) The reconstruction error almost surely satisfies Eqϕ γ (z|x,c) ||x µx(z, c; θ γ)||2 = O(γ). This result indicates that conditioning variables can further reduce the CVAE loss (by increasing the coefficient on the log γ term as γ 0 around optimal solutions. Moreover, conditioning can replace active latent dimensions; intuitively this occurs because using c to reconstruct dimensions of x, unlike dimensions of z, incurs no cost via the KL penalty term. Additionally, it is worth mentioning that even if the observed data itself is not strictly on a manifold (meaning r = d), once the conditioning variables c are introduced, manifold structure can be induced on Xc, i.e., with t > 0 it follows that d r + t > 0 and the number of active latent dimensions satisfies r t < d. 2.3 Adaptive Active Latent Dimensions Thus far our analysis has been predicated on the existence of a single r-dimensional manifold underlying the data x, along with a conditioning variable c that captures t degrees-of-freedom within this manifold. More broadly though, it is reasonable to envision scenarios whereby the data instead lie on a union of manifolds, each with a locally-defined value of r and possibly t for continuous conditioning variables. In such instances, both Theorems 1 and 2 can be naturally refined to reflect this additional flexibility. While we defer a formal treatment to future work, it is nonetheless worthwhile to consider CVAE behavior within two representative scenarios. First, consider the case where c is now a discrete random variable taking a value in some set/alphabet {αk}m k=1, such as the label of an MNIST digit [23] (whereby m = 10). It then becomes plausible to assume that r = f(αk) for some function f, meaning that the manifold dimension itself may vary conditioned on the value of c (e.g., the space of digits 1" is arguably simpler than the space of digits 8"). In principle then, a suitably-designed CVAE model trained on such data should be able to adaptively learn the dimensions of these regional manifolds. Later in Section 4.4 we empirically demonstrate that when we include a specialized attention layer within the CVAE decoder, which allows it to selectively shut on and off different latent dimensions, the resulting model can indeed learn the underlying union of low-dimensional manifolds under certain circumstances. From a conceptual standpoint, the outcome is loosely analogous to a separate, class-conditional VAE being trained on data associated with each αk. As a second scenario, we may also consider the case where t varies for different values of a continuous conditioning variable c. Extrapolating from Theorem 2, we would expect that the number of active latent dimensions r t will now vary across regions of the data space. Hence an appropraite CVAE architecture should be able to adaptively compress the active latent dimensions so as to align with the varying information contained in c. Again, Section 4.4 demonstrates that this is indeed possible. 3 On Common CVAE Model Design Choices In this section, we review CVAE model designs and training practices that, while adopted in various prior CVAE use cases, nonetheless may have underappreciated consequences, especially within the present context of learning the underlying data manifold dimensionality. 3.1 On the Equivalence of Conditional and Unconditional Priors Per the canonical CVAE design, it is common to include a parameterized, trainable prior pθ(z|c) within CVAE architectures [1, 7, 8, 20, 24]. However, the strict necessity of doing so is at least partially compromised by the following remark: Remark 1 (Converting conditional to unconditional priors) Consider a κ-simple CVAE model with prior pθ(z|c), encoder qϕ(z|x, c) and decoder pθ(x|z, c). We can always find another κ-simple CVAE model with prior p(z) N(0, I), encoder qϕ (z|x, c) and decoder pθ (x|z, c), such that L(θ, ϕ) = L(θ , ϕ ) and pθ(x|c) = pθ (x|c). Remark 1 indicates that, at least in principle, a parameterized, conditional prior is not unequivocally needed. Specifically, as detailed in the appendix, we can always explicitly convert an existing κ- simple CVAE model with conditional prior pθ(z|c) into another κ-simple CVAE model with fixed prior p(z) = N(z|0, I) without sacrificing any model capacity in the resulting generative process pθ (x|c) = R pθ (x|z, c)p(z)dz; essentially the additional expressivity of the conditional prior is merely absorbed into a revised decoder. Even so, there may nonetheless remain differences in the optimization trajectories followed during training such that achievable local minima may at times lack this equivalence. 3.2 The Impact of γ Initialization on Model Convergence As emphasized previously, VAE/CVAE models with sufficient capacity applied to manifold data achieve the minimizing loss when γ 0. However, directly fixing γ near zero can be problematic for reasons discussed in [3], and more broadly, performance can actually be compromised when γ is set to any fixed positive constant as noted in [4, 10, 15, 18]. Even so, it remains less well-understood how the initializaton of a learnable γ may impact the optimization trajectory during training. After all, the analysis we have provided is conditioned on finding global solutions, and yet it is conceivable that different γ initializations could influence a model s ability to steer around bad local optima. Note that the value of γ at the beginning of training arbitrates the initial trade-off between the reconstruction and KL terms, as well as the smoothness of the initial loss landscape, both of which are factors capable of influencing model convergence. We empirically study these factors in Section 4.5. 3.3 A Problematic Aspect of Encoder/Prior Model Weight Sharing Presumably to stabilize training and/or avoid overfitting, one widely-adopted practice is to share CVAE weights between the prior and posterior/encoder modules [8, 17, 19]. For generic, fixed-sized input data this may take the form of simply constraining the encoder as qϕ(z|x, c) = pθ(z|c) [17]. More commonly though, for sequential data both prior and encoder are instantiated as some form of recurrent network with shared parameters, where the only difference is the length of the input sequence [8, 19], i.e., the full sequence for the encoder or the partial conditioning sequence for the prior. More concretely with respect to the latter, assume sequential data x = {xl}n l=1, where l is a time index (for simplicity here we will assume a fixed length n across all samples, although this can be easily generalized). Then associated with each time point xl within a sample x, we have a prior conditioning sequence cl = x 1 (this excludes strictly deterministic sequences). Moreover, given a κ-simple sequential CVAE model,3 assume that the prior is constrained to share weights with the encoder such that pθ(zl|x