# deep_generative_models_complexity_dimensionality_and_approximation__1d293a27.pdf Journal of Machine Learning Research 26 (2025) 1-37 Submitted 8/24; Revised 2/25; Published 6/25 Deep Generative Models: Complexity, Dimensionality, and Approximation Kevin Wang kcwang@unc.edu Department of Biostatistics University of North Carolina at Chapel Hill Hongqian Niu hong.niu@unc.edu Department of Biostatistics University of North Carolina at Chapel Hill Yixin Wang yixinw@umich.edu Department of Statistics University of Michigan Didong Li didongli@unc.edu Department of Biostatistics University of North Carolina at Chapel Hill Editor: Arindam Banerjee Generative networks have shown remarkable success in learning complex data distributions, particularly in generating high-dimensional data from lower-dimensional inputs. While this capability is well-documented empirically, its theoretical underpinning remains unclear. One common theoretical explanation appeals to the widely accepted manifold hypothesis, which suggests that many real-world datasets, such as images and signals, often possess intrinsic low-dimensional geometric structures. Under this manifold hypothesis, it is widely believed that to approximate a distribution on a d-dimensional Riemannian manifold, the latent dimension needs to be at least d or d + 1. In this work, we show that this requirement on the latent dimension is not necessary by demonstrating that generative networks can approximate distributions on d-dimensional Riemannian manifolds from inputs of any arbitrary dimension, even lower than d, taking inspiration from the concept of space-filling curves. This approach, in turn, leads to a super-exponential complexity bound of the deep neural networks through expanded neurons. Our findings thus challenge the conventional belief on the relationship between input dimensionality and the ability of generative networks to model data distributions. This novel insight not only corroborates the practical effectiveness of generative networks in handling complex data structures, but also underscores a critical trade-offbetween approximation error, dimensionality, and model complexity. Keywords: Approximation theory, generative adversarial networks, manifold hypothesis, space-filling curve. . To whom correspondence should be addressed c 2025 Kevin Wang, Hongqian Niu, Yixin Wang and Didong Li. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-1335.html. Wang, Niu, Wang and Li 1. Introduction Generative models, such as generative adversarial networks (GANs, Goodfellow et al., 2014) and variational auto-encoders (VAEs, Kingma and Welling, 2013), have become a central topic in machine learning. These models are versatile, addressing various problems ranging from image generation (Bao et al., 2017; Han et al., 2018), style transfer (Karras et al., 2019), anomaly detection (Xia et al., 2022), to data augmentation (Tran et al., 2021), achieving increasingly accurate results across disciplines. Despite these advances, GANs, in particular, face challenges such as mode collapse, vanishing gradients, and training instability, particularly when distributions are not continuous or have disjoint supports (Arjovsky and Bottou, 2017). While traditional GANs and VAEs primarily relied on the Kullback Leibler divergence (KL) to measure distances between input and output distributions, more recent approaches have greatly improved training performance. The shift from KL to Wasserstein distance (Villani et al., 2009) is a notable example, as exemplified in the Wasserstein GAN (WGAN, Arjovsky et al., 2017), and further extended to Wasserstein VAE (WVAE, Ambrogioni et al., 2018) and Wasserstein Auto Encoders (Tolstikhin et al., 2017). Beyond this, there have been substantial enhancements in model architecture, loss functions, and regularization techniques. Advances such as conditional GANs (c GAN, Mirza and Osindero, 2014) for targeted image generation, auxiliary classifier GANs (AC-GANs, Odena et al., 2017), and self-attention GANs (SA-GANs, Zhang et al., 2019) have enriched the versatility and effectiveness of GANs. Similarly, VAEs have seen improvements with techniques such as hierarchical latent variables (Vahdat and Kautz, 2020; Sønderby et al., 2016) and incorporation of normalizing flows (Kingma et al., 2016), refining their ability to model complex distributions. These developments collectively contribute to the generation of highquality, high-dimensional data from simpler, lower-dimensional distributions, with notable impact in fields such as photorealistic image generation (Wang et al., 2018; Sarkar et al., 2021). The theoretical advancements in deep generative models, paralleling their practical applications, have been substantial. In particular, the depth and width of network architectures in these models have been theoretically shown to significantly impact their approximation capabilities, as detailed in the works focused on neural network expressiveness (Eldan and Shamir, 2016). Furthermore, the challenges of mode collapse in GANs and the propensity of VAEs to produce overly smooth outputs have catalyzed research into improving their architectural designs and training methodologies (Goodfellow et al., 2014; Kingma and Welling, 2013). Collectively, these theoretical advancements have not only deepened our understanding of the mechanisms behind deep generative models but also guided the development of more refined and capable generative architectures. However, the theoretical investigation of the manifold hypothesis in deep generative models, particularly its implications in approximation theory, remains less developed. The manifold hypothesis, widely accepted as a crucial reason behind the success of these models, posits that real-world high-dimensional data often reside on lower-dimensional manifolds (Bronstein et al., 2017). This concept is critical because it suggests a fundamental reason why models such as GANs and VAEs, which operate in low-dimensional spaces, are able to capture complex data distributions effectively. Although empirical evidence sup- Deep Generative Models on Manifolds ports this hypothesis, a deeper theoretical understanding of how these models approximate and learn distributions on lower-dimensional manifolds within high-dimensional spaces is crucial. Further theoretical work in this area is essential, not only to validate the manifold hypothesis but also to enhance the design and efficiency of generative models in handling complex, high-dimensional data. Building on the previous discussion of the manifold hypothesis, Dahal et al. (2022) offers a noteworthy theoretical contribution in the field of deep generative models. Their work effectively addresses fundamental questions in approximation theory, demonstrating that distributions on a d-dimensional Riemannian manifold can be approximated by a deep generative network through the pushforward measure of an easily sampled distribution in d+1 dimensions, such as a uniform distribution in a cube [0, 1]d+1. Importantly, they provide an upper bound on the complexity of these models, revealing that the required number of layers and neurons is determined by both the approximation error and the manifold dimension d. This insight offers a direct theoretical justification for the models abilities to learn diverse probability distributions and suggesting avenues for efficiency improvements in practical applications. However, the research by Dahal et al. (2022) also brings to light some important, yet unanswered questions. Primarily, the necessity of using a d + 1 dimensional input space to approximate a d dimensional manifold warrants further investigation. This approach, integral for assembling local patches, raises the possibility of achieving similar outcomes with an input dimension that matches d. Furthermore, in real-world applications where d is not readily known, the choice of input dimension becomes a critical decision. If the chosen dimensionality is either greater or lesser than the actual d, how would it impact the model s performance and efficiency? Answering these questions is crucial for advancing the development of deep generative models, potentially leading to more adaptable and efficient solutions for complex data representation. In this work, we aim to understand how the input dimension relates to the manifold s intrinsic dimension in deep generative models. We adapt the space-filling curve theory to demonstrate a novel aspect of deep generative models: their ability to learn a target data distribution from easy-to-sample distributions of arbitrary dimensions. This includes the capacity to approximate these distributions even from an input as low as a one-dimensional uniform distribution on the unit interval [0, 1] (see Figure 1 for examples of curves filling 2-dimensional manifolds). Intuitively, if the input dimension is smaller than the true dimension, the network can learn to fill out the true manifold with an increasingly complex series of structures. However, in order to accurately approximate the higher dimensional structure, the approximating manifold must fold onto itself in increasingly irregular ways to fill the space. In addition, our work establishes the critical interplay among the input dimension, the true dimension of the manifold, and the approximation error. In particular, our findings highlight a trade-offtriangle in deep generative models, where it is unattainable to achieve low (underestimated) dimensionality, low approximation error, and low network complexity simultaneously, where network complexity is defined to be network width (number of neurons) in this paper. Specifically, when the input dimension is underestimated and the approximation error is low, we encounter one corner of this triangle: a significant increase in model complexity. This complexity, evident in the escalated number of neurons, grows Wang, Niu, Wang and Li Figure 1: Two cases demonstrating the idea of how sufficiently large neural networks can learn distributions of higher dimension than their input sampling distributions by filling out the space. Depicted here are 1-dimensional partial space-filling curves beginning to fill out a 2-dimensional unit square (left) and a 2-dimensional cylindrical surface (right). The blue points represent the training sample from the target distribution used to fit the neural network, while the orange points represent new points generated by the trained network. super-exponentially with approximation error, underscoring the complex trade-offs and interdependencies inherent in accurately gauging the intrinsic dimension of the data. This triangle metaphorically highlights the necessity of balancing these aspects when designing generative models, a central theme of our contributions. Our contribution also includes empirical studies to support these findings, which are overlooked in existing literature (Dahal et al., 2022). Notably, our proof techniques for the underestimated dimension case are novel, requiring in-depth exploration of mathematical literature in Monge-Amp ere equations from the 1990s (Caffarelli, 1990a,b, 1991, 1992a,b, 1996), beyond the scope of existing literature (Dahal et al., 2022; Villani et al., 2009). The structure of our paper is organized as follows: Section 2 lays the groundwork with preliminaries. Section 3 discusses our main theoretical contributions. In Section 4, we present simulations on toy cases for illustrative and visualization purposes. Section 5 offers a sketch of our proof, focusing primarily on the workflow, and is followed by a Discussion Section. Complete proofs and additional experimental details are provided in the Appendix. 2. Preliminaries This section covers the essential concepts underpinning our research, including generative models, optimal transport, Riemannian manifolds, and existing approximation theory for generative models on Riemannian manifolds. This groundwork is crucial for appreciating the theoretical advances we present in later sections and situates our research within the broader context of generative modeling. 2.1 Generative Models Generative models aim to learn the distribution of a dataset through sampling procedures. We consider our data following a distribution Q residing in an ambient space RD and an input distribution ρ over a simpler space S, such as the uniform distribution on the Deep Generative Models on Manifolds hypercube [0, 1]d. The goal is to construct a generator g : S RD that minimizes a discrepancy function between the generated (pushforward) distribution g (ρ) and the target distribution Q, expressed as min g G Discrepancy(g (ρ), Q). In this context, the function class G often comprises deep neural networks. A lower discrepancy signifies a closer match, meaning that the data generated from the simpler input space S closely resemble the true data represented by distribution Q. The essence of generative models is to produce outputs that are similar to real data, aligning the generated samples with the actual distribution. A larger class G usually yields a smaller discrepancy, and therefore, better performance. However, this enhancement comes at the cost of increased computational resources. As a result, understanding the balance between the size of G and the discrepancy is of the utmost importance. A practical challenge is that Q is often unknown, but instead, we only observe samples x1, , xn Q that are commonly believed to be independent and identically distributed (iid) following Q. As a result, we can replace Q by its empirical distribution Qn := 1 n Pn i=1 δxi where δx is the Dirac measure at x. The practical objective of generative modeling is to minimize the discrepancy between the learned distribution and this empirical distribution: min g G Discrepancy(g (ρ), Qn), where the minimizer is usually denoted by bgn. 2.2 Optimal Transport In the study of generative models, the choice of discrepancy function is of fundamental importance. The Wasserstein distance, a central concept in optimal transport theory (Villani et al., 2009), is commonly employed due to its effectiveness in measuring the cost of transforming one distribution into another. This aligns well with the objectives of generative modeling. Definition 1 The Wasserstein-p distance between two distributions Q and ν in domain M, is defined as Wp(Q, ν) = inf γ Γ(Q,ν) E(x,y) γ(c(x, y)p) 1 where Γ(Q, ν) is the set of all couplings of Q and ν, containing all joint distributions over M M with marginals Q and ν, and c : M M R 0 is called the cost function. We focus on the Wasserstein-1 distance with c(x, y) = x y , also known as the earth mover s distance. This specific case is especially relevant for our work in generative models, offering a robust framework for evaluating the similarity between generated and target distributions. Wasserstein-1 distance additionally admits a dual form that makes is particularly suitable for computation, which is a property that Wasserstein-p in general lacks. However, our theory on convergence is extendable to Wasserstein-p distance in general, which we investigate in Corollary 9. Wang, Niu, Wang and Li 2.3 Space-filling Curves In this paper, for a set S RD, we define a space-filling curve to be a continuous curve Mϵ RD such that every point in S is within ϵ of some point in Mϵ. Consequently, we can define a space-filling manifold to be a manifold in RD satisfying the same properties: Definition 2 A manifold Mϵ RD is said to be ϵ-space-filling manifold of S RD if d(S, Mϵ) = sup x S inf y Mϵ Mϵ is called a space-filling curve if it s a one-dimensional manifold. When S itself is a manifold, the existence of such ϵ-space-filling manifolds is shown in Lemma 10. We note that this definition of the space-filling curve/manifold differs from the standard definition in topology that demands the curve be surjective on the larger space S. These classical space-filling curves are typically taken as the limit of certain classes of curves: Mspace-filling = lim ϵi 0 Mϵi, where the sequence Mϵi is carefully chosen to preserve desired properties (such as non self-intersection) in the limit. However, for our purposes, we do not need the limit, only particular choices of Mϵ satisfying d(S, Mϵ) < ϵ, and define space-filling manifolds thusly. We use the term space-filling for intuition purposes, and do not demand a completely space-filling property from our manifolds. 2.4 Riemannian Manifold Although data often reside in high-dimensional space RD, there is substantial evidence suggesting that they lie on some low-dimensional manifolds (Bronstein et al., 2017). This concept underlies many manifold-based generative models, such as VAEs and GANs. In the same manner as most existing work in the literature, we assume that the data are distributed on a d-dimensional orientable compact Riemannian manifold M, isometrically embedded in the ambient space RD, with Riemannian metric g. A manifold is a locally Euclidean space, with each local neighbor, known as a local chart, diffeomorphic to Euclidean space Rd (Boothby, 1986). The Riemannian metric g defines a smoothly varying inner product (metric) in each tangent space Tx M. The geodesic distance d M(x, y) is defined as the length of the shortest path connecting x, y M. In addition, there exists a well-defined d-form, known as the Riemannian volume form d Vol M, which often serves as the analog of the Lebesgue measure on Riemannian manifold. This allows us to define density functions of probability distributions with respect to the volume measure. For more details, see Do Carmo and Flaherty Francis (1992). 2.5 Existing Work In previous research on generative models, there has been a focus on low-dimensional data structures, assuming that high-dimensional data are parametrized by low-dimensional latent parameters. This approach treats manifolds as globally homeomorphic to Euclidean space, Deep Generative Models on Manifolds implying a single-chart manifold model (Luise et al., 2020; Schreuder et al., 2021; Block et al., 2022; Chae et al., 2023). However, this assumption presents limitations in dealing with the complexity inherent in general manifolds with multiple charts. In contrast, Yang et al. (2022) and Huang et al. (2022) demonstrate that GANs can approximate any data distribution from a one-dimensional continuous distribution. This method does not assume a global chart; however, it heavily relies on GANs memorizing empirical data distributions, posing limitations in generating novel samples. To overcome the limitations of the above approaches, Dahal et al. (2022) represents a significant development in this context. By avoiding the single-chart assumption and the need for memorizing data, they construct an oracle transport map suitable for general manifolds with multiple charts. This advancement in approximating distributions on Riemannian manifolds with neural network pushforwards paves the way for more sophisticated and effective generative models. Their study relies on two mild assumptions: Assumption 3 M is a d-dimensional compact Riemannian manifold isometrically embedded in RD. As a consequence, there exists B > 0 such that x B, x M. Assumption 4 Q is supported on M and has a density function q with respect to the volume measure Vol M with a positive lower bound c: 0 < c q(x), x M. We consider the Re LU-type neural networks g(x) = WLσ(WL 1 σ(W1x + b1) + + b L 1) + b L, where σ is the Re LU activation function (Fukushima, 1969), W is the weight matrix, b is the bias vector. The function class, denoted by GNN(m, L, p, κ), contains neural networks with Re LU activation function, input dimension m, maximum depth L, maximum width p, and bounded weights and biases by κ: GNN(m, L, p, κ) := g = (g1, , g D) : Rm RD : gj is in form (2.5) with at most L layers and max width p, Wi κ, bi κ} . Under the above assumptions, the two main theorems in Dahal et al. (2022) are as follows: Firstly, Lemma 5 establishes that certain distributions on compact Riemannian manifolds can be approximated using a deep neural network and a simple input distribution. Crucially, the dimension of the input space is precisely one more than the manifold dimension. Furthermore, Lemma 5 provides detailed complexity bounds on the size of the network to achieve a specified level of accuracy measured by the Wasserstein loss. Lemma 5 (Theorem 1 of Dahal et al. 2022) Let ρ = Unif(0, 1)d+1, then there exists a constant 0 < α < 1 that is independent of D such that for any 0 < ϵ < 1, there exists a deep neural network g GNN(d + 1, L, p, κ) with L = O log 1 ϵ , p = O Dϵ d α , κ = B, that satisfies W1(g (ρ), Q) < ϵ. Next, Lemma 6 serves as the empirical counterpart of Lemma 5, addressing scenarios in which only finite samples following the distribution Q are observed. It demonstrates the Wang, Niu, Wang and Li theoretical possibility of identifying such generators using finite samples instead of direct access to the true distribution Q. In addition, this lemma links the network size to the sample size rather than the approximation error. Lemma 6 (Theorem 2 of Dahal et al. 2022) Under the same assumption as in Lemma 5, let x1, , xn iid Q be n iid samples from Q, then for any δ > 0, set ϵ = n 1 d+δ in Lemma 5 so that the network class GNN(d + 1, L, p, κ) has parameters L = O log n 1 d+δ , p = d α(d+δ) , κ = B. Then the empirical risk minimizer bgn has rate E [W1(bgn (ρ), Q)] Cn 1 d+δ , where C is a constant independent of n and D. These results shed light on how distributions on the manifold can be approximated by a deep neural network s pushforward of a low-dimensional easy-to-sample distribution. However, this approach has several limitations. First, the intrinsic dimension d of the manifold M is almost never known. Although there is an immense literature on estimating d (Levina and Bickel, 2004; Zheng et al., 2022; Horvat and Pfister, 2022; Brown et al., 2022), it has been proven to be an almost insurmountable problem due to its complexity (Fefferman et al., 2016). Second, even if d is known, the choice of input dimension d + 1 is unnatural, as it exceeds the true dimension. Technically, the extra dimension results from the need to connect local neighborhoods using a pasting algorithm. But this seems superfluous from an intrinsic perspective, considering partition of unity could potentially serve the same purpose. Third, in practice, the input dimension is often treated as a tuning parameter, or simply set based on historical experience or recommendations without rigorous optimization. It remains unclear how these decisions on input dimension affect approximation performance, especially when d is underestimated. To address these issues, our study examines the scenario where the input dimension is arbitrary, potentially as minimal as one, within the same framework. We will summarize our primary findings in the subsequent section. 3. Main Theory We begin with Dahal et al. (2022), which presents a method for approximating distributions on compact Riemannian manifolds using neural networks with an input dimension of d + 1. This raises a natural question: can the input dimension be reduced to d, the manifold s intrinsic dimension, or even lower, without compromising the approximation s effectiveness? In this work, we answer this question affirmatively, leveraging concepts of space-filling curves and insights from a series of work on Monge-Amp ere equations by Caffarelli (Caffarelli, 1990a,b, 1991, 1992a,b, 1996). Our theorem demonstrates the feasibility of using any input dimension m 1, which can be either smaller or larger than d, even down to 1. However, reducing the input dimension below the manifold s dimension introduces a super-exponential increase in complexity. Finally, we highlight that our approach adheres to the same foundational assumptions as Dahal et al. (2022). Our theorem, central to this discussion, asserts the approximation power of these neural networks under varying input dimensions. It establishes the conditions under which the networks can effectively approximate distributions on Riemannian manifolds, taking into account the dimensionality of the input and its impact on the network s complexity. Deep Generative Models on Manifolds Theorem 7 (Approximation Power of Deep Generative Models) Under Assumptions 3 and 4, and with ρ as the uniform measure on [0, 1]m, for any ϵ > 0, there exists a deep neu- ral network g GNN(m, L, p, κ) such that L = O log 1 m α(m,ϵ) m d κ = max{B, 1}, and W1(g (ρ), Q) < ϵ. Furthermore, when m d, limϵ 0 α(m, ϵ) = 0, leading to a super-exponential increase in the width p. The purpose of this theorem is two-fold. Firstly, it demonstrates that neural networks can approximate distributions on a manifold, with the flexibility to use an input distribution uniformly distributed over a hypercube [0, 1]m of any dimension. This means that the choice of m can be adapted as needed, whether it s smaller or larger than the manifold dimension d. We additionally note that the choice of our generator to be the uniform distribution on the unit hypercube is non-restrictive. These results can easily be extended to any input distribution with a density that is upper bounded away from infinity and lower bounded from zero on the unit cube, such as a truncated normal distribution or uniform distributions. Secondly, the theorem provides a quantitative link between the complexity of the network and the relationship between the input dimension m and the manifold dimension d. Notably, when m > d, the rates coincide with those found in Dahal et al. (2022). However, the situation becomes particularly intriguing when the input dimension m is smaller than the target dimension d. In this scenario, the width of the neural network no longer increases at a polynomial rate of d α, but at a super-exponential rate of m α(m,ϵ) where m α(m,ϵ) ϵ 0 . This phenomenon implies that, while it is theoretically feasible to approximate a broad class of distributions with low-dimensional inputs, choosing an appropriate input dimension is critical to avoid excessively complex networks; it is often more advantageous to overestimate d than to underestimate it. We now turn our attention to the empirical aspect of our theory, examining the statistical guarantees provided by our model in the presence of iid samples x1, , xn Q. Theorem 8 (Statistical Guarantees of Deep Generative Models) Given iid samples x1, , xn Q with empirical distribution Qn, let δ > 0, L = O log n 1 d+δ , m (m+δ)α(m,n,d,δ) m d d (d+δ)α m > d, κ = max{B, 1}, then the empirical risk minimizer bgn GNN(m, L, p, κ) satisfies E [W1(bgn (ρ), Q)] (1 + 2Cδ)n 1 d+δ , where Cδ is a constant, independent of n. Furthermore, when m d, limn α(m, n, d, δ) = 0, leading to a super-exponential increase in the width p. Similar to the population version, this theorem demonstrates that deep neural networks can approximate manifold distributions from n iid observations, with the approximation error diminishing at a rate of 1 d+δ. However, the complexity of the network, particularly Wang, Niu, Wang and Li the width, increases alongside n. Although a larger n yields a smaller approximation error, when m d, the width p grows super-exponentially with n. These two theorems collectively underscore a trade-offtriangle of model complexity, input dimension, and approximation accuracy. In the next section, we illustrate these concepts through simulations using m = 1, 2, 3 as case studies for proof of concept and visualization. Finally, we also show that Theorem 7 and Theorem 8 can be partially extended to Wasserstein-p distances in the following corollary. Corollary 9 Under the same setup as Theorem 7, for any ϵ > 0, there exists a deep neural network g GNN(m, L, p, κ) such that Wp(g (ρ), Q) < ϵ. Moreover, under the same setup as Theorem 8, the empirical risk minimizer bgn GNN(m, L, p, κ) satisfies E [Wp(bgn (ρ), Q)] (1 + 2Cδ)n 1 d+δ . Note that the bounds for L, p, κ are not provided in this case; a more detailed discussion is presented in the proof in Appendix B.6. 4. Simulation In this section, we provide empirical evidence on three toy examples for visualization purposes of these deep generative networks abilities to learn space-filling curve approximations. We demonstrate the ability of the standard Re LU network to map the uniform distribution on the unit hypercubes of various dimensions [0, 1]m to a variety of target distributions, where m is the dimension of the input distribution. Note that here we are using a standard feedforward neural network with a Wasserstein-1 loss function approximation provided in the Python Optimal Transport package (POT, Flamary et al., 2021). This loss function takes the role of the critic network in a typical GAN model, hence our fitted feedforward network under this Wasserstein-1 loss function is analogous to the generator in a Wasserstein GAN (Arjovsky et al., 2017). We also report an empirical fill distance which we measure by taking a set of samples x1, ..., xn from the true manifold, another set of samples y1, ..., y N from the generated manifold, and calculating 1 n Pn i=1 minyj {d(xi, yj)}. This approximates the mean distance over all points in the data manifold to the closest point on the approximating curve. Hence, this is a measurement of the distance between the manifold supports of distributions, while the Wasserstein loss (used to train the neural networks) offers a measurement of the distance between distributions. In Simulation 1, we show that the problem of finding a generator from the two dimensional square to the two dimensional square is computationally feasible (as in the required complexity is relatively small). Additionally, we show that the same problem with only a one dimensional input dimension is relatively harder, and also show the trajectory of training iterations to visualize how the neural network learns the distribution with an under-dimensioned input. In Simulation 2, we showcase the mapping of the same uniform distribution on [0, 1] to the uniform distribution on a 2-dimensional cylinder S2 R3 (m = 1, d = 2, D = 3). We also showcase that the same target distribution can be learned with much lower complexity when we match the input dimension to the manifold dimension, that is m = 2 = d. In Simulation 3, we take our target distribution to be the 3-dimensional Deep Generative Models on Manifolds uniform distribution on a unit cube [0, 1]3 (d = D = 3). As in Simulations 1 and 2, we train Re LU Networks using uniform distributions on the 1, 2, and 3 dimensional hypercubes (m = 1, 2, 3) and compare performance. Full implementation details and additional simulations including a study of network complexity and approximation error, the use of higher input dimension than target manifold dimension (m = d + 1) for each of the three simulation cases, and the case of non-uniform input or target distributions, along with a link to a Git Hub repository containing all code, can be found in Appendix C. Figure 2: Simulation 1: Training results (left) of a small 2 hidden layer, 10 node each, fully connected network mapping the 2-D uniform input to a 2-D uniform target data manifold. The orange surface is generated by the neural network and is filling in the data manifold (blue). Wasserstein loss (top right) and fill distance (bottom right) between generated and observed data per training iteration for a total of 10,000 iterations. 4.1 Simulation 1 In Simulation 1, we let the data distribution Q be a uniform distribution over the unit square [0, 1]2 and consider the cases where the input distribution ρ to the neural network is a uniform distribution over the unit square [0, 1]2 R2 and a uniform distribution over the unit interval [0, 1] R. In this case, the manifold dimension d = 2 is equal to the dimension of the ambient space D = 2. We first consider the case where the input and target dimensions are equal. Aligned with our theory, the required complexity in this case is expected to be significantly lower than when the target data dimension is underestimated. In Figure 2, we show that training a fully connected feed forward network with just 2 hidden layers with 10 nodes each for 10,000 iterations under Wasserstein loss, again from the Python Optimal Transport (OT) package, achieves low Wasserstein loss as well as low average empirical fill distance. Note that in this case, during training the input is taken to be a random sample on the unit square at each iteration, while the target data is a uniform grid on the unit square, so that the input sample is not identical to the target data at any iteration. Wang, Niu, Wang and Li Figure 3: Simulation 1: Training results of a 5 hidden layer, 200 node each, fully connected network for 10,000 iterations. Training trajectory (left), Wasserstein loss (top right), and fill distance (bottom right) of mapping a uniform distribution on the interval [0, 1] to a uniform distribution on the unit square [0, 1]2. It can be seen that as the loss decreases, the fitted curve fills more of the square as expected. Next we re-examine the target distribution from the case that the input dimension ρ is a uniform distribution over the unit interval [0, 1]. Figure 3 demonstrates how a standard feedforward, fully-connected neural network, using Re LU activation and trained with Wasserstein loss, effectively maps the unit interval into an increasingly complex, space-filling curve in a two-dimensional space. Note that now a much larger network size of 5 hidden layers with 200 nodes each is required to reach a similar level of approximation to the equal input and target dimensions case. For further simulations regarding the effect of different network sizes see Appendix C.1, and for the m = d + 1 case see Appendix C.2. 4.2 Simulation 2 Analogous to Simulation 1, in Simulation 2 we consider a situation where the data distribution Q is a uniform distribution over a cylinder embedded in R3 and again consider the cases where the input distribution ρ to the neural network is a uniform distribution over the unit square [0, 1]2 R2 and the unit interval [0, 1] R. We first consider when the input distribution (uniform) dimension matches the target data manifold dimension (also uniform), where m = d = 2. We show that, aligned with our theory, the required complexity in this case is observed to be much lower than when the target data dimension is underestimated. In Figure 4, we show the results of training a fully connected feed forward network with a total of 3 hidden layers with 25 nodes each for 8,000 iterations under Wasserstein loss and that the network is able to achieve low Wasserstein loss as well as low average fill distance. In contrast, we consider the same uniform distribution on a cylinder embedded in R3 but with an input distribution of only one dimension. Note again that in this example we Deep Generative Models on Manifolds Figure 4: Simulation 2: Training results of a small 3 hidden layer, 25 node each, fully connected network for 8,000 iterations mapping a 2-D uniform input to a 2-D uniform target data manifold on a cylinder embedded in R3. Here the multicolored surface is generated by the neural network (colored by z-axis height) and is filling in the data manifold (blue). have the manifold M with dimension d = 2, embedded in an ambient Euclidean space with dimension D = 3. We show again that a feedforward, fully-connected neural network under Wasserstein loss and Re LU nonlinearity is still able to learn space-filling curves properly on the manifold of lower dimension than the ambient space (albeit with a much larger network size of 7 hidden layers with 250 nodes each). Again, the neural network maps the input interval into an increasingly complex space-filling curve on the cylinder as the number of training iterations increases and the loss decreases respectively as seen in Figure 5. For full implementation details and code, as well as the m = d + 1 case, see Appendix C. 4.3 Simulation 3 In Simulation 3, we consider a uniform distribution on a unit cube [0, 1]3 in R3, while now taking as input one, two, and three-dimensional uniform distributions on [0, 1]m for m = 1, 2, 3 respectively. We show again that a feedforward, fully-connected neural network under Wasserstein loss and Re LU nonlinearity is still able to learn space-filling sheets on the manifold of higher dimension than the input. We again start with the case where the input dimension matches the target dimension (m = 3). In Figure 6, we show the results of training a neural network with 3 hidden layers of 128 neurons each. It can be seen that the network achieves both low Wasserstein loss as well as empirical fill distance in Figure 6. Note that in this case the input data is a randomly re-sampled each training iteration, while the target remains a fixed grid on the unit cube, such that the input and target data are not identical at any iteration. Now we reduce the input dimension by 1, mapping a uniform distribution of dimension m = 2 to a uniform distribution in a higher target dimension d = 3. We observe in Figure 7 Wang, Niu, Wang and Li Figure 5: Simulation 2: Training results of a 7 hidden layer, 250 node each, fully connected network over 5,000 iterations. Training trajectory (left), Wasserstein loss (top right), and fill distance (bottom right) of mapping a 1-D uniform distribution on the interval [0, 1] to a 2-D uniform distribution on a cylinder embedded in R3. As the loss decreases over the iterations, the fitted curve fills more of the surface. Figure 6: Simulation 3: Training results of a 3 hidden layer, 128 node each, fully connected network over 4,000 iterations mapping the 3-D uniform input to a 3-D uniform target data manifold on the unit cube. Here the multicolored curve is generated by the neural network (colored by z-axis height) and is filling in the data manifold (blue). that a larger neural network with 4 hidden layers of 256 nodes each achieves similar loss to the smaller model used when m = d = 3. Finally, we reduce the input dimension one more time for the case when the input dimension is m = 1 and the data manifold is uniform in d = 3 dimensions. We observe in Figure 8 Deep Generative Models on Manifolds Figure 7: Simulation 3: Training results of a 4 hidden layer, 256 node each, fully connected network over 2,000 iterations mapping a uniform distribution on the unit square [0, 1]2 to a uniform distribution on [0, 1]3, colored by height (z-axis). As the loss decreases over the iterations, the fitted surface fills more of the space in the cube. Figure 8: Simulation 3: Training results of a 5 hidden layer, 256 node each, fully connected network over 5,000 iterations mapping the 1-D uniform input on the interval [0,1] to a 3-D uniform target data manifold on the unit cube. Here the orange curve is generated by the neural network and is filling in the data manifold (blue). that again a larger network with 5 hidden layers of 256 neurons each is required to achieve similar loss to the m = 2 and m = 3 cases. Together, the three simulations underscore the networks ability to adaptively map lower-dimensional inputs onto higher-dimensional man- Wang, Niu, Wang and Li ifolds in a generalizable manner, effectively demonstrating our main theorems in practice. Again, see Appendix C for full implementation details and extension to m = d + 1. 5. Sketch of the Proof In this section, we provide a proof sketch of our main theorems, highlighting key steps while deferring technical details to the Appendix. We start with the proof of Theorem 7. 5.1 Sketch Proof of Theorem 7 The proof is divided into three cases, depending on the input dimension m: when m > d, when 2 m d, and when m = 1. We start with the simplest case: m > d. 5.1.1 Case 1: m > d In this case, since there are m d redundant input variables, we introduce a layer g1 : Rm Rd+1 that effectively ignores these redundant input variables: g1(x1, ..., xm) = (x1, ..., xd+1). g1 is a one-layer Re LU neural network with zero bias, and a weight matrix W = I(d+1) (d+1) 0(d+1) (m d 1) . This step simplifies the network s input to d + 1 dimensions, aligning with the settings of Lemma 5. We then define the pushforward measure of ρ via g1, defined on Rd+1, by g1 as ρ = g1 (ρ). By Lemma 5, for any ϵ > 0, there exists a neural network g GNN(d + 1, L , p , κ) such that W1(g (ρ ), Q) < ϵ, where L = O log 1 ϵ , p = O Dϵ d/α , κ = B. Now we let g := g g1, so we have W1(g (ρ), Q) = W1((g g1) (ρ), Q) = W1(g (g1 (ρ)), Q) = W1(g (ρ ), Q) < ϵ. Finally, observe that g GNN(m, L, p, κ), where L = L +1 = O log 1 ϵ , p = max{p , m} = O Dϵ d/α , κ = max{B, 1}, which finishes the proof of this case. 5.1.2 Case 2: 2 m d In this case, where the input dimension lies between 2 and d, our strategy comprises several steps. First, we find a m 1 dimensional compact Riemannian manifold M ϵ approximating M. Then, we project the original distribution Q onto M ϵ to obtain Q ϵ, an approximation of Q. Next, we smooth Q ϵ into Q ϵ to satisfy Assumption 4. Finally, we show that Q ϵ can be approximated by a deep generative model, as detailed step by step below. Firstly, the existence of such a m 1 dimensional manifold M ϵ is guaranteed by the following Lemma 10 leveraging the idea of a space-filling manifold, with proofs in Appendix B.1. Lemma 10 Let M be a d dimensional Riemannian manifold isometrically embedded in RD. Then, for any 1 q d and any ϵ > 0, there exists a q-dimensional manifold isometrically embedded in RD, denoted by M ϵ satisfying Deep Generative Models on Manifolds d(M, M ϵ) = sup x M inf y M ϵ Secondly, we define Q ϵ := πϵ (Qϵ), where πϵ : M M ϵ is the orthogonal projection onto M ϵ. The Wasserstein distance W1(Q, Q ) is then bounded by W1(Q, Q ϵ) d(M, M ϵ) < ϵ. Thirdly, since Q ϵ may not satisfy Assumption 4, we cannot directly apply Lemma 5. To address this, we find another distribution on M ϵ, denoted by Q ϵ that satisfies Assumption 4, with W1(Q ϵ, Q ϵ ) < ϵ, . The existence of such Q ϵ is supported by the following Lemma 11, with proofs given in Appendix B.2. Lemma 11 Let M be a m 1 dimensional Riemannian manifold isometrically embedded in RD, and let Q be a distribution on M . Then, for any ϵ > 0, there exists a distribution Q on M satisfying Assumption 4, with W1(Q , Q ) < ϵ. Finally, by Lemma 5, there exists a network g GNN(m, L, p, κ) such that W1(g (ρ), Q ϵ ) < ϵ, where L = O log 1 , p = O Dϵ m α(m,ϵ) , κ = B. By the triangular inequality, we have W1(g (ρ), Q) W1(g ρ, Q ϵ ) + W1(Q ϵ , Q ϵ) + W1(Q ϵ, Q) 3ϵ. The remaining step is to show that α(m, ϵ) 0 as ϵ 0 for any m, which leads to the super-exponential complexity. However, this proof is technical, so we summarize it in Lemma 12 and defer the proofs to Appendix B.3. Lemma 12 α(m, ϵ) 0 as ϵ 0 for any 2 m d. 5.1.3 Case 3: m = 1 For the case where m = 1, the approach for 2 m d is not directly applicable, as Lemma 5 requires the input dimension to be at least two. Thus, we adopt a different strategy to construct the approximation manifold. Notably, in this case, the input space is a unit interval [0, 1], with ρ = Unif(0, 1). First, we apply Lemma 10 again, with q = 1, to obtain an approximation manifold M ϵ, a one-dimensional space-filling curve. Subsequently, we define Q ϵ = πϵ (Q), where πϵ : M M ϵ is the orthogonal project. This yields Q ϵ as a distribution on M ϵ satisfying W1(Q, Q ϵ) ϵ. Lemma 13 guarantees the construction of a smooth parameterization for the curve M ϵ, given by ηϵ : [0, 1] M ϵ, with W1(ηϵ (ρ), Q ϵ) ϵ. The proof is in Appendix B.4. Lemma 13 Let M ϵ be a parametric curve of dimension 1, and let Q ϵ be a distribution on M ϵ, then there exists a reparameterization ηϵ : [0, 1] M ϵ such that W1(ηϵ (ρ), Q ϵ) ϵ. Wang, Niu, Wang and Li Next, Lemma 14 allows us to approximate η using a neural network g. Lemma 14 (Dahal et al. (2022) Lemma 3) If η(t) is α H older continuous and bounded, then for any ϵ > 0 there is a Re LU network g such that Z [0,1] g(t) η(t) dt ϵ, and g GNN(1, L, p, κ), where L = O log 1 ϵ , p = O ϵ 1 α , and κ = B. Then we obtain g with W1(ηϵ (ρ), g (ρ)) ϵ and W1(Q, g (ρ)) W1(Q, Q ϵ) + W1(Q ϵ, ηϵ (ρ)) + W1(ηϵ (ρ), g (ρ)) < 3ϵ. Finally, we node that α = α(ϵ) depends on ϵ through ηϵ, and we claim that α(ϵ) ϵ 0 0. This is summarized as Lemma 15 with proofs in Appendix B.5, which completes the proof of this m = 1 case. Lemma 15 α(ϵ) ϵ 0 0. 5.2 Sketch Proof of Theorem 8 Let x1, ..., xn be iid samples from Q, and let Qn denote the associated empirical distribution. By Theorem 7, for any ϵ > 0 ,there exists g GNN(m, L, p, κ) such that W1(Q, g (ρ)) < ϵ, where L = O log 1 ϵ , p = O Dϵ m α(m,ϵ) , and κ = B. For the empirical minimizer bgn, we can break down the Wasserstein distance W1(bgn (ρ), Q) as follows: W1(bgn (ρ), Q) W1(bgn (ρ), Qn) + W1(Qn, Q) W1(g (ρ), Qn) + W1(Qn, Q) W1(g (ρ), Q) + W1(Q, Qn) + W1(Qn, Q) ϵ + 2W1(Q, Qn). Given Assumptions 3 and 4, Lemma 5 of Dahal et al. (2022) implies that for any δ > 0, a constant Cδ exists such that E [W1(Q, Qn)] Cδn 1 d+δ , where Cδ is a constant independent of n. Setting ϵ = n 1 d+δ , we have E [W1(bgϵ (ρ), Q)] (1 + 2Cδ)n 1 d+δ . Thus, we have achieved the claimed expected Wasserstein distance with desired complexity of bgn: L = O log n 1 d+δ , O Dn m (m+δ)α(m,n,d,δ) , and κ = B. In particular, when m > d, α(m, n, d, δ) remains constant (α), while when m d, limn α(m, n, d, δ) = limϵ 0 α(m, ϵ) = 0 as per Theorem 7. Deep Generative Models on Manifolds 6. Discussion Using the theory of space-filling curves, in this work we establish key relationships between the input dimension, the true dimension of the data manifold, and the approximation error of fitting deep generative networks. Specifically, we have shown that deep generative networks can learn the manifold structure in a generalizable manner regardless of the input dimension, even when the input dimension is as low as one. Furthermore, we quantify the complexity trade-offamong network size, input dimension, and approximation error, showing that underestimating the input dimension below the manifold dimension introduces a super-exponential increase in the width of the network. Furthermore, these results did not require any additional assumptions compared to previous work (Dahal et al., 2022). Toy simulation studies on twoand three-dimensional manifolds provide empirical evidence of GANs learning such space-filling curves from various input dimensions, offering additional explanation to the success of GANs and similar models to learn distributions without explicit need for dimensionality estimation in practice. Based on these results, there are multiple potential directions for future work. First, although we have derived statistical guarantees for existence, there is no guarantee on being able to practically fit such models, especially when the input dimension is heavily underestimated. Second, while we have established upper bounds on the complexity necessary to achieve certain approximation errors, identifying lower bounds poses a formidable challenge and remains an open question. The absence of explicit lower bounds in existing literature highlights the complexity and novelty of this problem area. Establishing these bounds would not only deepen our understanding of the behavior of generative networks but also potentially guide the development of more efficient network architectures. Third, exploring the phase transition in model complexity versus approximation error as a potential method for estimating the intrinsic dimension of data manifolds could provide practical insights into dimensionality reduction and the effective training of generative models. Finally, although we have shown that neural networks are capable of learning distributions exhibiting low dimensional manifold structures, there remains situations in which the data do not lie on a manifold of a single intrinsic dimension throughout, as is the case in the union of manifolds hypothesis (Brown et al., 2022). Such extension would further elucidate the ability of these deep generative models to empirically approximate arbitrary datasets in practice. Acknowledgment DL acknowledges Wenjing Liao for a helpful discussion. KW was supported by National Institute of Health (NIH) grant R01 LM014407; HN was supported by NIH R01 CA253450; YW was supported in part by the Office of Naval Research under grant N00014-23-1-2590, the National Science Foundation under grant No. 2231174, No. 2310831, No. 2428059, No. 2435696, No. 2440954, and a Michigan Institute for Data Science Propelling Original Data Science (PODS) grant; DL was supported by NIH R01 LM014407, NIH P50 CA058223, NIH P42 ES031007. Wang, Niu, Wang and Li Appendix A. Code Availability All code for experiments and generating paper figures can be found in the Git Hub repository at https://github.com/hong-niu/dgm24. All experiments were run on a single machine with an Nvidia RTX 4080 GPU with 16 GB of memory and adapted from the Python Optimal Transport (POT) package Flamary et al. (2021) for computing the Wasserstein distance. Appendix B. Additional Proof Details In this section, we prove five lemmas used in the sketch of the proofs of Theorem 7 and Theorem 8, namely, Lemma 10, Lemma 11, Lemma 12, Lemma 13, and Lemma 15. Note that Lemma 14 is directly from Lemma 3 of Dahal et al. (2022), so we skip its proof. B.1 Proof of Lemma 10 We first prove Lemma 10, which shows that for a compact d dimensional manifold M, there exists a smaller dimensional manifold M ϵ that approximates the manifold M. Proof Since RD is a separable metric space, M RD is also separable. Let E M be a dense countable subset of M, enumerated E = {e1, e2, ....}. Because E is dense on M, we have lim n sup x M inf y {e1,...,en} x y 2 2 = 0. Suppose that we can find a sequence of manifolds Mn satisfying for all n N, En := {e1, ..., en} Mn. We have 0 sup x M inf y Mn x y 2 2 sup x M inf y {e1,...,en} Consequently: lim n sup x M inf y Mn x y 2 2 = 0. Therefore, for any ϵ > 0, there exists a large enough n such that sup x M inf y Mn x y 2 2 < ϵ. It suffices to show that for any finite set En, there exists a Riemannian Manifold Mn interpolating En, that is, En Mn. We know there exists a q dimensional subspace, represented by the associated projection matrix P, such that, for all i = j, Pei = Pej. We can choose a basis of this subspace to be the first q coordinates through a change of basis. For convenience, we thus claim without loss of generality that the projection onto the first q coordinates πq(x1, ..., xn) = (x1, ..., xq) is injective on En so that πq(ei) = πq(ej) when i = j. Then there exists infinitely many interpolants, f, satisfying yi = f(πq(ei)). Deep Generative Models on Manifolds Included in our options are smooth interpolants, such as polynomial interpolants, splines, variational minimizing surfaces, RBF interpolations, Backus-Gilbert (Wendland, 2004), etc. We choose the cubic polyharmonic/thin plate spline due to its simplicity (Duchon, 1977): f = min f W m 2 (πq(M)),yi=f(πq(ei)) Jd m(f), where Jd m(f) = P α1+...+αd=m m! α1!...αd! R ... R mf xα1 1 ... x αd d Qd j=1 dxj. This interpolant is a smooth function on a compact domain, with a bounded derivative and its image is a smooth manifold. Further, we consider the function: F : πq(M) Rn, (x1, ..., xq) 7 (x1, ..., xq, fq+1(x1, ..., xq), ..., fn(x1, ..., xq)). Note that the image F(πq(M)) is a Riemannian manifold that interpolates F(πq(ei)) = ei. Thus, E F(πq(ei)) and d(F(πq(ei)), M) d1(E, M) ϵ. Hence, for any d dimensional Riemannian manifold M and any ϵ > 0, there exists a q dimensional Riemannian manifold M , such that d(M, M ϵ) = sup x M inf y M ϵ B.2 Proof of Lemma 11 We now prove Lemma 11, which shows that for any distribution on a compact manifold, we may find a sufficiently close distribution that satisfies Assumption 4. Proof The objective of this lemma is to perturb Q slightly, creating a perturbed distribution Q that satisfies Assumption 4. The proof involves the following five steps: 1. Decomposition of the manifold: Partition the manifold M ϵ into a finite union of balls, with each as a coordinate chart derived from some Euclidean ball. 2. Local pull back distribution: Pull back the local distribution Q through the exponential map. 3. Smooth local distributions: Smooth each local pull-back distribution to ensure closeness to the original pull-back. 4. Pushforward and reweighting: Push these local distributions forward to M using their exponential maps. Paste them together to form a new distribution, and re-weight it to account for overlaps. Wang, Niu, Wang and Li 5. Density adjustment: Add a small uniform density to ensure a positive lower bound on density and show that this new distribution is sufficiently close to the original distribution and satisfies Assumptions 4. Since M is compact, the injectivity radius, denoted by ι, is strictly positive (Do Carmo and Flaherty Francis, 1992). Fix 0 < r < ι and denote Ux = expx(Bx(0, r)), where expx is the exponential map at x and Bx(0, r) is the Euclidean ball in the tangent space Tx M . Again, due to the compactness, there exists a finite covering {expxj(Bxj(0, r)) : j = 1, 2, ..., N}. Let Q xj be the restriction of Q onto Uxj, that is Q xj(S) = R S 1(x Uxj)d Q (x). Since r < ι, the exponential map expxj is invertible on Uxj, so the the pullback of Q xj through expxj( ), denoted by Pxj, a measure on Bxj(0, r), is well-defined. However, the pull-back measure Pxj may not be smooth enough. To address this, for δ > 0, we consider the function Hδ(x, y) = Cδ(x)1{ x y < δ}, where Cδ(x) is the normalizing constant 1 Cδ(x) = R Bxj (0,r) 1( x y < δ)dy. Then denote the locally smoothed measure by P δ xj, with density: pδ xj(y) = Z Bxj (0,r) Hδ(x, y)d Pxj. Since the transport cost between (Pxj, P δ xj) is bounded by sup Hδ(x,y)=1 x y δ, we have W1(Pxj, P δ xj) δ. Pushing P δ xj forward through the exponential map onto M yields a locally smoothed distribution on Uxj, expxj (P δ xj). We have: W1(expxj (Pxj), expxj (P δ xj)) = inf γ Γ(expxj (Pxj ),expxj (P δxj )) Uxj Uxj x y dγ = inf γ Γ(Pxj ,P δxj ) Bxj (0,r) Bxj (0,r) expxj(x) expxj(y) dγ Lxj inf γ Γ(Pxj ,P δxj )) Bxj (0,r) Bxj (0,r) x y dγ, where Γ is the coupling and Lxj is the Lipschitz constant of expxj on Bxj(0, r) (Do Carmo and Flaherty Francis, 1992). Note that as δ approaches 0, expxj(P δ xj) converges in distribution to Q xj. Now we recombine these locally smoothed measures back into a new measure: j=1 expxj (P δ xj). Deep Generative Models on Manifolds Then let N(x) to be the number of Uxj containing x, that is N(x) = PN j=1 1(x Uxj) so that: lim δ 0 Qδ,0 = lim δ 0 c=1 expxj (P δ xj) = Z M N(x)d Q (x). To compensate for overlap in the covering, we adjust Qδ,0 to define a new measure 1 N(x)d Qδ,0. It is clear that as δ 0, Qδ d Q . Due to the compactness of M , the cost function is bounded and thus lim δ 0 W1(Q , Qδ) 0. So far, however, the density of Qδ, denoted by qδ may not be positively lowered bounded, as required by Assumption 4. To fill in the final gap, we conduct the following surgery to Qδ. By the compactness of M , the uniform density, denoted by u(x) is well defined on M . Let θ > 0 be an arbitrary small number, we define q = (1 θ)qδ + θu. Let Q be the corresponding distribution, we observe that 1):Q satisfies Assumption 4 for any δ and θ; 2) Q converges in distribution to Q when δ, θ 0. That is, for any ϵ, there exists δ > 0 and θ > 0 such that W1(Q , Qδ) < ϵ 2 and W1(Q , Qδ) < ϵ W1(Q , Q ) W1(Q , Qδ) + W1(Qδ, Q ) < ϵ, which closes the proof. B.3 Proof of Lemma 12 We now prove Lemma 12, which shows that when the input dimension is smaller than the true dimension, the quantity α(m, ϵ) approaches 0 as ϵ approaches 0. Proof In existing literature, such as Dahal et al. (2022), α is often treated as a constant, since the manifold under consideration is fixed. However, in our framework, both the approximating space-filling manifold M ϵ and the approximating distribution Q ϵ change with ϵ. In this situation, it becomes imperative to keep track of the constant α while the manifold M ϵ and Q ϵ are evolving. This task is extremly challenging for several reasons. First, much of the existing literature, including Villani et al. (2009); Dahal et al. (2022) , treats α as a constant, without a in-depth discussion. Second, even when we scrutinize their proofs, they typically establish only the existence of α, without any explicit construction. Third, the intricacies of α are hidden within the study of Monge-Amp ere equations of the potentials of the density functions of Q ϵ, a distinct, although related, field of study. Our analysis begins with the recent contributions of Dahal et al. (2022), who claim that there exists a constant α so that the complexity p = O(Dϵ d/α). The foundation of their construction of optimal transport relies on Theorem 12.50 from Villani et al. (2009), also known as Caffarelli s regularity theory. Within this theorem, the constant α, denoted by β Wang, Niu, Wang and Li by Villani, is introduced ambiguously as for some β (0, 1) , without further elucidation provided. To address this, we examined a series of papers by Caffarelli (Caffarelli, 1990a,b, 1991, 1992a,b, 1996), where we discovered that the constant α depends on the infimum of the density function of Q ϵ, denoted by q ϵ. In essence, as ϵ 0, inf q ϵ 0, leading to the conclusion that α(m, ϵ) 0. We prove this claim step by step. 1. Prove Vol(M ϵ) ϵ 0 . First we prove that as ϵ 0, the volume of the space-filling manifold M ϵ goes to infinity. Note that this volume is the Riemannian volume of the m 1 dimensional manifold M ϵ, not the volume in the ambient space. By Lemma 10, the tube around M ϵ with radius 2ϵ, denoted by T2ϵ(M ϵ) := {x RD : d(x, M ϵ) < 2ϵ}, contains M. Then, by Weyl s Tube Theorem (Weyl, 1939; Gray, 2003), the volume of the tube is given by the following series expansion: Vol(T2ϵ(M ϵ)) = (π2ϵ2) D (m 1) 2(D (m 1))! [(m 1)/2] X k2l(M ϵ)(2ϵ)2l (D (m 1) + 2)(D (m 1) + 4) (D (m 1) + 2l), where k2l(M ϵ) is the integrated mean curvature of M ϵ and k0(M ϵ) = Vol(M ϵ) as a special case when l = 0. Focusing on the leading term, we have Vol(T2ϵ(M ϵ)) = (π2ϵ2) D (m 1) 2(D (m 1))! Vol(M ϵ) + O(ϵ2) . (1) Similarly, we estimate the volume of the tube around M, Tϵ(M) as: Vol(Tϵ(M)) = (πϵ2) D d Vol(M) + O(ϵ2) . (2) Now we show that T2ϵ(M ϵ) Tϵ(M). For any x Tϵ(M), there exists y M such that x y < ϵ. By the construction of M ϵ, there exists z M ϵ such that y z < ϵ, as a result, d(x, M ϵ) x z x y + y z 2ϵ, which implies x T2ϵ(M ϵ) and hence T2ϵ(M ϵ) Tϵ(M). Linking the two volumes in Equation (1) and Equation (2), we conclude that (π2ϵ2) D (m 1) 2(D (m 1))! Vol(M ϵ) + O(ϵ2) (πϵ2) D d Vol(M) + O(ϵ2) = Vol(M ϵ) Cϵ m 1 d where the last limit is due to m 1 d < 0. Deep Generative Models on Manifolds 2. Prove inf q ϵ ϵ 0 0. Since M ϵ is compact and q ϵ integrates to one, we have M ϵ q ϵd Vol M ϵ Vol(M ϵ) inf q ϵ. As a result, we conclude that inf q ϵ ϵ 0 0 since Vol(M ϵ) ϵ 0 . 3. Prove α(m, ϵ) ϵ 0 0. As stated in Theorem 12.50 of Villani et al. (2009), the optimal transport mapping, denoted by Fϵ, is represented by its potential Fϵ C1,α(m,ϵ) such that Fϵ = ψϵ Cα(m,ϵ). Moreover, we know that q ϵ( ψϵ) det Dijψϵ = q from Equation (1) of Caffarelli (1992b), where Dij represents the Hessian matrix. The existence of such α(m, ϵ) was first given by Caffarelli (1991), which replies on the key assumption that there exists constants λ1, λ2 such that 0 < λ1,ϵ det Dijψϵ λ2,ϵ < . (3) Then, the key quantity we are interested, is defined as α(m, ϵ) = log2 δ(ϵ), where δ(ϵ) < 1 is the constant such that hϵ,1/2(x x0) < δ(ϵ) < hϵ,1(x x0), where hϵ,β is the cone generated by x0 and the level surface ψϵ = β (see Lemma 2 of Caffarelli (1991) for more details). The existence of such δ(ϵ) is given by Lemma 2 of Caffarelli (1991). Since inf q ϵ ϵ 0 0, we know that λ2,ϵ = sup det Dijψϵ , which violates the key assumption in Caffarelli (1991), leading to the fact that ψϵ has infinite derivative and hence infinite local changes and inseparable level sets, so that δ(ϵ) ϵ 0 1. As a result α(m, ϵ) = log2 δ(ϵ) ϵ 0 0, as desired. B.4 Proof of Lemma 13 We now prove Lemma 13, which states that an approximating curve for a manifold may be reparameterized by a differentiable function such that the pushforwards of the uniform distribution is close to the true distribution. Proof We first introduce the arc-length parameterization as ξ(t) : [0, Lϵ] M ϵ, which makes ξ invertible with a constant speed ξ (t) 1. For simplicity, we assume the total length Lϵ = 1 in this proof. Let Pξ be the pull back of Q ϵ onto [0, 1] through ξ, that is, ξ (Pξ) = Q ϵ. Then for δ > 0, define Hδ(x, y) = Cδ(x)1{ x y < δ}, where Cδ(x) is the normalizing constant: 1 Cδ(x) = Z [0,1] 1{ x y < δ}dy. Wang, Niu, Wang and Li Then we define the following density function [0,1] Hδ(x, y)d Pξ Z [0,1] sup x [0,1] Cδ(x) d Pξ = sup x [0,1] Cδ(x) = 1 Denote the corresponding distribution as Fδ. Let X Pξ and define Y | X = x to be uniform on the interval [x δ, x + δ] [0, 1], so that (X, Y ) is a coupling in Γ(Pξ, Fδ). The cost of such a transport is bounded by E(X,Y ) X Y = EX [EY [ X Y | X]] EX [EY [δ | X]] = δ. By the definition of the Wasserstein distance, W1(Fδ, Pξ) < δ. Subsequently, we can bound W1(ξ (Fδ), Q ϵ) = inf γ Γ(ξ (Fδ),Q ϵ) M ϵ M ϵ x y dγ = inf γ Γ(ξ (Fδ),ξ (Pξ)) M ϵ M ϵ x y dγ = inf γ Γ(Fδ,Pξ) [0,1] [0,1] ξ(x) ξ(y) dγ inf γ Γ(Fδ,Pξ) [0,1] [0,1] x y dγ = W1(Fδ, Pξ) δ. If the density fδ(x) is lower bounded away from 0, then the Cumulative Distribution Function (CDF) of fδ, denoted by Cδ(x), is invertible. Then we use the inverse transform to approximate Q ϵ: W1((ξ C 1 δ ) (ρ), Q ϵ) δ. Thus, ηϵ(t) := ξ(F 1 δ (t)) is our desired parametrization. Here, we highlight its dependence on ϵ through the domain M ϵ, which is crucial for Lemma 15. If fδ is not lower bounded away from 0, we can adopt the same technique in step 5 in the proof of Lemma 11, where we perturb it by a small uniform distribution with density u: f δ(x) = (1 θ)fδ(x) + u, where θ > 0. Denoting the CDF of f δ as C δ, we define ηϵ(t) := ξ(C δ 1(t)). As a result, there exists δ, θ > 0 such that W1(η (ρ), Q ϵ) = W1((ξ C 1 δ ) (ρ), Q ϵ) < ϵ. B.5 Proof of Lemma 15 We now explain the proof of Lemma 15, which shows that limϵ 0 α(m, ϵ) = 0 when m = 1. Proof The proof follows the same structure as the proof of Lemma 12, with exactly the same 3 steps. In fact, the arguments presented in the proof of Lemma 12 are applicable for any dimension m. Deep Generative Models on Manifolds B.6 Proof of Corollary 9 In Dahal et al. (2022), the authors first prove the existence of an oracle map g that perfectly transports the input distribution to the target distribution, that is Q = g (ρ). They then approximate this function g via a neural network f such that f g L1(ρ) ϵ, with explicit complexity bound on f. Then, in Lemma 2 of their paper, they show that one can further bound the Wasserstein-1 distance by W1(f (ρ), g (ρ)) C f g L1(ρ) Cϵ. To generalize the setting to Wasserstein-p distances, we need two components. First we extend Lemma 2 of Dahal et al. (2022) to bound Wasserstein-p distances by f g Lp(ρ). Then we find a suitable neural network f to control f g Lp(ρ). We start with the Wasserstein-p distance: Wp(f (ρ), g (ρ)) = inf π Γ(f (ρ),g (ρ)) E(x,y) π( x y p 2)1/p. Because the Wasserstein-p distance is the infimum over all couplings, we may find an upper bound by using any choice of coupling. The most obvious choice of coupling is (f(z), g(z)) (f (ρ), g (ρ)) where z ρ, which yields the following upper bound: Wp(f (ρ), g (ρ))p = inf π Γ(f (ρ),g (ρ)) E(x,y) π x y p 2 Ez ρ f(z) g(z) p 2 Z f(x) g(x) p 2 dρ = f g p Lp(ρ). According to Dahal et al. (2022), the oracle g may be constructed so as to be α H older continuous, and thus continuous. The Universal Approximation Theorem of neural networks Hornik (1989) states that if g is continuous, we may construct a neural network f such that f g ϵ. Note that f g Lp(ρ) f g , so we have Wp(f (ρ), g (ρ)) f g Lp(ρ) f g ϵ, since ρ is a probability measure. For the empirical result, we follow the same line of argument as in the Wasserstein-1 case. Define Qn to be the empirical distribution and bgn to be the empirical risk minimizer. Because Q is bounded, we know it has a finite p-th moment, so we have Wp(bgn (ρ), Q) Wp(bgn (ρ), Qn) + Wp(Qn, Q) Wp(g (ρ), Qn) + Wp(Qn, Q) Wp(g (ρ), Q) + Wp(Q, Qn) + Wp(Qn, Q) ϵ + 2Wp(Q, Qn). From Weed and Bach (2019), we know that if the Minkowski dimension of Q is γ, then for all δ > 0, there exists Cδ > 0 such that E(Wp(Q, Qn)) Cδn 1 γ+δ . By assumption, Q is supported on a compact d dimensional Riemannian manifold and consequently its Minkowski dimension is d. Thus, by choosing ϵ = n 1 d+δ as we did before, we obtain E [Wp(bgn (ρ), Q)] (1 + 2Cδ)n 1 d+δ . Wang, Niu, Wang and Li While we have proved the existence of an approximating deep neural network for the Wasserstein-p distance, providing specific bounds on the network s complexity analogous to those established for the Wasserstein-1 distance in Theorem 7 and Theorem 8 remains unresolved. In the case of W1, the complexity bounds are intricately tied to the construction detailed in Lemma 2 of Dahal et al. (2022). Extending these bounds to Wp distances would require developing a similar foundational lemma tailored to Wp settings, a task that involves substantial theoretical innovation and is not immediately straightforward. Another potential approach involves leveraging results from Ohn and Kim (2019), which offer complexity bounds in the Lp norm; however, this approach introduces a new challenge as the bound on network weights, κ, becomes dependent on ϵ. Removing the dependency on ϵ in the bound for network weights is not straightforward, requiring further investigation. Whether we develop constructions similar to Lemma 2 of Dahal et al. (2022) for Wp distances, or apply theories like those in Ohn and Kim (2019), the challenges are nontrivial. We consider this an important direction for future research. Appendix C. Additional Experiments Below we provide additional simulations elucidating the behavior of the neural networks within the same toy scenarios based on the simulations of the main paper to further support our theoretical findings. In Section C.1, we present an additional experiment regarding neural network complexity and approximation error in the context of Simulation 1. In Section C.2, we demonstrate consistent findings with the known theory extending each of the 3 simulations of the main paper to the input m = d + 1 case as proven previously in (Dahal et al., 2022). In each case we find that a smaller or equivalently sized neural network can learn comparatively close approximations to the m = d cases presented in the main paper. Finally, in Section C.3, we show that using a normally distributed input, rather than uniform, leads again to equally strong approximation performance against a normally distributed target output. Again, complete implementation notebooks generating each of the paper figures can be found at: https://github.com/hong-niu/dgm24. C.1 Extensions to Simulation 1 (Unit Square), m = 1, d = 2 In this section, we provide an additional study of complexity size and approximation accuracy when varying the size of the network. We repeat the same setup as Simulation 1 of the main paper as a representative example, where the input dimension m = 1 and the target data dimension d = 2, with the following changes. We present the results of three different sized networks a) 2 hidden layers, 10 nodes each, b) 3 hidden layers, 100 nodes each, and c) 5 hidden layers, 200 nodes each, and also significantly increase the training epochs to a maximum of 20,000 iterations for each experiment. In each case, the input distribution is uniform in 1-dimension, and the data distribution is uniform in 2-dimensions. We observe that the approximation improves with increasing network complexity, and that aligned with our theory, a model as small as network a) with two hidden layers is sufficient for fitting the data manifold when the input dimension is equal to the data, but such network is insufficient for approximating the manifold to the same degree when the input dimension is underestimated. Deep Generative Models on Manifolds a) 2 layers, 10 nodes each b) 3 layers, 100 nodes each c) 5 layers, 200 nodes each Figure 9: Training results of 3 different network sizes on mapping the 1-dimensional uniform input to a higher 2-dimensional uniform target data manifold. Here the orange curve is generated by the neural network and is filling in the data manifold (blue). Networks a) and b) were trained to 20,000 iterations, while network c) only required 10,000 iterations to converge. We observe that as the network complexity increases, the model is able to better fit the target distribution with both lower Wasserstein loss and empirical fill distance against the true data. Note that the smallest network a) is sufficient when the input and target data dimension are equal, as was in Section 4.1, but is unable to achieve the same degree of approximation after 20,000 training iterations now that the data dimension has been underestimated, aligning with expectations from our presented theory. Again consistent with our theory, we find that a much larger network c) with 5 hidden layers of 200 nodes each is required to achieve a closer Wasserstein loss and fill distance to the results of Section 4.1 now that the data dimension has been strictly underestimated, with a final Wasserstein loss of 0.0016 and fill distance of 0.0213. Furthermore, network c) improves on the medium sized network in network b) which achieves a final loss of 0.0028 and fill distance of 0.0302. We also note that the largest network c) also consistently achieves lower losses throughout training than the middle sized network b), despite being trained for 10,000 fewer iterations. Wang, Niu, Wang and Li a) 2 layers, 10 nodes each b) 3 layers, 100 nodes each c) 5 layers, 200 nodes each Figure 10: [Top Row] Wasserstein loss per training iteration for each of the three networks presented in Section C.1. [Bottom Row] Fill distance between generated curve and data per iteration for each of the three networks presented in Section C.1. Deep Generative Models on Manifolds C.2 Extensions to the Case for m = d + 1 Next we show that in each of the cases of Simulation 1, 2 and 3 from the main paper, when the input dimension m is 1 greater than the dimension of the target manifold d, a neural network can be trained to approximate the manifold as well as the m = d case as is consistent with previous theory in (Dahal et al., 2022). We note that we were able to find such a neural network using fewer or equivalently many neurons than that of the m = d case in each of the three cases. In Figure 11, we show that a smaller neural network than the 2-D to 2-D analogue of Simulation 1 is able to learn a mapping between a 3-D uniform input distribution to a smaller 2-D target distribution. We note that in this case, the number of training samples used to train the network (900) is identical to the 2-D to 2-D case in Simulation 1 of the main paper (900), however the number of visualization points for the 3-D uniform input has been increased to maintain the density of the input grid as the 2-D case. Figure 11: Training results of a 1 hidden layer, 10 node, fully connected network for 10,000 iterations. Training trajectory (left), Wasserstein loss (top right), and fill distance (bottom right) of mapping a uniform distribution on the cube [0, 1]3 to a uniform distribution on the unit square [0, 1]2. It can be seen that as the loss decreases, the fitted curve fills more of the square as expected, and as well as the 2-D to 2-D analogue in Simulation 1. Next, we repeat for m = d + 1 input dimension analogues of Simulation 2 in Figure 12, and for Simulation 3 in Figure 13. Again, the number of training samples used in both of these simulations is equal to the respective m = d cases of the main paper, while the neural networks used in the m = d + 1 cases actually use fewer total neurons to learn similarly accurate approximations. Wang, Niu, Wang and Li Figure 12: Training results of a 3 hidden layer, 20 node each, fully connected network for 10,000 iterations. Training trajectory (left), Wasserstein loss (top right), and fill distance (bottom right) of mapping a uniform distribution on the unit cube [0, 1]3 to a 2-D uniform distribution on a cylinder. It can be seen that as the loss decreases, the fitted curve fills more of the cylindrical surface as expected, and as well as the 2-D to 2-D analogue in Simulation 2. Figure 13: Training results of a 2 hidden layer, 128 node each, fully connected network for 10,000 iterations. Training trajectory (left), Wasserstein loss (top right), and fill distance (bottom right) of mapping a uniform distribution on [0, 1]4 to a uniform distribution on the unit cube [0, 1]3. It can be seen that as the loss decreases, the fitted curve fills more of the cube as expected, and as well as the 3-D to 3-D analogue in Simulation 3. Deep Generative Models on Manifolds C.3 Extensions to Non-Uniform Distributions Here, we briefly consider the case when the distributions are no longer uniform. Here we present two simulations where the target manifold is a 2-D normally distributed sample, and show that keeping the neural network architecture identical and only varying using either a 1-D uniform or 1-D normal distribution as the input results in equally accurate approximations of the target manifold in terms of Wasserstein distance and fill distance as is consistent with our theoretical assumptions. Figure 14: Training results of a 5 hidden layer, 200 node each, fully connected network for 10,000 iterations. Training trajectory (left), Wasserstein loss (top right), and fill distance (bottom right) of mapping a uniform distribution on the interval [0, 1] to a 2-D normally distributed target sample. Figure 15: Training results of a 5 hidden layer, 200 node each, fully connected network for 10,000 iterations. Training trajectory (left), Wasserstein loss (top right), and fill distance (bottom right) of mapping a 1-D normally distributed input to a 2-D normally distributed target sample. Wang, Niu, Wang and Li Luca Ambrogioni, Umut G u cl u, Ya gmur G u cl ut urk, Max Hinne, Marcel A van Gerven, and Eric Maris. Wasserstein variational inference. Advances in Neural Information Processing Systems, 31, 2018. Martin Arjovsky and L eon Bottou. Towards principled methods for training generative adversarial networks. ar Xiv preprint ar Xiv:1701.04862, 2017. Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214 223. PMLR, 2017. Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. CVAE-GAN: finegrained image generation through asymmetric training. In Proceedings of the IEEE international conference on computer vision, pages 2745 2754, 2017. Adam Block, Zeyu Jia, Yury Polyanskiy, and Alexander Rakhlin. Intrinsic dimension estimation using Wasserstein distance. The Journal of Machine Learning Research, 23(1): 14124 14160, 2022. William M Boothby. An introduction to differentiable manifolds and Riemannian geometry. Academic press, 1986. Michael M Bronstein, Joan Bruna, Yann Le Cun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18 42, 2017. Bradley CA Brown, Anthony L Caterini, Brendan Leigh Ross, Jesse C Cresswell, and Gabriel Loaiza-Ganem. The union of manifolds hypothesis. In Neur IPS 2022 Workshop on Symmetry and Geometry in Neural Representations, 2022. Luis A Caffarelli. Interior W 2,p estimates for solutions of the Monge-Amp ere equation. Annals of Mathematics, pages 135 150, 1990a. Luis A Caffarelli. A localization property of viscosity solutions to the Monge-Amp ere equation and their strict convexity. Annals of mathematics, 131(1):129 134, 1990b. Luis A Caffarelli. Some regularity properties of solutions of Monge Amp ere equation. Communications on pure and applied mathematics, 44(8-9):965 969, 1991. Luis A Caffarelli. Boundary regularity of maps with convex potentials. Communications on pure and applied mathematics, 45(9):1141 1151, 1992a. Luis A Caffarelli. The regularity of mappings with a convex potential. Journal of the American Mathematical Society, 5(1):99 104, 1992b. Luis A Caffarelli. Boundary regularity of maps with convex potentials ii. Annals of mathematics, 144(3):453 496, 1996. Deep Generative Models on Manifolds Minwoo Chae, Dongha Kim, Yongdai Kim, and Lizhen Lin. A likelihood approach to nonparametric estimation of a singular distribution using deep generative models. Journal of Machine Learning Research, 24(77):1 42, 2023. Biraj Dahal, Alexander Havrilla, Minshuo Chen, Tuo Zhao, and Wenjing Liao. On deep generative models for approximation and estimation of distributions on manifolds. Advances in Neural Information Processing Systems, 35:10615 10628, 2022. Manfredo Perdigao Do Carmo and J Flaherty Francis. Riemannian geometry, volume 6. Springer, 1992. Jean Duchon. Splines minimizing rotation-invariant semi-norms in Sobolev spaces. In Constructive Theory of Functions of Several Variables: Proceedings of a Conference Held at Oberwolfach April 25 May 1, 1976, pages 85 100. Springer, 1977. Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on learning theory, pages 907 940. PMLR, 2016. Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983 1049, 2016. R emi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z Alaya, Aur elie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, et al. Pot: Python optimal transport. The Journal of Machine Learning Research, 22(1): 3571 3578, 2021. Kunihiko Fukushima. Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics, 5(4):322 333, 1969. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014. Alfred Gray. Tubes, volume 221. Springer Science & Business Media, 2003. Changhee Han, Hideaki Hayashi, Leonardo Rundo, Ryosuke Araki, Wataru Shimoda, Shinichi Muramatsu, Yujiro Furukawa, Giancarlo Mauri, and Hideki Nakayama. GANbased synthetic brain MR image generation. In 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), pages 734 738. IEEE, 2018. Kurt Hornik. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359 366, 1989. Christian Horvat and Jean-Pascal Pfister. Intrinsic dimensionality estimation using normalizing flows. Advances in Neural Information Processing Systems, 35:12225 12236, 2022. Jian Huang, Yuling Jiao, Zhen Li, Shiao Liu, Yang Wang, and Yunfei Yang. An error analysis of generative adversarial networks for learning distributions. The Journal of Machine Learning Research, 23(1):5047 5089, 2022. Wang, Niu, Wang and Li Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401 4410, 2019. Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. Advances in neural information processing systems, 29, 2016. Elizaveta Levina and Peter Bickel. Maximum likelihood estimation of intrinsic dimension. Advances in neural information processing systems, 17, 2004. Giulia Luise, Massimiliano Pontil, and Carlo Ciliberto. Generalization properties of optimal transport GANs with latent distribution learning. ar Xiv preprint ar Xiv:2007.14641, 2020. Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ar Xiv preprint ar Xiv:1411.1784, 2014. Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier GANs. In International conference on machine learning, pages 2642 2651. PMLR, 2017. Ilsang Ohn and Yongdai Kim. Smooth function approximation by deep neural networks with general activation functions. Entropy, 21(7):627, 2019. Kripasindhu Sarkar, Lingjie Liu, Vladislav Golyanik, and Christian Theobalt. Human GAN: A generative model of human images. In 2021 International Conference on 3D Vision (3DV), pages 258 267. IEEE, 2021. Nicolas Schreuder, Victor-Emmanuel Brunel, and Arnak Dalalyan. Statistical guarantees for generative models without domination. In Algorithmic Learning Theory, pages 1051 1071. PMLR, 2021. Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. Advances in neural information processing systems, 29, 2016. Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. ar Xiv preprint ar Xiv:1711.01558, 2017. Ngoc-Trung Tran, Viet-Hung Tran, Ngoc-Bao Nguyen, Trung-Kien Nguyen, and Ngai-Man Cheung. On data augmentation for GAN training. IEEE Transactions on Image Processing, 30:1882 1897, 2021. Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. Advances in neural information processing systems, 33:19667 19679, 2020. C edric Villani et al. Optimal transport: old and new, volume 338. Springer, 2009. Deep Generative Models on Manifolds Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional GANs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798 8807, 2018. Jonathan Weed and Francis Bach. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. 2019. Holger Wendland. Scattered data approximation: Conditionally positive definite functions. 2004. Hermann Weyl. On the volume of tubes. American Journal of Mathematics, 61(2):461 472, 1939. Xuan Xia, Xizhou Pan, Nan Li, Xing He, Lin Ma, Xiaoguang Zhang, and Ning Ding. GAN-based anomaly detection: A review. Neurocomputing, 493:497 535, 2022. Yunfei Yang, Zhen Li, and Yang Wang. On the capacity of deep generative networks for approximating distributions. Neural networks, 145:144 154, 2022. Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In International conference on machine learning, pages 7354 7363. PMLR, 2019. Yijia Zheng, Tong He, Yixuan Qiu, and David Wipf. Learning manifold dimensions with conditional variational autoencoders. 2022.