# deep_generative_models_complexity_dimensionality_and_approximation__1d293a27.pdf

Journal of Machine Learning Research 26 (2025) 1-37 Submitted 8/24; Revised 2/25; Published 6/25

Deep Generative Models: Complexity, Dimensionality, and Approximation

Kevin Wang kcwang@unc.edu Department of Biostatistics University of North Carolina at Chapel Hill

Hongqian Niu hong.niu@unc.edu Department of Biostatistics University of North Carolina at Chapel Hill

Yixin Wang yixinw@umich.edu Department of Statistics University of Michigan

Didong Li didongli@unc.edu Department of Biostatistics University of North Carolina at Chapel Hill

Editor: Arindam Banerjee

Generative networks have shown remarkable success in learning complex data distributions, particularly in generating high-dimensional data from lower-dimensional inputs. While this capability is well-documented empirically, its theoretical underpinning remains unclear. One common theoretical explanation appeals to the widely accepted manifold hypothesis, which suggests that many real-world datasets, such as images and signals, often possess intrinsic low-dimensional geometric structures. Under this manifold hypothesis, it is widely believed that to approximate a distribution on a d-dimensional Riemannian manifold, the latent dimension needs to be at least d or d + 1. In this work, we show that this requirement on the latent dimension is not necessary by demonstrating that generative networks can approximate distributions on d-dimensional Riemannian manifolds from inputs of any arbitrary dimension, even lower than d, taking inspiration from the concept of space-ﬁlling curves. This approach, in turn, leads to a super-exponential complexity bound of the deep neural networks through expanded neurons. Our ﬁndings thus challenge the conventional belief on the relationship between input dimensionality and the ability of generative networks to model data distributions. This novel insight not only corroborates the practical eﬀectiveness of generative networks in handling complex data structures, but also underscores a critical trade-oﬀbetween approximation error, dimensionality, and model complexity.

Keywords: Approximation theory, generative adversarial networks, manifold hypothesis, space-ﬁlling curve.

. To whom correspondence should be addressed

c 2025 Kevin Wang, Hongqian Niu, Yixin Wang and Didong Li.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/24-1335.html.

Wang, Niu, Wang and Li

1. Introduction

Generative models, such as generative adversarial networks (GANs, Goodfellow et al., 2014) and variational auto-encoders (VAEs, Kingma and Welling, 2013), have become a central topic in machine learning. These models are versatile, addressing various problems ranging from image generation (Bao et al., 2017; Han et al., 2018), style transfer (Karras et al., 2019), anomaly detection (Xia et al., 2022), to data augmentation (Tran et al., 2021), achieving increasingly accurate results across disciplines. Despite these advances, GANs, in particular, face challenges such as mode collapse, vanishing gradients, and training instability, particularly when distributions are not continuous or have disjoint supports (Arjovsky and Bottou, 2017). While traditional GANs and VAEs primarily relied on the Kullback Leibler divergence (KL) to measure distances between input and output distributions, more recent approaches have greatly improved training performance. The shift from KL to Wasserstein distance (Villani et al., 2009) is a notable example, as exempliﬁed in the Wasserstein GAN (WGAN, Arjovsky et al., 2017), and further extended to Wasserstein VAE (WVAE, Ambrogioni et al., 2018) and Wasserstein Auto Encoders (Tolstikhin et al., 2017). Beyond this, there have been substantial enhancements in model architecture, loss functions, and regularization techniques. Advances such as conditional GANs (c GAN, Mirza and Osindero, 2014) for targeted image generation, auxiliary classiﬁer GANs (AC-GANs, Odena et al., 2017), and self-attention GANs (SA-GANs, Zhang et al., 2019) have enriched the versatility and eﬀectiveness of GANs. Similarly, VAEs have seen improvements with techniques such as hierarchical latent variables (Vahdat and Kautz, 2020; Sønderby et al., 2016) and incorporation of normalizing ﬂows (Kingma et al., 2016), reﬁning their ability to model complex distributions. These developments collectively contribute to the generation of highquality, high-dimensional data from simpler, lower-dimensional distributions, with notable impact in ﬁelds such as photorealistic image generation (Wang et al., 2018; Sarkar et al., 2021). The theoretical advancements in deep generative models, paralleling their practical applications, have been substantial. In particular, the depth and width of network architectures in these models have been theoretically shown to signiﬁcantly impact their approximation capabilities, as detailed in the works focused on neural network expressiveness (Eldan and Shamir, 2016). Furthermore, the challenges of mode collapse in GANs and the propensity of VAEs to produce overly smooth outputs have catalyzed research into improving their architectural designs and training methodologies (Goodfellow et al., 2014; Kingma and Welling, 2013). Collectively, these theoretical advancements have not only deepened our understanding of the mechanisms behind deep generative models but also guided the development of more reﬁned and capable generative architectures. However, the theoretical investigation of the manifold hypothesis in deep generative models, particularly its implications in approximation theory, remains less developed. The manifold hypothesis, widely accepted as a crucial reason behind the success of these models, posits that real-world high-dimensional data often reside on lower-dimensional manifolds (Bronstein et al., 2017). This concept is critical because it suggests a fundamental reason why models such as GANs and VAEs, which operate in low-dimensional spaces, are able to capture complex data distributions eﬀectively. Although empirical evidence sup-

Deep Generative Models on Manifolds

ports this hypothesis, a deeper theoretical understanding of how these models approximate and learn distributions on lower-dimensional manifolds within high-dimensional spaces is crucial. Further theoretical work in this area is essential, not only to validate the manifold hypothesis but also to enhance the design and eﬃciency of generative models in handling complex, high-dimensional data. Building on the previous discussion of the manifold hypothesis, Dahal et al. (2022) oﬀers a noteworthy theoretical contribution in the ﬁeld of deep generative models. Their work eﬀectively addresses fundamental questions in approximation theory, demonstrating that distributions on a d-dimensional Riemannian manifold can be approximated by a deep generative network through the pushforward measure of an easily sampled distribution in d+1 dimensions, such as a uniform distribution in a cube [0, 1]d+1. Importantly, they provide an upper bound on the complexity of these models, revealing that the required number of layers and neurons is determined by both the approximation error and the manifold dimension d. This insight oﬀers a direct theoretical justiﬁcation for the models abilities to learn diverse probability distributions and suggesting avenues for eﬃciency improvements in practical applications. However, the research by Dahal et al. (2022) also brings to light some important, yet unanswered questions. Primarily, the necessity of using a d + 1 dimensional input space to approximate a d dimensional manifold warrants further investigation. This approach, integral for assembling local patches, raises the possibility of achieving similar outcomes with an input dimension that matches d. Furthermore, in real-world applications where d is not readily known, the choice of input dimension becomes a critical decision. If the chosen dimensionality is either greater or lesser than the actual d, how would it impact the model s performance and eﬃciency? Answering these questions is crucial for advancing the development of deep generative models, potentially leading to more adaptable and eﬃcient solutions for complex data representation. In this work, we aim to understand how the input dimension relates to the manifold s intrinsic dimension in deep generative models. We adapt the space-ﬁlling curve theory to demonstrate a novel aspect of deep generative models: their ability to learn a target data distribution from easy-to-sample distributions of arbitrary dimensions. This includes the capacity to approximate these distributions even from an input as low as a one-dimensional uniform distribution on the unit interval [0, 1] (see Figure 1 for examples of curves ﬁlling 2-dimensional manifolds). Intuitively, if the input dimension is smaller than the true dimension, the network can learn to ﬁll out the true manifold with an increasingly complex series of structures. However, in order to accurately approximate the higher dimensional structure, the approximating manifold must fold onto itself in increasingly irregular ways to ﬁll the space. In addition, our work establishes the critical interplay among the input dimension, the true dimension of the manifold, and the approximation error. In particular, our ﬁndings highlight a trade-oﬀtriangle in deep generative models, where it is unattainable to achieve low (underestimated) dimensionality, low approximation error, and low network complexity simultaneously, where network complexity is deﬁned to be network width (number of neurons) in this paper. Speciﬁcally, when the input dimension is underestimated and the approximation error is low, we encounter one corner of this triangle: a signiﬁcant increase in model complexity. This complexity, evident in the escalated number of neurons, grows

Wang, Niu, Wang and Li

Figure 1: Two cases demonstrating the idea of how suﬃciently large neural networks can learn distributions of higher dimension than their input sampling distributions by ﬁlling out the space. Depicted here are 1-dimensional partial space-ﬁlling curves beginning to ﬁll out a 2-dimensional unit square (left) and a 2-dimensional cylindrical surface (right). The blue points represent the training sample from the target distribution used to ﬁt the neural network, while the orange points represent new points generated by the trained network.

super-exponentially with approximation error, underscoring the complex trade-oﬀs and interdependencies inherent in accurately gauging the intrinsic dimension of the data. This triangle metaphorically highlights the necessity of balancing these aspects when designing generative models, a central theme of our contributions. Our contribution also includes empirical studies to support these ﬁndings, which are overlooked in existing literature (Dahal et al., 2022). Notably, our proof techniques for the underestimated dimension case are novel, requiring in-depth exploration of mathematical literature in Monge-Amp ere equations from the 1990s (Caﬀarelli, 1990a,b, 1991, 1992a,b, 1996), beyond the scope of existing literature (Dahal et al., 2022; Villani et al., 2009). The structure of our paper is organized as follows: Section 2 lays the groundwork with preliminaries. Section 3 discusses our main theoretical contributions. In Section 4, we present simulations on toy cases for illustrative and visualization purposes. Section 5 oﬀers a sketch of our proof, focusing primarily on the workﬂow, and is followed by a Discussion Section. Complete proofs and additional experimental details are provided in the Appendix.

2. Preliminaries

This section covers the essential concepts underpinning our research, including generative models, optimal transport, Riemannian manifolds, and existing approximation theory for generative models on Riemannian manifolds. This groundwork is crucial for appreciating the theoretical advances we present in later sections and situates our research within the broader context of generative modeling.

2.1 Generative Models

Generative models aim to learn the distribution of a dataset through sampling procedures. We consider our data following a distribution Q residing in an ambient space RD and an input distribution ρ over a simpler space S, such as the uniform distribution on the

Deep Generative Models on Manifolds

hypercube [0, 1]d. The goal is to construct a generator g : S RD that minimizes a discrepancy function between the generated (pushforward) distribution g (ρ) and the target distribution Q, expressed as

min g G Discrepancy(g (ρ), Q).

In this context, the function class G often comprises deep neural networks. A lower discrepancy signiﬁes a closer match, meaning that the data generated from the simpler input space S closely resemble the true data represented by distribution Q. The essence of generative models is to produce outputs that are similar to real data, aligning the generated samples with the actual distribution. A larger class G usually yields a smaller discrepancy, and therefore, better performance. However, this enhancement comes at the cost of increased computational resources. As a result, understanding the balance between the size of G and the discrepancy is of the utmost importance. A practical challenge is that Q is often unknown, but instead, we only observe samples x1, , xn Q that are commonly believed to be independent and identically distributed (iid) following Q. As a result, we can replace Q by its empirical distribution Qn := 1

n Pn i=1 δxi where δx is the Dirac measure at x. The practical objective of generative modeling is to minimize the discrepancy between the learned distribution and this empirical distribution: min g G Discrepancy(g (ρ), Qn),

where the minimizer is usually denoted by bgn.

2.2 Optimal Transport

In the study of generative models, the choice of discrepancy function is of fundamental importance. The Wasserstein distance, a central concept in optimal transport theory (Villani et al., 2009), is commonly employed due to its eﬀectiveness in measuring the cost of transforming one distribution into another. This aligns well with the objectives of generative modeling.

Deﬁnition 1 The Wasserstein-p distance between two distributions Q and ν in domain M, is deﬁned as

Wp(Q, ν) = inf γ Γ(Q,ν) E(x,y) γ(c(x, y)p) 1

where Γ(Q, ν) is the set of all couplings of Q and ν, containing all joint distributions over M M with marginals Q and ν, and c : M M R 0 is called the cost function.

We focus on the Wasserstein-1 distance with c(x, y) = x y , also known as the earth mover s distance. This speciﬁc case is especially relevant for our work in generative models, oﬀering a robust framework for evaluating the similarity between generated and target distributions. Wasserstein-1 distance additionally admits a dual form that makes is particularly suitable for computation, which is a property that Wasserstein-p in general lacks. However, our theory on convergence is extendable to Wasserstein-p distance in general, which we investigate in Corollary 9.

Wang, Niu, Wang and Li

2.3 Space-ﬁlling Curves

In this paper, for a set S RD, we deﬁne a space-ﬁlling curve to be a continuous curve Mϵ RD such that every point in S is within ϵ of some point in Mϵ. Consequently, we can deﬁne a space-ﬁlling manifold to be a manifold in RD satisfying the same properties:

Deﬁnition 2 A manifold Mϵ RD is said to be ϵ-space-ﬁlling manifold of S RD if

d(S, Mϵ) = sup x S inf y Mϵ

Mϵ is called a space-ﬁlling curve if it s a one-dimensional manifold.

When S itself is a manifold, the existence of such ϵ-space-ﬁlling manifolds is shown in Lemma 10. We note that this deﬁnition of the space-ﬁlling curve/manifold diﬀers from the standard deﬁnition in topology that demands the curve be surjective on the larger space S. These classical space-ﬁlling curves are typically taken as the limit of certain classes of curves: Mspace-ﬁlling = lim ϵi 0 Mϵi,

where the sequence Mϵi is carefully chosen to preserve desired properties (such as non self-intersection) in the limit. However, for our purposes, we do not need the limit, only particular choices of Mϵ satisfying d(S, Mϵ) < ϵ, and deﬁne space-ﬁlling manifolds thusly. We use the term space-ﬁlling for intuition purposes, and do not demand a completely space-ﬁlling property from our manifolds.

2.4 Riemannian Manifold

Although data often reside in high-dimensional space RD, there is substantial evidence suggesting that they lie on some low-dimensional manifolds (Bronstein et al., 2017). This concept underlies many manifold-based generative models, such as VAEs and GANs. In the same manner as most existing work in the literature, we assume that the data are distributed on a d-dimensional orientable compact Riemannian manifold M, isometrically embedded in the ambient space RD, with Riemannian metric g. A manifold is a locally Euclidean space, with each local neighbor, known as a local chart, diﬀeomorphic to Euclidean space Rd (Boothby, 1986). The Riemannian metric g deﬁnes a smoothly varying inner product (metric) in each tangent space Tx M. The geodesic distance d M(x, y) is deﬁned as the length of the shortest path connecting x, y M. In addition, there exists a well-deﬁned d-form, known as the Riemannian volume form d Vol M, which often serves as the analog of the Lebesgue measure on Riemannian manifold. This allows us to deﬁne density functions of probability distributions with respect to the volume measure. For more details, see Do Carmo and Flaherty Francis (1992).

2.5 Existing Work

In previous research on generative models, there has been a focus on low-dimensional data structures, assuming that high-dimensional data are parametrized by low-dimensional latent parameters. This approach treats manifolds as globally homeomorphic to Euclidean space,

Deep Generative Models on Manifolds

implying a single-chart manifold model (Luise et al., 2020; Schreuder et al., 2021; Block et al., 2022; Chae et al., 2023). However, this assumption presents limitations in dealing with the complexity inherent in general manifolds with multiple charts. In contrast, Yang et al. (2022) and Huang et al. (2022) demonstrate that GANs can approximate any data distribution from a one-dimensional continuous distribution. This method does not assume a global chart; however, it heavily relies on GANs memorizing empirical data distributions, posing limitations in generating novel samples. To overcome the limitations of the above approaches, Dahal et al. (2022) represents a signiﬁcant development in this context. By avoiding the single-chart assumption and the need for memorizing data, they construct an oracle transport map suitable for general manifolds with multiple charts. This advancement in approximating distributions on Riemannian manifolds with neural network pushforwards paves the way for more sophisticated and eﬀective generative models. Their study relies on two mild assumptions:

Assumption 3 M is a d-dimensional compact Riemannian manifold isometrically embedded in RD. As a consequence, there exists B > 0 such that x B, x M.

Assumption 4 Q is supported on M and has a density function q with respect to the volume measure Vol M with a positive lower bound c: 0 < c q(x), x M.

We consider the Re LU-type neural networks

g(x) = WLσ(WL 1 σ(W1x + b1) + + b L 1) + b L,

where σ is the Re LU activation function (Fukushima, 1969), W is the weight matrix, b is the bias vector. The function class, denoted by GNN(m, L, p, κ), contains neural networks with Re LU activation function, input dimension m, maximum depth L, maximum width p, and bounded weights and biases by κ:

GNN(m, L, p, κ) := g = (g1, , g D) : Rm RD :

gj is in form (2.5) with at most L layers and max width p, Wi κ, bi κ} .

Under the above assumptions, the two main theorems in Dahal et al. (2022) are as follows: Firstly, Lemma 5 establishes that certain distributions on compact Riemannian manifolds can be approximated using a deep neural network and a simple input distribution. Crucially, the dimension of the input space is precisely one more than the manifold dimension. Furthermore, Lemma 5 provides detailed complexity bounds on the size of the network to achieve a speciﬁed level of accuracy measured by the Wasserstein loss.

Lemma 5 (Theorem 1 of Dahal et al. 2022) Let ρ = Unif(0, 1)d+1, then there exists a constant 0 < α < 1 that is independent of D such that for any 0 < ϵ < 1, there exists a deep neural network g GNN(d + 1, L, p, κ) with L = O log 1

ϵ , p = O Dϵ d

α , κ = B, that satisﬁes W1(g (ρ), Q) < ϵ.

Next, Lemma 6 serves as the empirical counterpart of Lemma 5, addressing scenarios in which only ﬁnite samples following the distribution Q are observed. It demonstrates the

Wang, Niu, Wang and Li

theoretical possibility of identifying such generators using ﬁnite samples instead of direct access to the true distribution Q. In addition, this lemma links the network size to the sample size rather than the approximation error.

Lemma 6 (Theorem 2 of Dahal et al. 2022) Under the same assumption as in Lemma

5, let x1, , xn iid Q be n iid samples from Q, then for any δ > 0, set ϵ = n 1 d+δ in Lemma 5 so that the network class GNN(d + 1, L, p, κ) has parameters L = O log n 1 d+δ , p =

d α(d+δ) , κ = B. Then the empirical risk minimizer bgn has rate E [W1(bgn (ρ), Q)]

Cn 1 d+δ , where C is a constant independent of n and D.

These results shed light on how distributions on the manifold can be approximated by a deep neural network s pushforward of a low-dimensional easy-to-sample distribution. However, this approach has several limitations. First, the intrinsic dimension d of the manifold M is almost never known. Although there is an immense literature on estimating d (Levina and Bickel, 2004; Zheng et al., 2022; Horvat and Pﬁster, 2022; Brown et al., 2022), it has been proven to be an almost insurmountable problem due to its complexity (Feﬀerman et al., 2016). Second, even if d is known, the choice of input dimension d + 1 is unnatural, as it exceeds the true dimension. Technically, the extra dimension results from the need to connect local neighborhoods using a pasting algorithm. But this seems superﬂuous from an intrinsic perspective, considering partition of unity could potentially serve the same purpose. Third, in practice, the input dimension is often treated as a tuning parameter, or simply set based on historical experience or recommendations without rigorous optimization. It remains unclear how these decisions on input dimension aﬀect approximation performance, especially when d is underestimated. To address these issues, our study examines the scenario where the input dimension is arbitrary, potentially as minimal as one, within the same framework. We will summarize our primary ﬁndings in the subsequent section.

3. Main Theory

We begin with Dahal et al. (2022), which presents a method for approximating distributions on compact Riemannian manifolds using neural networks with an input dimension of d + 1. This raises a natural question: can the input dimension be reduced to d, the manifold s intrinsic dimension, or even lower, without compromising the approximation s eﬀectiveness? In this work, we answer this question aﬃrmatively, leveraging concepts of space-ﬁlling curves and insights from a series of work on Monge-Amp ere equations by Caffarelli (Caﬀarelli, 1990a,b, 1991, 1992a,b, 1996). Our theorem demonstrates the feasibility of using any input dimension m 1, which can be either smaller or larger than d, even down to 1. However, reducing the input dimension below the manifold s dimension introduces a super-exponential increase in complexity. Finally, we highlight that our approach adheres to the same foundational assumptions as Dahal et al. (2022). Our theorem, central to this discussion, asserts the approximation power of these neural networks under varying input dimensions. It establishes the conditions under which the networks can eﬀectively approximate distributions on Riemannian manifolds, taking into account the dimensionality of the input and its impact on the network s complexity.

Deep Generative Models on Manifolds

Theorem 7 (Approximation Power of Deep Generative Models) Under Assumptions 3 and 4, and with ρ as the uniform measure on [0, 1]m, for any ϵ > 0, there exists a deep neu-

ral network g GNN(m, L, p, κ) such that L = O log 1

m α(m,ϵ) m d

κ = max{B, 1}, and W1(g (ρ), Q) < ϵ. Furthermore, when m d, limϵ 0 α(m, ϵ) = 0, leading to a super-exponential increase in the width p.

The purpose of this theorem is two-fold. Firstly, it demonstrates that neural networks can approximate distributions on a manifold, with the ﬂexibility to use an input distribution uniformly distributed over a hypercube [0, 1]m of any dimension. This means that the choice of m can be adapted as needed, whether it s smaller or larger than the manifold dimension d. We additionally note that the choice of our generator to be the uniform distribution on the unit hypercube is non-restrictive. These results can easily be extended to any input distribution with a density that is upper bounded away from inﬁnity and lower bounded from zero on the unit cube, such as a truncated normal distribution or uniform distributions. Secondly, the theorem provides a quantitative link between the complexity of the network and the relationship between the input dimension m and the manifold dimension d. Notably, when m > d, the rates coincide with those found in Dahal et al. (2022). However, the situation becomes particularly intriguing when the input dimension m is smaller than the target dimension d. In this scenario, the width of the neural network no longer increases at a polynomial rate of d

α, but at a super-exponential rate of m α(m,ϵ) where m α(m,ϵ) ϵ 0 . This phenomenon implies that, while it is theoretically feasible to approximate a broad class of distributions with low-dimensional inputs, choosing an appropriate input dimension is critical to avoid excessively complex networks; it is often more advantageous to overestimate d than to underestimate it. We now turn our attention to the empirical aspect of our theory, examining the statistical guarantees provided by our model in the presence of iid samples x1, , xn Q.

Theorem 8 (Statistical Guarantees of Deep Generative Models) Given iid samples x1, , xn Q with empirical distribution Qn, let δ > 0, L = O log n 1 d+δ ,

m (m+δ)α(m,n,d,δ) m d

d (d+δ)α m > d,

κ = max{B, 1}, then the empirical risk minimizer bgn GNN(m, L, p, κ) satisﬁes

E [W1(bgn (ρ), Q)] (1 + 2Cδ)n 1 d+δ ,

where Cδ is a constant, independent of n. Furthermore, when m d, limn α(m, n, d, δ) = 0, leading to a super-exponential increase in the width p.

Similar to the population version, this theorem demonstrates that deep neural networks can approximate manifold distributions from n iid observations, with the approximation error diminishing at a rate of 1 d+δ. However, the complexity of the network, particularly

Wang, Niu, Wang and Li

the width, increases alongside n. Although a larger n yields a smaller approximation error, when m d, the width p grows super-exponentially with n. These two theorems collectively underscore a trade-oﬀtriangle of model complexity, input dimension, and approximation accuracy. In the next section, we illustrate these concepts through simulations using m = 1, 2, 3 as case studies for proof of concept and visualization. Finally, we also show that Theorem 7 and Theorem 8 can be partially extended to Wasserstein-p distances in the following corollary.

Corollary 9 Under the same setup as Theorem 7, for any ϵ > 0, there exists a deep neural network g GNN(m, L, p, κ) such that Wp(g (ρ), Q) < ϵ. Moreover, under the same setup as Theorem 8, the empirical risk minimizer bgn GNN(m, L, p, κ) satisﬁes E [Wp(bgn (ρ), Q)]

(1 + 2Cδ)n 1 d+δ .

Note that the bounds for L, p, κ are not provided in this case; a more detailed discussion is presented in the proof in Appendix B.6.

4. Simulation

In this section, we provide empirical evidence on three toy examples for visualization purposes of these deep generative networks abilities to learn space-ﬁlling curve approximations. We demonstrate the ability of the standard Re LU network to map the uniform distribution on the unit hypercubes of various dimensions [0, 1]m to a variety of target distributions, where m is the dimension of the input distribution. Note that here we are using a standard feedforward neural network with a Wasserstein-1 loss function approximation provided in the Python Optimal Transport package (POT, Flamary et al., 2021). This loss function takes the role of the critic network in a typical GAN model, hence our ﬁtted feedforward network under this Wasserstein-1 loss function is analogous to the generator in a Wasserstein GAN (Arjovsky et al., 2017). We also report an empirical ﬁll distance which we measure by taking a set of samples x1, ..., xn from the true manifold, another set of samples y1, ..., y N from the generated manifold, and calculating 1

n Pn i=1 minyj {d(xi, yj)}. This approximates the mean distance over all points in the data manifold to the closest point on the approximating curve. Hence, this is a measurement of the distance between the manifold supports of distributions, while the Wasserstein loss (used to train the neural networks) oﬀers a measurement of the distance between distributions. In Simulation 1, we show that the problem of ﬁnding a generator from the two dimensional square to the two dimensional square is computationally feasible (as in the required complexity is relatively small). Additionally, we show that the same problem with only a one dimensional input dimension is relatively harder, and also show the trajectory of training iterations to visualize how the neural network learns the distribution with an under-dimensioned input. In Simulation 2, we showcase the mapping of the same uniform distribution on [0, 1] to the uniform distribution on a 2-dimensional cylinder S2 R3

(m = 1, d = 2, D = 3). We also showcase that the same target distribution can be learned with much lower complexity when we match the input dimension to the manifold dimension, that is m = 2 = d. In Simulation 3, we take our target distribution to be the 3-dimensional

Deep Generative Models on Manifolds

uniform distribution on a unit cube [0, 1]3 (d = D = 3). As in Simulations 1 and 2, we train Re LU Networks using uniform distributions on the 1, 2, and 3 dimensional hypercubes (m = 1, 2, 3) and compare performance. Full implementation details and additional simulations including a study of network complexity and approximation error, the use of higher input dimension than target manifold dimension (m = d + 1) for each of the three simulation cases, and the case of non-uniform input or target distributions, along with a link to a Git Hub repository containing all code, can be found in Appendix C.

Figure 2: Simulation 1: Training results (left) of a small 2 hidden layer, 10 node each, fully connected network mapping the 2-D uniform input to a 2-D uniform target data manifold. The orange surface is generated by the neural network and is ﬁlling in the data manifold (blue). Wasserstein loss (top right) and ﬁll distance (bottom right) between generated and observed data per training iteration for a total of 10,000 iterations.

4.1 Simulation 1

In Simulation 1, we let the data distribution Q be a uniform distribution over the unit square [0, 1]2 and consider the cases where the input distribution ρ to the neural network is a uniform distribution over the unit square [0, 1]2 R2 and a uniform distribution over the unit interval [0, 1] R. In this case, the manifold dimension d = 2 is equal to the dimension of the ambient space D = 2. We ﬁrst consider the case where the input and target dimensions are equal. Aligned with our theory, the required complexity in this case is expected to be signiﬁcantly lower than when the target data dimension is underestimated. In Figure 2, we show that training a fully connected feed forward network with just 2 hidden layers with 10 nodes each for 10,000 iterations under Wasserstein loss, again from the Python Optimal Transport (OT) package, achieves low Wasserstein loss as well as low average empirical ﬁll distance. Note that in this case, during training the input is taken to be a random sample on the unit square at each iteration, while the target data is a uniform grid on the unit square, so that the input sample is not identical to the target data at any iteration.

Wang, Niu, Wang and Li

Figure 3: Simulation 1: Training results of a 5 hidden layer, 200 node each, fully connected network for 10,000 iterations. Training trajectory (left), Wasserstein loss (top right), and ﬁll distance (bottom right) of mapping a uniform distribution on the interval [0, 1] to a uniform distribution on the unit square [0, 1]2. It can be seen that as the loss decreases, the ﬁtted curve ﬁlls more of the square as expected.

Next we re-examine the target distribution from the case that the input dimension ρ is a uniform distribution over the unit interval [0, 1]. Figure 3 demonstrates how a standard feedforward, fully-connected neural network, using Re LU activation and trained with Wasserstein loss, eﬀectively maps the unit interval into an increasingly complex, space-ﬁlling curve in a two-dimensional space. Note that now a much larger network size of 5 hidden layers with 200 nodes each is required to reach a similar level of approximation to the equal input and target dimensions case. For further simulations regarding the eﬀect of diﬀerent network sizes see Appendix C.1, and for the m = d + 1 case see Appendix C.2.

4.2 Simulation 2

Analogous to Simulation 1, in Simulation 2 we consider a situation where the data distribution Q is a uniform distribution over a cylinder embedded in R3 and again consider the cases where the input distribution ρ to the neural network is a uniform distribution over the unit square [0, 1]2 R2 and the unit interval [0, 1] R. We ﬁrst consider when the input distribution (uniform) dimension matches the target data manifold dimension (also uniform), where m = d = 2. We show that, aligned with our theory, the required complexity in this case is observed to be much lower than when the target data dimension is underestimated. In Figure 4, we show the results of training a fully connected feed forward network with a total of 3 hidden layers with 25 nodes each for 8,000 iterations under Wasserstein loss and that the network is able to achieve low Wasserstein loss as well as low average ﬁll distance. In contrast, we consider the same uniform distribution on a cylinder embedded in R3

but with an input distribution of only one dimension. Note again that in this example we

Deep Generative Models on Manifolds

Figure 4: Simulation 2: Training results of a small 3 hidden layer, 25 node each, fully connected network for 8,000 iterations mapping a 2-D uniform input to a 2-D uniform target data manifold on a cylinder embedded in R3. Here the multicolored surface is generated by the neural network (colored by z-axis height) and is ﬁlling in the data manifold (blue).

have the manifold M with dimension d = 2, embedded in an ambient Euclidean space with dimension D = 3. We show again that a feedforward, fully-connected neural network under Wasserstein loss and Re LU nonlinearity is still able to learn space-ﬁlling curves properly on the manifold of lower dimension than the ambient space (albeit with a much larger network size of 7 hidden layers with 250 nodes each). Again, the neural network maps the input interval into an increasingly complex space-ﬁlling curve on the cylinder as the number of training iterations increases and the loss decreases respectively as seen in Figure 5. For full implementation details and code, as well as the m = d + 1 case, see Appendix C.

4.3 Simulation 3

In Simulation 3, we consider a uniform distribution on a unit cube [0, 1]3 in R3, while now taking as input one, two, and three-dimensional uniform distributions on [0, 1]m for m = 1, 2, 3 respectively. We show again that a feedforward, fully-connected neural network under Wasserstein loss and Re LU nonlinearity is still able to learn space-ﬁlling sheets on the manifold of higher dimension than the input.

We again start with the case where the input dimension matches the target dimension (m = 3). In Figure 6, we show the results of training a neural network with 3 hidden layers of 128 neurons each. It can be seen that the network achieves both low Wasserstein loss as well as empirical ﬁll distance in Figure 6. Note that in this case the input data is a randomly re-sampled each training iteration, while the target remains a ﬁxed grid on the unit cube, such that the input and target data are not identical at any iteration.

Now we reduce the input dimension by 1, mapping a uniform distribution of dimension m = 2 to a uniform distribution in a higher target dimension d = 3. We observe in Figure 7

Wang, Niu, Wang and Li

Figure 5: Simulation 2: Training results of a 7 hidden layer, 250 node each, fully connected network over 5,000 iterations. Training trajectory (left), Wasserstein loss (top right), and ﬁll distance (bottom right) of mapping a 1-D uniform distribution on the interval [0, 1] to a 2-D uniform distribution on a cylinder embedded in R3. As the loss decreases over the iterations, the ﬁtted curve ﬁlls more of the surface.

Figure 6: Simulation 3: Training results of a 3 hidden layer, 128 node each, fully connected network over 4,000 iterations mapping the 3-D uniform input to a 3-D uniform target data manifold on the unit cube. Here the multicolored curve is generated by the neural network (colored by z-axis height) and is ﬁlling in the data manifold (blue).

that a larger neural network with 4 hidden layers of 256 nodes each achieves similar loss to the smaller model used when m = d = 3. Finally, we reduce the input dimension one more time for the case when the input dimension is m = 1 and the data manifold is uniform in d = 3 dimensions. We observe in Figure 8

Deep Generative Models on Manifolds

Figure 7: Simulation 3: Training results of a 4 hidden layer, 256 node each, fully connected network over 2,000 iterations mapping a uniform distribution on the unit square [0, 1]2 to a uniform distribution on [0, 1]3, colored by height (z-axis). As the loss decreases over the iterations, the ﬁtted surface ﬁlls more of the space in the cube.

Figure 8: Simulation 3: Training results of a 5 hidden layer, 256 node each, fully connected network over 5,000 iterations mapping the 1-D uniform input on the interval [0,1] to a 3-D uniform target data manifold on the unit cube. Here the orange curve is generated by the neural network and is ﬁlling in the data manifold (blue).

that again a larger network with 5 hidden layers of 256 neurons each is required to achieve similar loss to the m = 2 and m = 3 cases. Together, the three simulations underscore the networks ability to adaptively map lower-dimensional inputs onto higher-dimensional man-

Wang, Niu, Wang and Li

ifolds in a generalizable manner, eﬀectively demonstrating our main theorems in practice. Again, see Appendix C for full implementation details and extension to m = d + 1.

5. Sketch of the Proof

In this section, we provide a proof sketch of our main theorems, highlighting key steps while deferring technical details to the Appendix. We start with the proof of Theorem 7.

5.1 Sketch Proof of Theorem 7

The proof is divided into three cases, depending on the input dimension m: when m > d, when 2 m d, and when m = 1. We start with the simplest case: m > d.

5.1.1 Case 1: m > d

In this case, since there are m d redundant input variables, we introduce a layer g1 : Rm Rd+1 that eﬀectively ignores these redundant input variables:

g1(x1, ..., xm) = (x1, ..., xd+1).

g1 is a one-layer Re LU neural network with zero bias, and a weight matrix W = I(d+1) (d+1) 0(d+1) (m d 1) . This step simpliﬁes the network s input to d + 1 dimensions, aligning with the settings of Lemma 5. We then deﬁne the pushforward measure of ρ via g1, deﬁned on Rd+1, by g1 as ρ = g1 (ρ). By Lemma 5, for any ϵ > 0, there exists a neural network g GNN(d + 1, L , p , κ) such that

W1(g (ρ ), Q) < ϵ,

where L = O log 1

ϵ , p = O Dϵ d/α , κ = B. Now we let g := g g1, so we have

W1(g (ρ), Q) = W1((g g1) (ρ), Q) = W1(g (g1 (ρ)), Q) = W1(g (ρ ), Q) < ϵ.

Finally, observe that g GNN(m, L, p, κ), where L = L +1 = O log 1

ϵ , p = max{p , m} = O Dϵ d/α , κ = max{B, 1}, which ﬁnishes the proof of this case.

5.1.2 Case 2: 2 m d

In this case, where the input dimension lies between 2 and d, our strategy comprises several steps. First, we ﬁnd a m 1 dimensional compact Riemannian manifold M ϵ approximating M. Then, we project the original distribution Q onto M ϵ to obtain Q ϵ, an approximation of Q. Next, we smooth Q ϵ into Q ϵ to satisfy Assumption 4. Finally, we show that Q ϵ can be approximated by a deep generative model, as detailed step by step below. Firstly, the existence of such a m 1 dimensional manifold M ϵ is guaranteed by the following Lemma 10 leveraging the idea of a space-ﬁlling manifold, with proofs in Appendix B.1.

Lemma 10 Let M be a d dimensional Riemannian manifold isometrically embedded in RD. Then, for any 1 q d and any ϵ > 0, there exists a q-dimensional manifold isometrically embedded in RD, denoted by M ϵ satisfying

Deep Generative Models on Manifolds

d(M, M ϵ) = sup x M inf y M ϵ

Secondly, we deﬁne Q ϵ := πϵ (Qϵ), where πϵ : M M ϵ is the orthogonal projection onto M ϵ. The Wasserstein distance W1(Q, Q ) is then bounded by W1(Q, Q ϵ) d(M, M ϵ) < ϵ. Thirdly, since Q ϵ may not satisfy Assumption 4, we cannot directly apply Lemma 5. To address this, we ﬁnd another distribution on M ϵ, denoted by Q ϵ that satisﬁes Assumption 4, with W1(Q ϵ, Q ϵ ) < ϵ, . The existence of such Q ϵ is supported by the following Lemma 11, with proofs given in Appendix B.2.

Lemma 11 Let M be a m 1 dimensional Riemannian manifold isometrically embedded in RD, and let Q be a distribution on M . Then, for any ϵ > 0, there exists a distribution Q on M satisfying Assumption 4, with

W1(Q , Q ) < ϵ.

Finally, by Lemma 5, there exists a network g GNN(m, L, p, κ) such that W1(g (ρ), Q ϵ ) < ϵ, where

L = O log 1

, p = O Dϵ m α(m,ϵ) , κ = B.

By the triangular inequality, we have

W1(g (ρ), Q) W1(g ρ, Q ϵ ) + W1(Q ϵ , Q ϵ) + W1(Q ϵ, Q) 3ϵ.

The remaining step is to show that α(m, ϵ) 0 as ϵ 0 for any m, which leads to the super-exponential complexity. However, this proof is technical, so we summarize it in Lemma 12 and defer the proofs to Appendix B.3.

Lemma 12 α(m, ϵ) 0 as ϵ 0 for any 2 m d.

5.1.3 Case 3: m = 1

For the case where m = 1, the approach for 2 m d is not directly applicable, as Lemma 5 requires the input dimension to be at least two. Thus, we adopt a diﬀerent strategy to construct the approximation manifold. Notably, in this case, the input space is a unit interval [0, 1], with ρ = Unif(0, 1). First, we apply Lemma 10 again, with q = 1, to obtain an approximation manifold M ϵ, a one-dimensional space-ﬁlling curve. Subsequently, we deﬁne Q ϵ = πϵ (Q), where πϵ : M M ϵ is the orthogonal project. This yields Q ϵ as a distribution on M ϵ satisfying W1(Q, Q ϵ) ϵ. Lemma 13 guarantees the construction of a smooth parameterization for the curve M ϵ, given by ηϵ : [0, 1] M ϵ, with W1(ηϵ (ρ), Q ϵ) ϵ. The proof is in Appendix B.4.

Lemma 13 Let M ϵ be a parametric curve of dimension 1, and let Q ϵ be a distribution on M ϵ, then there exists a reparameterization ηϵ : [0, 1] M ϵ such that

W1(ηϵ (ρ), Q ϵ) ϵ.

Wang, Niu, Wang and Li

Next, Lemma 14 allows us to approximate η using a neural network g.

Lemma 14 (Dahal et al. (2022) Lemma 3) If η(t) is α H older continuous and bounded, then for any ϵ > 0 there is a Re LU network g such that Z

[0,1] g(t) η(t) dt ϵ,

and g GNN(1, L, p, κ), where L = O log 1

ϵ , p = O ϵ 1

α , and κ = B.

Then we obtain g with W1(ηϵ (ρ), g (ρ)) ϵ and

W1(Q, g (ρ)) W1(Q, Q ϵ) + W1(Q ϵ, ηϵ (ρ)) + W1(ηϵ (ρ), g (ρ)) < 3ϵ.

Finally, we node that α = α(ϵ) depends on ϵ through ηϵ, and we claim that α(ϵ) ϵ 0 0. This is summarized as Lemma 15 with proofs in Appendix B.5, which completes the proof of this m = 1 case.

Lemma 15 α(ϵ) ϵ 0 0.

5.2 Sketch Proof of Theorem 8

Let x1, ..., xn be iid samples from Q, and let Qn denote the associated empirical distribution. By Theorem 7, for any ϵ > 0 ,there exists g GNN(m, L, p, κ) such that W1(Q, g (ρ)) < ϵ,

where L = O log 1

ϵ , p = O Dϵ

m α(m,ϵ) , and κ = B.

For the empirical minimizer bgn, we can break down the Wasserstein distance W1(bgn (ρ), Q) as follows:

W1(bgn (ρ), Q) W1(bgn (ρ), Qn) + W1(Qn, Q) W1(g (ρ), Qn) + W1(Qn, Q)

W1(g (ρ), Q) + W1(Q, Qn) + W1(Qn, Q) ϵ + 2W1(Q, Qn).

Given Assumptions 3 and 4, Lemma 5 of Dahal et al. (2022) implies that for any δ > 0, a constant Cδ exists such that

E [W1(Q, Qn)] Cδn 1 d+δ ,

where Cδ is a constant independent of n. Setting ϵ = n 1 d+δ , we have

E [W1(bgϵ (ρ), Q)] (1 + 2Cδ)n 1 d+δ .

Thus, we have achieved the claimed expected Wasserstein distance with desired complexity of bgn: L = O log n 1 d+δ , O Dn

m (m+δ)α(m,n,d,δ) , and κ = B. In particular, when m > d, α(m, n, d, δ) remains constant (α), while when m d, limn α(m, n, d, δ) = limϵ 0 α(m, ϵ) = 0 as per Theorem 7.

Deep Generative Models on Manifolds

6. Discussion

Using the theory of space-ﬁlling curves, in this work we establish key relationships between the input dimension, the true dimension of the data manifold, and the approximation error of ﬁtting deep generative networks. Speciﬁcally, we have shown that deep generative networks can learn the manifold structure in a generalizable manner regardless of the input dimension, even when the input dimension is as low as one. Furthermore, we quantify the complexity trade-oﬀamong network size, input dimension, and approximation error, showing that underestimating the input dimension below the manifold dimension introduces a super-exponential increase in the width of the network. Furthermore, these results did not require any additional assumptions compared to previous work (Dahal et al., 2022). Toy simulation studies on twoand three-dimensional manifolds provide empirical evidence of GANs learning such space-ﬁlling curves from various input dimensions, oﬀering additional explanation to the success of GANs and similar models to learn distributions without explicit need for dimensionality estimation in practice.

Based on these results, there are multiple potential directions for future work. First, although we have derived statistical guarantees for existence, there is no guarantee on being able to practically ﬁt such models, especially when the input dimension is heavily underestimated. Second, while we have established upper bounds on the complexity necessary to achieve certain approximation errors, identifying lower bounds poses a formidable challenge and remains an open question. The absence of explicit lower bounds in existing literature highlights the complexity and novelty of this problem area. Establishing these bounds would not only deepen our understanding of the behavior of generative networks but also potentially guide the development of more eﬃcient network architectures. Third, exploring the phase transition in model complexity versus approximation error as a potential method for estimating the intrinsic dimension of data manifolds could provide practical insights into dimensionality reduction and the eﬀective training of generative models. Finally, although we have shown that neural networks are capable of learning distributions exhibiting low dimensional manifold structures, there remains situations in which the data do not lie on a manifold of a single intrinsic dimension throughout, as is the case in the union of manifolds hypothesis (Brown et al., 2022). Such extension would further elucidate the ability of these deep generative models to empirically approximate arbitrary datasets in practice.

Acknowledgment

DL acknowledges Wenjing Liao for a helpful discussion. KW was supported by National Institute of Health (NIH) grant R01 LM014407; HN was supported by NIH R01 CA253450; YW was supported in part by the Oﬃce of Naval Research under grant N00014-23-1-2590, the National Science Foundation under grant No. 2231174, No. 2310831, No. 2428059, No. 2435696, No. 2440954, and a Michigan Institute for Data Science Propelling Original Data Science (PODS) grant; DL was supported by NIH R01 LM014407, NIH P50 CA058223, NIH P42 ES031007.

Wang, Niu, Wang and Li

Appendix A. Code Availability

All code for experiments and generating paper ﬁgures can be found in the Git Hub repository at https://github.com/hong-niu/dgm24. All experiments were run on a single machine with an Nvidia RTX 4080 GPU with 16 GB of memory and adapted from the Python Optimal Transport (POT) package Flamary et al. (2021) for computing the Wasserstein distance.

Appendix B. Additional Proof Details

In this section, we prove ﬁve lemmas used in the sketch of the proofs of Theorem 7 and Theorem 8, namely, Lemma 10, Lemma 11, Lemma 12, Lemma 13, and Lemma 15. Note that Lemma 14 is directly from Lemma 3 of Dahal et al. (2022), so we skip its proof.

B.1 Proof of Lemma 10

We ﬁrst prove Lemma 10, which shows that for a compact d dimensional manifold M, there exists a smaller dimensional manifold M ϵ that approximates the manifold M. Proof Since RD is a separable metric space, M RD is also separable. Let E M be a dense countable subset of M, enumerated E = {e1, e2, ....}. Because E is dense on M, we have lim n sup x M inf y {e1,...,en}

x y 2 2 = 0.

Suppose that we can ﬁnd a sequence of manifolds Mn satisfying for all n N,

En := {e1, ..., en} Mn.

We have 0 sup x M inf y Mn

x y 2 2 sup x M inf y {e1,...,en}

Consequently: lim n sup x M inf y Mn

x y 2 2 = 0.

Therefore, for any ϵ > 0, there exists a large enough n such that

sup x M inf y Mn

x y 2 2 < ϵ.

It suﬃces to show that for any ﬁnite set En, there exists a Riemannian Manifold Mn interpolating En, that is, En Mn. We know there exists a q dimensional subspace, represented by the associated projection matrix P, such that, for all i = j, Pei = Pej. We can choose a basis of this subspace to be the ﬁrst q coordinates through a change of basis. For convenience, we thus claim without loss of generality that the projection onto the ﬁrst q coordinates πq(x1, ..., xn) = (x1, ..., xq) is injective on En so that πq(ei) = πq(ej) when i = j.

Then there exists inﬁnitely many interpolants, f, satisfying yi = f(πq(ei)).

Deep Generative Models on Manifolds

Included in our options are smooth interpolants, such as polynomial interpolants, splines, variational minimizing surfaces, RBF interpolations, Backus-Gilbert (Wendland, 2004), etc. We choose the cubic polyharmonic/thin plate spline due to its simplicity (Duchon, 1977):

f = min f W m 2 (πq(M)),yi=f(πq(ei)) Jd m(f),

where Jd m(f) = P α1+...+αd=m m! α1!...αd! R ... R

mf xα1 1 ... x αd d

Qd j=1 dxj.

This interpolant is a smooth function on a compact domain, with a bounded derivative and its image is a smooth manifold. Further, we consider the function:

F : πq(M) Rn, (x1, ..., xq) 7 (x1, ..., xq, fq+1(x1, ..., xq), ..., fn(x1, ..., xq)).

Note that the image F(πq(M)) is a Riemannian manifold that interpolates F(πq(ei)) = ei. Thus, E F(πq(ei)) and

d(F(πq(ei)), M) d1(E, M) ϵ.

Hence, for any d dimensional Riemannian manifold M and any ϵ > 0, there exists a q dimensional Riemannian manifold M , such that

d(M, M ϵ) = sup x M inf y M ϵ

B.2 Proof of Lemma 11

We now prove Lemma 11, which shows that for any distribution on a compact manifold, we may ﬁnd a suﬃciently close distribution that satisﬁes Assumption 4. Proof The objective of this lemma is to perturb Q slightly, creating a perturbed distribution Q that satisﬁes Assumption 4. The proof involves the following ﬁve steps:

1. Decomposition of the manifold: Partition the manifold M ϵ into a ﬁnite union of balls, with each as a coordinate chart derived from some Euclidean ball.

2. Local pull back distribution: Pull back the local distribution Q through the exponential map.

3. Smooth local distributions: Smooth each local pull-back distribution to ensure closeness to the original pull-back.

4. Pushforward and reweighting: Push these local distributions forward to M using their exponential maps. Paste them together to form a new distribution, and re-weight it to account for overlaps.

Wang, Niu, Wang and Li

5. Density adjustment: Add a small uniform density to ensure a positive lower bound on density and show that this new distribution is suﬃciently close to the original distribution and satisﬁes Assumptions 4.

Since M is compact, the injectivity radius, denoted by ι, is strictly positive (Do Carmo and Flaherty Francis, 1992). Fix 0 < r < ι and denote Ux = expx(Bx(0, r)), where expx is the exponential map at x and Bx(0, r) is the Euclidean ball in the tangent space Tx M . Again, due to the compactness, there exists a ﬁnite covering {expxj(Bxj(0, r)) : j = 1, 2, ..., N}. Let Q xj be the restriction of Q onto Uxj, that is Q xj(S) = R

S 1(x Uxj)d Q (x). Since r < ι, the exponential map expxj is invertible on Uxj, so the the pullback of Q xj through expxj( ), denoted by Pxj, a measure on Bxj(0, r), is well-deﬁned. However, the pull-back measure Pxj may not be smooth enough. To address this, for δ > 0, we consider the function

Hδ(x, y) = Cδ(x)1{ x y < δ},

where Cδ(x) is the normalizing constant 1 Cδ(x) = R

Bxj (0,r) 1( x y < δ)dy.

Then denote the locally smoothed measure by P δ xj, with density:

pδ xj(y) = Z

Bxj (0,r) Hδ(x, y)d Pxj.

Since the transport cost between (Pxj, P δ xj) is bounded by sup Hδ(x,y)=1 x y δ, we have W1(Pxj, P δ xj) δ. Pushing P δ xj forward through the exponential map onto M yields a locally smoothed distribution on Uxj, expxj (P δ xj). We have:

W1(expxj (Pxj), expxj (P δ xj)) = inf γ Γ(expxj (Pxj ),expxj (P δxj ))

Uxj Uxj x y dγ

= inf γ Γ(Pxj ,P δxj )

Bxj (0,r) Bxj (0,r) expxj(x) expxj(y) dγ

Lxj inf γ Γ(Pxj ,P δxj ))

Bxj (0,r) Bxj (0,r) x y dγ,

where Γ is the coupling and Lxj is the Lipschitz constant of expxj on Bxj(0, r) (Do Carmo and Flaherty Francis, 1992). Note that as δ approaches 0, expxj(P δ xj) converges in distribution to Q xj. Now we recombine these locally smoothed measures back into a new measure:

j=1 expxj (P δ xj).

Deep Generative Models on Manifolds

Then let N(x) to be the number of Uxj containing x, that is N(x) = PN j=1 1(x Uxj) so that:

lim δ 0 Qδ,0 = lim δ 0

c=1 expxj (P δ xj) = Z

M N(x)d Q (x).

To compensate for overlap in the covering, we adjust Qδ,0 to deﬁne a new measure

1 N(x)d Qδ,0.

It is clear that as δ 0, Qδ d Q . Due to the compactness of M , the cost function is bounded and thus lim δ 0 W1(Q , Qδ) 0.

So far, however, the density of Qδ, denoted by qδ may not be positively lowered bounded, as required by Assumption 4. To ﬁll in the ﬁnal gap, we conduct the following surgery to Qδ. By the compactness of M , the uniform density, denoted by u(x) is well deﬁned on M . Let θ > 0 be an arbitrary small number, we deﬁne q = (1 θ)qδ + θu. Let Q be the corresponding distribution, we observe that 1):Q satisﬁes Assumption 4 for any δ and θ; 2) Q converges in distribution to Q when δ, θ 0. That is, for any ϵ, there exists δ > 0 and θ > 0 such that W1(Q , Qδ) < ϵ

2 and W1(Q , Qδ) < ϵ

W1(Q , Q ) W1(Q , Qδ) + W1(Qδ, Q ) < ϵ,

which closes the proof.

B.3 Proof of Lemma 12

We now prove Lemma 12, which shows that when the input dimension is smaller than the true dimension, the quantity α(m, ϵ) approaches 0 as ϵ approaches 0. Proof In existing literature, such as Dahal et al. (2022), α is often treated as a constant, since the manifold under consideration is ﬁxed. However, in our framework, both the approximating space-ﬁlling manifold M ϵ and the approximating distribution Q ϵ change with ϵ. In this situation, it becomes imperative to keep track of the constant α while the manifold M ϵ and Q ϵ are evolving. This task is extremly challenging for several reasons. First, much of the existing literature, including Villani et al. (2009); Dahal et al. (2022) , treats α as a constant, without a in-depth discussion. Second, even when we scrutinize their proofs, they typically establish only the existence of α, without any explicit construction. Third, the intricacies of α are hidden within the study of Monge-Amp ere equations of the potentials of the density functions of Q ϵ, a distinct, although related, ﬁeld of study. Our analysis begins with the recent contributions of Dahal et al. (2022), who claim that there exists a constant α so that the complexity p = O(Dϵ d/α). The foundation of their construction of optimal transport relies on Theorem 12.50 from Villani et al. (2009), also known as Caﬀarelli s regularity theory. Within this theorem, the constant α, denoted by β

Wang, Niu, Wang and Li

by Villani, is introduced ambiguously as for some β (0, 1) , without further elucidation provided. To address this, we examined a series of papers by Caﬀarelli (Caﬀarelli, 1990a,b, 1991, 1992a,b, 1996), where we discovered that the constant α depends on the inﬁmum of the density function of Q ϵ, denoted by q ϵ. In essence, as ϵ 0, inf q ϵ 0, leading to the conclusion that α(m, ϵ) 0. We prove this claim step by step.

1. Prove Vol(M ϵ) ϵ 0 .

First we prove that as ϵ 0, the volume of the space-ﬁlling manifold M ϵ goes to inﬁnity. Note that this volume is the Riemannian volume of the m 1 dimensional manifold M ϵ, not the volume in the ambient space. By Lemma 10, the tube around M ϵ with radius 2ϵ, denoted by T2ϵ(M ϵ) := {x RD : d(x, M ϵ) < 2ϵ}, contains M. Then, by Weyl s Tube Theorem (Weyl, 1939; Gray, 2003), the volume of the tube is given by the following series expansion: Vol(T2ϵ(M ϵ)) =

(π2ϵ2) D (m 1)

2(D (m 1))!

[(m 1)/2] X

k2l(M ϵ)(2ϵ)2l

(D (m 1) + 2)(D (m 1) + 4) (D (m 1) + 2l),

where k2l(M ϵ) is the integrated mean curvature of M ϵ and k0(M ϵ) = Vol(M ϵ) as a special case when l = 0. Focusing on the leading term, we have

Vol(T2ϵ(M ϵ)) = (π2ϵ2) D (m 1)

2(D (m 1))!

Vol(M ϵ) + O(ϵ2) . (1)

Similarly, we estimate the volume of the tube around M, Tϵ(M) as:

Vol(Tϵ(M)) = (πϵ2) D d

Vol(M) + O(ϵ2) . (2)

Now we show that T2ϵ(M ϵ) Tϵ(M). For any x Tϵ(M), there exists y M such that x y < ϵ. By the construction of M ϵ, there exists z M ϵ such that y z < ϵ, as a result,

d(x, M ϵ) x z x y + y z 2ϵ,

which implies x T2ϵ(M ϵ) and hence T2ϵ(M ϵ) Tϵ(M).

Linking the two volumes in Equation (1) and Equation (2), we conclude that

(π2ϵ2) D (m 1)

2(D (m 1))!

Vol(M ϵ) + O(ϵ2) (πϵ2) D d

Vol(M) + O(ϵ2)

= Vol(M ϵ) Cϵ m 1 d

where the last limit is due to m 1 d < 0.

Deep Generative Models on Manifolds

2. Prove inf q ϵ ϵ 0 0.

Since M ϵ is compact and q ϵ integrates to one, we have

M ϵ q ϵd Vol M ϵ Vol(M ϵ) inf q ϵ.

As a result, we conclude that inf q ϵ ϵ 0 0 since Vol(M ϵ) ϵ 0 .

3. Prove α(m, ϵ) ϵ 0 0.

As stated in Theorem 12.50 of Villani et al. (2009), the optimal transport mapping, denoted by Fϵ, is represented by its potential Fϵ C1,α(m,ϵ) such that Fϵ = ψϵ Cα(m,ϵ). Moreover, we know that q ϵ( ψϵ) det Dijψϵ = q from Equation (1) of Caﬀarelli (1992b), where Dij represents the Hessian matrix. The existence of such α(m, ϵ) was ﬁrst given by Caﬀarelli (1991), which replies on the key assumption that there exists constants λ1, λ2 such that

0 < λ1,ϵ det Dijψϵ λ2,ϵ < . (3)

Then, the key quantity we are interested, is deﬁned as α(m, ϵ) = log2 δ(ϵ), where δ(ϵ) < 1 is the constant such that hϵ,1/2(x x0) < δ(ϵ) < hϵ,1(x x0), where hϵ,β is the cone generated by x0 and the level surface ψϵ = β (see Lemma 2 of Caﬀarelli (1991) for more details). The existence of such δ(ϵ) is given by Lemma 2 of Caﬀarelli (1991). Since inf q ϵ ϵ 0 0, we know that λ2,ϵ = sup det Dijψϵ , which violates the key assumption in Caﬀarelli (1991), leading to the fact that ψϵ has inﬁnite derivative and hence inﬁnite local changes and inseparable level sets, so that δ(ϵ) ϵ 0 1. As a result α(m, ϵ) = log2 δ(ϵ) ϵ 0 0, as desired.

B.4 Proof of Lemma 13

We now prove Lemma 13, which states that an approximating curve for a manifold may be reparameterized by a diﬀerentiable function such that the pushforwards of the uniform distribution is close to the true distribution. Proof We ﬁrst introduce the arc-length parameterization as ξ(t) : [0, Lϵ] M ϵ, which makes ξ invertible with a constant speed ξ (t) 1. For simplicity, we assume the total length Lϵ = 1 in this proof. Let Pξ be the pull back of Q ϵ onto [0, 1] through ξ, that is, ξ (Pξ) = Q ϵ. Then for δ > 0, deﬁne

Hδ(x, y) = Cδ(x)1{ x y < δ},

where Cδ(x) is the normalizing constant:

1 Cδ(x) = Z

[0,1] 1{ x y < δ}dy.

Wang, Niu, Wang and Li

Then we deﬁne the following density function

[0,1] Hδ(x, y)d Pξ Z

[0,1] sup x [0,1] Cδ(x) d Pξ = sup x [0,1] Cδ(x) = 1

Denote the corresponding distribution as Fδ. Let X Pξ and deﬁne Y | X = x to be uniform on the interval [x δ, x + δ] [0, 1], so that (X, Y ) is a coupling in Γ(Pξ, Fδ). The cost of such a transport is bounded by

E(X,Y ) X Y = EX [EY [ X Y | X]] EX [EY [δ | X]] = δ.

By the deﬁnition of the Wasserstein distance, W1(Fδ, Pξ) < δ. Subsequently, we can bound

W1(ξ (Fδ), Q ϵ) = inf γ Γ(ξ (Fδ),Q ϵ)

M ϵ M ϵ x y dγ = inf γ Γ(ξ (Fδ),ξ (Pξ))

M ϵ M ϵ x y dγ

= inf γ Γ(Fδ,Pξ)

[0,1] [0,1] ξ(x) ξ(y) dγ inf γ Γ(Fδ,Pξ)

[0,1] [0,1] x y dγ

= W1(Fδ, Pξ) δ.

If the density fδ(x) is lower bounded away from 0, then the Cumulative Distribution Function (CDF) of fδ, denoted by Cδ(x), is invertible. Then we use the inverse transform to approximate Q ϵ: W1((ξ C 1 δ ) (ρ), Q ϵ) δ.

Thus, ηϵ(t) := ξ(F 1 δ (t)) is our desired parametrization. Here, we highlight its dependence on ϵ through the domain M ϵ, which is crucial for Lemma 15. If fδ is not lower bounded away from 0, we can adopt the same technique in step 5 in the proof of Lemma 11, where we perturb it by a small uniform distribution with density u: f δ(x) = (1 θ)fδ(x) + u, where θ > 0. Denoting the CDF of f δ as C δ, we deﬁne

ηϵ(t) := ξ(C δ 1(t)). As a result, there exists δ, θ > 0 such that

W1(η (ρ), Q ϵ) = W1((ξ C 1 δ ) (ρ), Q ϵ) < ϵ.

B.5 Proof of Lemma 15

We now explain the proof of Lemma 15, which shows that limϵ 0 α(m, ϵ) = 0 when m = 1. Proof The proof follows the same structure as the proof of Lemma 12, with exactly the same 3 steps. In fact, the arguments presented in the proof of Lemma 12 are applicable for any dimension m.

Deep Generative Models on Manifolds

B.6 Proof of Corollary 9

In Dahal et al. (2022), the authors ﬁrst prove the existence of an oracle map g that perfectly transports the input distribution to the target distribution, that is Q = g (ρ). They then approximate this function g via a neural network f such that f g L1(ρ) ϵ, with explicit complexity bound on f. Then, in Lemma 2 of their paper, they show that one can further bound the Wasserstein-1 distance by W1(f (ρ), g (ρ)) C f g L1(ρ) Cϵ. To generalize the setting to Wasserstein-p distances, we need two components. First we extend Lemma 2 of Dahal et al. (2022) to bound Wasserstein-p distances by f g Lp(ρ). Then we ﬁnd a suitable neural network f to control f g Lp(ρ). We start with the Wasserstein-p distance:

Wp(f (ρ), g (ρ)) = inf π Γ(f (ρ),g (ρ)) E(x,y) π( x y p 2)1/p.

Because the Wasserstein-p distance is the inﬁmum over all couplings, we may ﬁnd an upper bound by using any choice of coupling. The most obvious choice of coupling is (f(z), g(z)) (f (ρ), g (ρ)) where z ρ, which yields the following upper bound:

Wp(f (ρ), g (ρ))p = inf π Γ(f (ρ),g (ρ)) E(x,y) π x y p 2

Ez ρ f(z) g(z) p 2

Z f(x) g(x) p 2 dρ

= f g p Lp(ρ).

According to Dahal et al. (2022), the oracle g may be constructed so as to be α H older continuous, and thus continuous. The Universal Approximation Theorem of neural networks Hornik (1989) states that if g is continuous, we may construct a neural network f such that f g ϵ. Note that f g Lp(ρ) f g , so we have

Wp(f (ρ), g (ρ)) f g Lp(ρ) f g ϵ,

since ρ is a probability measure. For the empirical result, we follow the same line of argument as in the Wasserstein-1 case. Deﬁne Qn to be the empirical distribution and bgn to be the empirical risk minimizer. Because Q is bounded, we know it has a ﬁnite p-th moment, so we have

Wp(bgn (ρ), Q) Wp(bgn (ρ), Qn) + Wp(Qn, Q) Wp(g (ρ), Qn) + Wp(Qn, Q)

Wp(g (ρ), Q) + Wp(Q, Qn) + Wp(Qn, Q) ϵ + 2Wp(Q, Qn).

From Weed and Bach (2019), we know that if the Minkowski dimension of Q is γ, then for all δ > 0, there exists Cδ > 0 such that E(Wp(Q, Qn)) Cδn 1 γ+δ . By assumption, Q is supported on a compact d dimensional Riemannian manifold and consequently its Minkowski dimension is d. Thus, by choosing ϵ = n 1 d+δ as we did before, we obtain E [Wp(bgn (ρ), Q)] (1 + 2Cδ)n 1 d+δ .

Wang, Niu, Wang and Li

While we have proved the existence of an approximating deep neural network for the Wasserstein-p distance, providing speciﬁc bounds on the network s complexity analogous to those established for the Wasserstein-1 distance in Theorem 7 and Theorem 8 remains unresolved. In the case of W1, the complexity bounds are intricately tied to the construction detailed in Lemma 2 of Dahal et al. (2022). Extending these bounds to Wp distances would require developing a similar foundational lemma tailored to Wp settings, a task that involves substantial theoretical innovation and is not immediately straightforward. Another potential approach involves leveraging results from Ohn and Kim (2019), which oﬀer complexity bounds in the Lp norm; however, this approach introduces a new challenge as the bound on network weights, κ, becomes dependent on ϵ. Removing the dependency on ϵ in the bound for network weights is not straightforward, requiring further investigation. Whether we develop constructions similar to Lemma 2 of Dahal et al. (2022) for Wp distances, or apply theories like those in Ohn and Kim (2019), the challenges are nontrivial. We consider this an important direction for future research.

Appendix C. Additional Experiments

Below we provide additional simulations elucidating the behavior of the neural networks within the same toy scenarios based on the simulations of the main paper to further support our theoretical ﬁndings. In Section C.1, we present an additional experiment regarding neural network complexity and approximation error in the context of Simulation 1. In Section C.2, we demonstrate consistent ﬁndings with the known theory extending each of the 3 simulations of the main paper to the input m = d + 1 case as proven previously in (Dahal et al., 2022). In each case we ﬁnd that a smaller or equivalently sized neural network can learn comparatively close approximations to the m = d cases presented in the main paper. Finally, in Section C.3, we show that using a normally distributed input, rather than uniform, leads again to equally strong approximation performance against a normally distributed target output. Again, complete implementation notebooks generating each of the paper ﬁgures can be found at: https://github.com/hong-niu/dgm24.

C.1 Extensions to Simulation 1 (Unit Square), m = 1, d = 2

In this section, we provide an additional study of complexity size and approximation accuracy when varying the size of the network. We repeat the same setup as Simulation 1 of the main paper as a representative example, where the input dimension m = 1 and the target data dimension d = 2, with the following changes. We present the results of three diﬀerent sized networks a) 2 hidden layers, 10 nodes each, b) 3 hidden layers, 100 nodes each, and c) 5 hidden layers, 200 nodes each, and also signiﬁcantly increase the training epochs to a maximum of 20,000 iterations for each experiment. In each case, the input distribution is uniform in 1-dimension, and the data distribution is uniform in 2-dimensions. We observe that the approximation improves with increasing network complexity, and that aligned with our theory, a model as small as network a) with two hidden layers is suﬃcient for ﬁtting the data manifold when the input dimension is equal to the data, but such network is insuﬃcient for approximating the manifold to the same degree when the input dimension is underestimated.

Deep Generative Models on Manifolds

a) 2 layers, 10 nodes each b) 3 layers, 100 nodes each c) 5 layers, 200 nodes each

Figure 9: Training results of 3 diﬀerent network sizes on mapping the 1-dimensional uniform input to a higher 2-dimensional uniform target data manifold. Here the orange curve is generated by the neural network and is ﬁlling in the data manifold (blue). Networks a) and b) were trained to 20,000 iterations, while network c) only required 10,000 iterations to converge. We observe that as the network complexity increases, the model is able to better ﬁt the target distribution with both lower Wasserstein loss and empirical ﬁll distance against the true data. Note that the smallest network a) is suﬃcient when the input and target data dimension are equal, as was in Section 4.1, but is unable to achieve the same degree of approximation after 20,000 training iterations now that the data dimension has been underestimated, aligning with expectations from our presented theory.

Again consistent with our theory, we ﬁnd that a much larger network c) with 5 hidden layers of 200 nodes each is required to achieve a closer Wasserstein loss and ﬁll distance to the results of Section 4.1 now that the data dimension has been strictly underestimated, with a ﬁnal Wasserstein loss of 0.0016 and ﬁll distance of 0.0213. Furthermore, network c) improves on the medium sized network in network b) which achieves a ﬁnal loss of 0.0028 and ﬁll distance of 0.0302. We also note that the largest network c) also consistently achieves lower losses throughout training than the middle sized network b), despite being trained for 10,000 fewer iterations.

Wang, Niu, Wang and Li

a) 2 layers, 10 nodes each b) 3 layers, 100 nodes each c) 5 layers, 200 nodes each

Figure 10: [Top Row] Wasserstein loss per training iteration for each of the three networks presented in Section C.1. [Bottom Row] Fill distance between generated curve and data per iteration for each of the three networks presented in Section C.1.

Deep Generative Models on Manifolds

C.2 Extensions to the Case for m = d + 1

Next we show that in each of the cases of Simulation 1, 2 and 3 from the main paper, when the input dimension m is 1 greater than the dimension of the target manifold d, a neural network can be trained to approximate the manifold as well as the m = d case as is consistent with previous theory in (Dahal et al., 2022). We note that we were able to ﬁnd such a neural network using fewer or equivalently many neurons than that of the m = d case in each of the three cases.

In Figure 11, we show that a smaller neural network than the 2-D to 2-D analogue of Simulation 1 is able to learn a mapping between a 3-D uniform input distribution to a smaller 2-D target distribution. We note that in this case, the number of training samples used to train the network (900) is identical to the 2-D to 2-D case in Simulation 1 of the main paper (900), however the number of visualization points for the 3-D uniform input has been increased to maintain the density of the input grid as the 2-D case.

Figure 11: Training results of a 1 hidden layer, 10 node, fully connected network for 10,000 iterations. Training trajectory (left), Wasserstein loss (top right), and ﬁll distance (bottom right) of mapping a uniform distribution on the cube [0, 1]3 to a uniform distribution on the unit square [0, 1]2. It can be seen that as the loss decreases, the ﬁtted curve ﬁlls more of the square as expected, and as well as the 2-D to 2-D analogue in Simulation 1.

Next, we repeat for m = d + 1 input dimension analogues of Simulation 2 in Figure 12, and for Simulation 3 in Figure 13. Again, the number of training samples used in both of these simulations is equal to the respective m = d cases of the main paper, while the neural networks used in the m = d + 1 cases actually use fewer total neurons to learn similarly accurate approximations.

Wang, Niu, Wang and Li

Figure 12: Training results of a 3 hidden layer, 20 node each, fully connected network for 10,000 iterations. Training trajectory (left), Wasserstein loss (top right), and ﬁll distance (bottom right) of mapping a uniform distribution on the unit cube [0, 1]3 to a 2-D uniform distribution on a cylinder. It can be seen that as the loss decreases, the ﬁtted curve ﬁlls more of the cylindrical surface as expected, and as well as the 2-D to 2-D analogue in Simulation 2.

Figure 13: Training results of a 2 hidden layer, 128 node each, fully connected network for 10,000 iterations. Training trajectory (left), Wasserstein loss (top right), and ﬁll distance (bottom right) of mapping a uniform distribution on [0, 1]4 to a uniform distribution on the unit cube [0, 1]3. It can be seen that as the loss decreases, the ﬁtted curve ﬁlls more of the cube as expected, and as well as the 3-D to 3-D analogue in Simulation 3.

Deep Generative Models on Manifolds

C.3 Extensions to Non-Uniform Distributions

Here, we brieﬂy consider the case when the distributions are no longer uniform. Here we present two simulations where the target manifold is a 2-D normally distributed sample, and show that keeping the neural network architecture identical and only varying using either a 1-D uniform or 1-D normal distribution as the input results in equally accurate approximations of the target manifold in terms of Wasserstein distance and ﬁll distance as is consistent with our theoretical assumptions.

Figure 14: Training results of a 5 hidden layer, 200 node each, fully connected network for 10,000 iterations. Training trajectory (left), Wasserstein loss (top right), and ﬁll distance (bottom right) of mapping a uniform distribution on the interval [0, 1] to a 2-D normally distributed target sample.

Figure 15: Training results of a 5 hidden layer, 200 node each, fully connected network for 10,000 iterations. Training trajectory (left), Wasserstein loss (top right), and ﬁll distance (bottom right) of mapping a 1-D normally distributed input to a 2-D normally distributed target sample.

Wang, Niu, Wang and Li

Luca Ambrogioni, Umut G u cl u, Ya gmur G u cl ut urk, Max Hinne, Marcel A van Gerven, and Eric Maris. Wasserstein variational inference. Advances in Neural Information Processing Systems, 31, 2018.

Martin Arjovsky and L eon Bottou. Towards principled methods for training generative adversarial networks. ar Xiv preprint ar Xiv:1701.04862, 2017.

Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214 223. PMLR, 2017.

Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. CVAE-GAN: ﬁnegrained image generation through asymmetric training. In Proceedings of the IEEE international conference on computer vision, pages 2745 2754, 2017.

Adam Block, Zeyu Jia, Yury Polyanskiy, and Alexander Rakhlin. Intrinsic dimension estimation using Wasserstein distance. The Journal of Machine Learning Research, 23(1): 14124 14160, 2022.

William M Boothby. An introduction to diﬀerentiable manifolds and Riemannian geometry. Academic press, 1986.

Michael M Bronstein, Joan Bruna, Yann Le Cun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18 42, 2017.

Bradley CA Brown, Anthony L Caterini, Brendan Leigh Ross, Jesse C Cresswell, and Gabriel Loaiza-Ganem. The union of manifolds hypothesis. In Neur IPS 2022 Workshop on Symmetry and Geometry in Neural Representations, 2022.

Luis A Caﬀarelli. Interior W 2,p estimates for solutions of the Monge-Amp ere equation. Annals of Mathematics, pages 135 150, 1990a.

Luis A Caﬀarelli. A localization property of viscosity solutions to the Monge-Amp ere equation and their strict convexity. Annals of mathematics, 131(1):129 134, 1990b.

Luis A Caﬀarelli. Some regularity properties of solutions of Monge Amp ere equation. Communications on pure and applied mathematics, 44(8-9):965 969, 1991.

Luis A Caﬀarelli. Boundary regularity of maps with convex potentials. Communications on pure and applied mathematics, 45(9):1141 1151, 1992a.

Luis A Caﬀarelli. The regularity of mappings with a convex potential. Journal of the American Mathematical Society, 5(1):99 104, 1992b.

Luis A Caﬀarelli. Boundary regularity of maps with convex potentials ii. Annals of mathematics, 144(3):453 496, 1996.

Deep Generative Models on Manifolds

Minwoo Chae, Dongha Kim, Yongdai Kim, and Lizhen Lin. A likelihood approach to nonparametric estimation of a singular distribution using deep generative models. Journal of Machine Learning Research, 24(77):1 42, 2023.

Biraj Dahal, Alexander Havrilla, Minshuo Chen, Tuo Zhao, and Wenjing Liao. On deep generative models for approximation and estimation of distributions on manifolds. Advances in Neural Information Processing Systems, 35:10615 10628, 2022.

Manfredo Perdigao Do Carmo and J Flaherty Francis. Riemannian geometry, volume 6. Springer, 1992.

Jean Duchon. Splines minimizing rotation-invariant semi-norms in Sobolev spaces. In Constructive Theory of Functions of Several Variables: Proceedings of a Conference Held at Oberwolfach April 25 May 1, 1976, pages 85 100. Springer, 1977.

Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on learning theory, pages 907 940. PMLR, 2016.

Charles Feﬀerman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983 1049, 2016.

R emi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z Alaya, Aur elie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenﬂos, Kilian Fatras, Nemo Fournier, et al. Pot: Python optimal transport. The Journal of Machine Learning Research, 22(1): 3571 3578, 2021.

Kunihiko Fukushima. Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics, 5(4):322 333, 1969.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.

Alfred Gray. Tubes, volume 221. Springer Science & Business Media, 2003.

Changhee Han, Hideaki Hayashi, Leonardo Rundo, Ryosuke Araki, Wataru Shimoda, Shinichi Muramatsu, Yujiro Furukawa, Giancarlo Mauri, and Hideki Nakayama. GANbased synthetic brain MR image generation. In 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), pages 734 738. IEEE, 2018.

Kurt Hornik. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359 366, 1989.

Christian Horvat and Jean-Pascal Pﬁster. Intrinsic dimensionality estimation using normalizing ﬂows. Advances in Neural Information Processing Systems, 35:12225 12236, 2022.

Jian Huang, Yuling Jiao, Zhen Li, Shiao Liu, Yang Wang, and Yunfei Yang. An error analysis of generative adversarial networks for learning distributions. The Journal of Machine Learning Research, 23(1):5047 5089, 2022.

Wang, Niu, Wang and Li

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401 4410, 2019.

Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive ﬂow. Advances in neural information processing systems, 29, 2016.

Elizaveta Levina and Peter Bickel. Maximum likelihood estimation of intrinsic dimension. Advances in neural information processing systems, 17, 2004.

Giulia Luise, Massimiliano Pontil, and Carlo Ciliberto. Generalization properties of optimal transport GANs with latent distribution learning. ar Xiv preprint ar Xiv:2007.14641, 2020.

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. ar Xiv preprint ar Xiv:1411.1784, 2014.

Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classiﬁer GANs. In International conference on machine learning, pages 2642 2651. PMLR, 2017.

Ilsang Ohn and Yongdai Kim. Smooth function approximation by deep neural networks with general activation functions. Entropy, 21(7):627, 2019.

Kripasindhu Sarkar, Lingjie Liu, Vladislav Golyanik, and Christian Theobalt. Human GAN: A generative model of human images. In 2021 International Conference on 3D Vision (3DV), pages 258 267. IEEE, 2021.

Nicolas Schreuder, Victor-Emmanuel Brunel, and Arnak Dalalyan. Statistical guarantees for generative models without domination. In Algorithmic Learning Theory, pages 1051 1071. PMLR, 2021.

Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. Advances in neural information processing systems, 29, 2016.

Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. ar Xiv preprint ar Xiv:1711.01558, 2017.

Ngoc-Trung Tran, Viet-Hung Tran, Ngoc-Bao Nguyen, Trung-Kien Nguyen, and Ngai-Man Cheung. On data augmentation for GAN training. IEEE Transactions on Image Processing, 30:1882 1897, 2021.

Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. Advances in neural information processing systems, 33:19667 19679, 2020.

C edric Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.

Deep Generative Models on Manifolds

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional GANs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798 8807, 2018.

Jonathan Weed and Francis Bach. Sharp asymptotic and ﬁnite-sample rates of convergence of empirical measures in wasserstein distance. 2019.

Holger Wendland. Scattered data approximation: Conditionally positive deﬁnite functions. 2004.

Hermann Weyl. On the volume of tubes. American Journal of Mathematics, 61(2):461 472, 1939.

Xuan Xia, Xizhou Pan, Nan Li, Xing He, Lin Ma, Xiaoguang Zhang, and Ning Ding. GAN-based anomaly detection: A review. Neurocomputing, 493:497 535, 2022.

Yunfei Yang, Zhen Li, and Yang Wang. On the capacity of deep generative networks for approximating distributions. Neural networks, 145:144 154, 2022.

Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. In International conference on machine learning, pages 7354 7363. PMLR, 2019.

Yijia Zheng, Tong He, Yixuan Qiu, and David Wipf. Learning manifold dimensions with conditional variational autoencoders. 2022.