# intrinsic_dimension_estimation_using_wasserstein_distance__a7aa51ff.pdf Journal of Machine Learning Research 23 (2022) 1-37 Submitted 12/21; Revised 5/22; Published 10/22 Intrinsic Dimension Estimation Using Wasserstein Distance Adam Block ablock@mit.edu Department of Mathematics Massachusetts Institute of Technology Zeyu Jia zyjia@mit.edu Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Yury Polyanskiy yp@mit.edu Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Alexander Rakhlin rakhlin@mit.edu Department of Brain & Cognitive Sciences Statistics and Data Science Center Massachusetts Institute of Technology Editor: Aryeh Kontorovich It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees. We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending only on the intrinsic dimension of the data. Keywords: Manifold Hypothesis, Dimension Estimation, Manifold Learning, Intrinsic Dimension, H older GANs 1. Introduction Recently, practical applications of machine learning involve a very large number of features, often many more than there are samples on which to train a model. Despite this imbalance, many modern machine learning models work astonishingly well. One of the more compelling explanations for this behavior is the manifold hypothesis, which posits that, though the data appear to the practitioner in a high-dimensional, ambient space, RD, they really lie on (or close to) a low dimensional space M of dimension d D, where we define dimension formally below. A good example to keep in mind is that of image data: each of thousands of pixels corresponds to three dimensions, but we expect that real images have some inherent structure that limits the true number of degrees of freedom in a realistic picture. This phenomenon has been thoroughly explored over the years, beginning with the linear case and moving into the more general, nonlinear regime, with such works as Niyogi et al. (2008, 2011); Belkin and Niyogi (2001); Bickel et al. (2007); Levina and Bickel (2004); 2022 Adam Block, Zeyu Jia, Yury Polyanskiy, and Alexander Rakhlin. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-1483.html. Block, Jia, Polyanskiy, and Rakhlin Kpotufe (2011); Kpotufe and Dasgupta (2012); Kpotufe and Garg (2013); Weed et al. (2019); Tenenbaum et al. (2000); Bernstein et al. (2000); Kim et al. (2019); Farahmand et al. (2007), among many, many others. Some authors have focused on finding representations for these lower dimensional sets (Niyogi et al., 2008; Belkin and Niyogi, 2001; Tenenbaum et al., 2000; Roweis and Saul, 2000; Donoho and Grimes, 2003), while other works have focused on leveraging the low dimensionality into statistically efficient estimators (Bickel et al., 2007; Kpotufe, 2011; Nakada and Imaizumi, 2020; Kpotufe and Dasgupta, 2012; Kpotufe and Garg, 2013; Ashlagi et al., 2021). In this work, our primary focus is on estimating the intrinsic dimension. To see why this is an important question, note that the local estimators of Bickel et al. (2007); Kpotufe (2011); Kpotufe and Garg (2013) and the neural network architecture of Nakada and Imaizumi (2020) all depend in some way on the intrinsic dimension. As noted in Levina and Bickel (2004), while a practitioner may simply apply cross-validation to select the optimal hyperparameters, this can be very costly unless the hyperparameters have a restricted range; thus, an estimate of intrinsic dimension is critical in actually applying the above works. In addition, for manifold learning, where the goal is to construct a representation of the data manifold in a lower dimensional space, the intrinsic dimension is a key parameter in many of the most popular methods (Tenenbaum et al., 2000; Belkin and Niyogi, 2001; Donoho and Grimes, 2003; Roweis and Saul, 2000). We propose a new estimator, based on distances between probability distributions, as well as provide rigorous, finite sample guarantees for the quality of the novel procedure. Recall that if µ, ν are two measures on a metric space (M, d M), then the Wasserstein-p distance between µ and ν is W M p (µ, ν)p = inf (X,Y ) Γ(µ,ν) E [d M(X, Y )p] (1) where Γ(µ, ν) is the set of all couplings of the two measures. If M RD, then there are two natural metrics to put on M: one is simply the restriction of the Euclidean metric to M while the other is the geodesic metric in M, i.e., the minimal length of a curve in M that joins the points under consideration. In the sequel, if the metric is simply the Euclidean metric, we leave the Wasserstein distance unadorned to distinguish it from the intrinsic metric. For a thorough treatment of such distances, see Villani (2008). We recall that the H older integral probability metric (H older IPM) is given by dβ,B(µ, ν) = sup f Cβ B(Ω) Eµ[f(X)] Eν[f(Y )] where Cβ B(Ω) is the H older ball defined in the sequel. When p = β = 1, the classical result of Kantorovich-Rubinstein says that the Wasserstein and H older distances agree. It has been known at least since Dudley (1969) that if a space M has dimension d, P is a measure with support M, and Pn is the empirical measure of n independent samples drawn from P, then W M 1 (Pn, P) n 1 d . More recently, Weed et al. (2019) has determined sharp rates for the convergence of this quantity for higher order Wasserstein distances in terms of the intrinsic dimension of the distribution. Below, we find sharp rates for the convergence of the empirical measure to the population measure with respect to the H older IPM: if β < d then dβ(Pn, P) n β d and if β > d 2 then dβ(Pn, P) n 1 2 . These sharp rates are intuitive in Intrinsic Dimension Estimation Using Wasserstein Distance that convergence to the population measure should only depend on the intrinsic complexity (i.e. dimension) without reference to the possibly much larger ambient dimension. The above convergence results are nice theoretical insights, but they have practical value, too. The results of Dudley (1969); Weed et al. (2019), as well as our results on the rate of convergence of the H older IPM, present a natural way to estimate the intrinsic dimension: take two independent samples, Pn, Pαn from P and consider the ratio of W M p (Pn, P)/W M p (Pαn, P) or dβ(Pn, P)/dβ(Pαn, P); as n , the first ratio should be about αd, while the second should be about α β d , and so d can be computed by taking the logarithm with respect to α. The first problem with this idea is that we do not know P; to address this, we instead compute the ratios using two independent samples. A more serious issue regards how large n must be in order for the asymptotic regime to apply. As we shall see below, the answer depends on the geometry of the supporting manifold. We define two estimators: one using the Euclidean distance and the other using the intrinsic distance: dn = log α log W1(Pn, P n) log W1(Pαn, P αn), edn = log α log W G 1 (Pn, P n) log W G 1 (Pαn, P αn), (2) where the primes indicate independent samples of the same size and G is a graph-based metric that approximates the intrinsic metric. Before we go into the details, we give an informal statement of our main theorem, which provides finite sample, non-asymptotic guarantees on the quality of the estimator:1 Theorem 1 (Informal version of Theorem 22) Let P be a measure on RD supported on a compact manifold of dimension d. Let τ be the reach of M, an intrinsic geometric quantity defined below. For any γ > 0, suppose we have N independent samples from P satisfying where ωd is the volume of a d-dimensional Euclidean unit ball. Then with probability at least 1 6ρ, the estimated dimension edn satisfies d 1 + 4γ edn (1 + 4γ)d. The same conclusion holds for dn. Although the guarantees for dn and f dn are similar, empirically edn is better, as explained below. The ambient dimension D never enters the statistical complexity given above. While the exponential dependence on the intrinsic dimension d is unfortunate, it is likely necessary as described below. While the reach, τ, determines the sample complexity of our dimension estimator, consideration of the injectivity radius, ι, is relevant for practical application. Both geometric quantities are defined formally in the following section, but, to understand the intuition, note that, as discussed above, there are two natural metrics we could be placing on M = supp P, 1. Explicit constants are given in the formal statement of Theorem 22 Block, Jia, Polyanskiy, and Rakhlin the Euclidean metric and the geodesic distance. The reach is, intuitively, the size of the largest ball with respect to the ambient metric such that we can treat points in M as if they were simply in Euclidean space; the injectivity radius is similar, except it treats neighborhoods with respect to the intrinsic metric. Considering that manifold distances are always at least as large as Euclidean distances, it is unsurprising that τ ι. Getting back to dimension estimation, specializing to the case of β = p = 1, and recalling (2), there are now two choices for our dimension estimator: we could use Wasserstein distance with respect to the Euclidean metric or Wasserstein distance with respect to the intrinsic metric (which we will denote by W M 1 ). We will see that if ι τ, then the two estimators induced by each of these distances behave similarly, but when ι τ, the latter is better. While we wish to use W M 1 (Pn, P n) to estimate the dimension, we do not know the intrinsic metric. As such, we use the k NN graph to approximate this intrinsic metric and introduce the measure W G 1 (Pn, P n). As we shall see below, if we had oracle access to geodesic distance d M, then the W M 1 -based estimator edn would only require ι d samples. However, our k NN estimator of d M, unfortunately, still requires the τ d samples. Nevertheless, there is a practical advantage of edn in that the metric estimator can leverage all N = 2(1 + α)n available samples, so that edn works if N τ d and only n ι d, whereas for dn we require n τ d itself. A natural question: is this more complicated approach necessary? i.e., is ι τ on real datasets? We believe that the answer is yes. To see this, consider the case of images of the digit 7 (for example) from MNIST (Le Cun and Cortes, 2010). As a demonstration, we sample images from MNIST in datasets of size ranging in powers of 2 from 32 to 2048, calculate the Wasserstein distance between these two samples, and plot the resulting trend. In the right plot, we pool all of the data to estimate the manifold distances, and then use these estimated distances to compute the Wasserstein distance between the empirical distributions. In order to better compare these two approaches, we also plot the residuals to the linear fit that we expect in the asymptotic regime. Looking at Figure 1, it is clear that we are not yet in the asymptotic regime if we simply use Euclidean distances; on the other hand, the trend using the manifold distances is much more clearly linear, suggesting that the slope of the best linear fit is meaningful. Thus we see that in order to get a meaningful dimension estimate from practical data sets, we cannot simply use W1 but must also estimate the geometry of the underlying distribution; this suggests that ι τ on this data manifold. More generally, we note that the injectivity radius, ι, is intrinsic to the geometry of the manifold and thus unaffected by the embedding; in contradistinction, the reach, τ, is extrinsic and thus can be made smaller by changing the embedding. In particular, when the obstruction to the reach being large is a bottleneck in the sense that the manifold is embedded in such a way as to place distant neighborhoods of the manifold close together in Euclidean distance (see Figure 2 for an example), we may expect τ ι. Intuitively, this matches the notion that the geometry of the data would be simple if we were to have access to the correct coordinate system and that the difficulty in understanding the geometry comes from its embedding in the ambient space. Intrinsic Dimension Estimation Using Wasserstein Distance Figure 1: Two log-log plots of comparing how W1(Pn, P n) decays to how W M 1 (Pn, P n) decays as n gets larger, as well as the residuals from a linear fit. The data are images of the digit 7 from MNIST with Wasserstein distances computed with the Sinkhorn algorithm (Cuturi, 2013). The manifold distances are approximated by a k-NN graph, as described in Section 3. We emphasize that, like many estimators of intrinsic dimension, we do not claim robustness to off-manifold noise (Levina and Bickel, 2004; Farahmand et al., 2007; Kim et al., 2019). Indeed, any fattening of the manifold will force any consistent estimator of intrinsic dimension to asymptotically grow to the full, ambient dimension as the number of samples grows. Various works have included off-manifold noise in different ways, often with the assumption that either the noise is known (Koltchinskii, 2000) or the manifold is linear (Niles-Weed and Rigollet, 2019). Methods that do not make these simplifying assumptions are often highly sensitive to scaling parameters that are required inputs in such methods as multi-scale, local SVD (Little et al., 2009). Extensions of our method to such noisy settings are a promising avenue of future research, particularly in understanding the effect of this noise on downstream applications as is done for Lipschitz classification in metric spaces and the resulting dimension-distortion tradeofffound in Gottlieb et al. (2016); in this work, however, we confine our theoretical study to the noiseless setting. The primary theoretical advantage of our estimator over that of Levina and Bickel (2004); Farahmand et al. (2007) is that we do not require the stringent regularity assumptions for our nonasymptotic rates to hold. We leave it for future empirical works whether this weakening of assumptions allows for a better practical estimator on real-world data sets. Our main contributions are as follows: Block, Jia, Polyanskiy, and Rakhlin In Section 3, we introduce a new estimator of intrinsic dimension. In Theorem 22 we prove non-asymptotic bounds on the quality of the introduced estimator. Moreover, unlike the MLE estimator of Levina and Bickel (2004) with non-asymptotic analysis in Farahmand et al. (2007), minimal regularity of the density of the population distribution is required for our guarantees and, unlike that suggested in Kim et al. (2019), our estimator is both computationally efficient and has sample complexity independent of the ambient dimension. In the course of proving Theorem 22, we adapt the techniques of Bernstein et al. (2000) to provide new, non-asymptotic bounds on the quality of k NN distance as an estimate of intrinsic distance in Proposition 24, with explicit sample complexity in terms of the reach of the underlying space. To our knowledge, these are the first such non-asymptotic bounds. We further note that the techniques we develop to prove the non-asymptotic bounds on our dimension estimator also serve to provide new statistical rates in learning Generative Adversarial Networks (GANs) with a H older discriminator class: We prove in Theorem 25 that if bµ is a H older GAN, then the distance between bµ and P, as measured by the H older IPM, is governed by rates dependent only on the intrinsic dimension of the data, independent of the ambient dimension or the dimension of the feature space. In particular, we prove in great generality that if P has intrinsic dimension d, then the rate of a Wasserstein GAN is n 1 d . This improves on the recent work of Schreuder et al. (2020). The work is presented in the order of the above listed contributions, preceded by a brief section on the geometric preliminaries and prerequisite results. We conclude the introduction by fixing notation and surveying some related work. Notation: We fix the following notation. We always let P be a probability distribution on RD and, whenever defined, we let d = dim supp P. We reserve X1, . . . , Xn for samples taken from P and we denote by Pn their empirical distribution. We reserve β for the smoothness of a H older class, Ω RD is always a bounded open domain, and is always the intrinsic diameter of a closed set. We also reserve M for a compact manifold. In general, we denote by S the support of a distribution P and we reuse M = supp P if we restrict ourselves to the case where S = M is a compact manifold, with Riemannian metric induced by the Euclidean metric. We denote by vol M the volume of the manifold with respect to its inherited metric and we reserve ωd for the volume of the unit ball in Rd. When a compact manifold manifold M can be assumed from context, we take the uniform measure on M to be the volume measure of M normalized so that M has unit measure. 1.1 Related Work Dimension Estimation There is a long history of dimension estimation, beginning with linear methods such as thresholding principal components (Fukunaga and Olsen, 1971), regressing k-Nearest-Neighbors (k NN) distances (Pettis et al., 1979), estimating packing numbers (K egl, 2002; Grassberger and Procaccia, 2004; Camastra and Vinciarelli, 2002), an estimator based solely on neighborhood (but not metric) information that was recently Intrinsic Dimension Estimation Using Wasserstein Distance proven consistent (Kleindessner and Luxburg, 2015), and many others. An exhaustive recent survey on the history of these techniques can be found in Camastra and Staiano (2016). Perhaps the most popular choice among current practitioners is the MLE estimator of Levina and Bickel (2004). The MLE estimator is constructed as the maximum likelihood of a parameterized Poisson process. As worked out in Levina and Bickel (2004), a local estimate of dimension for k 2 and x RD is given by j=1 log Tk(x) where Tj(x) is the distance between x and its jth nearest neighbor in the data set. The final estimate for fixed k is given by averaging bmk over the data points in order to reduce variance. While not included in the original paper, a similar motivation for such an estimator could be noting that if X is uniformly distributed on a ball of radius R in Rd, then E h log R ||X|| i = 1 d; the local estimator bmk(x) is the empirical version under the assumption that the density is smooth enough to be approximately constant on this small ball. The easy computation is included for the sake of completeness in Appendix E. In Farahmand et al. (2007), the authors examined a closely related estimator and provided non-asymptotic guarantees with an exponential dependence on the intrinsic dimension, albeit with stringent regularity conditions on the density. In addition to the estimators motivated by the volume growth of local balls discussed in the previous paragraph, Kim et al. (2019) proposed and analyzed a dimension estimator based on Travelling Salesman Paths (TSP). One major advantage to the TSP estimator is the lack of necessary regularity conditions on the density, requiring only an upper bound on the likelihood of the population density with respect to the volume measure on the manifold. On the other hand, the upper bound on sample complexity that that paper presents depends exponentially on the ambient dimension, which is pessimistic when the intrinsic dimension is substantially smaller. In addition, it is unclear how practical the estimator is due to the necessity of computing a solution to TSP; even ignoring this issue, Kim et al. (2019) note that practical tuning of the constants involved in their estimator is difficult and thus deploying their estimator as is on real-world datasets is unlikely. Manifold Learning The notion of reach was first introduced in Federer (1959), and subsequently used in the machine learning and computational geometry communities in such works as Niyogi et al. (2008, 2011); Aamari et al. (2019); Amenta and Bern (1999); Fefferman et al. (2016, 2018); Narayanan and Mitter (2010); Efimov et al. (2019); Boissonnat et al. (2019). Perhaps most relevant to our work, Narayanan and Mitter (2010); Fefferman et al. (2016) consider the problem of testing membership in a class of manifolds of large reach and derive tight bounds on the sample complexity of this question. Our work does not fall into the purview of their conclusions as we assume that the geometry of the underlying manifold is nice and estimate the intrinsic dimension. In the course of proving bounds on our dimension estimator, we must estimate the intrinsic metric of the data. We adapt the proofs of Tenenbaum et al. (2000); Bernstein et al. (2000); Niyogi et al. (2008) and provide Block, Jia, Polyanskiy, and Rakhlin tight bounds on the quality of a k-Nearest Neighbors (k NN) approximation of the intrinsic distance. Statistical Rates of GANs Since the introduction of Generative Adversarial Networks (GANs) in Goodfellow et al. (2014), there has been a plethora of empirical improvements and theoretical analyses. Recall that the basic GAN problem selects an estimated distribution bµ from a class of distributions P minimizing some adversarially learned distance between bµ and the empirical distribution Pn. Theoretical analyses aim to control the distance between the learned distribution bµ and the population distribution P from which the data comprising Pn are sampled. In particular statistical rates for a number of interesting discriminator classes have been proven including Besov balls (Uppal et al., 2019), balls in an RKHS (Liang, 2018), and neural network classes (Chen et al., 2020) among others. The latter paper, Chen et al. (2020) also considers GANs where the discriminative class is a H older ball, which includes the popular Wasserstein GAN framework of Arjovsky et al. (2017). They show that if bµ is the empirical minimizer of the GAN loss and the population distribution P Leb RD then E [dβ(bµ, P)] n β 2β+D up to factors polynomial in log n. Thus, in order to beat the curse of dimensionality, one requires β = Ω(D); note that the larger β is, the weaker the IPM is as the H older ball becomes smaller. In order to mitigate this slow rate, Schreuder et al. (2020) assume that both P and P are distributions arising from Lipschitz pushforwards of the uniform distribution on a d-dimensional hypercube; in this setting, they are able to remove dependence on D and show that E [dβ(bµ, P)] Ln β This last result beats the curse of dimensionality, but pays with restrictive assumptions on the generative model as well as dependence on the Lipschitz constant of the pushforward map. More importantly, the result depends exponentially not on the intrinsic dimension of P but rather on the dimension of the feature space used to represent P. In practice, stateof-the-art GANs used to produce images often choose d to be on the order of 128, which is much too large for the Schreuder et al. (2020) result to guarantee good performance. 2. Preliminaries 2.1 Geometry In this work, we are primarily concerned with the case of compact manifolds isometrically embedded in some large ambient space, RD. We note that this focus is largely in order to maintain simplicity of notation and exposition; extensions to more complicated, less regular sets with intrinsic dimension defined as the Minkowski dimension can easily be attained with our techniques. The key example to keep in mind is that of image data, where each pixel corresponds to a dimension in the ambient space, but, in reality, the distribution lives on a much smaller, embedded subspace. Many of our results can be easily extended to the non-compact case with additional assumptions on the geometry of the space and tails of the distribution of interest. Central to our study is the analysis of how complex the support of a distribution is. We measure complexity of a metric space by its entropy: Intrinsic Dimension Estimation Using Wasserstein Distance Definition 2 Let (X, d) be a metric space. The covering number at scale ε > 0, N(X, d, ε), is the minimal number s such that there exist points x1, . . . , xs such that X is contained in the union of balls of radius ε centred at the xi. The packing number at scale ε > 0, D(X, d, ε), is the maximal number s such that there exist points x1, . . . , xs X such that d(xi, xj) > ε for all i = j. The entropy is defined as log N(X, d, ε). We recall the classical packing-covering duality, proved, for example, in (van Handel, 2014, Lemma 5.12): Lemma 3 For any metric space X and scale ε > 0, D(X, d, 2ε) N(X, d, ε) D(X, d, ε). The most important geometric quantity that determines the complexity of a problem is the dimension of the support of the population distribution. There are many, often equivalent ways to define this quantity in general. One possibility, introduced in Assouad (1983) and subsequently used in Dasgupta and Freund (2008); Kpotufe and Dasgupta (2012); Kpotufe and Garg (2013) is that of doubling dimension: Definition 4 Let S RD be a closed set. For x S, the doubling dimension at x is the smallest d such that for all r > 0, the set Br(x) S can be covered by 2d balls of radius r 2, where Br(x) denotes the Euclidean ball of radius r centred at x. The doubling dimension of S is the supremum of the doubling dimension at x for all x S. This notion of dimension plays well with the entropy, as demonstrated by the following (Kpotufe and Dasgupta, 2012, Lemma 6): Lemma 5 ((Kpotufe and Dasgupta, 2012)) Let S have doubling dimension d and diameter . Then N(S, ε) We remark that a similar notion of dimension is that of the Minkowski dimension, which is defined as the asymptotic rate of growth of the entropy as the scale tends to zero. Recently, Nakada and Imaizumi (2020) examined the effect that an assumption of small Minkowski dimension has on learning with neural networks; their central statistical result can be recovered as an immediate consequence of our complexity bounds below. In order to develop non-asymptotic bounds, we need some understanding of the geometry of the support, M. We first recall the definition of the geodesic distance: Definition 6 Let S RD be closed. A piecewise smooth curve in S, γ, is a continuous function γ : I S, where I R is an interval, such that there exists a partition I1, , IJ of I such that γIj is smooth as a function to RD. The length of γ is induced by the embedding of S RD. For points p, q S, the intrinsic (or geodesic) distance is d S(p, q) = inf {length (γ)|γ(0) = p and γ(1) = q and γ is a piecewise smooth curve in S} . It is clear from the fact that straight lines are geodesics in RD that for any points p, q S, ||p q|| d S(p, q). We are concerned with two relevant geometric quantities, one extrinsic and the other intrinsic. Block, Jia, Polyanskiy, and Rakhlin Definition 7 Let S RD be a closed set. Let the medial axis Med(S) be defined as Med(S) = x RD| there exist p = q S such that ||p x|| = ||q x|| = d(x, S) . In other words, the medial axis is the set of points in RD that have at least two projections to S. Define the reach, τS of S as d(S, Med(S)), the minimal distance between a set and its medial axis. If S = M is a compact manifold with the induced Euclidean metric, we define the injectivity radius ι = ιM as the maximal r such that if p, q M such that d M(p, q) < r then there exists a unique length-minimizing geodesic connecting p to q in M. For more detail on the injectivity radius, see Lee (2018), especially Chapters 6 and 10. The difference between ιM and τM is in the choice of metric with which we equip M. We could choose to equip M with the metric induced by the Euclidean distance || || or we could choose to use the intrinsic metric d M defined above. The reach quantifies the maximal radius of a ball with respect to the Euclidean distance such that the intersection of this ball with M behaves roughly like Euclidean space. The injectivity radius, meanwhile, quantifies the maximal radius of a ball with respect to the intrinsic distance such that this ball looks like Euclidean space. While neither quantity is necessary for our dimension estimator, both figure heavily in the analysis. The final relevant geometric quantity is the sectional curvature. The sectional curvature of M at a point p M given two directions tangent to M at p is given by the Gaussian curvature at p of the image of the exponential map applied to a small neighborhood of the origin in the plane determined by the two directions. Intuitively, the sectional curvature measures how tightly wound the manifold is locally around each point. For an accessible introduction to the geometric notions mentioned here, see (Lee, 2018, Chapter 8). We now specialize to consider compact, dimension d manifolds M embedded in RD with the induced metric. One measure of size of the manifold M is the diameter, , with respect to the intrinsic distance defined above. Another notion of size is the volume measure, vol M. This measure can be defined intrinsically as integration with respect to the volume form, where the volume form can be thought of as the analogue of the Lebesgue differential in standard Euclidean space. In our setting, we could equivalently define the volume as the d-dimensional Hausdorffmeasure as in Aamari et al. (2019). Either way, when we refer to a measure µM that is uniform on the manifold, we consider the normalization such that µM(M) = 1, i.e., µM( ) = vol M( )/ vol(M). With the brief digression into volume concluded, we return to the notion of the reach, which encodes a number of local and global geometric properties. We summarize several of these in the following proposition: Proposition 8 Let M RD be a compact manifold isometrically embedded in RD. Suppose that τ = τM > 0. The following hold: (a) (Niyogi et al., 2008, Proposition 6.1)] The norm of the second fundamental form of M is bounded by 1 τ at all points p M. (b) (Aamari et al., 2019, Proposition A.1 (ii)) The injectivity radius of M is at least πτ. Intrinsic Dimension Estimation Using Wasserstein Distance Figure 2: Curve in R2 where τ ι. (c) (Boissonnat et al., 2019, Lemma 3) If p, q M such that ||p q|| 2τ then d M(p, q) 2τ arcsin ||p q|| A few remarks are in order. First, note that the Hopf-Rinow Theorem (Hopf and Rinow, 1931) guarantees that M is complete, which is fortuitous as completeness is a necessary, technical requirement for several of our arguments. Second, we note that (c) from Proposition 8 has a simple geometric interpretation: the upper bound on the right hand side is the length of the arc of a circle of radius τ containing points p, q; thus, the maximal distortion of the intrinsic metric with respect to the ambient metric is bounded by the circle of radius τ. Point (a) in the above proposition demonstrates that control of the reach leads to control of local distortion. From the definition, it is obvious that the reach provides an upper bound for the size of the global notion of a bottleneck, i.e., two points p, q M such that ||p q|| = 2τ < d M(p, q). Interestingly, these two local and global notions of distortion are the only ways that the reach of a manifold can be small, as (Aamari et al., 2019, Theorem 3.4) tells us that if the reach of a manifold M is τ, then either there exists a bottleneck of size 2τ or the norm of the second fundamental form is 1 τ at some point. Thus, in some sense, the reach is the correct measure of distortion. Note that while (b) above tells us that ιM τM, there is no comparable upper bound. To see this, consider Figure 2, which depicts a one-dimensional manifold embedded in R2. Note that the bottleneck in the center ensures that the reach of this manifold is very small; on the other hand, it is easy to see that the injectivity radius is given by half the length of the entire curve. As the curve can be extended arbitrarily, the reach can be arbitrarily small relative to the injectivity radius. We now proceed to bound the covering number of a compact manifold using the dimension and the injectivity radius. We note that upper bounds on the covering number with respect to the ambient metric were provided in Niyogi et al. (2008); Narayanan and Mitter (2010). A similar bound with less explicit constants can be found in (Kim et al., 2019, Lemma 4). Proposition 9 Let M RD be an isometrically embedded, compact, d-dimensional submanifold with injectivity radius ι > 0 such that the sectional curvatures are bounded above by κ1 0 and below by κ2 0. If ε < π 2 k1 ι then N(M, d M, ε) vol M Block, Jia, Polyanskiy, and Rakhlin If ε < 1 κ2 ι then vol M ωd d8 dε d D(M, d M, 2ε). Moreover, for all ε < ι, ωd dιd( κ2) d 2 e dι κ2ε d D(M, d M, ε). Thus, if ε < τ, where τ is the reach of M, then ωd d8 dε d D(M, d M, 2ε) N(M, d M, ε) vol M The proof of Proposition 9 can be found in Appendix A and relies on the Bishop-Gromov comparison theorem to leverage the curvature bounds from Proposition 8 into volume estimates for small intrinsic balls, a similar technique as found in Niyogi et al. (2008); Narayanan and Mitter (2010). The key point to note is that we have both upper and lower bounds for ε < ι, as opposed to just the upper bound guaranteed by Lemma 5. As a corollary, we are also able to derive bounds for the covering number with respect to the ambient metric: Corollary 10 Let M be as in Proposition 9. For ε < τ, we can control the covering numbers of M with respect to the Euclidean metric as ωd d16 dε d D(M, || || , 2ε) N(M, || || , ε) vol M The proof of Corollary 10 follows from Proposition 9 and the metric comparisons for small scales in Proposition 8; details can be found in Appendix A. 2.2 H older Classes and their Complexity In this section we make the elementary observation that complex function classes restricted to simple subsets can be much smaller than the original class. While such intuition has certainly appeared before, especially in designing esimators that can adapt to local intrinsic dimension, such as Bickel et al. (2007); Kpotufe and Dasgupta (2012); Kpotufe (2011); Kpotufe and Garg (2013); Dasgupta and Freund (2008); Steinwart et al. (2009); Nakada and Imaizumi (2020), we codify this approach below. To illustrate the above phenomenon at the level of empirical processes, we focus on H older functions in RD for some large D and let the simple subset be a subspace of dimension d where d D. We first recall the definition of a H older class: Definition 11 For an open domain Ω Rd and a function f : Ω R, define the β-H older norm as ||f||Cβ(Ω) = max 0 |γ| |α| sup x Ω |Dγf(x)| sup x,y Ω D β f(x) D β f(y) ||x y||β β . Define the H older ball of radius B, denoted by Cβ B(Ω), as the set of functions f : Ω R such that ||f||Cβ(Ω) B. If (M, g) is a Riemannian manifold of class C β +1 (see Lee (2018)), and f : M R we define the H older norm analogously, replacing |Dγf(x)| with || γf(x)||g, where is the covariant derivative. Intrinsic Dimension Estimation Using Wasserstein Distance It is a classical result of Kolmogorov and Tikhomirov (1993) that, for a bounded, open domain Ω RD, the entropy of a H older ball scales as log N Cβ B(Ω), || || , ε B as ε 0. As a consequence, we arrive at the following result, whose proof can be found in Appendix A for the sake of completeness. Proposition 12 Let S Ω Rd be a path-connected closed set contained in an open domain Ω. Let e F = Cβ B(Ω) and let F = e F|S. Then, β log D(F, || || , 2ε) log N(F, || || , ε) 3β2 log 2B Note that the content of the above result is really that of Kolmogorov and Tikhomirov (1993), coupled with the fact that restriction from Rd to M preserves smoothness. If we apply the easily proven volumetric bounds on covering and packing numbers for S a Euclidean ball to Proposition 12, we recover the classical result of Kolmogorov and Tikhomirov (1993). The key insight is that low-dimensional subsets can have covering numbers much smaller than those of a high-dimensional Euclidean ball: if the dimension of S is d, then we expect the covering number of S to scale like ε d. Plugging this into Proposition 12 tells us that the entropy of F, up to a factor logarithmic in 1 ε, scales like β . An immediate corollary of Lemma 5 and Proposition 12 is: Corollary 13 Let S RD be a closed set of diameter and doubling dimension d. Let S Ωopen and F be the restriction of Cβ B(Ω) to S. Then log N(F, || || , ε) 3β2 2B β Proof Combine the upper bound in Proposition 12 with the bound in Lemma 5. The conclusion of Corollary 13 is very useful for upper bounds as it tells us that the entropy for H older balls scales at most like ε d β as ε 0. If we desire comparable lower bounds, we require some of the geometry discussed above. Combining Proposition 12 and Corollary 10 yields the following bound: Corollary 14 Let M RD be an isometrically embedded, compact submanifold with reach τ > 0 and injectivity radius ι. Suppose Ω M is an open set and let F be the restriction of Cβ B(Ω) to M. Then for ε < τ, ωd d16 d 2B β log D(F , || || , 2ε) log N(F , || || , ε) 3β2 log 2B If we set F = Cβ B(M), then we have that for all ε < ι, ωd dιd( κ2) d 2 e dι κ2 B β log N(F, || || , ε) 3β2 log 2B Block, Jia, Polyanskiy, and Rakhlin In essence, Corollary 14 tells us that the rate of ε d β for the growth of the entropy of H older balls is sharp for sufficiently small ε. The key difference between the first and second statements is that the first is with respect to an ambient class of functions while the second is with respect to an intrinsic class. To better illustrate the difference, consider the case where β = B = 1, i.e., the class of Lipschitz functions on the manifold. In both cases, asymptotically, the entropy of Lipschitz functions scales like ε d; if we restrict to functions that are Lipschitz with respect to the ambient metric, then the above bound only applies for ε < τ; on the other hand, if we consider the larger class of functions that are Lipschitz with respect to the intrinsic metric, the bound applies for ε < ι. In the case where ι τ, this can be a major improvement. The observations in this section are undeniably simple; the real interest comes in the diverse applications of the general principle, some of which we detail below. As a final note, we remark that our guiding principle of simplifying function classes by restricting them to simple sets likely holds in far greater analysis than is explored here; in particular, Sobolev and Besov classes (see, for example, (Gin e and Nickl, 2016, 4.3)) likely exhibit similar behavior. 3. Dimension Estimation We outlined the intuition behind our dimension estimation in the introduction. In this section, we formally define the estimator and analyse its theoretical performance. We first apply standard empirical process theory and our complexity bounds in the previous section to upper bound the expected H older IPM (defined in (1)) between empirical and population distributions: Lemma 15 Let S RD be a compact set contained in a ball of radius R. Suppose that we draw n independent samples from a probability measure P supported on S and denote by Pn the corresponding empirical distribution. Let P n denote an independent identically distributed measure as Pn. Then we have E [dβ,B(Pn, P)] E dβ,B(Pn, P n) 16B inf δ>0 N(S, || || , ε)dε In particular, there exists a universal constant K such that if N(S, || || , ε) C1ε d for some C, d > 0, then E [dβ(Pn, P)] CβB 1 + p log n1{d=2β} n β holds with C = KC1. The proof uses the symmetrization and chaining technique and applies the complexity bounds of H older functions found above; the details can be found in Appendix E. We now specialize to the case where β = B = 1, due to the computational tractability of the resulting Wasserstein distance. Applying Kantorovich-Rubenstein duality (Kantorovich and Rubinshtein, 1958), we see that this special case of Lemma 15 recovers the special p = 1 case of Weed et al. (2019). From here on, we suppose that d > 2 and our metric on distributions is d1,1 = W1. Intrinsic Dimension Estimation Using Wasserstein Distance We begin by noting that if we have 2n, independent samples from P, then we can split them into two data sets of size n, and denote by Pn, P n the empirical distributions thus generated. We then note that Lemma 15 implies that if supp P M and M is of dimension d, then E W1(Pn, P n) CM,dn 1 If we were to establish a lower bound as well as concentration of W1(Pn, P n) about its mean, then we could consider the following estimator. Given a data set of size 2(α + 1)n, we can break the data into four samples, Pn, P n each of size n and Pαn, P αn of size αn. Then we would have logα W1(Pαn,P αn) W1(Pn,P n) = log α log W1(Pn, P n) log W1(Pαn, P αn) d. Which distance on M should be used to compute the Wasserstein distance, the Euclidean metric || || or the intrinsic metric d M( , )? As can be guessed from Corollary 14, asymptotically, both will work, but for finite sample sizes when ι τ, the latter is much better. One problem remains, however: because we are not assuming M to be known, we do not have access to d M and thus we cannot compute the necessary Wasserstein cost. In order to get around this obstacle, we recall the graph distance induced by a k NN graph: Definition 16 Let X1, . . . , Xn RD be a data set and fix ε > 0. We let G(X, ε) denote the weighted graph with vertices Xi and edges of weight ||Xi Xj|| between all vertices Xi, Xj such that ||Xi Xj|| ε. We denote by d G(X,ε) (or d G if X, ε are clear from context) the geodesic distance on the graph G(X, ε). We extend this metric to all of RD by letting d G(p, q) = ||p πG(p)|| + d G(πG(p), πG(q)) + ||q πG(q)|| where πG(p) argmin Xi ||p Xi||. We now have two Wasserstein distances, each induced by a different metric; to mitigate confusion, we introduce the following notation: Definition 17 Let X1, . . . , Xn, X 1, . . . , X n RD, sampled independently from P such that supp P M. Let Pn, P n be the empirical distributions associated to the data X, X . Let W1(Pn, P n) denote the Wasserstein cost with respect to the Euclidean metric and W M 1 (Pn, P n) denote the Wasserstein cost associated to the manifold metric, as in (1). For a fixed ε > 0, let W G 1 (Pn, P n) denote the Wasserstein cost associated to the metric d G(supp Pn supp P n,ε). Let dn, bdn, and edn denote the dimension estimators from (3) induced by each of the above metrics. Given sample distributions Pn, P n, we are able to compute W1(Pn, P n) and W G 1 (Pn, P n) for any fixed ε, but not W M 1 (Pn, P n) because we are assuming that the learner does not have access to the manifold M. On the other hand, adapting techniques from Weed et al. (2019), we are able to provide a non-asymptotic lower bound on W1(Pn, P n) and W M 1 (Pn, P n): Proposition 18 Suppose that P is a measure on RD such that supp P = M, where M is a d-dimensional, compact manifold with reach τ > 0 and injectivity radius ι > 0, such that Block, Jia, Polyanskiy, and Rakhlin the density of P with respect to the uniform measure on M is lower bounded by w > 0. Suppose that n > d vol M Then, almost surely, W1(Pn, P) 1 If we assume only that d( κ2) d 2 vol M 4wωd edι κ2 ! then, almost surely, W M 1 (Pn, P) 1 d ( κ2) 1 2 eι κ2n 1 An easy proof, based on the techniques (Weed et al., 2019, Proposition 6) can be found in Appendix E. Similarly, we can apply the same proof technique as Lemma 15 to establish the following upper bound: Proposition 19 Let M RD be a compact manifold with positive reach τ and dimension d > 2. Furthermore, suppose that P is a probability measure on RD with supp P M. Let X1, . . . , Xn, X 1, . . . , X n P be independent with corresponding empirical distributions Pn, P n. Then if diam M = , we have: E W M 1 (Pn, P) E W M 1 (Pn, P n) C vol M The full proof is in Appendix E and applies symmetrization and chaining, with an upper bound of Corollary 14. We note, as before, that a similar asymptotic rate is obtained by Weed et al. (2019) in a slightly different setting. We noted in the introduction that we required two facts to make our intuition regarding the dimension estimators precise. We have just shown that the first holds; we turn now to the second: concentration. To make this rigorous, we need one last technical concept: the T2-inequality. Definition 20 Let µ be a measure on a metric space (M, d). We say that µ satisfies a T2-inequality with constant c2 if for all measures ν µ, we have where D(ν||µ) = Eµ h log dν dµ i is the KL-divergence. The reason that the T2 inequality is useful for us is that Bobkov and G otze (1999) tell us that such an inequality implies, and is, by Gozlan et al. (2009), equivalent to Lipschitz concentration. We note further that W1(Pn, P n) is a Lipschitz function of the dataset Intrinsic Dimension Estimation Using Wasserstein Distance and thus concentrates about its mean. The constant in the T2 inequality depends on the measure µ and upper bounds for specific classes of measures are both well-known and remain an active area of research; for a more complete survey, see Bakry et al. (2014). We have the following bound: Proposition 21 Let P be a probability measure on RD that has density with respect to the (normalized) volume measure of M, lower bounded by w and upper bounded by W, where M is a d-dimensional manifold with reach τ > 0 and diam M = . Then we have: w exp d log 3 + 3d2 2 In order to bound the T2 constant in our case, we rely on the landmark result of Otto and Villani (2000) that relates c2 to another functional inequality, the log-Sobolev inequality (Bakry et al., 2014, Chapter 5). There are many ways to control the log-Sobolev constant in various situations, many of which are covered in Bakry et al. (2014). We use results from Wang (1997a), which incorporate the intrinsic geometry of the distribution, as our bound. A detailed proof can be found in Appendix B. We note that many other estimates with under slightly different conditions exits, such as that in Wang (1997b), which requires second-order control of the density of the population distribution with respect to the volume measure and the bound in Block et al. (2020), which provides control using a measure of nonconvexity. With added assumptions, we can gain much sharper control over c2; for example, if we assume a positive lower bound on the curvature of the support, we can apply the well-known Bakry Emery result (Bakry and Emery, 1985) and get dimension-free bounds. As another example, if we may assume stronger contol on the curvature of M beyond that guaranteed by the reach, we can remove the exponential dependence on the reach entirely. For the sake of simplicity and because we already admit an exponential dependence on the intrinsic dimension, we present only the more general bound here. We now provide a non-asymptotic bound on the quality of the estimator edn. Theorem 22 Let P be a probability measure on RD and suppose that P has a density with respect to the (normalized) volume measure of M lower bounded by w, where M is a ddimensional manifold with reach τ > 0 and injectivity radius ι > 0 such that d 3 and diam M = . Furthermore, suppose that P satisfies a T2 inequality with constant c2. Let γ > 0 and suppose α, n satisfy α max log 2 2γ nωd d , (48w) 1 γ , 3 d γ Suppose we have 2(α + 1)n samples drawn independently from P. Then, with probability at least 1 6ρ, we have d 1 + 3γ edn (1 + 3γ)d. Block, Jia, Polyanskiy, and Rakhlin If ι is replaced by τ above, we get the same bound with the vanilla estimator dn replacing edn. We note that we have not made every effort to minimize the constants in the statement above, with our emphasis being the dependence of these sample complexity bounds on the relevant geometric quantities. As an immediate consequence of Theorem 22, due to the fact that d is discrete, we can control the probability of error with sufficiently many samples. We may also apply Proposition 21 to replace c2 with our upper bound in terms of the reach. Corollary 23 Suppose we are in the situation of Theorem 22 and that P has density upper bounded by W with respect to the normalized uniform measure on M. Suppose further that α, n satisfy w exp d log 3 + 3d2 2 α max log2d2 nωd d , (48w)3d, 33d2 Then if we round edn to the nearest integer, and denote the resulting estimator by d n, we have with probability at least 1 6ρ, d n = d. Again, replacing ι by τ in the previous display yields the same result with bdn replaced by the vanilla estimator dn. Proof Note that because d N, if edn d 1 2, then rounding bdn to the nearest integer exactly recovers d. Setting γ < 1 4d, and plugging into the result of Theorem 22, along with an application of Proposition 21 to bound c2, concludes the proof. While the appearance of ι in Theorem 22 and Corollary 23 may seem minor, it is critical for any practical estimator. While αn = Ω τ d , we may take n as small as Ω ι d . Thus, using edn instead of the naive estimator dn allows us to leverage the entire data set in estimating the intrinsic distances, even on the small sub-samples. From the proof, it is clear that we want α to be as large as possible; thus if we have a total of N samples, we wish to make n as small as possible. If ι τ then we can make n much smaller (scaling like ι d) than if we were to simply use the Euclidean distance. As a result, on any data set where ι τ, the sample complexity of edn can be much smaller than that of dn. There are two parts to the proof of Theorem 22: first, we need to establish that our metric d G approximates d M with high probability and thus edn bdn; second, we need to show that bdn is, indeed, a good estimate of d. The second part follows from Propositions 19 and 18, and concentration; a detailed proof can be found in Appendix C. For the first part of the proof, in order to show that bdn edn, we demonstrate that d M d G in the following result: Intrinsic Dimension Estimation Using Wasserstein Distance Proposition 24 Let P be a probability measure on RD and suppose that supp P = M, a geodesically convex, compact manifold of dimension d and reach τ > 0. Suppose that we sample X1, . . . , Xn P independently. Let λ 1 2 and G = G(X, τλ). If for some ρ < 1, 1 log N M, d M, τλ2 where for any δ > 0 w B(δ) = inf p M P(BM δ (p)) with BM δ (p) the metric ball around p of radius δ. Then, with probability at least 1 ρ, for all x, y M, (1 λ) d M(x, y) d G(x, y) (1 + λ)d M(x, y). The proof of Proposition 24 follows the general outline of Bernstein et al. (2000), but is modified in two key ways: first, we control relevant geometric quantities by τ instead of by the quantities in Bernstein et al. (2000); second, we provide a quantitative, nonasymptotic bound on the number of samples needed to get a good approximation with high probability. The details are deferred to Appendix D. This result may be of interest in its own right as it provides a non-asymptotic version of the results from Tenenbaum et al. (2000); Bernstein et al. (2000). In particular, if we suppose that P has a density with respect to the uniform measure on M and this density is bounded below by a constant w > 0, then Proposition 24 combined with Proposition 9 tells us that if we have w τλ2 d log vol M samples, then we can recover the intrinsic distance of M with distortion λ. We further note that the dependence on τ, λ, d is quite reasonable in Proposition 24. The argument requires the construction of a τλ2-net on M and it is not difficult to see that one needs a covering at scale proportional to τλ in order to recover the intrinsic metric from discrete data points. For example, consider Figure 2; were a curve to be added to connect the points at the bottleneck, this would drastically decrease the intrinsic distance between the bottleneck points. In order to determine that the intrinsic distance between these points (without the connector) is actually quite large using the graph metric estimator, we need to set ε < τ, in which case these points are certainly only connected if there exists a point of distance less than τ to the bottleneck point, which can only occur with high probability if n = Ω τ 1 . We can extend this example to arbitrary dimension d by taking the product of the curve with r Sd 1 for r = Θ(τ); in this case, a similar argument holds and we now need Ω τ d points in order to guarantee with high probability that there exists a point of distance at most τ to one of the bottleneck points. In this way, we see that the τ d scaling is unavoidable in general. Note that the other estimators of intrinsic dimension mentioned in the introduction, in particular the MLE estimator of Levina and Bickel (2004), implicitly require the accuracy of the k NN distance for their estimation to hold; thus these estimators also suffer from the τ d sample complexity. Finally, we remark that Kim et al. (2019) presents a minimax lower bound for a related hypothesis testing problem and shows that minimax risk is bounded below by a local analogue of the reach raised to a power that depends linearly on the intrinsic dimension. Block, Jia, Polyanskiy, and Rakhlin 4. Application of Techniques to GANs In this section, we note that our techniques are not confined to the realm of dimension estimation and, in fact, readily apply to other problems. As an example, consider the unsupervised learning problem of generative modeling, where we suppose that there are samples X1, . . . , Xn P independent and we wish to produce a sample b X b P such that b P and P are close. Statistically, this problem can be expressed by fixing a class of distributions P and using the data to choose bµ P such that bµ is in some sense close to P. For computational reasons, one wishes P to contain distributions from which it is computationally efficient to sample; in practice, P is usually the class of pushforwards of a multi-variate Gaussian distribution by some deep neural network class G. While our statistical results include this setting, they are not restricted and apply for general classes of distributions P. In order to make the problem more precise, we require some notion of distance between distributions. We use the notion of the Integral Probability Metric (M uller, 1997; Sriperumbudur et al., 2012) associated to a H older ball Cβ B(Ω), as defined above. We suppose that supp P Ωand we abbreviate the corresponding IPM distance by dβ,B. Given the empirical distribution Pn, the GAN that we study can be expressed as bµ argmin µ P dβ,B(µ, Pn) = argmin µ P sup f Cβ B(Ω) Eµ[f] Pnf. In this section, we generalize the results of Schreuder et al. (2020). In particular, we derive new estimation rates for a GAN using a H older ball as a discriminating class, assuming that the population distribution P is low-dimensional; like Schreuder et al. (2020), we consider the noised and potentially contaminated setting. We have Theorem 25 Suppose that P is a probability measure on RD supported on a compact set S and suppose we have n independent Xi P with empirical distribution Pn. Let ηi be independent, centred random variables on RD such that E h ||ηi||2i σ2. Suppose we observe e Xi such that for at least (1 ε)n of the e Xi, we have e Xi = Xi+ηi; let the empirical distribution of the e Xi be e Pn. Let P be a known set of distributions and define bµ argmin µ P dβ,B(µ, e Pn). Then if there is some C1, d such that N(S, || || , δ) C1ε d, we have E [dβ,B(bµ, P)] inf µ P dβ,B(µ, P) + B(σ + 2ε) + CβB p where C is a constant depending linearly on C1. We note that the log n factor can be easily removed for all cases β = d 2 by paying slightly in order to increase the constants; for the sake of simplicity, we do not bother with this argument here. The proof of Theorem 25 is similar in spirit to that of Schreuder et al. (2020), which in turn follows Liang (2018), with details in Appendix E. The key step is in applying the bounds in Lemma 15 to the arguments of Liang (2018). We compare our result to the corresponding theorem (Schreuder et al., 2020, Theorem 2). In that work, the authors considered a setting where there is a known intrinsic dimension Intrinsic Dimension Estimation Using Wasserstein Distance d and the population distribution P = g#U [0, 1]d , the push-forward by an L-Lipschitz function g of the uniform distribution on a d-dimensional hypercube; in addition, they take P to be the set of push-forwards of U [0, 1]d by functions in some class F, all of whose elements are L-Lipschitz. Their result, (Schreuder et al., 2020, Theorem 2), gives an upper bound of E [dβ,1(bµ, P)] inf µ P dβ,1(µ, P) + L(σ + 2ε) + c L Note that our result is an improvement in two key respects. First, we do not treat the intrinsic dimension d as known, nor do we force the dimension of the feature space to be the same as the intrinsic dimension. Many of the state-of-the-art GAN architectures on datasets such as Image Net use a feature space of dimension 128 or 256 (Wu et al., 2019); the best rate that the work of Schreuder et al. (2020) can give, then would be n 1 128 . In our setting, even if the feature space is complex, if the true distribution lies on a much lower dimensional subspace, then it is the true, intrinsic dimension, that determines the rate of estimation. Secondly, note that the upper bound in (4) depends on the Lipschitz constant L; as the function classes used to determine the push-forwards are essentially all deep neural networks in practice, and the Lipschitz constants of such functions are exponential in depth, this can be a very pessimistic upper bound; our result, however, does not depend on this Lipschitz constant, but rather on properties intrinsic to the probability distribution P. This dependence is particularly notable in the noisy regime, where σ, ε do not vanish; the large multiplicative factor of L in this case would then make the bound useless. We conclude this section by considering the case most often used in practice: the Wasserstein GAN. Corollary 26 Suppose we are in the setting of Theorem 25 and S is contained in a ball of radius R for R 1 E [W1(bµ, P)] inf µ P W1(µ, P) + σ + 2Rε + CR p The proof of the corollary is almost immediate from Theorem 25. With additional assumptions on the tails of the ηi, we can turn our expectation into a high probability statement. In the special case with neither noise nor contamination, i.e. σ = ε = 0, we get that the Wasserstein GAN converges in Wasserstein distance at a rate of n 1 d , which we believe explains in large part the recent empirical success in modern Wasserstein-GANs. Acknowledgments We acknowledge support from the NSF through award DMS-2031883 and from the Simons Foundation through Award 814639 for the Collaboration on the Theoretical Foundations of Deep Learning. We acknowledge the support from NSF under award DMS-1953181, NSF Graduate Research Fellowship support under Grant No. 1122374, and support from the MIT-IBM Watson AI Lab. Block, Jia, Polyanskiy, and Rakhlin Appendix A. Proofs from Section 2 Proof [Proof of Proposition 12] We apply the method from the classic paper (Kolmogorov and Tikhomirov, 1993), following notation introduced there as applicable. For the sake of simplicity, we assume that β is an integer; the generalization to β N is analogous to that in Kolmogorov and Tikhomirov (1993). Let β = ε 2B and let x1, . . . , xs be a -connected net on S. For 0 k β and 1 i s, define where || || is the norm on tensors induced by the ambient (Euclidean) metric and Dk is the kth application of the covariant derivative. Let γ(f) = γk i (f) i,k be the matrix of all γk i (f) and let Uγ be the set of all f such that γ(f) = γ. Then the argument in the proof of (Kolmogorov and Tikhomirov, 1993, Theorem XIV) applies mutatis mutandis and we note that Uγ are 2ε neighborhoods in the H older norm. Thus it suffices to bound the number of possible γ. As in Kolmogorov and Tikhomirov (1993), we note that the number of possible values for γk 1 is at most 2B εk . Given the row γk i 0 k β, there are at most (4e + 2)β+1 values for the next row. Thus the total number of possible γ is bounded by (4e + 2)β+1 s β Y εk = (4e + 2)(β+1)s β Y β = (4e + 2)(β+1)s 2B By definition of the covering number and the fact that S is path-connected, we may take s = N(S, ) = N S, ε Taking logarithms and noting that log(4e + 2) 3 concludes the proof of the upper bound. The middle inequality is Theorem 3. For the lower bound, we again follow Kolmogorov and Tikhomirov (1993). Define ( a QD i=1 1 x2 i β 2 ||x|| 1 0 otherwise with a a constant to be set. Choose a 2 -separated set x1, . . . , xss with = ε β and consider the set of functions i=1 σi βϕ x xi where σi { 1} and σ varies over all possible sets of signs. The results of Kolmogorov and Tikhomirov (1993) guarantee that the gσ form a 2ε-separated set in F if a is chosen such that gσ F and there are 2s such combinations. By definition of packing numbers, we may choose Intrinsic Dimension Estimation Using Wasserstein Distance This concludes the proof of the lower bound. Proof [Proof of Proposition 9] We note first that the second statement follows from the first by applying (b) and (c) to Proposition 8 to control the curvature and injectivity radius in terms of the reach. Furthermore, the middle inequality in the last statement follows from Theorem 3. Thus we prove the first two statements. A volume argument yields the following control: N M, || ||g , r vol M infp M vol B ε 2 (p) is the ball around p of radius ε 2 with respect to the metric g. Thus it suffices to lower bound the volume of such a ball. Because ε < ι, we may apply the Bishop-Gromov comparison theorem (Gray, 2004, Theorem 3.17) to get that vol Bε(p) 2π d 2 !d 1 dt = ωd 2 1 sin (t κ1) d 1 dt where κ1 is an upper bound on the sectional curvature. We note that for t π 2 κ1 , we have πt κ1 and thus vol Bε(p) ωd πt d 1 dt = ωd The upper bound follows from control on the sectional curvature by τ, appearing in (Aamari et al., 2019, Proposition A.1), which, in turn, is an easy consequence of applying the Gauss formula to (a) of Proposition 8. We lower bound the packing number through an analogous argument as the upper bound for the covering number, this time with an upper bound on the volume of a ball of radius ε, again from (Gray, 2004, Theorem 3.17), but this time using a lower bound on the sectional curvature. In particular, we have for ε < ι, vol Bε(p) ωd !d 1 dt = ωd sinh (t κ2) κ2 where κ2 is a lower bound on the sectional curvature. Note that for t 1 κ2 , we have sinh (t κ2) κ2 cosh(2)t 4t. vol Bε(p) ωd 0 (4t)d 1dt = ωd The volume argument tells us that N M, || ||g , r vol M supp M vol Br(p) Block, Jia, Polyanskiy, and Rakhlin and the result follows. If we wish to extend the range of ε, we pay with a constant exponential exponential in d, reflecting the growth in volume of balls in negatively curved spaces. In particular, we can apply the same argument and note that as sinh(x) x is increasing, we have sinh (t κ2) κ2 sinh(ι κ2) ι κ2 t eι κ2 for all t < ι. Thus for all ε < ι. We have: N(M, || ||g , ε) vol M ωd dιd( κ2) d 2 e dι κ2ε d as desired. Proof [Proof of Theorem 10] Let BRD ε (p) be the set of points in RD with Euclidean distance to p less than ε and let BM ε (p) be the set of points in M with intrinsic (geodesic) distance to p less than ε. Then, if ε 2τ, combining the fact that straight lines are geodesics in RD and (d) from Proposition 8 gives BM ε (p) BRD ε (p) M BM 2τ arcsin( ε In particular, this implies N M, d M, 2τ arcsin ε N(M, || || , ε) N(M, d M, ε) D M, d M, 2τ arcsin ε D(M, || || , ε) D(M, d M, ε) whenever ε 2τ. Thus, applying Proposition 9, we have N(M, || || , ε) N(M, d M, ε) vol M and similarly, D(M, || || , 2ε) D M, d M, 2τ arcsin ε ωd d16 dε d using the fact that arcsin(x) 2x for x 0. The result follows. Appendix B. Proof of Proposition 21 As stated in the body, we bound the T2 constant c2 by the log-Sobolev constant of the same measure. We thus first define a log-Sobolev inequality: Definition 27 Let µ be a measure on M. We say that µ satisfies a log-Sobolev inequality with constant c LS if for all real valued, differentiable functions with mean 0 f : M R, we have: Z M f2 log(f2)dµ c LS M || f||2 dµ Intrinsic Dimension Estimation Using Wasserstein Distance where is the Levi-Civita connection and || || is the norm with respect to the Riemannian metric. While in the main body we cited Otto and Villani (2000) for the Otto-Villani theorem, we actually need a slight strengthening of this result. For technical reasons, Otto and Villani (2000) required the density of µ to have two derivatives; more recent works have eliminated that assumption. We have: Theorem 28 (Theorem 5.2 from Gigli and Ledoux (2013)) Suppose that µ satisfies the log-Sobolev inequality with constant c LS. Then µ satisfies the T2 inequality with constant c2 2c LS. We now recall the key estimate from Wang (1997a) that controls the log-Sobolev constant for the uniform measure on a compact manifold M2: Theorem 29 (Theorem 3.3 from Wang (1997a)) Let M be a compact, d-dimensional manifold with diameter . Suppose that Ric M K for some K R. Let µ be the uniform measure on M (i.e., the volume measure normalized so that µ(M) = 1). Then µ satisfies a log-Sobolev inequality with d e2K(d+1) 2 1 K e1+d 2K+. We are now ready to complete the proof. Proof [Proof of Proposition 21] By the Holly-Stroock perturbation theorem (Holley and Stroock, 1986), we know that if µ is the uniform measure on M normalized such that µ(M) = 1, and µ satisfies a log-Sobolev inequality with constant c LS then P satisfies a log Sobolev constant with c LS W w c LS. By (a) from Proposition 8, we have that the sectional curvatures of M are all bounded below by 2 τ 2 and thus Ric M (d 1) 2 τ 2 (for the relationship between the Ricci tensor and the sectional curvatures, see Lee (2018)). Noting that d+2 d 3 and plugging into the results of Theorem 29, we get that d 1 exp d log 3 + 3 2d2 Combining this with the Holly-Stroock result and Theorem 28 concludes the proof. Appendix C. Proof of Theorem 22 We first prove the following lemma on the concentration of W1(Pn, P n). Lemma 30 Suppose that P is a probability measure on (T, d) and that it satisfies a T2(c2)- inequality. Let X1, . . . , Xn, X 1, . . . , X n denote independent samples with corresponding empirical distributions Pn, P n. Then the following inequalities hold: P W1(Pn, P n) E W1(Pn, P n) t 2e nt2 P W1(Pn, P n) E W1(Pn, P n) t 2e nt2 2. We remark that some works, including Wang (1997a), define the log-Sobolev constant to be the inverse of our c LS. We translate their theorem into our terms by taking the reciprocol. Block, Jia, Polyanskiy, and Rakhlin Proof We note that by Gozlan et al. (2009), in particular the form of the main theorem stated in (van Handel, 2014, Theorem 4.31), it suffices to show that, as a function of the data, W1(Pn, P n) is 2 n-Lipschitz. Note that by symmetry, it suffices to show a one-sided inequality. By the triangle inequality, W1(Pn, P n) W1(Pn, µ) + W1(P n, µ) for any measure µ and thus it suffices to show that W1(Pn, µ) is 1 n-Lipschitz in the Xi. By (van Handel, 2014, Lemma 4.34), there exists a bijection between the set of couplings between Pn and µ and the set of ordered n-tuples of measures µ1, . . . , µn such that µ = 1 n P i µi. Thus we see that if X, e X are two data sets, then W1(Pn, µ) W1( e Pn, µ) sup 1 n Pn i=1 µi=µ Z d(Xi, y) d( e Xi, y) dµi(y) sup 1 n Pn i=1 µi=µ Z d(Xi, e Xi)dµi(y) X d(Xi, e Xi) i=1 d(Xi, e Xi)2 1 nd n(X, e X). The identical argument applies to W M 1 . We are now ready to show that bdn is a good estimator of d. Proposition 31 Suppose we are in the situation of Theorem 22 and we have α max log d 2γ nωd d Then with probability at least 1 4ρ, we have d 1 + 3γ bdn (1 + 3γ)d. Proof By Proposition 19 and Lemma 30, we have that with probability at least 1 e nt2 8c2 , we have W M 1 (Pn, P n) C vol M Intrinsic Dimension Estimation Using Wasserstein Distance By Proposition 18 and Lemma 30 and the left hand side of Proposition 19, we have that with probability at least 1 e αnt2 W M 1 (Pαn, P αn) 1 all under the assumption that n > d vol M Setting t = (αn) 5 4d , we see that, as α > 1, with probability at least 1 2e nt2 8c2 , we simultaneously have W M 1 (Pn, P n) C vol M W M 1 (Pαn, P αn) 1 Thus, in particular, W M 1 (Pn, P n) W M 1 (Pαn, P αn) C vol M log nωd d vol M 1 64 d vol M d Cw 1 d α 1 d Thus we see that bdn = log α log W1(Pn,P n) W1(Pαn,P αn) 1 d log α + + 1 d log w + 1 2 log log nωd d d vol M 1 + log(Cw)+ d 2 log log nωd d d , 8c2 2 2 log 1 α max log d 2γ nωd d Then with probability at least 1 2ρ, bdn d 1 + 2γ . Block, Jia, Polyanskiy, and Rakhlin An identical proof holds for the other side of the bound and thus the result holds. We are now ready to prove the main theorem using Proposition 31 and Proposition 24. Proof [Proof of Theorem 22] Note first that N M, d M, ιλ2 by Proposition 9. Setting λ = 1 2, we note that by Proposition 24, if the total number of samples d! 1 log vol M then with probability at least 1 ρ, we have 1 2d M(p, q) d G(p, q) 3 for all p, q M. Thus by the proof of Proposition 31 above, W M 1 (Pn, P n) W M 1 (Pαn, P αn) 1 + λ 1 λCw 1 d α 1 d Thus as long as α 1+λ 1 λ d γ = 3 d γ , then we have with probability at least 1 3ρ, edn d 1 + 3γ . A similar computation holds for the other bound. To prove the result for dn, note that if we replace the ιs by τ in (5) and (6), then the result still holds by the second part of Proposition 9. Then the identical arguments apply, mutatis mutandis, after skipping the step of approximating d M by d G. Appendix D. Metric Estimation Proofs In order to state our result, we need to consider the minimal amount of probability mass that P puts on any intrinsic ball of a certain radius in M. To formalize this notion, we define, for δ > 0, w B(δ) = inf p M P BM δ (p) . We need a few lemmata: Intrinsic Dimension Estimation Using Wasserstein Distance Lemma 32 Fix ε > 0 and a set of xi M and form G(x, ε). If the set of xi form a δ-net for M such that δ ε 4, then for all x, y M, d G(x, y) 1 + 4δ Proof This is a combination of (Bernstein et al., 2000, Proposition 1) and (Bernstein et al., 2000, Theorem 2). Lemma 33 Let 0 < λ < 1 and let x, y M such that ||x y|| 2τλ(1 λ). Then (1 λ)d M(x, y) ||x y|| d M(x, y). Proof Note that 2τλ(1 λ) τ 2 so we are in the situation of Proposition 8 (e). Let ℓ= d M(x, y). Rearranging the bound in Proposition 8 (e) yields Thus it suffices to show that ℓ 2τ λ. Again applying Proposition 8, we see that 1 2 ||x y|| Rearranging and plugging in ||x y|| 2τλ(1 λ) concludes the proof. The next lemma is a variant of (Niyogi et al., 2008, Lemma 5.1). Lemma 34 Let w B(δ) be as in Proposition 24 and let N(M, δ) be the covering number of M at scale δ. If we sample n w δ 2 1 log N(M, δ 2) ρ points independently from P, then with probability at least 1 ρ, the points form a δ-net of M. Proof Let y1, . . . , y N be a minimal δ 2-net of M. For each yi the probability that xi is not in B δ 2 (yi) is bounded by 1 w B δ 2 by definition. By independence, we have 2 (yi) 1 )Bw δ n e nw B( δ By a union bound, we have P i such that j xj B δ 2 (yi) N M, δ If n satisfies the bound in the statement then the right hand side (7) is controlled by ρ. Note that for any measure P, a simple union bound tells us that w B(δ) N (M, δ) 1 and Block, Jia, Polyanskiy, and Rakhlin that equality, up to a constant, is achieved for the uniform measure. This is within a log factor of the obvious lower bound given by the covering number on the number of points required to have a δ-net on M. With these lemmata, we are ready to conclude the proof: Proof [Proof of Proposition 24] Let ε = τλ 2τλ(1 λ) by λ 1 2. Let δ = λε 4 . By Lemma 34, with high probability, the xi form a δ-net on M; thus for the rest of the proof, we fix a set of xi such that this condition holds. Now we may apply Lemma 32 to yield the upper bound d G(x, y) (1 + λ)d M(x, y). For the lower bound, for any points p, q M there are points xj0, xjm such that d M(p, xj0) δ and d M(q, xjm) δ by the fact that the xi form a δ-net. Let xj1, . . . , xjm 1 be a geodesic in G between xj0 and xjm. By Lemma 33 and the fact that edges only exist for small weights, we have d M(p, q) d M (p, xj0) + d M (xjm, q) + i=1 d M xji 1, xji ||p xj0|| + ||xjm q|| + xji 1 xji ! = (1 λ) 1d G(p, q). Rearranging concludes the proof. Appendix E. Miscellany Proof [Proof of Lemma 15] By symmetrization and chaining, we have i=1 f(Xi) f(X i) i=1 εif(Xi) log N(F, || || , ε)dε log N F, || || , ε N(S, || || , ε)dε where the last step follows from Proposition 12. The first statement follows from noting ε is decreasing in ε, and thus allowing it to be pulled from the integral. If β > d the second statement follows from plugging in δ = 0 and recovering a rate of n 1 2 . If β < d then the second statement follows from plugging in δ = n β Proof [Proof of Proposition 18] We follow the proof of (Weed et al., 2019, Proposition 6) and use their notation. In particular, let = inf N(S, d M, ε)|S M and P(S) 1 Intrinsic Dimension Estimation Using Wasserstein Distance Applying a volume argument in the identical fashion to Proposition 9, but lower bounding the probability of a ball of radius ε by w multiplied by the volume of said small ball, we get that 2wωd d8 dε d if ε τ. Let 4wωd d8 d 1 and assume that n > vol M 4wωd d8 d (τ) d 1 i n BM ε 2 (Xi). Then because by our choice of ε, we have that P(S) < 1 2. Thus if X P then we have with probability at least 1 2, d M(X, {X1, . . . , Xn}) ε 2. Thus the Wasserstein distance between P and Pn is at least ε 4. The first result follows. We may apply the identical argument, instead using intrinsic covering numbers and the bound in Proposition 9 to recover the second statement. Proof [Proof of Proposition 19] By Kantorovich-Rubenstein duality and Jensen s inequality, we have E W M 1 (Pn, P) E i=1 f(Xi) E [f(Xi)] i=1 f(Xi) f(X i) = E W M 1 (Pn, P n) where F is the class of functions on M that are 1-Lipschitz with respect to d M. Note that, by translation invariance, we may take the radius of the H older ball F to be . By symmetrization and chaining, i=1 f(Xi) f(X i) i=1 εif(Xi) log N(F, || || , ε)dε where the last step comes from Corollary 14 and noting that after recentering, F contains functions f such that ||f||L (M) and || f||L (M) 1. Setting Block, Jia, Polyanskiy, and Rakhlin E W M 1 (Pn, P n) C vol M for some C 48, which concludes the proof. Proof [Proof of Theorem 25] By bounding the supremum of sums by the sum of suprema and the construction of bµ, dβ,B(bµ, P) dβ,B(bµ, e Pn) + dβ,B( e Pn, P) inf µ P dβ,B(µ, e Pn) + dβ,B( e Pn, P) inf µ P dβ,B(µ, P) + 2dβ,B( e Pn, P) inf µ P dβ,B(µ, P) + 2dβ,B( e Pn, Pn) + 2dβ,B(Pn, P). Taking expectations and applying Lemma 15 bounds the last term. The middle term can be bounded as follows: dβ,B( e Pn, Pn) = sup i=1 f(Xi) f( e Xi) sup i=1 f(Xi) f(Xi + ηi) + 2Bε i=1 B ||ηi|| + 2Bε where the first inequality follows from the fact that if f Cβ B(Ω) then ||f|| B and the contamination is at most ε. The second inequality follows from the fact that f is BLipschitz. Taking expectations and applying Jensen s inequality concludes the proof. Proof [Proof of Corollary 26] Applying Kantorovich-Rubenstein duality, the proof follows immediately from that of Theorem 25 by setting β = 1, with the caveat that we need to bound B and the Lipschitz constant separately. The Lipschitz constant is bounded by 1 by Kantorovich duality. The class is translation invariant, and so |||f|| E[f]| 2R by the fact that the Euclidean diameter of S is bounded by 2R. The result follows. Lemma 35 Let X be distributed uniformly on a centred (ℓ2) ball in Rd of radius R. Then, E log R ||X|| Proof Note that by scaling it suffices to prove the case R = 1. By changing to polar coordinates, E log 1 ||X|| S1 R 1 0 log 1 r rd 1drdθ R S1 R 1 0 rd 1drdθ 0 (log r) rd 1dr. Intrinsic Dimension Estimation Using Wasserstein Distance Substituting u = log r and applying integration by parts then gives 0 (log r) rd 1dr = rd as desired. Eddie Aamari, Jisu Kim, Fr ed eric Chazal, Bertrand Michel, Alessandro Rinaldo, Larry Wasserman, et al. Estimating the reach of a manifold. Electronic journal of statistics, 13 (1):1359 1399, 2019. Nina Amenta and Marshall Bern. Surface reconstruction by voronoi filtering. Discrete & Computational Geometry, 22(4):481 504, 1999. Martin Arjovsky, Soumith Chintala, and L eon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214 223. PMLR, 2017. Yair Ashlagi, Lee-Ad Gottlieb, and Aryeh Kontorovich. Functions with average smoothness: structure, algorithms, and learning. In Conference on Learning Theory, pages 186 236. PMLR, 2021. Patrice Assouad. Plongements lipschitziens dans rn. Bulletin de la Soci et e Math ematique de France, 111:429 448, 1983. Dominique Bakry and Michel Emery. Diffusions hypercontractives. In Seminaire de probabilit es XIX 1983/84, pages 177 206. Springer, 1985. Dominique Bakry, Ivan Gentil, and Michel Ledoux. Analysis and Geometry of Markov Diffusion Operators. Springer International Publishing, 2014. doi: 10.1007/978-3-319-00227-9. URL https://doi.org/10.1007/978-3-319-00227-9. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In Nips, volume 14, pages 585 591, 2001. Mira Bernstein, Vin De Silva, John C. Langford, and Joshua B. Tenenbaum. Graph approximations to geodesics on embedded manifolds, 2000. Peter J Bickel, Bo Li, et al. Local polynomial regression on unknown manifolds. In Complex datasets and inverse problems, pages 177 186. Institute of Mathematical Statistics, 2007. Adam Block, Youssef Mroueh, Alexander Rakhlin, and Jerret Ross. Fast mixing of multiscale langevin dynamics underthe manifold hypothesis. ar Xiv preprint ar Xiv:2006.11166, 2020. Sergej G Bobkov and Friedrich G otze. Exponential integrability and transportation cost related to logarithmic sobolev inequalities. Journal of Functional Analysis, 163(1):1 28, 1999. Block, Jia, Polyanskiy, and Rakhlin Jean-Daniel Boissonnat, Andr e Lieutier, and Mathijs Wintraecken. The reach, metric distortion, geodesic convexity and the variation of tangent spaces. Journal of applied and computational topology, 3(1):29 58, 2019. Francesco Camastra and Antonino Staiano. Intrinsic dimension estimation: Advances and open problems. Information Sciences, 328:26 41, 2016. Francesco Camastra and Alessandro Vinciarelli. Estimating the intrinsic dimension of data with a fractal-based method. IEEE Transactions on pattern analysis and machine intelligence, 24(10):1404 1407, 2002. Minshuo Chen, Wenjing Liao, Hongyuan Zha, and Tuo Zhao. Statistical guarantees of generative adversarial networks for distribution estimation. 2020. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292 2300, 2013. Sanjoy Dasgupta and Yoav Freund. Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 537 546, 2008. David L Donoho and Carrie Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 100 (10):5591 5596, 2003. Richard Mansfield Dudley. The speed of mean glivenko-cantelli convergence. The Annals of Mathematical Statistics, 40(1):40 50, 1969. Kirill Efimov, Larisa Adamyan, and Vladimir Spokoiny. Adaptive nonparametric clustering. IEEE Transactions on Information Theory, 65(8):4875 4892, 2019. Amir Massoud Farahmand, Csaba Szepesv ari, and Jean-Yves Audibert. Manifold-adaptive dimension estimation. In Proceedings of the 24th international conference on Machine learning, pages 265 272, 2007. Herbert Federer. Curvature measures. Transactions of the American Mathematical Society, 93(3):418 491, 1959. Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983 1049, 2016. Charles Fefferman, Sergei Ivanov, Yaroslav Kurylev, Matti Lassas, and Hariharan Narayanan. Fitting a putative manifold to noisy data. In Conference On Learning Theory, pages 688 720. PMLR, 2018. Keinosuke Fukunaga and David R Olsen. An algorithm for finding intrinsic dimensionality of data. IEEE Transactions on Computers, 100(2):176 183, 1971. Nicola Gigli and Michel Ledoux. From log sobolev to talagrand: a quick proof. Discrete and Continuous Dynamical Systems-Series A, pages dcds 2013, 2013. Intrinsic Dimension Estimation Using Wasserstein Distance Evarist Gin e and Richard Nickl. Mathematical foundations of infinite-dimensional statistical models. Number 40. Cambridge University Press, 2016. Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. ar Xiv preprint ar Xiv:1406.2661, 2014. Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Adaptive metric dimensionality reduction. Theoretical Computer Science, 620:105 118, 2016. Nathael Gozlan et al. A characterization of dimension free concentration in terms of transportation inequalities. The Annals of Probability, 37(6):2480 2498, 2009. Peter Grassberger and Itamar Procaccia. Measuring the strangeness of strange attractors. In The Theory of Chaotic Attractors, pages 170 189. Springer, 2004. Alfred Gray. Tubes. Birkh auser Basel, 2004. doi: 10.1007/978-3-0348-7966-8. URL https: //doi.org/10.1007/978-3-0348-7966-8. Richard Holley and Daniel W Stroock. Logarithmic sobolev inequalities and stochastic ising models. 1986. Heinz Hopf and Willi Rinow. Uber den begriffder vollst andigen differentialgeometrischen fl ache. Commentarii Mathematici Helvetici, 3(1):209 225, 1931. Leonid Vitaliyevich Kantorovich and SG Rubinshtein. On a space of totally additive functions. Vestnik of the St. Petersburg University: Mathematics, 13(7):52 59, 1958. Bal azs K egl. Intrinsic dimension estimation using packing numbers. In NIPS, pages 681 688. Citeseer, 2002. Jisu Kim, Alessandro Rinaldo, and Larry Wasserman. Minimax rates for estimating the dimension of a manifold. Journal of Computational Geometry, 10(1), 2019. Matth aus Kleindessner and Ulrike Luxburg. Dimensionality estimation without distances. In Artificial Intelligence and Statistics, pages 471 479. PMLR, 2015. A. N. Kolmogorov and V. M. Tikhomirov. epsilon-entropy and epsilon-capacity of sets in functional spaces. In Mathematics and Its Applications, pages 86 170. Springer Netherlands, 1993. doi: 10.1007/978-94-017-2973-4 7. URL https://doi.org/10.1007/ 978-94-017-2973-4_7. Vladimir I Koltchinskii. Empirical geometry of multivariate data: a deconvolution approach. Annals of statistics, pages 591 629, 2000. Samory Kpotufe. k-nn regression adapts to local intrinsic dimension. In Proceedings of the 24th International Conference on Neural Information Processing Systems, pages 729 737, 2011. Samory Kpotufe and Sanjoy Dasgupta. A tree-based regressor that adapts to intrinsic dimension. Journal of Computer and System Sciences, 78(5):1496 1515, 2012. Block, Jia, Polyanskiy, and Rakhlin Samory Kpotufe and Vikas K Garg. Adaptivity to local smoothness and dimension in kernel regression. In NIPS, pages 3075 3083, 2013. Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http: //yann.lecun.com/exdb/mnist/. John M Lee. Introduction to Riemannian manifolds. Springer, 2018. Elizaveta Levina and Peter Bickel. Maximum likelihood estimation of intrinsic dimension. Advances in neural information processing systems, 17:777 784, 2004. Tengyuan Liang. On how well generative adversarial networks learn densities: Nonparametric and parametric results. ar Xiv preprint ar Xiv:1811.03179, 2018. Anna V Little, Yoon-Mo Jung, and Mauro Maggioni. Multiscale estimation of intrinsic dimensionality of data sets. In 2009 AAAI Fall Symposium Series, 2009. Alfred M uller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, pages 429 443, 1997. Ryumei Nakada and Masaaki Imaizumi. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. Journal of Machine Learning Research, 21 (174):1 38, 2020. Hariharan Narayanan and Sanjoy Mitter. Sample complexity of testing the manifold hypothesis. In Proceedings of the 23rd International Conference on Neural Information Processing Systems-Volume 2, pages 1786 1794, 2010. Jonathan Niles-Weed and Philippe Rigollet. Estimation of wasserstein distances in the spiked transport model. ar Xiv preprint ar Xiv:1909.07513, 2019. Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high confidence from random samples. Discrete & Computational Geometry, 39(1-3):419 441, 2008. Partha Niyogi, Stephen Smale, and Shmuel Weinberger. A topological view of unsupervised learning from noisy data. SIAM Journal on Computing, 40(3):646 663, 2011. Felix Otto and C edric Villani. Generalization of an inequality by talagrand and links with the logarithmic sobolev inequality. Journal of Functional Analysis, 173(2):361 400, 2000. Karl W Pettis, Thomas A Bailey, Anil K Jain, and Richard C Dubes. An intrinsic dimensionality estimator from near-neighbor information. IEEE Transactions on pattern analysis and machine intelligence, (1):25 37, 1979. Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323 2326, 2000. Nicolas Schreuder, Victor-Emmanuel Brunel, and Arnak Dalalyan. Statistical guarantees for generative models without domination. ar Xiv preprint ar Xiv:2010.09237, 2020. Intrinsic Dimension Estimation Using Wasserstein Distance Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Sch olkopf, Gert RG Lanckriet, et al. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550 1599, 2012. Ingo Steinwart, Don R Hush, Clint Scovel, et al. Optimal rates for regularized least squares regression. In COLT, pages 79 93, 2009. Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319 2323, 2000. Ananya Uppal, Shashank Singh, and Barnab as P oczos. Nonparametric density estimation & convergence rates for gans under besov ipm losses. ar Xiv preprint ar Xiv:1902.03511, 2019. Ramon van Handel. Probability in high dimension. Technical report, PRINCETON UNIV NJ, 2014. C edric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. Feng-Yu Wang. On estimation of the logarithmic sobolev constant and gradient estimates of heat semigroups. Probability theory and related fields, 108(1):87 101, 1997a. Feng-Yu Wang. Logarithmic sobolev inequalities on noncompact riemannian manifolds. Probability theory and related fields, 109(3):417 424, 1997b. Jonathan Weed, Francis Bach, et al. Sharp asymptotic and finite-sample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25(4A):2620 2648, 2019. Yan Wu, JeffDonahue, David Balduzzi, Karen Simonyan, and Timothy Lillicrap. Logan: Latent optimisation for generative adversarial networks. ar Xiv preprint ar Xiv:1912.00953, 2019.