# on_convolutions_intrinsic_dimension_and_diffusion_models__63d2757a.pdf

Published in Transactions on Machine Learning Research (10/2025)

On Convolutions, Intrinsic Dimension, and Diffusion Models

Kin Kwan Leung kk@layer6.ai Layer 6 AI

Rasa Hosseinzadeh rasa@layer6.ai Layer 6 AI

Gabriel Loaiza-Ganem gabriel@layer6.ai Layer 6 AI

Reviewed on Open Review: https: // openreview. net/ forum? id= x Sz Bf1te4s

The manifold hypothesis asserts that data of interest in high-dimensional ambient spaces, such as image data, lies on unknown low-dimensional submanifolds. Diffusion models (DMs) which operate by convolving data with progressively larger amounts of Gaussian noise and then learning to revert this process have risen to prominence as the most performant generative models, and are known to be able to learn distributions with low-dimensional support. For a given datum in one of these submanifolds, we should thus intuitively expect DMs to have implicitly learned its corresponding local intrinsic dimension (LID), i.e. the dimension of the submanifold it belongs to. Kamkari et al. (2024b) recently showed that this is indeed the case by linking this LID to the rate of change of the log marginal densities of the DM with respect to the amount of added noise, resulting in an LID estimator known as FLIPD. LID estimators such as FLIPD have a plethora of uses, among others they quantify the complexity of a given datum, and can be used to detect outliers, adversarial examples and AI-generated text. FLIPD achieves state-of-the-art performance at LID estimation, yet its theoretical underpinnings are incomplete since Kamkari et al. (2024b) only proved its correctness under the highly unrealistic assumption of affine submanifolds. In this work we bridge this gap by formally proving the correctness of FLIPD under realistic assumptions. Additionally, we show that an analogous result holds when Gaussian convolutions are replaced with uniform ones, and discuss the relevance of this result.

1 Introduction

The manifold hypothesis (Bengio et al., 2013) states that high-dimensional data in RD commonly lies on a low-dimensional submanifold, or a disjoint union thereof (Brown et al., 2023), embedded on RD; this hypothesis has been empirically found to hold for image data (Pope et al., 2021) and other modalities (Cresswell et al., 2022). Given a dataset obeying this hypothesis along with a query datum x, one might want to leverage the data to estimate the local intrinsic dimension (LID) of the datum i.e. the dimension of the submanifold it belongs to which we denote as LID(x). At a first glance the problem of LID estimation might seem like a mere statistical or mathematical curiosity, yet LID estimators have found numerous important applications within machine learning: they effectively quantify image complexity (Kamkari et al., 2024b), are helpful in the detection of outliers (Houle et al., 2018; Anderberg et al., 2024; Kamkari et al., 2024a), AI-generated text (Tulchinskii et al., 2023), memorization (Ross et al., 2025) and adversarial examples (Ma et al., 2018), and LID estimates of neural network representations are predictive of generalization (Ansuini et al., 2019; Birdal et al., 2021; Brown et al., 2022; Magai & Ayzenberg, 2022). However, traditional LID estimators rely on pairwise distances or similar quantities (Fukunaga & Olsen, 1971; Pettis et al., 1979; Grassberger & Procaccia, 1983; Levina & Bickel, 2004; Mac Kay & Ghahramani, 2005; Johnsson et al., 2014;

Published in Transactions on Machine Learning Research (10/2025)

Facco et al., 2017; Bac et al., 2021), and they thus typically suffer both from poor scaling on dataset size and from the curse of dimensionality.

Generative models aim to learn the distribution p that gave rise to observed data, and when the manifold hypothesis holds, p has low-dimensional support. Research studying generative models through the lens of the manifold hypothesis has recently proliferated (Dai & Wipf, 2019; Zhang et al., 2020; Brehmer & Cranmer, 2020; Kim et al., 2020; Arbel et al., 2021; Caterini et al., 2021; Horvat & Pfister, 2021; Ross & Cresswell, 2021; Cunningham et al., 2022; Loaiza-Ganem et al., 2022a;b; Ross et al., 2023; Sorrenson et al., 2024; Loaiza-Ganem et al., 2024; Ventura et al., 2025), and an important finding from this line of research is that diffusion models (DMs; Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2021) which are state-ofthe-art generative models can successfully learn distributions with low-dimensional support (Pidstrigach, 2022; De Bortoli, 2022). When a generative model, such as a DM, correctly learns p it must therefore learn the corresponding intrinsic dimensions of its support as well; indeed, various works have shown that LID estimators can be extracted from generative models and that these new estimators avoid the aforementioned pitfalls of traditional ones (Tempczyk et al., 2022; Zheng et al., 2022; Horvat & Pfister, 2024; Stanczuk et al., 2024; Kamkari et al., 2024b).

FLIPD is a state-of-the-art and highly scalable LID estimator based on DMs that was recently proposed by Kamkari et al. (2024b). FLIPD is grounded in a result showing that

LID(x) = D + lim δ δ log ϱN (x, δ), (1)

where ϱN ( , δ) denotes the convolution between p and Gaussian noise with log standard deviation δ. DMs operate by adding Gaussian noise to data and estimating the corresponding score function; the right-hand side of Equation 1 can be written in terms of this score function, which then enables FLIPD to be computed by plugging-in the score function learned by the DM. However, Kamkari et al. (2024b) only proved Equation 1 in the case where p is supported on an affine submanifold of RD this is a highly restrictive and unrealistic assumption for typical data of interest such as natural images.

In summary, LID estimation is a relevant problem in machine learning and the current best-performing estimator of LID lacks strong formal justification. In this work we remedy this situation by proving that Equation 1 actually holds when p is supported on general disjoint unions of submanifolds of RD, thus fully justifying FLIPD. Additionally, we also prove that an analogous result holds when the convolution is carried out against a uniform distribution on a ball of log radius δ (rather than a Gaussian with log standard deviation δ). Although this second result is less practically relevant since DMs do not use uniform noise, it is still of interest since it directly links LID(x) to the probability assigned by p to a ball around x, which is a quantity that naturally appears implicitly and explicitly throughout machine learning (see e.g. Silverman, 1986; Naeem et al., 2020; Bhattacharjee et al., 2023; Kamkari et al., 2024a).

2 Setup and Background

2.1 Setup and Notation

Throughout our work, we will follow the definition of a d-dimensional manifold as being locally homeomorphic to Rd (Lee, 2012), meaning that the dimension of a manifold does not vary within the manifold. M will denote, depending on context, either an embedded d-dimensional submanifold of RD, or a countable disjoint union of embedded submanifolds of potentially varying dimensions. We will assume that data of interest is supported on M and generated according to a probability distribution admitting a density p : M R.1

Writing M as M = j Mj where Mj is dj-dimensional and Mi Mj = for i = j, the local intrinsic dimension of x M is defined as the dimension of the manifold that x belongs to, i.e. LID(x) = di if x Mi.

1Note that p is not a density with respect to the Lebesgue measure on RD, but rather with respect to the Riemannian measure (or volume form if it exists) on M; see the work of Loaiza-Ganem et al. (2024) for a discussion about this setup.

Published in Transactions on Machine Learning Research (10/2025)

We will make heavy use of isotropic Gaussian densities with log standard deviation δ R, i.e. with covariance matrix e2δID; for a mean µ RD, we will denote the corresponding density evaluated at x RD as

ND(x; µ, δ) = CN D e Dδ exp 1

2 x µ 2e 2δ , where CN D = (2π) D/2, (2)

and where will always refer to the Euclidean norm. We will sometimes find it convenient to write Nd(x; µ, δ) where d < D and x, µ RD, with the understanding that

Nd(x; µ, δ) := CN d e dδ exp 1

2 x µ 2e 2δ (3)

is not a density as it does not integrate to 1. We will also heavily use the convolution of p against a zero-centred Gaussian with log standard deviation δ, which we denote as2

ϱN (x, δ) := Z

M p(x )ND(x x ; 0, δ)dx . (4)

We will treat uniform densities analogously to Gaussian ones, and will denote the density of a uniform random variable on a ball of log radius δ centered at µ RD, evaluated at x, as

UD(x; µ, δ) = CU De Dδ1(x BD(µ, eδ)), where CU D = π D/2Γ D

2 + 1 , (5)

and where 1( ) denotes an indicator function, BD(µ, r) denotes a D-dimensional ball of radius r centred at µ, i.e. BD(µ, r) := {x RD : x µ < r}, and Γ is the gamma function. In analogy to the Gaussian case, we will also use Ud(x; µ, δ) := CU d e dδ1(x BD(µ, eδ)) (6)

for x, µ RD, and will also write

ϱU(x, δ) := Z

M p(x )UD(x x ; 0, δ)dx . (7)

As mentioned earlier, we will also prove Equation 1 when ϱN (x, δ) is replaced by ϱU(x, δ). Since

ϱU(x, δ) = Z

M BD(x,eδ) p(x )CU De Dδdx = CU De DδPX p(X BD(x, eδ)), (8)

our result thus explicitly links LID(x) to the probability assigned by p to a ball of log radius δ around x through

LID(x) = lim δ δ log PX p(X BD(x, eδ)). (9)

2.2 Diffusion Models and FLIPD

As already mentioned, the main goal of our paper is to prove Equation 1 under general and realistic assumptions. In the rest of this section we cover the background that makes this equation relevant, namely DMs and FLIPD. Our proofs do not rely on the content presented in this section, which can safely be skipped by readers who are already familiar with these topics, or by those who are only interested in our theoretical results. We will follow the stochastic differential equation (SDE) formulation of DMs of Song et al. (2021). DMs are generative models whose goal is to learn p; they do this by first instantiating a (forward) noising process through an Itô SDE, d Xt := α(Xt, t)dt + β(t)d Wt, X0 p, (10)

where α : RD [0, 1] RD and β : [0, 1] R are fixed functions which are specified as hyperparameters, and Wt is a D-dimensional Brownian motion. We will denote the density of Xt implied by Equation 10

2Note that, since p is not a density with respect to the Lebesgue measure, the integral in Equation 4 is not a Lebesgue integral, and dx should be understood as the Riemannian measure or volume form on M. Nonetheless, ϱN ( , δ) is a Lebesgue density in RD due to the added Gaussian noise.

Published in Transactions on Machine Learning Research (10/2025)

as p( , t), and note that p( , 0) = p.3 Here t [0, 1] corresponds to an artificial time variable specifying how much noise has been added to the data through the SDE, with t = 0 corresponding to no noise, and with t = 1 corresponding to p( , 1) being almost pure noise . A surprising result by Anderson (1982) and Haussmann & Pardoux (1986) shows that the reverse process, Yt := X1 t, also obeys a (backward) SDE,

d Yt = β2(1 t)s(Yt, 1 t) α(Yt, 1 t) dt + β(1 t)d ˆWt, Y0 p( , 1), (11)

where s is the (Stein) score function, i.e. s(x, t) := log p(x, t), and ˆWt is another D-dimensional Brownian motion. Equation 11 is at the core of DMs: by approximating the score function s with a properly trained neural network ˆs (and also approximating p( , 1) with an appropriately chosen Gaussian distribution), numerically solving the resulting SDE until time t = 1 will produce samples from a DM these samples will be approximately distributed according to p; the better the approximations made throughout, the closer the distribution of these samples will be to p (De Bortoli, 2022).

The function α is often chosen as α(x, t) = γ(t)x for some function γ : [0, 1] R. Under this choice, the transition kernel corresponding to Equation 10 is known (Särkkä & Solin, 2019), and it is given by

pt|0(xt | x0) = ND(xt; ψ(t)x0, log σ(t)), (12)

where the functions ψ, σ : [0, 1] R are uniquely determined by γ and β, and can be numerically evaluated for all common choices of γ and β. Furthermore, under these standard choices the ratio λ(t) := σ(t)/ψ(t) is injective and therefore admits a left inverse λ 1, which can also be numerically evaluated. Recall that the defining property of the transition kernel is that it satisfies

p(x, t) = Z

M p(x )pt|0(x | x )dx . (13)

The density in Equation 13 resulting from the SDE in Equation 10 is extremely similar to the convolution in Equation 4: both of them operate by adding Gaussian noise to data from p (up to the rescaling of the mean by ψ(t) in Equation 12). The main difference between these two ways of adding noise is simply how the amount of added noise is measured: in the DM formulation, no noise corresponds to t = 0, whereas this setting corresponds to the limit as δ for convolutions. Intuitively, it should then be the case that if one has access to a DM, then the right hand side of Equation 1 can be approximated, thus obtaining an estimate of LID. Kamkari et al. (2024b) used this intuition to propose FLIPD. First, they showed that ϱN ( , δ) and p( , t) are related through

log ϱN (x, δ) = D log ψ t(δ) + log p ψ t(δ) x, t(δ) , (14)

where t(δ) := λ 1(eδ). Then, using the Fokker-Planck equation which provides an explicit formula for tp(x, t) along with the chain rule and Equation 14, Kamkari et al. (2024b) showed that

δ log ϱN (x, δ) = σ2 t(δ) Tr s ψ t(δ) x, t(δ) + s ψ t(δ) x, t(δ) 2 =: ν s, x, t(δ) . (15)

Importantly, evaluating ν requires access to p(x, t) only through s. Lastly, by using the trained score function ˆs to approximate the unknown s and by setting a negative enough δ0, FLIPD is obtained by combining Equation 15 with Equation 1 as FLIPD (x; δ0) := D + ν ˆs, x, t(δ0) . (16)

The correctness of FLIPD as an estimator of LID therefore hinges on Equation 1 holding in general, and as previously mentioned, Kamkari et al. (2024b) only proved Equation 1 in the case where M is an affine submanifold of RD. We refer the reader to Kamkari et al. (2024b) for a more thorough derivation of FLIPD, along with illustrative examples, and empirical results.

3Note that p( , 0) and p( , t) for t > 0 are not densities in the same sense; only the latter are Lebesgue densities.

Published in Transactions on Machine Learning Research (10/2025)

3 Related Work

As mentioned in the introduction, statistical estimators of LID typically rely on the computation of all pairwise distances or angles on a given dataset sampled from p (Fukunaga & Olsen, 1971; Pettis et al., 1979; Grassberger & Procaccia, 1983; Levina & Bickel, 2004; Mac Kay & Ghahramani, 2005; Johnsson et al., 2014; Facco et al., 2017; Bac et al., 2021), resulting in these estimators exhibiting poor scaling on dataset size and ambient dimension D. More related to our work is research aiming to leverage deep generative models for LID estimation. Stanczuk et al. (2024) proposed an estimator which also leverages DMs but is not based on Equation 1 and requires many more function evaluations of ˆs than FLIPD. Horvat & Pfister (2024) proposed another method using DMs for LID estimation, but it requires altering the training procedure of the DM. As a result, these methods are not compatible with off-the-shelf state-of-the-art DMs such as Stable Diffusion (Rombach et al., 2022) to the best of our knowledge FLIPD is the only estimator which scales to this setting, making it particularly relevant to properly justify it.

Other works have used generative models beyond DMs, for example Zheng et al. (2022) showed that the number of active latent dimensions in variational autoencoders (Kingma & Welling, 2014; Rezende et al., 2014) estimates LID. Lastly, Tempczyk et al. (2022) proved another convolution-related result which they leveraged for LID estimation with normalizing flows (Dinh et al., 2017; Durkan et al., 2019). More specifically, they showed that for x M, as δ ,

log ϱN (x, δ) = δ(LID(x) D) + O(1). (17)

By adding different levels of noise δ to data and training a normalizing flow for each noise level, Tempczyk et al. (2022) estimate log ϱN (x, δ) for various values of δ; they then fit a linear regression predicting these estimates from δ the resulting slope is an estimate LID(x) D thanks to Equation 17. When proposing FLIPD, Kamkari et al. (2024b) noted that Equation 17 provides informal intuition for why Equation 1 holds: if the O(1) term was constant with respect to δ, then Equation 17 would imply Equation 1. In this sense, proving Equation 1 boils down to showing that the O(1) term indeed behaves like a constant as δ , or more formally, that the limit of its derivative with respect to δ converges to 0.

Several works have studied the interplay between DMs and the manifold hypothesis in contexts beyond LID estimation: Pidstrigach (2022) first showed that if the error between ˆs and s is bounded, then DMs recover the correct support, thus showing that DMs can learn manifolds. De Bortoli (2022) refined this result with error bounds in Wasserstein distance, and various follow-up works have provided finite-sample estimation rates (Chen et al., 2023; Oko et al., 2023; Tang & Yang, 2024; Wang et al., 2024). Lastly, Chen et al. (2024) observed that, under the manifold hypothesis, the Jacobian of the score function has low rank, and they further leveraged this observation for image editing.

4 Result Statements and Discussion

In this section we state and discuss our results. Partial proofs that avoid differential geometry (Lee, 2012; 2018) are provided in Section 5; the remaining steps, which rely on these tools, are given in the appendices. In our first result, we establish that Equation 1 holds when M is an arbitrary submanifold of RD and p obeys continuity and finite second moment conditions.

Theorem 1. Let M be a smooth d-dimensional embedded submanifold of RD and let p be a probability density function on M. Let x M be such that p is continuous at x, p(x) > 0, and C := R

M p(x )dist2 M(x, x )dx < , where dist M denotes the geodesic distance on M obtained from the induced Riemannian metric on M. Then,

lim δ δ log ϱN (x, δ) = d D. (18)

Our continuity and second moment assumptions in Theorem 1 are mild and are completely analogous to corresponding assumptions made by Kamkari et al. (2024b). Nevertheless, we emphasize once again that Kamkari et al. (2024b) also assumed that M is an affine submanifold of RD, which is a much stronger and unrealistic assumption on the structure of M. Watchful readers might have noticed that the finite second

Published in Transactions on Machine Learning Research (10/2025)

moment assumption i.e. R

M p(x )dist2 M(x, x )dx < implies that M must be connected (or more precisely, p must assign all its probability mass to a single connected component of M), which is likely not realistic in many settings of practical interest. However this obstacle is minor since, as we show below, Theorem 1 can be straightforwardly extended to the case where M is a disjoint union of submanifolds of RD of potentially varying dimensions.

Corollary 1. Let M = j Mj, where Mj is a smooth dj-dimensional embedded submanifold of RD with ξ := mini =j infxi Mi,xj Mj xi xj > 0. Let p be a probability density function on M, which we write as p(x) = πjpj(x) for x Mj, where for every j, pj is a probability density on Mj and πj > 0, and P

Let x Mi for some i be such that pi is continuous at x, pi(x) > 0, and R

Mi pi(x )dist2 Mi(x, x )dx < , where dist Mi denotes the geodesic distance on Mi obtained from the induced Riemannian metric on Mi. Then,

lim δ δ log ϱN (x, δ) = di D. (19)

Corollary 1 formally justifies FLIPD in the sense that it shows that, under mild regularity conditions,

lim δ D + ν(s, x, t(δ)) = LID(x), (20)

that is, when the score function is perfectly learned, i.e. ˆs = s, we have that FLIPD (x; δ) LID(x) as δ .

For the uniform case, analogously to the Gaussian one, we first prove the case where M is a d-dimensional submanifold, and then extend the result to disjoint unions of submanifolds. Note that our results with uniform distributions do not require the second moment conditions required in the Gaussian case.

Theorem 2. Let M be a smooth d-dimensional embedded submanifold of RD and let p be a probability density function on M. Let x M be such that p is continuous at x and p(x) > 0. Then,

lim δ δ log ϱU(x, δ) = d D. (21)

Corollary 2. Let M = j Mj, where Mj is a smooth dj-dimensional embedded submanifold of RD with ξ := mini =j infxi Mi,xj Mj xi xj > 0. Let p be a probability density function on M, which we write as p(x) = πjpj(x) for x Mj, where for every j, pj is a probability density on Mj and πj > 0, and P j πj = 1. Let x Mi be such that pi is continuous at x and pi(x) > 0. Then,

lim δ δ log ϱU(x, δ) = di D. (22)

As previously discussed, Corollary 2 immediately links LID(x) and PX p(X BD(x, eδ)) through Equation 9. Albeit Corollary 1 is more relevant due to its connection to DMs and FLIPD, Corollary 2 still succeeds at connecting two quantities of interest in machine learning despite lacking these connections. Some older estimators of LID use a given a dataset sampled from p by regressing the log proportion of datapoints in BD(x, eδ) against δ, and then estimating LID(x) as the corresponding slope (Pettis et al., 1979; Grassberger & Procaccia, 1983). These works provided an intuitively correct but not rigorous mathematical justification behind this estimator, which Equation 9 rigorously justifies.

5.1 Gaussian Case: Theorem 1 and Corollary 1

Proof of Theorem 1. The core idea behind the proof of Theorem 1 is to exploit the following relationship between ND(x; µ, δ) and Nd(x; µ, δ):

ND(x x ; 0, δ) = (2π) d D

2 eδ(d D)Nd(x x ; 0, δ). (23)

4Note that, because ξ > 0, without loss of generality any density p on M can be written as p(x) = πjpj(x) when x Mj.

Published in Transactions on Machine Learning Research (10/2025)

Thus we have

ϱN (x, δ) = Z

M p(x )ND(x x ; 0, δ)dx = (2π) d D

2 eδ(d D) Z

M p(x )Nd(x x ; 0, δ)dx . (24)

To simplify notation, we let

p N δ (x) := Z

M p(x )Nd(x x ; 0, δ)dx , (25)

so that δ log ϱN (x, δ) = d D +

δ log p N δ (x). (26)

Thus it suffices to show that lim δ δ log p N δ (x) = 0. (27)

First, note that we have

δ Nd(x x ; 0, δ) = ( d + x x 2e 2δ)Nd(x x ; 0, δ). (28)

lim δ δ log p N δ (x) = lim δ

δp N δ (x) p N δ (x) (29)

M p(x )Nd(x x ; 0, δ)dx R

M p(x )Nd(x x ; 0, δ)dx (30)

δNd(x x ; 0, δ)dx R

M p(x )Nd(x x ; 0, δ)dx (31)

M p(x )( d + x x 2e 2δ)Nd(x x ; 0, δ)dx

M p(x )Nd(x x ; 0, δ)dx (32)

= d + lim δ e 2δ R

M p(x ) x x 2Nd(x x ; 0, δ)dx

M p(x )Nd(x x ; 0, δ)dx . (33)

Note that going from Equation 30 to Equation 31 involves moving the derivative inside the integral. We formally justify this step in Appendix A. We then leverage the following two propositions, whose proofs we also include in Appendix A.

Proposition 1. Under the assumptions of Theorem 1,

M p(x )Nd(x x ; 0, δ)dx = p(x). (34)

Note that Proposition 1 is very similar to the limit of a Gaussian being a delta function, i.e. for a function f : Rd R and y Rd, we have that R

Rd f(y )Nd(y y ; 0, δ)dy f(y) as δ . However, the differences here are that: (i) we are computing the convolution on a manifold M, and (ii) the distance in the Gaussian is the distance of the ambient space, RD.

Proposition 2. Under the assumptions of Theorem 1,

lim δ e 2δ Z

M p(x ) x x 2Nd(x x ; 0, δ)dx = d p(x). (35)

The expected squared norm of a centred Gaussian is well-known to be the trace of its covariance, i.e. R

Rd y y 2Nd(y y ; 0, δ)dy = de2δ. Intuitively, Proposition 2 can be understood as stating that this property still holds as δ under (i) and (ii), with p(x ) again leaving the integral as p(x) because the Gaussian behaves like a delta function in this regime.

Applying these two propositions to Equation 33 yields Equation 27, which in turn finishes the proof of Theorem 1.

Published in Transactions on Machine Learning Research (10/2025)

Proof of Corollary 1. Under the conditions of Corollary 1, ϱN (x, δ) is given by

ϱN (x, δ) = X

Mj pj(x )ND(x x ; 0, δ)dx . (36)

Following an analogous derivation to the one leading to Equation 33, we get

lim δ δ log ϱN (x, δ) = D + lim δ

Mj pj(x ) x x 2e 2δND(x x ; 0, δ)dx P

Mj pj(x )ND(x x ; 0, δ)dx . (37)

Using basic calculus to maximize ND(x x ; 0, δ) and x x 2e 2δND(x x ; 0, δ) with respect to x x subject to x x ξ yields that, when this constraint holds,

ND(x x ; 0, δ) CN D e Dδ exp( 1

2ξ2e 2δ), (38)

and that if additionally δ 1

x x 2e 2δND(x x ; 0, δ) ξ2e 2δCN D e Dδ exp( 1

2ξ2e 2δ). (39)

The bound in Equation 38 decreases monotonically to 0 as δ , and the bound in Equation 39 does so as well provided that δ < log ξ 1

2 log(D + 2) (which makes the derivative of the log of the bound with respect to δ positive). It follows that ND(x x ; 0, δ) and x x 2e 2δND(x x ; 0, δ) are both uniformly bounded over x and δ when x Mj for j = i and δ < min( 1

2 log(D + 2)). The dominated convergence theorem then guarantees that, for j = i:

Mj pj(x )ND(x x ; 0, δ)dx = 0 = lim δ

Mj pj(x ) x x 2e 2δND(x x ; 0, δ)dx . (40)

Lastly, since x Mi, Proposition 1 ensures that limδ R

Mi pi(x )ND(x x ; 0, δ)dx > 0, so that Equation 37 reduces to

lim δ δ log ϱN (x, δ) = D + lim δ

Mi pi(x ) x x 2e 2δND(x x ; 0, δ)dx

Mi pi(x )ND(x x ; 0, δ)dx (41)

= lim δ δ log Z

Mi pi(x )ND(x x ; 0, δ)dx = di D, (42)

where the last equality follows from Theorem 1.

5.2 Uniform Case: Theorem 2 and Corollary 2

Proof of Theorem 2 (informal). In analogy with Theorem 1, the proof of Theorem 2 hinges on the following relationship between UD(x; µ, δ) and Ud(x; µ, δ):

UD(x; µ, δ) = CU D(CU d ) 1eδ(d D)Ud(x; µ, δ). (43)

Thus we have

ϱU(x, δ) = Z

M p(x )UD(x x ; 0, δ)dx = CU D(CU d ) 1eδ(d D) Z

M p(x )Ud(x x ; 0, δ)dx . (44)

To simplify notation once again, we let

p U δ (x) := Z

M p(x )Ud(x x ; 0, δ)dx , (45)

so that δ log ϱU(x, δ) = d D +

δ log p U δ (x). (46)

Published in Transactions on Machine Learning Research (10/2025)

Thus it suffices to show that lim δ δ log p U δ (x) = 0. (47)

Despite seeming extremely similar to the Gaussian case from Equation 27, the proof of Equation 47 is not completely analogous. We now provide some intuition as to why this equation holds. Note that p U δ (x) = CU d e dδPX p(X BD(x, eδ)). Intuitively, since p is continuous at x it can be locally approximated by a constant, meaning that when δ , we should expect PX p(X BD(x, eδ)) to behave like a constant times the volume (in M) of M BD(x, eδ). Since M is d-dimensional, we should also expect M BD(x, eδ) to behave as a d-dimensional ball as δ , i.e. to have a volume proportional to edδ. Putting these intuitions together, we get that p U δ (x) behaves like a constant as δ , thus yielding Equation 47.

We formalize the above intuition in Appendix B by providing a rigorous proof of Equation 47 this result completes the proof of Theorem 2.

Proof of Corollary 2. Under the assumptions of Corollary 2, ϱU(x, δ) is given by

ϱU(x, δ) = X

Mj pj(x )UD(x x ; 0, δ)dx . (48)

Whenever δ < log ξ, if x Mj for j = i, we have that UD(x x ; 0, δ) = 0 since x Mi, in which case

ϱU(x, δ) = πi

Mi pi(x )UD(x x ; 0, δ)dx . (49)

By Theorem 2, it follows that

lim δ δ log ϱU(x, δ) = lim δ δ log Z

Mi pi(x )UD(x x ; 0, δ)dx = di D. (50)

6 Conclusions and Future Work

In this paper we established a formal link between LID and the rate of change of the densities of different convolutions. The Gaussian case is of particular interest as it provides formal justification for FLIPD, a state-of-the-art LID estimator. The uniform case provides an interesting connection between LID and the probability assigned to Euclidean balls, and rigorously justifies older LID estimators.

One avenue for future work would be extending our result beyond Gaussian and uniform distributions. For example, despite losing direct relevance in machine learning, finding sufficient conditions on the convolving distributions for our result to hold is an interesting mathematical problem. In particular, we hypothesize that, under adequate regularity conditions, Theorem 1 and Theorem 2 can be generalized to cases where the noise distribution that p is convolved against obeys a decomposition like those of Equation 23 or Equation 43.

Another interesting avenue for future work is extending LID estimation to flow matching methods (Lipman et al., 2023; Albergo & Vanden-Eijnden, 2023; Liu et al., 2023; Tong et al., 2024). These methods aim to train a vector field ˆv such that solving the ordinary differential equation

d Yt = ˆv(Yt, 1 t)dt, Y0 p( , 1) (51)

results in Y1 being distributed according to the data distribution p( , 0). Flow matching methods differ in how they obtain such a ˆv. For example, when using the so called linear interpolation method (Lipman et al., 2023), ˆv is equivalent to an affine transformation of a score function ˆs (see Appendix D.3 in Kingma & Gao (2023)), which means that FLIPD can be trivially applied in this setting. However, not all flow matching methods produce a ˆv from which one can easily recover a corresponding score function ˆs to plug into FLIPD; for example we should expect this to be the case when using several rounds of rectified flows (Liu et al., 2023) or optimal transport regularization (Tong et al., 2024). We believe that producing a tractable estimate of

Published in Transactions on Machine Learning Research (10/2025)

LID(x) while given access only to a successfully trained ˆv from any flow matching method is an interesting research problem.

Lastly, although our results indeed justify FLIPD, they do so in the setting where the score function is perfectly learned, i.e. ˆs = s. Despite DMs being highly performant, they never perfectly recover the true score function, and discrepancies between ˆs and s can result in

lim δ FLIPD (x; δ) = LID(x) (52)

being violated. Indeed, Kamkari et al. (2024b) reported that in some instances, FLIPD can produce negative estimates of LID. Since Kamkari et al. (2024b) only proved Equation 1 when M is an affine submanifold of RD, the possibility remained that this negativity was due to the result not holding more generally; a consequence of Corollary 1 is that negative estimates can only be caused by ˆs having imperfectly learned s. We believe that quantifying how much FLIPD (x; δ0) differs from LID(x) as a function of the error incurred by ˆs to approximate s is another relevant problem. In particular, note that if we had bounds on both the error between s and ˆs, and on the error between s and ˆs, as δ , we could trivially obtain a bound on |ν(s, x, t(δ)) ν(ˆs, x, t(δ))| (see Equation 15); in turn, this would provide a bound on the LID estimation error. While we are aware of learning theory work bounding the error between s and ˆs (Chen et al., 2023; Oko et al., 2023; Tang & Yang, 2024; Wang et al., 2024), we are unaware of any existing bound on the error between s and ˆs; finding such a bound is another interesting direction for future work.

Acknowledgments

We thank Hamidreza Kamkari for useful conversations during the early stages of this research.

Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. In International Conference on Learning Representations, 2023.

Alastair Anderberg, James Bailey, Ricardo JGB Campello, Michael E Houle, Henrique O Marques, Miloš Radovanović, and Arthur Zimek. Dimensionality-aware outlier detection. In Proceedings of the 2024 SIAM International Conference on Data Mining, pp. 652 660, 2024.

Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982.

Alessio Ansuini, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. In Advances in Neural Information Processing Systems, 2019.

Michael Arbel, Liang Zhou, and Arthur Gretton. Generalized energy based models. In International Conference on Learning Representations, 2021.

Jonathan Bac, Evgeny M Mirkes, Alexander N Gorban, Ivan Tyukin, and Andrei Zinovyev. Scikit-dimension: A python package for intrinsic dimension estimation. Entropy, 23(10), 2021.

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798 1828, 2013.

Robi Bhattacharjee, Sanjoy Dasgupta, and Kamalika Chaudhuri. Data-copying in generative models: a formal framework. In International Conference on Machine Learning, 2023.

Tolga Birdal, Aaron Lou, Leonidas J Guibas, and Umut Simsekli. Intrinsic dimension, persistent homology and generalization in neural networks. In Advances in Neural Information Processing Systems, 2021.

Johann Brehmer and Kyle Cranmer. Flows for simultaneous manifold learning and density estimation. In Advances in Neural Information Processing Systems, 2020.

Published in Transactions on Machine Learning Research (10/2025)

Bradley CA Brown, Jordan Juravsky, Anthony L Caterini, and Gabriel Loaiza-Ganem. Relating regularization and generalization through the intrinsic dimension of activations. ar Xiv:2211.13239, 2022.

Bradley CA Brown, Anthony L Caterini, Brendan Leigh Ross, Jesse C Cresswell, and Gabriel Loaiza-Ganem. Verifying the union of manifolds hypothesis for image data. In International Conference on Learning Representations, 2023.

Anthony L Caterini, Gabriel Loaiza-Ganem, Geoff Pleiss, and John P Cunningham. Rectangular flows for manifold learning. In Advances in Neural Information Processing Systems, 2021.

Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. In International Conference on Machine Learning, 2023.

Siyi Chen, Huijie Zhang, Minzhe Guo, Yifu Lu, Peng Wang, and Qing Qu. Exploring low-dimensional subspace in diffusion models for controllable image editing. In Advances in Neural Information Processing Systems, 2024.

Jesse C Cresswell, Brendan Leigh Ross, Gabriel Loaiza-Ganem, Humberto Reyes-Gonzalez, Marco Letizia, and Anthony L Caterini. Calo Man: Fast generation of calorimeter showers with density estimation on learned manifolds. In Neur IPS Workshop on Machine Learning and the Physical Sciences, 2022.

Edmond Cunningham, Adam D Cobb, and Susmit Jha. Principal component flows. In International Conference on Machine Learning, 2022.

Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In International Conference on Learning Representations, 2019.

Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. Transactions on Machine Learning Research, 2022.

Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. In International Conference on Learning Representations, 2017.

Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In Advances in Neural Information Processing Systems, 2019.

Elena Facco, Maria d Errico, Alex Rodriguez, and Alessandro Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Reports, 7(1):12140, 2017.

Harley Flanders. Differentiation under the integral sign. The American Mathematical Monthly, 80(6):615 627, 1973.

Keinosuke Fukunaga and David R Olsen. An algorithm for finding intrinsic dimensionality of data. IEEE Transactions on Computers, 100(2):176 183, 1971.

Peter Grassberger and Itamar Procaccia. Measuring the strangeness of strange attractors. Physica D: Nonlinear Phenomena, 9(1-2):189 208, 1983.

UG Haussmann and E Pardoux. Time reversal of diffusions. The Annals of Probability, 14(4):1188 1205, 1986.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.

Christian Horvat and Jean-Pascal Pfister. Denoising normalizing flow. In Advances in Neural Information Processing Systems, 2021.

Christian Horvat and Jean-Pascal Pfister. On gauge freedom, conservativity and intrinsic dimensionality estimation in diffusion models. In International Conference on Learning Representations, 2024.

Published in Transactions on Machine Learning Research (10/2025)

Michael E Houle, Erich Schubert, and Arthur Zimek. On the correlation between local intrinsic dimensionality and outlierness. In Similarity Search and Applications: 11th International Conference, SISAP 2018, pp. 177 191. Springer, 2018.

Kerstin Johnsson, Charlotte Soneson, and Magnus Fontes. Low bias local intrinsic dimension estimation from expected simplex skewness. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37 (1):196 202, 2014.

Hamidreza Kamkari, Brendan Leigh Ross, Jesse C Cresswell, Anthony L Caterini, Rahul G Krishnan, and Gabriel Loaiza-Ganem. A geometric explanation of the likelihood OOD detection paradox. In International Conference on Machine Learning, 2024a.

Hamidreza Kamkari, Brendan Leigh Ross, Rasa Hosseinzadeh, Jesse C Cresswell, and Gabriel Loaiza-Ganem. A geometric view of data complexity: Efficient local intrinsic dimension estimation with diffusion models. In Advances in Neural Information Processing Systems, 2024b.

Hyeongju Kim, Hyeonseung Lee, Woo Hyun Kang, Joun Yeop Lee, and Nam Soo Kim. Softflow: Probabilistic framework for normalizing flow on manifolds. In Advances in Neural Information Processing Systems, 2020.

Diederik P Kingma and Ruiqi Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. In Advances in Neural Information Processing Systems, 2023.

Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.

John M Lee. Introduction to Smooth Manifolds. Springer, 2nd edition, 2012.

John M Lee. Introduction to Riemannian Manifolds. Springer, 2nd edition, 2018.

Elizaveta Levina and Peter Bickel. Maximum likelihood estimation of intrinsic dimension. In Advances in Neural Information Processing Systems, 2004.

Yaron Lipman, Ricky T Q Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In International Conference on Learning Representations, 2023.

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In International Conference on Learning Representations, 2023.

Gabriel Loaiza-Ganem, Brendan Leigh Ross, Jesse C Cresswell, and Anthony L Caterini. Diagnosing and fixing manifold overfitting in deep generative models. Transactions on Machine Learning Research, 2022a.

Gabriel Loaiza-Ganem, Brendan Leigh Ross, Luhuan Wu, John Patrick Cunningham, Jesse C Cresswell, and Anthony L Caterini. Denoising deep generative models. In Proceedings on "I Can t Believe It s Not Better! - Understanding Deep Learning Through Empirical Falsification" at Neur IPS 2022 Workshops, 2022b.

Gabriel Loaiza-Ganem, Brendan Leigh Ross, Rasa Hosseinzadeh, Anthony L. Caterini, and Jesse C. Cresswell. Deep generative models through the lens of the manifold hypothesis: A survey and new connections. Transactions on Machine Learning Research, 2024.

Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E Houle, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. In International Conference on Learning Representations, 2018.

David JC Mac Kay and Zoubin Ghahramani. Comments on Maximum likelihood estimation of intrinsic dimension by E. Levina and P. Bickel (2004). The Inference Group Website, Cavendish Laboratory, Cambridge University, 2005.

German Magai and Anton Ayzenberg. Topology and geometry of data manifold in deep learning. ar Xiv:2204.08624, 2022.

Published in Transactions on Machine Learning Research (10/2025)

Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reliable fidelity and diversity metrics for generative models. In International Conference on Machine Learning, 2020.

Kazusato Oko, Shunta Akiyama, and Taiji Suzuki. Diffusion models are minimax optimal distribution estimators. In International Conference on Machine Learning, 2023.

Karl W Pettis, Thomas A Bailey, Anil K Jain, and Richard C Dubes. An intrinsic dimensionality estimator from near-neighbor information. IEEE Transactions on pattern analysis and machine intelligence, (1): 25 37, 1979.

Jakiw Pidstrigach. Score-based generative models detect manifolds. In Advances in Neural Information Processing Systems, 2022.

Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic dimension of images and its impact on learning. In International Conference on Learning Representations, 2021.

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.

Brendan Leigh Ross and Jesse C Cresswell. Tractable density estimation on learned manifolds with conformal embedding flows. In Advances in Neural Information Processing Systems, 2021.

Brendan Leigh Ross, Gabriel Loaiza-Ganem, Anthony L Caterini, and Jesse C Cresswell. Neural implicit manifold learning for topology-aware generative modelling. Transactions on Machine Learning Research, 2023.

Brendan Leigh Ross, Hamidreza Kamkari, Tongzi Wu, Rasa Hosseinzadeh, Zhaoyan Liu, George Stein, Jesse C Cresswell, and Gabriel Loaiza-Ganem. A geometric framework for understanding memorization in generative models. In International Conference on Learning Representations, 2025.

Simo Särkkä and Arno Solin. Applied stochastic differential equations. Cambridge University Press, 2019.

Bernard W Silverman. Density estimation for statistics and data analysis, 1986.

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.

Peter Sorrenson, Felix Draxler, Armand Rousselot, Sander Hummerich, Lea Zimmerman, and Ullrich Köthe. Lifting architectural constraints of injective flows. In International Conference on Learning Representations, 2024.

Jan Stanczuk, Georgios Batzolis, Teo Deveney, and Carola-Bibiane Schönlieb. Diffusion models encode the intrinsic dimension of data manifolds. In International Conference on Machine Learning, 2024.

Rong Tang and Yun Yang. Adaptivity of diffusion models to manifold structures. In International Conference on Artificial Intelligence and Statistics, 2024.

Piotr Tempczyk, Rafał Michaluk, Lukasz Garncarek, Przemysław Spurek, Jacek Tabor, and Adam Golinski. LIDL: Local intrinsic dimension estimation using approximate likelihood. In International Conference on Machine Learning, 2022.

Published in Transactions on Machine Learning Research (10/2025)

Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport. Transactions on Machine Learning Research, 2024.

Eduard Tulchinskii, Kristian Kuznetsov, Laida Kushnareva, Daniil Cherniavskii, Sergey Nikolenko, Evgeny Burnaev, Serguei Barannikov, and Irina Piontkovskaya. Intrinsic dimension estimation for robust detection of ai-generated texts. In Advances in Neural Information Processing Systems, 2023.

Enrico Ventura, Beatrice Achilli, Gianluigi Silvestri, Carlo Lucibello, and Luca Ambrogioni. Manifolds, random matrices and spectral gaps: The geometric phases of generative diffusion. In International Conference on Learning Representations, 2025.

Peng Wang, Huijie Zhang, Zekai Zhang, Siyi Chen, Yi Ma, and Qing Qu. Diffusion models learn lowdimensional distributions via subspace clustering. ar Xiv:2409.02426, 2024.

Mingtian Zhang, Peter Hayes, Thomas Bird, Raza Habib, and David Barber. Spread divergence. In International Conference on Machine Learning, 2020.

Yijia Zheng, Tong He, Yixuan Qiu, and David P Wipf. Learning manifold dimensions with conditional variational autoencoders. In Advances in Neural Information Processing Systems, 2022.

Published in Transactions on Machine Learning Research (10/2025)

A Proofs for the Gaussian Case

The following claim provides justification for taking the derivative inside the integral from Equation 30 to Equation 31.

Claim 1. Under the assumptions of Theorem 1,

M p(x )Nd(x x ; 0, δ)dx = Z

δ Nd(x x ; 0, δ)dx . (53)

Proof. Let δ R, and let δa be such that δa < δ. We begin by bounding δNd(x x ; 0, δ):

δ Nd(x x ; 0, δ)| = | d + x x 2e 2δ|CN d e dδ exp 1

2 x x 2e 2δ (54)

(d + x x 2e 2δa)CN d e dδa. (55)

This bound is integrable due to the finite second moment assumption: Z

M p(x )(d + x x 2e 2δa)CN d e dδadx Z

M p(x )(d + dist M(x, x )2e 2δa)CN d e dδadx < . (56)

The result then follows from the Leibniz integral rule.

In the rest of this section we prove Proposition 1 and Proposition 2. Note that we share notation through the rest of the section, i.e. some of the notation we introduce in the proof of Proposition 1 is also used after this proof.

Proof of Proposition 1. Recall that we want to prove that

M p(x )Nd(x x ; 0, δ)dx = p(x). (57)

To compute this limit, without loss of generality, we perform a translation so that x = 0. Note that this translation preserves distance. Then perform an orthogonal transformation on RD such that the tangent plane of M at x = 0 is the d-subspace spanned by the first d coordinate vectors. Denote these coordinates by (y1, . . . , yd). As M is smooth, M can be parametrized by {y, u(y)} near x = 0, where y = (y1, . . . , yd) and u is a function from some open U Rd to RD d, such that u(0) = 0 and (Du)(0) = 0.5 In other words, (y1, . . . , yd) is a local coordinate system of M on U, with the induced Riemannian metric g. Note that g = Id at x = 0 as (Du)(0) = 0. This means that U is diffeomorphic to an open set in M via ϕ(x) = (x, u(x)). Shrink U if necessary such that U is bounded, i.e. R := supx ϕ(U) x < . Further shrink U if necessary such that BD(0, R) M = ϕ(U). In other words, x M, x < R x ϕ(U).

We are going to compute the limit on some open V ϕ(U) in M and M \ V . Let RV := supx V x < . Then for x M \ V , for any a R, we have

eaδNd( x ; 0, δ) = (2π) d/2e δ(d a) exp 1

2 x 2e 2δ (58)

(2π) d/2e δ(d a) exp 1

2R2 V e 2δ , (59)

which approaches zero as δ . This means that for any ϵ > 0, there exists KV,a such that δ < KV,a implies eaδNd( x ; 0, δ) < ϵ.

5To avoid using non-standard notation for derivatives, we overload notation and use D to refer to both the derivative operator and the ambient dimension. The meaning of D will always be clear from context.

Published in Transactions on Machine Learning Research (10/2025)

If x V , in the coordinate system (y1, . . . , yd), we have Nd( x ; 0, δ) represented by

ˆN(y, δ) := (2π) d/2e δd exp 1

2( y 2 + u(y) 2)e 2δ . (60)

For simplicity, let

N(y, δ) := (2π) d/2e δd exp 1

2 y 2e 2δ . (61)

Note that N(y, δ) is the standard multivariate gaussian distribution with covariance e2δId.

It is clear that ˆN(y, δ) N(y, δ). Now, consider the function v(y) = u(y)

y . Note that v(y) is continuous everywhere in U \ {0}. Since u and the derivatives of u vanish at y = 0, from the definition of derivative, we have

0 = lim y 0 u(y) u(0) (Du)(0)y

y = lim y 0 u(y)

y = lim y 0 v(y).

Thus v(y) can be extended to a continuous function in U.

Now let KV = max y ϕ 1(V ) v(y). Then we have u(y) 2 K2 V y 2. Thus

ˆN(y, δ) = (2π) d/2e δd exp 1

2( y 2 + u(y) 2)e 2δ (62)

(2π) d/2e δd exp 1

2 y 2(1 + K2 V )e 2δ (63)

= edδ0(2π) d/2e (δ+δ0)d exp 1

2 y 2e 2(δ+δ0) (64)

= (1 + K2 V ) d/2N(y, δ + δ0), (65)

where δ0 = 1

2 log(1 + K2 V ).

To summarize, we have (1 + K2 V ) d/2N(y, δ + δ0) ˆN(y, δ) N(y, δ). (66)

Now, for any open V ϕ(U) in M, we have

M p(x )Nd( x ; 0, δ)dx (67)

M\V p(x )Nd( x ; 0, δ)dx + Z

V p(x )Nd( x ; 0, δ)dx . (68)

For any ϵ > 0, whenever δ < KV,0 we have, Z

M\V p(x )Nd( x ; 0, δ)dx Z

M\V p(x )ϵdx ϵ Z

M p(x )dx = ϵ, (69)

as p is a probability density on M. This shows that

M\V p(x )Nd( x ; 0, δ)dx = 0. (70)

Note that this is true for any open V ϕ(U) in M.

On the other hand, in the local coordinate system (y1, . . . , yd), we have Z

V p(x )Nd( x ; 0, δ)dx = Z

ϕ 1(V ) p(y) p

g(y) ˆN(y, δ)dy. (71)

Published in Transactions on Machine Learning Research (10/2025)

Here g(y) = det(gij(y)) is continuous on ϕ 1(V ) with g(0) = 1. From Equation 66, we have

(1 + K2 V ) d/2 Z

ϕ 1(V ) p(y) p

g(y)N(y, δ + δ0)dy (72)

ϕ 1(V ) p(y) p

g(y) ˆN(y, δ)dy (73)

ϕ 1(V ) p(y) p

g(y)N(y, δ)dy. (74)

Since N(y, δ) is the usual multivariate Gaussian distribution, it converges to the delta function. Thus, taking δ , we have

(1+K2 V ) d/2p(0) lim inf δ

ϕ 1(V ) p(y) p

g(y) ˆN(y, δ)dy lim sup δ

ϕ 1(V ) p(y) p

g(y) ˆN(y, δ)dy p(0), (75)

g(0) = 1. But we have

lim inf δ p N δ (0) = lim inf δ

M p(x )Nd( x ; 0, δ)dx (76)

M\V p(x )Nd( x ; 0, δ)dx + lim inf δ

V p(x )Nd( x ; 0, δ)dx (77)

= lim inf δ

V p(x )Nd( x ; 0, δ)dx (78)

= lim inf δ

ϕ 1(V ) p(y) p

g(y) ˆN(y, δ)dy, (79)

and similarly for lim sup. This means that

(1 + K2 V ) d/2p(0) lim inf δ p N δ (0) lim sup δ p N δ (0) p(0). (80)

Note that these inequalities hold for all V ϕ(U). First notice that the function f(z) := (1 + z2) d/2 is continuous at z = 0, that f(0) = 1, and that f(z) 1. For any ϵ > 0, these exists κ > 0 such that whenever |z| < κ, we have 1 (1 + z2) d/2 < ϵ. As v(y) is continuous at y = 0 and v(0) = 0, there exists κ > 0 such that whenever y < κ , we have v(y) < κ/2. Thus let V := ϕ(Bd(0, κ ) U), we have KV κ/2 < κ and thus (1 + K2 V ) d/2 > 1 ϵ. Thus we have

p(0)(1 ϵ) lim inf δ p N δ (0) lim sup δ p N δ (0) p(0) (81)

for every ϵ > 0. This shows that lim δ p N δ (0) = p(0). (82)

Before proving Proposition 2, we need to establish some basic tools.

Claim 2. Let Bd(0, r) be the closed ball in Rd centred at 0 of radius r. Then

Rd\Bd(0,r) y 2e 2δN(y, δ)dy = 0. (83)

Proof. Without loss of generality, assume δ < log r 1

2 log (d + 2), then the integrand will decrease as δ decreases (the derivative of the log of the integrand with respect to δ is positive) and it converges pointwise to 0. It follows that the integrand is bounded for negative enough δ, and the result follows from the dominated convergence theorem.

Published in Transactions on Machine Learning Research (10/2025)

Claim 3. Let W be an open and bounded subset of Rd with 0 W. Let f : W R continuous at 0 and bounded on W. Then we have

W f(y) y 2e 2δN(y, δ)dy = f(0)d. (84)

Proof. For simplicity, let H(y, δ) = y 2e 2δN(y, δ). First we note that N(y, δ) is the density of multivariate Gaussian distribution with covariance matrix e2δId. This means that Z

Rd y 2N(y, δ)dy = ENd(y;0,δ)[ y 2] = Tr(e2δId) = de2δ. (85)

As f is continuous at 0, for any ϵ > 0, there exists r > 0 such that y < r |f(y) f(0)| < ϵ. Let K be an upper bound of f in W. Using Claim 2, we can pick K such that when δ < K we have R

Rd\Bd(0,r) H(y, δ) < ϵ. Then for δ < K , we have

W f(y)H(y, δ)dy f(0)d (86)

W f(y)H(y, δ)dy Z

Rd f(0)H(y, δ)dy (87)

Bd(0,r) (f(y) f(0))H(y, δ)dy + Z

W \Bd(0,r) f(y)H(y, δ)dy Z

Rd\Bd(0,r) f(0)H(y, δ)dy

Bd(0,r) H(y, δ)dy + K Z

W \Bd(0,r) H(y, δ)dy + |f(0)| Z

Rd\Bd(0,r) H(y, δ)dy (89)

Rd H(y, δ)dy + K Z

Rd\Bd(0,r) H(y, δ)dy + |f(0)| Z

Rd\Bd(0,r) H(y, δ)dy (90)

dϵ + Kϵ + |f(0)|ϵ (91)

=ϵ(d + K + |f(0)|). (92)

This proves the claim.

Proof of Proposition 2. Recall that we want to prove that

lim δ e 2δ Z

M p(x ) x x 2Nd(x x ; 0, δ)dx = d p(x). (93)

We perform the same translation and orthogonal transformation as in the proof of Proposition 1, which ensures that x = 0 and that the tangent plane of M at x = 0 is the d-subspace spanned by the first d coordinate vectors. We once again consider an open V ϕ(U) in M such that 0 V , and break the integral on V and on M \ V .

For any ϵ > 0, whenever δ < KV, 2 we have, Z

M\V p(x ) x 2e 2δNd( x ; 0, δ)dx Z

M\V p(x ) x 2ϵdx (94)

M\V p(x )dist2 M(x , 0)dx (95)

M p(x )dist2 M(x , 0)dx (96)

Here we used the assumption that the expected squared distance from x = 0 is finite. This shows that

M\V p(x ) x 2e 2δNd( x ; 0, δ)dx = 0. (98)

Published in Transactions on Machine Learning Research (10/2025)

Note that this is true for any open V ϕ(U) containing x = 0 in M. On the other hand, in the local coordinate system (y1, . . . , yd), we have

V p(x ) x 2Nd( x ; 0, δ)dx = Z

ϕ 1(V ) p(y) p

g(y)( y 2 + u(y) 2) ˆN(y, δ)dy. (99)

By Claim 3, we have that

ϕ 1(V ) p(y) p

g(y)( y 2 + u(y) 2) ˆN(y, δ)dy (100)

ϕ 1(V ) p(y) p

g(y) y 2(1 + v(y)2) ˆN(y, δ)dy (101)

ϕ 1(V ) p(y) p

g(y) y 2(1 + v(y)2)N(y, δ)dy (102)

g(0)(1 + v(0)2)d (103)

=p(0)d (104)

as δ . Here we use the fact that f(y) := p(y) p

g(y)(1 + v(y)) is bounded, because ϕ 1(V ) is compact. The lower bound will be

ϕ 1(V ) p(y) p

g(y)( y 2 + u(y) 2) ˆN(y, δ)dy (105)

ϕ 1(V ) p(y) p

g(y) y 2(1 + v(y)2) ˆN(y, δ)dy (106)

(1 + K2 V ) d/2e 2δ Z

ϕ 1(V ) p(y) p

g(y) y 2(1 + v(y)2)N(y, δ + δ0)dy (107)

(1 + K2 V ) d/2p(0) p

g(0)(1 + v(0)2)d (108)

=(1 + K2 V ) d/2p(0)d (109)

as δ . As in the proof of Proposition 1, we can bound the limit inferior and limit superior and shrink V as we like, thus finishing the proof.

B Proofs for the Uniform Case

Proof of Theorem 2. Recall that it is enough to show that

lim δ δ log p U δ (x) = 0. (110)

We begin by performing the same rotation and translation as in Appendix A, and considering the same coordinate vectors, parameterization of M, and open set U. Note that we have

p U δ (0) = Z

M p(x )Ud( x ; 0, δ)dx = CU d e dδ Z

M BD(0,eδ) p(x )dx . (111)

Assume that δ is negative enough for M BD(x, eδ) ϕ(U), which is open in M. Thus, in the coordinate system (y1, . . . , yd), we have

p U δ (0) = CU d e dδ Z

g(y)dy, (112)

Published in Transactions on Machine Learning Research (10/2025)

where Vδ := ϕ 1(M BD(x, eδ)) = {y : y 2 + u(y) 2 < e2δ}, g(y) = det(gij(y)) is continuous on Vδ, and g(0) = 1. Now, we have

δ log p U δ (0) = 1 p U δ (0) δ p U δ (0) (113)

= 1 p U δ (0)

dp U δ (0) + CU d e dδ

g(y)dy (114)

g(y)dy . (115)

So, we must prove that

g(y)dy = d. (116)

Now, let Aδ := δ R

Vδ dy be the derivative of V ol(Vδ) with respect to δ. Since p g is continuous at y = 0, we have

g(y)dy = lim h 0 1 h

Vδ+h p(y) p

= lim h 0 1 h

Vδ,h p(y) p

g(y)dy, (118)

where Vδ,h = {y : e2δ < y 2+ u(y) 2 < e2(δ+h)}. In what follows we will consider the case where h > 0; the case where h < 0 is analogous. Note that for small enough e2δ and h, Vδ,h is diffeomorphic to a d-dimensional annulus, as U is a local chart for ϕ(U), and for small enough open set, the intersection of the ball in RD

with M is diffeomorphic to the ball in Rd.

Now for any ϵ > 0, there exists κ > 0 such that |p(y) p

g(y) p(0)| < ϵ whenever y < κ, since g(0) = 1. Choose K such that when δ < K we have Vδ,h Bd(0, κ). Note that a priori K depends on h. But since we are interested in the behaviour as h 0, we can choose K so that the condition holds for h < h0 for some h0 > 0. This means that for δ < K , we have 1 h

Vδ,h (p(y) p

g(y) p(0))dy

Vδ,h |p(y) p

g(y) p(0)|dy < ϵ

Vδ,h dy. (119)

This means that p(0) ϵ

Vδ,h dy < 1

Vδ,h p(y) p

g(y)dy < p(0) + ϵ

Vδ,h dy. (120)

Taking h 0, we have

g(y)dy (p(0) + ϵ)Aδ. (121)

Note that a similar result as above for the same κ and K applies to the denominator of Equation 116, i.e. without the derivatives,

(p(0) ϵ)V ol(Vδ) < Z

g(y)dy < (p(0) + ϵ)V ol(Vδ). (122)

Putting Equation 121 and Equation 122 together, we get that, for δ < K :

(p(0) ϵ)Aδ (p(0) + ϵ)V ol(Vδ) <

g(y)dy < (p(0) + ϵ)Aδ (p(0) ϵ)V ol(Vδ). (123)

Published in Transactions on Machine Learning Research (10/2025)

To calculate Aδ, we use the Leibniz integral rule in its most general form (Flanders, 1973),

Vδ ι Φddy + Z

Vδ ι Φdy + Z

Vδ ι Φdy, (124)

as dy is closed and independent of δ. Here ι Φ denotes the interior product with Φ, which is the vector field of the velocity, i.e. if Φδ is a family of diffeormorphisms from a fixed domain V to Vδ, then Φ is the derivative of Φδ with respect to δ.

Thus, in what follows we construct a family of diffeomorphisms from a fixed domain to Vδ so that we can then apply the Leibniz integral rule. We first establish some ground work for Vδ. Note that Vδ = {y : G(y) < δ}, where G(y) := 1

2 log( y 2 + u(y) 2). Then we have

G yi = yi + P

j ujuj,i y 2 + u 2 , (125)

where we use the notation that any indices after a comma indicate a derivative, i.e. uj,i = uj

yi . This means

G = y + (Du)u

y 2 + u 2 . (126)

Proposition 3. There exists an open set W containing 0 such that G is well-defined and non-vanishing on W \ {0}.

To show this proposition, first we establish the following facts:

Claim 4. Let f : Rd R be smooth such that f(0) = 0. Then f(y) = P

i yigi(y) for some smooth functions gi.

Proof. Note that

f(y) = f(y) f(0) = Z 1

tf(ty)dt = Z 1

i yi f yi f(ty)dt. (127)

Thus gi = R 1 0 f yi f(ty)dt and gi is smooth.

Corollary 3. With u(y) as defined above, we have u(y) = P

i,j Aij(y)yiyj for some smooth functions Aij : Rd RD d.

Proof. Since u(0) = 0, we have uk : Rd R such that uk(0) = 0. Thus, by Claim 4, uk(y) = P j gkj(y)yj for some smooth gkj. Since all the derivatives of u vanish at 0, for any k, l, we have 0 = uk,l(0) = gkl(0) + P

j gkj,l(0) 0 = gkl(0). Thus, once again by Claim 4, gkl(y) = P

j hklj(y)yj for smooth hklj. Thus we have uk(y) = P

i gki(y)yi = P

i,j hkij(y)yiyj. Thus (Aij)k = hkij, which completes the proof.

Claim 5. The functions u u

y y , y (Du)u

y y , and u (Du) (Du)u

y y can be extended to C1 functions on Vδ, such that each function and their first derivatives vanish at y = 0.

Proof. Note that since u and y 2 are smooth, the only point at which we have to prove the above functions are C1 is at y = 0. Using Einstein summation notation, we first notice that since uj = Ajklykyl from Corollary 3, we have uj,i = Ajkl,iykyl + Ajkiyk + Ajilyl, (128)

which is a polynomial in yi of degree at least 1.

For u u, we have u u = uiui = Aijk Ailmyjykylym, (129)

where every term in this polynomial in yi is of degree at least 4.

Published in Transactions on Machine Learning Research (10/2025)

For y (Du)u, we have

y (Du)u = yiuj,iuj = Ajklyiykyl(Ajkl,iykyl + Ajkiyk + Ajilyl), (130)

where every term in this polynomial in yi is also of degree at least 4.

For u (Du) (Du)u, we have

u (Du) (Du)u = ui,jujui,kuk = ui,jui,k Ajlmylym Akstysyt, (131)

where again, every term in this polynomial in yi is of degree at least 4.

Let P(y) be a polynomial containing no terms of degree less than 4 (with coefficients being smooth functions on Vδ), and Q(y) := P(y)/ y 2. It is easy to see that limy 0 Q(y) = 0. And thus Q(y) can be extended continuously to y = 0 with Q(0) = 0. As P(y) contains no terms whose degree is less than 4, we also have limy 0 |Q(y) Q(0)|

y = 0. By the definition of differentiability, we have that Q is differentiable at y = 0 and

Q(0) = 0. Now, for y = 0, we have Q

y 4 , where P1(y) is a polynomial such that every term is of

degree at least 5. Thus Q

yi 0 as y 0. This shows that Q is C1.

Proof of Proposition 3. This denominator of Equation 126 is y 2 + u 2 = y 2(1 + u u

y y ). Since u u

as y 0, there is an open set around y = 0 where 1 + u u

y y = 0. Thus for any y in this open set such that y = 0, G is well-defined.

The squared-norm of the numerator of Equation 126 is

y + (Du)u 2 = (y + (Du)u) (y + (Du)u) = y y 1 + 2y (Du)u

y y + u (Du) (Du)u

Since, by Claim 5, both 2y (Du)u

y y and u (Du) (Du)u

y y approach 0 as y 0, there is an open set around y = 0 such that y +(Du)u 2 = 0 in this open set when y = 0. This means that there exists an open set W around y = 0 such that G is well-defined and non-vanishing on W \ {0}.

As we are only interested in the behaviour of Aδ when δ , we can choose a specific δ small enough such that Vδ W. Now, for any δ < δ , we have Vδ Vδ , given by the sub-level sets of G. Now choose δa , δa, δb, δb such that δa < δa < δ < δb < δb < δ . Now, define the vector field X := ψ(y) G G 2 , where ψ is a smooth function satisfying

ψ(y) = 1 if δa G(y) δb 0 if G(y) < δa or G(y) > δb (133)

and 0 ψ(y) 1. Then X is a smooth vector field that vanishes outside a compact set, which implies that X is a complete vector field. This means that integrating X gives rise to a one-parameter group of differomophisms ϕt : Rd Rd such that ϕt = X.

Now consider the map t 7 G(ϕt(y)). We have

t G(ϕt(y)) = G ϕt(y) = G ψ(ϕt(y)) G G 2

= ψ(ϕt(y)). (134)

As ψ(ϕt(y)) is always between 0 and 1, the map t 7 G(ϕt(y)) is a non-decreasing function of t.

Claim 6. If t 0, then G(y) G(ϕt(y)) G(y) + t. (135)

Proof. We have

G(ϕt(y)) G(y) = G(ϕt(y)) G(ϕ0(y)) = Z t

d dτ G(ϕτ(y))dτ = Z t

0 ψ(ϕτ(y))dτ. (136)

Note that 0 ψ 1. Thus for t 0, we have 0 G(ϕt(y)) G(y) t.

Published in Transactions on Machine Learning Research (10/2025)

Claim 7. Let y be such that G(y) = δ [δa, δb]. Then for t [0, δb δ], we have G(ϕt(y)) = G(y) + t.

Proof. By Claim 6, if t [0, δb δ], we have G(ϕt(y)) [G(y), G(y) + t] [δa, δb]. This means that ψ(ϕt(y)) = 1 for t [0, δb δ]. Thus we can replace the inequality with equality:

G(ϕt(y)) G(y) = Z t

0 ψ(ϕτ(y))dτ = Z t

0 dτ = t. (137)

Now, for any δ [δa, δb], we define Φδ := ϕδ δa.

Proposition 4. Φδ is a diffeomorphism from Vδa to Vδ.

Proof. Let y Vδa, i.e. G(y) < δa. By virtue of Claim 6, G(Φδ(y)) = G(ϕδ δa(y)) G(y) + δ δa < δ. Thus Φδ(y) Vδ. Conversely, let y Vδ and let y := ϕδa δ(y), so that Φδ(y ) = y. It suffices to show that y Vδa, i.e. G(y ) < δa. Assume the contrary, G(y ) δa. Note that G(y ) δ because t 7 G(ϕt(y)) is non-decreasing. Then we have

G(y) = G(Φδ(y )) = G(ϕδ δa(y )) G(ϕδ G(y )(y )) = G(y ) + δ G(y ) = δ, (138)

which is a contradiction. Here we used that t 7 G(ϕt(y)) is a non-decreasing function for the inequality, and Claim 7 for the second last equality.

Proposition 4 establishes that {Φδ : δ [δa, δb]} is a family of diffeomorphisms from a fixed domain Vδa to Vδ. We can finally invoke the Leibniz integral rule from Equation 124 with Φ = G G 2 on Vδ. Note that

G = y + (Du)u

y 2 + u 2 , G 2 = y + (Du)u 2

( y 2 + u 2)2 , G G 2 = H(y)(y + (Du)u), (139)

where H(y) := y 2+ u 2

y+(Du)u 2 . Note that as of now, H(y) is defined on Vδ. We would like to extend H(y) to Vδ. We can rewrite H(y) as

H(y) = y y + u u y y + 2y (Du)u + u (Du) (Du)u = 1 + u u

y y 1 + y (Du)u

y y + u (Du) (Du)u

y y . (140)

Thanks to Claim 5, H(y) can be extended to Vδ as a C1 function with H(0) = 1. This means that G G 2 can be extended to Vδ as a C1 vector field.

Now we apply the generalized Stokes theorem, followed by the definition of divergence, to get

Vδ ι Φdy = Z

Vδ ι G G 2 dy = Z

Vδ dι G G 2 dy = Z

Vδ div G G 2 dy. (141)

It suffices to calculate div G G 2 = div H(y)(y + (Du)u). Using Einstein summation notation, we have

div H(y)(y + (Du)u) = [H(y)(yi + uj,iuj)],i = H(y)(d + uj,iiuj + uj,iuj,i) + Hi(y)(yi + uj,iuj), (142)

which is continuous on Vδ. At y = 0, we have div G G 2

(0) = H(0)(d + 0 + 0) + Hi(0)(0 + 0) = d. (143)

Thus for ϵ > 0, there exists κ > 0 such that if y < κ , then (div G G 2 )(y) d < ϵ. Choose K such

that Vδ Bd(0, κ ) for δ < K . Then

|Aδ d V ol(Vδ)| Z

(y) d dy < ϵV ol(Vδ). (144)

Published in Transactions on Machine Learning Research (10/2025)

This means that (d ϵ)V ol(Vδ) < Aδ < (d + ϵ)V ol(Vδ). (145)

Combining Equation 123 and Equation 145, we have

(p(0) ϵ)(d ϵ)

g(y)dy < (p(0) + ϵ)(d + ϵ)

p(0) ϵ (146)

for δ < K = min(K , K ). Letting ϵ 0, we have δ , which entails Equation 116 and thus concludes the proof of Theorem 2.