# manifold_fitting_under_unbounded_noise__39cb6c3e.pdf

Journal of Machine Learning Research 26 (2025) 1-55 Submitted 1/21; Revised 2/25; Published 2/25

Manifold Fitting under Unbounded Noise

Zhigang Yao zhigang.yao@nus.edu.sg Department of Statistics and Data Science National University of Singapore 21 Lower Kent Ridge Road Singapore 117546

Yuqing Xia yuqing.xia@zufe.edu.cn School of Data Sciences Zhejiang University of Finance and Economics Hangzhou 310018, China

Editor: Ji Zhu

In the ﬁeld of non-Euclidean statistical analysis, a trend has emerged in recent times, of attempts to recover a low dimensional structure, namely a manifold, underlying the high dimensional data. Recovering the manifold requires the noise to be of a certain concentration and prevailing methods address this requirement by constructing an approximated manifold that is based on the tangent space estimation at each sample point. Although theoretical convergence for these methods is guaranteed, the samples are either noiseless or the noise is bounded. However, if the noise is unbounded, as is commonplace, the tangent space estimation at the noisy samples will be blurred an undesirable outcome since ﬁtting a manifold from the blurred tangent space might be more greatly compromised in terms of its accuracy. In this paper, we introduce a new manifold-ﬁtting method, whereby the output manifold is constructed by directly estimating the tangent spaces at the projected points on the latent manifold, rather than at the sample points, thus reducing the error caused by the noise. Assuming the noise is unbounded, our new method has a high probability of achieving theoretical convergence, in terms of the upper bound of the distance between the estimated and latent manifold. The smoothness of the estimated manifold is also evaluated by bounding the supremum of twice diﬀerence above. Numerical simulations are conducted as part of this new method to help validate our theoretical ﬁndings and demonstrate the advantages of our method over other relevant manifold ﬁtting methods. Finally, our method is applied to real data examples. Keywords: Manifold learning, Riemannian embedding, Convergence, Smoothness

1. Introduction

Linearity has been viewed as a cornerstone in the development of statistical methodology. For decades, prominent progress in statistics has been made with regard to linearizing the data and the way we analyze them. More recently, the phenomenon of high-throughput data, which share a high dimensional characteristic in their varying forms, has become more commonplace. Although each data point usually represents itself as a long vector or a large matrix, in principle they all can be viewed as points on or near an intrinsic

c 2025 Zhigang Yao and Yuqing Xia.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/21-0039.html.

Yao and Xia

manifold. Moreover, modern data sets no longer comprise samples of real vectors in a real vector space but samples of much more complex structures, assuming values in spaces that are naturally not (Euclidean) vector spaces. We are verily witnessing an explosion in the volumn of complex data with a geometric structure and, therefore, a growing need for statistical analysis, utilizing the nature of the data space. The manifold hypothesis has been carefully studied in Feﬀerman et al. (2016). Here, we only present several relevant examples to make sense of that hypothesis intuitively: the high dimensional data samples tend to lie near a lower dimensional manifold embedded in the ambient space. The classical Coil20 dataset (Nene et al., 1996), which contains images of 20 objects, may be used as an example. For each object, images are taken every 5 degrees as the object is rotated on a turntable, and each image is of size 32 32. In this case, the dimension of ambient space is the number of pixels, which is 1024, while the latent intrinsic structure can be compactly described with the angle of rotation. In addition to Coil20, such a structure can be found in many other data collections. In seismology, two-dimensional coordinates of earthquake epicenters are located along a one-dimensional fault line. In face recognition, high-dimensional facial images are dependent on lighting conditions (Georghiades et al., 2002) or head orientations (Happy et al., 2012). Given this form of data collection, a natural problem arises: how can we ﬁt a manifold to this data collection? The aim of manifold ﬁtting is to represent the latent manifold as an embedded sub-manifold of the ambient space. Once the latent manifold is learned, various types of analyses can be carried out based on it, such as denoising the observed samples by projecting them to the learned manifold (Gong et al., 2010), generating new data samples from the manifold (Radford et al., 2015), classifying samples according to the manifold (Yao and Zhang, 2020), and detecting fault lines for seismological purposes (Yao et al., 2024). These manifold-based techniques represent powerful tools for understanding and working with complex data structures.

In addition to manifold ﬁtting, dimension reduction constitutes another crucial branch of manifold learning. Over the past two decades, a litany of dimension reduction methods have emerged, each aiming to uncover the intrinsic structure of data by identifying its lowerdimensional embedding, as discussed in the review by Ma and Fu (2011). Unlike manifold ﬁtting, however, these methods primarily focus on mapping data from the ambient space to a lower-dimensional one. Consequently, the outputs of most dimension reduction methods consist of low-dimensional embeddings rather than points in the ambient space, although for applications such as denoising and data generation, relying solely on low-dimensional embeddings may not suﬃce.

The limitations of dimension reduction and the potential applications of manifold ﬁtting underline the value of formulating the manifold ﬁtting problem, as follows. Suppose the observed data samples X = {xi RD}N i=1 are in the form

xi = yi + ξi,

where y1, , y N are unobserved variables drawn from the uniform distribution supported on the latent manifold M with dimension d < D. Generally, M is assumed to be a compact and smooth sub-manifold embedded in the ambient space RD. The precise conditions on M will be detailed in Section 2.1. The uniform distribution assumption of yi sampled from M is the same as those used in the related works (Genovese et al., 2012c, 2014; Mohammed and

Manifold Fitting under Unbounded Noise

Narayanan, 2017). Here, ξ1, ξN are drawn from a distribution G. The assumptions about the noisy distribution G diﬀer among the related work. The simplest assumption is that the observed samples are noiseless (Feﬀerman et al., 2016; Mohammed and Narayanan, 2017). However, some literature assumes that the noise is distributed in a bounded region centered at the origin, which means that the observed samples are located in a tube centered at M (Genovese et al., 2012b). Other literature, such as Genovese et al. (2012c, 2014); Feﬀerman et al. (2018), assumes G to be a Gaussian distribution supported on RD, whose density at ξ is

( 1 2πσ2 ) D

2 exp( ξ 2 2 2σ2 ). (1.1)

The tail of the Gaussian distribution might make the theoretical analysis more challenging than in the previous two cases. Strictly speaking, previous manifold ﬁtting methods have not directly addressed this problem, nor have they proved the convergence of the ﬁtted manifold under this assumption. In this paper, we are concerned with the Gaussian assumption of noisy distribution, denoting it as Gσ to stress the deviation parameter σ hereafter. Under the above settings, the goal is to produce a smooth manifold Mout convergent to M. Speciﬁcally, if σ is suﬃciently small, one could derive Mout such that d(x, M) O(σ) holds for any arbitrary x Mout. In particular, Mout converges to M when σ 0. The convergence with respect to σ is the domain that Feﬀerman et al. (2018) is built on, although its ﬁnal theoretical result is expressed through sample complexity.

1.1 Related work

Methodological studies for manifold ﬁtting can be traced back to works from several decades ago on the principal curve (Hastie and Stuetzle, 1989), with every point on the principal curve/surface deﬁned as the conditional mean value of the points in the orthogonal subspace of the principal curve. Based on Hastie and Stuetzle (1989), many other principal-curve algorithms have been proposed, such as those of Banﬁeld and Raftery (1992); Stanford and Raftery (2000); Verbeek et al. (2002), each attempting to achieve lower estimation bias and improved robustness. More recently, Ozertem and Erdogmus (2011) describes the principal curve in a seemingly diﬀerent way albeit in a probabilistic sense. In the work by Ozertem and Erdogmus (2011), every point on the principal curve/surface is the local maximum, not the expected value, of the probability density in the local orthogonal subspace. This deﬁnition of the principal curve/surface is formulated as a ridge of the probability density. Although it has been demonstrated that these proposed methods produce acceptably accurate estimates in many simulated cases, they do not, however, provide a theoretical analysis for estimating accuracy nor the curvature of the output manifold in general cases, with the exception of special cases such as elliptical distributions. Recently, some works have focused on the theoretical analysis for manifold ﬁtting. In particular, Genovese et al. (2012b) and Genovese et al. (2012c) establish the upper bounds on the Hausdorﬀdistance between the output and latent manifold under various noise settings, although they do not oﬀer any practical estimators. Genovese et al. (2012a) proposes an estimator which is computationally simple, and whose convergence is guaranteed but its conclusions hold only when the noise is supported on a compact set. Genovese et al. (2014) focuses on the ridge of the probability density introduced by Ozertem and Erdogmus

Yao and Xia

(2011), and proposes a convergent algorithm. It is worth noting that the data in Genovese et al. (2014) was assumed to be blurred by homogeneous Gaussian noise, an assumption that is more general than that made in Genovese et al. (2012a). Boissonnat and Ghosh (2014) proposes an algorithm based upon Delauney complexes, whose convergence was analyzed by Aamari and Levrard (2018). Aamari and Levrard (2019) presented an algorithm to estimate a point on the manifold, its tangent and second form. Based on these, they approximated the latent manifold by a mere union of polynomial patches and gave convergence rate for noise-free and tubular noise models. Aizenbud and Sober (2021) presents an algorithm that showed convergence to the manifold and its tangent bundle, even with tubular noise. However, none of the methods outlined above are guaranteed to output an actual d-dimensional manifold with certain smoothness. To overcome this issue, some studies on manifold ﬁtting have sought to determine how curved the output manifold is. In the spirit of Ozertem and Erdogmus (2011) and Genovese et al. (2014), Feﬀerman et al. (2016) and Mohammed and Narayanan (2017) also took the ridge set into consideration, the former focusing on theoretical analysis, the latter on practical algorithms. Speciﬁcally, rather than focusing on the probability density function, they both chose to work with the approximate square-distance functions (asdf), approximating the latent manifold by the ridge of the asdf. The theoretical bounds for the manifold ﬁtting have also been considered in Feﬀerman et al. (2016) and Mohammed and Narayanan (2017), but for only noiseless data; that is, as long as the asdf meets certain regularity conditions, the researchers show that the output of the algorithm is a manifold with bounded reach, and the output manifold is arbitrarily close in Hausdorﬀto the latent one. To deal with manifold ﬁtting with noise, Feﬀerman et al. (2018) proposes a new approach to ﬁt a putative manifold under Gaussian noise. Unlike other methods, which use the entire sample set, the method of Feﬀerman et al. (2018) involves subsampling ﬁrst such that the number of used samples can be bounded above by e D. Under this constraint, the noise is supported on a bounded set with high probability. Given this, the application of Feﬀerman et al. (2018) is feasible with the bounded noise although the constraint on the sample size is problematic in that the upper bound does not go to zero even with a suﬃcient number of available samples and the variance of Gaussian noise diminishes. Therefore, the problem is not essentially addressed when the support of noise is unbounded and so creates room for the manifold-ﬁtting problem to arise, especially from the theoretical side.

1.2 Motivation

In this paper, we attempt to evaluate the convergence and smoothness of Mout. Of the aforementioned works, it is those of Mohammed and Narayanan (2017) and Feﬀerman et al. (2018) are most relevant to our study. This section explains these two methods geometrically and analyzes their limitations, which impels us to establish a more accurate manifold-ﬁtting method.

The left panel of Figure 1 illustrates the method of Mohammed and Narayanan (2017), whose essence is to deﬁne an approximate squared-distance function (asdf) to M and estimates M by the ridge set of the asdf. Speciﬁcally, for a given point x, the asdf at x is deﬁned as the weighted average of squared distances from x to the discs (the dashed lines) centered at the sample points (the blue and red dots). The right panel of Figure 1 illustrates

Manifold Fitting under Unbounded Noise

Figure 1: A toy example to illustrate the methods by Mohammed and Narayanan (2017) (left panel) and Feﬀerman et al. (2018) (right panel), where the black curve is a local part of M, x is a point oﬀM, and the dots xi and xj represent two samples in the neighborhood of x. Unlike those in the right panel, the samples in the left panel are on M, as Mohammed and Narayanan (2017) focus on the noiseless case.

the method of Feﬀerman et al. (2018). Its key idea is to approximate the bias from x to M for any arbitrary x and deﬁne the output manifold as points with zero bias. To obtain the approximation of bias from x to M, Feﬀerman et al. (2018) calculates the weighted average bias from x to the discs (the dashed lines) centered at the sample points, and projects the average bias by the estimated orthogonal projection onto the normal space of M at x (the gray solid line).

Due to the usage of discs centered at the sample points, the eﬀectiveness of each method depends on just how accurately these discs capture the local structure of the manifold. However, if the sample points are signiﬁcantly perturbed by unbounded noise and deviate signiﬁcantly from the latent manifold M, these discs also deviate far from M. Hence, in this scenario with unbounded noise, the methods proposed by Mohammed and Narayanan (2017) and Feﬀerman et al. (2018) may encounter diﬃculties in ﬁtting a manifold.

Even if the sample points lie on the manifold, say xi M, the disc centered at xi captures the local manifold at xi rather than the local manifold at x . The deviation between xi and x also introduces an approximation error in the distance/bias from x to M. As shown in Figure 1, both the solid red line and the solid red arrow are shorter than the distance/bias from x to M, and the average between the red and blue one cannot address this issue.

The limitations of the two aforementioned methods suggest that estimating the manifold based on discs centered at sample points may reduce the ﬁtting accuracy, especially in the scenarios with unbounded noise. It is this ﬁnding that compels us to invent a new manifoldﬁtting method.

Yao and Xia

1.3 Main contribution

From a statistical viewpoint, there is a pressing need for the development of a practical estimator with theoretical bounds satisfying the following requirements simultaneously, and which improves on the requirements of Feﬀerman et al. (2018):

The support of noise is unbounded.

The estimator shares a similar geometric property to M.

For any arbitrary x Mout, the distance between x and M is bounded above provided N is suﬃciently large and σ is suﬃciently small. In particular, the distance goes to zero as noise disappears.

The smoothness of Mout is mathematically guaranteed.

In this paper, we propose a novel approximation f(x) to the bias from any point x to M and ﬁt the latent manifold M in the ambient space as the points with f(x) = 0, where 0 presents a zero vector. Practically, such an output manifold can be achieved by solving the minimization f(x) 2 2 via gradient descent. This paper provides two main contributions in this aspect, the ﬁrst being the theoretical analysis satisfying the four requirements above as follows:

The noise is assumed to be drawn from the Gaussian distribution Gσ deﬁned in (1.1).

Any arbitrary neighborhood of Mout is a d-dimensional manifold.

For any x Mout, d(x, M) O(σ) given a large-enough dataset. Thus, Mout converges to M for an increasingly large sample size and diminishing noise.

The twice diﬀerence of Mout is bounded above by O( 1 σ).

The second important contribution of this paper is the performance of our estimator in practice. As illustrated in Figures 1 and 2, the bias from a point x to M is approximated better than by the other two relevant methods. Numeric results in Section 5 demonstrate the improved performance, which further suggests that our method outputs the approximated manifold to the latent one.

1.4 Organization

The rest of the paper is organized as follows. Section 2 includes the formulation of our approximation Mout to the latent manifold M . After that, the convergence and smoothness of Mout are analyzed in Theorem 5 and Theorem 7, respectively. Section 3 studies the function f deﬁned in (2.6) and determines the properties of its kernel space, the ﬁrst and second derivatives. Based on these properties of f, the proofs of Theorem 5 and Theorem 7 are derived in Section 4 with numeric examples listed in Section 5 .

Manifold Fitting under Unbounded Noise

2. Proposed method

2.1 Content and notations

Throughout this paper, the latent manifold is denoted as M and our approximation to M is denoted as Mout. For a set A RD and a point x RD, ΠAx denotes the projection of x onto A, namely the nearest point in A to x. Hence ΠMx is the projection of x onto the latent manifold. If there is no ambiguity, we might use x instead of ΠMx for simplicity. The distance between x and A, denoted by d(x, A), is the Euclidean distance between ΠAx and x. For any x M, Tx M denotes the tangent space of M at x and Πx denotes the orthogonal projection onto the normal space of M at x . We will make frequent use of the lower-cases c, c0, c1, etc. and upper-cases C, C0, C1 etc., in the rest of this paper with the lower-cases denoting generic constants less than 1, while the upper-cases denote generic constants greater than 1. Values of the generic constants may change from line to line. By constants, we mean they are independent of the radius r, the standard deviation σ or x, while the constants may depend on some other constants used to characterize the manifold, such as the reach of M.

We denote BD(x, r) as the Euclidean ball in RD centered at x of radius r, which deﬁnes a neighborhood of x. The index set Ix,r is deﬁned as the indices of the sample points in BD(x, r), and |Ix,r| denotes the cardinality of Ix,r. As given in (1.1), σ represents the standard deviation of noise. Throughout this paper, we assume

r = O( σ), σ < 1 (2.1)

without loss of generality, otherwise the data could be rescaled so that σ < 1 holds. Here r = O( σ) means that there exist constants c and C such that c σ r C σ. Noticing σ < 1, we obtain

r C σ < C. (2.2)

This means r can be bounded above by certain constant. In subsequent proofs, under the premise that it does not aﬀect the ﬁnal precision of conclusive upper bound, we will relax some r to C in order to simplify the proof.

The latent manifold M is supposed to be boundaryless, compact, d-dimensional, and twice diﬀerentiable, with a reach bounded by τ > 0. The concept reach is a measure of the regularity of the manifold, ﬁrst introduced by Federer (Federer, 1959) as follows:

Deﬁnition 1 (Reach). Let M be a closed subset of RD. The reach of M, denoted by reach(M), is the largest number τ to have the property that any point at a distance r < τ from M has a unique nearest point in M.

An important understanding of reach is that it is a twice diﬀerentiable quantity if the manifold is treated as a function. Speciﬁcally, if γ is an arc-length parametrized geodesic of M, then for all t, γ (t) 1/τ according to Niyogi et al. (2008). As a twice diﬀerentiable quantity, it is easy to understand that the reach describes how ﬂat the manifold is locally. For example, the reach of a sharp cusp is zero, and the reach of a linear subspace is inﬁnite. Thus, it is natural that the reach measures how close a manifold is to the tangent space locally. The following proposition by Federer (1959) explains this phenomenon:

Yao and Xia

Proposition 2.

reach(M) 1 = sup 2d(y, Tx M)

x y 2 2 |x, y M, x = y (2.3)

We emphasize that if reach(M) > 0, the error between M and Tx M at y is of a higher order than x y 2. Thus, in a small-enough neighbor of x, we can estimate M by Tx M with negligible error, which is the foundation of our approximation. The approximation Mout is deﬁned using the noisy sample points {xi}N i=1. The number of sample points should be suﬃciently large such that BD(x, r) contains enough sample points. Proposition 3 claims the relationship between |Ix,r| and N.

Proposition 3. Suppose x satisﬁes d(x, M) cr with some c < 1. There exist constants c and C such that |Ix,r| c rd N with probability at least 1 C/

Proof of proposition 3 is given in Appendix A.1. Based on this proposition, the requirement on |Ix,r| can be transformed to the requirement on N for further analysis in later sections. Speciﬁcally, N is required to be a suﬃciently large quantity in the order of O(r (d+2)).

2.2 Deﬁnition of the approximated manifold

This section introduces a novel method for estimating the bias f(x) from a point x to the latent manifold M, and deﬁnes the approximated manifold Mout as the points satisfying f(x) = 0. Unlike the aforementioned methods, which rely on discs centered at sample points to estimate the bias f(x), here we build upon the fact that a Riemannian manifold can be locally treated as an aﬃne space and calculate the bias from x to Tx M as an equivalent measure of the bias f(x) from x to M. Thus, the key to addressing such a bias is to ﬁnd an aﬃne space {x : Ψα x(x b)} approximating Tx M, where Ψα x estimates the orthogonal projection onto the normal space at x and b estimates one points in Tx M.

In order to approximate the orthogonal projection onto the normal space of M at x , Ψα x is deﬁned as the weighted average of {Pxi}i Ix,r, where Pxi is the orthogonal projection perpendicular to the ﬁrst d principal components in BD(xi, r ). Mathematically, Pxi = V V T , where V is the orthogonal component of V and V is the D d matrix whose columns are the eigenvectors corresponding to the largest d eigenvalues of P j Ixi,r (xj xi)(xj xi)T .

The radius r should be suﬃciently large, so that the intersection of BD(xi, r ) and M is nonempty. Further analysis in Section 3.1 explains that we need r 2r. As the weighted average of {Pxi}i Ix,r,

Ψα x = Πhi(Ax), Ax = X

i Ix,r αi(x)Pxi. (2.4)

Here, Πhi(A) denotes the projection onto the span of the eigenvectors corresponding to the largest D d eigenvalues of A. Speciﬁcally, Πhi A = V V T , V is a D (D d) matrix whose columns are the eigenvectors corresponding to the largest D d eigenvalues of A. Further, the weights αi : RD R in (2.4) are deﬁned as follows:

1 x xi 2 2 r2 β , x BD(xi, r)

0, otherwise , α(x) = X

i αi(x), αi(x) = αi(x)

α(x) , (2.5)

Manifold Fitting under Unbounded Noise

with β 2 a ﬁxed integer guaranteeing f(x) in (2.6) to be twice diﬀerentiable.

Under the assumption that a manifold can be approximated well by an aﬃne space locally, samples in the neighborhood of x lie close to Tx M, with the exception of noise. Therefore, a convex combination of these samples also lies close to Tx M. Thus, we can estimate b using the average of sample points in the neighborhood of x. Recalling the weights in (2.5), we formulate b = P i Ix,r αi(x)xi as the weighted average of sample points in the neighborhood of x. Then the bias from x to the space {x : Ψα x(x b)} is

f(x) : RD RD, f(x) = Ψα x

i Ix,r αi(x)xi

Finally, the approximation is deﬁned as

Mout = {x : d(x, M) cr, f(x) = 0, c < 1}, (2.7)

that is, the points with zero bias. By Deﬁnition 11 of Feﬀerman et al. (2016), M = {x : d(x, M) cr} is a manifold. Restrict f to M. When 0 is regular, the preimage f 1(0) = Mout M is a smooth submanifold. So we call Mout as the approximated manifold in the paper. Further characterization of the approximated manifold will be discussed in Theorem 4. The deﬁnition of Mout is practical. Theorem 5 in the next section claims that Mout approximates M in the order of O(r2). This means if we have an initial estimator of M with error cr, then we could achieve a better estimator of M using the deﬁnition of Mout. In practice, we solve the minimization f(x) 2 2 via the gradient descent method given the initial estimator, and the output of the gradient descent method approximates M in the order of O(r2) better than the initial guess. The advantages of our method are twofold. First, we introduce Ψα x directly, thus capturing the local structure of manifold at x , while the aforementioned methods capture the local structure of manifold near the sample points. Second, we approximate the manifold using a space passing b instead of any sample point. Beneﬁtting from the mutual oﬀset of noise, b hardly deviates far away from Tx M even if the noise is unbounded. As a result, we can expect {x : Ψα x(x b)} to be a better approximation to the local structure of manifold at x , which guarantees the bias from x to {x : Ψα x(x b)} is a better approximation to the bias from x to M. The toy example in Figure 2 illustrates the superiority of our method. The black arrow in Figure 2 is almost the bias from x to M, while both the average length of the solid lines in the left panel of Figure 1 and the black arrow in the right panel of Figure 1 are shorter than the ideal one.

2.3 Convergence and smoothness of the approximated manifold

In Theorem 4, we prove any arbitrary neighborhood of Mout is a d-dimensional manifold in high probability. In Theorem 5, we characterize the convergence of Mout in the probability δ0(1 δ)2, where we denote

δ0 = 1 d exp{ crd+2N

2 ln 2 } (2.8)

Yao and Xia

Figure 2: A toy example to illustrate the methods in our method. Ψα x is used to estimate the orthogonal projection onto the normal space of M at x , the black dot b is used to estimate a point in Tx M. Then the space {x : Ψα x(x b)}, illustrated as the black dashed line, approximates Tx M, and the bias from x to the black dashed line is the estimated bias from x to M, geometrically illustrated as the black arrow.

for convenience. When N is suﬃciently large as we set, δ0 is a high probability. Theorem 5 tells us that if r = O( σ) is suﬃciently small, Mout is a good estimator to M. Moreover, Corollary 6 tells us that the approximated manifold Mout converges to the latent manifold M as σ 0.

Theorem 4. Given δ > 0 and any arbitrary x Mout, there exists ϵ such that Mout BD(x, ϵ) is a d-dimensional manifold with probability δ0(1 δ)2 1 (1 crd)N .

Theorem 5. Given δ > 0, there exists a constant C such that d(x, M) Cr2 for any arbitrary x Mout with probability at least δ0(1 δ)2.

We point out that Theorem 5 holds assuming σ < 1 and r = O( σ) as (2.1) claims. If we further assume σ 0, we achieve the following corollary:

Corollary 6. For any arbitrary x Mout, d(x, M) 0 as σ 0 with probability at least δ0(1 δ)2.

Proof Given r = O( σ), there exists C0 such that r = C0 σ. For any ε > 0, let σ = ε CC2 0 ,

and then d(x, M) Cr2 = CC2 0σ = ε.

Given x, y M, the fraction d(y, Tx Mout)/ y x 2 2 characterizes the twice diﬀerentiable quantity that controls the local ﬂatness of Mout. Therefore, the lower bound of d(y, Tx Mout)/ y x 2 2 guarantees the smoothness of Mout. Recalling Proposition 2, such a quantity is related to the reach of a manifold, which characterizes the smoothness of a manifold.

Manifold Fitting under Unbounded Noise

Theorem 7. Given δ > 0, there exists constant c0 < 1 and c < 1 such that

z x 2 2 d(z, Tx Mout) c0r

for any arbitrary x and z in Mout with probability at least δ2 0(1 δ)4 1 (1 crd)N .

The proofs of Theorem 4, Theorem 5 and Theorem 7 are organized in the following way. First we explore the properties of Pxi for given xi through Theorem 11 in subsection 3.1, reveal the properties of weights {αi} through Proposition 12 and discuss the concentration phenomenon of Gaussian noise through Lemma 13 in subsection 3.2. Based on the conclusions above, we prove in Theorem 15 an upper bound on the approximation error of Ψα x, as a weighted sum of {Pxi}i Ix,r , to Πx . Subsequently, in subsection 3.3 and subsection 3.4, we obtain the upper bounds on f(x) 2 and the ﬁrst and second derivative of f(x) through Theorem 16, Theorem 17 and Theorem 19. Finally, the main conclusions, namely Theorem 4, Theorem 5 and Theorem 7, are proved in Section 4, using the upper bounds regarding f( ) deﬁned by (2.6). The dependency of the above theorems, lemmas, and propositions is demonstrated in Figure 3.

Figure 3: The dependency of the core theorems, lemmas, and propositions.

3. Bounds regarding the function f

To analyze Ψα x, we ﬁrst explore the properties of Pxi, where xi is any arbitrary sample point in BD(x, r). Next, properties of Ψα x can be analyzed as the weighted average of {Pxi}i Ix,r. Finally, we successively bound f(x) 2, the ﬁrst derivative of f(x) and the second derivative of f(x) above using bounds regarding Ψα x.

Yao and Xia

3.1 Properties of Pxi

To make the notations clearer, we replace xi with z in this section. Recalling the notations in Section 2.1, z is the closest point on M to z and Πz is the orthogonal projection onto the normal space of M at z . The aim of this section is to bound the error Pz Πz F . Figure 4 illustrates the variables used for the discussion of Pz and the related proof. The z (black dot) is an observed noisy point of the manifold M and the blue ball is BD(z, r ), centered at z with radius r . The subsequent proof requires BD(z, r ) M = , which is equal to d(z, M) r . Given d(x, M) cr and z BD(x, r), we have

d(z, M) z x 2 z x 2 + x x 2 = z x 2 + d(x, M) (c + 1)r.

Therefore, for any r 2r, d(z, M) < r . In this paper, we set r = 2r for convenience. The zi (red dot) is a noisy sample located in BD(z, r ), satisfying zi = yi + ξi, z i = ΠMzi, and pi is the projection of zi onto Tz M. The space T is the translation of Tz M passing z,

Figure 4: Diagram of variables used for the discussion of Pz.

and p i is the projection of zi onto T. Consider the symmetric matrix Λ = 1 |Iz,r | P i Iz,r (pi z )(pi z )T . Since both pi and

z are located in Tz M, the spanning space of Λ is contained in Tz M. Thus, rank(Λ) dim(Tz M) = d and thereby the (d+1)-th largest eigenvalue of Λ is 0. Setting columns of U be the eigenvectors of Λ corresponding to the d largest eigenvalues, U is also a basis of Tz M. Let U be the orthogonal complement of U, and we have Πz = U UT . Recalling Pz = V V T , where V is the orthogonal component of V and columns of V are the eigenvectors corresponding to the d largest eigenvalues of

ˆΛ = 1 |Iz,r |

i Iz,r (zi z)(zi z)T ,

Pz Πz F = V V T U UT F = V V T UUT F 2

by the following Lemma:

Manifold Fitting under Unbounded Noise

Lemma 8. Let Λ, ˆΛ Rn n be symmetric, with eigenvalues λ1 λn and ˆλ1 ˆλn respectively. Let 1 d n and assume λd > 0, λd+1 = 0. Let U, ˆU Rn d be eigenvectors corresponding to the ﬁrst d eigenvalues of Λ and ˆΛ, respectively. Then

UUT ˆU ˆUT F =

2 sin θ( ˆU, U) F 2

by the Davis-Kahan sin θ theorem, where θ( ˆU, U) is the n n diagonal matrix, whose diagonal comprises the principal angles between the column spaces of ˆU and U, and sin θ( ˆU, U) is deﬁned entrywise.

We require the upper bound on ˆΛ Λ F and the lower bound on the d-th eigenvalue of Λ, deriving them both in Lemma 9 and Lemma 10 as follows:

Lemma 9. Suppose r = O( σ) and d(z, M) cr with some c < 1. There exists C such that 1 |Iz,r | P i Iz,r (zi z)(zi z)T P i Iz,r (pi z )(pi z )T F is bounded above by

ξi 4 2 + ξi 3 2 + ξi 2 2 + r ξi 2

+ C r 3 + r z z 2 + z z 2 2

Lemma 10. The d-th eigenvalue of 1 |Iz,r | P i Iz,r (pi z )(pi z )T is bounded below by

λd cr 2, with probability δ0.

Proofs of Lemma 9 and 10 appear in Appendix A.2. Plugging the upper bound of Lemma 9 and the lower bound of Lemma 10 into (3.1), we can obtain the following theorem:

Theorem 11. Suppose r = O( σ) and d(z, M) r . For any given δ ,there exists C such that the diﬀerence between Pz and Πz is bounded by

r 2 1 |Iz,r |

ξi 4 2 + ξi 3 2 + ξi 2 2 + r ξi 2

+ C r + z z 2

r + z z 2 2 r 2

, in probability δ0.

The term z z 2 2 in Theorem 11 tells us that Pz cannot approximate Πz well if z is distant from M. When the sample size N is suﬃciently large and the sample points are blurred by Gaussian noise, there will always be several sample points that deviate far away from the latent manifold. If xi represents such a sample point, given Theorem 11, Pxi cannot eﬀectively capture the local structure of the latent manifold. However, the next section will explain how the error caused by Pxi can be eliminated when we calculate a weighted average over {Pxi}i Ix,r, denoted as Ψα x.

3.2 Properties of Ψα x

This section evaluates how Ψα x approximates Πx using the upper bound of Pxi Πx i F as derived in Theorem 11. As the weighted average of {Pxi}i Ix,r, Ψα x beneﬁts from the mutual oﬀset of the Gaussian noise. To mathematically clarify this phenomenon, Lemma 13 below bounds the weighted averages regarding ξi 2 above by the Berry-Esseen Theorem and the properties of the weights {αi(x)} stated in Proposition 12.

Yao and Xia

Proposition 12. For a point x satisfying d(x, M) cr, there exist constants c0 and c 0 such that

(i) α(x) is bounded below by c0|Ix,r|, with probability 1 C0/ p

(ii) α(x) is bounded below by a constant c 0 with probability 1 (1 crd)N = O(Nrd).

Lemma 13. Suppose d(x, M) cr with some constant c < 1 and r = O( σ). For any given δ, there exist constants C, c0 and n0 such that if N n0r d, then α(x) c0|Ix,r| with probability at least (1 δ) and X

i Ix,r αi(x) ξi k 2 Cσk and 1 |Ix,r|2 X

i,j Ix,r ξi s 2 ξi t 2 Cσs+t (3.2)

hold for k, s, t 4 with probability at least (1 δ)2.

Lemma 14. Suppose x and y are two points on M , then

Π x Π y 2 Π x Π y F C x y 2

The proof of Lemma 13 is shown in Appendix A.3. To evaluate how Ψα x approximates Πx , we ﬁrst evaluate how the tangent space changes when the point of tangency changes as Lemma 14, which is also proved in Appendix A.3. Based on the theorem above and lemmas, we obtain the following theorem to evaluate Ψα x:

Theorem 15. Suppose d(x, M) cr with some constant c < 1 and r = O( σ). For any given δ, there exist constants C and n0 such that if N n0r d, then

Ψα x Πx 2 Ψα x Πx F Cr (3.4)

holds with probability δ0(1 δ)2.

Proof By deﬁnition of Ax,

Ax Πx F = X

i Ix,r αi(x)(Pxi Πx i ) + X

i αi(x)(Πx i Πx ) F

i Ix,r αi(x) Pxi Πx i F + X

i Ix,r αi(x) Πx i Πx F . (3.5)

Setting z in Theorem 11 to be xi and replacing r by r = 2r, we obtain the upper bound of Pxi Πx i F with probability (1 δ)2. Plugging the upper bound into the ﬁrst term on the right-hand side of (3.5), we obtain X

i Ix,r αi(x) Pxi Πx i F C

αi(x) |Ixi,2r|

ξj 4 2 + ξj 3 2 + ξj 2 2 + r ξj 2

P i Ix,r αi(x) xi x i 2 r +

P i Ix,r αi(x) xi x i 2 2 r2

αi(x) |Ixi,2r|

ξj 4 2 + ξj 3 2 + ξj 2 2 + r ξj 2

P i Ix,r αi(x) ξi 2

P i Ix,r αi(x) ξi 2 2 r2

Manifold Fitting under Unbounded Noise

Plugging the upper bound of P i Ix,r αi(x) ξi k 2 into the last formula leads to

i Ix,r αi(x) Pxi Πx i F C σ

with probability δ0(1 δ)2, where the last inequality holds given r = O( σ). As for the second term on the right-hand side of (3.5),

i Ix,r αi(x) Πx i Πx F C

i Ix,r αi(x) x i x 2

i Ix,r αi(x) x i xi 2 + xi x 2 + x x 2

i Ix,r αi(x) x i xi 2

where the ﬁrst inequality is by Lemma 14, the second-to-last inequality holds given r = O( σ), and the last inequality holds by (2.2). Since Ψα x is the closest (D d)-rank projection matrix to Ax, we have

Ψα x Ax F Ax Πx F Cr, with probability δ0(1 δ)2. (3.6)

Hence, Ψα x Πx F Ψα x Ax F + Ax Πx F Cr with probability δ0(1 δ)2.

3.3 A bound on f(x)

This section examines how f(x) approximates the bias from x to M, which is achieved by calculating f(x) 2 for x M. If f approximates the bias well, such f(x) 2 should be bounded above by a small value with x M.

Theorem 16. Suppose x M and r = O( σ). For any given δ, there exist constants C and n0 such that if N n0r d, then f(x) 2 Cr2 with probability δ0(1 δ)2.

Proof It is clear that x = x when x M. Accordingly, we use x instead of x in the following discussion for convenience. First, we bound the distance between P i Ix,r αi(x)xi

Yao and Xia

and Tx M. By deﬁnition,

i Ix,r αi(x)xi, Tx M = Π x X

i Ix,r αi(x)xi x 2

i Ix,r αi(x) Π x(xi x) 2

i Ix,r αi(x) xi x i 2 + X

i Ix,r αi(x) Π x(x i x) 2

i Ix,r αi(x) ξi 2 + X

i Ix,r αi(x) x i x 2 2 τ

i Ix,r αi(x)( x i xi 2 + xi x 2)2

C1σ + C2 (σ + r)2

where the second-to-last inequality holds by Lemma 13 with probability (1 δ)2. The parameter r is selected in the order of σ, namely C3 exists such that r = C3 σ > C3σ since σ < 1. So (σ + r)2 < ( 1

C3 + 1)2r2 and

τ (σ + r)2 C1r2

C3 + 1)2r2 = Cr2.

Hence, we obtain d(P i Ix,r αi(x)xi, Tx M) Cr2. We let a = P i Ix,r αi(x)xi and b be the projection of a onto Tx M. Then, we have

a b 2 = Π x(a b) 2 = d X

i Ix,r αi(x)xi, Tx M Cr2.

According to the deﬁnition of f(x),

f(x) = Ψα x(x a) = Π x(x b) + (Ψα x Π x)(x b) + Ψα x(b a),

where Π x(x b) = 0, since x = x Tx M and b Tx M. Hence, we obtain

f(x) 2 Ψα x Π x F x a 2 + a b 2 + Ψα x(a b) 2 Ψα x Π x F x a 2 + a b 2 + a b 2 C1r (r + r2) + C2r2 Cr2,

where the second-to-last inequality holds by Theorem 11 with probability δ0 and the last inequality holds by (2.2). In summary, f(x) 2 Cr2 with probability δ0(1 δ)2.

Manifold Fitting under Unbounded Noise

3.4 A bound on the ﬁrst and second derivative of f(x)

We now proceed to obtain an upper bound on vf(x) 2 with v 2 = 1, where

vf(x) = lim t 0 f(x + tv) f(x)

for any v RD.

Theorem 17. Suppose d(x, M) cr and r = O( σ). For any given δ, there exist constants C and n0 such that if N n0r d,

vf(x) Ψα xv 2 Cr, (3.7)

with probability δ0(1 δ)2.

The proof of Theorem 17 refers to Appendix A.4. This theorem claims the ﬁrst derivative of f(x) approximates Ψα xv in the order of O(r). Taking v in Theorem 17 as e1, , e D, we achieve the following Corollary 18.

Corollary 18. Suppose d(x, M) cr and r = O( σ). For any given δ, there exist constants C and n0 such that if N n0r d,

Jf(x) Ψα x F Cr (3.8)

with probability δ0(1 δ)2.

Proof Let ei represent a D-dimensional vector where the i-th component is 1, and the other components are 0. The Jacobian matrix of function f can be represented as

Jf(x) = e1f(x), , e Df(x)

and Ψα x = Ψα xe1, , Ψα xe D . Hence,

Jf(x) Ψα x F = e1f(x), , e Df(x) Ψα xe1, , Ψα xe D F

i=1 eif(x) Ψαxei 2 2

i=1 (C2 1r2) = C1

The last inequality holds by Theorem 17, which concludes Jf(x) Ψα x F Cr with probability δ0(1 δ)2.

We now proceed to obtain an upper bound on v uf(x) 2 with v 2 = u 2 = 1 in Theorem 19. This theorem proves that the second derivative of f(x) is bounded above by a certain constant, which indicates the smoothness of f(x). The proof of Theorem 19 refers to Appendix A.5.

Theorem 19. Suppose d(x, M) cr with some constant c < 1 and r = O( σ). For any given δ, there exist constants C and n0 such that if N n0r d, then v uf(x) 2 C with probability δ0(1 δ)2.

Yao and Xia

4. Proofs of Theorem 4, Theorem 5 and Theorem 7

Theorem 4 claims that the intersection of Mout and the neighborhood of x Mout is a d-dimensional manifold. To prove this conclusion, we ﬁrst need to discuss the properties of the neighborhood of x, as stated in Proposition 20.

Proposition 20. Let ϵ = min{ q

α(x) |Ix,2r|2 r3

β , r} for given x, then

Ψα x Ψα z 2 Cr, z BD(x, ϵ)

with probability δ0(1 δ)2 1 (1 crd)N .

The proof of Proposition 20 can be found in Appendix A.6. Based on Proposition 20, we construct an auxiliary function h to further characterize the neighborhood of x and obtain the proof of Theorem 4, as shown below: Proof of Theorem 4 Let h(z) : BD(x, ϵ) RD RD d, per

h(z) = V T x f(z), (4.1)

where Vx is the factor of Ψα x such that Ψα x = Vx V T x . Then h(z) = 0 if f(z) = 0. Assuming there exists z such that h(z) = 0 but f(z) = 0, we obtain

Ψα x Ψα z 2 = max v =0

(Ψα x Ψα z )v 2 v 2

(Ψα x Ψα z )f(z) 2 f(z) 2

Ψα xf(z) Ψα z f(z) 2 f(z) 2 =

Vxh(z) f(z) 2 f(z) 2 =

0 f(z) 2 f(z) 2 = 1.

However, Ψα x Ψα z 2 Cr with probability δ0(1 δ)2 1 (1 crd)N via Proposition 20, which is contradictory to Ψα x Ψα z 2 1. Hence, f(z) = 0 if and only if h(z) = 0, equivalently h 1(0) = f 1(0) in BD(x, ϵ), with probability δ0(1 δ)2 1 (1 crd)N . For z BD(x, ϵ),

Jh(z) = V T x Jf(z)

= V T x (Jf(z) Jf(x)) + V T x (Jf(x) Ψα x) + V T x Ψα x.

On the right hand side of the above equality, we have Jf(x) Ψα x F Cr by Corollary 18, V T x Ψα x = V T x and Jf(z) Jf(x) F C max D i=1 Jeif(z) Jeif(x) 2 C x z 2 Cϵ Cr, where the second inequality holds by Theorem 19, which implies

Jh(z) V T x 2 Jh(z) V T x F = V T x (Jf(z) Jf(x)) + V T x (Jf(x) Ψα x) F Cr.

Hence, the maximal diﬀerence between the singular values of Jh(z) and V T x is bounded by Cr. Let σ1 σD d be the singular values of Jh(z). We obtain |σD d 1| Cr since the singular values of V T x are 1, which implies σD d 1 Cr and rank Jh(z) = D d for any z BD(x, ϵ). This means the rank of h at z equals D d for any z BD(x, ϵ), and thus h 1(0) is a d-dimensional submanifold of BD(x, ϵ) RD. The equivalence between h 1(0) and f 1(0) in BD(x, ϵ) guarantees that f 1(0) is also a d-dimensional submanifold of BD(x, ϵ) RD.

Manifold Fitting under Unbounded Noise

The above proof is based on Proposition 20, Corollary 18 and Theorem 19, where Proposition 20 is proved based on Theorem 15 and Proposition 12(ii), and Corollary 18 is proved based on Theorem 17. Noting that Theorem 15, Theorem 17 and Theorem 19 are valid when Lemma 13 and Theorem 11 hold, we obtain

P Mout BD(x, ϵ) is a d dimensional manifold

P Proposition 12(ii), Lemma 13 and Theorem 11 hold for x

δ0(1 δ)2 1 (1 crd)N .

Proof of Theorem 5 For any ﬁxed x Mout, we let Vx RD (D d) denote the orthonormal matrix such that Ψα x = Vx V T x , and let Ux denote the orthogonal complement of Vx. Then, we deﬁne F(z) = f(z) + Ux UT x z.

Let x be the projection of x onto M, as done previously, Πx = V V T , and U be the orthogonal complement of V . The diﬀerence F(x ) F(x) 2 can be evaluated as

F(x ) F(x) 2

= f(x ) + Ux UT x x f(x) Ux UT x x 2

= f(x ) + Ux UT x x Ux UT x x 2

f(x ) 2 + (Ux UT x U UT )(x x ) 2 + U UT (x x ) 2

= f(x ) 2 + (Ψα x Πx )(x x ) 2 + U UT (x x ) 2

f(x ) 2 + Ψα x Πx F x x 2 + U UT (x x ) 2 Cr2

The second equality holds because f(x) = 0 for x Mout while the last inequality holds because f(x ) 2 Cr2 via Theorem 16, Ψα x Πx F Cr via Theorem 15, x x = d(x, M) cr via the deﬁnition of Mout, and U UT x = U UT x , since x is the projection of x onto Tx M. The Jacobian matrix of F at z = x, denoted by JF (x) for simplicity, is

JF (x) = Jf(x) + Ux UT x = ID + Jf(x) Ψα x

By Corollary 18, each entry of the matrix Jf(x) Ψα x is bounded above by Cr. Hence, we obtain JF (x) = ID + O(r),

which means that JF (x) approximates ID with precision O(r) and JF (x) is invertible. Moreover, JF (x) F C(1 + r) and its inversion is J 1 F (x) F C(1 + r). The changing rate of JF can also be bounded as follows: supposing x and x are two arbitrary points, we have

JF (x ) JF (x ) F = Jf(x ) Jf(x ) F C x x 2 (4.2)

Yao and Xia

by the upper bound on the second derivative of f(x) in Theorem 19. Based on the conclusions that F(x) F(x ) 2 Cr2, JF (x) = ID+O(r), and JF (x ) JF (x ) F C x x 2, we could bound x x 2 via Theorem 2.9.4 (the inverse function theorem) in Hubbard and Hubbard (2001). Speciﬁcally,

The above proof is based on Theorem 15, Theorem 16, Corollary 18 and Theorem 19, which are valid when Lemma 13 and Theorem 11 hold. Hence, the conclusion x x 2 Cr2 is drawn with probability δ0(1 δ)2.

To prove the smoothness of the estimated manifold Mout, as stated in Theorem 7, we construct the following two auxiliary functions and clarify their properties.

Proposition 21. For any ﬁxed point x Mout, set Wx to be the basis of the spanning space of Jf(x)T and

g(z) = W T x f(z). (4.3)

The following two statements

(i) g is a function from RD to RD (D d)

(ii) Given z BD(x, rτ), g(z) = 0 if and only if f(z) = 0

hold simultaneously with probability at least δ2 0(1 δ)4 1 (1 crd)N .

Proposition 21 claims that f 1(0) and g 1(0) describe the same set in the neighborhood of x whose proof can be found in Appendix A.6. By Wx, we reset the coordinate system for g. Speciﬁcally, the rows of Jf(x) are orthogonal to the contour surface at x, and Wx is also the basis of the normal space of Mout at x. Accordingly, we set the ﬁrst d coordinates as the basis of Tx Mout and the last D d coordinates as the columns of Wx. In this coordinate system, we deﬁne an implicit function φ : Rd RD d based on g( ) using the implicit function theorem, such that ζ; φ(ζ) maps ζ Tx Mout to a point on the manifold Mout. Here, we let (η; ζ) denote the concatenation of column vectors η and ζ. The upper bound on the ﬁrst and second derivatives of φ is given in Lemma 22, with its proof appearing in Appendix A.6.

Lemma 22. Suppose function g is deﬁned as (4.3). The implicit function φ : Rd RD d

satisfying g , φ( ) = 0 exists, and its ﬁrst and second derivatives are bounded above by

sφ(ζ) C ζ; φ(ζ) x 2, t sφ(ζ) C,

with probability at least δ0(1 δ)2 1 (1 crd)N , for any s 2 = t 2 = 1.

Manifold Fitting under Unbounded Noise

Proof of Theorem 7 Let x and z be two points on Mout, and Tx Mout be the tangent space to Mout at x. The proof is conducted with z x 2 > rτ and z x 2 rτ, respectively. First, when z x 2 > rτ,

z x 2 2 d(z, Tx Mout) rτ (4.4)

holds because z x 2 d(z, Tx Mout). Second, when z x 2 rτ, we have g(z) = f(z) = g(x) = f(x) = 0 by Proposition 21 with probability δ2 0(1 δ)4 1 (1 crd)N , since x and z are on Mout. Let ζx and ζz denote the ﬁrst d coordinates of x and z, respectively. We have z = ζz; φ(ζz) , x = ζx; φ(ζx) , sφ(ζz) C z x 2 and t sφ(ζz) C with probability at least δ0(1 δ)2 by Lemma 22. So,

d(z, Tx Mout) = φ(ζz) φ(ζx) 2 C z x 2 ζz ζx 2 + C ζz ζx 2 2 C z x 2 2 + C z x 2 2 C z x 2 2.

As a result, z x 2 2 d(z, Tx Mout) z x 2 2 C z x 2 2 = 1

Combined with (4.4), we complete this proof.

The above proof requires Proposition 21 and Lemma 22, which are valid when Theorem 4 and Theorem 5 hold simultaneously for x and when Theorem 15 holds for z. Given the dependency between the theorems as shown in Figure 3, we establish that the above theorems hold when Lemma 13 and Theorem 11 simultaneously hold for x, z, and when Proposition 12(ii) holds for x. Hence, we have

P z x 2 2 d(z, Tx Mout) c0r

P Proposition 12(ii) holds for x Lemma 13 and Theorem 11 hold for x, z

δ2 0(1 δ)4 1 (1 crd)N .

5. Experimental Results

This section comprises two parts. The ﬁrst part provides numerical comparisons of the methods of Mohammed and Narayanan (2017), Feﬀerman et al. (2018), and Aizenbud and Sober (2021). Further, we apply relevant methods on several known manifolds, illustrate the output manifolds, and calculate the Hausdorﬀdistances between the output and latent manifolds. In the second part, we focus on real applications, and use our method to denoise facial images sampled from a lengthy video recording. The results of our method are then compared to the ﬁndings of each of the other aforementioned methods.

Yao and Xia

Algorithm 1: Project x onto Mout

Input: a point x, noisy data X = [x1, , x N], bandwidth parameters r and r , a step length parameter a, a tolerance ϵ, and the maximal number of iteration T. Output: projection x of x onto Mout.

1. Calculate Pxi = I V V T for each xi X, where V is the D d matrix whose columns are the eigenvectors corresponding to the largest d eigenvectors of P j Ixi,r (xj xi)(xj xi)T .

2. Set t = 1.

(1). Calculate αi(x) and αi(x) for i Ix,r by (2.5).

(2). Plug { αi(x), αi(x), Pxi}i Ix,r into (B.1) to obtain the gradient grad(x) of f(x) 2 2.

(3). Update t = t + 1 and x = x a grad(x).

(4). Repeat (1) to (3) until the tolerance condition f(x) 2 2 ϵ or the maximal iteration T is met.

3. Output x = x.

Implementation: the MATLAB codes, together with all numerical examples used in this paper, are available at https://zhigang-yao.github.io/research.html which contains a Git Hub link under the code tap. We have also implemented the related methods from Mohammed and Narayanan (2017) and Feﬀerman et al. (2018), since the authors of both papers have not provided implementation due to the nature of their work having been purely abstract.

5.1 Simulation

As explained in Subsection 1.2, by removing the unreliable discs which centered at the sample points as in Mohammed and Narayanan (2017) and Feﬀerman et al. (2018), one would expect an improved performance compared to these two methods. Assuming the data points are sampled from a tubular neighborhood, Aizenbud and Sober (2021) denoises the sample points iteratively using a local polynomial regression. As the degree increases, polynomial regression ﬁts a manifold better when the noise is limited but on the other hand a polynomial regression exhibits sensitivity once noise increases. As a method designed for Gaussian noise, our method is expected to be more robust as noise increases. To support this claim, we test methods in Mohammed and Narayanan (2017) (marked by km17), Feﬀerman et al. (2018) (marked by cf18), and Aizenbud and Sober (2021) with polynomial degree 1 and 2 (marked by ya21(deg=1) and ya21(deg=2)) on manifolds with both constant and inconstant curvature, namely: a circle embedded in R2, a sphere embedded in R3, and a torus embedded in R3. To ensure a traceable comparison, all the tests are conducted in the following way, similar to that of Mohammed and Narayanan (2017):

Manifold Fitting under Unbounded Noise

Sample N points from the latent manifold, blur the points with Gaussian noise deﬁned in (1.1) with given σ, and use the noisy data X = [x1, , x N] to implicitly construct output manifolds.

Initialize a collection of points P = [p1, , p N0] around the latent manifold.

Project each pi to the constructed output manifolds via km17, cf18, ya21(deg=1), ya21(deg=2)) and our method, respectively. We will then obtain P as the projection of P for each method.

Calculate the Hausdorﬀdistance between each P and M to estimate the Hausdorﬀ distance between the corresponding Mout and M.

As projections, points in P lie on the corresponding Mout, and the Hausdorﬀdistance H( P, M) could estimate H(Mout, M) when P are dense enough. This motivates us to evaluate the approximation error of Mout to M by H( P, M). To project a point p onto a manifold deﬁned by (2.7), we design algorithm 1. Taking x = p and f in algorithm 1 as (2.6), we could project p onto our output manifold. It should be noted that the diﬃculty of calculating such a gradient lies in calculating a gradient of orthogonal projection, which can be addressed, according to Shapiro and Fan (1995). Detailed formula refers to Appendix B. Mohammed and Narayanan (2017) suggested a subspace-constrained gradient descent algorithm to project a point onto Mout constructed by km17. Thus, we adopt this algorithm to implement km17 in this simulation. Although Feﬀerman et al. (2018) did not consider the issue, we nevertheless implement their method too via algorithm 1, treating f(x) as the approximated bias at x deﬁned by Feﬀerman et al. (2018). The details of this simulation are as follows: we uniformly sample N points denoted by y1, , y N from each target manifold and i.i.d. sample ξ1, , ξN from a Gaussian distribution (1.1) with a given standard derivation σ. Then, the noisy data X = {xi}N i=1 is constructed by xi = yi + ξi. The initial points P are sampled from the tube centered at M with radius 1

D, so that d(pi, M) σ for each pi. According to Theorem 5, d( pi, M) O(σ), which means the output points should be much closer to the latent manifold than the initial points. Again, we take N0 = N initial points for each test in the simulation. To implicitly construct the output manifolds, the methods km17, cf18, and our method,each require a bandwidth parameter r. According to the theoretical analysis, r = O( σ). So we take r = λ σ in this simulation, where λ is tuned in a large range for each method and each σ. All the results reported in this section are the ones using the best λ. The method ya21 also requires a bandwidth parameter h, which is again selected as the best one tuned from a large range. In constructing αi(x), our method requires β 2. We take β = d + 2 in the simulation, as Feﬀerman et al. (2018) did.

5.1.1 Manifold with constant curvature

This part tests the manifold ﬁtting methods for the circle in R2 and the sphere embedded in R3. For the circle, we set N = N0 = 300, while for the sphere, we set N = N0 = 1000. The diﬀerent sample-size settings guarantee comparable density in each case, as Figure 5 illustrates that the P (black dots) and their projection onto M (red dots) obtained by our

Yao and Xia

Figure 5: The performance of our method, km17, cf18, ya21(deg=1) and ya21(deg=2) when ﬁtting a circle (top row) and a sphere (bottom row), where black points represent points in P(black dots) and red points represents their projections onto M.

method, cf18, cf18, ya21(deg=1) and ya21(deg=2), from left to right. The black dots and red dots can be treated as the discretized versions of Mout and M, respectively. Thus, a larger overlap of the two sets of dots means the manifold is better ﬁtted. For the circle embedded in R2, we show the entire space in the left column, while for the sphere embedded in R3, we show the view from the positive z axis. Figure 5 shows that km17 clearly performs worse than the other methods in terms of ﬁtting error. From the two estimated circles by ya21(deg=1) and ya21(deg=2), we observe that there are sharp corners both at the top left and at the bottom right an observation that conﬁrms that the estimator by ya21 is not smooth. From the right edge of the circle and the sphere, we can also observe that our method preforms slightly better than cf18 in this experiment. To conﬁrm the superiority of our method, we repeat each test for 20 trials, and list the results of H( P, M) using the diﬀerent methods in Figure 6. Generally speaking, our method outperforms cf18, km17 and ya21(deg=1) in the compared cases and although ya21(deg=2) performs slightly better than our method in instances of very low noise, it is much more sensitive than our method. As the σ increases, ya21(deg=2) fails to outperform other methods. From Figure 6, H(M, Mout) = O(σ) for our method, which supports Theorem 5.

5.1.2 Manifold with inconstant curvature

We also implement the compared methods in the torus case, which is a type of manifold with inconstant curvature. Figure 7 illustrates the case with N = N0 = 800 and σ = 0.04, and the torus embedded in R3 is shown from the positive z axis. Here, the sample points in P are marked by black dots and their projection onto M are marked by red dots. The ﬁve subﬁgures are obtained from our method, cf18, km17, ya21(deg=1) and ya21(deg=2), from left to right. From the top and right edges of the torus, we can observe that our method performs better than both cf18 and km17. From the fourth subﬁgure, we can identify a clear gap between the red and black dots around the edge of the torus, which means ya21(deg=1) failed to ﬁt these points but using a second degree polynomial, ya21(deg=2) achieves a better ﬁtting as the right subﬁgure shows.

Manifold Fitting under Unbounded Noise

ya21(deg=2) ours ya21(deg=1) cf18 km17

d = 1, = 0.02, n=300

ya21(deg=2) ours ya21(deg=1) cf18 km17

d = 1, = 0.04, n=300

ya21(deg=2) ours ya21(deg=1) cf18 km17 0.02

d = 1, = 0.06, n=300

ya21(deg=2) ours ya21(deg=1) cf18 km17

d = 2, = 0.02, n=1000

ya21(deg=2) ours ya21(deg=1) cf18 km17

d = 2, = 0.04, n=1000

ya21(deg=2) ours ya21(deg=1) cf18 km17

d = 2, = 0.06, n=1000

Figure 6: The Hausdorﬀdistance of ﬁtting a circle (top row) and a sphere (bottom row) with σ = 0.02 (left column), σ = 0.04 (middle column) and σ = 0.06 (right column) using ya21(deg=2), our method, ya21(deg=1), cf18 and km17 respectively.

Figure 7: The performance of our method, km17, cf18, ya21(deg=1) and ya21(deg=2) when ﬁtting a torus with N = N0 = 800 and σ = 0.04, where black points represent points in P(black dots) and red points represent their projections onto M.

We also repeat each test for 20 trials and list the results of H( P, M) using the diﬀerent methods shown in Figure 8. When σ = 0.02 and σ = 0.04, our method performs better than cf18, km17 and ya21(deg=1) but as σ increases to 0.06, the ﬁtting problem becomes more diﬃcult and the performance of km17, cf18 and our method are similar, which further demonstrates the sensitivity of ya21(deg=2). When σ is small and the sample size is adequate, ya21(deg=2) outperforms the other methods but when the sample size decreases and σ increases, the performance of ya21(deg=2) deteriorates rapidly.

5.2 Facial image denoising

This subsection considers a concrete case - denoising facial images selected from the video database in Happy et al. (2012). We select 1,000 images of an individual turning his head

Yao and Xia

ya21(deg=2) ours ya21(deg=1) cf18 km17

0.55 d = 2, = 0.02, n=500

ya21(deg=2) ours ya21(deg=1) cf18 km17

d = 2, = 0.04, n=500

ya21(deg=2) ours ya21(deg=1) cf18 km17

d = 2, = 0.06, n=500

ya21(deg=2) ours ya21(deg=1) cf18 km17

d = 2, = 0.02, n=800

ya21(deg=2) ours ya21(deg=1) cf18 km17 0.04

d = 2, = 0.04, n=800

ya21(deg=2) ours ya21(deg=1) cf18 km17

d = 2, = 0.06, n=800

Figure 8: The Hausdorﬀdistance of ﬁtting a torus given 500 (top row) and 1000 (bottom row) samples with σ = 0.02 (left column), σ = 0.04 (middle column) and σ = 0.06 (right column) using ya21(deg=2), our method, ya21(deg=1), cf18 and km17 respectively.

around, then blurring those images via a Gaussian distribution with a diﬀerent standard derivation σ. In this experiment, σ is set to be the average of all pixels in 1,000 images multiplied by ρ = 0.2, 0.3, or 0.4. The size of each facial image is 80 80, which means D = 6400. The dimension d of the latent manifold is tuned from {1, 5, 10, 15, 20, 50, 75, 100} for each method and we choose d = 10 because of its outperformance.

From the 1,000 facial images, we select 5 with diﬀerent head orientations. The top row of Figure 9 displays these ﬁve original images, while the second row of Figure 9 shows these ﬁve images blurred, with ρ = 0.3. The goal of this experiment is to denoise these ﬁve blurred images by projecting them to the manifold learnt by the remaining 995 blurred images, which are treated as the noisy samples. To achieve the denoising, we use km17, cf18, ya21(deg=1), ya21(deg=2) and our method to construct the output manifold with the 995 noisy samples, and project the ﬁve tested images to each output manifold. When the output manifold correctly ﬁts the latent one, projecting blurred images to the output manifold denoises these facial images. In this experiment, we take β = 2 for our method to construct αi(x). If cf18 uses αi(x) as Feﬀerman et al. (2018) has suggested, it would not work quite satisfactorily, because of the over-large power d + 2 rather than β. Therefore, we take the same αi(x) for cf18 and our method to make the results comparable.

The last three rows of Figure 9 show the denoised images obtained by km17, cf18, ya21(deg=1), ya21(deg=2) and our method, respectively. The ﬁrst and third facial images were not recovered by km17. Although the faces in the other three images obtained by km17 can be distinguished, they are still very noisy. Cf18 could not recover the third image

Manifold Fitting under Unbounded Noise

Figure 9: Performance of facial image denoising with ρ = 0.3. The ﬁrst row consists of original images while the second row consists of blurred images. The third to seventh rows contain deblurred images using km17, cf18, ya21(deg=1), ya21(deg=2) and our method, respectively.

Yao and Xia

either, although the other four images obtained by cf18 are better than the ones obtained by km17. Both ya21(deg=1) and ya21(deg=2) can recover these ﬁve faces. However, the faces obtained by ya21(deg=2) are still somewhat fuzzy, compared with the ones obtained by ya21(deg=1) and our method. Our method recovered all the ﬁve faces, with the third face of much better quality than the faces from km17 and cf18. The results with the settings ρ = 0.2 or ρ = 0.4 are listed in Figure 10 and Figure 11 (Appendix C). When we take ρ = 0.2, the results of all three methods provide fairly good results. However, the results from km17 are somewhat noisy, with the obtained faces darker than the original ones. When ρ = 0.4, km17 is barely able to recover the faces, cf18 fails at the ﬁrst and third ones, but our method can still provide acceptable faces.

6. Discussion

We have proposed a new output manifold Mout to ﬁt data collection with Gaussian noise. The theoretical analysis of Mout has two main components: (1) the upper bound on d(x, M) for arbitrary x, which guarantees Mout approximates M well, and (2) the upper bound on the second-order diﬀerence of Mout, which guarantees the smoothness of Mout. To demonstrate the contribution of this paper, we compared our theoretical results to relevant works presented in Mohammed and Narayanan (2017) and Feﬀerman et al. (2018). All three of these works aim to ﬁt data collection by a smooth manifold, while the diﬀerence among these works lies in the assumptions about noise. Mohammed and Narayanan (2017) requires the data to be noiseless, which is the most strict assumption of the three. As mentioned in the Introduction, Feﬀerman et al. (2018) essentially requires the noise of data to be bounded, that is, the data collection X satisfying H(X, M) O(r2), where H( , ) denotes the Hausdorﬀdistance. If the noise of data obeys a Gaussian distribution, the researchers would select a subset from the entire dataset, assume the noise of the subset is bounded, and implement their proof on this subset of data. However, their sample selection step imposes a lower bound on r, meaning that the upper bound of H(M, Mout) cannot tend to 0. This paper, therefore, proposes a method to address the problem of Gaussian noise, which is commonly assumed but unsolved in relevant works. Unlike the bounded noise, X with Gaussian noise are not required to satisfy H(X, M) O(r2), which increases the diﬃculty of manifold ﬁtting. According to the discussion in Subsection 1.2 and the experiment results, our method could achieve a smaller approximating error than the methods presented in Mohammed and Narayanan (2017) and Feﬀerman et al. (2018). One possible reason is that we use the weighted average P i Ix,r αi(x)Pxi to estimate Πx rather than using each Pxi separately. To explain this claim, we consider the following expression: X

i Ix,r αi(x)Pxi Πx = X

i Ix,r αi(x)(Pxi Πx i )+ X

i Ix,r αi(x)Πx i Πx . (6.1)

For certain symmetric manifolds, the second term in the right hand side of (6.1) might be much closer to zero matrix than (Πx i Πx ). A circle may be considered as an example. Suppose x, x1, and x2 are points on the circle satisfying x1 x = x x2; then, the average of orthogonal projections onto the normal spaces at x1 and x2 equals the orthogonal projection onto the normal space at x, while the

Manifold Fitting under Unbounded Noise

projection onto the normal space at x1 (or x2) diﬀers from that at x with an error in the order of x x1 2 (or x x2 2) by Lemma 14. This phenomenon illustrates that the average of {Pxi}i Ix,r approximates Ψα x better than each Pxi for certain manifolds. We beneﬁt from this fact by using P i Ix,r αi(x)Pxi to construct our output manifold, while Mohammed and Narayanan (2017) and Feﬀerman et al. (2018) use each Pxi separately instead. Characterizing the symmetric property mentioned above and using this property in the methodology of manifold ﬁtting is an attractive and promising topic, and our work on it will continue.

Acknowledgments

ZY was supported by MOE Tier 1 A-0004809-00-00 and Tier 2 R-155-000-184-112 at the National University of Singapore. ZY is currently supported by MOE Tier 1 A-8002931-0000 and Tier 2 R-8001562-00-00. YX was supported by MOE Tier 1 A-0004809-00-00 and Tier 2 R-155-000-184-112 during her time as a Research Fellow at the National University of Singapore, and later by the Zhejiang Provincial Natural Science Foundation of China (Grant No. LQ21F020023) in the subsequent phase of her research. ZY thanks Professor Charles Feﬀerman and Professor Hariharan Narayanan for their helpful discussions on some details of Mohammed and Narayanan (2017) and Feﬀerman et al. (2018), which were very useful to this work. ZY thanks Professor Shing-Tung Yau for his wise comments and the support from the Center of Mathematical Sciences and Applications at Harvard University.

Yao and Xia

Appendix A. Proofs

A.1 Proof of Proposition 3

Lemma 23. If d(x, M) cr with some c < 1 and c1 satisﬁes c < c1 1, then there exists a constant c such that P(i Ix,c1r) c rd.

Proof Setting c2 be a constant satisfying c < c2 < c1, then

P(i Ix,c1r) P yi M BD(x, c2r), ξi 2 (c1 c2)r

= P yi M BD(x, c2r) P ξi 2 (c1 c2)r .

In order to bound P(i Ix,c1r) below, we bound the two probability expressions P(yi M BD(x, c2r)) and P( ξi 2 (c1 c2)r), respectively. Since d(x, M) cr < c2r, there exists c3 such that

P yi M BD(x, c2r) = Vol M BD(x, c2r)

Vol(M) = c3rd.

By the assumptions in (2.1), there exists 0 < c4 < 1 satisfying r c4 σ c4σ, which leads to r/σ c4 and

P ξi 2 (c1 c2)r = P ξi 2 2 σ2 (c1 c2)2r2

P ξi 2 2 σ2 (c1 c2)2c2 4

Since ξi 2 2/σ2 obeys Chi-square distribution and the constant (c1 c2)2c2 4 > 0, c5 is positive. Calculating the product of P yi M BD(x, c2r) and P ξi 2 (c1 c2)r , we have

P(i Ix,c1r) P yi M BD(x, c2r) P ξi 2 (c1 c2)r c3c5rd = c rd

which, completes this proof.

Proof of Proposition 3 Setting c1 = 1 in Lemma 23, we obtain P(i Ix,r) c rd. Hence, whether i Ix,r or not, can be treated as a Bernoulli distribution with the expectation of c rd. Applying the Berry-Esseen theorem to the N Bernoulli trials, there exists c < 1 such that |Ix,r| c rd N with probability 1 C/

A.2 Proof of Lemma 9 and Lemma 10

The following proof is derived from the notations illustrated in Figure 4 and the settings σ < 1, r = 2r and r = O( σ), which imply that there exist constants C and C independent on σ such that r < C and r < C by (2.2). Proof of Lemma 9 Let pi z = qi; then, p i z = pi z = qi. Considering zi z = zi p i+ p i z = zi p i+qi := δi+qi, we can rewrite P i Iz,r (zi z)(zi z)T P i Iz,r (pi z )(pi z )T

i Iz,r qiδT i + δiq T i + δiδT i . (A.1)

Manifold Fitting under Unbounded Noise

To begin with, we bound δi 2. Recalling that the projection onto the normal space at z is Πz ,

δi 2 = Πz (zi z) 2 Πz (zi z i ) + (z i z ) + (z z) 2 Πz (z i z ) 2 + zi z i 2 + z z 2

z i z 2 2 τ + zi z i 2 + z z 2

( z i zi 2 + zi z 2 + z z 2)2

τ + zi z i 2 + z z 2.

The last but one inequality holds in accordance with Proposition 2. As established previously, each zi is generated as yi + ξi with yi M and ξi N(0, σ2ID). Then, ξi 2 = zi yi 2 zi z i 2 since z i is the projection of zi onto M. Thus, δi 2 can be bounded by

δi 2 ( ξi 2 + r + z z 2)2

τ + ξi 2 + z z 2 C1 ξi 2 2 + ξi 2 + r 2 + z z 2 .

The last inequality is achieved by replacing certain z z 2 by its upper bound r and replacing certain r by a constant independent on σ , since r < C by r = 2r and (2.2). Considering the average over Iz,r , we obtain

i Iz,r δi 2 C1 ψ2 + ψ1 + r2 + z z 2 ,

and 1 |Iz,r |

i Iz,r δi 2 2 C2 ψ4 + ψ3 + ψ2 + rψ1 + r4 + r2 z z 2 + z z 2 2 .

where ψk := 1 |Iz,r | P i Iz,r ξi k 2, the above bounds are then plugged into the bound of (A.1) as follows:

1 |Iz,r | X

i Iz,r (zi z)(zi z)T X

i Iz,r (pi z )(pi z )T F

qiδT i + δiq T i + δiδT i F

2 qi 2 δi 2 + δi 2 2

2r δi 2 + δi 2 2

C ψ4 + ψ3 + ψ2 + r ψ1 + r 3 + r z z 2 + z z 2 2

The last but one inequality holds since qi 2 zi z 2 r . Replacing ψk by corresponding summation completes the proof.

Yao and Xia

Lemma 24 (Theorem 21 in Mohammed and Narayanan (2017)). Let Λ1, , Λk be i.i.d. random positive semideﬁnite D D matrices with expected value E[Λi] = M µI and Λi I. Then for all ϵ [0, 1/2],

i=1 Λi / [(1 ϵ)M, (1 + ϵ)M] i 2D exp ϵ2µk

Here, the matrix interval A [B, C] means aij [bij, cij] holds for any i, j and the matrix ordering A B means A B is a positive semideﬁnite.

Proof of Lemma 10 Before the proof of Lemma 10, we provide the useful notations and contents. For convenience, z is set to be the origin of the local coordinate system, and the coordinates in Tz M are set to be the ﬁrst d coordinates of the D coordinates. We let Pd : RD RD be an operator, setting the last (D d) entries of a vector to be zeros, that is, Pd(v) = [v1, , vd, 0, , 0]T . We also let Pd be the operator, setting the ﬁrst d entries of a vector to be zeros, that is, Pd = I Pd, with I being the identity operator. Notations v := Pd(v) and ˆv = Pd(v) are also used without confusion. Based on these notations, we calculate the useful bound on ˆη 2 for η M BD(z, r ). Using the deﬁnition of η, we obtain z η, z z = 0, z η, ˆη = 0, and, therefore

r 2 z η 2 2 = (z z ) + (z η) ˆη 2 2 z z 2 2 2 z z 2 ˆη 2 + z η 2 2 + ˆη 2 2 = z z 2 2 2 z z 2 ˆη 2 + z η 2 2.

Moreover, in accordance with Proposition 2, z η 2 2 2τ ˆη 2. Combining these two inequalities, we obtain

r 2 z z 2 2 + 2 z z 2 ˆη 2 z η 2 2 2τ ˆη 2

and, hence,

ˆη r 2 z z 2

2(τ z z ). (A.2)

We are now ready to prove Lemma 10. Let λ1 λD be the eigenvalues of matrix 1 |Iz,r | P i Iz,r (pi z )(pi z )T and µ1 µD be the eigenvalues of the population covariance matrix M, that is,

M := E[ 1 |Iz,r |

i Iz,r (pi z )(pi z )T ].

We see that λd+1 = = λD = µd+1 = = µD = 0. Therefore, we need only a lower bound for λd, which can be obtained by relating its value to µd through a concentration inequality given in Lemma 24. Assuming the ﬁrst d coordinates are aligned with the eigenvectors corresponding to the d largest eigenvalues of M, µd is the variance in the d-th direction. Clearly, the ﬁrst d coordinates are located in Tz M. Let P be the probability measure on Tz M BD(z, r ). For any q Tz M BD(z, r ), we ﬁrst bound P(q) above.

Manifold Fitting under Unbounded Noise

We set S(q) = {ζ : ζ = q} BD(z, r ), and ˆS(q) = ζ S(q){η : |η (i) ζ (i)| 3σ, i = 1, , D}, where η(i) and ζ(i) represent the i-th element of η and ζ, respectively. Then, we have q Tz M BD(z,r )S(q) BD(z, r ) and

q Tz M BD(z,r ) ˆS(q) ζ BD(z,r ){η : |η (i) ζ (i)| 3σ, i = 1, , D}

BD(z, r + 3σ

The probability at q is

P(q) = (2πσ) D/2

M e η ζ 2 2/2σ2dµM(η )

= (2πσ) D/2

M ˆS(q) e η ζ 2 2/2σ2dµM(η ) (A.3)

+ (2πσ) D/2

M\ ˆS(q) e η ζ 2 2/2σ2dµM(η ). (A.4)

We bound P(q) above by bounding (A.3) and (A.4).

(A.3) = (2πσ) D/2

M ˆS(q) e η q 2 2/2σ2e ˆη ˆζ 2 2/2σ2dµM(η )

M ˆS(q) e η q 2 2/2σ2 Z

0d RD d e ˆη ˆζ 2 2/2σ2dˆζ dµM(η )

= (2πσ) d/2

M ˆS(q) e η q 2 2/2σ2dµM(η )

= (2πσ) d/2

Pd(M ˆS(q)) e η q 2 2/2σ2q

det I + J( η )T J( η ) d η

1 + C2(r + 3σ

Rd 0D d e η q 2 2/2σ2d η

1 + C2(r + 3σ

The last inequality holds since J( η ) F (C(r + 3σ

D))/τ with η BD(z, r + 3σ

D). According to the deﬁnition of S(q) and ˆS(q), we have for any η M \ ˆS(q) and ζ S(q) the formula |η(i) ζ(i) | 3σ, which implies

(2πσ2) D/2 Z

S(q) e η ζ /2σ2dζ (0.01)D η M \ ˆS(q).

(A.4) = (2πσ) D/2

S(q) e η ζ 2 2/2σ2dζ dµM(η )

Vol(M \ ˆS(q))

Vol(M) (0.01)D (0.01)D.

Yao and Xia

In summary, we have

P(q) 1 Vol(M)

1 + C2(r + 3σ

d/2 + (0.01)D (A.5)

for any q Tz M BD(z, r ). We consider only the lower bound for q in a subset of Tz M BD(z, r ), namely Tz M BD(z , r0), where r0 is set as

r 2 r 2 z z 2 2 2(τ z z 2) + z z 2 + 3σ

r 2 r 2 z z 2 2 2(τ z z 2) + z z 2 2 3σ

For any q Tz M BD(z , r0) and η M BD(z, r ), we can verify the following conclusions via (A.2):

(i) The d-dimensional cube

{q : q (i) = 0 i d + 1, |q (j) q(j)| 3σ j d}

d) Tz M { η : η M BD(z, r )},

(ii) The (D d)-dimensional cube

{η : η (i) = 0 i d, |η (j) η(j)| 3σ j d + 1} {ˆζ : ζ S(q)}.

Now, we are ready to bound P(q) below for any q BD(z , r0) Tz M.

P(q) = (2πσ) D/2

M e η ζ 2 2/2σ2dµM(η )

M BD(z,r ) e η q 2 2/2σ2e ˆη ˆζ 2 2/2σ2dµM(η )

(0.99)D d(2πσ2) d/2

M BD(z,r ) e η q 2 2/2σ2dµM(η )

= (0.99)D d(2πσ2) d/2

Pd(M BD(z,r )) e η q 2 2/2σ2q

det I + J( η )T J( η ) d η

= (0.99)D d(2πσ2) d/2

Pd(M BD(z,r )) e η q 2 2/2σ2d η

The last but one inequality holds since q

det I + J( η )T J( η ) 1.

Manifold Fitting under Unbounded Noise

Since µd is the variance in the d-th direction, we have

Tz M BD(z,r ) P(q )d Ld(q )

Tz M BD(z,r ) q2 d P(q)d Ld(q)

α Vol(Bd(r ))

Tz M BD(z ,r0) q2 dd Ld(q)

α Vol(Bd(r ))

ℓΠd 1 j=1φj 2 d V

= Γ(d/2 + 1)α

0 ℓd+1 d 1 Y

j=1 sind j+1 φjdℓ

where α is the ratio between the lower bound and upper bound of P(q), namely,

α = (0.99)D

(1 + C2(r +3σ

D)2 τ 2 )d/2 + (0.01)DVol(M) ,

and the third line follows with a change of coordinates. Substitute

n q1 ℓcos φ1, q2 i d 1 ℓcos φiΠT j=1i 1 sin φj, qd ℓsin Πd 1 j=1φj o

with φd 1 [0, 2π], φi d 2 [0, π], ℓ [0, r0], and let

d V := ℓd 1Πd 2 j=1 sind j i φjdℓdφ1 dφd 1.

The integral in the fourth line can be evaluated by noting that R r0 0 ℓd+1dℓ= r0d+2/(d +

2), R 2π 0 sin2 φd 1dφd 1 = π and R π 0 sind j+1 φjdφj = πΓ((d j+2)/2)

Γ(1+(d j+1)/2 for 1 j d 2. Simplifying as Mohammed and Narayanan (2017) did, we get

µd α d + 2 r0d+2

(r )d = c0 (0.99)D

d + 2 (r0)d+2

According to Lemma 24, for any ϵ [0, 1/2], λd (1 ϵ)µd with probability 1

d exp{ ϵ2µd|Iz,r |

2 ln 2 }. Taking ϵ = 1/2, we have

λd c0 (0.99)D

d + 2 (r0)d+2

with probability 1 d exp{ ϵ2µd|Iz,r |

2 ln 2 }. Using r = O( σ) and z z (1 + c)r, we can simplify r0 and ﬁnd c0 satisfying r0 c0r. Hence, there exists a constant c independent on r such that λd cr2, which completes this proof.

Yao and Xia

A.3 Proof of Proposition 12, Lemma 13 and Lemma 14

Proof of Proposition 12 To show that α(x) is bounded below by c0|Ix,r| is equivalent to showing that there exists constant c1 > c and c2 such that among the |Ix,r| samples there are c2|Ix,r| ones lying in BD(x, c1r), where c0 in the lower bound is c0 = c2(1 c2 1)β. To quantify the number of samples {xi}i Ix,r lying in BD(x, c1r), we bound the conditional probability P( xi x 2 c1r|i Ix,r) below by calculating the lower bound of P( xi x 2 c1r) and the upper bound of P(i Ix,r), respectively. By Lemma 23, we have P(i Ix,c1r) c3rd. For the probability P(i Ix,r), we have

P(i Ix,r) = P( xi x 2 r, yi M \ BD(x, Cr))

+ P( xi x 2 r, yi M BD(x, Cr)), (A.7)

P( xi x 2 r, yi M BD(x, Cr)) P(yi M BD(x, Cr))

= Vol(M BD(x, C0r))

Vol(M) = Crd,

P( xi x 2 r, yi M \ BD(x, Cr)) P( ξi 2 (C 1)r)

where the second-last inequality holds by Chernoﬀbound, and the last inequality holds since r = O( σ) is suﬃciently small. Plugging the above bounds into (A.7), we obtain

P(i Ix,r) Crd.

Hence, for any i Ix,r, we have xi x 1 c1r with probability ρ = (crd)/(Crd) < 1 for being a constant independent on r. Applying the Berry-Esseen theorem to the |Ix,r| Bernoulli trials, we conclude that there exists c2|Ix,r| i in Ix,r such that xi x 2 c1r with probability 1 C/ p

|Ix,r|, which proves (i). To show (ii), we recall Lemma 23 that P(i Ix,c1r) crd. Thus there is a sample among N samples lying in BD(x, c1r) with probability

1 (1 crd)N = O(Nrd).

Then, α(x) (1 c2 1)β := c 0 with the same probability.

Lemma 25. Suppose ξ N(0, σ2ID); then we have, for any positive integer k:

(i) E( ξ k 2) = C1σk

(ii) Var( ξ k 2) = C2σ2k

Manifold Fitting under Unbounded Noise

(iii) E[ ξ k 2 E( ξ k 2) 3] = C3σ3k

(iv) ξi k 2 and ξj k 2 are independent if ξi and ξj are independent,

(v) E( ξi s 2 ξj t 2) = C4σs+t

(vi) Var( ξi s 2 ξj t 2) = C5σ2(s+t)

(vii) E[ ξi s 2 ξj t 2 E( ξi s 2 ξj t 2) 3] = C6σ3(s+t)

where Cn, n = 1, , 6 are constants depending on D and k.

Proof Letting the i-th element of ξ be denoted by ξ(i), we have the following qualities:

E( ξ k 2) = 1

i=1 ξ(i)2)k/2e

PD i=1 ξ(i)2

2σ2 dξ(1) dξ(D)

r=0 rke r2/(2σ2)SD(r)dr

r=0 r D+k 1e r2/(2σ2)dr

r=0 r D+k 2e r2/(2σ2)dr2

2 (2σ2) D+k

2 1e z/(2σ2)d z

2 1e zdz = 2k/2Γ(D+k

where Γ(t) = R + 0 st 1e sds is the Gamma function. Plugging the above equality into Var( ξ k 2) = E( ξ 2k 2 ) E( ξ k 2)2, and

E[ ξ k 2 E( ξ k 2) 3] =E( ξ 3k 2 ) 3E( ξ 2k 2 )E( ξ k 2)

+ 3E( ξ k 2)E( ξ k 2)2 E( ξ k 2)3,

will yield the variance and third moment. To show the independence, we set FX as the cumulative distribution function of X, St(ζ) = {ξt : ξt k 2 ζ} and ηt = ξt k 2 with t = i, j. Then

Fηi,ηj(ζi, ζj) = P(ηi ζi, ηj ζj)

= P(ξi Si(ζi), ξj Sj(ζj))

= P(ξi Si(ζi))P(ξj Sj(ζj))

= P(ηi ζi)P(ηj ζj)

= Fηi(ζi)Fηj(ζj),

Yao and Xia

which completes the proof of independence by deﬁnition. Based on the independence, we obtain

E( ξi s 2 ξj t 2) = E( ξi s 2)E( ξi s 2) = (C1σs) C1(σt) = C4σs+t.

Plugging the above equality into

Var( ξi s 2 ξj t 2) = E( ξi 2s 2 ξj 2t 2 ) E( ξi s 2 ξj t 2)2

E[ ξi s 2 ξj t 2 E( ξi s 2 ξj t 2) 3] =E( ξi 3s 2 ξj 3t 2 ) 3E( ξi 2s 2 ξj 2t 2 )E( ξi s 2 ξj t 2)

+ 3E( ξi s 2 ξj t 2)E( ξi s 2 ξj t 2)2 E( ξi s 2 ξj t 2)3,

will produce the variance and the third moment.

Proposition 26. Suppose {ξi}n i=1 are i.i.d. drawn from N(0, σ2ID), Pn i=1 αi = 1 and maxi {1, ,n} αi Cα/n with certain constant Cα. For any δ, there exist constants C, depending on D, k, δ, and n1, depending on δ and Cα such that if n n1, then

i=1 αi ξi k 2 Cσk and 1 n2

j=1 ξi s 2 ξj t 2 Cσs+t

holds for k, s, t 4 with probability at least 1 δ.

Proof By Lemma 25, ξ1 k 2, , ξn k 2 are i.i.d. random variables drawn from a distribution whose expectation is E( ξ k 2) and variance is Var( ξ k 2). Using the Berry-Esseen Theorem, the cumulative distribution function of the variable

i=1 αi ξi k 2 E( ξ k 2)) / (C1/2 2 σk

denoted by Fn satisﬁes

|Fn(t) Φ(t)| C0ρ Pn i=1 α3 i σ3k(Pn i=1 α2 i )3/2 = C 0

Pn i=1 α3 i (Pn i=1 α2 i )3/2 C 0

Pn i=1 α3 i 1 n(Pn i=1 αi)2 3/2 = C 0n 3 2

where Φ is the cumulative distribution function of standard normal distribution, ρ is the third moment of ξ k 2, which is in the order of σ3k according to Lemma 25(iii), and the last inequality holds in accordance with Cauchy s inequality. Since αi Cα/n, we obtain Pn i=1 α3 i n(Cα

n )3 = C3 αn 2 and therefore |Fn(t) Φ(t)| C / n. So there exists a constant C depending on D, k, and δ such that

i=1 αi ξi k 2 Cσk,

Manifold Fitting under Unbounded Noise

with probability 1 δ

2 C / n. Taking n1 = 4C

δ2 , Pn i=1 αi ξi k 2 Cσk with probability at least 1 δ when n n1. Analogously, there exist C and n0 such that

j=1 ξi s 2 ξi t 2 Cσs+t.

with probability at least 1 δ when n n1.

Proof of Lemma 13 By (2.2), we have r < C1. For any given δ, let

n0 = max n4C2C2d 1 δ2 , max{n1, 4C2 0 δ2 } c

where C and c are the two constants in Proposition 3, n1 is the constant in Proposition 26, and C0 is the constant in Proposition 12. Plugging N n0r d into Proposition 3, we obtain |Ix,r| max{n1, 4C2 0 δ2 } with probability at least 1 δ

2. Recalling Proposition 12 (i) and the deﬁnition of αi in (2.5), αi 1

α Cα |Ix,r| with probability at least 1 δ

2 and Cα = 1

c0 since 1 C0

2 by |Ix,r| 4C2 0 δ2 . As a result, conditions of Proposition 26 hold

with probability at least (1 δ

2)2 1 δ. Using Proposition 3, we are able to complete the proof.

Proof of Lemma 14 Corollary 12 of Boissonnat et al. (2019) shows that sin θ(Ux, Uy)

2 x y 2 2reach(M),

where Ux and Uy are the basis of Tx M and Ty M, respectively. Letting the orthogonal complements of Ux and Uy be denoted by Vx and Vy, respectively, we obtain Πx = Vx V T x and Πy = Vy V T y . Then, in accordance with (ii) of Lemma 8,

Πx Πy F = Vx V T x Vy V T y F = Ux Ux Uy Uy F C sin θ(Ux, Uy) 2

2C sin θ(Ux, Uy)

A.4 Proof of Theorem 17

To prove Theorem 17, we ﬁrst introduce two lemmas, namely Lemma 27 and Lemma 28.

Lemma 27. Suppose d(x, M) cr with some constant c < 1 and r = O( σ). For any given δ, there exist constants C and n0 such that if N n0r d, then the following inequalities hold:

(i) Pxi Πx i 2

i Ix,r 2 Cr|Ix,r| 1 2 with probability δ0(1 δ)2,

Yao and Xia

(ii) xi x i 2

Ix,r 2 Cr2|Ix,r| 1 2 with probability 1 δ,

(iii) x i x 2

Ix,r 2 Cr|Ix,r| 1 2 with probability 1 δ.

Proof We begin with (i). Plugging z = xi into Theorem 11, we obtain

i Ix,r (A2 + 2 AB + B2) with probability δ0,

where A = C

r2 1 |Ixi,2r| P j Ixi,2r ξj 4 2 + ξj 3 2 + ξj 2 2 + r ξj 2 and B = C(r + ξi 2

r + ξi 2 2 r2 ). In accordance with Lemma 13, there exist C and n0 such that

i Ix,r AB = C|Ix,r|

r 1 |Ix,r| |Ixi,2r|

ξj 4 + ξj 3 + ξj 2 + r ξj

r3 1 |Ix,r| |Ixi,2r|

ξj 4 ξi + ξj 3 ξi + ξj 2 ξi + r ξj ξi

r4 1 |Ix,r| |Ixi,2r|

ξj 4 ξi 2 + ξj 3 ξi 2 + ξj 2 ξi 2 + r ξj ξi 2

C|Ix,r|(r2 + r2 + r3) Cr2|Ix,r|,

r4 1 |Ixi,2r|2 X

ξj 4 ξk 4 + ξj 4 ξk 3 + + r2 ξj ξk

k=3 σk + r2σ2 Cr2,

i Ix,r B2 = X

r2 + ξi 2 2 r2 + ξi 4 2 r4 + 2 ξi 2 + 2 ξi 2 2 r + 2 ξi 3 2 r3

with probability 1 δ/3 respectively. The above bounds amount to Pxi Πx i 2

Cr2|Ix,r|, which leads to Pxi Πx i 2

2 Cr|Ix,r| 1 2 , with probability δ0(1 δ)2.

As for (ii), xi x i 2

i Ix,r xi x i 2 2 X

i Ix,r ξi 2 2 |C|Ix,r|σ2,

with probability 1 δ, which implies xi x i 2

Ix,r 2 C|Ix,r| 1 2 σ = Cr2|Ix,r| 1 2 . We derive (iii) based on

x i x 2 x i xi 2 + xi x 2 + x x 2 ξi 2 + 2r.

Manifold Fitting under Unbounded Noise

Thus we have x i x 2

ξi 2 2 + 4r2 + 2r ξi 2 C|Ix,r| σ2 + 4r2 + 2rσ Cr2|Ix,r|

with probability 1 δ, which implies x i x 2

2 Cr|Ix,r| 1 2 .

Lemma 28. Suppose d(x, M) cr with some constant c < 1, r = O( σ) and β 2. For any given δ, there exist constants C and n0 such that if N n0r d, then

2 with probability 1 δ.

Proof By Lemma 13, α(x) c0|Ix,r| with probability at least 1 δ. Based on this, we obtain the following inequalities given 0 αi(x) 1:

2 + ( v α(x)) αi(x)

2 + | v α(x)

α2(x) | (1)i Ix,r 2

r |Ix,r| 1 2

P i Ix,r αi(x)

r |Ix,r| 1 2 |Ix,r|

Proof of Theorem17 We rewrite (2.6) as

f(x) = Ψα x

i Ix,r αi(x)(x xi), (A.8)

and calculate the ﬁrst derivative of f(x) as

i Ix,r αi(x)Ψα x

i Ix,r αi(x)( vΨα x)(x xi)

i Ix,r ( vαi(x))Ψα x(x xi).

Yao and Xia

We deal with the three terms one by one. First,

i Ix,r αi(x)Ψα x

( v(x xi) = X

i Ix,r αi(x)Ψα xv = Ψα xv.

To bound the second term of (A.9), we proceed to bound vΨα x 2. In accordance with (26) of Feﬀerman et al. (2018), we establish the relationship between vΨα x 2 and v Ax 2 as follows:

vΨα x 2 8 v Ax 2

i vαi(x) (Pxi Πx i ) + (Πx i Πx ) + Πx v X

i | vαi(x)| Pxi Πx i 2 + C

i | vαi(x)| x i x 2 + 0

2|Ix,r| 1 2 ,

where the second to the last inequality holds by Cauchy-Schwarz inequality, and the last inequality holds by Lemma 27 and Lemma 28. As a result,

vΨα x 2 8 v Ax 2 C. (A.10)

Therefore, the second term of (A.9) is bounded as

i Ix,r αi(x)( vΨα x)(x xi) 2 X

i Ix,r αi(x) vΨα x 2 x xi 2 X

i Ix,r αi(x)Cr = Cr.

As for the last term in (A.9), we have

i Ix,r vαi(x)Ψα x(x xi) 2 X

i Ix,r vαi(x)Ψα x(x x i ) 2

Ψα x(x x ) X

i Ix,r vαi(x) 2

i Ix,r vαi(x)Ψα x(x i xi) 2

i Ix,r vαi(x)Ψα x(x x i ) 2 + 0

i Ix,r vαi(x)Ψα x(x i xi) 2,

Manifold Fitting under Unbounded Noise

i Ix,r vαi(x)Ψα x(x x i ) 2 X

i Ix,r | vαi(x)| Ψα x(x i x ) 2

Ψα x(x i x )

i Ix,r vαi(x)Ψα x(x i xi) 2 X

i Ix,r | vαi(x)| Ψα x(x i xi) 2

Ψα x(x i xi)

Ψα x(x i x ) 2 Ψα x Πx 2 x i x 2 + Πx (x i x ) 2

Cr2 + C x i x 2 2 τ Cr2,

where the second inequality holds in probability via Theorem 15 and Proposition 2. The above bounds amount to the bound on the ﬁrst derivative, that is, vf(x) Ψα xv 2 Cr. The above proof is based on Lemma 27, Lemma 28 and Theorem 15, which are valid when Lemma 13 and Theorem 11 hold. Hence, the conclusion obtained from the above proof is valid when Lemma 13 and Theorem 11 simultaneously hold, whose probability is at least δ0(1 δ)2.

A.5 Proof of Theorem 19

To prove Theorem 19, we ﬁrst introduce Lemma 29.

Lemma 29. Suppose d(x, M) cr with some constant c < 1, r = O( σ) and β 2. For any given δ, there exist constants C and n0 such that if N n0r d, then

r2 |Ix,r| 1

2 with probability 1 δ.

2 v u αi(x)

2 + v u α(x)

α2(x) αi(x)

+ ( v αi(x))( u α(x))

2 + ( u αi(x))( v α(x))

α(x) u α(x)

Yao and Xia

We bound these ﬁve terms one-by-one using α(x) c|Ix,r| which holds with probability 1 δ by Lemma 13 and 0 αi(x) 1. For the ﬁrst term,

β x xi 2 2 r4 + αi(x)

r2 |Ix,r| 1

For the second term,

α2(x) αi(x)

(1)i Ix,r 2

2 (1)i Ix,r 2 2

r2 |Ix,r| 1|Ix,r| 1

2 |Ix,r| = C

r2 |Ix,r| 1

The third and fourth terms are similar, where the third term is bounded by

( v αi(x))( u α(x))

i Ix,r 2 | u α(x)

α(x) | v αi(x)

i Ix,r 2 (1)i Ix,r 2 v αi(x)

2 |Ix,r| 1 2 C

r2 |Ix,r| 1

and analogically, the fourth is bounded by

( u αi(x))( v α(x))

r2 |Ix,r| 1

Finally, the ﬁfth term:

α(x) u α(x)

α(x) u α(x)

r2 |Ix,r| 1

Summing the above ﬁve terms up amounts to the proof.

Proof of Theorem19 Letting G(x) = P i Ix,r αi(x)(x xi), we obtain the following bound on the second derivative of f(x) v uf(x) 2 ( v uΨα x)G(x) 2 + ( vΨα x)( u G(x)) 2 + ( uΨα x)( v G(x)) 2 + Ψα x( v u G(x)) 2. (A.11)

Manifold Fitting under Unbounded Noise

For the ﬁrst term, we have

v uΨα x 2 C v Ax 2 u Ax 2 + v u Ax 2

i | v uαi(x)| Pxi Πx i 2 + Πx i Πx 2

+ C Πx v u X

C + C v uαi(x)

+ C v uαi(x)

r2 |Ix,r| 1

2 Cr|Ix,r| 1 2 C

where the second to the last inequality holds by Lemma 27 while the last inequality holds by Lemma 29, and therefore

( v uΨα x)G(x) 2 C

r r = C. (A.12)

For the second and third terms,

v G(x) 2 = v + X

i vαi(x)(xi x1) + X

i vαi(x) x1 2

i Ix,r 2 (2r)i Ix,r 2

2 2r|Ix,r| 1 2 = 1 + C,

and by (A.10) we obtain ( vΨα x) u G(x) 2 C, u G(x) ( vΨα x) 2 C.

For the fourth term, we have Ψα x

i ( v uαi(x))xi 2

i | v uαi(x)| xi x i 2 + X

i | v uαi(x)| Ψα x(x i x ) 2 + 0

( xi x i )i Ix,r 2 + ( Ψα x(x i x ) )i Ix,r 2

r2 |Ix,r| 1

2 (σ + r2)|Ix,r| 1 2 = C.

The above proof is based on Lemma 27 and Lemma 29, which are valid when Lemma 13 and Theorem 11 simultaneously hold. Hence, v uf(x) 2 C when Lemma 13 and Theorem 11 simultaneously hold, whose probability is at least δ0(1 δ)2.

Yao and Xia

A.6 Proof of Proposition 20, Proposition 21 and Lemma 22

Proof of Proposition 20 Considering the function φ(d) = (1 d

r2 )β for t 0, whose

derivative is φ (d) = β

r2 β 1, we obtain |φ (d)| β

r2 . This implies

| αi(x) αi(z)| β

r2 x z 2 2 β

r2 ϵ2 α(x) |Ix,2r|2 r r |Ix,2r|,

where the last inequality holds since α(x) = P i Ix,r αi(x) P i Ix,r 1 = |Ix,r| |Ix,2r|. For any z BD(x, ϵ), we have z x 2 r and Iz,r Ix,2r. By the deﬁnition of αi(z), we have αi(z) = 0 for i / Iz,r, and therefore

i Iz,r αi(z) = X

i Ix,2r αi(z)

αi(x) + αi(z) αi(x) = α(x) + X

αj(z) αj(x) .

Plug α(z) into the following denominator,

|αi(z) αi(x)| = αi(z)

( αi(x) αi(z) αi(x)

α(x) P j Ix,2r | αi(z) αi(x)| αi(x)

αi(x) + α(x) |Ix,2r|2 r

α(x) |Ix,2r| α(x)

|Ix,2r|2 r αi(x)

α(x) , αi(x)

α(x) αi(x) α(x) |Ix,2r|2 r

α(x) + |Ix,2r| α(x)

( αi(x) + r |Ix,2r| α(x) α(x) r |Ix,2r| αi(x)

α(x) , αi(x)

α(x) αi(x) r |Ix,2r| α(x) + α(x) r |Ix,2r|

( αi(x) + r |Ix,2r| (1 + C r |Ix,2r|) αi(x)

α(x) , αi(x) αi(x) r |Ix,2r| (1 C r |Ix,2r|)

αi(x)r + Cr + Cr2

α(x)|Ix,2r| r + Cr + Cr2

α(x)|Ix,2r| r + Cr + Cr2

c0|Ix,2r| = C r |Ix,2r|,

the second-to-last inequality holds since αi(x) 1 while the last inequality holds with probability 1 (1 crd)N by Proposition 12(ii). Based on the upper bound of |αi(x) αi(z)|, we obtain

Ax Az 2 = X

i Ix,r αi(x)Pxi X

i Iz,r αi(z)Pxi 2 = X

i Ix,2r αi(x)Pxi X

i Ix,2r αi(z)Pxi 2

i Ix,2r |αi(x) αi(z)| Pxi 2 X

i Ix,2r C r |Ix,2r| 1 = C r.

Noting Ψα x Ax 2 Cr with probability δ0(1 δ)2 by (3.6) in Theorem 15, we have

Ψα z Az 2 Ψα x Az 2 Ψα x Ax 2 + Ax Az 2 Cr,

Manifold Fitting under Unbounded Noise

and hence Ψα x Ψα z Ψα x Az 2 + Az Ψα z 2 Cr with probability δ0(1 δ)2 1 (1 crd)N , which completes this proof.

Proof of Proposition 21 We ﬁrst prove (i). Since the rows of Jf(x) are orthogonal to the contour surface at x, as the basis of the spanning space of Jf(x)T , Wx is also the basis of the normal space of Mout at x and thereby Wx RD (D d) by Theorem 4. This implies

P statement (i) holds = P Wx RD (D d) P Theorem 4 holds (A.13)

Now we proceed to prove (ii). It is clear that g(z) = 0 if f(z) = 0. Thus, we only need to prove that g(z) = 0 implies f(z) = 0. To do this, we ﬁrst assume the reverse, f(z) = 0 and g(z) = W T x f(z) = 0. Since W T x is the basis of span Jf(x)T , Jf(x) can be rewritten as Jf(x) = Y W T x and Jf(x)f(z) = Y W T x f(z) = Y g(z) = 0. By the deﬁnition of f(z) in equality (2.6), Ψα z f(z) = f(z). Hence, we obtain

Jf(x) Ψα z 2 = max v =0

f(z) 2 f(z) 2 = 0 f(z) 2

f(z) 2 = 1.

Jf(x) Ψα z 2 Jf(x) Ψα x 2 + Ψα x Πx F + Πx Πz F + Πz Ψα z F Cr

where the ﬁrst term is bounded by (3.8) in Corollary 18, the second and fourth terms are bounded by applying Theorem 15 for x and z, respectively, and the third term is bounded by Lemma 14. We conduct contradictory bounds of Jf(x) Ψα z 2. Hence, f(z) = 0 if g(z) = 0. The statement (ii) is proved when Corollary 18, Theorem 15 and Lemma 14 hold simultaneously. Noticing that Corollary 18 holds when Lemma 13 and Theorem 11 hold, Theorem 4 holds when Theorem 15 and Proposition 12(ii) hold, and Theorem 15 holds when Lemma 13 and Theorem 11 hold, we obtain

P statement (i) and (ii) hold

P Theorem 4 and Corollary 18 hold for x Theorem 15 holds for x, z

P Proposition 12(ii) holds for x P Lemma 13 and Theorem 11 hold for x, z

δ2 0(1 δ)4 1 (1 crd)N .

The proof is, therefore, complete.

Proposition 30. Letting σ1 σD be the singular values of Jf(x), then with probability at least δ0(1 δ)2, 1 + O(r) σ1 σD d 1 O(r).

Proof Let Ψα x = Vx V T x and Jf(x) = UxΣx W T x be the thin singular value decomposition of Jf(x), where Ux, Wx RD (D d) and Σx R(D d) (D d) by Theorem 4.

Yao and Xia

To begin with, we bound σD d below. Let S1 = span(Vx) and S2 = span{w1, w D d 1}, where w1, w D d 1 are the ﬁrst (D d 1) columns of Wx. Since dim(S1) > dim(S2), there exists η = 0 S1 S 2 , which implies Ψα xη = η and w T i η = 0 for i = 1, , D d 1. Hence,

Ψα x Jf(x) η = η UxΣx W T x η = η u D dσD dw T D dη,

where u D d is the (D d)-th column of Ux. This leads to

Ψα x Jf(x) η 2 = η u D dσD dw T D dη 2

η 2 u D dσD dw T D dη 2 = |1 σD d| η 2.

Cr Ψα x Jf(x) 2

Ψα x Jf(x) η 2 η 2 = |1 σD d|,

where the ﬁrst inequality holds by (3.8) in Corollary 18. So, σD d 1 O(r). Now, we turn to the upper bound of σ1. Let η = w1, then η 2 = 1 and w T i η = 0 for any i 2. Hence,

Ψα x Jf(x) η = Ψα xη UxΣx W T x η = Ψα xη σ1u1.

This leads to

Cr Ψα x Jf(x) 2

Ψα x Jf(x) η 2 = Ψα xη σ1u1 2 Ψα xη 2 σ1 .

So, σ1 Ψα xη 2+Cr 1+Cr. Note that the above proof relies on Theorem 4 and Corollary 18, where Theorem 4 and Corollary 18 hold when Lemma 13, Theorem 11 and Proposition 12(ii) hold. This proof is completed with probability at least δ0(1 δ)2 1 (1 crd)N .

Proposition 31. W T x Ψα x W T x ID d 2 Cr with probability at least δ0(1 δ)2 1 (1 crd)N .

Proof By Theorem 4, we obtain Wx RD (D d). Let the singular value decomposition of W T x Vx = PD d i=1 siaib T i , where si is the i-th singular value of W T x Vx and ai and bi are the singular vectors corresponding to si. Let η = Vxb D d, then

Cr Ψα x Jf(x) 2

Ψα x Jf(x) η 2 = Vxb D d UxΣx W T x Vxb D d 2

i=1 siaib T i )b D d 2 = 1 s D d Σxa D d 2 ,

where the ﬁrst inequality holds by (3.8) in Corollary 18. This leads to

1 Cr Σxa D d 2 s D d 1 + Cr Σxa D d 2 .

Manifold Fitting under Unbounded Noise

Noticing 1 O(r) σxa D d 2 1 + O(r) by Proposition 30, we conclude 1 O(r) s D d s1 1 since W T x Vx 2 1. So,

W T x Ψα x Wx ID d 2 = W T x Vx V T x Wx ID d 2

= ASST AT AAT 2 = A(SST ID d)AT 2 Cr,

where A = [a1, , a D d] and S is a diagonal matrix with (s1, , s D d) as the diagonal entries. Note that the above proof relies on Theorem 4 and Corollary 18, where Theorem 4 and Corollary 18 hold when Lemma 13, Theorem 11 and Proposition 12(ii) hold. This proof is completed with probability at least δ0(1 δ)2 1 (1 crd)N .

Proof of Lemma 22 Under the settings that the ﬁrst d coordinates are the basis of Tx Mout, and the last D d coordinates are the columns of Wx, Wx can be rewritten as Wx = (0, ID d)T . Hence, we obtain

Jg(z)(0, ID d)T = W T x Jf(z)Wx

=W T x Jf(z) Jf(x) Wx + W T x Jf(x) Ψα x

Wx + W T x Ψα x Wx ID d + ID d.

This leads to

Jg(z)(0, ID d)T ID d 2

Jf(z) Jf(x) 2 + Jf(x) Ψα x 2 + W T x Ψα x Wx ID d 2 C1r + C2r + C3r Cr,

where Jf(z) Jf(x) 2 Jf(z) Jf(x) F C1r by ( 4.2) in Theorem 5, Jf(x) Ψα x 2 Jf(x) Ψα x F C2r by Corollary 18 and W T x Ψα x Wx ID d 2 C3r by Proposition 31. Using Theorem 2.9.10 (the implicit function theorem) in Hubbard and Hubbard (2001), φ exits. Carrying out the ﬁrst derivative on g ζ, φ(ζ) = 0, we obtain

0 = sg(ζ, φ(ζ)) = Jg(ζ, φ(ζ)) sζ sφ(ζ)

= W T x Jf(ζ, φ(ζ)) Jf(x) sζ sφ(ζ)

+ W T x Jf(x) sζ sφ(ζ)

= W T x Jf(ζ, φ(ζ)) Jf(x) sζ sφ(ζ)

+ W T x UxΣx(0, ID d) sζ sφ(ζ)

This implies that

sφ(ζ) = Σ 1 x W T x Ux 1W T x Jf(ζ, φ(ζ)) Jf(x) sζ sφ(ζ)

Calculating ℓ2-norm of the two sides of the above equality, we obtain

sφ(ζ) 2 = Σ 1 x W T x Ux 1W T x Jf(ζ, φ(ζ)) Jf(x) sζ sφ(ζ)

1 + O(r) C Jf(ζ, φ(ζ)) Jf(x) 2 C (ζ, φ(ζ)) x 2

Yao and Xia

Carrying out the second derivative on g(ζ, φ(ζ)) = 0, we obtain

0 = t Jg(ζ, φ(ζ)) sζ sφ(ζ)

+ Jg(ζ, φ(ζ)) 0 t sφ(ζ)

Letting ei denote the i-th column of ID and

u = tζ tφ(ζ)

the i-th column of t Jg(ζ, φ(ζ)) is

t eig(ζ, φ(ζ)) = u 2 u u 2 eig(ζ, φ(ζ)) = u 2W T x u u 2 eif(ζ, φ(ζ)).

In conjunction with u u 2 eif(ζ, φ(ζ)) 2 C, as proved in Theorem 19, t eig(ζ, φ(ζ)) C, and therefore t Jg(ζ, φ(ζ) 2 C.

t sφ(ζ) = Σ 1 x W T x Ux 1 t Jg(ζ, φ(ζ)) sζ sφ(ζ)

+ W T x Jf(ζ, φ(ζ)) Jf(x) sζ sφ(ζ)

which implies

t sφ(ζ) 2 C(1 + O(r)) C + C z x 2 C.

Note that the above proof relies on Theorem 5, Corollary 18, Proposition 31 and Theorem 19, which are valid when Lemma 13, Theorem 11 and Proposition 12(ii) hold. Hence, this proof is completed with probability at least δ0(1 δ)2 1 (1 crd)N .

Appendix B. Gradient of f(x) 2 2

Let F(x) = f(x) 2 2, d denote the diﬀerential and G(x) = x P i Ix,r αi(x)xi, then

d F(x) = 2 f(x), df(x) = 2

Ψα x G(x), d

= 2 Ψα x G(x)G(x)T , dΨα x + 2 Ψα x G(x), d G(x)

= 2 Ψα x G(x)G(x)T , dΨα x + 2 Ψα x G(x), dx X

dαi(x) = d αi(x)

α(x) αi(x)dα(x)

α(x)2 = d αi(x)

α(x) αi(x)dα(x)

r2α(x) αi(x) d+1 d+2 (x xi) αi(x) X

i αi(x) d+1 d+2 (x xi), dx

Manifold Fitting under Unbounded Noise

and dΨα x can be calculated as below. Let λ1 λn be the eigenvalues of Ax and µ1 > > µs are the diﬀerent values of {λi}. Suppose λn d > λn d+1, and µ1 > > µt are the diﬀerent values of λ1 λn d. Pi,x = Vi,x V T i,x is an orthogonal projection and columns of Vi,x are the eigenvectors corresponding to µi. Then, we have Ψα x = Pt i=1 Pi,x. By Shapiro and Fan (1995),

1 µj µi Pi,x(d Ax)Pj,x + Pj,x(d Ax)Pi,x,

and thereby

i=1 d Pi,x =

1 µj µi Pi,x(d Ax)Pj,x + Pj,x(d Ax)Pi,x

Plug dΨα x into the ﬁrst term of d F(x),

Ψα x G(x)G(x)T , dΨα x

= T, d Ax = X

i Ix,r T, Pxi dαi(x)

where T = Pt i=1 Ps j=t+1 1 µj µi Pi,x

Ψα x G(x)G(x)T Pj,x+Pj,x

Ψα x G(x)G(x)T Pi,x. Plugging dαi(x) into the second term of d F(x), we obtain

Ψα x G(x), dx X

dαi(x)xi = Ψα x G(x), dx X

i Ix,r Ψα x G(x), xi dαi(x)

As the summation of the ﬁrst and second term,

d F(x) = D 2 X

T, Pxi + Ψα x G(x), xi dαi(x)

dx + Ψα x G(x), dx E

So the gradient of F(x) is

grad(x) = 2 X

T, Pxi + Ψα x G(x), xi dαi(x)

dx + Ψα x G(x). (B.1)

Appendix C. Results of Facial Image Denoising

Yao and Xia

Figure 10: Performance of facial image denoising with ρ = 0.2. The ﬁrst row consists of original images while the second row features blurred images. The third to seventh rows contain deblurred images using km17, cf18, ya21(deg=1), ya21(deg=2) and our method, respectively.

Manifold Fitting under Unbounded Noise

Figure 11: Performance of facial image denoising with ρ = 0.4. The ﬁrst row consists of original images while the second row again shows blurred images. The third to seventh rows contain deblurred images using km17, cf18, ya21(deg=1), ya21(deg=2) and our method, respectively.

Yao and Xia

Eddie Aamari and Cl ement Levrard. Stability and minimax optimality of tangential delaunay complexes for manifold reconstruction. Discrete & Computational Geometry, 59: 923 971, 2018.

Eddie Aamari and Cl ement Levrard. Nonasymptotic rates for manifold, tangent space and curvature estimation. The Annals of Statistics, 47:177 204, 2019.

Yariv Aizenbud and Barak Sober. Non-parametric estimation of manifolds from noisy data. ar Xiv preprint ar Xiv:2105.04754, 2021.

Jeﬀrey Banﬁeld and Adrian Raftery. Ice ﬂoe identiﬁcation in satellite images using mathematical morphology and clustering about principal curves. Journal of the American Statistical Association, 87:7 16, 1992.

Jean-Daniel Boissonnat and Arijit Ghosh. Manifold reconstruction using tangential delaunay complexes. Discrete & Computational Geometry, 51:221 267, 2014.

Jean-Daniel Boissonnat, Andr e Lieutier, and Mathijs Wintraecken. The reach, metric distortion, geodesic convexity and the variation of tangent spaces. Journal of Applied and Computational Topology, 3:29 58, 2019.

Herbert Federer. Curvature measures. Transactions of the American Mathematical Society, 93:418 491, 1959.

Charles Feﬀerman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29:983 1049, 2016.

Charles Feﬀerman, Sergei Ivanov, Yaroslav Kurylev, Matti Lassas, and Hariharan Narayanan. Fitting a putative manifold to noisy data. In Conference On Learning Theory, pages 688 720, 2018.

Christopher Genovese, Marco Perone-Paciﬁco, Isabella Verdinelli, and Larry Wasserman. The geometry of nonparametric ﬁlament estimation. Journal of the American Statistical Association, 107:788 799, 2012a.

Christopher Genovese, Marco Perone-Paciﬁco, Isabella Verdinelli, and Larry Wasserman. Minimax manifold estimation. Journal of Machine Learning Research, 13:1263 1291, 2012b.

Christopher Genovese, Marco Perone-Paciﬁco, Isabella Verdinelli, Larry Wasserman, et al. Manifold estimation and singular deconvolution under hausdorﬀloss. The Annals of Statistics, 40:941 963, 2012c.

Christopher Genovese, Marco Perone-Paciﬁco, Isabella Verdinelli, Larry Wasserman, et al. Nonparametric ridge estimation. The Annals of Statistics, 42:1511 1545, 2014.

Athinodoros Georghiades, Peter Belhumeur, and David Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23:643 660, 2002.

Manifold Fitting under Unbounded Noise

Dian Gong, Fei Sha, and G erard Medioni. Locally linear denoising on image manifolds. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics, pages 265 272, 2010.

SL Happy, Anirban Dasgupta, Anjith George, and Aurobinda Routray. A video database of human faces under near infra-red illumination for human computer interaction applications. In the Fourth International Conference on Intelligent Human Computer Interaction, pages 1 4, 2012.

Trevor Hastie and Werner Stuetzle. Principal curves. Journal of the American Statistical Association, 84:502 516, 1989.

John Hubbard and Barbara Hubbard. Vector analysis, linear algebra, and diﬀerential forms: A uniﬁed approach. Ithaca: Matrix Editions, 2001.

Yunqian Ma and Yun Fu. Manifold Learning Theory and Applications. CRC Press, 2011.

Kitty Mohammed and Hariharan Narayanan. Manifold learning using kernel density estimation and local principal components analysis. ar Xiv preprint ar Xiv:1709.03615, 2017.

Sameer Nene, Shree Nayar, Hiroshi Murase, et al. Columbia object image library (coil-20). Technical Report CUCS-006-96, 1996.

Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high conﬁdence from random samples. Discrete & Computational Geometry, 39:419 441, 2008.

Umut Ozertem and Deniz Erdogmus. Locally deﬁned principal curves and surfaces. Journal of Machine Learning Research, 12:1249 1286, 2011.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv preprint ar Xiv:1511.06434, 2015.

Alexander Shapiro and Michael Fan. On eigenvalue optimization. SIAM Journal on Optimization, 5:552 569, 1995.

Derek Stanford and Adrian Raftery. Finding curvilinear features in spatial point patterns: principal curve clustering with noise. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:601 609, 2000.

Jakob J Verbeek, Nikos Vlassis, and B Kr ose. A k-segments algorithm for ﬁnding principal curves. Pattern Recognition Letters, 23:1009 1017, 2002.

Zhigang Yao and Zhenyue Zhang. Principal boundary on Riemannian manifolds. Journal of the American Statistical Association, 115:1435 1448, 2020.

Zhigang Yao, Yuqing Xia, and Zengyan Fan. Random ﬁxed boundary ﬂows. Journal of the American Statistical Association, 119:2356 2368, 2024.