# structureadaptive_manifold_estimation__df830eb9.pdf Journal of Machine Learning Research 23 (2022) 1-62 Submitted 3/21; Revised 12/21; Published 2/22 Structure-Adaptive Manifold Estimation Nikita Puchkin npuchkin@hse.ru National Research University Higher School of Economics, Pokrovsky boulevard 11, 109028 Moscow, Russian Federation and Institute for Information Transmission Problems RAS, Bolshoy Karetny per. 19, build.1, 127051 Moscow, Russian Federation Vladimir Spokoiny spokoiny@wias-berlin.de Weierstrass Institute and Humboldt University, Mohrenstrasse 39, 10117 Berlin, Germany and National Research University Higher School of Economics, Pokrovsky boulevard 11, 109028 Moscow, Russian Federation and Institute for Information Transmission Problems RAS, Bolshoy Karetny per. 19, build.1, 127051 Moscow, Russian Federation Editor: Miguel Carreira-Perpinan We consider a problem of manifold estimation from noisy observations. Many manifold learning procedures locally approximate a manifold by a weighted average over a small neighborhood. However, in the presence of large noise, the assigned weights become so corrupted that the averaged estimate shows very poor performance. We suggest a structureadaptive procedure, which simultaneously reconstructs a smooth manifold and estimates projections of the point cloud onto this manifold. The proposed approach iteratively refines the weights on each step, using the structural information obtained at previous steps. After several iterations, we obtain nearly oracle weights, so that the final estimates are nearly efficient even in the presence of relatively large noise. In our theoretical study, we establish tight lower and upper bounds proving asymptotic optimality of the method for manifold estimation under the Hausdorffloss, provided that the noise degrades to zero fast enough. Keywords: manifold learning, manifold denoising, structural adaptation, adaptive procedures, minimax 1. Introduction We consider a problem of manifold learning, that is, to recover a low dimensional manifold from a cloud of points in a high dimensional space. This problem is of great theoretical and practical interest. For instance, if one deals with a problem of supervised or semi-supervised regression, the feature vectors, though lying in a very high-dimensional space, may occupy only a low-dimensional subset. In this case, one can hope to obtain a rate of prediction which depends on the intrinsic dimension of the data rather than on the ambient one and escape the curse of dimensionality. At the beginning of the century, the popularity of manifold 2022 Nikita Puchkin and Vladimir Spokoiny. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0338.html. Puchkin and Spokoiny learning gave rise to several novel nonlinear dimension reduction procedures, such as Isomap (Tenenbaum et al., 2000), locally linear embedding (Roweis and Saul, 2000, LLE) and its modification (Zhang and Wang, 2006), Laplacian eigenmaps (Belkin and Niyogi, 2003), and t-SNE (van der Maaten and Hinton, 2008). More recent works include interpolation on manifolds via geometric multi-resolution analysis (Maggioni et al., 2016), local polynomial estimators (Aamari and Levrard, 2019) and numerical solution of PDE (Shi and Sun, 2017). It is worth mentioning that all these works assume that the data points either lie exactly on the manifold or in its very small vicinity (which shrinks as the sample size n tends to infinity), so the noise ε is so negligible that it may be ignored and put into a remainder term in Taylor s expansion. However, in practice, this assumption can be too resrictive. and the observed data do not exactly lie on a manifold. One may think of this situation as there are unobserved true features that lie exactly on the manifold and the learner observes its corrupted versions. Such noise corruption leads to a dramatic decrease in the quality of manifold reconstruction for those algorithms which misspecify the model and assume that the data lies exactly on the manifold. Therefore, one has to do a preliminary step, which is called manifold denoising (see e.g. (Hein and Maier, 2006; Wang and Carreira-Perpinan, 2010; Gong et al., 2010)), to first project the data onto the manifold. Such methods usually act locally, i.e. consider a set of small neighborhoods, determined by a smoothing parameter (e.g. a number of neighbors or a radius h), and construct local approximations based on these neighborhoods. The problem of this approach is that the size of the neighborhood must be large compared to the noise magnitude M, which may lead to a non-optimal choice of the smoothing parameter. The exclusion is the class of procedures, based on an optimization problem, such as mean-shift (Fukunaga and Hostetler, 1975; Cheng, 1995) and its variants (Wang and Carreira-Perpinan, 2010; Ozertem and Erdogmus, 2011; Genovese et al., 2014). The mean-shift algorithm may be viewed as a generalized EM algorithm applied to the kernel density estimate (see (Carreira-Perpinan, 2007)). This algorithm and its modifications were extensively studied in the literature (Comaniciu and Meer, 2002; Hein and Maier, 2006; Li et al., 2007; Genovese et al., 2014; Arias-Castro et al., 2016). For a comprehensive review on mean-shift algorithms, a reader is referred to (Carreira-Perpinan, 2015). Though mean-shift algorithm was initially proposed for mode seeking and clustering, it found its applications in manifold denoising (see e.g. (Hein and Maier, 2006; Wang and Carreira-Perpinan, 2010; Ozertem and Erdogmus, 2011; Genovese et al., 2014; Carreira Perpinan, 2015)). If the observations lie around a smooth manifold, then few iterations of the mean-shift algorithm move the data towards the manifold. However, since the mean shift algorithm and its variants (for example, subspace-constrained mean-shift (Ozertem and Erdogmus, 2011; Genovese et al., 2014) which is based on density ridges (Eberly et al., 1994)) approximate the true density of Y1, . . . , Yn by the kernel density estimate, they may suffer from the curse of dimensionality and the rates of convergence we found in the literature depend on the ambient dimension rather than on the intrinsic one in the noisy case. To our best knowledge, only papers (Genovese et al., 2012a,b) consider the case, when the noise magnitude does not tend to zero as n grows. However, the approach in (Genovese et al., 2012a,b) assumes that the noise distribution is known and has a very special structure. For instance, considered in (Genovese et al., 2012a), the noise has a uniform distribution in the direction orthogonal to the manifold tangent space. Without belittling a significant impact of this paper, the assumption about the uniform distribution is unlikely to hold Structure-adaptive Manifold Estimation in practice. Moreover, the authors point out that their goal was to establish minimax rates rather than propose a practical estimator. Thus, there are two well studied extremal situations in manifold learning. The first one corresponds to the case of totally unknown noise distribution but extremely small noise magnitude, and the other one corresponds to the case of large noise, which distribution is completely known. This paper aims at studying the problem of manifold recovery under weak and realistic assumptions on the noise. Below we focus on a model with additive noise. Suppose we are given an i.i.d. sample Yn = (Y1, . . . , Yn), where Yi are independent copies of a random vector Y in RD, generated from the model Y = X + ε. (1) Here X is a random element whose distribution is supported on a low-dimensional manifold M RD, dim(M ) = d < D, and ε is a full dimensional noise. The goal of a statistician is to recover the corresponding unobserved variables Xn = {X1, . . . , Xn}, which lie on the manifold M , and estimate M itself. Assumptions on the noise are crucial for the quality of estimation. One usually assumes that the noise is not too large, that is, ε M almost surely for some relatively small noise magnitude M. If the value M is smaller than the reach1 of the manifold then the noise can be naturally decomposed in a component aligned with the manifold tangent space and another component describing the departure from the manifold. It is clear that the impact of these two components is different, and it is natural to consider an anisotropic noise. For this purpose, we introduce a free parameter b which controls the norm of the tangent component of the noise; see (A3) for the precise definition. The pair of parameters (M, b) characterizes the noise structure more precisely than just the noise magnitude M and allows us to understand the influence of the noise anisotropy on the rates of convergence. In our work we are particularly interested in situations when b is of order 1 (non-orthogonal noise) and when b is small (nearly orthogonal noise) but our theoretical study is also valid for intermediate values of b (see Equation A4 below). We still have to assume that the noise magnitude M = M(n) tends to zero as n tends to infinity but aim at describing the best possible rate of convergence still ensuring a consistent estimation. More precisely, if b is sufficiently small we allow M to be of order n 2/(3d+8), which is much slower than, for instance, (log n/n)2/d and (log n/n)1/d, considered in (Aamari and Levrard, 2019) and (Aamari and Levrard, 2018), respectively (see the assumption (A4.1) for the precise statement). To the best of our knowledge, this is the first paper which provides a rigorous theoretical study in this setup as well as the setup for intermediate b. As already mentioned, most of the existing manifold denoising procedures involve some nonparametric local smoothing methods with a corresponding bandwidth. The use of isotropic smoothing leads to the constraint that the noise magnitude is significantly smaller than the width of local neighborhoods; see e.g. (Hein and Maier, 2006; Maggioni et al., 2016; Osher et al., 2017; Aamari and Levrard, 2019). Similar problem arises even the case of effective dimension reduction in regression corresponding to the case of linear manifolds. The use of anisotropic smoothing helps to overcome this difficulty and to build efficient and asymptotically optimal estimation procedures; see e.g. (Xia et al., 2002) or (Hristache et al., 2001a). This paper extends the idea of structural adaptation proposed in (Hristache et al., 2001b,a). In these papers, the authors suggested to use anisotropic elliptic neighborhoods 1. A reader is referred to Section 2 for the definition. Puchkin and Spokoiny with axes shinking in the direction of the estimated effective dimension reduction (e.d.r.) subspace and stretching in the orthogonal directions to estimate the e.d.r. subspace. As the shape of the local neighborhoods depends on the unknown e.d.r. structure, the procedure learns this structure from the data using iterations. This explains the name structural adaptation . The use of anisotropic smoothing allows to obtain semiparametrically efficient root-n consistent estimates of the e.d.r. space (Xia et al., 2002; Hristache et al., 2001a). In our method, we construct cylindric neighborhoods, which are stretched in a normal direction to the manifold. However, our paper is not a formal generalization of (Hristache et al., 2001b) and (Hristache et al., 2001a). Those papers considered a regression setup, while our study focuses on a special unsupervised learning problem. This requires to develop essentially different technique and use different mathematical tools for theoretical study and substantially modify of the procedure. Also to mention that a general manifold learning is much more involved than just linear dimension reduction, and a straightforward extension from the linear case is not possible. Now we briefly describe our procedure. Many manifold denoising procedures (see, for instance, (Hein and Maier, 2006; Genovese et al., 2014; Osher et al., 2017; Aamari and Levrard, 2018)) act in an iterative manner and our procedure is not an exception. We start with some guesses bΠ(0) 1 , . . . , bΠ(0) n of the projectors onto the tangent spaces of M at the points X1, . . . , Xn, respectively. These guesses may be very poor, in fact. Nevertheless, they give a bit of information, which can be used to construct initial estimates b X(0) 1 , . . . , b X(0) n . On the other hand, the estimates b X(0) 1 , . . . , b X(0) n help to construct the estimates bΠ(1) 1 , . . . , bΠ(1) n of the projectors onto the tangent spaces of M at the points X1, . . . , Xn, respectively, which are better than bΠ(0) 1 , . . . , bΠ(0) n . One can repeat these two steps to iteratively refine the estimates of X1, . . . , Xn and of the manifold M itself. We call this approach a structureadaptive manifold estimation (SAME). We show that SAME constructs such estimates b X1, . . . , b Xn of X1, . . . , Xn and a manifold estimate c M of M , such that max 1 i n b Xi Xi Mb Mh h2 D(h2 M2) log n nhd , (Theorem 1) d H( c M, M ) M2b2 D(h4/κ2 M2) log n nhd , (Theorem 2) provided that h (D log n/n)1/d (DM2κ2 log n/n)1/(d+4) and M and, possibly, b degrade to zero fast enough, and both inequalities hold with an overwhelming probability. Here h is the width of a cylindrical neighborhood, which we are able to control, κ is a lower bound for the reach of M (see Section 2 for the definition of reach). Moreover, our algorithm estimates projectors Π(X1), . . . , Π(Xn) onto tangent spaces at X1, . . . , Xn. It produces estimates bΠ1, . . . , bΠn, such that max 1 i n bΠi Π(Xi) h D(h4/κ2 M2) log n nhd (Theorem 1) with high probability. Here, for any matrix A, A denotes its spectral norm. The notation f(n) g(n) means that there exists a constant c > 0, which does not depend on n, such that f(n) cg(n). d H( , ) denotes the Hausdorffdistance and it is defined as follows: d H(M1, M2) = inf {ε > 0 : M1 M2 B(0, ε), M2 M1 B(0, ε)} , Structure-adaptive Manifold Estimation where stands for the Minkowski sum and B(0, r) is a Euclidean ball in RD of radius r. The optimal choice of h yields max 1 i n b Xi Xi Mb DM2κ2 log n d H( c M, M ) M2b2 DM2κ2 log n max 1 i n bΠi Π(Xi) 1 DM2κ2 log n Note that the optimal choice of h is much smaller than a possible value n 2/(3d+8) of the noise magnitude M. As pointed out in (Genovese et al., 2012b), the manifold estimation can be considered as a particular case of the error-in-variables regression problem. Then the rate (M2/n log n)2/(d+4) makes sense since it corresponds to an optimal accuracy of locally linear estimation with respect to -norm in a nonparametric regression problem (which is also (M2/n log n)2/(d+4)). Besides, we prove a lower bound inf c M sup M Ed H( c M, M ) M2b2 κ3 κ 1 M2κ2 log n 2 d+4 (Theorem 3) which has never appeared in the manifold learning literature. Here c M is an arbitrary estimate of M and M fulfills some regularity conditions, which are precisely specified in Theorem 3. Theorem 3, together with Theorem 1 from (Kim and Zhou, 2015), where the authors managed to obtain the lower bound inf c M sup M Ed H( c M, M ) (log n/n)2/d, claims optimality of our method. The rest of this paper is organized as follows. In Section 2, we formulate model assumptions and introduce notations. In Section 3, we provide our algorithm for manifold denoising and then illustrate its performance in Section 4. Finally, in Section 5, we give a theoretical justification of the algorithm and discuss its optimality. The proofs of the main results are collected in Section 6. Many technical details are contained in Appendix. 2. Model Assumptions Let us remind that we consider the model (1), where X belongs to the manifold M and the distribution of the error vector ε will be described a bit later in this section. First, we require regularity of the underlying manifold M . We assume that it belongs to a class M d κ of twice differentiable, compact, connected manifolds without a boundary, contained in a ball B(0, R), with a reach, bounded below by κ, and dimension d: M M d κ = M RD : M is a compact, connected manifold without a boundary, M C2, M B(0, R), (A1) reach (M) κ, dim(M) = d < D . Puchkin and Spokoiny The reach of a manifold M is defined as a supremum of such r that any point in M B(0, r) has a unique (Euclidean) projection onto M. Here stands for the Minkowski sum and B(0, r) is a Euclidean ball in RD of radius r. One can also use the following equivalent definition of the reach (see (Genovese et al., 2012a, Section 2.1)). For a point x M, let Tx M stand for a tangent space of M at x, i. e. a linear space spanned by the derivative vectors of smooth curves on the manifold passing through x, and define a fiber Fr(x) = {x} Tx M B(x, r), where (Tx M) is an orthogonal complement of Tx M. Then reach (M) is a supremum of such r > 0 that for any x, x M, x = x , the sets Fr(x) and Fr(x ) do not intersect: reach (M) = sup r > 0 : x, x M, x = x , Fr(x) Fr(x ) = . The requirement that the reach is bounded away from zero prevents M from having a large curvature. In fact, if the reach of M is at least κ, then the curvature of any geodesic on M is bounded by 1/κ (see (Genovese et al., 2012a, Lemma 3)). Second, the density p(x) of X (with respect to the d-dimensional Hausdorffmeasure on M ) meets the following condition: p1 p0 > 0 : x M p0 p(x) p1, (A2) L 0 : x, x M |p(x) p(x )| L x x Besides the aforementioned conditions on M and X, we require some properties of the noise ε. We suppose that, given X M , the conditional distribution (ε | X) fulfils the following assumption: there exist 0 M < κ and 0 b κ, such that E(ε | X) = 0, ε M < κ, (A3) κ P( | X)-almost surely, where Π(X) is the projector onto the tangent space TXM . The model with manifold M M d κ and the bounded noise has been extensively studied in literature (see (Genovese et al., 2012a; Maggioni et al., 2016; Aamari and Levrard, 2018, 2019; Trillos et al., 2019)). In (Fefferman et al., 2018), the authors consider the Gaussian noise, which is unbounded, but they restrict themselves on the event max 1 i n εi κ, which is essentially similar to the case of bounded noise. In our work, we introduce an additional parameter b [0, κ], which characterises maximal deviation in tangent direction. The pair of parameters (M, b) determines the noise structure more precisely than just the noise magnitude M. If b = 0, we deal with perpendicular noise, which was studied in (Genovese et al., 2012a; Aamari and Levrard, 2019). The case b = κ corresponds to the bounded noise, which is not constrained to be orthogonal. Such model was considered, for instance, in (Aamari and Levrard, 2018). In our work, we provide upper bounds on accuracy of manifold estimation for all pairs (M, b) satisfying the following conditions: M An 2 3d+8 , M3b2 ακ D log n d DM2κ2 log n n 4 d+4 , (A4) Structure-adaptive Manifold Estimation where A and α are some positive constants. Among all the pairs (M, b), satisfying (A4), we can highlight two cases. The first one is the case of maximal admissible magnitude: M = M(n) An 2 3d+8 , (A4.1) b = b(n) ακ d DM2κ2 log n The second one is the case of maximal admissible angle: b = κ, M = M(n) D4αd+4 1 3d+4 n 4 3d+4 . (A4.2) If (A4.1) holds, we deal with almost perpendicular noise. Note that in this case the condition (A3) ensures that X is very close to the projection πM (Y ) of Y onto M . Here and further in this paper, for a closed set M and a point x, πM(x) stands for a Euclidean projection of x onto M. Thus, estimating X1, . . . , Xn, we also estimate the projections of Y1, . . . , Yn onto M . Also, we admit that the noise magnitude M may decrease as slow as n 2/(3d+8). We discuss this condition in details in Section 5 after Theorem 1 and compare it with other papers to convince the reader that the assumption M An 2/(3d+8) is mild. In fact, to the best of our knowledge, only in (Genovese et al., 2012a,b) the authors impose weaker assumptions on the noise magnitude. At the first glance, the condition (A4.1) looks very similar to the case of orthogonal noise b = 0. However, our theoretical study reveals a surprising effect: the existing lower bounds for manifold estimation in the case of perpendicular noise are different from the rates we prove for the case of almost perpendicular noise satisfying (A4.1). We provide the detailed discussion in Section 5 below. Finally, if (A4.2) holds, the noise is not constrained to be orthogonal. However, in this case, we must impose more restrictive condition on the noise magnitude than in (A4.1). Nevertheless, under the condition (A4.2), we show that the result of Aamari and Levrard (2018), Theorem 2.7, where the authors also consider bounded noise, can be improved if one additionally assumes that the log-density log p(x) is Lipschitz. A more detailed discussion is provided in Section 5. 3. A Structure-adaptive Manifold Estimator (SAME) In this section we propose a novel manifold estimation procedure based on a nonparametric smoothing technique and structural adaptation idea. One of the most popular methods in nonparametric estimation is weighted averaging: b X(loc) i = j=1 w(loc) ij Yj j=1 w(loc) ij , 1 i n, (2) and w(loc) ij are the localizing weights defined by w(loc) ij = K Yi Yj 2 , 1 i, j n, Puchkin and Spokoiny where K( ) is a smoothing kernel and the bandwidth h = h(n) is a tuning parameter. In this paper, we consider the kernel K(t) = e t. Remark 1 Instead of K(t) = e t, one can take any two times differentiable, monotonously decreasing on R+ function such that it and its first and second derivatives have either exponential decay or finite support. We use K(t) = e t to avoid further complications of the proofs. The estimate (2) has an obvious limitation. Consider a pair on indices (i, j) such that Xi Xj < h and h = h(n) is of order (log n/n)1/d, which is known to be the optimal choice in the presence of small noise (see (Aamari and Levrard, 2018, Proposition 5.1) and (Aamari and Levrard, 2019, Theorem 6)). If the noise magnitude M is much larger than (log n/n)1/d (which is the case we also consider), then M > h and the weights w(loc) ij carry wrong information about the neighborhood of Xi, i.e. w(loc) ij can be very small even if the distance Xi Xj is smaller than h. This leads to a large variance of the estimate (2) when h is of order (log n/n)1/d, and one has to increase the bandwidth h, inevitably making the bias of the estimate larger. The argument in the previous paragraph leads to the conclusion that the weights w(loc) ij must be adjusted. Let us fix any i from 1 to n. Ideal localizing weights wij are such that they take into account only those indices j, for which the norm Xi Xj does not exceed the bandwidth h too much. Of course, we do not have access to compute the norms Xi Xj for all pairs but assume for a second that the projector Π(Xi) onto the tangent space TXi M was known. Then, instead of the weights w(loc) ij , one would rather use the ones of the form wij(Π(Xi)) = K Π(Xi)(Yi Yj) 2 to remove a large orthogonal component of the noise. The norm Π(Xi)(Yi Yj) turns out to be closer to Xi Xj than Yi Yj , especially if the ambient dimension is large. Thus, instead of the ball {Y : Y Yi h} around Yi, we consider a cylinder {Y : Πi(Yi Y ) h}, where Πi is a projector, which is assumed to be close to Π(Xi). One just has to ensure that the cylinder does not intersect M several times. For this purpose, we introduce the weights wij(Πi) = K Πi(Yi Yj) 2 1 ( Yi Yj τ) , 1 j n, (3) with a constant τ < κ. The adjusted weights (3) require a good guess Πi of the projector Π(Xi). The question is how to find this guess. We use the following strategy. We start with poor estimates bΠ1(0), . . . , bΠ(0) n of Π(X1), . . . , Π(Xn) and take a large bandwidth h0. Then we compute the weighted average estimates b X(1) 1 , . . . , b X(1) n with the adjusted weights (3) and the bandwidth h0. These estimates can be then used to construct estimates bΠ(1) 1 , . . . , bΠ(1) n of Π(X1), . . . , Π(Xn), which are better than bΠ(0) 1 , . . . , bΠ(0) n . After that, we repeat the described steps with a bandwidth h1 < h0. This leads us to an iterative procedure, which is given by Algorithm 1. Let us discuss the role of the parameter γ in Algorithm 1. After the computation of the estimates b X(k) 1 , . . . , b X(k) n , our goal is to use them to update the projectors bΠ(k) 1 , . . . , bΠ(k) n . Structure-adaptive Manifold Estimation Algorithm 1 Structure-adaptive manifold estimator (SAME) 1: The sample of noisy observations Yn = (Y1, . . . , Yn), the initial guesses bΠ(0) 1 , . . . , bΠ(0) n of Π(X1), . . . , Π(Xn), the number of iterations K + 1, an initial bandwidth h0, the threshold τ and constants a > 1 and γ > 0 are given. 2: for k from 0 to K do 3: Compute the weights w(k) ij according to the formula w(k) ij = K bΠ (k) i (Yi Yj) 2 1 ( Yi Yj τ) , 1 i, j n . 4: Compute the estimates j=1 w(k) ij Yj . n X j=1 w(k) ij , 1 i n . (4) 5: If k < K, for each i from 1 to n, define a set J (k) i = {j : b X(k) j b X(k) i γhk} and compute the matrices bΣ (k) i = X ( b X(k) j b X(k) i )( b X(k) j b X(k) i )T , 1 i n . 6: If k < K, for each i from 1 to n, define bΠ(k+1) i as a projector onto a linear span of eigenvectors of bΣ(k) i , corresponding to the largest d eigenvalues. 7: If k < K, set hk+1 = a 1hk. return the estimates b X1 = b X(K) 1 , . . . , b Xn = b X(K) n . However, our theoretical analysis (see Lemma 3 in the proof of Theorem 1 below) reveals that the weighted average (4) removes well an orthogonal component of the noise but causes a shift of b X(k) 1 , . . . , b X(k) n in tangent direction. An illustration is given in Figure 1. Even if the noise is nearly orthogonal, the tangent component of b X(k) i Xi is O(hk) while the orthogonal component is only O(h2 k/κ). This means that if we take such b X(k) j s that Xj is close to Xi we can get a good estimate of the projector bΠ(k) i . However, even if Xi and Xj are close, the distance between b X(k) i and b X(k) j can be as large as O(h K), because of the shift in the tangent direction. We take it into account and introduce an auxiliary parameter γ to construct a set of indices J (k) i which includes only those j s that Xj is close to Xi. The computational complexity of Algorithm 1 is O(n2D2K + n D3K). This includes O(n2D2) operations to update the weights w(k) ij , 1 i, j n, and the estimates b X(k) i and bΣ(k) i , 1 i n, on each iteration and O(n D3) operations to update the projectors bΠ(k) i , 1 i n, on each iteration. SAME requires slightly more time than, for instance, the manifold blurring mean shift algorithm ((Wang and Carreira-Perpinan, 2010, MBMS), see the pseudocode in Appendix H below). The complexity of MBMS is O(n2D +n(D +k)(D k)2) per iteration. Here k is the number of neighbors used by MBMS to perform local Puchkin and Spokoiny Yi = Xi + εi Figure 1: An illustration of how the weighted average estimate (4) induces a shift in a tangent direction. Even if the noise is nearly orthogonal and the observation lies in a thin blue cylinder around the point of interest, the weighted average estimate falls into the green rectangle stretched in the tangent directions with high probability. Nevertheless, the averaging successfully removes the orthogonal component of the noise. PCA. SAME needs more operations to update the weights w(k) ij , 1 i, j n, because of multiplication of the projectors bΠ(k) i , 1 i n, by the vectors (Yj Yi), 1 i, j n. If the parameter k in MBMS is greater than D, then SAME and MBMS require the same time to perform PCA-type procedures. 4. Numerical Experiments In this section, we carry out simulations to illustrate the performance of SAME. For convenience, theoretical results were obtained for manifolds without a boundary (which is a common assumption in the manifold learning literature) but we use some well-known surfaces with boundary in the experiments. The source code of all the numerical experiments described in this section is available on Git Hub (link). 4.1 Manifold Denoising and Dimension Reduction In this section, we present the performance of SAME on two widely known artificial data sets: Swiss Roll and S-shape. First, we show how our estimator denoises the manifold. We start with the description of the experiment with the Swiss Roll. We sampled n = 2500 points on a two-dimensional manifold in R3 and then embedded the surface into R20 adding 17 dummy coordinates. After that, we added a uniform noise with a magnitude 0.75 to each coordinate (thus, the noise magnitude M was equal to 0.75 20). In our algorithm, we initialized bΠ(0) i = I20 for all i from 1 to n and made 6 iterations with h2 k = h2 0 1.25 k, 0 k 5, τ = h0, and γ = 4. To choose the initial bandwidth h0, we took α = 0.015 and put h0 equal to the distance to the αn -th nearest neighbor of the first sample point. Structure-adaptive Manifold Estimation Data set MSE of SAME, 102 MSE of MBMS, 102 Swiss Roll 67.4 69.3 S-shape 3.9 4.7 Table 1: Mean squared errors (MSE, (5)) of SAME and MBMS algorithms. Best results are boldfaced. The parameter γ had minor influence on the behaviour of the algorithm and it was set to 4 in all the experiments. To quantify the performance of the algorithm, we used the mean squared error 1 n i=1 b Xi Xi 2. (5) The results are shown in Figure 2 (top) and in Table 1. We compare SAME with the manifold blurring mean shift algorithm (Wang and Carreira Perpinan, 2010, MBMS). We provide a pseudocode of MBMS in Appendix H below to make the paper self-contained. The parameters σ and k of MBMS as well as the number of iterations were chosen such that they minimized the mean squared error (5) over a range of parameters. The dimension d was set to 2. The smallest MSE was achieved with σ = 2.6/ 2, k = 150 and only 1 iteration. The results are given in Figure 2 (top, right column). We observed that the mean squared error of MBMS grew after 1 or 2 iterations. In contrast to SAME, MBMS required much less iterations. However, SAME recovered the surface better than MBMS. The reason for that is hidden in the localizing weights w(k) ij , 1 i, j n, 1 k K. Projection onto a tangent hyperplane removes a large orthogonal component of the noise and, hence, bΠ(k) i (Yj Yi) better approximates the distance between Xj and Xi than Yj Yi . Besides, a weighted average estimate induces a shift in tangent direction (see the discussion after Algorithm 1 in Section 3) which may be harmful on early iterations. As a consequence, b X(k) j b X(k) i may also be a bad estimate for Xj Xi . The experiment with the S-shape manifold was carried in a similar way. We took n = 1500 points on the manifold in R3, embedded the surface into R30, and added a uniform noise with a magnitude 0.2 to each coordinate (thus, M = 0.2 30). Again, we initialized bΠ(0) i = I30 for all i from 1 to n. Then we put h2 k = h2 0 1.25 k, 0 k 13, i. e. the algorithm ran 14 iterations. The initial bandwidth h0 was equal to the distance to the 0.1n -th nearest neighbor of the first sample point. The parameters τ and γ were equal to h0 and 4, respectively. For MBMS, we took σ = 0.45/ 2, k = 300, d = 2, and made 2 iterations. The tuning procedure of the parameters of MBMS was the same as in the example with the S-shape data set. The result of this experiment is displayed in Figure 2 (bottom) and in Table 1. Next, we show how the preliminary denoising step may improve a dimension reduction. We consider the modified locally linear procedure (Zhang and Wang, 2006, MLLE for short), which is often used in applications due to its quality and computational efficiency. MLLE takes high-dimensional vectors as an input and returns their low-dimensional representation. In the case of S-shape and Swiss Roll data sets, one can easily find this map by straightening the curved surfaces into a plane. In the noiseless case, MLLE solves this Puchkin and Spokoiny Figure 2: Perfomance of SAME and MBMS on the Swiss Roll data set (top) and on the S-shape data set (bottom). Column 1: noisy observations lying near a twodimensional manifold. Column 2: noisy observations (blue) and the true manifold (green). Column 3: noisy observations (blue) and the projections onto the manifold (red), recovered by SAME. Column 4: noisy observations (blue) and the projections onto the manifold (red), recovered by MBMS. task. However, as the other non-linear dimension reduction procedures based on Taylor s expansion, this algorithm deteriorates its performance in the presence of significant noise. In Figure 3 (center images) one can clearly observe that the MLLE procedure is not able to recognize a two-dimensional structure in the noisy data set. Instead of a rectangularlike shape, which would be a natural choice to represent the two-dimensional structure of the S-shape and Swiss Roll data sets, we have a curve. However, if one first uses SAME for manifold denoising and only after that applies MLLE for dimension reduction, then one obtains the desired result: both surfaces are straightened into planes. Of course, popular dimension reduction methods (e.g. Isomap (Tenenbaum et al., 2000), LLE (Roweis and Saul, 2000), MLLE (Zhang and Wang, 2006), Laplacian eigenmaps (Belkin and Niyogi, 2003), t-SNE (van der Maaten and Hinton, 2008)) still perform well in the presence of small noise. However, a researcher should consider an option of using preliminary manifold denoising before dimension reduction in the case of larger noise. 4.2 Manifold Denoising and Semi-supervised Learning Manifold denoising can be a preprocessing step in semi-supervised learning. In the problem of semi-supervised learning, a statistician is usually given small amount of labelled data and a lot of unlabelled data. The goal is to recover the labels of the unlabelled points (transductive semi-supervised learning) or to propose a rule for prediction of label of a test point x ( true semi-supervised learning). In semi-supervised learning, it is usually assumed that the unlabelled data carries useful information, which may be useful for prediction. The Structure-adaptive Manifold Estimation Figure 3: The role of manifold denoising in a successful dimension reduction for the S-shape data set (top) and Swiss Roll data set (Bottom). Column 1: noisy observations. Column 2: application of MLLE to the data set without denoising. Column 3: application of MLLE to the data set with a preliminary denoising via SAME. Column 4: application of MLLE to the data set with a preliminary denoising via MBMS. most popular assumptions is that the data has a cluster structure or data points lie in a vicinity of a low-dimensional manifold. In this section, we pursue the goal of recovering labels of unlabelled points. We take two artificial data sets g241c and g241n, which are described in (Chapelle et al., 2010). The data sets g241c and g241n have n = 1500 pairs (Yi, Zi), 1 i n, where Zi { 1, 1} is a binary label and Y RD, D = 241 is a high-dimensional feature vector. According to (Chapelle et al., 2010), the data sets g241c and g241n were generated in such a way that they have a cluster structure and do not have hidden manifold structure. However, in (Hein and Maier, 2006), the authors report that preliminary manifold denoising step, applied to these data sets, improves the quality of classification. In this section, we illustrate that preliminary denoising with SAME also improves classification error. We split the data sets into 100 train points and 1400 test points. We use k-nearest neighbors classifier as a baseline for two reasons. First, k-NN method is popular and often used in practice. Second, k-NN classifier is based on pairwise distances between feature vectors and should gain from the manifold denoising. For each of the data sets we perform the following procedure. First, we apply k-NN method without denoising. Then we make manifold denoising using SAME and apply k-NN to the denoised data set. In the case of g241c data set, we took d = 10, τ = 22, γ = 4 and hk = 20 1.2 k, 0 k 1. In the case of g241n data set, we took d = 6, τ = 21, γ = 4 and hk = 20 1.2 k, 0 k 2. The results are summarized in Table 2. We observe that preliminary denoising improves quality of prediction. Puchkin and Spokoiny Data set Best number of neighbors k without denoising k-NN error without denoising (%) Best number of neighbors k after denoising denoised k NN error (%) g241c 21 31.3 15 27.8 g241n 18 27.1 12 25.3 Table 2: Error rates with and without manifold denoising via SAME for k-NN method, applied to artificial data sets g241c and g241n. 5. Theoretical Properties of SAME This section states the main results. Here and everywhere in this paper, for any matrix A, A denotes its spectral norm. The notation f(n) g(n) means f(n) g(n) f(n). Theorem 1 Assume (A1), (A2), (A3), and (A4) . Let the initial guesses bΠ(0) 1 , . . . , bΠ(0) n of Π(X1), . . . , Π(Xn) be such that on an event with probability at least 1 n 1 it holds max 1 i n bΠ (0) i Π(Xi) h0 with a constant , such that h0 κ/4, and h0 = C0/ log n, where C0 > 0 is an absolute constant. Choose τ = 2C0/ log n and set any a (1, 2]. If n is larger than a constant N , depending on , and h K (D log n/n)1/d (DM2κ2 log n/n)1/(d+4) (with a sufficiently large hidden constant, which is greater than 1) then there exists a choice of γ, such that after K iterations Algorithm 1 produces estimates b X1, . . . , b Xn, such that, with probability at least 1 (5K + 4)/n, it holds max 1 i n b Xi Xi Mb Mh K (1 + ΦM,b,h K,κ)h2 K κ + D(h2 K M2) log n max 1 i n bΠ (K) i Π(Xi) ΨM,b,h K,κ D(h4 K/κ2 M2) log n ΦM,b,h K,κ = M3(1 + b/h K)2 h2 Kκ + M2(1 + b/h K + q log h 1 K ) κh K + Mh2 K κ3 α + o(1), n , ΨM,b,h K,κ = 1 + M(1 + b/h K) (1 + ΦM,b,h K,κ)h K d+1 (1 + ΦM,b,h K,κ) (6) (1 + α) 4d+1 + (2 α)d+1 . Structure-adaptive Manifold Estimation In particular, if one chooses the parameter a and the number of iterations K in such a way that h K (Dκ2 log n/n)1/(d+2) (DM2κ2 log n/n)1/(d+4) then max 1 i n b Xi Xi Mb DM2κ2 log n If h K (D log n/n)1/d (DM2κ2 log n/n)1/(d+4) then max 1 i n bΠ (K) i Π(Xi) 1 DM2κ2 log n Note that one has to take the number of iterations K of order log n since the sequence of bandwidths h1, . . . , h K decreases exponentially. In Theorem 1, we assume that bΠ(0) 1 , . . . , bΠ(0) n may depend on Y1, . . . , Yn. The natural question is how to construct the initial guesses bΠ(0) 1 , . . . , bΠ(0) n of the projectors Π(X1), . . . , Π(Xn). We propose a strategy for initialization of our procedure. One can use (Aamari and Levrard, 2018, Proposition 5.1) to get the estimates bΠ(0) 1 , . . . , bΠ(0) n . For each i from 1 to n introduce bΣ (0) i = 1 n 1 j =i (Yj Y i)(Yj Y i)T 1(Yj B(Yi, h0)), where Y i = 1 Ni P j =i Yj1(Yj B(Yi, h0)), Ni = |{j : Yj B(Yi, h)}|. Let bΠ(0) i be the projector onto the linear span of the d largest eigenvalues of bΣ(0) i . Then the following result holds. Proposition 1 (Aamari and Levrard (2018), Proposition 5.1) Assume (A1), (A2), (A3). Set h0 (log n/n)1/d for large enough hidden constant. Let M/h0 1/4 and let h0 = h0(n) = o(1), n . Then for n large enough, with probability larger than 1 n 1, it holds max 1 i n bΠ (0) i Π(Xi) h0 Remark 2 In (Aamari and Levrard, 2018), the authors take h0 (log n/n)1/d. Nevertheless, a careful reading of the proofs reveals that one can also take larger values of h0. Remark 3 One can also use local PCA procedure from (Cheng and Wu, 2013) with a more sophisticated choice of the neighborhood for initialization. The condition (A4) and the choice of h K in Theorem 1 yield that M = M(n) can decrease almost as slow as h2/3 K = h2/3 K (n). Thus, we admit the situation when the noise magnitude M is much larger than the smoothing parameter h K. For instance, in (Aamari and Levrard, 2019), the authors use local polynomial estimates and require M = O(h2) and h = h(n) n 1/d. In (Aamari and Levrard, 2018), the authors assume M λ(log n/n)1/d, and λ does not exceed a constant λd,p0,p1, depending on d, p0 and p1. In (Fefferman et al., 2018), the authors deal with Gaussian noise N(0, σ2ID) and get the accuracy of manifold Puchkin and Spokoiny estimation O(σ D) using O(σ d) samples. This means that σ = O(n 1/d), which yields that max 1 i n εi n 1/dp with overwhelming probability. A similar situation is observed in (Genovese et al., 2014), where the authors also consider the Gaussian noise N(0, σ2ID) and, using the kernel density estimate with bandwidth h, obtain the upper bound σ2 log σ 1 + h2 + on the Hausdorffdistance between M and their estimate. In order to balance the first and the second terms, one must take σ = O(h/ p log h 1), which means that max 1 i n εi h D log n log h 1 , while we allow max1 i n εi be as large as h2/3 K . Finally, in (Hein and Maier, 2006) the authors require M = O(h). So, we see that the condition (A4) is quite mild. Theorem 1 claims that, despite the relatively large noise, our procedure constructs consistent estimates of the projections of the sample points onto the manifold M . The accuracy of the projection estimation is a bit worse than the accuracy of manifold estimation, which we provide in Theorem 2 below. The reason for that is the fact that the estimate b Xi is significantly shifted with respect to Xi in a tangent direction, while the orthogonal component of ( b Xi Xi) is small. A similar phenomenon was already known in the problem of efficient dimension reduction. For instance, in (Hristache et al., 2001b,a) the authors managed to obtain the rate n 2/3 for the bias of the component, which is orthogonal to the efficient dimension reduction space, while the rate of the bias in the index estimation was only n 1/2. Moreover, the term Mh K in Theorem 1 appears because of the correlation between the weights w(k) ij and the sample points Yj. We proceed with upper bounds on the estimation of the manifold M . Theorem 2 Assume conditions of Theorem 1. Consider the piecewise linear manifold estimate c M = n b Xi + h K bΠ (K) i u : 1 i n, u B(0, 1) RDo , where bΠ(K) i is a projector onto d-dimensional space obtained on the K-th iteration of Algorithm 1. Then, as long as h K (D log n/n)1/d (DM2κ2 log n/n)1/(d+4) (with a sufficiently large hidden constant, which is greater than 1), on an event with probability at least 1 (5K + 5)/n, it holds d H( c M, M ) (1 + ΦM,b,h K,κ + ΨM,b,h K,κ)h2 K κ M2b2 D(h4 K/κ2 M2) log h 1 K nhd K , where ΦM,b,h K,κ and ΨM,b,h K,κ are defined in (6). In particular, if a and K are chosen such that h K (D log n/n)1/d (DM2κ2/n log n)1/(d+4) , then d H( c M, M ) M2b2 κ3 κ 1 D log n d κ 1 DM2κ2 log n Structure-adaptive Manifold Estimation Remark 4 For manifold reconstruction, one can also use a different technique, based on (Aamari and Levrard, 2018, Theorem 4.1) and tangential Delaunay complexes. We emphasize that the manifold estimate c M, used in Theorem 2, is not a manifold itself but a union of d-dimensional discs. Let us elaborate on the result of Theorem 2. First, let us discuss the case of bounded non-orthogonal noise, that is, the situation when (A4.2) holds. The model with bounded noise was considered in (Aamari and Levrard, 2018), where the authors assumed that M satisfies (A1) and the density of X fulfils 0 < p0 p(x) p1, x M , for some constants p0, p1. Note that this is a slightly more general setup, since we additionally assume that the log-density is Lipschitz. Under these assumptions, Aamari and Levrard (2018) proved (Theorem 2.7) the following upper bound on the Hausdorffdistance using the tangential Delaunay complex (TDC): d H( c MTDC, M ) log n 2/d + M2 log n provided that M λd,p0,p1(log n/n)1/d, where the constant λd,p0,p1 depends on d, p0 and p1. To the best of our knowledge, the situation, when (A1), (A2), (A3), and (A4.2) hold, was not studied in the manifold learning literature. One can observe that both TDC and SAME achieve the rate O (log n/n)2/d in the case of extremely small noise M (log n/n)2/d. However, if (log n/n)2/d M n 4/(3d+4) then the rate of convergence of SAME in the case of the density p(x) satisfying (A2) improves over the known rates of TDC in the case of bounded away from 0 and density p(x). Now, let us discuss the case of almost orthogonal noise, i.e. when (A4) holds. This model is completely new in the manifold learning literature. The most similar one considered in the prior work is the model with perpendicular noise studied in (Genovese et al., 2012a; Aamari and Levrard, 2019), so we find it useful to compare this more restrictive model with our upper bounds for the case of almost orthogonal noise. In (Genovese et al., 2012a), the authors obtain the rates O(log n/n)2/(d+2) assuming that, given X, the noise ε has a uniform distribution on B(X, M) (TXM ) . In their work, the authors do not assume that M tends to zero as n tends to infinity, however, they put a far more restrictive assumption on the noise distribution than we do. In (Aamari and Levrard, 2019, Theorem 6), the authors use local polynomial estimate c MLP to prove the upper bound d H( c MLP , M ) log n for the case when M is a Ck-manifold with dimension d and reach at least κ without a boundary. If M is a C2-manifold, this rate is minimax optimal for the case of extremely small noise M (log n/n)2/d but it can be improved when the noise magnitude exceeds (log n/n)2/d. Theorem 2 shows that our procedure achieves the classical nonparametric rate, where the bias and the variance terms correspond to the best one can hope for when deals with the Puchkin and Spokoiny locally linear estimator. In the case of small noise (M (log n/n)2/d), the result of Theorem 2 matches the lower bound obtained in (Kim and Zhou, 2015) and the upper bound from (Aamari and Levrard, 2019). It is not surprising, because when M is a C2-manifold, the local polynomial estimate considered in (Aamari and Levrard, 2019) becomes a piecewise linear estimate, based on local PCA, and achieves the optimal rate in the case of small noise. Our algorithm acts in a similar manner and the only significant difference is hidden in the weights. However, if the noise is very small, there is no need to adjust the weights, so local PCA and SAME behave comparably in this regime. The same concerns the projector estimates bΠ(K) 1 , . . . , bΠ(K) n from Theorem 1. In the case of small noise, we recover the minimax rate (log n/n)1/d obtained in (Aamari and Levrard, 2018, 2019). However, as the magnitude of the noise grows, our procedure shows superior performance, compared to the estimates in (Aamari and Levrard, 2018, 2019). The result of Theorem 2 cannot be improved for the case of general additive noise, which fulfils the assumption (A3) with b (log n/n)1/d (M2κ2 log n/n)1/(d+4) . We justify this discussion by the following theorem. Theorem 3 Suppose that the sample Yn = {Y1, . . . , Yn} is generated according to the model (1), where M M d κ, the density p(x) of X fulfils (A2) (with sufficiently large p1, L and sufficiently small p0) and the noise ε satisfies (A3). Then, for any estimate c M, it holds that sup M M d κ EM d H( c M, M ) M2b2 Moreover, if, in addition, n is sufficiently large, Mκ (log n/n)2/d, and the parameter b in (A3) is such that b (log n/n)1/d (M2κ2 log n/n)1/(d+4) , with a large enough hidden constant, then, for any estimate c M, it holds that sup M M d κ EM d H( c M, M ) κ 1 M2κ2 log n 2 d+4 . (8) Theorem 3 studies the case M (log n/n)2/d. In (Kim and Zhou, 2015), the authors proved the minimax lower bound inf c M sup M M d κ EM d H( c M, M ) log n for the noiseless case, which is also tight for M (log n/n)2/d. Theorem 3, together with (Kim and Zhou, 2015, Theorem 1) yields that SAME is minimax optimal in the model with almost orthogonal noise. The lower bounds (7) and (8) are completely new and are different from the currently known results on manifold estimation from (Genovese et al., 2012a) and (Aamari and Levrard, 2019), where the authors studied a perpendicular noise fulfilling (A3) with b = 0. In (Genovese et al., 2012a), the authors focused on the case of uniform noise Structure-adaptive Manifold Estimation and proved the lower bound (M/n)2/(d+2) when M fulfils (A1) and p0 p(x) p1 for all x M . In (Aamari and Levrard, 2019), the authors went further and proved that inf c M sup M Ed H( c M, M ) (M/n)k/(d+k) n k/d, where M runs over a class of compact, connected Ck-manifolds of dimension d without a boundary and reach (M ) κ. Theorem 3 reveals a surprising effect: if one allows small deviations of εi s in tangent directions then the problem of manifold estimation becomes harder and this fact is reflected in the minimax rates of convergence. Namely, if b (log n/n)1/d (M2κ2 log n/n)1/(d+4), the minimax rate is M2b2/κ3 M2 log n/n 2/(d+4) (log n/n)2/d which is different from the best known lower bound (M/n)2/(d+2) (log n/n)2/d for the case of perpendicular noise. This section collects the proof of the main results. 6.1 Proof of Theorem 1 The proof of Theorem 1 is given in several steps. First, we show that the adjusted weights wij(Πi) are informative, i.e. significant weights correspond only to points Xj, which are close to Xi. Lemma 1 Assume (A1), (A3). Let Πi be any projector, such that Πi Π(Xi) h κ . Assume that M κ/16, h κ/4, and M( + b/h) κ/4. Then for any i and j, such that Yi Yj 0.5κ, it holds 1 2 Πi(Yi Yj) 2M( h + b) κ Xi Xj 2 Πi(Yi Yj) + 4M( h + b) Lemma 1 quantifies the informal statement Πi(Yj Yi) Xj Xi, giving explicit error bound depending on the error in the guess of the projector. Note that if (A4) holds, h [h K, h0] and h0, h K satisfy the conditions of Theorem 1 then M κ/16, M( + b/h) κ/4, h κ/4 if n is large enough. The next step is to show that the cylinder {y : Πi(y Yi) h, y Yi τ} contains enough sample points. For this purpose, we prove the regularity of the design points in the following sense. Lemma 2 Assume (A1) (A3). Fix any i from 1 to n and let Πi be any projector, such that Πi Π(Xi) h κ , where (log n/n)1/d h < h0, h κ/4, M κ/16, and M( + b/h) κ/4. Suppose that h0 is chosen in a such way that h0 0.5τ, and n is sufficiently large. Then, on an event with probability at least 1 n 2, it holds j=1 wij(Πi) C nhd (9) with an absolute constant C > 0. Puchkin and Spokoiny Roughly speaking, the cylinder {y : Πi(y Yi) h, y Yi τ} contains nhd sample points with high probability. It is important, because the ball B(Yi, h) would contain nh D sample points if h < M. The sum of weights controls the variance of our estimates. From this point of view, the choice of cylindric neighborhoods instead of the balls yields much better rates. Now, we are ready to make the main step in the proof of Theorem 1. Lemma 3 Assume conditions of Theorem 1. Let Π1, . . . , Πn be any (possibly random) projectors, such that Πi Π(Xi) h κ almost surely, (log n/n)1/d h h0, h κ/4, M κ/16, and M( + b/h) κ/4. Let wij(Πi), 1 i, j n, be the localizing weights, computed according to wij(Πi) = K Πi(Yi Yj) 2 1 ( Yi Yj τ) , 1 j n, with a constant τ < 0.5κ. Then, conditionally on Π1, . . . , Πn, on an event with probability at least 1 2n 1, it simultaneously holds j=1 wij(Πi)(Yj Xi) + ΦM,b,h,κ, nhd+2 D(h2 M2)nhd log n, j=1 wij(Πi) (I Π(Xi))(Yj Xi) (1 + ΦM,b,h,κ, ) nhd+2 D(h4/κ2 M2)nhd log n, ΦM,b,h,κ, = M3(1 + + b/h)2 h2κ + M2( + b/h + p log h 1) κh + (1 + 4)Mh2 and the hidden constants do not depend on . The proof of Lemma 3 is moved to Appendix C. In Lemma 3, the assumption (A4) comes into play. The condition (A4) implies that ΦM,b,h,κ, α + o(1) as n . This follows from the fact that, under (A4), for any h = h(n) h K, it holds M3 = o(h2), n . Indeed, we have (M2/n)2/(d+4) (M2/n)2/(d+4) = M(3d+8)/(d+4) n2/(d+4) (M2/n)2/(d+4) A (M2/n)2/(d+4) h2 K A (log n)2/(d+4) 0, n . Structure-adaptive Manifold Estimation Moreover, under (A4), it holds d κ DM2κ2 log n Therefore, ΦM,b,h,κ, α + o(1) as n . Similarly, if (A4.2) holds, we have M3b2 = M3κ2 = o(h4 K), which also yields ΦM,b,h,κ, 0. We need one more auxiliary result. Lemma 1, 2 and 3 imply that if we have good guesses of the projectors Π(X1), . . . , Π(Xn) then we can get good estimates of X1, . . . , Xn even if the noise magnitude M is quite large. Now we have to show that these estimates of X1, . . . , Xn can be used to construct good estimates of Π(X1), . . . , Π(Xn) and then we can carry out the proof by induction. Our next lemma is devoted to this problem. Let us work on the event, on which (9) holds with h = hk. Note that b X(k) i Xi = j=1 w(k) ij Yj j=1 w(k) ij j=1 w(k) ij (Yj Xi) j=1 w(k) ij d( b X(k) i , {Xi} TXi M ) = b X(k) i Xi Π(Xi)( b X(k) i Xi) j=1 w(k) ij Yj Xi Π(Xi)(Yj Xi) j=1 w(k) ij Here we used the fact that the projection of a point x onto the tangent plane {Xi} TXi M is given by π{Xi} TXi M (x) = Xi + Π(Xi)(x Xi). Then Lemma 3 and Lemma 2 immediately yield that, if we have max 1 i n bΠ (k) i Π(Xi) hk/κ on the k-th iteration with probability at least 1 (5k + 1)/n, then max 1 i n b X(k) i Xi M( hk + b) (1 + ΦM,b,hk,κ, )h2 k κ 2h3 k κ2 D(h2 k M2) log h 1 k nhd k , max 1 i n d( b X(k) i , {Xi} TXi M ) (1 + ΦM,b,hk,κ, )h2 k κ + D(h4 k/κ2 M2) log h 1 k nhd k with probability at least 1 (5k + 1)/n 3/n = 1 (5k + 4)/n. It only remains to check that the projector estimates bΠ(k+1) 1 , . . . , bΠ(k+1) n also satisfy max 1 i n bΠ (k+1) i Π(Xi) hk+1 κ with high probability. The precise statement is given in the following lemma. Puchkin and Spokoiny Lemma 4 Assume conditions of Theorem 1. Let Ωk be an event, such that on this event it holds max 1 i n b X(k) i Xi β1 D(h2 k M2) log h 1 k nhd k max 1 i n d( b X(k) i , {Xi} TXi M ) β2 D(h4 k/κ2 M2) log h 1 k nhd k 2β2h2 k κ . (10) Then there exists γ 1 + β1 such that, with probability at least P(Ωk) 2n 1, it holds max 1 i n bΠ (k+1) i Π(Xi) γ(γ + 4β1)dβ2hk κ (1 + β1)d+1β2hk The proof of Lemma 4 can be found in Appendix D. From the derivations before Lemma 4, the event Ωk from Lemma 4 has probability at least 1 (5k + 4)/n. Note that, if hk (D log n/n)1/d (DM2κ2 log n/n)1/(d+4) with a hidden constant greater than 1, as given in the conditions of Theorem 1, then the bias terms in (10) are dominating, i.e. D(h2 k M2) log h 1 k nhd k 2hk = 2ahk+1 and h2 k κ + D(h4 k/κ2 M2) log h 1 k nhd k 2h2 k κ . Due to the discussion before Lemma 4, we can take β1 = M( + b/hk) (1 + ΦM,b,hk,κ, )hk β2 = 1 + ΦM,b,hk,κ, . Then Lemma 4 yields max 1 i n bΠ (k+1) i Π(Xi) a(1 + β1)d+1β2 hk+1 The proof of Theorem 1 goes by induction. Let C be the hidden constant in (11). Assume that on the k-th iteration max 1 i n bΠ (k) i Π(Xi) (1 + α)Ca(4d+1 + (2 α)d+1) with probability at least 1 (5k + 1)/n. Here α is the constant from (A4). Lemma 2, Lemma 3 and Lemma 4 imply that, with probability at least 1 (5(k + 1) + 1)/n, it holds max 1 i n bΠ (k+1) i Π(Xi) a C(1 + β1)d+1β2 hk+1 Structure-adaptive Manifold Estimation We have to check that (1 + β1)d+1β2 (1 + α) 4d+1 + (2 α)d+1 . From the definition of β1 and β2, we have β2 = 1 + ΦM,b,h,κ, 1 + α + o(1), n , β1 M((Ca(1 + α)(4d+1 + (2 α)d+1) ) + b/h K) (1 + ΦM,h,κ, )h0 κ (Ca(1 + α)(4d+1 + (2 α)d+1) )2h2 0 κ2 κh K + o(1), n Show that (A4) yields Mb/(κh K) α + o(1), as n . Then, if n is sufficiently large, i.e. n N , this will imply β1 α + 1, β2 2(1 + α). Due to (A4), d κ DM2κ2 log n If M h2 K/κ then Mb κh K h Kb κ = o(1), n . Otherwise, we have Mb κ2h2 K = M3b2 Thus, Mb/(κh K) α + o(1), n , and we have β1 α + 1, β2 2(1 + α) for n N . This yields (1 + β1)d+1β2 2(1 + α) 2 + α d+1 2(1 + α) 2d 2d+1 + α(d+1)/2 = (1 + α) 4d+1 + (2 α)d+1 . Thus, on the next iteration we have max 1 i n bΠ (k+1) i Π(Xi) (1 + α)Ca 4d+1 + (2 α)d+1 with probability at least 1 (5(k + 1) + 1)/n. The confidence level 1 (5K + 4)/n in the claim of Theorem 1 follows from the fact that we do not recompute the projectors on the final step. 6.2 Proof of Theorem 2 Fix any x c M By definition of c M, there exist i and u B(0, 1), such that x = b Xi + h K bΠ (K) i u. Puchkin and Spokoiny Lemma 2, Lemma 3, Lemma 4, and the union bound imply that, with probability at least 1 (5K + 4)/n, max 1 i n d( b Xi, {Xi} TXi M ) (1 + ΦM,b,h K,κ)h2 K κ + D(h4 K/κ2 M2) log h 1 K nhd K and max 1 i n bΠ (K) i Π(Xi) ΨM,b,h K,κ h K where ΦM,b,h K,κ and ΨM,b,h K,κ are defined in (6). Recall that πM (x) denotes the projection of x onto a closed set M. Using the result Lemma 4, we immediately obtain d(x, {Xi} TXi M ) = inf v RD b Xi + h K bΠ (K) i u Xi Π(Xi)v d( b Xi, {Xi} TXi M ) (12) π{Xi} TXi M b Xi + h K bΠ (K) i u Xi Π(Xi)v Since the vector π{Xi} TXi M ( b Xi) Xi belongs to TXi M , we have π{Xi} TXi M ( b Xi) Xi = Π(Xi) π{Xi} TXi M ( b Xi) Xi . Then, substituting v + Xi π{Xi} TXi M ( b Xi) by ev, we obtain that the last expression in (12) is equal to d( b Xi, {Xi} TXi M ) h K bΠ (K) i u Π(Xi) v + Xi π{Xi} TXi M b Xi = d( b Xi, {Xi} TXi M ) + inf ev RD h K bΠ (K) i u Π(Xi)ev d( b Xi, {Xi} TXi M ) + h K bΠ (K) i u h KΠ(Xi)u d( b Xi, {Xi} TXi M ) + h K bΠ (K) i Π(Xi) (1 + ΦM,b,h K,κ + ΨM,b,h K,κ)h2 K κ + D(h4 K/κ2 M2) log h 1 K nhd K . Next, note that, x b Xi h K and, due to Theorem 1, we have π{Xi} TXi M (x) Xi x Xi x b Xi + b Xi Xi Mb Mh K (1 + ΦM,b,h K,κ)h2 K κ + D(h2 K M2) log h 1 K nhd K D(h2 K M2) log h 1 K nhd K Mb Structure-adaptive Manifold Estimation The inequalities in the last line follow from the fact that M < κ, ΦM,b,h K,κ α + o(1), n , and h K (D log n/n)1/d (DM2κ2 log n/n)1/(d+4) due to the conditions of Theorem 1. Since M is a C2-manifold with a reach at least κ, it holds that d(π{Xi} TXi M (x) , M ) π{Xi} TXi M (x) Xi 2 κ h2 K κ M2b2 Finally, we obtain d(x, M ) d(x, {Xi} TXi M ) + d(π{Xi} TXi M (x) , M ) (1 + ΦM,b,h K,κ + ΨM,b,h K,κ)h2 K κ M2b2 D(h4 K/κ2 M2) log h 1 K nhd K . Thus, c M M B(0, r) with r (1 + ΦM,b,h K,κ + ΨM,b,h K,κ)h2 K κ M2b2 D(h4 K/κ2 M2) log h 1 K nhd K . It remains to prove that M c M B(0, r) with the same r. Fix x M . Note that there exist constants c1 and r0, such that PX (X B(x, r)) p0Vol(B(x, r) M ) c1p0rd, r < r0 PX (X / B(x, r)) 1 c1p0rd e c1p0rd, r < r0 Let Nε(M ) stand for an ε-net of M . It is known (see, for example, Genovese et al. (2012a), Lemma 3) that |Nε(M )| ε d. Then P ( x M : i Xi / B(x, 2ε)) P ( x Nε(M ) : i Xi / B(x, ε)) x Nε(M ) P ( i Xi / B(x, ε)) ε de c1p0nεd. This implies that with probability at least 1 1/n sup x M min 1 i n x Xi log n According to (Federer, 1959, Theorem 4.18), on the same event we have sup x M min 1 i n d(x, {Xi} TXi M ) κ 1 log n Now, fix any x M . Without loss of generality, assume that min1 i n d(x, {Xi} TXi M ) is attained with i = 1. Let π{X1} TX1M (x) be the projection of x onto the tangent plane {X1} TX1M . It is clear that π{X1} TX1M (x) X1 log n Puchkin and Spokoiny Then there exists u B(0, 1), such that b X1 + h K bΠ (K) 1 u π{X1} TX1M (x) (1 + ΦM,b,h K,κ + ΨM,b,h K,κ)h2 K κ M2b2 D(h4 K/κ2 M2) log h 1 K nhd K . d(x, c M) κ 1 log n d + (1 + ΦM,b,h K,κ + ΨM,b,h K,κ)h2 K κ M2b2 D(h4 K/κ2 M2) log h 1 K nhd K (1 + ΦM,b,h K,κ + ΨM,b,h K,κ)h2 K κ M2b2 D(h4 K/κ2 M2) log h 1 K nhd K . 6.3 Proof of Theorem 3 For the sake of convenience, the proof of Theorem 3 is divided into two steps. First, we use the following lemma to obtain the lower bound (7). Lemma 5 Suppose that the sample Yn = {Y1, . . . , Yn} is generated according to the model (1), where M M d κ, M B(0, R) with R = p κ2 + M2b2/κ4, ε satisfies (A3), and the density p(x) of X fulfils (A2) with L > 0, p0 ((d+1)ωd+1Rd) 1, p1 ((d+1)ωd+1Rd) 1, where ωd+1 is the volume of the unit Euclidean ball in Rd+1. Then, for any estimate c M, it holds that sup M M d κ EM d H( c M, M ) M2b2 The proof of Lemma 5 is moved to Appendix F.1. The construction used in the proof of Lemma 5 is extremely simple. We show that if (ε | X) is supported on a tangent space TXM0, where M0 is a d-dimensional sphere of radius κ, then a statistician cannot distinguish between M0 and a sphere M1 with greater radius. Second, we prove the lower bound (8) using Lemma 6 below. The proof of Lemma 6 is based on application of (Tsybakov, 2009, Theorem 2.5) to a family of smooth manifolds with small bumps at different points. Lemma 6 Suppose that the sample Yn = {Y1, . . . , Yn} is generated according to the model (1), where M M d κ, the density p(x) of X fulfils (A2) with sufficiently large p1, sufficiently small p0 > 0, and L 4p1/3. Let the noise ε satisfy (A3) with b (log n/n)1/d (M2κ2 log n/n)1/(d+4) , where the hidden constant is large enough. Then, for any estimate c M, if n is sufficiently large and Mκ (log n/n)2/d, it holds that sup M M d κ EM d H( c M, M ) κ 1 M2κ2 log n Structure-adaptive Manifold Estimation The proof of Lemma 6 is moved to Appendix F.2. The claim of Theorem 3 follows from Lemma 5 and Lemma 6. Acknowledgments The authors are grateful to the action editor and three anonymous reviewers for valuable suggestions and remarks. This work was partly supported by the German Ministry for Education and Research as BIFOLD. It was partly carried out within the framework of the HSE University Basic Research Program. Results of Section 5 have been obtained under support of the RSF grant No. 19-71-30020. Nikita Puchkin is a Young Russian Mathematics award winner and would like to thank its sponsors and jury. Appendix A. Proof of Lemma 1 Xj Xi Πi(Yj Yi) Xj Xi Πi(Xj Xi) + Πi(εj εi) Xj Xi Π(Xi)(Xj Xi) + Πi Π(Xi) Xj Xi + Πi(εj εi) . According to (Federer, 1959, Theorem 4.18), Xj Xi Π(Xi)(Xj Xi) Xj Xi 2 Consider the term Πi(εj εi) . It holds Πi(εj εi) (Πi Π(Xi)(εj εi) + Π(Xi)(εj εi) κ + Π(Xi)εi + Π(Xj)εj + (Π(Xi) Π(Xj))εj (13) κ + 3M Xj Xi where in the last inequality we used the condition (A3) and applied Proposition 3. Taking into account that Πi Π(Xi) h κ , we conclude Xj Xi Πi(Yj Yi) Xj Xi 2 2κ + h Xj Xi + 2M( h + b) κ + 3M Xi Xj Using the triangle inequality Xi Xj Πi(Yi Yj) Xj Xi Πi(Yj Yi) Puchkin and Spokoiny and solving the quadratic inequality Xi Xj Πi(Yi Yj) Xj Xi 2 + (3M + h) Xj Xi κ + 2M( h + b) with respect to Xi Xj , we obtain Xi Xj Πi(Yi Yj) + 2M( h + b)/κ 1 ( h + 3M)/κ 2 Πi(Yi Yj) + 4M( h + b) Here we used the fact that, due to condition of the lemma, On the other hand, from (14) we have Xi Xj Πi(Yi Yj) Xj Xi 2 κ (3M + h) Xi Xj 2κ + 3M + h κ Xi Xj + 2M( h + b) κ Πi(Yj Yi) then Xi Xj 0.5 Πi(Yj Yi) . Otherwise, it holds Xi Xj (3M + h) 2 + Πi(Yi Yj) 4M( h + b)/κ Introduce a function g(t) = a2 + t a, a > 0, t a2. The function g(t) is concave, increasing and g(0) = 0. Therefore, for any t0 and any t [0, t0] it holds g(t) g(t0) t Taking a = ( h + 3M)/κ and t0 = 1 4M( h + b)/κ2, we immediately obtain 2 + κ 4M h/κ Πi(Yi Yj) 4M( h + b) Now it is easy to see that, if M κ/16, h κ/4, and M( + b/h) κ/4 then 3M + h < κ/2 < κ and 1 4M( h + b) 4 (3M + h)2 Structure-adaptive Manifold Estimation The last inequality yields 2 + κ 4M h/κ which, in its turn, implies Xi Xj Πi(Yi Yj) 2 2M( h + b). Appendix B. Proof of Lemma 2 Show that for any Πi, such that Πi Π(Xi) h/κ, it holds E( i)wi1(Πi) C1hd. Here and further in this paper, E( i)( ) E( | (Xi, Yi)). Due to Lemma 1, we have Πi(Yi Y1) 2 Xi X1 + 4M( h + b) which yields Πi(Yi Y1) 2 8 Xi X1 2 + 32M2( h + b)2 E( i)wi1(Πi) = E( i)e Πi(Yi Y1) 2 h2 1 ( Yi Y1 τ) E( i)e 32M2( +α)2/κ2e 8 Xi X1 2 h2 1 ( Xi X1 τ 2M) M B(Xi,τ 2M) h2 p(x)d W(x) E 1(B(Xi,h)) e 8 EXi (p) EXi (0) 2 det g(v)dv. Here we used the fact that, due to conditions of the lemma, M( + b/h) κ/4 and, also, τ 2M 0.5τ h0 h, if h0 is chosen sufficiently small. Next, due to (Aamari and Levrard, 2019, Lemma 1), EXi(v) EXi(0) v C v 2 C κ v . It also holds det g(v) 1 2 for any v E 1(B(Xi, h)). Then there exists a constant C , depending on d, p0 and κ, such that E( i)wi1(Πi) 2C hd. Puchkin and Spokoiny Now, consider the sum n X j=1 wij(Πi) Given Yi, the weights wij(Πi) are conditionally independent and identically distributed. The Bernstein s inequality implies that j=1 wij(Πi) C hd 2σ2+2C hd/3 e C nhd , and e C nhd n 1 if h log n n 1/d (with a sufficiently large hidden constant). Therefore, with probability at least 1 n 2, it holds j=1 wij(Πi) C hd. Appendix C. Proof of Lemma 3 Fix any i from 1 to n and denote E( i)( ) E ( |(Xi, Yi)) and P( i)( ) P ( |(Xi, Yi)). Also, let Pi( h/κ) be a set of projectors Π onto d-dimensional space, such that Π Π(Xi)|| h/κ. First, we study the supremum of the empirical process sup Πi Pi( h/κ) j=1 wij(Πi)(Yj Xi) = sup u B(0,1) Πi Pi( h/κ) j=1 wij(Πi)u T (Yj Xi) . The rest of the proof can be summarized as follows. First, we fix u B(0, 1) and Πi Pi( h/κ) and bound the supremum of the expectation sup u B(0,1) Πi Pi( h/κ) j=1 wij(Πi)u T (Yj Xi) . Then we provide uniform bounds on E( i) sup u B(0,1) Πi Pi( h/κ) j=1 wij(Πi)u T (Yj Xi) E( i) n X j=1 wij(Πi)u T (Yj Xi) Finally, we derive large deviation results for sup u B(0,1) Πi Pi( h/κ) j=1 wij(Πi)u T (Yj Xi) E( i) sup u B(0,1) Πi Pi( h/κ) j=1 wij(Πi)u T (Yj Xi). As it was said earlier, we start with bounds on the expectation. The rigorous result is given in the next proposition. Structure-adaptive Manifold Estimation Proposition 2 Under conditions of Theorem 1 and Lemma 3, for any u B(0, 1) and Πi Pi( h/κ), it holds j=1 wij(Πi)u T (Xj Xi) M( + b/h) h 2h2 j=1 wij(Πi)u T (I Π(Xi))(Xj Xi) nhd+2 j=1 wij(Πi)u T εj ΦM,b,h,κ, nhd+2 ΦM,b,h,κ, = M3(1 + + b/h)2 h2κ + M2( + b/h + p log h 1) κh + (1 + 4)Mh2 and the hidden constants do not depend on . The proof of Proposition 2 relies on Taylor s expansion but it is quite technical. Therefore, it is moved to Appendix E. We continue with a uniform bound on the expectation E sup u B(0,1), Πi Π(Xi) h/κ wij(Πi)u T Yj Ewij(Πi)u T Yj . Introduce the class of functions Fi = f(y) = K Πi(Yi y) 2 1( Yi y τ)u T (y Xi) : Πi Π(Xi) h/κ, Yi B(Xi, M), u B(0, 1) . We use the same trick as in (Gin e and Koltchinskii, 2006, Section 4). Note that the class F(1) i = f1(y) = Πi(Yi y) : (15) Πi Π(Xi) h/κ, Yi B(Xi, M) is VC subgraph, because the stripe {y : Π(Yi y) t} is an intersection of a finite number of halfspaces. According to (van der Vaart and Wellner, 1996, Theorem 2.6.18 (viii)), the class e F(1) i = f1(y) = K Πi(Yi y) 2 Πi Π(Xi) h/κ, Yi B(Xi, M) is also VC subgraph, since K( ) monotonously decreases. The class of balls F(2) i = f2(y) = 1( Yi y τ) : Yi B(Xi, M) (16) Puchkin and Spokoiny and the class of hyperplanes F(3) i = f3(y) = u T (y Xi) : u B(0, 1) are VC subgraph. The functions from the classes e F(1) i , F(2) i and F(3) i are bounded by 1, 1 and R + M respectively. Then there exist constants A and ν, depending only on the VC characteristics of the classes e F(1) i , F(2) i and F(3) i , such that N(Fi, L2(P( i) n ), δ) A where N(Fi, L2(P( i) n ), δ) is the δ-covering number of Fi with respect to the L2(P( i) n ) metric. Theorem 6 in Recht et al. (2011) (see also Szarek (1998)) implies that we can take ν Dd and A to be an absolute constant, which does not depend on D, d or κ. Corollary 2.2 from Gin e and Guillou (2002) implies E( i) sup f Fi f(Yj) E( i)f(Yj) Rσ with an absolute constant R and σ2 supf F Varf(Y1). Lemma 1 yields Xi Xj 2 8 Πi(Yi Yj) 2 + 32M2( + b/h)2h2 Using this, we can derive E( i)e 2 Πi(Yj Yi) 2 h2 (u T (Yj Xi))2 κ2 e Xj Xi 2 4h2 Yj Xi 2 2e1/2E( i)e Xj Xi 2 4h2 Xj Xi 2 + 2e1/2E( i)e Xj Xi 2 2e1/2E( i)e Xj Xi 2 4h2 Xj Xi 2 + 2e1/2E( i)e Xj Xi 2 Here we used the fact that 4M( + α) κ. Next, due to Lemma 9, there exist absolute constants B1 and B2, such that E( i)e Xj Xi 2 4h2 Xj Xi 2 B1hd+2 , E( i)e Xj Xi 2 Therefore, we can take σ2 = B2(h2 M2)hd with an absolute constant B. Thus, there exists a constant CR,M,d, depending on R, M and d only (but not on ), such that E( i) sup f Fi f(Yj) E( i)f(Yj) D(h2 M2)nhd log A B2(h2 M2)hd . Structure-adaptive Manifold Estimation Finally, we use the Talagrand s concentration inequality (Talagrand, 1996) and obtain bounds on large deviations of sup u B(0,1), Πi Π(Xi) h/κ j=1 wij(Πi)u T (Yj Xi). More precisely, we use the version of Talagrand s inequality from (Bousquet, 2002), where a deviation bound with nice constants was derived. Denote Zi = sup f Fi j=1 (f(Yj) Ef(Yj)) . Then (Bousquet, 2002, Theorem 2.3) claims that, on an event with probability 1 n 2, it holds that 4v log n + 2 log n with v = nσ2 + 2EZi and the same σ as in (17). This, together with (a) and (17), yields sup u B(0,1), Πi Π(Xi) h/κ j=1 wij(Πi)u T (Yj Xi) M( + b/h) h 2h2 + ΦM,b,h,κ, nhd+2 D(h2 M2)nhd log n on an event with probability at least 1 n 2. The union bound implies that, on an event with probability at least 1 n 1, it holds that j=1 wij(Πi)(Yj Xi) M( + b/h) h 2h2 + ΦM,b,h,κ, nhd+2 D(h2 M2)nhd log n. j=1 wij(Πi) (I Π(Xi))(Yj Xi) (1 + ΦM,b,h,κ, )nhd+2/κ + q D(h4/κ2 M2)nhd log n is proven in a completely similar way. Proposition 2 yields sup u B(0,1), Πi Π(Xi) h/κ j=1 wij(Πi)u T (I Π(Xi))(Yj Xi) (1 + ΦM,b,h,κ, )nhd+2 Puchkin and Spokoiny Again, applying the VC subgraph argument and using (Gin e and Guillou, 2002, Corollary 2.2), we obtain E( i) sup u B(0,1), Πi Π(Xi) h/κ j=1 wij(Πi)u T (I Π(Xi))(Yj Xi) (1 + ΦM,b,h,κ, )nhd+2 The only difference is that we can take (σ )2 hd(h4/κ2 M2) in this case. The reason for that is (Federer, 1959, Theorem 4.18), which implies (I Π(Xi))(Xj Xi) Xj Xi 2 E( i)w2 ij(Πi) u T (I Π(Xi))(Yj Xi) 2 E( i)w2 ij(Πi) (I Π(Xi))(Yj Xi) 2 2E( i)w2 ij(Πi) (I Π(Xi))(Xj Xi) 2 + 2E( i)w2 ij(Πi) εj 2 nhd h4/κ2 M2 . E( i) sup u B(0,1), Πi Π(Xi) h/κ j=1 wij(Πi)u T (I Π(Xi))(Yj Xi) (1 + ΦM,b,h,κ, )nhd+2 Dnhd(h4/κ2 M2) log h 1. Finally, applying the Talagrand s concentration inequality (Bousquet, 2002, Theorem 2.3)), we obtain that, for a fixed i, with probability at least 1 n 2, j=1 wij(Πi)(I Π(Xi))(Yj Xi) (1 + ΦM,b,h,κ, )nhd+2/κ + q D(h4/κ2 M2)nhd log n . Applying the union bound, we conclude j=1 wij(Πi)(I Π(Xi))(Yj Xi) (1 + ΦM,b,h,κ, )nhd+2/κ + q D(h4/κ2 M2)nhd log n. Structure-adaptive Manifold Estimation Appendix D. Proof of Lemma 4 Throughout the proof of Lemma 4, we work on the event Ωk, on which (10) holds. Consider j=1 ( b X(k) j b X(k) i )( b X(k) j b X(k) i )T 1 b X(k) j b X(k) i γhk . Denote vij = 1 b X(k) j b X(k) i γhk . Let Zij = π{Xi} TXi M (Xj) , 1 j n, b Z(k) ij = π{Xi} TXi M b X(k) j , 1 j n, and introduce a matrix j=1 vij( b Z(k) ij b Z(k) ii )( b Z(k) ij b Z(k) ii )T . From the conditions of Lemma 4, we have max 1 j n b Z(k) ij b X(k) j β2 nhd k D log n 2β2h2 k κ . This yields bΣ (k) i bΞ (k) i = sup u B(0,1) j=1 vij h (u T ( b X(k) j b X(k) i ))2 (u T ( b Z(k) ij b Z(k) ii ))2i j=1 vij b X(k) j b X(k) i + b Z(k) ij b Z(k) ii b X(k) j b Z(k) ij + b X(k) i b Z(k) ii . Since TXi M is a convex set, then b Z(k) ij b Z(k) ii = π{Xi} TXi M b X(k) j π{Xi} TXi M b X(k) i b X(k) j b X(k) i . bΣ (k) i bΞ (k) i 8β2h2 k κ j=1 vij b X(k) j b X(k) i 8γβ2h3 k κ Next, we are going to prove that, with probability at least 1 n 2, j=1 1 ( Xj Xi (γ + 4β1)hk) (18) 2C n(γ + 4β1)dhd k. Puchkin and Spokoiny The first inequality follows from the fact that b X(k) i b X(k) j γhk implies Xi Xj (γ + 4β1)hk. Next, we have P( i) ( Xj Xi (γ + 4β1)hk) = Z M x Xi (γ+4β1)hk v (γ+4β1)hk det g(v)dv. Using the inequality | p det g(v) 1| d v 2/κ2 (see (Trillos et al., 2019, Equation 2.1)), we have that p det g(v) 1, provided that v (γ + 4β1)hk. Then P( i) ( Xj Xi (γ + 4β1)hk) Vol(B(0, (γ + 4β1)hk)) (γ + 4β1)dhd k. Thus, there exists a constant C, such that P( i) ( Xj Xi (γ + 4β1)hk) C(γ + 4β1)dhd k . Denote C = C 16/3. The Bernstein s inequality yields j=1 1 ( Xj Xi 2γhk) > 2C nhd k e (C nhd k)2 2 (C nhd k)+2/3 (C nhd k) = e 3C nhd k 8 e 2nhd k e 2nhd K 1 and then (18) holds. From now on, we are working on the event, on which (18) holds. On this event, we have bΣ (k) i bΞ (k) i 8γ(γ + 4β1)dβ2C nhd+3 k . (19) Consider the matrix bΞ (k) i . According to Lemma 7, we have the following guarantee on the spectral gap of bΞ (k) i : λd(bΞi) λd+1(bΞi) 1 2 c(γ 4β1)d (γ 4β1)d+2nhd+2 k 9C n 2/d(γ + 4β1)d+2nhd+2 k 16C β2 1(γ + 4β1)dnhd+2 k C (γ + 4β1)d+4nhd+4 k κ2 . Structure-adaptive Manifold Estimation with probability at least 1 n 2. Take γ satisfying the inequalities (γ 4β1)d 96C c2 , c 32(γ 4β1)d+2 C n 2/d(γ + 4β1)d+2, (20) c 32(γ 4β1)d+2 16C β2 1(γ + 4β1)d, c 32(γ 4β1)d+2 C (γ + 4β1)d+4h2 0 κ2 . Note that such γ always exists if n 2/d and h0 are sufficiently small. Then λd(bΞi) λd+1(bΞi) c 8nhd+2 k 3c 32nhd+2 k = c 32nhd+2 k . (21) The Davis-Kahan sin θ theorem (Davis and Kahan, 1970) and the inequalities (19), (21) imply that for a fixed i from 1 to n with probability at least 1 2n 2 it holds bΠ (k+1) i Π(Xi) 256γ(γ + 4β1)dβ2C nhd+3 k cnhd+2 k = e Chk with e C = (256γ(γ + 4β1)dβ2C )/c. Applying the union bound, we have that max 1 i n bΠ (k+1) i Π(Xi) e Chk with probability at least 1 2n 1, and the proof of Lemma 4 is finished. Appendix E. Proof of Proposition 2 The proof of Proposition 2 is divided into three parts for the sake of convenience. On each step we prove one of the inequalities (a), (b), (c). E.1 Proof of Proposition 2a First, consider the expression E( i)wij(Πi)u T (Xj Xi). Let rd = 4h p (d + 2) log h 1. Then E( i)wij(Πi)u T (Xj Xi) = E( i)wij(Πi)u T (Xj Xi)1 (Xj B(Xi, rd)) + E( i)wij(Πi)u T (Xj Xi)1 (Xj / B(Xi, rd)) . Due to Lemma 1, Xi Xj 2 8 Πi(Yi Yj) 2 + 32M2( + b/h)2h2 Puchkin and Spokoiny and, if Xj / B(Xi, rd), we conclude wij(Πi) e Xi Xj 2 8h2 + 4M2( +b/h)2 4M2( +b/h)2 κ2 Xi Xj 2+r2 d 16h2 e 1 4 Xi Xj 2 Here we used the fact that, due to the conditions of Theorem 1, M( + b/h) κ/4. Using the equality max t>0 te t2 we conclude E( i)wij(Πi)u T (Xj Xi)1 (Xj / B(Xi, rd)) hd+3. (23) We see that outside the ball B(Xi, rd), the weights wij(Πi) become very small. It remains to consider the event {Xj B(Xi, rd)}. We assume that h0 is sufficiently small, so it holds rd 2h0 q 2(d + 2) log h 1 0 τ/4. On this event Yi Yj 2M + rd < τ, which yields E( i)wij(Πi)u T (Xj Xi)1 (Xj B(Xi, rd)) = E( i)e Πi(Yj Yi) 2 h2 u T (Xj Xi)1 (Xj B(Xi, rd)) . Using the Taylor s expansion, one has e Πi(Yj Yi) 2 h2 = e Xj Xi 2 h2 Πi(Yj Yi) 2 Xj Xi 2 where ξ = θ(Xj Xi) + (1 θ)Πi(Yi Yj) for some θ (0, 1). Consider the expectation E( i)e Πi(Xj Xi) 2 h2 u T (Xj Xi)1 (Xj B(Xi, rd)) . According to Lemma 8, it does not exceed E( i)e Πi(Xj Xi) 2 h2 u T (Xj Xi)1 (Xj B(Xi, rd)) dhd+2 Next, consider the second term in (24). Note that ξ = θΠi(Yj Yi) + (1 θ)(Xj Xi) Xi Xj Πi(Yj Yi) (Xj Xi) . From the proof of Lemma 1 (see Equation 14) we know that Xj Xi Πi(Yj Yi) Xj Xi 2 2κ + h Xj Xi + 2M( h + b) κ + 3M Xi Xj Structure-adaptive Manifold Estimation Xi Xj Xi Xj 2 2κ 2M( h + b) 1 3M + h + rd Xi Xj 2M( h + b) Xi Xj 2M( h + b) 4 Xi Xj 2M( h + b) This yields Xi Xj 2 16 ξ 2 + 128M2( + b/h)2h2/κ2 16 ξ 2 + 8h2. (25) h2 Πi(Yj Yi) 2 Xj Xi 2 u T (Xj Xi)1 (Xj B(Xi, rd)) (26) e1/2E( i)e Xi Xj 2 16h2 Πi(Yj Yi) 2 Xj Xi 2 Xj Xi 1 (Xj B(Xi, rd)) Πi(Yj Yi) 2 Xj Xi 2 = Πi(εj εi) 2 + 2(εj εi)T Πi(Xj Xi) (I Πi)(Xj Xi) 2. Due to (Federer, 1959, Theorem 4.18), (I Πi)(Xj Xi) Πi Π(Xi) Xj Xi + (I Π(Xi))(Xj Xi) κ + Xj Xi 2 Using (13), we obtain Πi(Yj Yi) 2 Xj Xi 2 κ + 3M Xj Xi + 2 Xj Xi 2M( h + b) κ + 3M Xj Xi κ + Xj Xi 2 Puchkin and Spokoiny Next, it is useful to control the expectations of the form E( i)e Xi Xj 2 16h2 Xj Xi q1 (Xj B(Xi, rd)) . Lemma 9 yields E( i)e Xi Xj 2 16h2 Xj Xi q1 (Xj B(Xi, rd)) hq+d. The inequalities (26) and (27) and Lemma 9 yield that, up to a multiplicative constant, the left-hand side of (26) is bounded by M2( + b/h)2hd+1 κ2 + M2hd+1 κ2 + M( + b/h)hd+1 M( + b/h + 1)hd+1 κ + ( 2 + 1)hd+3 This and Lemma 8 imply E( i)wij(Πi)u T (Xj Xi) M( + b/h) h 2h2 with a hidden constant, which does not depend on . E.2 Proof of Proposition 2b Consider the expectation j=1 wij(Πi)u T (I Π(Xi))(Xj Xi). Since for each j = i the summand has the same conditional distribution with respect to (Xi, Yi), it is enough to prove that E( i)wij(Πi)u T (I Π(Xi))(Xj Xi) hd+2 for any distinct j. Again, we use the decomposition j=1 wij(Πi)u T (I Π(Xi))(Xj Xi) = E( i) n X j=1 wij(Πi)u T (I Π(Xi))(Xj Xi)1(Xj B(Xi, rd)) + E( i) n X j=1 wij(Πi)u T (I Π(Xi))(Xj Xi)1(Xj / B(Xi, rd)). Structure-adaptive Manifold Estimation From (23), the second term is of order hd+3 hd+2/κ. On the event {Xj B(Xi, rd)}, we can use (Federer, 1959, Theorem 4.18): E( i)wij(Πi)u T (I Π(Xi))(Xj Xi)1(Xj B(Xi, rd)) E( i)wij(Πi) (I Π(Xi))(Xj Xi) 1(Xj B(Xi, rd)) 2κE( i)wij(Πi) Xj Xi 21(Xj B(Xi, rd)). Using (22), we obtain wij(Πi) = e Πi(Yj Yi) 2 h2 e Xj Xi 2 8h2 + 4M2( +b/h)2 h2 e 1 2 Xj Xi 2 The assertion of Proposition 2b now follows from Lemma 9. E.3 Proof of Proposition 2c To complete the proof of Proposition 2, it remains to bound the expectation E( i)wij(Πi)u T εj. Outside the ball B(Xi, rd), we have E( i)wij(Πi)u T εj1(Xj / B(Xi, rd) Mhd+2, so it remains to control the expectation E( i)wij(Πi)u T εj1(Xj B(Xi, rd). Again, use the Taylor s expansion: e Πi(Yi Yj) 2 h2 = e Xi Xj 2 h2 + e Xi Xj 2 h2 Πi(Yi Yj) 2 Xi Xj 2 h2 Πi(Yi Yj) 2 Xi Xj 2 2 where ζ = ϑΠi(Yj Yi)+(1 ϑ)(Xj Xi) for some ϑ (0, 1). On the event { Xi Xj rd} it holds Yi Yj 2M + rd τ. This yields E( i)e Xi Xj 2 h2 1( Yi Yj τ)1( Xi Xj rd)u T εj (29) = E( i)e Xi Xj 2 h2 1( Xi Xj rd)u T εj = 0. Now, consider the term E( i)e Xi Xj 2 h2 Πi(Yi Yj) 2 Xi Xj 2 1( Xi Xj rd)u T εj. Puchkin and Spokoiny It is equal to E( i)e Xi Xj 2 h2 1( Xi Xj rd)u T εj Πi(εi εj) 2 + 2(Xi Xj)T Πi(εi εj) (I Πi)(Xi Xj) 2 First, note that E( i)e Xi Xj 2 h2 (I Πi)(Xi Xj) 2 h2 1( Xi Xj rd)u T εj = 0. (30) Next, (13) implies Πi(εi εj) 2 2M( h + b) κ + 3M Xj Xi Then, using the inequality εj M (due to (A3)) and Lemma 9, we obtain E( i)e Xi Xj 2 h2 1( Xi Xj rd)u T εj Πi(εi εj) 2 M3( + b/h)2hd Finally, consider the expectation E( i)e Xi Xj 2 h2 2(Xi Xj)T Πi(εi εj) h2 1( Xi Xj rd)u T εj. Denote vij = 2E Πi(εi εj)u T εj | Xi, Xj, Yi . According to (13), the norm of the vector vij is bounded by vij 2M Πi(εi εj) 2M 2M( h + b) κ + 3M Xj Xi On the event { Xi Xj rd}, we have vij 2M 2M( h + b) Applying Lemma 8 with the vector u = vij/ vij , we obtain E( i)e Xi Xj 2 h2 2(Xi Xj)T Πi(εi εj) h2 1( Xi Xj rd)u T εj Structure-adaptive Manifold Estimation Taking (30), (31) and (32) together, one obtains E( i)e Xi Xj 2 h2 Πi(Yi Yj) 2 Xi Xj 2 M3( + b/h)2hd To complete the proof of Proposition 2, it remains to bound Πi(Yi Yj) 2 Xi Xj 2 2 2h4 1( Xi Xj rd)u T εj, where ζ = ϑΠi(Yj Yi) + (1 ϑ)(Xj Xi) for some ϑ (0, 1). The same argument, as in the analysis of the vector ξ (see the proof of Proposition 1a, Inequality 25), yields Xi Xj 2 16 ζ 2 + 128M2( + b/h)2h2/κ2 16 ζ 2 + 8h2 h2 e1/2e Xi Xj 2 Due to (27), Πi(Yi Yj) 2 Xi Xj 2 2 " 2M( h + b) κ + 3M Xj Xi + 2 Xj Xi 2M( h + b) κ + 3M Xj Xi κ + Xj Xi 2 Applying the inequality (a + b + c)2 3a2 + 3b2 + c2 and Lemma 9, we obtain Πi(Yi Yj) 2 Xi Xj 2 2 2h4 1( Xi Xj rd)u T εj (34) Mhd 4 M4( + b/h)4h4 κ4 + M2( + b/h)2h4 M5(1 + + b/h)4hd κ4 + M3(1 + + b/h)2hd κ2 + (1 + 4)Mhd+4 The assertion of Proposition 2c follows from Inequalities 29, 33, 34 and the fact that, due to conditions of Theorem 1, M(1 + + b/h) Puchkin and Spokoiny Appendix F. Proofs Related to Theorem 3 F.1 Proof of Lemma 5 It is enough to consider the case D = d+1. For any x0 Rd+1 and r > 0, denote a sphere of radius r centered at x0 by B(x, r) = {x Rd+1 : x x0 = r}. Consider M0 = B(0, κ). Let a random element X have a uniform distribution on M0. Clearly, M0 satisfies (A1) and the distribution of X satisfies (A2) with L = 0, p0 = p1 = ((d + 1)ωd+1κd) 1, where ωd+1 is the volume of the Euclidean ball in Rd+1 with radius 1. Given X, let ε have a uniform distribution on TXM0 B(X, Mb/κ). Then E(ε | X) = 0, ε = Mb κ P( | X)-almost surely, and the assumption (A3) is fulfilled. Consider Y = X + ε. Since X is orthogonal to ε by the construction, we have Y 2 = X 2 + ε 2 = κ2 + M2b2 κ2 almost surely. Consequently, the random element Y is supported on the sphere B(0, R) with R = p κ2 + M2b2/κ2. By the spherical symmetry construction, Y has a uniform distribution on B(0, R). Let X have a uniform distribution on M1 = B(0, R) and, for any X , let (ε | X ) = 0 P( | X )-almost surely. Then X + ε and X + ε have the same distribution. However, d H(M1, M0) = R κ = κ Here we used the fact that, due to the concavity of 1 + x 1, it holds that 3, x [0, 1]. Thus, for any estimate c M, we have sup M M d κ d H( c M, M) 1 2d H(M1, M0) > M2b2 F.2 Proof of Lemma 6 Without loss of generality, we assume D = d + 1. We write a (d + 1)-dimensional vector as (u, v), where u Rd, v R. Let Z(0) RD be a d-dimensional C -manifold without a boundary with reach greater than 1 such that {(z, 0) : z Rd, z 1/2} Z(0). In (Aamari and Levrard, 2019), the authors claim that such a manifold can be constructed by flattering smoothly a unit sphere in RD. Let Z = 4κZ(0). Then Z is a C -manifold without a boundary. Moreover, its reach is at least 4κ and {(z, 0) : z Rd, z 2κ} Z. Structure-adaptive Manifold Estimation We construct manifolds in the following way. Let ψ : Rd R be a smooth function, such that maxu ψ(u) = ψ(0) = 1, ψ(u) = 0 for any u / B(0, 1) and supu 2ψ(u) Λ for an absolute constant Λ. Let (z1, 0), . . . , (z N, 0), where z1, . . . , z N Rd, be a 2h-packing of d-dimensional ball Z B(0, κ/2), N = (4h/κ) d, h < κ/4. For any j {1, . . . , N}, introduce a manifold ed+1 : (z, 0) Z B(0, κ) (Z\B(0, κ)) , where the vector ei is the i-th vector of the canonical basis in Rd+1 with the components e(j) i = 1(i = j). Let M0 be equal to Z. Notice that Mj, j {1, . . . , N} differs from M0 only on the set B((zj, 0), h), and for any k = j the balls B((zj, 0), h) and B((zk, 0), h) do not intersect. In other words, we consider a family of manifolds with a small bump in one of the points (z1, 0), . . . , (z N, 0). Show that the family of manifolds M κ = {Mj : 1 j N} with where c0 is a constant to be chosen later, is contained in the class M d κ introduced in (A1). It is clear that Mj is a compact, connected, smooth d-dimensional manifold without a boundary for any j from 1 to N. The most important part is to check that the reach of Mj is not less than κ. For this purpose, we use Theorem 4.18 from Federer (1959), which states that reach (M) κ if and only if for any x, x M it holds d(x , {x} Tx M) x x 2/(2κ). Fix arbitrary j {1, . . . , N} and introduce fj(z) = z 0 Then for any x Mj B(0, κ) the exists unique (z, 0) Z, such that x = fj(z). By the construction, the inverse function to fj(z), (z, 0) Z B(0, κ), is given by f 1 j (x) = x(1), . . . , x(d) T , where x Mj B(0, κ) and x(j) is the j-th component of the vector x. Moreover, the unit normal to Mj at the point x = fj(z) is given by νj(z) = C 1 h Fix arbitrary x = fj(z), x0 = fj(z0) Mj and check that νj(z0)T (x x0) = νj(z0)T (fj(z) fj(z0)) z z0 2 Puchkin and Spokoiny The last inequality is obvious, since (z z0) is a subvector of (x x0). It remains to check the second inequality. It holds that νj(z0)T (fj(z) fj(z0)) κΛ ψT z0 zj (z z0) + h2 κΛ ψT z0 zj (z z0) + h2 κΛ Λ z z0 2 2h2 = z z0 2 Here we used Taylor s expansion of ψ up to the second order and the fact that 2ψ Λ. Now, we are going to describe distributions of X and ε in the model (1). Let a random element Z have a uniform distribution on Z. For any fixed j {1, . . . , N}, we take X = fj(Z), where fj(z) = z 0 Denote a volume of the set Z by VZ. Then for any x, such that f 1 j (x) / B(zj, h), the density pj(x) is just V 1 Z . Otherwise, the density pj(x) of X is defined by the formula det fj(f 1 j (x)) 1 f 1 j (x) zj Since for any two vectors u, v Rd+1 it holds det(I + uv T ) = 1 + u T v, we have κΛe T d+1 ψ f 1 j (x) zj !! 1 . (36) Note that ψ(0) = 0 by construction. Taking into account that supu 2ψ(u) Λ, we conclude that ψ f 1 j (x) zj ! Λ f 1 j (x) zj h Λ x : f 1 j (x) B(zj, h). This and the fact that h < κ/4 yield p0 = 4 5VZ 1 VZ 1 + h κ pj(x) 1 VZ 1 h κ 4 3VZ = p1. Thus, the density of X is bounded from above and below by p1 and p0 respectively. Show that pj(x) has 4p1/(3κ)-Lipschitz derivative. Differentiating (36), we obtain pj(x) = VZp2 j(x) κΛ f 1 j (x) zj VZp2 1 κ 4p1 Structure-adaptive Manifold Estimation where Id,d+1 Rd (d+1) is the matrix of the first d rows of the identity matrix Id+1. Thus, for each j, the density qj(x) fulfils (A2). Next, we describe the conditional distribution of Y given X. We generate Y from the model Y = X + ξed+11(X B(0, κ)), X Mj, (37) where P(ξ = 0.5M X(d+1)|X) = η(X), P(ξ = 0.5M X(d+1)) = 1 η(X), η = η(X) = 1/2 + X(d+1)/M, and X(d+1) is the (d + 1)-th component of X. Note that Y belongs either to the set M+ = {(z, 0.5M) : (z, 0) B(0, κ) Rd+1}, or to the set M = {(z, 0.5M) : (z, 0) B(0, κ) Rd+1}, or to the set Z\B(0, B(0, κ)). It remains to check the condition (A3). Take where c0 is such that the condition h2/(MκΛ) 1/2 is fulfilled. Such c0 exists since M (log n/n)2/d. First, note that the noise magnitude is not greater than 0.5M +h2/(κΛ), which is less than M. Second, using the expression (35) of the unit normal to Mj at the point x = fj(z), we have ξΠ(fj(z))ed+1 2 = |ξ|2 |ξe T d+1νj(z)|2 = |ξ|2 |ξ|2 1 + h2 ψ(z/h) 2/(κΛ)2 = |ξ|2h2 ψ(z/h) 2 κ2Λ2 + h2 ψ z zj h 2 Λ2|ξ|2 h2 and (A3) holds with Here we used the fact that for any u B(0, 1) ψ(u) = ψ(u) ψ(0) max u B(0,1) 2ψ(u ) u L . We use (Tsybakov, 2009, Theorem 2.5) to prove the lower bound in Theorem 3. Let Pj, 0 j N, be the probability measure, generated by Y = X + ε, where X Mj. Then P0 is a dominating measure, i.e. Pj P0 for all j from 1 to N, and for any j = k we have d H(Mj, Mk) h2 Prove that, for sufficiently large n, it holds j=1 KL(P n j , P n 0 ) α log N, (38) Puchkin and Spokoiny where KL(P, Q) is the Kullback-Leibler divergence between P and Q, α is a constant from the interval (0, 1/8). Then Theorem 2.5 in Tsybakov (2009) yields inf c M sup M M d κ Ed H( c M, M ) κ 1 M2κ2 log n It remains to check (38). For any j from 1 to N, it holds KL(P n j , P n 0 ) = n Z log d Pj(y) d P0(y)d Pj(y). The density of Pj with respect to the Hausdorffmeasure on (Z\B(0, κ)) M+ M is given by the formula log pj(y) = log 1 + 2h2 MκΛψ ey zj h log(2VZ), y = (ey, 0.5M) M+, log 1 2h2 MκΛψ ey zj h log(2VZ), y = (ey, 0.5M) M , log(VZ), y Z\B(0, κ). The density of P0 with respect to the same measure is just log p0(y) = ( log(2VZ), y = (ey, M) M+ M , log(VZ), y Z\B(0, κ). KL(P n j , P n 0 ) log 1 + 2h2 Using the inequality (1 + t) log(1 + t) t + t2 2 , t ( 1, 1), we obtain that (1 + t) log(1 + t) + (1 t) log(1 t) t2, t ( 1, 1). Then, since 2h2 < MκΛ, we have KL(P n j , P n 0 ) n 2VZ M2κ2Λ2 ψ2 z dz = CΛ,Znhd+4 Structure-adaptive Manifold Estimation where CΛ,Z > 0 is a constant, depending on Z and Λ. Substituting h by c0 M2κ2 log n n 1/(d+4) , we obtain CΛ,Znhd+4 M2κ2 = cd+4 0 CΛ,Z log n. On the other hand, log N = log 4h d = d log κ 4 d log c0 + d d + 4 log n M2κ2 log n d log c0 + d 2(d + 4) log n, where in the last inequality we assumed that n is large, so it holds M2κ2 log n n1/4 and κ 4n 1/(4d+16). Choose any constant c0 > 0, satisfying the inequality 8cd+4 0 CL,Z log n + d log c0 < d 2(d + 4) log n. Such constant always exists. Thus, (38) is fulfilled, and (Tsybakov, 2009, Theorem 2.5) yields the claim of Theorem 3. Appendix G. Auxiliary Results This section contains some auxiliary results, which are used in the proofs. The results below often use technique concerning integration over manifolds. Therefore, we would like to start with a short background, which will help a reader follow the proofs. Given x M and v Tx M, let γ(t, x, v) be a geodesic starting at x, such that dγ(t, x, v) The exponential map of M at the point x Ex : Tx M M is defined as Ex(p) = γ(1, x, v). Note that d M(Ex(v), x) = v for v κ/4, where d M(x, x ) is the length of the shortest path on M between x and x . We extensively use integration over manifolds. In these cases, the exponential map is useful to apply the change of variables formula. For a small open set U, x U M, it holds Z f(x)d W(x) = Z det g(p)dp, where g(p) is the metric tensor. Without going deep into details, we just mention that the metric tensor allows a nice decomposition (see, for instance, (Trillos et al., 2019, Equation 2.1)), which is enough for our purposes: det g(v) 1 d v 2 Puchkin and Spokoiny Proposition 3 Let M M d κ and x, x M, x x 2κ. Let Π(x) and Π(x ) be the projectors onto the tangent spaces Tx M and Tx M respectively. Then the following inequalities hold: x x d M(x, x ) 2 x x , (a) Π(x) Π(x ) x x Proof Proposition 3a follows from (Boissonnat et al., 2019, Lemma 2.5) and the inequality π t, t (0, π/2). To prove Proposition 3b, we use (Golub and Van Loan, 2013, Theorem 2.5.1): Π(x) Π(x ) = (I Π(x))Π(x ) = sin (Tx M, Tx M). Then the claim of the proposition follows from (Boissonnat et al., 2019, Corollary 3.6): Π(x) Π(x ) = sin (Tx M, Tx M) 2 sin (Tx M, Tx M) Lemma 7 Fix any i from 1 to n. There are absolute constants c and C , such that, with probability at least 1 n 2, it holds λd(bΞi) λd+1(bΞi) 1 2 c(γ 4β1)d (γ 4β1)d+2nhd+2 k 9C n 2/d(γ + 4β1)d+2nhd+2 k 16C β2 1(γ + 4β1)dnhd+2 k C (γ + 4β1)d+4nhd+4 k κ2 . Proof Now, consider the matrix bΞ(k) i . It is clear that all the eigenvectors of bΞ(k) i belong to the linear space TXi M . Thus, bΞ(k) i has at most d non-zero eigenvalues. In what follows, we show that the d-th largest eigenvalue of bΞ(k) i is non-zero and give a lower bound on the spectral gap λd(bΞ(k) i ) λd+1(bΞ(k) i ). It holds that λd(bΞ (k) i ) λd+1(bΞ (k) i ) = min u TXi M , u =1 j=1 vij(u T ( b Z(k) ij b Z(k) ii ))2 . Using the inequality b Z(k) ij Xj b Z(k) ij Zij + Xj Zij b X(k) j Xj + Xj Zij 2β1hk + Xj Xi 2 Structure-adaptive Manifold Estimation we obtain that for any u it holds (u T ( b Z(k) ij b Z(k) ii ))2 1 2(u T (Xj Xi))2 8β2 1h2 k Xj Xi 4 which yields λd(bΞ (k) i ) λd+1(bΞ (k) i ) 1 2 min u B(0,1) TXi M j=1 vij(u T (Xj Xi))2 j=1 vij Xj Xi 4 2 min u B(0,1) TXi M j=1 vij(u T (Xj Xi))2 16C β2 1(γ + 4β1)dnhd+2 k C (γ + 4β1)d+4nhd+4 k κ2 . In the last inequality we used (18) and the fact that Xi Xj (γ + 4β1)hk, if b X(k) i b X(K) j γhk. It remains to provide a lower bound for the sum min u B(0,1) TXi M j=1 vij(u T (Xj Xi))2. Let Nε stand for a ε-net of the set B(0, 1) TXi M . It is known that |Nε| (3/ε)d. Here and further in this proof we will assume ε = εn = 3n 1/d. Then, for any t > 0, it holds min u B(0,1) TXi M j=1 vij(u T (Xj Xi))2 < t j=1 vij(u T (Xj Xi))2 < 2t + 2ε2 n X j=1 vij Xi Xj 2 j=1 vij(u T (Xj Xi))2 < 2t + 4C ε2(γ + 4β1)d+2nhd+2 k Fix any u Nε and consider j=1 vij(u T (Xj Xi))2 < 2t + 4C ε2(γ + 4β1)d+2nhd+2 k Note that n X j=1 vij(u T (Xj Xi))2 j=1 1 ( Xi Xj (γ 4β1)hk) (u T (Xj Xi))2, Puchkin and Spokoiny so it holds j=1 vij(u T (Xj Xi))2 < 2t + 4C ε2(γ + 4β1)d+2nhd+2 k j=1 1 ( Xi Xj (γ 4β1)hk) (u T (Xj Xi))2 < 2t + 4C ε2(γ + 4β1)d+2nhd+2 k Given Xi, the random variables 1 ( Xi Xj hk) (u T (Xj Xi))2, 1 j n, are conditionally independent and identically distributed, and expectation of each of them can be bounded below by E( i)1 ( Xi X1 (γ 4β1)hk) (u T (X1 Xi))2 M B(Xi,hk) {|u T (Xi x)| 1 x Xi 2d W(x) c(γ 4β1)d+2hd+2 k . At the same time, the variance of these random variables does not exceed E( i)1 ( Xi X1 (γ 4β1)hk) (u T (X1 Xi))4 (γ 4β1)4h4 k P( i) Xi X1 (γ 4β1)hk C(γ 4β1)d+4hd+4 k C (γ 4β1)d+4hd+4 k , where C = C 16/3. Again, using the Bernstein s inequality, we obtain that for any t it holds j=1 1 ( Xi Xj (γ 4β1)hk) (u T (Xj Xi))2 < n E( i)1 ( Xi Xj (γ 4β1)hk) (u T (Xj Xi))2 t 2n C (γ 4β1)d+4hd+4 k + 2(γ 4β1)2h2 k t/3 Take t = δn E( i)1 ( Xi Xj (γ 4β1)hk) (u T (Xj Xi))2. Then 2n C (γ 4β1)d+4hd+4 k + 2(γ 4β1)2h2 k t/3 c2n2δ2(γ 4β1)2d+4h2d+4 k 2n C (γ 4β1)d+4hd+4 k + 2cnδ 3 (γ 4β1)d+4hd+4 k nc2δ2(γ 4β1)dhd k Structure-adaptive Manifold Estimation Choose δ satisfying the inequality c2δ2(γ 4β1)d In particular, δ = 2 c(γ 4β1)d + is a suitable choice. Then nc2δ2(γ 4β1)dhd k 2C + 2cδ e 3 log n e 2 log n log |Nε| = 1 |Nε|n2 . Thus, with probability at least 1 (|Nε|n2) 1, it holds j=1 1 ( Xi Xj (γ 4β1)hk) (u T (Xj Xi))2 1 2 c(γ 4β1)d c(γ 4β1)d+2hd+2 k . By the union bound, on an event with probability at least 1 n 2 it holds j=1 1 ( Xi Xj (γ 4β1)hk) (u T (Xj Xi))2 (41) 1 2 c(γ 4β1)d cn(γ 4β1)d+2hd+2 k Then, due to (40) and (41), on this event min u B(0,1) TXi M j=1 vij(u T (Xj Xi))2 1 2 c(γ 4β1)d (γ 4β1)d+2nhd+2 k 2C ε2(γ + 4β1)d+2nhd+2 k , Puchkin and Spokoiny and, together with (39), this yields λd(bΞi) λd+1(bΞi) 1 2 c(γ 4β1)d (γ 4β1)d+2nhd+2 k C ε2(γ + 4β1)d+2nhd+2 k 16C β2 1(γ + 4β1)dnhd+2 k C (γ + 4β1)d+4nhd+4 k κ2 . The choice ε = 3n 1/d yields the claim of Lemma 7. Lemma 8 Let E( i) denote the conditional expectation E( |(Xi, Yi)) and let rd = 4h p (d + 2) log h 1. Then, for any i from 1 to n, it holds E( i)e Xj Xi 2 h2 u T (Xj Xi)1 (Xj B(Xi, rd)) dhd+2 Proof We have E( i)e Xj Xi 2 h2 u T (Xj Xi)1 (Xj B(Xi, rd)) h2 u T (x Xi)p(x)d W(x). Due to (A2), the last expression does not exceed h2 u T (x Xi)d W(x) h2 x Xi 2d W(x). Due to Lemma 9, h2 x Xi 2d W(x) hd+2 so it remains to prove that h2 u T (x Xi)d W(x) hd+2 Structure-adaptive Manifold Estimation Let EXi( ) be the exponential map of M at Xi and denote e B(Xi, rd) = E 1(M B(Xi, rd)). Note that EXi( ) is a bijection on e B(Xi, rd) (see, for instance, (Aamari and Levrard, 2019, Lemma 1)), because rd κ/4. Then h2 u T (x Xi)d W(x) e EXi (v) EXi (0) 2 h2 u T (EXi(v) EXi(0)) p det g(v)dv. Introduce functions ψXi(v) = EXi(v) EXi(0) v and ϕXi(v) = ψXi(v) 2 + 2v T ψXi(v). Due to (Aamari and Levrard, 2019, Lemma 1), it holds that EXi(v) EXi(0) v = ψXi(v) 5 p 2 which yields ψXi(v) = O v 2 κ , ϕXi(v) = O v 3 κ . Now, consider p det g(v). It is known (see, for instance, (Trillos et al., 2019, Equation 2.1)) that there exists an absolute constant C, such that p det g(v) 1 Cd v 2 Taking (42) and (43) into account, we obtain e EXi (v) EXi (0) 2 h2 u T (EXi(v) EXi(0)) p e v 2+ϕXi (v) h2 u T (v + ψXi(v)) p h2 1 + O v 3 u T v + O v 2 h2 u T vdv + Z Puchkin and Spokoiny For the last three terms, we get e EXi (v) EXi (0) 2 h2 u T (EXi(v) EXi(0)) p det g(v)dv (44) h2 u T vdv + O hd+2 and, in order to complete the proof, we have to show that h2 u T vdv = O hd+2 Note that, for any fixed u Rd, it holds h2 u T vdv = 0. h2 u T vdv = Z Rd\ e B(Xi,rd) Remind that e B(Xi, rd) = E 1 Xi (B(Xi, rd)) = {v : EXi(v) EXi(0) rd}. By the definition of the exponential map, EXi(v) EXi(0) d M (EXi(v), EXi(0)) = v . Structure-adaptive Manifold Estimation Then we conclude that e B(Xi, rd) B(0, rd). This yields Rd\ e B(Xi,rd) Rd\ e B(Xi,rd) h2 v dv e r2 d 2h2 Z e r2 d 2h2 Z By definition, rd = 2h p 2(d + 2) log h 1. This implies e r2 d 2h2 = h4(d+2). Moreover, 2h2 v dv hd+1. Thus, we conclude h2 u T vdv hd+1+4(d+2) hd+2 and (45) finishes the proof of Lemma 8. Lemma 9 Let E( i) denote the conditional expectation E( |(Xi, Yi)) and let rd = 4h p (d + 2) log h 1. Then, for any i from 1 to n, it holds E( i)e Xi Xj 2 16h2 Xj Xi q1 (Xj B(Xi, rd)) hq+d. Proof Using (A2), we obtain E( i)e Xi Xj 2 16h2 Xj Xi q1 (Xj B(Xi, rd)) 16h2 x Xi qp(x)d W(x) 16h2 x Xi qd W(x). Puchkin and Spokoiny Using the exponential map, we get Z 16h2 x Xi qd W(x) e EXi (v) EXi (0) 2 16h2 EXi(v) EXi(0) qp Taking into account that v = d M (EXi(v), EXi(0)) and applying (Boissonnat et al., 2019, Lemma 2.5), we conclude 2 EXi(v) EXi(0) v . On the other hand, (43) yields p det g(v) 1 for all v e B(Xi, rd). Thus, we obtain e EXi (v) EXi (0) 2 16h2 EXi(v) EXi(0) qp 32h2 v qdv Z 32h2 v qdv hd+q. Appendix H. Pseudocode of the Manifold Blurring Mean Shift Algorithm This section contains a pseudocode of the manifold blurring mean shift algorithm (Wang and Carreira-Perpinan, 2010, MBMS). Structure-adaptive Manifold Estimation Algorithm 2 Manifold blurring mean shift algorithm (with full graph), Wang and Carreira Perpinan (2010) 1: The sample of noisy observations Yn = (Y1, . . . , Yn), a bandwidth σ > 0, and positive integers k and d are given. 2: Initialize b X1 = Y1, . . . , b Xn = Yn. 4: Compute the increments b Xi = b Xi + j=1 K b Xj b Xi 2 j=1 K b Xj b Xi 2 2σ2 , 1 i n, where K(t) = e t. 5: For each i from 1 to n, find k nearest neighbors Ni of b Xi. 6: For all i from 1 to n, perform local PCA, that is compute b Xj, 1 i n, j Ni ( b Xj µi)( b Xj µi)T , 1 i n, and put Πi a projector onto a linear span of eigenvectors of bΣi, corresponding to the largest d eigenvalues. 7: Update the increments b Xi (I Πi) b Xi, 1 i n. 8: Update the estimates b Xi b Xi + Xi, 1 i n. 9: until stop 10: return the estimates b X1, . . . , b Xn. E. Aamari and C. Levrard. Stability and minimax optimality of tangential Delaunay complexes for manifold reconstruction. Discrete Comput. Geom., 59(4):923 971, 2018. E. Aamari and C. Levrard. Nonasymptotic rates for manifold, tangent space and curvature estimation. Ann. Statist., 47(1):177 204, 2019. E. Arias-Castro, D. Mason, and B. Pelletier. On the estimation of the gradient lines of a density and the consistency of the mean-shift algorithm. J. Mach. Learn. Res., 17(43): 1 28, 2016. Puchkin and Spokoiny M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput., 15(6):1373 1396, 2003. J.-D. Boissonnat, A. Lieutier, and M. Wintraecken. The reach, metric distortion, geodesic convexity and the variation of tangent spaces. J. Appl. and Comput. Topology, 3(1): 29 58, 2019. O. Bousquet. A Bennett concentration inequality and its application to suprema of empirical processes. C. R. Math. Acad. Sci. Paris, 334(6):495 500, 2002. M. A. Carreira-Perpinan. Gaussian mean-shift is an em algorithm. IEEE Trans. Pattern Anal. Mach. Intell., 29(5):767 776, 2007. M. A. Carreira-Perpinan. A review of mean-shift algorithms for clustering. Ar Xiv, abs/1503.00687, 2015. O. Chapelle, B. Schlkopf, and A. Zien. Semi-Supervised Learning. The MIT Press, 1st edition, 2010. M.-Y. Cheng and H.-T. Wu. Local linear regression on manifolds and its geometric interpretation. J. Amer. Statist. Assoc., 108(504):1421 1434, 2013. Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell., 17(8):790 799, 1995. D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell., 24(5):603 619, 2002. C. Davis and W. Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM J. Numer. Anal., 7(1):1 46, 1970. D. Eberly, R. Gardner, B. Morse, S. Pizer, and C. Scharlach. Ridges for image analysis. J. Math. Imaging Vis., 4(4):353 373, 1994. H. Federer. Curvature measures. Trans. Amer. Math. Soc., 93:418 491, 1959. C. Fefferman, S. Ivanov, Y. Kurylev, M. Lassas, and H. Narayanan. Fitting a putative manifold to noisy data. COLT, Proc. Mach. Learn. Res., 75:688 720, 2018. K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory, 21:32 40, 1975. C. R. Genovese, M. Perone-Pacifico, I. Verdinelli, and L. Wasserman. Minimax manifold estimation. J. Mach. Learn. Res., 13:1263 1291, 2012a. C. R. Genovese, M. Perone-Pacifico, I. Verdinelli, and L. Wasserman. Manifold estimation and singular deconvolution under Hausdorffloss. Ann. Statist., 40(2):941 963, 2012b. C. R. Genovese, M. Perone-Pacifico, I. Verdinelli, and L. Wasserman. Nonparametric ridge estimation. Ann. Statist., 42(4):1511 1545, 2014. Structure-adaptive Manifold Estimation E. Gin e and A. Guillou. Rates of strong uniform consistency for multivariate kernel density estimators. Ann. Inst. H. Poincar e Probab. Statist., 38(6):907 921, 2002. E. Gin e and V. Koltchinskii. Empirical graph Laplacian approximation of Laplace-Beltrami operators: large sample results. IMS Lecture Notes Monogr. Ser., 51:238 259, 2006. G. H. Golub and C. F. Van Loan. Matrix computations. Johns Hopkins University Press, 4th edition, 2013. D. Gong, F. Sha, and G. Medioni. Locally linear denoising on image manifolds. AISTATS, Proc. Mach. Learn. Res., 9:265 272, 2010. M. Hein and M. Maier. Manifold denoising. NIPS, 19:561 568, 2006. M. Hristache, A. Juditsky, J. Polzehl, and V. Spokoiny. Structure adaptive approach for dimension reduction. Ann. Statist., 29(6):1537 1566, 2001a. M. Hristache, A. Juditsky, and V. Spokoiny. Direct estimation of the index coefficient in a single-index model. Ann. Statist., 29(3):595 623, 2001b. A. K. H. Kim and H. H. Zhou. Tight minimax rates for manifold estimation under Hausdorff loss. Electron. J. Stat., 9(1):1562 1582, 2015. J. Li, S. Ray, and B. G. Lindsay. A nonparametric statistical approach to clustering via mode identification. J. Mach. Learn. Res., 8(59):1687 1723, 2007. M. Maggioni, S. Minsker, and N. Strawn. Multiscale dictionary learning: non-asymptotic bounds and robustness. J. Mach. Learn. Res., 17(2):1 51, 2016. S. Osher, Z. Shi, and W. Zhu. Low dimensional manifold model for image processing. SIAM J. Img. Sci., 10:1669 1690, 2017. U. Ozertem and D. Erdogmus. Locally defined principal curves and surfaces. J. Mach. Learn. Res., 12:1249 1286, 2011. B. Recht, W. Xu, and B. Hassibi. Null space conditions and thresholds for rank minimization. Math. Program., 127(1, Ser. B):175 202, 2011. S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323 2326, 2000. Z. Shi and J. Sun. Convergence of the point integral method for laplace beltrami equation on point cloud. Res. Math. Sci., 4(1):22, 2017. S. J. Szarek. Metric entropy of homogeneous spaces. Banach Center Publ., 43:395 410, 1998. M. Talagrand. New concentration inequalities in product spaces. Invent. Math., 126(3): 505 563, 1996. J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290:2319 2323, 2000. Puchkin and Spokoiny N. G. Trillos, D. Sanz-Alonso, and R. Yang. Local regularization of noisy point clouds: Improved global geometric estimates and data analysis. J. Mach. Learn. Res., 20(136): 1 37, 2019. A. B. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009. L. van der Maaten and G. Hinton. Visualizing data using t-SNE. J. Mach. Learn. Res., 9: 2579 2605, 2008. A. W. van der Vaart and J. A. Wellner. Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. W. Wang and M. A. Carreira-Perpinan. Manifold blurring mean shift algorithms for manifold denoising. CVPR, pages 1759 1766, 2010. Y. Xia, H. Tong, W. Li, and L.-X. Zhu. An adaptive estimation of dimension reduction space. J. R. Stat. Soc., Ser. B, Stat. Methodol., 64(3):363 410, 2002. Z. Zhang and J. Wang. MLLE: Modified locally linear embedding using multiple weights. NIPS, 19:1593 1600, 2006.