# data_driven_estimation_of_laplacebeltrami_operator__94f3c3e6.pdf Data driven estimation of Laplace-Beltrami operator Frédéric Chazal Inria Saclay Palaiseau France frederic.chazal@inria.fr Ilaria Giulini Inria Saclay Palaiseau France ilaria.giulini@me.com Bertrand Michel Ecole Centrale de Nantes Laboratoire de Mathématiques Jean Leray (UMR 6629 CNRS) Nantes France bertrand.michel@ec-nantes.fr Approximations of Laplace-Beltrami operators on manifolds through graph Laplacians have become popular tools in data analysis and machine learning. These discretized operators usually depend on bandwidth parameters whose tuning remains a theoretical and practical problem. In this paper, we address this problem for the unnormalized graph Laplacian by establishing an oracle inequality that opens the door to a well-founded data-driven procedure for the bandwidth selection. Our approach relies on recent results by Lacour and Massart [LM15] on the so-called Lepski s method. 1 Introduction The Laplace-Beltrami operator is a fundamental and widely studied mathematical tool carrying a lot of intrinsic topological and geometric information about the Riemannian manifold on which it is defined. Its various discretizations, through graph Laplacians, have inspired many applications in data analysis and machine learning and led to popular tools such as Laplacian Eigen Maps [BN03] for dimensionality reduction, spectral clustering [VL07], or semi-supervised learning [BN04], just to name a few. During the last fifteen years, many efforts, leading to a vast literature, have been made to understand the convergence of graph Laplacian operators built on top of (random) finite samples to Laplace Beltrami operators. For example pointwise convergence results have been obtained in [BN05] (see also [BN08]) and [HAL07], and a (uniform) functional central limit theorem has been established in [GK06]. Spectral convergence results have also been proved by [BN07] and [VLBB08]. More recently, [THJ11] analyzed the asymptotic of a large family of graph Laplacian operators by taking the diffusion process approach previously proposed in [NLCK06]. Graph Laplacians depend on scale or bandwidth parameters whose choice is often left to the user. Although many convergence results for various metrics have been established, little is known about how to rigorously and efficiently tune these parameters in practice. In this paper we address this problem in the case of unnormalized graph Laplacian. More precisely, given a Riemannian manifold M of known dimension d and a function f : M R , we consider the standard unnormalized graph Laplacian operator defined by ˆ hf(y) = 1 nhd+2 X [f(Xi) f(y)] , y M, 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain. where h is a bandwidth, X1, . . . , Xn is a finite point cloud sampled on M on which the values of f can be computed, and K is the Gaussian kernel: for y Rm K(y) = 1 (4π)d/2 e y 2 m/4, (1) where y m is the Euclidean norm in the ambiant space Rm. In this case, previous results (see for instance [GK06]) typically say that the bandwidth parameter h in ˆ h should be taken of the order of n 1 d+2+α for some α > 0, but in practice, for a given point cloud, these asymptotic results are not sufficient to choose h efficiently. In the context of neighbor graphs [THJ11] proposes self-tuning graphs by choosing h locally in terms of the distances to the k-nearest neighbor, but note that k still need to be chosen and moreover as far as we know there is no guarantee for such method to be rate-optimal. More recently a data driven method for spectral clustering has been proposed in [Rie15]. Cross validation [AC+10] is the standard approach for tuning parameters in statistics and machine learning. Nevertheless, the problem of choosing h in ˆ h is not easy to rewrite as a cross validation problem, in particular because there is no obvious contrast corresponding to the problem (see [AC+10]). The so-called Lepski s method is another popular method for selecting the smoothing parameter of an estimator. The method has been introduced by Lepski [Lep92b, Lep93, Lep92a] for kernel estimators and local polynomials for various risks and several improvements of the method have then been proposed, see [LMS97, GL09, GL+08]. In this paper we adapt Lepski s method for selecting h in the graph Laplacian estimator ˆ h. Our method is supported by mathematical guarantees: first we obtain an oracle inequality - see Theorem 3.1 - and second we obtain the correct rate of convergence - see Theorem 3.3 - already proved in the asymptotical studies of [BN05] and [GK06] for non data-driven choices of the bandwidth. Our approach follows the ideas recently proposed in [LM15], but for the specific problem of Laplacian operators on smooth manifolds. In this first work about the data-driven estimation of Laplace-Beltrami operator, we focus as in [BN05] and [GK06] on the pointwise estimation problem: we consider a smooth function f on M and the aim is to estimate ˆ hf for the L2-norm 2,M on M Rm. The data driven method presented here may be adapted and generalized for other types of risks (uniform norms on functional family and convergence of the spectrum) and other types of graph Laplacian operators, this will be the subject of future works. The paper is organized as follows: Lepski s method is introduced in Section 2. The main results are stated in Section 3 and a sketch of their proof is given in Section 4 (the complete proofs are given in the supplementary material). A numerical illustration and a discussion about the proposed method are given in Sections 5 and 6 respectively. 2 Lepski s procedure for estimating the Laplace-Beltrami operator All the Riemannian manifolds considered in the paper are smooth compact d-dimensional submanifolds (without boundary) of Rm endowed with the Riemannian metric induced by the Euclidean structure of Rm. Recall that, given a compact d-dimensional smooth Riemannian manifold M with volume measure µ, its Laplace-Beltrami operator is the linear operator defined on the space of smooth functions on M as (f) = div( f) where f is the gradient vector field and div the divergence operator. In other words, using the Stoke s formula, is the unique linear operator satisfying Z M f 2dµ = Z Replacing the volume measure µ by a distribution P which is absolutely continuous with respect to µ , the weighted Laplace-Beltrami operator P is defined as p p, f , (2) where p is the density of P with respect to µ. The reader may refer to classical textbooks such as, e.g., [Ros97] or [Gri09] for a general and detailed introduction to Laplace operators on manifolds. In the following, we assume that we are given n points X1, . . . , Xn sampled on M according to the distribution P. Given a smooth function f on M, the aim is to estimate Pf, by selecting an estimator in a given finite family of graph Laplacian ( ˆ hf)h H, where H is a finite family of bandwidth parameters. Lepski s procedure is generally presented as a method for selecting bandwidth in an adaptive way. More generally, this method can be seen as an estimator selection procedure. 2.1 Lepski s procedure We first shortly explain the ideas of Lepski s method. Consider a target quantity s, a collection of estimators (ˆsh)h H and a loss function ℓ( , ). A standard objective when selecting ˆsh is trying to minimize the risk Eℓ(s, ˆsh) among the family of estimators. In most settings, the risk of an estimator can be decomposed into a bias part and a variance part. Of course neither the risk, the bias nor the variance of an estimator are known in practice. However in many cases, the variance term can be controlled quite precisely. Lepski s method requires that the variance of each estimator ˆsh can be tightly upper bounded by a quantity v(h). In most cases, the bias can be written as ℓ(s, sh) where sh corresponds to some (deterministic) averaged version of ˆsh. It thus seems natural to estimate ℓ(s, sh) by ℓ(ˆsh , ˆsh) for some h smaller than h. The later quantity incorporates some randomness while the bias does not. The idea is to remove the random part" of the estimation by considering [ℓ(ˆsh , ˆsh) v(h) v(h )]+, where [ ]+ denotes the positive part. The bias term is estimated by considering all pairs of estimators (sh, ˆsh ) through the quantity suph h [ℓ(ˆsh , ˆsh) v(h) v(h )]+. Finally, the estimator minimizing the sum of the estimated bias and variance is selected, see eq. (3) below. In our setting, the control of the variance of the graph Laplacian estimators ˆ h is not tight enough to directly apply the above described method. To overcome this issue, we use a more flexible version of Lepski s method that involves some multiplicative coefficients a and b introduced in the variance and bias terms. More precisely, let V (h) = Vf(h) be an upper bound for E[ (E[ ˆ h] ˆ h)f 2 2,M]. The bandwidth ˆh selected by our Lepski s procedure is defined by ˆh = ˆhf = arg min h H {B(h) + b V (h)} (3) where B(h) = Bf(h) = max h h, h H h ( ˆ h ˆ h)f 2 2,M a V (h ) i with 0 < a b. The calibration of the constants a and b in practice is beyond the scope of this paper, but we suggest a heuristic procedure inspired from [LM15] in section 5. 2.2 Variance of the graph Laplacian for smooth functions In order to control the variance term, we consider for this paper the set F of smooth functions f : M R uniformly bounded up to the third order. For some constant CF > 0 , let F = n f C3(M, R) , f (k) CF, k = 0, . . . , 3 o . (5) We introduce some notation before giving the variance term for f F. Define Dα = 1 (4π)d C u α+2 d 2 + C1 u α d e u 2 d/4 du (6) Dα = 1 (4π)d/2 C u α+2 d 4 + C1 u α d e u 2 d/8 du (7) where C and C1 are geometric constants that only depend on the metric structure of M (see Appendix). We also introduce the d-dimensional Gaussian kernel on Rd: Kd(u) = 1 (4π)d/2 e u 2 d/4, u Rd and we denote by p,d the Lp-norm on Rd. The next proposition provides an explicit bound V (h) on the variance term. Proposition 2.1. Given h H, for any f F, we have V (h) = C2 F nhd+2 2ωd Kd 2 2,d + αd(h) , where αd(h) = h2 2D4 + D + 3ωd Kd 2 2,d + h4 D6 + 3D (4π)d/2 and ωd = 3 2d/2 1. We now give the main result of the paper: an oracle inequality for the estimator ˆ ˆh, or in other words, a bound on the risk that shows that the performance of the estimator is almost as good as it would be if we knew the risks of each estimator. In particular it performs an (almost) optimal trade-off between the variance term V (h) and the approximation term D(h) = Df(h) = max (p P E[ ˆ h])f 2,M, sup h h (E[ ˆ h ] E[ ˆ h])f 2,M 2 sup h h (p P E[ ˆ h ])f 2,M. Theorem 3.1. According to the notation introduced in the previous section, let ϵ = a/2 1 and h h max exp min{ϵ2, ϵ} n , exp cϵ2γd(h ) where c > 0 is an absolute constant and γd(h ) = 1 p h d " 2ωd Kd 2 2,d + αd(h ) (2ωd Kd 1,d + βd(h))2 with αd defined by (8) and βd(h) = 2hωd Kd 1,d + h2(2 D3 + D) + h3( D4 + D). (9) Given f C2(M, R), with probability at least 1 2 P (p P ˆ ˆh)f 2,M inf h H n 3D(h) + (1 + b V (h) o . Broadly speaking, Theorem 3.1 says that there exists an event of large probability for which the estimator selected by Lepski s method is almost as good as the best estimator in the collection. Note that the size of the bandwidth family H has an impact on the probability term 1 2 P h H δ(h). If H is not too large, an oracle inequality for the risk of ˆ ˆhf can be easily deduced from the later result. Henceforth we assume that f F. We first give a control on the approximation term D(h). Proposition 3.2. Assume that the density p is C2. It holds that where CF is defined in eq. (5) and γ > 0 is a constant depending on p , p , p and on M. We consider the following grid of bandwidths: H = e k , log log(n) k log(n) . The previous results lead to the pointwise rate of convergence of the graph Laplacian selected by Lepski s method: Theorem 3.3. Assume that the density p is C2. For any f F, we have E h (p P ˆ ˆh)f 2,M i n 1 d+4 . (10) 4 Sketch of the proof of theorem 3.1 We observe that the following inequality holds (p P ˆ ˆh)f 2,M D(h) + (E[ ˆ h] ˆ h)f 2,M + p 2 (B(h) + b V (h)). (11) Indeed, for h H, (p P ˆ ˆh)f 2,M (p P h)f 2,M + ( h ˆ h)f 2,M + ( ˆ h ˆ ˆh)f 2,M D(h) + ( h ˆ h)f 2,M + ( ˆ h ˆ ˆh)f 2,M. By definition of B(h), for any h h, ( ˆ h ˆ h)f 2 2,M B(h) + a V (h ) B(max{h, h }) + a V (min{h, h }), so that, according to the definition of ˆh in eq. (3) and recalling that a b, ( ˆ ˆh ˆ h)f 2 2,M 2 [B(h) + a V (h)] 2 [B(h) + b V (h)] which proves eq. (11). We are now going to bound the terms that appear in eq. (11). The bound for D(h) is already given in proposition 3.2, so that in the following we focus on B(h) and (E[ ˆ h] ˆ h)f 2,M. More precisely the bounds we present in the next two propositions are based on the following lemma from [LM15]. Lemma 4.1. Let X1, . . . , Xnbe an i.i.d. sequence of variables. Let e S a countable set of functions and let η(s) = 1 i [gs(Xi) E[gs(Xi)]] for any s e S. Assume that there exist constants θ and vg such that for any s e S gs θ and Var[gs(X)] vg. Denote H = E[sups e S η(s)]. Then for any ϵ > 0 and any H H sup s e S η(s) (1 + ϵ)H # max exp ϵ2n H 2 , exp min{ϵ, 1}ϵn H Proposition 4.2. Let ϵ = a 2 1. Given h H, define h h max exp min{ϵ2, ϵ} n With probability at least 1 δ1(h) B(h) 2D(h)2. Proposition 4.3. Let ϵ = a 1. Given h H define δ2(h) = max exp min{ ϵ2, ϵ} n With probability at least 1 δ2(h) (E[ ˆ h] ˆ h)f 2,M p Combining the above propositions with eq. (11), we get that, for any h H, with probability at least 1 (δ1(h) + δ2(h)), (p P ˆ ˆh)f 2,M D(h) + p a V (h) + p 4D(h)2 + 2b V (h) 3D(h) + (1 + where we have used the fact that a b. Taking a union bound on h H we conclude the proof. 5 Numerical illustration In this section we illustrate the results of the previous section on a simple example. In section 5.1, we describe a practical procedure when the data set X is sampled according to the uniform measure on M. A numerical illustration us given in Section 5.2 when M is the unit 2-dimensional sphere in R3. 5.1 Practical application of the Lepksi s method Lepski s method presented in Section 2 can not be directly applied in practice for two reasons. First, we can not compute the L2-norm 2,M on M, the manifold M being unknown. Second, the variance terms involved in Lepski s method are not completely explicit. Regarding the first issue, we can approximate 2,M by splitting the data into two samples: an estimation sample X1 for computing the estimators and a validation sample X2 for evaluating this norm. More precisely, given two estimators ˆ hf and ˆ h f computed using X1, the quantity ( ˆ h ˆ h )f 2 2,M/µ(M) is approximated by the averaged sum 1 n2 P x X2 | ˆ hf(x) ˆ h f(x)|2, where n2 is the number of points in X2. We use these approximations to evaluate the bias terms B(h) defined by (4). The second issue comes from the fact that the variance terms involved in Lepski s method depend on the metric properties of the manifold and on the sampling density, which are both unknown. Theses variance terms are thus only known up to a multiplicative constant. This situation contrasts with more standard frameworks for which a tight and explicit control on the variance terms can be proposed, as in [Lep92b, Lep93, Lep92a]. To address this second issue, we follow the calibration strategy recently proposed in [LM15] (see also [LMR16]). In practice we remove all the multiplicative constants from V (h): all these constants are passed into the terms a and b. This means that we rewrite Lepski s method as follows: ˆh(a, b) = arg min h H B(h) = max h h, h H ( ˆ h ˆ h)f 2 2,M a 1 nh 4 We choose a and b according to the following heuristic: 1. Take b = a and consider the sequence of selected models: ˆh(a, a), 2. Starting from large values of a, make a decrease and find the location a0 of the main bandwidth jump in the step function a 7 ˆh(a, a), 3. Select the model ˆh(a0, 2a0). The justification of this calibration method is currently the subject of mathematical studies ([LM15]). Note that a similar strategy called "slope heuristic" has been proposed for calibrating ℓ0 penalties in various settings by strong mathematical results, see for instance [BM07, AM09, BMM12]. 5.2 Illustration on the sphere In this section we illustrate the complete method on a simple example with data points generated uniformly on the sphere S2 in R3. In this case, the weighted Laplace-Beltrami operator is equal to the (non weighted) Laplace-Beltrami operator on the sphere. We consider the function f(x, y, z) = (x2 + y2 + z) sin x cos x. The restriction of this function on the sphere has the following representation in spherical coordinates: f(θ, φ) = (sin2 φ + cos φ) sin(sin φ cos θ) cos(sin φ cos θ). It is well known that the Laplace-Beltrami operator on the sphere satisfies (see Section 3 in [Gri09]): S2u = 1 sin2 φ 2u θ2 + 1 sin φ φ for any smooth polar function u. This allows us to derive an analytic expression of S2 f. We sample n1 = 106 points on the sphere for computing the graph Laplacians and we use n = 103 points for approximating the norms ( ˆ h ˆ h ) f 2 2,M. We compute the graph Laplacians for bandwidths in a grid H between 0.001 and 0.8 (see fig. 1). The risk of each graph Laplacian is estimated by a standard Monte Carlo procedure (see fig. 2). Figure 1: Choosing h is crucial for estimating S2 f: small bandwidth overfits S2 f whereas large bandwidth leads to almost constant approximation functions of S2 f. Figure 2: Estimation of the risk of each graph Laplacian operator: the oracle Laplacian is for approximatively h = 0.15. Figure 3 illustrates the calibration method. On this picture, the x-axis corresponds to the values of a and the y-axis represents the bandwidths. The blue step function represents the function a 7 ˆh(a, a). The red step function gives the model selected by the rule a 7 ˆh(a, 2a). Following the heuristics given in Section 5.1, one could take for this example the value a0 3.5 (location of the bandwidth jump for the blue curve) which leads to select the model ˆh(a0, 2a0) 0.2 (red curve). 6 Discussion This paper is a first attempt for a complete and well-founded data driven method for inferring Laplace Beltrami operators from data points. Our results suggest various extensions and raised some questions of interest. For instance, other versions of the graph Laplacian have been studied in the literature (see Figure 3: Bandwidth jump heuristic: find the location of the jump (blue curve) and deduce the selected bandwidth with the red curve. for instance [HAL07, BN08]), for instance when data is not sampled uniformly. It would be relevant to propose a bandwidth selection method for these alternative estimators also. From a practical point of view, as explained in section 5, there is a gap between the theory we obtain in the paper and what can be done in practice. To fill this gap, a first objective is to prove an oracle inequality in the spirit of Theorem 3.1 for a bias term defined in terms of the empirical norms computed in practice. A second objective is to propose mathematically well-founded heuristics for the calibration of the parameters a and b. Tuning bandwidths for the estimation of the spectrum of the Laplace-Beltrami operator is a difficult but important problem in data analysis. We are currently working on the adaptation of our results to the case of operator norms and spectrum estimation. Appendix: the geometric constants C and C1 The following classical lemma (see, e.g. [GK06][Prop. 2.2 and Eq. 3.20]) relates the constants C and C1 introduced in Equations (6) and (7) to the geometric structure of M. Lemma 6.1. There exist constants C, C1 > 0 and a positive real number r > 0 such that for any x M, and any v Tx M such that v r, det(gij)(v) 1 C1 v 2 d and 1 2 v 2 d v 2 d C v 4 d Ex(v) x 2 m v 2 d (12) where Ex : Tx M M is the exponential map and (gi,j)i,j {1, , d} are the components of the metric tensor in any normal coordinate system around x. Although the proof of the lemma is beyond the scope of this paper, notice that one can indeed give explicit bounds on r and C in terms of the reach and injectivity radius of the submanifold M. Acknowledgments The authors are grateful to Pascal Massart for helpful discussions on Lepski s method. This work was supported by the ANR project Top Data ANR-13-BS01-0008 and ERC Gudhi No. 339025 [AC+10] Sylvain Arlot, Alain Celisse, et al. A survey of cross-validation procedures for model selection. Statistics surveys, 4:40 79, 2010. [AM09] Sylvain Arlot and Pascal Massart. Data-driven calibration of penalties for least-squares regression. The Journal of Machine Learning Research, 10:245 279, 2009. [BM07] Lucien Birgé and Pascal Massart. Minimal penalties for gaussian model selection. Probability theory and related fields, 138(1-2):33 73, 2007. [BMM12] Jean-Patrick Baudry, Cathy Maugis, and Bertrand Michel. Slope heuristics: overview and implementation. Statistics and Computing, 22(2):455 470, 2012. [BN03] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373 1396, 2003. [BN04] Mikhail Belkin and Partha Niyogi. Semi-supervised learning on riemannian manifolds. Machine learning, 56(1-3):209 239, 2004. [BN05] Mikhail Belkin and Partha Niyogi. Towards a theoretical foundation for laplacian-based manifold methods. In Learning theory, pages 486 500. Springer, 2005. [BN07] Mikhail Belkin and Partha Niyogi. Convergence of laplacian eigenmaps. Advances in Neural Information Processing Systems, 19:129, 2007. [BN08] Mikhail Belkin and Partha Niyogi. Towards a theoretical foundation for laplacian-based manifold methods. Journal of Computer and System Sciences, 74(8):1289 1308, 2008. [GK06] E. Giné and V. Koltchinskii. Empirical graph laplacian approximation of laplace beltrami operators: Large sample results. In High dimensional probability, pages 238 259. IMS, 2006. [GL+08] Alexander Goldenshluger, Oleg Lepski, et al. Universal pointwise selection rule in multivariate function estimation. Bernoulli, 14(4):1150 1190, 2008. [GL09] A. Goldenshluger and O. Lepski. Structural adaptation via lp-norm oracle inequalities. Probability Theory and Related Fields, 143(1-2):41 71, 2009. [Gri09] Alexander Grigoryan. Heat kernel and analysis on manifolds, volume 47. American Mathematical Soc., 2009. [HAL07] M. Hein, JY Audibert, and U.von Luxburg. Graph laplacians and their convergence on random neighborhood graphs. Journal of Machine Learning Research, 8(6), 2007. [Lep92a] Oleg V Lepski. On problems of adaptive estimation in white gaussian noise. Topics in nonparametric estimation, 12:87 106, 1992. [Lep92b] OV Lepskii. Asymptotically minimax adaptive estimation. i: Upper bounds. optimally adaptive estimates. Theory of Probability & Its Applications, 36(4):682 697, 1992. [Lep93] OV Lepskii. Asymptotically minimax adaptive estimation. ii. schemes without optimal adaptation: Adaptive estimators. Theory of Probability & Its Applications, 37(3):433 448, 1993. [LM15] Claire Lacour and Pascal Massart. Minimal penalty for goldenshluger-lepski method. ar Xiv preprint ar Xiv:1503.00946, 2015. [LMR16] Claire Lacour, Pascal Massart, and Vincent Rivoirard. Estimator selection: a new method with applications to kernel density estimation. ar Xiv preprint ar Xiv:1607.05091, 2016. [LMS97] Oleg V Lepski, Enno Mammen, and Vladimir G Spokoiny. Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. The Annals of Statistics, pages 929 947, 1997. [NLCK06] B. Nadler, S. Lafon, RR Coifman, and IG Kevrekidis. Diffusion maps, spectral clustering and reaction coordinates of dynamical systems. Applied and Computational Harmonic Analysis, 21(1):113 127, 2006. [Rie15] Antonio Rieser. A topological approach to spectral clustering. ar Xiv:1506.02633, 2015. [Ros97] Steven Rosenberg. The Laplacian on a Riemannian manifold: an introduction to analysis on manifolds. Number 31. Cambridge University Press, 1997. [THJ11] Daniel Ting, Ling Huang, and Michael Jordan. An analysis of the convergence of graph laplacians. ar Xiv preprint ar Xiv:1101.5435, 2011. [VL07] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395 416, 2007. [VLBB08] Ulrike Von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering. The Annals of Statistics, pages 555 586, 2008.