# geometric_structure_of_graph_laplacian_embeddings__04d7eb6d.pdf Journal of Machine Learning Research 22 (2021) 1-55 Submitted 8/19; Revised 11/20; Published 3/21 Geometric structure of graph Laplacian embeddings Nicol as Garc ıa Trillos nicolasgarcia@stat.wisc.edu Department of Statistics University of Wisconsin-Madison Madison, WI 53706, USA Franca Hoffmann fkoh@caltech.edu Bamdad Hosseini bamdadh@caltech.edu Computing and Mathematical Sciences California Institute of Technology Pasadena, CA 91125, USA Editor: Ingo Steinwart We analyze the spectral clustering procedure for identifying coarse structure in a data set x1, . . . , xn, and in particular study the geometry of graph Laplacian embeddings which form the basis for spectral clustering algorithms. More precisely, we assume that the data are sampled from a mixture model supported on a manifold M embedded in Rd, and pick a connectivity length-scale ε > 0 to construct a kernelized graph Laplacian. We introduce a notion of a well-separated mixture model which only depends on the model itself, and prove that when the model is well separated, with high probability the embedded data set concentrates on cones that are centered around orthogonal vectors. Our results are meaningful in the regime where ε = ε(n) is allowed to decay to zero at a slow enough rate as the number of data points grows. This rate depends on the intrinsic dimension of the manifold on which the data is supported. Keywords: Unsupervised learning, spectral clustering, graph Laplacian, mixture models, continuum limit 1. Introduction The goal of this article is to analyze the spectral clustering procedure for identifying coarse structure in a data set by means of appropriate spectral embeddings. Specifically, we combine ideas from spectral geometry, metastability, optimal transport, and spectral analysis of elliptic operators to describe the geometry of the embeddings that form the basis of spectral clustering in the case where the data generating model is sufficiently well-separated . Throughout this article we assume the data set Mn = {x1, . . . , xn} consists of points that are distributed according to an underlying probability measure ν supported on a mdimensional manifold M embedded in some Euclidean space Rd. The first step in spectral clustering is to construct a similarity graph G := {Mn, Wn} with vertices Mn and edge weight matrix Wn whose entries indicate how similar two points in Mn are; typical constructions are proximity graphs, where a length scale ε > 0 is chosen and high weights are given to pairs of points that are within distance ε from each other. Once the graph has been constructed a graph Laplacian operator is introduced, and its first few eigenvectors are used c 2021 Nicol as Garc ıa Trillos, Franca Hoffmann, Bamdad Hosseini. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v22/19-683.html. Garc ıa Trillos, Hoffmann, Hosseini to define an embedding that maps the data points Mn into a low-dimensional Euclidean space. More precisely, the low-lying spectrum of the graph Laplacian induces an embedding Fn : {x1, . . . , xn} RN mapping a data point xi into a vector Fn(xi) in RN whose entries are the first N eigenfunctions of the graph Laplacian evaluated at xi. Here we refer to the map Fn as the Laplacian embedding. The embedding step is the crucial step in spectral clustering, which afterwards will typically proceed by running an algorithm like k-means so as to recover the clustering structure of the data set (see von Luxburg (2007)). While k-means is a standard choice in the literature, our analysis will be agnostic to the actual algorithm used to cluster the embedded data set, and we will only focus on the geometry of the embedded data set. We refer the reader to Section 1.2 and in particular to Ling and Strohmer (2019) for alternatives to k-means. How can one motivate the use of the previous spectral clustering algorithm? Consider the simple scenario where the manifold M has N connected components. For a data set consisting of uniform samples from such M one can build the proximity graph G (at least for n large enough) in such a way that its associated graph Laplacian has N connected components and whose corresponding first N eigenvectors coincide with rescaled versions of indicator functions of the N connected components of M restricted to Mn. The resulting Laplacian embedding Fn is seen to map the original data set into a set of N orthogonal vectors in RN. The simple geometric structure of the embedded data set allows one to readily identify the true coarse structure of the original data set. The key question that we attempt to answer in this work is the following: what is the geometric structure of graph Laplacian embeddings in more general settings where M is connected, but only weakly? We attempt to answer this question by: 1. Relaxing the notion of N connected components of M, and replacing it with a notion of well-separatedness for the mixture model ν as defined in terms of three quantities described in Section 2.2 . We highlight that our notion of well-separatedness only depends on the mixture model ν and not on extrinsic objects like a user chosen kernel in the construction of the graph G. 2. Describing the geometry of the Laplacian embedding in terms of what we will call an orthogonal cone structure, a generalization of the orthogonality condition of the embedded data set observed when M has N connected components. We consider this notion for arbitrary probability measures and not just for point clouds, thus extending the notion introduced in Schiebinger et al. (2015). In order to study the geometry of the embedded data set, we split our analysis into two parts. First, we study the geometry of a continuum analogue Laplacian embedding by using techniques from spectral theory of elliptic PDEs. The continuum Laplacian embedding can be thought as an ideal object that is recovered (spectrally) from the graph Laplacian in the limit n , ε = ε(n) 0 (at a slow enough rate). In the second part, we quantify the difference between the two embeddings, the graph based and its continuum analogue. To achieve this we use recent results quantifying the approximation error of the spectra of continuum Laplacian operators with the spectra of appropriate graph Laplacians. It is only in the second step that we study statistical properties of random objects. The first part of the paper is, to some extent, of independent interest in connection to theory of diffusion processes and PDE analysis. The structure of our analysis illustrates a more general Geometric structure of graph Laplacian embeddings approach for the analysis of learning methods that rely on the construction of proximity graphs. Namely, first study the ideal continuum problem and then use approximating results to show that the graph problem inherits properties from its continuum counterpart. Overall, our ideas, definitions, and results provide new theoretical insights on spectral clustering. They further justify the use of spectral clustering beyond the trivial setting where a manifold is disconnected. In particular, we show that when the ground-truth measure ν has well-separated components, the geometry of the embedded data set Fn(Mn) is simple enough so that an algorithm like k-means can successfully identify the coarse structure of the data set. 1.1 Overview of main results Let us make our setup more precise. We assume that M is a connected manifold and that ν is a mixture model with N components supported on M (see Section 2 for detailed assumptions on this model). We work in a non-parametric setting and assume that the components of ν have sufficiently regular densities with respect to the volume form of M. We will primarily work with graph Laplacian operators corresponding to kernelized Laplacians (see (4) for the definition and Section 4.2.2 for some reasons behind this choice where in particular we observe that the kernelized Laplacian already covers many other cases of interest). Before describing the geometry of the embedded data set Fn(Mn) we study a continuum limit analogue of spectral clustering which we refer to as measure clustering. This can be thought as an ideal clustering problem where an infinite amount of data is available. Studying this continuum analogue is motivated by recent results concerning the convergence of graph Laplacian operators to differential operators in the large data limit as n, the number of vertices, goes to infinity and ε, the connectivity length scale specifying G, goes to zero sufficiently slowly. The low-lying spectrum of the limiting differential operator induces an embedding F : M RN in a manner similar to Fn, but now at the continuum level. We characterize the geometry of the measure F ν (i.e. the push-forward of ν through the continuum limit Laplacian embedding F) which forms the basis of the measure clustering procedure. In particular, we identify conditions on the ground-truth model that ensure that the measure F ν has an orthogonal cone structure. This is the content of our first key result Theorem 10 which is informally stated below. Theorem 1 (Informal) If the mixture ν has well-separated components (see Definition 4), then F ν, the push forward of ν through the Laplacian embedding F, has an orthogonal cone structure, i.e., most of its mass is concentrated within disjoint cones centered at orthogonal vectors (see Definition 8). Let us now turn our attention to the geometry of the embedded point cloud Fn(Mn). As mentioned above, when constructing the graph G from the data Mn, one often scales ε with n in such a way that when n , ε 0 (i.e. the weight matrix Wn localizes) at a slow enough rate. Analyzing this regime is motivated by the desire to balance between graph connectivity and computational efficiency. Recent results in the literature provide explicit rates of convergence for the spectra of graph Laplacians towards the spectrum of an elliptic differential operator (see (6) and Section A.3 below) in such regimes. Moreover, the convergence of the eigenvectors is with respect to a strong enough topology, implying that the empirical measure of Fn(Mn) and the measure F ν get closer in the Wasserstein Garc ıa Trillos, Hoffmann, Hosseini sense as n grows. In turn, this convergence guarantees that if F ν has an orthogonal cone structure, then Fn(Mn) will have one as well (i.e., orthogonal cone structures are stable under Wasserstein perturbations). In Theorem 15 we make the previous discussion mathematically precise. For the moment we provide an informal statement of that result. Theorem 2 (Informal) Under suitable conditions, if F ν has an orthogonal cone structure then, with very high-probability, the empirical measure of the embedded cloud Fn(Mn) also has an orthogonal cone structure. 1.2 Related work In this paper we focus on clustering. More specifically, clustering by means of appropriate embeddings using the spectrum of a graph Laplacian operator. In the geometric graph setting (our setting) von Luxburg et al. (2008) were the first to rigorously establish a statistical consistency result for graph-based spectral embeddings when the number of data points n goes to infinity and the connectivity ε > 0 remains constant. Up until then most papers focused on proving pointwise convergence estimates (see Belkin and Niyogi (2008); Hein et al. (2005) and references within). The work of von Luxburg et al. (2008) however, did not address the question of determining scalings for ε := ε(n) 0 that implied spectral convergence towards differential operators in the continuum limit. Moreover, it was still a pending task to analyze the entire spectral clustering algorithm, and in particular study the geometry of the resulting spectral embedding. Garc ıa Trillos and Slepˇcev (2015) introduced a framework using notions from optimal transport and the calculus of variations to study large data limits of optimization problems defined on random geometric graphs. Many large data limits of relevant functionals for machine learning have been studied using similar approaches: Slepcˇev and Thorpe (2019); Dunlop et al. (2019); Garc ıa Trillos and Murray (2017); Garc ıa Trillos and Sanz Alonso (2018). In particular, the optimal transport framework was used by Garc ıa Trillos and Slepˇcev (2018) to analyze the entire spectral clustering procedure (embedding step and k-means step) and the authors were able to show statistical consistency for scalings log(n)pm/n1/m ε 1 where m is the intrinsic dimension of the manifold M and pm > 0 is an appropriate constant depending on m. Nevertheless, the spectral convergence was asymptotic and no rate was provided until Tao and Shi (2020). Burago et al. (2014) took a somewhat similar approach to Garc ıa Trillos and Slepˇcev (2018), and in a non-probabilistic setting, conducted a careful analysis which ultimately allowed to provide very precise rates of convergence for the spectrum of graph Laplacians. Garc ıa Trillos et al. (2020) extended this analysis to cover the relevant probabilistic setting where point clouds are samples from a ground truth measure supported on a manifold. The rates obtained by Garc ıa Trillos et al. (2020) were recently improved in Calder and Garc ıa Trillos (2019): these are to the best of our knowledge the best rates of spectral convergence available. The work of Calder and Garc ıa Trillos (2019) relies on a priori convergence rates (such as the ones in Garc ıa Trillos et al. (2020)) which are then cleverly improved by relying on pointwise consistency of graph Laplacians for smooth enough functions. In Appendix A.3 we revisit some of the results and ideas introduced by Garc ıa Trillos et al. (2020); as a matter of fact, the probabilistic estimates that we use in this paper are proved in a similar fashion to the work of Garc ıa Trillos et al. (2020) after making some slight modifications that are discussed in the appendix. Geometric structure of graph Laplacian embeddings In principle, we could pursue the improvement of these rates following the framework presented by Calder and Garc ıa Trillos (2019), but we decided not to do so in order to avoid making the already lengthy exposition much longer and tedious: this is technical work that in our opinion is out of the scope of this paper. All of the publications listed above focus on spectral convergence of graph Laplacians, but do not study the geometry of Laplacian embeddings themselves, i.e., what is the geometry of the embedded data set, and how such geometry may affect the subsequent clustering step. Schiebinger et al. (2015) formulates this problem in a precise mathematical setting, by introducing a notion of a well-separated data generating mixture model. Roughly speaking, the authors obtain a result that, in spirit, reads similarly to our Theorem 2. However, the notion of a well-separated model introduced in that work depends on extrinsic quantities such as the kernel used to construct the graph at the finite data level, and in particular considers the regime where n is thought to be large, but the length-scale ε remains constant. In other words, the average number of neighbors per point is comparable to the total number of points. The work of Schiebinger et al. (2015) has inspired and motivated our work. Different scenarios where one can define a notion of well-separated components were studied by Little et al. (2020) and Ling and Strohmer (2019). Both of these papers are closely related to our work, and we believe are complementary of each other. In both works, the respective notion of well-separated components is used to help validate proposed methodologies for clustering. Little et al. (2020) focus on understanding the stability of spectral embeddings in situations where the original data set has a geometric structure that is highly anisotropic. The authors propose the use of a general family of distances (other than the Euclidean one) to construct graphs which ultimately induce corresponding spectral embeddings. The model considered in that paper is more geometric in nature than the one we consider here. Ling and Strohmer (2019) introduce a notion of well-separated components at the finite data level and use it to validate a new algorithm for clustering which is based on a convex relaxation for a multi-way cut problem and does not require a k-means step, making it quite appealing. We believe further exploration of the connections between the various mathematical ideas and models presented by Little et al. (2020) and Ling and Strohmer (2019) with the ones we present here is a worthwhile avenue of future research. Finally, it is worth reiterating that this work assumes a random geometric graph model for the data, i.e. randomness comes from the fact that nodes in the graph are assumed to be random draws from some underlying distribution supported on a manifold, and edges are directly determined from the distance relationship between these nodes. This type of model should be contrasted to other random graph models such as Erd os-R enyi type graphs or planted partition models. Several authors have investigated spectral clustering in those settings (e.g., see Rohe et al. (2011); Chaudhuri et al. (2012) and references within). In a way, an analysis of spectral clustering for any random graph model relies on an application, whether direct or indirect, of a Davis-Kahan type argument (see Davis and Kahan (1970)); the only distinction being the operator norms that are feasible in each setting and the use of some specific tools to build a bridge between the sample and population settings. In our case for example, the theory developed by Garc ıa Trillos et al. (2020) allows us to provide a discrete-to-continuum link, and the Davis-Kahan like argument comes from analyzing a suitable elliptic PDE operator. Garc ıa Trillos, Hoffmann, Hosseini 1.3 Outline The rest of the paper is divided as follows. Section 2 covers the preliminaries required for our theory and introduces the precise set-up of the paper. In particular, in Section 2.2 we introduce our notion of well-separated mixture models. We present our main results in Section 3. Theorem 10 on the geometry of continuum Laplacian embeddings (i.e. the measure clustering embedding) and Theorem 15 describing the geometry of graph Laplacian embeddings, are presented in Subsections 3.1 and 3.2 respectively. The proofs of these results are postponed to the appendix. Section 4 is devoted to discussing our notion of well-separated mixture models. We present some illustrative examples, discuss the implications of our results, and finally comment on possible extensions and generalizations. Detailed proofs of our main results are presented in Appendix A. In particular, Appendix A.1 focuses on stability properties of orthogonal cone structures, allowing us to prove Theorem 10 in Appendix A.2 after showing several preliminary results. In Appendix A.3 we make the connection between the eigenspaces generated by the first N eigenvectors of the graph Laplacian and those of the continuum Laplacian. We conclude with Appendix A.4, where we present the proof of Theorem 15. Appendix B contains further technical lemmata and theorems that are used in the proofs of Appendix A and the examples in Subsection 4.1. 2. Preliminaries Here we collect some definitions and notation that are used throughout the article. Discrete and continuum limit graph Laplacian operators are introduced in Subsection 2.1; the notion of a well-separated mixture model is defined in Subsection 2.2. Let M be a smooth, connected, orientable, m-dimensional manifold embedded in Rd. For the moment we think of M as either a compact manifold without boundary or a subset of Rd. Let ρ1, . . . , ρN be a family of C1(M) probability density functions on M (densities with respect to the volume form of M) and let w1, . . . , w N be positive weights adding to one. The densities ρ1, . . . , ρN are weighted to produce the mixture model k=1 wkρk(x), x M. (1) We assume that ρ(x) > 0 for all x M and let ν be the probability measure on M induced by ρ, that is, dν(x) = ρ(x)dx, where in the above and in the remainder dx denotes integration with respect to M s volume form and P(M) denotes the space of complete Borel probability measures on M. Given a density ϱ on M we define the weighted function spaces L2(M, ϱ) := n u : M 7 R u, u ϱ < + o , equipped with the weighted inner product u, v ϱ := Z M u(x)v(x)ϱ(x)dx. Geometric structure of graph Laplacian embeddings Throughout the article we routinely write L2(ϱ) instead of L2(M, ϱ) when the domain of the function space is clear from the context. In Appendix A.3 we modify our notation slightly and use L2(µ) to denote L2(M, ϱ) where µ = ϱdx and use , L2(µ) to denote , ρ. We consider a point cloud Mn = {x1, . . . , xn} that are i.i.d. samples from ν, denote by νn the empirical measure associated to the samples x1, . . . , xn, i=1 δxi. (2) Let L2(νn) be the space of functions un : Mn R, and we identify un L2(νn) with a column vector (un(x1), . . . , un(xn))T in Rn. 2.1 Graph Laplacian operators We associate a weighted graph structure to Mn as follows. Let η : [0, ) [0, ) be a non-increasing Lipschitz function with compact support and let 0 η(r)rm+1dr, (3) where we recall m d is the intrinsic dimension of the manifold M (typically we think of m d for applications, but this is unimportant for the purposes of the results discussed here). We assume that the kernel has been normalized so that Z Rm η(|x|)dx = 1. For a given ε > 0 we define the weight matrix Wn with entries (Wn)ij, (Wn)ij := 2ηε(|xi xj|) αηε2(dε(xi))1/2 (dε(xj))1/2 , (4) measuring how similar points xi and xj are. Here, | | denotes the Euclidean distance in the ambient space Rd, ηε is defined as j=1 ηε(|y xj|) y M . Finally, we introduce the graph Laplacian operator n : L2(νn) L2(νn), n := Dn Wn, (5) where Dn is the degree matrix associated to the weight matrix Wn, that is, Dn is a diagonal matrix whose diagonal entries (Dn)ii are given by (Dn)ii = Pn j=1(Wn)ij. The dependence of n on ε has been suppressed for notational convenience. Garc ıa Trillos, Hoffmann, Hosseini The core idea in our analysis is the convergence of the spectrum of n to that of the differential operator ρ div(ρ u) = u 1 which is defined for smooth functions u : M 7 R. In the above, stands for the gradient operator on M, div is the divergence on M, and is the Laplace-Beltrami operator on M. The term ρ u evaluated at a point x M is the inner product of the vectors ρ(x), u(x) in the tangent space Tx M; seen as vectors in Rd, ρ(x) u(x) is just their inner product in Rd. We establish the convergence of n to ρ in Appendix A.3 when the number of samples n while the local connectivity parameter ε 0 at an appropriate rate. We notice that the weights in (4) have been appropriately rescaled so that the eigenvalues of the graph Laplacian n defined below approximate those of ρ. We also point out that (Wn)ij can be written as (Wn)ij = 2ηε(|xi xj|) αηnε2(dε(xi)/n)1/2 (dε(xj)/n)1/2 , and that dε( )/n is nothing but a kernel density estimator for the density ρ due to the fact that η is normalized. Remark 3 The normalization terms appearing in (4) induce the kernelized graph Laplacian n which in turn converges to ρ in the continuum limit. There are other possible normalizations that one can consider such as the ones in the parametric family introduced in Coifman and Lafon (2006). The effect of the different normalizations can be observed asymptotically (as n ) in the way the density ρ affects the differential operator ρ (which in general will differ from (6)). We expand our discussion on the choice of different normalizations in Section 4.2.2; for an overview of a more general family of normalizations see Section 5 of Hoffmann et al. (2019). In the context of this paper, it is enough to say that n and ρ (as defined in (6)) are closely related to each other following Theorem 30. Similarly to (6) we also define the differential operators ρk in association with the densities ρk, which for smooth functions u, are defined by ρk div(ρk u) = u 1 ρk ρk u. (7) 2.2 Well-separated mixture models Let us now introduce the notion of a well-separated mixture model. This is a notion defined in terms of three quantities associated to the model ν that we refer to as overlapping, coupling, and indivisibility parameters. For k = 1, . . . , N, define the functions qk := rwkρk Geometric structure of graph Laplacian embeddings The overlapping parameter of the mixture model ν is the quantity S := max i =j ρ = max i =j Note that S 0, 1 minj wj since for any k {1, ..., N} we have the bound (min j wj)ρk j=1 wjρj = ρ , ρ 1 minj wj . The coupling parameter of the mixture model is defined as C := max k=1,...,N Ck, (10) 2 ρkdx, k = 1, . . . , N. (11) This quantity is the Fisher information of the measure ρkdx with respect to the measure ρdx. We use the convention that if the density ρk(x) vanishes at some point x M, then we interpret the integrand in (11) as zero at that point; in other words we only integrate over the set {x M : ρk(x) > 0}. Finally, the indivisibility parameter for the model is defined as Θ := min k=1,...,N Θk, (12) Θk := min u 1 u, u ρk . (13) In (13), 1 stands for the function that is identically equal to one, and indicates orthogonality with respect to the inner product , ρk. This may as well be defined as the first non-trivial eigenvalue of the operator ρk. In Subsection 4.1 we discuss different interpretations of S, C, Θ and present some examples that illustrate the significance of these parameters. Intuitively, the overlapping and coupling parameters S and C are small whenever the mixture components ρk are individual bumps that are mostly concentrated on disjoint sets (see Figure 1) while the indivisibility parameter Θ is large when the components ρk cannot be further divided into smaller components. We are now ready to introduce the notion of a well-separated mixture model. Definition 4 (Informal) The mixture model (1) is well-separated if S 1, C Θ 1. These conditions are to be read as: the overlapping S is small enough, and the coupling parameter C is small in comparison to the indivisibility parameter Θ. Garc ıa Trillos, Hoffmann, Hosseini We highlight that the above definition is only qualitative in accordance with the informal version of our first main result (Theorem 1). However, this definition can be made rigorous in the relative sense. For example, given two mixture models ν and ν , the model for which S and C/Θ are smaller is the better separated model and in turn the Laplacian embedding F can cluster that model more effectively. This is a direct consequence of Theorem 10 that gives precise quantitative bounds on how tightly the measure F ν concentrates around orthogonal vectors depending on the quantities S and C/Θ. Remark 5 We emphasize that both of the conditions in Definition 4 are crucial for the mixture model to be well-separated and to exclude pathologies. Having S 1 ensures that the components ρk are not repeated while C Θ ensures that components that should have otherwise been merged or components that should have otherwise been split into smaller components are excluded. See also Examples 1 and 2 and Remark 16. Remark 6 We will use the functions q1, . . . , q N frequently in the proofs of Appendix A so it is worth mentioning a simple interpretation for them. Suppose that we wanted to obtain one sample from the mixture model introduced in (1): x ρ. One way to obtain x is to first choose a random number k {1, . . . , N} where P(k = k) = wk and then sample x from ρk. Conversely given the value x = x, we can check that: P(k = k|x = x) = wkρk(x) Thus, qk(x) is simply the square root of the likelihood of x being sampled from ρk. 3. Geometric structure of Laplacian embeddings In this section we discuss detailed versions of our main results. In Subsection 3.1 we present Theorem 10 as the rigorous version of Theorem 1. This theorem gives quantitative bounds on the concentration properties of the pushforward measure F ν around orthogonal cones in low dimensions as a function of S, C, and Θ. The map F (defined in (14)) is the continuum analogue of the spectral embedding commonly used in spectral clustering. Subsection 3.2 contains Theorem 15 as the quantitative analogue of Theorem 2 which extends Theorem 10 from the continuum to the discrete setting. More precisely, this theorem shows that the embedded point cloud Fn(Mn) inherits the concentration properties of F ν with very high probability provided that (n, ε) are within an appropriate regime. Here Fn is the usual discrete Laplacian embedding defined in (18). In summary, Theorems 10 and 15 together state that whenever the mixture ν has N well-separated components and (n, ε) are in the appropriate regime, the point cloud Fn(Mn) concentrates within N disjoint orthogonal cones with very high probability. Further, the better ν is separated, the smaller the angle of these cones, and the more mass lies within the cones. This means that in classification tasks, the embedded cloud Fn(Mn) can be classified easily using N 1 dimensional hypersurfaces in RN that separate the orthogonal cones. Geometric structure of graph Laplacian embeddings 3.1 The continuum setting In what follows we describe the geometry of the ground-truth spectral embedding (i.e. the measure clustering embedding in the continuum). First, let us collect assumptions on the mixture model ν. Assumption 7 The components ρ1, . . . , ρN of the mixture ν belong to C1(M) and are such that the operators ρk and ρ have discrete point spectrum with associated orthonormal eigenfunctions spanning L2(ρk) and L2(ρ) respectively. Now let u1, . . . , u N be the first N orthonormal (with respect to , ρ) eigenfunctions of ρ corresponding to its N smallest eigenvalues. Define the Laplacian embedding u1(x) ... u N(x) The push-forward of the measure ν by F, denoted µ := F ν, is a measure on RN which describes the distribution of points originally in M after transformation by the Laplacian map F. Our first main result asserts that if the mixture model (1) is well-separated, then the measure µ has an orthogonal cone structure; this is a geometric notion originally introduced for point clouds in Schiebinger et al. (2015) and extended here for arbitrary probability distributions. Definition 8 Let σ (0, π/4), δ [0, 1), and r > 0. We say that the probability measure µ P(Rk) has an orthogonal cone structure with parameters (σ, δ, r) if there exists an orthonormal basis for Rk, e1, . . . , ek, such that j=1 C(ej, σ, r) where C(ej, σ, r) is the set C(ej, σ, r) := z Rk : z ej |z| > cos(σ), |z| > r . In simple words, a measure µ P(Rk) has an orthogonal cone structure if it is concentrated on cones centered at vectors which form an orthonormal basis for Rk. The parameter σ is the opening angle of the cones measured from the axis ej, r is the radius of the ball centered at the origin that we remove from the cones to form the sets C(ej, σ, r), and δ is the amount of mass that is allowed to lie outside the cones. The condition σ < π/4 ensures that the sets C(ej, σ, r) do not overlap. Remark 9 Instead of the above definition, one could also consider a component-wise cone structure, namely the existence of parameters {(σj, δj, rj)}k j=1 such that Pk j=1 δj [0, 1) and µ(C(ej, σj, rj)) wj δj for all j {1, ..., k}. This notion of orthogonal cone structure implies the weaker notion in Definition 8. We do not introduce this component-wise Garc ıa Trillos, Hoffmann, Hosseini orthogonal cone structure here as our assumption are not strong enough to conclude such a structure holds; in particular, one would need a more detailed, pointwise notion of the overlapping parameter S. However, we believe that this component-wise cone structure can lead to stronger versions of our main theorems with individual bounds on the probability masses lying within each cone as opposed to the total mass outside of the union of the cones. We are now ready to state our first main result. Theorem 10 Let ν be a mixture model with density ρ of the form (1) which we assume satisfies Assumptions 7. For σ (0, π/4), suppose S < wmin(1 cos2(σ)) wmax cos2(σ)N2 . δ := wmax cos2(σ)N2S wmin(1 cos2(σ)) . Here, wmax := maxi=1,...,N wi and wmin := mini=1,...,N wi. Suppose that τ S > 0, τN < 1, (15) and s, t > 0 satisfy t sin(s) Nwmax !2 + 4N3/2 1 1 Nτ 1 , s + σ < π Then, the probability measure µ := F ν has an orthogonal cone structure with parameters σ + s, δ + t2, 1 sin(s) wmax for any δ [δ , 1). Here, F is the continuum Laplacian embedding defined in (14). Notice firstly that the smaller the quantities S and C/Θ are (i.e. the better separated the mixture model is), the smaller τ is. As a consequence the right-hand side of (16) is also small. This allows a choice of parameters s, t close to zero. Secondly, for small S, we can choose σ to be small, and if S is small compared to wmin(1 cos2(σ)) wmax cos2(σ)N2 , then δ will be small also. Following Definition 8, small choices of δ and t mean that a bigger proportion of the mass of the measure F ν can be found inside the cones, whereas if σ and s are small, then the cones are more concentrated around orthogonal vectors. The bottom line is that the smaller S and C/Θ are, the more concentrated around orthogonal vectors the measure F ν is. Note also that keeping all other parameters fixed, smaller choices of σ will force a bigger choice of δ. This reflects a trade-offbetween the total mass of the measure F ν inside the cones, and the opening angle of the cones. Geometric structure of graph Laplacian embeddings To see that the measure F ν has an orthogonal cone structure under the assumption of a well-separated mixture model, we first study the geometry of a related measure F Q ν where F Q is the map F Q : x M 7 q1(x) w1 ... q N(x) w N where we recall the functions qk are as in (8). Note that the functions qk are normalized in the sense that PN k=1 |qk(x)|2 = 1 for all x M, whereas the entries of F Q are normalized in L2(ρ) since F Q k 2 ρ = Z ρ(x) ρ(x)dx = 1 k {1, ..., N} . We will show that F Q ν has an orthogonal cone structure provided the overlapping parameter of the mixture model S is small enough (see Proposition 29). In Section A.1 we show that orthogonal cone structures are stable under Wasserstein perturbations. Theorem 10 is then proved by showing that the measures F Q ν and F ν are close to each other in the Wasserstein sense modulo an orthogonal transformation. To achieve this we will essentially show that q1, . . . , q N are close to spanning the subspace generated by the first N eigenfunctions of ρ. This statement is trivially true in the extreme case when M has N connected components and the ρkdx are the uniform distributions of the connected components. Indeed, in that case, the first N eigenvalues of ρ are zero, and the first N eigenfunctions of ρ are rescaled versions of the indicator functions of the components. In other words, the span of the ρk is exactly equal to the span of the first N eigenfunctions of ρ, and in particular up to a rotation we have F Q = F. 3.2 The discrete setting Having discussed spectral clustering at the continuum level, we now move to the discrete setting. Let Mn := {x1, . . . , xn} be i.i.d. samples from the probability distribution dν = ρ(x)dx and let n be the discrete graph Laplacian operator on L2(νn) as in (5). The spectrum of n induces a graph Laplacian embedding un,1(xi) ... un,N(xi) where un,1, . . . , un,N are the first N eigenvectors of n. Our second main result states that, with very high-probability, the measure Fn νn has an orthogonal cone structure if the mixture model ν is well-separated, the number of data points n is sufficiently large, and the connectivity parameter ε is sufficiently small. In order to state and prove this result we make the following assumption. Garc ıa Trillos, Hoffmann, Hosseini Assumption 11 We assume that M is a smooth, compact, orientable, connected m-dimensional manifold without a boundary. Furthermore, we assume that ρ C1(M), and that there exist positive numbers 0 < ρ < ρ+ such that ρ ρ(x) ρ+, x M. We denote by CLip > 0 the Lipschitz constant of ρ. We also assume that n is large enough and ε is small enough, so that λN + log(n)pm n1/mε min{C, 1 2(λN+1 λN)} , (19) where pm = 3/4 if m = 2 and pm = 1/m when m 3, C > 0 is a finite constant that depends only on M, and λN and λN+1 are the N-th and (N + 1)-th eigenvalues of ρ in increasing order. Remark 12 Several theoretical results deducing convergence rates for the spectra of graph Laplacians towards their continuum counterparts make similar assumptions to those in Assumption 11 (e.g Garc ıa Trillos et al. (2020); Garc ıa Trillos and Slepˇcev (2018)), and for the moment being we will only be able to state our second main result under the assumption that M is compact (see Section 4.2.1 where we discuss further extensions and generalizations). Regarding the assumptions on n and ε, we remark that requiring the quantity on the left hand side to be less than a constant C is the same condition used in the literature to quantify rates of convergence for the spectra of graph Laplacians (see for example Burago et al. (2014); Garc ıa Trillos et al. (2020)). On the other hand, requiring the left hand side of (19) to also be smaller than 1 2(λN+1 λN) is a condition that is easier to satisfy when the model is well-separated. Indeed, we will later see that it is possible to find an upper bound for λN (Lemma 38) and a lower bound for λN+1 (Proposition 27) in terms of the parameters of the mixture model. That is, one can restate our assumption and write an inequality in terms of the overlapping, indivisibility and coupling parameters. However, at this point we do not believe there is any benefit in developing this discussion any further, and we remark that these are simply concrete and technical conditions that guarantee that we have entered the regime where our theorems hold. Remark 13 Notice that when the manifold M is smooth and compact as in Assumption 11, Assumption 7 is automatically satisfied. Remark 14 Notice that the fact that ρ is bounded away from zero does not imply that the individual components ρk are as well. In particular, an individual component ρk may be arbitrarily close to zero (or even equal to zero) in certain region of M without contradicting Assumption 11. As a matter of fact, one precisely needs this to be the case in order for a model to be well-separated (for otherwise S would not typically be small). In Appendix A.3, we make the connection between the spectrum of the graph Laplacian n and the spectrum of the continuous operator ρ. The proof of the main technical result, Theorem 30, relies on a comparison between the Dirichlet forms associated to the operators n and ρ by way of careful interpolation and discretization of discrete and continuum Geometric structure of graph Laplacian embeddings functions. Only small modifications to the proofs in Burago et al. (2014); Garc ıa Trillos et al. (2020) are necessary and Section A.3 should be understood as a short guide on how to adapt the machinery developed there to the context of this paper. Finally, we leverage Theorems 10 and 30 to establish our second main result, which asserts that under Assumption 11 and provided the mixture model is well-separated, the measure Fn νn also has an orthogonal cone structure with very high probability. More precisely we have the following. Theorem 15 Let N 2 and β > 1. Suppose that M and ρ satisfy Assumption 11 for some ε > 0. Assume also that the parameters S, C, Θ of the mixture model ρ are such that the quantities τ, δ and σ from Theorem 10 satisfy (15) and (16). In addition, assume S < N 2. Let x1, . . . , xn be i.i.d. samples from the measure dν = ρdx with associated empirical measure νn, and let Fn be the Laplacian embedding defined in (18). Then, there exists a constant Cβ > 0 depending only on β, so that with probability at least 1 Cβn β, the probability measure µn := Fn νn has an orthogonal cone structure with parameters σ + s, δ + t2, 1 sin(s) wmax for any δ [δ , 1), and s, t > 0 satisfying t sin(s) Nwmax !2 + 4N3/2 1 Nφ(S, C, Θ, N, ε, n, m), φ(S, C, Θ, N, ε, n, m) = c M ε + log(n)pm !2 N 1 NS1/2 and c M > 0 is a constant depending on M, ρ , CLip, η and N. In the above we require S and C/Θ small enough so that all the inner terms in the definition of φ make sense and are positive. Notice that the only extra term in inequality (20) that does not appear in (16) is the term φ(S, C, Θ, N, ε, n, m). As we will see in Appendix A.4 (following the estimates in Appendix A.3) this term is an upper bound for the Wasserstein distance between the measures F ν and Fn νn (up to an orthogonal transformation). The assumptions in Theorem 15 imply that F ν has an orthogonal cone structure as described in Theorem 10, and hence in order to deduce that Fn νn has an orthogonal cone structure (as it is stated in Theorem 15) it is Garc ıa Trillos, Hoffmann, Hosseini enough to combine Proposition 22 with the estimates on the Wasserstein distance between Fn νn and F ν. We observe that asymptotically, if the parameter ε > 0 scales with n like then φ(S, C, Θ, N, ε, n, m) gets smaller as n , and we deduce that the geometry of the point cloud Fn(Mn) converges to that of F ν. Nevertheless, it is worth emphasizing that our results are meaningful in the finite (but possibly large) n case. As we discussed in Section 1.2 these rates can be improved with some straightforward but tedious computations following the ideas in the recent work of Calder and Garc ıa Trillos (2019). We opted for concreteness and simplicity in our estimates (as much as possible) in order to avoid detracting from the main analytical idea that we explore in this paper: study properties of continuum objects using PDE techniques and then connect to the finite data objects using some available approximation result. 4. Discussion The purpose of this section is to provide a more detailed account of the notions introduced above, as well as to discuss our main results and provide several remarks on extensions and generalizations. In Subsection 4.1 we discuss the overlapping, coupling, and indivisibility parameters and the notion of a well-separated mixture model from various view points and in the context of two concrete examples. Subsection 4.2 is then dedicated to various extensions and generalizations of our results to unbounded domains and k-NN graphs as well as different normalizations of the graph Laplacian. 4.1 Overlapping, coupling and indivisibility parameters We now give a more detailed description of the parameters S, C and Θ introduced in Section 2.2. The overlapping parameter S is small if the components ρk have well-separated mass in the sense that if at some point x M one component ρk(x) is large, then all other components ρi(x), i = k, are small at that point. In this case, at x M, we have ρ(x) wkρk(x), and so ρk(x)ρi(x) which is small. If ρk is small at x M on the other hand, then is also small. In other words, the parameter S being small requires the components not to overlap too much (hence the name given to it). If the indivisibility parameter Θ is small, this is an indication that our choice of mixture model ρ(x) = P k wkρk(x) for a given density ρ is not optimal in the sense that it could Geometric structure of graph Laplacian embeddings be split into smaller meaningful components. This property is captured by the spectrum of the operators ρk. For any k {1, ..., N}, and u, v C1(M), ρku, v ρk = Z M u v ρkdx = u, v ρk, and so the first eigenvalue is zero with constant eigenfunction 1. For the sake of illustration, suppose that the support of ρk has two disconnected components. In that case, one can construct an eigenfunction orthogonal to 1 that takes different constant values on the two different components, and so the second eigenvalue is zero. In light of the max-min theorem (see for example (Attouch et al., 2014, Thm. 8.4.2), also known as the Courant Fisher theorem) and definition (13), we observe that Θk is exactly the second eigenvalue of the operator ρk. It is well-known, in both the spectral geometry and machine learning literature, that the first non-trivial eigenvalue of the Laplacian associated to a geometric object (discrete or continuum) is intimately related to the problem of two-way clustering. Having a small first non-trivial eigenvalue indicates the presence of two clusters in the data set in the discrete setting. This is essentially because the eigenvalue problem can be interpreted as a relaxation of balanced cut minimization problems (see von Luxburg (2007) and also Cheeger s inequality for graphs (Chung, 1997, Sec. 2.3)). The more ρk concentrates on two separate components, the smaller the second eigenvalue will become (see Hoffmann et al. (2019) for precise estimates). This is why the second eigenfunction can be interpreted as an approximate identifier for clusters in the problem of two-way clustering, and it is usually referred to as the Fiedler vector. Intuitively, a low value of Θk indicates that the component ρk can be split into further significant components. For this reason, we require that Θ, the indivisibility parameter for the model to be large enough, guaranteeing that the components in the model cannot be divided into further meaningful components. This follows from the observation that if there exists a k {1, ..., N} such that ρk can be split into separate components, then the second eigenvalue of ρk will be small resulting in Θ being small as well. The coupling parameter C being small can be interpreted as a metastability condition on the relative entropy of measures ρkdx with respect to the measure ρdx. We define the relative entropy of a probability measure ϱdx with respect to ρdx by H(ϱ|ρ) := Z M ϱ (log ϱ log ρ) dx . Given a probability density ρ, consider ϱ(t, x) varying with time that satisfies the evolution equation tϱ = ϱ + div(ϱ log ρ) = div (ϱ (log ϱ log ρ)) , (21) with initial condition ϱ(0, x) = ρk(x) for a fixed k {1, ..., N}. Model (21) is a linear Fokker-Planck equation that is driven by the competition between linear diffusion and a confining drift term given by the potential log ρ. Observe that ρ is a stationary state of the Garc ıa Trillos, Hoffmann, Hosseini system, and the quantity H(ϱ(t)|ρ) gives a sense of how far ϱ(t, x) is from the stationary state ρ(x) at time t 0. In this context, the coupling parameter Ck can be interpreted as an initial rate of decay for the entropy. More precisely, following Boltzmann s H-theorem, the relative Fisher Information I(ϱ|ρ) is defined as the entropy dissipation along solutions to (21), d dt H(ϱ(t)|ρ) = Z M tϱ (log ϱ log ρ) dx M ϱ | (log ϱ log ρ)|2 dx =: I(ϱ(t)|ρ) , and so the definition of the coupling parameter (11) can be interpreted as 4I(ϱ(0)|ρ) = 1 Approximating the entropy for small initial times t > 0, H(ϱ(t)|ρ) = H(ρk|ρ) 4Ckt + O(t2) , we observe that if the coupling parameter Ck is small, then H(ϱ(t)|ρ) only varies very slowly in a neighborhood of t = 0, and so we can consider H(ρk|ρ) to be in a metastable state. For a well-separated mixture model, we expect the initial entropy H(ρk|ρ) to be large enough as small entropy would indicate that ρk takes values close to ρ (note that H(ϱ|ρ) = 0 if and only if ϱ = ρ almost everywhere). In other words, having a small overlapping parameter S indicates that the initial entropy H(ρk|ρ) cannot be too small. For well-separatedness, however, we require that Ck is small for all k {1, ..., N}, that is to say that the Fisher Information is small initially when starting the evolution process (21) at any of the components ρk. In this sense, we require the mixture model to be in a metastable state. Having discussed the parameters S, Θ and C, we now give some intuition on the notion of a well-separated mixture as introduced in Definition 4. Roughly speaking, a mixture model is well-separated if it has small coupling and overlapping parameters C and S, but a large indivisibility parameter Θ. A natural question is whether it is necessary to require both C and S to be small. Intuitively, S is small when the components ρj are concentrated on disjoint sets that are far apart. In such a setting one expects the coupling parameter C to also be small. However, we present two examples where C and S are not necessarily simultaneously small, demonstrating that it is not enough to only impose a small C (Example 1) or a small S (Example 2). The examples were chosen to illustrate how to estimate the parameters S, C and Θ in some concrete settings, and to see that our mathematical definitions do capture our intuitive interpretation of what well-separated components should look like. Example 1 (Mixture of two Gaussians) Consider a mixture of two standard Gaussian densities on R2 obtained by shifting the two densities. More precisely, let ρ = 1 Geometric structure of graph Laplacian embeddings Figure 1: Cross section of a 2D mixture model with Gaussian components. 2πe |x|2/2, ρ2(x) := 1 2πe |x γ|2/2 for a fixed vector γ R2. A cross section of the mixture model along the vector γ is illustrated in Figure 1. We estimate how the overlapping and coupling parameters of the model scale with the size of the off-set γ. Intuitively, the components become better separated as |γ| increases and we will show that both C and S become small in this regime. However, as |γ| becomes small the coupling parameter C decreases once more while the overlapping parameter S increases. This example demonstrates why we require both C and S to be small. Observe that the indivisibility parameter is unaffected by shifts in space of the components, i.e., Θ does not depend on γ. On the other hand, straightforward calculations show ρ1(x) = x, ρ2(x) ρ2(x) = (x γ), and that ρ(x) ρ(x) = x + γ ρ(x) = (x γ) γ In particular, ρ(x) , ρ(x) It then follows from (11) that C C1 + C2 = |γ|2 ρ2 (ρ1 + ρ2) dx = |γ|2 ρ dx = |γ|2 Thus, the coupling parameter C is controlled by the overlapping parameter S and |γ|2. It follows that when |γ| is close to zero, the overlapping parameter is close to one since both ρ1 and ρ2 are roughly equal to ρ, and that the coupling parameter is small of order |γ|2. On the other hand, for large |γ|, the overlapping parameter decays exponentially fast and the coupling parameter decays as well. From an intuitive point of view it is to be expected that for large |γ| the model is well separated. It is straightfoward to verify that, for large values of |γ| the parameter S α1 exp( β1|γ|2) and it follows from the above arguments that C/Θ α2 exp( β2|γ|2) for constants αi, βi > 0, i = 1, 2. Now consider the bound (16) and fix s, σ > 0. Then a formal calculation yields Garc ıa Trillos, Hoffmann, Hosseini t α3 exp( β3|γ|2) provided that |γ| is sufficiently large. Let F denote the Laplacian embedding obtained from the first two eigenfunctions of ρ and let µ = F ν. Since δ vanishes like S as well we infer the probability mass of µ outside of its orthogonal cones should vanish like α4 exp( β4|γ|2). We verify the sharpness of the above formal calculation, up to the constants α4, β4, for a mixture of Gaussian distributions in two dimensions. Let ρ = 1 2ρ1 + 1 2ρ2 where ρ1 = N(0, 0.04I) and ρ2 = N(γ, 0.04I) with I denoting the identity matrix in R2 2. We then generate a random graph by drawing n = 211 vertices from ρ and construct the graph Laplacian matrix n as in (5) by setting ε = 2(log n)3/4 n1/2 0.2. We then construct the discrete Laplacian embedding Fn using the first two eigenvectors of n and compute the number noutside of embedded points that fall outside the cones C((1, 1)T , π/4 1/4, 0) and C((1, 1)T , π/4 1/4, 0). We choose |γ| {0.7, 0.8, 0.9, 1, 1.1, 1.2} and repeat the experiment over 20 trials for each value of |γ| where the vertices are redrawn from the mixture. We report the averaged values of noutside/n, as an empirical approximation to the probability mass outside the cones, versus |γ|2 in Figure 2(a) indicating that indeed log(noutside/n) = O(|γ|2)) as expected. 0.4 0.6 0.8 1 1.2 1.4 1.6 noutside /n 0 0.1 0.2 0.3 0.4 0.5 0 noutside /n Figure 2: Empirical approximation of the probability mass of µ = F ν lying outside of the orthogonal cones C((1, 1)T , π/4 1/4, 0) and C((1, 1)T , π/4 1/4, 0). ν is a mixture model of two components and F is the Laplacian embedding. Weighted graphs with n = 211 vertices were constructed with the vertices distributed according to ν and F was approximated using the eigenvectors of the graph Laplacian matrix n constructed using as kernel η(r) = χB1(r)/π, the indicator function on the unit ball. (a) Here ν is a mixture of two Gaussians as in Example 1 and the figure shows the fraction noutside/n of embedded vertices that fall outside of the cones versus |γ|2, where |γ| denotes the distance between the component means. (b) ν is the uniform measure on a dumbbell shaped domain as in Example 2 with components as in Figure 3(a) and the figure shows the fraction noutside/n of embedded vertices that fall outside of the cones versus ϑ the width of the thin channel between the two components of the dumbbell. Geometric structure of graph Laplacian embeddings Example 2 (Partitioning a dumbbell shaped domain) We consider a probability measure ρ on a dumbbell shaped domain M R2. More precisely define the sets M0 = {(x1, x2) : ℓ/2 < x1 < ℓ/2, |x2| < ϑ}, M1 = {(x1, x2) : ℓ/2 < x1 < ℓ, |x2| < ℓ/2}, M2 = {(x1, x2) : ℓ< x1 < ℓ/2, |x2| < ℓ/2}, for ℓ, ϑ > 0 and define M := S2 j=0 Mj R2. We have in mind ϑ ℓso that M is a dumbbell shaped domain consisting of two rectangles connected by a thin rectangle. We assume |M| = 1, i.e. ℓ2+2ℓϑ = 1. We take ρ = 1 on M and then define the mixture elements ρk by partitioning M into subsets and taking the ρk to be a partition of unity for M with appropriate normalization. We present two choices of partitions and demonstrate that for small ϑ, one choice results in well-separated densities ρk while the other partition does not. Our first partition aims at cutting the dumbbell perpendicularly to its axis (see Figure 3(a)). For 0 < ε < l/2 define the function ψ : R R 1 if t 0 1 2 ε2 t2 if t [0, ε/2] 2(t ε)2 ε2 if t [ε/2, ε] 0 if t ε We let ρ1, ρ2 : M R , ρ1(x) := ψ(x1), ρ2(x) := ψ( x1), x = (x1, x2) M. The densities ρ1 and ρ2 are then defined by ρ1(x) := 2 ρ1(x) ρ1(x) + ρ2(x), ρ2(x) := 2 ρ2(x) ρ1(x) + ρ2(x), (22) from where we see that ρ(x) = 1 2ρ2(x) with weights w1 = w2 = 1 2 in the notation of (1). We now show that the above mixture model is well-separated according to Definition 4 for an appropriate choice of ε(ϑ) as ϑ 0. Let us begin by estimating the parameters S, C and Θ. Following (9) we have M ρ1(x)ρ2(x)dx M ρ1(x) ρ2(x)dx where in the first inequality we have used the fact that 1 ρ1(x) + ρ2(x) 2. Garc ıa Trillos, Hoffmann, Hosseini Thus S can be made arbitrarily small by either taking ϑ or ε to zero. Furthermore, by (11) one can see that C = max k 1 4 2 ρk(x)dx 4ϑ Z ε 2 ψ(t)dt 4ϑ This implies that if we pick ϑ ε, then we can make C as small as we want. It remains to show that the indivisibility parameter Θ remains bounded away from zero for the right choice of parameters. In order to find a lower bound for Θ it will be convenient to use Cheeger s inequality, which gives us a lower bound for Θ in terms of a geometric quantity that is much easier to understand and estimate. Namely, by Theorem 43, Θk = min u 1 M | u|2ρkdx R M u2ρkdx (h(M, ρk))2 where in the above means orthogonal with respect to the inner product , ρk and h(M, ρk) := min A M A M ρk(x)d S(x) M\A ρkdx o. The inequality is due to Cheeger Cheeger (1969), and for the convenience of the reader we provide in the Appendix a sketch of the proof that in particular highlights the specific structure of the Raleigh quotient associated to the operators ρk and ρ that makes the inequality work (see Section B.2). In particular we emphasize that for Cheeger s inequality to hold, both the numerator and denominator of the Raleigh quotient defining Θk must be weighted with the same measure (in our case both integrals are weighted by ρk). Due to symmetry we have Θ = Θ1 = Θ2. Thanks to Cheeger s inequality we can focus on obtaining a lower bound for h(M, ρ1). Now, from the fact that 2 ρ1(x) + ρ2(x) 1 we see that in order to get a lower bound for h(M, ρ1) we can alternatively get a lower bound for h(M, ρ1). After a moment of reflection, we can conclude that if we can find C(ϑ, ε) such that At M ρ1(x)d S(x) M\At ρ1dx o C(ϑ, ε), t [ l/2, ε], where At := {x M : x1 t}, then h(M, ρ1) min{c, C(ϑ, ε)}, where c is a constant that does not depend on ϑ nor ε and corresponds to the cost of a cut along the x1 axis. This is because other than the horizontal cut along the x1 axis, the only other natural cuts that are viable competitors for minimizing Cut (the objective function in the definition of h(M, ρ1)) are the ones perpendicular to the bottleneck of the dumbbell Geometric structure of graph Laplacian embeddings and along the x2 axis. A straight forward calculation shows that the cuts along the x1 axis result in a value of Cut that is bounded away from zero and so we focus on finding C(ϑ, ε) by analyzing vertical cuts across the bottleneck. Notice that we do not have to consider the case t > ε since the density ρ1 is equal to zero for x = (x1, x2) with x1 > ε. In other words we just have to look at non-trivial subsets of {x M : ρ1(x) > 0}. For all t [ l/2, ε] we have At M = {(x1, x2) M : x1 = t, |x2| < ϑ}, and so we can write Cut(At) = ψ(t) R ε t ψ(s)ds. Now, when ε is small, for every t [ l/2, ε/2] we have Cut(At) 1 2 R ε t ψ(s)ds 1. On the other hand, we notice that by the definition of ψ we have t ψ(s)ds + (ψ(t))2 = 4(ε t)4 3ε4 0, t [ε/2, ε], implying that the function Cut(At) is increasing between ε/2 and ε. In particular, it follows that for all t [ε/2, ε] we have Cut(At) Cut(Aε/2) = 1 2 R ε ε/2 ψ(s)ds 1 Hence C(ϑ, ε) = min{1, 1/ε}. We conclude that provided ε 1 we have that for some positive constant that does not depend on ε or ϑ. The bottom line is that the model is well-separated as long as ϑ ε and ε is sufficiently small. One can then consider a partition that aims at cutting the dumbbell along its axis (see Figure 3(b)). In particular we can introduce analogue densities ρ1 and ρ2 smoothening out the indicator functions for the sets in the new partitioning (in this case the smoothening happens in the x2 coordinate). Then using similar calculations as before we can show that the overlapping parameter can be made arbitrarily small (by selecting the parameter ε to be small), however, it will not be possible to control the indivisibility parameter from below as ϑ 0. In fact, we expect the second eigenvalue of ρk to go to zero with ϑ following Ann e (1995). Moreover, the coupling parameter will not be controlled by ϑ in this case, and in fact as ε 0 it will blow up. Therefore according to our definition this mixture model is not well-separated. Notice that this example has made explicit the fact that the notion of well separated mixture model does not depend on the density ρ, but on the actual decomposition as a mixture model. The example has also exhibited that our mathematical notion of well-separated mixture model does capture our intuition. Let us formally verify the lower bound (16) for this example similar to our numerical experiments in Example 1. Here, we consider a mixture model with the components defined Garc ıa Trillos, Hoffmann, Hosseini as indicated by Figure 3(a), see (22). We let ℓ= 1 and consider dumbbell shaped domains with parameter ϑ {.05, .1, .15, .2, .25, .3, .4, .5}. We then let ρ denote the uniform measure on the dumbbell shaped domain M and generate random graphs with n = 211 vertices sampled randomly from ρ. We choose the rest of the parameters in this experiment identical to the experiment in Example 1 and report the averaged values of noutside/n versus ϑ over 20 trials in Figure 2(b), indicating a linear relationship between ϑ and the probability mass outside the orthogonal cones. Let us now return to (16) with s fixed and split ρ into a mixture of two components as in (22) so that N = 2 and wmax = wmin = 1/2. Assuming our bounds on S, C in (23) and (24) are sharp we can proceed to simplify the lower bound on t up to leading order assuming that ϑ ε < 1. A straightforward calculation then yields that t αϑ1/2 for some constant α > 0 which in turn suggests that the probability mass of µ lying outside orthogonal cones of a fixed angle should vanish like ϑ1/2. Thus we observe a gap between the lower bound in (16) and our numerical experiments, which suggest the probability mass outside of the cones vanishes like ϑ. We conclude that either the lower bound (16) is not sharp in this case or our bounds on S, C in (23) and (24) can be improved. Figure 3: Two partitioning strategies for the dumbbell shaped domain of Example 2. The partition (a) leads to well-separated clusters in the limit ϑ 0, while the partition (b) does not. Clearly, the subsets in (b) can be further split into meaningful components. The previous two examples have illustrated some aspects of the parameters S, Θ, C which we now put together in a remark. Remark 16 The parameters S, C, Θ, can be interpreted as quantitative measures for the presence of three different types of degeneracies in the mixture model: Small S guarantees that there are no repeated components in the model. Notice that in Example 1 the two components of the model are the same when γ = 0. In that case, C would be zero and Θ would still be bounded away from zero. Geometric structure of graph Laplacian embeddings Large Θ guarantees that a component in the model can not be divided into further well separated components (e.g. as it would be the case in Figure 3(b) ). Small C guarantees that our components are not slices of what should be considered one big component. Consider for example a one-dimensional mixture model with two components given by the halves of a centered Gaussian, each component equal to the part of the Gaussian lying to either side of the origin. Then the overlap S is zero and Θ is bounded away from zero but the coupling C would be large since it is easy for entropy to flow from one component to the other. In summary, C is a measure of how well entropy flows from any one given component to the rest of the distribution, which is essentially a measure of between class separation. Conversely, Θ measures the within class separation. For a clustering method to work we intuitively expect that one should have that the within class similarity is much greater than the between class similarity: this is exactly saying that the ratio C/Θ should be small. Small S on the other hand guarantees that with high probability, data points are allocated to the correct cluster, which is a direct consequence of small overlap between components. 4.2 Extensions and generalizations of main results As outlined in Section A the proofs of our main results are divided into two main parts. 1. We prove that the measure F ν has an orthogonal cone structure provided the model is well-separated. 2. We show that after an orthogonal transformation, the measures F ν and Fn νn are close to each other in the Wasserstein sense with high probability. We emphasize that for part (1) we do not need to assume that M has compact support, nor that the density ρ is bounded bellow and above by positive constants. We only need Assumptions 7. In contrast, for part (2), we impose the extra conditions in Assumptions 11 to guarantee that the results of Burago et al. (2014); Garc ıa Trillos et al. (2020) on the spectral convergence of graph Laplacians are applicable. 4.2.1 Extensions to unbounded domains and k-NN graphs In this paper Theorem 30 links the graph Laplacian n to the differential operator ρ. Assumption 11 is only used when proving Theorem 30, and in particular to show the existence of transport maps Tn between ν and νn satisfying Tn Id := ess sup x M |Tn(x) x| 0, at an explicit rate in terms of n. For an unbounded manifold M, any transport map Tn pushing forward ν into νn, must transport mass over arbitrary long distances (even a tiny bit of mass transported over a large distance already results in a large -OT distance). It is therefore of relevance to investigate other forms of establishing the discrete-continuum link that do not involve using -OT maps; this is the topic of current investigation. Garc ıa Trillos, Hoffmann, Hosseini It is also of relevance to extend our analysis to settings where different graph constructions are used to extract the coarse structure of the point cloud. In this paper we have focused on proximity graphs constructed using a kernel η and a connectivity parameter ε > 0, but other graph constructions are perhaps more popular among practitioners and enjoy higher regularity properties; such is the case of k nearest neighbors (k-NN) graphs. Results quantifying spectral convergence rates for k-NN graph Laplacians are largely missing in the literature, and it is interesting to extend the analysis showed in this paper to the k-NN setting. In Garc ıa Trillos (2019), variational techniques have been used to study the statistical consistency of optimization problems defined on k-NN graphs. 4.2.2 Different weighted versions of the Laplacian The kernelized Laplacian considered in this paper is just one of many operators that can be used to construct the embedding F. In light of the notion of diffusion maps introduced in Coifman and Lafon (2006), it seems reasonable to ask what is the effect of different graph Laplacian normalizations on the geometry of the embedded data. For example, one may consider a family of renormalizations, where for a given γ [0, 1], we let (Wn)ij := ηε(|xi xj|) dγ i dγ j , di := j=1 ηε(|xi xj|) .1 The weights (Wn)ij induce the family of normalized graph Laplacians γ n := I D 1 n Wn where Dn = diag(di), di = P j(Wn)ij, denotes the degree matrix of the weight matrix Wn. Intuitively, the parameter γ controls the effect of the density ρ on the clustering: for γ close to one, the effect of ρ is minimal, and the limiting clustering is influenced only by the geometry of M; on the other hand, when γ is close to zero, the effect of the groundtruth density ρ is maximal (see Coifman and Lafon (2006)). This observation can be made rigorous using the ideas presented in the work Garc ıa Trillos and Slepˇcev (2018) which when applied would show that after appropriate rescaling, the spectrum of the graph Laplacian γ n approximates that of the formally defined differential operator γ ρu := 1 ρ2(1 γ) div(ρ2(1 γ) u). Furthermore, following the discussion in Section A.3 one can bootstrap the ideas presented in Garc ıa Trillos et al. (2020) and obtain quantitative error estimates relating the spectra of γ n and γ ρ analogous to the ones that appear in Theorem 30. We notice that the limiting differential operator γ ρ is of the same type as the one we have considered in this paper, but where now we think of the underlying density function as being proportional to ρ2(1 γ). That is, we can introduce the density eρ(x) := ρθ(x) R M ρθ(y) dy, 1. Formally, one could take any γ R, but we follow here the choice of γ as in Coifman and Lafon (2006). Geometric structure of graph Laplacian embeddings and notice that γ ρ = ρ, for θ = 2(1 γ). Based on the analysis presented in this paper, it is then possible to describe the geometry of the graph Laplacian embeddings constructed from γ n, assuming that the modified density eρ admits a representation as a well-separated mixture model. An immediate question that arises is to understand whether there is an interesting criterion under which one can choose the value of γ that produces the best clustering for a given data set. Alternatively, instead of looking at a single value of γ, it maybe worth working with the full ensemble of Laplacian embeddings (for all values of γ), and determine how to use it as best as possible. The formulation and analysis of these questions are the topic of current investigation. Remark 17 Note that the so called random walk graph Laplacian Shi and Malik (2000) coincides with 0 n (i.e. γ = 0). The limiting differential operator then takes the form ρ2 div(ρ2 u). Remark 18 Asymptotically, the kernelized graph Laplacian n studied here coincides with 1/2 n (i.e. γ = 1/2). In particular, there is no reason to work with 1/2 n (i.e. renormalizing twice). In addition, the limiting differential operator 1/2 ρ = ρ has the following properties. First, ρu, v ρ = Z is linear in ρ. Secondly, is also linear in ρ, and ρ is self-adjoint with respect to , ρ. Thirdly, under fairly general conditions, such as when ρ is a well-separated mixture, ρ has a spectral gap after the N-th eigenvalue. Acknowledgments The authors would like to thank Ulrike von Luxburg for pointing them to the paper Schiebinger et al. (2015) which was the starting point of this work. NGT was supported by NSF grant DMS 1912802. FH was partially supported by Caltech s von Karman postdoctoral instructorship. BH is supported in part by a postdoctoral fellowship granted by Natural Sciences and Engineering Research Council of Canada. Colette Ann e. A note on the generalized dumbbell problem. Proceedings of the American Mathematical Society, 123(8):2595 2599, 1995. doi: 10.1090/S0002-9939-1995-1257096-5. Garc ıa Trillos, Hoffmann, Hosseini Hedy Attouch, Giuseppe Buttazzo, and G erard Michaille. Variational analysis in Sobolev and BV spaces: Applications to PDEs and optimization. SIAM,Philadelphia, 2014. doi: 10.1137/1.9781611973488. Mikhail Belkin and Partha Niyogi. Towards a theoretical foundation for Laplacian-based manifold methods. Journal of Computer and System Sciences, 74(8):1289 1308, 2008. doi: 10.1016/j.jcss.2007.08.006. Vladimir I. Bogachev. Measure Theory, volume 1. Springer, New York, 2007. doi: 10.1007/ 978-3-540-34514-5. Dmitri Burago, Sergei Ivanov, and Yaroslav Kurylev. A graph discretization of the Laplace Beltrami operator. Journal of Spectral Theory, 4(4):675 714, 2014. doi: 10.4171/JST/83. JeffCalder and Nicol as Garc ıa Trillos. Improved spectral convergence rates for graph Laplacians on epsilon-graphs and k-NN graphs. ar Xiv preprint:1910.13476, 2019. URL https://arxiv.org/abs/1910.13476. Kamalika Chaudhuri, Fan Chung, and Alexander Tsiatas. Spectral clustering of graphs with general degrees in the extended planted partition model. In Proceedings of the 25th Annual Conference on Learning Theory, pages 35.1 35.23, 2012. URL http:// proceedings.mlr.press/v23/chaudhuri12.html. JeffCheeger. A lower bound for the smallest eigenvalue of the laplacian. In Proceedings of the Princeton conference in honor of Professor S. Bochner, pages 195 199, 1969. Fan R. K. Chung. Spectral graph theory. Number 92 in CBMS: Regional Conference Series in Mathematics. American Mathematical Society, Providence, 1997. doi: 10.1090/cbms/092. Ronald R. Coifman and St ephane Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5 30, 2006. doi: 10.1016/j.acha.2006.04.006. Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1 46, 1970. Matthew M Dunlop, Dejan Slepˇcev, Andrew M Stuart, and Matthew Thorpe. Large data and zero noise limits of graph-based semi-supervised learning algorithms. Applied and Computational Harmonic Analysis, 2019. doi: 10.1016/j.acha.2019.03.005. Nicol as Garc ıa Trillos. Variational limits of k-NN graph-based functionals on data clouds. SIAM Journal on Mathematics of Data Science, 1(1):93 120, 2019. doi: 10.1137/18M1188999. Nicol as Garc ıa Trillos and Ryan Murray. A new analytical approach to consistency and overfitting in regularized empirical risk minimization. European Journal of Applied Mathematics, 28(6):886 921, 2017. doi: 10.1017/S0956792517000201. Nicol as Garc ıa Trillos and Daniel Sanz-Alonso. Continuum limits of posteriors in graph Bayesian inverse problems. SIAM Journal on Mathematical Analysis, 50(4):4020 4040, 2018. doi: 10.1137/17M1138005. Geometric structure of graph Laplacian embeddings Nicol as Garc ıa Trillos and Dejan Slepˇcev. Continuum limit of total variation on point clouds. Archive for Rational Mechanics and Analysis, pages 1 49, 2015. doi: 10.1007/ s00205-015-0929-z. Nicol as Garc ıa Trillos and Dejan Slepˇcev. A variational approach to the consistency of spectral clustering. Applied and Computational Harmonic Analysis, 45(2):239 281, 2018. doi: 10.1016/j.acha.2016.09.003. Nicol as Garc ıa Trillos, Moritz Gerlach, Matthias Hein, and Dejan Slepˇcev. Error estimates for spectral convergence of the graph Laplacian on random geometric graphs toward the Laplace-Beltrami operator. Foundations of Computational Mathematics, 20(4):827 887, 2020. doi: 10.1007/s10208-019-09436-w. Matthias Hein, Jean-Yves Audibert, and Ulrike von Luxburg. From graphs to manifolds weak and strong pointwise consistency of graph Laplacians. In Learning theory, volume 3559 of Lecture Notes in Computer Science, pages 470 485. Springer, Berlin, 2005. doi: 10.1007/11503415 32. Nicholas J Higham. Matrix nearness problems and applications. In Applications of matrix theory, pages 1 27. Oxford University Press, 1989. doi: 10.2307/3619410. Franca Hoffmann, Bamdad Hosseini, Assad A Oberai, and Andrew M Stuart. Spectral analysis of weighted laplacians arising in data clustering. ar Xiv preprint:1909.06389, 2019. URL https://arxiv.org/abs/1909.06389. Shuyang Ling and Thomas Strohmer. Certifying global optimality of graph cuts via semidefinite relaxation: A performance guarantee for spectral clustering. Foundations of Computational Mathematics, pages 1 55, 2019. doi: 10.1007/s10208-019-09421-3. Anna V Little, Mauro Maggioni, and James M Murphy. Path-based spectral clustering: Guarantees, robustness to outliers, and fast algorithms. Journal of Machine Learning Research, 21(6):1 66, 2020. Karl Rohe, Sourav Chatterjee, and Bin Yu. Spectral clustering and the high-dimensional stochastic blockmodel. Annals of Statistics, 39(4):1878 1915, 08 2011. doi: 10.1214/ 11-AOS887. Geoffrey Schiebinger, Martin J. Wainwright, and Bin Yu. The geometry of kernelized spectral clustering. Annals of Statistics, 43(2):819 846, 04 2015. doi: 10.1214/14-AOS1283. Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888 905, August 2000. doi: 10.1109/34.868688. Dejan Slepcˇev and Matthew Thorpe. Analysis of p-laplacian regularization in semisupervised learning. SIAM Journal on Mathematical Analysis, 51(3):2085 2120, 2019. doi: 10.1137/17M115222X. Garc ıa Trillos, Hoffmann, Hosseini Wenqi Tao and Zuoqiang Shi. Convergence of Laplacian spectra from random samples. Journal of Computational Mathematics, 38(6):952 984, 2020. ISSN 1991-7139. doi: 10. 4208/jcm.2008-m2018-0232. C edric Villani. Topics in optimal transportation. American Mathematical Society, Providence, RI, 2003. doi: 10.1007/b12016. Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4): 395 416, 2007. doi: 10.1007/s11222-007-9033-z. Ulrike von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clustering. The Annals of Statistics, 36(2):555 586, 2008. doi: 10.1214/009053607000000640. Appendix A. Proofs of main results Here we collect the detailed proofs of the main theoretical results of this article. In Appendix A.1 we show stability of orthogonal cone structures under perturbations in the Wasserstein distance, proving that nearby probability measures have similar orthogonal cone structures. Appendix A.2 is dedicated to the proof of our first main result Theorem 10 stating that the pushforward F ν has an orthogonal cone structure when ν is well-separated. In proving this theorem we also obtain some auxiliary results regarding the spectrum of ρ that are interesting by themselves. In Appendix A.3 we collect convergence results on the spectrum of n and ρ and prove that as n and ε 0 in an appropriate regime then the spectrum of n converges to that of ρ and obtain quantitative rates. We then combine the above results in Appendix A.4 to prove Theorem 15, showing that the point cloud Fn(Mn) has an orthogonal cone structure when ν is well-separated. A.1 Stability of orthogonal cone structures In this section we prove some basic results on the stability of orthogonal cone structures. The first result follows immediately from the characterization of weak convergence of probability measures via Portmanteau s theorem (Bogachev, 2007, Theorem 8.4.7). Proposition 19 Let {µn}n N be a sequence of Borel probability measures on Rk that converge weakly to a measure µ P(Rk). If µ has an orthogonal cone structure with parameters (σ, δ, r), then for every ε (0, 1 δ), there exists K N such that for all n K , µn has a orthogonal cone structure with parameters (σ, δ + ε, r). Proof Consider the open set U := k j=1C(ej, σ, r), where the sets C(ej, σ, r) are as in Definition 8 on the orthogonal cone structure property for µ. Since µn w µ, Portmanteau s theorem implies that lim inf n µn(U) µ(U). In particular, this means that for every ε (0, 1 δ), there exists a K such that if n K then, µn(U) µ(U) ε 1 δ ε. Geometric structure of graph Laplacian embeddings An immediate corollary of the previous proposition is the following. Corollary 20 If a probability measure µ has an orthogonal cone structure with parameters (σ, δ, r) and µn is the empirical measure associated to n i.i.d. samples from µ, then ε (0, 1 δ), with probability one there exists K = K(ε, δ) N such that for every n K, µn has a (σ, δ + ε, r)-cone structure. The idea is now to improve the previous asymptotic result and instead study the stability of orthogonal cone structures under small perturbations of a measure; we use the Wasserstein distance to measure the distance between different probability measures. We recall its definition. Definition 21 Let µ1, µ2 be two probability measures on Rk with finite second moments. We define their Wasserstein distance by (W2(µ1, µ2))2 := min π Γ(µ1,µ2) Rk Rk |x y|2dπ(x, y), where Γ(µ1, µ2) stands for the set of transportation plans between µ1 and µ2, that is, the set of probability measures in P(Rk Rk) with first and second marginals equal to µ1 and µ2 respectively. Proposition 22 Let µ1, µ2 P(Rk) and suppose that µ1 has an orthogonal cone structure with parameters (σ, δ, r), where σ < π/4. Let s, t > 0 be such that k W2(µ1, µ2), (26) and such that σ + s < π/4. Then, µ2 has an orthogonal cone structure with parameters (σ + s, δ + t2, r(1 sin(s)). Proof Denote by C1, . . . , Ck the cones associated to µ1 and let e1, . . . , ek be the orthonormal vectors they are centered at. For s, t > 0 satisfying (26) and such that s + σ < π/4, define Cj := z Rk : z ej |z| > cos( σ), |z| > r , where σ := σ + s and r := r(1 sin(s)). Let π Γ(µ2, µ1) be an optimal transportation plan between µ2 and µ1 (which exists by (Villani, 2003, Proposition 2.1)). That is, Z Rk Rk |x y|2dπ(x, y) = (W2(µ1, µ2))2 . First, observe that from the definition of r, simple trigonometry shows that min v Ck,w Ck |v w| = r sin(s). Garc ıa Trillos, Hoffmann, Hosseini From this, it follows that r2 sin2(s)π(Cj, Rk \ Cj) Z Cj (Rk\ Cj) |x y|2dπ(x, y) (W2(µ1, µ2))2 . Hence, for every j = 1, . . . , k, µ2( Cj) π(Cj, Cj) = π(Cj, Rk) π(Cj, Rk \ Cj) µ1(Cj) (W2(µ1, µ2))2 The assumption σ < π/4 and σ + s < π/4 implies that the sets C1, . . . , Ck and respectively C1, . . . , Ck are pairwise disjoint and thus t2 1 (δ + t2). We finish this section by recalling that a map T : Rk Rk with µ2 = T µ1 (i.e. µ2 is the pushforward of µ1 by T) is called a transportation map between µ1 and µ2. Such a map induces a transportation plan πT Γ(µ1, µ2) given by πT := (Id T) µ1, where the map Id T is given by x Rk 7 (x, T(x)) Rk Rk. In particular, for any such T, we have (W2(µ1, µ2))2 Z Rk Rk |x y|2dπT (x, y) = Z Rk |x T(x)|2dµ1(x). (27) A.2 Proof of Theorem 10 In what follows we use u1, . . . , u N to represent the first N eigenfunctions of ρ and denote by U their span. Normalized in L2(ρ), the functions u1, . . . , u N form an orthonormal basis for U. We use ΠN : L2(ρ) U to denote the projection onto U. Denote by Q the span of the functions q1, . . . , q N. Our first proposition quantifies how close are the functions qk to their projections onto U. Proposition 23 For every k = 1, . . . , N we have 1 wk qk ΠN(qk) 2 ρ C λN+1 , (28) where ΠN stands for the projection onto U the span of the first N eigenfunctions of ρ and λN+1 is the (N + 1)-st eigenvalue of ρ, where the eigenvalues are assumed to be indexed in increasing order. Geometric structure of graph Laplacian embeddings Proof A direct computation using the definition of qk shows that ρqk, qk ρ = Z M | qk(x)|2ρ(x)dx = wk Ck, where Ck is as defined in (11). On the other hand, we can write qk in the orthonormal basis of eigenfunctions {u1, u2, . . . } of ρ as for some coefficients {alk}l N. Using this representation we can alternatively write ρqk, qk ρ as ρqk, qk ρ = l=1 a2 lkλl + l=N+1 a2 lkλl. We deduce that wk C wk Ck λN+1 l=N+1 a2 lk = λN+1 qk ΠN(qk) 2 ρ. We now focus on getting lower bounds for λN+1 in terms of the parameters of the mixture model. In particular, we show that when S and C are small in relation to Θ then λN+1 is large. We start with two preliminary results. Lemma 24 For every j {1, . . . , N}, 1 S qj, qj ρj 1 + S. Remark 25 It follows from the above lemma that qj ρj converges to one as S 0. In general however, S may be bigger than one, in which case the lower bound is trivial. Proof From the definition of qj we see that qj, qj ρj = 1 + Z On the other hand, M |wjρj ρ|ρj k =j wk ρkρj Garc ıa Trillos, Hoffmann, Hosseini From the above computations we obtain the desired inequality. Lemma 26 For every j {1, . . . , N}, inf v,qj ρj =0 M | v|2ρjdx v, v ρj Θ(1 S), (29) Proof Let us fix j {1, . . . , N} and pick v H1(M, ρj), such that v, qj ρj = 0, and v, v ρj = 1, where H1(M, ρj) := u L2(M, ρj)| Z | u|2 + |u|2 ρjdx < + . (30) Observe that Z M | v|2ρjdx = k=1 v, ej,k 2 ρjλj,k, where in the above {λj,k, ej,k} are the orthonormal (w.r.t. , ρj) eigenpairs of ρj. Since λj,1 = 0, the first eigenvector is given by ej,1 1, the function which is identically equal to one. Using the above equality and the Pythagorean theorem we conclude that Z M | v|2ρjdx λj,2 k=2 v, ej,k 2 ρj = λj,2 v, v 2 ρj v, ej,1 2 ρj = Θj 1 v, ej,1 2 ρj , (31) where we recall Θj was defined in (13), and corresponds to the second eigenvalue of the operator ρj by the max-min formula (Attouch et al., 2014, Thm. 8.4.2). We now find an upper bound for v, ej,1 2 ρj. Notice that v, ej,1 ρj = Z M vρjdx = Z M v(1 qj)ρjdx, where the second equality follows from the fact that v, qj ρj = 0. The Cauchy-Schwartz inequality implies that, v, ej,1 2 ρj v 2 ρj M (1 qj)2ρjdx = Z M (1 qj)2ρjdx = 1 + Z M (q2 j 2qj)ρjdx. From the fact that 0 qj 1, we conclude that (q2 j 2qj) q2 j , and therefore v, ej,1 2 ρj 1 Z M q2 j ρjdx = 1 qj, qj ρj 1 (1 S) = S, where the second inequality follows from Lemma 24. To conclude, the estimate (29) is a consequence of the above inequality and (31). We are now in a position to find a lower bound for the (N + 1)-th eigenvalue λN+1 of ρ. Geometric structure of graph Laplacian embeddings Proposition 27 (Lower bound for λN+1) Suppose that NS < 1. Then, p Proof Let us consider u H1(M, ρ) satisfying u, u ρ = 1, and u Q , i.e. u is orthogonal to the span of {q1, ..., q N} in terms of the inner product , ρ. From the fact that u, qj ρ = 0 for all j {1, ..., N}, we deduce that for every j = 1, . . . , N, u, qj ρj = Z M uqjρjdx = 1 M uqj (wjρj ρ) dx = 1 | u, qj ρj| 1 M |u|qjρkdx M u2ρkdx 1/2 Z M q2 j ρkdx 1/2 M u2ρkdx 1/2 where in the second inequality we have used the Cauchy-Schwartz inequality, in the fourth inequality we have used Jensen s inequality, and in the last inequality the fact that u, u ρ = 1. Define now the functions vj := u u, qj ρj qj, j = 1, . . . , N. The function vj is orthogonal to qj with respect to the inner product , ρj. In addition, Z M | vj|2ρjdx = Z M | u|2ρjdx 2 u, qj ρj M | qj|2ρjdx. Notice that, Z M | qj|2ρjdx = wj M | (ρj/ρ)|2ρdx = 1 2 ρj dx Cj, where in the last step we used the fact that 0 qj 1. From the above inequality, Lemma 24, (32) and Cauchy-Schwartz inequality, it follows that Z M | vj|2ρjdx Z M | u|2ρjdx + 2 M | u|2ρjdx 1/2 Garc ıa Trillos, Hoffmann, Hosseini According to Lemma 26, the left hand side of the above expression can be bounded from below by Θ(1 S) vj, vj ρj (since vj, qj ρj = 0). On the other hand, vj, vj ρj = u, u ρj u, qj 2 ρj qj, qj ρj u, u ρj (1 wj)S as it follows from (32) and Lemma 24. Therefore, Θ(1 S) u, u ρj (1 wj)ΘS M | u|2ρjdx + 2 M | u|2ρjdx 1/2 Multiplying both sides of the above inequality by wj and adding over j we deduce M | u|2ρdx + 2 M | u|2wjρjdx 1/2 + CNS (1 S)2 . Applying Jensen s inequality and using the fact that u, u ρ = 1 we have M | u|2ρdx + 2 M | u|2ρdx 1/2 + CNS (1 S)2 CNS (1 S) u ρ . Given that the above inequality holds for every u H1(M, ρ) with u Q and u, u ρ = 1, we conclude from the fact that Q has dimension N and from the max-min formula (Attouch et al., 2014, Theorem 8.4.2), M | u|2ρdx R M u2ρdx λN+1 Combining the previous proposition with Proposition 23 we obtain the following result. Geometric structure of graph Laplacian embeddings Corollary 28 For every k = 1, . . . , N we have 1 wk qk ΠN(qk) 2 ρ The above corollary shows that for a well-separated mixture model, that is, for S and C/Θ small, the right-hand side in (33) is small also, and so in that case, the functions q1, ..., q N are close to their projections onto the first N eigenfunctions of ρ. We will be able to conclude that F ν has an orthogonal cone structure provided we can show F Q ν has one. Proposition 29 The probability measure µQ := F Q ν with F Q defined in (17) has an orthogonal cone structure with parameters (σ, δ, r) for any σ (0, π/4), δ δ < 1 and r = w 1/2 max where δ := wmax cos2(σ)N2S wmin(1 cos2(σ)) . Here, wmax := maxi=1,...,N wi and wmin := mini=1,...,N wi. Proof For each k = 1, . . . , N, let Ck := z RN : zk |z| > cos(σ), |z| r , with r := 1 wmax and fixed σ (0, π/4). It follows that µQ(Ck) = F Q ν(Ck) = ν(Ak), where Ak is the preimage of Ck through F Q, i.e. Ak :=(F Q) 1(Ck) ρ(x) > cos(σ) From the definition of r we see that the condition is redundant and so we can write x M : ρk(x) > cos2(σ) Now, for an arbitrary x0 N j=1Ac j M we have ρk(x0) cos2(σ) j=1 ρj(x0), k = 1, . . . , N, Garc ıa Trillos, Hoffmann, Hosseini or equivalently, (1 cos2(σ))ρk(x0) cos2(σ) X j =k ρj(x0), k = 1, . . . , N. From the fact that, N X we know there exists a ˆk {1, . . . , N} (depending on x0) for which N2 (1 cos2(σ)) wˆkρˆk(x0) 2 cos2(σ)wmax ρ(x0) wˆkρˆk(x0) cos2(σ)wmax ρ(x0) wkρk(x0) Since this is true for every x0 N k=1Ac k, we conclude that N2 ν N k=1Ac k cos2(σ)wmax cos2(σ)wmax ρ dx cos2(σ)wmax ν N k=1 Ac k wmax cos2(σ)N2S wmin(1 cos2(σ)), and so we deduce that µQ N k=1Ck 1 wmax cos2(σ)N2S wmin(1 cos2(σ)) . This concludes the proof. Proof [Proof of Theorem 10] In order to show that the measure µ := F ν has an orthogonal cone structure, it is enough to show that the measure (OF) ν has an orthogonal cone structure where OF is the map: OF : x M 7 OF(x) RN, and where O is a conveniently chosen N N orthogonal matrix. We will construct O in such a way that the measures OF ν and F Q ν are close to each other in the Wasserstein Geometric structure of graph Laplacian embeddings sense. From this, Proposition 22 and Proposition 29 we will be able to conclude that OF ν has an orthogonal cone structure. Let us introduce vi := ΠN(qi) ΠN(qi) ρ , i = 1, . . . , N, where we recall ΠN : L2(ρ) U is the orthogonal projection onto U, the span of the first N eigenfunctions of ρ. From Corollary 28 it follows that qi wi vi ρ qi wi ΠN(qi) wi ρ + ΠN(qi) wi ΠN(qi) ΠN(qi) ρ = qi wi ΠN(qi) wi ρ + 1 wi | ΠN(qi) ρ wi| = qi wi ΠN(qi) wi ρ + 1 wi | ΠN(qi) ρ qi ρ| 2 qi wi ΠN(qi) wi In particular, for i = j | vi, vj ρ| = vi qi wi , vj ρ + qi wi , vj qj wj ρ + qi wi , qj wj ρ ρ + qj wj vj ! 1 + S1/2 = τ. The first inequality follows from an application of the Cauchy-Schwartz inequality for the first two terms (since vi ρ = 1 by definition) and Jensen s inequality for the last term. By assumption (15), we can then use Lemma 41 to conclude that there exists v1, . . . , v N, an orthonormal basis for (U, , ρ) for which vi vi 2 ρ N 1 1 Nτ 1 2 , i = 1, . . . , N. Therefore, for any i = 1, . . . , N, ρ = qi wi vi ρ + 2 vi vi, qi wi ρ vi + vi, vi vi ρ ρ + 4 vi vi ρ Garc ıa Trillos, Hoffmann, Hosseini 1 Nτ 1 . (35) Let F : M 7 RN be the map F(x) = PN j=1 vj(x)ej. From the fact that { v1, . . . , v N} and {u1, . . . , u N} are both orthonormal bases for (U, , ρ) we can conclude there exists an orthogonal matrix O RN RN for which Moreover, let π := (F Q F) ν. We notice that π is a coupling between F Q ν and F ν. From (35) we see that W 2 2 (F Q ν, F ν) Z RN |z z|2dπ(z, z) M |F Q(x) F(x)|2dν(x) !2 + 4N3/2 1 1 Nτ 1 . (36) It then follows from Proposition 22 and Proposition 29 that µ has an orthogonal cone structure with parameters σ + s, δ + t2, 1 sin(s) wmax with δ [δ , 1) as given in Proposition 29 for any s, t > 0 satisfying assumption (16). A.3 From discrete to continuum: convergence results for the spectrum of n In this section we connect the spectrum of the graph Laplacian n with the spectrum of ρ. For this purpose, we modify our notation slightly and write L2(ν) for L2(ρ) to allow for comparison with L2(νn) (recall (2)). Furthermore, we write u, v L2(µ) := R M uv dµ(x) for µ = ν or νn. Theorem 30 Let M, ρ, ε and n be as in Assumptions 11. Let un,1, . . . , un,N be the first N eigenvectors of the kernelized graph Laplacian n (i.e. the graph with weights defined in (4)). Then, for every β > 1 there exists a constant Cβ > 0 such that with probability at least 1 Cβn β, there exists a map Tn : M {x1, . . . , xn} (see Remark 31) satisfying ν(T 1 n ({xi})) = 1 n, i = 1, . . . , n, (37) gj un,j Tn 2 L2(ν) c M ε + log(n)pm Geometric structure of graph Laplacian embeddings for some orthonormal functions g1, . . . , g N L2(ν) belonging to U, the span of the first N eigenfunctions of ρ with respect to , ρ, and a constant c M > 0 depending only on M, N, ρ , CLip and η. In the above, λN and λN+1 denote the N-th and the (N + 1)-th eigenvalues of ρ. Remark 31 Relation (37) is simply saying that the map Tn is a transportation map from ν to νn. Tn here is as defined by Garc ıa Trillos et al. (2020), i.e. the -optimal transport ( -OT) map between ν and νn. It was proved by Garc ıa Trillos et al. (2020) that the -OT cost scales like log(n)pm n1/m . Notice that this is the only term on the right hand side of (38) that explicitly depends on n. The proof of Theorem 30 relies on a comparison between the Dirichlet forms associated to the operators n and ρ by way of careful interpolation and discretization of discrete and continuum functions. Only small modifications to the proofs in Burago et al. (2014); Garc ıa Trillos et al. (2020) are necessary and in what follows we give a thorough explanation of the steps that need to be adjusted. For simplicity, we largely follow the notation in Garc ıa Trillos et al. (2020). Let us summarize our strategy for the proof. The Dirichlet forms associated to n and ρ are respectively defined as bn(un) := 1 i,j Wij|un(xi) un(xj)|2, un L2(νn), (39) M | u|2ρ(x)dx, u H1(M, ρ), (40) where we recall Wij are the weights defined in (4). We will show that bn(Pf) (1 + er(n, ε))D(f), f L2(ν) , D(Ifn) (1 + er(n, ε))bn(fn), fn L2(νn). for appropriately defined discretization and interpolation maps P and I; the er(n, ε) term does not depend on f or fn, and is small for n large, and ε small but larger than a certain quantity that depends on n that we will introduce below. Together with the properties of the maps P and I (see Lemma 33), these inequalities allow us to compare the spectra of the discrete and continuum operators. To define the maps P and I we first introduce the transport maps Tn constructed in (Garc ıa Trillos et al., 2020, Theorem 2). Proposition 32 For a given β > 1, there exists a constant Cβ > 0 depending only on β so that with probability at least 1 Cβn β there exists a map Tn : M {x1, . . . , xn} for which: 1. ν(T 1 n (xi)) = 1/n for all i = 1, . . . , n, 2. δn := ess supx M d M(Tn(x), x) C log(n)pm Garc ıa Trillos, Hoffmann, Hosseini where C = C(M, ρ , ρ+, β) > 0, pm = 3/4 for m = 2 and pm = 1/m for m 3, and d M represents the geodesic distance in M. Given the map Tn, the discretization map P : L2(ν) L2(νn), is defined as the transformation that takes a given f L2(ν) and maps it into the discrete function Pf(xi) := n Z T 1 n ({xi}) f(x)ρ(x)dx, i = 1, . . . , n. For the interpolation map I, we first let P be the adjoint of P with respect to the L2(νn) inner product, i.e. the map that satisfies Pg, fn L2(νn) = g, P fn L2(ν), g L2(ν), fn L2(νn), which takes an arbitrary discrete function fn L2(νn) and returns a function P fn L2(ν) defined by P fn(x) = fn Tn(x), x M. We also introduce a smoothing operator Λε,n,0f(x) := Z 1 (ε 2δn)m ψ d M(x, y) f(y)dy, x M, f L2(ν), which is a convolution operator with radial kernel where we recall αη := R 0 η(r)rm+1dr was defined in (3), and where d M(x, y) is the geodesic distance between points x, y M. We normalize the operator Λε,n,0 to produce Λε,nf := 1 Λε,n,01Λε,n,0f, f L2(ν), where in the above Λε,n,01 denotes the application of Λε,n,0 to the function that is identically equal to one; the normalization is introduced so as to ensure that Λε,n leaves constant functions unchanged. Finally, I is defined as the composition of P with Λε,n. Namely, Iun := Λε,n P un, un L2(νn) . Note that for any fn L2(ν) it is guaranteed that Ifn H1(M, ρ), given that η was assumed Lipschitz (and so in particular ψ is C1). Lemma 33 (Discretization and interpolation errors) Under the same assumptions on M and ρ as in Theorem 30 it follows that: 1. For every f H1(M, ρ), Pf 2 L2(νn) f 2 L2(ν) C1 δn f L2(ν)D(f)1/2 + ε + δn Geometric structure of graph Laplacian embeddings 2. For every f H1(M, ρ), bn(Pf) 1 + C2 3. For every fn L2(νn) we have, Ifn 2 L2(ν) fn 2 L2(νn) C3 ε fn L2(νn)bn(fn)1/2 + ε + δn fn 2 L2(νn) 4. For every fn L2(νn) such that Ifn H1(M, ρ), D(Ifn) 1 + C4 Here the constants C1, C2, C3, C4 > 0 only depend on lower and upper bounds for ρ and its Lipschitz constant, and on M and η. The term δn was defined in Proposition 32. Before we prove Lemma 33 let us introduce an intermediate, non-local continuum Dirichlet energy |f(x) f(y)|2p ρ(y)dxdy, f L2(ν), where r > 0 is a length scale to be chosen later on. Remark 34 Observe that the Dirichlet energy D and the non-local Dirichlet energy Er can be written as D(f) = Z M | f|2 ρ2dx, f H1(M, ρ), |f(x) f(y)|2 ρ(x) ρ(y)dxdy, f L2(ν), where ρ = ρ. We notice that ρ C1(M) since it satisfies the same regularity condition as ρ given that ρ is bounded away from zero. Hence, our continuum Dirichlet energies D and Er take the same form as in Garc ıa Trillos et al. (2020). As a consequence we may use all of the bounds relating D and Er that were proven there. Remark 35 The only difference between the discrete Dirichlet form bn in (39) and the one introduced in Garc ıa Trillos et al. (2020) is the normalization by the terms p dε(xi)/n that appear in the definition of W, i.e., Garc ıa Trillos et al. (2020) uses a different normalization of the graph Laplacian. We notice that the term p dε( )/n uniformly approximates the density ρ. More precisely, given that the kernel η is assumed to be normalized, it follows from (Garc ıa Trillos et al., 2020, Lemma 18) that max i=1,...,n 1 ndε(xi) ρ(xi) C ε + δn Garc ıa Trillos, Hoffmann, Hosseini where we recall δn is the -OT distance between νn and ν given in Proposition 32, and the constant C > 0 depends only on ρ and geometric quantities associated to M. We highlight that the above is a non-optimal estimate on the error of approximation of a kernel density estimator, but has the advantage of only depending on the -OT distance between empirical and ground-truth measures. Remark 36 The maps P, P , Λr and I are defined identically to the way they were defined in Garc ıa Trillos et al. (2020) and hence we may use all properties proved there. Proof [Proof of Lemma 33] Throughout the proof we use C to denote a finite positive constant that depends only on the lower and upper bounds on ρ, the Lipschitz constant of ρ, and geometric quantities of M. This constant may change value from one instance to the next, unless its dependence is explicitly stated. The first statement follows directly from (Garc ıa Trillos et al., 2020, Lemma 13(i)) after noticing that ρ C ρ, given that ρ is bounded away from zero. The second statement follows almost exactly as in (Garc ıa Trillos et al., 2020, Lemma 13(ii)), after making some small modifications that we now describe. First, by Remark 35 and the assumptions on ρ it follows that for all i, j {1, . . . , n} and every x T 1 n ({xi}) and y T 1 n ({xj}) we have dε(xi)dε(xj) (1 + C(ε + δn/ε)) 1 p Using the definition of Pf and Proposition 32 (1) it follows that bn(Pf) = 2 αηnεm+2 T 1 n ({xi}) T 1 n ({xj}) (f(x) f(y))ρ(x)ρ(y)dxdy T 1 n ({xi}) T 1 n ({xj}) (f(x) f(y))2ρ(x)ρ(y)dxdy 2(1 + C(ε + δn/ε)) j=1 η |xi xj| T 1 n ({xi}) T 1 n ({xj}) (f(x) f(y))2 ρ(x) ρ(y)dxdy , where the first inequality follows by Jensen s inequality. Using the above bound together with Remark 34 and proceeding identically to the proof of (Garc ıa Trillos et al., 2020, Lemma 13(ii)) we can bound bn in terms of Er. More precisely, bn(Pf) 1 + C(ε + δn/ε) Eε +2δn(f) + Cδn ε E2(ε +2δn)(f) Geometric structure of graph Laplacian embeddings where ε = (1+C ε2)ε, C is a constant intrinsic to the manifold M, and C > 0 is a constant depending only on η. At this point we can follow the rest of the proof of (Garc ıa Trillos et al., 2020, Lemma 13(ii)) to obtain the desired result after bounding Er in terms of D: 1 εm+2 Eε +2δn(f) 1 + C ε + δn ε E2(ε +2δn)(f) C δn The third statement follows directly from (Garc ıa Trillos et al., 2020, Lemma 14 (i)) after noticing that thanks to Remark 35 we have p dε(xi)dε(xj)/n C, where C depends only on the upper bound for ρ. Indeed, this inequality allows us to upper bound the discrete Dirichlet energy introduced in Garc ıa Trillos et al. (2020) with a constant multiple of our discrete Dirichlet energy bn. For the fourth and final statement we proceed as in (Garc ıa Trillos et al., 2020, Lemma 14(ii)), but making a small modification. By Remark 35 we conclude that for all i, j {1, . . . , n} and every x T 1 n ({xi}) and y T 1 n ({xj}) we have ρ(x)ρ(y) n(1 + C(ε + δn/ε)) p dε(xi)dε(xj) . The above inequality is used to replace the degree term dε/n with the density ρ, allowing us to reverse the bound in Lemma 33(2) . Having in mind the inequality, the proof of our statement is exactly as in Garc ıa Trillos et al. (2020). Indeed, we can follow (Garc ıa Trillos et al., 2020, Lemma 14(i)) and show that Eε 2δn(P fn) 1 + C ε + δn εm+2bn(fn). (41) In turn, (Garc ıa Trillos et al., 2020, Lemma 14(ii)) (recall remark 34) gives D(Ifn) 1 + C ε + δn 1 εm+2 Eε 2δn(P fn), for all fn L2(νn). Combining (41) with the above inequality we deduce the desired result. With Lemma 33 at hand, we can obtain a precise relationship between eigenvalues of n and ρ. Lemma 37 (Convergence rate for eigenvalues ) Let λi be the i-th eigenvalue of ρ and let λn,i be the i-th eigenvalue of n. Let β > 1. Then there exist constants C, Cβ > 0 such that for sufficiently large n, with probability at least 1 Cβn β, we have |λn,i λi| C ε + δn λi, i = 1, . . . , N, where C > 0 depends only on M, η and ρ and Cβ > 0 depends only on β. Garc ıa Trillos, Hoffmann, Hosseini Proof Using Lemma 33, we can follow exactly the proof of (Garc ıa Trillos et al., 2020, Theorem 4). Now, we are ready to prove Theorem 30. Proof [Proof of Theorem 30] Throughout this proof we use C to denote a finite positive constant that depends only on M, N, ρ , CLip and η. This constant may change from one instance to the next. Let un,1, . . . , un,N be the first N eigenfunctions of n with unit norm. We can assume they form an orthonormal basis with respect to , νn, with corresponding eigenvalues We now put (Burago et al., 2014, Lemma 7.3) together with Lemmata 33 and 37, as well as Remark 36 to conclude that for every j = 1, . . . , N we have Iun,j ΠN(Iun,j) 2 L2(ν) CM,NλN λN+1 λN =: γ2 0, (42) where ΠN denotes the projection onto U, the span of the first N eigenfunctions of ρ and CM,N > 0 is a constant depending on M and N only. We need to show that the functions ΠN(Iun,1), . . . , ΠN(Iun,N) can be modified slightly to form an orthonormal basis for U. First, by (42) we have Iun,j L2(ν) γ0 ΠN(Iun,j) L2(ν) Iun,j L2(ν) + γ0 . Next, we find a bound on Iun,j L2(ν) using Lemma 33(3). To control the bn(un,j) term on the right hand side of Lemma 33(3) we make use of the bound on the eigenvalues provided in Lemma 37, bn(un,j) = un,j , nun,j L2(νn) = λn,j C 1 + ε + δn and so, since the un,j are normalized, we obtain Iun,j 2 L2(ν) 1 C ε p λN + ε + δn Combining with the estimate for ΠNIun,j L2(ν) yields (for ε and δn/ε small enough and n large enough so that the right-hand side in the last estimate is less or equal to 1) 1 γ1 ΠN(Iun,j) L2(ν) 1 + γ1 j = 1, . . . , N, (45) γ1 := C ε p λN + ε + δn Geometric structure of graph Laplacian embeddings Lemma 33(3) also allows us to bound the difference of inner products. For i = j we have Iun,j, Iun,i L2(ν) = 1 Iun,j 2 L2(ν) + Iun,i 2 L2(ν) Iun,j Iun,i 2 L2(ν) , 0 = un,j, un,i L2(νn) = 1 un,j 2 L2(νn) + un,i 2 L2(νn) un,j un,i 2 L2(νn) . Subtracting the above identities and using Lemma 33(3) we obtain | Iun,j, Iun,i L2(ν)| C ε p λN + ε + δn From this, and the fact that both Iun,j ΠNIun,j L2(ν) and Iun,i ΠNIun,i L2(ν) are smaller than γ0, we deduce that | ΠNIun,i, ΠNIun,j L2(ν)| | Iun,i, Iun,j L2(ν)| + | Iun,j, Iun,i ΠNIun,i L2(ν)| + | ΠNIun,i, Iun,j ΠNIun,j L2(ν)| λN + ε + δn + Cγ0 =: γ2, i = j, where the second inequality follows from Cauchy-Schwartz, the fact that ΠNu L2(ν) u L2(ν) for any u L2(ν), and from Iun,i L2(ν) C by (44). With (45), (46) and Assumptions 11 (guaranteeing the smallness of γ0, γ1 and γ2) we can use Lemma 41 and Remark 42, to deduce the existence of an orthonormal system g1, . . . , g N for U satisfying: ΠNIun,j gj L2(ν) N 1 1 Nγ2 1 , j = 1, . . . , N. Using (42) and expanding in Nγ2 we deduce Iun,j gj 2 L2(ν) Iun,j ΠNIun,j L2(ν) + ΠNIun,j gj L2(ν) 2 N 1 1 Nγ2 1 2 2 γ2 + 3N5/2 !2 =: γ3, j = 1, . . . , N. Iun,j un,j Tn 2 L2(ν) = Λε 2δn P un,j P un,j 2 L2(ν) (ε 2δn)m+2 Eε 2δn(P un,j) Cε2bn(un,j) Cε2λN =: γ4, Garc ıa Trillos, Hoffmann, Hosseini where we have used (Garc ıa Trillos et al., 2020, Lemma 8), assuming δn < ε/(m + 5) to obtain the first inequality and where the second and third inequalities follow from (41) and (43) respectively. Using triangle inequality and the above estimates we obtain un,j Tn gj 2 L2(ν) 2(γ3 + γ4). Plugging in the definitions of γ3 and γ4 and using Assumptions 11, we may collect the highest order terms and deduce that for all j = 1, . . . , N we have un,j Tn gj 2 L2(ν) C(1 + N3/2 + N3)CM,N (ε + δn/ε) . where CM,N is the constant in (42). Proposition 32(2) implies that un,j Tn gj 2 L2(ν) c M ε + log(n)pm where c M is a constant proportional to CM,N and C given in Proposition 32(2). This concludes the proof of the theorem. Our goal is now to replace the terms λN and λN+1 in (38) with quantities that only depend on the parameters of the mixture model. A lower bound for λN+1 was already obtained in Proposition 27, and now we focus on obtaining an upper bound for λN. Proposition 38 (Upper bound for λN) Suppose that S1/2N < 1 and let λN be the N-th eigenvalue of ρ. Then, λN NC 1 NS1/2 . Proof To get an upper bound for λN it is enough to use the min-max formula (Attouch et al., 2014, Theorem 8.4.2) for λN. In particular, since Q := span{q1, . . . , q N} is Ndimensional it follows that λN max u Q ρu, u ρ Let u Q for which u, u ρ = 1. Then, for some scalars ak satisfying k=1 a2 k qk 2 ρ + j =k akaj qk, qj ρ = k=1 a2 kwk + j =k akaj qk, qj ρ. j =k akaj qk, qj ρ j =k |ak||aj| wk wj S1/2 NS1/2 N X k=1 a2 kwk. Geometric structure of graph Laplacian embeddings k=1 a2 kwk 1 1 NS1/2 . In addition, M qk qjρdx . Recall that qk 2 ρ = wk Ck. Using H older s inequality, we obtain j=1 |akaj| Z M | qk|2ρdx 1/2 Z M | qj|2ρdx 1/2 k=1 a2 kwk NC 1 NS1/2 . We conclude that λN NC 1 NS1/2 . Remark 39 In the previous result we used the trace of the operator ρ restricted to Q to bound ΘN. This is not necessarily an optimal bound, but we remark that in the case when C/Θ is small enough (in particular smaller than 1/N), one can use this estimate to get a meaningful lower bound for the spectral gap λN+1 λN. More precisely, we obtain !2 NC 1 NS1/2 . While in general we think of N as a relatively small number (representing the number of meaningful components in a data set) we would like to remark that N in the estimates λN NC 1 NS1/2 , can be replaced with Neff where Neff is the effective number of components any given component intersects. That is, Neff = max k=1,...,N #{i = k s.t ν(supp(ρk) supp(ρi)) > 0}. Estimates with better dependence on N may be obtained in terms of a quantity analogue to C of the form: max k=1,...,N M | log(ρk/ρ)|3ρkdx. Garc ıa Trillos, Hoffmann, Hosseini Combining the lower bound for λN+1 from Proposition 27 and the upper bound for λN we immediately deduce the following. Corollary 40 In Theorem 30, inequality (38) can be replaced with: Z M |gj(x) un,j Tn(x)|2dν(x) = gj un,j Tn 2 L2(ν) φ(S, C, Θ, N, ε, n, m), (47) for all j = 1, . . . , N, where we recall φ(S, C, Θ, N, ε, n, m) := c M ε + log(n)pm !2 NC 1 NS1/2 and c M is a constant depending on M, ρ , CLip, η and N. A.4 Proof of Theorem 15 Proof [Proof of Theorem 15] Let β > 1. From Corollary 40, we know that with probability greater than 1 Cβn β, there exist a transportation map Tn : M {x1, . . . , xn} pushing forward ν into νn and an orthonormal set of functions g1, . . . , gn in U (the space generated by the first N eigenfunctions of ρ) satisfying 1. supx M d M(x, Tn(x)) c M log(n)pm M |gi(x) un,i(Tn(x))|2dν(x) φ(S, C, Θ, N, ε, n, m), where the function φ is defined in (48) and used throughout this proof for convenience of notation. Let G(x) := (g1(x), . . . , g N(x)) for x M. We deduce that M |Fn Tn(x) G(x)|2 dν(x) = M |gi(x) un,i(Tn(x))|2dν(x) Nφ(S, C, Θ, N, ε, n, m). We claim that the integral on the left hand side of the previous expression can be written as: Z M |Fn Tn(x) G(x)|2 dν(x) = Z RN RN |x y|2dπn(x, y), for a transportation plan πn P(RN RN) between the measures G ν and Fn νn. To see this, let πn := (G Fn Tn) ν. It is straightforward to check that πn is a transportation plan between G ν and Fn νn, and moreover, by the change of variables formula Z RN RN |x y|2dπn(x, y) = Z M |G(x) Fn Tn(x)|2dν(x). Geometric structure of graph Laplacian embeddings (W2(G ν, Fn νn))2 Z M |Fn Tn(x) G(x)|2 dν(x) Nφ(S, C, Θ, N, ε, n, m). (49) Since g1, . . . , g N is an orthonormal basis for U, we conclude that there exists an orthogonal matrix R such that for every x M we have G(x) = RF(x), where F is the continuum spectral embedding as defined in (14). In order to show that the measure µn := Fn νn has an orthogonal cone structure, it is enough to show that the pushforward of νn through any orthogonal transformation of Fn has an orthogonal cone structure. In particular, we will show that the measure (RO 1Fn) νn has an orthogonal cone structure, where R is defined as above, and O is the orthogonal transformation chosen in the proof of Theorem 10: OF(x) = F(x) = ΠN(qj)(x) ΠN(qj) ρ ej . We can now show that the measures OR 1F ν and F Q ν are close to each other in the Wasserstein sense. Since G = RO 1 F, W2(OR 1Fn νn, F Q ν) W2(OR 1Fn νn, F ν) + W2( F ν, F Q ν) = W2(Fn νn, RO 1 F ν) + W2( F ν, F Q ν) = W2(Fn νn, G ν) + W2( F ν, F Q ν) Nφ(S, C, Θ, N, ε, n, m) !2 + 4N3/2 1 where the last inequality follows from (49) and (36). Thanks to the Assumption (20) we can apply Proposition 29 and Proposition 22 to conclude that OR 1Fn νn has an orthogonal cone structure with the parameters as stated in Theorem 15, and hence so does Fn νn. Appendix B. Supplementary results B.1 Near-orthogonal vectors Lemma 41 Let V be a vector space of dimension N and let , be an inner product on V with associated norm . Suppose that v1, . . . , v N are linearly independent unit vectors in V such that | vj, vl | δ, j = l Garc ıa Trillos, Hoffmann, Hosseini for δ > 0 satisfying Nδ < 1. Then, there exists an orthonormal basis for V , { v1, . . . v N}, such that for every j = 1, . . . , N vj vj φ(N, δ), (50) where φ(N, δ) := Proof Without loss of generality assume V is the Euclidean space RN with the usual inner product. Define the square matrix V = v1|v2| |v N . and the residual matrix R = VT V I. Since v1, ...v N are linearly independent, the matrix VT V is invertible, and following Higham (1989) we have that the the nearest orthonormal approximation to V in the Frobeneous norm is the matrix V = V(VT V) 1 2 = V(I + R) 1 Using the series expansion of the square root gives Note that |Rij| δ following our assumptions on the vj. Furthermore, a straightforward calculation shows that |(Rk)ij| Nk 1δk. Thus, |(V V)ij| N sup ij |Vij| sup ij |Vij| ! h (1 Nδ) 1 2 1 i = φ(N, δ) It follows that i=1 |(V V)ij|2 !1/2 = φ(N, δ) , which proves inequality (50). Geometric structure of graph Laplacian embeddings Remark 42 With the same notation as in the above lemma, suppose that the unit vectors v1, . . . , v N satisfy | vi, vj | δ, i = j, where δ satisfies the slightly stronger condition We claim that in that case the vectors are linearly independent. Indeed, for the sake of contradiction suppose that they are not. Then we can find numbers a1, . . . , a N not all equal to zero, for which Normalizing the ai we can further assume that i=1 a2 i = 1. It follows from Jensen s inequality that i=1 a2 i + 2 j =i aiaj vi, vj 1 2δ which would contradict the hypothesis on δ. B.2 Cheeger s inequality Cheeger s inequality in manifolds was introduced in Cheeger (1969). For the convenience of the reader here we present a proof of Cheeger s inequality for sufficiently smooth and bounded functions using techniques that are developed in spectral geometry and spectral graph theory literature. We omit some technical details and highlight the specific structure of the operator ρ which allows us to deduce the inequality. As we will see below not every normalization of Laplacian operator will produce a similar result. We also note that our result can be generalized to functions in L2(M, ρ) via a density argument. Theorem 43 Let u L2(M, ρ) sufficiently smooth and bounded with Z M uρdx = 0. Then ρu, u ρ h(M, ρ) := min A M A M ρ(x)d S(x) Garc ıa Trillos, Hoffmann, Hosseini Proof We show that for any non-constant function u L2(M, ρ) sufficiently smooth and bounded with Z M uρdx = 0, we can find a set A M such that R(u) := ρu, u ρ M | u|2ρdx R A M ρd S(x) 4(h(M, ρ))2. To see this, we start by letting r be the smallest number for which Z M 1{u r}ρdx 1/2, and we define z(x) := u(x) r, x M. In other words, r is the median of u(x). Let m, M be the infimum and supremum of z. Then either m < 0 or M > 0. Notice that since z and u differ only by a constant we have ρu, u ρ = ρz, z ρ, and from the fact that R M uρdx = 0, we also have z, z ρ u, u ρ In particular, R(z) R(u). Now, the coarea formula states that for any L1(dx) function g : M R we have M g(x)| z(x)|dx = Z At M g(x)d S(x) dt, where At is the level set: At := {x M : z(x) t}. Taking g to be the function g(x) = 2ρ(x)|z(x)|, x M, we deduce that Z M 2|z(x)|| z(x)|ρ(x)dx = Z At M ρ(x)d S(x) 2|t|dt At M ρ(x)d S(x) 2|t|dt, (51) Geometric structure of graph Laplacian embeddings where in the first equality we have used the fact that at the boundary At the function z is equal to t. On the other hand, At ρ(x)dx, Z M\At ρ(x)dx At ρ(x)dx 2|t|dt + Z M M\At ρ(x)dx At ρ(x)dx tdt + Z M M\At ρ(x)dx m 1{z(x) t}(x)( 2t)dtρ(x)dx + Z 0 1{z(x)>t}(x)(2t)dtρ(x)dx z(x) 1{z(x) 0}(x)( 2t)dtρ(x)dx + Z 0 1{z(x)>0}(x)(2t)dtρ(x)dx M 1{z(x) 0}(x)z2ρ(x)dx + Z M 1{z(x)>0}(x)z2ρ(x)dx = Z M z(x)2ρ(x)dx, where in the third equality we used the Fubini-Tonelli theorem to switch the order of integrals. From the previous identity and (51) we notice that At M ρ(x)d S(x) 2|t|dt = K(z) Z M K(z) := 2 R M |z(x)|| z(x)|ρ(x)dx Now, Cauchy Schwartz-inequality shows that This is the point where it is important to have both numerator and denominator in the Raleigh quotient R to be weighted by ρ. We notice that with a different weighting we would have not gotten the square root of the Raleigh quotient in this last step. It follows that At M ρ(x)d S(x) from where we can see that there must exist some t (m, M) for which R At M ρd S(x) M\At ρdx o 2 p