# learning_by_unsupervised_nonlinear_diffusion__dc759b91.pdf

Journal of Machine Learning Research 20 (2019) 1-56 Submitted 12/18; Revised 7/19; Published 10/19

Learning by Unsupervised Nonlinear Diﬀusion

Mauro Maggioni mauro.maggioni@jhu.edu Department of Mathematics, Department of Applied Mathematics and Statistics, Mathematical Institute of Data Sciences, Institute of Data Intensive Engineering and Science, Johns Hopkins University, Baltimore, MD 21218, USA

James M. Murphy jm.murphy@tufts.edu Department of Mathematics Tufts University, Medford, MA 02155, USA

Editor: Aapo Hyv arinen

This paper proposes and analyzes a novel clustering algorithm, called learning by unsupervised nonlinear diﬀusion (LUND), that combines graph-based diﬀusion geometry with techniques based on density and mode estimation. LUND is suitable for data generated from mixtures of distributions with densities that are both multimodal and supported near nonlinear sets. A crucial aspect of this algorithm is the use of time of a data-adapted diﬀusion process, and associated diﬀusion distances, as a scale parameter that is diﬀerent from the local spatial scale parameter used in many clustering algorithms. We prove estimates for the behavior of diﬀusion distances with respect to this time parameter under a ﬂexible nonparametric data model, identifying a range of times in which the mesoscopic equilibria of the underlying process are revealed, corresponding to a gap between within-cluster and between-cluster diﬀusion distances. These structures may be missed by the top eigenvectors of the graph Laplacian, commonly used in spectral clustering. This analysis is leveraged to prove suﬃcient conditions guaranteeing the accuracy of LUND. We implement LUND and conﬁrm its theoretical properties on illustrative data sets, demonstrating its theoretical and empirical advantages over both spectral and density-based clustering.

Keywords: unsupervised learning, clustering, spectral graph theory, manifold learning, diﬀusion geometry

1. Introduction

Unsupervised learning is a central problem in machine learning, requiring that data be analyzed without a priori knowledge of any class labels. A common unsupervised problem is clustering, in which the data is to be partitioned into clusters so that each

c 2019 Mauro Maggioni and James M. Murphy.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/18-873.html.

Maggioni and Murphy

cluster contains similar points and distinct clusters are suﬃciently separated. Even with suitable deﬁnitions of similarity and separation , this problem is typically illposed, requiring various geometric, analytic, topological, and statistical assumptions on the data and measurement method be imposed to make it tractable. Feature extraction is often combined with these standard methods (e.g. K-means) to improve clustering performance. In particular, spectral methods construct graphs representing data, and use the spectral properties of the resulting graph Laplacian to produce structure-revealing features in the data. Graphs often encode pairwise similarities between points, typically at a local spatial scale, often determined by a parameter σ. For example only points xi, xj within distance 4σ of each other may be connected, with weight exp( xi xj 2 2/σ2). From the graph, global features on the data may be derived, for example by considering the eigenfunctions of the random walk on the graph. Alternatively, graphs may be used to introduce data-adaptive distances, such as diffusion distances, which are associated to random walks and diﬀusion processes on graphs. Diﬀusion distances do not depend only on the graph itself, but also on a time parameter t that determines a scale on the graph at which these distances are considered, related to the time of diﬀusion or random walk. Choosing σ in graphbased algorithms, and both σ and t in the case of diﬀusion distances, is important in both theory and applications. However, their role is well-understood only in certain regimes (e.g. asymptotically for σ, t 0+) which are of interest in some problems (e.g. manifold learning) but not necessarily for clustering.x We propose the Learning by Unsupervised Nonlinear Diﬀusion (LUND) scheme for clustering, which combines diﬀusion distances and density estimation to eﬃciently cluster data generated from a nonparametric model. At the same time, we advance the understanding of the relationship between the local spatial scale parameter σ and the diﬀusion time parameter t in the context of clustering, demonstrating how the role of t can be exploited to successfully cluster data sets for which K-means, spectral clustering, or density-based clustering methods fail. We provide quantitative bounds and guarantees on the performance of the proposed clustering algorithm for data that may be highly nonlinear (i.e. non-convex, elongated, ellipsoidal, etc.) and of variable density.

1.1 Major Contributions and Outline

This article makes two major contributions. First, explicit estimates on diﬀusion distances for nonparametric clustered data are proved: we obtain lower bounds for the diﬀusion distance (see Deﬁnition 2.1, or Coifman et al. (2005)) between clusters, and upper bounds on the diﬀusion distance within clusters, as a function of the time parameter t and suitable properties of the clusters. These bounds yield a mesoscopic not too small, not too large diﬀusion time-scale at which diﬀusion distances separate clusters clearly and cohere points in the same cluster. These results, among other things, show how the role of the time parameter, which controls the scale on the

Learning by Unsupervised Nonlinear Diffusion

data of the diﬀusion distances, is very diﬀerent from the commonly-used scaling parameter σ in the construction of the underlying graph, which is a local spatial scale measured in the ambient space. Relationships between t and σ are well-understood in the asymptotic case of n + , σ 0+ (at an appropriate rate with n; see Coifman et al. (2005), Lafon et al. (2006), and Von Luxburg (2007)) and t 0+ (essentially Varadhan s lemma applied to diﬀusions on a manifold; see Den Hollander (2008), Jones et al. (2008), and references therein). These asymptotic relationships at small scales imply that the choice of t is essentially irrelevant, since in these limits diﬀusion distances are essentially geodesic distances. However, the clustering phenomena we are interested in are far from this regime, and we show that the interplay between t, σ, and n becomes crucial. Second, the LUND clustering scheme is proposed and shown to enjoy performance guarantees for clustering of certain non-parametric mixture models. We prove suﬃcient conditions for LUND to correctly determine the number of clusters in the data and to have low clustering error. Computationally, we present an eﬃcient algorithm implementing LUND, which scales near-linearly in the number of points n, in the ambient dimension D, and exponentially in the intrinsic dimension of the data. We verify the properties of the LUND scheme and algorithm on synthetic data, studying the relationships between the diﬀerent parameters in LUND, in particular between σ and t, and compare with popular and related the graph-based spectral clustering and fast search and ﬁnd of density peaks clustering (FSFDPC) (Rodriguez and Laio, 2014) algorithms, illustrating weaknesses of these methods and corresponding advantages of LUND. LUND may be understood as a combination of these two methods, in that it integrates diﬀusion distances (which are graph-based) and an outlier robustness procedure into the FSFDPC framework, which uses Euclidean distances. Indeed, our experiments illustrate how LUND combines the beneﬁts of graph-based and density-based methods. The outline of the article is as follows. Background is presented in Section 2. In Section 3, motivational data sets and a summary of the theoretical results are presented and discussed. Theoretical comparisons with related clustering methods are also made in Section 3. Estimates on diﬀusion distances are proved in Section 4. Performance guarantees for the LUND algorithm are proved in Section 5. Numerical experiments and computational complexity are discussed in Section 6. Conclusions and future research directions are given in Section 7.

2. Background

2.1 Background on Clustering

Given the wide variety of data of interest to scientiﬁc practitioners, many approaches to clustering have been developed, whose performance is often wildly variable and data-dependent.

Maggioni and Murphy

2.1.1 K-Means

A classical and popular clustering algorithm is K-means (Steinhaus, 1957; Friedman et al., 2001) and its variants (Ostrovsky et al., 2006; Arthur and Vassilvitskii, 2007; Park and Jun, 2009), which is often used in conjunction with feature extraction methods. K-means partitions the data into K (a parameter) groups, {Ck}K k=1, chosen to minimize within-cluster dissimilarity: C = arg min{Ck}K k=1 PK k=1 P

x Ck x xk 2 2, where xk is the mean of the kth cluster (for a given partition, it is the minimizer of the least squares cost in the inner sum). While popular, K-means and its variants may perform poorly for data sets that are not the union of well-separated, near-spherical clusters, and are sensitive to outliers.

2.1.2 Hierarchical Clustering Methods

Hierarchical methods iteratively merge or split clusters in order to produce a multiscale family of partitions known as a dendrogram (Friedman et al., 2001). More precisely, a dendrogram for n data points is a family of clusterings {Ci}n i=1 such that C1 is the clustering of n singleton clusters, and Ci is related to Ci+1 in that the two clusters minimizing some linkage function in Ci are merged in Ci+1. Single linkage clustering (SLC) (Sneath, 1957; Gower and Ross, 1969; Friedman et al., 2001) is a particular hierarchical clustering method that iteratively merges clusters according to the linkage function LSLC(C1, C2) = minx1 C1,x2 C2 x1 x2 2; metrics other than the ℓ2 norm may be used. For clusters that are well-separated, single linkage clustering is known to perform well (Arias-Castro, 2011), despite lack of strong statistical consistency in dimensions greater than 1 (Hartigan, 1981).

2.1.3 Density and Mode-Based Methods

Density and mode-based clustering methods detect regions of high-density and lowdensity to determine clusters. The DBSCAN (Ester et al., 1996) and DBCLASD (Xu et al., 1998) algorithms assign to the same cluster points that are close and have many near neighbors, and ﬂag as outliers points that lie alone in low-density regions. The mean-shift algorithm and variants (Fukunaga and Hostetler, 1975; Comaniciu and Meer, 2002; Chac on, 2012; Genovese et al., 2016) push points towards regions of high-density, and associate clusters with these high-density points, sometimes called modes. Both DBSCAN and mean-shift clustering suﬀer from a lack of robustness to outliers and depend strongly on parameter choices. The fast search and ﬁnd of density peaks clustering algorithm (FSFDPC) (Rodriguez and Laio, 2014) proposes to address these weaknesses. This method characterizes cluster modes as points that are far in Euclidean distance from points of higher density. FSFDPC has been widely applied (Spitzer et al., 2015; Wiwie et al., 2015; Sun et al., 2015; Rossant et al., 2016; Wang et al., 2016; Jia et al., 2016), and correctly clusters the data in Figure 1. However, we show that the standard FSFDPC method, while very popular in scientiﬁc applications, does not correctly cluster data unless

Learning by Unsupervised Nonlinear Diffusion

strong geometric and statistical assumptions on the data are satisﬁed. The main reason is that Euclidean distances are used to ﬁnd modes, which is inappropriate for data drawn from mixtures of distributions supported near nonlinear sets (see, for example, Figure 18). Moreover, FSFDPC is not robust to outliers, which may be far from other points but be of low-density.

2.1.4 Spectral Methods

Spectral clustering methods compute features that reveal the structure of data that may deviate from the spherical, Gaussian shapes ideal for K-means, and in particular may be nonlinear or elongated in shape. This is done by building local connectivity graphs on the data that encode pairwise similarities between points, then computing a spectral decomposition of adjacency or random walk or Laplacian operators deﬁned on this graph. Let X = {xi}n i=1 RD be a set of points to cluster. Let G be a graph with vertices corresponding to points of X and edges stored in an n n symmetric weight matrix W. Often one chooses Wij = K(xi, xj) for some (symmetric, often radial and rapidly decaying) nonnegative kernel K : RD RD R, such as K(xi, xj) = exp( ||xi xj||2/σ2) for some choice of scaling parameter σ > 0. The graph G may be fully connected, or it may be a nearest neighbors graph with respect to some metric. Let D be the diagonal matrix Dii := Pn j=1 Wij. The graph Laplacian is constructed as L = D W. One then normalizes L to acquire either the random walk Laplacian LRW = D 1 L = I D 1W or the symmetric normalized Laplacian LSYM = D 1

2. We focus on LSYM in what follows. It can be shown that LSYM has real eigenvalues 0 = λ1 λn 2 and corresponding eigenvectors {φi}n i=1. The original data X can be clustered by clustering the embedded data xi 7 (φ1(xi), φ2(xi), . . . , φM(xi)) for an appropriate choice of M n. In this step typically K-means is used, though Gaussian mixture models may (and perhaps should) be used, as they enjoy, unlike K-means, a suitably-deﬁned statistical consistency guarantee in the inﬁnite sample limit (Athreya et al., 2017). Spectral clustering relaxes a graph-cut problem: for a collection of subsets X1, . . . , XK X, the corresponding normalized cut is Ncut(X1, . . . , XK) = PK k=1 cut(Xk, Xc k)/vol(Xk),where cut(A, B) = P

xi A,xj B Wij, vol(A) = P

xi A Pn j=1 Wij. Minimizing Ncut yields clusters that are simultaneously separated and balanced (Shi and Malik, 2000). This NP-hard problem may be relaxed by analyzing the ﬁrst K eigenvectors of LSYM (Shi and Malik, 2000; Ng et al., 2002), or via a semideﬁnite programming problem (Ling and Strohmer, 2018). Weaknesses of spectral clustering were scrutinized by Nadler and Galun (2007). They show the top eigenvectors of the random walk matrix P deﬁned on {xi}n i=1 sampled from p(x) proportional to e U(x)/2 for some potential function U(x) converge under a suitable scaling as n to the top eigenfunctions of the Fokker-Planck operator Lψ(x) = ψ ψ U = µψ(x). The characteristic time scales of the stochastic diﬀerential equation (SDE) x(t) = U(x)+

2 w(t), w a Wiener process, determine

Maggioni and Murphy

-14 -12 -10 -8 -6 -4 -2 0 2

(a) Data to Cluster (Nadler and Galun, 2007)

-14 -12 -10 -8 -6 -4 -2 0 2

(b) Spectral clustering (Shi and Malik, 2000)

-14 -12 -10 -8 -6 -4 -2 0 2

(c) Spectral clustering (Ng et al., 2002)

-16 -14 -12 -10 -8 -6 -4 -2 0 2 4 -8

(d) Eigenvector 2

-16 -14 -12 -10 -8 -6 -4 -2 0 2 4 -8

(e) Eigenvector 3

-16 -14 -12 -10 -8 -6 -4 -2 0 2 4 -8

(f) Eigenvector 4

-16 -14 -12 -10 -8 -6 -4 -2 0 2 4 -8

(g) Eigenvector 5

-16 -14 -12 -10 -8 -6 -4 -2 0 2 4 -8

(h) Eigenvector 6

Figure 1: In (a), three Gaussians of essentially the same density are shown. Results of spectral clustering are shown in (b) (Shi and Malik, 2000) and (c) (Ng et al., 2002). In (d) - (h), the ﬁrst ﬁve non-trivial eigenvectors are shown. As noted by Nadler and Galun (2007), the underlying density for this data yields a Fokker-Planck operator whose low-energy eigenfunctions cannot distinguish between the two smaller clusters, thus preventing spectral clustering from succeeding: higher energy eigenfunctions are required. For this example, the sixth non-trivial eigenvector localizes suﬃciently on the small clusters to allow for correct determination of the cluster structure; this eigenvector is not used in traditional spectral clustering algorithms. the structure of the leading eigenfunctions of L (Gardiner, 2009): they correspond to the time scales of the slowest transitions between diﬀerent clusters and the equilibrium times within clusters. The relationships between these quantities determine which eigenfunctions of L (or P) reveal the cluster structure in the data. Gavish and Nadler (2013) further analyze related connections between normalized cuts and cluster exit times. Nadler and Galun (2007) present data which cannot be clustered with spectral clustering (Shi and Malik, 2000; Ng et al., 2002); see Figure 1.

2.2 Background on Diﬀusion Distances

One of the primary tools in the proposed clustering algorithm is diﬀusion distances, a class of data-dependent distances computed by constructing Markov processes on data that capture its intrinsic structure (Coifman et al., 2005; Coifman and Lafon, 2006; Lafon et al., 2006; Coifman et al., 2008; Singer and Coifman, 2008; Rohrdanz et al., 2011; Zheng et al., 2011; Lederman and Talmon, 2018; Lederman et al., 2015; Czaja et al., 2016; Li et al., 2017). We consider diﬀusion on the point cloud X = {xi}n i=1 RD via a Markov chain (Levin et al., 2009) with state space X. Let P be the corresponding n n transition matrix. The following shall be referred to as the usual assumptions on P: P is reversible, irreducible, aperiodic, and therefore ergodic. A common construction for P, and the one we consider in the algorithmic sections of this article, is to ﬁrst compute a weight matrix W, where Wij = exp( xi xj 2 2/σ2), i =

Learning by Unsupervised Nonlinear Diffusion

j for some appropriate scale parameter σ (0, ). The parameter σ encodes the interaction radius of each point: σ large allows for long-range interactions between points that are ℓ2-far, while σ small allows only for short-range interactions. Then P = D 1W, where D is the diagonal degree matrix with Dii = Pn ℓ=1 Wij. This1

row-normalizes P: Pn j=1 Pij = 1, i = 1, . . . , n. Since it is ergodic, P has a unique stationary distribution π satisfying πP = π, given by πi = Dii/ Pn j=1 Djj. Diﬀusion processes on graphs lead to a data-dependent notion of distance, known as diﬀusion distance (Coifman et al., 2005; Coifman and Lafon, 2006). While the focus of the construction is on diﬀusion distances and the diﬀusion process itself, we mention that diﬀusion maps provide a way of eﬃciently computing and visualizing large-time diﬀusion distances in Euclidean space, and at the same time may be understood as a type of nonlinear dimension reduction, in which data in a high number of dimensions may be embedded in a low-dimensional space by a nonlinear coordinate transformation. In this regard, diﬀusion maps are related to nonlinear dimension reduction techniques such as Isomap (Tenenbaum et al., 2000), Laplacian eigenmaps (Belkin and Niyogi, 2003), and local linear embedding (Roweis and Saul, 2000), kernel PCA (with a data-adapted kernel), among many others.

Deﬁnition 2.1 Let X = {xi}n i=1 RD and let P be a Markov process on X satisfying the usual assumptions and with stationary distribution π. Let π0 be a probability distribution on X. For points xi, xj X, let pt(xi, xj) = (Pt)ij, for some t [0, ). The diﬀusion distance at time t between x, y X is deﬁned, for ν = π0/π, by

Dt(x, y) = s X

u X (pt(x, u) pt(y, u))2 ν(u) = ||pt(x, ) pt(y, )||ℓ2(ν) .

If the underlying graph is generated from data sampled from a low-dimensional manifold, then diﬀusion distance parametrizes this low-dimensional structure (Coifman et al., 2005; Jones et al., 2008; Singer et al., 2009; Singer and Wu, 2012, 2016; Talmon and Wu, 2018). Indeed, diﬀusion distances admit a formulation in terms of the (right) eigenfunctions of P:

ℓ=1 λ2t ℓ(ψℓ(x) ψℓ(y))2, (2.2)

where {(ψℓ, λℓ)}n ℓ=1 are the right eigenpairs of P, ordered so that 1 = λ1 > λ2 λ3 λn > 1, and noting that ψ1 is constant by construction. Diﬀusion distances are parametrized by t, which measures how long the diﬀusion process on G has run. Small t allows a small amount of diﬀusion, which may prevent the interesting geometry of X from being discovered, but provides detailed, ﬁne scale

1. Note that with some abuse of notation we denote the entries of P by Pij, reserving the notation Pij for block submatrices of P that will be introduced and used later.

Maggioni and Murphy

information. Large t allows the diﬀusion process to run for so long that the ﬁne geometry may be washed out, leaving only coarse scale information. We will relate properties of clustered data X to t.

3. Data Model and Overview of Main Results

Among the main results of this article are suﬃcient conditions for clustering certain discrete data X RD. The data X is modeled as a realization from a probability distribution

k=1 wkµk, wk 0,

k=1 wk = 1, (3.1)

where each µk is a probability measure. Intuitively, our results require separation and cohesion conditions on {µk}K k=1. That is, each µk is far from µk , k = k and connections are strong (in a suitable sense) within each µk. X = {xi}n i=1 is generated by drawing, for each i, one of the K clusters, say ki, according to the multinomial distribution with parameters (w1, . . . , w K), and then drawing xi from µki. The clusters in the data are deﬁned as the subsets of X whose samples were drawn from a particular µk, that is, we deﬁne the cluster Xk := {xi X : ki = k}. Given X, the goal of clustering is to estimate these Xk, and to estimate K. Throughout the theoretical analysis of this article, we will deﬁne the accuracy of a set of labels {Yi}n i=1, Yi {1, . . . , K}, learned from an unsupervised algorithm to be |{i | Yi = ki}|/n, i.e. the proportion of points correctly labeled. The model (3.1) is nonparametric and makes few explicit assumptions on µ. We will allow µk to be supported near a non-linear set (e.g. a nonconvex subset, or a submanifold in RD) and be multimodal (i.e. with multiple high-density regions). These features may cause prominent clustering methods to fail, e.g. K-means, which requires near-spherical or well-separated clusters (Dasgupta and Schulman, 2007; Mixon et al., 2017); Gaussian mixture models, which handle well spherical and ellipsoidal clusters, but may struggle with clusters exhibiting diﬀerent, non-elliptical geometries (for example those shown in Figure 2 (b)); spectral clustering, which can fail for highly elongated clusters or clusters of diﬀerent sizes and densities; density-based methods, which are sensitive to noise and clusters of very diﬀerent densities, and may require care in setting parameters or implementing adaptive parameters. Two simple, motivating examples are in Figure 2. They feature variable densities, variable levels of connectivity, both within and across clusters, and (for the second example) nonlinear cluster shapes. The estimates for the behavior of diﬀusion distances that we derive will then be leveraged to prove that the LUND clustering scheme correctly labels the points and also correctly estimates the number of clusters, while other clustering schemes fail to cluster these data sets correctly.

Learning by Unsupervised Nonlinear Diffusion

-1.5 -1 -0.5 0 0.5 1 1.5 -1

(a) Bottleneck data

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -2

(b) Nonlinear data

200 400 600 800 1000 1200 1400

(c) P for data in (a).

200 400 600 800 1000 1200 1400 1600

(d) P for data in (b).

Figure 2: Subﬁgures (a) and (b) show two data sets linear and nonlinear colored by log10(p(x)), where p(x) is the empirical density. In (c), (d), we show the corresponding Markov transition matrices P, with entry magnitudes shown in log10 scale. We sorted the rows and columns so that the structure in these matrices become apparent (of course, the algorithms are independent of the sorting). The Markov chains are ergodic, but close to being reducible. Indeed, these transition matrices show hierarchical structure, with large within-cluster transitions and uniformly small probabilities of transition between clusters. The analysis of Section 4 makes this intuition precise. The transition matrices were constructed using the Gaussian kernel as in Section 2.2, with Euclidean distances and σ = .25 3.1 Summary of Main Results

Our ﬁrst result shows that the within-cluster and between-cluster diﬀusion distances can be controlled, as soon as Pt is approximately block constant in some sense. Deﬁne the (worst-case) within-cluster and between-cluster diﬀusion distances as:

Din t = max k max x,y Xk Dt(x, y), Dbtw t = min k =k min x Xk,y Xk Dt(x, y) . (3.2)

Our results guarantee that Din t is small and Dbtw t is large in terms of a constant ϵ, for all t in some time interval T . The interval T depends on both data-driven quantities of the matrix P (which may be understood as geometrically intrinsic to the underlying cluster structure), as well as ϵ. This simpliﬁcation of Theorem 4.12 holds:

Theorem 3.3 Let X = SK k=1 Xk and let P be a corresponding Markov transition matrix on X, inducing diﬀusion distances Dt. Then there exist non-negative constants {Ci}5 i=1 such that the following holds: for any ϵ > 0 we have

< t < C3ϵ = Din t C4ϵ, Dbtw t C5 C4ϵ.

The constants {Ci}5 i=1 are deﬁned precisely in Section 4, but to get a sense of them, let S be an idealized version of P in which all edges between clusters are deleted and redirected back into the cluster; this will be made precise by the notion of stochastic complement in Deﬁnition 4.7. Let S be a block diagonal matrix consisting of K rank 1 blocks, corresponding to the equilibrium transition matrices of the the blocks of S. The constants {Ci}5 i=1 may be interpreted as follows:

C1, C2 are related to the compactness or irreducibility of the clusters Xk, in terms of the rank of each block of S. In particular, if S is constant on each

Maggioni and Murphy

diagonal block corresponding to a cluster, then C1 = 0 and C2 = O(1), independently of n.

C3 is related to the transition probabilities between clusters in P. In the idealized case that there are no transitions between any of the clusters, C3 = . More generally, if the probabilities of transitions between clusters are small, then C3 will be large.

C4 is related to the balance of S relative to P. If each row of S P is constant, then C4 = 1/ n.

C5 is related to how balanced the equilibrium distributions are on each cluster. If S is block constant with blocks of the same size, then C5 = 1/ n.

The time interval T := [C1 ln C2

ϵ , C3ϵ] depends on ϵ, and there is tension in the role of ϵ between the condition t T and the conclusion that Din t C4ϵ, Dbtw t C5 C4ϵ. For large ϵ, T may be quite wide, but the conclusion on diﬀusion distances is weak (or even trivial if ϵ > C5/C4). On the other hand, ϵ small induces strong separation between within-cluster diﬀusion distances and between-cluster diﬀusion distances, at the expense of shrinking T . Indeed, for 0 < C1, C2, C3 < ﬁxed, T shrinks to the empty set as ϵ 0+. Suppose indeed that the each cluster is compact (C1 small, C2 = O(1)), the clusters are well-separated (C3 large), and balanced in the sense that S P is constant and S is block-constant with blocks of the same size (C4 = C5 = 1/ n). Then for ϵ > 0,

, C3ϵ = Dbtw t Din t C5 C4ϵ

which suggests strong separation with diﬀusion distances independently of n when ϵ is small. In particular, as the clusters become more separated (C3 increases) the time interval on which the Dbtw t /Din t remains large widens on the right. Similarly, the more compact or irreducible the clusters become (C1 becomes smaller), the wider the interval becomes on the left. In the ideal case that C1 = 0 (clusters are localized at single points), C3 = (inﬁnite separation between clusters), the ratio Dbtw t /Din t = for all t, due to the fact that Dbtw t is bounded away from 0 for all t while Din t = 0. We remark that as long as ϵ can be taken suﬃciently small, due to the geometric properties of the data as determined by C1, C2, C3, the lower bound on Dbtw t is positive. In the idealized case when C4 = C5 = 1/ n, then the lower bound is positive as long as ϵ < 1. We note that C1, C2, C3 are close to geometrically intrinsic, as will be discussed in Section 4.1. The LUND scheme characterizes modes of the clusters {Xk}K k=1 as high-density points that are far in diﬀusion distance from other points of high-density, regardless of the shape of the support of the distribution. Once modes are learned, remaining points are subsequently assigned to a mode in an iterative fashion. To detect the modes,

Learning by Unsupervised Nonlinear Diffusion

we introduce two quantities: p(x) is related to data density, while ρt(x) is related to diﬀusion geometry. Let p be a kernel density estimator (KDE) on X, for example p(x) = 1

y NN(x) exp( x y 2 2/σ2 0), for some choice of σ0 and set of nearest neighbors NN(x), normalized by Z so that P

x X p(x) = 1. Given Dt deﬁned on X, let

min y X{Dt(x, y) | p(y) p(x)}, x = arg max y X p(y),

max y X Dt(x, y), x = arg max y X p(y). (3.4)

The function ρt measures the diﬀusion distance of a point to its Dt-nearest neighbor of higher empirical density. LUND proceeds by analyzing

Dt(x) = p(x)ρt(x),

which is large only for high-density points that are far from their nearest diﬀusion neighbor of higher density. The function Dt serves two important purposes. First, its decay estimates the number of clusters in the data. Indeed, as will be shown in Section 5, under a ﬂexible data model, Dt has K very large values with the rest very small, where K is the number of latent clusters. Second, the modes of the data are estimated as the maximizers of Dt. Once these modes have been learned, they are given unique labels. Then, in order of decreasing density, points are assigned the same label as their Dt-nearest neighbor of higher density. In this sense, the labels of the modes are distributed from high to low density to the rest of the data. The LUND algorithm is detailed in Algorithm 1. A simpler variant of LUND, when K is known a priori, is detailed in Algorithm 2. Functions of p(x), ρt(x) other than multiplication could be used to construct Dt(x). The primary reason to consider the multiplication of these factors is to gain robustness to outliers. Indeed, an outlying point xo may be very far from all other points in diﬀusion distance, simply because its Euclidean coordinates are very far from the rest of X. In this case, one would have ρt(xo) very large, and p(xo) very small. By constructing Dt as the product of the density and diﬀusion geometric measurements, Dt(xo) = p(xo)ρt(xo) is not large (under a suitable regime of variation between p(xo) and ρt(xo)), ensuring that a far outlier is not be selected as a data mode. More precisely, suppose that diﬀusion distances and p(x) are computed with the same choice of scaling parameter σ and collection of nearest neighbors, so that the stationary distribution of P is equal to p: for all x X, p(x) = π(x). Suppose that X is ﬁxed except for the outlier point xo, which we assume to be the element of X with lowest empirical density. Letting x NN o be the nearest neighbor of xo of higher empirical density and π0 an arbitrary initial distribution,

Dt(xo) = ρt(xo)p(xo) =

ℓ=1 (pt(xo, xℓ) pt(x NN o , xℓ))2 π0(xℓ)π(xo)2

Maggioni and Murphy

Algorithm 1 Learning by Unsupervised Nonlinear Diﬀusion (LUND) Algorithm Input: X (data), σ0 (kernel density bandwidth), σ (diﬀusion scaling parameter), t (time parameter), τ (threshold) Output: Y (cluster assignments), ˆK (estimated number of clusters)

1: Build Markov transition matrix P using scale parameter σ. 2: Compute KDE p(x) for all x X using kernel bandwidth σ0. 3: Compute ρt(x) for all x X. 4: Compute Dt(x) = ρt(x)p(x) for all x X. 5: Sort X according to Dt(x) in descending order as {xmi}n i=1, n = |X|.

6: Compute ˆK = inf k Dt(xmk) Dt(xmk+1) > τ .

7: Assign Y (xmi) = i, i = 1, . . . , ˆK, and Y (xmi) = 0, i = ˆK + 1, . . . , n. 8: Sort X according to p(x) in decreasing order as {xℓi}n i=1. 9: for i = 1 : n do 10: if Y (xℓi)=0 then 11: Y (xℓi) = Y (x ), x = arg miny{Dt(xℓi, y) | p(y) p(xℓi) and y is labeled}. 12: end if 13: end for Noting lim xo 2 π(xo) = 0, it follows that lim xo 2 Dt(xo) = 0, so that as outliers move farther away from the rest of the data, they become less likely to be detected as modes. We remark that if outlier detection is performed on the data as a pre-processing step, this problem is less signiﬁcant, since densities become more comparable across points. In this case, other constructions for Dt may be suﬃciently robust, for example constructions that are additive in p, ρt. The LUND algorithm combines density estimation (as captured by p) with diﬀusion geometry (as captured by ρt). The crucial parameter of LUND is the time parameter, which determines the diﬀusion distance Dt used. Theorem 3.3 may be used to show that there is a range of t for which applying the proposed LUND algorithm is provably accurate. The ﬁrst concern is to understand conditions guaranteeing these modes are estimated accurately, the second that all other points are consequently labeled correctly. The following result summarizes Corollaries 5.4, 5.5, corresponding to the case when K is unknown a priori and must be estimated (as in Algorithm 1) or is known a priori (as in Algorithm 2).

Theorem 3.5 Suppose X = SK k=1 Xk as above. Let M be the set of cluster density maxima: M = {p(x) | k {1, 2, . . . , K} such that x = arg max y Xk p(y)}.

(a) Let {xmi}n i=1 be the points {xi}n i=1, sorted so that Dt(xm1) Dt(xm2) Dt(xmn). Then Algorithm 1 correctly estimates K and labels all points correctly

Learning by Unsupervised Nonlinear Diffusion

Algorithm 2 LUND Algorithm, K Known Input: X (data), σ0 (kernel density bandwidth), σ (diﬀusion scaling parameter), t (time parameter), K (number of clusters) Output: Y (cluster assignments)

1: Build Markov transition matrix P using scale parameter σ. 2: Compute a KDE p(x) for all x X using kernel bandwidth σ0. 3: Compute ρt(x) for all x X. 4: Compute Dt(x) = ρt(x)p(x) for all x X. 5: Sort X according to Dt(x) in descending order as {xmi}n i=1, n = |X|. 6: Assign Y (xmi) = i, i = 1, . . . , K, and Y (xmi) = 0, i = K + 1, . . . , n. 7: Sort X according to p(x) in decreasing order as {xℓi}n i=1. 8: for i = 1 : n do 9: if Y (xℓi)=0 then 10: Y (xℓi) = Y (x ), x = arg miny{Dt(xℓi, y) | p(y) p(xℓi) and y is labeled}. 11: end if 12: end for

for any τ satisfying

min(M) maxi=1, ,K ρt(xmi)

mini=1, ,K ρt(xmi) < τ < min(M)

max(M) Dbtw t Din t . (3.6)

(b) If K is known a priori, then Algorithm 2 labels all points accurately provided that

Din t Dbtw t < min(M)

max(M) . (3.7)

Theorem 3.3 suggests that the conditions (3.6), (3.7) will hold for a wide range of t, depending on max(M)/ min(M) and the underlying data geometry. Indeed, in the case that C4 = C5 = 1/ n, and the clusters are irreducible (C1 small) and wellseparated (C3 large), the time interval guaranteeing Din t /Dbtw t < ϵ is [C1 ln(C2

ϵ ), C3ϵ], which is wide even for ϵ small. Indeed, setting ϵ = min(M)/(2 max(M)) is suﬃcient to guarantee (3.7) holds for a wide range of t (always for C1 suﬃciently small and C3 suﬃciently large). Of course, the smaller the ratio min(M)/ max(M), the harder (3.7) is to satisfy. Together with Theorem 3.5, this implies the proposed method (Algorithm 1) correctly labels the data and estimates the number of clusters K correctly. Note that (3.7) implicitly relates the density of the separate clusters to their geometric properties. Indeed, if the clusters are well-separated and cohesive enough, then Din t /Dbtw t is very small, and a large discrepancy in the density of the clusters can be tolerated in inequality (3.7). Note that Din t , Dbtw t , min(M), and max(M) are invariant to increasing n, as long as the scale parameter in the kernel used for constructing diﬀusion distances and the KDE adjusts according to standard convergence results for graph

Maggioni and Murphy

Laplacians (Belkin and Niyogi, 2005, 2007; Garcia Trillos et al., 2016, 2018). In this sense these quantities are properties of the mixture model (3.1), and neither of n nor the scale parameter of the diﬀusion process σ. We note moreover that Theorem 3.5 suggests t must be taken in a mesoscopic range, that is, suﬃciently far from 0 but also bounded. Indeed, for t small, Din t is not necessarily small, as the Markov process has not mixed locally yet. For t large, Pt

is close to global stationarity, and the ratio Din t /Dbtw t is not necessarily small, since Dbtw t will be small. In this case, clusters would only be detectable based on density, requiring thresholding, which is susceptible to spurious identiﬁcation of regions around local density maxima as clusters. We remark that a LUND prototype adapted to image data was proposed for the empirical study of high-dimensional hyperspectral images by Murphy and Maggioni (2018a,b, 2019b), where it is shown to enjoy competitive performance with state-ofthe-art clustering algorithms on speciﬁc data sets. The LUND algorithm presented in the present work is more general and, in contrast with earlier related methods, appropriate for general point cloud data, not just images.

3.2 Comparisons with Related Clustering Algorithms

LUND combines graph-based methods with density-based methods, and it is therefore natural to compare it with spectral clustering and FSFDPC among other methods.

3.2.1 Comparison with Spectral Clustering

The normalized graph-cut problem in spectral clustering is related to the probability of transitioning between clusters in one time step (Meila and Shi, 2001). LUND uses intermediate time scales to separate clusters, namely the time scale at which the random walk has almost reached the stationary distribution conditioned on not leaving a cluster, and has not yet transitioned (with sizable probability) to a diﬀerent cluster. Spectral clustering enjoys performance guarantees under a range of model assumptions (Chen and Lerman, 2009a,b; Arias-Castro, 2011; Arias-Castro et al., 2011; Vidal, 2011; Zhang et al., 2012; Elhamifar and Vidal, 2013; Wang et al., 2015; Soltanolkotabi et al., 2014; Arias-Castro et al., 2017; Little et al., 2017). Under nonparametric assumptions on (3.1) with K = 2, Shi et al. (2009) show that the principal eigenfunctions and eigenvalues of the associated kernel operator K(f)(x) = R K(x, y)f(y)dµ(y) are closely approximated by the principal spectra of the kernel operators Ki(f)(x) = R K(x, y)f(y)dµi(y), i = 1, 2, possibly mixed up, depending on the spectra of K1, K2 and the weights w1, w2. This allows for the number of classes to be estimated accurately in some situations, and for points to be labeled by determining which distribution certain eigenvectors come from. The related work of Schiebinger et al. (2015) provides suﬃcient conditions under the nonparametric model (3.1) for the low-dimensional embedding of spectral clus-

Learning by Unsupervised Nonlinear Diffusion

tering to map well-separated, coherent regions in input space to approximately orthogonal regions in the embedding space. This in turn implies that K-means clustering succeeds with high probability, thereby yielding guarantees on the accuracy of spectral clustering. These results depend on two quantities: with µ as in (3.1) and K a kernel, they deﬁne separation and cohesion quantities, respectively, as S(µ) = maxi =j S(µi, µj), Γmin(µ) = mini=1,...,K Γ(µi), where

S(µi, µj) = 1 p(X)

X K(x, y)dµi(x)dµj(y), Γ(µi) = inf S X p(X) p(S)p(Sc)

Sc K(x, y)dµi(x)dµi(y),

X K(x, y)dµ(x)dµ(y). A major result of Schiebinger et al. (2015) is that spectral clustering is accurate with high probability depending on a conﬁdence parameter β and the number of data samples n if p

K(S(µ) + C(µ))

mini=1,...K wi + 1 n + β Γ4 min(µ) , (3.8)

where C(µ) is a coupling parameter that is not germane to the present discussion. Condition (3.8) holds when the within-cluster coherence Γmin(µ) is large relative to the similarity between clusters S(µ). Fixing the separation Γmin(µ), (3.8) is more likely to hold if the clusters are relatively spherical in shape. For example, in Figure 3 we represent two data sets, each consisting of two clusters, with comparable S(µ), but substantially diﬀerent Γmin(µ). Also note that in the ﬁnite sample case when 1 n in (3.8) is non-negligible, the importance of Γmin being not too small increases. The geometric parameters S(µ), Γmin are comparable to C3 and C1 in Theorem 3.3.

-4 -2 0 2 4 6 8

(a) S(µ) = 0.0533, Γmin(µ) 0.9550.

-10 -5 0 5 10 15 -10

(b) S(µ) = 0.0523, Γmin(µ) 0.2560.

Figure 3: In (a) and (b) two diﬀerent mixtures of Gaussians are shown. The two mixtures have roughly the same measure of between-cluster distance S(µ), but signiﬁcantly diﬀerent within-cluster coherence Γmin(µ). Spectral clustering will enjoy much stronger performance guarantees, according to Schiebinger et al. (2015), for the data in (a) compared to the data in (b), for a range of relevant choices of the parameter σ.

Maggioni and Murphy

It is of related interest to compare LUND to spectral clustering by recalling (2.2). In the generic case that λ2 > λ3, the (ψ2(x) ψ2(y))2 term dominates asymptotically as t . Hence, as t , LUND bears resemblance to spectral clustering with the second eigenvector alone (Shi and Malik, 2000). On the other extreme, for t = 0, diﬀusion distances depend on all eigenvectors equally. Using the ﬁrst K or the 2nd

through (K + 1)st eigenvectors ψl is the basis for many spectral clustering algorithms (Ng et al., 2002; Schiebinger et al., 2015), and is comparable to LUND for t = 0, combined with a truncation of (2.2). Note that clustering with the kernel K alone relates to using all eigenvectors and t = 1. By allowing t to be a tunable parameter, LUND interpolates between the extremes of the K principal eigenvectors equally (t = 0 and truncating the eigendecomposition after the Kth or (K +1)st eigenvector), using the kernel matrix (t = 1), and using only the second eigenvector (t ). The results of Section 6 validate the importance of this ﬂexibility. An additional challenge when using spectral clustering is to robustly estimate K. The eigengap ˆK = arg maxi λi+1 λi is a commonly used heuristic, but is often ineﬀective when Euclidean distances are used in the case of non-spherical clusters (Arias-Castro, 2011; Little et al., 2017). In contrast, Theorem 3.5 suggests LUND can robustly estimate K, which is shown empirically for synthetic data in Section 6. It is also of interest to compare the guarantees of Theorem 3.3 to the analysis of spectral clustering of Nadler and Galun (2007); see Section 2.1.4. In particular, the guarantees of LUND require balancing two quantities: the within cluster mixing times and the between-cluster transition probabilities. These are analyzed precisely in Section 4, and are quantiﬁed in Theorem 3.3 by C1 (within cluster mixing propensity) and C3 (between cluster transition propensity). In the framework of continuous Fokker Planck equations, these notions are closely related to relaxation time and ﬁrst exit time, respectively. Nadler and Galun (2007) argue that as long as the ﬁrst passage exit time is greater than the relaxation time within a cluster, for all clusters, then spectral clustering has a hope of achieving good results. The LUND algorithm relies on a similar fundamental observation, and the delicate balance between these two notions (within cluster mixing and between cluster transitions) are analyzed in a precise, quantitative sense for discrete Markov chains in Section 4, leading in Theorem 4.12 to a guarantee on the behavior of diﬀusion distances in terms of these quantities. Computationally, LUND and spectral methods are essentially the same, with the bottleneck in complexity being either the spectral decomposition of a dense n n matrix (O(Mn2) where M is the number of eigenvectors sought), or the computation of nearest neighbors when using a sparse diﬀusion operator or Laplacian (using an indexing structure for a fast nearest neighbors search, this is O(Cd Dn log(n)), where d is the intrinsic dimension of the data).

3.2.2 Comparison With Local Graph Cutting Algorithms

The LUND algorithm bears some resemblance to local graph cutting algorithms (Spielman and Teng, 2004; Andersen et al., 2006, 2008; Andersen and Peres, 2009;

Learning by Unsupervised Nonlinear Diffusion

Spielman and Teng, 2013, 2014; Yin et al., 2017; Fountoulakis et al., 2017). These methods compute a cluster C around a given vertex v such that the conductance of C is high (see Deﬁnition 4.3), and which can be computed in sublinear time with respect to the total number of vertices in the graph n, and in linear time with respect to |C|. In order to avoid an algorithm that scales linearly (or worse) in n, global features such as eigenvectors of a Markov transition matrix or graph Laplacian deﬁned on the data must be avoided. The Nibble algorithm (Spielman and Teng, 2013) and related methods (Andersen and Peres, 2009) compute approximate random walks for points nearby v, and truncate steps that take the random walker too far from already explored points. This accounts for the most important steps a random walker would take, and avoids considering all n vertices of the graph. In this sense, Nibble and related methods focus on local diﬀusion in order to compute a local cluster around a prioritized vertex v, while LUND focuses on both ﬁnding good starting points and a globally consistent partition of the whole graph, using nonlinear, typically large-time diﬀusion to uncover multitemporal structure.

3.2.3 Comparison with FSFDPC

The FSFDPC algorithm (Rodriguez and Laio, 2014) learns the modes of clusters in a manner similar to the method proposed in this article. In FSFDPC, the diﬀusion distance-based quantity ρt is replaced with a corresponding Euclidean distance-based quantity:

min y X{ x y 2 | p(y) p(x)}, x = arg max y X p(y),

max y X x y 2, x = arg max y X p(y).

Moreover, the modes are estimated using only ρEuc(x), rather than Dt(x) = p(x)ρt(x) as proposed in the LUND algorithm. As in LUND, FSFDPC iteratively assigns points the same label as their nearest Euclidean neighbor (LUND uses diﬀusion nearest neighbor) of higher density. The diﬀerences between LUND and FSFDPC are fundamental. Theoretical guarantees for the FDFDPC using Euclidean distances do not accommodate a rich class of distributions and the guarantees proved in this article fail when using ρEuc(x) (as in Rodriguez and Laio (2014)) or Deuc(x) = p(x)ρEuc(x) for computing modes. This is because for clusters that are multimodal or supported near non-spherical sets, there is no reason for high-density regions of one cluster to be well-separated in Euclidean distance. In Section 6, we shall see FSFDPC fails for the motivating data in Section 3. Moreover, the use of the product Dt(x) = p(x)ρt(x) to determine modes gives LUND robustness to outliers that FSFDPC lacks. Indeed, outlying points may admit very large ρt(x) or ρEuc(x) values, but very small p(x) values. In this sense, the density factor in Dt ensures that outlying points are not labeled as modes. This tension between geometry and density is highlighted in Theorem 3.5.

Maggioni and Murphy

In addition, the LUND algorithm is able to correctly estimate the number of clusters in the data, even for nonlinear or elongated clusters, using the ratio (or decay) of Dt(x). A similar criterion for FSFDPC is not available for clusters that are nonlinear or elongated, due to the fact that high density regions connected by many paths in the data may be very far apart in Euclidean distance, leading heuristics based on the decay of ρEuc(x) or Deuc(x) to fail; see Section 6. We remark that for simple, spherical data sets, using these heuristics for FSFDPC may work well; this is observed for isotropic Gaussian data in Section 6.3.

3.2.4 Comparison with Single Linkage Clustering

LUND is related to SLC in the sense that the underlying density of the data is an important determinant in the clusterings. However, LUND also incorporates geometric structure in the data when determining clusters, which can be especially powerful when density is uninformative. In Figure 4, data with constant density but signiﬁcant geometric structure is analyzed. LUND succeeds in learning an accurate clustering with respect to the latent data geometry, while SLC fails. A theoretical analysis of the bottleneck phenomenon is presented in Section 4.3.

4. Analysis of Diﬀusion Processes on Data

In this section, we derive estimates for diﬀusion distances. Let Din t , Dbtw t be as in (3.2). The main result of this section is to show there exists a time interval T [0, ] so that t T , Dbtw t > Din t , that is, for t T , within cluster diﬀusion distance is smaller than between cluster diﬀusion distance. Showing that within-cluster distances are small and between-cluster distances are large is essential for any clustering problem. The beneﬁt of using diﬀusion distance is its adaptability to the geometry of the data: it is possible that within cluster diﬀusion distance is less than between cluster diﬀusion distance, even in the case that the clusters are highly elongated and nonlinear. This property does not hold when points are compared with Euclidean distances or many other data-independent distances. Compared to existing methods for analyzing diﬀusion distances, the proposed method does not require an analysis of the localization properties of eigenfunctions of P, which may be challenging for small or elongated clusters (Shi et al., 2009). The proposed method has analogy to the analysis of Fokker-Planck equations of Nadler and Galun (2007), in that both approaches attempt to characterize the tension between within-cluster similarity and between-cluster separation for clustering with spectral methods. Unlike that analysis of the continuum stochastic PDE, our analysis provides explicit estimates on the behaviors of the discrete diﬀusion distances for ﬁnite data. Our analysis is based on the notions of near-reducibility and stochastic complement, which make precise the tension between within-cluster mixing and between-cluster transitions.

Learning by Unsupervised Nonlinear Diffusion

-1 0 1 2 3 4 5 6 7 8 9 -5

-1 0 1 2 3 4 5 6 7 8 9 -5

(b) Learned LUND Labels

200 400 600 800 1000 1200 1400 1600 1800

(c) Diﬀusion Distance Matrix

-1 0 1 2 3 4 5 6 7 8 9 -5

(d) Modes Learned with LUND

-1 0 1 2 3 4 5 6 7 8 9 -5

(e) Learned SLC Labels

10 20 30 40 50 60 70 80 90 100 Number of Leaves

(f) SLC Purity

Figure 4: Geometric data with roughly constant empirical density is shown in (a). The spherical clusters are connected with thin bottlenecks. In (d), the modes learned by LUND are shown in red, indicating that one mode per cluster is learned. This is because although the density is roughly constant, the thin bottlenecks force the four circular clusters to be far apart in diﬀusion distance. The corresponding labels learned by LUND are in (b), showing that the cluster structure is accurately inferred. In (c), the matrix of pairwise diﬀusion distances is shown with diﬀusion time t = 103 and diﬀusion scaling parameter σ = .5. The four blue blocks on the diagonal indicate that within-cluster diﬀusion distances are small, while between cluster diﬀusion distances are large. The connecting bottlenecks correspond to the ﬁnal rows and columns of this matrix, where the diﬀusion distances are less informative. In (e), labels learned from pruning the single linkage dendrogram at the third highest merge (producing four clusters) is shown, indicating this method is not appropriate for this data. Indeed, in (f), the purity of the hierarchical clustering is shown as a function of how many leaves (i.e. clusters) are used from the dendrogram. The purity remains low until 25 clusters are used, indicating that the single linkage dendrogram is unable to eﬃciently separate the four spherical clusters until 25 clusters are used. The fundamental reason for the failure of SLC on this data is the fact that the density is essentially constant and the data set is connected, which are the properties driving SLC. Unlike LUND, SLC does not incorporate geometric information, e.g. bottlenecks, which is quite discriminative for this data set. 4.1 Near Reducibility of Diﬀusion Processes

Let P be a Markov chain deﬁned on points X satisfying the usual assumptions with unique stationary distribution π. We will sometimes consider π as a function with domain X, other times as a vector with indices {1, . . . , |X|}.

Maggioni and Murphy

For any initial distribution π0, limt π0Pt = π and moreover for any choice of ν = π0/π, Dt(x, y) 0 uniformly as t . One can quantify the rate of this convergence by estimating the convergence rate of P to its stationary distribution.

Deﬁnition 4.1 For a discrete Markov chain with transition matrix P and stationary distribution π, the relative pointwise distance to π at time t is (t) = max i,j {1,...,n} |P t ij

The decay of (t) is regulated by the spectrum of P (Jerrum and Sinclair, 1989; Sinclair and Jerrum, 1989). Indeed, let 1 = λ1 > λ2 λn > 1 be the eigenvalues of P; note that λ2 < 1 follows from P irreducible and λn > 1 follows from P aperiodic (Chung, 1997). Let λ = maxi=2,...,n |λi| = max(|λ2|, |λn|), πmin = minx X π(x).

Theorem 4.2 (Jerrum and Sinclair, 1989; Sinclair and Jerrum, 1989) Let P be the transition matrix of a Markov chain on state space X satisfying the usual assumptions. Then (t) λt /πmin.

Instead of analyzing λ , the conductance of X may be used to bound (t).

Deﬁnition 4.3 Let G be a weighted graph on X and let S X. The conductance of S is ΦX(S) = P

xi S,xj Sc πi Pij/ min P

xi Sc πi . The conductance of G

is Φ(P) = min S G ΦX(S).

Methods for estimating the conductance of certain graphs include Poincar e estimates (Diaconis and Stroock, 1991; Diaconis and Saloﬀ-Coste, 1993) and the method of canonical paths (Jerrum and Sinclair, 1989; Sinclair and Jerrum, 1989; Aldous and Fill, 2002). These approaches estimate Φ(P) by showing that certain simple paths may be used as surrogates for generic paths in the graph. The conductance is related to λ2; see Chung (1997):

Theorem 4.4 (Cheeger s Inequality) Let G be a weighted, undirected graph with transition matrix P. Then the second eigenvalue λ2 of P satisﬁes Φ(P)2/2 1 λ2 2Φ(P).

Combining Theorem 4.2 and Cheeger s inequality relates (t) to Φ(P).

Theorem 4.5 (Jerrum and Sinclair, 1989; Sinclair and Jerrum, 1989) Let P be the transition matrix for a Markov chain on X satisfying the usual assumptions. Suppose Pii 1

2, i = 1, . . . , n. Then (t) (1 1

2Φ(P)2)t/πmin.

Learning by Unsupervised Nonlinear Diffusion

Note that any Markov chain can be made to satisfy Pii 1

2, i = 1, . . . , n, simply by replacing P with 1

2(P + I). This keeps the same stationary distribution and reduces the conductance by a factor of 1

2. Whether Theorem 4.2 or 4.5 is used, the convergence of P towards its stationary distribution is exponential, with rate determined by λ or Φ(P), that is, to how close to being reducible the chain is. This yields estimates on diﬀusion distances; indeed, for x, y X and any initial distribution π0,

Dt(x, y) =||pt(x, ) pt(y, )||ℓ2(ν) ||pt(x, ) π( )||ℓ2(ν) + ||pt(y, ) π( )||ℓ2(ν)

u X max z X |pt(z, u) π(u)|2

π(u)2 π(u)π0(u)

u X π(u)π0(u) 2 (t) 2(1 1

Thus, as t , Dt 0 uniformly at an exponential rate depending on the conductance of the underlying graph; a similar result holds for λ in place of Φ(P). This gives a global estimate on the diﬀusion distance in terms of λ and Φ(P). Note that a similar conclusion holds by analyzing (2.2), recalling that ψ1 is constant and λ2 = maxi =1 |λi| = λ . Unfortunately, a global estimate on diﬀusion distances may be too coarse for unsupervised clustering. To obtain the desired separation of Din t , Dbtw t , we need to study not the global mixing time, but the mesoscopic mixing times, corresponding to the time it takes for convergence of points in each cluster towards their mesoscopic equilibria, before reaching the global equilibrium. For this purpose, we use results from the theory of nearly reducible Markov processes (Simon and Ando, 1961; Meyer, 1989). Suppose the matrix P is irreducible; write P, possibly after a permutation of the indices of the points, in block decomposition as

P11 P12 . . . P1m P21 P22 . . . P2m ... ... ... ... Pm1 Pm2 . . . Pmm

where each Pii is square and m n. Let Ii be the indices of the points corresponding to Pii. Recall that if the graph corresponding to P is disconnected, then P is a reducible Markov chain. Recall that A = maxi P

j |Aij| is the maximal row sum of A = (Aij). Suppose that Pij is small but nonzero for i = j, that is, most of the interactions for points in Ii are contained within Pii. This suggests diﬀusion on the blocks Pii have dynamics that converge to their own, mesoscopic equilibria before the entire chain converges to a global equilibrium, depending on the weakness of connection between blocks. Interpreting the support sets Ii as corresponding to the clusters of X, this suggests there will be a time range for which points within

Maggioni and Murphy

each cluster are close in diﬀusion distance but far in diﬀusion distance from points in other clusters; such a state corresponds to a mesoscopic equilibrium. To make this precise, consider the notion of stochastic complement.

Deﬁnition 4.7 Let P be an n n irreducible Markov matrix partitioned into square block matrices as in (4.6). For a given index i {1, . . . , m}, let Pi denote the principal block submatrix generated by deleting the ith row and ith column of blocks from (4.6), and let P i = P1i P2i . . . Pi 1,i Pi+1,i . . . Pmi T and Pi = Pi1 Pi2 . . . Pi,i 1 Pi,i+1 . . . Pim . The stochastic complement of Pii is the matrix Sii = Pii + Pi (I Pi) 1P i.

One can interpret the stochastic complement Sii as the transition matrix for a reduced Markov chain obtained from the original chain, but in which transitions into or out of Ii are masked. More precisely, in the reduced chain Sii, a transition is either direct in Pii or indirect by moving ﬁrst through points outside of Pii, then back into Pii at some future time. Indeed, the term Pi (I Pi) 1P i in the deﬁnition of Sii accounts for leaving Ii (the factor Pi ), traveling for some time in Ic i (the factor (I Pi) 1), then re-entering Ii (the factor P i). Note that the factor (I Pi) 1 may be expanded in Neumann sum as (I Pi) 1 = P t=0 Pt i, showing that it accounts for exiting from Ii and returning to it after an arbitrary number of steps outside of it. The notion of stochastic complement quantiﬁes the interplay between the mesoscopic and global equilibria of P. We say P is primitive if it is non-negative, irreducible and aperiodic. The following theorem indicates how P may be analyzed when it is derived from cluster data {Xk}K k=1 sampled according to (3.1); a proof appears in the Appendix for completeness. This result, which produces estimates related to the diﬀusion operator P in the ℓ1 norm, is used to prove results on diﬀusion distances, which are deﬁned in an ℓ2 sense, partially in order to take advantage of spectral decompositions for fast computations. This discrepancy will be discussed and controlled in Section 4.2.

Theorem 4.8 (Meyer, 1989) Let P be an n n irreducible row-stochastic matrix partitioned into K2 square block matrices, and let S be the reducible row-stochastic matrix consisting of the stochastic complements of the diagonal blocks of P:

P11 P12 . . . P1K P21 P22 . . . P2K ... ... ... ... PK1 PK2 . . . PKK

S11 0 . . . 0 0 S22 . . . 0 ... ... ... ... 0 0 . . . SKK

Suppose each Sii is primitive, so that the eigenvalues of S satisfy λ1 = λ2 = = λK = 1 > λK+1 λK+2 > 1. Let Z diagonalize S, and let

S = lim t St =

1π1 0 . . . 0 0 1π2 . . . 0 ... ... ... ... 0 0 . . . 1πK

Learning by Unsupervised Nonlinear Diffusion

where πi is the stationary distribution for Sii. Then Pt S tδ + κ|λK+1|t, where δ = 2 maxi Pi and κ = Z Z 1 . Moreover, for any initial distribution π0 and s = limt π0St = π0S , π0Pt s 1 tδ + κ|λK+1|t.

Note that this result does not require the Markov chain to be reversible, and hence applies to diﬀusion processes deﬁned on directed graphs. The assumption that S is diagonalizable is not strictly necessary, and similar estimates hold more generally (Meyer, 1989). The estimate tδ + κ|λK+1|t consists of two terms. The tδ term corresponds to Pt St , which accounts for the approximation of Pt by the reducible Markov chain St. In the context of clustering, this term accounts for the between-cluster connections in P. The term κ|λK+1|t corresponds to St S , which accounts for propensity of mixing within a cluster. In the clustering context, this term quantiﬁes the withincluster distances. It follows from Theorem 4.8 that, given ϵ suﬃciently large, there is a range of t for which the dynamics of Pt are ϵ-close to the dynamics of the reducible, low-rank Markov chain S .

Corollary 4.9 Let λK+1, δ, κ be as in Theorem 4.8. Suppose that for some ϵ > 0, ln 2κ

ϵ /ln 1 |λK+1| < t < ϵ 2δ. Then Pt S < ϵ, and for every initial distribution

π0, π0Pt s 1 < ϵ.

In contrast with t, the values λK+1, δ, κ may be understood as ﬁxed geometric parameters of the data set which determine the range of times t at which mesoscopic equilibria are reached. More precisely, as n , δ, κ converge to natural continuous quantities independent of n, and Garcia Trillos et al. (2018) proved that as n , there is a natural scaling for σ 0+ in which the (random) empirical eigenvalues of P converge in a precise sense to the (deterministic) eigenvalues of a corresponding continuous operator deﬁned on the support of µ as in (3.1). Thus, the parameters of Theorem 4.9 may be understood as random ﬂuctuations of geometrically intrinsic quantities depending on µ. In the context of the proposed data model, these quantities may be interpreted as follows:

λK+1 is the largest eigenvalue of S not equal to 1. Since S is block diagonal and each Skk is primitive, it follows that λK+1 = maxk=1,...,K λ2(Skk). As discussed above, {λ2(Skk)}K k=1 is related to the conductance Φ(Skk) and the mixing time of the random walk restricted to Skk. If the entries of Skk are very close to the entries of Pkk, then a perturbative argument yields λ2(Skk) λ2(Pkk).

The quantity δ = 2 maxk=1,...,K Pk is controlled by the largest interaction between clusters. If the separation between the {Xk}K k=1 is large enough, δ will be small. To get a sense of this parameter, let C1, C2 be clusters with n/2 points uniformly sampled from the balls B1(0, 0), B1(2 + η, 0) R2. Then, for

Maggioni and Murphy

n suﬃciently large, dist(C1, C2) = minx1 C1,x2 C2 x1 x2 2 η. Note that the point in B1(0, 0) nearest B1(2 + η, 0) is (1, 0). Then modulo the variance from the random sampling,

B1(2+η,0) exp( (x, y) (1, 0) 2 2/σ2)dxdy Z

B1(2+η,0) exp( (x, y) (1, 0) 2 2/σ2)dxdy + Z

B1(0,0) exp( (x, y) (1, 0) 2 2/σ2)dxdy

2 exp( η2/σ2) exp( (η2 + 4η)/σ2) + 1.

Figure 5 illustrates empirically how δ depends on σ and dist(C1, C2) for such data.

-1 -0.5 0 0.5 1 1.5 2 2.5 3 -2

(a) Uniform, spherical clusters C1, C2 with dist(C1, C2) .1

0.01 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

(b) δ as a function of dist(C1, C2) and σ.

Figure 5: In (a), data drawn uniformly at random from the union of two balls in R2 is shown. In (b), it is shown that for such equal sized, spherical, constant density clusters, the separation parameter exhibits rapid decay as σ 0+, for ﬁxed separation. The experiments conﬁrm that the decay of δ is essentially logistic in (η2/σ2), as estimated in (4.10).

The quantity κ = Z Z 1 , with Z = (φ1| . . . |φn), is a measure of the condition number of diagonalizing S. If Z, Z 1 are orthogonal matrices, then each row of Z, Z 1 have ℓ2 norm 1, hence κ n. We remark that κ is bounded independently of n in the case that all the data live on a common manifold, using convergence of heat kernels and low-frequency eigenfunctions together with heat kernel estimates on manifolds. In the clustering setting, if each cluster is a manifold, similar results would hold in this case, albeit this analysis is a topic of ongoing research.

Learning by Unsupervised Nonlinear Diffusion

4.2 Diﬀusion Distance Estimates

Returning to the proposed data model X = SK k=1 Xk µ as per (3.1), let P be a corresponding Markov chain on X satisfying the usual assumptions. We estimate the dependence of diﬀusion distances on the parameters δ, λK+1, κ above. We also introduce a balance quantity that quantiﬁes the diﬀerence between the ℓ1 norm (the setting of Theorem 4.8) and the ℓ2 norm (the setting of diﬀusion distances). Throughout this section, let pt(xi, xj) = Pt ij, s (xi, xj) = S ij .

Deﬁnition 4.11 Let P, S Rn n be as in Theorem 4.8. Deﬁne

γ(t) = max x X

|pt(x, u) s (x, u)| pt(x, ) s (x, ) 2 1 n

Botelho-Andrade et al. (2019) show that for any vector v Rn, v 2 = cv n v 1, where

|vi| v 2 1 n

In this sense, γ(t) measures how the ℓ1 norm diﬀers from the ℓ2 norm across all rows of Pt S . In particular, when each row of Pt S is close to uniform, γ(t) is close to 1; when some row of Pt S concentrates all its mass around one index, then γ(t) = n. Note that 1 γ(t) n for all t.

Theorem 4.12 Let X = SK k=1 Xk and let P be a corresponding Markov transition matrix on X. Let δ, λK+1, κ, S be as in Theorem 4.8. Let Dt be the diﬀusion distance associated to P and counting measure ν. If t, ϵ satisfy

(a) Din t 2 ϵ nγ(t).

(b) Dbtw t 2 min y X s (y, ) ℓ2(ν) 2 ϵ nγ(t).

Proof By Corollary 4.9, Pt S < ϵ, that is, maxx X P

u X |pt(x, u) s (x, u)|ν(u) < ϵ. To see (a), let k be arbitrary and let x, y Xk. Then:

||pt(x, ) pt(y, )||ℓ2(ν) ||pt(x, ) s (x, )||ℓ2(ν) + ||pt(y, ) s (y, )||ℓ2(ν) + ||s (y, ) s (x, )||ℓ2(ν)

Maggioni and Murphy

|pt(x, u) s (x, u)| pt(x, ) s (x, ) ℓ2(ν) 1 n

2! 1 ||pt(x, ) s (x, )||ℓ1(ν)

|pt(y, u) s (y, u)| pt(y, ) s (y, ) ℓ2(ν) 1 n

2! 1 ||pt(y, ) s (y, )||ℓ1(ν)

+||s (y, ) s (x, )||ℓ2(ν)

2ϵ n max x X

|pt(x, u) s (x, u)| pt(x, ) s (x, ) ℓ2(ν) 1 n

2! 1 + ||s (y, ) s (x, )||ℓ2(ν) ,

where t satisﬁes ln 2κ

ϵ / ln 1 λK+1

< t < ϵ/(2δ). The line relating the norm in

ℓ1(ν) and ℓ2(ν) follows from Theorem 1 in Botelho-Andrade et al. (2019). Note that S has constant columns on each cluster, and in particular for x, y Xk, s (x, u) = s (y, u) = πk(u) for all u X, so that ||s (y, ) s (x, )||ℓ2(ν) = 0. Statement (a) follows. To see (b), suppose that x Xk, y Xℓ, k = ℓ. Then

pt(x, ) pt(y, ) ℓ2(ν) = pt(x, ) s (x, ) + s (x, ) s (y, ) + s (y, ) pt(y, ) ℓ2(ν) s (x, ) s (y, ) ℓ2(ν) pt(x, ) s (x, ) ℓ2(ν) pt(y, ) s (y, ) ℓ2(ν) = s (x, ) s (y, ) ℓ2(ν)

|pt(x, u) s (x, u)| pt(x, ) s (x, ) ℓ2(ν) 1 n

2! 1 ||pt(x, ) s (x, )||ℓ1(ν)

|pt(y, u) s (y, u)| pt(y, ) s (y, ) ℓ2(ν) 1 n

2! 1 ||pt(y, ) s (y, )||ℓ1(ν)

s (x, ) s (y, ) ℓ2(ν) ϵ n

|pt(x, u) s (x, u)| pt(x, ) s (x, ) ℓ2(ν) 1 n

|pt(y, u) s (y, u)| pt(y, ) s (y, ) ℓ2(ν) 1 n

s (x, ) s (y, ) ℓ2(ν) 2 ϵ n max z Xℓ Xk

|pt(z, u) s (z, u)| pt(z, ) s (z, ) ℓ2(ν) 1 n

2 min w Xk Xℓ s (w, ) ℓ2(ν) 2 ϵ n max z Xℓ Xk

|pt(z, u) s (z, u)| pt(z, ) s (z, ) ℓ2(ν) 1 n

where in the last step, to lower bound the ﬁrst term we used that s (y, ) = πl( ), s (x, ) = πk( ), and recalled that since k = l the supports of s (x, ) and s (y, )

Learning by Unsupervised Nonlinear Diffusion

are disjoint. Minimizing this lower bound over all clusters Xk, Xℓyields the desired result.

Heuristically, if ϵ is small and the reduced equilibrium distribution s is roughly constant on each cluster, there will be a range of t for which Din t Dbtw t . The notion of s being roughly constant on each cluster is equivalent to nodes in the same cluster having roughly constant degree. These theoretical estimates are compared to empirical bounds computed numerically in Section 6. If P is very close to S in Frobenius norm, then pt(x, y) is very close to s (x, y) and ϵ may be taken close to 0. In particular, for the ideal case ϵ = 0, the estimates of Theorem 4.12 reduce to

Din t = 0 , Dbtw t 2 min y X s (y, ) ℓ2(ν). (4.13)

One can deﬁne a natural notion of diﬀusion distance between disjoint clusters in a reducible Markov chain as the sum of the ℓ2 norms of their respective stationary distribution, which agrees with both the deﬁnition of diﬀusion distances upon taking the limit t + and with the lower bound (b) in Theorem 4.12 when ϵ 0+. Hence, while the estimates in the proof of Theorem 4.12 may not be optimal, they are quite natural for ϵ 0+. Away from the asymptotic regime ϵ 0+, the estimates of Theorem 4.12 may be further simpliﬁed by placing additional assumptions on the data. Indeed, if the equilibrium distributions in S are balanced and uniform, the following result holds:

Corollary 4.14 Suppose that s is uniform on each Xk, and the cardinality of each Xk is n/K. Then for any t, ϵ satisfying ln 2κ

ϵ ln 1 λK+1

< t < ϵ 2δ,

Din t 2 nϵγ(t) , Dbtw t 2 n

Proof If S has constant rows on each cluster (i.e. the stationary distribution on each cluster of the reduced Markov chain is uniform), and the clusters are of constant size n/K, then 2 miny X s (y, ) ℓ2(ν) = 2 p

K/n. Then Theorem 4.12 yields

Dbtw t 2 min y X s (y, ) ℓ2(ν) 2 ϵ nγ(t)

n 2 ϵ nγ(t)

Maggioni and Murphy

In particular, if ϵ

K/(2γ(t)), within cluster distances will be small since Din t p

K/n, and also there will be clear separation between clusters since Dbtw t = Ω( p

K/n). Note that when Pt S is balanced, γ(t) is O(1) with respect to n, so that the assumption on ϵ is independent of n. We remark that in general if S and Pt S are balanced in the sense of having approximately uniform rows, then miny X s (y, ) ℓ2(ν) scales like 1/ n and γ(t) = O(1), respectively. In this (generic) case, the bounds of Theorem 4.12 are on the order 1/ n. This is natural, since diﬀusion distances are computed in the ℓ2 sense, while Pt and S have rows with ℓ1 norm equal to 1.

4.3 Relaxing Separation Between Clusters

The analysis of diﬀusion distances in Section 4.2 depends on δ, a parameter characterizing separation between clusters. When δ is large, the estimates on Din t and Dbtw t from Theorem 4.12 are poor. However, δ is deﬁned as the maximum over all points of the probability of transitioning to a new cluster in one time step. The formulation in terms of the maximum is convenient for analysis, but is pessimistic in the case of clusters cores that are separated by bottlenecks or low density noise regions. Indeed, for these data, δ may be large there are some points quite close to the cluster boundaries but the probability of transition between clusters cores is low. The conservative estimates based on δ may be improved by considering some data points not as part of a cluster, but rather as noise or transition points. Let X = SK k=1 Xk N be a decomposition of X into cluster points (SK k=1 Xk) and noise points (N). Decompose P as

P11 P12 . . . P1K P1,N P21 P22 . . . P2K P2,N ... ... ... ... ... PK1 PK2 . . . PKK PK,N PN,1 PN,2 . . . PN,K PN,N

in a manner analogous to (4.6). Interpreting the ﬁnal block rows and columns as noise or bottleneck points that are not part of clusters, Theorem 4.12 may be generalized to allow for clusters that are not well-separated, by accounting for the geometric properties of noise and bottlenecks which do not admit transitions between clusters over short time scales. This more generalized analysis does not make use of the reduced Markov chain S, and some slightly new notation is required.

Deﬁnition 4.16 For P as in (4.15) deﬁne

γmin(t) = min x,y X,x =y

|pt(x, u) pt(y, u)| pt(x, ) pt(y, ) 2 1 n

Learning by Unsupervised Nonlinear Diffusion

γmax(t) = max x,y X

|pt(x, u) pt(y, u)| pt(x, ) pt(y, ) 2 1 n

The following result is more general than Theorem 4.12, in that it allows for bottlenecks in the data, but is not interpretable in terms of the near reducibility of P. Let Din t , Dbtw t be as in (3.2); note that these quantities ignore N and only consider distances within and between the clusters {Xk}K k=1.

Theorem 4.17 Let X = SK k=1 Xk N and let P be a corresponding Markov transition matrix on X, decomposed as in (4.15), and let ν be the counting measure. Let

δ(t) =2 max 1 k K (Pt)k , where Pk = [Pk1Pk2 . . . Pk,k 1Pk,k+1 . . . Pk K],

α(t) = max 1 k K max x,y Xk

u Xk |pt(x, u) pt(y, u)|,

β(t) =2 min 1 k K min x Xk

u Xk |pt(x, u)|,

ζ(t) = max 1 k K max x,y Xk

u N |pt(x, u) pt(y, u)|.

(a) Din t γmax(t)(α(t) + δ(t) + ζ(t)).

(b) Dbtw t γmin(t)(β(t) δ(t)).

Proof To see (a), suppose x, y Xk. Then

pt(x, ) pt(y, ) ℓ2(ν) γmax(t) pt(x, ) pt(y, ) ℓ1(ν)

u Xk |pt(x, u) pt(y, u)| + X

u Xℓ |pt(x, u) pt(y, u)| + X

u N |pt(x, u) pt(y, u)|

γmax(t)(α(t) + δ(t) + ζ(t)).

On the other hand, suppose x Xk, y Xℓ, k = ℓ. Then

pt(x, ) pt(y, ) ℓ2(ν) γmin(t) pt(x, ) pt(y, ) ℓ1(ν)

u Xk |pt(x, u) pt(y, u)| + X

u Xℓ |pt(x, u) pt(y, u)|

u Xk |pt(x, u)| |pt(y, u)| + X

u Xℓ |pt(y, u)| |pt(x, u)|

Maggioni and Murphy

γmin(t)(β(t) δ(t)).

The quantities δ(t), α(t), β(t) in Theorem 4.17 are analogous to those in Theorem 4.12, albeit without the aid of the analysis in terms of S . The quantity ζ(t) pertains to the noise N, and hence has no analogy in Theorem 4.12.

The quantity δ(t) is an upper bound on the probability mass that transitions between clusters at time t. It does not consider the mass transfer from clusters to the noise region. In Theorem 4.8, when the clusters are well-separated and N = , δ(t) δ(1)t = δt is a worst-case, though intuitive, upper bound. Generally, δ(t) may depend on time nonlinearly and in particular it may stay small for a long time, before the global equilibrium time of the underlying Markov chain P is approached, at which point it grows essentially linearly in t before stabilizing around the global mixing time. This occurs, for example, in data with bottlenecks between clusters. We remark that considering δ(t) in these more general terms is possible even in the noiseless case of Theorem 4.8. Indeed, this may allow for a better estimate than the worst-case estimate of δ(t) δ(1)t = δt, though at some loss of simplicity.

The quantity α(t) controls how close the diagonal blocks are to having constant rows, that is, how close to mesoscopic equilibria the random walker is on each individual cluster. In the case that all individual clusters are near their mesoscopic equilibria by time t, α(t) is close to 0. In the simpler context of Theorem 4.8, α(t) is governed by λt K+1, that is, by the second largest eigenvalue of the blocks of the stochastic complement of P. This is because the second smallest eigenvalue on each block controls the rate of convergence to local equilibrium of that block in the stochastic complement S.

The quantity β(t) measures the row sums of the transition matrices localized to each cluster. It is analogous to the term involving the norm of the stationary distribution in the lower bound on Dbtw t in Theorem 4.12.

The quantity ζ(t) is a measure of the impact of the noise region in the computation of diﬀusion distances. When the number of noise points is small relative to the number of cluster points, this term is negligible. It is also negligible when the noise region impacts a cluster in a uniform way, in particular when the cluster has reached a mesoscopic equilibrium.

Theorem 4.17 suggests that for data in which, with high probability, the clusters reach mesoscopic equilibria before the random walker exits the clusters and crosses the noise, there exists a range of parameters t such that Din t is small and Dbtw t is large. At this time scale, α(t) is small since the individual clusters are near mesoscopic equilibria,

Learning by Unsupervised Nonlinear Diffusion

and δ(t) is small and β(t) large because most of the mass is still localized on the distinct clusters. An important class of examples exhibiting this behavior is clusters connected by narrow bottlenecks, as shown for example in Figure 4 (a). In that example, consider the four balls as clusters {Xk}4 k=1 and the bottlenecks as N. The behavior of Din t and Dbtw t as a function of t, as well as several transition matrices for diﬀerent t values are shown in Figure 6.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 -6

(a) Dbtw t , Din t versus t.

200 400 600 800 1000 1200 1400 1600 1800

(b) Diﬀusion distances, t = 1

200 400 600 800 1000 1200 1400 1600 1800

(c) Diﬀusion distances, t = 103

200 400 600 800 1000 1200 1400 1600 1800

(d) Diﬀusion distances, t = 106

Figure 6: Plots illustrating that the bottleneck data in Figure 4 exhibits mesoscopic equilibria, despite the lack of any strict separation between the four clusters. Indeed, (a) shows that for a large time range, Din t is much less (note that the vertical axis is in log10 scale) than Dbtw t . This is further illustrated by the fact that for diﬀerent ranges of time t, the blocks in the diﬀusion distance matrix corresponding to the cluster sets become essentially all 0 see (b), (c). For large time, the diﬀusion distances start to converge uniformly to 0, as shown in (d). Note that the last rows and columns of P correspond to the bottleneck points. Theorem 4.12 does not provide useful estimates for this bottleneck data, because of the lack of strict separation between the clusters. Indeed, if bottleneck points are assigned to their nearest clusters, then δ is nearly 1, rendering Theorem 4.12 essentially useless. However, Theorem 4.17 supplies useful estimates on the diﬀusion distances within and between clusters (e.g. that Din t Dbtw t ). This is because δ(t) is small and β(t) large for long time scales, until the random walk has explored from one cluster to another. That it takes a long time for the random walk to reach global stationarity due to the bottleneck can be argued in terms of graph conductances (Chung, 1997). We sketch a combinatorial argument below, in order to observe how the data dimensionality aﬀects the bottlenecks. For simplicity, we consider just two clusters. Consider the following model of bottleneck data in RD. Let L > 0 and X = X1 X2 N, where X1 = [0, 1]D, X2 = [0, 1]D + (L, 0, 0, . . . , 0) are clusters connected at the centers of faces by a cylindrical tube N of length L and D 1 dimensional radius ϵ. Let X be generated by sampling uniformly at random from X. To simplify the analysis, consider a random walk on X such that x X transitions to points in Bσ(x) with equal probability, and not to any other points, for some σ > 0. For ϵ < σ, a point x at the boundary interface between X1 and N is such that most of its near neighbors lie inside the cluster, rather than in the bottleneck. Indeed, up to geometric constants, vol D(X1 Bσ(x )) σD, while vol D(N Bσ(x ))

Maggioni and Murphy

σϵD 1. In particular, the probability of a random walker transitioning from x into the bottleneck in one time step is upper bounded by (ϵ/σ)D 1, so that the expected number of times a random walk reaches the boundary interface before exiting X1 is lower bounded by (σ/ϵ)D 1. Supposing that the random walker spends time T exploring X1 before returning to the boundary interface, the expected time it takes for a random walker to exit from X1 into the bottleneck is (σ/ϵ)D 1T. Moreover, once the random walker transitions from X1 to N, the expected number of steps it takes to cross the bottleneck to get to X2 may be lower bounded by L/σ . Indeed, if all the one-step transitions are optimal in the sense of being distance σ exactly in the direction parallel to (1, 0, . . . , 0), then at least L/σ steps are required to cross; better estimates on the expected number of steps to cross are also possible (Levin et al., 2009). In particular, the random walker will, with high probability, exit X1 then return to X1 Ω( L/σ ) times, before exiting X1 and making it all the way to X2. Thus, the expected time to transition from an arbitrary starting point in X1 to any point in X2 may be crudely estimated to be Ω( L/σ (σ/ϵ)D 1T), which is large when L is large and ϵ is small, i.e. the bottleneck is long and thin. In particular, Dbtw t will remain large for t L/σ (σ/ϵ)D 1T. Noting that T is comparable to the mixing time on X1 (and thus X2) alone, this suggests that there are time scales at which the random walker is at equilibrium on the distinct clusters (albeit with small leakage into the bottleneck at each time step) but at which global equilibrium is not reached. At such time scales, diﬀusion distances within a cluster are small, and between the two clusters are large. We remark that this property of diﬀusion distances for bottlenecks also allows LUND to correctly learn the clusters in the data, since the density is constant on this data set. Indeed, in Figure 4, LUND learns four modes in the four distinct clusters, and consequently labels all points correctly. It is interesting to note that depending on the random sample, LUND may learn three modes in distinct clusters and a fourth mode either in the fourth cluster (as in Figure 4 (a)) or at the intersection of the bottlenecks (a possibility not shown in Figure 4). Whether the fourth mode is in a cluster or in the middle of the bottleneck depends on random ﬂuctuations in the empirical density of the data. Regardless of if fourth mode is in the bottleneck region or in a cluster, the consequent LUND labeling of the data is correct. Indeed, if one of the modes is the bottleneck, then the cluster without a mode will have points assigned the label associated to the bottleneck mode, leading all the clusters to be given correct labels.

4.4 Example: A Simple Gaussian Mixture Model

The major parameters controlling the estimates in Theorem 4.12 are δ, λK+1, and κ. To illustrate the key quantities of this theorem, we consider the simple example of a mixture of Gaussians µG = 1

2N(x1, Σ) + 1

2N(x2, Σ) in R2 with diagonal isotropic covariance matrix Σ = 1 10I. We construct the diﬀusion transition matrix P using the Gaussian kernel with σ = .2, as described in Section 2.2.

Learning by Unsupervised Nonlinear Diffusion

As x1 x2 2 increases, Theorem 4.12 becomes more informative. In Figure 7, samples are drawn from µG with diﬀerent amounts of separation (and hence diﬀerent δ, λK+1, κ values) and we show the dependence on ϵ of the bounds

Din t = 2 ϵ nγ(t), Dbtw t = 2 min y X s (y, ) ℓ2(ν) 2 ϵ nγ(t).

and the permissible time interval [ln 2κ

ϵ / ln 1 λK+1

2δ]. For Theorem 4.12 to be

meaningful, ϵ must be such that simultaneously Din t < Dbtw t and ln 2κ

ϵ / ln 1 λK+1

ϵ 2δ. As ϵ 0+, Din t < Dbtw t holds if the clusters are internally well-connected and

separated, as articulated in (4.13), while ln 2κ

ϵ / ln 1 λK+1

and ϵ 2δ 0+; a similar but reversed dichotomy occurs as ϵ . Figure 7 illustrates this tension between the hypotheses of Theorem 4.12 and the strength of its conclusion.

4.5 Robustness to Geometric Deformations

LUND depends mainly on the intrinsic geometric constants δ and λK+1, so clustering performance with LUND is robust to small amounts of geometric deformation, for example the action of a bi-Lipschitz map. This is illustrated in Figure 8, in which two circular clusters in R2 are distorted by stretching the second coordinate. The results of LUND as a function of t and the amount of stretching are shown. It can be seen that the more the clusters are stretched, the larger t needs to be to achieve good accuracy. This is because as the clusters are stretched, the parameter λK+1 increases towards 1, meaning that t must be larger to ensure that mesoscopic equilibria on the distinct clusters are reached. Note that the relationship between domain geometry and the eigenvalues of the Laplacian is well-studied in classical settings (Szeg o, 1954; Weinberger, 1956), and its role in graph-based clustering is noted in contexts other than LUND (Ng et al., 2002; Arias-Castro, 2011; Schiebinger et al., 2015; Little et al., 2017). We note that transformations that push the clusters closer together can strongly impact LUND, since this will dramatically change the δ value. This is true of all clustering algorithms; the primary beneﬁt of LUND compared to a range of existing methods is its robustness to distorting the shape of the clusters.

4.6 Relationship Between Time and Scaling Parameter in Diﬀusion Distances

The Markov chain underlying diﬀusion distances is typically constructed using the heat kernel K(x, y) = exp( x y 2 2/σ2), for some choice of (spatial) scale parameter σ. Once P = Pσ is constructed, the time parameter t enters. For data sampled from a common manifold, there exists an asymptotic relationship between t and σ as n :

lim σ 0 PT0/σ σ = e T0 , (4.18)

Maggioni and Murphy

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 -1

2, 0), x2 = (0,

-15 -13 -11 -9 -7 -5 -3

(b) [Din t , Dbtw t ].

-15 -13 -11 -9 -7 -5 -3 -1

ϵ / ln 1 λK+1

-1 -0.5 0 0.5 1 1.5 2 2.5 3 -1.5

(d) x1 = (2, 0), x2 = (0, 2)

-15 -13 -11 -9 -7 -5 -3

(e) [Din t , Dbtw t ].

-15 -13 -11 -9 -7 -5 -3 -1

ϵ / ln 1 λK+1

Figure 7: 2000 data points sampled from µG, for two sets of means are shown in (a), (d). In (b), (e), the range between Din t and Dbtw t as a function of ϵ is shown. Plots (c), (f) show the t interval guaranteed by Theorem 4.12, indicating the existence of a non-empty range of t for which the conclusions of Theorem 4.12 apply whenever the red curve is above the blue curve. As the means x1, x2 move apart, the time interval in which Theorem 4.12 guarantees good separation between the clusters expands. This is makes sense intuitively, since as the clusters move apart, the unsupervised learning problem becomes easier. We remark that the separation for the data in (a) is insuﬃcient for Theorem 4.12 to guarantee a large time range in which Din t < Dbtw t . This is because the δ constant is large for this data set, since δ is determined by the points in distinct clusters that are nearest, and in particular is a worst-case bound. Relaxing the δ separation condition is discussed in Section 4.3. where e T0 is the inﬁnitesimal generator corresponding to continuous diﬀusion with canonical time T0 (Lafon et al., 2006). So, asymptotically as n , and requiring σ 0+ and t = T0

σ , using (t, σ) is equivalent to using (Ct, σ/C) for any constant C > 0. This suggests that asymptotically as n , the performance of LUND with respect to σ, t should be constant if tσ is constant as σ 0+. As we shall see observe in Section 6, working with ﬁnite data in the cluster setting, rather than the asymptotic regime on a common manifold, may lead to more subtle relationships between t and σ.

Learning by Unsupervised Nonlinear Diffusion

-2 -1 0 1 2 3 4 5 -4

(a) Circular Data

-2 -1 0 1 2 3 4 5 -4

(b) Geometrically Deformed Data

1 1.25 1.5 1.75 2 2.25 2.5 2.75 3

(c) LUND Accuracy

Figure 8: In (a), uniform spherical data is shown. In (b), the data in (a) has been geometrically deformed by dilating the second coordinate by a factor of 3. We consider the robustness of LUND to the natural geometric transformation (x, y) 7 (x, αy), where α is understood as a stretching parameter. In (c), the results of LUND are shown as a function of t and the dilation factor α. As α increases, the cluster stretches and taking larger t allows LUND to accurately label the data. This is because the random walker requires more time to explore a high elongated cluster than a compact, spherical one. 5. Performance Guarantees for Unsupervised Learning

We consider now how the LUND algorithm (Algorithm 1) performs on data X = SK k=1 Xk. Let p(x) be a KDE for x X, let ρt be as in (3.4), and recall Dt(x) = p(x)ρt(x). The LUND algorithm sets the maximizers of Dt to be the modes of the clusters. Requiring potential modes to have large ρt values enforces that modes should be far in diﬀusion distance from other high-density points, and incorporating p(x) downweighs outliers, which may be far in diﬀusion distance from their nearest neighbor of higher empirical density.

Theorem 5.1 Suppose X = SK k=1 Xk. If Din t /Dbtw t < min(M)/ max(M), then the K maximizers {x i }K i=1 of Dt(x) are such that x i is a highest empirical density point of Xki for some permutation (k1, . . . , k K) of (1, . . . , K).

Proof We proceed by induction on 1 m K. Clearly x 1 = arg maxy X p(y) is a highest empirical density point of some Xk. Then suppose x 1, . . . , x m, m < K, have been determined, and are highest empirical density points of distinct classes Xk1, . . . , Xkm. We show that x m+1 must be a highest density point among the remaining Xk, k / {k1, . . . , km}. First, suppose x Xkr for some r {1, . . . , m} is any point in the classes already discovered, not of maximal within-class density. Then ρt(x) Din t , since x r Xkr has p(x) < p(x r), and hence for any x a highest density point in a cluster not already discovered,

Dt(x) < p(x kr)ρt(x) max(M)Din t max(M)

min(M) Din t Dbtw t p(x)ρt(x) < p(x)ρt(x) = Dt(x).

Maggioni and Murphy

Thus, x m+1 = x. Now, suppose x Xk, k = k1, . . . km. If x = x an empirical density maximizer of Xk, then:

Dt(x) = p(x)ρt(x) < p(x)ρt(x) p(x)Din t < p(x)Dbtw t p(x)ρt(x) = Dt(x).

Hence, x m+1 = x, and thus x m+1 must be among the classwise empirical density maximizers of Xk, k / {k1, . . . , km}.

A similar method proves that the ratios of the sorted Dt values determine the number of clusters K. The problem of estimating the number of clusters is a crucial one, but few methods admit theoretical guarantees; see Little and Byrd (2015) for an overview.

Corollary 5.2 Let {xmi}n i=1 be the points {xi}n i=1, sorted in non-increasing order: Dt(xm1) Dt(xm2) Dt(xmn). Then:

Dt(xmj ) Dt(xmj+1) max(M)

min(M) maxi=1, ,K ρt(xmi)

mini=1, ,K ρt(xmi) for j < K.

Dt(xm K ) Dt(xm K+1) min(M)

max(M) Dbtw t Din t .

Proof Statement (a) is immediate from the deﬁnition. To see (b), we compute

Dt(xm K) min(M)Dbtw t = min(M)

max(M) Dbtw t Din t Din t max(M) min(M)

max(M) Dbtw t Din t Dt(xm K+1).

Hence if Din t Dbtw t

maxi=1, ,K ρt(xmi)

mini=1, ,K ρt(xmi) min(M) max(M) 2 , there will be a sharp drop-oﬀin the values of Dt after the ﬁrst K maximizers. Note that the ratio mini=1, K ρt(xmi)/ maxi=1, K ρt(xmi) will be insigniﬁcant unless the clusters are arranged at diﬀerent scales (i.e. some clusters are very close to each other but far from others). Similarly, (min(M)/max(M))2

will be nearly 1 if the maximal densities of the clusters are comparable. These observations suggest to estimate the number of clusters by selecting a cutoﬀτ and setting

ˆK = inf k Dt(xmk) Dt(xmk+1) > τ , (5.3)

as in Algorithm 1. Once the modes have been learned correctly, points may be clustered simply by labeling each mode as belonging to its own class, then requiring that every point has the same label as its nearest neighbor in diﬀusion distance of higher density.

Learning by Unsupervised Nonlinear Diffusion

Corollary 5.4 Suppose X = SK k=1 Xk. Let {xmi}n i=1 be the points {xi}n i=1, sorted so that Dt(xm1) Dt(xm2) Dt(xmn). Then Algorithm 1 labels all points correctly for any τ satisfying

min(M) maxi=1, ,K ρt(xmi)

mini=1, ,K ρt(xmi) < τ < min(M)

max(M) Dbtw t Din t .

Proof By Corollary 5.2, the algorithm correctly estimates ˆK. Then, by Theorem 5.1, the algorithm correctly learns the empirical density maximizers of each of the {Xk}K k=1. It remains to show that the subsequent labeling of all points is accurate. For an unlabeled point x Xk, its nearest diﬀusion neighbor of higher density, x , must be in the same cluster Xk, since Din t < Dbtw t . Moreover, that point is already labeled as Y (x ) = k, since p(x ) p(x). Hence, Y (x) = k and by induction, all points are labeled correctly.

The dependence on τ is somewhat unsatisfying, and in practice, this quantity can be removed from the inputs of Algorithm 1 by instead setting ˆK = arg maxk Dt(xmk)/Dt(xmk+1). This provably detects K accurately by noting that the ratios Dt(xmj+1)/Dt(xmj) are small for j > K under a range of reasonable assumptions, for example the assumptions that the density of each cluster is bounded away from 0 and the ratio of the minimal and maximal within-cluster diﬀusion distance is bounded. Algorithm 2 describes the LUND algorithm in the simpler case that K is known a priori. This algorithm achieves perfect accuracy under milder conditions than Algorithm 1.

Corollary 5.5 Suppose X = SK k=1 Xk and K is known. If Din t Dbtw t < min(M) max(M), then Algorithm 2, labels all points correctly.

Proof This follows from Theorem 5.1, along with Din t < Dbtw t .

6. Numerical Experiments

We return to the motivating data sets of Section 3. The diﬀusion distances are computed by truncating (2.2) to sum only over the largest (in terms of modulus of the eigenvalues) M = 100 n eigenpairs, and the KDE p(x) uses 100 nearest neighbors with σ0 = 1. We compute a number of statistics on the data to test our theoretical estimates and to verify the eﬃcacy of the proposed algorithm. For the ﬁrst two data sets we examine, we plot Din t , Dbtw t as functions of t, to observe the multitemporal nature of our clustering algorithm. We also compute the theoretical estimates on Pt S as guaranteed by Theorem 4.8. The tightness of the theoretical estimates is evaluated

Maggioni and Murphy

by comparing to the empirical values. We moreover plot the diﬀusion distances from a ﬁxed point for a variety of t values, to illustrate the multitemporal behavior of these distances. After these evaluations, we cluster the data with the proposed LUND algorithm and compute the accuracy, comparing with spectral clustering and the FSFDPC algorithm. We moreover compute the estimates of K with both the proposed method ˆK = arg maxk Dt(xmk)/Dt(xmk+1) where {xmi}n i=1 are the points {xi}n i=1 sorted so that Dt(xm1) Dt(xm2) Dt(xmn), and spectral clustering eigengap ˆK = arg maxi λi+1 λi, as a function of the crucial parameters of the respective algorithms. In particular, we evaluate the robustness of spectral clustering methods with respect to σ, and LUND with respect to σ and t. For spectral clustering, we consider the variant in which just the second eigenvector ψ2 is used (Shi and Malik, 2000), as well as the variant in which the ﬁrst K eigenvectors {ψi}K i=1 are used (Ng et al., 2002). All experiments are conducted on randomly generated data, with results averaged over 100 trials.

6.1 Bottleneck Data

We ﬁrst analyze the linear, multimodal data set of Figure 2, in which two of the clusters feature two high-density regions, connected by a lower density bottleneck region. Theorem 4.8 upper bounds Pt S < ϵ in terms of δ, λK+1, κ, which for this data have values δ = 6.2697 10 8, λK+1 = 1 1.7563 10 4, κ = 2.6738 102

when P is constructed with σ = .15. As shown in Figure 9, the theoretical estimate correctly illustrates the overall behavior of the transition from initial distribution, to mesoscopic equilibria, then to a global equilibrium. The distance from a high-density point across time scales appears in Figure 10. For small time values, the diﬀusion distance scales similarly to Euclidean distance. However, by time t = 108, a mesoscopic equilibrium has been reached, and all points in the cluster are rather close together. By t = 1016, a global equilibrium has been reached.

6.1.1 Bottleneck Data Clustering Evaluation

Comparisons with spectral clustering appear in Figures 11 and 12. In Figure 12 (a), it is shown that for all values of the spatial scale parameter σ, the eigengap estimated number of clusters ˆK is 1, i.e. always incorrect. On the other hand, Figure 12 (b) shows that there is a range of (σ, t) values mesoscopic in t for which LUND achieves perfect accuracy. Indeed, after an initial phase in which the number of clusters is estimated as 1, the LUND estimate for K is decreasing in t, corresponding to the mixing of diﬀerent clusters over time. The LUND algorithm and FSFDPC are compared in Figure 13. Due to the nonspherical shapes of the clusters, FSFDPC is unable to learn the modes of the data correctly, and consequently assigns modes to the same cluster: the modes learned

Learning by Unsupervised Nonlinear Diffusion

0 5 10 15 -12

(a) Plot of Dbtw t , Din t against t.

0 5 10 15 -10

(b) Quantities related to mesoscopic and global equilibria.

Figure 9: In (a), Dbtw t , Din t are plotted against t. For t < 104.5, Din t > Dbtw t , since for small time, Dt is essentially the same as Euclidean distance. Around t = 104.5, there is a transition, in which Din t Dbtw t . This corresponds to the Markov chain reaching mesoscopic equilibria in which the chain is well-mixed on each cluster, but not well-mixed globally. The approximate times of convergence towards the mesoscopic and global equilibrium are denoted with dotted black vertical lines. In (b), we plot three quantities against t: the theoretical bound on Pt S guaranteed by Theorem 4.8; the empirical quantity Pt S ; and the empirically computed quantity λ2(P)t/πmin, which estimates the distance to the stationary distribution. Notice that Pt S gets small, both the theoretical bound and the empirical value, around t = 104.5. It then increases. Around t = 1013.75, λ2(P)t/πmin decays exponentially to 0, indicating that the global equilibrium has been reached. Note that the theoretical estimate on Pt S is not tight, though it accurately captures the overall behavior of the quantity with respect to t. Moreover, the global decay estimate λ2(P)t/πmin is somewhat conservative in estimating the mixing time of towards the equilibrium distribution, as can be seen by comparing the second vertical lines in (a) and (b). In order to keep the plots at the same scale, the plots in (a) are extended to be constant for roughly t > 12. by FSFDPC and subsequent labels appear in subﬁgures (a), (b), respectively. In contrast, LUND learns one mode from each cluster, as shown in (c). Consequently, all points are labeled correctly, as shown in (d).

-1 -0.5 0 0.5 1

-1 -0.5 0 0.5 1

(b) t = 104

-1 -0.5 0 0.5 1

(c) t = 108

-1 -0.5 0 0.5 1

(d) t = 1016

Figure 10: A high-density point is shown in red, and all other points are colored by the Dt distance from this point in log10 scale. The transition from initial distribution (a), to mesoscopic equilibrium (c), to global equilibrium (d), is illustrated as t grows.

Maggioni and Murphy

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

(a) Accuracy spectral clustering

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

(b) Accuracy of LUND

Figure 11: Accuracy of two variations of spectral clustering compared to LUND as functions of σ. While spectral clustering with ψ2 performs nearly perfectly for σ < .4, its performance degrades as σ increases. Classical spectral clustering using ψ2, ψ2, ψ4 achieves perfect clustering of the data for roughly σ < .35. LUND is able to achieve perfect clustering accuracy for a wide range of (σ, t) pairs, mainly for those σ values which allows spectral clustering with just ψ2 to succeed. As σ increases, the mesoscopic regime in which perfect accuracy is achieved shrinks before disappearing entirely around σ = .45. In this data, spectral clustering with just ψ2 performs about as well as LUND in terms of accuracy, assuming K is known.

0.3 0.4 0.5 0

(a) Estimates of ˆ K using eigengap statistic.

0.3 0.4 0.5 -2

(b) Over/underestimation of ˆ K using eigengap statistic.

0.3 0.4 0.5

(c) Estimates of ˆ K using Dt statistic.

0.3 0.4 0.5

(d) Over/underestimation of ˆ K using Dt statistic.

Figure 12: In (a), we see the estimates of ˆK using the eigengap statistic, as a function of spatial scale parameter σ. The eigengap consistently estimates ˆK = 1 < 3 = K, indicating that the multimodal nature of this data is too complicated for the spectral clustering eigengap to handle. A quantized version of these estimates is shown in (b), in which entry 0 indicates correct estimation, -1 indicates ˆK < K and 1 indicates ˆK > K. There is a regime of (σ, t) values in which LUND correctly estimates the number of clusters, as shown in (c) and (d). After an initial time, this regime is essentially monotonic decreasing in t, and the mesoscopic region in which ˆK = K is decreasing in σ. 6.2 Nonlinear Data

We now consider the nonlinear multimodal data of Figure 2 (b). The innermost circle is ﬁlled-in, and has only one high-density region. It is surrounded by two circles, each with two high-density regions connected by low-density regions. The paths connecting

Learning by Unsupervised Nonlinear Diffusion

-1.5 -1 -0.5 0 0.5 1 1.5 -1

(a) Learned modes, FSFDPC

-1.5 -1 -0.5 0 0.5 1 1.5 -1

(b) Learned labels, FSFDPC

-1.5 -1 -0.5 0 0.5 1 1.5 -1

(c) Learned modes, LUND

-1.5 -1 -0.5 0 0.5 1 1.5 -1

(d) Learned labels, LUND

Figure 13: Comparison of FSFDPC to LUND. In (a), the modes learned from FSFDPC with Euclidean distances are plotted. Due to the eloganted, non-spherical nature of the data, the modes are learned incorrectly. The subsequent labels, shown in (b), illustrate FSFDPC is not able to capture the structure of this data. In (c), the modes learned from LUND are shown. One mode is learned from each cluster, which allows for a correct labeling of all data points with LUND, as shown in (d). LUND used parameters (σ, t) = (.15, 106) for these data. antipodal points on the outer circles are long, which suggests these sets will have low conductance. In the context of Theorem 4.8, the parameters for this data have values δ = 1.7225 10 4, 1 λK+1 = 6.8350 10 5, κ = 2.655 102 with σ = .2. Comparison of theoretical and empirical estimates appear in Figure 14, and the diﬀusion distances from one of the high-density points appear in Figure 15, illustrating the transition from initial distribution, to mesoscopic equilibrium, to global equilibrium.

0 1 2 3 4 5 6 7 8 9 10 -12

(a) Plot of Dbtw t , Din t against t.

0 1 2 3 4 5 6 7 8 9 10 -6

(b) Quantities related to mesoscopic and global equilibria.

Figure 14: In (a), we plot Dbtw t , Din t against t. For roughly t < 104.25, Din t > Dbtw t ; around t = 104.25, there is a transition, in which Din t Dbtw t . This corresponds to the Markov chain reaching mesoscopic equilibria in which the chain is well-mixed on each cluster, but not well-mixed globally. The approximate times of convergence towards the mesoscopic and global equilibrium are denoted with dotted black vertical lines. In (b), we plot three quantities against t: the theoretical bound on Pt S guaranteed by Theorem 4.8; the empirically computed quantity Pt S ; and the empirically computed quantity λ2(P)t/πmin, which estimates the distance to the stationary distribution. Notice that Pt S gets small, both the theoretical bound and the empirical value, around t = 104.25. It then increases. Around t = 109, λ2(P)t/πmin decays exponentially to 0, indicating that the global equilibrium has been reached.

Maggioni and Murphy

-1.5 -1 -0.5 0 0.5 1 1.5 2

-1.5 -1 -0.5 0 0.5 1 1.5 2

(b) t = 104

-1.5 -1 -0.5 0 0.5 1 1.5 2

(c) t = 108

-1.5 -1 -0.5 0 0.5 1 1.5 2

(d) t = 1016

Figure 15: A high-density point is shown in red, and all other points are colored by Dt distance from this point in log10 scale. The transition from initial distribution (a), to mesoscopic equilibrium (c), to global equilibrium (d), is illustrated as t grows. 6.2.1 Nonlinear Data Clustering Evaluation

LUND is compared with the two spectral clustering variants in Figures 16 and 17. In terms of overall accuracy, LUND with the correct choice of t outperforms both methods of spectral clustering using ψ2 only and using ψ2, ψ3, ψ4 for a range of σ values. The strong performance of LUND in the mesoscopic range, away from t = 0, t = , conﬁrms the theoretical results, and demonstrates LUND s ﬂexibility compared to classical spectral methods. Beyond accuracy, the LUND estimator for K is empirically eﬀective for a range of (σ, t) values, while the eigengap is much less eﬀective.

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

(a) Accuracy spectral clustering

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

(b) Accuracy of LUND

Figure 16: Accuracy of two variations of spectral clustering compared to LUND as functions of σ. While spectral clustering with ψ2 performs well for very small σ, its performance degrades as σ increases; classical spectral clustering using ψ2, ψ3, ψ4 performs similarly though for a smaller range of σ. LUND is able to achieve perfect clustering accuracy for a wide range of (σ, t) pairs, in particular for pairs (σ, t) such that spectral clustering fails. As σ increases, the mesoscopic regime in which perfect accuracy is achieved shrinks before disappearing entirely around σ = .4. LUND outperforms spectral clustering with (ψ2, ψ3, ψ4) roughly for σ (.3, .4), and outperforms spectral clustering with ψ2 alone roughly for σ (.35, .4).

In Figure 18, LUND is compared to FSFDPC. LUND correctly learns the modes of the data and labels points correctly, as shown in (c), (d). FSFDPC, however, fails to

Learning by Unsupervised Nonlinear Diffusion

0.3 0.4 0.5 1

(a) Estimates of ˆ K using eigengap statistic.

0.3 0.4 0.5 -1

(b) Over/underestimation of ˆ K using eigengap statistic.

0.3 0.4 0.5

(c) Estimates of ˆ K using Dt statistic.

0.3 0.4 0.5

(d) Over/underestimation of ˆ K using Dt statistic.

Figure 17: In (a) and (b), we see the eigengap consistently misestimates the number of clusters in the data, ﬁrst estimating ˆK = 1, before oscillating between various numbers of clusters, including ending on ˆK = 3 for one value of σ. LUND is able to achieve an accurate estimate for ˆK for a range of (σ, t) values, with generally more t values yielding a correct estimate for smaller σ. We note that the vertical lines in (c), (d) around σ = .45 and σ = .5 that deviate from the surrounding values are due to using the mode as the summary statistic across the trials. If the mean is used, a much smoother plot emerges, though one that is less meaningful in terms of estimating the number of clusters. learn the modes of the data correctly, leading to erroneous labeling see subﬁgures (a), (b) respectively.

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -2

(a) Learned modes, FSFDPC

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -2

(b) Learned labels, FSFDPC

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -2

(c) Learned modes, LUND

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 -2

(d) Learned labels, LUND

Figure 18: In (a), the modes learned from FSFDPC with Euclidean distances are plotted. The nonlinear nature of the data causes the modes to be learned incorrectly. The subsequent labels, shown in (b), illustrate FSFDPC is not able to capture the structure of this data. In (c), the modes learned from LUND are shown. One mode is learned from each cluster, which allows for a correct labeling of all data points with LUND, as shown in (d). LUND used parameters (σ, t) = (.175, 105) for these data.

6.3 Gaussian Data

As a ﬁnal synthetic example, we consider the Gaussians of Figure 1, which were constructed by Nadler and Galun (2007) to be data on which both variants of spectral clustering fail. These data are not suﬃciently separated for Theorem 4.12 to apply, but LUND still is able to perform well, owing to the incorporation of density, which allows to easily estimate the modes of the data. Comparisons to spectral clustering in terms of overall accuracy are reported in Figure 19. It is also enlightening to consider

Maggioni and Murphy

performances of LUND and spectral clustering in terms of average accuracy, in which the overall accuracy on each of the clusters is computed separately, and these classwise accuracies are then averaged. Compared to the overall accuracy measure, the average accuracy measure discounts large clusters and increases the signiﬁcance of small clusters. Results for average accuracy are in Figure 20.

0.75 0.875 1 1.125 1.25 1.375 1.5 1.625 1.75 1.875 2

(a) Accuracy of spectral clustering

0.75 0.875 1 1.125 1.25 1.375 1.5 1.625 1.75 1.875 2

(b) Accuracy of LUND

Figure 19: The overall accuracy of the two spectral clustering variants, as well as LUND, are shown for the Gaussian data. In terms of overall accuracy, LUND is able to achieve near-perfect results for a range of parameter values. Nearly all errors made were due to a point being generated from one Gaussian and landing very close to another Gaussian, which is essentially an unavoidable identiﬁability issue from which any unsupervised clustering method would suﬀer. Neither of the spectral clustering methods is able to match LUND s performance, which can be attributed to fundamental issues with the use of only the ﬁrst small number of eigenvectors when performing spectral clustering, as shown by Nadler and Galun (2007) and illustrated in Figure 1. In Figure 21, LUND is compared to FSFDPC. LUND correctly learns the modes of the data and labels points correctly, as shown in (c), (d). FSFDPC also learns the modes correctly, due to the unimodality of the Gaussian clusters and their isotropic covariance matrices.

6.4 Experimental Conclusions

In all three synthetic examples, LUND performs well. On the bottleneck data, it gives the same accuracy as spectral clustering with ψ2 but better estimates on ˆK; on the nonlinear data, it gives the best range of accuracies with respect to σ, while also giving the best estimates of ˆK; on the Gaussian data, LUND performs as well as FSFDPC while both spectral methods fail. These results suggest that LUND combines the best properties of spectral clustering with density-based clustering, while enjoying theoretical guarantees. We remark that extensive experiments with a variant of LUND adapted to high-dimensional images were performed on real hyperspectral image data (Murphy and Maggioni, 2018a,b, 2019b), demonstrating the competitiveness of LUND with a range of benchmark and state-of-the-art clustering algorithms.

Learning by Unsupervised Nonlinear Diffusion

0.75 0.875 1 1.125 1.25 1.375 1.5 1.625 1.75 1.875 2

(a) Accuracy of spectral clustering

0.75 0.875 1 1.125 1.25 1.375 1.5 1.625 1.75 1.875 2

(b) Accuracy of LUND

Figure 20: The average accuracy of the two spectral clustering variants, as well as LUND, are shown for the Gaussian data. The results are qualitatively similar to overall accuracy, but with reduced performance for spectral clustering, since most of the errors made by the spectral clustering variants are on the small cluster, which is washed out by spectral clustering. LUND achieves essentially perfect performance for a range of parameter values, excepting identiﬁability issues.

-14 -12 -10 -8 -6 -4 -2 0 2 4 -8

(a) Learned modes, FSFDPC

-14 -12 -10 -8 -6 -4 -2 0 2 4 -8

(b) Learned labels, FSFDPC

-14 -12 -10 -8 -6 -4 -2 0 2 4 -8

(c) Learned modes, LUND

-14 -12 -10 -8 -6 -4 -2 0 2 4 -8

(d) Learned labels, LUND

Figure 21: For the Gaussian data, both LUND and FSFDPC learn the data modes correctly, and are subsequently able to label the data with high accuracy. The lack of diﬀerence in their comparative performances is attributed to the fact that the data in this case are Gaussians with isotropic covariance matrices, and in particular have simple spherical supports, which confers diﬀusion distances little advantage compared to Euclidean distances. LUND used parameters (σ, t) = (1, 103) for these data. As shown in Figures 11, 12, 16, 17, 19, 20, the relationship between σ and t is not as simple as suggested by equation (4.18). Indeed, in the non-asymptotic case, and in particular in the case of well-separated clusters, the relationship between t and σ does not obey a strict exponential relationship, as suggested by (4.18). Instead, t appears to interact with scales locally on each cluster, as can be seen by the bifurcations in these plots. Gaining a complete understanding of the relationship between σ and t in the cluster case is a topic of ongoing research.

Maggioni and Murphy

6.5 Computational Complexity

The proposed algorithm enjoys essentially linear computational complexity. This is achieved through the indexing structure cover trees (Beygelzimer et al., 2006), which allows for eﬃcient nearest neighbor searches under the assumption that data has lowdimensional structure. Indeed, for data {xi}n i=1 RD, the computation of each of the n data point s knn nearest neighbors can be achieved at a cost of O(knn DCdn log(n)), where d is the intrinsic dimension of the data. This allows for the computation of each point s empirical density estimate p(x) in O(knn DCdn log(n)), where knn is the number of nearest neighbors used in the density estimate. In addition, if the Markov chain is computed not on a fully connected graph, but on a knn nearest neighbors graph, the cost of computing diﬀusion distances with the eigenvector approximation is O(knn DCdn log(n) + knn M 2n), where M eigenvectors are used in the approximation. In the case that knn = O(log(n)), M = O(1), these complexities simplify to O(DCdn log(n)2). Computing all ρt(x) values is O(n log(n)), under the assumption that, except for the O(log(n)) class modes, each point has among its O(log(n)) nearest neighbors a point of higher empirical density. Subsequent sorting of p and Dt are O(n log(n)), so the overall algorithm is O(DCdn log2(n)), which is linear in n, up to logarithmic factors.

6.6 Discussion and Tuning of Parameters

The LUND algorithm depends crucially on the parameters σ, t, and also potentially on the cutoﬀparameter τ used to estimate K as in (5.3). Our experiments have demonstrated that LUND is robust to σ and t, and that the parameter τ may be dispensed with by estimating ˆK = arg maxk Dt(xmk)/Dt(xmk+1). We remark that σ may be set locally in an automated way by replacing a global σ with a point-dependent σ based on distances to near neighbors (Zelnik-Manor and Perona, 2005). This approach modiﬁes the construction of P by setting Wij = exp( xi xj 2 2/σiσj), where σi = xi NN(xi, k NN) 2, where NN(xi, k NN) is the kth NN Euclidean nearest neighbor of xi. Which nearest neighbor to use, k NN, is a function of the data dimension, but the approach is practically quite robust to this parameter. This has the eﬀect of adapting to the local density near each point. Moreover, it suggests that if the underlying density is smooth and bounded away from 0, then for all xi, there is a natural scaling σi 0+ as n , with rate depending on the data dimensionality. Regarding t, this parameter determines the time scale at which the diﬀusion is run. The theoretical results guarantee that if a certain clustering of the data admits suitable separation and geometric properties, then there will be choices of t for which LUND accurately labels the data. In practice, choosing t without supervision can be performed in several ways. One approach is to choose the t which maximizes the

Learning by Unsupervised Nonlinear Diffusion

ratio between successive values of Dt(x):

t = arg max t

arg max k Dt(xk)/Dt(xk+1) .

This is analogous to the method for selecting the optimal parameters for ultrametric spectral clustering in Little et al. (2017). A diﬀerent approach is to select t such that clustering with LUND at time t is stable (Wan and Meila, 2016; Meila, 2019) with respect to small perturbations in the underlying data. Relatedly, it is natural to select t such that the clustering results are stable to small perturbations in t. We remark that in some sense, multiple choices of t, leading to diﬀerent clusterings, may be reasonable. This is because there may be hierarchical cluster structure in the data, parametrized by t. Developing extensions and theoretical analyses of LUND in this context is the subject of ongoing research.

7. Conclusions and Future Work

In this article, new methods for bounding diﬀusion distances, based on nearly-reducible Markov chains, are deployed to provide suﬃcient conditions under which clustering of data can be guaranteed. The theoretical results rigorously show that diﬀusion distances exhibit multitemporal behavior, even in the case that clusters have multiple regions of high-density or nonlinear support. These estimates on diﬀusion distance allow to prove performance guarantees on the LUND algorithm. This may be interpreted as a critique of the popular FSFDPC algorithm, for which theoretical guarantees require unrealistic assumptions on the shapes of the clusters. Numerical experiments on bottleneck, nonlinear, and Gaussian data indicate that the theoretical results correspond with empirical performance, and that LUND enjoys advantages of both spectral clustering and FSFDPC while tempering their weaknesses. While the results presented in this paper for diﬀusion distances are, we believe, novel and useful for developing performance guarantees for unsupervised learning, they fall short of a full ﬁnite sample analysis. It is of interest to understand how the estimates on δ and λK+1, which govern the estimates of Theorem 4.12, and consequently the performance guarantees for clustering, scale with the number of sample points. Developing such precise estimates would require new mathematical methods for analyzing the spectra of random operators on graphs. Such an analysis is suggested by recent works in discrete-to-continuum spectral analysis (Garcia Trillos et al., 2018), though handling the factor (I Pii) 1 may provide for new challenges. As remarked in Section 6, diﬀusion distances are Euclidean distances in a new coordinate basis, given by the (right) eigenvectors of P. A diﬀerent approach to proving the clustering properties of Dt with respect to time would be to show that diﬀerent eigenvectors localize on particular clusters, and show that there are gaps in the eigenvalues λℓwhich account for the emergence of mesoscopic equilibria. This approach is related to the analysis of eigenvectors corresponding to small eigenvalues for the symmetric Laplacian (Shi et al., 2009), and may provide new insights.

Maggioni and Murphy

In Section 4.6, the asymptotic relationship between σ and t for diﬀusion distances is discussed (Lafon et al., 2006). As shown in Section 6, LUND exhibits a more delicate relationship between σ, t in the ﬁnite sample case. This is potentially due to the incorporation of empirical density into the diﬀusion geometry framework that LUND uses to estimate the data modes. Analyzing this phenomenon mathematically is a topic of future research. In LUND, the modes are learned sequentially, by selecting the K maximizers of Dt. It is of interest to consider whether jointly selecting the modes, based on a criterion that optimizes over K points simultaneously, leads to diﬀerent and in some cases improved clustering results in particular situations. Such a formulation may force the modes to be better spread throughout the data, and in particular would force the modes in the bottleneck data in Figure 4 to localize on the clusters, regardless of the ﬂuctuations in empirical density. For this data, it does not aﬀect the clustering accuracy if the fourth mode is in the middle of the bottleneck or in a cluster, but it is intuitive that the modes should localize on the four cluster cores rather than anywhere along the bottleneck. Developing a joint mode selection criterion may enforce this. The proposed method also lends itself to the semisupervised setting of active learning (Chapelle et al., 2006; Dasgupta, 2011), in which the user is allowed to query a small number of points for labels. By estimating which points are most likely to be modes of clusters, the LUND algorithm presents natural candidates to query for labels. Recent work has suggested this approach is feasible (Murphy and Maggioni, 2018b, 2019a; Maggioni and Murphy, 2019), and it is an ongoing research eﬀort to optimally integrate diﬀusion geometry into an active sampling scheme.

Acknowledgments

We thank the three anonymous reviewers, whose comments and insights greatly improved the manuscript. This project was partially funded by NSF-DMS-125012, NSFDMS-1724979, NSF-DMS-1708602, NSF-ATD-1737984, AFOSR FA9550-17-1-0280, NSF-IIS-1546392, NSF-DMS 1912737, and NSF-DMS 1924513.

Appendix A. Proof of Theorem 4.8

Notice Pt S Pt St + St S . For all t 0, Pt St = Pt i=1 St i(P S)Pi 1, so that

i=1 St i(P S)Pi 1

i=1 St i (P S) Pi 1 = t (P S) tδ.

To bound St S , notice that after diagonalizing S,

St = Z IK 0 0 Dt

Z 1, S = Z IK 0 0 0

Learning by Unsupervised Nonlinear Diffusion

where D is a diagonal matrix with λK+1, λK+2, . . . , λn on the diagonal. Hence, St S Z λt K+1 Z 1 = κλt K+1, as desired. The second result of the theorem follows similarly.

D. Aldous and J. Fill. Reversible Markov chains and random walks on graphs, 2002.

R. Andersen and Y. Peres. Finding sparse cuts locally using evolving sets. In Symposium on Theory of Computing (STOC), pages 235 244. ACM, 2009.

R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In Symposium on Foundations of Computer Science (FOCS), pages 475 486. IEEE, 2006.

R. Andersen, F. Chung, and K. Lang. Local partitioning for directed graphs using pagerank. Internet Mathematics, 5(1-2):3 22, 2008.

E. Arias-Castro. Clustering based on pairwise distances when the data is of mixed dimensions. IEEE Transactions on Information Theory, 57(3):1692 1706, 2011.

E. Arias-Castro, G. Chen, and G. Lerman. Spectral clustering based on local linear approximations. Electronic Journal of Statistics, 5:1537 1587, 2011.

E. Arias-Castro, G. Lerman, and T. Zhang. Spectral clustering based on local PCA. Journal of Machine Learning Research, 18(9):1 57, 2017.

D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Symposium on Discrete Algorithms (SODA), pages 1027 1035. Society for Industrial and Applied Mathematics, 2007.

A. Athreya, D.E. Fishkind, M. Tang, C.E. Priebe, Y. Park, J.T. Vogelstein, K. Levin, V. Lyzinski, and Y. Qin. Statistical inference on random dot product graphs: a survey. The Journal of Machine Learning Research, 18(1):8393 8484, 2017.

M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373 1396, 2003.

M. Belkin and P. Niyogi. Towards a theoretical foundation for Laplacian-based manifold methods. In International Conference on Computational Learning Theory (COLT), pages 486 500. Springer, 2005.

M. Belkin and P. Niyogi. Convergence of Laplacian eigenmaps. In Advances in Neural Information Processing Systems (NIPS), pages 129 136, 2007.

A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In International Conference on Machine Learning (ICML), pages 97 104. ACM, 2006.

Maggioni and Murphy

S. Botelho-Andrade, P.G. Casazza, D. Cheng, and R. Tran. The exact constant for the ℓ1 ℓ2 norm inequality. Mathematical Inequalities & Applications, 22(1):59 64, 2019.

J.E. Chac on. Clusters and water ﬂows: a novel approach to modal clustering through morse theory. ar Xiv preprint ar Xiv:1212.1384, 2012.

O. Chapelle, B. Scholkopf, and A. Zien. Semi-supervised Learning. MIT Press, 2006.

G. Chen and G. Lerman. Foundations of a multi-way spectral clustering framework for hybrid linear modeling. Foundations of Computational Mathematics, 9(5):517 558, 2009a.

G. Chen and G. Lerman. Spectral curvature clustering (SCC). International Journal of Computer Vision, 81(3):317 330, 2009b.

F. Chung. Spectral Graph Theory, volume 92. American Mathematical Soc., 1997.

R. Coifman and S. Lafon. Diﬀusion maps. Applied and Computational Harmonic Analysis, 21(1):5 30, 2006.

R.R Coifman, S. Lafon, A.B. Lee, M. Maggioni, B. Nadler, F. Warner, and S.W. Zucker. Geometric diﬀusions as a tool for harmonic analysis and structure deﬁnition of data: Diﬀusion maps. Proceedings of the National Academy of Sciences of the United States of America, 102(21):7426 7431, 2005.

R.R. Coifman, I.G. Kevrekidis, S. Lafon, M. Maggioni, and B. Nadler. Diﬀusion maps, reduction coordinates, and low dimensional representation of stochastic systems. Multiscale Modeling & Simulation, 7(2):842 864, 2008.

D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5): 603 619, 2002.

W. Czaja, B. Manning, L. Mc Lean, and J.M. Murphy. Fusion of aerial gamma-ray survey and remote sensing data for a deeper understanding of radionuclide fate after radiological incidents: examples from the Fukushima Dai-Ichi response. Journal of Radioanalytical and Nuclear Chemistry, 307(3):2397 2401, 2016.

S. Dasgupta. Two faces of active learning. Theoretical Computer Science, 412(19): 1767 1781, 2011.

S. Dasgupta and L. Schulman. A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. Journal of Machine Learning Research, 8(Feb):203 226, 2007.

F. Den Hollander. Large Deviations, volume 14. American Mathematical Soc., 2008.

Learning by Unsupervised Nonlinear Diffusion

P. Diaconis and L. Saloﬀ-Coste. Comparison theorems for reversible Markov chains. The Annals of Applied Probability, pages 696 730, 1993.

P. Diaconis and D. Stroock. Geometric bounds for eigenvalues of Markov chains. The Annals of Applied Probability, pages 36 61, 1991.

E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35 (11):2765 2781, 2013.

M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In International Conference on Knowledge Discovery and Data Mining (KDD), volume 96, pages 226 231, 1996.

K. Fountoulakis, D.F., and M.W. Mahoney. An optimization approach to locallybiased graph algorithms. Proceedings of the IEEE, 105(2):256 272, 2017.

J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical Learning, volume 1. Springer series in Statistics Springer, Berlin, 2001.

K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory, 21(1):32 40, 1975.

N Garcia Trillos, D. Slepcev, J. Von Brecht, T. Laurent, and X. Bresson. Consistency of Cheeger and ratio graph cuts. Journal of Machine Learning Research, 17(181): 1 46, 2016.

N. Garcia Trillos, M. Gerlach, M. Hein, and D. Slepcev. Error estimates for spectral convergence of the graph Laplacian on random geometric graphs towards the Laplace Beltrami operator. ar Xiv preprint ar Xiv:1801.10108, 2018.

C. Gardiner. Stochastic Methods, volume 4. springer Berlin, 2009.

M. Gavish and B. Nadler. Normalized cuts are approximately inverse exit times. SIAM Journal on Matrix Analysis and Applications, 34(2):757 772, 2013.

C.R. Genovese, M. Perone-Paciﬁco, I. Verdinelli, and L. Wasserman. Non-parametric inference for density modes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(1):99 126, 2016.

J.C. Gower and G.J.S. Ross. Minimum spanning trees and single linkage cluster analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 18 (1):54 64, 1969.

J.A. Hartigan. Consistency of single linkage for high-density clusters. Journal of the American Statistical Society, 76(374):388 394, 1981.

Maggioni and Murphy

M. Jerrum and A. Sinclair. Approximating the permanent. SIAM Journal on Computing, 18(6):1149 1178, 1989.

S. Jia, G. Tang, J. Zhu, and Q. Li. A novel ranking-based clustering approach for hyperspectral band selection. IEEE Transactions on Geoscience and Remote Sensing, 54(1):88 102, 2016.

P.W. Jones, M. Maggioni, and R. Schul. Manifold parametrizations by eigenfunctions of the Laplacian and heat kernels. Proceedings of the National Academy of Sciences, 105(6):1803 1808, 2008.

S. Lafon, Y. Keller, and R.R. Coifman. Data fusion and multicue data matching by diﬀusion maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11):1784 1797, 2006.

R.R. Lederman and R. Talmon. Learning the geometry of common latent variables using alternating-diﬀusion. Applied and Computational Harmonic Analysis, 44(3): 509 536, 2018.

R.R. Lederman, R. Talmon, H. Wu, Y.-L. Lo, and R.R. Coifman. Alternating diffusion for common manifold learning with application to sleep stage assessment. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5758 5762. IEEE, 2015.

D.A. Levin, Y. Peres, and E.L. Wilmer. Markov chains and mixing times. American Mathematical Soc., 2009.

R. Li, M.G. Frasch, and H. Wu. Eﬃcient fetal-maternal ECG signal separation from two channel maternal abdominal ECG via diﬀusion-based channel selection. Frontiers in physiology, 8:277, 2017.

S. Ling and T. Strohmer. Certifying global optimality of graph cuts via semideﬁnite relaxation: A performance guarantee for spectral clustering. Foundations of Computational Mathematics, pages 1 55, 2018.

A. Little and A. Byrd. A multiscale spectral method for learning number of clusters. In International Conference on Machine Learning and Applications (ICMLA), pages 457 460. IEEE, 2015.

A. Little, M. Maggioni, and J.M. Murphy. Path-based spectral clustering: Guarantees, robustness to outliers, and fast algorithms. Ar Xiv:1712.06206, 2017.

M. Maggioni and J.M. Murphy. Learning by active nonlinear diﬀusion. Foundations of Data Science, 1(3):271 291, 2019.

M. Meila. Good (k-means) clusterings are unique (up to small perturbations). Journal of Multivariate Analysis, 173:1 17, 2019.

Learning by Unsupervised Nonlinear Diffusion

M. Meila and J. Shi. Learning segmentation by random walks. In Advances in Neural Information Processing Systems (NIPS), pages 873 879, 2001.

C.D Meyer. Stochastic complementation, uncoupling Markov chains, and the theory of nearly reducible systems. SIAM Review, 31(2):240 272, 1989.

D.G. Mixon, S. Villar, and R. Ward. Clustering subgaussian mixtures by semideﬁnite programming. Information and Inference: A Journal of the IMA, 6(4):389 415, 2017.

J.M. Murphy and M. Maggioni. Diﬀusion geometric methods for fusion of remotely sensed data. In Algorithms and Technologies for Multispectral, Hyperspectral, and Ultraspectral Imagery XXIV, volume 10644, page 106440I. International Society for Optics and Photonics, 2018a.

J.M. Murphy and M. Maggioni. Iterative active learning with diﬀusion geometry for hyperspectral images. In Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), pages 1 5. IEEE, 2018b.

J.M. Murphy and M. Maggioni. Spectral-spatial diﬀusion geometry for hyperspectral image clustering. ar Xiv preprint ar Xiv:1902.05402, 2019a.

J.M. Murphy and M. Maggioni. Unsupervised clustering and active learning of hyperspectral images with nonlinear diﬀusion. IEEE Transactions on Geoscience and Remote Sensing, 57(3):1829 1845, 2019b.

B. Nadler and M. Galun. Fundamental limitations of spectral clustering. In Advances in Neural Information Processing Systems (NIPS), pages 1017 1024, 2007.

A.Y. Ng, M.I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems, 2:849 856, 2002.

R. Ostrovsky, Y. Rabani, L.J. Schulman, and C. Swamy. The eﬀectiveness of Lloydtype methods for the k-means problem. In Symposium on Foundations of Computer Science (FOCS), pages 165 176. IEEE, 2006.

H.-S. Park and C.-H. Jun. A simple and fast algorithm for k-medoids clustering. Expert Systems with Applications, 36(2):3336 3341, 2009.

A. Rodriguez and A. Laio. Clustering by fast search and ﬁnd of density peaks. Science, 344(6191):1492 1496, 2014.

M.A. Rohrdanz, W. Zheng, M. Maggioni, and C. Clementi. Determination of reaction coordinates via locally scaled diﬀusion map. The Journal of Chemical Physics, 134 (12):03B624, 2011.

Maggioni and Murphy

C. Rossant, S.N. Kadir, D.F.M. Goodman, J. Schulman, M.L.D. Hunter, A.N. Saleem, A. Grosmark, M. Belluscio, G.H. Denﬁeld, A.S. Ecker, et al. Spike sorting for large, dense electrode arrays. Nature Neuroscience, 19(4):634, 2016.

S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323 2326, 2000.

G. Schiebinger, M.J. Wainwright, and B. Yu. The geometry of kernelized spectral clustering. The Annals of Statistics, 43(2):819 846, 2015.

J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888 905, 2000.

T. Shi, M. Belkin, and B. Yu. Data spectroscopy: Eigenspaces of convolution operators and clustering. The Annals of Statistics, pages 3960 3984, 2009.

H.A. Simon and A. Ando. Aggregation of variables in dynamic systems. Econometrica: Journal of the Econometric Society, pages 111 138, 1961.

A. Sinclair and M. Jerrum. Approximate counting, uniform generation and rapidly mixing Markov chains. Information and Computation, 82(1):93 133, 1989.

A. Singer and R.R. Coifman. Non-linear independent component analysis with diﬀusion maps. Applied and Computational Harmonic Analysis, 25(2):226 239, 2008.

A. Singer and H. Wu. Vector diﬀusion maps and the connection Laplacian. Communications on Pure and Applied Mathematics, 65(8):1067 1144, 2012.

A. Singer and H. Wu. Spectral convergence of the connection Laplacian from random samples. Information and Inference: A Journal of the IMA, 6(1):58 123, 2016.

A. Singer, R. Erban, I.G. Kevrekidis, and R.R. Coifman. Detecting intrinsic slow variables in stochastic dynamical systems by anisotropic diﬀusion maps. Proceedings of the National Academy of Sciences, 106(38):16090 16095, 2009.

P.H.A. Sneath. The application of computers to taxonomy. Microbiology, 17(1): 201 226, 1957.

M. Soltanolkotabi, E. Elhamifar, and E.J. Candes. Robust subspace clustering. Annals of Statistics, 42(2):669 699, 2014.

D.A. Spielman and S.-H. Teng. Nearly-linear time algorithms for graph partitioning, graph sparsiﬁcation, and solving linear systems. In Symposium on Theory of Computing (STOC), pages 81 90. ACM, 2004.

D.A. Spielman and S.-H. Teng. A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning. SIAM Journal on Computing, 42(1):1 26, 2013.

Learning by Unsupervised Nonlinear Diffusion

D.A. Spielman and S.-H. Teng. Nearly linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. SIAM Journal on Matrix Analysis and Applications, 35(3):835 885, 2014.

M.H. Spitzer, P.F. Gherardini, G.K. Fragiadakis, N. Bhattacharya, R.T. Yuan, A.N. Hotson, R. Finck, Y. Carmi, E.R. Zunder, W.J. Fantl, et al. An interactive reference framework for modeling a dynamic immune system. Science, 349(6244):1259425, 2015.

H. Steinhaus. Sur la division des corps mat eriels en parties. Bull. Acad. Polon. Sci., 4(12):801 804, 1957.

K. Sun, X. Geng, and L. Ji. Exemplar component analysis: A fast band selection method for hyperspectral imagery. IEEE Geoscience and Remote Sensing Letters, 12(5):998 1002, 2015.

G. Szeg o. Inequalities for certain eigenvalues of a membrane of given area. Journal of Rational Mechanics and Analysis, 3:343 356, 1954.

R. Talmon and H. Wu. Latent common manifold learning with alternating diﬀusion: analysis and applications. Applied and Computational Harmonic Analysis, 2018.

J.B. Tenenbaum, V. De Silva, and J.C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319 2323, 2000.

R. Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28(2):52 68, 2011.

U. Von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4): 395 416, 2007.

Y. Wan and M. Meila. Graph clustering: block-models and model free results. In Advances in Neural Information Processing Systems (NIPS), pages 2478 2486, 2016.

P. Wang, B. Xu, J. Xu, G. Tian, C.-L. Liu, and H. Hao. Semantic expansion using word embedding clustering and convolutional neural network for improving short text classiﬁcation. Neurocomputing, 174:806 814, 2016.

X. Wang, K. Slavakis, and G. Lerman. Multi-manifold modeling in non-Euclidean spaces. In International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), pages 1023 1032, 2015.

H.F. Weinberger. An isoperimetric inequality for the n-dimensional free membrane problem. Journal of Rational Mechanics and Analysis, 5(4):633 636, 1956.

C. Wiwie, J. Baumbach, and R. R ottger. Comparing the performance of biomedical clustering methods. Nature Methods, 12(11):1033, 2015.

Maggioni and Murphy

X. Xu, M. Ester, H.-P. Kriegel, and J. Sander. A distribution-based clustering algorithm for mining in large spatial databases. In International Conference on Data Engineering (ICDE), pages 324 331. IEEE, 1998.

H. Yin, A.R. Benson, J. Leskovec, and D.F. Gleich. Local higher-order graph clustering. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 555 564. ACM, 2017.

L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In Advances in Neural Information Processing Systems (NIPS), pages 1601 1608, 2005.

T. Zhang, A. Szlam, Y. Wang, and G. Lerman. Hybrid linear modeling via local best-ﬁt ﬂats. International Journal of Computer Vision, 100(3):217 240, 2012.

W. Zheng, M.A. Rohrdanz, M. Maggioni, and C. Clementi. Polymer reversal rate calculated via locally scaled diﬀusion map. The Journal of Chemical Physics, 134 (14):144109, 2011.