# statistical_inference_for_cluster_trees__079b7010.pdf

Statistical Inference for Cluster Trees

Jisu Kim Department of Statistics Carnegie Mellon University Pittsburgh, USA jisuk1@andrew.cmu.edu

Yen-Chi Chen Department of Statistics University of Washington Seattle, USA yenchic@uw.edu

Sivaraman Balakrishnan Department of Statistics Carnegie Mellon University Pittsburgh, USA siva@stat.cmu.edu

Alessandro Rinaldo Department of Statistics Carnegie Mellon University Pittsburgh, USA arinaldo@stat.cmu.edu

Larry Wasserman Department of Statistics Carnegie Mellon University Pittsburgh, USA larry@stat.cmu.edu

A cluster tree provides a highly-interpretable summary of a density function by representing the hierarchy of its high-density clusters. It is estimated using the empirical tree, which is the cluster tree constructed from a density estimator. This paper addresses the basic question of quantifying our uncertainty by assessing the statistical signiﬁcance of topological features of an empirical cluster tree. We ﬁrst study a variety of metrics that can be used to compare different trees, analyze their properties and assess their suitability for inference. We then propose methods to construct and summarize conﬁdence sets for the unknown true cluster tree. We introduce a partial ordering on cluster trees which we use to prune some of the statistically insigniﬁcant features of the empirical tree, yielding interpretable and parsimonious cluster trees. Finally, we illustrate the proposed methods on a variety of synthetic examples and furthermore demonstrate their utility in the analysis of a Graft-versus-Host Disease (Gv HD) data set.

1 Introduction

Clustering is a central problem in the analysis and exploration of data. It is a broad topic, with several existing distinct formulations, objectives, and methods. Despite the extensive literature on the topic, a common aspect of the clustering methodologies that has hindered its widespread scientiﬁc adoption is the dearth of methods for statistical inference in the context of clustering. Methods for inference broadly allow us to quantify our uncertainty, to discern true clusters from ﬁnite-sample artifacts, as well as to rigorously test hypotheses related to the estimated cluster structure.

In this paper, we study statistical inference for the cluster tree of an unknown density. We assume that we observe an i.i.d. sample {X1, . . . , Xn} from a distribution P0 with unknown density p0. Here, Xi X Rd. The connected components C(λ), of the upper level set {x : p0(x) λ}, are called high-density clusters. The set of high-density clusters forms a nested hierarchy which is referred to as the cluster tree1 of p0, which we denote as Tp0.

Methods for density clustering fall broadly in the space of hierarchical clustering algorithms, and inherit several of their advantages: they allow for extremely general cluster shapes and sizes, and in general do not require the pre-speciﬁcation of the number of clusters. Furthermore, unlike ﬂat

1It is also referred to as the density tree or the level-set tree.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

clustering methods, hierarchical methods are able to provide a multi-resolution summary of the underlying density. The cluster tree, irrespective of the dimensionality of the input random variable, is displayed as a two-dimensional object and this makes it an ideal tool to visualize data. In the context of statistical inference, density clustering has another important advantage over other clustering methods: the object of inference, the cluster tree of the unknown density p0, is clearly speciﬁed.

In practice, the cluster tree is estimated from a ﬁnite sample, {X1, . . . , Xn} p0. In a scientiﬁc application, we are often most interested in reliably distinguishing topological features genuinely present in the cluster tree of the unknown p0, from topological features that arise due to random ﬂuctuations in the ﬁnite sample {X1, . . . , Xn}. In this paper, we focus our inference on the cluster tree of the kernel density estimator, Tbph, where bph is the kernel density estimator,

bph(x) = 1 nhd

where K is a kernel and h is an appropriately chosen bandwidth 2.

To develop methods for statistical inference on cluster trees, we construct a conﬁdence set for Tp0, i.e. a collection of trees that will include Tp0 with some (pre-speciﬁed) probability. A conﬁdence set can be converted to a hypothesis test, and a conﬁdence set shows both statistical and scientiﬁc signiﬁcances while a hypothesis test can only show statistical signiﬁcances [23, p.155].

To construct and understand the conﬁdence set, we need to solve a few technical and conceptual issues. The ﬁrst issue is that we need a metric on trees, in order to quantify the collection of trees that are in some sense close enough to Tbph to be statistically indistinguishable from it. We use the bootstrap to construct tight data-driven conﬁdence sets. However, only some metrics are sufﬁciently regular to be amenable to bootstrap inference, which guides our choice of a suitable metric on trees.

On the basis of a ﬁnite sample, the true density is indistinguishable from a density with additional inﬁnitesimal perturbations. This leads to the second technical issue which is that our conﬁdence set invariably contains inﬁnitely complex trees. Inspired by the idea of one-sided inference [9], we propose a partial ordering on the set of all density trees to deﬁne simple trees. To ﬁnd simple representative trees in the conﬁdence set, we prune the empirical cluster tree by removing statistically insigniﬁcant features. These pruned trees are valid with statistical guarantees that are simpler than the empirical cluster tree in the proposed partial ordering.

Our contributions: We begin by considering a variety of metrics on trees, studying their properties and discussing their suitability for inference. We then propose a method of constructing conﬁdence sets and for visualizing trees in this set. This distinguishes aspects of the estimated tree correspond to real features (those present in the cluster tree Tp0) from noise features. Finally, we apply our methods to several simulations, and a Graft-versus-Host Disease (Gv HD) data set to demonstrate the usefulness of our techniques and the role of statistical inference in clustering problems.

Related work: There is a vast literature on density trees (see for instance the book by Klemelä [16]), and we focus our review on works most closely aligned with our paper. The formal deﬁnition of the cluster tree, and notions of consistency in estimation of the cluster tree date back to the work of Hartigan [15]. Hartigan studied the efﬁcacy of single-linkage in estimating the cluster tree and showed that single-linkage is inconsistent when the input dimension d > 1. Several ﬁxes to single-linkage have since been proposed (see for instance [21]). The paper of Chaudhuri and Dasgupta [4] provided the ﬁrst rigorous minimax analysis of the density clustering and provided a computationally tractable, consistent estimator of the cluster tree. The papers [1, 5, 12, 17] propose various modiﬁcations and analyses of estimators for the cluster tree. While the question of estimation has been extensively addressed, to our knowledge our paper is the ﬁrst concerning inference for the cluster tree.

There is a literature on inference for phylogenetic trees (see the papers [13, 10]), but the object of inference and the hypothesized generative models are typically quite different. Finally, in our paper, we also consider various metrics on trees. There are several recent works, in the computational topology literature, that have considered different metrics on trees. The most relevant to our own work, are the papers [2, 18] that propose the functional distortion metric and the interleaving distance on trees. These metrics, however, are NP-hard to compute in general. In Section 3, we consider a variety of computationally tractable metrics and assess their suitability for inference.

2We address computing the tree Tbph, and the choice of bandwidth in more detail in what follows.

Figure 1: Examples of density trees. Black curves are the original density functions and the red trees are the associated density trees.

2 Background and Deﬁnitions

We work with densities deﬁned on a subset X Rd, and denote by . the Euclidean norm on X. Throughout this paper we restrict our attention to cluster tree estimators that are speciﬁed in terms of a function f : X 7 [0, ), i.e. we have the following deﬁnition: Deﬁnition 1. For any f : X 7 [0, ) the cluster tree of f is a function Tf : R 7 2X , where 2X is the set of all subsets of X, and Tf(λ) is the set of the connected components of the upper-level set {x X : f(x) λ}. We deﬁne the collection of connected components {Tf}, as {Tf} = S

As will be clearer in what follows, working only with cluster trees deﬁned via a function f simpliﬁes our search for metrics on trees, allowing us to use metrics speciﬁed in terms of the function f. With a slight abuse of notation, we will use Tf to denote also {Tf}, and write C Tf to signify C {Tf}. The cluster tree Tf indeed has a tree structure, since for every pair C1, C2 Tf, either C1 C2, C2 C1, or C1 C2 = holds. See Figure 1 for a graphical illustration of a cluster tree. The formal deﬁnition of the tree requires some topological theory; these details are in Appendix B.

In the context of hierarchical clustering, we are often interested in the height at which two points or two clusters merge in the clustering. We introduce the merge height from [12, Deﬁnition 6]: Deﬁnition 2. For any two points x, y X, any f : X 7 [0, ), and its tree Tf, their merge height mf(x, y) is deﬁned as the largest λ such that x and y are in the same density cluster at level λ, i.e.

mf(x, y) = sup {λ R : there exists C Tf(λ) such that x, y C} .

We refer to the function mf : X X 7 R as the merge height function. For any two clusters C1, C2 {Tf}, their merge height mf(C1, C2) is deﬁned analogously,

mf(C1, C2) = sup {λ R : there exists C Tf(λ) such that C1, C2 C} .

One of the contributions of this paper is to construct valid conﬁdence sets for the unknown true tree and to develop methods for visualizing the trees contained in this conﬁdence set. Formally, we assume that we have samples {X1, . . . , Xn} from a distribution P0 with density p0. Deﬁnition 3. An asymptotic (1 α) conﬁdence set, Cα, is a collection of trees with the property that

P0(Tp0 Cα) = 1 α + o(1).

We also provide non-asymptotic upper bounds on the o(1) term in the above deﬁnition. Additionally, we provide methods to summarize the conﬁdence set above. In order to summarize the conﬁdence set, we deﬁne a partial ordering on trees. Deﬁnition 4. For any f, g : X 7 [0, ) and their trees Tf, Tg, we say Tf Tg if there exists a map Φ : {Tf} {Tg} such that for any C1, C2 Tf, we have C1 C2 if and only if Φ(C1) Φ(C2).

With Deﬁnition 3 and 4, we describe the conﬁdence set succinctly via some of the simplest trees in the conﬁdence set in Section 4. Intuitively, these are trees without statistically insigniﬁcant splits.

It is easy to check that the partial order in Deﬁnition 4 is reﬂexive (i.e. Tf Tf) and transitive (i.e. that Tf1 Tf2 and Tf2 Tf3 implies Tf1 Tf3). However, to argue that is a partial order, we need to show the antisymmetry, i.e. Tf Tg and Tg Tf implies that Tf and Tg are equivalent in some sense. In Appendices A and B, we show an important result: for an appropriate topology on trees, Tf Tg and Tg Tf implies that Tf and Tf are topologically equivalent.

Figure 2: Three illustrations of the partial order in Deﬁnition 4. In each case, in agreement with our intuitive notion of simplicity, the tree on the top ((a), (b), and (c)) is lower than the corresponding tree on the bottom((d), (e), and (f)) in the partial order, i.e. for each example Tp Tq.

The partial order in Deﬁnition 4 matches intuitive notions of the complexity of the tree for several reasons (see Figure 2). Firstly, Tf Tg implies (number of edges of Tf) (number of edges of Tg) (compare Figure 2(a) and (d), and see Lemma 6 in Appendix B). Secondly, if Tg is obtained from Tf by adding edges, then Tf Tg (compare Figure 2(b) and (e), and see Lemma 7 in Appendix B). Finally, the existence of a topology preserving embedding from {Tf} to {Tg} implies the relationship Tf Tg (compare Figure 2(c) and (f), and see Lemma 8 in Appendix B).

3 Tree Metrics

In this section, we introduce some natural metrics on cluster trees and study some of their properties that determine their suitability for statistical inference. We let p, q : X [0, ) be nonnegative functions and let Tp and Tq be the corresponding trees.

3.1 Metrics

We consider three metrics on cluster trees, the ﬁrst is the standard ℓ metric, while the second and third are metrics that appear in the work of Eldridge et al. [12].

ℓ metric: The simplest metric is d (Tp, Tq) = p q = supx X |p(x) q(x)|. We will show in what follows that, in the context of statistical inference, this metric has several advantages over other metrics.

Merge distortion metric: The merge distortion metric intuitively measures the discrepancy in the merge height functions of two trees in Deﬁnition 2. We consider the merge distortion metric [12, Deﬁnition 11] deﬁned by

d M(Tp, Tq) = sup x,y X |mp(x, y) mq(x, y)|.

The merge distortion metric we consider is a special case of the metric introduced by Eldridge et al. [12]3. The merge distortion metric was introduced by Eldridge et al. [12] to study the convergence of cluster tree estimators. They establish several interesting properties of the merge distortion metric: in particular, the metric is stable to perturbations in ℓ , and further, that convergence in the merge distortion metric strengthens previous notions of convergence of the cluster trees.

Modiﬁed merge distortion metric: We also consider the modiﬁed merge distortion metric given by

d MM(Tp, Tq) = sup x,y X |d Tp(x, y) d Tq(x, y)|,

where d Tp(x, y) = p(x) + p(y) 2mp(x, y), which corresponds to the (pseudo)-distance between x and y along the tree. The metric d MM is used in various proofs in the work of Eldridge et al. [12].

3They further allow ﬂexibility in taking a sup over a subset of X.

It is sensitive to both distortions of the merge heights in Deﬁnition 2, as well as of the underlying densities. Since the metric captures the distortion of distances between points along the tree, it is in some sense most closely aligned with the cluster tree. Finally, it is worth noting that unlike the interleaving distance and the functional distortion metric [2, 18], the three metrics we consider in this paper are quite simple to approximate to a high-precision.

3.2 Properties of the Metrics

The following Lemma gives some basic relationships between the three metrics d , d M and d MM. We deﬁne pinf = infx X p(x), and qinf analogously, and a = infx X {p(x) + q(x)} 2 min{pinf, qinf}. Note that when the Lebesgue measure µ(X) is inﬁnite, then pinf = qinf = a = 0.

Lemma 1. For any densities p and q, the following relationships hold: (i) When p and q are continuous, then d (Tp, Tq) = d M(Tp, Tq). (ii) d MM(Tp, Tq) 4d (Tp, Tq). (iii) d MM(Tp, Tq) d (Tp, Tq) a, where a is deﬁned as above. Additionally when µ(X) = , then d MM(Tp, Tq) d (Tp, Tq).

The proof is in Appendix F. From Lemma 1, we can see that under a mild assumption (continuity of the densities), d and d M are equivalent. We note again that the work of Eldridge et al. [12] actually deﬁnes a family of merge distortion metrics, while we restrict our attention to a canonical one. We can also see from Lemma 1 that while the modiﬁed merge metric is not equivalent to d , it is usually multiplicatively sandwiched by d .

Our next line of investigation is aimed at assessing the suitability of the three metrics for the task of statistical inference. Given the strong equivalence of d and d M we focus our attention on d and d MM. Based on prior work (see [7, 8]), the large sample behavior of d is well understood. In particular, d (Tbph, Tp0) converges to the supremum of an appropriate Gaussian process, on the basis of which we can construct conﬁdence intervals for the d metric.

The situation for the metric d MM is substantially more subtle. One of our eventual goals is to use the non-parametric bootstrap to construct valid estimates of the conﬁdence set. In general, a way to assess the amenability of a functional to the bootstrap is via Hadamard differentiability [24]. Roughly speaking, Hadamard-differentiability is a type of statistical stability, that ensures that the functional under consideration is stable to perturbations in the input distribution. In Appendix C, we formally deﬁne Hadamard differentiability and prove that d MM is not point-wise Hadamard differentiable. This does not completely rule out the possibility of ﬁnding a way to construct conﬁdence sets based on d MM, but doing so would be difﬁcult and so far we know of no way to do it.

In summary, based on computational considerations we eliminate the interleaving distance and the functional distortion metric [2, 18], we eliminate the d MM metric based on its unsuitability for statistical inference and focus the rest of our paper on the d (or equivalently d M) metric which is both computationally tractable and has well understood statistical behavior.

4 Conﬁdence Sets

In this section, we consider the construction of valid conﬁdence intervals centered around the kernel density estimator, deﬁned in Equation (1). We ﬁrst observe that a ﬁxed bandwidth for the KDE gives a dimension-free rate of convergence for estimating a cluster tree. For estimating a density in high dimensions, the KDE has a poor rate of convergence, due to a decreasing bandwidth for simultaneously optimizing the bias and the variance of the KDE.

When estimating a cluster tree, the bias of the KDE does not affect its cluster tree. Intuitively, the cluster tree is a shape characteristic of a function, which is not affected by the bias. Deﬁning the biased density, ph(x) = E[bph(x)], two cluster trees from ph and the true density p0 are equivalent with respect to the topology in Appendix A, if h is small enough and p0 is regular enough:

Lemma 2. Suppose that the true unknown density p0, has no non-degenerate critical points 4, then there exists a constant h0 > 0 such that for all 0 < h h0, the two cluster trees, Tp0 and Tph have the same topology in Appendix A.

4The Hessian of p0 at every critical point is non-degenerate. Such functions are known as Morse functions.

From Lemma 2, proved in Appendix G, a ﬁxed bandwidth for the KDE can be applied to give a dimension-free rate of convergence for estimating the cluster tree. Instead of decreasing bandwidth h and inferring the cluster tree of the true density Tp0 at rate OP (n 2/(4+d)), Lemma 2 implies that we can ﬁx h > 0 and infer the cluster tree of the biased density Tph at rate OP (n 1/2) independently of the dimension. Hence a ﬁxed bandwidth crucially enhances the convergence rate of the proposed methods in high-dimensional settings.

4.1 A data-driven conﬁdence set

We recall that we base our inference on the d metric, and we recall the deﬁnition of a valid conﬁdence set (see Deﬁnition 3). As a conceptual ﬁrst step, suppose that for a speciﬁed value α we could compute the 1 α quantile of the distribution of d (Tbph, Tph), and denote this value tα. Then a valid conﬁdence set for the unknown Tph is Cα = {T : d (T, Tbph) tα}. To estimate tα, we use the bootstrap. Speciﬁcally, we generate B bootstrap samples, { e X1 1, , e X1 n}, . . . , { e XB 1 , , e XB n }, by sampling with replacement from the original sample. On each bootstrap sample, we compute the KDE, and the associated cluster tree. We denote the cluster trees { e T 1 ph, . . . , e T B ph}. Finally, we estimate tα by

btα = b F 1(1 α), where b F(s) = 1

i=1 I(d ( e T i ph, Tbph) < s).

Then the data-driven conﬁdence set is b Cα = {T : d (T, b Th) btα}. Using techniques from [8, 7], the following can be shown (proof omitted):

Theorem 3. Under mild regularity conditions on the kernel5, we have that the constructed conﬁdence set is asymptotically valid and satisﬁes,

P Th b Cα = 1 α + O log7 n

Hence our data-driven conﬁdence set is consistent at dimension independent rate. When h is a ﬁxed small constant, Lemma 2 implies that Tp0 and Tph have the same topology, and Theorem 3 guarantees that the non-parametric bootstrap is consistent at a dimension independent O(((log n)7/n)1/6) rate. For reasons explained in [8], this rate is believed to be optimal.

4.2 Probing the Conﬁdence Set

The conﬁdence set b Cα is an inﬁnite set with a complex structure. Inﬁnitesimal perturbations of the density estimate are in our conﬁdence set and so this set contains very complex trees. One way to understand the structure of the conﬁdence set is to focus attention on simple trees in the conﬁdence set. Intuitively, these trees only contain topological features (splits and branches) that are sufﬁciently strongly supported by the data.

We propose two pruning schemes to ﬁnd trees, that are simpler than the empirical tree Tbph that are in the conﬁdence set. Pruning the empirical tree aids visualization as well as de-noises the empirical tree by eliminating some features that arise solely due to the stochastic variability of the ﬁnite-sample. The algorithms are (see Figure 3): 1. Pruning only leaves: Remove all leaves of length less than 2btα (Figure 3(b)). 2. Pruning leaves and internal branches: In this case, we ﬁrst prune the leaves as above. This yields a new tree. Now we again prune (using cumulative length) any leaf of length less than 2btα. We continue iteratively until all remaining leaves are of cumulative length larger than 2btα (Figure 3(c)).

In Appendix D.2 we formally deﬁne the pruning operation and show the following. The remaining tree e T after either of the above pruning operations satisﬁes: (i) e T Tbph, (ii) there exists a function f whose tree is e T, and (iii) e T b Cα (see Lemma 10 in Appendix D.2). In other words, we identiﬁed a valid tree with a statistical guarantee that is simpler than the original estimate Tbph. Intuitively, some of the statistically insigniﬁcant features have been removed from Tbph. We should point out, however,

5See Appendix D.1 for details.

(a) The empirical tree. (b) Pruning only leaves.

(c) Pruning leaves and branches.

Figure 3: Illustrations of our two pruning strategies. (a) shows the empirical tree. In (b), leaves that are insigniﬁcant are pruned, while in (c), insigniﬁcant internal branches are further pruned top-down.

Ring data, alpha = 0.05

0.0 0.2 0.4 0.6 0.8 1.0

0 0.208 0.272 0.529

Mickey mouse data, alpha = 0.05

0.0 0.2 0.4 0.6 0.8 1.0

0 0.255 0.291

Yingyang data, alpha = 0.05

0.0 0.2 0.4 0.6 0.8 1.0

0 0.035 0.044 0.052 0.07

Figure 4: Simulation examples. (a) and (d) are the ring data; (b) and (e) are the mickey mouse data; (c) and (f) are the yingyang data. The solid lines are the pruned trees; the dashed lines are leaves (and edges) removed by the pruning procedure. A bar of length 2btα is at the top right corner. The pruned trees recover the actual structure of connected components.

that there may exist other trees that are simpler than Tbph that are in b Cα. Ideally, we would like to have an algorithm that identiﬁes all trees in the conﬁdence set that are minimal with respect to the partial order in Deﬁnition 4. This is an open question that we will address in future work.

5 Experiments

In this section, we demonstrate the techniques we have developed for inference on synthetic data, as well as on a real dataset.

5.1 Simulated data

We consider three simulations: the ring data (Figure 4(a) and (d)), the Mickey Mouse data (Figure 4(b) and (e)), and the yingyang data (Figure 4(c) and (f)). The smoothing bandwidth is chosen by the Silverman reference rule [20] and we pick the signiﬁcance level α = 0.05.

0.0 0.2 0.4 0.6 0.8 1.0

0e+00 2e 10 4e 10 6e 10 8e 10

(a) The positive treatment data.

0.0 0.2 0.4 0.6 0.8 1.0

0e+00 1e 10 2e 10 3e 10 4e 10

(b) The control data.

Figure 5: The Gv HD data. The solid brown lines are the remaining branches after pruning; the blue dashed lines are the pruned leaves (or edges). A bar of length 2btα is at the top right corner.

Example 1: The ring data. (Figure 4(a) and (d)) The ring data consists of two structures: an outer ring and a center node. The outer circle consists of 1000 points and the central node contains 200 points. To construct the tree, we used h = 0.202.

Example 2: The Mickey Mouse data. (Figure 4(b) and (e)) The Mickey Mouse data has three components: the top left and right uniform circle (400 points each) and the center circle (1200 points). In this case, we select h = 0.200.

Example 3: The yingyang data. (Figure 4(c) and (f)) This data has 5 connected components: outer ring (2000 points), the two moon-shape regions (400 points each), and the two nodes (200 points each). We choose h = 0.385.

Figure 4 shows those data ((a), (b), and (c)) along with the pruned density trees (solid parts in (d), (e), and (f)). Before pruning the tree (both solid and dashed parts), there are more leaves than the actual number of connected components. But after pruning (only the solid parts), every leaf corresponds to an actual connected component. This demonstrates the power of a good pruning procedure.

5.2 Gv HD dataset

Now we apply our method to the Gv HD (Graft-versus-Host Disease) dataset [3]. Gv HD is a complication that may occur when transplanting bone marrow or stem cells from one subject to another [3]. We obtained the Gv HD dataset from R package mclust . There are two subsamples: the control sample and the positive (treatment) sample. The control sample consists of 9083 observations and the positive sample contains 6809 observations on 4 biomarker measurements (d = 4). By the normal reference rule [20], we pick h = 39.1 for the positive sample and h = 42.2 for the control sample. We set the signiﬁcance level α = 0.05.

Figure 5 shows the density trees in both samples. The solid brown parts are the remaining components of density trees after pruning and the dashed blue parts are the branches removed by pruning. As can be seen, the pruned density tree of the positive sample (Figure 5(a)) is quite different from the pruned tree of the control sample (Figure 5(b)). The density function of the positive sample has fewer bumps (2 signiﬁcant leaves) than the control sample (3 signiﬁcant leaves). By comparing the pruned trees, we can see how the two distributions differ from each other.

6 Discussion

There are several open questions that we will address in future work. First, it would be useful to have an algorithm that can ﬁnd all trees in the conﬁdence set that are minimal with respect to the partial order . These are the simplest trees consistent with the data. Second, we would like to ﬁnd a way to derive valid conﬁdence sets using the metric d MM which we view as an appealing metric for tree inference. Finally, we have used the Silverman reference rule [20] for choosing the bandwidth but we would like to ﬁnd a bandwidth selection method that is more targeted to tree inference.

[1] S. Balakrishnan, S. Narayanan, A. Rinaldo, A. Singh, and L. Wasserman. Cluster trees on manifolds. In Advances in Neural Information Processing Systems, 2012.

[2] U. Bauer, E. Munch, and Y. Wang. Strong equivalence of the interleaving and functional distortion metrics for reeb graphs. In 31st International Symposium on Computational Geometry (So CG 2015), volume 34, pages 461 475. Schloss Dagstuhl Leibniz-Zentrum fuer Informatik, 2015.

[3] R. R. Brinkman, M. Gasparetto, S.-J. J. Lee, A. J. Ribickas, J. Perkins, W. Janssen, R. Smiley, and C. Smith. High-content ﬂow cytometry and temporal data analysis for deﬁning a cellular signature of graft-versus-host disease. Biology of Blood and Marrow Transplantation, 13(6):691 700, 2007.

[4] K. Chaudhuri and S. Dasgupta. Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems, pages 343 351, 2010.

[5] K. Chaudhuri, S. Dasgupta, S. Kpotufe, and U. von Luxburg. Consistent procedures for cluster tree estimation and pruning. IEEE Transactions on Information Theory, 2014.

[6] F. Chazal, B. T. Fasy, F. Lecci, B. Michel, A. Rinaldo, and L. Wasserman. Robust topological inference: Distance to a measure and kernel distance. ar Xiv preprint ar Xiv:1412.7197, 2014.

[7] Y.-C. Chen, C. R. Genovese, and L. Wasserman. Density level sets: Asymptotics, inference, and visualization. ar Xiv:1504.05438, 2015.

[8] V. Chernozhukov, D. Chetverikov, and K. Kato. Central limit theorems and bootstrap in high dimensions. Annals of Probability, 2016.

[9] D. Donoho. One-sided inference about functionals of a density. The Annals of Statistics, 16(4):1390 1420, 1988.

[10] B. Efron, E. Halloran, and S. Holmes. Bootstrap conﬁdence levels for phylogenetic trees. Proceedings of the National Academy of Sciences, 93(23), 1996.

[11] U. Einmahl and D. M. Mason. Uniform in bandwidth consistency of kernel-type function estimators. The Annals of Statistics, 33(3):1380 1403, 2005.

[12] J. Eldridge, M. Belkin, and Y. Wang. Beyond hartigan consistency: Merge distortion metric for hierarchical clustering. In Proceedings of The 28th Conference on Learning Theory, pages 588 606, 2015.

[13] J. Felsenstein. Conﬁdence limits on phylogenies, a justiﬁcation. Evolution, 39, 1985.

[14] C. R. Genovese, M. Perone-Paciﬁco, I. Verdinelli, and L. Wasserman. Nonparametric ridge estimation. The Annals of Statistics, 42(4):1511 1545, 2014.

[15] J. A. Hartigan. Consistency of single linkage for high-density clusters. Journal of the American Statistical Association, 1981.

[16] J. Klemelä. Smoothing of multivariate data: density estimation and visualization, volume 737. John Wiley & Sons, 2009.

[17] S. Kpotufe and U. V. Luxburg. Pruning nearest neighbor cluster trees. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 225 232, 2011.

[18] D. Morozov, K. Beketayev, and G. Weber. Interleaving distance between merge trees. Discrete and Computational Geometry, 49:22 45, 2013.

[19] D. W. Scott. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons, 2015.

[20] B. W. Silverman. Density estimation for statistics and data analysis, volume 26. CRC press, 1986.

[21] W. Stuetzle and R. Nugent. A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics, 19(2), 2010.

[22] L. Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006.

[23] L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer Science & Business Media, 2010. ISBN 1441923225, 9781441923226.

[24] J. Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Science & Business Media, 2013.