# adaptive_clustering_using_kernel_density_estimators__1a368062.pdf

Journal of Machine Learning Research 24 (2023) 1-56 Submitted 12/21; Revised 8/23; Published 8/23

Adaptive Clustering Using Kernel Density Estimators

Ingo Steinwart ingo.steinwart@mathematik.uni-stuttgart.de University of Stuttgart Department of Mathematics D-70569 Stuttgart, Germany

Bharath K. Sriperumbudur bks18@psu.edu Pennsylvania State University Department of Statistics University Park, PA 16802, USA

Philipp Thomann philipp.thomann@d-one.ai D ONE Solutions AG Sihlfeldstrasse 58 8003 Z urich, Switzerland

Editor: Sivan Sabato

We derive and analyze a generic, recursive algorithm for estimating all splits in a ﬁnite cluster tree as well as the corresponding clusters. We further investigate statistical properties of this generic clustering algorithm when it receives level set estimates from a kernel density estimator. In particular, we derive ﬁnite sample guarantees, consistency, rates of convergence, and an adaptive data-driven strategy for choosing the kernel bandwidth. For these results we do not need continuity assumptions on the density such as H older continuity, but only require intuitive geometric assumptions of non-parametric nature. In addition, we compare our results to other guarantees found in the literature and also present some experiments comparing our algorithm to k-means and hierarchical clustering.

Keywords: cluster analysis, kernel density estimation, consistency, rates, adaptivity

1. Introduction

A widely acknowledged problem in cluster analysis is the deﬁnition of a learning goal that describes a conceptually and mathematically convincing deﬁnition of clusters. One such deﬁnition, which goes back to Hartigan (1975) and is known as density-based clustering, assumes i.i.d. data D = (x1, . . . , xn) generated by some unknown distribution P on X Rd. Given some level ρ 0, the clusters of P are then deﬁned to be the connected components of the level set {h ρ}:= {x X : h(x) ρ}, where h is the density of P with respect to the Lebesgue measure. This single level approach has been studied, for example by Hartigan (1975); Cuevas and Fraiman (1997); Rigollet (2007); Maier et al. (2009); Rinaldo and Wasserman (2010). However, one of the conceptual drawbacks of the single level approach is that diﬀerent values of ρ may lead to diﬀerent (numbers of) clusters, see Figure 1, and in addition, there is no general rule for choosing ρ. To address this conceptual shortcoming, one often considers the so-called cluster tree approach instead, which tries to consider all

c 2023 Ingo Steinwart, Bharath K. Sriperumbudur, and Philipp Thomann.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1532.html.

Steinwart, Sriperumbudur, and Thomann

Figure 1: The single level approach applied to a probability density h on [0, 1]. Left. The level set {h ρ1} displayed by the orange horizontal line consists of one connected component, which results in one cluster for ρ1. For illustrational purposes, the part of the graph of h that belongs to this level set is also colored in orange. Right. For a slightly larger ρ2, the level set {h ρ2} consists of two connected components, which are indicated by the green and purple horizontal lines. As a result, we obtain two clusters for this level ρ2.

levels and the corresponding connected components simultaneously, see the left picture in Figure 2 for an illustration.

There exists a variety of articles investigating properties of the cluster tree approach, see e.g. (Hartigan, 1975; Stuetzle, 2003; Chaudhuri and Dasgupta, 2010; Stuetzle and Nugent, 2010; Kpotufe and von Luxburg, 2011; Chaudhuri et al., 2014; Wang et al., 2019) for details. For example, Chaudhuri and Dasgupta (2010) show, under some assumptions on h, that a modiﬁed single linkage algorithm recovers this tree in the sense of Hartigan (1981), and Kpotufe and von Luxburg (2011); Chaudhuri et al. (2014) obtain similar results for an underlying k-NN density estimator. In addition, Kpotufe and von Luxburg (2011); Chaudhuri et al. (2014) propose a simple pruning strategy, that removes connected components that artiﬁcially occur because of ﬁnite sample variability. However, the notion of recovery taken from Hartigan (1981) only focuses on the correct estimation of the cluster tree structure and not on the estimation of the clusters itself, cf. the discussion by Steinwart (2011). Finally, the most recent paper (Wang et al., 2019) establishes guarantees including rates of convergence for each ﬁxed level set, provided that a kernel-density estimator is used to produce level set estimates and the density has a certain smoothness such as α-H older continuity. A third approach taken by Steinwart (2011); Sriperumbudur and Steinwart (2012); Steinwart (2015a) tries to estimate both the ﬁrst split ρ in the cluster tree, and the corresponding clusters, see the right picture in Figure 2 for an illustration. As in the previously discussed papers, ﬁnite sample bounds are derived, which in (Steinwart, 2015a) are extended to learning rates. For example, these learning rates for estimating ρ are of the probabilistic form

P n D Xn : 0 ρ D ρ Kan

where ρ D is the sample based estimate constructed by the considered algorithm and K is a constant depending on some assumptions on P. In other words, with high probability, the algorithm estimates ρ up to precision Kan. The learning rates for estimating the resulting

Adaptive Clustering Using Kernel Density Estimators

Figure 2: A density h on [0, 1] and diﬀerent density-based clustering approaches. Left. The cluster tree approach considers various, or ideally even all, levels (in grey) simultaneously, making it superﬂuous to choose one level in advance. However, for a given algorithm the diﬃculty of detecting connected components may change with the considered level, since, for example, the distances of related connected components, i.e. clusters having the same color, change with the level as it can be seen for the second and third level from below. Right. The split tree approach tries to estimate both the levels at which a split in the inﬁnite, ideal cluster tree of h occurs and the resulting clusters at these levels. For the depicted h this leads to the problem of estimating the split levels ρ 1, ρ 2 (drawn in yellow) as well as the resulting clusters for ρ 1 (drawn in green and purple) and for ρ 2 (drawn in blue and dark pink). Our previous work, see (Steinwart, 2011; Sriperumbudur and Steinwart, 2012; Steinwart, 2015a) only considered the ﬁrst split ρ 1 and its clusters.

clusters are of similar probabilistic nature. Moreover, Steinwart (2015a) shows that these learning rates can also be obtained by an adaptive, fully data-driven hyper-parameter selection strategy. Unfortunately, however, Steinwart (2011, 2015a) only considers the simplest possible density estimator, namely a histogram approach, and Sriperumbudur and Steinwart (2012) restrict their considerations to compactly supported moving window density estimates for α-H older-continuous densities. In addition, the method by Sriperumbudur and Steinwart (2012) requires the user to know α, and hence it is not data-driven. Finally, all three papers completely ignore the behavior of the considered algorithm for single cluster distributions, i.e. for distributions that do not have a split in the cluster tree, and for multi cluster distributions, i.e. for distributions that have more than one split in the cluster tree. As a consequence, it remained unclear whether and how a suitably modiﬁed version of this algorithm can be used to estimate the split-tree, i.e. the combination of all levels at which a split in the cluster tree occurs together with the resulting clusters at these splits. We refer to Figure 2 for an illustration of such a split tree.

The goal of this paper is to address the discussed issues of Steinwart (2011); Sriperumbudur and Steinwart (2012); Steinwart (2015a). To be more precise, compared to these articles, we establish the following new results:

i) For single cluster distributions, we propose a new set of regularity assumptions for levels ρ at which the level set {h ρ} is small. For example, for bounded densities, these assumptions roughly speaking guarantee, that the level sets do not frazzle for

Steinwart, Sriperumbudur, and Thomann

Figure 3: Guarantees provided in previous papers such as Steinwart (2015a) and in this paper. Left. Steinwart (2015a) etc. only provide guarantees for estimating the ﬁrst split level ρ 1 and the corresponding clusters. In other words only the light blue area below the algorithm s split level estimate ρ1,out is covered. Right. In this paper we provide guarantees for estimating the entire split tree. To be more speciﬁc, Part ii) of Theorem 9 ensures that the generic cluster algorithm works correctly in the green area between ρ1,out and a distribution dependent value ρ 1 . On the left hand in the yellow area, Theorem 7 then guarantees, that the cluster algorithm correctly recognizes that there is no further split in the splitting tree. In contrast, on the right hand side, Part iii) of Theorem 9 makes it possible to reuse the guarantees of Steinwart (2015a) after some minor technical modiﬁcations. As a consequence, it can be ensured in the gray-blue area on the right hand side that the algorithm outputs an estimate ρ2,out of the second split level ρ 2 together with estimates for the corresponding clusters. Then in the green area on top of this, Part ii) of Theorem 9 again guarantees that the cluster algorithm works correctly up to some distribution dependent ρ 2 . Finally, Theorem 7 ensures in the two yellow areas at the very top that the cluster algorithm correctly recognizes that there is no further split in the cluster tree.

levels ρ close to the maximum h of the density h. Such assumptions were missing in (Steinwart, 2011; Sriperumbudur and Steinwart, 2012; Steinwart, 2015a).

ii) We present a simple modiﬁcation of the output behavior of the generic cluster algorithm of Steinwart (2015a) to deal with distributions that do not have a split in the cluster tree. Based on our new regularity assumptions in i) and the ones from Steinwart (2015a), we then show that this new cluster algorithm is able to:

a) provide an estimate ρout of the ﬁrst split level ρ if there is one;

b) correctly detect distributions for which there is no such split level;

c) construct estimates Bi of the clusters A i occurring at the ﬁrst split level.

Note that from a technical side both a) with ﬁnite sample bounds on |ρout ρ | and c) with ﬁnite sample bounds on λd(Bi A i ) directly follow from Steinwart (2015a), since our modiﬁcation of the generic cluster algorithm scans through candidate levels ρ in exactly the same way as the original algorithm of Steinwart (2015a) does. Therefore, the surprising, and compared to (Steinwart, 2015a) new part of our ﬁnite sample

Adaptive Clustering Using Kernel Density Estimators

guarantees is the fact that this scanning procedure does not need to be changed for correctly detecting single cluster distributions in b). Note that a highly beneﬁcial side-eﬀect of this fact is that our analysis in b) as well as in iii) and iv) below, can rely on the extensive set of tools developed by Steinwart (2015a).

iii) We then show how the results of ii) can be used to estimate the entire split-tree by recursively applying the new generic cluster algorithm. While from a higher perspective this result does not seem to be too surprising, it turns out that there are still a couple of serious technical diﬃculties involved. In a nutshell, these diﬃculties relate to the fact, that the generic algorithm may return an estimate ρout for ρ for which the connected components of {h ρout} are not yet suﬃciently apart from each other. While such an estimate ρout for ρ is desirable, it also prohibits a direct recursive application of the results of ii). To address this issue, we analyze the behavior of the generic cluster algorithm above the returned level ρout. In this analysis, which also goes beyond (Steinwart, 2011; Sriperumbudur and Steinwart, 2012; Steinwart, 2015a), it turns out, that the algorithm behaves correctly until it reaches a level ρ for which the connected components of {h ρ} are suﬃciently apart from each other. Above this level ρ , we further show that the results of ii) can then be recursively applied, leading to guarantees for the entire split-tree. We refer to Figure 3 for a detailed description on how the diﬀerent guarantees can be combined and how these guarantees diﬀer from our previous work (Steinwart, 2011; Sriperumbudur and Steinwart, 2012; Steinwart, 2015a).

Besides these improvements for the generic cluster algorithm we additionally present the following results:

iv) We show that the new generic cluster algorithm does not only work with an underlying histogram density estimator (HDEs) as in (Steinwart, 2011, 2015a), but also for a variety of kernel density estimators (KDEs). Here it turns out that the results of Steinwart (2015a), including those for the adaptive, fully data-driven hyper-parameter selection strategy, remain valid for the resulting new clustering algorithm, provided that the kernel has a bounded support. Moreover, if the kernel has an exponential tail behavior, then the results remain true modulo an extra logarithmic term, while in the case of even heavier tails, we show that the rates become worse by a polynomial factor. Note that compared to Steinwart (2015a), all the results for KDEs are new. Moreover, the results for KDEs substantially extend the results of Sriperumbudur and Steinwart (2012), since there a) only moving window kernels were treated, and b) only α-H older continuous densities with known α were considered. In contrast, our new results do not even require continuous densities, and for this reason, we also obtain signiﬁcantly more general results than the currently best results for KDE-based clustering achieved by Wang et al. (2019). The latter improvement is partially made possible, because we can rely on the tools of Steinwart (2015a). However, compared to the HDEs in (Steinwart, 2015a) considering KDEs still requires signiﬁcant technical eﬀorts such as ﬁnite sample bound for the -distance between a KDE and its population version.

v) We discuss in some detail the diﬀerences of our results and those of Chaudhuri et al. (2014) and Wang et al. (2019), which in some sense are the articles closest to ours.

Steinwart, Sriperumbudur, and Thomann

Here it turns out that, depending on the set of assumptions, sometimes the learning rates for estimating the split levels obtained by Wang et al. (2019) are better and sometimes the ones obtained by us are better. However, adaptivity, for example, is not achieved by Wang et al. (2019). The latter is also true for the paper by Chaudhuri et al. (2014), but in addition their rates are worse than ours, and in addition, their set of assumptions is more restrictive.

vi) We present some experiments comparing our algorithm run with either the moving window kernel or the Epanechnikov kernel to both k-means and hierarchical clustering. Here, it turns out that our algorithm outperforms the latter two as soon as the sample size is suﬃciently large. In addition, our algorithm is less sensitive to clusters having diﬀerent spatial scales.

This paper is organized as follows: In Section 2 we recall the key concepts of Steinwart (2015a). In Section 3 we ﬁrst introduce the new regularity assumptions mentioned in i). We then introduce and analyze the new generic cluster algorithm as described in ii). Moreover, the recursive approach described in iii) is analyzed in detail. Section 4 then presents key uncertain guarantees for level sets generated by KDEs, and Section 5 contains the material mentioned in iv), namely ﬁnite sample bounds as well as consistency results, rates of convergence, and an adaptive data-driven strategy for choosing the kernel bandwidth. In Section 6 we present the comparison to the articles by Chaudhuri et al. (2014) and Wang et al. (2019), and Section 7 contains the experiments. All proofs can be found in Section 8.

2. Preliminaries

In this section we recall the setup for deﬁning density-based clusters in a general context from Steinwart (2015a). To this end, let be a norm on Rd. Then we denote the closed unit ball of this norm by B and write B (x, δ) := x + δB . If the norm is known from the context, we usually write B(x, δ) instead. Moreover, the Euclidean norm on Rd is denoted by 2 and for the Lebesgue volume of its unit ball we write vold. Finally, denotes the supremum norm for functions. Let us now assume that we have some A X Rd as well as some norm on Rd. Then, for δ 0 we deﬁne the δ-tube and δ-trim of A in X by

A+δ := A+δ X := {x X : d(x, A) δ} , and A δ := A δ X := X \ (X \ A)+δ ,

where d(x, A) := infx A x x . We refer to Figure 4 for an illustration of these concepts as well as to some possible topological changes when going from A to either A+δ or A δ. We further write A for the interior of A and A for the closure of A. Moreover, A := A \ A denotes the boundary of A. Obviously, we have A+0 = A, and hence also A 0 = A. Furthermore, since x 7 d(x, A) is continuous, A+δ is always closed in X and A δ is always open in X. Note that if A is bounded, we always ﬁnd compact and convex X Rd with A+δ X = A+δ Rd and A δ X = A δ Rd. Based on this observation and the fact that we usually consider δ (0, 1] in combination with some suitably chosen X, see Assumption P below for details, we often ignore the surrounding set X. In addition to this notations, we denote the inradius

Adaptive Clustering Using Kernel Density Estimators

Figure 4: Left. A set A in blue together with a δ-tube A+δ of it in green and a δ-trim A δ of it in orange, where in both cases we considered the supremum norm in R2. All three sets are connected. Middle. A set A in blue that consists of two connected components. This time, however, its δ-tube A+δ is connected. Right. A connected set A in blue, for which its δ-trim A δ has two connected components. In summary, we see that the connected component structure of A does not necessarily relate to those of A+δ and A δ.

and diameter of a bounded A Rd by inrad A and diam A, respectively, that is

inrad A := sup r > 0 : x A with B(x, r) A

diam A := sup x x : x, x A .

Some interesting properties of these quantities can be found in (Steinwart et al., 2021, Lemma 8.1 and Lemma 8.2). Throughout this work, 1A denotes the indicator function of a set A, and A B the symmetric diﬀerence of two sets A and B. Let us now assume that P is a probability measure on a closed X Rd that is absolutely continuous with respect to the Lebesgue measure λd. Then P has a λd-density h and one could deﬁne the clusters of P to be the connected components of the level set {h ρ}, where ρ 0 is some user-deﬁned threshold. Unfortunately, however, this notion leads to serious issues if there is no canonical choice of h such as a continuous version, see the illustrations in (Steinwart, 2015a, Section 2.1). To address this issue, Steinwart (2015a) considered, for ρ 0, the measures

µρ(A) := λd(A {h ρ}) , A B(Rd) .

Since µρ is independent of the choice of h := d P/ dλd, the set

Mρ := supp µρ ,

where supp µρ denotes the support of the measure µρ, is independent of this choice, too. For any λd-density h of P, the deﬁnition immediately gives

λd {h ρ} \ Mρ = λd {h ρ} (Rd \ Mρ) = µρ(Rd \ Mρ) = 0 , (1)

i.e. modulo λd-zero sets, the level sets {h ρ} are not larger than Mρ. In fact, Mρ turns out to be the smallest closed set satisfying (1) and it is shown in (Steinwart, 2015b, Lemma A.1.2), that

{h ρ} Mρ {h ρ} and Mρ {h ρ} {h ρ} .

Steinwart, Sriperumbudur, and Thomann

Figure 5: Examples of injective, surjective, and bijective of CRMs, which also illustrate partitions generated by connected components. Left. For A := A A , P(A) := {A , A }, and P(B) := {B}, the CRM ζ : P(A) P(B) is surjective but not injective. Middle. For B := B B and the obvious deﬁnitions of P(A) and P(B), the CRM ζ : P(A) P(B) is injective but not surjective. Right. Analogously, for A := A A and B := B B , the CRM ζ : P(A) P(B) is bijective and hence P(A) is persistent in P(B). In this case the structure of the cells in A and B have a clear one-to-one relation to each other. Note that in all cases we further have P(A) = C(A) and P(B) = C(B).

In order to ensure inclusions that are inverse to (1), Steinwart (2015a) said that P is normal at level ρ if there exist two λd-densities h1 and h2 of P such that

λd(Mρ \ {h1 ρ}) = λd({h2 > ρ} \ Mρ) = 0 .

In particular, if P is normal at level ρ, then Mρ and {h1 ρ} only diﬀer by a Lebesgue zero set. Moreover, it is shown in (Steinwart, 2015b, Lemma A.1.3)1 that P is normal at every level, if it has both an upper semi-continuous λd-density h1 and a lower semi-continuous λd-density h2. Furthermore, if P has a λd-density h such that λd( {h ρ}) = 0, then the same lemma shows that P is normal at level ρ. This implies that essentially all P one usually thinks of are normal. Finally, if the conditions of normality at level ρ are satisﬁed for some λd-densities h1 and h2 of P, then they are actually satisﬁed for all λd-densities h of P and, as mentioned above, we have λd(Mρ {h ρ}) = 0. The next assumption collects the concepts introduced so far.

Assumption P. We have a λd-absolutely continuous, probability measure P that is normal at every level. In addition, supp P is compact, and X Rd is a compact and connected set with (supp P)+2 Rd X.

Note that the assumption (supp P)+2 Rd X ensures both A+δ X = A+δ Rd and A δ X = A δ Rd for all δ (0, 1] and A supp P. In this case, we therefore usually ignore the surrounding X. Let us now recall the deﬁnition of clusters from Steinwart (2015a). We begin with the following deﬁnition.

Deﬁnition 1 Let P(A) and P(B) be partitions of A B with A = . Then P(A) is comparable to P(B), write P(A) P(B), if, for all A P(A), there is a B P(B) with A B .

Informally speaking, P(A) is comparable to P(B), if no cell A P(A) is broken into pieces in P(B). In particular, if P1 and P2 are two partitions of A, then P1 P2 if and

1. In this lemma, the term upper normal at level ρ means that λd(Mρ \ {h1 ρ}) = 0 for some density h1 := d P/ dλd while lower normal at level ρ means λd({h2 > ρ} \ Mρ) = 0 for some density h2 := d P/ dλd.

Adaptive Clustering Using Kernel Density Estimators

only if P1 is ﬁner than P2. Now assume that P(A) and P(B) are two partitions with P(A) P(B). Then (Steinwart, 2015b, Lemma A.2.1) shows that there exists a unique map ζ : P(A) P(B) with

A ζ(A ) , A P(A) .

Following Steinwart (2011, 2015a), we call ζ the cell relating map (CRM) between A and B. Moreover, if ζ is bijective, we say that P(A) is persistent in P(B) and write P(A) P(B). We refer to Figure 5 for illustrations of persistent and non-persistent partitions.

The ﬁrst example of comparable partitions come from connected components. To be more precise, let A Rd be a closed subset and C(A) be the collection of its connected components. By deﬁnition, C(A) forms a partition of A, and if B Rd is another closed subset with A B and |C(B)| < then we have C(A) C(B), see (Steinwart, 2015b, Lemma A.2.3) as well as Figure 5. Following Steinwart (2015a), another class of partitions arise from a discrete notion of path-connectivity. To recall the latter, we ﬁx a τ > 0, an A Rd, and a norm on Rd. Then x, x A are τ-connected in A, if there exist x1, . . . , xn A such that x1 = x, xn = x and xi xi+1 < τ for all i = 1, . . . , n 1. Clearly, being τ-connected gives an equivalence relation on A. We write Cτ(A) for the resulting partition and call its cells the τ-connected components of A. It has been shown in (Steinwart, 2015b, Lemma A.2.7), that Cτ(A) Cτ(B) for all A B and τ > 0. Moreover, if |C(A)| < then C(A) = Cτ(A) for all suﬃciently small τ > 0, see (Steinwart, 2015a, Section 2.2) for details. Following Steinwart (2015a), we can now describe probability measures that can be clustered.

Deﬁnition 2 Let Assumption P be satisﬁed. Then P can be clustered between ρ 0 and ρ > ρ , if for all ρ [0, ρ ], we have |C(Mρ)| {1, 2} and the following two conditions are met:

i) If we have |C(Mρ)| = 1, then ρ ρ .

ii) If we have |C(Mρ)| = 2, then ρ ρ and C(Mρ ) C(Mρ).

Using the CRMs ζρ : C(Mρ ) C(Mρ), we then deﬁne the clusters of P by

ρ (ρ ,ρ ] ζρ(Ai) , i {1, 2} ,

where A1 and A2 are the topologically connected components of Mρ . Finally, we deﬁne

3 d ζρ +ε(A1), ζρ +ε(A2) , ε (0, ρ ρ ]. (2)

Deﬁnition 2 ensures that the level sets below ρ are connected, while for a certain range above ρ the level sets have exactly two components, which, in addition, are assumed to be persistent, see the right picture in Figure 6 for an illustration. Thus, the topological structure of Mr between the split level ρ and an anchor level ρ for ρ equals that of Mρ . Consequently we can use the connected components of Mρ to number the connected

Steinwart, Sriperumbudur, and Thomann

Figure 6: Left. A density h on [0, 1] that can be clustered between ρ 1 and ρ 1 . Below the split level ρ 1, all level sets Mρ only have one connected component. Between the split level ρ 1 and an anchor level ρ 1 of it, that is for all ρ (ρ 1, ρ 1 ], all level sets Mρ have two connected components (in green and purple). Moreover, we have persistence C(Mρ 1 ) C(Mρ) as indicated by the same color of the related cells. Taking their unions results in the two clusters A 1 and A 2 at the split level ρ 1, which are again indicated in green and purple. Finally, note that τ (ε) equals the distance of the green and purple component at level ρ + ε modulo the factor 3. Middle. A connected set B with M δ ρ = A A . The resulting CRM ζ : C(A) C(B) is only surjective and since the bridge in the middle narrows faster than linearly, the thickness function has an exponent γ < 1. Right. Again, a connected set B with M δ ρ = A A , for which the resulting CRM ζ : C(A) C(B) is only surjective. This time, however, the thickness function has an exponent γ = 1.

components of Mρ for ρ (ρ , ρ ). This is done in the deﬁnition of the clusters A i as well as in the deﬁnition of the function τ , which essentially measures the distance between the two connected components at level ρ + ε. We again refer to Figure 6 for some visual impressions.

The major goal of Steinwart (2011, 2015a) was to design an algorithm that is able to asymptotically estimate both the correct value of ρ and the clusters A 1 and A 2. However, this algorithm required that the level sets do not have bridges or cusps that are too thin. To make this precise, let us recall that for a closed A Rd, Steinwart (2011, 2015a) considered the function ψ A : (0, ) [0, ] deﬁned by

ψ A(δ) := sup x A d(x, A δ) , δ > 0.

Roughly speaking, ψ A(δ) describes the smallest radius ε needed to recover A from A δ

in the sense of A (A δ)+ε, see (Steinwart, 2015b, Section A.5) for this and various other results on ψ A. In particular, we have ψ A(δ) δ for all δ > 0 and ψ A(δ) = if A δ = . Moreover, ψ A behaves linearly, if bridges and cusps of A are not too thin, and even thinner cusps and bridges can be included by considering sets with ψ A(δ) cδγ for some ﬁxed c 1, γ (0, 1] and all suﬃciently small δ > 0. However, the discussion in (Steinwart, 2015b, Section A.5) also showed that γ = 1 can be viewed as the most normal case. Finally, we refer to Figure 6 for some examples of linear and nonlinear behavior of ψ that is relevant for the following deﬁnition taken from Steinwart (2015a).

Adaptive Clustering Using Kernel Density Estimators

Deﬁnition 3 Let Assumption P be satisﬁed. Then we say that P has thick level sets of order γ (0, 1] up to the level ρ > 0, if there exist constants cthick 1 and δthick (0, 1] such that, for all δ (0, δthick] and ρ [0, ρ ], we have

ψ Mρ(δ) cthick δγ .

In this case, we call ψ(δ) := 3cthickδγ the thickness function of P.

Now, the following assumption describes the probability measures we wish to cluster.

Assumption M. The probability measure P can be clustered between ρ and ρ , see Deﬁnition 2. In addition, P has thick level sets of order γ (0, 1] up to the level ρ . We denote the corresponding thickness function by ψ and write τ for the function deﬁned in (2).

The theory developed by Steinwart (2011); Sriperumbudur and Steinwart (2012); Steinwart (2015a) focused on the question, whether it is possible to estimate ρ and the resulting clusters for distributions that can be clustered. To this end, it was assumed that we had an estimation algorithm that constructs, for a given data sets D, level set estimates LD,ρ for all ρ > 0. Moreover, it was assumed that these level set estimates satisfy the guarantees

M δ ρ+ε LD,ρ M+δ ρ ε (3)

for all ρ [0, ρ ] and some ε, δ > 0. Based on these assumptions, a generic cluster algorithm was developed, where generic refers to the fact, that the cluster algorithm does not not need to know speciﬁcs about the construction of LD,ρ. Instead, only the guarantees (3) with known values for ε and δ are needed. The key result (Steinwart, 2015a, Theorem 2.9) then speciﬁed in terms of ε and δ how well this algorithm estimates both ρ and the clusters A 1 and A 2. Here the main diﬃculty of establishing this result arose from the fact no pair of Mρ+ε, and Mρ ε, M δ ρ+ε, and M+δ ρ ε need to have the same topological structure in terms of their connected components. Indeed, parts of these incompatibilities can already be seen in Figures 1 and 4.

What is missing in this analysis, however, is an investigation of the behavior of the generic cluster algorithm in situations in which P cannot be clustered because all level sets are connected. Now observe that the reason for this gap was the notion of thickness: Indeed, if P is a single-cluster probability measure, i.e. |C(Mρ)| 1 for all ρ 0, and P has thick level sets of the order γ up to the level ρ := sup{ρ : ρ 0 and |C(Mρ)| = 1}, then the proof of (Steinwart, 2015a, Theorem 2.9) can be easily extended to show that at each visited level ρ the algorithm correctly detects exactly one connected component. Unfortunately, however, the assumption of having thick levels up to the height ρ of the peak of h is too unrealistic, as it requires M δ ρ = for all ρ [0, ρ ] and δ (0, δthick], that is, the peak needs to be a plateau that contains a ball of radius δthick, as the following lemma shows.

Lemma 4 Let A Rd be bounded and δ > 0. Then A δ Rd = if and only if δ < inrad A.

Steinwart, Sriperumbudur, and Thomann

3. A Generic Algorithm for Estimating the Split-Tree

In this section we present a generic algorithm for estimating the entire split-tree. To this end, we ﬁrst introduce a new set of assumptions for single-cluster distributions that rule out irregular behavior of the level sets in the vicinity of the peak of the density. Unlike the na ıve approach we have discussed at the end of Section 2, this new set of assumptions includes a variety of realistic behaviors. In the second step we then present a generic cluster algorithm, whose only diﬀerence to the one in (Steinwart, 2015a) is its output behavior in situations in which no split has been detected. We then show that this new cluster algorithm, like its predecessor in (Steinwart, 2015a), correctly identiﬁes a split in the cluster tree. Moreover, we demonstrate that, unlike the one in (Steinwart, 2015a), the new cluster algorithm also correctly identiﬁes single-cluster distributions. Finally, we combine these insights to develop a new generic algorithm for estimating the entire cluster tree. Let us begin by introducing a new assumption for dealing with single-cluster distributions.

Assumption S. Assumption P is satisﬁed and there are ρ 0, γ (0, 1], cthick 1 and δthick (0, 1] such that for all ρ ρ and δ (0, δthick], we have |C(Mρ)| 1 as well as:

i) If M δ ρ = then ψ Mρ(δ) cthickδγ.

ii) If M δ ρ = , then, for all = A M+δ ρ and τ > 2cthickδγ, we have |Cτ(A)| = 1.

Note that |C(Mρ)| 1 simply means that the level sets of P above ρ are either empty or connected. If they are empty, there is nothing more to assume and in the other case, we can either have M δ ρ = or M δ ρ = . If M δ ρ = , then condition i) ensures that the level set Mρ is still thick in the sense of Deﬁnition 3, while in the other case M δ ρ = , condition ii) guarantees that the larger sets M+δ ρ cannot have multiple τ-connected components as long as we choose τ in a way that is required in the case of multiple clusters, too. In this respect note that (Steinwart et al., 2021, Lemma 8.3) shows that for all δ (0, δthick], there exists a ρ ρ with M δ ρ = , and therefore dealing with the case ii) cannot be avoided. In the case γ = 1, Condition ii) can also be interpreted in terms on inradius and diameter as the following lemma shows:

Lemma 5 For compact M Rd and c 1 the following statements are equivalent:

i) For all δ > 0 with M δ = and all = A M+δ and τ > 2cδ, we have |Cτ(A)| = 1.

ii) We have 2(c 1) inrad M diam M.

Since by Lemma 4 there only exists some δ (0, δthick] with M δ ρ = if inrad Mρ δthick, we thus see that for γ = 1, Condition ii) is equivalent to

2(cthick 1) inrad Mρ diam Mρ (4)

for all ρ ρ with inrad Mρ δthick. Inequality (4) essentially states that the diameterinradius ratio must be bounded for increasing ρ. Consequently, (4) is satisﬁed for cthick := 2 if all Mρ above ρ are balls with respect to the considered norm since in this case we have

Adaptive Clustering Using Kernel Density Estimators

Figure 7: Contour lines of two continuous densities. The levels ρ are chosen such that we have inrad Mρ {0, 1, 2, 3, 4, 5} with respect to the . Left. Assumption S is satisﬁed for γ = 1, since, for increasing ρ, diam Mρ decays faster than inrad Mρ does, and hence (4) holds. Right. Assumption S is not satisﬁed for γ = 1, since the highest level set satisﬁes inrad Mρ = 0 and diam Mρ > 0, which violates (4).

Algorithm 1 Clustering with the help of a generic level set estimator Require: Some ε, τ > 0 and a start level ρ0 0. A decreasing family (Lρ)ρ 0 of subsets of X. Ensure: An estimate of ρ or ρ and the corresponding clusters.

1: ρ ρ0 2: repeat

3: Identify the τ-connected components B1, . . . , BM of Lρ satisfying Bi Lρ+2ε = .

5: until M = 1

6: ρ ρ + 2ε

7: Identify the τ-connected components B1, . . . , BM of Lρ satisfying Bi Lρ+2ε = .

8: if M > 1 then

9: return ρout = ρ and the sets Bi for i = 1, . . . , M.

11: return ρout = ρ0 and the set Lρ0.

2 inrad Mρ = diam Mρ. Moreover, by increasing cthick we see that (4) remains true if we distort these balls by bi-Lipschitz continuous maps with constants that can be bounded independently of ρ. Analogously, (4) is satisﬁed, if Mρ are balls with respect to a ﬁxed norm that is diﬀerent to the one used for the δ-tubes and δ-trims in condition ii). In addition, if we have a ﬂat plateau at the highest level ρmax := sup{ρ > 0 : Mρ = } < , that is inrad Mρmax > 0, then (4) is always satisﬁed for some cthick > 1 because of diam Mρ diam X < . In contrast, (4) is violated, if, for example, inrad Mρmax = 0 and diam Mρmax > 0. Finally, for d = 1, we always have 2 inrad Mρ = diam Mρ for all non-empty level sets Mρ, since for these |C(Mρ)| = 1 ensures that Mρ is an interval. Consequently, for d = 1 Assumption S is satisﬁed with γ = 1, cthick = 2, and δthick = 1 whenever Assumption P is satisﬁed. For two-dimensional examples we refer to Figure 7 for an illustration. The next task is to formulate a generic algorithm that is able to estimate ρ and the resulting clusters if P can be clustered in the sense of Assumption M and that is also able to detect distributions that only have one cluster in the sense of Assumption S. We will see in

Steinwart, Sriperumbudur, and Thomann

the following that Algorithm 1 is such an algorithm. Before we present the corresponding results we ﬁrst note that the only diﬀerence of Algorithm 1 to the algorithm considered by Steinwart (2015a) is the more ﬂexible start level ρ0, compared to ρ0 = 0 in (Steinwart, 2015a), and the modiﬁed output in Lines 8-12. Indeed, the algorithm in (Steinwart, 2015a) always produces the return values of Line 9. In contrast, Algorithm 1 distinguishes between the cases M > 1 and M = 0. While for M > 1 the output of both algorithms exactly coincide, the new Algorithm 1 now returns ρ0 and Lρ0 in the case of M = 0. We will see in Theorem 7 that the latter case typically occurs for P satisfying Assumption S. In this respect recall that Lρ0 can be viewed as an estimate of Mρ0 and therefore returning Lρ0 makes sense for such distributions. The next theorem adapts (Steinwart, 2015a, Theorem 2.9) to Algorithm 1. Since the proof of (Steinwart, 2015a, Theorem 2.9) can be easily adapted to arbitrary start levels ρ0 0 and this proof also shows that the case M 1 is not occurring under the assumptions of this theorem, we omit the proof of Theorem 6.

Theorem 6 Let Assumption M be satisﬁed. Furthermore, let ε (ρ ρ )/9 , δ (0, δthick], τ (ψ(δ), τ (ε )], and ε (0, ε ], and ρ0 ρ . In addition, let (Lρ)ρ 0 be a decreasing family satisfying (3) for all ρ ρ0. Then we have:

i) The level ρout returned by Algorithm 1 satisﬁes ρout [ρ + 2ε, ρ + ε + 5ε] and

τ ψ(δ) < 3τ ρout ρ + ε .

ii) Algorithm 1 returns two sets B1 and B2 and these can be ordered such that we have

i=1 λd Bi A i 2

i=1 λd A i \ (Ai ρout+ε) δ + λd M+δ ρout ε \ {h > ρ } .

Here, Ai ρout+ε C(Mρout+ε) are ordered in the sense of Ai ρout+ε A i .

Theorem 6 shows that Algorithm 1 is still able to estimate ρ and the corresponding clusters if the P can be clustered in the sense of Assumption M. The main motivation for this section was, however, to have an algorithm that also behaves correctly for P that only have one cluster in the sense of Assumption S. The next theorem ensures such a behavior.

Theorem 7 Let Assumption S be satisﬁed and (Lρ)ρ 0 be a decreasing family of sets Lρ X such that (3) holds for some ﬁxed ε, δ > 0 and all ρ ρ0. If ρ0 ρ , δ (0, δthick], and τ > 2cthickδγ, then Algorithm 1 returns ρ0 and L0.

Note that Theorem 6 requires τ > ψ(d) = 3cthickδγ, while Theorem 7 even holds under the milder assumption τ > 2cthickδγ. Consequently, if we choose a τ with τ > 3cthickδγ, then the corresponding assumptions of both theorems are satisﬁed. Moreover, the additional assumption τ < τ (ε ) in Theorem 6 is actually more an assumption on ε than on τ as we will see later. Now assume that the assumptions of Theorem 6 are satisﬁed and that Algorithm 1 returned ρout and the cluster estimates B1, B2. Our goal is to apply Algorithm 1 on the

Adaptive Clustering Using Kernel Density Estimators

Algorithm 2 Estimating the split-tree with the help of a generic level set estimator Require: Some τ > 0, ε > 0 and a start level ρ0 0. decreasing family (Lρ)ρ 0 of subsets of X. Ensure: Estimates of all splits of the cluster tree and the corresponding clusters.

1: Call Algorithm 1 with ρ0 and (Lρ)ρ 0 2: if ρout > ρ0 then

3: Store the return values of Algorithm 1 in the split-tree

4: Call Algorithm 2 with ρout + ε and (L1,ρ)ρ 0 5: Call Algorithm 2 with ρout + ε and (L2,ρ)ρ 0 6: else if ρout = 0 then

7: Store the return values of Algorithm 1 in the split-tree

9: return split-tree

two detected clusters B1 and B2 separately, see Algorithm 2. To this end, we deﬁne the new level set estimates

Li,ρ := Lρ Bi , i = 1, 2, ρ ρout,

and let the Algorithm 1 run on both families of level set estimates separately. Of course, we want to use our insights into Algorithm 1, and for this reason, we need to replace (3) by a suitable new horizontal and vertical control. To ﬁnd such a new control, let us assume that Assumption M is satisﬁed and that we have ﬁxed a ρ (ρ , ρ ]. Moreover, let A1,ρ , and A2,ρ be the two connected components of Mρ . We then deﬁne two new children distributions P1 and P2 by

Pi(B) := P(B Ai,ρ )

P(Ai,ρ ) , i = 1, 2, (5)

for all measurable B X. Moreover, for ρ 0 we denote the level sets of P1 and P2 by M1,ρ and M2,ρ, respectively. We can now introduce distributions having a ﬁnite split tree.

Deﬁnition 8 Let P be a distribution satisfying Assumption P and |C(Mρ)| < for all ρ 0. Moreover, assume that there is a ρmax > 0 such that Mρ = for all ρ ρmax. Then P has a ﬁnite split-tree with minimal step size ϵ > 0, if one of the following two conditions are satisﬁed:

i) P satisﬁes Assumption S.

ii) P satisﬁes Assumption M with ρ ρ ϵ, and for ρ := (ρ + ρ )/2 the two probability measures P1 and P2 deﬁned by (5) have a ﬁnite split-tree with minimal step size ϵ > 0.

For example, the distribution shown repeatedly in the introduction, see e.g. Figure 2 for the split tree of this distribution, can be clustered. Our next goal is to show that Algorithm 2 can be used to estimate the split-tree for distributions having a ﬁnite split-tree with some

Steinwart, Sriperumbudur, and Thomann

unknown minimal step size ϵ > 0. To this end, we need the rather technical Theorem 9 below, which in its formulation requires the sets of τ-connected components of Lρ that are identiﬁed in Lines 4 and 7 of Algorithm 1, that is, the sets

b Cτ(Lρ) := B Cτ(Lρ) : B Lρ+2ε = , ρ 0.

Theorem 9 Let Assumption M be satisﬁed. Furthermore, let ε (ρ ρ )/16, δ (0, δthick], τ (ψ(δ), τ (ε )], and ε (0, ε ], and ρ0 ρ . In addition, let (Lρ)ρ 0 be a decreasing family satisfying (3) for all ρ ρ0. Finally, let ρout be the estimate of ρ and B1, B2 be the cluster estimates returned by Algorithm 1. Then the following statements are true:

i) We have |Cτ(M δ ρ )| = 2 and the sets V1 := A δ 1,ρ and V2 := A δ 2,ρ are the two τ-connected components of M δ ρ .

ii) For all ρ [ρout, ρ 3ε] we have | b Cτ(Lρ)| = 2. Moreover, we can order the two elements Bρ 1 and Bρ 2 of b Cτ(Lρ) such that

Vi Bρ i Bi , i = 1, 2 . (6)

iii) If ρ [ρ + ε + 6ε, ρ 5ε], then for all ρ ρ + 4ε we have Li,ρ Bρ +2ε i and

M δ i,ρ+ε Li,ρ M+δ i,ρ ε. (7)

To illustrate Theorem 9, we now deﬁne ρ := (ρ +ρ )/2 and assume ε (ρ ρ )/16. For ε (0, ε ] we then ﬁnd ρ [ρ + ε + 6ε, ρ 5ε] and

ρ + 4ε ρ + ρ

Then we have, ρ 3ε > ρ 4ε ρ (ρ ρ )/4 = ρ , and part ii) of Theorem 9 thus shows (6) for all ρ [ρout, ρ ]. Consequently, Algorithm 1, when working with the level sets (Li,ρ)ρ [ρout,ρ ], does identify exactly one connected component in its Line 3. In other words, the loop between its Lines 2 and 5 is not left for such ρ. Moreover, for ρ ρ ρ + 4ε part iii) of Theorem 9 ensures (7). Consequently, Theorems 6 and 7 can be applied to Algorithm 1 when working with the level sets (Li,ρ)ρ ρ for the distribution Pi. We refer to Figure 8 for a detailed description of how these guarantees work together. In summary, these considerations show that Algorithm 2 can be recursively analyzed with the help of Theorems 6 and 7 to show that Algorithm 2 indeed estimates the split-tree for all distributions P having a ﬁnite split-tree with some unknown minimal step size ϵ > 0. In particular, for all quantitative guarantees it actually suﬃces to describe the behavior of Algorithm 1 for distributions satisfying Assumption S and Assumption M. This insight will be adopted later in the statistical analysis of Section 5.

Adaptive Clustering Using Kernel Density Estimators

Figure 8: A density of a P that has a ﬁnite split-tree with minimal step size ϵ > 0 and the resulting guarantees. Left. Situation at the lowest split level ρ 1. The distribution P can be clustered between ρ 1 and ρ 1 and the (unnormalized) densities of the children distributions P1,1 (green) and P1,2 (purple) above ρ 1 := (ρ 1 + ρ 1)/2 are shown. Assumption S is satisﬁed by P1,1, while P1,2 satisﬁes Assumption M. Algorithm 2 initialized with ρ0 := 0 calls Algorithm 1 in its Line 1, which in turn returns ρ1,out ρ 1 with ρ1,out > ρ 1 and two corresponding clusters denoted by B1,1 and B1,2 as guaranteed by Theorem 6. Algorithm 2 stores these in its Line 3 and continues by calling a second instance of Algorithm 2 with start level ρ1,out +ε and the family (L1,ρ)ρ 0 in Line 4. This instance of Algorithm 2 in turn calls Algorithm 1 with ρ1,out + ε and the family (L1,ρ)ρ 0. For ρ [ρ1,out, ρ 1 ] with ρ 1 := 0.75ρ 1 + 0.25ρ 1 , that is, for levels between the two cyan lines, Part ii) of Theorem 9 ensures that Algorithm 1 only detects one cluster Bρ 1 in its loop, see also the calculations following Theorem 9. Consequently, this loop is not left below ρ 1 . Moreover, for ρ ρ 1 , Part iii) of Theorem 9 ensures M δ 1,ρ+ε L1,ρ M+δ 1,ρ ε, and hence Theorem 7 shows that Algorithm 1 returns its start level ρ1,out +ε and Lρ1,out+ε. As a consequence, both if-clauses of the second instance of Algorithm 2 are not satisﬁed and the overall program continues at Line 5 of the ﬁrst instance of Algorithm 2. There, a third instance of Algorithm 2 is called with ρ1,out + ε and (L2,ρ)ρ 0, which in turn begins by calling Algorithm 1 with the same values. Again, Theorem 9 ensures, that for ρ [ρ1,out, ρ 1 ] Algorithm 1 only detects one cluster and that for ρ ρ 1 , the crucial inclusions M δ 2,ρ+ε L2,ρ M+δ 2,ρ ε are satisﬁed. Theorem 6 hence ensures that Algorithm 1 returns a ρ2,out ρ 2 with ρ2,out > ρ 2 > ρ1,out + ε and corresponding clusters denoted by B2,1 and B2,2 to the third instance of Algorithm 2. These values are then stored in the split tree and a fourth and ﬁfth instance of Algorithm 2 are called with ρ2,out+ε and the newly deﬁned (L(2,1),ρ)ρ 0, respectively (L(2,2),ρ)ρ 0, given by L(2,i),ρ := L2,ρ B2,i. Right. Assumption S is satisﬁed for both children measures occurring at the split ρ 2. As for B1 on the left image, the combination of Theorem 9 and Theorem 7 ensures that Algorithm 1 called by the fourth and ﬁfth instance of Algorithm 2 with ρ2,out+ε and (L(2,i),ρ)ρ 0 returns these values to these instances of Algorithm 2. Thus, the fourth and ﬁfth instance of Algorithm 2 return to the third instance of Algorithm 2 without any further action, and in turn the third instance returns to the ﬁrst instance of Algorithm 2. This instance then reaches its Line 9, and therefore the overall program terminates with (ρ1,out, B1, B2) and (ρ2,out, B2,1, B2,2) being stored in its split tree.

Steinwart, Sriperumbudur, and Thomann

4. Uncertainty Control for Kernel Density Estimators

The results of Section 3 provide guarantees as soon as the input level sets satisfy (3). Steinwart (2015a) has shown that guarantees of the form (3) can be established for the level sets of histogram-based density estimators. The goal of this section is to show that (3) can also be established for a variety of kernel density estimators. Our ﬁrst deﬁnition introduces the considered kernels.

Deﬁnition 10 A bounded, measurable function K : Rd [0, ) is called symmetric kernel, if K(x) > 0 in some neighborhood of 0, K(x) = K( x) for all x Rd, and Z

Rd K(x) dλd(x) = 1 . (8)

For δ > 0 we write Kδ := δ d K(δ 1 ), and for r > 0 and a norm on Rd we deﬁne

Rd\B(0,r) K(x) dλd(x) , κ (r) := sup x Rd\B(0,r) K(x) .

We call κ1( ) and κ ( ) tail functions. Finally, we say that K has a bounded support if supp K B , and that K has an exponential tail behavior, if there exists a constant c > 0 such that

K(x) c exp x 2 , x Rd. (9)

Recall that the integrability condition (8) is standard for kernel density estimators. Moreover, kernels of the form K(x) = k( x ) are always symmetric and if the representing k : [0, ) [0, ) is bounded and measurable, so is K. Moreover, if k(r) > 0 for all r [0, ϵ), where ϵ > 0 is some constant, then K(x) > 0 in some neighborhood of 0. In particular, for k = c1[0,1] we obtain the rectangular window kernel , which is a symmetric kernel with bounded support, and if k is of the form k(r) = c exp( r2) or k(r) = c exp( r), then we obtain a symmetric kernel with exponential tail behavior. Examples of the latter are Gaussian kernels, while the triangular, the Epanechnikov, the quartic, the triweight, and the tricube kernels are further examples of symmetric kernels with bounded support. Finally note that each symmetric kernel with bounded support also has exponential tail behavior, since we always assume that K is bounded.

Given a kernel function K, it is standard in kernel density estimation to consider the modiﬁed versions Kδ of K with δ 0 for increasing sample sizes n, see for example (Devroye and Lugosi, 2001). Roughly speaking, this is because one can show that the inﬁnite-sample kernel density estimator h P,δ deﬁned below in (13) converges to the density h for δ 0. Also note that for the kernels discussed above and δ (0, 1), these versions Kδ are more narrow than the original function K.

Before we proceed with our main goal of establishing (3) let us brieﬂy discuss a couple of simple properties of symmetric kernels K in the sense of Deﬁnition 10. To this end, we ﬁrst note that the properties of the Lebesgue measure λd ensure that Z

Rd Kδ(x y) dλd(y) = Z

Rd K(x y) dλd(y) = Z

Rd K(y x) dλd(y) = 1 (10)

Adaptive Clustering Using Kernel Density Estimators

for all x Rd, δ > 0, and then by an analogous calculation we obtain Z

Rd\B(x,σ) Kδ(x y) dλd(y) = Z

Rd\B(0,σ/δ) K(y) dλd(y) = κ1(σ

In addition, we always have κ1(r) 0 for r and if K has bounded support, then the tail functions with respect to this norm satisfy

κ1(r) = κ (r) = 0 , r 1 . (12)

Moreover, for kernels with exponential tail, (Steinwart et al., 2021, Lemma 4.2) shows that the behavior of the tail functions can be bounded by κ1(r) cd2 vold e rrd 1 and κ (r) ce r. Now, let K be a symmetric kernel on Rd and P be a distribution on Rd. For δ > 0 we then deﬁne the inﬁnite-sample kernel density estimator h P,δ : Rd [0, ) by

h P,δ(x) := δ d Z

d P(y) x Rd. (13)

It is easy to see that h P,δ 0 is a bounded measurable function with h P,δ δ d K . Moreover, a quick application of Tonelli s theorem together with (10) yields h P,δ L1(λd) = 1, and hence h P,δ is a Lebesgue probability density. Moreover, if P has a Lebesgue density h, then it is well-known, see e.g. (Devroye and Lugosi, 2001, Theorem 9.1), that h P,δ h L1(λd) 0 for δ 0. In addition, if this density is bounded, then (10) yields

h P,δ = sup x Rd δ d Z

h(y) dλd(y) h sup x Rd

Rd Kδ(x y) dλd(y)

Clearly, if D = (x1, . . . , xn) Xn is a data set, we can consider the corresponding empirical measure 1

n Pn i=1 δxi, where δx denotes the Dirac measure at x. In a slight abuse of notation we also denote this empirical measure by D. The resulting function h D,δ : Rd R, called kernel density estimator (KDE), can then be computed by

h D,δ(x) := 1 nδd

One way to deﬁne level set estimates with the help of h D,δ is a simple plug-in approach, that is

LD,ρ := {h D,δ ρ} . (14)

While from a statistical perspective, this level set estimator is perfectly ﬁne, it is also computationally intractable. For example, if h D,δ is a moving window estimator, that is K(x) = c1[0,1]( x ) for x Rd, then the up to 2n diﬀerent level sets (14) are generated by intersection of balls around the samples, and the structure of these intersections may be

Steinwart, Sriperumbudur, and Thomann

too complicated to compute τ-connected components in Algorithm 1. For this reason, we consider level set estimates of the form

LD,ρ := {x D : h D,δ(x) ρ}+σ , (15)

where σ > 0. Note that computing connected components of (15) is indeed feasible, since it amounts to computing the connected components of the neighborhood graph, in which two vertices xi and xj with i = j have an edge if xi xj σ +τ. In particular, DBSCAN can be viewed as such a strategy for the moving window kernel. With these preparations we can now present our ﬁrst result that establishes a sort of uncertainty control (3) for level set estimates of the form (15).

Theorem 11 Let be some norm on Rd, K : Rd [0, ) be a symmetric kernel, and κ1( ) and κ ( ) be its associated tail functions. Moreover, let P be a distribution for which Assumption P is satisﬁed, and D be a data set such that the corresponding KDE satisﬁes h D,δ h P,δ < ε for some ε > 0 and δ > 0. For ρ 0 and σ > 0 we deﬁne

LD,ρ := {x D : h D,δ(x) ρ}+σ

and ϵ := max{ρκ1(σ

δ ), δ dκ (σ

δ )}. Then, for all ρ δ dκ (σ

δ ), we have

M 2σ ρ+ε+ϵ LD,ρ M+2σ ρ ε ϵ . (16)

Moreover, if P has a bounded density h, then (16) also holds for ϵ = h κ1(σ

If K has bounded support for the norm considered in Theorem 11, Equation (12) shows that (16) actually holds for ϵ = 0 and all ρ 0 and all σ δ. Therefore, we have indeed (3) with δ replaced by 2σ. In general, however, we have an additional horizontal uncertainty ϵ that aﬀects the guarantees of Theorem 6. To control this inﬂuence, our strategy will be to ensure that ϵ ε, which in view of ϵ = h κ1(σ

δ ) means that we need to have an upper bound on κ1( ) and σ. Theorem 11 tells us that the uncertainty control (16) is satisﬁed as soon as we have a data set D with h D,δ h P,δ < ε. Recall that rates for h D,δ h P,δ 0 have already been proven by Gin e and Guillou (2002). However, these rates only hold for n n0, where n0 may depend on D. In addition one needs to choose a sequence (δn) of bandwidths apriori, which excludes adaptivity as we will see below. Finally, the theory developed by Steinwart (2015a) requires bounds of the form h D,δ h P,δ < ε(δ, n, ς) that hold with probability not smaller than 1 e ς. Thus, the results of Gin e and Guillou (2002) are not suitable for our purposes. To establish suitable bounds, we need to recall some notions ﬁrst.

Deﬁnition 12 Let E be a Banach space and A E be a bounded subset. Then, for all ε > 0, the covering numbers of A are deﬁned by

N(A, E, ε) := inf n 1 : x1, . . . , xn E such that A

i=1 (xi + εB ) ,

where inf := . Furthermore, we use the notation N(A, E, ε) := N(A, E, ε).

Adaptive Clustering Using Kernel Density Estimators

We now introduce the kind of covering number bound we will use in our analysis.

Deﬁnition 13 Let (Z, A) be a measurable space and G be a set of measurable functions from Z to R for which there is a B > 0 with g B for all g G. Then G is called a uniformly bounded VC-class, if there are A > 0 and ν > 0 such that, for every distribution P on Z we have

N(G, L2(P), ϵ) AB

ν , 0 < ϵ B. (17)

Let us brieﬂy look at two important, suﬃcient criteria for ensuring that the set of functions

Kδ := Kδ(x ) : x X (18)

is a uniformly bounded VC-class. The ﬁrst such result considers moving window kernels.

Lemma 14 Consider the kernel K = c1B , where is either the Euclideanor the supremum norm. Then for all δ > 0 the set Kδ deﬁned by (18) is a uniformly bounded VC-class with B := δ d K = δ dc and A and ν being independent of δ.

The next lemma shows that H older continuous kernels also induce a uniformly bounded VC-class Kδ, provided that the input space X in (18) is compact. For its formulation we need to recall that for every norm on Rd and every compact subset X Rd there exists a ﬁnite constant C (X) > 0 such that for all 0 < ε diam (X) we have

N(X, , ϵ) C (X)ϵ d . (19)

Lemma 15 Let K : Rd [0, ) be a symmetric, α-H older continuous kernel and be a norm on Rd. We write |K|α for the corresponding α-H older constant. Moreover, let X Rd be a compact subset and Kδ deﬁned by (18). Then for all δ > 0 with δ |K|α

K 1/α diam (X), all 0 < ϵ B := δ d K , and all distributions P on Rd we have

N Kδ, L2(P), ϵ C (X) |K|α

In other words, Kδ is a uniformly bounded VC-class with ν := d/α and constant A := (C (X))α/d|K|α K 1 δ α.

Now, the second main result of this section establishes a ﬁnite sample bound on the norm h D,δ h P,δ . Later we will combine it with Theorem 11.

Theorem 16 Let X Rd and P be distribution on X that has a Lebesgue density h L1(Rd) Lp(Rd) for some p (1, ]. We write 1

p = 1. Moreover, let K : Rd [0, ) be a symmetric kernel for which there is a δ0 (0, 1] such that for all δ (0, δ0] the set Kδ

Steinwart, Sriperumbudur, and Thomann

deﬁned in (18) is a uniformly bounded VC-class with constants of the form Bδ = δ d K , Aδ = A0δ a, and A0 > 0, a 0, ν 1 being independent of δ, that is,

N Kδ, L2(Q), ϵ A0 K δ (d+a)

holds for all δ (0, δ0], all ϵ (0, Bδ], and all distributions Q on Rd. Then, there exists a C > 0 only depending on d, p, and K such that, for all n 1, δ > 0, and ς 1 satisfying

δ min δ0, 4p K

and |log δ| nδd/p h p

P n D : h D,δ h P,δ ℓ (X) < C

h p |log δ| ς

1 e ς. (23)

For bounded densities, Theorem 16 recovers the same rates as Gin e and Guillou (2002). However, Gin e and Guillou (2002) established the rates in an almost sure asymptotic form, whereas Theorem 16 provides a ﬁnite sample bound. Moreover, unlike Gin e and Guillou (2002), Theorem 16 also yields rates for unbounded densities.

5. Statistical Analysis of KDE-based Clustering

In this section we combine the generic results of Section 3 with the uncertainty control for level set estimates obtained from kernel density estimates we obtained in Section 4. As a result we will present ﬁnite sample guarantees, consistency results, and rates for estimating split levels ρ and the corresponding clusters. In this respect recall the discussion following Theorem 9, which showed that for deriving guarantees for estimating the split tree with the help of Algorithm 2 it actually suﬃces to analyze the behavior of Algorithm 1 for distributions satisfying Assumption S and Assumption M. Following this insight, we will focus on such guarantees for Algorithm 1. Our ﬁrst result presents ﬁnite sample bounds for estimating both ρ and the single or multiple clusters with the help of Algorithm 1. To treat kernels with bounded and unbounded support simultaneously, we restrict ourselves to the case of bounded densities, but at least for kernels with bounded support an adaption to p-integrable densities is straightforward as discussed by Steinwart et al. (2021).

Theorem 17 Let P be a distribution with bounded Lebesgue density and with Assumption P being satisﬁed. Moreover, let K be symmetric kernel with exponential tail behavior, for which the assumptions of Theorem 16 hold. For ﬁxed δ (0, e 1] and τ > 0, we choose a σ > 0 with

( δ if supp K B , δ | log δ|2 otherwise. (24)

Adaptive Clustering Using Kernel Density Estimators

and we further assume that this σ satisﬁes both σ δthick/2 and τ > ψ(2σ). Moreover, for ﬁxed ς 1, n 1 satisfying the assumptions (22), we pick an ε > 0 satisfying the bound

h |log δ| ς

and if K does not have bounded support, also

ε max 1, 2d2 vold} c δ| log δ| d . (26)

Now assume that for each data set D Xn sampled from P n, we feed Algorithm 1 with the level set estimators (LD,ρ)ρ 0 given by (15), the parameters τ and ε, and a start level ρ0 ε. Then the following statements are true:

i) If P satisﬁes Assumption S and ρ0 ρ , then with probability P n not less than 1 e ς

Algorithm 1 returns ρ0 and L0 and with ˆ Mρ := S ρ>ρ Mρ we have

λd Lρ0 ˆ Mρ λd M+2σ ρ0 ε \ ˆ Mρ + λd ˆ Mρ \ M 2σ ρ0+ε . (27)

ii) If P satisﬁes Assumption M and we have an

ε ε + inf ε (0, ρ ρ ] : τ (ε ) τ . (28)

with 9ε ρ ρ , then with probability P n not less than 1 e ς, we have a D Xn

such that the following statements are true for Algorithm 1:

(a) The returned level ρD,out satisﬁes both ρD,out [ρ + 2ε, ρ + ε + 5ε] and

τ ψ(2σ) < 3τ ρD,out ρ + ε .

(b) Two sets B1(D) and B2(D) are returned and these can be ordered such that for Ai ρD,out+ε C(MρD,out+ε) ordered in the sense of Ai ρD,out+ε A i we have

i=1 λd Bi(D) A i 2

i=1 λd A i \(Ai ρD,out+ε) 2σ + λd M+2σ ρD,out ε\{h > ρ } .

For our subsequent asymptotic analysis we note that the assumptions δ (0, e 1] and ς 1 of Theorem 17 show that (26) is satisﬁed if

max 1, 2d2 vold} c δ| log δ|+d/2 C

and if we choose δ in terms of n, i.e., δ = δn, then (30) is satisﬁed for large n if δn O(n a) for some small a > 0. We shall see below, that such rates for δn are typical. While (24) only provides a lower bound on possible values for σ, Theorem 17 actually indicates that σ should not be chosen signiﬁcantly larger than these lower bounds, either. Indeed, the choice of σ also implies a minimal value for τ by the condition τ > ψ(2σ), which

Steinwart, Sriperumbudur, and Thomann

in turn inﬂuences ε by (28). Namely, larger values of σ lead to larger τ and thus to larger ε . As a result, the guarantees in (a) become weaker, and in addition, larger values of σ also lead to weaker guarantees in (b). For a similar reason we do not consider kernels K with heavier tails than (9). Indeed, if K only has a polynomial upper bound for its tail, i.e., there are constants c and α > d with

K(x) c x α 2 , x Rd ,

then κ1(r) r α+d and κ (r) r α. Now, if we picked σ = δ| log δ|b for some b > 0, then we would need to replace (26) by a bound of the form cδ d| log δ| αb ε, and this would rule out ε 0 for δ 0. As a result, no rates would be possible. Now, one could address this by choosing σ := δb for some b (0, 1), which in turn would require a bound of the form c δα(1 b) d ε, instead of (26). Arguing as around (30) this is guaranteed if

cδα(1 b) d/2 C

and if δ 0 the latter would require b < 1 d

2α. In particular, b would be strictly bounded away from 1. However, such a choice for σ would signiﬁcantly weaken the guarantees given in (a) and (b) as explained above, and as a consequence, the rates obtained below would be worse. Note that from a high-level perspective this phenomenon is not surprising: indeed, heavier tails smooth out the inﬁnite sample density estimator h P,δ and as consequence, the uncertainty guarantees (16) become worse in the horizontal direction, i.e., we get more blurry estimates LD,ρ of Mρ. However, for the detection of connected components at a level ρ, less blurry estimates are preferable. In the remainder of this section, we illustrate how the ﬁnite sample guarantee of Theorem 17 can be used to derive both consistency and rates. We begin with the following result.

Corollary 18 Let P be a distribution with bounded Lebesgue density and with Assumption P being satisﬁed. Moreover, let K be a symmetric kernel with exponential tail behavior, for which the assumptions of Theorem 16 hold. Let (δn) be a positive sequence with δn n a

for some a > 0 and pick a null sequence (σn) satisfying (24) for all suﬃciently large n. Moreover, let (εn) and (τn) be positive null sequences with ψ(2σn) < τn for all suﬃciently large n, and

lim n log δ 1 n nε2nδdn = 0 .

Now assume that for each data set D Xn sampled from P n, we feed Algorithm 1 with the level set estimators (LD,ρ)ρ 0 given by (15), the parameters τn and εn, and the start level ρ0 := εn. Then the following statements are true:

i) If P satisﬁes Assumption S with ρ = 0, then for all ϵ > 0 we have

lim n P n D Xn : 0 < ρD,out ϵ = 1 ,

and if λd({h > 0} \ {h > 0}) = 0 we also have

lim n P n D Xn : λd(LD,ρD,out {h > 0}) ϵ = 1 ,

Adaptive Clustering Using Kernel Density Estimators

ii) If P satisﬁes Assumption M, then, for all ϵ > 0, we have

lim n P n D Xn : 0 < ρ D ρ ϵ = 1 ,

and, if λd(A i A 2 \ (A 1 A 2)) = 0, we also have, for B1(D), B2(D) as in (29):

lim n P n D Xn : λd(B1(D) A 1) + λd(B2(D) A 2) ϵ = 1 .

Our next goal is to establish rates of convergence for estimating ρ and the clusters. We begin with a result providing a rate of ρ D ρ . To this end we need to recall the following deﬁnition from Steinwart (2015a) that describes how well the clusters are separated above ρ .

Deﬁnition 19 Let Assumption M be satisﬁed. Then the clusters of P have a separation exponent κ (0, ], if there is a constant csep > 0 such that for all ε (0, ρ ρ ] we have τ (ε) csep ε1/κ .

Moreover, the separation exponent κ is exact, if there is a csep > 0 such that

τ (ε) csep ε1/κ , ε (0, ρ ρ ] .

The separation exponent describes how fast the connected components of Mρ approach each other for ρ ρ . The best separation exponent is κ = and in this case we have d(A 1, A 2) csep, i.e. the clusters A 1 and A 2 do not touch each other. The separation exponent makes it possible to ﬁnd a good value for ε in Theorem 17. Indeed, the proof of (Steinwart, 2015a, Theorem 4.3) shows that the value ε := ε+(τ/csep)κ

satisﬁes (28) as soon as we have 9ε ρ ρ . Consequently, the bound in part ii) (a) of Theorem 17 becomes

2ε ρD,out ρ 6ε + τ

if we have a separation exponent κ (0, ]. Moreover, if the separation exponent κ (0, ) is exact and we choose τ 2ψ(2σ), then (31) can be improved to

κ ρD,out ρ 6ε + τ

as the proof of Theorem (Steinwart, 2015a, Theorem 4.3) shows. To establish rates, it thus suﬃces to ﬁnd null sequences (εn), (δn), (σn), and (τn) that satisfy (24) and (25), and additionally δn O(n a) for some a > 0, if K does not have bounded support. The following corollary presents the resulting rates that are, modulo logarithmic terms, the best ones we can obtain from this approach.

Corollary 20 Let P be a distribution for which Assumption M is satisﬁed and whose Lebesgue density is bounded. Moreover, consider a symmetric kernel K with exponential tail behavior, for which the assumptions of Theorem 16 hold. In addition, assume that the

Steinwart, Sriperumbudur, and Thomann

clusters of P have separation exponent κ (0, ). Furthermore, let (εn), (δn), (σn), and (τn) be sequences with

εn (log n)3 log log n

γκ 2γκ+d , δn log n

σn (log n)3

1 2γκ+d , τn (log n)3 log log n

and assume that, for n 1 D Xn sampled from P n, we feed Algorithm 1 with the level set estimators (LD,ρ)ρ 0 given by (15), the parameters τn and εn, and the start level ρ0 := εn. Then there exists a K 1 such that for all suﬃciently large n we have

P n D Xn : 0 ρ D ρ Kεn

Moreover, if the separation exponent κ is exact, there exists another constant K 1 such that for all suﬃciently large n we have

P n Xn : Kεn ρ D ρ Kεn

Finally, if κ = and supp K B , then (32) holds for all suﬃciently large n, if σn = δn and

εn log n log log n

2 , δn log log n 1

2d , and τn log log n γ

Note that the rates obtained in Corollary 20 only diﬀer by the factor (log n)2 from the rates in (Steinwart, 2015a, Corollary 4.4). Moreover, if K has a bounded support, then an easy modiﬁcation of the above corollary yields exactly the same rates as in (Steinwart, 2015a, Corollary 4.4). Our next goal is to establish rates for λd(Bi(D) A i ) 0. Since this is a modiﬁed level set estimation problem, we need to recall some assumptions used in this context. The ﬁrst assumption in this direction is one-sided variant of a well-known condition introduced by Polonik (1995).

Deﬁnition 21 Let P be a distribution on X Rd that has a Lebesgue-density h. For a level ρ 0, we say that P has ﬂatness exponent ϑ (0, ], if there is a cﬂat > 0 with

λd {0 < h ρ < s} (cﬂats)ϑ , s > 0.

Note that the larger ϑ is, the steeper h must approach ρ from above. In particular, for ϑ = , the density h is allowed to take the value ρ but is otherwise bounded away from ρ. The second deﬁnition describes in some sense the roughness of the boundary of the clusters.

Deﬁnition 22 Let Assumption M be satisﬁed and α (0, 1]. Then the clusters have an α-smooth boundary, if there is a cbound > 0 such that, for all ρ (ρ , ρ ] and δ (0, δthick] we have

λd (Ai ρ)+δ \ (Ai ρ) δ cboundδα ,

where i {1, 2} and A1 ρ and A2 ρ denote the two connected components of the level set Mρ.

Adaptive Clustering Using Kernel Density Estimators

Note that in Rd, considering α > 1 does not make sense, and for an A Rd with rectiﬁable boundary we always have α = 1, see (Steinwart, 2015b, Lemma A.10.4). The following assumption collects the diﬀerent properties of P we have introduced.

Assumption R. Assumption M is satisﬁed and P has a bounded Lebesgue density h. Moreover, P has ﬂatness exponent ϑ (0, ] at level ρ , its clusters have an α-smooth boundary for some α (0, 1], and its clusters have separation exponent κ (0, ].

With the help of Assumption R we can now establish rates for estimating the clusters, that is, for λd(Bi(D) A i ) 0.

Corollary 23 Let Assumption R be satisﬁed and K be as in Corollary 20. and write ϱ := min{α, ϑγκ}. Furthermore, let (εn), (δn), and (τn) be sequences with

ϱ 2ϱ+ϑd (log log n) ϑd 8ϱ+4ϑd , δn log n log log n

σn (log n)3 log log n

ϑ 2ϱ+ϑd , τn (log n)3 (log log n)2

Assume that, for n 1, we feed Algorithm 1 as in Corollary 20. Then there is a constant K 1 such that, for all n 1 and the ordering as in (29), we have

i=1 λd Bi(D) A i K (log n)3 (log log n)2

ϑϱ 2ϱ+ϑd 1 1

Again, the rates obtained in Corollary 23 only diﬀer by the factor (log n)2 from the rates in (Steinwart, 2015a, Corollary 4.8). Moreover, if K has a bounded support, then an easy modiﬁcation of Corollary 23 again yields exactly the same rates as in (Steinwart, 2015a, Corollary 4.8). Our ﬁnal goal is to modify the adaptive parameter selection strategy for the histogrambased clustering algorithm of Steinwart (2015a) to our KDE-based clustering algorithm. To this end, let (0, 1] be ﬁnite and n 1, ς 1. For δ , we ﬁx σδ,n > 0 and τδ,n > 0 such that (24) and τδ,n 2ψ(2σδ,n) are satisﬁed. In addition, we deﬁne

| log δ|(ς + log | |) log log n

δdn + max 1, 2d2 vold} c δ| log δ| d , (33)

where Cu 1 is some user-speciﬁed constant and the second term can be omitted if the used kernel K has bounded support. Now assume that, for each δ , we run Algorithm 1 with the parameters εδ,n and τδ,n, with the start level ρ0 := εδ,n, and with the level set estimators (LD,ρ)ρ 0 given by (15). Let us consider a width δ D, that achieves the smallest returned level, i.e.

δ D, arg min δ ρD,δ,out . (34)

In general, this width may not be unique, and hence we assume in the following that we have a well-deﬁned choice, e.g. the smallest δ satisfying (34). Moreover, we write

ρ D, := min δ ρD,δ,out

Steinwart, Sriperumbudur, and Thomann

for the smallest returned level. Note that unlike the width δ D, , the level ρ D, is always unique. Finally, we deﬁne εD, := εδ D, ,n and τD, := τδ D, ,n. With these preparation we can now present the following ﬁnite sample bound for ρ D, .

Theorem 24 Let P be a distribution for which Assumption M is satisﬁed and whose Lebesgue density is bounded. Moreover, consider a symmetric kernel K with exponential tail behavior, for which the assumptions of Theorem 16 hold. In addition, assume that the two clusters of P have separation exponent κ (0, ]. For a ﬁxed ﬁnite (0, e 1], and n 1, ς 1, and Cu 1, we deﬁne εδ,n by (33) and σδ,n > 0 and τδ,n > 0 such that (24), τδ,n 2ψ(2σδ,n), and 2σδ,n δthick are satisﬁed for all δ . Furthermore, assume that 4C2 u log log n C h , where C is the constant in (23) and εδ,n+(τδ,n/csep)κ (ρ ρ )/9 for all δ . Then we have

P n n D Xn : εD, < ρ D, ρ min δ (τδ,n/csep)κ + 6εδ,n o 1 e ς .

Moreover, if the separation exponent κ is exact and κ < , then we even have

P n D : min δ c1τ κ δ,n + εδ,n < ρ D, ρ min δ c2τ κ δ,n + 6εδ,n 1 e ς ,

where c1 := 1

4(6csep) κ and c2 := c κ sep, and similarly

P n n D Xn : c1τ κ D, + εD, < ρ D, ρ c2τ κ D, + 6εD, o 1 e ς .

For an adaptive parameter selection strategy, it thus suﬃces to deﬁne appropriate , σδ,n, and τδ,n. Here we proceed as in (Steinwart, 2015a, Section 5). Namely, for n 16, we consider the interval

In := log n (log log n)2

d , 1 log log n

and ﬁx some n 1/d-net n In of In with | n| n. Furthermore, for some ﬁxed Cu 1 and n 16, we deﬁne σδ,n by (24), write τδ,n := σγ δ,n log log log n, and deﬁne εδ,n by (33) for all δ n and ς = log n. Following the ideas of the proofs of (Steinwart, 2015a, Corollaries 5.2 and 5.3) we then obtain a constant K such that for all suﬃciently large n 16 we have

P n D : ρ D, n ρ K (log n)3 (log log n)2

γκ 2γκ+d 1 1

Here, (36) holds if P has separation exponent κ (0, ), and if the kernel K has bounded support, it remains true for κ = . In addition, the upper bound in (36) can be matched by a lower bound that only diﬀers by a double logarithmic factor provided that the separation exponent κ (0, ) is exact. Finally, if Assumption R is satisﬁed, we further ﬁnd

i=1 λd(Bi(D) A i ) ˆK (log n)3 (log log n)2

ϑγκ 2γκ+ϑd 1 1

for all suﬃciently large n 16, where ˆK is another constant independent of n.

Adaptive Clustering Using Kernel Density Estimators

Finally, we like to mention that the presented adaptive strategy works optimally as long as all split levels have the same exact separation exponent κ. Indeed, as described above, our strategy detects, modulo some logarithmic terms, an asymptotically optimal choice δ D, for the ﬁrst split level, and since for all other split levels this choice is also asymptotically optimal as long as they have the same exact separation exponent κ, we see that Algorithm 2 is adaptive when using δ D, .

6. Comparisons

In this section we compare our ﬁndings to the most closely related papers, namely (Wang et al., 2019) and (Chaudhuri et al., 2014). In particular, we discuss the diﬀerent assumptions as well as the obtained statistical guarantees. To begin with, let us have a look at the assumptions on P, respectively its density h made by Wang et al. (2019). Here, we ﬁrst note that h is assumed to be α-smooth for some α > 0, that is, h is s := α -times continuously diﬀerentiable and all partial derivatives of order s are (α s)-H older continuous. In general, h does not need to have compact support, instead, Wang et al. (2019) only assume that {h ρ} is compact for all ρ > ρ , where ρ

is the smallest split level, see their Assumptions C and C , respectively. In this respect we note that for P having a continuous density their notion of split levels is closely related to ours and that their Assumption S(κ) equals our separation exponent κ. Finally, Wang et al. (2019) do not impose a thickness assumption, instead a so-called inner cone condition is considered. Recall that this cone condition assumes that constants εI > 0, c I > 0, and r I > 0 exist such that for all split levels ρ and all ρ (ρ εI, ρ + εI) we have

λd B(x, r) {h ρ} c Ird , x {h ρ}, r (0, r I] .

In general, it seems unclear how this condition relates to our thickness assumption, but in many natural situations the inner cone condition and the thickness assumption with γ = 1 are simultaneously satisﬁed. For details, we refer to the discussion in (Steinwart, 2015b, Appendix A.5), the rather generic examples considered in (Steinwart, 2015c, Appendix B.2), and the discussion in (Wang et al., 2019, page 15). In the following comparison we therefore assume that the inner cone condition and the thickness assumption with γ = 1 are simultaneously satisﬁed. In addition, we assume that h has ﬁnitely many split levels. Like our results, the clustering algorithm of Wang et al. (2019) is also based on a kernel density estimator. However, their central Algorithm 2 is using so-called α-valid kernels, whose KDEs enjoy the ideal approximation error behavior h P,δ h δα for δ 0 and α-smooth densities h. We refer to (Wang et al., 2019, page 9) for details. Finally, their split level estimator uses a veriﬁcation strategy that is similar to Line 3 of our Algorithm 1, see (Wang et al., 2019, Deﬁnition 6). To compare the split level guarantee of Wang et al. (2019) to ours, we assume that h is α-smooth, that all split levels have separation exponent α, and that h satisﬁes the additional assumptions discussed above. Also, we assume that their Algorithm 2 uses an α-valid kernel. Then (Wang et al., 2019, Proposition 3) shows that, modulo some logarithmic factors, all split levels can be estimated with rate n α 2α+d . Since we can choose α = κ and γ = 1, these rate coincide with ours established in Corollary 20 if we again ignore logarithmic factors. Given any κ > 0, however, our results do not require h to be κ-smooth, while (Wang

Steinwart, Sriperumbudur, and Thomann

et al., 2019, Proposition 3) only holds for κ-smooth densities. Moreover note that there are α-smooth densities with κ α, in fact we may have α = 1 and κ = , and for such densities, our rates are signiﬁcantly better than those of Wang et al. (2019). In addition, our clustering algorithm does not need to use kernels that are aligned with the smoothness of h, and unlike Wang et al. (2019) we also present a way to make our algorithm adaptive to the unknown κ. Finally, we like to emphasize that apart from Proposition 3, (Wang et al., 2019) contains several other interesting results, which are, however, not comparable to ours. Unlike the comparison to (Wang et al., 2019), the comparison to (Chaudhuri et al., 2014) turns out to be much more involved, as the latter paper uses assumptions that are quite diﬀerent to ours. Providing a detailed comparison is therefore beyond the scope of this paper and we refer the interested reader to (Steinwart et al., 2021, Appendix A) for a detailed comparison. To summarize the key points of this comparison, we show that, up to logarithmic factors, the best possible convergence rate achieved by the central Theorem VII.5 of Chaudhuri et al. (2014) for estimating the split levels is n α 3α+d while our algorithm achieves a rate of n α 2α+d , which is strictly better for all dimensions, with α being the H older-continuity of the underlying density. In particular, our rates can be achieved without knowing α, while Chaudhuri et al. (2014) do not oﬀer such adaptivity. In addition, our results can handle discontinuous densities, e.g. step functions on rectangles with mutually positive distances, while their Theorem VII.5 does not provide any guarantee at all for such densities. In particular, the consistency results for their unpruned algorithm require mild uniform continuity conditions , see the end of Section III.B in (Chaudhuri et al., 2014), and the guarantees for the pruned algorithm stated in their Theorem VII.5 explicitly require control on the uniform modulus of continuity. Finally, apart from split level estimation, our results also provide guarantees for corresponding clusters in measure while no such guarantees are in (Chaudhuri et al., 2014).

7. Experiments

In this section, we illustrate the behavior of our generic KDE-based clustering algorithm on a few artiﬁcial data sets for which the ground truth clustering can be computed. In addition, we compare their performance to k-means and hierarchical clustering.

Data. We consider six cluster problems, which are based on two-dimensional distributions cut down to [0, 1]2, with diﬀerent degrees of diﬃculty: The ﬁrst distribution, see Figure 9, is a mixture of 15 Gaussian distributions and a uniform background distribution. In addition, 12 out of the 15 well-separated distributions have a covariance of the form λI2, where I2 is the 2 2-identity matrix and λ is some scaling factor, while the remaining 3 Gaussians have a diﬀerent covariance matrix. Since this distribution was inspired by the S2-data set of Fr anti and Virmajoki (2006) we will call it S2 in the following. The second distribution, see Figure 10, is a modiﬁcation of the ﬁrst. Namely, two of the 15 clusters have been shrunken and moved towards each other, while the remaining clusters have remained unchanged. In the following we call it S2 modiﬁed. The third distribution, called toy 3G , is a mixture of 3 Gaussian distribution with similar but not identical covariances, see Figure 11. Unlike in the ﬁrst data set, however, the Gaussians are less separated and less concentrated making this distribution slightly more diﬃcult. The fourth distribution,

Adaptive Clustering Using Kernel Density Estimators

Figure 9: Left. A sample data set (top) of size n = 5000 from S2 and its ground truth clustering (bottom). Middle-Left. The 25th (bottom) and 75th (top) best run of kmeans on the S2 data sets. Middle. Corresponding 25th (bottom) and 75th

(top) best runs of hclust. Middle-Right and Right. The 25th (bottom) and 75th (top) best runs of our algorithm for the moving window kernel (middle-right) and the Epanechnikov kernel (right). All algorithms perform well.

see Figure 12, is based upon the third distribution. To be more precise, two Gaussians with close-by centers and with small variances have been added. In the following we call it toy 5G. The ﬁfth distribution, called bananas, is a variant of the classical bananas, or two-moon, data distribution, which is often used to assess the clustering performance on non-centroid-like clusters. Unlike its usual form, however, our variant consists of two very fuzzy bananas, which makes even a visual inspection not immediately obvious, see Figure 13. As a result of this noise, both clusters can be almost perfectly separated by a diagonal line, and therefore, it is in principle possible to ﬁnd a decent clustering with the help of two suitably chosen centroids. The sixth distribution, see Figure 14, is another mixture of Gaussians. In this distribution, however, each of its three clusters corresponds to a mixture of two Gaussians that have the same mean but diﬀerent variances. In addition, the clusters are merging into each other, and as a result, this data set can be viewed as the most challenging one from a visual perspective. In the following, we call it crosses.

For each of the six cluster problems described above and the following sample sizes

n {2500, 3000, 3500, 4200, 5000, 6000, 7000, 8200, 10000, 14000, 20000} .

we generated 100 data sets. In addition, we also computed the true densities of the 6 distributions on a 1000 1000 grid of [0, 1]2 to ﬁnd a high-resolution approximation of the ground truth clustering. Applying these ground true clusters to the data sets, makes it possible to assess how well the diﬀerent algorithms work.

Performance Measures. Besides visual inspection we consider two diﬀerent ways of comparing clustering algorithms. To describe these comparisons, let us assume that we have a data set D, ground truth clusters C1, . . . , Ck D, and estimated clusters ˆC1, . . . , ˆCm D.

Steinwart, Sriperumbudur, and Thomann

Figure 10: Left. A sample data set (top) of size n = 5000 from S2 modiﬁed and its ground truth clustering (bottom). Middle-Left. The 25th (bottom) and 75th

(top) best run of kmeans on the S2 modiﬁed data sets. Middle. Corresponding 25th (bottom) and 75th (top) best runs of hclust. Middle-Right and Right. The 25th (bottom) and 75th (top) best runs of our algorithm for the moving window kernel (middle-right) and the Epanechnikov kernel (right). Even in the better runs kmeans and hclust cannot separate the two modiﬁed clusters in the bottom-right, while for our algorithms a problem occurs only for the moving window kernel in the worse run. Namely, in this case one cluster remained undetected.

In all experiments we always observed m k. Since k-means and hierarchical clustering produce clusterings with ˆC1 ˆCm = D, while the ground truth clusters do not have this property for our data sets, we restrict the considered data set to the samples that occur in both a true and an estimated cluster, that is

Dperf := x D : x

Since in general the jth estimated cluster does not relate to the jth ground truth cluster, we ﬁrst needed to ﬁnd a suitable matching, that is an injective map Φ : {1, . . . , m} {1, . . . , k}. To this end, we deﬁne the matching error to be

E(Φ) := |{x Dperf : x CΦ(j(x))}|

|Dperf| , (37)

where for all x Dperf, we denote by j(x) the unique index j with x ˆCj. In other words, E(Φ) equals the fraction of samples x in Dperf for which Φ does not provide a correct matching of clusters ˆCj(x) CΦ(j(x)). Note that if Φ is a perfect matching in the sense of ˆCj Dperf = CΦ(j) Dperf for all j = 1, . . . , m, then E(Φ) = 0. For this reason, we determined a matching Φ that minimizes E( ), where we note that even in the case of

Adaptive Clustering Using Kernel Density Estimators

Figure 11: Left. A sample data set (top) of size n = 5000 from toy 3G and its ground truth clustering (bottom). Middle-Left. The 25th (bottom) and 75th (top) best run of kmeans on the toy 3G data sets. Middle. Corresponding 25th (bottom) and 75th (top) best runs of hclust. Middle-Right and Right. The 25th (bottom) and 75th (top) best runs of our algorithm for the moving window kernel (middleright) and the Epanechnikov kernel (right). All algorithms perform well.

the ﬁrst two cluster problems this was computationally feasible with the help of a greedy approach followed by a brute-force calculation over a restricted set of permutations. The corresponding optimal matching errors E(Φ ), averaged over all 100 repetitions of each cluster problem and each considered algorithm, are reported in Figure 15.

At this point let us brieﬂy discuss alternative ways to deﬁne the (optimal) matching error. To this end, we deﬁne C0 := D \ (C1, . . . , Ck) and ˆC0 := D \ ( ˆC1, . . . , ˆCm), that is, we have Dperf = D \(C0 ˆC0). Of course, we could also consider e.g. Dperf := D \ ˆC0 in the deﬁnition of (37) by setting Φ(0) := 0 for all considered maps Φ : {1, . . . , m} {1, . . . , k}. In this alternative, all samples x C0 \ ˆC0 are additionally counted for the matching error. Now kmeans and hclust both assign all samples to some estimated cluster, that is, we have ˆC0 = . In other words, all samples x C0 are automatically counted as an error for both algorithms. This eﬀect is clearly not desirable in a fair comparison, in particular since for some distributions such as bananas and crosses the set C0 is substantial, see Figures 13 and 14. Conversely, if we consider Dperf := D \ C0 then (37) additionally counts all samples x ˆC0 \ C0. For kmeans and hclust, this set is empty by design, while our algorithms usually satisfy | ˆC0 \ C0| > 0. Consequently, we would again have an undesirable eﬀect. The ﬁnal choice Dperf := D inherits both issues of the previously considered alternatives.

For the second numerical comparison, we again consider Φ . For this Φ , we counted the number of ground truth clusters not found by the considered algorithm, that is k m, and added the number of non-covering clusters, that is, the number of clusters ˆCj that do not cover at least 50% of the samples in the matched ground truth cluster CΦ (j) Dperf.

Steinwart, Sriperumbudur, and Thomann

Figure 12: Left. A sample data set (top) of size n = 5000 from toy 5G and its ground truth clustering (bottom). Middle-Left. The 25th (bottom) and 75th (top) best run of kmeans on the toy 5G data sets. Middle. Corresponding 25th

(bottom) and 75th (top) best runs of hclust. Middle-Right and Right. The 25th (bottom) and 75th (top) best runs of our algorithm for the moving window kernel (middle-right) and the Epanechnikov kernel (right). Both kmeans and hclust cannot separate the two spatially smaller clusters on the right, and as result, they are forced to cut through one of the spatially larger clusters. In contrast, our algorithms perform well.

The averages over the resulting identiﬁcation errors

I(Φ ) := k m +

j : | ˆCj CΦ(j) Dperf|

|CΦ(j) Dperf| 1/2

are reported in Figure 16. Finally, for visual inspection we ordered the 100 results of each of the algorithms described below on each of the six cluster problems with n = 5000 with the help of the joint error E(Φ ). We then plotted the clusterings on the 25th and 75th best run, see Figures 9 to 14. These visualizations help to understand the sources of the typical errors made by the individual algorithms.

Algorithms. We implemented an iterative version of Algorithm 2, in which for k = 0, 1, . . . we ﬁrst computed the τ-connected components of Lρ+kε that do not vanish at level Lρ+(k+2)ε, see also Line 3 of Algorithm 1. With the help of these connected components we then generated the corresponding cluster tree estimate, where we note that we skipped the Line 7 of Algorithm 1 since this line has only been inserted into Algorithm 1 to simplify the form of the statistical guarantees. We then called the resulting cluster tree estimator for diﬀerent KDE level set estimators, that is, for diﬀerent values of δ. The ﬁrst split in the cluster tree was then obtained by choosing the δ that resulted in the smallest ﬁrst split level. If all split levels have the same separation exponent κ, then it asymptotically suﬃces to consider this δ for the entire cluster tree. In general, however, splits further

Adaptive Clustering Using Kernel Density Estimators

Figure 13: Left. A sample data set (top) of size n = 5000 from bananas and its ground truth clustering (bottom). Middle-Left. The 25th (bottom) and 75th (top) best run of kmeans on the bananas data sets. Middle. Corresponding 25th

(bottom) and 75th (top) best runs of hclust. Middle-Right and Right. The 25th (bottom) and 75th (top) best runs of our algorithm for the moving window kernel (middle-right) and the Epanechnikov kernel (right). While the implicit bias of kmeans seems to consistently direct it towards imperfect centroids, the performance of hclust seems to be rather sensitive to the instance of the data set. In comparison, our algorithms are less sensitive and perform suﬃciently well even in their worse runs.

up in the tree may have either a diﬀerent κ or at least a diﬀerent constant csep. In these cases, choosing a diﬀerent δ at diﬀerent splits may be beneﬁcial. We adopted this insight by considering, after each detected split level ρ, those δ whose estimated cluster tree has another split level within one of the clusters emerging at ρ. Among those δ we then again choose the one resulting in the smallest next split level.

We considered 500 geometrically spaced candidate values of δ between c(ln(n)/n)1/d

and c(ln n) 1/d, where in the experiments, the factor c was determined by an estimate of the median mutual distance between the samples of the considered data set. Notice that modulo this factor c and some (double) logarithmic terms, this setup coincides with the theoretically derived one, see (35). Moreover, we considered both a plain moving window kernel and the Epanechnikov kernel, where in both cases the underlying norm was the Euclidean distance. Since both kernels have bounded support, we simply chose σ := δ, see (24), and ε := 3 p

h D,δ n 1δ d for each candidate value δ. Modulo some logarithmic terms, this choice for ε follows our theoretical insights, see (33). Finally, we decided to focus on thickness guarantees with the most natural choice γ := 1, see the detailed discussion in (Steinwart, 2015b, Appendix A.5), that is, we do not expect the algorithm to correctly keep clusters together that have thinner cusps or bridges. Based on this decision, we choose τ := (2 + ϵ) δ with ϵ = 0.00001, where we note that our theoretical ﬁndings actually hold

Steinwart, Sriperumbudur, and Thomann

Figure 14: Left. A sample data set (top) of size n = 5000 from crosses and its ground truth clustering (bottom). Middle-Left. The 25th (bottom) and 75th (top) best run of kmeans on the crosses data sets. Middle. Corresponding 25th (bottom) and 75th (top) best runs of hclust. Middle-Right and Right. The 25th (bottom) and 75th (top) best runs of our algorithm for the moving window kernel (middleright) and the Epanechnikov kernel (right). While the implicit bias of kmeans seems to consistently direct it towards centroids with suﬃciently benign Voronoi partitions, the behavior of hclust seems to be rather sensitive to the instance of the data set. In comparison, our algorithm with the moving window kernel is less sensitive and performs suﬃciently well even in its worse runs, while the Epanechnikov kernel performs less reliably.

true for each value τ > 2δ, if one carefully tracks all constants. In addition, this choice makes it possible that the estimated clusters can be as close as ϵ δ to each other.

Besides our methods we also considered k-means and hierarchical clustering. To this end, we used the functions kmeans, kmeans++, and hclust of R. Both types of algorithms have some hyper-parameters, which ideally would be chosen in a data-driven approach. To ensure fairness for kmeans, kmeans++, and hclust, such data-driven approaches would need to be calibrated to at least one of our two performance measures E(Φ ) or I(Φ ). However, we are not aware of any method to reliably estimate these performance measures, or some calibrated surrogates for them, from data alone and hence we proceeded diﬀerently. Namely, kmeans, kmeans++, and hclust received the correct number k of ground truth clusters as an input parameter, and kmeans was repeated with 100 random initializations using the parameter nstart = 100. In addition, we ﬁrst considered the 3 diﬀerent versions of the kmeans function on our data sets and identiﬁed the best one with the help of our performance metrics. Here it turned out that all three versions led to very similar results with Lloyd having a marginal advantage over the over two. In addition, all three versions produced better results than kmeans++, probably because we used 100 random initializations for kmeans. Similarly, we considered the 8 diﬀerent versions of hclust. It turned out

Adaptive Clustering Using Kernel Density Estimators

Figure 15: Average matching errors E(Φ ) for optimal matching Φ , see (37) for the diﬀerent cluster algorithms and for diﬀerent sample sizes. While kmeans and hclust are essentially not inﬂuenced by the sample size, our methods clearly become better with increasing sample size n.

that ward.D and ward.D2 performed best with a slight advantage of ward.D2, which we then used in the comparisons. In the experiments both kmeans and hclust were thus privileged in two ways: a) the correct number of clusters were given to them, while our algorithms had no information at all about the cluster problems at hand; and b) the best performing version of kmeans and hclust were chosen in hindsight, and with respect to our ground-truth performance measures. In contrast, our algorithms had to choose their hyper-parameters in a fully adaptive way and without access to ground truth performance measures.

Results. Figures 15 and 16 clearly show that the performance of kmeans and hclust is almost independent of the data set size n, while our two algorithms heavily depend on n. For example, on ﬁve of the six cluster problems, namely on all but toy 5G, the matching errors achieved by our algorithms for n = 2500 are substantially worse than those of kmeans and hclust and the same is true for the identiﬁcation errors. For n = 5000, however, the matching errors of our algorithms are at least as good as those of kmeans and hclust and on at least four of cluster problems our algorithms clearly outperform kmeans and hclust. For the identiﬁcation errors the situation is still mixed, but this is not really surprising, as the ﬁrst term k m in the computation of I(Φ ) vanishes for kmeans and hclust since the privileged information given to these two algorithms always ensures k = m. Despite this advantage, our algorithms always achieve identiﬁcation errors that are as least as good

Steinwart, Sriperumbudur, and Thomann

Figure 16: Average identiﬁcation errors I(Φ ) for optimal matching Φ . Both kmeans and hclust have diﬃculties on the two modiﬁed data sets. Our methods become better with increasing sample sizes and in all cases, nor errors are made for n 10000.

as those of kmeans and hclust as soon as we have n 8.200. Finally, for n 10000 our algorithms work almost error free with respect to both performance measures on all six cluster problems.

In terms of identiﬁcation errors, both kmeans and hclust have substantial diﬃculties with the two modiﬁcations S2 modiﬁed and toy 5G, while on the remaining four cluster problems they perform almost ﬂawless. A closer visual inspection reveals, that both cluster algorithms cannot identify the two spatially smaller, close-by clusters in the modiﬁcations, see Figures 10 and 12 and this phenomenon occurs at least on most of the data sets of e.g., size n = 5000. In comparison, our algorithm with the moving window kernel has some problems identifying one cluster in S2 modiﬁed, see Figure 10, but this phenomenon does not occur as often as the the problems of kmeans and hclust. The same observation can be made for our algorithm working with the Epanechnikov kernel on crosses.

Unsurprisingly, the poor identiﬁcation error performance of kmeans and hclust on the two modiﬁed cluster problems S2 modiﬁed and toy 5G directly translate into poor matching errors. In addition, both algorithms have problems on bananas and on crosses, see Figure 15. The reasons for these problems are diﬀerent: While kmeans constantly choose imperfect centroids on the bananas data sets, see Figure 13, hclust seems to be sensitive to the particular instance of the data set. The same behavior of both algorithms can be

Adaptive Clustering Using Kernel Density Estimators

observed on crosses, see Figure 14. In comparison, a poor matching error performance of our algorithms seem to be directly related to the identiﬁcation errors made in some cases.

Proof of Lemma 4: Let us ﬁrst assume that A δ = . Then there exists an x A δ and since A δ is open there also exists an ε > 0 with B(x, ε) A δ. Our ﬁrst intermediate goal is to show

B(y, δ) A , for all y B(x, ε) . (38)

To this end, we note that y B(x, ε) A δ implies y (Rd \ A)+δ. For z Rd \ A we thus ﬁnd d(y, z) d(y, Rd \ A) > δ, which shows (38). Now observe that for the proof of δ < inrad A it clearly suﬃces to establish

B(x, δ + ε) A . (39)

Let us therefore ﬁx a z B(x, δ +ε). By considering y := x in (38) we ﬁrst note that in the case x z δ we have z A. Hence it suﬃces to consider the case δ < x z δ + ε. For t := δ/ x z (0, 1) and y := (1 t)z + tx a quick calculation then shows both y z = δ and x y = x z δ ε. Applying (38) then yields (39). Let us now assume that δ < inrad A. Then there exists an ε > 0 with δ+ε < inrad A and hence we ﬁnd an x A with B(x, δ + ε) A. Clearly, it suﬃce to show that x A δ. To this end, we assume that x A δ, that is x (Rd \A)+δ. Since this means d(x, Rd \A) δ, we then ﬁnd a y Rd \ A with d(x, y) δ + ε. This contradicts B(x, δ + ε) A.

8.1 Proofs for the Generic Algorithm in Section 3

Proof of Lemma 5: i) ii). We choose a τ > 2c inrad M and δ := inrad M. By Lemma 4 we then know M δ = , and our construction also gives τ > 2cδ. If |M+δ| = 1 we obviously have diam M+δ = 0 < τ. Moreover, if |M+δ| > 1, there exist x, y M+δ with x y = diam M+δ since M+δ is compact and : M+δ M+δ R is continuous. For A := {x, y} M+δ we then have |Cτ(A)| = 1, which shows x y < τ. In summary, we therefore always have diam M+δ < τ. Now a simple calculation shows diam M+δ = 2δ + diam M, see also (Steinwart et al., 2021, Lemma 8.2), and thus we obtain τ > 2 inrad M + diam M. Since we initially chose τ > 2c inrad M arbitrarily, we conclude that 2c inrad M 2 inrad M + diam M. ii) i). Let us ﬁx a δ > 0 with M δ = . Lemma 4 then shows δ inrad M. We know ﬁx a non-empty A M+δ and some τ > 2cδ. This yields

τ > 2(c 1)δ + 2δ 2(c 1) inrad M + 2δ diam M + 2δ = diam M+δ .

For x, y A we thus ﬁnd x y diam A diam M+δ < τ, and hence A is τ-connected, that is |Cτ(A)| = 1.

Steinwart, Sriperumbudur, and Thomann

Theorem 25 Let ρ 0 and Assumption P be satisﬁed with |C(Mρ)| 1 for all ρ ρ . Moreover, let (Lρ)ρ 0 be a decreasing family of sets Lρ X such that

M δ ρ+ε Lρ M+δ ρ ε (40)

for some ﬁxed ε, δ > 0 and all ρ ρ . For all ρ ρ we then have:

i) If M δ ρ+3ε = , then |Cτ(M δ ρ+ε)| = 1 for all τ > 2ψ Mρ+ε(δ) and the corresponding CRM

ζ : Cτ(M δ ρ+ε) Cτ(Lρ) satisﬁes

Cτ(Lρ) = ζ Cτ(M δ ρ+ε) B Cτ(Lρ) : B Lρ+2ε = . (41)

ii) If M δ ρ+ε = , Assumption S is satisﬁed, and δ (0, δthick], then we have B Cτ(Lρ) : B Lρ+2ε = 1 , τ > 2cthickδγ . (42)

Proof of Theorem 25: i). We ﬁrst note that M δ ρ+3ε = implies M δ ρ+ε = . Now, (Steinwart, 2015b, Lemma A.4.3) showed that for all bounded A Rd with |C(A)| < , all δ > 0 with A δ = , and all τ > 2ψ A(δ) we have |Cτ(A δ)| |C(A)|, Thus we ﬁnd |Cτ(M δ ρ+ε)| |C(Mρ+ε)| 1, and by the already observed Mρ+ε = we conclude that |Cτ(M δ ρ+ε)| = 1. To establish (41) we now write A := M δ ρ+ε and B := ζ(A). Our ﬁrst intermediate goal is to establish the following disjoint union:

Cτ(Lρ) = {B} B Cτ(Lρ) \ {B} : B Lρ+2ε = B Cτ(Lρ) : B Lρ+2ε = . (43)

To this end, note that M δ ρ+3ε = and M δ ρ+3ε A together with A ζ(A) = B implies

= M δ ρ+3ε = A M δ ρ+3ε B Lρ+2ε ,

and thus {B Cτ(Lρ) \ {B} : B Lρ+2ε = } = {B Cτ(Lρ) : B Lρ+2ε = }. This gives (43). Let us now show (41). To this end, we ﬁrst observe that |Cτ(M δ ρ+ε)| = 1 implies Cτ(M δ ρ+ε) = {A} and hence ζ Cτ(M δ ρ+ε) = {B}. In view of (43) it thus remains to show B Lρ+2ε = ,

for all B Cτ(Lρ) with B = B. Let us assume the converse, that is, there is a B Cτ(Lρ) with B = B and B Lρ+2ε = . Since Lρ+2ε M+δ ρ+ε, there then exists an x B M+δ ρ+ε. By part i) of (Steinwart, 2015b, Lemma A.3.1) this gives an x Mρ+ε with d(x, x ) δ, and hence we obtain d(x , M δ ρ+ε) ψ Mρ+ε(δ) < τ

From this inequality we conclude that there is an x M δ ρ+ε satisfying d(x , x ) < τ/2. The CRM property then yields x M δ ρ+ε = A B and hence δ ψ Mρ+ε(δ), see (Steinwart et al., 2021, Lemma 8.5), gives

d(B , B) d(x, x ) d(x, x ) + d(x , x ) < δ + τ/2 ψ Mρ+ε(δ) + τ/2 < τ

Adaptive Clustering Using Kernel Density Estimators

However, B = B implies d(B , B ) τ by (Steinwart, 2015b, Lemma A.2.4), i.e. we have found a contradiction. ii). If Lρ+2ε = there is nothing to prove, and hence we may assume that Lρ+2ε = . Now assume that (42) is false. Then there exist B1, B2 Cτ(Lρ) with B1 = B2 and Bi Lρ+2ε = for i = 1, 2. For i = 1, 2 we consequently ﬁnd xi Bi Lρ+2ε, and for these there exist Ai Cτ(Lρ+2ε) with xi Ai. Now recall from (Steinwart, 2015b, Lemma A.2.7) that Lρ+2ε Lρ implies Cτ(Lρ+2ε) Cτ(Lρ), and therefore we have a CRM ζ : Cτ(Lρ+2ε) Cτ(Lρ). Our construction then gives

xi Ai Bi ζ(Ai) Bi ,

and thus ζ(Ai) Bi = for i = 1, 2. However, ζ(Ai) and Bi are both elements of the partition Cτ(Lρ) and hence we conclude ζ(Ai) = Bi for i = 1, 2. Moreover, ζ is a map, and therefore B1 = B2 implies A1 = A2. Let us write A := A1 A2. Since we know from (Steinwart, 2015b, Lemma A.2.4) that d(A1, A2) τ, we conclude by (Steinwart, 2015b, Lemma A.2.8) that Cτ(A) = {A1, A2}, and thus |Cτ(A)| = 2. However, we also have A Lρ+2ε M+δ ρ+ε, and since M δ ρ+ε = holds, Assumption S together with δ (0, δthick] and τ > 2cthickδγ ensures |Cτ(A)| = 1. Since this contradicts |Cτ(A)| = 2 we have proven (42).

Proof of Theorem 7: For i 0 we write ρi := ρ0 + iε for the sequence of potential levels Algorithm 1 visits. Moreover, let i := max{i 0 : M δ ρi+3ε = }, where we note that this maximum is ﬁnite by (Steinwart et al., 2021, Lemma 8.3). For i = 0, . . . , i , part i) of Theorem 25 then shows that Algorithm 1 identiﬁes exactly one component in its Line 3, and therefore it only identiﬁes more than one component in Line 3, if i i + 1. If it ﬁnishes the loop at Line 5, we thus know that ρ ρi +2, and therefore the level ρ considered in Line 7 satisﬁes ρ ρi +4. Now the deﬁnition of i yields M δ ρi +1+3ε = , and since ρi +1 + 3ε = (i + 1)ε + 3ε = (i + 4)ε = ρi +4, we ﬁnd M δ ρ = for the ρ considered in Line 7. This implies M δ ρ+ε = , and hence part ii) of Theorem 25 shows that Algorithm 1 identiﬁes at most one component in Line 7.

Lemma 26 Let Assumption M be satisﬁed, ρ (ρ , ρ ], ε := ρ ρ , and A1,ρ, and A2,ρ be the two connected components of Mρ, i.e. C(Mρ) = {A1,ρ, A2,ρ}. Then the following statements hold:

i) For all 0 < τ 3τ (ε) we have Cτ(Mρ) = C(Mρ).

ii) For all 0 < δ < τ τ (ε) we have Cτ(Mρ) Cτ(M+δ ρ ) = {A+δ 1,ρ, A+δ 2,ρ}.

iii) For δ (0, δthick] and ψ(δ) < τ τ (ε) we have |Cτ(M δ ρ )| = 2 with Cτ(M δ ρ ) = {A δ 1,ρ, A δ 2,ρ}.

Proof of Lemma 26: To adapt to the notation of Steinwart (2015a,b) we write τ Mρ := d(A1,ρ, A2,ρ). Note that this deﬁnition gives τ Mρ = 3τ (ρ ρ ) = 3τ (ε). i). The assertion directly follows part ii) from (Steinwart, 2015b, Proposition A.2.10). ii). The assertion has been shown in part iii) of (Steinwart, 2015b, Lemma A.4.1).

Steinwart, Sriperumbudur, and Thomann

iii). We ﬁrst note that using part ii) of (Steinwart, 2015a, Theorem 2.7) with ε := ε and ρ = ρ + ε we ﬁnd |Cτ(M δ ρ )| = |C(Mρ)| = 2. Moreover, we have

d(A1,ρ, A2,ρ) = τ Mρ = 3τ (ε) τ > ψ(δ) = 3cthickδγ > ψ Mρ(δ) δ , (44)

where the last inequality follows from (Steinwart et al., 2021, Lemma 8.5) since Assumption M implies Assumption P, and hence X is connected. Consequently, part v) of (Steinwart, 2015b, Lemma A.3.1) yields

M δ ρ = A1,ρ A2,ρ δ = A δ 1,ρ A δ 2,ρ , (45)

and we additionally note that (44) implies

d(A δ 1,ρ, A δ 2,ρ) d(A1,ρ, A2,ρ) τ . (46)

Now let A1 and A2 be the two τ-connected components of Cτ(M δ ρ ). Let us assume that A1 = A δ 1,ρ and A1 = A δ 2,ρ. Then (45) shows that there exist x A1 A δ 1,ρ and x A1 A δ 2,ρ. Since A1 is τ-connected, there further exist x1, . . . , xn A1 with x1 = x , xn = x and d(xi, xi+1) < τ for all i = 1, . . . , n 1. By x1 A δ 1,ρ, xn A δ 2,ρ, and (45) we conclude that there is an i = {1, . . . , n 1} with xi A δ 1,ρ and xi+1 A δ 2,ρ. This gives d(A δ 1,ρ, A δ 2,ρ) d(xi, xi+1) < τ, which clearly contradicts (46).

Lemma 27 Let Assumption M be satisﬁed, and P1 and P2 be deﬁned by (5) for some ﬁxed ρ (ρ , ρ ]. Then for i = 1, 2 and ρ ρ we have

Mi,ρ = Mρ Ai,ρ . (47)

Proof of Lemma 27: We ﬁrst note that since P is normal at all levels ρ > 0, we have λd(Mρ {h ρ}) = 0 for all λd-densities h of P and all ρ > 0. For a ﬁxed ρ ρ > 0. we can thus ﬁnd a λd-density h of P such that Mρ = {h ρ} and Mρ = {h ρ }. Let us deﬁne hi := 1Ai,ρ h. Then hi is a λd-density of Pi and we have

{hi ρ} = Mρ Ai,ρ . (48)

Moreover, we have Mi,ρ = supp λd {hi ρ} , and hence it suﬃces to show that Mi,ρ = {hi ρ}. For the proof of the inclusion we ﬁx an x Mi,ρ and an open U X with x U. The deﬁnition of the support of a measure then yields

λd(U Mρ) = λd U {h ρ} λd U {hi ρ} > 0 ,

which in turn implies x Mρ. This shows Mi,ρ Mρ. Moreover, Ai,ρ is closed by deﬁnition and we further have

λd Ai,ρ {hi ρ} = λd({hi ρ}) = λd X {hi ρ} .

Adaptive Clustering Using Kernel Density Estimators

Since the support of a ﬁnite measure is also the smallest closed subset having full measure, we conclude that Mi,ρ Ai,ρ . Combining the two found inclusions Mi,ρ Mρ and Mi,ρ Ai,ρ with (48) we have thus found the desired Mi,ρ {hi ρ}. For the proof of the converse inclusion we ﬁx an x {hi ρ} = Mρ Ai,ρ . Moreover, we ﬁx an open U X with x U, so that it suﬃces to show λd(U {hi ρ}) > 0. To this end, we may assume without loss of generality that i = 1. Moreover, since d(A1,ρ , A2,ρ ) > 0 and x A1,ρ we may additionally assume that U A2,ρ = . Now, x Mρ implies λd(U Mρ) > 0. Let us write Ak := Mρ Ak,ρ = {hk ρ}. This yields Mρ = A1 A2, A1 A2 = , and

λd(U A2) λd(U A2,ρ ) = 0 .

Using the disjoint union U Mρ = (U A1) (U A2), we conclude that

λd(U {h1 ρ}) = λd(U A1) = λd(U Mρ) > 0 .

As mentioned above this shows x M1,ρ.

Lemma 28 Let Assumption M be satisﬁed, and P1 and P2 be deﬁned by (5) for some ﬁxed ρ (ρ , ρ ]. Then, for i = 1, 2, the following statements are true:

i) For ε := ρ ρ and all 0 < δ < τ (ε ) and ρ ρ we have M+δ i,ρ = M+δ ρ A+δ i,ρ .

ii) For all δ > 0 and ρ ρ we have M δ i,ρ = M δ ρ A δ i,ρ .

Proof of Lemma 28: i). Let ξ : C(Mρ) C(Mρ ) be the CRM and B1, . . . , Bn be the connected components of C(Mρ). Without loss of generality we may assume there is an m {0, . . . , n} such that ξ(Bj) A1,ρ for all j = 1, . . . , m and ξ(Bj) A2,ρ for all j = m + 1, . . . , n. We deﬁne A1 := B1 Bm and A2 := Bm+1 Bn. Clearly, this construction ensures

Ak ξ(Ak) Ak,ρ , k = 1, 2. (49)

Moreover, we have Mρ = A1 A2, and hence we ﬁnd M+δ ρ = A+δ 1 A+δ 2 by part iv) of (Steinwart, 2015b, Lemma A.3.1). In view of (47), we consequently need to prove that

(A1 A2) Ai,ρ +δ = A+δ 1 A+δ 2 A+δ i,ρ . (50)

Be begin by observing that

(A1 A2) Ai,ρ = (A1 Ai,ρ ) (A2 Ai,ρ ) = Ai , (51)

where we used (49) and A1,ρ A2,ρ = . Similarly, the right-hand side of (50) can be written as A+δ 1 A+δ 2 A+δ i,ρ = A+δ 1 A+δ i,ρ A+δ 2 A+δ i,ρ . (52)

Steinwart, Sriperumbudur, and Thomann

In addition, (49) ensures A+δ i A+δ i,ρ , and by continuing (52) we thus ﬁnd

A+δ 1 A+δ 2 A+δ i,ρ = A+δ i A+δ k A+δ i,ρ , k {1, 2} \ {i} . (53)

Moreover, by part ii) of Lemma 26 we know that A+δ 1,ρ and A+δ 2,ρ are the two τ-connected

components of M+δ ρ for any τ with δ < τ < τ (ε ). Thus, we have A+δ 1,ρ A+δ 2,ρ = , and

since (49) ensures A+δ k A+δ k,ρ we conclude that A+δ k A+δ i,ρ = . Inserting the latter into (53) gives

A+δ 1 A+δ 2 A+δ i,ρ = A+δ i . (54)

Now, (50) follows from combining (51) with (54). ii). This directly follows from combining (47) with (Steinwart et al., 2021, Lemma 8.7).

Proof of Theorem 9: i). We ﬁrst note that ε ε := ρ ρ implies τ (ε ) τ (ε ). Consequently, part iii) of Lemma 26 applied for ρ := ρ gives the assertion. ii). We begin by showing that the CRMs ξρ+ε : Cτ(M δ ρ ) Cτ(M δ ρ+ε) and ξ : Cτ(M δ ρ+ε) Cτ(M δ ρout+ε) are bijective. To this end we consider the following commutative diagram of CRMs:

Cτ(M δ ρ ) Cτ(M δ ρout+ε)

Cτ(M δ ρ+ε)

- @ @ @ @@ R

Now, part iv) of (Steinwart, 2015b, Theorem A.6.2) shows that ξρout+ε is bijective, and consequently, ξρ+ε is injective. Moreover, part i) of (Steinwart, 2015a, Theorem 2.7) shows that 1 |Cτ(M δ ρ+ε)| 2, and since we already know that |Cτ(M δ ρ )| = 2 and that ξρ+ε is injective, we conclude that |Cτ(M δ ρ+ε)| = 2 and that ξρ+ε is bijective. Using the diagram we then see that the CRM ξ is also bijective. Our next goal is to show that the CRM bξρ : Cτ(M δ ρ+ε) b Cτ(Lρ) is well-deﬁned and bijective. To this end, we ﬁrst recall that our assumption ρ ρ 3ε together with (Steinwart, 2015a, Theorem 2.8) gives the following disjoint union:

Cτ(Lρ) = bξρ Cτ(M δ ρ+ε) B Cτ(Lρ) : B Lρ+2ε = .

Consequently, we have bξρ Cτ(M δ ρ+ε) = b Cτ(Lρ), that is, we can view bξρ as a surjective CRM bξρ : Cτ(M δ ρ+ε) b Cτ(Lρ). Similarly, part i) of Theorem 6 ensures

ρout ρ + ε + 5ε ρ + 6ε ρ 3ε ,

Adaptive Clustering Using Kernel Density Estimators

and repeating the reasoning above we see that the CRM bξρout : Cτ(M δ ρout+ε) Cτ(Lρout) can be viewed as a surjective CRM bξρout : Cτ(M δ ρout+ε) b Cτ(Lρout). Finally, consider the CRM ξ : Cτ(Lρ) Cτ(Lρout). For B b Cτ(Lρ) we then have

= B Lρ+2ε ξ(B) Lρ+2ε ξ(B) Lρout+2ε ,

i.e. ξ(B) b Cτ(Lρout). Consequently, the restriction ξ| b Cτ(Lρ) : b Cτ(Lρ) b Cτ(Lρout) is welldeﬁned, and obviously also a CRM. In summary, we obtain the following commutative diagram of CRMs

Cτ(M δ ρout+ε) b Cτ(Lρout)

Cτ(M δ ρ+ε) b Cτ(Lρ)

ξ ξ| b Cτ(Lρ)

Now, we have already seen that ξ is bijective, and in addition, part ii) of (Steinwart, 2015b, Theorem A.6.2) shows that bξρout is injective. Moreover, our considerations above showed that bξρout is surjective, and hence it is bijective. Using the diagram we conclude that bξρ : Cτ(M δ ρ+ε) b Cτ(Lρ) is injective. Since we have already seen that it is surjective, we conclude that it is indeed bijective. With the help of these preparations, the ﬁrst assertion now easily follows from i) and the bijectivity of bξρ and ξρ+ε, namely

| b Cτ(Lρ)| = bξρ ξρ+ε (Cτ(M δ ρ ) = Cτ(M δ ρ ) = 2 .

To show the second assertion, we write Bρ i := bξρ ξρ+ε(Vi). This immediately gives Vi Bρ i for i = 1, 2. Moreover, using the diagram we ﬁnd

Bρ i ξ| b Cτ(Lρ)(Bρ i ) = ξ| b Cτ(Lρ) bξρ ξρ+ε(Vi) = bξρout ξ ξρ+ε(Vi) = Bi ,

where the latter identity follows from part iii) of (Steinwart, 2015b, Theorem A.6.2). iii). We ﬁrst observe that ε := ρ ρ satisﬁes ε ε and by (Steinwart et al., 2021, Lemma 8.5) we hence ﬁnd δ ψ (δ) < τ (ε ) τ (ε ). Lemma 28 then shows

M δ i,ρ = M δ ρ A δ i,ρ and M+δ i,ρ = M+δ ρ A+δ i,ρ .

By the deﬁnition of Li,ρ we thus have to show the following two inclusions

M δ ρ+ε A δ i,ρ Lρ Bi , (55)

Lρ Bi M+δ ρ ε A+δ i,ρ . (56)

We begin by proving (55). To this end, we ﬁrst observe that (3) ensures M δ ρ+ε Lρ and hence it suﬃces to establish A δ i,ρ Bi. Now, we have already observed that τ τ (ε )

Steinwart, Sriperumbudur, and Thomann

τ (ε ), and consequently part iii) of Lemma 26 shows that A δ 1,ρ and A δ 2,ρ are the two τ-

connected components of M δ ρ . Moreover, part i) of Theorem 6 shows ρout ρ +ε +5ε

ρ ε, and hence we have ρ ε [ρout, ρ 3ε]. Applying (Steinwart, 2015a, Theorem 2.8) and the already established part ii) to the level ρ ε we then obtain A δ i,ρ Bρ ε i Bi for i {1, 2}. Let us now establish Li,ρ Bρ +2ε i . Without loss of generality we may assume i = 1. Now, consider the CRM ξ : Cτ(Lρ B1) Cτ(Lρ +2ε B1), which is possible since ρ ρ +2ε. Let us assume that there was a B Cτ(Lρ B1) with

ξ(B ) Bρ +2ε 1 .

Since Bρ +2ε 1 is a τ-connected component of Lρ +2ε B1 by part ii) applied to the level ρ +2ε [ρout, ρ 3ε] and ξ(B ) is another such τ-connected component we conclude that ξ(B ) Bρ +2ε 1 = . Moreover, our construction and part ii) give

ξ(B ) Bρ +2ε 2 B1 Bρ +2ε 2 B1 B2 = ,

and therefore part ii) shows ξ(B ) b Cτ(Lρ + 2ε). Together with ρ ρ + 4ε the latter implies B Lρ B Lρ +4ε ξ(B ) Lρ +4ε = .

Consequently, we have found a contradiction, and therefore we have ξ(B ) Bρ +2ε 1 for all τ-connected components of L1,ρ = Lρ B1. Since B ξ(B ) we have thus found

L1,ρ Bρ +2ε 1 . Let us now show (56). To this end, we note that (3) ensures Lρ M+δ ρ ε, and hence it

suﬃces to prove Lρ Bi A+δ i,ρ . Moreover, we have already shown that Lρ Bi Bρ +2ε i , and therefore, it suﬃces to establish

Bρ +2ε i A+δ i,ρ .

To this end, recall that we have already observed τ τ (ε ) τ (ε ). Part ii) of Lemma 26 thus shows that A+δ 1,ρ and A+δ 2,ρ are the two τ-connected components of M+δ ρ . Now consider

the CRM ξ : Cτ(Lρ +2ε) Cτ(M+δ ρ ). Then the τ-connected component Bρ +2ε 1 of Lρ +2ε

satisﬁes Bρ +2ε 1 ξ(Bρ +2ε 1 ), and therefore, exactly one of the following two conditions is satisﬁed

Bρ +2ε 1 A+δ 1,ρ , (57)

Bρ +2ε 1 A+δ 2,ρ . (58)

However, our construction ensures V1 A+δ 1,ρ , and part ii) gives V1 Bρ +2ε 1 . This gives

= V1 Bρ +2ε 1 A+δ 1,ρ , and therefore we can exclude (58). Consequently (57) is true. The

inclusion Bρ +2ε 2 A+δ 2,ρ can be shown analogously.

Adaptive Clustering Using Kernel Density Estimators

8.2 Proofs for Section 4

Lemma 29 Let K : Rd [0, ) be a symmetric kernel with tail function κ1( ). Moreover, let P be a λd-absolutely continuous distribution on Rd that is normal at some level ρ 0. Then for all x Rd and σ > 0 with B(x, σ) Mρ and all δ > 0 we have

h P,δ(x) ρ ρκ1(σ

while for all x Rd and σ > 0 with B(x, σ) X \ Mρ and all δ > 0 we have

h P,δ(x) < ρ + δ dκ (σ

Finally, if P has a bounded density h, then the inequality (59) can be replaced by

h P,δ(x) ρ κ1(σ

whenever 0 ρ h and (60) can be replaced, for all ρ 0, by

h P,δ(x) < ρ + κ1(σ

δ ) h . (62)

Proof of Lemma 29: Let h be a λd-density of P. For the proof of (59), we ﬁrst observe that λd(B(x, σ) \ {h ρ}) λd(Mρ \ {h ρ}) = 0, since P is normal at level ρ. Therefore, we obtain Z

B(x,σ) Kδ(x y) h(y) dλd(y) = Z

B(x,σ) {h ρ} Kδ(x y) h(y) dλd(y)

B(x,σ) {h ρ} Kδ(x y) dλd(y)

B(x,σ) Kδ(x y) dλd(y) , (63)

and this leads to

h P,δ(x) = Z

Rd Kδ(x y) h(y) dλd(y)

B(x,σ) Kδ(x y) dλd(y) + Z

Rd\B(x,σ) Kδ(x y) h(y) dλd(y)

B(x,σ) Kδ(x y) dλd(y) + ρ Z

Rd\B(x,σ) Kδ(x y) dλd(y)

Rd\B(x,σ) Kδ(x y) dλd(y) + Z

Rd\B(x,σ) Kδ(x y) h(y) dλd(y)

Rd\B(x,σ) Kδ(x y) dλd(y) ,

where in the last step we used (10). Now, the assertion follows from (11), and for a bounded density h and ρ h , the inequality (61) is a direct consequence of (59). To show (60) we ﬁrst note that (1) yields

λd B(x, σ) \ {h < ρ} λd (Rd \ Mρ) \ {h < ρ} = λd {h ρ} \ Mρ = 0 .

Steinwart, Sriperumbudur, and Thomann

Analogously to (63) we then obtain Z

B(x,σ) Kδ(x y) h(y) dλd(y) = Z

B(x,σ) {h<ρ} Kδ(x y) h(y) dλd(y)

B(x,σ) {h<ρ} Kδ(x y) dλd(y)

B(x,σ) Kδ(x y) dλd(y) ,

where for the strict inequality we used our assumption that K is strictly positive in a neighborhood of 0. Adapting the last estimate of the proof of (59) we then ﬁnd

h P,δ(x) < ρ Z

B(x,σ) Kδ(x y) dλd(y) + Z

Rd\B(x,σ) Kδ(x y) h(y) dλd(y)

B(x,σ) Kδ(x y) dλd(y) + ρ Z

Rd\B(x,σ) Kδ(x y) dλd(y)

Rd\B(x,σ) Kδ(x y) dλd(y) + Z

Rd\B(x,σ) Kδ(x y) h(y) dλd(y)

Rd\B(x,σ) Kδ(x y) h(y) dλd(y) .

Now, for bounded h inequality (62) follows from (11), while in the general case the estimate Z

Rd\B(x,σ) Kδ(x y) h(y) dλd(y) sup y Rd\B(x,σ) Kδ(x y) = δ dκ (σ

leads to (60).

Proof of Theorem 11: To prove the ﬁrst inclusion, we ﬁx an x M 2σ ρ+ε+ϵ This means x / (Rd \ Mρ+ε+ϵ)+2σ, i.e., for all x Rd \ Mρ+ε+ε we have x x > 2σ. In other words, for all x Rd with x x 2σ, we have x Mρ+ε+ϵ, i.e., we have found

B(x, 2σ) Mρ+ε+ϵ . (64)

Let us now suppose that there exists a sample xi D such that h D,δ(xi) < ρ and x xi σ. By h D,δ h P,δ < ε we then ﬁnd

h P,δ(xi) < ρ + ε . (65)

On the other hand, x xi σ together with the already shown (64) implies B(xi, σ) Mρ+ε+ϵ by a simple application of the triangle inequality. Consequently, (59) together with ϵ ρκ1(σ

δ ) gives h P,δ(xi) ρ + ε, which contradicts (65). For all samples xi D, we thus have h D,δ(xi) ρ or x xi > σ. Let us assume that we have x xi > σ for all xi D. Then we ﬁnd

h D,δ(x) = 1

i=1 δ d K x xi

i=1 δ dκ (σ

δ ) = δ dκ (σ

δ ) ρ . (66)

Adaptive Clustering Using Kernel Density Estimators

On the other hand, we have B(x, σ) B(x, 2σ) Mρ+ε+ϵ and therefore (59) together with ϵ ρκ1(σ

δ ) gives h P,δ(x) ρ+ε. By h D,δ h P,δ < ε we conclude that h D,δ(x) > ρ, which contradicts (66). Therefore there does exist a sample xi D with x xi σ. Using the inclusion (64) together with the triangle inequality we then again ﬁnd B(xi, δ) Mρ+ε+ϵ, and hence (59) yields h P,δ(xi) ρ + ε. This leads to h D,δ(xi) ρ, and hence we ﬁnally obtain x {x D : h D,δ(x ) ρ}+σ = LD,ρ .

Finally, if h has a bounded density, then we have Mρ = for ρ > h and therefore M 2σ ρ+ε+ϵ LD,ρ is trivially satisﬁed. Moreover, to show the assertion for ρ h , we simply need to replace (59) with (61) in the proof above. To prove the second inclusion, we pick an x LD,ρ. By the deﬁnition of LD,ρ, there then exists an xi D such that x xi σ and h D,δ(xi) ρ. The latter implies h P,δ(xi) > ρ ε. Our ﬁrst goal is to show that Mρ ε ϵ B(xi, σ) = . To this end, let us assume the converse, that is B(xi, σ) Rd \ Mρ ε ϵ. By (60) and ϵ δ dκ (σ

δ ) we then ﬁnd h P,δ(xi) < ρ ε, which contradicts the earlier established h P,δ(xi) > ρ ε. Consequently, there exists an x Mρ ε ϵ B(xi, σ), which in turn leads to

d(x, Mρ ε ϵ) x x x xi + xi x 2σ .

This shows the desired x M+2σ ρ ε ϵ. Finally, to show the assertion for bounded densities, we simply need to replace (60) with (62) in the proof above.

For the proof of Lemma 14 we need to recall the following classical result, which is a reformulation of (van der Vaart and Wellner, 1996, Theorem 2.6.4).

Theorem 30 Let A be a set of subsets of Z that has ﬁnite VC-dimension V . Then the set of indicator functions G := {1A : A A} is a uniformly bounded VC-class for which we have B = 1 and the constants A and ν in (17) only depend on V .

We also need the next result, which investigates the eﬀect of scaling in the input space.

Lemma 31 Let G be set of measurable functions g : Rd R such that there exists a constant B 0 with g B for all g G. For δ > 0, we deﬁne gδ : Rd R by gδ(x) := g(x/δ), x Rd Furthermore, we write Gδ := {gδ : g G}. Then, for all ϵ (0, B] and all δ > 0, we have

sup P N(G, L2(P), ϵ) = sup P N(Gδ, L2(P), ϵ) ,

where the suprema are taken over all probability measures P on Rd.

Proof of Lemma 31: Because of symmetry we only prove . Let us ﬁx ϵ, δ > 0 and a distribution P on Rd. We deﬁne a new distribution P on Rd by P (A) := P(1

δA) for all measurable A Rd. Furthermore, let C be an ϵ-net of Gδ with respect to L2(P ). For C := C 1/δ, we then have |C| = |C |, and hence it suﬃces to show that C is an ϵ-net of G with respect to L2(P). To this end, we ﬁx a g G. Then gδ Gδ, and hence there exists an

Steinwart, Sriperumbudur, and Thomann

h C with gδ h L2(P ) ϵ. Moreover, we have h := h 1/δ C, and since the deﬁnition of P ensures EP fδ = EP f for all measurable f : Rd [0, ), we obtain

g h L2(P) = gδ hδ L2(P ) = gδ h L2(P ) ϵ,

i.e. C is an ϵ-net of G with respect to L2(P).

Proof of Lemma 14: The set A := {x + B : x Rd} of has ﬁnite VC-dimension by (Devroye and Lugosi, 2001, Corollary 4.2) or (Devroye and Lugosi, 2001, Lemma 4.1), respectively. In both cases, Theorem 30 thus shows that for

G := K(x ) : x Rd

there are constants A and ν only depending on the VC-dimension of A such that

N G, L2(P), K ϵ = N K 1 G, L2(P), ϵ A

for all ϵ (0, 1] and all distributions P on Rd. Moreover, observe that

Gδ = K(x δ 1 ) : x Rd = K x

: x Rd = δd Kδ .

Consequently, Lemma 31 leads to

sup P N(Kδ, L2(P), δ dϵ) = sup P N(δd Kδ, L2(P), ϵ) = sup P N(G, L2(P), ϵ) A K

for all ϵ (0, K ]. A simple variable transformation then yields the assertion.

Proof of Lemma 15: We ﬁrst recall that if A E is a compact subset of some Banach space E and T : A F is a α-H older continuous map into another Banach space F then we have N T(A), F , |T|αϵα N A, E, ϵ , ϵ > 0 ,

where |T|α is the α-H older constant of T. We now ﬁx a 0 < δ |K|α

K 1/α diam (X) and

a probability measure P on Rd. For x X we now consider the map kx,δ : Rd [0, ] deﬁned by

kx,δ(y) := Kδ(x y) = δ d K x y

Since K is bounded and measurable, so is kx,δ, and hence we obtain a map T : X L (P) deﬁned by T(x) := kx,δ. Moreover, T is α-H older continuous, since for x, x X, we have

T(x) T(x ) = sup y Rd

δ (α+d)|K|α x x α ,

i.e. we have shown |T|α δ (α+d)|K|α. By our initial observation and (19) we then conclude that

N Kδ, L2(P), |T|αϵα = N T(X), L2(P), |T|αϵα N X, , ϵ C (X)ϵ d

Adaptive Clustering Using Kernel Density Estimators

for all ϵ (0, diam (X)]. A variable transformation together with our bound on |T|α thus yields

N Kδ, L2(P), ϵ C (X) |T|α

d/α C (X) |K|α

for all 0 < ϵ δ (α+d)|K|α diam (X). Since the assumed δ |K|α

K 1/α diam (X) implies δ d K diam (X) αδ (α+d)|K|α

we then see that (20) does hold for all 0 < ϵ δ d K .

For the proof of Theorem 16 we quote a version of Talagrand s inequality due to Bousquet (2002) from (Steinwart and Christmann, 2008, Theorem 7.5).

Theorem 32 Let (Z, P) be a probability space and G be a set of measurable functions from Z to R. Furthermore, let B 0 and σ 0 be constants such that EP g = 0, EP g2 σ2, and g B for all g G. For n 1, deﬁne G : Zn R by

G(z) := sup g G

, z = (z1, . . . , zn) Zn.

Then, for all ς > 0, we have

P n n z Zn : G(z) 4EP n G +

For the proof of Theorem 16 we also need (Gin e and Guillou, 2001, Proposition 2.1), which bounds the expected suprema of empirical processes indexed by uniformly bounded VC-classes. The following theorem provides a slightly simpliﬁed version of that proposition.

Theorem 33 Let (Z, P) be a probability space and G be a uniformly bounded VC-class on Z with constants A, B, and ν. Furthermore, let σ > 0 be a constant with σ B and EP g2 σ2 for all g G. Then there exists a universal constant C such that G deﬁned in Theorem 32 satisﬁes

We are now able to establish the following generalization of Theorem 16.

Proposition 34 Let X Rd and P be a probability measure on X with Lebesgue density h L1(Rd) Lp(Rd) for some p (1, ]. Moreover, let 1

p = 1 and q := 1 2p = 1

2p and K be a symmetric kernel. Suppose further that the set Kδ deﬁned in (18) satisﬁes (21) for all δ (0, δ0], where δ0 (0, 1]. Then, there exists a positive constant C only depending on d, p, and K such that, for all n 1, all δ (0, δ0] satisfying δ h p p 4p K , and all ς 1 we have

P n D : h D,δ h P,δ ℓ (X) < Cς

δa+dq h 1/2 p +

C h pς δd(1+1/p)n log C

δa+dq h 1/2 p

Steinwart, Sriperumbudur, and Thomann

Proof of Proposition 34: We deﬁne θ := 1

2p. Then K L1(Rd) L (Rd) leads to

K 2p K 1 θ 1 K θ = K θ .

We further deﬁne kx,δ := δ d K δ 1(x ) and fx,δ := kx,δ EP kx,δ. Then we have EP fx,δ = 0 and fx,δ 2 K δ d for all x X and δ > 0. Moreover, we have EP f2 x,δ EP k2 x,δ and thus

EP f2 x,δ = δ 2d Z

h(y) dλd(y) δ 2d h p

δ d(1+1/p) h p K 2θ =: σ2 δ

for all x X and δ > 0. In addition, for all D Xn we have

i=1 fx,δ(xi) = h D,δ(x) h P,δ(x) .

Applying Theorem 32 to G := {fx,δ : x X}, we thus see, for all δ > 0, ς > 0, and n 1, that

h D,δ h P,δ ℓ (X) < 4ED P n h D ,δ h P,δ ℓ (X) + 2ς

2ς h p K 2θ nδd(1+1/p) (67)

holds with probability P n not smaller than 1 e ς. It thus remains to bound the term

ED P n h D ,δ h P,δ ℓ (X) = ED P n sup x X

To this end, we ﬁrst note that |EP kx,δ| kx,δ = δ d K =: Bδ. Consequently, we have Fδ := {fx,δ : x X} kx,δ b : kx,δ Kδ, |b| Bδ ,

and since N([ Bδ, Bδ], | |, ϵ) 2Bδϵ 1 we conclude that for A := max{1, A0} we have

sup Q N Fδ, L2(Q), ϵ 2 A0 K δ (d+a)

ϵ 2 A K δ (d+a)

for all δ (0, δ0], ϵ (0, Bδ], where the supremum runs over all distributions Q on X. Now, our very ﬁrst estimates showed fx,δ 2Bδ and EP f2 x,δ σ2 δ, and since σδ 2Bδ is equivalent to δ 4p K p (2 2θ) h p p = 4p K h p p ,

Theorem 33 together with 2θ = 1 + 1/p thus yields

ED P n sup x X

nδd log 2 A K

(ν+1) h p K 1+1/p 2δd(1+1/p)n log 2 A K

Adaptive Clustering Using Kernel Density Estimators

for such δ. Moreover, we have

σδδd+a = log 2 A K δ d(1+1/p)/2 h 1/2 p K θ δd+a = log 2 A K q

δa+dq h 1/2 p ,

and hence the previous estimate can be simpliﬁed to

ED P n sup x X

nδd log 2 A K q

δa+dq h 1/2 p +

v u u tν h p K 1+1/p δd(1+1/p)n log 2 A K q

δa+dq h 1/2 p

Combining this with (67) gives

h D,δ h P,δ ℓ (X)

< 4ED P n h D ,δ h P,δ ℓ (X) + 2ς

2ς h p K 1+1/p nδd(1+1/p)

nδd log 2 A K q

δa+dq h 1/2 p +

v u u tν h p K 1+1/p δd(1+1/p)n log 2 A K q

δa+dq h 1/2 p

2ς h p K 1+1/p nδd(1+1/p)

Cς nδd log C

δa+dq h 1/2 p +

v u u t C h pς δd(1+1/p)n log C

δa+dq h 1/2 p

with probability P n not smaller than 1 e ς.

Proof of Theorem 16: By Proposition 34, it suﬃces to ﬁnd a constant C such that

Cς nδd log C

δa+dq h 1/2 p +

C h pς δd(1+1/p)n log C

δa+dq h 1/2 p C r

h p |log δ| ς

nδd(1+1/p) . (68)

To this end, we ﬁrst observe that δa+dq h 1/2 p C 1 implies Cδ a dq h 1/2 p δ 2a 2dq, and thus we obtain log C δa+dq h 1/2 p (2a + 2dq) log δ 1. For C := (2a + 2dq)C we therefore

Cς nδd log C

δa+dq h 1/2 p +

C h pς δd(1+1/p)n log C

δa+dq h 1/2 p C ς

nδd log δ 1 +

C h pς nδd(1+1/p) log δ 1 .

Moreover, it is easy to check that the assumption |log δ|

C ς ensures that

nδd log δ 1

C h pς nδd(1+1/p) log δ 1 ,

and from the latter we conclude that (68) holds for C := 2

C . The assertion now follows for the constant C := max{C, C , C }.

Steinwart, Sriperumbudur, and Thomann

8.3 Proofs for the KDE-Based Clustering in Section 5

Proof of Theorem 17: Let us ﬁx a D Xn with h D,δ h P,δ < ε/2. By (23) we see that the probability P n of such a D is not smaller than 1 e ς. We deﬁne ϵ := h κ1(σ

δ ). In the case of supp K B this leads to ϵ = 0 δ dκ (σ

δ ) = 0 ρ0 as noted after Theorem 11. Furthermore, in the case of (9), (Steinwart et al., 2021, Lemma 4.2) shows

δ ) h κ1(| log δ|2) cd2 vold e | log δ|2| log δ|2d 2 cd2 vold δ| log δ| d ε/2,

where in the third to last step we used 0 < δ 1 and in the second to last step we used (Steinwart et al., 2021, Lemma 8.17). In addition, we have δ dκ (σ

δ ) cδ de | log δ|2 = cδ| log δ| d ε ρ0. Consequently, Theorem 11 shows, for all ρ ρ0, that

M 2σ ρ+ε LD,ρ M+2σ ρ ε . (69)

i). The assertion follows from Theorem 7 applied in the case d := 2σ. Indeed, we have just seen that (40) holds for all ρ ρ0, if we replace δ by δ, and our assumptions guarantee δ (0, δthick], ρ0 ρ , and τ > ψ( δ) = 3cthick δγ > 2cthick δγ. Moreover, (27) follows from (69). ii). We check that the assumptions of Theorem 6 are satisﬁed for δ := 2σ, if ε (ρ ρ )/9. Clearly, we have δ (0, δthick], ε (0, ε ], and ψ( δ) < τ. To show τ τ (ε ) we write

E := {ε (0, ρ ρ ] : τ (ε ) τ}.

Since we assumed ε < , we obtain E = by the deﬁnition of ε . There thus exists an ε E with ε inf E + ε ε . Using the monotonicity of τ established in (Steinwart, 2015a, Theorem A.4.2) we then conclude that τ τ (ε ) τ (ε ), and hence all assumptions of (Steinwart, 2015a, Theorem 2.9) are indeed satisﬁed with δ replaced by δ. The assertions now immediately follow from this theorem.

Proof of Corollary 18: Using Theorem 17 the proof of ii) is a literal copy of the proof of (Steinwart, 2015a, Theorem 4.1) and the proof of i) is an easy adaptation of this proof.

Proof of Corollary 20: Using Theorem 17 the proof is a simple combination and adaptation of the proofs of (Steinwart, 2015a, Theorem 4.3) and (Steinwart, 2015a, Corollary 4.4).

Proof of Corollary 23: Using Theorem 17 the proof is a simple combination and adaptation of the proofs of (Steinwart, 2015a, Theorem 4.7) and (Steinwart, 2015a, Corollary 4.8).

Proof of Theorem 24: The deﬁnition of εδ,n in (33) together with 4C2 u log log n C h ensures that (25) and (26) are satisﬁed for all δ . In addition, the assumptions of Theorem 24 ensure that the remaining conditions of Theorem 17 are satisﬁed. Now the assertion follows by some union bound arguments, which are analogous to those of the proof of (Steinwart, 2015a, Theorem 5.1).

Adaptive Clustering Using Kernel Density Estimators

Acknowledgment

The authors profusely thank the action editor and the reviewers for their comments and suggestions that signiﬁcantly improved the readability of the paper. The work of I. Steinwart and P. Thomann was funded by the DFG Grant STE 1074/2-1. I. Steinwart was also funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany s Excellence Strategy - EXC 2075 - 390740016. B. K. Sriperumbudur thanks the National Science Foundation for support under the grant NSF-DMS-1713011.

O. Bousquet. Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. Ph.D. thesis, Ecole Polytechnique, 2002.

K. Chaudhuri and S. Dasgupta. Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems 23, pages 343 351. 2010.

K. Chaudhuri, S. Dasgupta, S. Kpotufe, and U. von Luxburg. Consistent procedures for cluster tree estimation and pruning. IEEE Trans. Inf. Theory, 60:7900 7912, 2014.

A. Cuevas and R. Fraiman. A plug-in approach to support estimation. Ann. Statist., 25: 2300 2312, 1997.

L. Devroye and G. Lugosi. Combinatorial Methods in Density Estimation. Springer, New York, 2001.

P. Fr anti and O. Virmajoki. Iterative shrinking method for clustering problems. Pattern Recognition, 39:761 765, 2006.

E. Gin e and A. Guillou. On consistency of kernel density estimators for randomly censored data: Rates holding uniformly over adaptive intervals. Ann. Inst. H. Poincar e Probab. Statist., 37:503 522, 2001.

E. Gin e and A. Guillou. Rates of strong uniform consistency for multivariate kernel density estimators. Ann. Inst. H. Poincar e Probab. Statist., 38:907 921, 2002.

J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, New York, 1975.

J.A. Hartigan. Consistency of single linkage for high-density clusters. J. Amer. Statist. Assoc., 76:388 394, 1981.

S. Kpotufe and U. von Luxburg. Pruning nearest neighbor cluster trees. In Proceedings of the 28th International Conference on Machine Learning, pages 225 232. 2011.

M. Maier, M. Hein, and U. von Luxburg. Optimal construction of k-nearest neighbor graphs for identifying noisy clusters. Theoret. Comput. Sci, 410:1749 1764, 2009.

W. Polonik. Measuring mass concentrations and estimating density contour clusters an excess mass aproach. Ann. Statist., 23:855 881, 1995.

Steinwart, Sriperumbudur, and Thomann

P. Rigollet. Generalization error bounds in semi-supervised classiﬁcation under the cluster assumption. J. Mach. Learn. Res., 8:1369 1392, 2007.

A. Rinaldo and L. Wasserman. Generalized density clustering. Ann. Statist., 38:2678 2722, 2010.

B.K. Sriperumbudur and I. Steinwart. Consistency and rates for clustering with DBSCAN. In Proceedings of the 15th International Conference on Artiﬁcial Intelligence and Statistics 2012, pages 1090 1098, 2012.

I. Steinwart. Adaptive density level set clustering. In Proceedings of the 24th Conference on Learning Theory 2011, pages 703 738, 2011.

I. Steinwart. Fully adaptive density-based clustering. Ann. Statist., 43:2132 2167, 2015a.

I. Steinwart. Supplement A to Fully adaptive density-based clustering . Ann. Statist., 43, 2015b.

I. Steinwart. Supplement B to Fully adaptive density-based clustering . Ann. Statist., 43, 2015c.

I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008.

I. Steinwart, B.K. Sriperumbudur, and P. Thomann. Adaptive clustering using kernel density estimators. Technical report, 2021. http://arxiv.org/abs/1708.05254v3.

W. Stuetzle. Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classiﬁcation, 20:25 47, 2003.

W. Stuetzle and R. Nugent. A generalized single linkage method for estimating the cluster tree of a density. J. Comput. Graph. Statist., 19:397 418, 2010.

A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag, New York, 1996.

D. Wang, X. Lu, and A. Rinaldo. DBSCAN: Optimal rates for density-based cluster estimation. J. Mach. Learn. Res., 20:1 50, 2019.