# adaptive_clustering_using_kernel_density_estimators__1a368062.pdf Journal of Machine Learning Research 24 (2023) 1-56 Submitted 12/21; Revised 8/23; Published 8/23 Adaptive Clustering Using Kernel Density Estimators Ingo Steinwart ingo.steinwart@mathematik.uni-stuttgart.de University of Stuttgart Department of Mathematics D-70569 Stuttgart, Germany Bharath K. Sriperumbudur bks18@psu.edu Pennsylvania State University Department of Statistics University Park, PA 16802, USA Philipp Thomann philipp.thomann@d-one.ai D ONE Solutions AG Sihlfeldstrasse 58 8003 Z urich, Switzerland Editor: Sivan Sabato We derive and analyze a generic, recursive algorithm for estimating all splits in a finite cluster tree as well as the corresponding clusters. We further investigate statistical properties of this generic clustering algorithm when it receives level set estimates from a kernel density estimator. In particular, we derive finite sample guarantees, consistency, rates of convergence, and an adaptive data-driven strategy for choosing the kernel bandwidth. For these results we do not need continuity assumptions on the density such as H older continuity, but only require intuitive geometric assumptions of non-parametric nature. In addition, we compare our results to other guarantees found in the literature and also present some experiments comparing our algorithm to k-means and hierarchical clustering. Keywords: cluster analysis, kernel density estimation, consistency, rates, adaptivity 1. Introduction A widely acknowledged problem in cluster analysis is the definition of a learning goal that describes a conceptually and mathematically convincing definition of clusters. One such definition, which goes back to Hartigan (1975) and is known as density-based clustering, assumes i.i.d. data D = (x1, . . . , xn) generated by some unknown distribution P on X Rd. Given some level ρ 0, the clusters of P are then defined to be the connected components of the level set {h ρ}:= {x X : h(x) ρ}, where h is the density of P with respect to the Lebesgue measure. This single level approach has been studied, for example by Hartigan (1975); Cuevas and Fraiman (1997); Rigollet (2007); Maier et al. (2009); Rinaldo and Wasserman (2010). However, one of the conceptual drawbacks of the single level approach is that different values of ρ may lead to different (numbers of) clusters, see Figure 1, and in addition, there is no general rule for choosing ρ. To address this conceptual shortcoming, one often considers the so-called cluster tree approach instead, which tries to consider all c 2023 Ingo Steinwart, Bharath K. Sriperumbudur, and Philipp Thomann. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/21-1532.html. Steinwart, Sriperumbudur, and Thomann Figure 1: The single level approach applied to a probability density h on [0, 1]. Left. The level set {h ρ1} displayed by the orange horizontal line consists of one connected component, which results in one cluster for ρ1. For illustrational purposes, the part of the graph of h that belongs to this level set is also colored in orange. Right. For a slightly larger ρ2, the level set {h ρ2} consists of two connected components, which are indicated by the green and purple horizontal lines. As a result, we obtain two clusters for this level ρ2. levels and the corresponding connected components simultaneously, see the left picture in Figure 2 for an illustration. There exists a variety of articles investigating properties of the cluster tree approach, see e.g. (Hartigan, 1975; Stuetzle, 2003; Chaudhuri and Dasgupta, 2010; Stuetzle and Nugent, 2010; Kpotufe and von Luxburg, 2011; Chaudhuri et al., 2014; Wang et al., 2019) for details. For example, Chaudhuri and Dasgupta (2010) show, under some assumptions on h, that a modified single linkage algorithm recovers this tree in the sense of Hartigan (1981), and Kpotufe and von Luxburg (2011); Chaudhuri et al. (2014) obtain similar results for an underlying k-NN density estimator. In addition, Kpotufe and von Luxburg (2011); Chaudhuri et al. (2014) propose a simple pruning strategy, that removes connected components that artificially occur because of finite sample variability. However, the notion of recovery taken from Hartigan (1981) only focuses on the correct estimation of the cluster tree structure and not on the estimation of the clusters itself, cf. the discussion by Steinwart (2011). Finally, the most recent paper (Wang et al., 2019) establishes guarantees including rates of convergence for each fixed level set, provided that a kernel-density estimator is used to produce level set estimates and the density has a certain smoothness such as α-H older continuity. A third approach taken by Steinwart (2011); Sriperumbudur and Steinwart (2012); Steinwart (2015a) tries to estimate both the first split ρ in the cluster tree, and the corresponding clusters, see the right picture in Figure 2 for an illustration. As in the previously discussed papers, finite sample bounds are derived, which in (Steinwart, 2015a) are extended to learning rates. For example, these learning rates for estimating ρ are of the probabilistic form P n D Xn : 0 ρ D ρ Kan where ρ D is the sample based estimate constructed by the considered algorithm and K is a constant depending on some assumptions on P. In other words, with high probability, the algorithm estimates ρ up to precision Kan. The learning rates for estimating the resulting Adaptive Clustering Using Kernel Density Estimators Figure 2: A density h on [0, 1] and different density-based clustering approaches. Left. The cluster tree approach considers various, or ideally even all, levels (in grey) simultaneously, making it superfluous to choose one level in advance. However, for a given algorithm the difficulty of detecting connected components may change with the considered level, since, for example, the distances of related connected components, i.e. clusters having the same color, change with the level as it can be seen for the second and third level from below. Right. The split tree approach tries to estimate both the levels at which a split in the infinite, ideal cluster tree of h occurs and the resulting clusters at these levels. For the depicted h this leads to the problem of estimating the split levels ρ 1, ρ 2 (drawn in yellow) as well as the resulting clusters for ρ 1 (drawn in green and purple) and for ρ 2 (drawn in blue and dark pink). Our previous work, see (Steinwart, 2011; Sriperumbudur and Steinwart, 2012; Steinwart, 2015a) only considered the first split ρ 1 and its clusters. clusters are of similar probabilistic nature. Moreover, Steinwart (2015a) shows that these learning rates can also be obtained by an adaptive, fully data-driven hyper-parameter selection strategy. Unfortunately, however, Steinwart (2011, 2015a) only considers the simplest possible density estimator, namely a histogram approach, and Sriperumbudur and Steinwart (2012) restrict their considerations to compactly supported moving window density estimates for α-H older-continuous densities. In addition, the method by Sriperumbudur and Steinwart (2012) requires the user to know α, and hence it is not data-driven. Finally, all three papers completely ignore the behavior of the considered algorithm for single cluster distributions, i.e. for distributions that do not have a split in the cluster tree, and for multi cluster distributions, i.e. for distributions that have more than one split in the cluster tree. As a consequence, it remained unclear whether and how a suitably modified version of this algorithm can be used to estimate the split-tree, i.e. the combination of all levels at which a split in the cluster tree occurs together with the resulting clusters at these splits. We refer to Figure 2 for an illustration of such a split tree. The goal of this paper is to address the discussed issues of Steinwart (2011); Sriperumbudur and Steinwart (2012); Steinwart (2015a). To be more precise, compared to these articles, we establish the following new results: i) For single cluster distributions, we propose a new set of regularity assumptions for levels ρ at which the level set {h ρ} is small. For example, for bounded densities, these assumptions roughly speaking guarantee, that the level sets do not frazzle for Steinwart, Sriperumbudur, and Thomann Figure 3: Guarantees provided in previous papers such as Steinwart (2015a) and in this paper. Left. Steinwart (2015a) etc. only provide guarantees for estimating the first split level ρ 1 and the corresponding clusters. In other words only the light blue area below the algorithm s split level estimate ρ1,out is covered. Right. In this paper we provide guarantees for estimating the entire split tree. To be more specific, Part ii) of Theorem 9 ensures that the generic cluster algorithm works correctly in the green area between ρ1,out and a distribution dependent value ρ 1 . On the left hand in the yellow area, Theorem 7 then guarantees, that the cluster algorithm correctly recognizes that there is no further split in the splitting tree. In contrast, on the right hand side, Part iii) of Theorem 9 makes it possible to reuse the guarantees of Steinwart (2015a) after some minor technical modifications. As a consequence, it can be ensured in the gray-blue area on the right hand side that the algorithm outputs an estimate ρ2,out of the second split level ρ 2 together with estimates for the corresponding clusters. Then in the green area on top of this, Part ii) of Theorem 9 again guarantees that the cluster algorithm works correctly up to some distribution dependent ρ 2 . Finally, Theorem 7 ensures in the two yellow areas at the very top that the cluster algorithm correctly recognizes that there is no further split in the cluster tree. levels ρ close to the maximum h of the density h. Such assumptions were missing in (Steinwart, 2011; Sriperumbudur and Steinwart, 2012; Steinwart, 2015a). ii) We present a simple modification of the output behavior of the generic cluster algorithm of Steinwart (2015a) to deal with distributions that do not have a split in the cluster tree. Based on our new regularity assumptions in i) and the ones from Steinwart (2015a), we then show that this new cluster algorithm is able to: a) provide an estimate ρout of the first split level ρ if there is one; b) correctly detect distributions for which there is no such split level; c) construct estimates Bi of the clusters A i occurring at the first split level. Note that from a technical side both a) with finite sample bounds on |ρout ρ | and c) with finite sample bounds on λd(Bi A i ) directly follow from Steinwart (2015a), since our modification of the generic cluster algorithm scans through candidate levels ρ in exactly the same way as the original algorithm of Steinwart (2015a) does. Therefore, the surprising, and compared to (Steinwart, 2015a) new part of our finite sample Adaptive Clustering Using Kernel Density Estimators guarantees is the fact that this scanning procedure does not need to be changed for correctly detecting single cluster distributions in b). Note that a highly beneficial side-effect of this fact is that our analysis in b) as well as in iii) and iv) below, can rely on the extensive set of tools developed by Steinwart (2015a). iii) We then show how the results of ii) can be used to estimate the entire split-tree by recursively applying the new generic cluster algorithm. While from a higher perspective this result does not seem to be too surprising, it turns out that there are still a couple of serious technical difficulties involved. In a nutshell, these difficulties relate to the fact, that the generic algorithm may return an estimate ρout for ρ for which the connected components of {h ρout} are not yet sufficiently apart from each other. While such an estimate ρout for ρ is desirable, it also prohibits a direct recursive application of the results of ii). To address this issue, we analyze the behavior of the generic cluster algorithm above the returned level ρout. In this analysis, which also goes beyond (Steinwart, 2011; Sriperumbudur and Steinwart, 2012; Steinwart, 2015a), it turns out, that the algorithm behaves correctly until it reaches a level ρ for which the connected components of {h ρ} are sufficiently apart from each other. Above this level ρ , we further show that the results of ii) can then be recursively applied, leading to guarantees for the entire split-tree. We refer to Figure 3 for a detailed description on how the different guarantees can be combined and how these guarantees differ from our previous work (Steinwart, 2011; Sriperumbudur and Steinwart, 2012; Steinwart, 2015a). Besides these improvements for the generic cluster algorithm we additionally present the following results: iv) We show that the new generic cluster algorithm does not only work with an underlying histogram density estimator (HDEs) as in (Steinwart, 2011, 2015a), but also for a variety of kernel density estimators (KDEs). Here it turns out that the results of Steinwart (2015a), including those for the adaptive, fully data-driven hyper-parameter selection strategy, remain valid for the resulting new clustering algorithm, provided that the kernel has a bounded support. Moreover, if the kernel has an exponential tail behavior, then the results remain true modulo an extra logarithmic term, while in the case of even heavier tails, we show that the rates become worse by a polynomial factor. Note that compared to Steinwart (2015a), all the results for KDEs are new. Moreover, the results for KDEs substantially extend the results of Sriperumbudur and Steinwart (2012), since there a) only moving window kernels were treated, and b) only α-H older continuous densities with known α were considered. In contrast, our new results do not even require continuous densities, and for this reason, we also obtain significantly more general results than the currently best results for KDE-based clustering achieved by Wang et al. (2019). The latter improvement is partially made possible, because we can rely on the tools of Steinwart (2015a). However, compared to the HDEs in (Steinwart, 2015a) considering KDEs still requires significant technical efforts such as finite sample bound for the -distance between a KDE and its population version. v) We discuss in some detail the differences of our results and those of Chaudhuri et al. (2014) and Wang et al. (2019), which in some sense are the articles closest to ours. Steinwart, Sriperumbudur, and Thomann Here it turns out that, depending on the set of assumptions, sometimes the learning rates for estimating the split levels obtained by Wang et al. (2019) are better and sometimes the ones obtained by us are better. However, adaptivity, for example, is not achieved by Wang et al. (2019). The latter is also true for the paper by Chaudhuri et al. (2014), but in addition their rates are worse than ours, and in addition, their set of assumptions is more restrictive. vi) We present some experiments comparing our algorithm run with either the moving window kernel or the Epanechnikov kernel to both k-means and hierarchical clustering. Here, it turns out that our algorithm outperforms the latter two as soon as the sample size is sufficiently large. In addition, our algorithm is less sensitive to clusters having different spatial scales. This paper is organized as follows: In Section 2 we recall the key concepts of Steinwart (2015a). In Section 3 we first introduce the new regularity assumptions mentioned in i). We then introduce and analyze the new generic cluster algorithm as described in ii). Moreover, the recursive approach described in iii) is analyzed in detail. Section 4 then presents key uncertain guarantees for level sets generated by KDEs, and Section 5 contains the material mentioned in iv), namely finite sample bounds as well as consistency results, rates of convergence, and an adaptive data-driven strategy for choosing the kernel bandwidth. In Section 6 we present the comparison to the articles by Chaudhuri et al. (2014) and Wang et al. (2019), and Section 7 contains the experiments. All proofs can be found in Section 8. 2. Preliminaries In this section we recall the setup for defining density-based clusters in a general context from Steinwart (2015a). To this end, let be a norm on Rd. Then we denote the closed unit ball of this norm by B and write B (x, δ) := x + δB . If the norm is known from the context, we usually write B(x, δ) instead. Moreover, the Euclidean norm on Rd is denoted by 2 and for the Lebesgue volume of its unit ball we write vold. Finally, denotes the supremum norm for functions. Let us now assume that we have some A X Rd as well as some norm on Rd. Then, for δ 0 we define the δ-tube and δ-trim of A in X by A+δ := A+δ X := {x X : d(x, A) δ} , and A δ := A δ X := X \ (X \ A)+δ , where d(x, A) := infx A x x . We refer to Figure 4 for an illustration of these concepts as well as to some possible topological changes when going from A to either A+δ or A δ. We further write A for the interior of A and A for the closure of A. Moreover, A := A \ A denotes the boundary of A. Obviously, we have A+0 = A, and hence also A 0 = A. Furthermore, since x 7 d(x, A) is continuous, A+δ is always closed in X and A δ is always open in X. Note that if A is bounded, we always find compact and convex X Rd with A+δ X = A+δ Rd and A δ X = A δ Rd. Based on this observation and the fact that we usually consider δ (0, 1] in combination with some suitably chosen X, see Assumption P below for details, we often ignore the surrounding set X. In addition to this notations, we denote the inradius Adaptive Clustering Using Kernel Density Estimators Figure 4: Left. A set A in blue together with a δ-tube A+δ of it in green and a δ-trim A δ of it in orange, where in both cases we considered the supremum norm in R2. All three sets are connected. Middle. A set A in blue that consists of two connected components. This time, however, its δ-tube A+δ is connected. Right. A connected set A in blue, for which its δ-trim A δ has two connected components. In summary, we see that the connected component structure of A does not necessarily relate to those of A+δ and A δ. and diameter of a bounded A Rd by inrad A and diam A, respectively, that is inrad A := sup r > 0 : x A with B(x, r) A diam A := sup x x : x, x A . Some interesting properties of these quantities can be found in (Steinwart et al., 2021, Lemma 8.1 and Lemma 8.2). Throughout this work, 1A denotes the indicator function of a set A, and A B the symmetric difference of two sets A and B. Let us now assume that P is a probability measure on a closed X Rd that is absolutely continuous with respect to the Lebesgue measure λd. Then P has a λd-density h and one could define the clusters of P to be the connected components of the level set {h ρ}, where ρ 0 is some user-defined threshold. Unfortunately, however, this notion leads to serious issues if there is no canonical choice of h such as a continuous version, see the illustrations in (Steinwart, 2015a, Section 2.1). To address this issue, Steinwart (2015a) considered, for ρ 0, the measures µρ(A) := λd(A {h ρ}) , A B(Rd) . Since µρ is independent of the choice of h := d P/ dλd, the set Mρ := supp µρ , where supp µρ denotes the support of the measure µρ, is independent of this choice, too. For any λd-density h of P, the definition immediately gives λd {h ρ} \ Mρ = λd {h ρ} (Rd \ Mρ) = µρ(Rd \ Mρ) = 0 , (1) i.e. modulo λd-zero sets, the level sets {h ρ} are not larger than Mρ. In fact, Mρ turns out to be the smallest closed set satisfying (1) and it is shown in (Steinwart, 2015b, Lemma A.1.2), that {h ρ} Mρ {h ρ} and Mρ {h ρ} {h ρ} . Steinwart, Sriperumbudur, and Thomann Figure 5: Examples of injective, surjective, and bijective of CRMs, which also illustrate partitions generated by connected components. Left. For A := A A , P(A) := {A , A }, and P(B) := {B}, the CRM ζ : P(A) P(B) is surjective but not injective. Middle. For B := B B and the obvious definitions of P(A) and P(B), the CRM ζ : P(A) P(B) is injective but not surjective. Right. Analogously, for A := A A and B := B B , the CRM ζ : P(A) P(B) is bijective and hence P(A) is persistent in P(B). In this case the structure of the cells in A and B have a clear one-to-one relation to each other. Note that in all cases we further have P(A) = C(A) and P(B) = C(B). In order to ensure inclusions that are inverse to (1), Steinwart (2015a) said that P is normal at level ρ if there exist two λd-densities h1 and h2 of P such that λd(Mρ \ {h1 ρ}) = λd({h2 > ρ} \ Mρ) = 0 . In particular, if P is normal at level ρ, then Mρ and {h1 ρ} only differ by a Lebesgue zero set. Moreover, it is shown in (Steinwart, 2015b, Lemma A.1.3)1 that P is normal at every level, if it has both an upper semi-continuous λd-density h1 and a lower semi-continuous λd-density h2. Furthermore, if P has a λd-density h such that λd( {h ρ}) = 0, then the same lemma shows that P is normal at level ρ. This implies that essentially all P one usually thinks of are normal. Finally, if the conditions of normality at level ρ are satisfied for some λd-densities h1 and h2 of P, then they are actually satisfied for all λd-densities h of P and, as mentioned above, we have λd(Mρ {h ρ}) = 0. The next assumption collects the concepts introduced so far. Assumption P. We have a λd-absolutely continuous, probability measure P that is normal at every level. In addition, supp P is compact, and X Rd is a compact and connected set with (supp P)+2 Rd X. Note that the assumption (supp P)+2 Rd X ensures both A+δ X = A+δ Rd and A δ X = A δ Rd for all δ (0, 1] and A supp P. In this case, we therefore usually ignore the surrounding X. Let us now recall the definition of clusters from Steinwart (2015a). We begin with the following definition. Definition 1 Let P(A) and P(B) be partitions of A B with A = . Then P(A) is comparable to P(B), write P(A) P(B), if, for all A P(A), there is a B P(B) with A B . Informally speaking, P(A) is comparable to P(B), if no cell A P(A) is broken into pieces in P(B). In particular, if P1 and P2 are two partitions of A, then P1 P2 if and 1. In this lemma, the term upper normal at level ρ means that λd(Mρ \ {h1 ρ}) = 0 for some density h1 := d P/ dλd while lower normal at level ρ means λd({h2 > ρ} \ Mρ) = 0 for some density h2 := d P/ dλd. Adaptive Clustering Using Kernel Density Estimators only if P1 is finer than P2. Now assume that P(A) and P(B) are two partitions with P(A) P(B). Then (Steinwart, 2015b, Lemma A.2.1) shows that there exists a unique map ζ : P(A) P(B) with A ζ(A ) , A P(A) . Following Steinwart (2011, 2015a), we call ζ the cell relating map (CRM) between A and B. Moreover, if ζ is bijective, we say that P(A) is persistent in P(B) and write P(A) P(B). We refer to Figure 5 for illustrations of persistent and non-persistent partitions. The first example of comparable partitions come from connected components. To be more precise, let A Rd be a closed subset and C(A) be the collection of its connected components. By definition, C(A) forms a partition of A, and if B Rd is another closed subset with A B and |C(B)| < then we have C(A) C(B), see (Steinwart, 2015b, Lemma A.2.3) as well as Figure 5. Following Steinwart (2015a), another class of partitions arise from a discrete notion of path-connectivity. To recall the latter, we fix a τ > 0, an A Rd, and a norm on Rd. Then x, x A are τ-connected in A, if there exist x1, . . . , xn A such that x1 = x, xn = x and xi xi+1 < τ for all i = 1, . . . , n 1. Clearly, being τ-connected gives an equivalence relation on A. We write Cτ(A) for the resulting partition and call its cells the τ-connected components of A. It has been shown in (Steinwart, 2015b, Lemma A.2.7), that Cτ(A) Cτ(B) for all A B and τ > 0. Moreover, if |C(A)| < then C(A) = Cτ(A) for all sufficiently small τ > 0, see (Steinwart, 2015a, Section 2.2) for details. Following Steinwart (2015a), we can now describe probability measures that can be clustered. Definition 2 Let Assumption P be satisfied. Then P can be clustered between ρ 0 and ρ > ρ , if for all ρ [0, ρ ], we have |C(Mρ)| {1, 2} and the following two conditions are met: i) If we have |C(Mρ)| = 1, then ρ ρ . ii) If we have |C(Mρ)| = 2, then ρ ρ and C(Mρ ) C(Mρ). Using the CRMs ζρ : C(Mρ ) C(Mρ), we then define the clusters of P by ρ (ρ ,ρ ] ζρ(Ai) , i {1, 2} , where A1 and A2 are the topologically connected components of Mρ . Finally, we define 3 d ζρ +ε(A1), ζρ +ε(A2) , ε (0, ρ ρ ]. (2) Definition 2 ensures that the level sets below ρ are connected, while for a certain range above ρ the level sets have exactly two components, which, in addition, are assumed to be persistent, see the right picture in Figure 6 for an illustration. Thus, the topological structure of Mr between the split level ρ and an anchor level ρ for ρ equals that of Mρ . Consequently we can use the connected components of Mρ to number the connected Steinwart, Sriperumbudur, and Thomann Figure 6: Left. A density h on [0, 1] that can be clustered between ρ 1 and ρ 1 . Below the split level ρ 1, all level sets Mρ only have one connected component. Between the split level ρ 1 and an anchor level ρ 1 of it, that is for all ρ (ρ 1, ρ 1 ], all level sets Mρ have two connected components (in green and purple). Moreover, we have persistence C(Mρ 1 ) C(Mρ) as indicated by the same color of the related cells. Taking their unions results in the two clusters A 1 and A 2 at the split level ρ 1, which are again indicated in green and purple. Finally, note that τ (ε) equals the distance of the green and purple component at level ρ + ε modulo the factor 3. Middle. A connected set B with M δ ρ = A A . The resulting CRM ζ : C(A) C(B) is only surjective and since the bridge in the middle narrows faster than linearly, the thickness function has an exponent γ < 1. Right. Again, a connected set B with M δ ρ = A A , for which the resulting CRM ζ : C(A) C(B) is only surjective. This time, however, the thickness function has an exponent γ = 1. components of Mρ for ρ (ρ , ρ ). This is done in the definition of the clusters A i as well as in the definition of the function τ , which essentially measures the distance between the two connected components at level ρ + ε. We again refer to Figure 6 for some visual impressions. The major goal of Steinwart (2011, 2015a) was to design an algorithm that is able to asymptotically estimate both the correct value of ρ and the clusters A 1 and A 2. However, this algorithm required that the level sets do not have bridges or cusps that are too thin. To make this precise, let us recall that for a closed A Rd, Steinwart (2011, 2015a) considered the function ψ A : (0, ) [0, ] defined by ψ A(δ) := sup x A d(x, A δ) , δ > 0. Roughly speaking, ψ A(δ) describes the smallest radius ε needed to recover A from A δ in the sense of A (A δ)+ε, see (Steinwart, 2015b, Section A.5) for this and various other results on ψ A. In particular, we have ψ A(δ) δ for all δ > 0 and ψ A(δ) = if A δ = . Moreover, ψ A behaves linearly, if bridges and cusps of A are not too thin, and even thinner cusps and bridges can be included by considering sets with ψ A(δ) cδγ for some fixed c 1, γ (0, 1] and all sufficiently small δ > 0. However, the discussion in (Steinwart, 2015b, Section A.5) also showed that γ = 1 can be viewed as the most normal case. Finally, we refer to Figure 6 for some examples of linear and nonlinear behavior of ψ that is relevant for the following definition taken from Steinwart (2015a). Adaptive Clustering Using Kernel Density Estimators Definition 3 Let Assumption P be satisfied. Then we say that P has thick level sets of order γ (0, 1] up to the level ρ > 0, if there exist constants cthick 1 and δthick (0, 1] such that, for all δ (0, δthick] and ρ [0, ρ ], we have ψ Mρ(δ) cthick δγ . In this case, we call ψ(δ) := 3cthickδγ the thickness function of P. Now, the following assumption describes the probability measures we wish to cluster. Assumption M. The probability measure P can be clustered between ρ and ρ , see Definition 2. In addition, P has thick level sets of order γ (0, 1] up to the level ρ . We denote the corresponding thickness function by ψ and write τ for the function defined in (2). The theory developed by Steinwart (2011); Sriperumbudur and Steinwart (2012); Steinwart (2015a) focused on the question, whether it is possible to estimate ρ and the resulting clusters for distributions that can be clustered. To this end, it was assumed that we had an estimation algorithm that constructs, for a given data sets D, level set estimates LD,ρ for all ρ > 0. Moreover, it was assumed that these level set estimates satisfy the guarantees M δ ρ+ε LD,ρ M+δ ρ ε (3) for all ρ [0, ρ ] and some ε, δ > 0. Based on these assumptions, a generic cluster algorithm was developed, where generic refers to the fact, that the cluster algorithm does not not need to know specifics about the construction of LD,ρ. Instead, only the guarantees (3) with known values for ε and δ are needed. The key result (Steinwart, 2015a, Theorem 2.9) then specified in terms of ε and δ how well this algorithm estimates both ρ and the clusters A 1 and A 2. Here the main difficulty of establishing this result arose from the fact no pair of Mρ+ε, and Mρ ε, M δ ρ+ε, and M+δ ρ ε need to have the same topological structure in terms of their connected components. Indeed, parts of these incompatibilities can already be seen in Figures 1 and 4. What is missing in this analysis, however, is an investigation of the behavior of the generic cluster algorithm in situations in which P cannot be clustered because all level sets are connected. Now observe that the reason for this gap was the notion of thickness: Indeed, if P is a single-cluster probability measure, i.e. |C(Mρ)| 1 for all ρ 0, and P has thick level sets of the order γ up to the level ρ := sup{ρ : ρ 0 and |C(Mρ)| = 1}, then the proof of (Steinwart, 2015a, Theorem 2.9) can be easily extended to show that at each visited level ρ the algorithm correctly detects exactly one connected component. Unfortunately, however, the assumption of having thick levels up to the height ρ of the peak of h is too unrealistic, as it requires M δ ρ = for all ρ [0, ρ ] and δ (0, δthick], that is, the peak needs to be a plateau that contains a ball of radius δthick, as the following lemma shows. Lemma 4 Let A Rd be bounded and δ > 0. Then A δ Rd = if and only if δ < inrad A. Steinwart, Sriperumbudur, and Thomann 3. A Generic Algorithm for Estimating the Split-Tree In this section we present a generic algorithm for estimating the entire split-tree. To this end, we first introduce a new set of assumptions for single-cluster distributions that rule out irregular behavior of the level sets in the vicinity of the peak of the density. Unlike the na ıve approach we have discussed at the end of Section 2, this new set of assumptions includes a variety of realistic behaviors. In the second step we then present a generic cluster algorithm, whose only difference to the one in (Steinwart, 2015a) is its output behavior in situations in which no split has been detected. We then show that this new cluster algorithm, like its predecessor in (Steinwart, 2015a), correctly identifies a split in the cluster tree. Moreover, we demonstrate that, unlike the one in (Steinwart, 2015a), the new cluster algorithm also correctly identifies single-cluster distributions. Finally, we combine these insights to develop a new generic algorithm for estimating the entire cluster tree. Let us begin by introducing a new assumption for dealing with single-cluster distributions. Assumption S. Assumption P is satisfied and there are ρ 0, γ (0, 1], cthick 1 and δthick (0, 1] such that for all ρ ρ and δ (0, δthick], we have |C(Mρ)| 1 as well as: i) If M δ ρ = then ψ Mρ(δ) cthickδγ. ii) If M δ ρ = , then, for all = A M+δ ρ and τ > 2cthickδγ, we have |Cτ(A)| = 1. Note that |C(Mρ)| 1 simply means that the level sets of P above ρ are either empty or connected. If they are empty, there is nothing more to assume and in the other case, we can either have M δ ρ = or M δ ρ = . If M δ ρ = , then condition i) ensures that the level set Mρ is still thick in the sense of Definition 3, while in the other case M δ ρ = , condition ii) guarantees that the larger sets M+δ ρ cannot have multiple τ-connected components as long as we choose τ in a way that is required in the case of multiple clusters, too. In this respect note that (Steinwart et al., 2021, Lemma 8.3) shows that for all δ (0, δthick], there exists a ρ ρ with M δ ρ = , and therefore dealing with the case ii) cannot be avoided. In the case γ = 1, Condition ii) can also be interpreted in terms on inradius and diameter as the following lemma shows: Lemma 5 For compact M Rd and c 1 the following statements are equivalent: i) For all δ > 0 with M δ = and all = A M+δ and τ > 2cδ, we have |Cτ(A)| = 1. ii) We have 2(c 1) inrad M diam M. Since by Lemma 4 there only exists some δ (0, δthick] with M δ ρ = if inrad Mρ δthick, we thus see that for γ = 1, Condition ii) is equivalent to 2(cthick 1) inrad Mρ diam Mρ (4) for all ρ ρ with inrad Mρ δthick. Inequality (4) essentially states that the diameterinradius ratio must be bounded for increasing ρ. Consequently, (4) is satisfied for cthick := 2 if all Mρ above ρ are balls with respect to the considered norm since in this case we have Adaptive Clustering Using Kernel Density Estimators Figure 7: Contour lines of two continuous densities. The levels ρ are chosen such that we have inrad Mρ {0, 1, 2, 3, 4, 5} with respect to the . Left. Assumption S is satisfied for γ = 1, since, for increasing ρ, diam Mρ decays faster than inrad Mρ does, and hence (4) holds. Right. Assumption S is not satisfied for γ = 1, since the highest level set satisfies inrad Mρ = 0 and diam Mρ > 0, which violates (4). Algorithm 1 Clustering with the help of a generic level set estimator Require: Some ε, τ > 0 and a start level ρ0 0. A decreasing family (Lρ)ρ 0 of subsets of X. Ensure: An estimate of ρ or ρ and the corresponding clusters. 1: ρ ρ0 2: repeat 3: Identify the τ-connected components B1, . . . , BM of Lρ satisfying Bi Lρ+2ε = . 5: until M = 1 6: ρ ρ + 2ε 7: Identify the τ-connected components B1, . . . , BM of Lρ satisfying Bi Lρ+2ε = . 8: if M > 1 then 9: return ρout = ρ and the sets Bi for i = 1, . . . , M. 11: return ρout = ρ0 and the set Lρ0. 2 inrad Mρ = diam Mρ. Moreover, by increasing cthick we see that (4) remains true if we distort these balls by bi-Lipschitz continuous maps with constants that can be bounded independently of ρ. Analogously, (4) is satisfied, if Mρ are balls with respect to a fixed norm that is different to the one used for the δ-tubes and δ-trims in condition ii). In addition, if we have a flat plateau at the highest level ρmax := sup{ρ > 0 : Mρ = } < , that is inrad Mρmax > 0, then (4) is always satisfied for some cthick > 1 because of diam Mρ diam X < . In contrast, (4) is violated, if, for example, inrad Mρmax = 0 and diam Mρmax > 0. Finally, for d = 1, we always have 2 inrad Mρ = diam Mρ for all non-empty level sets Mρ, since for these |C(Mρ)| = 1 ensures that Mρ is an interval. Consequently, for d = 1 Assumption S is satisfied with γ = 1, cthick = 2, and δthick = 1 whenever Assumption P is satisfied. For two-dimensional examples we refer to Figure 7 for an illustration. The next task is to formulate a generic algorithm that is able to estimate ρ and the resulting clusters if P can be clustered in the sense of Assumption M and that is also able to detect distributions that only have one cluster in the sense of Assumption S. We will see in Steinwart, Sriperumbudur, and Thomann the following that Algorithm 1 is such an algorithm. Before we present the corresponding results we first note that the only difference of Algorithm 1 to the algorithm considered by Steinwart (2015a) is the more flexible start level ρ0, compared to ρ0 = 0 in (Steinwart, 2015a), and the modified output in Lines 8-12. Indeed, the algorithm in (Steinwart, 2015a) always produces the return values of Line 9. In contrast, Algorithm 1 distinguishes between the cases M > 1 and M = 0. While for M > 1 the output of both algorithms exactly coincide, the new Algorithm 1 now returns ρ0 and Lρ0 in the case of M = 0. We will see in Theorem 7 that the latter case typically occurs for P satisfying Assumption S. In this respect recall that Lρ0 can be viewed as an estimate of Mρ0 and therefore returning Lρ0 makes sense for such distributions. The next theorem adapts (Steinwart, 2015a, Theorem 2.9) to Algorithm 1. Since the proof of (Steinwart, 2015a, Theorem 2.9) can be easily adapted to arbitrary start levels ρ0 0 and this proof also shows that the case M 1 is not occurring under the assumptions of this theorem, we omit the proof of Theorem 6. Theorem 6 Let Assumption M be satisfied. Furthermore, let ε (ρ ρ )/9 , δ (0, δthick], τ (ψ(δ), τ (ε )], and ε (0, ε ], and ρ0 ρ . In addition, let (Lρ)ρ 0 be a decreasing family satisfying (3) for all ρ ρ0. Then we have: i) The level ρout returned by Algorithm 1 satisfies ρout [ρ + 2ε, ρ + ε + 5ε] and τ ψ(δ) < 3τ ρout ρ + ε . ii) Algorithm 1 returns two sets B1 and B2 and these can be ordered such that we have i=1 λd Bi A i 2 i=1 λd A i \ (Ai ρout+ε) δ + λd M+δ ρout ε \ {h > ρ } . Here, Ai ρout+ε C(Mρout+ε) are ordered in the sense of Ai ρout+ε A i . Theorem 6 shows that Algorithm 1 is still able to estimate ρ and the corresponding clusters if the P can be clustered in the sense of Assumption M. The main motivation for this section was, however, to have an algorithm that also behaves correctly for P that only have one cluster in the sense of Assumption S. The next theorem ensures such a behavior. Theorem 7 Let Assumption S be satisfied and (Lρ)ρ 0 be a decreasing family of sets Lρ X such that (3) holds for some fixed ε, δ > 0 and all ρ ρ0. If ρ0 ρ , δ (0, δthick], and τ > 2cthickδγ, then Algorithm 1 returns ρ0 and L0. Note that Theorem 6 requires τ > ψ(d) = 3cthickδγ, while Theorem 7 even holds under the milder assumption τ > 2cthickδγ. Consequently, if we choose a τ with τ > 3cthickδγ, then the corresponding assumptions of both theorems are satisfied. Moreover, the additional assumption τ < τ (ε ) in Theorem 6 is actually more an assumption on ε than on τ as we will see later. Now assume that the assumptions of Theorem 6 are satisfied and that Algorithm 1 returned ρout and the cluster estimates B1, B2. Our goal is to apply Algorithm 1 on the Adaptive Clustering Using Kernel Density Estimators Algorithm 2 Estimating the split-tree with the help of a generic level set estimator Require: Some τ > 0, ε > 0 and a start level ρ0 0. decreasing family (Lρ)ρ 0 of subsets of X. Ensure: Estimates of all splits of the cluster tree and the corresponding clusters. 1: Call Algorithm 1 with ρ0 and (Lρ)ρ 0 2: if ρout > ρ0 then 3: Store the return values of Algorithm 1 in the split-tree 4: Call Algorithm 2 with ρout + ε and (L1,ρ)ρ 0 5: Call Algorithm 2 with ρout + ε and (L2,ρ)ρ 0 6: else if ρout = 0 then 7: Store the return values of Algorithm 1 in the split-tree 9: return split-tree two detected clusters B1 and B2 separately, see Algorithm 2. To this end, we define the new level set estimates Li,ρ := Lρ Bi , i = 1, 2, ρ ρout, and let the Algorithm 1 run on both families of level set estimates separately. Of course, we want to use our insights into Algorithm 1, and for this reason, we need to replace (3) by a suitable new horizontal and vertical control. To find such a new control, let us assume that Assumption M is satisfied and that we have fixed a ρ (ρ , ρ ]. Moreover, let A1,ρ , and A2,ρ be the two connected components of Mρ . We then define two new children distributions P1 and P2 by Pi(B) := P(B Ai,ρ ) P(Ai,ρ ) , i = 1, 2, (5) for all measurable B X. Moreover, for ρ 0 we denote the level sets of P1 and P2 by M1,ρ and M2,ρ, respectively. We can now introduce distributions having a finite split tree. Definition 8 Let P be a distribution satisfying Assumption P and |C(Mρ)| < for all ρ 0. Moreover, assume that there is a ρmax > 0 such that Mρ = for all ρ ρmax. Then P has a finite split-tree with minimal step size ϵ > 0, if one of the following two conditions are satisfied: i) P satisfies Assumption S. ii) P satisfies Assumption M with ρ ρ ϵ, and for ρ := (ρ + ρ )/2 the two probability measures P1 and P2 defined by (5) have a finite split-tree with minimal step size ϵ > 0. For example, the distribution shown repeatedly in the introduction, see e.g. Figure 2 for the split tree of this distribution, can be clustered. Our next goal is to show that Algorithm 2 can be used to estimate the split-tree for distributions having a finite split-tree with some Steinwart, Sriperumbudur, and Thomann unknown minimal step size ϵ > 0. To this end, we need the rather technical Theorem 9 below, which in its formulation requires the sets of τ-connected components of Lρ that are identified in Lines 4 and 7 of Algorithm 1, that is, the sets b Cτ(Lρ) := B Cτ(Lρ) : B Lρ+2ε = , ρ 0. Theorem 9 Let Assumption M be satisfied. Furthermore, let ε (ρ ρ )/16, δ (0, δthick], τ (ψ(δ), τ (ε )], and ε (0, ε ], and ρ0 ρ . In addition, let (Lρ)ρ 0 be a decreasing family satisfying (3) for all ρ ρ0. Finally, let ρout be the estimate of ρ and B1, B2 be the cluster estimates returned by Algorithm 1. Then the following statements are true: i) We have |Cτ(M δ ρ )| = 2 and the sets V1 := A δ 1,ρ and V2 := A δ 2,ρ are the two τ-connected components of M δ ρ . ii) For all ρ [ρout, ρ 3ε] we have | b Cτ(Lρ)| = 2. Moreover, we can order the two elements Bρ 1 and Bρ 2 of b Cτ(Lρ) such that Vi Bρ i Bi , i = 1, 2 . (6) iii) If ρ [ρ + ε + 6ε, ρ 5ε], then for all ρ ρ + 4ε we have Li,ρ Bρ +2ε i and M δ i,ρ+ε Li,ρ M+δ i,ρ ε. (7) To illustrate Theorem 9, we now define ρ := (ρ +ρ )/2 and assume ε (ρ ρ )/16. For ε (0, ε ] we then find ρ [ρ + ε + 6ε, ρ 5ε] and ρ + 4ε ρ + ρ Then we have, ρ 3ε > ρ 4ε ρ (ρ ρ )/4 = ρ , and part ii) of Theorem 9 thus shows (6) for all ρ [ρout, ρ ]. Consequently, Algorithm 1, when working with the level sets (Li,ρ)ρ [ρout,ρ ], does identify exactly one connected component in its Line 3. In other words, the loop between its Lines 2 and 5 is not left for such ρ. Moreover, for ρ ρ ρ + 4ε part iii) of Theorem 9 ensures (7). Consequently, Theorems 6 and 7 can be applied to Algorithm 1 when working with the level sets (Li,ρ)ρ ρ for the distribution Pi. We refer to Figure 8 for a detailed description of how these guarantees work together. In summary, these considerations show that Algorithm 2 can be recursively analyzed with the help of Theorems 6 and 7 to show that Algorithm 2 indeed estimates the split-tree for all distributions P having a finite split-tree with some unknown minimal step size ϵ > 0. In particular, for all quantitative guarantees it actually suffices to describe the behavior of Algorithm 1 for distributions satisfying Assumption S and Assumption M. This insight will be adopted later in the statistical analysis of Section 5. Adaptive Clustering Using Kernel Density Estimators Figure 8: A density of a P that has a finite split-tree with minimal step size ϵ > 0 and the resulting guarantees. Left. Situation at the lowest split level ρ 1. The distribution P can be clustered between ρ 1 and ρ 1 and the (unnormalized) densities of the children distributions P1,1 (green) and P1,2 (purple) above ρ 1 := (ρ 1 + ρ 1)/2 are shown. Assumption S is satisfied by P1,1, while P1,2 satisfies Assumption M. Algorithm 2 initialized with ρ0 := 0 calls Algorithm 1 in its Line 1, which in turn returns ρ1,out ρ 1 with ρ1,out > ρ 1 and two corresponding clusters denoted by B1,1 and B1,2 as guaranteed by Theorem 6. Algorithm 2 stores these in its Line 3 and continues by calling a second instance of Algorithm 2 with start level ρ1,out +ε and the family (L1,ρ)ρ 0 in Line 4. This instance of Algorithm 2 in turn calls Algorithm 1 with ρ1,out + ε and the family (L1,ρ)ρ 0. For ρ [ρ1,out, ρ 1 ] with ρ 1 := 0.75ρ 1 + 0.25ρ 1 , that is, for levels between the two cyan lines, Part ii) of Theorem 9 ensures that Algorithm 1 only detects one cluster Bρ 1 in its loop, see also the calculations following Theorem 9. Consequently, this loop is not left below ρ 1 . Moreover, for ρ ρ 1 , Part iii) of Theorem 9 ensures M δ 1,ρ+ε L1,ρ M+δ 1,ρ ε, and hence Theorem 7 shows that Algorithm 1 returns its start level ρ1,out +ε and Lρ1,out+ε. As a consequence, both if-clauses of the second instance of Algorithm 2 are not satisfied and the overall program continues at Line 5 of the first instance of Algorithm 2. There, a third instance of Algorithm 2 is called with ρ1,out + ε and (L2,ρ)ρ 0, which in turn begins by calling Algorithm 1 with the same values. Again, Theorem 9 ensures, that for ρ [ρ1,out, ρ 1 ] Algorithm 1 only detects one cluster and that for ρ ρ 1 , the crucial inclusions M δ 2,ρ+ε L2,ρ M+δ 2,ρ ε are satisfied. Theorem 6 hence ensures that Algorithm 1 returns a ρ2,out ρ 2 with ρ2,out > ρ 2 > ρ1,out + ε and corresponding clusters denoted by B2,1 and B2,2 to the third instance of Algorithm 2. These values are then stored in the split tree and a fourth and fifth instance of Algorithm 2 are called with ρ2,out+ε and the newly defined (L(2,1),ρ)ρ 0, respectively (L(2,2),ρ)ρ 0, given by L(2,i),ρ := L2,ρ B2,i. Right. Assumption S is satisfied for both children measures occurring at the split ρ 2. As for B1 on the left image, the combination of Theorem 9 and Theorem 7 ensures that Algorithm 1 called by the fourth and fifth instance of Algorithm 2 with ρ2,out+ε and (L(2,i),ρ)ρ 0 returns these values to these instances of Algorithm 2. Thus, the fourth and fifth instance of Algorithm 2 return to the third instance of Algorithm 2 without any further action, and in turn the third instance returns to the first instance of Algorithm 2. This instance then reaches its Line 9, and therefore the overall program terminates with (ρ1,out, B1, B2) and (ρ2,out, B2,1, B2,2) being stored in its split tree. Steinwart, Sriperumbudur, and Thomann 4. Uncertainty Control for Kernel Density Estimators The results of Section 3 provide guarantees as soon as the input level sets satisfy (3). Steinwart (2015a) has shown that guarantees of the form (3) can be established for the level sets of histogram-based density estimators. The goal of this section is to show that (3) can also be established for a variety of kernel density estimators. Our first definition introduces the considered kernels. Definition 10 A bounded, measurable function K : Rd [0, ) is called symmetric kernel, if K(x) > 0 in some neighborhood of 0, K(x) = K( x) for all x Rd, and Z Rd K(x) dλd(x) = 1 . (8) For δ > 0 we write Kδ := δ d K(δ 1 ), and for r > 0 and a norm on Rd we define Rd\B(0,r) K(x) dλd(x) , κ (r) := sup x Rd\B(0,r) K(x) . We call κ1( ) and κ ( ) tail functions. Finally, we say that K has a bounded support if supp K B , and that K has an exponential tail behavior, if there exists a constant c > 0 such that K(x) c exp x 2 , x Rd. (9) Recall that the integrability condition (8) is standard for kernel density estimators. Moreover, kernels of the form K(x) = k( x ) are always symmetric and if the representing k : [0, ) [0, ) is bounded and measurable, so is K. Moreover, if k(r) > 0 for all r [0, ϵ), where ϵ > 0 is some constant, then K(x) > 0 in some neighborhood of 0. In particular, for k = c1[0,1] we obtain the rectangular window kernel , which is a symmetric kernel with bounded support, and if k is of the form k(r) = c exp( r2) or k(r) = c exp( r), then we obtain a symmetric kernel with exponential tail behavior. Examples of the latter are Gaussian kernels, while the triangular, the Epanechnikov, the quartic, the triweight, and the tricube kernels are further examples of symmetric kernels with bounded support. Finally note that each symmetric kernel with bounded support also has exponential tail behavior, since we always assume that K is bounded. Given a kernel function K, it is standard in kernel density estimation to consider the modified versions Kδ of K with δ 0 for increasing sample sizes n, see for example (Devroye and Lugosi, 2001). Roughly speaking, this is because one can show that the infinite-sample kernel density estimator h P,δ defined below in (13) converges to the density h for δ 0. Also note that for the kernels discussed above and δ (0, 1), these versions Kδ are more narrow than the original function K. Before we proceed with our main goal of establishing (3) let us briefly discuss a couple of simple properties of symmetric kernels K in the sense of Definition 10. To this end, we first note that the properties of the Lebesgue measure λd ensure that Z Rd Kδ(x y) dλd(y) = Z Rd K(x y) dλd(y) = Z Rd K(y x) dλd(y) = 1 (10) Adaptive Clustering Using Kernel Density Estimators for all x Rd, δ > 0, and then by an analogous calculation we obtain Z Rd\B(x,σ) Kδ(x y) dλd(y) = Z Rd\B(0,σ/δ) K(y) dλd(y) = κ1(σ In addition, we always have κ1(r) 0 for r and if K has bounded support, then the tail functions with respect to this norm satisfy κ1(r) = κ (r) = 0 , r 1 . (12) Moreover, for kernels with exponential tail, (Steinwart et al., 2021, Lemma 4.2) shows that the behavior of the tail functions can be bounded by κ1(r) cd2 vold e rrd 1 and κ (r) ce r. Now, let K be a symmetric kernel on Rd and P be a distribution on Rd. For δ > 0 we then define the infinite-sample kernel density estimator h P,δ : Rd [0, ) by h P,δ(x) := δ d Z d P(y) x Rd. (13) It is easy to see that h P,δ 0 is a bounded measurable function with h P,δ δ d K . Moreover, a quick application of Tonelli s theorem together with (10) yields h P,δ L1(λd) = 1, and hence h P,δ is a Lebesgue probability density. Moreover, if P has a Lebesgue density h, then it is well-known, see e.g. (Devroye and Lugosi, 2001, Theorem 9.1), that h P,δ h L1(λd) 0 for δ 0. In addition, if this density is bounded, then (10) yields h P,δ = sup x Rd δ d Z h(y) dλd(y) h sup x Rd Rd Kδ(x y) dλd(y) Clearly, if D = (x1, . . . , xn) Xn is a data set, we can consider the corresponding empirical measure 1 n Pn i=1 δxi, where δx denotes the Dirac measure at x. In a slight abuse of notation we also denote this empirical measure by D. The resulting function h D,δ : Rd R, called kernel density estimator (KDE), can then be computed by h D,δ(x) := 1 nδd One way to define level set estimates with the help of h D,δ is a simple plug-in approach, that is LD,ρ := {h D,δ ρ} . (14) While from a statistical perspective, this level set estimator is perfectly fine, it is also computationally intractable. For example, if h D,δ is a moving window estimator, that is K(x) = c1[0,1]( x ) for x Rd, then the up to 2n different level sets (14) are generated by intersection of balls around the samples, and the structure of these intersections may be Steinwart, Sriperumbudur, and Thomann too complicated to compute τ-connected components in Algorithm 1. For this reason, we consider level set estimates of the form LD,ρ := {x D : h D,δ(x) ρ}+σ , (15) where σ > 0. Note that computing connected components of (15) is indeed feasible, since it amounts to computing the connected components of the neighborhood graph, in which two vertices xi and xj with i = j have an edge if xi xj σ +τ. In particular, DBSCAN can be viewed as such a strategy for the moving window kernel. With these preparations we can now present our first result that establishes a sort of uncertainty control (3) for level set estimates of the form (15). Theorem 11 Let be some norm on Rd, K : Rd [0, ) be a symmetric kernel, and κ1( ) and κ ( ) be its associated tail functions. Moreover, let P be a distribution for which Assumption P is satisfied, and D be a data set such that the corresponding KDE satisfies h D,δ h P,δ < ε for some ε > 0 and δ > 0. For ρ 0 and σ > 0 we define LD,ρ := {x D : h D,δ(x) ρ}+σ and ϵ := max{ρκ1(σ δ ), δ dκ (σ δ )}. Then, for all ρ δ dκ (σ δ ), we have M 2σ ρ+ε+ϵ LD,ρ M+2σ ρ ε ϵ . (16) Moreover, if P has a bounded density h, then (16) also holds for ϵ = h κ1(σ If K has bounded support for the norm considered in Theorem 11, Equation (12) shows that (16) actually holds for ϵ = 0 and all ρ 0 and all σ δ. Therefore, we have indeed (3) with δ replaced by 2σ. In general, however, we have an additional horizontal uncertainty ϵ that affects the guarantees of Theorem 6. To control this influence, our strategy will be to ensure that ϵ ε, which in view of ϵ = h κ1(σ δ ) means that we need to have an upper bound on κ1( ) and σ. Theorem 11 tells us that the uncertainty control (16) is satisfied as soon as we have a data set D with h D,δ h P,δ < ε. Recall that rates for h D,δ h P,δ 0 have already been proven by Gin e and Guillou (2002). However, these rates only hold for n n0, where n0 may depend on D. In addition one needs to choose a sequence (δn) of bandwidths apriori, which excludes adaptivity as we will see below. Finally, the theory developed by Steinwart (2015a) requires bounds of the form h D,δ h P,δ < ε(δ, n, ς) that hold with probability not smaller than 1 e ς. Thus, the results of Gin e and Guillou (2002) are not suitable for our purposes. To establish suitable bounds, we need to recall some notions first. Definition 12 Let E be a Banach space and A E be a bounded subset. Then, for all ε > 0, the covering numbers of A are defined by N(A, E, ε) := inf n 1 : x1, . . . , xn E such that A i=1 (xi + εB ) , where inf := . Furthermore, we use the notation N(A, E, ε) := N(A, E, ε). Adaptive Clustering Using Kernel Density Estimators We now introduce the kind of covering number bound we will use in our analysis. Definition 13 Let (Z, A) be a measurable space and G be a set of measurable functions from Z to R for which there is a B > 0 with g B for all g G. Then G is called a uniformly bounded VC-class, if there are A > 0 and ν > 0 such that, for every distribution P on Z we have N(G, L2(P), ϵ) AB ν , 0 < ϵ B. (17) Let us briefly look at two important, sufficient criteria for ensuring that the set of functions Kδ := Kδ(x ) : x X (18) is a uniformly bounded VC-class. The first such result considers moving window kernels. Lemma 14 Consider the kernel K = c1B , where is either the Euclideanor the supremum norm. Then for all δ > 0 the set Kδ defined by (18) is a uniformly bounded VC-class with B := δ d K = δ dc and A and ν being independent of δ. The next lemma shows that H older continuous kernels also induce a uniformly bounded VC-class Kδ, provided that the input space X in (18) is compact. For its formulation we need to recall that for every norm on Rd and every compact subset X Rd there exists a finite constant C (X) > 0 such that for all 0 < ε diam (X) we have N(X, , ϵ) C (X)ϵ d . (19) Lemma 15 Let K : Rd [0, ) be a symmetric, α-H older continuous kernel and be a norm on Rd. We write |K|α for the corresponding α-H older constant. Moreover, let X Rd be a compact subset and Kδ defined by (18). Then for all δ > 0 with δ |K|α K 1/α diam (X), all 0 < ϵ B := δ d K , and all distributions P on Rd we have N Kδ, L2(P), ϵ C (X) |K|α In other words, Kδ is a uniformly bounded VC-class with ν := d/α and constant A := (C (X))α/d|K|α K 1 δ α. Now, the second main result of this section establishes a finite sample bound on the norm h D,δ h P,δ . Later we will combine it with Theorem 11. Theorem 16 Let X Rd and P be distribution on X that has a Lebesgue density h L1(Rd) Lp(Rd) for some p (1, ]. We write 1 p = 1. Moreover, let K : Rd [0, ) be a symmetric kernel for which there is a δ0 (0, 1] such that for all δ (0, δ0] the set Kδ Steinwart, Sriperumbudur, and Thomann defined in (18) is a uniformly bounded VC-class with constants of the form Bδ = δ d K , Aδ = A0δ a, and A0 > 0, a 0, ν 1 being independent of δ, that is, N Kδ, L2(Q), ϵ A0 K δ (d+a) holds for all δ (0, δ0], all ϵ (0, Bδ], and all distributions Q on Rd. Then, there exists a C > 0 only depending on d, p, and K such that, for all n 1, δ > 0, and ς 1 satisfying δ min δ0, 4p K and |log δ| nδd/p h p P n D : h D,δ h P,δ ℓ (X) < C h p |log δ| ς 1 e ς. (23) For bounded densities, Theorem 16 recovers the same rates as Gin e and Guillou (2002). However, Gin e and Guillou (2002) established the rates in an almost sure asymptotic form, whereas Theorem 16 provides a finite sample bound. Moreover, unlike Gin e and Guillou (2002), Theorem 16 also yields rates for unbounded densities. 5. Statistical Analysis of KDE-based Clustering In this section we combine the generic results of Section 3 with the uncertainty control for level set estimates obtained from kernel density estimates we obtained in Section 4. As a result we will present finite sample guarantees, consistency results, and rates for estimating split levels ρ and the corresponding clusters. In this respect recall the discussion following Theorem 9, which showed that for deriving guarantees for estimating the split tree with the help of Algorithm 2 it actually suffices to analyze the behavior of Algorithm 1 for distributions satisfying Assumption S and Assumption M. Following this insight, we will focus on such guarantees for Algorithm 1. Our first result presents finite sample bounds for estimating both ρ and the single or multiple clusters with the help of Algorithm 1. To treat kernels with bounded and unbounded support simultaneously, we restrict ourselves to the case of bounded densities, but at least for kernels with bounded support an adaption to p-integrable densities is straightforward as discussed by Steinwart et al. (2021). Theorem 17 Let P be a distribution with bounded Lebesgue density and with Assumption P being satisfied. Moreover, let K be symmetric kernel with exponential tail behavior, for which the assumptions of Theorem 16 hold. For fixed δ (0, e 1] and τ > 0, we choose a σ > 0 with ( δ if supp K B , δ | log δ|2 otherwise. (24) Adaptive Clustering Using Kernel Density Estimators and we further assume that this σ satisfies both σ δthick/2 and τ > ψ(2σ). Moreover, for fixed ς 1, n 1 satisfying the assumptions (22), we pick an ε > 0 satisfying the bound h |log δ| ς and if K does not have bounded support, also ε max 1, 2d2 vold} c δ| log δ| d . (26) Now assume that for each data set D Xn sampled from P n, we feed Algorithm 1 with the level set estimators (LD,ρ)ρ 0 given by (15), the parameters τ and ε, and a start level ρ0 ε. Then the following statements are true: i) If P satisfies Assumption S and ρ0 ρ , then with probability P n not less than 1 e ς Algorithm 1 returns ρ0 and L0 and with ˆ Mρ := S ρ>ρ Mρ we have λd Lρ0 ˆ Mρ λd M+2σ ρ0 ε \ ˆ Mρ + λd ˆ Mρ \ M 2σ ρ0+ε . (27) ii) If P satisfies Assumption M and we have an ε ε + inf ε (0, ρ ρ ] : τ (ε ) τ . (28) with 9ε ρ ρ , then with probability P n not less than 1 e ς, we have a D Xn such that the following statements are true for Algorithm 1: (a) The returned level ρD,out satisfies both ρD,out [ρ + 2ε, ρ + ε + 5ε] and τ ψ(2σ) < 3τ ρD,out ρ + ε . (b) Two sets B1(D) and B2(D) are returned and these can be ordered such that for Ai ρD,out+ε C(MρD,out+ε) ordered in the sense of Ai ρD,out+ε A i we have i=1 λd Bi(D) A i 2 i=1 λd A i \(Ai ρD,out+ε) 2σ + λd M+2σ ρD,out ε\{h > ρ } . For our subsequent asymptotic analysis we note that the assumptions δ (0, e 1] and ς 1 of Theorem 17 show that (26) is satisfied if max 1, 2d2 vold} c δ| log δ|+d/2 C and if we choose δ in terms of n, i.e., δ = δn, then (30) is satisfied for large n if δn O(n a) for some small a > 0. We shall see below, that such rates for δn are typical. While (24) only provides a lower bound on possible values for σ, Theorem 17 actually indicates that σ should not be chosen significantly larger than these lower bounds, either. Indeed, the choice of σ also implies a minimal value for τ by the condition τ > ψ(2σ), which Steinwart, Sriperumbudur, and Thomann in turn influences ε by (28). Namely, larger values of σ lead to larger τ and thus to larger ε . As a result, the guarantees in (a) become weaker, and in addition, larger values of σ also lead to weaker guarantees in (b). For a similar reason we do not consider kernels K with heavier tails than (9). Indeed, if K only has a polynomial upper bound for its tail, i.e., there are constants c and α > d with K(x) c x α 2 , x Rd , then κ1(r) r α+d and κ (r) r α. Now, if we picked σ = δ| log δ|b for some b > 0, then we would need to replace (26) by a bound of the form cδ d| log δ| αb ε, and this would rule out ε 0 for δ 0. As a result, no rates would be possible. Now, one could address this by choosing σ := δb for some b (0, 1), which in turn would require a bound of the form c δα(1 b) d ε, instead of (26). Arguing as around (30) this is guaranteed if cδα(1 b) d/2 C and if δ 0 the latter would require b < 1 d 2α. In particular, b would be strictly bounded away from 1. However, such a choice for σ would significantly weaken the guarantees given in (a) and (b) as explained above, and as a consequence, the rates obtained below would be worse. Note that from a high-level perspective this phenomenon is not surprising: indeed, heavier tails smooth out the infinite sample density estimator h P,δ and as consequence, the uncertainty guarantees (16) become worse in the horizontal direction, i.e., we get more blurry estimates LD,ρ of Mρ. However, for the detection of connected components at a level ρ, less blurry estimates are preferable. In the remainder of this section, we illustrate how the finite sample guarantee of Theorem 17 can be used to derive both consistency and rates. We begin with the following result. Corollary 18 Let P be a distribution with bounded Lebesgue density and with Assumption P being satisfied. Moreover, let K be a symmetric kernel with exponential tail behavior, for which the assumptions of Theorem 16 hold. Let (δn) be a positive sequence with δn n a for some a > 0 and pick a null sequence (σn) satisfying (24) for all sufficiently large n. Moreover, let (εn) and (τn) be positive null sequences with ψ(2σn) < τn for all sufficiently large n, and lim n log δ 1 n nε2nδdn = 0 . Now assume that for each data set D Xn sampled from P n, we feed Algorithm 1 with the level set estimators (LD,ρ)ρ 0 given by (15), the parameters τn and εn, and the start level ρ0 := εn. Then the following statements are true: i) If P satisfies Assumption S with ρ = 0, then for all ϵ > 0 we have lim n P n D Xn : 0 < ρD,out ϵ = 1 , and if λd({h > 0} \ {h > 0}) = 0 we also have lim n P n D Xn : λd(LD,ρD,out {h > 0}) ϵ = 1 , Adaptive Clustering Using Kernel Density Estimators ii) If P satisfies Assumption M, then, for all ϵ > 0, we have lim n P n D Xn : 0 < ρ D ρ ϵ = 1 , and, if λd(A i A 2 \ (A 1 A 2)) = 0, we also have, for B1(D), B2(D) as in (29): lim n P n D Xn : λd(B1(D) A 1) + λd(B2(D) A 2) ϵ = 1 . Our next goal is to establish rates of convergence for estimating ρ and the clusters. We begin with a result providing a rate of ρ D ρ . To this end we need to recall the following definition from Steinwart (2015a) that describes how well the clusters are separated above ρ . Definition 19 Let Assumption M be satisfied. Then the clusters of P have a separation exponent κ (0, ], if there is a constant csep > 0 such that for all ε (0, ρ ρ ] we have τ (ε) csep ε1/κ . Moreover, the separation exponent κ is exact, if there is a csep > 0 such that τ (ε) csep ε1/κ , ε (0, ρ ρ ] . The separation exponent describes how fast the connected components of Mρ approach each other for ρ ρ . The best separation exponent is κ = and in this case we have d(A 1, A 2) csep, i.e. the clusters A 1 and A 2 do not touch each other. The separation exponent makes it possible to find a good value for ε in Theorem 17. Indeed, the proof of (Steinwart, 2015a, Theorem 4.3) shows that the value ε := ε+(τ/csep)κ satisfies (28) as soon as we have 9ε ρ ρ . Consequently, the bound in part ii) (a) of Theorem 17 becomes 2ε ρD,out ρ 6ε + τ if we have a separation exponent κ (0, ]. Moreover, if the separation exponent κ (0, ) is exact and we choose τ 2ψ(2σ), then (31) can be improved to κ ρD,out ρ 6ε + τ as the proof of Theorem (Steinwart, 2015a, Theorem 4.3) shows. To establish rates, it thus suffices to find null sequences (εn), (δn), (σn), and (τn) that satisfy (24) and (25), and additionally δn O(n a) for some a > 0, if K does not have bounded support. The following corollary presents the resulting rates that are, modulo logarithmic terms, the best ones we can obtain from this approach. Corollary 20 Let P be a distribution for which Assumption M is satisfied and whose Lebesgue density is bounded. Moreover, consider a symmetric kernel K with exponential tail behavior, for which the assumptions of Theorem 16 hold. In addition, assume that the Steinwart, Sriperumbudur, and Thomann clusters of P have separation exponent κ (0, ). Furthermore, let (εn), (δn), (σn), and (τn) be sequences with εn (log n)3 log log n γκ 2γκ+d , δn log n σn (log n)3 1 2γκ+d , τn (log n)3 log log n and assume that, for n 1 D Xn sampled from P n, we feed Algorithm 1 with the level set estimators (LD,ρ)ρ 0 given by (15), the parameters τn and εn, and the start level ρ0 := εn. Then there exists a K 1 such that for all sufficiently large n we have P n D Xn : 0 ρ D ρ Kεn Moreover, if the separation exponent κ is exact, there exists another constant K 1 such that for all sufficiently large n we have P n Xn : Kεn ρ D ρ Kεn Finally, if κ = and supp K B , then (32) holds for all sufficiently large n, if σn = δn and εn log n log log n 2 , δn log log n 1 2d , and τn log log n γ Note that the rates obtained in Corollary 20 only differ by the factor (log n)2 from the rates in (Steinwart, 2015a, Corollary 4.4). Moreover, if K has a bounded support, then an easy modification of the above corollary yields exactly the same rates as in (Steinwart, 2015a, Corollary 4.4). Our next goal is to establish rates for λd(Bi(D) A i ) 0. Since this is a modified level set estimation problem, we need to recall some assumptions used in this context. The first assumption in this direction is one-sided variant of a well-known condition introduced by Polonik (1995). Definition 21 Let P be a distribution on X Rd that has a Lebesgue-density h. For a level ρ 0, we say that P has flatness exponent ϑ (0, ], if there is a cflat > 0 with λd {0 < h ρ < s} (cflats)ϑ , s > 0. Note that the larger ϑ is, the steeper h must approach ρ from above. In particular, for ϑ = , the density h is allowed to take the value ρ but is otherwise bounded away from ρ. The second definition describes in some sense the roughness of the boundary of the clusters. Definition 22 Let Assumption M be satisfied and α (0, 1]. Then the clusters have an α-smooth boundary, if there is a cbound > 0 such that, for all ρ (ρ , ρ ] and δ (0, δthick] we have λd (Ai ρ)+δ \ (Ai ρ) δ cboundδα , where i {1, 2} and A1 ρ and A2 ρ denote the two connected components of the level set Mρ. Adaptive Clustering Using Kernel Density Estimators Note that in Rd, considering α > 1 does not make sense, and for an A Rd with rectifiable boundary we always have α = 1, see (Steinwart, 2015b, Lemma A.10.4). The following assumption collects the different properties of P we have introduced. Assumption R. Assumption M is satisfied and P has a bounded Lebesgue density h. Moreover, P has flatness exponent ϑ (0, ] at level ρ , its clusters have an α-smooth boundary for some α (0, 1], and its clusters have separation exponent κ (0, ]. With the help of Assumption R we can now establish rates for estimating the clusters, that is, for λd(Bi(D) A i ) 0. Corollary 23 Let Assumption R be satisfied and K be as in Corollary 20. and write ϱ := min{α, ϑγκ}. Furthermore, let (εn), (δn), and (τn) be sequences with ϱ 2ϱ+ϑd (log log n) ϑd 8ϱ+4ϑd , δn log n log log n σn (log n)3 log log n ϑ 2ϱ+ϑd , τn (log n)3 (log log n)2 Assume that, for n 1, we feed Algorithm 1 as in Corollary 20. Then there is a constant K 1 such that, for all n 1 and the ordering as in (29), we have i=1 λd Bi(D) A i K (log n)3 (log log n)2 ϑϱ 2ϱ+ϑd 1 1 Again, the rates obtained in Corollary 23 only differ by the factor (log n)2 from the rates in (Steinwart, 2015a, Corollary 4.8). Moreover, if K has a bounded support, then an easy modification of Corollary 23 again yields exactly the same rates as in (Steinwart, 2015a, Corollary 4.8). Our final goal is to modify the adaptive parameter selection strategy for the histogrambased clustering algorithm of Steinwart (2015a) to our KDE-based clustering algorithm. To this end, let (0, 1] be finite and n 1, ς 1. For δ , we fix σδ,n > 0 and τδ,n > 0 such that (24) and τδ,n 2ψ(2σδ,n) are satisfied. In addition, we define | log δ|(ς + log | |) log log n δdn + max 1, 2d2 vold} c δ| log δ| d , (33) where Cu 1 is some user-specified constant and the second term can be omitted if the used kernel K has bounded support. Now assume that, for each δ , we run Algorithm 1 with the parameters εδ,n and τδ,n, with the start level ρ0 := εδ,n, and with the level set estimators (LD,ρ)ρ 0 given by (15). Let us consider a width δ D, that achieves the smallest returned level, i.e. δ D, arg min δ ρD,δ,out . (34) In general, this width may not be unique, and hence we assume in the following that we have a well-defined choice, e.g. the smallest δ satisfying (34). Moreover, we write ρ D, := min δ ρD,δ,out Steinwart, Sriperumbudur, and Thomann for the smallest returned level. Note that unlike the width δ D, , the level ρ D, is always unique. Finally, we define εD, := εδ D, ,n and τD, := τδ D, ,n. With these preparation we can now present the following finite sample bound for ρ D, . Theorem 24 Let P be a distribution for which Assumption M is satisfied and whose Lebesgue density is bounded. Moreover, consider a symmetric kernel K with exponential tail behavior, for which the assumptions of Theorem 16 hold. In addition, assume that the two clusters of P have separation exponent κ (0, ]. For a fixed finite (0, e 1], and n 1, ς 1, and Cu 1, we define εδ,n by (33) and σδ,n > 0 and τδ,n > 0 such that (24), τδ,n 2ψ(2σδ,n), and 2σδ,n δthick are satisfied for all δ . Furthermore, assume that 4C2 u log log n C h , where C is the constant in (23) and εδ,n+(τδ,n/csep)κ (ρ ρ )/9 for all δ . Then we have P n n D Xn : εD, < ρ D, ρ min δ (τδ,n/csep)κ + 6εδ,n o 1 e ς . Moreover, if the separation exponent κ is exact and κ < , then we even have P n D : min δ c1τ κ δ,n + εδ,n < ρ D, ρ min δ c2τ κ δ,n + 6εδ,n 1 e ς , where c1 := 1 4(6csep) κ and c2 := c κ sep, and similarly P n n D Xn : c1τ κ D, + εD, < ρ D, ρ c2τ κ D, + 6εD, o 1 e ς . For an adaptive parameter selection strategy, it thus suffices to define appropriate , σδ,n, and τδ,n. Here we proceed as in (Steinwart, 2015a, Section 5). Namely, for n 16, we consider the interval In := log n (log log n)2 d , 1 log log n and fix some n 1/d-net n In of In with | n| n. Furthermore, for some fixed Cu 1 and n 16, we define σδ,n by (24), write τδ,n := σγ δ,n log log log n, and define εδ,n by (33) for all δ n and ς = log n. Following the ideas of the proofs of (Steinwart, 2015a, Corollaries 5.2 and 5.3) we then obtain a constant K such that for all sufficiently large n 16 we have P n D : ρ D, n ρ K (log n)3 (log log n)2 γκ 2γκ+d 1 1 Here, (36) holds if P has separation exponent κ (0, ), and if the kernel K has bounded support, it remains true for κ = . In addition, the upper bound in (36) can be matched by a lower bound that only differs by a double logarithmic factor provided that the separation exponent κ (0, ) is exact. Finally, if Assumption R is satisfied, we further find i=1 λd(Bi(D) A i ) ˆK (log n)3 (log log n)2 ϑγκ 2γκ+ϑd 1 1 for all sufficiently large n 16, where ˆK is another constant independent of n. Adaptive Clustering Using Kernel Density Estimators Finally, we like to mention that the presented adaptive strategy works optimally as long as all split levels have the same exact separation exponent κ. Indeed, as described above, our strategy detects, modulo some logarithmic terms, an asymptotically optimal choice δ D, for the first split level, and since for all other split levels this choice is also asymptotically optimal as long as they have the same exact separation exponent κ, we see that Algorithm 2 is adaptive when using δ D, . 6. Comparisons In this section we compare our findings to the most closely related papers, namely (Wang et al., 2019) and (Chaudhuri et al., 2014). In particular, we discuss the different assumptions as well as the obtained statistical guarantees. To begin with, let us have a look at the assumptions on P, respectively its density h made by Wang et al. (2019). Here, we first note that h is assumed to be α-smooth for some α > 0, that is, h is s := α -times continuously differentiable and all partial derivatives of order s are (α s)-H older continuous. In general, h does not need to have compact support, instead, Wang et al. (2019) only assume that {h ρ} is compact for all ρ > ρ , where ρ is the smallest split level, see their Assumptions C and C , respectively. In this respect we note that for P having a continuous density their notion of split levels is closely related to ours and that their Assumption S(κ) equals our separation exponent κ. Finally, Wang et al. (2019) do not impose a thickness assumption, instead a so-called inner cone condition is considered. Recall that this cone condition assumes that constants εI > 0, c I > 0, and r I > 0 exist such that for all split levels ρ and all ρ (ρ εI, ρ + εI) we have λd B(x, r) {h ρ} c Ird , x {h ρ}, r (0, r I] . In general, it seems unclear how this condition relates to our thickness assumption, but in many natural situations the inner cone condition and the thickness assumption with γ = 1 are simultaneously satisfied. For details, we refer to the discussion in (Steinwart, 2015b, Appendix A.5), the rather generic examples considered in (Steinwart, 2015c, Appendix B.2), and the discussion in (Wang et al., 2019, page 15). In the following comparison we therefore assume that the inner cone condition and the thickness assumption with γ = 1 are simultaneously satisfied. In addition, we assume that h has finitely many split levels. Like our results, the clustering algorithm of Wang et al. (2019) is also based on a kernel density estimator. However, their central Algorithm 2 is using so-called α-valid kernels, whose KDEs enjoy the ideal approximation error behavior h P,δ h δα for δ 0 and α-smooth densities h. We refer to (Wang et al., 2019, page 9) for details. Finally, their split level estimator uses a verification strategy that is similar to Line 3 of our Algorithm 1, see (Wang et al., 2019, Definition 6). To compare the split level guarantee of Wang et al. (2019) to ours, we assume that h is α-smooth, that all split levels have separation exponent α, and that h satisfies the additional assumptions discussed above. Also, we assume that their Algorithm 2 uses an α-valid kernel. Then (Wang et al., 2019, Proposition 3) shows that, modulo some logarithmic factors, all split levels can be estimated with rate n α 2α+d . Since we can choose α = κ and γ = 1, these rate coincide with ours established in Corollary 20 if we again ignore logarithmic factors. Given any κ > 0, however, our results do not require h to be κ-smooth, while (Wang Steinwart, Sriperumbudur, and Thomann et al., 2019, Proposition 3) only holds for κ-smooth densities. Moreover note that there are α-smooth densities with κ α, in fact we may have α = 1 and κ = , and for such densities, our rates are significantly better than those of Wang et al. (2019). In addition, our clustering algorithm does not need to use kernels that are aligned with the smoothness of h, and unlike Wang et al. (2019) we also present a way to make our algorithm adaptive to the unknown κ. Finally, we like to emphasize that apart from Proposition 3, (Wang et al., 2019) contains several other interesting results, which are, however, not comparable to ours. Unlike the comparison to (Wang et al., 2019), the comparison to (Chaudhuri et al., 2014) turns out to be much more involved, as the latter paper uses assumptions that are quite different to ours. Providing a detailed comparison is therefore beyond the scope of this paper and we refer the interested reader to (Steinwart et al., 2021, Appendix A) for a detailed comparison. To summarize the key points of this comparison, we show that, up to logarithmic factors, the best possible convergence rate achieved by the central Theorem VII.5 of Chaudhuri et al. (2014) for estimating the split levels is n α 3α+d while our algorithm achieves a rate of n α 2α+d , which is strictly better for all dimensions, with α being the H older-continuity of the underlying density. In particular, our rates can be achieved without knowing α, while Chaudhuri et al. (2014) do not offer such adaptivity. In addition, our results can handle discontinuous densities, e.g. step functions on rectangles with mutually positive distances, while their Theorem VII.5 does not provide any guarantee at all for such densities. In particular, the consistency results for their unpruned algorithm require mild uniform continuity conditions , see the end of Section III.B in (Chaudhuri et al., 2014), and the guarantees for the pruned algorithm stated in their Theorem VII.5 explicitly require control on the uniform modulus of continuity. Finally, apart from split level estimation, our results also provide guarantees for corresponding clusters in measure while no such guarantees are in (Chaudhuri et al., 2014). 7. Experiments In this section, we illustrate the behavior of our generic KDE-based clustering algorithm on a few artificial data sets for which the ground truth clustering can be computed. In addition, we compare their performance to k-means and hierarchical clustering. Data. We consider six cluster problems, which are based on two-dimensional distributions cut down to [0, 1]2, with different degrees of difficulty: The first distribution, see Figure 9, is a mixture of 15 Gaussian distributions and a uniform background distribution. In addition, 12 out of the 15 well-separated distributions have a covariance of the form λI2, where I2 is the 2 2-identity matrix and λ is some scaling factor, while the remaining 3 Gaussians have a different covariance matrix. Since this distribution was inspired by the S2-data set of Fr anti and Virmajoki (2006) we will call it S2 in the following. The second distribution, see Figure 10, is a modification of the first. Namely, two of the 15 clusters have been shrunken and moved towards each other, while the remaining clusters have remained unchanged. In the following we call it S2 modified. The third distribution, called toy 3G , is a mixture of 3 Gaussian distribution with similar but not identical covariances, see Figure 11. Unlike in the first data set, however, the Gaussians are less separated and less concentrated making this distribution slightly more difficult. The fourth distribution, Adaptive Clustering Using Kernel Density Estimators Figure 9: Left. A sample data set (top) of size n = 5000 from S2 and its ground truth clustering (bottom). Middle-Left. The 25th (bottom) and 75th (top) best run of kmeans on the S2 data sets. Middle. Corresponding 25th (bottom) and 75th (top) best runs of hclust. Middle-Right and Right. The 25th (bottom) and 75th (top) best runs of our algorithm for the moving window kernel (middle-right) and the Epanechnikov kernel (right). All algorithms perform well. see Figure 12, is based upon the third distribution. To be more precise, two Gaussians with close-by centers and with small variances have been added. In the following we call it toy 5G. The fifth distribution, called bananas, is a variant of the classical bananas, or two-moon, data distribution, which is often used to assess the clustering performance on non-centroid-like clusters. Unlike its usual form, however, our variant consists of two very fuzzy bananas, which makes even a visual inspection not immediately obvious, see Figure 13. As a result of this noise, both clusters can be almost perfectly separated by a diagonal line, and therefore, it is in principle possible to find a decent clustering with the help of two suitably chosen centroids. The sixth distribution, see Figure 14, is another mixture of Gaussians. In this distribution, however, each of its three clusters corresponds to a mixture of two Gaussians that have the same mean but different variances. In addition, the clusters are merging into each other, and as a result, this data set can be viewed as the most challenging one from a visual perspective. In the following, we call it crosses. For each of the six cluster problems described above and the following sample sizes n {2500, 3000, 3500, 4200, 5000, 6000, 7000, 8200, 10000, 14000, 20000} . we generated 100 data sets. In addition, we also computed the true densities of the 6 distributions on a 1000 1000 grid of [0, 1]2 to find a high-resolution approximation of the ground truth clustering. Applying these ground true clusters to the data sets, makes it possible to assess how well the different algorithms work. Performance Measures. Besides visual inspection we consider two different ways of comparing clustering algorithms. To describe these comparisons, let us assume that we have a data set D, ground truth clusters C1, . . . , Ck D, and estimated clusters ˆC1, . . . , ˆCm D. Steinwart, Sriperumbudur, and Thomann Figure 10: Left. A sample data set (top) of size n = 5000 from S2 modified and its ground truth clustering (bottom). Middle-Left. The 25th (bottom) and 75th (top) best run of kmeans on the S2 modified data sets. Middle. Corresponding 25th (bottom) and 75th (top) best runs of hclust. Middle-Right and Right. The 25th (bottom) and 75th (top) best runs of our algorithm for the moving window kernel (middle-right) and the Epanechnikov kernel (right). Even in the better runs kmeans and hclust cannot separate the two modified clusters in the bottom-right, while for our algorithms a problem occurs only for the moving window kernel in the worse run. Namely, in this case one cluster remained undetected. In all experiments we always observed m k. Since k-means and hierarchical clustering produce clusterings with ˆC1 ˆCm = D, while the ground truth clusters do not have this property for our data sets, we restrict the considered data set to the samples that occur in both a true and an estimated cluster, that is Dperf := x D : x Since in general the jth estimated cluster does not relate to the jth ground truth cluster, we first needed to find a suitable matching, that is an injective map Φ : {1, . . . , m} {1, . . . , k}. To this end, we define the matching error to be E(Φ) := |{x Dperf : x CΦ(j(x))}| |Dperf| , (37) where for all x Dperf, we denote by j(x) the unique index j with x ˆCj. In other words, E(Φ) equals the fraction of samples x in Dperf for which Φ does not provide a correct matching of clusters ˆCj(x) CΦ(j(x)). Note that if Φ is a perfect matching in the sense of ˆCj Dperf = CΦ(j) Dperf for all j = 1, . . . , m, then E(Φ) = 0. For this reason, we determined a matching Φ that minimizes E( ), where we note that even in the case of Adaptive Clustering Using Kernel Density Estimators Figure 11: Left. A sample data set (top) of size n = 5000 from toy 3G and its ground truth clustering (bottom). Middle-Left. The 25th (bottom) and 75th (top) best run of kmeans on the toy 3G data sets. Middle. Corresponding 25th (bottom) and 75th (top) best runs of hclust. Middle-Right and Right. The 25th (bottom) and 75th (top) best runs of our algorithm for the moving window kernel (middleright) and the Epanechnikov kernel (right). All algorithms perform well. the first two cluster problems this was computationally feasible with the help of a greedy approach followed by a brute-force calculation over a restricted set of permutations. The corresponding optimal matching errors E(Φ ), averaged over all 100 repetitions of each cluster problem and each considered algorithm, are reported in Figure 15. At this point let us briefly discuss alternative ways to define the (optimal) matching error. To this end, we define C0 := D \ (C1, . . . , Ck) and ˆC0 := D \ ( ˆC1, . . . , ˆCm), that is, we have Dperf = D \(C0 ˆC0). Of course, we could also consider e.g. Dperf := D \ ˆC0 in the definition of (37) by setting Φ(0) := 0 for all considered maps Φ : {1, . . . , m} {1, . . . , k}. In this alternative, all samples x C0 \ ˆC0 are additionally counted for the matching error. Now kmeans and hclust both assign all samples to some estimated cluster, that is, we have ˆC0 = . In other words, all samples x C0 are automatically counted as an error for both algorithms. This effect is clearly not desirable in a fair comparison, in particular since for some distributions such as bananas and crosses the set C0 is substantial, see Figures 13 and 14. Conversely, if we consider Dperf := D \ C0 then (37) additionally counts all samples x ˆC0 \ C0. For kmeans and hclust, this set is empty by design, while our algorithms usually satisfy | ˆC0 \ C0| > 0. Consequently, we would again have an undesirable effect. The final choice Dperf := D inherits both issues of the previously considered alternatives. For the second numerical comparison, we again consider Φ . For this Φ , we counted the number of ground truth clusters not found by the considered algorithm, that is k m, and added the number of non-covering clusters, that is, the number of clusters ˆCj that do not cover at least 50% of the samples in the matched ground truth cluster CΦ (j) Dperf. Steinwart, Sriperumbudur, and Thomann Figure 12: Left. A sample data set (top) of size n = 5000 from toy 5G and its ground truth clustering (bottom). Middle-Left. The 25th (bottom) and 75th (top) best run of kmeans on the toy 5G data sets. Middle. Corresponding 25th (bottom) and 75th (top) best runs of hclust. Middle-Right and Right. The 25th (bottom) and 75th (top) best runs of our algorithm for the moving window kernel (middle-right) and the Epanechnikov kernel (right). Both kmeans and hclust cannot separate the two spatially smaller clusters on the right, and as result, they are forced to cut through one of the spatially larger clusters. In contrast, our algorithms perform well. The averages over the resulting identification errors I(Φ ) := k m + j : | ˆCj CΦ(j) Dperf| |CΦ(j) Dperf| 1/2 are reported in Figure 16. Finally, for visual inspection we ordered the 100 results of each of the algorithms described below on each of the six cluster problems with n = 5000 with the help of the joint error E(Φ ). We then plotted the clusterings on the 25th and 75th best run, see Figures 9 to 14. These visualizations help to understand the sources of the typical errors made by the individual algorithms. Algorithms. We implemented an iterative version of Algorithm 2, in which for k = 0, 1, . . . we first computed the τ-connected components of Lρ+kε that do not vanish at level Lρ+(k+2)ε, see also Line 3 of Algorithm 1. With the help of these connected components we then generated the corresponding cluster tree estimate, where we note that we skipped the Line 7 of Algorithm 1 since this line has only been inserted into Algorithm 1 to simplify the form of the statistical guarantees. We then called the resulting cluster tree estimator for different KDE level set estimators, that is, for different values of δ. The first split in the cluster tree was then obtained by choosing the δ that resulted in the smallest first split level. If all split levels have the same separation exponent κ, then it asymptotically suffices to consider this δ for the entire cluster tree. In general, however, splits further Adaptive Clustering Using Kernel Density Estimators Figure 13: Left. A sample data set (top) of size n = 5000 from bananas and its ground truth clustering (bottom). Middle-Left. The 25th (bottom) and 75th (top) best run of kmeans on the bananas data sets. Middle. Corresponding 25th (bottom) and 75th (top) best runs of hclust. Middle-Right and Right. The 25th (bottom) and 75th (top) best runs of our algorithm for the moving window kernel (middle-right) and the Epanechnikov kernel (right). While the implicit bias of kmeans seems to consistently direct it towards imperfect centroids, the performance of hclust seems to be rather sensitive to the instance of the data set. In comparison, our algorithms are less sensitive and perform sufficiently well even in their worse runs. up in the tree may have either a different κ or at least a different constant csep. In these cases, choosing a different δ at different splits may be beneficial. We adopted this insight by considering, after each detected split level ρ, those δ whose estimated cluster tree has another split level within one of the clusters emerging at ρ. Among those δ we then again choose the one resulting in the smallest next split level. We considered 500 geometrically spaced candidate values of δ between c(ln(n)/n)1/d and c(ln n) 1/d, where in the experiments, the factor c was determined by an estimate of the median mutual distance between the samples of the considered data set. Notice that modulo this factor c and some (double) logarithmic terms, this setup coincides with the theoretically derived one, see (35). Moreover, we considered both a plain moving window kernel and the Epanechnikov kernel, where in both cases the underlying norm was the Euclidean distance. Since both kernels have bounded support, we simply chose σ := δ, see (24), and ε := 3 p h D,δ n 1δ d for each candidate value δ. Modulo some logarithmic terms, this choice for ε follows our theoretical insights, see (33). Finally, we decided to focus on thickness guarantees with the most natural choice γ := 1, see the detailed discussion in (Steinwart, 2015b, Appendix A.5), that is, we do not expect the algorithm to correctly keep clusters together that have thinner cusps or bridges. Based on this decision, we choose τ := (2 + ϵ) δ with ϵ = 0.00001, where we note that our theoretical findings actually hold Steinwart, Sriperumbudur, and Thomann Figure 14: Left. A sample data set (top) of size n = 5000 from crosses and its ground truth clustering (bottom). Middle-Left. The 25th (bottom) and 75th (top) best run of kmeans on the crosses data sets. Middle. Corresponding 25th (bottom) and 75th (top) best runs of hclust. Middle-Right and Right. The 25th (bottom) and 75th (top) best runs of our algorithm for the moving window kernel (middleright) and the Epanechnikov kernel (right). While the implicit bias of kmeans seems to consistently direct it towards centroids with sufficiently benign Voronoi partitions, the behavior of hclust seems to be rather sensitive to the instance of the data set. In comparison, our algorithm with the moving window kernel is less sensitive and performs sufficiently well even in its worse runs, while the Epanechnikov kernel performs less reliably. true for each value τ > 2δ, if one carefully tracks all constants. In addition, this choice makes it possible that the estimated clusters can be as close as ϵ δ to each other. Besides our methods we also considered k-means and hierarchical clustering. To this end, we used the functions kmeans, kmeans++, and hclust of R. Both types of algorithms have some hyper-parameters, which ideally would be chosen in a data-driven approach. To ensure fairness for kmeans, kmeans++, and hclust, such data-driven approaches would need to be calibrated to at least one of our two performance measures E(Φ ) or I(Φ ). However, we are not aware of any method to reliably estimate these performance measures, or some calibrated surrogates for them, from data alone and hence we proceeded differently. Namely, kmeans, kmeans++, and hclust received the correct number k of ground truth clusters as an input parameter, and kmeans was repeated with 100 random initializations using the parameter nstart = 100. In addition, we first considered the 3 different versions of the kmeans function on our data sets and identified the best one with the help of our performance metrics. Here it turned out that all three versions led to very similar results with Lloyd having a marginal advantage over the over two. In addition, all three versions produced better results than kmeans++, probably because we used 100 random initializations for kmeans. Similarly, we considered the 8 different versions of hclust. It turned out Adaptive Clustering Using Kernel Density Estimators Figure 15: Average matching errors E(Φ ) for optimal matching Φ , see (37) for the different cluster algorithms and for different sample sizes. While kmeans and hclust are essentially not influenced by the sample size, our methods clearly become better with increasing sample size n. that ward.D and ward.D2 performed best with a slight advantage of ward.D2, which we then used in the comparisons. In the experiments both kmeans and hclust were thus privileged in two ways: a) the correct number of clusters were given to them, while our algorithms had no information at all about the cluster problems at hand; and b) the best performing version of kmeans and hclust were chosen in hindsight, and with respect to our ground-truth performance measures. In contrast, our algorithms had to choose their hyper-parameters in a fully adaptive way and without access to ground truth performance measures. Results. Figures 15 and 16 clearly show that the performance of kmeans and hclust is almost independent of the data set size n, while our two algorithms heavily depend on n. For example, on five of the six cluster problems, namely on all but toy 5G, the matching errors achieved by our algorithms for n = 2500 are substantially worse than those of kmeans and hclust and the same is true for the identification errors. For n = 5000, however, the matching errors of our algorithms are at least as good as those of kmeans and hclust and on at least four of cluster problems our algorithms clearly outperform kmeans and hclust. For the identification errors the situation is still mixed, but this is not really surprising, as the first term k m in the computation of I(Φ ) vanishes for kmeans and hclust since the privileged information given to these two algorithms always ensures k = m. Despite this advantage, our algorithms always achieve identification errors that are as least as good Steinwart, Sriperumbudur, and Thomann Figure 16: Average identification errors I(Φ ) for optimal matching Φ . Both kmeans and hclust have difficulties on the two modified data sets. Our methods become better with increasing sample sizes and in all cases, nor errors are made for n 10000. as those of kmeans and hclust as soon as we have n 8.200. Finally, for n 10000 our algorithms work almost error free with respect to both performance measures on all six cluster problems. In terms of identification errors, both kmeans and hclust have substantial difficulties with the two modifications S2 modified and toy 5G, while on the remaining four cluster problems they perform almost flawless. A closer visual inspection reveals, that both cluster algorithms cannot identify the two spatially smaller, close-by clusters in the modifications, see Figures 10 and 12 and this phenomenon occurs at least on most of the data sets of e.g., size n = 5000. In comparison, our algorithm with the moving window kernel has some problems identifying one cluster in S2 modified, see Figure 10, but this phenomenon does not occur as often as the the problems of kmeans and hclust. The same observation can be made for our algorithm working with the Epanechnikov kernel on crosses. Unsurprisingly, the poor identification error performance of kmeans and hclust on the two modified cluster problems S2 modified and toy 5G directly translate into poor matching errors. In addition, both algorithms have problems on bananas and on crosses, see Figure 15. The reasons for these problems are different: While kmeans constantly choose imperfect centroids on the bananas data sets, see Figure 13, hclust seems to be sensitive to the particular instance of the data set. The same behavior of both algorithms can be Adaptive Clustering Using Kernel Density Estimators observed on crosses, see Figure 14. In comparison, a poor matching error performance of our algorithms seem to be directly related to the identification errors made in some cases. Proof of Lemma 4: Let us first assume that A δ = . Then there exists an x A δ and since A δ is open there also exists an ε > 0 with B(x, ε) A δ. Our first intermediate goal is to show B(y, δ) A , for all y B(x, ε) . (38) To this end, we note that y B(x, ε) A δ implies y (Rd \ A)+δ. For z Rd \ A we thus find d(y, z) d(y, Rd \ A) > δ, which shows (38). Now observe that for the proof of δ < inrad A it clearly suffices to establish B(x, δ + ε) A . (39) Let us therefore fix a z B(x, δ +ε). By considering y := x in (38) we first note that in the case x z δ we have z A. Hence it suffices to consider the case δ < x z δ + ε. For t := δ/ x z (0, 1) and y := (1 t)z + tx a quick calculation then shows both y z = δ and x y = x z δ ε. Applying (38) then yields (39). Let us now assume that δ < inrad A. Then there exists an ε > 0 with δ+ε < inrad A and hence we find an x A with B(x, δ + ε) A. Clearly, it suffice to show that x A δ. To this end, we assume that x A δ, that is x (Rd \A)+δ. Since this means d(x, Rd \A) δ, we then find a y Rd \ A with d(x, y) δ + ε. This contradicts B(x, δ + ε) A. 8.1 Proofs for the Generic Algorithm in Section 3 Proof of Lemma 5: i) ii). We choose a τ > 2c inrad M and δ := inrad M. By Lemma 4 we then know M δ = , and our construction also gives τ > 2cδ. If |M+δ| = 1 we obviously have diam M+δ = 0 < τ. Moreover, if |M+δ| > 1, there exist x, y M+δ with x y = diam M+δ since M+δ is compact and : M+δ M+δ R is continuous. For A := {x, y} M+δ we then have |Cτ(A)| = 1, which shows x y < τ. In summary, we therefore always have diam M+δ < τ. Now a simple calculation shows diam M+δ = 2δ + diam M, see also (Steinwart et al., 2021, Lemma 8.2), and thus we obtain τ > 2 inrad M + diam M. Since we initially chose τ > 2c inrad M arbitrarily, we conclude that 2c inrad M 2 inrad M + diam M. ii) i). Let us fix a δ > 0 with M δ = . Lemma 4 then shows δ inrad M. We know fix a non-empty A M+δ and some τ > 2cδ. This yields τ > 2(c 1)δ + 2δ 2(c 1) inrad M + 2δ diam M + 2δ = diam M+δ . For x, y A we thus find x y diam A diam M+δ < τ, and hence A is τ-connected, that is |Cτ(A)| = 1. Steinwart, Sriperumbudur, and Thomann Theorem 25 Let ρ 0 and Assumption P be satisfied with |C(Mρ)| 1 for all ρ ρ . Moreover, let (Lρ)ρ 0 be a decreasing family of sets Lρ X such that M δ ρ+ε Lρ M+δ ρ ε (40) for some fixed ε, δ > 0 and all ρ ρ . For all ρ ρ we then have: i) If M δ ρ+3ε = , then |Cτ(M δ ρ+ε)| = 1 for all τ > 2ψ Mρ+ε(δ) and the corresponding CRM ζ : Cτ(M δ ρ+ε) Cτ(Lρ) satisfies Cτ(Lρ) = ζ Cτ(M δ ρ+ε) B Cτ(Lρ) : B Lρ+2ε = . (41) ii) If M δ ρ+ε = , Assumption S is satisfied, and δ (0, δthick], then we have B Cτ(Lρ) : B Lρ+2ε = 1 , τ > 2cthickδγ . (42) Proof of Theorem 25: i). We first note that M δ ρ+3ε = implies M δ ρ+ε = . Now, (Steinwart, 2015b, Lemma A.4.3) showed that for all bounded A Rd with |C(A)| < , all δ > 0 with A δ = , and all τ > 2ψ A(δ) we have |Cτ(A δ)| |C(A)|, Thus we find |Cτ(M δ ρ+ε)| |C(Mρ+ε)| 1, and by the already observed Mρ+ε = we conclude that |Cτ(M δ ρ+ε)| = 1. To establish (41) we now write A := M δ ρ+ε and B := ζ(A). Our first intermediate goal is to establish the following disjoint union: Cτ(Lρ) = {B} B Cτ(Lρ) \ {B} : B Lρ+2ε = B Cτ(Lρ) : B Lρ+2ε = . (43) To this end, note that M δ ρ+3ε = and M δ ρ+3ε A together with A ζ(A) = B implies = M δ ρ+3ε = A M δ ρ+3ε B Lρ+2ε , and thus {B Cτ(Lρ) \ {B} : B Lρ+2ε = } = {B Cτ(Lρ) : B Lρ+2ε = }. This gives (43). Let us now show (41). To this end, we first observe that |Cτ(M δ ρ+ε)| = 1 implies Cτ(M δ ρ+ε) = {A} and hence ζ Cτ(M δ ρ+ε) = {B}. In view of (43) it thus remains to show B Lρ+2ε = , for all B Cτ(Lρ) with B = B. Let us assume the converse, that is, there is a B Cτ(Lρ) with B = B and B Lρ+2ε = . Since Lρ+2ε M+δ ρ+ε, there then exists an x B M+δ ρ+ε. By part i) of (Steinwart, 2015b, Lemma A.3.1) this gives an x Mρ+ε with d(x, x ) δ, and hence we obtain d(x , M δ ρ+ε) ψ Mρ+ε(δ) < τ From this inequality we conclude that there is an x M δ ρ+ε satisfying d(x , x ) < τ/2. The CRM property then yields x M δ ρ+ε = A B and hence δ ψ Mρ+ε(δ), see (Steinwart et al., 2021, Lemma 8.5), gives d(B , B) d(x, x ) d(x, x ) + d(x , x ) < δ + τ/2 ψ Mρ+ε(δ) + τ/2 < τ Adaptive Clustering Using Kernel Density Estimators However, B = B implies d(B , B ) τ by (Steinwart, 2015b, Lemma A.2.4), i.e. we have found a contradiction. ii). If Lρ+2ε = there is nothing to prove, and hence we may assume that Lρ+2ε = . Now assume that (42) is false. Then there exist B1, B2 Cτ(Lρ) with B1 = B2 and Bi Lρ+2ε = for i = 1, 2. For i = 1, 2 we consequently find xi Bi Lρ+2ε, and for these there exist Ai Cτ(Lρ+2ε) with xi Ai. Now recall from (Steinwart, 2015b, Lemma A.2.7) that Lρ+2ε Lρ implies Cτ(Lρ+2ε) Cτ(Lρ), and therefore we have a CRM ζ : Cτ(Lρ+2ε) Cτ(Lρ). Our construction then gives xi Ai Bi ζ(Ai) Bi , and thus ζ(Ai) Bi = for i = 1, 2. However, ζ(Ai) and Bi are both elements of the partition Cτ(Lρ) and hence we conclude ζ(Ai) = Bi for i = 1, 2. Moreover, ζ is a map, and therefore B1 = B2 implies A1 = A2. Let us write A := A1 A2. Since we know from (Steinwart, 2015b, Lemma A.2.4) that d(A1, A2) τ, we conclude by (Steinwart, 2015b, Lemma A.2.8) that Cτ(A) = {A1, A2}, and thus |Cτ(A)| = 2. However, we also have A Lρ+2ε M+δ ρ+ε, and since M δ ρ+ε = holds, Assumption S together with δ (0, δthick] and τ > 2cthickδγ ensures |Cτ(A)| = 1. Since this contradicts |Cτ(A)| = 2 we have proven (42). Proof of Theorem 7: For i 0 we write ρi := ρ0 + iε for the sequence of potential levels Algorithm 1 visits. Moreover, let i := max{i 0 : M δ ρi+3ε = }, where we note that this maximum is finite by (Steinwart et al., 2021, Lemma 8.3). For i = 0, . . . , i , part i) of Theorem 25 then shows that Algorithm 1 identifies exactly one component in its Line 3, and therefore it only identifies more than one component in Line 3, if i i + 1. If it finishes the loop at Line 5, we thus know that ρ ρi +2, and therefore the level ρ considered in Line 7 satisfies ρ ρi +4. Now the definition of i yields M δ ρi +1+3ε = , and since ρi +1 + 3ε = (i + 1)ε + 3ε = (i + 4)ε = ρi +4, we find M δ ρ = for the ρ considered in Line 7. This implies M δ ρ+ε = , and hence part ii) of Theorem 25 shows that Algorithm 1 identifies at most one component in Line 7. Lemma 26 Let Assumption M be satisfied, ρ (ρ , ρ ], ε := ρ ρ , and A1,ρ, and A2,ρ be the two connected components of Mρ, i.e. C(Mρ) = {A1,ρ, A2,ρ}. Then the following statements hold: i) For all 0 < τ 3τ (ε) we have Cτ(Mρ) = C(Mρ). ii) For all 0 < δ < τ τ (ε) we have Cτ(Mρ) Cτ(M+δ ρ ) = {A+δ 1,ρ, A+δ 2,ρ}. iii) For δ (0, δthick] and ψ(δ) < τ τ (ε) we have |Cτ(M δ ρ )| = 2 with Cτ(M δ ρ ) = {A δ 1,ρ, A δ 2,ρ}. Proof of Lemma 26: To adapt to the notation of Steinwart (2015a,b) we write τ Mρ := d(A1,ρ, A2,ρ). Note that this definition gives τ Mρ = 3τ (ρ ρ ) = 3τ (ε). i). The assertion directly follows part ii) from (Steinwart, 2015b, Proposition A.2.10). ii). The assertion has been shown in part iii) of (Steinwart, 2015b, Lemma A.4.1). Steinwart, Sriperumbudur, and Thomann iii). We first note that using part ii) of (Steinwart, 2015a, Theorem 2.7) with ε := ε and ρ = ρ + ε we find |Cτ(M δ ρ )| = |C(Mρ)| = 2. Moreover, we have d(A1,ρ, A2,ρ) = τ Mρ = 3τ (ε) τ > ψ(δ) = 3cthickδγ > ψ Mρ(δ) δ , (44) where the last inequality follows from (Steinwart et al., 2021, Lemma 8.5) since Assumption M implies Assumption P, and hence X is connected. Consequently, part v) of (Steinwart, 2015b, Lemma A.3.1) yields M δ ρ = A1,ρ A2,ρ δ = A δ 1,ρ A δ 2,ρ , (45) and we additionally note that (44) implies d(A δ 1,ρ, A δ 2,ρ) d(A1,ρ, A2,ρ) τ . (46) Now let A1 and A2 be the two τ-connected components of Cτ(M δ ρ ). Let us assume that A1 = A δ 1,ρ and A1 = A δ 2,ρ. Then (45) shows that there exist x A1 A δ 1,ρ and x A1 A δ 2,ρ. Since A1 is τ-connected, there further exist x1, . . . , xn A1 with x1 = x , xn = x and d(xi, xi+1) < τ for all i = 1, . . . , n 1. By x1 A δ 1,ρ, xn A δ 2,ρ, and (45) we conclude that there is an i = {1, . . . , n 1} with xi A δ 1,ρ and xi+1 A δ 2,ρ. This gives d(A δ 1,ρ, A δ 2,ρ) d(xi, xi+1) < τ, which clearly contradicts (46). Lemma 27 Let Assumption M be satisfied, and P1 and P2 be defined by (5) for some fixed ρ (ρ , ρ ]. Then for i = 1, 2 and ρ ρ we have Mi,ρ = Mρ Ai,ρ . (47) Proof of Lemma 27: We first note that since P is normal at all levels ρ > 0, we have λd(Mρ {h ρ}) = 0 for all λd-densities h of P and all ρ > 0. For a fixed ρ ρ > 0. we can thus find a λd-density h of P such that Mρ = {h ρ} and Mρ = {h ρ }. Let us define hi := 1Ai,ρ h. Then hi is a λd-density of Pi and we have {hi ρ} = Mρ Ai,ρ . (48) Moreover, we have Mi,ρ = supp λd {hi ρ} , and hence it suffices to show that Mi,ρ = {hi ρ}. For the proof of the inclusion we fix an x Mi,ρ and an open U X with x U. The definition of the support of a measure then yields λd(U Mρ) = λd U {h ρ} λd U {hi ρ} > 0 , which in turn implies x Mρ. This shows Mi,ρ Mρ. Moreover, Ai,ρ is closed by definition and we further have λd Ai,ρ {hi ρ} = λd({hi ρ}) = λd X {hi ρ} . Adaptive Clustering Using Kernel Density Estimators Since the support of a finite measure is also the smallest closed subset having full measure, we conclude that Mi,ρ Ai,ρ . Combining the two found inclusions Mi,ρ Mρ and Mi,ρ Ai,ρ with (48) we have thus found the desired Mi,ρ {hi ρ}. For the proof of the converse inclusion we fix an x {hi ρ} = Mρ Ai,ρ . Moreover, we fix an open U X with x U, so that it suffices to show λd(U {hi ρ}) > 0. To this end, we may assume without loss of generality that i = 1. Moreover, since d(A1,ρ , A2,ρ ) > 0 and x A1,ρ we may additionally assume that U A2,ρ = . Now, x Mρ implies λd(U Mρ) > 0. Let us write Ak := Mρ Ak,ρ = {hk ρ}. This yields Mρ = A1 A2, A1 A2 = , and λd(U A2) λd(U A2,ρ ) = 0 . Using the disjoint union U Mρ = (U A1) (U A2), we conclude that λd(U {h1 ρ}) = λd(U A1) = λd(U Mρ) > 0 . As mentioned above this shows x M1,ρ. Lemma 28 Let Assumption M be satisfied, and P1 and P2 be defined by (5) for some fixed ρ (ρ , ρ ]. Then, for i = 1, 2, the following statements are true: i) For ε := ρ ρ and all 0 < δ < τ (ε ) and ρ ρ we have M+δ i,ρ = M+δ ρ A+δ i,ρ . ii) For all δ > 0 and ρ ρ we have M δ i,ρ = M δ ρ A δ i,ρ . Proof of Lemma 28: i). Let ξ : C(Mρ) C(Mρ ) be the CRM and B1, . . . , Bn be the connected components of C(Mρ). Without loss of generality we may assume there is an m {0, . . . , n} such that ξ(Bj) A1,ρ for all j = 1, . . . , m and ξ(Bj) A2,ρ for all j = m + 1, . . . , n. We define A1 := B1 Bm and A2 := Bm+1 Bn. Clearly, this construction ensures Ak ξ(Ak) Ak,ρ , k = 1, 2. (49) Moreover, we have Mρ = A1 A2, and hence we find M+δ ρ = A+δ 1 A+δ 2 by part iv) of (Steinwart, 2015b, Lemma A.3.1). In view of (47), we consequently need to prove that (A1 A2) Ai,ρ +δ = A+δ 1 A+δ 2 A+δ i,ρ . (50) Be begin by observing that (A1 A2) Ai,ρ = (A1 Ai,ρ ) (A2 Ai,ρ ) = Ai , (51) where we used (49) and A1,ρ A2,ρ = . Similarly, the right-hand side of (50) can be written as A+δ 1 A+δ 2 A+δ i,ρ = A+δ 1 A+δ i,ρ A+δ 2 A+δ i,ρ . (52) Steinwart, Sriperumbudur, and Thomann In addition, (49) ensures A+δ i A+δ i,ρ , and by continuing (52) we thus find A+δ 1 A+δ 2 A+δ i,ρ = A+δ i A+δ k A+δ i,ρ , k {1, 2} \ {i} . (53) Moreover, by part ii) of Lemma 26 we know that A+δ 1,ρ and A+δ 2,ρ are the two τ-connected components of M+δ ρ for any τ with δ < τ < τ (ε ). Thus, we have A+δ 1,ρ A+δ 2,ρ = , and since (49) ensures A+δ k A+δ k,ρ we conclude that A+δ k A+δ i,ρ = . Inserting the latter into (53) gives A+δ 1 A+δ 2 A+δ i,ρ = A+δ i . (54) Now, (50) follows from combining (51) with (54). ii). This directly follows from combining (47) with (Steinwart et al., 2021, Lemma 8.7). Proof of Theorem 9: i). We first note that ε ε := ρ ρ implies τ (ε ) τ (ε ). Consequently, part iii) of Lemma 26 applied for ρ := ρ gives the assertion. ii). We begin by showing that the CRMs ξρ+ε : Cτ(M δ ρ ) Cτ(M δ ρ+ε) and ξ : Cτ(M δ ρ+ε) Cτ(M δ ρout+ε) are bijective. To this end we consider the following commutative diagram of CRMs: Cτ(M δ ρ ) Cτ(M δ ρout+ε) Cτ(M δ ρ+ε) - @ @ @ @@ R Now, part iv) of (Steinwart, 2015b, Theorem A.6.2) shows that ξρout+ε is bijective, and consequently, ξρ+ε is injective. Moreover, part i) of (Steinwart, 2015a, Theorem 2.7) shows that 1 |Cτ(M δ ρ+ε)| 2, and since we already know that |Cτ(M δ ρ )| = 2 and that ξρ+ε is injective, we conclude that |Cτ(M δ ρ+ε)| = 2 and that ξρ+ε is bijective. Using the diagram we then see that the CRM ξ is also bijective. Our next goal is to show that the CRM bξρ : Cτ(M δ ρ+ε) b Cτ(Lρ) is well-defined and bijective. To this end, we first recall that our assumption ρ ρ 3ε together with (Steinwart, 2015a, Theorem 2.8) gives the following disjoint union: Cτ(Lρ) = bξρ Cτ(M δ ρ+ε) B Cτ(Lρ) : B Lρ+2ε = . Consequently, we have bξρ Cτ(M δ ρ+ε) = b Cτ(Lρ), that is, we can view bξρ as a surjective CRM bξρ : Cτ(M δ ρ+ε) b Cτ(Lρ). Similarly, part i) of Theorem 6 ensures ρout ρ + ε + 5ε ρ + 6ε ρ 3ε , Adaptive Clustering Using Kernel Density Estimators and repeating the reasoning above we see that the CRM bξρout : Cτ(M δ ρout+ε) Cτ(Lρout) can be viewed as a surjective CRM bξρout : Cτ(M δ ρout+ε) b Cτ(Lρout). Finally, consider the CRM ξ : Cτ(Lρ) Cτ(Lρout). For B b Cτ(Lρ) we then have = B Lρ+2ε ξ(B) Lρ+2ε ξ(B) Lρout+2ε , i.e. ξ(B) b Cτ(Lρout). Consequently, the restriction ξ| b Cτ(Lρ) : b Cτ(Lρ) b Cτ(Lρout) is welldefined, and obviously also a CRM. In summary, we obtain the following commutative diagram of CRMs Cτ(M δ ρout+ε) b Cτ(Lρout) Cτ(M δ ρ+ε) b Cτ(Lρ) ξ ξ| b Cτ(Lρ) Now, we have already seen that ξ is bijective, and in addition, part ii) of (Steinwart, 2015b, Theorem A.6.2) shows that bξρout is injective. Moreover, our considerations above showed that bξρout is surjective, and hence it is bijective. Using the diagram we conclude that bξρ : Cτ(M δ ρ+ε) b Cτ(Lρ) is injective. Since we have already seen that it is surjective, we conclude that it is indeed bijective. With the help of these preparations, the first assertion now easily follows from i) and the bijectivity of bξρ and ξρ+ε, namely | b Cτ(Lρ)| = bξρ ξρ+ε (Cτ(M δ ρ ) = Cτ(M δ ρ ) = 2 . To show the second assertion, we write Bρ i := bξρ ξρ+ε(Vi). This immediately gives Vi Bρ i for i = 1, 2. Moreover, using the diagram we find Bρ i ξ| b Cτ(Lρ)(Bρ i ) = ξ| b Cτ(Lρ) bξρ ξρ+ε(Vi) = bξρout ξ ξρ+ε(Vi) = Bi , where the latter identity follows from part iii) of (Steinwart, 2015b, Theorem A.6.2). iii). We first observe that ε := ρ ρ satisfies ε ε and by (Steinwart et al., 2021, Lemma 8.5) we hence find δ ψ (δ) < τ (ε ) τ (ε ). Lemma 28 then shows M δ i,ρ = M δ ρ A δ i,ρ and M+δ i,ρ = M+δ ρ A+δ i,ρ . By the definition of Li,ρ we thus have to show the following two inclusions M δ ρ+ε A δ i,ρ Lρ Bi , (55) Lρ Bi M+δ ρ ε A+δ i,ρ . (56) We begin by proving (55). To this end, we first observe that (3) ensures M δ ρ+ε Lρ and hence it suffices to establish A δ i,ρ Bi. Now, we have already observed that τ τ (ε ) Steinwart, Sriperumbudur, and Thomann τ (ε ), and consequently part iii) of Lemma 26 shows that A δ 1,ρ and A δ 2,ρ are the two τ- connected components of M δ ρ . Moreover, part i) of Theorem 6 shows ρout ρ +ε +5ε ρ ε, and hence we have ρ ε [ρout, ρ 3ε]. Applying (Steinwart, 2015a, Theorem 2.8) and the already established part ii) to the level ρ ε we then obtain A δ i,ρ Bρ ε i Bi for i {1, 2}. Let us now establish Li,ρ Bρ +2ε i . Without loss of generality we may assume i = 1. Now, consider the CRM ξ : Cτ(Lρ B1) Cτ(Lρ +2ε B1), which is possible since ρ ρ +2ε. Let us assume that there was a B Cτ(Lρ B1) with ξ(B ) Bρ +2ε 1 . Since Bρ +2ε 1 is a τ-connected component of Lρ +2ε B1 by part ii) applied to the level ρ +2ε [ρout, ρ 3ε] and ξ(B ) is another such τ-connected component we conclude that ξ(B ) Bρ +2ε 1 = . Moreover, our construction and part ii) give ξ(B ) Bρ +2ε 2 B1 Bρ +2ε 2 B1 B2 = , and therefore part ii) shows ξ(B ) b Cτ(Lρ + 2ε). Together with ρ ρ + 4ε the latter implies B Lρ B Lρ +4ε ξ(B ) Lρ +4ε = . Consequently, we have found a contradiction, and therefore we have ξ(B ) Bρ +2ε 1 for all τ-connected components of L1,ρ = Lρ B1. Since B ξ(B ) we have thus found L1,ρ Bρ +2ε 1 . Let us now show (56). To this end, we note that (3) ensures Lρ M+δ ρ ε, and hence it suffices to prove Lρ Bi A+δ i,ρ . Moreover, we have already shown that Lρ Bi Bρ +2ε i , and therefore, it suffices to establish Bρ +2ε i A+δ i,ρ . To this end, recall that we have already observed τ τ (ε ) τ (ε ). Part ii) of Lemma 26 thus shows that A+δ 1,ρ and A+δ 2,ρ are the two τ-connected components of M+δ ρ . Now consider the CRM ξ : Cτ(Lρ +2ε) Cτ(M+δ ρ ). Then the τ-connected component Bρ +2ε 1 of Lρ +2ε satisfies Bρ +2ε 1 ξ(Bρ +2ε 1 ), and therefore, exactly one of the following two conditions is satisfied Bρ +2ε 1 A+δ 1,ρ , (57) Bρ +2ε 1 A+δ 2,ρ . (58) However, our construction ensures V1 A+δ 1,ρ , and part ii) gives V1 Bρ +2ε 1 . This gives = V1 Bρ +2ε 1 A+δ 1,ρ , and therefore we can exclude (58). Consequently (57) is true. The inclusion Bρ +2ε 2 A+δ 2,ρ can be shown analogously. Adaptive Clustering Using Kernel Density Estimators 8.2 Proofs for Section 4 Lemma 29 Let K : Rd [0, ) be a symmetric kernel with tail function κ1( ). Moreover, let P be a λd-absolutely continuous distribution on Rd that is normal at some level ρ 0. Then for all x Rd and σ > 0 with B(x, σ) Mρ and all δ > 0 we have h P,δ(x) ρ ρκ1(σ while for all x Rd and σ > 0 with B(x, σ) X \ Mρ and all δ > 0 we have h P,δ(x) < ρ + δ dκ (σ Finally, if P has a bounded density h, then the inequality (59) can be replaced by h P,δ(x) ρ κ1(σ whenever 0 ρ h and (60) can be replaced, for all ρ 0, by h P,δ(x) < ρ + κ1(σ δ ) h . (62) Proof of Lemma 29: Let h be a λd-density of P. For the proof of (59), we first observe that λd(B(x, σ) \ {h ρ}) λd(Mρ \ {h ρ}) = 0, since P is normal at level ρ. Therefore, we obtain Z B(x,σ) Kδ(x y) h(y) dλd(y) = Z B(x,σ) {h ρ} Kδ(x y) h(y) dλd(y) B(x,σ) {h ρ} Kδ(x y) dλd(y) B(x,σ) Kδ(x y) dλd(y) , (63) and this leads to h P,δ(x) = Z Rd Kδ(x y) h(y) dλd(y) B(x,σ) Kδ(x y) dλd(y) + Z Rd\B(x,σ) Kδ(x y) h(y) dλd(y) B(x,σ) Kδ(x y) dλd(y) + ρ Z Rd\B(x,σ) Kδ(x y) dλd(y) Rd\B(x,σ) Kδ(x y) dλd(y) + Z Rd\B(x,σ) Kδ(x y) h(y) dλd(y) Rd\B(x,σ) Kδ(x y) dλd(y) , where in the last step we used (10). Now, the assertion follows from (11), and for a bounded density h and ρ h , the inequality (61) is a direct consequence of (59). To show (60) we first note that (1) yields λd B(x, σ) \ {h < ρ} λd (Rd \ Mρ) \ {h < ρ} = λd {h ρ} \ Mρ = 0 . Steinwart, Sriperumbudur, and Thomann Analogously to (63) we then obtain Z B(x,σ) Kδ(x y) h(y) dλd(y) = Z B(x,σ) {h<ρ} Kδ(x y) h(y) dλd(y) B(x,σ) {h<ρ} Kδ(x y) dλd(y) B(x,σ) Kδ(x y) dλd(y) , where for the strict inequality we used our assumption that K is strictly positive in a neighborhood of 0. Adapting the last estimate of the proof of (59) we then find h P,δ(x) < ρ Z B(x,σ) Kδ(x y) dλd(y) + Z Rd\B(x,σ) Kδ(x y) h(y) dλd(y) B(x,σ) Kδ(x y) dλd(y) + ρ Z Rd\B(x,σ) Kδ(x y) dλd(y) Rd\B(x,σ) Kδ(x y) dλd(y) + Z Rd\B(x,σ) Kδ(x y) h(y) dλd(y) Rd\B(x,σ) Kδ(x y) h(y) dλd(y) . Now, for bounded h inequality (62) follows from (11), while in the general case the estimate Z Rd\B(x,σ) Kδ(x y) h(y) dλd(y) sup y Rd\B(x,σ) Kδ(x y) = δ dκ (σ leads to (60). Proof of Theorem 11: To prove the first inclusion, we fix an x M 2σ ρ+ε+ϵ This means x / (Rd \ Mρ+ε+ϵ)+2σ, i.e., for all x Rd \ Mρ+ε+ε we have x x > 2σ. In other words, for all x Rd with x x 2σ, we have x Mρ+ε+ϵ, i.e., we have found B(x, 2σ) Mρ+ε+ϵ . (64) Let us now suppose that there exists a sample xi D such that h D,δ(xi) < ρ and x xi σ. By h D,δ h P,δ < ε we then find h P,δ(xi) < ρ + ε . (65) On the other hand, x xi σ together with the already shown (64) implies B(xi, σ) Mρ+ε+ϵ by a simple application of the triangle inequality. Consequently, (59) together with ϵ ρκ1(σ δ ) gives h P,δ(xi) ρ + ε, which contradicts (65). For all samples xi D, we thus have h D,δ(xi) ρ or x xi > σ. Let us assume that we have x xi > σ for all xi D. Then we find h D,δ(x) = 1 i=1 δ d K x xi i=1 δ dκ (σ δ ) = δ dκ (σ δ ) ρ . (66) Adaptive Clustering Using Kernel Density Estimators On the other hand, we have B(x, σ) B(x, 2σ) Mρ+ε+ϵ and therefore (59) together with ϵ ρκ1(σ δ ) gives h P,δ(x) ρ+ε. By h D,δ h P,δ < ε we conclude that h D,δ(x) > ρ, which contradicts (66). Therefore there does exist a sample xi D with x xi σ. Using the inclusion (64) together with the triangle inequality we then again find B(xi, δ) Mρ+ε+ϵ, and hence (59) yields h P,δ(xi) ρ + ε. This leads to h D,δ(xi) ρ, and hence we finally obtain x {x D : h D,δ(x ) ρ}+σ = LD,ρ . Finally, if h has a bounded density, then we have Mρ = for ρ > h and therefore M 2σ ρ+ε+ϵ LD,ρ is trivially satisfied. Moreover, to show the assertion for ρ h , we simply need to replace (59) with (61) in the proof above. To prove the second inclusion, we pick an x LD,ρ. By the definition of LD,ρ, there then exists an xi D such that x xi σ and h D,δ(xi) ρ. The latter implies h P,δ(xi) > ρ ε. Our first goal is to show that Mρ ε ϵ B(xi, σ) = . To this end, let us assume the converse, that is B(xi, σ) Rd \ Mρ ε ϵ. By (60) and ϵ δ dκ (σ δ ) we then find h P,δ(xi) < ρ ε, which contradicts the earlier established h P,δ(xi) > ρ ε. Consequently, there exists an x Mρ ε ϵ B(xi, σ), which in turn leads to d(x, Mρ ε ϵ) x x x xi + xi x 2σ . This shows the desired x M+2σ ρ ε ϵ. Finally, to show the assertion for bounded densities, we simply need to replace (60) with (62) in the proof above. For the proof of Lemma 14 we need to recall the following classical result, which is a reformulation of (van der Vaart and Wellner, 1996, Theorem 2.6.4). Theorem 30 Let A be a set of subsets of Z that has finite VC-dimension V . Then the set of indicator functions G := {1A : A A} is a uniformly bounded VC-class for which we have B = 1 and the constants A and ν in (17) only depend on V . We also need the next result, which investigates the effect of scaling in the input space. Lemma 31 Let G be set of measurable functions g : Rd R such that there exists a constant B 0 with g B for all g G. For δ > 0, we define gδ : Rd R by gδ(x) := g(x/δ), x Rd Furthermore, we write Gδ := {gδ : g G}. Then, for all ϵ (0, B] and all δ > 0, we have sup P N(G, L2(P), ϵ) = sup P N(Gδ, L2(P), ϵ) , where the suprema are taken over all probability measures P on Rd. Proof of Lemma 31: Because of symmetry we only prove . Let us fix ϵ, δ > 0 and a distribution P on Rd. We define a new distribution P on Rd by P (A) := P(1 δA) for all measurable A Rd. Furthermore, let C be an ϵ-net of Gδ with respect to L2(P ). For C := C 1/δ, we then have |C| = |C |, and hence it suffices to show that C is an ϵ-net of G with respect to L2(P). To this end, we fix a g G. Then gδ Gδ, and hence there exists an Steinwart, Sriperumbudur, and Thomann h C with gδ h L2(P ) ϵ. Moreover, we have h := h 1/δ C, and since the definition of P ensures EP fδ = EP f for all measurable f : Rd [0, ), we obtain g h L2(P) = gδ hδ L2(P ) = gδ h L2(P ) ϵ, i.e. C is an ϵ-net of G with respect to L2(P). Proof of Lemma 14: The set A := {x + B : x Rd} of has finite VC-dimension by (Devroye and Lugosi, 2001, Corollary 4.2) or (Devroye and Lugosi, 2001, Lemma 4.1), respectively. In both cases, Theorem 30 thus shows that for G := K(x ) : x Rd there are constants A and ν only depending on the VC-dimension of A such that N G, L2(P), K ϵ = N K 1 G, L2(P), ϵ A for all ϵ (0, 1] and all distributions P on Rd. Moreover, observe that Gδ = K(x δ 1 ) : x Rd = K x : x Rd = δd Kδ . Consequently, Lemma 31 leads to sup P N(Kδ, L2(P), δ dϵ) = sup P N(δd Kδ, L2(P), ϵ) = sup P N(G, L2(P), ϵ) A K for all ϵ (0, K ]. A simple variable transformation then yields the assertion. Proof of Lemma 15: We first recall that if A E is a compact subset of some Banach space E and T : A F is a α-H older continuous map into another Banach space F then we have N T(A), F , |T|αϵα N A, E, ϵ , ϵ > 0 , where |T|α is the α-H older constant of T. We now fix a 0 < δ |K|α K 1/α diam (X) and a probability measure P on Rd. For x X we now consider the map kx,δ : Rd [0, ] defined by kx,δ(y) := Kδ(x y) = δ d K x y Since K is bounded and measurable, so is kx,δ, and hence we obtain a map T : X L (P) defined by T(x) := kx,δ. Moreover, T is α-H older continuous, since for x, x X, we have T(x) T(x ) = sup y Rd δ (α+d)|K|α x x α , i.e. we have shown |T|α δ (α+d)|K|α. By our initial observation and (19) we then conclude that N Kδ, L2(P), |T|αϵα = N T(X), L2(P), |T|αϵα N X, , ϵ C (X)ϵ d Adaptive Clustering Using Kernel Density Estimators for all ϵ (0, diam (X)]. A variable transformation together with our bound on |T|α thus yields N Kδ, L2(P), ϵ C (X) |T|α d/α C (X) |K|α for all 0 < ϵ δ (α+d)|K|α diam (X). Since the assumed δ |K|α K 1/α diam (X) implies δ d K diam (X) αδ (α+d)|K|α we then see that (20) does hold for all 0 < ϵ δ d K . For the proof of Theorem 16 we quote a version of Talagrand s inequality due to Bousquet (2002) from (Steinwart and Christmann, 2008, Theorem 7.5). Theorem 32 Let (Z, P) be a probability space and G be a set of measurable functions from Z to R. Furthermore, let B 0 and σ 0 be constants such that EP g = 0, EP g2 σ2, and g B for all g G. For n 1, define G : Zn R by G(z) := sup g G , z = (z1, . . . , zn) Zn. Then, for all ς > 0, we have P n n z Zn : G(z) 4EP n G + For the proof of Theorem 16 we also need (Gin e and Guillou, 2001, Proposition 2.1), which bounds the expected suprema of empirical processes indexed by uniformly bounded VC-classes. The following theorem provides a slightly simplified version of that proposition. Theorem 33 Let (Z, P) be a probability space and G be a uniformly bounded VC-class on Z with constants A, B, and ν. Furthermore, let σ > 0 be a constant with σ B and EP g2 σ2 for all g G. Then there exists a universal constant C such that G defined in Theorem 32 satisfies We are now able to establish the following generalization of Theorem 16. Proposition 34 Let X Rd and P be a probability measure on X with Lebesgue density h L1(Rd) Lp(Rd) for some p (1, ]. Moreover, let 1 p = 1 and q := 1 2p = 1 2p and K be a symmetric kernel. Suppose further that the set Kδ defined in (18) satisfies (21) for all δ (0, δ0], where δ0 (0, 1]. Then, there exists a positive constant C only depending on d, p, and K such that, for all n 1, all δ (0, δ0] satisfying δ h p p 4p K , and all ς 1 we have P n D : h D,δ h P,δ ℓ (X) < Cς δa+dq h 1/2 p + C h pς δd(1+1/p)n log C δa+dq h 1/2 p Steinwart, Sriperumbudur, and Thomann Proof of Proposition 34: We define θ := 1 2p. Then K L1(Rd) L (Rd) leads to K 2p K 1 θ 1 K θ = K θ . We further define kx,δ := δ d K δ 1(x ) and fx,δ := kx,δ EP kx,δ. Then we have EP fx,δ = 0 and fx,δ 2 K δ d for all x X and δ > 0. Moreover, we have EP f2 x,δ EP k2 x,δ and thus EP f2 x,δ = δ 2d Z h(y) dλd(y) δ 2d h p δ d(1+1/p) h p K 2θ =: σ2 δ for all x X and δ > 0. In addition, for all D Xn we have i=1 fx,δ(xi) = h D,δ(x) h P,δ(x) . Applying Theorem 32 to G := {fx,δ : x X}, we thus see, for all δ > 0, ς > 0, and n 1, that h D,δ h P,δ ℓ (X) < 4ED P n h D ,δ h P,δ ℓ (X) + 2ς 2ς h p K 2θ nδd(1+1/p) (67) holds with probability P n not smaller than 1 e ς. It thus remains to bound the term ED P n h D ,δ h P,δ ℓ (X) = ED P n sup x X To this end, we first note that |EP kx,δ| kx,δ = δ d K =: Bδ. Consequently, we have Fδ := {fx,δ : x X} kx,δ b : kx,δ Kδ, |b| Bδ , and since N([ Bδ, Bδ], | |, ϵ) 2Bδϵ 1 we conclude that for A := max{1, A0} we have sup Q N Fδ, L2(Q), ϵ 2 A0 K δ (d+a) ϵ 2 A K δ (d+a) for all δ (0, δ0], ϵ (0, Bδ], where the supremum runs over all distributions Q on X. Now, our very first estimates showed fx,δ 2Bδ and EP f2 x,δ σ2 δ, and since σδ 2Bδ is equivalent to δ 4p K p (2 2θ) h p p = 4p K h p p , Theorem 33 together with 2θ = 1 + 1/p thus yields ED P n sup x X nδd log 2 A K (ν+1) h p K 1+1/p 2δd(1+1/p)n log 2 A K Adaptive Clustering Using Kernel Density Estimators for such δ. Moreover, we have σδδd+a = log 2 A K δ d(1+1/p)/2 h 1/2 p K θ δd+a = log 2 A K q δa+dq h 1/2 p , and hence the previous estimate can be simplified to ED P n sup x X nδd log 2 A K q δa+dq h 1/2 p + v u u tν h p K 1+1/p δd(1+1/p)n log 2 A K q δa+dq h 1/2 p Combining this with (67) gives h D,δ h P,δ ℓ (X) < 4ED P n h D ,δ h P,δ ℓ (X) + 2ς 2ς h p K 1+1/p nδd(1+1/p) nδd log 2 A K q δa+dq h 1/2 p + v u u tν h p K 1+1/p δd(1+1/p)n log 2 A K q δa+dq h 1/2 p 2ς h p K 1+1/p nδd(1+1/p) Cς nδd log C δa+dq h 1/2 p + v u u t C h pς δd(1+1/p)n log C δa+dq h 1/2 p with probability P n not smaller than 1 e ς. Proof of Theorem 16: By Proposition 34, it suffices to find a constant C such that Cς nδd log C δa+dq h 1/2 p + C h pς δd(1+1/p)n log C δa+dq h 1/2 p C r h p |log δ| ς nδd(1+1/p) . (68) To this end, we first observe that δa+dq h 1/2 p C 1 implies Cδ a dq h 1/2 p δ 2a 2dq, and thus we obtain log C δa+dq h 1/2 p (2a + 2dq) log δ 1. For C := (2a + 2dq)C we therefore Cς nδd log C δa+dq h 1/2 p + C h pς δd(1+1/p)n log C δa+dq h 1/2 p C ς nδd log δ 1 + C h pς nδd(1+1/p) log δ 1 . Moreover, it is easy to check that the assumption |log δ| C ς ensures that nδd log δ 1 C h pς nδd(1+1/p) log δ 1 , and from the latter we conclude that (68) holds for C := 2 C . The assertion now follows for the constant C := max{C, C , C }. Steinwart, Sriperumbudur, and Thomann 8.3 Proofs for the KDE-Based Clustering in Section 5 Proof of Theorem 17: Let us fix a D Xn with h D,δ h P,δ < ε/2. By (23) we see that the probability P n of such a D is not smaller than 1 e ς. We define ϵ := h κ1(σ δ ). In the case of supp K B this leads to ϵ = 0 δ dκ (σ δ ) = 0 ρ0 as noted after Theorem 11. Furthermore, in the case of (9), (Steinwart et al., 2021, Lemma 4.2) shows δ ) h κ1(| log δ|2) cd2 vold e | log δ|2| log δ|2d 2 cd2 vold δ| log δ| d ε/2, where in the third to last step we used 0 < δ 1 and in the second to last step we used (Steinwart et al., 2021, Lemma 8.17). In addition, we have δ dκ (σ δ ) cδ de | log δ|2 = cδ| log δ| d ε ρ0. Consequently, Theorem 11 shows, for all ρ ρ0, that M 2σ ρ+ε LD,ρ M+2σ ρ ε . (69) i). The assertion follows from Theorem 7 applied in the case d := 2σ. Indeed, we have just seen that (40) holds for all ρ ρ0, if we replace δ by δ, and our assumptions guarantee δ (0, δthick], ρ0 ρ , and τ > ψ( δ) = 3cthick δγ > 2cthick δγ. Moreover, (27) follows from (69). ii). We check that the assumptions of Theorem 6 are satisfied for δ := 2σ, if ε (ρ ρ )/9. Clearly, we have δ (0, δthick], ε (0, ε ], and ψ( δ) < τ. To show τ τ (ε ) we write E := {ε (0, ρ ρ ] : τ (ε ) τ}. Since we assumed ε < , we obtain E = by the definition of ε . There thus exists an ε E with ε inf E + ε ε . Using the monotonicity of τ established in (Steinwart, 2015a, Theorem A.4.2) we then conclude that τ τ (ε ) τ (ε ), and hence all assumptions of (Steinwart, 2015a, Theorem 2.9) are indeed satisfied with δ replaced by δ. The assertions now immediately follow from this theorem. Proof of Corollary 18: Using Theorem 17 the proof of ii) is a literal copy of the proof of (Steinwart, 2015a, Theorem 4.1) and the proof of i) is an easy adaptation of this proof. Proof of Corollary 20: Using Theorem 17 the proof is a simple combination and adaptation of the proofs of (Steinwart, 2015a, Theorem 4.3) and (Steinwart, 2015a, Corollary 4.4). Proof of Corollary 23: Using Theorem 17 the proof is a simple combination and adaptation of the proofs of (Steinwart, 2015a, Theorem 4.7) and (Steinwart, 2015a, Corollary 4.8). Proof of Theorem 24: The definition of εδ,n in (33) together with 4C2 u log log n C h ensures that (25) and (26) are satisfied for all δ . In addition, the assumptions of Theorem 24 ensure that the remaining conditions of Theorem 17 are satisfied. Now the assertion follows by some union bound arguments, which are analogous to those of the proof of (Steinwart, 2015a, Theorem 5.1). Adaptive Clustering Using Kernel Density Estimators Acknowledgment The authors profusely thank the action editor and the reviewers for their comments and suggestions that significantly improved the readability of the paper. The work of I. Steinwart and P. Thomann was funded by the DFG Grant STE 1074/2-1. I. Steinwart was also funded by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany s Excellence Strategy - EXC 2075 - 390740016. B. K. Sriperumbudur thanks the National Science Foundation for support under the grant NSF-DMS-1713011. O. Bousquet. Concentration inequalities and empirical processes theory applied to the analysis of learning algorithms. Ph.D. thesis, Ecole Polytechnique, 2002. K. Chaudhuri and S. Dasgupta. Rates of convergence for the cluster tree. In Advances in Neural Information Processing Systems 23, pages 343 351. 2010. K. Chaudhuri, S. Dasgupta, S. Kpotufe, and U. von Luxburg. Consistent procedures for cluster tree estimation and pruning. IEEE Trans. Inf. Theory, 60:7900 7912, 2014. A. Cuevas and R. Fraiman. A plug-in approach to support estimation. Ann. Statist., 25: 2300 2312, 1997. L. Devroye and G. Lugosi. Combinatorial Methods in Density Estimation. Springer, New York, 2001. P. Fr anti and O. Virmajoki. Iterative shrinking method for clustering problems. Pattern Recognition, 39:761 765, 2006. E. Gin e and A. Guillou. On consistency of kernel density estimators for randomly censored data: Rates holding uniformly over adaptive intervals. Ann. Inst. H. Poincar e Probab. Statist., 37:503 522, 2001. E. Gin e and A. Guillou. Rates of strong uniform consistency for multivariate kernel density estimators. Ann. Inst. H. Poincar e Probab. Statist., 38:907 921, 2002. J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, New York, 1975. J.A. Hartigan. Consistency of single linkage for high-density clusters. J. Amer. Statist. Assoc., 76:388 394, 1981. S. Kpotufe and U. von Luxburg. Pruning nearest neighbor cluster trees. In Proceedings of the 28th International Conference on Machine Learning, pages 225 232. 2011. M. Maier, M. Hein, and U. von Luxburg. Optimal construction of k-nearest neighbor graphs for identifying noisy clusters. Theoret. Comput. Sci, 410:1749 1764, 2009. W. Polonik. Measuring mass concentrations and estimating density contour clusters an excess mass aproach. Ann. Statist., 23:855 881, 1995. Steinwart, Sriperumbudur, and Thomann P. Rigollet. Generalization error bounds in semi-supervised classification under the cluster assumption. J. Mach. Learn. Res., 8:1369 1392, 2007. A. Rinaldo and L. Wasserman. Generalized density clustering. Ann. Statist., 38:2678 2722, 2010. B.K. Sriperumbudur and I. Steinwart. Consistency and rates for clustering with DBSCAN. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics 2012, pages 1090 1098, 2012. I. Steinwart. Adaptive density level set clustering. In Proceedings of the 24th Conference on Learning Theory 2011, pages 703 738, 2011. I. Steinwart. Fully adaptive density-based clustering. Ann. Statist., 43:2132 2167, 2015a. I. Steinwart. Supplement A to Fully adaptive density-based clustering . Ann. Statist., 43, 2015b. I. Steinwart. Supplement B to Fully adaptive density-based clustering . Ann. Statist., 43, 2015c. I. Steinwart and A. Christmann. Support Vector Machines. Springer, New York, 2008. I. Steinwart, B.K. Sriperumbudur, and P. Thomann. Adaptive clustering using kernel density estimators. Technical report, 2021. http://arxiv.org/abs/1708.05254v3. W. Stuetzle. Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classification, 20:25 47, 2003. W. Stuetzle and R. Nugent. A generalized single linkage method for estimating the cluster tree of a density. J. Comput. Graph. Statist., 19:397 418, 2010. A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer-Verlag, New York, 1996. D. Wang, X. Lu, and A. Rinaldo. DBSCAN: Optimal rates for density-based cluster estimation. J. Mach. Learn. Res., 20:1 50, 2019.