# kernelestimated_nonparametric_overlapbased_syncytial_clustering__7749caaa.pdf

Journal of Machine Learning Research 21 (2020) 1-54 Submitted 7/18; Revised 5/20; Published 6/20

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

Israel A. Almod ovar-Rivera israel.almodovar@upr.edu Department of Biostatistics and Epidemiology University of Puerto Rico at Medical Science Campus San Juan, PR 00936-5067, USA

Ranjan Maitra maitra@iastate.edu Department of Statistics Iowa State University Ames, IA, 50011-1090, USA

Editor: Miguel A. Carreira-Perpi n an

Commonly-used clustering algorithms usually ﬁnd ellipsoidal, spherical or other regularstructured clusters, but are more challenged when the underlying groups lack formal structure or deﬁnition. Syncytial clustering is the name that we introduce for methods that merge groups obtained from standard clustering algorithms in order to reveal complex group structure in the data. Here, we develop a distribution-free fully-automated syncytial clustering algorithm that can be used with k-means and other algorithms. Our approach estimates the cumulative distribution function of the normed residuals from an appropriately ﬁt k-groups model and calculates the estimated nonparametric overlap between each pair of clusters. Groups with high pairwise overlap are merged as long as the estimated generalized overlap decreases. Our methodology is always a top performer in identifying groups with regular and irregular structures in several datasets and can be applied to datasets with scatter or incomplete records. The approach is also used to identify the distinct kinds of gamma ray bursts in the Burst and Transient Source Experiment 4Br catalog and the distinct kinds of activation in a functional Magnetic Resonance Imaging study. Keywords: BATSE, DEMP, DEMP+, DBSCAN*, density peaks algorithm, GRB, GSLNN, k-clips, k-means, km-means, kernel density estimation, KNOB-Syn C, Mix Mod Combi, MGHD, MSAL, overlap, PGMM, SDSS, spectral clustering, Ti K-means

1. Introduction

Cluster analysis (Ramey, 1985; Mc Lachlan and Basford, 1988; Kaufman and Rousseuw, 1990; Everitt et al., 2001; Melnykov and Maitra, 2010; Xu and Wunsch, 2009; Bouveyron et al., 2019) is an unsupervised learning method that partitions datasets into distinct groups of homogeneous observations. Finding such structure in the absence of group information can be challenging but is important in many applications, such as taxonomical classiﬁcation (Michener and Sokal, 1957), market segmentation (Hinneburg and Keim, 1999), software management (Maitra, 2001) and so on. As such, a number of methods, ranging from the heuristic (Johnson, 1967; Everitt et al., 2001; Jain and Dubes, 1988; Forgy, 1965; Mac Queen, 1967; Kaufman and Rousseuw, 1990) to the more formal, model-based (Titter-

c 2020 Israel Almod ovar-Rivera and Ranjan Maitra.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-435.html.

Almod ovar-Rivera and Maitra

ington et al., 1985; Mc Lachlan and Peel, 2000; Melnykov and Maitra, 2010; Mc Nicholas, 2016; Bouveyron et al., 2019) approaches have been proposed and implemented. Most common clustering algorithms, whether model-agnostic methods like k-means (Mac Queen, 1967; Hartigan and Wong, 1979; Lloyd, 1982) or model-based approaches such as Gaussian mixture models (Fraley and Raftery, 2002; Melnykov and Maitra, 2010) yield clusters with regular dispersions or structure. For instance, the k-means algorithm is geared towards ﬁnding homogeneous spherical clusters or spherically-dispersed groups of equal radius. Such algorithms are not designed to ﬁnd general-shaped or structured groups, therefore, many additional approaches have been suggested to identify irregularly-shaped groups (see, for example, Dhillon et al., 2004; Fred and Jain, 2005; von Luxburg, 2007; Baudry et al., 2010; Hennig, 2010; Melnykov, 2016; Peterson et al., 2018). Kernel k-means clustering (Dhillon et al., 2004) enhances the k-means algorithm by using a kernel function φ( ) that nonlinearly maps the original (input) space to a higher-dimensional feature space where it may be possible to linearly separate clusters that were not linearly separable in the original space. Spectral clustering (von Luxburg, 2007) uses k-means on the ﬁrst few eigenvectors of a Laplacian of the similarity matrix of the data. Both methods need the number of clusters to be provided: in the case of spectral clustering, von Luxburg (2007) suggests estimating this number as the one with the highest gap between successive eigenvalues. A separate set of approaches modiﬁes the distribution of the mixture components in model-based clustering (MBC) by replacing the commonly-used multivariate Gaussian component with other more general distributions. Some of these approaches simply add dimension reduction in the form of factor models (Ghahramani and Hinton, 1997; Mc Nicholas and Murphy, 2008) through parsimonious Gaussian mixture models (PGMM). More generally, Franczak et al. (2014) propose MBC using a mixture of asymmetric shifted Laplace distributions (Mix SAL) while Browne and Mc Nicholas (2015) suggest using a mixture of generalized hyperbolic distributions (Mix GHD). These approaches more fully exploit MBC but can be CPU intensive and are somewhat limited in capturing complex structures. Evidence accumulation clustering or EAC (Fred and Jain, 2005) combines results from multiple runs of the k-means algorithm with the underlying rationale that each partitioning provides independent evidence of structure that is then extricated by cross-tabulating the relative frequencies (out of the multiple partitionings) that each observation pair is in the same group. This relative frequency table serves as a similarity matrix for hierarchical clustering: however, implementation of this method can be computationally demanding in terms of CPU speed and memory. Stuetzle and Nugent (2010) developed a nonparametric clustering approach under the premise that each group corresponds to a mode of the estimated multivariate density of the observations. The high-density modes are located and hierarchically clustered with dissimilarity between two modes calculated in terms of the lowest density or number of common points in each mode s domain of attraction. The density-based spatial clustering algorithm of applications with noise (DBSCAN) algorithm (Ester et al., 1996) groups together points in high-density regions while identifying points in low-density regions as outliers. A reﬁnement (DBSCAN*, by Campello et al., 2013) follows the same principle but classiﬁes so-called border observations as outliers. Both algorithms depend on the minimum cluster size and reachability distance, and also on a cut-oﬀto determine the border and outlying observations. The authors suggest setting this cutoﬀat the knee of a plot of the k-nearest neighbor distances of the observations. In

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

a similar vein, Rodriguez and Laio (2014) developed a fast Density Peaks (DP) algorithm to determine cluster centers and ﬁnd outliers while considering the local density of each observation. DP uses the estimated multivariate density in order to classify observations into outliers and does not rely on an explicit cut-oﬀvalue, but other parameters need to be subjectively speciﬁed or estimated to graphically decide on the number of groups. These methods all rely on density estimates and are not immune from the ravages of the curse of dimensionality. More recent work (Baudry et al., 2010; Hennig, 2010; Melnykov, 2016; Peterson et al., 2018) proposed merging groups found using MBC or k-means. Such methods fall into the category of what we introduce in this paper as syncytial clustering algorithms, because they yield a cluster structure resembling a syncytium, a term that in cell biology refers to a multi-nucleated mass of cytoplasm inseparable into individual cells and that can arise from multiple fusions of uninuclear cells. Syncytial clustering algorithms are similar in that they merge or fuse groups that originally corresponded to mixture model components or kmeans or other regular-structured groups. Resulting partitions have groups with potentially multiple well-deﬁned and structured sub-groups. We outline a few such algorithms next. MBC is premised on the idea of a one-to-one correspondence between a mixture component of given density form and group. Such injective mapping assumptions are not always tenable so some authors (Baudry et al., 2010; Hennig, 2010; Melnykov, 2016) model each group as a mixture of (one or more) components. Operationally, we have a syncytial clustering framework where identiﬁed mixture components that are not very distinct from each other are merged (Baudry et al., 2010; Hennig, 2010; Melnykov, 2016) into a cluster. Baudry et al. (2010) successively merge mixture component pairs that result in the highest change in entropy, continuing for as long as the entropy increases. This method, abbreviated here as MMC, is implemented in the R (R Development Core Team, 2018) package RMix Mod Combi (Baudry and Celeux, 2014). Hennig (2010) developed the directly estimated misclassiﬁcation probabilities (DEMP) algorithm to identify candidate components for merging. The author argued that the best measure of group similarity should relate to the classiﬁcation probability and so proposed that clusters with the highest pairwise misclassiﬁcation probabilities be merged. The DEMP+ method (Melnykov, 2016) mimics DEMP but replaces the misclassiﬁcation probabilities of DEMP with the overlap measure of Maitra and Melnykov (2010) for Gaussian mixture components. DEMP+ uses Monte Carlo simulation to determine pairwise overlap between merged components and uses thresholds on the maximum pairwise overlap to determine termination. The sliding threshold was empirically suggested to be chosen to be inversely related to dimension. The MBC algorithms oﬀer a principled approach to the partitioning of observations into groups but are more demanding in CPU time and perhaps unnecessary to use when the objective is simply to ﬁnd the most appropriate grouping with no particular dogma regarding shape or structure and where using k-means as a starting point for an initial clustering may be a fairly plausible but faster alternative. Perhaps recognizing this aspect, Melnykov (2016) contended that DEMP+ can be applied to k-means output by assuming equal mixing proportions and homogeneous spherical dispersions in the mixture model. The basis for this assertion is the framing of the k-means algorithm of Lloyd (1982) as a Classiﬁcation Expectation-Maximization (CEM) Algorithm (see Fraley and Raftery, 1998, for details). But k-means clustering makes hard assignments of each observation and, in-

Almod ovar-Rivera and Maitra

deed, most commonly-used statistical software programs, such as R (R Development Core Team, 2018) use the eﬃcient Hartigan and Wong (1979) algorithm that handles computations quite diﬀerently and sparingly than Lloyd (1982). In this vein, Peterson et al. (2018) provided the K-m H algorithm to merge poorer-separated k-means groups. Such groups are identiﬁed as per an easily-computed index that uses normal theory with spherical dispersion assumptions. However, the K-m H algorithm has a large number of settings and parameters: using default values and rules-of-thumb provided by the authors, we have found that this method performs well in many datasets but not as well in many others. Therefore, it would be worth investigating other syncytial clustering algorithms that use k-means groupings for clustering eﬃciency while also reducing the need to tune multiple parameter settings. A separate issue is the impact of Gaussian mixture model assumptions in methods such as DEMP+ when applied to regular-structured groups found using, say, the multivariate tmixture or other appropriate models. A nonparametric method not taking recourse to such distributional assumptions would be desirable in addressing this shortcoming. This paper therefore proposes the Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering (KNOB-Syn C) algorithm that successively merges groups from a well-optimized k-means solution until some objective and nonparametric data-driven cluster overlap measure vanishes or is no longer reduced. This measure is calibrated through the generalized overlap (Maitra, 2010; Melnykov and Maitra, 2011; Melnykov et al., 2012) calculated using smooth estimation of the cumulative distribution function (CDF) developed in Section 2. Our algorithm is illustrated and comprehensively evaluated in Section 3. Although motivated using k-means, the method is general enough to apply to the output of other partitioning algorithms, such as clustering using the Mahalanobis (1936) distance, or in scenarios with scatter (Maitra and Ramler, 2009) or incomplete records (Lithio and Maitra, 2018). Section 4 also applies our methodology to two interesting settings: in the ﬁrst case, we identify the diﬀerents kinds of gamma ray bursts in the most recent Burst and Transient Source Experiment (BATSE) 4Br catalog. Our second application uses KNOB-Syn C to identify activation from single replications of a functional Magnetic Resonance Imaging (f MRI) study obtained from a right-hand ﬁnger tapping experiment performed by a right-hand-dominant male. We ﬁnd our results to both be interpretable and with greater reproducibility than current methods. The paper concludes with some discussion. An appendix provides mathematical proofs for our derived theoretical properties of smooth estimation of the CDF using asymmetric kernel density estimation and detailed graphical illustrations of experimental performance on two-dimensional (2D) datasets and numerical summaries of performance on all datasets.

2. Methodological Development

2.1. Problem Setup

Let Ξ = {X1, X2, . . . , Xn} be a random sample of n p-dimensional observations, with each

c=1 [fc(x)]ζic, (1)

where C is the number of groups, ζic = I(Xi Cc) with I(Z) = 1 if Z holds and 0 otherwise, fc(x) is the cluster-speciﬁc density of an observation in the cth cluster and Cc is the set

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

of observations in the sample from that group. Our speciﬁcation in (1) refers to a hard clustering framework: we marginally obtain a mixture model if we specify independent identical multinary prior distributions on each ζic. Our objective is to estimate ζics (equivalently, Ccs) for each c = 1, 2, . . . , C with C possibly unknown. We also assume that for each c = 1, 2, . . . , C, the density fc(x) for any Xi Cc (i.e. ζic = 1) can be further described by

k=1 [h( x µCc k )]ζCc ik , (2)

where h( ) is deﬁned on the positive half of the real line so that h( x ) is a zero-centered density in Rp with spherical level hyper-surfaces. This means that each group in the dataset can be further decomposed into multiple homogeneous spherically-dispersed subgroups, and ζCc ik = 1 if Xi is in Cc and in the kth subgroup inside Cc, and zero otherwise. That is, we

can model Xi Ξ as Xi QC c=1 Qkc k=1[h( x µCc k )]ζCc ik , or equivalently as

k=1 [h( x µ k )]ζ ik, (3)

where ζ ik and µ k for k = 1, 2, . . . , K are renumerations, respectively, of all the ζCc ik and µCc k for k = 1, 2, . . . , kc, c = 1, 2, . . . , C. Therefore, K = PC c=1 kc, ζic = Pkc k=1 ζCc ik for c = 1, 2, . . . , C and PK k=1 ζ ik PC c=1 Pkc k=1 ζCc ik = 1 (however, both K and C are also unknown). The reformulation of (1) in terms of (3) means that the k-means algorithm (Forgy, 1965; Lloyd, 1982; Hartigan and Wong, 1979) can be employed along with cluster-selection methods (for example, Krzanowski and Lai, 1988; Sugar and James, 2003; Maitra et al., 2012) to obtain a ﬁrst-pass clustering of the dataset where the observations are partitioned into an estimated number ( ˆK) of homogeneous spherically-dispersed groups. Our proposal is to develop methods for identifying the supersets of these k-means (homogeneous spherical) groups to obtain the clusters {Cc; c = 1, 2, . . . , C} with C also needing to be estimated. These supersets will reveal the general-shaped clustering structure in the data. From the ˆK-groups solution, deﬁne the ith residual (i = 1, 2, . . . , n) as

k=1 ˆµ k ˆζ ik; (4)

where ˆµ k is the multivariate mean vector of the observations in the kth group and ˆζ ik = I(Xi kth k-means group). From (4), we obtain the normed residuals, that is, we obtain

ˆϵ iˆϵi = Xi

i=1 ˆζ ik ˆµ k (5)

for i = 1, 2, . . . , n; k = 1, 2, . . . , ˆK. These ˆΨ1, ˆΨ2, . . . ˆΨn may be viewed as a random sample with density function h( ) and CDF H( ) and having support in [0, ). We now provide methods for estimating H( ) under assumptions of a smooth CDF.

Almod ovar-Rivera and Maitra

2.2. Smooth estimation of the CDF of the normed residuals

We ﬁrst introduce a smooth estimator for an univariate CDF. Let Y1, Y2, . . . , Yn be a random sample having CDF H( ) and probability density function (PDF) h( ). The natural and most common estimator is the empirical CDF (ECDF) deﬁned as

i=1 1(Yi y). (6)

It is easy to see that ˆHn(y) is an unbiased estimator of H(y), that is, E[ ˆHn(y)] = H(y). Further, it converges almost surely to the true CDF H( ). However, the ECDF is a step function for any n and so inappropriate for a smooth continuous CDF, even though it is a smooth function in the limit as n (Silverman, 1986). An alternative kernel estimator (Rosenblatt, 1956; Parzen, 1962; Silverman, 1986; Wand and Jones, 1995) for H( ) replaces the indicator function in (6) by its smooth cousin. Strictly speaking, kernel density estimation is most often employed in nonparametric contexts (Silverman, 1986) but can also be extended to smooth CDF estimation by integrating over the domain of the kernel. Let G(y) = R y K(u)du be the CDF of a kernel function K( ). The kernel CDF estimator is then deﬁned as

ˆH(y; b) = 1

where b is the bandwidth or the smoothing parameter. Equation (7) makes the popular assumption of a symmetric kernel, the most common examples of which are the Gaussian and Epanechnikov (Epanechnikov, 1969; Azzalini, 1981; Reiss, 1981) kernels. However, using a symmetric kernel when the support of the distribution is not on the entire real line (as is the case with our normed residuals) causes weights to be assigned outside the domain of the observations, resulting in boundary bias (Bouezmarni and Scaillet, 2005). So Chen (2000) proposed using an asymmetric kernel in (7) based on the gamma density, with behavior similar to the Gaussian kernel and a comparable rate of convergence in terms of the mean squared error. However, Chen (2000) s estimator is not a valid density for ﬁnite sample sizes (Jeon and Kim, 2013) so we consider the Reciprocal Inverse Gaussian (RIG) kernel density estimator (Scaillet, 2004)

ˆh(y; b) = 1

i=1 K (y; Yi, b) , (8)

with K(y; Yi, b) = 1 2πb Yi exp{ 1 2b Yi [Yi (y b)]2}. Since K(y; Yi, b) is a smooth function

the CDF estimate is deﬁned as G(y; Yi, b) = R y 0 K(t; Yi, b)dt. Then, integrating ˆh(y; b) with respect to y yields the smooth CDF estimator

ˆH(y; b) = 1

i=1 G (y; Yi, b) = 1

Φ Yi + b Yib

Φ Yi (y b) Yib

where Φ( ) is the standard Gaussian CDF. An added beneﬁt of using the RIG kernel over the gamma kernel is that the estimated CDF is in closed form and can be readily evaluated

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

using standard software. We now investigate some theoretical properties of the asymmetric RIG kernel CDF estimator. Before proceeding however, we revisit the deﬁnition of the Inverse Gaussian and RIG densities for the sake of completeness and to ﬁx ideas.

Deﬁnition 1 A nonnegative random variable Uµ,λ is said to arise from the Inverse Gaussian distribution with parameters (µ, λ) if it has the density

r(u; µ, λ) =

2πu exp n λ

2µ u µ 2 + µ

u o , u > 0

0 otherwise. (10)

Notationally, we write Uµ,λ IG(µ, λ). Also, we have E(Uµ,λ) = µ and Var(Uµ,λ) = µ3/λ.

Deﬁnition 2 A nonnegative random variable Vµ,λ is said to be from the Reciprocal Inverse Gaussian distribution with parameters (µ, λ) if it has the density

s(v; µ, λ) =

2πv exp n λ

2µ vµ 2 + 1

µv o , v > 0

0 otherwise. (11)

Notationally, Vµ.λ RIG(µ, λ). Further, Vµ.λ is equivalent in law to 1/Uµ,λ where Uµ,λ IG(µ, λ) and E(Vµ,λ) = 1/µ + 1/λ while Var(Vµ,λ) = (λ + 2µ)/(λ2µ).

We now develop some properties of the asymmetric kernel RIG to estimate the CDF.

Lemma 3 Let Y1, Y2, . . . , Yn be independent identically distributed nonnegative-valued random variables with CDF H(y), and PDF h(y) that is inﬁnitely diﬀerentiable. Also, consider the RIG kernel density deﬁned by K(t; Yi, b) = φ[(Yi (t b))/ b Yi]/ b Yi where φ(z) is the standard normal density evaluated at z. Consider estimating H(y) using ˆH(y; b) as deﬁned in (9). Then, as b 0, E[ ˆH(y; b)] = H(y) + b[yh (y) h(y)]/2 + o(b) H(y) + O(b) and Var[ ˆH(y; b)] H(y)(1 H(y))/n H(y)/(2n) H(y)

b/(2n 2πy) + o(

b) H(y)(1 H(y))/n H(y)/(2n) + O(b).

Proof See Appendix A.1.

Lemma 3 shows that ˆH(y; b) has lower variance than the ECDF and has point-wise Mean Squared Error (MSE) at y that is given by MSE[ ˆH(y; b)] = Var[ ˆH(y; b)] + [Bias{ ˆH(y; b)}]2.

2.2.1. Bandwidth selection

Scaillet (2004) minimized the Mean Integrated Squared Error (MISE) to provide a rule-ofthumb bandwidth selector for the RIG kernel density estimator of the form

" 2 R 0 y 1/2h(y)dy π R 0 y2{h (y)}2dy

#2/5 n 2/5. (12)

However, (12) involves knowledge of the true density h( ) and is directly unusable. Scaillet (2004) proposed obtaining ˆb by assuming an initial parametric density, say h( ; θ), for h( ) and estimating the parameters θ of the density from the sample. Exact derivations using

Almod ovar-Rivera and Maitra

a lognormal density for h( ; θ) were provided (Scaillet, 2004) but this approach has been found to produce estimates that are biased downwards. We therefore adopt Scaillet (2004) s approach but use an initial gamma density h(y; ϑ, τ) = exp ( y/τ)yϑ 1τ ϑ/Γ(ϑ) for y > 0 and zero otherwise. Under this setup, R 0 y 1/2h(y; ϑ, τ)dy = Γ(ϑ 1/2) τ/Γ(ϑ) and R 0 y2{h (y; ϑ, τ)}2dy = (6ϑ 4)(ϑ 1)Γ(2ϑ)/{4ϑτ 3Γ2(ϑ)(2ϑ 1)}. Therefore, we have

" 22ˆϑ+1ˆτ 7/2(2ˆϑ 1)Γ(ˆϑ 1

2)Γ(ˆϑ) π(6ˆϑ 4)(ˆϑ 1)Γ(2ˆϑ)

with ˆϑ and ˆτ estimated from the sample Y1, Y2, . . . , Yn using, for example, the method of moments. This ˆb is used in (9) to obtain our smoothed RIG-kernel CDF estimator. The development of this section, when applied to the normed residuals ˆΨ1, ˆΨ2, . . . , ˆΨn (in place of Y1, Y2, . . . , Yn), yields a smooth nonparametric kernel-based estimator of their CDF. We use this kernel-estimated CDF in our development of the nonparametric estimation of the overlap measure between groups.

2.3. A nonparametric estimator of overlap between groups

Overlap between two groups is an indicator of the extent to which they are indistinguishable from each other. Maitra and Melnykov (2010) deﬁned the pairwise overlap of two mixture components as the sum of the misclassiﬁcation probabilities ωlk ωkl = ωl|k + ωk|l with

ωl|k = P[X is assigned to Cl | X is truly in Ck]. (14)

For any two mixture components with densities f(x; θk) and f(x; θl) and mixing proportions πk and πl, we have

ωk|l = P [πkf(xi; θk) < πlf(xi; θl)|xi f(xi; θl)] ,

where θk and θl are the parameter sets associated with the kth and lth mixture components. Maitra and Melnykov (2010) calculated (14) for Gaussian mixture densities, but the deﬁnition itself is general enough to include other clustering situations including those as general as when we have cluster distributions given by densities of the type in (1). For an equal-proportioned mixture of homogeneous spherical Gaussian densities, Maitra and Melnykov (2010) showed that ωk|l = Φ( µl µk /2σ) between the kth and the lth cluster where Φ( ) is the standard Gaussian CDF, µk and µl are the kth and the lth cluster means and σ is the common (homogeneous) standard deviation for each group, estimated unbiasedly as WSSK/{(n K)p} with WSSK being the optimized value of the within-sums-of-squares (WSS) of the K-groups solution. The sum of ωk|l and ωl|k reduces to ωkl = 2Φ( µl µk /2σ). The k-means formulation of (3) can be viewed more generally (Maitra et al., 2012) and extends beyond the case of Gaussian-distributed groups, so we develop nonparametric methods for estimating the overlap measure.

2.3.1. Pairwise overlap between two k-means groups

The pairwise overlap (14) between two groups can generally be calculated from HΨ( ) as

ωl|k = P ( X µl < X µk | X Ck) = 1 P Ψk < Ψl(k) (15)

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

where Ψk represents the normed residual obtained from the kth group, and Ψl(k) represents the normed pseudo-residual which we deﬁne as the norm of the remainder that is obtained by subtracting the lth cluster mean µl from an observation X Ck. Let HΨ(y) be the RIG kernel-estimated smooth CDF obtained using the bandwidth selected as per (13). Then, P(Ψk < y) can be estimated using ˆHΨ(y;ˆb) (where Ψ in the subscript of ˆH( ; ) denotes that the estimated CDF uses the normed residuals). However, the calculation of P Ψk < Ψl(k)

is not as straightforward. So we estimate P Ψk < Ψl(k) using a na ıve average estimator

ˆP Ψk < Ψl(k) = 1

i=1 ˆζ ik ˆHΨ( Xi ˆµ l ;ˆb), (16)

where n k = Pn i=1 ˆζ ik. The na ıve estimator (16) can be considered as an empirical estimator of E[ ˆHΨ( Xi ˆµl ;ˆb) | ˆµk, ˆµl, Xi kth spherically-dispersed group ]. Similar estimates of ωk|l, and therefore ωkl, can be obtained. We call this estimated overlap ˆωkl ˆωlk.

2.3.2. Pairwise overlap between two composite groups

As described in (2), a composite group is one that can be further decomposed into subpopulations. We now extend the deﬁnition of the pairwise overlap for such groups. Let ωCl|Ck be deﬁned as in (14) but for composite groups. That is, we use ωCl|Ck rather than ωl|k in order to specify that the overlap measure is between composite clusters Cl and Ck. Now ωCl|Ck = 1 P[minr Ck X µr < minj Cl X µj | X Ck]. Suppose now that C s k is the sth spherical sub-cluster of Ck with mean µ s, s = 1, 2, . . . , |Ck|, with |Ck| being the number of spherical sub-clusters in Ck. We assume that if X Ck, then argminr {1,2,...,|Ck|} X µ r = s k implies that X is in the subgroup given by Cs k. Under this assumption, the density of X is deﬁned through its (sth) sub-cluster and so

P min r Ck X µr y | X Ck

= 1 P min r Ck Ψr > y = 1 [1 P (Ψr y)]|Ck| (17)

where Ψr is a normed residual (obtained, for instance, from the k-means solution) for the rth spherically-dispersed subgroup in the kth cluster. We use the RIG kernel distribution estimator to obtain P (Ψr < y). From (14), and using the same ideas as in (16) we get the na ıve estimator

i=1 ˆζic ˆHΨ(min r Cl Xi µr ;ˆb)

and similarly for ˆωCk|Cl, from where we calculate ˆωCl Ck ˆωCk Cl = ˆωCl|Ck + ˆωCk|Cl. Our deﬁnitions of Cks and ˆωCk Cl are consistent in the sense that if Ck = {k} and Cl = {l} are both k-means groups, then ˆωCk Cl = ˆωkl. We use this equivalence in the description of our KNOB-Syn C algorithm in Section 2.4 below.

2.3.3. Summarizing overlap in a partitioning

Our development so far has provided us with pairwise overlap measures for k-means-type (Section 2.3.1) and composite (Section 2.3.2) groups. For a K-groups (whether of the composite or k-means type) partitioning, we get K 2 pairwise overlap measures. Summarizing

Almod ovar-Rivera and Maitra

the pairwise overlap measures is important to provide a sense of clustering complexity so Maitra and Melnykov (2010) originally proposed regulating ˇω (maximum of all pairwise overlaps) and ω (average of all K 2 pairwise overlaps) and demonstrated (see Figures 2 and 3 of Maitra and Melnykov, 2010) the ability to summarize a wide range of cluster geometries. However, because specifying two measures simultaneously is cumbersome, later versions of the CARP (Melnykov and Maitra, 2011) and Mix Sim (Melnykov et al., 2012) software packages borrowed ideas from Maitra (2010) to obtain the generalized overlap ω = (ˇλΩ 1)/(K 1) where ˇλΩis the largest eigenvalue of the (symmetric) matrix Ωof pairwise overlaps ωl,k (ωCk Cl for composite groups) and with diagonal entries that are all 1. ω lies in [0,1] with zero indicating perfect separation between all group densities and 1 indicating indistinguishability between any of them. In this paper, we obtain the estimated generalized overlap ˆω using the estimated matrix ˆΩwith oﬀ-diagonal entries given by the kernel-estimated pairwise overlaps ˆωl,k or ˆωCk Cl, depending on whether we have simple k-means-type or composite groups.

2.4. The KNOB-Syn C Algorithm

Having provided theoretical development for the machinery that we will use, we now describe our multi-phased KNOB-Syn C algorithm:

1. The k-means phase: This phase ﬁnds the optimal partition of the dataset in terms of homogeneous spherically-dispersed groups and has the following steps:

(a) For each K {1, 2, . . . , Kmax}, obtain K-means partitions initialized each of n Kp times with K distinct seeds randomly chosen from the dataset and run to termination. The best in terms of the value of the objective function (WSS) at termination of each set of n Kp runs is our putative optimal K-means partition for that K {1, 2, . . . , Kmax}. We use Kmax = max{ n, 50}.

(b) When n is small relative to p (operationally, n < p2), use Krzanowski and Lai (1988) s KL criterion to decide on the optimal K. Otherwise, for larger n, we use the jump statistic (Sugar and James, 2003) on the optimal K-means partitions (K {1, 2, . . . , Kmax}) obtained in Step 1a to determine the optimal K (denoted by ˆK). In calculating the jump statistic, we have used y = p/2, which has become the default in most applications. We refer to Sugar and James (2003) for more detailed discussion on this choice of y. The corresponding ˆK-means solution is the optimal homogeneous spherically-dispersed partition of the dataset. This concludes the k-means phase of the algorithm.

2. The initial overlap calculation phase: This phase starts with the output of Step 1. That is, we start with a structural deﬁnition of the dataset in terms of ˆK optimal homogeneous spherically-dispersed groups. Our objective here is to calculate the overlap between each of these groups using nonparametric kernel estimation methods. We proceed as follows:

(a) For each observation Xi, i = 1, 2, . . . , n, compute its normed residual ˆΨi = p

ˆϵ iˆϵi where ˆϵi is deﬁned as in (4). Also, obtain the normed pseudo-residual ˆΨi;l(k) = Xi ˆµl for Xi Ck, and l = k {1, 2, . . . , ˆK}.

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

(b) Using the set of normed residuals {ˆΨi; i = 1, 2, . . . , n}, obtain its RIG-kernelestimated CDF using (7) with bandwidth determined as per (13).

(c) For any two groups k = l {1, 2, . . . , ˆK}, estimate the pairwise overlap ˆωlk = ˆωl|k + ˆωk|l, where ˆωl|k and ˆωk|l are calculated using (15) and (16). We obtain the estimated overlap matrix ˆΩ(with diagonal elements all equal to unity). For

clarity, denote this overlap matrix as ˆΩ (1) and pairwise overlaps as ˆω(1) kl ˆω(1) Ck Cl.

(d) From the overlap matrix ˆΩ (1), calculate the generalized overlap ˆω. Call it ˆω(1).

3. The merging phase: The merging phase is triggered only if some of the overlap measures between overlapping clusters are more than the others (operationally, if 4 ˆω(1) ˇˆω where ˇˆω is the maximum of the estimated pairwise overlaps) or if ˆω is not negligible, that is, if ˆω(1) 0 (operationally ˆω(1) 10 5). In that case, this phase merges groups, provides pairwise overlap measures between newly-formed composite groups, the updated overlap matrix and the generalized overlap, continuing for as long as the generalized overlap keeps decreasing (by at least 10 5) or is not negligible. Speciﬁcally, this phase iteratively proceeds for ℓ= 1, 2, . . . with the following steps:

(a) Merge the groups with the maximum overlap and every pair of groups that have individual pairwise overlaps substantially larger than the generalized overlap ˆω(ℓ). That is, merge every pair of groups Ck, Cl, k = l such that ˆω(ℓ) lk ˇˆω(ℓ)

or ˆω(ℓ) lk > κ ˆω(ℓ), for some κ as described in the comments section below. Call the new merged group Cmin(k,l) and decrease the index labels of the groups with indices greater than max(k, l). Decrement ˆK by 1 for every merged pair.

(b) Using (18), update the pairwise overlap measures that have changed as a result of the merges in Step 3a. Call the updated measures ˆω(ℓ+1) Ck Cl . Obtain the updated

overlap matrix (call it ˆΩ (ℓ+1)) and the updated generalized overlap ˆω(ℓ+1). Set ℓ ℓ+ 1.

(c) The merging phase terminates if ˆω(ℓ) > ˆω(ℓ 1), ˆω(ℓ) 0, or ˆω(ℓ) ˇˆω(ℓ). The terminating ˆK is the ˆC of (1).

4. Final clustering solution: The grouping {C1, C2, . . . , C ˆC} at the end of the merging phase is the ﬁnal partition of the dataset. This gives us a total of ˆC general-shaped groups in the dataset.

Comments: We provide some additional remarks on KNOB-Syn C and relate it to other algorithms for ﬁnding general-shaped clusters and settings:

1. The k-means phase ﬁnds regular-structured (more speciﬁcally, homogeneous spherical) groups and, in this regard, is similar to the initial stages of K-m H (Peterson et al., 2018) and EAC (Fred and Jain, 2005). However, EAC repeats k-means with ﬁxed K several times and is built upon the premise that each k-means run does not end up with the same clustering, especially when we do not have underlying homogeneous spherically-dispersed groups. On the other hand, K-m H uses a separability index built

Almod ovar-Rivera and Maitra

on Gaussian assumptions for each cluster and has a large number of user-speciﬁed parameters. KNOB-Syn C uses nonparametric CDF estimation with a plugin bandwidth selector and a na ıve average estimator to calculate the overlap between sphericallydispersed groups and a na ıve estimator for the overlap between composite groups. Our methodology has one parameter (κ) that is chosen in a completely data-driven framework. No parameter requires ﬁne-tuning by the practitioner. Also, the number of general-structured groups is decided upon termination that is objectively declared whenever the generalized overlap vanishes or does not go down further.

2. As with MMC, DEMP or DEMP+, the use of cluster distributions in the overlap calculations simpliﬁes and keeps practical computations even for large datasets. In contrast, EAC, DBSCAN, DBSCAN , DP and K-m H require memory-intensive crosstabulation of the entire dataset across multiple clusterings because n n frequency tables need to be calculated and/or stored.

3. KNOB-Syn C uses a na ıve estimator to update the overlap between composite groups, unlike DEMP+ which uses Monte Carlo simulations and is slower. Further, DEMP+ uses the maximum overlap that is very sensitive to individual pairwise overlap measures while KNOB-Syn C uses the generalized overlap measure (Maitra, 2010) that provides a nonlinear summary of all the individual pairwise overlaps.

4. Unlike DEMP or DEMP+, the stopping criterion of KNOB-Syn C is data-driven, thus allowing for the possibility of obtaining well-separated and less well-separated partitionings as supported by the data. Our algorithm also has the potential, unlike MMC, DEMP or DEMP+, to merge multiple pairs of groups in a step.

5. KNOB-Syn C uses nonparametric CDF estimation but does so in univariate space by exploiting the inherent spherically-dispersed structure (ellipsoidal in the case of clustering with the Mahalanobis distance) of the sub-clusters. Therefore, it has greater immunity against the curse of dimensionality that bedevils multivariate density estimation that is used in algorithms such as DBSCAN and DP.

6. The parameter κ determines the types of composite groups that are formed. For larger values of κ, we have groups formed by merging a few pairs at each iteration while smaller values κ prefer many simultaneous mergers. (For κ , no merging is possible.) In the ﬁrst case, we expect to have stringy groups while in the second case, we ﬁnd clusters that are irregular-shaped but less stringy. A data-driven approach to choosing κ, that we adopt, runs the algorithm with diﬀerent values of κ = 1, 2, 3, 4, 5, and uses the ﬁnal partitioning with the smallest terminating ˆω as the optimal clustering.

7. Unlike other syncytial clustering algorithms like DEMP, DEMP+ or K-m H, KNOBSyn C allows for the possibility of multiple pairs of groups to be merged at an iteration.

8. Our initial stage uses k-means for speed and eﬃciency that also allows us to explore larger candidate values of K. However, the approach could very well have been used with clustering algorithms obtained using, say, the generalized Mahalanobis distance. The overlap calculations are then easily modiﬁed. To see this, suppose that the

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

generalized Mahalanobis distance between two points x and y is given by dΓ(x, y) = (x y) Γ (x y), where Γ is any appropriate nonnegative-deﬁnite matrix with (say, Moore-Penrose) inverse given by Γ . (A positive deﬁnite Γ leads to the usual Mahalanobis distance.) Under the generalized Mahalanobis distance framework, (14) reduces to

ωl|k = P (Γ )1/2(X µl) < (Γ )1/2(X µk) | X kth group

= 1 P (Γ )1/2(X µk) < (Γ )1/2(X µl) | X kth group , (19)

which means that the problem reduces to the Euclidean case if we use what we here refer to as the normed Mahalanobis-free residuals (and pseudo-residuals). Operationally, this is equivalent to obtaining ˆεi = (Γ )1/2(X PK i=1 ˆζ ik ˆµ k), and replacing the ˆϵi with ˆεi in the calculation of (5) and proceeding as before. This framework also includes the case when we scale each variable before clustering, as happens when the features are on vastly diﬀerent scales, or when we use principal components (PCs) as our clustering variables we illustrate these scenarios in Sections 3.3.6 and 4.1.

9. Our algorithm accommodates clustering scenarios in the presence of scatter as provided, for instance, by the output of the k-clips algorithm of Maitra and Ramler (2009). Scatter observations are those that are unlike any other and may be considered as individual groups in their own right. KNOB-Syn C incorporates these scatter observations as individual clusters in addition to the groups found from the output of the k-clips algorithm and proceeds with the overlap calculation and merging phases as described earlier in this section. We illustrate this scenario in Section 3.4.1.

10. Datasets often have incomplete records with missing observations in some features. The km-means algorithm (Lithio and Maitra, 2018) provides a k-means type algorithm for Euclidean distance clustering in this setting. Then, instead of k-means, KNOBSyn C can incorporate results from km-means in the ﬁrst stage. For the incompletelyobserved records, we calculate the rescaled normed residual in the presence of missing information by removing the missing value from their calculation and re-weighting it appropriately. Speciﬁcally, we calculate the ith rescaled normed residual as

l=1 (Xijl ˆµkjl)2 (20)

where Xij1, Xij2, . . . , Xijpi represent the pi available features for the ith record that has been assigned to the kth spherically-dispersed sub-group with estimated mean ˆµk. Similar arguments allow for the calculation of the rescaled normed pseudo-residual ˆΨil(k). The use of the nonparametric CDF estimator in KNOB-Syn C provides us with the ﬂexibility to calculate the initial overlap estimates from these scaled-up normed residuals. The merging phase and termination criteria of our algorithm remain unchanged. Section 3.4.2 illustrates KNOB-Syn C on a dataset with incomplete records.

11. The use of nonparametric methods in the overlap calculations means that a large number of methods may be possible to use in the initial partitional phase. The method

Almod ovar-Rivera and Maitra

can also potentially be modiﬁed to apply to other kinds of datasets. For instance, the initial clustering can be done for categorical datasets using k-modes (Huang, 1997, 1998; Chaturvedi et al., 2001; Dorman and Maitra, 2020) and then the Generalized or Gaussianized Distributional Transform (R uschendorf, 2013; Zhu et al., 2019) and copula model (Nelsen, 2006) can potentially be applied to each cluster to obtain numerical-valued residuals for use with our overlap estimation and calculations.

Having proposed our KNOB-Syn C algorithm, we now illustrate and evaluate its performance in relation to a host of competing methods.

3. Performance Evaluations

We ﬁrst illustrate the performance of KNOB-Syn C on the 2D Aggregation dataset of Gionis et al. (2007) and then follow with more detailed performance evaluations on a large number of datasets usually used to evaluate competing algorithms in the literature. These datasets range from two to many dimensions. We compare our methods with a wide range of suitors. These rival methods are the syncytial clustering techniques of K-m H (Peterson et al., 2018) using author-supplied R code, MMC (Baudry et al., 2010) as implemented in the R package Mix Mod Combi, DEMP (Hennig, 2010) using the R package fpc and DEMP+ (Melnykov, 2016). We also evaluate performance with EAC (Fred and Jain, 2005) and GSL-NN (Stuetzle and Nugent, 2010) using publicly available author-supplied code. We also apply two common connectivity-based techniques of spectral and kernel k-means clustering. Both these methods need the number of groups to proceed: for spectral clustering we decide this number to be the one with the highest gap in successive eigenvalues of the similarity matrix (von Luxburg, 2007). For kernel k-means, we set K to be the true value: we recognize that our evaluation of kernel k-means potentially provides this method with an unfair advantage, however, we proceed in this fashion in order to understand the best case scenario of this competing method. Finally, we also compare our method s performance relative to DBSCAN as implemented in the R package dbscan (Hahsler and Piekenbrock, 2018), DP clustering (Rodriguez and Laio, 2014) as implemented in the R package density Clust (Pedersen et al., 2017), PGMM (Mc Nicholas and Murphy, 2008) using the R package pgmm (Mc Nicholas et al., 2018), MSAL using the R package Mix SAL (Franczak et al., 2018) and MGHD using the R package Mix GHD (Tortora et al., 2019). Many of these algorithms have multiple parameters that need to be set in our experiments, we use the default settings where guidance for choosing these parameters is not explicitly available. Also, the DBSCAN and DP clustering algorithms identify scatter/outliers that by deﬁnition are those observations that are unlike any other in the dataset, so we follow Maitra and Ramler (2009) in considering them as individual singleton clusters in our performance assessments. Performance for each method is evaluated by Hubert and Arabie (1985) s adjusted Rand index R measured between the true partition and the estimated ﬁnal partitioning. In general, R 1: values closer to 1 indicating greater similarity between partitionings and good clustering performance. The index takes values farther from 1 as performance becomes poorer and is expected to take a value of zero for a random assignmment. The index R can take arbitrarily negative values, but as very helpfully pointed out to us by a reviewer, the probability of observing R < 1 is relatively small (see Steinley, 2004, for further discussion on the characteristics of this index).

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

3.1. Illustrative Example: the Aggregation dataset

The 2D Aggregation dataset (Gionis et al., 2007) has n = 788 observations from C = 7 groups of diﬀerent characteristics. Figure 1 displays the results of the diﬀerent phases and

(a) k-means phase: ˆK = 14

(b) First merging phase: ˆC = 8

(c) Final solution: ˆC = 7

1 2 3 4 5 6 7 8 9 10 11 12 13

0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0.01 0 0 0 0 0

0 0 0 0 0 0 0.02 0 0 0

0 0.020.01 0 0 0 0 0 0

0 0 0 0 0.01 0 0 0

0 0 0 0 0.01 0 0

0 0 0 0 0 0

(d) ˆω = 0.0013

1 2 3 4 5 6 7

0 0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0.01 0

(e) ˆω = 0.0008

1 2 3 4 5 6

0 0 0 0 0 0

(f) ˆω =10 6

Figure 1: Illustrating all three stages of the KNOB-Syn C algorithm on the Aggregation dataset: Results of (a) the k-means phase, (b) the ﬁrst merging phase and (c) the second (and ﬁnal) merging phase of the algorithm. In these and all such subsequent ﬁgures, character denotes true class membership while color indicates estimated class membership. (d) (f) Estimated pairwise nonparametric overlap values corresponding to the partitions in (a), (b) and (c).

iterations of KNOB-Syn C. We display the stages of KNOB-Syn C for κ = 1 which is when we have the lowest terminating ˆω (from among κ = 1, 2, 3, ) for this example. The kmeans phase of our algorithm identiﬁes 14 clusters with partitioning as in Figure 1a and estimates the initial overlap matrix ˆΩto be as in Figure 1d. The ﬁrst merging phase yields the partitioning in Figure 1b with the updated ˆΩof Figure 1e. The next merging phase only combines one pair of groups and is terminal, resulting in the ﬁnal partitioning of the dataset as in Figure 1c. The overlap matrix (Figure 1f) indicates well-separated clusters,

Almod ovar-Rivera and Maitra

with only six mislabeled observations relative to the true, and a R of 0.98 between the true and estimated classiﬁcations.

PGMM, R = 0.64 MSAL, R = 0.82 MGHD, R = 0.47

Spectral Clustering, R = 0.59 Kernel k means, R = 0.24 DBSCAN*, R = 0.58 Density Peaks, R = 0.26

DEMP+, R = 0.91 MMC, R = 0.80 EAC, R = 0.90 GSL NN, R = 0.81

True KNOB Syn C, R = 0.98 K m H, R = 0.95 DEMP, R = 0.91

Adjusted Rand Index

PGMM, R = 0.64 MSAL, R = 0.82 MGHD, R = 0.47

Spectral Clustering, R = 0.59 Kernel k means, R = 0.24 DBSCAN*, R = 0.58 Density Peaks, R = 0.26

DEMP+, R = 0.91 MMC, R = 0.80 EAC, R = 0.90 GSL NN, R = 0.81

True KNOB Syn C, R = 0.98 K m H, R = 0.95 DEMP, R = 0.91

Adjusted Rand Index

PGMM, R = 0.64 MSAL, R = 0.82 MGHD, R = 0.47

Spectral Clustering, R = 0.59 Kernel k means, R = 0.24 DBSCAN*, R = 0.58 Density Peaks, R = 0.26

DEMP+, R = 0.91 MMC, R = 0.80 EAC, R = 0.90 GSL NN, R = 0.81

True KNOB Syn C, R = 0.98 K m H, R = 0.95 DEMP, R = 0.91

Adjusted Rand Index

PGMM, R = 0.64 MSAL, R = 0.82 MGHD, R = 0.47

Spectral Clustering, R = 0.59 Kernel k means, R = 0.24 DBSCAN*, R = 0.58 Density Peaks, R = 0.26

DEMP+, R = 0.91 MMC, R = 0.80 EAC, R = 0.90 GSL NN, R = 0.81

True KNOB Syn C, R = 0.98 K m H, R = 0.95 DEMP, R = 0.91

Adjusted Rand Index

PGMM, R = 0.64 MSAL, R = 0.82 MGHD, R = 0.47

Spectral Clustering, R = 0.59 Kernel k means, R = 0.24 DBSCAN*, R = 0.58 Density Peaks, R = 0.26

DEMP+, R = 0.91 MMC, R = 0.80 EAC, R = 0.90 GSL NN, R = 0.81

True KNOB Syn C, R = 0.98 K m H, R = 0.95 DEMP, R = 0.91

Adjusted Rand Index

Figure 2: Clustering performance of each algorithm on the Aggregation dataset.

The competing methods (Figure 2) all perform marginally to substantially worse. K-m H is the second best performer (R = 0.95) ﬁnding ˆC = 9 groups but breaking the top right

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

cluster into two and also grouping a few other stray observations. Both DEMP and DEMP+ yield the same result (R = 0.91, ˆC = 6), but MMC (R = 0.8, ˆC = 8) has trouble with the largest group, splitting it into two sub-groups. EAC breaks the top central and large groups on the right into many clusters, resulting in ˆC = 14 but R = 0.9. Thus, in spite of identifying a large number of groups, EAC is able to capture a fair bit of the complex group structure of this dataset. GSL-NN can not distinguish between the groups on the right but also ﬁnds many other small groups elsewhere, ending with ˆC = 12 groups and R = 0.81. The performance of spectral clustering is worse: it ﬁnds ˆC = 12 groups and has a R = 0.59 with the true classiﬁcation. Despite being provided with the true C = 7, kernel k-means with R = 0.24 is the worst performer in this example, with DP (R = 0.26, ˆC = 26) only marginally better. DBSCAN* at R = 0.58 and ˆC = 277, correctly ﬁnds the large circular group and one of the smallest groups, but the observations in the top left group are almost all classiﬁed as outliers/scatter. Among the MBC methods for general-shaped clusters, MSAL at R = 0.82 is the best performer, ﬁnding ˆC = 7 groups but having trouble with the larger group at the bottom. PGMM ﬁnds ˆC = 12 groups (R = 0.64) with the larger groups split further, while the worst-performing MBC method is MGHD (R = 0.47, ˆC = 15).

3.2. Additional 2D Experiments

3.2.1. Experimental framework

Figure 3 displays the 12 additional 2D datasets used to evaluate performance of KNOBSyn C and its competitors. Barring the ﬁrst example, all these datasets have been used by other authors to demonstrate and evaluate performance of their methods. The groups in these datasets have structure ranging from the regular (e.g., the 7-spherically-dispersed Gaussian clusters dataset that is modeled on a similar example in Maitra, 2009a, and where sophisticated methods like KNOB-Syn C are superﬂuous and unnecessary) to widelyvarying complexity. The Banana Arcs dataset has n = 4515 observations clumped in four banana-shaped structures arced around each other. The Banana-clump and Bullseye datasets are from Stuetzle and Nugent (2010) the former has 200 observations with one spherical group and another arced around it on the left like a banana, while the latter has 400 observations grouped, as its name implies, as a bullseye. The more complex-structured Bullseye-Cigarette dataset (Peterson et al., 2018) has three concentric-ringed groups, two elongated groups above two spherical groups on the left, and another group that is actually a superset of two overlapping spherical groups (n = 3025 and C = 8). The Compound dataset (Zahn, 1971) is very complex-structured with n = 399 observations in C = 6 groups that are not just varied in shape, but a group that sits atop another on the right. The Half-ringed clusters dataset (Jain and Law, 2005) has 373 observations in two arc-shaped clusters, one of which is dense and the other being very sparsely-populated. The Path-based dataset (Chang and Yeung, 2008) has 300 observations in three groups, two of which are regular-shaped and surrounded by a widely arcing third group. The Spiral dataset (Chang and Yeung, 2008) has 312 observations in three spiral groups that are very diﬃcult for standard clustering algorithms to recover accurately. The SSS dataset has 5015 observations in three S-shaped groups of varying density and orientations while the XXXX dataset has n = 415 observations distributed in four cross-shaped structures.

Almod ovar-Rivera and Maitra

GG G GG G G

G GGGGG GGGG GG GG GGGGGGGGG

GG GG G G GG

GGGGGGG GGGG GG G G

G G G G GGG

GGG G G GG GG G G G

GGGGGGGG GGGGG GGGGGGG G GGG G

GGG G G GG G

G G GG G GGGGGG G GGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GG G G G G G GGGGG G GG

GG G G G GG GGGGG GG

G G G G G G

GGGG G GG G G

GGGGG GG G GGG

G GGGG G G GG

G GGGG G G GGGGG

G G GGGGG G

G G GGG GGG

G GGGG G GG

G GG G G G G

G G G G G G

GG G GG G GG GGGGG

G GGGG GGGG G GG G GG GG GGGGGGG

G G G GGGG G G G GGGGGG G GGG

G G G GGGG GG GGGG GG

GG G G G GGG GG G

G G GG GGG G

GG GG G G G

SCX Bananas Spiral SSS XXXX

Bullseye Cigarette Compound Half ringed clusters Path based

7 Spherical Banana Clump Bananas Arcs Bullseye

Figure 3: Shape datasets used in the two-dimensional performance evaluations.

3.2.2. Results

Figure 4 and Table 8 summarize the performance of all methods on the 2D experimental datasets. Detailed displays of diﬀerent methods on individual datasets are in Appendix B. The summaries indicate across-the-board good performance of KNOB-Syn C with it always being a top performer. In its worst case, KNOB-Syn C gets a R = 0.55 (on the Path-based dataset) where it terminates early (Figure 18) but here also it is the fourth-best performer, behind spectral clustering (R = 0.72), MGHD (R = 0.60) and PGMM (R = 0.59). The competing syncytial clustering methods do well in some cases, but not in others where other methods perform better. Among the syncytial clustering methods, K-m H performs better than DEMP, DEMP+ and MMC whose performance can sometimes be poor (e.g., on the Bullseye, Half-ringed clusters and Spiral datasets vide Figures 14, 17 and 20). It is on these datasets that the other methods (EAC, GSL-NN, and spectral clustering) do better. The performance of kernel-k-means even with known true number of groups is varied, being very good sometimes (e.g., in the Bananas-clump dataset of Figure 13) but

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

7 Spherical

Aggregation

Bananas Arcs

Banana clump

Bullseye Cigarette

Half ringed clusters

SCX Bananas

Adjusted Rand Index

(a) Performance by dataset

Adjusted Rand Index

(b) Performance by method

Figure 4: Performance of KNOB-Syn C (abbreviation: KNS), K-m H, DEMP (DM), DEMP+ (DM+), MMC, EAC, GSL-NN (GSN), spectral clustering (Sp C), kernel k-means (kk-m), DBSCAN , DP, PGMM, MSAL and MGHD on 2D datasets.

very poor in other cases (e.g., as seen before in the Aggregation dataset) where almost every other method does well. The three general MBC methods perform similarly but PGMM does a bit better than MSAL and MGHD. DBSCAN and, especially DP, generally perform poorly however, DBSCAN performs well in some cases. We quantify performance of each method against its competitors in terms of its average deviation from the best performer. Speciﬁcally, for each dataset, we compute the deviation (or diﬀerence) in R of a method from that of the best performer for that dataset. The average deviation ( D) of a method over all datasets is an overall indicator of its performance. Table 8 provides D and the standard deviation or SD (Dσ) of the diﬀerences. On the average, KNOB-Syn C is the best performer (and with the lowest Dσ) followed by GSL-NN, K-m H and EAC. This conclusion of KNOBSyn C s superior overall performance is also supported by Figure 4(b). We surmise that KNOB-Syn C does well across the diﬀerent datasets because of its ability by construction to merge many or few components at a time, with the exact choice of merges and termination objectively selected and determined by the distinctiveness of the resulting partitioning as per ˆω.

3.3. Higher-Dimensional Datasets

We also study the performance of KNOB-Syn C and its competitors on higher-dimensional datasets. These datasets are modestto higher-dimensional, with between 173 to 10993 records. For the higher-dimensional datasets (i.e., with non-redundant dimension greater than 10), we ﬁnd that all methods other than GSL-NN generally perform better when used on the ﬁrst few (m) kernel principal components (KPCs) rather than on the raw data. For these methods and these datasets, we use the ﬁrst m KPCs of each dataset with m chosen as the ﬁrst time after which increases in the eigenvalues corresponding to the successive KPCs are below 0.5%. GSL-NN is implemented on the original datasets. Further, KNOB-Syn C,

Almod ovar-Rivera and Maitra

K-m H and EAC are built on k-means whose results depend on the scale of the features. So, for these methods, we scaled each feature by the SD prior to analysis unless the features were all collected on a similar scale, as with the E. coli example of Section 3.3.2 or the logtransformed GRB dataset of Section 4.1. (Following the usual rule-of-thumb in multivariate statistics, we assume that features are on similar scales if the most variable feature has SD no more than four times that of the feature with lowest variability.) Each dataset and the performance of each method is ﬁrst described individually. A comprehensive summary of the performance of each method on each dataset follows in Section 3.3.11.

3.3.1. Simplex-7 Gaussian Clusters

This dataset, from Stuetzle and Nugent (2010), is of Gaussian realizations of size 50, 60, 70, 80, 90, 100 and 110 each from seven clusters with means set at the vertices of the seven-dimensional unit simplex and homogeneous spherical dispersions with common SD of 0.25 in each dimension. Like the 7-Spherical dataset, this dataset exempliﬁes a case where standard methods such as k-means or Gaussian MBC should be adequate. Therefore it is a test of whether our algorithm and its competitors are able to refrain from identifying spurious complexity. All methods, except for EAC, DBSCAN* and DP, identify seven groups and have good clustering performance. In particular, the syncytial methods have very good performance (R = 0.97); other methods have R [0.92, 0.98]. EAC still performs well (R = 0.94) but ﬁnds ˆC = 6 groups. DBSCAN* is the worst performer (R = 0.03) on this dataset, ﬁnding many outliers ( ˆC = 508). DP s performance is middling at R = 0.58 and with many outliers ( ˆC = 79).

3.3.2. E. coli Protein Localization

The E. coli dataset, publicly available from the University of California Irvine s Machine Learning Repository (UCIMLR) (Newman et al., 1998), concerns identiﬁcation of protein localization sites for the E. coli bacteria (Nakai and Kinehasa, 1991). There are eight protein localization sites: cytoplasm, inner membrane without signal sequence, periplasm, inner membrane with an uncleavable signal sequence, outer membrane, outer membrane lipoprotein, inner membrane lipoprotein, and inner membrane with a cleavable signal sequence. Identifying these sites is an important early step for ﬁnding remedies (Nakai and Kinehasa, 1991). Each protein sequence has a number of numerical attributes see Horton and Nakai (1985) for a listing and their detailed description. Two attributes are binary, but 326 of the 336 sequences have common values for these attributes. We restrict our investigation to these sequences and drop the two binary attributes from our list of variables. These 326 sequences have no representation from the inner membrane or outer membrane lipoproteins. Additionally, we also drop two sequences because they are the lone representatives from the inner membrane with cleavable sequence site (Maitra, 2002). Therefore we have n = 324 observations from C = 5 true classes. KNOB-Syn C identiﬁes ˆC = 9 groups. Table 1 presents the confusion matrix containing the number of times a protein from a localization site is assigned to each KNOB-Syn C group. Ignoring stray assignments, the sites are fairly well-deﬁned in the ﬁrst four groups, with R = 0.72. Uncleavable signal sequences from the inner membrane site are diﬃcult to distinguish from those that are also from there but have no signal sequence. Sequences from the other sites are better-clariﬁed. Among the alterna-

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

Table 1: Confusion matrix of the KNOB-Syn C groups against the true E. coli localization sites.

1 2 3 4 5 6 7 8 9 cytoplasm 138 0 0 4 0 0 0 0 1 inner membrane, no signal sequence 7 62 0 0 0 3 3 1 0 inner membrane, uncleavable signal sequence 1 32 0 0 0 0 0 1 0 outer membrane 0 0 17 2 0 0 0 0 0 periplasm 3 1 2 38 8 0 0 0 0

tive methods, EAC does slightly better (R = 0.77) but identiﬁes 10 groups. The remaining methods all do slightly to substantially worse. DEMP, DEMP+ and K-m H each identify four groups but with R [0.63, 0.70]. K-m H ﬁnds only two groups (R = 0.41) while the rest ﬁnd more groups but disagree more strongly with the true localizations. DBSCAN

ﬁnds a large number of groups ( ˆC = 174) with relatively poor performance (R = 0.31). DP is marginally better (R = 0.39, ˆC = 12) while PGMM (R = 0.48) and MGHD (R = 0.6) improves on DP, each ﬁnding 8 groups. (MSAL did not converge to a solution.) Overall, EAC and KNOB-Syn C are the top two performers, with DEMP and DEMP+ close behind.

3.3.3. Standard Wine Recognition

The standard wine recognition dataset (Forina et al., 1988; S. Aeberhard and de Vel, 1992), also available from the UCIMLR contains p = 13 measurements on n = 178 wine samples that are obtained from its chemical analysis. There are 59, 71 and 48 wines of the Barolo, Grignolino and Barbera cultivars, so C = 3. Because p > 10 here, we use m = 17 KPCs. KNOB-Syn C is the best performer, ﬁnding ˆC = 3 groups with a clustering performance of R = 0.92. The ﬁrst group contains all the 59 wines from the Barola cultivar and 2 Grignolino wines. The second group contains 66 wines, all exclusively from the Grignolino cultivar. The third group has 2 Grignolino and 48 Barbera wines. Thus, there is very good deﬁnition among the KNOB-Syn C groups. On the other hand, only MMC (R = 0.67; ˆC = 5), Km H (R = 0.62, ˆC = 6), PGMM (R = 0.66, ˆC = 5) and EAC (R = 0.60; ˆC = 9) perform modestly while the others are substantially worse with DBSCAN , in particular, classifying all observations as outliers, resulting in ˆC = 178 and R = 0.

3.3.4. Extended Wine Recognition

A reviewer very helpfully pointed out that the dataset used in Section 3.3.3 is actually a reduced variant, and a fuller version of the dataset with 27 variables is available in the R package pgmm. We used m = 26 KPCs in our experimental evaluations on this larger dataset. DEMP and DEMP+ (R = 1, ˆC = 3) show perfect classiﬁcation while MGHD (R = 0.95, ˆC = 3) and MMC (R = 0.91, ˆC = 4) also perform well. KNOB-Syn C is a top performer and the best among the distribution-free methods, ﬁnding ˆC = 3 groups and with a clustering performance of R = 0.93. The ﬁrst group here has 58 Barola and 2 Grignolino wines. The second group contains 68 wines from the Grignolino cultivar and the one Barolo wine that was not placed in the ﬁrst group. The third group has the 48

Almod ovar-Rivera and Maitra

Barbera wines and the one remaining Grignolino wine. Similar to the 13-dimensional case, we get good deﬁnition among the groups. Other methods not discussed here do moderately to substantially worse.

3.3.5. Olive Oils

The olive oils dataset (Forina and Tiscornia, 1982; Forina et al., 1983) has measurements on 8 chemical components for 572 samples of olive oil taken from 9 diﬀerent areas in Italy that are from three regions: Sardinia and Northern and Southern Italy. This is an interesting dataset with sub-classes (areas) inside classes (regions). Indeed, Peterson et al. (2018) were able to identify, with one misclassiﬁcation, sub-groups within the regions but not the areas (R = 0.67, ˆC = 11; we however get R = 0.56, ˆC = 8 using the authors supplied code) they surmised that it may be more possible to identify characteristics of olive oils based on regions deﬁned by physical geography rather than areas demarcated by political geography. We therefore analyze performance on this dataset both in terms of how regions and areas are recovered. KNOB-Syn C identiﬁes ˆC = 4 regions (Table 2) with oils from the Sardinian and Northern regions correctly classiﬁed into the ﬁrst two groups. The Southern region oils are split into our two remaining groups, one containing all but 2 of the 25 North Apulian samples and 6 of the 36 Sicilian samples, and the other group containing all the Southern oils. In

Table 2: Confusion matrix of the KNOB-Syn C grouping of the Olive Oils dataset.

Region Area 1 2 3 4 Sardinia Coast-Sardinia 33 0 0 0 Inland-Sardinia 65 0 0 0 North East-Liguria 0 0 50 0 West-Liguria 0 0 50 0 Umbria 0 0 51 0 South Calabria 0 56 0 0 North-Apulia 0 2 0 23 South-Apulia 0 206 0 0 Sicily 0 30 0 6

terms of clustering performance, KNOB-Syn C gets R = 0.55 when compared to the true areal grouping but R = 0.87 when compared to the true regional grouping. For this dataset DEMP (R = 0.85; ˆC = 7), DEMP+ (R = 0.82; ˆC = 12) and MSAL (R = 0.7; ˆC = 9) are the top performers with respect to the true areal grouping. The remaining methods all have middling performance. When compared with the true regional grouping, KNOB-Syn C is by far the best performer. Overall, the clustering performance of KNOB-Syn C for the regional grouping marginally trumps the performance of DEMP for the areal grouping and so may be considered to be more accurate in uncovering the group structure in the dataset.

3.3.6. Image Segmentation

The image segmentation dataset, also available from the UCIMLR, is on 19 attributes of the scene in each 3 3 image manually classiﬁed to be from BRICKFACE, CEMENT, FOLIAGE, GRASS, PATH, SKY and WINDOW. (Thus, C = 7.) We combine the training

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

and test datasets to obtain 330 instances of each scene, so n = 2310. There is a lot of redundancy in the attributes so we reduce the dataset to 8 PCs that together explain at least 99.9% of the total variance in the dataset. The PCs are obtained from the correlation matrix because the 19 attributes have vastly diﬀerent scales. The KNOB-Syn C solution ﬁnds ˆC = 12 clusters, with R = 0.55. The confusion matrix (Table 3) indicates that the SKY images are perfectly identiﬁed while GRASS and, to a lesser extent, PATH and CEMENT, are fairly well-identiﬁed. On the other hand, the partitioning struggles to distinguish between BRICKFACE, FOLIAGE and WINDOW. Among other methods, only EAC (R = 0.59, ˆC =

Table 3: Confusion matrix of the KNOB-Syn C grouping against the true for the Image segmentation dataset.

1 2 3 4 5 6 7 8 9 10 11 12 BRICKFACE 330 0 0 0 0 0 0 0 0 0 0 0 CEMENT 42 257 0 4 0 27 0 0 0 0 0 0 FOLIAGE 300 5 0 0 0 5 2 7 3 1 3 4 GRASS 1 0 327 0 0 2 0 0 0 0 0 0 PATH 0 0 0 269 0 61 0 0 0 0 0 0 SKY 0 0 0 0 330 0 0 0 0 0 0 0 WINDOW 309 13 0 0 0 8 0 0 0 0 0 0

40) and MGHD (R = 0.56, ˆC = 7) are modestly to marginally better than KNOB-Syn C. Inspection of the EAC grouping indicates many small groups but also diﬃculty in separating FOLIAGE and WINDOW, placing them together in one group. Further, BRICKFACE is split into ﬁve groups, four of which are predominantly of this kind, but the ﬁfth group is unable to distinguish 146 observations of BRICKFACE from 32, 52 and 62 observations of CEMENT, FOLIAGE and WINDOW, respectively. The other methods all perform moderately to substantially worse (Table 9) with PGMM, MSAL and DBSCAN unable to ﬁnd clustering solutions.

3.3.7. Yeast Protein Localization

The yeast protein localization dataset (Nakai, 1996), also obtained from the UCIMLR, was used by Melnykov (2016) to illustrate the application of DEMP+. This dataset is on the localization of the proteins in yeast into one of C = 10 sites and has two attributes (presence of HDEL substring and peroxisomal targeting signal in the C-term) that are essentially binary and trinary. Following Melnykov (2016), we drop these variables and use the other p = 6 variables, namely signal sequence recognition scores based on (a) Mc Geoch s and (b) von Heijne s methods, (c) ALOM membrane spanning region prediction score, and discriminant analysis scores of the amino acid content of (d) N-terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins and (e) vacuolar and extracellular proteins and (f) discriminant scores of nuclear localization signals of nuclear and non-nuclear proteins. For this dataset, all methods perform poorly. KNOB-Syn C (R = 0.226; ˆC = 7) is the best performer the other clustering methods essentially randomly allocate observations. Surprisingly, DEMP+ ( ˆC = 6, R = 0.01) performs very poorly.

Almod ovar-Rivera and Maitra

(Melnykov, 2016, only used the ﬁrst ﬁve of our variables to illustrate the DEMP+ method: we ﬁnd no appreciable improvement even then, with ˆC = 7 and R = 0.009. Personal queries to the author did not successfully resolve this discrepancy.) It appears therefore that the yeast protein localization dataset may be diﬃcult to accurately partition in a completely unsupervised framework.

3.3.8. Acute Lymphoblastic Leukemia

The Acute Lymphoblastic Leukemia (ALL) training dataset of Yeoh et al. (2002) was used by Stuetzle and Nugent (2010) to illustrate GSL-NN in a high-dimensional small sample size framework. We use the standardized dataset in Stuetzle and Nugent (2010) that measured the oligonucleotide expression levels of the 1000 highest-varying genes in 215 patients suﬀering from one of seven leukemia subtypes, namely, T-ALL, E2A-PBX1, BCR-ABL, TEL-AML1, MLL rearrangement, Hyperploid > 50 chromosomes, or an unknown category labeled OTHER. Some subtypes have very few cases: for instance, only 9, 14 and 18 patients are of type BCR-ABL, MLL and E2A-PBX1, respectively. For this dataset, we use m = 42 KPCs for all methods but GSL-NN. The k-means stage of KNOB-Syn C identiﬁes six groups, none of which are merged in the merging phase, resulting in the best partitioning among all competing methods. Table 4 presents the confusion matrix containing the number of cases

Table 4: Confusion matrix of the KNOB-Syn C grouping against the true leukemia subtypes for the ALL dataset.

1 2 3 4 5 6 Hyperdiploid > 50 0 0 2 35 5 0 E2A-PBX1 0 17 0 0 1 0 BCR-ABL 0 0 2 1 6 0 TEL-AML1 4 2 8 2 36 0 MLL 0 3 10 0 1 0 T-ALL 0 0 1 0 0 27 OTHER 51 0 0 0 1 0

a patient of a leukemia subtype was assigned to a KNOB-Syn C group. We see that most leukemia subtypes are distinctively identiﬁed in the KNOB-Syn C solution. The alternative methods perform mildly to substantially worse with PGMM, spectral clustering, GSL-NN and DP having clustering solutions (R = 0.61, 0.55, 0.54, 0.53) that are the next best after KNOB-Syn C. Other methods generally do poorly, with MSAL and MGHD unable to ﬁnd solutions while DBSCAN classiﬁes all observations as outliers.

3.3.9. Zipcode images

The zipcode images (Stuetzle and Nugent, 2010) dataset consists of n = 2000 16 16 images of handwritten Hindu-Arabic numerals and is our second higher-dimensional example. As in the ALL dataset of Section 3.3.8, we normalize the observations to have zero mean and unit variance so that the Euclidean distance between any two normalized images is negatively and linearly related to the correlation between their pixels. We extract and use the ﬁrst

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

m = 33 KPCs for all algorithms but GSL-NN. KNOB-Syn C identiﬁes 9 groups and has the best clustering performance (R = 0.76). DP is the second best (R = 0.58) performer but ﬁnds ˆC = 53 groups (including singletons) followed by MGHD (R = 0.56, ˆC = 6), K-m H (R = 0.55, ˆC = 22), GSL-NN and spectral clustering (both with R = 0.54 but ˆC = 7 and 23). The other methods all perform moderately to substantially worse. Figure 5 displays

Figure 5: KNOB-Syn C groups, with colormap indicating group, of the Zipcode dataset.

the 9 KNOB-Syn C groups. While misclassiﬁcations abound in almost all groups, there is good agreement with 0, 1, 2, the leaner 8s and, (to a lesser extent) 3 and 6, largely correctly identiﬁed. The digit 2 is placed in two groups, of the leaner and the rounded versions. The group where 3 predominates also has some 5s and 8s but the categorization makes visual sense. Another group is composed largely of 4s, 7s and 9s but that placement also appears visually explainable. Clearer and straighter 7s and 9s are placed in a separate group. Our partitioning ﬁnds it harder to distinguish between 5 and 6 but here also the commonality of the strokes in the digits assigned to this group explains this categorization. Thus we see that KNOB-Syn C is not only the best performer for this dataset but also provides interpretable results. We comment that our application of all methods to this dataset has been entirely

Almod ovar-Rivera and Maitra

unsupervised: methodologies that also account for spatial context and pixel neighborhood may further improve the grouping but are outside the purview of this paper.

3.3.10. Handwritten Pen-digits

The Handwritten Pen-digits dataset (Alimoglu, 1996; Alimoglu and Alpaydin, 1996) available at the UCIMLR is a larger dataset that has 16 attributes from 250 handwritten samples

Table 5: Confusion matrix for the Handwritten Pen-digits dataset. 0 1 2 3 4 5 6 7 8 9 11 12 13 14 15 0 1099 1 0 0 19 0 21 0 0 2 0 1 0 0 0 1 0 657 358 34 1 0 2 2 0 89 0 0 0 0 0 2 0 2 1141 0 0 0 0 0 0 1 0 0 0 0 0 3 0 4 2 1046 1 0 0 0 0 2 0 0 0 0 0 4 0 5 1 2 1118 0 1 0 0 17 0 0 0 0 0 5 0 1 0 252 0 625 0 0 2 175 0 0 0 0 0 6 0 0 1 0 0 1 1054 0 0 0 0 0 0 0 0 7 0 144 5 2 0 0 0 914 0 0 0 0 77 0 0 8 4 0 0 3 0 1 0 1 461 0 139 321 48 24 53 9 24 9 0 72 3 0 0 0 1 714 0 0 0 232 0

of 30 writers. (There are n = 10992 records because eight samples are unavailable.) We use m = 18 KPCs in our analysis (Peterson et al., 2018, used the ﬁrst 7 PCs and got R = 0.64 and ˆC = 24). KNOB-Syn C ﬁnds ˆC = 15 groups and is the best performer (R = 0.723). It separates the digits 0, 2, 3, 4, 6 and, to a lesser extent, 7 fairly well but identifying 1, 5 and 9 is a bit more challenging (Table 5). It also identiﬁes multiple types of 8. MGHD ﬁnds the correct number and is the next-best performer (R = 0.67). The other methods perform moderately to substantially worse with MSAL unable to ﬁnd a clustering solution.

3.3.11. Summary of Performance

Olive Oils Area

Olive Oils Region

Adjusted Rand Index

(a) Performance by dataset

Adjusted Rand Index

(b) Performance by method

Figure 6: Overall performance of all competing methods on all higher-dimensional datasets. Abbreviations are as in Figure 4.

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

Figure 6 and Table 9 summarize performance of all methods on the higher-dimensional experiments. As in the 2D case, KNOB-Syn C is almost always among the top performers for high-dimensional datasets. Indeed, KNOB-Syn C has the lowest average diﬀerence in R from that of the best-performing method over all datasets (Table 9). The other methods generally perform worse, with EAC, PGMM and kernel-k-means (with true number of groups) among the better ones. Thus, the results of our experiments on real and synthetic datasets indicate good performance of KNOB-Syn C relative to its competitors.

3.4. Extensions of KNOB-Syn C

As indicated in Section 2, the development of our syncytial clustering methodology is based on the nonparametric estimation of the CDF of the residuals and so can be applied to other scenarios. We explore performance of our methodology in two such settings.

3.4.1. KNOB-Syn C in the presence of scatter

Maitra and Ramler (2009) provided the k-clips algorithm for k-means clustering in the presence of scatter, or observations that are unlike any other in the dataset. Our KNOB-Syn C methodology and software readily incorporates k-clips results by replacing the k-means phase with that algorithm, and proceeding by including the scatter points as individual singleton clusters. We illustrate our methodology on the ﬁrst 100 images of the Olivetti faces database (Samaria and Harter, 1994) that were used by Rodriguez and Laio (2014) to illustrate their DP algorithm. The 100 images under our consideration are of 10 faces each of 10 individuals taken at diﬀerent angles and under diﬀerent light conditions. Therefore, each individual can be considered to be a group with members that are that person s 10 images. Each 112 92 image has a total of 10,304 pixels so we use the ﬁrst 37 KPCs. While this application does not have any true scatter points, we use this application to illustrate KNOB-Syn C with k-clips because it was used by Rodriguez and Laio (2014) to showcase DP that ﬁnds scatter (outliers, in their parlance) in addition to clusters. The k-clips algorithm with the default Bayesian Information Criterion (BIC) (Schwarz, 1978) ﬁnds only two well-deﬁned homogeneous spherical clusters and 68 scatter points. We use the trace of the within-sums-of-squares-and-products matrix, rather than its determinant (Maitra and Ramler, 2009), in our objective function in order to satisfy the condition of homogenous spherical clusters around which our base KNOB-Syn C algorithm is built. Thus, we have a total of 70 initial groups. KNOB-Syn C s merging phase ends with 9 large groups, 5 small groups and 1 scatter observation (so ˆC = 16) and R = 0.902. The results are displayed in Table 6 and Figure 7 for comparison, the latter also displays the results reported in Rodriguez and Laio (2014) which found 9 clusters and 62 scatter points, resulting in ˆC = 71 and R = 0.22. (The ﬁgure displays images assigned to a group by means of a distinctive sequential palette. Because there are not enough colors to also identify each scatter point with its individual sequential palette, we use an individual randomized nominal palette for each scatter assignment.) KNOB-Syn C identiﬁes images from six individuals (Persons 2, 4, 6, 7, 8 and 9) perfectly and the ﬁrst, third and the tenth individuals nearly so. The ﬁfth individual is characterized into 5 smaller groups that includes the case where one image is grouped together with the one misclassiﬁed image of the tenth person. The performance of our algorithm overwhelms that reported in Rodriguez and Laio (2014).

Almod ovar-Rivera and Maitra

Table 6: Confusion matrix of the KNOB-Syn C results for the Olivetti faces dataset.

Assigned Groups Individual 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 9 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 8 0 0 0 0 0 0 0 2 0 0 0 0 0 4 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 3 0 0 0 0 0 0 2 0 2 1 2 6 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 9 0 0 0 0 1 0

(a) KNOB-Syn C: R = 0.90

(b) DP: R = 0.22

Figure 7: Clusters of the ﬁrst 100 images in the Olivetti database obtained by (a) KNOBSyn C and (b) DP as reported in Rodriguez and Laio (2014). Each group is represented by its own distinctive sequential palette. Scatter observations (i.e. singleton groups) are represented by individual randomized nominal palettes.

We note that we used the ﬁrst 37 KPCs with our KNOB-Syn C algorithm while Rodriguez and Laio (2014) used the original images with similarity metric as in Sampat et al. (2009). Using DP ( ˆC = 83, R = 0.06) or DBSCAN ( ˆC = 100, R = 0) with Euclidean similarity on the 37 KPCs gave us worse results.

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

3.4.2. KNOB-Syn C with incomplete records

We now illustrate a scenario where KNOB-Syn C is applied to a dataset with incomplete records. In this example, we replace the k-means phase with Lithio and Maitra (2018) s km-means algorithm that modiﬁes k-means to account for incomplete records. The authors also develop a modiﬁed jump statistic to select the number of groups. The km-means results are input into the merging phase of KNOB-Syn C and the algorithm proceeds as usual. We illustrate our methodology on a subset (Wagstaﬀ, 2004) of the Sloan Digital Sky Survey (SDSS) dataset that measures ﬁve features (brightness, in psf Counts, size in petro Rads, texture, and two measures of shape (M e1 and M e2 that we refer to as Shape1 and Shape2 in our analysis) on 1220 galaxies and 287 stars. Thus the true C = 2 and n = 1507. The dataset has some missing values for the shape measures of 42 galaxies. The km-means algorithm with the modiﬁed jump statistic of Lithio and Maitra (2018) ﬁnds K0 = 46 homogeneous spherically-dispersed groups. The initial overlap calculations of Step 2 of our algorithm yield ˆω = 0.0297 and ˆˇω = 0.381. The merging phase is triggered, and terminates with ˆC = 4 groups. Figure 8 provides a 3D radial visualization (Zhu et al.,

KNOB-Syn C Groups

1 2 3 4 Galaxies 1159 2 56 3 Stars 0 287 0 0

Figure 8: (Top) Three views of 3D radial visualization displays of the KNOB-Syn C groups found in the SDSS dataset. Only the completely observed records are displayed in the ﬁgures. (Bottom) Confusion matrix between the true classiﬁcations of Galaxies and Stars with the KNOB-Syn C grouping that yielded R = 0.86.

2019) of the clustering results and a confusion matrix of the obtained grouping vis-a-vis the true classiﬁcation. We see that KNOB-Syn C groups all the 287 stars together, but also includes 2 galaxies. The remaining galaxies are all partitioned into groups of 1159, 56 and 3 observations. The large galaxy group and the group with stars are all well-separated from the ones in the smaller galaxy groups. The second-largest KNOB-Syn C galaxy group has larger-sized galaxies while the three galaxies in the last group have larger Shape1 and brightness. This illustration demonstrates KNOB-Syn C s ability to identify general-shaped

Almod ovar-Rivera and Maitra

clusters even in the presence of incomplete records. We note that some of the competing methods such as K-m H or EAC may be modiﬁed to incorporate km-means results but such modiﬁcations to both the methodology and software is outside the scope of this paper. Our experimental evaluations comprehensively demonstrate that our KNOB-Syn C algorithm works very well in ﬁnding general-shaped clusters. Indeed, our methodology can also incorporate scenarios that allow for scatter or incomplete records in the dataset.

4. Real-world applications

In this section we apply KNOB-Syn C to ﬁrst ﬁnd the diﬀerent kinds of Gamma Ray Bursts (GRBs) in an astronomy catalog and second, to identify activation detected in f MRI experiments. The ground truth is unknown in both these applications, so we compare our results with other available evidence in the literature.

4.1. Determining the distinct kinds of Gamma Ray Bursts

There is tremendous interest in understanding the source and nature of Gamma Ray Bursts (GRBs) that are the brightest electromagnetic events known to occur in space (Chattopadhyay et al., 2007; Piran, 2005). Many researchers (Mazets et al., 1981; Norris et al., 1984; Dezalay et al., 1992) have hypothesized that GRBs are of several kinds, but the exact number and descriptive properties of these groups is an area of active research and investigation. Most analyses have traditionally focused on univariate and bivariate statistical and descriptive methods for classiﬁcation and found two groups but other authors (Mukherjee et al., 1998; Chattopadhyay et al., 2007) have found three diﬀerent kinds of GRBs when using more variables in the clustering. Recent careful analyses (Chattopadhyay and Maitra, 2017, 2018) has conclusively established ﬁve ellipsoidally-shaped groups in the GRB dataset obtained from the BATSE 4Br catalog. Indeed, Chattopadhyay and Maitra (2018) established that all nine ﬁelds of the BATSE 4Br catalog have important clustering information using methods developed in Raftery and Dean (2006). These nine ﬁelds are the two duration variables (time by which 50% and 90% of the ﬂux arrive), the four time-integrated ﬂuences in the 20-50, 50-100, 100-300, and > 300 ke V spectral channels, and the (three) measurements on peak ﬂuxes in time bins of 64, 256 and 1024 milliseconds. The authors used multivariate t-mixtures MBC on the logarithm of the measurements, and BIC for model selection, to arrive at their result of ﬁve ellipsoidally-shaped groups. GRB datasets have typically been analyzed after using a log10 transformation to remove skewness in the dataset. This summary transformation is somewhat arbitrary so Berry and Maitra (2019) used their Transformation-infused K-means (Ti K-means) algorithm to alternately transform features and cluster skewed datasets. A modiﬁcation (Berry and Maitra, 2019) of the jump statistic that accounts for the use of transformations in the algorithm found ﬁve groups that were characterized as long-intermediate-intermediate, short-faintintermediate, short-faint-soft, long-bright-hard and long-intermediate-hard in terms of their duration (T90), total ﬂuence (Ftotal) and spectral hardness (H321) which are the summaries used to characterize GRB groups (Mukherjee et al., 1998). The ﬁelds of the BATSE 4Br catalog are heavily correlated in the log-scale. This has led many researchers to argue for and summarily ignore all but a few variables in their analysis. Here we explore performance using KNOB-Syn C on the ﬁrst three PCs (accounting for

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

Duration Fluence Spectral Hardness

1 0 1 2 3 40 35 30 25 20 0.45 0.50 0.55 0.60 0.65

Group 1 2 3 4 5

(a) KNOB-Syn C: ˆC = 5

Duration Fluence Spectral Hardness

1 0 1 2 3 40 35 30 25 20 0.45 0.50 0.55 0.60 0.65

Group 1 2 3 4 5

(b) Tik-means: ˆC = 5

Figure 9: Summary of duration (T90), total ﬂuence (Ftotal) and spectral hardness (H321) for the groups obtained using (a) KNOB-Syn C with the generalized Mahalanobis distance and (b) Ti K-means.

96.27% of the total variance) of the nine scaled log-transformed variables which is equivalent to using KNOB-Syn C with the generalized Mahalanobis distance. The k-means phase applied on the 3 PCs ﬁnds 5 groups. KNOB-Syn C does not enter the merging stage at all since the maximum and generalized overlaps are the same. Comparison of our results (Figure 9 and Table 7) with those of Berry and Maitra (2019) shows fairly good agreement in the confusion matrix (R = 0.868). The ﬁrst and the fourth groups have short burst durations (T90) and soft spectral hardness (H321) although the ﬁrst group has fainter total

Almod ovar-Rivera and Maitra

ﬂuence (Ftotal). Groups 2 and 3 have long durations and bright ﬂuences but diﬀerent spectral hardness. Group 5 has long duration GRBs but with intermediate ﬂuence and spectral hardness. The true number and kinds of GRB groups is not known but our results show that the KNOB-Syn C solution yields groups that are distinct, interpretable and in line with the newer results obtained by Ti K-means (Berry and Maitra, 2019) or MBC (Chattopadhyay and Maitra, 2017, 2018).

Table 7: (a) Summary of duration (T90), ﬂuence (Ftotal) and spectral hardness (H321) for each of the ﬁve groups obtained using KNOB-Syn C with the generalized Mahalanobis distance and Ti K-means. (b) Confusion matrix of the Ti K-means and KNOB-Syn C solutions (R = 0.868).

k nk T90 Ftotal H321 T90 Ftotal H321

1 207 0.234 0.003 31.423 0.007 0.505 10 4 short-intermediate-soft 2 187 1.619 0.003 24.492 0.01 0.497 10 4 long-bright-soft 3 459 1.543 0.001 30.682 0.003 0.526 10 4 long-intermediate-hard 4 318 0.32 0.002 35.381 0.004 0.503 10 4 short-faint-soft 5 428 1.882 0.001 27.235 0.003 0.516 10 4 long-bright-hard

1 197 0.272 0.003 31.349 0.007 0.505 10 4 short-intermediate-soft 2 188 1.65 0.003 24.439 0.01 0.498 10 4 long-bright-soft 3 429 1.531 0.001 30.727 0.003 0.526 10 4 long-intermediate-hard 4 333 0.304 0.002 35.302 0.004 0.502 10 4 short-faint-soft 5 452 1.867 0.001 27.362 0.003 0.517 10 4 long-bright-hard

KNOB-Syn C k 1 2 3 4 5

1 188 2 2 3 2 2 0 181 0 0 7 3 1 0 420 2 6 4 13 0 7 313 0 5 5 4 30 0 413

4.2. Activation detection in a f MRI ﬁnger-tapping task experiment

Our second application uses KNOB-Syn C to identify activation in f MRI experiments. One objective of f MRI is to determine cerebral regions that respond to a task or particular stimulus (Bandettini et al., 1993; Belliveau et al., 1991; Kwong et al., 1992; Ogawa et al., 1990). A typical approach relates, after correction and pre-processing, the observed Blood Oxygen Level Dependent (BOLD) time course sequence at each image voxel to the expected BOLD response (Friston et al., 1994; Glover, 1999; Lazar, 2008) by ﬁtting a general linear model (Friston et al., 1995) and obtaining a test statistic (often a t-statistic) that tests

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

for signiﬁcance at that voxel. Thresholding methods (Forman et al., 1995; Genovese et al., 2002) are often used on these t-statistics to determine activation. Attempts to use clustering algorithms have been made, but Thirion et al. (2014) found that despite the advantages of speed and simplicity, k-means is not, in general, a good performer because it ﬁts data idiosyncracies and pathologies. We therefore explore if KNOB-Syn C can improve the k-means clustering solution on these datasets.

Our dataset for this experiment is from a right-hand ﬁnger-tapping experiment of a righthand-dominant male and was acquired over twelve regularly-spaced sessions in a two-month span. We choose only 5 of these sessions that were identiﬁed in Maitra (2010) as the ones with the highest reliability. Because there is no known gold standard, our comparison here will be of the ﬁve partitionings detected in each replication with each other. Each dataset was preprocessed and voxel-wise Z-scores were obtained that quantiﬁed the test statistic under the hypothesis of no activation at each voxel. We refer to Maitra et al. (2002) and Maitra (2009b) for imaging details. At each of the n = 179364 voxels, we compute the Zscores to test the hypothesis that the expected BOLD levels are signiﬁcantly related to the right-hand tapping at a voxel. These Z-scores for each replication are our (one-dimensional) dataset. Because of the large size of the dataset, most competing methods are impractical to apply, so we only use KNOB-Syn C here. (For computational reasons also, we do not estimate K0 in the k-means phase but set it at K0 = 50.)

(a) KNOB-Syn C (b) AR-FAST

0.15 0.27 0.37 0.2

0.17 0.35 0.16

0.06 0.1 0.15 0

Figure 10: Observed Z-scores at the voxels in the smaller (activated) group for the righthand ﬁnger-thumb opposition task experiments obtained using (a) KNOB-Syn C and (b) AR-FAST. For each set of experiments, we display activation maps for the 18th, 19th, 20th and 21st slices in each column. The replications are represented by rows. Jaccard indices of activation between each pair of replicates using (c) KNOB-Syn C and (d) AR-FAST algorithms.

Almod ovar-Rivera and Maitra

The 50 homogeneous k-means groups in each of the ﬁve replicates when supplied to the merging phase each terminated with ˆC = 2 syncytial groups. For the ﬁrst replicate, the largest group has 178307 (99.4%) voxels this is essentially the region of no activation. The other replicates have 178898 (99.7%), 178129 (99.3%), 179087 (99.8%), and 178658 (99.6%) voxels in this group. Figure 10a displays the Z-scores at the activated voxels over four slices of the brain. The displayed slices comprise the ipsiand contra-lateral pre-motor cortices (pre-M1), the primary motor cortex (M1), the pre-supplementary motor cortex (pre-SMA), and the supplementary motor cortex (SMA). We see broad agreement between the replications in each of the four slices. We compare our KNOB-Syn C results with the robust adaptive smoothed thresholding (AR-FAST) algorithm of Almod ovar-Rivera and Maitra (2019) implemented by the R package RFASTf MRI (Almod ovar-Rivera and Maitra, 2019) which shows far less agreement among the 5 replicates in terms of detected activation in the four slices. Speciﬁcally, KNOB-Syn C (Figure 10a) identiﬁes activation in the left M1 and in the ipsi-lateral pre-M1 areas. There is some identiﬁed activation in the contralateral pre-M1, pre-SMA and SMA voxels. On the other hand, AR-FAST (Figure 10b) ﬁnds less activation in the left M1 and in the ipsi-lateral pre-M1 areas. Figure 10c displays the Jaccard (1901) index of the activation detected (using KNOB-Syn C) between each pair of replications. Figure 10d displays similar Jaccard index calculations with regard to activation detected using AR-FAST. The Jaccard indices are higher for KNOB-Syn C-found activation for each pair of replications and show greater reproducibility. The summarized Jaccard index of Maitra (2010) which provides an overall measure of reproducibility of activation detected across replicates is 0.238 for KNOB-Syn C and 0.102 for AR-FAST which was shown (Almod ovar-Rivera and Maitra, 2019) to be a top performer on this dataset. We comment that while the overall Jaccard indices are low for both methods, the low value of 0.238 also reﬂects the challenge of activation detection in single-subject f MRI. Seen in this context, KNOB-Syn C does quite well. This example illustrates the potential of KNOBSyn C to improve and reﬁne clustering solutions making it possible, for instance, to use k-means and to alleviate some of the concerns raised in Thirion et al. (2014).

5. Discussion

This paper has proposed a syncytial clustering algorithm called KNOB-Syn C that merges groups found by standard clustering algorithms such as k-means, and does so in a datadriven and fully objective way. A R package called Syn Clust R implements our method in the function KNOBSyn C and the competing K-m H syncytial algorithm in the function km H and is publicly available at https://github.com/ialmodovar/Syn Clust R. Our method is distribution-free and can apply to the results of many standard clustering algorithms. We use the overlap measure of Maitra and Melnykov (2010) for merging and for decisions but use kernel-based nonparametric methods to calculate this overlap. Our algorithm has no parameters that require ﬁne-tuning by the user and, as pointed out by a reviewer, shows robust performance across many datasets of many dimensions and with little to tremendous complexity, when compared against a host of other methods. Further, our methodology is general enough to extend to situations with incomplete records or where clustering is done in the presence of scatter. Application of KNOB-Syn C to data from the BATSE 4Br catalog

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

provides further evidence of ﬁve kinds of GRBs. Our approach is also demonstrated to potentially make it possible to adapt k-means clustering for activation detection in f MRI. This paper also developed estimation methods of the CDF using the asymmetric RIG kernel. We used the plugin-bandwidth selector that minimizes the MISE as our bandwidth choice but it would be good to develop and investigate more sophisticated approaches. Further, our development in this paper provides an opportunity to develop nonparametric methods for diagnostics in clustering. For instance, our developed kernel CDF estimator could be used to determine uncertainties in k-means classiﬁcations. A reviewer has also very kindly drawn our attention to the fact that the construction of composite clusters that underlies the idea behind syncytial clustering has also been used in the context of semisupervised clustering ( Smieja and Wiercioch, 2017) where, instead of estimated overlap as used in this paper, available class labels are used in deciding to merge pairs of groups. We believe that such an approach may also beneﬁt from our methodology, especially when not all classes have representation in the supervised portion of the dataset. Thus, we see that although we have made an important contribution, a number of issues remain that would beneﬁt from further attention.

Acknowledgments

The authors sincerely thank the Action Editor and three anonymous reviewers whose helpful and insightful comments on an earlier version of this article greatly improved its content. The authors also acknowledge partial support for this research from the National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health (NIH) under its Award No. R21EB016212, I. Almod ovar-Rivera also acknowledges receipt of a fellowship from Iowa State University s Alliance for Graduate Education and the Professoriate (AGEP) program for underrepresented graduate students in STEM ﬁelds. R. Maitra also acknowledges support from the United States Department of Agriculture (USDA) National Institute of Food and Agriculture (NIFA) Hatch project IOW03617. The content of this paper however is solely the responsibility of the authors and does not represent the oﬃcial views of either the NIBIB, the NIH, the NIFA or the USDA.

Appendix A.

A.1. Proof of Lemma 3

Proof We have

E[ ˆH(y; b)] = E

i=1 G(y; Yi, b)

= E (G(y; Y1, b)) = Z

0 G(t; y, b)h(t)dt

0 K(t; w, b)dw h(t)dt

0 K(t; w, b)h(t)dtdw

0 E[h(V1/(w b),1/b)]dw,

Almod ovar-Rivera and Maitra

where the random variable V1/(w b),1/b RIG[1/(w b), 1/b]. The last equality holds because the inner integral R 0 K(t; w, b)h(t)dt = E[h{V1/(w b),1/b}]. Then, expanding V1/(w b),1/b around its mean w and also using Var{V1/(w b),1/b} = b(w + b) yields

0 E[h{V1/(w b),1/b}]dt = Z y

0 h(w)dw + 1

0 (bw + b2)h (w)dw + o(b2)

0 wh (w)dw + o(b2)

= H(y) + by

2 [h(y) h(0)] + o(b2)

2 yh (y) h(y) + o(b) H(y) + O(b).

For the variance, we have from the deﬁnition,

Var[ ˆH(y; b)] = Var

i=1 G(y; Yi, b)

n Var [G(y; Y1, b)]

n E G2(y; Y1, b) 1

n [E(G(y; Y1, b))]2 .

The second term is easily obtained from (21). It remains to derive the second moment of the estimator, E G2(y; Y1, b) = R 0 G2(y; t, b)h(t)dt which can be recast as

E[G2(y; Y1, b)] = Z

0 G2(y; t, b)h(t)dt

0 G(y; t, b)Φ

0 G(t; y, b)F(t; b)dt

0 G(y; t, b)Φ

0 E[F(V1/(w b),1/b; y, b)]dw,

where F(t; y, b) = Φ p

tb h(t) using a similar random variable V1/(w b),1/b and tactics as used in the reductions leading to (21). Since t, b > 0, we have that 2 p

b/t < and so Φ(2) Φ p

b/t 1. Therefore, we have

0 G(y; t, b)h(t)dt Z

G(y; t, b)h(t)dt Z

0 G(y; t, b)h(t)dt.

But R 0 G(y; t, b)h(t)dt E[G(y; Y1, b) = H(y)+b[yh(y) h(y)]/2+o(b) and Φ(2) = 0.97725 so that

G(y; t, b)h(t)dt H(y) + b

2[yh(y) h(y)] + o(b) (23)

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

For the second term in (22), expanding V1/(w b),1/b around its mean w yields

0 E[F(V1/(w b),1/b; y, b)]dw

0 F(w; y, b)dw + 1

0 Var(V1/(w b),1/b; y, b)F (w; y, b)dw + o(b)

wb h(w)dw + b

0 (w + b)F (w; y, b)dw + o(b)

b/y)H(y) Z y

wb Z h(w)dw dw

0 (w + b)F (w; y, b)dw + o(b)

The derivative in the integrand is

so that (24) equals Φ( p

b/y)H(y) + o(

b). We now expand Φ( p

b/y) using a Taylor series expansion around 0 to get

b/y) = 1/2 +

2πy) + O(b).

Inserting this result into (24) and combining with (23) means that (22) is

E[G2(y; Y1, b)] = H(y)/2 H(y)

and the approximate expressions for the variance in Lemma 3 follow.

Appendix B. Detailed Experimental Evaluations

Figures 11-22 illustrate performance on 2D datasets obtained by KNOB-Syn C, K-m H, DEMP, DEMP+, MMC, EAC, GSL-NN, Spectral clustering, kernel k-means, DBSCAN*, Density peaks, PGMM, MSAL and MGHD. In all cases, plotting character and color represent the true and estimated group indicator. Tables 8 and 9 display performance, in terms of R and estimated ˆC, on 2D and higher-dimensional datasets, of KNOB-Syn C (denoted as KNS in the table), K-m H, DEMP (DM), DEMP+ (DM+), MMC, EAC, GSL-NN (GSN), spectral clustering (Sp C), kernel-k-means (k-km), DBSCAN* (D*), density peaks (DP), PGMM (PGM), MSAL (MSL) and MGHD (MHD). For k-km, ˆC was set at the true C and not estimated. For each method, the absolute deviation of R for that method from the highest R for each dataset was obtained: the average and SD of these absolute deviations are also reported for each method in the last row.

Almod ovar-Rivera and Maitra

G GGGGGGGGGG

G GGGGGGGGGG

G GGGGGGGGGG

G GGGGGGGGGG

G GGGGGGGGGG

G GGGGGGGGGG

G GGGGGGGGGG

G GGGGGGGGGG

G GGGGGGGGGG

G GGGGGGGGGG

G GGGGGGGGGG

G GGGGGGGGGG

G GGGGGGGGGG

G GGGGGGGGGG

G GGGGGGGGGG

DBSCAN*, R = 0.53 Density Peaks, R = 0.78 PGMM, R = 1.00 MSAL, R = 0.99 MGHD, R = 0.64

MMC, R = 0.63 EAC, R = 0.56 GSL NN, R = 0.63 Spectral Clustering, R = 0.56 Kernel k means, R = 0.72

True KNOB Syn C, R = 0.99 K m H, R = 0.89 DEMP, R = 1.00 DEMP+, R = 1.00

Figure 11: The Spherical-7 example, and clusterings obtained using the 14 methods.

G G GGG GGG G G G G GG G G G G GG GGGGGG GGGGGGGGGGGGGG

G G GG GGGGG

G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG

G G GG GGGGG

G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG

G G GG GGGGG

G G GGG GGG G G G G GG G G G G GG GGGGGG GGGGGGGGGGGGGG

G G GG GGGGG

G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG

G G GG GGGGG

G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG

G G GG GGGGG

G G GGG GGG G G G G GG G G G G GG GGGGGG GGGGGGGGGGGGGG

G G GG GGGGG

G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG

G G GG GGGGG

G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG

G G GG GGGGG

G G GGG GGG G G G G GG G G G G GG GGGGGG GGGGGGGGGGGGGG

G G GG GGGGG

G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG

G G GG GGGGG

G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG

G G GG GGGGG

G G GGG GGG G G G G GG G G G G GG GGGGGG GGGGGGGGGGGGGG

G G GG GGGGG

G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG

G G GG GGGGG

G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG

G G GG GGGGG

DBSCAN*, R = 0.98 Density Peaks, R = 0.48 PGMM, R = 0.41 MSAL, R = 0.39 MGHD, R = 0.50

MMC, R = 0.11 EAC, R = 1.00 GSL NN, R = 1.00 Spectral Clustering, R = 1.00 Kernel k means, R = 0.70

True KNOB Syn C, R = 0.91 K m H, R = 1.00 DEMP, R = 0.43 DEMP+, R = 0.50

Figure 12: The Bananas-Arcs example and groupings obtained using the 14 methods.

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

DBSCAN*, R = 0.00 Density Peaks, R = 0.19 PGMM, R = 0.75 MSAL, R = 0.37 MGHD, R = 0.65

MMC, R = 0.67 EAC, R = 1.00 GSL NN, R = 1.00 Spectral Clustering, R = 1.00 Kernel k means, R = 0.92

True KNOB Syn C, R = 0.98 K m H, R = 1.00 DEMP, R = 0.77 DEMP+, R = 1.00

Figure 13: The Banana-clump example and groups obtained using the 14 methods.

GG G G G GG

GG G G G GG

GG G G G GG

GG G G G GG

GG G G G GG

GG G G G GG

GG G G G GG

GG G G G GG

GG G G G GG

GG G G G GG

GG G G G GG

GG G G G GG

GG G G G GG

GG G G G GG

GG G G G GG

DBSCAN*, R = 0.95 Density Peaks, R = 0.09 PGMM, R = 0.26 MSAL, R = 0.00 MGHD, R = 0.26

MMC, R = 0.15 EAC, R = 0.01 GSL NN, R = 1.00 Spectral Clustering, R = 1.00 Kernel k means, R = 0.53

True KNOB Syn C, R = 1.00 K m H, R = 1.00 DEMP, R = 0.21 DEMP+, R = 0.31

Figure 14: The Bullseye example and clusters obtained using the 14 methods.

Almod ovar-Rivera and Maitra

GGGGGGGGGGGGGGGGGG

G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G

GGG GGG G G G

G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGG

G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G

GGG GGG G G G

G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGG

G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G

GGG GGG G G G

G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGG

G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G

GGG GGG G G G

G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGG

G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G

GGG GGG G G G

G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGG

G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G

GGG GGG G G G

G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGG

G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G

GGG GGG G G G

G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGG

G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G

GGG GGG G G G

G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGG

G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G

GGG GGG G G G

G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGG

G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G

GGG GGG G G G

G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGG

G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G

GGG GGG G G G

G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGG

G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G

GGG GGG G G G

G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGG

G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G

GGG GGG G G G

G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGG

G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G

GGG GGG G G G

G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGG

G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G

GGG GGG G G G

G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG

DBSCAN*, R = 0.99 Density Peaks, R = 0.56 PGMM, R = 0.53 MSAL, R = 0.00 MGHD, R = 0.44

MMC, R = 0.17 EAC, R = 1.00 GSL NN, R = 1.00 Spectral Clustering, R = 0.23 Kernel k means, R = 0.44

True KNOB Syn C, R = 0.91 K m H, R = 1.00 DEMP, R = 0.62 DEMP+, R = 0.61

Figure 15: The Bullseye-Cigarette example and groups obtained with the 14 methods.

DBSCAN*, R = 0.82 Density Peaks, R = 0.42 PGMM, R = 0.71 MSAL, R = 0.59 MGHD, R = 0.43

MMC, R = 0.59 EAC, R = 0.81 GSL NN, R = 0.74 Spectral Clustering, R = 0.37 Kernel k means, R = 0.50

True KNOB Syn C, R = 0.93 K m H, R = 0.50 DEMP, R = 0.74 DEMP+, R = 0.74

Figure 16: The Compound example and partitionings obtained using the 14 methods.

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

GG G G GGGG GGG

G GG GGGG GG

G G G G GG GG

G G G GG G GG GG G

GG G G GGGG GGG

G GG GGGG GG

G G G G GG GG

G G G GG G GG GG G

GG G G GGGG GGG

G GG GGGG GG

G G G G GG GG

G G G GG G GG GG G

GG G G GGGG GGG

G GG GGGG GG

G G G G GG GG

G G G GG G GG GG G

GG G G GGGG GGG

G GG GGGG GG

G G G G GG GG

G G G GG G GG GG G

GG G G GGGG GGG

G GG GGGG GG

G G G G GG GG

G G G GG G GG GG G

GG G G GGGG GGG

G GG GGGG GG

G G G G GG GG

G G G GG G GG GG G

GG G G GGGG GGG

G GG GGGG GG

G G G G GG GG

G G G GG G GG GG G

GG G G GGGG GGG

G GG GGGG GG

G G G G GG GG

G G G GG G GG GG G

GG G G GGGG GGG

G GG GGGG GG

G G G G GG GG

G G G GG G GG GG G

GG G G GGGG GGG

G GG GGGG GG

G G G G GG GG

G G G GG G GG GG G

GG G G GGGG GGG

G GG GGGG GG

G G G G GG GG

G G G GG G GG GG G

GG G G GGGG GGG

G GG GGGG GG

G G G G GG GG

G G G GG G GG GG G

GG G G GGGG GGG

G GG GGGG GG

G G G G GG GG

G G G GG G GG GG G

GG G G GGGG GGG

G GG GGGG GG

G G G G GG GG

G G G GG G GG GG G

DBSCAN*, R = 0.07 Density Peaks, R = 0.10 PGMM, R = 0.21 MSAL, R = 0.27 MGHD, R = 0.30

MMC, R = 0.12 EAC, R = 1.00 GSL NN, R = 0.26 Spectral Clustering, R = 0.20 Kernel k means, R = 0.03

True KNOB Syn C, R = 0.88 K m H, R = 0.95 DEMP, R = 0.37 DEMP+, R = 0.37

Figure 17: The Half-Ringed clusters example and groups obtained with the 14 methods.

GGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGG

DBSCAN*, R = 0.18 Density Peaks, R = 0.23 PGMM, R = 0.59 MSAL, R = 0.44 MGHD, R = 0.60

MMC, R = 0.00 EAC, R = 0.41 GSL NN, R = 0.00 Spectral Clustering, R = 0.72 Kernel k means, R = 0.35

True KNOB Syn C, R = 0.55 K m H, R = 0.42 DEMP, R = 0.41 DEMP+, R = 0.41

Figure 18: The Path-based example and groups obtained with the 14 methods.

Almod ovar-Rivera and Maitra

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

GGGGGGGGGGGGGG

DBSCAN*, R = 0.99 Density Peaks, R = 0.25 PGMM, R = 0.44 MSAL, R = 0.52 MGHD, R = 0.38

MMC, R = 0.19 EAC, R = 1.00 GSL NN, R = 1.00 Spectral Clustering, R = 0.87 Kernel k means, R = 0.23

True KNOB Syn C, R = 0.95 K m H, R = 1.00 DEMP, R = 0.81 DEMP+, R = 0.79

Figure 19: The SCX-Bananas example and clusters obtained using the 14 methods.

DBSCAN*, R = 0.10 Density Peaks, R = 0.42 PGMM, R = 0.06 MSAL, R = 0.00 MGHD, R = 0.07

MMC, R = 0.00 EAC, R = 0.14 GSL NN, R = 1.00 Spectral Clustering, R = 0.35 Kernel k means, R = 0.00

True KNOB Syn C, R = 0.86 K m H, R = 0.01 DEMP, R = 0.00 DEMP+, R = 0.00

Figure 20: The Spiral example and groupings obtained using the 14 competing methods.

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

GGGGGGGGG GGGGGG G

GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G

GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG

GGGG GGGGGGGGG G G GGG

GGGG GG G G G GGGGGGG G G G

G GG GGGGG GGGGGGGGGGGG G

G GGGGGGGGGGG

GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG

GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG

GGGGGGGGGGG GGGG GGGGG G GGGG GG

GGGGG GGGGGGGGGGG

GGGG GGG G GG GGGGG GGGGGG

GGG G GG GG GG

G G G G G GG G GGGG

GGGGG GGGG G GGG G

GGGGG GGGGGGG

GGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

G GGGGGGGGGGGGG

G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G

GGGGGGG GGG G

GGGGG G GGG GG GG G

G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG

G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG

GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G

G GGGGGGGGGGGGGGGG GG GGGG G

G GGGGGGGG G GGG GGGGGGGG GG G

GGGGGGGGG GGGGGG G

GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G

GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG

GGGG GGGGGGGGG G G GGG

GGGG GG G G G GGGGGGG G G G

G GG GGGGG GGGGGGGGGGGG G

G GGGGGGGGGGG

GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG

GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGG G GGGGGG GGGGGGG GG

GGGGGGGGGGG GGGG GGGGG G GGGG GG

GGGGG GGGGGGGGGGG

GGGG GGG G GG GGGGG GGGGGG

GGG G GG GG GG

G G G G G GG G GGGG

GGGGG GGGG G GGG G

GGGGG GGGGGGG

GGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

G GGGGGGGGGGGGG

G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G

GGGGGGG GGG G

GGGGG G GGG GG GG G

G GGGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG

G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG

GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G

G GGGGGGGGGGGGGGGG GG GGGG G

G GGGGGGGG G GGG GGGGGGGG GG G

GGGGGGGGG GGGGGG G

GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G

GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG

GGGG GGGGGGGGG G G GGG

GGGG GG G G G GGGGGGG G G G

G GG GGGGG GGGGGGGGGGGG G

G GGGGGGGGGGG

GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG

GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG

GGGGGGGGGGG GGGG GGGGG G GGGG GG

GGGGG GGGGGGGGGGG

GGGG GGG G GG GGGGG GGGGGG

GGG G GG GG GG

G G G G G GG G GGGG

GGGGG GGGG G GGG G

GGGGG GGGGGGG

GGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

G GGGGGGGGGGGGG

G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G

GGGGGGG GGG G

GGGGG G GGG GG GG G

G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG

G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG

GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G

G GGGGGGGGGGGGGGGG GG GGGG G

G GGGGGGGG G GGG GGGGGGGG GG G

GGGGGGGGG GGGGGG G

GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G

GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG

GGGG GGGGGGGGG G G GGG

GGGG GG G G G GGGGGGG G G G

G GG GGGGG GGGGGGGGGGGG G

G GGGGGGGGGGG

GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG

GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG

GGGGGGGGGGG GGGG GGGGG G GGGG GG

GGGGG GGGGGGGGGGG

GGGG GGG G GG GGGGG GGGGGG

GGG G GG GG GG

G G G G G GG G GGGG

GGGGG GGGG G GGG G

GGGGG GGGGGGG

GGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

G GGGGGGGGGGGGG

G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G

GGGGGGG GGG G

GGGGG G GGG GG GG G

G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG

G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG

GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G

G GGGGGGGGGGGGGGGG GG GGGG G

G GGGGGGGG G GGG GGGGGGGG GG G

GGGGGGGGG GGGGGG G

GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G

GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG

GGGG GGGGGGGGG G G GGG

GGGG GG G G G GGGGGGG G G G

G GG GGGGG GGGGGGGGGGGG G

G GGGGGGGGGGG

GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG

GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGG G GGGGGG GGGGGGG GG

GGGGGGGGGGG GGGG GGGGG G GGGG GG

GGGGG GGGGGGGGGGG

GGGG GGG G GG GGGGG GGGGGG

GGG G GG GG GG

G G G G G GG G GGGG

GGGGG GGGG G GGG G

GGGGG GGGGGGG

GGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

G GGGGGGGGGGGGG

G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G

GGGGGGG GGG G

GGGGG G GGG GG GG G

G GGGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG

G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG

GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G

G GGGGGGGGGGGGGGGG GG GGGG G

G GGGGGGGG G GGG GGGGGGGG GG G

GGGGGGGGG GGGGGG G

GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G

GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG

GGGG GGGGGGGGG G G GGG

GGGG GG G G G GGGGGGG G G G

G GG GGGGG GGGGGGGGGGGG G

G GGGGGGGGGGG

GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG

GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG

GGGGGGGGGGG GGGG GGGGG G GGGG GG

GGGGG GGGGGGGGGGG

GGGG GGG G GG GGGGG GGGGGG

GGG G GG GG GG

G G G G G GG G GGGG

GGGGG GGGG G GGG G

GGGGG GGGGGGG

GGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

G GGGGGGGGGGGGG

G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G

GGGGGGG GGG G

GGGGG G GGG GG GG G

G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG

G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG

GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G

G GGGGGGGGGGGGGGGG GG GGGG G

G GGGGGGGG G GGG GGGGGGGG GG G

GGGGGGGGG GGGGGG G

GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G

GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG

GGGG GGGGGGGGG G G GGG

GGGG GG G G G GGGGGGG G G G

G GG GGGGG GGGGGGGGGGGG G

G GGGGGGGGGGG

GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG

GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG

GGGGGGGGGGG GGGG GGGGG G GGGG GG

GGGGG GGGGGGGGGGG

GGGG GGG G GG GGGGG GGGGGG

GGG G GG GG GG

G G G G G GG G GGGG

GGGGG GGGG G GGG G

GGGGG GGGGGGG

GGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

G GGGGGGGGGGGGG

G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G

GGGGGGG GGG G

GGGGG G GGG GG GG G

G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG

G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG

GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G

G GGGGGGGGGGGGGGGG GG GGGG G

G GGGGGGGG G GGG GGGGGGGG GG G

GGGGGGGGG GGGGGG G

GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G

GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG

GGGG GGGGGGGGG G G GGG

GGGG GG G G G GGGGGGG G G G

G GG GGGGG GGGGGGGGGGGG G

G GGGGGGGGGGG

GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG

GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGG G GGGGGG GGGGGGG GG

GGGGGGGGGGG GGGG GGGGG G GGGG GG

GGGGG GGGGGGGGGGG

GGGG GGG G GG GGGGG GGGGGG

GGG G GG GG GG

G G G G G GG G GGGG

GGGGG GGGG G GGG G

GGGGG GGGGGGG

GGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

G GGGGGGGGGGGGG

G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G

GGGGGGG GGG G

GGGGG G GGG GG GG G

G GGGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG

G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG

GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G

G GGGGGGGGGGGGGGGG GG GGGG G

G GGGGGGGG G GGG GGGGGGGG GG G

GGGGGGGGG GGGGGG G

GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G

GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG

GGGG GGGGGGGGG G G GGG

GGGG GG G G G GGGGGGG G G G

G GG GGGGG GGGGGGGGGGGG G

G GGGGGGGGGGG

GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG

GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG

GGGGGGGGGGG GGGG GGGGG G GGGG GG

GGGGG GGGGGGGGGGG

GGGG GGG G GG GGGGG GGGGGG

GGG G GG GG GG

G G G G G GG G GGGG

GGGGG GGGG G GGG G

GGGGG GGGGGGG

GGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

G GGGGGGGGGGGGG

G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G

GGGGGGG GGG G

GGGGG G GGG GG GG G

G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG

G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG

GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G

G GGGGGGGGGGGGGGGG GG GGGG G

G GGGGGGGG G GGG GGGGGGGG GG G

GGGGGGGGG GGGGGG G

GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G

GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG

GGGG GGGGGGGGG G G GGG

GGGG GG G G G GGGGGGG G G G

G GG GGGGG GGGGGGGGGGGG G

G GGGGGGGGGGG

GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG

GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG

GGGGGGGGGGG GGGG GGGGG G GGGG GG

GGGGG GGGGGGGGGGG

GGGG GGG G GG GGGGG GGGGGG

GGG G GG GG GG

G G G G G GG G GGGG

GGGGG GGGG G GGG G

GGGGG GGGGGGG

GGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

G GGGGGGGGGGGGG

G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G

GGGGGGG GGG G

GGGGG G GGG GG GG G

G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG

G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG

GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G

G GGGGGGGGGGGGGGGG GG GGGG G

G GGGGGGGG G GGG GGGGGGGG GG G

GGGGGGGGG GGGGGG G

GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G

GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG

GGGG GGGGGGGGG G G GGG

GGGG GG G G G GGGGGGG G G G

G GG GGGGG GGGGGGGGGGGG G

G GGGGGGGGGGG

GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG

GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGG G GGGGGG GGGGGGG GG

GGGGGGGGGGG GGGG GGGGG G GGGG GG

GGGGG GGGGGGGGGGG

GGGG GGG G GG GGGGG GGGGGG

GGG G GG GG GG

G G G G G GG G GGGG

GGGGG GGGG G GGG G

GGGGG GGGGGGG

GGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

G GGGGGGGGGGGGG

G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G

GGGGGGG GGG G

GGGGG G GGG GG GG G

G GGGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG

G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG

GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G

G GGGGGGGGGGGGGGGG GG GGGG G

G GGGGGGGG G GGG GGGGGGGG GG G

GGGGGGGGG GGGGGG G

GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G

GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG

GGGG GGGGGGGGG G G GGG

GGGG GG G G G GGGGGGG G G G

G GG GGGGG GGGGGGGGGGGG G

G GGGGGGGGGGG

GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG

GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG

GGGGGGGGGGG GGGG GGGGG G GGGG GG

GGGGG GGGGGGGGGGG

GGGG GGG G GG GGGGG GGGGGG

GGG G GG GG GG

G G G G G GG G GGGG

GGGGG GGGG G GGG G

GGGGG GGGGGGG

GGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

G GGGGGGGGGGGGG

G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G

GGGGGGG GGG G

GGGGG G GGG GG GG G

G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG

G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG

GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G

G GGGGGGGGGGGGGGGG GG GGGG G

G GGGGGGGG G GGG GGGGGGGG GG G

GGGGGGGGG GGGGGG G

GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G

GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG

GGGG GGGGGGGGG G G GGG

GGGG GG G G G GGGGGGG G G G

G GG GGGGG GGGGGGGGGGGG G

G GGGGGGGGGGG

GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG

GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG

GGGGGGGGGGG GGGG GGGGG G GGGG GG

GGGGG GGGGGGGGGGG

GGGG GGG G GG GGGGG GGGGGG

GGG G GG GG GG

G G G G G GG G GGGG

GGGGG GGGG G GGG G

GGGGG GGGGGGG

GGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

G GGGGGGGGGGGGG

G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G

GGGGGGG GGG G

GGGGG G GGG GG GG G

G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG

G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG

GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G

G GGGGGGGGGGGGGGGG GG GGGG G

G GGGGGGGG G GGG GGGGGGGG GG G

GGGGGGGGG GGGGGG G

GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G

GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG

GGGG GGGGGGGGG G G GGG

GGGG GG G G G GGGGGGG G G G

G GG GGGGG GGGGGGGGGGGG G

G GGGGGGGGGGG

GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG

GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGG G GGGGGG GGGGGGG GG

GGGGGGGGGGG GGGG GGGGG G GGGG GG

GGGGG GGGGGGGGGGG

GGGG GGG G GG GGGGG GGGGGG

GGG G GG GG GG

G G G G G GG G GGGG

GGGGG GGGG G GGG G

GGGGG GGGGGGG

GGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

G GGGGGGGGGGGGG

G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G

GGGGGGG GGG G

GGGGG G GGG GG GG G

G GGGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG

G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG

GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G

G GGGGGGGGGGGGGGGG GG GGGG G

G GGGGGGGG G GGG GGGGGGGG GG G

GGGGGGGGG GGGGGG G

GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G

GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG

GGGG GGGGGGGGG G G GGG

GGGG GG G G G GGGGGGG G G G

G GG GGGGG GGGGGGGGGGGG G

G GGGGGGGGGGG

GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG

GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG

GGGGGGGGGGG GGGG GGGGG G GGGG GG

GGGGG GGGGGGGGGGG

GGGG GGG G GG GGGGG GGGGGG

GGG G GG GG GG

G G G G G GG G GGGG

GGGGG GGGG G GGG G

GGGGG GGGGGGG

GGGGGGGGGGG

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

G GGGGGGGGGGGGG

G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G

GGGGGGG GGG G

GGGGG G GGG GG GG G

G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG

G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG

GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G

G GGGGGGGGGGGGGGGG GG GGGG G

G GGGGGGGG G GGG GGGGGGGG GG G

DBSCAN*, R = 0.95 Density Peaks, R = 0.16 PGMM, R = 0.37 MSAL, R = 0.94 MGHD, R = 0.40

MMC, R = 0.63 EAC, R = 1.00 GSL NN, R = 1.00 Spectral Clustering, R = 0.20 Kernel k means, R = 0.36

True KNOB Syn C, R = 1.00 K m H, R = 0.77 DEMP, R = 0.77 DEMP+, R = 0.84

Figure 21: The SSS example and clusters obtained with the 14 competing methods.

GG G GG GG G G GGGGG

GG G GG GG G G GGGGG

GG G GG GG G G GGGGG

GG G GG GG G G GGGGG

GG G GG GG G G GGGGG

GG G GG GG G G GGGGG

GG G GG GG G G GGGGG

GG G GG GG G G GGGGG

GG G GG GG G G GGGGG

GG G GG GG G G GGGGG

GG G GG GG G G GGGGG

GG G GG GG G G GGGGG

GG G GG GG G G GGGGG

GG G GG GG G G GGGGG

GG G GG GG G G GGGGG

DBSCAN*, R = 0.99 Density Peaks, R = 0.33 PGMM, R = 0.90 MSAL, R = 1.00 MGHD, R = 0.74

MMC, R = 1.00 EAC, R = 1.00 GSL NN, R = 1.00 Spectral Clustering, R = 0.29 Kernel k means, R = 0.99

True KNOB Syn C, R = 0.95 K m H, R = 0.96 DEMP, R = 1.00 DEMP+, R = 1.00

Figure 22: The XXXX example and groupings obtained using the 14 competing methods.

Almod ovar-Rivera and Maitra

Table 8: Performance, in terms of R and estimated ˆC, on 2D datasets, of competing methods. Dataset KNS K-m H DM DM+ MMC EAC GSN Sp C kk-m D* DP PGM MSL MHD (n, p, K) 7-Spherical 0.99 0.89 1 1 0.63 0.56 0.63 0.56 0.72 0.53 0.78 1 0.99 0.64 (500, 2, 7) 7 10 7 7 5 7 5 20 7 4 9 7 7 13 Aggregation 0.98 0.95 0.91 0.91 0.80 0.90 0.81 0.59 0.24 0.58 0.26 0.64 0.82 0.43 (788, 2, 7) 7 9 6 6 8 12 11 7 7 277 54 12 7 15 Banana-arcs 0.91 1 0.43 0.50 0.11 1 1 1 0.7 0.98 0.48 0.41 0.39 0.50 (4515, 2, 4) 5 4 11 10 52 4 4 4 4 47 460 9 4 8 Banana-clump 0.98 1 0.77 1 0.67 1 1 1 0.92 0 0.19 0.75 0.37 0.65 (200, 2, 2) 4 2 3 2 4 2 2 4 2 1 12 3 2 5 Bullseye 1 1 0.21 0.31 0.15 -0.01 1 1 0.53 0.95 0.09 0.26 -0.00 0.26 (400, 2, 2) 2 2 7 5 11 2 2 2 2 14 23 5 2 5 Bullseye-Cig 0.91 1 0.62 0.61 0.17 1 1 0.23 0.44 0.99 0.56 0.53 0 0.44 (3025, 2, 8) 7 6 7 7 5 9 14 16 8 17 115 4 1 4 Compound 0.93 0.5 0.74 0.74 0.59 0.81 0.74 0.37 0.5 0.82 0.42 0.71 0.59 0.43 (399, 2, 6) 19 13 5 5 5 6 6 13 6 117 9 6 6 12 Half-ringed 0.88 0.95 0.37 0.37 0.12 1 0.26 0.20 0.03 0.07 0.1 0.21 0.27 0.3 (373, 2, 2) 16 3 6 6 11 2 2 8 2 155 17 5 2 5 Path-based 0.55 0.42 0.41 0.41 0 0.41 0 0.72 0.35 0.18 0.23 0.59 0.44 0.60 (300, 2, 3) 21 2 2 2 3 15 3 3 3 217 9 7 3 7 SCX-Bananas 0.95 1 0.81 0.79 0.19 1 1 0.87 0.23 0.99 0.25 0.44 0.52 0.38 (3420, 2, 8) 8 8 7 9 39 8 8 4 8 25 389 16 8 17 Spiral 0.86 0.01 0 0 0 0.14 1 0.35 -0 0.1 0.42 0.06 0 0.09 (312, 2, 3) 3 2 1 1 1 9 3 11 3 208 7 7 1 7 SSS 1 0.77 0.77 0.84 0.63 1 1 0.20 0.36 0.95 0.16 0.37 0.94 0.40 (5015, 2, 3) 3 5 5 4 27 4 3 19 3 29 385 7 3 7 XXXX 0.95 0.96 1 1 1 1 1 0.29 0.99 0.99 0.33 0.90 1 0.74 (415, 2, 4) 10 5 4 4 4 4 4 20 4 8 20 6 4 9 D 0.06 0.17 0.35 0.32 0.58 0.22 0.17 0.40 0.53 0.35 0.64 0.44 0.48 0.52 Dσ 0.06 0.28 0.31 0.31 0.33 0.35 0.27 0.34 0.31 0.39 0.20 0.29 0.38 0.21

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

Table 9: Performance, in terms of R and estimated ˆC, on high-dimensional datasets. (In the table, m displays the eﬀective dimension of the dataset and is the number of coordinates, PCs or KPCs used as per the descriptions in Section 3.3.)

Dataset KNS K-m H DM DM+ MMC EAC GSN Sp C kk-m DBSCAN* DP PGMM MSAL MGHD (n, p, K, m)

Simplex-7 0.97 0.97 0.97 0.97 0.97 0.94 0.94 0.92 0.94 0.03 0.58 0.97 0.98 0.96 (560, 7, 7, 7) 7 7 7 7 7 6 7 7 7 508 79 7 7 7

E.coli 0.72 0.63 0.70 0.68 0.59 0.77 0.03 0.26 0.45 0.31 0.38 0.48 0 0.60 (336, 7, 7, 5) 9 4 4 4 8 10 16 15 7 174 12 8 1 8

Wines-13 0.92 0.62 0.52 0.5 0.67 0.6 0.38 0.43 0.6 0 0.26 0.66 0 0.16 (178, 13, 3, 17) 3 11 7 7 7 8 7 23 7 178 15 5 1 6

Wines-27 0.93 -0.01 1 1 0.91 0.62 0 0.35 0.88 0 0.56 0.85 0 0.95 (178, 27, 3, 26) 3 2 3 3 4 7 1 8 3 178 12 3 1 3

Olive Oils-Area 0.55 0.56 0.85 0.82 0.66 0.55 0.51 0.48 0.56 0.47 0.40 0.61 0.70 0.58 (572, 8, 9, 8) 4 8 7 12 11 14 5 18 9 201 27 5 9 7

Olive Oils-Region 0.89 0.69 0.45 0.47 0.22 0.46 0.67 0.23 0.4 0.44 0.41 0.59 0.58 0.54 (572, 8, 3, 8) 4 8 7 12 11 14 5 18 9 201 27 5 9 7

Image 0.54 0.48 0.49 0.46 0.28 0.59 0.10 0.22 0.52 0 0.38 0 0 0.56 (2310, 19, 7, 8) 12 17 18 17 46 40 48 45 7 1 56 1 1 7

Yeast 0.22 0.01 -0.01 -0.01 0.04 0.004 0.003 0.11 0.14 0 0.03 0 0 0.04 (1484, 8, 10, 6) 5 4 6 6 37 13 33 17 10 1 44 1 1 10

ALL 0.68 0.19 0.14 0.14 0.35 0.53 0.54 0.55 0.5 0 0.45 0.61 0 0 (215, 1000, 7, 42) 6 18 6 6 9 9 5 12 7 215 41 7 1 1

Zipcode 0.76 0.55 0.35 0.33 0.01 0.21 0.54 0.54 0 0 0.58 0.52 0 0.56 (2000, 256, 10, 33) 9 22 36 33 1 45 7 23 10 2000 53 6 1 6

Pendigits 0.72 0.51 0.58 0.6 0.26 0.58 0.004 0.42 0.3 0.3 0.54 0.44 0 0.67 (10992, 16, 10, 18) 15 9 40 32 59 58 59 48 10 7 9 6 1 10

D 0.04 0.29 0.21 0.22 0.31 0.22 0.44 0.35 0.27 0.62 0.35 0.24 0.56 0.25 Dσ 0.09 0.27 0.20 0.20 0.23 0.17 0.29 0.21 0.21 0.26 0.16 0.15 0.33 0.26

Almod ovar-Rivera and Maitra

F. Alimoglu. Combining multiple classiﬁers for pen-based handwritten digit recognition. Master s thesis, Institute of Graduate Studies in Science and Engineering, Bogazici University, 1996.

F. Alimoglu and E. Alpaydin. Methods of combining multiple classiﬁers based on diﬀerent representations for pen-based handwriting recognition. In Proceedings of the Fifth Turkish Artiﬁcial Intelligence and Artiﬁcial Neural Networks Symposium (TAINN 96), Istanbul, Turkey, 1996.

I. Almod ovar-Rivera and R. Maitra. RFASTf MRI: Fast adaptive smoothing and thresholding for improved activation detection in low-signal f MRI, 2019. R Package, URL http://github.com/ialmodovar/RFASTf MRI.

I. A. Almod ovar-Rivera and R. Maitra. FAST adaptive smoothed thresholding for improved activation detection in low-signal f MRI. IEEE Transactions on Medical Imaging, 38(12): 2821 2828, 2019. doi: 10.1109/TMI.2019.2915052.

A. Azzalini. A note on the estimation of a distribution function and quantiles by a kernel method. Biometrika, 68(1):326 328, 1981.

P. A. Bandettini, A. Jesmanowicz, E. C. Wong, and J. S. Hyde. Processing strategies for time-course data sets in functional mri of the human brain. Magnetic Resonance in Medicine, 30:161 173, 1993.

J.-P. Baudry and G. Celeux. Rmixmod Combi: Combining Mixture Components for Clustering, 2014. URL https://CRAN.R-project.org/package=Rmixmod Combi. R package version 1.0.

J.-P. Baudry, A. E. Raftery, G. Celeux, K. Lo, and R. Gottardo. Combining mixture components for clustering. Journal of Computational and Graphical Statistics, 19(2):332 353, 2010.

J. W. Belliveau, D. N. Kennedy, R. C. Mc Kinstry, B. R. Buchbinder, R. M. Weisskoﬀ, M. S. Cohen, J. M. Vevea, T. J. Brady, and B. R. Rosen. Functional mapping of the human visual cortex by magnetic resonance imaging. Science, 254:716 719, 1991.

N. S. Berry and R. Maitra. Tik-means: Transformation-infused k-means clustering for skewed groups. Statistical Analysis and Data Mining: The ASA Data Science Journal, 12(3):223 233, 2019.

T. Bouezmarni and O. Scaillet. Consistency of asymmetric kernel density estimators and smoothed histograms with application to income data. Econometric Theory, 21(02):390 412, 2005.

C. Bouveyron, G. Celeux, B. T. Murphy, and A. E. Raftery. Model-Based Clustering and Classiﬁcation for Data Science: With Applications in R. Cambridge Series in Statistical and Probabilistic Mathematics, 2019.

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

R. P. Browne and P. D. Mc Nicholas. A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics, 43(2):176 198, 2015.

R. J. Campello, D. Moulavi, and J. Sander. Density-based clustering based on hierarchical density estimates. In Paciﬁc-Asia conference on knowledge discovery and data mining, pages 160 172. Springer, 2013.

H. Chang and D.-Y. Yeung. Robust path-based spectral clustering. Pattern Recognition, 41 (1):191 203, 2008. ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2007.04.010. URL http://www.sciencedirect.com/science/article/pii/S0031320307002038.

S. Chattopadhyay and R. Maitra. Gaussian-mixture-model-based cluster analysis ﬁnds ﬁve kinds of gamma-ray bursts in the BATSE catalogue. Monthly Notices of the Royal Astronomical Society, 469(3):3374 3389, 2017. doi: 10.1093/mnras/stx1024.

S. Chattopadhyay and R. Maitra. Multivariate t-mixture-model-based cluster analysis of BATSE catalogue establishes importance of all observed parameters, conﬁrms ﬁve distinct ellipsoidal sub-populations of gamma-ray bursts. Monthly Notices of the Royal Astronomical Society, 481(3):3196 3209, 07 2018. ISSN 0035-8711. doi: 10.1093/mnras/sty1940.

T. Chattopadhyay, R. Misra, A. K. Chattopadhyay, and M. Naskar. Statistical evidence for three classes of gamma-ray bursts. Astrophysical Journal, 667(2):1017, 2007. doi: https://doi.org/10.1086/520317.

A. Chaturvedi, P. E. Green, and J. D. Caroll. K-modes clustering. Journal of Classiﬁcation, 18:35 55, 2001.

S. X. Chen. Probability density function estimation using gamma kernels. Annals of the Institute of Statistical Mathematics, 52(3):471 480, 2000.

J.-P. Dezalay, C. Barat, R. Talon, R. Syunyaev, O. Terekhov, and A. Kuznetsov. Short cosmic events - A subset of classical GRBs? In W. S. Paciesas and G. J. Fishman, editors, American Institute of Physics Conference Series, volume 265 of American Institute of Physics Conference Series, pages 304 309, 1992.

I. Dhillon, Y. Guan, and B. Kulis. A uniﬁed view of kernel k-means, spectral clustering and graph cuts. Technical Report TR-04-25, University of Texas at Austin, 2004.

K. S. Dorman and R. Maitra. An eﬃcient k-modes algorithm for clustering categorical datasets. Ar Xiv e-prints:2006.03936, 2020.

V. A. Epanechnikov. Non-parametric estimation of a multivariate probability density. Theory of Probability and its Applications, 14:153158, 1969. doi: 10.1137/1114019.

M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD-96, volume 96, pages 226 231, 1996.

B. S. Everitt, S. Landau, and M. Leesem. Cluster Analysis (4th ed.). Hodder Arnold, New York, 2001.

Almod ovar-Rivera and Maitra

E. Forgy. Cluster analysis of multivariate data: eﬃciency vs. interpretability of classiﬁcations. Biometrics, 21:768 780, 1965.

M. Forina and E. Tiscornia. Pattern recognition methods in the prediction of italian olive oil origin by their fatty acid content. Annali di Chimica, 72:143 155, 1982.

M. Forina, C. Armanino, S. Lanteri, and E. Tiscornia. Classiﬁcation of olive oils from their fatty acid composition. In Food Research and Data Analysis, pages 189 214. Applied Science Publishers, London, 1983.

M. Forina, R. Leardi, and S. Lanteri. PARVUS - an extendible package for data exploration, classiﬁcation and correlation, 1988.

S. D. Forman, J. D. Cohen, M. Fitzgerald, W. F. Eddy, M. A. Mintun, and D. C. Noll. Improved assessment of signiﬁcant activation in functional magnetic resonance imaging (fmri): Use of a cluster-size threshold. Magnetic Resonance in Medicine, 33:636 647, 1995.

C. Fraley and A. E. Raftery. How many clusters? which cluster method? answers via model-based cluster analysis. Computer Journal, 41:578 588, 1998.

C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97:611 631, 2002.

B. C. Franczak, R. P. Browne, and P. D. Mc Nicholas. Mixtures of shifted asymmetric Laplace distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6):1149 1157, 2014.

B. C. Franczak, R. P. Browne, P. D. Mc Nicholas, and K. L. Burak. Mix SAL: Mixtures of Multivariate Shifted Asymmetric Laplace (SAL) Distributions, 2018. URL https: //CRAN.R-project.org/package=Mix SAL. R package version 1.0.

A. L. Fred and A. K. Jain. Combining multiple clusterings using evidence accumulation. IEEE transactions on pattern analysis and machine intelligence, 27(6):835 850, 2005.

K. J. Friston, P. Jezzard, and R. Turner. Analysis of functional MRI time-series. Human Brain Mapping, 1:153 171, 1994.

K. J. Friston, A. P. Holmes, K. J. Worsley, J.-B. Poline, C. D. Frith, and R. S. J. Frackowiak. Statistical parametric maps in functional imaging: A general linear approach. Human Brain Mapping, 2:189 210, 1995.

C. R. Genovese, N. A. Lazar, and T. E. Nichols. Thresholding of statistical maps in functional neuroimaging using the false discovery rate:. Neuroimage, 15:870 878, 2002.

Z. Ghahramani and G. E. Hinton. The EM algorithm for factor analyzers. Technical Report CRG-TR-96-1, University of Toronto, Toronto, Canada, 1997.

A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):4, 2007.

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

G. H. Glover. Deconvolution of impulse response in event-related BOLD f MRI. Neuroimage, 9:416 429, 1999.

M. Hahsler and M. Piekenbrock. dbscan: Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms, 2018. URL https://CRAN.R-project.org/ package=dbscan. R package version 1.1-3.

J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. Applied Statistics, 28: 100 108, 1979.

C. Hennig. Methods for merging Gaussian mixture components. Advances in Data Analysis and Classiﬁcation, 2010. doi: 10.1007/s11634-010-0058-3.

A. Hinneburg and D. Keim. Cluster discovery methods for large databases: from the past to the future. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, 1999.

P. Horton and K. Nakai. A probablistic classiﬁcation system for predicting the cellular localization sites of proteins. Intelligent Systems in Molecular Biology, pages 109 115, 1985.

Z. Huang. Clustering large data sets with mixed numeric and categorical values. In Proceedings of the First Paciﬁc Asia Knowledge Discovery and Data Mining Conference, page 2134, Singapore, 1997. World Scientiﬁc.

Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2:283304, 1998.

L. Hubert and P. Arabie. Comparing partitions. Journal of Classiﬁcation, 2:193 218, 1985.

P. Jaccard. Etude comparative de la distribution ﬂorale dans une portion des alpes et des jura. Bulletin del la Soci et e Vaudoise des Sciences Naturelles, 37:547579, 1901.

A. Jain and R. Dubes. Algorithms for clustering data. Prentice Hall, Englewood Cliﬀs, NJ, 1988.

A. K. Jain and M. H. C. Law. Data clustering: A users dilemma. In S. K. Pal, B. S., and B. S., editors, Pattern Recognition and Machine Intelligence. PRe MI 2005, volume 3776 of Lecture Notes in Computer Science, pages 1 10, Berlin, Heidelberg, 2005. Springer.

Y. Jeon and J. H. T. Kim. A gamma kernel density estimation for insurance loss data. Insurance: Mathematics and Economics, 53:569 579, 2013. doi: http://dx.doi.org/10. 1016/j.insmatheco.2013.08.009.

S. Johnson. Hierarchical clustering schemes. Psychometrika, 32:3:241 254, 1967.

L. Kaufman and P. J. Rousseuw. Finding Groups in Data. John Wiley & Sons, New York, 1990.

W. J. Krzanowski and Y. Lai. A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics, pages 23 34, 1988.

Almod ovar-Rivera and Maitra

K. K. Kwong, J. W. Belliveau, D. A. Chesler, I. E. Goldberg, R. M. Weisskoﬀ, B. P. Poncelet, D. N. Kennedy, B. E. Hoppel, M. S. Cohen, R. Turner, H.-M. Cheng, T. J. Brady, and B. R. Rosen. Dynamic magnetic resonance imaging of human brain activity during primary sensory stimulation. Proceedings of the National Academy of Sciences of the United States of America, 89:5675 5679, 1992.

N. A. Lazar. The Statistical Analysis of Functional MRI Data. Springer, 2008.

A. Lithio and R. Maitra. An eﬃcient k-means-type algorithm for clustering datasets with incomplete records. Statistical Analysis and Data Mining: The ASA Data Science Journal, 11(6):296 311, 2018.

S. Lloyd. Least squares quantization in PCM. Information Theory, IEEE Transactions on, 28(2):129 137, 1982.

J. Mac Queen. Some methods for classiﬁcation and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium, 1:281 297, 1967.

P. C. Mahalanobis. On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India, 2(1):4955, 1936.

R. Maitra. Clustering massive datasets with applications to software metrics and tomography. Technometrics, 43(3):336 346, 2001.

R. Maitra. A statistical perspective to data mining. Journal of the Indian Society of Probability and Statistics, 6:28 77, 2002.

R. Maitra. Initializing partition-optimization algorithms. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6:144 157, 2009a. doi: 10.1109/TCBB.2007. 70244.

R. Maitra. Assessing certainty of activation or inactivation in test-retest f MRI studies. Neuroimage, 47(1):88 97, 2009b.

R. Maitra. A re-deﬁned and generalized percent-overlap-of-activation measure for studies of f MRI reproducibility and its use in identifying outlier activation maps. Neuroimage, 50(1):124 135, 2010.

R. Maitra and V. Melnykov. Simulating data to study performance of ﬁnite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 19 (2):354 376, 2010. doi: 10.1198/jcgs.2009.08054.

R. Maitra and I. P. Ramler. Clustering in the presence of scatter. Biometrics, 65:341 352, 2009.

R. Maitra, S. R. Roys, and R. P. Gullapalli. Test-retest reliability estimation of functional MRI data. Magnetic Resonance in Medicine, 48:62 70, 2002.

R. Maitra, V. Melnykov, and S. Lahiri. Bootstrapping for signiﬁcance of compact clusters in multi-dimensional datasets. Journal of the American Statistical Association, 107(497): 378 392, 2012. doi: http://dx.doi.org/10.1080/01621459.2011.646935.

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

E. P. Mazets, S. V. Golenetskii, V. N. Ilinskii, V. N. Panov, R. L. Aptekar, I. A. Gurian, M. P. Proskura, I. A. Sokolov, Z. I. Sokolova, and T. V. Kharitonova. Catalog of cosmic gamma-ray bursts from the KONUS experiment data. I. Astrophysics and Space Science, 80:3 83, Nov. 1981. doi: 10.1007/BF00649140.

G. Mc Lachlan and D. Peel. Finite Mixture Models. John Wiley and Sons, Inc., New York, 2000.

G. J. Mc Lachlan and K. E. Basford. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York, 1988.

P. D. Mc Nicholas. Mixture model-based classiﬁcation. Chapman and Hall/CRC, 2016.

P. D. Mc Nicholas and T. B. Murphy. Parsimonious Gaussian mixture models. Statistics and Computing, 18(3):285 296, 2008.

P. D. Mc Nicholas, A. El Sherbiny, A. F. Mc Daid, and T. B. Murphy. pgmm: Parsimonious Gaussian Mixture Models, 2018. URL https://CRAN.R-project.org/package=pgmm. R package version 1.2.3.

V. Melnykov. Merging mixture components for clustering through pairwise overlap. Journal of Computational and Graphical Statistics, 25(1):66 90, 2016.

V. Melnykov and R. Maitra. Finite mixture models and model-based clustering. Statistics Surveys, 4:80 116, 2010. ISSN 1935-7516. doi: 10.1214/09-SS053.

V. Melnykov and R. Maitra. CARP: Software for ﬁshing out good clustering algorithms. Journal of Machine Learning Research, 12:69 73, 2011.

V. Melnykov, W.-C. Chen, and R. Maitra. Mix Sim: An R package for simulating data to study performance of clustering algorithms. Journal of Statistical Software, 51(12):1 25, 2012. URL http://www.jstatsoft.org/v51/i12/.

C. D. Michener and R. R. Sokal. A quantitative approach to a problem in classiﬁcation. Evolution, 11:130 162, 1957.

S. Mukherjee, E. D. Feigelson, G. Jogesh Babu, F. Murtagh, C. Fraley, and A. Raftery. Three types of gamma-ray bursts. Astrophyical Journal, 508:314 327, Nov. 1998. doi: 10.1086/306386.

K. Nakai. UCI machine learning repository, 1996. URL http://archive.ics.uci.edu/ml.

K. Nakai and M. Kinehasa. Expert sytem for predicting protein localization sites in gramnegative bacteria. PROTEINS: Structure, Function, and Genetics, 11:95 110, 1991.

R. B. Nelsen. An Introduction to Copulas. Springer, New York, 2 edition, 2006.

D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. UCI repository of machine learning databases, 1998.

Almod ovar-Rivera and Maitra

J. P. Norris, T. L. Cline, U. D. Desai, and B. J. Teegarden. Frequency of fast, narrow gamma-ray bursts. Nature, 308:434, Mar. 1984. doi: 10.1038/308434a0.

S. Ogawa, T. M. Lee, A. S. Nayak, and P. Glynn. Oxygenation-sensitive contrast in magnetic resonance image of rodent brain at high magnetic ﬁelds. Magnetic Resonance in Medicine, 14:68 78, 1990.

E. Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3):1065, 1962. doi: 10.1214/aoms/1177704472.

T. L. Pedersen, S. Hughes, and X. Qiu. density Clust: Clustering by Fast Search and Find of Density Peaks, 2017. URL https://CRAN.R-project.org/package=density Clust. R package version 0.3.

A. D. Peterson, A. P. Ghosh, and R. Maitra. Merging k-means with hierarchical clustering for identifying general-shaped groups. Stat, 7(1):e172, 2018. doi: 10.1002/sta4.172.

T. Piran. The physics of gamma-ray bursts. Rev. Mod. Phys., 76:1143 1210, Jan 2005. doi: 10.1103/Rev Mod Phys.76.1143.

R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2018. URL http: //www.R-project.org. ISBN 3-900051-07-0.

A. E. Raftery and N. Dean. Variable selection for model-based clustering. Journal of the American Statistical Association, 101:168 178, 2006.

D. B. Ramey. Nonparametric clustering techniques. In Encyclopedia of Statistical Science, volume 6, pages 318 319. Wiley, New York, 1985.

R.-D. Reiss. Nonparametric estimation of smooth distribution functions. Scandinavian Journal of Statistics, pages 116 119, 1981.

A. Rodriguez and A. Laio. Clustering by fast search and ﬁnd of density peaks. Science, 344 (6191):1492 1496, 2014.

M. Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3):832, 1956. doi: 10.1214/aoms/1177728190.

L. R uschendorf. Mathematical Risk Analysis. Springer-Verlag, Berlin Heidelberg, 2013. doi: 10.1007/978-3-642-33590-7.

D. C. S. Aeberhard and O. de Vel. Comparison of classiﬁers in high dimensional settings. Technical Report 92-02, Department of Computer Science and Department of Mathematics and Statistics, James Cook University of North Queensland, 1992.

F. S. Samaria and A. C. Harter. Parameterization of a stochastic model for human face identiﬁcation. In Proceedings of the Second IEEE Workshop on Applications of Computer Vision, pages 138 142, Sarasota, Florida, 1994.

Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering

M. P. Sampat, Z. Wang, S. Gupta, A. C. Bovik, and M. K. Markey. Complex wavelet structural similarity: A new image similarity index. IEEE Transactions on Image Processing, 18(11):2385 2401, 2009. doi: 10.1109/TIP.2009.2025923.

O. Scaillet. Density estimation using inverse and reciprocal inverse Gaussian kernels. Nonparametric Statistics, 16(1-2):217 226, 2004.

G. Schwarz. Estimating the dimensions of a model. Annals of Statistics, 6:461 464, 1978.

B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, London, 1986. ISBN 0-412-24620-1.

M. Smieja and M. Wiercioch. Constrained clustering with a complex cluster structure. Advances in Data Analysis and Classiﬁcation, 11(3):493 518, 2017.

D. Steinley. Properties of the Hubert-Arabie adjusted Rand index. Psychological methods, 9(3):386, 2004.

W. Stuetzle and R. Nugent. A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics, 2010. doi: 10.1198/ jcgs.2009.07049.

C. A. Sugar and G. M. James. Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463), 2003.

B. Thirion, G. Varoquaux, E. Dohmatob, and J.-B. Poline. Which f MRI clustering gives good brain parcellations? Frontiers in Neuroscience, 8:167, 2014. ISSN 1662453X. doi: 10.3389/fnins.2014.00167. URL https://www.frontiersin.org/article/ 10.3389/fnins.2014.00167.

D. Titterington, A. Smith, and U. Makov. Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons, Chichester, U.K., 1985.

C. Tortora, A. El Sherbiny, R. P. Browne, B. C. Franczak, , P. D. Mc Nicholas, and D. D. Amos. Mix GHD: Model Based Clustering, Classiﬁcation and Discriminant Analysis Using the Mixture of Generalized Hyperbolic Distributions, 2019. URL https: //CRAN.R-project.org/package=Mix GHD. R package version 2.3.2.

U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395 416, December 2007.

K. Wagstaﬀ. Clustering with missing values: No imputation required. In Classiﬁcation, clustering, and data mining applications, pages 649 658. Springer, 2004.

M. P. Wand and M. C. Jones. Kernel Smoothing. Chapman & Hall/CRC, London, 1995. ISBN 0-412-55270-1.

R. Xu and D. C. Wunsch. Clustering. John Wiley & Sons, NJ, Hoboken, 2009.

Almod ovar-Rivera and Maitra

E.-J. Yeoh, M. E. Ross, S. A. Shurtleﬀ, W. Williams, D. Patel, R. Mahfouz, F. G. Behm, S. C. Raimondi, M. V. Relling, A. Patel, C. Cheng, D. Campana, D. Wilkins, X. Zhou, J. Li, H. Liu, C.-H. Pui, W. E. Evans, C. Naeve, L. Wong, and J. R. Downing. Classiﬁcation, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression proﬁling. Cancer Cell, 1(2):133 143, 2002. ISSN 1535-6108. doi: https://doi.org/10.1016/S1535-6108(02)00032-6.

C. T. Zahn. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on computers, 100(1):68 86, 1971.

Y. Zhu, F. Dai, and R. Maitra. Three-dimensional radial visualization of high-dimensional continuous or discrete datasets. Ar Xiv e-prints:1905.09505, Mar. 2019.