# kernelestimated_nonparametric_overlapbased_syncytial_clustering__7749caaa.pdf Journal of Machine Learning Research 21 (2020) 1-54 Submitted 7/18; Revised 5/20; Published 6/20 Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering Israel A. Almod ovar-Rivera israel.almodovar@upr.edu Department of Biostatistics and Epidemiology University of Puerto Rico at Medical Science Campus San Juan, PR 00936-5067, USA Ranjan Maitra maitra@iastate.edu Department of Statistics Iowa State University Ames, IA, 50011-1090, USA Editor: Miguel A. Carreira-Perpi n an Commonly-used clustering algorithms usually find ellipsoidal, spherical or other regularstructured clusters, but are more challenged when the underlying groups lack formal structure or definition. Syncytial clustering is the name that we introduce for methods that merge groups obtained from standard clustering algorithms in order to reveal complex group structure in the data. Here, we develop a distribution-free fully-automated syncytial clustering algorithm that can be used with k-means and other algorithms. Our approach estimates the cumulative distribution function of the normed residuals from an appropriately fit k-groups model and calculates the estimated nonparametric overlap between each pair of clusters. Groups with high pairwise overlap are merged as long as the estimated generalized overlap decreases. Our methodology is always a top performer in identifying groups with regular and irregular structures in several datasets and can be applied to datasets with scatter or incomplete records. The approach is also used to identify the distinct kinds of gamma ray bursts in the Burst and Transient Source Experiment 4Br catalog and the distinct kinds of activation in a functional Magnetic Resonance Imaging study. Keywords: BATSE, DEMP, DEMP+, DBSCAN*, density peaks algorithm, GRB, GSLNN, k-clips, k-means, km-means, kernel density estimation, KNOB-Syn C, Mix Mod Combi, MGHD, MSAL, overlap, PGMM, SDSS, spectral clustering, Ti K-means 1. Introduction Cluster analysis (Ramey, 1985; Mc Lachlan and Basford, 1988; Kaufman and Rousseuw, 1990; Everitt et al., 2001; Melnykov and Maitra, 2010; Xu and Wunsch, 2009; Bouveyron et al., 2019) is an unsupervised learning method that partitions datasets into distinct groups of homogeneous observations. Finding such structure in the absence of group information can be challenging but is important in many applications, such as taxonomical classification (Michener and Sokal, 1957), market segmentation (Hinneburg and Keim, 1999), software management (Maitra, 2001) and so on. As such, a number of methods, ranging from the heuristic (Johnson, 1967; Everitt et al., 2001; Jain and Dubes, 1988; Forgy, 1965; Mac Queen, 1967; Kaufman and Rousseuw, 1990) to the more formal, model-based (Titter- c 2020 Israel Almod ovar-Rivera and Ranjan Maitra. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/18-435.html. Almod ovar-Rivera and Maitra ington et al., 1985; Mc Lachlan and Peel, 2000; Melnykov and Maitra, 2010; Mc Nicholas, 2016; Bouveyron et al., 2019) approaches have been proposed and implemented. Most common clustering algorithms, whether model-agnostic methods like k-means (Mac Queen, 1967; Hartigan and Wong, 1979; Lloyd, 1982) or model-based approaches such as Gaussian mixture models (Fraley and Raftery, 2002; Melnykov and Maitra, 2010) yield clusters with regular dispersions or structure. For instance, the k-means algorithm is geared towards finding homogeneous spherical clusters or spherically-dispersed groups of equal radius. Such algorithms are not designed to find general-shaped or structured groups, therefore, many additional approaches have been suggested to identify irregularly-shaped groups (see, for example, Dhillon et al., 2004; Fred and Jain, 2005; von Luxburg, 2007; Baudry et al., 2010; Hennig, 2010; Melnykov, 2016; Peterson et al., 2018). Kernel k-means clustering (Dhillon et al., 2004) enhances the k-means algorithm by using a kernel function φ( ) that nonlinearly maps the original (input) space to a higher-dimensional feature space where it may be possible to linearly separate clusters that were not linearly separable in the original space. Spectral clustering (von Luxburg, 2007) uses k-means on the first few eigenvectors of a Laplacian of the similarity matrix of the data. Both methods need the number of clusters to be provided: in the case of spectral clustering, von Luxburg (2007) suggests estimating this number as the one with the highest gap between successive eigenvalues. A separate set of approaches modifies the distribution of the mixture components in model-based clustering (MBC) by replacing the commonly-used multivariate Gaussian component with other more general distributions. Some of these approaches simply add dimension reduction in the form of factor models (Ghahramani and Hinton, 1997; Mc Nicholas and Murphy, 2008) through parsimonious Gaussian mixture models (PGMM). More generally, Franczak et al. (2014) propose MBC using a mixture of asymmetric shifted Laplace distributions (Mix SAL) while Browne and Mc Nicholas (2015) suggest using a mixture of generalized hyperbolic distributions (Mix GHD). These approaches more fully exploit MBC but can be CPU intensive and are somewhat limited in capturing complex structures. Evidence accumulation clustering or EAC (Fred and Jain, 2005) combines results from multiple runs of the k-means algorithm with the underlying rationale that each partitioning provides independent evidence of structure that is then extricated by cross-tabulating the relative frequencies (out of the multiple partitionings) that each observation pair is in the same group. This relative frequency table serves as a similarity matrix for hierarchical clustering: however, implementation of this method can be computationally demanding in terms of CPU speed and memory. Stuetzle and Nugent (2010) developed a nonparametric clustering approach under the premise that each group corresponds to a mode of the estimated multivariate density of the observations. The high-density modes are located and hierarchically clustered with dissimilarity between two modes calculated in terms of the lowest density or number of common points in each mode s domain of attraction. The density-based spatial clustering algorithm of applications with noise (DBSCAN) algorithm (Ester et al., 1996) groups together points in high-density regions while identifying points in low-density regions as outliers. A refinement (DBSCAN*, by Campello et al., 2013) follows the same principle but classifies so-called border observations as outliers. Both algorithms depend on the minimum cluster size and reachability distance, and also on a cut-offto determine the border and outlying observations. The authors suggest setting this cutoffat the knee of a plot of the k-nearest neighbor distances of the observations. In Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering a similar vein, Rodriguez and Laio (2014) developed a fast Density Peaks (DP) algorithm to determine cluster centers and find outliers while considering the local density of each observation. DP uses the estimated multivariate density in order to classify observations into outliers and does not rely on an explicit cut-offvalue, but other parameters need to be subjectively specified or estimated to graphically decide on the number of groups. These methods all rely on density estimates and are not immune from the ravages of the curse of dimensionality. More recent work (Baudry et al., 2010; Hennig, 2010; Melnykov, 2016; Peterson et al., 2018) proposed merging groups found using MBC or k-means. Such methods fall into the category of what we introduce in this paper as syncytial clustering algorithms, because they yield a cluster structure resembling a syncytium, a term that in cell biology refers to a multi-nucleated mass of cytoplasm inseparable into individual cells and that can arise from multiple fusions of uninuclear cells. Syncytial clustering algorithms are similar in that they merge or fuse groups that originally corresponded to mixture model components or kmeans or other regular-structured groups. Resulting partitions have groups with potentially multiple well-defined and structured sub-groups. We outline a few such algorithms next. MBC is premised on the idea of a one-to-one correspondence between a mixture component of given density form and group. Such injective mapping assumptions are not always tenable so some authors (Baudry et al., 2010; Hennig, 2010; Melnykov, 2016) model each group as a mixture of (one or more) components. Operationally, we have a syncytial clustering framework where identified mixture components that are not very distinct from each other are merged (Baudry et al., 2010; Hennig, 2010; Melnykov, 2016) into a cluster. Baudry et al. (2010) successively merge mixture component pairs that result in the highest change in entropy, continuing for as long as the entropy increases. This method, abbreviated here as MMC, is implemented in the R (R Development Core Team, 2018) package RMix Mod Combi (Baudry and Celeux, 2014). Hennig (2010) developed the directly estimated misclassification probabilities (DEMP) algorithm to identify candidate components for merging. The author argued that the best measure of group similarity should relate to the classification probability and so proposed that clusters with the highest pairwise misclassification probabilities be merged. The DEMP+ method (Melnykov, 2016) mimics DEMP but replaces the misclassification probabilities of DEMP with the overlap measure of Maitra and Melnykov (2010) for Gaussian mixture components. DEMP+ uses Monte Carlo simulation to determine pairwise overlap between merged components and uses thresholds on the maximum pairwise overlap to determine termination. The sliding threshold was empirically suggested to be chosen to be inversely related to dimension. The MBC algorithms offer a principled approach to the partitioning of observations into groups but are more demanding in CPU time and perhaps unnecessary to use when the objective is simply to find the most appropriate grouping with no particular dogma regarding shape or structure and where using k-means as a starting point for an initial clustering may be a fairly plausible but faster alternative. Perhaps recognizing this aspect, Melnykov (2016) contended that DEMP+ can be applied to k-means output by assuming equal mixing proportions and homogeneous spherical dispersions in the mixture model. The basis for this assertion is the framing of the k-means algorithm of Lloyd (1982) as a Classification Expectation-Maximization (CEM) Algorithm (see Fraley and Raftery, 1998, for details). But k-means clustering makes hard assignments of each observation and, in- Almod ovar-Rivera and Maitra deed, most commonly-used statistical software programs, such as R (R Development Core Team, 2018) use the efficient Hartigan and Wong (1979) algorithm that handles computations quite differently and sparingly than Lloyd (1982). In this vein, Peterson et al. (2018) provided the K-m H algorithm to merge poorer-separated k-means groups. Such groups are identified as per an easily-computed index that uses normal theory with spherical dispersion assumptions. However, the K-m H algorithm has a large number of settings and parameters: using default values and rules-of-thumb provided by the authors, we have found that this method performs well in many datasets but not as well in many others. Therefore, it would be worth investigating other syncytial clustering algorithms that use k-means groupings for clustering efficiency while also reducing the need to tune multiple parameter settings. A separate issue is the impact of Gaussian mixture model assumptions in methods such as DEMP+ when applied to regular-structured groups found using, say, the multivariate tmixture or other appropriate models. A nonparametric method not taking recourse to such distributional assumptions would be desirable in addressing this shortcoming. This paper therefore proposes the Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering (KNOB-Syn C) algorithm that successively merges groups from a well-optimized k-means solution until some objective and nonparametric data-driven cluster overlap measure vanishes or is no longer reduced. This measure is calibrated through the generalized overlap (Maitra, 2010; Melnykov and Maitra, 2011; Melnykov et al., 2012) calculated using smooth estimation of the cumulative distribution function (CDF) developed in Section 2. Our algorithm is illustrated and comprehensively evaluated in Section 3. Although motivated using k-means, the method is general enough to apply to the output of other partitioning algorithms, such as clustering using the Mahalanobis (1936) distance, or in scenarios with scatter (Maitra and Ramler, 2009) or incomplete records (Lithio and Maitra, 2018). Section 4 also applies our methodology to two interesting settings: in the first case, we identify the differents kinds of gamma ray bursts in the most recent Burst and Transient Source Experiment (BATSE) 4Br catalog. Our second application uses KNOB-Syn C to identify activation from single replications of a functional Magnetic Resonance Imaging (f MRI) study obtained from a right-hand finger tapping experiment performed by a right-hand-dominant male. We find our results to both be interpretable and with greater reproducibility than current methods. The paper concludes with some discussion. An appendix provides mathematical proofs for our derived theoretical properties of smooth estimation of the CDF using asymmetric kernel density estimation and detailed graphical illustrations of experimental performance on two-dimensional (2D) datasets and numerical summaries of performance on all datasets. 2. Methodological Development 2.1. Problem Setup Let Ξ = {X1, X2, . . . , Xn} be a random sample of n p-dimensional observations, with each c=1 [fc(x)]ζic, (1) where C is the number of groups, ζic = I(Xi Cc) with I(Z) = 1 if Z holds and 0 otherwise, fc(x) is the cluster-specific density of an observation in the cth cluster and Cc is the set Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering of observations in the sample from that group. Our specification in (1) refers to a hard clustering framework: we marginally obtain a mixture model if we specify independent identical multinary prior distributions on each ζic. Our objective is to estimate ζics (equivalently, Ccs) for each c = 1, 2, . . . , C with C possibly unknown. We also assume that for each c = 1, 2, . . . , C, the density fc(x) for any Xi Cc (i.e. ζic = 1) can be further described by k=1 [h( x µCc k )]ζCc ik , (2) where h( ) is defined on the positive half of the real line so that h( x ) is a zero-centered density in Rp with spherical level hyper-surfaces. This means that each group in the dataset can be further decomposed into multiple homogeneous spherically-dispersed subgroups, and ζCc ik = 1 if Xi is in Cc and in the kth subgroup inside Cc, and zero otherwise. That is, we can model Xi Ξ as Xi QC c=1 Qkc k=1[h( x µCc k )]ζCc ik , or equivalently as k=1 [h( x µ k )]ζ ik, (3) where ζ ik and µ k for k = 1, 2, . . . , K are renumerations, respectively, of all the ζCc ik and µCc k for k = 1, 2, . . . , kc, c = 1, 2, . . . , C. Therefore, K = PC c=1 kc, ζic = Pkc k=1 ζCc ik for c = 1, 2, . . . , C and PK k=1 ζ ik PC c=1 Pkc k=1 ζCc ik = 1 (however, both K and C are also unknown). The reformulation of (1) in terms of (3) means that the k-means algorithm (Forgy, 1965; Lloyd, 1982; Hartigan and Wong, 1979) can be employed along with cluster-selection methods (for example, Krzanowski and Lai, 1988; Sugar and James, 2003; Maitra et al., 2012) to obtain a first-pass clustering of the dataset where the observations are partitioned into an estimated number ( ˆK) of homogeneous spherically-dispersed groups. Our proposal is to develop methods for identifying the supersets of these k-means (homogeneous spherical) groups to obtain the clusters {Cc; c = 1, 2, . . . , C} with C also needing to be estimated. These supersets will reveal the general-shaped clustering structure in the data. From the ˆK-groups solution, define the ith residual (i = 1, 2, . . . , n) as k=1 ˆµ k ˆζ ik; (4) where ˆµ k is the multivariate mean vector of the observations in the kth group and ˆζ ik = I(Xi kth k-means group). From (4), we obtain the normed residuals, that is, we obtain ˆϵ iˆϵi = Xi i=1 ˆζ ik ˆµ k (5) for i = 1, 2, . . . , n; k = 1, 2, . . . , ˆK. These ˆΨ1, ˆΨ2, . . . ˆΨn may be viewed as a random sample with density function h( ) and CDF H( ) and having support in [0, ). We now provide methods for estimating H( ) under assumptions of a smooth CDF. Almod ovar-Rivera and Maitra 2.2. Smooth estimation of the CDF of the normed residuals We first introduce a smooth estimator for an univariate CDF. Let Y1, Y2, . . . , Yn be a random sample having CDF H( ) and probability density function (PDF) h( ). The natural and most common estimator is the empirical CDF (ECDF) defined as i=1 1(Yi y). (6) It is easy to see that ˆHn(y) is an unbiased estimator of H(y), that is, E[ ˆHn(y)] = H(y). Further, it converges almost surely to the true CDF H( ). However, the ECDF is a step function for any n and so inappropriate for a smooth continuous CDF, even though it is a smooth function in the limit as n (Silverman, 1986). An alternative kernel estimator (Rosenblatt, 1956; Parzen, 1962; Silverman, 1986; Wand and Jones, 1995) for H( ) replaces the indicator function in (6) by its smooth cousin. Strictly speaking, kernel density estimation is most often employed in nonparametric contexts (Silverman, 1986) but can also be extended to smooth CDF estimation by integrating over the domain of the kernel. Let G(y) = R y K(u)du be the CDF of a kernel function K( ). The kernel CDF estimator is then defined as ˆH(y; b) = 1 where b is the bandwidth or the smoothing parameter. Equation (7) makes the popular assumption of a symmetric kernel, the most common examples of which are the Gaussian and Epanechnikov (Epanechnikov, 1969; Azzalini, 1981; Reiss, 1981) kernels. However, using a symmetric kernel when the support of the distribution is not on the entire real line (as is the case with our normed residuals) causes weights to be assigned outside the domain of the observations, resulting in boundary bias (Bouezmarni and Scaillet, 2005). So Chen (2000) proposed using an asymmetric kernel in (7) based on the gamma density, with behavior similar to the Gaussian kernel and a comparable rate of convergence in terms of the mean squared error. However, Chen (2000) s estimator is not a valid density for finite sample sizes (Jeon and Kim, 2013) so we consider the Reciprocal Inverse Gaussian (RIG) kernel density estimator (Scaillet, 2004) ˆh(y; b) = 1 i=1 K (y; Yi, b) , (8) with K(y; Yi, b) = 1 2πb Yi exp{ 1 2b Yi [Yi (y b)]2}. Since K(y; Yi, b) is a smooth function the CDF estimate is defined as G(y; Yi, b) = R y 0 K(t; Yi, b)dt. Then, integrating ˆh(y; b) with respect to y yields the smooth CDF estimator ˆH(y; b) = 1 i=1 G (y; Yi, b) = 1 Φ Yi + b Yib Φ Yi (y b) Yib where Φ( ) is the standard Gaussian CDF. An added benefit of using the RIG kernel over the gamma kernel is that the estimated CDF is in closed form and can be readily evaluated Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering using standard software. We now investigate some theoretical properties of the asymmetric RIG kernel CDF estimator. Before proceeding however, we revisit the definition of the Inverse Gaussian and RIG densities for the sake of completeness and to fix ideas. Definition 1 A nonnegative random variable Uµ,λ is said to arise from the Inverse Gaussian distribution with parameters (µ, λ) if it has the density r(u; µ, λ) = 2πu exp n λ 2µ u µ 2 + µ u o , u > 0 0 otherwise. (10) Notationally, we write Uµ,λ IG(µ, λ). Also, we have E(Uµ,λ) = µ and Var(Uµ,λ) = µ3/λ. Definition 2 A nonnegative random variable Vµ,λ is said to be from the Reciprocal Inverse Gaussian distribution with parameters (µ, λ) if it has the density s(v; µ, λ) = 2πv exp n λ 2µ vµ 2 + 1 µv o , v > 0 0 otherwise. (11) Notationally, Vµ.λ RIG(µ, λ). Further, Vµ.λ is equivalent in law to 1/Uµ,λ where Uµ,λ IG(µ, λ) and E(Vµ,λ) = 1/µ + 1/λ while Var(Vµ,λ) = (λ + 2µ)/(λ2µ). We now develop some properties of the asymmetric kernel RIG to estimate the CDF. Lemma 3 Let Y1, Y2, . . . , Yn be independent identically distributed nonnegative-valued random variables with CDF H(y), and PDF h(y) that is infinitely differentiable. Also, consider the RIG kernel density defined by K(t; Yi, b) = φ[(Yi (t b))/ b Yi]/ b Yi where φ(z) is the standard normal density evaluated at z. Consider estimating H(y) using ˆH(y; b) as defined in (9). Then, as b 0, E[ ˆH(y; b)] = H(y) + b[yh (y) h(y)]/2 + o(b) H(y) + O(b) and Var[ ˆH(y; b)] H(y)(1 H(y))/n H(y)/(2n) H(y) b/(2n 2πy) + o( b) H(y)(1 H(y))/n H(y)/(2n) + O(b). Proof See Appendix A.1. Lemma 3 shows that ˆH(y; b) has lower variance than the ECDF and has point-wise Mean Squared Error (MSE) at y that is given by MSE[ ˆH(y; b)] = Var[ ˆH(y; b)] + [Bias{ ˆH(y; b)}]2. 2.2.1. Bandwidth selection Scaillet (2004) minimized the Mean Integrated Squared Error (MISE) to provide a rule-ofthumb bandwidth selector for the RIG kernel density estimator of the form " 2 R 0 y 1/2h(y)dy π R 0 y2{h (y)}2dy #2/5 n 2/5. (12) However, (12) involves knowledge of the true density h( ) and is directly unusable. Scaillet (2004) proposed obtaining ˆb by assuming an initial parametric density, say h( ; θ), for h( ) and estimating the parameters θ of the density from the sample. Exact derivations using Almod ovar-Rivera and Maitra a lognormal density for h( ; θ) were provided (Scaillet, 2004) but this approach has been found to produce estimates that are biased downwards. We therefore adopt Scaillet (2004) s approach but use an initial gamma density h(y; ϑ, τ) = exp ( y/τ)yϑ 1τ ϑ/Γ(ϑ) for y > 0 and zero otherwise. Under this setup, R 0 y 1/2h(y; ϑ, τ)dy = Γ(ϑ 1/2) τ/Γ(ϑ) and R 0 y2{h (y; ϑ, τ)}2dy = (6ϑ 4)(ϑ 1)Γ(2ϑ)/{4ϑτ 3Γ2(ϑ)(2ϑ 1)}. Therefore, we have " 22ˆϑ+1ˆτ 7/2(2ˆϑ 1)Γ(ˆϑ 1 2)Γ(ˆϑ) π(6ˆϑ 4)(ˆϑ 1)Γ(2ˆϑ) with ˆϑ and ˆτ estimated from the sample Y1, Y2, . . . , Yn using, for example, the method of moments. This ˆb is used in (9) to obtain our smoothed RIG-kernel CDF estimator. The development of this section, when applied to the normed residuals ˆΨ1, ˆΨ2, . . . , ˆΨn (in place of Y1, Y2, . . . , Yn), yields a smooth nonparametric kernel-based estimator of their CDF. We use this kernel-estimated CDF in our development of the nonparametric estimation of the overlap measure between groups. 2.3. A nonparametric estimator of overlap between groups Overlap between two groups is an indicator of the extent to which they are indistinguishable from each other. Maitra and Melnykov (2010) defined the pairwise overlap of two mixture components as the sum of the misclassification probabilities ωlk ωkl = ωl|k + ωk|l with ωl|k = P[X is assigned to Cl | X is truly in Ck]. (14) For any two mixture components with densities f(x; θk) and f(x; θl) and mixing proportions πk and πl, we have ωk|l = P [πkf(xi; θk) < πlf(xi; θl)|xi f(xi; θl)] , where θk and θl are the parameter sets associated with the kth and lth mixture components. Maitra and Melnykov (2010) calculated (14) for Gaussian mixture densities, but the definition itself is general enough to include other clustering situations including those as general as when we have cluster distributions given by densities of the type in (1). For an equal-proportioned mixture of homogeneous spherical Gaussian densities, Maitra and Melnykov (2010) showed that ωk|l = Φ( µl µk /2σ) between the kth and the lth cluster where Φ( ) is the standard Gaussian CDF, µk and µl are the kth and the lth cluster means and σ is the common (homogeneous) standard deviation for each group, estimated unbiasedly as WSSK/{(n K)p} with WSSK being the optimized value of the within-sums-of-squares (WSS) of the K-groups solution. The sum of ωk|l and ωl|k reduces to ωkl = 2Φ( µl µk /2σ). The k-means formulation of (3) can be viewed more generally (Maitra et al., 2012) and extends beyond the case of Gaussian-distributed groups, so we develop nonparametric methods for estimating the overlap measure. 2.3.1. Pairwise overlap between two k-means groups The pairwise overlap (14) between two groups can generally be calculated from HΨ( ) as ωl|k = P ( X µl < X µk | X Ck) = 1 P Ψk < Ψl(k) (15) Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering where Ψk represents the normed residual obtained from the kth group, and Ψl(k) represents the normed pseudo-residual which we define as the norm of the remainder that is obtained by subtracting the lth cluster mean µl from an observation X Ck. Let HΨ(y) be the RIG kernel-estimated smooth CDF obtained using the bandwidth selected as per (13). Then, P(Ψk < y) can be estimated using ˆHΨ(y;ˆb) (where Ψ in the subscript of ˆH( ; ) denotes that the estimated CDF uses the normed residuals). However, the calculation of P Ψk < Ψl(k) is not as straightforward. So we estimate P Ψk < Ψl(k) using a na ıve average estimator ˆP Ψk < Ψl(k) = 1 i=1 ˆζ ik ˆHΨ( Xi ˆµ l ;ˆb), (16) where n k = Pn i=1 ˆζ ik. The na ıve estimator (16) can be considered as an empirical estimator of E[ ˆHΨ( Xi ˆµl ;ˆb) | ˆµk, ˆµl, Xi kth spherically-dispersed group ]. Similar estimates of ωk|l, and therefore ωkl, can be obtained. We call this estimated overlap ˆωkl ˆωlk. 2.3.2. Pairwise overlap between two composite groups As described in (2), a composite group is one that can be further decomposed into subpopulations. We now extend the definition of the pairwise overlap for such groups. Let ωCl|Ck be defined as in (14) but for composite groups. That is, we use ωCl|Ck rather than ωl|k in order to specify that the overlap measure is between composite clusters Cl and Ck. Now ωCl|Ck = 1 P[minr Ck X µr < minj Cl X µj | X Ck]. Suppose now that C s k is the sth spherical sub-cluster of Ck with mean µ s, s = 1, 2, . . . , |Ck|, with |Ck| being the number of spherical sub-clusters in Ck. We assume that if X Ck, then argminr {1,2,...,|Ck|} X µ r = s k implies that X is in the subgroup given by Cs k. Under this assumption, the density of X is defined through its (sth) sub-cluster and so P min r Ck X µr y | X Ck = 1 P min r Ck Ψr > y = 1 [1 P (Ψr y)]|Ck| (17) where Ψr is a normed residual (obtained, for instance, from the k-means solution) for the rth spherically-dispersed subgroup in the kth cluster. We use the RIG kernel distribution estimator to obtain P (Ψr < y). From (14), and using the same ideas as in (16) we get the na ıve estimator i=1 ˆζic ˆHΨ(min r Cl Xi µr ;ˆb) and similarly for ˆωCk|Cl, from where we calculate ˆωCl Ck ˆωCk Cl = ˆωCl|Ck + ˆωCk|Cl. Our definitions of Cks and ˆωCk Cl are consistent in the sense that if Ck = {k} and Cl = {l} are both k-means groups, then ˆωCk Cl = ˆωkl. We use this equivalence in the description of our KNOB-Syn C algorithm in Section 2.4 below. 2.3.3. Summarizing overlap in a partitioning Our development so far has provided us with pairwise overlap measures for k-means-type (Section 2.3.1) and composite (Section 2.3.2) groups. For a K-groups (whether of the composite or k-means type) partitioning, we get K 2 pairwise overlap measures. Summarizing Almod ovar-Rivera and Maitra the pairwise overlap measures is important to provide a sense of clustering complexity so Maitra and Melnykov (2010) originally proposed regulating ˇω (maximum of all pairwise overlaps) and ω (average of all K 2 pairwise overlaps) and demonstrated (see Figures 2 and 3 of Maitra and Melnykov, 2010) the ability to summarize a wide range of cluster geometries. However, because specifying two measures simultaneously is cumbersome, later versions of the CARP (Melnykov and Maitra, 2011) and Mix Sim (Melnykov et al., 2012) software packages borrowed ideas from Maitra (2010) to obtain the generalized overlap ω = (ˇλΩ 1)/(K 1) where ˇλΩis the largest eigenvalue of the (symmetric) matrix Ωof pairwise overlaps ωl,k (ωCk Cl for composite groups) and with diagonal entries that are all 1. ω lies in [0,1] with zero indicating perfect separation between all group densities and 1 indicating indistinguishability between any of them. In this paper, we obtain the estimated generalized overlap ˆω using the estimated matrix ˆΩwith off-diagonal entries given by the kernel-estimated pairwise overlaps ˆωl,k or ˆωCk Cl, depending on whether we have simple k-means-type or composite groups. 2.4. The KNOB-Syn C Algorithm Having provided theoretical development for the machinery that we will use, we now describe our multi-phased KNOB-Syn C algorithm: 1. The k-means phase: This phase finds the optimal partition of the dataset in terms of homogeneous spherically-dispersed groups and has the following steps: (a) For each K {1, 2, . . . , Kmax}, obtain K-means partitions initialized each of n Kp times with K distinct seeds randomly chosen from the dataset and run to termination. The best in terms of the value of the objective function (WSS) at termination of each set of n Kp runs is our putative optimal K-means partition for that K {1, 2, . . . , Kmax}. We use Kmax = max{ n, 50}. (b) When n is small relative to p (operationally, n < p2), use Krzanowski and Lai (1988) s KL criterion to decide on the optimal K. Otherwise, for larger n, we use the jump statistic (Sugar and James, 2003) on the optimal K-means partitions (K {1, 2, . . . , Kmax}) obtained in Step 1a to determine the optimal K (denoted by ˆK). In calculating the jump statistic, we have used y = p/2, which has become the default in most applications. We refer to Sugar and James (2003) for more detailed discussion on this choice of y. The corresponding ˆK-means solution is the optimal homogeneous spherically-dispersed partition of the dataset. This concludes the k-means phase of the algorithm. 2. The initial overlap calculation phase: This phase starts with the output of Step 1. That is, we start with a structural definition of the dataset in terms of ˆK optimal homogeneous spherically-dispersed groups. Our objective here is to calculate the overlap between each of these groups using nonparametric kernel estimation methods. We proceed as follows: (a) For each observation Xi, i = 1, 2, . . . , n, compute its normed residual ˆΨi = p ˆϵ iˆϵi where ˆϵi is defined as in (4). Also, obtain the normed pseudo-residual ˆΨi;l(k) = Xi ˆµl for Xi Ck, and l = k {1, 2, . . . , ˆK}. Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering (b) Using the set of normed residuals {ˆΨi; i = 1, 2, . . . , n}, obtain its RIG-kernelestimated CDF using (7) with bandwidth determined as per (13). (c) For any two groups k = l {1, 2, . . . , ˆK}, estimate the pairwise overlap ˆωlk = ˆωl|k + ˆωk|l, where ˆωl|k and ˆωk|l are calculated using (15) and (16). We obtain the estimated overlap matrix ˆΩ(with diagonal elements all equal to unity). For clarity, denote this overlap matrix as ˆΩ (1) and pairwise overlaps as ˆω(1) kl ˆω(1) Ck Cl. (d) From the overlap matrix ˆΩ (1), calculate the generalized overlap ˆω. Call it ˆω(1). 3. The merging phase: The merging phase is triggered only if some of the overlap measures between overlapping clusters are more than the others (operationally, if 4 ˆω(1) ˇˆω where ˇˆω is the maximum of the estimated pairwise overlaps) or if ˆω is not negligible, that is, if ˆω(1) 0 (operationally ˆω(1) 10 5). In that case, this phase merges groups, provides pairwise overlap measures between newly-formed composite groups, the updated overlap matrix and the generalized overlap, continuing for as long as the generalized overlap keeps decreasing (by at least 10 5) or is not negligible. Specifically, this phase iteratively proceeds for ℓ= 1, 2, . . . with the following steps: (a) Merge the groups with the maximum overlap and every pair of groups that have individual pairwise overlaps substantially larger than the generalized overlap ˆω(ℓ). That is, merge every pair of groups Ck, Cl, k = l such that ˆω(ℓ) lk ˇˆω(ℓ) or ˆω(ℓ) lk > κ ˆω(ℓ), for some κ as described in the comments section below. Call the new merged group Cmin(k,l) and decrease the index labels of the groups with indices greater than max(k, l). Decrement ˆK by 1 for every merged pair. (b) Using (18), update the pairwise overlap measures that have changed as a result of the merges in Step 3a. Call the updated measures ˆω(ℓ+1) Ck Cl . Obtain the updated overlap matrix (call it ˆΩ (ℓ+1)) and the updated generalized overlap ˆω(ℓ+1). Set ℓ ℓ+ 1. (c) The merging phase terminates if ˆω(ℓ) > ˆω(ℓ 1), ˆω(ℓ) 0, or ˆω(ℓ) ˇˆω(ℓ). The terminating ˆK is the ˆC of (1). 4. Final clustering solution: The grouping {C1, C2, . . . , C ˆC} at the end of the merging phase is the final partition of the dataset. This gives us a total of ˆC general-shaped groups in the dataset. Comments: We provide some additional remarks on KNOB-Syn C and relate it to other algorithms for finding general-shaped clusters and settings: 1. The k-means phase finds regular-structured (more specifically, homogeneous spherical) groups and, in this regard, is similar to the initial stages of K-m H (Peterson et al., 2018) and EAC (Fred and Jain, 2005). However, EAC repeats k-means with fixed K several times and is built upon the premise that each k-means run does not end up with the same clustering, especially when we do not have underlying homogeneous spherically-dispersed groups. On the other hand, K-m H uses a separability index built Almod ovar-Rivera and Maitra on Gaussian assumptions for each cluster and has a large number of user-specified parameters. KNOB-Syn C uses nonparametric CDF estimation with a plugin bandwidth selector and a na ıve average estimator to calculate the overlap between sphericallydispersed groups and a na ıve estimator for the overlap between composite groups. Our methodology has one parameter (κ) that is chosen in a completely data-driven framework. No parameter requires fine-tuning by the practitioner. Also, the number of general-structured groups is decided upon termination that is objectively declared whenever the generalized overlap vanishes or does not go down further. 2. As with MMC, DEMP or DEMP+, the use of cluster distributions in the overlap calculations simplifies and keeps practical computations even for large datasets. In contrast, EAC, DBSCAN, DBSCAN , DP and K-m H require memory-intensive crosstabulation of the entire dataset across multiple clusterings because n n frequency tables need to be calculated and/or stored. 3. KNOB-Syn C uses a na ıve estimator to update the overlap between composite groups, unlike DEMP+ which uses Monte Carlo simulations and is slower. Further, DEMP+ uses the maximum overlap that is very sensitive to individual pairwise overlap measures while KNOB-Syn C uses the generalized overlap measure (Maitra, 2010) that provides a nonlinear summary of all the individual pairwise overlaps. 4. Unlike DEMP or DEMP+, the stopping criterion of KNOB-Syn C is data-driven, thus allowing for the possibility of obtaining well-separated and less well-separated partitionings as supported by the data. Our algorithm also has the potential, unlike MMC, DEMP or DEMP+, to merge multiple pairs of groups in a step. 5. KNOB-Syn C uses nonparametric CDF estimation but does so in univariate space by exploiting the inherent spherically-dispersed structure (ellipsoidal in the case of clustering with the Mahalanobis distance) of the sub-clusters. Therefore, it has greater immunity against the curse of dimensionality that bedevils multivariate density estimation that is used in algorithms such as DBSCAN and DP. 6. The parameter κ determines the types of composite groups that are formed. For larger values of κ, we have groups formed by merging a few pairs at each iteration while smaller values κ prefer many simultaneous mergers. (For κ , no merging is possible.) In the first case, we expect to have stringy groups while in the second case, we find clusters that are irregular-shaped but less stringy. A data-driven approach to choosing κ, that we adopt, runs the algorithm with different values of κ = 1, 2, 3, 4, 5, and uses the final partitioning with the smallest terminating ˆω as the optimal clustering. 7. Unlike other syncytial clustering algorithms like DEMP, DEMP+ or K-m H, KNOBSyn C allows for the possibility of multiple pairs of groups to be merged at an iteration. 8. Our initial stage uses k-means for speed and efficiency that also allows us to explore larger candidate values of K. However, the approach could very well have been used with clustering algorithms obtained using, say, the generalized Mahalanobis distance. The overlap calculations are then easily modified. To see this, suppose that the Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering generalized Mahalanobis distance between two points x and y is given by dΓ(x, y) = (x y) Γ (x y), where Γ is any appropriate nonnegative-definite matrix with (say, Moore-Penrose) inverse given by Γ . (A positive definite Γ leads to the usual Mahalanobis distance.) Under the generalized Mahalanobis distance framework, (14) reduces to ωl|k = P (Γ )1/2(X µl) < (Γ )1/2(X µk) | X kth group = 1 P (Γ )1/2(X µk) < (Γ )1/2(X µl) | X kth group , (19) which means that the problem reduces to the Euclidean case if we use what we here refer to as the normed Mahalanobis-free residuals (and pseudo-residuals). Operationally, this is equivalent to obtaining ˆεi = (Γ )1/2(X PK i=1 ˆζ ik ˆµ k), and replacing the ˆϵi with ˆεi in the calculation of (5) and proceeding as before. This framework also includes the case when we scale each variable before clustering, as happens when the features are on vastly different scales, or when we use principal components (PCs) as our clustering variables we illustrate these scenarios in Sections 3.3.6 and 4.1. 9. Our algorithm accommodates clustering scenarios in the presence of scatter as provided, for instance, by the output of the k-clips algorithm of Maitra and Ramler (2009). Scatter observations are those that are unlike any other and may be considered as individual groups in their own right. KNOB-Syn C incorporates these scatter observations as individual clusters in addition to the groups found from the output of the k-clips algorithm and proceeds with the overlap calculation and merging phases as described earlier in this section. We illustrate this scenario in Section 3.4.1. 10. Datasets often have incomplete records with missing observations in some features. The km-means algorithm (Lithio and Maitra, 2018) provides a k-means type algorithm for Euclidean distance clustering in this setting. Then, instead of k-means, KNOBSyn C can incorporate results from km-means in the first stage. For the incompletelyobserved records, we calculate the rescaled normed residual in the presence of missing information by removing the missing value from their calculation and re-weighting it appropriately. Specifically, we calculate the ith rescaled normed residual as l=1 (Xijl ˆµkjl)2 (20) where Xij1, Xij2, . . . , Xijpi represent the pi available features for the ith record that has been assigned to the kth spherically-dispersed sub-group with estimated mean ˆµk. Similar arguments allow for the calculation of the rescaled normed pseudo-residual ˆΨil(k). The use of the nonparametric CDF estimator in KNOB-Syn C provides us with the flexibility to calculate the initial overlap estimates from these scaled-up normed residuals. The merging phase and termination criteria of our algorithm remain unchanged. Section 3.4.2 illustrates KNOB-Syn C on a dataset with incomplete records. 11. The use of nonparametric methods in the overlap calculations means that a large number of methods may be possible to use in the initial partitional phase. The method Almod ovar-Rivera and Maitra can also potentially be modified to apply to other kinds of datasets. For instance, the initial clustering can be done for categorical datasets using k-modes (Huang, 1997, 1998; Chaturvedi et al., 2001; Dorman and Maitra, 2020) and then the Generalized or Gaussianized Distributional Transform (R uschendorf, 2013; Zhu et al., 2019) and copula model (Nelsen, 2006) can potentially be applied to each cluster to obtain numerical-valued residuals for use with our overlap estimation and calculations. Having proposed our KNOB-Syn C algorithm, we now illustrate and evaluate its performance in relation to a host of competing methods. 3. Performance Evaluations We first illustrate the performance of KNOB-Syn C on the 2D Aggregation dataset of Gionis et al. (2007) and then follow with more detailed performance evaluations on a large number of datasets usually used to evaluate competing algorithms in the literature. These datasets range from two to many dimensions. We compare our methods with a wide range of suitors. These rival methods are the syncytial clustering techniques of K-m H (Peterson et al., 2018) using author-supplied R code, MMC (Baudry et al., 2010) as implemented in the R package Mix Mod Combi, DEMP (Hennig, 2010) using the R package fpc and DEMP+ (Melnykov, 2016). We also evaluate performance with EAC (Fred and Jain, 2005) and GSL-NN (Stuetzle and Nugent, 2010) using publicly available author-supplied code. We also apply two common connectivity-based techniques of spectral and kernel k-means clustering. Both these methods need the number of groups to proceed: for spectral clustering we decide this number to be the one with the highest gap in successive eigenvalues of the similarity matrix (von Luxburg, 2007). For kernel k-means, we set K to be the true value: we recognize that our evaluation of kernel k-means potentially provides this method with an unfair advantage, however, we proceed in this fashion in order to understand the best case scenario of this competing method. Finally, we also compare our method s performance relative to DBSCAN as implemented in the R package dbscan (Hahsler and Piekenbrock, 2018), DP clustering (Rodriguez and Laio, 2014) as implemented in the R package density Clust (Pedersen et al., 2017), PGMM (Mc Nicholas and Murphy, 2008) using the R package pgmm (Mc Nicholas et al., 2018), MSAL using the R package Mix SAL (Franczak et al., 2018) and MGHD using the R package Mix GHD (Tortora et al., 2019). Many of these algorithms have multiple parameters that need to be set in our experiments, we use the default settings where guidance for choosing these parameters is not explicitly available. Also, the DBSCAN and DP clustering algorithms identify scatter/outliers that by definition are those observations that are unlike any other in the dataset, so we follow Maitra and Ramler (2009) in considering them as individual singleton clusters in our performance assessments. Performance for each method is evaluated by Hubert and Arabie (1985) s adjusted Rand index R measured between the true partition and the estimated final partitioning. In general, R 1: values closer to 1 indicating greater similarity between partitionings and good clustering performance. The index takes values farther from 1 as performance becomes poorer and is expected to take a value of zero for a random assignmment. The index R can take arbitrarily negative values, but as very helpfully pointed out to us by a reviewer, the probability of observing R < 1 is relatively small (see Steinley, 2004, for further discussion on the characteristics of this index). Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering 3.1. Illustrative Example: the Aggregation dataset The 2D Aggregation dataset (Gionis et al., 2007) has n = 788 observations from C = 7 groups of different characteristics. Figure 1 displays the results of the different phases and (a) k-means phase: ˆK = 14 (b) First merging phase: ˆC = 8 (c) Final solution: ˆC = 7 1 2 3 4 5 6 7 8 9 10 11 12 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.01 0 0 0 0 0 0 0 0 0 0 0 0.02 0 0 0 0 0.020.01 0 0 0 0 0 0 0 0 0 0 0.01 0 0 0 0 0 0 0 0.01 0 0 0 0 0 0 0 0 (d) ˆω = 0.0013 1 2 3 4 5 6 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.01 0 (e) ˆω = 0.0008 1 2 3 4 5 6 0 0 0 0 0 0 (f) ˆω =10 6 Figure 1: Illustrating all three stages of the KNOB-Syn C algorithm on the Aggregation dataset: Results of (a) the k-means phase, (b) the first merging phase and (c) the second (and final) merging phase of the algorithm. In these and all such subsequent figures, character denotes true class membership while color indicates estimated class membership. (d) (f) Estimated pairwise nonparametric overlap values corresponding to the partitions in (a), (b) and (c). iterations of KNOB-Syn C. We display the stages of KNOB-Syn C for κ = 1 which is when we have the lowest terminating ˆω (from among κ = 1, 2, 3, ) for this example. The kmeans phase of our algorithm identifies 14 clusters with partitioning as in Figure 1a and estimates the initial overlap matrix ˆΩto be as in Figure 1d. The first merging phase yields the partitioning in Figure 1b with the updated ˆΩof Figure 1e. The next merging phase only combines one pair of groups and is terminal, resulting in the final partitioning of the dataset as in Figure 1c. The overlap matrix (Figure 1f) indicates well-separated clusters, Almod ovar-Rivera and Maitra with only six mislabeled observations relative to the true, and a R of 0.98 between the true and estimated classifications. PGMM, R = 0.64 MSAL, R = 0.82 MGHD, R = 0.47 Spectral Clustering, R = 0.59 Kernel k means, R = 0.24 DBSCAN*, R = 0.58 Density Peaks, R = 0.26 DEMP+, R = 0.91 MMC, R = 0.80 EAC, R = 0.90 GSL NN, R = 0.81 True KNOB Syn C, R = 0.98 K m H, R = 0.95 DEMP, R = 0.91 Adjusted Rand Index PGMM, R = 0.64 MSAL, R = 0.82 MGHD, R = 0.47 Spectral Clustering, R = 0.59 Kernel k means, R = 0.24 DBSCAN*, R = 0.58 Density Peaks, R = 0.26 DEMP+, R = 0.91 MMC, R = 0.80 EAC, R = 0.90 GSL NN, R = 0.81 True KNOB Syn C, R = 0.98 K m H, R = 0.95 DEMP, R = 0.91 Adjusted Rand Index PGMM, R = 0.64 MSAL, R = 0.82 MGHD, R = 0.47 Spectral Clustering, R = 0.59 Kernel k means, R = 0.24 DBSCAN*, R = 0.58 Density Peaks, R = 0.26 DEMP+, R = 0.91 MMC, R = 0.80 EAC, R = 0.90 GSL NN, R = 0.81 True KNOB Syn C, R = 0.98 K m H, R = 0.95 DEMP, R = 0.91 Adjusted Rand Index PGMM, R = 0.64 MSAL, R = 0.82 MGHD, R = 0.47 Spectral Clustering, R = 0.59 Kernel k means, R = 0.24 DBSCAN*, R = 0.58 Density Peaks, R = 0.26 DEMP+, R = 0.91 MMC, R = 0.80 EAC, R = 0.90 GSL NN, R = 0.81 True KNOB Syn C, R = 0.98 K m H, R = 0.95 DEMP, R = 0.91 Adjusted Rand Index PGMM, R = 0.64 MSAL, R = 0.82 MGHD, R = 0.47 Spectral Clustering, R = 0.59 Kernel k means, R = 0.24 DBSCAN*, R = 0.58 Density Peaks, R = 0.26 DEMP+, R = 0.91 MMC, R = 0.80 EAC, R = 0.90 GSL NN, R = 0.81 True KNOB Syn C, R = 0.98 K m H, R = 0.95 DEMP, R = 0.91 Adjusted Rand Index Figure 2: Clustering performance of each algorithm on the Aggregation dataset. The competing methods (Figure 2) all perform marginally to substantially worse. K-m H is the second best performer (R = 0.95) finding ˆC = 9 groups but breaking the top right Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering cluster into two and also grouping a few other stray observations. Both DEMP and DEMP+ yield the same result (R = 0.91, ˆC = 6), but MMC (R = 0.8, ˆC = 8) has trouble with the largest group, splitting it into two sub-groups. EAC breaks the top central and large groups on the right into many clusters, resulting in ˆC = 14 but R = 0.9. Thus, in spite of identifying a large number of groups, EAC is able to capture a fair bit of the complex group structure of this dataset. GSL-NN can not distinguish between the groups on the right but also finds many other small groups elsewhere, ending with ˆC = 12 groups and R = 0.81. The performance of spectral clustering is worse: it finds ˆC = 12 groups and has a R = 0.59 with the true classification. Despite being provided with the true C = 7, kernel k-means with R = 0.24 is the worst performer in this example, with DP (R = 0.26, ˆC = 26) only marginally better. DBSCAN* at R = 0.58 and ˆC = 277, correctly finds the large circular group and one of the smallest groups, but the observations in the top left group are almost all classified as outliers/scatter. Among the MBC methods for general-shaped clusters, MSAL at R = 0.82 is the best performer, finding ˆC = 7 groups but having trouble with the larger group at the bottom. PGMM finds ˆC = 12 groups (R = 0.64) with the larger groups split further, while the worst-performing MBC method is MGHD (R = 0.47, ˆC = 15). 3.2. Additional 2D Experiments 3.2.1. Experimental framework Figure 3 displays the 12 additional 2D datasets used to evaluate performance of KNOBSyn C and its competitors. Barring the first example, all these datasets have been used by other authors to demonstrate and evaluate performance of their methods. The groups in these datasets have structure ranging from the regular (e.g., the 7-spherically-dispersed Gaussian clusters dataset that is modeled on a similar example in Maitra, 2009a, and where sophisticated methods like KNOB-Syn C are superfluous and unnecessary) to widelyvarying complexity. The Banana Arcs dataset has n = 4515 observations clumped in four banana-shaped structures arced around each other. The Banana-clump and Bullseye datasets are from Stuetzle and Nugent (2010) the former has 200 observations with one spherical group and another arced around it on the left like a banana, while the latter has 400 observations grouped, as its name implies, as a bullseye. The more complex-structured Bullseye-Cigarette dataset (Peterson et al., 2018) has three concentric-ringed groups, two elongated groups above two spherical groups on the left, and another group that is actually a superset of two overlapping spherical groups (n = 3025 and C = 8). The Compound dataset (Zahn, 1971) is very complex-structured with n = 399 observations in C = 6 groups that are not just varied in shape, but a group that sits atop another on the right. The Half-ringed clusters dataset (Jain and Law, 2005) has 373 observations in two arc-shaped clusters, one of which is dense and the other being very sparsely-populated. The Path-based dataset (Chang and Yeung, 2008) has 300 observations in three groups, two of which are regular-shaped and surrounded by a widely arcing third group. The Spiral dataset (Chang and Yeung, 2008) has 312 observations in three spiral groups that are very difficult for standard clustering algorithms to recover accurately. The SSS dataset has 5015 observations in three S-shaped groups of varying density and orientations while the XXXX dataset has n = 415 observations distributed in four cross-shaped structures. Almod ovar-Rivera and Maitra GG G GG G G G GGGGG GGGG GG GG GGGGGGGGG GG GG G G GG GGGGGGG GGGG GG G G G G G G GGG GGG G G GG GG G G G GGGGGGGG GGGGG GGGGGGG G GGG G GGG G G GG G G G GG G GGGGGG G GGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GG G G G G G GGGGG G GG GG G G G GG GGGGG GG G G G G G G GGGG G GG G G GGGGG GG G GGG G GGGG G G GG G GGGG G G GGGGG G G GGGGG G G G GGG GGG G GGGG G GG G GG G G G G G G G G G G GG G GG G GG GGGGG G GGGG GGGG G GG G GG GG GGGGGGG G G G GGGG G G G GGGGGG G GGG G G G GGGG GG GGGG GG GG G G G GGG GG G G G GG GGG G GG GG G G G SCX Bananas Spiral SSS XXXX Bullseye Cigarette Compound Half ringed clusters Path based 7 Spherical Banana Clump Bananas Arcs Bullseye Figure 3: Shape datasets used in the two-dimensional performance evaluations. 3.2.2. Results Figure 4 and Table 8 summarize the performance of all methods on the 2D experimental datasets. Detailed displays of different methods on individual datasets are in Appendix B. The summaries indicate across-the-board good performance of KNOB-Syn C with it always being a top performer. In its worst case, KNOB-Syn C gets a R = 0.55 (on the Path-based dataset) where it terminates early (Figure 18) but here also it is the fourth-best performer, behind spectral clustering (R = 0.72), MGHD (R = 0.60) and PGMM (R = 0.59). The competing syncytial clustering methods do well in some cases, but not in others where other methods perform better. Among the syncytial clustering methods, K-m H performs better than DEMP, DEMP+ and MMC whose performance can sometimes be poor (e.g., on the Bullseye, Half-ringed clusters and Spiral datasets vide Figures 14, 17 and 20). It is on these datasets that the other methods (EAC, GSL-NN, and spectral clustering) do better. The performance of kernel-k-means even with known true number of groups is varied, being very good sometimes (e.g., in the Bananas-clump dataset of Figure 13) but Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering 7 Spherical Aggregation Bananas Arcs Banana clump Bullseye Cigarette Half ringed clusters SCX Bananas Adjusted Rand Index (a) Performance by dataset Adjusted Rand Index (b) Performance by method Figure 4: Performance of KNOB-Syn C (abbreviation: KNS), K-m H, DEMP (DM), DEMP+ (DM+), MMC, EAC, GSL-NN (GSN), spectral clustering (Sp C), kernel k-means (kk-m), DBSCAN , DP, PGMM, MSAL and MGHD on 2D datasets. very poor in other cases (e.g., as seen before in the Aggregation dataset) where almost every other method does well. The three general MBC methods perform similarly but PGMM does a bit better than MSAL and MGHD. DBSCAN and, especially DP, generally perform poorly however, DBSCAN performs well in some cases. We quantify performance of each method against its competitors in terms of its average deviation from the best performer. Specifically, for each dataset, we compute the deviation (or difference) in R of a method from that of the best performer for that dataset. The average deviation ( D) of a method over all datasets is an overall indicator of its performance. Table 8 provides D and the standard deviation or SD (Dσ) of the differences. On the average, KNOB-Syn C is the best performer (and with the lowest Dσ) followed by GSL-NN, K-m H and EAC. This conclusion of KNOBSyn C s superior overall performance is also supported by Figure 4(b). We surmise that KNOB-Syn C does well across the different datasets because of its ability by construction to merge many or few components at a time, with the exact choice of merges and termination objectively selected and determined by the distinctiveness of the resulting partitioning as per ˆω. 3.3. Higher-Dimensional Datasets We also study the performance of KNOB-Syn C and its competitors on higher-dimensional datasets. These datasets are modestto higher-dimensional, with between 173 to 10993 records. For the higher-dimensional datasets (i.e., with non-redundant dimension greater than 10), we find that all methods other than GSL-NN generally perform better when used on the first few (m) kernel principal components (KPCs) rather than on the raw data. For these methods and these datasets, we use the first m KPCs of each dataset with m chosen as the first time after which increases in the eigenvalues corresponding to the successive KPCs are below 0.5%. GSL-NN is implemented on the original datasets. Further, KNOB-Syn C, Almod ovar-Rivera and Maitra K-m H and EAC are built on k-means whose results depend on the scale of the features. So, for these methods, we scaled each feature by the SD prior to analysis unless the features were all collected on a similar scale, as with the E. coli example of Section 3.3.2 or the logtransformed GRB dataset of Section 4.1. (Following the usual rule-of-thumb in multivariate statistics, we assume that features are on similar scales if the most variable feature has SD no more than four times that of the feature with lowest variability.) Each dataset and the performance of each method is first described individually. A comprehensive summary of the performance of each method on each dataset follows in Section 3.3.11. 3.3.1. Simplex-7 Gaussian Clusters This dataset, from Stuetzle and Nugent (2010), is of Gaussian realizations of size 50, 60, 70, 80, 90, 100 and 110 each from seven clusters with means set at the vertices of the seven-dimensional unit simplex and homogeneous spherical dispersions with common SD of 0.25 in each dimension. Like the 7-Spherical dataset, this dataset exemplifies a case where standard methods such as k-means or Gaussian MBC should be adequate. Therefore it is a test of whether our algorithm and its competitors are able to refrain from identifying spurious complexity. All methods, except for EAC, DBSCAN* and DP, identify seven groups and have good clustering performance. In particular, the syncytial methods have very good performance (R = 0.97); other methods have R [0.92, 0.98]. EAC still performs well (R = 0.94) but finds ˆC = 6 groups. DBSCAN* is the worst performer (R = 0.03) on this dataset, finding many outliers ( ˆC = 508). DP s performance is middling at R = 0.58 and with many outliers ( ˆC = 79). 3.3.2. E. coli Protein Localization The E. coli dataset, publicly available from the University of California Irvine s Machine Learning Repository (UCIMLR) (Newman et al., 1998), concerns identification of protein localization sites for the E. coli bacteria (Nakai and Kinehasa, 1991). There are eight protein localization sites: cytoplasm, inner membrane without signal sequence, periplasm, inner membrane with an uncleavable signal sequence, outer membrane, outer membrane lipoprotein, inner membrane lipoprotein, and inner membrane with a cleavable signal sequence. Identifying these sites is an important early step for finding remedies (Nakai and Kinehasa, 1991). Each protein sequence has a number of numerical attributes see Horton and Nakai (1985) for a listing and their detailed description. Two attributes are binary, but 326 of the 336 sequences have common values for these attributes. We restrict our investigation to these sequences and drop the two binary attributes from our list of variables. These 326 sequences have no representation from the inner membrane or outer membrane lipoproteins. Additionally, we also drop two sequences because they are the lone representatives from the inner membrane with cleavable sequence site (Maitra, 2002). Therefore we have n = 324 observations from C = 5 true classes. KNOB-Syn C identifies ˆC = 9 groups. Table 1 presents the confusion matrix containing the number of times a protein from a localization site is assigned to each KNOB-Syn C group. Ignoring stray assignments, the sites are fairly well-defined in the first four groups, with R = 0.72. Uncleavable signal sequences from the inner membrane site are difficult to distinguish from those that are also from there but have no signal sequence. Sequences from the other sites are better-clarified. Among the alterna- Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering Table 1: Confusion matrix of the KNOB-Syn C groups against the true E. coli localization sites. 1 2 3 4 5 6 7 8 9 cytoplasm 138 0 0 4 0 0 0 0 1 inner membrane, no signal sequence 7 62 0 0 0 3 3 1 0 inner membrane, uncleavable signal sequence 1 32 0 0 0 0 0 1 0 outer membrane 0 0 17 2 0 0 0 0 0 periplasm 3 1 2 38 8 0 0 0 0 tive methods, EAC does slightly better (R = 0.77) but identifies 10 groups. The remaining methods all do slightly to substantially worse. DEMP, DEMP+ and K-m H each identify four groups but with R [0.63, 0.70]. K-m H finds only two groups (R = 0.41) while the rest find more groups but disagree more strongly with the true localizations. DBSCAN finds a large number of groups ( ˆC = 174) with relatively poor performance (R = 0.31). DP is marginally better (R = 0.39, ˆC = 12) while PGMM (R = 0.48) and MGHD (R = 0.6) improves on DP, each finding 8 groups. (MSAL did not converge to a solution.) Overall, EAC and KNOB-Syn C are the top two performers, with DEMP and DEMP+ close behind. 3.3.3. Standard Wine Recognition The standard wine recognition dataset (Forina et al., 1988; S. Aeberhard and de Vel, 1992), also available from the UCIMLR contains p = 13 measurements on n = 178 wine samples that are obtained from its chemical analysis. There are 59, 71 and 48 wines of the Barolo, Grignolino and Barbera cultivars, so C = 3. Because p > 10 here, we use m = 17 KPCs. KNOB-Syn C is the best performer, finding ˆC = 3 groups with a clustering performance of R = 0.92. The first group contains all the 59 wines from the Barola cultivar and 2 Grignolino wines. The second group contains 66 wines, all exclusively from the Grignolino cultivar. The third group has 2 Grignolino and 48 Barbera wines. Thus, there is very good definition among the KNOB-Syn C groups. On the other hand, only MMC (R = 0.67; ˆC = 5), Km H (R = 0.62, ˆC = 6), PGMM (R = 0.66, ˆC = 5) and EAC (R = 0.60; ˆC = 9) perform modestly while the others are substantially worse with DBSCAN , in particular, classifying all observations as outliers, resulting in ˆC = 178 and R = 0. 3.3.4. Extended Wine Recognition A reviewer very helpfully pointed out that the dataset used in Section 3.3.3 is actually a reduced variant, and a fuller version of the dataset with 27 variables is available in the R package pgmm. We used m = 26 KPCs in our experimental evaluations on this larger dataset. DEMP and DEMP+ (R = 1, ˆC = 3) show perfect classification while MGHD (R = 0.95, ˆC = 3) and MMC (R = 0.91, ˆC = 4) also perform well. KNOB-Syn C is a top performer and the best among the distribution-free methods, finding ˆC = 3 groups and with a clustering performance of R = 0.93. The first group here has 58 Barola and 2 Grignolino wines. The second group contains 68 wines from the Grignolino cultivar and the one Barolo wine that was not placed in the first group. The third group has the 48 Almod ovar-Rivera and Maitra Barbera wines and the one remaining Grignolino wine. Similar to the 13-dimensional case, we get good definition among the groups. Other methods not discussed here do moderately to substantially worse. 3.3.5. Olive Oils The olive oils dataset (Forina and Tiscornia, 1982; Forina et al., 1983) has measurements on 8 chemical components for 572 samples of olive oil taken from 9 different areas in Italy that are from three regions: Sardinia and Northern and Southern Italy. This is an interesting dataset with sub-classes (areas) inside classes (regions). Indeed, Peterson et al. (2018) were able to identify, with one misclassification, sub-groups within the regions but not the areas (R = 0.67, ˆC = 11; we however get R = 0.56, ˆC = 8 using the authors supplied code) they surmised that it may be more possible to identify characteristics of olive oils based on regions defined by physical geography rather than areas demarcated by political geography. We therefore analyze performance on this dataset both in terms of how regions and areas are recovered. KNOB-Syn C identifies ˆC = 4 regions (Table 2) with oils from the Sardinian and Northern regions correctly classified into the first two groups. The Southern region oils are split into our two remaining groups, one containing all but 2 of the 25 North Apulian samples and 6 of the 36 Sicilian samples, and the other group containing all the Southern oils. In Table 2: Confusion matrix of the KNOB-Syn C grouping of the Olive Oils dataset. Region Area 1 2 3 4 Sardinia Coast-Sardinia 33 0 0 0 Inland-Sardinia 65 0 0 0 North East-Liguria 0 0 50 0 West-Liguria 0 0 50 0 Umbria 0 0 51 0 South Calabria 0 56 0 0 North-Apulia 0 2 0 23 South-Apulia 0 206 0 0 Sicily 0 30 0 6 terms of clustering performance, KNOB-Syn C gets R = 0.55 when compared to the true areal grouping but R = 0.87 when compared to the true regional grouping. For this dataset DEMP (R = 0.85; ˆC = 7), DEMP+ (R = 0.82; ˆC = 12) and MSAL (R = 0.7; ˆC = 9) are the top performers with respect to the true areal grouping. The remaining methods all have middling performance. When compared with the true regional grouping, KNOB-Syn C is by far the best performer. Overall, the clustering performance of KNOB-Syn C for the regional grouping marginally trumps the performance of DEMP for the areal grouping and so may be considered to be more accurate in uncovering the group structure in the dataset. 3.3.6. Image Segmentation The image segmentation dataset, also available from the UCIMLR, is on 19 attributes of the scene in each 3 3 image manually classified to be from BRICKFACE, CEMENT, FOLIAGE, GRASS, PATH, SKY and WINDOW. (Thus, C = 7.) We combine the training Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering and test datasets to obtain 330 instances of each scene, so n = 2310. There is a lot of redundancy in the attributes so we reduce the dataset to 8 PCs that together explain at least 99.9% of the total variance in the dataset. The PCs are obtained from the correlation matrix because the 19 attributes have vastly different scales. The KNOB-Syn C solution finds ˆC = 12 clusters, with R = 0.55. The confusion matrix (Table 3) indicates that the SKY images are perfectly identified while GRASS and, to a lesser extent, PATH and CEMENT, are fairly well-identified. On the other hand, the partitioning struggles to distinguish between BRICKFACE, FOLIAGE and WINDOW. Among other methods, only EAC (R = 0.59, ˆC = Table 3: Confusion matrix of the KNOB-Syn C grouping against the true for the Image segmentation dataset. 1 2 3 4 5 6 7 8 9 10 11 12 BRICKFACE 330 0 0 0 0 0 0 0 0 0 0 0 CEMENT 42 257 0 4 0 27 0 0 0 0 0 0 FOLIAGE 300 5 0 0 0 5 2 7 3 1 3 4 GRASS 1 0 327 0 0 2 0 0 0 0 0 0 PATH 0 0 0 269 0 61 0 0 0 0 0 0 SKY 0 0 0 0 330 0 0 0 0 0 0 0 WINDOW 309 13 0 0 0 8 0 0 0 0 0 0 40) and MGHD (R = 0.56, ˆC = 7) are modestly to marginally better than KNOB-Syn C. Inspection of the EAC grouping indicates many small groups but also difficulty in separating FOLIAGE and WINDOW, placing them together in one group. Further, BRICKFACE is split into five groups, four of which are predominantly of this kind, but the fifth group is unable to distinguish 146 observations of BRICKFACE from 32, 52 and 62 observations of CEMENT, FOLIAGE and WINDOW, respectively. The other methods all perform moderately to substantially worse (Table 9) with PGMM, MSAL and DBSCAN unable to find clustering solutions. 3.3.7. Yeast Protein Localization The yeast protein localization dataset (Nakai, 1996), also obtained from the UCIMLR, was used by Melnykov (2016) to illustrate the application of DEMP+. This dataset is on the localization of the proteins in yeast into one of C = 10 sites and has two attributes (presence of HDEL substring and peroxisomal targeting signal in the C-term) that are essentially binary and trinary. Following Melnykov (2016), we drop these variables and use the other p = 6 variables, namely signal sequence recognition scores based on (a) Mc Geoch s and (b) von Heijne s methods, (c) ALOM membrane spanning region prediction score, and discriminant analysis scores of the amino acid content of (d) N-terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins and (e) vacuolar and extracellular proteins and (f) discriminant scores of nuclear localization signals of nuclear and non-nuclear proteins. For this dataset, all methods perform poorly. KNOB-Syn C (R = 0.226; ˆC = 7) is the best performer the other clustering methods essentially randomly allocate observations. Surprisingly, DEMP+ ( ˆC = 6, R = 0.01) performs very poorly. Almod ovar-Rivera and Maitra (Melnykov, 2016, only used the first five of our variables to illustrate the DEMP+ method: we find no appreciable improvement even then, with ˆC = 7 and R = 0.009. Personal queries to the author did not successfully resolve this discrepancy.) It appears therefore that the yeast protein localization dataset may be difficult to accurately partition in a completely unsupervised framework. 3.3.8. Acute Lymphoblastic Leukemia The Acute Lymphoblastic Leukemia (ALL) training dataset of Yeoh et al. (2002) was used by Stuetzle and Nugent (2010) to illustrate GSL-NN in a high-dimensional small sample size framework. We use the standardized dataset in Stuetzle and Nugent (2010) that measured the oligonucleotide expression levels of the 1000 highest-varying genes in 215 patients suffering from one of seven leukemia subtypes, namely, T-ALL, E2A-PBX1, BCR-ABL, TEL-AML1, MLL rearrangement, Hyperploid > 50 chromosomes, or an unknown category labeled OTHER. Some subtypes have very few cases: for instance, only 9, 14 and 18 patients are of type BCR-ABL, MLL and E2A-PBX1, respectively. For this dataset, we use m = 42 KPCs for all methods but GSL-NN. The k-means stage of KNOB-Syn C identifies six groups, none of which are merged in the merging phase, resulting in the best partitioning among all competing methods. Table 4 presents the confusion matrix containing the number of cases Table 4: Confusion matrix of the KNOB-Syn C grouping against the true leukemia subtypes for the ALL dataset. 1 2 3 4 5 6 Hyperdiploid > 50 0 0 2 35 5 0 E2A-PBX1 0 17 0 0 1 0 BCR-ABL 0 0 2 1 6 0 TEL-AML1 4 2 8 2 36 0 MLL 0 3 10 0 1 0 T-ALL 0 0 1 0 0 27 OTHER 51 0 0 0 1 0 a patient of a leukemia subtype was assigned to a KNOB-Syn C group. We see that most leukemia subtypes are distinctively identified in the KNOB-Syn C solution. The alternative methods perform mildly to substantially worse with PGMM, spectral clustering, GSL-NN and DP having clustering solutions (R = 0.61, 0.55, 0.54, 0.53) that are the next best after KNOB-Syn C. Other methods generally do poorly, with MSAL and MGHD unable to find solutions while DBSCAN classifies all observations as outliers. 3.3.9. Zipcode images The zipcode images (Stuetzle and Nugent, 2010) dataset consists of n = 2000 16 16 images of handwritten Hindu-Arabic numerals and is our second higher-dimensional example. As in the ALL dataset of Section 3.3.8, we normalize the observations to have zero mean and unit variance so that the Euclidean distance between any two normalized images is negatively and linearly related to the correlation between their pixels. We extract and use the first Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering m = 33 KPCs for all algorithms but GSL-NN. KNOB-Syn C identifies 9 groups and has the best clustering performance (R = 0.76). DP is the second best (R = 0.58) performer but finds ˆC = 53 groups (including singletons) followed by MGHD (R = 0.56, ˆC = 6), K-m H (R = 0.55, ˆC = 22), GSL-NN and spectral clustering (both with R = 0.54 but ˆC = 7 and 23). The other methods all perform moderately to substantially worse. Figure 5 displays Figure 5: KNOB-Syn C groups, with colormap indicating group, of the Zipcode dataset. the 9 KNOB-Syn C groups. While misclassifications abound in almost all groups, there is good agreement with 0, 1, 2, the leaner 8s and, (to a lesser extent) 3 and 6, largely correctly identified. The digit 2 is placed in two groups, of the leaner and the rounded versions. The group where 3 predominates also has some 5s and 8s but the categorization makes visual sense. Another group is composed largely of 4s, 7s and 9s but that placement also appears visually explainable. Clearer and straighter 7s and 9s are placed in a separate group. Our partitioning finds it harder to distinguish between 5 and 6 but here also the commonality of the strokes in the digits assigned to this group explains this categorization. Thus we see that KNOB-Syn C is not only the best performer for this dataset but also provides interpretable results. We comment that our application of all methods to this dataset has been entirely Almod ovar-Rivera and Maitra unsupervised: methodologies that also account for spatial context and pixel neighborhood may further improve the grouping but are outside the purview of this paper. 3.3.10. Handwritten Pen-digits The Handwritten Pen-digits dataset (Alimoglu, 1996; Alimoglu and Alpaydin, 1996) available at the UCIMLR is a larger dataset that has 16 attributes from 250 handwritten samples Table 5: Confusion matrix for the Handwritten Pen-digits dataset. 0 1 2 3 4 5 6 7 8 9 11 12 13 14 15 0 1099 1 0 0 19 0 21 0 0 2 0 1 0 0 0 1 0 657 358 34 1 0 2 2 0 89 0 0 0 0 0 2 0 2 1141 0 0 0 0 0 0 1 0 0 0 0 0 3 0 4 2 1046 1 0 0 0 0 2 0 0 0 0 0 4 0 5 1 2 1118 0 1 0 0 17 0 0 0 0 0 5 0 1 0 252 0 625 0 0 2 175 0 0 0 0 0 6 0 0 1 0 0 1 1054 0 0 0 0 0 0 0 0 7 0 144 5 2 0 0 0 914 0 0 0 0 77 0 0 8 4 0 0 3 0 1 0 1 461 0 139 321 48 24 53 9 24 9 0 72 3 0 0 0 1 714 0 0 0 232 0 of 30 writers. (There are n = 10992 records because eight samples are unavailable.) We use m = 18 KPCs in our analysis (Peterson et al., 2018, used the first 7 PCs and got R = 0.64 and ˆC = 24). KNOB-Syn C finds ˆC = 15 groups and is the best performer (R = 0.723). It separates the digits 0, 2, 3, 4, 6 and, to a lesser extent, 7 fairly well but identifying 1, 5 and 9 is a bit more challenging (Table 5). It also identifies multiple types of 8. MGHD finds the correct number and is the next-best performer (R = 0.67). The other methods perform moderately to substantially worse with MSAL unable to find a clustering solution. 3.3.11. Summary of Performance Olive Oils Area Olive Oils Region Adjusted Rand Index (a) Performance by dataset Adjusted Rand Index (b) Performance by method Figure 6: Overall performance of all competing methods on all higher-dimensional datasets. Abbreviations are as in Figure 4. Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering Figure 6 and Table 9 summarize performance of all methods on the higher-dimensional experiments. As in the 2D case, KNOB-Syn C is almost always among the top performers for high-dimensional datasets. Indeed, KNOB-Syn C has the lowest average difference in R from that of the best-performing method over all datasets (Table 9). The other methods generally perform worse, with EAC, PGMM and kernel-k-means (with true number of groups) among the better ones. Thus, the results of our experiments on real and synthetic datasets indicate good performance of KNOB-Syn C relative to its competitors. 3.4. Extensions of KNOB-Syn C As indicated in Section 2, the development of our syncytial clustering methodology is based on the nonparametric estimation of the CDF of the residuals and so can be applied to other scenarios. We explore performance of our methodology in two such settings. 3.4.1. KNOB-Syn C in the presence of scatter Maitra and Ramler (2009) provided the k-clips algorithm for k-means clustering in the presence of scatter, or observations that are unlike any other in the dataset. Our KNOB-Syn C methodology and software readily incorporates k-clips results by replacing the k-means phase with that algorithm, and proceeding by including the scatter points as individual singleton clusters. We illustrate our methodology on the first 100 images of the Olivetti faces database (Samaria and Harter, 1994) that were used by Rodriguez and Laio (2014) to illustrate their DP algorithm. The 100 images under our consideration are of 10 faces each of 10 individuals taken at different angles and under different light conditions. Therefore, each individual can be considered to be a group with members that are that person s 10 images. Each 112 92 image has a total of 10,304 pixels so we use the first 37 KPCs. While this application does not have any true scatter points, we use this application to illustrate KNOB-Syn C with k-clips because it was used by Rodriguez and Laio (2014) to showcase DP that finds scatter (outliers, in their parlance) in addition to clusters. The k-clips algorithm with the default Bayesian Information Criterion (BIC) (Schwarz, 1978) finds only two well-defined homogeneous spherical clusters and 68 scatter points. We use the trace of the within-sums-of-squares-and-products matrix, rather than its determinant (Maitra and Ramler, 2009), in our objective function in order to satisfy the condition of homogenous spherical clusters around which our base KNOB-Syn C algorithm is built. Thus, we have a total of 70 initial groups. KNOB-Syn C s merging phase ends with 9 large groups, 5 small groups and 1 scatter observation (so ˆC = 16) and R = 0.902. The results are displayed in Table 6 and Figure 7 for comparison, the latter also displays the results reported in Rodriguez and Laio (2014) which found 9 clusters and 62 scatter points, resulting in ˆC = 71 and R = 0.22. (The figure displays images assigned to a group by means of a distinctive sequential palette. Because there are not enough colors to also identify each scatter point with its individual sequential palette, we use an individual randomized nominal palette for each scatter assignment.) KNOB-Syn C identifies images from six individuals (Persons 2, 4, 6, 7, 8 and 9) perfectly and the first, third and the tenth individuals nearly so. The fifth individual is characterized into 5 smaller groups that includes the case where one image is grouped together with the one misclassified image of the tenth person. The performance of our algorithm overwhelms that reported in Rodriguez and Laio (2014). Almod ovar-Rivera and Maitra Table 6: Confusion matrix of the KNOB-Syn C results for the Olivetti faces dataset. Assigned Groups Individual 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 9 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 8 0 0 0 0 0 0 0 2 0 0 0 0 0 4 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 3 0 0 0 0 0 0 2 0 2 1 2 6 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 9 0 0 0 0 1 0 (a) KNOB-Syn C: R = 0.90 (b) DP: R = 0.22 Figure 7: Clusters of the first 100 images in the Olivetti database obtained by (a) KNOBSyn C and (b) DP as reported in Rodriguez and Laio (2014). Each group is represented by its own distinctive sequential palette. Scatter observations (i.e. singleton groups) are represented by individual randomized nominal palettes. We note that we used the first 37 KPCs with our KNOB-Syn C algorithm while Rodriguez and Laio (2014) used the original images with similarity metric as in Sampat et al. (2009). Using DP ( ˆC = 83, R = 0.06) or DBSCAN ( ˆC = 100, R = 0) with Euclidean similarity on the 37 KPCs gave us worse results. Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering 3.4.2. KNOB-Syn C with incomplete records We now illustrate a scenario where KNOB-Syn C is applied to a dataset with incomplete records. In this example, we replace the k-means phase with Lithio and Maitra (2018) s km-means algorithm that modifies k-means to account for incomplete records. The authors also develop a modified jump statistic to select the number of groups. The km-means results are input into the merging phase of KNOB-Syn C and the algorithm proceeds as usual. We illustrate our methodology on a subset (Wagstaff, 2004) of the Sloan Digital Sky Survey (SDSS) dataset that measures five features (brightness, in psf Counts, size in petro Rads, texture, and two measures of shape (M e1 and M e2 that we refer to as Shape1 and Shape2 in our analysis) on 1220 galaxies and 287 stars. Thus the true C = 2 and n = 1507. The dataset has some missing values for the shape measures of 42 galaxies. The km-means algorithm with the modified jump statistic of Lithio and Maitra (2018) finds K0 = 46 homogeneous spherically-dispersed groups. The initial overlap calculations of Step 2 of our algorithm yield ˆω = 0.0297 and ˆˇω = 0.381. The merging phase is triggered, and terminates with ˆC = 4 groups. Figure 8 provides a 3D radial visualization (Zhu et al., KNOB-Syn C Groups 1 2 3 4 Galaxies 1159 2 56 3 Stars 0 287 0 0 Figure 8: (Top) Three views of 3D radial visualization displays of the KNOB-Syn C groups found in the SDSS dataset. Only the completely observed records are displayed in the figures. (Bottom) Confusion matrix between the true classifications of Galaxies and Stars with the KNOB-Syn C grouping that yielded R = 0.86. 2019) of the clustering results and a confusion matrix of the obtained grouping vis-a-vis the true classification. We see that KNOB-Syn C groups all the 287 stars together, but also includes 2 galaxies. The remaining galaxies are all partitioned into groups of 1159, 56 and 3 observations. The large galaxy group and the group with stars are all well-separated from the ones in the smaller galaxy groups. The second-largest KNOB-Syn C galaxy group has larger-sized galaxies while the three galaxies in the last group have larger Shape1 and brightness. This illustration demonstrates KNOB-Syn C s ability to identify general-shaped Almod ovar-Rivera and Maitra clusters even in the presence of incomplete records. We note that some of the competing methods such as K-m H or EAC may be modified to incorporate km-means results but such modifications to both the methodology and software is outside the scope of this paper. Our experimental evaluations comprehensively demonstrate that our KNOB-Syn C algorithm works very well in finding general-shaped clusters. Indeed, our methodology can also incorporate scenarios that allow for scatter or incomplete records in the dataset. 4. Real-world applications In this section we apply KNOB-Syn C to first find the different kinds of Gamma Ray Bursts (GRBs) in an astronomy catalog and second, to identify activation detected in f MRI experiments. The ground truth is unknown in both these applications, so we compare our results with other available evidence in the literature. 4.1. Determining the distinct kinds of Gamma Ray Bursts There is tremendous interest in understanding the source and nature of Gamma Ray Bursts (GRBs) that are the brightest electromagnetic events known to occur in space (Chattopadhyay et al., 2007; Piran, 2005). Many researchers (Mazets et al., 1981; Norris et al., 1984; Dezalay et al., 1992) have hypothesized that GRBs are of several kinds, but the exact number and descriptive properties of these groups is an area of active research and investigation. Most analyses have traditionally focused on univariate and bivariate statistical and descriptive methods for classification and found two groups but other authors (Mukherjee et al., 1998; Chattopadhyay et al., 2007) have found three different kinds of GRBs when using more variables in the clustering. Recent careful analyses (Chattopadhyay and Maitra, 2017, 2018) has conclusively established five ellipsoidally-shaped groups in the GRB dataset obtained from the BATSE 4Br catalog. Indeed, Chattopadhyay and Maitra (2018) established that all nine fields of the BATSE 4Br catalog have important clustering information using methods developed in Raftery and Dean (2006). These nine fields are the two duration variables (time by which 50% and 90% of the flux arrive), the four time-integrated fluences in the 20-50, 50-100, 100-300, and > 300 ke V spectral channels, and the (three) measurements on peak fluxes in time bins of 64, 256 and 1024 milliseconds. The authors used multivariate t-mixtures MBC on the logarithm of the measurements, and BIC for model selection, to arrive at their result of five ellipsoidally-shaped groups. GRB datasets have typically been analyzed after using a log10 transformation to remove skewness in the dataset. This summary transformation is somewhat arbitrary so Berry and Maitra (2019) used their Transformation-infused K-means (Ti K-means) algorithm to alternately transform features and cluster skewed datasets. A modification (Berry and Maitra, 2019) of the jump statistic that accounts for the use of transformations in the algorithm found five groups that were characterized as long-intermediate-intermediate, short-faintintermediate, short-faint-soft, long-bright-hard and long-intermediate-hard in terms of their duration (T90), total fluence (Ftotal) and spectral hardness (H321) which are the summaries used to characterize GRB groups (Mukherjee et al., 1998). The fields of the BATSE 4Br catalog are heavily correlated in the log-scale. This has led many researchers to argue for and summarily ignore all but a few variables in their analysis. Here we explore performance using KNOB-Syn C on the first three PCs (accounting for Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering Duration Fluence Spectral Hardness 1 0 1 2 3 40 35 30 25 20 0.45 0.50 0.55 0.60 0.65 Group 1 2 3 4 5 (a) KNOB-Syn C: ˆC = 5 Duration Fluence Spectral Hardness 1 0 1 2 3 40 35 30 25 20 0.45 0.50 0.55 0.60 0.65 Group 1 2 3 4 5 (b) Tik-means: ˆC = 5 Figure 9: Summary of duration (T90), total fluence (Ftotal) and spectral hardness (H321) for the groups obtained using (a) KNOB-Syn C with the generalized Mahalanobis distance and (b) Ti K-means. 96.27% of the total variance) of the nine scaled log-transformed variables which is equivalent to using KNOB-Syn C with the generalized Mahalanobis distance. The k-means phase applied on the 3 PCs finds 5 groups. KNOB-Syn C does not enter the merging stage at all since the maximum and generalized overlaps are the same. Comparison of our results (Figure 9 and Table 7) with those of Berry and Maitra (2019) shows fairly good agreement in the confusion matrix (R = 0.868). The first and the fourth groups have short burst durations (T90) and soft spectral hardness (H321) although the first group has fainter total Almod ovar-Rivera and Maitra fluence (Ftotal). Groups 2 and 3 have long durations and bright fluences but different spectral hardness. Group 5 has long duration GRBs but with intermediate fluence and spectral hardness. The true number and kinds of GRB groups is not known but our results show that the KNOB-Syn C solution yields groups that are distinct, interpretable and in line with the newer results obtained by Ti K-means (Berry and Maitra, 2019) or MBC (Chattopadhyay and Maitra, 2017, 2018). Table 7: (a) Summary of duration (T90), fluence (Ftotal) and spectral hardness (H321) for each of the five groups obtained using KNOB-Syn C with the generalized Mahalanobis distance and Ti K-means. (b) Confusion matrix of the Ti K-means and KNOB-Syn C solutions (R = 0.868). k nk T90 Ftotal H321 T90 Ftotal H321 1 207 0.234 0.003 31.423 0.007 0.505 10 4 short-intermediate-soft 2 187 1.619 0.003 24.492 0.01 0.497 10 4 long-bright-soft 3 459 1.543 0.001 30.682 0.003 0.526 10 4 long-intermediate-hard 4 318 0.32 0.002 35.381 0.004 0.503 10 4 short-faint-soft 5 428 1.882 0.001 27.235 0.003 0.516 10 4 long-bright-hard 1 197 0.272 0.003 31.349 0.007 0.505 10 4 short-intermediate-soft 2 188 1.65 0.003 24.439 0.01 0.498 10 4 long-bright-soft 3 429 1.531 0.001 30.727 0.003 0.526 10 4 long-intermediate-hard 4 333 0.304 0.002 35.302 0.004 0.502 10 4 short-faint-soft 5 452 1.867 0.001 27.362 0.003 0.517 10 4 long-bright-hard KNOB-Syn C k 1 2 3 4 5 1 188 2 2 3 2 2 0 181 0 0 7 3 1 0 420 2 6 4 13 0 7 313 0 5 5 4 30 0 413 4.2. Activation detection in a f MRI finger-tapping task experiment Our second application uses KNOB-Syn C to identify activation in f MRI experiments. One objective of f MRI is to determine cerebral regions that respond to a task or particular stimulus (Bandettini et al., 1993; Belliveau et al., 1991; Kwong et al., 1992; Ogawa et al., 1990). A typical approach relates, after correction and pre-processing, the observed Blood Oxygen Level Dependent (BOLD) time course sequence at each image voxel to the expected BOLD response (Friston et al., 1994; Glover, 1999; Lazar, 2008) by fitting a general linear model (Friston et al., 1995) and obtaining a test statistic (often a t-statistic) that tests Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering for significance at that voxel. Thresholding methods (Forman et al., 1995; Genovese et al., 2002) are often used on these t-statistics to determine activation. Attempts to use clustering algorithms have been made, but Thirion et al. (2014) found that despite the advantages of speed and simplicity, k-means is not, in general, a good performer because it fits data idiosyncracies and pathologies. We therefore explore if KNOB-Syn C can improve the k-means clustering solution on these datasets. Our dataset for this experiment is from a right-hand finger-tapping experiment of a righthand-dominant male and was acquired over twelve regularly-spaced sessions in a two-month span. We choose only 5 of these sessions that were identified in Maitra (2010) as the ones with the highest reliability. Because there is no known gold standard, our comparison here will be of the five partitionings detected in each replication with each other. Each dataset was preprocessed and voxel-wise Z-scores were obtained that quantified the test statistic under the hypothesis of no activation at each voxel. We refer to Maitra et al. (2002) and Maitra (2009b) for imaging details. At each of the n = 179364 voxels, we compute the Zscores to test the hypothesis that the expected BOLD levels are significantly related to the right-hand tapping at a voxel. These Z-scores for each replication are our (one-dimensional) dataset. Because of the large size of the dataset, most competing methods are impractical to apply, so we only use KNOB-Syn C here. (For computational reasons also, we do not estimate K0 in the k-means phase but set it at K0 = 50.) (a) KNOB-Syn C (b) AR-FAST 0.15 0.27 0.37 0.2 0.17 0.35 0.16 0.06 0.1 0.15 0 Figure 10: Observed Z-scores at the voxels in the smaller (activated) group for the righthand finger-thumb opposition task experiments obtained using (a) KNOB-Syn C and (b) AR-FAST. For each set of experiments, we display activation maps for the 18th, 19th, 20th and 21st slices in each column. The replications are represented by rows. Jaccard indices of activation between each pair of replicates using (c) KNOB-Syn C and (d) AR-FAST algorithms. Almod ovar-Rivera and Maitra The 50 homogeneous k-means groups in each of the five replicates when supplied to the merging phase each terminated with ˆC = 2 syncytial groups. For the first replicate, the largest group has 178307 (99.4%) voxels this is essentially the region of no activation. The other replicates have 178898 (99.7%), 178129 (99.3%), 179087 (99.8%), and 178658 (99.6%) voxels in this group. Figure 10a displays the Z-scores at the activated voxels over four slices of the brain. The displayed slices comprise the ipsiand contra-lateral pre-motor cortices (pre-M1), the primary motor cortex (M1), the pre-supplementary motor cortex (pre-SMA), and the supplementary motor cortex (SMA). We see broad agreement between the replications in each of the four slices. We compare our KNOB-Syn C results with the robust adaptive smoothed thresholding (AR-FAST) algorithm of Almod ovar-Rivera and Maitra (2019) implemented by the R package RFASTf MRI (Almod ovar-Rivera and Maitra, 2019) which shows far less agreement among the 5 replicates in terms of detected activation in the four slices. Specifically, KNOB-Syn C (Figure 10a) identifies activation in the left M1 and in the ipsi-lateral pre-M1 areas. There is some identified activation in the contralateral pre-M1, pre-SMA and SMA voxels. On the other hand, AR-FAST (Figure 10b) finds less activation in the left M1 and in the ipsi-lateral pre-M1 areas. Figure 10c displays the Jaccard (1901) index of the activation detected (using KNOB-Syn C) between each pair of replications. Figure 10d displays similar Jaccard index calculations with regard to activation detected using AR-FAST. The Jaccard indices are higher for KNOB-Syn C-found activation for each pair of replications and show greater reproducibility. The summarized Jaccard index of Maitra (2010) which provides an overall measure of reproducibility of activation detected across replicates is 0.238 for KNOB-Syn C and 0.102 for AR-FAST which was shown (Almod ovar-Rivera and Maitra, 2019) to be a top performer on this dataset. We comment that while the overall Jaccard indices are low for both methods, the low value of 0.238 also reflects the challenge of activation detection in single-subject f MRI. Seen in this context, KNOB-Syn C does quite well. This example illustrates the potential of KNOBSyn C to improve and refine clustering solutions making it possible, for instance, to use k-means and to alleviate some of the concerns raised in Thirion et al. (2014). 5. Discussion This paper has proposed a syncytial clustering algorithm called KNOB-Syn C that merges groups found by standard clustering algorithms such as k-means, and does so in a datadriven and fully objective way. A R package called Syn Clust R implements our method in the function KNOBSyn C and the competing K-m H syncytial algorithm in the function km H and is publicly available at https://github.com/ialmodovar/Syn Clust R. Our method is distribution-free and can apply to the results of many standard clustering algorithms. We use the overlap measure of Maitra and Melnykov (2010) for merging and for decisions but use kernel-based nonparametric methods to calculate this overlap. Our algorithm has no parameters that require fine-tuning by the user and, as pointed out by a reviewer, shows robust performance across many datasets of many dimensions and with little to tremendous complexity, when compared against a host of other methods. Further, our methodology is general enough to extend to situations with incomplete records or where clustering is done in the presence of scatter. Application of KNOB-Syn C to data from the BATSE 4Br catalog Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering provides further evidence of five kinds of GRBs. Our approach is also demonstrated to potentially make it possible to adapt k-means clustering for activation detection in f MRI. This paper also developed estimation methods of the CDF using the asymmetric RIG kernel. We used the plugin-bandwidth selector that minimizes the MISE as our bandwidth choice but it would be good to develop and investigate more sophisticated approaches. Further, our development in this paper provides an opportunity to develop nonparametric methods for diagnostics in clustering. For instance, our developed kernel CDF estimator could be used to determine uncertainties in k-means classifications. A reviewer has also very kindly drawn our attention to the fact that the construction of composite clusters that underlies the idea behind syncytial clustering has also been used in the context of semisupervised clustering ( Smieja and Wiercioch, 2017) where, instead of estimated overlap as used in this paper, available class labels are used in deciding to merge pairs of groups. We believe that such an approach may also benefit from our methodology, especially when not all classes have representation in the supervised portion of the dataset. Thus, we see that although we have made an important contribution, a number of issues remain that would benefit from further attention. Acknowledgments The authors sincerely thank the Action Editor and three anonymous reviewers whose helpful and insightful comments on an earlier version of this article greatly improved its content. The authors also acknowledge partial support for this research from the National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health (NIH) under its Award No. R21EB016212, I. Almod ovar-Rivera also acknowledges receipt of a fellowship from Iowa State University s Alliance for Graduate Education and the Professoriate (AGEP) program for underrepresented graduate students in STEM fields. R. Maitra also acknowledges support from the United States Department of Agriculture (USDA) National Institute of Food and Agriculture (NIFA) Hatch project IOW03617. The content of this paper however is solely the responsibility of the authors and does not represent the official views of either the NIBIB, the NIH, the NIFA or the USDA. Appendix A. A.1. Proof of Lemma 3 Proof We have E[ ˆH(y; b)] = E i=1 G(y; Yi, b) = E (G(y; Y1, b)) = Z 0 G(t; y, b)h(t)dt 0 K(t; w, b)dw h(t)dt 0 K(t; w, b)h(t)dtdw 0 E[h(V1/(w b),1/b)]dw, Almod ovar-Rivera and Maitra where the random variable V1/(w b),1/b RIG[1/(w b), 1/b]. The last equality holds because the inner integral R 0 K(t; w, b)h(t)dt = E[h{V1/(w b),1/b}]. Then, expanding V1/(w b),1/b around its mean w and also using Var{V1/(w b),1/b} = b(w + b) yields 0 E[h{V1/(w b),1/b}]dt = Z y 0 h(w)dw + 1 0 (bw + b2)h (w)dw + o(b2) 0 wh (w)dw + o(b2) = H(y) + by 2 [h(y) h(0)] + o(b2) 2 yh (y) h(y) + o(b) H(y) + O(b). For the variance, we have from the definition, Var[ ˆH(y; b)] = Var i=1 G(y; Yi, b) n Var [G(y; Y1, b)] n E G2(y; Y1, b) 1 n [E(G(y; Y1, b))]2 . The second term is easily obtained from (21). It remains to derive the second moment of the estimator, E G2(y; Y1, b) = R 0 G2(y; t, b)h(t)dt which can be recast as E[G2(y; Y1, b)] = Z 0 G2(y; t, b)h(t)dt 0 G(y; t, b)Φ 0 G(t; y, b)F(t; b)dt 0 G(y; t, b)Φ 0 E[F(V1/(w b),1/b; y, b)]dw, where F(t; y, b) = Φ p tb h(t) using a similar random variable V1/(w b),1/b and tactics as used in the reductions leading to (21). Since t, b > 0, we have that 2 p b/t < and so Φ(2) Φ p b/t 1. Therefore, we have 0 G(y; t, b)h(t)dt Z G(y; t, b)h(t)dt Z 0 G(y; t, b)h(t)dt. But R 0 G(y; t, b)h(t)dt E[G(y; Y1, b) = H(y)+b[yh(y) h(y)]/2+o(b) and Φ(2) = 0.97725 so that G(y; t, b)h(t)dt H(y) + b 2[yh(y) h(y)] + o(b) (23) Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering For the second term in (22), expanding V1/(w b),1/b around its mean w yields 0 E[F(V1/(w b),1/b; y, b)]dw 0 F(w; y, b)dw + 1 0 Var(V1/(w b),1/b; y, b)F (w; y, b)dw + o(b) wb h(w)dw + b 0 (w + b)F (w; y, b)dw + o(b) b/y)H(y) Z y wb Z h(w)dw dw 0 (w + b)F (w; y, b)dw + o(b) The derivative in the integrand is so that (24) equals Φ( p b/y)H(y) + o( b). We now expand Φ( p b/y) using a Taylor series expansion around 0 to get b/y) = 1/2 + 2πy) + O(b). Inserting this result into (24) and combining with (23) means that (22) is E[G2(y; Y1, b)] = H(y)/2 H(y) and the approximate expressions for the variance in Lemma 3 follow. Appendix B. Detailed Experimental Evaluations Figures 11-22 illustrate performance on 2D datasets obtained by KNOB-Syn C, K-m H, DEMP, DEMP+, MMC, EAC, GSL-NN, Spectral clustering, kernel k-means, DBSCAN*, Density peaks, PGMM, MSAL and MGHD. In all cases, plotting character and color represent the true and estimated group indicator. Tables 8 and 9 display performance, in terms of R and estimated ˆC, on 2D and higher-dimensional datasets, of KNOB-Syn C (denoted as KNS in the table), K-m H, DEMP (DM), DEMP+ (DM+), MMC, EAC, GSL-NN (GSN), spectral clustering (Sp C), kernel-k-means (k-km), DBSCAN* (D*), density peaks (DP), PGMM (PGM), MSAL (MSL) and MGHD (MHD). For k-km, ˆC was set at the true C and not estimated. For each method, the absolute deviation of R for that method from the highest R for each dataset was obtained: the average and SD of these absolute deviations are also reported for each method in the last row. Almod ovar-Rivera and Maitra G GGGGGGGGGG G GGGGGGGGGG G GGGGGGGGGG G GGGGGGGGGG G GGGGGGGGGG G GGGGGGGGGG G GGGGGGGGGG G GGGGGGGGGG G GGGGGGGGGG G GGGGGGGGGG G GGGGGGGGGG G GGGGGGGGGG G GGGGGGGGGG G GGGGGGGGGG G GGGGGGGGGG DBSCAN*, R = 0.53 Density Peaks, R = 0.78 PGMM, R = 1.00 MSAL, R = 0.99 MGHD, R = 0.64 MMC, R = 0.63 EAC, R = 0.56 GSL NN, R = 0.63 Spectral Clustering, R = 0.56 Kernel k means, R = 0.72 True KNOB Syn C, R = 0.99 K m H, R = 0.89 DEMP, R = 1.00 DEMP+, R = 1.00 Figure 11: The Spherical-7 example, and clusterings obtained using the 14 methods. G G GGG GGG G G G G GG G G G G GG GGGGGG GGGGGGGGGGGGGG G G GG GGGGG G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG G G GG GGGGG G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG G G GG GGGGG G G GGG GGG G G G G GG G G G G GG GGGGGG GGGGGGGGGGGGGG G G GG GGGGG G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG G G GG GGGGG G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG G G GG GGGGG G G GGG GGG G G G G GG G G G G GG GGGGGG GGGGGGGGGGGGGG G G GG GGGGG G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG G G GG GGGGG G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG G G GG GGGGG G G GGG GGG G G G G GG G G G G GG GGGGGG GGGGGGGGGGGGGG G G GG GGGGG G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG G G GG GGGGG G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG G G GG GGGGG G G GGG GGG G G G G GG G G G G GG GGGGGG GGGGGGGGGGGGGG G G GG GGGGG G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG G G GG GGGGG G G GGG GGG G G G G GGG G G G GG GGGGGG GGGGGGGGGGGGGG G G GG GGGGG DBSCAN*, R = 0.98 Density Peaks, R = 0.48 PGMM, R = 0.41 MSAL, R = 0.39 MGHD, R = 0.50 MMC, R = 0.11 EAC, R = 1.00 GSL NN, R = 1.00 Spectral Clustering, R = 1.00 Kernel k means, R = 0.70 True KNOB Syn C, R = 0.91 K m H, R = 1.00 DEMP, R = 0.43 DEMP+, R = 0.50 Figure 12: The Bananas-Arcs example and groupings obtained using the 14 methods. Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering DBSCAN*, R = 0.00 Density Peaks, R = 0.19 PGMM, R = 0.75 MSAL, R = 0.37 MGHD, R = 0.65 MMC, R = 0.67 EAC, R = 1.00 GSL NN, R = 1.00 Spectral Clustering, R = 1.00 Kernel k means, R = 0.92 True KNOB Syn C, R = 0.98 K m H, R = 1.00 DEMP, R = 0.77 DEMP+, R = 1.00 Figure 13: The Banana-clump example and groups obtained using the 14 methods. GG G G G GG GG G G G GG GG G G G GG GG G G G GG GG G G G GG GG G G G GG GG G G G GG GG G G G GG GG G G G GG GG G G G GG GG G G G GG GG G G G GG GG G G G GG GG G G G GG GG G G G GG DBSCAN*, R = 0.95 Density Peaks, R = 0.09 PGMM, R = 0.26 MSAL, R = 0.00 MGHD, R = 0.26 MMC, R = 0.15 EAC, R = 0.01 GSL NN, R = 1.00 Spectral Clustering, R = 1.00 Kernel k means, R = 0.53 True KNOB Syn C, R = 1.00 K m H, R = 1.00 DEMP, R = 0.21 DEMP+, R = 0.31 Figure 14: The Bullseye example and clusters obtained using the 14 methods. Almod ovar-Rivera and Maitra GGGGGGGGGGGGGGGGGG G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G GGG GGG G G G G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G GGG GGG G G G G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G GGG GGG G G G G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G GGG GGG G G G G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G GGG GGG G G G G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G GGG GGG G G G G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G GGG GGG G G G G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G GGG GGG G G G G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G GGG GGG G G G G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G GGG GGG G G G G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G GGG GGG G G G G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G GGG GGG G G G G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G GGG GGG G G G G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G GGG GGG G G G G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG G GGGGGGGG G GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGG G GGGGG GGGGGGG GGGGGGG GG GG G G G G GG GGGG G G GGG GGG G G G G GGGG G G G G GGG GG GGG G GGG GGGG G G GGGGGG G G G GGGGG GGGGGGGG GGGGGG G GGGG GGGGG GGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGG DBSCAN*, R = 0.99 Density Peaks, R = 0.56 PGMM, R = 0.53 MSAL, R = 0.00 MGHD, R = 0.44 MMC, R = 0.17 EAC, R = 1.00 GSL NN, R = 1.00 Spectral Clustering, R = 0.23 Kernel k means, R = 0.44 True KNOB Syn C, R = 0.91 K m H, R = 1.00 DEMP, R = 0.62 DEMP+, R = 0.61 Figure 15: The Bullseye-Cigarette example and groups obtained with the 14 methods. DBSCAN*, R = 0.82 Density Peaks, R = 0.42 PGMM, R = 0.71 MSAL, R = 0.59 MGHD, R = 0.43 MMC, R = 0.59 EAC, R = 0.81 GSL NN, R = 0.74 Spectral Clustering, R = 0.37 Kernel k means, R = 0.50 True KNOB Syn C, R = 0.93 K m H, R = 0.50 DEMP, R = 0.74 DEMP+, R = 0.74 Figure 16: The Compound example and partitionings obtained using the 14 methods. Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering GG G G GGGG GGG G GG GGGG GG G G G G GG GG G G G GG G GG GG G GG G G GGGG GGG G GG GGGG GG G G G G GG GG G G G GG G GG GG G GG G G GGGG GGG G GG GGGG GG G G G G GG GG G G G GG G GG GG G GG G G GGGG GGG G GG GGGG GG G G G G GG GG G G G GG G GG GG G GG G G GGGG GGG G GG GGGG GG G G G G GG GG G G G GG G GG GG G GG G G GGGG GGG G GG GGGG GG G G G G GG GG G G G GG G GG GG G GG G G GGGG GGG G GG GGGG GG G G G G GG GG G G G GG G GG GG G GG G G GGGG GGG G GG GGGG GG G G G G GG GG G G G GG G GG GG G GG G G GGGG GGG G GG GGGG GG G G G G GG GG G G G GG G GG GG G GG G G GGGG GGG G GG GGGG GG G G G G GG GG G G G GG G GG GG G GG G G GGGG GGG G GG GGGG GG G G G G GG GG G G G GG G GG GG G GG G G GGGG GGG G GG GGGG GG G G G G GG GG G G G GG G GG GG G GG G G GGGG GGG G GG GGGG GG G G G G GG GG G G G GG G GG GG G GG G G GGGG GGG G GG GGGG GG G G G G GG GG G G G GG G GG GG G GG G G GGGG GGG G GG GGGG GG G G G G GG GG G G G GG G GG GG G DBSCAN*, R = 0.07 Density Peaks, R = 0.10 PGMM, R = 0.21 MSAL, R = 0.27 MGHD, R = 0.30 MMC, R = 0.12 EAC, R = 1.00 GSL NN, R = 0.26 Spectral Clustering, R = 0.20 Kernel k means, R = 0.03 True KNOB Syn C, R = 0.88 K m H, R = 0.95 DEMP, R = 0.37 DEMP+, R = 0.37 Figure 17: The Half-Ringed clusters example and groups obtained with the 14 methods. GGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGG DBSCAN*, R = 0.18 Density Peaks, R = 0.23 PGMM, R = 0.59 MSAL, R = 0.44 MGHD, R = 0.60 MMC, R = 0.00 EAC, R = 0.41 GSL NN, R = 0.00 Spectral Clustering, R = 0.72 Kernel k means, R = 0.35 True KNOB Syn C, R = 0.55 K m H, R = 0.42 DEMP, R = 0.41 DEMP+, R = 0.41 Figure 18: The Path-based example and groups obtained with the 14 methods. Almod ovar-Rivera and Maitra GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGG DBSCAN*, R = 0.99 Density Peaks, R = 0.25 PGMM, R = 0.44 MSAL, R = 0.52 MGHD, R = 0.38 MMC, R = 0.19 EAC, R = 1.00 GSL NN, R = 1.00 Spectral Clustering, R = 0.87 Kernel k means, R = 0.23 True KNOB Syn C, R = 0.95 K m H, R = 1.00 DEMP, R = 0.81 DEMP+, R = 0.79 Figure 19: The SCX-Bananas example and clusters obtained using the 14 methods. DBSCAN*, R = 0.10 Density Peaks, R = 0.42 PGMM, R = 0.06 MSAL, R = 0.00 MGHD, R = 0.07 MMC, R = 0.00 EAC, R = 0.14 GSL NN, R = 1.00 Spectral Clustering, R = 0.35 Kernel k means, R = 0.00 True KNOB Syn C, R = 0.86 K m H, R = 0.01 DEMP, R = 0.00 DEMP+, R = 0.00 Figure 20: The Spiral example and groupings obtained using the 14 competing methods. Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering GGGGGGGGG GGGGGG G GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG GGGG GGGGGGGGG G G GGG GGGG GG G G G GGGGGGG G G G G GG GGGGG GGGGGGGGGGGG G G GGGGGGGGGGG GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG GGGGGGGGGGG GGGG GGGGG G GGGG GG GGGGG GGGGGGGGGGG GGGG GGG G GG GGGGG GGGGGG GGG G GG GG GG G G G G G GG G GGGG GGGGG GGGG G GGG G GGGGG GGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGG G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G GGGGGGG GGG G GGGGG G GGG GG GG G G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G G GGGGGGGGGGGGGGGG GG GGGG G G GGGGGGGG G GGG GGGGGGGG GG G GGGGGGGGG GGGGGG G GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG GGGG GGGGGGGGG G G GGG GGGG GG G G G GGGGGGG G G G G GG GGGGG GGGGGGGGGGGG G G GGGGGGGGGGG GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGG G GGGGGG GGGGGGG GG GGGGGGGGGGG GGGG GGGGG G GGGG GG GGGGG GGGGGGGGGGG GGGG GGG G GG GGGGG GGGGGG GGG G GG GG GG G G G G G GG G GGGG GGGGG GGGG G GGG G GGGGG GGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGG G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G GGGGGGG GGG G GGGGG G GGG GG GG G G GGGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G G GGGGGGGGGGGGGGGG GG GGGG G G GGGGGGGG G GGG GGGGGGGG GG G GGGGGGGGG GGGGGG G GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG GGGG GGGGGGGGG G G GGG GGGG GG G G G GGGGGGG G G G G GG GGGGG GGGGGGGGGGGG G G GGGGGGGGGGG GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG GGGGGGGGGGG GGGG GGGGG G GGGG GG GGGGG GGGGGGGGGGG GGGG GGG G GG GGGGG GGGGGG GGG G GG GG GG G G G G G GG G GGGG GGGGG GGGG G GGG G GGGGG GGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGG G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G GGGGGGG GGG G GGGGG G GGG GG GG G G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G G GGGGGGGGGGGGGGGG GG GGGG G G GGGGGGGG G GGG GGGGGGGG GG G GGGGGGGGG GGGGGG G GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG GGGG GGGGGGGGG G G GGG GGGG GG G G G GGGGGGG G G G G GG GGGGG GGGGGGGGGGGG G G GGGGGGGGGGG GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG GGGGGGGGGGG GGGG GGGGG G GGGG GG GGGGG GGGGGGGGGGG GGGG GGG G GG GGGGG GGGGGG GGG G GG GG GG G G G G G GG G GGGG GGGGG GGGG G GGG G GGGGG GGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGG G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G GGGGGGG GGG G GGGGG G GGG GG GG G G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G G GGGGGGGGGGGGGGGG GG GGGG G G GGGGGGGG G GGG GGGGGGGG GG G GGGGGGGGG GGGGGG G GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG GGGG GGGGGGGGG G G GGG GGGG GG G G G GGGGGGG G G G G GG GGGGG GGGGGGGGGGGG G G GGGGGGGGGGG GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGG G GGGGGG GGGGGGG GG GGGGGGGGGGG GGGG GGGGG G GGGG GG GGGGG GGGGGGGGGGG GGGG GGG G GG GGGGG GGGGGG GGG G GG GG GG G G G G G GG G GGGG GGGGG GGGG G GGG G GGGGG GGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGG G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G GGGGGGG GGG G GGGGG G GGG GG GG G G GGGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G G GGGGGGGGGGGGGGGG GG GGGG G G GGGGGGGG G GGG GGGGGGGG GG G GGGGGGGGG GGGGGG G GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG GGGG GGGGGGGGG G G GGG GGGG GG G G G GGGGGGG G G G G GG GGGGG GGGGGGGGGGGG G G GGGGGGGGGGG GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG GGGGGGGGGGG GGGG GGGGG G GGGG GG GGGGG GGGGGGGGGGG GGGG GGG G GG GGGGG GGGGGG GGG G GG GG GG G G G G G GG G GGGG GGGGG GGGG G GGG G GGGGG GGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGG G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G GGGGGGG GGG G GGGGG G GGG GG GG G G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G G GGGGGGGGGGGGGGGG GG GGGG G G GGGGGGGG G GGG GGGGGGGG GG G GGGGGGGGG GGGGGG G GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG GGGG GGGGGGGGG G G GGG GGGG GG G G G GGGGGGG G G G G GG GGGGG GGGGGGGGGGGG G G GGGGGGGGGGG GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG GGGGGGGGGGG GGGG GGGGG G GGGG GG GGGGG GGGGGGGGGGG GGGG GGG G GG GGGGG GGGGGG GGG G GG GG GG G G G G G GG G GGGG GGGGG GGGG G GGG G GGGGG GGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGG G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G GGGGGGG GGG G GGGGG G GGG GG GG G G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G G GGGGGGGGGGGGGGGG GG GGGG G G GGGGGGGG G GGG GGGGGGGG GG G GGGGGGGGG GGGGGG G GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG GGGG GGGGGGGGG G G GGG GGGG GG G G G GGGGGGG G G G G GG GGGGG GGGGGGGGGGGG G G GGGGGGGGGGG GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGG G GGGGGG GGGGGGG GG GGGGGGGGGGG GGGG GGGGG G GGGG GG GGGGG GGGGGGGGGGG GGGG GGG G GG GGGGG GGGGGG GGG G GG GG GG G G G G G GG G GGGG GGGGG GGGG G GGG G GGGGG GGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGG G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G GGGGGGG GGG G GGGGG G GGG GG GG G G GGGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G G GGGGGGGGGGGGGGGG GG GGGG G G GGGGGGGG G GGG GGGGGGGG GG G GGGGGGGGG GGGGGG G GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG GGGG GGGGGGGGG G G GGG GGGG GG G G G GGGGGGG G G G G GG GGGGG GGGGGGGGGGGG G G GGGGGGGGGGG GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG GGGGGGGGGGG GGGG GGGGG G GGGG GG GGGGG GGGGGGGGGGG GGGG GGG G GG GGGGG GGGGGG GGG G GG GG GG G G G G G GG G GGGG GGGGG GGGG G GGG G GGGGG GGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGG G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G GGGGGGG GGG G GGGGG G GGG GG GG G G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G G GGGGGGGGGGGGGGGG GG GGGG G G GGGGGGGG G GGG GGGGGGGG GG G GGGGGGGGG GGGGGG G GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG GGGG GGGGGGGGG G G GGG GGGG GG G G G GGGGGGG G G G G GG GGGGG GGGGGGGGGGGG G G GGGGGGGGGGG GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG GGGGGGGGGGG GGGG GGGGG G GGGG GG GGGGG GGGGGGGGGGG GGGG GGG G GG GGGGG GGGGGG GGG G GG GG GG G G G G G GG G GGGG GGGGG GGGG G GGG G GGGGG GGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGG G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G GGGGGGG GGG G GGGGG G GGG GG GG G G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G G GGGGGGGGGGGGGGGG GG GGGG G G GGGGGGGG G GGG GGGGGGGG GG G GGGGGGGGG GGGGGG G GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG GGGG GGGGGGGGG G G GGG GGGG GG G G G GGGGGGG G G G G GG GGGGG GGGGGGGGGGGG G G GGGGGGGGGGG GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGG G GGGGGG GGGGGGG GG GGGGGGGGGGG GGGG GGGGG G GGGG GG GGGGG GGGGGGGGGGG GGGG GGG G GG GGGGG GGGGGG GGG G GG GG GG G G G G G GG G GGGG GGGGG GGGG G GGG G GGGGG GGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGG G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G GGGGGGG GGG G GGGGG G GGG GG GG G G GGGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G G GGGGGGGGGGGGGGGG GG GGGG G G GGGGGGGG G GGG GGGGGGGG GG G GGGGGGGGG GGGGGG G GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG GGGG GGGGGGGGG G G GGG GGGG GG G G G GGGGGGG G G G G GG GGGGG GGGGGGGGGGGG G G GGGGGGGGGGG GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG GGGGGGGGGGG GGGG GGGGG G GGGG GG GGGGG GGGGGGGGGGG GGGG GGG G GG GGGGG GGGGGG GGG G GG GG GG G G G G G GG G GGGG GGGGG GGGG G GGG G GGGGG GGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGG G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G GGGGGGG GGG G GGGGG G GGG GG GG G G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G G GGGGGGGGGGGGGGGG GG GGGG G G GGGGGGGG G GGG GGGGGGGG GG G GGGGGGGGG GGGGGG G GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG GGGG GGGGGGGGG G G GGG GGGG GG G G G GGGGGGG G G G G GG GGGGG GGGGGGGGGGGG G G GGGGGGGGGGG GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG GGGGGGGGGGG GGGG GGGGG G GGGG GG GGGGG GGGGGGGGGGG GGGG GGG G GG GGGGG GGGGGG GGG G GG GG GG G G G G G GG G GGGG GGGGG GGGG G GGG G GGGGG GGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGG G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G GGGGGGG GGG G GGGGG G GGG GG GG G G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G G GGGGGGGGGGGGGGGG GG GGGG G G GGGGGGGG G GGG GGGGGGGG GG G GGGGGGGGG GGGGGG G GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG GGGG GGGGGGGGG G G GGG GGGG GG G G G GGGGGGG G G G G GG GGGGG GGGGGGGGGGGG G G GGGGGGGGGGG GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGG G GGGGGG GGGGGGG GG GGGGGGGGGGG GGGG GGGGG G GGGG GG GGGGG GGGGGGGGGGG GGGG GGG G GG GGGGG GGGGGG GGG G GG GG GG G G G G G GG G GGGG GGGGG GGGG G GGG G GGGGG GGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGG G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G GGGGGGG GGG G GGGGG G GGG GG GG G G GGGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G G GGGGGGGGGGGGGGGG GG GGGG G G GGGGGGGG G GGG GGGGGGGG GG G GGGGGGGGG GGGGGG G GGG GGG GGG GGGGG GGGGGGGGGG GGGGGG GGG G GGGGG GGG G G GGGGG G G GG GG GGG G GGGGG G GG GGGGGGG GGGG GGGGG GGGG GGGGGGGGG G G GGG GGGG GG G G G GGGGGGG G G G G GG GGGGG GGGGGGGGGGGG G G GGGGGGGGGGG GGGGG GGGGGGGG GGGGGGG G GGGGGGGGGGG GGGG GGGGGG GGG GG GGGGGG GGGGGGGGGGGGGGG GGGGGG G G GGGGG GGGGGG G GGGGGGG GGGGGG GGGGGGG GG GGGGGGGGGGG GGGG GGGGG G GGGG GG GGGGG GGGGGGGGGGG GGGG GGG G GG GGGGG GGGGGG GGG G GG GG GG G G G G G GG G GGGG GGGGG GGGG G GGG G GGGGG GGGGGGG GGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG G GGGGGGGGGGGGG G G GGGGGGG G GGGGG G G G GGG G G GGGGGGGGGG G GGGGGGG GGG G GGGGG G GGG GG GG G G GGGGGGGGGG G G G GGGGG G GGGGGGG G GGG G GGGGGGGGGGG G GGGGGGGG G GGGGGGG G GGGGGGGG GGGGGG GG GG G GGGGG G GGGGGG GGG GGGGGGG GGG GG GGGG G G G GGGGGGGGGGGGGGGG GG GGGG G G GGGGGGGG G GGG GGGGGGGG GG G DBSCAN*, R = 0.95 Density Peaks, R = 0.16 PGMM, R = 0.37 MSAL, R = 0.94 MGHD, R = 0.40 MMC, R = 0.63 EAC, R = 1.00 GSL NN, R = 1.00 Spectral Clustering, R = 0.20 Kernel k means, R = 0.36 True KNOB Syn C, R = 1.00 K m H, R = 0.77 DEMP, R = 0.77 DEMP+, R = 0.84 Figure 21: The SSS example and clusters obtained with the 14 competing methods. GG G GG GG G G GGGGG GG G GG GG G G GGGGG GG G GG GG G G GGGGG GG G GG GG G G GGGGG GG G GG GG G G GGGGG GG G GG GG G G GGGGG GG G GG GG G G GGGGG GG G GG GG G G GGGGG GG G GG GG G G GGGGG GG G GG GG G G GGGGG GG G GG GG G G GGGGG GG G GG GG G G GGGGG GG G GG GG G G GGGGG GG G GG GG G G GGGGG GG G GG GG G G GGGGG DBSCAN*, R = 0.99 Density Peaks, R = 0.33 PGMM, R = 0.90 MSAL, R = 1.00 MGHD, R = 0.74 MMC, R = 1.00 EAC, R = 1.00 GSL NN, R = 1.00 Spectral Clustering, R = 0.29 Kernel k means, R = 0.99 True KNOB Syn C, R = 0.95 K m H, R = 0.96 DEMP, R = 1.00 DEMP+, R = 1.00 Figure 22: The XXXX example and groupings obtained using the 14 competing methods. Almod ovar-Rivera and Maitra Table 8: Performance, in terms of R and estimated ˆC, on 2D datasets, of competing methods. Dataset KNS K-m H DM DM+ MMC EAC GSN Sp C kk-m D* DP PGM MSL MHD (n, p, K) 7-Spherical 0.99 0.89 1 1 0.63 0.56 0.63 0.56 0.72 0.53 0.78 1 0.99 0.64 (500, 2, 7) 7 10 7 7 5 7 5 20 7 4 9 7 7 13 Aggregation 0.98 0.95 0.91 0.91 0.80 0.90 0.81 0.59 0.24 0.58 0.26 0.64 0.82 0.43 (788, 2, 7) 7 9 6 6 8 12 11 7 7 277 54 12 7 15 Banana-arcs 0.91 1 0.43 0.50 0.11 1 1 1 0.7 0.98 0.48 0.41 0.39 0.50 (4515, 2, 4) 5 4 11 10 52 4 4 4 4 47 460 9 4 8 Banana-clump 0.98 1 0.77 1 0.67 1 1 1 0.92 0 0.19 0.75 0.37 0.65 (200, 2, 2) 4 2 3 2 4 2 2 4 2 1 12 3 2 5 Bullseye 1 1 0.21 0.31 0.15 -0.01 1 1 0.53 0.95 0.09 0.26 -0.00 0.26 (400, 2, 2) 2 2 7 5 11 2 2 2 2 14 23 5 2 5 Bullseye-Cig 0.91 1 0.62 0.61 0.17 1 1 0.23 0.44 0.99 0.56 0.53 0 0.44 (3025, 2, 8) 7 6 7 7 5 9 14 16 8 17 115 4 1 4 Compound 0.93 0.5 0.74 0.74 0.59 0.81 0.74 0.37 0.5 0.82 0.42 0.71 0.59 0.43 (399, 2, 6) 19 13 5 5 5 6 6 13 6 117 9 6 6 12 Half-ringed 0.88 0.95 0.37 0.37 0.12 1 0.26 0.20 0.03 0.07 0.1 0.21 0.27 0.3 (373, 2, 2) 16 3 6 6 11 2 2 8 2 155 17 5 2 5 Path-based 0.55 0.42 0.41 0.41 0 0.41 0 0.72 0.35 0.18 0.23 0.59 0.44 0.60 (300, 2, 3) 21 2 2 2 3 15 3 3 3 217 9 7 3 7 SCX-Bananas 0.95 1 0.81 0.79 0.19 1 1 0.87 0.23 0.99 0.25 0.44 0.52 0.38 (3420, 2, 8) 8 8 7 9 39 8 8 4 8 25 389 16 8 17 Spiral 0.86 0.01 0 0 0 0.14 1 0.35 -0 0.1 0.42 0.06 0 0.09 (312, 2, 3) 3 2 1 1 1 9 3 11 3 208 7 7 1 7 SSS 1 0.77 0.77 0.84 0.63 1 1 0.20 0.36 0.95 0.16 0.37 0.94 0.40 (5015, 2, 3) 3 5 5 4 27 4 3 19 3 29 385 7 3 7 XXXX 0.95 0.96 1 1 1 1 1 0.29 0.99 0.99 0.33 0.90 1 0.74 (415, 2, 4) 10 5 4 4 4 4 4 20 4 8 20 6 4 9 D 0.06 0.17 0.35 0.32 0.58 0.22 0.17 0.40 0.53 0.35 0.64 0.44 0.48 0.52 Dσ 0.06 0.28 0.31 0.31 0.33 0.35 0.27 0.34 0.31 0.39 0.20 0.29 0.38 0.21 Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering Table 9: Performance, in terms of R and estimated ˆC, on high-dimensional datasets. (In the table, m displays the effective dimension of the dataset and is the number of coordinates, PCs or KPCs used as per the descriptions in Section 3.3.) Dataset KNS K-m H DM DM+ MMC EAC GSN Sp C kk-m DBSCAN* DP PGMM MSAL MGHD (n, p, K, m) Simplex-7 0.97 0.97 0.97 0.97 0.97 0.94 0.94 0.92 0.94 0.03 0.58 0.97 0.98 0.96 (560, 7, 7, 7) 7 7 7 7 7 6 7 7 7 508 79 7 7 7 E.coli 0.72 0.63 0.70 0.68 0.59 0.77 0.03 0.26 0.45 0.31 0.38 0.48 0 0.60 (336, 7, 7, 5) 9 4 4 4 8 10 16 15 7 174 12 8 1 8 Wines-13 0.92 0.62 0.52 0.5 0.67 0.6 0.38 0.43 0.6 0 0.26 0.66 0 0.16 (178, 13, 3, 17) 3 11 7 7 7 8 7 23 7 178 15 5 1 6 Wines-27 0.93 -0.01 1 1 0.91 0.62 0 0.35 0.88 0 0.56 0.85 0 0.95 (178, 27, 3, 26) 3 2 3 3 4 7 1 8 3 178 12 3 1 3 Olive Oils-Area 0.55 0.56 0.85 0.82 0.66 0.55 0.51 0.48 0.56 0.47 0.40 0.61 0.70 0.58 (572, 8, 9, 8) 4 8 7 12 11 14 5 18 9 201 27 5 9 7 Olive Oils-Region 0.89 0.69 0.45 0.47 0.22 0.46 0.67 0.23 0.4 0.44 0.41 0.59 0.58 0.54 (572, 8, 3, 8) 4 8 7 12 11 14 5 18 9 201 27 5 9 7 Image 0.54 0.48 0.49 0.46 0.28 0.59 0.10 0.22 0.52 0 0.38 0 0 0.56 (2310, 19, 7, 8) 12 17 18 17 46 40 48 45 7 1 56 1 1 7 Yeast 0.22 0.01 -0.01 -0.01 0.04 0.004 0.003 0.11 0.14 0 0.03 0 0 0.04 (1484, 8, 10, 6) 5 4 6 6 37 13 33 17 10 1 44 1 1 10 ALL 0.68 0.19 0.14 0.14 0.35 0.53 0.54 0.55 0.5 0 0.45 0.61 0 0 (215, 1000, 7, 42) 6 18 6 6 9 9 5 12 7 215 41 7 1 1 Zipcode 0.76 0.55 0.35 0.33 0.01 0.21 0.54 0.54 0 0 0.58 0.52 0 0.56 (2000, 256, 10, 33) 9 22 36 33 1 45 7 23 10 2000 53 6 1 6 Pendigits 0.72 0.51 0.58 0.6 0.26 0.58 0.004 0.42 0.3 0.3 0.54 0.44 0 0.67 (10992, 16, 10, 18) 15 9 40 32 59 58 59 48 10 7 9 6 1 10 D 0.04 0.29 0.21 0.22 0.31 0.22 0.44 0.35 0.27 0.62 0.35 0.24 0.56 0.25 Dσ 0.09 0.27 0.20 0.20 0.23 0.17 0.29 0.21 0.21 0.26 0.16 0.15 0.33 0.26 Almod ovar-Rivera and Maitra F. Alimoglu. Combining multiple classifiers for pen-based handwritten digit recognition. Master s thesis, Institute of Graduate Studies in Science and Engineering, Bogazici University, 1996. F. Alimoglu and E. Alpaydin. Methods of combining multiple classifiers based on different representations for pen-based handwriting recognition. In Proceedings of the Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium (TAINN 96), Istanbul, Turkey, 1996. I. Almod ovar-Rivera and R. Maitra. RFASTf MRI: Fast adaptive smoothing and thresholding for improved activation detection in low-signal f MRI, 2019. R Package, URL http://github.com/ialmodovar/RFASTf MRI. I. A. Almod ovar-Rivera and R. Maitra. FAST adaptive smoothed thresholding for improved activation detection in low-signal f MRI. IEEE Transactions on Medical Imaging, 38(12): 2821 2828, 2019. doi: 10.1109/TMI.2019.2915052. A. Azzalini. A note on the estimation of a distribution function and quantiles by a kernel method. Biometrika, 68(1):326 328, 1981. P. A. Bandettini, A. Jesmanowicz, E. C. Wong, and J. S. Hyde. Processing strategies for time-course data sets in functional mri of the human brain. Magnetic Resonance in Medicine, 30:161 173, 1993. J.-P. Baudry and G. Celeux. Rmixmod Combi: Combining Mixture Components for Clustering, 2014. URL https://CRAN.R-project.org/package=Rmixmod Combi. R package version 1.0. J.-P. Baudry, A. E. Raftery, G. Celeux, K. Lo, and R. Gottardo. Combining mixture components for clustering. Journal of Computational and Graphical Statistics, 19(2):332 353, 2010. J. W. Belliveau, D. N. Kennedy, R. C. Mc Kinstry, B. R. Buchbinder, R. M. Weisskoff, M. S. Cohen, J. M. Vevea, T. J. Brady, and B. R. Rosen. Functional mapping of the human visual cortex by magnetic resonance imaging. Science, 254:716 719, 1991. N. S. Berry and R. Maitra. Tik-means: Transformation-infused k-means clustering for skewed groups. Statistical Analysis and Data Mining: The ASA Data Science Journal, 12(3):223 233, 2019. T. Bouezmarni and O. Scaillet. Consistency of asymmetric kernel density estimators and smoothed histograms with application to income data. Econometric Theory, 21(02):390 412, 2005. C. Bouveyron, G. Celeux, B. T. Murphy, and A. E. Raftery. Model-Based Clustering and Classification for Data Science: With Applications in R. Cambridge Series in Statistical and Probabilistic Mathematics, 2019. Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering R. P. Browne and P. D. Mc Nicholas. A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics, 43(2):176 198, 2015. R. J. Campello, D. Moulavi, and J. Sander. Density-based clustering based on hierarchical density estimates. In Pacific-Asia conference on knowledge discovery and data mining, pages 160 172. Springer, 2013. H. Chang and D.-Y. Yeung. Robust path-based spectral clustering. Pattern Recognition, 41 (1):191 203, 2008. ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2007.04.010. URL http://www.sciencedirect.com/science/article/pii/S0031320307002038. S. Chattopadhyay and R. Maitra. Gaussian-mixture-model-based cluster analysis finds five kinds of gamma-ray bursts in the BATSE catalogue. Monthly Notices of the Royal Astronomical Society, 469(3):3374 3389, 2017. doi: 10.1093/mnras/stx1024. S. Chattopadhyay and R. Maitra. Multivariate t-mixture-model-based cluster analysis of BATSE catalogue establishes importance of all observed parameters, confirms five distinct ellipsoidal sub-populations of gamma-ray bursts. Monthly Notices of the Royal Astronomical Society, 481(3):3196 3209, 07 2018. ISSN 0035-8711. doi: 10.1093/mnras/sty1940. T. Chattopadhyay, R. Misra, A. K. Chattopadhyay, and M. Naskar. Statistical evidence for three classes of gamma-ray bursts. Astrophysical Journal, 667(2):1017, 2007. doi: https://doi.org/10.1086/520317. A. Chaturvedi, P. E. Green, and J. D. Caroll. K-modes clustering. Journal of Classification, 18:35 55, 2001. S. X. Chen. Probability density function estimation using gamma kernels. Annals of the Institute of Statistical Mathematics, 52(3):471 480, 2000. J.-P. Dezalay, C. Barat, R. Talon, R. Syunyaev, O. Terekhov, and A. Kuznetsov. Short cosmic events - A subset of classical GRBs? In W. S. Paciesas and G. J. Fishman, editors, American Institute of Physics Conference Series, volume 265 of American Institute of Physics Conference Series, pages 304 309, 1992. I. Dhillon, Y. Guan, and B. Kulis. A unified view of kernel k-means, spectral clustering and graph cuts. Technical Report TR-04-25, University of Texas at Austin, 2004. K. S. Dorman and R. Maitra. An efficient k-modes algorithm for clustering categorical datasets. Ar Xiv e-prints:2006.03936, 2020. V. A. Epanechnikov. Non-parametric estimation of a multivariate probability density. Theory of Probability and its Applications, 14:153158, 1969. doi: 10.1137/1114019. M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD-96, volume 96, pages 226 231, 1996. B. S. Everitt, S. Landau, and M. Leesem. Cluster Analysis (4th ed.). Hodder Arnold, New York, 2001. Almod ovar-Rivera and Maitra E. Forgy. Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics, 21:768 780, 1965. M. Forina and E. Tiscornia. Pattern recognition methods in the prediction of italian olive oil origin by their fatty acid content. Annali di Chimica, 72:143 155, 1982. M. Forina, C. Armanino, S. Lanteri, and E. Tiscornia. Classification of olive oils from their fatty acid composition. In Food Research and Data Analysis, pages 189 214. Applied Science Publishers, London, 1983. M. Forina, R. Leardi, and S. Lanteri. PARVUS - an extendible package for data exploration, classification and correlation, 1988. S. D. Forman, J. D. Cohen, M. Fitzgerald, W. F. Eddy, M. A. Mintun, and D. C. Noll. Improved assessment of significant activation in functional magnetic resonance imaging (fmri): Use of a cluster-size threshold. Magnetic Resonance in Medicine, 33:636 647, 1995. C. Fraley and A. E. Raftery. How many clusters? which cluster method? answers via model-based cluster analysis. Computer Journal, 41:578 588, 1998. C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97:611 631, 2002. B. C. Franczak, R. P. Browne, and P. D. Mc Nicholas. Mixtures of shifted asymmetric Laplace distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(6):1149 1157, 2014. B. C. Franczak, R. P. Browne, P. D. Mc Nicholas, and K. L. Burak. Mix SAL: Mixtures of Multivariate Shifted Asymmetric Laplace (SAL) Distributions, 2018. URL https: //CRAN.R-project.org/package=Mix SAL. R package version 1.0. A. L. Fred and A. K. Jain. Combining multiple clusterings using evidence accumulation. IEEE transactions on pattern analysis and machine intelligence, 27(6):835 850, 2005. K. J. Friston, P. Jezzard, and R. Turner. Analysis of functional MRI time-series. Human Brain Mapping, 1:153 171, 1994. K. J. Friston, A. P. Holmes, K. J. Worsley, J.-B. Poline, C. D. Frith, and R. S. J. Frackowiak. Statistical parametric maps in functional imaging: A general linear approach. Human Brain Mapping, 2:189 210, 1995. C. R. Genovese, N. A. Lazar, and T. E. Nichols. Thresholding of statistical maps in functional neuroimaging using the false discovery rate:. Neuroimage, 15:870 878, 2002. Z. Ghahramani and G. E. Hinton. The EM algorithm for factor analyzers. Technical Report CRG-TR-96-1, University of Toronto, Toronto, Canada, 1997. A. Gionis, H. Mannila, and P. Tsaparas. Clustering aggregation. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):4, 2007. Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering G. H. Glover. Deconvolution of impulse response in event-related BOLD f MRI. Neuroimage, 9:416 429, 1999. M. Hahsler and M. Piekenbrock. dbscan: Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms, 2018. URL https://CRAN.R-project.org/ package=dbscan. R package version 1.1-3. J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. Applied Statistics, 28: 100 108, 1979. C. Hennig. Methods for merging Gaussian mixture components. Advances in Data Analysis and Classification, 2010. doi: 10.1007/s11634-010-0058-3. A. Hinneburg and D. Keim. Cluster discovery methods for large databases: from the past to the future. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, 1999. P. Horton and K. Nakai. A probablistic classification system for predicting the cellular localization sites of proteins. Intelligent Systems in Molecular Biology, pages 109 115, 1985. Z. Huang. Clustering large data sets with mixed numeric and categorical values. In Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, page 2134, Singapore, 1997. World Scientific. Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2:283304, 1998. L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193 218, 1985. P. Jaccard. Etude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Soci et e Vaudoise des Sciences Naturelles, 37:547579, 1901. A. Jain and R. Dubes. Algorithms for clustering data. Prentice Hall, Englewood Cliffs, NJ, 1988. A. K. Jain and M. H. C. Law. Data clustering: A users dilemma. In S. K. Pal, B. S., and B. S., editors, Pattern Recognition and Machine Intelligence. PRe MI 2005, volume 3776 of Lecture Notes in Computer Science, pages 1 10, Berlin, Heidelberg, 2005. Springer. Y. Jeon and J. H. T. Kim. A gamma kernel density estimation for insurance loss data. Insurance: Mathematics and Economics, 53:569 579, 2013. doi: http://dx.doi.org/10. 1016/j.insmatheco.2013.08.009. S. Johnson. Hierarchical clustering schemes. Psychometrika, 32:3:241 254, 1967. L. Kaufman and P. J. Rousseuw. Finding Groups in Data. John Wiley & Sons, New York, 1990. W. J. Krzanowski and Y. Lai. A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics, pages 23 34, 1988. Almod ovar-Rivera and Maitra K. K. Kwong, J. W. Belliveau, D. A. Chesler, I. E. Goldberg, R. M. Weisskoff, B. P. Poncelet, D. N. Kennedy, B. E. Hoppel, M. S. Cohen, R. Turner, H.-M. Cheng, T. J. Brady, and B. R. Rosen. Dynamic magnetic resonance imaging of human brain activity during primary sensory stimulation. Proceedings of the National Academy of Sciences of the United States of America, 89:5675 5679, 1992. N. A. Lazar. The Statistical Analysis of Functional MRI Data. Springer, 2008. A. Lithio and R. Maitra. An efficient k-means-type algorithm for clustering datasets with incomplete records. Statistical Analysis and Data Mining: The ASA Data Science Journal, 11(6):296 311, 2018. S. Lloyd. Least squares quantization in PCM. Information Theory, IEEE Transactions on, 28(2):129 137, 1982. J. Mac Queen. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium, 1:281 297, 1967. P. C. Mahalanobis. On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India, 2(1):4955, 1936. R. Maitra. Clustering massive datasets with applications to software metrics and tomography. Technometrics, 43(3):336 346, 2001. R. Maitra. A statistical perspective to data mining. Journal of the Indian Society of Probability and Statistics, 6:28 77, 2002. R. Maitra. Initializing partition-optimization algorithms. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6:144 157, 2009a. doi: 10.1109/TCBB.2007. 70244. R. Maitra. Assessing certainty of activation or inactivation in test-retest f MRI studies. Neuroimage, 47(1):88 97, 2009b. R. Maitra. A re-defined and generalized percent-overlap-of-activation measure for studies of f MRI reproducibility and its use in identifying outlier activation maps. Neuroimage, 50(1):124 135, 2010. R. Maitra and V. Melnykov. Simulating data to study performance of finite mixture modeling and clustering algorithms. Journal of Computational and Graphical Statistics, 19 (2):354 376, 2010. doi: 10.1198/jcgs.2009.08054. R. Maitra and I. P. Ramler. Clustering in the presence of scatter. Biometrics, 65:341 352, 2009. R. Maitra, S. R. Roys, and R. P. Gullapalli. Test-retest reliability estimation of functional MRI data. Magnetic Resonance in Medicine, 48:62 70, 2002. R. Maitra, V. Melnykov, and S. Lahiri. Bootstrapping for significance of compact clusters in multi-dimensional datasets. Journal of the American Statistical Association, 107(497): 378 392, 2012. doi: http://dx.doi.org/10.1080/01621459.2011.646935. Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering E. P. Mazets, S. V. Golenetskii, V. N. Ilinskii, V. N. Panov, R. L. Aptekar, I. A. Gurian, M. P. Proskura, I. A. Sokolov, Z. I. Sokolova, and T. V. Kharitonova. Catalog of cosmic gamma-ray bursts from the KONUS experiment data. I. Astrophysics and Space Science, 80:3 83, Nov. 1981. doi: 10.1007/BF00649140. G. Mc Lachlan and D. Peel. Finite Mixture Models. John Wiley and Sons, Inc., New York, 2000. G. J. Mc Lachlan and K. E. Basford. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York, 1988. P. D. Mc Nicholas. Mixture model-based classification. Chapman and Hall/CRC, 2016. P. D. Mc Nicholas and T. B. Murphy. Parsimonious Gaussian mixture models. Statistics and Computing, 18(3):285 296, 2008. P. D. Mc Nicholas, A. El Sherbiny, A. F. Mc Daid, and T. B. Murphy. pgmm: Parsimonious Gaussian Mixture Models, 2018. URL https://CRAN.R-project.org/package=pgmm. R package version 1.2.3. V. Melnykov. Merging mixture components for clustering through pairwise overlap. Journal of Computational and Graphical Statistics, 25(1):66 90, 2016. V. Melnykov and R. Maitra. Finite mixture models and model-based clustering. Statistics Surveys, 4:80 116, 2010. ISSN 1935-7516. doi: 10.1214/09-SS053. V. Melnykov and R. Maitra. CARP: Software for fishing out good clustering algorithms. Journal of Machine Learning Research, 12:69 73, 2011. V. Melnykov, W.-C. Chen, and R. Maitra. Mix Sim: An R package for simulating data to study performance of clustering algorithms. Journal of Statistical Software, 51(12):1 25, 2012. URL http://www.jstatsoft.org/v51/i12/. C. D. Michener and R. R. Sokal. A quantitative approach to a problem in classification. Evolution, 11:130 162, 1957. S. Mukherjee, E. D. Feigelson, G. Jogesh Babu, F. Murtagh, C. Fraley, and A. Raftery. Three types of gamma-ray bursts. Astrophyical Journal, 508:314 327, Nov. 1998. doi: 10.1086/306386. K. Nakai. UCI machine learning repository, 1996. URL http://archive.ics.uci.edu/ml. K. Nakai and M. Kinehasa. Expert sytem for predicting protein localization sites in gramnegative bacteria. PROTEINS: Structure, Function, and Genetics, 11:95 110, 1991. R. B. Nelsen. An Introduction to Copulas. Springer, New York, 2 edition, 2006. D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. UCI repository of machine learning databases, 1998. Almod ovar-Rivera and Maitra J. P. Norris, T. L. Cline, U. D. Desai, and B. J. Teegarden. Frequency of fast, narrow gamma-ray bursts. Nature, 308:434, Mar. 1984. doi: 10.1038/308434a0. S. Ogawa, T. M. Lee, A. S. Nayak, and P. Glynn. Oxygenation-sensitive contrast in magnetic resonance image of rodent brain at high magnetic fields. Magnetic Resonance in Medicine, 14:68 78, 1990. E. Parzen. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 33(3):1065, 1962. doi: 10.1214/aoms/1177704472. T. L. Pedersen, S. Hughes, and X. Qiu. density Clust: Clustering by Fast Search and Find of Density Peaks, 2017. URL https://CRAN.R-project.org/package=density Clust. R package version 0.3. A. D. Peterson, A. P. Ghosh, and R. Maitra. Merging k-means with hierarchical clustering for identifying general-shaped groups. Stat, 7(1):e172, 2018. doi: 10.1002/sta4.172. T. Piran. The physics of gamma-ray bursts. Rev. Mod. Phys., 76:1143 1210, Jan 2005. doi: 10.1103/Rev Mod Phys.76.1143. R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2018. URL http: //www.R-project.org. ISBN 3-900051-07-0. A. E. Raftery and N. Dean. Variable selection for model-based clustering. Journal of the American Statistical Association, 101:168 178, 2006. D. B. Ramey. Nonparametric clustering techniques. In Encyclopedia of Statistical Science, volume 6, pages 318 319. Wiley, New York, 1985. R.-D. Reiss. Nonparametric estimation of smooth distribution functions. Scandinavian Journal of Statistics, pages 116 119, 1981. A. Rodriguez and A. Laio. Clustering by fast search and find of density peaks. Science, 344 (6191):1492 1496, 2014. M. Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 27(3):832, 1956. doi: 10.1214/aoms/1177728190. L. R uschendorf. Mathematical Risk Analysis. Springer-Verlag, Berlin Heidelberg, 2013. doi: 10.1007/978-3-642-33590-7. D. C. S. Aeberhard and O. de Vel. Comparison of classifiers in high dimensional settings. Technical Report 92-02, Department of Computer Science and Department of Mathematics and Statistics, James Cook University of North Queensland, 1992. F. S. Samaria and A. C. Harter. Parameterization of a stochastic model for human face identification. In Proceedings of the Second IEEE Workshop on Applications of Computer Vision, pages 138 142, Sarasota, Florida, 1994. Kernel-estimated Nonparametric Overlap-Based Syncytial Clustering M. P. Sampat, Z. Wang, S. Gupta, A. C. Bovik, and M. K. Markey. Complex wavelet structural similarity: A new image similarity index. IEEE Transactions on Image Processing, 18(11):2385 2401, 2009. doi: 10.1109/TIP.2009.2025923. O. Scaillet. Density estimation using inverse and reciprocal inverse Gaussian kernels. Nonparametric Statistics, 16(1-2):217 226, 2004. G. Schwarz. Estimating the dimensions of a model. Annals of Statistics, 6:461 464, 1978. B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, London, 1986. ISBN 0-412-24620-1. M. Smieja and M. Wiercioch. Constrained clustering with a complex cluster structure. Advances in Data Analysis and Classification, 11(3):493 518, 2017. D. Steinley. Properties of the Hubert-Arabie adjusted Rand index. Psychological methods, 9(3):386, 2004. W. Stuetzle and R. Nugent. A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics, 2010. doi: 10.1198/ jcgs.2009.07049. C. A. Sugar and G. M. James. Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463), 2003. B. Thirion, G. Varoquaux, E. Dohmatob, and J.-B. Poline. Which f MRI clustering gives good brain parcellations? Frontiers in Neuroscience, 8:167, 2014. ISSN 1662453X. doi: 10.3389/fnins.2014.00167. URL https://www.frontiersin.org/article/ 10.3389/fnins.2014.00167. D. Titterington, A. Smith, and U. Makov. Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons, Chichester, U.K., 1985. C. Tortora, A. El Sherbiny, R. P. Browne, B. C. Franczak, , P. D. Mc Nicholas, and D. D. Amos. Mix GHD: Model Based Clustering, Classification and Discriminant Analysis Using the Mixture of Generalized Hyperbolic Distributions, 2019. URL https: //CRAN.R-project.org/package=Mix GHD. R package version 2.3.2. U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395 416, December 2007. K. Wagstaff. Clustering with missing values: No imputation required. In Classification, clustering, and data mining applications, pages 649 658. Springer, 2004. M. P. Wand and M. C. Jones. Kernel Smoothing. Chapman & Hall/CRC, London, 1995. ISBN 0-412-55270-1. R. Xu and D. C. Wunsch. Clustering. John Wiley & Sons, NJ, Hoboken, 2009. Almod ovar-Rivera and Maitra E.-J. Yeoh, M. E. Ross, S. A. Shurtleff, W. Williams, D. Patel, R. Mahfouz, F. G. Behm, S. C. Raimondi, M. V. Relling, A. Patel, C. Cheng, D. Campana, D. Wilkins, X. Zhou, J. Li, H. Liu, C.-H. Pui, W. E. Evans, C. Naeve, L. Wong, and J. R. Downing. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1(2):133 143, 2002. ISSN 1535-6108. doi: https://doi.org/10.1016/S1535-6108(02)00032-6. C. T. Zahn. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on computers, 100(1):68 86, 1971. Y. Zhu, F. Dai, and R. Maitra. Three-dimensional radial visualization of high-dimensional continuous or discrete datasets. Ar Xiv e-prints:1905.09505, Mar. 2019.