# minimum_density_hyperplanes__b4481160.pdf Journal of Machine Learning Research 17 (2016) 1-33 Submitted 6/15; Revised 9/16; Published 9/16 Minimum Density Hyperplanes Nicos G. Pavlidis n.pavlidis@lancaster.ac.uk Department of Management Science Lancaster University Lancaster, LA1 4YX, UK David P. Hofmeyr d.hofmeyr@lancaster.ac.uk Department of Mathematics and Statistics Lancaster University Lancaster, LA1 4YF, UK Sotiris K. Tasoulis s.tasoulis@ljmu.ac.uk Department of Applied Mathematics Liverpool John Moores University, Liverpool, L3 3AF, UK Editor: Andreas Krause Associating distinct groups of objects (clusters) with contiguous regions of high probability density (high-density clusters), is central to many statistical and machine learning approaches to the classification of unlabelled data. We propose a novel hyperplane classifier for clustering and semi-supervised classification which is motivated by this objective. The proposed minimum density hyperplane minimises the integral of the empirical probability density function along it, thereby avoiding intersection with high density clusters. We show that the minimum density and the maximum margin hyperplanes are asymptotically equivalent, thus linking this approach to maximum margin clustering and semi-supervised support vector classifiers. We propose a projection pursuit formulation of the associated optimisation problem which allows us to find minimum density hyperplanes efficiently in practice, and evaluate its performance on a range of benchmark data sets. The proposed approach is found to be very competitive with state of the art methods for clustering and semi-supervised classification. Keywords: low-density separation, high-density clusters, clustering, semi-supervised classification, projection pursuit 1. Introduction We study the fundamental learning problem: Given a random sample from an unknown probability distribution with no, or partial label information, identify a separating hyperplane that avoids splitting any of the distinct groups (clusters) present in the sample. We adopt the cluster definition given by Hartigan (1975, chap. 11), in which a high-density cluster is defined as a maximally connected component of the level set of the probability density function, p(x), at level c 0, levcp(x) = n x Rd p(x) > c o . c 2016 Nicos G. Pavlidis, David P. Hofmeyr, and Sotiris K. Tasoulis. Pavlidis, Hofmeyr and Tasoulis An important advantage of this approach over other methods is that it is well founded from a statistical perspective, in the sense that a well-defined population quantity is being estimated. However, since p(x) is typically unknown, detecting high-density clusters necessarily involves estimates of this function, and standard approaches to nonparametric density estimation are reliable only in low dimensions. A number of existing density clustering algorithms approximate the level sets of the empirical density through a union of spheres around points whose estimated density exceeds a user-defined threshold (Walther, 1997; Cuevas et al., 2000, 2001; Rinaldo and Wasserman, 2010). The choice of this threshold affects both the shape and number of detected clusters, while an appropriate threshold is typically not known in advance. The performance of these methods deteriorates sharply as dimensionality increases, unless the clusters are assumed to be clearly discernible (Rinaldo and Wasserman, 2010). An alternative is to consider the more specific problem of allocating observations to clusters, which shifts the focus to local properties of the density, rather than its global approximation. The central idea underlying such methods is that if a pair of observations belong to the same cluster they must be connected through a path traversing only high-density regions. Graph theory is a natural choice to address this type of problem. Azzalini and Torelli (2007); Stuetzle and Nugent (2010) and Menardi and Azzalini (2014) have recently proposed algorithms based on this approach. Even these approaches however are limited to problems of low dimensionality by the standards of current applications (Menardi and Azzalini, 2014). An equivalent formulation of the density clustering problem is to assume that clusters are separated through contiguous regions of low probability density; known as the low-density separation assumption. In both clustering and semi-supervised classification, identifying the hyperplane with the maximum margin is considered a direct implementation of the lowdensity separation approach. Motivated by the success of support vector machines (SVMs) in classification, maximum margin clustering (MMC) (Xu et al., 2004), seeks the maximum margin hyperplane to perform a binary partition (bi-partition) of unlabelled data. MMC can be equivalently viewed as seeking the binary labelling of the data sample that will maximise the margin of an SVM estimated using the assigned labels. In a plethora of applications data can be collected cheaply and automatically, while labelling observations is a manual task that can be performed for a small proportion of the data only. Semi-supervised classifiers attempt to exploit the abundant unlabelled data to improve the generalisation error over using only the scarce labelled examples. Unlabelled data provide additional information about the marginal density, p(x), but this is beneficial only insofar as it improves the inference of the class conditional density, p(x|y). Semi-supervised classification relies on the assumption that a relationship between p(x) and p(x|y) exists. The most frequently assumed relationship is that high-density clusters are associated with a single class (cluster assumption), or equivalently that class boundaries pass through lowdensity regions (low-density separation assumption). The most widely used semi-supervised classifier based on the low-density separation assumption is the semi-supervised support vector machine (S3VM) (Vapnik and Sterin, 1977; Joachims, 1999; Chapelle and Zien, 2005). S3VMs implement the low-density separation assumption by partitioning the data according to the maximum margin hyperplane with respect to both labelled and unlabelled data. Minimum Density Hyperplanes Encouraging theoretical results for semi-supervised classification have been obtained under the cluster assumption. If p(x) is a mixture of class conditional distributions, Castelli and Cover (1995, 1996) have shown that the generalisation error will be reduced exponentially in the number of labelled examples if the mixture is identifiable. More recently, Singh et al. (2009) showed that the mixture components can be identified if p(x) is a mixture of a finite number of smooth density functions, and the separation between mixture components is large. Rigollet (2007) considers the cluster assumption in a nonparametric setting, that is in terms of density level sets, and shows that the generalisation error of a semi-supervised classifier decreases exponentially given a sufficiently large number of unlabelled data. However, the cluster assumption is difficult to verify with a limited number of labelled examples. Furthermore, the algorithms proposed by Rigollet (2007) and Singh et al. (2009) are difficult to implement efficiently even if the cluster assumption holds. This renders them impractical for real-world problems (Ji et al., 2012). Although intuitive, the claim that maximising the margin over (labelled and) unlabelled data is equivalent to identifying the hyperplane that goes through regions with the lowest possible probability density has received surprisingly little attention. The work of Ben David et al. (2009) is the only attempt we are aware of to theoretically investigate this claim. Ben-David et al. (2009) quantify the notion of a low-density separator by defining the density on a hyperplane, as the integral of the probability density function along the hyperplane. They study the existence of universally consistent algorithms to compute the hyperplane with minimum density. The maximum hard margin classifier is shown to be consistent only in one dimensional problems. In higher dimensions only a soft-margin algorithm is a consistent estimator of the minimum density hyperplane. Ben-David et al. (2009) do not provide an algorithm to compute low density hyperplanes. This paper introduces a novel approach to clustering and semi-supervised classification which directly identifies low-density hyperplanes in the finite sample setting. In this approach the density on a hyperplane criterion proposed by Ben-David et al. (2009) is directly minimised with respect to a kernel density estimator that employs isotropic Gaussian kernels. The density on a hyperplane provides a uniform upper bound on the value of the empirical density at points that belong to the hyperplane. This bound is tight and proportional to the density on the hyperplane. Therefore, the smallest upper bound on the value of the empirical density on a hyperplane is achieved by hyperplanes that minimise the density on a hyperplane criterion. An important feature of the proposed approach is that the density on a hyperplane can be evaluated exactly through a one-dimensional kernel density estimator, constructed from the projections of the data sample onto the vector normal to the hyperplane. This renders the computation of minimum density hyperplanes tractable even in high dimensional applications. We establish a connection between the minimum density hyperplane and the maximum margin hyperplane in the finite sample setting. In particular, as the bandwidth of the kernel density estimator is reduced towards zero, the minimum density hyperplane converges to the maximum margin hyperplane. An intermediate result establishes that there exists a positive bandwidth such that the partition of the data sample induced by the minimum density hyperplane is identical to that of the maximum margin hyperplane. The remaining paper is organised as follows: The formulation of the minimum density hyperplane problem as well as basic properties are presented in Section 2. Section 3 Pavlidis, Hofmeyr and Tasoulis establishes the connection between minimum density hyperplanes and maximum margin hyperplanes. Section 4 discusses the estimation of minimum density hyperplanes and the computational complexity of the resulting algorithm. Experimental results are presented in Section 5, followed by concluding remarks and future research directions in Section 6. 2. Problem Formulation We study the problem of estimating a hyperplane to partition a finite data set, X = {xi}n i=1 Rd, without splitting any of the high-density clusters present. We assume that X is an i.i.d. sample of a random variable X on Rd, with unknown probability density function p : Rd R+. A hyperplane is defined as H(v, b) := {x Rd | v x = b}, where without loss of generality we restrict attention to hyperplanes with unit normal vector, i.e., those parameterised by (v, b) Sd 1 R, where Sd 1 = {v Rd v = 1}. Following Ben David et al. (2009) we define the density on the hyperplane H(v, b) as the integral of the probability density function along the hyperplane, I(v, b) := Z H(v,b) p(x)dx. (1) We approximate p(x) through a kernel density estimator with isotropic Gaussian kernels, ˆp(x|X, h2I) = 1 n(2πh2)d/2 i=1 exp x xi 2 This class of kernel density estimators has the useful property that the integral in Equation (1) can be evaluated exactly by projecting X onto v; constructing a one-dimensional density estimator with Gaussian kernels and bandwidth h; and evaluating the density at b, ˆI(v, b|X, h2I) := Z H(v,b) ˆp x|X, h2I dx, i=1 exp (b v xi)2 = ˆp b | {v xi}n i=1, h2 . (3) The univariate kernel estimator ˆp | {v xi}n i=1, h2 approximates the projected density on v, that is, the density function of the random variable, Xv = X v. Henceforth we use ˆI(v, b) to approximate I(v, b). To simplify terminology we refer to ˆI(v, b) as the density on H(v, b), or the density integral on H(v, b), rather than the empirical density, or the empirical density integral, respectively. For notational convenience we write ˆI(v, b) for ˆI(v, b|X, h2I), where X and h are apparent from context. The following Lemma, adapted from (Tasoulis et al., 2010, Lemma 3), shows that ˆI(v, b) provides an upper bound for the maximum value of the empirical density at any point that belongs to the hyperplane. Lemma 1 Let X = {xi}n i=1 Rd, and ˆp(x|X, h2I) be a kernel density estimator with isotropic Gaussian kernels. Then, for any (v, b) Sd 1 R, max x H(v,b) ˆp(x|X, h2I) 2πh2 1 d 2 ˆI(v, b), for all x H(v, b). (4) Minimum Density Hyperplanes This lemma shows that a hyperplane, H(v, b), cannot intersect level sets of the empirical density with level higher than 2πh2 1 d 2 ˆI(v, b). The proof of the lemma relies on the fact that projection contracts distances, and follows from simple algebra. In Equation (4) equality holds if and only if there exists x H(v, b) and c Rn such that all xi X, can be written as xi = x + civ. It is therefore not possible to obtain a uniform upper bound on the value of the empirical density at points that belong to H(v, b) that is lower than 2πh2 1 d 2 ˆI(v, b) using only one-dimensional projections. Since the upper bound of Lemma 1 is tight and proportional to ˆI(v, b), minimising the density on the hyperplane leads to the lowest upper bound on the maximum value of the empirical density along the hyperplane separator. To obtain hyperplane separators that are meaningful for clustering and semi-supervised classification, it is necessary to constrain the set of feasible solutions, because the density on a hyperplane can be made arbitrarily low by considering a hyperplane that intersects only the tail of the density. In other words, for any v, ˆI(v, b) can be made arbitrarily low for sufficiently large |b|. In both problems the constraints restrict the feasible set to a subset of the hyperplanes that intersect the interior of the convex hull of X. In detail, let conv X denote the convex hull of X, and assume Int(conv X) = . Define C to be the set of hyperplanes that intersect Int(conv X), C = n H(v, b) (v, b) Sd 1 R, z Int(conv X) s.t. v z = b o . (5) Then denote by F the set of feasible hyperplanes, where F C. We define the minimum density hyperplane (MDH), H(v , b ) F to satisfy, ˆI(v , b ) = min (v,b)|H(v,b) F ˆI(v, b). (6) In the following subsections we discuss the specific formulations for clustering and semisupervised classification in turn. 2.1 Clustering Since high-density clusters are formed around the modes of p(x), the convex hull of these modes would be a natural choice to define the set of feasible hyperplanes. Unfortunately, this convex hull is unknown and difficult to estimate. We instead propose to constrain the distance of hyperplanes to the origin, b. Such a constraint is inevitable as for any v Sd 1, ˆI(v, b) can become arbitrarily close to zero for sufficiently large |b|. Obviously, such hyperplanes are inappropriate for the purposes of bi-partitioning as they assign all the data to the same partition. Rather than fixing b to a constant, we constrain it in the interval, F(v) = [µv ασv, µv + ασv] , (7) where µv and σv denote the mean and standard deviation, respectively, of the projections {v xi}n i=1. The parameter α 0, controls the width of the interval, and has a probabilistic interpretation from Chebyshev s inequality. Smaller values of α favour more balanced partitions of the data at the risk of excluding low density hyperplanes that separate clusters more effectively. On the other hand, increasing α increases the risk of separating out only a Pavlidis, Hofmeyr and Tasoulis few outlying observations. We discuss in detail how to set this parameter in the experimental results section. If Int(conv X) = , then there exists α > 0 such that the set of feasible hyperplanes for clustering, FCL, satisfies, FCL = n H(v, b) (v, b) Sd 1 R, b F(v) o C, (8) where C is the set of hyperplanes that intersect Int(conv X), as defined in Equation (5). The minimum density hyperplane for clustering is the solution to the following constrained optimisation problem, min (v,b) Sd 1 R ˆI(v, b), (9a) subject to: b µv + ασv 0, (9b) µv + ασv b 0. (9c) Since the objective function and the constraints are continuously differentiable, MDHs can be estimated through constrained optimisation methods like sequential quadratic programming (SQP). Unfortunately the problem of local minima due to the nonconvexity of the objective function seriously hinders the effectiveness of this approach. To mitigate this we propose a parameterised optimisation formulation, which gives rise to a projection pursuit approach. Projection pursuit methods optimise a measure of interestingness of a linear projection of a data sample, known as the projection index. For our problem the natural choice of projection index for v is the minimum value of the projected density within the feasible region, minb F(v) ˆI(v, b). This index gives the minimum density integral of feasible hyperplanes with normal vector v. To ensure the differentiability of the projection index we incorporate a penalty term into the objective function. We define the penalised density integral as, f CL(v, b) = ˆI(v, b) + L ηϵ max {0, µv ασv b, b µv ασv}1+ϵ , (10) where, L = e1/2h2 2π 1 supb R ˆI(v,b) b , ϵ (0, 1) is a constant term that ensures that the penalty function is everywhere continuously differentiable, and η (0, 1). Other penalty functions are possible, but we only consider the above due to its simplicity, and the fact that its parameters offer a direct interpretation: L in terms of the derivative of the projected density on v; and η in terms of the desired accuracy of the minimisers of f CL(v, b) relative to the minimisers of Equation (9), as discussed in the following proposition. Proposition 2 For v Sd 1, define, the set of minimisers, B(v) = arg min b F(v) ˆI(v, b), (11) BC(v) = arg min b R f CL(v, b) (12) For every b B(v) there exists b C BC(v) such that |b b C| η. Moreover, there are no minimisers of f CL(v, b) outside the interval [µv ασv η, µv + ασv + η], BC(v) R\[µv ασv η, µv + ασv + η] = . Minimum Density Hyperplanes Proof Any minimiser in the interior of the feasible region, b B(v) Int(F(v)), also minimises the penalised function, since f CL(v, b) = ˆI(v, b) for all b Int(F(v)), hence b BC(v). Next we consider the case when either or both of the boundary points of F(v), b = µv ασv and b+ = µv +ασv, are contained in B(v). It suffices to show that, f CL(v, b) > ˆI(v, b ) for all b < b η, and f CL(v, b) > ˆI(v, b+) for all b > b+ + η. We discuss only the case b > b+ +η as the treatment of b < b η is identical. Assume that ˆI(v, b) < ˆI(v, b+) (since in the opposite case the result follows immediately: f CL(v, b) > ˆI(v, b) > ˆI(v, b+)). From the mean value theorem there exists ξ (b+, b) such that, ˆI(v, b+) = ˆI(v, b) (b b+) ˆI(v, b) b=ξ ˆI(v, b) + (b b+)L < ˆI(v, b) + L(b b+)1+ϵ ηϵ = f CL(v, b). In the above we used the following facts: ˆI(v,b) b b=ξ < 0, L supb R ˆI(v,b) b , and b b+ We define the projection index for the clustering problem as the minimum of the penalised density integral, φCL(v) = min b R f CL(v, b). (13) Since the optimisation problem of Equation (13) is one-dimensional it is simple to compute the set of global minimisers BC(v). As we discuss in Section 4, this is necessary to compute directional derivatives of the projection index, as well as, to determine whether φCL is differentiable. We call the optimisation of φCL, minimum density projection pursuit (MDP2). For each v, MDP2 considers only the optimal choice of b. This enables it to avoid local minima of ˆI(v, ). Most importantly MDP2 is able to accommodate a discontinuous change in the location of the global minimiser(s), arg minb R f CL(v, b), as v changes. Neither of the above can be achieved when the optimisation is jointly over (v, b) as in the original constrained optimisation problem, Equation (9). The projection index φCL is continuous, but it is not guaranteed to be everywhere differentiable when BC(v) is not a singleton. The resulting optimisation problem is therefore nonsmooth and nonconvex. To illustrate the effectiveness of MDP2 to estimate MDHs, we compare this approach with a direct optimisation of the constrained problem given in Equation (9) using SQP. To enable visualisation we consider the two-dimensional S1 data set (Fr anti and Virmajoki, 2006), constructed by sampling from a Gaussian mixture distribution with fifteen components, where each component corresponds to a cluster. Figure 1 depicts the MDHs obtained over 100 random initialisations of SQP and MDP2. It is evident that SQP frequently yields hyperplanes that intersect regions with high probability density thus splitting clusters. As SQP always converged in these experiments the poor performance is solely due to convergence to local minima. In contrast, MDP2 converges to three different solutions over the 100 experiments, all of which induce high quality partitions, and none intersects a high-density Pavlidis, Hofmeyr and Tasoulis (a) SQP (b) MDP2 Figure 1: Binary partitions induced by 100 MDHs estimated through SQP and MDP2 GGGGGGGGGGGGG GGGGGGGGGGGG GGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGGGGGGGGGGG Projection Index: φCL(θ) 0.06 0.10 0.14 0.18 0 π 4 π 2 3π 4 π Projection Angle: θ 0 π 4 π 2 3π 4 π 0 10 20 30 40 50 Figure 2: Projection index for S1 data set and solutions obtained through SQP and MDP2 cluster. In polar coordinates any v S1 can be parameterised through a single projection angle. Using this parameterisation, the upper plot of Figure 2 depicts the value of the projection index, φCL(v(θ)), for θ [0, π]. The lower plot of the figure provides histograms of the distribution of the solutions (locally optimal projection angles) obtained over the 100 experiments with SQP (grey) and MDP2 (white). The figure shows that φCL(v) is continuous but not everywhere differentiable. The solution most frequently obtained through MDP2 corresponds to the global optimum, while the only other two solutions identified are the local minimisers with the next two lowest function values. In contrast SQP converges to a much wider range of solutions. Note that this method is not guaranteed to identify the Minimum Density Hyperplanes optimal value of b for any v(θ) and this indeed occurs in this example. Therefore the value of φCL(v) is a lower bound for the function values of the minimisers identified through SQP. 2.2 Semi-Supervised Classification In semi-supervised classification labels are available for a subset of the data sample. The resulting classifier needs to predict as accurately as possible the labelled examples, while avoiding intersection with high-density regions of the empirical density. The MDH formulation can readily accommodate partially labelled data by incorporating the linear constraints associated with the labelled data into the clustering formulation. Without loss of generality assume that the first ℓexamples are labelled by y = (y1, . . . , yℓ) { 1, 1}ℓ. The MDH for semi-supervised classification is the solution to the problem, min (v,b) Sd 1 R ˆI(v, b), (14a) subject to: yi(v xi b) 0, i = 1, . . . , ℓ, (14b) b µv + ασv 0, (14c) µv + ασv b 0, (14d) where ˆI(v, b), µv, and σv are computed over the entire data set. If the labelled examples are linearly separable the constraints in Equation (14) define a nonempty feasible set of hyperplanes, FLB = n H(v, b) | (v, b) Sd 1 R, b F(v), yi(v xi b) 0, i {1, . . . , l} o C. (15) Equations (14c) and (14d) act as a balancing constraint which discourages MDHs that classify the vast majority of unlabelled data to a single class. Balancing constraints are included in the estimation of S3VMs for the same reason (Joachims, 1999; Chapelle and Zien, 2005). As in the case of clustering, the direct minimisation of Equation (14) frequently leads to locally optimal solutions. To mitigate this we again propose a projection pursuit formulation. We define the penalised density integral for semi-supervised classification as, f SSC(v, b) = f CL(v, b) + γ i=1 max {0, yi(v xi b)}1+ϵ (16) where, γ > 0 is a user-defined constant, which controls the trade-offbetween reducing the density on the hyperplane, and misclassifying the labelled examples. The projection index is then defined as the minimum of the penalised density integral, φSSC(v) = min b R f SSC(v, b). (17) 3. Connection to Maximum Margin Hyperplanes In this section we discuss the connection between MDHs and maximum (hard) margin hyperplane separators. The margin of a hyperplane H(v, b) with respect to a data set X is Pavlidis, Hofmeyr and Tasoulis defined as the minimum Euclidean distance between the hyperplane and its nearest datum, margin H(v, b) = min x X |v x b|. (18) The points whose distance to the hyperplane H(v, b) is equal to the margin of the hyperplane, that is, arg minx X |v x b|, are called the support points of H(v, b). Let F denote the set of feasible hyperplanes; then the maximum margin hyperplane (MMH), H(vm, bm) F satisfies, margin H(vm, bm) = max (v,b)|H(v,b) F margin H(v, b). (19) The main result of this section is Theorem 5, which states that as the bandwidth parameter, h, is reduced to zero the MDH converges to the MMH. An intermediate result, Lemma 4, shows that there exists a positive bandwidth, h > 0 such that, for all h (0, h ), the partition of the data set induced by the MDH is identical to that of the MMH. We first discuss some assumptions which allow us to present the theoretical results of this section. As before we assume a fixed and finite data set X Rd, and approximate its (assumed) underlying probability density function via a kernel density estimator using Gaussian kernels with isotropic bandwidth matrix h2I. We assume that the interior of the convex hull of the data, Int(conv X), is non-empty, and define C as the set of hyperplanes that intersect Int(conv X), as in Equation (5). The set of feasible hyperplanes, F, for either clustering or the semi-supervised classification satisfies F C. By construction every H(v, b) F defines a hyperplane which partitions X into two non-empty subsets. Observe that if for each v Sd 1 the set {b R | H(v, b) F} is compact, then by the compactness of Sd 1 a maximum margin hyperplane in F exists. For both the clustering and semi-supervised classification problems this compactness holds by construction. For any h > 0, let (v h, b h) Sd 1 R parameterise a hyperplane which achieves the minimal density integral over all hyperplanes in F, for bandwidth matrix h2I. That is, ˆI(v h, b h) = min (v,b)|H(v,b) F ˆI(v, b). (20) Following the approach of Tong and Koller (2000) we first show that as the bandwidth, h, is reduced towards zero, the density on a hyperplane is dominated by its nearest point. This is achieved by establishing that for all sufficiently small values of h, a hyperplane with non-zero margin has lower density integral than any other hyperplane with smaller margin. Lemma 3 Take H(v, b) F with non-zero margin and 0 < δ < margin H(v, b) := Mv,b. Then h > 0 such that h (0, h ) and Mw,c := margin H(w, c) Mv,b δ implies ˆI(v, b) < ˆI(w, c). Proof Using Equation (3) it is easy to see that, ˆI(v, b) 1 h inf n ˆI(w, c) | Mw,c Mv,b δ o 1 nh 2π exp (Mv,b δ)2 Minimum Density Hyperplanes 0 lim h 0+ ˆI(v, b) inf n ˆI(w, c) | Mw,c Mv,b δ o lim h 0+ n exp M2 v,b 2h2 exp n (Mv,b δ)2 Therefore, h > 0 such that h (0, h ) ˆI(v,b) inf n ˆI(w,c) Mw,c Mv,b δ o < 1. An immediate corollary of Lemma 3 is that as h tends to zero the margin of the MDH tends to the maximum margin. However, this does not necessarily ensure the stronger result that the sequence of MDHs converges to the MMH. To establish this we require two technical results, which describe some algebraic properties of the MMH, and are provided as part of the proof of Theorem 5 which is given in Appendix A. The next lemma uses the previous result to show that there exists a positive bandwidth, h > 0, such that an MDH estimated using h (0, h ) induces the same partition of X as the MMH. The result assumes that the MMH is unique. Notice that if X is a sample of realisations of a continuous random variable then this uniqueness holds with probability 1. Lemma 4 Suppose there is a unique hyperplane in F with maximum margin, which can be parameterised by (vm, bm) Sd 1 R. Then h > 0 s.t. h (0, h ) H(v h, b h) induces the same partition of X as H(vm, bm). Proof Let M = margin H(vm, bm), and let P be the collection of hyperplanes that induce the same partition of X as that induced by H(vm, bm). Since X is finite and H(vm, bm) is unique, δ > 0 s.t. H(w, c) / P margin H(w, c) M δ. By Lemma 3, h > 0 s.t., h (0, h ) H(v h, b h) / {H(w, c) | margin H(w, c) M δ} , therefore H(v h, b h) P. The next theorem is the main result of this section, and states that the MDH converges to the MMH as the bandwidth parameter is reduced to zero. Notice that by the non-unique representation of hyperplanes, the maximum margin hyperplane has two parameterisations in C, namely (vm, bm) and ( vm, bm). Convergence to the maximum margin hyperplane is therefore equivalent to showing that, min{ (v h, b h) (vm, bm) , (v h, b h) + (vm, bm) } 0 as h 0+. Theorem 5 Suppose there is a unique hyperplane in F with maximum margin, which can be parameterised by (vm, bm) Sd 1 R. Then, lim h 0+ min { (v h, b h) (vm, bm) , (v h, b h) + (vm, bm) } = 0. Pavlidis, Hofmeyr and Tasoulis The set F used in Theorem 5 is generic so it can capture the constraints associated with both clustering and semi-supervised classification, Equations (9) and (14) respectively. In the case of semi-supervised classification we must also assume that the labelled data are linearly separable. Theorem 5 is not directly applicable to the MDP2 formulations as in this case the function being minimised is not the density on a hyperplane. The next two subsections establish this result for the MDP2 formulation of the clustering and semisupervised classification problem. 3.1 MDP2 for Clustering We have shown that for the constrained optimisation formulation the MDH converges to the MMH within the feasible set, FCL C. In addition, for a fixed v, Proposition 2 bounds the distance between minimisers of the penalised function f CL, arg minb R f CL(v, b), and the optimal b of the constrained problem, arg minb F(v) ˆI(v, b). Combining these we can show that the optimal solution to the penalised MDP2 formulation converges to the maximum margin hyperplane in FCL, provided the parameters within the penalty term suitably depend on the bandwidth parameter, h. While the general case can be shown, for ease of exposition we make the simplifying assumption that the maximum margin hyperplane is strictly feasible, i.e., if (vm, bm) parameterises the maximum margin hyperplane then bm (µvm ασvm, µvm + ασvm). For h, η, L > 0 define (v h,η,L, b h,η,L) to be any global minimiser of f CL, i.e., f CL(v h,η,L, b h,η,L) = min (v,b) Sd 1 R f CL(v, b). Lemma 6 Suppose there is a unique hyperplane in FCL with maximum margin, which can be parameterised by (vm, bm) Sd 1 R. Suppose further that bm (µvm ασvm, µvm + ασvm). For h > 0, let L(h) = (e1/2h2 2π) 1, and 0 < η(h) h. Then, lim h 0+ min{ (v h,η(h),L(h), b h,η(h),L(h)) (vm, bm) , (v h,η(h),L(h), b h,η(h),L(h)) + (vm, bm) } = 0. Proof Let M = margin H(vm, bm) and as in the proof of Lemma 4, let δ > 0 be such that any hyperplane inducing a different partition from H(vm, bm) has margin at most M δ. Consider the set F δ CL := {(v, b) Sd 1 R|b Bδ/2(F(v))}, where we used the notation Bδ/2(F(v)) to denote the neighbourhood of F(v) given by {r R|d(r, F(v)) < δ/2}. The set F δ CL increases the feasible set of hyperplanes by allowing b to range in b Bδ/2(F(v)). For any fixed v, the maximum margin of all hyperplanes with normal vector v can increase by at most δ/2. Thus, any hyperplane inducing a different partition compared to H(vm, bm) has a margin at most M δ/2. Since H(vm, bm) is strictly feasible it therefore remains the unique maximum margin hyperplane in F δ CL. Observe now that for 0 < h < δ/2 we have H(v h,η(h),L(h), b h,η(h),L(h)) F δ CL, by Proposition 2. In addition, by Theorem 5, we know that the minimisers of ˆI(v, b) over F δ CL, say H(vδ h, bδ h), satisfy lim h 0+ min n (vδ h, bδ h) (vm, bm) , (vδ h, bδ h) + (vm, bm) o = 0. Minimum Density Hyperplanes MMH MDH h: 0.05 MDH h: 0.43 MDH h: 0.03 MDH h: 0.22 MDH h: 0.01 MDH h: 0.11 Figure 3: Convergence of the MDH to the maximum margin hyperplane for a decreasing sequence of bandwidth parameters, h. Now, since H(vm, bm) is strictly feasible ϵ > 0 s.t. (v, b) Bϵ ({(vm, bm), (vm, bm)}) H(v, b) FCL. Then for any 0 < ϵ < ϵ there exists h > 0 s.t. for 0 < h < h both (vδ h, bδ h) Bϵ({(vm, bm), (vm, bm)}) H(vδ h, bδ h) FCL and H(v h,η(h),L(h), b h,η(h),L(h)) F δ CL. Now for H(v, b) F δ CL \ FCL we know that ˆI(v, b) < f CL(v, b), whereas for H(v, b) FCL, ˆI(v, b) = f CL(v, b) and therefore the minimiser of f CL(v, b) must lie in the neighbourhood Bϵ({(vm, bm), (vm, bm)}), and the result follows. To illustrate the convergence of the MDH to the MMH we use the two-dimensional data set shown in Figure 3. The data is sampled from a mixture of two Gaussian distributions with equal covariance matrix. The MDH with respect to the true underlying density is H ((1, 1), 0). A large margin separator is artificially introduced by removing a few observations in a narrow margin around a hyperplane different from H ((1, 1), 0). The margin is intentionally small to ensure that identifying the MMH is non-trivial. Figure 3 illustrates the MDH solutions arising from the MDP2 method for a decreasing sequence of bandwidths, h. Initially the MDH approximately coincides with the optimal MDH with respect to the true density of the Gaussian mixture. As h decreases, the MDH approaches the MMH and for the smallest values of h the two are indistinguishable. 3.2 MDP2 for Semi-Supervised Classification Denote the set of hyperplanes which correctly classify the labelled data by FLB. Under the assumption that H(v, b) FLB FCL with non-zero margin, we can show that, provided the parameter γ does not shrink too quickly with h, the hyperplane that minimises f SSC converges to the MMH contained in FLB FCL, where as before we assume that such an MMH is strictly feasible. To establish this result it is sufficient to show that there exists h > 0 such Pavlidis, Hofmeyr and Tasoulis that for all h (0, h ), the optimal hyperplane H(v h,η,L,γ, b h,η,L,γ) correctly classifies all the labelled examples. If this holds, then f SSC(v h,η,L,γ, b h,η,L,γ) = f CL(v h,η,L,γ, b h,η,L,γ) for all sufficiently small h, and hence Lemma 6 can be applied to establish the result. The proof relies on the fact that the penalty terms associated with the known labels in Equation (16) are polynomials in b. Provided that γ is bounded below by a polynomial in h, the value of the penalty terms for hyperplanes that do not correctly classify the labelled data dominate the value of the density integral as h approaches zero. Therefore the optimal hyperplane must correctly classify the labelled data for small values of h. Lemma 7 Define FLB = {H(v, b) yi(v xi b) > 0, i = 1, . . . , ℓ} and FCL = {H(v, b) µv ασv b µv + ασv} and assume that FSSC = FLB FCL = and that H(v, b) FSSC with non-zero margin. For h > 0, let L(h) = (e1/2h2 2π) 1, 0 < η(h) h and γ(h) hr for some r > 0. Then h > 0 s.t. h (0, h ) H(v h,η(h),L(h),γ(h), b h,η(h),L(h),γ(h)) FLB. Proof Consider H(v, b) FLB. Then, f SSC(v, b) 1 n 2πh exp( ν2 /2h2) + γ(h)ν1+ϵ > γ(h)ν1+ϵ , where ν > 0 minimises 1 n 2πh exp( ν2/2h2)+γ(h)ν1+ϵ. Therefore, ν is the unique positive number satisfying, 2πh exp ν2 2h2 + (1 + ϵ)γ(h)νϵ = 0 ν1 ϵ = (1 + ϵ)γ(h)n 2πh3 exp ν2 2h2 ν (1 + ϵ)γ(h)n 2πh3 1/1 ϵ . We therefore have, f SSC(v, b) > γ(h) (1 + ϵ)γ(h)n = Kγ(h) 2 1 ϵ h where K is a constant which can be chosen independent of (v, b). Finally, for any H(v , b ) FSSC with non-zero margin, h > 0 s.t. h (0, h ) f SSC(v , b ) = ˆI(v , b ) < Kh 1 ϵ < f SSC(v, b). Since K is independent of (v, b), the result follows. The final set of inequalities holds since the hyperplane H(v , b ) is assumed to have non-zero margin, say Mv ,b > 0, and hence ˆI(v , b ) 1 h 2π exp{ Mv ,b /2h2}, which tends to zero faster than any polynomial in h. Minimum Density Hyperplanes 4. Estimation of Minimum Density Hyperplanes In this section we discuss the computation of MDHs. We first investigate the continuity and differentiability properties required to optimise the projection indices φCL(v) and φSSC(v). Since the domain of both projection indices, φCL(v) and φSSC(v), is the boundary of the unit-sphere in Rd it is more convenient to express v in terms of spherical coordinates, ( cos(θi) Qi 1 j=1 sin(θj), i = 1, . . . , d 1 Qd 1 j=1 sin(θj), i = d, (21) where θ Θ = [0, π]d 2 [0, 2π] is called the projection angle. Using spherical coordinates renders the domain, Θ, convex and compact, and reduces dimensionality by one. As the following discussion applies to both φCL(v) and φSSC(v) we denote a generic projection index φ : Θ R, and the associated set of minimisers, as, φ(θ) = min b A f(v(θ), b), (22) B(θ) = b A f(v(θ), b) = φ(θ) , (23) where f(v(θ), b) is continuously differentiable, A R is compact and convex, and the correspondence B(θ) gives the set of global minimisers of f(v(θ), b) for each θ. The definition of A is not critical in our formulation. Setting, A min v Sd 1{µv} ασpc1 η, max v Sd 1{µv} + ασpc1 + η , (24) where σ2 pc1 is the variance of the projections along the first principal component, ensures that the set of hyperplanes that satisfy the constraint of Equation (7) will be a subset of A for all v. Berge s maximum theorem (Berge, 1963; Polak, 1987), establishes the continuity of φ(θ) and the upper-semicontinuity (u.s.c.) of the correspondence B(θ). Theorem 3.1 in (Polak, 1987) enables us to establish that φ(θ) is locally Lipschitz continuous. Using Theorem 4.13 of Bonnans and Shapiro (2000) we can further show that φ(θ) is directionally differentiable everywhere. The directional derivative at θ in the direction ν is given by, dφ(θ; ν) = min b B(θ) Dθf(v(θ), b) ν, (25) where Dθ denotes the derivative with respect to θ. It is clear from Equation (25) that φ(θ) is differentiable if Dθf(v(θ), b) is the same for all b B(θ). If B(θ) is a singleton then this condition is trivially satisfied and φ(θ) is continuously differentiable at θ. It is possible to construct examples in which B(θ) is not a singleton. However, with the exception of contrived examples, our experience with real and simulated data sets indicates that when h is set through standard bandwidth selection rules B(θ) is almost always a singleton over the optimisation path. Proposition 8 Suppose B(θ) is a singleton for almost all θ Θ. Then φ(θ) is continuously differentiable almost everywhere. Pavlidis, Hofmeyr and Tasoulis Proof The result follows immediately from the fact that if B(θ) = {b} is a singleton, then the derivative Dφ(θ) = Dθf(v(θ), b), which is continuous. Wolfe (1972) has provided early examples of how standard gradient-based methods can fail to converge to a local optimum when used to minimise nonsmooth functions. In the last decade a new class of nonsmooth optimisation algorithms has been developed based on gradient sampling (Burke et al., 2006). Gradient sampling methods use generalised gradient descent to find local minima. At each iteration points are randomly sampled in a radius ε of the current candidate solution, and the gradient at each point is computed. The convex hull of these gradients serves as an approximation of the ε-Clarke generalised gradient (Burke et al., 2002). The minimum element in the convex hull of these gradients is a descent direction. The gradient sampling algorithm progressively reduces the sampling radius so that the convex hull approximates the Clarke generalised gradient. When the origin is contained in the Clarke generalised gradient there is no direction of descent, and hence the current candidate solution is a local minimum. Gradient sampling achieves almost sure global convergence for functions that are locally Lipschitz continuous and almost everywhere continuously differentiable. It is also well documented that it is an effective optimisation method for functions that are only locally Lipschitz continuous. 4.1 Computational Complexity In this subsection we analyse the computational complexity of MDP2. At each iteration the algorithm projects the data sample onto v(θ) which involves O(nd) operations. To compute the projection index, φ(θ), we need to minimise the penalised density integral, f(v(θ), b). This can be achieved by first evaluating f(v(θ), b) on a grid of m points, to bracket the location of the minimiser, and then applying bisection to compute the minimiser(s) within the desired accuracy. The main computational cost of this procedure is due to the first step which involves m evaluations of a kernel density estimator with n kernels. Using the improved fast Gauss transform (Morariu et al., 2008) this can be performed in O(m + n) operations, instead of O(mn). Bisection requires O( log2 ε) iterations to locate the minimiser with accuracy ε. If the minimiser of the penalised density integral b = arg minb A f(v(θ), b), is unique the projection index is continuously differentiable at θ. To obtain the derivative of the projection index it is convenient to define the projection function, P(v) = (x1 v, . . . , xn v) . An application of the chain rule yields, dθφ = Dθf(v(θ), b ) = DP f(v(θ), b )Dv PDθv (26) where the derivative of the projections of the data sample with respect to v is equal to the data matrix, Dv P = (x1, . . . , xn) ; and Dθv is the derivative of v with respect to the projection angle, which yields a d (d 1) matrix. The computation of the derivative therefore requires O(d(n + d)) operations. The original GS algorithm requires O(d) gradient evaluations at each iteration which is costly. Curtis and Que (2013) have developed an adaptive gradient sampling algorithm that requires O(1) gradient evaluations in each iteration. More recently, Lewis and Overton (2013) have strongly advocated that for the minimisation of nonsmooth, nonconvex, locally Minimum Density Hyperplanes n d c banknotea 1372 4 2 br. cancera 699 9 2 foresta 523 27 4 ionospherea 351 33 2 optidigitsa 5618 64 10 pendigitsa 10992 16 10 seedsa 210 7 3 smartphone a 10929 561 12 image seg.a 2309 18 7 satellitea 6435 36 6 syntha 600 60 6 votinga 435 16 2 winea 178 13 3 yeastb 698 72 5 a. UCI machine learning repository https://archive.ics.uci.edu/ml/datasets.html b. Stanford Yeast Cell Cycle Analysis Project http://genome-www.stanford.edu/cellcycle/ Table 1: Details of benchmark data sets: size (n), dimensionality (d), number of clusters (c). Lipschitz functions, a simple BFGS method using inexact line searches is much more efficient in practice than gradient sampling, although no convergence guarantees have been established for this method. BFGS requires a single gradient evaluation at each iteration and a matrix vector operation to update the Hessian matrix approximation. In our experiments we use the BFGS algorithm. 5. Experimental Results In this section we assess the empirical performance of MDHs for clustering and semisupervised classification. We compare performance with existing state-of-the-art methods for both problems on the following 14 benchmark data sets: Banknote authentication (banknote), Breast Cancer Wisconsin original (br. cancer), Forest type mapping (forest), Ionosphere, Optical recognition of handwritten digits (optidigits), Pen-based recognition of hand-written digits (pendigits), Seeds, Smartphone-Based Recognition of Human Activities and Postural Transitions (smartphone), Statlog Image Segmentation (image seg.), Statlog Landsat Satellite (satellite), Synthetic control chart time series (synth control), Congressional voting records (voting), Wine, and Yeast cell cycle analysis (yeast). Details of these data sets, in terms of their size, n, dimensionality, d and number of clusters, c, can be seen in Table 1. 5.1 Clustering Since an MDH yields a bi-partition of a data set rather than a complete clustering, we propose two measures to assess the quality of a binary partition of a data set containing an Pavlidis, Hofmeyr and Tasoulis arbitrary number of clusters. Both take values in [0, 1] with larger values indicating a better partition. These measures are motivated by the fact that a good binary partition should (a) avoid dividing clusters between elements of the partition, and (b) be able to discriminate at least one cluster from the rest of the data. To capture this we modify the cluster labels of the data by assigning each cluster to the element of the binary partition which contains the majority of its members. In the case of a tie the cluster is assigned to the smaller of the two partitions. We thus merge the true clusters into two aggregate clusters, C1 and C2. The first measure we use is the binary V-measure which is simply the V-measure (Rosenberg and Hirschberg, 2007) computed on C1, C2 with respect to the binary partition, which we denote Π1, Π2. The V-measure is the harmonic mean of homogeneity and completeness. For a data set containing clusters C1, . . . , Cc, partitioned as Π1, . . . , Πk, homogeneity is defined as the conditional entropy of the cluster distribution within each partition, Πi. Completeness is symmetric to homogeneity and measures the conditional entropy of each partition within each cluster, Cj. An important characteristic of the V-measure for evaluating binary partitions is that if the distribution of clusters within each partition is equal to the overall cluster distribution in the data set then the V-measure is equal to zero (Rosenberg and Hirschberg, 2007). This means that if an algorithm fails to distinguish the majority of any of the clusters from the remainder of the data, the binary V-measure returns zero performance. Other evaluation metrics for clustering, such as purity and the Rand index, can assign a high value to such partitions. To define the second performance measure we first determine the number of correctly and incorrectly classified samples. The error of a binary partition, E(Π1, Π2), given in Equation (27), is defined as the number of elements of each aggregate cluster which are not in the same partition as the majority of their original clusters. In contrast, the success of a partition, S(Π1, Π2), Equation (28), measures the number of samples which are in the same partition as the majority of their original clusters. The Success Ratio, SR(Π1, Π2), Equation (29), captures the extent to which the majority of at least one cluster is welldistinguished from the rest of the data. E(Π1, Π2) = min {|Π1 C1| + |Π2 C2|, |Π1 C2| + |Π2 C1|} , (27) S(Π1, Π2) = min {max {|Π1 C1|, |Π1 C2|} , max {|Π2 C1|, |Π2 C2|}} , (28) SR(Π1, Π2) = S(Π1, Π2) S(Π1, Π2) + E(Π1, Π2). (29) The Success Ratio takes the value zero if an algorithm fails to distinguish the majority of any cluster from the remainder of the data. 5.1.1 Parameter Settings for MDP2 The two most important settings for the performance of the proposed approach are the initial projection direction, and the choice of α, which controls the width of the interval F(v) within which the optimal hyperplane falls. Despite the ability of the MDP2 formulation to mitigate the effect of local minima of the projected density, the problem remains non-convex and local minima in the projection index can still lead to suboptimal performance. We have found that this effect is amplified in general when either or both the number of dimensions, and the number of high density clusters in the data set is large. To better handle the effect Minimum Density Hyperplanes of local optima, we use multiple initialisations and select the MDH that maximises the relative depth criterion, defined in Equation (30). The relative depth of an MDH, H(v, b), is defined as the smaller of the relative differences in the density on the MDH and its two adjacent modes in the projected density, Relative Depth(v, b) = min n ˆI(v, ml), ˆI(v, mr) o ˆI(v, b) ˆI(v, b) (30) where ml and mr are the two adjacent modes in the projected density on v. If an MDH does not separate the modes of the projected density, then its relative depth is set to zero, signalling a failure of MDP2 to identify a meaningful bi-partition. The relative depth is appealing because it captures the fact that a high quality separating hyperplane should have a low density integral, and separate well the modes of the projected density. Note also that the relative depth is equivalent to the inverse of a measure used to define cluster overlap in the context of Gaussian mixtures (Aitnouri et al., 2000). In all the reported experiments we initialise MDP2 to the first and second principal component and select the MDH with the largest relative depth. For the data sets listed above it was never the case that both initialisations led to MDHs with zero relative depth. The choice of α determines the trade-offbetween a balanced bi-partition and the ability to discover lower density hyperplanes. The difficulties associated with choosing this parameter are illustrated in Figure 4. In each sub-figure the horizontal axis is the candidate projection vector, v, while the right vertical axis is the direction of maximum variability orthogonal to v. Points correspond to projections of the data sample onto this two-dimensional space, while colour indicates cluster membership. The solid line depicts the projected density on v, while the dotted line depicts the penalised function, f CL(v, ). The scale of both functions is depicted on the left vertical axis. The solid vertical line indicates the MDH along v. Setting α to a large value can cause MDP2 to focus on hyperplanes that have low density because they partition only a small subset of the data set as shown in Figure 4(a). In contrast smaller values of α may cause the algorithm to disregard valid lower density hyperplane separators (see Figure 4(b)), or for the separating hyperplane to not be a local minimiser of the projected density (see Figure 4(c)). Rather than selecting a single value for α we recommend solving MDP2 repeatedly for an increasing sequence of values in the range {αmin, αmax}, where each implementation beyond the first is initialised using the solution to the previous. Setting αmin close to zero forces MDP2 to seek low density hyperplanes that induce a balanced data partition. This tends to find projections which display strong multimodal structure, yet prevents convergence to hyperplanes that have low density because they partition a few observations, as in the case shown in Figure 4(a). Increasing α progressively fine-tunes the location of the MDH. To avoid sensitivity to the value of αmax (set to 0.9) the output of the algorithm is the last hyperplane that corresponds to a minimiser of the projected density. Figure 5 illustrates this approach using the optical recognition of handwritten digits data set from the UCI machine learning repository (Lichman, 2013). Figure 5(a) depicts the projected density on the initial projection direction, which in this case is the second principal component. As shown, the density is unimodal and the clusters are not well separated along this vector. Although not shown, if a large value of α is used from the outset, MDP2 will identify a vector Pavlidis, Hofmeyr and Tasoulis 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 (a) MDH separating few observations 0.0 0.1 0.2 0.3 0.4 (b) Lower density hyperplane beyond feasible region 4 2 0 2 4 6 8 0.0 0.1 0.2 0.3 0.4 (c) MDH not a minimiser of the projected density Figure 4: Impact of choice of α on minimum density hyperplane. along which the projected density is unimodal and skewed. Figure 5(b) shows that after five iterations with α = 10 2 MDP2 has identified a projection vector with bimodal density. In subsequent iterations the two modes become more clearly separated, Figure 5(c), while increasing α enables MDP2 to locate an MDH that corresponds to a minimiser of ˆI(v, b), as illustrated in Figure 5(d). In all experiments we set the bandwidth parameter to h = 0.9ˆσpc1n 1/5, where ˆσpc1 is the estimated standard deviation of the data projected onto the first principal component. This bandwidth selection rule is recommended when the density being approximated is assumed to be multimodal (Silverman, 1986). The parameter η controls the distance between the minimisers of arg minb R f CL(v, b) and arg minb F(v) ˆI(v, b), while larger values of ϵ increase the smoothness of the penalised function f CL. Values of η close to zero affect the numerical stability of the one-dimensional optimisation problem, due to the term L ηϵ in f CL becoming very large. We used η = 10 2 and ϵ = 1 10 6 to avoid numerical instability. Beyond these numerical problems the values of η and ϵ do not affect the solutions obtained through MDP2. 5.1.2 Performance Evaluation We compare the performance of MDP2 for clustering with the following methods: 1. k-means++ (Arthur and Vassilvitskii, 2007), a version of k-means that is guaranteed to be O(log k)-competitive to the optimal k-means clustering. 2. The adaptive linear discriminant analysis guided k-means (LDA-km) (Ding and Li, 2007). LDA-km attempts to discover the most discriminative linear subspace for clustering by iteratively using k-means, to assign labels to observations, and LDA to identify the most discriminative subspace. 3. The principal direction divisive partitioning (PDDP) (Boley, 1998), and the densityenhanced PDDP (de PDDP) (Tasoulis et al., 2010). Both methods project the data onto the first principal component. PDDP splits at the mean of the projections, while de PDDP splits at the lowest local minimum of the one-dimensional density estimator. Minimum Density Hyperplanes 0.0 0.1 0.2 0.3 (a) Initial projection 6 4 2 0 2 4 6 0.0 0.1 0.2 0.3 0.4 (b) Projection after 5 iterations with α = 10 2 0.0 0.1 0.2 0.3 0.4 (c) Projection after 25 iterations with α = 10 2 6 4 2 0 2 4 6 0.0 0.1 0.2 0.3 0.4 0.5 (d) Final projection with α = αmax Figure 5: Evolution of the minimum density hyperplane through consecutive iterations. 4. The iterative support vector regression algorithm for MMC (Zhang et al., 2009) using the inner product and Gaussian kernel, i SVR-L and i SVR-G respectively. Both are initialised with the output of 2-means++. 5. Normalised cut spectral clustering (SCn) (Ng et al., 2002) using the Gaussian affinity function, and the automatic bandwidth selection method of Zelnik-Manor and Perona (2004). This choice of kernel and bandwidth produced substantially better performance than alternative choices considered. For data sets that are too large for the eigen decomposition of the Gram matrix to be feasible we employed the Nystr om method (Fowlkes et al., 2004). We also considered the density-based clustering algorithm Pdf Cluster (Menardi and Azzalini, 2014), but this algorithm could not be executed on the larger data sets and so its performance is not reported in this paper. With the exception of SCn and i SVR-G, the methods considered bi-partition the data through a hyperplane in the original feature Pavlidis, Hofmeyr and Tasoulis space. For the 2-means and LDA-2m algorithm the hyperplane separator bisects the line segment joining the two centroids. i SVR-L directly seeks the maximum margin hyperplane in the original space, while i SVR-G seeks the maximum margin hyperplane in the feature space defined by the Gaussian kernel. PDDP and de PDDP use a hyperplane whose normal vector is the first principal component. PDDP uses a fixed split point while de PDDP uses the hyperplane with minimum density along the fixed projection direction. Table 2 reports the performance of the considered methods with respect to the success ratio (SR) and the binary V-measure (V-m) on the fourteen data sets. In addition Figures 6(a) and 6(b) provide summaries of the overall performance on all data sets using boxplots of the raw performance measures as well as the associated regret. The regret of an algorithm on a given data set is defined as the difference between the best performance attained on this data set and the performance of this algorithm. By comparing against the best performing clustering algorithm regret accommodates for differences in difficulty between clustering problems, while also making use of the magnitude of performance differences between algorithms. The distribution of performance with respect to both SR and V-m is negatively skewed for most methods, and as a result the median is higher than the mean (indicated with a red dot). It is clear from Table 2 that no single method is consistently superior to all others, although MDP2 achieves the highest or tied highest performance on seven data sets (more than any other method). More importantly MDP2 is among the best performing methods in almost all cases. This fact is better captured by the regret distributions in Figure 6(b). Here we see that the average, median, and maximum regret of MDP2 is substantially lower than any of the competing methods. In addition MDP2 achieves the highest mean and median performance with respect to both SR and V-m, while also having much lower variability in performance when compared with most other methods. Pairwise comparisons between MDP2 and other methods reveal some less obvious facts. SCn achieves higher performance than MDP2 in more examples (six) than any other competing method, however it is much less consistent in its performance, obtaining very poor performance on five of the data sets. The i SVR maximum margin clustering approach is arguably the closest competitor to MDP2. i SVR-L and i SVR-G achieve the second and third highest average performance with respect to V-m and SR respectively. The PDDP algorithm is the second best performing method on average with respect to SR, but performs poorly with respect to V-m. The density enhanced variant, de PDDP, performs on average much worse than MDP2. This approach is similarly motivated by obtaining hyperplanes with low density integral, and its low average performance indicates the usefulness of searching for high quality projections as opposed to always using the first principal component. Finally, neither of the k-means variants appears to be competitive with MDP2 in general. 5.2 Semi-Supervised Classification In this section we evaluate MDHs for semi-supervised classification. We compare MDHs against three state-of-the-art semi-supervised classification methods: Laplacian Regularised Support Vector Machines (Lap SVM) (Belkin et al., 2006), Simple Semi-Supervised Learning (SSSL) (Ji et al., 2012), and Correlated Nystr om Views (XNV) (Mc Williams et al., 2013). For all methods the inner product kernel was used to render the resulting classifiers linear, Minimum Density Hyperplanes MDP2 i SVR-L i SVR-G SCn LDA-2m 2-means++ PDDP de PDDP Data set SR V-m SR V-m SR V-m SR V-m SR V-m SR V-m SR V-m SR V-m banknote 0.79 0.55 0 0 0.35 0 0.46 0.10 0 0.01 0.37 0.01 0.40 0.03 0 0.03 br. cancer 0.91 0.79 0.73 0.56 0.73 0.56 0 0.13 0.87 0.71 0.87 0.72 0.91 0.78 0.90 0.77 forest 0.78 0.67 0.90 0.72 0.91 0.74 0.56 0.41 0.76 0.63 0.72 0.58 0.64 0.36 0 0 image seg. 0.89 0.72 0.82 0.59 0.88 0.71 0.92 0.87 0.78 0.58 0.78 0.71 0.87 0.67 1 1 ionosphere 0.48 0.13 0.47 0.13 0.47 0.13 0.55 0.22 0.47 0.12 0.47 0.12 0.47 0.12 0.42 0.09 optidigits 0.93 0.85 0.63 0.29 0.82 0.60 0 0 0.81 0.62 0.92 0.82 0.68 0.30 0 0 pendigits 0.74 0.39 0.79 0.55 0.88 0.68 0.80 0.68 0.79 0.55 0.78 0.57 0.79 0.54 0.61 0.42 satellite 0.89 0.75 0.73 0.40 0.73 0.40 0.92 0.86 0.73 0.40 0.87 0.81 0.71 0.37 0 0 seeds 0.88 0.73 0.71 0.53 0.71 0.53 0.89 0.76 0.96 0.90 0.86 0.70 0.75 0.59 0.73 0.60 smartphone 0.99 0.97 0.99 0.95 0.99 0.96 0.99 0.94 0.99 0.97 0.99 0.94 0.99 0.95 0 0 synth 0.98 0.94 0.94 0.83 0.94 0.83 1 1 0.88 0.76 1 1 0.69 0.51 1 1 voting 0.70 0.43 0.46 0.09 0 0 0 0.05 0.69 0.41 0 0 0.70 0.40 0.68 0.38 wine 0.77 0.61 0.70 0.52 0.69 0.50 0.67 0.48 0.66 0.48 0.68 0.49 0.65 0.46 0.68 0.49 yeast 0.92 0.76 0.89 0.68 0.91 0.72 0.84 0.61 0.86 0.63 0.91 0.73 0.87 0.65 0 0 Average Improvement 0.13 0.18 0.12 0.14 0.22 0.16 0.10 0.11 0.10 0.08 0.11 0.18 0.40 0.32 Table 2: Performance on the task of binary partitioning. (Ties in best performance were resolved by considering more decimal places) Success Ratio V measure (a) Raw Performance Measure Success Ratio V measure Figure 6: Performance and Regret Distributions for all Methods Considered and thereby comparable to our method. As the MDH is asymptotically equivalent to a linear S3VM we also considered the continuous formulation for the estimation of a S3VM proposed by Chapelle and Zien (2005). These results are omitted as this method was not competitive on any of the considered data sets. Pavlidis, Hofmeyr and Tasoulis 5.2.1 Parameter Settings for MDP2 The existence of a few labelled examples enables an informed initialisation of MDP2. We consider the first and second principal components as well as the weight vector of a linear SVM trained on the labelled examples only, and initialise MDP2 with the vector that minimises the value of the projection index, φSSC. The penalty parameter γ is first set to 0.1 and with this setting α is progressively increased in the same way as for clustering. After this, α is kept at αmax and γ is increased to 1 and then 10. Thus the emphasis is initially on finding a low density hyperplane with respect to the marginal density ˆp(x). As the algorithm progresses the emphasis on correctly classifying the labelled examples increases, so as to obtain a hyperplane with low training error within the region of low density already determined. 5.2.2 Performance Evaluation To assess the effect on performance of the number of labelled examples, ℓ, we consider a range of values. We compare the methods using the subset of data sets used in the previous section in which the size of the smallest class exceeds 100. In total eight data sets are used. For each value of ℓ, 30 random partitions into labelled and unlabelled data are considered. As classes are balanced in the data sets considered, performance is measured only in terms of classification error on the unlabelled data. For data sets with more than two classes all pairwise combinations of classes are considered and aggregate performance is reported. Figure 7 provides plots of the median and interquartile range of the classification error for values of ℓbetween 5 and 100 for the four data sets with two classes. Overall MDP2 appears to be most competitive when the number of labelled examples is small. In addition, MDP2 is comparable with the best performing method in almost every case. The only exception is the ionosphere data set where Lap SVM outperforms MDP2 for all values of ℓ. Figure 8 provides plots of the median and interquartile range of the aggregate classification error on data sets containing more than two classes. As these data sets are larger we consider up to 300 labelled examples. Note that the interquartile range for XNV is not depicted for the satellite data set. The variability of performance of XNV on this data set was so high that including the interquartile range would obscure all other information in the figure. MDP2 exhibits the best performance overall, and obtains the lowest median classification error, or tied lowest, for all data sets and values of ℓ. 5.3 Summary of Experimental Results We evaluated the performance of the MDP2 formulation for finding MDHs for both clustering and semi-supervised classification, on a large collection of benchmark data sets, and in comparison with state-of-the-art methods for both problems. For clustering, we found that no single method was consistently superior to all others. This is a result of the vastly differing nature of the data sets in terms of size, dimensionality, number and shape of clusters, etc. MDP2 achieved the best performance on more data sets than any of the competing methods, and importantly was competitive with the best performing method in almost every data set considered. All other methods performed poorly in at least as many examples. Boxplots of both the raw performance and performance regret, which measures the difference between each method and the best performing method on each Minimum Density Hyperplanes G G G G G G 5 10 20 40 60 80 100 Number of Labelled Examples Classification Error 5 10 20 40 60 80 100 Number of Labelled Examples Classification Error (b) banknote G G G G G G G 0.05 5 10 20 40 60 80 100 Number of Labelled Examples Classification Error (c) breast cancer 5 10 20 40 60 80 100 Number of Labelled Examples Classification Error (d) ionosphere MDP2 median ( ), Lap SVM median ( ), SSSL median ( ), XNV median ( + ), with corresponding interquartile ranges given by shaded regions. Figure 7: Classification error for different number of labelled examples for data sets with two clusters. Pavlidis, Hofmeyr and Tasoulis 10 50 100 150 200 250 300 Number of Labelled Examples Classification Error (a) pendigits 10 50 100 150 200 250 300 Number of Labelled Examples Classification Error (b) optidigits 10 50 100 150 200 250 300 Number of Labelled Examples Classification Error (c) satellite 10 50 100 150 200 250 300 Number of Labelled Examples Classification Error (d) image segmentation MDP2 median ( ), Lap SVM median ( ), SSSL median ( ), XNV median ( + ), with corresponding interquartile ranges given by shaded regions. Figure 8: Classification error for different numbers of labelled examples over all pairwise combinations of classes. Minimum Density Hyperplanes data set, allowed us to summarise the comparative performance of the different methods across data sets. The mean and median raw performance of MDP2 is substantially higher than the next best performing method, and the regret is also substantially lower. In the case of semi-supervised classification it was apparent that MDP2 is extremely competitive when the number of labelled examples is (very) small, but that in some cases its performance does not improve as much as that of the other methods considered, when the labelled examples become more abundant. Our experiments suggest that overall MDP2 is very competitive with the state-of-the-art for semi-supervised classification problems. 6. Conclusions We proposed a new hyperplane classifier for clustering and semi-supervised classification. The proposed approach is motivated by determining low density linear separators of the high-density clusters within a data set. This is achieved by minimising the integral of the empirical density along the hyperplane, which is computed through kernel density estimation. To the best of our knowledge this is the first direct implementation of the low density separation assumption that underlies high-density clustering and numerous influential semi-supervised classification methods. We show that the minimum density hyperplane is asymptotically connected with maximum margin hyperplane, thereby establishing an important link between the proposed approach, maximum margin clustering, and semisupervised support vector machines. The proposed formulation allows us to evaluate the integral of the density on a hyperplane by projecting the data onto the vector normal to the hyperplane, and estimating a univariate kernel density estimator. This enables us to apply our method effectively and efficiently on data sets of much higher dimensionality than is generally possible for density based clustering methods. To mitigate the problem of convergence to locally optimal solutions we proposed a projection pursuit formulation. We evaluated the minimum density hyperplane approach on a large collection of benchmark data sets. The experimental results obtained indicate that the method is competitive with state-of-the-art methods for clustering and semi-supervised classification. Importantly the performance of the proposed approach displays low variability across a variety of data sets, and is robust to differences in data size, dimensionality, and number of clusters. In the context of semi-supervised classification, the proposed approach shows especially good performance when the number of labelled data is small. Acknowledgments We would like to thank the reviewers for their insightful comments which substantially improved this paper. We also thank Prof. David Leslie, and Dr. Teemu Roos for valuable comments and suggestions on this work. Nicos Pavlidis would like to thank the Isaac Newton Institute for Mathematical Sciences, Cambridge, for support and hospitality during the programme Inference for Change-Point and Related Processes , where part of the work on this paper was undertaken. David Hofmeyr gratefully acknowledges the support of the EPSRC funded EP/H023151/1 STOR-i centre for doctoral training, as well as the Oppenheimer Pavlidis, Hofmeyr and Tasoulis (c) (d) Proposed MMH , Orthogonal hyperplane - - -, Hyperplane with larger margin - - -, Regular points , Support points , Differently assigned support point Figure 9: Two dimensional illustration of Lemma 9 Memorial Trust. The underlying code and data are openly available from Lancaster University data repository at http://dx.doi.org/10.17635/lancaster/researchdata/97. Appendix A. Proof of Theorem 5 Before proving Theorem 5 we require the following two technical lemmata which establish some algebraic properties of the maximum margin hyperplane. The following lemma shows that any hyperplane orthogonal to the maximum margin hyperplane results in a different partition of the support points of the maximum margin hyperplane. The proof relies on the fact that if this statement does not hold then a hyperplane with larger margin exists which is a contradiction. Figure 9 provides an illustration of why this result holds. (a) Any hyperplane orthogonal to MMH generates a different partition of the support points of MMH, e.g., the point highlighted in red in (b) is grouped with the lower three by the dotted line but with the upper two by the solid line, the MMH. If an orthogonal hyperplane can generate the same partition (c), then a larger margin hyperplane than the proposed MMH exists (d). Lemma 9 Suppose there is a unique hyperplane in F with maximum margin, which can be parameterised by (vm, bm) Sd 1 R. Let M = margin H(vm, bm), C+ = {x X | vm x bm = M} and C = {x X | bm vm x = M}. Then, w Null(vm), c R either min{w x c | x C+} 0, or max{w x c | x C } 0. Minimum Density Hyperplanes Suppose the result does not hold, then (w, c) with w = 1, w vm = 0 and min{w x c|x C+} > 0 and max{w x c|x C } < 0. Let m = min{|w x c| x C+ C }. Define λ = m 2M < 1. Define u = 1 λ2+(1 λ)2 (λw + (1 λ)vm) and d = λc+(1 λ)bm λ2+(1 λ)2 . By construction u = 1. For any x+ C+ we have, u x+ d = λ(w x+ c) + (1 λ)(vm x+ bm) p λ2 + (1 λ)2 λm + (1 λ)M p λ2 + (1 λ)2 = m2 + 2M2 Mm p m2 + (2M m)2 Similarly one can show that d u x > M for any x C , meaning that (u, d) achieves a larger margin on C+ and C than (vm, bm), a contradiction. The next lemma uses the above result to provide an upper bound on the distance between pairs of support points projected onto any vector, in terms of the angle between that vector and vm. Lemma 10 Suppose there is a unique hyperplane in F with maximum margin, which can be parameterised by (vm, bm) Sd 1 R. Define M = margin H(vm, bm), C+ = {x X|vm x bm = M}, and C = {x X|bm vm x = M}. There is no vector w Rd for which w x+ w x > 2Mvm w for all pairs x+ C+, x C . Proof Suppose such a vector exists. Define w = w (vm w)vm. By construction w Null(vm). For any pair x+ C+, x C we have w x+ w x = w x+ (vm w)vm x+ w x + (vm w)vm x > vm w(2M vm x+ + bm bm + vm x ) Define c := 1 2(min{w x+ x+ C+}+max{w x x C }). Then min{w x+ c|x+ C+} > 0 and max{w x c|x C } < 0, a contradiction. We are now in a position to provide the main proof of this appendix. The theorem states that if the maximum margin hyperplane is unique, and can be parameterised by (vm, bm) Sd 1 R, then lim h 0+ min { (v h, b h) (vm, bm) , (v h, b h) + (vm, bm) } = 0, Pavlidis, Hofmeyr and Tasoulis where {H(v h, b h)}h is any collection of minimum density hyperplanes indexed by their bandwidth h > 0. Proof of Theorem 5 Define M = margin H(vm, bm), C+ = {x X | vm x bm = M} and C = {x X | bm vm x = M}. Let B = max{ x x X}. Take any ϵ > 0 and set 0 < δ to satisfy 2δ M (1 + B2) + 2Bδ3/2 q 2 M + δ2 = ϵ2. Now, suppose (w, c) Sd 1 R satisfies, w x+ c > M δ, x+ C+ and c w x > M δ, x C . By Lemma 10 we know that x+ C+, x C s.t. w x+ w x 2Mvm w. Thus vm w w x+ w x = w x+ c + c w x Thus vm w 2 < 2δ M . Now, for each x+ C+, vm x+ b = M and for each x C , b vm x = M. Thus for any such x+, x we have, M δ + w x < c < w x+ M + δ, bm vm x δ + w x < c < w x+ vm x+ + bm + δ, bm δ (vm w) x < c < bm + δ + (w vm) x+, bm δ B vm w < c < bm + δ + B w vm , |c bm| < |δ + B w vm | . We can now bound the distance between (w, c) and (vm, bm), (vm, bm) (w, c) 2 = vm w 2 + |bm c|2 < vm w 2(1 + B2) + 2Bδ vm w + δ2 < 2δ M (1 + B2) + 2Bδ We have shown that for any hyperplane H(w, c) that achieves a margin larger than M δ on the support points of the maximum margin hyperplane, x C+ C , the distance between (w, c) and (vm, bm) is less than ϵ. Equivalently, any hyperplane H(w, c) such that (w, c) (vm, bm) > ϵ has a margin less than M δ, as min |w x c| x C+ C < M δ. By symmetry, the same holds for any (w, c) within distance ϵ of ( vm, bm). By Lemma 4 h1 > 0 such that for all h (0, h1), the minimum density hyperplane for h, H(v h, b h), induces the same partition of X as the maximum margin hyperplane, H(vm, bm). By Lemma 3 h2 > 0 such that for all h (0, h2), margin H(v h, b h) > M δ. Therefore for h (0, min{h1, h2}), min { (v h, b h) (vm, bm) , (v h, b h) + (vm, bm) } < ϵ. Since ϵ > 0 was arbitrarily chosen, this gives the result. Minimum Density Hyperplanes E. Aitnouri, S. Wang, and D. Ziou. On comparison of clustering techniques for histogram pdf estimation. Pattern Recognition and Image Analysis, 10(2):206 217, 2000. D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In ACMSIAM Symposium on Discrete Algorithms (SODA), pages 1027 1035, 2007. A. Azzalini and N. Torelli. Clustering via nonparametric density estimation. Statistics and Computing, 17(1):71 80, 2007. M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labelled and unlabeled examples. Journal of Machine Learning Research, 7:2399 2434, 2006. S. Ben-David, T. Lu, D. P al, and M. Sot akov a. Learning low-density separators. In D. van Dyk and M. Welling, editors, International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR Workshop and Conference Proceedings, pages 25 32, 2009. C. Berge. Topological Spaces. Macmillan, New York, 1963. D. Boley. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4):325 344, 1998. J. F. Bonnans and A. Shapiro. Perturbation Analysis of Optimization Problems. Springer Series in Operations Research. Springer, 2000. J. V. Burke, A. S. Lewis, and M. L. Overton. Approximating subdifferentials by random sampling of gradients. Mathematics of Operations Research, 27(3):567 584, 2002. J. V. Burke, A. S. Lewis, and M. L. Overton. A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM Journal on Optimization, 15(3):751 779, 2006. V. Castelli and T. M. Cover. On the exponential value of labeled samples. Pattern Recognition Letters, 16(1):105 111, 1995. V. Castelli and T. M. Cover. The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Transactions on Information Theory, 42(6):2102 2117, 1996. O. Chapelle and A. Zien. Semi-supervised classification by low density separation. In R. G. Cowell and Z. Ghahramani, editors, International Conference on Artificial Intelligence and Statistics (AISTATS), pages 57 64, 2005. A. Cuevas, M. Febrero, and R. Fraiman. Estimating the number of clusters. Canadian Journal of Statistics, 28(2):367 382, 2000. A. Cuevas, M. Febrero, and R. Fraiman. Cluster analysis: a further approach based on density estimation. Computational Statistics and Data Analysis, 36(4):441 459, 2001. Pavlidis, Hofmeyr and Tasoulis F. E. Curtis and X. Que. An adaptive gradient sampling algorithm for nonsmooth optimization. Optimization Methods and Software, 28(6):1302 1324, 2013. C. Ding and T. Li. Adaptive dimension reduction using discriminant analysis and k-means clustering. In International Conference on Machine Learning (ICML), pages 521 528, 2007. C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the Nystr om method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2):214 225, 2004. P. Fr anti and O. Virmajoki. Iterative shrinking method for clustering problems. Pattern Recognition, 39(5):761 775, 2006. J. A. Hartigan. Clustering Algorithms. Wiley Series in Probability and Mathematical Statistics. Wiley, New York, 1975. M. Ji, T. Yang, B. Lin, R. Jin, and J. Han. A simple algorithm for semi-supervised learning with improved generalization error bound. In J. Langford and J. Pineau, editors, International Conference on Machine Learning (ICML), pages 1223 1230, 2012. T. Joachims. Transductive inference for text classification using support vector machines. In International Conference on Machine Learning (ICML), volume 99, pages 200 209, 1999. A. Lewis and M. Overton. Nonsmooth optimization via quasi-Newton methods. Mathematical Programming, 141:135 163, 2013. M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ ml. B. Mc Williams, D. Balduzzi, and J. M. Buhmann. Correlated random features for fast semi-supervised learning. In Advances in Neural Information Processing Systems (NIPS), pages 440 448, 2013. G. Menardi and A. Azzalini. An advancement in clustering via nonparametric density estimation. Statistics and Computing, 24(5):753 767, 2014. V. I. Morariu, B. V. Srinivasan, V. C. Raykar, R. Duraiswami, and L. S. Davis. Automatic online tuning for fast Gaussian summation. In Advances in Neural Information Processing Systems (NIPS), pages 1113 1120, 2008. A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: analysis and an algorithm. In T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems (NIPS), volume 14, pages 849 856, 2002. E. Polak. On the mathematical foundations of nondifferentiable optimization in engineering design. SIAM Review, 29(1):21 89, 1987. P. Rigollet. Generalization error bounds in semi-supervised classification under the cluster assumption. Journal of Machine Learning Research, 8:1369 1392, 2007. Minimum Density Hyperplanes A. Rinaldo and L. Wasserman. Generalized density clustering. The Annals of Statistics, 38 (5):2678 2722, 2010. A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external cluster evaluation measure. In Empirical Methods in Natural Language Processing and Computational Natural Language Learning, volume 7, pages 410 420, 2007. B. W. Silverman. Density estimation for statistics and data analysis, volume 26. CRC press, 1986. A. Singh, R. D. Nowak, and X. Zhu. Unlabeled data: Now it helps, now it doesn t. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems (NIPS), pages 1513 1520, 2009. W. Stuetzle and R. Nugent. A generalized single linkage method for estimating the cluster tree of a density. Journal of Computational and Graphical Statistics, 19(2):397 418, 2010. S. K. Tasoulis, D. K. Tasoulis, and V. P. Plagianakos. Enhancing principal direction divisive clustering. Pattern Recognition, 43(10):3391 3411, 2010. S. Tong and D. Koller. Restricted Bayes optimal classifiers. In National Conference on Artificial Intelligence (AAAI), pages 658 664, 2000. V. Vapnik and A. Sterin. On structural risk minimization or overall risk in a problem of pattern recognition. Automation and Remote Control, 10(3):1495 1503, 1977. G. Walther. Granulometric smoothing. The Annals of Statistics, 25(6):2273 2299, 1997. P. Wolfe. On the convergence of gradient methods under constraint. IBM Journal of Research and Development, pages 407 411, 1972. L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems (NIPS), volume 17, pages 1537 1544, 2004. L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In Advances in Neural Information Processing Systems (NIPS), pages 1601 1608, 2004. K. Zhang, I. W. Tsang, and J. T. Kwok. Maximum margin clustering made practical. IEEE Transactions on Neural Networks, 20(4):583 596, 2009.