# optimal_margin_distribution_clustering__00bfe002.pdf Optimal Margin Distribution Clustering Teng Zhang, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210023, China {zhangt, zhouzh}@lamda.nju.edu.cn Maximum margin clustering (MMC), which borrows the large margin heuristic from support vector machine (SVM), has achieved more accurate results than traditional clustering methods. The intuition is that, for a good clustering, when labels are assigned to different clusters, SVM can achieve a large minimum margin on this data. Recent studies, however, disclosed that maximizing the minimum margin does not necessarily lead to better performance, and instead, it is crucial to optimize the margin distribution. In this paper, we propose a novel approach ODMC (Optimal margin Distribution Machine for Clustering), which tries to cluster the data and achieve optimal margin distribution simultaneously. Specifically, we characterize the margin distribution by the firstand second-order statistics, i.e., the margin mean and variance, and extend a stochastic mirror descent method to solve the resultant minimax problem. Moreover, we prove theoretically that ODMC has the same convergence rate with state-ofthe-art cutting plane based algorithms but involves much less computation cost per iteration, so our method is much more scalable than existing approaches. Extensive experiments on UCI data sets show that ODMC is significantly better than compared methods, which verifies the superiority of optimal margin distribution learning. Introduction Clustering is an important research area in machine learning, data mining and pattern recognition that aims at grouping data points which are similar. It arises in a wide range of domains including information retrieval, computer version, bioinformatics, etc., and various clustering algorithms have been proposed over past decades (Jain and Dubes 1988; Xu and Wunsch 2005; Jain 2010). A recently proposed method for clustering, referred to as maximum margin clustering (MMC), is based on the large margin heuristic of support vector machine (SVM) (Cortes and Vapnik 1995; Vapnik 1995). The intuition is that, for a good clustering, when labels are assigned to different clusters, SVM can achieve a large minimum margin on this This research was supported by the 973 Program and NSFC (61333014) and the Program B for outstanding Ph D candidate of Nanjing University. Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. data. Since the resultant minimax problem involves labeling each instance from the set {+1, 1}, it s no longer a convex optimization problem but a mixed-integer programming which is much more difficult to handle. From then on, a lot of efforts have been devoted to solve this problem, which can be roughly classified into two groups. The first group applies various convex relaxation techniques. Xu et al. (2005) first relaxed it as a convex semi-definite programming (SDP), in which a positive semi-definite matrix with linear constraints is used to approximate the matrix of label outer product. Soon after, Valizadegan and Jin (2006) introduced a new formulation, whose number of variables is significantly reduced, although it s still a SDP problem. Finally, Li et al. (2009; 2013) proposed a tighter minimax relaxation than SDP formulation, which can be solved by iteratively generating the most violated labels and then combining them via multiple kernel learning. The second group directly optimizes the original problem via variants of non-convex optimization. Examples include alternative optimization (Zhang, Tsang, and Kwok 2007; 2009), in which clustering is preformed by sequentially finding labels and optimizing a support vector regression (SVR), and constrained convex-concave procedure (CCCP) (Zhao, Wang, and Zhang 2008; Wang, Zhao, and Zhang 2010), in which the non-convex objective function or constraints are expressed as the a difference between two convex functions, and the latter is further replaced by a linear approximation so that the whole is convex. Moreover, several researchers also tried to extend MMC to more general learning settings. For example, Zhou et al. (2013) assumed that the data has latent variables and developed the LMMC framework. Niu et al. (2013) showed an alternative principle to MMC, called maximum volume clustering (MVC), is more theoretically advantageous. The incremental version of MMC is also proposed (Vijaya Saradhi and Charly Abraham 2016). Aforementioned MMC algorithms are all based on the large margin principle, i.e., trying to maximize the minimum margin of training instances. However, recent studies on margin theory (Gao and Zhou 2013) disclosed that maximizing the minimum margin does not necessarily lead to better performance, and instead, it is crucial to optimize the margin distribution. Inspired by this recognition, Zhang and Zhou (2014; 2016; 2017) proposed optimal margin distribution machine (ODM) which can achieve better gen- The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) eralization performance than large margin based methods. Later, Zhou and Zhou (2016) extends the idea to an approach which is able to exploit unlabeled data and handle unequal misclassification cost. The success of optimal margin distribution learning suggests that there may still exist large space to further enhance for MMC. In this paper, we propose a novel approach ODMC (Optimal margin Distribution Machine for Clustering), which tries to cluster the data and achieve optimal margin distribution simultaneously. Specifically, we characterize the margin distribution by the firstand second-order statistics, i.e., the margin mean and variance, and then apply the minimax convex relaxation proposed in (Li et al. 2009), which is proven to be tighter than SDP relaxations, to get a convex reformulation. For the optimization of the resultant minimax problem, we propose a stochastic mirror descent method which can converge quickly in practice. Moreover, we prove theoretically that ODMC has the same convergence rate with state-of-the-art cutting plane based algorithms but involves much less computation cost per iteration, so our method is much more scalable than existing approaches. Extensive experiments on UCI data sets show that ODMC is significantly better than compared methods, which verifies the superiority of optimal margin distribution learning. The rest of this paper is organized as follows. We first introduce some preliminaries and then present the ODMC method. Next we show the experimental studies. Finally we conclude this paper with future work. Preliminaries We start with a simpler scenario of supervised learning. Denote X as the instance space and Y = {+1, 1} as the label set. Let D be an unknown (underlying) distribution over X Y. A training set S = {(x1, y1), (x2, y2), . . . , (xm, ym)} (X Y)m is drawn identically and independently (i.i.d.) according to D. Let φ : X H be a feature mapping associated to some positive definite kernel κ. The hypothesis is defined based on the linear model h(x) = w φ(x) and the predicted label of instance x is the sign of h(x), then the decision function naturally leads to the definition of margin for a labeled instance, i.e., γ(x, y) = yw φ(x) (Cristianini and Shawe-Taylor 2000). Thus h misclassifies (x, y) if and only if it produces a negative margin. Given a hypothesis set H of functions mapping X to Y and the labeled training set S, our goal is to learn a function h H such that the generalization error R(h) = E(x,y) D[1sign(h(x)) =y] is small, where 1( ) is the indicator function that returns 1 when the argument holds, and 0 otherwise. Optimal margin distribution machine It is well known that SVM employs the large margin principle to select h and tries to maximize the minimum margin of training data, i.e., the smallest distance from the instances to the decision boundary. As a result, the solution of SVM just consists of a small amount of data, that is support vectors (SV), and the rest (non-SVs) are totally ignored, which may be misleading in some situations (Zhou 2014). A more robust strategy is to consider the whole data, i.e., optimizing the margin distribution. To characterize the distribution, the two most straightforward statistics are the firstand second-order statistics, that is, the margin mean and variance. Moreover, a recent study (Gao and Zhou 2013) on margin theory proved that maximizing the margin mean and minimizing the margin variance simultaneously can yield a tighter generalization bound, so we arrive at the following formulation, min w, γ,ξi,ϵi 1 2 w 2 H η γ + λ i=1 (ξ2 i + ϵ2 i ), s.t. γ(xi, yi) γ ξi, γ(xi, yi) γ + ϵi, i, where γ is the margin mean, η and λ are trading-off parameters, ξi and ϵi are the deviation of γ(xi, yi) to the margin mean. It s evident that m i=1(ξ2 i +ϵ2 i )/m is exactly the margin variance. First, by scaling w which doesn t affect the final classification results, the margin mean can be fixed as 1, then the deviation of γ(xi, yi) to the margin mean is |yiw φ(xi) 1|. Secondly, the hyperplane yiw φ(xi) = 1 divides the feature space into two parts and for each instance, no matter which part it lies in, it will suffer a loss which is quadratic with the deviation. So it is more reasonable to set different weights for the two kinds of deviations because the instances lie in {x | yw φ(x) < 1} are much easier to be misclassified than the other. Thirdly, according to representer theorem (Sch olkopf and Smola 2001), the optimal solution is spanned only by SVs. To achieve a sparse solution, we introduce a θ-insensitive loss like SVR, i.e., the instances whose deviation is smaller than θ are tolerated and only those whose deviation is larger than θ will suffer a loss. Finally, we obtain the formulation of ODM, min w,ξi,ϵi 1 2 w 2 H + λ ξ2 i + νϵ2 i (1 θ)2 , s.t. yiw φ(xi) 1 θ ξi, yiw φ(xi) 1 + θ + ϵi, i. where ν is a parameter for trading-off different kinds of deviations, θ is a parameter for controlling the sparsity of the solution, and (1 θ)2 in the denominator is to scale the second term to be a surrogate loss for 0-1 loss. Optimal margin distribution clustering In clustering setting, labels are no longer available, and so also need to be optimized. Let ˆy = [ˆy1, . . . , ˆym] { 1}m denotes a vector of the unknown labels. The basic idea of ODMC is to minimize the objective function in Eq. (1) w.r.t. both the labeling ˆy and decision function parameter w, ξi, ϵi. Hence, Eq. (1) is extended to min ˆy B min w,ξi,ϵi 1 2 w 2 H + λ ξ2 i + νϵ2 i (1 θ)2 , s.t. ˆyiw φ(xi) 1 θ ξi, ˆyiw φ(xi) 1 + θ + ϵi, i, where B is a set of candidate label assignments obtained from some domain knowledge. For example, when the positive and negative instances are known to be approximately balanced, we can set B = {ˆy | β e ˆy β} where β is a small constant controlling the class imbalance. When a set of must-link constraints M which requires two instances should be associated with the same cluster, or a set of cannot-link constraints C which requires two instances should be associated with different clusters, are provided, we can set B = {ˆy | ˆyi = ˆyj, ˆyj = ˆyk, (xi, xj) M, (xj, xk) C}. To avoid the curse of dimensionality, the inner minimization problem of Eq. (2) is usually cast in the dual form. Denote X as the data matrix whose i-th column is φ(xi), i.e., X = [φ(x1), . . . , φ(xm)], and introduce the dual variables α 0, the Lagrangian of Eq. (2) leads to min ˆy B max α 0 1 2α K ˆy ˆy K ˆy ˆy K ˆy ˆy K ˆy ˆy 4λ α I 0 0 1 ν I α (θ 1)e (θ + 1)e where K = X X is the kernel matrix, denotes the element-wise product, and e stands for the all-one vector. Note that the objective function is a negative definite quadratic form whose stationary point can t locate at the infinity, so we can replace the constraint {α | α 0} by a bounded box A = {α | 0 α τe}, where the auxiliary parameter τ is introduced for the sake of mathematical soundness. For a sufficiently large τ, the new problem is equal to the original problem. To overcome the difficulty of this mixed-integer programming, many relaxations have been proposed, among which the minimax convex relaxation proposed in (Li et al. 2009; 2013) is proven to be the tightest. So in this paper, we also employ this method to deal with the mixed-integer problem, i.e., interchanging the order of maxα A and minˆy B, then we can obtain max α A min ˆy B G(α, ˆy), where G(α, ˆy) is the objective function of Eq. (3), and this can be further transformed into max α A min δ ( δ) s.t. G(α, ˆyk) δ, ˆyk B. (4) For the inner optimization in Eq. (4), introduce the dual variables μ = [μ1, . . . , μ|B|] 0, the Lagrangian leads to max μ 0 min δ { δ k:ˆyk B μk(G(α, ˆyk) δ)}, By setting the partial derivative of δ to zero, we can obtain k:ˆyk B μk = 1 and the dual turns into k:ˆyk B μk G(α, ˆyk)}, (5) where M = {μ R|B| + | e μ = 1} is the simplex in R|B|. By substituting Eq. (5) into Eq. (4) and denoting ϕ(μ, α) = k:ˆyk B μk G(α, ˆyk), Eq. (4) can be rewritten as max α A min μ M ϕ(μ, α). Note that ϕ(μ, α) is a convex combination of negative definite quadratic forms, so it s convex in μ and concave in α. According to Sion s minimax theorem (Sion 1958), there exists a saddle point (ˆμ, ˆα) M A such that min μ M max α A ϕ(μ, α) max α A ϕ(ˆμ, α) = ϕ(ˆμ, ˆα) = min μ M ϕ(μ, ˆα) max α A min μ M ϕ(μ, α), (6) By combining with the following minimax inequality (Kim and Boyd 2008), max α A min μ M ϕ(μ, α) min μ M max α A ϕ(μ, α), we can realize that all the equalities hold in Eq. (6) and arrive at the final formulation of ODMC: min μ M max α A ϕ(μ, α). (7) Optimization In this section, we commence with a simple introduction to minimax problem, followed by a stochastic mirror descent method to find the saddle point. Finally we give a convergence rate analysis. Minimax problem Since ϕ( , α) is convex and ϕ(μ, ) is concave, according to the convex inequality, for any pair ( μ, α) M A we have ϕ( μ, α) ϕ(μ, α) μϕ( μ, α) ( μ μ), μ M, ϕ( μ, α) ϕ( μ, α) αϕ( μ, α) ( α α), α A. By adding the above two inequalities together we have ϕ( μ, α) ϕ(μ, α) g( w) ( w w), μ, α, (8) where w = (μ, α), w = ( μ, α) M A and g( w) = ( μϕ( w), αϕ( w)). Note that Eq. (8) holds for any μ and α, in particular we have max α A ϕ( μ, α) min μ M ϕ(μ, α) g( w) ( w w). (9) The left hand side is referred to as the duality gap , which can be decomposed into two parts, i.e., max α A ϕ( μ, α) min μ M ϕ(μ, α) = max α A ϕ( μ, α) ϕ(ˆμ, ˆα) + ϕ(ˆμ, ˆα) min μ M ϕ(μ, α) = max α A ϕ( μ, α) min μ M max α A ϕ(μ, α) primal gap + max α A min μ M ϕ(μ, α) min μ M ϕ(μ, α) dual gap As can be seen, the primal gap and the dual gap are both nonnegative and the more closer to the saddle point, the smaller both gaps. So duality gap can be viewed as a measure to evaluate the closeness of current point ( μ, α) to the saddle point (ˆμ, ˆα). Stochastic mirror descent For ODMC, the feasible set of μ and α are simplex and bounded box, respectively, so the most suitable mirror maps for the two domains are ΦM(μ) = k μk log μk and ΦA(α) = α 2 2/2. Introduce the joint map Φ(w) = aΦM(μ) + bΦA(α), where a and b are parameters to be specified later. It can be shown that ΦM(μ) = log μ + e, ΦA(α) = α and Φ(w) = (a log μ + ae, bα). At the t-th iteration, we first map wt = (μt, αt) into the dual space Φ(wt) = (a log μt + ae, bαt), followed by one step of stochastic gradient descent in the dual space, Φ( wt+1) = Φ(wt) η g(wt) = (a log μt + ae η μ ϕ(μt, αt), bαt + η α ϕ(μt, αt)) where μ ϕ, α ϕ and g are the noisy unbiased estimation of μϕ, αϕ and g, respectively, and η is the step size. Next, we map Φ( wt+1) back to the primal space, i.e., to find wt+1 = (ut+1, vt+1) such that a log ut+1 + ae = a log μt + ae η μ ϕ(μt, αt), bvt+1 = bαt + η α ϕ(μt, αt)), which implies that ut+1 = μt exp( η μ ϕ(μt, αt)/a) and vt+1 = αt + η α ϕ(μt, αt)/b. Finally, we project (ut+1, vt+1) back to M A based on Kullback-Leibler divergence and Euclidean distance, respectively, i.e., we solve the following two optimization problems: μ = argmin μ M μ log μ ut+1 , α = argmin α A α vt+1 2 2, Fortunately, both problems have a closed-form solution. The latter is to project vt+1 onto the bounded box, so we have αt+1 = max{min{vt+1, τe}, 0}. For the former, the Lagrangian function leads to μ log(μ/ut+1) + ζ(e μ 1), where ζ is the dual variable. By setting the partial derivative of μ to zero, i.e., log(μ/ut+1) + e + ζe = 0, we have μt+1 = ut+1 exp( 1 ζ). Since μt+1 belongs to a simplex, hence 1 = e μt+1 = e ut+1 exp( 1 ζ) = ut+1 1 exp( 1 ζ), which implies that exp( 1 ζ) = 1/ ut+1 1, thus we have μt+1 = ut+1/ ut+1 1. Figure 1 illustrates one iteration of this procedure. The remaining question is how to find the stochastic gradient μ ϕ(μt, αt) and α ϕ(μt, αt). Note that ϕ(μ, α) = k:ˆyk B μk G(α, ˆyk), so we have μϕ(μt, αt) = [G(αt, ˆy1), . . . , G(αt, ˆy|B|)], αϕ(μt, αt) = [ αG(αt, ˆy1), . . . , αG(αt, ˆy|B|)]μt. By uniformly choosing an index it from {1, 2, . . . , |B|}, we can obtain μ ϕ(μt, αt, it) = [0, . . . , |B|G(αt, ˆyit) . . . , 0]. On the other hand, by randomly sampling an index jt according to the distribution μt on {1, 2, . . . , |B|}, we can obtain α ϕ(μt, αt, jt) = αG(αt, ˆyjt). It can be shown that E[ μ ϕ(μt, αt, it) | μt, αt] = μϕ(μt, αt), E[ α ϕ(μt, αt, jt) | μt, αt] = αϕ(μt, αt), and g(wt) = ( μ ϕ(μt, αt, it), α ϕ(μt, αt, jt)) is an unbiased estimation of g(wt). Algorithm 1 summarizes the pseudo-code of ODMC. primal space dual space wt+1 ( Φ) 1 Figure 1: Illustration of one iteration of stochastic mirror descent. Algorithm 1 Stochastic mirror descent for ODMC 1: Input: data set X, maximum iteration number T, ODM parameters λ, ν, θ, upper bound τ, stopping criteria ι. 2: Initialize μ0 [1/|B|, . . . , 1/|B|], α0 0, t 0. 3: while t < T do 4: Uniformly select it from {1, 2, . . . , |B|}. 5: μ ϕ(μt, αt, it) [0, . . . , |B|G(αt, ˆyit) . . . , 0]. 6: Select jt from {1, 2, . . . , |B|} according to μt. 7: α ϕ(μt, αt, jt) αG(αt, ˆyjt). 8: ut+1 μt exp( η μ ϕ(μt, αt, it)/a). 9: vt+1 αt + η α ϕ(μt, αt, jt)/b. 10: μt+1 ut+1/ ut+1 1. 11: αt+1 max{min{vt+1, τe}, 0}. 12: t t + 1. 13: if duality gap is smaller than ι then 14: Break. 15: end if 16: end while 17: Output: μ, α. Recovering the cluster assignment Once the saddle point (ˆμ, ˆα) is found, we can obtain the cluster assignment according to sign( k:ˆyk B μ k ˆyk). Convergence rate For the cutting-plane based algorithms, it has been proven that the time complexity is O(1/ϵ2) (Zhao, Wang, and Zhang 2008), i.e., to get a solution with ϵ accuracy, the algorithm needs run O(1/ϵ2) iterations. In this section, we show that ODMC has the same convergence rate, but note that in each iteration, cutting-plane based algorithms need to find the most violated label and then train a SVM model, whereas, ODMC just preforms a random sampling and a step of stochastic gradient descent, followed by a projection with closed-form solutions, so our method is much more scalable. Theorem 1. Assume |G(α, ˆyk)| and αG(α, ˆyk) 2 are upper bounded by L1 and L2 respectively for any ˆyk B and α A. Let a = |B|L1/ log |B|, b = 2L2/ mτ and 2/T, the expectation of duality gap at the average point ( T t=1 μt/T, T t=1 αt/T) satisfies 2 log |B| + L2τ m)/ Proof. Similar to the analysis of vanilla gradient descent, we can prove the duality gap at the average point is upper bounded by the sum of inner product between the gradient g(wt) and the gap wt w, max α A 1 T t=1 ϕ(μt, α) min μ M 1 T t=1 ϕ(μ, αt) max α A ϕ(μt, α) min μ M ϕ(μ, αt) t=1 E[g(wt) (wt w)] t=1 E[E[ g(wt) (wt w) | wt]] t=1 E[ g(wt) (wt w)], where the first inequality holds since ϕ(μ, α) is convex in μ and concave in α, the second inequality is owing to the sub-additivity of max, the third inequality is Eq. (9), and the final equality is according to the law of total expectation. Next we bound each term in the summation separately. With the update rule of stochastic mirror descent, we have g(wt) (wt w) η ( Φ(wt) Φ( wt+1)) (wt w) η (ΔΦ(w, wt) + ΔΦ(wt, wt+1) ΔΦ(w, wt+1)) η (ΔΦ(w, wt) + ΔΦ(wt, wt+1) ΔΦ(w, wt+1) ΔΦ(wt+1, wt+1)), where ΔΦ(w, wt) = Φ(w) Φ(wt) Φ(wt) (w wt) is the Bregman divergence. Since wt+1 is the projection of wt+1 onto the convex set M A, the final inequality holds true for any w M A according to the generalized triangle inequality. Note that the first and third term will lead to a telescopic sum when summing over t = 0 to t = T, it remains to bound the other two terms, ΔΦ(wt, wt+1) ΔΦ(wt+1, wt+1) = Φ(wt) Φ(wt+1) Φ( wt+1) (wt wt+1) η g(wt) (wt wt+1) 1 2 wt wt+1 .2 η g(wt) wt wt+1 . 1 2 wt wt+1 .2 2 ( wt wt+1 . η g(wt) )2 + 1 2(η g(wt) )2 η2 g(wt) 2 2 where w .2 = a μ 2 1 + b α 2 2 and 2 = μ 2 /a + α 2 2/b is the dual norm. The first inequality holds since Φ( ) is 1-strongly convex function w.r.t. the norm ., and the second inequality is according to H older s inequality. Note that |G(α, ˆyk)| L1 and αG(α, ˆyk) 2 L2, hence we have g(wt) 2 = 1 a μ ϕ(wt, it) 2 + 1 b α ϕ(wt, jt) 2 2 a|B|2L2 1 + 1 b L2 2 = |B|L1 log |B| + L2τ Further note that ΔΦ(w, w1) = Φ(w) Φ(w1) Φ(w1) (w w1) = aμ log(μ/μ1) + b μ μ1 2 2/2 a log |B| + bmτ 2/2 log |B| + L2τ Combine together and we have, t=1 E[ g(wt) (wt w)] η g(wt) 2 2 + ΔΦ(w, w1) ΔΦ(w, w T +1) log |B| + L2τ By substituting η = 2/T we can conclude the proof. This theorem shows that the duality gap decays as the rate O(1/ T). By setting O(1/ T) = ϵ we can obtain the time complexity T = O(1/ϵ2), which is the same with cuttingplane based methods. Empirical Study In this section, we empirically evaluate the proposed method on 24 UCI data sets. Table 1 summarizes the statistics of these data sets. As can be seen, the number of instance is ranged from 62 to 3175, and the dimensionality is ranged from 2 to 4702, covering a broad range of properties. Evaluation criteria We set the number of clusters equal to the true number of classes for all the methods. To evaluate their performance, we compare the generated clusters with the true classes by computing the following two performance measures. Clustering Accuracy (Acc) (Xu et al. 2005). First remove the labels for all instances, and then predict the clusters via performing clustering methods, finally measure the classification accuracy according to true label. Rand Index (RI) (Rand 1971). Let C be the set of clustering results, and denotes L as the set of true classes. Rand index represents the frequency of occurrence of agreements over all the instance pairs, i.e., the probability that C and L will agree on a randomly chosen instance pair. Compared methods ODMC is compared with k-means (KM) method, normalized cut (NC) method (Shi and Malik 2000), GMMC (Valizadegan and Jin 2006), Iter SVR (Zhang, Tsang, and Kwok 2007), CPMMC (Zhao, Wang, and Zhang 2008) and LGMMC (Li et al. 2009). MMC (Xu et al. 2005) is not chosen as a baseline since it can t return results in a reasonable time for most data sets. For GMMC, Iter SVR, CPMMC, LG-MMC, ODMC, the parameters C or λ is selected from {1, 10, 100, 1000}. For ODMC, ν and θ are selected from [0.2, 0.4, 0.6, 0.8]. For all data sets, both the linear and Gaussian kernels are used. In particular, the width σ of Gaussian kernel is picked from {0.25 γ, 0.5 γ, γ, 2 γ, 4 γ}, where γ is the average distance between instances. The parameter of normalized cut is chosen from the same range of σ. The balance constraint is set in the same manner as in (Zhang, Tsang, and Kwok 2007), i.e., 0.03m for balanced data set and 0.3m for imbalanced data set. All the experiments are repeated 10 times and the average performance is reported with the best parameter setting. Table 1: Characteristics of experimental data sets. ID Data set #Instance #Feature 1 echocardiogram 62 8 2 dbworld 64 4,702 3 hepatitis 80 19 4 colic 188 13 5 house 232 16 6 heart-h 261 10 7 heart 270 9 8 heart-statlog 270 13 9 breast 277 9 10 cylinder-bands 277 39 11 heart-c 296 13 12 haberman 306 14 13 ionosphere 351 33 14 vehicle 435 16 15 credit-a 653 15 16 diabetes 768 8 17 fourclass 862 2 18 tic-tac-toe 958 9 19 credit-g 1,000 20 20 german 1,000 59 21 optdigits 1,143 42 22 svmguide3 1,284 22 23 sick 2,643 28 24 splice 3,175 60 Performance Table 2 summarizes the results on 24 UCI data sets. GMMC did not return results on some data sets due to the high computation cost. As can be seen, for both measures, ODMC achieves the best performance on 17 data sets and shows significant improvement over existing MMC approaches on most data sets. We compare the average single iteration time cost of our method with Iter SVR, CPMMC and LG-MMC on some representative data sets. All the experiments are performed with MATLAB 2017b on a machine with 8 2.60 GHz CPUs and 32GB main memory. As shown in Figure 2, our method achieves the lowest time cost for most data sets and is only slightly worse than compared methods on data set creditg. Note that the sub-problem of Iter SVR and LG-MMC are solved by LIBSVM (Chang and Lin 2011), which is a fast implementation of both SVR and SVM, this shows that our method is also computationally efficient. credit-g german optdigits svmguide3 sick splice time cost (sec) Iter SVR CPMMC LG-MMC ODMC Figure 2: Average single iteration time cost of Iter SVR, CPMMC, LG-MMC, ODMC. Conclusions Maximum margin clustering (MMC), which employs the large margin heuristic from support vector machine, have achieved more accurate results than traditional clustering methods. Recent studies disclosed that instead of minimum margin, it is more crucial to optimize the margin distribution for SVM-style learning algorithms. Inspired by this recognition, we propose a novel approach ODMC for clustering by optimizing the margin distribution. To solve the resultant minimax problem, we extend a stochastic mirror descent method which is much more scalable than existing cuttingplane based approaches. Experimental results in various data sets show that our method achieves promising clustering performance, which further verifies the superiority of optimal margin distribution learning. In the future, we will apply importance sampling (Schmidt et al. 2015) to further accelerate our method and extend it to other learning settings, i.e., semi-supervised learning. Table 2: Clustering Accuracy (Acc) and Rand Index (RI) comparisons on 24 UCI data sets. The best performance on each data set is bolded. / indicates ODMC is significantly better/worse than compared methods (paired t-tests at 95% significance level). The win/tie/loss counts for ODMC are summarized in the last two rows. GMMC did not return results on some data sets in two days. Data set Measure KM NC GMMC Iter SVR CPMMC LG-MMC ODMC echocardiogram Acc 0.696 0.677 0.694 0.537 0.710 0.774 0.792 RI 0.572 0.556 0.568 0.498 0.581 0.645 0.661 dbworld Acc 0.599 0.625 0.578 0.840 0.547 0.859 0.857 RI 0.534 0.524 0.504 0.743 0.497 0.754 0.748 hepatitis Acc 0.693 0.525 0.575 0.638 0.838 0.838 0.841 RI 0.576 0.495 0.505 0.534 0.724 0.724 0.731 colic Acc 0.749 0.622 0.537 0.740 0.622 0.840 0.862 RI 0.622 0.527 0.500 0.613 0.527 0.730 0.752 house Acc 0.893 0.534 0.737 0.901 0.534 0.905 0.902 RI 0.808 0.500 0.611 0.821 0.500 0.828 0.825 heart-h Acc 0.742 0.536 0.617 0.780 0.625 0.789 0.795 RI 0.636 0.501 0.525 0.662 0.529 0.666 0.669 heart Acc 0.670 0.567 0.563 0.684 0.556 0.744 0.772 RI 0.572 0.507 0.506 0.589 0.504 0.618 0.637 heart-statlog Acc 0.765 0.504 0.741 0.761 0.556 0.796 0.811 RI 0.649 0.498 0.614 0.645 0.504 0.674 0.681 breast Acc 0.629 0.592 0.538 0.575 0.708 0.726 0.726 RI 0.550 0.515 0.501 0.518 0.585 0.600 0.600 cylinder-bands Acc 0.634 0.534 0.574 0.626 0.643 0.657 0.675 RI 0.534 0.501 0.509 0.530 0.539 0.548 0.571 heart-c Acc 0.662 0.551 0.588 0.780 0.541 0.777 0.775 RI 0.559 0.503 0.514 0.657 0.502 0.652 0.652 haberman Acc 0.604 0.735 0.582 0.520 0.735 0.739 0.736 RI 0.530 0.609 0.512 0.500 0.609 0.613 0.611 ionosphere Acc 0.708 0.541 0.721 0.691 0.641 0.738 0.754 RI 0.586 0.502 0.596 0.572 0.538 0.612 0.636 vehicle Acc 0.659 0.510 0.554 0.693 0.501 0.715 0.742 RI 0.556 0.499 0.505 0.581 0.499 0.591 0.624 credit-a Acc 0.730 0.576 0.522 0.771 0.547 0.772 0.770 RI 0.636 0.511 0.500 0.669 0.504 0.647 0.662 diabetes Acc 0.667 0.637 0.663 0.629 0.651 0.733 0.745 RI 0.555 0.537 0.552 0.533 0.545 0.608 0.625 fourclass Acc 0.659 0.624 0.515 0.640 0.644 0.763 0.788 RI 0.552 0.530 0.500 0.546 0.541 0.638 0.666 tic-tac-toe Acc 0.547 0.551 0.511 0.548 0.653 0.653 0.658 RI 0.504 0.505 0.500 0.505 0.547 0.547 0.551 credit-g Acc 0.539 0.683 N/A 0.534 0.700 0.704 0.710 RI 0.507 0.567 N/A 0.509 0.580 0.583 0.585 german Acc 0.569 0.700 N/A 0.546 0.700 0.700 0.700 RI 0.510 0.580 N/A 0.505 0.580 0.580 0.580 optdigits Acc 0.962 0.501 N/A 0.995 0.500 0.978 0.980 RI 0.952 0.500 N/A 0.989 0.500 0.957 0.958 svmguide3 Acc 0.593 0.657 N/A 0.572 0.738 0.739 0.741 RI 0.518 0.549 N/A 0.511 0.613 0.614 0.618 sick Acc 0.731 0.518 N/A 0.515 0.920 0.920 0.920 RI 0.628 0.500 N/A 0.501 0.852 0.852 0.852 splice Acc 0.665 0.519 N/A 0.630 0.519 0.641 0.655 RI 0.554 0.501 N/A 0.534 0.501 0.539 0.550 ODMC: w/t/l Acc 22/2/0 22/2/0 18/0/0 18/5/1 17/7/0 9/15/0 RI 22/2/0 22/2/0 18/0/0 19/4/1 17/7/0 9/15/0 Chang, C.-C., and Lin, C.-J. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2:27:1 27:27. Cortes, C., and Vapnik, V. 1995. Support-vector networks. Machine Learning 20(3):273 297. Cristianini, N., and Shawe-Taylor, J. 2000. An Introduction to Support Vector Machines and Other Kernel-based Learn- ing Methods. Cambridge, UK: Cambridge University Press. Gao, W., and Zhou, Z.-H. 2013. On the doubt about margin explanation of boosting. Artificial Intelligence 203:1 18. Jain, A. K., and Dubes, R. C. 1988. Algorithms for Clustering Data. Upper Saddle River, NJ: Prentice-Hall. Jain, A. K. 2010. Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31(8):651 666. Kim, S.-J., and Boyd, S. 2008. A minimax theorem with applications to machine learning, signal processing, and finance. SIAM Journal on Optimization 19(3):1344 1367. Li, Y.-F.; Tsang, I. W.; Kwok, J. T.; and Zhou, Z.-H. 2009. Tighter and convex maximum margin clustering. In In Proceeding of the 12th International Conference on Artificial Intelligence and Statistics, volume 5, 344 351. Li, Y.-F.; Tsang, I. W.; Kwok, J. T.; and Zhou, Z.-H. 2013. Convex and scalable weakly labeled svms. Journal of Machine Learning Research 14(1):2151 2188. Niu, G.; Dai, B.; Shang, L.; and Sugiyama, M. 2013. Maximum volume clustering: A new discriminative clustering approach. Journal of Machine Learning Research 14:2641 2687. Rand, W. M. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336):846 850. Schmidt, M.; Babanezhad, R.; Ahmed, M.; Defazio, A.; Clifton, A.; and Sarkar, A. 2015. Non-uniform stochastic average gradient method for training conditional random fields. In Lebanon, G., and Vishwanathan, S. V. N., eds., Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, 819 828. San Diego, CA: PMLR. Sch olkopf, B., and Smola, A. J. 2001. Learning with kernels: support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press. Shi, J., and Malik, J. 2000. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8):888 905. Sion, M. 1958. On general minimax theorems. Pacific Journal of Mathematics 8(1):171 176. Valizadegan, H., and Jin, R. 2006. Generalized maximum margin clustering and unsupervised kernel learning. In Proceedings of the 19th International Conference on Neural Information Processing Systems, 1417 1424. Cambridge, MA, USA: MIT Press. Vapnik, V. 1995. The Nature of Statistical Learning Theory. New York: Springer-Verlag. Vijaya Saradhi, V., and Charly Abraham, P. 2016. Incremental maximum margin clustering. Pattern Analysis and Applications 19(4):1057 1067. Wang, F.; Zhao, B.; and Zhang, C. 2010. Linear time maximum margin clustering. IEEE Transactions on Neural Networks 21(2):319 332. Xu, R., and Wunsch, II, D. 2005. Survey of clustering algo- rithms. IEEE Transactions on Neural Networks 16(3):645 678. Xu, L.; Neufeld, J.; Larson, B.; and Schuurmans, D. 2005. Maximum margin clustering. In Saul, L. K.; Weiss, Y.; and Bottou, L., eds., Advances in Neural Information Processing Systems 17. MIT Press. 1537 1544. Zhang, T., and Zhou, Z.-H. 2014. Large margin distribution machine. In Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 313 322. Zhang, T., and Zhou, Z.-H. 2016. Optimal margin distribution machine. Co RR, abs/1604.03348. Zhang, T., and Zhou, Z.-H. 2017. Multi-class optimal distribution machine. In Proceedings of the 34th International Conference on Machine Learning, 4063 4071. Zhang, K.; Tsang, I. W.; and Kwok, J. T. 2007. Maximum margin clustering made practical. In Proceedings of the 24th International Conference on Machine Learning, 1119 1126. Zhang, K.; Tsang, I. W.; and Kwok, J. T. 2009. Maximum margin clustering made practical. Trans. Neur. Netw. 20(4):583 596. Zhao, B.; Wang, F.; and Zhang, C. 2008. Efficient maximum margin clustering via cutting plane algorithm. In Siam International Conference on Data Mining, 751 762. Zhou, Y.-H., and Zhou, Z.-H. 2016. Large margin distribution learning with cost interval and unlabeled data. IEEE Transactions on Knowledge and Data Engineering 28(7):1749 1763. Zhou, G.-T.; Lan, T.; Vahdat, A.; and Mori, G. 2013. Latent maximum margin clustering. In Proceedings of the 26th International Conference on Neural Information Processing Systems, 28 36. USA: Curran Associates Inc. Zhou, Z.-H. 2014. Large margin distribution learning. In Proceedings of the 6th IAPR International Workshop on Artificial Neural Networks in Pattern Recognition, 1 11.