# multiclass_optimal_margin_distribution_machine__fffde23f.pdf Multi-Class Optimal Margin Distribution Machine Teng Zhang 1 Zhi-Hua Zhou 1 Recent studies disclose that maximizing the minimum margin like support vector machines does not necessarily lead to better generalization performances, and instead, it is crucial to optimize the margin distribution. Although it has been shown that for binary classification, characterizing the margin distribution by the firstand second-order statistics can achieve superior performance. It still remains open for multiclass classification, and due to the complexity of margin for multi-class classification, optimizing its distribution by mean and variance can also be difficult. In this paper, we propose mc ODM (multi-class Optimal margin Distribution Machine), which can solve this problem efficiently. We also give a theoretical analysis for our method, which verifies the significance of margin distribution for multi-class classification. Empirical study further shows that mc ODM always outperforms all four versions of multi-class SVMs on all experimental data sets. 1. Introduction Support vector machines (SVMs) and Boosting have been two mainstream learning approaches during the past decade. The former (Cortes & Vapnik, 1995) roots in the statistical learning theory (Vapnik, 1995) with the central idea of searching a large margin separator, i.e., maximizing the smallest distance from the instances to the classification boundary in a RKHS (reproducing kernel Hilbert space). It is noteworthy that there is also a long history of applying margin theory to explain the latter (Freund & Schapire, 1995; Schapire et al., 1998), due to its tending to be empirically resistant to over-fitting (Reyzin & Schapire, 2006; Wang et al., 2011; Zhou, 2012). 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China. Correspondence to: Zhi-Hua Zhou . Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s). Recently, the margin theory for Boosting has finally been defended (Gao & Zhou, 2013), and has disclosed that the margin distribution rather than a single margin is more crucial to the generalization performance. It suggests that there may still exist large space to further enhance for SVMs. Inspired by this recognition, (Zhang & Zhou, 2014; 2016) proposed a binary classification method to optimize margin distribution by characterizing it through the firstand second-order statistics, which achieves quite satisfactory experimental results. Later, (Zhou & Zhou, 2016) extends the idea to an approach which is able to exploit unlabeled data and handle unequal misclassification cost. A brief summary of this line of early research can be found in (Zhou, 2014). Although it has been shown that for binary classification, optimizing the margin distribution by maximizing the margin mean and minimizing the margin variance simultaneously can get superior performance, it still remains open for multi-class classification. Moreover, the margin for multiclass classification is much more complicated than that for binary class classification, which makes the resultant optimization be a difficult non-differentiable non-convex programming. In this paper, we propose mc ODM (multi-class Optimal margin Distribution Machine) to solve this problem efficiently. For optimization, we relax mc ODM into a series of convex quadratic programming (QP), and extend the Block Coordinate Descent (BCD) algorithm (Tseng, 2001) to solve the dual of each QP. The sub-problem of each iteration of BCD is also a QP. By exploiting its special structure, we derive a sorting algorithm to solve it which is much faster than general QP solvers. We further provide a generalization error bound based on Rademacher complexity, and further present the analysis of the relationship between generalization error and margin distribution for multi-class classification. Extensive experiments on twenty two data sets show the superiority of our method to all four versions of multi-class SVMs. The rest of this paper is organized as follows. Section 2 introduces some preliminaries. Section 3 formulates the problem. Section 4 presents the proposed algorithm. Section 5 discusses some theoretical analyses. Section 6 reports on our experimental studies and empirical observations. Finally Section 7 concludes with future work. Multi-Class Optimal Margin Distribution Machine 2. Preliminaries We denote by X Rd the instance space and Y = [k] the label set, where [k] = {1, . . . , k}. Let D be an unknown (underlying) distribution over X Y. A training set S = {(x1, y1), (x2, y2), . . . , (xm, ym)} (X Y)m is drawn identically and independently (i.i.d.) according to distribution D. Let ϕ : X 7 H be a feature mapping associated to some positive definite kernel κ. For multi-class classification setting, the hypothesis is defined based on k weight vectors w1, . . . , wk H, where each weight vector wl, l Y defines a scoring function x 7 w l ϕ(x) and the label of instance x is the one resulting in the largest score, i.e., h(x) = argmaxl Y w l ϕ(x). This decision function naturally leads to the following definition of the margin for a labeled instance (x, y): γh(x, y) = w y ϕ(x) max l =y w l ϕ(x). Thus h misclassifies (x, y) if and only if it produces a negative margin for this instance. Given a hypothesis set H of functions mapping X to Y and the labeled training set S, our goal is to learn a function h H such that the generalization error R(h) = E(x,y) D[1h(x) =y] is small. 3. Formulation To design optimal margin distribution machine for multiclass classification, we need to understand how to optimize the margin distribution. (Gao & Zhou, 2013) proved that, to characterize the margin distribution, it is important to consider not only the margin mean but also the margin variance. Inspired by this idea, (Zhang & Zhou, 2014; 2016) proposed the optimal margin distribution machine for binary classification, which characterizes the margin distribution according to the firstand second-order statistics, that is, maximizing the margin mean and minimizing the margin variance simultaneously. Specifically, let γ denote the margin mean, and the optimal margin distribution machine can be formulated as: min w, γ,ξi,ϵi Ω(w) η γ + λ i=1 (ξ2 i + ϵ2 i ) s.t. γh(xi, yi) γ ξi, γh(xi, yi) γ + ϵi, i, where Ω(w) is the regularization term to penalize the model complexity, η and λ are trading-off parameters, ξi and ϵi are the deviation of the margin γh(xi, yi) to the margin mean. It s evident that m i=1(ξ2 i +ϵ2 i )/m is exactly the margin variance. By scaling w which doesn t affect the final classification results, the margin mean can be fixed as 1, then the de- viation of the margin of (xi, yi) to the margin mean is |γh(xi, yi) 1|, and the optimal margin distribution machine can be reformulated as min w,ξi,ϵi Ω(w) + λ ξ2 i + µϵ2 i (1 θ)2 s.t. γh(xi, yi) 1 θ ξi, γh(xi, yi) 1 + θ + ϵi, i. where µ (0, 1] is a parameter to trade off two different kinds of deviation (larger or less than margin mean). θ [0, 1) is a parameter of the zero loss band, which can control the number of support vectors, i.e., the sparsity of the solution, and (1 θ)2 in the denominator is to scale the second term to be a surrogate loss for 0-1 loss. For multi-class classification, let the regularization term Ω(w) = k l=1 wl 2 H/2 and combine with the definition of margin, and we arrive at the formulation of mc ODM, min wl,ξi,ϵi 1 2 l=1 wl 2 H + λ ξ2 i + µϵ2 i (1 θ)2 (1) s.t. w yiϕ(xi) max l =yi w l ϕ(xi) 1 θ ξi, w yiϕ(xi) max l =yi w l ϕ(xi) 1 + θ + ϵi, i. where λ, µ and θ are the parameters for trading-off described previously. 4. Optimization Due to the max operator in the second constraint, mc ODM is a non-differentiable non-convex programming, which is quite difficult to solve directly. In this section, we first relax mc ODM into a series of convex quadratic programming (QP), which can be much easier to handle. Specifically, at each iteration, we recast the first constraint as k 1 linear inequality constraints: w yiϕ(xi) w l ϕ(xi) 1 θ ξi, l = yi, and replace the second constraint with w yiϕ(xi) Mi 1 + θ + ϵi, where Mi = maxl =yi w l ϕ(xi) and wl is the solution to the previous iteration. Then we can repeatedly solve the following convex QP problem until convergence: min wl,ξi,ϵi 1 2 l=1 wl 2 H + λ ξ2 i + µϵ2 i (1 θ)2 (2) s.t. w yiϕ(xi) w l ϕ(xi) 1 θ ξi, l = yi, w yiϕ(xi) Mi 1 + θ + ϵi, i. Multi-Class Optimal Margin Distribution Machine Introduce the lagrange multipliers ζl i 0, l = yi for the first k 1 constraints and βi 0 for the last constraint respectively, the Lagrangian function of Eq. 2 leads to L(wl, ξi, ϵi, ζl i, βi) l=1 wl 2 H + λ ξ2 i + µϵ2 i (1 θ)2 l =yi ζl i(w yiϕ(xi) w l ϕ(xi) 1 + θ + ξi) i=1 βi(w yiϕ(xi) Mi 1 θ ϵi), By setting the partial derivations of variables {wl, ξi, ϵi} to zero, we have s =yi ζs i (1 δyi,l)ζl i δyi,lβi ξi = m(1 θ)2 l =yi ζl i, ϵi = m(1 θ)2 2λµ βi. (3) where δyi,l equals 1 when yi = l and 0 otherwise. We further simplify the expression of wl as i=1 (αl i δyi,lβi)ϕ(xi), (4) by defining αl i ζl i for l = yi and αyi i s =yi ζs i and substituting Eq. 4 and Eq. 3 into the Lagrangian function, then we have the following dual problem min αl i,α yi i ,βi l=1 wl 2 H + m(1 θ)2 i=1 (αyi i )2 i=1 β2 i + (1 θ) + (Mi + 1 + θ) l=1 αl i = 0, i, αl i 0, i, l = yi, The objective function in Eq. 5 involves m(k + 1) variables in total, so it is not easy to optimize with respect to all the variables simultaneously. Note that all the constraints can be partitioned into m disjoint sets, and the i-th set only involves α1 i , . . . , αk i , βi, so the variables can be divided into m decoupled groups and an efficient block coordinate descent algorithm (Tseng, 2001) can be applied. Specifically, we sequentially select a group of k + 1 variables α1 i , . . . , αk i , βi associated with instance xi to minimize, while keeping other variables as constants, and repeat this procedure until convergence. Algorithm 1 below details the kenrel mc ODM. Algorithm 1 Kenrel mc ODM 1: Input: Data set S. 2: Initialize α = [α1 1, . . . , αk 1, . . . , α1 m, . . . , αk m] and β = [β1, . . . , βm] as zero vector. 3: while α and β not converge do 4: for i = 1, . . . , m do 5: Mi maxl =yi m j=1(αl j δyj,lβj)κ(xj, xi). 6: end for 7: Solve Eq. 5 by block coordinate descent method. 8: end while 9: Output: α, β. 4.1. Solving the sub-problem The sub-problem in step 7 of Algorithm 1 is also a convex QP with k + 1 variables, which can be accomplished by some standard QP solvers. However, by exploiting its special structure, i.e., only a small quantity of cross terms are involved, we can derive an algorithm to solve this subproblem just by sorting, which can be much faster than general QP solvers. Note that all variables except α1 i , . . . , αk i , βi are fixed, so we have the following sub-problem: min αl i,α yi i ,βi 2 (αl i)2 + l =yi Blαl i + D 2 (αyi i )2 Aαyi i βi + Byiαyi i + E 2 β2 i + Fβi l=1 αl i = 0, (6) αl i 0, l = yi, where A = κ(xi, xi), Bl = j =i κ(xi, xj)(αl j δyj,lβj) + 1 θ for l = yi, Byi = j =i κ(xi, xj)(αyi j δyj,yiβj), D = A + m(1 θ)2 2λ , E = A + m(1 θ)2 2λµ and F Mi + 1 + θ Byi. The KKT conditions of Eq. 6 indicate that there are scalars ν, ρl and η such that l=1 αl i = 0, (7) αl i 0, l = yi, (8) Multi-Class Optimal Margin Distribution Machine ρlαl i = 0, ρl 0, l = yi, (10) Aαl i + Bl ν + ρl = 0, l = yi, (11) ηβi = 0, η 0, (12) Aαyi i + Eβi + F η = 0, (13) Dαyi i Aβi + Byi ν = 0. (14) According to Eq. 8, Eq. 10 and Eq. 11 are equivalent to Aαl i + Bl ν = 0, if αl i < 0, l = yi, (15) Bl ν 0, if αl i = 0, l = yi. (16) In the same way, Eq. 12 and Eq. 13 are equivalent to Aαyi i + Eβi + F = 0, if βi > 0, (17) Aαyi i + F 0, if βi = 0. (18) Thus KKT conditions turn to Eq. 7 - Eq. 9 and Eq. 14 - Eq. 18. Note that αl i min ( 0, ν Bl ) , l = yi, (19) satisfies KKT conditions Eq. 8 and Eq. 15 - Eq. 16 and βi max ( 0, Aαyi i F satisfies KKT conditions Eq. 9 and Eq. 17 - Eq. 18. By substituting Eq. 20 into Eq. 14, we obtain Dαyi i + Byi ν = max ( 0, A E (Aαyi i F) ) . (21) Let s consider the following two cases in turn. Case 1: Aαyi i F, according to Eq. 20 and Eq. 21, we have βi = 0 and αyi i = ν Byi D . Thus, A ν Byi D F, which implies that ν Byi + DF Case 2: Aαyi i > F, according to Eq. 20 and Eq. 21, we have βi = Aα yi i F E > 0 and αyi i = Eν AF EByi DE A2 . Thus, A Eν AF EByi DE A2 > F, which implies that ν > Byi + DF The remaining task is to find ν such that Eq. 7 holds. With Eq. 7 and Eq. 19, it can be shown that l:αl i<0 Bl A D + |{l|αl i < 0}| , Case 1, (22) l:αl i<0 Bl AE DE A2 + |{l|αl i < 0}| , Case 2. (23) In both cases, the optimal ν takes the form of (P + l:αl i<0 Bl)/(Q+|{l|αl i < 0}|), where P and Q are some constants. (Fan et al., 2008) showed that it can be found by sorting {Bl} for l = yi in decreasing order and then sequentially adding them into an empty set Φ, until l Φ Bl Q + |Φ| max l Φ Bl. (24) Note that the Hessian matrix of the objective function of Eq. 6 is positive definite, which guarantees the existence and uniqueness of the optimal solution, so only one of Eq. 22 and Eq. 23 can hold. We can first compute ν according to Eq. 24 for Case 1, and then check whether the constraint of ν is satisfied. If not, we further compute ν for Case 2. Algorithm 2 summarizes the pseudo-code for solving the sub-problem. Algorithm 2 Solving the sub-problem 1: Input: Parameters A, B = {B1, . . . , Bk}, D, E, F. 2: Initialize ˆB B, then swap ˆB1 and ˆByi, and sort ˆB\{ ˆB1} in decreasing order. 3: i 2, ν AByi/D. 4: while i k and ν/(i 2 + A/D) < ˆBi do 5: ν ν + ˆBi. 6: i i + 1. 7: end while 8: if ν Byi + DF/A then 9: αl i min(0, (ν Bl)/A), l = yi. 10: αyi i (ν Byi)/D. 11: βi 0. 12: else 13: i 2, ν (AE ˆB1 + A2F)/(DE A2). 14: while i k and ν/(i 2 + AE/(DE A2)) < ˆBi do 15: ν ν + ˆBi. 16: i i + 1. 17: end while 18: αl i min(0, (ν Bl)/A), l = yi. 19: αyi i (Eν AF EByi)/(DE A2). 20: βi (Aαyi i F)/E. 21: end if 22: Output: α1 i , . . . , αk i , βi. 4.2. Speedup for linear kernel In section 4.1, the proposed method can efficiently deal with kernel mc ODM. However, the computation of Mi in step 5 of Algorithm 1 and the computation of parameters Bl in Algorithm 2 both involve the kernel matrix, whose inherent computational cost takes O(m2) time, so it might be computational prohibitive for large scale problems. When linear kernel is used, these problems can be alleviated. According to Eq. 4, w is spanned by the training instance so it lies in a finite dimensional space under this Multi-Class Optimal Margin Distribution Machine circumstance. By storing w1, . . . , wk explicitly, the computational cost of Mi = maxl =yi w l xi can be much less. Moreover, note that Bl = j =i x i xj(αl j δyj,lβj) = m j=1 x i xj( αl j δyj,l βj) x i xi( αl i δyi,l βi) = w l xi A(αl i δyi,lβi), so Bl can also be computed efficiently. 5. Analysis In this section, we study the statistical property of mc ODM. To present the generalization bound of mc ODM, we need to introduce the following loss function Φ, Φ(z) = 1z 0 + (z 1 + θ)2 (1 θ)2 10 0, with probability at least 1 δ, the following generalization bound holds for any h H, i=1 Φ(γh(xi, yi)) + 16rΛ Proof. Let Hθ be the family of hypotheses mapping X Y 7 R defined by Hθ = {(x, y) 7 γh,θ(x, y) : h H}, with Mc Diarmid inequality (Mc Diarmid, 1989), yields the following inequality with probability at least 1 δ, E[Φ(γh,θ(x, y))] 1 i=1 Φ(γh,θ(xi, yi)) + 2RS(Φ Hθ) + 3 δ 2m , h Hθ. Note that Φ(γh,θ) = Φ(γh), R(h) = E[1γh(x,y) 0] E[1γh,θ(x,y) 0] E[Φ(γh,θ(x, y))] and Φ(z) is 2 1 θLipschitz function, by using Talagrand s lemma (Mohri et al., 2012), we have i=1 Φ(γh(xi, yi)) + 4RS( Hθ) According to Theorem 7 of (Lei et al., 2015), we have RS( Hθ) 4rΛ 2πk/m and proves the stated result. Theorem 1 shows that we can get a tighter generalization bound for smaller rΛ and smaller θ. Since γ 2rΛ, so the former can be viewed as an upper bound of the margin. Besides, 1 θ is the lower bound of the zero loss band of mc ODM. This verifies that better margin distribution (i.e., larger margin mean and smaller margin variance) can yield better generalization performance, which is also consistent with the work of (Gao & Zhou, 2013). 6. Empirical Study In this section, we empirically evaluate the effectiveness of our method on a broad range of data sets. We first introduce the experimental settings and compared methods in Section 6.1, and then in Section 6.2, we compare our method with four versions of multi-class SVMs, i.e., mc SVM (Weston & Watkins, 1999; Crammer & Singer, 2001; 2002), one-versus-all SVM (ova SVM), one-versusone SVM (ovo SVM) (Ulrich, 1998) and error-correcting output code SVM (ecoc SVM) (Dietterich & Bakiri, 1995). In addition, we also study the influence of the number of classes on generalization performance and margin distribution in Section 6.3. Finally, the computational cost is presented in Section 6.4. 6.1. Experimental Setup We evaluate the effectiveness of our proposed methods on twenty two data sets. Table 1 summarizes the statistics of these data sets. The data set size ranges from 150 to more than 581,012, and the dimensionality ranges from 4 to more than 62,061. Moreover, the number of class ranges from 3 to 1,000, so these data sets cover a broad range of properties. All features are normalized into the interval [0, 1]. For each data set, eighty percent of the instances are randomly selected as training data, and the rest are used as testing data. For each data set, experiments are repeated for 10 times with random data partitions, and the average accuracies as well as the standard deviations are recorded. mc ODM is compared with four versions of multi-class SVMs, i.e., ova SVM, ovo SVM, ecoc SVM and mc SVM. These four methods can be roughly classified into two groups. The first group includes the first three methods by converting the multi-class classification problem into a set of binary classification problems. Specially, ova SVM consists of learning k scoring functions hl : X 7 { 1, +1}, l Y, each seeking to discriminate one class l Y from all the others, as can be seen it need train k SVM models. Alternatively, ovo SVM determines the scoring functions for all the combinations of class pairs, so it need train k(k 1)/2 SVM models. Finally, ecoc SVM is a generalization of the former two methods. This technique assigns to each class l Y a code word with length c, which serves as a signature for this class. After training Multi-Class Optimal Margin Distribution Machine Table 1. Characteristics of experimental data sets. ID Dataset #Instance #Feature #Class ID Dataset #Instance #Feature #Class 1 iris 150 4 3 12 sector 9,619 55,197 105 2 wine 178 13 3 13 pendigits 10,992 16 10 3 glass 214 9 6 14 news20 19,928 62,061 20 4 svmguide2 391 20 3 15 letter 20,000 16 26 5 svmguide4 612 10 6 16 protein 24,387 357 3 6 vehicle 846 18 4 17 shuttle 58,000 9 7 7 vowel 990 10 11 18 connect-4 67,557 126 3 8 segment 2,310 19 7 19 mnist 70,000 780 10 9 dna 3,186 180 3 20 aloi 108,000 128 1,000 10 satimage 6,435 36 6 21 rcv1 534,135 47,236 53 11 usps 9,298 256 10 22 covtype 581,012 54 7 c binary SVM models h1( ), . . . , hc( ), the class predicted for each testing instance is the one whose signatures is the closest to [h1(x), . . . , hc(x)] in Hamming distance. The weakness of these methods is that they may produce unclassifiable regions and their computational costs are usually quite large in practice, which can be observed in the following experiments. On the other hand, mc SVM belongs to the second group. It directly determines all the scroing functions at once, so the time cost is usually less than the former methods. In addition, the unclassifiable regions are also resolved. For all the methods, the regularization parameter λ for mc ODM or C for binary SVM and mc SVM is selected by 5-fold cross validation from [20, 22, . . . , 220]. For mc ODM, the regularization parameters µ and θ are both selected from [0.2, 0.4, 0.6, 0.8]. For ecoc SVM, the exhaustive codes strategy is employed, i.e., for each class, we construct a code of length 2k 1 1 as the signature. All the selections for parameters are performed on training sets. 6.2. Results Table 2 summarizes the detailed results on twenty two data sets. As can be seen, the overall performance of our method is superior or highly competitive to the other compared methods. Specifically, mc ODM performs significantly better than mc SVM/ova SVM/ovo SVM/ecoc SVM on 17/19/18/17 over 22 data sets respectively, and achieves the best accuracy on 20 data sets. In addition, as can be seen, in comparing with other four methods which don t consider margin distribution, the win/tie/loss counts show that mc ODM is always better or comparable, almost never worse than it. 6.3. Influence of the Number of Classes In this section we study the influence of the number of classes on generalization performance and margin distribution, respectively. 6.3.1. GENERALIZATION PERFORMANCE Figure 1 plots the generalization performance of all the five methods on data set segment, and similar observation can be found for other data sets. As can be seen, when the number of classes is less than four, all methods perform quite well. However, as the fifth class is added, the generalization performance of other four methods decreases drastically. This might be attributable to the introduction of some noisy data, which SVM-style algorithms are very sensitive to since they optimize the minimum margin. On the other hand, our method considers the whole margin distribution, so it can be robust to noise and relatively more stable. 2 3 4 5 6 7 0.88 generalization performance mc SVM ova SVM ovo SVM ecoc SVM mc ODM Figure 1. Generalization performance of all the five methods on data set segment with the increase of the number of classes. 6.3.2. MARGIN DISTRIBUTION Figure 2 plots the frequency histogram of margin distribution produced by mc SVM, ova SVM and mc ODM on data set segment as the number of classes increases from two to seven. As can be seen, when the number of classes is less than four, all methods can achieve good margin distribution, whereas with the increase of the number of classes, the other two methods begin to produce negative margins. At the same time, the distribution of our method becomes Multi-Class Optimal Margin Distribution Machine Table 2. Accuracy (mean std.) comparison on twenty two data sets. Linear kernel is used. The best accuracy on each data set is bolded. / indicates the performance of mc ODM is significantly better/worse than the compared methods (paired t-tests at 95% significance level). The average rank and top1 times are listed in the third and second row from the bottom. The win/tie/loss counts for mc ODM are summarized in the last row. ovo SVM and ecoc SVM did not return results on some data sets in 48 hours. Dataset mc SVM ova SVM ovo SVM ecoc SVM mc ODM iris 84.7 5.0 82.0 5.0 73.0 7.1 73.3 4.2 87.3 3.4 wine 96.5 2.6 97.0 2.4 96.8 3.1 91.6 2.7 98.4 1.9 glass 62.0 7.6 57.8 2.1 62.4 6.0 60.4 4.0 68.2 7.3 svmguide2 80.9 2.6 81.5 2.3 74.6 5.4 79.0 3.6 84.1 1.8 svmguide4 86.6 3.2 77.2 3.1 77.1 6.9 74.2 4.3 87.8 3.5 vehicle 80.5 2.1 78.5 2.9 73.5 3.5 76.9 2.5 85.6 6.6 vowel 65.4 2.1 44.1 2.6 71.1 4.9 43.1 2.0 65.6 3.9 segment 95.3 0.9 92.3 0.7 93.2 2.2 90.7 0.8 96.7 1.3 dna 95.1 0.6 95.1 0.7 93.0 1.0 90.9 1.1 95.6 0.7 satimage 81.8 0.5 80.1 0.7 77.4 4.2 76.0 1.4 89.1 8.0 usps 95.1 0.3 94.2 0.3 94.5 0.6 92.3 0.7 95.1 0.6 sector 93.3 0.4 93.1 0.5 89.9 0.6 N/A 93.6 0.6 pendigits 94.8 0.3 91.8 0.5 95.7 0.8 87.6 0.7 96.7 0.9 news20 83.3 0.3 83.0 0.3 69.4 0.4 N/A 83.6 0.3 letter 76.9 0.4 64.9 1.0 75.7 1.2 N/A 82.7 1.6 protein 68.7 0.3 68.7 0.4 68.1 0.3 66.5 0.5 70.2 2.3 shuttle 97.2 0.1 89.9 2.8 97.4 0.1 84.2 0.0 99.6 0.0 connect-4 75.7 0.1 75.7 0.1 75.7 0.2 75.1 0.2 94.4 7.6 mnist 92.5 0.2 91.5 0.2 90.9 0.5 87.6 0.2 95.2 0.3 aloi 90.1 0.2 82.2 0.2 N/A N/A 91.4 0.3 rcv1 92.8 0.1 92.7 0.1 92.1 0.1 N/A 93.0 1.1 covtype 71.7 0.1 71.2 0.1 72.7 0.1 70.6 0.1 72.8 1.6 Avg. Rank 2.45 3.18 3.45 4.82 1.09 mc ODM: w/t/l 17/5/0 19/3/0 18/3/0 17/0/0 sharper , which prevents the instance with small margin, so our method can still perform relatively well as the number of classes increases, which is also consistent with the observation from Figure 1. 6.4. Time Cost We compare the single iteration time cost of our method with mc SVM, ova SVM, ovo SVM on all the data sets except aloi, on which ovo SVM could not return results in 48 hours. All the experiments are performed with MATLAB 2012b on a machine with 8 2.60 GHz CPUs and 32GB main memory. The average CPU time (in seconds) on each data set is shown in Figure 3. The binary SVM used in ova SVM, ovo SVM and mc SVM are both implemented by the LIBLINEAR (Fan et al., 2008) package. It can be seen that for small data sets, the efficiency of all the methods are similar, however, for data sets with more than ten classes, e.g., sector and rcv1, mc SVM and mc ODM, which learn all the scroing functions at once, are much faster than ova SVM and ovo SVM, owing to the inefficiency of binarydecomposition as discussed in Section 6.1. Note that LIBLINEAR are very fast implementations of binary SVM and mc SVM, and this shows that our method is also computationally efficient. 7. Conclusions Recent studies disclose that for binary class classification, maximizing the minimum margin does not necessarily lead to better generalization performances, and instead, it is crucial to optimize the margin distribution. However, it remains open to the influence of margin distribution for multi-class classification. We try to answer this question in this paper. After maximizing the margin mean and minimizing the margin variance simultaneously, the resultant optimization is a difficult non-differentiable non-convex programming. We propose mc ODM to solve this problem efficiently. Extensive experiments on twenty two data sets validate the superiority of our method to four versions of multi-class SVMs. In the future it will be interesting to extend mc ODM to more general learning settings, i.e., multilabel learning and structured learning. Acknowledgements This research was supported by the NSFC (61333014) and the Collaborative Innovation Center of Novel Software Technology and Industrialization. Authors want to thank reviewers for helpful comments, and thank Dr. Wei Gao for reading a draft. Multi-Class Optimal Margin Distribution Machine 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 0 10 20 30 40 50 mc SVM ova SVM mc ODM 2 class 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0 10 20 30 40 50 mc SVM ova SVM mc ODM 3 class -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0 50 100 150 200 mc SVM ova SVM mc ODM 4 class -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0 100 200 300 mc SVM ova SVM mc ODM 5 class -0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0 100 200 300 mc SVM ova SVM mc ODM 6 class -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0 100 200 300 mc SVM ova SVM mc ODM 7 class Figure 2. Frequency histogram of the margin distribution produced by mc SVM, ova SVM and mc ODM on data set segment as the number of classes increase from two to seven. CPU time (sec) mc SVM ova SVM ovo SVM mc ODM Figure 3. Single iteration time cost of mc SVM, ova SVM, ovo SVM and mc ODM on all the data sets except aloi. Multi-Class Optimal Margin Distribution Machine Cortes, C. and Vapnik, V. Support-vector networks. Machine Learning, 20(3):273 297, 1995. Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265 292, 2001. Crammer, K. and Singer, Y. On the learnability and design of output codes for multiclass problems. Machine Learning, 47(2):201 233, 2002. Dietterich, T. G. and Bakiri, G. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2(2):263 286, 1995. Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., and Lin, C. J. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9:1871 1874, 2008. Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the 2nd European Conference on Computational Learning Theory, pp. 23 37, Barcelona, Spain, 1995. Gao, W. and Zhou, Z.-H. On the doubt about margin explanation of boosting. Artificial Intelligence, 203:1 18, 2013. Lei, Y., rn Dogan, Binder, A., and Kloft, M. Multiclass svms: From tighter data-dependent generalization bounds to novel algorithms. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 2026 2034. Curran Associates, Inc., 2015. Mc Diarmid, C. On the method of bounded differences. Surveys in Combinatorics, 141(1):148 188, 1989. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. MIT Press, Cambridge, MA, 2012. Reyzin, L. and Schapire, R. E. How boosting the margin can also boost classifier complexity. In Proceedings of 23rd International Conference on Machine Learning, pp. 753 760, Pittsburgh, PA, 2006. Schapire, R. E., Freund, Y., Bartlett, P. L., and Lee, W. S. Boosting the margin: a new explanation for the effectives of voting methods. Annuals of Statistics, 26(5):1651 1686, 1998. Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3):475 494, 2001. Ulrich, H. G. Kreel. Pairwise classification and support vector machines. In Schlkopf, B., Burges, C. J. C., and Smola, A. J. (eds.), Advances in Kernel Methods: Support Vector Machines, pp. 255 268, Cambridge, MA, December 1998. MIT Press. Vapnik, V. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995. Wang, L. W., Sugiyama, M., Yang, C., Zhou, Z.-H., and Feng, J. A refined margin analysis for boosting algorithms via equilibrium margin. Journal of Machine Learning Research, 12:1835 1863, 2011. Weston, J. and Watkins, C. Multi-class support vector machines. In Proceedings of the European Symposium on Artificial Neural Networks, Brussels, Belgium, 1999. Zhang, T. and Zhou, Z.-H. Large margin distribution machine. In Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 313 322, New York, NY, 2014. Zhang, T. and Zhou, Z.-H. Optimal margin distribution machine. Co RR, abs/1604.03348, 2016. Zhou, Y.-H. and Zhou, Z.-H. Large margin distribution learning with cost interval and unlabeled data. IEEE Transactions on Knowledge and Data Engineering, 28 (7):1749 1763, 2016. Zhou, Z.-H. Ensemble Methods: Foundations and Algorithms. CRC Press, Boca Raton, FL, 2012. Zhou, Z.-H. Large margin distribution learning. In Proceedings of the 6th IAPR International Workshop on Artificial Neural Networks in Pattern Recognition, pp. 1 11, Montreal, Canada, 2014.