# multiclass_optimal_margin_distribution_machine__fffde23f.pdf

Multi-Class Optimal Margin Distribution Machine

Teng Zhang 1 Zhi-Hua Zhou 1

Recent studies disclose that maximizing the minimum margin like support vector machines does not necessarily lead to better generalization performances, and instead, it is crucial to optimize the margin distribution. Although it has been shown that for binary classiﬁcation, characterizing the margin distribution by the ﬁrstand second-order statistics can achieve superior performance. It still remains open for multiclass classiﬁcation, and due to the complexity of margin for multi-class classiﬁcation, optimizing its distribution by mean and variance can also be difﬁcult. In this paper, we propose mc ODM (multi-class Optimal margin Distribution Machine), which can solve this problem efﬁciently. We also give a theoretical analysis for our method, which veriﬁes the signiﬁcance of margin distribution for multi-class classiﬁcation. Empirical study further shows that mc ODM always outperforms all four versions of multi-class SVMs on all experimental data sets.

1. Introduction

Support vector machines (SVMs) and Boosting have been two mainstream learning approaches during the past decade. The former (Cortes & Vapnik, 1995) roots in the statistical learning theory (Vapnik, 1995) with the central idea of searching a large margin separator, i.e., maximizing the smallest distance from the instances to the classiﬁcation boundary in a RKHS (reproducing kernel Hilbert space). It is noteworthy that there is also a long history of applying margin theory to explain the latter (Freund & Schapire, 1995; Schapire et al., 1998), due to its tending to be empirically resistant to over-ﬁtting (Reyzin & Schapire, 2006; Wang et al., 2011; Zhou, 2012).

1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China. Correspondence to: Zhi-Hua Zhou <zhouzh@lamda.nju.edu.cn>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

Recently, the margin theory for Boosting has ﬁnally been defended (Gao & Zhou, 2013), and has disclosed that the margin distribution rather than a single margin is more crucial to the generalization performance. It suggests that there may still exist large space to further enhance for SVMs. Inspired by this recognition, (Zhang & Zhou, 2014; 2016) proposed a binary classiﬁcation method to optimize margin distribution by characterizing it through the ﬁrstand second-order statistics, which achieves quite satisfactory experimental results. Later, (Zhou & Zhou, 2016) extends the idea to an approach which is able to exploit unlabeled data and handle unequal misclassiﬁcation cost. A brief summary of this line of early research can be found in (Zhou, 2014).

Although it has been shown that for binary classiﬁcation, optimizing the margin distribution by maximizing the margin mean and minimizing the margin variance simultaneously can get superior performance, it still remains open for multi-class classiﬁcation. Moreover, the margin for multiclass classiﬁcation is much more complicated than that for binary class classiﬁcation, which makes the resultant optimization be a difﬁcult non-differentiable non-convex programming. In this paper, we propose mc ODM (multi-class Optimal margin Distribution Machine) to solve this problem efﬁciently. For optimization, we relax mc ODM into a series of convex quadratic programming (QP), and extend the Block Coordinate Descent (BCD) algorithm (Tseng, 2001) to solve the dual of each QP. The sub-problem of each iteration of BCD is also a QP. By exploiting its special structure, we derive a sorting algorithm to solve it which is much faster than general QP solvers. We further provide a generalization error bound based on Rademacher complexity, and further present the analysis of the relationship between generalization error and margin distribution for multi-class classiﬁcation. Extensive experiments on twenty two data sets show the superiority of our method to all four versions of multi-class SVMs.

The rest of this paper is organized as follows. Section 2 introduces some preliminaries. Section 3 formulates the problem. Section 4 presents the proposed algorithm. Section 5 discusses some theoretical analyses. Section 6 reports on our experimental studies and empirical observations. Finally Section 7 concludes with future work.

Multi-Class Optimal Margin Distribution Machine

2. Preliminaries

We denote by X Rd the instance space and Y = [k] the label set, where [k] = {1, . . . , k}. Let D be an unknown (underlying) distribution over X Y. A training set S = {(x1, y1), (x2, y2), . . . , (xm, ym)} (X Y)m

is drawn identically and independently (i.i.d.) according to distribution D. Let ϕ : X 7 H be a feature mapping associated to some positive deﬁnite kernel κ. For multi-class classiﬁcation setting, the hypothesis is deﬁned based on k weight vectors w1, . . . , wk H, where each weight vector wl, l Y deﬁnes a scoring function x 7 w l ϕ(x) and the label of instance x is the one resulting in the largest score, i.e., h(x) = argmaxl Y w l ϕ(x). This decision function naturally leads to the following deﬁnition of the margin for a labeled instance (x, y):

γh(x, y) = w y ϕ(x) max l =y w l ϕ(x).

Thus h misclassiﬁes (x, y) if and only if it produces a negative margin for this instance.

Given a hypothesis set H of functions mapping X to Y and the labeled training set S, our goal is to learn a function h H such that the generalization error R(h) = E(x,y) D[1h(x) =y] is small.

3. Formulation

To design optimal margin distribution machine for multiclass classiﬁcation, we need to understand how to optimize the margin distribution. (Gao & Zhou, 2013) proved that, to characterize the margin distribution, it is important to consider not only the margin mean but also the margin variance. Inspired by this idea, (Zhang & Zhou, 2014; 2016) proposed the optimal margin distribution machine for binary classiﬁcation, which characterizes the margin distribution according to the ﬁrstand second-order statistics, that is, maximizing the margin mean and minimizing the margin variance simultaneously. Speciﬁcally, let γ denote the margin mean, and the optimal margin distribution machine can be formulated as:

min w, γ,ξi,ϵi Ω(w) η γ + λ

i=1 (ξ2 i + ϵ2 i )

s.t. γh(xi, yi) γ ξi,

γh(xi, yi) γ + ϵi, i,

where Ω(w) is the regularization term to penalize the model complexity, η and λ are trading-off parameters, ξi and ϵi are the deviation of the margin γh(xi, yi) to the margin mean. It s evident that m i=1(ξ2 i +ϵ2 i )/m is exactly the margin variance.

By scaling w which doesn t affect the ﬁnal classiﬁcation results, the margin mean can be ﬁxed as 1, then the de-

viation of the margin of (xi, yi) to the margin mean is |γh(xi, yi) 1|, and the optimal margin distribution machine can be reformulated as

min w,ξi,ϵi Ω(w) + λ

ξ2 i + µϵ2 i (1 θ)2

s.t. γh(xi, yi) 1 θ ξi,

γh(xi, yi) 1 + θ + ϵi, i.

where µ (0, 1] is a parameter to trade off two different kinds of deviation (larger or less than margin mean). θ [0, 1) is a parameter of the zero loss band, which can control the number of support vectors, i.e., the sparsity of the solution, and (1 θ)2 in the denominator is to scale the second term to be a surrogate loss for 0-1 loss.

For multi-class classiﬁcation, let the regularization term Ω(w) = k l=1 wl 2 H/2 and combine with the deﬁnition of margin, and we arrive at the formulation of mc ODM,

min wl,ξi,ϵi 1 2

l=1 wl 2 H + λ

ξ2 i + µϵ2 i (1 θ)2 (1)

s.t. w yiϕ(xi) max l =yi w l ϕ(xi) 1 θ ξi,

w yiϕ(xi) max l =yi w l ϕ(xi) 1 + θ + ϵi, i.

where λ, µ and θ are the parameters for trading-off described previously.

4. Optimization

Due to the max operator in the second constraint, mc ODM is a non-differentiable non-convex programming, which is quite difﬁcult to solve directly.

In this section, we ﬁrst relax mc ODM into a series of convex quadratic programming (QP), which can be much easier to handle. Speciﬁcally, at each iteration, we recast the ﬁrst constraint as k 1 linear inequality constraints:

w yiϕ(xi) w l ϕ(xi) 1 θ ξi, l = yi,

and replace the second constraint with

w yiϕ(xi) Mi 1 + θ + ϵi,

where Mi = maxl =yi w l ϕ(xi) and wl is the solution to the previous iteration. Then we can repeatedly solve the following convex QP problem until convergence:

min wl,ξi,ϵi 1 2

l=1 wl 2 H + λ

ξ2 i + µϵ2 i (1 θ)2 (2)

s.t. w yiϕ(xi) w l ϕ(xi) 1 θ ξi, l = yi,

w yiϕ(xi) Mi 1 + θ + ϵi, i.

Multi-Class Optimal Margin Distribution Machine

Introduce the lagrange multipliers ζl i 0, l = yi for the ﬁrst k 1 constraints and βi 0 for the last constraint respectively, the Lagrangian function of Eq. 2 leads to

L(wl, ξi, ϵi, ζl i, βi)

l=1 wl 2 H + λ

ξ2 i + µϵ2 i (1 θ)2

l =yi ζl i(w yiϕ(xi) w l ϕ(xi) 1 + θ + ξi)

i=1 βi(w yiϕ(xi) Mi 1 θ ϵi),

By setting the partial derivations of variables {wl, ξi, ϵi} to zero, we have

s =yi ζs i (1 δyi,l)ζl i δyi,lβi

ξi = m(1 θ)2

l =yi ζl i, ϵi = m(1 θ)2

2λµ βi. (3)

where δyi,l equals 1 when yi = l and 0 otherwise. We further simplify the expression of wl as

i=1 (αl i δyi,lβi)ϕ(xi), (4)

by deﬁning αl i ζl i for l = yi and αyi i

s =yi ζs i and substituting Eq. 4 and Eq. 3 into the Lagrangian function, then we have the following dual problem

min αl i,α yi i ,βi

l=1 wl 2 H + m(1 θ)2

i=1 (αyi i )2

i=1 β2 i + (1 θ)

+ (Mi + 1 + θ)

l=1 αl i = 0, i,

αl i 0, i, l = yi,

The objective function in Eq. 5 involves m(k + 1) variables in total, so it is not easy to optimize with respect to all the variables simultaneously. Note that all the constraints can be partitioned into m disjoint sets, and the i-th set only involves α1 i , . . . , αk i , βi, so the variables can be divided into m decoupled groups and an efﬁcient block coordinate descent algorithm (Tseng, 2001) can be applied.

Speciﬁcally, we sequentially select a group of k + 1 variables α1 i , . . . , αk i , βi associated with instance xi to minimize, while keeping other variables as constants, and repeat this procedure until convergence.

Algorithm 1 below details the kenrel mc ODM.

Algorithm 1 Kenrel mc ODM

1: Input: Data set S. 2: Initialize α = [α1 1, . . . , αk 1, . . . , α1 m, . . . , αk m] and β = [β1, . . . , βm] as zero vector. 3: while α and β not converge do 4: for i = 1, . . . , m do 5: Mi maxl =yi m j=1(αl j δyj,lβj)κ(xj, xi). 6: end for 7: Solve Eq. 5 by block coordinate descent method. 8: end while 9: Output: α, β.

4.1. Solving the sub-problem

The sub-problem in step 7 of Algorithm 1 is also a convex QP with k + 1 variables, which can be accomplished by some standard QP solvers. However, by exploiting its special structure, i.e., only a small quantity of cross terms are involved, we can derive an algorithm to solve this subproblem just by sorting, which can be much faster than general QP solvers.

Note that all variables except α1 i , . . . , αk i , βi are ﬁxed, so we have the following sub-problem:

min αl i,α yi i ,βi

2 (αl i)2 +

l =yi Blαl i + D

2 (αyi i )2 Aαyi i βi

+ Byiαyi i + E

2 β2 i + Fβi

l=1 αl i = 0, (6)

αl i 0, l = yi,

where A = κ(xi, xi), Bl =

j =i κ(xi, xj)(αl j δyj,lβj) + 1 θ for l = yi, Byi =

j =i κ(xi, xj)(αyi j

δyj,yiβj), D = A + m(1 θ)2

2λ , E = A + m(1 θ)2

2λµ and F Mi + 1 + θ Byi.

The KKT conditions of Eq. 6 indicate that there are scalars ν, ρl and η such that

l=1 αl i = 0, (7)

αl i 0, l = yi, (8)

Multi-Class Optimal Margin Distribution Machine

ρlαl i = 0, ρl 0, l = yi, (10)

Aαl i + Bl ν + ρl = 0, l = yi, (11) ηβi = 0, η 0, (12)

Aαyi i + Eβi + F η = 0, (13)

Dαyi i Aβi + Byi ν = 0. (14)

According to Eq. 8, Eq. 10 and Eq. 11 are equivalent to

Aαl i + Bl ν = 0, if αl i < 0, l = yi, (15)

Bl ν 0, if αl i = 0, l = yi. (16)

In the same way, Eq. 12 and Eq. 13 are equivalent to

Aαyi i + Eβi + F = 0, if βi > 0, (17)

Aαyi i + F 0, if βi = 0. (18)

Thus KKT conditions turn to Eq. 7 - Eq. 9 and Eq. 14 - Eq. 18. Note that

αl i min ( 0, ν Bl

) , l = yi, (19)

satisﬁes KKT conditions Eq. 8 and Eq. 15 - Eq. 16 and

βi max ( 0, Aαyi i F

satisﬁes KKT conditions Eq. 9 and Eq. 17 - Eq. 18. By substituting Eq. 20 into Eq. 14, we obtain

Dαyi i + Byi ν = max ( 0, A

E (Aαyi i F) ) . (21)

Let s consider the following two cases in turn.

Case 1: Aαyi i F, according to Eq. 20 and Eq. 21, we have βi = 0 and αyi i = ν Byi

D . Thus, A ν Byi

D F, which implies that ν Byi + DF

Case 2: Aαyi i > F, according to Eq. 20 and Eq. 21, we

have βi = Aα yi i F

E > 0 and αyi i = Eν AF EByi

DE A2 . Thus,

A Eν AF EByi

DE A2 > F, which implies that ν > Byi + DF

The remaining task is to ﬁnd ν such that Eq. 7 holds. With Eq. 7 and Eq. 19, it can be shown that

l:αl i<0 Bl

A D + |{l|αl i < 0}| , Case 1, (22)

l:αl i<0 Bl

AE DE A2 + |{l|αl i < 0}| , Case 2. (23)

In both cases, the optimal ν takes the form of (P + l:αl i<0 Bl)/(Q+|{l|αl i < 0}|), where P and Q are some

constants. (Fan et al., 2008) showed that it can be found by sorting {Bl} for l = yi in decreasing order and then sequentially adding them into an empty set Φ, until

l Φ Bl Q + |Φ| max l Φ Bl. (24)

Note that the Hessian matrix of the objective function of Eq. 6 is positive deﬁnite, which guarantees the existence and uniqueness of the optimal solution, so only one of Eq. 22 and Eq. 23 can hold. We can ﬁrst compute ν according to Eq. 24 for Case 1, and then check whether the constraint of ν is satisﬁed. If not, we further compute ν

for Case 2. Algorithm 2 summarizes the pseudo-code for solving the sub-problem.

Algorithm 2 Solving the sub-problem

1: Input: Parameters A, B = {B1, . . . , Bk}, D, E, F. 2: Initialize ˆB B, then swap ˆB1 and ˆByi, and sort ˆB\{ ˆB1} in decreasing order. 3: i 2, ν AByi/D. 4: while i k and ν/(i 2 + A/D) < ˆBi do 5: ν ν + ˆBi. 6: i i + 1. 7: end while 8: if ν Byi + DF/A then 9: αl i min(0, (ν Bl)/A), l = yi. 10: αyi i (ν Byi)/D. 11: βi 0. 12: else 13: i 2, ν (AE ˆB1 + A2F)/(DE A2). 14: while i k and ν/(i 2 + AE/(DE A2)) < ˆBi do 15: ν ν + ˆBi. 16: i i + 1. 17: end while 18: αl i min(0, (ν Bl)/A), l = yi. 19: αyi i (Eν AF EByi)/(DE A2). 20: βi (Aαyi i F)/E. 21: end if 22: Output: α1 i , . . . , αk i , βi.

4.2. Speedup for linear kernel

In section 4.1, the proposed method can efﬁciently deal with kernel mc ODM. However, the computation of Mi in step 5 of Algorithm 1 and the computation of parameters Bl in Algorithm 2 both involve the kernel matrix, whose inherent computational cost takes O(m2) time, so it might be computational prohibitive for large scale problems.

When linear kernel is used, these problems can be alleviated. According to Eq. 4, w is spanned by the training instance so it lies in a ﬁnite dimensional space under this

Multi-Class Optimal Margin Distribution Machine

circumstance. By storing w1, . . . , wk explicitly, the computational cost of Mi = maxl =yi w l xi can be much less. Moreover, note that Bl =

j =i x i xj(αl j δyj,lβj) = m j=1 x i xj( αl j δyj,l βj) x i xi( αl i δyi,l βi) = w l xi A(αl i δyi,lβi), so Bl can also be computed efﬁciently.

5. Analysis

In this section, we study the statistical property of mc ODM. To present the generalization bound of mc ODM, we need to introduce the following loss function Φ,

Φ(z) = 1z 0 + (z 1 + θ)2

(1 θ)2 10<z 1 θ,

γh,θ(x, y) = w y ϕ(x) max l Y {w l ϕ(x) (1 θ)1l=y},

where 1( ) is the indicator function that returns 1 when the argument holds, and 0 otherwise. As can be seen, γh,θ(x, y) is a lower bound of γh(x, y) and Φ(γh,θ(x, y)) = Φ(γh(x, y)).

Theorem 1. Let H = {(x, y) X [k] 7 w y ϕ(x)| k l=1 wl 2 H Λ2} be the hypothesis space of mc ODM, where ϕ : X 7 H is a feature mapping induced by some positive deﬁnite kernel κ. Assume that S {x : κ(x, x) r2}, then for any δ > 0, with probability at least 1 δ, the following generalization bound holds for any h H,

i=1 Φ(γh(xi, yi)) + 16rΛ

Proof. Let Hθ be the family of hypotheses mapping X Y 7 R deﬁned by Hθ = {(x, y) 7 γh,θ(x, y) : h H}, with Mc Diarmid inequality (Mc Diarmid, 1989), yields the following inequality with probability at least 1 δ,

E[Φ(γh,θ(x, y))] 1

i=1 Φ(γh,θ(xi, yi))

+ 2RS(Φ Hθ) + 3

δ 2m , h Hθ.

Note that Φ(γh,θ) = Φ(γh), R(h) = E[1γh(x,y) 0] E[1γh,θ(x,y) 0] E[Φ(γh,θ(x, y))] and Φ(z) is 2 1 θLipschitz function, by using Talagrand s lemma (Mohri et al., 2012), we have

i=1 Φ(γh(xi, yi)) + 4RS( Hθ)

According to Theorem 7 of (Lei et al., 2015), we have RS( Hθ) 4rΛ

2πk/m and proves the stated result.

Theorem 1 shows that we can get a tighter generalization bound for smaller rΛ and smaller θ. Since γ 2rΛ, so the former can be viewed as an upper bound of the margin. Besides, 1 θ is the lower bound of the zero loss band of mc ODM. This veriﬁes that better margin distribution (i.e., larger margin mean and smaller margin variance) can yield better generalization performance, which is also consistent with the work of (Gao & Zhou, 2013).

6. Empirical Study

In this section, we empirically evaluate the effectiveness of our method on a broad range of data sets. We ﬁrst introduce the experimental settings and compared methods in Section 6.1, and then in Section 6.2, we compare our method with four versions of multi-class SVMs, i.e., mc SVM (Weston & Watkins, 1999; Crammer & Singer, 2001; 2002), one-versus-all SVM (ova SVM), one-versusone SVM (ovo SVM) (Ulrich, 1998) and error-correcting output code SVM (ecoc SVM) (Dietterich & Bakiri, 1995). In addition, we also study the inﬂuence of the number of classes on generalization performance and margin distribution in Section 6.3. Finally, the computational cost is presented in Section 6.4.

6.1. Experimental Setup

We evaluate the effectiveness of our proposed methods on twenty two data sets. Table 1 summarizes the statistics of these data sets. The data set size ranges from 150 to more than 581,012, and the dimensionality ranges from 4 to more than 62,061. Moreover, the number of class ranges from 3 to 1,000, so these data sets cover a broad range of properties. All features are normalized into the interval [0, 1]. For each data set, eighty percent of the instances are randomly selected as training data, and the rest are used as testing data. For each data set, experiments are repeated for 10 times with random data partitions, and the average accuracies as well as the standard deviations are recorded.

mc ODM is compared with four versions of multi-class SVMs, i.e., ova SVM, ovo SVM, ecoc SVM and mc SVM. These four methods can be roughly classiﬁed into two groups. The ﬁrst group includes the ﬁrst three methods by converting the multi-class classiﬁcation problem into a set of binary classiﬁcation problems. Specially, ova SVM consists of learning k scoring functions hl : X 7 { 1, +1}, l Y, each seeking to discriminate one class l Y from all the others, as can be seen it need train k SVM models. Alternatively, ovo SVM determines the scoring functions for all the combinations of class pairs, so it need train k(k 1)/2 SVM models. Finally, ecoc SVM is a generalization of the former two methods. This technique assigns to each class l Y a code word with length c, which serves as a signature for this class. After training

Multi-Class Optimal Margin Distribution Machine

Table 1. Characteristics of experimental data sets.

ID Dataset #Instance #Feature #Class ID Dataset #Instance #Feature #Class

1 iris 150 4 3 12 sector 9,619 55,197 105 2 wine 178 13 3 13 pendigits 10,992 16 10 3 glass 214 9 6 14 news20 19,928 62,061 20 4 svmguide2 391 20 3 15 letter 20,000 16 26 5 svmguide4 612 10 6 16 protein 24,387 357 3 6 vehicle 846 18 4 17 shuttle 58,000 9 7 7 vowel 990 10 11 18 connect-4 67,557 126 3 8 segment 2,310 19 7 19 mnist 70,000 780 10 9 dna 3,186 180 3 20 aloi 108,000 128 1,000 10 satimage 6,435 36 6 21 rcv1 534,135 47,236 53 11 usps 9,298 256 10 22 covtype 581,012 54 7

c binary SVM models h1( ), . . . , hc( ), the class predicted for each testing instance is the one whose signatures is the closest to [h1(x), . . . , hc(x)] in Hamming distance. The weakness of these methods is that they may produce unclassiﬁable regions and their computational costs are usually quite large in practice, which can be observed in the following experiments. On the other hand, mc SVM belongs to the second group. It directly determines all the scroing functions at once, so the time cost is usually less than the former methods. In addition, the unclassiﬁable regions are also resolved.

For all the methods, the regularization parameter λ for mc ODM or C for binary SVM and mc SVM is selected by 5-fold cross validation from [20, 22, . . . , 220]. For mc ODM, the regularization parameters µ and θ are both selected from [0.2, 0.4, 0.6, 0.8]. For ecoc SVM, the exhaustive codes strategy is employed, i.e., for each class, we construct a code of length 2k 1 1 as the signature. All the selections for parameters are performed on training sets.

6.2. Results

Table 2 summarizes the detailed results on twenty two data sets. As can be seen, the overall performance of our method is superior or highly competitive to the other compared methods. Speciﬁcally, mc ODM performs significantly better than mc SVM/ova SVM/ovo SVM/ecoc SVM on 17/19/18/17 over 22 data sets respectively, and achieves the best accuracy on 20 data sets. In addition, as can be seen, in comparing with other four methods which don t consider margin distribution, the win/tie/loss counts show that mc ODM is always better or comparable, almost never worse than it.

6.3. Inﬂuence of the Number of Classes

In this section we study the inﬂuence of the number of classes on generalization performance and margin distribution, respectively.

6.3.1. GENERALIZATION PERFORMANCE

Figure 1 plots the generalization performance of all the ﬁve methods on data set segment, and similar observation can be found for other data sets. As can be seen, when the number of classes is less than four, all methods perform quite well. However, as the ﬁfth class is added, the generalization performance of other four methods decreases drastically. This might be attributable to the introduction of some noisy data, which SVM-style algorithms are very sensitive to since they optimize the minimum margin. On the other hand, our method considers the whole margin distribution, so it can be robust to noise and relatively more stable.

2 3 4 5 6 7 0.88

generalization performance

mc SVM ova SVM ovo SVM ecoc SVM mc ODM

Figure 1. Generalization performance of all the ﬁve methods on data set segment with the increase of the number of classes.

6.3.2. MARGIN DISTRIBUTION

Figure 2 plots the frequency histogram of margin distribution produced by mc SVM, ova SVM and mc ODM on data set segment as the number of classes increases from two to seven. As can be seen, when the number of classes is less than four, all methods can achieve good margin distribution, whereas with the increase of the number of classes, the other two methods begin to produce negative margins. At the same time, the distribution of our method becomes

Multi-Class Optimal Margin Distribution Machine

Table 2. Accuracy (mean std.) comparison on twenty two data sets. Linear kernel is used. The best accuracy on each data set is bolded. / indicates the performance of mc ODM is signiﬁcantly better/worse than the compared methods (paired t-tests at 95% signiﬁcance level). The average rank and top1 times are listed in the third and second row from the bottom. The win/tie/loss counts for mc ODM are summarized in the last row. ovo SVM and ecoc SVM did not return results on some data sets in 48 hours.

Dataset mc SVM ova SVM ovo SVM ecoc SVM mc ODM

iris 84.7 5.0 82.0 5.0 73.0 7.1 73.3 4.2 87.3 3.4 wine 96.5 2.6 97.0 2.4 96.8 3.1 91.6 2.7 98.4 1.9 glass 62.0 7.6 57.8 2.1 62.4 6.0 60.4 4.0 68.2 7.3 svmguide2 80.9 2.6 81.5 2.3 74.6 5.4 79.0 3.6 84.1 1.8 svmguide4 86.6 3.2 77.2 3.1 77.1 6.9 74.2 4.3 87.8 3.5 vehicle 80.5 2.1 78.5 2.9 73.5 3.5 76.9 2.5 85.6 6.6 vowel 65.4 2.1 44.1 2.6 71.1 4.9 43.1 2.0 65.6 3.9 segment 95.3 0.9 92.3 0.7 93.2 2.2 90.7 0.8 96.7 1.3 dna 95.1 0.6 95.1 0.7 93.0 1.0 90.9 1.1 95.6 0.7 satimage 81.8 0.5 80.1 0.7 77.4 4.2 76.0 1.4 89.1 8.0 usps 95.1 0.3 94.2 0.3 94.5 0.6 92.3 0.7 95.1 0.6 sector 93.3 0.4 93.1 0.5 89.9 0.6 N/A 93.6 0.6 pendigits 94.8 0.3 91.8 0.5 95.7 0.8 87.6 0.7 96.7 0.9 news20 83.3 0.3 83.0 0.3 69.4 0.4 N/A 83.6 0.3 letter 76.9 0.4 64.9 1.0 75.7 1.2 N/A 82.7 1.6 protein 68.7 0.3 68.7 0.4 68.1 0.3 66.5 0.5 70.2 2.3 shuttle 97.2 0.1 89.9 2.8 97.4 0.1 84.2 0.0 99.6 0.0 connect-4 75.7 0.1 75.7 0.1 75.7 0.2 75.1 0.2 94.4 7.6 mnist 92.5 0.2 91.5 0.2 90.9 0.5 87.6 0.2 95.2 0.3 aloi 90.1 0.2 82.2 0.2 N/A N/A 91.4 0.3 rcv1 92.8 0.1 92.7 0.1 92.1 0.1 N/A 93.0 1.1 covtype 71.7 0.1 71.2 0.1 72.7 0.1 70.6 0.1 72.8 1.6

Avg. Rank 2.45 3.18 3.45 4.82 1.09

mc ODM: w/t/l 17/5/0 19/3/0 18/3/0 17/0/0

sharper , which prevents the instance with small margin, so our method can still perform relatively well as the number of classes increases, which is also consistent with the observation from Figure 1.

6.4. Time Cost

We compare the single iteration time cost of our method with mc SVM, ova SVM, ovo SVM on all the data sets except aloi, on which ovo SVM could not return results in 48 hours. All the experiments are performed with MATLAB 2012b on a machine with 8 2.60 GHz CPUs and 32GB main memory. The average CPU time (in seconds) on each data set is shown in Figure 3. The binary SVM used in ova SVM, ovo SVM and mc SVM are both implemented by the LIBLINEAR (Fan et al., 2008) package. It can be seen that for small data sets, the efﬁciency of all the methods are similar, however, for data sets with more than ten classes, e.g., sector and rcv1, mc SVM and mc ODM, which learn all the scroing functions at once, are much faster than ova SVM and ovo SVM, owing to the inefﬁciency of binarydecomposition as discussed in Section 6.1. Note that LIBLINEAR are very fast implementations of binary SVM and mc SVM, and this shows that our method is also computationally efﬁcient.

7. Conclusions

Recent studies disclose that for binary class classiﬁcation, maximizing the minimum margin does not necessarily lead to better generalization performances, and instead, it is crucial to optimize the margin distribution. However, it remains open to the inﬂuence of margin distribution for multi-class classiﬁcation. We try to answer this question in this paper. After maximizing the margin mean and minimizing the margin variance simultaneously, the resultant optimization is a difﬁcult non-differentiable non-convex programming. We propose mc ODM to solve this problem efﬁciently. Extensive experiments on twenty two data sets validate the superiority of our method to four versions of multi-class SVMs. In the future it will be interesting to extend mc ODM to more general learning settings, i.e., multilabel learning and structured learning.

Acknowledgements

This research was supported by the NSFC (61333014) and the Collaborative Innovation Center of Novel Software Technology and Industrialization. Authors want to thank reviewers for helpful comments, and thank Dr. Wei Gao for reading a draft.

Multi-Class Optimal Margin Distribution Machine

0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 0 10 20 30 40 50

mc SVM ova SVM mc ODM 2 class

0.02 0.04 0.06 0.08 0.10 0.12 0.14 0 10 20 30 40 50

mc SVM ova SVM mc ODM 3 class

-0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0 50 100 150 200

mc SVM ova SVM mc ODM 4 class

-0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0 100 200 300

mc SVM ova SVM mc ODM 5 class

-0.04 -0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0 100 200 300

mc SVM ova SVM mc ODM 6 class

-0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0 100 200 300

mc SVM ova SVM mc ODM 7 class

Figure 2. Frequency histogram of the margin distribution produced by mc SVM, ova SVM and mc ODM on data set segment as the number of classes increase from two to seven.

CPU time (sec)

mc SVM ova SVM ovo SVM mc ODM

Figure 3. Single iteration time cost of mc SVM, ova SVM, ovo SVM and mc ODM on all the data sets except aloi.

Multi-Class Optimal Margin Distribution Machine

Cortes, C. and Vapnik, V. Support-vector networks. Machine Learning, 20(3):273 297, 1995.

Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265 292, 2001.

Crammer, K. and Singer, Y. On the learnability and design of output codes for multiclass problems. Machine Learning, 47(2):201 233, 2002.

Dietterich, T. G. and Bakiri, G. Solving multiclass learning problems via error-correcting output codes. Journal of Artiﬁcial Intelligence Research, 2(2):263 286, 1995.

Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., and Lin, C. J. Liblinear: A library for large linear classiﬁcation. Journal of Machine Learning Research, 9:1871 1874, 2008.

Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the 2nd European Conference on Computational Learning Theory, pp. 23 37, Barcelona, Spain, 1995.

Gao, W. and Zhou, Z.-H. On the doubt about margin explanation of boosting. Artiﬁcial Intelligence, 203:1 18, 2013.

Lei, Y., rn Dogan, Binder, A., and Kloft, M. Multiclass svms: From tighter data-dependent generalization bounds to novel algorithms. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28, pp. 2026 2034. Curran Associates, Inc., 2015.

Mc Diarmid, C. On the method of bounded differences. Surveys in Combinatorics, 141(1):148 188, 1989.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. MIT Press, Cambridge, MA, 2012.

Reyzin, L. and Schapire, R. E. How boosting the margin can also boost classiﬁer complexity. In Proceedings of 23rd International Conference on Machine Learning, pp. 753 760, Pittsburgh, PA, 2006.

Schapire, R. E., Freund, Y., Bartlett, P. L., and Lee, W. S. Boosting the margin: a new explanation for the effectives of voting methods. Annuals of Statistics, 26(5):1651 1686, 1998.

Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3):475 494, 2001.

Ulrich, H. G. Kreel. Pairwise classiﬁcation and support vector machines. In Schlkopf, B., Burges, C. J. C., and Smola, A. J. (eds.), Advances in Kernel Methods: Support Vector Machines, pp. 255 268, Cambridge, MA, December 1998. MIT Press.

Vapnik, V. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.

Wang, L. W., Sugiyama, M., Yang, C., Zhou, Z.-H., and Feng, J. A reﬁned margin analysis for boosting algorithms via equilibrium margin. Journal of Machine Learning Research, 12:1835 1863, 2011.

Weston, J. and Watkins, C. Multi-class support vector machines. In Proceedings of the European Symposium on Artiﬁcial Neural Networks, Brussels, Belgium, 1999.

Zhang, T. and Zhou, Z.-H. Large margin distribution machine. In Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 313 322, New York, NY, 2014.

Zhang, T. and Zhou, Z.-H. Optimal margin distribution machine. Co RR, abs/1604.03348, 2016.

Zhou, Y.-H. and Zhou, Z.-H. Large margin distribution learning with cost interval and unlabeled data. IEEE Transactions on Knowledge and Data Engineering, 28 (7):1749 1763, 2016.

Zhou, Z.-H. Ensemble Methods: Foundations and Algorithms. CRC Press, Boca Raton, FL, 2012.

Zhou, Z.-H. Large margin distribution learning. In Proceedings of the 6th IAPR International Workshop on Artiﬁcial Neural Networks in Pattern Recognition, pp. 1 11, Montreal, Canada, 2014.