# semisupervised_optimal_margin_distribution_machines__5e95525c.pdf

Semi-Supervised Optimal Margin Distribution Machines

Teng Zhang and Zhi-Hua Zhou National Key Lab for Novel Software Technology, Nanjing University, Nanjing 210023, China {zhangt, zhouzh}@lamda.nju.edu.cn

Semi-supervised support vector machines is an extension of standard support vector machines with unlabeled instances, and the goal is to ﬁnd a label assignment of the unlabeled instances, so that the decision boundary has the maximal minimum margin on both the original labeled instances and unlabeled instances. Recent studies, however, disclosed that maximizing the minimum margin does not necessarily lead to better performance, and instead, it is crucial to optimize the margin distribution. In this paper, we propose a novel approach ss ODM (Semi Supervised Optimal margin Distribution Machine), which tries to assign the labels to unlabeled instances and to achieve optimal margin distribution simultaneously. Speciﬁcally, we characterize the margin distribution by the ﬁrstand second-order statistics, i.e., the margin mean and variance, and extend a stochastic mirror prox method to solve the resultant saddle point problem. Extensive experiments on twenty UCI data sets show that ss ODM is signiﬁcantly better than compared methods, which veriﬁes the superiority of optimal margin distribution learning.

1 Introduction

Traditional supervised learning uses only labeled instances to train the classiﬁers. However, labeled instances are often difﬁcult or expensive to obtain since they require the efforts of experienced human annotators. On the other hand, unlabeled instances are universal and relatively easy to collect. To exploit the value of them, many semi-supervised learning algorithms have been proposed, among which a very popular type of algorithms is the semi-supervised support vector machines (S3VMs). Examples include the semi-supervised SVM [Bennett and Demiriz, 1999], the transductive SVM (TSVM) [Joachims, 1999], the Laplacian SVM [Belkin et al., 2006], and the S3VM using label mean (Mean S3VM) [Li et al., 2009a], just to name a few. Bennett and Demiriz s S3VM

This research was supported by National Key R&D Program of China (2018YFB1004300) and NSFC (61751306) and the Program B for outstanding Ph D candidate of Nanjing University.

and the TSVM are built upon the cluster assumption and use the unlabeled instances to regularize the decision boundary. Speciﬁcally, these methods prefer the decision boundary that passes through low-density regions [Chapelle and Zien, 2005]. The Laplacian SVM is a S3VM that exploits the instance s manifold structure via the graph Laplacian. It encodes both the labeled and unlabeled instances by a connected graph, where each instance is represented as a vertex and two vertices are connected by an edge if they are quite similar to each other. The goal is to ﬁnd labels for the unlabeled instances such that their inconsistencies with both the labeled instances and the underlying graph structure are minimized. The Mean S3VM bases on the observation that S3VMs with knowledge of the means of the labels of the unlabeled instances is closely related to the supervised SVM with known labels on all the unlabeled instances. So the approach is consist of two steps, i.e., ﬁrst estimate the label means of the unlabeled instances, and then build a SVM model. Moreover, several works also try to extend S3VMs for other learning settings, e.g., the CS4VM [Li et al., 2010] considers the situation where different misclassiﬁcation errors are associated with unequal costs, and the safe S3VM (S4VM) [Li and Zhou, 2011] tries to exploit many candidate low-density separators simultaneously to reduce the risk of identifying only one poor separator with unlabeled instances, so it always performed better than S3VMs. Aforementioned all kinds of S3VMs are all based on the large margin principle, i.e., try to maximize the minimum margin of training instances. However, recent studies on margin theory [Gao and Zhou, 2013] disclosed that maximizing the minimum margin does not necessarily lead to better performance, and instead, it is crucial to optimize the margin distribution. Inspired by this recognition, Zhang and Zhou (2014) proposed ODMs (optimal margin distribution machines) which can achieve better generalization performance than large margin based methods. Later, Zhang and Zhou (2017; 2018) extends the idea to multi-class learning and clustering. The success of optimal margin distribution learning suggests that there may still exist large space to further enhance for S3VMs. In this paper, we propose a novel approach ss ODM (semisupervised optimal margin distribution machines), which tries to learn the label assignments of unlabeled instances and to achieve optimal margin distribution simultaneously.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Speciﬁcally, we characterize the margin distribution by the ﬁrstand second-order statistics, i.e., the margin mean and variance, and then apply the minimax convex relaxation proposed in [Li et al., 2009b], which is proven to be tighter than SDP relaxations [Xu et al., 2005], to get a convex reformulation. For the optimization of the resultant saddle point problem, we propose a stochastic mirror prox method which can converge more quickly in practise than the general subgradient descent for non-smooth problem. Extensive experiments on twenty UCI data sets show that ss ODM is signiﬁcantly better than compared methods, which veriﬁes the superiority of optimal margin distribution learning. The rest of this paper is organized as follows. We ﬁrst introduce some preliminaries and then present the ss ODM method. Next we show the experimental studies. Finally we conclude this paper with future work.

2 Preliminaries We start with the traditional supervised learning. Denote X as the instance space and Y = {+1, 1} as the label set. Let D be an unknown (underlying) distribution over X Y. A training set S = {(x1, y1), (x2, y2), . . . , (xm, ym)} (X Y)m is drawn identically and independently (i.i.d.) according to D. Let φ : X 7 H be a feature mapping associated to some positive deﬁnite kernel κ : X X 7 R. The hypothesis is deﬁned based on the linear model h(x) = w φ(x) and the predicted label of instance x is the sign of h(x), then the decision function naturally leads to the deﬁnition of margin for a labeled instance, i.e., γ(x, y) = yw φ(x) [Cristianini and Shawe Taylor, 2000]. Thus the higher the margin value, the more conﬁdence we will have that x s label is y, and h misclassiﬁes (x, y) if and only if it produces a negative margin. Given a hypothesis set H of functions mapping X to Y and the labeled training set S, our goal is to learn a function h H such that the generalization error R(h) = E(x,y) D[1sign(h(x)) =y] is small, where 1( ) is the indicator function that returns 1 when the argument holds, and 0 otherwise.

2.1 Optimal Margin Distribution Machine It is well known that SVM employs the large margin principle to select h and tries to maximize the minimum margin of training instance, i.e., the smallest distance from the instances to the decision boundary. As a result, the solution of SVM just consists of a small amount of instance, that is support vectors (SV), and the rest (non-SVs) are totally ignored, which may be misleading in some situations. See Figure 1 for an illustration [Zhou, 2014]. A more robust strategy is to consider the whole instances, i.e., optimizing the margin distribution. To characterize the distribution, the two most straightforward statistics are the ﬁrstand second-order statistics, that is, the margin mean and variance. Moreover, a recent study [Gao and Zhou, 2013] on margin theory proved that maximizing the margin mean and minimizing the margin variance simultaneously can yield a tighter generalization bound, so we arrive at the following formulation,

min w, γ,ξi,ϵi 1 2 w 2 H η γ + λ

i=1 (ξ2 i + ϵ2 i ),

Figure 1: A simple illustration of linear separators optimizing the minimum margin and margin distribution, respectively. Dotted ellipses are two underlying distributions, from which circles and squares are instances sampled. Solid circle and square are mean instances (not necessarily real instances in the training data). hmin and hdist are decision hyperplanes achieved by optimizing the minimum margin and margin distribution, respectively.

s.t. γ(xi, yi) γ ξi, γ(xi, yi) γ + ϵi, i,

where γ is the margin mean, η and λ are trading-off parameters, ξi and ϵi are the deviation of γ(xi, yi) to the margin mean. It s evident that Pm i=1(ξ2 i + ϵ2 i )/m is exactly the margin variance. First, by scaling w which doesn t affect the ﬁnal classiﬁcation results, the margin mean can be ﬁxed as 1, then the deviation of γ(xi, yi) to the margin mean is |yiw φ(xi) 1|. Secondly, the hyperplane yiw φ(xi) = 1 divides the feature space into two parts and for each instance, no matter which part it lies in, it will suffer a loss which is quadratic with the deviation. So it is more reasonable to set different weights for the two kinds of deviations because the instances lie in {x | yw φ(x) < 1} are much easier to be misclassiﬁed than the other. Thirdly, according to representer theorem [Sch olkopf and Smola, 2001], the optimal solution is spanned only by SVs. To achieve a sparse solution, we introduce a θ-insensitive loss like SVR, i.e., the instances whose deviation is smaller than θ are tolerated and only those whose deviation is larger than θ will suffer a loss. Finally, we obtain the formulation of ODM,

min w,ξi,ϵi 1 2 w 2 H + λ

ξ2 i + νϵ2 i (1 θ)2 ,

s.t. yiw φ(xi) 1 θ ξi,

yiw φ(xi) 1 + θ + ϵi, i.

where ν is a parameter for trading-off different kinds of deviations, θ is a parameter for controlling the sparsity of the solution, and (1 θ)2 in the denominator is to scale the second term to be a surrogate loss for 0-1 loss. It can be found that the least square loss is a special case of ODM s loss fucntion by ignoring the asymmetry, so it is more

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

sensitive to outliers and can only be suitable for some speciﬁc distribution. In addition, without θ-insensitivity, all the instances will be support vectors, which can result in unnecessary computation cost. In other words, ODM is more ﬂexible and can characterize the margin distribution more adaptively.

In semi-supervised learning setting, not all the training labels are known. Let SL = {xi, yi}l i=1 and SU = {xj}m j=l+1 be the sets of labeled and unlabeled instances, respectively. L = {1, . . . , l} and U = {l + 1, . . . , m} are the index sets of the labeled and unlabeled instances. In semi-supervised learning, unlabeled data are typically much more abundant than labeled data, that is, m l l. Hence, one can obtain a trivially optimal solution with inﬁnite margin by assigning all the unlabeled examples to the same label. To prevent such a useless solution, Joachims (1999) introduced the balance constraint:

e ˆy U m l = e y L

where ˆy = [ˆy1, . . . , ˆym] is the vector of learned labels on both labeled and unlabeled examples, y L = [y1, . . . , yl], ˆy U = [ˆyl+1, . . . , ˆym], and e stands for the all-one vector. The basic idea of ss ODM is to minimize the objective function in Eq. (1) w.r.t. both the labeling ˆy and decision function parameter w, ξi, ϵi. Hence, Eq. (1) is extended to

min ˆy B min w,ξi,ϵi 1 2 w 2 H +

i=1 λi ξ2 i + νϵ2 i (1 θ)2

s.t. ˆyiw φ(xi) 1 θ ξi,

ˆyiw φ(xi) 1 + θ + ϵi, i,

where B = {ˆy | ˆy = [ˆy L; ˆy U], ˆy L = y L, ˆy U {0, 1}m l, e ˆy U

m l = e y L

l } is a set of candidate label assign-

ments. λi = λ1(m l) λ2l

l(m l) 1i L + λ2 m l, and λ1, λ2 trade off empirical losses on the labeled and unlabeled data, respectively. To avoid the curse of dimensionality, the inner minimization problem of Eq. (2) is usually cast in the dual form. Denote X as the data matrix whose i-th column is φ(xi), i.e., X = [φ(x1), . . . , φ(xm)], and introduce the dual variables α 0, the Lagrangian of Eq. (2) leads to

min ˆy B max α 0 1

2α K ˆy ˆy K ˆy ˆy

K ˆy ˆy K ˆy ˆy

4 α I λ 0 0 1 ν I λ

α (θ 1)e (θ + 1)e

where K = X X is the kernel matrix, λ = [λ1, . . . , λm], and denotes the element-wise product and division, respectively. Note that the objective function is a negative deﬁnite quadratic form whose stationary point can t locate at the inﬁnity, so we can replace the constraint {α | α 0} by a bounded box A = {α | 0 α τe}, where the auxiliary

parameter τ is introduced for the sake of mathematical soundness. For a sufﬁciently large τ, the new problem is equal to the original problem. To overcome the difﬁculty of this mixed-integer programming, many relaxations have been proposed, among which the minimax convex relaxation proposed in [Li et al., 2009b; 2013] is proven to be the tightest. So in this paper, we also employ this method to deal with the mixed-integer problem, i.e., interchanging the order of maxα A and minˆy B, then we can obtain

max α A min ˆy B G(α, ˆy),

where G(α, ˆy) is the objective function of Eq. (3), and this can be further transformed into

max α A min δ ( δ) s.t. G(α, ˆyk) δ, ˆyk B. (4)

For the inner optimization in Eq. (4), introduce the dual variables µ = [µ1, . . . , µ|B|] 0, the Lagrangian leads to

max µ 0 min δ

k:ˆyk B µk(G(α, ˆyk) δ)

By setting the partial derivative of δ to zero, we can obtain P k:ˆyk B µk = 1 and the dual turns to

k:ˆyk B µk G(α, ˆyk)

where M = {µ R|B| + | e µ = 1} is the simplex in R|B|. By substituting Eq. (5) into Eq. (4) and denoting ϕ(µ, α) = P k:ˆyk B µk G(α, ˆyk), Eq. (4) can be rewritten as

max α A min µ M ϕ(µ, α).

Note that ϕ(µ, α) is a convex combination of negative definite quadratic forms, so it s convex in µ and concave in α. According to Sion s minimax theorem [Sion, 1958], there exists a saddle point (µ , α ) M A such that

min µ M max α A ϕ(µ, α) max α A ϕ(µ , α) = ϕ(µ , α )

= min µ M ϕ(µ, α ) max α A min µ M ϕ(µ, α), (6)

By combining with the following minimax inequality [Kim and Boyd, 2008],

max α A min µ M ϕ(µ, α) min µ M max α A ϕ(µ, α),

we can realize that all the equalities hold in Eq. (6) and arrive at the ﬁnal formulation of ss ODM:

min µ M max α A ϕ(µ, α). (7)

4 Optimization In this section, we commence with a simple introduction to minimax problem, followed by a stochastic mirror prox method to ﬁnd the saddle point.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

4.1 Minimax Problem Since ϕ( , α) is convex and ϕ(µ, ) is concave, according to the convex inequality, for any pair ( µ, α) M A we have

ϕ( µ, α) ϕ(µ, α) µϕ( µ, α) ( µ µ), µ M,

ϕ( µ, α) ϕ( µ, α) αϕ( µ, α) ( α α), α A.

By adding the above two inequalities together we have

ϕ( µ, α) ϕ(µ, α) g(u) (u w), µ, α, (8)

where w = (µ, α), u = ( µ, α) M A, and g(u) = ( µϕ(u), αϕ(u)), which plays a similar role as gradient in general convex optimization. Note that Eq. (8) holds for any µ and α, in particular we have

max α A ϕ( µ, α) min µ M ϕ(µ, α) g(u) (u w). (9)

The left hand side is referred to as the duality gap , which can be decomposed into two parts, i.e.,

max α A ϕ( µ, α) min µ M ϕ(µ, α)

= max α A ϕ( µ, α) ϕ(µ , α ) + ϕ(µ , α ) min µ M ϕ(µ, α)

= max α A ϕ( µ, α) min µ M max α A ϕ(µ, α) | {z } primal gap + max α A min µ M ϕ(µ, α) min µ M ϕ(µ, α) | {z } dual gap

As can be seen, the primal gap and the dual gap are both nonnegative and the more closer to the saddle point, the smaller both gaps. So duality gap can be viewed as a measure to evaluate the closeness of current point ( µ, α) to the saddle point (µ , α ).

4.2 Stochastic Mirror Prox Method For ss ODM, the feasible set of µ and α are simplex and bounded box, respectively, so the most suitable optimization method is mirror descent [Beck and Teboulle, 2003], and the corresponding mirror maps are ΦM(µ) = P k µk log µk and ΦA(α) = α 2 2/2, respectively. Further note that the objective of inner optimization is smooth function, mirror descent can be accelerated to the rate O(1/t) by applying the mirror prox technique in [Nemirovski, 2005]. Introduce the joint map Φ(w) = aΦM(µ) + bΦA(α), where a = 1/ p

log |B| and b =

2/τ m. It can be shown that ΦM(µ) = log µ + e, ΦA(α) = α and Φ(w) = (a log µ + ae, bα). At the t-th iteration, we ﬁrst map wt = (µt, αt) into the dual space Φ(wt) = (a log µt + ae, bαt), followed by one step of stochastic gradient descent in the dual space,

Φ(ut) = Φ(wt) ηeg(wt) = (a log µt + ae η µ eϕ(µt, αt), bαt + η α eϕ(µt, αt))

where µ eϕ, α eϕ and eg are the noisy unbiased estimation of µϕ, αϕ and g, respectively, and η is the step size. Next,

we map Φ(ut) back to the primal space, i.e., to ﬁnd ut = ( µt, αt) such that

a log µt + ae = a log µt + ae η µ eϕ(µt, αt), b αt = bαt + η α eϕ(µt, αt),

which implies that µt = µt exp( η µ eϕ(µt, αt)/a) and αt = αt + η α eϕ(µt, αt)/b. Finally, we project ( µt, αt) back to M A based on Kullback-Leibler divergence and Euclidean distance, respectively, i.e., we solve the following two optimization problems:

µt+1 = argminµ Mµ log µ

αt+1 = argminα A α αt 2 2,

Fortunately, both problems have a closed-form solution. The latter is to project αt onto the bounded box, so we have αt+1 = max{min{ αt, τe}, 0}. For the former, the Lagrangian function leads to µ log(µ/ µt) + ζ(e µ 1), where ζ is the dual variable. By setting the partial derivative of µ to zero, i.e., log(µ/ µt) + e + ζe = 0, we have µt+1 = µt exp( 1 ζ). Since µt+1 belongs to a simplex, hence 1 = e µt+1 = e µt exp( 1 ζ) = µt 1 exp( 1 ζ), which implies that exp( 1 ζ) = 1/ µt 1, thus we have µt+1 = µt/ µt 1. Once we have yt+1 = ( µt+1, αt+1), start the above procedures from wt again, but this time using the gradient evaluated at yt+1 instead of wt. In words, one iteration of mirror prox consists of two steps of mirror descent starting from the same point, but using gradients evaluated at different points. Figure 2 illustrates one iteration of stochastic mirror prox method.

primal space dual space

Figure 2: Illustration of one iteration of stochastic mirror prox method.

The remaining question is how to ﬁnd the stochastic gradient µ eϕ(µt, αt) and α eϕ(µt, αt). Note that ϕ(µ, α) = P k:ˆyk B µk G(α, ˆyk), so we have

µϕ(µt, αt) = [G(αt, ˆy1), . . . , G(αt, ˆy|B|)],

αϕ(µt, αt) = [ αG(αt, ˆy1), . . . , αG(αt, ˆy|B|)]µt.

By uniformly choosing an index it from {1, 2, . . . , |B|}, we can obtain µ eϕ(µt, αt, it) = [0, . . . , |B|G(αt, ˆyit) . . . , 0].

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Algorithm 1 Stochastic mirror prox for ss ODM

1: Input: maximum iteration number T, ODM parameters λ1, λ2, ν, θ, upper bound τ, stopping criteria ι. 2: Initialize µ0 [1/|B|, . . . , 1/|B|], α0 0, t 0. 3: while t < T do 4: Uniformly select it, i t from {1, 2, . . . , |B|}. 5: µ eϕ [0, . . . , |B|G(αt, ˆyit) . . . , 0]. 6: Select jt from {1, 2, . . . , |B|} according to µt. 7: α eϕ αG(αt, ˆyjt). 8: µt µt exp( η µ eϕ/a). 9: αt αt + η α eϕ/b. 10: µt+1 µt/ µt 1. 11: αt+1 max{min{ αt, τe}, 0}. 12: µ eϕ [0, . . . , |B|G( αt+1, ˆyi t) . . . , 0]. 13: Select j t from {1, 2, . . . , |B|} according to µt+1. 14: α eϕ αG( αt+1, ˆyj t). 15: µt+1 µt exp( η µ eϕ/a). 16: αt+1 αt + η α eϕ/b. 17: µt+1 µt+1/ µt+1 1. 18: αt+1 max{min{ αt+1, τe}, 0}. 19: t t + 1. 20: if duality gap is smaller than ι then 21: Break. 22: end if 23: end while 24: Output: µ, α.

On the other hand, by randomly sampling an index jt according to the distribution µt on {1, 2, . . . , |B|}, we can obtain α eϕ(µt, αt, jt) = αG(αt, ˆyjt). It can be shown that

E[ µ eϕ(µt, αt, it) | µt, αt] = µϕ(µt, αt), E[ α eϕ(µt, αt, jt) | µt, αt] = αϕ(µt, αt),

and eg(wt) = ( µ eϕ(µt, αt, it), α eϕ(µt, αt, jt)) is an unbiased estimation of g(wt). Algorithm 1 summarizes the pseudo-code of ss ODM.

4.3 Recovering the Label Assignment Once the saddle point (µ , α ) is found, we can obtain the label assignment of unlabeled instances according to sign(P k:ˆyk B µ k ˆyk).

5 Empirical Study

In this section, we empirically evaluate the proposed method on twenty UCI data sets. Table 1 summarizes the statistics of these data sets. As can be seen, the number of instance is ranged from 62 to 72,309, and the dimensionality is ranged from 6 to 20,958, covering a broad range of properties.

5.1 Setting For each UCI data set, 75% of the examples are randomly chosen for training, and the rest for testing. We investigate the performance of each approach with varying amount of labeled data (namely, 5%, 10% of the labeled data). The whole setup is repeated 10 times and the average accuracies as well as the standard deviations on the test set are recorded.

ID Data set #Instance #Feature

1 echocardiogram 62 8 2 house 232 16 3 heart 270 9 4 heart-statlog 270 13 5 haberman 306 14 6 live-discorders 345 6 7 ionosphere 351 33 8 vehicle 435 16 9 house-votes 435 16 10 clean1 476 166 11 wdbc 569 14 12 isolet 600 51 13 austra 690 15 14 australian 690 42 15 diabetes 768 8 16 german 1,000 59 17 optdigits 1,143 42 18 krvskp 3,196 36 19 sick 2,643 28 20 real-sim 72,309 20,958

Table 1: Characteristics of experimental data sets.

We compare our method with 1) the standard SVM (using labeled data only) [Cortes and Vapnik, 1995], and four state-of-the-art S3VMs, namely 2) Transductive SVM (TSVM) [Joachims, 1999]; 3) Laplacian SVM [Belkin et al., 2006]; 4)Univer SVM (USVM) [Collobert et al., 2006]; and 5) S4VM [Li and Zhou, 2011]. Note that TSVM and USVM adopt the same objective but with different optimization strategies (local search and constrained convex-concave procedure, respectively), so they may converge to different local minimum. For all the methods, the parameters C, λ1, λ2 are selected from {1, 10, 100, 1000}. For ss ODM, ν and θ are selected from [0.2, 0.4, 0.6, 0.8]. For all data sets, both the linear and Gaussian kernels are used. In particular, the width σ of Gaussian kernel is picked from {0.25 γ, 0.5 γ, γ, 2 γ, 4 γ}, where γ is the average distance between instances. All the experiments are repeated 10 times and the average performance is reported with the best parameter setting.

5.2 Performance Table 2 summarizes the results on the twenty UCI data sets. As can be seen, for the two settings, i.e., 5% labeled data and 10% labeled data, ss ODM achieves the best performance on 14 and 13 data sets, respectively. In addition, in comparing with S3VMs which do not consider margin distribution, the win/tie/loss counts show that ss ODM is always better or comparable, almost never worse than S3VMs.

6 Conclusions Semi-supervised support vector machines (S3VMs), which employs the large margin heuristic from support vector machines, have achieved more accurate results than other semi-

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Data set Label SVM TSVM Lap SVM USVM S4VM ss ODM

echocardiogram 5% .800 .071 .741 .082 .644 .221 .801 .061 .804 .078 .819 .011 10% .812 .077 .761 .087 .684 .201 .821 .063 .824 .073 .839 .015 house 5% .900 .041 .903 .056 .906 .067 .903 .068 .909 .073 .917 .014 10% .912 .047 .921 .057 .918 .171 .911 .053 .924 .066 .923 .018 heart 5% .700 .080 .752 .062 .733 .063 .762 .063 .772 .061 .783 .054 10% .751 .049 .783 .047 .756 .041 .779 .053 .784 .056 .798 .016 heart-statlog 5% .730 .010 .762 .061 .753 .068 .781 .053 .791 .061 .789 .054 10% .751 .008 .792 .062 .793 .058 .791 .041 .831 .056 .824 .045 haberman 5% .651 .071 .614 .053 .577 .112 .743 .123 .732 .121 .737 .141 10% .683 .067 .634 .047 .601 .098 .794 .113 .787 .111 .788 .011 live-discorders 5% .568 .051 .555 .053 .556 .055 .590 .052 .530 .071 .588 .048 10% .583 .067 .584 .047 .601 .088 .642 .103 .590 .091 .639 .008 ionosphere 5% .678 .061 .822 .113 .656 .058 .770 .064 .701 .064 .791 .041 10% .691 .057 .861 .047 .681 .058 .791 .043 .761 .054 .831 .015 vehicle 5% .748 .041 .751 .083 .773 .052 .741 .069 .789 .062 .791 .041 10% .761 .035 .772 .062 .791 .046 .760 .060 .819 .067 .823 .035 house-votes 5% .888 .031 .891 .043 .899 .032 .901 .053 .912 .041 .925 .035 10% .897 .026 .899 .031 .901 .032 .910 .045 .925 .035 .929 .031 clean1 5% .580 .061 .621 .074 .641 .065 .623 .065 .641 .041 .661 .045 10% .591 .054 .641 .054 .649 .055 .634 .542 .651 .038 .671 .035 wdbc 5% .813 .064 .803 .034 .808 .048 .820 .061 .821 .048 .829 .039 10% .831 .054 .823 .031 .818 .044 .825 .060 .829 .043 .835 .033 isolet 5% .970 .029 .976 .031 .980 .038 .987 .037 .988 .048 .991 .040 10% .973 .025 .978 .030 .985 .032 .989 .035 .989 .043 .992 .045 austra 5% .770 .059 .820 .030 .766 .033 .781 .045 .775 .041 .813 .040 5% .790 .051 .850 .032 .796 .031 .788 .037 .799 .036 .843 .034 australian 5% .672 .081 .681 .034 .782 .031 .789 .041 .788 .042 .799 .035 10% .681 .075 .692 .038 .791 .030 .792 .040 .797 .041 .805 .039 diabetes 5% .679 .080 .683 .039 .761 .069 .771 .049 .768 .048 .790 .049 10% .703 .071 .703 .034 .770 .063 .776 .040 .771 .044 .798 .040 german 5% .700 .030 .703 .010 .708 .021 .710 .029 .715 .041 .726 .045 10% .700 .030 .708 .010 .709 .016 .713 .023 .718 .043 .731 .037 optdigits 5% .922 .020 .891 .090 .902 .050 .913 .069 .918 .041 .919 .042 10% .925 .023 .894 .091 .912 .055 .917 .064 .913 .040 .921 .040 krvskp 5% .911 .023 .899 .091 .921 .054 .916 .068 .926 .042 .932 .040 10% .919 .020 .903 .090 .927 .050 .925 .069 .929 .040 .936 .045 sick 5% .941 .021 .932 .090 .939 .034 .935 .048 .929 .045 .948 .048 10% .946 .020 .938 .088 .941 .033 .939 .041 .934 .041 .955 .043 real-sim 5% .901 .022 .913 .065 .922 .035 .930 .043 .924 .049 .922 .033 10% .909 .020 .919 .062 .929 .031 .933 .040 .928 .043 .941 .028

ss ODM: w/t/l 5% 18/2/0 18/1/1 20/0/0 18/2/0 13/7/0 10% 18/2/0 18/1/1 20/0/0 18/2/0 11/9/0

Table 2: Accuracies on the various data sets with 5% and 10% labeled instances on twenty UCI data sets. The best performance on each data set is bolded. / indicates ss ODM is signiﬁcantly better/worse than compared methods (paired t-tests at 95% signiﬁcance level). The win/tie/loss counts for ss ODM are summarized in the last two rows.

supervised methods. Recent studies disclosed that instead of minimum margin, it is more crucial to optimize the margin distribution for SVM-style learning algorithms. Inspired by this recognition, we propose a novel approach ss ODM for semi-supervised learning by optimizing the margin distribution. To conquer the resultant saddle point problem, we extend a stochastic mirror prox method which can converge more quickly in practise than general sub-gradient descent

for non-smooth problem. Experimental results in various data sets show that our method achieves promising performance, which further veriﬁes the superiority of optimal margin distribution learning. In the future, we will apply importance sampling [Schmidt et al., 2015] and variance reduction techniques [Johnson and Zhang, 2013] to further accelerate our method and extend it to other learning settings, i.e., multiinstance multi-label learning.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

[Beck and Teboulle, 2003] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167 175, 2003. [Belkin et al., 2006] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, pages 2399 2434, 2006. [Bennett and Demiriz, 1999] K. P. Bennett and A. Demiriz. Semi-supervised support vector machines. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, pages 368 374, 1999. [Chapelle and Zien, 2005] O. Chapelle and A. Zien. Semisupervised classiﬁcation by low density separation. In Proceedings of the 8th International Conference on Artiﬁcial Intelligence and Statistics, pages 57 64, 2005. [Collobert et al., 2006] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large scale transductive svms. Journal of Machine Learning Research, 7:1687 1712, 2006. [Cortes and Vapnik, 1995] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273 297, 1995. [Cristianini and Shawe-Taylor, 2000] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge, UK, 2000. [Gao and Zhou, 2013] W. Gao and Z.-H. Zhou. On the doubt about margin explanation of boosting. Artiﬁcial Intelligence, 203:1 18, 2013. [Joachims, 1999] T. Joachims. Transductive inference for text classiﬁcation using support vector machines. pages 200 209. Morgan Kaufmann, 1999. [Johnson and Zhang, 2013] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 315 323. Curran Associates, Inc., 2013. [Kim and Boyd, 2008] Seung-Jean Kim and Stephen Boyd. A minimax theorem with applications to machine learning, signal processing, and ﬁnance. SIAM Journal on Optimization, 19(3):1344 1367, 2008. [Li and Zhou, 2011] Y.-F. Li and Z.-H. Zhou. Towards making unlabeled data never hurt. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pages 1081 1088, 2011. [Li et al., 2009a] Y.-F. Li, James T. Kwok, and Z.-H. Zhou. Semi-supervised learning using label mean. In International Conference on Machine Learning, pages 633 640, 2009. [Li et al., 2009b] Y.-F. Li, Ivor W. Tsang, James T. Kwok, and Z.-H. Zhou. Tighter and convex maximum margin

clustering. In In Proceeding of the 12th International Conference on Artiﬁcial Intelligence and Statistics, volume 5, pages 344 351, 2009. [Li et al., 2010] Y.-F. Li, J. T. Kwok, and Z.-H. Zhou. Costsensitive semi-supervised support vector machine. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence, pages 500 505. AAAI Press, 2010. [Li et al., 2013] Yu-Feng Li, Ivor W. Tsang, James T. Kwok, and Zhi-Hua Zhou. Convex and scalable weakly labeled svms. Journal of Machine Learning Research, 14(1):2151 2188, jan 2013. [Nemirovski, 2005] A. Nemirovski. Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems. Society for Industrial and Applied Mathematics, 2005. [Schmidt et al., 2015] Mark Schmidt, Reza Babanezhad, Mohamed Ahmed, Aaron Defazio, Ann Clifton, and Anoop Sarkar. Non-uniform stochastic average gradient method for training conditional random ﬁelds. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artiﬁcial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 819 828, San Diego, CA, 09 12 May 2015. PMLR. [Sch olkopf and Smola, 2001] B. Sch olkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge, MA, 2001. [Sion, 1958] Maurice Sion. On general minimax theorems. Paciﬁc Journal of Mathematics, 8(1):171 176, 1958. [Xu et al., 2005] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1537 1544. MIT Press, 2005. [Zhang and Zhou, 2014] T. Zhang and Z.-H. Zhou. Large margin distribution machine. In Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 313 322, New York, NY, 2014. [Zhang and Zhou, 2017] T. Zhang and Z.-H. Zhou. Multiclass optimal distribution machine. In Proceedings of the 34th International Conference on Machine Learning, pages 4063 4071, Sydney, NSW, Australia, 2017. [Zhang and Zhou, 2018] T. Zhang and Z.-H. Zhou. Optimal margin distribution clustering. In Proceedings of the 20th National Conference on Artiﬁcial Intelligence, AAAI 18. AAAI Press, 2018. [Zhou, 2014] Z.-H. Zhou. Large margin distribution learning. In Proceedings of the 6th IAPR International Workshop on Artiﬁcial Neural Networks in Pattern Recognition, pages 1 11, Montreal, Canada, 2014.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)