# learning_with_singleteacher_multistudent__2da66e30.pdf

Learning with Single-Teacher Multi-Student

Shan You, Chang Xu, Chao Xu, Dacheng Tao

Key Lab. of Machine Perception (MOE), Cooperative Medianet Innovation Center, School of EECS, Peking University, China UBTECH Sydney AI Centre, SIT, FEIT, University of Sydney, Australia youshan@pku.edu.cn, c.xu@sydney.edu.au xuchao@cis.pku.edu.cn, dacheng.tao@sydney.edu.au

In this paper we study a new learning problem deﬁned as Single-Teacher Multi-Student (STMS) problem, which investigates how to learn a series of student (simple and speciﬁc) models from a single teacher (complex and universal) model. Taking the multiclass and binary classiﬁcation for example, we focus on learning multiple binary classiﬁers from a single multiclass classiﬁer, where each of binary classiﬁer is responsible for a certain class. This actually derives from some realistic problems, such as identifying the suspect based on a comprehensive face recognition system. By treating the already-trained multiclass classiﬁer as the teacher, and multiple binary classiﬁers as the students, we propose a gated support vector machine (g SVM) as a solution. A series of g SVMs are learned with the help of single teacher multiclass classiﬁer. The teacher s help is two-fold; ﬁrst, the teacher s score provides the gated values for students decision; second, the teacher can guide the students to accommodate training examples with different difﬁculty degrees. Extensive experiments on real datasets validate its effectiveness.

Introduction Multiclass classiﬁcation (MC) (Aly 2005; Bishop 2006; Nie, Wang, and Huang 2017) considers that an instance can be afﬁliated to one of multiple ( 2) classes. Many real applications can be cast into MC problem. Taking the face recognition (He et al. 2017) for example, the task is required to distinguish multiple people based on their face photos, e.g. 126 individuals in AR Face Database (Martinez 1998). Other MC applications refer to image classiﬁcation (Guo 2017), text categorization (Xu et al. 2017), information retrieval (Yang et al. 2017), multi-view classiﬁcation (Xu, Tao, and Xu 2013) et.al. To accommodate multiple classes, the multiclass classiﬁers usually adopt a score-based prediction mechanism. In detail, they can generate a score vector in terms of all classes as the output, then the class with the maximum score is regarded as the prediction. Some of the widely-used multiclass classiﬁers actually follow this mechanism, such as One-vs-All (Vapnik 2013), structural SVM (Tsochantaridis et al. 2004), rank SVM (Joachims 2002), et.al. As a result, the multiclass classiﬁers usually have larger model size than the binary classiﬁers, and more expensive inference cost as well.

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Suppose we have a multiclass classiﬁer (e.g. structural SVM) which has been already trained well. It allows us to identify a class from multiple candidates. In some real circumstances, however, the task in hand may only require a binary classiﬁcation task, i.e. identifying whether a test instance belongs to one speciﬁc class instead of what class it belongs to. For example, in news categorization a multiclass classiﬁer is implemented to recognize what category a news item is related to, such as tech news, entertainment news, economic news, et.al. However, when a user only requests for the tech news, the system only needs to classify whether the news items are tech-related or not. Another example refers to the suspect tracking in criminal detection. Usually, a face recognition system has already existed whose function is to identify a person s ID according to his/her face. This system may be capable of distinguishing millions of people. Nevertheless, for suspect tracking, we only concern whether the testing faces belong to the suspect or not. Of course, we can still use the multiclass classiﬁer to directly recognize whether the instance is a tech news item (or the suspect). If the instance is related to other news categories (or persons), we think it is no tech news (or suspect). However, this practice will always involve the inference of all the other classes; thus for a binary classiﬁcation, the inference cost is too expensive. Or just using the training data to learn the binary classiﬁers is also an intuitive method. Nevertheless, this can not take advantage of the existing multiclass classiﬁer which may provide some useful information. Therefore, it is of the need to learn a new binary classiﬁer with much faster inference speed than the multiclass classiﬁer, and higher classiﬁcation accuracy than that of simple learning with training data. In our paper, we suggest learning the binary classiﬁers by virtue of the existing multiclass classiﬁer. In fact, this can be likened vividly to the basketball team training by a single teacher (or coach) (He, Eisner, and Daume 2012; Hinton, Vinyals, and Dean 2015; You et al. 2017b; Zhu, Liu, and Lopes 2017). An experienced teacher has extensive knowledge about how to be a good forward, center and guard. What the teacher wants is to train a team and each player has his/her own speciality; some specialize in the forward while others are more suitable for centers or guards. Our problem can also be regarded as learning from a single teacher (i.e. the multiclass classiﬁer) to obtain multiple students (i.e. binary

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

classiﬁers), each of which is responsible for a certain class. To the best of our knowledge, this is the ﬁrst attempt to study this type of problem, which we deﬁne as Single-Teacher Multi-Student (STMS) problem. Different from our prior work (You et al. 2017b) aiming at Multi-Teacher Single-Student problem where multiple teachers have the same functions, our work highlights multiple students and they possess different functions from each other (i.e. covering different binary classes). To solve the STMS problem, we propose a gated support vector machine (g SVM) to accomplish the learning of a series of binary classiﬁers with the help of a teacher multiclass classiﬁer. Treating the teacher s score as auxiliary information, we formulate the students learning as a skiping out of the gate game where the gate is set by the teacher. The gate (i.e. teacher s score) reﬂects how strongly the teacher has conﬁdence about an instance s belonging to the target class. If a student can not skip out of the gate, then he/she has to stick to the teacher s decision, and also thinks the instance belongs to the target class. On the contrary, if the student does skip out of the gate, then he/she can reject the opinion of the teacher, and votes against belonging to the target class. During the training, we encourage the magnitude order between student s prediction value and teacher s gated value match with the ground-truth class response. By further studying the teacher s score matrix on the training examples, we introduce a difﬁculty degree on each example to measure how difﬁcult the example is for the students. This prior difﬁculty information given by the teacher is to further guide and beneﬁt the students during the learning. We embed this privileged information (Du, Xu, and Tao 2017; You et al. 2017a) by rescaling the slack variables, and formulate the whole model in an SVM fashion, which we name gated SVM. A majorization-minimization algorithm is adopted to optimize the g SVM, with an iteratively-reweighted least-squares method presented. Experimental results on extensive benchmark datasets validate the effectiveness of our gated SVM.

Problem Formulation In this section, we detail our solution to the Single-Teacher Multi-Student problem. Focusing on classiﬁcation problems, we investigate how to use a multiclass classiﬁer (teacher) to develop a variety of binary classiﬁers (students) corresponding to different classes. For simplicity, we take the structural support vector machines (SSVMs) as our teacher model for example, but our solution can be extended to other scorebased multiclass classiﬁcation approaches.

Teacher Model: SSVM Typical SVM is designed for binary classiﬁcation problems with the output being 1 or -1. In contrast, multiclass classiﬁcation problems involve multiple classes (e.g. K classes), the output space Y changes from {1, 1} to {1, 2, ..., K}. To accomodate this output space or even more complicated one, structual support vector machine (SSVM) is developed (Tsochantaridis et al. 2004; Yu and Joachims 2009; Joachims, Finley, and Yu 2009). An SSVM is parameterized as a function of a problem-speciﬁc joint feature map

ψ(x, y) : X Y RM, where X is the feature space and Y is the output space (e.g. Y = {1, 2, ..., K} for multi-class classiﬁcation). Then the SSVM can be learned from the following model with training data {(xi, yi)}n i=1:

min w,ξ 1 2 w 2 2 + C

s.t. Δ(y, yi) + w, ψ(xi, y) w, ψ(xi, yi) ξi ξi 0, y Y, i = 1, ..., n,

where , is the Euclidean inner product of two vectors or matrixes and Δ( , ) is a distance measure for two outputs. In this way, for a new instance x, its prediction is determined to be the class with highest score, i.e.

ˆy = arg max y Y w , ψ(x, y) (2)

Note that this inference needs to traverse all classes for the maximization operator over all socres.

Student Model: Gated SVM Suppose we have a well-trained multiclass SSVM model T as the teacher, the aim is to learn a series of binary classiﬁers each of which is responsible for a certain class. Given the training dataset X = [x1, ..., xn] of n examples and their ground-truth labels y Yn, the prediction scores of all examples on all classes can form a score matrix S Rn K given by the teacher model T , where (i, j)-th element Sij refers to the i-th example s prediction score on the j-th class. For SSVM, the score is calculted as Eq.(2),

Sij = w , ψ(xi, j) . (3)

The score matrix reﬂects the teacher s acceptance conﬁdence on each of classes the examples belong to. Without the loss of generality, we focus on the learning for the y-th class in Y. Then the y-th column sy of S indicates the teacher T s conﬁdence on the class y. Considering that the teacher is well-trained and predict correctly in most times, then the scores of those with positive labels are usually the maximum values and vice versa. Then to identify the class y, an intuitive idea is to present the rejection conﬁdence, namely, how conﬁdently we think an instance does not belong to this class. Suppose the rejection conﬁdence of class y on instance x is parameterized as ϕ(x; θ) with parameter θ. The predicted result ˆzy on the class y follows

ˆzy = +1, if w , ψ(x, y) ϕ(x; θ) 1, otherwise (4)

where +1 indicates the instance x belongs to class y while 1 means the otherwise. Eq.(4) serves as the decision rule; what a student has to learn is the rejection conﬁdence function ϕ(x; θ). Once the ϕ(x; θ) is learned, the inference (testing) will go as following: given an example, a student estimates the rejection conﬁdence by calculating the function ϕ, then the student will compare it to the acceptance conﬁdence given by the teacher. If the student has a larger value than that of the teacher, then he/she will reject the teacher and predict the label to be negative.

Table 1: The difﬁculty of examples reﬂected by the differences of teacher s scores. indicates that the larger the value is, the more difﬁcult the example is. means the smaller the value is, the more difﬁcult the example is.

Case Relation Prediction Difference Fixed Difference (i) zy = +1; ˆzy = +1 correct Sy S2 0 zy(Sy S2) 0 (ii) zy = +1; ˆzy = 1 incorrect Sy S1 0 zy(Sy S1) 0 (iii) zy = 1; ˆzy = +1 incorrect Sy S2 0 zy(Sy S2) 0 (iv) zy = 1; ˆzy = 1 correct Sy S1 0 zy(Sy S1) 0

This prediction process resembles the student s skiping out of the gate game where the gate is set by the teacher. The gate (i.e. teacher s score) reﬂects how strongly the teacher has conﬁdence about an instance s belonging to the target class. And this gate actually serves as a threshold for the student s judgement. This is different from the SVM models where all examples have the same hard threshold 0, and it is more adaptive than SVM since each example has own different threshold given by the teacher. Besides, in SVM scores larger than the threshold are predicted to be positive while in Eq.(4) to cohere with the teacher s maximization prediction rule, scores smaller than the gate are considered to be with positive responses (i.e. labels). In this way, the prediction is mutually determined by the cooperation of the teacher and the student. Now we investigate how to learn a good rejection conﬁdence estimator for the students. Suppose the ground-truth of all training examples on class y is zy {+1, 1}n, then if y = ˆy in Eq.(2), zy i = +1; otherwise, zy i = 1, i = 1, ..., n. Assume the rejection conﬁdence estimator is a linear function, i.e. ϕ(x; θ) = θ, φ(x) , where φ(x) Rm is the feature vector of x. For each example xi, the rejection conﬁdence should observe the inequality

zy i θ, φ(xi) zy i w , ψ(xi, y) (5)

which means that if xi belongs to class y (zy i = +1), then we should have θ, φ(xi) w , ψ(xi, y) ; otherwise, θ, φ(xi) w , ψ(xi, y) . To expect a signiﬁcant gap (or margin) between the inequalities, we push the inequalies harder a bit by introducing a constant ε,

ε + zy i θ, φ(xi) zy i w , ψ(xi, y) (6)

This order inequality is consistent with the prediction rule Eq.(4); however, Eq.(6) serves as a hard constraint and may be easily destroyed by the outliers in examples or fatal errors from teacher. To enhance the robustness, we introduce a slack variable ξi 0 to control the tolerance for prediction mistakes, i.e.

ε + zy i θ, φ(xi) zy i w , ψ(xi, y) + ξi. (7)

Our goal is to minimize the prediction mistakes for the student, then the leanring model can be formulated in SVM style:

min θ,ξ 1 2 θ 2 2 + C

s.t. ε + zy i θ, φ(xi) zy i w , ψ(xi, y) + ξi. ξi 0, i = 1, 2, ..., n.

Algorithm 1 Difﬁculty Coding with Single-Teacher

Input: Teacher model (SSVM) with classiﬁer w and joint feature operator ψ(x, y), an example x with the goundtruth response zy for class y. 1: Calculate the teacher s prediction score vector S as Eq.(3), with the maximum score and second maximum score being S1 and S2, repectively. 2: Denote the score of class y as Sy, and calculate the teacher s prediction response ˆzy. 3: Calculate the difference δ as Eq.(9). 4: Calculate the difﬁculty degree d as Eq.(11). Output: Example x s difﬁculty degree d in Eq.(10).

The ﬁrst (1/2) θ 2 2 is a regularization term; the second term is actually the ℓ1 norm of slack variable vector ξ = [ξ1, ..., ξn]T to encourage the sparisity. A sparse ξ implies that only a fraction of examples make predition mistakes. C > 0 is a constant to balance both two terms.

Going Further: Difﬁculty among Examples

In Eq.(8) for each example, the teacher s score on the target class serves as the gated information for the student s prediction. In the following, we reveal that the teacher s scores on the other classes also have beneﬁcial information for the student s learning the target class. For other non-target classes, an intuition is that if their score values are much closer to that of target class, then we think the example is more difﬁcult for the teacher to distinguish whether it belongs to the target class or not. Moreover, since the teacher is not an Oracle, he/she also makes mistakes in the prediction. Thus with the ground-truth label responses, the incorrectly predicted examples are regarded more difﬁcult than those corrected predicted for the teacher. We shall measure the difﬁculty of examples from both two aspects in the sequel, i.e. the correctness and closeness. For target class y and an example x, its ground-truth response and the prediction response of the teacher T are denoted as zy and ˆzy, respectively. Denote the score vector on all K classes as S, with the score of target class as Sy. There are totally four cases for zy and ˆzy. (i) zy = +1; ˆzy = +1. In this case, the teacher predicted correctly, with the Sy being the maximum value of S. Then the difference between Sy and the second largest score S2 reﬂects the closeness. The smaller (closer) the difference Sy S2 0 is, the more difﬁcult the example x is for the teacher. (ii) zy = +1; ˆzy = 1. The teacher predicts incorrectly.

Algorithm 2 Training for gated SVM (g SVM) with Majorization Minimization optimization

Input: Training data: for each example xi its feature vector φ(xi) and their labels y Yn. A teacher model T . Learning parameters: C > 0. 1: Calculate the teacher T s prediction ˆy and the score matrix S. 2: for each y in Y do 3: For class y, calculate the ground-truth response zy, the teacher s prediction ˆzy and the each example xi s score Sy i in the score matrix S. 4: Calculate the difﬁculty degree di of each example according to Algorithm 1. 5: Initialize the classiﬁer θ for class y 6: while not convergence do 7: Calculate the residual ϑi = zy i φT xiθ zy i Sy i for each example xi.

8: Update the classiﬁer θ C

2nΦZy D 1|Υ| 1ZyΦT ) 1ΦZy D 1|Υ| 1(μy |ϑ|)

9: end while 10: Output the binary classiﬁer θy θ for class y. 11: end for

The maximum score of S is not Sy. Then the difference between Sy and the maximum score S1 mirrors the degree of the incorrectness. The larger the absolute of the difference Sy S1 0, the more difﬁcult the example is. (iii) zy = 1; ˆzy = +1. The teacher predicts incorrectly. The maximum score predicted by the teacher should not the in the top position. A larger difference Sy S2 0 indicates a more difﬁcult example. (iv) zy = 1; ˆzy = 1. The teacher predicts correctly. The difference Sy S1 0 corresponds to the closeness. The larger its abosolute is, the more difﬁclut the example is.

All four cases have their own depiction of the difﬁculty of examples. We clarify them in Table 1. In addition, it is our consensus that incorrectly predicted examples are assumed to be more difﬁcult than those correctly predicted. To make the differences consistent with this assumption, we introduce a ﬁxed difference in Table 1. As in Table 1, the correctly predicted examples always have non-negative differences while the incorrectly predicted examples have non-positive ones. Thus the differences of incorrect examples are smaller than those of correct examples. Moreover, the smaller the differences are, the more difﬁcult the examples are supposed to be. To make the differences more compact, we rewrite them into one expression,

δ(x; T ) = Sy ˆzy min(ˆzy S1, ˆzy S2)

zy|Sy| . (9)

which is consistent with the Table 1, and the absolute of Sy in the denominator serves as a normalization constant. The difference δ is capable of reﬂecting the difﬁculty of examples for the teacher. A student can use this difﬁculty information to regularize his/her learning the target class. In particular, the slack variable ξi in Eq.(7) indicates the tolerance for the mistakes. Generally speaking, if an example is more difﬁcult, then we have more tolerance for the mistake; otherwise, we have less tolerance. Thus we can embed the difﬁculty degree into the student s learning via rescaling the slack variable,

zy i θ, φ(xi) zy i w , ψ(xi, y) + diξi. (10)

where the di is the difﬁculty degree presented by the teacher. The difﬁcult examples are expected to have large difﬁculty degree d. However, the difference δ can be positive or negative, we propose to generate the difﬁculty degree d by directly mapping the difference in the interval [0,1]. In detail, a monotonically decreasing non-negative coding function f( ) is needed, for example, a ﬁlpped clip function, i.e.

1 if δ < 1 ϵ 1

2 , if 1 δ 1 ϵ, if δ > 1 (11)

where 0 ϵ 1 is a constant controlling the decay. We show the coding function in Figure 1 and elobarate the difﬁculty coding process in Algorithm 1. Embedding the difﬁculty degree into Eq.(8), we obtain the gated SVM, i.e.

min θ,ξ 1 2 θ 2 2 + C

s.t. ε + zy i θ, φ(xi) zy i w , ψ(xi, y) + diξi. ξi 0, i = 1, 2, ..., n.

As in Eq.(12), the more difﬁcult (lager di) examples will be allowed for larger mistake errors. In this way, the student can learn more adaptively with different examples. To distinguish Eq.(12) from Eq.(8) which is with no difﬁculty coding, we call Eq.(12) g SVM and Eq.(8) g SVM-0. Remark 1. Substituing diξi with υi, the objective of

Eq.(12) will be rewritten as 1

n n i=1 d 1 i υi in term of variables θ and υ. This formulation is similar with the idea of weighted SVM (Lapin, Hein, and Schiele 2014), where each of example is assigned an importance weight d 1 i . Remark 2. Since the teacher model has k times as many classiﬁers than a binary SVM, the computation cost is linear with the class number k. Note that the teacher model can be speeded up for its binary inference via a traverse technique: as soon as any class with a larger value than that of the target class is found, one can stop traversing the classes in Eq.(2).

-1.5 -1 -0.5 0 0.5 1 1.5 Difference

difficulty degree d

Figure 1: Difﬁculty coding function from the difference δ. The blue line is the original difference values, and the red line is the corresponding difﬁculty degrees. The green mark indicates the constant ϵ in Eq.(11).

But this only works on those which are predicted to be -1 by the teacher. For those +1 predicted examples, we still have to traverse all classes to make sure the target class has the largest score. Even for the -1 predicted examples. the worst case still needs to traverse all classes. We will show this in our experiments.

Optimization In this section, we elaborate how to solve the proposed gated SVM Eq.(12). g SVM is a convex problem, which can be efﬁciently solved by many off-the-shelf toolboxes, such as mosek (Mosek 2010) and cvx (Grant, Boyd, and Ye 2008). By dint of Eq.(12), we rewrite it into an empirical risk minimization problem,

min θ 1 2 θ 2 2+ C

i=1 d 1 i [ε+zy i θ, φ(xi) zy i Sy i ]+ (13)

where Sy i = w , ψ(xi, y) and [x]+ = max(x, 0) is a rectiﬁed linear unit (Re LU) function. Eq.(13) is unconstrained; however, the loss function is not smooth and nondifferentiable. Majorization minimization (MM) algorithms substitute a simple optimization problem for a difﬁcult optimization problem (Hunter and Lange 2004; Mairal 2015; Lange 2016). In the sequel, we present how to optimize Eq.(13) with an MM algorithm. First, we throw light on the basic paradigm of MM.

Majorization Minimization (MM) MM liberates the optimization of a non-comvex or/and nonsmooth objective function by iteratively minimizing a majorization suggogate function. During the optimization, it actively constructs a majorizer M(θ; θ(k)) at the corrent iterate θ(k), which is assumed to be easy to manipulate. Suppose the objective function is g(θ). The majorizer M is supposed to satisfy two properties:

g(θ) M(θ; θ(k)), θ (14)

θ(k) = arg min θ M(θ; θ(k)) g(θ). (15)

Then the next iterate is presented by minimizing the majorizer instead of the original objective:

θ(k+1) = arg min θ M(θ; θ(k)). (16)

As a result, we can have a sequence of iterates enabling the previous objective g(θ) to keep non-increasing.

Iterative Update The difﬁculty of optimizing Eq.(13) comes from the nonsmoothness of Re LU function. In our paper, we adopt a majorizer proposed in (Nguyen and Mc Lachlan 2017), presented in Lemma 1. Lemma 1 ((Nguyen and Mc Lachlan 2017)). If v = 0, then the function [u]+ is majorized at v by

M(u; v) = 1 4|v|(u + |v|)2. (17)

Besides, every function can be majorized by itself, including

the ﬁrst term 1

2 θ 2 2 in Eq.(13). Following Lemma 1 and

letting u = ε+zy i φT xiθ zy i Sy i , v = ε+zy i φT xiθ(k) zy i Sy i 1, then the majorizer is as

M(θ; θ(k)) = C

(ε + zy i φT xiθ zy i Sy i + |ϑ(k) i |)2

4di|ϑ(k) i | + 1

(18) where ϑ(k) i = ε + zy i φT xiθ(k) zy i Sy i . Let ϑ(k) Rn = [ϑ(k) 1 , ..., ϑ(k) n ]T , zy = [zy 1, ..., zy n]T , Zy = diag(zy), Φ = [φ(x1), ..., φ(xn)] Rm n, d = [d1, ..., dn]T , D = diag(d), Υ(k) = diag(ϑ(k)) and μy Rn with each element μy i = zy i Sy i . Note that we usually add a small positive number (e.g. 10 6) to the |ϑ(k) i | in the denominator in Eq.(18) for the numerical stability. Then Eq.(18) can be rewritten as

M(θ; θ(k)) = 1

2 θ 2 2 + C

4n(η(k))T D 1|Υ(k)| 1η(k)

with η(k)(θ) = ε + ZyΦT θ μy + |ϑ(k)| (20) where | | is element-wise absolute and ε is a vector with elements being all ε. Eq.(19) is a positive quadratic form. To minimize the M(θ; θ(k)), we zero the derivative of θ,

2nΦZy D 1|Υ(k)| 1(ε + ZyΦT θ μy + |ϑ(k)|)

= A(k)θ b(k) = 0 (21)

A(k) = I + C

2nΦZy D 1|Υ(k)| 1ZyΦT (22)

2nΦD 1|Υ(k)| 1ΦT (23)

2nΦZy D 1|Υ(k)| 1(μy |ϑ(k)| ε). (24)

1We use φxi to take place of φ(xi) for briefness.

Table 2: Data statistics. #train is the number of training examples while #test is that of test ones. #features and #classes are the number of features and classes, respectively.

Dataset #train #test #features #classes letter 15000 5000 16 26 protein 17766 6621 357 3 satimage 4435 2000 36 6 shuttle 43500 14500 9 7 vowel 528 462 10 11 yeast 1185 299 8 10

class1 class2 class3 class4 class5 class6

class1 class2 class3 class4 class5 class6 0.4

class1 class2 class3 class4 class5 class6 0.94

teacher g SVM g SVM-0 SVM

Figure 2: Performance of binary classiﬁers with respect to each class on satimage dataset. Note that for the visual clearity, some results of SVM s AUC are not shown since they are smaller than 0.94.

Due to A(k) s being positive deﬁnite, the next iterate θ(k+1) can be easily obtained by

θ(k+1) = (A(k)) 1b(k) (25)

The algorithm outline is presented in Algorithm 2. In addition, the majorizer Eq.(19) is also a least-squares with varying weight, and Algorithm 2 can also be regarded as an iterativelyreweighted least-squares (IRLS) algorithm (Navia-Vazquez et al. 2001). (Nguyen and Mc Lachlan 2017) demonstrated it is globally convergent to the minimum.

Experimental Results In this part, we experimentally investigate the effectiveness and efﬁciency of the proposed gated SVM in dealing with the STMS problem.

Conﬁguration We ﬁrst illustrate our experimental conﬁguration, including some preliminaries and the speciﬁc setting. Datasets. We used 6 benchmark multi-class datasets in our experiment, including letter (Hsu and Lin 2002), protein (Wang 2002), satimage (Hsu and Lin 2002), shuttle (Hsu and Lin 2002), vowel and yeast.2 The various statistics of all datasets are presented in Tabel 2.

2Datasets can be downloaded in https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/multiclass.html

0 0.5 1 False positive rate

True positive rate

teacher g SVM g SVM-0 SVM

0 0.5 1 False positive rate

True positive rate

teacher g SVM g SVM-0 SVM

Figure 3: ROC curves for two classes on satimage dataset.

satimage (6 classes) vowel (11 classes) letter (26 classes) 0

relative time

teacher teacher-speed up g SVM SVM

Figure 4: Relative inference time to the teacher model, i.e. the speciﬁc time devided by the time of the teacher model. We do not show the time of g SVM-0 since it is the same with g SVM in inference stage.

Comparison methods. Since we ﬁrst deal with the STMS problem, there are few comparison methods, but the following can serve as solutions. 1) Just using the teacher model (denoted as teacher ). For the target class, we can directly use the teacher s prediction results; if other classes are predicted by the teacher, then we think the examples do not belong to the target class. 2) Directly training a binary classiﬁer for the target class. In our experiment, we use the widely-used SVM method ( denoted as SVM ). 3)To verify the effectiveness of the proposed difﬁculty coding from the teacher, we chose g SVM-0 (Eq.(8)) as a natural comparison method. Evaluation metrics. We select four evaluation metrics to comprehensively assess the classiﬁcation performance of the learned binary classiﬁers, including accuracy, F-measure, receiver operating characteristic curve (ROC curve) and area under the curve (AUC). Teacher model. SSVM is used for obtaining the teacher models. In particular, we use the joint feature map ψ(x, y) to be (I(y = 1)φ(x), ..., I(y = K)φ(x))T , where I( ) is an indicator function, and the distance measure is Hamming distance, i.e. Δ(y, yi) = I(y = yi).3

Experimental setting. For each dataset, with the trained teacher model (i.e. a multiclass classiﬁer), we train a binary classiﬁer for each of its class. The parameters C, ε and ϵ are tuned by 10-fold cross validation in

3Actually, this formulation also coheres with the form of Crammer-Singer multiclass SVM (Crammer and Singer 2001).

Table 3: The mean and median of classiﬁcation performance of all binary classiﬁers in term of all classes for each dataset.

mean median dataset method accuracy F-measure AUC accuracy F-measure AUC

teacher 99.72% 96.35% 99.95% 99.73% 96.43% 99.97% SVM 97.35% 43.47% 90.89% 96.90% 44.37% 93.40% g SVM-0 98.41% 81.75% 98.64% 98.47% 80.19% 98.70% g SVM 98.88% 82.26% 99.72% 99.16% 87.50% 99.81%

teacher 80.50% 67.55% 84.91% 81.03% 71.86% 84.34% SVM 78.31% 60.86% 81.41% 77.57% 64.26% 82.34% g SVM-0 78.67% 63.23% 82.54% 78.15% 63.79% 82.88% g SVM 81.01% 69.86% 85.03% 81.06% 68.47% 85.23%

teacher 97.05% 89.80% 98.76% 97.35% 92.13% 99.34% SVM 94.00% 74.28% 91.46% 93.65% 81.31% 96.81% g SVM-0 94.33% 81.95% 98.54% 94.12% 83.56% 98.95% g SVM 96.28% 86.76% 98.55% 96.55% 89.29% 99.08%

teacher 99.95% 82.20% 98.81% 99.96% 85.71% 98.99% SVM 97.47% 42.83% 82.61% 98.91% 43.33% 88.01% g SVM-0 98.13% 77.54% 94.86% 98.34% 73.66% 95.02% g SVM 99.57% 80.42% 99.28% 99.61% 83.17% 99.85%

teacher 91.93% 53.69% 92.12% 93.29% 55.38% 93.10% SVM 89.95% 22.01% 80.00% 89.91% 24.55% 85.11% g SVM-0 86.19% 43.03% 91.05% 85.50% 42.00% 90.04% g SVM 90.93% 52.49% 91.96% 92.21% 51.72% 91.92%

teacher 92.04% 63.59% 85.52% 96.49% 64.12% 83.28% SVM 90.47% 35.37% 81.86% 95.82% 36.67% 81.37% g SVM-0 91.44% 49.71% 83.00% 95.99% 53.08% 84.36% g SVM 91.71% 51.67% 83.10% 96.55% 55.14% 85.40%

the sets {10 3, ..., 100, ..., 103}, {0, 0.1, 0.2, ..., 1.5} and {0.1, 0.2, ..., 0.5}, respectively.

Results and Analysis

In Figures 2 and 3, we show the performance of accuracy, F-measure, AUC and ROC curve on dataset satimage for example. Since we trained binary classiﬁers for all classes in terms of each dataset, to comprehensively investigate the performance gap among all competing methods, we calculate the mean and median of all classiﬁers on each dataset and evaluation metric (Van Hasselt, Guez, and Silver 2016). All obtained results are summarized in Tabel 3. First, for verifying the effectiveness of the proposed difﬁculty coding, we can see that g SVM is better than g SVM-0 in most classes in Figure 2, and for the cases in Table 3, g SVM wins g SVM-0 in all cases. This implies that the proposed difﬁculty coding does play a positive role in enhancing the performance of classiﬁers. Second, it can be seen that the proposed g SVM is comparable to the teacher model in all classes in Figure 2 and achieves near results with the teacher in Tabel 3. Besides, their ROC curves are very close in Figure 3. The slight superiority of teacher model to a binary SVM in a binary inference may result from the advantages of a well-designed multiclass classiﬁer over the simple One-vs All strategy. However, since the teacher model has the much larger model size than g SVM, and its inference stage needs to involve all classes, its inference time would be much more than that of g SVM. We report the relative inference time to the teacher on three datasets in Figure 4. We can see that g SVM and SVM have the considerably less time than that

of teacher model (even with the speed-up technique). As the number of classes increases, the gap could be much larger. That implies that teacher model is not appropriate for the binary prediction when the number of classes is large. Third, g SVM and SVM have the competing inference time according to Figure 4, however, SVM s classiﬁcation performance is inferior to that of g SVM in Figures 2, Figure 3 and Table 3. Thus g SVM enjoys the superiority to SVM in classiﬁcation performance. To sum up, from empirical results g SVM has the comparable classiﬁcation performance with the teacher model, but much faster inference speed than the teacher model. As for SVM, it has the worse classiﬁcation performance than g SVM, though it has similar efﬁciency with g SVM.

In this paper, we illustrate a new problem that uses a pretrained teacher model to learn a series of student models. We deﬁne this problem as Single-Teacher Multi-Student (STMS) problem. By taking the multi-class and binary classiﬁcation as an example, we investigate how to use a welltrained multiclass classiﬁer to learn a series of binary classiﬁers with respect to each of its class. To solve this problem, we propose a gated SVM as the model for learning binary classiﬁers. In gated SVM, the students prediction is determined by both the teacher and themselves. Moreover, we use the teacher s score output to construct different difﬁculty degrees for various training examples, so the students training is guided adaptively by the teacher. Experimental results show the effectiveness of our method and validate our theory.

Acknowledgments We thank the constructive comments of all reviewers and the support from the NSFC 61375026 and 2015BAF15B00, and ARC Projects FL-170100117, DE-180101438, DP180103424, DP-140102164 and LP-150100671.

References Aly, M. 2005. Survey on multiclass classiﬁcation methods. Neural Netw 19. Bishop, C. M. 2006. Pattern recognition and machine learning. springer. Crammer, K., and Singer, Y. 2001. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research 2(Dec):265 292. Du, Y.; Xu, C.; and Tao, D. 2017. Privileged matrix factorization for collaborative ﬁltering. In Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, 1610 1616. Grant, M.; Boyd, S.; and Ye, Y. 2008. Cvx: Matlab software for disciplined convex programming. Guo, Y. 2017. Problems in large-scale image classiﬁcation. In AAAI, 5038 5039. He, R.; Wu, X.; Sun, Z.; and Tan, T. 2017. Learning invariant deep representation for nir-vis face recognition. In AAAI, 2000 2006. He, H.; Eisner, J.; and Daume, H. 2012. Imitation learning by coaching. In Advances in Neural Information Processing Systems, 3149 3157. Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531. Hsu, C.-W., and Lin, C.-J. 2002. A comparison of methods for multiclass support vector machines. IEEE transactions on Neural Networks 13(2):415 425. Hunter, D. R., and Lange, K. 2004. A tutorial on mm algorithms. The American Statistician 58(1):30 37. Joachims, T.; Finley, T.; and Yu, C.-N. J. 2009. Cutting-plane training of structural svms. Machine Learning 77(1):27 59. Joachims, T. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 133 142. ACM. Lange, K. 2016. MM Optimization Algorithms. SIAM. Lapin, M.; Hein, M.; and Schiele, B. 2014. Learning using privileged information: Svm+ and weighted svm. Neural Networks 53:95 108. Mairal, J. 2015. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization 25(2):829 855. Martinez, A. M. 1998. The ar face database. CVC technical report. Mosek, A. 2010. The mosek optimization software. Online at http://www. mosek. com 54.

Navia-Vazquez, A.; Perez-Cruz, F.; Artes-Rodriguez, A.; and Figueiras-Vidal, A. R. 2001. Weighted least squares training of support vector classiﬁers leading to compact and adaptive schemes. IEEE Transactions on Neural Networks 12(5): 1047 1059. Nguyen, H. D., and Mc Lachlan, G. J. 2017. Iterativelyreweighted least-squares ﬁtting of support vector machines: A majorization minimization algorithm approach. ar Xiv preprint ar Xiv:1705.04651. Nie, F.; Wang, X.; and Huang, H. 2017. Multiclass capped lp-norm svm for robust classiﬁcations. Tsochantaridis, I.; Hofmann, T.; Joachims, T.; and Altun, Y. 2004. Support vector machine learning for interdependent and structured output spaces. In Proceedings of the twenty-ﬁrst international conference on Machine learning, 104. ACM. Van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep reinforcement learning with double q-learning. In AAAI, 2094 2100. Vapnik, V. 2013. The nature of statistical learning theory. Springer science & business media. Wang, J.-Y. 2002. Application of support vector machines in bioinformatics. Taipei: Department of Computer Science and Information Engineering, National Taiwan University. Xu, W.; Sun, H.; Deng, C.; and Tan, Y. 2017. Variational autoencoder for semi-supervised text classiﬁcation. In AAAI, 3358 3364. Xu, C.; Tao, D.; and Xu, C. 2013. A survey on multi-view learning. ar Xiv preprint ar Xiv:1304.5634. Yang, E.; Deng, C.; Liu, W.; Liu, X.; Tao, D.; and Gao, X. 2017. Pairwise relationship guided deep hashing for crossmodal retrieval. In AAAI, 1618 1625. You, S.; Xu, C.; Wang, Y.; Xu, C.; and Tao, D. 2017a. Privileged multi-label learning. In Proceedings of the Twenty Sixth International Joint Conference on Artiﬁcial Intelligence, 3336 3342. IJCAI 2017, Melbourne, Australia, August 1925, 2017 You, S.; Xu, C.; Xu, C.; and Tao, D. 2017b. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1285 1294. ACM. Yu, C.-N. J., and Joachims, T. 2009. Learning structural svms with latent variables. In Proceedings of the 26th annual international conference on machine learning, 1169 1176. ACM. Zhu, X.; Liu, J.; and Lopes, M. 2017. No learner left behind: On the complexity of teaching multiple learners simultaneously. In Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, 3588 3594.