# learning_with_singleteacher_multistudent__2da66e30.pdf Learning with Single-Teacher Multi-Student Shan You, Chang Xu, Chao Xu, Dacheng Tao Key Lab. of Machine Perception (MOE), Cooperative Medianet Innovation Center, School of EECS, Peking University, China UBTECH Sydney AI Centre, SIT, FEIT, University of Sydney, Australia youshan@pku.edu.cn, c.xu@sydney.edu.au xuchao@cis.pku.edu.cn, dacheng.tao@sydney.edu.au In this paper we study a new learning problem defined as Single-Teacher Multi-Student (STMS) problem, which investigates how to learn a series of student (simple and specific) models from a single teacher (complex and universal) model. Taking the multiclass and binary classification for example, we focus on learning multiple binary classifiers from a single multiclass classifier, where each of binary classifier is responsible for a certain class. This actually derives from some realistic problems, such as identifying the suspect based on a comprehensive face recognition system. By treating the already-trained multiclass classifier as the teacher, and multiple binary classifiers as the students, we propose a gated support vector machine (g SVM) as a solution. A series of g SVMs are learned with the help of single teacher multiclass classifier. The teacher s help is two-fold; first, the teacher s score provides the gated values for students decision; second, the teacher can guide the students to accommodate training examples with different difficulty degrees. Extensive experiments on real datasets validate its effectiveness. Introduction Multiclass classification (MC) (Aly 2005; Bishop 2006; Nie, Wang, and Huang 2017) considers that an instance can be affiliated to one of multiple ( 2) classes. Many real applications can be cast into MC problem. Taking the face recognition (He et al. 2017) for example, the task is required to distinguish multiple people based on their face photos, e.g. 126 individuals in AR Face Database (Martinez 1998). Other MC applications refer to image classification (Guo 2017), text categorization (Xu et al. 2017), information retrieval (Yang et al. 2017), multi-view classification (Xu, Tao, and Xu 2013) et.al. To accommodate multiple classes, the multiclass classifiers usually adopt a score-based prediction mechanism. In detail, they can generate a score vector in terms of all classes as the output, then the class with the maximum score is regarded as the prediction. Some of the widely-used multiclass classifiers actually follow this mechanism, such as One-vs-All (Vapnik 2013), structural SVM (Tsochantaridis et al. 2004), rank SVM (Joachims 2002), et.al. As a result, the multiclass classifiers usually have larger model size than the binary classifiers, and more expensive inference cost as well. Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Suppose we have a multiclass classifier (e.g. structural SVM) which has been already trained well. It allows us to identify a class from multiple candidates. In some real circumstances, however, the task in hand may only require a binary classification task, i.e. identifying whether a test instance belongs to one specific class instead of what class it belongs to. For example, in news categorization a multiclass classifier is implemented to recognize what category a news item is related to, such as tech news, entertainment news, economic news, et.al. However, when a user only requests for the tech news, the system only needs to classify whether the news items are tech-related or not. Another example refers to the suspect tracking in criminal detection. Usually, a face recognition system has already existed whose function is to identify a person s ID according to his/her face. This system may be capable of distinguishing millions of people. Nevertheless, for suspect tracking, we only concern whether the testing faces belong to the suspect or not. Of course, we can still use the multiclass classifier to directly recognize whether the instance is a tech news item (or the suspect). If the instance is related to other news categories (or persons), we think it is no tech news (or suspect). However, this practice will always involve the inference of all the other classes; thus for a binary classification, the inference cost is too expensive. Or just using the training data to learn the binary classifiers is also an intuitive method. Nevertheless, this can not take advantage of the existing multiclass classifier which may provide some useful information. Therefore, it is of the need to learn a new binary classifier with much faster inference speed than the multiclass classifier, and higher classification accuracy than that of simple learning with training data. In our paper, we suggest learning the binary classifiers by virtue of the existing multiclass classifier. In fact, this can be likened vividly to the basketball team training by a single teacher (or coach) (He, Eisner, and Daume 2012; Hinton, Vinyals, and Dean 2015; You et al. 2017b; Zhu, Liu, and Lopes 2017). An experienced teacher has extensive knowledge about how to be a good forward, center and guard. What the teacher wants is to train a team and each player has his/her own speciality; some specialize in the forward while others are more suitable for centers or guards. Our problem can also be regarded as learning from a single teacher (i.e. the multiclass classifier) to obtain multiple students (i.e. binary The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) classifiers), each of which is responsible for a certain class. To the best of our knowledge, this is the first attempt to study this type of problem, which we define as Single-Teacher Multi-Student (STMS) problem. Different from our prior work (You et al. 2017b) aiming at Multi-Teacher Single-Student problem where multiple teachers have the same functions, our work highlights multiple students and they possess different functions from each other (i.e. covering different binary classes). To solve the STMS problem, we propose a gated support vector machine (g SVM) to accomplish the learning of a series of binary classifiers with the help of a teacher multiclass classifier. Treating the teacher s score as auxiliary information, we formulate the students learning as a skiping out of the gate game where the gate is set by the teacher. The gate (i.e. teacher s score) reflects how strongly the teacher has confidence about an instance s belonging to the target class. If a student can not skip out of the gate, then he/she has to stick to the teacher s decision, and also thinks the instance belongs to the target class. On the contrary, if the student does skip out of the gate, then he/she can reject the opinion of the teacher, and votes against belonging to the target class. During the training, we encourage the magnitude order between student s prediction value and teacher s gated value match with the ground-truth class response. By further studying the teacher s score matrix on the training examples, we introduce a difficulty degree on each example to measure how difficult the example is for the students. This prior difficulty information given by the teacher is to further guide and benefit the students during the learning. We embed this privileged information (Du, Xu, and Tao 2017; You et al. 2017a) by rescaling the slack variables, and formulate the whole model in an SVM fashion, which we name gated SVM. A majorization-minimization algorithm is adopted to optimize the g SVM, with an iteratively-reweighted least-squares method presented. Experimental results on extensive benchmark datasets validate the effectiveness of our gated SVM. Problem Formulation In this section, we detail our solution to the Single-Teacher Multi-Student problem. Focusing on classification problems, we investigate how to use a multiclass classifier (teacher) to develop a variety of binary classifiers (students) corresponding to different classes. For simplicity, we take the structural support vector machines (SSVMs) as our teacher model for example, but our solution can be extended to other scorebased multiclass classification approaches. Teacher Model: SSVM Typical SVM is designed for binary classification problems with the output being 1 or -1. In contrast, multiclass classification problems involve multiple classes (e.g. K classes), the output space Y changes from {1, 1} to {1, 2, ..., K}. To accomodate this output space or even more complicated one, structual support vector machine (SSVM) is developed (Tsochantaridis et al. 2004; Yu and Joachims 2009; Joachims, Finley, and Yu 2009). An SSVM is parameterized as a function of a problem-specific joint feature map ψ(x, y) : X Y RM, where X is the feature space and Y is the output space (e.g. Y = {1, 2, ..., K} for multi-class classification). Then the SSVM can be learned from the following model with training data {(xi, yi)}n i=1: min w,ξ 1 2 w 2 2 + C s.t. Δ(y, yi) + w, ψ(xi, y) w, ψ(xi, yi) ξi ξi 0, y Y, i = 1, ..., n, where , is the Euclidean inner product of two vectors or matrixes and Δ( , ) is a distance measure for two outputs. In this way, for a new instance x, its prediction is determined to be the class with highest score, i.e. ˆy = arg max y Y w , ψ(x, y) (2) Note that this inference needs to traverse all classes for the maximization operator over all socres. Student Model: Gated SVM Suppose we have a well-trained multiclass SSVM model T as the teacher, the aim is to learn a series of binary classifiers each of which is responsible for a certain class. Given the training dataset X = [x1, ..., xn] of n examples and their ground-truth labels y Yn, the prediction scores of all examples on all classes can form a score matrix S Rn K given by the teacher model T , where (i, j)-th element Sij refers to the i-th example s prediction score on the j-th class. For SSVM, the score is calculted as Eq.(2), Sij = w , ψ(xi, j) . (3) The score matrix reflects the teacher s acceptance confidence on each of classes the examples belong to. Without the loss of generality, we focus on the learning for the y-th class in Y. Then the y-th column sy of S indicates the teacher T s confidence on the class y. Considering that the teacher is well-trained and predict correctly in most times, then the scores of those with positive labels are usually the maximum values and vice versa. Then to identify the class y, an intuitive idea is to present the rejection confidence, namely, how confidently we think an instance does not belong to this class. Suppose the rejection confidence of class y on instance x is parameterized as ϕ(x; θ) with parameter θ. The predicted result ˆzy on the class y follows ˆzy = +1, if w , ψ(x, y) ϕ(x; θ) 1, otherwise (4) where +1 indicates the instance x belongs to class y while 1 means the otherwise. Eq.(4) serves as the decision rule; what a student has to learn is the rejection confidence function ϕ(x; θ). Once the ϕ(x; θ) is learned, the inference (testing) will go as following: given an example, a student estimates the rejection confidence by calculating the function ϕ, then the student will compare it to the acceptance confidence given by the teacher. If the student has a larger value than that of the teacher, then he/she will reject the teacher and predict the label to be negative. Table 1: The difficulty of examples reflected by the differences of teacher s scores. indicates that the larger the value is, the more difficult the example is. means the smaller the value is, the more difficult the example is. Case Relation Prediction Difference Fixed Difference (i) zy = +1; ˆzy = +1 correct Sy S2 0 zy(Sy S2) 0 (ii) zy = +1; ˆzy = 1 incorrect Sy S1 0 zy(Sy S1) 0 (iii) zy = 1; ˆzy = +1 incorrect Sy S2 0 zy(Sy S2) 0 (iv) zy = 1; ˆzy = 1 correct Sy S1 0 zy(Sy S1) 0 This prediction process resembles the student s skiping out of the gate game where the gate is set by the teacher. The gate (i.e. teacher s score) reflects how strongly the teacher has confidence about an instance s belonging to the target class. And this gate actually serves as a threshold for the student s judgement. This is different from the SVM models where all examples have the same hard threshold 0, and it is more adaptive than SVM since each example has own different threshold given by the teacher. Besides, in SVM scores larger than the threshold are predicted to be positive while in Eq.(4) to cohere with the teacher s maximization prediction rule, scores smaller than the gate are considered to be with positive responses (i.e. labels). In this way, the prediction is mutually determined by the cooperation of the teacher and the student. Now we investigate how to learn a good rejection confidence estimator for the students. Suppose the ground-truth of all training examples on class y is zy {+1, 1}n, then if y = ˆy in Eq.(2), zy i = +1; otherwise, zy i = 1, i = 1, ..., n. Assume the rejection confidence estimator is a linear function, i.e. ϕ(x; θ) = θ, φ(x) , where φ(x) Rm is the feature vector of x. For each example xi, the rejection confidence should observe the inequality zy i θ, φ(xi) zy i w , ψ(xi, y) (5) which means that if xi belongs to class y (zy i = +1), then we should have θ, φ(xi) w , ψ(xi, y) ; otherwise, θ, φ(xi) w , ψ(xi, y) . To expect a significant gap (or margin) between the inequalities, we push the inequalies harder a bit by introducing a constant ε, ε + zy i θ, φ(xi) zy i w , ψ(xi, y) (6) This order inequality is consistent with the prediction rule Eq.(4); however, Eq.(6) serves as a hard constraint and may be easily destroyed by the outliers in examples or fatal errors from teacher. To enhance the robustness, we introduce a slack variable ξi 0 to control the tolerance for prediction mistakes, i.e. ε + zy i θ, φ(xi) zy i w , ψ(xi, y) + ξi. (7) Our goal is to minimize the prediction mistakes for the student, then the leanring model can be formulated in SVM style: min θ,ξ 1 2 θ 2 2 + C s.t. ε + zy i θ, φ(xi) zy i w , ψ(xi, y) + ξi. ξi 0, i = 1, 2, ..., n. Algorithm 1 Difficulty Coding with Single-Teacher Input: Teacher model (SSVM) with classifier w and joint feature operator ψ(x, y), an example x with the goundtruth response zy for class y. 1: Calculate the teacher s prediction score vector S as Eq.(3), with the maximum score and second maximum score being S1 and S2, repectively. 2: Denote the score of class y as Sy, and calculate the teacher s prediction response ˆzy. 3: Calculate the difference δ as Eq.(9). 4: Calculate the difficulty degree d as Eq.(11). Output: Example x s difficulty degree d in Eq.(10). The first (1/2) θ 2 2 is a regularization term; the second term is actually the ℓ1 norm of slack variable vector ξ = [ξ1, ..., ξn]T to encourage the sparisity. A sparse ξ implies that only a fraction of examples make predition mistakes. C > 0 is a constant to balance both two terms. Going Further: Difficulty among Examples In Eq.(8) for each example, the teacher s score on the target class serves as the gated information for the student s prediction. In the following, we reveal that the teacher s scores on the other classes also have beneficial information for the student s learning the target class. For other non-target classes, an intuition is that if their score values are much closer to that of target class, then we think the example is more difficult for the teacher to distinguish whether it belongs to the target class or not. Moreover, since the teacher is not an Oracle, he/she also makes mistakes in the prediction. Thus with the ground-truth label responses, the incorrectly predicted examples are regarded more difficult than those corrected predicted for the teacher. We shall measure the difficulty of examples from both two aspects in the sequel, i.e. the correctness and closeness. For target class y and an example x, its ground-truth response and the prediction response of the teacher T are denoted as zy and ˆzy, respectively. Denote the score vector on all K classes as S, with the score of target class as Sy. There are totally four cases for zy and ˆzy. (i) zy = +1; ˆzy = +1. In this case, the teacher predicted correctly, with the Sy being the maximum value of S. Then the difference between Sy and the second largest score S2 reflects the closeness. The smaller (closer) the difference Sy S2 0 is, the more difficult the example x is for the teacher. (ii) zy = +1; ˆzy = 1. The teacher predicts incorrectly. Algorithm 2 Training for gated SVM (g SVM) with Majorization Minimization optimization Input: Training data: for each example xi its feature vector φ(xi) and their labels y Yn. A teacher model T . Learning parameters: C > 0. 1: Calculate the teacher T s prediction ˆy and the score matrix S. 2: for each y in Y do 3: For class y, calculate the ground-truth response zy, the teacher s prediction ˆzy and the each example xi s score Sy i in the score matrix S. 4: Calculate the difficulty degree di of each example according to Algorithm 1. 5: Initialize the classifier θ for class y 6: while not convergence do 7: Calculate the residual ϑi = zy i φT xiθ zy i Sy i for each example xi. 8: Update the classifier θ C 2nΦZy D 1|Υ| 1ZyΦT ) 1ΦZy D 1|Υ| 1(μy |ϑ|) 9: end while 10: Output the binary classifier θy θ for class y. 11: end for The maximum score of S is not Sy. Then the difference between Sy and the maximum score S1 mirrors the degree of the incorrectness. The larger the absolute of the difference Sy S1 0, the more difficult the example is. (iii) zy = 1; ˆzy = +1. The teacher predicts incorrectly. The maximum score predicted by the teacher should not the in the top position. A larger difference Sy S2 0 indicates a more difficult example. (iv) zy = 1; ˆzy = 1. The teacher predicts correctly. The difference Sy S1 0 corresponds to the closeness. The larger its abosolute is, the more difficlut the example is. All four cases have their own depiction of the difficulty of examples. We clarify them in Table 1. In addition, it is our consensus that incorrectly predicted examples are assumed to be more difficult than those correctly predicted. To make the differences consistent with this assumption, we introduce a fixed difference in Table 1. As in Table 1, the correctly predicted examples always have non-negative differences while the incorrectly predicted examples have non-positive ones. Thus the differences of incorrect examples are smaller than those of correct examples. Moreover, the smaller the differences are, the more difficult the examples are supposed to be. To make the differences more compact, we rewrite them into one expression, δ(x; T ) = Sy ˆzy min(ˆzy S1, ˆzy S2) zy|Sy| . (9) which is consistent with the Table 1, and the absolute of Sy in the denominator serves as a normalization constant. The difference δ is capable of reflecting the difficulty of examples for the teacher. A student can use this difficulty information to regularize his/her learning the target class. In particular, the slack variable ξi in Eq.(7) indicates the tolerance for the mistakes. Generally speaking, if an example is more difficult, then we have more tolerance for the mistake; otherwise, we have less tolerance. Thus we can embed the difficulty degree into the student s learning via rescaling the slack variable, zy i θ, φ(xi) zy i w , ψ(xi, y) + diξi. (10) where the di is the difficulty degree presented by the teacher. The difficult examples are expected to have large difficulty degree d. However, the difference δ can be positive or negative, we propose to generate the difficulty degree d by directly mapping the difference in the interval [0,1]. In detail, a monotonically decreasing non-negative coding function f( ) is needed, for example, a filpped clip function, i.e. 1 if δ < 1 ϵ 1 2 , if 1 δ 1 ϵ, if δ > 1 (11) where 0 ϵ 1 is a constant controlling the decay. We show the coding function in Figure 1 and elobarate the difficulty coding process in Algorithm 1. Embedding the difficulty degree into Eq.(8), we obtain the gated SVM, i.e. min θ,ξ 1 2 θ 2 2 + C s.t. ε + zy i θ, φ(xi) zy i w , ψ(xi, y) + diξi. ξi 0, i = 1, 2, ..., n. As in Eq.(12), the more difficult (lager di) examples will be allowed for larger mistake errors. In this way, the student can learn more adaptively with different examples. To distinguish Eq.(12) from Eq.(8) which is with no difficulty coding, we call Eq.(12) g SVM and Eq.(8) g SVM-0. Remark 1. Substituing diξi with υi, the objective of Eq.(12) will be rewritten as 1 n n i=1 d 1 i υi in term of variables θ and υ. This formulation is similar with the idea of weighted SVM (Lapin, Hein, and Schiele 2014), where each of example is assigned an importance weight d 1 i . Remark 2. Since the teacher model has k times as many classifiers than a binary SVM, the computation cost is linear with the class number k. Note that the teacher model can be speeded up for its binary inference via a traverse technique: as soon as any class with a larger value than that of the target class is found, one can stop traversing the classes in Eq.(2). -1.5 -1 -0.5 0 0.5 1 1.5 Difference difficulty degree d Figure 1: Difficulty coding function from the difference δ. The blue line is the original difference values, and the red line is the corresponding difficulty degrees. The green mark indicates the constant ϵ in Eq.(11). But this only works on those which are predicted to be -1 by the teacher. For those +1 predicted examples, we still have to traverse all classes to make sure the target class has the largest score. Even for the -1 predicted examples. the worst case still needs to traverse all classes. We will show this in our experiments. Optimization In this section, we elaborate how to solve the proposed gated SVM Eq.(12). g SVM is a convex problem, which can be efficiently solved by many off-the-shelf toolboxes, such as mosek (Mosek 2010) and cvx (Grant, Boyd, and Ye 2008). By dint of Eq.(12), we rewrite it into an empirical risk minimization problem, min θ 1 2 θ 2 2+ C i=1 d 1 i [ε+zy i θ, φ(xi) zy i Sy i ]+ (13) where Sy i = w , ψ(xi, y) and [x]+ = max(x, 0) is a rectified linear unit (Re LU) function. Eq.(13) is unconstrained; however, the loss function is not smooth and nondifferentiable. Majorization minimization (MM) algorithms substitute a simple optimization problem for a difficult optimization problem (Hunter and Lange 2004; Mairal 2015; Lange 2016). In the sequel, we present how to optimize Eq.(13) with an MM algorithm. First, we throw light on the basic paradigm of MM. Majorization Minimization (MM) MM liberates the optimization of a non-comvex or/and nonsmooth objective function by iteratively minimizing a majorization suggogate function. During the optimization, it actively constructs a majorizer M(θ; θ(k)) at the corrent iterate θ(k), which is assumed to be easy to manipulate. Suppose the objective function is g(θ). The majorizer M is supposed to satisfy two properties: g(θ) M(θ; θ(k)), θ (14) θ(k) = arg min θ M(θ; θ(k)) g(θ). (15) Then the next iterate is presented by minimizing the majorizer instead of the original objective: θ(k+1) = arg min θ M(θ; θ(k)). (16) As a result, we can have a sequence of iterates enabling the previous objective g(θ) to keep non-increasing. Iterative Update The difficulty of optimizing Eq.(13) comes from the nonsmoothness of Re LU function. In our paper, we adopt a majorizer proposed in (Nguyen and Mc Lachlan 2017), presented in Lemma 1. Lemma 1 ((Nguyen and Mc Lachlan 2017)). If v = 0, then the function [u]+ is majorized at v by M(u; v) = 1 4|v|(u + |v|)2. (17) Besides, every function can be majorized by itself, including the first term 1 2 θ 2 2 in Eq.(13). Following Lemma 1 and letting u = ε+zy i φT xiθ zy i Sy i , v = ε+zy i φT xiθ(k) zy i Sy i 1, then the majorizer is as M(θ; θ(k)) = C (ε + zy i φT xiθ zy i Sy i + |ϑ(k) i |)2 4di|ϑ(k) i | + 1 (18) where ϑ(k) i = ε + zy i φT xiθ(k) zy i Sy i . Let ϑ(k) Rn = [ϑ(k) 1 , ..., ϑ(k) n ]T , zy = [zy 1, ..., zy n]T , Zy = diag(zy), Φ = [φ(x1), ..., φ(xn)] Rm n, d = [d1, ..., dn]T , D = diag(d), Υ(k) = diag(ϑ(k)) and μy Rn with each element μy i = zy i Sy i . Note that we usually add a small positive number (e.g. 10 6) to the |ϑ(k) i | in the denominator in Eq.(18) for the numerical stability. Then Eq.(18) can be rewritten as M(θ; θ(k)) = 1 2 θ 2 2 + C 4n(η(k))T D 1|Υ(k)| 1η(k) with η(k)(θ) = ε + ZyΦT θ μy + |ϑ(k)| (20) where | | is element-wise absolute and ε is a vector with elements being all ε. Eq.(19) is a positive quadratic form. To minimize the M(θ; θ(k)), we zero the derivative of θ, 2nΦZy D 1|Υ(k)| 1(ε + ZyΦT θ μy + |ϑ(k)|) = A(k)θ b(k) = 0 (21) A(k) = I + C 2nΦZy D 1|Υ(k)| 1ZyΦT (22) 2nΦD 1|Υ(k)| 1ΦT (23) 2nΦZy D 1|Υ(k)| 1(μy |ϑ(k)| ε). (24) 1We use φxi to take place of φ(xi) for briefness. Table 2: Data statistics. #train is the number of training examples while #test is that of test ones. #features and #classes are the number of features and classes, respectively. Dataset #train #test #features #classes letter 15000 5000 16 26 protein 17766 6621 357 3 satimage 4435 2000 36 6 shuttle 43500 14500 9 7 vowel 528 462 10 11 yeast 1185 299 8 10 class1 class2 class3 class4 class5 class6 class1 class2 class3 class4 class5 class6 0.4 class1 class2 class3 class4 class5 class6 0.94 teacher g SVM g SVM-0 SVM Figure 2: Performance of binary classifiers with respect to each class on satimage dataset. Note that for the visual clearity, some results of SVM s AUC are not shown since they are smaller than 0.94. Due to A(k) s being positive definite, the next iterate θ(k+1) can be easily obtained by θ(k+1) = (A(k)) 1b(k) (25) The algorithm outline is presented in Algorithm 2. In addition, the majorizer Eq.(19) is also a least-squares with varying weight, and Algorithm 2 can also be regarded as an iterativelyreweighted least-squares (IRLS) algorithm (Navia-Vazquez et al. 2001). (Nguyen and Mc Lachlan 2017) demonstrated it is globally convergent to the minimum. Experimental Results In this part, we experimentally investigate the effectiveness and efficiency of the proposed gated SVM in dealing with the STMS problem. Configuration We first illustrate our experimental configuration, including some preliminaries and the specific setting. Datasets. We used 6 benchmark multi-class datasets in our experiment, including letter (Hsu and Lin 2002), protein (Wang 2002), satimage (Hsu and Lin 2002), shuttle (Hsu and Lin 2002), vowel and yeast.2 The various statistics of all datasets are presented in Tabel 2. 2Datasets can be downloaded in https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/multiclass.html 0 0.5 1 False positive rate True positive rate teacher g SVM g SVM-0 SVM 0 0.5 1 False positive rate True positive rate teacher g SVM g SVM-0 SVM Figure 3: ROC curves for two classes on satimage dataset. satimage (6 classes) vowel (11 classes) letter (26 classes) 0 relative time teacher teacher-speed up g SVM SVM Figure 4: Relative inference time to the teacher model, i.e. the specific time devided by the time of the teacher model. We do not show the time of g SVM-0 since it is the same with g SVM in inference stage. Comparison methods. Since we first deal with the STMS problem, there are few comparison methods, but the following can serve as solutions. 1) Just using the teacher model (denoted as teacher ). For the target class, we can directly use the teacher s prediction results; if other classes are predicted by the teacher, then we think the examples do not belong to the target class. 2) Directly training a binary classifier for the target class. In our experiment, we use the widely-used SVM method ( denoted as SVM ). 3)To verify the effectiveness of the proposed difficulty coding from the teacher, we chose g SVM-0 (Eq.(8)) as a natural comparison method. Evaluation metrics. We select four evaluation metrics to comprehensively assess the classification performance of the learned binary classifiers, including accuracy, F-measure, receiver operating characteristic curve (ROC curve) and area under the curve (AUC). Teacher model. SSVM is used for obtaining the teacher models. In particular, we use the joint feature map ψ(x, y) to be (I(y = 1)φ(x), ..., I(y = K)φ(x))T , where I( ) is an indicator function, and the distance measure is Hamming distance, i.e. Δ(y, yi) = I(y = yi).3 Experimental setting. For each dataset, with the trained teacher model (i.e. a multiclass classifier), we train a binary classifier for each of its class. The parameters C, ε and ϵ are tuned by 10-fold cross validation in 3Actually, this formulation also coheres with the form of Crammer-Singer multiclass SVM (Crammer and Singer 2001). Table 3: The mean and median of classification performance of all binary classifiers in term of all classes for each dataset. mean median dataset method accuracy F-measure AUC accuracy F-measure AUC teacher 99.72% 96.35% 99.95% 99.73% 96.43% 99.97% SVM 97.35% 43.47% 90.89% 96.90% 44.37% 93.40% g SVM-0 98.41% 81.75% 98.64% 98.47% 80.19% 98.70% g SVM 98.88% 82.26% 99.72% 99.16% 87.50% 99.81% teacher 80.50% 67.55% 84.91% 81.03% 71.86% 84.34% SVM 78.31% 60.86% 81.41% 77.57% 64.26% 82.34% g SVM-0 78.67% 63.23% 82.54% 78.15% 63.79% 82.88% g SVM 81.01% 69.86% 85.03% 81.06% 68.47% 85.23% teacher 97.05% 89.80% 98.76% 97.35% 92.13% 99.34% SVM 94.00% 74.28% 91.46% 93.65% 81.31% 96.81% g SVM-0 94.33% 81.95% 98.54% 94.12% 83.56% 98.95% g SVM 96.28% 86.76% 98.55% 96.55% 89.29% 99.08% teacher 99.95% 82.20% 98.81% 99.96% 85.71% 98.99% SVM 97.47% 42.83% 82.61% 98.91% 43.33% 88.01% g SVM-0 98.13% 77.54% 94.86% 98.34% 73.66% 95.02% g SVM 99.57% 80.42% 99.28% 99.61% 83.17% 99.85% teacher 91.93% 53.69% 92.12% 93.29% 55.38% 93.10% SVM 89.95% 22.01% 80.00% 89.91% 24.55% 85.11% g SVM-0 86.19% 43.03% 91.05% 85.50% 42.00% 90.04% g SVM 90.93% 52.49% 91.96% 92.21% 51.72% 91.92% teacher 92.04% 63.59% 85.52% 96.49% 64.12% 83.28% SVM 90.47% 35.37% 81.86% 95.82% 36.67% 81.37% g SVM-0 91.44% 49.71% 83.00% 95.99% 53.08% 84.36% g SVM 91.71% 51.67% 83.10% 96.55% 55.14% 85.40% the sets {10 3, ..., 100, ..., 103}, {0, 0.1, 0.2, ..., 1.5} and {0.1, 0.2, ..., 0.5}, respectively. Results and Analysis In Figures 2 and 3, we show the performance of accuracy, F-measure, AUC and ROC curve on dataset satimage for example. Since we trained binary classifiers for all classes in terms of each dataset, to comprehensively investigate the performance gap among all competing methods, we calculate the mean and median of all classifiers on each dataset and evaluation metric (Van Hasselt, Guez, and Silver 2016). All obtained results are summarized in Tabel 3. First, for verifying the effectiveness of the proposed difficulty coding, we can see that g SVM is better than g SVM-0 in most classes in Figure 2, and for the cases in Table 3, g SVM wins g SVM-0 in all cases. This implies that the proposed difficulty coding does play a positive role in enhancing the performance of classifiers. Second, it can be seen that the proposed g SVM is comparable to the teacher model in all classes in Figure 2 and achieves near results with the teacher in Tabel 3. Besides, their ROC curves are very close in Figure 3. The slight superiority of teacher model to a binary SVM in a binary inference may result from the advantages of a well-designed multiclass classifier over the simple One-vs All strategy. However, since the teacher model has the much larger model size than g SVM, and its inference stage needs to involve all classes, its inference time would be much more than that of g SVM. We report the relative inference time to the teacher on three datasets in Figure 4. We can see that g SVM and SVM have the considerably less time than that of teacher model (even with the speed-up technique). As the number of classes increases, the gap could be much larger. That implies that teacher model is not appropriate for the binary prediction when the number of classes is large. Third, g SVM and SVM have the competing inference time according to Figure 4, however, SVM s classification performance is inferior to that of g SVM in Figures 2, Figure 3 and Table 3. Thus g SVM enjoys the superiority to SVM in classification performance. To sum up, from empirical results g SVM has the comparable classification performance with the teacher model, but much faster inference speed than the teacher model. As for SVM, it has the worse classification performance than g SVM, though it has similar efficiency with g SVM. In this paper, we illustrate a new problem that uses a pretrained teacher model to learn a series of student models. We define this problem as Single-Teacher Multi-Student (STMS) problem. By taking the multi-class and binary classification as an example, we investigate how to use a welltrained multiclass classifier to learn a series of binary classifiers with respect to each of its class. To solve this problem, we propose a gated SVM as the model for learning binary classifiers. In gated SVM, the students prediction is determined by both the teacher and themselves. Moreover, we use the teacher s score output to construct different difficulty degrees for various training examples, so the students training is guided adaptively by the teacher. Experimental results show the effectiveness of our method and validate our theory. Acknowledgments We thank the constructive comments of all reviewers and the support from the NSFC 61375026 and 2015BAF15B00, and ARC Projects FL-170100117, DE-180101438, DP180103424, DP-140102164 and LP-150100671. References Aly, M. 2005. Survey on multiclass classification methods. Neural Netw 19. Bishop, C. M. 2006. Pattern recognition and machine learning. springer. Crammer, K., and Singer, Y. 2001. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research 2(Dec):265 292. Du, Y.; Xu, C.; and Tao, D. 2017. Privileged matrix factorization for collaborative filtering. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, 1610 1616. Grant, M.; Boyd, S.; and Ye, Y. 2008. Cvx: Matlab software for disciplined convex programming. Guo, Y. 2017. Problems in large-scale image classification. In AAAI, 5038 5039. He, R.; Wu, X.; Sun, Z.; and Tan, T. 2017. Learning invariant deep representation for nir-vis face recognition. In AAAI, 2000 2006. He, H.; Eisner, J.; and Daume, H. 2012. Imitation learning by coaching. In Advances in Neural Information Processing Systems, 3149 3157. Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531. Hsu, C.-W., and Lin, C.-J. 2002. A comparison of methods for multiclass support vector machines. IEEE transactions on Neural Networks 13(2):415 425. Hunter, D. R., and Lange, K. 2004. A tutorial on mm algorithms. The American Statistician 58(1):30 37. Joachims, T.; Finley, T.; and Yu, C.-N. J. 2009. Cutting-plane training of structural svms. Machine Learning 77(1):27 59. Joachims, T. 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 133 142. ACM. Lange, K. 2016. MM Optimization Algorithms. SIAM. Lapin, M.; Hein, M.; and Schiele, B. 2014. Learning using privileged information: Svm+ and weighted svm. Neural Networks 53:95 108. Mairal, J. 2015. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization 25(2):829 855. Martinez, A. M. 1998. The ar face database. CVC technical report. Mosek, A. 2010. The mosek optimization software. Online at http://www. mosek. com 54. Navia-Vazquez, A.; Perez-Cruz, F.; Artes-Rodriguez, A.; and Figueiras-Vidal, A. R. 2001. Weighted least squares training of support vector classifiers leading to compact and adaptive schemes. IEEE Transactions on Neural Networks 12(5): 1047 1059. Nguyen, H. D., and Mc Lachlan, G. J. 2017. Iterativelyreweighted least-squares fitting of support vector machines: A majorization minimization algorithm approach. ar Xiv preprint ar Xiv:1705.04651. Nie, F.; Wang, X.; and Huang, H. 2017. Multiclass capped lp-norm svm for robust classifications. Tsochantaridis, I.; Hofmann, T.; Joachims, T.; and Altun, Y. 2004. Support vector machine learning for interdependent and structured output spaces. In Proceedings of the twenty-first international conference on Machine learning, 104. ACM. Van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep reinforcement learning with double q-learning. In AAAI, 2094 2100. Vapnik, V. 2013. The nature of statistical learning theory. Springer science & business media. Wang, J.-Y. 2002. Application of support vector machines in bioinformatics. Taipei: Department of Computer Science and Information Engineering, National Taiwan University. Xu, W.; Sun, H.; Deng, C.; and Tan, Y. 2017. Variational autoencoder for semi-supervised text classification. In AAAI, 3358 3364. Xu, C.; Tao, D.; and Xu, C. 2013. A survey on multi-view learning. ar Xiv preprint ar Xiv:1304.5634. Yang, E.; Deng, C.; Liu, W.; Liu, X.; Tao, D.; and Gao, X. 2017. Pairwise relationship guided deep hashing for crossmodal retrieval. In AAAI, 1618 1625. You, S.; Xu, C.; Wang, Y.; Xu, C.; and Tao, D. 2017a. Privileged multi-label learning. In Proceedings of the Twenty Sixth International Joint Conference on Artificial Intelligence, 3336 3342. IJCAI 2017, Melbourne, Australia, August 1925, 2017 You, S.; Xu, C.; Xu, C.; and Tao, D. 2017b. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1285 1294. ACM. Yu, C.-N. J., and Joachims, T. 2009. Learning structural svms with latent variables. In Proceedings of the 26th annual international conference on machine learning, 1169 1176. ACM. Zhu, X.; Liu, J.; and Lopes, M. 2017. No learner left behind: On the complexity of teaching multiple learners simultaneously. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, 3588 3594.