# teachingtolearn_and_learningtoteach_for_multilabel_propagation__686a2162.pdf

Teaching-to-Learn and Learning-to-Teach for Multi-Label Propagation

Chen Gong , and Dacheng Tao and Jie Yang and Wei Liu

Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University Centre for Quantum Computation and Intelligent Systems, University of Technology Sydney Didi Research, Beijing, China {goodgongchen, jieyang}@sjtu.edu.cn dacheng.tao@uts.edu.au weiliu@didichuxing.com

Multi-label propagation aims to transmit the multi-label information from labeled examples to unlabeled examples based on a weighted graph. Existing methods ignore the speciﬁc propagation difﬁculty of different unlabeled examples and conduct the propagation in an imperfect sequence, leading to the error-prone classiﬁcation of some difﬁcult examples with uncertain labels. To address this problem, this paper associates each possible label with a teacher , and proposes a Multi-Label Teaching-to-Learn and Learning-to Teach (ML-TLLT) algorithm, so that the entire propagation process is guided by the teachers and manipulated from simple examples to more difﬁcult ones. In the teaching-to-learn step, the teachers select the simplest examples for the current propagation by investigating both the deﬁnitiveness of each possible label of the unlabeled examples, and the dependencies between labels revealed by the labeled examples. In the learning-to-teach step, the teachers reversely learn from the learner s feedback to properly select the simplest examples for the next propagation. Thorough empirical studies show that due to the optimized propagation sequence designed by the teachers, ML-TLLT yields generally better performance than seven state-of-the-art methods on the typical multi-label benchmark datasets.

Introduction Multi-Label Learning (MLL) refers to the problem in which an example can be assigned a set of different labels. So far, MLL has been intensively adopted in image annotation (Wang, Huang, and Ding 2009), text categorization (Schapire and Singer 2000), social behavior learning (Tang and Liu 2009), and others. By following the taxonomy presented in (Zhang and Zhou 2014), we classify the existing MLL algorithms into problem transformation methods and algorithm adaptation methods. Problem transformation methods cast MML into other wellstudied scenarios. Representative approaches include Calibrated Label Ranking (F urnkranz et al. 2008) which transforms MLL into a label ranking problem, and Binary Relevance (Boutell et al. 2004) which regards MLL as a series of binary classiﬁcation tasks. Algorithm adaptation methods extend the existing learning algorithms to multi-label cases. For example, (Zhang

Copyright c 2016, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

and Zhou 2007) adapt the traditional KNN classiﬁer to multi-label KNN, (Clare and King 2001; Bi and Kwok 2011) deploy the tree model to analyze the MLL problem, and (Elisseeff and Weston 2001; Xu, Li, and Zhou 2013; Xu, Tao, and Xu 2015) develop various multi-label SVMs by introducing the ranking loss, PRO loss, and the causality between labels, respectively. Although the above methods differ from one another, they all focus on how to exploit the label correlations to optimize learning performance. Since graph is a simple yet powerful tool to model the relationship between labels or examples, several researchers have recently introduced graph to MLL problem. Representative works include (Kong, Ng, and Zhou 2013; Chen et al. 2013; Wang, Huang, and Ding 2009; Chen et al. 2008; Jiang 2012; Kang, Jin, and Sukthankar 2006; Zha et al. 2008; Wang, Tu, and Tsotsos 2013). However, above graph-based propagation methods often suffer from unsatisfactory performance due to the unexpected noise (e,g, outliers) in the sample space and the huge label search space (the size is 2q where q is the number of possible labels). To make matters worse, they propagate the labels to unlabeled examples in an unfavorable sequence without considering their individual propagation difﬁculty or reliability. For example, (Wang, Huang, and Ding 2009; Chen et al. 2008; Kong, Ng, and Zhou 2013) treat all the unlabeled examples equally and conduct a one-shot label propagation by minimizing the designed energy function. (Jiang 2012; Wang, Tu, and Tsotsos 2013) iteratively transfer the label information to unlabeled examples as long as these examples are directly linked to the labeled examples on a graph. Inspired by (Gong et al. 2015), we address the above problem by proposing a novel iterative multi-label propagation scheme called Multi-Label Teaching-to-Learn and Learning-to-Teach (ML-TLLT), to explicitly manipulate the propagation sequence, so that the unlabeled examples are logically propagated from simple to difﬁcult. This is beneﬁcial to improving propagation quality because the previously attained simple knowledge eases the learning burden for the subsequent difﬁcult examples. Generally, an example is simple if its labels can be conﬁdently decided. Suppose we have n = l+u examples X = {x1, , xl,xl+1, ,xl+u}, where the ﬁrst l elements constitute the labeled set L and the remaining u examples form the unlabeled set U. Each element xi X Rd is associated with a set of q possible labels

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

encoded in a label vector Yi = (Yi1, , Yiq) {0, 1}q, where Yir = 1 (r = 1, , q) means that xi has the label r, and 0 otherwise. Our target is to iteratively propagate the labels Y1, , Yl from L to U based on the established KNN graph G = V, E (see Fig.1(a)). Here V is the node set corresponding to the total n examples, and E is the edge set representing the similarities between these nodes. We associate each of the q labels with a teacher (see Fig.1(b)). In the teaching-to-learn step of a single propagation, the r-th (r=1, ,q) teachers estimate the difﬁculty of xi U by evaluating the r-th label deﬁnitiveness Mr(xi) from their own viewpoints. The correlations between labels are also considered by the teachers and then encoded in the variables Qrp [0, 1] (r, p = 1, , q). Based on the labelspeciﬁc deﬁnitiveness and the pairwise label correlations, the simplest examples are determined by individual teachers (recorded by the selection matrix S(1), ,S(q)), after which the overall simplest examples agreed by all the teachers are placed into the curriculum set S . The state-of-theart TRAnsductive Multi-label (TRAM) algorithm (Kong, Ng, and Zhou 2013) is adopted as a learner to reliably propagate the labels to the designated curriculum S (see Fig.1(c)). In the learning-to-teach step, the learner delivers learning feedback to the teachers to assist them in deciding the subsequent suitable curriculum. The above ML-TLLT process iterates with the labeled set and unlabeled set respectively updated by L := L S and U := U S , and terminates when U = . As a result, all the original unlabeled examples are assigned reliable labels Yl+1, , Yl+u. Our work is different from active learning (Settles 2010) because active learning needs a human labeler to label the selected examples while our method does not. Our work also differs from curriculum learning (Bengio et al. 2009; Kumar, Packer, and Koller 2010; Jiang et al. 2015; Khan, Mutlu, and Zhu 2011) in that ML-TLLT requires the interaction between a teaching committee and a learner.

Our Approach

We denote W as the adjacency matrix of G with the (i, j)-

th element Wij = exp xi xj 2/(2ξ2) if xi and xj are linked by an edge, and Wij = 0 otherwise. Here ξ is the kernel width decided as the average Euclidean distance between all pairs of examples (Kong, Ng, and Zhou 2013). Based on W, we have the diagonal degree matrix Dii = n j=1 Wij and graph Laplacian matrix L = D W. We stack Y1, ,Yn into a label matrix Y = Y 1 , ,Y n , and the correlation between labels r and p is then computed by Qrp = cos(YL,r, YL,p) = YL,r,YL,p YL,r YL,p with YL,r = (Y1r, ,Ylr) . Similarly, we also deﬁne a label score matrix F= F 1 , ,F n with every element Fir 0 denoting the possibility of xi belonging to the class r.

Teaching-to-learn Step In each propagation, all the unlabeled examples that are directly connected to L are included in the candidate set B of size b, and the target of the teachers is to pick up the simplest curriculum examples S ={x 1, ,x s} from B where

s is the number of selected examples. Such selection should consider both the deﬁnitiveness of individual possible label, and the correlation between the pairs of labels. The example xi B is simple in terms of the r-th label if Yir is deﬁnitely 1 or 0. Let Cr (or Cr) as the set including all the labeled examples with the r-th label 1 (or 0), the deﬁnitiveness of xi s r-th label is then modeled by

Mr(xi) = | T(xi, Cr) T(xi, Cr)|, (1)

where T(xi, Cr) (or T(xi, Cr)) represents the average commute time between xi and all the elements in the set Cr (or Cr). That is,

T(xi, Cr) = 1 |Cr|

xi Cr T(xi, xi )

T(xi, Cr) = 1 | Cr|

xi / Cr T(xi, xi ) . (2)

In (2), the notation | | computes the size of the corresponding set, and T(xi, xi ) denotes the commute time (Qiu and Hancock 2007) between xi and xi , which is

T(xi, xi ) = n

k=1 h(λk) uki uki 2, (3)

where 0 = λ1 λn are the eigenvalues of Laplacian matrix L, and u1, , un are the associated eigenvectors; uki denotes the i-th element of uk; h(λk) = 1/λk if λk = 0 and h(λk) = 0 otherwise. Commute time T(xi, xi ) describes the time cost starting from xi, reaching xi , and then returning to xi again, therefore it can be leveraged to describe the closeness of two examples. The larger Mr(xi) is, the simpler xi is in terms of the label r. To consider the correlations between labels, we force the two teachers of labels r and p to generate similar curriculums if the two labels are highly correlated over the labeled examples (i.e. Qrp is large). Based on above considerations, we introduce a binary example selection matrix S(r) {1, 0}b s for the r-th label (r=1, ,q). The element S(r) ij =1 means that the r-th teacher considers the i-th example to be simple, and it should therefore be included as the j-th element in the curriculum set. The ﬁnal selected examples agreed by all q teachers are indicated by the matrix S , which has the same deﬁnition as S(r). Therefore, the model for the selection of examples is

min S(1), ,S(q),S q

r=1 tr S(r) M(r) 1S(r)

r,p=1Qrp S(r) S(p) 2 +β1 q

s.t. S {1, 0}b s, S S = Is s,

S(r) {1, 0}b s, S(r) S(r) =Is s, for r=1, , q

where M(r) is a diagonal matrix with the diagonal elements M(r) ii = Mr(xi) for any xi B. The ﬁrst deﬁnitiveness term in the objective function investigates the deﬁnitiveness of xi s all q labels and regulates the i-th row of S(r) to zeros if Mr(xi) is small. The second label correlation term discovers the label dependencies to make the

, 1,2,3 T i i Y

( ) q i M x

* 1 3 , x x

(a) (c) (b)

Figure 1: The framework of our algorithm. (a) illustrates the established graph, in which the red balls, blue squares and yellow lines represent the labeled examples, unlabeled examples and edges, respectively. In (b), each of the q possible labels Yi1, , Yiq is associated with a teacher, who evaluates the corresponding label deﬁnitiveness Mr(xi) (r takes a value from 1, , q) on all the unlabeled xi (i = 1, 2, 3 in this ﬁgure). By incorporating the label correlations (magenta arrows) recorded by Qrp (r, p = 1 , q), the individual decisions S(1), , S(q) are made and then uniﬁed to an overall simplest curriculum set S ={x1, x3}. These curriculum examples are classiﬁed by the learner in (c). Lastly, a learning feedback on S is generated to help the teachers decide the next suitable curriculum.

highly correlated labels produce similar selection matrices. The third consistency term integrates the selection matrices decided by various teachers to a consistent result. The positive β0 and β1 are trade-off parameters. The orthogonality constraints S(r) S(r) =Is s (I denotes the identity matrix) and S S = Is s ensure that every example is selected only once in S(r) and S . However, the above problem (4) is NP-hard due to the discrete {1, 0}-constraints on S(r) and S . Therefore, we relax these integer constraints to continuous nonnegative constraints to make (4) tractable as follows:

min S(1), ,S(q),S

r=1 tr S(r) M(r) 1S(r)

r,p=1Qrp S(r) S(p) 2 +β1 q

s.t. S Ob s, S S = Is s,

S(r) Ob s, S(r) S(r) =Is s, for r=1, , q

where O denotes the all-zero matrix. We adopt alternating minimization to sequentially update S(1), , S(q), S with the other variables ﬁxed, and ﬁnd a local solution to (5). Updating S(r). To update S(r) where r takes a value from 1, , q, we ﬁx S(r ) (r =r) and S , and solve the following S(r)-subproblem:

min S(r) tr S(r) M(r) 1S(r)

p=1 Qrp S(r) S(p) 2 +β1 S(r) S 2

s.t. S(r) Ob s, S(r) S(r) =Is s

It should be noted that (6) is a nonconvex optimization problem because of the orthogonality constraint. The feasible region falls on the Stiefel manifold, which is the set of all m1 m2 matrices satisfying the orthogonality constraint, i.e.

St(m1, m2) = X Rm1 m2 : X X=Im2 m2 . Consequently, we adopt the Partial Augmented Lagrangian Multiplier (PALM) method (Bertsekas 2014) to solve the problem (6). Only the nonnegative constraint is incorporated into the objective function of the augmented Lagrangian expression, while the orthogonality constraint is explicitly retained and imposed on the subproblem for updating S(r). By doing this, S(r) is updated on the Stiefel manifold, which can be effectively accomplished by the curvilinear search method (Wen and Yin 2013). Therefore, the partial augmented Lagrangian function of problem (6) is

L (S(r), Λ(r), T(r), σr)

=tr S(r) M(r) 1S(r) + β0 q

p=1 Qrp S(r) S(p) 2

+β1 S(r) S 2 +tr Λ(r) (S(r) T(r)) + σr

S(r) T(r) 2 ,

where Λ(r) Rb s is the Lagrangian multiplier, T(r) Rb s is an auxiliary nonnegative matrix, and σr > 0 is the penalty coefﬁcient. Therefore, S(r) is updated by minimizing (7) subject to S(r) S(r) =Is s via the curvilinear search method (Wen and Yin 2013) (see Algorithm 1). In Algorithm 1, L(S(r)) is the gradient of L(S(r), Λ(r), T(r), σr) w.r.t. S(r), and L P(τ) = tr L(S(r)) P (τ) calculates the derivate of L(S(r), Λ(r), T(r), σr) w.r.t. the stepsize τ, in which P (τ) = I + τ

2A 1A S(r)+ P(τ)

2 . Algorithm 1 works by ﬁnding the gradient of L in the tangent plane of the manifold at the point S(r)(iter) (Line 4), based on which a curve is obtained on the manifold that proceeds along the projected negative gradient (Line 7). A curvilinear search is then made along the curve towards the optimal S(r)(iter+1).

The core of Algorithm 1 for preserving the orthogonality constraint lies in the skew-symmetric matrix A-based Cayley transformation I+ τ

2A , which projects S(r) to P(τ) to guarantee that P(τ) P(τ)=I always holds. In each iteration, the optimal stepsize τ is estimated by the Barzilai-Borwein method (Fletcher 2005). The auxiliary matrix T(r) in (7) is to force S(r) to be nonnegative, which is updated by the conventional rule in the augmented Lagrangian method, that is, T(r) ij :=

max(0, S(r) ij + Λ(r) ij /σr). The entire PALM algorithm for solving the S(r)- subproblem (6) is outlined in Algorithm 2, which is guaranteed to converge (Wen and Yin 2013). Updating S . The S -subproblem is formulated as

s.t. S Ob s, S S =Is s . (8)

We also use PALM to solve the S -subproblem, which is the same as the updating of S(r), therefore we omit the detailed explanation for updating S due to space limitations. By alternately solving the S(r)-subproblem and the S - subproblem, the objective value of (5) always decreases. This objective function is lower bounded by 0 since the diagonal matrices M(r) 1 (r = 1, , q) are positive deﬁnite. Therefore, the entire alternating minimization process is guaranteed to converge, and the overall simplest examples (i.e. a curriculum) agreed by all q teachers is suggested by S . Based on S , the solution of problem (4) can be obtained by discretizing the continuous S into binary values. Specifically, we ﬁnd the largest element in S , and record its row and column; then from the unrecorded columns and rows we search the largest element and mark it again. This procedure is repeated until the s largest elements have been found. The rows of these s elements indicate the selected examples for the current propagation. Multi-label Propagation. By employing the normalized graph Laplacian H = I D 1W, and starting from FL := YL, Kong et al. (Kong, Ng, and Zhou 2013) suggest that the label scores FS ,r of the curriculum examples in S on the r-th (r=1, , q) label can be obtained by solving HS ,S FS ,r = HS ,LFL,r, (9) where HS ,S and HS ,L are sub-matrices of H indexed by the corresponding subscripts, and FS ,r, FL,r are the rth column vectors of the label score matrices FS and FL, respectively. Since the (i, r)-th element in FS conveys the possibility of xi S having the r-th label, (Kong, Ng, and Zhou 2013) propose setting xi s label vector Yi as Yir =1 if Fir is among the θi largest elements in Fi, and Yir = 0 otherwise. By assuming that similar examples should have a similar number of labels, they present a linear equation to ﬁnd the suitable ΘS = (θ1, ,θs) , which is HS ,S ΘS = HS ,LΘL, (10) where ΘL is a column vector sharing a similar deﬁnition with ΘS but recording the available numbers of labels of labeled examples instead. The number of labels for xi S is then decided by rounding θi to the nearest integer.

Algorithm 1 The curvilinear search for minimizing (7)

1: Input: S(r) that satisﬁes S(r) S(r) = I, ε = 10 5, τ = 10 3, ϑ = 0.2, η = 0.85, h = 1, ν = L(S(r)), iter = 0 2: repeat 3: // Compute the skew-symmetric matrix A

4: A = L(S(r)) S(r) S(r) L(S(r)) ;

5: // Deﬁne search path P(τ) on the Stiefel manifold and ﬁnd a suitable Barzilai-Borwein (BB) stepsize τ 6: repeat 7: P(τ) = I + τ

2 A S(r); 8: τ := ϑ τ; 9: // Check BB condition 10: until L( P(τ)) ν τL ( P(0)) 11: // Update variables 12: S(r) := P(τ);

13: Q := ηh+1; ν := ηhν + L(S(r)) /h; iter := iter+1;

14: until L(S(r)) < ε

15: Output: S(r) that minimizes (7)

Algorithm 2 PALM for solving S(r)-subproblem (6)

1: Input: M(r), arbitrary initial S(r) satisfying S(r) S(r) = I, all-one matrix Λ(r), β0, β1, σr = 1, ρ = 1.2, iter = 0 2: repeat 3: // Update T(r)

4: T(r) ij = max(0, S(r) ij + Λ(r) ij /σr); 5: // Update S(r) by minimizing Eq. (7) using Algorithm 1 6: S(r) :=arg min S(r) S(r)=Is s L (S(r), Λ(r), T(r), σr); 7: // Update variables 8: Λ(r) ij := max(0, Λ(r) ij + σr S(r) ij ); σr := min(ρσr, 1010); iter := iter + 1; 9: until Convergence 10: Output: S(r) that minimizes (6)

Learning-to-teach Step In the learning-to-teach step, teachers should also learn from the learner by absorbing feedback to adjust the curriculum generation in the next propagation. Speciﬁcally, if the learner s t-th learning performance is satisfactory, teachers may assign more examples to it for the (t+1)-th propagation. Otherwise, teachers should allocate fewer examples to the learner. However, the real labels of the curriculum examples are not available, so we deﬁne a learning conﬁdence Conf(S ) to evaluate the learner s performance on S . Intuitively, if xi S is assigned label r, namely Yir=1, the corresponding label score Fir should be as large as possible. This is because large Fir indicates that assigning label r to xi is very conﬁdent. Therefore, the propagation conﬁdence on single xi (i.e. Conf(xi)) is deﬁned by

Conf(xi) = minr=1, ,q, Yir=1Fir, (11)

based on which we average the conﬁdence over all xi S , and deﬁne Conf(S ) as

Conf(S ) = 1

i=1 Conf(xi), xi S . (12)

0 0.25 0.5 0.75 1 0

Figure 2: The curve of the learning feedback deﬁned in (13) with different choices of α. Large α leads to a steep curve.

Based on Conf(S ), we utilize a sigmoid function to map the Conf(S ) to a nonlinear learning feedback, which is

1 1+exp( 2α(Conf(S ) 0.5)) 1 1+exp(α) 1 1+exp( α) + 1 1+exp(α) . (13)

Eq. (13) has a number of ideal properties, such as being monotonically increasing, and Feedback = 0, 0.5 and 1 when Conf(S ) = 0, 0.5 and 1, respectively. The parameter α regulates the steepness of the curve of (13), and is set to 3 throughout this paper (see Fig.2). By utilizing the designed feedback (13), the number of simplest examples selected for the next propagation is s(t+1) = b(t+1) Feedback where b(t+1) is the size of candidate set B in the (t+1)-th propagation, and rounds up the inside element to the nearest integer.

Experimental Results

This section ﬁrst validates several critical steps in the proposed ML-TLLT, and then compares ML-TLLT with seven state-of-the-art methods on ﬁve benchmark datasets. Six evaluation metrics for MLL are adopted, including ranking loss, average precision, hamming loss, one error, coverage, and Micro F1; their deﬁnitions can be found in (Zhang and Zhou 2014). All the adopted datasets come from the MULAN1 repository. The reported results of various algorithms on all the datasets are produced by 5-fold cross validation.

Algorithm Validation Three critical factors help to boost the performance of MLTLLT: 1) the simple-to-difﬁcult propagation sequence generated by the teaching-to-learn step; 2) the label correlation term developed in (4); and 3) the feedback (13) designed in the learning-to-teach step. To verify their beneﬁts, we ﬁrst replace the teaching-to-learn step by randomly selecting the curriculum examples (termed Random ) to highlight the importance of 1), and then remove the label correlation term from (4) (termed No Corr ) to demonstrate the contribution of 2), and lastly set the feedback (13) to a constant value 0.5

1http://mulan.sourceforge.net/datasets-mlc.html

Table 1: The validation of key steps in our ML-TLLT model. ( ) denotes the larger (smaller), the better for the corresponding metric. The best records are marked in bold.

Random No Corr No FB ML-TLLT Ranking loss 0.260 0.020 0.260 0.022 0.261 0.020 0.255 0.019 Average precision 0.712 0.024 0.713 0.024 0.713 0.023 0.717 0.023 Hamming loss 0.292 0.021 0.291 0.025 0.290 0.019 0.289 0.022 One error 0.388 0.538 0.385 0.052 0.385 0.048 0.383 0.051 Coverage 2.185 0.128 2.194 0.136 2.199 0.119 2.172 0.109 Micro F1 0.536 0.030 0.537 0.035 0.539 0.025 0.540 0.031

(denoted No FB ) to show the effect of 3). The results on Emotions dataset presented in Table 1 clearly indicate that the performances on all the metrics decrease without any of the above three critical factors, therefore they are indispensable to ML-TLLT for achieving the improved results.

Comparison With Existing Methods The employed baselines include Multi-Label KNN [ MLKNN , (Zhang and Zhou 2007)], Multi-Label SVM (Elisseeff and Weston 2001) with linear kernel [ MLSVM (Linear) ] and RBF Kernel [ MLSVM (RBF) ], Multi Label learning on Tensor Product Graph [ MLTPG , (Jiang 2012)], Semi-supervised Multi-label learning via Sylvester Equation (Chen et al. 2008) inherited from Harmonic Functions ( SMSE-HF ) and Local and Global Consistency ( SMSE-LGC ), and TRAM (Kong, Ng, and Zhou 2013) which acts as the learner in our algorithm. Five datasets Emotions, Yeast, Scene, Corel5K and Bibtex in MULAN are leveraged to test the performance of all the methods. For fair comparison, we build the same graph for MLKNN, SMSE-HF, SMSE-LGC, TRAM and ML-TLLT on every dataset, and the number of neighbors K is set to 10, 10, 10, 35, 35 on Emotions, Yeast, Scene, Corel5K and Bibtex, respectively. In ML-TLLT, the trade-off parameters β0 and β1 are set to 1 for all the experiments. As suggested by (Chen et al. 2008), we set u = 1, v = 0.15 in SMSE-HF, and β = γ = 1 in SMSE-LGC. The weighting parameter C in MLSVM (Linear) and MLSVM (RBF) is tuned to 1. The results are shown in Table 2. We do not conduct MLSVM on Corel5K and Bibtex because MLSVM is not scalable to these two datasets. The hamming loss and Micro F1 of SMSE are also not reported because (Chen et al. 2008) do not provide an explicit solution for deciding the number of assigned labels. Table 2 suggests that ML-TLLT generally outperforms other baselines on all the datasets. Speciﬁcally, it can be observed that ML-TLLT performs better than the plain learner TRAM in almost all situations, therefore the effectiveness of our TLLT strategy is demonstrated.

Conclusion This paper proposes a novel framework for multi-label propagation, termed Multi-Label Teaching-to-Learn and Learning-to-Teach (ML-TLLT). As a result of the interactions between teachers and learner, all the unlabeled examples are elaborately propagated from simple to difﬁcult. They are consequently assigned trustable and accurate labels, leading to the superior performance of ML-TLLT over existing state-of-the-art methods.

Table 2: Experimental results of the compared methods on benchmark datasets. ( ) denotes the larger (smaller), the better. The best records are marked in bold. ( ) indicates that TLLT is signiﬁcantly better (worse) than the corresponding method.

Ranking loss Emotions Yeast Scene Corel5K Bibtex MLSVM (Linear) 0.294 0.035 0.167 0.005 0.080 0.005 - - MLSVM (RBF) 0.415 0.023 0.195 0.007 0.302 0.020 - - MLKNN 0.262 0.016 0.170 0.006 0.076 0.009 0.278 0.003 0.137 0.003

MLTPG 0.439 0.037 0.239 0.002 0.116 0.008 0.335 0.004 0.192 0.003

SMSE-HF 0.262 0.015 0.166 0.007 0.080 0.004 0.265 0.004 0.127 0.004 SMSE-LGC 0.273 0.026 0.180 0.007 0.080 0.007 0.251 0.003 0.139 0.003

TRAM 0.263 0.022 0.178 0.008 0.080 0.006 0.297 0.009 0.133 0.003

ML-TLLT 0.255 0.019 0.169 0.006 0.073 0.007 0.271 0.004 0.126 0.003 Average precision Emotions Yeast Scene Corel5K Bibtex MLSVM (Linear) 0.678 0.023 0.761 0.005 0.852 0.005 - - MLSVM (RBF) 0.573 0.012 0.713 0.006 0.606 0.016 - - MLKNN 0.702 0.021 0.762 0.008 0.868 0.013 0.111 0.005 0.583 0.003 MLTPG 0.580 0.034 0.683 0.006 0.831 0.009 0.073 0.001 0.529 0.003

SMSE-HF 0.711 0.019 0.765 0.006 0.864 0.007 0.117 0.004 0.576 0.004 SMSE-LGC 0.707 0.034 0.746 0.005 0.861 0.011 0.119 0.003 0.571 0.003 TRAM 0.700 0.025 0.752 0.013 0.858 0.012 0.115 0.004 0.561 0.006

ML-TLLT 0.717 0.023 0.765 0.005 0.859 0.012 0.123 0.002 0.575 0.005 Hamming loss Emotions Yeast Scene Corel5K Bibtex MLSVM (Linear) 0.289 0.022 0.202 0.005 0.131 0.005 - - MLSVM (RBF) 0.330 0.013 0.227 0.002 0.171 0.002 - - MLKNN 0.264 0.004 0.194 0.003 0.087 0.006 0.016 0.000 0.033 0.000 MLTPG 0.333 0.034 0.300 0.005 0.124 0.006 0.016 0.000 0.044 0.001

SMSE-HF - - - - - SMSE-LGC - - - - - TRAM 0.306 0.018 0.211 0.009 0.092 0.007 0.030 0.000 0.037 0.001

ML-TLLT 0.289 0.022 0.204 0.006 0.086 0.007 0.029 0.001 0.033 0.001 One error Emotions Yeast Scene Corel5K Bibtex MLSVM (Linear) 0.481 0.041 0.229 0.013 0.253 0.004 - - MLSVM (RBF) 0.555 0.066 0.249 0.018 0.596 0.019 - - MLKNN 0.407 0.050 0.236 0.019 0.222 0.020 0.870 0.018 0.085 0.003 MLTPG 0.553 0.068 0.249 0.018 0.269 0.015 0.931 0.009 0.088 0.004

SMSE-HF 0.398 0.038 0.234 0.012 0.228 0.013 0.857 0.007 0.080 0.006 SMSE-LGC 0.386 0.064 0.251 0.019 0.232 0.019 0.867 0.009 0.078 0.004 TRAM 0.415 0.049 0.268 0.024 0.239 0.022 0.857 0.006 0.104 0.004

ML-TLLT 0.383 0.051 0.228 0.011 0.220 0.021 0.853 0.006 0.083 0.003 Coverage Emotions Yeast Scene Corel5K Bibtex MLSVM (Linear) 2.284 0.202 6.306 0.112 0.448 0.058 - - MLSVM (RBF) 2.970 0.283 6.608 0.136 1.577 0.099 - - MLKNN 2.266 0.113 6.340 0.137 0.448 0.058 226.031 2.620 74.259 1.434

MLTPG 3.102 0.113 8.072 0.037 0.647 0.044 246.253 1.359 89.830 1.196

SMSE-HF 2.208 0.137 6.223 0.152 0.470 0.030 215.497 1.695 71.533 1.957

SMSE-LGC 2.259 0.142 6.213 0.136 0.464 0.037 209.655 0.956 74.935 1.441

TRAM 2.202 0.151 6.341 0.122 0.460 0.027 226.254 2.552 70.123 1.840

ML-TLLT 2.172 0.109 6.265 0.139 0.423 0.044 217.511 2.032 68.539 1.298 Micro F1 Emotions Yeast Scene Corel5K Bibtex MLSVM (Linear) 0.510 0.047 0.651 0.008 0.632 0.019 - - MLSVM (RBF) 0.310 0.039 0.596 0.006 0.591 0.013 - - MLKNN 0.453 0.035 0.642 0.010 0.734 0.020 0.004 0.001 0.460 0.005

MLTPG 0.368 0.026 0.520 0.018 0.645 0.015 0.002 0.000 0.482 0.007 SMSE-HF - - - - - SMSE-LGC - - - - - TRAM 0.513 0.023 0.650 0.013 0.735 0.021 0.099 0.003 0.445 0.007

ML-TLLT 0.540 0.031 0.663 0.007 0.753 0.021 0.107 0.003 0.476 0.004

Acknowledgments This research is supported by NSFC, China (No: 61572315); 973 Plan, China (No: 2015CB856004); and Australian Research Council Discovery Project (No: DP-140102164 & No: FT-130101457).

References Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J. 2009. Curriculum learning. In ICML, 41 48. ACM. Bertsekas, D. 2014. Constrained optimization and Lagrange multiplier methods. Academic Press. Bi, W., and Kwok, J. 2011. Multi-label classiﬁcation on tree-and DAG-structured hierarchies. In ICML, 17 24. Boutell, M.; Luo, J.; Shen, X.; and Brown, C. 2004. Learning multi-label scene classiﬁcation. Pattern Recognition 37(9):1757 1771. Chen, G.; Song, Y.; Wang, F.; and Zhang, C. 2008. Semisupervised multi-label learning by solving a Sylvester equation. In SDM, 410 419. SIAM. Chen, X.; Mu, Y.; Liu, H.; Yan, S.; Rui, Y.; and Chua, T. 2013. Large-scale multilabel propagation based on efﬁcient sparse graph construction. ACM Transactions on Multimedia Computing, Communications, and Applications 10(1):6. Clare, A., and King, R. 2001. Knowledge discovery in multi-label phenotype data. In ECML-PKDD, 42 53. Springer. Elisseeff, A., and Weston, J. 2001. A kernel method for multi-labelled classiﬁcation. In NIPS, 681 687. Fletcher, R. 2005. On the Barzilai-Borwein method. In Optimization and Control with Applications. Springer. 235 256. F urnkranz, J.; H ullermeier, E.; Menc ıa, E.; and Brinker, K. 2008. Multilabel classiﬁcation via calibrated label ranking. Machine Learning 73(2):133 153. Gong, C.; Tao, D.; Liu, W.; Maybank, S.; Fang, M.; Fu, K.; and Yang, J. 2015. Saliency propagation from simple to difﬁcult. In CVPR, 2531 2539. Jiang, L.; Meng, D.; Zhao, Q.; Shan, S.; and Hauptmann, A. 2015. Self-paced curriculum learning. In AAAI. Jiang, J. 2012. Multi-label learning on tensor product graph. In AAAI. Kang, F.; Jin, R.; and Sukthankar, R. 2006. Correlated label propagation with application to multi-label learning. In CVPR, volume 2, 1719 1726. IEEE. Khan, F.; Mutlu, B.; and Zhu, X. 2011. How do humans teach: On curriculum learning and teaching dimension. In NIPS, 1449 1457. Kong, X.; Ng, M.; and Zhou, Z. 2013. Transductive multilabel learning via label set propagation. Knowledge and Data Engineering, IEEE Transactions on 25(3):704 719. Kumar, M.; Packer, B.; and Koller, D. 2010. Self-paced learning for latent variable models. In NIPS, 1189 1197. Qiu, H., and Hancock, E. 2007. Clustering and embedding using commute times. Pattern Analysis and Machine Intelligence, IEEE Transactions on 29(11):1873 1890.

Schapire, R., and Singer, Y. 2000. Boostexter: A boostingbased system for text categorization. Machine Learning 39(2):135 168. Settles, B. 2010. Active learning literature survey. Technical Report 1648, University of Wisconsin, Madison 52(5566):11. Tang, L., and Liu, H. 2009. Relational learning via latent social dimensions. In KDD, 817 826. ACM. Wang, H.; Huang, H.; and Ding, C. 2009. Image annotation using multi-label correlated Green s function. In ICCV, 2029 2034. IEEE. Wang, B.; Tu, Z.; and Tsotsos, J. 2013. Dynamic label propagation for semi-supervised multi-class multi-label classiﬁcation. In ICCV, 425 432. IEEE. Wen, Z., and Yin, W. 2013. A feasible method for optimization with orthogonality constraints. Mathematical Programming 142(1-2):397 434. Xu, M.; Li, Y.; and Zhou, Z. 2013. Multi-label learning with pro loss. In AAAI. Xu, C.; Tao, D.; and Xu, C. 2015. Large-margin multi-label causal feature learning. In AAAI. Zha, Z.; Mei, T.; Wang, J.; Wang, Z.; and Hua, X. 2008. Graph-based semi-supervised learning with multi-label. In ICME. IEEE. Zhang, M., and Zhou, Z. 2007. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition 40(7):2038 2048. Zhang, M., and Zhou, Z. 2014. A review on multi-label learning algorithms. Knowledge and Data Engineering, IEEE Transactions on 26(8):1819 1837.