# label_enhancement_for_label_distribution_learning__3047e10f.pdf

Label Enhancement for Label Distribution Learning

Ning Xu1,2, An Tao3 and Xin Geng1,2,

1MOE Key Laboratory of Computer Network and Information Integration, China 2School of Computer Science and Engineering, Southeast University, Nanjing 210096, China 3School of Information Science and Engineering, Southeast University, Nanjing 210096, China {xning, taoan, xgeng}@seu.edu.cn

Label distribution is more general than both singlelabel annotation and multi-label annotation. It covers a certain number of labels, representing the degree to which each label describes the instance. The learning process on the instances labeled by label distributions is called label distribution learning (LDL). Unfortunately, many training sets only contain simple logical labels rather than label distributions due to the difﬁculty of obtaining the label distributions directly. To solve the problem, one way is to recover the label distributions from the logical labels in the training set via leveraging the topological information of the feature space and the correlation among the labels. Such process of recovering label distributions from logical labels is deﬁned as label enhancement (LE), which reinforces the supervision information in the training sets. This paper proposes a novel LE algorithm called Graph Laplacian Label Enhancement (GLLE). Experimental results on one artiﬁcial dataset and fourteen real-world datasets show clear advantages of GLLE over several existing LE algorithms.

1 Introduction

Learning with ambiguity is a hot topic in recent machine learning and data mining research. A learning process is essentially building a mapping from the instances to the labels. This paper mainly focuses on the ambiguity at the label side of the mapping, i.e., one instance is not necessarily mapped to one label. Multi-label learning (MLL) [Tsoumakas and Katakis, 2006] studies the problem where each example is represented by a single instance while associated with a set of labels simultaneously, and the task is to learn a multi-label predictor which maps an instance to a relevant label set [Gibaja and Ventura, 2015; Zhang and Zhou, 2014]. During the past decade, multi-label learning techniques have been widely employed to learn from data with rich semantics, such as text [Rubin et al., 2012], image [Cabral et al., 2011], audio [Lo et al., 2011], video [Wang et al., 2011], etc.

Corresponding author

In most of the supervised data, an instance x is assigned with ly x {0, 1} to each possible label y, representing whether y describes x. In this paper, ly x is called logical label as ly x reﬂects the logical relationship between the label and the instance. Logical label answers the essential question which label can describe the instance , but not involves the explicit relative importance of each label. To solve this problem, a more natural way to label an instance x is to assign a real number dy x to each possible label y, representing the degree to which y describes x. Without loss of generality, assume that dy x [0, 1]. Further suppose that the label set is complete, i.e., using all the labels in the set can always fully describe the instance. Then, P y dy x = 1. Such dy x is called the description degree of y to x. For a particular instance, the description degrees of all the labels constitute a real-valued vector called label distribution, which describes the instance more comprehensively than logical labels. The learning process on the instances labeled by label distributions is therefore called label distribution learning (LDL) [Geng, 2016]. Label distribution is more general than logical labels in most supervised learning problems because the relevance or irrelevance of a label to an instance is essentially relative in mainly three aspects: The differentiation between the relevant and irrelevant labels is relative. A bipartite partition of the label set into relevant and irrelevant labels with respect to an instance is actually a simpliﬁcation of the real problem. In many cases, the boundary between relevant and irrelevant labels is not clear. For example, in emotion analysis from facial expressions, a facial expression often conveys a complex mixture of basic emotions (e.g., happy, sad, surprise, anger, disgust and fear) [Zhou et al., 2015]. As shown in Fig. 1a, for an expression, different basic emotions exhibit different intensities. The partition between the relevant and irrelevant emotions depends on the choice of the threshold. But there is no absolutely subjective criterion to determine the threshold. When multiple labels are associated with an instance, the relative importance among them is more likely to be different rather than exactly equal. For example, in Fig. 1b, a natural scene image may be annotated with the labels sky, water, building and cloud simultaneously, but the relative importance of each label to this image is different.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

(a) Emotion

(b) Nature scene

Figure 1: Three examples about the relevance or irrelevance of each label.

The irrelevance of each irrelevant label may be very different. For example, in Fig. 1c, for a car, the label airplane is more irrelevant than the label tank. However, in most training sets, label distribution is not explicitly available. It is difﬁcult to obtain the label distributions directly because the process of quantifying the description degrees is costly. Therefore, we need a way to recover the label distributions from the logical labels in the training set via leveraging the topological information in the feature space and the correlation among the labels. This process is called label enhancement (LE) in this paper. LE reinforces the supervision information in the training sets by exploiting the relative importance of each label. After the label distributions are recovered, more effective supervised learning can be achieved by leveraging the label distributions [Li et al., 2015; Hou et al., 2016]. Note that although there is no explicit concept of LE deﬁned in existing literatures, some methods with similar function to LE have been proposed. For example, logical labels are transferred to a discretized bivariate Gaussian label distribution centered at the coarse ground-truth label by using priori knowledge in head pose estimation [Geng and Xia, 2014] and facial age estimation [Geng et al., 2014]. Some works [Gayar et al., 2006; Jiang et al., 2006] build the membership degrees to the labels, which can constitute a label distribution. Some works [Li et al., 2015; Hou et al., 2016] establish the relationship between instances and labels by graph and transfer logical labels into label distributions. The rest of this paper is organized as follows. First, the formulation of LE and the details of the LE algorithms are proposed in Section 2. After that, the results of the comparative experiments are reported in Section 3. Finally, conclusions are drawn in Section 4.

2 Label Enhancement 2.1 Formulation of Label Enhancement First of all, the main notations used in this paper are listed as follows. The instance variable is denoted by x, the particular i-th instance is denoted by xi, the label variable is denoted by y, the particular j-th label value is denoted by yj, the logical label vector of xi is denoted by li = (ly1 xi, ly2 xi, ..., lyc xi) , where c is the number of possible labels. The description degree of y to x is denoted by dy x, and the label distribution of xi is denoted by di = (dy1 xi, dy2 xi, ..., dyc xi) . Let X = Rq denote the q-dimensional feature space. Then, the process of LE can be deﬁned as follows.

Given a training set S = {(xi, li)|1 i n}, where xi X and li {0, 1}c, LE recovers the label distribution di of xi from the logical label vector li, and thus transforms S into a LDL training set E = {(xi, di)|1 i n}.

2.2 Existing Label Enhancement Algorithms Fuzzy Label Enhancement The LE algorithm based on fuzzy clustering (FCM) [Gayar et al., 2006] employs fuzzy C-means clustering [Castillo and Melin, 2005] which attempts to cluster feature vectors by iteratively minimizing an objective function. Supposing that fuzzy C-means clustering divides the training set S into p clusters and µk denotes the k-th cluster prototype. Then, the membership degree of xi to the k-th cluster is calculated by

mk xi = 1 Pp j=1 Dist(xi,µk) Dist(xi,µj) 1 β 1 , (1)

where β > 1, and Dist(, ) is the Euclidean distance. Then, the matrix A providing soft connections between classes and clusters is constructed by initializing a c p zero matrix A and updating each row Aj through

Aj = Aj + mxi, iflyj xi = 1, (2)

where mxi = (m1 xi, m2 xi, ..., mp xi). Then, the membership degree vector of xi to the labels is calculated by using fuzzy composition di = A m xi. Finally, the label distribution corresponding to each instance is generated via the softmax

normalization dy xi = e dy xi P

y e dy xi .

For each label yj, the LE algorithm based on kernel method (KM) [Jiang et al., 2006] divides S into two sets, i.e., Cyj + and Cyj . Cyj + contains such sample point xi with lyj xi = 1 and Cyj contains such sample point xi with lyj xi = 0. Then, the center of Cyj + in the feature space is deﬁned by ψyj = 1 n+ P

xi C yj + ϕ (xi), where n+ is the number of the samples in

Cyj + and ϕ (xi) is a nonlinear transformation of x to a higher dimensional feature space. Then, the radius of Cyj + is calculated by r = max ψyj ϕ (xi) . (3)

The distance between a sample xi Cyj + and the center of Cyj + is calculated by

si = ϕ (xi) ψyj . (4)

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

The calculations involving ϕ (xi) can be obtained indirectly though the kernel function K (xi, xj) = ϕ (xi) ϕ (xj). Then, the membership degree of xi to each label can be calculated by

dyj xi = 1 q

s2 i r2+δ if lyj xi = 1 0 if lyj xi = 0 , (5)

where δ > 0. Finally, the membership degrees are transferred to the label distribution via the softmax normalization.

Graph-based Label Enhancement The LE algorithm based on label propagation (LP) [Li et al., 2015] recovers the label distributions from logical labels by using iterative label propagation technique [Zhu and Goldberg, 2009]. Let G denotes the fully-connected graph constructed over S, and then the n n symmetric similarity matrix A is speciﬁed for G as

( exp xi xj 2

2 if i = j 0 if i = j . (6)

Correspondingly, the label propagation matrix P is constructed from the similarity matrix through P = ˆ A 1

2 . Here ˆ A is a diagonal matrix with the elements ˆaii = n P

j=1 aij. At the t-th iteration, the label distribution matrix

D = [d1, d2, ..., dn] is updated by propagating labelingimportance information with the label propagation matrix P as D(t) = αP D(t 1) + (1 α)L, (7)

where L = [l1, l2, ..., ln] is the logical label matrix in training set and the initial matrix D(0) = L. Speciﬁcally, α (0, 1) is the balancing parameter which controls the fraction of the information inherited from the label propagation and the logical label matrix. Finally, D(t) will converge to D , and we normalize the label distributions by using the softmax normalization. The LE algorithm based on manifold learning (ML) [Hou et al., 2016] considers that the topological structure of the feature space can be represented by a graph G. W is the weight matrix whose element wij represents the weight of the relationship between xi and xj. This method assumes that each data point can be optimally reconstructed by using a linear combination of its neighbors [Roweis and Saul, 2000; Wang and Zhang, 2008]. Then, the approximation of the feature manifold is to induce the minimization of

j =i wijxj 2, (8)

where wij = 0 unless xj is one of xi s K-nearest neighbors

j=1 wij = 1. According to the smoothness assumption

[Zhu et al., 2005], the topological structure of the feature space can be transferred to the label space local by local. Then,

the reconstruction of the label manifold can infer to the minimization of

j =i wijdj 2

s.t. dyl xilyl xi > λ, 1 i n, 1 j c,

where λ > 0. The label distributions are generated with the optimization by using a constrained quadratic programming process. Finally, we normalize di by using the softmax normalization.

2.3 The GLLE Algorithm This section proposes a new LE algorithm named Graph Laplacian Label Enhancement (GLLE). Given a training set S, we construct the feature matrix X = [x1, x2, ..., xn] and the logical label matrix L = [l1, l2, ..., ln]. Our aim is to recover the label distribution matrix D = [d1, d2, ..., dn] from the logical label matrix L. To solve this problem, we consider the model di = W ϕ(xi) + b = ˆ W φi, (10) where W = [w1, ..., wc] is a weight matrix and b Rc is a bias vector. ϕ(x) is a nonlinear transformation of x to a higher dimensional feature space. For convenient describing, we set ˆ W = [W , b] and φi = [ϕ(xi); 1]. Accordingly, the goal of our method is to determine the best parameter ˆ W that can generate a reasonable label distribution di given the instance xi. Then, the optimization problem becomes

min ˆ W L( ˆ W ) + λΩ( ˆ W ), (11)

where L is a loss function, Ωis the function to mine hidden label importance, and λ is the parameter trading off the two terms. Note that LE is essentially a pre-processing applied to the training set, which is different from standard supervised learning. Therefore, our optimization does not need to consider the overﬁtting problem. Since the information in the label distributions is inherited from the initial logical labels, we choose the least squares (LS) loss function as

i=1 ˆ W φi li 2

= tr[( ˆ W Φ L) ( ˆ W Φ L)],

where Φ = [φ1, ..., φn]. In order to mine the hidden label importance from the training examples via leveraging the topological information of the feature space, we specify the local similarity matrix A whose elements are calculated by

( exp xi xj 2

2σ2 if xj N(i) 0 otherwise , (13)

where N(i) means the set of xi s K-nearest neighbors, and σ > 0 is the width parameter for similarity calculation which is ﬁxed to be 1 in this paper. According to the smoothness assumption [Zhu et al., 2005], the points close to each other are more likely to share a label. Intuitively, if xi and xj have

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

No. Dataset #Examples #Features #Labels

1 Artiﬁcial 2601 3 3

2 SJAFFE 213 243 6 3 Natural Scene 2,000 294 9 4 Yeast-spoem 2,465 24 2 5 Yeast-spo5 2,465 24 3 6 Yeast-dtt 2,465 24 4 7 Yeast-cold 2,465 24 4 8 Yeast-heat 2,465 24 6 9 Yeast-spo 2,465 24 6 10 Yeast-diau 2,465 24 7 11 Yeast-elu 2,465 24 14 12 Yeast-cdc 2,465 24 15 13 Yeast-alpha 2,465 24 18 14 SBU 3DFE 2,500 243 6 15 Movie 7,755 1,869 5

Table 1: Statistics of the 15 Datasets Used in the Experiments

a high degree of similarity, as measured by aij, then di and dj should be near to one another. This intuition leads to the following function which we wish to minimize:

Ω( ˆ W ) = X

i,j aij di dj 2

= tr( ˆ W ΦGΦ ˆ W ),

where G = ˆ A A is the graph Laplacian and ˆ A is the diag-

onal matrix whose elements are ˆaii = n P

Formulating the LE problem into an optimization framework over Eq. (12) and Eq. (14) yields the target function of ˆ W

T( ˆ W ) = tr[( ˆ W Φ L) ( ˆ W Φ L)]

+λtr( ˆ W ΦGΦ ˆ W ). (15)

The optimization of Eq. (15) uses an effective quasi-Newton method BFGS [Nocedal and Wright, 2006]. As to the optimization of the target function T( ˆ W ), the computation of BFGS is mainly related to the ﬁrst-order gradient, which can be obtained through

ˆ W = 2 ˆ W ΦΦ 2LΦ + λ ˆ W ΦG Φ

+λ ˆ W ΦGΦ . (16)

When the best parameter ˆ W is determined, the label distribution di can be generated through Eq. (10). Finally, we normalize di by using the softmax normalization. According to the representor s theorem [Smola, 1999], under fairly general conditions, a learning problem can be expressed as a linear combination of the training examples in the feature space, i.e. wj = P i θjϕ(xi). If we replace this expression into Eq. (15) and Eq. (16), it will generate the inner product< ϕ(xi), ϕ(xj) >, and then the kernel trick can be applied.

(a) Ground-Truth

Figure 2: Comparison between the ground-truth and recovered label distributions (regarded as RGB colors) on the artiﬁcial manifold.

3 Experiments 3.1 Datasets There are in total 15 datasets used in the experiments including an artiﬁcial toy dataset and 14 real-world datasets1. Some basic statistics about these 15 datasets are given in Table 1. The ﬁrst dataset is an artiﬁcial toy dataset which is generated to show in a direct and visual way whether the LE algorithms can recover the label distributions from the logical labels. In this dataset, the instance x is of threedimensional and there are three labels. The label distribution d = (dy1 x , dy2 x , dy3 x ) of x = (x1, x2, x3) is created in the following way.

ti = axi + bx2 i + cx3 i + d, i = 1, ..., 3, (17)

ψ1 = (h 1 t)2, ψ2 = (h 2 t + β1ψ1)2, ψ3 = (h 3 t + β2ψ2)2, (18)

dyi x = ψi ψ1 + ψ2 + ψ3 , i = 1, ..., 3, (19)

where t = (t1, t2, t3) , xi [ 1, 1], a = 1, b = 0.5, c = 0.2, d = 1, h1 = (4, 2, 1) , h2 = (1, 2, 4) , h3 = (1, 4, 2) , and β1 = β2 = 0.01. In order to show the results of LE algorithms in a direct and visual way, the examples of the toy dataset are selected from a certain manifold in the feature space. The ﬁrst two components of the instance x, x1 and x2, are located at a grid of the interval 0.04 within the range [ 1, 1], and there are in total 51 51 = 2601 instances. The third component x3 is calculated by

x3 = sin((x1 + x2) π). (20)

Then, the label distribution d corresponding to each is calculated via Eq. (17)-(19).

1http://cse.seu.edu.cn/Personal Page/xgeng/LDL/index.htm

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Datasets FCM KM LP ML GLLE

Artiﬁcial 0.188(3) 0.260(5) 0.130(2) 0.227(4) 0.108(1) SJAFFE 0.132(3) 0.214(5) 0.107(2) 0.190(4) 0.100(1) Natural Scene 0.368(5) 0.306(4) 0.275(1) 0.295(2) 0.296(3) Yeast-spoem 0.233(3) 0.408(5) 0.163(2) 0.400(4) 0.108(1) Yeast-spo5 0.162(3) 0.277(5) 0.114(2) 0.273(4) 0.092(1) Yeast-dtt 0.097(2) 0.257(5) 0.128(3) 0.244(4) 0.065(1) Yeast-cold 0.141(3) 0.252(5) 0.137(2) 0.242(4) 0.093(1) Yeast-heat 0.169(4) 0.175(5) 0.086(2) 0.165(3) 0.056(1) Yeast-spo 0.130(3) 0.175(5) 0.090(2) 0.171(4) 0.067(1) Yeast-diau 0.124(3) 0.152(5) 0.099(2) 0.148(4) 0.084(1) Yeast-elu 0.052(3) 0.078(5) 0.044(2) 0.072(4) 0.030(1) Yeast-cdc 0.051(3) 0.076(5) 0.042(2) 0.071(4) 0.038(1) Yeast-alpha 0.044(3) 0.063(5) 0.040(2) 0.057(4) 0.033(1) SBU 3DFE 0.135(2) 0.238(5) 0.123(1) 0.233(4) 0.141(3) Movie 0.230(4) 0.234(5) 0.161(2) 0.164(3) 0.160(1)

Avg. Rank 3.13 4.93 1.93 3.73 1.27

Table 2: Recovery Results (value(rank)) Measured by Cheb

Datasets FCM KM LP ML GLLE

Artiﬁcial 0.933(3) 0.918(5) 0.974(2) 0.925(4) 0.980(1) SJAFFE 0.906(3) 0.827(5) 0.941(2) 0.857(4) 0.946(1) Natural Scene 0.593(5) 0.748(4) 0.860(1) 0.818(2) 0.769(3) Yeast-spoem 0.878(3) 0.812(5) 0.950(2) 0.815(4) 0.968(1) Yeast-spo5 0.922(3) 0.882(5) 0.969(2) 0.884(4) 0.974(1) Yeast-dtt 0.959(2) 0.759(5) 0.921(3) 0.763(4) 0.983(1) Yeast-cold 0.922(3) 0.779(5) 0.925(2) 0.784(4) 0.969(1) Yeast-heat 0.883(3) 0.779(5) 0.932(2) 0.783(4) 0.980(1) Yeast-spo 0.909(3) 0.800(5) 0.939(2) 0.803(4) 0.968(1) Yeast-diau 0.882(3) 0.799(5) 0.915(2) 0.803(4) 0.939(1) Yeast-elu 0.950(2) 0.758(5) 0.918(3) 0.763(4) 0.978(1) Yeast-cdc 0.929(2) 0.754(5) 0.916(3) 0.759(4) 0.959(1) Yeast-alpha 0.922(2) 0.751(5) 0.911(3) 0.756(4) 0.973(1) SBU 3DFE 0.912(2) 0.812(5) 0.922(1) 0.815(4) 0.900(3) Movie 0.773(5) 0.880(4) 0.929(1) 0.919(2) 0.900(3)

Avg. Rank 2.93 4.87 2.07 3.73 1.40

Table 3: Recovery Results (value(rank)) Measured by Cosine

The second to the fourteen datasets are real-world LDL datasets [Geng, 2016] collected from biological experiments on the yeast genes, facial expression images, natural scene images and movies, respectively.

3.2 Evaluation Measures In order to compare the recovered label distribution with the ground-truth, a natural choice of the evaluation measure is the average distance or similarity between the recovered label distribution and the ground-truth label distribution. According to Geng s suggestion [Geng, 2016], we select six LDL measures, i.e., Chebyshev distance (Cheb), Clark distance (Clark), Canberra metric (Canber), Kullback-Leibler divergence (KL), cosine coefﬁcient (Cosine) and intersection similarity (Intersec). The ﬁrst four are distance measures and the last two are similarity measures. Due to page limitation, we only show representative results on Cheb and Cosine. Those results on other evaluation measures are similar.

3.3 Methodology The four algorithms described in Section 2.2, i.e., FCM [Gayar et al., 2006], KM [Jiang et al., 2006], LP [Li et al., 2015], ML [Hou et al., 2016], and our GLLE are all applied to the 15 datasets shown in Table 1. We consider the following LDL learning setting. With each instance, a label distribution is associated. The training set,

however, contains for each instance not the actual distribution, but a set of labels. The set includes the labels with the highest weights in the distribution, and is the smallest set such that the sum of these weights exceeds a given threshold. This setting can model, for instance, the way in which users label images or add keywords to texts: it assumes that users add labels starting with the most relevant ones, until they feel the labeling is sufﬁciently complete. Therefore, the logical labels in the datasets can be binarized from the real label distributions as follows. For each instance x, the greatest description degree dyj x is found, and the label yj is set to relevant label, i.e., lyj x = 1. Then, we calculate the sum of the description degrees of all the current relevant labels H = P yj Y+ dyj x , where Y+ is the set of the current relevant labels. If H is less than a predeﬁned threshold T, we continue ﬁnding the greatest description degree among other labels excluded from Y+ and select the label corresponding to the greatest description degree into Y+. This process continues until H > T. Finally, the logical labels to the labels in Y+ are set to 1, and other logical labels are set to 0. In our experiments, T = 0.5. There are two parts in the experiments. In the ﬁrst part, we recover the label distributions from the logical labels via the LE algorithms, and then compare the recovered label distributions with the ground-truth label distributions. In the second part, in order to further test the effectiveness of LDL after the LE pre-process on the logical-labeled datasets, we ﬁrst recover the label distributions from the logical labels via the LE algorithms, and then use the recovered label distributions for LDL training. Finally, the trained LDL models are tested on the new test dataset, and the label distribution predictions are compared with those predictions made by the model directly trained on the ground-truth label distributions. Ten-fold cross validation is conducted for each algorithm. For GLLE, the parameter λ is chosen among {10 2, 10 1, ..., 100} and the number of neighbors K is set to c + 1. The kernel function in GLLE is Gaussian kernel. The parameter α in LP is set to 0.5. The number of neighbors K for ML is set to c + 1. The parameter β in FCM is set to 2. The kernel function in KM is Gaussian kernel. The LDL algorithm used in this paper is SA-BFGS [Geng, 2016].

3.4 Experimental Results

Recovery Performance In order to visually show the results of the LE algorithms on the artiﬁcial dataset, the description degrees of the three labels are regarded as the three color channels of the RGB color space, respectively. In this way, the color of a point in the feature space will visually represent its label distribution. Thus, the label distribution recovered by the LE algorithms can be compared with the ground-truth label distribution through observing the color patterns on the manifold. For easier comparison, the images are visually enhanced by applying a decorrelation stretch process. The results are shown in Fig. 2. It can be seen that GLLE recovers almost identical color patterns with the ground-truth. LP, ML and FCM can also recover similar color patterns with the ground-truth. However, KM fails to obtain a reasonable result.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

Artificial SJAFFE Natural Scene Yeast-spoem Yeast-spo5 Yeast-dtt Yeast-cold Yeast-heat Yeast-spo Yeast-diau Yeast-elu Yeast-cdc Yeast-alpha SBU_3DFE Movie

FCM KM LP ML GLLE

Ground-Truth

Figure 3: Comparison of the LDL after the LE pre-process against the direct LDL measured by Cheb .

Artificial SJAFFE Natural Scene Yeast-spoem Yeast-spo5 Yeast-dtt Yeast-cold Yeast-heat Yeast-spo Yeast-diau Yeast-elu Yeast-cdc Yeast-alpha SBU_3DFE Movie

FCM KM LP ML GLLE

Ground-Truth

Figure 4: Comparison of the LDL after the LE pre-process against the direct LDL measured by Cosine .

Table 4: The Average Ranks of Five Algorithms on Six Measures

Criterion FCM KM LP ML GLLE

Cheb 4.40 4.20 2.00 3.13 1.27 Clark 4.33 4.07 2.27 3.07 1.27 Canber 4.20 4.13 2.27 3.13 1.27 KL 4.37 4.30 2.00 3.13 1.20 Cosine 4.53 4.27 1.93 3.07 1.20 Intersec 4.40 4.20 1.93 3.13 1.33

For quantitative analysis, Table 2 and Table 3 tabulate the results of the ﬁve LE algorithms on all the datasets evaluated by Cheb and Cosine, and the best performance on each dataset is highlighted by boldface. For each evaluation metric, indicates the smaller the better while indicates the larger the better. Note that since each LE algorithm only runs once, there is no record of standard deviation. The performances of the ﬁve LE algorithms evaluated by six measures are ranked as GLLE LP FCM ML KM. GLLE ranks 1st in 86.7% cases and ranks 2nd in 6.7% cases across six evaluation measures. Thus, GLLE generally performs better than other LE algorithms.

LDL Predictive Performance In this experiment, Ground-Truth represents the predictions made by the LDL model directly trained on the groundtruth label distributions. Then, FCM , KM , LP , ML and GLLE represent the predictions made by the LDL model trained on the label distributions recovered by each LE algorithm, respectively. All the algorithms are tested via ten-fold cross validation. The histograms of the LDL predictive performances are given in Fig. 3 and Fig. 4. The average rank of each algorithm over all the datasets is shown in Table 4. Note that since Ground-Truth is regarded as a upper bound

performance in this experiment, we rank FCM, KM, LP, ML and GLLE without considering Ground-Truth. Based on the experimental results, GLLE ranks 1st in 78.9% cases and ranks 2nd in 16.7% cases across all the evaluation measures. Thus, GLLE achieves superior performance over other LE algorithms. Note that in most cases, GLLE is very close to Ground-Truth, especially on the Nature Scene and Yeast-spoem datasets. But the difference between them is relatively larger on a few datasets (Yeast-cold, Yeast-diau and Yeast-alpha). This is because that the description degrees constituting each ground-truth label distribution in these datasets are almost equal. Thus, the binarization process to generate the logical labels might become unstable. It is hard to recover the reasonable label distributions from these logical labels. When the description degrees constituting each ground-truth label distribution in the datasets (e.g., the Nature Scene and Yeast-spoem datasets) are much different, the binarization process can easily differentiate the relevant labels and the irrelevant labels, which is helpful to recover the reasonable label distributions. Compared with the second best algorithm, on average, GLLE s distance to Ground-Truth is closer by 12.9% on Cheb, 16.9% on Clark, 17.0% on Canber, 18.0% on KL, 23.1% on Cosine, and 18.9% on Intersec, respectively. The results of the LDL predictive performances prove the effectiveness of LDL after LE pre-process by using GLLE on the logical-labeled training sets.

4 Conclusion

This paper shows label enhancement, which reinforces the supervision information in the training sets. LE can recover the label distributions from the logical labels in the training sets via leveraging the topological information of the feature space and the correlation among the labels. In order to solve

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)

the LE problem, we introduce existing algorithms that can be used for LE and propose a novel method called GLLE. Extensive comparative studies clearly validate the advantage of GLLE against other LE algorithms and the effectiveness of LDL after LE pre-process on the logical-labeled datasets. In the future, we will explore if there exist better ways to recover the label distributions.

Acknowledgements

This research was supported by the National Key Research & Development Plan of China (No. 2017YFB1002801), the National Science Foundation of China (61622203), the Jiangsu Natural Science Funds for Distinguished Young Scholar (BK20140022), the Collaborative Innovation Center of Novel Software Technology and Industrialization, and the Collaborative Innovation Center of Wireless Communications Technology.

References [Cabral et al., 2011] Ricardo S. Cabral, Fernando De la Torre, Jo ao P. Costeira, and Alexandre Bernardino. Matrix completion for multi-label image classiﬁcation. In Advances in Neural Information Processing Systems, pages 190 198, Granada SPAIN, 2011. [Castillo and Melin, 2005] Oscar Castillo and Patricia Melin. Hybrid intelligent systems for pattern recognition using soft computing: An evolutionary approach for neural networks and fuzzy systems. Springer, New York, 2005. [Gayar et al., 2006] Neamat El Gayar, Friedhelm Schwenker, and G unther Palm. A study of the robustness of knn classiﬁers trained using soft labels. In Proceedings of the 2nd International Conference on Artiﬁcial Neural Network in Pattern Recognition, pages 67 80, Ulm, Germany, 2006. [Geng and Xia, 2014] Xin Geng and Yu Xia. Head pose estimation based on multivariate label distribution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1837 1842, Columbus, OH, 2014. [Geng et al., 2014] Xin Geng, Qin Wang, and Yu Xia. Facial age estimation by adaptive label distribution learning. In Proceedings of the 22nd International Conference on Pattern Recognition, pages 4465 4470, Stockholm, Sweden, 2014. [Geng, 2016] Xin Geng. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 28(7):1734 1748, 2016. [Gibaja and Ventura, 2015] Eva Gibaja and Sebastian Ventura. A tutorial on multilabel learning. ACM Computing Surveys, 47(3):1 38, 2015. [Hou et al., 2016] Peng Hou, Xin Geng, and Min-Ling Zhang. Multi-label manifold learning. In Proceedings of the 30th AAAI Conference on Artiﬁcial Intelligence, pages 1680 1686, Phoenix, AZ, 2016.

[Jiang et al., 2006] Xiufeng Jiang, Zhang Yi, and Jian Cheng Lv. Fuzzy svm with a new fuzzy membership function. Neural Computing & Applications, 15(3-4):268 276, 2006. [Li et al., 2015] Yu-Kun Li, Min-Ling Zhang, and Xin Geng. Leveraging implicit relative labeling-importance information for effective multi-label learning. In Proceedings of the 15th IEEE International Conference on Data Mining, pages 251 260, Atlantic City, NJ, 2015. [Lo et al., 2011] Hung-Yi Lo, Ju-Chiang Wang, Hsin-Min Wang, and Shou-De Lin. Cost-sensitive multi-label learning for audio tag annotation and retrieval. IEEE Transactions on Multimedia, 13(3):518 529, 2011. [Nocedal and Wright, 2006] Jorg Nocedal and Stephen J Wright. Numerical optimization. Springer, New York, 2006. [Roweis and Saul, 2000] Sam T. Roweis and Lawrence K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323 2326, 2000. [Rubin et al., 2012] Timothy N. Rubin, America Chambers, Padhraic Smyth, and Mark Steyvers. Statistical topic models for multi-label document classiﬁcation. Machine Learning, 88(1-2):157 208, 2012. [Smola, 1999] Alex J. Smola. Learning with kernels. Ph.D. Thesis, GMD, Birlinghoven, German, 1999. [Tsoumakas and Katakis, 2006] Grigorios Tsoumakas and Ioannis Katakis. Multi-label classiﬁcation: An overview. International Journal of Data Warehousing and Mining, 3(3):1 13, 2006. [Wang and Zhang, 2008] Fei Wang and Changshui Zhang. Label propagation through linear neighborhoods. IEEE Transactions on Knowledge and Data Engineering, 20(1):55 67, 2008. [Wang et al., 2011] Jingdong Wang, Yinghai Zhao, Xiuqing Wu, and Xian-Sheng Hua. A transductive multi-label learning approach for video concept detection. Pattern Recognition, 44(10):2274 2286, 2011. [Zhang and Zhou, 2014] Min-Ling Zhang and Zhi-Hua Zhou. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8):1819 1837, 2014. [Zhou et al., 2015] Ying Zhou, Hui Xue, and Xin Geng. Emotion distribution recognition from facial expressions. In Proceedings of the 23rd ACM International Conference on Multimedia, pages 1247 1250, Brisbane, Australia, 2015. [Zhu and Goldberg, 2009] Xiaojin Zhu and Andrew B. Goldberg. Introduction to semi-supervised learning. Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning, 3(1):1 130, 2009. [Zhu et al., 2005] Xiaojin Zhu, John Lafferty, and Ronald Rosenfeld. Semi-supervised learning with graphs. Carnegie Mellon University, language technologies institute, school of computer science, 2005.

Proceedings of the Twenty-Seventh International Joint Conference on Artiﬁcial Intelligence (IJCAI-18)