# partial_label_learning_by_semantic_difference_maximization__717c88e4.pdf

Partial Label Learning by Semantic Difference Maximization

Lei Feng1,2 and Bo An1

1School of Computer Science and Engineering, Nanyang Technological University, Singapore 2Alibaba-NTU Singapore Joint Research Institute, Singapore feng0093@e.ntu.edu.sg, boan@ntu.edu.sg

Partial label learning is a weakly supervised learning framework, in which each instance is provided with multiple candidate labels while only one of them is correct. Most of the existing approaches focus on leveraging the instance relationships to disambiguate the given noisy label space, while it is still unclear whether we can exploit potentially useful information in label space to alleviate the label ambiguities. This paper gives a positive answer to this question for the ﬁrst time. Speciﬁcally, if two instances do not share any common candidate labels, they cannot have the same groundtruth label. By exploiting such dissimilarity relationships from label space, we propose a novel approach that aims to maximize the latent semantic differences of the two instances whose groundtruth labels are deﬁnitely different, while training the desired model simultaneously, thereby continually enlarging the gap of label conﬁdences between two instances of different classes. Extensive experiments on artiﬁcial and real-world partial label datasets show that our approach signiﬁcantly outperforms state-of-the-art counterparts.

1 Introduction Partial label (PL) learning [Jin and Ghahramani, 2003; Cour et al., 2011] belongs to the family of weakly supervised learning frameworks. It aims to deal with the problem that each instance is provided with a set of candidate labels, only one of which is the ground-truth label. Partial label learning is also termed as ambiguous label learning [H ullermeier and Beringer, 2006; Zeng et al., 2013; Chen et al., 2014; Chen et al., 2018] and superset label learning [Liu and Dietterich, 2012; Liu and Dietterich, 2014; Gong et al., 2018]. As a result of the difﬁculty in collecting perfect data with completely correct labels in many real-world scenarios, partial label learning has been applied to various domains. Examples include automatic face naming [Zeng et al., 2013], object detection [Liu and Dietterich, 2012], and web mining [Luo and Orabona, 2010]. Formally speaking, let X = Rn be the n-dimensional feature space and Y = {1, 2, , l} be the label space includ-

ing l labels. Suppose the PL dataset is denoted by D = {(xi, Si)}m i=1 where xi X is an n-dimensional feature vector and Si Y is the corresponding candidate label set where the ground-truth label yi must be in this candidate label set, i.e., yi Si. Given such data, the goal of partial label learning is to train a multi-class classiﬁcation model f : X Y that tries to correctly predict the label of a test instance. Due to the semantic ambiguities conveyed by the label space, the key of partial label learning is to disambiguate the candidate label set, thereby targeting the ground-truth label. To achieve this, most of the existing disambiguation-based approaches normally follow two typical strategies, including the average-based strategy [H ullermeier and Beringer, 2006; Cour et al., 2011] and identiﬁcation-based strategy [Jin and Ghahramani, 2003; Liu and Dietterich, 2012; Zhang and Yu, 2015; Zhang et al., 2016; Tang and Zhang, 2017; Gong et al., 2018]. The average-based strategy treats each candidate label equally, and makes the ﬁnal prediction by averaging the modeling outputs of candidate labels. The identiﬁcation-based strategy aims to handle the candidate labels with discrimination, and usually employ an iterative process to gradually update the conﬁdence of each candidate label. By taking into account the different conﬁdences of candidate labels, the identiﬁcation-based strategy generally outperforms the average-based strategy, thereby having attracted increasing attention. Most approaches [Zhang and Yu, 2015; Zhang et al., 2016; Feng and An, 2018; Gong et al., 2018] following this strategy normally leverage the topological information in feature space to derive the conﬁdence of each candidate label. Speciﬁcally, the conjecture that nearby (similar) instances are supposed to have the same label is widely used by these approaches. However, it is still unclear whether we can directly extract useful information in label space to help with the derivation of the conﬁdences of candidate labels. This paper gives a positive answer to this question for the ﬁrst time. There is a key observation that if two instances do not share any common candidate labels, they cannot have the same ground-truth label. For example, suppose there are two instances x1 and x2 whose corresponding label vectors are given as y1 = [1, 1, 0, 0] and y2 = [0, 0, 1, 1]. Without knowing the ground-truth label of the each instance, we can still easily ﬁnd that x1 and x2 cannot have the same groundtruth label, since they do not share any common candidate labels.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

By exploiting such dissimilarity relationships from label space, we propose a novel partial label learning approach called SDIM (Semantic DIfference Maximization), which aims to maximize the latent semantic differences of the two instances whose ground-truth labels are deﬁnitely different, while training the desired model simultaneously, thereby continually enlarging the gap of label conﬁdences between two instances of different classes. The effectiveness of SIDM is clearly demonstrated by extensive experiments on 4 artiﬁcial and 6 real-world PL datasets.

2 Related Work

The key of effective partial label learning is to disambiguate the candidate labels. Existing disambiguation-based approaches mainly follow two strategies: the average-based strategy and the identiﬁcation-based strategy. The average-based strategy assumes that each candidate label contributes equally to model training, and the ﬁnal prediction is made by averaging the modeling outputs of all the candidate labels. Following this strategy, instance-based approaches [H ullermeier and Beringer, 2006; Gong et al., 2018] predict the label of a test instance x by averaging the outputs of its nearest neighbors, i.e., arg maxy Y P xi N (x) I(y Si) where N(x) is the set of neighbors of x. In addition, parametric approaches aim to train a parametric model [Cour et al., 2011; Zhang et al., 2016] that is able to differentiate the average modeling output of candidate labels (F(xi, y; θ), y Si) from that of non-candidate labels (F(xi, ˆy; θ), ˆy ˆSi), where ˆSi is the set of non-candidate labels and ˆy Si is a non-candidate label. Obviously, the average-based strategy is simple and clear, while it may suffer from the problem that without discrimination of candidate labels, the ground-truth label could be overwhelmed by the other false positive labels. To address the drawback of the average-based strategy, the identiﬁcation-based strategy tries to handle the candidate labels with discrimination, and derive different conﬁdences of candidate labels. Following this strategy, conventional approaches aim to optimize the objective function according to the maximum likelihood criterion [Jin and Ghahramani, 2003] or the maximum margin criterion [Nguyen and Caruana, 2008]. Recently, there have been increasing interests [Zhang and Yu, 2015; Zhang et al., 2016; Feng and An, 2018; Gong et al., 2018] in leveraging the topological information in feature space to derive the conﬁdence of each candidate label. These approaches normally iteratively update the conﬁdences of candidate labels based on the widely used assumption that nearby (similar) instances are supposed to have the same label. One potential drawback of the identiﬁcationbased strategy lies in that if the differentiated label is a false positive label, it would have a dramatically malignant inﬂuence on the follow-up model training. In addition, because of the redundant and noisy features naturally exist in feature space, the extracted topological information may be misleading. Hence there is an important question, i.e., whether we can extract useful information from label space to help with the iterative process of updating the conﬁdences of candidate labels? In this paper, a novel partial label learning approach called

SDIM will be introduced, which provides a positive answer to the above question.

3 Preliminaries

Following the conventional notations used in Introduction, we denote the feature matrix and the label matrix given in the PL dataset by X = [x1, x2, , xm] Rm n and Y = [y1, y2, , ym] {0, 1}m l, respectively. Here, yij = 1 means that the j-th label is in the candidate label set of the instance xi (i.e., j Si), otherwise the j-th label is a noncandidate label of xi. In addition, we introduce the partial label conﬁdence matrix P = [p1, p2, , pm] [0, 1]m l where pi represents the label conﬁdence vector of xi. Note that many approaches [Zhang and Yu, 2015; Zhang et al., 2016; Feng and An, 2018; Gong et al., 2018; Feng and An, 2019] have introduced or deﬁned such partial label conﬁdence matrix P. As the partial label conﬁdence matrix P is not directly accessible from PL training examples, in this paper, we carefully illustrate some requirements for P. Intuitively, since each candidate label has the potential to be the ground-truth label, the conﬁdence of each candidate label should be in [0, 1]. While the conﬁdence of each noncandidate label should be strictly 0, as non-candidate labels can never be the ground-truth label. We can use the constraint 0 P Y to compactly represent such logic. Moreover, since there is only one ground-truth label for each instance, competitive relationships naturally exist in the candidate label set. Therefore, we also assume that each label conﬁdence vector pi should be normalized, i.e., P j pij = 1. Such constraint implicitly shows that once the conﬁdence of certain candidate label is enlarged, the conﬁdences of other labels would be decreased. Based on the above descriptions of P, we present the formal deﬁnition of partial label conﬁdence matrix as follows.

Deﬁnition 1 (Partial Label Conﬁdence Matrix). Given the label matrix Y in the partial label dataset, we deﬁne the partial label conﬁdence matrix P as:

candidacy: 0 pi yi, i [m]

normalization: P j pij = 1, i [m]

where [m] := {1, 2, , m}. By compact representation, we also deﬁne a partial label simplex as := {P [0, 1]m l : P Y, P1l = 1m} where 1m is a vector of size m with all of its elements equal to 1, thus P .

As stated before, SDIM aims to maximize the latent semantic differences of the two instances whose ground-truth labels are deﬁnitely different while model training. Formally, suppose xi and xj have different ground-truth labels, i.e., y i yj = 0, the gap between the label conﬁdence vectors pi and pj should be maximized. In this paper, we adopt the widely-used Euclidean distance. Therefore, our goal is to maximize pi pj 2 2, if y i yj = 0. Here, we dig more about why our proposed regularization approach can work. Suppose yi = [1, 1, 0, 0] and yj = [0, 0, 1, 1], we

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

have pi = [pi1, pi2, 0, 0] and pj = [0, 0, pj3, pj4] (according to Deﬁnition 1). If pi1, pi2 = 0.5, and pj3, pj4 = 0.5, we could get pi pj 2 2 = 1. While if pi1 = 1, pi2 = 0 and pj3 = 1, pj4 = 0 (the values could be exchanged), we would obtain pi pj 2 2 = 2. In this way, it would be clearly observed that pi pj 2 2 hits the lowest value when the conﬁdence of each candidate label is equal to 0.5. While pi pj 2 2 gradually increases when the conﬁdence of each candidate label approaches to 0 or 1. As a result, such regularization approach would enhance the discrimination abilities to disambiguate the candidate labels, and the obtained conﬁdence vectors pi and pj would be more conﬁdent, thereby reducing the label ambiguities. For better representation, we introduce an indicating matrix R = [rij]m m, indicating whether two instances deﬁnitely have different ground-truth labels:

rij = 1, if y i yj = 0 0, if y i yj = 0 (1)

In this way, we present our proposed regularization approach as follows:

j=1 rij pi pj 2 2 = max P tr(P LP) (2)

where L = diag(R1) R is the Laplacian matrix, and tr( ) is the trace operator. Note that our proposed regularization approach aims to maximize the convex objective, which is diametrically opposed to the common convex manifold regularization [Belkin et al., 2006] that minimizes the convex objective. In other words, problem (2) is not a convex problem. By integrating the proposed regularization term into the widely-used model, we obtain the ﬁnal optimization problem:

min W,P 1 2 XW P 2 F + λ

2 tr(P LP) (3)

where W Rn l is the model parameter. Note that in problem (3), we aim to learn from the partial label conﬁdence matrix P while updating P simultaneously. Such two tasks would be mutually promoted. Due to the difﬁculty in optimizing the two variables W and P together, we adopt the simple alternating optimization method, which enables us to iteratively optimize one variable with the other ﬁxed.

4.1 Model Training With P ﬁxed, problem (3) with respect to W reduces to:

min W 1 2 XW P 2 F + λ

2 W 2 F (4)

which is the common linear regression model. Simple closedform solution could be easily obtained:

W = (X X + λIn n) 1X P (5)

where In n is an identity matrix whose scale is n n. However, such linear model may not be able to deal with the complex nonlinear case. To solve this problem, we adopt a kernel

extension to train a kernel ridge regression model. Specifically, we resort to a feature mapping φ( ) : Rn RH, which maps the original feature space (X) to some higher dimensional Hilbert space (φ(X)). By representor theorem [Sch olkopf et al., 2002], the model parameter W can be represented by a linear combination of the input features φ(X), i.e., W = φ(X) A where A = [aij]m l is the matrix storing the weights. Hence φ(X)W = KA where K = φ(X)φ(X) Rm m is the kernel matrix with each element kij = φ(xi) φ(xj) = κ(xi, xj), and κ( , ) denotes the kernel function. By incorporating such kernel extension, problem (4) can be stated as:

min A 1 2 KA P 2 F + λ

2 tr(A KA) (6)

where we have used the property of the trace operator, i.e., W 2 F = tr(W W) = tr(A KA). Setting the gradient w.r.t. A to 0, the closed-form solution is reported as:

A = (K + λIm m) 1P (7)

In this paper, we adopt the popular Gaussian kernel function κ(xi, xj) = exp((xi xj)/2σ2) with σ set to the averaged pairwise Euclidean distances of instances.

4.2 Conﬁdence Updating

With A ﬁxed, we denote by the modeling output Q = φ(X)W = KA, thus problem (3) w.r.t. P reduces to:

min P 1 2 P Q 2 F β

2 tr(P LP) (8)

Note that in problem (8), the ﬁrst term is convex, while the last term is concave. Therefore, problem (8) is a constrained convex-concave problem [Yuille and Rangarajan, 2003; Sriperumbudur and Lanckriet, 2009]. Fortunately, since the constraints are linear, we can directly employ the Convex Con Cave Procedure (CCCP) [Yuille and Rangarajan, 2003] to update P. CCCP can be regarded as a majorizationminimization algorithm [Sriperumbudur et al., 2011; Gong et al., 2018] that optimizes the original nonconvex problem by solving a sequence of convex problems. Speciﬁcally, problem (8) can be regarded as the difference between two convex functions C1 = 1

2 P Q 2 F and C2 = β

2 tr(P LP). In each iteration, C2 is replaced by its ﬁrst order Taylor approximation C2, and problem (8) can be approximated by C1 C2, which becomes a convex problem. Theoretical analyses show that CCCP converges to a local minima [Sriperumbudur and Lanckriet, 2009]. For our problem, we denote by P(i) the updated value of P at the i-th iteration, and linearize C2 at P(i) by its Taylor approximation:

2 (tr(P(i) LP(i)) + 2(tr(P LP(i) P(i) LP(i)))

= βtr(P LP(i)) β

2 tr(P(i) LP(i)) (9)

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Algorithm 1 The SDIM Algorithm

Inputs: D: the PL training set {(X, Y)} λ, β: the regularization parameters x: the unseen test instance Output: y: the predicted label for the test instance x

1: construct the indicating matrix R = [rij]m m by (1); 2: construct the kernel matrix K = [κ(xi, xj)]m m using Gaussian kernel function; 3: initialize P by (13); 4: repeat 5: update A = [aij]m l by (7); 6: update Q = KA; 7: update P by solving (12) with a general QP procedure; 8: until convergence or the maximum number of iterations. 9: return the predicted label y according to (14).

Since the variable is P, the last term in (9) can be removed, and the approximated objective function for updating P is:

P(i+1) = arg min P

1 2 P Q 2 F βtr(P LP(i)) (10)

Here, we will show that problem (10) is actually a standard Quadratic Programming (QP) problem. Let us vectorize P Rm l, Q Rm l, and Y Rm l to ˆp Rml, ˆq Rml, and ˆy Rml. To copy with the equality constraint using ˆp, we specially pick up the indices of ˆp by deﬁning a set Z = {Z0, Z1, , Zm 1} as follows:

j Zi, if j%m = i, j [ml] (11)

Thus problem (10) can be equivalent to:

ˆp(i+1) = arg min ˆp

1 2 ˆp Hˆp + f ˆp (12)

j Zi ˆpj = 1, Zi Z

where H = Iml ml, f = ˆq βGˆp(i), and G = Im m L where is the Kronecker product. In this way, any off-theshelf QP toolbox can be used to solve problem (12). By iteratively solving a sequence of QP problems, we can eventually get local optimal ˆp. Finally, by reshaping ˆp Rml to P Rm l, we can obtain the eventually updated partial label conﬁdence matrix P. Since the variables A and P are alternatively updated, we can simply always update P for one step (for one QP problem) after updating A. Thus the iterative process of updating P is kept, while the optimization efﬁciency is improved. For the 0-th iteration, P(0) = [p(0) ij ]m l is initialized as:

( 1 |Si|, if j Si 0, if j / Si (13)

After the completion of the whole optimization process, SDIM gives the predicted label y of the test instance x by:

y = arg max j [l]

i=1 aijκ(x, xi) (14)

The pseudo code of SDIM is presented in Algorithm 1.

5 Experiments

In this section, we conduct extensive experiments on artﬁcial and real-world datasets to demonstrate the effectiveness of our proposed approach.

5.1 Comparing Algorithms We compare our approach with six state-of-the-art partial label learning approaches, each conﬁgured with suggested hyperparameters in accordance with the respective literature:

PLKNN [H ullermeier and Beringer, 2006]: a k NN approach following the average-based strategy [default conﬁguration: k {5, 6, , 10}];

CLPL [Cour et al., 2011]: a convex approach following the average-based strategy [default conﬁguration: SVM with squared hinge loss];

IPAL [Zhang and Yu, 2015]: an approach following the identiﬁcation-based strategy that leverages the structural information in feature space [default conﬁguration: α {0, 0.1, , 1}, k {5, 6, , 10}];

PL-SVM [Nguyen and Caruana, 2008]: a maximum margin approach following the identiﬁcation-based strategy [default conﬁguration: λ {10 3, , 103}];

PLALOC [Wu and Zhang, 2018]: a disambiguation-free approach that adapts the binary decomposition [default conﬁguration: µ = 10]

ECOC [Zhang et al., 2017]: a disambiguation-free approach based on the coding-decoding procedure [default conﬁguration: L = log2(l)]

For our approach, λ is searched in {0.001, 0.005, , 0.5} and β is searched in {0.00001, 0.00005, 0.0001, , 0.1}. For all the approaches, parameters are selected by ﬁve-fold cross-validation on the training set. For each dataset, we perform ten-fold cross-validation, and report the resulting mean prediction accuracies and the standard deviations.

5.2 Experiments on Controlled UCI Datasets

Dataset ecoli dermatology vehicle usps Examples 336 366 846 9298 Features 7 34 18 256 Labels 8 6 4 10

Table 1: Characteristics of the controlled UCI datasets.

Table 1 reports the characteristics of four UCI datasets used in our experiments. Following the widely-used controlling protocol [Zhang and Yu, 2015; Zhang et al., 2016;

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Dataset Examples Features Labels Avg. CLs Task Domain Lost 1122 108 16 2.23 automatic face naming [Panis and Lanitis, 2014] MSRCv2 1758 48 23 3.16 object classiﬁcation [Liu and Dietterich, 2012] Bird Song 4998 38 13 2.18 bird song classiﬁcation [Briggs et al., 2012] Soccer Player 17472 279 171 2.09 automatic face naming [Zeng et al., 2013] Yahoo! News 22991 163 219 1.91 automatic face naming [Guillaumin et al., 2010] FG-NET 1002 262 78 7.48 facial age estimation [Panis and Lanitis, 2014]

Table 2: Characteristics of real-world partial label datasets.

(b) dermatology

(c) vehicle

Figure 1: Classiﬁcation performance on controlled UCI datasets with p ranging from 0.1 to 0.7 (r = 1).

(b) dermatology

(c) vehicle

Figure 2: Classiﬁcation performance on controlled UCI datasets with p ranging from 0.1 to 0.7 (r = 2).

(b) dermatology

(c) vehicle

Figure 3: Classiﬁcation performance on controlled UCI datasets with p ranging from 0.1 to 0.7 (r = 3).

Tang and Zhang, 2017; Feng and An, 2018; Wu and Zhang, 2018], we generate the artiﬁcial partial label datasets by using two contorlling parameters p and r. Here, p controls the proportion of instances that have candidate labels, and r controls the number of false positive labels, in other words, |Si| = r+1. All the candidate labels are randomly generated.

Figure 1, Figure 2, and Figure 3 report the classiﬁcation accuracy of each approach as p ranges from 0.1 to 0.7 with step size 0.1, when r = 1, r = 2, and r = 3, respectively. As shown in these ﬁgures, SDIM outperforms other comparing algorithms in most cases. It is worth noting that candidate labels are randomly generated, there might be the case that many pairs of examples share at least one common candi-

date labels. In this case, Our proposed method may achieve mediocre performance. However, such case is really rare since experimental results demonstrate that our method can achieve the best performance in more than 75 cases out of the 84 cases.

5.3 Experiments on Real-World Datasets

Table 2 summaries the characteristics of the real-world partial label datasets from various task domains, which also includes the average number of candidate labels per instance (Avg. CLs). Note that the Bird Song dataset is normalized using the Z-scores by convention. The mean accuracy with standard deviation of each approach on each real-world dataset

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

SDIM PLKNN CLPL IPAL PLSVM PALOC ECOC Lost 0.801 0.031 0.432 0.051 0.742 0.038 0.678 0.053 0.729 0.042 0.629 0.056 0.703 0.052 MSRCv2 0.518 0.037 0.417 0.034 0.413 0.041 0.529 0.039 0.461 0.046 0.479 0.042 0.505 0.027 Bird Song 0.754 0.021 0.648 0.017 0.632 0.019 0.713 0.015 0.663 0.032 0.711 0.016 0.740 0.016 Soccer Player 0.577 0.016 0.495 0.018 0.368 0.010 0.541 0.016 0.464 0.011 0.537 0.015 0.537 0.020 Yahoo! News 0.663 0.013 0.483 0.011 0.462 0.009 0.609 0.011 0.629 0.010 0.625 0.005 0.662 0.010 FG-NET 0.076 0.019 0.039 0.018 0.063 0.027 0.054 0.030 0.063 0.029 0.065 0.019 0.040 0.025 FG-NET(MAE3) 0.466 0.022 0.269 0.045 0.458 0.022 0.362 0.034 0.356 0.022 0.435 0.018 0.251 0.029 FG-NET(MAE5) 0.621 0.024 0.438 0.053 0.596 0.017 0.540 0.033 0.479 0.016 0.609 0.043 0.354 0.038

Table 3: Classiﬁcation accuracy of each algorithm on the real-world datasets. Furthermore, / indicates whether SDIM is statistically superior/inferior to the comparing algorithm (t-test at 0.05 signiﬁcance level for two independent samples).

(a) Varying λ on Lost

(b) Varying β on Lost

(c) Convergence on Lost

(d) Convergence on MSRCv2

Figure 4: Parameter sensitivity and convergence analysis of SDIM on the real-world datasets Lost and MSRCv2.

is reported in Table 3. It is worthy noting that all the approaches achieve extremely poor performance on the FGNET dataset, because its Avg. CLs is very large. For better evaluation of such task, we resort to conventional mean absolute error (MAE). Speciﬁcally, for FG-NET (MAE3/MAE5), a test example is considered correctly classiﬁed if the MAE between the predicted age and the real age is no more than 3/5 years. The experimental results of FG-NET(MAE3) and FG-NET(MAE5) are also recorded in Table 3. From Table 3, we can observe that: 1) SDIM statistically outperforms PLKNN on all the real-world datasets; 2) Out of the 48 cases (6 comparing algorithms and 8 tasks), SDIM statistically outperforms other algorithms in 85.4% cases; 3) It is worthy noting that SDIM is never statistically inferior to any comparing algorithms.

Further Analysis There are two tradeoff parameters for SDIM. We conduct sensitivity analysis by varying one parameter, while keeping the other ﬁxed at the best setting. Figure 4(a) and 4(b) show the performance of SDIM w.r.t. λ and β, respectively. As can be seen from Figure 4(a), when performance of SDIM is relatively poor when λ is too small or too big. Such observation agrees with the intuition that it is important to control the model complexity for avoiding overﬁtting or underﬁtting. Note that β controls our proposed regularization term, i.e., the importance of semantic difference maximization. From Figure 4(b), we can also ﬁnd that when β is very small, the importance of semantic difference maximization is hardly considered, thus SDIM achieves relatively low prediction accuracy. As β increases, the importance of semantic difference maximization will gradually be taken into consideration, hence the prediction accuracy starts to increase. Such observation clearly conﬁrms the effectiveness of our SDIM approach. However, if β is overly large, the classiﬁcation accuracy will drop. This is because the objective function focuses too much

on maximizing the semantic differences while ignoring the importance of model training. It is also worth noting that when β is extremely small (even without the proposed regularization term), SDIM still achieves satisﬁed performance. Which means, the reduced version of SDIM (i.e., kernel regression with conﬁdence updating) provides a strong baseline. This observation also agrees with [Feng and An, 2018; Feng and An, 2019]. Figure 4(c) and Figure 4(d) illustrate the convergence of SDIM on Lost and MSRCv2, according to the difference of the value of the objective function (3) between two successive iterations. It can be easily observed that the loss converges within a few iterations. Hence the convergence of SDIM is demonstrated.

6 Conclusion

This paper gives the ﬁrst attempt to leverage the dissimilarity relationships from label space for dealing with label ambiguities. A novel partial label learning approach called SDIM is proposed to maximize the latent semantic differences while training the desired model simultaneously. Extensive experimental results demonstrate the effectiveness of SDIM. Contrary to conventional approaches that leverage the similarities in feature space to disambiguate the candidate labels, SDIM resorts to the dissimilarities in label space. Hence a question naturally arises on how similarities and dissimilarities can be combined together for enhancing partial label learning performance. We leave this as future work.

Acknowledgements

This work was supported by MOE, NRF, and NTU.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

References [Belkin et al., 2006] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7(11):2399 2434, 2006. [Briggs et al., 2012] Forrest Briggs, Xiaoli Z Fern, and Raviv Raich. Rank-loss support instance machines for miml instance annotation. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 534 542, 2012. [Chen et al., 2014] Yi-Chen Chen, Vishal M Patel, Rama Chellappa, and P Jonathon Phillips. Ambiguously labeled learning using dictionaries. IEEE Transactions on Information Forensics and Security, 9(12):2076 2088, 2014. [Chen et al., 2018] Ching-Hui Chen, Vishal M Patel, and Rama Chellappa. Learning from ambiguously labeled face images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(7):1653 1667, 2018. [Cour et al., 2011] Timothee Cour, Ben Sapp, and Ben Taskar. Learning from partial labels. Journal of Machine Learning Research, 12(5):1501 1536, 2011. [Feng and An, 2018] Lei Feng and Bo An. Leveraging latent label distributions for partial label learning. In International Joint Conference on Artiﬁcial Intelligence, pages 2107 2113, 2018. [Feng and An, 2019] Lei Feng and Bo An. Partial label learning with self-guided retraining. In AAAI Conference on Artiﬁcial Intelligence, 2019. [Gong et al., 2018] Chen Gong, Tongliang Liu, Yuanyan Tang, Jian Yang, Jie Yang, and Dacheng Tao. A regularization approach for instance-based superset label learning. IEEE Transactions on Cybernetics, 48(3):967 978, 2018. [Guillaumin et al., 2010] Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. Multiple instance metric learning from automatically labeled bags of faces. Lecture Notes in Computer Science, 63(11):634 647, 2010. [H ullermeier and Beringer, 2006] Eyke H ullermeier and J urgen Beringer. Learning from ambiguously labeled examples. Intelligent Data Analysis, 10(5):419 439, 2006. [Jin and Ghahramani, 2003] Rong Jin and Zoubin Ghahramani. Learning with multiple labels. In Neural Information Processing Systems, pages 921 928, 2003. [Liu and Dietterich, 2012] Li-Ping Liu and Thomas G Dietterich. A conditional multinomial mixture model for superset label learning. In Neural Information Processing Systems, pages 548 556, 2012. [Liu and Dietterich, 2014] Li-Ping Liu and Thomas Dietterich. Learnability of the superset label learning problem. In International Conference on Machine Learning, pages 1629 1637, 2014. [Luo and Orabona, 2010] Jie Luo and Francesco Orabona. Learning from candidate labeling sets. In Neural Information Processing Systems, pages 1504 1512, 2010.

[Nguyen and Caruana, 2008] Nam Nguyen and Rich Caruana. Classiﬁcation with partial labels. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 551 559, 2008. [Panis and Lanitis, 2014] Gabriel Panis and Andreas Lanitis. An overview of research activities in facial age estimation using the fg-net aging database. In European Conference on Computer Vision, pages 737 750, 2014. [Sch olkopf et al., 2002] Bernhard Sch olkopf, Alexander J Smola, et al. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, 2002. [Sriperumbudur and Lanckriet, 2009] Bharath K Sriperumbudur and Gert RG Lanckriet. On the convergence of the concave-convex procedure. In Neural Information Processing Systems, pages 1759 1767, 2009. [Sriperumbudur et al., 2011] Bharath K Sriperumbudur, David A Torres, and Gert RG Lanckriet. A majorizationminimization approach to the sparse generalized eigenvalue problem. Machine Learning, 85(1-2):3 39, 2011. [Tang and Zhang, 2017] Cai-Zhi Tang and Min-Ling Zhang. Conﬁdence-rated discriminative partial label learning. In AAAI Conference on Artiﬁcial Intelligence, pages 2611 2617, 2017. [Wu and Zhang, 2018] Xuan Wu and Min-Ling Zhang. Towards enabling binary decomposition for partial label learning. In International Joint Conference on Artiﬁcial Intelligence, pages 2427 2436, 2018. [Yuille and Rangarajan, 2003] Alan L Yuille and Anand Rangarajan. The concave-convex procedure. Neural Computation, 15(4):915 936, 2003. [Zeng et al., 2013] Zi-Nan Zeng, Shi-Jie Xiao, Kui Jia, Tsung-Han Chan, Sheng-Hua Gao, Dong Xu, and Yi Ma. Learning by associating ambiguously labeled images. In IEEE Conference on Computer Vision and Pattern Recognition, pages 708 715, 2013. [Zhang and Yu, 2015] Min-Ling Zhang and Fei Yu. Solving the partial label learning problem: An instance-based approach. In International Joint Conference on Artiﬁcial Intelligence, pages 4048 4054, 2015. [Zhang et al., 2016] Min-Ling Zhang, Bin-Bin Zhou, and Xu-Ying Liu. Partial label learning via feature-aware disambiguation. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1335 1344, 2016. [Zhang et al., 2017] Min-Ling Zhang, Fei Yu, and Cai Zhi Tang. Disambiguation-free partial label learning. IEEE Transactions on Knowledge and Data Engineering, 29(10):2155 2167, 2017.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)