# multilabel_learning_with_pairwise_relevance_ordering__0369e0a9.pdf

Multi-Label Learning with Pairwise Relevance Ordering

Ming-Kun Xie and Sheng-Jun Huang College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics MIIT Key Laboratory of Pattern Analysis and Machine Intelligence, Nanjing, 211106 {mkxie, huangsj}@nuaa.edu.cn

Precisely annotating objects with multiple labels is costly and has become a critical bottleneck in real-world multi-label classiﬁcation tasks. Instead, deciding the relative order of label pairs is obviously less laborious than collecting exact labels. However, the supervised information of pairwise relevance ordering is less informative than exact labels. It is thus an important challenge to effectively learn with such weak supervision. In this paper, we formalize this problem as a novel learning framework, called multi-label learning with pairwise relevance ordering (PRO). We show that the unbiased estimator of classiﬁcation risk can be derived with a cost-sensitive loss only from PRO examples. Theoretically, we provide the estimation error bound for the proposed estimator and further prove that it is consistent with respect to the commonly used ranking loss. Empirical studies on multiple datasets and metrics validate the effectiveness of the proposed method.

1 Introduction

Multi-label learning (MLL) solves problems where each object is assigned with multiple class labels simultaneously [Zhang and Zhou, 2013]. For example, an image may be annotated with labels building, street and person. The goal of multi-label learning is to train a classiﬁcation model that can predict all the relevant labels for unseen instances. A large number of recent works have witnessed the great successes that MLL has achieved in many real-world applications, e.g., image annotation [Chen et al., 2019], human attribute recognition [Li et al., 2016], user proﬁling [Liu et al., 2021], and protein function prediction [Elisseeff and Weston, 2002].

Traditional multi-label learning studies assume that each instance has been precisely annotated with all of its relevant labels. However, in many real-world scenarios, it is difﬁcult and costly to collect the precise annotations. Instead, each instance may be provided with the relative order of label pairs, where each label pair y y (or y y ) indicates that label y is more relevant (or irrelevant) than label y to instance x, i.e., p(y = 1|x) > p(y = 1|x) (or p(y = 1|x) < p(y = 1|x)). Generally, deciding the relative order of label pairs would be much easier than collecting the precise annotations and thus less costly. For example, in medical image analysis, only experts with rich experiences can accurately identify the disease for a patient based on the medical image. In contrast, if the question is to decide which of two given diseases is more likely suffered by the patient, then even medical students with basic knowledge may easily provide the answer. While the annotation cost is signiﬁcantly reduced with pairwise relevance ordering, the learning task becomes more challenging, since the supervised information of pairwise relevance ordering is much less than exact labels.

We formalize this learning problem as a new framework called multi-label learning with pairwise relevance ordering (PRO). Speciﬁcally, PRO attempts to learn a classiﬁcation model from multi-label

Correspondence to: Sheng-Jun Huang (huangsj@nuaa.edu.cn).

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

examples with the relative order of label pairs, where each label pair y y indicates three possible cases: 1) both y and y are relevant to x, i.e., y = 1, y = 1; 2) y is relevant to x while y is not, i.e., y = 1, y = 0; 3) both y and y are irrelevant to x, i.e., y = 0, y = 0.

PRO is a novel learning framework with signiﬁcant difference from exiting settings. For example, semi-supervised multi-label learning (SSMLL) learns a classiﬁer by exploiting a few of labeled examples as well as a large number of unlabeled examples [Liu et al., 2006]; multi-label learning with missing labels (MLML) assumes that only a subset of labels are available for each instance [Sun et al., 2010, Yu et al., 2014]; partial multi-label learning (PML) assigns each instance with a candidate label set [Xie and Huang, 2018, Zhang and Fang, 2020]; multi-label learning with noisy labels assumes that multiple class labels may be ﬂipped simultaneously with their respective probabilities [Xie and Huang, 2021a]. However, these frameworks do not consider multi-label examples with pairwise relevance ordering, and cannot be employed to solve PRO problems.

To deal with multi-label data with pairwise relevance ordering, we propose a cost-sensitive loss function for learning a multi-label classiﬁer with empirical risk minimization. Theoretically, we show that the unbiased estimator of classiﬁcation risk can be derived from only PRO examples if the surrogate loss function satisﬁes a mild condition, i.e., the symmetric condition. The estimation error bound is established for the unbiased estimator, showing that learning with PRO examples can be multi-label consistent to the commonly used ranking loss. Extensive experimental results on multiple datasets and evaluation metrics demonstrate the practical usefulness of the proposed method.

2 Related Works

There are plenty of literature on multi-label learning. As one of the earliest representative methods, Binary Relevance simply decomposes the multi-label learning task into a set of binary classiﬁcation problems [Zhang and Zhou, 2013]. Nevertheless, such method neglects the label correlations, which are regarded as an essential information for multi-label classiﬁcation. Therefore, there are many studies trying to learn a multi-label classiﬁer by exploiting the label correlations [Read et al., 2011]. Some of these works focus on the pairwise correlation [Elisseeff and Weston, 2002, Li et al., 2017], while some others consider the high order correlation among all labels [Chen et al., 2019].

To solve SSMLL problems, some works attempt to learn a multi-label classiﬁer based on the graph models [Kong et al., 2011] while some others utilize the low-rank assumption [Jing et al., 2015]. In addition to these methods, the co-training strategy [Zhan and Zhang, 2017] and matrix factorization [Liu et al., 2006] are also employed to solve SSMLL problems.

The pioneering MLML study [Sun et al., 2010] tries to construct a similarity graph for each label and the manifold regularization term is added to recover the missing labels. A linear classiﬁer with the low-rank constraint is proposed to deal with large scale data with missing labels [Yu et al., 2014]. In [Kanehira and Harada, 2016], authors solve the MLML problem by viewing it as a positive-unlabeled learning task. Some other techniques also employ the robust loss [Xu et al., 2019] or the group lasso regularizer [Bucak et al., 2011] to solve MLML problems.

In order to deal with partial-labeled data, the most commonly used strategy is disambiguation [Cour et al., 2011], which recovers ground-truth labeling information for candidate labels. Some methods perform the disambiguation strategy by estimating a conﬁdence for each candidate label [Xie and Huang, 2018, Yu et al., 2018, 2020]. Other methods utilize the decomposition scheme [Sun et al., 2019] or adversarial training [Yan and Guo, 2021]. In [Xie and Huang, 2021b], authors ﬁrst solve PML problems by considering the generation process of noisy labels. A recent work utilizes the meta disambiguation strategy to deal with partial-labeled data [Xie et al., 2021]. Although it allows noisy labels hidden in the candidate set (e.g., PML) or some labels are missed (e.g., SSMLL and MLML), the aforementioned frameworks consider the supervised information of the label-level, i.e., each of labels is relevant or not, which can be still costly. Instead our proposed PRO framework considers pairwise relevance ordering, which can be much easier for annotators and thus less costly. In [Huang et al., 2015], authors propose a multi-label active learning framework called AURO which queries the relevance ordering between two labels in every iteration. However, different from the proposed PRO framework, AURO directly asks annotators to provide each label pair with one of three possible cases that have been discussed in Section 1, which can be regarded as a stronger supervised information with higher cost. AURO cannot be used to solve the PRO problem. Furthermore, authors consider the pairwise supervision of instances and performing binary classiﬁcation based on similar paired data

and unlabeled examples [Bao et al., 2018, Shimada et al., 2021, Bao et al., 2020]. Another recent work trains the binary classiﬁer under the supervision of pairwise conﬁdence comparisons [Feng et al., 2021]. These methods focus on binary classiﬁcation and cannot be directly applied to solve the PRO problem.

3 Preliminaries

In this section, before deriving our main results for solving the PRO problem, we introduce some notations and provide the necessary preliminaries.

In traditional multi-label learning task, let x X be a feature vector and y Y its corresponding labels, where X = Rd is the feature space and Y = {0, 1}q is the target space with q possible class labels. Here, yj = 1 indicates the j-th label is relevant to the instance; yj = 0, otherwise. Let D = {xi, yi}n i=1 be the given training examples, where each example is drawn i.i.d. according to the joint distribution p(x, y).

In multi-label learning, many loss functions have been proposed to measure the performance of learning algorithms, such as ranking loss, hamming loss, coverage and average precision [Zhang and Zhou, 2013]. Among them, the ranking loss concerns about label pairs that are ordered reversely for an instance, which naturally considers the pairwise label correlation. Given the decision function f : X Rd, the ranking loss can be deﬁned as follows :

L(f(x), y) = X

1 j<k q I(yj = 1, yk = 0)ℓ(fk, fj) + I(yj = 0, yk = 1)ℓ(fj, fk), (1)

where ℓ(fj, fk) = I(fj > fk) + 1

2I(fj = fk). (2)

Here, fj is the j-th component of f(x) and I( ) is the indicator function, which outputs 1 if the condition holds while outputs 0 otherwise. The goal of multi-label learning tasks is to learn the optimal classiﬁer f by minimizing the following expected classiﬁcation risk:

R(f) = Ep(x,y)[L(f(x), y))] = X

y Y p(y)Ep(x|y)[L(f(x), y))], (3)

1 j<k q Ep(x|yj=1,yk=0)[ℓ(fk, fj)] + Ep(x|yj=0,yk=1)[ℓ(fj, fk)],

1 j<k q π10 jk Ep10 jk(x)[ℓ(fk, fj)] + π01 jk Ep01 jk(x)[ℓ(fj, fk)],

where π10 jk = p(yj = 1, yk = 0) (or π01 jk = p(yj = 0, yk = 1)) denotes the positive-negative (or negative-positive) label pair prior probability, and p10 jk(x) = p(x|yj = 1, yk = 0) (or p01 jk(x) = p(x|yj = 0, yk = 1)) denotes the class-conditional probability density of data given the positivenegative (negative-positive) label pair. Accordingly, we deﬁne the minimal risk (also called the Bayes risk) as R = inff R(f).

However, the loss function L is highly discontinuous and computationally NP-hard, which often makes the corresponding optimization problem hard to solve [Gao and Zhou, 2013]. In practice, a feasible solution is to consider alternatively a surrogate loss function L which can be solved efﬁciently. Accordingly, the L-risk with respect to p(x, y) can be deﬁned as:

RL(f) = Ep(x,y)[L(f(x), y))] = X

1 j<k q π10 jk Ep10 jk(x)[φ(fj fk)]+π01 jk Ep01 jk(x)[φ(fk fj)], (4)

where φ is a surrogate loss function. A common choice is hinge loss φ(t) = max(0, 1 t) in [Elisseeff and Weston, 2002]. Accordingly, we deﬁne the minimal L-risk (also called the Bayes L-risk) as R L = inff RL(f).

4 Learning with PRO

In this section, we ﬁrst formulate the problem of multi-label learning with pairwise relevance ordering (PRO). Then, the unbiased estimator is proposed for solving the PRO problem. Due to the page limit, most proofs for theorems in Section 4 and Section 5 are provided in the supplementary material.

4.1 The PRO Framework

In the PRO framework, each example x is associated with the relevance ordering of K label pairs y = {yj y j}K j=1, where yj y j represents the label yj is more relevant than label y j, i.e., p(yj = 1|x) > p(y j = 1|x). Note that here y j represents one out of q 1 labels (except for label yj) and we can further use yk to denote y j for notational simplicity. As discussed in Section 1, yj yk indicates three possible cases:1) both yj and yk are relevant to x, i.e., yj = 1, yk = 1; 2) yj is relevant to x while yk is not, i.e., yj = 1, yk = 0; 3) both yj and yk are irrelevant to x, i.e., yj = 0, yk = 0. The observation tells us that yj = 1, yk = 0 (or yj = 0, yk = 1) occurs if and only if yj yk (or yj yk), i.e., the prior probability p(yj = 1, yk = 0|yj yk) = 0 (or p(yj = 0, yk = 1|yj yk) = 0). We can further utilize the prior to calibrate the ordinary risk Eq(3). Therefore, by taking the prior into consideration, the classiﬁcation risk Eq(3) can be calibrated as:

1 j<k q Π10 jk Eτ 10 jk (x)[ℓ(fk, fj)] + Π01 jk Eτ 01 jk (x)[ℓ(fj, fk)], (5)

where Π10 jk = p(yj = 1, yk = 0|yj yk) (or Π01 jk = p(yj = 0, yk = 1|yj yk)) denotes the calibrated positive-negative (or negative-positive) label pair prior probability, and τ 10 jk = p(x|yj = 1, yk = 0, yj yk) (or τ 01 jk = p(x|yj = 0, yk = 1, yj yk)) denotes the calibrated class-conditional probability density of data given the positive-negative (negative-positive) label pair. Accordingly, the L-risk Eq(4) can be calibrated as:

1 j<k q Π10 jk Eτ 10 jk (x)[φ(fj fk)] + Π01 jk Eτ 01 jk (x)[φ(fk fj)]. (6)

Our goal is to train a multi-label classiﬁer only based on the observed examples D = {xi, yi}n i=1, drawn i.i.d. from the distribution p(x, y). The expected L-risk with respect to p(x, y) can be formulated as follows:

E p(x, y)[L(f(x), y)] = X

1 j<k q eπ10 jk E p10 jk(x)[φ(fj fk)] + eπ01 jk E p01 jk(x)[φ(fk fj)], (7)

where eπ10 jk = p(yj yk) (or eπ01 jk = p(yj yk)) denotes the positive (or negative) ordering label pair prior probability of PRO examples, and p10 jk = p(x|yj yk) (or p01 jk = p(x|yj yk)) denotes the class-conditional probability density of PRO examples given the positive (negative) ordering label pair. However, directly solving the estimator Eq.(7) to obtain the classiﬁer usually suffers the over-ﬁtting issue, which makes the classiﬁer fail to obtain a promising generalization performance.

4.2 The Proposed Method

In this section, we derive an unbiased risk estimator for solving the PRO problem.

Based on the aforementioned discussions, each label pair yj yk indicates three possible cases, which motivates us to derive the following lemma.

Lemma 1. Each PRO example of D is drawn i.i.d. according a probability distribution with the following class-conditional density:

p(x|yj yk) = Π10 jkp(x|yj = 1, yk = 0, yj yk) + π11 jkp(x|yj = 1, yk = 1) (8)

+ π00 jkp(x|yj = 0, yk = 0).

Based on the lemma, we derive the following theorem, which obtains the unbiased estimator of the classiﬁcation risk only from the PRO examples. Theorem 1. The classiﬁcation risk Eq.(5) can be equivalently re-written as

eπ10 jk E p10 jk(x)[ℓ(fk, fj)] + 1 eπ01 jk

eπ01 jk E p01 jk(x)[ℓ(fj, fk)] .

Proof. According to Eq.(8), we have

E p10 jk(x)[ℓ(fk, fj)] = Π10 jk Eτ 10 jk (x)[ℓ(fk, fj)] + π11 jk Ep11 jk(x)[ℓ(fk, fj)] + π00 jk Ep00 jk(x)[ℓ(fk, fj)].

E p01 jk(x)[ℓ(fj, fk)] = Π01 jk Eτ 01 jk (x)[ℓ(fj, fk)] + π11 jk Ep11 jk(x)[ℓ(fj, fk)] + π00 jk Ep00 jk(x)[ℓ(fj, fk)].

Then, the expected classiﬁcation risk R(f) can be expressed as follows:

1 j<k q Π10 jk Eτ 10 jk (x)[ℓ(fk, fj)] + Π01 jk Eτ 01 jk (x)[ℓ(fj, fk)],

1 j<k q E p10 jk(x)[ℓ(fk, fj)] + E p01 jk(x)[ℓ(fj, fk)] π11 jk Ep11 jk(x)[ℓ(fk, fj) + ℓ(fj, fk)]

π00 jk Ep00 jk(x)[ℓ(fk, fj) + ℓ(fj, fk)],

eπ10 jk E p10 jk(x)[ℓ(fk, fj)] + 1 eπ01 jk

eπ01 jk E p01 jk(x)[ℓ(fj, fk)] π11 jk π00 jk,

where the last equality is due to the fact that ℓ(fj, fk) + ℓ(fk, fj) = 1 by the deﬁnition of ℓ.

However, as discussed in Section 3, it is difﬁcult to optimize the loss function ℓdue to its highly discontinuity. To solve the problem, the following corollary tells us that the unbiased estimator of L-risk with respect to p(x, y) can be established under a mild condition. Corollary 1. The L-risk Eq.(6) can be equivalently re-written as

eπ10 jk E p10 jk(x)[φ(fj fk)] + 1 eπ01 jk

eπ01 jk E p01 jk(x)[φ(fk fj)] , (9)

if it holds, for every t, the loss function φ satisﬁes

φ(t) + φ( t) = 1, (10)

where, for each label pair yj, yk, eπ10 jk E p10 jk(x)[φ(fj fk)] (or eπ01 jk E p01 jk(x)[φ(fk fj)]) is the expected L-risk with respect to p(x, y) and can be directly estimated from PRO training examples with suitable surrogate loss functions. It is noteworthy that the symmetric condition Eq(10) has been widely used in other weakly supervised learning frameworks [Du Plessis et al., 2014, Ishida et al., 2017].

With the theorem, we can train a multi-label classiﬁer by minimizing the empirical approximation of R L(f) from PRO examples as follows:

b R L(f) = 1

i=1 L(f(x), y), (11)

L(f(x), y) = X

1 eπ10 jk I(yij yik)φ(fj(xi) fk(xi))+ 1

eπ01 jk I(yij yik)φ(fk(xi) fj(xi)).

(12) Here, it is worthy noting that in contrast to previous cost-sensitive methods [Du Plessis et al., 2014], which often requires extra assumptions or sophisticated techniques to obtain the cost coefﬁcients, the π10 jk (or π01 jk) can be directly estimated from the observed PRO training data.

5 Theoretical Analysis

In this section, we provide the estimation error bound for the proposed unbiased estimator and further prove that it is consistent with respect to ranking loss.

Let σ = {σ1, ..., σn} be n Rademacher variables with σi independently uniform variable taking value in { 1, +1}. Then, the Rademacher complexity with respect to function class F and loss function L can be formulated as follows:

Rn(L F) = Ex,y,σ

i=1 σi L(f(xi), yi)

Based on the deﬁnition, we can establish the following lemma.

Lemma 2. Let Rn( L F) be the Rademacher complexity of the loss function L and function class F over D of n training examples drawn from p(x, y), which can be deﬁned as

Rn( L F) = E DEσ

i=1 σi L(f(xi), yi)

Then, Rn( L F) 2KπCφRn(F),

where π = maxj,k 1 π10 jk , j, k [q] and Cφ is the Lipschitz constant of φ.

Based on Lemma 2, we can establish the uniform deviation bounds of b R L(f) as follows: Lemma 3. For the loss function φ bounded by Θ and any δ > 0, with the probability at least 1 δ, we have

max f F | b R L(f) R L(f)| 4KπCφRn(F) + KπΘ

Based on Lemma 3, we can derive the estimation error bound as follows, which further shows that learning from PRO examples can be multi-label consistent with respect to ranking loss. Theorem 2. For any δ > 0, with probability at least 1 δ, we have

RL( ˆf) min f F RL(f) 8KπCφRn(F) + 2KπΘ

where ˆf is trained by minimizing b R L(f). Furthermore, if φ is a differential and non-increasing function with φ (0) < 0 and φ(t) + φ( t) = 2φ(0), then learning from PRO data with the modiﬁed loss function L Eq.(12) is consistent w.r.t ranking loss, i.e., there exists a non-negative concave function ξ with ξ(0) = 0, such that

R( ˆf) R ξ(RL( ˆf) R L).

Theorem 2 tells us that learning from PRO examples is consistent with respect to ranking loss. As n , if RL( ˆf) = R L, then we have the consistency: R( ˆf) = R(f ), since Rn(F) 0 for all parametric models with a bounded norm such as deep networks trained with weight decay [Lu et al., 2018]. Based on the above discussion, Sigmoid loss φ(t) = 1 1+et is a suitable surrogate loss function in our case, since it satisﬁes the symmetric condition Eq.(10) and meanwhile has been proven to be consistent with respect to ranking loss [Gao and Zhou, 2013].

6 Experiment

In this section, to validate the effectiveness of the proposed method, we perform the experiments on varied datasets with multiple evaluation metrics.

6.1 Experimental Settings

Datasets We evaluate our method on ﬁve multi-label datasets: Multi-MNIST2 [Finn et al., 2017], Multi-Kuzushiji-MNIST (Multi-KMNIST for short), Multi-Fashion-MNIST 3 (Multi-FMNIST for short), VOC2007 4 [Everingham et al., 2010] and MSCOCO 5 [Lin et al., 2014]. For three Multi MNIST-style datasets, we randomly sample 6,000 images for training and 4,000 images for testing. VOC2007 contains 9,963 images for 20 object categories, which are divided into train, val and test sets. Following [Chen et al., 2019, 2018], we use the trainval set to train the models, and evaluate the

2See https://github.com/shaohua0116/Multi Digit MNIST for Multi-MNIST. 3Similar to Multi-MNIST, we construct Multi-Kuzushiji-MNIST and Multi-Fashion-MNIST for two commonly used datasets Kuzushiji-MNIST and Fashion-MNIST, repsectively. 4See http://host.robots.ox.ac.uk/pascal/VOC/voc2007/ for VOC2007. 5See https://cocodataset.org for MSCOCO.

u-PRO b-PRO b-CCMN u-CCMN

3 4 5 6 7 8 Number of Label Pairs

Ranking Loss

3 4 5 6 7 8 Number of Label Pairs

Hamming Loss

3 4 5 6 7 8 Number of Label Pairs

3 4 5 6 7 8 Number of Label Pairs

Average Precision

3 4 5 6 7 8 Number of Label Pairs

Ranking Loss

3 4 5 6 7 8 Number of Label Pairs

Hamming Loss

3 4 5 6 7 8 Number of Label Pairs

3 4 5 6 7 8 Number of Label Pairs

Average Precision

3 4 5 6 7 8 Number of Label Pairs

Ranking Loss

(a) Ranking Loss

3 4 5 6 7 8 Number of Label Pairs

Hamming Loss

(b) Hamming Loss

3 4 5 6 7 8 Number of Label Pairs

(c) Coverage

3 4 5 6 7 8 Number of Label Pairs

Average Precision

(d) Average Precision

Figure 1: Comparison results with varying number of label pairs on Multi-MNIST, Multi-KMNIST and Multi-FMNIST.

performance on the test set. MSCOCO contains 82,081 images as the training set and 40,504 images as the validation set. We randomly sample 20,000 images from the training set for training and 10,000 images from the validation set for testing. For each dataset, we randomly sample K pairs of labels and assign their relevance ordering, where K varies among {3, 4, 5, 6, 7, 8} for Multi-MNIST-style datasets, {6, 8, 10, 12, 14, 16} for VOC2007 and {5, 10, 15, 20, 25, 30} for MSCOCO. In particular, in the case that two labels are both positive (or negative), we decide their pairwise relevance ordering randomly, i.e., one out of two labels is randomly sampled to be more relevant to the other one. For each dataset, we repeat experiments ﬁve times and report their averaging performances.

Metrics We evaluate the performance of the proposed method based on multiple standard multilabel criterion: ranking loss, hamming loss, coverage and average precision. For ranking loss, hamming loss and coverage, the smaller value, the better performance; for average precision, the larger value, the better performance. The detail of these criterion can be found in [Zhang and Zhou, 2013].

Methods Under the PRO framework, the proposed method that minimizes b R L(f) in Eq.(11) with Sigmoid loss function is denoted by Unbiased-PRO (u-PRO for short). We compare with the baseline: Biased-PRO (b-PRO for short), which attempts to minimize the empirical approximation of the biased classiﬁcation risk in Eq(7) with hinge loss function. Note that PRO is new learning framework, and there is no method can be directly applied to PRO problems. We employ a recently proposed framework called CCMN [Xie and Huang, 2021a] to transform the PRO problem into a MLL problem with class-conditional multi-label noise (CCMN) by regarding y as the positive label while y as the negative label for each label pair y y . And we compare with the following methods: Biased-CCMN (b-CCMN for short), which directly learns a multi-label classiﬁer with noisy labels; Unbiased-CCMN (u-CCMN), which employs the unbiased estimator proposed in [Xie and Huang, 2021a] to solve the transformed CCMN problem. Note that for u-CCMN, the true noise rates (the probability of the positive (negative) label ﬂipped into the negative (positive) one) are given in experiments.

Implementation For experiments on Multi-MNIST-style datasets, we train a linear model by using Adam [Kingma and Ba, 2015] optimizer with learning rate of 0.001. We added an ℓ2-regularization term, with the regularization parameter of 0.0001. For experiments on VOC2007 and MSCOCO, we use an Alexnet [Krizhevsky et al., 2012] and a Resnet-18 [He et al., 2016] pre-trained with the

u-PRO b-PRO b-CCMN u-CCMN

6 8 10 12 14 16 Number of Label Pairs

Ranking Loss

(a) Ranking Loss

6 8 10 12 14 16 Number of Label Pairs

Hamming Loss

(b) Hamming Loss

6 8 10 12 14 16 Number of Label Pairs

(c) Coverage

6 8 10 12 14 16 Number of Label Pairs

Average Precision

(d) Average Precision

Figure 2: Comparison results with varying number of label pairs for each instance on VOC2007.

u-PRO b-PRO b-CCMN u-CCMN

5 10 15 20 25 30 Number of Label Pairs

Ranking Loss

(a) Ranking Loss

5 10 15 20 25 30 Number of Label Pairs

Hamming Loss

(b) Hamming Loss

5 10 15 20 25 30 Number of Label Pairs

(c) Coverage

5 10 15 20 25 30 Number of Label Pairs

Average Precision

(d) Average Precision

Figure 3: Comparison results with varying number of label pairs for each instance on MSCOCO.

ILSVRC2012 dataset on Pytorch platform [Paszke et al., 2019]. The Alexnet and Resnet-18 are trained by using stochastic gradient descent (SGD) with learning rate of 0.0001. An ℓ2-regularization term is added with the regularization parameter of 0.0001. The batch size for all datasets is set as 200. All the experiments are conducted on Ge Force RTX 2080 GPUs

6.2 Performance Comparison

Figure 1 illustrates the performance curve of each comparing method as the number of label pairs for each instance increases in terms of four evaluation metrics on three Multi-MNIST-style datasets. As shown in the ﬁgures, we can obtain the following observations: 1) b-PRO and b-CCMN achieve the worst performances, which indicates neither directly minimizing the empirical biased classiﬁcation risk nor simply conducting binary classiﬁcation transformation can solve PRO problems, since these two methods may suffer from over-ﬁtting issues due to the biasedness of the risk estimation. 2) Compared to b-CCMN method, u-CCMN achieves a promising performance in most cases. This observation demonstrates that CCMN framework is effective for tackling PRO problems in some extent. 3) Our proposed unbiased-PRO method achieves the best performance in almost all cases and signiﬁcantly outperforms u-CCMN. It is worthy noting that u-CCMN utilizes the true noise rates which are usually unavailable in practice, and thus the superiority of the proposed method would be more signiﬁcant in real-world setting.

Figure 2 and Figure 3 illustrate the performance curve of each comparing method as the number of label pairs for each instance increases in terms of four evaluation metrics on VOC2007 and MSCOCO, respectively. From the ﬁgures, it can be observed that our proposed u-PRO method achieves the best performance in all cases. It seem that u-CCMN performs unstable on VOC2007. It even obtains worse results than the baseline b-PRO in terms of hamming loss and average precision. One possible reason is that u-CCMN suffers from the over-ﬁtting issue when the complex model is used (in the experiments, Alexnet is used for VOC2007). These results convincingly validate that the proposed unbiased estimator can effectively solve PRO problems.

6.3 Ablation Study

In this section, we conduct some ablation experiments to provide empirical validations for the theoretical analysis proposed in the paper.

u-PRO-Sigmoid u-PRO-Ramp u-PRO-Hinge b-PRO-Hinge

3 4 5 6 7 8 Number of Label Pairs

Ranking Loss

3 4 5 6 7 8 Number of Label Pairs

Ranking Loss

3 4 5 6 7 8 Number of Label Pairs

Ranking Loss

3 4 5 6 7 8 Number of Label Pairs

Hamming Loss

(a) Multi-MNIST

3 4 5 6 7 8 Number of Label Pairs

Hamming Loss

(b) Multi-KMNIST

3 4 5 6 7 8 Number of Label Pairs

Hamming Loss

(c) Multi-FMNIST

Figure 4: Comparison results with varying number of label pairs for each instance on Multi-MNIST, Multi-KMNIST and Multi-FMNIST in terms of ranking loss and hamming loss.

We ﬁrst examine the unbiasedness for the proposed estimator. Based on the discussion in Section 4.2, we disclose that the unbiased estimator is composed of two components, i.e., the cost-sensitive estimator Eq.(9) and the symmetric surrogate loss function which satisﬁes Eq.(10). Here, u-PROSigmoid and u-PRO-Ramp represent the empirical cost-sensitive estimator Eq.(11) with Sigmoid and Ramp losses, respectively, which both satisfy the symmetric condition. U-PRO-Hinge and b-PRO-Hinge represent the cost-sensitive estimator Eq.(11) and biased estimator Eq.(7) with hinge loss, which does not satisfy the symmetric condition.

Due to the page limit, Figure 4 only report the performance curves of these four estimators in terms of ranking loss and hamming loss on Multi-MNIST, Multi-KMNIST and Multi-FMNIST datasets. From the ﬁgures, we can obtain following observations: 1) u-PRO-Sigmoid and u-PRO-Ramp achieve the better performances than u-PRO-Hinge and b-PRO-Hinge in almost all cases, which indicates both two components, i.e., the cost-sensitive estimator and the symmetric surrogate loss function, certainly contribute to obtain the unbiased estimator for solving the PRO problem; 2) U-PROHinge outperforms b-PRO-Hinge with signiﬁcant superiority, which indicates that the cost-sensitive estimator plays an important role in achieving unbiased risk estimation. The observation tells that even without the symmetric surrogate loss function, we can obtain a promising result by utilizing the cost-sensitive estimator in practice. Finally, from the Figure 4, it can be observe that u-PRO-Sigmoid generally achieve better performance than u-PRO-Ramp in terms of ranking loss, which provides an empirical validation of Theorem 2, since Sigmoid loss has been proven to be consistent with respect to ranking loss while Ramp loss is not [Gao and Zhou, 2013].

7 Conclusion

In this paper, we study the problem of multi-label classiﬁcation with pairwise relevance ordering, where each instance is assigned with the relative order of label pairs. To solve PRO problems, we propose an empirical estimator of classiﬁcation risk based on a cost-sensitive loss. Theoretically, we shows that the proposed estimator can be in an unbiased fashion if the surrogate loss function satisﬁes the symmetric condition. We derive the estimation error bound for the proposed method, and further prove that learning from PRO examples with the proposed unbiased estimator is consistent with respect to ranking loss. Finally, we experimentally examine the effectiveness of the proposed method on multiple datasets and evaluation metrics. In the future, we will study PRO problems by considering the data generation process.

8 Acknowledgments

This research was supported by the National Key R&D Program of China (2020AAA0107000), Natural Science Foundation of Jiangsu Province of China (BK20211517), and NSFC (62076128, 61732006).

Han Bao, Gang Niu, and Masashi Sugiyama. Classiﬁcation from pairwise similarity and unlabeled data. In International Conference on Machine Learning, pages 452 461. PMLR, 2018.

Han Bao, Takuya Shimada, Liyuan Xu, Issei Sato, and Masashi Sugiyama. Similarity-based classiﬁcation: Connecting similarity learning to binary classiﬁcation. ar Xiv preprint ar Xiv:2006.06207, 2020.

Serhat Selcuk Bucak, Rong Jin, and Anil K Jain. Multi-label learning with incomplete class assignments. In CVPR 2011, pages 2801 2808. IEEE, 2011.

Tianshui Chen, Zhouxia Wang, Guanbin Li, and Liang Lin. Recurrent attentional reinforcement learning for multi-label image recognition. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 32, 2018.

Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5177 5186, 2019.

Timothee Cour, Ben Sapp, and Ben Taskar. Learning from partial labels. The Journal of Machine Learning Research, 12:1501 1536, 2011.

Marthinus C Du Plessis, Gang Niu, and Masashi Sugiyama. Analysis of learning from positive and unlabeled data. Advances in neural information processing systems, 27:703 711, 2014.

André Elisseeff and Jason Weston. A kernel method for multi-labelled classiﬁcation. In Advances in neural information processing systems, pages 681 687, 2002.

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2): 303 338, 2010.

Lei Feng, Senlin Shu, Nan Lu, Bo Han, Miao Xu, Gang Niu, Bo An, and Masashi Sugiyama. Pointwise binary classiﬁcation with pairwise conﬁdence comparisons. In International Conference on Machine Learning, pages 3252 3262. PMLR, 2021.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126 1135. PMLR, 2017.

Wei Gao and Zhi-Hua Zhou. On the consistency of multi-label learning. Artiﬁcial Intelligence, 199 (1):22 44, 2013.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

Sheng-Jun Huang, Songcan Chen, and Zhi-Hua Zhou. Multi-label active learning: query type matters. In Proceedings of the 24th International Conference on Artiﬁcial Intelligence, pages 946 952, 2015.

Takashi Ishida, Gang Niu, Weihua Hu, and Masashi Sugiyama. Learning from complementary labels. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5644 5654, 2017.

Liping Jing, Liu Yang, Jian Yu, and Michael K Ng. Semi-supervised low-rank mapping learning for multi-label classiﬁcation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1483 1491, 2015.

Atsushi Kanehira and Tatsuya Harada. Multi-label ranking from positive and unlabeled data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5138 5146, 2016.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann Le Cun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.

Xiangnan Kong, Michael K Ng, and Zhi-Hua Zhou. Transductive multilabel learning via label set propagation. IEEE Transactions on Knowledge and Data Engineering, 25(3):704 719, 2011.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neural networks. Advances in neural information processing systems, 25:1097 1105, 2012.

Yining Li, Chen Huang, Chen Change Loy, and Xiaoou Tang. Human attribute recognition by deep hierarchical contexts. In European Conference on Computer Vision, pages 684 700. Springer, 2016.

Yuncheng Li, Yale Song, and Jiebo Luo. Improving pairwise ranking for multi-label image classiﬁcation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3617 3625, 2017.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740 755. Springer, 2014.

Weiwei Liu, Haobo Wang, Xiaobo Shen, and Ivor Tsang. The emerging trends of multi-label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.

Yi Liu, Rong Jin, and Liu Yang. Semi-supervised multi-label learning by constrained non-negative matrix factorization. In AAAI, 2006.

Nan Lu, Gang Niu, Aditya Krishna Menon, and Masashi Sugiyama. On the minimal supervision for training any binary classiﬁer from only unlabeled data. In International Conference on Learning Representations, 2018.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. ar Xiv preprint ar Xiv:1912.01703, 2019.

Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classiﬁer chains for multi-label classiﬁcation. Machine learning, 85(3):333, 2011.

Takuya Shimada, Han Bao, Issei Sato, and Masashi Sugiyama. Classiﬁcation from pairwise similarities/dissimilarities and unlabeled data via empirical risk minimization. Neural Computation, 33(5): 1234 1268, 2021.

Lijuan Sun, Songhe Feng, Tao Wang, Congyan Lang, and Yi Jin. Partial multi-label learning by lowrank and sparse decomposition. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pages 5016 5023, 2019.

Yu-Yin Sun, Yin Zhang, and Zhi-Hua Zhou. Multi-label learning with weak label. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence, 2010.

Ming-Kun Xie and Sheng-Jun Huang. Partial multi-label learning. In Proceedings of the Thirty Second AAAI Conference on Artiﬁcial Intelligence (AAAI-18), pages 4302 4309, 2018.

Ming-Kun Xie and Sheng-Jun Huang. Ccmn: A general framework for learning with class-conditional multi-label noise. ar Xiv preprint ar Xiv:2105.07338, 2021a.

Ming-Kun Xie and Sheng-Jun Huang. Partial multi-label learning with noisy label identiﬁcation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021b.

Ming-Kun Xie, Feng Sun, and Sheng-Jun Huang. Partial multi-label learning with meta disambiguation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1904 1912, 2021.

Miao Xu, Yu-Feng Li, and Zhi-Hua Zhou. Robust multi-label learning with pro loss. IEEE Transactions on Knowledge and Data Engineering, 32(8):1610 1624, 2019.

Yan Yan and Yuhong Guo. Adversarial partial multi-label learning with label disambiguation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, pages 10568 10576, 2021.

Guoxian Yu, Xia Chen, Carlotta Domeniconi, Jun Wang, Zhao Li, Zili Zhang, and Xindong Wu. Feature-induced partial multi-label learning. In 2018 IEEE International Conference on Data Mining (ICDM), pages 1398 1403. IEEE, 2018.

Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Large-scale multi-label learning with missing labels. In International conference on machine learning, pages 593 601. PMLR, 2014.

Tingting Yu, Guoxian Yu, Jun Wang, Carlotta Domeniconi, and Xiangliang Zhang. Partial multi-label learning using label compression. In 2020 IEEE International Conference on Data Mining (ICDM), pages 761 770. IEEE, 2020.

Wang Zhan and Min-Ling Zhang. Inductive semi-supervised multi-label learning with co-training. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1305 1314, 2017.

Min-Ling Zhang and Jun-Peng Fang. Partial multi-label learning via credible label elicitation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.

Min-Ling Zhang and Zhi-Hua Zhou. A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering, 26(8):1819 1837, 2013.