# pointwise_binary_classification_with_pairwise_confidence_comparisons__486aa395.pdf

Pointwise Binary Classiﬁcation with Pairwise Conﬁdence Comparisons

Lei Feng 1 Senlin Shu 2 Nan Lu 3 4 Bo Han 5 4 Miao Xu 6 4 Gang Niu 4 Bo An 7 Masashi Sugiyama 4 3

To alleviate the data requirement for training effective binary classiﬁers in binary classiﬁcation, many weakly supervised learning settings have been proposed. Among them, some consider using pairwise but not pointwise labels, when pointwise labels are not accessible due to privacy, conﬁdentiality, or security reasons. However, as a pairwise label denotes whether or not two data points share a pointwise label, it cannot be easily collected if either point is equally likely to be positive or negative. Thus, in this paper, we propose a novel setting called pairwise comparison (Pcomp) classiﬁcation, where we have only pairs of unlabeled data that we know one is more likely to be positive than the other. Firstly, we give a Pcomp data generation process, derive an unbiased risk estimator (URE) with theoretical guarantee, and further improve URE using correction functions. Secondly, we link Pcomp classiﬁcation to noisylabel learning to develop a progressive URE and improve it by imposing consistency regularization. Finally, we demonstrate by experiments the effectiveness of our methods, which suggests Pcomp is a valuable and practically useful type of pairwise supervision besides the pairwise label.

1. Introduction

Traditional supervised learning techniques have achieved great advances, while they require precisely labeled data. In many real-world scenarios, it may be too difﬁcult to collect such data. To alleviate this issue, a large number of

1College of Computer Science, Chongqing University, China 2College of Computer and Information Science, Southwest University, China 3The University of Tokyo, Japan 4RIKEN Center for Advanced Intelligence Project, Japan 5Department of Computer Science, Hong Kong Baptist University, China 6School of Information Technology and Electrical Engineering, The University of Queensland, Australia 7School of Computer Science and Engineering, Nanyang Technological University, Singapore. Correspondence to: Lei Feng <lfeng@cqu.edu.cn>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

weakly supervised learning problems (Zhou, 2018) have been extensively studied, including semi-supervised learning (Zhu & Goldberg, 2009; Niu et al., 2013; Sakai et al., 2018), multi-instance learning (Zhou et al., 2009; Sun et al., 2016; Zhang & Zhou, 2017), noisy-label learning (Han et al., 2018; Xia et al., 2019; Wei et al., 2020), partiallabel learning (Zhang et al., 2017; Feng et al., 2020b; Lv et al., 2020), complementary-label learning (Ishida et al., 2017; Yu et al., 2018; Ishida et al., 2019; Feng et al., 2020a), positive-unlabeled classiﬁcation (Elkan & Noto, 2008; Niu et al., 2016; Gong et al., 2019; Chen et al., 2020), positive-conﬁdence classiﬁcation (Ishida et al., 2018), similar-unlabeled classiﬁcation (Bao et al., 2018), similardissimilar classiﬁcation (Shimada et al., 2020; Bao et al., 2020), unlabeled-unlabeled classiﬁcation (Lu et al., 2019; 2020), and triplet classiﬁcation (Cui et al., 2020).

Among these weakly supervised learning problems, some of them (Bao et al., 2018; Shimada et al., 2020; Bao et al., 2020) consider learning a binary classiﬁer with pairwise labels that indicate whether two instances belong to (similar) the same class or not (dissimilar), when pointwise labels are not accessible due to privacy, conﬁdentiality, or security reasons. However, if either of the two instances is equally likely to be positive or negative, it becomes difﬁcult for us to accurately collect the underlying pairwise label of them. This motivates us to consider using another type of pairwise supervision (instead of the pairwise label) for successfully learning a binary classiﬁer.

In this paper, we propose a novel setting called pairwise comparison (Pcomp) classiﬁcation, where we aim to perform pointwise binary classiﬁcation with only pairwise comparison data. A pairwise comparison (x, x ) represents that the instance x has a larger conﬁdence of belonging to the positive class than the instance x . Such weak supervision (pairwise conﬁdence comparison) could be much easier for people to collect than full supervision (pointwise label) in practice, especially for applications on sensitive or private matters. For example, it may be difﬁcult to collect sensitive or private data with pointwise labels, as asking for the true labels could be prohibited or illegal. In this case, it could be easier for people to collect other weak supervision like the comparison information between two examples.

It is also advantageous to consider pairwise conﬁdence com-

Pointwise Binary Classiﬁcation with Pairwise Conﬁdence Comparisons

parisons in pointwise binary classiﬁcation with class overlapping, where the labeling task is difﬁcult, and even experienced labelers may provide wrong pointwise labels. Let us denote the labeling standard of a labeler as p(y|x) and assume that an instance x1 is more positive than another instance x2. Facing the difﬁcult labeling task, different labelers may hold different labeling standards, p(y = +1|x1) > p(y = +1|x2) > 1/2, p(y = +1|x1) > 1/2 > p(y = +1|x2), and 1/2 > p(y = +1|x1) > p(y = +1|x2), thereby providing different pointwise labels: (+1, +1), (+1, 1), ( 1, 1). We can ﬁnd that different labelers may provide inconsistent pointwise labels, while pairwise conﬁdence comparisons are unanimous and accurate. One may argue that we could aggregate multiple labels of the same instance using crowdsourcing learning methods (Whitehill et al., 2009; Raykar et al., 2010). However, as not every instance will be labeled by multiple labelers, it is not always applicable to crowdsourcing learning methods. Therefore, our proposed Pcomp classiﬁcation is useful in this case.

Our main contributions can be summarized as follows:

We propose pairwise comparison (Pcomp) classiﬁcation, a novel weakly supervised learning setting, and present a mathematical formulation for the generation process of pairwise comparison data.

We prove that an unbiased risk estimator (URE) can be derived, propose an empirical risk minimization (ERM) based method, and present an improvement using correction functions (Lu et al., 2020) for alleviating overftting when complex models are used.

We start from the noisy-label learning perspective to introduce the Rank Pruning method (Northcutt et al., 2017) that holds a progressive URE for solving our proposed Pcomp classiﬁcation problem and improve it by imposing consistency regularization.

Extensive experimental results demonstrate the effectiveness of our proposed solutions for Pcomp classiﬁcation.

2. Preliminaries

Binary classiﬁcation with pairwise comparisons and extra pointwise labels has been studied (Xu et al., 2017; Kane et al., 2017), while our work focuses on a more challenging problem where only pairwise comparison examples are provided. To the best of our knowledge, we are the ﬁrst to investigate such a challenging problem. Unlike previous studies (Xu et al., 2017; Kane et al., 2017) that leverage some pointwise labels to differentiate the labels of pairwise comparisons, our methods are purely based on ERM with only pairwise comparisons. In the next, we brieﬂy introduce some notations and review related problem formulations.

Binary Classiﬁcation. Since our paper focuses on how to train a binary classiﬁer from pairwise comparison data,

we ﬁrst review the problem formulation of binary classiﬁcation. Let the feature space be X and the label space be Y = {+1, 1}. Suppose the collected dataset is denoted by D = {(xi, yi)}n i=1 where each example (xi, yi) is independently sampled from the joint distribution with density p(x, y), which includes an instance xi X and a label yi Y. The goal of binary classiﬁcation is to train an optimal classiﬁer f : X 7 R by minimizing the following (expected) classiﬁcation risk:

R(f) = Ep(x,y) ℓ(f(x), y)

= π+Ep+(x) ℓ(f(x), +1)

+ π Ep (x) ℓ(f(x), 1) , (1)

where ℓ: R Y 7 R+ denotes a binary loss function, π+ := p(y = +1) (or π := p(y = 1)) denotes the positive (or negative) class prior probability, and p+(x) := p(x|y = +1) (or p (x) := p(x|y = 1)) denotes the class-conditional probability density of the positive (or negative) data. ERM approximates the expectations over p+(x) and p (x) by the empirical averages of positive and negative data and the empirical risk is minimized with respect to the classiﬁer f.

Positive-Unlabeled (PU) Classiﬁcation. In some realworld scenarios, it may be difﬁcult to collect negative data, and only positive (P) and unlabeled (U) data are available. PU classiﬁcation aims to train an effective binary classiﬁer in this weakly supervised setting. Previous studies (du Plessis et al., 2014; 2015; Kiryo et al., 2017) showed that the classiﬁcation risk R(f) in Eq. (1) can be rewritten only in terms of positive and unlabeled data as

RPU(f) = π+Ep+(x) ℓ(f(x), +1) ℓ(f(x), 1)

+ Ep(x) ℓ(f(x), 1) , (2)

where p(x) = π+p+(x) + π p (x) denotes the density of unlabeled data. This risk expression immediately allows us to employ ERM in terms of positive and unlabeled data.

Unlabeled-Unlabeled (UU) Classiﬁcation. The recent studies (Lu et al., 2019; 2020) showed that it is possible to train a binary classiﬁer only from two unlabeled datasets with different class priors. Lu et al. (2019) showed that the classiﬁcation risk R(f) can be rewritten as

RUU(f) = Eptr(x) h(1 θ )π+

θ θ ℓ(f(x), +1)

θ θ ℓ(f(x), 1) i

+ Eptr (x ) hθ(1 π+)

θ θ ℓ(f(x ), 1)

θ θ ℓ(f(x ), +1) i , (3)

Pointwise Binary Classiﬁcation with Pairwise Conﬁdence Comparisons

where θ and θ are different class priors of two unlabeled datasets, and ptr(x) and ptr (x ) are the densities of two datasets of unlabeled data, respectively. This risk expression immediately allows us to employ ERM only from two sets of unlabeled data. For RUU(f) in Eq. (3), if we set θ = 1, θ = π+, and replace ptr(x) and ptr (x ) by p+(x) and p(x) respectively, then we can recover RPU(f) in Eq. (2). Therefore, UU classiﬁcation could be taken as a generalized framework of PU classiﬁcation in terms of URE. Besides, Eq. (3) also recovers a complicated URE of similarunlabeled classiﬁcation (Bao et al., 2018) by setting θ = π+ and θ = π2 +/(2π2 + 2π+ + 1).

To solve our proposed Pcomp classiﬁcation problem, we will present a mathematical formulation for the generation process of pairwise comparison data, based on which we will explore two UREs that are compatible with any model and optimizer to train a binary classiﬁer by ERM and establish the corresponding estimation error bounds.

3. Data Generation Process

In order to derive UREs for performing ERM, we ﬁrst formulate the underlying generation process of pairwise comparison data1, which consists of pairs of unlabeled data that we know which one is more likely to be positive. Suppose the provided dataset is denoted by e D = {(xi, x i)}n i=1 where (xi, x i) (with their unknown true labels (yi, y i)) is expected to satisfy p(yi = +1|xi) > p(y i = +1|x i).

It is clear that we could easily collect pairwise comparison data if the positive conﬁdence (i.e., p(y = +1|x)) of each instance could be obtained. However, such information is much harder to obtain than class labels in real-world scenarios. Therefore, unlike some studies (Ishida et al., 2018; Shinoda et al., 2020) that assume the positive conﬁdence of each instance is provided by the labeler, we only assume that the labeler has access to the labels of training data. Speciﬁcally, we adopt the assumption (Cui et al., 2020) that weakly supervised examples are ﬁrst sampled from the true data distribution, but the labels are only accessible to the labeler. Then, the labeler would provide us weakly supervised information (i.e., pairwise comparison information) according to the labels of sampled data pairs. That is, for any pair of unlabeled data (x, x ), the labeler would tell us whether (x, x ) could be collected as a pairwise comparison for Pcomp classiﬁcation, based on the labels (y, y ) rather than the positive conﬁdences (p(y = +1|x), p(y = +1|x )).

Now, the question becomes: how does the labeler consider (x, x ) as a pairwise comparison for Pcomp classiﬁcation, in terms of the labels (y, y )? As shown in our previous exam-

1In contrast to Xu et al. (2019) and Xu et al. (2020) that utilized pairwise comparison data to solve the regression problem, we focus on binary classiﬁcation.

ple of binary classiﬁcation with class overlapping, we could infer that the labels (y, y ) of our required pairwise comparison data (x, x ) for Pcomp classiﬁcation can only be one of the three cases {(+1, 1), (+1, +1), ( 1, 1)}, because the condition p(y = +1|x) p(y = +1|x ) is deﬁnitely violated if (y, y ) = ( 1, +1). Therefore, we assume that the labeler would take (x, x ) as a pairwise comparison example in the dataset e D, if the labels (y, y ) of (x, x ) belong to the above three cases. It is also worth noting that for a pair of data (x, x ) with labels (y, y ) = ( 1, +1), the labeler would take (x , x) as a pairwise comparison example. Because by exchanging the positions of (x, x ), (x , x) would be associated with labels (+1, 1), which belong to the three cases. In summary, we assume that pairwise comparison data are sampled from those pairs of data whose labels belong to the three cases {(+1, 1), (+1, +1), ( 1, 1)}. Based on the above described generation process of pairwise comparison data, we have the following theorem.

Theorem 1. According to the generation process of pairwise comparison data described above, let

ep(x, x ) = q(x, x ) π2 + + π2 + π+π , (4)

where q(x, x ) = π2 +p+(x)p+(x ) + π2 p (x)p (x ) + π+π p+(x)p (x ). Then, we can conclude that the collected pairwise comparison data are independently drawn from ep(x, x ), i.e., e D = {(xi, x i)}n i=1 i.i.d. ep(x, x ).

The proof is provided in Appendix A. Theorem 1 provides an explicit expression of the probability density of pairwise comparison data.

Next, we would like to extract pointwise information from pairwise information, since our goal is to perform pointwise binary classiﬁcation. Let us denote the pointwise data collected from e D = {(xi, x i)}n i=1 by breaking the pairwise comparison relation as e D+ = {xi}n i=1 and e D = {x i}n i=1. Then we can obtain the following theorem.

Theorem 2. Pointwise examples in e D+ and e D are independently drawn from ep+(x) and ep (x ), where

ep+(x) = π+ π2 + π+ p+(x) + π2 π2 + π+ p (x),

ep (x ) = π2 + π2 + + π p+(x ) + π π2 + + π p (x ).

The proof is provided in Appendix B. Theorem 2 shows the relationships between the pointwise densities and the class-conditional densities. Besides, it indicates that from pairwise comparison data, we can essentially obtain examples that are independently drawn from ep+(x) and ep (x ).

Pointwise Binary Classiﬁcation with Pairwise Conﬁdence Comparisons

4. The Proposed Methods

In this section, we explore two UREs to train a binary classiﬁer by ERM from only pairwise comparison data with the above generation process.

4.1. Corrected Pcomp Classiﬁcation

In Eq. (1), the classiﬁcation risk R(f) could be separately expressed as the expectations over p+(x) and p (x). Although we do not have access to the two class-conditional densities p+(x) and p (x), we can represent them by our introduced pointwise densities ep+(x) and ep (x).

Lemma 1. We can express p+(x) and p (x) in terms of ep+(x) and ep+(x) as

ep+(x) π ep (x) ,

ep (x) π+ep+(x) .

The proof is provided in Appendix C. As a result of Lemma 1, we can express the classiﬁcation risk R(f) using only pairwise comparison data sampled from ep+(x) and ep (x).

Theorem 3. The classiﬁcation risk R(f) can be equivalently expressed as

RPC(f) = Eep+(x) ℓ(f(x), +1) π+ℓ(f(x), 1) (5)

+ Eep (x ) ℓ(f(x ), 1) π ℓ(f(x ), +1) .

The proof is provided in Appendix D. In this way, we could train a binary classiﬁer by minimizing the following empirical approximation of RPC(f):

b RPC(f) = 1

i=1 ℓ(f(xi), +1) + ℓ(f(x i), 1) (6)

π+ℓ(f(xi), 1) π ℓ(f(x i), +1) .

Estimation Error Bound. Here, we establish an estimation error bound for the proposed URE. Let F = {f : X 7 R} be the model class, bf PC = arg minf F b RPC(f) be the empirical risk minimizer, and f = arg minf F R(f) be the true risk minimizer. Let e R+ n (F) and e R n (F) be the Rademacher complexities (Bartlett & Mendelson, 2002) of F with sample size n over ep+(x) and ep (x) respectively.

Theorem 4. Suppose the loss function ℓis ρ-Lipschitz with respect to the ﬁrst argument (0 ρ ), and all functions in the model class F are bounded, i.e., there exists a positive constant Cb such that f Cb for any f F. Let Cℓ:= supz Cb,t= 1 ℓ(z, t). Then for any δ > 0, with probability at least 1 δ, we have

R( bf PC) R(f ) (1 + π+)4ρe R+ n (F)

+ (1 + π )4ρe R n (F) + 6Cℓ q

The proof is provided in Appendix E. Theorem 4 shows that our proposed method is consistent, i.e., as n , R( bf PC) R(f ), since e R+ n (F), e R n (F) 0 for all parametric models with a bounded norm such as deep neural networks trained with weight decay (Golowich et al., 2017; Lu et al., 2019). Besides, e R+ n (F) and e R n (F) can be normally bounded by CF/ n for a positive constant CF. Hence, we can further see that the convergence rate is Op(1/ n) where Op denotes the order in probability. This order is the optimal parametric rate for ERM without additional assumptions (Mendelson, 2008).

Relation to UU Classiﬁcation. It is worth noting that the URE of UU classiﬁcation RUU(f) is quite general for binary classiﬁcation with weak supervision. Hence we also would like to show the relationships between our proposed estimator RPC(f) and RUU(f). We demonstrate by the following corollary that under some conditions, RUU(f) is equivalent to RPC(f).

Corollary 1. By setting ptr = ep+(x), p tr = ep (x), θ = π+/(1 π+ + π2 +), and θ = π2 +/(1 π+ + π2 +), Eq. (3) is equivalent to Eq. (5), which means that RUU(f) is equivalent to RPC(f).

We omit the proof of Corollary 1, since it is straightforward to derive Eq. (5) from Eq. (3) with required notations.

Empirical Risk Correction. As shown by Lu et al. (2020), directly minimizing b RPC(f) would suffer from overﬁtting when complex models are used due to the negative risk issue. More speciﬁcally, since negative terms are included in Eq. (6), the empirical risk can be negative even though the original true risk can never be negative. To ease this problem, they wrapped the terms in b RUU(f) that cause a negative empirical risk by certain consistent correction functions deﬁned in Lu et al. (2020), such as the rectiﬁed linear unit (Re LU) function g(z) = max(0, z) and absolute value function g(z) = |z|. These consistent correction functions could also be applied to b RPC for alleviating overﬁtting when complex models are used. In this way, we could obtain the following corrected empirical risk estimator:

b Rc PC(f) = g 1

ℓ(f(xi), +1) π ℓ(f(x i), +1)

ℓ(f(x i), 1) π+ℓ(f(xi), 1) . (7)

4.2. Progressive Pcomp Classiﬁcation

Here, we start from the noisy-label learning perspective to solve the Pcomp classiﬁcation problem. Intuitively, we could simply perform binary classiﬁcation by regarding the data from ep+(x) as (noisy) positive data and the data from ep (x) as (noisy) negative data. However, this naive solution

Pointwise Binary Classiﬁcation with Pairwise Conﬁdence Comparisons

could be inevitably affected by noisy labels. In this scenario, we denote the noise rates as ρ = p(ey = +1|y = 1) and ρ+ = p(ey = 1|y = +1) where ey is the observed (noisy) label and y is the true label, and denote the inverse noise rates as φ+ = p(y = 1|ey = +1) and φ = p(y = +1|ey = 1). According to the deﬁned generation process of pairwise comparison data, we have the following theorem. Theorem 5. The following equalities hold:

φ+ = π2 π2 + + π2 + π+π , ρ+ = π+ 1 + π+ ,

φ = π2 + π2 + + π2 + π+π , ρ = π 1 + π .

The proof is provided in Appendix F.

Theorem 5 shows that the noise rates can be obtained if we regard the Pcomp classiﬁcation problem as the noisy-label learning problem. With known noise rates for noisy-label learning, it was shown (Natarajan et al., 2013; Northcutt et al., 2017) that a URE could be derived. Here, we adopt the Rank Pruning method (Northcutt et al., 2017) because it holds a progressive URE by selecting conﬁdent examples using the learning model and achieves state-of-the-art performance. Speciﬁcally, we denote by the dataset composed of all the observed positive data e P, i.e., e P = {xi}n i=1, where xi is independently sampled from ep+(x). Similarly, the dataset composed of all the observed negative data is denoted by e N, i.e., e N = {x i}n i=1, where x i is independently sampled from ep (x ). Then, conﬁdent examples will be selected from e P and e N by ranking the outputs of the model f. We denote the selected positive data from e P as e Psel, and the selected negative data from e N as e Nsel:

e Psel = arg max P:|P|=(1 φ+)| e P|

x {P e P} f(x),

e Nsel = arg min N:|N|=(1 φ )| e N|

x {N e N} f(x).

Then we show that if the model f satisﬁes the separability condition, i.e., for any true positive instance xp and for any true negative instance xn, we have f(xp) > f(xn). In other words, if the model output of every true positive instance is always larger than that of every true negative instance, we could obtain a URE. We name it progressive URE, as the model f is progressively optimized. Theorem 6 (Theorem 5 in (Northcutt et al., 2017)). Assume that the model f satisﬁes the above separability condition, then the classiﬁcation risk R(f) can be equivalently expressed as

Rp PC(f) = Eep+(x) hℓ(f(x), +1)

1 ρ+ I[x e Psel] i

+ Eep (x ) hℓ(f(x ), 1)

1 ρ I[x e Nsel] i ,

where I[ ] denotes the indicator function.

In this way, we have the following empirical approximation of Rp PC:

b Rp PC(f) = 1

ℓ(f(xi), +1)

1 ρ+ I[xi e Psel]

+ ℓ(f(x i), 1) 1 ρ I[x i e Nsel] . (8)

Estimation Error Bound. It worth noting that Northcutt et al. (2017) did not prove the learning consistency for the Rank Pruning method. Here, we establish an estimation error bound for this method, which guarantees the learning consistency. Let bfp PC = arg minf F b Rp PC(f) be the empirical risk minimizer of the Rank Pruning method, then we have the following theorem.

Theorem 7. Suppose the loss function ℓis ρ-Lipschitz with respect to the ﬁrst argument (0 ρ ), and all functions in the model class F are bounded, i.e., there exists a positive constant Cb such that f Cb for any f F. Let Cℓ:= supz Cb,t= 1 ℓ(z, t). Then for any δ > 0, with probability at least 1 δ, we have

R( bfp PC) R(f ) 2 1 ρ+

2ρe R+ n (F) + Cℓ q

2ρe R n (F) + Cℓ q

The proof is provided in Appendix G. Theorem 7 shows that the above method is consistent and this estimation error bound also attains the optimal convergence rate without any additional assumptions (Mendelson, 2008).

Regularization. For the above Rank Pruning method, its URE is based on the assumption that the learning model could satisfy the separability condition. Thus, its performance heavily depends on the accuracy of the learning model. However, as the learning model is progressively updated, some of the selected conﬁdent examples may still contain label noise during the training process. As a result, the Rank Pruning method would be affected by incorrectly selected data. A straightforward improvement could be to improve the output quality of the learning model. Motivated by Mean Teacher used in semi-supervised learning (Tarvainen & Valpola, 2017), we also resort to a teacher model that is an exponential moving average of model snapshots, i.e., Θ t = αΘ t 1 + (1 α)Θt, where Θ

denotes the parameters of the teacher model, Θ denotes the parameters of the learning model, the subscript t denotes the training step, and α is a smoothing coefﬁcient hyper-parameter. Such a teacher model could guide the learning model to produce high-quality outputs. To learn from the teacher model, we leverage consistency regularization Ω(f) = Ex fΘ(x) fΘ (x) 2 (Laine & Aila,

Pointwise Binary Classiﬁcation with Pairwise Conﬁdence Comparisons

2016; Tarvainen & Valpola, 2017) to make the learning model consistent with the teacher model for improving the performance of the Rank Pruning method.

5. Experiments

In this section, we conduct extensive experiments to evaluate the performance of our proposed Pcomp classiﬁcation methods on various datasets using different models.

Datasets. We use four popular benchmark datasets, including MNIST (Le Cun et al., 1998), Fashion-MNIST (Xiao et al., 2017), Kuzushiji-MNIST (Clanuwat et al., 2018), and CIFAR-10 (Krizhevsky et al., 2009). We train a multilayer perceptron (MLP) model with three hidden layers of width 300 and Re LU activation functions (Nair & Hinton, 2010) and batch normalization (Ioffe & Szegedy, 2015) on the ﬁrst three datasets. We train a Res Net-34 model (He et al., 2016) on the CIFAR-10 dataset. We also use USPS and three datasets from the UCI machine learning repository (Blake & Merz, 1998) including Pendigits, Optdigits, and CNAE-9. We train a linear model on these datasets, since they are not large-scale datasets. The brief descriptions of all used datasets with the corresponding models are reported in Table 1. Since these datasets are specially used for multi-class classiﬁcation, we manually transformed them into binary classiﬁcation datasets (see Appendix H). As we have shown in Theorem 2, the pairwise comparison examples can be equivalently transformed into pointwise examples, which are more convenient to generate. Therefore, we generate pointwise examples in experiments. Speciﬁcally, as Theorem 5 discloses the noise rates in our deﬁned data generation process, we simply generate pointwise corrupted examples according to the derived noise rates.

Methods. For our proposed Pcomp classiﬁcation problem, we propose the following methods:

Pcomp-Unbiased, which denotes the proposed method that minimizes b RPC(f) in Eq. (6). Pcomp-Re LU, which denotes the proposed method that minimizes b Rc PC(f) in Eq. (7) using the Re LU function as the risk correction function. Pcomp-ABS, which denotes the proposed method that minimizes b Rc PC(f) in Eq. (7) using the absolute value function as the risk correction function. Pcomp-Teacher, which improves the Rank Pruning method by imposing consistency regularization to make the learning model consistent with a teacher model. Besides, we compare with the following baselines:

Binary-Biased, which conducts binary classiﬁcation by regarding the data from ep+(x) as positive data and the data from ep (x) as negative data. This is a straightforward method to handle the Pcomp classiﬁcation problem. In our setting, Binary-Biased reduces to the BER minimization method (Menon et al., 2015).

Table 1. Speciﬁcation of the used benchmark datasets and models. These datasets are specially processed (see Appendix H) for performing Pcomp classiﬁcation.

Dataset # Train # Test # Features # Classes Model

MNIST 60,000 10,000 784 10 MLP Fashion 60,000 10,000 784 10 MLP Kuzushiji 60,000 10,000 784 10 MLP CIFAR-10 50,000 10,000 3,072 10 Res Net-34

USPS 7,437 1,861 256 10 Linear Pendigits 8,793 2,199 16 10 Linear Optdigits 4,495 1,125 62 10 Linear CNAE-9 864 216 856 9 Linear

Noisy-Unbiased, which denotes the noisy-label learning method that minimizes the empirical approximation of the URE proposed by (Natarajan et al., 2013).

Rank Pruning, which denotes the noisy-label learning method (Northcutt et al., 2017) that minimizes the empirical risk b Rp PC(f) in Eq. (8).

For all learning methods, we take the logistic loss as the binary loss function ℓ(i.e., ℓ(z) = ln(1+exp( z))), for fair comparisons. The hyper-parameter settings of all learning methods are aslo reported in Appendix H. We implement our methods using Py Torch (Paszke et al., 2019) and use the Adam (Kingma & Ba, 2015) optimizer with mini-batch size set to 256 and the number of training epochs set to 200 for the four large-scale datasets and 100 for other four datasets. All the experiments are conducted on Ge Force GTX 1080 Ti GPUs. Codes are avaible in supplementary materials.

Experimental Setup. We evaluate the performance of all learning methods under different class prior settings, i.e., π+ is selected from {0.2, 0.5, 0.8}. It is worth noting that we could estimate π+ according to our described data generation process. Speciﬁcally, we can exactly estimate eπ by counting the fraction of collected pairwise comparison data in all the sampled pairs of data. Since eπ = π2 + + π = π2 ++1 π+, we have π+ = 1/2 p

eπ 3/4 (if π+ < π ) or π+ = 1/2 + p

eπ 3/4 (if π+ π ). Therefore, if we know whether π+ is larger than π , we could exactly estimate the true class prior π+. For simplicity, we assume that the class prior π+ is known for all the methods. We repeat the sampling-and-training process 5 times for all learning methods on all datasets and record the mean accuracy with standard deviation (mean std).

Experimental Results with Complex Models. Table 2 records the classiﬁcation accuracy of each method on the four benchmark datasets with different class priors. From Table 2, we have the following observations:

Binary-Biased always achieves the worst performance, which means that our Pcomp classiﬁcation problem cannot be simply solved by binary classiﬁcation.

Pointwise Binary Classiﬁcation with Pairwise Conﬁdence Comparisons

Table 2. Classiﬁcation accuracy (mean std) of each method on the four benchmark datasets with different class priors. The best performance is highlighted in bold.

Class Prior Methods MNIST Kuzushiji Fashion CIFAR-10

Noisy-Unbiased 0.806 0.026 0.712 0.014 0.865 0.050 0.665 0.119 Binary-Biased 0.282 0.018 0.584 0.021 0.371 0.067 0.499 0.170 Rank Pruning 0.933 0.005 0.810 0.006 0.938 0.005 0.840 0.005

Pcomp-ABS 0.893 0.013 0.849 0.007 0.898 0.008 0.833 0.008 Pcomp-Re LU 0.927 0.010 0.838 0.014 0.927 0.022 0.812 0.007 Pcomp-Unbiased 0.782 0.025 0.693 0.041 0.835 0.038 0.584 0.011 Pcomp-Teacher 0.958 0.007 0.835 0.015 0.954 0.004 0.841 0.006

Noisy-Unbiased 0.892 0.013 0.656 0.096 0.908 0.031 0.642 0.031 Binary-Biased 0.537 0.026 0.615 0.008 0.457 0.032 0.470 0.039 Rank Pruning 0.888 0.004 0.782 0.008 0.917 0.005 0.793 0.015

Pcomp-ABS 0.832 0.010 0.732 0.007 0.899 0.006 0.712 0.023 Pcomp-Re LU 0.875 0.009 0.729 0.019 0.918 0.016 0.746 0.030 Pcomp-Unbiased 0.877 0.012 0.690 0.084 0.906 0.037 0.631 0.031 Pcomp-Teacher 0.942 0.004 0.791 0.013 0.957 0.002 0.795 0.012

Noisy-Unbiased 0.800 0.036 0.749 0.038 0.815 0.058 0.591 0.019 Binary-Biased 0.299 0.066 0.558 0.005 0.444 0.020 0.371 0.113 Rank Pruning 0.939 0.004 0.830 0.012 0.939 0.004 0.843 0.010

Pcomp-ABS 0.815 0.009 0.825 0.011 0.879 0.014 0.827 0.013 Pcomp-Re LU 0.916 0.013 0.827 0.011 0.925 0.015 0.811 0.007 Pcomp-Unbiased 0.793 0.016 0.721 0.037 0.823 0.032 0.569 0.007 Pcomp-Teacher 0.958 0.007 0.836 0.015 0.955 0.004 0.844 0.014

Pcomp-Unbiased is inferior to Pcomp-ABS and Pcomp Re LU. This observation accords with what we have discussed, i.e., directly minimizing b RPC(f) would suffer from overﬁtting when complex models are used, because there are negative terms included in b RPC(f) and the empirical risk can be negative during the training process. In contrast, Pcomp-Re LU and Pcomp-ABS employ consistent correction functions on b RPC(f) so that the empirical risk will never be negative. Therefore, when complex models such as deep neural networks are used, Pcomp-Re LU and Pcomp-ABS are expected to outperform Pcomp-Unbiased.

Pcomp-Teacher achieves the best performance in most cases. This observation veriﬁes the effectiveness of the imposed consistency regularization, which makes the learning model consistent with a teacher model, to improve the quality of selected conﬁdent examples by the Rank Pruning method.

It is worth noting that the standard deviations of Binary Biased, Pcomp-Unbiased, and Noisy-Unbiased are sometimes higher than other methods. This is because the three methods suffer from overﬁtting when complex models are used, and the performance could be quite unstable in different trials.

Experimental Results with Simple Models. Table 3 reports the classiﬁcation accuracy of each method on the four UCI datasets with different class priors. From Table 3, we

have the following observations:

Binary-Biased achieves the worst performance in nearly all cases, which also implies that we need to develop other novel methods for Pcomp classiﬁcation.

Pcomp-Unbiased has comparable performance to its variants Pcomp-ABS and Pcomp-Re LU, because Pcomp Unbiased does not suffer from overﬁtting when the linear model is used, and it is not necessary to use consistent correction functions anymore.

Pcomp-Teacher is still better than Rank Pruning, while it is sometimes inferior to other Pcomp classiﬁcation methods. This is because the linear model is not as powerful as neural networks, thus the selected conﬁdent examples may not be so reliable.

Performance of Increasing Pairwise Comparisons. As shown by Theorem 4 and Theorem 7, the performance of our Pcomp classiﬁcation methods is expected to be improved if more pairwise comparisons are given. To empirically validate such theoretical ﬁndings, we further conduct experiments on Pendigits and Optdigits with class prior π+ = 0.5 and π+ = 0.8 by changing the fraction of pairwise comparisons (100% means that we use all the generated pairwise comparisons in the training process). As shown in Figure 1, the classiﬁcation accuracy of our methods generally increases given more pairwise comparisons. This observation is clearly in accordance with our derived estimation error

Pointwise Binary Classiﬁcation with Pairwise Conﬁdence Comparisons

Table 3. Classiﬁcation accuracy (mean std) of each method on the four UCI datasets with different class priors. The best performance is highlighted in bold.

Class Prior Methods USPS Pendigits Optdigits CNAE-9

Noisy-Unbiased 0.921 0.010 0.857 0.012 0.885 0.015 0.830 0.019 Binary-Biased 0.752 0.016 0.639 0.050 0.705 0.026 0.629 0.040 Rank Pruning 0.931 0.007 0.780 0.061 0.878 0.011 0.780 0.046

Pcomp-ABS 0.909 0.005 0.863 0.013 0.871 0.013 0.819 0.018 Pcomp-Re LU 0.917 0.006 0.868 0.012 0.875 0.011 0.835 0.011 Pcomp-Unbiased 0.922 0.007 0.867 0.015 0.881 0.018 0.806 0.023 Pcomp-Teacher 0.927 0.009 0.847 0.035 0.886 0.016 0.687 0.052

Noisy-Unbiased 0.911 0.007 0.826 0.014 0.838 0.011 0.766 0.039 Binary-Biased 0.913 0.007 0.761 0.049 0.834 0.013 0.773 0.051 Rank Pruning 0.925 0.008 0.840 0.019 0.864 0.027 0.672 0.068

Pcomp-ABS 0.912 0.008 0.839 0.010 0.841 0.013 0.749 0.047 Pcomp-Re LU 0.912 0.007 0.844 0.008 0.846 0.013 0.766 0.038 Pcomp-Unbiased 0.911 0.006 0.846 0.009 0.847 0.014 0.763 0.040 Pcomp-Teacher 0.926 0.008 0.840 0.019 0.869 0.021 0.787 0.047

Noisy-Unbiased 0.918 0.017 0.890 0.013 0.884 0.006 0.838 0.017 Binary-Biased 0.731 0.017 0.634 0.042 0.554 0.106 0.621 0.037 Rank Pruning 0.934 0.010 0.856 0.018 0.863 0.025 0.737 0.050

Pcomp-ABS 0.900 0.017 0.889 0.014 0.882 0.002 0.824 0.013 Pcomp-Re LU 0.921 0.018 0.894 0.012 0.888 0.004 0.839 0.015 Pcomp-Unbiased 0.899 0.014 0.883 0.006 0.874 0.008 0.840 0.013 Pcomp-Teacher 0.937 0.008 0.880 0.011 0.884 0.010 0.781 0.033

20% 40% 60% 80% 100% #pairwise comparisons

classification accuracy

Pendigits with π+ = 0.5

Pcomp-Unbiased Pcomp-Re LU Pcomp-ABS Pcomp-Teacher

20% 40% 60% 80% 100% #pairwise comparisons

classification accuracy

Pendigits with π+ = 0.8

Pcomp-Unbiased Pcomp-Re LU Pcomp-ABS Pcomp-Teacher

20% 40% 60% 80% 100% #pairwise comparisons

classification accuracy

Optdigits with π+ = 0.5

Pcomp-Unbiased Pcomp-Re LU Pcomp-ABS Pcomp-Teacher

20% 40% 60% 80% 100% #pairwise comparisons

classification accuracy

Optdigits with π+ = 0.8

Pcomp-Unbiased Pcomp-Re LU Pcomp-ABS Pcomp-Teacher

Figure 1. The classiﬁcation accuracy of our proposed Pcomp classiﬁcation methods when the number of pairwise comparisons increases.

bounds, because the estimation error would decrease as the number of pairwise comparisons increases.

6. Conclusion and Future Work

In this paper, we proposed a novel weakly supervised learning setting called pairwise comparison (Pcomp) classiﬁcation, where we aim to train a binary classiﬁer from only pairwise comparison data, i.e., two examples that we know one is more likely to be positive than the other, instead of pointwise labeled data. Pcomp classiﬁcation is useful for private classiﬁcation tasks where we are not allowed to directly access labels and subjective classiﬁcation tasks where labelers have different labeling standards. To solve the Pcomp classiﬁcation problem, we presented a mathematical formulation for the generation process of pairwise comparison data, based on which we explored two unbiased risk estimators (UREs) to train a binary classiﬁer by em-

pirical risk minimization and established the corresponding estimation error bounds. We ﬁrst proved that a URE can be derived and improved it using correction functions. Then, we started from the noisy-label learning perspective to introduce a progressive URE and improved it by imposing consistency regularization. Finally, experiments demonstrated the effectiveness of our proposed methods.

In future work, we will apply Pcomp classiﬁcation to solve some challenging real-world problems like binary classiﬁcation with class overlapping. In addition, we could also extend Pcomp classiﬁcation to the multi-class classiﬁcation setting by using the one-versus-all strategy. Suppose there are multiple classes, we are given pairs of unlabeled data that we know which one is more likely to belong to a speciﬁc class. Then, we can use the proposed methods in this paper to train a binary classiﬁer for each class. Finally, by comparing the outputs of these binary classiﬁers, the predicted class can be determined.

Pointwise Binary Classiﬁcation with Pairwise Conﬁdence Comparisons

Acknowledgements

This research was supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-RP-2019-0013), National Satellite of Excellence in Trustworthy Software Systems (Award No: NSOE-TSS2019-01), and NTU. NL was supported by MEXT scholarship No. 171536 and the MSRA D-CORE Program. BH was supported by the RGC Early Career Scheme No. 22200720, NSFC Young Scientists Fund No. 62006202, and HKBU CSD Departmental Incentive Grant. GN and MS were supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan. MS was also supported by the Institute for AI and Beyond, UTokyo.

Bao, H., Niu, G., and Sugiyama, M. Classiﬁcation from pairwise similarity and unlabeled data. In ICML, pp. 452 461, 2018.

Bao, H., Shimada, T., Xu, L., Sato, I., and Sugiyama, M. Similarity-based classiﬁcation: Connecting similarity learning to binary classiﬁcation. ar Xiv preprint ar Xiv:2006.06207, 2020.

Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3(11):463 482, 2002.

Blake, C. L. and Merz, C. J. Uci repository of machine learning databases, 1998. URL http://archive. ics.uci.edu/ml/index.php.

Chen, X., Chen, W., Chen, T., Yuan, Y., Gong, C., Chen, K., and Wang, Z. Self-pu: Self boosted and calibrated positive-unlabeled training. In ICML, pp. 1510 1519, 2020.

Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., and Ha, D. Deep learning for classical Japanese literature. ar Xiv preprint ar Xiv:1812.01718, 2018.

Cui, Z.-H., Charoenphakdee, N., Sato, I., and Sugiyama, M. Classiﬁcation from triplet comparison data. Neural Computation, 32(3):659 681, 2020.

du Plessis, M. C., Niu, G., and Sugiyama, M. Analysis of learning from positive and unlabeled data. In Neur IPS, pp. 703 711, 2014.

du Plessis, M. C., Niu, G., and Sugiyama, M. Convex formulation for learning from positive and unlabeled data. In ICML, pp. 1386 1394, 2015.

Elkan, C. and Noto, K. Learning classiﬁers from only positive and unlabeled data. In KDD, pp. 213 220, 2008.

Feng, L., Kaneko, T., Han, B., Niu, G., An, B., and Sugiyama, M. Learning with multiple complementary labels. In ICML, pp. in press, 2020a.

Feng, L., Lv, J.-Q., Han, B., Xu, M., Niu, G., Geng, X., An, B., and Sugiyama, M. Provably consistent partial-label learning. In Neur IPS, 2020b.

Golowich, N., Rakhlin, A., and Shamir, O. Size-independent sample complexity of neural networks. ar Xiv preprint ar Xiv:1712.06541, 2017.

Gong, C., Shi, H., Liu, T.-L., Zhang, C., Yang, J., and Tao, D.-C. Loss decomposition and centroid estimation for positive and unlabeled learning. IEEE TPAMI, 2019.

Han, B., Yao, Q.-M., Yu, X.-R., Niu, G., Xu, M., Hu, W.- H., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Neur IPS, pp. 8527 8537, 2018.

He, K.-M., Zhang, X.-Y., Ren, S.-Q., and Sun, J. Deep residual learning for image recognition. In CVPR, pp. 770 778, 2016.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015.

Ishida, T., Niu, G., Hu, W., and Sugiyama, M. Learning from complementary labels. In Neur IPS, pp. 5644 5654, 2017.

Ishida, T., Niu, G., and Sugiyama, M. Binary classiﬁcation for positive-conﬁdence data. In Neur IPS, pp. 5917 5928, 2018.

Ishida, T., Niu, G., Menon, A. K., and Sugiyama, M. Complementary-label learning for arbitrary losses and models. In ICML, pp. 2971 2980, 2019.

Kane, D. M., Lovett, S., Moran, S., and Zhang, J. Active classiﬁcation with comparison queries. In FOCS, pp. 355 366, 2017.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015.

Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. Positive-unlabeled learning with non-negative risk estimator. In Neur IPS, pp. 1674 1684, 2017.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Laine, S. and Aila, T. Temporal ensembling for semisupervised learning. ar Xiv preprint ar Xiv:1610.02242, 2016.

Pointwise Binary Classiﬁcation with Pairwise Conﬁdence Comparisons

Le Cun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Lu, N., Niu, G., Menon, A. K., and Sugiyama, M. On the minimal supervision for training any binary classiﬁer from only unlabeled data. In ICLR, 2019.

Lu, N., Zhang, T.-Y., Niu, G., and Sugiyama, M. Mitigating overﬁtting in supervised classiﬁcation from two unlabeled datasets: A consistent risk correction approach. In AISTATS, 2020.

Lv, J.-Q., Xu, M., Feng, L., Niu, G., Geng, X., and Sugiyama, M. Progressive identiﬁcation of true labels for partial-label learning. In ICML, 2020.

Mendelson, S. Lower bounds for the empirical minimization algorithm. IEEE TIT, 54(8):3797 3803, 2008.

Menon, A., Van Rooyen, B., Ong, C. S., and Williamson, B. Learning from corrupted binary labels via classprobability estimation. In ICML, pp. 125 134, 2015.

Nair, V. and Hinton, G. E. Rectiﬁed linear units improve restricted boltzmann machines. In ICML, 2010.

Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A. Learning with noisy labels. In Neur IPS, pp. 1196 1204, 2013.

Niu, G., Jitkrittum, W., Dai, B., Hachiya, H., and Sugiyama, M. Squared-loss mutual information regularization: A novel information-theoretic approach to semi-supervised learning. In ICML, pp. 10 18, 2013.

Niu, G., Plessis, M. C. d., Sakai, T., Ma, Y., and Sugiyama, M. Theoretical comparisons of positive-unlabeled learning against positive-negative learning. In Neur IPS, pp. 1199 1207, 2016.

Northcutt, C. G., Wu, T., and Chuang, I. L. Learning with conﬁdent examples: Rank pruning for robust classiﬁcation with noisy labels. In UAI, 2017.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Neur IPS, pp. 8026 8037, 2019.

Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., and Moy, L. Learning from crowds. JMLR, 11(4), 2010.

Sakai, T., Niu, G., and Sugiyama, M. Semi-supervised auc optimization based on positive-unlabeled learning. MLJ, 107(4):767 794, 2018.

Shimada, T., Bao, H., Sato, I., and Sugiyama, M. Classiﬁcation from pairwise similarities/dissimilarities and unlabeled data via empirical risk minimization. Neural Computation, 2020.

Shinoda, K., Kaji, H., and Sugiyama, M. Binary classiﬁcation from positive data with skewed conﬁdence. In IJCAI, pp. 3328 3334, 2020.

Sun, M., Han, T. X., Liu, M.-C., and Khodayari Rostamabad, A. Multiple instance learning convolutional neural networks for object recognition. In ICPR, pp. 3270 3275, 2016.

Tarvainen, A. and Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Neur IPS, pp. 1195 1204, 2017.

Wei, H.-X., Feng, L., Chen, X.-Y., and An, B. Combating noisy labels by agreement: A joint training method with co-regularization. In CVPR, pp. 13726 13735, 2020.

Whitehill, J., Wu, T.-f., Bergsma, J., Movellan, J. R., and Ruvolo, P. L. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Neur IPS, pp. 2035 2043, 2009.

Xia, X.-B., Liu, T.-L., Wang, N.-N., Han, B., Gong, C., Niu, G., and Sugiyama, M. Are anchor points really indispensable in label-noise learning? In Neur IPS, pp. 6835 6846, 2019.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.

Xu, L.-Y., Honda, J., Niu, G., and Sugiyama, M. Uncoupled regression from pairwise comparison data. In Neur IPS, pp. 3992 4002, 2019.

Xu, Y., Zhang, H., Miller, K., Singh, A., and Dubrawski, A. Noise-tolerant interactive learning using pairwise comparisons. In Neur IPS, pp. 2431 2440, 2017.

Xu, Y., Balakrishnan, S., Singh, A., and Dubrawski, A. Regression with comparisons: Escaping the curse of dimensionality with ordinal information. JMLR, 21(162): 1 54, 2020.

Yu, X.-Y., Liu, T.-L., Gong, M.-M., and Tao, D.-C. Learning with biased complementary labels. In ECCV, pp. 68 83, 2018.

Zhang, M.-L., Yu, F., and Tang, C.-Z. Disambiguation-free partial label learning. TKDE, 29(10):2155 2167, 2017.

Zhang, Y.-L. and Zhou, Z.-H. Multi-instance learning with key instance shift. In IJCAI, pp. 3441 3447, 2017.

Pointwise Binary Classiﬁcation with Pairwise Conﬁdence Comparisons

Zhou, Z.-H. A brief introduction to weakly supervised learning. National Science Review, 5(1):44 53, 2018.

Zhou, Z.-H., Sun, Y.-Y., and Li, Y.-F. Multi-instance learning by treating instances as non-iid samples. In ICML, pp. 1249 1256, 2009.

Zhu, X.-J. and Goldberg, A. B. Introduction to semisupervised learning. Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning, 3(1):1 130, 2009.