# pointwise_binary_classification_with_pairwise_confidence_comparisons__486aa395.pdf Pointwise Binary Classification with Pairwise Confidence Comparisons Lei Feng 1 Senlin Shu 2 Nan Lu 3 4 Bo Han 5 4 Miao Xu 6 4 Gang Niu 4 Bo An 7 Masashi Sugiyama 4 3 To alleviate the data requirement for training effective binary classifiers in binary classification, many weakly supervised learning settings have been proposed. Among them, some consider using pairwise but not pointwise labels, when pointwise labels are not accessible due to privacy, confidentiality, or security reasons. However, as a pairwise label denotes whether or not two data points share a pointwise label, it cannot be easily collected if either point is equally likely to be positive or negative. Thus, in this paper, we propose a novel setting called pairwise comparison (Pcomp) classification, where we have only pairs of unlabeled data that we know one is more likely to be positive than the other. Firstly, we give a Pcomp data generation process, derive an unbiased risk estimator (URE) with theoretical guarantee, and further improve URE using correction functions. Secondly, we link Pcomp classification to noisylabel learning to develop a progressive URE and improve it by imposing consistency regularization. Finally, we demonstrate by experiments the effectiveness of our methods, which suggests Pcomp is a valuable and practically useful type of pairwise supervision besides the pairwise label. 1. Introduction Traditional supervised learning techniques have achieved great advances, while they require precisely labeled data. In many real-world scenarios, it may be too difficult to collect such data. To alleviate this issue, a large number of 1College of Computer Science, Chongqing University, China 2College of Computer and Information Science, Southwest University, China 3The University of Tokyo, Japan 4RIKEN Center for Advanced Intelligence Project, Japan 5Department of Computer Science, Hong Kong Baptist University, China 6School of Information Technology and Electrical Engineering, The University of Queensland, Australia 7School of Computer Science and Engineering, Nanyang Technological University, Singapore. Correspondence to: Lei Feng . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). weakly supervised learning problems (Zhou, 2018) have been extensively studied, including semi-supervised learning (Zhu & Goldberg, 2009; Niu et al., 2013; Sakai et al., 2018), multi-instance learning (Zhou et al., 2009; Sun et al., 2016; Zhang & Zhou, 2017), noisy-label learning (Han et al., 2018; Xia et al., 2019; Wei et al., 2020), partiallabel learning (Zhang et al., 2017; Feng et al., 2020b; Lv et al., 2020), complementary-label learning (Ishida et al., 2017; Yu et al., 2018; Ishida et al., 2019; Feng et al., 2020a), positive-unlabeled classification (Elkan & Noto, 2008; Niu et al., 2016; Gong et al., 2019; Chen et al., 2020), positive-confidence classification (Ishida et al., 2018), similar-unlabeled classification (Bao et al., 2018), similardissimilar classification (Shimada et al., 2020; Bao et al., 2020), unlabeled-unlabeled classification (Lu et al., 2019; 2020), and triplet classification (Cui et al., 2020). Among these weakly supervised learning problems, some of them (Bao et al., 2018; Shimada et al., 2020; Bao et al., 2020) consider learning a binary classifier with pairwise labels that indicate whether two instances belong to (similar) the same class or not (dissimilar), when pointwise labels are not accessible due to privacy, confidentiality, or security reasons. However, if either of the two instances is equally likely to be positive or negative, it becomes difficult for us to accurately collect the underlying pairwise label of them. This motivates us to consider using another type of pairwise supervision (instead of the pairwise label) for successfully learning a binary classifier. In this paper, we propose a novel setting called pairwise comparison (Pcomp) classification, where we aim to perform pointwise binary classification with only pairwise comparison data. A pairwise comparison (x, x ) represents that the instance x has a larger confidence of belonging to the positive class than the instance x . Such weak supervision (pairwise confidence comparison) could be much easier for people to collect than full supervision (pointwise label) in practice, especially for applications on sensitive or private matters. For example, it may be difficult to collect sensitive or private data with pointwise labels, as asking for the true labels could be prohibited or illegal. In this case, it could be easier for people to collect other weak supervision like the comparison information between two examples. It is also advantageous to consider pairwise confidence com- Pointwise Binary Classification with Pairwise Confidence Comparisons parisons in pointwise binary classification with class overlapping, where the labeling task is difficult, and even experienced labelers may provide wrong pointwise labels. Let us denote the labeling standard of a labeler as p(y|x) and assume that an instance x1 is more positive than another instance x2. Facing the difficult labeling task, different labelers may hold different labeling standards, p(y = +1|x1) > p(y = +1|x2) > 1/2, p(y = +1|x1) > 1/2 > p(y = +1|x2), and 1/2 > p(y = +1|x1) > p(y = +1|x2), thereby providing different pointwise labels: (+1, +1), (+1, 1), ( 1, 1). We can find that different labelers may provide inconsistent pointwise labels, while pairwise confidence comparisons are unanimous and accurate. One may argue that we could aggregate multiple labels of the same instance using crowdsourcing learning methods (Whitehill et al., 2009; Raykar et al., 2010). However, as not every instance will be labeled by multiple labelers, it is not always applicable to crowdsourcing learning methods. Therefore, our proposed Pcomp classification is useful in this case. Our main contributions can be summarized as follows: We propose pairwise comparison (Pcomp) classification, a novel weakly supervised learning setting, and present a mathematical formulation for the generation process of pairwise comparison data. We prove that an unbiased risk estimator (URE) can be derived, propose an empirical risk minimization (ERM) based method, and present an improvement using correction functions (Lu et al., 2020) for alleviating overftting when complex models are used. We start from the noisy-label learning perspective to introduce the Rank Pruning method (Northcutt et al., 2017) that holds a progressive URE for solving our proposed Pcomp classification problem and improve it by imposing consistency regularization. Extensive experimental results demonstrate the effectiveness of our proposed solutions for Pcomp classification. 2. Preliminaries Binary classification with pairwise comparisons and extra pointwise labels has been studied (Xu et al., 2017; Kane et al., 2017), while our work focuses on a more challenging problem where only pairwise comparison examples are provided. To the best of our knowledge, we are the first to investigate such a challenging problem. Unlike previous studies (Xu et al., 2017; Kane et al., 2017) that leverage some pointwise labels to differentiate the labels of pairwise comparisons, our methods are purely based on ERM with only pairwise comparisons. In the next, we briefly introduce some notations and review related problem formulations. Binary Classification. Since our paper focuses on how to train a binary classifier from pairwise comparison data, we first review the problem formulation of binary classification. Let the feature space be X and the label space be Y = {+1, 1}. Suppose the collected dataset is denoted by D = {(xi, yi)}n i=1 where each example (xi, yi) is independently sampled from the joint distribution with density p(x, y), which includes an instance xi X and a label yi Y. The goal of binary classification is to train an optimal classifier f : X 7 R by minimizing the following (expected) classification risk: R(f) = Ep(x,y) ℓ(f(x), y) = π+Ep+(x) ℓ(f(x), +1) + π Ep (x) ℓ(f(x), 1) , (1) where ℓ: R Y 7 R+ denotes a binary loss function, π+ := p(y = +1) (or π := p(y = 1)) denotes the positive (or negative) class prior probability, and p+(x) := p(x|y = +1) (or p (x) := p(x|y = 1)) denotes the class-conditional probability density of the positive (or negative) data. ERM approximates the expectations over p+(x) and p (x) by the empirical averages of positive and negative data and the empirical risk is minimized with respect to the classifier f. Positive-Unlabeled (PU) Classification. In some realworld scenarios, it may be difficult to collect negative data, and only positive (P) and unlabeled (U) data are available. PU classification aims to train an effective binary classifier in this weakly supervised setting. Previous studies (du Plessis et al., 2014; 2015; Kiryo et al., 2017) showed that the classification risk R(f) in Eq. (1) can be rewritten only in terms of positive and unlabeled data as RPU(f) = π+Ep+(x) ℓ(f(x), +1) ℓ(f(x), 1) + Ep(x) ℓ(f(x), 1) , (2) where p(x) = π+p+(x) + π p (x) denotes the density of unlabeled data. This risk expression immediately allows us to employ ERM in terms of positive and unlabeled data. Unlabeled-Unlabeled (UU) Classification. The recent studies (Lu et al., 2019; 2020) showed that it is possible to train a binary classifier only from two unlabeled datasets with different class priors. Lu et al. (2019) showed that the classification risk R(f) can be rewritten as RUU(f) = Eptr(x) h(1 θ )π+ θ θ ℓ(f(x), +1) θ θ ℓ(f(x), 1) i + Eptr (x ) hθ(1 π+) θ θ ℓ(f(x ), 1) θ θ ℓ(f(x ), +1) i , (3) Pointwise Binary Classification with Pairwise Confidence Comparisons where θ and θ are different class priors of two unlabeled datasets, and ptr(x) and ptr (x ) are the densities of two datasets of unlabeled data, respectively. This risk expression immediately allows us to employ ERM only from two sets of unlabeled data. For RUU(f) in Eq. (3), if we set θ = 1, θ = π+, and replace ptr(x) and ptr (x ) by p+(x) and p(x) respectively, then we can recover RPU(f) in Eq. (2). Therefore, UU classification could be taken as a generalized framework of PU classification in terms of URE. Besides, Eq. (3) also recovers a complicated URE of similarunlabeled classification (Bao et al., 2018) by setting θ = π+ and θ = π2 +/(2π2 + 2π+ + 1). To solve our proposed Pcomp classification problem, we will present a mathematical formulation for the generation process of pairwise comparison data, based on which we will explore two UREs that are compatible with any model and optimizer to train a binary classifier by ERM and establish the corresponding estimation error bounds. 3. Data Generation Process In order to derive UREs for performing ERM, we first formulate the underlying generation process of pairwise comparison data1, which consists of pairs of unlabeled data that we know which one is more likely to be positive. Suppose the provided dataset is denoted by e D = {(xi, x i)}n i=1 where (xi, x i) (with their unknown true labels (yi, y i)) is expected to satisfy p(yi = +1|xi) > p(y i = +1|x i). It is clear that we could easily collect pairwise comparison data if the positive confidence (i.e., p(y = +1|x)) of each instance could be obtained. However, such information is much harder to obtain than class labels in real-world scenarios. Therefore, unlike some studies (Ishida et al., 2018; Shinoda et al., 2020) that assume the positive confidence of each instance is provided by the labeler, we only assume that the labeler has access to the labels of training data. Specifically, we adopt the assumption (Cui et al., 2020) that weakly supervised examples are first sampled from the true data distribution, but the labels are only accessible to the labeler. Then, the labeler would provide us weakly supervised information (i.e., pairwise comparison information) according to the labels of sampled data pairs. That is, for any pair of unlabeled data (x, x ), the labeler would tell us whether (x, x ) could be collected as a pairwise comparison for Pcomp classification, based on the labels (y, y ) rather than the positive confidences (p(y = +1|x), p(y = +1|x )). Now, the question becomes: how does the labeler consider (x, x ) as a pairwise comparison for Pcomp classification, in terms of the labels (y, y )? As shown in our previous exam- 1In contrast to Xu et al. (2019) and Xu et al. (2020) that utilized pairwise comparison data to solve the regression problem, we focus on binary classification. ple of binary classification with class overlapping, we could infer that the labels (y, y ) of our required pairwise comparison data (x, x ) for Pcomp classification can only be one of the three cases {(+1, 1), (+1, +1), ( 1, 1)}, because the condition p(y = +1|x) p(y = +1|x ) is definitely violated if (y, y ) = ( 1, +1). Therefore, we assume that the labeler would take (x, x ) as a pairwise comparison example in the dataset e D, if the labels (y, y ) of (x, x ) belong to the above three cases. It is also worth noting that for a pair of data (x, x ) with labels (y, y ) = ( 1, +1), the labeler would take (x , x) as a pairwise comparison example. Because by exchanging the positions of (x, x ), (x , x) would be associated with labels (+1, 1), which belong to the three cases. In summary, we assume that pairwise comparison data are sampled from those pairs of data whose labels belong to the three cases {(+1, 1), (+1, +1), ( 1, 1)}. Based on the above described generation process of pairwise comparison data, we have the following theorem. Theorem 1. According to the generation process of pairwise comparison data described above, let ep(x, x ) = q(x, x ) π2 + + π2 + π+π , (4) where q(x, x ) = π2 +p+(x)p+(x ) + π2 p (x)p (x ) + π+π p+(x)p (x ). Then, we can conclude that the collected pairwise comparison data are independently drawn from ep(x, x ), i.e., e D = {(xi, x i)}n i=1 i.i.d. ep(x, x ). The proof is provided in Appendix A. Theorem 1 provides an explicit expression of the probability density of pairwise comparison data. Next, we would like to extract pointwise information from pairwise information, since our goal is to perform pointwise binary classification. Let us denote the pointwise data collected from e D = {(xi, x i)}n i=1 by breaking the pairwise comparison relation as e D+ = {xi}n i=1 and e D = {x i}n i=1. Then we can obtain the following theorem. Theorem 2. Pointwise examples in e D+ and e D are independently drawn from ep+(x) and ep (x ), where ep+(x) = π+ π2 + π+ p+(x) + π2 π2 + π+ p (x), ep (x ) = π2 + π2 + + π p+(x ) + π π2 + + π p (x ). The proof is provided in Appendix B. Theorem 2 shows the relationships between the pointwise densities and the class-conditional densities. Besides, it indicates that from pairwise comparison data, we can essentially obtain examples that are independently drawn from ep+(x) and ep (x ). Pointwise Binary Classification with Pairwise Confidence Comparisons 4. The Proposed Methods In this section, we explore two UREs to train a binary classifier by ERM from only pairwise comparison data with the above generation process. 4.1. Corrected Pcomp Classification In Eq. (1), the classification risk R(f) could be separately expressed as the expectations over p+(x) and p (x). Although we do not have access to the two class-conditional densities p+(x) and p (x), we can represent them by our introduced pointwise densities ep+(x) and ep (x). Lemma 1. We can express p+(x) and p (x) in terms of ep+(x) and ep+(x) as ep+(x) π ep (x) , ep (x) π+ep+(x) . The proof is provided in Appendix C. As a result of Lemma 1, we can express the classification risk R(f) using only pairwise comparison data sampled from ep+(x) and ep (x). Theorem 3. The classification risk R(f) can be equivalently expressed as RPC(f) = Eep+(x) ℓ(f(x), +1) π+ℓ(f(x), 1) (5) + Eep (x ) ℓ(f(x ), 1) π ℓ(f(x ), +1) . The proof is provided in Appendix D. In this way, we could train a binary classifier by minimizing the following empirical approximation of RPC(f): b RPC(f) = 1 i=1 ℓ(f(xi), +1) + ℓ(f(x i), 1) (6) π+ℓ(f(xi), 1) π ℓ(f(x i), +1) . Estimation Error Bound. Here, we establish an estimation error bound for the proposed URE. Let F = {f : X 7 R} be the model class, bf PC = arg minf F b RPC(f) be the empirical risk minimizer, and f = arg minf F R(f) be the true risk minimizer. Let e R+ n (F) and e R n (F) be the Rademacher complexities (Bartlett & Mendelson, 2002) of F with sample size n over ep+(x) and ep (x) respectively. Theorem 4. Suppose the loss function ℓis ρ-Lipschitz with respect to the first argument (0 ρ ), and all functions in the model class F are bounded, i.e., there exists a positive constant Cb such that f Cb for any f F. Let Cℓ:= supz Cb,t= 1 ℓ(z, t). Then for any δ > 0, with probability at least 1 δ, we have R( bf PC) R(f ) (1 + π+)4ρe R+ n (F) + (1 + π )4ρe R n (F) + 6Cℓ q The proof is provided in Appendix E. Theorem 4 shows that our proposed method is consistent, i.e., as n , R( bf PC) R(f ), since e R+ n (F), e R n (F) 0 for all parametric models with a bounded norm such as deep neural networks trained with weight decay (Golowich et al., 2017; Lu et al., 2019). Besides, e R+ n (F) and e R n (F) can be normally bounded by CF/ n for a positive constant CF. Hence, we can further see that the convergence rate is Op(1/ n) where Op denotes the order in probability. This order is the optimal parametric rate for ERM without additional assumptions (Mendelson, 2008). Relation to UU Classification. It is worth noting that the URE of UU classification RUU(f) is quite general for binary classification with weak supervision. Hence we also would like to show the relationships between our proposed estimator RPC(f) and RUU(f). We demonstrate by the following corollary that under some conditions, RUU(f) is equivalent to RPC(f). Corollary 1. By setting ptr = ep+(x), p tr = ep (x), θ = π+/(1 π+ + π2 +), and θ = π2 +/(1 π+ + π2 +), Eq. (3) is equivalent to Eq. (5), which means that RUU(f) is equivalent to RPC(f). We omit the proof of Corollary 1, since it is straightforward to derive Eq. (5) from Eq. (3) with required notations. Empirical Risk Correction. As shown by Lu et al. (2020), directly minimizing b RPC(f) would suffer from overfitting when complex models are used due to the negative risk issue. More specifically, since negative terms are included in Eq. (6), the empirical risk can be negative even though the original true risk can never be negative. To ease this problem, they wrapped the terms in b RUU(f) that cause a negative empirical risk by certain consistent correction functions defined in Lu et al. (2020), such as the rectified linear unit (Re LU) function g(z) = max(0, z) and absolute value function g(z) = |z|. These consistent correction functions could also be applied to b RPC for alleviating overfitting when complex models are used. In this way, we could obtain the following corrected empirical risk estimator: b Rc PC(f) = g 1 ℓ(f(xi), +1) π ℓ(f(x i), +1) ℓ(f(x i), 1) π+ℓ(f(xi), 1) . (7) 4.2. Progressive Pcomp Classification Here, we start from the noisy-label learning perspective to solve the Pcomp classification problem. Intuitively, we could simply perform binary classification by regarding the data from ep+(x) as (noisy) positive data and the data from ep (x) as (noisy) negative data. However, this naive solution Pointwise Binary Classification with Pairwise Confidence Comparisons could be inevitably affected by noisy labels. In this scenario, we denote the noise rates as ρ = p(ey = +1|y = 1) and ρ+ = p(ey = 1|y = +1) where ey is the observed (noisy) label and y is the true label, and denote the inverse noise rates as φ+ = p(y = 1|ey = +1) and φ = p(y = +1|ey = 1). According to the defined generation process of pairwise comparison data, we have the following theorem. Theorem 5. The following equalities hold: φ+ = π2 π2 + + π2 + π+π , ρ+ = π+ 1 + π+ , φ = π2 + π2 + + π2 + π+π , ρ = π 1 + π . The proof is provided in Appendix F. Theorem 5 shows that the noise rates can be obtained if we regard the Pcomp classification problem as the noisy-label learning problem. With known noise rates for noisy-label learning, it was shown (Natarajan et al., 2013; Northcutt et al., 2017) that a URE could be derived. Here, we adopt the Rank Pruning method (Northcutt et al., 2017) because it holds a progressive URE by selecting confident examples using the learning model and achieves state-of-the-art performance. Specifically, we denote by the dataset composed of all the observed positive data e P, i.e., e P = {xi}n i=1, where xi is independently sampled from ep+(x). Similarly, the dataset composed of all the observed negative data is denoted by e N, i.e., e N = {x i}n i=1, where x i is independently sampled from ep (x ). Then, confident examples will be selected from e P and e N by ranking the outputs of the model f. We denote the selected positive data from e P as e Psel, and the selected negative data from e N as e Nsel: e Psel = arg max P:|P|=(1 φ+)| e P| x {P e P} f(x), e Nsel = arg min N:|N|=(1 φ )| e N| x {N e N} f(x). Then we show that if the model f satisfies the separability condition, i.e., for any true positive instance xp and for any true negative instance xn, we have f(xp) > f(xn). In other words, if the model output of every true positive instance is always larger than that of every true negative instance, we could obtain a URE. We name it progressive URE, as the model f is progressively optimized. Theorem 6 (Theorem 5 in (Northcutt et al., 2017)). Assume that the model f satisfies the above separability condition, then the classification risk R(f) can be equivalently expressed as Rp PC(f) = Eep+(x) hℓ(f(x), +1) 1 ρ+ I[x e Psel] i + Eep (x ) hℓ(f(x ), 1) 1 ρ I[x e Nsel] i , where I[ ] denotes the indicator function. In this way, we have the following empirical approximation of Rp PC: b Rp PC(f) = 1 ℓ(f(xi), +1) 1 ρ+ I[xi e Psel] + ℓ(f(x i), 1) 1 ρ I[x i e Nsel] . (8) Estimation Error Bound. It worth noting that Northcutt et al. (2017) did not prove the learning consistency for the Rank Pruning method. Here, we establish an estimation error bound for this method, which guarantees the learning consistency. Let bfp PC = arg minf F b Rp PC(f) be the empirical risk minimizer of the Rank Pruning method, then we have the following theorem. Theorem 7. Suppose the loss function ℓis ρ-Lipschitz with respect to the first argument (0 ρ ), and all functions in the model class F are bounded, i.e., there exists a positive constant Cb such that f Cb for any f F. Let Cℓ:= supz Cb,t= 1 ℓ(z, t). Then for any δ > 0, with probability at least 1 δ, we have R( bfp PC) R(f ) 2 1 ρ+ 2ρe R+ n (F) + Cℓ q 2ρe R n (F) + Cℓ q The proof is provided in Appendix G. Theorem 7 shows that the above method is consistent and this estimation error bound also attains the optimal convergence rate without any additional assumptions (Mendelson, 2008). Regularization. For the above Rank Pruning method, its URE is based on the assumption that the learning model could satisfy the separability condition. Thus, its performance heavily depends on the accuracy of the learning model. However, as the learning model is progressively updated, some of the selected confident examples may still contain label noise during the training process. As a result, the Rank Pruning method would be affected by incorrectly selected data. A straightforward improvement could be to improve the output quality of the learning model. Motivated by Mean Teacher used in semi-supervised learning (Tarvainen & Valpola, 2017), we also resort to a teacher model that is an exponential moving average of model snapshots, i.e., Θ t = αΘ t 1 + (1 α)Θt, where Θ denotes the parameters of the teacher model, Θ denotes the parameters of the learning model, the subscript t denotes the training step, and α is a smoothing coefficient hyper-parameter. Such a teacher model could guide the learning model to produce high-quality outputs. To learn from the teacher model, we leverage consistency regularization Ω(f) = Ex fΘ(x) fΘ (x) 2 (Laine & Aila, Pointwise Binary Classification with Pairwise Confidence Comparisons 2016; Tarvainen & Valpola, 2017) to make the learning model consistent with the teacher model for improving the performance of the Rank Pruning method. 5. Experiments In this section, we conduct extensive experiments to evaluate the performance of our proposed Pcomp classification methods on various datasets using different models. Datasets. We use four popular benchmark datasets, including MNIST (Le Cun et al., 1998), Fashion-MNIST (Xiao et al., 2017), Kuzushiji-MNIST (Clanuwat et al., 2018), and CIFAR-10 (Krizhevsky et al., 2009). We train a multilayer perceptron (MLP) model with three hidden layers of width 300 and Re LU activation functions (Nair & Hinton, 2010) and batch normalization (Ioffe & Szegedy, 2015) on the first three datasets. We train a Res Net-34 model (He et al., 2016) on the CIFAR-10 dataset. We also use USPS and three datasets from the UCI machine learning repository (Blake & Merz, 1998) including Pendigits, Optdigits, and CNAE-9. We train a linear model on these datasets, since they are not large-scale datasets. The brief descriptions of all used datasets with the corresponding models are reported in Table 1. Since these datasets are specially used for multi-class classification, we manually transformed them into binary classification datasets (see Appendix H). As we have shown in Theorem 2, the pairwise comparison examples can be equivalently transformed into pointwise examples, which are more convenient to generate. Therefore, we generate pointwise examples in experiments. Specifically, as Theorem 5 discloses the noise rates in our defined data generation process, we simply generate pointwise corrupted examples according to the derived noise rates. Methods. For our proposed Pcomp classification problem, we propose the following methods: Pcomp-Unbiased, which denotes the proposed method that minimizes b RPC(f) in Eq. (6). Pcomp-Re LU, which denotes the proposed method that minimizes b Rc PC(f) in Eq. (7) using the Re LU function as the risk correction function. Pcomp-ABS, which denotes the proposed method that minimizes b Rc PC(f) in Eq. (7) using the absolute value function as the risk correction function. Pcomp-Teacher, which improves the Rank Pruning method by imposing consistency regularization to make the learning model consistent with a teacher model. Besides, we compare with the following baselines: Binary-Biased, which conducts binary classification by regarding the data from ep+(x) as positive data and the data from ep (x) as negative data. This is a straightforward method to handle the Pcomp classification problem. In our setting, Binary-Biased reduces to the BER minimization method (Menon et al., 2015). Table 1. Specification of the used benchmark datasets and models. These datasets are specially processed (see Appendix H) for performing Pcomp classification. Dataset # Train # Test # Features # Classes Model MNIST 60,000 10,000 784 10 MLP Fashion 60,000 10,000 784 10 MLP Kuzushiji 60,000 10,000 784 10 MLP CIFAR-10 50,000 10,000 3,072 10 Res Net-34 USPS 7,437 1,861 256 10 Linear Pendigits 8,793 2,199 16 10 Linear Optdigits 4,495 1,125 62 10 Linear CNAE-9 864 216 856 9 Linear Noisy-Unbiased, which denotes the noisy-label learning method that minimizes the empirical approximation of the URE proposed by (Natarajan et al., 2013). Rank Pruning, which denotes the noisy-label learning method (Northcutt et al., 2017) that minimizes the empirical risk b Rp PC(f) in Eq. (8). For all learning methods, we take the logistic loss as the binary loss function ℓ(i.e., ℓ(z) = ln(1+exp( z))), for fair comparisons. The hyper-parameter settings of all learning methods are aslo reported in Appendix H. We implement our methods using Py Torch (Paszke et al., 2019) and use the Adam (Kingma & Ba, 2015) optimizer with mini-batch size set to 256 and the number of training epochs set to 200 for the four large-scale datasets and 100 for other four datasets. All the experiments are conducted on Ge Force GTX 1080 Ti GPUs. Codes are avaible in supplementary materials. Experimental Setup. We evaluate the performance of all learning methods under different class prior settings, i.e., π+ is selected from {0.2, 0.5, 0.8}. It is worth noting that we could estimate π+ according to our described data generation process. Specifically, we can exactly estimate eπ by counting the fraction of collected pairwise comparison data in all the sampled pairs of data. Since eπ = π2 + + π = π2 ++1 π+, we have π+ = 1/2 p eπ 3/4 (if π+ < π ) or π+ = 1/2 + p eπ 3/4 (if π+ π ). Therefore, if we know whether π+ is larger than π , we could exactly estimate the true class prior π+. For simplicity, we assume that the class prior π+ is known for all the methods. We repeat the sampling-and-training process 5 times for all learning methods on all datasets and record the mean accuracy with standard deviation (mean std). Experimental Results with Complex Models. Table 2 records the classification accuracy of each method on the four benchmark datasets with different class priors. From Table 2, we have the following observations: Binary-Biased always achieves the worst performance, which means that our Pcomp classification problem cannot be simply solved by binary classification. Pointwise Binary Classification with Pairwise Confidence Comparisons Table 2. Classification accuracy (mean std) of each method on the four benchmark datasets with different class priors. The best performance is highlighted in bold. Class Prior Methods MNIST Kuzushiji Fashion CIFAR-10 Noisy-Unbiased 0.806 0.026 0.712 0.014 0.865 0.050 0.665 0.119 Binary-Biased 0.282 0.018 0.584 0.021 0.371 0.067 0.499 0.170 Rank Pruning 0.933 0.005 0.810 0.006 0.938 0.005 0.840 0.005 Pcomp-ABS 0.893 0.013 0.849 0.007 0.898 0.008 0.833 0.008 Pcomp-Re LU 0.927 0.010 0.838 0.014 0.927 0.022 0.812 0.007 Pcomp-Unbiased 0.782 0.025 0.693 0.041 0.835 0.038 0.584 0.011 Pcomp-Teacher 0.958 0.007 0.835 0.015 0.954 0.004 0.841 0.006 Noisy-Unbiased 0.892 0.013 0.656 0.096 0.908 0.031 0.642 0.031 Binary-Biased 0.537 0.026 0.615 0.008 0.457 0.032 0.470 0.039 Rank Pruning 0.888 0.004 0.782 0.008 0.917 0.005 0.793 0.015 Pcomp-ABS 0.832 0.010 0.732 0.007 0.899 0.006 0.712 0.023 Pcomp-Re LU 0.875 0.009 0.729 0.019 0.918 0.016 0.746 0.030 Pcomp-Unbiased 0.877 0.012 0.690 0.084 0.906 0.037 0.631 0.031 Pcomp-Teacher 0.942 0.004 0.791 0.013 0.957 0.002 0.795 0.012 Noisy-Unbiased 0.800 0.036 0.749 0.038 0.815 0.058 0.591 0.019 Binary-Biased 0.299 0.066 0.558 0.005 0.444 0.020 0.371 0.113 Rank Pruning 0.939 0.004 0.830 0.012 0.939 0.004 0.843 0.010 Pcomp-ABS 0.815 0.009 0.825 0.011 0.879 0.014 0.827 0.013 Pcomp-Re LU 0.916 0.013 0.827 0.011 0.925 0.015 0.811 0.007 Pcomp-Unbiased 0.793 0.016 0.721 0.037 0.823 0.032 0.569 0.007 Pcomp-Teacher 0.958 0.007 0.836 0.015 0.955 0.004 0.844 0.014 Pcomp-Unbiased is inferior to Pcomp-ABS and Pcomp Re LU. This observation accords with what we have discussed, i.e., directly minimizing b RPC(f) would suffer from overfitting when complex models are used, because there are negative terms included in b RPC(f) and the empirical risk can be negative during the training process. In contrast, Pcomp-Re LU and Pcomp-ABS employ consistent correction functions on b RPC(f) so that the empirical risk will never be negative. Therefore, when complex models such as deep neural networks are used, Pcomp-Re LU and Pcomp-ABS are expected to outperform Pcomp-Unbiased. Pcomp-Teacher achieves the best performance in most cases. This observation verifies the effectiveness of the imposed consistency regularization, which makes the learning model consistent with a teacher model, to improve the quality of selected confident examples by the Rank Pruning method. It is worth noting that the standard deviations of Binary Biased, Pcomp-Unbiased, and Noisy-Unbiased are sometimes higher than other methods. This is because the three methods suffer from overfitting when complex models are used, and the performance could be quite unstable in different trials. Experimental Results with Simple Models. Table 3 reports the classification accuracy of each method on the four UCI datasets with different class priors. From Table 3, we have the following observations: Binary-Biased achieves the worst performance in nearly all cases, which also implies that we need to develop other novel methods for Pcomp classification. Pcomp-Unbiased has comparable performance to its variants Pcomp-ABS and Pcomp-Re LU, because Pcomp Unbiased does not suffer from overfitting when the linear model is used, and it is not necessary to use consistent correction functions anymore. Pcomp-Teacher is still better than Rank Pruning, while it is sometimes inferior to other Pcomp classification methods. This is because the linear model is not as powerful as neural networks, thus the selected confident examples may not be so reliable. Performance of Increasing Pairwise Comparisons. As shown by Theorem 4 and Theorem 7, the performance of our Pcomp classification methods is expected to be improved if more pairwise comparisons are given. To empirically validate such theoretical findings, we further conduct experiments on Pendigits and Optdigits with class prior π+ = 0.5 and π+ = 0.8 by changing the fraction of pairwise comparisons (100% means that we use all the generated pairwise comparisons in the training process). As shown in Figure 1, the classification accuracy of our methods generally increases given more pairwise comparisons. This observation is clearly in accordance with our derived estimation error Pointwise Binary Classification with Pairwise Confidence Comparisons Table 3. Classification accuracy (mean std) of each method on the four UCI datasets with different class priors. The best performance is highlighted in bold. Class Prior Methods USPS Pendigits Optdigits CNAE-9 Noisy-Unbiased 0.921 0.010 0.857 0.012 0.885 0.015 0.830 0.019 Binary-Biased 0.752 0.016 0.639 0.050 0.705 0.026 0.629 0.040 Rank Pruning 0.931 0.007 0.780 0.061 0.878 0.011 0.780 0.046 Pcomp-ABS 0.909 0.005 0.863 0.013 0.871 0.013 0.819 0.018 Pcomp-Re LU 0.917 0.006 0.868 0.012 0.875 0.011 0.835 0.011 Pcomp-Unbiased 0.922 0.007 0.867 0.015 0.881 0.018 0.806 0.023 Pcomp-Teacher 0.927 0.009 0.847 0.035 0.886 0.016 0.687 0.052 Noisy-Unbiased 0.911 0.007 0.826 0.014 0.838 0.011 0.766 0.039 Binary-Biased 0.913 0.007 0.761 0.049 0.834 0.013 0.773 0.051 Rank Pruning 0.925 0.008 0.840 0.019 0.864 0.027 0.672 0.068 Pcomp-ABS 0.912 0.008 0.839 0.010 0.841 0.013 0.749 0.047 Pcomp-Re LU 0.912 0.007 0.844 0.008 0.846 0.013 0.766 0.038 Pcomp-Unbiased 0.911 0.006 0.846 0.009 0.847 0.014 0.763 0.040 Pcomp-Teacher 0.926 0.008 0.840 0.019 0.869 0.021 0.787 0.047 Noisy-Unbiased 0.918 0.017 0.890 0.013 0.884 0.006 0.838 0.017 Binary-Biased 0.731 0.017 0.634 0.042 0.554 0.106 0.621 0.037 Rank Pruning 0.934 0.010 0.856 0.018 0.863 0.025 0.737 0.050 Pcomp-ABS 0.900 0.017 0.889 0.014 0.882 0.002 0.824 0.013 Pcomp-Re LU 0.921 0.018 0.894 0.012 0.888 0.004 0.839 0.015 Pcomp-Unbiased 0.899 0.014 0.883 0.006 0.874 0.008 0.840 0.013 Pcomp-Teacher 0.937 0.008 0.880 0.011 0.884 0.010 0.781 0.033 20% 40% 60% 80% 100% #pairwise comparisons classification accuracy Pendigits with π+ = 0.5 Pcomp-Unbiased Pcomp-Re LU Pcomp-ABS Pcomp-Teacher 20% 40% 60% 80% 100% #pairwise comparisons classification accuracy Pendigits with π+ = 0.8 Pcomp-Unbiased Pcomp-Re LU Pcomp-ABS Pcomp-Teacher 20% 40% 60% 80% 100% #pairwise comparisons classification accuracy Optdigits with π+ = 0.5 Pcomp-Unbiased Pcomp-Re LU Pcomp-ABS Pcomp-Teacher 20% 40% 60% 80% 100% #pairwise comparisons classification accuracy Optdigits with π+ = 0.8 Pcomp-Unbiased Pcomp-Re LU Pcomp-ABS Pcomp-Teacher Figure 1. The classification accuracy of our proposed Pcomp classification methods when the number of pairwise comparisons increases. bounds, because the estimation error would decrease as the number of pairwise comparisons increases. 6. Conclusion and Future Work In this paper, we proposed a novel weakly supervised learning setting called pairwise comparison (Pcomp) classification, where we aim to train a binary classifier from only pairwise comparison data, i.e., two examples that we know one is more likely to be positive than the other, instead of pointwise labeled data. Pcomp classification is useful for private classification tasks where we are not allowed to directly access labels and subjective classification tasks where labelers have different labeling standards. To solve the Pcomp classification problem, we presented a mathematical formulation for the generation process of pairwise comparison data, based on which we explored two unbiased risk estimators (UREs) to train a binary classifier by em- pirical risk minimization and established the corresponding estimation error bounds. We first proved that a URE can be derived and improved it using correction functions. Then, we started from the noisy-label learning perspective to introduce a progressive URE and improved it by imposing consistency regularization. Finally, experiments demonstrated the effectiveness of our proposed methods. In future work, we will apply Pcomp classification to solve some challenging real-world problems like binary classification with class overlapping. In addition, we could also extend Pcomp classification to the multi-class classification setting by using the one-versus-all strategy. Suppose there are multiple classes, we are given pairs of unlabeled data that we know which one is more likely to belong to a specific class. Then, we can use the proposed methods in this paper to train a binary classifier for each class. Finally, by comparing the outputs of these binary classifiers, the predicted class can be determined. Pointwise Binary Classification with Pairwise Confidence Comparisons Acknowledgements This research was supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-RP-2019-0013), National Satellite of Excellence in Trustworthy Software Systems (Award No: NSOE-TSS2019-01), and NTU. NL was supported by MEXT scholarship No. 171536 and the MSRA D-CORE Program. BH was supported by the RGC Early Career Scheme No. 22200720, NSFC Young Scientists Fund No. 62006202, and HKBU CSD Departmental Incentive Grant. GN and MS were supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan. MS was also supported by the Institute for AI and Beyond, UTokyo. Bao, H., Niu, G., and Sugiyama, M. Classification from pairwise similarity and unlabeled data. In ICML, pp. 452 461, 2018. Bao, H., Shimada, T., Xu, L., Sato, I., and Sugiyama, M. Similarity-based classification: Connecting similarity learning to binary classification. ar Xiv preprint ar Xiv:2006.06207, 2020. Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3(11):463 482, 2002. Blake, C. L. and Merz, C. J. Uci repository of machine learning databases, 1998. URL http://archive. ics.uci.edu/ml/index.php. Chen, X., Chen, W., Chen, T., Yuan, Y., Gong, C., Chen, K., and Wang, Z. Self-pu: Self boosted and calibrated positive-unlabeled training. In ICML, pp. 1510 1519, 2020. Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., and Ha, D. Deep learning for classical Japanese literature. ar Xiv preprint ar Xiv:1812.01718, 2018. Cui, Z.-H., Charoenphakdee, N., Sato, I., and Sugiyama, M. Classification from triplet comparison data. Neural Computation, 32(3):659 681, 2020. du Plessis, M. C., Niu, G., and Sugiyama, M. Analysis of learning from positive and unlabeled data. In Neur IPS, pp. 703 711, 2014. du Plessis, M. C., Niu, G., and Sugiyama, M. Convex formulation for learning from positive and unlabeled data. In ICML, pp. 1386 1394, 2015. Elkan, C. and Noto, K. Learning classifiers from only positive and unlabeled data. In KDD, pp. 213 220, 2008. Feng, L., Kaneko, T., Han, B., Niu, G., An, B., and Sugiyama, M. Learning with multiple complementary labels. In ICML, pp. in press, 2020a. Feng, L., Lv, J.-Q., Han, B., Xu, M., Niu, G., Geng, X., An, B., and Sugiyama, M. Provably consistent partial-label learning. In Neur IPS, 2020b. Golowich, N., Rakhlin, A., and Shamir, O. Size-independent sample complexity of neural networks. ar Xiv preprint ar Xiv:1712.06541, 2017. Gong, C., Shi, H., Liu, T.-L., Zhang, C., Yang, J., and Tao, D.-C. Loss decomposition and centroid estimation for positive and unlabeled learning. IEEE TPAMI, 2019. Han, B., Yao, Q.-M., Yu, X.-R., Niu, G., Xu, M., Hu, W.- H., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Neur IPS, pp. 8527 8537, 2018. He, K.-M., Zhang, X.-Y., Ren, S.-Q., and Sun, J. Deep residual learning for image recognition. In CVPR, pp. 770 778, 2016. Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015. Ishida, T., Niu, G., Hu, W., and Sugiyama, M. Learning from complementary labels. In Neur IPS, pp. 5644 5654, 2017. Ishida, T., Niu, G., and Sugiyama, M. Binary classification for positive-confidence data. In Neur IPS, pp. 5917 5928, 2018. Ishida, T., Niu, G., Menon, A. K., and Sugiyama, M. Complementary-label learning for arbitrary losses and models. In ICML, pp. 2971 2980, 2019. Kane, D. M., Lovett, S., Moran, S., and Zhang, J. Active classification with comparison queries. In FOCS, pp. 355 366, 2017. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015. Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. Positive-unlabeled learning with non-negative risk estimator. In Neur IPS, pp. 1674 1684, 2017. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. Laine, S. and Aila, T. Temporal ensembling for semisupervised learning. ar Xiv preprint ar Xiv:1610.02242, 2016. Pointwise Binary Classification with Pairwise Confidence Comparisons Le Cun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Lu, N., Niu, G., Menon, A. K., and Sugiyama, M. On the minimal supervision for training any binary classifier from only unlabeled data. In ICLR, 2019. Lu, N., Zhang, T.-Y., Niu, G., and Sugiyama, M. Mitigating overfitting in supervised classification from two unlabeled datasets: A consistent risk correction approach. In AISTATS, 2020. Lv, J.-Q., Xu, M., Feng, L., Niu, G., Geng, X., and Sugiyama, M. Progressive identification of true labels for partial-label learning. In ICML, 2020. Mendelson, S. Lower bounds for the empirical minimization algorithm. IEEE TIT, 54(8):3797 3803, 2008. Menon, A., Van Rooyen, B., Ong, C. S., and Williamson, B. Learning from corrupted binary labels via classprobability estimation. In ICML, pp. 125 134, 2015. Nair, V. and Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In ICML, 2010. Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A. Learning with noisy labels. In Neur IPS, pp. 1196 1204, 2013. Niu, G., Jitkrittum, W., Dai, B., Hachiya, H., and Sugiyama, M. Squared-loss mutual information regularization: A novel information-theoretic approach to semi-supervised learning. In ICML, pp. 10 18, 2013. Niu, G., Plessis, M. C. d., Sakai, T., Ma, Y., and Sugiyama, M. Theoretical comparisons of positive-unlabeled learning against positive-negative learning. In Neur IPS, pp. 1199 1207, 2016. Northcutt, C. G., Wu, T., and Chuang, I. L. Learning with confident examples: Rank pruning for robust classification with noisy labels. In UAI, 2017. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Neur IPS, pp. 8026 8037, 2019. Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., and Moy, L. Learning from crowds. JMLR, 11(4), 2010. Sakai, T., Niu, G., and Sugiyama, M. Semi-supervised auc optimization based on positive-unlabeled learning. MLJ, 107(4):767 794, 2018. Shimada, T., Bao, H., Sato, I., and Sugiyama, M. Classification from pairwise similarities/dissimilarities and unlabeled data via empirical risk minimization. Neural Computation, 2020. Shinoda, K., Kaji, H., and Sugiyama, M. Binary classification from positive data with skewed confidence. In IJCAI, pp. 3328 3334, 2020. Sun, M., Han, T. X., Liu, M.-C., and Khodayari Rostamabad, A. Multiple instance learning convolutional neural networks for object recognition. In ICPR, pp. 3270 3275, 2016. Tarvainen, A. and Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Neur IPS, pp. 1195 1204, 2017. Wei, H.-X., Feng, L., Chen, X.-Y., and An, B. Combating noisy labels by agreement: A joint training method with co-regularization. In CVPR, pp. 13726 13735, 2020. Whitehill, J., Wu, T.-f., Bergsma, J., Movellan, J. R., and Ruvolo, P. L. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Neur IPS, pp. 2035 2043, 2009. Xia, X.-B., Liu, T.-L., Wang, N.-N., Han, B., Gong, C., Niu, G., and Sugiyama, M. Are anchor points really indispensable in label-noise learning? In Neur IPS, pp. 6835 6846, 2019. Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. Xu, L.-Y., Honda, J., Niu, G., and Sugiyama, M. Uncoupled regression from pairwise comparison data. In Neur IPS, pp. 3992 4002, 2019. Xu, Y., Zhang, H., Miller, K., Singh, A., and Dubrawski, A. Noise-tolerant interactive learning using pairwise comparisons. In Neur IPS, pp. 2431 2440, 2017. Xu, Y., Balakrishnan, S., Singh, A., and Dubrawski, A. Regression with comparisons: Escaping the curse of dimensionality with ordinal information. JMLR, 21(162): 1 54, 2020. Yu, X.-Y., Liu, T.-L., Gong, M.-M., and Tao, D.-C. Learning with biased complementary labels. In ECCV, pp. 68 83, 2018. Zhang, M.-L., Yu, F., and Tang, C.-Z. Disambiguation-free partial label learning. TKDE, 29(10):2155 2167, 2017. Zhang, Y.-L. and Zhou, Z.-H. Multi-instance learning with key instance shift. In IJCAI, pp. 3441 3447, 2017. Pointwise Binary Classification with Pairwise Confidence Comparisons Zhou, Z.-H. A brief introduction to weakly supervised learning. National Science Review, 5(1):44 53, 2018. Zhou, Z.-H., Sun, Y.-Y., and Li, Y.-F. Multi-instance learning by treating instances as non-iid samples. In ICML, pp. 1249 1256, 2009. Zhu, X.-J. and Goldberg, A. B. Introduction to semisupervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1):1 130, 2009.