# learning_from_similarityconfidence_data__62e093f1.pdf

Learning from Similarity-Conﬁdence Data

Yuzhou Cao 1 Lei Feng 2 3 Yitian Xu 1 Bo An 4 Gang Niu 3 Masashi Sugiyama 3 5

Weakly supervised learning has drawn considerable attention recently to reduce the expensive time and labor consumption of labeling massive data. In this paper, we investigate a novel weakly supervised learning problem called learning from similarity-conﬁdence (Sconf) data, where we aim to learn an effective binary classiﬁer from only unlabeled data pairs equipped with conﬁdence that illustrates their degree of similarity (two examples are similar if they belong to the same class). To solve this problem, we propose an unbiased estimator of the classiﬁcation risk that can be calculated from only Sconf data and show that the estimation error bound achieves the optimal convergence rate. To alleviate potential overﬁtting when ﬂexible models are used, we further employ a risk correction scheme on the proposed risk estimator. Experimental results demonstrate the effectiveness of the proposed methods.

1. Introduction

In supervised classiﬁcation, a vast quantity of exactly labeled data are required for training effective classiﬁers. However, the collection of massive data with exact supervision is laborious and expensive in many real-world problems. To overcome this bottleneck, weakly supervised learning (Zhou, 2018) has been proposed and explored under various settings, including but not limited to, semi-supervised learning (Chapelle et al., 2006; Zhu & Goldberg, 2009; Niu et al., 2013; Li & Zhou, 2015; Sakai et al., 2017; Li & Liang, 2019; Guo et al., 2020), positive-unlabeled learning (Elkan & Noto, 2008; du Plessis et al., 2014; 2015; Kiryo et al., 2017; Sansone et al., 2019), noisy-label learning (Natara-

1College of Science, China Agricultural University, Beijing, China 2College of Computer Science, Chongqing University, Chongqing, China 3RIKEN Center for Advanced Intelligence Project, Tokyo, Japan 4Nanyang Technological University, School of Computer Science and Engineering, Singapore 5The University of Tokyo, Tokyo, Japan. Correspondence to: Yitian Xu <xytshuxue@126.com>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

jan et al., 2013; Han et al., 2018b;a; Zhang et al., 2019; Xia et al., 2019; 2020), partial-label learning (Cour et al., 2011; Wang et al., 2019; Feng et al., 2020b; Lv et al., 2020), complementary-label learning (Ishida et al., 2017; Yu et al., 2018; Ishida et al., 2019; Feng et al., 2020a; Katsura & Uchida, 2020; Chou et al., 2020), similarity-unlabeled learning (Bao et al., 2018), and similarity-dissimilarity learning (Shimada et al., 2019).

In this paper, we consider a novel weakly supervised learning setting called similarity-conﬁdence (Sconf) learning. Under this setting, we aim to train a binary classiﬁer from only unlabeled data pairs equipped with similarity conﬁdence that demonstrates the degree of their pairwise similarity, i.e., the possibility that two data points share the same label, without any ordinarily labeled data. The Sconf learning setting exists in many practical scenarios. Compared with ordinary class labels, similarity labels are more easily accessible in many applications (e.g., protein function prediction (Klein et al., 2002)) and can alleviate potentially biased decisions (Fisher, 1993). However, such similarity labels could cause severe privacy leakage: for a data pair equipped with a similarity label, the disclosure of the class label of either of the two examples can subsequently reveal the class label of another one. When the collected data are sensitive (e.g., political opinions and religious orientations), such leakage will lead to serious consequences. In this scenario, similarity conﬁdence is more favorable in the sense of privacy-preserving: given the similarity conﬁdence of a data pair, people are uncertain if they share the same label because the conﬁdence only gives the probability that they belong to the same class, e.g., even if the similarity conﬁdence is 70%, we still cannot assert that the data pair shares the same label since the data pair can also be dissimilar with possibility 30%. As a result, it is unable to exactly ﬁgure out the underlying similarity label of the data pair from only similarity conﬁdence.

Another example is crowdsourcing (Howe, 2009). When the data are annotated by crowdworkers, it is difﬁcult for us to always obtain high-quality crowdsourcing labels (Wang & Zhou, 2015) due to the crowdworkers lack of domain knowledge. When a data pair is annotated with both pairwise similarity and dissimilarity labels by different crowdworkers, we can generate the similarity conﬁdence by averaging instead of choosing the majority of crowdsourcing

Learning from Similarity-Conﬁdence Data

labels, which can alleviate noisy supervision. In these scenarios, Sconf learning makes it possible to learn an effective binary classiﬁer from only unlabeled data pairs equipped with similarity conﬁdence instead of hard labels.

Our main contributions in this paper are the following:

We propose a novel Sconf learning framework (in Section 5) that allows the use of ERM by constructing an unbiased estimator of the classiﬁcation risk with only unlabeled data pairs with similarity conﬁdence, where any loss functions, models, and optimizers are applicable in this setting.

In Section 5, we derive an estimation error bound for Sconf learning and show that it achieves the optimal parametric convergence rate. Analysis of the inﬂuence of noisy conﬁdence also shows the robustness of our Sconf learning framework.

We leverage an effective empirical risk correction scheme (Kiryo et al., 2017; Lu et al., 2020) for correcting the obtained unbiased risk estimator to alleviate potential overﬁtting and further show the consistency of its risk minimizer (in Section 6).

Extensive experiments on various datasets and deep neural networks clearly demonstrate the effectiveness of the proposed Sconf learning method and risk correction scheme (in Section 7).

2. Related Work

We illustrate Sconf learning and related problems in Figure 1. In what follows, we brieﬂy review semi-supervised clustering and similarity-based learning.

The research on similarity-based learning was pioneered by the semi-supervised clustering (SSC) paradigm, where pairwise similarity/dissimilarity is utilized to enhance the clustering performance (Wagstaff et al., 2001; Basu et al., 2002; Xing et al., 2002; Niu et al., 2012; Yi et al., 2013). From the learning theory viewpoint, the SSC methods are conﬁned in the clustering setting and have no generalization guarantee.

Recently, many studies have tried to solve the similaritybased learning problem by empirical risk minimization (ERM) with rigorous consistency analysis. In Bao et al. (2018), it was shown that the classiﬁcation risk can be recovered from similar data pairs and unlabeled data, which enables the use of ERM and analysis on the estimation error. However, the dissimilar data pairs are ignored in this work and the collection of additional unlabeled data is inevitable.

Later, Shimada et al. (2019) made it possible to learn from both similar and dissimilar data pairs by ERM, yet it is still conﬁned within the hard-label setting. Bao et al. (2020) introduced a new performance metric for the binary discrim-

inative model and developed a surrogate risk minimization framework with both similar and dissimilar data pairs.

On the other hand, the likelihood-based models (Hsu et al., 2019; Wu et al., 2020) were proposed to conduct similaritybased learning for multi-class classiﬁcation tasks. The loss functions in these methods are ﬁxed and we cannot directly optimize the classiﬁcation-oriented losses in Bao et al. (2020). Compared with these works, our proposed Sconf learning framework is assumption-free on models, loss functions, and optimizers, which makes it a ﬂexible framework when we use deep learning.

3. Preliminaries

In this section, we ﬁrst brieﬂy review the ordinary classiﬁcation problem and then show our problem setting where each unlabeled data pair is merely equipped with similarity conﬁdence. Proofs are presented in supplementary materials.

3.1. Ordinary Classiﬁcation Problem

Suppose that the feature space is X Rd and the label space is Y= { 1, +1}, the instance and its ordinary class label (x, y) obey an unknown distribution with density p(x, y). Then the critical work is to ﬁnd a decision function g( ) : X R that minimizes the classiﬁcation risk:

R(g) = Ep(x,y)[ℓ(g(x), y)], (1)

where ℓ( , ) : R Y R+ is a binary loss function, e.g., the 0-1 loss and logistic loss. An equivalent expression of classiﬁcation risk (1) used in the following sections is:

R(g) = π+E+[ℓ(g(x), +1)] | {z } R+(g)

+ π E [ℓ(g(x), 1)] | {z } R (g)

where π+ = p(y = +1), π = p(y = 1) denote the class prior probabilities. E+[ ] and E [ ] are expectations on class-conditional distributions with densities p+(x) = p(x|y = +1) and p (x) = p(x|y = 1), respectively. The class posterior probabilities are denoted by r+(x) = p(y = +1|x) and r (x) = p(y = 1|x).

3.2. Generation of Similarity-Conﬁdence Data

To conduct ERM with only Sconf data, we ﬁrst give the underlying distributions of Sconf data pairs and further discuss the expression and property of similarity conﬁdence.

In this setting, we only have access to the unlabeled data pairs with similarity conﬁdence: S = {(xi, x i), si}n i=1, where the similarity conﬁdence si = s(xi, x i) = p(yi = y i|xi, x i) denotes the probability that xi and x i share the same label yi = y i. Each unlabeled data pair in {(xi, x i)}n i=1 is drawn independently from a simple distribution U 2 with density p(x, x ) = p(x)p(x ) and we

Learning from Similarity-Conﬁdence Data

Figure 1. Illustration of Sconf learning and related problems.

further denote by Sn the unlabeled data pairs with similarity conﬁdence {(xi, x i)}n i=1 i.i.d. p(x, x ). This formulation implies that we can regard the decoupled unlabeled samples {xi}n i=1 {x i}n i=1 as drawn from the marginal distribution p(x) independently, which can be easily implemented in real-world data collection. We also assume the sample independence: (x, y) (x , y ), which is an implicit assumption used in Bao et al. (2018). Furthermore, we show that the similarity conﬁdence has the following property:

Lemma 1. (Equivalent expression of similarity conﬁdence)

s(x, x ) = π2 +p+(x)p+(x )+π2 p (x)p (x ) p(x)p(x ) . (2)

4. Failure of Learning with One-Sided Similarity Relation

As mentioned in Section 1, though we can recover classiﬁcation risk from both similar and dissimilar data (Shimada et al., 2019), such a type of hard label could cause serious privacy leakage, which indicates that it is not favorable when the data are sensitive. Such leakage may be alleviated by learning from only one-sided similarity relation: we only have similar (dissimilar) data pairs and no dissimilar (similar) data pairs are provided. For example, when investigating political or religious orientations, people with dissimilar opinions may refuse to give the answer in case of potential conﬂicts. In these scenarios, only similar data pairs are accessible. As reported in Hadsell et al. (2006), learning with only similar data pairs can lead to degenerated solutions. A natural optional idea is to combine one-sided similarity relation with similarity conﬁdence.

Can we learn an effective classiﬁer from only one-sided similarity relation and similarity conﬁdence? Unfortunately, the following experimental and theoretical results give a negative answer to this question. Due to the space limitation, we only provide the result when we only have similar data pairs, and a completely analogous result with only dissimilar data pairs is listed in the supplementary materials.

Figure 2. Decision boundaries and the distribution of similarity conﬁdence.

Suppose we have a training set including only similar data pairs, i.e., Sn S = {(xi, x i)}n i=1 i.i.d. p(x, x |y = y ) and their similarity conﬁdences {si}n i=1. The following theorem shows that it is theoretically possible to conduct ERM with only data provided above:

Theorem 1. With similar data pairs and their similarity conﬁdence, assuming that s(x, x ) > 0 for all the pair (x, x ), we can get the unbiased estimator of classiﬁcation risk (1), i.e., Ep(x,x |y=y )[ ˆRS(g)] = R(g), where

ˆRS(g) = (π2 + + π2 ) Pn i=1 (si π )(ℓ(g(xi),+1)+ℓ(g(x i),+1)) 2n(π+ π )si

+(π2 ++π2 ) Pn i=1 (π+ si)(ℓ(g(xi), 1)+ℓ(g(x i), 1)) 2n(π+ π )si . (3)

It seems that we can conduct ERM on the obtained unbiased risk estimator to get a binary classiﬁer. Unfortunately, with only one-sided similarity relation, we can only get collapsed solutions empirically. Denote the empirical risk minimizer of Eq. (3) with ˆg S . Then we come to the following conclusion:

Theorem 2. Suppose π+ > π and 0-1 loss is used. For similar data pairs, we assume that si π+ for i = 1, , n. Then ˆg S is a collapsed solution that classiﬁes all the examples as positive.

A rough proof intuition for Theorem 2 is that the coefﬁcients of positive loss terms are always positive and those of negative loss terms are always negative, then minimizing Eq. (3)

Learning from Similarity-Conﬁdence Data

is equivalent to minimizing the positive counterpart and maximizing the negative counterpart of classiﬁcation risk, which can lead to the collapsed solution. We can conclude that though Eq. (3) is unbiased, it cannot well represent the classiﬁcation risk in Eq. (1).

To empirically illustrate the failure of learning with onesided similarity relation, we conducted experiments on a synthetic dataset and show the distribution of similarity conﬁdence. The detailed statistics of the synthetic dataset are provided in the supplementary materials. According to the experimental results in Figure 2, the empirical minimizers of learning with only similar or dissimilar data pairs yield collapsed results and their classiﬁcation boundaries are severely biased, which aligns with Theorem 2. The distribution of similarity conﬁdence also meets our assumption on {si}n i=1.

As shown above, the incorporation of one-sided similarity relation can lead to collapsed solution due to the highly skewed distribution of similarity conﬁdence. A potential remedy for such failure is an underlying non-skewed distribution of similarity conﬁdence. Fortunately, we can achieve this goal with only unlabeled data pairs. In the following section, we show that given unlabeled data pairs with similarity conﬁdence, the hard similarity labels are all completely unnecessary, which means that we can successfully train an effective classiﬁer from only unlabeled data pairs with similarity conﬁdence.

5. Learning from Similarity-Conﬁdence Data

In this section, we propose an unbiased risk estimator for learning from only unlabeled data pairs with similarity conﬁdence and show the consistency of the proposed estimator by giving its estimation error bound. Finally, we propose an effective class-prior estimator for estimating π+ when it is not given in advance. An analysis of the inﬂuence of inaccurate similarity conﬁdence is also provided by giving a high-probability bound.

5.1. Unbiased Risk Estimator with Sconf Data

In this section, we derive an unbiased estimator of the classiﬁcation risk in Eq. (1) with only Sconf data and establish its risk minimization framework.

Based on the settings in Section 3.2, we ﬁrst derive the crucial lemma before deriving the unbiased estimator of classiﬁcation risk (1) from only Sconf data:

Lemma 2. The following equalities hold:

R+(g) = EU 2[ ˆR+(g)], R (g) = EU 2[ ˆR (g)], where

ˆR+(g) = Pn i=1 (si π )(ℓ(g(xi),+1)+ℓ(g(x i),+1)) 2n(π+ π ) , (4)

ˆR (g) = Pn i=1 (π+ si)(ℓ(g(xi), 1)+ℓ(g(x i), 1)) 2n(π+ π ) . (5)

According to Lemma 2 above, we get the unbiased estimator of each counterpart of the classiﬁcation risk in Eq. (1). Then we can simply derive the unbiased estimator of the classiﬁcation risk in Eq. (1) with only Sconf data:

Theorem 3. We can construct the unbiased estimator ˆR(g) of the classiﬁcation risk (1), i.e., EU 2[ ˆR(g)] = R(g), with only Sconf data as in the formulation below:

ˆR(g) = ˆR+(g) + ˆR (g). (6)

Since there are no implicit assumptions on models, losses, and optimizers in our analysis, any convex/non-convex loss and deep/linear model can be used for Sconf learning.

5.2. Estimation Error Bound

Here we show the consistency of proposed risk estimator ˆR(g) in Eq. (6) by giving an estimation error bound. To begin with, let G be our function class for ERM. Assume there exists Cg > 0 that supg G g Cg and Cℓ> 0 such that sup|z| Cg ℓ(z, y) Cℓholds for all y. Following the usual practice (Mohri et al., 2012), we assume ℓ(z, y) is Lipschitz continuous w.r.t. z for all |z| Cg and all y with a Lipschitz constant Lℓ.

Let g = arg ming G R(g) be the minimizer of classiﬁcation risk in Eq. (1), and ˆg = arg ming G ˆR(g) be the minimizer of empirical risk in Eq. (6). Then we can derive the following estimation error bound for Sconf learning:

Theorem 4. For any δ > 0, the following inequality holds with probability at least 1 δ:

R(ˆg) R(g ) 2LℓRn(G)

|π+ π | + 2Cℓ |π+ π |

where Rn(G) is the Rademacher complexity of G for unlabeled data of size n drawn from the marginal distribution with density p(x).

The deﬁnition of the Rademacher complexity (Bartlett & Mendelson, 2001) is provided in the supplementary material. Note that the estimation error bound converges in the rate of Op(1/ n) if we assume that Rn(G) CG/ n, where Op denotes the order in probability and CG is a non-negative constant determined by the model complexity. This is a natural assumption since many model classes (e.g., linearin-parameter models and fully-connected neural network (Golowich et al., 2018)) satisfy this condition. We make this assumption in the rest of this paper.

Learning from Similarity-Conﬁdence Data

Theorem 4 shows that the empirical risk minimizer converges to the classiﬁcation risk minimizer with highprobability in the rate of Op(1/ n). As shown in Mendelson (2008), this is the optimal parametric convergence rate without additional assumptions.

5.3. Class-Prior Estimation from Similarity Conﬁdence

In our Sconf learning, class-prior π+ plays an important role in the construction of the unbiased risk estimator. Compared with the previous work (Bao et al., 2018; Shimada et al., 2019), we make a milder assumption on the data distribution, which aligns with the practical data collection process. However, when the class-prior π+ is not given, we cannot estimate it by mixture proportion estimation (Scott, 2015) as in Bao et al. (2018) since we only have data drawn from a single distribution U 2. In this section, we propose a simple yet effective class-prior estimator with only Sconf data.

We have the following theorem for the sample average of similarity conﬁdence:

Theorem 5. The sample average of similarity conﬁdence is an unbiased estimator of π2 + + π2 :

ESn Pn i=1 s(xi,x i) n = π2 + + π2 .

Furthermore, according to Mc Diarmid s inequality (Mc Diarmid, 1989), the sample average of similarity conﬁdence is consistent and converges to π2 + + π2 in the rate of Op(exp( n)), which is the optimal rate according to the central limit theorem (Chung, 1974).

Let us denote π2 + + π2 by πS. Assuming π+ > π , we can calculate the class prior by π+ = 2πS 1+1

2 . According to Theorem 5, we can approximate π+ with the average of similarity conﬁdence and the formulation above.

5.4. Analysis with Noisy Similarity Conﬁdence

In the previous sections, we assumed that accurate conﬁdence is accessible. However, this assumption may not be realistic in some practical tasks. We may have the question that how the noisy similarity conﬁdence can affect the learning performance? If our Sconf learning is not robust and even a slight noise on the similarity conﬁdence can cause catastrophic degradation of performance? In this section, we theoretically justify that the Sconf learning framework is robust to noise on similarity conﬁdence by bounding the estimation error of learning with noisy conﬁdence.

Suppose we are given the noisy Sconf data pairs: Sn = {(xi, x i), si}n i=1, where si is the noisy similarity conﬁdence and is not necessary equal to s(xi, x i) (in fact, it can take the form of any real number in [0, 1]). For simplicity, we replace the accurate conﬁdence {si}n i=1 in Eq. (6) with noisy ones { si}n i=1 and denote the noisy empirical risk with

R(g). The minimizer of noisy risk is g = arg min g G R(g). To

quantify the inﬂuence of noisy similarity conﬁdence, we deduce the following estimation error bound:

Theorem 6. For any δ > 0, the following inequality holds with probability at least 1 δ:

R( g) R(g ) 4LℓRn(G)

|π+ π | + 4Cℓ |π+ π |

2n + 2Cℓσn n|π+ π |,

where σn = Pn i=1 |si si| is the summation of the deviation of noisy similarity conﬁdence.

In a straightforward way, the deduced estimation error bound demonstrates the magnitude of the inﬂuence of noisy similarity conﬁdence: the estimation error of g is affected up to the mean absolute error of noisy conﬁdence and the noisy conﬁdence only has limited inﬂuence on the performance of Sconf learning. If the summation of noise σn has a sublinear growth rate in high probability, Sconf learning can even remain consistent, which shows that our Sconf learning framework is robust to the noisy conﬁdence.

6. Consistent Risk Correction

In the previous section, we showed the unbiased risk estimator that can recover the classiﬁcation risk in Eq. (1) from only Sconf data with rigorous consistency analysis. It is noticeable that the positive and negative counterparts of the empirical risk, i.e., ˆR+(g) and ˆR (g), are not bounded below and can go negative, while their expectations are non-negative by deﬁnition. This contradiction can be problematic since as in previous works (Kiryo et al., 2017; Lu et al., 2020) that severe overﬁtting usually occurs when the empirical risk goes negative, especially when ﬂexible models (e.g., deep models) are used. This phenomenon can be also observed in Sconf learning, as shown in Figure 3. The detailed setting of optimization algorithm is provided in supplementary materials.

In this section, we alleviate this problem with a simple yet effective risk correction on the proposed estimator (6). We further show that the proposed corrected estimator can preserve its consistency by bounding its estimation error.

6.1. General Risk Formulation

Can we alleviate the overﬁtting in Sconf learning without collecting more data or changing the model? Here we give a positive answer to this question by giving a slightly modiﬁed empirical risk estimator. Since the overﬁtting is caused by negative empirical risk, it is a natural idea to make a correction on the risk estimator when it goes negative. This idea was ﬁrst proposed in Kiryo et al. (2017), where the data that yield a negative risk are ignored by applying a non-negative risk estimator. Lu et al. (2020) further showed

Learning from Similarity-Conﬁdence Data

(a) Kuzushiji-MNIST, MLP (Re LU, d-500-500-1)

(b) CIFAR-10, Res Net-34

Both Kuzushiji-MNIST (Clanuwat et al., 2018) and CIFAR-10 (Krizhevsky, 2012) are manually corrupted into binary classiﬁcation datasets. In (a), we trained a 3-layer multi-layer perceptron (MLP) with Re LU (Nair & Hinton, 2010) on Kuzushiji-MNIST. In (b), Res Net-34 (He et al., 2016) was trained on CIFAR-10. Adam (Kingma & Ba, 2015) was chosen as the optimization algorithm. Logistic loss was used as the loss function. The generation of similarity conﬁdence and the details of corrupted datasets are the same as those in Section 7. The occurrence of overﬁtting and negative empirical risk is almost simultaneous: when the empirical risk (red line) goes negative, the risk on test set (blue line) stops dropping and increases rapidly.

Figure 3. Illustration of the overﬁtting of unbiased Sconf learning.

that the information in those data can be helpful for generalization and should not be dropped. Based on the previous works, we propose the consistently corrected risk estimator for Sconf learning to enforce the non-negativity: Deﬁnition 1. (Lu et al., 2020) A risk estimator e R is called the consistently corrected risk estimator if it takes the following form:

e R(g) = f ˆR+(g) + f ˆR (g) , (7)

where f(x) =

( x, x 0, k|x|, x < 0. and k > 0.

Denote the minimizer of consistently corrected Sconf risk estimator (7) with g = arg ming G e R(g), which can be obtained by ERM. Two representative correction functions are Non-Negative correction (Kiryo et al., 2017) and ABSolute function (Lu et al., 2020), with k = 0 and 1 respectively. Their explicit formulations are shown below:

e RNN(g) = max n 0, ˆR+(g) o + max n 0, ˆR (g) o , (8)

e RABS(g) = ˆR+(g) + ˆR (g) . (9)

In Section 7, we will experimentally show their efﬁciency in alleviating overﬁtting.

6.2. Consistency Guarantee

It is noticeable that e R(g) is an upper bound of the unbiased risk estimator ˆR(g) for any ﬁxed classiﬁer g, which means that e R(g) is generally biased and does not align with the consistency analysis in the previous section. Here we justify the use of ERM by analyzing the consistency of e R(g) and its minimizer g. We ﬁrst show that the corrected estimator is consistent and the bias decays exponentially.

Theorem 7. (Consistency of e R(g)) Assume that there are α > 0 and β > 0 such that R+(g) α and R (g) β. According to the assumptions in Theorem 4, the bias of e R(g) decays exponentially as n :

E[ e R(g)] R(g) (Lf +1)Cℓ

|π+ π | exp (π+ π )2n

where = exp(α2)+exp(β2) and Lf = max{1, k} is the Lipschitz constant of f( ). Furthermore, with probability at least 1 δ:

| e R(g) R(g)|

LℓCℓ |π+ π |

2n + (Lf +1)Cℓ

|π+ π | exp (π+ π )2n

Based on Theorem 7, we show that the empirical risk minimizer g obtained by ERM converges to g in the same rate of Op(1/ n). Theorem 8. (Estimation error bound of g) Based on the assumptions and notations above, with probability at least 1 δ:

R( g) R(g ) 2(Lf +1)Cℓ

|π+ π | exp (π+ π )2n

|π+ π | + q

2n 2(Lℓ+1)Cℓ

Theorem 8 shows that learning with e R(g) is also consistent and it has the same convergence rate as learning with R(g) since the additional exponential term is of lower order.

7. Experiments In this section, we demonstrate the usefulness of proposed methods on both synthetic and benchmark datasets with data generation process in Section 5. Sconf-Unbiased, Sconf ABS and Sconf-NN are short for ERM with risk estimators in Eqs. (6), (8), and (9), respectively.

Learning from Similarity-Conﬁdence Data

Table 1. Mean and standard deviation of the classiﬁcation accuracy over 5 trails in percentage with linear models on various synthetic datasets. Std. is the standard deviation of Gaussian noise. The best and comparable methods based on the paired t-test at the signiﬁcance level 5% are highlighted in boldface.

Setup Sconf Sconf (std. = 0.1) Sconf (std. = 0.2) Sconf (std. = 0.3) Supervised

A 89.91 0.18 89.84 0.03 89.55 0.64 89.64 0.33 89.66 0.41 B 90.62 0.28 90.34 0.40 90.03 0.33 89.49 0.92 90.71 0.17 C 88.05 0.30 88.14 0.14 87.92 0.57 87.91 0.38 88.14 0.16 D 90.43 0.14 90.29 0.30 90.40 0.15 90.20 0.14 90.56 0.16

(a) Setup A

(b) Setup B

(c) Setup C

(d) Setup D

Figure 4. Illustration of Sconf learning on different scales of noise and Gaussian distributions on a single trail.

7.1. Synthetic Experiments

We experimentally characterize the behavior of Sconf learning and show its robustness to noisy conﬁdence on the synthetic datasets.

Setup: We generated the positive and negative data according to the 2-dimensional Gaussian distributions with different means and covariance for p+(x) and p (x). The setups of data generation distributions are provided in the supplementary material.

500 positive data and 300 negative data were generated independently from each distribution for training. We dropped the class labels for Sconf learning and generated the unlabeled data pair according to the data generation process in Section 5. Then we analytically computed the class posterior probability r+(x) from the two Gaussian densities and equipped the unlabeled data pairs with true similarity conﬁdence, which was obtained based on Lemma 1. 1000 positive data and 600 negative data were generated in the same way for testing.

The linear-in-input model f(x) = ω x + b and logistic loss were used. We trained the model with Adam for 100 epochs (full-batch size) and default momentum parameter. The learning rate was initially set to 0.1 and divided by 10 every 30 epochs. To generate noisy similarity conﬁdence, we added zero-mean Gaussian noise with different scales of standard deviation chosen from {0.1 , 0.2, 0.3} on the obtained similarity conﬁdence. When the noisy similarity conﬁdence was over 1 or below 0, we clipped it to 1 or rounded up to 0, respectively. The results of fully supervised

learning are also provided.

Experimental results: The results are shown in Table 1 and Figure 4. Compared with fully supervised learning, Sconf learning has similar accuracy on all synthetic datasets. The decline in performance under different scales of noise is not signiﬁcant, which shows the robustness of Sconf learning.

7.2. Benchmark Experiments

Here we conducted experiments with deep neural networks on the more realistic benchmark datasets.

Datasets: We evaluated the performance of proposed methods on six widely-used benchmarks MNIST (Le Cun et al., 1998), Fashion-MNIST (Xiao et al.), Kuzushiji-MNIST (Clanuwat et al., 2018), EMNIST (Cohen et al., 2017), SVHN (Netzer et al., 2011), and CIFAR-10 (Krizhevsky, 2012). Following Lu et al. (2020), we manually corrupted the multi-class datasets into binary classiﬁcation datasets. The detailed statistics of datasets are in the supplementary materials.

Baselines: We compared our methods with both statistical learning-based and representation learning-based similarity learning baselines, including similarity-dissimilarity learning (SD) (Shimada et al., 2019), Siamese network (Koch et al., 2015), and contrastive loss (Hadsell et al., 2006). Since we can only get the vector representation rather than class label using Siamese network and contrastive loss, we adopted the one-shot setting in (Koch et al., 2015) and randomly chose two samples with different labels as the prototypes. Then prediction is determined according to the

Learning from Similarity-Conﬁdence Data

Table 2. Mean and standard deviation of the classiﬁcation accuracy over 5 trials in percentage with deep models. The best and comparable methods based on the paired t-test at the signiﬁcance level 5% are highlighted in boldface.

Datasets Proposed Baselines

Sconf-Unbiased Sconf-ABS Sconf-NN SD Siamese Contrastive

MNIST 87.22 2.11 96.12 2.31 96.04 1.23 86.57 0.78 55.08 3.94 71.91 2.39 Kuzushiji-MNIST 78.12 3.08 89.25 1.58 90.00 0.55 76.42 4.09 59.82 6.15 67.18 5.41 Fashion-MNIST 86.28 7.03 91.44 0.39 91.37 0.30 83.61 8.94 58.29 4.42 64.97 5.76 EMNIST-Digits 87.96 1.67 96.62 0.06 96.21 0.11 76.18 11.21 53.08 2.55 66.37 5.40 EMNIST-Letters 77.14 3.71 86.32 1.20 86.72 1.39 76.18 11.21 55.76 3.95 60.29 3.14 EMNIST-Balanced 68.61 10.21 74.94 2.92 74.83 3.40 64.03 14.66 52.61 1.22 58.30 2.19 CIFAR-10 65.68 5.03 84.71 1.41 84.49 1.14 60.39 6.56 59.83 2.75 54.38 1.48 SVHN 72.88 3.15 83.51 0.65 82.37 0.23 71.48 5.43 60.90 5.01 69.26 2.97

(a) MNIST, MLP

(b) Kuzushiji-MNIST, MLP

(c) Fashion-MNIST, MLP

(d) EMNIST-Digits, MLP

(e) EMNIST-Letters, MLP

(f) EMNIST-Balanced, MLP

(g) SVHN, Res Net-18

(h) CIFAR-10, Res Net-34

Figure 5. Experimental Results of proposed methods. Dark colors show the mean accuracy and light colors show the standard deviation.

similarity between each test instance and the prototypes.

Experimental setup: We trained the proposed methods and baseline methods with the same model on a certain dataset with logistic loss. Different models are used on different datasets as summarized in Figure 5. Since Siamese network and contrastive loss is representation learning-based, the output dimensions for them were changed to 300 on MNIST, Kuzushiji-MNIST, Fashion-MNIST, EMNIST and further increased to 1000 on CIFAR-10 and SVHN. For all the methods, the optimization algorithm was chosen to be Adam (Kingma & Ba, 2015) and the detailed setting is listed in supplementary materials. For ERM-based methods: Sconf-Unbiased, Sconf-ABS, Sconf-NN, and SD, the validation accuracy was also calculated according to their empirical risk estimators on a validation set consisted of Sconf data, which means that we do not have to collect additional ordinarily labeled data for validation when using ERM-based methods.

To simulate real similarity conﬁdence, We generated the class posterior probability p(y=+1|x) using logistic regression with the same network for each dataset, and obtained the similarity conﬁdence according to Lemma 1. Since the baseline methods requires data with hard similarity la-

bels, we generated the similarity label for each data pair according to the Bernoulli distribution determined by their similarity conﬁdence. Note that we ask the labelers for similarity conﬁdence values in real-world Sconf learning, but we generated them through a probability classiﬁer here. The class labels are only used for estimating the similarity conﬁdence and the test sets are not used in any process of experiments except for testing steps.

We implemented all the methods by Pytorch (Paszke et al., 2019), and conducted the experiments on NVIDIA Tesla P4 GPUs. Experimental results are reported in Figure 5 and Table 2.

Experimental results: It can be observed from Table 2 that the proposed methods: Sconf-Unbiased, Sconf-ABS, and Sconf-NN outperformed the baseline methods on all the datasets. Among all the methods, Siamese network and contrastive loss performed poorly since they are representation learning-based rather than classiﬁcation-oriented. Though the goal of SD learning is classiﬁcation, it failed to compete with the proposed methods because it can only utilize the similarity labels degenerating from similarity conﬁdence, which can cause the loss of supervision information.

The efﬁciency of the risk correction schemes on mitigat-

Learning from Similarity-Conﬁdence Data

ing overﬁtting is illustrated in Figure 5 and Table 2. For Sconf-Unbiased, the learning curves in Figure 5 show that when the empirical risk of Sconf-Unbiased (red full line) goes negative, the test loss increases rapidly, which indicates the occurrence of overﬁtting. As a consequence, the performance of Sconf-Unbiased became catastrophic compared with Sconf-ABS and Sconf-NN as shown in Table 2. Thanks to the risk correction schemes that enforce the non-negativity of empirical risk, Sconf-ABS and Sconf-NN did not suffer from the overﬁtting caused by negative empirical risk and greatly outperformed Sconf-Unbiased and all the baseline methods. For Sconf-ABS and Sconf-NN, the learning curves of their test and training loss are consistent, i.e., the minimization of training loss corresponding to the corrected risk estimators implies the minimization of classiﬁcation risk. This observation indicates that the corrected risk estimators can better represent the classiﬁcation risk compared with the unbiased risk estimator.

8. Conclusion

We proposed a novel weakly supervised learning setting and effective algorithms for learning from unlabeled data pairs equipped with similarity conﬁdence, where no class labels or similarity labels are needed. We proposed the unbiased risk estimator from unlabeled data pairs with similarity conﬁdence and further improved its performance against overﬁtting via a risk correction scheme. Furthermore, we proved the consistency of the minimizers of the risk estimator and corrected risk estimators. Experimental results showed that the proposed methods outperform baseline methods and our proposed risk correction scheme can effectively mitigate overﬁtting caused by negative empirical risk.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Nos. 12071475, 11671010), Beijing Natural Science Foundation (No.4172035), the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-RP-2019-0013), National Satellite of Excellence in Trustworthy Software Systems (Award No: NSOE-TSS2019-01), and NTU. GN and MS were supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan. MS was also supported by the Institute for AI and Beyond, UTokyo.

Bao, H., Niu, G., and Sugiyama, M. Classiﬁcation from pairwise similarity and unlabeled data. In ICML, pp. 461 470, 2018.

Bao, H., Shimada, T., Xu, L., Sato, I., and Sugiyama,

M. Similarity-based classiﬁcation: Connecting similarity learning to binary classiﬁcation. Co RR, abs/2006.06207, 2020.

Bartlett, P. L. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. In COLT, pp. 224 240, 2001.

Basu, S., Banerjee, A., and Mooney, R. J. Semi-supervised clustering by seeding. In ICML, pp. 27 34, 2002.

Chapelle, O., Sch olkopf, B., and Zien, A. (eds.). Semi Supervised Learning. The MIT Press, 2006.

Chou, Y., Niu, G., Lin, H., and Sugiyama, M. Unbiased risk estimators can mislead: A case study of learning with complementary labels. In ICML, pp. 1929 1938, 2020.

Chung, K. A Course in Probability Theory. Academic Press, 1974.

Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., and Ha, D. Deep learning for classical japanese literature. Co RR, abs/1812.01718, 2018.

Cohen, G., Afshar, S., Tapson, J., and van Schaik, A. EMNIST: an extension of MNIST to handwritten letters. Co RR, abs/1702.05373, 2017.

Cour, T., Sapp, B., and Taskar, B. Learning from partial labels. J. Mach. Learn. Res., 12:1501 1536, 2011.

du Plessis, M. C., Niu, G., and Sugiyama, M. Analysis of learning from positive and unlabeled data. In Neur IPS, pp. 703 711, 2014.

du Plessis, M. C., Niu, G., and Sugiyama, M. Convex formulation for learning from positive and unlabeled data. In ICML, pp. 1386 1394, 2015.

Elkan, C. and Noto, K. Learning classiﬁers from only positive and unlabeled data. In KDD, pp. 213 220, 2008.

Feng, L., Kaneko, T., Han, B., Niu, G., An, B., and Sugiyama, M. Learning from multiple complementary labels. In ICML, pp. 3072 3081, 2020a.

Feng, L., Lv, J., Han, B., Xu, M., Niu, G., Geng, X., An, B., and Sugiyama, M. Provably consistent partial-label learning. In ICML, 2020b.

Fisher, R. J. Social desirability bias and the validity of indirect questioning. Journal of Consumer Research, (2): 303 315, 1993.

Golowich, N., Rakhlin, A., and Shamir, O. Size-independent sample complexity of neural networks. In COLT, pp. 297 299, 2018.

Learning from Similarity-Conﬁdence Data

Guo, L., Zhou, Z., and Li, Y. Record: Resource constrained semi-supervised learning under distribution shift. In KDD, pp. 1636 1644, 2020.

Hadsell, R., Chopra, S., and Le Cun, Y. Dimensionality reduction by learning an invariant mapping. In CVPR, pp. 1735 1742, 2006.

Han, B., Yao, J., Niu, G., Zhou, M., Tsang, I. W., Zhang, Y., and Sugiyama, M. Masking: A new perspective of noisy supervision. In Neur IPS, pp. 5841 5851, 2018a.

Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I. W., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Neur IPS, pp. 8536 8546, 2018b.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, pp. 463 469, 2016.

Howe, J. Crowdsourcing: Why the power of the crowd is driving the future of business. Crwon Publishing Group, 2009.

Hsu, Y., Lv, Z., Schlosser, J., Odom, P., and Kira, Z. Multiclass classiﬁcation without multi-class labels. In ICLR, 2019.

Ishida, T., Niu, G., Hu, W., and Sugiyama, M. Learning from complementary labels. In Neur IPS, pp. 5639 5649, 2017.

Ishida, T., Niu, G., Menon, A. K., and Sugiyama, M. Complementary-label learning for arbitrary losses and models. In ICML, pp. 2971 2980, 2019.

Katsura, Y. and Uchida, M. Bridging ordinary-label learning and complementary-label learning. In ACML, pp. 161 176, 2020.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015.

Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. Positive-unlabeled learning with non-negative risk estimator. In Neur IPS, pp. 1675 1685, 2017.

Klein, D., Kamvar, S. D., and Manning, C. D. From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. In ICML, pp. 307 314, 2002.

Koch, G., Zemel, R., and Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop, 2015.

Krizhevsky, A. Learning multiple layers of features from tiny images. University of Toronto, 05 2012.

Le Cun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, November 1998.

Ledoux, M. and Talagrand, M. Probability in Banach Spaces: isoperimetry and processes. Springer Science & Business Media, 2013.

Li, Y. and Liang, D. Safe semi-supervised learning: A brief introduction. Frontiers Comput. Sci., 13(4):669 676, 2019.

Li, Y. and Zhou, Z. Towards making unlabeled data never hurt. IEEE Trans. Pattern Anal. Mach. Intell., 37(1): 175 188, 2015.

Lu, N., Zhang, T., Niu, G., and Sugiyama, M. Mitigating overﬁtting in supervised classiﬁcation from two unlabeled datasets: A consistent risk correction approach. In AISTATS, pp. 1115 1125, 2020.

Lv, J., Xu, M., Feng, L., Niu, G., Geng, X., and Sugiyama, M. Progressive identiﬁcation of true labels for partiallabel learning. In ICML, pp. 6500 6510, 2020.

Mc Diarmid, C. On the method of bounded differences, pp. 148 188. London Mathematical Society Lecture Note Series. Cambridge University Press, 1989.

Mendelson, S. Lower bounds for the empirical minimization algorithm. IEEE Trans. Inf. Theory, 54(8):3797 3803, 2008.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. Adaptive computation and machine learning. MIT Press, 2012.

Nair, V. and Hinton, G. E. Rectiﬁed linear units improve restricted boltzmann machines. In ICML, pp. 807 814, 2010.

Natarajan, N., Dhillon, I. S., Ravikumar, P., and Tewari, A. Learning with noisy labels. In Neur IPS, pp. 1196 1204, 2013.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Reading digits in natural images with unsupervised feature learning. Neur IPS Workshop on Deep Learning and Unsupervised Feature Learning, 1 2011.

Niu, G., Dai, B., Yamada, M., and Sugiyama, M. Information-theoretic semi-supervised metric learning via entropy regularization. In ICML, 2012.

Niu, G., Jitkrittum, W., Dai, B., Hachiya, H., and Sugiyama, M. Squared-loss mutual information regularization: A novel information-theoretic approach to semi-supervised learning. In ICML, pp. 10 18, 2013.

Learning from Similarity-Conﬁdence Data

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., K opf, A., Yang, E., De Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Neur IPS, pp. 8024 8035, 2019.

Sakai, T., du Plessis, M. C., Niu, G., and Sugiyama, M. Semi-supervised classiﬁcation based on classiﬁcation from positive and unlabeled data. In ICML, pp. 2998 3006, 2017.

Sansone, E., Natale, F. G. B. D., and Zhou, Z. Efﬁcient training for positive unlabeled learning. IEEE Trans. Pattern Anal. Mach. Intell., 41(11):2584 2598, 2019.

Scott, C. A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In AISTATS, 2015.

Shimada, T., Bao, H., Sato, I., and Sugiyama, M. Classiﬁcation from pairwise similarities/dissimilarities and unlabeled data via empirical risk minimization. Co RR, abs/1904.11717, 2019.

Wagstaff, K., Cardie, C., Rogers, S., and Schr odl, S. Constrained k-means clustering with background knowledge. In ICML, pp. 577 584, 2001.

Wang, Q., Li, Y., and Zhou, Z. Partial label learning with unlabeled data. In IJCAI, pp. 3755 3761, 2019.

Wang, W. and Zhou, Z. Crowdsourcing label quality: A theoretical analysis. Sci. China Inf. Sci., 58(11):1 12, 2015.

Wu, S., Xia, X., Liu, T., Han, B., Gong, M., Wang, N., Liu, H., and Niu, G. Multi-class classiﬁcation from noisysimilarity-labeled data. Co RR, abs/2002.06508, 2020.

Xia, X., Liu, T., Wang, N., Han, B., Gong, C., Niu, G., and Sugiyama, M. Are anchor points really indispensable in label-noise learning? In Neur IPS, pp. 6835 6846, 2019.

Xia, X., Liu, T., Han, B., Wang, N., Gong, M., Liu, H., Niu, G., Tao, D., and Sugiyama, M. Part-dependent label noise: Towards instance-dependent label noise. In Neur IPS, 2020.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Co RR, abs/1708.07747.

Xing, E. P., Ng, A. Y., Jordan, M. I., and Russell, S. J. Distance metric learning with application to clustering with side-information. In Neur IPS, pp. 505 512, 2002.

Yi, J., Zhang, L., Jin, R., Qian, Q., and Jain, A. K. Semisupervised clustering by input pattern assisted pairwise similarity matrix completion. In ICML, pp. 1400 1408, 2013.

Yu, X., Liu, T., Gong, M., and Tao, D. Learning with biased complementary labels. In ECCV, pp. 69 85, 2018.

Zhang, Z., Zhao, P., Jiang, Y., and Zhou, Z. Learning from incomplete and inaccurate supervision. In KDD, pp. 1017 1025, 2019.

Zhou, Z. A brief introduction to weakly supervised learning. National Science Review, 005(001):44 53, 2018.

Zhu, X. and Goldberg, A. B. Introduction to Semi Supervised Learning. Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2009.