# classification_from_pairwise_similarity_and_unlabeled_data__2f3e3504.pdf Classification from Pairwise Similarity and Unlabeled Data Han Bao 1 2 Gang Niu 2 Masashi Sugiyama 2 1 Supervised learning needs a huge amount of labeled data, which can be a big bottleneck under the situation where there is a privacy concern or labeling cost is high. To overcome this problem, we propose a new weakly-supervised learning setting where only similar (S) data pairs (two examples belong to the same class) and unlabeled (U) data points are needed instead of fully labeled data, which is called SU classification. We show that an unbiased estimator of the classification risk can be obtained only from SU data, and the estimation error of its empirical risk minimizer achieves the optimal parametric convergence rate. Finally, we demonstrate the effectiveness of the proposed method through experiments. 1. Introduction In supervised classification, we need a vast amount of labeled data in the training phase. However, in many realworld problems, it is time-consuming and laborious to label a huge amount of unlabeled data. To deal with this problem, weakly-supervised classification (Zhou, 2018) has been explored in various setups, including semi-supervised classification (Chapelle & Zien, 2005; Belkin et al., 2006; Chapelle et al., 2010; Miyato et al., 2016; Laine & Aila, 2017; Sakai et al., 2017; Tarvainen & Valpola, 2017; Luo et al., 2018), multiple instance classification (Li & Vasconcelos, 2015; Miech et al., 2017; Bao et al., 2018), and positive-unlabeled (PU) classification (Elkan & Noto, 2008; du Plessis et al., 2014; 2015; Niu et al., 2016; Kiryo et al., 2017). Another line of research from the clustering viewpoint is semi-supervised clustering, where pairwise similarity and dissimilarity data (a.k.a. must-link and cannot-link constraints) are utilized to guide unsupervised clustering to a desired solution. The common approaches are (i) constrained clustering (Wagstaff et al., 2001; Basu et al., 2002; 1The University of Tokyo, Japan 2RIKEN, Japan. Correspondence to: Han Bao . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). Table 1: Explanations of classification and clustering. Problem Explanation Classification The goal is to minimize the true risk (given the zero-one loss) of an inductive classifier. To this end, an empirical risk (given a surrogate loss) on the training data is minimized for training the classifier. The training and testing phases can be clearly distinguished. Classification requires the existence of the underlying joint density. Clustering The goal is to partition the data at hand into clusters. To this end, density- /margin-/information-based measures are optimized for implementing the lowdensity separation based on the cluster assumption. Most of the clustering methods are designed for in-sample inferencea. Clustering does not need the underlying joint density. a Discriminative clustering methods are designed for out-ofsample inference, such as maximum margin clustering (Xu et al., 2005) and information maximization clustering (Krause et al., 2010; Sugiyama et al., 2011). 2004; Li & Liu, 2009), which utilize pairwise links as constraints on clustering. (ii) metric learning (Xing et al., 2002; Bilenko et al., 2004; Weinberger et al., 2005; Davis et al., 2007; Li et al., 2008; Niu et al., 2012), which perform (k-means) clustering on learned metrics (iii) matrix completion (Yi et al., 2013; Chiang et al., 2015), which recover unknown entries in a similarity matrix. Semi-supervised clustering and weakly-supervised classification are similar in that they do not use fully-supervised data. However, they are different from the learning theoretic viewpoint weakly-supervised classification methods are justified as supervised learning methods, while semi-supervised clustering methods are still evaluated as unsupervised learning (see Table 1). Indeed, weaklysupervised learning methods based on empirical risk minimization (du Plessis et al., 2014; 2015; Niu et al., 2016; Sakai et al., 2017) were shown that their estimation errors achieve the optimal parametric convergence rate, while such generalization guarantee is not available for semi-supervised Classification from Pairwise Similarity and Unlabeled Data clustering methods. The goal of this paper is to propose a novel weaklysupervised learning method called SU classification, where only similar (S) data pairs (two examples belong to the same class) and unlabeled (U) data points are employed, in order to bridge these two different paradigms. In SU classification, the information available for training a classifier is similar to semi-supervised clustering. However, our proposed method gives an inductive model, which learns decision functions from training data and can be applied for out-of-sample prediction (i.e., prediction of unseen test data). Furthermore, the proposed method can not only separate two classes but also identify which class is positive (class identification) under certain conditions. SU classification is particularly useful to predict people s sensitive matters such as religion, politics, and opinions on racial issues people often hesitate to give explicit answers to these matters, instead indirect questions might be easier to answer: Which person do you have the same belief as? 1 For this SU classification problem, our contributions in this paper are three-fold: 1. We propose an empirical risk minimization method for SU classification (Section 2). This enables us to obtain an inductive classifier. Under certain loss conditions together with the linear-in-parameter model, its objective function becomes even convex in the parameters. 2. We theoretically establish an estimation error bound for our SU classification method (Section 4), showing that the proposed method achieves the optimal parametric convergence rate. 3. We experimentally demonstrate the practical useful- ness of the proposed SU classification method (Section 5). Related problem settings are summarized in Figure 1. 2. Classification from Pairwise Similarity and Unlabeled Data In this section, we propose a learning method to train a classifier from pairwise similarity and unlabeled data. 2.1. Preliminaries We formulate the standard binary classification problem briefly. Let X Rd be a d-dimensional example space and Y = {+1, 1} be a binary label space. We assume that labeled data (x, y) 2 X Y is drawn from the joint 1 This questioning can be regarded as one type of randomized response (indirect questioning) techniques (Warner, 1965; Fisher, 1993), which is a survey method to avoid social desirability bias. Figure 1: Illustrations of SU classification and other related problem settings. probability distribution with density p(x, y). The goal of binary classification is to obtain a classifier f : X ! R which minimizes the classification risk defined as R(f) , E (X,Y ) p [ (f(X), Y )] , (1) where E(X,Y ) p[ ] denotes the expectation over the joint distribution p(X, Y ) and : R Y ! R+ is a loss function. The loss function (z, t) measures how well the true class label t 2 Y is estimated by an output of a classifier z 2 R, generally yielding a small/large value if t is well/poorly estimated by z. In standard supervised classification scenarios, we are given positive and negative training data independently following p(x, y). Then, based on these training data, the classification risk (1) is empirically approximated and the empirical risk minimizer is obtained. However, in many real-world problems, collecting labeled training data is costly. The goal of this paper is to train a binary classifier only from pairwise similarity and unlabeled data, which are cheaper to collect than fully labeled data. Classification from Pairwise Similarity and Unlabeled Data 2.2. Pairwise Similarity and Unlabeled Data First, we discuss underlying distributions of similar data pairs and unlabeled data points, in order to perform the empirical risk minimization. Pairwise Similarity: If x and x0 belong to the same class, they are said to be pairwise similar (S). We assume that similar data pairs are drawn following p S(x, x0) = p(x, x0|y = y0 = +1 _ y = y0 = 1) +p+(x)p+(x0) + 2 p (x)p (x0) 2 where + , p(y = +1) and , p(y = 1) are the class-prior probabilities satisfying + + = 1, and p+(x) , p(x|y = +1) and p (x) , p(x|y = 1) are the class-conditional densities. Eq. (2) means that we draw two labeled data independently following p(x, y), and we accept/reject them if they belong to the same class/different classes. Unlabeled Data: We assume that unlabeled (U) data points are drawn following the marginal density p(x), which can be decomposed into the sum of the class-conditional densities as p(x) = +p+(x) + p (x). (3) Our goal is to train a classifier only from SU data, which we call SU classification. We assume that we have similar pairs DS and an unlabeled dataset DU as DS , {(x S,i, x0 i.i.d. p S(x, x0), DU , {x U,i}n U i.i.d. p(x). We also use a notation e DS , {ex S,i}2n S i=1 to denote pointwise similar data obtained by ignoring pairwise relations in DS. Lemma 1. e DS = {ex S,i}2n S i=1 are independently drawn following ep S(x) = 2 where S , 2 A proof is given in Appendix A. Lemma 1 states that a similar data pair (x S, x0 S) is essentially symmetric, and x S, x0 S can be regarded as being independently drawn following ep S, if we assume the pair (x S, x0 S) is drawn following p S. This perspective is important when we analyze the variance of the risk estimator (Section 2.4), and estimate the class-prior (Section 3.2). 2.3. Risk Expression with SU Data Below, we attempt to express the classification risk (1) only in terms of SU data. Assume + 6= 1 2, and let e (z), LS, (z) and LU, (z) be e (z) , (z, +1) (z, 1), LS, (z) , 1 2 + 1 LU, (z) , 2 + 1 (z, +1) + + 2 + 1 (z, 1). Then we have the following theorem. Theorem 1. The classification risk (1) can be equivalently expressed as RSU, (f) = S E (X,X0) p S LS, (f(X)) + LS, (f(X0)) X p [LU, (f(X))] . A proof is given in Appendix B. According to Theorem 1, the following is a natural candidate for an unbiased estimator of the classification risk (1): LS, (f(x S,i)) + LS, (f(x0 LU, (f(x U,i)) LS, (f(ex S,i)) + 1 LU, (f(x U,i)), where in the last line we use the decomposed version of similar pairs e DS instead of DS, since the loss form is symmetric. LS, and LU, are illustrated in Figure 2. 2.4. Minimum-Variance Risk Estimator Eq. (5) is one of the candidates of an unbiased SU risk estimator. Indeed, due to the symmetry of (x, x0) p S(x, x0), we have the following lemma. Lemma 2. The first term of RSU, (f), i.e., S E (X,X0) p S LS, (f(X)) + LS, (f(X0)) can be equivalently expressed as S E (X,X0) p S [ LS, (f(X)) + (1 )LS, (f(X0))] , where 2 [0, 1] is an arbitrary weight. Classification from Pairwise Similarity and Unlabeled Data La(z) Ll(z) (a) Squared Loss, + = 3 (b) Logistic Loss, + = 3 (c) Squared Loss, + = 1 (d) Logistic Loss, + = 1 4 Figure 2: LS, and LU, appearing in Eq. (5) are illustrated with different loss functions and class-priors. A proof is given in Appendix C.1. By Lemma 2, LS, (f(x S,i)) + (1 )LS, (f(x0 is also an unbiased estimator of Eq. (6). Then, a natural question arises: is the risk estimator (5) best among all ? We answer this question by the following theorem. Theorem 2. The estimator LS, (f(x S,i)) + LS, (f(x0 S,i)) 2 (8) has the minimum variance among estimators in the form Eq. (7) with respect to 2 [0, 1]. A proof is given in Appendix C.2. Thus, the variance minimality (with respect to in Eq. (7)) of the risk estimator (5) is guaranteed by Theorem 2. We use this risk estimator in the following sections. 2.5. Practical Implementation Here, we investigate the objective function when the linearin-parameter model f(x) = w>φ(x)+w0 is employed as a classifier, where w 2 Rd and w0 2 R are parameters and φ : Rd ! Rb is basis functions. In general, the bias parameter w0 can be ignored 2. We formulate SU classification as the following empirical risk minimization problem using Eq. (5) together with the 2 regularization: b J (w), (9) b J (w) , S LS, (w>φ(ex S,i)) LU, (w>φ(x U,i)) + λ 2 Let eφ(x) , [φ(x)> 1]> and e w , [w> w0]> then w>φ(x) + w0 = e w> eφ(x). Table 2: A selected list of margin loss functions satisfying the conditions in Theorem 3. Loss name (m) Squared loss 1 4(m 1)2 Logistic loss log(1 + exp( m)) Double hinge loss max( m, max(0, 1 and λ > 0 is the regularization parameter. We need the class-prior + (included in S) to solve this optimization problem. We discuss how to estimate + in Section 3.2. Next, we will investigate appropriate choices of the loss function . From now on, we focus on margin loss functions (Mohri et al., 2012): is said to be a margin loss function if there exists : R ! R+ such that (z, t) = (tz). In general, our objective function (10) is non-convex even if a convex loss function is used for 3. However, the next theorem, inspired by Natarajan et al. (2013) and du Plessis et al. (2015), states that a certain loss function will result in a convex objective function. Theorem 3. If the loss function (z, t) is a convex margin loss, twice differentiable in z almost everywhere (for every fixed t 2 { 1}), and satisfies the condition (z, +1) (z, 1) = z, then b J (w) is convex. A proof of Theorem 3 is given in Appendix D. Examples of margin loss functions satisfying the conditions in Theorem 3 are shown in Table 2 (also illustrated in Figure 3). Below, as special cases, we show the objective functions for the squared and the double-hinge losses. The detailed derivations are given in Appendix E. Squared Loss: The squared loss is SQ(z, t) = 1 4(tz 1)2. Substituting SQ into Eq. (10), the objective function is b JSQ(w) = w> 1>XS + 1 2n U 3 In general, LU, is non-convex because either 2 + 1 ( , +1) or + 2 + 1 ( , 1) is convex and the other is concave. LS, is not always convex even if is convex, either. Classification from Pairwise Similarity and Unlabeled Data y@R a[m 2/ GQ;Bbi B+ .Qm#H2@>BM;2 Figure 3: Comparison of loss functions. where 1 is the vector whose elements are all ones, I is the identity matrix, XS , [φ(ex S,1) φ(ex S,2n S)]>, and XU , [φ(x U,1) φ(x U,n U)]>. The minimizer of this objective function can be obtained analytically as w = n U 2 + 1 U XU + 2λn UI Thus the optimization problem can be easily implemented and solved highly efficiently if the number of basis functions is not so large. Double-Hinge Loss: Since the hinge loss H(z, t) = max(0, 1 tz) does not satisfy the conditions in Theorem 3, the double-hinge loss DH(z, t) = max( tz, max(0, 1 2 1 2tz)) is proposed by du Plessis et al. (2015) as an alternative. Substituting DH into Eq. (10), we can reformulate the optimization problem as follows: min w, , S 2n S(2 + 1)1>XSw n S(2 + 1)1> + + n U(2 + 1)1> + λ where for vectors denotes the element-wise inequality. This optimization problem is a quadratic program (QP). The transformation into the standard QP form is given in Appendix E. 3. Relation between Class-Prior and SU Classification In Section 2, we assume that the class-prior + is given in advance. In this section, we first clarify the relation between behaviors of the proposed method and +, then we propose an algorithm to estimate + in case we do not have + in advance. Table 3: Behaviors of the proposed method on class identification and class separation, depending on prior knowledge of the classprior. Case Prior knowledge Identification Separation 1 exact + 3 3 2 nothing 7 3 3 sign( + ) 3 3 3.1. Class-Prior-Dependent Behaviors of Proposed We discuss the following three different cases on prior knowledge of + (summarized in Table 3). (Case 1) The class-prior is given: In this case, we can directly solve the optimization problem (9). The solution does not only separate data but also identifies classes, i.e., determine which class is positive. (Case 2) No prior knowledge on the class-prior is given: In this case, we need to estimate + before solving (9). If we assume + > , the estimation method in Section 3.2 gives an estimator of +. Thus, we can regard the larger cluster as positive class and solve the optimization problem (9). This time the solution just separates data because we have no prior information for class identifiability. (Case 3) Magnitude relation of the class-prior is given: Finally, consider the case where we know which class has a larger class-prior. In this case, we also need to estimate +, but surprisingly, we can identify classes. For example, if the negative class has a larger class-prior, first we estimate the class-prior (let b be an estimated value). Since Algorithm 1 given in Sec. 3.2 always gives an estimate of the classprior of the larger class, the positive class-prior is given as + = 1 b . After that, it reduces to Case 1. Remark: In all of the three cases above, our proposed method gives an inductive model, which is applicable to out-of-sample prediction without any modification. On the other hand, most of the unsupervised/semi-supervised clustering methods are designed for in-sample prediction, which can only give predictions for data at hand given in advance. 3.2. Class-Prior Estimation from Pairwise Similarity and Unlabeled Data We propose a class-prior estimation algorithm only from SU data. First, let us begin with connecting the pairwise marginal distribution p(x, x0) and p S(x, x0) when two examples x and x0 are drawn independently: p(x, x0) = p(x)p(x0) +p+(x)p+(x0) + 2 p (x)p (x0) + p+(x)p (x0) + + p (x)p+(x0) = Sp S(x, x0) + Dp D(x, x0), (11) Classification from Pairwise Similarity and Unlabeled Data Algorithm 1 Prior estimation from SU data. CPE is a classprior estimation algorithm. Input: DU = {x U,i}n U i=1 (samples from p), e DS = {ex S,i}2n S i=1 (samples from ep S) Output: class-prior + S CPE(DU, e DS) + where Eq. (2) was used to derive the last line, D , 2 + , and = p(x, x0|(y = +1 y0 = 1) _ (y = 1 y0 = +1)) = + p+(x)p (x0) + + p (x)p+(x0) Marginalizing out x0 in Eq. (11) as Lemma 1, we obtain p(x) = Sep S(x) + Dep D(x), where ep S is defined in Eq. (4) and ep D(x) , (p+(x) + p (x))/2. Since we have samples DU and e DS drawn from p and ep S respectively (see Eqs. (3) and (4)), we can estimate S by mixture proportion estimation4 methods (Scott, 2015; Ramaswamy et al., 2016; du Plessis et al., 2017). After estimating S, we can calculate +. By the discussion in Section 3.1, we assume + > . Then, following 2 S 1 = S D = ( + )2 = (2 + 1)2 0, we obtain + = 2 . We summarize a wrapper of mixture proportion estimation in Algorithm 1. 4. Estimation Error Bound In this section, we establish an estimation error bound for the proposed method. Hereafter, let F RX be a function class of a specified model. Definition 1. Let n be a positive integer, Z1, . . . , Zn be i.i.d. random variables drawn from a probability distribution with density µ, H = {h : Z ! R} be a class of measurable functions, and σ = (σ1, . . . , σn) be Rademacher variables, i.e., random variables taking +1 and 1 with even probabilities. Then (expected) Rademacher complexity of H is defined as R(H; n, µ) , E Z1,...,Zn µ E 4 Given a distribution F which is a convex combination of distributions G and H such that F = (1 )G+ H, the mixture proportion estimation problem is to estimate 2 [0, 1] only with samples from F and H. In our case, F, H, and correspond to p(x), ep S(x), and S, respectively. See, e.g., Scott (2015). In this section, we assume for any probability density µ, our model class F satisfies R(F; n, µ) CF pn (13) for some constant CF > 0. This assumption is reasonable because many model classes such as the linear-inparameter model class F = {f(x) = w>φ(x) | kwk Cw, kφk1 Cφ} (Cw and Cφ are positive constants) satisfy it (Mohri et al., 2012). Subsequently, let f , arg min R(f) be the true risk min- imizer, and bf , arg min b RSU, (f) be the empirical risk Theorem 4. Assume the loss function is -Lipschitz with respect to the first argument (0 < < 1), and all functions in the model class F are bounded, i.e., there exists a constant Cb such that kfk1 Cb for any f 2 F. Let C , supt2{ 1} (Cb, t). For any δ > 0, with probability at least 1 δ, R( bf) R(f ) CF, ,δ δ |2 + 1| . A proof is given in Appendix F. Theorem 4 shows that if we have + in advance, our proposed method is consistent, i.e., R( bf) ! R(f ) as n S ! 1 and n U ! 1. The convergence rate is Op(1/pn S + 1/pn U), where Op denotes the order in probability. This order is the optimal parametric rate for the empirical risk minimization without additional assumptions (Mendelson, 2008). 5. Experiments In this section, we empirically investigate the performance of class-prior estimation and the proposed method for SU classification. Datasets: Datasets are obtained from the UCI Machine Learning Repository (Lichman, 2013), the LIBSVM (Chang & Lin, 2011), and the ELENA project 5. We randomly subsample the original datasets, to maintain that similar pairs consist of positive and negative pairs with the ratio of 2 (see Eq. (2)), while the ratios of unlabeled and test data are + to (see Eq. (3)). 5https://www.elen.ucl.ac.be/neural-nets/ Research/Projects/ELENA/elena.htm Classification from Pairwise Similarity and Unlabeled Data kyy 9yy 3yy Reyy na + nl kyy 9yy 3yy Reyy na + nl Figure 4: Estimation errors of the class-prior (absolute value of difference between true class-prior and estimated class-prior) from SU data over 100 trials are plotted in the vertical axes. For all experiments, true class-prior + is set to 0.7. 5.1. Class-Prior Estimation First, we study empirical performance of class-prior estimation. We conduct experiments on benchmark datasets. Different dataset sizes {200, 400, 800, 1600} are tested, where half of the data are S pairs and the other half are U data. In Figure 4, KM1 and KM2 are plotted, which are proposed by Ramaswamy et al. (2016). We used them as CPE in Algorithm 1 6. Since S = 2 2, we use additional heuristic to set λleft = 2 in Algorithm 1 of Ramaswamy et al. (2016). 5.2. Classification Complexity We empirically investigate our proposed method in terms of the relationship between classification performance and the number of training data. We conduct experiments on benchmark datasets with the fixed number of S pairs (fixed to 200), and the different numbers of U data {200, 400, 800, 1600}. The experimental results are shown in Figure 5. It indicates that the classification error decreases as n U grows, which well agree with our theoretical analysis in Theorem 4. Furthermore, we observe a tendency that classification error becomes smaller as the class-prior becomes farther from 1 2. This is because CF, ,δ in Eq. (14) has the term |2 + 1| in the denominator, which will make the upper bound looser when + is close to 1 The detailed setting about the proposed method is described below. Our implementation is available at https:// github.com/levelfour/SU_Classification. Proposed Method (SU): We use the linear-in-input model f(x) = w>x + b. In Section 5.2, the squared loss is used, and + is given (Case 1 in Table 3). In Section 5.3, the squared loss and the double-hinge loss are used, and the class-prior is estimated by Algorithm 1 with KM2 (Ramaswamy et al., 2016) (Case 2 in Table 3). The regulariza- 6We used the author s implementations published in http://web.eecs.umich.edu/ cscott/code/ kernel_CPE.zip. tion parameter λ is chosen from {10 1, 10 4, 10 7}. To choose hyperparameters, 5-fold cross-validation is used. Since we do not have any labeled data in the training phase, the validation error cannot be computed directly. Instead, Eq. (5) equipped with the zero-one loss 01( ) = 1 2(1 sign( )) is used as a proxy to estimate the validation error. In each experimental trial, the model with minimum validation error is chosen. 5.3. Benchmark Comparison with Baseline Methods We compare our proposed method with baseline methods on benchmark datasets. We conduct experiments on each dataset with 500 similar data pairs, 500 unlabeled data, and 100 test data. As can be seen from Table 4, our proposed method outperforms baselines for many datasets. The details about the baseline methods are described below. Baseline 1 (KM): As a simple baseline, we consider kmeans clustering (Mac Queen, 1967). We ignore pair information of S data and apply k-means clustering with k = 2 to U data. Baseline 2 (ITML): Information-theoretic metric learning (Davis et al., 2007) is a metric learning method by regularizing the covariance matrix based on prior knowledge, with pairwise constraints. We use the identity matrix as prior knowledge, and the slack variable parameter is fixed to γ = 1, since we cannot employ the cross-validation without any class label information. Using the obtained metric, k-means clustering is applied on test data. Baseline 3 (SERAPH): Semi-supervised metric learning paradigm with hyper sparsity (Niu et al., 2012) is another metric learning method based on entropy regularization. Hyperparameter choice follows SERAPHhyper. Using the obtained metric, k-means clustering is applied on test data. Baseline 4 (3SMIC): Semi-supervised SMI-based clustering (Calandriello et al., 2014) models class-posteriors and maximizes mutual information between unlabeled data at hand and their cluster labels. The penalty parameter γ and the kernel parameter t are chosen from {10 2, 100, 102} and {4, 7, 10}, respectively, via 5-fold cross-validation. Baseline 5 (DIMC): Dirty IMC (Chiang et al., 2015) is a noisy version of inductive matrix completion, where the similarity matrix is recovered from a low-rank feature matrix. Similarity matrix S is assumed to be expressed as UU >, where U is low-rank feature representations of input data. After obtaining U, k-means clustering is conducted on U. Two hyperparameters λM, λN in Eq. (2) in (Chiang et al., 2015) are set to λM = λN = 10 2. Baseline 6 (IMSAT): Information maximizing selfaugmented training (Hu et al., 2017) is an unsupervised learning method to make a probabilistic classifier that maps Classification from Pairwise Similarity and Unlabeled Data kyy 9yy 3yy Reyy nl *H bb B}+ i BQM 2 Q π+ = 0.1 π+ = 0.4 π+ = 0.7 kyy 9yy 3yy Reyy nl *H bb B}+ i BQM 2 Q kyy 9yy 3yy Reyy nl *H bb B}+ i BQM 2 Q kyy 9yy 3yy Reyy nl *H bb B}+ i BQM 2 Q kyy 9yy 3yy Reyy nl *H bb B}+ i BQM 2 Q kyy 9yy 3yy Reyy nl *H bb B}+ i BQM 2 Q kyy 9yy 3yy Reyy nl *H bb B}+ i BQM 2 Q kyy 9yy 3yy Reyy nl *H bb B}+ i BQM 2 Q Figure 5: Average classification error (vertical axes) and standard error (shaded areas) over 50 trials. Different n U 2 {200, 400, 800, 1600} are tested, while n S is fixed to 200. For each dataset, results with different class-priors ( + 2 {0.1, 0.4, 0.7}) are plotted, which is assumed to be known in advance. Dataset phoneme does not have a plot for + = 0.1 because the number of data in the original dataset is insufficient to subsample SU dataset with + = 0.1. Table 4: Mean accuracy and standard error of SU classification on different benchmark datasets over 20 trials. For all experiments, class-prior + is set to 0.7. The proposed method does not have oracle + in advance, instead estimating it. Performances are measured by the clustering accuracy 1 min(r, 1 r), where r is error rate. Bold-faces indicate outperforming methods, chosen by one-sided t-test with the significance level 5%. The result of SERAPH with w8a is unavailable due to high-dimensionality and memory constraints. SU(proposed) Baselines Dataset Dim Squared Double-hinge KM ITML SERAPH 3SMIC DIMC IMSAT(linear) adult 123 64.5 (1.2) 84.5 (0.8) 58.1 (1.1) 57.9 (1.1) 66.5 (1.7) 58.5 (1.3) 63.7 (1.2) 69.8 (0.9) banana 2 67.5 (1.2) 68.2 (1.2) 54.3 (0.7) 54.8 (0.7) 55.0 (1.1) 61.9 (1.2) 64.3 (1.0) 69.8 (0.9) cod-rna 8 82.8 (1.3) 71.0 (0.9) 63.1 (1.1) 62.8 (1.0) 62.5 (1.4) 56.6 (1.2) 63.8 (1.1) 69.1 (0.9) higgs 28 55.1 (1.1) 69.3 (0.9) 66.4 (1.6) 66.6 (1.3) 63.4 (1.1) 57.0 (0.9) 65.0 (1.1) 69.7 (1.4) ijcnn1 22 65.5 (1.3) 73.6 (0.9) 54.6 (0.9) 55.8 (0.7) 59.8 (1.2) 58.9 (1.3) 66.2 (2.2) 68.5 (1.1) magic 10 66.0 (2.0) 69.0 (1.3) 53.9 (0.6) 54.5 (0.7) 55.0 (0.9) 59.1 (1.4) 63.1 (1.1) 70.0 (1.1) phishing 68 75.0 (1.4) 91.3 (0.6) 64.4 (1.0) 61.9 (1.1) 62.4 (1.1) 60.1 (1.3) 64.8 (1.4) 69.4 (0.8) phoneme 5 67.8 (1.5) 70.8 (1.0) 65.2 (0.9) 66.7 (1.4) 69.1 (1.4) 61.3 (1.1) 64.5 (1.2) 69.2 (1.1) spambase 57 69.7 (1.4) 85.5 (0.5) 60.1 (1.8) 54.4 (1.1) 65.4 (1.8) 61.5 (1.3) 63.6 (1.3) 70.5 (1.1) susy 18 59.8 (1.3) 74.8 (1.2) 55.6 (0.7) 55.4 (0.9) 58.0 (1.0) 57.1 (1.2) 65.2 (1.0) 70.4 (1.2) w8a 300 62.1 (1.5) 86.5 (0.6) 71.0 (0.8) 69.5 (1.5) 0.0 (0.0) 60.5 (1.5) 65.0 (2.0) 70.2 (1.2) waveform 21 77.8 (1.3) 87.0 (0.5) 56.1 (0.8) 54.8 (0.7) 56.5 (0.9) 56.5 (0.9) 65.0 (0.9) 69.7 (1.1) similar data to similar representations, combining information maximization clustering with self-augmented training, which make the predictions of perturbed data close to the predictions of the original ones. Instead of data perturbation, self-augmented training can be applied on S data to make each pair of data similar. Here the logistic regressor p (y|x) = (1 + exp( >x)) 1 is used as a classification model, where is parameters to learn. Trade-off parameter λ is set to 1. Remark: KM, ITML, and SERAPH rely on k-means, which is trained by using only training data. Test prediction is based on the metric between test data and learned cluster centers. Among the baselines, DIMC can only handle insample prediction, so it is trained by using both training and test data at the same time. 6. Conclusion In this paper, we proposed a novel weakly-supervised learning problem named SU classification, where only similar pairs and unlabeled data are needed. SU classification even becomes class-identifiable under a certain condition on the class-prior (see Table 3). Its optimization problem with the linear-in-parameter model becomes convex if we choose certain loss functions such as the squared loss and the doublehinge loss. We established an estimation error bound for the proposed method, and confirmed that the estimation error decreases with the parametric optimal order, as the number of similar data and unlabeled data becomes larger. We also investigated the empirical performance and confirmed that our proposed method performs better than baseline methods. Classification from Pairwise Similarity and Unlabeled Data Acknowledgements This work was supported by JST CREST JPMJCR1403 including the AIP challenge program, Japan. We thank Ryuichi Kiryo for fruitful discussions on this work. Bao, H., Sakai, T., Sato, I., and Sugiyama, M. Convex for- mulation of multiple instance learning from positive and unlabeled bags. Neural Networks, 105:132 141, 2018. Basu, S., Banerjee, A., and Mooney, R. J. Semi-supervised clustering by seeding. In ICML, pp. 27 34, 2002. Basu, S., Bilenko, M., and Mooney, R. J. A probabilistic framework for semi-supervised clustering. In SIGKDD, pp. 59 68, 2004. Belkin, M., Niyogi, P., and Sindhwani, V. Manifold reg- ularization: A geometric framework for learning from labeled and unlabeled examples. The Journal of Machine Learning Research, 7:2399 2434, 2006. Bilenko, M., Basu, S., and Mooney, R. J. Integrating con- straints and metric learning in semi-supervised clustering. In ICML, pp. 839 846, 2004. Calandriello, D., Niu, G., and Sugiyama, M. Semisupervised information-maximization clustering. Neural Networks, 57:103 111, 2014. Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2011. Software available at http: //www.csie.ntu.edu.tw/ cjlin/libsvm. Chapelle, O. and Zien, A. Semi-supervised classification by low density separation. In AISTATS 2005, pp. 57 64, 2005. Chapelle, O., Schlkopf, B., and Zien, A. Semi-Supervised Learning. MIT Press, 1st edition, 2010. Chiang, K.-Y., Hsieh, C.-J., and Dhillon, I. S. Matrix com- pletion with noisy side information. In NIPS, pp. 3447 3455, 2015. Davis, J. V., Kulis, B., Jain, P., Sra, S., and Dhillon, I. S. Information-theoretic metric learning. In ICML, pp. 209 216, 2007. du Plessis, M. C., Niu, G., and Sugiyama, M. Analysis of learning from positive and unlabeled data. In NIPS, pp. 703 711, 2014. du Plessis, M. C., Niu, G., and Sugiyama, M. Convex formulation for learning from positive and unlabeled data. In ICML, pp. 1386 1394, 2015. du Plessis, M. C., Niu, G., and Sugiyama, M. Class-prior estimation for learning from positive and unlabeled data. Machine Learning, 106(4):463 492, 2017. Elkan, C. and Noto, K. Learning classifiers from only positive and unlabeled data. In SIGKDD, pp. 213 220, 2008. Fisher, R. Social desirability bias and the validity of indirect questioning. Journal of Consumer Research, 20(2):303 315, 1993. Hu, W., Miyato, T., Tokui, S., Matsumoto, E., and Sugiyama, M. Learning discrete representations via information maximizing self-augmented training. In ICML, pp. 1558 1567, 2017. Kiryo, R., Niu, G., du Plessis, M. C., and Sugiyama, M. Positive-unlabeled learning with non-negative risk estimator. In NIPS, pp. 1674 1684, 2017. Krause, A., Perona, P., and Gomes, R. Discriminative clus- tering by regularized information maximization. In NIPS, pp. 775 783, 2010. Laine, S. and Aila, T. Temporal ensembling for semisupervised learning. In ICLR, 2017. Li, W. and Vasconcelos, N. Multiple instance learning for soft bags via top instances. In CVPR, pp. 4277 4285, 2015. Li, Z. and Liu, J. Constrained clustering by spectral kernel learning. In ICCV, pp. 421 427, 2009. Li, Z., Liu, J., and Tang, X. Pairwise constraint propaga- tion by semidefinite programming for semi-supervised classification. In ICML, pp. 576 583, 2008. Lichman, M. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml. Luo, Y., Zhu, J., Li, M., Ren, Y., and Zhang, B. Smooth neighbors on teacher graphs for semi-supervised learning. In CVPR, 2018. Mac Queen, J. Some methods for classification and analy- sis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, pp. 281 297, Berkeley, Calif., 1967. University of California Press. Mendelson, S. Lower bounds for the empirical minimization algorithm. IEEE Transactions on Information Theory, 54 (8):3797 3803, 2008. Miech, A., Alayrac, J., Bojanowski, P., Laptev, I., and Sivic, J. Learning from video and text via large-scale discriminative clustering. In ICCV, pp. 5267 5276, 2017. Classification from Pairwise Similarity and Unlabeled Data Miyato, T., Maeda, S., Koyama, M., Nakae, K., and Ishii, S. Distributional smoothing with virtual adversarial training. In ICLR, 2016. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Founda- tions of Machine Learning. MIT Press, 2012. Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A. Learning with noisy labels. In NIPS, pp. 1196 1204, 2013. Niu, G., Dai, B., Yamada, M., and Sugiyama, M. Information-theoretic semi-supervised metric learning via entropy regularization. In ICML, pp. 1717 1762, 2012. Niu, G., du Plessis, M. C., Sakai, T., Ma, Y., and Sugiyama, M. Theoretical comparisons of positive-unlabeled learning against positive-negative learning. In NIPS, pp. 1199 1207, 2016. Ramaswamy, H. G., Scott, C., and Tewari, A. Mixture pro- portion estimation via kernel embedding of distributions. In ICML, pp. 2052 2060, 2016. Sakai, T., du Plessis, M. C., Niu, G., and Sugiyama, M. Semi-supervised classification based on classification from positive and unlabeled data. In ICML, pp. 2998 3006, 2017. Scott, C. A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In AISTATS, pp. 838 846, 2015. Sugiyama, M., Yamada, M., Kimura, M., and Hachiya, H. On information-maximization clustering: Tuning parameter selection and analytic solution. In ICML, pp. 65 72, 2011. Tarvainen, A. and Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NIPS, pp. 1195 1204, 2017. Wagstaff, K., Cardie, C., Rogers, S., and Schr odl, S. Con- strained k-means clustering with background knowledge. In ICML, pp. 577 584, 2001. Warner, S. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63 69, 1965. Weinberger, K. Q., Blitzer, J., and Saul, L. K. Distance metric learning for large margin nearest neighbor classification. In NIPS, pp. 1473 1480, 2005. Xing, E. P., Ng, A. Y., Jordan, M. I., and Russell, S. Distance metric learning, with application to clustering with sideinformation. In NIPS, pp. 521 528, 2002. Xu, L., Neufeld, J., Larson, B., and Schuurmans, D. Maxi- mum margin clustering. In NIPS, pp. 1537 1544, 2005. Yi, J., Zhang, L., Jin, R., Qian, Q., and Jain, A. Semi- supervised clustering by input pattern assisted pairwise similarity matrix completion. In ICML, pp. 1400 1408, 2013. Zhou, Z.-H. A brief introduction to weakly supervised learning. National Science Review, 5(1):44 53, 2018.