# uncoupled_regression_from_pairwise_comparison_data__ae67069d.pdf

Uncoupled Regression from Pairwise Comparison Data

Liyuan Xu 1,2 Junya Honda 1,2 liyuan@ms.k.u-tokyo.ac.jp honda@stat.t.u-tokyo.ac.jp Gang Niu 2 Masashi Sugiyama 2,1 gang.niu@riken.jp sugi@k.u-tokyo.ac.jp

1The University of Tokyo 2RIKEN

Uncoupled regression is the problem to learn a model from unlabeled data and the set of target values while the correspondence between them is unknown. Such a situation arises in predicting anonymized targets that involve sensitive information, e.g., one s annual income. Since existing methods for uncoupled regression often require strong assumptions on the true target function, and thus, their range of applications is limited, we introduce a novel framework that does not require such assumptions in this paper. Our key idea is to utilize pairwise comparison data, which consists of pairs of unlabeled data that we know which one has a larger target value. Such pairwise comparison data is easy to collect, as typically discussed in the learning-to-rank scenario, and does not break the anonymity of data. We propose two practical methods for uncoupled regression from pairwise comparison data and show that the learned regression model converges to the optimal model with the optimal parametric convergence rate when the target variable distributes uniformly. Moreover, we empirically show that for linear models the proposed methods are comparable to ordinary supervised regression with labeled data.

1 Introduction

In supervised regression, we need a vast amount of labeled data in the training phase, which is costly and laborious to collect in many real-world applications. To deal with this problem, weakly-supervised regression has been proposed in various settings, such as semi-supervised learning (see Kostopoulos et al. [17] for the survey), multiple instance regression [27, 34], and transductive regression [4, 5]. See [35] for a thorough review of the weakly-supervised learning in binary classiﬁcation, which can be extended to regression with slight modiﬁcations.

Uncoupled regression [2] is one variant of weakly-supervised learning. In ordinary coupled regression, the pairs of features and targets are provided, and we aim to learn a model that minimizes a certain prediction error on test data. On the other hand, in the uncoupled regression problem, we only have access to unlabeled data and the set of target values, and thus, we do not know the true target for each data point. Such a situation often arises when we aim to predict people s sensitive matters such as one s annual salary or the total amount of deposit, the data of which is often anonymized for privacy concerns. Note that it may not be impossible to conduct uncoupled regression without further assumptions, since no labeled data is provided.

Carpentier and Schlueter [2] showed that uncoupled regression is solvable when the feature is one-dimensional and the true target function is monotonic to it. Although their algorithm is of less

Now at Gatsby Computational Neuroscience Unit

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

practical use due to its strong assumption, their work offers a valuable insight that a model is learnable from uncoupled data if we know the ranking in the dataset. In this paper, we show that, instead of imposing the monotonic assumption, we can infer such ranking information from data to solve uncoupled regression. We use pairwise comparison data as a source of ranking information, which consists of the pairs of unlabeled data that we know which data point has a larger target value.

Note that pairwise comparison data is easy to collect even for sensitive matters such as one s annual earnings. Although people often hesitate to give explicit answers to it, it might be easier to answer indirect questions: Which person earns more than you? 2, which yields pairwise comparison data that we need. The difﬁculty here is that comparison is based on the target value, which contains the noise. Hence, the comparison data is also affected by this noise. Considering that we do not put any assumption on the true target function, our methods are applicable to many situations. A similar problem was considered in Bao et al. [1] as well.

One naive method for uncoupled regression with pairwise comparison data is to use a score-based ranking method [29], which learns a score function with the minimum inversions in pairwise comparison data. With such a score function, we can match unlabeled data and the set of target values, and then, conduct supervised learning. However, as discussed in Rigollet and Weed [28], we cannot consistently recover the true target function even if we know the true order of missing target values in unlabeled data due to the noise in them.

In contrast, our methods directly minimize the regression risk. We ﬁrst rewrite the regression risk so that it can be estimated from unlabeled and pairwise comparison data, and learn a model through empirical risk minimization. Such an approach based on risk rewriting has been extensively studied in the classiﬁcation scenario [7, 6, 23, 30, 18] and exhibits promising performance. We propose two estimators of the risk deﬁned based on the expected Bregman divergence [11], which is a natural choice of the risk function. We show that if the marginal distribution of the target variable is uniform then the estimators are unbiased and the learned model converges to the optimal model with the optimal rate. In general cases, however, we prove that it is impossible to have such an unbiased estimator in any marginal distributions and the learned model may not converge to the optimal one. Still, our empirical evaluations based on synthetic data and benchmark datasets show that our methods exhibit similar performance to a model learned from coupled data for ordinary supervised regression.

The paper is structured as follows. After discussing the related work in Section 2, we formulate the uncoupled regression problem with pairwise comparison data in detail in Section 3. In Sections 4 and 5, we discuss two methods for uncoupled regression and derive estimation error bounds for each method. Finally, we show empirical results in Section 6 and conclude the paper in Section 7.

2 Related work

Several methods have been proposed to match two independently collected data sources. In the context of data integration [3], the matching is conducted based on some contextual data provided for both data sources. For example, Walter and Fritsch [31] used spatial information as contextual data to integrate two data sources. Some work evaluated the quality of matching by some information criterion and found the best matching by maximizing the metric. This problem is called cross-domain object matching (CDOM), which was formulated in Jebara [15]. A number of methods have been proposed for CDOM, such as Quadrianto et al. [26], Yamada and Sugiyama [33], and Jitta and Klami [16].

Another line of related work in the uncoupled regression problem imposed an assumption on the true target function. For example, Carpentier and Schlueter [2] assumed that the true target function is monotonic to a single feature, and it was reﬁned theoretically by Rigollet and Weed [28]. Another common assumption is that the true target function is exactly expressed as a linear function of the features, which was studied in Hsu et al. [14] and Pananjady et al. [24]. Although the model learned from these methods converges to the true target function with inﬁnite uncoupled data, they are of less practical use due to their strong assumptions. On the other hand, our methods do not require any assumptions on such mapping functions and are applicable to wider scenarios.

2This questioning can be regarded as one type of randomized response (indirect questioning) techniques [32], which is a survey method to avoid social desirability bias.

It is worth noting that some methods use uncoupled data to enhance the performance of semisupervised learning. For example, in label regularization [19], uncoupled data is used to regularize a regression model so that the distribution of prediction on unlabeled data is close to the marginal distribution of target variables, which was reported to increase the accuracy.

Pairwise comparison data was originally considered in the ranking problem [29, 22], which aims to learn a score function that can rank data correctly. In fact, we can apply ranking methods, such as rank SVM [13], to our problem. However, the naive application of them performs inferiorly compared to proposed methods, as we will show empirically, since our goal is not to order data correctly but to predict true target values.

3 Problem settings

In this section, we formulate the uncoupled regression problem and introduce pairwise comparison data.

3.1 Uncoupled regression problem

We ﬁrst formulate the standard regression problem brieﬂy. Let X Rd be a d-dimensional feature space and Y R be a target space. We denote X, Y as random variables on spaces X, Y, respectively. We assume these random variables follow the joint distribution PX,Y . The goal of the regression problem is to obtain model h : X ! Y in hypothesis space H which minimizes the risk deﬁned as

R(h) = EX,Y [l(h(X), Y )] , (1)

where EX,Y denotes the expectation over PX,Y and l : Y Y ! R+ is a loss function.

The loss function l(z, t) measures the closeness between a true target t 2 Y and an output of a model z 2 Y, which generally grows as prediction z gets far from the target t. In this paper, we mainly consider l(z, t) to be the Bregman divergence dφ(t, z), which is deﬁned as

dφ(t, z) = φ(t) φ(z) (t z)φ0(z)

for some convex function φ : R ! R, and φ0 denotes the derivative of φ. It is natural to have such a loss function since the minimizer of risk R is EY |X=x [Y ] when hypothesis space H is rich enough [11], where EY |X=x is the conditional expectation over the distribution of Y given X = x. Many common loss functions can be interpreted as the Bregman divergence; for instance, when φ(x) = x2, then dφ(t, z) becomes the l2-loss, and when φ(x) = x log x (1 x) log(1 x), then dφ(t, z) is the Kullback Leibler divergence between the Bernoulli distributions with probabilities t and z.

In the standard regression scenario, we are given labeled training data D = {(xi, yi)}n

i=1 drawn independently and identically from PX,Y . Then, based on the training data, we empirically estimate risk R(h) and obtain model ˆh by the minimizing the empirical risk. On the other hand, in uncoulped regression, what we are given are unlabeled data DU = {xi}n U

i=1 and target values DY = {yi}n Y

i=1 without correspondance. Here, n U is the size of unlabeled data. Furthermore, we denote the marginal distribution of feature X as PX and its probability density function as f X. Similarly, PY stands for the marginal distribution of target Y , and f Y is the density function of PY . We use EX,Y , EX and EY to denote the expectations over PX,Y , PX, and PY , respectively.

Unlike Carpentier and Schlueter [2], we do not try to match unlabeled data and target values. In fact, our methods do not use each target value in DY but use density function f Y of the target, which can be estimated from DY . For simplicity, we assume that the true density function f Y is known. The case where we need to estimate f Y from DY is discussed in Appendix B.

3.2 Pairwise comparison data

Here, we introduce pairwise comparison data. It consists of two random variables (X+, X ), where the target value of X+ is larger than that of X . Formally, (X+, X ) are deﬁned as

X (Y Y 0), X0 (Y < Y 0), X =

X0 (Y Y 0), X (Y < Y 0), (2)

where (X, Y ), (X0, Y 0) are two independent pairs of random variables following PX,Y . We denote the joint distribution of (X+, X ) as PX+,X and the marginal distributions as PX+, PX . Density functions f X+,X , f X+, f X and expectations EX+,X , EX+, EX are deﬁned in the same way.

We assume that we have access to n R pairs of i.i.d. samples of (X+, X ) as DR = {(x+

i=1 in addition to unlabeled data DU and density function f Y of target variable Y . In the following sections, we show that uncoupled regression can be solved only from this information. In fact, our methods only require samples of either one of X+, X , which corresponds to the case where only a winner or loser of the ranking is observable.

One naive approach to conducting uncouple regression with DR would be to adopt ranking method, which is to learn a ranker r : X ! R that minimizes the following expected ranking loss:

RR(r) = EX+,X

r(X+) r(X ) < 0

is the indicator function. By minimizing the empirical estimation of (3) based on DR, we can learn a ranker ˆr that can sort data points by target Y . Then, we can predict quantiles of test data by ranking DU, which leads to the prediction by applying the inverse of the cumulative distribution function (CDF) of Y . Formally, if the test point xtest is ranked top n0-th in DU, we can predict the target value for xtest as

ˆh(xtest) = F 1

where FY (t) = P(Y t) is the CDF of Y .

This approach, however, is known to be highly sensitive to the randomness in the target variable as discussed in Rigollet and Weed [28]. This is because a noise involved in the single data point changes the ranking of all other data points and affects their predictions. As illustrated in Rigollet and Weed [28], even if when we have a perfect ranker, i.e., we know the true order in DU, model (4) is still different from the expected target Y given feature X in presence of noise.

4 Empirical risk minimization by risk approximation

In this section, we propose a method to learn a model from pairwise comparison data DR, unlabeled data DU, and density function f Y of target variable Y . The method follows the empirical risk minimization principle, while the risk is approximated so that it can be empirically estimated from available data. Therefore, we call this method the risk approximation (RA) method. Here, we present an approximated risk and derive its estimation error bound.

From the deﬁnition of the Bregman divergence, the risk function in (1) is expressed as

R(h) = EY [φ(Y )] EX [φ(h(X)) h(X)φ0(h(X))] EX,Y [Y φ0(h(X))] . (5)

In this decomposition, the last term is the only problematic part in uncoupled regression since it requires to calculate the expectation on the joint distribution. Here, we consider approximating the last term based on the following expectations over the distributions of X+, X .

Lemma 1. We have

= 2EX,Y [FY (Y )φ0(h(X))] ,

= 2EX,Y [(1 FY (Y ))φ0(h(X))] .

The proof can be found in Appendix C.1. From Lemma 1, we can see that EX,Y [Y φ0(h(X))] = (EX+ [φ0(h(X+))])/2 if FY (y) = y, which corresponds to the case where target variable Y

marginally distributes uniformly in [0, 1]. This leads us to consider the approximation in the form of

EX,Y [Y φ0(h(X))] ' w1EX+

for some constants w1, w2 2 R. Note that the above uniform case corresponds to (w1, w2) = (1/2, 0). In general, if target Y marginally distributes uniformly on [a, b] for b > a, that is, FY (y) = (y a)/(b a) for all y 2 [a, b], we can see that approximation (6) becomes exact for (w1, w2) = (b/2, a/2) from Lemma 1. In such a case, we can construct an unbiased estimator of true risk R from

unlabeled and pairwise comparison data. For non-uniform target marginal distributions, we choose (w1, w2) that minimizes the upper bound of the estimation error, which we will discuss in detail later.

Since we have EX [φ0(X)] = 1 2EX+ [φ0(X+)] + 1

2EX [φ0(X )] from Lemma 1, (6) can be rewritten as

EX,Y [Y φ0(h(X))]

' λEX [φ0(X)] +

for arbitrary λ 2 R. Hence, by approximating (5) by (7), we can write the approximated risk RRA as

RRA(h; λ, w1, w2) = C EX [φ(h(X)) (h(X) λ)φ0(h(X))]

Here, C = EY [φ(Y )] can be ignored in the optimization procedure. Now, the empirical estimator of RRA is

ˆRRA(h; λ, w1, w2) = C 1

(φ(h(xi)) (h(xi) λ)φ0(h(xi)))

which is to be minimized in the RA method. Again, we would like to emphasize that if marginal distribution PY is uniform on [a, b] and (w1, w2) is set to (b/2, a/2), we have RRA = R and ˆRRA is an unbiased estimator of R for any λ 2 R.

From the deﬁnition of ˆRRA, we can see that by setting λ to either 2w1 or 2w2, ˆRRA becomes independent of either X+ or X . This means that we can conduct uncouple regression even if one of X+, X is missing in data, which corresponds to the case where only winners or only losers of the comparison are observed.

Another advantage of tuning free parameter λ is that we can reduce the variance in empirical risk

ˆRRA as discussed in Sakai et al. [30] and Bao et al. [1]. As in Sakai et al. [30], the optimal λ that minimizes the variance in ˆRRA for n U ! 1 is derived as follows. Theorem 1. For a given model h, let σ2

respectively, where Var X [ ] is the variance with respect to the random variable X. Then, setting

yields the estimator with the minimum variance among estimators in the form of ˆRRA when n U ! 1.

The proof can be found in Appendix C.3. From Theorem 1, we can see that the optimal λ does not equal zero, which means that we can reduce the variance in the empirical estimation with a sufﬁcient number of unlabeled data by tuning λ. Note that this situation is natural since unlabeled data is easier to collect than pairwise comparison data as discussed in Duh and Kirchhoff [9].

Now, from the discussion of the pseudo-dimension [12], we establish an upper bound of the following estimation error, which is used to choose weights (w1, w2). Let ˆh RA and h be the minimizers of

ˆRRA and R in hypothesis class H, respectively. Then, we have the following theorem that bounds the excess risk in terms of parameters (w1, w2). Theorem 2. Suppose that the pseudo-dimensions of {x ! φ0(h(x)) | h 2 H}, {x ! h(x)φ0(h(x)) φ(h(x)) | h 2 H} are ﬁnite and there exist constants m, M such that |h(x)φ0(h(x)) φ(h(x))| m, |φ0(h(x))| M for all x 2 X and all h 2 H. Then,

R(ˆh RA) R(h ) + O

A + MErr(w1, w2)

holds with probability 1 δ, where Err is deﬁned as

Err(w1, w2) = EY [|Y 2w1FY (Y ) 2w2(1 FY (Y ))|] . (8)

The proof can be found in Appendix C.2. Note that the conditions for the boundedness of |h(x)φ0(h(x)) φ(h(x))|, |φ0(h(x))| hold for many losses, e.g., the l2-loss, when we consider a hypothesis space of bounded functions.

From Theorem 2, we can see that we can learn a model with less excess risk by minimizing Err(w1, w2). Note that Err(w1, w2) can be easily minimized since density function f Y is known or can be estimated from DY . In particular, if target Y is uniformly distributed on [a, b], we have Err(w1, w2) = 0 by setting (w1, w2) = (b/2, a/2). In such a case, ˆh RA becomes a consistent model, i.e., R(ˆh RA) ! R(h ) as n U, n R ! 1. The convergence rate is O(1/pn U + 1/pn R), which is the optimal parametric rate for the empirical risk minimization without additional assumptions when the enough amount of unlabeled and pairwise comparison data is provided jointly [21].

One important case where target variable Y distributes uniformly is when the target is a quantile value . For instance, we are to build a screening system for credit cards. Then, what we are interested in is how much is an applicant credible in the population? , which means that we want to predict the quantile value of the credit score in the marginal distribution. By deﬁnition, we know that such a quantile value distributes uniformly, and thus we can have a consistent model by minimizing ˆRRA.

In general cases, however, we may have Err(w1, w2) > 0, and ˆh RA becomes not consistent. Nevertheless, this is inevitable as suggested in the following theorem.

Theorem 3. There exists a pair of joint distributions PX,Y , PX,Y that yields the same marginal distributions of feature PX and target PY , and the same distributions of the pairwise comparison data PX+,X but have different conditional expectation EY |X=x [Y ].

Theorem 3 states that there exists a pair of distributions that cannot be distinguished from available data. Considering that h (x) = EY |X=x [Y ] when hypothesis space H is rich enough [11], this theorem implies that we cannot always obtain a consistent model. Still, in Section 6, we show that h RA empirically exhibits a similar accuracy to a model learned from ordinary coupled data.

5 Empirical risk minimization by target transformation

In this section, we introduce another method to uncoupled regression with pairwise comparison data, called the target transformation (TT) method. Whereas the RA method minimizes the approximation of the original risk, the TT method transforms the target variable so that it marginally distributes uniformly and minimizes an unbiased estimator of the risk deﬁned based on the transformed variable.

Although there are several ways to map Y to a uniformly distributed random variable, one natural candidate would be CDF FY (Y ), which leads to the following risk:

RTT(h) = EX,Y [dφ(FY (Y ), FY (h(X))] . (9)

Since FY (Y ) distributes uniformly on [0, 1] by deﬁnition, we can construct the following unbiased estimator of RTT by the same discussion as in the previous section.

ˆRTT(h; λ) = C 1

(λ FY (h(xi)))φ0(FY (h(xi))) + φ(FY (h(xi)))

2 φ0(FY (h(x+

2 φ0(FY (h(x

where λ is a hyper-parameter to be tuned. The TT method minimizes ˆRTT to learn a model. However, the learned model is, again, not always consistent in terms of original risk R. This is because, in rich enough hypothesis space H, the minimizer h TT = F 1

EY |X=x [FY (Y )]

of (9) is different from EY |X=x [Y ], the minimizer of (1), unless target Y distributes uniformly. Hence, for a non-uniform target, we cannot always obtain a consistent model. However, we can still derive an estimation error bound if h TT 2 H and target variable Y is generated as

Y = htrue(X) + ", (10)

where htrue : X ! Y is the true target function and " is a zero-mean noise variable bounded in [ σ, σ] for some constant σ > 0.

Theorem 4. Assume that target variable Y is generated by (10) and h TT 2 H. If the pseudodimensions of {x ! φ0(FY (h(x)))|h 2 H}, {x ! φ0(FY (h(x)))|h 2 H} are ﬁnite and there exist constants P > p > 0 such that p f Y (y) P for all y 2 Y, we have

R(ˆh TT) R(htrue) +

with probability 1 δ for φ(x) = x2, where ˆh TT is the minimizer of risk ˆRTT in H.

The proof can be found in Appendix C.5. From Theorem 4, we can see that ˆh TT is not necessarily consistent. Again, this is inevitable due to the same reason as the RA method. We can see that the error in the TT method is explicitly dependent on the noise, and thus, it is advantageous when the target contains less noise. In Section 6, we empirically compare these methods and show that which method is more suitable differs from case to case.

6 Experiments

In this section, we present the empirical performances of the proposed methods in experiments based on synthetic data and benchmark data. We show that our proposed methods outperform the naive method described in (4) and existing method [24]. Moreover, it is shown that our methods have a similar performance to a model learned from ordinary supervised learning with coupled data.

Before presenting the results, we describe the detailed procedure of experiments. In all experiments, we consider l2-loss l(z, t) = (z t)2, which corresponds to setting φ(x) = x2 in Bregman divergence dφ(t, z). The performance is also evaluated by the mean squared error (MSE) in the held-out test data. We repeat each experiment for 100 times and report the mean and the standard deviation. We employ hypothesis space of linear functions H = {h(x) = >x | 2 Rd} for the RA method. A slightly different hypothesis space H0 = {h(x) = F 1

Y (σ( >x)) | 2 Rd} is employed for the TT method in order to simplify the loss, where σ is logistic function σ(x) = 1/(1 + exp( x)). The procedure of hyper-parameter tuning in RRA and RTT can be found in Appendix A.

6.1 Comparison with baseline methods

We introduce two types of baseline methods here. One is a naive application of the ranking methods described in (4), in which we use SVMRank [13] as a ranking method. We use the linear kernel in SVMRank. The other is an ordinary supervised linear regression (LR), in which we ﬁt a linear model using the true labels in unlabeled data DU. Note that LR does not use pairwise comparison data DR.

Result for synthetic data. First, we show the result for the synthetic data, in which we know the true marginal PY . We sample 5-dimensional unlabeled data DU from normal distribution N(0, Id), where Id is the identity matrix. Then, we sample true unknown parameter such that k k2 = 1 uniformly at random. Target Y is generated as Y = >X + ", where " is a noise following N(0, 0.01). Consequently, PY corresponds to N(0,

1.01), which is utilized in the proposed methods and the ranking baseline. The pairwise comparison data is generated by (2). We ﬁrst sample two features X, X0 from N(0, Id), and then, compare them based on the target value Y, Y 0

calculated by Y = >X + ". We ﬁx n U to 100,000 and alter n R from 20 to 10,240 to see the change of performance with respect to the size of pairwise comparison data.

The results are presented in Figure 1. From this ﬁgure, we can see that with sufﬁcient pairwise comparison data, the performance of our methods is signiﬁcantly better than SVMRank baseline and close to LR. This is astonishing since LR uses the true label of DU, while our methods do not. In this experiment, the RA method consistently performs better than the TT method, though this is not universal as shown in the experiments on benchmark datasets.

Note that the TT method is unstable when the size of pairwise comparison data is small. We observed this phenomenon in all experiments. This is because we learn the quantile value when we minimize

102 103 104 Size of Pairwise Comparison Data

Figure 1: MSE for Synthetic Data

102 103 104 Size of Pairwise Comparison Data

Figure 2: MSE for housing Dataset

Table 1: MSE for benchmark datasets when n R is 5,000. The bold face means the outstanding method in uncoupled regression methods (SVMRank, RA and TT) chosen by the Welch t-test with signiﬁcance level 5%. Note that LR does not solve uncoupled regression since it uses labels in DU.

Supervised Regression Uncoupled Regression

Dataset LR SVMRank RA TT

housing 24.5(5.0) 110.3(29.5) 29.5(6.9) 22.5(6.2) diabetes 3041.9(219.8) 8575.9(883.1) 3087.3(256.3) 3127.3(278.8) airfoil 23.3(2.2) 62.1(7.6) 23.7(2.0) 22.7(2.2) concrete 109.5(13.3) 322.9(45.8) 111.7(13.2) 139.1(17.9) powerplant 20.6(0.9) 372.2(34.8) 21.8(1.1) 22.0(1.0) mpg 12.1(2.04) 125(15.1) 12.8(2.16) 10.3(2.08) redwine 0.412(0.0361) 1.28(0.112) 0.442(0.0473) 0.466(0.0412) whitewine 0.574(0.0325) 1.58(0.0691) 0.597(0.0382) 0.644(0.0414) abalone 5.05(0.375) 20.9(1.44) 5.26(0.372) 5.54(0.424)

RTT, and this can be severely inaccurate when the size of pairwise comparison data is small. On the other hand, RRA directly minimizes the approximation of true risk R, which is less sensitive to the size of DR.

Result for benchmark datasets. We conducted the experiments for the benchmark datasets as well, in which we do not know true marginal PY . The details of benchmark datasets can be found in Appendix A. We use the original features as unlabeled data DU. Density function f Y is estimated from target values in the dataset by kernel density estimation [25] with the Gaussian kernel. Here, the bandwidth of Gaussian kernel is determined by cross-validation. The pairwise comparison data is constructed by comparing the true target values of two data points uniformly sampled from DU.

Figure 2 shows3 the performance of each method with respect to the size of pairwise comparison data for the housing dataset. We can see that the proposed methods signiﬁcantly outperform SVMRank and approach to LR with increasing n R. This fact suggests that the estimation error in f Y has little impact on the performance. The results for various datasets when n R is 5,000 are presented in Table 1, in which both proposed methods show the promising performance. Note that the method with less MSE differs by each dataset, which means that we cannot easily judge which method is better.

6.2 Comparison with other uncoupled regression methods

Here, we show the results of the empirical comparison between our methods and the method proposed in Pananjady et al. [24], which is another uncoupled regression method. Note that Pananjady et al. [24] considered a different problem, since they assume that the true regression function is exactly a linear function of the features and ignore the comparative data. Hence, we synthetically create data that all three methods are applicable and conduct comparison based on it.

3From Figure 2, we can again see that the TT method performs unstably when n R is small for benchmark data, which causes the strange proﬁle in the log plot. Since the standard deviation of the TT method is large, the mean accuracy minus the standard deviation goes negative, which diverges to 1 in the plot.

103 104 Size of Pairwise Comparison Data

LR Pananjady et al

Figure 3: MSE for synthetic data following normal distribution

103 104 Size of Pairwise Comparison Data

LR Pananjady et al

Figure 4: MSE for 1-dimensional synthetic data following uniform distribution

This method is to learn the optimal coupling of unlabeled data DU = {xi}n U

i=1 and the target values DY = {yi}n U

i=1, assuming the linear relationship between them. Formally, the method is to ﬁnd optimal parameter ˆ and permutation ˆσ : [n U] ! [n U] that minimizes the MSE:

(ˆ , ˆσ) = arg min

where Pn is the set of permutations of n items. Pananjady et al. [24] proved that this minimization is NP-hard in general but can be solved with O(n U log n U) computation when d = 1. Hence, we conduct the experiment based on synthetic 1-dimensional data.

The data is generated as follows. Unlabeled data DU = {xi}n U

i=1 is sampled from a certain distribution, where the size of data is ﬁxed as n U = 100, 000. Here, we used normal distribution N(0, 1) and uniform distribution on [ 1, 1]. We set = 1 and generate target Y as Y = X + ", where " is a noise following N(0, 0.25). We randomly shufﬂe target values yi to build DY . The comparative data is constructed in the same way as the synthetic data described in Section 6.1. Note that the method in Pananjady et al. [24] ignores comparative data, hence the performance does not depend on the amount of comparative data.

The results4 are shown in Figures 3 and 4, which show the superiority of our method. This is mainly due to the difference in data used in each method. The method in Pananjady et al. [24] only uses the unlabeled data and target values, while our methods utilize the comparative data as well. We can see that this additional information greatly contributes to better performance.

7 Conclusions

In this paper, we proposed novel methods for uncoupled regression by utilizing pairwise comparison data. We introduced two methods, the risk approximation (RA) method, and the target transformation (TT) method, for the problem. The RA method is to approximate the expected Bregman divergence by the linear combination of expectations of given data, and the TT method is to learn a model for quantile values and uses the inverse of the CDF to predict the target. We derived estimation error bounds for each method and showed that the learned model is consistent when the target variable distributes uniformly. Furthermore, the empirical evaluations based on both synthetic data and benchmark datasets suggested the competence of our method. The empirical results also indicated the instability of the TT method when the size of pairwise comparison data is small, and we may need some regularization scheme to prevent it, which is left for future work.

Acknowledgements

LX utilized the facility provided by Masason Foundation. JH acknowledges support by KAKENHI 18K17998, and MS was supported by JST CREST Grant Number JPMJCR18A2.

4The standard deviation of the method in Pananjady et al. [24] is large but ﬁnite. The mean minus standard deviation diverges to 1 in the plot for the same reason as Figure 2.

[1] H. Bao, G. Niu, and M. Sugiyama. Classiﬁcation from pairwise similarity and unlabeled data.

In Proceedings of the 35th International Conference on Machine Learning, 2018.

[2] A. Carpentier and T. Schlueter. Learning relationships between data obtained independently. In

Proceedings of the 19th International Conference on Artiﬁcial Intelligence and Statistics, 2016.

[3] W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets

for data integration. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.

[4] C. Cortes and M. Mohri. On transductive regression. In Proceedings of 19th Advances in Neural

Information Processing Systems, 2007.

[5] C. Cortes, M. Mohri, D. Pechyony, and A. Rastogi. Stability of transductive regression algorithms. In Proceedings of the 25th International Conference on Machine Learning, 2008.

[6] M. du Plessis, G. Niu, and M. Sugiyama. Convex formulation for learning from positive and

unlabeled data. In Proceedings of the 32nd International Conference on Machine Learning, 2015.

[7] M. C. du Plessis, G. Niu, and M. Sugiyama. Analysis of learning from positive and unlabeled

data. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Proceedings of the 27th Advances in Neural Information Processing Systems, 2014.

[8] D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.

uci.edu/ml.

[9] K. Duh and K. Kirchhoff. Semi-supervised ranking for document retrieval. Computer Speech

and Language, 25(2):261 281, 2011.

[10] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics,

32:407 499, 2004.

[11] A. Frigyik, S. Srivastava, and M. R. Gupta. Functional Bregman divergence and bayesian

estimation of distributions (preprint). IEEE Transactions on Information Theory, 54:5130 5139, 11 2008.

[12] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other

learning applications. Information and Computation, 100(1):78 150, 1992.

[13] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression.

In A. Smola, P. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classiﬁers, pages 115 132, Cambridge, MA, 2000. MIT Press.

[14] D. J. Hsu, K. Shi, and X. Sun. Linear regression without correspondence. In Proceedings of the

30th Advances in Neural Information Processing Systems, pages 1531 1540, 2017.

[15] T. Jebara. Kernelizing sorting, permutation and alignment for minimum volume PCA. In

Proceedings of the 17th Annual Conference on Learning Theory, 2004.

[16] A. Jitta and A. Klami. Few-to-few cross-domain object matching. In Proceedings of The 3rd

International Workshop on Advanced Methodologies for Bayesian Networks, 2017.

[17] G. Kostopoulos, S. Karlos, S. Kotsiantis, and O. Ragos. Semi-supervised regression: A recent

review. Journal of Intelligent and Fuzzy Systems, 35:1 18, 2018.

[18] N. Lu, G. Niu, A. K. Menon, and M. Sugiyama. On the minimal supervision for training any

binary classiﬁer from only unlabeled data. In Proceedings of the 7th International Conference on Learning Representations, 2019.

[19] G. S. Mann and A. Mc Callum. Generalized expectation criteria for semi-supervised learning

with weakly labeled data. The Journal of Machine Learning Research, 11:955 984, 2010.

[20] P. Massart. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The Annals of

Probability, 18(3):1269 1283, 1990.

[21] S. Mendelson. Lower bounds for the empirical minimization algorithm. IEEE Transactions on

Information Theory, 54:3797 3803, 2008.

[22] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT

Press, 2012.

[23] G. Niu, M. C. du Plessis, T. Sakai, Y. Ma, and M. Sugiyama. Theoretical comparisons of

positive-unlabeled learning against positive-negative learning. In Proceedings of the 29th Advances in Neural Information Processing Systems 29, 2016.

[24] A. Pananjady, M. J. Wainwright, and T. A. Courtade. Linear regression with shufﬂed data:

Statistical and computational limits of permutation recovery. IEEE Transactions on Information Theory, 64(5):3286 3300, 2018.

[25] E. Parzen. On estimation of a probability density function and mode. The Annals of Mathemati-

cal Statistics, 33(3):1065 1076, 1962.

[26] N. Quadrianto, L. Song, and A. J. Smola. Kernelized sorting. In Proceedings of the 21st

Advances in Neural Information Processing Systems, 2009.

[27] S. Ray and D. Page. Multiple instance regression. In Proceedings of the 18th International

Conference on Machine Learning, 2001.

[28] P. Rigollet and J. Weed. Uncoupled isotonic regression via minimum Wasserstein deconvolution.

Information and Inference: A Journal of the IMA, 2019.

[29] C. Rudin, C. Cortes, M. Mohri, and R. E. Schapire. Margin-based ranking meets boosting in

the middle. In Proceedings of the 18th Annual Conference on Learning Theory, 2005.

[30] T. Sakai, M. C. du Plessis, G. Niu, and M. Sugiyama. Semi-supervised classiﬁcation based

on classiﬁcation from positive and unlabeled data. In Proceedings of the 34th International Conference on Machine Learning, 2017.

[31] V. Walter and D. Fritsch. Matching spatial data sets: a statistical approach. International

Journal of Geographical Information Science, 13(5):445 473, 1999.

[32] S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias.

Journal of the American Statistical Association, 60(309):63 69, 1965.

[33] M. Yamada and M. Sugiyama. Cross-domain object matching with model selection. In

Proceedings of the 14th International Conference on Artiﬁcial Intelligence and Statistics, 2011.

[34] Q. Zhang and S. A. Goldman. EM-DD: an improved multiple-instance learning technique. In

Proceedings of 15th Advances in Neural Information Processing Systems, 2002.

[35] Z. Zhou. A brief introduction to weakly supervised learning. National Science Review, 5(1):

44 53, 2018.