# noisetolerant_interactive_learning_using_pairwise_comparisons__2f125277.pdf

Noise-Tolerant Interactive Learning Using Pairwise Comparisons

Yichong Xu*, Hongyang Zhang*, Kyle Miller , Aarti Singh*, and Artur Dubrawski

*Machine Learning Department, Carnegie Mellon University, USA Auton Lab, Carnegie Mellon University, USA {yichongx, hongyanz, aarti, awd}@cs.cmu.edu, mille856@andrew.cmu.edu

We study the problem of interactively learning a binary classiﬁer using noisy labeling and pairwise comparison oracles, where the comparison oracle answers which one in the given two instances is more likely to be positive. Learning from such oracles has multiple applications where obtaining direct labels is harder but pairwise comparisons are easier, and the algorithm can leverage both types of oracles. In this paper, we attempt to characterize how the access to an easier comparison oracle helps in improving the label and total query complexity. We show that the comparison oracle reduces the learning problem to that of learning a threshold function. We then present an algorithm that interactively queries the label and comparison oracles and we characterize its query complexity under Tsybakov and adversarial noise conditions for the comparison and labeling oracles. Our lower bounds show that our label and total query complexity is almost optimal.

1 Introduction

Given high costs of obtaining labels for big datasets, interactive learning is gaining popularity in both practice and theory of machine learning. On the practical side, there has been an increasing interest in designing algorithms capable of engaging domain experts in two-way queries to facilitate more accurate and more effort-efﬁcient learning systems (c.f. [26, 31]). On the theoretical side, study of interactive learning has led to signiﬁcant advances such as exponential improvement of query complexity over passive learning under certain conditions (c.f. [5, 6, 7, 15, 19, 27]). While most of these approaches to interactive learning ﬁx the form of an oracle, e.g., the labeling oracle, and explore the best way of querying, recent work allows for multiple diverse forms of oracles [12, 13, 16, 33]. The focus of this paper is on this latter setting, also known as active dual supervision [4]. We investigate how to recover a hypothesis h that is a good approximator of the optimal classiﬁer h , in terms of expected 0/1 error Pr X[h(X) = h (X)], given limited access to labels on individual instances X X and pairwise comparisons about which one of two given instances is more likely to belong to the +1/-1 class.

Our study is motivated by important applications where comparisons are easier to obtain than labels, and the algorithm can leverage both types of oracles to improve label and total query complexity. For example, in material design, synthesizing materials for speciﬁc conditions requires expensive experimentation, but with an appropriate algorithm we can leverage expertize of material scientists, for whom it may be hard to accurately assess the resulting material properties, but who can quickly compare different input conditions and suggest which ones are more promising. Similarly, in clinical settings, precise assessment of each individual patient s health status can be difﬁcult, expensive and/or risky (e.g. it may require application of invasive sensors or diagnostic surgeries), but comparing relative statuses of two patients at a time may be relatively easy and accurate. In both these scenarios we may have access to a modest amount of individually labeled data, but the bulk of more accessible training information is available via pairwise comparisons. There are many other examples where

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

Refine Sampling Space

Request Batch

Learn Classifier

Labeling Oracle

Figure 1: Explanation of work ﬂow of ADGAC-based algorithms. Left: Procedure of typical active learning algorithms. Right: Procedure of our proposed ADGAC-based interactive learning algorithm which has access to both pairwise comparison and labeling oracles. Table 1: Comparison of various methods for learning of generic hypothesis class (Omitting log(1/ε) factors).

Label Noise Work # Label # Query Tolcomp

Tsybakov (κ) [18] O 1

ε 2κ 2 dθ O 1

ε 2κ 2 dθ N/A

Tsybakov (κ) Ours O 1

ε 2κ 2 θ + dθ O(ε2κ)

Adversarial (ν = O(ε)) [19] O(dθ) O(dθ) N/A Adversarial (ν = O(ε)) Ours O(1) O(dθ) O(ε2)

humans ﬁnd it easier to perform pairwise comparisons rather than providing direct labels, including content search [17], image retrieval [31], ranking [21], etc.

Despite many successful applications of comparison oracles, many fundamental questions remain. One of them is how to design noise-tolerant, cost-efﬁcient algorithms that can approximate the unknown target hypothesis to arbitrary accuracy while having access to pairwise comparisons. On one hand, while there is theoretical analysis on the pairwise comparisons concerning the task of learning to rank [3, 22], estimating ordinal measurement models [28] and learning combinatorial functions [11], much remains unknown how to extend these results to more generic hypothesis classes. On the other hand, although we have seen great progress on using single or multiple oracles with the same form of interaction [9, 16], classiﬁcation using both comparison and labeling queries remains an interesting open problem. Independently of our work, Kane et al. [23] concurrently analyzed a similar setting of learning to classify using both label and comparison queries. However, their algorithms work only in the noise-free setting.

Our Contributions: Our work addresses the aforementioned issues by presenting a new algorithm, Active Data Generation with Adversarial Comparisons (ADGAC), which learns a classiﬁer with both noisy labeling and noisy comparison oracles.

We analyze ADGAC under Tsybakov (TNC) [30] and adversarial noise conditions for the labeling oracle, along with the adversarial noise condition for the comparison oracle. Our general framework can augment any active learning algorithm by replacing the batch sampling in these algorithms with ADGAC. Figure 1 presents the work ﬂow of our framework. We propose A2-ADGAC algorithm, which can learn an arbitrary hypothesis class. The label complexity of the algorithm is as small as learning a threshold function under both TNC and adversarial noise condition, independently of the structure of the hypothesis class. The total query complexity improves over previous best-known results under TNC which can only access the labeling oracle. We derive Margin-ADGAC to learn the class of halfspaces. This algorithm has the same label and total query complexity as A2-ADGAC, but is computationally efﬁcient. We present lower bounds on total query complexity for any algorithm that can access both labeling and comparison oracles, and a noise tolerance lower bound for our algorithms. These lower bounds demonstrate that our analysis is nearly optimal. An important quantity governing the performance of our algorithms is the adversarial noise level of comparisons: denote by Tolcomp(ε, δ, A) the adversarial noise tolerance level of comparisons that guarantees an algorithm A to achieve an error of ε, with probability at least 1 δ. Table 1 compares our results with previous work in terms of label complexity, total query complexity, and Tolcomp for generic hypothesis class C with error ε. We see that our results signiﬁcantly improve over prior

Table 2: Comparison of various methods for learning of halfspaces (Omitting log(1/ε) factors).

Label Noise Work # Label # Query Tolcomp Efﬁcient?

Massart [8] O(d) O(d) N/A No Massart [5] poly(d) poly(d) N/A Yes Massart Ours O(1) O(d) O(ε2) Yes Tsybakov (κ) [19] O( 1

ε 2κ 2 dθ) O( 1

ε 2κ 2 dθ) N/A No

Tsybakov (κ) Ours O 1

ε 2κ 2 + d O(ε2κ) Yes

Adversarial (ν = O(ε)) [34] O(d) O(d) N/A No Adversarial (ν = O(ε)) [6] O(d2) O(d2) N/A Yes Adversarial (ν = O(ε)) Ours O(1) O(d) O(ε2) Yes

work with the extra comparison oracle. Denote by d the VC-dimension of C and θ the disagreement coefﬁcient. We also compare the results in Table 2 for learning halfspaces under isotropic log-concave distributions. In both cases, our algorithms enjoy small label complexity that is independent of θ and d. This is helpful when labels are very expensive to obtain. Our algorithms also enjoy better total query complexity under both TNC and adversarial noise condition for efﬁciently learning halfspaces.

2 Preliminaries

Notations: We study the problem of learning a classiﬁer h : X Y = { 1, 1}, where X and Y are the instance space and label space, respectively. Denote by PXY the distribution over X Y and let PX be the marginal distribution over X. A hypothesis class C is a set of functions h : X Y. For any function h, deﬁne the error of h under distribution D over X Y as err D(h) = Pr(X,Y ) D[h(X) = Y ]. Let err(h) = err PXY(h). Suppose that h C satisﬁes err(h ) = infh C err(h). For simplicity, we assume that such an h exists in class C.

We apply the concept of disagreement coefﬁcient from Hanneke [18] for generic hypothesis class in this paper. In particular, for any set V C, we denote by DIS(V ) = {x X : h1, h2 V, h1(x) = h2(x)}. The disagreement coefﬁcient is deﬁned as θ = supr>0 Pr[DIS(B(h ,r))]

r , where B(h , r) = {h C : Pr X PX [h(X) = h (X)] r}.

Problem Setup: We analyze two kinds of noise conditions for the labeling oracle, namely, adversarial noise condition and Tsybakov noise condition (TNC). We formally deﬁne them as follows.

Condition 1 (Adversarial Noise Condition for Labeling Oracle). Distribution PXY satisﬁes adversarial noise condition for labeling oracle with parameter ν 0, if ν = Pr(X,Y ) PXY[Y = h (X)]. Condition 2 (Tsybakov Noise Condition for Labeling Oracle). Distribution PXY satisﬁes Tsybakov noise condition for labeling oracle with parameters κ 1, µ 0, if h : X { 1, 1}, err(h) err(h ) µ Pr X PX [h(X) = h (X)]κ. Also, h is the Bayes optimal classiﬁer, i.e., h (x) = sign(η(x) 1/2). 1 where η(x) = Pr[Y = 1|X = x]. The special case of κ = 1 is also called Massart noise condition.

In the classic active learning scenario, the algorithm has access to an unlabeled pool drawn from PX . The algorithm can then query the labeling oracle for any instance from the pool. The goal is to ﬁnd an h C such that the error Pr[h(X) = h (X)] ε2. The labeling oracle has access to the input x X, and outputs y { 1, 1} according to PXY. In our setting, however, an extra comparison oracle is available. This oracle takes as input a pair of instances (x, x ) X X, and returns a variable Z(x, x ) { 1, 1}, where Z(x, x ) = 1 indicates that x is more likely to be positive, while Z(x, x ) = 1 otherwise. In this paper, we discuss an adversarial noise condition for the comparison oracle. We discuss about dealing with TNC on the comparison oracle in appendix.

Condition 3 (Adversarial Noise Condition for Comparison Oracle). Distribution PXXZ satisﬁes adversarial noise with parameter ν 0, if ν = Pr[Z(X, X )(h (X) h (X )) < 0].

1The assumption that h is Bayes optimal classiﬁer can be relaxed if the approximation error of h can be quantiﬁed under assumptions on the decision boundary (c.f. [15]). 2Note that we use the disagreement Pr[h(X) = h (X)] instead of the excess error err(h) err(h ) in some of the other literatures. The two conditions can be linked by assuming a two-sided version of Tsybakov noise (see e.g., Audibert 2004).

Table 3: Summary of notations. Notation Meaning Notation Meaning

C Hypothesis class κ Tsybakov noise level (labeling) X, X Instance & Instance space ν Adversarial noise level (labeling) Y, Y Label & Label space ν Adversarial noise level (comparison) Z, Z Comparison & Comparison space err D(h) Error of h on distribution D d VC dimension of C SClabel Label complexity θ Disagreement coefﬁcient SCcomp Comparison complexity h Optimal classiﬁer in C Tollabel Noise tolerance (labeling) g Optimal scoring function Tolcomp Noise tolerance (comparison)

Note that we do not make any assumptions on the randomness of Z: Z(X, X ) can be either random or deterministic as long as the joint distribution PXXZ satisﬁes Condition 3.

For an interactive learning algorithm A, given error ε and failure probability δ, let SCcomp(ε, δ, A) and SClabel(ε, δ, A) be the comparison and label complexity, respectively. The query complexity of A is deﬁned as the sum of label and comparison complexity. Similar to the deﬁnition of Tolcomp(ε, δ, A), deﬁne Tollabel(ε, δ, A) as the maximum ν such that algorithm A achieves an error of at most ε with probability 1 δ. As a summary, A learns an h such that Pr[h(X) = h (X)] ε with probability 1 δ using SCcomp(ε, δ, A) comparisons and SClabel(ε, δ, A) labels, if ν Tollabel(ε, δ, A) and ν Tolcomp(ε, δ, A). We omit the parameters of SCcomp, SClabel, Tolcomp, Tollabel if they are clear from the context. We use O( ) to express sample complexity and noise tolerance, and O( ) to ignore the log( ) terms. Table 3 summarizes the main notations throughout the paper.

3 Active Data Generation with Adversarial Comparisons (ADGAC)

The hardness of learning from pairwise comparisons follows from the error of comparison oracle: the comparisons are noisy, and can be asymmetric and intransitive, meaning that the human might give contradicting preferences like x1 x2 x1 or x1 x2 x3 x1 (here is some preference). This makes traditional methods, e.g., deﬁning a function class {h : h(x) = Z(x, ˆx), ˆx X}, fail, because such a class may have inﬁnite VC dimension.

In this section, we propose a novel algorithm, ADGAC, to address this issue. Having access to both comparison and labeling oracles, ADGAC generates a labeled dataset by techniques inspired from group-based binary search. We show that ADGAC can be combined with any active learning procedure to obtain interactive algorithms that can utilize both labeling and comparison oracles. We provide theoretical guarantees for ADGAC.

3.1 Algorithm Description To illustrate ADGAC, we start with a general active learning framework in Algorithm 1. Many active learning algorithms can be adapted to this framework, such as A2 [7] and margin-based active algorithms [6, 5]. Here U represents the querying space/disagreement region of the algorithm (i.e., we reject an instance x if x U), and V represents a version space consisting of potential classiﬁers. For example, A2 algorithm can be adapted to Algorithm 1 straightforwardly by keeping U as the sample space and V as the version space. More concretely, A2 algorithm [7] for adversarial noise can be characterized by

U0 = X, V0 = C, f V (U, V, W, i) = {h : |W|err W (h) niεi}, f U(U, V, W, i) = DIS(V ),

where εi and ni are parameters of the A2 algorithm, and DIS(V ) = {x X : h1, h2 V, h1(x) = h2(x)} is the disagreement region of V . Margin-based active learning [6] can also be ﬁtted into Algorithm 1 by taking V as the halfspace that (approximately) minimizes the hinge loss, and U as the region within the margin of that halfspace.

To efﬁciently apply the comparison oracle, we propose to replace step 4 in Algorithm 1 with a subroutine, ADGAC, that has access to both comparison and labeling oracles. Subroutine 2 describes ADGAC. It takes as input a dataset S and a sampling number k. ADGAC ﬁrst runs Quicksort algorithm on S using feedback from comparison oracle, which is of form Z(x, x ). Given that the comparison oracle Z( , ) might be asymmetric w.r.t. its two arguments, i.e., Z(x, x ) may not equal to Z(x , x), for each pair (xi, xj), we randomly choose (xi, xj) or (xj, xi) as the input to Z( , ). After Quicksort, the algorithm divides the data into multiple groups of size αm = ε| S|, and does

Algorithm 1 Active Learning Framework

Input: ε, δ, a sequence of ni, functions f U, f V .

1: Initialize U U0 X, V V0 C. 2: for i = 1, 2, ..., log(1/ε) do 3: Sample unlabeled dataset S of size ni. Let S {x : x S, x U}. 4: Request the labels of x S and obtain W {(xi, yi) : xi S}. 5: Update V f V (U, V, W, i), U f U(U, V, W, i). Output: Any classiﬁer ˆh V .

Subroutine 2 Active Data Generation with Adversarial Comparison (ADGAC)

Input: Dataset S with |S| = m, n, ε, k.

2m. 2: Deﬁne preference relation on S according to Z. Run Quicksort on S to rank elements in an increasing order. Obtain a sorted list S = (x1, x2, ..., xm). 3: Divide S into groups of size αm: Si = {x(i 1)αm+1, ..., xiαm}, i = 1, 2, ..., 1/α . 4: tmin 1, tmax 1/α. 5: while tmin < tmax do Do binary search 6: t = (tmin + tmax)/2. 7: Sample k points uniformly without replacement from St and obtain the labels Y = {y1, ..., yk}. 8: If Pk i=1 yi 0, then tmax = t; else tmin = t + 1.

9: For t > t and xi St , let ˆyi 1. 10: For t < t and xi St , let ˆyi 1. 11: For xi St, let ˆyi be the majority of labeled points in St. Output: Predicted labels ˆy1, ˆy2, ..., ˆym.

group-based binary search by sampling k labels from each group and determining the label of each group by majority vote.

For active learning algorithm A, let A-ADGAC be the algorithm of replacing step 4 with ADGAC using parameters (Si, ni, εi, ki), where εi, ki are chosen as additional parameters of the algorithm. We establish results for speciﬁc A: A2 and margin-based active learning in Sections 4 and 5, respectively. 3.2 Theoretical Analysis of ADGAC

Before we combine ADGAC with active learning algorithms, we provide theoretical results for ADGAC. By the algorithmic procedure, ADGAC reduces the problem of labeling the whole dataset S to binary searching a threshold on the sorted list S. One can show that the conﬂicting instances cannot be too many within each group Si, and thus binary search performs well in our algorithm. We also use results in [3] to give an error estimate of Quicksort. We have the following result based on the above arguments.

Theorem 4. Suppose that Conditions 2 and 3 hold for κ 1, ν 0, and n = Ω 1

ε 2κ 1 log(1/δ) . Assume a set S with | S| = n is sampled i.i.d. from PX and S S is

an arbitrary subset of S with |S| = m. There exist absolute constants C1, C2, C3 such that if we run Subroutine 2 with ε < C1, ν C2ε2κδ, k = k(1)(ε, δ) := C3 log log(1/ε)

ε 2κ 2, it will output a labeling of S such that |{xi S : ˆyi = h (xi)}| εn, with probability at least 1 δ. The expected number of comparisons required is O(m log m), and the number of sample-label pairs

required is SClabel(ε, δ) = O log m

εn log(1/δ) 1

ε 2κ 2 . Similarly, we analyze ADGAC under adversarial noise condition w.r.t. labeling oracle with ν = O(ε).

Theorem 5. Suppose that Conditions 1 and 3 hold for ν, ν 0, and n = Ω 1

ε log(1/δ) . Assume a set S with | S| = n is sampled i.i.d. from PX and S S is an arbitrary subset of S with |S| = m. There exist absolute constants C1, C2, C3, C4 such that if we run Subroutine 2 with

ε < C1, ν C2ε2δ, k = k(2)(ε, δ) := C3 log log(1/ε)

δ , and ν C4ε, it will output a labeling

of S such that |{xi S : ˆyi = h (xi)}| εn, with probability at least 1 δ. The expected number of comparisons required is O(m log m), and the number of sample-label pairs required is

SClabel(ε, δ) = O log m

εn log log(1/ε)

δ . Proof Sketch. We call a pair (xi, xj) an inverse pair if Z(xi, xj) = 1, h (xi) = 1, h (xj) = 1, and an anti-sort pair if h (xi) = 1, h (xj) = 1, and i < j. We show that the expectation of inverse pairs is n(n 1)ε . By the results in [3] the numbers of inverse pairs and anti-sort pairs have the same expectation, and the actual number of anti-sort pairs can be bounded by Markov s inequality. Then we show that the majority label of each group must be all -1 starting from beginning the list, and changes to all 1 at some point of the list. With a careful choice of k, we may obtain the true majority with k labels under Tsybakov noise; we will thus end up in the turning point of the list. The error is then bounded by the size of groups. See appendix for the complete proof.

Theorems 4 and 5 show that ADGAC gives a labeling of dataset with arbitrary small error using label complexity independent of the data size. Moreover, ADGAC is computationally efﬁcient since it only involves binary search. These nice properties of ADGAC lead to improved query complexity when we combine ADGAC with other active learning algorithms. 4 A2-ADGAC: Learning of Generic Hypothesis Class

In this section, we combine ADGAC with A2 algorithm to learn a generic hypothesis class. We use the framework in Algorithm 1: let A2-ADGAC be the algorithm that replaces step 4 in Algorithm 1 with ADGAC of parameters (S, ni, εi, ki), where ni, εi, ki are parameters to be speciﬁed later. Under TNC, we have the following result. Theorem 6. Suppose that Conditions 2 and 3 hold, and h (x) = sign(η(x) 1/2). There exist global constants C1, C2 such that if we run A2-ADGAC with ε < C1, δ, ν Tolcomp(ε, δ) = C2ε2κδ,

εi = 2 (i+2), ni = Ω 1 εi (d log(1/ε)) + 1 εi

2κ 1 log(1/δ) , ki = k(1) εi, δ 4 log(1/ε) with k(1)

speciﬁed in Theorem 4, with probability at least 1 δ, the algorithm will return a classiﬁer ˆh with Pr[ˆh(X) = h (X)] ε with comparison and label complexity

E[SCcomp] = O

2κ 2 log(1/δ)

SClabel = O

ε, θ log(1/δ) 1

The dependence on log2(1/ε) in SCcomp can be reduced to log(1/ε) under Massart noise.

We can prove a similar result for adversarial noise condition. Theorem 7. Suppose that Conditions 1 and 3 hold. There exist global constants C1, C2, C3 such that if we run A2-ADGAC with ε < C1, δ, ν Tolcomp(ε, δ) = C2ε2δ, ν Tollabel(ε, δ) = C3ε, εi =

2 (i+2), ni = Ω 1 εi d log 1 εi

log(1/δ) , ki = k(2) εi, δ 4 log(1/ε) with k(2) speciﬁed in Theorem

5, with probability at least 1 δ, the algorithm will return a classiﬁer ˆh with Pr[ˆh(X) = h (X)] ε with comparison and label complexity

E[SCcomp] = O θd log(θd) log 1

SClabel = O log 1

ε, θ log(1/δ) .

Proof of Theorems 6 and 7 uses Theorem 4 and Theorem 5 with standard manipulations in VC theory. Theorems 6 and 7 show that having access to even a biased comparison function can reduce the problem of learning a classiﬁer in high-dimensional space to that of learning a threshold classiﬁer in one-dimensional space as the label complexity matches that of actively learning a threshold classiﬁer. Given the fact that comparisons are usually easier to obtain, A2-ADGAC will save a lot in practice due to its small label complexity. More importantly, we improve the total query complexity under TNC by separating the dependence on d and ε; The query complexity is now the sum of the two terms instead of the product of them. This observation shows the power of pairwise comparisons for learning classiﬁers. Such small label/query complexity is impossible without access to a comparison

oracle, since query complexity with only labeling oracle is at least Ω d 1

ε 2κ 2 and Ω d log 1

under TNC and adversarial noise conditions, respectively [19]. Our results also matches the lower bound of learning with labeling and comparison oracles up to log factors (see Section 6).

We note that Theorems 6 and 7 require rather small Tolcomp, equal to O(ε2κδ) and O(ε2δ), respectively. We will show in Section 6.3 that it is necessary to require Tolcomp = O(ε2) in order to obtain a classiﬁer of error ε, if we restrict the use of labeling oracle to only learning a threshold function. Such restriction is able to reach the near-optimal label complexity as speciﬁed in Theorems 6 and 7. 5 Margin-ADGAC: Learning of Halfspaces

In this section, we combine ADGAC with margin-based active learning [6] to efﬁciently learn the class of halfspaces. Before proceeding, we ﬁrst mention a naive idea of utilizing comparisons: we can i.i.d. sample pairs (x1, x2) from PX PX , and use Z(x1, x2) as the label of x1 x2, where Z is the feedback from comparison oracle. However, this method cannot work well in our setting without additional assumption on the noise condition for the labeling Z(x1, x2).

Before proceeding, we assume that PX is isotropic log-concave on Rd; i.e., PX has mean 0, covariance I and the logarithm of its density function is a concave function [5, 6]. The hypothesis class of halfspaces can be represented as C = {h : h(x) = sign(w x), w Rd}. Denote by h (x) = sign(w x) for some w Rd. Deﬁne lτ(w, x, y) = max (1 y(w x)/τ, 0) and lτ(w, W) = 1 |W | P

(x,y) W lτ(w, x, y) as the hinge loss. The expected hinge loss of w is Lτ(w, D) = Ex D[lτ(w, x, sign(w x))].

Margin-based active learning [6] is a concrete example of Algorithm 1 by taking V as (a singleton set of) the hinge loss minimizer, while taking U as the margin region around that minimizer. More concretely, take U0 = X and V0 = {w0} for some w0 such that θ(w0, w ) π/2. The algorithm works with constants M 2, κ < 1/2 and a set of parameters ri, τi, bi, zi that equal to Θ(M i) (see proof in Appendix for formal deﬁnition of these parameters). V always contains a single hypothesis. Suppose V = {wi 1} in iteration i 1. Let vi satisﬁes lτi(vi, W) minv: v wi 1 2 ri, v 2 1 lτi(v, W) + κ/8, where wi is the content of V in iteration i.

We also have f V (V, W, i) = {wi} = n vi vi 2

o and f U(U, V, W, i) = {x : |wi x| bi}.

Let Margin-ADGAC be the algorithm obtained by replacing the sampling step in margin-based active learning with ADGAC using parameters (S, ni, εi, ki), where ni, εi, ki are additional parameters to be speciﬁed later. We have the following results under TNC and adversarial noise conditions, respectively. Theorem 8. Suppose that Conditions 2 and 3 hold, and h (x) = sign(w x) = sign(η(x) 1/2). There are settings of M, κ, ri, τi, bi, εi, ki, and constants C1, C2 such that for all ε C1, ν Tolcomp(ε, δ) = C2ε2κδ, if we run Margin-ADGAC with w0 such that θ(w0, w ) π/2, and ni = O 1 εi d log3(dk/δ) + 1

ε 2κ 1 log(1/δ) , it ﬁnds ˆw such that Pr[sign( ˆw X) = sign(w X)] ε with probability at least 1 δ. The comparison and label complexity are

E[SCcomp] = O

d log4(d/δ) + 1

2κ 2 log(1/δ)

SClabel = O

log(1/ε) log(1/δ) 1

The dependence on log2(1/ε) in SCcomp can be reduced to log(1/ε) under Massart noise. Theorem 9. Suppose that Conditions 1 and 3 hold. There are settings of M, κ, ri, τi, bi, εi, ki, and constants C1, C2, C3 such that for all ε C1, ν Tolcomp(ε, δ) = C2ε2κδ, ν Tolcomp(ε, δ) =

C3ε, if we run Margin-ADGAC with ni = O 1 εi d log3(dk/δ) and w0 such that θ(w0, w ) π/2, it ﬁnds ˆw such that Pr[sign( ˆw X) = sign(w X)] ε with probability at least 1 δ. The comparison and label complexity are E[SCcomp] = O log(1/ε) d log4(d/δ) , SClabel = O (log(1/ε) log(1/δ)) .

The proofs of Theorems 8 and 9 are different from the conventional analysis of margin-based active learning in two aspects: a) Since we use labels generated by ADGAC, which is not independently

sampled from the distribution PXY, we require new techniques that can deal with adaptive noises; b) We improve the results of [6] over the dependence of d by new Rademacher analysis.

Theorems 8 and 9 enjoy better label and query complexity than previous results (see Table 2). We mention that while Yan and Zhang [32] proposed a perceptron-like algorithm with label complexity as small as O(d log(1/ε)) under Massart and adversarial noise conditions, their algorithm works only under uniform distributions over the instance space. In contrast, our algorithm Margin-ADGAC works under broad log-concave distributions. The label and total query complexity of Margin-ADGAC improves over that of traditional active learning. The lower bounds in Section 6 show the optimality of our complexity. 6 Lower Bounds In this section, we give lower bounds on learning using labeling and pairwise comparison. In Section 6.1, we give a lower bound on the optimal label complexity SClabel. In Section 6.2 we use this result to give a lower bound on the total query complexity, i.e., the sum of comparison and label complexity. Our two methods match these lower bounds up to log factors. In Section 6.3, we additionally give an information-theoretic bound on Tolcomp, which matches our algorithms in the case of Massart and adversarial noise.

Following from [19, 20], we assume that there is an underlying score function g such that h (x) = sign(g (x)). Note that g does not necessarily have relation with η(x); We only require that g (x) represents how likely a given x is positive. For instance, in digit recognition, g (x) represents how an image looks like a 7 (or 9); In the clinical setting, g (x) measures the health condition of a patient. Suppose that the distribution of g (X) is continuous, i.e., the probability density function exists and for every t R, Pr[g (X) = t] = 0. 6.1 Lower Bound on Label Complexity The deﬁnition of g naturally induces a comparison oracle Z with Z(x, x ) = sign(g (x) g (x )). We note that this oracle is invariant to shifting w.r.t. g , i.e., g and g +t lead to the same comparison oracle. As a result, we cannot distinguish g from g + t without labels. In other words, pairwise comparisons do not help in improving label complexity when we are learning a threshold function on R, where all instances are in the natural order. So the label complexity of any algorithm is lower bounded by that of learning a threshold classiﬁer, and we formally prove this in the following theorem.

Theorem 10. For any algorithm A that can access both labeling and comparison oracles, sufﬁciently small ε, δ, and any score function g that takes at least two values on X, there exists a distribution PXY satisfying Condition 2 such that the optimal function is in the form of h (x) = sign(g(x) + t) for some t R and

SClabel(ε, δ, A) = Ω (1/ε)2κ 2 log(1/δ) . (1)

If PXY satisﬁes Condition 1 with ν = O(ε), SClabel satisﬁes (1) with κ = 1. The lower bound in Theorem 10 matches the label complexity of A2-ADGAC and Margin-ADGAC up to a log factor. So our algorithm is near-optimal. 6.2 Lower Bound on Total Query Complexity We use Theorem 10 to give lower bounds on the total query complexity of any algorithm which can access both comparison and labeling oracles. Theorem 11. For any algorithm A that can access both labeling and comparison oracles, and sufﬁciently small ε, δ, there exists a distribution PXY satisfying Condition 2, such that

SCcomp(ε, δ, A) + SClabel(ε, δ, A) = Ω (1/ε)2κ 2 log(1/δ) + d log(1/ε) . (2)

If PXY satisﬁes Condition 1 with ν = O(ε), SCcomp + SClabel satisﬁes (2) with κ = 1.

The ﬁrst term of (2) follows from Theorem 10, whereas the second term follows from transforming a lower bound of active learning with access to only the labeling oracle. The lower bounds in Theorem 11 match the performance of A2-ADGAC and Margin-ADGAC up to log factors. 6.3 Adversarial Noise Tolerance of Comparisons Note that label queries are typically expensive in practice. Thus it is natural to ask the following question: what is the minimal requirement on ν , given that we are only allowed to have access to minimal label complexity as in Theorem 10? We study this problem in this section. More concretely,

we study the requirement on ν when we learn a threshold function using labels. Suppose that the comparison oracle gives feedback using a scoring function ˆg, i.e., Z(x, x ) = sign(ˆg(x) ˆg(x )), and has error ν . We give a sharp minimax bound on the risk of the optimal classiﬁer in the form of h(x) = sign(ˆg(x) t) for some t R below.

Theorem 12. Suppose that min{Pr[h (X) = 1], Pr[h (X) = 1]}

ν and both ˆg(X) and g (X) have probability density functions. If ˆg(X) induces an oracle with error ν , then we have mint maxˆg,g Pr[sign(ˆg(X) t) = h (X)] =

ν . The proof is technical and omitted. By Theorem 12, we see that the condition of ν = ε2 is necessary if labels from g are only used to learn a threshold on ˆg. This matches our choice of ν under Massart and adversarial noise conditions for labeling oracle (up to a factor of δ). 7 Conclusion

We presented a general algorithmic framework, ADGAC, for learning with both comparison and labeling oracles. We proposed two variants of the base algorithm, A2-ADGAC and Margin-ADGAC, to facilitate low query complexity under Tsybakov and adversarial noise conditions. The performance of our algorithms matches lower bounds for learning with both oracles. Our analysis is relevant to a wide range of practical applications where it is easier, less expensive, and/or less risky to obtain pairwise comparisons than labels.

There are multiple directions for future works. One improvement over our work is to show complexity bounds for excess risk err(h) err(h ) instead of Pr[h = h ]. Also, our bound on comparison complexity is in expectation due to limits of quicksort; deriving concentration inequalities on the comparison complexity would be helpful. Also, an adaptive algorithm that applies to different levels of noise w.r.t. labels and comparisons would be interesting; i.e., use labels when comparisons are noisy and use comparisons when labels are noisy. Other directions include using comparisons (or more broadly, rankings) for other ML tasks like regression or matrix completion.

Acknowledgments

This research is supported in part by AFRL grant FA8750-17-2-0212. We thank Chicheng Zhang for insightful ideas on improving results in [6] using Rademacher complexity.

[1] S. Agarwal and P. Niyogi. Stability and generalization of bipartite ranking algorithms. In Annual Conference on Learning Theory, pages 32 47, 2005. [2] S. Agarwal and P. Niyogi. Generalization bounds for ranking algorithms via algorithmic stability. Journal of Machine Learning Research, 10:441 474, 2009. [3] N. Ailon and M. Mohri. An efﬁcient reduction of ranking to classiﬁcation. ar Xiv preprint ar Xiv:0710.2889, 2007. [4] J. Attenberg, P. Melville, and F. Provost. A uniﬁed approach to active dual supervision for labeling features and examples. In Machine Learning and Knowledge Discovery in Databases, pages 40 55. Springer, 2010. [5] P. Awasthi, M.-F. Balcan, N. Haghtalab, and H. Zhang. Learning and 1-bit compressed sensing under asymmetric noise. In Annual Conference on Learning Theory, pages 152 192, 2016. [6] P. Awasthi, M.-F. Balcan, and P. M. Long. The power of localization for efﬁciently learning linear separators with noise. Journal of the ACM, 63(6):50, 2017. [7] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In Proceedings of the 23rd international conference on Machine learning, pages 65 72. ACM, 2006. [8] M.-F. Balcan, A. Broder, and T. Zhang. Margin based active learning. In Annual Conference On Learning Theory, pages 35 50, 2007. [9] M.-F. Balcan and S. Hanneke. Robust interactive learning. In COLT, pages 20 1, 2012. [10] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-concave distributions. In Annual Conference on Learning Theory, pages 288 316, 2013. [11] M.-F. Balcan, E. Vitercik, and C. White. Learning combinatorial functions from pairwise comparisons. ar Xiv preprint ar Xiv:1605.09227, 2016. [12] M.-F. Balcan and H. Zhang. Noise-tolerant life-long matrix completion via adaptive sampling. In Advances in Neural Information Processing Systems, pages 2955 2963, 2016. [13] A. Beygelzimer, D. J. Hsu, J. Langford, and C. Zhang. Search improves label for active learning. In Advances in Neural Information Processing Systems, pages 3342 3350, 2016. [14] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classiﬁcation: A survey of some recent advances. ESAIM: probability and statistics, 9:323 375, 2005. [15] R. M. Castro and R. D. Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory, 54(5):2339 2353, 2008.

[16] O. Dekel, C. Gentile, and K. Sridharan. Selective sampling and active learning from single and multiple teachers. Journal of Machine Learning Research, 13:2655 2697, 2012. [17] J. Fürnkranz and E. Hüllermeier. Preference learning and ranking by pairwise comparison. In Preference learning, pages 65 82. Springer, 2010. [18] S. Hanneke. Adaptive rates of convergence in active learning. In COLT. Citeseer, 2009. [19] S. Hanneke. Theory of active learning, 2014. [20] S. Hanneke and L. Yang. Surrogate losses in passive and active learning. ar Xiv preprint ar Xiv:1207.3772, 2012. [21] R. Heckel, N. B. Shah, K. Ramchandran, and M. J. Wainwright. Active ranking from pairwise comparisons and the futility of parametric assumptions. ar Xiv preprint ar Xiv:1606.08842, 2016. [22] K. G. Jamieson and R. Nowak. Active ranking using pairwise comparisons. In Advances in Neural Information Processing Systems, pages 2240 2248, 2011. [23] D. M. Kane, S. Lovett, S. Moran, and J. Zhang. Active classiﬁcation with comparison queries. ar Xiv preprint ar Xiv:1704.03564, 2017. [24] A. Krishnamurthy. Interactive Algorithms for Unsupervised Machine Learning. Ph D thesis, Carnegie Mellon University, 2015. [25] L. Lovász and S. Vempala. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307 358, 2007. [26] S. Maji and G. Shakhnarovich. Part and attribute discovery from relative annotations. International Journal of Computer Vision, 108(1-2):82 96, 2014. [27] S. Sabato and T. Hess. Interactive algorithms: from pool to stream. In Annual Conference On Learning Theory, pages 1419 1439, 2016. [28] N. B. Shah, S. Balakrishnan, J. Bradley, A. Parekh, K. Ramchandran, and M. Wainwright. When is it better to compare than to score? ar Xiv preprint ar Xiv:1406.6618, 2014. [29] N. Stewart, G. D. Brown, and N. Chater. Absolute identiﬁcation by relative judgment. Psychological review, 112(4):881, 2005. [30] A. B. Tsybakov. Optimal aggregation of classiﬁers in statistical learning. Annals of Statistics, pages 135 166, 2004. [31] C. Wah, G. Van Horn, S. Branson, S. Maji, P. Perona, and S. Belongie. Similarity comparisons for interactive ﬁne-grained categorization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 859 866, 2014. [32] S. Yan and C. Zhang. Revisiting perceptron: Efﬁcient and label-optimal active learning of halfspaces. ar Xiv preprint ar Xiv:1702.05581, 2017. [33] L. Yang and J. G. Carbonell. Cost complexity of proactive learning via a reduction to realizable active learning. Technical report, CMU-ML-09-113, 2009. [34] C. Zhang and K. Chaudhuri. Beyond disagreement-based agnostic active learning. In Advances in Neural Information Processing Systems, pages 442 450, 2014.