# agnostic_pointwisecompetitive_selective_classification__7408a1b9.pdf

Journal of Artiﬁcial Intelligence Research 52 (2015) 179-201 Submitted 5/14; published 1/15

Agnostic Pointwise-Competitive Selective Classiﬁcation

Yair Wiener wyair@tx.technion.ac.il Ran El-Yaniv rani@cs.technion.ac.il Computer Science Department Technion Israel Institute of Technology Haifa 32000, Israel

A pointwise competitive classiﬁer from class F is required to classify identically to the best classiﬁer in hindsight from F. For noisy, agnostic settings we present a strategy for learning pointwise-competitive classiﬁers from a ﬁnite training sample provided that the classiﬁer can abstain from prediction at a certain region of its choice. For some interesting hypothesis classes and families of distributions, the measure of this rejected region is shown to be diminishing at rate β1 O (polylog(m) log(1/δ)/m)β2/2 , with high probability, where m is the sample size, δ is the standard conﬁdence parameter, and β1, β2 are smoothness parameters of a Bernstein type condition of the associated excess loss class (related to F and the 0/1 loss). Exact implementation of the proposed learning strategy is dependent on an ERM oracle that is hard to compute in the agnostic case. We thus consider a heuristic approximation procedure that is based on SVMs, and show empirically that this algorithm consistently outperforms a traditional rejection mechanism based on distance from decision boundary.

1. Introduction

Given a labeled training set and a class of models F, is it possible to select from F, based on a ﬁnite training sample, a model whose predictions are always identical to best model in hindsight? While classical results from statistical learning theory surely preclude such a possibility within the standard model, when changing the rules of the game it is possible. Indeed, consider a game where our classiﬁer is allowed to abstain from prediction without penalty in some region of its choice (a.k.a classiﬁcation with a reject option). For this game, and assuming a noise free realizable setting, it was shown by El-Yaniv and Wiener (2010) that one can train a perfect classiﬁer that never errs whenever it is willing to predict. While always abstaining will render such perfect classiﬁcation vacuous, it was shown that for a quite broad set of problems (each speciﬁed by an underlying distribution family and a hypothesis class), perfect realizable classiﬁcation is achievable with a rejection rate that diminishes quickly to zero with the training sample size. In general, perfect classiﬁcation cannot be achieved in a noisy setting. In this paper, our objective is to achieve pointwise competitiveness, a property ensuring that the prediction at every non-rejected test point is identical to the prediction of the best predictor in hindsight from the same class. Here we consider pointwise-competitive selective classiﬁcation and generalize the results of El-Yaniv and Wiener (2010) to the agnostic case. In particular, we show that pointwise-competitive classiﬁcation is achievable with high probability by a learning strategy called low error selective strategy (LESS). Given a training sample Sm

c 2015 AI Access Foundation. All rights reserved.

Wiener & El-Yaniv

and a hypothesis class F, LESS outputs a pointwise-competitive selective classiﬁer (f, g), where f(x) is a standard classiﬁer, and g(x) is a selection function that qualiﬁes some of the predictions as don t knows (see deﬁnitions in Section 2). The classiﬁer f is simply taken to be the empirical risk minimizer (ERM) classiﬁer, ˆf. Pointwise competitiveness is achieved through g as follows. Using standard concentration inequalities, we show that the true risk minimizer, f , achieves empirical error that is close that of ˆf. Thus, with high probability f belongs to the class of low empirical error hypotheses. Now all that is left to do is set g(x) such that it allows the prediction of the label of x, as ˆf(x), if and only if all the hypotheses in this low error class unanimously agree on the label of x. In the simpler, realizable setting (El-Yaniv & Wiener, 2010), this low error class simply reduces to the version space. The bulk of our analysis (in Sections 3, 4 and 5) concerns coverage bounds for LESS, namely, showing that the measure of the region where the classiﬁer (f, g) refuses to classify, diminishes quickly, with high probability, as the training sample size grows (see Section 2 for a formal deﬁnition). We provide several general and distribution-dependent coverage bounds. In particular, we show (in Corollaries 12 and 14, respectively) high probability bounds for the coverage Φ(f, g) of the classiﬁer (f, g, ) of the form,

Φ(f, g) 1 β1 O (polylog(m) log(1/δ)/m)β2/2 ,

for linear models under (unknown) distribution P(X, Y ), where X are feature space points and Y are labels, whose marginal P(X) is any ﬁnite mixture of Gaussians, and for axis aligned rectangles under P(X, Y ) whose marginal P(X) is a product distribution, where β1, β2 are Bernstein class smoothness parameters depending on the hypothesis class and the underlying distribution (and the loss function, 0/1 in our case). At the outset, eﬃcient implementation of LESS seems to be out of reach as we are required to track the supremum of the empirical error over a possibly inﬁnite hypothesis subset, which in general might be intractable. To overcome this computational diﬃculty, we propose a reduction of this problem to a problem of calculating (two) constrained ERMs. For any given test point x, we calculate the ERM over the training sample Sm with a constraint on the label of x (one positive label constraint and one negative). We show that thresholding the diﬀerence in empirical error between these two constrained ERMs is equivalent to tracking the supremum over the entire (inﬁnite) hypothesis subset. Based on this reduction we introduce in Section 6 a disbelief principle that motivates a heuristic implementation of LESS, which relies on constrained SVMs, and mimics the optimal behavior. In Section 7 we present some numerical examples over medical classiﬁcation problems and examine the empirical performance of the new algorithm and compare its performance with that of the widely used selective classiﬁcation method for rejection, based on distance from decision boundary.

2. Pointwise-Competitive Selective Classiﬁcation: Preliminary Deﬁnitions

Let X be some feature space, for example, d-dimensional vectors in Rd, and Y be some output space. In standard classiﬁcation, the goal is to learn a classiﬁer f : X Y, using

Agnostic Pointwise-Competitive Selective Classification

a ﬁnite training sample of m labeled examples, Sm = {(xi, yi)}m i=1, assumed to be sampled i.i.d. from some unknown underlying distribution P(X, Y ) over X Y. The classiﬁer is to be selected from some hypothesis class F. Let ℓ: Y Y [0, 1] be a bounded loss function. In selective classiﬁcation (El-Yaniv & Wiener, 2010), the learning algorithm receives Sm and is required to output a selective classiﬁer, deﬁned to be a pair (f, g), where f F is a classiﬁer, and g : X {0, 1} is a selection function, serving as qualiﬁer for f as follows. For any x X, (f, g)(x) = f(x) iﬀg(x) = 1. Otherwise, the classiﬁer outputs I don t know. The general performance of a selective predictor is characterized in terms of two quantities: coverage and risk. The coverage of (f, g) is Φ(f, g) EP [g(x)] . The true risk of (f, g), with respect to some loss function ℓ, is the average loss of f restricted to its region of activity as qualiﬁed by g, and normalized by its coverage, R(f, g) EP [ℓ(f(x), y) g(x)] /Φ(f, g). It is easy to verify that if g 1 (and therefore Φ(f, g) = 1), then R(f, g) reduces to the familiar risk functional R(f) EP [ℓ(f(x), y)]. For a classiﬁer f, let ˆR(f) 1

m Pm i=1 ℓ(f(xi), yi), the standard empirical error of f over the sample Sm. Let ˆf = argminf F ˆR(f) be the empirical risk minimizer (ERM), and let f = argminf F R(f) be the true risk minimizer with respect to unknown distribution P(X, Y ).1 Clearly, the true risk minimizer f is unknown. A selective classiﬁer (f, g) is called pointwise-competitive if for any x X, for which g(x) > 0, f(x) = f (x).

3. Low Error Selective Strategy (LESS)

For any hypothesis class F, hypothesis f F, distribution P, sample Sm, and real number r > 0, deﬁne the true and empirical low-error sets,

V(f, r) f F : R(f ) R(f) + r (1)

and ˆV(f, r) n f F : ˆR(f ) ˆR(f) + r o . (2)

Throughout the paper we denote by σ(m, δ, d) the slack of a standard uniform deviation bound, given in terms of the training sample size, m, the conﬁdence parameter, δ, and the VC-dimension, d, of the class F,

σ(m, δ, d) 2

The following theorem is a slight extension of the statement made by Bousquet, Boucheron, and Lugosi (2004, p. 184).

Theorem 1 (Bousquet et al., 2004). Let ℓbe the 0/1 loss function and F, a hypothesis class whose VC-dimension is d. For any 0 < δ < 1, with probability of at least 1 δ over the choice of Sm from P m, any hypothesis f F satisﬁes

R(f) ˆR(f) + σ(m, δ, d).

Similarly, ˆR(f) R(f) + σ(m, δ, d) under the same conditions.

1 More formally, f is a classiﬁer such that R(f ) = inff F R(f) and inff F P((x, y) : f(x) = f (x)) = 0. The existence of such a (measurable) f is guaranteed under suﬃcient considerations (see Hanneke, 2012, pp. 1511-2).

Wiener & El-Yaniv

Remark 2. The use of Theorem 1 and, in particular, VC bounds for classiﬁcation problems (0/1 loss) is not mandatory for developing the theory presented in this paper. Similar theories can be developed using other types of bounds (e.g., Rademacher or compression bounds) for other learning problems.

Let G F. The disagreement set (Hanneke, 2007a; El-Yaniv & Wiener, 2010) w.r.t. G is deﬁned as DIS(G) {x X : f1, f2 G s.t. f1(x) = f2(x)} . (4)

Let us now motivate the low-error selective strategy (LESS) whose pseudo-code appears in Strategy 1. The strategy is deﬁne whenever the empirical risk minimizer (ERM) exists, for example, in the case of the 0/1 loss. Using a standard uniform deviation bound, such as the one in Theorem 1, one can show that the training error of the true risk minimizer, f , cannot be too far from the training error of the empirical risk minimizer, ˆf. Therefore, we can guarantee, with high probability, that the empirical low error class ˆV ˆf, r (applied with appropriately chosen r) includes the true risk minimizer f . The selection function g is now constructed to accept a subset of the domain X, on which all hypotheses in the empirical low-error set unanimously agree. Strategy 1 formulates this idea. We call it a strategy rather then an algorithm because it lacks implementation details. Indeed, it is not clear at the outset if this strategy can be implemented.

Strategy 1 Agnostic low-error selective strategy (LESS) Input: Sm, m, δ, d Output: a pointwise-competitive selective classiﬁer (h, g) w.p. 1 δ

1: Set ˆf = ERM(F, Sm), i.e., ˆf is any empirical risk minimizer from F w.r.t. Sm

2: Set G = ˆV ˆf, 2σ(m, δ/4, d) (see Eq. (2) and (3))

3: Construct g such that g(x) = 1 x {X \ DIS (G)}

We now begin the analysis of LESS. The following lemma establishes its pointwise competitiveness. In Section 4 we develop general coverage bounds in terms of an undetermined disagreement coeﬃcient. Then, in Section 5 we present distribution-dependent bounds that do not rely on the disagreement coeﬃcient.

Lemma 3 (pointwise competitiveness). Let ℓbe the 0/1 loss function and F, a hypothesis class whose VC-dimension is d. Let δ > 0 be given and let (f, g) be the selective classiﬁer chosen by LESS. Then, with probability of at least 1 δ/2, (f, g) is a pointwise competitive selective classiﬁer.

Proof. By Theorem 1, with probability of at least 1 δ/4,

ˆR(f ) R(f ) + σ (m, δ/4, d) .

Clearly, since f minimizes the true error, R(f ) R( ˆf). Applying again Theorem 1, we know that with probability of at least 1 δ/4,

R( ˆf) ˆR( ˆf) + σ (m, δ/4, d) .

Agnostic Pointwise-Competitive Selective Classification

Using the union bound, it follows that with probability of at least 1 δ/2,

ˆR(f ) ˆR( ˆf) + 2σ (m, δ/4, d) .

Hence, with probability of at least 1 δ/2,

f ˆV ˆf, 2σ (m, δ/4, d) G.

By deﬁnition, LESS constructs the selection function g(x) such that it equals one iﬀx X \DIS (G) . Thus, for any x X, for which g(x) = 1, all the hypotheses in G agree, and in particular f and ˆf agree. Therefore (f, g) is pointwise-competitive with high probability.

4. General Coverage Bounds for LESS in Terms of the Disagreement Coeﬃcient

We require the following deﬁnitions to facilitate the coverage analysis. For any f F and r > 0, deﬁne the set B(f, r) of all hypotheses that reside in a ball of radius r around f,

B(f, r) f F : Pr X P f (X) = f(X) r .

For any G F, and distribution P, we denote by G the volume of the disagreement set of G (see (4)), G Pr {DIS(G)}. Let r0 0. The disagreement coeﬃcient (Hanneke, 2009) of the hypothesis class F with respect to the target distribution P is

θ(r0) θf (r0) = sup r>r0

The disagreement coeﬃcient will be utilized later on in our analysis. See also a discussion on its characteristics after Corollary 7. The associated excess loss class of the class F and the loss function ℓ(Massart, 2000; Mendelson, 2002; Bartlett, Mendelson, & Philips, 2004) is deﬁned as XL(F, ℓ)(x, y) {ℓ(f(x), y) ℓ(f (x), y) : f F} .

Whenever F and ℓare ﬁxed we abbreviate XL = XL(F, ℓ)(x, y). XL is said to be a (β1, β2)-Bernstein class with respect to P (where 0 < β2 1 and β1 1), if every h XL satisﬁes Eh2 β1(Eh)β2. (6)

Bernstein classes arise in many natural situations (see, e.g., Koltchinskii, 2006; Bartlett & Mendelson, 2006; Bartlett & Wegkamp, 2008). For example, if the conditional probability P(Y |X) is bounded away from 1/2, or it satisﬁes Tsybakov s noise conditions2, then the excess loss function is a Bernstein class (Bartlett & Mendelson, 2006; Tsybakov, 2004).3

2 If the data was generated from any unknown deterministic hypothesis with limited noise then P(Y |X) is bounded away from 1/2.

3 Speciﬁcally, for the 0/1 loss, Assumption A in Proposition 1 in the work of Tsybakov (2004), is equivalent to the Bernstein class condition of Equation (6) above with β2 = α 1+α, where α is the Tsybakov noise parameter.

Wiener & El-Yaniv

In the following sequence of lemmas and theorems we assume a binary hypothesis class F with VC-dimension d, an underlying distribution P over X { 1}, and that ℓis the 0/1 loss function. Also, XL denotes the associated excess loss class. Our results can be extended to loss functions other than 0/1 by similar techniques to those used by Beygelzimer, Dasgupta, and Langford (2009). In Figure 1 we schematically depict the hypothesis class F (the gray area), the target hypothesis (ﬁlled black circle outside F), and the best hypothesis in the class f . The distance of two points in the diagram relates to the distance between two hypothesis under the marginal distribution P(X). Our ﬁrst observation is that if the excess loss class is (β1, β2)-Bernstein class, then the set of low true error (depicted in Figure 1 (a)) resides within a larger ball centered around f (see Figure 1 (b)).

Figure 1: The set of low true error (a) resides within a ball around f (b).

Lemma 4. If XL is a (β1, β2)-Bernstein class with respect to P, then for any r > 0

V(f , r) B f , β1rβ2 .

Proof. If f V(f , r) then, by deﬁnition, E {I(f(X) = Y )} E {I(f (X) = Y )} + r. By linearity of expectation we have,

E {I(f(X) = Y ) I(f (X) = Y )} r. (7)

Since XL is (β1, β2)-Bernstein,

E {I(f(X) = f (X))} = E {|I(f(X) = Y ) I(f (X) = Y )|}

= E n (ℓ(f(X), Y ) ℓ(f (X), Y ))2o Eh2 β1(Eh)β2

β1 (E {I(f(X) = Y ) I(f (X) = Y )})β2 .

By (7), E {I(f(X) = f (X))} β1rβ2. Therefore, by deﬁnition, f B f , β1rβ2 .

Agnostic Pointwise-Competitive Selective Classification

So far we have seen that the set of low true error resides within a ball around f . Now we would like to prove that with high probability the set of low empirical error (depicted in Figure 2 (a)) resides within the set of low true error (see Figure 2 (b)). We emphasize that the distance between hypotheses in Figure 2 (a) is based on the empirical error, while the distance in Figure 2 (b) is based on the true error.

Figure 2: The set of low empirical error (a) resides within the set of low true error (b).

Lemma 5. For any r > 0, and 0 < δ < 1, with probability of at least 1 δ,

ˆV( ˆf, r) V (f , 2σ (m, δ/2, d) + r) .

Proof. If f ˆV( ˆf, r), then, by deﬁnition, ˆR(f) ˆR( ˆf)+r. Since ˆf minimizes the empirical error, we know that ˆR( ˆf) ˆR(f ). Using Theorem 1 twice, and applying the union bound, we see that with probability of at least 1 δ,

R(f) ˆR(f) + σ(m, δ/2, d) ˆR(f ) R(f ) + σ(m, δ/2, d).

Therefore, R(f) R(f ) + 2σ (m, δ/2, d) + r,

and f V (f , 2σ (m, δ/2, d) + r) .

We have shown that, with high probability, the set of low empirical error is a subset of a certain ball around f . Therefore, the probability that at least two hypotheses in the set of low empirical error will disagree with each other is bounded by the probability that at least two hypotheses in that ball around f will disagree with each other. Fortunately, the latter is bounded by the disagreement coeﬃcient as established in the following lemma.

Wiener & El-Yaniv

Lemma 6. For any r > 0 and 0 < δ < 1, with probability of at least 1 δ,

ˆV( ˆf, r) β1 (2σ (m, δ/2, d) + r)β2 θ(r0),

where θ(r0) is the disagreement coeﬃcient of F with respect to P, applied with r0 = (2σ(m, δ/2, d))β2 (see (5)).

Proof. Applying Lemmas 5 and 4 we get that with probability of at least 1 δ,

ˆV( ˆf, r) B f , β1 (2σ (m, δ/2, d) + r)β2 .

Therefore, ˆV( ˆf, r) B f , β1 (2σ (m, δ/2, d) + r)β2 .

By the deﬁnition of the disagreement coeﬃcient (5), for any r > r0, B(f , r ) θ(r0)r . Recalling that β1 1 and thus observing that r = β1 (2σ (m, δ/2, d) + r)β2 > (2σ (m, δ/2, d))β2 = r0, the proof is complete.

We are now in a position to state our ﬁrst coverage bound for the selective classiﬁer constructed by LESS. This bound is given in terms of the disagreement coeﬃcient.

Corollary 7. Let F be a hypothesis class as in Theorem 1, and assume that XL is a (β1, β2)-Bernstein class w.r.t. P. Let (f, g) be the selective classiﬁer constructed by LESS. Then, with probability of at least 1 δ, (f, g) is a pointwise competitive selective classiﬁer and Φ(f, g) 1 β1 (4σ (m, δ/4, d))β2 θ(r0),

where θ(r0) is the disagreement coeﬃcient of F with respect to P, and r0 = (2σ(m, δ/4, d))β2 .

Proof. By Lemma 3, with probability of at least 1 δ/2, (f, g) is pointwise-competitive. Set G ˆV ˆf, 2σ (m, δ/4, d) . By construction, f = ˆf, and the selection function g(x) equals one iﬀx X \ DIS (G). Thus, by the deﬁnition of coverage, Φ(f, g) = E{g(X)} = 1 G. Therefore, applications of Lemma 6 and the union bound imply that with probability of at least 1 δ, (f, g) is pointwise-competitive and its coverage satisﬁes,

Φ(f, g) = E{g(X)} = 1 G 1 β1 (4σ (m, δ/4, d))β2 θ(r0),

Noting that θ(r) is monotone non-increasing with r, we know that the coverage bound of Corollary 7 clearly applies with θ(0). The quantity θ(0) has been discussed in numerous papers and has been shown to be ﬁnite in various settings including thresholds in R under any distribution (θ(0) = 2) (Hanneke, 2009), linear separators through the origin in Rd under uniform distribution on the sphere (θ(0)

d) (Hanneke, 2009), and linear separators in Rd under smooth data distribution bounded away from zero (θ(0) c(f )d, where c(f ) is an unknown constant that depends on the target hypothesis) (Friedman, 2009). For these cases, an application of Corollary 7 is suﬃcient to guarantee pointwisecompetitiveness with bounded coverage that converges to one. Unfortunately for many hypothesis classes and distributions the disagreement coeﬃcient θ(0) is inﬁnite (Hanneke, 2009). Fortunately, if the disagreement coeﬃcient θ(r) grows slowly with respect to 1/r (as shown in Wang, 2011, under suﬃcient smoothness conditions), Corollary 7 is suﬃcient to guarantee bounded coverage.

Agnostic Pointwise-Competitive Selective Classification

5. More Distribution-Dependent Coverage Bounds for LESS

In this section we establish distribution-dependent coverage bounds for LESS. The starting point of these bounds is the following corollary.

Corollary 8. Let F be a hypothesis class as in Theorem 1, and assume that F has disagreement coeﬃcient θ(r0) = O (polylog (1/r0)) (8)

w.r.t. distribution P, and that XL is a (β1, β2)-Bernstein class w.r.t. the same distribution. Let (f, g) be the selective classiﬁer chosen by LESS. Then, with probability of at least 1 δ, (f, g) is pointwise competitive and its coverage satisﬁes,

Φ(f, g) 1 β1 O

Proof. Plugging in (8) in the coverage bound of Corollary 7 immediately yields the result.

Corollary 8 states fast coverage bounds for LESS in cases where the disagreement coeﬃcient grows slowly with respect to 1/r0.4 Recent results on disagreement-based active learning and selective prediction (Wiener et al., 2014; Wiener, 2013) established tight relations between the disagreement coeﬃcient and an empirical quantity called the version space compression set size. This quantity has been analyzed by El-Yaniv and Wiener (2010) in the context of realizable selective classiﬁcation, and there are known distribution-dependent bounds for it. Our plan for the rest of this section is to introduce the version space compression set size, discuss its relation to the disagreement coeﬃcient, and then show how to apply those results in the agnostic setting. While we are interested in solving the agnostic case, we will now consider for a moment the realizable setting and utilize known results that will be used in our analysis. Speciﬁcally, we now assume that f F with P(Y = f (x)|X = x) = 1 for all x X, where (X, Y ) P. Given a training sample Sm, let VSF,S be the induced version space, i.e., the set of all hypotheses consistent with the given sample Sm. The version space compression set size, denoted ˆn(Sm) = ˆn(F, Sm), is deﬁned to be the size of the smallest subset of Sm inducing the same version space (Hanneke, 2007b; El-Yaniv & Wiener, 2010). Being a function of Sm, clearly ˆn(Sm) is a random variable, and for any speciﬁc realization Sm its value is unique. For any m and δ (0, 1], deﬁne the version space compression set size minimal bound as

Bˆn(m, δ) min {b N : P(ˆn(Sm) b) 1 δ} . (9)

We rely on the following lemma (Wiener et al., 2014). For the sake of self-containment we provide its proof in the appendix.

4 When the disagreement coeﬃcient does not grow ploy-logarithmically with 1/r0 but is still o(1/r0), it is still possible to prove a lower bound on the coverage. Speciﬁcally, if θ(r0) = O ((1/r0)α) with α < 1, one can show that Φ(f, g) 1 O(1/( m)β2(1 α)).

Wiener & El-Yaniv

Lemma 9 (Wiener et al., 2014). In the realizable case, if Bˆn(m, δ) = O (polylog(m) log (1/δ)), or Bˆn m, 1

20 = O (polylog(m)), then θ(r0) = O polylog 1 r0

Obviously, the statement of Lemma 9 only holds (and is well deﬁned) within a realizable setting (the version space compression set size is only deﬁned for this setting). We now turn back to the agnostic setting and consider an arbitrary underlying distribution P over X Y. Recall that in the agnostic setting, we let f : X Y denote a (measurable) classiﬁer such that R(f ) = inff F R(f) and inff F P((x, y) : f(x) = f (x)) = 0, which is guaranteed to exist under suﬃcient assumptions (see Hanneke, 2012, Section 6.1); We call f an inﬁmal (best) hypothesis (of F, w.r.t. P). Clearly there can be several diﬀerent inﬁmal hypotheses. We note, however, that if XL is a (β1, β2)-Bernstein class with respect to P (as we assume in this paper), then Lemma 4 ensures that all inﬁmal hypotheses are identical up to measure zero. The deﬁnitions of version space and version space compression set size can be naturally generalized to the agnostic setting with respect to an inﬁmal hypothesis (Wiener et al., 2014) as follows. Let f be an inﬁmal hypothesis of F w.r.t. P. The agnostic version space of Sm is VSF,Sm,f {f F : (x, y) Sm, f(x) = f (x)}.

The agnostic version space compression set size, denoted ˆn(Sm) = ˆn(F, Sm, f ), is deﬁned to be the size of the smallest subset of Sm inducing the agnostic version space VSF,Sm,f . Finally, extend also the deﬁnition of the version space compression set minimal bound to the agnostic setting as follows.

Bˆn(m, δ, f ) min{b N : P(ˆn(F, Sm, f ) b) 1 δ}.

The key observation that allows for surprisingly easy utilization of Lemma 9 in the agnostic setting is that the disagreement coeﬃcient depends only on the hypothesis class F and the marginal distribution P(X). Using an inﬁmal hypothesis f we can therefore take any agnostic learning problem and consider its realizable projection, whereby points are labeled by f and it has the same marginal distribution P(X). These two problems will have (essentially) the same disagreement coeﬃcients. This idea was initially observed by Hanneke (2013) and Wiener (2013). Here we formulate it as a slight variation of the formulation in the work of Wiener, Hanneke, and El-Yaniv (2014). We deﬁne the disagreement in the agnostic setting as in (5) with respect to an inﬁmal hypothesis f . For any agnostic learning problem (F, P) we deﬁne its realizable projection (F , P ) as follows. Let F F {f } where f is an inﬁmal hypothesis of the agnostic problem. Deﬁne P to be a distribution with marginal P (X) = P(X), and P(Y = f (x)|X = x) = 1 for all x X. It is easy to verify that (F , P ) is a realizable learning problem, i.e., f F with PP (X,Y )(Y = f (x)|X = x) = 1 for all x X.

Lemma 10 (Realizable projection). Given any agnostic learning problem, (F, P), let (F , P ) be its realizable projection. Let θ(r0) and θ (r0) be the associated disagreement coefﬁcients of the agnostic and realizable projection problems, respectively. Then, θ(r0) θ (r0).

Proof. First note that θ and θ depend, respectively, on P and P only via f and the marginal distributions P(X) = P (X). Since F F {f } = F , we readily get that θ(r0) θ (r0).

Agnostic Pointwise-Competitive Selective Classification

Let us summarize the above derivation. Given an agnostic problem (F, P), consider its realizable projection (F , P ). If Bˆn(m, δ) = O (polylog(m) log (1/δ)) (or Bˆn(m, 1/20) = O (polylog(m))) for the realizable problem, then by Lemma 9, θ(r0) = O (polylog (1/r0)), which, by Lemma 10, also holds in the original agnostic problem. Therefore, Corollary 7 applies and we obtain a fast coverage bound for LESS w.r.t. (F, P). New agnostic coverage bounds for LESS are obtained using the following known bounds for the (realizable) version space compression set size. The ﬁrst one, by El-Yaniv and Wiener (2010), applies to the problem of learning linear separators under a mixture of Gaussian distributions. The following theorem is a direct application of Lemma 32 in the work of El-Yaniv and Wiener (2010).

Theorem 11 (El-Yaniv & Wiener, 2010). For any d, n N, let X Rd, F be the space of linear separators on Rd, and P be any distribution with marginal over Rd that is a mixture of n multivariate normal distributions. Then, there is a constant cd,n > 0 (depending on d, n, but otherwise independent of P) such that m 2,

Bˆn(m, 1/20) cd,n(log(m))d 1.

Applying Theorem 11, together with Lemma 10, Lemma 9 and Corollary 8, immediately yields the following result.

Corollary 12. Assume the conditions of Theorem 11. Assume also that XL is a (β1, β2)- Bernstein class w.r.t. P(X, Y ). Let (f, g) be the selective classiﬁer constructed by LESS. Then, with probability of at least 1 δ, (f, g) is a pointwise competitive selective classiﬁer and Φ(f, g) 1 β1 O (polylog(m) log(1/δ)/m)β2/2 .

The second version space compression set size bound concerns realizable learning of axis-aligned rectangles under product densities over Rn. Such bounds have been previously proposed by Wiener, Hanneke, and El-Yaniv (2014) and El-Yaniv and Wiener (2010, 2012). We now state (without proof) a recent bound (Wiener, Hanneke, & El-Yaniv, 2014) giving version space compression set size bound for this learning problem (whose positive class is bounded away from zero).

Theorem 13 (Wiener et al., 2014). For d, m N and λ, δ (0, 1), let X Rd. For any P with marginal distribution over Rd that is a product of densities over Rd with marginals having continuous CDFs, and for F the space of axis-aligned rectangles f on Rd with P((x, y) : f(x) = 1) λ,

Bˆn(m, δ) 8d

Here again, an application of Theorem 13, together with Lemma 10, Lemma 9 and Corollary 8 yields the following corollary.

Corollary 14. For d, m N and λ, δ (0, 1), let X Rd. Let P(X, Y ) be an underlying distribution with marginal P(X) that is a product of densities over Rd with marginals having continuous CDFs. Let F the space of axis-aligned rectangles f on Rd with P((x, y) : f(x) = 1) λ, Assume that XL is a (β1, β2)-Bernstein class w.r.t. P(X, Y ). Let (f, g) be the

Wiener & El-Yaniv

selective classiﬁer constructed by LESS. Then, with probability of at least 1 δ, (f, g) is a pointwise competitive selective classiﬁer and

Φ(f, g) 1 β1 O (polylog(m) log(1/δ)/m)β2/2 .

6. ERM Oracles and the Disbelief principle

At the outset, eﬃcient construction of the selection function g prescribed by LESS seems to be out of reach as we are required to verify, for each point x in question, whether all hypotheses in the low error class agree on its label. Moreover, g should be computed for the entire domain. Luckily, it is possible to compute g in a lazy manner and we now show how to compute g(x) by calculating (two) constrained ERMs. For any given test point x, we calculate the ERM over the training sample Sm with a constraint on the label of x (one positive label constraint and one negative). We show that thresholding the diﬀerence in empirical error between these two constrained ERMs is equivalent to tracking the supremum over the entire (inﬁnite) hypothesis subset. The following lemma establishes this reduction.

Lemma 15. Let (f, g) be a selective classiﬁer chosen by LESS after observing the training sample Sm. Let ˆf be an empirical risk minimizer over Sm. Let x be any point in X and deﬁne fx argmin f F

n ˆR(f) | f(x) = sign ˆf(x) o ,

i.e., an empirical risk minimizer forced to label x the opposite from ˆf(x). Then

g(x) = 0 ˆR( fx) ˆR( ˆf) 2σ (m, δ/4, d) . (10)

Proof. First note that according to the deﬁnition of ˆV (see Eq (2)),

ˆR( fx) ˆR( ˆf) 2σ (m, δ/4, d) fx ˆV ˆf, 2σ (m, δ/4, d) . (11)

To prove the ﬁrst direction ( =) of (10), assume that the RHS of (10) holds. By (11), we get that both ˆf, fx ˆV. However, by construction, ˆf(x) = fx(x), so x DIS(ˆV) and g(x) = 0. To prove the other direction (= ), assume that ˆR( fx) ˆR( ˆf) > 2σ (m, δ/4, d). Under this assumption, we will prove that for any f ˆV, f (x) = ˆf(x), and therefore, x X \ DIS(ˆV), entailing that g(x) = 1. Indeed, assume by contradiction that there exists f ˆV such that f (x) = fx(x) = ˆf(x). By construction, it holds that

ˆR(f ) ˆR( fx) > ˆR( ˆf) + 2σ (m, δ/4, d) ,

so f ˆV. Contradiction.

Lemma 15 tells us that in order to decide if point x should be rejected we need to measure the empirical error ˆR( fx) of a special empirical risk minimizer, fx, which is constrained to label x the opposite from ˆh(x). If this error is suﬃciently close to ˆR(ˆh), our classiﬁer cannot be too sure about the label of x and we must reject it. Thus, provided we can compute these ERMs, we can decide whether to predict or reject any individual test point x X,

Agnostic Pointwise-Competitive Selective Classification

without actually constructing g for the entire domain X. Figure 3 illustrates this principle for a 2-dimensional example. The hypothesis class is the class of linear classiﬁers in R2

and the source distribution is two normal distributions. Negative samples are represented by blue circles and positive samples by red squares. As usual, ˆf denotes the empirical

Figure 3: Constrained ERM.

risk minimizer. Let us assume that we want to classify point x1. This point is classiﬁed positive by ˆf. Therefore, we force this point to be negative and calculate the restricted ERM (depicted by doted line marked fx1). The diﬀerence between the empirical risk of ˆf and fx1 is not large enough, so point x1 will be rejected. However, if we want to classify point x2, the diﬀerence between the empirical risk of ˆf and fx2 is quite large and the point will be classiﬁed as positive. Equation (11) motivates the following deﬁnition of a disbelief index DF(x, Sm) for each individual point in X. Speciﬁcally, for any x X, deﬁne its disbelief index w.r.t. Sm and F,

D(x) DF(x, Sm) ˆR( fx) ˆR( ˆf).

Observe that D(x) is large whenever our model is sensitive to the label of x in the sense that when we are forced to bend our best model to ﬁt the opposite label of x, our model substantially deteriorates, giving rise to a large disbelief index. This large D(x) can be interpreted as our disbelief in the possibility that x can be labeled so diﬀerently. In this case we should deﬁnitely predict the label of x using our unforced model. Conversely, if D(x) is small, our model is indiﬀerent to the label of x and in this sense, is not committed to its label. In this case we should abstain from prediction at x. Notice that LESS is a speciﬁc application of thresholded disbelief index. We note that a similar technique of using an ERM oracle that can enforce an arbitrary number of example-based constraints was used by Dasgupta, Hsu, and Monteleoni (2007a) and Beygelzimer, Hsu, Langford, and Zhang (2010), in the context of active learning. As in our disbelief index, the diﬀerence between the empirical risk (or importance weighted empirical risk, see Beygelzimer et al., 2010) of two ERM oracles (with diﬀerent constraints) is used to estimate prediction conﬁdence.

Wiener & El-Yaniv

0.2 0.4 0.6 0.8 1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.05

Figure 4: RC curve of our technique (depicted in red) compared to rejection based on distance from decision boundary (depicted in dashed green line). The RC curve in right ﬁgure zooms into the lower coverage regions of the left curve.

In practical applications of selective prediction it is desirable to allow for some control over the trade-oﬀbetween risk and coverage; in other words, it is desirable to be able to develop the entire risk-coverage (RC) curve for the classiﬁer at hand (see, e.g., El-Yaniv & Wiener, 2010) and let the user choose the cutoﬀpoint along this curve in accordance with other practical considerations and constraints. The disbelief index facilitates an exploration of the risk-coverage trade-oﬀcurve for our classiﬁer as follows. Given a pool of test points we can rank these test points according to their disbelief index, and points with low index should be rejected ﬁrst. Thus, this ranking provides the means for constructing a risk-coverage trade-oﬀcurve. Ignoring for the moment implementation details (which are discussed in Section 7), a typical RC curve generated by LESS is depicted in Figure 4 (red curve)5. The dashed green RC curve was computed using the traditional distance-based techniques for rejection (see discussion of this common technique in Section 8) The right graph is a zoom in section of the entire RC curve (depicted on the left graph). The dashed horizontal line is the test error of f on the entire domain and the dotted line is the Bayes error. While for high coverage values the two techniques are statistically indistinguishable, for any coverage less than 60% we get a signiﬁcant advantage for LESS. It is clear that in this case not only the estimation error was reduced, but also the test error goes signiﬁcantly below the optimal test error of f for low coverage values. Interestingly, the disbelief index generates rejection regions that are fundamentally different than those obtained by the traditional distance-based techniques for rejection (see Section 8). To illustrate this point (and still ignoring implementation details), consider Figure 5 where we depict the rejection regions for a training sample of 150 points sampled from a mixture of two identical normal distributions (centered at diﬀerent locations). The height map in this ﬁgure, which correspond to disbelief index magnitude (a), and distance from decision boundary (b), reﬂect the conﬁdence regions of each technique according to its own conﬁdence measure.

5 The learning problem is the same synthetic problem used for generating Figure 6.

Agnostic Pointwise-Competitive Selective Classification

Figure 5: Linear classiﬁer. Conﬁdence height map using (a) disbelief index; (b) distance from decision boundary.

Figure 6: SVM with polynomial kernel. Conﬁdence height map using (a) disbelief index; (b) distance from decision boundary.

To intuitively explain the height map of Figure 5(a), recall that the disbelief index is the diﬀerence between the empirical error of the ERM and the restricted ERM. If a test point resides in a high density region, we expect that forcing the wrong label for that point will result in a large increase of the training error. As a result, the denser the area is, the larger the disbelief index, and therefore, the higher the classiﬁcation conﬁdence. The second synthetic 2D source distribution we consider is even more striking. Here X is distributed uniformly over [0, 3π] [ 2, 2] and the labels are sampled according to the following conditional distribution

P(Y = 1|X = (x1, x2)) 0.95, x2 sin(x1); 0.05, else.

The thick red line depicts the decision boundary of the Bayes classiﬁer. The hight maps in Figure 6 depict the rejection regions obtained by (our approximation of) LESS and by

Wiener & El-Yaniv

the traditional (distance from decision boundary) technique for a training sample of 50 points sampled from this distribution (averaged over 100 iterations). Here the hypothesis class used for training was SVM with a polynomial kernel of degree 5. The qualitative diﬀerence between these two techniques, and in particular, the nice ﬁt of the disbelief principle technique compared to SVM is quite surprising.

Figure 7: RC curves for SVM with linear kernel. Our method in solid red, and rejection based on distance from decision boundary in dashed green. Horizontal axis (c) represents coverage.

7. Heuristic Procedure Using SVM and its Empirical Performance

The computation of a (constrained) ERM oracle can be eﬃciently achieved in the case of realizable learning with linear models (see, e.g., El-Yaniv & Wiener, 2010) and in the case of linear regression (Wiener & El Yaniv, 2012). However, in a noisy setting the computation of the linear ERM oracle can be reduced to a variant of the MAX FLS and C MAX FLS problems (with strict and non-strict inequalities) (Amaldi & Kann, 1995). Unfortunately,

Agnostic Pointwise-Competitive Selective Classification

MAX FLS is APX-complete (within a factor 2). C MAX FLS is MAX IND SET-hard, and cannot be approximated eﬃciently at all. Moreover, there are extensions of these results to other classes, including axis-aligned hyper-rectangles, showing that approximating ERM for these classes is NP-hard (Ben-David et al., 2003). While at present it is not known if these hardness results (and other related lower bounds) hold for half spaces under nice distributions such as Gaussian (mixtures), we note that Tauman Kalai et al. (2008) studied the problem of agnostically learning halfspaces under distributional assumptions. In particular, they showed that if the data distribution is uniform over the d-dimensional unit sphere (or hyper-cube, and other related distributions), then it is possible to agnostically learn ǫ-accurate halfspaces in time poly(d1/ǫ4). However, it is known that these particular distributions do not elicit eﬀective pointwise competitive learning. On the contrary, the uniform distribution over the unit sphere is among the worst possible distributions for pointwise-competitive classiﬁcation (and disagreement-based active learning) unless one utilizes homogeneous halfspaces (see discussion in, e.g., El-Yaniv & Wiener, 2010).

Figure 8: SVM with linear kernel. The maximum coverage for a distance-based rejection technique that allows the same error rate as our method with a speciﬁc coverage.

Wiener & El-Yaniv

Having discussed these computational hurdles, we should recall that much of applied machine learning research and many of its applications are doing quite well with heuristic approximations (rather than formal ones). When practical performance is the objective, clever heuristics and tricks can sometimes make the diﬀerence. At this point in the paper we therefore switch from theory to practice, aiming at implementing a rejection method inspired by the disbelief principle and see how well they work on real world problems. We approximate the ERM as follows. Using support vector machines (SVMs) we use a high C value (105 in our experiments) to penalize more on training errors than on small margin (see deﬁnitions of the SVM parameters in, e.g. Chang & Lin, 2011). In this way the solution to the optimization problem tend to get closer to the ERM. In order to estimate ˆR( fx) we have to restrict the SVM optimizer to only consider hypotheses that classify the point x in a speciﬁc way. To accomplish this we use a weighted SVM for unbalanced data. We add the point x as another training point with weight 10 times larger than the weight of all training points combined. Thus, the penalty for misclassiﬁcation of x is very large and the optimizer ﬁnds a solution that doesn t violate the constraint. Another problem we face is that the disbelief index is a noisy statistic that highly depends on the sample Sm. To overcome this noise we use robust statistics. First we generate an odd number k of diﬀerent samples (S1 m, S2 m, . . . Sk m) using bootstrap sampling (we used k = 11). For each sample we calculate the disbelief index for all test points and for each point take the median of these measurements as the ﬁnal index. We also note that for any ﬁnite training sample the disbelief index is a discrete variable. It is often the case that several test points share the same disbelief index. In those cases we can use any conﬁdence measure as a tie breaker. In our experiments we use distance from decision boundary to break ties. Focusing on SVMs with a linear kernel we compared the RC (Risk-Coverage) curves achieved by the proposed method with those achieved by SVM with rejection based on distance from decision boundary. This latter approach is very common in practical applications of selective classiﬁcation. For implementation we used LIBSVM (Chang & Lin, 2011). We tested our algorithm on standard medical diagnosis problems from the UCI repository, including all datasets used by Grandvalet, Rakotomamonjy, Keshet, and Canu (2008). We transformed nominal features to numerical ones in a standard way using binary indicator attributes. We also normalized each attribute independently so that its dynamic range is [0, 1]. No other preprocessing was employed. In each iteration we choose uniformly at random non-overlapping training set (100 samples) and test set (200 samples) for each dataset.6 The SVM was trained on the entire training set, and test samples were sorted according to conﬁdence (either using distance from decision boundary or disbelief index). Figure 7 depicts the RC curves of our technique (red solid line) and rejection based on distance from decision boundary (green dashed line) for linear kernel on all 6 datasets. All results are averaged over 500 iterations (error bars show standard error). With the exception of the Hepatitis dataset, in which both methods were statistically indistinguishable, in all other datasets the proposed method exhibits signiﬁcant advantage over the traditional approach. We would like to highlight the performance of the proposed method on the Pima dataset. While the traditional approach cannot achieve error less than 8% for any

6 Due to the size of the Hepatitis dataset the test set was limited to 29 samples.

Agnostic Pointwise-Competitive Selective Classification

Figure 9: RC curves for SVM with RBF kernel. Our method in solid red and rejection based on distance from decision boundary in dashed green.

rejection rate, in our approach the test error decreases monotonically to zero with rejection rate. Furthermore, a clear advantage for our method over a large range of rejection rates is evident in the Haberman dataset.7. For the sake of fairness, we note that the running time of our algorithm (as presented here) is substantially longer than the traditional technique. The performance of our algorithm can be substantially improved when many unlabeled samples are available. In this case the rejection function can be evaluated on the unlabeled samples to generate a new labeled sample. Then a new rejection classiﬁer can be trained on this sample. Figure 8 depicts the maximum coverage for a distance-based rejection technique that allows the same error rate as our method with a speciﬁc coverage. For example, let us assume that our method can have an error rate of 10% with coverage of 60% and the

7 The Haberman dataset contains survival data of patients who had undergone surgery for breast cancer. With estimated 207,090 new cases of breast cancer in the united states during 2010 (Society, 2010) an improvement of 1% aﬀects the lives of more than 2000 women.

Wiener & El-Yaniv

Figure 10: SVM with RBF kernel. The maximum coverage for a distance-based rejection technique that allows the same error rate as our method with a speciﬁc coverage.

distance-based rejection technique achieves the same error with maximum coverage of 40%. Then the point (0.6, 0.4) will be on the red line. Thus, if the red line is bellow the diagonal then our technique has an advantage over distance-based rejection and visa versa. As an example, consider the Haberman dataset, and observe that regardless of the rejection rate, distance-based technique cannot achieve the same error as our technique with coverage lower than 80%. Figures 9 and 10 depict the results obtained with RBF kernel. In this case a statistically signiﬁcant advantage for our technique was observed for all datasets.

8. Related Work

Pointwise-competitive classiﬁcation is a unique and extreme instance of classiﬁcation with an abstention option, an idea which emerged from the pattern recognition community, was ﬁrst proposed and studied 50 years ago by Chow (1957, 1970), and generated lots of interest

Agnostic Pointwise-Competitive Selective Classification

(Fumera et al., 2001; Tortorella, 2001; Santos-Pereira & Pires, 2005; Fumera & Roli, 2002; Pietraszek, 2005; Bounsiar et al., 2006; Landgrebe et al., 2006; Herbei & Wegkamp, 2006; Hellman, 1970; El-Yaniv & Pidan, 2011; Bartlett & Wegkamp, 2008; Wegkap, 2007; Freund et al., 2004). Taking a broader perspective, pointwise-competitive selective prediction (and in particular, classiﬁcation) is a particular instance of the broader concept of conﬁdencerated learning, whereby the learner must formally quantify conﬁdence in its prediction. Achieving eﬀective conﬁdence-rated prediction (including abstention) is a longstanding and challenging goal in a number of disciplines and research communities. Let us ﬁrst discuss some of the most prominent approaches to conﬁdence-rated prediction and note how they related to the present work. In the knows-what-it-knows (KWIK) framework studied in reinforcement-learning (Li, Littman, & Walsh, 2008; Strehl & Littman, 2007; Li & Littman, 2010) a similar notion to pointwise competitiveness is studied, and coverage rates are analyzed (Li et al., 2008; Li, 2009). However, KWIK was limited to the realizable model and is concerned with an adversarial setting where both the target hypothesis and the training data are selected by an adversary. While all positive results for the KWIK adversarial setting apply to the statistical pointwise-competitive prediction setting (where training examples are sampled i.i.d.), this adversarial setting precludes non trivial coverage for all the interesting hypothesis classes currently addressed by pointwise-competitive prediction. This deﬁciency comes as no surprise because the KWIK adversarial setting is much more challenging than the statistical pointwise-competitive prediction assumptions. The conformal prediction framework (Vovk, Gammerman, & Shafer, 2005) provides hedged predictions by allowing the possibility of multi-labeled predictions and guarantees a user-desired conﬁdence rate in an asymptotic sense. Conformal prediction is mainly concerned with an online probabilistic setting. Rather than predicting a single label for each sample point, a conformal predictor can assign multiple labels. Any user-deﬁned conﬁdence level ǫ for the error rate can be asymptotically guaranteed. When interpreting these multi-labeled predictions as rejection, we can compare it to pointwise-competitive prediction. In this sense, conformal prediction can construct online predictors with a reject option that have asymptotic performance guarantees. A few important diﬀerences between conformal predictions and pointwise-competitive prediction can be pointed out. While both approaches provide hedged predictions, they use diﬀerent notions of hedging. Whereas in pointwise-competitive prediction the goal is to guarantee that with high probability over the training sample our predictor agrees with the best predictor in the same class over all points in the accepted domain, the goal in conformal predictions is to provide guarantees for the average error rate, where the average is taken over all possible samples and test points.8 In this sense, conformal prediction cannot achieve pointwise competitiveness. In addition, conformal prediction also utilizes a diﬀerent notion of error than the one used in the pointwise-competitive model. While pointwise-competitive prediction is focused on performance guarantees for the error rate only on the covered (accepted) examples, conformal prediction provides a guarantee for all examples (including those that have multiple predictions or none at all). By increasing the multi-labeled prediction rate (uncertain prediction),

8 As noted by Vovk et al.: It is impossible to achieve the conditional probability of error equal to ǫ given the observed examples, but it is the unconditional probability of error that equals ǫ. Therefore, it implicitly involves averaging over diﬀerent data sequences... (Vovk et al., 2005, p. 295).

Wiener & El-Yaniv

the error rate can be decreased to any arbitrarily small value. This is not the case with the pointwise-competitive prediction error notion on the covered examples, which is bounded below by the Bayes error on the covered region. Finally, conformal prediction mentions a notion of eﬃciency, which is similar to coverage but, to the best of our knowledge, no ﬁnite sample results have been established. Another interesting scheme in the vicinity of conﬁdence-rated learning is the the guaranteed error machine (GEM) (Campi, 2010). In the GEM model the reject option is considered as a correct answer, which means that risk can be reduced arbitrarily (as in conformal prediction). Pointwise-competitive classiﬁcation is a special case of pointwise-competitive prediction (El-Yaniv & Wiener, 2010, 2011; Wiener & El Yaniv, 2012; El-Yaniv & Wiener, 2012; Wiener, 2013; Wiener et al., 2014). Pointwise-competitive selective classiﬁcation was ﬁrst addressed by El-Yaniv and Wiener (2010) where the realizable case was studied (in that paper pointwise-competitiveness was termed perfect classiﬁcation ). The present article extends pointwise-competitive classiﬁcation to noisy problems There are also a number of theoretical studies of (general) selective classiﬁcation (not pointwise-competitive). Freund et al. (2004) studied a simple ensemble method for binary classiﬁcation. Given a hypothesis class F, the method outputs a weighted average of all the hypotheses in F, where the weight of each hypothesis exponentially depends on its individual training error. Their algorithm abstains from prediction whenever the weighted average of all individual predictions is close to zero. They were able to bound the probability of misclassiﬁcation by 2R(f ) + ǫ(m) and, under some conditions, they proved a bound of 5R(f ) + ǫ(F, m) on the rejection rate. The LESS strategy can be viewed as an extreme variation of the Freund et al. method. We include in our ensemble only hypotheses with suﬃciently low empirical error and we abstain if the weighted average of all predictions is not deﬁnitive ( = 1). Our risk and coverage bounds are asymptotically tighter. Excess risk bounds were developed by Herbei and Wegkamp (2006) for a model where each rejection incurs a cost in [0, 1/2]. Their bound applies to any empirical risk minimizer over a hypothesis class of ternary hypotheses (whose output is in { 1, reject}). See also various extensions by Wegkap (2007) and Bartlett and Wegkamp (2008). A rejection mechanism for SVMs based on distance from decision boundary is perhaps the most widely known and used rejection technique. It is routinely used in medical applications (Mukherjee et al., 1998; Guyon et al., 2002; Mukherjee, 2003). Few papers proposed alternative techniques for rejection in the case of SVMs. Those include taking the reject area into account during optimization (Fumera & Roli, 2002), training two SVM classiﬁers with asymmetric cost (Sousa, Mora, & Cardoso, 2009), and using a hinge loss (Bartlett & Wegkamp, 2008). Grandvalet et al. (2008) proposed an eﬃcient implementation of SVM with a reject option using a double hinge loss. They empirically compared their results with two other selective classiﬁers: the one proposed by Bartlett and Wegkamp (2008) and the traditional rejection based on distance from decision boundary. In their experiments there was no statistically signiﬁcant advantage to either method compared to the traditional approach for high rejection rates. Pointwise selective classiﬁcation is strongly tied to disagreement-based active learning. For the realizable case, El-Yaniv and Wiener (2012) presented a reduction of stream-based active learning with the CAL algorithm of Cohn et al. (1994) to pointwise-competitive classiﬁcation. This reduction roughly states that if the rejection rate (the reciprocal of

Agnostic Pointwise-Competitive Selective Classification

coverage) of LESS is O(polylog(m/δ)/m) then the problem (F, P) is actively learnable by CAL with exponential speedup. A consequence of this reduction resulted in the ﬁrst exponential speedup bounds for CAL with general linear models under any ﬁnite mixture of Gaussians. The other direction, showing that exponential speedup for CAL implies the above rejection rate for LESS (in the realizable setting) was recently established by Wiener (2013) and by Wiener, Hanneke, and El-Yaniv (2014) (using two diﬀerent techniques). The version space compression set size, which is extensively utilized in the present work, has been introduced implicitly by Hanneke (2007b) as a special case of the extended teaching dimension, and in that context, the version space compression set is called the minimal specifying set. It was introduced explicitly by El-Yaniv and Wiener (2010) in the context of selective classiﬁcation, and was proved by El-Yaniv and Wiener (2012) to be a special case of the extended teaching dimension of Hanneke (2007b). Relations between the disagreement coeﬃcient and the version space compression set size were ﬁrst discussed El-Yaniv and Wiener (2012). Sharp ties between these two quantities, such as those stated in Lemma 9, and others were very recently developed by Wiener, Hanneke, and El-Yaniv (2014).

9. Concluding Remarks

We ﬁnd the existence of pointwise-competitive classiﬁcation quite fascinating. The striking feature of such a classiﬁer is that, by deﬁnition, a pointwise-competitive predictor is free of estimation error and cannot overﬁt. This means that our hypothesis class can be as expressive as we like and still we will be protected from overﬁtting. However, without eﬀective coverage bounds our pointwise-competitive classiﬁer may refuse to predict at all times. The current paper, and recent studies on both selective prediction (El-Yaniv & Wiener, 2015) and active learning (Wiener, Hanneke, & El-Yaniv, 2014), place the version space compression set size at the center of stage, as a leading quantity that can drive results and intuition in both domains. At present, this is the only known technique able to prove fast coverage for pointwise-competitive classiﬁcation and exponential label complexity speedup for disagreement-based active learning for both general linear models under a ﬁxed mixture of Gaussians and axis aligned rectangles under product distributions.. Is it possible to extend these results beyond linear classiﬁers and axis aligned rectangles under interesting distribution families? For example, it is plausible that existing results for axis-aligned rectangles can be extended to decision trees. The formal relationship between active learning and pointwise-competitive classiﬁcation (El-Yaniv & Wiener, 2012; Wiener, 2013; Wiener et al., 2014) created a powerful synergy that allows for migrating results between these two models. Currently, this formal connection is manifested via two links. The ﬁrst, within a realizable setting, is the equivalence of LESS-based classiﬁcation with fast coverage to CAL-based active learning with exponential speedup. The second link consists of bounds that relate the underlying complexity measures: the disagreement coeﬃcient in active learning, and version space compression set size in pointwise-competitive classiﬁcation. A number of other non-established relations that can signiﬁcantly substantiate the interaction between the two problems could be considered. For example, is it possible to prove a direct equivalence between LESS-based

Wiener & El-Yaniv

pointwise-competitive agnostic classiﬁcation with fast coverage rates and LESS-based active learning with exponential speedup? We expect that a resolution of this question will have various interesting implications. For example, such a relationship could potentially facilitate the migration of very interesting algorithms and techniques devised for active learning to the pointwise-competitive framework. An immediate candidate is the algorithm of Beygelzimer et al. (2010), which builds on ideas of Dasgupta et al. (2007b) and Beygelzimer et al. (2009). Resembling the implementation proposed for LESS via calls to (a constrained) ERM oracle, this algorithm works without tracking the version space for both the ﬁnal choice of the hypothesis as well as the querying component. Instead, for querying, it relies on an ERM oracle that enforces at most one example-based constrain. Thus, the importanceweighting technique on which it is based resembles the disbelief principle we outline here. In this regard, it will be very interesting to also consider and migrate ideas from active learning algorithms emerging from the online learning branch (Orabona & Cesa-Bianchi, 2011; Cesa-Bianchi et al., 2009; Dekel et al., 2010) while using, as required, online to batch conversion techniques (Zhang, 2005; Kakade & Tewari, 2009; Cesa-Bianchi & Gentile, 2008; Dekel, 2008).

The LESS strategy requires a unanimous vote among all hypotheses in a low empirical error subset of the hypothesis class. When considering, e.g., linear models, this subset of hypotheses is uncountable, and in any case (even if it is ﬁnite) its size can be huge. Clearly, LESS is an extremely radical and defensive strategy. An immediate question that arises is whether the LESS unanimity requirement can be relaxed to a majority vote. Can we achieve pointwise competitiveness with only a (strong) majority vote instead of unanimity? Besides the greater ﬂexibility of a general voting scheme, which may lead to diﬀerent types of interesting learning algorithms, such a relaxation can potentially ease the computational complexity of implementing LESS (which, as discussed above, is a bottleneck in agnostic classiﬁcation). For example, with a relaxed voting scheme we might utilize hypothesis sampling, for which a classical example in a related context is the celebrated query-bycommittee (QBC) strategy (Seung et al., 1992; Freund et al., 1997; Fine et al., 2002; Gilad Bachrach, 2007; Gilad-Bachrach et al., 2005). However, if strict pointwise competitiveness is advocated, it is easy to see any strong majority vote is not suﬃcient. Indeed, consider an f that diﬀers from all other hypotheses in F on a single point in X. Unless the probability of this point is very large (not the typical case), with high probability this point is not part of the training set Sm, and therefore, any majority vote (even very strong) will label it the opposite of f . Hence, in the worst case, even a strong majority is not suﬃcient for pointwise competitiveness. As a natural compromise for the pointwise competitiveness objective, one can revert to standard excess-risk bounds (Bartlett et al., 2006) whereby we compare the overall average performance of our predictor, R(f), to that of the optimal predictor, R(f ) (not pointwise). In this regard, the work of Freund, Mansour, and Schapire (2004) discussed in Section 8, is such a result with its excess-risk bound R(f, g) 2R(f ) + O 1/(m1/2 θ)

(θ is a hyper-parameter) and coverage bound Φ(f, g) 1 5R(f ) O ln |F|/

m1/2 θ . Considering excess-risk bounds against f , is it possible to beat the above risk and coverage bounds using a relaxed voting scheme for rejection? What would be the optimal bounds in a fully agnostic setting? Can better bounds be devised for speciﬁc distributions like Gaussian mixtures? We note that the Freund et al. strategy is also interesting because

Agnostic Pointwise-Competitive Selective Classification

the ﬁnal aggregated predictor is in general outside of F and can, in principle, signiﬁcantly outperform f F (the above bound does not elicit such a behavior). This emphasizes the potential usefulness of ensembles, applied not only in the rejection scheme, but also in the ﬁnal predictor. Recall that in the LESS strategy the ﬁnal predictor always belongs to F. Thus, when considering ensembles and allowing excess-risk bounds, there can be even more ambitious goals, such as strictly beating f on average.

Acknowledgments

We thank the anonymous referees for their good comments, and are grateful to Steve Hanneke for helpful and insightful discussions. Also, we warmly thank the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI), and Israel Science Foundation (ISF) for their generous support.

Appendix A. Some Proofs

The proof of Lemma 9 below relies on the following Lemma 16 (Wiener et al., 2014), whose proof is also provided here for the sake of self-containment.

Lemma 16 (Wiener et al., 2014). In the realizable case, for any r0 (0, 1),

θ(r0) max max r (r0,1) 16Bˆn

Proof. We will prove that, for any r (0, 1),

r max 16Bˆn

, 512 . (12)

The result then follows by taking the supremum of both sides over r (r0, 1).

Fix r (0, 1), let m = 1/r , and for i {1, . . . , m}, deﬁne Sm\i = Sm \ {(xi, yi)}. Also deﬁne Dm\i = DIS(VSF,Sm\i B(f , r)) and m\i = P(xi Dm\i|Sm\i) = P(Dm\i Y). If B(f , r)m 512, (12) clearly holds. Otherwise, suppose B(f , r)m > 512. If xi DIS(VSF,Sm\i), then we must have (xi, yi) ˆCSm. So

1 DIS(VSF,Sm\i)(xi).

Wiener & El-Yaniv

P {ˆn(Sm) (1/16) B(f , r)m}

1 DIS(VSF,Sm\i)(xi) (1/16) B(f , r)m

1 Dm\i(xi) (1/16) B(f , r)m

1 DIS(B(f ,r))(xi) 1 Dm\i(xi)

1 DIS(B(f ,r))(xi) (1/16) B(f , r)m

1 DIS(B(f ,r))(xi) 1 Dm\i(xi)

1 DIS(B(f ,r))(xi) 1

16 B(f , r)m,

1 DIS(B(f ,r))(xi) < 7

8 B(f , r)m

1 DIS(B(f ,r))(xi) 1 Dm\i(xi)

1 DIS(B(f ,r))(xi) 1

16 B(f , r)m,

1 DIS(B(f ,r))(xi) 7

8 B(f , r)m

1 DIS(B(f ,r))(xi) < (7/8) B(f , r)m

1 DIS(B(f ,r))(xi) 1 Dm\i(xi) (13/16) B(f , r)m

Since we are considering the case B(f , r)m > 512, a Chernoﬀbound implies

1 DIS(B(f ,r))(xi) < (7/8) B(f , r)m

exp { B(f , r)m/128} < e 4.

Furthermore, Markov s inequality implies

1 DIS(B(f ,r))(xi) 1 Dm\i(xi) (13/16) B(f , r)m

m B(f , r) E h Pm i=1

1 Dm\i(xi) i

(13/16)m B(f , r) .

Since the xi values are exchangeable,

i=1 E h E h

1 Dm\i(xi) Sm\i ii =

i=1 E m\i = m E m\m .

Agnostic Pointwise-Competitive Selective Classification

it can be shown (Hanneke, 2012) that this is at least

m(1 r)m 1 B(f , r).

In particular, when B(f , r)m > 512, we must have r < 1/511 < 1/2, which implies (1 r) 1/r 1 1/4, so that we have

(1/4)m B(f , r).

Altogether, we have established that

P (ˆn(Sm) (1/16) B(f , r)m) < m B(f , r) (1/4)m B(f , r)

(13/16)m B(f , r) + e 4

13 + e 4 < 19/20.

Thus, since ˆn(Sm) Bˆn m, 1

20 with probability at least 19/20, we must have that

> (1/16) B(f , r)m (1/16) B(f , r)

Proof of Lemma 9. Assuming that Bˆn(m, δ) = O polylog(m) log 1

δ holds, there exists a constant δ1 (0, 1/20) for which Bˆn(m, δ1) = O (polylog(m)). Because Bˆn(m, δ) is nonincreasing with δ, Bˆn m, 1

20 Bˆn(m, δ1), and thus Bˆn m, 1

20 = O (polylog(m)). Therefore,

max m 1/r0 Bˆn

= O max m 1/r0 polylog(m) = O polylog 1

and using Lemma 16 we have,

θ(r0) max max m 1/r0 16Bˆn

528 + 16 max m 1/r0 Bˆn

= O polylog 1

Amaldi, E., & Kann, V. (1995). The complexity and approximability of ﬁnding maximum feasible subsystems of linear relations. Theoretical computer science, 147(1), 181 210.

Bartlett, P. L., Jordan, M. I., & Mc Auliﬀe, J. D. (2006). Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473), 138 156.

Bartlett, P., & Mendelson, S. (2006). Discussion of 2004 IMS medallion lecture: Local rademacher complexities and oracle inequalities in risk minimization by V. koltchinskii. Annals of Statistics, 34, 2657 2663.

Wiener & El-Yaniv

Bartlett, P., Mendelson, S., & Philips, P. (2004). Local complexities for empirical risk minimization. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers.

Bartlett, P., & Wegkamp, M. (2008). Classiﬁcation with a reject option using a hinge loss. Journal of Machine Learning Research, 9, 1823 1840.

Ben-David, S., Eiron, N., & Long, P. (2003). On the diﬃculty of approximately maximizing agreements. Journal of Computer and System Sciences, 66(3), 496 514.

Beygelzimer, A., Dasgupta, S., & Langford, J. (2009). Importance weighted active learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 49 56. ACM.

Beygelzimer, A., Hsu, D., Langford, J., & Zhang, T. (2010). Agnostic active learning without constraints. Advances in Neural Information Processing Systems 23.

Beygelzimer, A., Dasgupta, S., & Langford, J. (2009). Importance weighted active learning. In Proceedings of the 26th annual international conference on machine learning, pp. 49 56. ACM.

Beygelzimer, A., Hsu, D., Langford, J., & Zhang, T. (2010). Agnostic active learning without constraints. ar Xiv preprint ar Xiv:1006.2588.

Bounsiar, A., Grall, E., & Beauseroy, P. (2006). A kernel based rejection method for supervised classiﬁcation. International Journal of Computational Intelligence, 3, 312 321.

Bousquet, O., Boucheron, S., & Lugosi, G. (2004). Introduction to statistical learning theory. In Advanced Lectures on Machine Learning, Vol. 3176 of Lecture Notes in Computer Science, pp. 169 207. Springer.

Campi, M. (2010). Classiﬁcation with guaranteed probability of error. Mach. Learn., 80(1), 63 84.

Cesa-Bianchi, N., & Gentile, C. (2008). Improved risk tail bounds for on-line algorithms. Information Theory, IEEE Transactions on, 54(1), 386 390.

Cesa-Bianchi, N., Gentile, C., & Orabona, F. (2009). Robust bounds for classiﬁcation via selective sampling. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 121 128. ACM.

Chang, C., & Lin, C. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1 27:27. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm .

Chow, C. (1957). An optimum character recognition system using decision function. IEEE Trans. Computer, 6(4), 247 254.

Chow, C. (1970). On optimum recognition error and reject trade-oﬀ. IEEE Trans. on Information Theory, 16, 41 36.

Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. Machine Learning, 15(2), 201 221.

Dasgupta, S., Hsu, D., & Monteleoni, C. (2007a). A general agnostic active learning algorithm. In NIPS.

Agnostic Pointwise-Competitive Selective Classification

Dasgupta, S., Monteleoni, C., & Hsu, D. J. (2007b). A general agnostic active learning algorithm. In Advances in neural information processing systems, pp. 353 360.

Dekel, O. (2008). From online to batch learning with cutoﬀ-averaging.. NIPS.

Dekel, O., Gentile, C., & Sridharan, K. (2010). Robust selective sampling from single and multiple teachers.. In COLT, pp. 346 358.

El-Yaniv, R., & Pidan, D. (2011). Selective prediction of ﬁnancial trends with hidden markov models. In NIPS, pp. 855 863.

El-Yaniv, R., & Wiener, Y. (2010). On the foundations of noise-free selective classiﬁcation. Journal of Machine Learning Research, 11, 1605 1641.

El-Yaniv, R., & Wiener, Y. (2011). Agnostic selective classiﬁcation. In Neural Information Processing Systems (NIPS).

El-Yaniv, R., & Wiener, Y. (2012). Active learning via perfect selective classiﬁcation. Journal of Machine Learning Research, 13, 255 279.

El-Yaniv, R., & Wiener, Y. (2015). On the version space compression set size and its applications. In Vovk, V., Papadopoulos, H., & Gammerman, A. (Eds.), Measures of Complexity: Festschrift for Alexey Chervonenkis. Springer, Berlin.

Fine, S., Gilad-Bachrach, R., & Shamir, E. (2002). Query by committee, linear separation and random walks. Theoretical Computer Science, 284(1), 25 51.

Freund, Y., Mansour, Y., & Schapire, R. (2004). Generalization bounds for averaged classiﬁers. Annals of Statistics, 32(4), 1698 1722.

Freund, Y., Seung, H., Shamir, E., & Tishby, N. (1997). Selective sampling using the query by committee algorithm. Machine Learning, 28, 133 168.

Friedman, E. (2009). Active learning for smooth problems. In Proceedings of the 22nd

Annual Conference on Learning Theory.

Fumera, G., & Roli, F. (2002). Support vector machines with embedded reject option. In Pattern Recognition with Support Vector Machines: First International Workshop, pp. 811 919.

Fumera, G., Roli, F., & Giacinto, G. (2001). Multiple reject thresholds for improving classiﬁcation reliability. Lecture Notes in Computer Science, 1876.

Gilad-Bachrach, R. (2007). To PAC and Beyond. Ph.D. thesis, the Hebrew University of Jerusalem.

Gilad-Bachrach, R., Navot, A., & Tishby, N. (2005). Query by committee made real. In NIPS.

Grandvalet, Y., Rakotomamonjy, A., Keshet, J., & Canu, S. (2008). Support vector machines with a reject option. In NIPS, pp. 537 544. MIT Press.

Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classiﬁcation using support vector machines.. Machine Learning, 389 422.

Hanneke, S. (2007a). A bound on the label complexity of agnostic active learning. In ICML, pp. 353 360.

Wiener & El-Yaniv

Hanneke, S. (2007b). Teaching dimension and the complexity of active learning. In Proceedings of the 20th Annual Conference on Learning Theory (COLT), Vol. 4539 of Lecture Notes in Artiﬁcial Intelligence, pp. 66 81.

Hanneke, S. (2009). Theoretical Foundations of Active Learning. Ph.D. thesis, Carnegie Mellon University.

Hanneke, S. (2013). A statistical theory of active learning. Unpublished.

Hanneke, S. (2012). Activized learning: Transforming passive to active with improved label complexity. The Journal of Machine Learning Research, 98888, 1469 1587.

Hellman, M. (1970). The nearest neighbor classiﬁcation rule with a reject option. IEEE Trans. on Systems Sc. and Cyb., 6, 179 185.

Herbei, R., & Wegkamp, M. (2006). Classiﬁcation with reject option. The Canadian Journal of Statistics, 34(4), 709 721.

Kakade, S., & Tewari, A. (2009). On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems (NIPS), pp. 801 808.

Koltchinskii, V. (2006). 2004 IMS medallion lecture: Local rademacher complexities and oracle inequalities in risk minimization. Annals of Statistics, 34, 2593 2656.

Landgrebe, T., Tax, D., Pacl ık, P., & Duin, R. (2006). The interaction between classiﬁcation and reject performance for distance-based reject-option classiﬁers. Pattern Recognition Letters, 27(8), 908 917.

Li, L., & Littman, M. L. (2010). Reducing reinforcement learning to kwik online regression. Annals of Mathematics and Artiﬁcial Intelligence, 217 237.

Li, L., Littman, M., & Walsh, T. (2008). Knows what it knows: a framework for self-aware learning. In Proceedings of the 25th international conference on Machine learning, pp. 568 575. ACM.

Li, L. (2009). A unifying framework for computational reinforcement learning theory. Ph.D. thesis, Rutgers, The State University of New Jersey.

Massart, P. (2000). Some applications of concentration inequalities to statistics. In Annales de la Facult e des Sciences de Toulouse, Vol. 9, pp. 245 303. Universit e Paul Sabatier.

Mendelson, S. (2002). Improving the sample complexity using global data. Information Theory, IEEE Transactions on, 48(7), 1977 1991.

Mukherjee, S. (2003). Chapter 9. classifying microarray data using support vector machines. In of scientists from the University of Pennsylvania School of Medicine and the School of Engineering and Applied Science. Kluwer Academic Publishers.

Mukherjee, S., Tamayo, P., Slonim, D., Verri, A., Golub, T., Mesirov, J. P., & Poggio, T. (1998). Support vector machine classiﬁcation of microarray data. Tech. rep., AI Memo 1677, Massachusetts Institute of Technology.

Orabona, F., & Cesa-Bianchi, N. (2011). Better algorithms for selective sampling. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 433 440.

Agnostic Pointwise-Competitive Selective Classification

Pietraszek, T. (2005). Optimizing abstaining classiﬁers using ROC analysis. In Proceedings of the Twenty-Second International Conference on Machine Learning(ICML), pp. 665 672.

Santos-Pereira, C., & Pires, A. (2005). On optimal reject rules and ROC curves. Pattern Recognition Letters, 26(7), 943 952.

Seung, H., Opper, M., & Sompolinsky, H. (1992). Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning theory (COLT), pp. 287 294.

Society, A. C. (2010). Cancer facts & ﬁgures 2010..

Sousa, R., Mora, B., & Cardoso, J. (2009). An ordinal data method for the classiﬁcation with reject option. In ICMLA, pp. 746 750. IEEE Computer Society.

Strehl, A. L., & Littman, M. L. (2007). Online linear regression and its application to model-based reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1417 1424.

Tauman Kalai, A., Klivans, A., Mansour, Y., & Servedio, R. (2008). Agnostically learning halfspaces. SIAM J. Comput., 37(6), 1777 1805.

Tortorella, F. (2001). An optimal reject rule for binary classiﬁers. Lecture Notes in Computer Science, 1876, 611 620.

Tsybakov, A. (2004). Optimal aggregation of classiﬁers in statistical learning. Annals of Mathematical Statistics, 32, 135 166.

Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer, New York.

Wang, L. (2011). Smoothness, disagreement coeﬃcient, and the label complexity of agnostic active learning. JMLR, 2269 2292.

Wegkap, M. (2007). Lasso type classiﬁers with a reject option. Electronic Journal of Statistics, 1, 155 168.

Wiener, Y. (2013). Theoretical Foundations of Selective Prediction. Ph.D. thesis, Technion Israel Institute of Technology.

Wiener, Y., & El Yaniv, R. (2012). Pointwise tracking the optimal regression function. In Advances in Neural Information Processing Systems 25, pp. 2051 2059.

Wiener, Y., Hanneke, S., & El-Yaniv, R. (2014). A compression technique for analyzing disagreement-based active learning. ar Xiv preprint ar Xiv:1404.1504.

Zhang, T. (2005). Data dependent concentration bounds for sequential prediction algorithms. In Learning Theory, pp. 173 187. Springer.