# on_the_consistency_of_auc_pairwise_optimization__7ac52581.pdf

On the Consistency of AUC Pairwise Optimization

Wei Gao and Zhi-Hua Zhou

National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology and Industrialization Nanjing 210023, China {gaow, zhouzh}@lamda.nju.edu.cn

AUC (Area Under ROC Curve) has been an important criterion widely used in diverse learning tasks. To optimize AUC, many learning approaches have been developed, most working with pairwise surrogate losses. Thus, it is important to study the AUC consistency based on minimizing pairwise surrogate losses. In this paper, we introduce the generalized calibration for AUC optimization, and prove that it is a necessary condition for AUC consistency. We then provide a sufﬁcient condition for AUC consistency, and show its usefulness in studying the consistency of various surrogate losses, as well as the invention of new consistent losses. We further derive regret bounds for exponential and logistic losses, and present regret bounds for more general surrogate losses in the realizable setting. Finally, we prove regret bounds that disclose the equivalence between the pairwise exponential loss of AUC and univariate exponential loss of accuracy.

1 Introduction

AUC (Area Under ROC Curve) has been an important criterion widely used in diverse learning tasks [Freund et al., 2003; Kotlowski et al., 2011; Flach et al., 2011; Zuva and Zuva, 2012]. Owing to its non-convexity and discontinuousness, direct optimization of AUC often leads to NP-hard problems. To make a compromise for avoiding computational difﬁculties, many pairwise surrogate losses, e.g., exponential loss [Freund et al., 2003; Rudin and Schapire, 2009], hinge loss [Brefeld and Scheffer, 2005; Joachims, 2005; Zhao et al., 2011] and least square loss [Gao et al., 2013], have been widely adopted in practical algorithms. It is important to study the consistency of these pairwise surrogate losses. In other words, whether the expected risk of learning with surrogate losses converge to the Bayes risk? Here, consistency (also known as Bayes consistency) guarantees the optimization of a surrogate loss will yield an optimal solution with Bayes risk in the limit of inﬁnite sample.

Supported by NSFC (61333014, 61321491) and Jiangsu SF (BK20140613).

This work presents a theoretical study on the consistency of AUC optimization based on minimizing pairwise surrogate losses. The main contributions include: i) We introduce the generalized calibration, and prove that it is necessary yet insufﬁcient for AUC consistency (cf. Theorem 1). This is because, for pairwise surrogate losses, minimizing the expected risk over the whole distribution is not equivalent to minimizing the conditional risk on each pair of instances from different classes. For example, hinge loss and absolute loss are shown to be calibrated but inconsistent with AUC. ii) We provide a sufﬁcient condition for the AUC consistency based on minimizing pairwise surrogate losses (cf. Theorem 2). From this ﬁnding, we prove that exponential loss, logistic loss and distance-weighted loss are consistent with AUC. In addition, this result suggests the invention of some new consistent surrogate losses such as q-norm hinge loss and general hinge loss. iii) We present regret bounds for exponential and logistic losses (cf. Theorem 3 and Corollary 5). For general surrogate losses, we present the regret bounds in the realizable setting (cf. Theorem 4). iv) We provide regret bounds to disclose the equivalence (cf. Theorems 5 and 6) between the pairwise exponential surrogate loss of AUC and univariate exponential surrogate loss of accuracy. As a result, the univariate exponential loss is consistent AUC, and the pairwise exponential loss is consistent with accuracy by selecting a proper threshold. One direct consequence of this ﬁnding is the equivalence between Ada Boost and Rank Boost in the limit of inﬁnite sample.

Related Work The studies on AUC can be traced back to 1970 s in signal detection theory [Egan, 1975], and AUC has been an important performance measure for information retrieval and learning to rank, especially in bipartite ranking [Cohen et al., 1999; Freund et al., 2003; Rudin and Schapire, 2009]. Consistency has been an important issue. Zhang [2004b] and Bartlett et al. [2006] provided the fundamental analysis for binary classiﬁcation, and many algorithms such as boosting and SVMs are proven to be consistent. The consistency studies on multi-class and multi-label learnings have

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)

been addressed in [Zhang, 2004a; Tewari and Bartlett, 2007] and [Gao and Zhou, 2013], respectively. Much attention has been paid to the consistency of learning to rank [Cossock and Zhang, 2008; Xia et al., 2008; Duchi et al., 2010]. It is noteworthy that previous consistency studies focus on univariate surrogate losses over single instance [Zhang, 2004b; Bartlett et al., 2006], whereas pairwise surrogate losses are deﬁned on pairs of instances from different classes. This difference brings a challenge for studying AUC consistency: for univariate surrogate loss, it is sufﬁcient to study the conditional risk; for pairwise surrogate losses, however, the whole distribution has to be considered (cf. Lemma 1). Because minimizing the expected risk over the whole distribution is not equivalent to minimizing the conditional risk. Duchi et al. [2010] explored the consistency of supervised ranking, which is different from our setting: they considered instances consisting of a query, a set of inputs and a weighted graph, and the goal is to order the inputs according to the weighted graph; yet we consider instances with positive or negative labels, and aim to rank positive instances higher than negative ones. Clemen con et al. [2008] studied the consistency of ranking, and shown that calibration is a necessary and sufﬁcient condition. We study the consistency of score functions by pairwise surrogate losses, and calibration is necessary but insufﬁcient for AUC consistency (cf. Theorem 1). Kotlowski et al. [2011] studied the AUC consistency based on minimizing univariate exponential and logistic losses, and this study is generalized to proper (composite) surrogate losses in [Agarwal, 2013]. These studies focused on univariate surrogate losses, whereas our work considers pairwise surrogate losses. Almost at the same time of our earlier version [Gao and Zhou, 2012], Uematsu and Lee [2012] provided a sufﬁcient condition similar to Theorem 2 (as to be shown in Section 3.2), but with different proof skills. Later, our Theorem 2 was extended by Menon and Williamson [2014]. [Uematsu and Lee, 2012; Menon and Williamson, 2014] did not provide the other three contributions (i.e., i, iii and iv in the previous section) of our work. The rest of the paper is organized as follows. Section 2 introduces preliminaries. Section 3 presents consistent conditions. Section 4 gives regret bounds. Section 5 discloses the equivalence of exponential loss. Section 6 concludes.

2 Preliminaries

Let X and Y = {+1, 1} be the input and output spaces, respectively. Suppose that D is an unknown distribution over X Y, and DX corresponds to the instance-marginal distribution over X. Let η(x) = Pr[y = +1|x] be the conditional probability over x. For score function f : X R, the AUC w.r.t. distribution D is deﬁned as

E[I[(y y )f(x) f(x ) > 0] + 1

2I[f(x) = f(x )]|y = y ]

where (x, y) and (x , y ) are drawn i.i.d. from distribution D, and I[ ] is the indicator function which returns 1 if the argument is true and 0 otherwise. Maximizing the AUC is equivalent to minimizing the expected risk, which can be viewed as

a reward formulation as follows.

R(f) = E[η(x)(1 η(x ))ℓ(f, x, x )

+ η(x )(1 η(x))ℓ(f, x , x)] (1)

where the expectation takes on x and x drawn i.i.d. from DX , and ℓ(f, x, x ) = I[f(x) > f(x )] + 1

2I[f(x) = f(x )] is called ranking loss. Write the Bayes risk R = inff[R(f)], and we get the set of Bayes optimal functions as

B = {f : R(f) = R } = {f : (f(x) f(x ))

(η(x) η(x )) > 0 if η(x) = η(x )}. (2)

Ranking loss ℓis non-convex and discontinuous, and directly optimizing it often leads to NP-hard problems. In practice, we consider pairwise surrogate losses as follows:

Ψ(f, x, x ) = φ(f(x) f(x )),

where φ is a convex function, e.g., exponential loss φ(t) = e t [Freund et al., 2003; Rudin and Schapire, 2009], hinge loss φ(t) = max(0, 1 t) [Brefeld and Scheffer, 2005; Joachims, 2005; Zhao et al., 2011], etc. We deﬁne the expected φ-risk as

Rφ(f) = Ex,x D2 X [η(x)(1 η(x ))φ(f(x) f(x ))

+ η(x )(1 η(x))φ(f(x ) f(x))], (3)

and denote by R φ = inff Rφ(f). Given two instances x, x X, we deﬁne the conditional φ-risk as

C(x, x , α) = η(x)(1 η(x ))φ(α)

+ η(x )(1 η(x))φ( α) (4)

where α = f(x) f(x ). For simplicity, denote by η = η(x) and η = η(x ) when it is clear from the context.

3 AUC Consistency We ﬁrst deﬁne the AUC consistency as follows: Deﬁnition 1 The surrogate loss φ is said to be consistent with AUC if for every sequence {f n (x)}n 1, the following holds over all distributions D on X Y:

Rφ(f n ) R φ then R(f n ) R .

In binary classiﬁcation, recall the notion of classiﬁcation calibration, which is a sufﬁcient and necessary condition for consistency of 0/1 error [Bartlett et al., 2006]. A surrogate loss φ is said to be classiﬁcation-calibrated if, for every x X with η(x) = 1/2,

inf f(x)(1 2η(x)) 0 {η(x)φ(f(x)) + (1 η(x))φ( f(x))}

> inf f(x) R {η(x)φ(f(x)) + (1 η(x))φ( f(x))} .

We now generalize to AUC calibration as follows: Deﬁnition 2 The surrogate loss φ is said to be calibrated if

H (η, η ) > H(η, η ) for any η = η

where H (η, η ) = infα: α(η η ) 0 C(x, x , α), H(η, η ) = infα R C(x, x , α) and C(x, x , α) is deﬁned in Eqn. (4).

We ﬁrst have

R φ = inf f Rφ(f) Ex,x D2 X inf α C(η(x), η(x ), α). (5)

It is noteworthy that the equality in the above does not hold for many surrogate losses from the following lemma: Lemma 1 For hinge loss φ(t) = max(0, 1 t), least square hinge loss φ(t) = (max(0, 1 t))2, least square loss φ(t) = (1 t)2 and absolute loss φ(t) = |1 t|, we have

inf f Rφ(f) > Ex,x D2 X inf α C(η(x), η(x ), α).

Proof We will present detailed proof for hinge loss by contradiction, and similar considerations could be made to other losses. Suppose that there exists a function f s.t.

Rφ(f ) = Ex,x D2 X [inf α C(η(x), η(x ), α)].

For simplicity, we consider three instances x1, x2, x3 X s.t. η(x1) < η(x2) < η(x3). The conditional risk of hinge loss is given by

C(x, x , α) = η(x)(1 η(x )) max(0, 1 α) +η(x )(1 η(x)) max(0, 1 + α).

Minimizing C(x, x , α) gives α = 1 if η(x) < η(x ). This yields f (x1) f (x2) = 1, f (x1) f (x3) = 1 and f (x2) f (x3) = 1; yet they are contrary each other.

From Lemma 1, the study on AUC consistency should focus on the expected φ-risk over the whole distribution rather than conditional φ-risk on each pair of instances. This is quite different from binary classiﬁcation where minimizing the expected risk over the whole distribution is equivalent to minimizing the conditional risk on each instance, and thus binary classiﬁcation focuses on the conditional risk as illustrated in [Zhang, 2004b; Bartlett et al., 2006].

3.1 Calibration is Necessary yet Insufﬁcient for AUC Consistency We ﬁrst prove that calibration is a necessary condition for AUC consistency as follows: Lemma 2 If the surrogate loss φ is consistent with AUC, then φ is calibrated, and for convex φ, it is differentiable at t = 0 with φ (0) < 0. Proof If φ is not calibrated, then there exist η0 and η 0 s.t. η0 > η 0 and H (η0, η 0) = H(η0, η 0). This implies the existence of some α0 0 such that

η0(1 η 0)φ(α0) + η 0(1 η0)φ( α0)

= inf α R {η0(1 η 0)φ(α) + η 0(1 η0)φ( α)} .

We consider an instance space X = {x1, x2} with marginal probability Pr[x1] = Pr[x2] = 1/2, η(x1) = η0 and η(x2) = η 0. We construct a sequence {f n }n =1 by selecting f n (x1) = f n (x2) + α0, and it is easy to get that

Rφ(f n ) R φ yet R(f n ) R = (η0 η 0)/8 as n ,

which shows that calibration is a necessary condition.

To prove φ (0) < 0, we consider the instance space X = {x1, x2} with Pr[x1] = Pr[x2] = 1/2, η(x1) = η1 and η(x2) = η2. Assume that φ is differentiable at t = 0 with φ (0) 0. For convex φ, we have η1(1 η2)φ(α) + η2(1 η1)φ( α) (η1 η2)αφ (0)+(η1(1 η2)+η2(1 η1))φ(0) (η1(1 η2)+η2(1 η1))φ(0) for (η1 η2)α 0. This follows that

min n {η1(1 η2)φ(0) + η2(1 η1)φ(0)},

inf (η1 η2)α 0{η1(1 η2)φ(α) + η2(1 η1)φ( α)} o

= inf (η1 η2)α 0{η1(1 η2)φ(α) + η2(1 η1)φ( α)}

which is contrary to H (η1, η2) > H(η1, η2). Suppose that φ is not differentiable at t = 0. There exists two subgradients g1 > g2 such that φ(t) g1t + φ(0) and φ(t) g2t + φ(0) for t R. If g1 > g2 0, we select η1 = g1/(g1 + g2) and η2 = g2/(g1 + g2). It is obvious that η1 > η2, and for any α 0, we have η1(1 η2)φ(α) + η2(1 η1)φ( α) η1(1 η2)(g2α+φ(0))+η2(1 η1)( g1α+φ(0)) (η1(1 η2)+ η2(1 η1))φ(0). In a similar manner, we can prove η1(1 η2)φ(α)+η2(1 η1)φ( α) (η1(1 η2)+η2(1 η1))φ(0) for g1 0 > g2, g1 > 0 g2 and 0 g1 > g2 if (η1 η2)α 0. This follows that H(η1, η2) = H (η1, η2), which is contrary to the consistency of φ.

For the converse direction, we observe that hinge loss and absolute loss are convex with φ (0) < 0, and thus they are calibrated, yet inconsistent with AUC as follows: Lemma 3 For hinge loss φ(t) = max(0, 1 t) and absolute loss φ(t) = |1 t|, the surrogate loss Ψ(f, x, x ) = φ(f(x) f(x )) is inconsistent with AUC. Proof We present detailed proof for hinge loss and similar proof could be made to absolute loss. We consider the instance space X = {x1, x2, x3}, and assume that, for 1 i 3, the marginal probability Pr[xi] = 1/3 and conditional probability ηi = η(xi) s.t. η1 < η2 < η3, 2η2 < η1 + η3 and 2η1 > η2 + η1η3. We write fi = f(xi) for 1 i 3. Eqn. (3) gives

Rφ(f) = κ0 + κ1

j =i ηi(1 ηj) max(0, 1 + fj fi)

where κ0 > 0 and κ1 > 0 are constants and independent to f. Minimizing Rφ(f) yields the optimal expected φ-risk R φ = κ0 + κ1(3η1 + 3η2 2η1η2 2η1η3 2η2η3)

when f = (f 1 , f 2 , f 3 ) s.t. f 1 = f 2 = f 3 1. Note that f = (f 1, f 2, f 3) s.t. f 1 + 1 = f 2 = f 3 1 is not the optimal solution w.r.t. hinge loss since Rφ(f ) = κ0 + κ1(5η1 + 2η2 2η1η2 3η1η3 2η2η3) = R φ + κ1(2η1 η2 η1η3) = R φ + κ1(η2 η1)/2. This completes the proof.

Together with Lemma 2 and Lemma 3, we have

Theorem 1 Calibration is necessary yet insufﬁcient for AUC consistency. The study on AUC consistency is not parallel to that of binary classiﬁcation where the classiﬁcation calibration is necessary and sufﬁcient for the consistency of 0/1 error in [Bartlett et al., 2006]. The main difference is that, for AUC consistency, minimizing the expected risk over the whole distribution is not equivalent to minimizing the conditional risk on each pair of instances as shown by Lemma 1.

3.2 Sufﬁcient Condition for AUC Consistency We now present a sufﬁcient condition for AUC consistency. Theorem 2 The surrogate loss Ψ(f, x, x ) = φ(f(x) f(x )) is consistent with AUC if φ: R R is a convex, differentiable and non-increasing function s.t. φ (0) < 0. Proof It sufﬁces to prove inff / B Rφ(f) > inff Rφ(f) for convex, differentiable and non-increasing function φ s.t. φ (0) < 0. Assume that inff / B Rφ(f) = inff Rφ(f), i.e., there is an optimal function f s.t. Rφ(f ) = inff Rφ(f) and f / B, i.e., for some x1, x2 X, we have f (x1) f (x2) yet η(x1) > η(x2). Recall the φ-risk s deﬁnition in Eqn. (3)

X η(x)(1 η(x ))φ(f(x) f(x ))+

η(x )(1 η(x))φ(f(x ) f(x))d Pr(x)d Pr(x ).

We introduce function h1 s.t. h1(x) = 0 if x = x1 and h1(x1) = 1 otherwise, and write g(γ) = Rφ(f + γh1) for any γ R, and thus g is convex. For optimal function f , we have g (0) = 0 which implies that Z

X\x1 η(x1)(1 η(x))φ (f (x1) f (x))

η(x)(1 η(x1))φ (f (x) f (x1))d Pr(x) = 0. (6)

Similarly, we have Z

X\x2 η(x2)(1 η(x))φ (f (x2) f (x))

η(x)(1 η(x2))φ (f (x) f (x2))d Pr(x) = 0. (7)

For convex differentiable and non-increasing function φ, we have φ (f (x1) f (x)) φ (f (x2) f (x)) 0 if f (x1) f (x2). This follows

η(x1)φ (f (x1) f (x)) η(x2)φ (f (x2) f (x)) (8)

for η(x1) > η(x2). In a similar manner, we have

(1 η(x2))φ (f (x) f (x2))

(1 η(x1))φ (f (x) f (x1)). (9)

If f (x1) = f (x2), then we have

η(x1)(1 η(x2))φ (f (x1) f (x2))

η(x2)(1 η(x1))φ (f (x2) f (x1)) < 0

from φ (0) < 0 and η(x1) > η(x2), which is contrary to Eqns. (6) and (7) by combining Eqns. (8) and (9).

If f (x1) < f (x2), then φ (f (x1) f (x2)) φ (0) < 0, φ (f (x1) f (x2)) φ (f (x2) f (x1)) 0, and

η(x1)(1 η(x2))φ (f (x1) f (x2))

η(x2)(1 η(x1))φ (f (x2) f (x1))

which is also contrary to Eqns. (6) and (7) by combining Eqns. (8) and (9).This theorem follows as desired.

From Theorem 2, we have

Corollary 1 For exponential loss φ(t) = e t and logistic loss φ(t) = ln(1 + e t), the surrogate loss Ψ(f, x, x ) = φ(f(x) f(x )) is consistent with AUC.

Marron et al. (2007) introduced the distance-weighted loss method for high-dimensional yet small-size sample, which was reformulated by Bartlett et al. (2006), for any ϵ > 0, as

t for t ϵ; and φ(t) = 1

ϵ otherwise.

Corollary 2 For distance-weighted loss, the surrogate loss Ψ(f, x, x ) = φ(f(x) f(x )) is consistent with AUC.

Lemma 3 proves the inconsistency of hinge loss, and also shows the difﬁculty for consistency without differentiability. We now derive some variants of hinge loss that are consistent. For example, the q-norm hinge loss: φ(t) = (max(0, 1 t))q for q > 1 is consistent as follows:

Corollary 3 For q-norm hinge loss, the surrogate loss φ(f, x, x ) = φ(f(x) f(x )) is consistent with AUC.

From this corollary, it is immediate to get the consistency of least-square hinge loss φ(t) = (max(0, 1 t))2. For ϵ > 0, deﬁne the general hinge loss as φ(t) = 1 t for t 1 ϵ; φ(t) = 0 for t 1 + ϵ; and φ(t) = (t 1 ϵ)2/4ϵ otherwise.

Corollary 4 For general hinge loss, the surrogate loss Ψ(f, x, x ) = φ(f(x) f(x )) is consistent with AUC.

Hinge loss is inconsistent with AUC, but we can use the general hinge loss to approach hinge loss when ϵ 0. In addition, it is also interesting to derive other consistent surrogate losses under the guidance of Theorem 2.

4 Regret Bounds

We will present the regret bounds for exponential and logistic losses, and for general losses under the realizable setting.

4.1 Regret Bounds for Exponential and Logistic Losses We begin with a special property as follows:

Proposition 1 For exponential loss and logistic loss, we have

inf f Rφ(f) = Ex,x D2 X inf α C(η(x), η(x ), α).

Proof We provide the detailed proof for exponential loss. For a ﬁxed instance x0 X and f(x0), we set

f(x) = f(x0) + 1

2 ln η(x)(1 η(x0))

η(x0)(1 η(x)) for x = x0.

This holds that, for x1, x2 X,

f(x1) f(x2) = 1

2 ln η(x1)(1 η(x2))

η(x2)(1 η(x1)),

which minimizes C(η(x1), η(x2), α) by α = f(x1) f(x2). We complete the proof as desired.

Proposition 1 is speciﬁc to the exponential and logistic loss, and does not hold for hinge loss, absolute loss, etc. Based on this proposition, we study the regret bounds for exponential and logistic loss by focusing on conditional risk as follows: Theorem 3 For constants κ0 > 0 and 0 < κ1 1, we have

R(f) R κ0(Rφ(f) R φ)κ1,

if f arg inff Rφ(f) satisﬁes that, for η(x) = η(x ), (f (x) f (x ))(η(x) η(x )) > 0 and |η(x) η(x )| κ0 C(η(x), η(x ), 0) C(η(x), η(x ), f (x) f (x )) κ1.

Proof This proof is partly motivated from [Zhang, 2004b]. From Eqns. (1) and (2), we have

R(f) R = E(η(x) η(x ))(f(x) f(x ))<0[|η(x) η(x )|]

+ Ef(x)=f(x )[|η(x ) η(x)|]/2

E(η(x) η(x ))(f(x) f(x )) 0[|η(x) η(x )|]

which yields that, by our assumption and Jensen s inequality,

R(f) R κ0 E(η(x) η(x ))(f(x) f(x )) 0[C(η(x),

η(x ), 0) C(η(x), η(x ), f (x) f (x ))] κ1

for 0 < κ1 < 1. This remains to prove that

E[C(η(x), η(x ), 0) C(η(x), η(x ), f (x) f (x ))]

E[C(η(x), η(x ), f(x) f(x ))

C(η(x), η(x ), f (x) f (x ))]

where the expectations take over (η(x) η(x ))(f(x) f(x )) 0. To see it, we consider the following cases: 1) For η(x) = η(x ) and convex φ, we have

C(η(x), η(x ), 0) C(η(x), η(x ), f(x) f(x ));

2) For f(x) = f(x ), we have

C(η(x), η(x ), 0) = C(η(x), η(x ), f(x) f(x ));

3) For (η(x) η(x ))(f(x) f(x )) < 0, we derive that 0 is between f(x) f(x ) and f (x) f (x ) from assumption (f (x) f (x ))(η(x) η(x )) > 0. For convex φ, we have C(η(x), η(x ), 0) max(C(η(x), η(x ), f(x) f(x )) and C(η(x), η(x ), f (x) f (x ))) = C(η(x), η(x ), f(x) f(x )). The theorem follows as desired.

Based on this theorem, we have Corollary 5 The regret bounds for exponential and logistic loss are given, respectively, by

Rφ(f) R φ ,

Rφ(f) R φ .

Proof We will present detailed proof for exponential loss and similarly consider logistic loss. The optimal function f satisﬁes f (x) f (x ) = 1 2 ln η(x)(1 η(x ))

η(x )(1 η(x) by minimizing C(η(x), η(x ), f(x) f(x )). This follows (f (x) f (x ))(η(x) η(x )) > 0 for η(x) = η(x ), C(η(x), η(x ), 0) = η(x)(1 η(x )) + η(x )(1 η(x)), and C(η(x), η(x ), f (x) f (x )) = p

η(x)(1 η(x)) p

η(x )(1 η(x )). Therefore, we have

C(η(x), η(x ), 0) C(η(x), η(x ), f (x) f (x ))

η(x)(1 η(x )) p

η(x )(1 η(x)) 2

= |η(x) η(x )|2

η(x)(1 η(x )) + p

η(x )(1 η(x)))2

|η(x) η(x )|2,

where we use the fact η(x), η(x ) [0, 1]. We complete the proof by applying Theorem 3.

4.2 Regret Bounds for Realizable Setting We now deﬁne the realizable setting as: Deﬁnition 3 A distribution D is said to be realizable if η(x)(1 η(x)) = 0 for each x X. This setting has been studied for bipartite ranking [Rudin and Schapire, 2009] and multi-class classiﬁcation [Long and Servedio, 2013]. Under this setting, we have Theorem 4 For some κ > 0, if R φ = 0, then we have

R(f) R κ(Rφ(f) R φ)

when φ(t) 1/κ for t 0 and φ(t) 0 for t > 0. Proof Let D+ and D denote the positive and negative instance distributions, respectively. Eqn. (1) gives that R(f) equals to

Ex D+,x D [I[f(x) < f(x )] + 1

2I[f(x) = f(x )]]

and R = inff[R(f)] = 0 for f(x) > f(x ). From Eqn. (3), we get the φ-risk Rφ(f) = Ex D+,x D [φ(f(x) f(x ))]. Then, R(f) R = Ex D+,x D [I[f(x) < f(x )] + I[f(x) = f(x )]/2] Ex D+,x D [κφ(f(x) f(x ))] = κ(Rφ(f) R φ), which completes the proof.

Based on this theorem, we have Corollary 6 For exponential loss, hinge loss, general hinge loss, q-norm hinge loss, and least square loss, we have

R(f) R Rφ(f) R φ,

and for logistic loss, we have

R(f) R 1 ln 2(Rφ(f) R φ).

Hinge loss is consistent with AUC under the realizable setting yet inconsistent for the general case as shown in Lemma 3. Corollary 5 shows regret bounds for exponential and logistic loss in the general case, whereas the above corollary provides tighter regret bounds under the realizable setting.

5 Equivalence Between AUC and Accuracy Optimization with Exponential Losses In binary classiﬁcation, we try to learn a score function f X R, and make predictions based on sgn[f(x)]. The goal is to improve the accuracy by minimizing

Racc(f) = E(x,y) D [I [yf(x) < 0]]

= Ex [η (x) I [f (x) < 0] + (1 η (x)) I [f (x) > 0]] .

Denote by R acc = inff Racc(f), and we get the set of optimal solutions for accuracy as follows:

Bacc = {f : f(x)(η(x) 1/2) > 0 for η(x) = 1/2}.

The popular formulation, called surrogate losses, is given by

φacc(f(x), y) = φ(yf(x)),

where φ is a convex function such as exponential loss [Freund and Schapire, 1997], logistic loss [Friedman et al., 2000], etc. We deﬁne the expected φacc-risk as

Rφacc(f) = E(x,y) D[φ(yf(x))] = Ex[Cacc(η(x), f(x))]

= Ex[(1 η(x))φ( f(x)) + η(x)φ(f(x))],

and denote by R φacc = inff Rφacc(f).

Theorem 5 For exponential loss and classiﬁer f, we have

Rφ(f) R φ Rφacc(f)(Rφacc(f) R φacc).

Proof For accuracy s exponential surrogate loss, we have

Rφacc(f) R φacc = Ex q

η(x)e f(x) q

(1 η(x))ef(x) 2

and for AUC s exponential surrogate loss, we have

Rφ(f) R φ = Ex,x q

η(x)(1 η(x ))e f(x)+f(x )

η(x )(1 η(x))ef(x) f(x ) 2 .

By using (ab cd)2 a2(b d)2 + d2(a c)2, we have

Rφ(f) R φ 4Ex[(1 η(x))ef(x)](Rφacc(f) R φacc)

and in a similar manner, we have

Rφ(f) R φ 4Ex[η(x)e f(x)](Rφacc(f) R φacc).

This follows Rφ(f) R φ Rφacc(f)(Rφacc(f) R φacc).

For ranking function f, we select a proper threshold to construct classiﬁer by

t f = arg min t ( ,+ ) Ex η(x)e f(x)+t + (1 η(x))ef(x) t

2 ln(Ex[η(x)e f(x)]/ ln Ex[(1 η(x))ef(x)]).

Based on such threshold, we have Theorem 6 For ranking function f and exponential loss,

Rφacc(f t f) R φacc 2 q

by selecting the threshold t f deﬁned above.

Proof For score function f(x), we have

Rφacc(f t f) R φacc = 2Ex p

η(x)(1 η(x))

Ex[η(x)e f(x)]Ex[(1 η(x))ef(x)].

For pairwise exponential loss of AUC, we have

Rφ(f) R φ = 2Ex[η(x)e f(x)]Ex[1 η(x)ef(x)]

η(x)(1 η(x))])2 (Rφacc(f t f) R φacc)2/2

which completes the proof.

Together with Corollary 5, Theorems 5 and 6, and [Zhang, 2004b, Theorem 2.1], we have Theorem 7 For exponential loss and classiﬁer f, we have

R(f) R Rφacc(f)(Rφacc(f) R φacc) 1/2

Racc(f) R acc

2(Rφacc(f) R φacc)1/2 .

For exponential loss and ranking function f, we have

R(f) R (Rφ(f) R φ)1/2

Racc(f t f) R acc 2(Rφ(f) R φ))1/4 .

This theorem discloses the asymptotic equivalence between univariate exponential loss of accuracy and the pairwise exponential loss of AUC. As a result, Ada Boost and Rank Boost are equivalent, i.e., both of them optimize AUC and accuracy simultaneously, because Ada Boost and Rank Boost essentially optimize φacc(f(x), y) = e yf(x) and φ(f, x, x ) = e (f(x) f(x )), respectively. Rudin and Schapire [2009] established the equivalence between Ada Boost and Rank Boost for ﬁnite training sample based on the assumption of equal contribution between negative and positive classes. Our work does not make any assumption, and regret bounds show the equivalence between pairwise and univariate exponential loss, providing a new explanation between Ada Boost and Rank Boost. In [Menon and Williamson, 2014], there is a proposition: Proposition 10 Given any DM,η X { 1}, strictly proper composite loss ℓwith inverse link function Ψ 1(v) = 1/(1 + e av) for some a R \ {0}, and scorer s: X R, there exists a convex function Fℓ: [0, 1] R+ such that

Fℓ regret D,Univ Bipart,01(s) regret D,Univ Bipart,ℓ(s)

where regret D,Univ Bipart,ℓ(s) equals to

LD Bipart,ℓ(Diff(s)) inf t: X R[LD Bipart,ℓ(Diff(t))]

where Univ means univariate loss, and the other notations please refer [Menon and Williamson, 2014]. This proposition shows that the univariate exponential loss is consistent with AUC optimization. In our Theorems 5 and 6, we show that the univariate exponential loss is equivalent to pairwise exponential loss, for the consistency of optimizing all performance measures such as AUC, rankloss, precision-recall,

etc. Note that the cited proposition does not involve pairwise loss, needless to say the equivalence between pairwise and univariate losses; moreover, the cited proposition considers only AUC for performance measure, whereas we consider all performance measures. 1

6 Conclusion

This work studies the consistency of AUC optimization by minimizing pairwise surrogate losses. We ﬁrst showed that calibration is necessary yet insufﬁcient for AUC consistency. We then provide a new sufﬁcient condition, and show the consistency of exponential loss, logistic loss, least-square hinge loss, etc. Further, we derive regret bounds for exponential and logistic losses, and obtain the regret bounds for many surrogate losses under the realizable setting. Finally, we provide regret bounds to show the equivalence between the pairwise exponential loss of AUC and univariate exponential loss of accuracy, with a direct consequence that Ada Boost and Rank Boost are equivalent in the limit of inﬁnite sample.

[Agarwal, 2013] S. Agarwal. Surrogate regret bounds for the area under the ROC curve via strongly proper losses. In COLT, pages 338 353, 2013. [Bartlett et al., 2006] P. L. Bartlett, M. I. Jordan, and J. D. Mc Auliffe. Convexity, classiﬁcation, and risk bounds. J. Am. Stat. Assoc., 101(473):138 156, 2006. [Brefeld and Scheffer, 2005] U. Brefeld and T. Scheffer. AUC maximizing support vector learning. In ICML Workshop, 2005. [Clemen con et al., 2008] S. Clemen con, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of Ustatistics. Ann. Stat., 36(2):844 874, 2008. [Cohen et al., 1999] W. W. Cohen, R. E. Schapire, and Y. Singer. Learning to order things. Neural Comput., 10:243 270, 1999. [Cossock and Zhang, 2008] D. Cossock and T. Zhang. Statistical analysis of Bayes optimal subset ranking. IEEE T. Inform. Theory, 54(11):5140 5154, 2008. [Duchi et al., 2010] J. C. Duchi, L. W. Mackey, and M. I. Jordan. On the consistency of ranking algorithms. In ICML, pages 327 334, 2010. [Egan, 1975] J. Egan. Signal detection theory and ROC curve, Series in Cognition and Perception. Academic Press, New York, 1975. [Flach et al., 2011] P. A. Flach, J. Hern andez-Orallo, and C. F. Ramirez. A coherent interpretation of AUC as a measure of aggregated classiﬁcation performance. In ICML, pages 657 664, 2011. [Freund and Schapire, 1997] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. JCSS, 55(1):119 139, 1997.

1This explanation is added on request by a reviewer.

[Freund et al., 2003] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efﬁcient boosting algorithm for combining preferences. JMLR, 4:933 969, 2003. [Friedman et al., 2000] J. Friedman, T. Hastie, and R. Tibshirani. Addtive logistic regression: A statistical view of boosting (with discussions). Ann. Stat., 28(2):337 407, 2000. [Gao and Zhou, 2012] W. Gao and Z.-H. Zhou. On the consistency of AUC optimization. CORR abs/1208.0645, 2012. [Gao and Zhou, 2013] W. Gao and Z.-H. Zhou. On the consistency of multi-label learning. AIJ, 199:22 44, 2013. [Gao et al., 2013] W. Gao, R. Jin, S. Zhu, and Z.-H. Zhou. One-pass auc optimization. In ICML, pages 906 914, 2013. [Joachims, 2005] T. Joachims. A support vector method for multivariate performance measures. In ICML, pages 377 384, 2005. [Kotlowski et al., 2011] W. Kotlowski, K. Dembczynski, and E. H ullermeier. Bipartite ranking through minimization of univariate loss. In ICML, pages 1113 1120, 2011. [Long and Servedio, 2013] P. Long and R. Servedio. Consistency versus realizable H-consistency for multiclass classiﬁcation. In ICML, pages 801 809, 2013. [Marron et al., 2007] J. Marron, M. Todd, and J. Ahn. Distance-weighted discrimination. J. Am. Stat. Assoc., 102(480):1267 1271, 2007. [Menon and Williamson, 2014] A. K. Menon and R. C. Williamson. Bayes-optimal scorers for bipartite ranking. In COLT, pages 68 106, 2014. [Rudin and Schapire, 2009] C. Rudin and R. E. Schapire. Margin-based ranking and an equivalence between Ada Boost and Rank Boost. JMLR, 10:2193 2232, 2009. [Tewari and Bartlett, 2007] A. Tewari and P. L. Bartlett. On the consistency of multiclass classiﬁcation methods. JMLR, 8:1007 1025, 2007. [Uematsu and Lee, 2012] K. Uematsu and Y. Lee. On theoretically optimal ranking functions in bipartite ranking. Technical report, 2012. [Xia et al., 2008] F. Xia, T. Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise approach to learning to rank: Theory and algorithm. In ICML, pages 1192 1199, 2008. [Zhang, 2004a] T. Zhang. Statistical analysis of some multicategory large margin classiﬁcation methods. JMLR, 5:1225 1251, 2004. [Zhang, 2004b] T. Zhang. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. Ann. Stat., 32(1):56 85, 2004. [Zhao et al., 2011] P. Zhao, S. Hoi, R. Jin, and T. Yang. Online AUC maximization. In ICML, pages 233 240, 2011. [Zuva and Zuva, 2012] K. Zuva and T. Zuva. Evaluation of information retrieval systems. Int. J. Comput. Sci. Inform. Tech., 4:35 43, 2012.