# the_coherent_loss_function_for_classification__45340c7a.pdf

The Coherent Loss Function for Classiﬁcation

Wenzhuo Yang A0096049@NUS.EDU.SG Department of Mechanical Engineering, National University of Singapore, Singapore 117576

Melvyn Sim DSCSIMM@NUS.EDU.SG Department of Decision Sciences, National University of Singapore, Singapore 117576

Huan Xu MPEXUH@NUS.EDU.SG Department of Mechanical Engineering, National University of Singapore, Singapore 117576

A prediction rule in binary classiﬁcation that aims to achieve the lowest probability of misclassiﬁcation involves minimizing over a nonconvex, 0-1 loss function, which is typically a computationally intractable optimization problem. To address the intractability, previous methods consider minimizing the cumulative loss the sum of convex surrogates of the 0-1 loss of each sample. We revisit this paradigm and develop instead an axiomatic framework by proposing a set of salient properties on functions for binary classiﬁcation and then propose the coherent loss approach, which is a tractable upper-bound of the empirical classiﬁcation error over the entire sample set. We show that the proposed approach yields a strictly tighter approximation to the empirical classiﬁcation error than any convex cumulative loss approach while preserving the convexity of the underlying optimization problem, and this approach for binary classiﬁcation also has a robustness interpretation which builds a connection to robust SVMs.

1. Introduction

The goal of supervised learning is to predict an unobserved output value y from an observed input x. This is achieved by learning a function relationship y f(x) from a set of observed training examples {(yi, xi)}m i=1. The quality of predictor f( ) is often measured by some loss function ℓ(f(x), y). A typical statistical setup in machine learning assumes that all training data and testing samples are IID

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

samples drawn from an unknown distribution µ, and the goal is to ﬁnd a predictor f( ) such that the expected loss E(y,x) µℓ(f(x), y) is minimized. Since µ is unknown, the expected loss is often replaced by the empirical loss

i=1 ℓ(f(xi), yi). (1)

Minimizing Lemp(f), as well as numerous regularization based variants of it, is one of the fundamental cornerstones of statistical machine learning (e.g., Vapnik & Lerner, 1963; Vapnik & Chervonenkis, 1991; Poggio et al., 2004).

This paper focuses on binary classiﬁcation problems, where y { 1, +1}. A point (y, x) is correctly predicted if sign(f(x)) = y, and its classiﬁcation error is given by the 0-1 loss ℓ(f(xi), yi) = 1(y = sign(f(x)) = 1(yf(x) 0). Due to the non-convexity of the indicator function, minimizing the empirical classiﬁcation error i 1(yif(xi) 0) is known to be NP-hard even to approximate (Arora et al., 1997; Ben-David et al., 2003). A number of methods have been proposed to mitigate this computational difﬁculty, all based on the idea that to minimize the cumulative loss , which is the sum of individual losses given by,

i=1 ϕ(yf(x))

where ϕ( ) is a convex upper bound of the classiﬁcation error 1(yf(x) 0). For example, Ada Boost (Freund & Schapire, 1997; Friedman et al., 2000; Schapire & Singer, 1999) employs the exponential loss function exp( yf(x)), and Support Vector Machines (SVMs) (Boser et al., 1992; Cortes & Vapnik, 1995) employ a hinge-loss function max{1 yf(x), 0}.

In this paper we revisit this paradigm, and introduce a notion termed coherent loss, as opposed to cumulative loss

Coherent Loss Function for Classiﬁcation

used in the conventional approach. Brieﬂy speaking, instead of using an upper bound of the individual classiﬁcation error (the 0-1 loss), we propose to use a tractable upper bound of the total empirical classiﬁcation error for the whole training set. That is, we look for Φ : ℜm 7 ℜ such that

Φ(c1, , cm) 1

i=1 1(ci 0), (c1, , cm) ℜm.

Intuitively, since coherent loss functions are more general than cumulative loss functions, one may expect to obtain a tighter and still tractable bound of the empirical classiﬁcation error via coherent loss function. We formalize this intuition in this paper. Speciﬁcally, our contributions include the followings.

In Section 2, we consider a principled approach by formalizing the salient properties of functions, termed as coherent classiﬁcation loss functions, that could be used to quantify the performance of a classiﬁcation rule. These functions have dual-representations which enable us to identify the minimal coherent classiﬁcation loss function, which, loosely speaking, is the coherent classiﬁcation loss function that best approximates the 0-1 loss, which also achieves a tighter bound of the empirical classiﬁcation error than any convex cumulative loss. We show that optimizing this function is equivalent to a convex optimization problem, and hence tractable.

In Section 3, we consider an equivalent form of the coherent loss function and then provide several applications of this loss function in classiﬁcation problems. We remark that a tighter approximation of the 0-1 loss can potentially reduce the impact of outliers on the classiﬁcation accuracy. Cumulative loss function may signiﬁcantly deviate from the 0-1 loss when c 0. Consequently, a misclassiﬁed outlier can incur a huge loss, and prevents an otherwise perfect prediction rule from being selected. This sensitivity can be mitigated by a tighter approximation.

Section 4 provides a statistical interpretation of minimizing the coherent loss function. Section 5 reports the experimental results which show that our classiﬁcation method outperforms the standard SVM when additional constraints are imposed on the decision function.

Notations: We use boldface letters to represent column vectors, and capital letters for matrices. We reserve e for special vectors: ei is the vector whose i-th entry is 1, and the rest are 0; e N, where N is an index set, is the vector that for all i N, the corresponding entry equals 1, and zero otherwise; e is the vector with all entries equal to 1. The i-th entry of a vector x is denoted by xi. We use [c]+ to denote max{0, c} and 1[ ] to denote the indicator function, and let Pm be the set of all m m permutation matrices and Ip be the p p identity matrix.

2. Coherent Classiﬁcation Loss Function

We now propose the notion of coherent classiﬁcation loss functions based on an axiomatic approach. Along the way, we show the existence of a tight coherent classiﬁcation loss function which can achieve better approximation of the empirical classiﬁcation error than any convex cumulative loss. The deﬁnition of the coherent classiﬁcation loss function is motivated from analyzing the salient properties of functions used to quantify the performance of a classiﬁcation rule. A natural approach is to elicit these properties from the classiﬁcation error. Speciﬁcally, given u1, , um where ui the decision value of the ith sample, e.g. ui = yif(xi), the classiﬁcation error ϱ : ℜm 7 [0, 1] is given by

ϱ(u1, . . . , um) = 1

i=1 1[ui < 0]. (2)

We will next propose a set of properties and that functions endowed with these properties are known as coherent classiﬁcation loss functions.

2.1. Salient Properties and Representation Theorem

We elicit the ﬁve salient properties from the classiﬁcation error as follows. Consider ρ( ) : ℜm [0, 1].

Property 1 (Complete classiﬁcation). ρ(u) = 0 if and only if u 0.

Complete classiﬁcation essentially says if every sample is correctly classiﬁed, then it is optimal.

Property 2 (Misclassiﬁcation avoidance). If u < 0, then ρ(u) = 1.

This property states that if all samples are misclassiﬁed, then it is the worst classiﬁcation and hence ρ( ) achieves the maximal value.

Property 3 (Monotonicity). If u1 u2 then ρ(u1) ρ(u2).

Monotonicity requires that if a decision better classiﬁes every sample, then it is more desirable.

Property 4 (Order invariance). For any P Pm, we have ρ(u) = ρ(Pu).

Order invariance essentially states that the order of the samples does not matter. This is natural in the classiﬁcation problem, since each sample is drawn IID, and is treated equally.

Property 5 (Scale invariance). For all α > 0, ρ(αu) = ρ(u).

Scale invariance is a property that the classiﬁcation error function satisﬁes. It essentially means that changing the

Coherent Loss Function for Classiﬁcation

scale does not affect the preference between classiﬁers. While it may be debatable whether scale invariance is as necessary as other properties, indeed as we show later in this section, this property can be relaxed. Deﬁnition 1 (Coherent Classiﬁcation Loss). A function ρ( ) : ℜm [0, 1] is a coherent classiﬁcation loss function (CCLF) if it satisﬁes Property 1 to 5, and is quasi-convex and lower semi-continuous.

Here, quasi-convexity and semi-continuity are introduced to for tractability. Our ﬁrst result is a (dual) representation theorem of any CCLF. We need the following deﬁnition ﬁrst. Deﬁnition 2 (Admissible Class). A class of sets Vk ℜm

parameterized by k [0, 1] is called admissible class, if they satisfy the following properties:

1. For any k [0, 1], Vk is a closed, convex cone, and is order invariant. Here, being order invariant means that v V implies Pv V for any P Pm;

2. k k implies Vk Vk ;

3. V1 = cl(limk 1 Vk) and V0 = limk 0 Vk.

4. V1 = ℜm +;

5. For any λ > 0, we have λe V0.

Theorem 1 (Representation Theorem). A function ρ( ) is a CCLF if and only if it can be written as

ρ(u) = 1 sup{k [0, 1]| sup v Vk ( v u) 0}, (3)

for an admissible class {Vk}. Here sup over an empty set is set as 0.

Proof. We sketch the proof and leave the details in the supplementary material. The if part is relatively easy, by checking that any function ρ(u) = 1 sup{k [0, 1]| supv Vk( v u) 0} for some admissible class {Vk} satisﬁes all properties required for a CCLF.

The only if part requires more work. We want to show that given a function ρ( ) which is a CCLF, it can be represented as (3) for some admissible class {Vk}. The proof consists of three steps: We ﬁrst show that ρ( ) can be represented as ρ(u) = 1 sup{k [0, 1]| supv Vk( v u) 0}, for some {Vk}, not necessarily admissible. This essentially follows from a result in Brown & Sim (2009). We then show that we can replace Vk by a class of closed, convex, order-invariant, cones Vk. Speciﬁcally, we can pick Vk cl(cc(or(Vk))), where or( ) (respectively cc( )) is the minimal order invariant (respectively, convex cone) superset. Finally we show that {Vk} is admissible, by checking that all properties in Deﬁnition 2 are satisﬁed, to complete the proof.

2.2. Minimal coherent classiﬁcation loss function

This section shows that among all CCLF functions that upper-bound the classiﬁcation error, there exists a minimal (i.e., best) one. Theorem 2. Deﬁne ρ( ) : ℜm 7 [0, 1] as follows

ρ(u) = max{t : t i=1 u(i) < 0} m ,

where {u(i)} is a permutation of {ui} in a non-decreasing order, and max over an empty set is taken as zero. Then the following holds.

1. ρ( ) is a CCLF, and is an upper-bound of the classiﬁcation error, i.e., ρ(u) ϱ(u), u ℜm.

2. Let Vk ℜm satisfy that if k = 0, then Vk = conv {λe|λ > 0}; and if s m < k s+1

m for s = 0, , m 1, then

Vk = conv {λe N| λ > 0, N : |N| = m s} ,

where N is an index set. Then {Vk} is an admissible class corresponding to ρ( ).

3. ρ( ) is the tightest CCLF bound. That is, if ρ ( ) is a CCLF function and satisﬁes ρ (u) ϱ(u) for all u ℜm, then ρ (u) ρ(u) for all u ℜm.

Proof. We provide a sketch of the proof and leave the details to the supplementary material. Claim 1 is relatively straightforward. It is also easy to see that Vk is an admissible set. So one only needs to show that Vk is the set corresponding to ρ( ), to establish Claim 2. To show Claim 3, we let {V k} be an admissible class corresponding to ρ ( ), and show that λe N V k, which further implies Vk V k. This establishes claim 3.

We next show that scale invariance can be relaxed. Indeed, for any quasi-convex upper bound of classiﬁcation error that satisﬁes other properties, the minimal CCLF is a tighter bound. Theorem 3. Let ˆρ : ℜm 7 [0, 1] be a quasi-convex function that satisﬁes complete classiﬁcation, misclassiﬁcation avoidance, monotonicity, order invariance, and that ˆρ(u) ϱ(u). Then there exists a CCLF ρ( ) such that

ϱ(u) ρ(u) ρ(u) ˆρ(u), u ℜm.

Proof. We sketch the proof. The main idea is to construct such a function

ρ(u) lim ϵ 0[min γ>0 ˆρ((u + ϵ)/γ)],

and show that ρ( ) is a CCLF and ˆ ρ(u) ρ(u) ϱ(u). Finally, since ρ( ) is the minimal CCLF, this completes the proof.

Coherent Loss Function for Classiﬁcation

One important property of ρ( ) is that it achieves better approximation of the empirical classiﬁcation error than any convex cumulative loss.

Theorem 4. If f( ) is a convex function and an upper bound of the 0-1 loss function, then for any u = (u1, , um), we have ϱ(u) ρ(u) 1

m m i=1 f(ui).

Proof. Without loss of generality, assume (u1, , um) are in a non-decreasing order. Let p max{i|ui < 0} and q max{t| t i=1 ui < 0}, then q i=1 ui = p i=1 ui + q i=p+1 ui < 0.

Since f( ) is convex and f(x) 1[x 0], there exists k 0 such that f(x) max{kx + 1, 0} (this can be done for example by taking k as a subgradient of f(x) at x = 0). If k = 0, then 1 m m i=1 f(ui) 1 ρ(u), the theorem holds. Otherwise k < 0, we have

i=1 (kui + 1) +

i=p+1 f(ui)

i=p+1 f(ui)

i=p+1 f(ui)

i=p+1 (f(ui) kui).

Note that ui 0 for i = p + 1, , m, then if ui 1

k, f(ui) kui kui 1. Otherwise f(ui) kui kui + 1 kui = 1. Hence, p + q i=p+1(f(ui) kui) p + (q p) = q. By the deﬁnition of ρ(u), the theorem holds.

2.3. Optimization with the coherent loss function

We now discuss the computational issue of optimization of the minimal CCLF ρ( ). Indeed, we show that this can be converted to a tractable convex optimization problem. Speciﬁcally, we consider the following problem on variables (u, w):

s.t. fj(u, w) 0; j = 1, , n, (4)

where fi( , ) are convex functions. We have the following theorem.

Theorem 5. Assume complete classiﬁcation is not achievable, i.e., there is no feasible (u, w) with u 0. Let (s , t , h ) be an optimal solution to the following opti-

mization problem:

min h,s,t 1 m

i=1 [1 si]+

s.t. hfj(s/h, t/h) 0; j = 1, , n; h > 0.

Then (s /h , t /h ) is an optimal solution to Problem (4).

Proof. We provide a sketch of the proof. We ﬁrst show that the level set of Problem (4), i.e., Ui {(u, w) | ρ(u) 1 i/m; fj(u, w) 0, j} for i = 1, , m, equals the following

{(u, w)| d :

i=1 [d ui]+ (m i + 1)d;

fj(u, w) 0, j.}

This can be proved by applying the Theorem 2, and then using duality of linear program. This set can further be shown equivalent to the feasible set of

i=1 [1 ui/d]+ (m i + 1) (6)

fj(u, w) 0; j = 1, , n.

Thus, ﬁnding the optimal solution to Problem (4) is equivalent to solve the following problem

i=1 [1 ui/d]+

s.t. fj(u, w) 0; j = 1, , n;

Then let h = 1/d, s = hu and t = hw, the theorem is established.

Notice that hfj(s/h, t/h) is the perspective function of fj( , ), and is hence jointly convex to (h, s, t) (Boyd & Vandenberghe, 2004). Thus, Problem (5) is equivalent to a tractable convex optimization problem.

3. Equivalent Formulation and Applications

From Theorem 5, when there is no (u, w) such that u 0 and fj(u, w) 0 for j = 1, , n, Problem (4) is equivalent to minimizing the following optimization problem:

min Φ(u) s.t. fj(u, w) 0; j = 1, , n, (8)

where Φ(u) is deﬁned by

Φ(u) min γ>0 1 m

i=1 [1 ui/γ]+. (9)

Coherent Loss Function for Classiﬁcation

From this formulation, we also show, from another perspective, that minimizing the coherent loss function is equivalent to minimizing a tighter upper bound of the 0-1 loss function, or in other words, the coherent loss function achieves better approximation of the empirical classiﬁcation error than any convex cumulative loss.

Theorem 6. Let ϕ : ℜ7 ℜ+ be a non-increasing, convex function that satisﬁes

ϕ(c) 1(c 0), c ℜ.

Then we have for all u ℜm:

i=1 1(ui 0) Φ(u) 1

Proof. Recall that the hinge-loss ϕ 1(c) [1 c]+ is the tightest convex bound of 0-1 loss which has a derivative (or sub-gradient) 1 at c = 0 (e.g., (Sch olkopf & Smola, 2002)). That is, if a convex function ϕ( ) satisﬁes ϕ(c) 1(c 0), c, and also satisﬁes 1 ϕ(0), then ϕ 1(c) ϕ(c) for all c. Similarly, ϕ γ(c) max[1 c/γ]+ is the tightest convex bound of 0-1 loss with a derivative 1/γ at x = 0. Since ϕ( ) is non-increasing, it can not have positive derivative at c = 0. Thus, Φ( ) is a tighter bound than any non-increasing, convex cumulative loss functions.

Recall that (8) is equivalent to the following convex optimization problem

min h,s,t 1 m

i=1 [1 si]+

s.t. hfj(s/h, t/h) 0; j = 1, , n;

which can be solved efﬁciently. We now provide some applications of the proposed coherent loss function.

At ﬁrst, we illustrate with an example, that the proposed coherent loss function can be more robust to outliers. Let u1, u2 R100 be the followings: u1 = ( 1000, 1000, 1000, , 1000), and u2 = (+1, 1, +1, 1, , +1, 1). In this case, u2 appears to be a less favorable classiﬁcation since 50% of samples are misclassiﬁed. It is easy to check that u1 incurs a much larger hinge-loss than u2, even though only one sample is misclassiﬁed. In contrast, the coherent loss of u1 is no more than 0.02 (take γ = 1/1000), and that of u2 is at least 0.5 (since 50% samples are misclassiﬁed, and the coherent loss is an upper bound). Thus, the coherent loss is more robust in this example, partly because it better approximates the 0-1 loss, and hence is less affected by large outliers. See Figure 1.

1000 500 0 500 1000 2

Figure 1. Illustration of the effect of outliers to the cumulative loss vs the coherent loss. Here, w1 has a margin u1, and w2 has a margin u2. The cumulative loss approach will pick w2, where the proposed method will pick w1, which is a better classiﬁcation.

EXAMPLE: LINEAR SVM

We illustrate the proposed method with the linear classiﬁcation problem, and in particular, the linear Support Vector Machines algorithm (SVMs) (Boser et al., 1992; Cortes & Vapnik, 1995; Sch olkopf & Smola, 2002). Given m training samples (yi, xi)m i=1, the goal is to ﬁnd a hyperplane that correctly classify as many training samples as possible with a large margin, which leads to the following formulation:

i=1 1[yi(w xi + b) 0]

for a given C > 0. Since the objective function is nonconvex, Problem (11) is an intractable problem. Hence, SVM uses the hinge-loss function ϕ 1(c) = [1 c]+ as a convex surrogate.

Following the proposed coherent loss function approach, we minimize the 0-1 loss function with margin a 0: 1 m m i=1 1[yi(w xi + b) a] and replace this objective function by the coherent loss function ρ(u) where ui = yi(w xi + b) a (Margin a makes the condition in Theorem 5 hold, and the approximation of this 0-1 loss function by using the hinge-loss function still leads to the standard SVM). Then we obtain the following formulation,

min w,b,γ>0 1 m

i=1 [1 (yi(w xi + b) a)/γ]+

s.t. w 2 C.

As discussed above, we can change variables h = 1/γ, ˆw = w/γ and ˆb = b/γ, and simplify Formulation (12) as the following:

min ˆw,ˆb,h>0

i=1 [1 + ah yi( ˆw xi + ˆb)]+

s.t. ˆw 2 h C.

Coherent Loss Function for Classiﬁcation

This is also equivalent to the robust formulation of SVM (Shivaswamy et al., 2006):

s.t. inf xi (xi,I) P[yi(w xi + b) 1 ξi] 1 κ,

where xi (xi, I) denotes a family of distributions which have a common mean xi and covariance I, and κ a2/(a2 + C2).

We next consider the case where one may like to impose additional constraints on w. For instance, if the ﬁrst feature is measured from a less reliable source, then an ideal classiﬁcation rule should discount the importance of the ﬁrst feature, by imposing a constraint like |w1| 0.001. Thus, the linear classiﬁcation problem becomes

min w,b 1 m

i=1 1[yi(w xi + b) a]

s.t. w 2 C Aw d.

Using the coherent loss to replace the objective function, and simplifying the resulting formulation, we obtain the following second order cone program

min ˆw,ˆb,h

i=1 [1 + ah yi( ˆw xi + ˆb)]+

s.t. ˆw 2 Ch 0

A ˆw dh h > 0.

Finally, we remark that the coherent loss approach can be kernelized, since a representation theorem (Sch olkopf & Smola, 2002) still holds if the coherent loss function is used.

EXAMPLE: MULTI-CLASS SVM

The coherent loss function can also be applied in multiclass classiﬁcation problems. The main idea of previous approaches (Liu & Shen, 2006; Lee et al., 2004; Crammer & Singer, 2002) of multi-class SVMs is solving one single regularization problem by imposing a penalty on the values of fy(x) fz(x) for sample (x, y) where fy( ) and fz( ) are decision function for class y and z, respectively. Suppose that the training samples are drawn from k different classes and the decision function fy(x) = w y x + by for each y = 1, , k. Consider the following 0-1 loss penalty

formulation:

i=1 1 [ min z [k],z =yi{fyi(xi) fz(xi)} a ]

s.t. Gi(wi) C; i = 1, , k

i=1 fi = 0,

where k i=1 fi = ( k i=1 wi, k i=1 bi), Gi( ) is convex (e.g. Gi( ) = 2) and margin a 0, then we can apply the coherent loss function approach to make an approximation:

min fi,γ>0 1 m

[ 1 minz [k],z =yi{fyi(xi) fz(xi)} a

s.t. Gi(wi) C; i = 1, , k

i=1 fi = 0,

which can be simpliﬁed as the following:

min ˆ fi,h>0

[ 1 + ah + max z [k],z =yi{ ˆfz(xi) ˆfyi(xi)} ]

s.t. h Gi( ˆwi/h) h C; i = 1, , k

i=1 ˆfi = 0,

where ˆfi(x) = ˆw i x + ˆbi. Clearly, this is a convex optimization problem and can be solved efﬁciently.

4. Statistical Property

In this section, we provide a statistical interpretation of minimizing the coherent loss function. As standard in learning theory, we assume that the training samples are drawn IID from an unknown distribution P, and the goal is to ﬁnd a predictor f( ) such that the classiﬁcation error of f given below is as small as possible:

L(f( )) = E( x, y) P[I(f( x), y)].

Here ( x, y) P means sample ( x, y) follows the distribution P, and I(f( x), y) = 1[ yf( x) 0]. Recall that minimizing the coherent loss function is equivalent to minimizing the following function

Φ(u) min γ>0 1 m

i=1 ϕγ(ui),

where ϕγ(u) = max{0, 1 u/γ}. Let η(x) = P[ y = 1| x = x], then the optimal Bayes error L = L(2η( ) 1).

Coherent Loss Function for Classiﬁcation

We now develop an upper bound of the difference between L(f( )) and L using similar techniques in Zhang (2004).

For ﬁxed γ, denote the expected loss of f( ) w.r.t ϕγ( ) by

Qγ(f( )) = E( x, y) P[ϕγ( yf( x))],

Qγ(η, f) = ηϕγ(f) + (1 η)ϕγ( f) Qγ(η, f) = Qγ(η, f) Qγ(η, f γ(η)),

where f γ(η) = arg minf Qγ(η, f). Recall that ϕγ(u) = max{0, 1 u/γ}, which implies f γ(η) = sign(2η 1)γ. Then we have the following lemma.

Lemma 1. For γ > 0, we have Qγ(η, 0) = |2η 1|.

Proof. From the deﬁnition of Qγ(η, f) and Qγ(η, f), we have

Qγ(η, f) =η(ϕγ(f) ϕγ(f γ(η))) + (1 η)(ϕγ( f) ϕγ( f γ(η)))

=η max{0, 1 f/γ} + (1 η) max{0, 1 + f/γ}

η(max{0, 1 sign(2η 1)}) (1 η)(max{0, 1 + sign(2η 1)})

=η max{0, 1 f/γ} + (1 η) max{0, 1 + f/γ} 1 + |2η 1|.

This implies that Qγ(η, 0) = |2η 1|.

By applying the lemma above, we can bound the classiﬁcation error of f( ) w.r.t ϕγ( ) in terms of E x Qγ(η( x), f( x)).

Theorem 7. For any γ > 0 and any measurable function f(x), we have

L(f( )) L E x Qγ(η( x), f( x)) = E x[Qγ(η( x), f( x)) + |2η( x) 1| 1].

Proof. By deﬁnition of L( ), it is easy to verify that

L(f( )) L(2η( ) 1) =Eη(X) 0.5,f(X)<0(2η(X) 1)+

Eη(X)<0.5,f(X) 0(1 2η(X)) E(2η(X) 1)f(X) 0|2η(X) 1|.

From Lemma 1 Qγ(η, 0) = |2η 1|, we have

L(f( )) L E(2η( x) 1)f( x) 0 Qγ(η( x), 0).

To complete the proof, since Qγ(η, f) = Qγ(η, f) Qγ(η, f γ(η)), it sufﬁces to show that Qγ(η(x), 0) Qγ(η(x), f(x)) for all x such that (2η(x) 1)f(x) 0. To see this, we consider three scenarios:

η > 0.5: We have f γ(η) = sign(2η 1)γ > 0. In addition, (2η 1)f 0 implies that f 0. Since 0 [f, f γ(η)] and the convexity of Qγ(η, f) w.r.t. f, we have Qγ(η, 0) max{Qγ(η, f), Qγ(η, f γ(η))} = Qγ(η, f).

η < 0.5: In this case we have f γ(η) < 0 and f 0, which leads to 0 [f γ(η), f], which implies that Qγ(η, 0) max{Qγ(η, f), Qγ(η, f γ(η))} = Qγ(η, f).

η = 0.5: Note that f γ = 0, which implies that Qγ(η, 0) Qγ(η, f) for all f.

From the proof of Lemma 1, we have Qγ(η, f) = Qγ(η, f) + |2η 1| 1. Hence the theorem holds.

Corollary 1. For any measurable function f(x),

L(f( )) L min γ>0 E x[Qγ(η( x), f( x))+

|2η( x) 1| 1]. (13)

Proof. Since Theorem 7 holds for any γ > 0, we obtain this corollary.

For the training samples {xi, yi}m i=1, since η(xi) = yi {1, 1}, the empirical estimation of the bound in (13) is Φ(u) = minγ>0 1

m m i=1 ϕγ(ui) where ui = yif(xi), which implies that minimizing the coherent loss function is equivalent to minimizing the empirical bound of the difference between L(f( )) and L .

5. Simulations

We report some numerical simulation results in this section to illustrate the proposed approach. Besides the regularization constraints (e.g. w C for binary-class SVMs and wi C, i = 1, , k for multi-class SVMs), we consider the case where additional linear constraints are also imposed on the coefﬁcient w. For clarity, we choose a simple additional constraint Aw T to compare the performance of the cumulative loss formulation (SVM) and our coherent loss formulation (CCLF) for binary-class and multi-class classiﬁcation, where A = [Ik, 0] Rk n. In other words, the constraint ensures that the maximum of the ﬁrst k elements of w is bounded by T. We now compare their performance under two cases: 1) k is ﬁxed, T varies; 2) T is ﬁxed, k varies.

Three binary-class datasets Breast cancer , Ionosphere and Diabetes , and two multi-class datasets Wine and Iris from UCI (Asuncion & Newman, 2007) are used, where we randomly pick 50% as training samples, 20% as validation samples, and the rest as testing samples. For the cumulative loss formulation approach, parameter C is

Coherent Loss Function for Classiﬁcation

0 0.2 0.4 0.6 0.8 1 0

Classification error

Breast Cancer, T=0.1

0 0.2 0.4 0.6 0.8 1 0.02

Classification error

Breast Cancer, T=0.3

0 0.2 0.4 0.6 0.8 1 0 0.05

Classification error

Iris, T=0.1

0 0.2 0.4 0.6 0.8 1 0 0.05

Classification error

Iris, T=0.3

Figure 2. Performance comparison of cumulative loss approach vs coherent loss approach where bound T is ﬁxed and the fraction k/n varies from 0.0 to 1.0. Left and right columns report the classiﬁcation errors for the two cases T = 0.1 and T = 0.3.

determined by cross-validation. For the coherent loss formulation approach, parameter C is ﬁxed while parameter a is determined by cross-validation. For each T, we repeated the experiments 20 times and computed the average classiﬁcation errors. To solve the resulting optimization problems, we use CVX (Grant & Boyd, 2011; 2008), and Gurobi (Gurobi Optimization, 2013) as the solver.

Figure 3 shows the simulation results under ﬁxed k. Clearly, when additional constraints are imposed, it appears that the coherent loss approach consistently outperforms the cumulative loss approach. When T is small, the cumulative loss approach performs much worse. When T becomes large, its performance can be close to the coherent loss approach. Figure 2 provides the results under ﬁxed T, which shows that the coherent loss and cumulative loss approaches have similar performance when k/n is small but the coherent loss approach outperforms the cumulative loss approach when k/n is large. We believe that these phenomena are due to the fact that the coherent loss is a better approximation for the empirical classiﬁcation error.

6. Conclusion

In this paper, we revisit the standard cumulative-loss approach in dealing with the non-convexity of the 0-1 loss function in classiﬁcation, namely minimizing the sum of convex surrogates for each sample. We propose the notion of coherent loss, which is a tractable upper-bound of the total classiﬁcation error for the entire sample set. This approach yields a strictly tighter approximation to the 0-1 loss than any cumulative loss, while preserving the tractability of the resulting optimization problem. The formulation obtained by applying the coherent loss to binary classiﬁcation also has a robustness interpretation, which builds a strong

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

Classification error

Breast Cancer, k=0.8n

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0

Classification error

Breast Cancer, k=1.0n

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.1

Classification error

Ionosphere, k=0.8n

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.1

Classification error

Ionosphere, k=1.0n

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.2

Classification error

Diabetes, k=0.8n

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.2

Classification error

Diabetes, k=1.0n

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0

Classification error

Wine, k=0.8n

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0

Classification error

Wine, k=1.0n

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.05

Classification error

Iris, k=0.8n

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.05

Classification error

Iris, k=1.0n

Figure 3. Performance comparison of cumulative loss approach vs coherent loss approach. Left and right columns report the classiﬁcation errors for the two cases k = 0.8n and k = n (recall that k and n are the numbers of the rows and columns of matrix A, respectively). The four rows, from top to bottom, report results for Breast Cancer, Ionosphere, Diabetes, Wine and Iris, respectively.

connection between the coherent loss and robust SVMs. Finally, we remark that the coherent loss approach has favorable statistical properties and the simulation results show that it can outperform the standard SVM when additional constraints are imposed.

Acknowledgments

This work is partially supported by the Ministry of Education of Singapore through Ac RF Tier Two grant R-265000-443-112.

Coherent Loss Function for Classiﬁcation

Arora, S., Babai, L., Stern, J., and Sweedyk, Z. The hardness of approximate optima in lattices, codes, and systems of linear equations. Journal of Computer and System Sciences, 54:317 331, 1997.

Asuncion, A. and Newman, D. J. UCI machine learning repository, 2007. URL http: //www.ics.uci.edu/{\char126}mlearn/ {MLR}epository.html.

Ben-David, S., Eiron, N., and Long, P. M. On the difﬁculty of approximately maximizing agreements. Journal of Computer and System Sciences, 66:496 513, 2003.

Boser, B. E., Guyon, I. M., and Vapnik, V. N. A training algorithm for optimal margin classiﬁers. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pp. 144 152, New York, NY, 1992.

Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004.

Brown, D.B. and Sim, M. Satisﬁcing measures for analysis of risky positions. Management Science, 55(1):71 84, 2009.

Cortes, C. and Vapnik, V. N. Support vector networks. Machine Learning, 20:1 25, 1995.

Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res., 2:265 292, 2002.

Freund, Y. and Schapire, R. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1): 119 139, 1997.

Friedman, J., Hastie, T., and Tibshirani, R. Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28:337 407, 2000.

Grant, M. and Boyd, S. Graph implementations for nonsmooth convex programs. In Recent Advances in Learning and Control, pp. 95 110. Springer-Verlag Limited, 2008.

Grant, M. and Boyd, S. CVX: Matlab software for disciplined convex programming, version 1.21. http://cvxr.com/cvx, 2011.

Gurobi Optimization, Inc. Gurobi optimizer reference manual, 2013. URL http://www.gurobi.com.

Lee, Y., Lin, Y., and Wahba, G. Multicategory support vector machines, theory, and application to the classiﬁcation of microarray data and satellite radiance data. Journal of the American Statistical Association, 99:67 81, 2004.

Liu, Y. and Shen, X. Multicategory φ-learning. Journal of the American Statistical Association, 101(474):500 509, 2006.

Poggio, T., Rifkin, R., Mukherjee, S., and Niyogi, P. General conditions for predictivity in learning theory. Nature, 428(6981):419 422, 2004.

Schapire, E. and Singer, Y. Improved boosting algorithms using conﬁdence-rated predictions. Machine Learning, 37:297 336, 1999.

Sch olkopf, B. and Smola, A. J. Learning with Kernels. MIT Press, 2002.

Shivaswamy, P. K., Bhattacharyya, C., and Smola, A. J. Second order cone programming approaches for handling missing and uncertain data. Journal of Machine Learning Research, 7:1283 1314, July 2006.

Vapnik, V. N. and Chervonenkis, A. The necessary and sufﬁcient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis, 1(3):260 284, 1991.

Vapnik, V. N. and Lerner, A. Pattern recognition using generalized portrait method. Automation and Remote Control, 24:744 780, 1963.

Zhang, T. Statistical behavior and consistency of classiﬁcation methods based on convex risk minimization. The Annals of Statistics, 32:56 85, 2004.