# the_coherent_loss_function_for_classification__45340c7a.pdf The Coherent Loss Function for Classification Wenzhuo Yang A0096049@NUS.EDU.SG Department of Mechanical Engineering, National University of Singapore, Singapore 117576 Melvyn Sim DSCSIMM@NUS.EDU.SG Department of Decision Sciences, National University of Singapore, Singapore 117576 Huan Xu MPEXUH@NUS.EDU.SG Department of Mechanical Engineering, National University of Singapore, Singapore 117576 A prediction rule in binary classification that aims to achieve the lowest probability of misclassification involves minimizing over a nonconvex, 0-1 loss function, which is typically a computationally intractable optimization problem. To address the intractability, previous methods consider minimizing the cumulative loss the sum of convex surrogates of the 0-1 loss of each sample. We revisit this paradigm and develop instead an axiomatic framework by proposing a set of salient properties on functions for binary classification and then propose the coherent loss approach, which is a tractable upper-bound of the empirical classification error over the entire sample set. We show that the proposed approach yields a strictly tighter approximation to the empirical classification error than any convex cumulative loss approach while preserving the convexity of the underlying optimization problem, and this approach for binary classification also has a robustness interpretation which builds a connection to robust SVMs. 1. Introduction The goal of supervised learning is to predict an unobserved output value y from an observed input x. This is achieved by learning a function relationship y f(x) from a set of observed training examples {(yi, xi)}m i=1. The quality of predictor f( ) is often measured by some loss function ℓ(f(x), y). A typical statistical setup in machine learning assumes that all training data and testing samples are IID Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). samples drawn from an unknown distribution µ, and the goal is to find a predictor f( ) such that the expected loss E(y,x) µℓ(f(x), y) is minimized. Since µ is unknown, the expected loss is often replaced by the empirical loss i=1 ℓ(f(xi), yi). (1) Minimizing Lemp(f), as well as numerous regularization based variants of it, is one of the fundamental cornerstones of statistical machine learning (e.g., Vapnik & Lerner, 1963; Vapnik & Chervonenkis, 1991; Poggio et al., 2004). This paper focuses on binary classification problems, where y { 1, +1}. A point (y, x) is correctly predicted if sign(f(x)) = y, and its classification error is given by the 0-1 loss ℓ(f(xi), yi) = 1(y = sign(f(x)) = 1(yf(x) 0). Due to the non-convexity of the indicator function, minimizing the empirical classification error i 1(yif(xi) 0) is known to be NP-hard even to approximate (Arora et al., 1997; Ben-David et al., 2003). A number of methods have been proposed to mitigate this computational difficulty, all based on the idea that to minimize the cumulative loss , which is the sum of individual losses given by, i=1 ϕ(yf(x)) where ϕ( ) is a convex upper bound of the classification error 1(yf(x) 0). For example, Ada Boost (Freund & Schapire, 1997; Friedman et al., 2000; Schapire & Singer, 1999) employs the exponential loss function exp( yf(x)), and Support Vector Machines (SVMs) (Boser et al., 1992; Cortes & Vapnik, 1995) employ a hinge-loss function max{1 yf(x), 0}. In this paper we revisit this paradigm, and introduce a notion termed coherent loss, as opposed to cumulative loss Coherent Loss Function for Classification used in the conventional approach. Briefly speaking, instead of using an upper bound of the individual classification error (the 0-1 loss), we propose to use a tractable upper bound of the total empirical classification error for the whole training set. That is, we look for Φ : ℜm 7 ℜ such that Φ(c1, , cm) 1 i=1 1(ci 0), (c1, , cm) ℜm. Intuitively, since coherent loss functions are more general than cumulative loss functions, one may expect to obtain a tighter and still tractable bound of the empirical classification error via coherent loss function. We formalize this intuition in this paper. Specifically, our contributions include the followings. In Section 2, we consider a principled approach by formalizing the salient properties of functions, termed as coherent classification loss functions, that could be used to quantify the performance of a classification rule. These functions have dual-representations which enable us to identify the minimal coherent classification loss function, which, loosely speaking, is the coherent classification loss function that best approximates the 0-1 loss, which also achieves a tighter bound of the empirical classification error than any convex cumulative loss. We show that optimizing this function is equivalent to a convex optimization problem, and hence tractable. In Section 3, we consider an equivalent form of the coherent loss function and then provide several applications of this loss function in classification problems. We remark that a tighter approximation of the 0-1 loss can potentially reduce the impact of outliers on the classification accuracy. Cumulative loss function may significantly deviate from the 0-1 loss when c 0. Consequently, a misclassified outlier can incur a huge loss, and prevents an otherwise perfect prediction rule from being selected. This sensitivity can be mitigated by a tighter approximation. Section 4 provides a statistical interpretation of minimizing the coherent loss function. Section 5 reports the experimental results which show that our classification method outperforms the standard SVM when additional constraints are imposed on the decision function. Notations: We use boldface letters to represent column vectors, and capital letters for matrices. We reserve e for special vectors: ei is the vector whose i-th entry is 1, and the rest are 0; e N, where N is an index set, is the vector that for all i N, the corresponding entry equals 1, and zero otherwise; e is the vector with all entries equal to 1. The i-th entry of a vector x is denoted by xi. We use [c]+ to denote max{0, c} and 1[ ] to denote the indicator function, and let Pm be the set of all m m permutation matrices and Ip be the p p identity matrix. 2. Coherent Classification Loss Function We now propose the notion of coherent classification loss functions based on an axiomatic approach. Along the way, we show the existence of a tight coherent classification loss function which can achieve better approximation of the empirical classification error than any convex cumulative loss. The definition of the coherent classification loss function is motivated from analyzing the salient properties of functions used to quantify the performance of a classification rule. A natural approach is to elicit these properties from the classification error. Specifically, given u1, , um where ui the decision value of the ith sample, e.g. ui = yif(xi), the classification error ϱ : ℜm 7 [0, 1] is given by ϱ(u1, . . . , um) = 1 i=1 1[ui < 0]. (2) We will next propose a set of properties and that functions endowed with these properties are known as coherent classification loss functions. 2.1. Salient Properties and Representation Theorem We elicit the five salient properties from the classification error as follows. Consider ρ( ) : ℜm [0, 1]. Property 1 (Complete classification). ρ(u) = 0 if and only if u 0. Complete classification essentially says if every sample is correctly classified, then it is optimal. Property 2 (Misclassification avoidance). If u < 0, then ρ(u) = 1. This property states that if all samples are misclassified, then it is the worst classification and hence ρ( ) achieves the maximal value. Property 3 (Monotonicity). If u1 u2 then ρ(u1) ρ(u2). Monotonicity requires that if a decision better classifies every sample, then it is more desirable. Property 4 (Order invariance). For any P Pm, we have ρ(u) = ρ(Pu). Order invariance essentially states that the order of the samples does not matter. This is natural in the classification problem, since each sample is drawn IID, and is treated equally. Property 5 (Scale invariance). For all α > 0, ρ(αu) = ρ(u). Scale invariance is a property that the classification error function satisfies. It essentially means that changing the Coherent Loss Function for Classification scale does not affect the preference between classifiers. While it may be debatable whether scale invariance is as necessary as other properties, indeed as we show later in this section, this property can be relaxed. Definition 1 (Coherent Classification Loss). A function ρ( ) : ℜm [0, 1] is a coherent classification loss function (CCLF) if it satisfies Property 1 to 5, and is quasi-convex and lower semi-continuous. Here, quasi-convexity and semi-continuity are introduced to for tractability. Our first result is a (dual) representation theorem of any CCLF. We need the following definition first. Definition 2 (Admissible Class). A class of sets Vk ℜm parameterized by k [0, 1] is called admissible class, if they satisfy the following properties: 1. For any k [0, 1], Vk is a closed, convex cone, and is order invariant. Here, being order invariant means that v V implies Pv V for any P Pm; 2. k k implies Vk Vk ; 3. V1 = cl(limk 1 Vk) and V0 = limk 0 Vk. 4. V1 = ℜm +; 5. For any λ > 0, we have λe V0. Theorem 1 (Representation Theorem). A function ρ( ) is a CCLF if and only if it can be written as ρ(u) = 1 sup{k [0, 1]| sup v Vk ( v u) 0}, (3) for an admissible class {Vk}. Here sup over an empty set is set as 0. Proof. We sketch the proof and leave the details in the supplementary material. The if part is relatively easy, by checking that any function ρ(u) = 1 sup{k [0, 1]| supv Vk( v u) 0} for some admissible class {Vk} satisfies all properties required for a CCLF. The only if part requires more work. We want to show that given a function ρ( ) which is a CCLF, it can be represented as (3) for some admissible class {Vk}. The proof consists of three steps: We first show that ρ( ) can be represented as ρ(u) = 1 sup{k [0, 1]| supv Vk( v u) 0}, for some {Vk}, not necessarily admissible. This essentially follows from a result in Brown & Sim (2009). We then show that we can replace Vk by a class of closed, convex, order-invariant, cones Vk. Specifically, we can pick Vk cl(cc(or(Vk))), where or( ) (respectively cc( )) is the minimal order invariant (respectively, convex cone) superset. Finally we show that {Vk} is admissible, by checking that all properties in Definition 2 are satisfied, to complete the proof. 2.2. Minimal coherent classification loss function This section shows that among all CCLF functions that upper-bound the classification error, there exists a minimal (i.e., best) one. Theorem 2. Define ρ( ) : ℜm 7 [0, 1] as follows ρ(u) = max{t : t i=1 u(i) < 0} m , where {u(i)} is a permutation of {ui} in a non-decreasing order, and max over an empty set is taken as zero. Then the following holds. 1. ρ( ) is a CCLF, and is an upper-bound of the classification error, i.e., ρ(u) ϱ(u), u ℜm. 2. Let Vk ℜm satisfy that if k = 0, then Vk = conv {λe|λ > 0}; and if s m < k s+1 m for s = 0, , m 1, then Vk = conv {λe N| λ > 0, N : |N| = m s} , where N is an index set. Then {Vk} is an admissible class corresponding to ρ( ). 3. ρ( ) is the tightest CCLF bound. That is, if ρ ( ) is a CCLF function and satisfies ρ (u) ϱ(u) for all u ℜm, then ρ (u) ρ(u) for all u ℜm. Proof. We provide a sketch of the proof and leave the details to the supplementary material. Claim 1 is relatively straightforward. It is also easy to see that Vk is an admissible set. So one only needs to show that Vk is the set corresponding to ρ( ), to establish Claim 2. To show Claim 3, we let {V k} be an admissible class corresponding to ρ ( ), and show that λe N V k, which further implies Vk V k. This establishes claim 3. We next show that scale invariance can be relaxed. Indeed, for any quasi-convex upper bound of classification error that satisfies other properties, the minimal CCLF is a tighter bound. Theorem 3. Let ˆρ : ℜm 7 [0, 1] be a quasi-convex function that satisfies complete classification, misclassification avoidance, monotonicity, order invariance, and that ˆρ(u) ϱ(u). Then there exists a CCLF ρ( ) such that ϱ(u) ρ(u) ρ(u) ˆρ(u), u ℜm. Proof. We sketch the proof. The main idea is to construct such a function ρ(u) lim ϵ 0[min γ>0 ˆρ((u + ϵ)/γ)], and show that ρ( ) is a CCLF and ˆ ρ(u) ρ(u) ϱ(u). Finally, since ρ( ) is the minimal CCLF, this completes the proof. Coherent Loss Function for Classification One important property of ρ( ) is that it achieves better approximation of the empirical classification error than any convex cumulative loss. Theorem 4. If f( ) is a convex function and an upper bound of the 0-1 loss function, then for any u = (u1, , um), we have ϱ(u) ρ(u) 1 m m i=1 f(ui). Proof. Without loss of generality, assume (u1, , um) are in a non-decreasing order. Let p max{i|ui < 0} and q max{t| t i=1 ui < 0}, then q i=1 ui = p i=1 ui + q i=p+1 ui < 0. Since f( ) is convex and f(x) 1[x 0], there exists k 0 such that f(x) max{kx + 1, 0} (this can be done for example by taking k as a subgradient of f(x) at x = 0). If k = 0, then 1 m m i=1 f(ui) 1 ρ(u), the theorem holds. Otherwise k < 0, we have i=1 (kui + 1) + i=p+1 f(ui) i=p+1 f(ui) i=p+1 f(ui) i=p+1 (f(ui) kui). Note that ui 0 for i = p + 1, , m, then if ui 1 k, f(ui) kui kui 1. Otherwise f(ui) kui kui + 1 kui = 1. Hence, p + q i=p+1(f(ui) kui) p + (q p) = q. By the definition of ρ(u), the theorem holds. 2.3. Optimization with the coherent loss function We now discuss the computational issue of optimization of the minimal CCLF ρ( ). Indeed, we show that this can be converted to a tractable convex optimization problem. Specifically, we consider the following problem on variables (u, w): s.t. fj(u, w) 0; j = 1, , n, (4) where fi( , ) are convex functions. We have the following theorem. Theorem 5. Assume complete classification is not achievable, i.e., there is no feasible (u, w) with u 0. Let (s , t , h ) be an optimal solution to the following opti- mization problem: min h,s,t 1 m i=1 [1 si]+ s.t. hfj(s/h, t/h) 0; j = 1, , n; h > 0. Then (s /h , t /h ) is an optimal solution to Problem (4). Proof. We provide a sketch of the proof. We first show that the level set of Problem (4), i.e., Ui {(u, w) | ρ(u) 1 i/m; fj(u, w) 0, j} for i = 1, , m, equals the following {(u, w)| d : i=1 [d ui]+ (m i + 1)d; fj(u, w) 0, j.} This can be proved by applying the Theorem 2, and then using duality of linear program. This set can further be shown equivalent to the feasible set of i=1 [1 ui/d]+ (m i + 1) (6) fj(u, w) 0; j = 1, , n. Thus, finding the optimal solution to Problem (4) is equivalent to solve the following problem i=1 [1 ui/d]+ s.t. fj(u, w) 0; j = 1, , n; Then let h = 1/d, s = hu and t = hw, the theorem is established. Notice that hfj(s/h, t/h) is the perspective function of fj( , ), and is hence jointly convex to (h, s, t) (Boyd & Vandenberghe, 2004). Thus, Problem (5) is equivalent to a tractable convex optimization problem. 3. Equivalent Formulation and Applications From Theorem 5, when there is no (u, w) such that u 0 and fj(u, w) 0 for j = 1, , n, Problem (4) is equivalent to minimizing the following optimization problem: min Φ(u) s.t. fj(u, w) 0; j = 1, , n, (8) where Φ(u) is defined by Φ(u) min γ>0 1 m i=1 [1 ui/γ]+. (9) Coherent Loss Function for Classification From this formulation, we also show, from another perspective, that minimizing the coherent loss function is equivalent to minimizing a tighter upper bound of the 0-1 loss function, or in other words, the coherent loss function achieves better approximation of the empirical classification error than any convex cumulative loss. Theorem 6. Let ϕ : ℜ7 ℜ+ be a non-increasing, convex function that satisfies ϕ(c) 1(c 0), c ℜ. Then we have for all u ℜm: i=1 1(ui 0) Φ(u) 1 Proof. Recall that the hinge-loss ϕ 1(c) [1 c]+ is the tightest convex bound of 0-1 loss which has a derivative (or sub-gradient) 1 at c = 0 (e.g., (Sch olkopf & Smola, 2002)). That is, if a convex function ϕ( ) satisfies ϕ(c) 1(c 0), c, and also satisfies 1 ϕ(0), then ϕ 1(c) ϕ(c) for all c. Similarly, ϕ γ(c) max[1 c/γ]+ is the tightest convex bound of 0-1 loss with a derivative 1/γ at x = 0. Since ϕ( ) is non-increasing, it can not have positive derivative at c = 0. Thus, Φ( ) is a tighter bound than any non-increasing, convex cumulative loss functions. Recall that (8) is equivalent to the following convex optimization problem min h,s,t 1 m i=1 [1 si]+ s.t. hfj(s/h, t/h) 0; j = 1, , n; which can be solved efficiently. We now provide some applications of the proposed coherent loss function. At first, we illustrate with an example, that the proposed coherent loss function can be more robust to outliers. Let u1, u2 R100 be the followings: u1 = ( 1000, 1000, 1000, , 1000), and u2 = (+1, 1, +1, 1, , +1, 1). In this case, u2 appears to be a less favorable classification since 50% of samples are misclassified. It is easy to check that u1 incurs a much larger hinge-loss than u2, even though only one sample is misclassified. In contrast, the coherent loss of u1 is no more than 0.02 (take γ = 1/1000), and that of u2 is at least 0.5 (since 50% samples are misclassified, and the coherent loss is an upper bound). Thus, the coherent loss is more robust in this example, partly because it better approximates the 0-1 loss, and hence is less affected by large outliers. See Figure 1. 1000 500 0 500 1000 2 Figure 1. Illustration of the effect of outliers to the cumulative loss vs the coherent loss. Here, w1 has a margin u1, and w2 has a margin u2. The cumulative loss approach will pick w2, where the proposed method will pick w1, which is a better classification. EXAMPLE: LINEAR SVM We illustrate the proposed method with the linear classification problem, and in particular, the linear Support Vector Machines algorithm (SVMs) (Boser et al., 1992; Cortes & Vapnik, 1995; Sch olkopf & Smola, 2002). Given m training samples (yi, xi)m i=1, the goal is to find a hyperplane that correctly classify as many training samples as possible with a large margin, which leads to the following formulation: i=1 1[yi(w xi + b) 0] for a given C > 0. Since the objective function is nonconvex, Problem (11) is an intractable problem. Hence, SVM uses the hinge-loss function ϕ 1(c) = [1 c]+ as a convex surrogate. Following the proposed coherent loss function approach, we minimize the 0-1 loss function with margin a 0: 1 m m i=1 1[yi(w xi + b) a] and replace this objective function by the coherent loss function ρ(u) where ui = yi(w xi + b) a (Margin a makes the condition in Theorem 5 hold, and the approximation of this 0-1 loss function by using the hinge-loss function still leads to the standard SVM). Then we obtain the following formulation, min w,b,γ>0 1 m i=1 [1 (yi(w xi + b) a)/γ]+ s.t. w 2 C. As discussed above, we can change variables h = 1/γ, ˆw = w/γ and ˆb = b/γ, and simplify Formulation (12) as the following: min ˆw,ˆb,h>0 i=1 [1 + ah yi( ˆw xi + ˆb)]+ s.t. ˆw 2 h C. Coherent Loss Function for Classification This is also equivalent to the robust formulation of SVM (Shivaswamy et al., 2006): s.t. inf xi (xi,I) P[yi(w xi + b) 1 ξi] 1 κ, where xi (xi, I) denotes a family of distributions which have a common mean xi and covariance I, and κ a2/(a2 + C2). We next consider the case where one may like to impose additional constraints on w. For instance, if the first feature is measured from a less reliable source, then an ideal classification rule should discount the importance of the first feature, by imposing a constraint like |w1| 0.001. Thus, the linear classification problem becomes min w,b 1 m i=1 1[yi(w xi + b) a] s.t. w 2 C Aw d. Using the coherent loss to replace the objective function, and simplifying the resulting formulation, we obtain the following second order cone program min ˆw,ˆb,h i=1 [1 + ah yi( ˆw xi + ˆb)]+ s.t. ˆw 2 Ch 0 A ˆw dh h > 0. Finally, we remark that the coherent loss approach can be kernelized, since a representation theorem (Sch olkopf & Smola, 2002) still holds if the coherent loss function is used. EXAMPLE: MULTI-CLASS SVM The coherent loss function can also be applied in multiclass classification problems. The main idea of previous approaches (Liu & Shen, 2006; Lee et al., 2004; Crammer & Singer, 2002) of multi-class SVMs is solving one single regularization problem by imposing a penalty on the values of fy(x) fz(x) for sample (x, y) where fy( ) and fz( ) are decision function for class y and z, respectively. Suppose that the training samples are drawn from k different classes and the decision function fy(x) = w y x + by for each y = 1, , k. Consider the following 0-1 loss penalty formulation: i=1 1 [ min z [k],z =yi{fyi(xi) fz(xi)} a ] s.t. Gi(wi) C; i = 1, , k i=1 fi = 0, where k i=1 fi = ( k i=1 wi, k i=1 bi), Gi( ) is convex (e.g. Gi( ) = 2) and margin a 0, then we can apply the coherent loss function approach to make an approximation: min fi,γ>0 1 m [ 1 minz [k],z =yi{fyi(xi) fz(xi)} a s.t. Gi(wi) C; i = 1, , k i=1 fi = 0, which can be simplified as the following: min ˆ fi,h>0 [ 1 + ah + max z [k],z =yi{ ˆfz(xi) ˆfyi(xi)} ] s.t. h Gi( ˆwi/h) h C; i = 1, , k i=1 ˆfi = 0, where ˆfi(x) = ˆw i x + ˆbi. Clearly, this is a convex optimization problem and can be solved efficiently. 4. Statistical Property In this section, we provide a statistical interpretation of minimizing the coherent loss function. As standard in learning theory, we assume that the training samples are drawn IID from an unknown distribution P, and the goal is to find a predictor f( ) such that the classification error of f given below is as small as possible: L(f( )) = E( x, y) P[I(f( x), y)]. Here ( x, y) P means sample ( x, y) follows the distribution P, and I(f( x), y) = 1[ yf( x) 0]. Recall that minimizing the coherent loss function is equivalent to minimizing the following function Φ(u) min γ>0 1 m i=1 ϕγ(ui), where ϕγ(u) = max{0, 1 u/γ}. Let η(x) = P[ y = 1| x = x], then the optimal Bayes error L = L(2η( ) 1). Coherent Loss Function for Classification We now develop an upper bound of the difference between L(f( )) and L using similar techniques in Zhang (2004). For fixed γ, denote the expected loss of f( ) w.r.t ϕγ( ) by Qγ(f( )) = E( x, y) P[ϕγ( yf( x))], Qγ(η, f) = ηϕγ(f) + (1 η)ϕγ( f) Qγ(η, f) = Qγ(η, f) Qγ(η, f γ(η)), where f γ(η) = arg minf Qγ(η, f). Recall that ϕγ(u) = max{0, 1 u/γ}, which implies f γ(η) = sign(2η 1)γ. Then we have the following lemma. Lemma 1. For γ > 0, we have Qγ(η, 0) = |2η 1|. Proof. From the definition of Qγ(η, f) and Qγ(η, f), we have Qγ(η, f) =η(ϕγ(f) ϕγ(f γ(η))) + (1 η)(ϕγ( f) ϕγ( f γ(η))) =η max{0, 1 f/γ} + (1 η) max{0, 1 + f/γ} η(max{0, 1 sign(2η 1)}) (1 η)(max{0, 1 + sign(2η 1)}) =η max{0, 1 f/γ} + (1 η) max{0, 1 + f/γ} 1 + |2η 1|. This implies that Qγ(η, 0) = |2η 1|. By applying the lemma above, we can bound the classification error of f( ) w.r.t ϕγ( ) in terms of E x Qγ(η( x), f( x)). Theorem 7. For any γ > 0 and any measurable function f(x), we have L(f( )) L E x Qγ(η( x), f( x)) = E x[Qγ(η( x), f( x)) + |2η( x) 1| 1]. Proof. By definition of L( ), it is easy to verify that L(f( )) L(2η( ) 1) =Eη(X) 0.5,f(X)<0(2η(X) 1)+ Eη(X)<0.5,f(X) 0(1 2η(X)) E(2η(X) 1)f(X) 0|2η(X) 1|. From Lemma 1 Qγ(η, 0) = |2η 1|, we have L(f( )) L E(2η( x) 1)f( x) 0 Qγ(η( x), 0). To complete the proof, since Qγ(η, f) = Qγ(η, f) Qγ(η, f γ(η)), it suffices to show that Qγ(η(x), 0) Qγ(η(x), f(x)) for all x such that (2η(x) 1)f(x) 0. To see this, we consider three scenarios: η > 0.5: We have f γ(η) = sign(2η 1)γ > 0. In addition, (2η 1)f 0 implies that f 0. Since 0 [f, f γ(η)] and the convexity of Qγ(η, f) w.r.t. f, we have Qγ(η, 0) max{Qγ(η, f), Qγ(η, f γ(η))} = Qγ(η, f). η < 0.5: In this case we have f γ(η) < 0 and f 0, which leads to 0 [f γ(η), f], which implies that Qγ(η, 0) max{Qγ(η, f), Qγ(η, f γ(η))} = Qγ(η, f). η = 0.5: Note that f γ = 0, which implies that Qγ(η, 0) Qγ(η, f) for all f. From the proof of Lemma 1, we have Qγ(η, f) = Qγ(η, f) + |2η 1| 1. Hence the theorem holds. Corollary 1. For any measurable function f(x), L(f( )) L min γ>0 E x[Qγ(η( x), f( x))+ |2η( x) 1| 1]. (13) Proof. Since Theorem 7 holds for any γ > 0, we obtain this corollary. For the training samples {xi, yi}m i=1, since η(xi) = yi {1, 1}, the empirical estimation of the bound in (13) is Φ(u) = minγ>0 1 m m i=1 ϕγ(ui) where ui = yif(xi), which implies that minimizing the coherent loss function is equivalent to minimizing the empirical bound of the difference between L(f( )) and L . 5. Simulations We report some numerical simulation results in this section to illustrate the proposed approach. Besides the regularization constraints (e.g. w C for binary-class SVMs and wi C, i = 1, , k for multi-class SVMs), we consider the case where additional linear constraints are also imposed on the coefficient w. For clarity, we choose a simple additional constraint Aw T to compare the performance of the cumulative loss formulation (SVM) and our coherent loss formulation (CCLF) for binary-class and multi-class classification, where A = [Ik, 0] Rk n. In other words, the constraint ensures that the maximum of the first k elements of w is bounded by T. We now compare their performance under two cases: 1) k is fixed, T varies; 2) T is fixed, k varies. Three binary-class datasets Breast cancer , Ionosphere and Diabetes , and two multi-class datasets Wine and Iris from UCI (Asuncion & Newman, 2007) are used, where we randomly pick 50% as training samples, 20% as validation samples, and the rest as testing samples. For the cumulative loss formulation approach, parameter C is Coherent Loss Function for Classification 0 0.2 0.4 0.6 0.8 1 0 Classification error Breast Cancer, T=0.1 0 0.2 0.4 0.6 0.8 1 0.02 Classification error Breast Cancer, T=0.3 0 0.2 0.4 0.6 0.8 1 0 0.05 Classification error Iris, T=0.1 0 0.2 0.4 0.6 0.8 1 0 0.05 Classification error Iris, T=0.3 Figure 2. Performance comparison of cumulative loss approach vs coherent loss approach where bound T is fixed and the fraction k/n varies from 0.0 to 1.0. Left and right columns report the classification errors for the two cases T = 0.1 and T = 0.3. determined by cross-validation. For the coherent loss formulation approach, parameter C is fixed while parameter a is determined by cross-validation. For each T, we repeated the experiments 20 times and computed the average classification errors. To solve the resulting optimization problems, we use CVX (Grant & Boyd, 2011; 2008), and Gurobi (Gurobi Optimization, 2013) as the solver. Figure 3 shows the simulation results under fixed k. Clearly, when additional constraints are imposed, it appears that the coherent loss approach consistently outperforms the cumulative loss approach. When T is small, the cumulative loss approach performs much worse. When T becomes large, its performance can be close to the coherent loss approach. Figure 2 provides the results under fixed T, which shows that the coherent loss and cumulative loss approaches have similar performance when k/n is small but the coherent loss approach outperforms the cumulative loss approach when k/n is large. We believe that these phenomena are due to the fact that the coherent loss is a better approximation for the empirical classification error. 6. Conclusion In this paper, we revisit the standard cumulative-loss approach in dealing with the non-convexity of the 0-1 loss function in classification, namely minimizing the sum of convex surrogates for each sample. We propose the notion of coherent loss, which is a tractable upper-bound of the total classification error for the entire sample set. This approach yields a strictly tighter approximation to the 0-1 loss than any cumulative loss, while preserving the tractability of the resulting optimization problem. The formulation obtained by applying the coherent loss to binary classification also has a robustness interpretation, which builds a strong 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Classification error Breast Cancer, k=0.8n 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 Classification error Breast Cancer, k=1.0n 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.1 Classification error Ionosphere, k=0.8n 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.1 Classification error Ionosphere, k=1.0n 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.2 Classification error Diabetes, k=0.8n 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.2 Classification error Diabetes, k=1.0n 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 Classification error Wine, k=0.8n 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 Classification error Wine, k=1.0n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.05 Classification error Iris, k=0.8n 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.05 Classification error Iris, k=1.0n Figure 3. Performance comparison of cumulative loss approach vs coherent loss approach. Left and right columns report the classification errors for the two cases k = 0.8n and k = n (recall that k and n are the numbers of the rows and columns of matrix A, respectively). The four rows, from top to bottom, report results for Breast Cancer, Ionosphere, Diabetes, Wine and Iris, respectively. connection between the coherent loss and robust SVMs. Finally, we remark that the coherent loss approach has favorable statistical properties and the simulation results show that it can outperform the standard SVM when additional constraints are imposed. Acknowledgments This work is partially supported by the Ministry of Education of Singapore through Ac RF Tier Two grant R-265000-443-112. Coherent Loss Function for Classification Arora, S., Babai, L., Stern, J., and Sweedyk, Z. The hardness of approximate optima in lattices, codes, and systems of linear equations. Journal of Computer and System Sciences, 54:317 331, 1997. Asuncion, A. and Newman, D. J. UCI machine learning repository, 2007. URL http: //www.ics.uci.edu/{\char126}mlearn/ {MLR}epository.html. Ben-David, S., Eiron, N., and Long, P. M. On the difficulty of approximately maximizing agreements. Journal of Computer and System Sciences, 66:496 513, 2003. Boser, B. E., Guyon, I. M., and Vapnik, V. N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pp. 144 152, New York, NY, 1992. Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004. Brown, D.B. and Sim, M. Satisficing measures for analysis of risky positions. Management Science, 55(1):71 84, 2009. Cortes, C. and Vapnik, V. N. Support vector networks. Machine Learning, 20:1 25, 1995. Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res., 2:265 292, 2002. Freund, Y. and Schapire, R. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1): 119 139, 1997. Friedman, J., Hastie, T., and Tibshirani, R. Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28:337 407, 2000. Grant, M. and Boyd, S. Graph implementations for nonsmooth convex programs. In Recent Advances in Learning and Control, pp. 95 110. Springer-Verlag Limited, 2008. Grant, M. and Boyd, S. CVX: Matlab software for disciplined convex programming, version 1.21. http://cvxr.com/cvx, 2011. Gurobi Optimization, Inc. Gurobi optimizer reference manual, 2013. URL http://www.gurobi.com. Lee, Y., Lin, Y., and Wahba, G. Multicategory support vector machines, theory, and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99:67 81, 2004. Liu, Y. and Shen, X. Multicategory φ-learning. Journal of the American Statistical Association, 101(474):500 509, 2006. Poggio, T., Rifkin, R., Mukherjee, S., and Niyogi, P. General conditions for predictivity in learning theory. Nature, 428(6981):419 422, 2004. Schapire, E. and Singer, Y. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37:297 336, 1999. Sch olkopf, B. and Smola, A. J. Learning with Kernels. MIT Press, 2002. Shivaswamy, P. K., Bhattacharyya, C., and Smola, A. J. Second order cone programming approaches for handling missing and uncertain data. Journal of Machine Learning Research, 7:1283 1314, July 2006. Vapnik, V. N. and Chervonenkis, A. The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis, 1(3):260 284, 1991. Vapnik, V. N. and Lerner, A. Pattern recognition using generalized portrait method. Automation and Remote Control, 24:744 780, 1963. Zhang, T. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32:56 85, 2004.