# optimal_classification_with_multivariate_losses__df72a448.pdf Optimal Classification with Multivariate Losses Nagarajan Natarajan T-NANATA@MICROSOFT.COM Microsoft Research, INDIA Oluwasanmi Koyejo SANMI@ILLINOIS.EDU Stanford University, CA & University of Illinois at Urbana-Champaign, IL, USA Pradeep Ravikumar PRADEEPR@CS.UTEXAS.EDU Inderjit S. Dhillon INDERJIT@CS.UTEXAS.EDU The University of Texas at Austin, TX, USA Multivariate loss functions are extensively employed in several prediction tasks arising in Information Retrieval. Often, the goal in the tasks is to minimize expected loss when retrieving relevant items from a presented set of items, where the expectation is with respect to the joint distribution over item sets. Our key result is that for most multivariate losses, the expected loss is provably optimized by sorting the items by the conditional probability of label being positive and then selecting top k items. Such a result was previously known only for the F-measure. Leveraging on the optimality characterization, we give an algorithm for estimating optimal predictions in practice with runtime quadratic in size of item sets for many losses. We provide empirical results on benchmark datasets, comparing the proposed algorithm to state-of-the-art methods for optimizing multivariate losses. 1. Introduction A recent flurry of theoretical results and practical algorithms highlights a growing interest in understanding and optimizing general multivariate losses (Joachims, 2005; Petterson and Caetano, 2011; Dembczynski et al., 2011; Ye et al., 2012; Koyejo et al., 2014; Narasimhan et al., 2014). Conventional losses such as the 0-1 loss (or the error) in binary or multiclass classification, and Hamming loss in case of multilabel learning fall short in many applications, such as medical diagnosis, fraud detection, and informa- Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). tion retrieval, with imbalanced and rare event classification tasks (Lewis and Gale, 1994; Drummond and Holte, 2005; Gu et al., 2009; He and Garcia, 2009). Practitioners employ multivariate loss functions such as the F-measure that capture non-linear trade-off between the entries of the confusion matrix , namely, true positives, false positives, true negatives and false negatives. Multivariate loss functions are defined on vector-valued predictions, such as label vectors in multilabel learning, subsets of classes in multiclass classification, etc. The goal in the prediction tasks is to minimize expected loss when retrieving relevant items from a presented set of items, where the expectation is with respect to the joint distribution over item sets (that models uncertainty in the data). Algorithmic approaches for minimizing expected multivariate losses such as structured support vector machines have been proposed (Joachims, 2005; Petterson and Caetano, 2011). Interestingly, the optimization can be performed efficiently for most multivariate losses that can be written as a function of the entries of the confusion matrix. However, there are no known consistency guarantees for these approaches; in fact, recently, Dembczynski et al. (2013) showed that structured loss minimization is not consistent for the F-measure. On the other hand, an important theoretical question is if and when one can explicitly characterize the optimal solution for the loss minimization problem. A key difficulty in the analysis of general multivariate losses is that they are often non-decomposable, i.e., the loss on prediction vectors does not decompose into the sum of losses over individual predictions. Two decades ago, Lewis (1995) showed that the optimal solution to minimizing expected F-measure, under the assumption that the labels are conditionally independent, admits a simple form in particular, it requires only the knowledge of the marginals P(Yi|x). Then, Dembczynski et al. (2011) showed that the optimal solution for F-measure can be de- Optimal Classification with Multivariate Losses scribed using O(n2) parameters for a general distribution P over sets of n items. Since then, there has been extensive work focusing on binary and multilabel F-measure (Dembczy nski et al., 2012; Ye et al., 2012; Dembczynski et al., 2013; Waegeman et al., 2014; Lipton et al., 2014). Yet, the question of characterizing optimal prediction for many other losses used in practice remains largely unknown. In this paper, we provide a general theoretical analysis of expected loss minimization for general, nondecomposable, multivariate losses in binary, multiclass and multilabel prediction problems. Under conditional independence assumption on the underlying distribution, we show that the optimal solution for most losses exhibit a simple form, and depends only on the conditional probability of the label being positive. In multiclass classification, the joint distribution is a multinomial, and we show that a similar result holds. In particular, we identify a natural sufficient condition for any loss under which it allows such a simple characterization of optimal we require the loss to be monotonically decreasing in true positives; this is satisfied by most, if not all, loss functions including the monotonic family studied by Narasimhan et al. (2014), and the linear fractional family studied by Koyejo et al. (2014; 2015). As a special case of our analysis, we naturally recover the F-measure result of Lewis (1995) for binary classification, and the result of Coz Velasco et al. (2009) for multiclass classification. Minimizing (and even evaluating) expected multivariate losses can involve exponential-time computation, even when given access to the exact label distribution. As we show, in light of our main result characterizing optimal predictions, and with careful implementation, computations can be greatly simplified. We give an algorithm that runs in O(n3) time for a general loss, where n is the size of the item set (number of instances or labels or classes as the case maybe). For special cases such as Fβ and Jaccard, the algorithm can be implemented to run in time O(n2) . We prove that our overall procedure for computing the optimal in practice is consistent. We also support our theoretical results with experimental evaluation on synthetic and realworld datasets. Related Work: We highlight some of the key results relating to prediction with general multivariate losses. Existing theoretical analysis has focused on two distinct approaches for characterizing the population version of multivariate losses: identified by Ye et al. (2012) as decision theoretic analysis (DTA) and empirical utility maximization (EUM). In DTA, the goal is to minimize the expected loss of a classifier on sets of predictions, which is the setting in our work here, while in EUM, the goal is to minimize the loss applied to population confusion matrix with expected values as entries. We can interpret DTA as minimizing the average loss over an infinite set of test sets, each of a fixed size, while EUM as minimizing the loss of a classifier over a single infinitely large test set. More recently, there have been several theoretical and algorithmic advances relating to general performance measures (Parambath et al., 2014; Koyejo et al., 2014; Narasimhan et al., 2014; Kar et al., 2014; Narasimhan et al., 2015; Koyejo et al., 2015) used in binary, multiclass and multilabel classification in the EUM setting. In stark contrast, we know much less about the setting in our paper; several authors have proposed algorithms for empirical optimization of the expected Fβ measure including Chai (2005), Jansche (2007) and Dembczynski et al. (2011). Ye et al. (2012) compare the DTA and the EUM analyses for Fβ, showing an asymptotic equivalence as the number of test samples goes to infinity. Quevedo et al. (2012) propose a cubic complexity dynamic programing algorithm for computing optimal labeling under conditional label independence assumption, albeit without providing an optimality characterization or consistency. 2. Multivariate Loss Minimization We consider the problem of making multivariate predictions y = {y1, y2, . . . , yn} {0, 1}n, for a given set of instances (described by their features) x X. Let X denote the random variable for instances, Y denote the random variable for label vectors of length n (with Yi denoting the random variable for ith label). Given a multivariate loss L, the goal is to minimize the expected loss wrt. the underlying joint distribution P over X and Y: h = arg min h:x7 y E(X,Y) PL(h(X), Y). Note that h optimizes the expected loss wrt. to the conditional P(Y|x) at each x; therefore it is sufficient to analyze the optimal predictions at a given x. h (x) = arg min h:x7 y EY P(Y|x)L(h(x), Y). (1) The choices of x, y and L depend on the prediction task: In binary classification, x = {x1, x2, . . . , xn} is a set of instances, yi {0, 1} corresponds to the label of instance xi and hi(x) {0, 1} denotes the predicted label for xi. One may be interested in retrieving the positive instances with high precision or high recall (or a function of the two, such as the F-measure). We let L : {0, 1}n {0, 1}n R+. In multiclass classification with n classes, x = {x} is an instance, y [n] corresponds to the class of instance x, whereas the prediction h(x) [n] corresponds to a subset of predicted classes (represented by a binary vector in {0, 1}n). Here we want to retrieve the true class in a predicted subset of small size Optimal Classification with Multivariate Losses k n; so, L : [n] {0, 1}n R+. This problem can be thought of as learning non-deterministic classifiers (Coz Velasco et al., 2009). In multilabel learning with n labels, x = {x} is an instance, y, h(x) {0, 1}n correspond to the set of relevant and predicted labels respectively for instance x. Popular choices of loss functions include multilabel F-measure and the Hamming loss. Here, L : {0, 1}n {0, 1}n R+. 2.1. Hardness of the general setting The optimization problem (1) essentially entails a search over binary vectors of length n. Efficient search for optimal solution or obtaining a consistent estimator can be notoriously hard depending on the loss function and the joint distribution. Of course, if the loss function is decomposable over instances (such as the Hamming loss in case of multilabel learning, or 0-1 loss in case of binary classification), a consistent estimator of the optimal solution can be obtained via empirical risk minimization. Dembczynski et al. (2011) showed that for the F-measure, and arbitrary distribution P, optimal solution to (1) can be obtained provided we can estimate O(n2) parameters of P in particular, P(Yi = 1, n i=1 Yi = s|x), i, s [n] [n]. In case of the subset 0/1 loss defined as L(h(x), y) = 1 if h(x) = y and L(h(x), y) = 0 otherwise, h (x) is the mode of the distribution P(Y|x), which is infeasible to estimate for arbitrary P. For general multivariate losses, one popular algorithmic approach is to employ structural support vector machines (Joachims, 2005; Petterson and Caetano, 2011) which optimize a convex upper bound of the expected loss on training data. However, there are no consistency results known for the approach. In fact, Dembczynski et al. (2013) show that structural SVMs are inconsistent for arbitrary P, in case of the F-measure. More recently, Wang et al. (2015) study the multivariate loss minimization problem from an adversarial point of view, and provide a game-theoretic solution. Inevitably, they require solving sub-problems of the form (1), which (as they discuss) can be worked out for a few specific losses (such as the F-measure) but are hard in general. 2.2. Conditional independence Consider the setting when P(Y|x) satisfies conditional independence. In case of binary classification, instances are typically assumed to be i.i.d., and therefore conditional independence P(Y|x) = Πn i=1P(Y |xi) holds. In case of multilabel learning, conditional label independence P(Y|x) = Πn i=1P(Yi|x) may be strong, as labels are likely to be correlated in practice. It has been known for a long time that for the F-measure, under conditional independence (Lewis, 1995), the optimal solution to (1) can be computed by simply sorting the instances according to P(Y |xi) and setting the labels for the top k instances to 1 and the rest to 0 (for some 0 k n). As a consequence, we only require estimates of the marginals to compute the optimal solution. For convenience, denote h(x) by s {0, 1}n (and the optimal solution to (1) by s ). Theorem (Lewis (1995)). Consider: LFβ(s, y) = 1 (1 + β)2 n i=1 siyi n i=1 si + β2 n i=1 yi . (2) Let xi s be sorted in decreasing order of the marginals P(Y |xi). Then, the optimal predictions s (1) for the Fβ loss in (2) is given by s i = 1, for i [k ], s i = 0 otherwise, for some 0 k n that may depend on x. But such a characterization of optimality is not known for other multivariate losses used in practice. 2.3. Multinomial distributions Note that in case of multiclass classification with n classes, C1, C2, . . . , Cn, one can alternatively think of a joint distribution P(Y|x) over label vectors y {0, 1}n such that the distribution is supported only on y satisfying n i=1 yi = 1. Letting ei {0, 1}n be the indicator vector for class Ci, define: P(Y = ei|x) = P(Y = Ci|x), P(Y = y|x) = 0, otherwise. (3) Characterizing optimal solution is simple in this setting as well. Coz Velasco et al. (2009) proved a result very similar to that of (Lewis, 1995) (although the proof turns out to be much simpler in this case), in the context of multiclass classification with Fβ-measure. Theorem (Coz Velasco et al. (2009)). Consider: s = arg min s {0,1}n EY P(Y |x)LFβ(s, e Y ), where LFβ is defined as in (2). Let the classes Ci be indexed in the decreasing order of the conditionals P(Y = Ci|x). Then, s satisfies s i = 1, for i [k ], s i = 0 otherwise, for some 0 k n that may depend on x. Next, we show that the above results indeed hold for general multivariate losses. 3. Main Results We now show that for most multivariate losses L, the corresponding optimal predictions (1) can be explicitly characterized for multinomial distributions (as in the case of multiclass classification) and joint distributions satisfying Optimal Classification with Multivariate Losses conditional independence (as in the cases of binary classification and multilabel learning with conditional label independence assumption). To this end, we consider general losses that are functions of the entries of the so-called confusion matrix, namely true positives, true negatives, false positives and false negatives. The empirical confusion ma- trix is computed as b C(s, y) = [ c TP c FN c FP c TN with entries: i=1 siyi, c TN = 1 i=1 (1 si)(1 yi), i=1 si(1 yi), c FN = 1 i=1 (1 si)yi. Most multivariate loss functions used in practice (see Table 1) admit the above form. For example, LFβ in (2) can be written as: LFβ(s, y) := LFβ(b C(s, y)) = 1 (1 + β2)c TP (1 + β2)c TP + β2 c FN + c FP . Without loss of generality, we let L : [0, 1]4 7 R+ denote a multivariate loss function evaluated on the entries of the confusion matrix. We begin by observing the following equivalent representation for loss L. All the proofs in the manuscript are supplied in the Appendix due to limited space. Proposition 1. Let u(s, y) = c TP(s, y), v = v(s) := 1 n i si and p = p(y) := 1 n i yi, then Φ : [0, 1]3 R+ such that L(b C(s, y)) = Φ(u(s, y), v(s), p(y)). Next, we define a certain monotonicity property which can be seen to be satisfied by popular loss functions. Definition 1 (TP Monotonicity). A loss function L is said to be TP monotonic if for any v, p, and u1 > u2, it holds that Φ(u1,v, p) < Φ(u2, v, p). In other words, L satisfies TP monotonicity if the corresponding representation Φ (Proposition 1) is monotonically decreasing in its first argument. It is easy to verify, for instance that ΦFβ(u, v, p) = 1 (1+β2)u β2p+v is monotonically decreasing in u. We are now ready to state our main result regarding minimizing expected losses wrt. distributions satisfying conditional independence. Theorem 1 (Binary Losses). Assume the joint distribution P satisfies conditional independence. Let L be a multivariate loss that satisfies TP monotonicity. Then, the optimal predictions s := h (x) in (1) satisfy: min{P(Y = 1|xi)|s i = 1} max{P(Y = 1|xi)|s i = 0}. In other words, s i = 1, for all i [k ], for some 0 k n that may depend on x, and s i = 0 for i [k ]. Note that the above result also applies to multilabel classification, albeit under conditional label independence assumption. Next, we give a result for multiclass classification, where the joint distribution is multinomial. Theorem 2 (Multiclass Losses). Fix a multinomial distribution P (3). Let L be a multivariate loss that satisfies TP monotonicity. Let the classes Ci be indexed in the decreasing order of the conditionals P(Y = Ci|x). Then the optimal predictions s := h (x) in (1) is given by: s i = 1, for i [k ], for some 0 k n that may depend on x, and s i = 0 otherwise. The proof essentially is a direct consequence of TP monotonicity and thus gives a generalization of the result in (Coz Velasco et al., 2009). 3.1. Recovered and New Results It is clear that our results generalize those by Lewis (1995) and by Coz Velasco et al. (2009) for expected loss minimization. Now, we draw attention to some of the recent optimality results in classification using general loss functions, where the objective is different from (1). The motivation of this section is to highlight the generality of our analysis and place our contributions in the context of recent results in a closely related setting. For a binary classifier h : X 7 {0, 1} and a distribution P over X {0, 1}, let C(h; P) = [TP FN FP TN ] represent the population confusion matrix with entries: TP = P(h(x) = 1, y = 1), TN = P(h(x) = 0, y = 0), FP = P(h(x) = 1, y = 0), FN = P(h(x) = 0, y = 1). Koyejo et al. (2014) and Narasimhan et al. (2014) are interested in optimizing the following notion of expected loss: h = arg min h:X {0,1} L(C(h; P)) i.e., L is applied to the population confusion matrix. Under mild assumptions on the distribution P, they show for a large family of loss functions, that the optimal solution satisfies a thresholded form, i.e. h (x) = sign(P(Y |x) δ ), where δ is a constant that depends only on the distribution and the loss itself. In contrast, for the expected loss minimization in (1), there need not be a threshold in general that minimizes the expected loss on a given x (as also discussed in (Lewis, 1995)). The Fractional Linear Family: Koyejo et al. (2014; 2015) studied a large family LFL of losses that are represented by: ΦFL(c TP(s, y), v(s), p(y)) = c0 + c1c TP + c2v + c3p d0 + d1c TP + d2v + d3p (4) Optimal Classification with Multivariate Losses for constants ci, di, i = {0, 1, 2, 3}. We identify a subclass of LFL where our results apply. The following result can be proven by inspection and is stated without proof. Proposition 2. If c1 < d1, then LFL satisfies TP monotonicity. Note that this essentially constitutes the most useful losses in LFL where increase in true positives (for a fixed total number of predictions) leads to decrease in loss. Metrics from Narasimhan et al. (2014): The following alternative three-parameter representation of losses L was studied by Narasimhan et al. (2014). Let p = p(y) := 1 n i yi, rp = d TPR(s, y) = c TP(s,y) p(y) and rn = d TNR(s, y) = c TN(s,y) 1 p(y) , then Γ : [0, 1]3 R+ such that: L(b C(s, y)) = Γ( d TPR(s, y), d TNR(s, y), p(y)). (5) As shown in Table 1, many losses used in practice are easily represented in this form. Representation for additional losses is simplified by including the empirical precision, given by d Prec(s, y) = c TP(s,y) v(s) , where v(s) := 1 n i si = c TP + c FP. Consider the following monotonicity property relevant to the representation (5). Definition 2 (TPR/TNR Monotonicity). A loss L is said to be TPR/TNR monotonic if when rp1 > rp2 and rn1 > rn2 and p fixed, then Γ(rp1, rn1, p) < Γ(rp2, rn2, p). In other words, L satisfies TPR/TNR monotonicity if the corresponding Γ in (5) is monotonically decreasing in its first two arguments. It can be shown that all the losses listed in Table 1 satisfy TPR/TNR monotonicity. The following proposition states that any loss function that satisfies TPR/TNR monotonicity also satisfies TP monotonicity. Proposition 3. If L satisfies TPR/TNR monotonicity, then L satisfies TP monotonicity. We can verify from the third column of Table 1 that each of the TPR/TNR monotonic losses Φ(u, v, p) is monotonically decreasing in u. So any loss that satisfies TPR/TNR monotonicity admits optimal classifier (1) as stated in Theorems 1 and 2. The area under the ROC curve (AUC): AUC is an important special case, which reduces to ΦAUC (see Table 1) in case of binary predictions (Joachims, 2005). It is clear that following proposition follows directly, and is stated without proof: Proposition 4. ΦAUC satisfies TP monotonicity. While prior work on AUC has focused on optimizing prediction of continuous scores, our approach is able to optimize explicit label predictions. Note that optimality results Table 1. Examples of TP monotonic losses. LOSS DEFINITION Φ(u, v, p) AM 1 d TPR+ d TNR 2 1 u+p(1 v p) p(1 p) Fβ 1 1+β2 β2 d Prec + 1 d TPR b FP+c FN c TP+ b FP+c FN p+v u G-TP/PR 1 d TPR. d Prec 1 u p.v G-Mean 1 d TPR. d TNR 1 u(1 v p+u) p(1 p) H-Mean 1 2/ ( 1 d TPR + 1 d TNR ) 1 2u(1 v p+u) (1 v p)p+u AUC b FP . c FN (c TP+c FN)( b FP+c TN) for continuous scores need not trivially extend to optimality results for discrete label decisions. 4. Algorithms For multiclass classification, Theorem 2 immediately suggests O(n2) algorithm for computing the optimal solution when P(Y = Ci|x) is known for each k = 1, 2, . . . , n, evaluate the expected loss in selecting the top k classes (classes are sorted by P(Y = Ci|x)), which can be done in O(n) time. However, in binary and multilabel learning scenarios, even when P(Yi = 1|x) is known exactly, characterization of optimality in Theorem 1 falls short in practice it is not obvious how to compute the expectation in (1) without exhaustively enumerating 2n possible y vectors. In this section, we present efficient algorithms for computing estimators for optimal predictions given x and a TP monotonic loss function L. We also prove the consistency of the proposed algorithms. 4.1. Computing Optimal in Practice We observe that by evaluating the loss L through the function Φ (as defined in Proposition 1), we can generalize the approach suggested by Ye et al. (2012) for Fβ-measure to obtain a template algorithm for optimizing any TP monotonic L. Consider the vector s {0, 1}n with the top k values set to 1 and the rest to 0, and let Si:j := j l=i yl. Note that for any y {0, 1}n that satisfies S1:k = k1 and Sk+1:n = k2, L(s, y) can simply be evaluated as Φ( 1 n(k1 + k2)). Thus y {0,1}n P(y|x)L(s, y) can be evaluated as a sum over possible values of k1 and k2, where the expectation is computed wrt. P(S1:k = k1)P(Sk+1:n = k2) with 0 k1 k and 0 k2 n k. Now, it remains to compute P(S1:k = k1) and P(Sk+1:n = k2) efficiently. Let ηi = P(Yi = 1|x). A consistent estimate of this quantity may be obtained by minimizing a strongly proper loss Optimal Classification with Multivariate Losses function such as logistic loss (Reid and Williamson, 2009). Using the conditional independence assumption, we can show that P(S1:k = k1) and P(Sk+1:n = k2) are the coefficients of zk1 and zk2 in Πk j=1[ηjz + (1 ηj)] and Πn j=k+1[ηjz + (1 ηj)] respectively, each of which can be computed in time O(n2) for fixed k. Note that the loss L itself can be evaluated in constant time. Let Lk denote the expected loss wrt. the estimated conditionals ηi, corresponding to setting the predictions of indices with top k highest conditional probabilities to 1. The resulting algorithm is presented in Algorithm 1. Though the computational complexity of Algorithm 1 is O(n3), we find that in practice it suffices to run the outer iterations until k where k is the first k s.t. Lk Lk+1 (or k = n if no such k exists), because for all k > k , Lk Lk holds. It is not obvious if this is a property enjoyed by all TP monotonic loss functions under conditional independence, but it would be interesting to theoretically establish this. The improvement in runtime is significant as multivariate losses are typically employed in scenarios where there is heavy imbalance in the distribution (in case of binary classification) or typically a small set of labels are relevant for a given instance (in case of multilabel learning), so k n, for large n. Finally, we note that for a sub-family of fractional-linear losses studied by Koyejo et al. (2014; 2015), we can get a faster algorithm that runs in time O(n2) using the trick by Ye et al. (2012). Due to limited space, the resulting algorithm is presented in Appendix B.1. Remark on multinomial distributions. For multiclass classification, note that computing the expectation wrt. the multinomial distribution is straight-forward. We simply replace steps 4-6 of Algorithm 1 with the following step: k