# consistent_multilabel_classification__813b8b96.pdf Consistent Multilabel Classification Oluwasanmi Koyejo Department of Psychology, Stanford University sanmi@stanford.edu Nagarajan Natarajan Department of Computer Science, University of Texas at Austin naga86@cs.utexas.edu Pradeep Ravikumar Department of Computer Science, University of Texas at Austin pradeepr@cs.utexas.edu Inderjit S. Dhillon Department of Computer Science, University of Texas at Austin inderjit@cs.utexas.edu Multilabel classification is rapidly developing as an important aspect of modern predictive modeling, motivating study of its theoretical aspects. To this end, we propose a framework for constructing and analyzing multilabel classification metrics which reveals novel results on a parametric form for population optimal classifiers, and additional insight into the role of label correlations. In particular, we show that for multilabel metrics constructed as instance-, microand macroaverages, the population optimal classifier can be decomposed into binary classifiers based on the marginal instance-conditional distribution of each label, with a weak association between labels via the threshold. Thus, our analysis extends the state of the art from a few known multilabel classification metrics such as Hamming loss, to a general framework applicable to many of the classification metrics in common use. Based on the population-optimal classifier, we propose a computationally efficient and general-purpose plug-in classification algorithm, and prove its consistency with respect to the metric of interest. Empirical results on synthetic and benchmark datasets are supportive of our theoretical findings. 1 Introduction Modern classification problems often involve the prediction of multiple labels simultaneously associated with a single instance e.g. image tagging by predicting multiple objects in an image. The growing importance of multilabel classification has motivated the development of several scalable algorithms [8, 12, 18] and has led to the recent surge in theoretical analysis [1, 3, 7, 16] which helps guide and understand practical advances. While recent results have advanced our knowledge of optimal population classifiers and consistent learning algorithms for particular metrics such as the Hamming loss and multilabel F-measure [3, 4, 5], a general understanding of learning with respect to multilabel classification metrics has remained an open problem. This is in contrast to the more traditional settings of binary and multiclass classification where several recently established results have led to a rich understanding of optimal and consistent classification [9, 10, 11]. This manuscript constitutes a step towards establishing results for multilabel classification at the level of generality currently enjoyed only in these traditional settings. Towards a generalized analysis, we propose a framework for multilabel sample performance metrics and their corresponding population extensions. A classification metric is constructed to measure the utility1 of a classifier, as defined by the practitioner or end-user. The utility may be measured using Equal contribution. 1Equivalently, we may define the loss as the negative utility. the sample metric given a finite dataset, and further generalized to the population metric with respect to a given data distribution (i.e. with respect to infinite samples). Two distinct approaches have been proposed for studying the population performance of classifier in the classical settings of binary and multiclass classification, described by Ye et al. [17] as decision theoretic analysis (DTA) and empirical utility maximization (EUM). DTA population utilities measure the expected performance of a classifier on a fixed-size test set, while EUM population utilities are directly defined as a function of the population confusion matrix. However, state-of-the-art analysis of multilabel classification has so-far lacked such a distinction. The proposed framework defines both EUM and DTA multilabel population utility as generalizations of the aforementioned classic definitions. Using this framework, we observe that existing work on multilabel classification [1, 3, 7, 16] have exclusively focused on optimizing the DTA utility of (specific) multilabel metrics. Averaging of binary classification metrics remains one of the most widely used approaches for defining multilabel metrics. Given a binary label representation, such metrics are constructed via averaging with respect to labels (instance-averaging), with respect to examples separately for each label (macro-averaging), or with respect to both labels and examples (micro-averaging). We consider a large sub-family of such metrics where the underlying binary metric can be constructed as a fraction of linear combinations of true positives, false positives, false negatives and true negatives [9]. Examples in this family include the ubiquitous Hamming loss, the averaged precision, the multilabel averaged F-measure, and the averaged Jaccard measure, among others. Our key result is that a Bayes optimal multilabel classifier for such metrics can be explicitly characterized in a simple form the optimal classifier thresholds the label-wise conditional probability marginals, and the label dependence in the underlying distribution is relevant to the optimal classifier only through the threshold parameter. Further, the threshold is shared by all the labels when the metric is instance-averaged or micro-averaged. This result is surprising and, to our knowledge, a first result to be shown at this level of generality for multilabel classification. The result also sheds additional insight into the role of label correlations in multilabel classification answering prior conjectures by Dembczy nski et al. [3] and others. We provide a plug-in estimation based algorithm that is efficient as well as theoretically consistent, i.e. the true utility of the empirical estimator approaches the optimal (EUM) utility of the Bayes classifier (Section 4). We also present experimental evaluation on synthetic and real-world benchmark multilabel datasets comparing different estimation algorithms (Section 5) for representative multilabel performance metrics selected from the studied family. The results observed in practice are supportive of what the theory predicts. 1.1 Related Work We briefly highlight closely related theoretical results in the multilabel learning literature. Gao and Zhou [7] consider the consistency of multilabel learning with respect to DTA utility, with a focus on two specific losses Hamming and rank loss (the corresponding measures are defined in Section 2). Surrogate losses are devised which result in consistent learning with respect to these metrics. In contrast, we propose a plug-in estimation based algorithm which directly estimates the Bayes optimal, without going through surrogate losses. Dembczynski et al. [2] analyze the DTA population optimal classifier for the multilabel rank loss, showing that the Bayes optimal is independent of label correlations in the unweighted case, and construct certain weighted univariate losses which are DTA consistent surrogates in the more general weighted case. Perhaps the work most closely related to ours is by Dembczynski et al. [4] who propose a novel DTA consistent plug-in rule estimation based algorithm for multilabel F-measure. Cheng et al. [1] consider optimizing popular losses in multilabel learning such as Hamming, rank and subset 0/1 loss (which is the multilabel analog of the classical 0-1 loss). They propose a probabilistic version of classifier chains (first introduced by Read et al. [13]) for estimating the Bayes optimal with respect to subset 0/1 loss, though without rigorous theoretical justification. 2 A Framework for Multilabel Classification Metrics Consider multilabel classification with M labels, where each instance is denoted by x 2 X. For convenience, we will focus on the common binary encoding, where the labels are represented by a vector y 2 Y = {0, 1}M, so ym = 1 iff the mth label is associated with the instance, and ym = 0 otherwise. The goal is to learn a multilabel classifier f : X 7! Y that optimizes a certain performance metric with respect to P a fixed data generating distribution over the domain X Y, using a training set of instance-label pairs (x(n), y(n)), n = 1, 2, . . . , N drawn (typically assumed iid.) from P. Let X and Y denote the random variables for instances and labels respectively, and let denote the performance (utility) metric of interest. Most classification metrics can be represented as functions of the entries of the confusion matrix. In case of binary classification, the confusion matrix is specified by four numbers, i.e., true positives, true negatives, false positives and false negatives. Similarly, we construct the following primitives for multilabel classification: c TP(f)m,n = Jfm(x(n)) = 1, y(n) m = 1K c FP(f)m,n = Jfm(x(n)) = 1, y(n) c TN(f)m,n = Jfm(x(n)) = 0, y(n) m = 0K c FN(f)m,n = Jfm(x(n)) = 0, y(n) where JZK denotes the indicator function that is 1 if the predicate Z is true or 0 otherwise. It is clear that most multilabel classification metrics considered in the literature can be written as a function of the MN primitives defined in (1). In the following, we consider a construction which is of sufficient generality to capture all multilabel metrics in common use. Let Ak(f) : {c TP(f)m,n, c FP(f)m,n, c TN(f)m,n, c FN(f)m,n}M,N m=1,n=1 7! R, k = 1, 2, . . . , K represent a set of K functions. Consider sample multilabel metrics constructed as functions: : {Ak(f)}K k=1 7! [0, 1). We note that the metric need not decompose over individual instances. Equipped with this definition of a sample performance metric , consider the population utility of a multilabel classifier f defined as: U(f; , P) = ({E [ Ak(f) ]}K k=1), (2) where the expectation is over iid draws from the joint distribution P. Note that this can be seen as a multilabel generalization of the so-called Empirical Utility Maximization (EUM) style classifiers studied in binary [9, 10] and multiclass [11] settings. Our goal is to learn a multilabel classifier that maximizes U(f; , P) for general performance metrics . Define the (Bayes) optimal multilabel classifier as: = argmax f:X ! {0,1}M U(f; , P). (3) . We say that ˆf is a consistent estimator of f if U(ˆf; , P) Examples. The averaged accuracy (1 - Hamming loss) used in multilabel classification corresponds to simply choosing: A1(f) = 1 MN n=1 c FP(f)m,n + c FN(f)m,n and Ham(f) = 1 A1(f). The measure corresponding to rank loss2 can be obtained by choosing Ak(f) = c FP(f)m1,k c FN(f)m2,k , for k = 1, 2, . . . , N and Rank = 1 k=1 Ak(f). Note that the choice of {Ak}, and therefore , is not unique. Remark 1. Existing results on multilabel classification have focused on decision-theoretic analysis (DTA) style classifiers, where the utility is defined as: UDTA(f; , P) = E , (4) and the expectation is over iid samples from P. Furthermore, there are no theoretical results for consistency with respect to general performance metrics in this setting (See Appendix B.2). For the remainder of this manuscript, we refer to U(f; P) as the utility defined in (2). We will also drop the argument f (e.g. write c TP(f) as c TP) when it is clear from the context. 2.1 A Framework for Averaged Binary Multilabel Classification Metrics The most popular class of multilabel performance metrics consists of averaged binary performance metrics, that correspond to particular settings of {Ak(f)} using certain averages as described in the following. For the remainder of this subsection, the metric : [0, 1]4 ! [0, 1) will refer to a binary classification metric as is typically applied to a binary confusion matrix. 2A subtle but important aspect of the definition of rank loss in the existing literature, including [2] and [7], is that the Bayes optimal is allowed to be a real-valued function and may not correspond to a label decision. Micro-averaging: Micro-averaged multilabel performance metrics micro are defined by averaging over both labels and examples. Let: c TP(f) = 1 MN c TP(f)m,n, c FP(f) = 1 MN c FP(f)m,n, (5) c TN(f) and c FN(f) are defined similarly, then the micro-averaged multilabel performance metrics are given by: micro({Ak(f)}K k=1) := (c TP, c FP, c TN, c FN). (6) Thus, for micro-averaging, one applies a binary performance metric to the confusion matrix defined by the averaged quantities described in (5). Macro-averaging: The metric macro measures average classification performance across labels. Define the averaged measures: c TPm(f) = 1 c TP(f)m,n, c FPm(f) = 1 c FP(f)m,n, c TNm(f) and c FNm(f) are defined similarly. The macro-averaged performance metric is given by: macro({Ak(f)}K (c TPm, c FPm, c TNm, c FNm). (7) Instance-averaging: The metric instance measures the average classification performance across examples. Define the averaged measures: c TPn(f) = 1 c TP(f)m,n, c FPn(f) = 1 c FP(f)m,n, c TNn(f) and c FNn(f) are defined similarly. The instance-averaged performance metric is given by: instance({Ak(f)}K (c TPn, c FPn, c TNn, c FNn). (8) 3 Characterizing the Bayes Optimal Classifier for Multilabel Metrics We now characterize the optimal multilabel classifier for the large family of metrics outlined in Section 2.1 ( micro, macro and instance) with respect to the EUM utility. We begin by observing that while micro-averaging and instance-averaging seem quite different when viewed as sample averages, they are in fact equivalent at the population level. Thus, we need only focus on micro to characterize instance as well. Proposition 1. For a given binary classification metric , consider the averaged multilabel metrics micro defined in (6) and instance defined in (8). For any f, U(f; micro, P) U(f; instance, P). In particular, f We further restrict our study to metrics selected from the linear-fractional metric family, recently studied in the context of binary classification [9]. Any in this family can be written as: (c TP, c FP, c FN, c TN) = a0 + a11c TP + a10c FP + a01 c FN + a00 c TN b0 + b11c TP + b10c FP + b01 c FN + b00 c TN where a0, b0, aij, bij, i, j 2 {0, 1} are fixed, and c TP, c FP, c FN, c TN are defined as in Section 2.1. Many popular multilabel metrics can be derived using linear-fractional . Some examples include3: Fβ : Fβ = (1 + β2)c TP (1 + β2)c TP + β2 c FN + c FP Hamming : Ham = c TP + c TN Jaccard : Jacc = c TP c TP + c FP + c FN Precision : Prec = c TP c TP + c FP 3Note that Hamming is typically defined as the loss, given by 1 Ham. Define the population quantities: = PM m=1 P(Ym = 1) and γ(f) = PM m=1 P(fm(x) = 1). Let , where the expectation is over iid draws from P. From (5), it follows that, = γ(f) TP(f), TN(f) = 1 γ(f) + TP(f) and FN(f) = γ(f) TP(f). Now, the population utility (2) corresponding to micro can be written succinctly as: U(f; micro, P) = (TP(f), FP(f), FN(f), TN(f)) = c0 + c1TP(f) + c2γ(f) d0 + d1TP(f) + d2γ(f) (10) with the constants: c0 = a01 + a00 a00 + a0, c1 = a11 a10 a01 + a00, c2 = a10 a00 and d0 = b01 + b00 b00 + b0, d1 = b11 b10 b01 + b00, d2 = b10 b00. We assume that the joint P has a density µ that satisfies d P = µdx, and define m(x) = P(Ym = 1|X = x). Our first main result characterizes the Bayes optimal multilabel classifier f Theorem 2. Given the constants {c1, c2, c0} and {d1, d2, d0}, define: micro c2 c1 d1 U The optimal Bayes classifier f := f micro defined in (3) is given by: 1. When c1 > d1 U micro, f takes the form f m(x) = J m(x) > δ K, for m 2 [M]. 2. When c1 < d1 U micro, f takes the form f m(x) = J m(x) < δ K, for m 2 [M]. The proof is provided in Appendix A.2, and applies equivalently to instance-averaging. Theorem 2 recovers existing results in binary [9] settings (See Appendix B.1 for details), and is sufficiently general to capture many of the multilabel metrics used in practice. Our proof is closely related to the binary classification case analyzed in Theorem 2 of [9], but differs in the additional averaging across labels. A key observation from Theorem 2 is that the optimal multilabel classifier can be obtained by thresholding the marginal instance-conditional probability for each label P(Ym = 1|x) and, importantly, that the optimal classifiers for all the labels share the same threshold δ . Thus, the effect of the joint distribution is only in the threshold parameter. We emphasize that while the presented results characterize the optimal population classifier, incorporating label correlations into the prediction algorithm may have other benefits with finite samples, such as statistical efficiency when there are known structural similarities between the marginal distributions [3]. Further analysis is left for future work. The Bayes optimal for the macro-averaged population metric is straightforward to establish. We observe that the threshold is not shared in this case. Proposition 3. For a given linear-fractional metric , consider the macro-averaged multilabel metric macro defined in (7). Let c1 > d1 U macro and f = f macro(x). We have, for m = 1, 2, . . . , M: m = J m(x) > δ m 2 [0, 1] is a constant that depends on the metric and the label-wise instance-conditional marginals of P. Analogous results hold for c1 < d1 U macro. Remark 2. It is clear that micro-, macroand instanceaveraging are equivalent at the population level when the metric is linear. This is a straightforward consequence of the observation that the corresponding sample utilities are the same. More generally, micro-, macroand instanceaveraging are equivalent whenever the optimal threshold is a constant independent of P, such as for linear metrics, where d1 = d2 = 0 so δ = c2 c1 (cf. Corollary 4 of Koyejo et al. [9]). Thus, our analysis recovers known results for Hamming loss [3, 7]. 4 Consistent Plug-in Estimation Algorithm Importantly, the Bayes optimal characterization points to a simple plug-in estimation algorithm that enjoys consistency as follows. First, one obtains an estimate ˆ m(x) of the marginal instanceconditional probability m(x) = P(Ym = 1|x) for each label m (see Reid and Williamson [14]) using a training sample. Then, the given metric micro(f) is maximized on a validation sample. For the remainder of this manuscript, we assume wlog. that c1 > d1 U . Note that in order to maximize over {fδ : fm(x) = J m(x) > δK 8m = 1, 2, . . . , M, δ 2 (0, 1)}, it suffices to optimize: ˆδ = argmax micro(ˆfδ), (12) where micro is the micro-averaged sample metric defined in (6) (similarly for instance). Though the threshold search is over a continuous space δ 2 (0, 1) the number of distinct micro(ˆfδ) values given a training sample of size N is at most NM. Thus (12) can be solved efficiently on a finite sample. Algorithm 1: Plugin-Estimator for micro and instance Input: Training examples S = {x(n), y(n)}N n=1 and metric micro (or instance). for m = 1, 2, . . . , M do 1. Select the training data for label m: Sm = {x(n), y(n) n=1. 2. Split the training data Sm into two sets Sm1 and Sm2. 3. Estimate ˆ m(x) using Sm1, define ˆfm(x) = Jˆ m(x) > δK. end for Obtain ˆδ by solving (12) on S2 = [M m=1Sm2. Return: ˆfˆδ. Consistency of the proposed algorithm. The following theorem shows that the plug-in procedure of Algorithm 1 results in a consistent classifier. Theorem 4. Let micro be a linear-fractional metric. If the estimates ˆ m(x) satisfy ˆ m p! m, 8m, then the output multilabel classifier ˆfˆδ of Algorithm 1 is consistent. The proof is provided in Appendix A.4. From Proposition 1, it follows that consistency holds for instance as well. Additionally, in light of Proposition 3, we may apply the learning algorithms proposed by [9] for binary classification independently for each label to obtain a consistent estimator for macro. 5 Experiments We present two sets of results. The first is an experimental validation on synthetic data with known ground truth probabilities. The results serve to verify our main result (Theorem 2) characterizing the Bayes optimal for averaged multilabel metrics. The second is an experimental evaluation of the plugin estimator algorithms for micro-, instance-, and macro-averaged multilabel metrics on benchmark datasets. 5.1 Synthetic data: Verification of Bayes optimal We consider the micro-averaged F1 metric in (9) for multilabel classification with 4 labels. We sample a set of five 2-dimensional vectors x = {x(1), x(2), . . . , x(5)} from the standard Gaussian. The conditional probability m for label m is modeled using a sigmoid function: m(x) = P(Ym = 1|x) = 1 1+exp w T mx, using a vector wm sampled from the standard Gaussian. The Bayes optimal f (x) 2 {0, 1}4 that maximizes the micro-averaged F1 population utility is then obtained by exhaustive search over all possible label vectors for each instance. In Figure 1 (a)-(d), we plot the conditional probabilities (wrt. the sample index) for each label, the corresponding f m for each x, and the optimal threshold δ using (11). We observe that the optimal multilabel classifier indeed thresholds P(Ym|x) for each label m, and furthermore, that the threshold is same for all the labels, as stated in Theorem 2. (a) (b) (c) (d) Figure 1: Bayes optimal classifier for multilabel F1 measure on synthetic data with 4 labels, and distribution supported on 5 instances. Plots from left to right show the Bayes optimal classifier prediction for instances, and for labels 1 through 4. Note that the optimal δ at which the label-wise marginal m(x) is thresholded is shared, conforming to Theorem 2 (larger plots are included in Appendix C). 5.2 Benchmark data: Evaluation of plug-in estimators We now evaluate the proposed plugin-estimation (Algorithm 1) that is consistent for microand instance-averaged multilabel metrics. We focus on two metrics, F1 and Jaccard, listed in (9). We compare Algorithm 1, designed to optimize micro-averaged (or instance-averaged) multilabel metrics to two related plugin-estimation methods: (i) a separate threshold δ m tuned for each label m individually this optimizes the utility corresponding to the macro-averaged metric, but is not consistent for micro-averaged or instance-averaged metrics, and is the most common approach in practice. We refer to this as Macro-Thres, (ii) a constant threshold 1/2 for all the labels this is known to be optimal for averaged accuracy (equiv. Hamming loss), but not for non-decomposable F1 or Jaccard metrics. We refer to this as Binary Relevance (BR) [15]. We use four benchmark multilabel datasets4 in our experiments: (i) SCENE, an image dataset consisting of 6 labels, with 1211 training and 1196 test instances, (ii) BIRDS, an audio dataset consisting of 19 labels, with 323 training and 322 test instances, (iii) EMOTIONS, a music dataset consisting of 6 labels, with 393 training and 202 test instances, and (iv) CAL500, a music dataset consisting of 174 labels, with 400 training and 100 test instances5. We perform logistic regression (with L2 regularization) on a separate subsample to obtain estimates of ˆ m(x) of P(Ym = 1|x), for each label m (as described in Section 4). All the methods we evaluate rely on obtaining a good estimator for the conditional probability. So we exclude labels that are associated with very few instances in particular, we train and evaluate using labels associated with at least 20 instances, in each dataset, for all the methods. In Table 1, we report the micro-averaged F1 and Jaccard metrics on the test set for Algorithm 1, Macro-Thres and Binary Relevance. We observe that estimating a fixed threshold for all the labels (Algorithm 1) consistently performs better than estimating thresholds for each label (Macro-Thres) and than using threshold 1/2 for all labels (BR); this conforms to our main result in Theorem 2 and the consistency analysis of Algorithm 1 in Theorem 4. A similar trend is observed for the instanceaveraged metrics computed on the test set, shown in Table 2. Proposition 1 shows that maximizing the population utilities of micro-averaged and instance-averaged metrics are equivalent; the result holds in practice as presented in Table 2. Finally, we report macro-averaged metrics computed on test set in Table 3. We observe that Macro-Thres is competitive in 3 out of 4 datasets; this conforms to Proposition 3 which shows that in the case of macro-averaged metrics, it is optimal to tune a threshold specific to each label independently. Beyond consistency, we note that by using more samples, joint threshold estimation enjoys additional statistical efficiency, while separate threshold estimation enjoys greater flexibility. This trade-off may explain why Algorithm 1 achieves the best performance in three out of four datasets in Table 3, though it is not consistent for macro-averaged metrics. 4The datasets were obtained from http://mulan.sourceforge.net/datasets-mlc.html. 5Original CAL500 dataset does not provide splits; we split the data randomly into train and test sets. DATASET BR Algorithm 1 Macro-Thres BR Algorithm 1 Macro-Thres F1 Jaccard SCENE 0.6559 0.6847 0.0072 0.6631 0.0125 0.4878 0.5151 0.0084 0.5010 0.0122 BIRDS 0.4040 0.4088 0.0130 0.2871 0.0734 0.2495 0.2648 0.0095 0.1942 0.0401 EMOTIONS 0.5815 0.6554 0.0069 0.6419 0.0174 0.3982 0.4908 0.0074 0.4790 0.0077 CAL500 0.3647 0.4891 0.0035 0.4160 0.0078 0.2229 0.3225 0.0024 0.2608 0.0056 Table 1: Comparison of plugin-estimator methods on multilabel F1 and Jaccard metrics. Reported values correspond to micro-averaged metric (F1 and Jaccard) computed on test data (with standard deviation, over 10 random validation sets for tuning thresholds). Algorithm 1 is consistent for microaveraged metrics, and performs the best consistently across datasets. DATASET BR Algorithm 1 Macro-Thres BR Algorithm 1 Macro-Thres F1 Jaccard SCENE 0.5695 0.6422 0.0206 0.6303 0.0167 0.5466 0.5976 0.0177 0.5902 0.0176 BIRDS 0.1209 0.1390 0.0110 0.1390 0.0259 0.1058 0.1239 0.0077 0.1195 0.0096 EMOTIONS 0.4787 0.6241 0.0204 0.6156 0.0170 0.4078 0.5340 0.0072 0.5173 0.0086 CAL500 0.3632 0.4855 0.0035 0.4135 0.0079 0.2268 0.3252 0.0024 0.2623 0.0055 Table 2: Comparison of plugin-estimator methods on multilabel F1 and Jaccard metrics. Reported values correspond to instance-averaged metric (F1 and Jaccard) computed on test data (with standard deviation, over 10 random validation sets for tuning thresholds). Algorithm 1 is consistent for instance-averaged metrics, and performs the best consistently across datasets. DATASET BR Algorithm 1 Macro-Thres BR Algorithm 1 Macro-Thres F1 Jaccard SCENE 0.6601 0.6941 0.0205 0.6737 0.0137 0.5046 0.5373 0.0177 0.5260 0.0176 BIRDS 0.3366 0.3448 0.0110 0.2971 0.0267 0.2178 0.2341 0.0077 0.2051 0.0215 EMOTIONS 0.5440 0.6450 0.0204 0.6440 0.0164 0.3982 0.4912 0.0072 0.4900 0.0133 CAL500 0.1293 0.2687 0.0035 0.3226 0.0068 0.0880 0.1834 0.0024 0.2146 0.0036 Table 3: Comparison of plugin-estimator methods on multilabel F1 and Jaccard metrics. Reported values correspond to the macro-averaged metric computed on test data (with standard deviation, over 10 random validation sets for tuning thresholds). Macro-Thres is consistent for macro-averaged metrics, and is competitive in three out of four datasets. Though not consistent for macro-averaged metrics, Algorithm 1 achieves the best performance in three out of four datasets. 6 Conclusions and Future Work We have proposed a framework for the construction and analysis of multilabel classification metrics and corresponding population optimal classifiers. Our main result is that for a large family of averaged performance metrics, the EUM optimal multilabel classifier can be explicitly characterized by thresholding of label-wise marginal instance-conditional probabilities, with weak label dependence via a shared threshold. We have also proposed efficient and consistent estimators for maximizing such multilabel performance metrics in practice. Our results are a step forward in the direction of extending the state-of-the-art understanding of learning with respect to general metrics in binary and multiclass settings. Our work opens up many interesting research directions, including the potential for further generalization of our results beyond averaged metrics, and generalized results for DTA population optimal classification, which is currently only well-understood for the F-measure. Acknowledgments: We acknowledge the support of NSF via CCF-1117055, CCF-1320746 and IIS1320894, and NIH via R01 GM117594-01 as part of the Joint DMS/NIGMS Initiative to Support Research at the Interface of the Biological and Mathematical Sciences. [1] Weiwei Cheng, Eyke H ullermeier, and Krzysztof J Dembczynski. Bayes optimal multilabel classification via probabilistic classifier chains. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 279 286, 2010. [2] Krzysztof Dembczynski, Wojciech Kotlowski, and Eyke H ullermeier. Consistent multilabel ranking through univariate losses. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 1319 1326, 2012. [3] Krzysztof Dembczy nski, Willem Waegeman, Weiwei Cheng, and Eyke H ullermeier. On label dependence and loss minimization in multi-label classification. Machine Learning, 88(1-2): 5 45, 2012. [4] Krzysztof Dembczynski, Arkadiusz Jachnik, Wojciech Kotlowski, Willem Waegeman, and Eyke H ullermeier. Optimizing the F-measure in multi-label classification: Plug-in rule approach versus structured loss minimization. In Proceedings of the 30th International Conference on Machine Learning, pages 1130 1138, 2013. [5] Krzysztof J Dembczynski, Willem Waegeman, Weiwei Cheng, and Eyke H ullermeier. An exact algorithm for F-measure maximization. In Advances in Neural Information Processing Systems, pages 1404 1412, 2011. [6] Luc Devroye. A probabilistic theory of pattern recognition, volume 31. Springer, 1996. [7] Wei Gao and Zhi-Hua Zhou. On the consistency of multi-label learning. Artificial Intelligence, 199:22 44, 2013. [8] Ashish Kapoor, Raajay Viswanathan, and Prateek Jain. Multilabel classification using bayesian compressed sensing. In Advances in Neural Information Processing Systems, pages 2645 2653, 2012. [9] Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon. Consistent binary classification with generalized performance metrics. In Advances in Neural Information Processing Systems, pages 2744 2752, 2014. [10] Harikrishna Narasimhan, Rohit Vaish, and Shivani Agarwal. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In Advances in Neural Information Processing Systems, pages 1493 1501, 2014. [11] Harikrishna Narasimhan, Harish Ramaswamy, Aadirupa Saha, and Shivani Agarwal. Consis- tent multiclass algorithms for complex performance measures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2398 2407, 2015. [12] James Petterson and Tib erio S Caetano. Submodular multi-label learning. In Advances in Neural Information Processing Systems, pages 1512 1520, 2011. [13] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi- label classification. Machine learning, 85(3):333 359, 2011. [14] Mark D Reid and Robert C Williamson. Composite binary losses. The Journal of Machine Learning Research, 9999:2387 2422, 2010. [15] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Mining multi-label data. In Data mining and knowledge discovery handbook, pages 667 685. Springer, 2010. [16] Willem Waegeman, Krzysztof Dembczynski, Arkadiusz Jachnik, Weiwei Cheng, and Eyke H ullermeier. On the bayes-optimality of f-measure maximizers. Journal of Machine Learning Research, 15:3333 3388, 2014. [17] Nan Ye, Kian Ming A Chai, Wee Sun Lee, and Hai Leong Chieu. Optimizing F-measures: a tale of two approaches. In Proceedings of the International Conference on Machine Learning, 2012. [18] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Large-scale multi-label learning with missing labels. In Proceedings of the 31st International Conference on Machine Learning, pages 593 601, 2014.