# consistent_multilabel_classification__813b8b96.pdf

Consistent Multilabel Classiﬁcation

Oluwasanmi Koyejo Department of Psychology,

Stanford University sanmi@stanford.edu

Nagarajan Natarajan Department of Computer Science,

University of Texas at Austin naga86@cs.utexas.edu

Pradeep Ravikumar Department of Computer Science,

University of Texas at Austin pradeepr@cs.utexas.edu

Inderjit S. Dhillon Department of Computer Science,

University of Texas at Austin inderjit@cs.utexas.edu

Multilabel classiﬁcation is rapidly developing as an important aspect of modern predictive modeling, motivating study of its theoretical aspects. To this end, we propose a framework for constructing and analyzing multilabel classiﬁcation metrics which reveals novel results on a parametric form for population optimal classiﬁers, and additional insight into the role of label correlations. In particular, we show that for multilabel metrics constructed as instance-, microand macroaverages, the population optimal classiﬁer can be decomposed into binary classiﬁers based on the marginal instance-conditional distribution of each label, with a weak association between labels via the threshold. Thus, our analysis extends the state of the art from a few known multilabel classiﬁcation metrics such as Hamming loss, to a general framework applicable to many of the classiﬁcation metrics in common use. Based on the population-optimal classiﬁer, we propose a computationally efﬁcient and general-purpose plug-in classiﬁcation algorithm, and prove its consistency with respect to the metric of interest. Empirical results on synthetic and benchmark datasets are supportive of our theoretical ﬁndings.

1 Introduction

Modern classiﬁcation problems often involve the prediction of multiple labels simultaneously associated with a single instance e.g. image tagging by predicting multiple objects in an image. The growing importance of multilabel classiﬁcation has motivated the development of several scalable algorithms [8, 12, 18] and has led to the recent surge in theoretical analysis [1, 3, 7, 16] which helps guide and understand practical advances. While recent results have advanced our knowledge of optimal population classiﬁers and consistent learning algorithms for particular metrics such as the Hamming loss and multilabel F-measure [3, 4, 5], a general understanding of learning with respect to multilabel classiﬁcation metrics has remained an open problem. This is in contrast to the more traditional settings of binary and multiclass classiﬁcation where several recently established results have led to a rich understanding of optimal and consistent classiﬁcation [9, 10, 11]. This manuscript constitutes a step towards establishing results for multilabel classiﬁcation at the level of generality currently enjoyed only in these traditional settings.

Towards a generalized analysis, we propose a framework for multilabel sample performance metrics and their corresponding population extensions. A classiﬁcation metric is constructed to measure the utility1 of a classiﬁer, as deﬁned by the practitioner or end-user. The utility may be measured using

Equal contribution.

1Equivalently, we may deﬁne the loss as the negative utility.

the sample metric given a ﬁnite dataset, and further generalized to the population metric with respect to a given data distribution (i.e. with respect to inﬁnite samples). Two distinct approaches have been proposed for studying the population performance of classiﬁer in the classical settings of binary and multiclass classiﬁcation, described by Ye et al. [17] as decision theoretic analysis (DTA) and empirical utility maximization (EUM). DTA population utilities measure the expected performance of a classiﬁer on a ﬁxed-size test set, while EUM population utilities are directly deﬁned as a function of the population confusion matrix. However, state-of-the-art analysis of multilabel classiﬁcation has so-far lacked such a distinction. The proposed framework deﬁnes both EUM and DTA multilabel population utility as generalizations of the aforementioned classic deﬁnitions. Using this framework, we observe that existing work on multilabel classiﬁcation [1, 3, 7, 16] have exclusively focused on optimizing the DTA utility of (speciﬁc) multilabel metrics.

Averaging of binary classiﬁcation metrics remains one of the most widely used approaches for deﬁning multilabel metrics. Given a binary label representation, such metrics are constructed via averaging with respect to labels (instance-averaging), with respect to examples separately for each label (macro-averaging), or with respect to both labels and examples (micro-averaging). We consider a large sub-family of such metrics where the underlying binary metric can be constructed as a fraction of linear combinations of true positives, false positives, false negatives and true negatives [9]. Examples in this family include the ubiquitous Hamming loss, the averaged precision, the multilabel averaged F-measure, and the averaged Jaccard measure, among others. Our key result is that a Bayes optimal multilabel classiﬁer for such metrics can be explicitly characterized in a simple form the optimal classiﬁer thresholds the label-wise conditional probability marginals, and the label dependence in the underlying distribution is relevant to the optimal classiﬁer only through the threshold parameter. Further, the threshold is shared by all the labels when the metric is instance-averaged or micro-averaged. This result is surprising and, to our knowledge, a ﬁrst result to be shown at this level of generality for multilabel classiﬁcation. The result also sheds additional insight into the role of label correlations in multilabel classiﬁcation answering prior conjectures by Dembczy nski et al. [3] and others.

We provide a plug-in estimation based algorithm that is efﬁcient as well as theoretically consistent, i.e. the true utility of the empirical estimator approaches the optimal (EUM) utility of the Bayes classiﬁer (Section 4). We also present experimental evaluation on synthetic and real-world benchmark multilabel datasets comparing different estimation algorithms (Section 5) for representative multilabel performance metrics selected from the studied family. The results observed in practice are supportive of what the theory predicts.

1.1 Related Work

We brieﬂy highlight closely related theoretical results in the multilabel learning literature. Gao and Zhou [7] consider the consistency of multilabel learning with respect to DTA utility, with a focus on two speciﬁc losses Hamming and rank loss (the corresponding measures are deﬁned in Section 2). Surrogate losses are devised which result in consistent learning with respect to these metrics. In contrast, we propose a plug-in estimation based algorithm which directly estimates the Bayes optimal, without going through surrogate losses. Dembczynski et al. [2] analyze the DTA population optimal classiﬁer for the multilabel rank loss, showing that the Bayes optimal is independent of label correlations in the unweighted case, and construct certain weighted univariate losses which are DTA consistent surrogates in the more general weighted case. Perhaps the work most closely related to ours is by Dembczynski et al. [4] who propose a novel DTA consistent plug-in rule estimation based algorithm for multilabel F-measure. Cheng et al. [1] consider optimizing popular losses in multilabel learning such as Hamming, rank and subset 0/1 loss (which is the multilabel analog of the classical 0-1 loss). They propose a probabilistic version of classiﬁer chains (ﬁrst introduced by Read et al. [13]) for estimating the Bayes optimal with respect to subset 0/1 loss, though without rigorous theoretical justiﬁcation.

2 A Framework for Multilabel Classiﬁcation Metrics

Consider multilabel classiﬁcation with M labels, where each instance is denoted by x 2 X. For convenience, we will focus on the common binary encoding, where the labels are represented by a vector y 2 Y = {0, 1}M, so ym = 1 iff the mth label is associated with the instance, and

ym = 0 otherwise. The goal is to learn a multilabel classiﬁer f : X 7! Y that optimizes a certain performance metric with respect to P a ﬁxed data generating distribution over the domain X Y, using a training set of instance-label pairs (x(n), y(n)), n = 1, 2, . . . , N drawn (typically assumed iid.) from P. Let X and Y denote the random variables for instances and labels respectively, and let denote the performance (utility) metric of interest.

Most classiﬁcation metrics can be represented as functions of the entries of the confusion matrix. In case of binary classiﬁcation, the confusion matrix is speciﬁed by four numbers, i.e., true positives, true negatives, false positives and false negatives. Similarly, we construct the following primitives for multilabel classiﬁcation:

c TP(f)m,n = Jfm(x(n)) = 1, y(n)

m = 1K c FP(f)m,n = Jfm(x(n)) = 1, y(n)

c TN(f)m,n = Jfm(x(n)) = 0, y(n)

m = 0K c FN(f)m,n = Jfm(x(n)) = 0, y(n)

where JZK denotes the indicator function that is 1 if the predicate Z is true or 0 otherwise. It is clear that most multilabel classiﬁcation metrics considered in the literature can be written as a function of the MN primitives deﬁned in (1).

In the following, we consider a construction which is of sufﬁcient generality to capture all multilabel metrics in common use. Let Ak(f) : {c TP(f)m,n, c FP(f)m,n, c TN(f)m,n, c FN(f)m,n}M,N

m=1,n=1 7! R, k = 1, 2, . . . , K represent a set of K functions. Consider sample multilabel metrics constructed as functions: : {Ak(f)}K

k=1 7! [0, 1). We note that the metric need not decompose over individual instances. Equipped with this deﬁnition of a sample performance metric , consider the population utility of a multilabel classiﬁer f deﬁned as:

U(f; , P) = ({E [ Ak(f) ]}K

k=1), (2) where the expectation is over iid draws from the joint distribution P. Note that this can be seen as a multilabel generalization of the so-called Empirical Utility Maximization (EUM) style classiﬁers studied in binary [9, 10] and multiclass [11] settings.

Our goal is to learn a multilabel classiﬁer that maximizes U(f; , P) for general performance metrics . Deﬁne the (Bayes) optimal multilabel classiﬁer as:

= argmax f:X ! {0,1}M U(f; , P). (3)

. We say that ˆf is a consistent estimator of f

if U(ˆf; , P)

Examples. The averaged accuracy (1 - Hamming loss) used in multilabel classiﬁcation corresponds to simply choosing: A1(f) = 1 MN

n=1 c FP(f)m,n + c FN(f)m,n and Ham(f) = 1 A1(f). The measure corresponding to rank loss2 can be obtained by choosing Ak(f) =

c FP(f)m1,k

c FN(f)m2,k

, for k = 1, 2, . . . , N and Rank = 1

k=1 Ak(f). Note that the choice of {Ak}, and therefore , is not unique. Remark 1. Existing results on multilabel classiﬁcation have focused on decision-theoretic analysis (DTA) style classiﬁers, where the utility is deﬁned as:

UDTA(f; , P) = E

, (4) and the expectation is over iid samples from P. Furthermore, there are no theoretical results for consistency with respect to general performance metrics in this setting (See Appendix B.2).

For the remainder of this manuscript, we refer to U(f; P) as the utility deﬁned in (2). We will also drop the argument f (e.g. write c TP(f) as c TP) when it is clear from the context.

2.1 A Framework for Averaged Binary Multilabel Classiﬁcation Metrics

The most popular class of multilabel performance metrics consists of averaged binary performance metrics, that correspond to particular settings of {Ak(f)} using certain averages as described in the following. For the remainder of this subsection, the metric : [0, 1]4 ! [0, 1) will refer to a binary classiﬁcation metric as is typically applied to a binary confusion matrix.

2A subtle but important aspect of the deﬁnition of rank loss in the existing literature, including [2] and [7], is that the Bayes optimal is allowed to be a real-valued function and may not correspond to a label decision.

Micro-averaging: Micro-averaged multilabel performance metrics micro are deﬁned by averaging over both labels and examples. Let:

c TP(f) = 1 MN

c TP(f)m,n, c FP(f) = 1 MN

c FP(f)m,n, (5)

c TN(f) and c FN(f) are deﬁned similarly, then the micro-averaged multilabel performance metrics are given by:

micro({Ak(f)}K

k=1) := (c TP, c FP, c TN, c FN). (6) Thus, for micro-averaging, one applies a binary performance metric to the confusion matrix deﬁned by the averaged quantities described in (5).

Macro-averaging: The metric macro measures average classiﬁcation performance across labels. Deﬁne the averaged measures:

c TPm(f) = 1

c TP(f)m,n, c FPm(f) = 1

c FP(f)m,n,

c TNm(f) and c FNm(f) are deﬁned similarly. The macro-averaged performance metric is given by:

macro({Ak(f)}K

(c TPm, c FPm, c TNm, c FNm). (7)

Instance-averaging: The metric instance measures the average classiﬁcation performance across examples. Deﬁne the averaged measures:

c TPn(f) = 1

c TP(f)m,n, c FPn(f) = 1

c FP(f)m,n,

c TNn(f) and c FNn(f) are deﬁned similarly. The instance-averaged performance metric is given by:

instance({Ak(f)}K

(c TPn, c FPn, c TNn, c FNn). (8)

3 Characterizing the Bayes Optimal Classiﬁer for Multilabel Metrics

We now characterize the optimal multilabel classiﬁer for the large family of metrics outlined in Section 2.1 ( micro, macro and instance) with respect to the EUM utility. We begin by observing that while micro-averaging and instance-averaging seem quite different when viewed as sample averages, they are in fact equivalent at the population level. Thus, we need only focus on micro to characterize instance as well. Proposition 1. For a given binary classiﬁcation metric , consider the averaged multilabel metrics micro deﬁned in (6) and instance deﬁned in (8). For any f, U(f; micro, P) U(f; instance, P). In particular, f

We further restrict our study to metrics selected from the linear-fractional metric family, recently studied in the context of binary classiﬁcation [9]. Any in this family can be written as:

(c TP, c FP, c FN, c TN) = a0 + a11c TP + a10c FP + a01 c FN + a00 c TN

b0 + b11c TP + b10c FP + b01 c FN + b00 c TN

where a0, b0, aij, bij, i, j 2 {0, 1} are ﬁxed, and c TP, c FP, c FN, c TN are deﬁned as in Section 2.1. Many popular multilabel metrics can be derived using linear-fractional . Some examples include3:

Fβ : Fβ = (1 + β2)c TP

(1 + β2)c TP + β2 c FN + c FP

Hamming : Ham = c TP + c TN

Jaccard : Jacc =

c TP c TP + c FP + c FN

Precision : Prec =

c TP c TP + c FP

3Note that Hamming is typically deﬁned as the loss, given by 1 Ham.

Deﬁne the population quantities: = PM

m=1 P(Ym = 1) and γ(f) = PM

m=1 P(fm(x) = 1). Let

, where the expectation is over iid draws from P. From (5), it follows that,

= γ(f) TP(f), TN(f) = 1 γ(f) + TP(f) and FN(f) = γ(f) TP(f).

Now, the population utility (2) corresponding to micro can be written succinctly as:

U(f; micro, P) = (TP(f), FP(f), FN(f), TN(f)) = c0 + c1TP(f) + c2γ(f)

d0 + d1TP(f) + d2γ(f) (10)

with the constants:

c0 = a01 + a00 a00 + a0, c1 = a11 a10 a01 + a00, c2 = a10 a00 and d0 = b01 + b00 b00 + b0, d1 = b11 b10 b01 + b00, d2 = b10 b00.

We assume that the joint P has a density µ that satisﬁes d P = µdx, and deﬁne m(x) = P(Ym = 1|X = x). Our ﬁrst main result characterizes the Bayes optimal multilabel classiﬁer f

Theorem 2. Given the constants {c1, c2, c0} and {d1, d2, d0}, deﬁne:

micro c2 c1 d1 U

The optimal Bayes classiﬁer f := f

micro deﬁned in (3) is given by:

1. When c1 > d1 U

micro, f takes the form f

m(x) = J m(x) > δ K, for m 2 [M].

2. When c1 < d1 U

micro, f takes the form f

m(x) = J m(x) < δ K, for m 2 [M].

The proof is provided in Appendix A.2, and applies equivalently to instance-averaging. Theorem 2 recovers existing results in binary [9] settings (See Appendix B.1 for details), and is sufﬁciently general to capture many of the multilabel metrics used in practice. Our proof is closely related to the binary classiﬁcation case analyzed in Theorem 2 of [9], but differs in the additional averaging across labels. A key observation from Theorem 2 is that the optimal multilabel classiﬁer can be obtained by thresholding the marginal instance-conditional probability for each label P(Ym = 1|x) and, importantly, that the optimal classiﬁers for all the labels share the same threshold δ . Thus, the effect of the joint distribution is only in the threshold parameter. We emphasize that while the presented results characterize the optimal population classiﬁer, incorporating label correlations into the prediction algorithm may have other beneﬁts with ﬁnite samples, such as statistical efﬁciency when there are known structural similarities between the marginal distributions [3]. Further analysis is left for future work.

The Bayes optimal for the macro-averaged population metric is straightforward to establish. We observe that the threshold is not shared in this case. Proposition 3. For a given linear-fractional metric , consider the macro-averaged multilabel metric macro deﬁned in (7). Let c1 > d1 U

macro and f = f

macro(x). We have, for m = 1, 2, . . . , M:

m = J m(x) > δ

m 2 [0, 1] is a constant that depends on the metric and the label-wise instance-conditional marginals of P. Analogous results hold for c1 < d1 U

macro. Remark 2. It is clear that micro-, macroand instanceaveraging are equivalent at the population level when the metric is linear. This is a straightforward consequence of the observation that the corresponding sample utilities are the same. More generally, micro-, macroand instanceaveraging are equivalent whenever the optimal threshold is a constant independent of P, such as for linear metrics, where d1 = d2 = 0 so δ = c2

c1 (cf. Corollary 4 of Koyejo et al. [9]). Thus, our analysis recovers known results for Hamming loss [3, 7].

4 Consistent Plug-in Estimation Algorithm

Importantly, the Bayes optimal characterization points to a simple plug-in estimation algorithm that enjoys consistency as follows. First, one obtains an estimate ˆ m(x) of the marginal instanceconditional probability m(x) = P(Ym = 1|x) for each label m (see Reid and Williamson [14])

using a training sample. Then, the given metric micro(f) is maximized on a validation sample. For the remainder of this manuscript, we assume wlog. that c1 > d1 U . Note that in order to maximize over {fδ : fm(x) = J m(x) > δK 8m = 1, 2, . . . , M, δ 2 (0, 1)}, it sufﬁces to optimize:

ˆδ = argmax

micro(ˆfδ), (12)

where micro is the micro-averaged sample metric deﬁned in (6) (similarly for instance). Though the threshold search is over a continuous space δ 2 (0, 1) the number of distinct micro(ˆfδ) values given a training sample of size N is at most NM. Thus (12) can be solved efﬁciently on a ﬁnite sample.

Algorithm 1: Plugin-Estimator for micro and instance

Input: Training examples S = {x(n), y(n)}N

n=1 and metric micro (or instance). for m = 1, 2, . . . , M do

1. Select the training data for label m: Sm = {x(n), y(n)

n=1. 2. Split the training data Sm into two sets Sm1 and Sm2. 3. Estimate ˆ m(x) using Sm1, deﬁne ˆfm(x) = Jˆ m(x) > δK. end for Obtain ˆδ by solving (12) on S2 = [M

m=1Sm2. Return: ˆfˆδ.

Consistency of the proposed algorithm. The following theorem shows that the plug-in procedure of Algorithm 1 results in a consistent classiﬁer.

Theorem 4. Let micro be a linear-fractional metric. If the estimates ˆ m(x) satisfy ˆ m

p! m, 8m, then the output multilabel classiﬁer ˆfˆδ of Algorithm 1 is consistent.

The proof is provided in Appendix A.4. From Proposition 1, it follows that consistency holds for instance as well. Additionally, in light of Proposition 3, we may apply the learning algorithms proposed by [9] for binary classiﬁcation independently for each label to obtain a consistent estimator for macro.

5 Experiments

We present two sets of results. The ﬁrst is an experimental validation on synthetic data with known ground truth probabilities. The results serve to verify our main result (Theorem 2) characterizing the Bayes optimal for averaged multilabel metrics. The second is an experimental evaluation of the plugin estimator algorithms for micro-, instance-, and macro-averaged multilabel metrics on benchmark datasets.

5.1 Synthetic data: Veriﬁcation of Bayes optimal

We consider the micro-averaged F1 metric in (9) for multilabel classiﬁcation with 4 labels. We sample a set of ﬁve 2-dimensional vectors x = {x(1), x(2), . . . , x(5)} from the standard Gaussian. The conditional probability m for label m is modeled using a sigmoid function: m(x) = P(Ym = 1|x) = 1 1+exp w T

mx, using a vector wm sampled from the standard Gaussian. The Bayes optimal f (x) 2 {0, 1}4 that maximizes the micro-averaged F1 population utility is then obtained by exhaustive search over all possible label vectors for each instance. In Figure 1 (a)-(d), we plot the conditional probabilities (wrt. the sample index) for each label, the corresponding f

m for each x, and the optimal threshold δ using (11). We observe that the optimal multilabel classiﬁer indeed thresholds P(Ym|x) for each label m, and furthermore, that the threshold is same for all the labels, as stated in Theorem 2.

(a) (b) (c) (d)

Figure 1: Bayes optimal classiﬁer for multilabel F1 measure on synthetic data with 4 labels, and distribution supported on 5 instances. Plots from left to right show the Bayes optimal classiﬁer prediction for instances, and for labels 1 through 4. Note that the optimal δ at which the label-wise marginal m(x) is thresholded is shared, conforming to Theorem 2 (larger plots are included in Appendix C).

5.2 Benchmark data: Evaluation of plug-in estimators

We now evaluate the proposed plugin-estimation (Algorithm 1) that is consistent for microand instance-averaged multilabel metrics. We focus on two metrics, F1 and Jaccard, listed in (9). We compare Algorithm 1, designed to optimize micro-averaged (or instance-averaged) multilabel metrics to two related plugin-estimation methods: (i) a separate threshold δ

m tuned for each label m individually this optimizes the utility corresponding to the macro-averaged metric, but is not consistent for micro-averaged or instance-averaged metrics, and is the most common approach in practice. We refer to this as Macro-Thres, (ii) a constant threshold 1/2 for all the labels this is known to be optimal for averaged accuracy (equiv. Hamming loss), but not for non-decomposable F1 or Jaccard metrics. We refer to this as Binary Relevance (BR) [15].

We use four benchmark multilabel datasets4 in our experiments: (i) SCENE, an image dataset consisting of 6 labels, with 1211 training and 1196 test instances, (ii) BIRDS, an audio dataset consisting of 19 labels, with 323 training and 322 test instances, (iii) EMOTIONS, a music dataset consisting of 6 labels, with 393 training and 202 test instances, and (iv) CAL500, a music dataset consisting of 174 labels, with 400 training and 100 test instances5. We perform logistic regression (with L2 regularization) on a separate subsample to obtain estimates of ˆ m(x) of P(Ym = 1|x), for each label m (as described in Section 4). All the methods we evaluate rely on obtaining a good estimator for the conditional probability. So we exclude labels that are associated with very few instances in particular, we train and evaluate using labels associated with at least 20 instances, in each dataset, for all the methods.

In Table 1, we report the micro-averaged F1 and Jaccard metrics on the test set for Algorithm 1, Macro-Thres and Binary Relevance. We observe that estimating a ﬁxed threshold for all the labels (Algorithm 1) consistently performs better than estimating thresholds for each label (Macro-Thres) and than using threshold 1/2 for all labels (BR); this conforms to our main result in Theorem 2 and the consistency analysis of Algorithm 1 in Theorem 4. A similar trend is observed for the instanceaveraged metrics computed on the test set, shown in Table 2. Proposition 1 shows that maximizing the population utilities of micro-averaged and instance-averaged metrics are equivalent; the result holds in practice as presented in Table 2. Finally, we report macro-averaged metrics computed on test set in Table 3. We observe that Macro-Thres is competitive in 3 out of 4 datasets; this conforms to Proposition 3 which shows that in the case of macro-averaged metrics, it is optimal to tune a threshold speciﬁc to each label independently. Beyond consistency, we note that by using more samples, joint threshold estimation enjoys additional statistical efﬁciency, while separate threshold estimation enjoys greater ﬂexibility. This trade-off may explain why Algorithm 1 achieves the best performance in three out of four datasets in Table 3, though it is not consistent for macro-averaged metrics.

4The datasets were obtained from http://mulan.sourceforge.net/datasets-mlc.html. 5Original CAL500 dataset does not provide splits; we split the data randomly into train and test sets.

DATASET BR Algorithm 1 Macro-Thres BR Algorithm 1 Macro-Thres F1 Jaccard SCENE 0.6559 0.6847 0.0072 0.6631 0.0125 0.4878 0.5151 0.0084 0.5010 0.0122 BIRDS 0.4040 0.4088 0.0130 0.2871 0.0734 0.2495 0.2648 0.0095 0.1942 0.0401 EMOTIONS 0.5815 0.6554 0.0069 0.6419 0.0174 0.3982 0.4908 0.0074 0.4790 0.0077 CAL500 0.3647 0.4891 0.0035 0.4160 0.0078 0.2229 0.3225 0.0024 0.2608 0.0056

Table 1: Comparison of plugin-estimator methods on multilabel F1 and Jaccard metrics. Reported values correspond to micro-averaged metric (F1 and Jaccard) computed on test data (with standard deviation, over 10 random validation sets for tuning thresholds). Algorithm 1 is consistent for microaveraged metrics, and performs the best consistently across datasets.

DATASET BR Algorithm 1 Macro-Thres BR Algorithm 1 Macro-Thres F1 Jaccard SCENE 0.5695 0.6422 0.0206 0.6303 0.0167 0.5466 0.5976 0.0177 0.5902 0.0176 BIRDS 0.1209 0.1390 0.0110 0.1390 0.0259 0.1058 0.1239 0.0077 0.1195 0.0096 EMOTIONS 0.4787 0.6241 0.0204 0.6156 0.0170 0.4078 0.5340 0.0072 0.5173 0.0086 CAL500 0.3632 0.4855 0.0035 0.4135 0.0079 0.2268 0.3252 0.0024 0.2623 0.0055

Table 2: Comparison of plugin-estimator methods on multilabel F1 and Jaccard metrics. Reported values correspond to instance-averaged metric (F1 and Jaccard) computed on test data (with standard deviation, over 10 random validation sets for tuning thresholds). Algorithm 1 is consistent for instance-averaged metrics, and performs the best consistently across datasets.

DATASET BR Algorithm 1 Macro-Thres BR Algorithm 1 Macro-Thres F1 Jaccard SCENE 0.6601 0.6941 0.0205 0.6737 0.0137 0.5046 0.5373 0.0177 0.5260 0.0176 BIRDS 0.3366 0.3448 0.0110 0.2971 0.0267 0.2178 0.2341 0.0077 0.2051 0.0215 EMOTIONS 0.5440 0.6450 0.0204 0.6440 0.0164 0.3982 0.4912 0.0072 0.4900 0.0133 CAL500 0.1293 0.2687 0.0035 0.3226 0.0068 0.0880 0.1834 0.0024 0.2146 0.0036

Table 3: Comparison of plugin-estimator methods on multilabel F1 and Jaccard metrics. Reported values correspond to the macro-averaged metric computed on test data (with standard deviation, over 10 random validation sets for tuning thresholds). Macro-Thres is consistent for macro-averaged metrics, and is competitive in three out of four datasets. Though not consistent for macro-averaged metrics, Algorithm 1 achieves the best performance in three out of four datasets.

6 Conclusions and Future Work

We have proposed a framework for the construction and analysis of multilabel classiﬁcation metrics and corresponding population optimal classiﬁers. Our main result is that for a large family of averaged performance metrics, the EUM optimal multilabel classiﬁer can be explicitly characterized by thresholding of label-wise marginal instance-conditional probabilities, with weak label dependence via a shared threshold. We have also proposed efﬁcient and consistent estimators for maximizing such multilabel performance metrics in practice. Our results are a step forward in the direction of extending the state-of-the-art understanding of learning with respect to general metrics in binary and multiclass settings. Our work opens up many interesting research directions, including the potential for further generalization of our results beyond averaged metrics, and generalized results for DTA population optimal classiﬁcation, which is currently only well-understood for the F-measure.

Acknowledgments: We acknowledge the support of NSF via CCF-1117055, CCF-1320746 and IIS1320894, and NIH via R01 GM117594-01 as part of the Joint DMS/NIGMS Initiative to Support Research at the Interface of the Biological and Mathematical Sciences.

[1] Weiwei Cheng, Eyke H ullermeier, and Krzysztof J Dembczynski. Bayes optimal multilabel

classiﬁcation via probabilistic classiﬁer chains. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 279 286, 2010.

[2] Krzysztof Dembczynski, Wojciech Kotlowski, and Eyke H ullermeier. Consistent multilabel

ranking through univariate losses. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pages 1319 1326, 2012.

[3] Krzysztof Dembczy nski, Willem Waegeman, Weiwei Cheng, and Eyke H ullermeier. On label

dependence and loss minimization in multi-label classiﬁcation. Machine Learning, 88(1-2): 5 45, 2012.

[4] Krzysztof Dembczynski, Arkadiusz Jachnik, Wojciech Kotlowski, Willem Waegeman, and

Eyke H ullermeier. Optimizing the F-measure in multi-label classiﬁcation: Plug-in rule approach versus structured loss minimization. In Proceedings of the 30th International Conference on Machine Learning, pages 1130 1138, 2013.

[5] Krzysztof J Dembczynski, Willem Waegeman, Weiwei Cheng, and Eyke H ullermeier. An

exact algorithm for F-measure maximization. In Advances in Neural Information Processing Systems, pages 1404 1412, 2011.

[6] Luc Devroye. A probabilistic theory of pattern recognition, volume 31. Springer, 1996.

[7] Wei Gao and Zhi-Hua Zhou. On the consistency of multi-label learning. Artiﬁcial Intelligence,

199:22 44, 2013.

[8] Ashish Kapoor, Raajay Viswanathan, and Prateek Jain. Multilabel classiﬁcation using bayesian

compressed sensing. In Advances in Neural Information Processing Systems, pages 2645 2653, 2012.

[9] Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon.

Consistent binary classiﬁcation with generalized performance metrics. In Advances in Neural Information Processing Systems, pages 2744 2752, 2014.

[10] Harikrishna Narasimhan, Rohit Vaish, and Shivani Agarwal. On the statistical consistency

of plug-in classiﬁers for non-decomposable performance measures. In Advances in Neural Information Processing Systems, pages 1493 1501, 2014.

[11] Harikrishna Narasimhan, Harish Ramaswamy, Aadirupa Saha, and Shivani Agarwal. Consis-

tent multiclass algorithms for complex performance measures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2398 2407, 2015.

[12] James Petterson and Tib erio S Caetano. Submodular multi-label learning. In Advances in

Neural Information Processing Systems, pages 1512 1520, 2011.

[13] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classiﬁer chains for multi-

label classiﬁcation. Machine learning, 85(3):333 359, 2011.

[14] Mark D Reid and Robert C Williamson. Composite binary losses. The Journal of Machine

Learning Research, 9999:2387 2422, 2010.

[15] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Mining multi-label data. In

Data mining and knowledge discovery handbook, pages 667 685. Springer, 2010.

[16] Willem Waegeman, Krzysztof Dembczynski, Arkadiusz Jachnik, Weiwei Cheng, and Eyke

H ullermeier. On the bayes-optimality of f-measure maximizers. Journal of Machine Learning Research, 15:3333 3388, 2014.

[17] Nan Ye, Kian Ming A Chai, Wee Sun Lee, and Hai Leong Chieu. Optimizing F-measures: a

tale of two approaches. In Proceedings of the International Conference on Machine Learning, 2012.

[18] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Large-scale multi-label

learning with missing labels. In Proceedings of the 31st International Conference on Machine Learning, pages 593 601, 2014.