# on_the_bayesoptimality_of_fmeasure_maximizers__71792be1.pdf Journal of Machine Learning Research 15 (2014) 3513-3568 Submitted 10/13; Revised 6/14; Published 11/14 On the Bayes-Optimality of F-Measure Maximizers Willem Waegeman willem.waegeman@ugent.be Department of Mathematical Modelling, Statistics and Bioinformatics Ghent University, Ghent 9000 Belgium Krzysztof Dembczy nski krzysztof.dembczynski@cs.put.poznan.pl Arkadiusz Jachnik arkadiusz.jachnik@cs.put.poznan.pl Institute of Computing Science Poznan University of Technology, Poznan 60-695 Poland Weiwei Cheng roywwcheng@gmail.com Amazon Development Center Germany, Berlin 10707 Germany Eyke H ullermeier eyke@upb.de Department of Computer Science University of Paderborn, Paderborn 33098 Germany Editor: Charles Elkan The F-measure, which has originally been introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure is a statistically and computationally challenging problem, since no closed-form solution exists. Adopting a decision-theoretic perspective, this article provides a formal and experimental analysis of different approaches for maximizing the F-measure. We start with a Bayes-risk analysis of related loss functions, such as Hamming loss and subset zero-one loss, showing that optimizing such losses as a surrogate of the F-measure leads to a high worst-case regret. Subsequently, we perform a similar type of analysis for F-measure maximizing algorithms, showing that such algorithms are approximate, while relying on additional assumptions regarding the statistical distribution of the binary response variables. Furthermore, we present a new algorithm which is not only computationally efficient but also Bayesoptimal, regardless of the underlying distribution. To this end, the algorithm requires only a quadratic (with respect to the number of binary responses) number of parameters of the joint distribution. We illustrate the practical performance of all analyzed methods by means of experiments with multi-label classification problems. Keywords: F-measure, Bayes-optimal predictions, regret, statistical decision theory, multi-label classification, structured output prediction 1. Introduction Being rooted in information retrieval (van Rijsbergen, 1974), the so-called F-measure is nowadays routinely used as a performance metric for different types of prediction problems, including binary classification, multi-label classification (MLC), and certain applications of structured output prediction. Amongst others, examples of such applications include c 2014 Willem Waegeman, Krzysztof Dembczy nski, Arkadiusz Jachnik, Weiwei Cheng, Eyke H ullermeier. Waegeman et al. chunking or named entity recognition in natural language processing (Sang and De Meulder, 2003), image segmentation or edge detection in computer vision (Martin et al., 2004) and detection of geographic coincidence in social networks (Zhuang et al., 2012). Compared to measures like error rate in binary classification and Hamming loss in multilabel classification, the F-measure enforces a better balance between performance on the minority and the majority class, respectively. Therefore, it is more suitable in the case of imbalanced data, as it does not take the true negative rate into account. Given a prediction h = (h1, . . . , hm) {0, 1}m of an m-dimensional binary label vector y = (y1, . . . , ym) (e.g., the class labels of a test set of size m in binary classification or the label vector associated with a single instance in MLC or the binary vector indicating named entities in a text document in a structured output prediction task), the F-measure is defined as follows: F(y, h) = 2 Pm i=1 yihi Pm i=1 yi + Pm i=1 hi [0, 1] , (1) where 0/0 = 1 by definition. This measure essentially corresponds to the harmonic mean of precision prec and recall recl: prec(y, h) = Pm i=1 yihi Pm i=1 hi , recl(y, h) = Pm i=1 yihi Pm i=1 yi . One can generalize the F-measure to a weighted harmonic average of these two values, but for the sake of simplicity, we stick to the unweighted mean, which is often referred to as the F1-score or the F1-measure. Despite its popularity in experimental settings, very few theoretical studies of the Fmeasure can be found. This paper intends to bridge this gap by analyzing existing methods and, moreover, by presenting a new algorithm that exhibits the desirable property of statistical consistency. To this end, we will adopt a decision-theoretic viewpoint. Modeling the ground-truth as a random variable Y = (Y1, Y2, . . . , Ym), i.e., assuming an underlying probability distribution P over {0, 1}m, the prediction h that maximizes the expected F-measure is given by h F = arg max h {0,1}m E [F(Y , h)] = arg max h {0,1}m y {0,1}m P(y) F(y, h). (2) The corresponding optimization problem is non-trivial and cannot be solved in closed form. Moreover, a brute-force search is infeasible, as it would require checking all 2m combinations of prediction vector h and summing over an exponential number of terms in each combination. As a result, many researchers who report the F-measure in experimental studies rely on optimizing a surrogate loss as an approximation of (2). For problems such as multi-label classification and structured output prediction, the Hamming loss and the subset zero-one loss are immediate candidates for such surrogates. However, as will be shown in Section 3, these surrogates do not yield a statistically consistent model and, more importantly, they manifest a high regret. As an intermezzo, we present results for the Jaccard index, which has recently gained an increased popularity in areas such as multi-label classification. This measure is closely related to the F-measure, and its optimization appears to be even more difficult. On the Bayes-Optimality of F-Measure Maximizers Apart from optimizing surrogates, a few more specialized approaches for finding the F-measure maximizer (2) have been presented in the last decades (Lewis, 1995; Chai, 2005; Jansche, 2007; Ye et al., 2012; Quevedo et al., 2012). These algorithms will be revisited in Section 4. They typically require the assumption of independence of the Yi, i.e., i=1 pyi i (1 pi)1 yi , (3) with pi = P(Yi = 1). While being natural for problems like binary classification, this assumption is indeed not tenable in domains like MLC and structured output prediction. We will show in Section 4 that algorithms based on independence assumptions or marginal probabilities are not statistically consistent when arbitrary probability distributions P are considered. Moreover, we also show that the worst-case regret of these algorithms is very high. Looking at (2), it seems that information about the entire joint distribution P is needed to maximize the F-measure. Yet, as will be shown in this paper, the problem can be solved more efficiently. In Section 5, we present a general algorithm that requires only a quadratic instead of an exponential (with respect to m) number of parameters of the joint distribution. If these parameters are given, then, depending on their form, the exact solution can be obtained in quadratic or cubic time. This result holds regardless of the underlying distribution. In particular, unlike algorithms such as Chai (2005); Jansche (2007); Ye et al. (2012) and Quevedo et al. (2012), we do not require independence of the binary response variables (labels). Our theoretical results are specifically relevant for applications in multi-label classification and structured output prediction. In these application domains, three different aggregation schemes of the F-measure can be distinguished, namely instance-wise, microand macro-averaging. One should carefully distinguish these versions, since algorithms optimized with a given objective are usually performing suboptimally for other (target) evaluation measures (e.g., Dembczy nski et al., 2012a; Luaces et al., 2012). In Section 7, we present extensive experimental results to illustrate the practical usefulness of our findings. More specifically, all examined methods are compared for a series of multi-label classification problems. One particular data set originates from a recent data mining competition, in which we obtained the second place using some of the algorithms presented in this article. Let us anticipate that our experimental results will not determine a clear winner. This is not at all surprising: while enjoying the advantage of consistency, our algorithm requires the estimation of more parameters than existing approximate algorithms. As a consequence, exact optimization is not necessarily superior to approximate optimization. Instead, the relative performance of exact and approximate optimization depends on several factors, such as the sample size, the length of Y , the shape of the distribution P, etc. As mentioned above, we adopt a decision-theoretic point of view: assuming a probabilistic model to be given, the problem of F-measure maximization is interpreted as an inference problem. Before going into technical details, we like to stress that this is only one way of looking at F-measure maximization. A second, somewhat orthogonal approach is to optimize the F-measure during the training phase. This is sometimes referred to as empirical utility maximization. In general, optimality in this framework is different from our definition of optimality, but connections between the two paradigms have recently been Waegeman et al. discussed by Ye et al. (2012). These authors establish asymptotic equivalence results under the assumption of independence and infinitely large vectors Y . They focus on binary classification problems, for which such assumptions are realistic, because the vector Y then represents an entire test set of i.i.d. observations. The same assumptions are made in another recent work that provides an interesting theoretical analysis for binary classification problems (Zhao et al., 2013). However, in structured output prediction and multi-label classification, independence does not hold and the length of Y might be small, especially if the instance-wise F-measure needs to be optimized. Algorithms that optimize the F-measure during training will hence not be discussed further in this article. Nevertheless, we briefly mention some of them here for the sake of completeness. In binary classification, such algorithms are extensions of support vector machines (Musicant et al., 2003; Joachims, 2005), logistic regression (Jansche, 2005) or boosting (Kokkinos, 2010). However, the most popular methods, including that of Keerthi et al. (2007), rely on explicit threshold adjustment. A few specific algorithms have also been proposed for certain applications in structured output prediction (Tsochantaridis et al., 2005; Suzuki et al., 2006; Daum e III et al., 2009) and multi-label classification (Fan and Lin, 2007; Zhang et al., 2010; Petterson and Caetano, 2010, 2011). During training, some of these methods, especially those based on structured SVMs, need to solve an inference problem that is closely related but not identical to (2). In a recent paper, we have presented a theoretical and experimental comparison of approaches that optimize the F-measure during training or inference, respectively, in the context of multi-label classification (Dembczy nski et al., 2013). Since we focus on the decision-theoretic point of view in this work, we do not discuss the theoretical results obtained in that article, but for completeness we report some experimental results. Parts of this article have already been published in previous conference papers (Dembczy nski et al., 2011, 2013). Here, we summarize the results of these papers in a unifying framework, provide a much more detailed theoretical analysis, and complement them by additional formal results as well as novel experimental studies. The rest of the article is organized as follows. Section 2 gives a formal definition of the notion of regret, which will serve as a key element for showing that most existing algorithms are suboptimal. Section 3 contains a regret analysis of algorithms that optimize other loss functions than the F-measure. In Section 4, we perform a similar analysis for F-measure inference algorithms that have been proposed in the literature. Subsequently, our own algorithm is presented in Section 5. In Section 6, we test the inference algorithms on synthetic data, while practical considerations, applications and experimental results on benchmark data are further discussed in Section 7. The proofs of all theorems in Sections 3 and 4 can be found in an appendix. 2. Formal Presentation of our Mathematical Framework In order to show formally that many existing algorithms are sub-optimal w.r.t. optimising the F-measure, we will use a mathematical framework that is closely related to frameworks encountered in classical machine learning papers. However, our analysis will slightly differ from the analysis performed in such papers, as we are investigating inference algorithms instead of training algorithms. Let us start with a formal definition of what we will call the On the Bayes-Optimality of F-Measure Maximizers regret of an inference algorithm. Definition 2.1 Given a probability distribution P, the regret of a vector of predictions h w.r.t. the F-measure is formally defined as RF (h) = E F(Y , h F ) F(Y , h) = X F(y, h F ) F(y, h) P(y) , with h F the F-measure maximizer in (2). The above definition of regret (aka excess risk in the statistical literature) can be considered as a classical tool in the framework of Bayes-risk consistency (Devroye et al., 1997; Bartlett et al., 2006). However, let us emphasize that the theoretical analysis presented below differs from traditional techniques for investigating the consistency of machine learning algorithms. Typically, a training algorithm is considered consistent if its risk converges in probability to the Bayes risk as the training sample grows to infinity. Since many training algorithms optimize a convex and differentiable surrogate of the target loss of interest, such an analysis is often performed by bounding the risk of the target loss as a function of the surrogate φ-risk of the surrogate loss (e.g., Breiman, 2000; Steinwart, 2001; Zhang, 2004; Bartlett et al., 2006; Tewari and Bartlett, 2007; Duchi et al., 2010; Gao and Zhou, 2013). In this article, we start from a different perspective, since we are analyzing the Bayesoptimality of inference algorithms. As such, we call a given loss a surrogate loss for the F-measure if an inference algorithm optimizes this loss instead of the F-measure. This is different from, for example, the above papers, which analyse the consistency of surrogate losses during the training phase of a machine learning algorithm, using the surrogate loss as the internal training loss, i.e., as a convex and differentiable approximation of the target loss that is optimized during training. Furthermore, a second notable difference to other papers is that sample size convergence is less important in our analysis, as we are starting from a trained probabilistic model that is assumed to deliver consistent estimates. A similar analysis to the one presented here has been performed for the subset 0/1 loss and the Hamming loss in (Dembczy nski et al., 2012a). We will consider the regret of various types of loss functions and algorithms under any arbitrary probability distribution P. By searching for probability distributions that maximize the regret, we are mainly considering the worst-case scenario. In the case of surrogate losses, we restrict this search to probability distributions that deliver unique risk minimizers; the reasons for this restriction are of technical nature. Similar to the F-measure maximizer, let us introduce the risk minimizer of a loss L : {0, 1}m {0, 1}m R+ as h L = arg min h {0,1}m E [L(Y , h)] = arg min h {0,1}m y {0,1}m P(y) L(y, h). (4) This allows us to introduce the worst-case regret formally. Definition 2.2 Let L : {0, 1}m {0, 1}m R+ be a loss, let P be the set of all probability distributions over {0, 1}m, and let Pu L be the subset of P that delivers unique solutions to Waegeman et al. (4). Then, the worst-case regret is defined as sup P Pu L E F(Y , h F ) F(Y , h L) , (5) with h F and h L defined by (2) and (4), respectively. Note that, in the above definition, we restrict the worst-case analysis to probability distributions with a unique risk minimizer for L. Technically, the problem would otherwise become more difficult, as it would require the comparison of the F-measure maximizer with a set of risk minimizers HL instead of a unique minimizer h L. This could be done in different ways, for example, by looking at the most favorable case for L, leading to sup P P min h L HL E h F(Y , h F ) F(Y , h L) i , or the least favorable one, leading to sup P P max h L HL E h F(Y , h F ) F(Y , h L) i . To avoid a more or less arbitrary decision, we prefer to exclude these cases from the beginning. In any case, it is clear that the regret (5) provides a lower bound for any other definition of regret that maximizes over the entire set P of distributions. Furthermore, remark that non-uniqueness of the F-measure maximizer is unproblematic, since for the definition of the regret, only the value of the F-measure is important, not the maximizer itself. Using classical notions such as Fisher consistency (Wu et al., 2010), one can say that a sufficient condition for inconsistency is encountered if (5) does not evaluate to zero. However, since exact solutions or lower bounds will be derived, we are able to give much more precise information on the degree of incorrectness. 3. The F-Measure and Related Loss Functions Given the difficulty of maximizing the F-measure, we start our analysis by investigating a few related loss functions that have been used as surrogates in some multi-label classification and structured output prediction papers. We will analyze the Hamming loss, the subset zero-one loss and the Jaccard index. For the former two loss functions, we perform a regret analysis to show that optimizing these loss functions is not consistent if the F-measure is our performance metric of interest. For the Jaccard index, we derive a simple upper bound on the regret when optimizing the F-measure instead. 3.1 The Hamming Loss The Hamming loss can be considered as the most standard loss for multi-label classification problems (e.g., Schapire and Singer, 2000; Tsoumakas and Katakis, 2007; Hariharan et al., 2010). It is also widely used in many structured output prediction methods (e.g., Taskar et al., 2004; Daum e III et al., 2009; Finley and Joachims, 2008). Using the general notation On the Bayes-Optimality of F-Measure Maximizers that was introduced above, the Hamming loss simply corresponds to the error rate in binary classification, and it can be formally defined as follows:1 LH(y, h) = 1 i=1 Jyi = hi(x)K . (6) For the Hamming loss, the risk minimizer is h H = arg min h {0,1}m E [LH(Y , h)] = arg min h {0,1}m y {0,1}m P(y) LH(y, h). (7) This is obtained by h H(x) = (h H,1, . . . , h H,m), where h H,i(x) = arg max b {0,1} P(Yi = b) i {1, . . . , m}. (8) Thus, in order to optimize the Hamming loss, one should select the marginal modes of P. The following theorem presents our main result for the Hamming loss. Theorem 1 Let h H be a vector of predictions obtained by minimizing the Hamming loss, Then for m > 2 the worst-case regret is given by: sup P Pu LH E F(Y , h F ) F(Y , h H) = 0.5 , where the supremum is taken over all possible distributions P that result in a unique Hamming loss minimizer. In other words, the theorem indicates that optimizing the Hamming loss as a surrogate for the F-measure results in a prediction that is far from optimal. This claim will be further confirmed by experimental results in Sections 6 and 7. 3.2 The Subset Zero-One Loss The next multi-label loss function we analyze is the subset 0/1 loss, which generalizes the well-known 0/1 loss from the conventional to the multi-label setting: Ls(y, h) = Jy = h K (9) Admittedly, this loss function may appear overly stringent, especially in the case of many labels. Moreover, since making a mistake on a single label is punished as hardly as a mistake on all labels, it does not discriminate well between almost correct and completely wrong predictions. Still, a lot of existing frameworks for multi-label classification and structured output prediction optimize a convex upper bound on this loss in a direct or approximate manner. For example, conditional random fields optimize the log-loss as a surrogate for the subset zero-one loss (Lafferty et al., 2001), structured support vector machines consider the 1. We use Jc K to denote the indicator function, equal to 1 if predicate c holds and 0 otherwise. Waegeman et al. 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8 1.0 Figure 1: Plot of the worst-case regret for the subset zero-one loss (11) as a function of the number of labels m. structured hinge loss as a surrogate of the subset 0/1 loss when no margin/slack rescaling is performed (Tsochantaridis et al., 2005) and probabilistic classifier chains with logistic base classifiers optimize the log-loss approximately by means of pseudo-likelihood maximization (Dembczy nski et al., 2010, 2012b). Moreover, maximum a posteriori (MAP) estimation techniques in Bayesian statistics and graphical models are also known to minimize the subset 0/1 loss. As for any other 0/1 loss, the risk-minimizing prediction for (9) is simply given by the mode of the distribution: hs = arg max y {0,1}m P(y) (10) Thus, unlike the Hamming loss, looking at marginal probabilities does not suffice to minimize the subset 0/1 loss. When the independence assumption is violated, information about the joint distribution over labels is needed, similar as for the F-measure. Our interest in the subset 0/1 loss is primarily fueled by this connection. Summarized in the following theorem, we perform a similar type of regret analysis as for the Hamming loss. Theorem 2 Let hs be a vector of predictions obtained by minimizing the subset 0/1 loss, then for m > 2 the worst-case regret is given by: sup P Pu Ls E F(Y , h F ) F(Y , hs) = ( m 2 + 2m2)m (2m 1)(4 + m + m2) , (11) where the supremum is taken over all possible distributions P. Let us remark that the worst-case regret converges rapidly to one as a function of the number of labels m, as illustrated in Figure 1. Similar to the result for the Hamming loss, the above theorem confirms that using the subset zero-one loss as an alternative for the On the Bayes-Optimality of F-Measure Maximizers F-measure can potentially yield a high regret. Optimizing the subset zero-one loss might hence be not a valid alternative. Our experimental results in Sections 6 and 7 will indeed make clear that such an approach performs suboptimal for several data sets. 3.3 The Jaccard Index The F-measure was originally defined by set operators, as a measure for expressing the similarity of sets. In this literature, it is known as the dice coefficient (Dice, 1945). Another well-known measure for expressing similarity of sets is the Jaccard index. The two measures are very related, since both belong to a more general parametric family of similarity measures for sets (De Baets et al., 2009). The Jaccard index computes the ratio of intersection to union: J(y, h) = |{i | yi = 1 hi = 1, i = 1, . . . , m}| |{i | yi = 1 hi = 1, i = 1, . . . , m}| (12) Owing to a simple transformation, it can also be written as follows:2 J(y, h) = Pm i=1 yihi Pm i=1 yi + Pm i=1 hi Pm i=1 yihi (13) In recent years, the Jaccard index has gained popularity in the machine learning community. In the context of kernel methods, it is often used as an alternative to the linear kernel for binary feature vectors, such as fingerprints of molecules in cheminformatics and bioinformatics. In these application domains, one often speaks of the Tanimoto kernel (Swamidass et al., 2005). As a utility function the Jaccard index is often considered in multi-label classification. It remains an open question whether or not a closed-form solution for the risk minimizer of the Jaccard similarity exists, but the maximization is far from straightforward. Under the assumption of label independence, which allows one to transform many loss functions to a contingency table, Quevedo et al. (2012) have recently proposed an exact algorithm for maximizing the instance-wise Jaccard similarity, as well as other loss functions that can be computed from such a contingency table. However, without this assumption, one commonly believes that exact optimization is intractable (Chierichetti et al., 2010). Even though the F-measure and the Jaccard are monotonically related, it is not the case that the F-measure maximizer is necessarily also the Jaccard maximizer, because the summation in (4) in general breaks the monotonicity property. As a result, the analysis that we report for the Jaccard index differs from the one reported for the Hamming loss and the subset 0/1 loss. Given that the maximization of the Jaccard index is believed to be much harder, it does not make sense to use this measure as a surrogate for the F-measure during optimization. In contrast, one might think of maximizing the F-measure as a surrogate for the Jaccard index. The following result characterizes what we can lose with such a strategy. Theorem 3 Let h J and h F be vectors of predictions obtained by maximizing the Jaccard index and the F-measure, respectively. Let the utility of the F-measure maximizer be given 2. Similar as the F-measure, note that the denominator is 0 if yi = hi = 0 for all i = 1, . . . , m. In this case, the utility is 0 by definition. Waegeman et al. δ(P) = max h {0,1}m E [F(Y , h)] = max h {0,1}m X y {0,1}m P(y) F(y, h). The regret of the F-measure maximizer with respect to the Jaccard index is then upper bounded by E J(Y , h J) J(Y , h F ) 1 δ(P)/2 for all possible distributions P. Notwithstanding that the above upper bound on the regret remains rather loose, the observation is interesting because the upper bound on the regret decreases as a function of the utility of the F-measure maximizer δ(P). Due to this relationship, a high utility for the F-measure implies that optimizing this measure as a surrogate for the Jaccard similarity might be a reasonable thing to do. In other words, on data sets for which an F-measure maximizing algorithm gets good results, one can expect that this algorithm will also get good results in terms of the Jaccard index. 4. Existing Algorithms for F-Measure Maximization The previous section revealed that optimizing more conventional loss functions as surrogates for the F-measure might result in a poor predictive performance. In this section, we perform a similar type of analysis for more specialized algorithms that intend to solve (2). These algorithms make different types of assumptions to simplify the problem. First of all, the algorithms operate on a constrained hypothesis space, sometimes justified by theoretical arguments. Secondly, they only guarantee optimality for specific distributions P. 4.1 Algorithms Based on Label Independence By assuming independence of the random variables Y1, ..., Ym, optimization problem (2) can be substantially simplified. It has been shown independently in (Lewis, 1995) and (Jansche, 2007) that the optimal solution then always contains the labels with the highest marginal probabilities, or no labels at all. As a consequence, only a few hypotheses h (m+1 instead of 2m) need to be examined, and the computation of the expected F-measure can be performed in an efficient way. Theorem 4 [(Lewis, 1995)] Let Y1, Y2, . . . , Ym be independent Bernoulli variables with parameters p1, p2, . . . , pm respectively. Then, for all j, k {1, . . . , m}, h F,j = 1 and h F,k = 0 implies pj pk. On the Bayes-Optimality of F-Measure Maximizers In addition, Lewis (1995) showed that the expected F-measure can be approximated by the following expression under the assumption of independence:3 E [F(Y , h)] ( Qm i=1(1 pi), if h = 0m 2 Pm i=1 pihi Pm i=1 pi+Pm i=1 hi , if h = 0m This approximation is exact for h = 0m, while for h = 0m, an upper bound of the error can easily be determined (Lewis, 1995). However, Chai (2005), Jansche (2007) and Quevedo et al. (2012) have independently proposed exact procedures for computing the F-maximizer. To this end, independence is assumed and marginal probabilities p1, p2, . . . , pm serve as input for the algorithms. The method of Jansche runs in O(m4), while the other two approaches solve the same problem more efficiently in O(m3). As a starting point for explaining the three algorithms, notice that (2) can be solved via outer and inner maximization. Namely, (2) can be transformed into an inner maximization h(k) = arg max h Hk E [F(Y , h)] , (14) where Hk = {h {0, 1}m | Pm i=1 hi = k}, followed by an outer maximization h F = arg max h {h(0),...,h(m)} E [F(Y , h)] . (15) The outer maximization (15) can be done by simply checking all m + 1 possibilities. The main effort is then devoted to solving the inner maximization (14). According to Lewis theorem, to solve (14), one needs to check only one vector h for a given k, in which hi = 1 for the k labels with highest marginal probabilities pi. The remaining problem is the computation of the expected F-measure in (14). This expectation cannot be computed naively, as the sum is over exponentially many terms. But the F-measure is a function of integer counts that are bounded, so it can normally only assume a much smaller number of distinct values. It has been further shown that the expectation has a domain with a cardinality exponential in m; but since the cardinality of its range is polynomial in m, it can be computed in polynomial time. As a result, Jansche (2007) obtains an algorithm that is cubic in m for computing (14), resulting in an overall O(m4) time complexity. He also presents an approximate version of this procedure, reducing the complexity from cubic to quadratic. This approximation leads to an overall complexity of O(m3), but it does no longer guarantee optimality of the prediction. As a more efficient alternative, the procedure of Chai (2005) is based on ordering the labels according to the marginal probabilities. For h(k) Hk, thus hi = 1 for k labels with the greatest marginal probabilities, he derives the following expression: E h F(Y , h(k)) i = 2 i=1 (1 pi)I1(m) , 3. We use 0m and 1m to denote m-element vectors of all zeros or ones, respectively. Waegeman et al. where I1(m) is given by the following recurrence equations and boundary conditions: It(a) = It+1(a) + rt It+1(a + 1) + rt Jt+1(a + 1) Jt(a) = Jt+1(a) + rt Jt+1(a + 1) Ik+1(a) = 0 Jm+1(a) = a 1 with ri = pi/(1 pi). These equations suggest a dynamic programming algorithm of space O(m) and time O(m2) in computing the expected F-measure for given k. This yields an overall time complexity of O(m3). In a more recent follow-up paper, Ye et al. (2012) further improved the dynamic programming algorithm to an O(m2) complexity by additional sharing of internal representations. The old and the new version of the algorithm both rely on Lewis theorem and a factorization of the probability mass for constructing recurrence equations, hence leaving few hope for extending the algorithm to situations where label independence cannot be assumed. In another recent paper, Quevedo et al. (2012) propose a general inference procedure that utilizes similar recurrence equations and dynamic programming techniques. In contrast to Chai (2005), Jansche (2007) and Ye et al. (2012), the authors primarily address multi-label classification problems, focusing on a wide range of loss functions that can be computed from a contingency table in an instance-wise manner. As a result, the instance-wise F-measure is maximized as a special case, while assuming label independence. If the independence assumption is violated, none of the above methods is able to guarantee optimality. In the most general case, the F-maximizer needs to be computed by analyzing the joint distribution. The above methods rely on modeling or ordering marginal probabilities, which is not sufficient to compute the F-maximizer for many distributions. This is illustrated by the following example, in which two joint distributions with identical marginal probabilities have different F-measure maximizers: y P(y) 0001 0.1 0010 0.2 0100 0.2 1000 0.5 y P(y) 0000 0.5 1001 0.1 1010 0.2 1100 0.2 The non-specified configurations have zero probability mass. For both distributions, we have p1 = P(Y1 = 1) = 0.5, p2 = P(Y2 = 1) = 0.2, p3 = P(Y3 = 1) = 0.2, p4 = P(Y4 = 1) = 0.1, but one can easily check that the F-measure maximizers are h = (1000) and h = (0000), respectively. The regret is small for this simple example, but methods that assume independence may produce predictions being far away from the optimal one. The following result shows this concretely. Theorem 5 Let h I be a vector of predictions obtained by assuming label independence as defined in (3), then the worst-case regret is lower-bounded by: E F(Y , h F ) F(Y , h I) 2q 1, On the Bayes-Optimality of F-Measure Maximizers for all q [1/2, 1] satisfying Pm s=1 2m! (m s)!(s 1)!(m+s)qm s(1 q)s qm > 0 and the supremum taken over all possible distributions P. For increasing m, the condition is satisfied for q close to one (see the appendix for details). In such a scenario, the worst-case regret is lower bounded by Rq = 2q 1, so that limq 1,m Rq = 1. As a consequence, the lower bound becomes tight in the limit of m going to infinity, as summarized in the following corollary. Corollary 1 Let h I be a vector of predictions obtained by assuming independence, then the worst-case regret converges to 1 in the limit of m, i.e., lim m sup P E F(Y , h F ) F(Y , h I) = 1, where the supremum is taken over all possible distributions P. Again we will show by means of experiments in Sections 6 and 7 that algorithms based on label independence can be suboptimal on real-world data sets. 4.2 Algorithms Based on the Categorical Distribution Solving (2) becomes straightforward in the case of a specific distribution in which the probability mass is distributed over vectors y containing only a single positive label, i.e., Pm i=1 yi = 1, corresponding to the categorical distribution. This was studied by del Coz et al. (2009) in the setting of so-called non-deterministic classification. Theorem 6 [(del Coz et al., 2009)] Denote by y(i) a vector for which yi = 1 and all the other entries are zeros. Assume that P is a joint distribution such that P(Y = y(i)) = pi. The maximizer h of (2) consists of the k labels with the highest marginal probabilities, where k is the first integer for which k X j=1 pj (1 + k)pk+1; if there is no such integer, then h = 1m. The categorical distribution reflects the case of multi-class classification problems, as a special case of multi-label classification problems. The above approach is only applicable to such problems. 4.3 Algorithms that Use Both the Marginal and the Joint Distribution Since all the methods so far rely on the fact that the optimal solution contains ones for the labels with the highest marginal probabilities (or consists of a vector of zeros), one may expect that thresholding on the marginal probabilities (hi(θ) = 1 for pi θ, and hi(θ) = 0 otherwise) will provide a solution to (2) in general. Practically, despite using marginal probabilities for the thresholds, such a scenario does not assume label independence anymore, Waegeman et al. because also the joint probability distribution P must be provided.4 When the labels are ordered according to the marginal probabilities, thus pi pi+1 for all i {1, ..., m 1}, this approach resembles the following optimization problem: h T = arg max θ {1,p1,...,pm} E [F(Y , h(θ))] . Thus, to find an optimal threshold θ, access to the entire joint distribution is needed. However, this is not the main problem here, since in the next section, we will show that only a polynomial number of parameters of the joint distribution is needed. What is more interesting is the observation that the F-maximizer is in general not consistent with the order of marginal label probabilities. In fact, the regret can be substantial, as shown by the following result. Theorem 7 Let h T be a vector of predictions obtained by putting a threshold on sorted marginal probabilities, then the worst-case regret is lower bounded by E F(Y , h F ) F(Y , h T ) max 0, 1 where the supremum is taken over all possible distributions P. Finding the exact value of the supremum in the worst case is for the above formulation an interesting open question. The statement is a surprising result in light of the existence of many algorithms that rely on finding a threshold for maximizing the F-measure (Keerthi et al., 2007; Fan and Lin, 2007; Zhang et al., 2010; Lipton et al., 2014) remark that those methods rather seek for a threshold on scoring functions instead of marginal probabilities. While being justified by Theorems 4, 5 and 6 for specific applications, thresholding does not yield optimal predictions in general. Let us illustrate this with an example for which m = 12: y P(y) 000000000000 0.21 100000000000 0.39 011111100000 0.2 010000011111 0.2 The non-specified configurations have zero probability mass. The F-measure maximizer is given by (1000000000000); yet, not the first label but the second one exhibits the highest marginal probability. The regret remains rather low in this case, but higher values can be easily obtained by constructing more complicated examples from (41) see the appendix. 4. In fact most thresholding methods that optimize the F-measure during training do not use the joint distribution, and define a threshold based on marginals only. However, that is in practice the same as assuming independence, and resembles the same conclusions as in Section 4.1. In contrast, the regret will be lower when also the joint distribution is used to define the threshold. When the F-measure is optimized in an inference phase, starting from a trained probabilistic model, access to the joint distribution is of course needed. On the Bayes-Optimality of F-Measure Maximizers 5. An Exact Algorithm for F-Measure Maximization We now introduce an exact and efficient algorithm for computing the F-maximizer without using any additional assumption on the probability distribution P. While adopting the idea of decomposing the problem into an outer and an inner maximization, our algorithm differs in the way the inner maximization is solved.5 For convenience, let us introduce the following quantities: i=1 yi, ik = X 2P(y) sy + k. The first quantity gives the number of ones in the label vector y, while ik is a specific marginal value for i-th label, which for u = 1 corresponds to weighted true positives. Using these quantities, we show that only m2 + 1 parameters of the joint distribution P(y) are needed to compute the F-maximizer. Theorem 8 The solution of (2) can be computed by solely using P(y = 0m) and the values of ik, for i, k {1, . . . , m}, that constitute an m m matrix . Proof. The inner optimization problem (14) can be formulated as follows: h(k) = arg max h Hk E [F(Y , h)] = arg max h Hk y {0,1}m P(y)2 Pm i=1 yihi sy + k . The sums in arg max can be swapped, resulting in y {0,1}m P(y)2 Pm i=1 yihi sy + k = 2P(y) sy + k . Finally, we obtain h(k) = arg max h Hk i=1 hi ik . (16) As a result, one does not need the whole distribution to find the maximizer of the Fmeasure, but the values of ik, which can be given in the form of an m m matrix . For the special case of k = 0, we have h(k) = 0m and Ey P(y) [F(y, 0m)] = P(y = 0m). If the matrix is given, the solution of the F-measure maximization (2) is straightforward, since for each inner maximization the problem boils down to selecting the k labels with the highest ik. The resulting algorithm, referred to as General F-measure Maximizer (GFM), is summarized in Algorithm 1 and its time complexity is analyzed in the following theorem. Theorem 9 Algorithm 1 solves problem (2) in time O(m2) assuming that the matrix of m2 parameters and P(y = 0m) are given. 5. The description of the method slightly differs from the previous paper (Dembczy nski et al., 2011), and it is concordant with Dembczy nski et al. (2013). Waegeman et al. Algorithm 1 General F-measure Maximizer INPUT: matrix and probability P(y = 0m) for k = 1 to m do solve the inner optimization problem (14): h(k) = arg max h Hi by setting hi = 1 to k labels with the highest ik (in case of ties take any k top labels), and hi = 0 for the rest; store a value of E h F(Y , h(k)) i = i=1 h(k) i ik; end for define h(0) = 0m, and E [F(Y , 0m)] = P(y = 0m); solve the outer optimization problem (15): h F = arg max h {h(0),...,h(m)} E [F(Y , h)] ; return h F and E [F(Y , h F )]; Proof. To solve (16), it is enough to find the top k elements (i.e., the elements with the highest values) in the k-th column of matrix , which can be carried out in linear time (Cormen et al., 2001). This step has to be repeated for all k. Therefore, the overall complexity of the inner maximization is quadratic. The solution of the outer optimization problem (15) is then straight-forward and requires linear time. In light of combining the inference algorithm with particular training algorithms, like multinomial regression as we discuss it in Section 7.1.4, it could be reasonable to redefine the formulation in the following way. Consider the probabilities pis = P(yi = 1, sy = s), i, s {1, . . . , m} , (17) that constitute an m m matrix P.6 Let us also introduce an m m matrix W with elements wsk = 2 (s + k) , s, k {1, . . . , m} . (18) It can be easily shown that = PW, (19) 2P(y) sy + k = 2pis s + k. 6. We use capital letters to denote matrices. On the Bayes-Optimality of F-Measure Maximizers If the matrix P is taken as an input by the algorithm, then its complexity is dominated by the matrix multiplication (19) that is solved naively in O(m3), but faster algorithms working in O(m2.376) are known (Coppersmith and Winograd, 1990).7 Interestingly, the above results clearly suggest that the F-measure maximizer is more affected by the number of 1s in the y-vectors than by the interdependence between particular labels. In other words, modeling of pairwise or higher degree dependencies between labels is not necessary to obtain an optimal solution, but a proper estimation of marginal quantities ( ik, or pis) that take the number of co-occurring labels into account. In the reminder of this section, we discuss the properties of the GFM algorithm in comparison to the other algorithms discussed in Section 4. The methods presented by Chai (2005); Jansche (2007) and Ye et al. (2012) all assume label independence and produce exactly the same result, apart from small numerical instabilities that might always occur. These methods, contrary to GFM, will not deliver an exact F-maximizer if the assumption of independence is violated. On the other hand, the disadvantage of GFM is the quadratic number of parameters it requires as input, while the other methods only need m parameters. Since the estimation of a larger number of parameters is statistically more difficult, it is a priori unclear which method performs better in practice. We are facing here a common trade-offbetween an approximate method on better estimates (we need to estimate a smaller number of parameters from a given sample) and an exact method on potentially weaker estimates. Nonetheless, if the joint distribution is concentrated on a small number of different label combinations y, the estimates of or P can be as good as the estimates of the marginal probabilities pi. From the computational perspective, Jansche s method is characterized by a much higher time complexity, being respectively O(m4) and O(m3) for the exact and the approximate versions. The method of Chai has a cubic complexity, and the enhanced version presented in Ye et al. (2012) is more efficient, since it solves the problem in O(m2) time. The GFM algorithm is quite competitive, as its complexity is of O(m2) or O(m3), depending on the setting. Moreover, the cubic complexity of GFM, which follows from the matrix multiplication, can be further decreased if the number of distinct values of sy with non-zero probability mass is smaller than m. 6. Simulations In the previous sections, we gave theoretical results concerning the performance of different inference methods in the worst case scenarios. Here, we verify the methods empirically on synthetic data to check the difference in average performance on two large classes of distributions. The first class assumes independence of labels, while the second class uses a model with strong label dependencies. We test four inference methods optimal for different performance measures. The first one is suited for Hamming loss. It estimates the marginal probabilities by simple counting from a given sample and gives empirical marginal modes as output. We denote this method MM, since it estimates marginal modes. The second one is tailored for subset 0/1 loss. It 7. The complexity of the Coppersmith-Winograd algorithm (Coppersmith and Winograd, 1990) is more of theoretical significance, since practically this algorithm outperforms the na ıve method only for huge matrices. Waegeman et al. seeks for the joint mode by checking the frequencies of label combinations appearing in the sample. We refer to this method as JM, since it estimates the joint mode. The two remaining methods are suited for F-measure maximization. We use the dynamic programming method of Ye et al. (2012) that assumes label independence, denoted by FM, and the exact GFM method described in the previous section, which performs exact inference. All parameters required by these algorithms are also estimated from the sample by simple counting. We verify the performance of the inference methods by using Hamming loss, subset 0/1 loss, the Jaccard index, and the F-measure. We run these simulations, as well as the other experiments described later in this paper, on a Debian virtual machine with 8-core x64 processor and 5GB RAM. 6.1 Label Independence The independent data are generated according to: i=1 P(yi) , where the probabilities P(yi) are given by the logistic model: P(yi = 1) = 1 1 + exp( wi), where wi N(0, 3) . In experiments we set the number of labels to 25 and vary the number of observations using the following values {5, 10, 20, 30, 40, 50, 75, 100, 200, 500, 1000, 2000, 5000, 10000}. We repeat the experiment for 30 different models, i.e., sets of values wi. For each model, we use 50 different training sets of the same size, but to reduce the variance of the results we use for testing one set of 100,000 observations. The results are given in Figure 2. The right column presents the performance with respect to different measures as a function of the number of training observations. The left column gives the same results, but zoomed to the range from 5 to 100 training observations. We see from the plots that MM and JM get the best results for Hamming loss and subset 0/1 loss. Since the labels are independent, the marginal modes and joint mode are the same. Therefore, for large enough training samples, these two algorithms converge to the same prediction that should be optimal for Hamming and subset 0/1. However, JM converges much slower, since it directly estimates the joint mode, by checking the frequencies of label combinations in the training set. FM and GFM perform very similarly for each performance measure. Since the labels are independent, FM and GFM should converge to the same prediction, being optimal for the F-measure. GFM, however, may get slightly worse results for small sample sizes, since it needs to estimate a larger number of parameters than FM. We also see that algorithms maximizing the F-measure perform the best for Jaccard index. 6.2 Strong Label Dependence We perform a similar experiment for a data model with strong label dependencies. The data are generated by the chain rule of probability, i.e., i=1 P(yi | y1, . . . , yi 1), On the Bayes-Optimality of F-Measure Maximizers Hamming loss 0 20 40 60 80 100 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.2 # of observations Hamming loss MM JM FM GFM 0 2000 4000 6000 8000 10000 0.17 0.19 0.21 0.23 # of observations Hamming loss G G GGGGG G G G G G G GGGGGG G G G G G G GG GGGG G G G G G MM JM FM GFM subset 0/1 loss 0 20 40 60 80 100 0.990 0.992 0.994 0.996 0.998 # of observations subset 0/1 loss MM JM FM GFM 0 2000 4000 6000 8000 10000 0.990 0.992 0.994 0.996 0.998 # of observations subset 0/1 loss G G G G G G G G G G G G G G G G G G MM JM FM GFM 0 20 40 60 80 100 0.74 0.76 0.78 0.80 # of observations MM JM FM GFM 0 2000 4000 6000 8000 10000 0.74 0.76 0.78 0.80 # of observations GGG G G G G G G GGGGGG G G G G G G GGGGGG G G G G G MM JM FM GFM Jaccard index 0 20 40 60 80 100 0.60 0.62 0.64 0.66 0.68 0.70 # of observations Jaccard index MM JM FM GFM 0 2000 4000 6000 8000 10000 0.60 0.62 0.64 0.66 0.68 0.70 # of observations Jaccard index GGG G G G G G G GGGGGG G G G G G G GGGGGG G G G G G MM JM FM GFM Figure 2: Performance of inference methods in case of label independence as a function of the number of training observations. Left: the performance up to 100 training observations. Right: the performance up to 10000 training observations. The error bars show the standard error of the measured quantities. Waegeman et al. where the probabilities P(yi | y1, . . . , yi 1) are coming from a logistic model of the form: P(yi | y1, . . . , yi 1) = 1 1 + exp( Pi 1 j=1 2wij(yj 1 with all wij N(1, 3) and wi0 N(1, 3). This model tends to produce for a given label yi a value that appeared more often on previous labels. The results are presented in Figure 3. In this case, the marginal modes and the joint mode are not the same. Therefore MM performs the best for Hamming loss and JM for subset 0/1 loss. More importantly, we can see that FM performs suboptimally for F-measure, and a clear winner in this case is GFM. This result confirms our theoretical analysis and shows the benefits of the GFM inference method. Also GFM performs the best for the Jaccard index, followed by the JM. The FM method in this case performs the worst. Similarly to the F-measure we might expect here that methods that assume label independence will not get good results with respect to the Jaccard index. There is of course a price we have to pay for a good performance of GFM. Figure 4 presents running times of parameter estimation and inference of the algorithms as a function of the number of labels. GFM is the slowest method. The running times increase quadratically with the number of labels. The inference time of FM grows also quadratically, but with a lower rate. Moreover, this algorithm needs only to estimate marginal probabilities, therefore its estimation time is exactly the same as for the MM method. 7. Application to Multi-Label Classification Problems The inference methods for F-measure maximization can be used whenever an estimation of required parameters is possible. In this section, we focus on the application of the inference methods in the multi-label setting. Thus, we consider a task of predicting a vector y = (y1, y2, . . . , ym) {0, 1}m given another vector x = (x1, x2, . . . , xn) Rn as input attributes. To this end, we use a training set {(xi, yi)}N i=1 to estimate the required parameters and perform inference for a given test vector x so as to deliver an optimal prediction under the F-measure (1). Thus, we optimize the performance for each instance individually (instance-wise F-measure), in contrast to macroand micro-averaging of the F-measure (Yang, 1999; Tsoumakas et al., 2010). 7.1 Learning Algorithms The inference methods for F-measure maximization can be combined with many conventional learning approaches. In the following, we mainly focus on algorithms in which the final decision is made based on an empirical distribution, like in nearest-neighbors or decision trees. In these algorithms all parameters required by F-measure maximization methods are estimated from the empirical distribution. At the end of this section we also discuss another approach in which the parameters are obtained from a parametric model, for example, from logistic regression. We present the algorithms with the GFM method for F-measure maximization; however, it should be clear from the description how to obtain corresponding variants of the algorithms for the methods that assume label independence. On the Bayes-Optimality of F-Measure Maximizers Hamming loss 0 20 40 60 80 100 0.16 0.17 0.18 0.19 0.20 0.21 0.22 # of observations Hamming loss MM JM FM GFM 0 2000 4000 6000 8000 10000 0.16 0.17 0.18 0.19 0.20 0.21 0.22 # of observations Hamming loss G GGGGGG G G G G G GG G G G G G G G G G G G G G G G G G MM JM FM GFM subset 0/1 loss 0 20 40 60 80 100 0.70 0.75 0.80 # of observations subset 0/1 loss MM JM FM GFM 0 2000 4000 6000 8000 10000 0.70 0.75 0.80 # of observations subset 0/1 loss G G G G G G GGG G G G G G MM JM FM GFM 0 20 40 60 80 100 0.62 0.64 0.66 0.68 # of observations MM JM FM GFM 0 2000 4000 6000 8000 10000 0.62 0.64 0.66 0.68 # of observations G G G G G G G G G G G G MM JM FM GFM Jaccard index 0 20 40 60 80 100 0.53 0.54 0.55 0.56 0.57 0.58 0.59 # of observations Jaccard index MM JM FM GFM 0 2000 4000 6000 8000 10000 0.53 0.55 0.57 0.59 # of observations Jaccard index G G G G G G G G G G G G MM JM FM GFM Figure 3: Performance of inference methods in case of label dependence as a function of the number of training observations. Left: the performance up to 100 training observations. Right: the performance up to 10000 training observations. The error bars show the standard error of the measured quantities. Waegeman et al. Estimation and prediction time 0 50 100 150 200 250 0 50 100 150 200 250 300 # of labels estimation time [ms] G G G G G G G G G G G G G G G G G G G G G G G G MM JM FM GFM 0 50 100 150 200 250 0.0 0.5 1.0 1.5 # of labels inference time [ms] G G G G G G G G G G G G G G G G G MM JM FM GFM Figure 4: Running times in milliseconds of the inference methods for label dependent data as a function of the number of labels. Left plot shows estimation time of parameters required by the method. Right plot shows the inference time. The error bars show the standard error of the measured quantities. 7.1.1 Instance-Based Learning Several instance-based methods for multi-label classification have been proposed in the past (e.g., Zhang and Zhou, 2007; Cheng and H ullermeier, 2009), but none of these methods is tailored for optimizing the F-measure during an inference phase. Consider a query instance x X and let {yj, xj}l j=1 denote the l nearest neighbors of x with respect to a distance measure on X in the training set. The number l is a fixed parameter of the method. Instance-based learning can be extended for maximizing the F-measure in a straight-forward way, namely by replacing the distribution P(y) with the empirical distribution in the neighborhood of the query. Correspondingly, the values and P(y = 0m) are estimated through simple counting: syj + k , ˆP(y = 0m) = 1 j=1 Jyj = 0m K. By using these estimates in the GFM algorithm, we obtain an estimate of the F-measure maximizer. 7.1.2 Decision Trees Decision tree methods have been extensively studied in standard classification and regression settings, and have also been generalized to multi-output problems like MLC and multivariate regression (see e.g., Zhang, 1998; Lee, 2006). An adaptation of decision trees for maximizing the F-measure is slightly more complicated than for instance-based learning. The method we present here resembles the ideas used in predictive clustering trees (Vens et al., 2008). For simplicity, we only consider binary trees. Moreover, since decision tree induction is well-known in the machine learning field, we restrict our discussion to the main differences to conventional (classification or regression) tree learning. In a decision tree, each leaf node represents a (typically rectangular) part of the instance space X and is labeled with a local model. Typically, a local model consists of a single prediction, namely the prediction that On the Bayes-Optimality of F-Measure Maximizers minimizes the average loss among the associated training examples (e.g., the mean value in regression and the most frequent label in classification). Applying the same principle in the case of the F-measure in MLC comes down to computing the maximizer of this measure over the instances in a leaf node. This is similar to the case of instance-based learning, except that the examples used for estimating and P(y = 0m) originate from a rectangular region of the instance space X and not from the neighborhood of the query instance x. The more demanding part is the induction of the tree, i.e., finding optimal splits with respect to the F-measure. For a given node of the tree, we search over all attributes and possible split points, just like in regular decision tree algorithms. Let us denote by N a set of training examples {(yj, xj)}l j=1 in a node. The task is then to find a split of N into two subsets, NL and NR, that maximize some purity criterion. Our approach is analogous to the one of conventional decision trees, but based on the F-measure: #(N) F(NL) + #(NR) where #(A) is a cardinality of a set A, and F(A) = max h 1 #(A) yi A F(yi, h). In order to speed up computations of F( ), we notice that searching a split is usually performed in an example-by-example manner, which means that we can easily update our estimates of and P(y = 0m). Moreover, assuming that a given training example has a lower number of relevant labels, we do not have to recompute the whole matrix , but only update some of its rows and columns. Finally, the search for the top k elements in each column of can be made faster by checking local changes in the current rankings of labels. We repeat the above step recursively until a stop condition is reached, for example the F-measure becomes maximal or the number of examples in the leaf node falls below a threshold. Of course, more sophisticated approaches are conceivable. Let us also mention that it is possible to generalize bagging (Breiman, 1996) with these decision trees. Then, GFM can easily be applied over the bootstrap sample of the weak hypotheses returned by the trees. 7.1.3 Probabilistic Classifier Chains Probabilistic classifier chains (PCCs) (Dembczy nski et al., 2010) is an approach similar to maximum entropy Markov models (Mc Callum et al., 2000) and to conditional random fields (CRFs) (Lafferty et al., 2001; Ghamrawi and Mc Callum, 2005). All these approaches estimate the joint conditional distribution P(y | x). PCC has an additional advantage that one can easily sample from the estimated distribution. The underlying idea is to repeatedly apply the product rule of probability to the joint distribution of the labels: i=1 P(yi | x, y1, . . . , yi 1) (20) Learning in this framework can be considered as a simple procedure. According to (20), we decompose the joint distribution into a sequence of marginal distributions that depend Waegeman et al. on a subset of the labels. These marginal distributions can be learned by m functions fi : X {0, 1}i 1 [0, 1] on an augmented input space X {0, 1}i 1, taking y1, . . . , yi 1 as additional input attributes: fi : (x, y1, . . . , yi 1) 7 P(yi = 1 | x, y1, . . . , yi 1) (21) By plugging the log-linear model into (20), it can be shown that pairwise dependencies between labels yi and yj are modeled (see also (Kumar et al., 2013)). The algorithm is mainly suitable for subset 0/1 loss. Exploration of the structure of the chain in the inference phase boils down to search for the most probable label combination in a resulting probabilistic binary tree. A greedy algorithm follows only one path choosing always only the most probable label in each position in the chain (Read et al., 2009). This algorithm, however, may lead to suboptimal results. It has been shown (Dembczy nski et al., 2012b), however, that an exact method based on a variant of uniform-cost search with a cut-offlist finds the joint mode in a linear time of 1/pmax, where pmax is the probability of the joint mode. For reasonable values of pmax, this method works very fast. To optimize the response of PCC for other loss functions, we need to obtain a sample of observations from the conditional joint distribution P(y | x). To get a single observation, we can follow the chain and pick the value of label yi by tossing a biased coin with probabilities given by the i-th classifier. Such a procedure is sometimes referred to as ancestral sampling (Bishop, 2006, Chapter 8). From the sample of such observations, we can estimate all the parameters required by inference methods like GFM, similarly as in the case of nearest neighbors and decision trees. More precisely, let {yj}n j=1 denote a set of sampled observations for a given test example x. Then, the values and P(y = 0m) can be estimated through simple counting: syj + k , ˆP(y = 0m) = 1 j=1 Jyj = 0m K. By plugging these estimates into the GFM algorithm, we obtain for a given x a prediction optimized for the F-measure. 7.1.4 Parametric Models Alternatively to the approaches described above, which estimate the parameters required by the F-measure maximization methods on empirical distributions, we discuss here an approach in which the parameters are efficiently obtained from a parametric model (Dembczy nski et al., 2013). Unfortunately, there is no easy way to estimate directly the matrix , since the elements of this matrix do not correspond to a proper probability distribution. However, we can estimate matrix P, defined in (17), which elements are probabilities: pis = P(yi = 1, sy = s), i, s {1, . . . , m} . Multiplying the matrix P by a weight matrix W with elements (18) results in the estimate of , as shown in (19). To estimate the elements of P we can use a simple reduction to m multi-class probability estimation (i.e., multinomial regression) problems, each with at most m + 1 classes. We On the Bayes-Optimality of F-Measure Maximizers define one multinomial regression model for each row of matrix P. Let us observe that for t = (Jyi = 1K sy), t {0, . . . , m}, we have: X t {0,...,m} P(t | x) = 1 . Therefore, we can define the i-th multinomial regression problem as: fi : x 7 P(t | x), for t {0, . . . , m} . (22) In a similar way, we can estimate P(y = 0m | x) by performing an additional reduction to binary probability estimation with t = Jy = 0m K as an output variable: f0 : x 7 P(t | x) , and solving it via logistic regression. The decomposition of the original problem into independent multinomial regression tasks has computational advantages. Moreover, since the number of distinct values of sy is usually small, the number of classes in a single multinomial regression task is much smaller than m + 1; only in the worst case, we end up with a quadratic complexity in the number of labels m. Let us, however, remark that the elements of matrix P estimated across different tasks are not fully independent of each other (e.g., pim is the same for all i, since P(yi = 1, sy = m) = P(y = 1m)). Consequently, learning on a finite training set may lead to conflicting estimates that are not in agreement with any valid distribution. To avoid such conflicts, one may include additional constraints in the learning problem or calibrate the estimates afterwards. However, Dembczy nski et al. (2013) have shown that with the sample size growing to infinity this approach is statistically consistent. To summarize this approach we notice that learning of the probabilistic model has a time complexity that is at most quadratic in m. In the inference phase for a test instance x, we first get estimates of P(0m | x) and P from the probabilistic model, again in at most quadratic time. Then, we need to multiply matrices P and W to get . Finally, all the parameters are plugged into the GFM method. This approach has in the original paper (Dembczy nski et al., 2013) been referred to as Exact-F-measure-Plug-in classifier (EFP). Under the assumption of label independence, one can simplify this approach. Since we only need marginal probabilities pi, it is enough to reduce the problem to m binary probability estimation tasks that can be solved, for example, by logistic regression. Then, for each test instance x, we obtain a vector of marginal probabilities pi, to which one might apply, for example, the inference method of Ye et al. (2012). We refer to this approach as Label-independence F-measure-Plug-in classifier (LFP), similarly as in (Dembczy nski et al., 2013). 7.2 Experimental Results We test some of the algorithms described above on four commonly used multi-label benchmark data sets with known training and test sets. We take these data sets from the MU- Waegeman et al. LAN8 and Lib SVM9 repositories. Table 1 contains basic statistics of these data sets. We also relate the obtained results to the results of a variant of structured SVMs that moves the effort of maximizing the F-measure to the training phase. We run the experiments on the machine that was also used for the simulations described earlier, i.e., on a Debian virtual machine with 8-core x64 processor and 5GB RAM. data set #train #test m d Scene 1211 1196 6 294 Yeast 1500 917 14 103 Enron 1123 579 53 1001 Mediamill 30993 12914 101 120 Table 1: Data sets and their properties. The number of training and test observations is denoted by #train and #test, respectively, m is the number of labels, d is the number of features. 7.2.1 Instance-Based Learning We first present results of instance-based learning. We use a different number of nearest neighbors, l {10, 20, 50, 100}. For each test example, we seek for its nearest neighbors and apply different inference methods. We use exactly the same methods we applied in our experiments on synthetic data given in Section 6. The first method, MM, estimates the marginal modes, JM estimates the joint mode, FM approximates the F-measure by assuming label independence, and the introduced GFM performs exact inference for the F-measure. The nearest-neighbor search is performed by using the Weka (Hall et al., 2009) and Mulan (Tsoumakas et al., 2011) implementation of instance-based learning. The results are given in Table 2. Similarly as in Section 6, we report the performance in terms of Hamming loss, subset 0/1 loss, F-measure and Jaccard index. We can generally confirm our previous results on synthetic data: an inference method tailored for a given performance measure obtains the best results. This is clear for Hamming loss, for which MM has the smallest error throughout. JM performs the best for subset 0/1 loss on all data sets with some exceptions on Enron. Both methods tailored for F-measure maximization, FM and GFM, substantially outperform MM and JM on this performance criterion. FM seems to beat GFM on Enron, while the latter method produces better results on the other data sets. There are, however, no clear results for the Jaccard index. Let us underline that in the case of nearest neighbor methods, we are dealing with a specific trade-offbetween the size of the neighborhood and its volume. In general, increasing the sample size in the inference methods should improve the results. But in this case, by increasing the number of neighbors, we simultaneously increase the volume of the space that contains the neighbors. In other words, some of the neighbors can be far away from 8. This repository can be found at: http://mulan.sourceforge.net/datasets.html. 9. This repository can be found at: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ multilabel.html. On the Bayes-Optimality of F-Measure Maximizers Hamming Loss [%] Subset 0/1 Loss [%] F-measure [%] Jaccard [%] 10 20 50 100 10 20 50 100 10 20 50 100 10 20 50 100 MM 10.28 10.62 11.52 13.03 40.47 45.15 54.01 66.05 65.80 59.53 49.39 36.40 64.23 58.36 48.54 35.79 JM 11.19 10.99 11.97 12.28 36.54 36.29 39.13 40.05 68.26 68.73 65.89 64.97 67.06 67.47 64.63 63.71 FM 13.80 14.40 16.74 19.83 53.09 56.10 64.55 75.17 70.80 70.12 67.20 62.65 64.59 63.24 58.76 52.48 GFM 13.03 14.09 16.15 19.12 49.75 54.68 63.13 72.91 71.42 70.29 67.85 64.31 65.95 63.80 59.73 54.41 MM 20.72 20.00 20.02 20.38 81.57 81.03 82.77 85.71 62.88 61.93 60.31 58.26 52.31 51.49 49.74 47.47 JM 23.09 21.78 21.84 22.04 76.66 76.23 77.54 78.30 59.14 60.53 60.97 60.68 49.75 51.06 51.20 50.72 FM 22.94 23.26 22.83 23.21 83.21 83.21 85.93 88.11 65.29 65.06 65.23 64.85 54.14 53.74 53.71 53.10 GFM 22.94 23.17 22.87 23.61 82.99 83.97 86.37 88.66 65.49 65.47 65.75 64.98 54.31 54.09 54.11 53.08 MM 5.73 5.94 6.28 6.46 88.08 88.60 89.46 89.64 35.56 27.91 23.23 23.71 28.70 22.97 19.36 19.60 JM 6.56 6.51 6.70 6.73 86.18 87.39 89.29 89.46 33.38 29.52 25.52 24.68 27.68 24.51 21.05 20.38 FM 6.51 6.23 6.34 6.48 87.56 88.60 87.22 88.26 47.50 47.01 42.10 37.86 37.31 36.82 33.65 30.17 GFM 6.61 6.28 6.54 6.63 88.08 86.87 88.43 89.29 44.43 42.45 34.76 29.27 35.12 34.10 27.93 23.70 MM 3.29 3.18 3.16 3.19 89.17 89.66 90.30 91.13 54.98 53.92 52.68 51.81 43.32 43.34 43.10 42.74 JM 4.17 4.03 3.93 3.91 88.93 88.82 89.24 89.05 48.87 48.58 47.32 47.07 43.55 43.66 43.30 42.80 FM 3.97 3.81 3.72 3.68 91.30 91.99 92.64 93.39 55.47 55.65 55.51 55.24 43.60 42.64 41.49 40.58 GFM 3.87 3.74 3.66 3.65 90.48 91.15 91.99 92.81 55.52 55.80 55.54 55.16 38.56 38.40 37.39 37.28 Table 2: Empirical results on 4 benchmark data sets. Instance-based methods are used with a different number of neighbors (l {10, 20, 50, 100}) and with different inference methods: MM estimates marginal modes, JM the joint mode, FM approximates the F-measure maximizer by assuming independence of labels, and GFM computes the exact F-measure maximizer over the nearest neighbors. Results are reported for Hamming loss, subset 0/1 loss, F-measure and Jaccard index. The best results for a given number of neighbors and a performance measure are marked in bold. Waegeman et al. Inference time [s] Inference time [s] 10 20 50 100 10 20 50 100 Scene Yeast MM 7.622 8.110 9.078 10.064 MM 3.662 3.907 4.352 4.793 JM 7.519 8.052 9.093 10.010 JM 3.687 3.958 4.362 4.812 FM 7.488 8.023 9.106 10.126 FM 3.728 3.920 4.404 4.788 GFM 8.311 8.171 9.169 10.475 GFM 3.634 3.944 4.467 4.875 Enron Mediamill MM 1.601 1.618 1.978 2.172 MM 336.550 374.207 435.171 499.977 JM 1.477 1.607 1.929 2.236 JM 338.855 373.297 431.354 492.079 FM 1.483 1.645 1.931 2.145 FM 339.704 375.230 433.660 493.424 GFM 1.518 1.681 2.022 2.228 GFM 347.402 382.663 442.356 502.060 Table 3: Computation time in seconds for the instance-based methods for different l = {10, 20, 50, 100} and inference methods: MM, JM, FM, and GFM. Computation time includes searching time and inference time the test example, which usually deteriorates the quality of the estimates. This is usually the case for high-dimensional problems. This may partially explain the results on the Enron data set. The performance under the F-measure decreases substantially with the number of neighbors. Also when comparing the instance-based methods with other methods, presented later in this section, we see that the overall performance on this data set is much worse for the former methods. In Table 3 we present the computation time of the instance-based methods. The computation time includes both the search of nearest-neighbors and the inference. We can easily observe that the nearest-neighbor search dominates inference in terms of computational complexity, since there is only a small difference in computational times between the inference methods for a given data set and l. We do not present the inference times separately here, since their characteristics are exactly the same as presented in Section 6. 7.2.2 Probabilistic Classifier Chains In the next experiments, we use PCC. We train PCC by using linear regularized logistic regression. We use the implementation of logistic regression from Mallet (Mc Callum, 2002). We tune the regularization parameter for each base classifier independently by minimizing the logistic loss, hoping to thereby produce better probability estimates. We use 5-fold cross-validation and choose the regularization parameter from the following set of possible values {10 4, 10 3, . . . , 103}. Similarly as in the previous experiments, we use four different inference mechanisms. For MM, FM, and GFM methods, we obtain the estimates of the required parameters by performing ancestral sampling from the conditional joint distribution of each test example x. The JM method, instead of estimating the joint mode from the sample, applies the efficient exact search method (Dembczy nski et al., 2012b). Table 4 contains the results of the experiment. As before, we report the Hamming loss, subset 0/1 loss, F-measure and the Jaccard index. We also give the training and inference times. The training time concerns the entire procedure that consists of tuning On the Bayes-Optimality of F-Measure Maximizers of the regularization parameter in cross-validation and training of a model with the best value of the regularization parameter. The results for MM, FM, and GFM are given for sample size of 1000. From the results, we can clearly state that approaches tailored for the F-measure obtain better results on this performance criterion. It seems that GFM obtains slightly better results, but also needs a little bit more time. In Figure 5, we additionally present the F-measure and inference times of FM and GFM as a function of the number of observations obtained from ancestral sampling. These results are computed over 5 runs of the inference methods to decrease the impact of the randomness of the sampling method. The plots confirm our theoretical results concerning the predictive performance and time complexity. However, the inference times reported here include all three steps: sampling, estimation of parameters, and inference based on these parameters. As we can see in Table 4 and Figure 5 the differences between GFM, FM, and MM are not substantial here, since sampling is the most expensive step. The exact method used for joint mode estimation works much faster than the other methods based on sampling. From the other results in Table 4 we can also observe that MM is the best for Hamming loss, and JM for the subset 0/1 loss. The results for the Jaccard index show that maximization of the F-measure can be used as a proxy for this performance criterion, at least FM and GFM perform better than MM. 7.2.3 Parametric Models To complete the picture we also present the experimental results of parametric models that have been previously published in (Dembczy nski et al., 2013). We compare EFP and its simplified variant LFP. We also include to the comparison the binary relevance (BR) approach that learns and predicts for each label independently. Such a model should perform well under the Hamming loss. All the methods use linear models and are trained, similarly as PCC, by logistic regression. The regularization parameter and its tuning is exactly the same as in the experiment with PCC. The results are summarized in Table 5. We show the results for Hamming loss, Fmeasure and report training and inference times. Not surprisingly, BR achieves the best results for Hamming loss, but it is outperformed by all the other methods on the F-measure. EFP is the best method in this regard. BR is the most efficient in inference. Nevertheless, the inference times of LFP and EFP are quite comparable to those of BR, despite their quadratic (for LFP) and cubic (for EFP) complexity. Admittedly, however, the data sets used in the experiments only contain a small to moderate number of labels (up to 100). For data sets with thousands of labels, the difference is likely to become substantially larger. The training of BR and LFP (these are exactly the same procedures) is the most effective. Training of EFP leads to m multinomial regression models. One should note, however, that the number of classes in each multinomial regression models can be much less than the highest possible value m+1. Therefore, the training of EFP is still quite effective and takes only a few times longer than the training of LFP. The training time includes here also the tuning time, similarly as in the case of PCC. Waegeman et al. Hamming Subset 0/1 F-Measure [%] Jaccard [%] Training Inference Loss [%] Loss [%] Time [s] Time [s] MM 9.89 42.69 62.73 61.37 39 1.839 JM 10.40 34.87 70.93 69.48 39 0.255 FM 12.83 50.44 72.78 66.71 39 1.858 GFM 12.74 49.86 72.78 66.89 39 1.927 MM 19.56 81.98 61.78 51.32 32 7.120 JM 20.91 76.34 63.36 53.56 32 0.220 FM 22.31 84.22 65.53 54.29 32 7.161 GFM 22.54 84.73 65.63 54.32 32 7.513 MM 4.63 86.88 52.62 42.51 81 120.636 JM 4.81 82.56 55.67 45.80 81 1.974 FM 5.59 90.33 58.54 46.19 81 121.172 GFM 5.53 89.12 59.08 46.89 81 121.700 MM 3.18 92.44 51.21 39.60 6150 2226.455 JM 3.57 90.02 44.99 35.62 6150 28.856 FM 3.62 94.81 55.39 42.57 6150 2230.661 GFM 3.61 94.51 55.18 42.44 6150 2293.945 Table 4: Empirical results on 4 benchmark data sets. PCC is used with different inference methods: MM - estimates marginal modes, JM - the joint mode, FM - approximates the F-measure maximizer by assuming independence of labels, and GFM - estimates the exact F-measure maximizer over the conditional distribution obtained from the model. For MM, FM, and GFM, we sample 1000 observations from the conditional joint distribution for each test example. Results are reported for Hamming loss, subset 0/1 loss, F-measure and Jaccard distance. The best results for a performance measure are marked in bold. On the Bayes-Optimality of F-Measure Maximizers 0 200 400 600 800 1000 0.69 0.70 0.71 0.72 0.73 # of simulations 0 200 400 600 800 1000 0.5 1.0 1.5 2.0 # of observations Inference time [s] 0 200 400 600 800 1000 0.61 0.62 0.63 0.64 0.65 0.66 # of observations 0 200 400 600 800 1000 # of observations Inference time [s] 0 200 400 600 800 1000 0.565 0.575 0.585 0.595 # of observations 0 200 400 600 800 1000 0 20 40 60 80 100 120 # of observations Inference time [s] 0 200 400 600 800 1000 0.50 0.51 0.52 0.53 0.54 0.55 # of observations 0 200 400 600 800 1000 0 500 1000 1500 2000 # of observations Inference time [s] Figure 5: Performance of FM and GFM inference methods used in PCC with different number of observations obtained from ancestral sampling. The results are averaged over 5 runs of the inference methods to eliminate the randomness of sampling. Left plots show F-measure, while right plots inference times. The inference time includes sampling, estimation of parameters, and inference based on these parameters. The error bars show the standard error of the measured quantities. Waegeman et al. Hamming Loss [%] F-Measure [%] Training Time [s] Inference Time [s] BR 10.51 55.73 29 0.241 LFP 12.18 74.38 29 0.270 EFP 12.22 74.44 72 0.399 BR 20.03 60.59 26 0.128 LFP 22.24 65.02 26 0.146 EFP 22.82 65.47 101 0.367 BR 4.54 55.49 52 1.016 LFP 6.09 56.86 52 1.519 EFP 5.34 61.04 214 2.628 BR 3.19 51.21 3238 13 LFP 3.67 55.15 3238 20 EFP 3.63 55.16 24620 30 Table 5: Empirical results on 4 benchmark data sets of parametric models: BR, LFP, and EFP. We report the Hamming loss, F-measure and training and inference times in seconds. The best results for a performance measure are marked in bold. 7.2.4 Comparison with Structured Support Vector Machines In this subsection we gather results of all F-measure maximization methods presented so far and compare them with the results of a variant of structured SVMs (Tsochantaridis et al., 2005) in which the effort of maximizing the F-measure is moved to the training phase. Two methods of that kind, referred to as RML and SML, have been introduced by Petterson and Caetano (2010, 2011). We present here only the results of RML, previously published in (Dembczy nski et al., 2013).10 Basically, this method trains one model for each label, but in a way that the margin, appropriately rescaled by the F-measure, is maximized jointly over all labels. RML uses a variant of the cutting-plane algorithm for optimization. So, in each iteration a most violating constraint for each training example is generated. This step has quadratic complexity in terms of the number of labels. In the prediction phase the models are independently applied to corresponding labels. Surprisingly, this method produces usually better results than the more complex SML method (Petterson and Caetano, 2011), which additionally models pairwise label dependencies.11 In the experiment the RML method uses a linear model. The regularization parameter is tuned in 5-fold crossvalidation using a range of values corresponding to the one used for PCC, EFP, and LFP. The maximal number of iterations in the cutting-plane algorithm has been set to 1000. 10. The results were obtained by using the software available at http://users.cecs.anu.edu.au/ ~jpetterson/. 11. The comparison of these two methods is given in (Petterson and Caetano, 2011) and Dembczy nski et al. (2013). On the Bayes-Optimality of F-Measure Maximizers Table 6 present the results for F-measure, training and inference times. For instancebased methods we report the variant with the number of neighbors which achieves the best results for a given data set. Similarly for PCC, we take the number of sampled observations which leads to the best performance. Observing the results we can say that in general the methods that maximize the F-measure in the inference phase outperform the structured SVM approach. RML is only competitive on the Scene data set, on which it wins against instance-based methods and PCC, but loses from EFP and LFP. On Yeast and Mediamill it achieves the worst performance. On the latter data set the difference is the most substantial. On Enron it wins only against the instance-based method. As we already pointed out, on this data set all variants of nearest neighbors perform weakly, probably because of the high-dimensional feature space. The comparison of the running times between RML and the other methods should be interpreted with caution, due to the use of different programming languages (RML is implemented in C++, while the other algorithms in Java) and differences in the implementations (different data structures). Therefore, the evaluation times may not be fully comparable. For example, the inference times for BR (Table 5) and RML should basically be very similar, as in both cases there is a single linear model for each label. Yet, the implementation of RML is much more efficient. Nevertheless, we are still able to derive several important conclusions. Not surprisingly RML is the most efficient in inference. However, the cutting-plane algorithm and the constraint generation step therein slow down significantly the training of RML. This also makes tuning very costly. For PCC, EFP and LFP, the tuning can be performed independently for each base classifier. In that way we hope to obtain a good probabilistic model and there is no need to perform neither a costly training nor inference. Unfortunately, this is not the case of RML. For example, tuning on the Mediamill data set has not finished in reasonable time. The F-measure result reported in the table is the best one among those obtained on the test set for different values of the regularization parameter. Each of the methods maximizing the F-measure in the inference phase has its advantages and disadvantages. By comparing the results, we see that parametric models get the best results on Scene and Enron followed by PCC. On Yeast and Mediamill all the methods perform very similarly, but the best is the instance-based learning here. Learning of EFP is the most costly. PCC and LFP train a single model for each label, but in the case of the former algorithm the feature space is enhanced by the preceding labels, therefore, the training time is longer for this procedure. We do not report training time for instancebased methods, as in the simplest case there is no learning in this kind of methods. In general, however, one should consider the time needed for tuning the number of neighbors and optionally learning the metric, which we have not performed here. From the inference point of view, LFP seems to be the most efficient. As we already discussed in the previous subsection, EFP is still quite competitive in comparison with LFP, but for data sets with a large number of labels this difference shall be more substantial. Instance-based methods are also very efficient for data sets with a small number of training examples. Because of the sampling procedure applied for each testing examples, PCC is the most time demanding procedure. However, we can see from Figure 5 that the good performance with respect to the F-measure can be obtained with a smaller number of sampled observations. In many Waegeman et al. applications a set of 100 observations should be sufficient, resulting in a sample generation that is approximately 10 times faster. The main advantage of PCC is that once we have a trained model we can apply inference for many different performance measures without any additional training. F-Measure [%] Training Time [s] Inference Time [s] IB FM (l = 10) 70.80 - 3.746 IB GFM (l = 10) 71.42 - 3.749 PCC FM (n = 1000) 72.78 39 1.858 PCC GFM (n = 1000) 72.77 39 1.927 LFP 74.38 29 0.270 EFP 74.44 72 0.399 RML 73.92 73 0.118 IB FM (l = 10) 65.29 - 1.518 IB GFM (l = 50) 65.75 - 1.795 PCC FM (n = 1000) 65.53 39 7.161 PCC GFM (n = 1000) 65.63 39 7.513 LFP 65.02 26 0.146 EFP 65.47 101 0.367 RML 64.78 206 0.056 IB FM (l = 10) 47.50 - 0.787 IB GFM (l = 10) 44.43 - 0.810 PCC FM (n = 500) 58.75 81 75.677 PCC GFM (n = 500) 59.19 81 76.056 LFP 56.86 52 1.519 EFP 61.04 214 2.628 RML 57.69 3897 0.143 IB FM (l = 20) 55.65 - 164 IB GFM (l = 20) 55.80 - 167 PCC FM (n = 1000) 55.39 6150 2230 PCC GFM (n = 1000) 55.18 6150 2293 LFP 55.15 3238 20 EFP 55.16 24620 30 RML 49.35 - 7 Table 6: Comparison of RML, a variant of structured SVMs for F-measure maximization, with other F-measure maximization methods given in the previous tables: instance-based methods (IB), PCC, LFP and EFP. We present here the best variant of instance-based methods and PCC for a given data set. The number l of nearest neighbors for the best variant of IB is given in parentheses. Similarly the number of sampled observations in PCC is also given in parentheses. The F-measure, training and inferences time in seconds are reported. The best results are marked in bold. On the Bayes-Optimality of F-Measure Maximizers 7.3 Results in the JRS 2012 Data Mining Competition We used the algorithms maximizing the F-measure in the inference phase in our solution for the JRS 2012 Data Mining Competition (Janusz et al., 2012).12 This competition considered topical classification of bio-medical articles. In essence, it consisted of a multi-label learning problem, where the objective was to optimize the instance-based F-measure. We decided to participate in this competition to showcase the practical relevance of our theoretical findings regarding the F-measure maximization. Similar to many of our competitors, our final predictions in the competition were produced by a blend of several methods, and they achieved a very satisfactory result, namely the second place in the competition with more than 100 participants. In this paragraph, we briefly explain the methodology that led to this result. Our solution was mainly based on PCC with FM and GFM inference methods and the LFP algorithm. The methods were tuned and run in a similar way as described in the previous experiments (with small differences: we used 10-fold cross validation, and considered a wider range for the regularization parameter, namely {10 5, 10 4, . . . , 105}). At this time the EFP algorithm was not yet developed. We used neither instance-based nor decision tree methods. Our preprocessing on the competition data was quite straightforward. We simply deleted all the empty columns (i.e., zero vectors) in the training data, then the corresponding columns in the test data. The values of features were normalized to [0, 1]. The results of the methods are presented in Table 7. The F-measure is computed over the entire test set delivered by the organizers after the competition. This is a minor difference in comparison to the competition results, which are computed over 90% of test examples. The remaining 10% of test examples constitute a validation set that served for computing the scores for the leader board during the competition. The results of PCC we show for different sizes of samples generated from the conditional joint distribution of a given test example. In the last row in the table, we also give a result of the final method we used in the competition. It relies on averaging over all predictions we computed during the competition. These predictions were results of different parameterization of PCC and BR. In total we gathered 16 predictions that we aggregated via voting. In this voting procedure, we tested different thresholds on the validation set and selected the best one. The solution is described in more detail in (Cheng et al., 2012). From the results we can see that there is no big difference among the methods. The voting procedure improves only slightly over LFP and PCC with the GFM inference. The results of these methods would be enough to obtain at least the third place in the competition. It shows that a quite simple model, without any blending, but with an appropriate inference method suited for a given performance measure is enough for solving complex tasks. Interestingly, LFP performs here better than PCC with GFM, which suggests independence of the labels. However, one can also observe that PCC with FM loses against other methods. This may suggest that PCC with the sampling procedure has problems with accurate estimation of marginal probabilities. Increasing the sample size improves the results (for both, FM and GFM), but it still seems that LFP is the most appropriate method in this case. It is the most cheapest one, since it does not require additional sampling in the inference step as PCC does, and gives results only slightly worse than the voting method 12. More info can be found at: http://tunedit.org/challenge/JRS12Contest. Waegeman et al. Method F-measure PCC FM (n = 50) 0.48650 PCC FM (n = 200) 0.51979 PCC FM (n = 1000) 0.52995 PCC GFM (n = 50) 0.52286 PCC GFM (n = 200) 0.53005 PCC GFM (n = 1000) 0.53146 LFP 0.53279 Voting (final submission) 0.53327 Table 7: The results on the JRS 2012 Competition data set. The number n in parentheses denotes the number of sampled observations in PCC. that averages over many predictions. As we already said before, there is no clear answer which of the two inference methods, GFM or FM, will get better results on a given data set. GFM provides an exact solution, but needs to estimate more parameters, so FM may get better results, particularly in the case of no or weak label dependencies. 8. Discussion In contrast to other performance measures commonly used in experimental studies, such as error rate, squared loss, and AUC, the F-measure has been investigated less thoroughly from a theoretical point of view, and only few papers have been devoted to that kind of analysis so far (e.g. Lewis (1995); Chai (2005); Jansche (2007); Dembczy nski et al. (2011); Ye et al. (2012); Zhao et al. (2013)). In this paper, we analyzed the problem of optimal predictive inference from the joint distribution under the F-measure. While partial results were already known from the literature, we completed the picture by presenting the solution for the general case without any distributional assumptions and by analyzing the relations between F and other performance measures. Our GFM algorithm requires only a polynomial number of parameters of the joint distribution and delivers the exact solution in polynomial time. From a theoretical perspective, GFM should be preferred to existing approaches, which typically perform threshold maximization on marginal probabilities, often relying on the assumption of (conditional) independence of labels. Focusing on optimizing the instance-wise F-measure, empirical results on synthetic and real-world multi-label data sets show a competitive performance for our approach. The algorithms discussed here optimize the F-measure in the inference phase. Alternatively, one can move the effort of maximizing the F-measure to the training phase, as in structured SVMs (Tsochantaridis et al., 2005), SEARN (Daum e III et al., 2009), or in a specific variant of CRFs (Suzuki et al., 2006). These algorithms, however, are usually based on additional assumptions, and their original formulation does not directly concern multi-label problems. In the experiments, we performed a comparison to the adaptation of structured SVMs to the multi-label setting introduced by Petterson and Caetano (2010, On the Bayes-Optimality of F-Measure Maximizers 2011). This algorithm also maximizes the F-measure, but produces worse results than the approaches based on the GFM inference. However, its prediction time is much faster, giving an interesting alternative in time-critical applications. Let us also mention that the GFM algorithm can be easily tailored for maximizing the instance-wise F-measure in structured output prediction problems. If the structured output classifier is able to model the joint distribution, from which we can easily sample observations, then the use of the algorithm is straight-forward. An application of this kind is planned as future work. The GFM algorithm could also be considered for maximizing the macro F-measure, for example, in a similar setting as in (Zhang et al., 2010), where a specific Bayesian on-line model is used. In order to maximize the macro F-measure, the authors sample from the graphical model to find an optimal threshold. The GFM algorithm may solve this problem optimally, since, as stated by the authors, the independence of labels is lost after integrating out the model parameters. Theoretically, one may also consider a direct maximization of the micro F-measure with GFM, but the computational burden is rather high in this case. We would also like to emphasize that maximization of instance-based F-measure leads to suboptimal results for the micro F-measure. Despite being related to each other, these two measures coincide only in a specific case when Pm i=1(yi+hi) is constant for all test examples. The discrepancy between these measures strongly depends on the nature of the data and the classifier used. For high variability in Pm i=1(yi + hi), a significant difference between the values of these two measures is to be expected. Surprisingly, experimental results are quite often reported in terms of micro F-measure, although the algorithms maximize the instance-wise F-measure on the training set. The use of the GFM algorithm in binary classification seems to be superfluous, since in this case, the assumption of label independence is rather reasonable. The algorithm of Ye et al. (2012) seems to be an interesting alternative for probabilistic classifiers. Thresholding methods (Keerthi et al., 2007; Ye et al., 2012; Zhao et al., 2013) or learning algorithms optimizing the F-measure directly (Musicant et al., 2003; Joachims, 2005; Jansche, 2005; Ye et al., 2012) are also appropriate solutions here. Acknowledgments We thank Volkmar Welker for carefully checking the proofs of some theorems. Willem Waegeman was for this work supported as a postdoctoral fellow by the Research Foundation of Flanders (FWO Vlaanderen). Krzysztof Dembczy nski and Arkadiusz Jachnik were supported by the Foundation of Polish Science under the Homing Plus programme, co-financed by the European Regional Development Fund. Weiwei Cheng and Eyke H ullermeier were supported by the German Research Foundation. Waegeman et al. Appendix A. Proofs of Theorems Theorem 1 Let h H be a vector of predictions obtained by minimizing the Hamming loss, Then for m > 2 the worst-case regret is given by: sup P Pu LH E F(Y , h F ) F(Y , h H) = 0.5 , where the supremum is taken over all possible distributions P that result in a unique Hamming loss minimizer. Proof. For a fixed Hamming loss minimizer h H it follows from (8) that any probability distribution P Pu LH should satisfy the following constraint for all i {1, ..., m}: X y {0,1}m:yi =h H,i P(y) 0.5 ϵ with ϵ > 0. Practically, we will choose ϵ arbitrarily close to zero, implying that its contribution vanishes in the limit, but this construction allows to rewrite the constraint in traditional mathematical programming form. Let us also define ηy(h, h H) = F(y, h) F(y, h H) for all y {0, 1}m. Finding the supremum over all probability distributions then becomes equivalent to solving the following mixed integer nonlinear program: max h,h H,P Pu LH y {0,1}m ηy(h, h H)P(y) (23) P y {0,1}m P(y) = 1 , i {1, ..., m} : P y {0,1}m:yi =h H,i P(y) 0.5 ϵ , y {0, 1}m : 0 P(y) 1 , h, h H {0, 1}m By definition, the solution h for which the maximum is obtained corresponds to the Fmeasure maximizer in (2). To reduce the number of integer variables in the optimization problem, we introduce the following equivalence classes for the indices of labels: A = {i {1, ..., m} : hi = 1 h H,i = 0} , B = {i {1, ..., m} : hi = 0 h H,i = 1} , C = {i {1, ..., m} : hi = 1 h H,i = 1} , D = {i {1, ..., m} : hi = 0 h H,i = 0} . We also adopt the shorthand notation a = |A|, b = |B|, c = |C|, d = |D| and i=1 yi , s A y = X i A yi , s B y = X i B yi , s C y = X i C yi , s D y = X i D yi . (24) On the Bayes-Optimality of F-Measure Maximizers The coefficients in (23) can then be simplified to: ηy(h, h H) = ηy(a, b, c, d) = 2s A y (sy + b + c) 2s B y (sy + a + c) + 2s C y (b a) (sy + a + c)(sy + b + c) . (25) As a consequence, only four integer variables remain present in the optimization problem, which is for simplicity converted to a minimization problem in standard mixed-integer linear program form: min a,b,c,d,P X y {0,1}m ηy(a, b, c, d)P(y) P y {0,1}m P(y) = 1 , i {1, ..., m} : P y {0,1}m:yi =h H,i P(y) 0.5 ϵ , y {0, 1}m : 0 P(y) 1 , a + b + c + d = m , a, b, c, d N . This new optimization problem is a relaxation of (23), since the F-maximizer of the probability distribution found as solution will not necessarily comply with the definition of the sets A, B, C and D. However, this will not cause any trouble, because the oracle solution that is derived below will obey this additional constraint. One arrives at a standard linear program formulation by keeping the four integer variables fixed. As the key element of our proof, we show that for every allowed value of (a, b, c, d) a solution of the linear program is given by the following probability distribution: 0.5 ϵ if y = y A 0.5 (2d 1)ϵ if y = y BCD 2ϵ if y ΩD m 0 otherwise where y A is defined as a vector containing ones at positions i A and zeros at all other positions. Similarly, y BCD contains zeros at positions i A and ones at all other positions: y A i = 1 + y BCD i = 1, i A , y A i = 1 y BCD i = 0, i B C D . The set ΩD m is defined as: ΩD m = {y {0, 1}m | X i A yi = 0 X i B C yi = b + c X i D yi = d 1} , so this set contains d vectors, which differ only in one position with y BCD. We verify the Karush-Kuhn-Tucker (KKT) conditions to prove that the above probability distribution yields the optimum of the linear program for every (a, b, c, d). For Waegeman et al. linear programs, which represent a specific case of optimizing an invex function, the KKTconditions are not only necessary but also sufficient for optimality (Hanson, 1981). Let us define the primal Lagrangian as: y {0,1}m ηy(a, b, c, d)P(y) + ν X y {0,1}m:yi =h H,i P(y) 0.5 + ϵ y {0,1}m λ y P(y) + X y {0,1}m λ+ y (P(y) 1) . with ν, µi, λ+ y and λ y Lagrange multipliers. For the above-mentioned probability distribution, the complementary slackness conditions imply that λ+ y = 0 for all y {0, 1}m and λ y = 0 for all y contained in ΩD m {y A, y BCD}. Hence, the zero-gradient condition results in the following system of equations: ηy(a, b, c, d) = ν + X i:yi =h H,i µi , y ΩD m {y A, y BCD} , ηy(a, b, c, d) = ν + X i:yi =h H,i µi λ y , y / ΩD m {y A, y BCD} . First, we show that a solution always exists for this system, and additionally, we check the dual feasibility conditions µi 0, λ+ y 0 and λ y 0. For all y / ΩD m {y A, y BCD}, the equations have an individual variable λ y , which only occurs in one equation, so these equations do not impose any further restrictions, apart from the non-negativity constraint on the respective Lagrange multipliers. Furthermore, since the sum over µi is always positive, we have enough degrees of freedom to ignore those equations. For the other equations, one arrives at a new system of equations by solving for ν: X i:yi =h H,i µi X j:y j =h H,j µj = ηy(a, b, c, d) ηy (a, b, c, d) , y, y ΩD m {y A, y BCD} . We continue by writing out this system of equations more explicitly, resulting in four cases. For the pair corresponding to y = y A and y = y BCD we obtain X i A B C µi X i D µi = ηy(a, b, c, d) ηy (a, b, c, d) . (26) For all pairs having y = y A and y ΩD m, so d pairs in total with j D, we obtain X i A B C µi X i D\{j} µi = ηy(a, b, c, d) ηy (a, b, c, d) . (27) For all pairs having y = y BCD and y ΩD m, so again d pairs with i D, we obtain µi = ηy(a, b, c, d) ηy (a, b, c, d) . (28) On the Bayes-Optimality of F-Measure Maximizers For all pairs having y, y ΩD m, so d(d 1) pairs with i, j D, we obtain µi = µj . (29) Let us observe that the d equations (27) and the d equations (28) are equivalent due to (26). Moreover, the d(d 1) equations in (29) are trivially satisfied, resulting in a system that can be reduced to d + 1 equations and m variables. Hence, a solution always exists if the dual feasibility conditions are satisfied. Plugging (25) into (28) yields for i D: µi = 2c(b a) (m + c)(2b + 2c + d) 2b 2b + 2c + d 2c(b a) (m + c 1)(2b + 2c + d 1) + 2b 2b + 2c + d 1 2cb(m 2) (m + c)(m 1 + c)(2b + 2c + d) . So, µi 0 as soon as m > 1. Due to the denominator it is also required that 2b+2c+d > 1. For the µi having i / D we find: X i A B C µi ηy(a, b, c, d) ηy (a, b, c, d) = 2a 2a + c + 2b 2b + 2c + d 2c(b a) (m + c)(2b + 2c + d) 2mb (m + c)(2b + 2c + d) 0 . In other words, these are the conditions that remain for the µi for which i D. From the inequality it follows that the individual µi can be made greater than zero in this case, implying that also the active Lagrange multipliers λ y can be chosen in such a way that they are greater than zero. Consequently, when m > 1, all KKT conditions are satisfied for the oracle solution that we provide. Let us compare the value of the objective function for all allowed values of the four integer variables a, b, c and d. By omitting ϵ-dependent terms, which vanish when ϵ approaches zero, we obtain: X y ΩD m {y A,y BCD} ηy(a, b, c, d)P(y) = a (2a + c) ca (m + c)(2b + 2c + d) mb (m + c)(2b + 2c + d) . This function is decreasing in b and c, it is constant in a as soon as c = 0 and a > 0. It is constant in d when b + c = 0, thus its maximum of 0.5 is not unique. This maximum is for example obtained for (a = 1, b = 0, c = 0, d = m 1) and it corresponds to the worst-case regret mentioned in the theorem. Remark as well that the solution imposes as additional constraint m > 2, as a result of the previous constraint 2b + 2c + d > 1. Two cases were Waegeman et al. excluded from our analysis: (a = m, b = 0, c = 0, d = 0) and (a = m 1, b = 0, c = 0, d = 1). These cases do not deserve further attention, since they lead to a worst-case regret that is always upper bounded by 0.5. Theorem 2 Let hs be a vector of predictions obtained by minimizing the subset 0/1 loss, then for m > 2 the worst-case regret is given by: sup P Pu Ls E F(Y , h F ) F(Y , hs) = ( m 2 + 2m2)m (2m 1)(4 + m + m2) , where the supremum is taken over all possible distributions P. Proof. It follows from (9) that optimization problem (4) has a unique solution for the subset zero-one loss if and only if the underlying probability distribution has a unique mode. Translating this requirement into a mathematical programming formulation implies that any probability distribution P Pu Ls should satisfy the following constraint: P(y) + ϵ P(hs) , for all y {0, 1}m \ hs and any ϵ > 0. Practically, the contribution of ϵ will again vanish by choosing it arbitrarily close to zero. As a result, the supremum can be interpreted as the solution of a mixed integer nonlinear program: y {0,1}m (F(y, h) F(y, hs))P(y) (30) P y {0,1}m P(y) = 1 , y {0, 1}m \ hs : P(y) + ϵ P(hs) , y {0, 1}m : 0 P(y) 1 , hs, h F {0, 1}m ϵ 0 . Recall that by construction the solution for h again coincides with the F-measure maximizer in (2). The mixed integer nonlinear program contains 22m integer variables and 2m real variables. In what follows, we assume that ϵ is arbitrarily close to zero, so that all ϵ-dependent terms cancel out, while guaranteeing unique risk minimizers. The only consequence of this decision is that the presented solution acts as a supremum (the maximum is not reached). Despite a similar formulation as the previous theorem, our proving techniques will be quite different, because, unlike the previous theorem, it is impossible to derive an oracle solution for the entire mixed integer nonlinear program. Alternatively, we will look for a solution of the optimization problem that emerges when ϵ-dependent terms become zero. The only practical problem with this solution is the non-uniqueness of the corresponding subset zero-one loss minimizer, but we will show that this solution can be approximated by an arbitrarily close solution with a unique subset zero-one loss minimizer. This technique will suffice to prove the theorem. However, let us remark that the same technique would not On the Bayes-Optimality of F-Measure Maximizers work to prove Theorem 1, because there it would not be possible to approach the solution of the optimization problem without ϵ-terms by an ϵ-approximate extension with unique Hamming loss minimizer. This explains why the proving techniques of Theorem 1 and 2 are different. We start by providing an oracle solution for the linear program that is obtained by fixing the integer variables to h F = 1m and hs = 0m. The first part of the proof is a bit similar to a proof given for the regret of the subset 0/1 loss minimizer w.r.t. the Hamming loss (Dembczy nski et al., 2012a). While omitting ϵ-dependent terms, optimization problem (30) can be reformulated in standard linear program form as: y {0,1}m η(y)P(y) P y {0,1}m P(y) 1 = 0 , y {0, 1}m \ 0m : P(y) P(0m) 0 , y {0, 1}m : P(y) 0 , y {0, 1}m : P(y) 1 0 , ( 2sy sy+m if y = 0m , 1 if y = 0m . In the next paragraphs we will show that the following probability distribution corresponds to the solution of the linear program: PA(y) = 2 m2+m+4 if d(y, 0m) m 2 y = 0m , 0 otherwise , where d H(y, y ) = Pm i=1 |yi y i| denotes the Hamming distance. This solution represents a case where the subset zero-one loss minimizer is not unique, but it could be easily extended to an ϵ-approximate solution with unique subset zero-one loss minimizer: 2 m2+m+4 + m2+m+2 m2+m+4ϵ if y = 0m , 2 m2+m+4 2 m2+m+4ϵ if d(y, 0m) m 2 , 0 otherwise , We verify the KKT conditions to prove that the above probability distribution is indeed the solution of the optimization problem. The primal Lagrangian of the linear program can be defined as: y {0,1}m η(y)P(y) + ν X y =0m λ2 y P(y) P(0m) y {0,1}m λ0 y P(y) + X y {0,1}m λ1 y P(y) , Waegeman et al. with ν, λ0 y, λ1 y and λ2 y Lagrange multipliers. The stationarity condition for optimality leads to the following system of linear equations: η(y) + ν + λ1 y λ0 y + λ2 y = 0 y = 0m , (31) 1 + ν + λ1 y λ0 y X y =0m λ2 y = 0 y = 0m (32) Other conditions that need to be satisfied are dual feasibility y : λ0 y 0 , (33) y : λ1 y 0 , (34) y : λ2 y 0 , (35) and the complementary slackness conditions of our oracle solution PA(y): y Ωu {0m} : λ0 y = 0 , y : λ1 y = 0 , y / Ωu : λ2 y = 0 , where Ωu = Ω(m) Ω(m 1) Ω(m 2) and Ω(t) = {y {0, 1}m | d H(0m, y) = t}. Plugging the latter three conditions into (31) and (32) yields η(y) + v λ0 y = 0 , y / Ωu {0m} , η(y) + v + λ2 y = 0 , y Ωu , v = P y =0m λ2 y 1 . Solving the last equation for v results in v = ( m 2 + 2m2)m (2m 1)(4 + m + m2) . Subsequently, one can verify that this solution for v obeys the non-negativity conditions for all λy. The non-negativity of λ2 y turns out to be most restrictive for the equivalence class y Ω(m 2). In this case we obtain: λ2 y = 2m 4 2m 2 v = 2(3m2 10m + 4) (m 1)(2m 1)(4 + m + m2) . (36) Analyzing this function more thoroughly reveals that it is strictly positive in the interval [+3, + [. Similarly, we find that the most restrictive condition on λ0 y is obtained for the elements in the equivalence class y Ω(m 3), leading to the following equality: λ0 y = v 2m 6 2m 3 = 9m2 + 2m3 + 56m 24 (2m 3)(2m 1)(4 + m + m2) . (37) This function is also strictly positive in the interval [+3, + [, so all non-negativity conditions are satisfied. Plots of the functions are shown in Figure 6. On the Bayes-Optimality of F-Measure Maximizers 0.25 0.20 0.15 0.10 0.05 0.00 0.05 0.0 0.5 1.0 1.5 2.0 Figure 6: Plots of the functions λ2 y and λ0 y as defined by equations (36) and (37). One can observe that PA(y) yields the regret mentioned in the theorem. Thus, what we found so far is a lower bound on the worst-case regret. The tightness of the bound is further proven by showing that the supremum is always obtained by hs = 0m and h F = 1m as soon as m > 2. Since it is impossible to enumerate all solutions for the 22m possible values of the integer variables, we analyse the properties of the objective function to prove that the optimum is obtained for hs = 0m and h F = 1m. Similar to the previous theorem, let us introduce A = {i {1, ..., m} : hi = 1 hs,i = 0} , B = {i {1, ..., m} : hi = 0 hs,i = 1} , C = {i {1, ..., m} : hi = 1 hs,i = 1} , D = {i {1, ..., m} : hi = 0 hs,i = 0} . and the shorthand notations a = |A|, b = |B|, c = |C|, d = |D|, sy, s A y , s B y , s C y , s D y , as defined in (24). Optimization problem (30) can then be relaxed to the following standard mixed-integer nonlinear program form: min a,b,c,d,P X y {0,1}m ηy(a, b, c, d)P(y) P y {0,1}m P(y) = 1 , y {0, 1}m \ hs : P(y) P(hs) 0 , y {0, 1}m : 0 P(y) 1 , a + b + c + d = m , a, b, c, d N , Waegeman et al. writing down the coefficients of the objective function as: ηy(h, hs) = ηy(a, b, c, d) = s A y + s C y sy + a + c s B y + s C y sy + b + c . Recall that the relaxation again originates from the fact that the solution not necessarily complies with the definitions of the sets A, B, C and D. Now observe that the objective function of the mixed-integer nonlinear program is strictly decreasing in b, independent of the other variables, so we can fix b = 0. Subsequently, observe that for b = 0 the objective function is also strictly decreasing in c, independent of the other variables, so we also fix c = 0. As a result, the coefficients can be further simplified to: ηy(h, hs) = ηy(a, b, c, d) = ( s A y sy+a if y = 0m , 1 if y = 0m . To complete the proof we show by contradiction that the optimum is obtained for a = m. In order to construct the recurrence equations below let us introduce q {1, ..., m 1} and let Ω0 q = {y {0, 1}m | yq+1 = 0 y = 0m} , Ω1 q = {y {0, 1}m | yq+1 = 1 y = 01 q+1} , where 01 q+1 denotes a vector of m 1 zeros apart from a single one at position q + 1. Furthermore, let us introduce the mapping Ψq : {0, 1}m {0, 1}m, which, in any binary vector of length m, toggles the bit at position q + 1: a zero at that position becomes a one and vice versa. The mapping Ψq hence defines a unique correspondence between any element in Ω0 q and its sister element in Ω1 q. For a = q, the objective function can then be written as: y {0,1}m ηy(a, b, c, d)P(y) = P(0m) 2s A y sy + q P(y) + X y Ω1q + 2s A Ψq(y) sΨq(y) + q + 1P(y) while for a = q + 1, it can be written as: δq+1(P) = X y {0,1}m ηy(a, b, c, d)P(y) = P(0m) + 2 a + 1P(01 q+1) 2s A y sy + q + 1P(y) + X y Ω1q + 2s A Ψq(y) + 2 sΨq(y) + q + 2P(y) Let us assume that the global optimum is obtained for a < m. Furthermore, let P q(y) be the probability distribution that delivers this optimum for (a = q, d = m q). We construct a new probability distribution P q+1(y) as follows: P q+1(y) = P q(Ψq(y)) if (y Ω1 q P q(y) > P q(Ψq(y))) (y Ω0 q P q(y) < P q(Ψq(y))) , P(y) otherwise , On the Bayes-Optimality of F-Measure Maximizers If P q(y) is feasible, then P q+1(y) is feasible, too. It follows from 2s A Ψq(y) + 2 sΨq(y) + q + 2 2s A Ψq(y) sΨq(y) + q , y {0, 1}m that δq+1(P q+1) δq(P q) for all q. This is a contradiction. Consequently, the global optimum of the optimization problem is given by P A. Side note about the difference in proving technique for Theorems 1 and 2. In this paragraph we give some additional explanation why the proving techniques for Theorems 1 and 2 are different. The proof of Theorem 2 cannot proceed with ϵ-terms because we are not able to derive an oracle solution for the mixed integer linear program when ϵ-terms are considered. However, for Theorem 2 this is not needed because we are able to find a probability distribution that has unique risk minimizers, while being arbitrarily close to the oracle solution that we present. The claim of the theorem then immediately follows from taking the limit when ϵ approaches zero. Let us consider the example of m = 4, for which we obtain the following probability distribution: y P(y) 0000 1/12 1100 1/12 1010 1/12 1001 1/12 0110 1/12 0101 1/12 0011 1/12 1110 1/12 1011 1/12 1101 1/12 0111 1/12 1111 1/12 The subset zero-one loss minimizer is not unique in this case, but let us choose 0000 as subset zero-one loss minimizer. The F-measure maximizer is 1111. The regret should then be 13/24 - see fraction in Theorem 2. We can easily extend this to an ϵ-case with unique subset zero-one loss minimizer: Waegeman et al. y P(y) 0000 1/12 + 11ϵ 1100 1/12 ϵ 1010 1/12 ϵ 1001 1/12 ϵ 0110 1/12 ϵ 0101 1/12 ϵ 0011 1/12 ϵ 1110 1/12 ϵ 1011 1/12 ϵ 1101 1/12 ϵ 0111 1/12 ϵ 1111 1/12 ϵ So, this suffices as a mathematically correct proof. However, the same trick cannot be used for Theorem 1. If we would omit the ϵ-terms in the Lagrangian there, we would end up with a solution that cannot be ϵ-approached by a probability distribution with unique risk minimizer. As an example, let us again consider the case m = 4 and the following probability distribution: y P(y) 1000 .5 0111 .5 One of the risk minimizers for Hamming loss is vector 0000 with zero F-measure. Another vector, 1110, which is also a Hamming loss minimizer, gets an F-measure of 0.5833. This is a regret that is higher than the supremum mentioned in Theorem 1, but it is a value that cannot be ϵ-approached when restricting to unique Hamming loss minimizers. Theorem 3 Let h J and h F be vectors of predictions obtained by maximizing the Jaccard index and the F-measure, respectively. Let the utility of the F-measure maximizer be given by δ(P) = max h {0,1}m E [F(Y , h)] = max h {0,1}m X y {0,1}m P(y) F(y, h). The regret of the F-measure maximizer with respect to the Jaccard index is then upper bounded by E J(Y , h J) J(Y , h F ) 1 δ(P)/2 for all possible distributions P On the Bayes-Optimality of F-Measure Maximizers Proof. The proof follows immediately from the following double inequality: Pm i=1 yihi(x) Pm i=1 yi + Pm i=1 hi(x) Pm i=1 yihi(x) Pm i=1 yi + Pm i=1 hi(x) Pm i=1 yihi(x) 2 Pm i=1 yihi(x) Pm i=1 yi + Pm i=1 hi(x) , which results in F(y, h F ) 2 J(y, h F ) F(y, h F ) , for all y, h {0, 1}m. Theorem 5 Let h I be a vector of predictions obtained by assuming label independence as defined in (3), then the worst-case regret is lower-bounded by: E F(Y , h F ) F(Y , h I) 2q 1, for all q [1/2, 1] satisfying Pm s=1 2m! (m s)!(s 1)!(m+s)qm s(1 q)s qm > 0 and the supremum taken over all possible distributions P. Proof. To analyze the potential regret of methods that assume independence, it is sufficient to compare the F-maximizers and their corresponding F-measures for a joint distribution defined on independent random variables and a second joint distribution having the same marginal distributions, but no independence. Below, we analyze two families of probability distributions that are parameterized by a single parameter q, which is defined as q = P(Yi = 0) for all i = 1, ..., m. The first family resembles the case of independent random variables, for which the joint distribution is defined as the product of marginal probabilities: PA(y) = qm s(1 q)s where s = i=1 yi . (39) The second family of distributions captures one particular case of a very strong stochastic dependence: q if y = 0m 1 q if y = 1m 0 otherwise , If q > 0.5, then the F-measure maximizer of PB is given by 0m and its corresponding Fmeasure is q. It is less straightforward to find the F-measure maximizer of PA. Let us first introduce the equivalence classes Ωm(s) = {y {0, 1}m | i=1 yi = s} , Ωm(s, l) = {y {0, 1}m | i=1 yi = s yl = 1} , Waegeman et al. with s, l {1, ..., m}. The cardinality of these equivalence classes is given by |Ωm(s)| = m s = m! (m s)!s! , |Ωm(s, l)| = m 1 s 1 = (m 1)! (m s)!(s 1)! . Let hk be a series of predictions such that Pm i=1 hi = k. Without loss of generality, we can fix hk to a vector of k ones that are followed by m k zeros, because the distributions that we analyze are fully symmetric (i.e., all index permutations of {1, ..., m} in the label vectors yield the same values in probability mass). As a result, we can write the expected F-measure of hk with k > 0 as: EY PA [F(Y , hk)] = X y {0,1}m F(y, hk)PA(y) (40) Pk i=1 2yi sy + k PA(y) Pk i=1 2yi s + k PA(y) 2yi s + kqm s(1 q)s 2 s + kqm s(1 q)s (m 1)! (m s)!(s 1)! 2k s + kqm s(1 q)s This is an increasing function of k, which implies that the F-measure maximizer consists of a vector of solely ones (or solely zeros) for PA. In addition, the expected F-measure for a prediction vector of zeros is given by EY PA [F(Y , 0m)] = qm . Let us define δm as δm = EY PA [F(Y , 1m) F(Y , 0m)] (m 1)! (m s)!(s 1)! 2m m + sqm s(1 q)s qm . Then, assuming independence delivers the wrong maximizer for PB as soon as δm > 0. Corollary 1 Let h I be a vector of predictions obtained by assuming independence, then the worst-case regret converges to 1 in the limit of m, i.e., lim m sup P E F(Y , h F ) F(Y , h I) = 1, On the Bayes-Optimality of F-Measure Maximizers where the supremum is taken over all possible distributions P. Proof. For increasing m, the condition is satisfied for q close to one. The easiest way to observe this is computing the limit lim m qm = 0 , which implies m! (m s)!s!qm s(1 q)s = 1 . From this last limit and 2m m + s m s it follows that: (m 1)! (m s)!(s 1)! 2m m + sqm s(1 q)s 1 . By definition, (40) cannot exceed the upper bound of one, so this inequality must hold as an equality. In such a scenario, the worst-case regret is lower bounded by Rq = 2q 1, so that limq 1,m Rq = 1. As a consequence, the lower bound becomes tight in the limit of m going to infinity. Theorem 7 Let h T be a vector of predictions obtained by putting a threshold on sorted marginal probabilities, then the worst-case regret is lower bounded by E F(Y , h F ) F(Y , h T ) max 0, 1 where the supremum is taken over all possible distributions P. Proof. To analyze the regret of thresholding approaches, we have to construct a counterexample for which the F-measure is not consistent with the order of the marginal probabilities. The following family of distributions is such a counterexample: 1/2 ϵ if y1 = 1 Pm i=1 yi = 1 (1/2 + ϵ)/(2m 4) if y2 = 1 1 + Pm/2+1 i=3 yi = Pm i=3 yi = m/2 (1/2 + ϵ)/(2m 4) if y2 = 1 1 + Pm i=m/2+2 yi = Pm i=3 yi = m/2 0 otherwise where we consider for simplicity that m is even and ϵ [0, 1/2] represents a positive constant. For ϵ close to zero, one can easily show that the F-measure maximizer is given by a vector of predictions consisting of only zeros, apart from a single one at position one. The expected F-measure of this prediction vector is 1/2 ϵ. However, this prediction vector can never be returned by a method that relies on thresholding over marginal probabilities, because Waegeman et al. P(Y2 = 1) > P(Y1 = 1) in this particular case. By enumerating all candidate solutions examined by thresholding, one will find instead a prediction vector h11 consisting of zeros, apart from a one at the first two positions. The expected F-measure of this prediction vector is E [F(Y , h11)] = (1/2 ϵ)(2/3) + (1/2 + ϵ)(2/(2 + (m/2))) . As a consequence, this results in the above-mentioned regret when ϵ approaches zero. P. Bartlett, M. Jordan, and J. Mc Auliffe. Convexity, classification and risk bounds. Journal of the American Statistical Association, 101:138 156, 2006. C. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., 2006. L. Breiman. Bagging predictors. Machine Learning, 24:123 140, 1996. L. Breiman. Some infinity theory for predictor ensembles. Technical Report 577, Department of Statistics, University of California, Berkeley, 2000. K. Chai. Expectation of F-measures: Tractable exact computation and some empirical observations of its properties. In Proceedings of the International ACM Conference on Research and Development in Information Retrieval (SIGIR), 2005. W. Cheng and E. H ullermeier. Combining instance-based learning and logistic regression for multilabel classification. Machine Learning, 76(2-3):211 225, 2009. W. Cheng, K. Dembczy nski, W. Waegeman A. Jaroszewicz, and E. H ullermeier. F-measure maximization in topical classification. In Rough Sets and Current Trends in Computing, volume 7413 of Lecture Notes in Computer Science, pages 439 446, 2012. F. Chierichetti, R. Kumar, S. Pandey, and S. Vassilvitskii. Finding the Jaccard median. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 293 311, 2010. D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. Journal of Symbolic Computation, 3(9):251 280, 1990. T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms, 2nd edition. MIT Press, 2001. H. Daum e III, J. Langford, and D. Marcu. Search-based structured prediction. Machine Learning, 75:297 325, 2009. B. De Baets, S. Janssens, and H. De Meyer. On the transitivity of a parametric family of cardinality-based similarity measures. Journal of Approximate Reasoning, 50:104 116, 2009. J. del Coz, J. Diez, and A. Bahamonde. Learning nondeterministic classifiers. Journal of Machine Learning Research, 10:2273 2293, 2009. On the Bayes-Optimality of F-Measure Maximizers K. Dembczy nski, W. Cheng, and E. H ullermeier. Bayes optimal multilabel classification via probabilistic classifier chains. In Proceedings of the International Conference on Machine Learning (ICML), pages 279 286. Omnipress, 2010. K. Dembczy nski, W. Waegeman, W. Cheng, and E. H ullermeier. An exact algorithm for F-measure maximization. In Advances in Neural Information Processing Systems, volume 25, 2011. K. Dembczy nski, W. Waegeman, W. Cheng, and E. H ullermeier. On label dependence and loss minimization in multi-label classification. Machine Learning, 88:5 45, 2012a. K. Dembczy nski, W. Waegeman, W. Cheng, and E. H ullermeier. An analysis of chaining in multi-label classification. In Proceedings of the European Conference on Artificial Intelligence (ECAI), 2012b. K. Dembczy nski, A. Jachnik, W. Kotlowski, W. Waegeman, and E. H ullermeier. Optimizing the F-measure in multi-label classification: Plug-in rule approach versus structured loss minimization. In Proceedings of the International Conference on Machine Learning (ICML), 2013. L. Devroye, L. Gy orfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1997. L. Dice. Measures of the amount of ecologic associations between species. Ecology, 26: 297 302, 1945. J. Duchi, L. Mackey, and M. I. Jordan. On the consistency of ranking algorithms. In Proceedings of the International Conference on Machine Learning (ICML), pages 327 334, 2010. R. Fan and C. Lin. A study on threshold selection for multi-label classification. Technical report, Department of Computer Science, National Taiwan University, 2007. T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In Proceedings of the International Conference on Machine Learning (ICML). Omnipress, 2008. Wei Gao and Zhi-Hua Zhou. On the consistency of multi-label learning. Artificial Intelligence, 199-200:22 44, 2013. N. Ghamrawi and A. Mc Callum. Collective multi-label classification. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM), pages 195 200, 2005. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. The WEKA data mining software: An update. SIGKDD Explorations, 11, 2009. M.A. Hanson. On sufficiency of the Kuhn-Tucker conditions. Journal of Mathematical Analysis and Applications, 80:545 550, 1981. Waegeman et al. B. Hariharan, L. Zelnik-Manor, S. Vishwanathan, and M. Varma. Large scale max-margin multi-label classification with priors. In Proceedings of the International Conference on Machine Learning (ICML), pages 423 430. Omnipress, 2010. M. Jansche. Maximum expected F-measure training of logistic regression models. In Proceedings of the Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 736 743, 2005. M. Jansche. A maximum expected utility framework for binary sequence labeling. In Proceedings of the Annual Meetings of the Association for Computational Linguistics (ACL), pages 736 743, 2007. A. Janusz, H. Son Nguyen, D. Slezak, S. Stawicki, and A. Krasuski. JRS 2012 data mining competition: Topical classification of biomedical research papers. In Rough Sets and Current Trends in Computing, volume 7413 of Lecture Notes in Computer Science, pages 417 426. Springer, 2012. T. Joachims. A support vector method for multivariate performance measures. In Proceedings of the International Conference on Machine Learning (ICML), pages 377 384, 2005. S. Keerthi, V. Sindhwani, and O. Chapelle. An efficient method for gradient-based adaptation of hyperparameters in SVM models. In Advances in Neural Information Processing Systems, volume 19. MIT Press, 2007. I. Kokkinos. Boundary detection using F-measure-, filterand feature- (f3) boost. In Proceedings of the European Conference on Computer Vision (ECCV): Part II, pages 650 663, 2010. A. Kumar, S. Vembu, A.K. Menon, and C. Elkan. Beam search algorithms for multilabel learning. In Machine Learning, 2013. J. Lafferty, A. Mc Callum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the International Conference on Machine Learning (ICML), pages 282 289. Morgan Kaufmann, 2001. S. Lee. On classification and regression trees for multiple responses and its application. Journal of Classification, 23:123 141, 2006. D. Lewis. Evaluating and optimizing autonomous text classification systems. In Proceedings of the International ACM Conference on Research and Development in Information Retrieval (SIGIR), pages 246 254, 1995. Z. Lipton, C. Elkan, and B. Naryanaswam. Optimal thresholding of classifiers to maximize F1 measure. In Proceedings of the European Conference on Machine Learning, 2014. O. Luaces, J. Diez, J. Barranquero, J. del Coz, and A. Bahamonde. Binary relevance efficacy for multilabel classification. Progress in Artificial Intelligence, 1(4):303 313, 2012. On the Bayes-Optimality of F-Measure Maximizers D. Martin, C. Fowlkes, and J. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26:530 549, 2004. A. K Mc Callum, 2002. URL http://mallet.cs.umass.edu. A. K. Mc Callum, D. Freitag, and F. (2000) Pereira. Maximum entropy Markov models for information extraction and segmentation. In Proceedings of the International ACM Conference on Research and Development in Information Retrieval (SIGIR), 2000. D. Musicant, V. Kumar, and A. Ozgur. Optimizing F-measure with support vector machines. In Proceedings of the International FLAIRS Conference, pages 356 360. Haller AAAI Press, 2003. J. Petterson and T. Caetano. Reverse multi-label learning. In Advances in Neural Information Processing Systems, volume 24, 2010. J. Petterson and T. Caetano. Submodular multi-label learning. In Advances in Neural Information Processing Systems, volume 25, 2011. J. Quevedo, O. Luaces, and A. Bahamonde. Multilabel classifiers with a probabilistic thresholding strategy. Pattern Recognition, 45:876 883, 2012. J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier chains for multi-label classification. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), pages 254 269. Springer, 2009. T. Sang and F. De Meulder. Introduction to the Co NLL-2003 shared task: Languageindependent named entity recognition. In Proceedings of the Human Language Technologies Conference and the Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), pages 142 147, 2003. R. Schapire and Y. Singer. Boostexter: a boosting-based system for text categorization. Machine Learning, 39:135 168, 2000. I. Steinwart. On the influence of the kernel on the consistency of support vector machines. Journal of Machine Learning Research, 2:67 93, 2001. J. Suzuki, E. Mc Dermott, and H. Isozaki. Training conditional random fields with multivariate evaluation measures. In Proceedings of the Annual Meetings of the Association for Computational Linguistics (ACL), pages 217 224, 2006. S. Swamidass, J. Chen, J. Bruand, P. Phung, L. Ralaivola, and P. Baldi. Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity. Bioinformatics, 21 Suppl 1(2):359 368, 2005. B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In Advances in Neural Information Processing Systems, volume 16. MIT Press, 2004. Waegeman et al. A. Tewari and P. L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8:1007 1025, 2007. I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:1453 1484, 2005. G. Tsoumakas and I. Katakis. Multi label classification: An overview. International Journal of Data Warehousing and Mining, 3(3):1 13, 2007. G. Tsoumakas, I. Katakis, and I. Vlahavas. Mining multi-label data. In O. Maimon and L. Rokach, editors, Data Mining and Knowledge Discovery Handbook. Springer, 2010. G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas. Mulan: A java library for multi-label learning. Journal of Machine Learning Research, 12:2411 2414, 2011. C. J. van Rijsbergen. Foundation of evaluation. Journal of Documentation, 30(4):365 373, 1974. C. Vens, J. Struyf, L. Schietgat, S. Dzeroski, and H. Blockeel. Decision trees for hierarchical multi-label classification. Machine Learning, 73(2):185 214, 2008. Y. Wu, H. Zhang, and Y. Liu. Robust model-free multiclass probability estimation. Journal of the American Statistical Association, 105:424 436, 2010. Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1 2):67 88, 1999. N. Ye, K. Chai, W. Lee, and H. Chieu. Optimizing F-measures: a tale of two approaches. In Proceedings of the International Conference on Machine Learning, 2012. H. Zhang. Classification trees for multiple binary responses. Journal of the American Statistical Association, 93(441):180 193, 1998. M. Zhang and Z. Zhou. ML-k NN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038 2048, 2007. T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32:56 85, 2004. X. Zhang, T. Graepel, and R. Herbrich. Bayesian online learning for multi-label and multivariate performance measures. In Proceedings of the Conference on Artificial Intelligence and Statistics (AISTATS), pages 956 963, 2010. M. Zhao, N. Edakunni, A. Pocock, and G. Brown. Beyond Fano s inequality: Bounds on the optimal F-score, BER, and cost-sensitive risk and their implications. Journal of Machine Learning Research, 14:1033 1090, 2013. H. Zhuang, A. Chin, S. Wu, W. Wang, X. Wang, and J. Tang. Inferring geographic coincidence in ephemeral social networks. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), pages 1 16. Springer, 2012.