# extreme_fmeasure_maximization_using_sparse_probability_estimates__33ca1ae6.pdf

Extreme F-Measure Maximization using Sparse Probability Estimates

Kalina Jasinska KJASINSKA@CS.PUT.POZNAN.PL Krzysztof Dembczy nski KDEMBCZYNSKI@CS.PUT.POZNAN.PL Institute of Computing Science, Pozna n University of Technology, 60-965 Pozna n, Poland

R obert Busa-Fekete BUSAROBI@GMAIL.COM Karlson Pfannschmidt KIUDEE@MAIL.UPB.DE Timo Klerx TIMOK@UPB.DE Eyke H ullermeier EYKE@UPB.DE Department of Computer Science, Paderborn University, 33098 Paderborn, Germany

We consider the problem of (macro) F-measure maximization in the context of extreme multi-label classiﬁcation (XMLC), i.e., multi-label classiﬁcation with extremely large label spaces. We investigate several approaches based on recent results on the maximization of complex performance measures in binary classiﬁcation. According to these results, the F-measure can be maximized by properly thresholding conditional class probability estimates. We show that a na ıve adaptation of this approach can be very costly for XMLC and propose to solve the problem by classiﬁers that efﬁciently deliver sparse probability estimates (SPEs), that is, probability estimates restricted to the most probable labels. Empirical results provide evidence for the strong practical performance of this approach.

1. Introduction

Extreme classiﬁcation refers to multi-class and multi-label learning problems that involve hundreds of thousands (Deng et al., 2009; Partalas et al., 2015) or even millions of labels (Agrawal et al., 2013). Applications of that kind can be found in image classiﬁcation (Deng et al., 2011), text document classiﬁcation (Dekel & Shamir, 2010), on-line advertising (Beygelzimer et al., 2009a), and video recommendation (Weston et al., 2013). Several approaches for efﬁcient extreme classiﬁcation have recently been proposed, including FASTXML (Prabhu & Varma, 2014), LOMTREES (Choromanska & Langford, 2015),

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

SLEEC (Bhatia et al., 2015), robust Bloom ﬁlters (Cisse et al., 2013), label partitioning (Weston et al., 2013), and fast label embeddings (Mineiro & Karampatziakis, 2015).

These algorithms have been mainly evaluated in terms of ranking-based measures such as Precision@K or NDCG@K. In this work, we focus on extreme multi-label classiﬁcation (XMLC) and turn our attention to the (macro) F-measure, which is a commonly used performance measure in multi-label classiﬁcation as well as other ﬁelds, such as natural language processing. A variant of this measure has also been used in the Kaggle s LSHTC competition (Partalas et al., 2015); see https://www.kaggle.com/c/lshtc.

We tackle the learning problem within the Empirical Utility Maximization (EUM) framework for F-measure maximization. As shown recently, maximizing the F-measure in EUM can be accomplished by properly tuning a threshold on class probability estimates (CPEs), that is, using a suitably chosen threshold to separate between positive and negative predictions (Koyejo et al., 2014; Narasimhan et al., 2014; Kotłowski & Dembczynski, 2015). In the extreme learning scenario, both probability estimation and threshold tuning are challenging tasks; in particular, tuning the threshold for each label through exhaustive search is computationally infeasible. In this paper, we aim at extending existing threshold tuning methods for optimizing the F-measure in binary classiﬁcation, so as to make them amenable to the optimization of the macro F-measure in the XMLC setting. To this end, we propose the idea of sparse probability estimates (SPEs).

Following a formal description of the macro F-measure and the EUM approach in Section 2 and 3, respectively, we adapt threshold tuning methods to the XMLC setting by exploting the idea of SPEs in Section 4. In Section 5, we discuss two concrete methods for delivering SPEs. The combination of these methods with threshold optimization approaches is evaluated experimentally in Section 6.

Extreme F-Measure Maximization using Sparse Probability Estimates

2. Macro-averaged F-measure

Consider an XMLC problem with m labels and denote the true and predicted label vector of the ith instance xi by yi = (yi,1, . . . , yi,m) 2 {0, 1}m and byi = (ˆyi,1, . . . , ˆyi,m) 2

{0, 1}m, respectively. For a test set {(xi, yi)}n

i=1 of size n, let Y be the n m matrix with entries yi,j, 1 i n, 1 j m, and denote by y j the jth column of Y ; the same notation is used for the collection of estimates b Y . The macro F-measure (or F-score) is then deﬁned as:

F(y j, by j)

i=1 yi,j ˆyi,j Pn

i=1 yi,j + Pn

To maximize this measure, it obviously sufﬁces to maximize the (standard) F-measure F(y j, by j) for each label j separately. In principle, threshold tuning methods for F-measure maximization in binary classiﬁcation can thus be used, simply by applying them independently for each label. However, a na ıve adaptation of these methods can be very costly for problems with extremely large label spaces. This is because the tuning methods require CPEs for all labels and instances of the set at hand; for example, at least 1010 predictions have to be produced and possibly stored for m > 105 labels and n > 105 instances.

Fortunately, a signiﬁcant reduction of complexity is possible thanks to important properties of the XMLC problem and the F-measure. First, the number of positive labels in XMLC is typically very small in comparison to the number of negative labels. Second, for computing the F-measure, only the true positive labels (yi,j = 1) and the predicted positive labels (ˆyi,j = 1) are needed, but neither the true negatives nor the predicted negatives. These properties motivate the idea of sparse probability estimates (SPEs), by which we mean probability estimates restricted to the top-labels, i.e., those labels with a sufﬁciently high probability other probabilities are not important for tuning the threshold. As will be seen, SPEs will signiﬁcantly reduce complexity of threshold tuning in extreme classiﬁcation.

3. Expected Utility Maximization (EUM)

Suppose a ﬁnite set Dn = {(xi, yi)}n

i=1 of labeled instances to be given. Moreover, denote by (x, j) the probability that the jth label is positive for the instance x, i.e., (x, j) = P(yj = 1 | x), and that estimates b (x, j) of these posteriors are produced by a probabilistic classiﬁer b : X [m] ! [0, 1]. For the time being, we are not interested in the training of this classiﬁer but focus on the F-measure optimization step.

The EUM principle is based on the empirical estimate of the F-score on the data Dn. Consider the F-score obtained by the

threshold classiﬁer b (x) = (b 1(x, 1), . . . , b m(x, m)), where b j(x, j) = Jb (x, j) j K:1

F ( ; b , Dn) = 1

i=1 yi,jb (xi, j) Pn

i=1 yi,j + Pn

i=1 b (xi, j) (2)

Obviously, with byi,j = b (xi, j), we have F( ; b , Dn) = FM(Y , b Y ), and the optimal threshold vector

2[0,1]m F( ; b , Dn) (3)

can be found by searching for the j in the ﬁnite set of candidates {b (xi, j)}n

i=1, independently for each label.

The problem of F-measure optimization has received increasing attention in recent years and has been tackled with different methods, including EUM but also alternatives like the so-called decision-theoretic approach (Ye et al., 2012). Although most of the contributions so far refer to binary classiﬁcation (i.e., the standard F-measure), many of the results can be readily extended to the case of the macro F-measure, thanks to the fact that the latter is an average of binary F-measures. More concretely, based on the result of Ye et al. (2012), F( ; b , Dn)

P ! F(b ) as n ! 1 for any 2 (0, 1)m, where F(b ) is the population level macro F-score of b deﬁned as

E [ (x, i)b i(x, i)] E [ (x, i)] + E [b i(x, i)] ,

and the expectation is taken with respect to the data distribution. Narasimhan et al. (2014) provide an even stronger result, which can be extended to the macro F-measure, too: If a classiﬁer b Dn is induced from Dn by an L1-consistent learner for every label and a threshold is obtained by maximizing (2) on an independent set D0

n, then F(b

P ! F( ) as n ! 1 (under mild assumptions on the data distribution), where = arg max 2[0,1]m F( ). Finally, based on (Ye et al., 2012) (see their Theorem 4), there is no classiﬁer of the form X [j] ! {0, 1} with a population level macro F-score better than F( ). This provides another justiﬁcation for using threshold classiﬁers in the multi-label scenario. For a more elaborate discussion on consistency of multi-label classiﬁcation with complex performance measures, see (Koyejo et al., 2015).

4. Macro F-measure maximization

This section presents three F-measure optimization methods, the ﬁrst two of which directly build on the EUM framework. Each of these methods seeks to optimize the threshold values in (3), typically in a label-wise manner. Sorting-based

1JPK = 1 if the predicate P is true and = 0 otherwise.

Extreme F-Measure Maximization using Sparse Probability Estimates

threshold optimization (STO) ﬁnds an optimal threshold on a validation set, while the ﬁxed thresholds approach (FTA) validates only a restricted set of predeﬁned candidates for the threshold. The third method, online F-measure optimization (OFO), tunes the threshold in an online setting. It can be used either in combination with an online learning method or, like the two previous methods, be applied on a validation set.

In the following, we denote the SPE of the posterior (x) = ( (x, 1), . . . , (x, m)) for an instance x and the corresponding index set by

b (x, j) : j 2 b (x)

b (x) = {j 2 [m] : b (x, j) > j} ,

where = ( 1, . . . , m) 2 [0, 1]m, and j is the threshold used for the jth label. If the threshold is the same for each label, i.e., 1 = = m = , we write bs (x) and b (x) instead of bs (x) and b (x). Concrete (tree-based) SPE techniques for computing these sets in a very efﬁcient way will be discussed in Section 5.

4.1. Search-based threshold optimization (STO)

An exact solution of (3) on a ﬁnite set Dn calls for computing all CPEs for all labels and, moreover, the F-score for all CPEs as possible thresholds. This can be implemented by sorting the CPEs for each label ﬁrst, and ﬁnding the threshold that yields the highest F-measure with a single pass over the sorted list afterward. Since n m posterior estimates are needed, and sorting requires time O(n log n), the overall computational complexity of this procedure is O(mn log n).

To reduce the cost, we can solve (3) by using SPEs. We ﬁrst compute and store the positive labels for which the posterior estimates are above ﬁxed thresholds , which is an input parameter of Algorithm 1. Next, the algorithm seeks to ﬁnd the optimal threshold for each label j based on an exhaustive search by computing the F-score for each b (xi, j) as a possible threshold. The cost of this search is much lower now, as it only needs to sort the SPE values provided j is not too close to 0, these should be much less than n.

Needless to say, the SPE thresholds j have to be chosen carefully and deﬁnitely smaller than the optimal thresholds in (3); otherwise, the latter cannot be found anymore. On the other side, the larger j, the more efﬁcient the search procedure will be. So what is a reasonable range of values for an optimal threshold , and hence for j?

To answer this question, we derive (theoretical) bounds for a threshold. Interestingly, the optimal solution on a population level (i.e., all probabilities are known) satisﬁes the following condition (Zhao et al., 2013):

F = F( ) = 2 (4)

Algorithm 1 STO(Dn, b , )

1: for j = 1 ! m do 2: Ij = ;

3: for i = 1 ! n do 4: Compute bs (xi) and b (xi) . SPE 5: for each j 2 b (xi) do 6: Ij = Ij [ {i} 7: pij = b (x1, j)(2 bs (xi))

8: for j = 1 ! m do 9: Find j based on {pij : i 2 Ij} and {yi,j : i 2 Ij} using sorting

That is, the optimal F-measure is twice the value of the optimal threshold. As an immediate consequence, the upper bound of the threshold is 0.5. In order to derive a lower bound, note that a classiﬁer which always predicts positive (i.e., which uses a threshold = 0) yields an F-measure of F = 2 /( + 1), where is the (prior) probability of the label being positive. Since > 0 can reasonably be assumed, F > = 0, and (4) is violated. Therefore, F must be smaller than the optimal value F , and

Our analysis thus suggests that, for each label j, the threshold =

j should be chosen from the range

( j/( j + 1), 0.5] . (5)

Remark that this result assumes the posteriors (x, j) (or at least perfect estimates ˆ (x, j) thereof) to be given.

4.2. Fixed thresholds approach (FTA)

Following the EUM principle, STO optimally tunes the threshold for each label on the data Dn. By doing so, it is of course prone to overﬁtting. Therefore, one can consider a different implementation of line 9 in Algorithm 1. FTA follows a simple modiﬁcation, in which a predeﬁned set of possible threshold values within the range (5) is tried. This approach can be simpliﬁed even further, by using the same threshold for all labels, which turns out to produce more stable results. As a side remark, we note that a common threshold for all labels is provably optimal for micro-averaged and instance-averaged F-measures (Koyejo et al., 2015).

Implementing FTA like Algorithm 1 requires computing and storing all SPEs for a validation set and checking the F-measure for each predeﬁned threshold. Alternatively, one can compute the F-measure for each threshold simultaneously by passing the validation set only once. In that case, SPEs do not need to be stored, but auxiliary variables for each of the predeﬁned thresholds have to be kept.

Extreme F-Measure Maximization using Sparse Probability Estimates

4.3. Online F-measure optimization (OFO)

The OFO algorithm by Busa-Fekete et al. (2015) optimizes the binary F-measure by tuning the threshold in an online fashion. Assuming instances to be observed sequentially, the algorithm seeks to maximize the so-called online F-measure, which is deﬁned for label j as

Fi,j = 2 Pi

=1 y ,jby ,j Pi

=1 y ,j + Pi

where we concisely write

y ,jby ,j, bi,j =

by ,j . (7)

The OFO algorithm simply sets the threshold to

i,j = ai 1,j

when processing the ith instance, and thus makes predictions

byi,j = Jb (xi, j) > j K. The threshold update is motivated by condition (4). Moreover, Busa-Fekete et al. (2015) have proven the consistency of this approach: The threshold and the online F-score computed by OFO converge in probability to the optimal F-score and threshold, respectively, as n goes to inﬁnity, provided the posterior estimates are coming from an L1 consistent classiﬁer.

Thanks to its online nature, OFO can be readily adapted to a large-scale learning regime. In contrast to STO, it allows the data to be processed in a sequential manner, without the need to store any predicted scores or labels from previous iterations. The pseudo-code of the algorithm is shown in Algorithm 2. First, the initial threshold is set based on a = (a1, . . . , am) and b = (b1, . . . , bm), which are input parameters (the impact of which will be investigated in our experimental study). Then, in each iteration, the set of predicted positive labels b (x) is computed (line 4), and the thresholds are updated according to (8) for all labels in (xi) [ b (xi), where (x) is the set of positive labels for x. Note that this update can be implemented in an efﬁcient way, since, according to the deﬁnition of ai,j and bi,j, if label j and the predicted label are both negative, then neither ai,j nor bi,j need to be updated. Finally, the memory requirement of the algorithm is O(m), since only two auxiliary arrays (a and b) and the array of thresholds need to be stored. The computational complexity is O(nm0), where m0 = Pn

i=1 | (xi) [ b (xi)|, which can be orders of magnitude smaller than m if both the labels and the predictions are sparse. In general, OFO can be either applied on a validation set or run simultaneously with training of the class probability model. For large validation sets, a single pass over the data is enough to obtain an accurate estimate of the threshold.

Algorithm 2 OFO(Dn, b , a, b)

1: for i = 1 ! m do 2: Set i = ai/bi . Initial value

3: for i = 1 ! n do 4: Compute b (xi) . Predicted positives 5: for all j 2 (xi) [ b (xi) do 6: aj = aj + Jj 2 (xi) \ b (xi)K

7: bj = bj + Jj 2 (xi)K + Jj 2 b (xi)K 8: j = aj

bj 9: return

5. Efﬁcient sparse probability estimators

In this section, we discuss extreme classiﬁcation algorithms for computing sparse probability estimates. We mainly focus on Probabilistic Label Trees (PLTs), which ideally ﬁt the idea of sparse probability estimation. Later, we recall FASTXML (Prabhu & Varma, 2014) and explain why this algorithm can be treated as an efﬁcient sparse probability estimator, too. We also discuss other large scale algorithms that do not deliver sparse probability estimates in an efﬁcient way.

5.1. Probabilistic label trees

PLTs share similarities with conditional probability estimation trees (Beygelzimer et al., 2009a) and probabilistic classiﬁer chains (Dembczy nski et al., 2010), while being suited to estimate marginal posterior probabilities (x, j). They are also similar to Homer (Tsoumakas et al., 2008), which transforms training examples in the same way but does not admit a probabilistic interpretation. Let us also remark that a similar concept is known in neural networks and natural language processing under the name of hierarchical softmax classiﬁers (Morin & Bengio, 2005); however, it has not been used in a multi-label scenario.

In a nutshell, PLTs are based on the label tree approach (Beygelzimer et al., 2009b; Bengio et al., 2010; Deng et al., 2011), in which each leaf node corresponds to one label. Classiﬁcation of a test example relies on a sequence of decisions made by node classiﬁers, leading the test example from the root to the leaves of the tree. Since PLTs are designed for multi-label classiﬁcation, each internal node classiﬁer decides whether or not to continue the path by moving to the child nodes. This is different from typical left/right decisions made in tree-based classiﬁers. Moreover, a leaf node classiﬁer needs to make a ﬁnal decision regarding the prediction of a label associated to this leaf. PLTs use a class probability estimator in each node of the tree, such that an estimate of the posterior probability of a label associated with a leaf is given by the product of the probability estimates on the path from the root to that leaf. Prediction then relies on traversing the tree from the root to the leaf nodes. Whenever

Extreme F-Measure Maximization using Sparse Probability Estimates

the intermediate value of the product of probabilities in an internal node is less than a given threshold, the subtree below this node is not explored anymore. This pruning strategy leads to a very efﬁcient classiﬁcation procedure.

5.2. Formal description of PLTs

To introduce PLTs more formally, denote a tree by T and the root of the tree r(T). In general, T can be of any form; here, we consider trees of height k and degree b. The leaves of T correspond to labels. We denote a set of leaves of a (sub)tree rooted in node t by L(t). The parent node of t is denoted by pa(t), and the set of child nodes by Ch(t). The path from the root r(T) to the jth leaf is denoted by Path(j).

PLTs use a path from a root to the jth leaf to estimate posteriors (x, j). With each node t and training instance x, we associate a label zt = JP

j2L(t) yj 1K. In the leaf nodes, the labels zj, j 2 L, correspond to the original labels yj.

Consider the leaf node j and the path from the root to this leaf node. Using the chain rule of probability, we can express (x, j) in the following way:

T (x, t) , (9)

where T (x, t) = P(zt = 1 | zpa(t) = 1, x) for all non-root nodes t, and T (x, t) = P(zt = 1 | x) if t is the root node. The correctness of (9) follows from the observation that zt = 1 implies zpa(t) = 1. A detailed derivation of the chain rule in this setup is given in Appendix A.

The training algorithm for PLTs is given in Algorithm 3. Let Dn = {(xi, yi)}n

i=1 be a training set of multi-label examples. To learn classiﬁers in all nodes of a tree T, we need to properly ﬁlter training examples to estimate T (x, t) (line 5). Moreover, we need to use a learning algorithm A that trains a class probability estimator ˆ T (x, t) for each node t in the tree. The training algorithm returns a set of probability estimation classiﬁers Q.

The learning time complexity of PLTs can be expressed in terms of the number of nodes in which an original training example (x, y) is used. Since the training example is used in a node t only if t is the root or zpa(t) = 1, this number is upper bounded by s b k + 1, where b and k denote the degree and height of the tree, respectively, and s is the number of positive labels in y. For sparse labels, this value is much lower than m. Note that learning can be performed simultaneously for all nodes, and each node classiﬁer can be trained using online methods, such as stochastic gradient descent (Bottou, 2010).

Interestingly, in the case of sparse features, the space complexity can be signiﬁcantly reduced as well. Admittedly, the number of models is the highest for binary trees and can be as high as 2m 1 (notice that the size of a tree with m leaves is upper bounded by 2m 1). This is twice

Algorithm 3 PLT.TRAIN(T, A, Dn)

1: Q = ; 2: for each node t 2 T do 3: D0 = ; 4: for i = 1 ! n do 5: if t is root or zpa(t) = 1 then 6: zt = JP

j2L(t) yij 1K 7: D0 = D0 [ (xi, zt)

8: ˆ T (x, t) = A(D0), Q = Q [ ˆ T (x, t)

9: return a set of probability estimation classiﬁers Q.

Algorithm 4 PLT.PREDICT(x, T, Q, )

1: ˆy = 0m, Q = ;, Q.add(r(T), ˆ T (x, r(T))) 2: while Q 6= ; do 3: (t, pt) = pop(Q) 4: if pt then 5: if t is a leaf node then 6: ˆyt = 1 7: else 8: for c 2 Ch(t) do 9: Q.add(c, pt ˆ T (x, c))

10: return ˆy.

the number of models in the simplest 1-vs-all approach. Paradoxically, the space complexity can be the lowest at this upper bound. This is because only the sibling nodes need to share the same features, while no other features are needed to build corresponding classiﬁers. Therefore, only those (few) features needed to describe the sibling labels have to be used in the models. If the space complexity still exceeds the available memory, one can always use feature hashing over all nodes (Weinberger et al., 2009).

Prediction with probabilistic label trees relies on estimating (9) by traversing the tree from the root to the leaf nodes. However, if the intermediate value of this product in node t, denoted by pt, is less than a given threshold , then the subtree starting in node t is no longer explored. For the sake of completeness, we shortly describe this procedure (see Algorithm 4). We start with setting ˆy = 0m. In order to traverse a tree, we initialize a queue Q to which we add the root node r T with its posterior estimate ˆ T (x, r(T)). In the while loop, we iteratively pop a node from Q and compute pt. If pt , we either set ˆyj = 1 if t is a leaf, or otherwise add its child nodes to Q with the value pt multiplied by their posterior estimates ˆ T (x, c), c 2 Ch(t). If Q is empty, we stop the search and return ˆy.

Traversing a label tree can be much cheaper than querying m independent classiﬁers, one for each label. If there is only one label exceeding the threshold, PLT ideally needs to call only bk + 1 classiﬁers (all classiﬁers on a path from

Extreme F-Measure Maximization using Sparse Probability Estimates

the root to a leaf plus all classiﬁers in siblings of the path nodes). Of course, in the worst-case scenario, the entire tree might be explored, but even then, no more than 2m 1 calls are required (with m leaves, the size of the tree is upper bounded by 2m 1). In the case of sparse label sets, PLTs can signiﬁcantly speed up the classiﬁcation procedure. The expected cost of prediction depends, however, on the tree structure and accuracy of node classiﬁers.

Note that PLTs can be used with any value as a threshold. Moreover, by considering a separate threshold t in each node t, one can use a different threshold for each label j on posterior estimates ˆ (x, j). It is enough to set a threshold in the parent node t as follows: t = minj2Ch(t) j. In this way, PLTs can efﬁciently obtain SPEs for any in STO and FTA (Algorithm 1), and any in OFO (Algorithm 2). PLTs can easily be tailored for Precision@K, too. To predict the top labels, it is enough to change Q to a priority queue and stop the prediction procedure after a given number of top labels.

Let us ﬁnally underline that PLTs obey strong theoretical guarantees. In Appendix A, we derive a surrogate regret bound showing that the overall error of PLTs is reduced by improving the node classiﬁers. For optimal node classiﬁers, we obtain optimal multi-label classiﬁers in terms of estimation of posterior probabilities (x, j).

5.3. Fast XML

As an example of another efﬁcient sparse probability estimator, we recall FASTXML, which adapts the idea of standard decision trees (Breiman et al., 1984) to the XMLC setting. To improve predictive performance, FASTXML uses an ensemble of decision trees. Internal nodes in the trees contain sparse linear classiﬁers, which are trained to optimize an n DCG-based ranking loss. The authors show that this optimization can be performed very efﬁciently. The most costly step is L1-regularized logistic regression solved by the new GLMNET algorithm (Yuan et al., 2012) implemented in the Lib Linear package (Fan et al., 2008). The height of trees in FASTXML scales logarithmically with the number of training examples (though it should be noted that different stopping rules for growing a tree can be used).

Sparse predictions of class probabilities, a major prerequistie for efﬁcient macro F-measure maximization, are produced by FASTXML in a quite natural way. This is because, during training, the nodes are recursively partitioned till each leaf contains only a small number of training examples. Thus, only a small number of active labels is assigned to each leaf. During prediction, a test example passes down the tree until it reaches a leaf node. The prediction procedure can then focus exclusively on the label distribution in the leaf node. FASTXML uses a standard prediction procedure for decision trees that relies on the relative frequencies of labels in the leaf node reached by the test example. These relative frequencies

can be treated as CPEs. Since FASTXML is an ensemble, the leaf node label distributions are averaged over all trees in the ensemble. Eventually, CPEs are thus only delivered for a small number of labels for a given test example. For the remaining labels, the probabilities are assumed to be 0. By using thresholds and in STO, FTA, and OFO, respectively, the CPEs returned by FASTXML can be treated as SPEs.

5.4. PLT vs. Fast XML

Both FASTXML and PLTs achieve fast prediction time thanks to exploiting a tree-structure. A PLT is a single label tree with linear classiﬁers in each node, while FASTXML makes use of an ensemble of regular trees with linear splits. Notice that the height of a PLT scales logarithmically with the number of labels, while the size of a tree in FASTXML is logarithmic in the number of training examples. Therefore, depending on the problem, the trees may differ in size.

For a single tree, FASTXML follows only a single path from the root to a leaf. Each leaf node corresponds to a region in the feature space and returns a label distribution for this region. By assuming that, in a certain region, only a small number of labels have non-zero conditional probability, FASTXML is able to efﬁciently deliver SPEs. PLTs in turn may explore several paths from the root to the leaves. By using a threshold on probability estimates, each internal node decides whether to go down the tree or stop exploration. Therefore, PLTs can efﬁciently deliver all labels the CPEs of which exceed given thresholds.

Another difference is that, while FASTXML is training the structure of the tree, PLT assumes a predeﬁned structure (or uses a random tree). Of course, just like conditional probability trees (Beygelzimer et al., 2009a), PLTs could in principle be combined with an online tree learning procedure. Let us remark, however, that a predeﬁned tree structure allows for training the node classiﬁers independently of each other, thereby making the whole training process amenable to massive parallelization.

In the XMLC setting, only a few methods suggest themselves for a combination with threshold tuning methods, such as STO, FTA and OFO. It seems that PLTs and FASTXML are among the best options. Some algorithms are also able to deliver SPEs, but not always efﬁciently. For example, the 1-vs-all approach scales linearly with m, which is too expen-

sive in many applications. Even methods that rely on fast training, like negative sampling (Collobert & Weston, 2008) or fast label embeddings (Mineiro & Karampatziakis, 2015), still exhibit linear cost for prediction. To deliver sparse probability estimates efﬁciently, additional structures, such as trees (Bengio et al., 2010; Weston et al., 2013) or locality sensitive hashing (Shrivastava & Li, 2014), are required.

Extreme F-Measure Maximization using Sparse Probability Estimates

6. Experiments

We carried out two sets of experiments. In the ﬁrst, we verify the effectiveness of PLTs in handling a large number of labels by comparing its performance to that of FASTXML in terms of Precision@K. In the second experiment, we combine PLTs and FASTXML with the threshold tuning methods, namely with FTA, STO and OFO, for maximizing the macro F-measure. In both experiments we used six large-scale datasets taken from the Extreme Classiﬁcation Repository2 with predeﬁned train/test splits (see main statistics of these datasets in Table 1).

To perform the experiments we implemented PLTs in Java.3 We used random complete b-ary trees (where b is a hyperparameter) with L2-logistic regression as a node classiﬁer, trained by a variant of stochastic gradient descent introduced by Duchi & Singer (2009) (see Appendix B for details). To deal with a large number of weights, we used feature hashing shared over all tree nodes with 230 hash values. In the ﬁrst experiment, we tuned the hyperparameters of PLTs in a simple 80/20 validation on the training set. We applied the off-the-shelf hyperparameter optimizer

SMAC (Hutter et al., 2011) with a wide range of parameters, reported in Appendix C. In the second experiment, we used the values of the hyperparameters found to be optimal in the ﬁrst experiment. For FASTXML, we used the C++ code delivered by Prabhu & Varma (2014), as well as the hyperparameters suggested by them.

6.1. Sparse probability estimators

In the ﬁrst experiment, we evaluate the algorithms in terms of Precision@K (for K = 1, 3, 5) and computational cost. The results are shown in Table 1. The performance of PLTs is on a par with that of FASTXML, getting better results on half of the datasets. It also seems that PLTs are more efﬁcient in handling very sparse labels, because their training time is typically smaller for those datasets where the number of labels per instance is small (i.e., < 6). This can be explained by the fact that only a very small part of the label tree is updated in case of only a few positive labels (see Algorithm 4). Notice, however, that the comparison is not completely fair as both algorithms are implemented using different technologies and coding styles. In the next subsection, we more carefully analyze test time complexity, focusing on the number of computed inner products per test example.

6.2. Extreme macro F-measure optimization

In the second experiment, we test the threshold tuning algorithms described in Section 4. We use 80% of each dataset for

2http://research.microsoft.com/enus/um/people/manik/downloads/XC/XMLRepository.html

3Get code at https://github.com/busarobi/XMLC.

training PLTs and FASTXML, and then run FTA, STO and OFO on the remaining 20%. The latter part of the training set is also used to validate the input parameters of the threshold tuning algorithms. For the vector , the input parameter of STO and FTA, we ﬁrst compute the lower bound j of the optimal threshold according to (5), i.e., j = b j/(b j + 1), with b j the prior probability estimate for label j. Then, each element of is set to max(1/c, j), where c 2 C = {10000, 1000, 200, 100, 50, 20, 10, 7, 5, 4, 3, 2}. Similarly, the input parameter b of OFO is tuned over the same set C, while its other input parameter a is constantly set to 1. We additionally carried out experiments for assessing the impact of parameter a (see results in Appendix D), which slightly improves the results. We also control the thresholds in OFO to be greater than the lower bound j.

In Figure 1, the validation and test macro F-scores of PLTs and FASTXML are plotted against various input parameter values in the form of 1/c. As one would expect, STO outperforms both FTA and OFO in terms of F-score on the validation set for high values of c (see the dashed lines on the plots). This is not surprising, since as j goes to 0, the probability estimates are getting less sparse. In particular, for j = 0, STO ﬁnds the optimal threshold on the validation set. Therefore, can be used to trade computational complexity off against approximation quality.

More surprisingly, the difference between the validation and test performance of STO is substantial. This can be explained by the fact that only 20% of the training data was used for tuning the thresholds, and this data is not guaranteed to contain positive instances for every label. In case of no positive examples for a label and no predicted positives either the F-score is set to 1 as suggested by Lewis (1995). In turn, FTA and OFO seem to be more efﬁcient in handling sparse probability estimate and there is no such difference in performance on validation and test sets.

The optimal parameters of the threshold tuning algorithms were chosen based on the validation performance (dashed lines in Figure 1). The corresponding test scores are shown in Table 2. Based on the results, STO is clearly outperformed by either FTA or OFO. We explain the poor performance of sorting-based optimization by its susceptibility to overﬁtting for sparse classes, whereas FTA and especially OFO seem to be more robust in the extreme learning regime.

In extreme classiﬁcation, the computational requirements for evaluating the model on test examples are not less important than the predictive performance of the model. Therefore, we also study the average (per example) wall-clock test times and number of inner products computed by the trained models. We believe that the number of inner products is a more objective measure of the computational complexity, because PLTs and FASTXML can be viewed as an ensemble of linear classiﬁers, whose main computational effort

Extreme F-Measure Maximization using Sparse Probability Estimates

Table 1. Main characteristics of the datasets used in the experiment and Precision@K scores with K = {1, 3, 5} for PLTs and FASTXML

along with their training time in minutes. The highest Precision@K values are in bold for every dataset.

Main statistics of datasets PLT FASTXML #labels #features #test #train inst./lab. lab./inst. P@1 P@3 P@5 Time P@1 P@3 P@5 Time

RCV1 2456 47236 155962 623847 1218.56 4.79 90.46 72.4 51.86 64 91.13 73.35 52.67 78 Amazon Cat 13330 203882 306782 1186239 448.57 5.04 91.47 75.84 61.02 96 92.95 77.50 62.51 561 Wiki10 30938 101938 6616 14146 8.52 18.64 84.34 72.34 62.72 290 81.71 66.67 56.70 16 Delicious 205443 782585 100095 196606 72.29 75.54 45.37 38.94 35.88 1327 42.81 38.76 36.34 458 Wiki LSHTC 325056 1617899 587084 1778351 17.46 3.19 45.67 29.13 21.95 653 49.35 32.69 24.03 724 Amazon 670091 135909 153025 490449 3.99 5.45 36.65 32.12 28.85 54 34.24 29.3 26.12 422

Table 2. The macro F-measure on the test set for the best hyperparameter values obtained on a validation set and the average (per example) wall-clock test time (in milliseconds) and number of inner products computed by PLTs and FASTXML on the test sets. The numbers in bold indicate the best macro F-score achieved on each dataset.

Macro F-score on test set Test time in millisec. Num. of inner products PLT FASTXML PLT FASTXML PLT FASTXML Dataset FTA STO OFO FTA STO OFO FTA STO OFO FTA STO OFO

RCV1 20.41 21.16 21.56 17.04 19.58 18.93 0.33 0.78 0.19 0.96 252 354 212 747 Amazon Cat 34.83 31.64 33.13 41.07 37.28 42.46 0.37 2.23 0.85 1.14 225 2272 711 871 Wiki10 29.98 24.02 30.28 29.88 28.26 29.51 1.69 2.18 1.87 3.00 534 13682 2518 541 Delicious-200K 11.12 10.96 11.20 11.18 10.83 11.19 0.61 17.41 4.11 3.86 216 29067 5610 739 Wiki LSHTC 12.31 16.22 14.00 21.24 20.41 20.84 0.22 13.46 4.54 1.17 175 28448 2875 900 Amazon 51.77 46.94 51.28 52.86 47.53 50.44 0.29 5.54 0.54 1.39 132 5557 125 796

1/c 10-4 10-3 10-2 10-1 100

Macro F-score

FTA test FTA valid STO test STO valid OFO test OFO valid

1/c 10-4 10-3 10-2 10-1 100

Macro F-score

0.7 Amazon Cat

FTA test FTA valid STO test STO valid OFO test OFO valid

1/c 10-4 10-3 10-2 10-1 100

Macro F-score

FTA test FTA valid STO test STO valid OFO test OFO valid

1/c 10-4 10-3 10-2 10-1 100

Macro F-score

0.4 Delicious-200K

FTA test FTA valid STO test STO valid OFO test OFO valid

1/c 10-4 10-3 10-2 10-1 100

Macro F-score

0.6 Wiki LSHTC

FTA test FTA valid STO test STO valid OFO test OFO valid

1/c 10-4 10-3 10-2 10-1 100

Macro F-score

FTA test FTA valid STO test STO valid OFO test OFO valid

1/c 10-4 10-3 10-2 10-1 100

Macro F-score

FTA test FTA valid STO test STO valid OFO test OFO valid

1/c 10-4 10-3 10-2 10-1 100

Macro F-score

0.7 Amazon Cat

FTA test FTA valid STO test STO valid OFO test OFO valid

1/c 10-4 10-3 10-2 10-1 100

Macro F-score

FTA test FTA valid STO test STO valid OFO test OFO valid

1/c 10-4 10-3 10-2 10-1 100

Macro F-score

0.4 Delicious-200K

FTA test FTA valid STO test STO valid OFO test OFO valid

1/c 10-4 10-3 10-2 10-1 100

Macro F-score

0.6 Wiki LSHTC

FTA test FTA valid STO test STO valid OFO test OFO valid

1/c 10-4 10-3 10-2 10-1 100

Macro F-score

FTA test FTA valid STO test STO valid OFO test OFO valid

Figure 1. The macro F-measure obtained by STO, FTA and OFO with PLTs (top) and FASTXML (bottom). The dashed and solid lines

represent the performance of the methods on the validation and test set for various initial parameters, respectively.

consists of computing inner products.

The average per test example wall-clock time and number of inner products are shown in Table 2. We used the set of thresholds that were found to be the best in the validation process described above. Remark that for FASTXML the number of inner products does not depend on a tunning method (i.e., for each method only one path from the root to a leaf node is visited). The results reveal some general trends. First, PLTs along with the thresholds found by FTA have the smallest computation time except on RCV1. Second, PLTs with thresholds found by OFO outperforms FASTXML on three datasets. Third, the computational time for STO is the highest in almost every case; since its performance is usually lower than that of OFO and FTA, this suggests that the thresholds found by STO are in general underestimated. Remark that FASTXML uses 50 trees, while in PLTs a single tree is explored. Moreover, the height of a tree in FASTXML

scales with the number of training examples. Therefore, a FASTXML tree can in some cases be larger than a PLT.

7. Conclusion

We addressed the problem of macro-F measure maximization in XMLC and combined three tuning methods (STO, FTA, OFO) with two efﬁcient sparse probability estimators (PLTs and FASTXML). It seems that online F-measure optimization (OFO) used with either of the two classiﬁers yields the most promising results. As a next step, we plan to generalize the results presented here to other complex performance measures and work further on different types of sparse probability estimators.

Extreme F-Measure Maximization using Sparse Probability Estimates

Acknowledgments

Kalina Jasinska and Krzysztof Dembczy nski are supported by the Polish National Science Centre under grant no. 2013/09/D/ST6/03917.

Agarwal, S. Surrogate regret bounds for bipartite ranking

via strongly proper losses. JMLR, 15:1653 1674, 2014.

Agrawal, R., Gupta, A., Prabhu, Y., and Varma, M. Multi-

label learning with millions of labels: Recommending advertiser bid phrases for web pages. In WWW, pp. 13 24. ACM, 2013.

Bengio, S., Weston, J., and Grangier, D. Label embedding

trees for large multi-class tasks. In NIPS 24, pp. 163 171, 2010.

Beygelzimer, A., Langford, J., Lifshits, Y., Sorkin, G., and

Strehl, A. Conditional probability tree estimation analysis and algorithms. In UAI, pp. 51 58, 2009a.

Beygelzimer, A., Langford, J., and Ravikumar, P. Error-

correcting tournaments. In ALT, pp. 247 262. Springer, 2009b.

Bhatia, K., Jain, H., Kar, P., Varma, M., and Jain, P. Sparse

local embeddings for extreme multi-label classiﬁcation. In NIPS 29, pp. 730 738, 2015.

Bottou, L. Large-scale machine learning with stochastic

gradient descent. In COMPSTAT, pp. 177 187. Springer, 2010.

Bottou, L. Stochastic gradient tricks. In Neural Networks,

Tricks of the Trade, Reloaded, pp. 430 445. Springer,

Breiman, L., Friedman, J., Olshen, R., and Stone, C.

Classiﬁcation and Regression Trees. Wadsworth and Brooks, Monterey, CA, 1984.

Busa-Fekete, R., Sz or enyi, B., Dembczy nski, K., and

H ullermeier, E. Online F-measure optimization. In NIPS 29, pp. 595 603, 2015.

Choromanska, A. and Langford, J. Logarithmic time online

multiclass prediction. In NIPS 29, pp. 55 63, 2015.

Cisse, M., Usunier, N., Arti eres, T., and Gallinari, P. Robust

Bloom ﬁlters for large multilabel classiﬁcation tasks. In NIPS 26, pp. 1851 1859, 2013.

Collobert, R. and Weston, J. A uniﬁed architecture for natural language processing: Deep neural networks with multitask learning. In ICML, pp. 160 167, 2008.

Dekel, O. and Shamir, O. Multiclass-multilabel classiﬁca-

tion with more classes than examples. In AISTATS, pp. 137 144, 2010.

Dembczy nski, K., Cheng, W., and H ullermeier, E. Bayes

optimal multilabel classiﬁcation via probabilistic classiﬁer chains. In ICML, pp. 279 286, 2010.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,

L. Image Net: A large-scale hierarchical image database. In CVPR, pp. 248 255, 2009.

Deng, J., Satheesh, S., Berg, A. C., and Fei-Fei, L. Fast

and balanced: Efﬁcient label tree learning for large scale object recognition. In NIPS 24, pp. 567 575, 2011.

Duchi, J. and Singer, Y. Efﬁcient online and batch learning

using forward backward splitting. JMLR, 10:2899 2934, 2009.

Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin,

C.-J. LIBLINEAR: A library for large linear classiﬁcation. JMLR, 9:1871 1874, 2008.

Hutter, F., Hoos, H., and Leyton-Brown, K. Sequential

model-based optimization for general algorithm conﬁguration. In Learning and Intelligent Optimization, pp. 507 523. Springer, 2011.

Kotłowski, W. and Dembczynski, K. Surrogate regret bounds for generalized classiﬁcation performance metrics. In ACML, 2015.

Koyejo, S., Natarajan, N., Ravikumar, P., and Dhillon,

I. Consistent binary classiﬁcation with generalized performance metrics. In NIPS 27, pp. 2744 2752, 2014.

Koyejo, S., Natarajan, N., Ravikumar, P., and Dhillon, I.

Consistent multilabel classiﬁcation. In NIPS 29, pp. 3321 3329, 2015.

Lewis, D. Evaluating and optimizing autonomous text classiﬁcation systems. In SIGIR, pp. 246 254, 1995.

Mineiro, P. and Karampatziakis, N. Fast label embeddings

via randomized linear algebra. In ECML/PKDD 2015, pp. 37 51, 2015.

Morin, F. and Bengio, Y. Hierarchical probabilistic neural

network language model. In AISTATS, pp. 246 252, 2005.

Narasimhan, H., Vaish, R., and S., Agarwal. On the statistical

consistency of plug-in classiﬁers for non-decomposable performance measures. In NIPS 27, pp. 1493 1501, 2014.

Partalas, I., Kosmopoulos, A., Baskiotis, N., Arti eres, T.,

Paliouras, G., Gaussier, E., Androutsopoulos, I., Amini, M.-R., and Gallinari, P. LSHTC: A benchmark for large-scale text classiﬁcation. Co RR, 2015.

Extreme F-Measure Maximization using Sparse Probability Estimates

Prabhu, Y. and Varma, M. Fast XML: A fast, accurate and

stable tree-classiﬁer for extreme multi-label learning. In KDD, pp. 263 272. ACM, 2014.

Shrivastava, A. and Li, P. Asymmetric LSH (ALSH) for

sublinear time maximum inner product search (MIPS). In NIPS 27, pp. 2321 2329, 2014.

Tsoumakas, G., Katakis, I., and Vlahavas, I. Effective and

efﬁcient multilabel classiﬁcation in domains with large number of labels. In Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data, 2008.

Weinberger, K.Q., Dasgupta, A., Langford, J., Smola, A.,

and Attenberg, J. Feature hashing for large scale multitask learning. In ICML, pp. 1113 1120. ACM, 2009.

Weston, J., Makadia, A., and Yee, H. Label partitioning for

sublinear ranking. In ICML, pp. 181 189, 2013.

Ye, N., Chai, A., Lee, W., and Chieu, H. Optimizing F-measures: A tale of two approaches. In ICML, 2012.

Yuan, G.-X., Ho, C.-H., and Lin, C.-J. An improved GLMNET for L1-regularized logistic regression. JMLR, 13:1999 2030, 2012.

Zhao, M.-J., Edakunni, N., Pocock, A., and Brown, G.

Beyond Fano s inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications. JMLR, 14:1033 1090, 2013.