# reliable_multilabel_classification_prediction_with_partial_abstention__8de247ec.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Reliable Multilabel Classiﬁcation: Prediction with Partial Abstention

Vu-Linh Nguyen, Eyke H ullermeier Heinz Nixdorf Institute and Department of Computer Science, Paderborn University, Germany vu.linh.nguyen@uni-paderborn.de, eyke@upb.de

In contrast to conventional (single-label) classiﬁcation, the setting of multilabel classiﬁcation (MLC) allows an instance to belong to several classes simultaneously. Thus, instead of selecting a single class label, predictions take the form of a subset of all labels. In this paper, we study an extension of the setting of MLC, in which the learner is allowed to partially abstain from a prediction, that is, to deliver predictions on some but not necessarily all class labels. We propose a formalization of MLC with abstention in terms of a generalized loss minimization problem and present ﬁrst results for the case of the Hamming loss, rank loss, and F-measure, both theoretical and experimental.

1 Introduction In statistics and machine learning, classiﬁcation with abstention, also known as classiﬁcation with a reject option, is an extension of the standard setting of classiﬁcation, in which the learner is allowed to refuse a prediction for a given query instance; research on this setting dates back to early work by Chow (1970) and Hellman (1970) and remains to be an important topic till today, most notably for binary classiﬁcation (Bartlett and Wegkamp 2008; Cortes, De Salvo, and Mohri 2016; Franc and Prusa 2019; Grandvalet et al. 2008). For the learner, the main reason to abstain is a lack of certainty about the corresponding outcome refusing or at least deferring a decision might then be better than taking a high risk of a wrong decision. Nowadays, there are many machine learning problems in which complex, structured predictions are sought (instead of scalar values, like in classiﬁcation and regression). For such problems, the idea of abstaining from a prediction can be generalized toward partial abstention: Instead of predicting the entire structure, the learner predicts only parts of it, namely those for which it is certain enough. This idea has already been realized, e.g., for the case where predictions are rankings (Cheng et al. 2010; 2012). Another important example is multilabel classiﬁcation (MLC), in which an outcome associated with an instance is a labeling in the form of a subset of an underlying reference

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

set of class labels (Dembczy nski et al. 2012; Tsoumakas, Katakis, and Vlahavas 2009; Zhang and Zhou 2014). In this paper, we study an extension of the setting of MLC, in which the learner is allowed to partially abstain from a prediction, that is, to deliver predictions on some but not necessarily all class labels (or, more generally, to refuse committing to a single complete prediction). Although MLC has been studied extensively in the machine learning literature in the recent past, there is surprisingly little work on MLC with abstention so far a notable exception is (Pillai, Fumera, and Roli 2013), to which we will return in the Section 7. Prediction with abstention is typically realized as a twostage approach. First, the learner delivers a prediction that provides information about its uncertainty. Then, taking this uncertainty into account, a decision is made about whether or not to predict, or on which parts. In binary classiﬁcation, for example, a typical approach is to produce probabilistic predictions and to abstain whenever the probability is close to 1/2. We adopt a similar approach, in which we rely on probabilistic MLC, i.e., probabilistic predictions of labelings. In the next section, we brieﬂy recall the setting of multilabel classiﬁcation. The generalization toward MLC with (partial) abstention is then introduced and formalized in Section 3. Instantiations of the setting of MLC with abstention for the speciﬁc cases of the Hamming loss, rank loss, and F-measure are studied in Sections 4 6, respectively. Related work is discussed in Section 7. Finally, experimental results are presented in Section 8, prior to concluding the paper in Section 9. All formal results in this paper (propositions, remarks, corollaries) are stated without proofs, which are deferred to the supplementary material.

2 Multilabel Classiﬁcation

In this section, we describe the MLC problem in more detail and formalize it within a probabilistic setting. Along the way, we introduce the notation used throughout the paper. Let X denote an instance space, and let L={λ1, . . . , λm} be a ﬁnite set of class labels. We assume that an instance x X is (probabilistically) associated with a subset of labels Λ = Λ(x) 2L; this subset is often called the set of relevant labels, while the complement L\Λ is considered as irrelevant for x. We identify a set Λ of relevant labels with a

binary vector y = (y1, . . . , ym), where yi = λi Λ .1 By Y = {0, 1}m we denote the set of possible labelings. We assume observations to be realizations of random variables generated independently and identically (i.i.d.) according to a probability distribution p on X Y, i.e., an observation y = (y1, . . . , ym) is the realization of a corresponding random vector Y = (Y1, . . . , Ym). We denote by p(Y | x) the conditional distribution of Y given X = x, and by pi(Yi | x) the corresponding marginal distribution of Yi:

pi(b | x) =

y Y:yi=b p(y | x) . (1)

A multilabel classiﬁer h is a mapping X Y that assigns a (predicted) label subset to each instance x X. Thus, the output of a classiﬁer h is a vector ˆy = h(x) = (h1(x), . . . , hm(x)) . Given training data in the form of a ﬁnite set of observations (x, y) X Y, drawn independently from Pr(X, Y), the goal in MLC is to learn a classiﬁer h : X Y that generalizes well beyond these observations in the sense of minimizing the expected risk with respect to a speciﬁc loss function. In the literature, various MLC loss functions have been proposed, including the Hamming loss, the subset 0/1 loss, the F-measure, the Jaccard measure, and the rank loss. The Hamming loss is given by

ℓH(y, ˆy) =

i=1 yi = ˆyi , (2)

and the subset 0/1 loss by ℓS(y, ˆy) = y = ˆy . Thus, both losses generalize the standard 0/1 loss commonly used in classiﬁcation, but in a very different way. Hamming and subset 0/1 are prototypical examples of what is called a (label-wise) decomposable and non-decomposable loss, respectively (Dembczy nski et al. 2012). A decomposable loss can be expressed in the form

i=1 ℓi(yi, ˆyi) (3)

with suitable binary loss functions ℓi : {0, 1}2 R, whereas a non-decomposable loss does not permit such a representation. It can be shown that, to produce optimal predictions ˆy = h(x) minimizing expected loss, knowledge about the marginals pi(Yi | x) is enough in the case of a decomposable loss, but not in the case of a non-decomposable loss (Dembczy nski et al. 2012). Instead, if a loss is nondecomposable, high-order probabilities are needed, and in the extreme case even the entire distribution p(Y | x) (like in the case of the subset 0/1 loss). On an algorithmic level, this means that MLC with a decomposable loss can be tackled by what is commonly called binary relevance (BR) learning (i.e., learning one binary classiﬁer for each label individually), whereas non-decomposable losses call for more sophisticated learning methods that are able to take labeldependencies into account.

1 is the indicator function, i.e., A = 1 if the predicate A is true and = 0 otherwise.

3 MLC with Abstention In our generalized setting of MLC with abstention, which is introduced in this section, the classiﬁer is allowed to produce partial predictions

ˆy = h(x) Ypa ..= {0, , 1}m , (4)

where ˆyi = indicates an abstention on the label λi; we denote by A(ˆy) [m] ..= {1, . . . , m} and D(ˆy) ..= [m] \ A(ˆy) the set of indices i for which ˆyi = and ˆyi {0, 1}, respectively, that is, the indices on which the learner abstains and decides to predict.

3.1 Risk Minimization To evaluate a reliable multilabel classiﬁer, a generalized loss function L : Y Ypa R+ (5)

is needed, which compares a partial prediction ˆy with a ground-truth labeling y. Given such a loss, and assuming a probabilistic prediction for a query instance x, i.e., a probability p( | x) on the set of labelings (or at least an estimation thereof), the problem of risk minimization comes down to ﬁnding

ˆy argmin ˆy Ypa E L(y, ˆy) (6)

= argmin ˆy Ypa

y Y L(y, ˆy) p(y | x) .

The concrete form of this optimization problem as well as its difﬁculty depend on several choices, including the underlying MLC loss function ℓand its extension L.

3.2 Generalized Loss Functions On the basis of a standard MLC loss ℓ, a generalized loss function (5) can be derived in different ways, also depending on how to penalize the abstention. Further below, we propose a generalization based on an additive penalty. Before doing so, we discuss some general properties that might be of interest for generalized losses. As a ﬁrst property, we should expect a generalized loss L to reduce to its conventional version ℓin the case of no abstention. In other words,

L(y, ˆy) = ℓ(y, ˆy) ,

whenever ˆy is a precise prediction ˆy Y. Needless to say, this is a property that every generalized loss should obey.

Monotonicity. Another reasonable property is monotonicity: The loss should only increase (or at least not decrease) when (i) turning a correct prediction on a label λi into an abstention or an incorrect prediction, (ii) or turning an abstention into an incorrect prediction. This reﬂects the following chain of preferences: a correct prediction is better than an abstention, which in turn is better than an incorrect prediction. More formally, for a ground-truth labeling y and a partial prediction ˆy1, let C1, A1 L denote the subset of labels on which the prediction is correct and on which the learner

abstains, respectively, and deﬁne C2, A2 L analogously for a prediction ˆy2. Then

(C2 C1) (C2 A2) (C1 A1) (7)

L(y, ˆy1) L(y, ˆy2) .

Uncertainty-alignment. Intuitively, when producing a partial prediction, an optimal prediction rule is supposed to abstain on the most uncertain labels. More formally, consider a generalized loss function L and a prediction ˆy which, for a query x X, is a risk-minimizer (6). Moreover, denoting by pi = pi(1 | x) the (marginal) probability that label λi is relevant for x, it is natural to quantify the degree of uncertainty on this label in terms of

ui = 1 2|pi 1/2| = 2 min(pi, 1 pi) , (8)

or any other function symmetric around 1/2. We say that ˆy is uncertainty-aligned if

yi A(ˆy), yj D(ˆy) : ui uj .

Thus, a prediction is uncertainty-aligned if the following holds: Whenever the learner decides to abstain on label λi and to predict on label λj, the uncertainty on λj cannot exceed the uncertainty on λi. We then call a loss function L uncertainty-aligned if it guarantees the existence of an uncertainty-aligned risk-minimizer, regardless of the probability p = p( | x).

Additive Penalty for Abstention Consider the case of a partial prediction ˆy and denote by ˆy D and ˆy A the projections of ˆy to the entries in D(ˆy) and A(ˆy), respectively. As a natural extension of the original loss ℓ, we propose a generalized loss of the form

L(y, ˆy) = ℓ(y D, ˆy D) + f(A(ˆy)) , (9)

with ℓ(y D, ˆy D) the original loss on that part on which the learner predicts and f(A(ˆy)) a penalty for abstaining on A(ˆy). The latter can be seen as a measure of the loss of usefulness of the prediction ˆy due to its partiality, i.e., due to having no predictions on A(ˆy). An important instantiation of (9) is the case where the penalty is a counting measure, i.e., where f only depends on the number of abstentions:

L(y, ˆy) = ℓ(y D, ˆy D) + f |A(ˆy)| . (10)

A special case of (10) is to penalize each abstention ˆyi = with the same constant c [0, 1], which yields

L(y, ˆy) = ℓ(y D, ˆy D) + |A(ˆy)| c . (11)

Of course, instead of a linear function f, more general penalty functions are conceivable. For example, a practically relevant penalty is a concave function of the number of abstentions: Each additional abstention causes a cost, so f is monotone increasing in |A(ˆy)|, but the marginal cost of abstention is decreasing. Proposition 1. Let the loss ℓbe decomposable in the sense of (3), and let ˆy be a risk-minimizing prediction (for a query

instance x). The minimization of the expected loss (10) is then accomplished by

ˆy = argmin 1 d m E (ℓ(y, ˆyd)) + f(m d) , (12)

where the prediction ˆyd is speciﬁed by the index set

Dd(ˆyd) ..= {π(1), . . . , π(d)} , (13)

and the permutation π sorts the labels in increasing order of the label-wise expected losses

si = min ˆyi {0,1} E(ℓi(yi, ˆyi)) ,

i.e., sπ(1) sπ(m).

As shown by the previous proposition, a risk-minimizing prediction for a decomposable loss can easily be found in time O(m log(m)), simply by sorting the labels according to their contribution to the expected loss, and then ﬁnding the optimal size d of the prediction according to (12).

4 The Case of Hamming Loss This section presents ﬁrst results for the case of the Hamming loss function (2). In particular, we analyze extensions of the Hamming loss according to (10) and address the corresponding problem of risk minimization. Given a query instance x, assume conditional probabilities pi = p(yi = 1 | x) are given or made available by an MLC predictor h. In the case of Hamming, the expected loss of a prediction ˆy is then given by

E (ℓH(y, ˆy)) =

i:ˆyi=1 1 pi +

and minimized by ˆy such that ˆyi = 0 if pi 1/2 and ˆyi = 1 otherwise. In the setting of abstention, we call a prediction ˆy a dprediction if |D(ˆy)| = d. Let π be a permutation of [m] that sorts labels according to the uncertainty degrees (8), i.e., such that uπ(1) uπ(2) uπ(m). As a consequence of Proposition 1, we obtain the following result.

Corollary 1. In the case of Hamming loss, let ˆy be a riskminimizing prediction (for a query instance x). The minimization of the expected loss (10) is then accomplished by

ˆy = argmin 1 d m E (ℓH(y, ˆyd)) + f(m d) , (14)

where the prediction ˆyd is speciﬁed by the index set

Dd(ˆyd) = {π(1), . . . , π(d)} . (15)

Corollary 2. The extension (10) of the Hamming loss is uncertainty-aligned. In the case of the extension (11) of the Hamming loss, the optimal prediction is given by (15) with

d = |{i [m] | min (pi, 1 pi) c}| .

Remark 1. The extension (10) of the Hamming loss is monotonic, provided f is non-decreasing and such that f(k + 1) f(k) 1 for all k [m 1].

5 The Case of Rank Loss

In the case of the rank loss, we assume predictions in the form of rankings instead of labelings. Ignoring the possibility of ties, such a ranking can be represented in the form of a permutation π of [m], where π(i) is the index j of the label λj on position i (and π 1(j) the position of label λj). The rank loss then counts the number of incorrectly ordered label-pairs, that is, the number of pairs λi, λj such that λi is ranked worse than λj although λi is relevant while λj is irrelevant:

(i,j):yi>yj

π 1(i) > π 1(j) ,

or equivalently,

1 i<j m yπ(i) = 0 yπ(j) = 1 . (16)

Thus, given that the ground-truth labeling is distributed according to the probability p( | x), the expected loss of a ranking π is

E(π) ..= E (ℓR(y, π)) =

1 i<j m pπ(i),π(j)(0, 1 | x) ,

where pu,v is the pairwise marginal

pu,v(a, b | x) =

y Y:yu=a,yv=b p(y | x) . (18)

In the following, we ﬁrst recall the risk-minimizer for the rank loss as introduced above and then generalize it to the case of partial predictions. To simplify notation, we omit the dependence of probabilities on x (for example, we write pu,v(a, b) instead of pu,v(a, b | x)), and write (i) as indices of permuted labels instead of π(i). We also use the following notation: For a labeling y, let r(y) = m i=1 yi be the number of relevant labels, and c(y) = r(y)(m r(y)) the number of relevant/irrelevant label pairs (and hence an upper bound on the rank loss). A risk-minimizing ranking π, i.e., a ranking minimizing (17), is provably obtained by sorting the labels λi in decreasing order of the probabilities pi = pi(1 | x), i.e., according to their probability of being relevant (Dembczy nski et al. 2012). Thus, an optimal prediction π is such that

p(1) p(2) p(m) . (19)

To show this result, let π denote the reversal of π, i.e., the ranking that reverses the order of the labels. Then, for each pair (i, j) such that yi > yj, either π or π incurs an error, but not both. Therefore, ℓR(y, π) + ℓR(y, π) = c(y), and

ℓR(y, π) ℓR(y, π) = 2ℓR(y, π) c(y) . (20)

Since c(y) is a constant that does not depend on π, minimizing ℓR(y, π) (in expectation) is equivalent to minimizing the

difference ℓR(y, π) ℓR(y, π). For the latter, the expectation (17) becomes

1 i<j m p(i),(j)(0, 1) p(i),(j)(1, 0) (21)

1 i<j m p(j) p(i)

1 i m (2i (m + 1))p(i) ,

where the transition from the ﬁrst to the second sum is valid because (Dembczy nski, Cheng, and H ullermeier 2010) pu,v(0, 1) pu,v(1, 0)

= pu,v(0, 1) + pu,v(1, 1) pu,v(1, 1) pu,v(1, 0)

= pv(1) pu(1) = pv pu . From (21), it is clear that a risk-minimizing ranking π is deﬁned by (19). To generalize this result, let us look at the rank loss of a partial prediction of size d [m], i.e., a ranking of a subset of d labels. To simplify notation, we identify such a prediction, not with the original set of indices of the labels, but the positions of the corresponding labels in the sorting (19). Thus, a partial prediction of size d is identiﬁed by a set of indices K = {k1, . . . , kd} such that k1 < k2 < < kd, where k K means that the label λ(k) with the kth largest probability p(k) in (19) is included. According to the above result, the optimal ranking πK on these labels is the identity, and the expected loss of this ranking is given by

1 i<j d p(ki),(kj)(0, 1) . (22)

Lemma 1. Assuming (conditional) independence of label probabilities in the sense that pi,j(yi, yj) = pi(yi)pj(yj), the generalized loss (10) is minimized in expectation by a partial prediction with decision set of the form Kd = a, b ..= {1, . . . , a} {b, . . . , m} , (23) with 1 a < b m and m + a b + 1 = d. According to the previous lemma, an optimal d-selection Kd leading to an optimal (partial) ranking of length d is always a boundary set of positions in the ranking (19). The next lemma establishes an important relationship between optimal selections of increasing length. Lemma 2. Let Kd = a, b be an optimal d-selection (23) for d 2. At least one of the extensions a+1, b or a, b 1 of Kd is an optimal (d + 1)-selection. Thanks to the previous lemma, a risk-minimizing partial ranking can be constructed quite easily (in time O(m log(m)). First, the labels are sorted according to (19). Then, an optimal decision set is produced by starting with the boundary set 1, m and increasing this set in a greedy manner ( a concrete algorithm is given in the supplementary material). Proposition 2. Given a query instance x, assume conditional probabilities p(yi = 1 | x) = hi(x) are made available by an MLC predictor h. A risk-minimizing partial ranking can be constructed in time O(m log(m)).

Remark 2. The extension (10) of the rank loss is not uncertainty-aligned.

Since a prediction is a (partial) ranking instead of a (partial) labeling, the property of monotonicity as deﬁned in Section 3.2 does not apply in the case of rank loss. Although it would be possible to generalize this property, for example by looking at (in)correctly sorted label pairs instead of (in)correct labels, we refrain from a closer analysis here.

6 The Case of F-measure

The F-measure is the harmonic mean of precision and recall and can be expressed as follows:

F(y, ˆy) ..= 2 m i=1 yiˆyi m i=1(yi + ˆyi) . (24)

The problem of ﬁnding the expected F-maximizer

ˆy = arg max ˆy Y E (F(y, ˆy)) (25)

= arg max ˆy Y

y Y F(y, ˆy) p(y | x)

has been studied quite extensively in the literature (Chai 2005; Dembczy nski et al. 2011; Decubber et al. 2018; Jansche 2007; Lewis 1995; Quevedo, Luaces, and Bahamonde 2012; Waegeman et al. 2014; Ye et al. 2012). Obviously, the optimization problem (25) can be decomposed into an inner and an outer maximization as follows:

ˆyk ..= arg max ˆy Yk E (F(y, ˆy)) , (26)

ˆy ..= arg max k {0,...,m} E F(y, ˆyk) , (27)

where Yk ..= {ˆy Y| m i=1 ˆyi = k} denotes the set of all predictions with exactly k positive labels. Lewis (1995) showed that, under the assumption of conditional independence, the F-maximizer has always a speciﬁc form: it predicts the k labels with the highest marginal probabilities pi as relevant, and the other m k labels as irrelevant. More speciﬁcally, for any number k = 0, . . . , m, the solution of the optimization problem (26), namely a koptimal solution ˆyk, is obtained by setting ˆyi = 1 for the k labels with the highest marginal probabilities pi, and ˆyi = 0 for the remaining ones. Thus, the F-maximizer (27) can be found as follows:

The labels λi are sorted in decreasing order of their (predicted) probabilities pi.

For every k {0, . . . , m}, the optimal prediction ˆyk is deﬁned as described above.

For each of these ˆyk, the expected F-measure is computed.

As an F-measure maximizer ˆy, the k-optimal prediction ˆyk with the highest expected F-measure is adopted.

Overall, the computation of ˆy can be done in time O(m2) (Decubber et al. 2018; Ye et al. 2012).

To deﬁne the generalization of the F-measure, we ﬁrst turn it into the loss function ℓF (y, ˆy) ..= 1 F(y, ˆy). The generalized loss is then given by LF (y, ˆy) ..= 1 F(y D, ˆy D) + f(|A(ˆy)|). Minimizing the expectation of this loss is obviously equivalent to maximizing the following generalized F-measure in expectation:

FG(y, ˆy) ..=

1 f(a) if a = m , 2

i D(ˆy) yi ˆyi

i D(ˆy)(yi+ˆyi) f(a) otherwise, (28)

where a ..= |A(ˆy)|. Remark 3. If f in (28) is a strictly increasing function, then - turning an incorrect prediction or an abstention on a label λi into a correct prediction increases the generalized Fmeasure, whereas - turning an incorrect prediction into an abstention may decrease the measure. Therefore, the generalized F-measure (28) is not monotonic. The F-maximizer ˆy of the generalized F-measure (28) is given by ˆy = arg max ˆy Ypa E (FG(y, ˆy)) (29)

= arg max ˆy Ypa

y Y F(y D, ˆy D) p(y | x) f(a) .

In the following, we show that the F-maximizer of the generalized F-measure (28) can be found in the time O(m3). For any k = 0, . . . , m, denote by

i D(ˆy) ˆyi = k

The optimization problem (29) is decomposed into an inner and an outer maximization as follows: ˆyk ..= arg max ˆy Yk pa E (FG(y, ˆy)) , (31)

ˆy ..= arg max ˆy {ˆyk|k=0,...,m} E FG(y, ˆyk) . (32)

Lemma 3. For any partial prediction ˆy Ypa and any index j D(ˆy), - E (FG(y D, ˆy D)) is an increasing function of pj if ˆyj = 1; - E (FG(y D, ˆy D)) is a decreasing function of pj if ˆyj = 0. Lemma 4. Let π be the permutation that sorts the labels in decreasing order of the marginal probability pi(yi | x) deﬁned in (19). Assuming (conditional) independence of label probabilities in the sense that

i=1 pyi i (1 pi)1 yi , (33)

the generalized F-measure (28) is maximized in expectation by an optimal k-prediction ˆyk with decision set of the form

D(ˆyk) = k, l ..= {1, . . . , k} {l, . . . , m} , (34) with some l k + 1 and

ˆy(i) = 1 if i {1, . . . , k} , 0 if i {l, . . . , m} . (35)

Thanks to the previous lemma, a maximizer ˆy of the generalized F-measure (28) can be constructed following a procedure similar to the case of the rank loss. First, the labels are sorted according to (19). Then, we evaluate all possible partial predictions with decision sets k, l of the form (34), and ﬁnd the one with the highest expected F-measure (28) (a concrete algorithm is given in the supplementary material).

Proposition 3. Given a query instance x, assume conditional probabilities pi = p(yi = 1 | x) = hi(x) are made available by an MLC predictor h. Assuming (conditional) independence of label probabilities in the sense of (33), a prediction ˆy maximizing the generalized F-measure (28) in expectation is constructed in time O(m3).

7 Related Work

In spite of extensive research on multilabel classiﬁcation in the recent past, there is surprisingly little work on abstention in MLC. A notable exception is an approach by Pillai, Fumera, and Roli (2013), who focus on the F-measure as a performance metric. They tackle the problem of maximizing the F-measure on a subset of label predictions, subject to the constraint that the effort for manually providing the remaining labels (those on which the learner abstains) does not exceed a pre-deﬁned value. The decision whether or not to abstain on a label is guided by two thresholds on the predicted degree of relevance, which are tuned in a suitable manner. Even though this is an interesting approach, it is arguably less principled than ours, in which optimal predictions are derived in a systematic way, based on decisiontheoretic ideas and the notion of Bayes-optimality. Besides, Pillai, Fumera, and Roli (2013) offer a solution for a speciﬁc setting but not a general framework for MLC with partial abstention. More indirectly related is the work by Park and Simoff (2015), who investigate the uncertainty in multilabel classiﬁcation. They propose a modiﬁcation of the entropy measure to quantify the uncertainty of an MLC prediction. Moreover, they show that this measure correlates with the accuracy of the prediction, and conclude that it could be used as a measure of acceptance (and hence rejection) of a prediction. While Park and Simoff (2015) focus on the uncertainty of a complete labeling y, Destercke (2015) and Antonucci and Corani (2017) quantify the uncertainty in individual predictions yi using imprecise probabilities and so-called credal classiﬁers, respectively. Again, corresponding estimates can be used for the purpose of producing more informed decisions, including partial predictions.

8 Experiments

In this section, we present an empirical analysis that is meant to show the effectiveness of our approach to prediction with abstention. To this end, we perform experiments on a set of standard benchmark data sets from the MULAN repository2 (cf. Table 1), following a 10-fold cross-validation procedure.

2http://mulan.sourceforge.net/datasets.html

Table 1: Data sets used in the experiments # name # inst. # feat. # lab.

1 cal500 502 68 174 2 emotions 593 72 6 3 scene 2407 294 6 4 yeast 2417 103 14 5 mediamill 43907 120 101 6 nus-wide 269648 128 81

8.1 Experimental Setting

For training an MLC classiﬁer, we use binary relevance (BR) learning with logistic regression (LR) as base learner (in its default setting in sklearn, i.e., with regularisation parameter set to 1)3. Of course, more sophisticated techniques could be applied, and results using classiﬁer chains are given in the supplementary material. However, since we are dealing with decomposable losses, BR is well justiﬁed. Besides, we are ﬁrst of all interested in analyzing the effectiveness of abstention, and less in maximizing overall performance. All competitors essentially only differ in how the conditional probabilities provided by LR are turned into a (partial) MLC prediction. We ﬁrst compare the performance of reliable classiﬁers to the conventional BR classiﬁer that makes full predictions (MLC) as well as the cost of full abstention (ABS) these two serve as baselines that MLC with abstention should be able to improve on. A classiﬁer is obtained as a riskminimizer of the extension (10) of Hamming loss (2), instantiated by the penalty function f and the constant c. We conduct a ﬁrst series of experiments (SEP) with linear penalty f1(a) = a c, where c [0.05, 0.5], and a second series (PAR) with concave penalty f2(a) = (a m c)/(m + a), varying c [0.1, 1]. The performance of a classiﬁer is evaluated in terms of the average loss. Besides, we also compute the average abstention size |A(ˆy)|/m. The same type of experiment is conducted for the rank loss (with MLC and ABS denoting full prediction and full abstention, respectively). A predicted ranking is a riskminimizer of the extension (10) instantiated by the penalty function f and the constant c. We conduct a ﬁrst series of experiments (SEP) with f1 as above and c [0.1, 1], and a second series (PAR) with f2 as above and c [0.2, 2].

8.2 Results

The results (illustrated in Figures 1 and 2 for three data sets results for the other data sets are similar and can be found in the supplementary material) clearly conﬁrm our expectations. The Hamming loss (cf. Figure 1) under partial abstention is often much lower than the loss under full prediction and full abstention, showing the effectiveness of the approach. When the cost c increases, the loss increases while the abstention size decreases, with a convergence of the performance of SEP and PAR to the one of MLC at c = 0.5

3For an implementation in Python, see http://scikit.ml/api/skmultilearn.html

Cost of abstention in %(cn 5)

Hamming losses in %

Cost of abstention in % (cn 5)

Abstention size in %

Cost of abstention in % (cn 5)

Hamming losses in %

(c) emotions

Cost of abstention in % (cn 5)

Abstention size in %

(d) emotions

2 4 6 8 10 2

Cost of abstention in % (cn 5)

Hamming losses in %

Cost of abstention in % (cn 5)

Abstention size in %

ABS MLC PAR SEP

Figure 1: Binary relevance with logistic regression: Experimental results in terms of expected Hamming loss (LH 100)/m and abstention size (in percent) for f1(a) = a c (SEP) and f2(a) = (a m c)/(m+a) (PAR), as a function of the cost of abstention.

and c = 1, respectively. Similar results are obtained in the case of rank loss (cf. Figure 2), except that convergence to the performance of MLC is slower (i.e., requires lager cost values c, especially on the data set cal500). This is plausible, because the cost of a wrong prediction on a single label can be as high as m 1, compared to only 1 in the case of Hamming. Due to space restrictions, we transferred experimental results for the generalized F-measure to the supplementary material. These results are very similar to those for the rank loss. In light of the observation that the respective riskminimizers have the same structure, this is not very surprising. The supplementary material also contains results for other

Cost of abstention (cn 1)

Rank losses

Cost of abstention (cn 1)

Abstention size in %

Cost of abstention (cn 0.1)

Rank losses

(c) emotions

Cost of abstention (cn 0.1)

Abstention size in %

(d) emotions

2 4 6 8 10 3

Cost of abstention (cn 0.1)

Rank losses

Cost of abstention (cn 0.1)

Abstention size in %

ABS MLC PAR SEP

Figure 2: Binary relevance with logistic regression: Experimental results in terms of expected rank loss LR/m and abstention size (in percent) for f1(a) = a c (SEP) and f2(a) = (a m c)/(m + a) (PAR), as a function of the cost of abstention.

MLC algorithms, including BR with support vector machines (using Platt-scaling (Lin, Lin, and Weng 2007; Platt 1999) to turn scores into probabilities) as base learners and classiﬁer chains (Read et al. 2009) with LR and SVMs as base learners. Again, the results are very similar to those presented above.

9 Conclusion This paper presents a formal framework of MLC with partial abstention, which builds on two main building blocks: First, the extension of an underlying MLC loss function so as to accommodate abstention in a proper way, and second the problem of optimal prediction, that is, minimizing this loss in expectation. We instantiated our framework for the Hamming loss,

the rank loss, and the F-measure, which are three important and commonly used loss functions in multi-label classiﬁcation. We elaborated on properties of risk-minimizers, showed them to have a speciﬁc structure, and devised efﬁcient methods to produce optimal predictions. Experimentally, we showed these methods to be effective in the sense of reducing loss when being allowed to abstain. In future work, we will further elaborate on our formal framework. As a concrete next step, we plan to investigate instantiations for other loss functions commonly used in MLC and the cases of label dependence (Dembczy nski et al. 2012; Waegeman et al. 2014).

Acknowledgments We thank Willem Waegeman for interesting discussions and useful suggestions. This work was supported by the Germany Research Foundation (DFG) (grant numbers HU 1284/19).

Antonucci, A., and Corani, G. 2017. The multilabel naive credal classiﬁer. International Journal of Approximate Reasoning 83:320 336. Bartlett, P. L., and Wegkamp, M. H. 2008. Classiﬁcation with a reject option using a hinge loss. Journal of Machine Learning Research 9(Aug):1823 1840. Chai, K. M. A. 2005. Expectation of f-measures: Tractable exact computation and some empirical observations of its properties. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR), 593 594. ACM. Cheng, W.; Rademaker, M.; De Baets, B.; and H ullermeier, E. 2010. Predicting partial orders: ranking with abstention. In Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part I (ECML/PKDD), 215 230. Springer-Verlag. Cheng, W.; H ullermeier, E.; Waegeman, W.; and Welker, V. 2012. Label ranking with partial abstention based on thresholded probabilistic models. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), 2501 2509. Chow, C. 1970. On optimum recognition error and reject tradeoff. IEEE Transactions on information theory 16(1):41 46. Cortes, C.; De Salvo, G.; and Mohri, M. 2016. Learning with rejection. In Proceedings of the 27th International Conference on Algorithmic Learning Theory (ALT), 67 82. Springer Verlag. Decubber, S.; Mortier, T.; Dembczy nski, K.; and Waegeman, W. 2018. Deep f-measure maximization in multi-label classiﬁcation: A comparative study. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), 290 305. Springer. Dembczy nski, K.; Waegeman, W.; Cheng, W.; and H ullermeier, E. 2011. An exact algorithm for f-measure maximization. In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS), 1404 1412. Curran Associates Inc. Dembczy nski, K.; Waegeman, W.; Cheng, W.; and H ullermeier, E. 2012. On label dependence and loss minimization in multi-label classiﬁcation. Machine Learning 88(1-2):5 45. Dembczy nski, K.; Cheng, W.; and H ullermeier, E. 2010. Bayes optimal multilabel classiﬁcation via probabilistic classiﬁer chains. In

Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML), 279 286. Omnipress. Destercke, S. 2015. Multilabel predictions with sets of probabilities: The hamming and ranking loss cases. Pattern Recognition 48(11):3757 3765. Franc, V., and Prusa, D. 2019. On discriminative learning of prediction uncertainty. In Proceedings of the 36th International Conference on Machine Learning (ICML), 1963 1971. Grandvalet, Y.; Rakotomamonjy, A.; Keshet, J.; and Canu, S. 2008. Support vector machines with a reject option. In Proceedings of the 21st International Conference on Neural Information Processing Systems (NIPS), 537 544. Curran Associates Inc. Hellman, M. E. 1970. The nearest neighbor classiﬁcation rule with a reject option. IEEE Transactions on Systems Science and Cybernetics 6(3):179 185. Jansche, M. 2007. A maximum expected utility framework for binary sequence labeling. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL), 736 743. Lewis, D. D. 1995. Evaluating and optimizing autonomous text classiﬁcation systems. In Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR), 246 254. ACM. Lin, H.-T.; Lin, C.-J.; and Weng, R. C. 2007. A note on platt s probabilistic outputs for support vector machines. Machine learning 68(3):267 276. Park, L. A., and Simoff, S. 2015. Using entropy as a measure of acceptance for multi-label classiﬁcation. In Proceedings of the 14th International Symposium on Intelligent Data Analysis (IDA), 217 228. Springer. Pillai, I.; Fumera, G.; and Roli, F. 2013. Multi-label classiﬁcation with a reject option. Pattern Recognition 46(8):2256 2266. Platt, J. C. 1999. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. Advances in Large Margin Classiers. Quevedo, J. R.; Luaces, O.; and Bahamonde, A. 2012. Multilabel classiﬁers with a probabilistic thresholding strategy. Pattern Recognition 45(2):876 883. Read, J.; Pfahringer, B.; Holmes, G.; and Frank, E. 2009. Classiﬁer chains for multi-label classiﬁcation. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II (ECML/PKDD), 254 269. Springer-Verlag. Tsoumakas, G.; Katakis, I.; and Vlahavas, I. 2009. Mining multilabel data. In Data mining and knowledge discovery handbook. Springer. 667 685. Waegeman, W.; Dembczy nki, K.; Jachnik, A.; Cheng, W.; and H ullermeier, E. 2014. On the bayes-optimality of f-measure maximizers. The Journal of Machine Learning Research 15(1):3333 3388. Ye, N.; Chai, K. M. A.; Lee, W. S.; and Chieu, H. L. 2012. Optimizing f-measures: a tale of two approaches. In Proceedings of the 29th International Coference on International Conference on Machine Learning (ICML), 1555 1562. Omnipress. Zhang, M.-L., and Zhou, Z.-H. 2014. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering 26(8):1819 1837.