# reliable_multilabel_classification_prediction_with_partial_abstention__8de247ec.pdf The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Reliable Multilabel Classification: Prediction with Partial Abstention Vu-Linh Nguyen, Eyke H ullermeier Heinz Nixdorf Institute and Department of Computer Science, Paderborn University, Germany vu.linh.nguyen@uni-paderborn.de, eyke@upb.de In contrast to conventional (single-label) classification, the setting of multilabel classification (MLC) allows an instance to belong to several classes simultaneously. Thus, instead of selecting a single class label, predictions take the form of a subset of all labels. In this paper, we study an extension of the setting of MLC, in which the learner is allowed to partially abstain from a prediction, that is, to deliver predictions on some but not necessarily all class labels. We propose a formalization of MLC with abstention in terms of a generalized loss minimization problem and present first results for the case of the Hamming loss, rank loss, and F-measure, both theoretical and experimental. 1 Introduction In statistics and machine learning, classification with abstention, also known as classification with a reject option, is an extension of the standard setting of classification, in which the learner is allowed to refuse a prediction for a given query instance; research on this setting dates back to early work by Chow (1970) and Hellman (1970) and remains to be an important topic till today, most notably for binary classification (Bartlett and Wegkamp 2008; Cortes, De Salvo, and Mohri 2016; Franc and Prusa 2019; Grandvalet et al. 2008). For the learner, the main reason to abstain is a lack of certainty about the corresponding outcome refusing or at least deferring a decision might then be better than taking a high risk of a wrong decision. Nowadays, there are many machine learning problems in which complex, structured predictions are sought (instead of scalar values, like in classification and regression). For such problems, the idea of abstaining from a prediction can be generalized toward partial abstention: Instead of predicting the entire structure, the learner predicts only parts of it, namely those for which it is certain enough. This idea has already been realized, e.g., for the case where predictions are rankings (Cheng et al. 2010; 2012). Another important example is multilabel classification (MLC), in which an outcome associated with an instance is a labeling in the form of a subset of an underlying reference Copyright c 2020, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. set of class labels (Dembczy nski et al. 2012; Tsoumakas, Katakis, and Vlahavas 2009; Zhang and Zhou 2014). In this paper, we study an extension of the setting of MLC, in which the learner is allowed to partially abstain from a prediction, that is, to deliver predictions on some but not necessarily all class labels (or, more generally, to refuse committing to a single complete prediction). Although MLC has been studied extensively in the machine learning literature in the recent past, there is surprisingly little work on MLC with abstention so far a notable exception is (Pillai, Fumera, and Roli 2013), to which we will return in the Section 7. Prediction with abstention is typically realized as a twostage approach. First, the learner delivers a prediction that provides information about its uncertainty. Then, taking this uncertainty into account, a decision is made about whether or not to predict, or on which parts. In binary classification, for example, a typical approach is to produce probabilistic predictions and to abstain whenever the probability is close to 1/2. We adopt a similar approach, in which we rely on probabilistic MLC, i.e., probabilistic predictions of labelings. In the next section, we briefly recall the setting of multilabel classification. The generalization toward MLC with (partial) abstention is then introduced and formalized in Section 3. Instantiations of the setting of MLC with abstention for the specific cases of the Hamming loss, rank loss, and F-measure are studied in Sections 4 6, respectively. Related work is discussed in Section 7. Finally, experimental results are presented in Section 8, prior to concluding the paper in Section 9. All formal results in this paper (propositions, remarks, corollaries) are stated without proofs, which are deferred to the supplementary material. 2 Multilabel Classification In this section, we describe the MLC problem in more detail and formalize it within a probabilistic setting. Along the way, we introduce the notation used throughout the paper. Let X denote an instance space, and let L={λ1, . . . , λm} be a finite set of class labels. We assume that an instance x X is (probabilistically) associated with a subset of labels Λ = Λ(x) 2L; this subset is often called the set of relevant labels, while the complement L\Λ is considered as irrelevant for x. We identify a set Λ of relevant labels with a binary vector y = (y1, . . . , ym), where yi = λi Λ .1 By Y = {0, 1}m we denote the set of possible labelings. We assume observations to be realizations of random variables generated independently and identically (i.i.d.) according to a probability distribution p on X Y, i.e., an observation y = (y1, . . . , ym) is the realization of a corresponding random vector Y = (Y1, . . . , Ym). We denote by p(Y | x) the conditional distribution of Y given X = x, and by pi(Yi | x) the corresponding marginal distribution of Yi: pi(b | x) = y Y:yi=b p(y | x) . (1) A multilabel classifier h is a mapping X Y that assigns a (predicted) label subset to each instance x X. Thus, the output of a classifier h is a vector ˆy = h(x) = (h1(x), . . . , hm(x)) . Given training data in the form of a finite set of observations (x, y) X Y, drawn independently from Pr(X, Y), the goal in MLC is to learn a classifier h : X Y that generalizes well beyond these observations in the sense of minimizing the expected risk with respect to a specific loss function. In the literature, various MLC loss functions have been proposed, including the Hamming loss, the subset 0/1 loss, the F-measure, the Jaccard measure, and the rank loss. The Hamming loss is given by ℓH(y, ˆy) = i=1 yi = ˆyi , (2) and the subset 0/1 loss by ℓS(y, ˆy) = y = ˆy . Thus, both losses generalize the standard 0/1 loss commonly used in classification, but in a very different way. Hamming and subset 0/1 are prototypical examples of what is called a (label-wise) decomposable and non-decomposable loss, respectively (Dembczy nski et al. 2012). A decomposable loss can be expressed in the form i=1 ℓi(yi, ˆyi) (3) with suitable binary loss functions ℓi : {0, 1}2 R, whereas a non-decomposable loss does not permit such a representation. It can be shown that, to produce optimal predictions ˆy = h(x) minimizing expected loss, knowledge about the marginals pi(Yi | x) is enough in the case of a decomposable loss, but not in the case of a non-decomposable loss (Dembczy nski et al. 2012). Instead, if a loss is nondecomposable, high-order probabilities are needed, and in the extreme case even the entire distribution p(Y | x) (like in the case of the subset 0/1 loss). On an algorithmic level, this means that MLC with a decomposable loss can be tackled by what is commonly called binary relevance (BR) learning (i.e., learning one binary classifier for each label individually), whereas non-decomposable losses call for more sophisticated learning methods that are able to take labeldependencies into account. 1 is the indicator function, i.e., A = 1 if the predicate A is true and = 0 otherwise. 3 MLC with Abstention In our generalized setting of MLC with abstention, which is introduced in this section, the classifier is allowed to produce partial predictions ˆy = h(x) Ypa ..= {0, , 1}m , (4) where ˆyi = indicates an abstention on the label λi; we denote by A(ˆy) [m] ..= {1, . . . , m} and D(ˆy) ..= [m] \ A(ˆy) the set of indices i for which ˆyi = and ˆyi {0, 1}, respectively, that is, the indices on which the learner abstains and decides to predict. 3.1 Risk Minimization To evaluate a reliable multilabel classifier, a generalized loss function L : Y Ypa R+ (5) is needed, which compares a partial prediction ˆy with a ground-truth labeling y. Given such a loss, and assuming a probabilistic prediction for a query instance x, i.e., a probability p( | x) on the set of labelings (or at least an estimation thereof), the problem of risk minimization comes down to finding ˆy argmin ˆy Ypa E L(y, ˆy) (6) = argmin ˆy Ypa y Y L(y, ˆy) p(y | x) . The concrete form of this optimization problem as well as its difficulty depend on several choices, including the underlying MLC loss function ℓand its extension L. 3.2 Generalized Loss Functions On the basis of a standard MLC loss ℓ, a generalized loss function (5) can be derived in different ways, also depending on how to penalize the abstention. Further below, we propose a generalization based on an additive penalty. Before doing so, we discuss some general properties that might be of interest for generalized losses. As a first property, we should expect a generalized loss L to reduce to its conventional version ℓin the case of no abstention. In other words, L(y, ˆy) = ℓ(y, ˆy) , whenever ˆy is a precise prediction ˆy Y. Needless to say, this is a property that every generalized loss should obey. Monotonicity. Another reasonable property is monotonicity: The loss should only increase (or at least not decrease) when (i) turning a correct prediction on a label λi into an abstention or an incorrect prediction, (ii) or turning an abstention into an incorrect prediction. This reflects the following chain of preferences: a correct prediction is better than an abstention, which in turn is better than an incorrect prediction. More formally, for a ground-truth labeling y and a partial prediction ˆy1, let C1, A1 L denote the subset of labels on which the prediction is correct and on which the learner abstains, respectively, and define C2, A2 L analogously for a prediction ˆy2. Then (C2 C1) (C2 A2) (C1 A1) (7) L(y, ˆy1) L(y, ˆy2) . Uncertainty-alignment. Intuitively, when producing a partial prediction, an optimal prediction rule is supposed to abstain on the most uncertain labels. More formally, consider a generalized loss function L and a prediction ˆy which, for a query x X, is a risk-minimizer (6). Moreover, denoting by pi = pi(1 | x) the (marginal) probability that label λi is relevant for x, it is natural to quantify the degree of uncertainty on this label in terms of ui = 1 2|pi 1/2| = 2 min(pi, 1 pi) , (8) or any other function symmetric around 1/2. We say that ˆy is uncertainty-aligned if yi A(ˆy), yj D(ˆy) : ui uj . Thus, a prediction is uncertainty-aligned if the following holds: Whenever the learner decides to abstain on label λi and to predict on label λj, the uncertainty on λj cannot exceed the uncertainty on λi. We then call a loss function L uncertainty-aligned if it guarantees the existence of an uncertainty-aligned risk-minimizer, regardless of the probability p = p( | x). Additive Penalty for Abstention Consider the case of a partial prediction ˆy and denote by ˆy D and ˆy A the projections of ˆy to the entries in D(ˆy) and A(ˆy), respectively. As a natural extension of the original loss ℓ, we propose a generalized loss of the form L(y, ˆy) = ℓ(y D, ˆy D) + f(A(ˆy)) , (9) with ℓ(y D, ˆy D) the original loss on that part on which the learner predicts and f(A(ˆy)) a penalty for abstaining on A(ˆy). The latter can be seen as a measure of the loss of usefulness of the prediction ˆy due to its partiality, i.e., due to having no predictions on A(ˆy). An important instantiation of (9) is the case where the penalty is a counting measure, i.e., where f only depends on the number of abstentions: L(y, ˆy) = ℓ(y D, ˆy D) + f |A(ˆy)| . (10) A special case of (10) is to penalize each abstention ˆyi = with the same constant c [0, 1], which yields L(y, ˆy) = ℓ(y D, ˆy D) + |A(ˆy)| c . (11) Of course, instead of a linear function f, more general penalty functions are conceivable. For example, a practically relevant penalty is a concave function of the number of abstentions: Each additional abstention causes a cost, so f is monotone increasing in |A(ˆy)|, but the marginal cost of abstention is decreasing. Proposition 1. Let the loss ℓbe decomposable in the sense of (3), and let ˆy be a risk-minimizing prediction (for a query instance x). The minimization of the expected loss (10) is then accomplished by ˆy = argmin 1 d m E (ℓ(y, ˆyd)) + f(m d) , (12) where the prediction ˆyd is specified by the index set Dd(ˆyd) ..= {π(1), . . . , π(d)} , (13) and the permutation π sorts the labels in increasing order of the label-wise expected losses si = min ˆyi {0,1} E(ℓi(yi, ˆyi)) , i.e., sπ(1) sπ(m). As shown by the previous proposition, a risk-minimizing prediction for a decomposable loss can easily be found in time O(m log(m)), simply by sorting the labels according to their contribution to the expected loss, and then finding the optimal size d of the prediction according to (12). 4 The Case of Hamming Loss This section presents first results for the case of the Hamming loss function (2). In particular, we analyze extensions of the Hamming loss according to (10) and address the corresponding problem of risk minimization. Given a query instance x, assume conditional probabilities pi = p(yi = 1 | x) are given or made available by an MLC predictor h. In the case of Hamming, the expected loss of a prediction ˆy is then given by E (ℓH(y, ˆy)) = i:ˆyi=1 1 pi + and minimized by ˆy such that ˆyi = 0 if pi 1/2 and ˆyi = 1 otherwise. In the setting of abstention, we call a prediction ˆy a dprediction if |D(ˆy)| = d. Let π be a permutation of [m] that sorts labels according to the uncertainty degrees (8), i.e., such that uπ(1) uπ(2) uπ(m). As a consequence of Proposition 1, we obtain the following result. Corollary 1. In the case of Hamming loss, let ˆy be a riskminimizing prediction (for a query instance x). The minimization of the expected loss (10) is then accomplished by ˆy = argmin 1 d m E (ℓH(y, ˆyd)) + f(m d) , (14) where the prediction ˆyd is specified by the index set Dd(ˆyd) = {π(1), . . . , π(d)} . (15) Corollary 2. The extension (10) of the Hamming loss is uncertainty-aligned. In the case of the extension (11) of the Hamming loss, the optimal prediction is given by (15) with d = |{i [m] | min (pi, 1 pi) c}| . Remark 1. The extension (10) of the Hamming loss is monotonic, provided f is non-decreasing and such that f(k + 1) f(k) 1 for all k [m 1]. 5 The Case of Rank Loss In the case of the rank loss, we assume predictions in the form of rankings instead of labelings. Ignoring the possibility of ties, such a ranking can be represented in the form of a permutation π of [m], where π(i) is the index j of the label λj on position i (and π 1(j) the position of label λj). The rank loss then counts the number of incorrectly ordered label-pairs, that is, the number of pairs λi, λj such that λi is ranked worse than λj although λi is relevant while λj is irrelevant: (i,j):yi>yj π 1(i) > π 1(j) , or equivalently, 1 i yj, either π or π incurs an error, but not both. Therefore, ℓR(y, π) + ℓR(y, π) = c(y), and ℓR(y, π) ℓR(y, π) = 2ℓR(y, π) c(y) . (20) Since c(y) is a constant that does not depend on π, minimizing ℓR(y, π) (in expectation) is equivalent to minimizing the difference ℓR(y, π) ℓR(y, π). For the latter, the expectation (17) becomes 1 i