# classifier_chains_a_review_and_perspectives__3cc0045d.pdf

Journal of Artiﬁcial Intelligence Research 70 (2021) 683-718 Submitted 08/2020; published 02/2021

Classiﬁer Chains: A Review and Perspectives

Jesse Read jesse.read@polytechnique.edu LIX, Ecole Polytechnique Institut Polytechnique de Paris Palaiseau 91120. France.

Bernhard Pfahringer bernhard@cs.waikato.ac.nz GeoﬀHolmes geoff@cs.waikato.ac.nz Eibe Frank eibe@cs.waikato.ac.nz University of Waikato, Hamilton, New Zealand.

The family of methods collectively known as classiﬁer chains has become a popular approach to multi-label learning problems. This approach involves chaining together oﬀthe-shelf binary classiﬁers in a directed structure, such that individual label predictions become features for other classiﬁers. Such methods have proved ﬂexible and eﬀective and have obtained state-of-the-art empirical performance across many datasets and multi-label evaluation metrics. This performance led to further studies of the underlying mechanism and eﬃcacy, and investigation into how it could be improved. In the recent decade, numerous studies have explored the theoretical underpinnings of classiﬁer chains, and many improvements have been made to the training and inference procedures, such that this method remains among the best options for multi-label learning. Given this past and ongoing interest, which covers a broad range of applications and research themes, the goal of this work is to provide a review of classiﬁer chains, a survey of the techniques and extensions provided in the literature, as well as perspectives for this approach in the domain of multi-label classiﬁcation in the future. We conclude positively, with a number of recommendations for researchers and practitioners, as well as outlining key issues for future research.

1. Introduction

Interest in multi-label classiﬁcation has grown at an explosive pace in the last 10 years, from only a few dozen explicit mentions in the scientiﬁc literature to hundreds of new papers per year, a signiﬁcant collection of benchmark datasets, and a number of dedicated software frameworks. Applications are as diverse as those found in multi-class classiﬁcation, and several families of methods have emerged. Reviews of the area are given by, for example, Zhang and Zhou (2014), and in the broader context of multi-output learning, by Waegeman et al. (2019).

The deﬁning aspect of multi-label learning is the association of multiple binary class labels to a single instance. A multi-label dataset can be denoted D = {x(i), y(i)}N i=1, consisting of N examples. Each i-th instance x(i) is associated with a label vector y(i) = [y(i) 1 , . . . , y(i) L ]; there are L elements corresponding to L label concepts, and each element y(i) j {0, 1} indicates the relevance (if 1) or not (if 0) of the j-th concept to this instance. A multi-label

c 2021 AI Access Foundation. All rights reserved.

Read, Pfahringer, Holmes, & Frank

model is tasked with providing predictions by = [by1, . . . , by L] for any given test instance x. Note that traditional multi-class learning also considers L concepts but, in contrast, only a single concept can be relevant to any particular instance. In both types of learning, each instance is typically multi-dimensional, described by D features (also referred to as attributes). Figure 1a shows an example of a multi-label dataset. Figure 1b shows how this dataset can be naturally transformed, or reduced, into L binary problems that are solved independently. This approach, of applying independent binary classiﬁers, is known as the binary relevance method, which has become a typical baseline in multi-label studies.

X Y1 Y2 Y3 Y4 x(1) 0 1 1 0 x(2) 1 0 0 0 x(3) 0 1 0 0 x(4) 1 0 0 1 x(5) 0 0 0 1

x by1 by2 by3 by4

(a) A multi-label dataset, with test instance x.

X Y1 x(1) 0 x(2) 1 x(3) 0 x(4) 1 x(5) 0

X Y2 x(1) 1 x(2) 0 x(3) 1 x(4) 0 x(5) 0

X Y3 x(1) 1 x(2) 0 x(3) 0 x(4) 0 x(5) 0

X Y4 x(1) 0 x(2) 0 x(3) 0 x(4) 1 x(5) 1

(b) A transformation into L two-class datasets to which independent binary classiﬁers can be applied.

Figure 1: Illustration of how independent classiﬁers can be applied to a multi-label classiﬁcation problem by transformation (often explicitly called a reduction) into separate datasets. Each instance is also a vector, not expanded for notational simplicity. The goal of a model is to predict all labels byj|j = 1, . . . , L for a given test instance x.

The method of classiﬁer chains was described by Read et al. (2009) (later, with an extended analysis, by Read et al., 2011), and also proposed contemporaneously by Kajdanowicz and Kazienko (2009). The idea is simple: connect binary classiﬁers in a chain , such that the output prediction of one classiﬁer is appended as an additional feature to the input of all subsequent classiﬁers. This method is one of many approaches that seeks to model relationships between labels, thus obtaining improved performance over the binary relevance approach. There are now dozens of variants and analyses of classiﬁer chains, and the method has been involved in at least a hundred empirical evaluations, often proving among the most competitive, even in relatively recent comparisons (Moyano, Galindo, Cios, & Ventura, 2018). On the grounds of such interest, much of which is still ongoing, the goal of this paper is to provide an overview of landmark developments and analyses, and also to discuss perspectives exempliﬁed by various methods that have been proposed. For the purposes of our investigation, we deﬁne classiﬁer chains under the following two properties: 1) one classiﬁer per label, considered as a node in a chain, where 2) the chain is any directed acyclic structure in which the output of one classiﬁer becomes input to the subsequent classiﬁers to which it is connected in that structure. This is a broader deﬁnition than in the initial work proposing classiﬁer chains, which explicitly considered a fully-connected cascade; to aﬀord us greater ﬂexibility to follow more recent developments in the same context. Arguably, we could speak of directed acyclic graphs of classiﬁers but we retain the terminology of a chain in line with the bulk of the related literature. Indeed,

Classifier Chains: A Review and Perspectives

the ﬂexibility of this method is certainly one of the main factors behind its popularity, and the large number of variants oﬀered in the literature. If we consider chains which are not connected at all, then we recover the binary relevance method. We remark that binary relevance , although typically denoting independent classiﬁers, can be considered itself a family of methods that encompass the full spectrum of classiﬁer chains (Burkhardt & Kramer, 2015; Zhang, Li, Liu, & Geng, 2018) from independent classiﬁers to fully-connected chains, without conﬂict of terms. An important concept is the hyperparametrization of base classiﬁers; here, any suitable binary classiﬁer (e.g., logistic regression, decision trees, support vector machines) can be considered. In the deterministic sense, the j-th binary classiﬁer, given an input instance x, produces predictions byj {0, 1} indicating the relevance of each of the j-th labels as it pertains to that instance. For probabilistic varieties of chains, discussed in Section 3.1, we should additionally assume that a base classiﬁer provides a probabilistic interpretation of this decision. It is worth noting that the classiﬁer chain then corresponds to an appropriately structured Bayesian network in which the base classiﬁers deﬁne the conditional probability distributions at each node of the network. Graphically, the original formulation of classiﬁer chains can be drawn as in Figure 2b, as a fully connected chain. One could also refer to the fully connected structure as a cascade or a fan to distinguish from a Markov chain where each node is connected only to the previous node and following node which is also a possible conﬁguration in this context (see Figure 2c, and, for example, Read, Martino, & Hollm en, 2017). Other variants are also possible, such as trees (for example, Figure 2d, and Enrique Sucar et al., 2014), and arbitrary directed acyclic graphs (DAGs) (for example, Figure 2e, and Zhang & Zhang, 2010). The mechanism of all these conﬁgurations is the same: the incoming edges to the j-th node represent features to the j-th classiﬁer, and the outgoing edge its prediction. Figure 2a is equivalent to binary relevance (and thus, Figure 1b). Even though undirected graphical models are related (further discussion in Section 5), we consider those to belong to a separate class of model due to the diﬀerent inference strategies involved.

y4 y3 y2 y1

y4 y3 y2 y1

y4 y3 y2 y1

y4 y3 y2 y1

y4 y3 y2 y1

y3 y1 y2 y4

Figure 2: Diﬀerent classiﬁer chain structures for a problem with 4 labels. Note the diﬀerence between (b) and (f) is the order of labels.

Figure 3 shows the dataset transformation corresponding to Figure 2b (which can be contrasted with that of independent models; Figure 2a and Figure 1b); one dataset per node/label. It is straightforward to make transformations for the other cases exempliﬁed

Read, Pfahringer, Holmes, & Frank

in Figure 2 in a related fashion. Any binary classiﬁer can be employed on the transformed datasets.

X Y1 x(1) 0 x(2) 1 x(3) 0 x(4) 1 x(5) 0

X Y1 Y2 x(1) 0 1 x(2) 1 0 x(3) 0 1 x(4) 1 0 x(5) 0 0

X Y1 Y2 Y3 x(1) 0 1 1 x(2) 1 0 0 x(3) 0 1 0 x(4) 1 0 0 x(5) 0 0 0

x by1 by2 by3

X Y1 Y3 Y3 Y4 x(1) 0 1 1 0 x(2) 1 0 0 0 x(3) 0 1 0 0 x(4) 1 0 0 1 x(5) 0 0 0 1

x by1 by2 by3 by4

Figure 3: The transformation of a dataset (that of Figure 1a) for the application of classiﬁer chains (as shown in Figure 2b). It is worth highlighting that (as in Figure 1) the 0/1 values shown in the tables correspond to the values y(i) j found in the training data; whereas by1, . . ., by L (in the ﬁnal row) are all predicted values not available at training time.

Classiﬁer chains have obtained state-of-the-art performance in many empirical evaluations, including a variety of datasets and evaluation metrics (see, for example, Madjarov et al., 2012; Moyano et al., 2018, and references therein). This strong oﬀ-the-shelf performance, their simplicity of implementation, and the open choice of base classiﬁers to ﬁt many preferences and suitability to diﬀerent domains, has led to their wide usage and ongoing development. A search in the academic literature reveals they have been used in diverse applications ranging from vision and natural language domains, to bioinformatics, health analytics, time series, and route forecasting. Although we do not explicitly tackle this setting in this manuscript, the method of classiﬁer chains (unlike many other multilabel methods) generalizes easily to the case of having multiple (more than two) values per label, a task sometimes called multi-label multi-class, multi-dimensional, or multi-objective classiﬁcation, see, for example, the work by Kocev et al. (2007), Read et al. (2014).

In addition to attracting interest from practitioners on account of their performance, classiﬁer chains have also raised many questions of a theoretical nature: How can their eﬃcacy be explained? And what do the learning algorithms for these chains optimize? (We look at these issues in Section 3). Tied in with this are further questions that have driven much related work: Is there an optimal chain order and, if so, how to ﬁnd it? (This is covered in Section 4). Can chains be seen as a special case of other methods (or vice versa)? (See our discussion of related work in Section 5). Following an attempt at addressing these questions, we provide perspectives and discuss open issues in Section 6, and give recommendations to practitioners in Section 7, before concluding (Section 8).

2. Preliminaries: Multi-label Learning

The general concept of multi-label classiﬁcation and classiﬁer chains has been introduced in Section 1 in terms of mapping instances x to label-vectors y. Here we provide additional notation and concepts which are relevant to a more in-depth and theoretical discussion.

Beginning with a binary single-label problem, we seek a classiﬁer h to map instances x to y {0, 1}. It should carry out this mapping, i.e., make a decision/classiﬁcation, according

Classifier Chains: A Review and Perspectives

to the loss function of interest. A common objective is to minimize 0/1 loss

ℓ0/1(y, by) = [[y = by]] =

( 0 y = by (inputs match exactly) 1 otherwise (1)

which is equivalent to maximizing classiﬁcation accuracy. In this paper we use J to denote a payoﬀfunction such as accuracy, where higher is better; note that ℓ0/1(y, by) = 1 Jacc.(y, by). Minimizing the 0/1 loss implies that in a probabilistic setting

by = h(x) = argmax y {0,1} P(y|x) (2)

i.e., the mode of the posterior distribution (maximum a-posteriori, or MAP estimate). In multi-label classiﬁcation, there are 2L possible classiﬁcations (corresponding to distinct combinations/subsets of L labels), rather than just 2, hence:

by = [by1, . . . , by L] = h(x) = argmax y {0,1}L P(y|x) (3)

We use the notation P(y|x) P(Y = y|X = x), corresponding to a single number, i.e., P(y|x) [0, 1], to represent the probability that y1, . . . , y L are the label relevance indicators, given test instance x. The 0/1 loss can be normalized, providing a value [0, 1] for N training examples as

ℓ0/1(D) = 1

i=1 ℓ0/1(y(i), by(i)) (4)

with training set D, where ℓ0/1(y(i), by(i)) referring to Eq. (1) returns 0 only if each single bit matches exactly. (thus often referred to as exact match in payoﬀform; ℓ0/1(D) = 1 Jexact-match(D)). We may rewrite explicitly

ℓ0/1(y, by) = [[y = by]] =

( 0 PL j=1[[yj = byj]] = L 1 otherwise

The sum over indices i = 1, . . . , N is allowed by the assumption of independent and identically distributed samples in the training set. The choice of this loss coincides with the general intuition that we should model labels together, since an incorrect combination of labels by any degree gives worst possible loss for that instance. As such it has motivated and justiﬁed many methods. It also implies exponential complexity (if all 2L label combinations are to be modeled explicitly), therefore providing a major challenge to overcome. Classiﬁer chains form one family of such methods. There are numerous (more than a dozen) often-used multi-label loss metrics; some of them are drastically diﬀerent from each other. For example, another loss frequently referred to in the multi-label literature is Hamming loss:

ℓH(D) = 1 NL

j=1 ℓ0/1 y(i) j , by(i) j = 1

i=1 ℓ0/1 y(i) j , by(i) j !

Read, Pfahringer, Holmes, & Frank

Even if expressed here in terms of (single-label) 0/1 loss, Hamming loss can be considered on the opposite end of the spectrum to 0/1 loss as a multi-label metric, and indeed indicates that we can model all labels individually and independently, as we have explicitly expressed on the right hand side; i.e., it can be tackled by independent binary classiﬁers via minimizing Eq. (1) on a per-label basis. What does it mean to model labels together (or not)? Statistically, we can speak of modeling label dependence. We may speak of global dependence (or marginal/unconditional dependence) between two label variables Y1 and Y2 when

P(Y ) = P(Y1)P(Y2|Y1) = P(Y2)P(Y1|Y2) = P(Y1)P(Y2)

and conditional dependence (conditioned on input x) implies that

P(Y |x) = P(Y1|x)P(Y2|Y1, x) = P(Y2|x)P(Y1|Y2, x) = P(Y1|x)P(Y2|x) (6)

(we use upper case Y = [Y1, Y2] here to make explicit that these inequivalences can hold for any combinations [y1, y2] {0, 1}2). Figure 4 provides visual intuition in the form of graphical models. Also note that an in-depth introduction to these concepts is given in the context of multi-label learning by Dembczy nski et al. (2012a), and we refer the reader there for a thorough treatment.

(a) Global independence

(b) Global dependence

(c) Conditional independence

(d) Conditional dependence

Figure 4: Diﬀerent types of dependence w.r.t. Y1 and Y2; shown here as Bayesian networks.

Note that in Eq. (6) the second and third terms are equivalent expansions of the joint full conditional (left term). This means that Figure 4d (also Figure 4b) showing dependence can be equivalently drawn with the arrow pointing from Y2 to Y1. More generally, for L labels, conditional dependence implies

P(Y |x) = P(Y1|x)

j=2 P(Yj|x, Y1, . . . , Yj 1) (7)

j=1 P(Yj|x)

and again here, any permutation of j = 1, . . . , L is valid. The expansion into conditional distributions is convenient for expression as a Bayesian network, such as those drawn in Figure 2, where directed edges show the conditional dependence. Eq. (7) speciﬁcally captures full dependence corresponding to either Figure 2b or Figure 2f (among other possible label orders). Sparser structures can model sparser dependence relations, by replacing Y1, . . . , Yj 1 with the corresponding parents of node Yj.

Classifier Chains: A Review and Perspectives

Thus far, we have provided the groundwork for motivating classiﬁer chains: performance metrics such as 0/1 loss which imply modeling labels together (i.e., linking them in the chain structure). Further, we have brieﬂy reviewed label dependence, a theoretical mechanism upon which to view such an approach. Although Hamming loss entails the goal of high performance on each label individually and independently (rather than jointly together) such a goal should also be a goal of a multi-label method and typically will correlate with it; sometimes the minimizer will even coincide with that of 0/1 loss (Dembczy nski et al., 2012a); an obvious example is where classiﬁcation can be carried out perfectly and all losses are 0.

Table 1: A toy training dataset with empirical distributions. In the conditional distributions speciﬁcally exempliﬁed here, test instance x = [1, 1], and by1 = 1. Notice that either by = [0, 1] or by = [1, 0] fullﬁl Eq. (3) for minimizing 0/1 loss, yet minimizing Hamming loss suggests [0, 0] according to Eq. (2) per label. By comparing P( ) and P( | x) in tables (b), (c) and (d), we ﬁnd both global dependence and conditional dependence under Eq. (6); but we must recall that these P are only empirical estimations via counting from training dataset (a) (D = {x(i), y(i)}N i=1 where L = 2, N = 8); speciﬁcally P(y) = 1 N PN i=1[[y(i) = y]] and P(y| x) = P(y, x)/P( x) where P(y, x) = 1 N PN i=1[[y(i) = y]][[x(i) = x]]; e.g., P([1, 0]| x) = (1

2 = 0.5; and, i.e., not necessarily equivalent to either underlying ground truth or test data distributions. Note that the tables can be understood in conjunction with Figure 4.

X Y1 Y2 [0, 1] 1 1 [0, 1] 0 0 [1, 0] 0 0 [0, 0] 0 0 [1, 1] 1 0 [1, 1] 1 0 [1, 1] 0 1 [1, 1] 0 1 (a) Training dataset

y P(y) P(y| x) [0, 0] 0.375 0. [0, 1] 0.25 0.5 [1, 0] 0.25 0.5 [1, 1] 0.125 0. (b) Estimated distributions over y

y1 P(y1) P(y1| x) 0 0.625 0.5 1 0.375 0.5

y2 P(y2) P(y2| x) 0 0.625 0.5 1 0.375 0.5 (c) Estimated distributions over yj

y2 P(y2|by1) P(y2| x, by1) 0 0.67 1. 1 0.33 0. (d) Estimated distributions over y2|by1

At this point, however, it is crucial to emphasise that the equivalence in Eq. (6), denoting label dependence, holds only for ground-truth P. In real-world problems, we do not have true underlying distributions P and instead have to approximate them, typically in a data-driven fashion (learning from the training data). Table 1 shows example empirical distributions obtained by counting of co-occurrence frequencies from training data. Diﬀerent base classiﬁers will embody diﬀerent approximations, but whichever model class we choose, it is unlikely that these estimated P will be the ground truth. This departure (from an idealized ground-truth model) is furthermore exacerbated by approximate inference (which is increasingly needed for larger numbers of labels, given the exponential increase in label

Read, Pfahringer, Holmes, & Frank

combinations). This issue has major implications regarding the design of classiﬁer chains; it raises important questions regarding their performance and choice of chain structure, and lays the basis for a discussion of the complex relationship between chain structure and performance under diﬀerent loss functions. We address these questions in the following sections of this article.

3. How Classiﬁer Chains Work

In this section, we explore the mechanism of classiﬁer chains, looking to explain reasons for their eﬃcacy on multi-label problems. As with most machine learning methods, there are many angles from which to approach and view classiﬁer chains. We elaborate on two of these in the following subsections (similarities to other methods will be approached in Section 5). We choose these two because they illustrate two aspects leading to the success of chaining. In particular, the ﬁrst (Section 3.1) on account of its appealing probabilistic interpretation and how this relates to strong predictive performance particularly under 0/1 loss. The second (Section 3.2) is mainly about explaining strong performance of chains under Hamming loss and improving performance in general, by producing a stronger feature representation via chained labels. Indeed, if a multi-label concept is well modeled, predictive performance will be strong across many loss metrics or even as mentioned in Section 2 equivalent. Metrics may also be blended or used as surrogates for each other (Park & Read, 2018). In this section, we consider the fully-cascaded chain where each j-th base classiﬁer provides prediction byj = hj(x, by1, . . . , byj 1) (8)

but this can easily be generalized to any directed acyclic structure by replacing indices 1, . . . , j 1 with the indices corresponding to the parent nodes of yj. To build hj, one uses instances {[x(i), y(i) 1 , . . . , y(i) j 1]}N i=1 formed from the training data as

input, and corresponding {y(i) j }N i=1 as output. From this perspective, binary base classiﬁers can then be trained in an oﬀ-the-shelf manner. We remind the reader that, conversely, predictions byj (or their estimated probabilities) are not used (or known) at training time (refer to Figure 3 for illustration) at prediction time they are imputed recursively by the model via Eq. (8). It has been noticed by Senge et al. (2013), and is worth remarking here, that this common way of training classiﬁer chains, using the training labels as inputs, breaks a common assumption of statistical learning, namely that the distribution of the training data and test data should be identical. This is because at test time, the predicted labels are used as inputs, yet these are not from the same distribution as the training data, as that distribution is not known, and can only be approximated by the predictive distribution used on test data. Further discussion on this special case is deferred to Section 5.3.

3.1 Classiﬁer Chains as Probabilistic Models

The formalization of probabilistic classiﬁer chains was proposed by Dembczy nski et al. (2010). The training process is identical to the standard formulation, but an additional requirement of the base classiﬁers hj is that they have a probabilistic interpretation (at

Classifier Chains: A Review and Perspectives

least in the loose sense of a prediction [0, 1] that could be understood as a conﬁdence):

byj = hj(x) := argmax yj {0,1} P(yj|x, y1, . . . , yj 1) (9)

With these models (or, more precisely, their probabilistic components), inference can be phrased as a maximum a-priori (MAP) estimate, where Eq. (3) is developed as follows:

by = argmax y {0,1}L P(y1|x)

j=2 P(yj|x, y1, . . . , yj 1) (10)

This corresponds to a minimization of the 0/1 loss (Eq. (4)), since the MAP estimate Eq. (10) is the minimizer for this loss. Note, particularly, the expansion of the joint probability from Eq. (7) in correspondence with Figure 2b. Figure 5 illustrates this idea as a probability tree where each path from root to leaf represents one combination of labels y {0, 1}L, associated with probability P(y|x), factored across branches as per Eq. (10). Under this view, deterministic inference in classiﬁer chains as originally proposed by Read et al. (2009), follows a single path greedily through the tree. This is an inexpensive and arguably crude approximation of Eq. (10) to the extent that one can only talk of the method being a mode-seeker (Dembczy nski et al., 2010), that is to say it seeks out the MAP estimate1 in a greedy way, but is not guaranteed to ﬁnd it. On the other extreme, exploring all 2L paths provides a Bayes-optimal inference solution, corresponding to the highest payoﬀ(minimal 0/1 loss). While being a faithful evaluation of Eq. (10), this exhaustive search is generally intractable, provoking the application of tree search methods to eﬃciently trial a subset of paths, thereby approximating the best solution at reduced cost: for example, Monte Carlo search, ϵ-approximate, beam search, or even A

search (the work by Mena et al., 2016 provides a comprehensive survey; with detailed references and empirical results). Note also that, although typically set up to minimize 0/1 loss, probabilistic classiﬁer chains can be used with any loss function (Dembczy nski, Waegeman, & H ullermeier, 2012b) including Hamming loss, unlike the original greedy inference which seeks out the 0/1 minimizer in a non-optimal way. We can also see that goodness-of-ﬁt depends on chain structure, since if it is reduced to that of Figure 2a, chains will generally not be able to minimize 0/1 because label dependencies cannot be modeled. We defer a deeper discussion on chain structure to Section 4. Although classiﬁer chains may be conﬁgured to optimize diﬀerent losses, by default when making a prediction (supposing adequate connectivity and no steps taken to the contrary) they are seeking a minimization of 0/1 loss (greedily, exhaustively, or on some trade-oﬀ) for that prediction, using the given structure and test instance. This explains the typically high performance under this and related metrics. Recall that, although the base models are induced independently (i.e., the training process for hj is not dependent on the outcome of training h1, . . . , hj 1 and can be carried out separately and even in parallel including the probabilistic chains), this is not the case at test time, where a model s inference procedure is explicitly dependent on the inference of the previous labels. Moreover, training is done iteratively over labels, typically leading to a 0/1-loss minimization

1. One may remark that, with regard to the ﬁrst label in the chain y1 (but only that one) it can be said to be minimizing Hamming loss

Read, Pfahringer, Holmes, & Frank

Figure 5: A probability tree corresponding to a fully-connected classiﬁer chain (like Figure 2b) over three labels. The marginal p(yj|x, . . .) is shown on branches from level j 1 to j, whereas P(y|x) is given for each full path. Note that in this example the nodes (labels) of the best path, P([0, 1, 0]|x) = 0.288, are not the same as those taken by standard (greedy) classiﬁer chains, which picks P([1, 0, 1]|x) = 0.268. Note also that there are 2L paths in total (23 = 8 in this case).

or some approximation thereof. Given a particularly useful/powerful (and possibly large) set of input features x, the additional predictive beneﬁt of the extra label features may be negligible this is an informal way of stating that labels are approximately conditionally independent of each other. In such a scenario, the minimizer for 0/1-loss and Hamming loss is equivalent, and the multi-label nature of the problem ceases to exist. However, it also implies that the problem is easy studies invariably measure some form of conditional label dependence; as we commented above at the beginning of this section.

3.2 Classiﬁer Chains as Neural Networks

One might assume that if (probabilistic) classiﬁer chains are set up to minimize 0/1 loss, they will not show statistically signiﬁcant improvement compared to independent classiﬁers in cases where all label concepts are being evaluated independently, such as under Hamming loss. However, it is widely shown that classiﬁer chains do in fact often outperform independent classiﬁers when labels are evaluated independently, even when the base model class is identical (see, for example, empirical studies by Madjarov et al., 2012; Moyano et al., 2018; Rivolli et al., 2020). This apparent contradiction is resolved under a diﬀerent conceptualisation of the base classiﬁers. If we consider that models h1, . . . , hj 1 are part of the j-th model hj, then the performance gain can be explained in terms of earlier labels oﬀering themselves as a feature space expansion for later labels in the chain (Dembczy nski et al., 2012b; Read & Hollm en, 2017). Figure 6 makes this explicit by showing classiﬁer chains as a feed-forward multi-layer neural network (where base classiﬁers take the place of activation functions). The result is comparable to the way latent nodes behave inside a standard multi-layer neural network. One can recall the case of modeling the xor function, which is very well known in the neural network community (see, for example, Goodfellow, Bengio, & Courville, 2016) for requiring a hidden layer for correct modeling. Figure 7 contrasts a typical failed

Classifier Chains: A Review and Perspectives

Figure 6: Greedy inference in a classiﬁer chain as a forward pass in a neural network. Base classiﬁers take the place of activation functions into yj and identity functions to zk, e.g., z3 = f(y1) = y1 (can also be seen as delay nodes). Note that z-variables are thus either vectors or scalars depending on input. There is no implication of back propagation in the training process. Note y1 and y2 can be viewed as part of the model for predicting y3.

attempt of a linear classiﬁer to achieve separability of points w.r.t. their xor class labels vs. a successful solution in a classiﬁer chain where separation is achieved via leveraging of an earlier label prediction (or, in this example). We further point out that even a base classiﬁer with a linear decision boundary provides a non-linearity via its decision function. For example, logistic regression has a sigmoid function, and uses a threshold to map into discrete class output, byj {0, 1}.

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 x1

Model YXOR|x1, x2

YXOR = 0 YXOR = 1

0.20.0 0.2 0.4 0.6 0.8 1.0 1.2

0.2 0.0 0.2 0.4 0.6 0.8 1.0

Model YXOR|YXOR, x1, x2

Figure 7: The xor label cannot be separated by a linear decision boundary in x R2 space (left), but can be separated with an expansion to the feature space in the form of the xor label (right; vertical axis). Note that jitter has been added to the points x {0, 1}2 for purposes of illustration.

If classiﬁer chains may be considered as a deep neural network of L + 1 layers, this is only in respect of the forward pass. A full analogy to neural networks is limited by the fact

Read, Pfahringer, Holmes, & Frank

that the nodes of a classiﬁer chain are not latent nodes in the true sense; they are already exempliﬁed in the training set as training labels. We cannot claim deep learning, as we do not back-propagate through the base classiﬁers. Hence, one may argue that classiﬁer chains should be considered as feature-space expanders rather than neural networks, however an analogy with elementary basis-function expansion is also limited since the formation of these functions (in chains) is data driven and not expert driven. A deep network has been leveraged and extended with synthetic/artiﬁcial labels and traditional hidden layers, for example, by Read and Hollm en (2017), and Cisse, Al-Shedivat, and Bengio (2016); which can be seen as developing a network deep in the label space (as opposed to in the input space). Particularly the idea of adding synthetic labels to a chain blurs the line with the concept of basis expansions (arbitrary non-linear functions, often polynomials or radial functions). We refer again to this discussion later in Table 3.

It is also possible to pass probabilistic information from the marginal posterior on labels down the chain instead of a hard classiﬁcation, where each node emits real-valued probabilities P(Yj = 1|x, . . .) [0, 1] instead of byj = hj(x) = argmaxyj {0,1} P(Yj = yj|x, . . .)), according to the argument that this additional information on prediction uncertainty/conﬁdence is more expressive than a binary value. Under the view as a forward pass through a neural network, this is a question of activation function/non-linearity as represented by each base classiﬁer.

Activation functions in a neural network are inherently simplistic, since it is the greater network that embodies the necessary complexity. In a related way, a classiﬁer chain (i.e., a network of base classiﬁers) is relatively less eﬀective if we select a powerful non-linear model class for these classiﬁers such as decision trees or even ensemble models (Read & Hollm en, 2017); the stronger base learners make much of the connectivity in the chain redundant. This is particularly so under Hamming loss, where the gain of chains compared to independent models may in theory be reduced to zero. However, in practice the eﬀect varies greatly depending on the dataset and base classiﬁer parametrization (Rivolli et al., 2020). This is an important issue in classiﬁer-chain design. In any case, this question of connectivity leads us to the general question of how speciﬁcally to order or structure the chain around its label nodes.

4. The Question of Chain Structure

Chain structure is a major issue, because it aﬀects major aspects of the classiﬁcation model including, notably, predictive performance, computational scalability, and interpretation. Note that we consider a variation in the order of nodes a speciﬁc variation of structure (no need to consider them separately, as in some work).

The full-chain factorization (as in Eq. (10)) is in fact valid and equivalent for any order of labels (see also Eq. (6) in Section 2) however, as explained earlier, this refers only to the case where P is the ground-truth distribution and when the inference is exhaustive. In practice (where these conditions do not hold) it is found that chain order has signiﬁcant eﬀect. This has been empirically found in the literature (for example, by Read et al., 2014, among others) and we additionally illustrate this here with the results in Figure 8: the variability of performance among diﬀerent chains is clearly exhibited.

Classifier Chains: A Review and Perspectives

Independent

[ 1 2 3 4 5 6 ]

[ 1 2 3 4 6 5 ]

[ 1 2 3 5 4 6 ]

[ 1 2 3 5 6 4 ]

[ 1 2 3 6 4 5 ]

[ 1 2 3 6 5 4 ]

[ 1 2 4 3 5 6 ]

[ 1 2 4 3 6 5 ]

[ 1 2 4 5 3 6 ]

[ 1 2 4 5 6 3 ]

[ 1 2 4 6 3 5 ]

[ 1 2 4 6 5 3 ]

[ 1 2 5 3 4 6 ]

[ 1 2 5 3 6 4 ]

[ 1 2 5 4 3 6 ]

[ 1 2 5 4 6 3 ]

[ 1 2 5 6 3 4 ]

[ 1 2 5 6 4 3 ]

[ 1 2 6 3 4 5 ]

[ 1 2 6 3 5 4 ]

[ 1 2 6 4 3 5 ]

[ 1 2 6 4 5 3 ]

[ 1 2 6 5 3 4 ]

[ 1 2 6 5 4 3 ]

[ 1 3 2 4 5 6 ]

[ 1 3 2 4 6 5 ]

[ 1 3 2 5 4 6 ]

[ 1 3 2 5 6 4 ]

[ 1 3 2 6 4 5 ]

[ 1 3 2 6 5 4 ]

[ 1 3 4 2 5 6 ]

[ 1 3 4 2 6 5 ]

[ 1 3 4 5 2 6 ]

[ 1 3 4 5 6 2 ]

[ 1 3 4 6 2 5 ]

[ 1 3 4 6 5 2 ]

[ 1 3 5 2 4 6 ]

[ 1 3 5 2 6 4 ]

[ 1 3 5 4 2 6 ]

[ 1 3 5 4 6 2 ]

[ 1 3 5 6 2 4 ]

[ 1 3 5 6 4 2 ]

[ 1 3 6 2 4 5 ]

[ 1 3 6 2 5 4 ]

[ 1 3 6 4 2 5 ]

Jaccard score from 45 chain permutations; 'emotions' data.

Figure 8: Jaccard score on the Emotions data for classiﬁer chains (base classiﬁer: logistic regression) for the ﬁrst 45 of the 720 possible chain orders compared. Also included are the Jaccard scores obtained using independent classiﬁers and an ensemble of all 45 chains, where the latter predicts the majority labels across all 45 chains.

Discussions of chain order frequently relate to the issue of error propagation (Senge, del Coz, & H ullermeier, 2014); a phenomenon related to the challenge faced in hidden Markov models known as the label-bias problem (Dietterich, 2002) where it denotes the eﬀect of high uncertainty at some part of the chain leading to an error, which in turn propagates down the chain causing further errors. There are numerous factors implicated in this uncertainty. One obvious factor is the length of the chain, and here there are several aspects to notice: a longer chain is inherently more susceptible to propagating error (simply because of greater length for an error to propagate over); yet conversely if this chain is heavily cascaded/connected it then also oﬀers a potentially more powerful expansion of feature space (by oﬀering to leverage an increasing number of input nodes; recall, Section 3.2). Naturally, a longer chain is much more computationally challenging to manage, with the number of possible predictions increasing exponentially and the number of possible chain structures increasing hyper-exponentially (w.r.t. length/the number of labels L). A sparse DAG is attractive because it simpliﬁes the model, making it more eﬃcient and scalable, and it can even improve predictive performance (because it may reduce error propagation by reducing propagation in general). Enforcing sparsity may also yield a more interpretable model. Unfortunately, although such a hypothetical DAG is eﬃcient to use, ﬁnding it (as well as chain order) is a question of combinatorial complexity, and the possible structures cannot be exhaustively trialed for more than a dozen or so labels: just 6 labels of the benchmark emotions data (considered in Figure 8 w.r.t. Jaccard score2) imply a total of 720 diﬀerent chain orders, and 32 768 unique DAGs. Due to the diﬀerent factors involved (accuracy, eﬃciency, interpretation, . . . ) the motivation for structure search is often not clearly linked to any particular factor. In the fol-

2. Also known as Jaccard index, or Jaccard similarity coeﬃcient, considered alongside Hamming and 0/1metrics by, for example, Park and Read (2018); higher is better

Read, Pfahringer, Holmes, & Frank

lowing, we focus on the predictive performance (which arguably has motivated the largest number of related studies in classiﬁer chain order) and additionally treat eﬃciency and interpretation with regard to this as a question of tradeoﬀ. In the following we discuss diﬀerent options for structure search. In order to facilitate this discussion and the diﬀerent issues that arise, we consider the example of Figure 9, featuring a toy example of three labels, each representing a logical operation on the two binary inputs (recall also the related Figure 7).

Figure 9: A toy problem, with two bits as input, x {0, 1}2, and three labels that represent the logical operations on these bits. Shown are 3 of the 8 possible directed structures among the three nodes (note that the order changes). The suitability of each depends on the base classiﬁer, where a non-linear classiﬁer like a decision tree will solve all labels on any structure, and logistic regression is only suitable on the middle with greedy inference; also on the fully-connected structure (right) with exhaustive inference.

4.1 Random Ensembles

Using ensembles of randomly ordered chains was an early approach (Read et al., 2011) that proved eﬀective in a similar sense to other methods that induce diversity among ensemble members and then combine predictions, such as bagging (Breiman, 1996), and a similar bias-tradeoﬀanalysis applies. Bagging is typically carried out using an unstable learning algorithm such as a decision tree inducer. However, even if base models are not necessarily unstable, imposing a random order on each chain model helps to achieve this eﬀect. This is related to, for example, imposing random structures of neural networks to make them more diverse when used in an ensemble. The averaging eﬀect of the ensemble thus reduces the variance component of the error caused by this randomness. Note however that this strategy essentially mitigates potentially poor chain orders rather than identifying a good order (as seen in Figure 8 where the ensemble of chains is an average performer compared to the best and worst chain orders).

4.2 Using a Label Hierarchy

Hierarchical classiﬁcation has long been of interest in the multi-label community (Kiritchenko, Matwin, Nock, & Famili, 2006; Ramrez-Corona, Sucar, & Morales, 2014). A hierarchy designed by a domain expert will almost certainly incorporate some form of label dependence, and is available for use by an algorithm even prior to seeing the data. However, such hand-built hierarchies and other structures (if they are available) are usually designed for human interpretation of the data rather than to increase classiﬁer performance. Indeed, a model class has typically not even been considered at the time of the hierarchy design. Top performance on hierarchical classiﬁcation problems can be obtained with chain-based

Classifier Chains: A Review and Perspectives

methods that completely ignore the hierarchy (Puurula et al., 2014), indicating that such a consideration is not always necessary.

We do emphasise that results may diﬀer if the hierarchy explicitly forms part of the loss metric, but this goes beyond the scope of this work. In other cases, making use of a hand-built/domain-expert hierarchy can be considered a special case of generating structure from label dependence, which we discuss in the following, Section 4.3 (from the context of learning the dependence from the data). Later, in Section 5.5, we discuss hierarchical tree-based methods in the context of related work.

4.3 Structure from Label Dependence

A large part of the classiﬁer-chains literature promotes structuring the chain according to label dependencies discovered in the data. This idea is well founded; it would be ineﬃcient to place a chain structure over a set of labels that are independent of each other. And it is also attractive in the sense that measuring dependence between variables is a highly studied problem for which many statistical tools exist. A common recipe for classiﬁer chain-based methods (and other methods in the wider multi-label literature) has been to 1) measure label dependence, and 2) use the dependence measurements to create a structure (often a sparse one for eﬃciency reasons), and 3) deploy base classiﬁers and an inference method of choice. This approach is taken by Zhang and Zhang (2010), Read et al. (2015), Kajdanowicz and Kazienko (2013), Jun et al. (2019) and dozens more (many others cited by those authors).

However, although it may be intuitively obvious that dependence is connected to structure, exploiting this intuition successfully requires further thought.

A simple approach is to measure global label dependence by counting label co-occurrences, and to derive a structure from this, where high dependence favors connectivity. An initial inconvenience is the diﬃculty to obtain any representative sparse structure, due to the almost ubiquitous inter-dependence of multi-label data. A second issue is that any global dependence detected among labels in the dataset does not imply a corresponding advantage connecting these in a chain structure at testing time where input instances are observed (the reader may recall the discussion on global versus conditional dependence in Section 2, and in particular Table 1).

Measuring conditional label dependence (dependence under observation of input x) implies building models. Probabilistic classiﬁers hj (recall; Eq. (9)) incorporate a mechanism for obtaining approximation P( |x, ) that can be used to check for conditional dependence against Eq. (7). This immediately raises two questions: ﬁrstly, of the computational complexity of building these models (as required for each measurement), and secondly, of the eﬀectiveness of the chosen model class for the particular problem; diﬀerent base classiﬁers will provide diﬀerent approximations of ground-truth P.

An eﬃcient way of estimating conditional label dependence was described by Zhang and Zhang (2010). It requires building only one classiﬁer per label (and is thus linear with L) and thereafter only incurs the computational overhead in computing pairwise statistical measurements of dependence between their errors. This is because the individual errors ϵj = ℓ(yj, byj) (of each j-th label, under loss function ℓ) are conditioned on the instance via the prediction: byj = hj(x). This approach is therefore relatively aﬀordable.

Read, Pfahringer, Holmes, & Frank

The model class of base classiﬁers has a huge eﬀect on the ideal chain structure. Consider, for example, the toy problem in Figure 9 having logical operations as labels (relating also to Figure 7, as discussed in Section 3.2): if classiﬁers hj learn to predict their labels perfectly given any pair of input bits x then there is no point in connecting any labels in a chain (conditional independence holds). But of course, a perfect model cannot usually be obtained in practice. For example, logistic regression fails to model the Yxor concept correctly on its own, which may show up in measurements of dependence, and thus according to the argument of structure from dependence, chain connectivity should be increased around this label.

However, we remark that the order of labels in the chain (its directionality) cannot be determined from measurements of statistical dependence. In Figure 9 it is apparent that the order in which or and xor are given may be crucial to the success of the problem (particularly when greedy inference is employed). Figure 8 also illustrates numerous cases where a minor change in label order has a signiﬁcant eﬀect on accuracy, for example, notice the considerable drop in accuracy from swapping labels 4 and 5 at the end of the chain.

To take this aspect into account, Zaragoza et al. (2011) create a connected structure, obtained from measurements of label dependence, and then convert it into a number of trees with diﬀerent directionality across each one (namely, one tree per label, each label is used once as a root node). As an ensemble method, it incurs the relative advantages and disadvantages of such an approach (see Section 4.1). Particularly, having so many models can be unsatisfactory in terms of computational complexity and interpretability (notably, it does not shed any light onto causality).

Overall, label dependence measurements can potentially provide interesting insights into interpreting the data (just as feature dependence is often measured in single-label data to obtain this information) from a correlation point of view. Conditional dependence can hint towards structures that might be useful for prediction, but this is ultimately dependent on the selection of base classiﬁers and inference schemes both of which are often plugged in after a structure is set. As such, dependence cannot reveal causality, and it cannot not unveil the most powerful directionality (order of labels) across the chain.

4.4 Order via Base Classiﬁer Accuracy

Heuristics can also be designed around the expected predictive performance of base classiﬁers, i.e., their accuracy: labels that are easier to model (i.e., have relatively higher predictive performance) can be placed near the beginning of the chain, supposing that these models are less likely to generate errors (which would be propagated down a relatively longer part of the chain). Therefore, under this view, weaker classiﬁers are placed nearer the end of the chain where they have less inﬂuence. Studies have been carried out by Enrique Sucar et al. (2014), Kajdanowicz and Kazienko (2013); and discussed also by Senge et al. (2013). One should note that although this approach suggests an order across labels (unlike label-dependence), it does not lead to a general solution for structure. Neither is there any consensus in the literature over whether this approach is eﬀective. Indeed, very poor predictions for a particular label may potentially serve as excellent features for other predictions. In view of this, such labels should be placed nearer the beginning of the chain

Classifier Chains: A Review and Perspectives

to serve as good feature expansions (as implied in Section 3.2) even if their own predictions are poor.

4.5 General Search for Structure

Although there are good arguments given in the literature to the above-discussed factors behind chain-order performance, in real-world data, the diﬀerent factors are convoluted and may interact and counteract each other. This is apparent when we take a closer look at per-label accuracy; as in Figure 2. Therefore a single heuristic may not be globally eﬀective.

m Chain Order Jaccard A1 A2 A3 A4 A5 A6 5 [1 2 3 6 4 5] 0.52 0.81 0.70 0.73 0.93 0.82 0.74 6 [1 2 3 6 5 4] 0.54 0.81 0.70 0.73 0.93 0.86 0.74 17 [1 2 5 4 6 3] 0.53 0.81 0.70 0.73 0.94 0.81 0.73 18 [1 2 5 6 3 4] 0.53 0.81 0.70 0.71 0.91 0.81 0.77 20 [1 2 6 3 5 4] 0.57 0.81 0.70 0.75 0.95 0.84 0.77 42 [1 3 5 6 4 2] 0.57 0.81 0.72 0.74 0.94 0.83 0.76

Table 2: A selection of models {m} (indexed from left to right in Figure 8), associated Jaccard score (also as in Figure 8) alongside respective per-label accuracy (Aj) for all labels j = 1, . . . , 6. Moving a label forward or backwards in the chain may have either a positive or negative outcome for the labels involved or make no diﬀerence at all; and perlabel changes may counteract each other to be invisible under the overall score associated with a structure.

If our aim is to ﬁnd an optimal chain structure under a particular loss function, given a particular base classiﬁer and inference conﬁguration, at any cost, then we may approach the task directly as a trial-and-error search through the space of all possible structures. This task is already relevant in Bayesian network structure learning (Gasse, 2017). Finding the optimal structure this way is an NP-hard problem, but many options already exist for tackling it, for example: local search, simulated annealing, and other hill-climbing and evolutionary methods; and many of these have been adapted speciﬁcally to classiﬁer chains, for example, by Kumar et al. (2013), Read et al. (2014), Goncalves et al. (2013). And Gasse (2017) gives a thorough treatment in the general application to multi-label learning. Figure 10 illustrates a simple example using tree search. Formally, letting θ represent the parametrization of the structure of a chain classiﬁer (for which base model class is pre-selected), and J (θ) be the payoﬀ(accuracy) incurred by this model, then we are searching for

θ = argmax θ Θ E[J (θ)] (11)

where space Θ contains all possible structures, of hyper-exponential size w.r.t. L, namely 2L(L 1)/2 DAGs (or L! orders given ﬁxed structure). Note that, of course this is an argmin when considering loss metrics ℓ. Aside from the size of the space, we note that each proposed step/trial θ Θ in the search requires the approximation of an expectation, because we want parametrization θ to

Read, Pfahringer, Holmes, & Frank

[1, 2, 3, 4]: 0.48

[2, 1, 3, 4] 0.50 [1, 2, 4, 3] 0.55

[2, 4, 3, 1] 0.53 [2, 3, 1, 4] 0.55 [4, 1, 3, 2] 0.50

[1, 4, 3, 2] 0.60 [2, 4, 1, 3] 0.58 [4, 1, 2, 3] 0.54

[1, 3, 4, 2] 0.57

Figure 10: An example of searching for chain order via tree search, over L = 4 labels. At each node the chain order θ is shown with its associated expected payoﬀE[J (θ)] (as in Eq. (11)). Each branch is associated with a measurement of this estimated payoﬀ; implying a slow expansion. Nevertheless, as is typical, a small search can render an important improvement in predictive performance, even if no clear pattern emerges with regard to chain order.

perform well in general on unseen test data. An average over internal k-fold cross validation results using the training data can be used in its place a task which requires building k copies of a classiﬁer chain (which itself consists of L base classiﬁers) at each step; and implies a diﬃcult tradeoﬀ: More cross validation (i.e., more folds) may better approximate the expectation (and facilitate a more precise search), but using k folds implies a factor of k in running time. On the other hand, using fewer folds implies less computational expenditure, but higher variance in their estimated performance. This challenging scenario is known in other tasks, and many tools are available; Powell (2019), for example, provides a synthesis of techniques and applications.

Even with state-of-the-art tools and computational resources, the optimal θ (satisfying Eq. (11)) may never be found and we will have to make do with some approximation ˆθ, such that (we hope) J (ˆθ) J (θ ). Luckily, in practice, the surface of the function J (θ) over θ is often undulating with many local maxima that yield good results (Read et al., 2014) and therefore oﬀ-the-shelf searches yield a good local maximum ˆθ (corresponding to an eﬀective chain) relatively quickly. Figure 8 also shows an indication of this. Performing multiple searches in parallel (from diﬀerent points in the space) additionally yields an eﬀective ensemble (where all members rest on or close to some local maxima).

In addition to oﬀ-the-shelf textbook methods, one can make particular adaptations of a search that are useful in the context of classiﬁer chains. Namely, one can take advantage of the fact that under greedy inference a search in chain space can be made increasingly faster by freezing the chain along its directionality. In other words, a simulated-annealing type scheme where proposed structures diﬀer from the ones trialled already only near the end of the chain, and increasingly so over time. This speed-up is on account of only having to retrain classiﬁers which are down chain from the ﬁrst modiﬁed node and the ﬁrst part

Classifier Chains: A Review and Perspectives

1 2 3 4 5 6 j

Probability of swapping j (under = 1), t = 0, 1, , 5, , 10

t = 0 t = 1 t = 5 t = 10

Figure 11: Searching for a chain structure or order involves trialing a series of proposals. For example, two labels can be swapped in their positions to propose a new order. This ﬁgure shows a distribution for selecting such labels, which is modiﬁed over time steps t of the search, at speed proportional to β. The search becomes cheaper over time, since altering the chain from point j only involves retraining base models j + 1, . . . , L. Based on Read et al. (2014).

of the chain can be reused from earlier trials (Read et al., 2014). An illustration is given in Figure 11. By imposing a ﬁxed structure, the search space is reduced to simply the order in which labels appear within that structure. Read et al. (2015), for example, use a relatively-sparse trellis structure. Teisseyre (2017) employs L1 regularization to prune a fully-connected directed structure to a sparse one. The latter approach is particularly eﬃcient as the structure search is carried out as an integral part of the learning algorithm; albeit with a loss of ﬂexibility since this method does not work generally for all types of regularization and base classiﬁers. The search for chain structure is often approached exclusively in the model-building (training) phase, prior to inference on test instances. After all, the structure is an integral part of a classiﬁer chain model and one can obtain a single structure which works globally well on the test set. On the other hand, it must be emphasised that even an exhaustive search will not necessarily uncover a single ground-truth representation θ (in Eq. (11)). In any case θ is not necessarily unique, and moreover, may diﬀer among evaluation metrics and types of instances. Since metrics J , and particularly, test instances, are not always observed at training time, it can also be appropriate to move the structure search to inference time, as follows, w.r.t. the MAP estimate, extended from Eq. (10):

by = argmax y {0,1}L,θ Θ P(y|x; θ ) (12)

where P( | ; θ) implies traversing the probabilistic representation speciﬁed by the structure θ (a particular classiﬁer chain arrangement). Naturally, conducting the full search indicated by Eq. (12) is impractical even for modestly sized label sets since it requires training multiple models per test instance. However,

Read, Pfahringer, Holmes, & Frank

if we have already carried out a search at training time for a globally-performing structure, then we have already trialed a set of strutures S = {θ1, . . . , θM}. Replacing Θ with S in Eq. (12) yields a dynamic search that incurs only an additional factor M of running time per test instance, the same as an ensemble of M models, yet with often better results (Read et al., 2014; da Silva, Gon calves, Plastino, & Freitas, 2014; Narassiguin, Elghazel, & Aussem, 2017; Pinto, Soares, & Mendes-Moreira, 2016). If each individual θ S is obtained by starting from a diﬀerent point in the search space, there will typically be considerable variety among these locally-optimum structures, despite the fact that all are rendered in the search for a globally optimum structure. On small problems, where it is feasible to generate a large part of the structure space, it is possible to observe many disconnections between local changes in structure and global changes in accuracy. For example in Figure 8 (and, respectively, Table 2), the small diﬀerence in chain order from swapping labels [1 2 3 6 4 5] and [1 2 3 6 5 4] corresponds to a notable jump in predictive performance (particularly surprising as this swap comes at the end of the chain). Conversely, the orders [1 2 6 3 5 4] and [1 3 6 5 4 2] appear objectively dissimilar from each other, but obtain very similar (and relatively high) Jaccard score. Table 2 provides particular insight.

4.6 Summary and Alternatives

The question of the best chain structure can only be answered respective of evaluation metric, base classiﬁer(s), parametrization, inference method, and test instances and none of these items are necessarily attached to the available real-world training data. The structure of models is an important issue across many areas of machine learning, including probabilistic models and neural networks. Therefore research is ongoing in many areas (a recent survey is provided by Scanagatta, Salmer on, & Stella, 2019). Even though investing in structure search can pay oﬀin terms of global and local predictive performance (regarding test set, and individual test instances, respectively), the challenges and instability associated with this search has inspired researchers to avoid this issue altogether. Indeed, increasingly-promising eﬀorts have given rise to eﬀective methods which are similar in approach to classiﬁer chains, but avoid the question of chain structure. We look at some of these in the following section, and discuss the relative disadvantages that this approach incurs.

5. Related Methods

Having elaborated prediction with classiﬁer chains as probabilistic inference (Section 3.1) and a feed-forward pass of a neural network (Section 3.2), it is inevitable to turn up close connections to other methods in these areas, and other related areas.

5.1 Probabilistic Graphical Models

The probabilistic view of classiﬁer chains revealed that a classiﬁer chain is in fact a type of probabilistic graphical model. It can be seen as a particular case of a conditional random ﬁeld (Dembczy nski et al., 2012b), or a maximum entropy Markov model or a hidden Markov Model (Read et al., 2017), or another variety of graphical model, depending on the chain

Classifier Chains: A Review and Perspectives

structure and inference method chosen. Note that Markov models speciﬁcally imply the Markov assumption which is not a typical restriction in chains. Also, greedy inference is considered the standard option for classiﬁer chains, i.e., a single directed pass, but typically more rigorous (even if still approximate) inference is carried out in, e.g., conditional random ﬁelds (CRFs), and to speciﬁcally minimize log loss. A main point of departure from classiﬁer chains, at least as we have deﬁned them in this work, is when directionality is removed from (or equivalently bi-directionality is implied upon) the edges connecting label nodes. Such a graph of binary classiﬁers (see Figure 12a), was proposed by Guo and Gu (2011). Compared to a directional chain, the training procedure is simpliﬁed: each binary classiﬁer takes the output of all other classiﬁers as additional input; so the question of label order is no longer in consideration, neither the computational expenditure required for it. However, the prediction/inference phase is signiﬁcantly more intense: single-pass greedy inference is not possible or at least not eﬀective, and rather, hundreds or thousands (or more) iterations of Gibbs sampling may be required for each test instance. For larger labelsets, the question of structure (even if not directionality) is still relevant, since sparsity becomes necessary for tractability. Essentially, for each test instance x, this approach relies on taking samples

y[t] j p(yj|x, y[t] 1 , . . . , y[t] j 1, y[t 1] j+1 , . . . , y[t 1] L ) (13)

over iterations t = 1, 2, . . . until convergence. Then, a marginal mean or joint mode estimation (to minimize Hamming loss or 0/1 loss, respectively) can be obtained simply by averaging or taking the mode over samples.

(a) CRF/Undirected network.

y[2] 1 y[2] 2 y[2] 3 y[2] 4

y[1] 1 y[1] 2 y[1] 3 y[1] 4

(b) Stacked binary relevance.

Figure 12: Multi-label models related to classiﬁer chains, exempliﬁed for L = 4 labels: (a) an undirected model (shown as bi-directional) among labels, from Guo and Gu (2011), implying iterations at inference time; and (b) a stacking approach such as used by Cheng and H ullermeier (2009) among others (note that often the second level of labels in (b) also takes the input x, but this is omitted for brevity).

5.2 Neural Networks

The view of classiﬁer chains as neural networks (as we discussed in Section 3.2) raises obvious connections, not just to standard multi-layer perceptron architectures (which, as Nam et al., 2014 demonstrate, can be competitive multi-label models) but also relatively recent architectures, for example, residual neural networks (He, Zhang, Ren, & Sun, 2016); Res Nets. In Figure 13, we illustrate a simple Res Net for single-label prediction, in a way

Read, Pfahringer, Holmes, & Frank

that draws obvious comparison to the cascade of a classiﬁer chain as shown in Figure 6; only the type of inner layer variables diﬀer, with Res Nets having been introduced to cascade inputs across hidden layers rather than label-space.

Figure 13: A Res Net for single-label prediction, using node notation similar to Figure 6 w.r.t. zl. In this case, hl are standard hidden layers in the network. Note the structural similarity to that of the classiﬁer chain in Figure 6.

Further connections can be seen to variants of recurrent neural networks (RNNs) such as long short-term memory networks (LSTMs). Indeed, LSTMs have already found some success in application to the multi-label problem (Nam, Loza Menc ıa, Kim, & F urnkranz, 2017). The main interest in this application is in the context of an extremely large number of labels, since potentially substantially fewer parameters are required; in fact possibly fewer than the number of output labels, since with a ﬁxed number of parameters (as chosen by the user as a hyperparameter) an RNN will continue producing labels until an end-of-sequence symbol is output. We remark that unlike usual applications of sequential models like RNNs, the order that labels are output is not important for multi-label classiﬁcation tasks, rather they are considered as an unordered set.

5.3 Stacking

As noticed by Kajdanowicz and Kazienko (2013), Senge et al. (2013) and others, chaining can be viewed as a particular case of binary relevance stacking and vice versa. Stacking approaches for multi-label classiﬁcation have been presented and studied independently, for example, by Cheng and H ullermeier (2009), Loza Menc ıa and Janssen (2016), Kajdanowicz and Kazienko (2009). These methods typically use the predictions of independent binary classiﬁers as inputs to a second set of classiﬁers. The vanilla version of this approach is shown in Figure 12b, noting that often the input is additionally directed to the second layer of classiﬁers. The understanding is that the second layer corrects the predictions of the ﬁrst layer in taking into account dependence among labels. A stacked prediction is made as follows for the j-th label:

byj = by[2] j = h[2] j (x, by[1] 1 , . . . , by[1] L ) (14)

where [l] denotes the l-th layer, i.e., h[2] j is the base classiﬁer on layer 2 responsible for predicting the relevance of the j-th label. The similarities with greedy inference (Eq. (8))

Classifier Chains: A Review and Perspectives

Table 3: A learning problem with regard to label y2, shown for (a) training, and (b) testing, under related schemes, including a classiﬁer chain. Here, φ is a basis function (e.g., φ1 = x2) and ˆz1 is a latent variable (not known at training time). All values with hats, e.g., ˆz, require a learning procedure to obtain; otherwise are available as (or can be calculated directly from) the dataset prior to learning. We have omitted superscripts of training examples (x x(i), y2 y(i) 2 ) for brevity; x is explicitly a test example (with no known associated labels).

X1 X2 X3 Y2 Basis expansion x φ1 φ2 y2 Classiﬁer chain x y1 y2 Stacking x by1 by2 y2 Hidden layer x y2 (a) Training

X1 X2 X3 Y2 Basis expansion x φ1 φ2 by2 Classiﬁer chain x by1 by2 Stacking x by[1] 1 by[1] 2 by[2] 2 Hidden layer x ˆz1 ˆz2 by2 (b) Inference

in a standard classiﬁer chain is apparent: with a diﬀerence that in a classiﬁer chain the j-th label is only predicted once. It is easy to imagine simple changes to cross from one form to another. For example, by passing over a chain twice, we obtain

byj = by[2] j = h[2] j (x, by[1] 1 , . . . , by[1] L , by[2] 1 , . . . , by[2] j 1) (15)

for the prediction of the j-th label in the second pass/iteration; a general combination of chaining and stacking of which both a standard chain and stacked set of classiﬁers are special cases. As we add a third and fourth pass over the chain, and so on, we ﬁnd close ties to inference for undirected chains (recall Figure 12a and, particularly, Eq. (13)) except in that case nodes are chosen (i.e., samples are taken) stochastically, rather than taking the mode as prediction.

However, if we look at the training phase, there is an important distinction to make with regard to stacking. If each model h[l] j is trained using the true labels {y(i) j } as they are presented in the training dataset (as additional inputs alongside each x(i)), this can indeed be seen as a particular case of classiﬁer chains. On the other hand, if h[l] j is trained using some

predictions {by(i) j } as inputs, where by(i) j = h[l 1] j (x(i), . . .) from a previous layer, then there is a qualitative diﬀerence; this form of training is typical of stacking. When classiﬁer chains uses this procedure of training models using predictions rather than the training set labels, this is called nested stacking by Senge et al. (2013); and those authors showed interesting results: nested stacking performs better under Hamming loss (presumably by leveraging predictions as feature-space expansion), whereas standard classiﬁer chains perform better under 0/1 loss (as they are able to model conditional label dependence). Nested stacking has also been recognized elsewhere (for example, by Li & Lin, 2014) under diﬀerent names. Table 3 provides an explicit example and comparison among four methods including these, which can provide further intuition.

Read, Pfahringer, Holmes, & Frank

The large family of methods stemming from the so-called label powerset approach provides a well-known alternative to classiﬁer chains. The RAk EL (random k-label subsets) method (Tsoumakas, Katakis, & Vlahavas, 2011) has been heavily cited and extended in the multilabel literature. Rather than binary base classiﬁers, label powerset methods consider multiclass base classiﬁers, where each label vector/combination is considered as a class value; thus the total number of possible classes is 2L. The method gets its name from set notation, where all combinations of labels from a set of labels, e.g., {1, . . . , L}, is indeed the powerset P, of cardinality 2L (we can denote this by P = {0, 1}L). However, in practice, the number of classes is much less than 2L because it is limited to the labelsets actually observed in the training data (of size N), and because only sets of k < L labels are considered (as in RAk EL), and possibly also because the set is explicitly reduced to only the most frequently observed labelsets as performed by Read et al. (2014). Although these methods were originally seen as a diﬀerent approach to classiﬁer chains, in a probabilistic setting, they can all be observed as approximating the same optimization problem, namely minimizing 0/1 loss as k L. Suppose indeed that k = L and that a set of labelsets V P has been selected by one of these methods, where |V| 2L, then the inference task is to select which labelset maximizes the posterior joint conditional:

by = argmax y V P(y|x) (16)

i.e., a MAP estimate. In comparison to Eq. (10), one sees that, whereas classiﬁer chains provide an eﬃcient (probabilistic tree-)search over the space of all 2L possible predictions, label-powerset methods restrict the search space itself.

5.5 Tree-based Methods

In Section 4.2 above we discussed hierarchies, where a tree is modeled around a pre-deﬁned dataset hierarchy. It is also clear that a tree, rather than a cascaded chain, is one kind of generalization that can be learned for classiﬁer chains, i.e., yielding classiﬁer trees. This approach has been studied by Dembczy nski et al. (2016) in a probabilistic context.

In this scenario, of trees built around class labels, we ﬁnd close connections to other tree-based approaches, such as nested dichotomies (Frank & Kramer, 2004; Leathart et al., 2019); even if these are not necessarily for multi-label classiﬁcation having binary class probability estimates facilitates eﬃcient tree search techniques for eﬃcient inference in many contexts. The same can be said with regard to hierarchical softmax (Wydmuch, Jasinska, Kuznetsov, Busa-Fekete, & Dembczynski, 2018).

Condensed ﬁlter trees (Li & Lin, 2014) are a tree-based method for multi-label learning related to classiﬁer chains, particularly in regard to having 2L possible paths from root to leaf (see, particularly, Figure 5). The novelty in these trees is based on the fact that, since many paths are seldom taken, focus can be placed on important ones, hence condensing the structure.

Classifier Chains: A Review and Perspectives

5.6 Structured-Output Prediction

Multi-label learning is a particular type of structured-output learning, for which typical examples involve sequence to sequence mapping or image segmentation (i.e., outputs with inherent local dependence) but structured output learning also considers problems that require output of more complex structures such as trees and graphs. CRFs and Markov models already mentioned above are well-known models to this kind of learning.

The interesting connection to classiﬁer chains is in methods where the aspect of search is combined with learning, as well as the idea of a transformation (or reduction) to simpler problems. A notable example is the SEARN algorithm (Daum e, Langford, & Marcu, 2009) which reduces structured output to binary classiﬁcation via cost sensitive classiﬁcation. These authors use the language of reinforcement learning, and talk of a policy h that is built for each training example, and taking action byj (making a prediction). However one can get close to this view from Section 4.5, where the search and the learning becomes inextricably linked. Furthermore, as with classiﬁer chains, a main advantage (compared to many other structured-output methods) is being able to tackle diﬀerent loss functions using any model class (i.e., what we refer to as a base classiﬁer). In Section 3.1, we mentioned how probabilistic chains can be ﬂexible with regard to loss function, and this was further generalized by the hyper-parametrization of the loss function with regard to chain order search (Eq. (11)). In this sense, in their most generic formulations, SEARN and classiﬁer chains are closely related, and could be formulated as special cases of each other, not forgetting that the original respective targeted areas (structured-output learning vs. multilabel classiﬁcation for chains) do indeed lead to diﬀerent formulations.

Whereas SEARN (as most varieties of classiﬁer chain) uses classiﬁers to produce structured outputs through a single sequence of greedy decisions, HC Search (Doppa, Fern, & Tadepalli, 2014a) makes use of recurrent classiﬁers. This approach has been shown eﬀective vs. SEARN, as well as on multi-label problems (Doppa, Yu, Ma, Fern, & Tadepalli, 2014b). In this context, a heuristic (H) provides a search (e.g., a tree search, as mentioned above) across possible labelsets y, and then it is a cost function (C) which evaluates these candidates. A novelty separating this method from chaining as we have discussed it is the idea that the search may be learned separately under a diﬀerent loss, as opposed to employing an oﬀ-the-shelf textbook search method. The idea of evaluating (albeit not necessarily explicitly ranking) candidates y produced by a structure is identical in the probabilistic chain formulation.

5.7 Summary

A full elaboration of all related methods is beyond the scope of this paper. Instead, we can emphasise again the particular niche of classiﬁer chains: a ﬂexible hyper-parametrization of binary base classiﬁers trained on a transformed multi-label dataset with oﬀ-the-shelf tools and models; a fast approximate inference over general DAG structures, leading to their strong out-of-the-box performance without the need for hand-crafted feature functions or hidden units.

Read, Pfahringer, Holmes, & Frank

6. Perspectives and Open Issues for Classiﬁer Chain Methods

Given that over a decade has passed since the initial formulation of classiﬁer chains as a method of multi-label classiﬁcation, it is well worth asking whether they are still relevant and competitive, in the rapidly evolving area of multi-label learning. In general, the constant ﬂow of modiﬁcations and developments building on classiﬁer chains that appear in the scientiﬁc literature suggests that interest is strong and ongoing. Of course, developments are driven by the general interest in multi-label learning. Several new implementations of classiﬁer chains have appeared in major open-source software frameworks in recent years, including Scikit-Learn (Pedregosa et al., 2011) and derivatives (Szyma nski & Kajdanowicz, 2017; Montiel et al., 2018). On the other hand, there are several limitations that are becoming increasingly apparent, for example computational complexity. A single fully-cascaded chain implies quadratic complexity in terms of feature space expansion. This is negligible on datasets with only tens of labels, but in recent years the multi-label community has approached ever larger datasets, eventually including a class of extreme multi-label problems, for example by Liu et al. (2017), with tens or hundreds of thousands of label concepts. Many strategies can be taken to extend usability and scalability of chains, for example ensemble subspaces have been used successfully in datasets with many thousands of labels (Puurula et al., 2014), where chains are built on a subset of the labels and their votes are combined. This label-subspace methodology draws heavily from other approaches, such as the RAk EL method (Tsoumakas et al., 2011). Particularly, as data sets grow larger and as computational power becomes cheaper and more widely available (especially GPUs and TPUs, etc.) it becomes increasingly diﬃcult for classiﬁer chain approaches to out-compete neural network architectures (such as those by Liu et al., 2017; but there are many such examples), for which maturing frameworks exist. Earlier, the chaining mechanism replaced to some extent the need for hidden nodes and learning their associated weights/parameters (see Section 3.2) providing an oﬀ-the-block advantage against data-hungry neural networks. But now data is increasingly available, and it is possible to build networks of millions of parameters, and train those parameters, with only a few lines of code and a few hours of GPU time. On the other hand, even though the largest multi-label datasets are becoming larger, creating a new trend in extreme classiﬁcation, there is no shortage of new real-world applications associated with only modest numbers of labels in smaller tabular datasets. And this is likely to maintain interest in and development of chain methods. Besides that, we emphasise that neural and chain architectures are not by any means mutually exclusive (as already seen in Section 3.2) and neural architectures can further beneﬁt from aspects found in classiﬁer chains, as explored by Read and Hollm en (2017) and Cisse et al. (2016) among others. Still, in cases of small data , where deep networks of latent nodes are not needed or suitable, there are often particular challenges for classiﬁer chains that need further attention. For example, as a set of binary classiﬁers, chain methods are particularly vulnerable to problems of class imbalance, stemming from the sparsity of the label matrix of most multi-label datasets. There have been proposals to address this, for example, by Liu and Tsoumakas (2018), Lin and Xu (2016). In some cases the sparsity may be linked addition-

Classifier Chains: A Review and Perspectives

ally to the phenomenon of weak labels a type of noise due to lazy human labeling where some relevant label values are missing in the training data. Although interpretability has not been a motivating factor behind most work on classiﬁer chains to date (or on multi-label learning in general) it can be seen as a key advantage oﬀered by chaining. Unlike multi-layer neural networks or other multi-label methods where inter-label dependence is hidden in (i.e., modeled by) inner layers of the network, classiﬁer chains explicitly models this dependence in a structure in the output space among the labels. In the multi-label literature, one can ﬁnd diﬀerent visualizations of label dependence, usually in the form of graphs or heatmaps, but the associated work for the most part does not verify such dependence relations with a domain expert, or show them to be useful or oﬀering insight to any real-world problems, or even demonstrate stability of the relations from one test set to another. In addition, further development is needed to establish if and in what contexts dependence is useful and reliable for interpretation particularly taking into account base models and chain structure. It seems that this is a clear path requiring attention, especially with growing interest in interpretable machine learning (Molnar, 2019). There have been increasingly advanced eﬀorts to integrate feature selection into the chain, for example, by (Tenenboim-Chekina, Rokach, & Shapira, 2013; Teisseyre, 2017). Such an approach makes sense, as any conceptual boundary between feature and label variables is already inherently blurred by the chaining mechanism. Using so-called automatic machine learning (Auto-ML) could prove useful as a general means for electing and calibrating base models, chain order, and so on. There is some recent and early work in this direction involving classiﬁer chains (Wever, Tornede, Mohr, & H ullermeier, 2020). The choice of SEARN (mentioned in Section 5.6) to use the terminology of a policy is particularly interesting. If transferred to the area of classiﬁer chains, one can easily imagine using many of the large variety of methods from the area of reinforcement learning, such as Monte Carlo tree search, for ﬁnding a classiﬁcation in a non-greedy way. Another interesting perspective of classiﬁer chains is via transfer learning and concept drift adaptation. In a sense, building classiﬁer chains is transfer learning. In machine learning, if we become interested in a new class concept, we may want to adapt from (i.e., transfer) knowledge representations of an existing similar concept. The labels in a multi-label dataset are almost always related in some way (after all, they are part of the same dataset). Therefore adding a new label to a classiﬁer chain could be considered as transferring knowledge from existing concepts (i.e., labels) to learn the new one. Of course, in typical applications of transfer learning, such as adaptation to concept drift, older concepts are no longer useful and can eventually be discarded, unlike in the typical multi-label case, where all labels are, in most cases, considered equally relevant. One may then even-more easily consider the adaptation of chaining mechanisms to multi-task learning learning L tasks together. Multi-label learning is clearly a special case of multi-task learning. In the general case, however, there is more freedom with regard to diﬀerent types and sizes of input spaces per concept in multi-task learning. Furthermore, multi-task learning inherently includes (or at least does not exclude) regression tasks among the L tasks to be learned which will take us to the ﬁnal point of this section. The development of chains in a regression context, where labels take on continuous values yj R, appears natural at ﬁrst approach, yet it meets signiﬁcant challenges. Despite

Read, Pfahringer, Holmes, & Frank

early demonstrations of direct application (Borchani, Varando, Bielza, & Larra naga, 2015), getting strong oﬀ-the-shelf performance (vs. independent models) from these so-called regressor chains has proven more diﬃcult. This is mainly due to diﬀerent loss functions targeted in regression, and the lack of inherent non-linearity (for example, consider: logistic regression is linear in terms of decision boundary, but classiﬁcation via logistic regression is not, because each input is cast to either 0 or 1). A recent analysis is provided by Read and Martino (2020). Overall, regressor chains appear to be an interesting avenue for future research, but they behave and require a treatment so diﬀerent from their classiﬁer homologues that we can avoid analysis of them in this paper.

7. Summary and Recommendations

In this work, we have catalogued the evolution of the family of methods of classiﬁer chains across many diﬀerent analyses, and synthesized many of these methods and their respective advantages and disadvantages.

We have not provided a large-scale empirical comparison of diﬀerent methods, since the inherent ﬂexibility of classiﬁer chains makes it diﬃcult to set up a fair but concise evaluation. Almost all varieties target some point on the spectrum of possible tradeoﬀs between predictive performance and computational expenditure, or address a particular challenge, and are therefore interesting for speciﬁc combinations of dataset and metric. However, instead of such an evaluation, we make some general recommendations.

Figure 14 and Table 4 outline the main varieties to be chosen from and (in the table) their respective computational complexity, with regard to both training and testing phases. The complexity is considered relative to the size of the label set, L (thus we do not deal speciﬁcally with subsampling strategies that may aﬀect the size of the input instance space). Clearly, as L becomes larger, more consideration must be made toward computationally tractable training and inference. However, aside from this spectrum, there are other important aspects worth highlighting. If the metric of predictive performance evaluates labels independently of each other, as for example Hamming loss does, then less chain structure is necessary in general, but the base classiﬁer should be suﬃciently powerful and non-linear. On the other hand, a weak linear base classiﬁer will almost always beneﬁt from increased connectivity, and more rigorous inference.

A number of classiﬁer chain recipes are suggested in Table 5. These suggested conﬁgurations still leave room for ﬁner-grained parametrization such as the ϵ or beam-width of the search, hyper-parametrization of base classiﬁers, and so on. Indeed, each recipe does not necessarily correspond to a particular paper from the literature, although some speciﬁc example references are given (also in in Table 4 and throughout the text of this paper). Rather, these can be seen as a way to roll the review material of this paper into a toolbox comprising a number of high-level recommendations suitable for many real-world problems.

Even though particular conﬁgurations of classiﬁer chains scale up to fairly large datasets, as discussed in Section 6, many large multi-label problems, especially of the extreme variety, are increasingly better served by neural network architectures, which may, of course, incorporate elements of classiﬁer chains.

Classifier Chains: A Review and Perspectives

Table 4: Complexity of (a) inference and (b) structuring strategies for L labels and M trained ensemble members (M = 1 corresponds to a single model), supposing ﬁxed input space and base classiﬁer. In (a), S is dependent on a chosen search strategy; and can be anywhere in 1 S 2L, i.e., ranging between greedy and brute-force complexity. Recall that, although measuring marginal dependence does not require training classiﬁers, computation is nevertheless required.

(a) Inference complexity (number of passes down the chain)

Inference Iterations Example reference Greedy O(M) Read et al., 2009 Search O(M S) Mena et al., 2016 Exhaustive O(M 2L) Dembczy nski et al., 2010

(b) Training complexity, which involves making a number of marginal-dependence measurements (complexity denoted in OM in terms of the number of measurements) and/or inducing models (complexity denoted in O in terms of the number of models). Note that we include the cost of the ﬁnal models built for prediction (in addition to those built for measuring conditional dependence) in O. Three variations are listed for conditional dependence.

Structuring strategy Complexity Example reference Random (default) O(M L) Read et al., 2011 Marginal Dependence OM(L2) + O(M L) Cond. Dependence O(L2) + O(M L) Cond. Dependence O(L) + OM(L2) + O(M L) Zhang & Zhang, 2010 Cond. Dependence O(L) Teisseyre, 2017 Based on Accuracy O(L) + O(M L) See Section 4.4 Search (ﬁxed structure) O(M L!) Read et al., 2015 Search (free structure) O(2L2) Gasse, 2017

Actual complexity depends on chosen search algorithm

Figure 14: Conﬁgurations of classiﬁer chains in terms of inference and single model vs. ensembles (left), chain order/structure (mid.), and base classiﬁer (right). In addition, the quality of posterior probability/conﬁdence of the base classiﬁer should also be considered for non-greedy inference (not shown).

8. Conclusion

Over a decade after initial interest in classiﬁer chains as a method for multi-label classiﬁcation, novel developments and analyses and fresh applications continue to appear in

Read, Pfahringer, Holmes, & Frank

Table 5: Some suggested classiﬁer chain recipes combining the results of numerous papers (references given in Table 4). Obviously a plethora of other options are also possible. Refer to Section 3 and Section 4 for details on Inference and Structure, respectively. If not speciﬁed, the base classiﬁer is assumed selected and parametrized according to practitioner preference, and problem domain and dimensions.

Name Inference Order/Structure Base Classiﬁer The Baseline Greedy Ensemble Random SVM or trees As by Read et al. (2009)

The Kaggler Greedy Ensemble Random/Sparse A mixture A large subspace-ensemble of random diverse sparse chains; with a variety of base models.

A Good Order Search Single Logistic regression Good probability estimates help with a search for one powerful fully-cascaded model.

Neural Net Greedy Single L2 logistic reg. A full cascade for predictive power, with regularization. Essentially a neural network.

Neural Net Sparse Greedy L1 logistic reg. Single model, L1 regularization takes care of pruning, as in Teisseyre (2017).

Sparse & interpretable Greedy Single Decision trees Use a conditional dependence heuristic (Zhang & Zhang, 2010), for sparse structure.

Expensive & eﬀective Search Linear, or mix A multiple-start structure search. Reuse models in a dynamic ensemble at inference time.

the literature. Particular variations of chaining continue to attain competitive and often state-of-the-art performance on many multi-label datasets. New mechanisms for training and inference have been developed and have now also been adapted to other areas, such as multi-output regression.

The rise of ubiquitous access to neural network frameworks and associated hardware acceleration has begun to overshadow the option of oﬀ-the-shelf classiﬁer chains for very large datasets. Nevertheless, as is also the case in relation to many other areas, there can be mutual beneﬁt and shared development between deep neural and chaining approaches. In addition to this, one should keep in mind that only a subset of newly emerging datasets can be considered better suited to treatment under deep neural architectures, and therefore we can expect classiﬁer chaining to continue to be relevant, thereby justifying the review of the methodology which we have carried out in this paper.

Furthermore, we may remark that there are many issues found in multi-label contexts that directly relate to classiﬁer chains, such as weak labels, class imbalance and interpretability of label relations discovered (and how they relate to and can provide insight on the underlying application domain). These thematics are far from considered solved, and new issues are coming to the forefront. We speculate that numerous papers will continue to appear to confront them.

Classifier Chains: A Review and Perspectives

Acknowledgements

Thanks to Tomasz Kajdanowicz and Willem Waegeman who pointed out useful references and provided insightful discussion on classiﬁer chains during the preparation of this paper. Thanks also to all ﬁve anonymous reviewers, each of whom provided important discussion and suggestions, and pointed out relevant material and references, which contributed towards the development of this paper.

Borchani, H., Varando, G., Bielza, C., & Larra naga, P. (2015). A survey on multi-output regression. Wiley Int. Rev. Data Min. and Knowl. Disc., 5(5), 216 233.

Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123 140.

Burkhardt, S., & Kramer, S. (2015). On the spectrum between binary relevance and classiﬁer chains in multi-label classiﬁcation. In SAC 2015: 30th ACM Symposium on Applied Computing, pp. 885 892. ACM.

Cheng, W., & H ullermeier, E. (2009). Combining instance-based learning and logistic regression for multilabel classiﬁcation. Machine Learning, 76(2-3), 211 225.

Cisse, M., Al-Shedivat, M., & Bengio, S. (2016). Adios: Architectures deep in output space. In Proceedings of The 33rd International Conference on Machine Learning, Vol. 48, pp. 2770 2779, New York, New York, USA. PMLR.

da Silva, P. N., Gon calves, E. C., Plastino, A., & Freitas, A. A. (2014). Distinct chains for diﬀerent instances: An eﬀective strategy for multi-label classiﬁer chains. In Calders, T., Esposito, F., H ullermeier, E., & Meo, R. (Eds.), Machine Learning and Knowledge Discovery in Databases, pp. 453 468, Berlin, Heidelberg. Springer Berlin Heidelberg.

Daum e, III, H., Langford, J., & Marcu, D. (2009). Search-based structured prediction. Machine Learning, 75(3), 297 325.

Dembczy nski, K., Cheng, W., & H ullermeier, E. (2010). Bayes optimal multilabel classiﬁcation via probabilistic classiﬁer chains. In ICML 10: 27th International Conference on Machine Learning, pp. 279 286, Haifa, Israel. Omnipress.

Dembczy nski, K., Kotlowski, W., Waegeman, W., Busa-Fekete, R., & H ullermeier, E. (2016). Consistency of probabilistic classiﬁer trees. In ECML-PKDD 2016 : machine learning and knowledge discovery in databases, Vol. 9852, pp. 511 526. Springer.

Dembczy nski, K., Waegeman, W., Cheng, W., & H ullermeier, E. (2012a). On label dependence and loss minimization in multi-label classiﬁcation. Machine Learning, 88(1-2), 5 45.

Dembczy nski, K., Waegeman, W., & H ullermeier, E. (2012b). An analysis of chaining in multi-label classiﬁcation. In ECAI: European Conference of Artiﬁcial Intelligence, Vol. 242, pp. 294 299. IOS Press.

Dietterich, T. G. (2002). Machine learning for sequential data: A review. In Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, pp. 15 30, London, U.K. Springer-Verlag.

Read, Pfahringer, Holmes, & Frank

Doppa, J. R., Fern, A., & Tadepalli, P. (2014a). HC-search: A learning framework for search-based structured prediction. Journal of Artiﬁcial Intelligence Research, 50, 369 407.

Doppa, J. R., Yu, J., Ma, C., Fern, A., & Tadepalli, P. (2014b). HC-search for multilabel prediction: An empirical study. In Brodley, C. E., & Stone, P. (Eds.), AAAI Conference on Artiﬁcial Intelligence, July 27 -31, 2014, Qu ebec City, Qu ebec, Canada, pp. 1795 1801. AAAI Press.

Enrique Sucar, L., Bielza, C., Morales, E. F., Hernandez-Leal, P., Zaragoza, J. H., & Larra naga, P. (2014). Multi-label classiﬁcation with bayesian network-based chain classiﬁers. Pattern Recognition Letters, 41(C), 14 22.

Frank, E., & Kramer, S. (2004). Ensembles of nested dichotomies for multi-class problems. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML 04, p. 39, New York, NY, USA. Association for Computing Machinery.

Gasse, M. (2017). Probabilistic Graphical Model Structure Learning : Application to Multi Label Classiﬁcation. Theses, Universit e de Lyon.

Goncalves, E. C., Plastino, A., & Freitas, A. A. (2013). A genetic algorithm for optimizing the label ordering in multi-label classiﬁer chains. In 2013 IEEE 25th International Conference on Tools with Artiﬁcial Intelligence, pp. 469 476.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. The MIT Press.

Guo, Y., & Gu, S. (2011). Multi-label classiﬁcation using conditional dependency networks. In IJCAI 11: 24th International Conference on Artiﬁcial Intelligence, pp. 1300 1305. IJCAI/AAAI.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770 778.

Jun, X., Lu, Y., Lei, Z., & Guolun, D. (2019). Conditional entropy based classiﬁer chains for multi-label classiﬁcation. Neurocomputing, 335, 185 194.

Kajdanowicz, T., & Kazienko, P. (2009). Hybrid repayment prediction for debt portfolio. In Nguyen, N. T., Kowalczyk, R., & Chen, S.-M. (Eds.), Computational Collective Intelligence. Semantic Web, Social Networks and Multiagent Systems, pp. 850 857, Berlin, Heidelberg. Springer Berlin Heidelberg.

Kajdanowicz, T., & Kazienko, P. (2013). Heuristic classiﬁer chains for multi-label classiﬁcation. In Larsen, H. L., Martin-Bautista, M. J., Vila, M. A., Andreasen, T., & Christiansen, H. (Eds.), Flexible Query Answering Systems, pp. 555 566, Berlin, Heidelberg. Springer Berlin Heidelberg.

Kiritchenko, S., Matwin, S., Nock, R., & Famili, F. (2006). Learning and evaluation in the presence of class hierarchies: Application to text categorization. In Proc. of the 19th Canadian Conference on Artiﬁcial Intelligence, pp. 395 406.

Kocev, D., Vens, C., Struyf, J., & Dˇzeroski, S. (2007). Ensembles of multi-objective decision trees. In Proc. of the 18th European Conf. on Machine Learning, ECML 07, pp. 624 631, Berlin, Heidelberg. Springer-Verlag.

Classifier Chains: A Review and Perspectives

Kumar, A., Vembu, S., Menon, A., & Elkan, C. (2013). Beam search algorithms for multilabel learning. Machine Learning, 92(1), 65 89.

Leathart, T., Frank, E., Pfahringer, B., & Holmes, G. (2019). On calibration of nested dichotomies. In Advances in Knowledge Discovery and Data Mining - 23rd Paciﬁc Asia Conference, PAKDD 2019, Macau, China, April 14-17, 2019, Proceedings, Part I, pp. 69 80.

Li, C.-L., & Lin, H.-T. (2014). Condensed ﬁlter tree for cost-sensitive multi-label classiﬁcation. In Xing, E. P., & Jebara, T. (Eds.), Proceedings of the 31st International Conference on Machine Learning, Vol. 32 of Proceedings of Machine Learning Research, pp. 423 431, Bejing, China. PMLR.

Lin, W., & Xu, D. (2016). Imbalanced multi-label learning for identifying antimicrobial peptides and their functional types. Bioinformatics, 32 24, 3745 3752.

Liu, B., & Tsoumakas, G. (2018). Making classiﬁer chains resilient to class imbalance. In Zhu, J., & Takeuchi, I. (Eds.), Proceedings of The 10th Asian Conference on Machine Learning, Vol. 95 of Proceedings of Machine Learning Research, pp. 280 295. PMLR.

Liu, J., Chang, W.-C., Wu, Y., & Yang, Y. (2017). Deep learning for extreme multi-label text classiﬁcation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 17, pp. 115 124, New York, NY, USA. ACM.

Loza Menc ıa, E., & Janssen, F. (2016). Learning rules for multi-label classiﬁcation: a stacking and a separate-and-conquer approach. Machine Learning, 105(1), 77 126.

Madjarov, G., Kocev, D., Gjorgjevikj, D., & Dˇzeroski, S. (2012). An extensive experimental comparison of methods for multi-label learning. Pattern Recognition, 45(9), 3084 3104.

Mena, D., Monta n es, E., Quevedo, J. R., & Coz, J. J. (2016). An overview of inference methods in probabilistic classiﬁer chains for multilabel classiﬁcation. Wiley Int. Rev. Data Min. and Knowl. Disc., 6(6), 215 230.

Molnar, C. (2019). Interpretable Machine Learning. https://christophm.github.io/interpretable-ml-book/.

Montiel, J., Read, J., Bifet, A., & Abdessalem, T. (2018). Scikit-Multi Flow: A multi-output streaming framework. Journal of Machine Learning Research, 19(72), 1 5.

Moyano, J. M., Galindo, E. L. G., Cios, K. J., & Ventura, S. (2018). Review of ensembles of multi-label classiﬁers: Models, experimental study and prospects. Information Fusion, 44, 33 45.

Nam, J., Kim, J., Menc ıa, E. L., Gurevych, I., & F urnkranz, J. (2014). Large-scale multilabel text classiﬁcation - revisiting neural networks. In ECML-PKDD 14: 25th European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 437 452.

Nam, J., Loza Menc ıa, E., Kim, H. J., & F urnkranz, J. (2017). Maximizing subset accuracy with recurrent neural networks in multi-label classiﬁcation. In Advances in Neural Information Processing Systems 30, pp. 5413 5423.

Read, Pfahringer, Holmes, & Frank

Narassiguin, A., Elghazel, H., & Aussem, A. (2017). Dynamic ensemble selection with probabilistic classiﬁer chains. In Ceci, M., Hollm en, J., Todorovski, L., Vens, C., & Dˇzeroski, S. (Eds.), Machine Learning and Knowledge Discovery in Databases, pp. 169 186, Cham. Springer International Publishing.

Park, L. A. F., & Read, J. (2018). A blended metric for multi-label optimisation and evaluation. In ECML-PKDD 2018: 29th European Conference on Machine Learning, pp. 719 734.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825 2830.

Pinto, F., Soares, C., & Mendes-Moreira, J. (2016). Chade: Metalearning with classiﬁer chains for dynamic combination of classiﬁers. In Frasconi, P., Landwehr, N., Manco, G., & Vreeken, J. (Eds.), Machine Learning and Knowledge Discovery in Databases, pp. 410 425, Cham. Springer International Publishing.

Powell, W. B. (2019). A uniﬁed framework for stochastic optimization. European Journal of Operational Research, 275(3), 795 821.

Puurula, A., Read, J., & Bifet, A. (2014). Kaggle LSHTC4 winning solution. Tech. rep., Kaggle LSHTC4 Winning Solution. Report on our winning solution to the LSHTC4 Kaggle Competition.

Ramrez-Corona, M., Sucar, L. E., & Morales, E. F. (2014). Chained path evaluation for hierarchical multi-label classiﬁcation..

Read, J., & Hollm en, J. (2017). Multi-label classiﬁcation using labels as hidden nodes. Tech. rep. 1503.09022v3, Ar Xiv.org. Ar Xiv.

Read, J., & Martino, L. (2020). Probabilistic regressor chains with Monte-Carlo methods. Neurocomputing, In Press, 1 26.

Read, J., Martino, L., & Hollm en, J. (2017). Multi-label methods for prediction with sequential data. Pattern Recognition, 63(March), 45 55.

Read, J., Martino, L., & Luengo, D. (2014). Eﬃcient Monte Carlo methods for multidimensional learning with classiﬁer chains. Pattern Recognition, 47(3), 1535 1546.

Read, J., Martino, L., Olmos, P. M., & Luengo, D. (2015). Scalable multi-output label prediction: From classiﬁer chains to classiﬁer trellises. Pattern Recognition, 48(6), 2096 2109.

Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classiﬁer chains for multi-label classiﬁcation. In ECML 2009: 20th European Conference on Machine Learning, pp. 254 269. Springer.

Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2011). Classiﬁer chains for multi-label classiﬁcation. Machine Learning, 85(3), 333 359.

Read, J., Puurula, A., & Bifet, A. (2014). Multi-label classiﬁcation with meta labels. In ICDM 14: IEEE International Conference on Data Mining, pp. 941 946. IEEE.

Classifier Chains: A Review and Perspectives

Rivolli, A., Read, J., Soares, C., Pfahringer, B., & de Carvalho, A. C. P. L. F. (2020). An empirical analysis of binary transformation strategies and base algorithms for multilabel learning. Machine Learning, In Press(1573-0565), 1 55.

Scanagatta, M., Salmer on, A., & Stella, F. (2019). A survey on bayesian network structure learning from data. Progress in Artiﬁcial Intelligence, pp. 425 439.

Senge, R., del Coz, J., & H ullermeier, E. (2013). Rectifying classiﬁer chains for multi-label classiﬁcation. In Lernen, Wissen, Adaption (LWA) 2013, pp. 162 169.

Senge, R., del Coz, J. J., & H ullermeier, E. (2014). On the problem of error propagation in classiﬁer chains for multi-label classiﬁcation. In Spiliopoulou, M., Schmidt-Thieme, L., & Janning, R. (Eds.), Data Analysis, Machine Learning and Knowledge Discovery, pp. 163 170, Cham. Springer International Publishing.

Szyma nski, P., & Kajdanowicz, T. (2017). A scikit-based Python environment for performing multi-label classiﬁcation. Ar Xiv e-prints.

Teisseyre, P. (2017). CCnet: Joint multi-label classiﬁcation and feature selection using classiﬁer chains and elastic net regularization. Neurocomputing, 235, 98 111.

Tenenboim-Chekina, L., Rokach, L., & Shapira, B. (2013). Ensemble of feature chains for anomaly detection. In Multiple Classiﬁer Systems, Vol. 7872 of Lecture Notes in Computer Science, pp. 295 306. Springer Berlin Heidelberg.

Tsoumakas, G., Katakis, I., & Vlahavas, I. (2011). Random k-labelsets for multi-label classiﬁcation. IEEE Transactions on Knowledge and Data Engineering, 23(7), 1079 1089.

Waegeman, W., Dembczy nski, K., & H ullermeier, E. (2019). Multi-target prediction: a unifying view on problems and methods. Data Mining and Knowledge Discovery, 33(2), 293 324.

Wever, M., Tornede, A., Mohr, F., & H ullermeier, E. (2020). Libre: Label-wise selection of base learners in binary relevance for multi-label classiﬁcation. In Berthold, M. R., Feelders, A., & Krempl, G. (Eds.), Advances in Intelligent Data Analysis XVIII, pp. 561 573, Cham. Springer International Publishing.

Wydmuch, M., Jasinska, K., Kuznetsov, M., Busa-Fekete, R., & Dembczynski, K. (2018). A no-regret generalization of hierarchical softmax to extreme multi-label classiﬁcation. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., & Garnett, R. (Eds.), Advances in Neural Information Processing Systems 31, pp. 6355 6366. Curran Associates, Inc.

Zaragoza, J. H., Sucar, L. E., Morales, E. F., Bielza, C., & Larra naga, P. (2011). Bayesian chain classiﬁers for multidimensional classiﬁcation. In 24th International Joint Conference on Artiﬁcial Intelligence (IJCAI 11), pp. 2192 2197.

Zhang, M.-L., Li, Y.-K., Liu, X.-Y., & Geng, X. (2018). Binary relevance for multi-label learning: an overview. Frontiers of Computer Science, 12(2), 191 202.

Zhang, M.-L., & Zhang, K. (2010). Multi-label learning by exploiting label dependency. In KDD 10: 16th ACM SIGKDD International conference on Knowledge Discovery and Data mining, pp. 999 1008. ACM.

Read, Pfahringer, Holmes, & Frank

Zhang, M.-L., & Zhou, Z.-H. (2014). A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8), 1819 1837.