# multilabel_causal_feature_selection__1976a6c5.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Multi-Label Causal Feature Selection

Xingyu Wu,1 Bingbing Jiang,2 Kui Yu,3 Huanhuan Chen,1 Chunyan Miao4

1School of Computer Science and Technology, University of Science and Technology of China. 2Hangzhou Institute of Service Engineering, Hangzhou Normal University. 3School of Computer and Information, Hefei University of Technology. 4School of Computer Science and Engineering, Nanyang Technological University. xingyuwu@mail.ustc.edu.cn, jiangbb@hznu.edu.cn, yukui@hfut.edu.cn, hchen@ustc.edu.cn, ascymiao@ntu.edu.sg.

Multi-label feature selection has received considerable attentions during the past decade. However, existing algorithms do not attempt to uncover the underlying causal mechanism, and individually solve different types of variable relationships, ignoring the mutual effects between them. Furthermore, these algorithms lack of interpretability, which can only select features for all labels, but cannot explain the correlation between a selected feature and a certain label. To address these problems, in this paper, we theoretically study the causal relationships in multi-label data, and propose a novel Markov blanket based multi-label causal feature selection (MB-MCF) algorithm. MB-MCF mines the causal mechanism of labels and features ﬁrst, to obtain a complete representation of information about labels. Based on the causal relationships, MBMCF then selects predictive features and simultaneously distinguishes common features shared by multiple labels and label-speciﬁc features owned by single labels. Experiments on real-world data sets validate that MB-MCF could automatically determine the number of selected features and simultaneously achieve the best performance compared with state-of-the-art methods. An experiment in Emotions data set further demonstrates the interpretability of MB-MCF.

Introduction

In machine learning research, multi-label learning focuses on the problem that each instance is associated with multiple class labels simultaneously (Zhang and Zhou 2006), which is ubiquitous in many real-world applications, such as image annotation (Liu et al. 2018), text categorization (Liu et al. 2017), and gene function classiﬁcation (Fodeh and Tiwari 2018). Similar to single-label learning, high dimensional data with an enormous amount of redundant features signiﬁcantly increases the computational burden of multilabel learning, which could also lead to over-ﬁtting and performance degradation of learning algorithms (Lin et al. 2015). Previous studies (Liu and Motoda 2007) have shown that only a subset of relevant features carry the most discriminative information. Thus, in recent years, many multilabel feature selection algorithms have been proposed to ﬁnd a lower-dimensional representation of the original feature

Corresponding author: Huanhuan Chen. Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

(a) Multiple irrelevant labels. (b) Multiple relevant labels.

Figure 1: Causal structures of two extreme cases. Labels are highlighted in red and features are highlighted in green.

space, which can be broadly classiﬁed into transformationbased methods and direct methods (Pereira et al. 2018). Multi-label data contains three types of variable relationships, i.e., relationships between labels, between features, and between labels and features. Earlier feature selection methods, such as some transformation-based methods, only focus on the last two types of relationships, and transform a multi-label problem into one or several single-label problems with transformation techniques (Godbole and Sarawagi 2004; Read 2008). While some recent direct approaches revise traditional single-label feature selection algorithms to process the multi-label data directly, such as sub-feature uncovering with sparsity (SFUS) (Ma et al. 2012) and multilabel informed feature selection (MIFS) (Jian et al. 2016), which have taken label correlation into consideration. However, most of these algorithms consider the three types of relationships individually. For example, MIFS uses a matrix to encode label correlations only. Separately solving the three types of relationships would ignore the mutual effects between different types of relationships, which limits the effectiveness of feature selection. Simultaneously analyzing all the relationships needs to consider the underlying mechanism, while existing methods do not attempt to uncover it. Another problem is that, existing methods lack of interpretability, i.e., cannot explain the correlation between a selected feature and a certain label. It is necessary to know which labels are inﬂuenced by a selected feature. For example, the recently presented topic, label-speciﬁc feature (Zhang and Wu 2015), aiming to the phenomenon that different class labels may carry speciﬁc characteristics of their

own, will beneﬁt from the interpretable feature section algorithm. Therefore, we need to propose a method that not only selects predictive features, but also distinguishes the common features shared by multiple labels and the label-speciﬁc features owned by some single labels. To address above challenges, this paper investigates the multi-label feature selection from a causal perspective. The superiority of causality-aware methods reﬂects in two aspects. Firstly, causal mechanism treats labels and features as ordinary variables, focusing on the underlying cause-effect relationships between all variables. Therefore, by mining the causal mechanism, we can simultaneously consider all types of the relationships, and thereby solve the aforementioned ﬁrst challenge. As depicted in Figure 1, using a directed acyclic graph (DAG) to represent the causal structure, we can intuitively read from the causal structure that labels are all independent of each other in Figure 1 (a) but are all relevant to each other in Figure 1 (b). Similarly, the dependency between labels and features and the redundancy between different features can also be read out. As for the interpretability, for any feature, it is easy to locate the labels inﬂuenced by a certain feature in the causal structure, which could help to distinguish the common features and label-speciﬁc features. Given the superiority of causality, we need to ﬁnd an effective means to represent the complex causal relationships. In this paper, Markov blanket (MB) is chosen, which is amenable to represent the local causal structure of a variable and has been used in single label causal feature selection (Yu, Liu, and Li 2018). In a faithful Bayesian network (BN), the MB of a variable consists of its parents (direct causes), children (direct effects) and spouses (other direct causes of direct effects). And given the MB of a variable, all other variables will be independent of this variable (Spirtes et al. 2000). In single-label learning, the MB of a label can be directly used as the selected feature set (Pellet and Elisseeff 2008). However, in multi-label scenario, directly using the MB of multiple labels is obviously unadvisable due to two problems. Firstly, real-world data always violates the faithfulness condition, and accordingly, some features will contain equivalent information (Statnikov et al. 2013) about labels, which might lead to more common features. Secondly, some relevant features might be excluded out of the local causal structure when there exist strong label relevances. In this paper, we take some DAGs as examples to elaborate these two problems and provide detailed theoretical analyses, which give an insight on algorithm design. To solve these problems, we propose a MB-based multilabel causal feature selection algorithm (MB-MCF). MBMCF ﬁrst mines the local causal structure of each label to uncover the causal mechanism between both labels and features. Based on the causal relationships, MB-MCF then searches common features between relevant labels and labelspeciﬁc features for single labels. The main contributions of this paper are summarized as follows:

1. Some theoretical contributions facilitate the multi-label causal feature selection research. 1) We analyze the causal structure for multi-label scenario, and ﬁnd that

the causal relationships between labels and its relevant features might be blocked by other strongly relevant labels. 2) Given the fact that real-world data sets violate the faithfulness condition, we study the equivalent information phenomenon in multi-label data, and formally give the property of the common features of multiple labels.

2. Different from existing methods, the proposed MB-MCF has at least three practical beneﬁts: 1) Based on causal learning, MB-MCF simultaneously considers all dependencies between both labels and features, and selects features not only predictive but also causally informative. 2) MB-MCF possesses interpretability. 3) MB-MCF does not require the number of selected features to be predetermined. To the best of our knowledge, it is the ﬁrst multi-label causal feature selection algorithm.

Notations and Deﬁnitions

In this paper, the capital letters (such as X) represent random variables and the lower-case letters (such as x) represent their values, the capital bold italic letters (such as Z) denote variable sets. Speciﬁcally, let U denote the set of all the (discrete random) variables, and T t T1, T2, ..., Tlu Ă U denote the label set. In addition, the symbol X M Y |Z (X K Y |Z) represents that variables X and Y are conditionally (in)dependent given a variable set Z. The symbol Ip X, Y q denotes the mutual information between X and Y . Deﬁnition 1 (Bayesian network) (Pearl 1998). Let P denote the joint probability distribution over a variable set U of a directed acyclic graph (DAG) G. The triplet x U, G, Py constitutes a BN, if x U, G, Py satisﬁes the Markov condition: every variable is independent of any subset including its non-descendant variables given its parents in G. Deﬁnition 2 (Faithfulness) (Spirtes et al. 2000). Given a BN x U, G, Py, G is faithful to P if and only if every conditional independence present in P is entailed by G and the Markov condition. P is faithful if and only if there exists a DAG G such that G is faithful to P. Deﬁnition 3 (Markov blanket) (Pearl 1998). In a faithful BN x U, G, Py, the Markov blanket of variable T in G is unique and consists of its parents, children, and spouses. Theorem 1 (Pearl 1998). Given the MBp Tq, X K T|MBp Tq for any X P U MBp Tq t Tu. Tsamardinos and Aliferis (2003) proved that MB is the theoretically optimal set of features if the faithfulness condition is satisﬁed, which conﬁrms that we can transfer the feature selection problem into the MB discovery of the class attribute in a faithful BN. To understand the intuition in the perspective of causal leaning, we consider that the MB includes the direct causes (parents), direct effects (children), and other direct causes of direct effects (spouses) of the class attribute (Yu et al. 2019). Deﬁnition 4 (Equivalent information) (Statnikov et al. 2013). Two subsets of variables X and Y from U contain equivalent information about a variable T iff the following conditions hold: T M X, T M Y, T K X|Y and T K Y|X. Equivalent information phenomenon occurs when the faithfulness condition is violated. It can be interpreted as Ip T, Xq Ip T, Yq, where the symbol I denotes the mu-

Strong label relevance

ܥ٣ ଶȁ ଵ ܦ٣ ଶȁ ଵ

Relevant features are ignored Blocked relations

Original causal structure Discovered causal structure

Example 1: No equivalent MB Example 2: Exist equivalent MBs

denotes possible common features.

Phase 1 of MB-MCF

Phase 2 of MB-MCF

Phase 3 of MB-MCF

unselected feature candidate feature label

unselected feature candidate feature retrieved feature label

unselected feature

label-specific feature common feature

(a) An example for Problem 1, i.e., some relevant features are ignored due to strong label relevance.

(b) Two examples for Problem 2, i.e., common features might exist due to the equivalent MBs. (c) Overview of proposed MB-MCF algorithm.

Mine the local causal structure of each label.

Retrieve the relevant features ignored by strong label relevance.

Distinguish common features and labelspecific features.

retrieved dependency

Figure 2: We discuss two problems for multi-label causal feature selection. (a) illustrates Problem 1. When T2 is completely dependent (red arrow) on T1, the causal relationships (highlighted with red X ) between T2 and its causal features C and D might be blocked, making C and D excluded out of the local causal structure of T2. To retrieve ignored features, we present Theorem 2. (b) gives two examples to illustrate Problem 2. For Example 2 in (b), if t A, B, Eu and t C, D, Eu are equivalent MBs of T2, then A and B are possible common features of T1 and T2. To ﬁnd more common features and decrease redundancy in feature set, we present Theorem 3 to give the property of common features. (c) gives an overview of the proposed MB-MCF algorithm, where the phase 2 solves Problem 1 based on Theorem 2, and phase 2 solves Problem 2 based on Theorem 3.

tual information. In other words, X and Y contain the same information about T. In real-world applications, the faithfulness condition is usually not satisﬁed, and thus, equivalent information phenomenon is common in these data sets. In this paper, we will use the deﬁnition to help us ﬁnd the redundant features in multi-label scenarios.

Problem Discussion and Analyses As previously mentioned, MB is employed to represent the local causal structure of a variable. In single-label learning, the MB of a label is the minimal set which carries all the information about the class label, and thereby, is the theoretically optimal feature subset if the faithfulness condition is satisﬁed (Pellet and Elisseeff 2008). MB of multiple labels is the union of MB set of each label (Liu and Liu 2018), which can be formalized as:

MBpt T1, T2, ..., Tluq

i 1 MBp Tiq t T1, T2, ..., Tlu. (1)

However, there exist two problems making the MB of multiple labels unsuitable for direct use as a feature subset. We take some DAGs as examples to locate these two problems and further give some theoretical analyses to solve them. Problem 1: Dependencies between label and its relevant features might be blocked by other strongly relevant labels, which makes some relevant features be discarded. Using MB to deﬁne the variable relevance, two strongly relevant variables are included in the MB set of each other,

while two weakly relevant variables are not, but have a connecting path (in BN) between them (Tsamardinos and Aliferis 2003). A special case might occur in strong relevance, called deterministic relations, where presence (or absence) of one variable implies presence (or absence) of the other, and vice versa. For example, labels male and female for a person can not coexist at the same instances. Statnikov et al. (2013) has proved that deterministic relations could result in equivalent information phenomenon (refer to Deﬁnition 4). Due to this phenomenon, strongly relevant labels could block the dependencies between features and labels. Specifically, when a label and its relevant features contain equivalent information about another label, these relevant features might be excluded out of the local causal structure in the causal discovery process, mainly because the label is independent of these features given the other label. For example in Figure 2(a), assume that the relation between T1 and T2 is deterministic, and t C, Du and T2 contain equivalent information about T1. If we implement a MB discovery algorithm to ﬁnd the local causal structure of T2, the dependence between T2 and C, D will be tested as T2 K C|T1 and T2 K D|T1. Thus, C and D would be misjudged as nonparent-child variables and further be discarded as redundant features instead of causal features. To solve this problem, we propose Theorem 2 to give an insight on how to retrieve these ignored relevant features. Theorem 2. Labels T1 and T2 are strongly relevant to each other. MBi denotes the MB of Ti (i P t1, 2u), and S Ă MB2.

If T2 and S contain equivalent information about T1, then: Ip T1, p MB1 t T2uq Y Sq Ip T1, MB1q. (2) Proof. Since labels T1 and T2 are strongly relevant to each other, thus, T2 P MB1. According to the chain rule of mutual information (Cover and Thomas 2012), we have: Ip T1, p MB1 t T2uq Y Sq Ip T1, S|MB1 t T2uq Ip T1, MB1 t T2uq, Ip T1, MB1q Ip T1, T2|MB1 t T2uq Ip T1, MB1 t T2uq.

Then, subtracting Ip T1, MB1 t T2uq from both sides of the Eq. (2), we obtain: Ip T1, S|MB1 t T2uq Ip T1, T2|MB1 t T2uq. (4) Eq. (2) will be proved by showing that Eq. (4) is established. Since T2 and S contain equivalent information about T1, thus Ip T1, Sq Ip T1, T2q. Therefore, Eq. (4) is established. We have thus proved the theorem. In Theorem 2, features in feature subset S are the aforementioned discarded relevant features, i.e., t C, Du in Figure 2(a). According to Eq. (2), if we remove the strongly relevant labels (i.e. T2 in Eq. (2)) ﬁrst and then search the causal features of T1 again, then the discarded relevant features (i.e. S in Eq. (2)) will be retrieved. In the designed algorithm, we will retrieve the discarded relevant features in this way. Problem 2: More common features might exist due to the equivalent multiple MBs. Due to label relevance, some features are shared by multiple labels, called common features. For Example 1 in Figure 2(b), the common cause A, and the common effect B, are included in both MBp T1q and MBp T2q. Obviously, A and B are common features of T1 and T2. Actually, since the real-world data usually violates the faithfulness condition, a label might have multiple MBs, and thereby, there exist common features as long as the intersection of one of the MB sets and the MB of another label is non-empty. We illustrate the problem with Example 2 in Figure 2(b). In the conventional sense, the intersection of MBp T1q, t A, Bu, and MBp T2q, t C, D, Eu, is empty. However, the common features might still exist only if t A, B, Eu is also the MB of T2, and then t A, Bu is the common features of T1 and T2. From the perspective of information theory, the above phenomenon in Example 2 in Figure 2(b) is mainly because t A, Bu and t C, Du contain equivalent information about T2. Intuitively, if all feature subsets containing equivalent information could be found, then we need not implement the time-consuming process to ﬁnd all the MBs of a label. Nevertheless, there is no theoretical proof that has been presented to guarantee that there is no loss of information if part of the MB features are replaced with their equivalent features. In the following, we propose Theorem 3 to illustrate this issue. Theorem 3. Let T t T1, T2, ..., Tku denote the label subset, MBi denote the MB of Ti (i P t1, 2, ..., ku), Si Ă MBi. For @i P t1, 2, ..., ku, if S Ă U and Si contain equivalent information about Ti, then S satisfy:

i 1 MBi Y S

i 1 Si, Tq Ip

i 1 MBi, Tq. (5)

And we call S is the common feature set of labels in T. Proof. It sufﬁces to prove the case that T t T1, T2u, since any multi-label case is a direct consequence of twolabel case using induction on the number of variables involved in T. According to the Deﬁnition 5, we have:

Ip S, Tiq Ip Si, Tiq. (6)

We ﬁrst prove that Ip MBi, Tiq Ip MBi Y S Si, Tiq. According to the chain rule, we have:

Ip MBi Y S, Tiq Ipp MBi Y S Siq Y Si, Tiq Ip MBi Y S Si, Tiq Ip Si, Ti|MBi Y S Siq, (7)

where Ip MBi YS, Tiq Ip MBi, Tiq since S K Ti|MBi, and Ip Si, Ti|MBi YS Siq 0 according to Eq. (6). Substituting them into Eq. (7), we get:

Ip MBi Y S Si, Tiq Ip MBi, Tiq. (8)

Let Ai MBi Y S Si, then we transform Eq. (5) as:

Ip A1 Y A2, T1 Y T2q Ip MB1 Y MB2, T1 Y T2q. (9)

According to the chain rules, we expand the left term and right term in Eq. (9) as L and R, respectively.

L Ip A1, T1q Ip A2, T2|A1, T1q Ip A1, T2|T1q Ip A2, T1|A1q, R Ip MB1, T1q Ip MB2, T2|MB1, T1q Ip MB1, T2|T1q Ip MB2, T1|MB1q.

According to Eq. (8), we have Ip A1, T1q Ip MB1, T1q and Ip A2, T2|A1, T1q Ip MB2, T2|MB1, T1q. According to the property of MB, we have Ip A2, T1|A1q Ip MB2, T1|MB1q. We continue expand the third term in Eq. (10) as follows.

Ip A1, T2|T1q Ip MB1 S1, T2q Ip S, T2|MB1 S1q, Ip MB1, T2|T1q Ip MB1 Y S1 S1, T2q Ip MB1 S1, T2q Ip S1, T2|MB1 S1q.

Therefore, Ip A1, T2|T1q ě Ip MB1, T2|T1q according to Eq. (6), and thereby, L ě R. According to Eq. (1), the MB set of t T1, T2u is a subset of MB1YMB2. Thus, we can directly prove that L ď R from Eq. (9) according to the property of MB. Hence, L R and Eq. (5) is established. For better understanding, we map the elements in Theorem 3 to the Example 2 in Figure 2(b). Feature set S is t A, Bu in Example 2, which contains the information of all labels. And S1 t A, Bu (Note that, S and Si need not be different.), S2 t C, Du. And thereby, Theorem 3 has proved that, if S and S1 contain equivalent information about T1, and S and S2 contain equivalent information about T2, then we can use S to simultaneously replace S1 and S2, without any information loss (as shown in Eq. (5)). Furthermore, the replacement process is interpretable, which points out the common features of multiple labels and label-speciﬁc features of single labels. For example, in Figure 2(b), removing t C, Du and replacing it with t A, Bu, we obtain the selected feature subset t A, B, Eu, in which E is a speciﬁc feature of T2, while A and B are common features of t T1, T2u. The idea in Theorem 3 will be used in the proposed algorithm.

Our Algorithm In this section, we propose the MB-based multi-label causal feature selection algorithm (MB-MCF, Algorithm 1) for detail. As shown in Figure 2(c), MB-MCF has three phases: Phase 1 (lines 2-4) mines the causal mechanism in data. Phase 2 (lines 5-14) retrieves the ignored relevant features inﬂuenced by the strongly relevant labels, which solves the aforementioned Problem 1 with Theorem 2. Phase 3 (lines 15-23) ﬁnds the common features and the label-speciﬁc features, which uses Theorem 3 to solve the Problem 2. Phase 1 (lines 2-4). This phase employs an up-to-date divide-and-conquer-based MB discovery algorithm A (such as CCMB (Wu et al. 2019)) to get the local causal structure of each label. We only ﬁnd the direct causes and effects (i.e., PCi) of each label Ti, since the direct causes and effects carry most of the information about labels while the spouse discovery process is time-consuming. Note that, in the discovery process, we do not distinguish labels and features but consider them as ordinary variables. Thus, PCi provides the relationships not only between labels and features but also between different labels. Phase 2 (lines 5-14). This phase retrieves the features ignored in the Phase 1, which is inﬂuenced by strong label relevance as illustrated in Problem 1. Directly testing the deterministic relations between labels is difﬁcult. Nevertheless, testing strong label relevance is feasible, which can be read from the local causal structure of a label (as Ti P PCj in line 5). According to Theorem 2, we can ﬁnd the ignored features through removing the strongly relevant labels and retesting the dependencies between the ignored features and target label. Concretely, line 5 traverses all strongly relevant label pairs Ti, Tj and line 6 traverses all possible ignored features Fk. Given a conditioning set Z which includes the strongly relevant label Tj, if a feature Fk is independent of the target label Ti (line 7), then Fk might be the ignored feature. A further test is implemented in line 8, if any conditioning set Z which dose not include Tj can not block the dependency between the feature Fk and the target label Tj, then, according to Eq.(2), we can assert that Fk is ignored in the Phase 1 and retrieve it in line 9. Finally, in line 13, we need to remove the strongly relevant labels out of the feature set as preparation for next phase. Phase 3 (lines 15-23). Based on the local causal structure obtained from the preceding phases, we ﬁnd common features and label-speciﬁc features in this phase. According to Theorem 3, feature sets containing equivalent information can be used to replace each other without any information loss. Thus, we ﬁrst ﬁnd the equivalent feature sets for each label in lines 15-21. For each label (line 15), line 17 ﬁnds the feature set Z such that Z and S (a subset of PCj) contain equivalent information about label Ti. Actually, to improve efﬁciency, an upper limit should be set for the size of Z since large-size Z can be derived from the small-size Z. According to Deﬁnition 3, Z and S meeting the condition in line 17 can be considered as the equivalent feature sets of target label Ti, thus, they will be recorded in equivalent feature table EIFi of Ti in line 18. As we get EIFi for all Ti, the common feature discovery problem can be transformed to a search problem, that is, searching the minimal set Sel Fea

Algorithm 1 The MB-MCF Algorithm.

1: Input: Labels set T t T1, T2, . . . , Tlu, features set F t F1, F2, . . . , Fmu, a divide-and-conquer-based MB discovery algorithm A, signiﬁcance level α. {Phase 1: Get the local causal structure of each label.} 2: for i 1 . . . l do 3: PCi Ð Find direct causes and effects of Ti from TY F t Tiu with the parent-child discovery process of A. 4: end for {Phase 2: Identify the strong label relevance and retrieve the ignored features.} 5: for i, j 1 . . . l and Ti P PCj do 6: for k 1 . . . m do 7: if DZ : t Tiu Ă Z Ă PCj s.t. Fk K Tj|Z then 8: if @Z Ă PCj t Tiu s.t. Fk M Tj|Z then 9: PCj PCj Y t Fku 10: end if 11: end if 12: end for 13: PCj PCj t Tiu 14: end for {Phase 3: Find common features and label-speciﬁc features.} 15: for i 1 . . . l do 16: for each Z Ă F PCi and Z M Ti do 17: if DS Ă PCi s.t. Ti K Z|S and Ti K S|Z then 18: EIFi EIFi Y tă S, Z ąu 19: end if 20: end for 21: end for 22: Search the minimal set Sel Fea s.t. @i, D ă S, Z ąP EIFi, p PCi S Y Zq Ă Sel Fea. 23: Output: Selected feature subset Sel Fea.

such that for @i, D ă S, Z ąP EIFi, p PCi SYZq Ă Sel Fea (line 22). The constraint can be rephrased as, for label Ti, if we replace some of the features in PCi with its equivalent feature set, then at least one of the substituted PCi must be included in the selected feature subset Sel Fea, to guarantee that there is no information loss of a label. The minimal set satisfying the constraint is the optimal feature subset. A greedy algorithm can be used to ﬁnd an optimal or suboptimal solution. Note that, in the process of searching Sel Fea, we can record the relationship between each selected feature and each label. For example in Figure 2(b), assuming EIFT2 tă t A, Bu, t C, Du ąu, then t A, Bu is recorded as the common features of t T1, T2u when t C, Du is replaced with t A, Bu in line 22. And E is a label-speciﬁc feature of T2.

Experiments Experimental Settings In this section, we present the experimental studies of the proposed MB-MCF algorithm on real-world data sets, which are from diverse application domains. Table 1 displays the details of the ﬁve multi-label data sets, including domain, standard statistics, and sizes of divided training and test data of each data set. In which, cardinality denotes the average

number of labels for per instance, and density normalizes the label cardinality by the number of labels.

Table 1: Details of the multi-label data sets. Data set domain #Training #Test #Features #Labels cardinality density Birds audio 500 100 260 19 1.014 0.053 CAL500 music 300 100 68 174 26.044 0.150 EUR-Lex text 5000 2000 5000 201 2.213 0.011 Mediamill video 1000 1000 120 101 4.376 0.043 NUS-WIDE images 10000 5000 500 81 1.869 0.023

Five state-of-the-art multi-label feature selection algorithms are compared, including SFUS (Ma et al. 2012), CSFS (Chang et al. 2014), MIFS (Jian et al. 2016), CMFS (Braytee et al. 2017) and MCLS (Huang, Jiang, and Sun 2018). In addition, we also use the original data with no feature selection as a baseline in each experiment. To evaluate the effectiveness of the proposed methods, we employ a representative multi-label classiﬁcation algorithm, ML-k NN (Zhang and Zhou 2007), to compute the classiﬁcation accuracies archived by using selected features, and the number of nearest neighbors k is set to 10 with default setting. Due to space limitation, we choose an examplebased metric Hamming Loss and two label-based metrics FMacro and FMicro (macro-averaging and micro-averaging of F1-measure) to measure the performances of multi-label classiﬁcation algorithm with selected features of each algorithm. Hamming Loss evaluates the fraction of misclassiﬁed instance-label pairs:

Hamming Loss 1

1 q |ZiΔYi|. (11)

where p and q denote the number of instances and labels, respectively. Zi represents the predicted label set and Yi is the correct label set in the i-th instance, and Δ stands for the symmetric difference between the two sets. FMicro can be considered as a weighted average of F1measure over all q labels, while FMacro is an arithmetic average of all output labels, which can be calculated by:

2TPi 2TPi FPi FNi . (12)

FMacro řq i 1 2TPi řq i 1p2TPi FPi FNiq. (13)

where TPi, FPi and FNi denote the number of true positives, false positives and false negatives in the i-th class label, respectively.

Performance Comparison In this experiment, we ﬁrst apply MB-MCF and other comparing algorithms to select features and then use ML-KNN to train a classiﬁer with the selected features. Each experiment is repeated 10 times with different training and test data, and we report the average performances, i.e., Hamming Loss, FMacro and FMicro. Since these comparing feature selection algorithms can not determine the optimal number of features, we gradually increase the percentage of the selected features from 2% to 20% with a step of

2%. For a fair comparison, the regularization parameters for all comparing algorithms are tuned from t0.01, 0.1, 0.3, . . . , 0.9, 1u by grid search. For the proposed MB-MCF, we employ Hiton-MB (Aliferis, Tsamardinos, and Statnikov 2003) as the MB discovery algorithm and use the G2-test (Pearl 1998) to implement the conditional independence tests.

Table 2: The number of features selected by MB-MCF. Data set Birds CAL500 EUR-Lex Mediamill NUS-WIDE #Features 40 13 871 18 85

Figure 3 shows the average Hamming Loss, FMacro and FMicro variation curves of different multi-label feature selection algorithms with respect to the percentage of selected features. As shown in Table 2 and Figure 3, our MB-MCF does not vary with the increasing selected feature percentage and can automatically determine the number of selected features, which is different from other algorithms requiring to predetermine the number of selected features. Clearly, MB-MCF consistently outperforms other algorithms in terms of each metric under the same number of selected features. Moreover, compared with the best result of the state-of-the-art methods, MB-MCF still achieves better or very competitive performances. Speciﬁcally, in Birds, CAL500 and Mediamill data sets, MB-MCF achieves signiﬁcantly higher FMacro, FMicro and very competitive Hamming Loss compared with the state-of-the-art methods, which demonstrates that MB-MCF captures more effective features by considering the causal information between both features and labels. In large-scale data set EURLex, SFUS can not be conducted on a 16-GB memory due to its high space complexity. From Figure 3 (c), (h), (m), we observe that the performances of state-of-the-art methods vary ﬁrst and then tend to stable with the increase of the percentage of selected features, and the size of the feature set selected by MB-MCF exactly falls nearby the turning point, which demonstrates the effectiveness of MB-MCF to automatically determine the number of features. In largescale data set NUS-WIDE, most of existing algorithms do not reach the performance of baseline since the insufﬁcient training data inﬂuences the effectiveness of feature selection. However, MB-MCF is the only algorithm outperforming the baseline even when a few number of samples are used to train, which shows that MB-MCF is more dataefﬁcient.

Interpretability Compared with traditional multi-label feature selection methods, MB-MCF can select useful features and simultaneously possess interpretability. To illustrate the interpretability of MB-MCF, we implement MB-MCF on Emotions data set and provide the detail relationships between labels and selected features obtained from lines 15-22 in MB-MCF. The Emotions data set contains 6 labels, namely amazed-surprised (L1), happy-pleased (L2), relaxing-calm (L3), quiet-still (L4), sad-lonely (L5), and angry-aggressive (L6). We employ MB-MCF to select features on Emotions with 500 training samples, and record the relationship be-

0.05 0.1 0.15 0.2 Percentage of selected features

Hamming Loss

MIFS MCLS CSFS SFUS CMFS Baseline MB-MCF

0.05 0.1 0.15 0.2 Percentage of selected features

Hamming Loss

MIFS MCLS CSFS SFUS CMFS Baseline MB-MCF

0.05 0.1 0.15 0.2 Percentage of selected features

Hamming Loss

10-3 EUR-Lex

MIFS MCLS CSFS CMFS Baseline MB-MCF

0.05 0.1 0.15 0.2 Percentage of selected features

Hamming Loss

MIFS MCLS CSFS SFUS CMFS Baseline MB-MCF

0.05 0.1 0.15 0.2 Percentage of selected features

Hamming Loss

MIFS MCLS CSFS SFUS CMFS Baseline MB-MCF

0.05 0.1 0.15 0.2 Percentage of selected features

MIFS MCLS CSFS SFUS CMFS Baseline MB-MCF

0.05 0.1 0.15 0.2 Percentage of selected features

MIFS MCLS CSFS SFUS CMFS Baseline MB-MCF

0.05 0.1 0.15 0.2 Percentage of selected features

MIFS MCLS CSFS CMFS Baseline MB-MCF

0.05 0.1 0.15 0.2 Percentage of selected features

MIFS MCLS CSFS SFUS CMFS Baseline MB-MCF

0.05 0.1 0.15 0.2 Percentage of selected features

MIFS MCLS CSFS SFUS CMFS Baseline MB-MCF

0.05 0.1 0.15 0.2 Percentage of selected features

MIFS MCLS CSFS SFUS CMFS Baseline MB-MCF

0.05 0.1 0.15 0.2 Percentage of selected features

MIFS MCLS CSFS SFUS CMFS Baseline MB-MCF

0.05 0.1 0.15 0.2 Percentage of selected features

MIFS MCLS CSFS CMFS Baseline MB-MCF

0.05 0.1 0.15 0.2 Percentage of selected features

MIFS MCLS CSFS SFUS CMFS Baseline MB-MCF

0.05 0.1 0.15 0.2 Percentage of selected features

MIFS MCLS CSFS SFUS CMFS Baseline MB-MCF

Figure 3: The Hamming Loss, FMacro and FMicro of MB-MCF and other state-of-the-art algorithms. The result of MB-MCF is a red dot instead of a curve since MB-MCF could automatically determine the number of selected features.

tween each selected feature and each label as shown in Figure 4. The selected features are listed at the top of the ﬁgure in the order of the serial numbers in Emotions data set, and each feature corresponds to a column of the grid, each label corresponds to a row. If a feature has an effect on a label, then the corresponding cell is dyed. From Figure 4, we observe that features F21 and F40 carry the information about all labels, and F2 is a label-speciﬁc feature of label L2. From the distribution of shaded cells, we can conclude that the labels in label pairs p L1, L4q, p L2, L5q and p L3, L6q have similar common features, and thereby, they have stronger correlation with each other than other label pairs, which is consistent with the Tellegen-Watson Clark model 1 of mood in previous study (Tellegen, Watson, and Clark 1999). On Emotions, MB-MCF achieves similar performance with existing methods (the details are not provided due to space limitation). However, MB-MCF can not only effectively select the relevant features containing discriminative information, but also simultaneously explain the relationships between variables (including both labels and features).

This paper investigates multi-label feature selection problem in the causal perspective. We study the causal structure of multi-label data, and discover that strong label rel-

1The model was employed for labeling the data in the Emotions.

ܮଵ ܮଶ ܮଷ ܮସ ܮହ ܮ

ܨଶܨଵ ܨଶଵܨଶଷܨଶସܨଶହܨଶଽܨଷଵܨଷଽܨସ ܨସ ܨସ ܨହଷܨହସܨ ହܨ

Figure 4: The relation between each selected feature (corresponding to a column) and each label (corresponding to a row) in Emotions. The shaded cell indicates that the corresponding feature has an effect on the corresponding label.

evance might block the dependencies between label and its relevant features. Furthermore, we present the property of common features shared by multiple labels. Based on the above theoretical contributions, we propose a novel algorithm, MB-MCF, which mines the causal mechanism ﬁrst and then ﬁnds common features and label-speciﬁc features. Compared with traditional multi-label feature selection algorithms, MB-MCF possesses interpretability, and selects features not only predictive but also causally informative, while it does not require the number of selected features to be predetermined. Finally, we conduct extensive experiments to validate the effectiveness and superiority of proposed MB-MCF. Future work could strengthen the experimental evaluations and develop a joint algorithm (Jiang et al. 2019) to simultaneously select features and learn a clas-

siﬁer with the selected causal features.

Acknowledgements This research is supported in part by the National Key Research and Development Program of China under Grant No. 2016YFB1000905, the National Natural Science Foundation of China under Grant No. 61876206, 91846111, and 91746209.

References Aliferis, C. F.; Tsamardinos, I.; and Statnikov, A. 2003. HITON: a novel Markov blanket algorithm for optimal variable selection. In Proceedings of the American Medical Informatics Association Annual Symposium, 21 25. Braytee, A.; Liu, W.; Catchpoole, D. R.; and Kennedy, P. J. 2017. Multi-label feature selection using correlation information. In Proceedings of the ACM Conference on Information and Knowledge Management, 1649 1656. Chang, X.; Nie, F.; Yang, Y.; and Huang, H. 2014. A convex formulation for semi-supervised multi-label feature selection. In Proceedings of 28th AAAI Conference on Artiﬁcial Intelligence, 1171 1177. Cover, T. M., and Thomas, J. A. 2012. Elements of information theory. John Wiley & Sons. Fodeh, S. J., and Tiwari, A. 2018. Exploiting MEDLINE for gene molecular function prediction via NMF based multi-label classiﬁcation. Journal of Biomedical Informatics 86(10):160 166. Godbole, S., and Sarawagi, S. 2004. Discriminative methods for multi-labeled classiﬁcation. In Proceedings of Paciﬁc Asia conference on knowledge discovery and data mining, 22 30. Huang, R.; Jiang, W.; and Sun, G. 2018. Manifold-based constraint laplacian score for multi-label feature selection. Pattern Recognition Letters 112(9):346 352. Jian, L.; Li, J.; Shu, K.; and Liu, H. 2016. Multi-label informed feature selection. In Proceedings of 26th International Joint Conference on Artiﬁcial Intelligence, 1627 1633. Jiang, B.; Wu, X.; Yu, K.; and Chen, H. 2019. Joint semi-supervised feature selection and classiﬁcation through Bayesian approach. In Proceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence. Lin, Y.; Hu, Q.; Liu, J.; and Duan, J. 2015. Multilabel feature selection based on max-dependency and minredundancy. Neurocomputing 168(11):92 103. Liu, X.-Q., and Liu, X.-S. 2018. Markov blanket and Markov boundary of multiple variables. Journal of Machine Learning Research 19(1):1658 1707. Liu, H., and Motoda, H. 2007. Computational methods of feature selection. CRC Press. Liu, J.; Chang, W.-C.; Wu, Y.; and Yang, Y. 2017. Deep learning for extreme multi-label text classiﬁcation. In Proceedings of the 40th International ACM Conference on Research and Development in Information Retrieval, 115 124.

Liu, Y.; Wen, K.; Gao, Q.; Gao, X.; and Nie, F. 2018. SVM based multi-label learning with missing labels for image annotation. Pattern Recognition 78(6):307 317. Ma, Z.; Nie, F.; Yang, Y.; Uijlings, J. R.; and Sebe, N. 2012. Web image annotation via subspace-sparsity collaborated feature selection. IEEE Transactions on Multimedia 14(4):1021 1030. Pearl, J. 1998. Probabilistic reasoning in intelligent systems : networks of plausible inference. Morgan Kaufmann. Pellet, J.-P., and Elisseeff, A. 2008. Using Markov blankets for causal structure learning. Journal of Machine Learning Research 9(7):1295 1342. Pereira, R. B.; Plastino, A.; Zadrozny, B.; and Merschmann, L. H. 2018. Categorizing feature selection methods for multi-label classiﬁcation. Artiﬁcial Intelligence Review 49(1):57 78. Read, J. 2008. A pruned problem transformation method for multi-label classiﬁcation. In Proceedings of New Zealand Computer Science Research Student Conference, 143 150. Spirtes, P.; Glymour, C. N.; Scheines, R.; Heckerman, D.; Meek, C.; Cooper, G.; and Richardson, T. 2000. Causation, prediction, and search. MIT press. Statnikov, A.; Lytkin, N. I.; Lemeire, J.; and Aliferis, C. F. 2013. Algorithms for discovery of multiple Markov boundaries. Journal of Machine Learning Research 14(2):499 566. Tellegen, A.; Watson, D.; and Clark, L. A. 1999. On the dimensional and hierarchical structure of affect. Psychological Science 10(4):297 303. Tsamardinos, I., and Aliferis, C. F. 2003. Towards principled feature selection: relevancy, ﬁlters and wrappers. In Proceedings of the 9th International Workshop on Artiﬁcial Intelligence and Statistics. Wu, X.; Jiang, B.; Yu, K.; Miao, C.; and Chen, H. 2019. Accurate markov boundary discovery for causal feature selection. IEEE Transactions on Cybernetics, to be published, doi:10.1109/TCYB.2019.2940509. Yu, K.; Li, J.; Ding, W.; and Le, T. D. 2019. Multisource causal feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence, to be published, doi:10.1109/TPAMI.2019.2908373. Yu, K.; Liu, L.; and Li, J. 2018. A uniﬁed view of causal and non-causal feature selection. [Online] Available: https://arxiv.org/abs/1802.05844. Zhang, M.-L., and Wu, L. 2015. Lift: multi-label learning with label-speciﬁc features. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(1):107 120. Zhang, M.-L., and Zhou, Z.-H. 2006. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering 18(10):1338 1351. Zhang, M.-L., and Zhou, Z.-H. 2007. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition 40(7):2038 2048.