# instancewise_feature_grouping__bc6b0b51.pdf Instance-wise Feature Grouping Aria Masoomi1, Chieh Wu1, Tingting Zhao1, Zifeng Wang1, Peter Castaldi2, and Jennifer Dy1 masoomi.a@northeastern.edu, wu.chie@northeastern.edu, t.zhao@northeastern.edu zifengwang@ece.neu.edu, repjc@channing.harvard.edu, jdy@ece.neu.edu 1Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, US 2Channing Division of Network Medicine, Brigham and Women s Hospital, Boston, MA, US In many learning problems, the domain scientist is often interested in discovering the groups of features that are redundant and are important for classification. Moreover, the features that belong to each group, and the important feature groups may vary per sample. But what do we mean by feature redundancy? In this paper, we formally define two types of redundancies using information theory: Representation and Relevant redundancies. We leverage these redundancies to design a formulation for instance-wise feature group discovery and reveal a theoretical guideline to help discover the appropriate number of groups. We approximate mutual information via a variational lower bound and learn the feature group and selector indicators with Gumbel-Softmax in optimizing our formulation. Experiments on synthetic data validate our theoretical claims. Experiments on MNIST, Fashion MNIST, and gene expression datasets show that our method discovers feature groups with high classification accuracies. 1 Introduction Data samples are typically represented by features that domain experts assume to be important for a learning problem; however, not all features are important. The goal of feature selection is to select which features are needed to improve learning performance. Moreover, knowing which features are important helps in understanding learning algorithms. Traditionally, Feature Selection algorithms find a global set of features for the entire data [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. While knowing the most important global features are useful, feature importance may vary across the entire population. For example in images, while one set of pixels may help us identify a shoe, a vastly different set of pixels would be required to identify a shirt. From this observation, there is an additional need for Feature Selection to be on a case-by-case basis, an approach also known as Instance-wise Feature Selection. A novel concept that has only been recently investigated in the context of explaining black-box models [11, 12, 13, 14]. Learning saliency maps [15] in some ways also provide some form of instance-wise feature importance by highlighting (weighting) important pixels in an image. While Instance-wise Feature Selection focuses on each feature s relationship to its labels, it ignores the interaction among features. Multiple features may be equally important and yet redundant in relation to each other. Traditional feature selection algorithms (such as LASSO [16]) tend to select just one of these redundant features. However, in some domains such as gene expression applications, we are interested not only in which genes (features) are important but also in which genes interact together for disease prediction. Therefore, in addition to Instance-wise Feature Selection, we wish to also group the features based on their relationship with each other and to the label. There exist Signifies equal contribution. 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. methods like group Lasso (GLasso) [17] that selects which feature groups are important given a predefined grouping. Yet, in many applications the feature groups are unknown. Thus, methods that learn feature groups have been proposed [18, 19, 20, 21]. While these methods perform group feature selection, the groups are global and not instance-wise. In contrast to these approaches, this paper introduces instance-wise methods that can learn the feature group structure and identify its importance for prediction from an information theory perspective. We refer to this approach as Instance-wise Feature Grouping. Our Contribution. We introduce a novel method for learning instance-wise feature grouping, the group Interpreter (g I). Our formulation is made possible by our theoretical contribution of defining the concept of redundancy in this setting. Leveraging mutual information s ability to measure dependency, we formally define two types: Representation Redundancy captures the dependency between features while Relevant Redundancy captures the dependency between features and its corresponding labels. We prove how these redundancies can be captured and describe the mechanisms by which information is preserved. Our analysis leads to a lower bound to identify the number of groups for each sample. Moreover, we provide a practical algorithm that approximates mutual information (MI) through a variational lower bound. The algorithm also learns a mapping function that identifies the most important feature groups on a sample by sample basis. Finally, our theories are experimentally verified on both synthetic and real data from ongoing research. Indeed, our method reveals the difference in gene expression based on smoking status. We make the source code publicly available at https://github.com/ariahimself/Instance-wise-Feature-Grouping. Related Work. Many traditional global feature selection utilizes MI as criterion for selection (as it is a natural criterion for measuring dependency among random variables). However, in global feature selection, the goal is to find the minimal subset of features relevant for prediction [4, 5, 6, 22, 1, 23]. A way to achieve finding this minimal subset is to maximize feature relevance while minimizing feature redundancy [24, 25]. Note that they wish to remove redundancy. In contrast, our goal is to learn which features group (i.e., cluster) together, where we define similarity of features based on redundancy. For example, m RMR [25] maximizes feature relevance while minimizing feature redundancy. If features F1 and F2 are highly dependent and relevant to prediction, only one will be chosen. In contrast, g I would select both as a group, highlighting to domain scientists that these two features are both relevant and redundant to each other. Unlike traditional feature clustering, our goal is to learn feature similarity not just based on their redundancy with each other (representation redundancy) but also on their redundancy based on their prediction ability (relevant redundacy). We formally define these concepts in this paper. Among other global feature group learning methods, Chormunge and Jena [19] learn feature groups based on k-means clustering then apply g Lasso; Bilevel Learning [20] learns the feature groups through a multi-task learning setting using bilevel optimization; OSCAR [18] automatically learns the feature groups by encouraging equality in the magnitude of each pair of variables. All these group feature selection methods are global; whereas, our proposed method g I learns feature groups instance-wise. 2 Ingredients of Feature Group Learning Overall Framework. We summarize the overall network framework of our method in Figure 1, followed by a description of each component in this section. Given data set X Rn d with n samples and d features, and let Y Rn be its corresponding labels; the ith sample input and its label are denoted as xi Rd and yi R. Our goal is to separate the features into k non-overlapping groups and select the m most important groups for each sample. We learn for each sample a matrix G to indicate each feature s group membership. The G matrix is specifically constrained such that G {0, 1}k d where Gi,j = 1 if the jth feature of a sample belongs to the ith group. After compressing the features into k groups, we also learn a vector s {0, 1}k where sµ = 1 if the µ group is among the m most important groups. The Group Membership Matrix G. We denote G as a random variable over a set of all possible G matrices where G {G {0, 1}k d| Pk i=1 Gij = 1}. This allows us to generate instance-wised G matrices for each sample by learning P(G|X), and use its most likely outcome as G where G = arg max G P(G|X). To learn G, we propose to train a non-traditional autoencoder ψθG that Figure 1: Flowchart of our instance-wise feature grouping framework. maps the data X into a low dimensional embedding Z Rn k with ˆX Rn d as its decoded output where ψθG(X) = ˆX. The encoder and decoder functions are denoted as TG : Rd Rk and T + G : Rk Rd where for a given sample i, zi = TG(xi) = Gxi, ˆxi = T + G (zi) = GT zi, and ˆxi = ψθG(xi) = T + G TG(xi) = GT Gxi. Note that each feature of zi is a summation of only features of the same group, therefore, each feature encapsulates the characteristics of its corresponding group; accordingly, we refer to them as characteristic features. The Group Selector s. Each G matrix is coupled with its own m-hot vector s that indicates the m most important groups. By defining the random variable S {s {0, 1}k|| s| = m}, we also learn s indirectly by learning the distribution P(S|Z), where s = arg max S P(S|Z). This is accomplished given a 2nd autoencoder φθS,θG(xi) = GT (Gxi s) which selects m characteristic features that corresponds to the m most important groups, where is an element wise product or Hadamard product. Hence, given X we have φθS,θG(X) = X where the ith row is xi. Defining Feature Redundancy. Intuitively, features can be redundant if it is highly dependent on another set of features, we call this Representation Redundancy. Simultaneously, features can also be redundant if their inclusion does not improve the data/label dependency, i.e., given the occurrence of a feature, additional features may not provide any extra label-predicting information; we call this Relevant Redundancy. Formally, let Xj be a random variable representing the jth feature, and let X = {X1, . . . , Xd} be a set of all features with the cardinality of |X|. By leveraging mutual information (MI, I), we define the two redundancies below. Definition 1. Feature Xj is Representation Redundant with respect to a set of random variables Z iff I(Xj; X) = 0 and I(Xj; X|Z) = 0. (1) Definition 2. Feature Xj is Relevant Redundant with respect to a set of random variables Z iff I(Xj; Y) = 0 and I(Xj; Y|Z) = 0. (2) Note that in Def. (1), while condition I(Xj; X) = 0 is always true since Xj X, it is nevertheless included to preserve the symmetry with Def. (2). Following these definitions, we present our method, the Group Interpreter (g I), which implicitly learns G and s by maximizing max θG,θS I(ˆX; X) + λI( X; Y), s.t: ˆX = ψθG(X), X = φθS,θG(X). (3) This objective is theoretically motivated by Defs. 1 and 2. Indeed, the ˆX that maximizes I(ˆX; X) captures Representation Redundancy while I( X; Y) identifies the optimal X to capture Relevant Redundancy. The control parameter λ then balances the two criteria. We formally prove these claims in the following two theorems with their proof included in App. C. Theorem 1. The maximum mutual information I(ˆX; X) is achieved if and only if its characteristic features Z induced by the model makes X representative redundant based on Def. (1), i.e. max G I(ˆX; X) = I(X; X) min G I(X; X|Z) = 0, s.t. G {0, 1}k d, i=1 Gij = 1, Z = TG(X), ˆX = ψθG(X). (4) Theorem 2. The maximum mutual information I( X; X) is achieved if and only if its m-selected characteristic features Z s induced by the model makes X relevant redundant based on Def. (2), i.e. max G I( X; Y) = I(X; Y) min G I(X; Y|Z s) = 0, s.t. G {0, 1}k d, i=1 Gij = 1, Z = TG(X), X = φθS,θG(X), s {0, 1}k, |s| = m. (5) Approximating Mutual Information. Since the various distributions required to compute MI are difficult to obtain, we instead maximize MI s variational lower bound [11] as a surrogate. We provide here a summary of the key formulations while leaving the detail derivations to App. G. First, we solve Eq. (3) by first simplifying it into expectations max θG,θS EX, ˆ X[log(P(X|ˆX)] + λEY, X[log(P(Y| X)] s.t: ˆX = ψθG(X), X = φθS,θG(X) (6) This objective can be approximated by computing its empirical estimate using samples from P(X|ˆX) and P(Y| X). We generate ˆX, X samples via ancestral sampling [26] from P(ˆX|Z, G)P(Z|G, X) P(G|X) P(X), (7) P( X|Z s, G)P(Z s|S, Z)P(S|Z)P(Z|G, X)P(G|X)P(X). (8) However, since both P(X|ˆX) and P(Y| X) are unknown, we further use their variational lower bound to approximate their distributions via two additional networks. Specifically, we use QθR(X|ψθG(X)) to approximate P(X|ˆX), and QθP (Y|φθS,θG(X)) for P(Y| X). This affords us the advantage of combining the four networks (ψθG, φθS,θG, QθR, and QθP ) into a large single network and jointly optimize them via Stochastic Gradient Descent (SGD). The resulting formulation becomes min θG,θS,θP ,θR i=1 ||xi QθR(ψθG(xi))||2 λ i=1 p(yi)log(QθP (yi|(φθS,θG(xi))). (9) Solving Eq. (9) relies on drawing samples from P(G|X) and P(S|Z). However, since G and s are constrained to be indicators, how do we enforce the categorical constraint on the output of ψθG and φθS,θG? We clarify how adding a Gumbel-softmax layer [27] achieves this in the next section. Gumbel-Softmax. Standard networks cannot perform backpropagation through samples. Gumbelsoftmax overcome this obstacle by generating differentiable samples from a categorical distribution. Leveraging this technique, we sample a k-dimensional vector ϵ from a Gumbel distribution where its ith element is sampled via ϵi = log( log ui), ui Uniform(0, 1). This enables us to apply the reparameterization trick [28], which consequently samples from a concrete distribution, C Concrete(log p1, ..., log pk), where the ith element is computed with Ci = exp(log pi + ϵi)/τ Pk j=1 exp(log pj + ϵj)/τ s.t. lim τ 0 P(Ci = 1) = pi Pk j=1 pj . (10) The sharpness of the concrete distribution is controlled by τ; where as τ 0, the concrete random variable approaches to the categorical distribution as defined in Eq. (10). Therefore, θG in Fig. 1 represents the combination of a network QθG : Rd 7 Rk d with a Gumbel-softmax layer. Since QθG outputs a dimension of k d, the output can be reorganized into d columns of size k vectors, where the ith column represents the group membership probability [p1, ..., pk]T for the ith feature. By passing each column into the Gumbel-softmax layer, it consequently generates a one-hot vector for each column of the G matrix, representing samples from P(G|X). Similarly, θS consists of QθS : Rk 7 Rk with a Gumbel-softmax layer. However, an m-hot vector is generated by repeating Gumbel-softmax m times. Specifically, let each trial be Ct, then s is generated by Ct Concrete(QθS), for t = 1, . . . m, s = [s1, . . . , sk]T , sj = max t Ct j. (11) Discovering the Number of Groups. Instead of randomly guessing the number of groups, k, is there a theoretical guideline? We tackle this question from an information-theoretic perspective, by asking if there exists a minimum k such that all relevant information is preserved. To state the question precisely, what is the minimum m and k such that I(X; Y) = I(φθS,θG(X); Y)? Since we compress the original features into characteristics features and then remove the least important groups, how can information retention be possible? By studying the simpler case where all groups are kept, we identified a set of conditions which this becomes possible and discovered a lower bound for k. Specifically, we simplify the problem by letting m = k such that ψθG = φθS,θG, then we study if T + G and TG individually preserves information. Conceptually, since ψθG = T + G TG, information is preserved if T + G and TG both preserve information. This intuition is supported by Kraskov et al. [29]: they show that MI is invariant under diffeomorphism mappings. Therefore, we investigate if TG and T + G are diffeomorphisms and formalize these findings in the following two lemmas with their proof in App. B. Lemma 1. The decoder T + G : Z Im(T + G ) is a Diffeomorphism map. Lemma 2. If k < d then the mapping TG : Rd Rk is not injective, thus not a diffeomorphism. Since k is always less than d when features are grouped together, our analysis proves that ψθG cannot be a diffeomorphism; a disappointing result. Yet, we note that while having diffeomorphism guarantees information preservation, nothing is stated about non-diffeomorphism mappings. Indeed, by digging deeper, we found that information preservation is still possible under certain non-diffeomorphism conditions. Specifically, we prove that relevant information can still be preserved if k is sufficiently large, i.e., larger than the number as defined by Eq. (63). In fact, in these cases, we proved the existence of a matrix G such that I(ˆX; Y) = I(X; Y). We formally state this finding in Theorem 3; the proof can be found in App. D. Theorem 3. Let X = {X1, . . . , Xd} be a random variable that consists of all features , let relevant features U be U = {Xj| I(Xj; Y) = 0 A X I(Xj; Y|A) = 0}, (12) and let irrelevant features be Uc, then, G G such that I(TG(X); Y) = I(X; Y) if k |U| + 1(|U| = d) (13) where 1(|U| = d) is an indicator function equal to one when |U| = d and zero when |U| = d. While Theorem 3 provides a theoretical bound for k, in practice, the computation of U assumes prior access to complex posterior distributions. Since this assumption is rarely true in practice, we provide an alternative bound that only requires the correlation coefficient between the features and the labels. We formally state this theorem and its corollaries below with their proofs in App. E. Theorem 4. Given ρ as the correlation measure and C = {Xj|ρ(Xj; Y) = 0 A ρ(Xj; Y|A) = 0} then |C| |U|. Corollary 4.1. Theorems 3 and 4 yields a lower bound for k where |C| + 1(|C| = d) k. Corollary 4.2. For Gaussian distributions the inequality turns into equality where |C| = |U|. By leveraging Corollary 4.1, a more tractable set C can be obtained in place of U to bound k. Computational and Memory Complexities. Since our algorithm can be solved via SGD, g I has efficient memory and computational complexities of O(kd2) and O(nkd2) respectively. For a detailed derivation of these complexities, refer to App. H. Feature Selection vs Explaining Black-Box Models. Due to the common confusion between feature selection, and black-box explanatory models (BEM), we emphasize that our focus is feature selection. Our method, g I, learns a classifier Qθp(y|φθS,θG(x)) via feature grouping that approximates the true underlying posterior P(Y|X). Note that one can easily extend g I to explain black-box models by changing P(Y|X) to a complex black-box learned classifier PM(Y|X) (e.g., neural networks [30], random forest [31]) similar to Chen et al. [11]; where, Qθp(y|φθS,θG(x)) now approximates PM(Y|X) by learning from training data with Y generated from the output of PM(Y|X) for each xi. Although g I can be easily extended to BEM, we leave this extension for future research. 3 Experiments Datasets. We validate the theoretical claims with nine synthetic datasets constructed from a combination of three Representation (D1, D2, D3) and three Relevance (R1, R2, R3) redundancy patterns as shown in Table 1. Recall that Xj indicates the jth feature. For Representation Redundancy (D patterns), the features within the same parentheses are correlated with each other. For Relevance Redundancy (R patterns) the P(Y = 1|X) is directly proportional to a function of the features indicated. We generate 100000 training, 1000 validation, and 1000 test samples for each combination. For each combination, we evaluate g I s ability to correctly identify the number of groups (k), the redundancy patterns, and classification results. We also evaluate our method on a real-world gene expression data as quantified by RNA sequencing from the COPDGene Study, an observational study to identify genomic markers associated with chronic obstructive pulmonary disease (COPD) [32]. The dataset is divided into a training and test set of 1500 and 407 patients along with the expression of 439 most relevant genes based on Gene Ontology categories [33]. We additionally test on benchmark image datasets from MNIST, and Fashion MNIST (F-MNIST) [34, 35] to evaluate our method s ability to generate visual results. D1 (X1, X2), (X3, X4) D2 (X1, X3), (X2, X4) D3 (X1, X3, X4), (X2) R1 P (Y = 1|X) e X1 X3 R2 P (Y = 1|X) e P4 i=1 X2 i 4 R3 P (Y = 1|X) e sin(2X1)+2|X2|+X3+exp( X4 2.4) Table 1: Synthetic data generation patterns Model MNIST-2 MNIST-10 F-MNIST g I 96.7 0.2 91.6 0.85 94.6 0.6 L2X 97.1 0.5 80.5 2.5 96.0 0.6 shap 99.24 0.46 90.8 1.9 94.45 2.62 INV 91.23 3.48 77.94 2.35 89.63 3.45 Lasso 96.01 0.2 86.03 0.02 96.7 0.0 Group Cluster 94.36 0.05 85.0 0.09 92.04 0.06 OSCAR 95.56 0.31 90.94 0.27 95.0 0.3 OWL 95.8 0.25 90.92 0.31 94.90 0.16 LPA 95.63 0.56 87.72 1.23 94.83 0.97 Table 2: g I m = 1, k = 2, image Classification accuracy comparison. Experimental Settings. All experimental accuracies are reported via the mean and standard deviation of 10 runs. The experiments are implemented with Python, Numpy, Sklearn, and Tensor Flow [36, 37, 38, 39] on a single NVIDIA GTX 1060Ti GPU. We use a neural network of width 100 and depth 2 to generate the probability inputs for the Gumbel-Softmax to obtain G and S; the Gumbel temperature was set to 0.1. Re LU was used as the activation function with softmax at the final layer for prediction. Adam optimizer with a learning rate of 0.001 and hyperparameters β1 = 0.9, β2 = 0.999 was used without further tuning. All datasets are centered to 0 and normalized to have a standard deviation of 1. For all data, we used two fully connected layers of width 32 and 16. All λs are identified by maximizing the objective given a validation set. Competing Methods. We compare g I against nine related feature selection and explainable methods. For all methods, we learn from samples of the true underlying posterior P(Y|X) (i.e., ground-truth training data) to fairly compare them. Global feature selection: Lasso (Least Absolute Shrinkage and Selection Operator) [23] is a regression method that utilizes l1 regularization to induce sparsity and effectively perform feature selection. GLasso (sparse Group Lasso) [40, 41] is a Lasso version that assumes a feature grouping structure, enforces l1 sparsity and performs group selection with an l1,2 regularizer. Deep instance-wise feature selection: SHAP (SHapley Additive ex Planations) [12] provides a unified framework for explaining models by identifying a class of additive feature importance measures for prediction. SHAP learns feature importance (Shapley values) based on a game theoretic approach. L2X [11] performs instance-wise feature selection for explaining black-box models by maximizing the mutual information between the selected features and the response variable. In addition, L2X uses Gumbel softmax to learn a continuous relaxation of the feature selector. INV (INVASE) [14] is an extension over L2X without the need to specify the number of selected features in advance and is capable of discovering subsets of features with a different size per instance. LPA (Learn to Pay Attention) [15] is A visual-attention based deep learning model for learning saliency maps from the original input images. We adapted the original model to a fully-connected version based on the architecture used by all methods in this paper for fair comparison. Note that these models cannot learn and do not use the feature grouping structure. CAE [42]: An end-to-end unsupervised global feature selection to reconstruct the input data, with a Gumbel softmax layer as the encoder and a standard neural network as the decoder. As an unsupervised method, we only apply CAE to the visual MNIST and F-MNIST experiments. Global feature selection with group learning: OSCAR (octagonal shrinkage and clustering algorithm for regression) [18] learns feature groups in regression by regularizing the weights with l1 and pairwise l norm to encourage correlated predictors that have a similar effect on the response to form clusters represented by the same coefficient. OWL-Lasso [43] performs linear regression and group feature selection by utilizing a weighted l1 regularization. Group Cluster groups the features based on hierarchical correlation clustering [44] followed by GLasso. Results on Synthetic Data. We use synthetic datasets to answer the following questions: Can our model correctly identify the features that are highly dependent on each other? Can our model correctly identify the most relevant features in predicting Y? Is k based on Theorems 3 and 4 a tight lower bound? How does the accuracy of our method compare to existing interpretable methods? Given all 9 redundancy combinations of (Di, Rj) plus six additional Gaussian noise features, Table 3 indicates that both g I(m = k) and g I(m < k) are capable of achieving high class accuracy while learning the latent group structure (high representation (rep) and relevant (rel) accuracies), thereby confirming Thms. 1 and 2. Moreover, since the recommended k value by Thm. 4 is a lower bound, we investigated the bound by plotting the classification accuracy at each increment of k in Fig. 2 and circle the lower bound predicted by Thm. 4. As predicted by our theorem, after the number of groups passes the lower bound calculated by Thm. 4 the preservation of the mutual information between X and Y is possible and indeed after the number of groups passes the lower bound there is no decline in the classification accuracy. Class Acc (g I(m = k)) k (g I(m = k)) Group Rep Acc (g I(m = k)) Class Acc (g I) m (g I) Group Rel Acc (g I) D1 D2 D3 D1 D2 D3 D1 D2 D3 D1 D2 D3 D1 D2 D3 D1 D2 D3 R1 96.9 1.5 100 0 99.5 1.5 3 2 2 99.8 0.3 100 0 100 0 91.0 3 99.6 1 98.5 2 1 1 1 100 0 100 0 100 0 R2 100 0 100 0 99.6 0.4 3 3 3 100 0 100 0 99 1.8 92 3 95.4 2 94 1 2 2 2 100 0 100 0 100 0 R3 98.9 0.8 97.6 0.6 99.2 0.8 3 3 3 100 0 100 0 99.5 0.9 94.4 3 94.7 2 90.9 5 2 2 2 100 0 100 0 100 0 Table 3: Measuring g I s ability to identify the most relevant groups using nine redundancy patterns. Note that g I is capable of identifying the relevant groups while achieving a high classification accuracy. Figure 2: Accuracy versus number of groups used: We circle the number of groups predicted by Thm. 4. Data D1 D2 D3 D1 + D2 R1 98.4 1 99.7 0.46 95 3.8 98.7 1.3 R2 100 0 99.5 0.5 100 0 100 0 R3 98.8 0.9 99.4 0.6 99.4 0.6 99.2 0.4 R1 85 3.5 100 0.0 85.7 6 88 4 R2 95 2 95 1.4 95 2.1 99.7 0.5 R3 94 2.2 95 1.1 87.7 1 93 1.3 R1 70.25, 0.83 100 0.0 100 0.0 89.2 0.97 R2 88.0 0.0 94 0.0 82 0.0 94.4 0.48 R3 94.6 0.48 95 0.0 95 0.0 95.4 0.48 R1 87.2 3 88.5 3 87.2 3 86 2 R2 73 3 80.9 4 68 3.5 75 4 R3 74 4 79 2 73 4 74 4 R1 49 3 100 0.0 100 0.0 74 1 R2 66 1 61 1 67 2 58 2 R3 75 2 84 2 59 3 81 8 R1 49 1 100 0.0 100 0.0 76.0 0.4 R2 64 0.08 62 0.4 49 0.4 56 0.4 R3 74 1 83 0.5 68 1.3 79 2 R1 49.0 0.31 100 0.0 100 0.0 50.0 0.0 R2 50.0 0.3 50.3 0.11 50.0 0.3 50.1 0.05 R3 74.7 0.2 84.03 .15 66 0.14 79.2 0.13 R1 49.0 0.31 100 0.0 100 0.0 50.0 0.0 R2 50.0 0.3 50.3 0.11 50.0 0.3 50.1 0.05 R3 74.7 0.2 84.03 .15 66 0.14 79.2 0.13 Table 4: The classification prediction accuracy on synthetic datasets. In Table 4, we compare g I against competing methods. In addition to mixing D and R redundancies together, we increase the data complexity by combining D1 relationships with D2 as D1 +D2, where half of the samples generated are randomly chosen to have D1 redundancies while the other half is set to have D2 redundancies. Since only our model performs classification based on instance-wise grouping of features, the D1 +D2 pattern is of particular interest to validate our advantage, i.e., given its correct assumption of the data, our model is expected to outperform all alternative methods. By marking the best results as bold in Table 4, we can see that g I is almost always the best performing classifier. As expected, the accuracy difference is particularly prominent with D1 + D2. COPDGene Dataset. Given the effect of smoking on health, there is significant interest in its impact on gene expression. Specifically, how does exposure alter gene expression, and how do groups of genes exhibit coordinated changes given exposure? This data highlights the insufficiency of learning a single global group structure because smokers and non-smokers may be characterized by completely different gene groups. We emphasize the importance of identifying this variability by applying g I to the COPDGene dataset to learn the most predictive group of genes on smoking status. Instead of trying to pinpoint a single group of the most important genes, g I s instance-wise capability is designed to automatically identify multiple groups. Figure 3: Gene expression (input features) of patients XT . No pattern is visually noticeable. Figure 4: Prediction accuracy vs. number of features selected. g I consistently outperforms other methods. Figure 5: The important genes selected by G and s. The selected genes (rows) are indicated by white pixels for each patient (column) and black when not selected. Figure 6: Each column represents the characteristic features of each patient, i.e., Zs. Note that visually, smokers and non-smokers are clustered appropriately. Fig. 4 compares the test accuracy between several competing methods given increasing number of features; g I consistently achieves the highest accuracy. Moreover, note that Group Lasso represents the traditional method of applying biologically predefined groups. Yet, even when domain knowledge is incorporated within Group Lasso, the instance-wise capability of g I identifies the gene groups that achieves much higher predictive accuracy. We next studied the group structure produced by g I. First, notice that the original gene expression matrix in Fig. 3 lacked any visually noticeable patterns. We then plot the most relevant group of genes selected by G and s in Fig. 5, where the selected genes (rows) are indicated by the white pixels for each patient (column). Even with the high variance between patients, a pattern emerges; g I has identified the group of genes that are common across smokers and non-smokers respectively. As suggested by our results, there exists a visual difference in gene expression between the two groups and g I has identified the specific genes for each group. We next plot out the characteristic features formed by each G matrix. As predicted by Thm. 4, this compressed representation of the original input features retained the most relevant information despite the compression. Since different genes tended to be selected in smokers compared to nonsmokers, we performed Gene Set Enrichment Analysis (GSEA) as implemented in the Gene Pattern Cloud instance (https://cloud.genepattern.org/) using a set of curated immunologic gene signatures (the C7 set) from the Molecular Signatures Database. Immunologic signatures is well suited for analysis of blood expression data since the majority of cells present in blood are immune cells. The analysis determines whether predefined gene sets are enriched in the extremes of the ranked list of genes, where ranking is based on each gene s likelihood of being selected among each of the two cohorts. In this analysis, using a 10 percent false discovery rate, 20 significantly enriched immunologic gene sets were identified among the most frequently selected genes for smokers, whereas no similar enrichment of immunologic signatures was observed among the genes selected the most among nonsmokers. Competing Method Performance on Image Datasets. Table 2 reports the classification accuracy for all methods on MNIST-2, MNIST-10 (all 10 digits) and F-MNIST. Notice that while L2X and Shap performed slightly better on the simpler F-MNIST and MNIST-2 (3 vs. 8) datasets, g I performed better on the more complex MNIST-10 (all 10 digits) dataset. In Fig. 7, we compare the visual patterns generated by g I against several best performing deep models. For each image, each method identifies and displays the most informative pixel group in white; the top row is the original image while the results of each method are displayed below. While L2X, LPA, and Shap are all instance-wise and can achieve high predictive accuracy, it is not clear visually from their white pixels in Fig. 7 why these pixels are important. CAE outputs a discernible shape of 8, however, its features are global, resulting in the same pixel choice across all samples. In contrast, g I discovered the pixels that are equally important, resulting in a visually compelling segmentation in the shape of the classification object. Our result suggests that capturing and identifying redundancies within the data produces visually interpretable explanations, highlighting the importance of combining group structure with instance-wise flexibility. While other methods struggle to identify the different digits and clothing, g I handled the complexity independent of the number of classes. An even more challenging task is to also capture the style variation within the same class. We highlight this ability with 10 digits of diverse shapes in Fig. 7 under INSTANCE-WISE MNIST STYLE; a larger collection showcasing a variety of style variation results can be found in App. F. For these results, notice how the explanatory pixels follow closely to the style of the original image. Figure 7: Comparing the most important pixels as identified by each competing algorithm. 4 Conclusion Our theoretical contribution formally defines the concept of redundancy between features based on MI. This clarifies how features can be grouped together, and how many groups should exist while retaining the most relevant information. It further enables us to formulate an objective (g I) that captures these redundancies on an instance-wise basis. Our theories are corroborated by both synthetic and real experimental results. We have applied our instance-wise feature group discovery and selection method to lung disease gene expression data; of which we discovered gene expression patterns common to smokers and non-smokers respectively. 5 Broader Impacts In this paper, we introduce a novel algorithm for instance-wise feature group discovery and selection. The algorithm learns mapping functions that identify the appropriate group membership of each feature along with each group s importance as an instance-wise label predictor. Namely, we have focused our paper on feature selection to model the features important for capturing the information in the underlying true posterior P(Y|X). While we have focused on feature selection, there are also other strategies to define and approach interpretability [45]. Instead of estimating the posterior distribution P(Y|X), one can apply our method to capture the information for trained black-box models PM(Y|X), e.g., deep neural networks and random forests. Consequently, the algorithm can be used to perform instance-wise group feature selection on the black-box model, learning the features which a given black-box model perceives as important. In this approach to explainability, our method has the potential impact on making black-box models explainable in terms of knowing how the features were used during prediction. This gives rise to future research directions that can help data scientists check for bias, fairness, vulnerabilities of the models they use [46, 47]. Although this paper focuses on the machine learning aspect of our discovery, our work is also relevant from its consequential findings on the lung disease dataset. The feature selection results on the lung disease data allow us to discover the genes that interact together for predicting smoking and non-smoking. This can potentially impact our understanding of lung disease, in particular by identifying cooperative relationship between genes that can delineate important aspects of their biological functions. However, to make an impact to medical research would require further and careful investigation to confirm the findings with appropriate medical collaborators. As a warning to our ML and data analyst colleagues, we encourage applying ML to applications that is beneficial to society, such as health. But, to do so properly, one needs to work closely with domain expert collaborators to make nontrivial contributions to their fields of research. Beyond applications to lung disease, learning important features for prediction and the features that interact together is important in genetic understanding of other diseases [48, 49]. In general, feature selection has been impactful in a variety of domains beyond medicine for example, climate [50], law[51]. Given the potential impact it can have, including on the most pressing diseases of today, we seek to widely disseminate this research and make our source code publicly available at https: //github.com/ariahimself/Instance-wise-Feature-Grouping. Lastly, while our method is useful in identifying feature groups that interact together for prediction. We caution that this does not imply causation, and poses a potential misuse of our technique. Additionally, since our model is learned from a training set, its conclusions are limited by the quality and characteristics of what it was trained on. Therefore, inherent biases that pre-existed in the data will lead to biased feature groups and conclusions. As with any supervised machine learning algorithms, our method can be applied to a variety of applications (e.g., health, climate, image analysis) with potential impact to multiple sectors of society. Our intent is to build such models for societal good and we encourage others to as well. Acknowledgements The work described was supported in part by Award Numbers U01 HL089897, U01 HL089856, R01 HL124233, and R01 HL147326 from the National Heart, Lung, and Blood Institute, the FDA Center for Tobacco Products (CTP), and NSF IIS 1546428. The authors would also like to thank Stratis Ioannidis, Amirreza Farnoosh,Yale Chang, Davin Hill, and Zulqarnain Khan for helping to review the paper and provide fruitful discussions. Their insightful comments in conjunction with the comments of the anonymous reviewers have helped to improve this paper tremendously. [1] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157 1182, 2003. [2] Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. Feature selection via dependence maximization. Journal of Machine Learning Research, 13(May):1393 1434, 2012. [3] Jason Weston, Sayan Mukherjee, Olivier Chapelle, Massimiliano Pontil, Tomaso Poggio, and Vladimir Vapnik. Feature selection for svms. In Advances in neural information processing systems, pages 668 674, 2001. [4] Kenji Kira, Larry A Rendell, et al. The feature selection problem: Traditional methods and a new algorithm. In Aaai, volume 2, pages 129 134, 1992. [5] Manoranjan Dash and Huan Liu. Feature selection for classification. Intelligent data analysis, 1 (3):131 156, 1997. [6] Daphne Koller and Mehran Sahami. Toward optimal feature selection. Technical report, Stanford Info Lab, 1996. [7] Xiaofei He, Deng Cai, and Partha Niyogi. Laplacian score for feature selection. In Advances in neural information processing systems, pages 507 514, 2006. [8] Mahdokht Masaeli, Glenn Fung, and Jennifer G. Dy. From transformation-based dimensionality reduction to feature selection. In Proceedings of the International Conference on Machine Learning, pages 751 758, 2010. [9] Mahdokht Masaeli, Yan Yan, Glenn Fung, and Jennifer G. Dy. Convex principal feature selection. In Proceedings of the SIAM International Conference on Data Mining, pages 619 628, 2010. [10] Yale Chang, Yi Li, Adam Ding, and Jennifer Dy. A robust-equitable copula dependence measure for feature selection. In Artificial Intelligence and Statistics, pages 84 92, 2016. [11] Jianbo Chen, Le Song, Martin Wainwright, and Michael Jordan. Learning to explain: An information-theoretic perspective on model interpretation. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 883 892, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/chen18j. html. [12] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in neural information processing systems, pages 4765 4774, 2017. [13] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. ICML, abs/1704.02685, 2017. [14] Jinsung Yoon, James Jordon, and Mihaela van der Schaar. Invase: Instance-wise variable selection using neural networks. ICLR, 2018. [15] Saumya Jetley, Nicholas A Lord, Namhoon Lee, and Philip HS Torr. Learn to pay attention. In International Conference on Learning Representations, 2018. [16] Fadil Santosa and William W Symes. Linear inversion of band-limited reflection seismograms. SIAM Journal on Scientific and Statistical Computing, 7(4):1307 1330, 1986. [17] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49 67, 2006. [18] Howard Bondell and Brian Reich. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics, 64:115 123, March 2008. [19] Smita Chormunge and Sudarson Jena. Correlation based feature selection with clustering for high dimensional data. Journal of Electrical Systems and Information Technology, 5(3): 542 549, 2018. [20] Jordan Frecon, Saverio Salzo, and Massimiliano Pontil. Bilevel learning of the group lasso structure. In Advances in Neural Information Processing Systems, pages 8301 8311, 2018. [21] Xiangrong Zeng and Mário A. T. Figueiredo. Solving OSCAR regularization problems by fast approximate proximal splitting algorithms. Digital Signal Processing, 31:124 135, 2014. [22] Ron Kohavi and George H. John. Wrappers for feature subset selection. Artificial Intelligence, 97:273 324, 1997. [23] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. [24] Lei Yu and Huan Liu. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 5:1205 1224. [25] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1226 1238, 2005. [26] Christopher M Bishop. Pattern recognition and machine learning. springer, 2006. [27] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. ICLR, 2016. [28] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ICLR, 2013. [29] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Physical review E, 69(6):066138, 2004. [30] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learning representations by back-propagating errors. Cognitive modeling, 5(3):1, 1988. [31] TK Ho. Random decision forests (pdf): Proceedings of the 3rd international conference on document analysis and recognition. 1995. [32] Aabida Saferali, Jeong H Yun, Margaret M Parker, Phuwanat Sakornsakolpat, Robert P Chase, Andrew Lamb, Brian D Hobbs, Marike H Boezen, Xiangpeng Dai, Kim de Jong, et al. Analysis of genetically driven alternative splicing identifies fbxo38 as a novel copd susceptibility gene. PLo S genetics, 15(7):e1008229, 2019. [33] Adrian Alexa and Jorg Rahnenfuhrer. topgo: enrichment analysis for gene ontology. R package version, 2(0):2020, 2020. [34] Yann Le Cun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http: //yann.lecun.com/exdb/mnist/. [35] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017. [36] Guido Van Rossum and Fred L Drake. The python language reference manual. Network Theory Ltd., 2011. [37] Eric Jones, Travis Oliphant, Pearu Peterson, et al. Sci Py: Open source scientific tools for Python, 2001 . URL http://www.scipy.org/. [Online; accessed ]. [38] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake Vander Plas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pages 108 122, 2013. [39] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor Flow: Largescale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow. org/. Software available from tensorflow.org. [40] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A note on the group lasso and a sparse group lasso. ar Xiv preprint ar Xiv:1001.0736, 2010. [41] Noah Simon, Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A sparse-group lasso. Journal of computational and graphical statistics, 22(2):231 245, 2013. [42] Abubakar Abid, Muhammad Fatih Balin, and James Zou. Concrete autoencoders for differentiable feature selection and reconstruction. ICML, 2019. [43] Małgorzata Bogdan, Ewout Van Den Berg, Chiara Sabatti, Weijie Su, and Emmanuel J Candès. Slope adaptive variable selection via convex optimization. The annals of applied statistics, 9 (3):1103, 2015. [44] Richard O Duda, Peter E Hart, and David G Stork. Pattern classification. John Wiley & Sons, 2012. [45] Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning. ar Xiv:1702.08608v2, 2017. [46] Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garcia, Sergio Gil-Lopez, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco Herrera. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Information Fusion, 58: 82 115, June 2020. [47] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A survey of methods for explaining black box models. ACM Computing Surveys, 51(5):Article 93, August 2018. [48] Rebecka Jörnsten and Bin Yu. Simultaneous gene clustering and subset selection for sample classification via MDL. Bioinformatics, 19(9):1100 1109, 2003. [49] Mee Young Park, Trevor Hastie, and Robert Tibshirani. Averaged gene expressions for regression. Biostatistics, 8(2):212 227, 2007. [50] Soumyadeep Chatterjee, Karsten Steinhaeuser, Arindam Banerjee, Snigdhansu Chatterjee, and Auroop Ganguly. Sparse group lasso: Consistency and climate applications. In Proceedings of the 2012 SIAM International Conference on Data Mining, pages 47 58. SIAM, 2012. [51] Harry Surden. Machine learning and law. Wash. L. Rev., 89:87, 2014.