# learning_feature_engineering_for_classification__5177e268.pdf Learning Feature Engineering for Classification Fatemeh Nargesian1, Horst Samulowitz2, Udayan Khurana2 Elias B. Khalil3, Deepak Turaga2 1University of Toronto, 2IBM Research, 3Georgia Institute of Technology fnargesian@cs.toronto.edu, {samulowitz, ukhurana}@us.ibm.com, lyes@gatech.edu, turaga@us.ibm.com Feature engineering is the task of improving predictive modelling performance on a dataset by transforming its feature space. Existing approaches to automate this process rely on either transformed feature space exploration through evaluation-guided search, or explicit expansion of datasets with all transformed features followed by feature selection. Such approaches incur high computational costs in runtime and/or memory. We present a novel technique, called Learning Feature Engineering (LFE), for automating feature engineering in classification tasks. LFE is based on learning the effectiveness of applying a transformation (e.g., arithmetic or aggregate operators) on numerical features, from past feature engineering experiences. Given a new dataset, LFE recommends a set of useful transformations to be applied on features without relying on model evaluation or explicit feature expansion and selection. Using a collection of datasets, we train a set of neural networks, which aim at predicting the transformation that impacts classification performance positively. Our empirical results show that LFE outperforms other feature engineering approaches for an overwhelming majority (89%) of the datasets from various sources while incurring a substantially lower computational cost. 1 Introduction Feature engineering is a central task in data preparation for machine learning. It is the practice of constructing suitable features from given features that lead to improved predictive performance. Feature engineering involves the application of transformation functions such as arithmetic and aggregate operators on given features to generate new ones. Transformations help scale a feature or convert a non-linear relation between a feature and a target class into a linear relation, which is easier to learn. Feature engineering is usually conducted by a data scientist relying on her domain expertise and iterative trial and error and model evaluation. To perform automated feature engineering, some existing approaches adopt guided- search in feature space using heuristic feature quality measures (such as information gain) and other surrogate measures of performance [Markovitch and Rosenstein, 2002; Fan et al., 2010]. Others perform greedy feature construction and selection based on model evaluation [Dor and Reich, 2012; Khurana et al., 2016]. Kanter et al. proposed the Data Science Machine (DSM) which considers feature engineering problem as feature selection on the space of novel features. DSM relies on exhaustively enumerating all possible features that can be constructed from a dataset, given sequences generated from a set of transformations, then performing feature selection on the augmented dataset [Kanter and Veeramachaneni, 2015]. Evaluation-based and exhaustive feature enumeration and selection approaches result in high time and memory cost and may lead to overfitting due to brute-force generation of features. Moreover, although deep neural networks (DNN) allow for useful meta-features to be learned automatically [Bengio et al., 2013], the learned features are not always interpretable and DNNs are not effective learners in various application domains. In this paper, we propose LFE (Learning Feature Engineering), a novel meta-learning approach to automatically perform interpretable feature engineering for classification, based on learning from past feature engineering experiences. By generalizing the impact of different transformations on the performance of a large number of datasets, LFE learns useful patterns between features, transforms and target that improve learning accuracy. We show that generalizing such patterns across thousands of features from hundreds of datasets can be used to successfully predict suitable transformations for features in new datasets without actually applying the transformations, performing model building and validation tasks, that are time consuming. LFE takes as input a dataset and recommends a set of paradigms for constructing new useful features. Each paradigm consists of a transformation and an ordered list of features on which the transformation is suitable. At the core of LFE, there is a set of Multi-Layer Perceptron (MLP) classifiers, each corresponding to a transformation. Given a set of features and class labels, the classifier predicts whether the transformation can derive a more useful feature than the input features. LFE considers the notion of feature and class relevance in the context of a transformation as the measure of the usefulness of a pattern of feature value Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) and class label distributions, and transformation. Different datasets contain different feature sizes and different value ranges. One key challenge in generalizing across different datasets is to convert feature values and their class labels to a fixed size feature vector representation that can be fed into LFE classifiers. To characterize datasets, handcrafted meta-features, fixed-size stratified sampling, neural networks and hashing methods have been used for different tasks [Michie et al., 1994; Kalousis, 2002; Feurer et al., 2015; Weinberger et al., 2009]. However, these representations do not directly capture the correlation between feature values and class labels. To capture such correlations, LFE constructs a stack of fixed-size representations of feature values per target class. We use Quantile Data Sketch to represent feature values of each class. Quantile has been used as a fixed-size space representation and achieves reasonably accurate approximation to the distribution function induced by the data being sketched [Greenwald and Khanna, 2001]. LFE presents a computationally efficient and effective alternative to other automated feature engineering approaches by recommending suitable transformations for features in a dataset. To showcase the capabilities of LFE, we trained LFE on 85K features, extracted from 900 datasets, for 10 unary transformations and 122K feature pairs for 4 binary transformations, for two models: Random Forest and Logistic Regression. The transformations are listed in Table 1. We empirically compare LFE with a suite of feature engineering approaches proposed in the literature or applied in practice (such as the Data Science Machine, evaluation-based, random selection of transformations and always applying the most popular transformation in the training data) on a subset of 50 datasets from UCI repository [Lichman, 2013], Open ML [Vanschoren et al., 2014] and other sources. Our experiments show that, of the datasets that demonstrated any improvement through feature engineering, LFE was the most effective in 89% of the cases. As shown in Figure 2, similar results were observed for the LFE trained with Logistic Regression. Moreover, LFE runs in significantly lesser time compared to the other approaches. This also enables interactions with a practitioner since it recommends transformations on features in a short amount of time. 2 Related Work The FICUS algorithm [Markovitch and Rosenstein, 2002] takes as input a dataset and a set of transformations, and performs a beam search over the space of possible features. FICUS s search for better features is guided by heuristic measures based on information gain in a decision tree, and other surrogate measures of performance. The more recent FCTree [Fan et al., 2010] also uses a decision tree to partition the data using original or constructed features as splitting points. The FEADIS algorithm of [Dor and Reich, 2012] relies on a combination of random feature generation and feature selection. FEADIS adds constructed features greedily, and as such requires many expensive performance evaluations. The Deep Feature Synthesis component of Data Science Machine (DSM) [Kanter and Veeramachaneni, 2015] relies on exhaustively enumerating all possible new features, then performing feature selection over the augmented dataset. Cognito [Khurana et al., 2016] recommends a series of transformations based on a greedy, hierarchical heuristic search. Cognito and DSM focus on sequences of feature transformations, which is outside the scope of this paper. In contrast to these approaches, we do not require expensive classifier evaluations to measure the impact of a transformation. Explore Kit [Katz et al., 2016] also generates all possible candidate features, but uses a learned ranking function to sort them. Explore Kit requires generating all possible candidate features when engineering features for a new (test) dataset and as such, results reported for Explore Kit are with a time limit of three days per dataset. In contrast, LFE can generate effective features within seconds, on average. Several machine learning methods perform feature extraction or learning indirectly. While they do not explicitly work on input features and transformations, they generate new features as means to solving another problem [Storcheus et al., 2015]. Methods of that type include dimensionality reduction, kernel methods and deep learning. Kernel algorithms such as SVM [Shawe-Taylor and Cristianini, 2004] can be seen as embedded methods, where the learning and the (implicit) feature generation are performed jointly. This is in contrast to our setting, where feature engineering is a preprocessing step. Deep neural networks learn useful features automatically [Bengio et al., 2013] and have shown remarkable successes on video, image and speech data. However, in some domains feature engineering is still required. Moreover, features derived by neural networks are often not interpretable which is an important factor in certain application domains such as healthcare [Che et al., 2015]. 3 Automated Feature Engineering Problem Consider a dataset, D, with features, F = {f1, . . . , fn}, and a target class, , a set of transformations, T = {T1, . . . , Tm}, and a classification task, L. The feature engineering problem is to find q best paradigms for constructing new features such that appending the new features to D maximizes the accuracy of L. Each paradigm consists of a candidate transformation Tc 2 T of arity r, an ordered list of features [fi, . . . , fi+r 1] and a usefulness score. For a dataset with n features and with u unary transformations, O(u n) new features can be constructed. With b binary transformations, there are O(b P n 2 ) new possible features, where P n 2 is the 2-permutation of n features. Given a fixed set of transformations, the number of new features and their combinations to explore, for an exact solution, grows exponentially. Hence, we make the case that a mere enumeration and trial by model training and testing is not a computationally practical option, and a scalable solution to the problem must avoid this computational bottleneck. LFE reduces the complexity of feature space exploration by providing an efficient approach for finding a particularly good transformation for a given set of features. Therefore, given n features, for unary and binary transformations, LFE performs, respectively, O(n) and O(P n 2 ) transformation predictions. In order to assess the relative impact of adding new features Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) across different techniques, we add as many new features as those originally in the data. For unary transformations, LFE predicts the most suitable transformation for each feature. For binary and higher arity transformations, LFE considers a random sample of all combinations of features, finds the paradigm for each combination and selects top-k useful ones. In the following section, we describe how LFE learns and predicts useful transformations for features. 4 Transformation Recommendation LFE models the problem of predicting a useful r-ary transformation Tc 2 Tr, (Tr is the set of r-ary transformations in T ), for a given list of features [f1, . . . , fr] as a multi-class classification problem, where the input is a representation of features, R[f1,...,fr], and output classes are transformations in Tr. LFE takes a one-vs-rest approach [Rifkin and Klautau, 2004]. Each transformation is modelled as a Multi-Layer Perceptron (MLP) binary classifier with real-valued confidence scores as output. Recommending an r-ary transformation for r features involves applying all |Tr| MLPs on R[f1,...,fr]. If the highest confidence score obtained from classifiers is above a given threshold, LFE recommends the corresponding transformation to be applied on feature f. Let gk(R[f1,...,fr]) be the confidence score of the MLP corresponding to transformation Tk, and γ is the threshold for confidence scores which we determined empirically. LFE recommends the transformation Tc, for features [f1, . . . , fr], as follows: c = arg max k gk(R[f1,...,fr]) recommend : Tc, if gc(R[f1,...,fr]) > γ none, otherwise (1) In the following sections, we describe how numerical features are represented to be the input of transformation MLPs. Next, we explain the process of collecting past feature engineering knowledge as training samples to train MLPs. 4.1 Feature-Class Representation Transformations are used to reveal and improve significant correlation or discriminative information between features and class labels. The more pronounced this correlation, the higher the chance that a model can achieve a better predictive performance. Each LFE classifier learns the patterns of feature-class distributions for which the corresponding transformation has been effective in improving feature-class correlation. LFE represents feature f in a dataset with k classes as follows: Rf = h Q(1) f ; Q(2) f ; . . . ; Q(k) f i (2) where Q(i) f is a fixed-sized representation of values in f that are associated with class i. We call this representation Quantile Sketch Array. Next, we describe how feature values associated to class i are translated into representation Q(i) f , which is meant to capture the distribution of feature values. Neural networks have been successful in learning representations for image and speech data [Bengio et al., 2013]. Others have proposed solutions for estimating a Probability Distribution Function (PDF) in an n-dimensional space [Garrido and Juste, 1998]. However, it is not clear how existing representation learning and PDF learning approaches may be applied in the context of raw numerical data. The main challenges is the high variability in the size and the range of feature values (e.g., from 10 to millions). In our setting, features are data points represented with various numbers of dimensions (number of distinct feature values). Hence, Random Projection [Bingham and Mannila, 2001] for dimensionality reduction is not applicable. Although Recurrent Neural Networks can deal with varying input size, we aim at determining a fixed-size representation that captures the correlation between features and target classes. Previous approaches have used hand-crafted meta-features, including information-theoretic and statistical meta-features, to represent datasets [Michie et al., 1994; Kalousis, 2002; Feurer et al., 2015]. Such meta-features aim at modelling the distribution of values in datasets. Performing fixed-size sampling of feature values is another approach for representing distributions. Samples extracted from features and classes are required to reflect the distribution of values in both feature and target classes. While stratified sampling solves the issue for one feature, it becomes more complex for multiple features with high correlation [Skinner et al., 1994]. Feature hashing has been used to represent features of type string as feature vectors [Weinberger et al., 2009]. Although feature hashing can be generalized for numerical values, it is not straightforward to choose distance-preserving hash functions that map values within a small range to the same hash value. In Section 5, we empirically show how LFE transformation classifiers perform using each of representations described above. Quantile Sketch Array (QSA) uses quantile data sketch [Wang et al., 2013] to represent feature values associated with a class label. QSA is a non-parametric representation that enables characterizing the approximate Probability Distribution Function of values. QSA is closely related to the familiar concept of histogram, where data is summarized into a small number of buckets. Naumann et al. used quantile representation for numerical features to perform feature classification [Naumann et al., 2002]. We apply the exact brute-force approach to compute quantile sketch for numerical values [Wang et al., 2013], described as follows. Let Vk be the bag of values in feature f that are for the training data points with the label ck and Q(i) f is the quantile sketch of Vk. First, we scale these values to a predefined range [lb, ub]. Generating Q(i) f involves bucketing all values in Vk into a set of bins. Given a fixed number of bins, r, the range [lb, ub] is partitioned into r disjoint bins of uniform width w = ub lb r . Assume, the range [lb, ub] is partitioned into bins {b0, . . . , br 1}, where the bin bj is a range [lb + j w, lb + (j + 1) w). Function B(vl) = bj associates the value vl in Vk to the bin bj. Function P(bj) returns the number of feature values, that are bucketed in bj. Finally, I(bj) = P (bj) P 0 m