# ensemblebased_ultrahighdimensional_variable_screening__fc1056a6.pdf Ensemble-based Ultrahigh-dimensional Variable Screening Wei Tu1 , Dong Yang1 , Linglong Kong1 , Menglu Che2 , Qian Shi1 , Guodong Li3 and Guangjian Tian4 1Department of Mathematical and Statistical Sciences, University of Alberta 2Department of Statistics and Actuarial Science, University of Waterloo 3Department of Statistics and Actuarial Science, University of Hong Kong 4Huawei Noah s Ark Lab, Hong Kong, China {wei.tu, dy2, lkong, qshi}@ualberta.ca, m3che@uwaterloo.ca, gdli@hku.hk, tian.guangjian@huawei.com Since the sure independence screening (SIS) method by Fan and Lv [2008], many different variable screening methods have been proposed based on different measures under different models. However, most of these methods are designed for specific models. In practice, we often have very little information about the data generating process and different methods can result in very different sets of features. The heterogeneity presented here motivates us to combine various screening methods simultaneously. In this paper, we introduce a general ensemble-based framework to efficiently combine results from multiple variable screening methods. The consistency and sure screening property of proposed framework has been established. Extensive simulation studies confirm our intuition that the proposed ensemble-based method is more robust against model specification than using single variable screening method. The proposed ensemble-based method is used to predict attention deficit hyperactivity disorder (ADHD) status using brain function connectivity (FC). 1 Introduction The evolution of data acquisition technologies and computing power has allowed researchers nowadays to collect and store data with high dimensionality and complex structure much more efficiently. Examples can be found in gene expression microarray data, single nucleotide polymorphism (SNP) data, magnetic resonance imaging (MRI) data, highfrequency financial data, and others. One common task is to extract useful variables from a high dimensional feature space to explain or predict a response variable. Traditional variable selection methods such as forward selection, backward elimination and best subset selection become computationally expensive or even infeasible at these conditions. To address these problems, a family of penalized These authors contributed equally to this work. Correspondence Author. least squares based methods has been developed. Examples include Lasso and Adaptive Lasso ([Tibshirani, 1996; Zou, 2006]), SCAD [Fan and Li, 2001], elastic net [Zou and Hastie, 2005], and MCP [Zhang, 2010]. However, when the dimensionality p is much larger than the sample size n or even grows exponentially with n, the aforementioned penalization methods can perform poorly or even become infeasible due to the simultaneous challenges of computational expediency, statistical accuracy and algorithm stability [Fan et al., 2009]. For example, in MRI studies, images with dimension 1024 1024 200 can be acquired for each subject, and due to the high cost of the MRI scanning, studies might only contain less than 100 subjects. If we treat the signal from each voxel as a feature, the dimension of feature space p is much higher than the sample size n. A natural idea to address these challenges is to reduce the dimensionality p from a large scale to a relatively large scale d using a fast screening algorithm, and then the ultrahighdimensional problem can be greatly simplified into a moderately high-dimensional one. Subsequently, standard penalized variable selection methods can be applied to the remaining variables. Fan and Lv [2008] first introduced the sure independence screening (SIS) by ranking the marginal correlation of each covariate and the response. The good numerical performance and novel theoretical properties have made SIS popular in ultrahigh dimensional reduction. As a result, SIS and its extensions have been generalized to many important settings including generalized linear model [Fan et al., 2010], multi-index semi-parametric models [Zhu et al., 2011], nonparametric regression [Fan et al., 2011], quantile regression [He et al., 2013]. Other marginal screening methods based on different measure of association between predictors and response have also been studied, such as Kendall s τ [Li et al., 2012a], distance correlation [Li et al., 2012b]. We refer to [Liu et al., 2015] for a more comprehensive list of references. Most feature screening methods are designed for specific models, and they all enjoy good theoretical and numerical performances under a set of conditions on the data generating process. However, in practice, we rarely know the actual relationship between features and the response, and different covariates might have different relationships with the response. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) The heterogeneity presented here motivates us to consider various screening methods simultaneously in practice, so the natural question here is how to aggregate the results from these screening methods to achieve more robust and accurate performances. To the authors best knowledge, there is no existing work on addressing these problems. In this paper, we propose an ensemble-based framework for combining results from multiple variable screening methods. It is well known that ensemble methods is usually more accurate than a single learner, and they have already achieved great success in many real-world tasks [Zhou, 2012]. In the area of variable selection, the idea of ensemble has been studied before ([Bach, 2008; Bol on-Canedo and Alonso Betanzos, 2019]). In section 2, we introduce the details of ensemble-based variable screening. In section 3, we establish the theoretical properties of proposed method. In section 4, we conduct extensive simulation studies under different models. In section 5, we conduct a real data analysis using functional magnetic resonance imaging (f MRI) data to predict attention deficit hyperactivity disorder (ADHD) status. In summary, we have the following contributions and findings: 1. We propose an ensemble-based variable screening framework to efficiently combine results from different variable screening methods. The proposed framework is very flexible, and can be paralleled. 2. We prove that the proposed ensemble-based variable screening method inherits the nice theoretical properties (e.g. consistency, sure screening property) of the base candidates. 3. We conduct extensive simulation studies and real data analysis to illustrate the numerical performance of proposed framework. The proposed ensemble-based method more robust again model specification than using single variable screening method, that is to say, even though the proposed ensemble based screener may not outperform all base screener in a specific model setting, but it has the most consistent and robust performance across different model settings. 2 Ensemble-based Variable Screening 2.1 A General Framework We consider the problem of variable screening in ultrahighdimensional feature space, where we observe response variable Y and the associated covariate vector X = (X1, . . . , Xp)T . Consider the conditional distribution function of Y given x, denoted by F(y|X) = P(Y < y|X). Define two sets of variables: A ={j : F(y|X) = P(Y < y|X) functionally depends on Xj}, I ={j : F(y|X) = P(Y < y|X) does not functionally depends on Xj}. If j A, Xj is referred to as an active feature, otherwise an inactive feature. The goal is to reduce dimensionality p from a large scale to a moderate scale by a fast and efficient method, while including all the variables in set A. The key idea of the marginal screening procedure is to rank all predictors by using a utility measure between the response and each predictor and then to retain the top variables for further investigation. It usually involves three steps: 1. Calculate the screening statistic vector F = [f1, . . . , fp]. For example, fj = cor(Xj, Y ) is used in SIS. 2. Obtain the rank vector R = [r1, . . . , rp] by ranking F from largest to smallest (usually a large fj indicates a stronger relationship between Xj and Y). 3. Choose the γp top ranked features as the active set b A, and γ (0, 1) is predetermined. To be specific, the vote vector V = [v1, . . . , vp] with vj = I(rj γp ). Different screening methods usually result in different ranking, and in the next section we introduce the concept of variable screening ensembles to combine the results from different methods. 2.2 Constructing Variable Screening Ensembles Generally, an ensemble is constructed in two steps: generating the base learners and combining them. To get a good ensemble, it is generally believed that the base learners should be as accurate as possible, and as diverse as possible. Consider K base variable screening candidates, each one of them returns screening utilities of all p variables. Now we introduce the variable screening ensemble matrix: W 1 W 2... W K w11 w12 w13 . . . w1p w21 w22 w23 . . . w2p ... ... ... ... ... w K1 w K2 w K3 . . . w Kp where W i = [wi1, . . . , wip], i = 1, . . . , K denotes the screening utilities obtained using screening candidate i. Wi can be the vector of the screening statistics vector Fi of all p features, for example, the correlation between X and Y in SIS. If a single base variable screening candidate has been used K times through bootstrapping or parameter tuning, we obtain homogeneous ensembles. On the other hand, if multiple screening candidates have been used, this leads to heterogeneous ensembles. Since the screening statistics produced by different base candidates can have different scale and/or range, Wi and Wj may not be comparable. In this case, we can choose the rank of all p features Ri for Wi obtained using base candidate i. Another choice is to use the binary vote Vi introduced in the last section. Intuitively, using the rank R will be more robust compared to directly using screening statistic F, and it will be more efficient compared to using binary vote V since V can be sensitive to the choice of threshold parameter γ. In section 4, we further use numerical studies to confirm this intuition. After obtaining the ensemble matrix, we discuss how to efficiently combine the columns of the matrix. 2.3 Aggregation Functions After constructing the ensemble matrix, the next step would be finding appropriate combination functions to aggregate the results. A general combination function is a multivariate Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) function that projects the j-th column of the ensemble matrix to a real number, that is, f(w1j, . . . , w Kj) R. There are many combination functions available in the literature [Zhou, 2012]. In our ensemble screening framework, we consider two major types of combination functions. Mean Combination: Taking means of prediction results is a commonly adopted approach in machine learning literature [Nguyen et al., 2018]. In our approach, mean combination function combines the output of the screening algorithms by taking the mean of each column in the ensemble matrix. Provided the screening utilities associated with the j-th column as w1j, . . . , w Kj, the mean ensemble function is defined as f(w1j, . . . , w Kj) = 1 i=1 wij, j = 1, . . . , p. Median Combination: Instead of choosing the mean of w1j, . . . , w Kj, we use the median, which is more robust when those screening utilities are skewed or with outliers. As an alternative to mean, median provides a more robust result when the distribution of wij is skewed. For example, we use the ranking vector R i obtained from base candidate i as each row of the ensemble matrix, W i . For active variable Xj, if most base candidates return a small rank except for one or two outliers , then the mean-aggregated result will be largely influenced by these outliers , while the medianaggregated result will still be robust. Formally we have f(w1j, . . . , w Kj) = Median {w1j, . . . , w Kj} , j = 1, . . . , p. 2.4 Variables Selected Denote the combined information of the ensemble matrix by applying mean or median aggregation function as {E1, . . . , Ep}. For any given threshold γ (0, 1), the γp top ranked predictors are selected as the active set: b A = {1 j p :|Ej|is among the first γp largest of all}. 3 Technical Results As variable screening only serves as the first step in high dimensional data analysis, the most important property as far as practical application is concerned the sure screening property [Fan and Lv, 2008]. That is, with probability approaching 1, the screening algorithm keeps all of the true active variables. P(A b A) 1. We require that the sure screening property holds for each screening algorithms. For the quantile based variable screening algorithms, e.g. Qa SIS , we require the sure screening property holds at each quantile level τk which means the selected variables contain the true active set Ak with a probability tending to one. Regarding the screening utilities, we require all the screening algorithms in the ensemble have consistency of the screening utilities, that is for 1 i K, P( max 1 j p| bwij wij| > δn) 0, where wij is the corresponding screening utility for screening algorithm i and variable j, bwij is an estimator for wij, and δn is some threshold number that is usually related to n. This requirement is reasonable as it is shown in most of the variable screening literatures [Fan and Lv, 2008; Fan et al., 2011; He et al., 2013; Zhu et al., 2011]. Lemma 3.1 (Consistency of aggregated screening utilities) Given number of K screening algorithms which are based on screening utilities wij, i = 1, . . . , K and j = 1, . . . , p. Denote f as the combination function. Assume the following: P( max 1 j p | bwij wij| > δn) 0 |f( bw1j, . . . , bw Kj) f(w1j, . . . , w Kj)| max 1 i K| bwij wij|, in which δn is some threshold constant related to n. We have the consistency of the aggregated screening utility which is: P( max 1 j p |f( bw1j, . . . , bw Kj) f(w1j, . . . , w Kj)| > δn) 0. Theorem 3.2 (Sure screening property) Denote wj as the jth aggregated screening utility, and w j as the sample estimate. Denote A as the active variable set. Assume min j Awj > 2δn and our screening method select the variables with w j > δn. Given the assumptions of the previous lemma, we have the sure screening property for our ensemble screening approach: P(A b A) 1, in which δn is some threshold number related to n. Remark: To guarantee the sure screening property holds for our ensemble procedure, we just need δn to be the smallest among all the thresholds we combined. As long as all of the candidate methods show sure screening property, given the assumptions in Lemma 3.1 hold, our ensemble approach also enjoys sure screening property. Lemma 3.3 (Lower bound for simultaneous screening probability) Denote b T1 and b T2 as two selected sets by applying two different screening algorithms. Define b T simu = b T1 b T2 as the variable set selected by both of the screening algorithms. Define bΠ1 K, bΠ2 K, bΠsimu K as the probabilities of selected sets of screening algorithm 1, screening algorithm 2 and the simultaneous set containing a variable set K, K {1, . . . , p}. Then we have: bΠsimu K 2 min n bΠ1 K, bΠ2 K o 1. Lemma 3.4 Let K {1, . . . , p} be a set of variables and b Ti be the set of selected variables by applying a variable screening algorithm i. If max n P(K b T1), P(K b T2) o ε, then P(bΠsimu K ξ) ε2/ξ. Theorem 3.5 (Error control) Denote S = |T | and N = |F| as the number of underlying true important and unimportant variables. Correspondingly, denote b S = |T b T simu| Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) and b N = |F b Fsimu| as the number of estimated important and unimportant variables. In addition denote V = E(|F b T simu|) as the expected number of falsely selected variables in b T simu. Assume exchangeability which is P(k b T ) = E( b N)/N, where k (1, . . . , p). Also assume that the candidate variable screening process is not worse than random guessing. Given the screening threshold Tn and the threshold of selection probability πthr, we have: E(V ) 1 2πthr 1 T 2 n p . Remark: If a variable is important, the underlying true ranking of the variable is higher than the other unimportant variables. With a threshold Tn, the true important variables are supposed to rank within Tn. If Tn is a relative small number say 30, and the probability threshold πthr is decently greater than 50%, given p = 1000 the expectation number of falsely discovered variables could be controlled within a small number. Intuitively, when each candidate screening method is good enough and the threshold Tn is small, the number of falsely selected variables could be controlled at a very low level. 4 Numerical Studies 4.1 Simulation Settings In this section, we demonstrate the numerical performance of our proposed approach. For the base screening candidates, we consider three popular methods, the SIS in [Fan and Lv, 2008], the sure independence ranking and screening (SIRS) in [Zhu et al., 2011], and the quantile adaptive sure independence screening (Qa SIS) in [He et al., 2013]. The SIS is based on linear model, and both SIRS and Qa SIS are model free. These three methods all have their advantages and disadvantages, and none of them uniformly performs better than the others in all models. For Qa SIS, different quantile levels can be used for the screening, we first consider the ensemble of Qa SIS with different quantile levels, and there will be a more homogeneous scenario. Secondly we consider a more heterogeneous case by combining all these three different methods. We consider two evaluation metrics: the first one is R, that is the smallest number of features that we need to include to ensure that all the active variables are selected; the second one (denoted by S) is the proportion of the active predictors selected when the threshold Tn = n/ log(n) is adopted. An effective variable screening procedure is expected to have the value of R reasonably small comparing to the number of active variables and the value of S close to one. All the experiments have been repeated 100 times, and the median and interquartile range (IQR) have been reported for both R and S. Depending on the choice of Wi in the ensemble matrix and the aggregation function, different variable screening ensembles can be obtained. We use VSr E (Variable Screening Ensemble), VSr E-R and VSr E-V to denote the ensembles obtained by choosing W as the original screening statistic F, the rank R and the binary vote V respectively (defined in section 2.1). For the aggregation function, mean and median Qa SIS(0.5) Qa SIS(0.75) VSr E(mean) VSr E(median) VSr E R(mean) VSr E R(median) VSr E V(mean) T1(5) T2(10) T3(19) T4(37) Figure 1: Comparisons of different ensemble-based methods via combining Qa SIS with different quantile levels under Model 1. Each line here represents different choices of quantile set with the number in the bracket represents the number of quantile levels used. For each method under each model, median log(R) and its IQR (error bar) are presented. are used. For example, mean-based ensemble is denoted as VSr E(mean). Note that the median aggregation function for the vote based VSr E may not be a good choice since median is not a good summary statistics for binary data. 4.2 Models Settings We consider the following four models, from linear, nonlinear to heteroscedastic cases. The random error term ε follows a standard normal distribution in all four models and is independent of X. Model 1(n = 200, p = 2000) [Fan and Lv, 2008]: This is a model of the form Y = Xβ + ε, in which all the columns of X are generated from a standard normal distribution. There are in total eight true active predictors which are generated as follows: set a = 4 log(n)/n 1 2 and the coefficients corresponding to the active predictors are derived by ( 1)u(a+|z|), where u follows a Bernoulli distribution with p = 0.4 and z is drawn from a standard normal distribution. Model 2(n = 200, p = 2000) [Zhu et al., 2011]: The random data is generated from Y = 2(X1 + 0.8X2 + 0.6X3 + 0.4X4 + 0.2X5) + exp(X20 + X21 + X22) ε, where X = (X1, X2, . . . , X2000) follows a multivariate normal distribution with mean 0 and covariance Σ = (σij)p p with σij = ρ|i j|. We consider ρ = 0.8 here. The number of active variables here is 8. Model 3(n = 400, p = 1000) [Fan et al., 2011]: First, define the following functions g1(x) = x; g2(x) = (2x 1)2; g3(x) = sin(2πx)/(2 sin(2πx)); g4(x) = 0.1 sin(2πx) + 0.2 cos(2πx) + 0.3 sin(2πx)2 + 0.4 cos(2πx)3 + 0.5 sin(2πx)3. The random data are generated from: Y = 5g1(X1)+3g2(X2)+4g3(X3)+6g4(X4)+ 1.74ε, where X = (X1, X2, . . . , X1000) has the similar structure as in Model 2 with ρ = 0.4. Model 4(n = 400, p = 5000) [He et al., 2013]: Y = 2(X2 1 + X2 2) + [10 1 exp(X1 + X2 + X18 + X19 + . . . + X30)] ε., where X = (X1, X2, . . . , X5000) has the similar structure as Model 2 with ρ = 0.8. The number of active variables here is 15. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Qa SIS(0.5) Qa SIS(0.75) VSr E(mean) VSr E(median) VSr E R(mean) VSr E R(median) VSr E V(mean) Qa SIS(0.5) Qa SIS(0.75) VSr E(mean) VSr E(median) VSr E R(mean) VSr E R(median) VSr E V(mean) Figure 2: Boxplots of log(R) (upper) and S (lower) for ensemblebased screening methods by combining Qa SIS with different quantile levels under Model 1. Quantile level set T2 with K = 10 is used here. 4.3 Combining Quantile Levels We first consider VSr E constructed by applying Qa SIS separately on different quantile levels. In order to show the effect of different choices of quantile sets, we show the results of applying four different quantile sets which have 5, 10, 19 and 37 equally spaced quantile levels respectively as follows: T1 = {0.1, 0.3, . . . , 0.9}, T2 = {0.05, 0.15, . . . , 0.95}, T3 = {0.05, 0.1, . . . , 0.95}, T4 = {0.05, 0.075, . . . , 0.95}. As an illustration of the choice of VSr Es and combination functions, all five types of methods are applied. Model 1 is used for this simulation. The line chart in Figure 1 shows the comparison of single screening method (Qa SIS(0.5) and Qa SIS(0.75)) and the proposed ensemble-based methods using 4 different sets of quantile levels, while the boxplots in Figure 2 focus on the case when quantile level set T2 with K = 10 is used. From Figure 1 and Figure 2, we have the following observations: 1. All proposed ensemble-based methods outperform the single base screening method, with regard to both R and S. 2. The performance of ensemble-based method improves with the increase of the number of base screening methods K. But the improvement becomes substantial as K reaches a certain level, while the computation time increases linearly with K. 3. When the screening statistic is comparable across ensemble candidates, both meanand medianbased VSr E Figure 3: Comparison of proposed ensemble-based methods (solid lines) and single ensemble candidates (dash lines) under different model settings. For each method under each model, median log(R) and its IQR (error bar) are presented. Note that the ensemble-based VSr E-R here is obtained by combining Qa SIS(0.75), SIRS and SIS. Qa SIS(0.75) SIS SIRS VSr E R(mean) VSr E R(median) Qa SIS(0.75) SIS SIRS VSr E R(mean) VSr E R(median) Figure 4: Boxplots of log(R) (upper) and S (lower) for different variable screening methods under Model 2. Note that the ensemblebased VSr E-R here is obtained by combining Qa SIS(0.75), SIRS and SIS. and VSr E-R provides reasonably good results. 4.4 Combining Different Variable Screening Methods Instead of constructing a VSr E by applying Qa SIS on different quantile levels, we could apply different screening methods to the sample and combine the results. In the following, we show the power of simply combining three commonly used variable screening methods which are SIS, SIRS and Qa SIS. Quantile level 0.75 is used for Qa SIS. As the screening utilities of different methods may have very different magnitudes, it is more appropriate to use VSr E-R in order to avoid the situation that some screening methods with relative large screening utilities might dominate the combined result. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Figure 3 compares the ensemble-based VSr E-R(mean) and VSr E-R(median) with the single screening candidates under Model 1 to Model 4. Figure 4 provides more detailed results on Model 2 about both R and S. Based on Figure 3 and Figure 4, we have the following observations: 1. The proposed VSr E-R(median) has the most robust performance across all four models. 2. Each ensemble candidate has its own advantage: SIS performs the best under Model 1, SIRS under Model 3 and Qa SIS under Model 4. The ensemble-based methods inherit the good performance of the base screening candidates, while still robust to model misspecification as long as some of the base learners perform reasonably well. 3. For the aggregation function, median is a more robust choice. 5 Real Data Analysis Attention deficit hyperactivity disorder (ADHD) is a brain disorder marked by an ongoing pattern of inattention and hyperactivity-impulsivity that interferes with functioning or development. Recent development of medical imaging such as functional magnetic resonance imaging (f MRI) and diffusion tensor imaging (DTI) shows promising potential in predicting patients outcomes and understanding the underlying pathophysiology of diseases [Greicius et al., 2007]. We use the ADHD-200 Consortium data which is a publicly available resting-state f MRI (rs-f MRI) data [Milham et al., 2012] in this study. f MRI measures brain activity by detecting changes associated with blood flow [Huettel et al., 2004]. This data set contains 120 subjects (n = 120) from the NYU site (New York University Child Study Center) of the ADHD-200 Consortium, 42 are typically developing children showing ADHD negative and 78 are diagnosed as ADHD. The Anatomical Automatic Labeling (AAL) atlas [Tzourio Mazoyer et al., 2002] was used for the parcellation. In this study, our goal is to find which brain connectivity pair of ROIs (region of interest) is contributing to the ADHD express levels hence each brain connectivity is considered to be a variable. For each subject, there are 172 time points and the AAL has 116 ROIs. We obtain the mean time series for each of the 116 regions by averaging the f MRI time series over all voxels in the region, hence initially we have p = (116 116 116)/2 = 6670 predictors. Because of the large p small n scenario (n = 120, p = 6670), a variable screening procedure is necessary to remove some noise brain connectivities in order to apply lower dimensional variable selection approaches. We apply VSr E-R(median) by combining SIS, SIRS and Qa SIS with quantile level 0.75 and use Qa SIS at quantile level 0.75 as a comparison. In the variable selection step, we employ both LASSO and SCAD. The tuning parameters of LASSO and SCAD are selected. In the next step, we choose those ROI connectivities that are simultaneously selected by both LASSO and SCAD. For the classification, we adopt the support vector machine classifier (SVM) [Hearst et al., 1998] with linear kernel. All tuning parameter are selected by 10 folds cross validation. Unlike the simulation settings, we do not know the true active variables, so we can not use metrics like R and S. Instead 0.7 0.8 0.9 Prediction accuracy (a) Results based on proposed ensemble-based VSr E 0.7 0.8 Prediction accuracy (b) Results based on Qa SIS Figure 5: Prediction accuracy of ADHD status using single screening candidate (Qa SIS (0.75)) and proposed ensemble-based method we measure the usefulness of the selected features by their predictive power. Figure 5 shows the prediction accuracy using 100 bootstrapped samples. The top panel is the histogram of accuracy using ensemble-based method while the bottom panel is the one using Qa SIS with quantile level 0.75. We can see a substantial improvement using the proposed method. Also, some of the partial correlations screened out by the proposed ensemble-based method include regions associated with a network called the default-mode network, which is a large and robustly replicable network of brain regions that is associated with task-irrelevant mental processes and mindwandering. Similar findings are reported in [Uddin et al., 2008; Tian et al., 2006]. 6 Discussion In this paper, we introduce a general ensemble-based framework to combine results from different variable screening methods. The simulation studies confirm our intuition that the ensemble-based methods indeed inherit the good performance of the candidates, that is to say, as long as some candidates have decent performance, the ensemble-based method will work reasonably well. This is very important in practice, since when we do not have much information about the actual data generating process, the ensemble-based methods will be a more robust and safer choice than relying on a single method. Also, we find that the median rank-based aggregation method has the most robust performance. It is also worthy to point out that the proposed framework can be easily implemented in a parallel way so it will not add too much computation burden. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) References [Bach, 2008] Francis R Bach. Bolasso: model consistent lasso estimation through the bootstrap. In Proceedings of the 25th international conference on Machine learning, pages 33 40. ACM, 2008. [Bol on-Canedo and Alonso-Betanzos, 2019] Ver onica Bol on-Canedo and Amparo Alonso-Betanzos. Ensembles for feature selection: A review and future trends. Information Fusion, 52:1 12, 2019. [Fan and Li, 2001] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348 1360, 2001. [Fan and Lv, 2008] Jianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849 911, 2008. [Fan et al., 2009] Jianqing Fan, Richard Samworth, and Yichao Wu. Ultrahigh dimensional feature selection: beyond the linear model. Journal of Machine Learning Research, 10:2013 2038, December 2009. [Fan et al., 2010] Jianqing Fan, Rui Song, et al. Sure independence screening in generalized linear models with np-dimensionality. The Annals of Statistics, 38(6):3567 3604, 2010. [Fan et al., 2011] Jianqing Fan, Yang Feng, and Rui Song. Nonparametric independence screening in sparse ultrahigh-dimensional additive models. Journal of the American Statistical Association, 106(494):544 557, 2011. [Greicius et al., 2007] Michael D Greicius, Benjamin H Flores, Vinod Menon, Gary H Glover, Hugh B Solvason, Heather Kenna, Allan L Reiss, and Alan F Schatzberg. Resting-state functional connectivity in major depression: abnormally increased contributions from subgenual cingulate cortex and thalamus. Biological Psychiatry, 62(5):429 437, 2007. [He et al., 2013] Xuming He, Lan Wang, and Hyokyoung Grace Hong. Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data. The Annals of Statistics, 41(1):342 369, 2013. [Hearst et al., 1998] Marti Hearst, Susan T. Dumais, E Osman, John Platt, and Bernhard Scholkopf. Support vector machines. IEEE Intelligent Systems, 13(4):18 28, July 1998. [Huettel et al., 2004] Scott A Huettel, Allen W Song, and Gregory Mc Carthy. Functional magnetic resonance imaging, volume 1. Sinauer Associates Sunderland, MA, 2004. [Li et al., 2012a] Gaorong Li, Heng Peng, Jun Zhang, and Lixing Zhu. Robust rank correlation based screening. The Annals of Statistics, pages 1846 1877, 2012. [Li et al., 2012b] Runze Li, Wei Zhong, and Liping Zhu. Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499):1129 1139, 2012. [Liu et al., 2015] Jing Yuan Liu, Wei Zhong, and Run Ze Li. A selective overview of feature screening for ultrahighdimensional data. Science China Mathematics, 58(10):1 22, 2015. [Milham et al., 2012] Michael P Milham, Damien Fair, Maarten Mennes, and Stewart Mostofsky. The ADHD-200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience. Frontiers in Systems Neuroscience, 6:62, 2012. [Nguyen et al., 2018] Tien Thanh Nguyen, Xuan Cuong Pham, Alan Wee-Chung Liew, and Witold Pedrycz. Aggregation of classifiers: a justifiable information granularity approach. IEEE Transactions on Cybernetics, pages 1 10, 2018. [Tian et al., 2006] Lixia Tian, Tianzi Jiang, Yufeng Wang, Yufeng Zang, Yong He, Meng Liang, Manqiu Sui, Qingjiu Cao, Siyuan Hu, Miao Peng, et al. Altered resting-state functional connectivity patterns of anterior cingulate cortex in adolescents with attention deficit hyperactivity disorder. Neuroscience letters, 400(1-2):39 43, 2006. [Tibshirani, 1996] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), (1):267 288, 1996. [Tzourio-Mazoyer et al., 2002] Nathalie Tzourio-Mazoyer, Brigitte Landeau, Dimitri Papathanassiou, Fabrice Crivello, Olivier Etard, Nicolas Delcroix, Bernard Mazoyer, and Marc Joliot. Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain. Neuroimage, 15(1):273 289, 2002. [Uddin et al., 2008] Lucina Q Uddin, AM Clare Kelly, Bharat B Biswal, Daniel S Margulies, Zarrar Shehzad, David Shaw, Manely Ghaffari, John Rotrosen, Lenard A Adler, F Xavier Castellanos, et al. Network homogeneity reveals decreased integrity of default-mode network in adhd. Journal of neuroscience methods, 169(1):249 254, 2008. [Zhang, 2010] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894 942, 2010. [Zhou, 2012] Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. Chapman and Hall/CRC, 2012. [Zhu et al., 2011] Li-Ping Zhu, Lexin Li, Runze Li, and Li Xing Zhu. Model-free feature screening for ultrahighdimensional data. Journal of the American Statistical Association, 106(496):1464 1475, 2011. [Zou and Hastie, 2005] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301 320, 2005. [Zou, 2006] Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418 1429, 2006. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)