# distributed_feature_screening_via_componentwise_debiasing__94bd57f6.pdf

Journal of Machine Learning Research 21 (2020) 1-32 Submitted 6/19; Revised 12/19; Published 02/20

Distributed Feature Screening via Componentwise Debiasing

Xingxiang Li xli396@uottawa.ca School of Mathematics and Statistics Xi an Jiaotong University, China Department of Mathematics and Statistics University of Ottawa, Canada

Runze Li rzli@psu.edu Department of Statistics and The Methodology Center The Pennsylvania State University, USA

Zhiming Xia statxzm@nwu.edu.cn School of Mathematics Northwest University, China

Chen Xu cx3@uottawa.ca Department of Mathematics and Statistics University of Ottawa, Canada

Editor: Garvesh Raskutti

Feature screening is a powerful tool in processing high-dimensional data. When the sample size N and the number of features p are both large, the implementation of classic screening methods can be numerically challenging. In this paper, we propose a distributed screening framework for big data setup. In the spirit of divide-and-conquer , the proposed framework expresses a correlation measure as a function of several component parameters, each of which can be distributively estimated using a natural U-statistic from data segments. With the component estimates aggregated, we obtain a ﬁnal correlation estimate that can be readily used for screening features. This framework enables distributed storage and parallel computing and thus is computationally attractive. Due to the unbiased distributive estimation of the component parameters, the ﬁnal aggregated estimate achieves a high accuracy that is insensitive to the number of data segments m. Under mild conditions, we show that the aggregated correlation estimator is as eﬃcient as the centralized estimator in terms of the probability convergence bound and the mean squared error rate; the corresponding screening procedure enjoys sure screening property for a wide range of correlation measures. The promising performances of the new method are supported by extensive numerical examples.

Keywords: Feature screening, Big data, Divide-and-conquer, Aggregated correlation, Sure screening property

1. Introduction

With rapid development of data generation and acquisition, massive data with a huge number of features are frequently encountered in many scientiﬁc ﬁelds. High dimensionality poses simultaneous challenges of computational cost, statistical accuracy, and algorithmic

c 2020 Xingxiang Li, Runze Li, Zhiming Xia, and Chen Xu.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-537.html.

Li, Li, Xia and Xu

stability for classic statistical methods (Fan et al., 2009). To facilitate the computing process, one natural strategy is to screen most irrelevant features out before an elaborative analysis. This procedure is referred to as feature screening. With dimensionality reduced from high to low, analytical diﬃculties are reduced drastically. In the literature, plenty of works have been done in this area; in particular, the correlation-based screening methods have attracted a great deal of attention. These methods conduct screening based on a certain correlation measure between features and the response. Features with weak correlations are treated as irrelevant ones and are to be removed. This type of methods can be conveniently implemented without strong model assumptions (even model-free). Thus, they are commonly used for analyzing high-dimensional data with complex structures. For example, Fan and Lv (2008) proposed a sure independence screening (SIS) based on Pearson correlation. Zhu et al. (2011) proposed a sure independent ranking and screening (SIRS) based on a utility measure that is concerned with the entire conditional distribution of the response given the predictors. Li et al. (2012a) proposed a robust rank correlation screening (RRCS) based on the Kendall τ rank correlation. Li et al. (2012b) developed a model-free sure independence screening procedure based on the distance correlation (DC-SIS). Wu and Yin (2015) proposed a distribution function sure independence screening (DF-SIS) approach, which uses a measure to test the independence of two variables. Zhou et al. (2019) proposed a robust correlation measure to screen features containing extreme values. Feature screening has been demonstrated to be an attractive strategy in many applications. Most existing methods are developed under the situation, where the number of features p is large but the sample size N is moderate. However, in modern scientiﬁc research, it is increasingly common that data analysts have to deal with big data sets, where p and N are both huge. For example, in modern genome wide genetic studies, millions of SNPs are genotyped on hundreds of thousands participants. In Internet studies, an antivirus software may scan tens of thousands keywords in millions of URLs per minute. When faced with large-p-large-N data, the direct implementation of classic screening methods can be numerically ineﬃcient due to storage bottleneck and algorithmic feasibility. For example, for a data set with N = p = 10, 000, the well-known DC-SIS needs about 60 hours to conduct a full screening on a computer with 3.2 GHz CPU and 32 GB memory. Developing computationally convenient methods for big data screening is therefore desirable in practice. When a data set is too huge to be processed on a single computer, it is natural to consider using a divide-and-conquer strategy. In such a strategy, a large problem is ﬁrst divided into smaller manageable subproblems and the ﬁnal output is obtained by combining the corresponding sub-outputs. In this spirit, many machine learning and statistical methods have been rebuilt for processing big data (e.g., Chen and Xie, 2014; Xu et al., 2016; Jordan et al., 2019; Shi et al., 2018; Banerjee et al., 2019). These inspiring works motivate us to explore the feasibility of using this promising strategy for feature screening with big data. In this paper, we propose a distributed feature screening framework based on aggregated correlation measures, and refer to it as aggregated correlation screening (ACS). In ACS, we express a correlation measure as a function of several component parameters, each of which can be distributively estimated using a natural U-statistic from data segments. With the unbiased component estimates combined together, we obtain an aggregated correlation estimate, which can be readily used for feature screening. In the proposed ACS framework, a massive data set is split into and processed in m manageable segments, which can be

Distributed Feature Screening via Componentwise Debiasing

stored in multiple computers and the corresponding local estimations can be done by parallel computing. It thus provides a computationally attractive route for feature screening with large-p-large-N data. This framework is also suitable for the setup, where data are naturally stored in diﬀerent locations (e.g., medical data at hospital level). The U-statistic estimation of the component parameters serves as an eﬀective and convenient debasing technique, which ensures the high accuracy of the aggregated correlation estimator and the reliability of the corresponding screening procedure. Under mild conditions, we show that the aggregated correlation estimator is as eﬃcient as the classic centralized estimator in terms of the probability convergence bound and the mean squared error (MSE) rate. Such a full eﬃciency is insensitive to the choice of m, which may be speciﬁed by the problem itself or to be determined by the users. For a wide range of correlation measures, we further show that ACS enjoys the sure screening property without the need of specifying a parametric model (model-free). The proposed ACS has its roots in component-wise estimation. In the literature, this idea has been used to distributively recover a centralized estimator deﬁned by smooth estimating equations that are separatable in data segments (e.g., Chen et al., 2006; Lin and Xi, 2011). Unfortunately, these works are not directly applicable to the correlation-based screening, as the centralized estimators of many commonly-used correlation measures are not typically deﬁned by estimating equations and are non-separatable in data segments (e.g., SIRS and DC). The proposed ACS follows from the natural composition of a centralized correlation estimator; it does not seek to fully recover the centralized estimator but leads to an eﬀective and computationally aﬀordable alternative of it. Our results in this paper provide a theoretical support of using this natural strategy for distributed feature screening. We demonstrate the computational advantages and promising screening accuracy of ACS in a series of numerical examples. The rest of this paper is organized as follows. In Section 2, we formulate the research problem and introduce the ACS framework. In Section 3, we investigate the theoretical properties of ACS. In Section 4, we demonstrate the promising performance of ACS by Monte Carlo simulations and a real data example. Concluding remarks are given in Section 5 and the proofs of theorems are provided in the Appendix.

2. Methodology

2.1. Feature screening with big data

Let D = {(Yi, Xi)}N i=1 be N independently and identically distributed (i.i.d) copies of {Y, X}, where Y is a response variable with support Φy and X = (X1, ..., Xp)T is a pdimensional covariate vector. We are interested in the situation, where p and N are both large. When a data set is massive and high-dimensional, it is often reasonable to assume that only a handful of covariates (features) are relevant to the response. Let F(y|X) be the conditional distribution function of Y given X. A feature Xj is considered to be relevant if F(y|X) functionally depends on Xj for some y Φy. We use M to denote the index set of the relevant features and deﬁne Mc = {1, ..., p} \ M. The goal of feature screening is to remove most irrelevant features Xjs with j Mc before an elaborative analysis. One commonly used strategy is to ﬁrst estimate a marginal correlation measure between the response and each feature, and then remove the features with weak correlations.

Li, Li, Xia and Xu

Speciﬁcally, let ωj 0 be a measure of correlation strength between Y and Xj. Let ˆωj be a centralized estimate of ωj based on D. With a pre-speciﬁed threshold γ > 0, one may retain the features in ˆ M = {j : ˆωj γ, j = 1, ..., p},

and remove the others. This classic approach is eﬀective when sample size N is moderate. However, when N and p are both huge, computing {ˆωj}p j=1 based on the full data set D can be numerically costly.

2.2. Aggregated correlation screening

Motivated by the recent works in distributed learning, we consider adopting the idea of divide-and-conquer to tackle big data feature screening. Without loss of generality, suppose that the original full data set D is equally partitioned into m manageable segments {Dl}m l=1, each of which contains n = N/m observations. Depending on the computational environment, these segments can be distributively stored on and processed by multiple computers or can be sequentially processed by a single computer. Let ˆωl,j be the local correlation estimate between Xj and Y based on data segment Dl. One simple screening strategy is to compute an averaged correlation estimate

l=1 ˆωl,j (1)

for 1 j p and remove the features with small ωj values. This approach is referred to as simple average screening (SAS), which perhaps is the most straightforward way for distributed screening. To facilitate the computing process, using a relatively large number of small segments is often preferred in the analysis. However, when m is large, ωj may substantially diﬀer from the centralized estimator ˆωj due to the cumulated bias inherited from the local estimators. As a result, its screening performance is often unstable in practice, as to be revealed in our numerical studies. One way to improve SAS is to conduct debiasing on ˆωl,js before averaging them over. Unfortunately, this is not straightforward for many commonly-used correlation measures that are nonlinear. Our idea is to express a correlation measure ωj as a function of several component parameters, and conduct the distributed unbiased estimation of the component parameters. By doing so, we carry out componentwise debasing on original ˆωl,js in an eﬀective but much easier way. With the unbiased component estimates naturally combined together, we obtain an aggregated correlation estimate that can be readily used for feature screening. To be more speciﬁc, suppose that a correlation measure between Y and Xj can be expressed as ωj = g(θj,1, ..., θj,s), (2)

where g is a pre-speciﬁed function and θj,1, ..., θj,s are s component parameters from a compact space. For a given correlation measure, expression (2) may not be unique. We choose the form of g such that the corresponding component parameters can be conveniently estimated with no bias. For the ease of presentation, let ˆθj,h(Zi1j, . . . , Zikhj) denote a basis unbiased estimator (kernel) of θj,h with the minimal kh i.i.d copies of Zj = {Y, Xj} for

Distributed Feature Screening via Componentwise Debiasing

h = 1, . . . , s. Without loss of generality, we assume that ˆθj,h is symmetric such that its value is invariant to the permutation of {Zi1j, . . . , Zikhj}. Suppose that D is too big to be processed on a single computer and is equally partitioned into m segments {Dl}m l=1. We use Sl to denote the index set of {Y, X} copies on Dl. With a pre-speciﬁed correlation measure ωj, we propose to distributively screen features in the following framework.

1. Express ωj in the form of (2) with an appropriate g.

2. On each data segment, we estimate θj,h by a local U-statistic

Ul j,h = n kh

{i1,...,ikh} Sl

ˆθj,h(Zi1j, ..., Zikhj), (3)

where the summation is over all {Zi1j, ..., Zikhj} combinations chosen from Dl.

3. We compute an aggregated correlation estimate between Y and Xj by

eωj = g( Uj,1, ..., Uj,s), (4)

where Uj,h = 1

m Pm l=1 Ul j,h for h = 1, . . . , s.

4. With a user-speciﬁed threshold γ > 0, we retain the features in

f M = {j : eωj γ, j = 1, ..., p},

and remove the others.

We name the proposed screening framework as the aggregated correlation screening (ACS). It is seen that step 2 only requires information stored on the data segments, and thus it can be carried out by parallel or sequential processing. This makes ACS computationally suitable for the large-p-large-N situation. The use of U-statistics in step 2 helps to further reduce the variances of the local unbiased estimators on θj,hs and helps to enhance the stability of the method. The computational complexity of (3) is O(nkh), which can be conveniently handled with an appropriate m such that the local sample size n = N/m is moderate. Compared with SAS, ACS screens features based on a non-linear aggregation of unbiased component estimates. This way enables us to substantially reduce the bias of the ﬁnal correlation estimate with a little sacriﬁce on the variance. The overall accuracy of the ωj estimate is therefore improved; this in turn leads to a more reliable screening result in the distritbuted setup.

2.3. Examples and extension

2.3.1. Examples

The proposed ACS framework is suitable for many commonly used correlation measures. In this subsection, we provide a few concrete examples of using ACS. As a reference, we also list the corresponding expressions of ˆωl,j used in SAS in the Appendix for the interested readers. For the convenience of presentation, let Xij denote the jth entry of Xi deﬁned in Section 2.1 for j = 1, . . . , p.

Li, Li, Xia and Xu

1. Pearson correlation

Pearson correlation measures the strength of linear relationship between Y and Xj. Fan and Lv (2008) used it as a feature screening index for the linear model. When Pearson correlation is used in ACS, ωj can be expressed in the form of (2) by

ωj = g(θj,1, ..., θj,5) =

E(Xj Y ) E(Xj)E(Y ) q

(EX2 j E2(Xj))(EY 2 E2(Y ))

where θj,1 = E(Xj Y ), θj,2 = E(Xj), θj,3 = E(Y ), θj,4 = EX2 j , and θj,5 = EY 2. In step 2 of ACS, Ul j,h can be computed by (3) with kh = 1 and

ˆθj,1 = Xi1j Yi1, ˆθj,2 = Xi1j, ˆθj,3 = Yi1, ˆθj,4 = X2 i1j, ˆθj,5 = Y 2 i1,

for i1 Sl. It is seen that Uj,h in (4) coincides with classic moment estimates. When the data set is properly standardized, the expression of ωj can be further simpliﬁed.

2. Kendall τ rank correlation

Kendall τ rank correlation measures the ordinal association between Y and Xj. It was used in Li et al. (2012a) for feature screening in linear and transformation models. When this correlation measure is used in ACS, ωj can be expressed by

ωj = g(θj,1) = E(I(Xj < X j)I(Y < Y )) 1/4 ,

where {X j, Y } is an independent copy of {Xj, Y } and θj,1 = E(I(Xj < X j)I(Y < Y )). In step 2 of ACS, Ul j,1 can be computed by (3) with k1 = 2 and

(i1,i2) I(Xi1j < Xi2j)I(Yi1 < Yi2),

where {i1, i2} Sl and the summation is over all permutations of (i1, i2).

3. SIRS correlation

SIRS correlation can be used to detect nonlinear relationship between Y and Xj. It was proposed by Zhu et al. (2011) for feature screening in parametric and semiparametric models. When this correlation is used in ACS, ωj can be expressed by

ωj = θj,1 = EY {E2(Xj I(Y < Y ))},

where Y is an independent copy of Y and feature Xj is assumed to have zero mean and unit variance. In step 2 of ACS, Ul j,1 can be computed by (3) with k1 = 3 and

(i1,i2,i3) Xi1j Xi2j I(Yi1 < Yi3)I(Yi2 < Yi3),

where {i1, i2, i3} Sl and the summation is over all permutations of (i1, i2, i3).

Distributed Feature Screening via Componentwise Debiasing

4. Distance correlation

Distance correlation (DC) can be used to measure the dependence between Y and Xj. Li et al. (2012b) used it as a model-free screening index. When DC is used in ACS, ωj can be expressed by

ωj = g(θj,1, ..., θj,8) = θj,1 + θj,2 θj,3 2θj,4 q

(θj,5 + θ2 j,2 2θj,6)(θj,7 + θ2 j,3 2θj,8)

θj,1 = E{|Y Y | |Xj X j|},

θj,2 = E{|Y Y |}, θj,3 = E{|Xj X j|},

θj,4 = E{E(|Y Y | | Y )E(|Xj X j| | Xj)},

θj,5 = E{|Y Y |2}, θj,6 = E{E2(|Y Y | | Y )},

θj,7 = E{|Xj X j|2}, θj,8 = E{E2(|Xj X j| | Xj)},

where (Y , X j) is an independent copy of (Y, Xj). In step 2 of ACS, Ul j,1, Ul j,4 can be computed by (3) with k1 = 2, k4 = 3, and

ˆθj,1 = 1 2

(i1,i2) |Yi1 Yi2| |Xi1j Xi2j|, (5)

ˆθj,4 = 1 6

(i1,i2,i3) |Yi1 Yi3| |Xi2j Xi3j|. (6)

The expression of ˆθj,h for h = 2, 3, 5, 7 is similar to (5); the expression of ˆθj,h for h = 6, 8 is similar to (6).

Remark: When Pearson correlation is used, the aggregated estimator eωj in (4) coincides with the centralized estimator ˆωj; the proposed ACS leads to the same screening result of the classic SIS. For the correlations in Examples 2-4, the computational cost of eωj is substantially lower than that of ˆωj. For DC, when the data segments are parallel processed, ACS drastically reduces the cost of centralized screening from O(p N3) down to O(p N3/m3); this is also numerically cheaper than the feature-splitting-based screening, whose cost is O(p N3/m) with each local computer evaluating p/m features based on ˆωjs. The idea of componentwise debiasing in ACS provides a viable and eﬀective route to estimate ωj in a distributed manner. For commonly-used correlation measures, form (2) can be naturally constructed. The simplicity and compatibility of ACS make it a user-friendly approach in practice.

2.3.2. Extension

When data partition is manually done, one may further improve the stability of ACS with multiple partitions. Speciﬁcally, suppose that we repeat the random data partition R times.

Li, Li, Xia and Xu

For each partition, we conduct unbiased estimation of component parameters based on (3). We then carry out (4) with Uj,h replaced by

r=1 Ur j,h,

where Ur j,h denotes the mean U-statistic for the rth partition. By averaging over R partitions, the variability of eωj is further reduced; this leads to a reinforced ACS that is more reliable for feature screening. When a correlation measure with large khs is used and m is constricted to be small, one may further split a data segment Dl into smaller and manageable sub-segments. One can then conduct U-statistic estimation for the component parameters sequentially based on the sub-segments and use an averaged quantity to replace Ul j,h in Step 2 of ACS. If needed, a reinforced version of this strategy can also be used with multiple sub-partitions.

3. Theoretical Analysis

We now provide some theoretical justiﬁcation of using ACS. Apparently, the screening performance of ACS relies on the accuracy of the aggregated correlation estimator eωj (4). We show that eωj is an eﬀective and eﬃcient tool to estimate ωj; this serves as a theoretical foundation of ACS. Our theoretical investigation is based on the following technical conditions.

C1 There exist two constants κ0 and D0 such that, for any 0 κ κ0, E{exp(κ|ˆθj,h|)} < D0 for all h = 1, ..., s, j = 1, ..., p.

C2 In (2), g( ) is formed by ﬁnite operations of addition, subtraction, multiplication, division, absolutization, and square root, where the division and square root are taken over a quantity uniformly bounded away from zero.

C3 There exist two constants c > 0 and 0 < τ < 1/2 such that min j M ωj 2c N τ.

Condition C1 requires that |ˆθj,h| has a regular distribution, such that its moment generating function exists on [0, κ0]. This is a mild condition for many correlation measures. For example, when Kendall τ correlation is used with ACS, ˆθj,h is bounded and thus C1 is naturally satisﬁed; when SIRS is used with ACS, C1 is implied if E{exp(ξX2 j )} < D 0 for some ξ > 0, D 0 > 0 and 1 j p. Condition C2 is applicable to a variety of commonly used correlation measures, including the ones discussed in Section 2.3.1. We conjecture that ACS would still be eﬀective with a more complicated g( ). However, the corresponding theoretical justiﬁcation is likely to be lengthy. Here, we aim to provide some theoretical understanding of the proposed screening framework and do not intend to make this condition weakest possible. Condition C3 requires that the marginal correlation between any relevant feature and the response should not be too small. This is a natural feature identiﬁability requirement, which has been widely used in the literature; see, for example, Condition 3 of Fan and Lv (2008), Condition 2 of Li et al. (2012b), and Condition 6 of Wu and Yin (2015).

Distributed Feature Screening via Componentwise Debiasing

With the conditions above, we ﬁrst show that the averaged local U-statistics Uj,hs enjoy the properties stated in the following proposition.

Proposition 1 Under Condition C1, we have

max 1 j p,1 h s Var( Uj,h) = O( 1

N2 ) + ... + O(mkh 1

which is increasing in m.

By (7), we see that the largest Var( Uj,h) is in the order of O(N 1) for both ﬁxed m and diverging m. While a larger m leads to an increased variance of Uj,h, such an information loss is insigniﬁcant in the sense that the asymptotic order of (7) remains unchanged. Since Uj,hs are unbiased, the high precision implies the uniform second moment consistency, which echoes Theorem 2 of Lin and Xi (2010). Proposition 1 indicates that the component parameters can be eﬀectively estimated by summarizing the corresponding local U-statistics from data segments. With the properties of Uj,h, we show the eﬀectiveness of eωj in the following two theorems.

Theorem 2 Suppose that Conditions C1-C3 are satisﬁed and k = max{kh, h = 1, . . . , s} n. There exists a constant η > 0 such that

P max 1 j p |eωj ωj| c N τ ηp(1 N 2τ/η)m n/k ,

where n/k denotes the largest integer no larger than n/k.

Note that k is a constant depending on the choice of ωj and g( ); thus, m n/k is in the same order of N. Theorem 2 implies that the aggregated correlation estimators are uniformly consistent even when p grows exponentially with Nα for some 0 < α < 1. In the literature, it has been shown that the centralized estimator achieves convergence bound |ˆωj ωj| = Op(N τ) for 0 < τ < 1/2 (Li et al., 2012a,b; Cui et al., 2015; Wu and Yin, 2015). Theorem 2 indicates that eωj works as eﬃciently as the centralized estimator ˆωj in terms of the probability convergence bound. For the correlation measures admitting a Lipschitz continuous g( ), we further justify the corresponding eωj in terms of MSE. To be speciﬁc, recall that we express a correlation measure ωj = g(θj) with θj = (θj,1, ..., θj,s) from a compact space Θ Rs. We say g( ) is Lipschitz continuous in θj, if the following condition is satisﬁed.

C2 There exists a positive constant L such that |g(θj) g(θ j)| L θj θ j for any θj, θ j Θ, where denotes the Euclidean norm.

Condition C2 can be naturally satisﬁed by many correlation measures. For example, as shown in Section 2.3.1, we have g(θj,1) = |θj,1 1/4| for Kendall τ correlation and g(θj,1) = θj,1 for SIRS, where both g( )s are Lipschitz continuous. When component parameters

Li, Li, Xia and Xu

appear in the denominator in g( ), Condition C2 is often satisﬁed with mild requirements on (Y, X) such that the denominator is uniformly bounded away from zero. For example, when Pearson correlation is used, Condition C2 is satisﬁed when Var(Y ) > 0 and Var(Xj) > c for j = 1, . . . , p with some c > 0; when DC is used, Condition C2 is satisﬁed when the distance variances of Xj and Y deﬁned in Li et al. (2012b) are all bounded away from zero. In fact, when the parameter space Θ is compact, Condition C2 can be implied by Condition C2, the proof of which is similar to Lemma 6 in the Appendix. Based on this condition, we derive a uniform MSE bound for eωjs in the following theorem.

Theorem 3 Under Conditions C1 and C2 , if k n and Uj,h Θ, we have

max 1 j p MSE(eωj) = E(eωj ωj)2 = O(1/N).

Theorem 3 shows that, when g( ) is smooth enough, the aggregated correlation estimator eωj matches the optimal MSE rate achievable by a centralized estimator having access to the entire data of size N. This result also indicates that using the aggregation of Ustatistic component estimates is an eﬀective way in reducing the bias of the ﬁnal correlation estimator. Diﬀerent from the existing debiasing techniques developed for the distributed Mestimation (e.g., Zhang et al., 2012; Battey et al., 2018), the idea of componentwise debiasing is built upon the natural composition of a centralized correlation estimator; instead of estimating the bias of each ˆωl,j in (1) based on Dl, it addresses the debiasing task via improving the component estimators, where the distributed U-statistic can be conveniently used. Beneﬁted from the high precision of Uj,h, eωj enjoys a full estimation eﬃciency that is insensitive to the choice of m; this further leads to a reliable feature screening. Admittedly, eωj can be still biased due to the non-linear aggregation in (4). For the commonly-used correlation measures, function g is smooth and the number of component parameters s is ﬁnite. Consequently, aggregating those ˆωl,js via g is unlikely to bring a signiﬁcant bias to eωj. In fact, Theorem 3 immediately implies that Bias(eωj) = O(1/

N) regardless of the choice of m. In comparison, when SAS is used in our setup, we have Bias( ωj) = Bias(ˆωl,j), the scale of which is mainly determined by the amount of data n = N/m stored in segment Dl. For example, when SIRS is used without local standardization, it can be shown that Bias(ˆωl,j) = O(1/n) = O(m/N) (Zhu et al., 2011). When m is large, this bias can severely aﬀect its accuracy. In particular, when m/

N , MSE( ωj) has a rate slower than O(1/N). Next, we justify the proposed ACS framework using the following theorem.

Theorem 4 Under Conditions C1-C3, if k n and γ = c N τ, then there exists a constant η > 0 such that

P{M f M} 1 ηd(1 N 2τ/η)m n/k ,

where d is the cardinality of M.

We show in the Appendix that, when d = O(N), the probability bound in Theorem 4 goes to one as N . Thus, the proposed ACS enjoys sure screening property in the sense

Distributed Feature Screening via Componentwise Debiasing

of Fan and Lv (2008), even when the number of relevant features d is diverging. That is, when N is large, ACS removes most irrelevant features and retains all relevant features with an overwhelming probability. It is a desired property for a good feature screening method. Note that the requirement n = N/m k is very mild in general; for many correlation measures, it can be naturally satisﬁed with a liberal choice of m = O(N), which makes ACS a ﬂexible and reliable approach. Although the asymptotic error bound of eωj is less aﬀected by m, our empirical experience does show that a small m may help to improve the practical screening accuracy of ACS. However, an overly small m often leads to a high computational cost. In applications, one good strategy is to choose the smallest m for ACS within the computational budget.

4. Numerical Studies

We assess the ﬁnite sample performance of ACS via simulations and a real data example. In particular, we compare ACS with the naive SAS in terms of the screening accuracy and stability. All numerical experiments are conducted using software MATLAB on Windows computers with 3.2 GHz CPUs and 32 GB memory.

4.1. Simulations

4.1.1. Example 1

Apparently, an eﬀective screening relies on the accurate estimates of the correlation strength ωj. Our ﬁrst experiment is to check whether the proposed aggregated correlation (AC) measure eωj in (4) is an eﬀective estimator of ωj. To this end, we generate N = 2700 independent copies from (Y, X), where Y and X are two independent random variables following N(0, 1). Due to independence, the Kendall τ correlation, SIRS, and DC between Y and X are all zero. We randomly split the data into m = 45, 90, 180 equal-sized segments and use eωj speciﬁed in Section 2.3.1 (with j = 1) to estimate the three aforementioned correlations between Y and X. We repeat the procedure T = 500 times and measure the accuracy of eωj by root-mean-squared error (RMSE). Speciﬁcally, let eωj(t) denote the value of eωj for the tth repetition. RMSE is computed by

RMSE(eωj) =

t=1 (eωj(t))2 #1/2

For comparison, we report the corresponding RMSEs of the simple averaging (SA) estimators ωj deﬁned in (1) under the same m setup. Moreover, we check the performance of the reinforced eωj (r AC) using the multiple partition strategy with R = 3 as discussed in Section 2.3.2. As a benchmark, we also report the RMSEs of the centralized estimators with m = 1. The results are summarized in Figure 1 with the corresponding computational time (in seconds) given in Table 1. For all the three tested correlations, we see that both eωj and ωj work well when m is small. As m increases, ωj becomes less accurate. As discussed, this is mainly due to the nonnegligible biases of the segmental estimates. In comparison, eωj conducts componentwise debiasing and leads to a high estimation accuracy over a wide range of m. Compared

Li, Li, Xia and Xu

Figure 1: The RMSE of distributed correlation estimators with m = (1, 45, 90, 180), where SA, AC, and r AC stand for ωj, eωj, and reinforced eωj respectively.

with the centralized estimators (m = 1 case), the distributed estimators eωj and ωj are computationally more attractive, in particular when m is large. As expected, the reinforced aggregated estimators help to further improve the estimation accuracy of eωj at a higher computational cost.

4.1.2. Example 2

The promising performance of eωj encourages us to further check whether the associated screening procedure ACS also works well. To this end, we generate N independent copies of X = (X1, . . . , Xp) from a multivariate normal distribution with zero mean. The corresponding response Y is generated based on the following models.

(a) Y = β1X1 + β2X2 + ... + β8X8 + ε,

(b) Y = β1X1 + β2X4 + β3X7 + β4X10 + ε,

(c) Y = exp(β1X1 + β2X4 + β3X7 + β4X10 + ε),

(d) Y = β1X1 + β2X4 + exp(|β3|X7 + |β4|X10) + ε,

(e) Y = β1X1 + β2X2 4 + β3I(X7 > 0) + β4|X10| + ε,

(f) Y = 2β1X1X2 + 2β2I(X12 > 0) + 3β3X22 + ε,

Distributed Feature Screening via Componentwise Debiasing

Table 1: Mean computational time of distributed correlation estimators (in seconds) with ﬁxed sample size N = 2700 and varied number of data segments; m = 1 case corresponds to the centralized estimates and m > 1 cases correspond to the distributed estimates.

Correlation Estimator m = 1 m = 45 m = 90 m = 180

SA 3.4 10 1 1.4 10 4 4.3 10 5 2.2 10 5

Kendall τ AC 3.4 10 1 1.5 10 4 4.7 10 5 2.6 10 5

r AC 3.7 10 4 1.2 10 4 5.5 10 5

SA 1.7 10 1 1.1 10 4 5.0 10 5 4.0 10 5

SIRS AC 1.7 10 1 7.5 10 5 2.5 10 5 1.4 10 5

r AC 2.0 10 4 7.0 10 5 3.9 10 5

SA 8.2 10 1 1.4 10 4 4.7 10 5 3.3 10 5

DC AC 7.8 10 1 1.3 10 4 4.0 10 5 2.7 10 5

r AC 3.6 10 4 1.2 10 4 9.3 10 5

where ε N(0, 1) is a noise term. Models (a) and (b) are two linear cases with diﬀerent model sparsity and covariance structures. Models (c) and (d) are transformation model and multiple-index model, which are adopted from Li et al. (2012a) and Zhu et al. (2011) respectively. Models (e) and (f) are addictive model and interactive model, both of which were discussed in Li et al. (2012b). In Model (a), cov(X) is set to be an identity matrix, while in Models (b)-(f) we set cov(Xj, Xr) = 0.5|j r| for j, r {1, . . . , p} such that the features have an autoregressive correlation. In Models (a)-(f), the values of model coeﬃcients are generated by ( 1)W (2 + |V |), where W Bernoulli(0.6) and V N(0, 1). We apply the proposed ACS on these simulated data sets for feature screening. In each case, we split the data into m segments and assess the performance of ACS based on Pearson, Kendall τ, SIRS, and DC correlations as discussed in Section 2.3.1. For each correlation scenario, we set the corresponding screening threshold by

γ = ρ min j M ˆωj, (8)

where ˆωj is the centralized estimator of that correlation and ρ = 0.8, 0.6 is a scale parameter. The choice of γ in (8) guarantees that all relevant features will be retained by the classic screening method based on ˆωj; it purely serves for evaluating the proposed distributed method under the setup, in which the classic method is likely to work well. In practice, a proper γ is usually determined by users based on their research goals as well as the prior information about their data. We will test ACS in our next example with a data-driven choice of γ. We evaluate the performance of ACS in terms of successful screening rate (SSR), screened model size (MS), positive selection rate (PSR), false discovery rate (FDR), Speciﬁcally, let ˆ M(t) denote the index set of the features retained after screening based on the t-th repeti-

Li, Li, Xia and Xu

Table 2: Simulation results for Model (a) with N = 2400, p = 10000, M 0 = 8, m = (40, 60). The two values a, b in the same column correspond to ρ = 0.8, 0.6 cases.

m Correlation Method SSR MS Std(MS) PSR FDR Timen Time N

40 Pearson SAS 1.0, 1.0 8, 9 5, 531 1.0, 1.0 0.0, .11 0.012 0.176 SVS .97, 1.0 8, 13 6, 256 1.0, 1.0 0.0, .36 0.012 ACS 1.0, 1.0 8, 8 0, 0 1.0, 1.0 0.0, 0.0 0.012 Kendall τ SAS 1.0, 1.0 8, 18 64, 935 1.0, 1.0 0.0, .56 0.678 1504 SVS .97, 1.0 8, 28 34, 507 1.0, 1.0 0.0, .71 0.678 ACS 1.0, 1.0 8, 8 0, 0 1.0, 1.0 0.0, 0.0 0.678 r ACS 1.0, 1.0 8, 8 0, 0 1.0, 1.0 0.0, 0.0 2.030 SIRS SAS 1.0, 1.0 10, 1042 1536, 3388 1.0, 1.0 .20, .99 0.013 1.672 SVS .99, 1.0 8, 34 781, 1540 1.0, 1.0 0.0, .76 0.013 ACS .69, .93 45, 176 122, 229 1.0, 1.0 .83, .95 0.013 r ACS .86, .99 8, 10 13, 46 1.0, 1.0 0.0, .20 0.033 DC SAS 1.0, 1.0 9996, 10000 2807, 715 1.0, 1.0 .99, .99 0.742 6625 SVS 1.0, 1.0 9491, 10000 3712, 1076 1.0, 1.0 .99, .99 0.742 ACS .96, 1.0 8, 8 3, 20 1.0, 1.0 0.0, 0.0 0.742 r ACS .99, 1.0 8, 8 0, 0 1.0, 1.0 0.0, 0.0 2.223 60 Pearson SAS 1.0, 1.0 8, 156 192, 1753 1.0, 1.0 0.0, .95 0.011 0.176 SVS 1.0, 1.0 8, 80 60, 840 1.0, 1.0 0.0, .90 0.011 ACS 1.0, 1.0 8, 8 0, 0 1.0, 1.0 0.0, 0.0 0.011 Kendall τ SAS 1.0, 1.0 8, 1293 675, 2772 1.0, 1.0 0.0, .99 0.329 1504 SVS .98, 1.0 8, 313 265, 1429 1.0, 1.0 0.0, .97 0.329 ACS 1.0, 1.0 8, 8 0, 0 1.0, 1.0 0.0, 0.0 0.329 r ACS 1.0, 1.0 8, 8 0, 0 1.0, 1.0 0.0, 0.0 0.982 SIRS SAS 1.0, 1.0 4096, 9916 3927, 2392 1.0, 1.0 .99, .99 0.010 1.672 SVS 1.0, 1.0 85, 4036 2479, 3752 1.0, 1.0 .90, .99 0.010 ACS .61, .85 124, 378 200, 327 1.0, 1.0 .94, .98 0.010 r ACS .88, .98 8, 22 36, 94 1.0, 1.0 0.0, .63 0.025 DC SAS 1.0, 1.0 10000, 10000 158, 0 1.0, 1.0 .99, .99 0.479 6625 SVS 1.0, 1.0 10000, 10000 785, 0 1.0, 1.0 .99, .99 0.479 ACS .93, 1.0 8, 8 9, 37 1.0, 1.0 0.0, 0.0 0.479 r ACS .98, 1.0 8, 8 0, 2 1.0, 1.0 0.0, 0.0 1.434

tion. The aforementioned four indices are calculated as follows.

t=1 I{M ˆ M(t)}, MS = j ˆ M(t) 0 k

$ M ˆ M(t) 0 M 0

med , FDR =

$ ˆ M(t) M 0

where I{ } is an indicator function, med denotes the median of a series of values, and 0 denotes the number of elements in a set. For comparison, we report as well the screening outcomes of SAS, which is based on the simple averaging estimators (1). Also, we compare ACS with a subsampling-voting screening (SVS) method. In SVS, parallel screening is ﬁrst

Distributed Feature Screening via Componentwise Debiasing

Table 3: Simulation results for Model (b) with N = 1200, p = 1500, M 0 = 4, m = (20, 40).

m Correlation Method SSR MS Std(MS) PSR FDR Timen Time N

20 Pearson SAS 1.0, 1.0 6, 9 19, 109 1.0, 1.0 .33, .53 0.001 0.022 SVS .96, 1.0 6, 9 16, 66 1.0, 1.0 .33, .53 0.001 ACS 1.0, 1.0 6, 9 2, 2 1.0, 1.0 .33, .53 0.001 Kendall τ SAS 1.0, 1.0 6, 9 37, 132 1.0, 1.0 .33, .53 0.105 51.20 SVS .95, .98 6, 9 26, 85 1.0, 1.0 .33, .56 0.105 ACS 1.0, 1.0 6, 8 2, 2 1.0, 1.0 .33, .50 0.105 SIRS SAS 1.0, 1.0 6, 8 203, 217 1.0, 1.0 .33, .50 0.001 0.111 SVS .97, .99 6, 7 149, 200 1.0, 1.0 .33, .43 0.001 ACS .89, .98 6, 9 38, 50 1.0, 1.0 .33, .56 0.001 DC SAS 1.0, 1.0 8, 10 310, 510 1.0, 1.0 .50, .60 0.122 105.3 SVS 1.0, 1.0 7, 9 281, 419 1.0, 1.0 .43, .56 0.122 ACS .98, .99 5, 7 6, 15 1.0, 1.0 .20, .43 0.122 40 Pearson SAS 1.0, 1.0 6, 9 140, 219 1.0, 1.0 .33, .56 0.001 0.022 SVS .97, .99 7, 9 89, 181 1.0, 1.0 .43, .56 0.001 ACS 1.0, 1.0 6, 9 2, 2 1.0, 1.0 .33, .53 0.001 Kendall τ SAS 1.0, 1.0 7, 9 171, 263 1.0, 1.0 .43, .56 0.035 51.20 SVS 0.99, 1.0 7, 10 123, 210 1.0, 1.0 .43, .60 0.035 ACS 1.0, 1.0 6, 9 2, 2 1.0, 1.0 .33, .53 0.035 SIRS SAS 1.0, 1.0 8, 11 359, 509 1.0, 1.0 .50, .64 0.001 0.111 SVS 1.0, 1.0 7, 9 248, 361 1.0, 1.0 .43, .56 0.001 ACS .81, .92 8, 21 59, 75 1.0, 1.0 .56, .80 0.001 DC SAS 1.0, 1.0 89, 1466 668, 646 1.0, 1.0 .95, .99 0.038 105.3 SVS 1.0, 1.0 19, 1116 606, 667 1.0, 1.0 .78, .99 0.038 ACS .93, .99 6, 7 19, 31 1.0, 1.0 .29, .43 0.038

conducted on m random subsamples of D, each of which is of size n = N/m; a feature is then retained if it is selected by more than 50% of the subsamples. To check the improving strategy in Section 2.3.2, we further run the reinforced ACS (r ACS) with R = 3 for the data generated from Model (a). In our setup, the classic screening method based on ˆωj would have SSR = 1, PSR = 1 in all cases and is likely to have small MS and FDR as N is large. Yet, as to be revealed, its computational cost can be unaﬀordable in many cases. For the computational convenience, we choose to exclude the classic screening from the comparison and use directly the optimal values of the aforementioned indices as a benchmark. We summarize the simulation results in Tables 2-4 based on T = 100 repetitions. For Models (c)-(f), we only exhibit the selected results due to the page limit. In the tables, Timen and Time N report the averaged computational time (in seconds) respectively for a distributed screening and the corresponding classic screening based on a few pilot runs. The two values in the same column correspond to the two setups of ρ in (8). Std(MS) reports the sample standard deviation of ˆ M(t) 0s, which measures the screening precision. With the oracle choice of γ, we see that all methods perform well in terms of keeping relevant features; this is indicated by their high SSRs in most cases. Regarding the screening

Li, Li, Xia and Xu

Table 4: Simulation results for Models (c)-(f) with case-speciﬁc (N, p, m) setup listed in the table.

m Correlation Method SSR MS Std(MS) PSR FDR Timen Time N

Model (c), N = 2400, p = 10000, M 0 = 4 40 Pearson SAS 1.0, 1.0 10000,10000 0, 0 1.0, 1.0 .99, .99 0.007 0.166 ACS 1.0, 1.0 1825, 3207 2333, 2350 1.0, 1.0 .99, .99 0.007 Kendall τ SAS 1.0, 1.0 6, 9 2, 141 1.0, 1.0 .33, .56 0.666 1429.2 ACS 1.0, 1.0 6, 8 2, 2 1.0, 1.0 .33, .50 0.666 80 Pearson SAS 1.0, 1.0 10000, 10000 0, 0 1.0, 1.0 .99, .99 0.005 0.166 ACS 1.0, 1.0 1825, 3207 2333, 2350 1.0, 1.0 .99, .99 0.005 Kendall τ SAS 1.0, 1.0 7, 10 343, 2439 1.0, 1.0 .43, .60 0.220 1429.2 ACS 1.0, 1.0 6, 8 2, 2 1.0, 1.0 .33, .50 0.220 Model (d), N = 3600, p = 10000, M 0 = 4 50 Pearson SAS 1.0, 1.0 10000, 10000 0, 0 1.0, 1.0 .99, .99 0.009 0.259 ACS 1.0, 1.0 7478, 8128 2473, 2049 1.0, 1.0 .99, .99 0.009 SIRS SAS 1.0, 1.0 8, 10 270, 1377 1.0, 1.0 .50, .60 0.015 3.393 ACS .95, 1.0 7, 9 22, 66 1.0, 1.0 .43, .56 0.015 100 Pearson SAS 1.0, 1.0 10000, 10000 0, 0 1.0, 1.0 .99, .99 0.006 0.259 ACS 1.0, 1.0 7478, 8128 2473, 2049 1.0, 1.0 .99, .99 0.006 SIRS SAS 1.0, 1.0 11, 99 3167, 4192 1.0, 1.0 .64, .96 0.009 3.393 ACS .83, .92 8, 13 79, 177 1.0, 1.0 .50, .69 0.009 Model (e), N = 4800, p = 10000, M 0 = 4 60 Pearson SAS 1.0, 1.0 10000, 10000 0, 0 1.0, 1.0 .99, .99 0.010 0.334 ACS 1.0, 1.0 1508, 2807 2596, 2578 1.0, 1.0 .99, .99 0.010 DC SAS 1.0, 1.0 10000, 10000 1813, 1188 1.0, 1.0 .99, .99 1.147 26221 ACS .89, .95 6, 9 72, 162 1.0, 1.0 .33, .53 1.147 120 Pearson SAS 1.0, 1.0 10000, 10000 0, 0 1.0, 1.0 .99, .99 0.006 0.334 ACS 1.0, 1.0 1508, 2807 2596, 2578 1.0, 1.0 .99, .99 0.006 DC SAS 1.0, 1.0 10000, 10000 0, 0 1.0, 1.0 .99, .99 0.489 26221 ACS .86, .94 11, 73 175, 317 1.0, 1.0 .65, .95 0.489 Model (f), N = 10000, p = 10000, M 0 = 4 100 Pearson SAS 1.0, 1.0 10000, 10000 0, 0 1.0, 1.0 .99, .99 0.016 0.729 ACS 1.0, 1.0 8600, 8950 3365, 2946 1.0, 1.0 .99, .99 0.016 DC SAS 1.0, 1.0 10000, 10000 1809, 0 1.0, 1.0 .99, .99 1.923 213998 ACS .98, 1.0 8, 8 1, 2 1.0, 1.0 .50, .50 1.923 250 Pearson SAS 1.0, 1.0 10000, 10000 0, 0 1.0, 1.0 .99, .99 0.017 0.729 ACS 1.0, 1.0 8600, 8950 3365, 2946 1.0, 1.0 .99, .99 0.017 DC SAS 1.0, 1.0 10000, 10000 0, 0 1.0, 1.0 .99, .99 0.441 213998 ACS .85, 1.0 8, 10 9, 50 1.0, 1.0 .50, .60 0.441

accuracy, SAS seems to be inferior, as it tends to keep too many irrelevant features after screening. This phenomenon is particularly severe for non-linear correlation measures SIRS and DC under the large-m-small-γ setup. As an extreme case, when DC is used in Models (e) and (f), SAS suggests keeping all the 10000 features; this completely fails in the mission of screening. The over-selection of SAS here is a direct result from the inaccuracy of the

Distributed Feature Screening via Componentwise Debiasing

corresponding simple averaging estimators ωjs. When Pearson and Kendall τ correlations are used, this issue is less severe, as the corresponding ωjs are less biased due to their nature. The proposed ACS, in comparison, is built upon the stable eωjs, and thus achieves a reasonably high screening accuracy in most setups. For all the four correlation choices, it is able to screen most irrelevant features out, while keep relevant ones with a high probability. Such a performance is very promising. Moreover, the SAS-based screening tends to have a high variability in most cases; this makes it less trustable in practice. In Models (a) and (b), SVS seems to help reducing the variability of SAS, but the improvement is less signiﬁcant as compared with ACS or r ACS. This is somewhat expected, as both SAS and SVS are based on the local estimators ˆωl,j, the bias of which can be substantial when m is large. We observe that, when SIRS is used in Model (a) with m = 60, none of SAS, SVS and ACS works very well, if SSR and Std(MS) are considered jointly. This might be due to the relatively low sensitivity of SIRS in detecting linear correlations when n is small. When m = 40, all methods get improved due to the increased accuracy in both ωj and eωj. Apparently, the multiple data partition strategy in r ACS helps a lot in this case, as indicated by its high SSR and low Std(MS) in both m setups. We also observe that, with Pearson correlation, neither ACS nor SAS performs satisfactorily in Models (c)-(f). This is because relevant features in those models mainly have non-linear correlations with the response. The superior screening accuracy of ACS is observed when advanced correlation measures are used in those models. Beneﬁted from its distributed framework, the proposed ACS enables parallel computing and enjoys a great numerical advantage over the classic screening procedures (i.e., m = 1 case). As shown in Tables 2-4, the computational cost of ACS can be even less than 1% of the traditional cost with a large m setup, while it still maintains relatively high screening accuracy. This merit together with its broad compatibility makes ACS an attractive approach for screening with large-N-large-p data.

4.1.3. Example 3

We further test ACS when the screening threshold γ is chosen by a data-driven method. Similar to Zhu et al. (2011), after the training data are generated, we additionally generate N independent copies of a q-dimensional auxiliary variable (X 1, . . . , X q)T from a standard multivariate normal distribution. Note that X js are independent of the response Y and thus are all irrelevant. A reasonable screening threshold is therefore set to be

γ = max j=1,...,q eω j,

where eω j is the aggregated correlation estimator between Y and X j. We repeat our simulation in Model (b) of Example 2 with the new choice of γ and report the results in Table 5, where N = 2400, p = 5000, m = (80, 100), and q = (1000, 500). It is seen that the performances of ACS and r ACS with R = 5 are still sharp with the non-oracle γ. In fact, the signiﬁcance of the proposed method is even better observed in this example, as SAS seems to be more sensitive to γ.

Li, Li, Xia and Xu

Table 5: Simulation results of Example 3: data are generated from Model (b) with N = 2400, p = 5000, M 0 = 4; distributed screening is conducted based on a datadriven γ with m = (80, 100). The two values a, b in the same column correspond to q = (1000, 500) cases.

m Correlation Method SSR MS Std(MS) PSR FDR Timen Time N

80 Pearson SAS 1.0, 1.0 5000, 5000 0, 0 1.0, 1.0 .99, .99 0.003 0.086 ACS 1.0, 1.0 15, 20 6, 12 1.0, 1.0 .73, .80 0.003 Kendall τ SAS 1.0, 1.0 5000, 5000 0, 0 1.0, 1.0 .99, .99 0.112 738.0 ACS 1.0, 1.0 15, 21 5, 10 1.0, 1.0 .73, .81 0.112 r ACS 1.0, 1.0 14, 17 4, 7 1.0, 1.0 .71, .76 0.559 SIRS SAS 1.0, 1.0 17, 74 368, 769 1.0, 1.0 .76, .95 0.005 0.850 ACS .84, .84 11, 13 6, 9 1.0, 1.0 .64, .71 0.005 r ACS .86, .88 7, 8 2, 2 1.0, 1.0 .43, .50 0.024 DC SAS 1.0, 1.0 5000, 5000 0, 0 1.0, 1.0 .99, .99 0.146 3240.6 ACS .99, 1.0 12, 15 7, 12 1.0, 1.0 .67, .73 0.146 r ACS 1.0, 1.0 9, 9 2, 2 1.0, 1.0 .56, .56 0.728 100 Pearson SAS 1.0, 1.0 5000, 5000 0, 0 1.0, 1.0 .99, .99 0.002 0.086 ACS 1.0, 1.0 15, 19 5, 11 1.0, 1.0 .73, .79 0.002 Kendall τ SAS 1.0, 1.0 5000, 5000 0, 0 1.0, 1.0 .99, .99 0.086 738.0 ACS 1.0, 1.0 15, 18 5, 9 1.0, 1.0 .73, .78 0.086 r ACS 1.0, 1.0 13, 15 4, 6 1.0, 1.0 .69, .73 0.428 SIRS SAS 1.0, 1.0 448, 1404 1183, 1738 1.0, 1.0 .99, .99 0.004 0.850 ACS .80, .82 11, 15 6, 9 1.0, 1.0 .64, .75 0.004 r ACS .83, .85 7, 7 2, 2 1.0, 1.0 .43, .43 0.020 DC SAS 1.0, 1.0 5000, 5000 0, 0 1.0, 1.0 .99, .99 0.128 3240.6 ACS .98, .98 12, 15 8, 13 1.0, 1.0 .67, .75 0.128 r ACS 1.0, 1.0 9, 9 2, 2 1.0, 1.0 .56, .56 0.640

4.1.4. Example 4

In the previous examples, we have observed the promising performance of ACS on a few parametric models. We now extend our numerical assessment on ACS to a model-free learning framework, where the true model is unknown or may not even exist. To be speciﬁc, with a given data set D = {(Yi, Xi)}N i=1, we consider a kernel ridge regression (KRR), where the goal is to ﬁnd a predictive function ˆf by minimizing

ˆf = arg min f

i=1 (Yi f(Xi))2 + λ f 2 K

where f has the form

j=1 βj K(X, Xj), (10)

f 2 K = PN i,j=1 βiβj K(Xi, Xj) is the norm of f induced by a semi-positive deﬁnite kernel function K, and λ > 0 is a tuning parameter. Let K = {K(Xi, Xj), i, j = 1, . . . , N} be the working kernel matrix associated with K. The coeﬃcient vector β = (β1, . . . , βN)T of ˆf in

Distributed Feature Screening via Componentwise Debiasing

form of (10) is estimated by ˆβ = (K + NλIN) 1Y,

where Y = (Y1, . . . , YN)T and IN is a N N identity matrix. With a new input Xnew, KRR predicts the corresponding response by ˆYnew = ˆf(Xnew) = PN j=1 ˆβj K(Xnew, Xj). When N is huge, it is often reasonable to assume that not all kernel atoms (features) Kj = K( , Xj) are relevant for prediction; this amounts to assume that some βjs in (10) are zero. When a Kj is irrelevant, one may also consider Xj to be less important in learning ˆf and thus it can be eliminated from (9). This suggests that one may remove the jth column and the jth row from K to reduce the computational cost in KRR. Thus, the goal of feature screening here is to screen out most irrelevant kernel atoms Kjs before carrying out KRR. In KRR, the number of features equals to the sample size N and K has a natural dependent structure in both rows and columns. It is thus of interest to see how well ACS could do for feature screening in this challenging setup. To this end, we generate D from model (b) in Example 2 with N = 6000 and p = 30, from which N = 4800 entries are randomly selected as a training set and the remaining 1200 ones are treated as a testing set. We then generate the N N working kernel matrix K based on the training set using a Gaussian kernel K(Xi, Xj) = exp( Xi Xj 2 2/100). We treat K as an input design matrix and randomly partition it by rows into m = 10, 40, 120, 160 sub-kernel matrices Kl for l = 1, . . . m, each of which is of dimension n N with n = N/m. Those sub-kernel matrices Kls together with the corresponding n 1 sub-responses Yls are then treated as m data segments Dl = {(Yl, Kl)}m l=1, on which we apply ACS with SIRS to screen the irrelevant kernel atoms (i.e. screen irrelevant columns of K). For comparison, we also run SAS and r ACS with R = 3 for each choice of m. Since r ACS shows its superior accuracy in our previous examples, we set the screening threshold γ such that 500 important features will be retained by r ACS with m = 10; the same γ is then used for all screening methods in all m setups. In this example, the true model in form of (10) is unknown. We thus evaluate the screening performance in terms of prediction. Speciﬁcally, after screening, we obtain a reﬁned kernel matrix by removing the irrelevant columns and the corresponding rows from K. We then carry out KKR to obtain a ﬁtted ˆf based on this reﬁned kernel matrix with λ determined by a 10-fold cross validation. We assess the predictive power of ˆf based on the testing set in terms of the prediction RMSE

RMSE( ˆf) =

i Stest (Yi ˆf(Xi))2 #1/2 ,

where Stest and ntest denote the observation index set and the sample size of the testing set respectively. In Figure 2, we report the mean prediction RMSE as well as the averaged model size (AMS) based on 100 repetitions. We also report the performance of ˆf based on a full KRR without screening. It is seen that all screening methods under consideration lead to a similar predictive accuracy comparable to the full KRR. When m is small, SAS and ACS tend to keep the same amount of relevant features. As m increases, SAS becomes more liberal by retaining more features after screening, while ACS remains restrictive. When m = 160, SAS suggests 1308 relevant features, which is about 2.6 times the number of features

Li, Li, Xia and Xu

Figure 2: Simulation results for Example 4. (a) The averaged model size (AMS) selected by SAS, ACS, and r ACS with diﬀerent choices of m. (b) The predictive rootmean-squared error of KRR based on the features retained by SAS, ACS, and r ACS. The dashed line corresponds to the full KRR without feature screening.

suggested by ACS. Yet, as indicated by their prediction RMSEs, including a large number of features in ˆf does not help to signiﬁcantly improve the predictive power. This implies that a large portion of the SAS-suggested features are actually redundant. In comparison, ACS is stable among all m setups and leads to a more accurate screening in general. Admittedly, the current theoretical support for ACS does not directly apply to the non-i.i.d KRR learning. Yet, we observe that ACS still performs relatively well in this challenging case. This further indicates that ACS can be a reliable method in practice.

4.2. A real data analysis

We apply the proposed ACS to a real data set1, which contains 81 covariates extracted from 21,263 superconductors along with the associated critical temperature (response). Readers may refer to Hamidieh (2018) for a detailed description of this data set. It is of interest to predict the unknown response given a set of new values of the covariates. It is likely that the covariates are linked to the response with a non-linear relationship. To avoid potential model mis-speciﬁcation, we analyze this data set using the gaussian KRR as discussed in Example 4 of the previous subsection. We remove data entries with missing values in Xi and get 20,877 available data entries, from which we randomly select 20, 000 entries as a training set and treat the remaining 877 ones as a testing set. This leads to a working kernel matrix K with N = 20, 000 observations and 20, 000 kernel atoms K( , Xj). Apparently, it is numerically costly to conduct the full KRR on K, which is likely to contain many redundant kernel atoms (features). It is therefore beneﬁcial to conduct distributed feature screening before KRR. To this end, we randomly partition the training set into m = 10, 100, 200, 500 segments and run ACS, r ACS, and SAS

1. The data set is available at http://archive.ics.uci.edu/ml/datasets/Superconductivty+Data.

Distributed Feature Screening via Componentwise Debiasing

Figure 3: Analysis of superconductor data. (a) The averaged model size (AMS) selected by SAS, ACS, and r ACS with diﬀerent choices of m. (b) The predictive rootmean-squared error of KRR based on the features retained by SAS, ACS, and r ACS.

under the same setup as in Example 4, except that the screening threshold γ here is set by the 800th largest eωj in r ACS with m = 10.

The results are summarized in Figure 3 with 100 repetitions. The pattern in the plots is similar to Figure 2. The screening accuracy and high stability are again observed for the proposed method.

5. Concluding Remarks

Technological innovations have made a profound impact on knowledge discovery. Extracting useful features from massive amount of high dimensional data is essential in many modern scientiﬁc areas. In this paper, we proposed a distributed framework (ACS) for feature screening with large-N-large-p data sets. In the spirit of divide-and-conquer , ACS enables distributed storage and paralleling computing, and thus enjoys a great numerical advantage over the classic screening methods. The key of success for ACS is that we express a correlation measure as a function of several component parameters and conduct distributive unbiased estimation for each of them. With the unbiased component estimates combined together, we then obtained an aggregated correlation estimate eωj, which is accurate and insensitive to the number of data segments used in the analysis. This further leads to a computationally eﬃcient and performance reliable screening procedure. Under mild conditions, we showed that eωj is as eﬃcient as the classic centralized estimator, while it drastically reduces the computational cost. The corresponding screening procedure is compatible with a broad range of correlation measures and enjoys the desirable sure screening property.

It should be noted that our current discussion is based on the i.i.d assumption of (Yi, Xi), which can be impractical when data segments are naturally stored at diﬀerent locations. In such a scenario, it is likely that data segments are of diﬀerent sizes and qualities. To make

Li, Li, Xia and Xu

the proposed ACS more adaptive, one may replace Uj,h in (4) by a weighted average, where the weight is proportional to the inverse-variance of the local component estimator Ul j,h. The proposed ACS focuses on model-free feature screening, where most screening measures have a natural expression in (2). It would be promising to extend the distributive idea to other model-based screening methods, where a screening measure may not have a closedform expression. One possible way is to conduct distributed optimization for estimating a model-based screening measure. We leave this interesting topic for future research.

Acknowledgments

Xu s research was supported by NSERC grant RGPIN-2016-05024 and NSFC grant 116900 14. R. Li s research was supported by NSF grant DMS 1820702 and NIDA, NIH grant P50 DA039838. Xia s research was supported by NSFC grant 11771353. X. Li s research was supported by JSPSP grant 2019SJA2093. The content is solely the responsibility of the authors and does not necessarily represent the oﬃcial views of the aforementioned funding agencies.

Appendix A. Expressions of ˆωl,j for the examples in Section 2.3.1

For comparison with the proposed ACS, we list the corresponding expressions of ˆωl,j used in SAS for the examples in Section 2.3.1.

1. Pearson correlation

1 n P i Dl Xij Yi 1

n P i Dl Xij 1

n P i Dl Yi q

n P i Dl X2 ij [ 1

n P i Dl Xij]2)( 1

n P i Dl Y 2 i [ 1

n P i Dl Yi]2)

2. Kendall τ rank correlation

i1,i2 Dl I(Xi1j < Xi2j)I(Yi1 < Yi2) 1/4

3. SIRS correlation

ˆωl,j = 1 n(n 1)(n 2)

i2 Dl Xi2j I(Yi2 < Yi1)

where Xij Dl are standardized for each feature j.

4. Distance correlation

ˆωl,j = ˆθl,j,1 + ˆθl,j,2 ˆθl,j,3 2ˆθl,j,4 q

(ˆθl,j,5 + ˆθ2 l,j,2 2ˆθl,j,6)(ˆθl,j,7 + ˆθ2 l,j,3 2ˆθl,j,8) ,

Distributed Feature Screening via Componentwise Debiasing

ˆθl,j,1 = 1 n2 X

i1,i2 Dl |Yi1 Yi2| |Xi1j Xi2j|, (11)

ˆθl,j,4 = 1 n3 X

i1,i2,i3 Dl |Yi1 Yi3| |Xi2j Xi3j|. (12)

The expression of ˆθl,j,h for h = 2, 3, 5, 7 is similar to (11); the expression of ˆθl,j,h for h = 6, 8 is similar to (12).

In the above examples, ˆωl,js are biased with the scale of bias heavily depending on the local sample size n = N/m. When m is small, the simple average estimator ωj = 1

m Pm l=1 ˆωl,j can be severely biased, as the bias can not be eliminated by simple averaging. In comparison, the aggregated correlation estimator eωj conducts distributed unbiased estimation for the components within ωj and thus leads to a more accurate estimation that is insensitive to m.

Appendix B. Proof of Proposition 1

Let Ψj,h,q = Eq[ˆθj,h(Zi1j, . . . , Ziqj, Ziq+1j, . . . , Zikhj)] be the expectation of ˆθj,h with respect to (Ziq+1j, . . . , Zikhj) for 0 q kh 1 and ζj,h,q = Var(Ψj,h,q). By equation (5.13) of Hoeﬀding (1992), we have

Var(Ul j,h) = n kh

The variance of Uj,h is therefore given by

Var( Uj,h) = 1

m Var(Ul j,h) = 1

ζj,h,q. (13)

By Theorem 5.1 of Hoeﬀding (1992), we have

for q = 1, ..., kh 1. This together with Condition C1 implies that ζj,h,1 ... ζj,h,kh = Var(ˆθj,h) E(ˆθ2 j,h) < 2D0/κ2 0 for all j and h. Since kh is a constant, the qth summand in (13) can be expressed by

ζj,h,q = O( 1 mnq ) = O(mq 1

Nq ), for q = 1, ..., kh.

Expression (7) is hence implied. To show Var( Uj,h) is increasing in m, let Uj,h[m, n] denote Uj,h based on m data segments, each of which is with size n. By Theorem 5.2 of Hoeﬀding (1992), we have

Li, Li, Xia and Xu

N Var( Uj,h[1, N]) is a decreasing function of N. Suppose N = n1m1 = n2m2 with n1 > n2. Then, we have

n1Var( Uj,h[1, n1]) n2Var( Uj,h[1, n2]),

N Var( Uj,h[1, n1]) n2

N Var( Uj,h[1, n2]),

1 m1 Var( Uj,h[1, n1]) 1 m2 Var( Uj,h[1, n2]),

Var( Uj,h[m1, n1]) Var( Uj,h[m2, n2]).

The incremental property is therefore proved.

Appendix C. Proof of Theorem 2

To prove Theorem 2, we ﬁrst derive a probability convergence bound of the component estimator Uj,h in the following Lemma.

Lemma 5 Suppose Condition C1 is satisﬁed and ε (0, δ0] with an arbitrarily large δ0 > 0. There exists a suﬃciently small c0 > 0 such that

P(| Uj,h θj,h)| ε) 2(1 c0ε2/2)m n/kh ,

for j = 1, . . . , p and h = 1, . . . , s, where n/kh denotes the largest integer no larger than n/kh.

Proof of Lemma 5. Let ˆθj,h be a basis unbiased estimator of θj,h with degree kh. By Markov s inequality, we have

P( Uj,h θj,h ε) = P(exp{ν( Uj,h θj,h)} exp{νε})

exp{ νε} exp{ νθj,h}E[exp{ν Uj,h}], (14)

for any ε > 0 and 0 < ν κ0mrh with rh = n/kh . Let Sl = {l1, ..., ln} denote the index set of {Y, X} copies based on Dl, on which we can construct rh independent ˆθj,hs. We deﬁne an averaged estimator based on those ˆθj,hs by

Vj,h(Zl1j, ..., Zlnj) = 1

u=1 ˆθj,h(Zl(u 1)kh+1j, ..., Zlukhj).

Then, the local U-statistic in (3) can be expressed by

{i1,...,in} Ω Vj,h(Zli1j, ..., Zlinj),

where Ω= {1, ..., n} and the summation is over all {Zli1j, ..., Zlinj} permutations from Dl. Consequently,

l=1 Ul j,h = 1

{i1,...,in} Ω

l=1 Vj,h(Zli1j, ..., Zlinj).

Distributed Feature Screening via Componentwise Debiasing

Since exponential function is convex, Jensen s inequality implies that

E[exp{ν Uj,h}] = E

{i1,...,in} Ω

l=1 Vj,h(Zli1j, ..., Zlinj)

{i1,...,in} Ω E

l=1 Vj,h(Zli1j, ..., Zlinj)

= ψmrh j,h (κ) , (15)

where κ = ν/(mrh) and ψj,h(κ) = E[exp{κˆθj,h}]. Combining (14) and (15), we have

P( Uj,h θj,h ε) [exp{ κε} exp{ κθj,h}ψj,h(κ)]mrh. (16)

Let V be a generic variable. By Taylor expansion, we have exp{κV } = 1+κV +κ2V /2, where 0 < V < V 2 exp{κ1V } for some κ1 (0, κ). Thus, factor exp{ κθj,h}ψj,h(κ) in (16) can be bounded by

exp{ κθj,h}ψj,h(κ) = E[exp{κ(ˆθj,h θj,h)}]

= E h 1 + κ(ˆθj,h θj,h) + κ2 exp{κ1(ˆθj,h θj,h)}(ˆθj,h θj,h)2/2 i

= 1 + κ2E h (ˆθj,h θj,h)2 exp{κ1(ˆθj,h θj,h)} i /2

1 + κ2[Eˆθ4 j,h E exp{2κ1(ˆθj,h θj,h)}]1/2/2, (17)

where (17) is implied by H older s inequality. By Condition C1 and the compactness of Θ, we know (17) can be bounded by 1 + D1κ2

with some D1 > 0 for all j = 1, ..., p, h = 1, ..., s. Also, when κε < 1, we have exp( κε) 1 εκ + D2ε2κ2 with some D2 > 0. Thus, we have the base term in (16) bounded by

exp{ κε} exp{ κθj,h}ψj,h(κ) (1 + D1κ2)(1 εκ + D2ε2κ2)

= 1 εκ + D2κ2ε2 + D1κ2 D1κ3ε + D1D2κ4ε2

1 εκ + D2κ2ε2 + D1κ2 + D1D2κ4ε2

= 1 εκ + E1,

where E1 = D2κ2ε2 + D1κ2 + D1D2κ4ε2. By setting κ = c0ε, we have

κε = D2c0ε2 + D1c0 + D1D2c3 0ε4

D2c0δ2 0 + D1c0 + D1D2c3 0δ4 0. (18)

Note that, when c0 > 0 is small enough, we have κ (0, κ0), κε < 1, and (18) is bounded by 1/2. Thus, the base term in (16) is further bounded by

exp{ κε} exp{ κθj,h}ψj,h(κ) 1 εκ/2. (19)

Li, Li, Xia and Xu

Combining (16) and (19), we have

P( Uj,h θj,h ε) (1 c0ε2/2)mrh.

Similarly, we can show that P( Uj,h θj,h ε) (1 c0ε2/2)mrh. Therefore, we obtain

P(| Uj,h θj,h| ε) 2(1 c0ε2/2)m n/kh

under the conditions speciﬁed in the proposition. The proof is complete.

With Lemma 5, we prove Theorem 2 based on the following fact.

Lemma 6 Suppose that θh, h = 1, ..., s are bounded, that is, there exists a positive constant a > 0 such that |θh| < a. Let eθh be an estimator of θh. Suppose for any ε (0, c], there exists a constant c1 > 0 such that, for any h {1, ..., s},

P(|eθh θh| ε) c1(1 ε2/c1)m n/k , (20)

where k is a positive integer. Then, there exists a positive constant c such that

P |eθh| |θh| ε c (1 ε2/c )m n/k , (21)

P(|(eθh1 + eθh2) (θh1 + θh2)| ε) c (1 ε2/c )m n/k , (22)

P(|(eθh1 eθh2) (θh1 θh2)| ε) c (1 ε2/c )m n/k , (23)

P(|eθh1eθh2 θh1θh2| ε) c (1 ε2/c )m n/k , (24)

P(|eθ2 h θ2 h| ε) c (1 ε2/c )m n/k . (25)

Moreover, suppose there exists a constant b > 0 such that |θh2| > b. Then, we have

P(|eθh1/eθh2 θh1/θh2| ε) c (1 ε2/c )m n/k . (26)

If we further assume θh > 0, then

ε c (1 ε2/c )m n/k . (27)

Proof of Lemma 6. We prove the lemma by justifying (21)-(27) sequentially. The proof of (21) is straightforward. By (20), for any ε (0, c], we have

P |eθh| |θh| ε P eθh θh ε

c1(1 ε2/c1)m n/k .

We now work on (22). For any ε (0, c], we have

P(|(eθh1 + eθh2) (θh1 + θh2)| ε)

P(|eθh1 θh1| ε/2) + P(|eθh2 θh2| ε/2)

2c1(1 ε2/(4c1))m n/k c2(1 ε2/c2)m n/k ,

Distributed Feature Screening via Componentwise Debiasing

where c2 = 4c1. Similarly, we can also show (23).

To show (24), we ﬁrst prove that eθhs are bounded in probability. Speciﬁcally, since |θh| a, we have, for any ε (0, c],

P |eθh| a + ε P |eθh θh| + |θh| a + ε

P |eθh θh| ε

c1(1 ε2/c1)m n/k . (28)

P(|eθh1eθh2 θh1θh2| ε)

P(|eθh1eθh2 eθh1θh2 + eθh1θh2 θh1θh2| ε)

P(|eθh1| |eθh2 θh2| + |θh2| |eθh1 θh1| ε)

P(|eθh1| |eθh2 θh2| ε/2) + P(|θh2| |eθh1 θh1| ε/2). (29)

By (20) and (28), the ﬁrst term of (29) can be bounded by

P(|eθh1| |eθh2 θh2| ε/2)

= P(|eθh1| |eθh2 θh2| ε/2, |eθh1| a + ε)

+P(|eθh1| |eθh2 θh2| ε/2, |eθh1| < a + ε)

P(|eθh1| a + ε) + P((a + ε) |eθh2 θh2| ε/2)

c1(1 ε2/c1)m n/k + c1(1 ε2/c3)m n/k ,

where c3 = max{4(a + c)2c1, c1}. The second term of (29) can be bounded by

P(|θh2| |eθh1 θh1| ε/2) P(|eθh1 θh1| ε/(2a))

c1(1 ε2/c4)m n/k

with c4 = max{4a2c1, c1}. Then, by setting c5 = max{3c1, c3}, we have

P(|eθh1eθh2 θh1θh2| ε) 3c1(1 ε2/c3)m n/k c5(1 ε2/c5)m n/k ,

which proves (24). By setting eθh2 = eθh1 = eθh in (24), we immediately have result (25).

To prove (26), let us ﬁrst show that eθh2 is bounded away from 0 in probability. Since |θh2| > b > 0, there exists a constant δ1 (0, c) such that for some b = b δ1 > 0,

P(|eθh2| b ) P(|θh2| |eθh2 θh2| b δ1)

P(|eθh2 θh2| δ1)

c1(1 δ2 1/c1)m n/k .

Let c6 = c1c2/δ2 1. Then, for ε (0, c), we have

P(|eθh2| b ) c1(1 ε2/c6)m n/k . (30)

Li, Li, Xia and Xu

Based on (30), we have

P(|eθh1/eθh2 θh1/θh2| ε)

= P(|eθh1/eθh2 θh1/θh2| ε, |eθh2| b ) + P(|eθh1/eθh2 θh1/θh2| ε, |eθh2| > b )

P(|eθh2| b ) + P(|eθh1/eθh2 θh1/θh2| ε, |eθh2| > b )

c1(1 ε2/c6)m n/k + P(|eθh1/eθh2 θh1/θh2| ε, |eθh2| > b ). (31)

In (31), the second term can be bounded by

P(|eθh1/eθh2 θh1/θh2| ε, |eθh2| > b )

P(|eθh1/eθh2 θh1/eθh2| + |θh1/eθh2 θh1/θh2| ε, |eθh2| > b )

b |eθh1 θh1| ε/2 + P

|eθh2| |θh2| |eθh2 θh2| ε/2

c1(1 ε2/c7)m n/k + c1(1 ε2/c8)m n/k , (32)

where c7 = max{4c1/(b )2, c1} and c8 = max{4a2c1/(b b)2, c1}. Let c9 = max{3c1, c6, c7, c8}, then we have

P(|eθh1/eθh2 θh1/θh2| ε) c9(1 ε2/c9)m n/k .

Let us work on (27). Since θh > 0, there exists a b > 0 such that θh > b. Similar to (30)-(32), there exist two positive constants b and c10 such that

P(|eθh| b ) + P

eθh + θh ε, |eθh| > b

c1(1 ε2/c10)m n/k + P |eθh θh| ( p

c1(1 ε2/c10)m n/k + c1(1 ε2/c11)m n/k ,

where c11 = max{c1/( p

b)2, c1}. By setting c12 = max{2c1, c10, c11}, we obtain that

ε c12(1 ε2/c12)m n/k .

Result (27) is therefore proved.

Combining the results in (21)-(27), we prove Lemma 5 by setting c = max{c1, c2, c5, c9, c12}.

Proof of Theorem 2. By Lemma 5, for any ε (0, δ0], there exists a c0 > 0 such that

P(| Uj,h θj,h| ε) 2(1 c0ε2)m n/kh 2(1 c0ε2)m n/k ,

Distributed Feature Screening via Componentwise Debiasing

where k = max{kh, h = 1, . . . , s} n. Since δ0 can be arbitrarily large, the inequality holds with ε = c N τ (0, c] for some 0 < τ < 1/2. Thus, we have

P(| Uj,h θj,h| c N τ) 2(1 c13N 2τ/2)m n/k

c14(1 N 2τ/c14)m n/k , h = 1, ..., s, (33)

where c14 = max{2, 2/c13}. This implies that the results of Lemma 5 are applicable by setting eθh = Uj,h. By Condition C2, we require that eωj = g( Uj,1, . . . , Uj,s) is constructed by a ﬁnite number of simple numerical operations, which serve as basic building blocks of g( ). For each building block, Lemma 6 can be used immediately to establish the convergence bound for the corresponding eωj. With ﬁnite combination of those building blocks, (33) further implies that

P(|eωj ωj| c N τ) η(1 N 2τ/η)m n/k (34)

for some generic positive constant η. Consequently, we have

P max 1 j p |eωj ωj| c N τ

j=1 P |eωj ωj| c N τ

ηp(1 N 2τ/η)m n/k .

The theorem is proved.

Appendix D. Proof of Theorem 3

When ωj = g(θj) is Lipschitz continuous in θj, we have

MSE(eωj) = MSE(g( Uj,1, ..., Uj,s))

= E(g( Uj,1, ..., Uj,s) g(θj))2

h=1 ( Uj,h θj,h)2 !

h=1 E( Uj,h θj,h)2

s L2 max j,h {Var( Uj,h)},

where L is deﬁned in C2 . Thus, the theorem is implied directly by Proposition 1.

Appendix E. Proof of Theorem 4

Note that γ = c N τ. If M f M, there must exist some j M such that eωj < c N τ. Also, by Condition C3, we assume min j M ωj 2c N τ. Thus, M f M implies |eωj ωj| > c N τ

Li, Li, Xia and Xu

for some j M. Therefore, by (34), we have

P{M f M} P(max j M |eωj ωj| c N τ)

1 P(max j M |eωj ωj| > c N τ)

1 d P(|eωj ωj| > c N τ)

1 dη(1 N 2τ/η)m n/k ,

where d is the cardinality of M. The theorem is proved. To see that the above probability bound goes to 1 as N , it suﬃce to show that

d(1 N 2τ/η)m n/k = o(1).

(1 N 2τ/η)m n/k = n (1 1/(ηN2τ))ηN2τ om n/k /(ηN2τ) .

Since lim N (1 1/(ηN2τ))ηN2τ = 1/e,

there exists a positive integer N1 such that whenever N > N1, we have (1 1/(ηN2τ))ηN2τ < 2/e < 1. Also note that 0 < τ < 1/2 and m n/k is in the same order of N; there exists a positive integer N2 and a constant υ (0, 1 2τ) such that whenever N > N2, we have m n/k /(ηN2τ) > Nυ. Therefore, when N > max{N1, N2}, we have

n (1 1/(ηN2τ))ηN2τ om n/k /ηN2τ < (2/e)Nυ.

It follows that, when d = O(N) and N is large enough,

d(1 N 2τ/η)m n/k d(2/e)Nυ = o(1).

The probability bound in Theorem 4 thus goes to 1 as N .

Moulinath Banerjee, Cecile Durot, Bodhisattva Sen, et al. Divide and conquer in nonstandard problems and the super-eﬃciency phenomenon. The Annals of Statistics, 47(2): 720 757, 2019.

Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu. Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics, 46(3): 1352 1382, 2018.

Xueying Chen and Min Ge Xie. A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24(4):1655 1684, 2014.

Distributed Feature Screening via Componentwise Debiasing

Yixin Chen, Guozhu Dong, Jiawei Han, Jian Pei, Benjamin W Wah, and Jianyong Wang. Regression cubes with lossless compression and aggregation. IEEE Transactions on Knowledge and Data Engineering, 18(12):1585 1599, 2006.

Hengjian Cui, Runze Li, and Wei Zhong. Model-free feature screening for ultrahigh dimensional discriminant analysis. Journal of the American Statistical Association, 110(510): 630 641, 2015.

Jianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5): 849 911, 2008.

Jianqing Fan, Richard Samworth, and Yichao Wu. Ultrahigh dimensional feature selection: beyond the linear model. Journal of machine learning research, 10(Sep):2013 2038, 2009.

Kam Hamidieh. A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science, 154:346 354, 2018.

Wassily Hoeﬀding. A class of statistics with asymptotically normal distribution. In Breakthroughs in Statistics, pages 308 334. Springer, 1992.

Michael I Jordan, Jason D Lee, and Yun Yang. Communication-eﬃcient distributed statistical inference. Journal of the American Statistical Association, 114(526):668 681, 2019.

Gaorong Li, Heng Peng, Jun Zhang, Lixing Zhu, et al. Robust rank correlation based screening. The Annals of Statistics, 40(3):1846 1877, 2012a.

Runze Li, Wei Zhong, and Liping Zhu. Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499):1129 1139, 2012b.

N Lin and R Xi. Fast surrogates of u-statistics. Computational Statistics & Data Analysis, 54(1):16 24, 2010.

Nan Lin and Ruibin Xi. Aggregated estimating equation estimation. Statistics and Its Interface, 4(1):73 83, 2011.

Chengchun Shi, Wenbin Lu, and Rui Song. A massive data framework for m-estimators with cubic-rate. Journal of the American Statistical Association, 113(524):1698 1709, 2018.

Yuanshan Wu and Guosheng Yin. Conditional quantile screening in ultrahigh-dimensional heterogeneous data. Biometrika, 102(1):65 76, 2015.

Chen Xu, Yongquan Zhang, Runze Li, and Xindong Wu. On the feasibility of distributed kernel regression for big data. IEEE Transactions on Knowledge and Data Engineering, 28(11):3041 3052, 2016.

Yuchen Zhang, John C Duchi, and Martin Wainwright. Comunication-eﬃcient algorithms for statistical optimization. Journal of Machine Learning Research, 14(1):3321 3363, 2012.

Li, Li, Xia and Xu

Tingyou Zhou, Liping Zhu, Runze Li, and Chen Xu. Model-free forward regression via cumulative divergence. Journal of the American Statistical Association, page in press, 2019.

Li-Ping Zhu, Lexin Li, Runze Li, and Li-Xing Zhu. Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106(496): 1464 1475, 2011.