# feature_selection_using_evalues__f391087e.pdf

Feature Selection using e-values

Subhabrata Majumdar 1 2 Snigdhansu Chatterjee 1

In the context of supervised parametric models, we introduce the concept of e-values. An e-value is a scalar quantity that represents the proximity of the sampling distribution of parameter estimates in a model trained on a subset of features to that of the model trained on all features (i.e. the full model). Under general conditions, a rank ordering of e-values separates models that contain all essential features from those that do not.

The e-values are applicable to a wide range of parametric models. We use data depths and a fast resampling-based algorithm to implement a feature selection procedure using e-values, providing consistency results. For a p-dimensional feature space, this procedure requires fitting only the full model and evaluating p + 1 models, as opposed to the traditional requirement of fitting and evaluating 2p models. Through experiments across several model settings and synthetic and real datasets, we establish that the e-values method as a promising general alternative to existing model-specific methods of feature selection.

1. Introduction

In the era of big data, feature selection in supervised statistical and machine learning (ML) models helps cut through the noise of superfluous features, provides storage and computational benefits, and gives model intepretability. Modelbased feature selection can be divided into two categories (Guyon & Elisseeff, 2003): wrapper methods that evaluate models trained on multiple feature sets (Schwarz, 1978; Shao, 1996), and embedded methods that combine feature selection and training, often through sparse regularization (Tibshirani, 1996; Zou, 2006; Zou & Hastie, 2005). Both categories are extremely well-studied for independent data

1School of Statistics, University of Minnesota Twin Cities, Minneapolis, MN, USA 2Currently at Splunk. Correspondence to: Subhabrata Majumdar <smajumdar@splunk.com>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

models, but have their own challenges. Navigating an exponentially growing feature space using wrapper methods is NP-hard (Natarajan, 1995), and case-specific search strategies like k-greedy, branch-and-bound, simulated annealing are needed. Sparse penalized embedded methods can tackle high-dimensional data, but have inferential and algorithmic issues, such as biased Lasso estimates (Zhang & Zhang, 2014) and the use of convex relaxations to compute approximate local solutions (Wang et al., 2013; Zou & Li, 2008).

Feature selection in dependent data models has received comparatively lesser attention. Existing implementations of wrapper and embedded methods have been adapted for dependent data scenarios, such as mixed effect models (Meza & Lahiri, 2005; Nguyen & Jiang, 2014; Peng & Lu, 2012) and spatial models (Huang et al., 2010; Lee & Ghosh, 2009). However these suffer from the same issues as their independent data counterparts. If anything, the higher computational overhead for model training makes implementation of wrapper methods even harder in this situation!

In this paper, we propose the framework of e-values as a common principle to perform best subset feature selection in a range of parametric models covering independent and dependent data settings. In essence ours is a wrapper method, although it is able to determine important features affecting the response by training a single model: the one with all features, or the full model. We achieve this by utilizing the information encoded in the distribution of model parameter estimates. Notwithstanding recent efforts (Lai et al., 2015; Singh et al., 2007), parameter distributions that are fully data-driven have generally been underutilized in data science. In our work, e-values score a candidate model to quantify the similarity between sampling distributions of parameters in that model and the full model. Sampling distribution is the distribution of a parameter estimate, based on the random data samples the estimate is calculated from. We access sampling distributions using the efficient Generalized Bootstrap technique (Chatterjee & Bose, 2005), and utilize them using data depths, which are essentially pointto-distribution inverse distance functions (Tukey, 1975; Zuo, 2003; Zuo & Serfling, 2000), to compute e-values.

How e-values perform feature selection Data depth functions quantify the inlyingness of a point in multivariate space with respect to a probability distribution or a multivariate

Feature Selection using e-values

M1 M2 M3 M4

Figure 1.1. The e-value method. Solid and dotted ellipses represent Mahalanobis depth contours at some α > 0 for sample sizes n1, n2; n2 > n1.

point cloud, and have seen diverse applications in the literature (Jornsten, 2004; Narisetty & Nair, 2016; Rousseeuw & Hubert, 1998). As an example, consider the Mahalanobis depth, defined as

D(x, F) = [1 + (x µF ) Σ 1 F (x µF )] 1,

for x Rp and F a p-dimensional probability distribution with mean µF and covariance matrix ΣF . When x is close to µF , D(x, F) takes a high value close to 1. On the other hand, as x , the depth at x approaches 0. Thus, D(x, F) quantifies the proximity of x to F. Points with high depth are situated deep inside the probability distribution F, while low depth points sit at the periphery of F.

Given a depth function D we define e-value as the mean depth of the finite-sample estimate of a parameter β for a candidate model M, with respect to the sampling distribution of the full model estimate ˆβ:

e(M) = ED(ˆβM, [ˆβ]),

where ˆβM is the estimate of β assuming model M, [ˆβ] is the sampling distribution of ˆβ, and E( ) denotes expectation. For large sample size n, the index set Sselect obtained by Algorithm 1 below elicits all non-zero features in the true parameter. We use bootstrap to obtain multiple copies of ˆβ and ˆβ j for calculating the e-values in steps 1 and 5.

Algorithm 1 Best subset selection using e-values

1: Obtain full model e-value: e(Mfull) = ED(ˆβ, [ˆβ]). 2: Set Sselect = ϕ. 3: For j in 1 : p 4: Replace jth index of ˆβ by 0, name it ˆβ j. 5: Obtain e(M j) = ED(ˆβ j, [ˆβ]). 6: If e(M j) < e(Mfull)) 7: Set Sselect {Sselect, j}.

As an example, consider a linear regression with two features, Gaussian errors ϵ N(0, 1), and the following can-

didate models (Figure 1.1). We denote by Θi the domain of parameters considered in model Mi; i = 1, . . . , 4.

M1 : Y = X1β1 + X2β2 + ϵ; Θ1 = R2, M2 : Y = X1β1 + ϵ; Θ2 = R {0}, M3 : Y = X2β2 + ϵ; Θ3 = {0} R, M4 : Y = ϵ; Θ4 = (0, 0)T .

Let β0 = (5, 0)T be the true parameter. The full model sampling distribution, ˆβ N(β0, (XT X) 1), is more concentrated around β0 for higher sample sizes. Thus the depths at points along the (red) line β1 = 0, and the origin (red point), become smaller, and mean depths approach 0 for M3 and M4. On the other hand, since depth functions are affine invariant (Zuo & Serfling, 2000), mean depths calculated over parameter spaces for M1 (blue line) and M2 (blue surface) remain the same, and do not vanish as n . Thus, e-values of the good models M1, M2 models the parameter spaces of which contain β0 separate from those of the bad models M3, M4 more and more as n grows. Algorithm 1 is the result of this separation, and a rank ordering of the good models based on how parsimonious they are (Theorem 5.1).

2. Related work

Feature selection is a widely studied area in statistics and ML. The vast amount of literature on this topic includes classics like the Bayesian Information Criterion (Schwarz, 1978, BIC) or Lasso (Tibshirani, 1996), and recent advances such as Mixed Integer Optimization (Bertsimas et al., 2016, MIO). To deal with the increasing scale and complexity of data in recent applications, newer streams of work have also materialized such as nonlinear and model-independent methods (Song et al., 2012), model averaging (Fragoso et al., 2018), and dimension reduction techniques (Ma & Zhu, 2013). We refer the reader to Guyon & Elisseeff (2003) and the literature review in Bertsimas et al. (2016) for a broader overview of feature selection methods, and focus on the three lines of work most related to our formulation.

Shao (1996) first proposed using bootstrap for feature selection, with the squared residual averaged over a large number of resampled parameter estimates from a m-out-ofn bootstrap as selection criterion. Leave-One-Covariate-Out (Lei et al., 2018, LOCO) is based on the idea of samplesplitting. The full model and leave-one-covariate-out models are trained using one part of the data, and the rest of the data are used to calculate the importance of a feature as median difference in predictive performance between the full model and a LOCO model. Finally, Barber & Candes (2015) introduced the powerful idea of Knockoff filters, where a knockoff version of the original input dataset imitating its correlation structure is constructed. This method is explicitly able to control False Discovery Rate.

Feature Selection using e-values

Our framework of e-values has some similarities with the above methods, such as the use of bootstrap (similar to Shao (1996)), feature-specific statistics (similar to LOCO and Knockoffs), and evaluation at p + 1 models (similar to LOCO). However, e-values are also significantly different. They do not suffer from the high computational burden of model refitting over multiple bootstrap samples (unlike Shao (1996)), are not conditional on data splitting (unlike LOCO), and have a very general theoretical formulation (unlike Knockoffs). Most importantly, unlike all the three methods discussed above, e-values require fitting only a single model, and work for dependent data models.

Previously, Vander Weele & Ding (2017) have used the term E-Value in the context of sensitivity analysis. However, our and their definition of e-values are quite different. We see our e-values as a means to evaluate a feature with reference to a model. There are some parallels to the well known p-values used for hypothesis testing, but we see the e-value as a more general quantity that covers dependent data situations, as well as general estimation and hypothesis testing problems. While the current paper is an application of e-values for feature selection, we plan to build up on the framework introduced here on other applications, including group feature selection, hypothesis testing, and multiple testing problems.

3. Preliminaries

For a positive integer n 1, Let Zn = {Z1, . . . , Zn} be an observable array of random variables that are not necessarily independent. We parametrize Zn using a parameter vector θ Θ Rp, and energy functions {ψi(θ, Zi) : 1 i n}. We assume that there is a true unknown vector of parameters θ0, which is the unique minimizer of Ψn(θ) = E Pn i=1 ψi(θ, Zi).

Models and their characterization Let S = θ Θ supp(θ) be the common non-zero support of all estimable parameters θ (denoted by supp(θ)). In this setup, we associate a model M with two quantities (a) The set of indices S S with unknown and estimable parameter values, and (b) an ordered vector of known constants C = (Cj : j / S) at indices not in S. Thus a generic parameter vector for the model M (S, C), denoted by θm Θm Θ = Q j Θj, consists of unknown indices and known constants

θmj = unknown θmj Θj for j S, known Cj Θj for j / S. (3.1)

Given the above formulation, we characterize a model.

Definition 3.1. A model M is called adequate if P

j / S |Cj θ0j| = 0. A model that is not adequate, is called an inadequate model.

By definition the full model is always adequate, as is the model corresponding to the singleton set containing the true parameter. Thus the set of adequate models is non-empty by construction.

Another important notion is the one of nested models.

Definition 3.2. We consider a model M1 to be nested in M2, notationally M1 M2, if S1 S2 and C2 is a subvector of C1.

If a model is adequate, then any model it is nested in is also adequate. In the context of feature selection this obtains a rank ordering: the most parsimonious adequate model, with S = supp(β0) and C = 0p |S|, is nested in all other adequate models. All models nested within it are inadequate.

Estimators The estimator ˆθ of θ0 is obtained as a minimizer of the sample analog of Ψn(θ):

ˆθ = arg min θ Ψn(θ) = arg min θ

i=1 ψi θ, Zi . (3.2)

Under standard regularity conditions on the energy functions, an(ˆθ θ ) converges to a p-dimensional Gaussian distribution as n , where an is a sequence of positive real numbers (Appendix A). Remark 3.3. For generic estimation methods based on likelihood and estimating equations, the above holds with an n1/2, resulting in the standard root-n asymptotics.

The estimator in (3.2) corresponds to the full model M , i.e. the model where all indices in S are estimated. For any other model M, we simply augment entries of ˆθ at indices in S with elements of C elsewhere to obtain a modelspecific coefficient estimate ˆθm:

ˆθmj = ˆθ j for j S, Cj for j / S. (3.3)

The logic behind this plug-in estimate is simple: for a candidate model M, a joint distribution of its estimated parameters, i.e. [ˆθS], can actually be obtained from [ˆθ ] by taking its marginals at indices S.

Depth functions Data depth functions (Zuo & Serfling, 2000) quantify the closeness of a point in multivariate space to a probability distribution or data cloud. Formally, let G denote the set of non-degenerate probability measures on Rp. We consider D : Rp G [0, ) to be a data depth function if it satisfies the following properties:

(B1) Affine invariance: For any non-singular matrix A Rp p, and b Rp and random variable Y having distribution G G, D(x, G) = D(Ax + b, [AY + b]). (B2) Lipschitz continuity: For any G G, there exists δ > 0 and α (0, 1), possibly depending on G such that

Feature Selection using e-values

whenever |x y| < δ, we have |D(x, G) D(y, G)| < |x y|α. (B3) Consider random variables Yn Rp such that [Yn] Y G. Then D(y, [Yn]) converges uniformly to D(y, Y). So that, if Y Y, then limn ED(Yn, [Yn]) = ED(Y, Y) < . (B4) For any G G, lim x D(x, G) = 0. (B5) For any G G with a point of symmetry µ(G) Rp, D(., G) is maximum at µ(G):

D(µ(G), G) = sup x Rp D(x, G) < .

Depth decreases along any ray between µ(G) to x, i.e. for t (0, 1), x Rp,

D(x, G) < D(µ(G) + t(x µ(G)), G)

< D(µ(G), G).

Conditions (B1), (B4) and (B5) are integral to the formal definition of data depth (Zuo & Serfling, 2000), while (B2) and (B3) implicitly arise for several depth functions (Mosler, 2013). We require only a subset of (B1)-(B5) for the theory in this paper, but use data depths throughout for simplicity.

4. The e-values framework

The e-value of model M is the mean depth of ˆθM with respect to [ˆθ]: en(M) = ED(ˆθm, [ˆθ ]).

4.1. Resampling approximation of e-values

Typically, the distributions of either of the random quantities ˆθm and ˆθ , are unknown, and have to be elicited from the data. Because of the plugin method in (3.3), only [ˆθ ] needs to be approximated. We do this using Generalized Bootstrap (Chatterjee & Bose, 2005, GBS). For an exchangeable array of non-negative random variables independent of the data as resampling weights: Wr = (Wr1, . . . , Wrn)T Rn, the GBS estimator ˆθr is the minimizer of

i=1 Wriψi θ, Zi . (4.1)

We assume the following conditions on the resampling weights and their interactions as n :

EWr1 = 1, VWr1 = τ 2 n = o(a2 n) , EW 4 r1 < ,

EWr1Wr2 = O(n 1), EW 2 r1W 2 r2 1. (4.2)

Many resampling schemes can be described as GBS, such as the m-out-of-n bootstrap (Bickel & Sakov, 2008) and scaleenhanced wild bootstrap (Chatterjee, 2019). Under fairly weak regularity conditions on the first two derivatives ψ i and ψ i of ψi, (an/τn)(ˆθr ˆθ ) and an(ˆθ θ0) converge to the same weak limit in probability (See Appendix A).

Instead of repeatedly solving (4.1), we use model quantities computed while calculating ˆθ to obtain a first-order approximation of ˆθr .

ˆθr = ˆθ τn

i=1 ψ i (ˆθ , Zi)

i=1 Wriψ i(ˆθ , Zi)

+ Rr, (4.3)

where Er Rr 2 = o P (1), and Wri = (Wri 1)/τn. Thus only Monte Carlo sampling is required to obtain the resamples. Being an embarrassingly parallel procedure, this results in significant computational speed gains.

To estimate en(M) we obtain two independent sets of weights {Wr; r = 1, . . . , R} and {Wr1; r1 = 1, . . . , R1} for large integers R, R1. We use the first set of resamples to obtain the distribution [ˆθr ] to approximate [ˆθ ], and the second set of resamples to get the plugin estimate ˆθr1m:

ˆθr1mj = ˆθr1 j for j S, Cj for j / S. (4.4)

Finally, the resampling estimate of a model e-value is: ern(M) = Er1D ˆθr1m, [ˆθr ] , where Er1 is expectation, conditional on the data, computed using the resampling r1.

4.2. Fast algorithm for best subset selection

For best subset selection, we restrict to the model class M0 = {M : Cj = 0 j / S} that only allows zero constant terms. In this setup our fast selection algorithm consists of only three stages: (a) fit the full model and estimate its e-value, (b) replace each covariate by 0 and compute e-value of all such reduced models, and (c) collect covariates dropping which causes the e-value to go down.

To fit the full model, we need to determine the estimable index set S . When n > p, the choice is simple: S = {1, . . . , p}. When p > n, we need to ensure that p = |S | < n, so that ˆθ (properly centered and scaled) has a unique asymptotic distribution. Similar to past works on feature selection (Lai et al., 2015), we use a feature screening step before model fitting to achieve this (Section 5).

After obtaining S and ˆθ , for each of the p + 1 models under consideration: the full model and all drop-1-feature models, we follow the recipe in Section 4.1 to get bootstrap e-values. This gives us all the components for a sample version of Algorithm 1, which we present as Algorithm 2.

Feature Selection using e-values

Algorithm 2 Best subset feature selection using e-values and GBS 1. Fix resampling standard deviation τn. 2. Obtain GBS samples: T = {ˆθ1 , ..., ˆθR }, and T1 = {ˆθ1 , ..., ˆθR1 } using (4.3). 3. Calculate ˆern(M ) = 1 R1 PR1 r1=1 D(ˆθr1 , [T ]). 4. Set ˆS0 = ϕ. 5. for j in 1 : p 6. for r1 in 1 : R1 7. Replace jth index of ˆθ r1 by 0 to get ˆθr1, j. 8. Calculate ˆern(M j) = 1 R1 PR1 r1=1 D(ˆθr1, j, [T ]). 9. if ˆern(M j) < ˆern(M ) 10. Set ˆS0 { ˆS0, j}.

Choice of τn The intermediate rate of divergence for the bootstrap standard deviation τn: τn , τn/an 0, is a necessary and sufficient condition for the consistency of GBS (Chatterjee & Bose, 2005), and that of the bootstrap approximation of population e-values (Theorem 5.4). We select τn using the following quantity, which we call Generalized Bootstrap Information Criterion (GBIC):

i=1 ψi ˆθ( ˆS0, τn), Zi +

supp(ˆθ( ˆS0, τn)) , (4.5)

where ˆθ( ˆS0, τn) is the refitted parameter vector using the index set ˆS0 selected by running Algorithm 2 with τn. We repeat this for a range of τn values, and choose the index set corresponding to the τn that gives the smallest GBIC(τn). For our synthetic data experiments we take τn = log(n) and use GBIC, while for one real data example, we use validation on a test set to select the optimal τn, both with favorable results.

Detection threshold for finite samples In practice especially for smaller sample sizes due to sampling uncertainty it may be difficult to ensure that the full model e-value exactly separates the null and non-null e-values, and small amounts of false positives or negatives may occur in the selected feature set ˆS0. One way to tackle this is by shifting the detection threshold in Algorithm 2 by a small δ:

ˆeδ rn(M ) = (1 + δ)ˆern(M ).

To prioritize true positives or true negatives, δ can be set to be positive or negative respectively. In one of our experiments (Section 6.3), setting δ = 0 results in a small amount of false positives due to a few non-null features having evalues close to the full model e-value. Setting δ to small positive values gradually eliminates these false positives.

5. Theoretical results

We now investigate theoretical properties of e-values. Our first result separates inadequate models from adequate models at the population level, and gives a rank ordering of adequate models using their population e-values.

Theorem 5.1. Under conditions B1-B5, for a finite sequence of adequate models M1 . . . Mk and any inadequate models Mk+1, . . . , MK, we have for large n

en(M1) > . . . > en(Mk) > max j {k+1,...K} en(Mj).

As n , en(Mi) ED(Y, [Y ]) < with Y having an elliptic distribution if i K, else en(Mi) 0.

We define the data generating model as M0 M with estimable indices S0 = supp(θ0) and constants C0 = 0p |S0|. Then we have the following.

Corollary 5.2. Assume the conditions of Theorem 5.1. Consider the restricted class of candidate models M0 in Section 4.2, where all known coefficients are fixed at 0. Then for large enough n, M0 = arg max M M0[en(M)].

Thus, when only the models with known parameters set at 0 (the model set M0) are considered, e-value indeed maximizes at the true model at the population level. However there are still 2p possible models. This is where the advantage of using e-values their one-step nature comes through.

Corollary 5.3. Assume the conditions of Corollary 5.2. Consider the models M j M0 with S j = {1, . . . , p} \ {j} for j = 1, . . . , p. Then covariate j is a necessary component of M0, i.e. M j is an inadequate model, if and only if for large n we have en(M j) < en(M ).

Dropping an essential feature from the full model makes the model inadequate, which has very small e-value for large enough n, whereas dropping a non-essential feature increases the e-value (Theorem 5.1). Thus, simply collecting features dropping which cause a decrease in the e-value suffices for feature selection.

Following the above results, we establish model selection consistency of Algorithm 2 at the sample level. This means that the probability that the one-step procedure is able to select the true model feature set goes to 1, when the procedure is repeated for a large number of randomly drawn datasets from the data-generating distribution.

Theorem 5.4. Consider two sets of generalized bootstrap estimates of ˆθ : T = {ˆθr : r = 1, . . . , R} and T1 = {ˆθr1 : r1 = 1, . . . , R1}. Obtain sample e-values as:

ˆern(M) = 1

r1=1 D(ˆθr1m, [T ]), (5.1)

Feature Selection using e-values

where [T ] is the empirical distribution of the corresponding bootstrap samples, and ˆθr1m are obtained as in Section 4.1. Consider the feature set ˆS0 = {ˆern(M j) < ˆern(M )}. Then as n, R, R1 , P2( ˆS0 = S0) 1, with P2 as probability conditional on data and bootstrap samples.

Theorem 5.4 is contingent on the fact that the the true model M0 is a member of M0, that is S0 S . This is ensured trivially when n p. If p > n, M0 is the set of all possible models on the feature set selected by an appropriate screening procedure. In high-dimensional linear models, we use Sure Independent Screening (Fan & Lv, 2008, SIS) for this purpose. Given that S is selected using SIS, Fan & Lv (2008) proved that, for constants C > 0 and κ that depend on the minimum signal in θ0,

P(M0 M0) 1 O exp[ Cn1 2κ]

For more complex models, model-free filter methods (Koller & Sahami, 1996; Zhu et al., 2011) can be used to obtain S . For example, under mild conditions on the design matrix, the method of Zhu et al. (2011) is consistent:

P(|S Sc 0| r) 1 r p + d

with positive integers r and d: d = p being a good practical choice (Zhu et al., 2011, Theorem 3). Combining (5.2) or (5.3) with Corollary 5.4 as needed establishes asymptotically accurate recovery of S0 through Algorithm 2.

6. Numerical experiments

We implement e-values using a GBS with scaled resample weights Wri Gamma(1, 1) 1, and resample sizes R = R1 = 1000. We use Mahalanobis depth for all depth calculations. Mahalanobis depth is much less computation-intensive than other depth functions (Dyckerhoff & Mozharovskyi, 2016; Liu & Zuo, 2014), but is not usually preferred in applications due to its non-robustness. However, we do not use any robustness properties of data depth, so are able to use it without any concern. For each replication for each data setting and method, we compute performance metrics on test datasets of the same dimensions as the respective training dataset. All our results are based on 1000 such replications.

6.1. Feature selection in linear regression

Given a true coefficient vector β0, we use the model Z (Y, X), Y = Xβ0 + ϵ, with ϵ Nn(0, σ2In), and n = 500 and p = 100. We generate the rows of X independently from Np(0, ΣX), where ΣX follows an equicorrelation structure having correlation coefficient ρ: (ΣX)ij = ρ|i j|. Under this basic setup, we consider the following settings

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

log(Prediction Error)

5 10 15 20 25 k

5 10 15 20 25 k

log(Prediction Error)

0.3 0.7 1.1 1.5 1.9 2.3 σ

0.3 0.7 1.1 1.5 1.9 2.3 σ

log(Prediction Error)

Figure 6.1. Performance metrics for n = 500, p = 100: top, middle and bottom rows indicate simulation settings 1, 2 and 3, respectively. Competing methods include LASSO and SCADpenalized regression, Stepwise regression using BIC and backward deletion, Knockoff filters (Barber & Candes, 2015), and Mixed Integer Optimization (Bertsimas et al., 2016, MIO).

to evaluate the effect of different magnitudes of feature correlation in X, sparsity level in β0 and error variance σ.

Setting 1 (effect of feature correlation): We repeat the above setup for ρ = 0.1, . . . , 0.9, fixing β0 = (1.5, 0.5, 1, 1.5, 1, 0p 5), σ = 1; Setting 2 (effect of sparsity level): We consider β0 = (1k, 0p k), with varying degrees of the sparsity level k = 5, 10, 15, 20, 25. We fix ρ = 0.5, σ = 1; Setting 3 (effect of noise level): We consider different values of noise level (equivalent to having different signalto-noise ratios) by testing for σ = 0.3, 0.5, . . . , 2.3, fixing β0 = (15, 0p 5), ρ = 0.5.

Feature Selection using e-values

To implement e-values, we repeat Algorithm 2 for τn {0.2, 0.6, 1, 1.4, 1.8} log(n) p

log(p), and select the model having the lowest GBIC(τn).

Figure 6.1 summarizes the comparison results. Across all three settings, our method consistently produces the most predictive model, i.e. the model with the lowest prediction error. It also produces the sparsest model almost always. Among the competitors, SCAD performs the closest to evalues in setting 2 for both the metrics. However, SCAD seems to be more severely affected by high noise level (i.e. high σ) or high feature correlation (i.e. high ρ). Lasso and Step tend to select models with many null variables, and have higher prediction errors. Performance of the other two methods (MIO, knockoffs) is middling.

Method e-value Lasso SCAD Step Knockoff MIO Time 6.3 0.4 0.9 20.1 1.9 127.2

Table 6.1. Mean computation time (in seconds) for all methods.

We present computation times for Setting 1 averaged across all values of ρ in Table 6.1. All computations were performed on a Windows desktop with an 8-core Intel Core-i7 6700K 4GHz CPU and 16GB RAM. For e-values, using a smaller number of bootstrap samples or running computations over bootstrap samples in parallel greatly reduces computation time with little to no loss of performance. However, we report its computation time over a single core similar to other methods. Sparse methods like Lasso and SCAD are much faster as expected. Our method has lower computation time than Step and MIO, and much better performance.

6.2. High-dimensional linear regression (p > n)

We generate data from the same setup as Section 6.1, but with n = 100, p = 500. We perform an initial screening of features using SIS, then apply the e-values and other methods on this SIS-selected predictor set. Figure 6.2 summarizes the results. In addition to competing methods, we report metrics corresponding to the original SIS-selected predictor set as a baseline. Sparsity-wise, e-values produce the most improvement over the SIS baseline among all methods, and tracks the true sparsity level closely. Both e-values and Knockoffs produce sparser estimates as ρ grows higher (Setting 1). However, unlike the Knockoffs our method maintains good prediction performance even at high feature correlation. In general e-values maintain good prediction performance, although this difference is less obvious than the low-dimensional case. Note that for k = 25 in setting 2, SIS screening produces overly sparse feature sets, affecting prediction errors for all methods.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

log(Prediction Error)

5 10 15 20 25 k

5 10 15 20 25 k

log(Prediction Error)

0.3 0.7 1.1 1.5 1.9 2.3 σ

0.3 0.7 1.1 1.5 1.9 2.3 σ

log(Prediction Error)

Figure 6.2. Performance metrics for n = 100, p = 500.

6.3. Mixed effect model

We use the simulation setup from Peng & Lu (2012): Y = Xβ +RU +ϵ, where the data Z (Y, X, R) consists of m independent groups of observations with multiple (ni) dependent observations in each group, R being the withingroup random effects design matrix. We consider 9 fixed effects and 4 random effects, with the true fixed effect coefficient β0 = (12, 07) and random effect coefficient U drawn from N4(0, ). The random effect covariance matrix has elements ( )11 = 9, ( )21 = 4.8, ( )22 = 4, ( )31 = 0.6, ( )32 = ( )33 = 1, ( )4j = 0; j = 1, . . . , 4, and the error variance of ϵ is set at σ2 = 1. The goal here is to select covariates of the fixed effect. We use two scenarios for our study: Setting 1, where the number of groups (m) considered is 30, and the number of observations in the ith group, i = 1, . . . , m, is ni = 5, and Setting 2, where m = 60, ni = 10. Finally, we generate 100 independent

Feature Selection using e-values

Method Setting 1: ni = 5, m = 30 Setting 2: ni = 10, m = 60 FPR FNR Acc MS FPR FNR Acc MS δ = 0 9.4 0.0 76 2.33 0.0 0.0 100 2.00 δ = 0.01 6.7 0.0 82 2.22 0.0 0.0 100 2.00 e-value δ = 0.05 1.0 0.0 97 2.03 0.0 0.0 100 2.00 δ = 0.1 0.3 0.0 99 2.01 0.0 0.0 100 2.00 δ = 0.15 0.0 0.0 100 2.00 0.0 0.0 100 2.00 BIC 21.5 9.9 49 2.26 1.5 1.9 86 2.10 AIC 17 11.0 46 2.43 1.5 3.3 77 2.20 SCAD (Peng & Lu, 2012) GCV 20.5 6 49 2.30 1.5 3 79 2.18 p

log n/n 21 15.6 33 2.67 1.5 4.1 72 2.26 M-ALASSO (Bondell et al., 2010) - - 73 - - - 83 - SCAD-P (Fan & Li, 2012) - - 90 - - - 100 - r PQL (Hui et al., 2017) - - 98 - - - 99 -

Table 6.2. Performance comparison for mixed effect models. We compare e-values with a number of sparse penalized methods: (a) Peng & Lu (2012) that uses SCAD penalty and different methods of selecting regularization tuning parameters, (b) The adaptive lasso-based method of Bondell et al. (2010), (c) The SCAD-P method Fan & Li (2012), and (d) regularized Penalized Quasi-Likelihood Hui et al. (2017, r PQL). For comparison with Peng & Lu (2012), we present mean false positive (FPR) and false negative (FNR) rates, Accuracy (Acc), and Model Size (MS), i.e. the number of non-zero fixed effects estimated. To compare with other methods we only use Acc, since they did not report the rest of the metrics.

datasets for each setting. To implement e-values by minimizing GBIC(τn), we consider τn {1, 1.2, . . . , 4.8, 5}. To tackle small sample issues in Setting 1 (Section 4.2), we repeat the model selection procedure using the shifted e-values ˆeδ rn( ) for δ {0, 0.01, 0.05, 0.1, 0.15}.

Without shifted thresholds, e-values perform commendably in both settings. For Setting 2, it reaches the best possible performance across all metrics. However, we observed that in a few replications of setting 1, a small number of null input features had e-values only slightly lower than the full model e-value, resulting in increased FPR and model size. We experimented with lowered detection thresholds to mitigate this. As seen in Table 6.2, increasing δ gradually eliminates the null features, and e-values reach perfect performance across all metrics for δ = 0.15.

7. Real data examples

Indian monsoon data To identify the driving factors behind precipitation during the Indian monsoon season using e-values, we obtain data on 35 potential covariates (see Appendix D) from National Climatic Data Center (NCDC) and National Oceanic and Atmospheric Administration (NOAA) repositories for 1978 2012. We consider annual medians of covariates as fixed effects, log yearly rainfall at a weather station as output feature, and include year-specific random intercepts. To implement e-values, we use projection depth (Zuo, 2003) and GBS resample sizes R = R1 = 1000. We train our model on data from the years 1978-2002, run e-values best subset selection for tuning parameters τn {0.05, 0.1, . . . , 1}. We consider two methods to select

the best refitted model: (a) minimizing GBIC(τn), and (b) minimizing forecasting errors on samples from 2003 2012.

Figure 7.1a plots the t-statistics for features from the best refitted models obtained by the above two methods. Minimizing for GBIC and test error selects 32 and 26 covariates, respectively. The largest contributors are maximum temperature and elevation, which are related to precipitation based on the Clausius-Clapeyron relation (Li et al., 2017; Singleton & Toumi, 2012). All other selected covariates have documented effects on Indian monsoon (Krishnamurthy & Kinter III, 2003; Moon et al., 2013). Reduced model forecasts obtained from a rolling validation scheme (i.e. i.e. use data from 1978 2002 for 2003, 1979-2003 for 2004 and so on) have less bias across testing years (Fig. 7.1b).

Spatio-temporal dependence analysis in f MRI data This dataset is due to Wakeman & Henson (2015), where each of the 19 subjects go through 9 runs of consecutive visual tasks. Blood oxygen level readings are recorded across time as 3D images made of 64 64 33 (total 135,168) voxels. Here we use the data from a single run and task on subject 1, and aim to estimate dependence patterns of readings across 210 time points and areas of the brain.

We fit separate regressions at each voxel (Appendix E), with second order autoregressive terms, neighboring voxel readings and one-hot encoded visual task categories in the design matrix. After applying the e-value feature selection, we compute the F-statistic at each voxel using selected coefficients only, and obtain their p-values. Fig. 7.1c highlights voxels with p-values < 0.05. Left and right visual cortex areas show high spatial dependence, with more dependence

Feature Selection using e-values

del_TT_Deg_Celsius

Temp Anomaly

t statistic

Method GBIC Test Error

2004 2006 2008 2010 2012

3 2 1 0 1 2 3

Full model Reduced model

Figure 7.1. (a) Plot of t-statitics for Indian Monsoon data for features selected by at least one of the methods. Dashed lines indicate standard Gaussian quantiles at tail probabilities 0.025 and 0.975, (b) Full vs. reduced model rolling forecast comparisons, and (c) f MRI data: smoothed surface obtained from p-values show high spatial dependence in right optic nerve, auditory nerves and auditory cortex (top left), left visual cortex (bottom right) and cerebellum (lower middle).

on the left side. Signals from the right visual field obtained by both eyes are processed by the left visual cortex. The lopsided dependence pattern suggests that visual signals from the right side led to a higher degree of processing in our subject s brain. We also see activity in the cerebellum, the role of which in visual perception is well-known (Calhoun et al., 2010; Kirschen et al., 2010).

8. Conclusion

In this work, we introduced a new paradigm of feature selection through e-values. The e-values can be particularly helpful in situations where model training is costly and potentially distributed across multiple servers (i.e. federated learning), so that a brute force parallelized approach of training and evaluating multiple models is not practical or even possible.

There are three immediate extensions of the framework presented in this paper. Firstly, grouped e-values are of interest to leverage prior structural information on the predictor set. There are no conceptual difficulties in evaluating overlapping and non-overlapping groups of predictors using e-values in place of individual predictors. However technical conditions may be required to ensure a rigorous implementation. Secondly, our current formulation of evalues essentially relies upon the number of samples being more than the effective number of predictors for a unique covariance matrix of the parameter estimate to asymptotically exist. When p > n, using a sure screening method to filter out unnecessary variables works well empirically. However this needs theoretical validation. Thirdly, instead of using mean depth, other functionals of the (empirical) depth distribution such as quantiles can be used as e-values. Similar to stability selection (Meinshausen & Buhlmann,

2010), it may be possible to use the intersection of predictor sets obtained by using a number of such functionals in Algorithm 2 as the final selected set of important predictors.

An effective implementation of e-values hinges on the choice of bootstrap method and tuning parameter τn. To this end, we see opportunity for enriching the research on empirical process methods in complex overparametrized models, such as Deep Neural Nets (DNN), which the e-values framework can build up on. Given the current push for interpretability and trustworthiness of DNN-based decision making systems, there is potential for tools to be developed within the general framework of e-values that provide local and global explanations of large-scale deployed model outputs in an efficient manner.

Acknowledgements

This work is part of the first author (SM) s Ph D thesis (Majumdar, 2017). He acknowledges the support of the University of Minnesota Interdisciplinary Doctoral Fellowship during his Ph D. The research of SC is partially supported by the US National Science Foundation grants 1737918, 1939916 and 1939956 and a grant from Cisco Systems.

Reproducibility

Code and data for the experiments in this paper are available at https://github.com/shubhobm/e-values.

Barber, R. F. and Candes, E. J. Controlling the false discovery rate via knockoffs. Ann. Statist., 43:2055 2085, 2015.

Feature Selection using e-values

Bertsimas, D., King, A., and Mazumder, R. Best Subset Selection via a Modern Optimization Lens. Ann. Statist., 44(2):813 852, 2016.

Bickel, P. and Sakov, A. On the Choice of m in the m Out of n Bootstrap and its Application to Confidence Bounds for Extrema. Stat. Sinica, 18:967 985, 2008.

Bondell, H. D., Krishna, A., and Ghosh, S. K. Joint variable selection for fixed and random effects in linear mixedeffects models. Biometrics, 66(4):1069 1077, 2010.

Calhoun, V. D., Adali, T., and Mcginty, V. B. f MRI activation in a visual-perception task: network of areas detected using the general linear model and independent components analysis. Neuroimage, 14:1080 1088, 2010.

Chatterjee, S. The scale enhanced wild bootstrap method for evaluating climate models using wavelets. Statist. Prob. Lett., 144:69 73, 2019.

Chatterjee, S. and Bose, A. Generalized bootstrap for estimating equations. Ann. Statist., 33(1):414 436, 2005.

Dietz, L. R. and Chatterjee, S. Logit-normal mixed model for Indian monsoon precipitation. Nonlin. Processes Geophys., 21:939 953, sep 2014.

Dietz, L. R. and Chatterjee, S. Investigation of Precipitation Thresholds in the Indian Monsoon Using Logit-Normal Mixed Models. In Lakshmanan, V., Gilleland, E., Mc Govern, A., and Tingley, M. (eds.), Machine Learning and Data Mining Approaches to Climate Science, pp. 239 246. Springer, 2015.

Dyckerhoff, R. and Mozharovskyi, P. Exact computation of the halfspace depth. Comput. Stat. Data Anal., 98:19 30, 2016.

Fan, J. and Lv, J. Sure Independence Screening for Ultrahigh Dimensional Feature Space. J. R. Statist. Soc. B, 70:849 911, 2008.

Fan, Y. and Li, R. Variable selection in linear mixed effects models. Ann. Statist., 40(4):2043 2068, 2012.

Fang, K. T., Kotz, S., and Ng, K. W. Symmetric multivariate and related distributions. Monographs on Statistics and Applied Probability, volume 36. Chapman and Hall Ltd., London, 1990.

Fragoso, T. M., Bertoli, W., and Louzada, F. Bayesian model averaging: A systematic review and conceptual classification. International Statistical Review, 86(1): 1 28, 2018.

Ghosh, S., Vittal, H., Sharma, T., et al. Indian Summer Monsoon Rainfall: Implications of Contrasting Trends in the Spatial Variability of Means and Extremes. PLo S ONE, 11:e0158670, 2016.

Gosswami, B. N. The Global Monsoon System: Research and Forecast, chapter The Asian Monsoon: Interdecadal Variability. World Scientific, 2005.

Guyon, I. and Elisseeff, A. An Introduction to Variable and Feature Selection. J. Mach. Learn. Res., 3:1157 1182, 2003.

Huang, H.-C., Hsu, N.-J., Theobald, D. M., and Breidt, F. J. Spatial Lasso With Applications to GIS Model Selection. J. Comput. Graph. Stat., 19:963 983, 2010.

Hui, F. K. C., Muller, S., and Welsh, A. H. Joint Selection in Mixed Models using Regularized PQL. J. Amer. Statist. Assoc., 112:1323 1333, 2017.

Jornsten, R. Clustering and classification based on the L1 data depth. J. Mult. Anal., 90:67 89, 2004.

Kirschen, M. P., Chen, S. H., and Desmond, J. E. Modality specific cerebro-cerebellar activations in verbal working memory: an f MRI study. Behav. Neurol., 23:51 63, 2010.

Knutti, R., Furrer, R., Tebaldi, C., et al. Challenges in Combining Projections from Multiple Climate Models. J. Clim., 23:2739 2758, 2010.

Koller, D. and Sahami, M. Toward optimal feature selection. In 13th International Conference on Machine Learning, pp. 284 292, 1996.

Krishnamurthy, C. K. B., Lall, U., and Kwon, H.-H. Changing Frequency and Intensity of Rainfall Extremes over India from 1951 to 2003. J. Clim., 22:4737 4746, 2009.

Krishnamurthy, V. and Kinter III, J. L. The Indian monsoon and its relation to global climate variability. In Rodo, X. and Comin, F. A. (eds.), Global Climate: Current Research and Uncertainties in the Climate System, pp. 186 236. Springer, 2003.

Lahiri, S. N. Bootstrapping M-estimators of a multiple linear regression parameter. Ann. Statist., 20( ):1548 1570, 1992.

Lai, R. C. S., Hannig, J., and Lee, T. C. M. Generalized Fiducial Inference for Ultrahigh-Dimensional Regression. J. Amer. Statist. Assoc., 110(510):760 772, 2015.

Lee, H. and Ghosh, S. Performance of Information Criteria for Spatial Models. J. Stat. Comput. Simul., 79:93 106, 2009.

Lei, J. et al. Distribution-Free Predictive Inference for Regression. J. Amer. Statist. Assoc., 113:1094 1111, 2018.

Li, X., Wang, L., Guo, X., and Chen, D. Does summer precipitation trend over and around the Tibetan Plateau depend on elevation? Int. J. Climatol., 37:1278 1284, 2017.

Feature Selection using e-values

Liu, X. and Zuo, Y. Computing halfspace depth and regression depth. Commun. Stat. Simul. Comput., 43(5): 969 985, 2014.

Ma, Y. and Zhu, L. A review on dimension reduction. International Statistical Review, 81(1):134 150, 2013.

Majumdar, S. An Inferential Perspective on Data Depth. Ph D thesis, University of Minnesota, 2017.

Meinshausen, N. and Buhlmann, P. Stability selection. J. R. Stat. Soc. B, 72:417 473, 2010.

Meza, J. and Lahiri, P. A note on the Cp statistic under the nested error regression model. Surv. Methodol., 31: 105 109, 2005.

Moon, J.-Y., Wang, B., Ha, K.-J., and Lee, J.-Y. Teleconnections associated with Northern Hemisphere summer monsoon intraseasonal oscillation. Clim. Dyn., 40(11-12): 2761 2774, 2013.

Mosler, K. Depth statistics. In Becker, C., Fried, R., and Kuhnt, S. (eds.), Robustness and Complex Data Structures, pp. 17 34. Springer Berlin Heidelberg, 2013. ISBN 978-3-642-35493-9.

Narisetty, N. N. and Nair, V. N. Extremal Depth for Functional Data and Applications. J. Amer. Statist. Assoc., 111:1705 1714, 2016.

Natarajan, B. K. Sparse Approximate Solutions to Linear Systems. Siam. J. Comput., 24:227 234, 1995.

Nguyen, T. and Jiang, J. Restricted fence method for covariate selection in longitudinal data analysis. Biostatistics, 13(2):303 314, 2014.

Peng, H. and Lu, Y. Model selection in linear mixed effect models. J. Multivar. Anal., 109:109 129, 2012.

Rousseeuw, P. J. and Hubert, M. Regression Depth. J. Amer. Statist. Assoc., 94:388 402, 1998.

Schwarz, G. Estimating the dimension of a model. Ann. Statist., 6(2):461 464, 1978.

Shao, J. Bootstrap model selection. J. Amer. Statist. Assoc., 91(434):655 665, 1996.

Singh, K., Xie, M., and Strawderman, W. E. Confidence distribution (CD) distribution estimator of a parameter. In Complex Datasets and Inverse Problems: Tomography, Networks and Beyond, volume 54 of Lecture Notes Monograph Series, pp. 132 150. IMS, 2007.

Singleton, A. and Toumi, R. Super-Clausius-Clapeyron scaling of rainfall in a model squall line. Quat. J. R. Met. Soc., 139(671):334 339, 2012.

Song, L., Smola, A., Gretton, A., Bedo, J., and Borgwardt, K. Feature selection via dependence maximization. Journal of Machine Learning Research, 13(47):1393 1434, 2012.

Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58(267 288), 1996.

Trenberth, K. E. Changes in precipitation with climate change. Clim. Res., 47:123 138, 2011.

Trenberth, K. E., Dai, A., Rasmussen, R. M., and Parsons, D. B. The changing character of precipitation. Bull. Am. Meteorol. Soc.,, 84:1205 1217, 2003.

Tukey, J. W. Mathematics and picturing data. In James, R. (ed.), Proceedings of the International Congress on Mathematics, volume 2, pp. 523 531, 1975.

Vander Weele, T. and Ding, P. Sensitivity Analysis in Observational Research: Introducing the E-Value. Ann Intern Med., 167(4):268 274, 2017.

Wakeman, D. G. and Henson, R. N. A multi-subject, multimodal human neuroimaging dataset. Scientif. Data, 2: article 15001, 2015.

Wang, B., Ding, Q., Fu, X., et al. Fundamental challenge in simulation and prediction of summermonsoon rainfall. Geophys. Res. Lett., 32:L15711, 2005.

Wang, L., Kim, Y., and Li, R. Calibrating Nonconvex Penalized Regression in Ultra-high Dimension. Ann. Statist., 41:2505 2536, 2013.

Zhang, C.-H. and Zhang, S. S. Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Statist. Soc. B, 76(1):217 242, 2014.

Zhu, L.-P., Li, L., Li, R., and Zhu, L.-X. Model-Free Feature Screening for Ultrahigh-Dimensional Data. J. Amer. Statist. Assoc., 106(496):1464 1475, 2011.

Zou, H. The Adaptive Lasso and Its Oracle Properties. J. Amer. Statist. Assoc., 101:1418 1429, 2006.

Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301 320, 2005.

Zou, H. and Li, R. One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist., 36:1509 1533, 2008.

Zuo, Y. Projection-based depth functions and associated medians. Ann. Statist., 31:1460 1490, 2003.

Zuo, Y. and Serfling, R. General notions of statistical depth function. Ann. Statist., 28(2):461 482, 2000.

Feature Selection using e-values

A. Consistency of Generalized Bootstrap

We first state a number of conditions on the energy functions ψi( , ), under which we state and prove two results to ensure consistency of the estimation and bootstrap approximation procedures. In the context of the main paper, the conditions and results here ensure that the full model parameter estimate ˆθ follows a Gaussian sampling distribution, and Generalized Bootstrap (GBS) can be used to approximate this sampling distribution.

A.1. Technical conditions

Note that several sets of alternative conditions can be developed (Chatterjee & Bose, 2005; Lahiri, 1992), many of which are amenable to our results. However, for the sake of brevity and clarity, we only address the case where the energy function ψi( , ) is smooth in the first argument. This case covers a vast number of models routinely considered in statistics and machine learning.

We often drop the second argument from energy function, thus for example ψi θ ψi θ, Zi , and use the notation ˆψki for ψki(ˆθ ), for k = 0, 1, 2. Also, for any function h(θ) evaluated at the true parameter value θ , we use the notation h h(θ ). When A and B are square matrices of identical dimensions, the notation B < A implies that the matrix A B is positive definite.

In a neighborhood of θ , we assume the functions ψi are thrice continuously differentiable in the first argument, with the successive derivatives denoted by ψki, k = 0, 1, 2. That is, there exists a δ > 0 such that for any θ = θ + t satisfying t < δ we have

d dθψi(θ) := ψ0i(θ) Rp,

and for the a-th element of ψ0i(θ), denoted by ψ0i(a)(θ), we have

ψ0i(a)(θ) = ψ0i(a)(θ ) + ψ1i(a)(θ )t + 2 1t T ψ2i(a)(θ + ct)t,

for a = 1, . . . p, and some c (0, 1) possibly depending on a. We assume that for each n, there is a sequence of σ-fields F1 F2 . . . Fn such that {Pj i=1 ψ0i(θ ), Fj} is a martingale.

The spectral decomposition of Γ0n := Pn i=1 Eψ0iψT 0i is given by Γ0n = P0nΛ0n P T 0n, where P0n Rp Rp is an orthogonal matrix whose columns contain the eigenvectors, and Λ0n is a diagonal matrix containing the eigenvalues of Γ0n. We assume that Γ0n is positive definite, that is, all the diagonal entries of Λ0n are positive numbers. We assume that there is a constant δ0 > 0 such that λmin(Γ0n) > δ0 for sufficiently large n.

Let Γ1i(θ ) be the p p matrix whose a-th row is Eψ1i(a); we assume this expectation exists. Define Γ1n = Pn i=1 Γ1i(θ ). We assume that Γ1n is nonsingular for each n. The singular value decomposition of Γ1n is given by Γ1n = P1nΛ1n QT 1n, where P1n, Q1n Rp Rp are orthogonal matrices, and Λ1n is a diagonal matrix. We assume that the diagonal entries of Λ1n are all positive, which implies that in the population, at the true value of the parameter the energy functional Ψn(θ ) actually achieves a minimal value. We define Λc kn for various real numbers c as diagonal matrices where the j-th diagonal entry of Λkn is raised to the power c, for k = 0, 1. Correspondingly, we define Γc 1n = P1nΛc 1n QT 1n. We assume that there is a constant δ1 > 0 such that λmax(ΓT 1nΓ1n) < δ1 for all sufficiently large n. Define the matrix An = Γ 1/2 0n Γ1n. We assume the following conditions:

(C1) The minimum eigenvalue of AT n An tends to infinity. That is, there is a sequence an as n such that

λmin Γ1nΓ 1 0n ΓT 1n a2 n. (C.1)

(C2) There exists a sequence of positive reals {γn} that is bounded away from zero, such that

λmax Γ 1 1n Γ2 0nΓ T 1n = o(γ 2 n ) as n . (C.2)

i=1 ψ1i Γ1n A 1 n

F = o(pγ 2 n ). (C.3)

Feature Selection using e-values

where A F denotes the Frobenius norm of matrix A. (C4) For the symmetric matrix ψ2i(a)(θ) and for some δ2 > 0, there exists a symmetric matrix M2i(a) such that

sup θ θ <δ2 ψ2i(a)(θ) < M2i(a),

i=1 Eλ2 max M2i(a) = o a6 nn 1pγ 2 n . (C.4)

(C5) For any vector c Rp with c = 1, we define the random variable Zni = c T Γ 1/2 0n ψi for i = 1, . . . n. We assume that

i=1 Z2 ni p 1, and E max i Zni 0. (C.5)

(C6) Assume that

λmax Γ1nΓ 1 0n ΓT 1n a2 n. (C.6)

The technical conditions (C1)-(C5) are extremely broad, and allow for different rates of convergence of different parameter estimators. The additional condition (C6) is a natural condition that, coupled with (C1), ensures identical rate of convergence an for all the parameter estimators in a model.

Standard regularity conditions on likelihood and estimating functions that have been routinely assumed in the literature are special cases of the framework above. In such cases, (C1)-(C6) hold with an n1/2, resulting in the standard root-n asymptotics.

A.2. Results

We first present the consistency and asymptotic normality of the estimation process in Theorem A.1 below. Theorem A.1. Assume conditions (C1)-(C5). Then ˆθ is a consistent estimator of θ , and An(ˆθ θ ) converges weakly to the p-dimensional standard Gaussian distribution. Under the additional condition (C6), we have that an(ˆθ θ ) converges weakly to a Gaussian distribution in p-dimension.

Proof of Theorem A.1. We consider a generic point θ = θ + A 1 n t. From the Taylor series expansion, we have

ψ0i(a)(θ) = ψ0i(a) + ψ1i(a)A 1 n t + 2 1t T A T n ψ2i(a)( θ )A 1 n t,

for a = 1, . . . p, and θ = θ + c A 1 n t for some c (0, 1).

Recall our convention that for any function h(θ) evaluated at the true parameter value θ , we use the notation h h(θ ). Also define the p-dimensional vector Rn( θ , t) whose a-th element is given by

Rn(a)( θ , t) = t T A T n

i=1 ψ2i(a)( θ )A 1 n t.

Thus we have

i=1 ψ0i(θ + A 1 n t) = p 1/2A 1 n

i=1 ψ0i + p 1/2A 1 n

i=1 ψ1i A 1 n t + 2 1p 1/2A 1 n Rn( θ , t)

= p 1/2A 1 n

i=1 ψ0i + p 1/2A 1 n Γ1n A 1 n t

+ p 1/2A 1 n n X

i=1 ψ1i Γ1n A 1 n t

+ 2 1p 1/2A 1 n Rn( θ , t).

Feature Selection using e-values

Fix ϵ > 0. We first show that there exists a C0 > 0 such that

P h p 1/2A 1 n

i=1 ψ0i > C0 i < ϵ/2. (A.7)

For this, we compute

i=1 ψ0i 2 = p 1E

i,j=1 ψT 0i A T n A 1 n ψ0j

A T n A 1 n E

i=1 ψ0iψT 0i

= p 1tr A T n A 1 n Γ0n

from assumption (C.2).

Sn(t) = p 1/2A 1 n n X

i=1 ψ0i(θ + A 1 n t)

i=1 ψ0i p 1/2Γ 1 1n Γ0nt.

We next show that for any C > 0, for all sufficiently large n, we have

E h sup t C

Sn(t) i2 = o(1). (A.8)

This follows from (C.3) and (C.4). Note that

Sn(t) = p 1/2A 1 n n X

i=1 ψ1i Γ1n A 1 n t + 2 1p 1/2A 1 n Rn( θn, t).

Sn(t) p 1/2 sup t C

i=1 ψ1i Γ1n A 1 n t + 2 1p 1/2 sup t C

A 1 n Rn( θ , t) .

We consider each of these terms separately.

For any matrix M Rp Rp, we have

Mt = sup t C

j=1 Mijtj 2i1/2 sup t C

j=1 t2 j i1/2 = M F sup t C t = C M F .

Using M = A 1 n Pn i=1 ψ1i Γ1n A 1 n and (C.3), we get one part of the result.

For the other term, we similarly have h sup t C

p 1/2A 1 n Rn( θ , t) i2 = p 1 sup t C

A 1 n Rn( θ , t) 2

p 1λmax A T n A 1 n sup t C

Rn( θ , t) 2

p 1λmax A 1 n A T n sup t C

Rn( θ , t) 2

p 1a 2 n sup t C

Rn( θ , t) 2.

Feature Selection using e-values

Note that sup t C

Rn( θ , t) 2 = sup t C

Rn( θ , t) 2.

Rn( θ , t) 2 =

R n(a)( θ , t) 2

i=1 ψ2i(a)( θ )A 1 n t 2

i,j=1 t T A T n ψ2i(a)( θ )A 1 n tt T A T n ψ2j(a)( θ )A 1 n t.

Based on this, we have

Rn( θ , t) 2 = sup t C

i,j=1 t T A T n ψ2i(a)( θ )A 1 n tt T A T n ψ2 nj(a)( θ )A 1 n t

i,j=1 t T A T n M2i(a)A 1 n tt T A T n M2j(a)A 1 n t

A 1 n t 4 p X

i=1 λmax M2i(a) 2

C4nλ2 max A T n A 1 n p X

i=1 λ2 max M2i(a) .

Putting all these together, we have

E h sup t C

p 1/2A 1 n Rn( θ , t) i2 = p 1E h sup t C A 1 n Rn( θ , t) i2

p 1a 2 n E h sup t C

Rn( θ , t) i2

= O p 1a 2 n E h sup t C

Rn( θ , t) i2

= O p 1na 6 n p X

i=1 Eλ2 max M2ni(a)

using (C.4).

Since we have defined

Sn(t) = p 1/2A 1 n n X

i=1 ψ0i(θ + A 1 n t)

i=1 ψ0i p 1/2Γ 1 1n Γ0nt,

i=1 ψ0i(θ + p1/2A 1 n t) = Sn(t) + p 1/2A 1 n

i=1 ψ0i + A 1 n Γ1n A 1 n t.

Feature Selection using e-values

n p 1/2t T Γ1n A 1 n

i=1 ψ0i(θ + p1/2A 1 n t) o

n t T Γ1n Sn(t) + p 1/2t T Γ1n A 1 n

i=1 ψ0i + t T Γ1n A 1 n Γ1n A 1 n t o

inf t =C t T Γ1n Sn(t) + p 1/2 inf t =C t T Γ1n A 1 n

i=1 ψ0i + inf t =C t T Γ1n A 1 n Γ1n A 1 n t

Cδ1/2 1 sup |t|=C |Sn(t)| Cδ1/2 1 p 1/2 A 1 n

i=1 ψ0i + C2δ0.

The last step above utilizes facts like a T b a b . Consequently, defining C1 = Cδ0/δ1/2 1 , we have

P h inf t =C

n t T Γ1n A 1 n

i=1 ψ0i(θ + p1/2A 1 n t) o < 0 i

P h sup t =C |Sn(t)| + |A 1 n

i=1 ψ0i| > C1 i

P h sup t =C

Sn(t) > C1/2 i + P h A 1 n

i=1 ψ0i > C1/2 i < ϵ,

for all sufficiently large n, using (A.7) and (A.8). This implies that with a probability greater than 1 ϵ there is a root Tn of the equations Pn i=1 ψ0i(θ + A 1 n t) in the ball { t < C}, for some C > 0 and all sufficiently large n. Defining ˆθ = θ + A 1 n Tn, we obtain the desired result. Issues like dependence on ϵ and other technical details are handled using standard arguments, see Chatterjee & Bose (2005) for related arguments.

Since we have sup t <C Sn(t) = o P (1), and Tn lies in the set t < C, define Rn = Sn Tn = o P (1). Consequently

= p 1/2A 1 n n X

i=1 ψ0i(θ + A 1 n Tn)

i=1 ψ0i p 1/2Γ 1 1n Γ0n Tn

= p 1/2A 1 n

i=1 ψ0i p 1/2Γ 1 1n Γ0n Tn.

Tn = Γ 1 0n Γ1n A 1 n

i=1 ψ0i + p1/2Γ 1 0n Γ1n Rn = Γ 1/2 0n

i=1 Ψ0i + p1/2Γ 1 0n Γ1n Rn.

Note that our conditions imply that for any c with c = 1, we have that c T Tn has two terms, where V c T Γ 1/2 0n Pn i=1 ψ0i = 1 and

E p1/2c T Γ 1 0n Γ1n Rn 2 = O(1),

using (C.2). Using (C.5) we also have that for any c with c = 1, c T Tn converges in distribution to N(0, 1). This completes the proof.

We now have a parallel result on consistency of the GBS resampling scheme. The essence of this theorem is that under the same set of conditions, several resampling schemes are consistent resampling procedures to implement the e-values framework.

Feature Selection using e-values

Theorem A.2. Assume conditions (C1)-(C5). Additionally, assume that the resampling weights Wrni are exchangeable random variables satisfying the conditions (4.2). Define ˆBn = µnτ 1 n ˆΓ1/2 0n ˆΓ 1 1n , where ˆΓ0n and ˆΓ1n are sample equivalents of Γ0n and Γ1n, respectively. Conditional on the data, ˆBn(ˆθr ˆθ ) converges weakly to the p-dimensional standard Gaussian distribution in probability.

Under the additional condition (C6), defining bn = µnτ 1 n an, the distributions of an(ˆθ θ ) and bn(ˆθr ˆθ ) converge to the same weak limit in probability.

Proof of Theorem A.2. This proof has steps similar to that of the proof of Theorem A.1, apart from several additional technicalities. We omit the details.

B. Theoretical proofs

The results in Section 5 hold under more general conditions than in the main paper specifically, when the asymptotic distribution of ˆβ comes from a general class of elliptic distributions, rather than simply being multivariate Gaussian.

Following Fang et al. (1990), the density function of an elliptically distributed random variable E(µ, Σ, g) takes the form: h(x; µ, Σ) = |Σ| 1/2g((x µ)T Σ 1(x µ)) where µ Rp, Σ Rp p is positive semi-definite, and g is a density function that is non-negative, scalar-valued, continuous and strictly increasing. For the asymptotic distribution of ˆβ , we assume the following conditions:

(A1) There exist a sequence of positive reals an , positive-definite (PD) matrix V Rp p and density g such that an(ˆθ θ0) converges to E(0p, V, g) in distribution, denoted by an(ˆθ θ0) E(0p, V, g); (A2) For almost every dataset Zn, There exist PD matrices Vn Rp p such that plimn Vn = V .

These conditions are naturally satisfied for a Gaussian limiting distribution.

B.1. Proofs of main results

Proof of Theorem 5.1. We divide the proof into four parts:

1. ordering of adequate model e-values, 2. convergence of all adequate model e-values to a common limit, 3. convergence of inadequate model e-values to 0, 4. comparison of adequate and inadequate model e-values,

Proof of part 1. Since we are dealing with a finite sequence of nested models, it is enough to prove that en(M1) > en(M2) for large enough n, when both M1 and M2 are adequate models and M1 M2.

Suppose T0 = E(0p, Ip, g). Affine invariance implies invariant to rotational transformations, and since the evaluation functions we consider decrease along any ray from the origin because of (B5), E(θ, T0) is a monotonocally decreasing function of θ for any θ Rp. Now consider the models M0 1, M0 2 that have 0 in all indices outside S1 and S2, respectively. Take some θ10 Θ0 1, which is the parameter space corresponding to M0 1, and replace its (zero) entries at indices j S2 \S1 by some non-zero δ Rp |S2\S1|. Denote it by θ1δ. Then we shall have

θT 1δθ1δ > θT 10θ10 D(θ10, T0) > D(θ1δ, T0) Es1D(θ10, T0) > Es1D(θ1δ, T0),

where Es1 denotes the expectation taken over the marginal of the distributional argument T0 at indices S1. Notice now that by construction θ1δ Θ0 2, the parameter space corresponding to M0 2, and since the above holds for all possible δ, we can take expectation over indices S2 \ S1 in both sides to obtain Es1D(θ10, T0) > Es2D(θ20, T0), with θ20 denoting a general element in Θ20.

Combining (A1) and (A2) we get an V 1/2 n (ˆθ θ0) T0. Denote Tn = [an V 1/2 n (ˆθ θ0)], and choose a positive ϵ < (Es1D(θ10, T0) Es2D(θ20, T0))/2. Then, for large enough n we shall have

|D(θ10, Tn) D(θ10, T0)| < ϵ |Es1D(θ10, Tn) Es1D(θ10, T0)| < ϵ,

following condition (B4). Similarly we have |Es2D(θ20, Tn) Es2D(θ20, T0)| < ϵ for the same n for which the above holds. This implies Es1D(θ10, Tn) > Es2D(θ20, Tn).

Feature Selection using e-values

Now apply the affine transformation t(θ) = V 1/2 n θ/an+θ0 to both arguments of the depth function above. This will keep the depths constant following affine invariance, i.e. D(t(θ10), [ˆθ ]) = D(θ10, Tn) and D(t(θ20), [ˆθ ]) = D(θ20, Tn). Since this transformation maps Θ0 1 to Θ1, the parameter space corresonding to M1, we get Es1D(t(θ10), [ˆθ ]) > Es2D(t(θ20), [ˆθ ]), i.e. en(M1) > en(M2).

Proof of part 2. For the full model M , an(ˆθ θ0) Ep(0, V, g) by (A1). It follows now from a direct application of condition (B3) that en(M ) E(Y, [Y ] where Y E(0p, V, g).

For any other adequate model M, we use (B1) property of D:

D(ˆθm, [ˆθ ]) = D ˆθm θ0, ˆθ0 θ0 , (B.1)

and decompose the first argument

ˆθm θ0 = (ˆθm ˆθ ) + (ˆθ θ0). (B.2)

We now have

ˆθm = θm + a 1 n Tmn, ˆθ = θ + a 1 n T n,

where Tmn is non-degenerate at the indices S, and T n E(0p, V, g). For the first summand of the right-hand side in (B.2) we then have

ˆθm ˆθ = θm θ0 + Rn, (B.3)

where E R2 n = O(a 2 n ), and θmj equals θ0j in indices j S and Cj elsewhere. Since M is adequate, θm = θ0. Thus, substituting the right-hand side in (B.2) we get D ˆθm θ0, h ˆθ θ0 i D ˆθ θ0, h ˆθ θ0 i Rn α, (B.4)

from Lipschitz continuity of D( ) given in (B2). Part 2 now follows.

Proof of Part 3. Since the depth function D is invariant under location and scale transformations, we have

D(ˆθm, [ˆθ ]) = D an(ˆθm θ0), an(ˆθ θ0) . (B.5)

Decomposing the first argument,

an(ˆθm θ0) = an(ˆθm θm) + an(θm θ0). (B.6)

Since M is inadequate, P

j / S |Cj θ0j| > 0, i.e. θm and θ0 are not equal in at least one (fixed) index. Consequently as n , an(θm θ0) , thus en(M) 0 by condition (B4).

Proof of part 4. For any inadequate model Mj, k < j K, suppose Nj is the integer such that en1(Mj1) < en1(M ) for all n1 > Nj. Part 3 above ensures that such an integer exists for every inadequate model. Now define N = maxk<j K Nj. Thus en1(M ) is larger than e-values of all inadequate models Mj1 for k < j K.

Proof of Corollary 5.2. By construction, M0 is nested in all other adequate models in M0. Hence Theorem 4.1 implies en(M0) > en(Mad) > en(Minad) for any adequate model Mad and inadequate model Minad in M0 and large enough n.

Proof of Corollary 5.3. Consider j S0. Then θ0 / M j, hence M j is inadequate. By choice of n1 from Corollary 4.1, e-values of all inadequate models are less than that of M , hence en1(M j) < en1(M ).

On the other hand, suppose there exists a j such that en1(M j) en1(M ) but j / S0. Now j / S0 means that M j is an adequate model. Since M j is nested within M for any j, and the full model is always adequate, we have en1(M j) > en1(M ) by Theorem 4.1: leading to a contradiction and thus completing the proof.

Feature Selection using e-values

Proof of Theorem 5.4. Corollary 4.2 implies that

S0 = {j : en(M j) < en(M )}.

Now define S0 = {j : ern(M j) < ern(M )}. We also use an approximation result.

Lemma B.1. For any adequate model M, the following holds for fixed n and an exchangeable array of GB resamples Wr as in the main paper:

ern = en(M) + Rr, Er|Rr|2 = o P (1). (B.7)

Using Lemma B.1 for M j and M we now have

ern(M j) = en(M j) + Rrj,

ern(M ) = en(M ) + Rr ,

such that Er|Rr |2 = o P (1) and Er|Rrj|2 = o P (1) for all j. Hence P1( S0 = S0) 1 as n , P1 being probability conditional on the data. Similarly one can prove that the probability conditional on the bootstrap samples that S0 = ˆS0 holds goes to 1 as R, R1 , completing the proof.

B.2. Proofs of auxiliary results

Proof of Lemma B.1. We decompose the resampled depth function as

D ˆθr1m, [ˆθr ] = D an

ˆθr1m ˆθ , an

ˆθr1m ˆθm an

ˆθm ˆθ , an

Conditional on the data, (an/τn)(ˆθr1m ˆθm) has the same weak limit as an(ˆθm θm), and the same holds for (an/τn)(ˆθr1 ˆθ ) and an(ˆθ θ ). Now (B.3) and τn combine to give

ˆθm ˆθ P 0,

as n . Lemma B.1 follows directly now.

C. Details of experiments

Among the competing methods, for stepwise regression there is no tuning. MIO requires specification of a range of desired sparsity levels and a time limit for the MIO solver to run for each sparsity level in the beginning. We specify the sparsity range to be {1, 3, . . . , 29} in all settings to cover the sparsity levels across different simulation settings, and the time limit to be 10 seconds. We select the optimal MIO sparsity level as the one for which the resulting estimate gives the lowest BIC. We use BIC to select the optimal regularization parameter for Lasso and SCAD as well. The Knockoff filters come in two flavors: Knockoff and Knockoff+. We found that Knockoff+ hardly selects any features in our settings, so use the Knockoffs for evaluation, setting its false discovery rate threshold at the default value of 0.1. We shall include these details in the appendix.

D. Details of Indian Monsoon data

Various studies indicate that our knowledge about the physical drivers of precipitation in India is incomplete; this is in addition to the known difficulties in modeling precipitation itself (Knutti et al., 2010; Trenberth et al., 2003; Trenberth, 2011; Wang et al., 2005). For example, (Gosswami, 2005) discovered an upward trend in frequency and magnitude of extreme rain events, using daily central Indian rainfall data on a 10 12 grid, but a similar study on a 1 1 gridded data by (Ghosh et al., 2016) suggested that there are both increasing and decreasing trends of extreme rainfall events, depending on the location. Additionally, (Krishnamurthy et al., 2009) reported increasing trends in exceedances of the 99th percentile

Feature Selection using e-values

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure E.1. (Top) Plot of significant p-values at 95% confidence level at the specified cross-sections; (bottom) a smoothed surface obtained from the p-values clearly shows high spatial dependence in right optic nerve, auditory nerves, auditory cortex and left visual cortex areas

of daily rainfall; however, there is also a decreasing trend for exceedances of the 90th percentile data in many parts of India. Significant spatial and temporal variabilities at various scales have also been discovered for Indian Monsoon (Dietz & Chatterjee, 2014; 2015).

We attempt to identify the driving factors behind precipitation during the Indian monsoon season using our e-value based feature selection technique. Data is obtained from the repositories of the National Climatic Data Center (NCDC) and National Oceanic and Atmospheric Administration (NOAA), for the years 1978-2012. We obtained data 35 potential predictors of the Indian summer precipitation:

(A) Station-specific: (from 36 weather stations across India) Latitude, longitude, elevation, maximum and minimum temperature, tropospheric temperature difference ( TT), Indian Dipole Mode Index (DMI), Ni no 3.4 anomaly;

(B) Global:

u-wind and v-wind at 200, 600 and 850 mb; Ten indices of Madden-Julian Oscillations: 20E, 70E, 80E, 100E, 120E, 140E, 160E, 120W, 40W, 10W; Teleconnections: North Atlantic Oscillation (NAO), East Atlantic (EA), West Pacific (WP), East Pacific/North Pacific (EPNP), Pacific/North American (PNA), East Atlantic/Western Russia (EAWR), Scandinavia (SCA), Tropical/Northern Hemisphere (TNH), Polar/Eurasia (POL); Solar Flux; Land-Ocean Temperature Anomaly (TA).

These covariates are all based on existing knowledge and conjectures from the actual Physics driving Indian summer precipitations. The references provided earlier in this section, and multiple references contained therein may be used for background knowledge on the physical processes related to Indian monsoon rainfall, which after decades of study remains one of the most challenging problems in climate science.

E. Details of f MRI data implementation

Typically, the brain is divided by a grid into three-dimensional array elements called voxels, and activity is measured at each voxel. More specifically, a series of three-dimensional images are obtained by measuring Blood Oxygen Level Dependent (BOLD) signals for a time interval as the subject performs several tasks at specific time points. A single f MRI image typically consists of voxels in the order of 105, which makes even fitting the simplest of statistical models computationally intensive when it is repeated for all voxels to generate inference, e.g. investigating the differential activation of brain region in response to a task.

The dataset we work with comes from a recent study involving 19 test subjects and two types of visual tasks (Wakeman & Henson, 2015). Each subject went through 9 runs, in which they were showed faces or scrambled faces at specific time points. In each run 210 images were recorded in 2 second intervals, and each 3D image was of the dimension of 64 64 33,

Feature Selection using e-values

which means there were 135,168 voxels. Here we use the data from a single run on subject 1, and perform a voxelwise analysis to find out the effect of time lags and BOLD responses at neighboring voxels on the BOLD response at a voxel. Formally we consider separate models at voxel i {1, 2, ..., V }, with observations across time points t {1, 2, ..., T}.

Clubbing together the stimuli, drift, neighbor and autoregressive terms into a combined design matrix X = ( x(1)T , ..., x(T)T )T and coefficient vector θi, we can write yi(t) = x(t)T θi + ϵi(t). We estimate the set of non-zero coefficients in θi using the e-value method. Suppose this set is Ri, and its subsets containing coefficient corresponding to neighbor and non-neighbor (i.e. stimuli and drift) terms are Si and Ti, respectively. To quantify the effect of neighbors we now calculate the corresponding F-statistic:

n Si xi,nˆθi,n)2

n Ti xi,nˆθi,n)2 |n Ti|

and obtain its p-value, i.e. P(Fi F|Si|,|n Ti|).

Figure E.1 shows plots of the voxels with a significant p-value from the above F-test, with a darker color associated with lower p-value, as opposed to the smoothed surface in the main paper. Most of the significant terms were due to the coefficients corresponding to neighboring terms. A very small proportion of voxels had any autoregressive effects selected (less than 1%), and most of them were in regions of the image that were outside the brain, indicating noise.

In future work, we aim to expand on the encouraging findings and repeat the procedure on other individuals in the study. An interesting direction here is to include subject-specific random effects and correlate their clinical outcomes (if any) to observed spatial dependency patterns in their brain.