# discovering_conditionally_salient_features_with_statistical_guarantees__fdd590bd.pdf Discovering Conditionally Salient Features with Statistical Guarantees Jaime Roquero Gimenez 1 James Zou 2 The goal of feature selection is to identify important features that are relevant to explain an outcome variable. Most of the work in this domain has focused on identifying globally relevant features, which are features that are related to the outcome using evidence across the entire dataset. We study a more fine-grained statistical problem: conditional feature selection, where a feature may be relevant depending on the values of the other features. For example in genetic association studies, variant A could be associated with the phenotype in the entire dataset, but conditioned on variant B being present it might be independent of the phenotype. In this sense, variant A is globally relevant, but conditioned on B it is no longer locally relevant in that region of the feature space. We present a generalization of the knockoff procedure that performs conditional feature selection while controlling a generalization of the false discovery rate (FDR) to the conditional setting. By exploiting the feature/response model-free framework of the knockoffs, the quality of the statistical FDR guarantee is not degraded even when we perform conditional feature selections. We implement this method and present an algorithm that automatically partitions the feature space such that it enhances the differences between selected sets in different regions, and validate the statistical theoretical results with experiments. 1. Introduction Given d features X = (X1, . . . , Xd) and a response variable Y , in many settings only a small subset of Xi are relevant for Y . The goal of global statistical feature selection is to identify those relevant features, while ideally provid- 1Department of Statistics, Stanford University, Stanford USA 2Department of Biomedical Data Science, Stanford University, Stanford USA. Correspondence to: James Zou . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). ing some statistical control on the rate of false discoveries. This problem has been extensively studied in the statistics and machine learning literature (Benjamini & Hochberg, 1995; Holm, 1979; Benjamini & Yekutieli, 2001; Efron, 2012). One scientific domain that requires extensive use of feature selection is genomics. There Y corresponds to a phenotype of interest (e.g. whether a patient has a disease or not), and X contains information about individual mutations at genomic sites with high variability, or single nucleotide polymorphisms (SNPs). Researchers are interested to identify the small subset of features/SNPs that affect the disease risks based on information gathered from cohorts of participants with and without the phenotype of interest in genome-wide association studies (GWAS) (Hirschhorn & Daly, 2005; Mc Carthy et al., 2008). We call this global feature selection because its goal is to identify the relevant features for Y regardless of their location in the feature space. GWAS looks for the subset of SNPs that best help to predict the phenotype, regardless of the actual values of the SNPs themselves. The purpose of this work is to extend the feature selection problem to a conditional feature selection problem, where we want to identify relevant features conditionally on being on a subregion of the feature space. One motivation for this comes again from genomics. Given a SNP (or a set of SNPs) that we believe is relevant for understanding a disease based on prior biological knowledge, or from a previous GWAS study, we want to run a GWAS conditionally on such SNP having a prescribed value. As an example, a few SNPs in a few loci have been identified as major potential causes breast cancer, such as the mutations in the BRCA1/2 genes (O donovan & Livingston, 2010). We would like to run a GWAS conditioned on not having these variants to identify the other sources of genetic variation that are related to breast cancer. There are two main advantages when doing conditional feature selection. First, a feature that is selected through a global feature selection procedure may no longer be relevant conditioned on a local scale. In genomics, the presence or absence of a mutation can activate or deactivate different biological pathways. Therefore another mutation can be relevant or not for a given phenotype depending on whether its pathway was activated by the first mutation. Being able to distinguish between globally and locally relevant features Discovering Conditionally Salient Features with Statistical Guarantees is therefore essential for understanding the complex interactions between features and outcome. Also, by focusing on a local region, one could increase power to identify locally relevant features, which could potentially remain undiscovered if running a global feature selection procedure. Global feature selection procedures are unable to do this kind of conditional feature selection while preserving statistical guarantees (Sun et al., 2010; Tang et al., 2014). Statistical models that allow for inference while conditioning on some value of the features are very limited, relying upon strong assumptions such as simple linear models that are generally not appropriate for high-dimensional settings of interest. A second approach for conditional feature selection is to restrict the initial dataset to points lying in the region of interest, so that the outcomes of the selection procedure are only related to such region. However, this approach substantially reduces the amount of data available to fit the model, leading to more variance in the estimates. Furthermore, this undermines the validity of the assumptions upon which the construction of p-values relies, so that statistical false discovery guarantees are even less likely to hold. Our Contributions We develop a new method which extends recently developed knockoff procedure to perform conditional feature selection while still guaranteeing that the local false discovery rate (defined in Sec. 2) is controlled. By using the whole dataset to fit our model of the feature distribution P X during the first step of the procedure, we make sure that the FDR control for our local knockoff method is as valid as in the usual global method. We extend the knockoff machinery so that we are able to localize the second step of the procedure, where we construct local importance scores for each feature. The selection sets are obtained conditionally on being in a subregion of the space. The main contributions of this paper are in laying out the new framework for conditional feature selection along with proposing a new knockoff algorithm with mathematical guarantees. We also validate the algorithm on experiments. 2. Local Null Features and Local False Discovery Rate We refer to conditional feature selection the task of identifying a relevant subset of features that explain the response conditionally on being in some subspace of the feature space. As such, we need to define a notion of null feature conditionally on being on a subspace. We will refer to such generalization of null feature as local null features; we will use local and conditional interchangeably to emphasize the conditioning on a local region of the feature space. Preliminaries Given a positive integer d, let [d] = {1, . . . , d} and P([d]) denote the collection of all subsets of [d]. For x œ Rd, r > 0, let B(x, r) be the ball centered at x of radius r for the sup norm. For any two mappings , : Rd æ P([d]), we write µ (resp. \ ) if (x) µ (x) (resp. (x)\ (x)) for all x œ Rd. We begin by introducing the usual setting of feature selection procedures. We consider the data (X, Y) = (Xi, Yi)1ÆiÆn as a sequence of n i.i.d. samples from some unknown joint distribution: (Xi, Yi) = (Xi1, . . . , Xid, Yi) P XY , i = 1, . . . , n. We then define the set of null features H0 µ [d] such that j œ H0 if Xj Y |X j (where the j subscript indicates all variables except the jth, bold letters indicate vectors, and double-struck letters for the data matrix of stacked vectors). The set of non-null features H1 = [d] \ H0, also called alternatives, is important because these capture the truly influential effects and the goal of selection procedures is to identify them. We will denote by (X, Y ) the random variables whenever we make probabilistic statements. Local Null Features We now consider a local notion of null feature, that is, we are interested in whether any given feature has an impact on the outcome conditionally on the feature set X being on the neighborhood of some point x œ Rd. Indeed, the set of relevant features may depend on which region of the feature space we are in. We generalize the definition of a null feature as follows: Definition 2.1. Define mappings Rd æ P([d]) x æ H0 0(x) and H0 1 = [d] \ H0 such that j œ H0 0(x) if Xj Y |X j = x j. We say that the jth feature is a local null at x œ Rd if j œ H0 0(x). For r > 0, define Rd æ P([d]) x æ Hr zœB(x,r) H0 We say that the jth feature is a r-local null at x œ Rd if 0(x). Finally, we define the set of r-local non-nulls at x by Hr 1(x) = [d] \ Hr A straightforward generalization of this definition consists in defining local nulls based on a general subset of Rd, although we will keep this simplified version in what follows. From now on we will assume the joint distribution of (X, Y ), has a positive density p(x, y) > 0 with respect to a base product measure, so that we can write the conditional distribution of Y |X as p(y|x). Saying j is a (global) null (i.e. Xj Y |X j) is equivalent to p(y|x) not depending on xj. In contrast, j is a local null at x indicates that, if we fix the values of x j, then the expression p(y|x) = p(y|xj, x j) as a function of y and xj does not depend on xj. An immediate consequence is that the set of Discovering Conditionally Salient Features with Statistical Guarantees local nulls does not vary if we move along the null features. Also, identifying local non-nulls at smaller r implies having a more detailed information about the distribution P XY . Our ability to recover the set of local non-nulls for smaller r will be limited by the availability of data in the r-radius neighborhood of x. We present these results more formally as follows: Proposition 2.1. For any x, z œ Rd, 0(x) = z[d]\H0 If rÕ > r, then HŒ 0 = H0 µ HrÕ 0 the set of global nulls. Equivalently, Hr Proof. If j is a local null at x, then p(y|x) does not depend on xj when fixing x j. Therefore j is still a local null whenever we change its value provided the other coordinates are fixed. The first assertion is a consequence of this, extended to the whole set of nulls. The second assertion is immediate following the definition Hr zœB(x,r) H0 Local False Discovery Rate The output of a global feature selection procedure is a data-dependent set of selected features ˆS µ {1, . . . , d}. One popular statistical criterion to evaluate the performance of the procedure is the False Discovery Rate (FDR), which stands for the expected value of the proportion of false discoveries: FDR = E The ratio | ˆ SflH0| | ˆ S| is also called False Discovery Proportion (FDP). Given a prescribed target q œ (0, 1), we want to guarantee that our procedure s FDR is upper bounded by q. If we now switch to a local/conditional perspective, we are now looking for procedures that construct mappings x æ ˆS(x) where we want ˆS(x) to match the set of local nulls at x. The criterion to evaluate these new selection mapping procedures should also be point-dependent. We can now accordingly extend the definition of FDP and FDR to the local setting. Definition 2.2. For r > 0, and a given feature selection procedure that for x œ Rd outputs a data-dependent set of selected features ˆS(x), we define the r-local False Discovery Proportion (FDPr) and r-local False Discovery Rate (FDRr) as the following functions of x œ Rd: FDPr(x)= | ˆS(x) flHr FDRr(x)=E[FDPr(x)] The randomness in the FDPr expression comes from the fact that the construction of ˆS(x) depends on the random dataset of observations (X, Y). The motivation for such criterion is the following: in a global feature selection procedure with FDR control, we want to construct ˆS as close as possible from HŒ 1 , while penalizing whenever j œ ˆS flHŒ want to have a more granular approximation of the local non-nulls at x by constructing ˆS(x), we need to penalize whenever j œ ˆS(x)flHr 0(x). Indeed, given that HŒ 0 by Proposition 2.1, if we were to construct local selection sets using the global FDR we wouldn t penalize the fact that in a neighborhood of x a feature j is locally null even though it is non-null at a far away part of the feature space. A new issue that arises from this statistical objective is that we no longer need to control for a single real value, but for a function Rd æ [0, 1]. From now on, we will focus on controlling point-wise FDRr at a finite number of distinct points far apart, in an independent manner. A future research direction will consist in defining a more comprehensive FDR objective across selection sets at several points in the space. The strength of the local procedure that we now present is that the validity of the statistical guarantees on the local FDR is as strong as for the global procedure. The ability of making true local discoveries at a given point will be limited by the availability of data in the neighborhood of such point. However the rate of false positives will still be controlled. 3. Controlling Local FDR Through Knockoffs with Local Importance Scores We now present the global knockoff procedure, and then define the generalization to a local setting through an extension of the importance scores. Our procedure requires the analyst to provide information on the choice of the partitioning of the feature space. We end this section by providing two algorithms based on heuristics to automate this choice. 3.1. The Knockoff Procedure Traditionally feature selection is based on a parametric model we assume captures the relationship between X and Y . By fitting such model to the available data, with the appropriate statistical assumptions we can assess the probability of each parameter to be different from 0, and use it as a proxy for the likelihood that the corresponding feature is relevant for predicting Y (Lippert et al., 2011; Miller, 2002; Tibshirani, 1996). These assumptions on the distribution of Y |X are often inadequate, as the complexity of the distribution of the response given the features is seldom limited to a linear regression (or even to other more complex models such as generalized linear models). This model mismatch can break down the statistical guarantees of the procedure, or drastically reduce its ability to find relevant features (Friedman et al., 2001). Furthermore, in high-dimensional settings the classic distributional results upon which p-values are constructed (the summary value for assessing the statistical relevance of a feature) are no longer valid, so the statistical guarantees for any selection based on those methods may completely break down (Sur & Candès, 2018). Discovering Conditionally Salient Features with Statistical Guarantees The knockoff procedure is a two-step procedure that has recently emerged which avoids the pitfalls of the model-based global feature selection procedures by shifting the bulk of the statistical hypothesis from modeling Y |X to modeling the distribution P X of X (Barber et al., 2015; Candès et al., 2018). False Discovery Rate control (i.e. control of the number of false positives in expectation) is guaranteed during the first step of the procedure, by assuming only knowledge of P X: during this step the analyst does not make any assumption on the behavior of Y |X, and still FDR is controlled no matter which choices are made downstream. This first step constructs what are called knockoff variables X that are used as controls when running the second step of the procedure. That second step aims at getting a high number of true selections (referred to as the power of the procedure), which requires the analyst to fit some appropriate procedure to the joint data (X, Y ) among a large pool of options. To summarize, the knockoff procedure decouples the assumptions underlying the statistical guarantee (knowledge of P X to generate X) from the assumptions needed to have a high power (fitting a model for Y |X). We now introduce some notation for a more formal presentation of the knockoff procedure. For x, x œ Rd, we define the concatenated vector by [x, x] œ R2d, and the following swap operation: given a subset S µ [d] of indices, the vector [x, x]swap(S) œ Rd Rd corresponds to [x, x] where the coordinates indexed by S are swapped between x and x (we use the same notation whenever the vectors are stacked in a matrix). Also, for any subset of indices S µ [d] we denote by x S œ R|S| the restriction of x to coordinates in S. For a set A, denote #A its size. Assuming we know the ground truth for the distribution P X, the first step of the knockoff procedure is to obtain a knockoff sample X that satisfies the following conditions (Candès et al., 2018): Definition 3.1 (Knockoff sample). A knockoff sample X of a d-dimensional random variable X is a d-dimensional random variable such that two properties are satisfied: Conditional independence: X Y |X Exchangeability : [X, X]swap(S) d= [X, X] S µ {1, . . . , d} d= stands for equality in distribution. The first condition is immediately satisfied as long as knockoffs are sampled conditioned on the sample X without considering any information about Y . Satisfying the exchangeability condition is more technical and has recently received widespread attention (Sesia et al., 2017; Jordon et al., 2019; Gimenez et al., 2018; Liu & Zheng, 2018; Romano et al., 2018). Indeed satisfying the exchangeability condition is the cornerstone of the knockoff procedure upon which the FDR control relies, thus the importance of having accurate mechanisms to generate valid knockoffs. The next step requires the analyst to apply some procedure on the dataset [X, X], Y such that it constructs importance scores [T , T ] œ Rd +: a large value for Tj (or Tj) indicates that the procedure considers that the feature Xj (or Xj) has a strong influence on Y . Notice that here we consider the full dataset of i.i.d. samples X, and the corresponding dataset of knockoff variables X, that are generated sample-wise from X. The importance scores are random variables constructed based on the whole set of samples. Some examples consist in taking the absolute values of the coefficients in a linear or logistic regression, the time of first entry of a feature in a LASSO path in a L1-regularized regression, the drops in accuracy in a trained classifier by perturbing one feature at a time (Barber et al., 2015; Candès et al., 2018; Gimenez et al., 2018; Lu et al., 2018). The only crucial requirement on such importance scores to be valid (for the knockoff procedure) is that they are associated to their corresponding feature, and agnostic to the ordering of the remaining features. That is, for any S µ [d], if the analyst feeds the procedure with the dataset [X, X]swap(S), Y, then the resulting importance scores are [T , T ]swap(S). The strength and flexibility of the knockoff procedure is that there is no need to assume any model on the joint distribution of (X, Y ), and that we are free to choose any method that generates valid importance scores: the FDR control will hold because the knockoff importance scores Tj will serve as controls for the Tj. For a target FDR level q œ (0, 1), the output of the procedure is the set ˆS = {j : (Tj Tj) Ø ˆ } based on a data-driven threshold t > 0 : 1 + #{j : (Tj Tj) Æ t} #{j : (Tj Tj) Ø t} Æ q (3) The jth feature is selected when the difference between the importance scores of Xj and Xj is larger than ˆ . Why subsetting the data first does not work The first naive way to extend the knockoff procedure to a local setting is to run the whole procedure after restricting the initial dataset. For x œ Rd, if we are interested in identifying the set of local non-nulls Hr 1(x), we could run the knockoff procedure starting with the set {(Xi, Yi), s.t.ÎXi xÎŒ Æ r}. The reason why this approach is difficult to implement is that generating X for the conditional distribution of X being in a neighborhood of x is generally not tractable. Restricting the amount of data harms how well we can fit our model of P X to the data. Moreover conditioning on the values of X requires a different process for sampling the knockoffs X|X than what is currently available (Barber et al., 2015; Candès et al., 2018; Gimenez et al., 2018; Lu et al., 2018). In order to construct Discovering Conditionally Salient Features with Statistical Guarantees Algorithm 1 Knockoffs with Local Importance Scores Feature Selection Procedure Input: Dataset X, Y, set of L points L = {zl}1ÆlÆL, radius r > 0. Generate dataset of knockoff variables X. for l = 1 to L do Subsample data (X(l), X(l), Y(l)) = {(X , X , Y ) such that ÎX zlÎŒ Æ r & Î X zlÎŒ Æ r} Generate (r-local) importance scores T (l), T (l) from (X(l), X(l), Y(l)) Construct ˆ (l) as in equation 3 from T (l), T Construct ˆSl = {j œ [d] : T (l) (l) j Ø ˆ (l)} end for Return: Selected sets ( ˆSl)1ÆlÆL local selection sets at a given point x, we need to keep the original knockoff generation process intact, but construct local importance scores. Keeping the statistical guarantee in place requires reformulating the whole knockoff procedure in a local approach. 3.2. Knockoff with Local Importance Scores Our extension of the knockoff procedure assumes that we have valid knockoffs available for the whole dataset: that is, the first step of constructing knockoffs is exactly the same as in the global knockoff procedure. By using the whole available dataset, by making the same assumptions as in the global knockoff setting, we can reuse all the available tools already developed for building valid knockoffs, and will not leave out any useful information which could hurt the validity of the statistical guarantees. Fix r > 0, consider a set of L points z1, . . . , z L whose pairwise distances are lower bounded by 2r. We now present Algorithm 1, a procedure that constructs L sets ˆS1, . . . , ˆSL µ [d] of selected features at each point zl, such that we control FDRr at any desired FDR target level q œ (0, 1), uniformly for all sets. The crucial aspect of this procedure is that we are able to make a local statement for local FDR uniformly for several points. This algorithm starts by constructing knockoffs X from X as in the global knockoff procedure. It then subsets the whole dataset into L datasets (X(l), X(l), Y(l)) where for each 1 Æ l Æ L the samples in the lth subset are such that the original vector X(l) and the knockoff sample X (l) are in a r-ball neighborhood of zl (we denote by X a generic sample of the vector X among the samples in X(l)). We now run construct r-local importance scores independently for each subset, that we use to construct a selected set of features as in the global knockoff procedure. As for the global knockoff procedure, a wise choice of importance scores may lead to a higher power of the procedure, i.e. a correct modelling of the re- lationship between the original features and the response for the dataset (X(l), X(l), Y(l)). As long as the importance scores satisfy the association requirement stated above, one can use different methods for each l: we can use the absolute coefficients when fitting a logistic regression (Candès et al., 2018) for one of the subsets if we know it is a good model for that subset, and a more generic accuracy drop (Gimenez et al., 2018) for another region where we may lack such knowledge. Theorem 3.1. The output of Algorithm 1 is such that FDRr is controlled uniformly for zl. That is, denoting for each 1 Æ l Æ L, FDPr(zl) = | ˆSl flHr we have that: E[FDPr(zl)] Æ q In order to prove this result we need to show that the usual knockoffs can be reused in a local way: this requires to redefine and extend the distributional properties underlying the proof of FDR control from the global knockoff procedure as in (Candès et al., 2018). We present the technical tools needed to prove this theorem in Section 5. 3.3. Algorithms for Partitioning the Feature Space into Meaningful Sub-regions We now present the last element needed to obtain a reasonable local selection procedure. The implementation of Algorithm 1 requires the analyst to indicate the points at which the selection sets need to be computed. Such knowledge may not be available: it even requires to identify some of the non-nulls that are needed to partition the space into sub-regions with different sets of non-nulls. We now present an approach to hierarchically identify relevant "switch" features for the purpose of partitioning the feature space and the values that these switch features take at the boundaries of the partitions, and a second one in Appendix D. These algorithms based on heuristics only provide input values to Algorithm 1. As such, the statistical guarantees described in Theorem 3.1 always hold. Importance Score-based Partitioning Finding the best "switch" feature and its boundary value can be done in a greedy manner. The output will be relevant depending on the choice of importance scores that we use as a subroutine of this algorithm. Fix a subset I µ [d] of features, fix a set Λ = ( 1, . . . , L) œ RL of L boundary values. For each pair (c, ) of feature and split value, we split the full matrix dataset (X, Y) into two sets (X>,c , Y>,c ) and (X<,c , Y<,c ) based on the value of the dth covariate of Discovering Conditionally Salient Features with Statistical Guarantees Algorithm 2 One-Step Greedy Feature Space Partition Input: Subset of indices I, Split values Λ, Dataset {X, Y}, radius r. for d in I do for in Λ do Partition the data (X, Y) according to X d > or X d < into (X>,c , Y>,c ) and (X<,c , Y<,c ). Generate importance scores T >,c and T <,c . Compute gapd, = ÎT >,c T <,c ÎŒ end for end for Select dú, ú = arg max gapd, . if Final recursive step then Return z<, z> s.t. z<,dú Æ ú r and z>,dú Ø ú+r. else Return (X>,c , Y>,c ) and (X<,c , Y<,c ). end if each sample Xi and compute importance scores T >,c and T <,c independently on each (notice that importance scores do not require the presence of knockoffs: moreover, importance scores are constructed based on a set of features agnostic to whether those features were original or knockoff). As a method to generate importance scores, we will train a classifier on the dataset and compute, for each feature, the accuracy drops when reevaluating the trained classifier on the same initial data except for one feature that we shuffle across samples. We then select the optimal split as the one with highest gap in importance scores in LŒ norm between the two splits. We can then recursively, depending on the amount of data, run the algorithm on each of the splits, or, if it corresponds to the final step, return a set of vectors lying in distinct subregions, spaced by at least 2r. The main disadvantage of this method is the computational cost. Indeed, for a given subset of relevant features, identifying the optimal choice for the switch feature and the boundary value is quadratic in the size of the subset. As such we may want to choose an appropriate subset I and set Λ as inputs for Algorithm 2 (any global feature selection procedure can be used to identify an appropriate set I). Given that we need to repeatedly fit the model, this method may become prohibitive whenever the number of relevant features is too high (i.e. non sparse settings or highdimensional settings). We present an alternative approach to overcome computational bottlenecks in Appendix D. 4. Experiments We now run experiments and the first goal is to show that our main theorem holds. Among the many available distributions for P X that allow for efficient sampling of knockoffs, we choose the Hidden Markov Model (HMM) (Sesia et al., 2017) to illustrate a GWAS study. We therefore sample the datasets X, X so that X simulates the SNPs of a cohort of patients, i.e. a matrix of 0, 1, and 2. We generate a complex response Y in the following way: we randomly pick a subset of q features S0 = {s1, . . . , sq} that correspond to the global non-nulls. However, we design the response in such way that not all global non-nulls are local non-nulls at every point. Our design of the response begins by choosing switch features: based on the value of Xs1 (greater or lower than 1.5) we select the first or second half of the remaining global non-null indices {s2, . . . , sq}, and we repeat one more time this process (i.e. for example whenever Xs1 > 1.5 look at whether Xs2 > 1.5, in which case we end up picking the first half of {s3, . . . , sq/2}). We end up, for each individual sample Xi = (Xi,1, . . . , Xi,d) in the dataset X, with a subset of indices Si µ S0 µ [d] (in the case where Xi,s1, Xi,s2 > 1.5, we have then Si = {s3, . . . , sq/4}). We then generate Yi as a Bernoulli whose probability is given by a linear combination of the values Xi,Si (i.e. can be modeled through a logistic regression on that subset of the space). This way, the set of local non-nulls varies depending on the values of the switch features. 5000 10000 20000 30000 40000 50000 Aver Dge Gl Rb Dl 3Rwer - )ull 6SDFe Aver Dge LRFDl 3Rwer - 0e GLu P r DGLus (2 3Drt Lt LRns) Aver Dge LRFDl 3Rwer - 6PDll r DGLus (4 3Drt Lt LRns) 5000 10000 20000 30000 40000 50000 1u Pber Rf s DPSles Aver Dge Gl Rb Dl )D5 - )ull 6SDFe Aver Dge LRFDl )D5 - 0e GLu P r DGLus (2 3Drt Lt LRns) Aver Dge LRFDl )D5 - 6PDll r DGLus (4 3Drt Lt LRns) Figure 1. Local Power and Local FDR as a Function of Sample size. For several partitions, the dots indicate the (averaged over 20 runs) local power (top) / local FDR (bottom) at each of the partitions (1, 2 or 4). The lines correspond to the averages across partitions and 20 runs, with estimates of the standard errors. We report the target FDR at 0.2 with a horizontal line. We now run Algorithm 1 at three "resolution levels", i.e. for L = 1, 2 and 4 points (zl)1ÆlÆL (so that we respectively Discovering Conditionally Salient Features with Statistical Guarantees construct 1,2,4 selection sets ( ˆSl)1ÆlÆL), using each time an optimal choice of points that we assume are given and correspond to the optimal partitions in our response generating process. Our target FDR is q = 0.2 and locally we consider as importance scores the absolute values of the coefficients in a logistic regression. We report the results in Figure 1, where the dots refer (for each resolution level) to the local FDR/power at each individual set of the partition. It is crucial to notice that here the FDR/Power scale is related to the procedure: as we increase the partition we consider local FDR/Power on a smaller scale: at the highest resolution, i.e. L = 4, for the point zl corresponding to the region Xs1, Xs2 > 1.5, we will evaluate the FDR and power of the output ˆSl of the procedure locally with respect to S0l = {s3, . . . , sq/4}. Our procedure is able, in every setting, to retrieve the correct subset of non-nulls at the corresponding scale while controlling for FDR at the corresponding scale. That is, the run with just 1 point corresponds to the usual global feature selection problem and selects features that are global nulls. Whenever we choose 2 or 4 points, we output several selection sets that correspond to the local nulls on the corresponding regions. However to appreciate the real interest of the local feature selection problem one needs to consider what happens to the features that are non-null in a subspace but null in another (for example Xs3 is non-null only whenever Xs1, Xs2 > 1.5). The fact that our procedure controls local FDR indicates that, at a given subspace, it is properly identifying the true local non-nulls, and not selecting features that are null in that subspace, albeit non-null somewhere else. Figure 2 illustrates this phenomenon: we run again the procedure (for different values of L) but report the averaged local FDR across the four partitions whenever L = 4. This indicates that, if we were only interested in a phenomenon in a subregion of the feature space, the output of the global feature selection procedure would report a large proportion of features that are actually irrelevant in the corresponding subregion. Notice also that our simulations were done in dimension 200, and we vary the size of the dataset to show that our method still controls local FDR even though the number of points in a given subregion is very limited: we indeed have lower power in these settings, but our statistical guarantees always hold. 5. Theoretical Analysis of Local Feature FDR Our main result presented in Theorem 3.1 holds regardless of the method used to partition the space. It is based upon a generalization of several notions defined in the first introductions of the knockoff procedure (Barber et al., 2015; Candès et al., 2018) which we now present. 500010000 20000 30000 40000 50000 1u Pber Rf s DPples LRFDl )D5 Dt h Lghest res Rlut LRn (L 4) Gl Rb Dl pr RFe Gure 2 3Drt Lt LRn 3r RFe Gure 4 3Drt Lt LRn 3r RFe Gure Figure 2. Local FDR at highest resolution for procedures at dif- ferent resolutions. Whenever we evaluate the results of a lowresolution procedure (global feature selection) with a high resolution FDR (small radius with 4 partitions), the coarser methods can not control the finer FDR. Local swaps and Local Exchangeability We extend the definition of the swap to mappings from Rd Rd to Rd Rd, where the set of indices S used to swap depends on the inputs of the mapping. Definition 5.1. We call a mapping : Rd æ P([d]) a generic swap or a swap. In addition, we say that a swap is a local swap if for any x, z œ Rd, x[d]\ (x) = z[d]\ (x) (x) = (z) Given a mapping Rd Rd æ Rd Rd F1(x, x), . . . , Fd(x, x), F1(x, x), . . . , Fd(x, x) and swap , define the operation [F , F ]swap( ) as the mapping [F , F ]swap( ) : (x, x) æ (F , F )(x, x) swap( (x)). We develop further this notion and additional notation in Appendix A. Our goal now is to show that the exchangeability condition that defines a knockoff variable implies a stronger distributional result. Proposition 5.1. Let be a local swap. If X is a knockoff random variable for X (i.e. satisfies exchangeability), then [X, X]swap( ) d= [X, X] (4) which we refer to as local exchangeability. If µ H0 [X, X]swap( ), Y d= [X, X], Y (5) Discovering Conditionally Salient Features with Statistical Guarantees We prove this result in Section B.1. This result extends the exchangeability property and the Lemma 3.2 in (Candès et al., 2018). Instead of swapping a fixed set of features, we now allow the swapping indices to depend on the features. Notice that the knockoffs X are constructed as in the general case, the local exchangeability does not require a different definition for knockoff variables. We extend the definition of a swap to probability distributions: for µ œ Pr(Rd Rd), we denote µswap( ) := L([X, X]swap( )) whenever µ = L([X, X]). Abusing notation, whenever µ = L([X, X], Y ) we will still denote µswap( ) := L([X, X]swap( ), Y ). Local Feature Statistics The next step is to extend the construction of feature statistics to the local setting. Definition 5.2. Define local importance scores as a mapping: Pr(Rd Rd R) æ Rd æ Rd Rd" µ æ (T µ, T µ) (6) (T µ, T µ) : z æ (T µ(z), T µ(z)) (7) such that, for any S µ [d], we have Φ(µswap(S)) = [Φ(µ)]swap(S) For r > 0, we say that such importance scores Φ are r- local if, for any µ œ Pr(Rd Rd R), we have that Φ(µ)(z) = (T µ(z), T µ(z)) only depends on µ through the restriction of µ to B(z, r) B(z, r) R. That is, if µ, µÕ are two probability measures on Rd Rd R such that they coincide on B(z, r) B(z, r) R, then (T µ(z), T µ(z)) = (T µÕ(z), T µÕ(z)). The next goal consists in translating the swap operation in µswap( ) = L([X, X]swap( ), Y ) into a swap of [T µ, T µ]swap( ). This step does not require X to be a knockoff of X: in what follows we do not make any assumption on µ. Notice that the swap operation has been defined (Definition 5.1) as a transformation of a mapping Rd Rd æ Rd Rd, but it can be immediately extended to mappings Rd æ Rd Rd. We are able to relate [T µswap( ), T µswap( )] and [T µ, T µ]swap( ) if we assume locality constraints on the importance scores. Definition 5.3. For : Rd æ P([d]) local swap, and r > 0, define Ar = {z œ Rd, (z) = (y) y œ B(z, r)} (z) if z œ Ar Proposition 5.2. Assume there exists r > 0 such that the importance scores are r-local. Then, assuming Ar is nonempty, for z œ Ar, we have: [(T µ(z), T µ(z))]swap( (z)) =[T µswap( )(z), T µswap( )(z)] We prove this result in Section B.2. Whenever µ = L([X, X], Y ) where X is a knockoff of X, consider a local swap such that µ H0 0. By proposition 5.1, we get that µ = µswap( ), and therefore (T µ, T µ) = (T µswap( ), T µswap( )). In practice, we can imagine that instead of using µ as an input when constructing importance scores, we use ˆµn, an empirical measure defined by the dataset of n i.i.d. samples from µ that we feed as input to the algorithm. Therefore a consequence of proposition 5.1 will be that, taking also into account the eventual randomness when generating importance scores, we have for any z œ Rd, (T ˆµn(z), T ˆµn(z)) d= (T ˆµn,swap( )(z), T ˆµn,swap( )(z)) We now combine this equality with that of Proposition 5.2. Fix r > 0 such that we have r-local importance scores. Consider a set of L points z1, . . . z N œ Rd that are pairwise 2r far apart, that is, for any 1Æl, lÕ ÆN, Îzl zlÕÎŒ Ø 2r. We can now conclude thanks to Proposition B.3 in the Appendix, which implies the flip-sign property introduced first in (Candès et al., 2018). This is all that is needed to construct an adaptive threshold l based on T ˆµn(zl), T ˆµn(zl), and the corresponding selected features ˆSl such that FDR is controlled. That is, according to Theorem 3.4 in (Candès et al., 2018), the set ˆSl is such that E[| ˆSl flHr hence the result. We provide additional details in Appendix A. 6. Discussion We develop a new framework for selecting features conditionally on being in a specific region of the feature space, for which we define the notions of local null features and local FDR. The fact that FDR is controlled even though the analyst does not need to model the relationship between the outcome and the feature variables makes the knockoff procedure stand out among feature selection procedures. We extend the usage of the knockoff procedure to deal with our local feature selection problem, in such way that there is no need to strengthen the assumptions required for the statistical guarantees to hold. Discovering Conditionally Salient Features with Statistical Guarantees ACKNOWLEDGMENTS J.R.G. is supported by a Stanford Graduate Fellowship. J.Z. is supported by a Chan Zuckerberg Biohub Investigator grant and National Science Foundation (NSF) Grant CRII 1657155. Barber, R. F., Candès, E. J., et al. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43 (5):2055 2085, 2015. Benjamini, Y. and Hochberg, Y. Controlling the false dis- covery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society. Series B (Methodological), pp. 289 300, 1995. Benjamini, Y. and Yekutieli, D. The control of the false dis- covery rate in multiple testing under dependency. Annals of statistics, pp. 1165 1188, 2001. Candès, E., Fan, Y., Janson, L., and Lv, J. Panning for gold: model-x knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2018. Consortium, . G. P. et al. A global reference for human genetic variation. Nature, 526(7571):68, 2015. Efron, B. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, volume 1. Cambridge University Press, 2012. Friedman, J., Hastie, T., and Tibshirani, R. The elements of statistical learning, volume 1. Springer series in statistics New York, NY, USA:, 2001. Gimenez, J. R., Ghorbani, A., and Zou, J. Knockoffs for the mass: new feature importance statistics with false discovery guarantees. ar Xiv preprint ar Xiv:1807.06214, 2018. Hirschhorn, J. N. and Daly, M. J. Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics, 6(2):95, 2005. Holm, S. A simple sequentially rejective multiple test pro- cedure. Scandinavian journal of statistics, pp. 65 70, 1979. Jordon, J., Yoon, J., and van der Schaar, M. Knockoff- GAN: Generating knockoffs for feature selection using generative adversarial networks. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=Bye Z5j C5YQ. Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. ar Xiv preprint ar Xiv:1703.04730, 2017. Lippert, C., Listgarten, J., Liu, Y., Kadie, C. M., Davidson, R. I., and Heckerman, D. Fast linear mixed models for genome-wide association studies. Nature methods, 8(10): 833, 2011. Lipton, Z. C. The mythos of model interpretability. ar Xiv preprint ar Xiv:1606.03490, 2016. Liu, Y. and Zheng, C. Auto-encoding knockoff genera- tor for fdr controlled variable selection. ar Xiv preprint ar Xiv:1809.10765, 2018. Lu, Y., Fan, Y., Lv, J., and Noble, W. S. Deeppink: re- producible feature selection in deep neural networks. In Advances in Neural Information Processing Systems, pp. 8690 8700, 2018. Mc Carthy, M. I., Abecasis, G. R., Cardon, L. R., Gold- stein, D. B., Little, J., Ioannidis, J. P., and Hirschhorn, J. N. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature reviews genetics, 9(5):356, 2008. Miller, A. Subset selection in regression. Chapman and Hall/CRC, 2002. O donovan, P. J. and Livingston, D. M. Brca1 and brca2: breast/ovarian cancer susceptibility gene products and participants in dna double-strand break repair. Carcinogenesis, 31(6):961 967, 2010. Romano, Y., Sesia, M., and Candès, E. J. Deep knockoffs. ar Xiv preprint ar Xiv:1811.06687, 2018. Sesia, M., Sabatti, C., and Candès, E. J. Gene hunting with knockoffs for hidden markov models. ar Xiv preprint ar Xiv:1706.04677, 2017. Sun, Y., Todorovic, S., and Goodison, S. Local-learning- based feature selection for high-dimensional data analysis. IEEE transactions on pattern analysis and machine intelligence, 32(9):1610 1626, 2010. Sur, P. and Candès, E. J. A modern maximum-likelihood theory for high-dimensional logistic regression. ar Xiv preprint ar Xiv:1803.06964, 2018. Tang, J., Alelyani, S., and Liu, H. Feature selection for classification: A review. Data classification: Algorithms and applications, pp. 37, 2014. Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267 288, 1996.