# sobolev_independence_criterion__700af6e4.pdf

Sobolev Independence Criterion

Youssef Mroueh, Tom Sercu, Mattia Rigotti, Inkit Padhi, Cicero Dos Santos

IBM Research & MIT-IBM Watson AI lab mroueh,mrigotti@us.ibm.com,inkit.padhi@ibm.com

We propose the Sobolev Independence Criterion (SIC), an interpretable dependency measure between a high dimensional random variable X and a response variable Y . SIC decomposes to the sum of feature importance scores and hence can be used for nonlinear feature selection. SIC can be seen as a gradient regularized Integral Probability Metric (IPM) between the joint distribution of the two random variables and the product of their marginals. We use sparsity inducing gradient penalties to promote input sparsity of the critic of the IPM. In the kernel version we show that SIC can be cast as a convex optimization problem by introducing auxiliary variables that play an important role in feature selection as they are normalized feature importance scores. We then present a neural version of SIC where the critic is parameterized as a homogeneous neural network, improving its representation power as well as its interpretability. We conduct experiments validating SIC for feature selection in synthetic and real-world experiments. We show that SIC enables reliable and interpretable discoveries, when used in conjunction with the holdout randomization test and knockoffs to control the False Discovery Rate. Code is available at http://github.com/ibm/sic.

1 Introduction

Feature Selection is an important problem in statistics and machine learning for interpretable predictive modeling and scientiﬁc discoveries. Our goal in this paper is to design a dependency measure that is interpretable and can be reliably used to control the False Discovery Rate in feature selection. The mutual information between two random variables X and Y is the most commonly used dependency measure. The mutual information I(X; Y ) is deﬁned as the Kullback-Leibler divergence between the joint distribution pxy of X, Y and the product of their marginals pxpy, I(X; Y ) = KL(pxy, pxpy). Mutual information is however challenging to estimate from samples, which motivated the introduction of dependency measures based on other f-divergences or Integral Probability Metrics [1] than the KL divergence. For instance, the Hilbert-Schmidt Independence Criterion (HSIC) [2] uses the Maximum Mean Discrepancy (MMD) [3] to assess the dependency between two variables, i.e. HSIC(X, Y ) = MMD(pxy, pxpy), which can be easily estimated from samples via Kernel mean embeddings in a Reproducing Kernel Hilbert Space (RKHS) [4]. In this paper we introduce the Sobolev Independence Criterion (SIC), a form of gradient regularized Integral Probability Metric (IPM) [5, 6, 7] between the joint distribution and the product of marginals. SIC relies on the statistics of the gradient of a witness function, or critic, for both (1) deﬁning the IPM constraint and (2) ﬁnding the features that discriminate between the joint and the marginals. Intuitively, the magnitude of the average gradient with respect to a feature gives an importance score for each feature. Hence, promoting its sparsity is a natural constraint for feature selection.

The paper is organized as follows: we show in Section 2 how sparsity-inducing gradient penalties can be used to deﬁne an interpretable dependency measure that we name Sobolev Independence Criterion

Tom Sercu is now with Facebook AI Research, and Cicero Dos Santos with Amazon AWS AI. The work was done when they were at IBM Research.

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

(SIC). We devise an equivalent computational-friendly formulation of SIC in Section 3, that gives rise to additional auxiliary variables j. These naturally deﬁne normalized feature importance scores that can be used for feature selection. In Section 4 we study the case where the SIC witness function f is restricted to an RKHS and show that it leads to an optimization problem that is jointly convex in f and the importance scores . We show that in this case SIC decomposes into the sum of feature scores, which is ideal for feature selection. In Section 5 we introduce a Neural version of SIC, which we show preserves the advantages in terms of interpretability when the witness function is parameterized as a homogeneous neural network, and which we show can be optimized using stochastic Block Coordinate Descent. In Section 6 we show how SIC and conditional Generative models can be used to control the False Discovery Rate using the recently introduced Holdout Randomization Test [8] and Knockoffs [9]. We validate SIC and its FDR control on synthetic and real datasets in Section 8.

2 Sobolev Independence Criterion: Interpretable Dependency Measure

Motivation: Feature Selection. We start by motivating gradient-sparsity regularization in SIC as a mean of selecting the features that maintain maximum dependency between two randoms variable X (the input) and Y (the response) deﬁned on two spaces X Rdx and Y Rdy (in the simplest case dy = 1). Let pxy be the joint distribution of (X, Y ) and px, py be the marginals of X and Y resp. Let D be an Integral Probability Metric associated with a function space F, i.e for two distributions p, q:

D(p, q) = sup

Ex pf(x) Ex qf(x).

With p = pxy and q = pxpy this becomes a generalized deﬁnition of Mutual Information. Instead of the usual KL divergence, the metric D with its witness function, or critic, f(x, y) measures the distance between the joint pxy and the product of marginals pxpy. With this generalized deﬁnition of mutual information, the feature selection problem can be formalized as ﬁnding a sparse selector or gate w 2 Rdx such that D(pw x,y, pw xpy) is maximal [10, 11, 12, 13] , i.e. supw,kwk 0 s D(pw x,y, pw xpy), where is a pointwise multiplication and kwk 0 = #{j|wj 6= 0}. This problem can be written in the following penalized form:

Epxyf(w x, y) Epxpyf(w x, y) λ||w|| 0.

We can relabel f(x, y) = f(w x, y) and write (P) as: sup f2 F Epxy f(x, y) Epxpy f(x, y), where F = { f| f(x, y) = f(w x, y)|f 2 F, kwk 0 s}. Observe that we have: @ f @xj = wj

Since wj is sparse the gradient of f is sparse on the support of pxy and pxpy. Hence, we can reformulate the problem (P) as follows:

(SIC): sup f2F

Epxyf(x, y) Epxpyf(x, y) λPS(f),

where PS(f) is a penalty that controls the sparsity of the gradient of the witness function f on the support of the measures. Controlling the nonlinear sparsity of the witness function in (SIC) via its gradients is more general and powerful than the linear sparsity control suggested in the initial form (P), since it takes into account the nonlinear interactions with other variables. In the following Section we formalize this intuition by theoretically examining sparsity-inducing gradient penalties [14].

Sparsity Inducing Gradient Penalties. Gradient penalties have a long history in machine learning and signal processing. In image processing the total variation norm is used for instance as a regularizer to induce smoothness. Splines in Sobolev spaces [15], and manifold learning exploit gradient regularization to promote smoothness and regularity of the estimator. In the context of neural networks, gradient penalties were made possible through double back-propagation introduced in [16] and were shown to promote robustness and better generalization. Such smoothness penalties became popular in deep learning partly following the introduction of WGAN-GP [17], and were used as regularizer for distance measures between distributions in connection to optimal transport theory [5, 6, 7]. Let µ be a dominant measure of pxy and pxpy the most commonly used gradient penalties is

L2(f) = E(x,y) µ krxf(x, y)k2 .

While this penalty promotes smoothness, it does not control the desired sparsity as discussed in the previous section. We therefore elect to instead use the nonlinear sparsity penalty introduced in [14] :

0(f) = #{j|E(x,y) µ

!!! @f(x,y)

6= 0}, and its relaxation :

As discussed in [14], E(x,y) µ

!!! @f(x,y)

= 0 implies that f is constant with respect to variable xj, if the function f is continuously differentiable and the support of µ is connected. These considerations motivate the following deﬁnition of the Sobolev Independence Criterion (SIC):

SIC(L1)2(pxy, pxpy) = sup

Epxyf(x, y) Epxpyf(x, y) λ

2Eµf 2(x, y).

Note that we add a 1-like penalty ( S(f) ) to ensure sparsity and an 2-like penalty (Eµf 2(x, y)) to ensure stability. This is similar to practices with linear models such as Elastic net.

Here we will consider µ = pxpy (although we could also use µ = 1 2(pxy + pxpy)). Then, given samples {(xi, yi), i = 1, . . . , N} from the joint probability distribution pxy and iid samples {(xi, yi), i = 1, . . . , N} from pxpy, SIC can be estimated as follows:

d SIC(L1)2(pxy, pxpy) = sup

f(xi, yi) 1

f(xi, yi) λ

f 2(xi, yi),

where ˆ S(f) = Pdx

!!! @f(xi, yi)

Remark 1. Throughout this paper we consider feature selection only on x since y is thought of as the response. Nevertheless, in many other problems one can perform feature selection on x and y jointly, which can be simply achieved by also controlling the sparsity of ryf(x, y) in a similar way.

3 Equivalent Forms of SIC with -trick

As it was just presented, the SIC objective is a difﬁcult function to optimize in practice. First of all, the expectation appears after the square root in the gradient penalties, resulting in a non-smooth term (since the derivative of square root is not continuous at 0). Moreover, the fact that the expectation is inside the nonlinearity introduces a gradient estimation bias when the optimization of the SIC objective is performed using stochastic gradient descent (i.e. using mini-batches). We alleviate these problems (non-smoothness and biased expectation estimation) by making the expectation linear in the objective thanks to the introduction of auxiliary variables j that will end up playing an important role in this work. This is achieved thanks to a variational form of the square root that is derived from the following Lemma (which was used for a similar purpose as ours when alleviating the non-smoothness of mixed norms encountered in multiple kernel learning and group sparsity norms):

Lemma 1 ([18],[19]). Let aj, j = 1 . . . d, aj > 0 we have:

j=1 j = 1}, optimum achieved at j = paj/ P

We alleviate ﬁrst the issue of non smoothness of the square root by adding an " 2 (0, 1), and we

deﬁne: S," = Pdx

!!! @f(x,y)

+ ". Using Lemma 1 the nonlinear sparsity inducing

gradient penalty can be written as :

( S,"(f))2 = inf{

!!! @f(x,y)

where the optimum is achieved for :

j," = βj Pdx

k=1 βk , where β2

!!! @f(x,y)

+ ". We refer to

j," as the normalized importance score of feature j. Note that j is a distribution over the features and gives a natural ranking between the features. Hence, substituting (S)(f) with S,"(f) in its equivalent form we obtain the " perturbed SIC:

SIC(L1)2,"(pxy, pxpy) = inf{L"(f, ) : f 2 F, j, j > 0,

where L"(f, ) = (f, pxy, pxpy) + λ 2

""" @f(x,y)

j + 2Epxpyf 2(x, y), and (f, pxy, pxpy) = Epxyf(x, y) Epxpyf(x, y). Finally, SIC can be empirically estimated as

d SIC(L1)2,"(pxy, pxpy) = inf{ˆL"(f, ) : f 2 F, j, j > 0,

where ˆL"(f, ) = ˆ (f, pxy, pxpy) + λ

i=1 f 2(xi, yi), and

main the objective ˆ (f, pxy, pxpy) = 1

i=1 f(xi, yi) 1

i=1 f(xi, yi). Remark 2 (Group Sparsity). We can deﬁne similarly nonlinear group sparsity, if we would like our critic to depends on subsets of coordinates. Let Gk, k = 1, . . . , K be an overlapping or non

overlapping group : g S(f) = PK

!!! @f(x,y)

. The -trick applies naturally.

4 Convex Sobolev Independence Criterion in Fixed Feature Spaces

We will now specify the function space F in SIC and consider in this Section critics of the form:

F = {f|f(x, y) = hu, Φ!(x, y)i , kuk2 γ},

where Φ! : X Y ! Rm is a ﬁxed ﬁnite dimensional feature map. We deﬁne the mean embeddings of the joint distribution pxy and product of marginals pxpy as follow: µ(pxy) = Epxy[Φ!(x, y)], µ(pxpy) = Epxpy[Φ!(x, y)] 2 Rm. Deﬁne the covariance embedding of pxpy as C(pxpy) = Epxpy[Φ!(x, y) Φ!(x, y)] 2 Rm m and ﬁnally deﬁne the Gramian of derivatives embedding for coordinate j as Dj(pxpy) = Epxpy[ @Φ!(x,y)

@xj @Φ!(x,y)

@xj ] 2 Rm m. We can write

the constraint kuk2 γ as the penalty term kuk2. Deﬁne L"(u, ) = hu, µ(pxpy) µ(pxy)i +

j + C(pxpy) + Im

. Observe that :

SIC(L1)2,"(pxy, pxpy) = inf{L"(u, ) : u 2 Rm, j, j > 0,

We start by remarking that SIC is a form of gradient regularized maximum mean discrepancy [3]. Previous MMD work comparing joint and product of marginals did not use the concept of nonlinear sparsity. For example the Hilbert-Schmidt Independence Criterion (HSIC) [2] uses Φ!(x, y) = φ(x) (y) with a constraint ||u||2 1. CCA and related kernel measures of dependence [20, 21] use L2

2 constraints L2

2(px) and L2

2(py) on each function space separately.

Optimization Properties of Convex SIC We analyze in this Section the Optimization properties of SIC. Theorem 1 shows that the SIC(L1)2," loss function is jointly strictly convex in (u, ) and hence admits a unique solution that solves a ﬁxed point problem. Theorem 1 (Existence of a solution, Uniqueness, Convexity and Continuity). Note that L(u, ) = L"=0(u, ). The following properties hold for the SIC loss:

1) L(u, ) is differentiable and jointly convex in (u, ). L(u, ) is not continuous for , such that

j = 0 for some j.

2) Smoothing, Perturbed SIC: For " 2 (0, 1), L"(u, ) = L(u, ) + λ

" j is jointly strictly convex and has compact level sets on the probability simplex, and admits a unique minimizer (u

3) The unique minimizer of L"(u, ) is a solution of the following ﬁxed point prob-

j + C(pxpy) + Im

(µ(pxy) µ(pxpy)), and

hu ",Dj(pxpy)u "i+" Pdx

hu ",Dk(pxpy)u "i+".

The following Theorem shows that a solution of the unperturbed SIC problem can be obtained from the smoothed SIC(L1)2," in the limit " ! 0:

Theorem 2 (From Perturbed SIC to SIC). Consider a sequence " , " ! 0 as ! 1 , and consider a sequence of minimizers (u

) of L" (u, ), and let (u , ) be the limit of this sequence, then (u , ) is a minimizer of L(u, ).

Interpretability of SIC. The following corollary shows that SIC can be written in terms of the importance scores of the features, since at optimum the main objective is proportional to the constraint term. It is to the best of our knowledge the ﬁrst dependency criterion that decomposes in the sum of contributions of each coordinate, and hence it is an interpretable dependency measure. Moreover,

j are normalized importance scores of each feature j, and their ranking can be used to assess feature importance. Corollary 1 (Interpretability of Convex SIC ). Let (u , ) be the limit deﬁned in Theorem 2. Deﬁne f (x, y) = hu , Φ!(x, y)i, and kf k F = ku k. We have that

SIC(L1)2(pxy, pxpy) = 1 2

Epxyf (x, y) Epxpyf (x, y)

Epxpy|@f (x, y)

2Epxpyf ,2(x, y) +

Epxpy| @f (x,y)

j S,L1(f ) and Pdx

j=1 j = 1. The terms

j can be seen as quantifying how much dependency as measured by SIC can be explained by a coordinate j. Ranking of

j can be used to rank inﬂuence of coordinates.

Thanks to the joint convexity and the smoothness of the perturbed SIC, we can solve convex empirical SIC using alternating minimization on u and or block coordinate descent using ﬁrst order methods such as gradient descent on u and mirror descent [22] on that are known to be globally convergent in this case (see Appendix A for more details).

5 Non Convex Neural SIC with Deep Re LU Networks

While Convex SIC enjoys a lot of theoretical properties, a crucial short-coming is the need to choose a feature map Φ! that essentially goes back to the choice of a kernel in classical kernel methods. As an alternative, we propose to learn the feature map as a deep neural network. The architecture of the network can be problem dependent, but we focus here on a particular architecture: Deep Re LU Networks with biases removed. As we show below, using our sparsity inducing gradient penalties with such networks, results in input sparsity at the level of the witness function f of SIC. This is desirable since it allows for an interpretable model, similar to the effect of Lasso with Linear models, our sparsity inducing gradient penalties result in a nonlinear self-explainable witness function f [23], with explicit sparse dependency on the inputs.

Deep Re LU Networks with no biases, homogeneity and Input Sparsity via Gradient Penalties. We start by invoking the Euler Theorem for homogeneous functions: Theorem 3 (Euler Theorem for Homogeneous Functions). A continuously differentiable function f is deﬁned as homogeneous of degree k if f(λx) = λkf(x), 8λ 2 R. The Theorem states that f is homogeneous of degree k if and only if kf(x) = hrxf(x), xi = Pdx

@xj xj. Now consider deep Re LU networks with biases removed for any number of layers L: FRe Lu = {f|f(x, y) = hu, Φ!(x)i , where Φ!(x, y) = σ(WL . . . σ(W2σ(W1[x, y]))), u 2 Rm, Φ! : Rdx+dy ! Rm}, where σ(t) = max(t, 0), Wj are linear weights. Any f 2 FRe LU is clearly homogeneous of degree 1. As an immediate consequence of Euler Theorem we then have: f(x, y) = hrxf(x, y), xi + hryf(x, y), yi. The ﬁrst term is similar to a linear term in a linear model, the second term can be seen as a bias. Using our sparsity-inducing gradient penalties with such networks guarantees that on average on the support of a dominant measure the gradients with respect to x are sparse. Intuitively, the gradients wrt x act like the weight in linear models, and our sparsity inducing gradient penalty act like the 1 regularization of Lasso. The main advantage compared to Lasso is that we have a highly nonlinear decision function, that has better capacity of capturing dependencies between X and Y .

Non-convex SIC with Stochastic Block Coordinate Descent (BCD). We deﬁne the empirical non convex SIC(L1)2 using this function space FRe Lu as follows:

d SIC(L1)2(pxy, pxpy) = inf{ˆL(f , ) : f 2 F Re LU, j, j > 0,

where = (vec(W1) . . . vec(WL), u) are the network parameters. Algorithm 3 in Appendix B summarizes our stochastic BCD algorithm for training the Neural SIC. The algorithm consists of SGD updates to and mirror descent updates to .

Boosted SIC. When training Neural SIC, we can obtain different critics f and importance scores , by varying random seeds or hyper-parameters (architecture, batch size etc). Inspired by importance scores in random forest, we deﬁne Boosted SIC as the arithmetic mean or the geometric mean of .

6 FDR Control and the Holdout Randomization Test/ Knockoffs.

Controlling the False Discovery Rate (FDR) in Feature Selection is an important problem for reproducible discoveries. In a nutshell, for a feature selection problem given the ground-truth set of features S, and a feature selection method such as SIC that gives a candidate set ˆS, our goal is to maximize the TPR (True Positive Rate) or the power, and to keep the False Discovery Rate (FDR) under Control. TPR and FDR are deﬁned as follows:

#{i : i 2 ˆS \ S}

#{i : i 2 S}

#{i : i 2 ˆS\S}

#{i : i 2 ˆS}

We explore in this paper two methods that provably control the FDR: 1) The Holdout Randomization Test (HRT) introduced in [8], that we specialize for SIC in Algorithm 4; 2) Knockoffs introduced in [9] that can be used with any basic feature selection method such as Neural SIC, and guarantees provable FDR control.

HRT-SIC. We are interested in measuring the conditional dependency between a feature xj and the response variable y conditionally on the other features noted x j. Hence we have the following null hypothesis: H0 : xj

y |x j () pxy = pxj|x jpy|x jpx j. In order to simulate the null hypothesis, we propose to use generative models for sampling from xj|x j (See Appendix D). The principle in HRT [8] that we specify here for SIC in Algorithm 4 (given in Appendix B) is the following: instead of reﬁtting SIC under H0, we evaluate the mean of the witness function of SIC on a holdout set sampled under H0 (using conditional generators for R rounds). The deviation of the mean of the witness function under H0 from its mean on a holdout from the real distribution gives us p-values. We use the Benjamini-Hochberg [24] procedure on those p-values to achieve a target FDR. We apply HRT-SIC on a shortlist of pre-selected features per their ranking of j.

Knockoffs-SIC. Knockoffs [25] work by ﬁnding control variables called knockoffs x that mimic the behavior of the real features x and provably control the FDR [9]. We use here Gaussian knockoffs [9] and train SIC on the concatenation of [x, x], i.e we train SIC([X; X], Y ) and obtain that has now twice the dimension dx, i.e for each real feature j, there is the real importance score j and the knockoff importance score j+dx. knockoffs-SIC consists in using the statistics Wj = j j+dx and the knockoff ﬁlter [9] to select features based on the sign of Wj (See Alg. 5 in Appendix).

7 Relation to Previous Work

Kernel/Neural Measure of Dependencies. As discussed earlier SIC can be seen as a sparse gradient regularized MMD [3, 7] and relates to the Sobolev Discrepancy of [5, 6]. Feature selection with MMD was introduced in [10] and is based on backward elimination of features by recomputing MMD on the ablated vectors. SIC has the advantage of ﬁtting one critic that has interpretable feature scores. Related to the MMD is the Hilbert Schmidt Independence Criterion (HSIC) and other variants of kernel dependency measures introduced in [2, 21]. None of those criteria has a nonparametric sparsity constraint on its witness function that allows for explainability and feature selection. Other Neural measures of dependencies such as MINE [26] estimate the KL divergence using neural networks, or that of [27] that estimates a proxy to the Wasserstein distance using Neural Networks.

Interpretability, Sparsity, Saliency and Sensitivity Analysis. Lasso and elastic net [28] are interpretable linear models that exploit sparsity, but are limited to linear relationships. Random forests

[29] have a heuristic for determining feature importance and are successful in practice as they can capture nonlinear relationships similar to SIC. We believe SIC can potentially leverage the deep learning toolkit for going beyond tabular data where random forests excel, to more structured data such as time series or graph data. Finally, SIC relates to saliency based post-hoc interpretation of deep models such as [30, 31, 32]. While those method use the gradient information for a post-hoc analysis, SIC incorporates this information to guide the learning towards the important features. As discussed in Section 2.1 many recent works introduce deep networks with input sparsity control through a learned gate or a penalty on the weights of the network [11, 12, 13]. SIC exploits a stronger notion of sparsity that leverages the relationship between the different covariates.

8 Experiments

Synthetic Data Validation. We ﬁrst validate our methods and compare them to baseline models in simulation studies on synthetic datasets where the ground truth is available by construction. For this we generate the data according to a model y = f(x) + where the model f( ) and the noise deﬁne the speciﬁc synthetic dataset (see Appendix F.1). In particular, the value of y only depends on a subset of features xi, i = 1, . . . , p through f( ), and performance is quantiﬁed in terms of TPR and FDR in discovering them among the irrelevant features. We experiment with two datasets: A) Complex multivariate synthetic data (Sin Exp), which is generated from a complex multivariate model proposed in [33] Sec 5.3, where 6 ground truth features xi out of 50 generate the output y through a non-linearity involving the product and composition of the cos, sin and exp functions (see Appendix F.1). We therefore dub this dataset Sin Exp. To increase the difﬁculty even further, we introduce a pairwise correlation between all features of 0.5. In Fig. 1 we show results for datasets of 125 and 500 samples repeated 100 times comparing performance of our models with the one of two baselines: Elastic Net (EN) and Random Forest (RF). B) Liang Dataset. We show results on the benchmark dataset proposed by [34], speciﬁcally the generalized Liang dataset matching most of the setup from [8] Sec 5.1. We provide dataset details and results in Appendix F.1 (Results in Figure 2).

PRw Hr Dnd FD5

El Dsti F 1Ht 5Dnd RP FRr Hst

06E + 6RERl Hv PHn Dlty 6,C

DDt Ds Ht 6,1EXP, n 125 s DPSl Hs

PRw Hr Dnd FD5

El Dsti F 1Ht 5Dnd RP FRr Hst

06E + 6RERl Hv PHn Dlty 6,C

DDt Ds Ht 6,1EXP, n 500 s DPSl Hs

Figure 1: Sin Exp synthetic dataset. TPR and FDR of Elastic Net (EN) and Random Forest (RF) baseline models (left panels) are compared to our methods: a 2-hidden layer neural network with no biases trained to minimize an objective comprising an MSE cost and a Sobolev Penalty term (MSE + Sobolev Penalty), and the same network trained to optimize SIC criterion (right panels), for datasets of 125 samples (top panels) and 500 samples (bottom panels). For all models TPR and FDR are computed by selecting the top 6 features in order of feature importance (which for EN is deﬁned as the absolute value of the weight of a feature, for RF is the out-of-bag error associated to it (see [35]), and for our method is the value of its ). Selecting the ﬁrst 6 features is useful to compare models, but assumes oracle knowledge of the fact that there are 6 ground truth features. We therefore also compute FDR and TPR after selecting features using the HRT method of [8] among the top 20 features. HRT estimates the importance of a feature quantifying its effect on the distribution of y on a holdout set by replacing its values with samples from a conditional distribution (see Section 6). We use HRT to control FDR rate at 10% (red horizontal dotted line). Standard box plots are generated over 100 repetitions of each simulation.

Feature Selection on Drug Response dataset. We consider as a real-world application the Cancer Cell Line Encyclopedia (CCLE) dataset [36], described in Appendix F.2. We study the result of using the normalized importance scores j from SIC for (heuristic) feature selection, against features selected by Elastic Net. Table 1 shows the heldout MSE of a predictor trained on selected features, averaged over 100 runs (each run: new randomized 90%/10% data split, NN initialization). The goal here is to quantify the predictiveness of features selected by SIC on its own, without the full randomized testing machinery. The SIC critic and regressor NN were respectively the big_critic and regressor_NN described with training details in Appendix F.3, while the random forest is trained with default hyper parameters from scikit-learn [37]. We can see that, with just j, informative features are selected for the downstream regression task, with performance comparable to those selected by Elastic Net, which was trained explicitly for this task. The features selected with high j values and their overlap with the features selected by Elastic Net are listed in Appendix F.2 Table 3.

All 7251 features 1.160 3.990 0.783 0.167 Elastic-Net1 [36] top-7 0.864 0.432 0.931 0.215 Elastic-Net2 [8] top-10 0.663 0.161 0.830 0.190 SIC top-7 0.728 0.166 0.856 0.189 SIC top-10 0.706 0.158 0.817 0.173 SIC top-15 0.734 0.168 0.859 0.202 Table 1: CCLE results on downstream regression task. Heldout MSE for drug PLX4720 prediction based on selected features. Columns: neural network (NN) and random forest (RF) regressors.

HIV-1 Drug Resistance with Knockoffs-SIC. The second real-world dataset that we analyze is the HIV-1 Drug Resistance[38], which consists in detecting mutations associated with resistance to a drug type. For our experiments we use all the three classes of drugs: Protease Inhibitors (PIs), Nucleoside Reverse Transcriptase Inhibitors (NRTIs), and Non-nucleoside Reverse Transcriptase Inhibitors (NNRTIs). We use the pre-processing of each dataset (<drug-class, drug-type>) of the knockoff tutorial [39] made available by the authors. Concretely, we construct a dataset (X, X) of the concatenation of the real data and Gaussian knockoffs [9], and ﬁt SIC([X, X], Y ). As explained in Section 6, we use in the knockoff ﬁlter the statistics Wj = j j+dx, i.e. the difference of SIC importance scores between each feature and its corresponding knockoff. For SIC experiments, we use small_critic architecture (See Appendix F.3 for training details). We use Boosted SIC, by varying the batch sizes in N 2 {10, 30, 50}, and computing the geometric mean of produced by those three setups as the feature importance needed for Knockoffs. Results are summarized in Table 2.

Drug Class Drug Type Knockoff with GLM Boosted SIC Knockoff

TD FD FDP TD FD FDP

APV 19 3 0.13 17 5 0.22 ATV 22 8 0.26 19 1 0.05 IDV 19 12 0.38 15 3 0.16 LPV 16 1 0.05 14 2 0.12 NFV 24 7 0.22 19 5 0.21 RTV 19 8 0.29 12 2 0.20 SQV 17 4 0.19 14 8 0.36

X3TC 0 0 0 7 0 0 ABC 10 1 0.09 11 1 0.08 AZT 16 4 0.2 12 5 0.29 D4T 6 1 0.14 8 0 0 DDI 0 0 0 8 0 0

DLV 10 13 0.56 8 10 0.55 EFV 11 11 0.5 11 10 0.47 NVP 7 10 0.58 7 11 0.611 Table 2: Comparison of applying (knockoff ﬁlter + GLM) and (Knockoff ﬁlter+Boosted SIC). For each <drug-class, drug-type> we compared the True Discoveries (TD), False Discoveries(FD) and False Discovery Proportion (FDP). Knockoff with Boosted SIC keeps FDP under control without compromising power, and succeeds in making true discoveries that GLM with knockoffs doesn t ﬁnd.

9 Conclusion

We introduced in this paper the Sobolev Independence Criterion (SIC), a dependency measure that gives rise to feature importance which can be used for feature selection and interpretable decision making. We laid down the theoretical foundations of SIC and showed how it can be used in conjunction with the Holdout Randomization Test and Knockoffs to control the FDR, enabling reliable discoveries. We demonstrated the merits of SIC for feature selection in extensive synthetic and real-world experiments with controlled FDR.

[1] Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Scholkopf, and Gert

R. G. Lanckriet. On integral probability metrics, φ-divergences and binary classiﬁcation. 2009.

[2] A. Gretton, K. Fukumizu, CH. Teo, L. Song, B. Schölkopf, and AJ. Smola. A kernel statistical

test of independence. In Advances in neural information processing systems 20, 2008.

[3] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander

Smola. A kernel two-sample test. JMLR, 2012.

[4] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. Kernel

mean embedding of distributions: A review and beyond. Arxiv, 2017.

[5] Youssef Mroueh, Chun-Liang Li, Tom Sercu, Anant Raj, and Yu Cheng. Sobolev gan. ICLR,

[6] Youssef Mroueh, Tom Sercu, and Anant Raj. Sobolev descent. In AISTATS, 2019.

[7] Michael Arbel, Dougal J. Sutherland, Mikolaj Binkowski, and Arthur Gretton. On gradient

regularizers for mmd gans. Neur IPS, 2018.

[8] W. Tansey, V. Veitch, H. Zhang, R. Rabadan, and D. M. Blei. The holdout randomization test:

Principled and easy black box feature selection. ar Xiv preprint ar Xiv:1811.00645, 2018.

[9] Emmanuel Candes, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold: model-x

knockoffs for high dimensional controlled variable selection. 2018.

[10] Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. Feature selection

via dependence maximization. J. Mach. Learn. Res., 2012.

[11] Jean Feng and Noah Simon. Sparse-input neural networks for high-dimensional nonparametric

regression and classiﬁcation. 2017.

[12] Mao Ye and Yan Sun. Variable selection via penalized neural network: a drop-out-one loss

approach. In Proceedings of the 35th International Conference on Machine Learning, 2018.

[13] Yutaro Yamada, Oﬁr Lindenbaum, Sahand Negahban, and Yuval Kluger. Deep supervised

feature selection using stochastic gates. Arxiv, 2018.

[14] Lorenzo Rosasco, Silvia Villa, Soﬁa Mosci, Matteo Santoro, and Alessandro Verri. Nonpara-

metric sparsity and regularization. J. Mach. Learn. Res., 2013.

[15] Grace Wahba. Smoothing noisy data with spline functions. Numerische mathematik, 24(4),

[16] Harris Drucker and Yann Le Cun. Improving generalization performance using double back-

propagation. IEEE Transactions on Neural Networks, 1992.

[17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville.

Improved training of wasserstein gans. ar Xiv:1704.00028, 2017.

[18] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature

learning. Mach. Learn., 2008.

[19] Francis Bach, Rodolphe Jenatton, and Julien Mairal. Optimization with Sparsity-Inducing

Penalties (Foundations and Trends(R) in Machine Learning). Now Publishers Inc., Hanover, MA, USA, 2011.

[20] H.D. Vinod. Canonical ridge and econometrics of joint production. Journal of Econometrics,

[21] Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Schölkopf. Kernel measures of

conditional dependence. In Advances in Neural Information Processing Systems 20. 2008.

[22] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for

convex optimization. Oper. Res. Lett., 2003.

[23] David Alvarez Melis and Tommi Jaakkola. Towards robust interpretability with self-explaining

neural networks. In Advances in Neural Information Processing Systems 31. 2018.

[24] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A Practical and powerful

approach to multiple testing. J. Roy. Statist. Soc., 57:289 300, 1995.

[25] Rina Foygel Barber, Emmanuel J Candès, et al. Controlling the false discovery rate via

knockoffs. The Annals of Statistics, 43(5):2055 2085, 2015.

[26] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio,

Aaron Courville, and R Devon Hjelm. Mine: Mutual information neural estimation, 2018.

[27] Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron van den Oord, Sergey Levine, and Pierre

Sermanet. Wasserstein dependency measure for representation learning, 2019.

[28] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning.

Springer New York Inc., 2001.

[29] Leo Breiman. Random forests. Mach. Learn., 2001.

[30] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through

propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning, 2017.

[31] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks:

Visualising image classiﬁcation models and saliency maps. International Conference on Learning Representations (Workshop Track)., 2014.

[32] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert

Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classiﬁer decisions by layer-wise relevance propagation. PLo S ONE, 2015.

[33] Jean Feng and Noah Simon. Sparse-input neural networks for high-dimensional nonparametric

regression and classiﬁcation. ar Xiv preprint ar Xiv:1711.07592, 2017.

[34] Faming Liang, Qizhai Li, and Lei Zhou. Bayesian neural networks for selection of drug sensitive

genes. Journal of the American Statistical Association, 113(523), 2018.

[35] Leo Breiman. Random forests. Machine learning, 45(1):5 32, 2001.

[36] Jordi Barretina, Giordano Caponigro, Nicolas Stransky, Kavitha Venkatesan, Adam A Margolin,

Sungjoon Kim, Christopher J Wilson, Joseph Lehár, Gregory V Kryukov, Dmitriy Sonkin, et al. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483(7391):603, 2012.

[37] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,

Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikitlearn: Machine learning in python. Journal of machine learning research, 12(Oct):2825 2830, 2011.

[38] Soo-Yon Rhee, Jonathan Taylor, Gauhar Wadhera, Asa Ben-Hur, Douglas L Brutlag, and

Robert W Shafer. Genotypic predictors of human immunodeﬁciency virus type 1 drug resistance. Proceedings of the National Academy of Sciences, 103(46):17355 17360, 2006.

[39] Matteo Sesia and Evan Patterson. R tutorial for knockoffs - 4. https://web.stanford.edu/

group/candes/knockoffs/software/knockoffs/tutorial-4-r.html, 2017.

[40] P. Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization.

J. Optim. Theory Appl., 109, 2001.

[41] Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A uniﬁed convergence analysis of block

successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 2013.

[42] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al.

Conditional image generation with pixelcnn decoders. In Advances in neural information processing systems, pages 4790 4798, 2016.

[43] Ethan Perez, Harm de Vries, Florian Strub, Vincent Dumoulin, and Aaron Courville. Learning

visual reasoning without strong priors. ar Xiv preprint ar Xiv:1707.03017, 2017.

[44] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.

[45] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito,

Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.

[46] Yaniv Romano, Matteo Sesia, and Emmanuel Candès. Deep knockoffs. Journal of the American

Statistical Association, pages 1 27, 2019.