# postselection_inference_with_hsiclasso__0e58e2a8.pdf

Post-Selection Inference with HSIC-Lasso

Tobias Freidling 1 Benjamin Poignard 2 3 H ector Climente-Gonz alez 3 Makoto Yamada 3 4

Detecting inﬂuential features in non-linear and/or high-dimensional data is a challenging and increasingly important task in machine learning. Variable selection methods have thus been gaining much attention as well as post-selection inference. Indeed, the selected features can be signiﬁcantly ﬂawed when the selection procedure is not accounted for. We propose a selective inference procedure using the so-called model-free HSICLasso based on the framework of truncated Gaussians combined with the polyhedral lemma. We then develop an algorithm, which allows for low computational costs and provides a selection of the regularisation parameter. The performance of our method is illustrated by both artiﬁcial and real-world data based experiments, which emphasise a tight control of the type-I error, even for small sample sizes.

1. Introduction

The choice of a relevant statistical model in light of the observations prior to any statistical inference is ubiquitous in statistics. This reduces computational costs and fosters parsimonious models. For example, in linear regression analysis, allowing for a subset of features to enter the model, i.e. the sparsity assumption, is particularly suited to high-dimensional data, which potentially provides more robust predictions. Yet, should one follow the standard procedure, which assumes a model speciﬁed a priori, the data-driven nature of the selection procedure would be tacitly overlooked. Therefore, applying classical inference methods may entail seriously ﬂawed results, as it was highlighted by Leeb & P otscher (2005; 2006), among others.

1Department of Pure Mathematics and Mathematical Statistics, University of Cambridge, Cambridge, United Kingdom 2Graduate School of Economics, Osaka University, Osaka, Japan 3Center for Advanced Intelligence Project (AIP), RIKEN, Kyoto, Japan 4Graduate School of Informatics, Kyoto University, Kyoto, Japan. Correspondence to: Tobias Freidling <taf40@cam.ac.uk>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

The most straightforward remedy is sample splitting, which uses one part of the data for model selection and the other part for inference, cf. (Cox, 1975). However, this approach leaves space for improvement as the selection-data is outright disregarded at the inference step. In particular, two major paradigms regarding the treatment of model selection have evolved. Berk et al. (2013) developed a post-selection inference (PSI) method, which enables to circumvent distorting effects inherent to any selection procedure. Yet, in practice this method often leads to overly conservative conﬁdence intervals and entails high computational costs. In contrast to this work, Lee et al. (2016) and Tibshirani et al. (2016) only accounted for the actual selection outcome by conditioning on the latter at the inference step. This requires control over or at least insight into the selection procedure in order to characterise the selection event. Under the Gaussian assumption, it is possible to express this event as a restriction on the distribution of the inference target, using the so-called polyhedral lemma, and derive a pivotal quantity. Due to its comparatively low computational cost and general set-up, this method was applied with great success to a variety of model selection approaches, e.g. (Hyun et al., 2018) and (Taylor & Tibshirani, 2018). Variable selection procedure methods typically include step-wise procedures - see (Hocking, 1976) -, information criteria, with the AIC and its extensions - see (Akaike, 1974) -, and regularisation methods, pioneered by Tibshirani s work on the Lasso (1996). Their limitations, especially regarding assumptions such as linearity or certain probability distributions, fostered a ﬂourishing research on model-free feature selection. The kernel-based Hilbert-Schmidt independence criterion (HSIC) (2005), proposed by Gretton et al., emerged as a key ingredient as it enables to quantify the dependence between two random variables. This allows for simple feature selection: For instance, one can select a ﬁxed number of covariates that exhibit the highest HSIC-estimates with the response, which is referred to as HSIC-ordering in the rest of the paper. HSIC-Lasso (2014), proposed by Yamada et al., additionally accounts for the dependence structure among the covariates and selects features with an L1-penalty. Yamada et al. (2018) developed a selective inference approach for the former selection procedure; the analogous results for HSIC-Lasso, however, are still an important blind spot in PSI.

Post-Selection Inference with HSIC-Lasso

In this study, we are interested in the following problem: Given n observations of a response variable and potential features, select the relevant covariates using the HSIC-Lasso screening procedure; then, based on the polyhedral lemma, develop asymptotic inference given the Lasso selection procedure in the same spirit as Lee et al. (2016). Our contributions are as follows: First, we derive a tailored, novel asymptotic pivot and characterise the selection event of HSIC-Lasso in an afﬁne linear fashion; then we propose an algorithm that solves cumbersome issues arising in applications, such as high computational costs and hyper-parameter choice; ﬁnally, to illustrate our theoretical results and the relevance of the proposed method, we conduct an empirical analysis using artiﬁcial and real-world data. To the best of our knowledge, this work is the ﬁrst approach to tackle the question of PSI with HSIC-Lasso.

2. Background

In this section the two theoretical cornerstones which our work is founded on - namely PSI based on truncated Gaussians and the Hilbert-Schmidt independence criterion - are reviewed.

2.1. PSI with Truncated Gaussians

We ﬁrst review the PSI-approach (2016), which was pioneered by Lee et al. and which considers the distribution of the quantity of interest, alias target, conditionally on a selection event at the inference step. We denote the set of potential models S, the model estimator ˆS and suppose that the response Y follows the distribution N(µ, Σ), where µ is unknown and Σ is given. It is assumed that the inference target can be expressed as ηT S µ for a vector ηS that can depend on the previously chosen model S. Consequently, the distribution of interest is ηT S Y |{ ˆS = S}. If the selection event allows for an afﬁne linear representation, i.e. { ˆS = S} = {AY b}, with A and b only depending on S and the covariates, a speciﬁc model choice can be seen as a restriction on the distribution of the inference target. This is the object of the following result.

Lemma 1 (Polyhedral lemma). Let Y N(µ, Σ) with µ Rn and Σ Rn n, η Rn, A Rk n and b Rk. Deﬁning Z := Id CηT Y and C := (ηT Ση) 1Ση, then

{AY b} = V (Z) ηT Y V+(Z) ,

holds almost surely with

V (Z) := max j:(AC)j<0 bj (AZ)j

(AC)j , (1)

V+(Z) := max j:(AC)j>0 bj (AZ)j

(AC)j . (2)

Hence, selection restricts the values the inference target can take, which gives rise to the deﬁnition of truncated Gaussians.

Deﬁnition 2. Let µ R, σ2 > 0 and a, b R such that a < b. Then the cumulative distribution function of a Gaussian distribution N(µ, σ2) truncated to the interval [a, b] is given by

F [a,b] µ,σ2(x) = Φ( x µ

where Φ denotes the cdf of N(0, 1).

Concluding the line of thought, we are now able to state a pivotal quantity that can be used for inference.

Theorem 3. Under the assumptions of Lemma 1 it holds that

F [V (Z),V+(Z)] ηT µ,ηT Ση (ηT Y )|{AY b} Unif (0, 1),

where V and V+ are given by (1) and (2), respectively.

This result corresponds to Theorem 5.2 of (Lee et al., 2016), which characterises the distribution of ηT Y conditionally on Y belonging to the polyhedron {AY b}.

2.2. Hilbert-Schmidt Independence Criterion

Gretton et al. proposed the Hilbert-Schmidt independence criterion (2005) as a model-free measure for the dependence between two random variables. It relies on embedding probability measures P into a reproducing kernel Hilbert space Hk with associated kernel function k. If k is universal, i.e. Hk is dense in the space of continuous functions, then the embedding P 7 µk(P) is injective. Hence, the squared distance between µk(P) and µk(Q) (maximum mean discrepancy) forms a metric so that P = Q µk(P) µk(Q) 2 Hk = 0 for any probability measures P and Q; see (Smola et al., 2007) for further details on RKHSs. The Gaussian kernel, which is universal, is probably the most commonly used kernel, and is deﬁned as

k(x, y) = exp x y 2 2 2σ2

A way to detect the dependence between two random variables X and Y is the comparison between the joint distribution PX,Y and the product of the marginals PXPY using the maximum mean discrepancy. This metric is precisely the Hilbert-Schmidt independence criterion, which can also be deﬁned in terms of the involved kernels as follows.

Deﬁnition 4. Let X and Y be random variables, X and Y

independent copies, and k and l be bounded kernels. The Hilbert-Schmidt independence criterion HSICk,l is given

Post-Selection Inference with HSIC-Lasso

HSICk,l(X, Y ) = EX,X ,Y,Y [k(X, X ) l(Y, Y )]

+ EX,X [k(X, X )] EY,Y [l(Y, Y )]

2 EX,Y [EX [k(X, X )] EY [l(Y, Y )]] .

Several approaches for estimating the Hilbert-Schmidt independence criterion from a data sample {(xj, yj)}n j=1 have been put forward. Gretton et al. (2005) proposed a V-statistic based estimator [ HSICb(X, Y ), which is biased, whereas Song et al. (2012) provided an unbiased U-statistic version [ HSICu(X, Y ). Regarding the asymptotic distribution, Zhang et al. (2018) established that both these estimators scaled by n1/2 converge to a Gaussian distribution if X and Y are dependent. However for X Y , the asymptotic distribution is not normal. Since the true dependence between X and Y is unknown, we focus on estimators that are asymptotically normal in either case. In this respect, it is helpful to express the unbiased version of the HSIC-estimator as a U-statistic of degree 4 with kernel function h provided, e.g. in Theorem 3 in Song et al. (2012). Zhang et al. (2018) and Lim et al. (2020) suggested the following estimators.

Deﬁnition 5. Let B N and subdivide the data into folds of size B, {{(xb i, yb i )}B i=1}n/B b=1 . The block estimator [ HSICblock with block size B is given by

[ HSICblock(X, Y ) = 1 n/B

b=1 [ HSICu(Xb, Y b), (3)

where [ HSICu(Xb, Y b) denotes the unbiased estimate on the data {(xb i, yb i )}B i=1. Let Sn,4 be the set of all 4-subsets of {1, . . . , n} and let D be a multiset containing m elements of Sn,4 randomly chosen with replacement. Further, suppose m = O(n) and deﬁne l := limn,m m/n. The incomplete U-statistic estimator [ HSICinc of size l is deﬁned by [ HSICinc(X, Y ) = 1

(i,j,q,r) D h(i, j, q, r). (4)

Both estimators are asymptotically normal.

Theorem 6. Let {(x(1) j , . . . , x(p) j , yj)}n j=1 be an i.i.d. data sample and deﬁne H0 = (HSIC(X(1), Y ), . . . . . . , HSIC(X(p), Y ))T , and Hblock and Hinc accordingly. Assume that B and l are the same for all entries of Hblock and Hinc, respectively, let n/B and choose D for all elements of Hinc independently. Then p

n/B Hblock H0 D N(0, Σblock), m Hinc H0 D N(0, Σinc),

with positive deﬁnite matrices Σblock and Σinc.

The ﬁrst statement is a direct consequence of the multidimensional central limit theorem; to prove the second asymptotic result, we use the framework of U-statistics. The details are deferred to the supplementary material.

3. PSI for HSIC-Lasso

This section contains our main theoretical results for PSI with HSIC-Lasso and addresses the difﬁculties arising in practical applications.

3.1. Feature Selection with HSIC-Lasso

Yamada et al. introduced the model-free HSIC-Lasso (2014) feature selection method as follows. Deﬁnition 7. Let {(x(1) j , . . . , x(p) j , yj)}n j=1 be an i.i.d. data sample, k and l be kernels, let K and L be given by Kij = k(xi, xj) and Lij = l(yi, yj) for i, j {1, . . . , n} and set Γ = Id 1

n11T . Denoting L = ΓLΓ and K(s) = ΓK(s)Γ, ˆβ is given by

ˆβ = argmin β Rp +

s=1 βs K(s) 2 Frob + λ β 1,

where λ > 0 and R+ = [0, ). The HSIC-Lasso selection procedure is deﬁned by ˆS = {s: ˆβs > 0}.

For our purposes, however, we consider the following alternative representation of ˆβ:

ˆβ = argmin β Rp +

s=1 βs [ HSICb(X(s), Y )

s,r=1 βsβr [ HSICb(X(s), X(r)) + λ β 1. (5)

It becomes apparent that feature selection is driven by three competing components. Considering the ﬁrst and third term together, we notice that covariates with high dependence to the response achieve positive ˆβ-values, whereas the ˆβentries of non-inﬂuential features are forced to zero by the regularisation term. Moreover, the second term penalises the selection of covariates showing high dependence on other features. Taking (5) as a starting point, we replace the biased Vstatistic based estimators with asymptotically normal ones and allow for a more general weighted Lasso-penalty. Deﬁnition 8. Let H be an asymptotically Gaussian and H be any HSIC-estimator and deﬁne H and M by Hs = H(X(s), Y ), Msr = H(X(s), X(r)) for s, r {1, . . . , p}. The normal weighted HSIC-Lasso selection procedure is given by ˆS = {s: ˆβs > 0} with

ˆβ = argmin β Rp + βT H + 1

2 βT Mβ + λ βTw, (6)

Post-Selection Inference with HSIC-Lasso

where w Rp + is a ﬁxed weight vector.

Using the asymptotically normal response H, the framework of the polyhedral lemma can be applied; however, we need to provide an asymptotic pivot.

Theorem 9. Let (Hn)n N, (Mn)n N and (bΣn)n N be sequences of random vectors and matrices, respectively, such that Hn N(µ, Σ) in distribution, Mn M, bΣn Σ almost surely and bΣn (Hn, Mn). For a selected model S and S S, let η S, A S and b S be a.s. continuous functions of M and assume that the selection events are given by {A S(Mn)Hn b S(Mn)} and that int({A S(M)H b S(M)}) = . Then

F [V (Zn),V+(Zn)] ηT n µ,ηT n bΣnηn (ηT n Hn)|{An Hn bn} D Unif (0, 1), (7) as n where ηn = η S(Mn), An = A S(Mn) and bn = b S(Mn).

This statement is tailored to selection with normal weighted HSIC-Lasso and generalises Theorem 3 as it relaxes the requirements of a normal response and known covariance Σ: Under the assumption of an asymptotically Gaussian response and a consistent estimator for Σ, we obtain an asymptotic pivot. To prove this result, we show that (Hn, Mn, bΣn)|{An Hn bn} (H, M, Σ)|{AH b} in distribution and then apply the continuous mapping theorem to the a.s. continuous function F. The details of the proof can be found in the supplement.

3.2. Inference Targets and Selection Event

We now deﬁne the inference targets and characterise the selection events of the normal weighted HSIC-Lasso. To do so, we ﬁrst introduce the following notations. For any matrix B Rq q, v Rq and index sets I, J {1, . . . , q}, we deﬁne Ic := {1, . . . , q} \ I. Moreover, v I contains all entries at positions in I and BIJ R|I| |J| is given by the rows and columns of B whose indices are in I and J, respectively. For a selected model S and j S, we consider the HSIC-target Hj := e T j H and the partial target ˆβpar j,S := e T j M 1 SS HS, where ej denotes the j-th unit vector. The former target describes the dependence between response Y and feature X(j). In the same spirit of a partial regression coefﬁcient, the latter can be interpreted as the degree of inﬂuence of X(j) on Y adjusted to the dependence structure among the covariates. Expressing both targets in the form of ηT S H, the respective η-vectors for the HSICand partial target are ej and (M 1 SS | 0)T ej. We notice that the HSIC-target is inﬂuenced by the selection information {j ˆS} only, whereas the partial target is, by deﬁnition, affected by the entire chosen set of covariates

{ ˆS = S}. We characterise these selection events as follows.

Theorem 10. Assume the same framework as in Deﬁnition 8, suppose that M is positive deﬁnite and let η Rp. Then { ˆS = S} = {A (HS, HSc)T b} holds, where

M 1 SS | 0 MSc SM 1 SS | Id

, b = M 1 SS w S w Sc MSc SM 1 SS w S

(8) and 0 denotes a matrix of size |S| |Sc| ﬁlled with zeros. Moreover, {j ˆS} = {AH b} holds for

A = e T j , b = e T j M ˆβ j λwj, (9)

where ˆβ j denotes ˆβ with the j-th entry set to zero.

To prove these statements, we use the Karush-Kuhn-Tucker (KKT) conditions, which characterise the solution of (6) by a set of inequalities. Manipulating these, we obtain the afﬁne linear representation of { ˆS = S} and {j ˆS}. The details can be found in the supplementary material. Theorem 10 is the key result that allows us to carry out post-selection inference with HSIC-Lasso. In summary, we have to consider the distributions e T j H | {j ˆS} and e T j (M 1 SS H | 0) | { ˆS = S} for the HSICand partial target, respectively, in order to account for the selection. With the afﬁne linear representations (8) and (9), we can apply Theorem 9 and get an asymptotic pivot for inference.

3.3. Practical Applications

Equipped with these theoretical results, we now propose an algorithm that handles difﬁculties arising in practical applications.

3.3.1. POSITIVE DEFINITENESS

Theorem 10 requires M to be positive deﬁnite. For the original version of HSIC-Lasso (5), this condition is always fulﬁlled by the structure of the biased HSIC-estimates. However, there is no guarantee for other estimation procedures. For this reason, we project M onto the space of positive definite matrices, as proposed by Higham (1988): The spectral decomposition of M is computed and all negative eigenvalues are replaced with a small positive value ε > 0.

3.3.2. COMPUTATIONAL COSTS

HSIC-Lasso is frequently applied to high-dimensional data where the number of covariates p exceeds the sample size n. The resulting high computational costs are caused by the calculation of the HSIC-estimates, where H grows as O(p) and M as O(p2). Therefore, we introduce an upstream screening stage identifying a subset of potentially inﬂuential features so that HSIC-Lasso only has to deal with these. Following the approach of Yamada et al. (2018), we compute

Post-Selection Inference with HSIC-Lasso

the HSIC-estimates [ HSIC(Y, X(j)), j {1, . . . , p}, and select a pre-ﬁxed number p < p of the covariates having the highest estimates. We call this HSIC-ordering. In order to ensure valid inference results, we have to adjust for the screening step as well because it affects feature selection. To do so, we split the data into two folds, one dedicated to screening, the other dedicated to HSIC-Lasso selection among the screened variables, cf. (Cox, 1975). Thus, potentially distorting effects of the screening step are separated from inference on the second fold. Moreover, unbiased HSIC-estimates can be used for screening, which are more precise than block or incomplete U-statistic estimates.

Remark. In future applications, random Fourier features could be used to speed up the kernel computation of the objective function (5), allowing for a larger p . However, it is not immediately clear whether we can recover the theoretical guarantees for our method when using approximated kernel functions.

3.3.3. HYPER-PARAMETER CHOICE

In practice, a suitable choice of the regularisation parameter λ and the weight vector w is key for meaningful results. Since the data generating process is unknown, we have to estimate these hyper-parameters. In order to prevent this from affecting inference results, we use the ﬁrst fold for hyper-parameter selection. In doing so, we can apply any estimation method for λ, such as cross-validation, e.g. (Stone, 1974), or the AIC, cf. (Akaike, 1974), and get a valid procedure that is easy to implement. Moreover, we can employ Zou s adaptive Lasso penalty (2006), that uses the weight vector w = 1/|ˆβ|γ. γ is typically set to 1.5 or 2 and ˆβ is a n -consistent estimator, e.g. the ordinary least squares estimator. Contrary to the vanilla Lasso, this method satisﬁes the oracle property, that is the sparsitybased estimator recovers the true underlying sparse model and has an asymptotically normal distribution. Yet, this property was only proven for a covariance matrix of the form Σ = σ2Id. Hence, we have to evaluate the usefulness of the adaptive Lasso in empirical simulations.

3.3.4. ALGORITHM

We summarise our proposed PSI method for a normal (weighted) HSIC-Lasso selection procedure as follows. To begin with, we split the data into two subsets. On the ﬁrst fold, we compute the HSIC-estimates between all covariates and the response, determine the most inﬂuential p features (screening) and estimate the hyper-parameters λ and w with the methods previously speciﬁed. On the second fold, we compute the estimates of H, M and bΣ for the screened features and solve the optimisation problem (6). For all j {1, . . . , p} such that ˆβj > 0, we ﬁnd the truncation points for the speciﬁed targets with Theorem 10 and Lemma

1 and test with the asymptotic pivot (7) if the targets are signiﬁcant at a given level α. The supplementary material contains a more detailed description of the algorithm in pseudocode.

Remark. Although Theorem 9 requires independence between the covarianceand the (H, M)-estimate we could not observe any detrimental inﬂuence if bΣ is computed as outlined above.

4. Experiments

In this section, we illustrate our theoretical contribution and the proposed algorithm on artiﬁcial and real-world data. The source code for the experiments was implemented in Python and relies on Lim et al. s mskernel-package (2020). We use the Lasso optimisation routines of scikit-learn which implements the cyclical gradient descent algorithm, cf. (Friedman et al., 2007), and the least angle regression algorithm (LARS) (2004), proposed by Efron et al.. The source code for the following experiments is available on Github: tobias-freidling/hsic-lasso-psi.

4.1. Artiﬁcial Data

We examine the achieved type-I error of the proposed algorithm, compare its power to other approaches for postselection inference and brieﬂy discuss additional experiments. For continuous data, we use Gaussian kernels where the bandwidth parameter is chosen according to the median heuristic, cf. (Sch olkopf & Smola, 2018); for discrete data with nc samples in category c, we apply the normalised delta kernel which is given by

l(y, y ) :=

( 1/nc, if y = y = c, 0, otherwise.

Moreover, we use a quarter of the data for the ﬁrst fold, select the hyper-parameter λ applying 10-fold cross-validation with MSE, use a non-adaptive Lasso-penalty and do not conduct screening as the number of considered features is already small enough. On the second fold, we estimate M with the block estimator of size B = 10 as it is computationally less expensive than the unbiased estimator and leads to similar results. The covariance matrix Σ of H is estimated based on the summands of the block (3) and incomplete U-statistic (4) estimator, respectively. To this end, we use the oracle approximating shrinkage (OAS) estimator (2010), which was presented by Chen et al. and is particularly tailored for high-dimensional Gaussian data. We ﬁx the signiﬁcance level at α = 0.05 and simulate 100 datasets for each considered sample size.

Post-Selection Inference with HSIC-Lasso

4.1.1. TYPE-I ERROR

In order to simulate the achieved type-I error we use the toy models

(M1) Y Ber g P10 i=1 Xi , X N(050, Ξ),

g(x) = ex/(1 + ex),

(M2) Y = P5 i=1 Xi Xi+5 + ε, X N(050, Ξ),

ε N(0, σ2),

where 050 R50 and Ξ R50 50, to generate the data. These are clearly non-linear and cover categorical and continuous responses. In (M2), we choose σ2 such that the variance of ε is a ﬁfth of the variance of the X-dependent terms of Y amounting to a noise-to-signal ratio of 0.2. As for the covariance matrix Ξ, two cases are considered: we either set Ξ = Id or use decaying correlation, i.e. Ξij = 0.5|i j|. We simulate datasets with sample sizes n {400, 800, 1200, 1600} for all different settings of models and covariance matrices and estimate H with block estimators of sizes 5 and 10 as well as with an incomplete U-statistics estimator of size l = 1. Since the partial target both depends on the entire set of selected variables and its value cannot be directly inferred from the data-generating mechanism, it is inherently hard to rigorously assess the type-I error of any given partial target. (However, the false positive rate for testing different partial targets hints that the type-I error is probably close to 0.05). For this reason, we concentrate on the HSIC-target in our analysis. For both models HSIC(Y, X(j)) = 0, j {11, . . . , 50}, holds which allows us to estimate the type-I error as the ratio of null hypothesis rejections and tests among the selected features with indices in {11, . . . , 50}. If inﬂuential variables are correlated with uninﬂuential ones, the HSIC-value is not precisely zero; nonetheless, the use of decaying correlation renders the bias of this effect ignorable. Figure 1 illustrates that the type-I error is close to 0.05 across all estimators and data generating mechanisms, even for small sample sizes.

4.1.2. POWER

In this set of experiments, we adapt the toy model (M1), replacing X1 by θX1 and setting Ξ = Id, and denote it (M1 ). Moreover, we introduce the following linear and modiﬁed linear model

(M3) Y = θX1 + P10 i=2 Xi + ε, X N(050, Id),

ε N(0, σ2),

(M4) Y = θ h(X1) + P10 i=2 Xi + ε, X N(050, Id),

h(x) = x x3, ε N(0, σ2),

where θ R. Our proposed algorithm is applied with both a block estimator, B = 10, and an incomplete U-statistics estimator, l = 1, and is compared with the so-called Multi PSI-approach, presented by Lim et al. (2020): We select k features with HSIC-ordering and carry out inference for the HSIC-targets. (Since M is not involved in the feature selection, we cannot deﬁne partial targets for HSIC-ordering.) It was empirically shown that multiscale bootstrapping (2004), which was ﬁrst presented by Shimodaira and is abbreviated by Multi, is a more powerful PSI-approach for HSIC-ordering than truncated Gaussians. In our simulations, we set k = 15 and applied Multi with a block estimator, B = 10, as well as an incomplete U-statistics estimator, l = 1. Additionally, we applied Lee et al. s original PSI-method (2016), that relies on Lasso-regularisation and assumes a linear regression setting, to (M3) and (M4). The inference target in this case is the partial regression coefﬁcient. We simulate datasets for values of θ in {0.00, 0.33, 0.67, 1.00, 1.33, 1.67, 2.00, 2.33} and a sample size of n = 800, and compute the ratio of rejections of the null-hypothesis, i.e. the respective inference target corresponding to X1 is zero, and the number of tests carried out. Plotting the obtained ratios against θ does not correspond to the usual depiction of the power function as θ is not the inference target for all considered procedures. However, this allows for an intuitive understanding of how strong X1 needs to inﬂuence Y in order to be detected. Figure 2 exhibits that the power of our proposed algorithm is similar to the Multi procedure, especially when using the block estimator. This conﬁrms that, even without a manual choice of the number of selected features and costly bootstrap sampling, it is possible to match the best performing model-free PSI-methods. Moreover, we observe that regularised linear regression clearly outperforms our procedure as well as Multi for small values of θ if the data-generating process is indeed linear. However, when X1 inﬂuences the outcome Y not only through a linear, but also through a cubic term, that is (M4), we discern that PSI based on a linear model has no power at all whereas model-agnostic methods still achieve noticeable power. This exempliﬁes that PSI procedures, built upon a certain model, can only be conﬁdently used if there is limited uncertainty about the underlying data generation.

4.1.3. ADDITIONAL EXPERIMENTS

In our simulations, we focused on the statistical properties of the proposed approach but also started investigating the behaviour of the algorithm for different feature selection set-ups. The conclusions we drew deserve a few comments. Screening HSIC-ordering includes the inﬂuential features into the screened set with high probability when p and the size of the ﬁrst fold are sufﬁciently large. Nonetheless,

Post-Selection Inference with HSIC-Lasso

500 1000 1500 sample size

type-I error

500 1000 1500 sample size

500 1000 1500 sample size

500 1000 1500 sample size

Block, B=5 Block, B=10 incomplete, l=1

Figure 1. Type-I error for the HSIC-target in different toy models (from left to right): (M1) with Ξ = Id, (M1) with Ξij = 0.5|i j|, (M2) with Ξ = Id, (M2) with Ξij = 0.5|i j|

0.0 0.5 1.0 1.5 2.0 theta

0.0 0.5 1.0 1.5 2.0 theta

0.0 0.5 1.0 1.5 2.0 theta

Proposal, Block, B=10 Proposal, inc., l=1 Multi, Block, B=10 Multi, inc., l=1 Linear

Figure 2. Power of detecting X1 as inﬂuential feature in different toy models (from left to right): (M1 ), (M3), (M4)

screening is merely a method to reduce computational complexity and can potentially weaken the performance of the downstream HSIC-Lasso procedure. Therefore, the number of screened features p should be set as high as computational resources allow. Regularisation The use of an adaptive penalty term generally leads to fewer selected features than the vanilla Lasso regularisation. Moreover, cross-validation and the AIC often choose similar values for λ. Dependence structure In datasets with strong correlation between inﬂuential and uninﬂuential features we observe that the partial target is capable of correctly detecting the uninﬂuential ones. How the dependence structure among inﬂuential features materialises in rejections of the null hypothesis, however, is a subtle and still open question. Estimators In general, block estimators are computationally less expensive than incomplete U-statistics estimators. For the latter ones, the calculation costs, but also the power increases with the size l. Split ratio In the experiments with artiﬁcial data, the size of the ﬁrst fold was set to be a third of the second fold as this already sufﬁces to obtain a decent estimate of the regularisation parameter. For most datasets it is advisable to dedicate more data to the HSIC-Lasso procedure than to the hyper-parameter selection; however, a reliable heuristic for the split ratio remains subject to future research.

4.2. Benchmarks

Now we proceed to applying our proposed algorithm to benchmark datasets from the UCI Repository and the Broad Institute s Single Cell Portal, respectively. We provide an additional, more in-depth experiment in the supplementary materials.

4.2.1. TURKISH STUDENT DATASET

This dataset contains 5 820 course evaluation scores provided by students from Gazi University, Ankara, who answered 28 questions on a ﬁve-level Likert scale, see further (Gunduz & Fokoue, 2013). For our experiment we use the perceived difﬁculty of the course as response variable. This data was previously evaluated by Yamada et al. (2018) who selected features with HSIC-ordering and used the familiar framework of the polyhedral lemma and truncated Gaussians for PSI, denoted by Poly. We use the block estimator of size 10 and set k = 10 for the Multi approach to accord with Yamada et al. and report their ﬁndings along with ours. For our proposal, the ﬁrst fold contains 20% of the data, we select λ with 10-fold cross-validation and do not carry out screening as the number of features (p = 28) is manageable. Table 1 summarises our ﬁndings. First, we notice that Multi and Poly pick different features despite sharing the same selection procedure which can probably be attributed to randomisation, carried out by Yamada et al. (2018). Moreover, we observe that HSIC-Lasso

Post-Selection Inference with HSIC-Lasso

Table 1. P-values of the HSIC-target for selected features in the Turkish student dataset

FEATURE P-VALUE PROPOSAL MULTI POLY Q2 0.021 - 0.452 Q3 - 0.782 - Q11 0.004 - - Q13 - - 0.018 Q14 - 0.001 - Q15 - 0.095 - Q17 <0.001 <0.001 0.033 Q18 - - 0.186 Q19 - <0.001 - Q20 - 0.004 0.463 Q21 - 0.032 0.033 Q22 - <0.001 0.042 Q23 - - 0.037 Q25 - 0.002 - Q26 - - 0.176 Q28 0.004 0.041 <0.001

chooses a very parsimonious model with only four covariates whose associated HSIC-targets are highly signiﬁcant. Among the tested approaches, there is a moderate agreement on the inﬂuential covariates where only Q17: The Instructor arrived on time for classes. and Q28: The Instructor treated all students in a right and objective manner. are unanimously chosen and found to be signiﬁcant. Considering the partial targets for HSIC-Lasso, we ﬁnd that only ˆβpar 17,S appears signiﬁcant. The different results that we obtain may hint that the Turkish student data set is intrinsically noisy or that methods based on HSIC-estimation and the polyhedral lemma are unstable. However, the Lasso-selection of HSIC-Lasso, unlike Multi or Poly, penalises correlated features which affects PSI as well. Hence, different results for the compared methods do not necessarily indicate incorrectness of either approach.

4.2.2. SINGLE-CELL RNASEQ DATA

Villani et al. (2017) isolated around 2 400 blood cells, enriched in two particular kinds of leukocytes: dendritic cells (DCs) and monocytes. Then, they measured the gene expression on every cell using single-cell RNAseq aiming to describe the diversity between, and within those two cell types based on their gene expression proﬁle. They end up deﬁning 10 different subclasses: 6 types of DCs and 4 types of monocytes. In our experiment, we use 1 078 samples of this data aspiring to ﬁnd the genes that separate these 10 classes among the 26 593 genes. We standardise the single-cell RNAseq data gene-wise and impute missing gene expressions with MAGIC, see further (van Dijk et al., 2018). Since the response is categorical, we use the normalised delta kernel. Unlike the Turkish student dataset, we are now confronted with a considerably high-dimensional problem where the number of features greatly exceeds the

Table 2. P-Values of the partial targets corresponding to selected features in the single-cell RNAseq dataset

GENE P-VALUE GENE P-VALUE ACTB 0.961 IGJ <0.001 CD14 0.026 LYZ <0.001 FCER1A <0.001 MTRNR2L2 0.420 FCGR3A 0.001 RPS3A <0.001 FTL 0.968 TMSB4X 0.012 HLA-DPA1 <0.001 TVAS5 0.553 IFI30 0.002

sample size. Therefore, screening becomes more challenging and we consequently split the data evenly into ﬁrst and second fold. We screen 1 000 potentially inﬂuential features and apply the incomplete U-statistic estimator with a large size of 20, hoping to better capture the potentially involved dependence structure. The remaining parameters were set as in the previous experiment. For an in-depth analysis of the dataset, we recommend to conduct a sensitivity analysis which investigates the behaviour of the method for different values of the parameters, such as the split ratio, the number of screened features or the size of the estimator. We ﬁnd that HSIC-Lasso selects 13 features and that all of the associated HSIC-targets and most of the partial targets are signiﬁcant, cf. Table 2. One of the traditionally deﬁning characteristics of monocytes is the expression of the CD14 protein; encouragingly, HSIC-Lasso selected this gene as a discriminating feature. In fact, it also selected six other genes which Villani et al. used in multiple cell signatures: FCGR3A (DC4), FCER1A (DC2 and DC3), FTL (DC4), IFI30 (c DC-like), IGJ (p DC-like), and LYZ (c DC-like). Jointly, this shows the ability of HSIC-Lasso to recover multiple genes used to deﬁne the classes. More exciting, however, are the genes which are selected by HSIC-Lasso and were not used in the original study. These might point to new molecular signatures and functions that differentiate these cell types. For instance, one of these genes play a role in immunity: HLA-DPA1, which presenting cells like DCs use to present exogenous proteins to other immune cells. The proposed PSI framework adds nuance to this picture by providing a soft ranking of the selected genes. Notoriously, FTL and ACTB have p-values close to 1, which suggests that most of the information they provide is already captured by other selected genes. Hence, other genes should be prioritised in hypothetical downstream experiments. Our method contributes to the detection of discriminatory features inasmuch it facilitates to formulate conﬁdence statements about the obtained results.

5. Discussion and Outlook

Post-selection inference was initially proposed for linear regression models with Lasso regularisation and subsequently expanded to generalised regression situations and Lasso

Post-Selection Inference with HSIC-Lasso

variants, see for example (Taylor & Tibshirani, 2018) and (Hyun et al., 2018). However, knowledge of an underlying model seemed to be necessary in order to properly account for the selection-process. Yamada et al. (2018) overcame this limitation by capturing the unknown dependence between variables via the model-agnostic Hilbert-Schmidt independence criterion and embedding the estimates into the formerly proposed PSI-framework using asymptotic normality. However, this approach still requires the user to decide how many features to select. Our method chooses features in a data-driven manner and correctly accounts for the selection process and is thus ideal for situations with limited knowledge about the structure of the data. Extending the theoretical framework to allow for a sequential application of HSIC-Lasso with different values of λ, similar to Tibshirani et al. s least angle regression algorithm (2016), is an interesting and practically relevant step for future research. Moreover, Liu et al. (2018) hint that developing inference targets apart from the partial and HSIC-target can be useful to reduce the length of conﬁdence intervals. Lastly, the incorporation of the choice of λ into the postselection framework for HSIC-Lasso would allow to analyse the data on only one fold and render the proposed framework fully in line with the PSI philosophy. Loftus (2015) and Markovic et al. (2017) took ﬁrst steps in this direction; albeit, the application to HSIC-Lasso is still an open issue. From an algorithmic point of view, establishing heuristics for a good choice of hyper-parameters, such as the size of the estimators or the split ratio, can as well be object of future research.

Acknowledgments

Tobias Freidling thanks the Max Weber Program and Erasmus+ for supporting his stay at Kyoto University with a scholarship and Mathias Drton for an insightful discussion on Theorem 9. Makoto Yamada was supported by MEXT KAKENHI 20H04243 and partly supported by MEXT KAKENHI 21H04874. Benjamin Poignard was supported by the Japanese Society for the Promotion of Science (Start Support Kakenhi 19K23193).

Akaike, H. A new look at the statistical model identiﬁcation. IEEE Transactions on Automatic Control, 19:716 723, 1974.

Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. Valid post-selection inference. The Annals of Statistics, 41(2):802 837, 2013.

Chen, Y., Wiesel, A., Eldar, Y. C., and Hero, A. O. Shrinkage algorithms for MMSE covariance estimation. IEEE

Transactions on Signal Processing, 58(10):5016 5029, 2010.

Cox, D. R. A note on data-splitting for the evaluation of signiﬁcance levels. Biometrika, 62:441 444, 1975.

Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. Least angle regression. (With discussion). The Annals of Statistics, 32(2):407 499, 2004.

Friedman, J., Hastie, T., H oﬂing, H., and Tibshirani, R. Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302 332, 2007.

Gretton, A., Bousquet, O., Smola, A., and Sch olkopf, B. Measuring statistical dependence with Hilbert-Schmidt norms. In Algorithmic learning theory. 16th international conference, ALT 2005, Singapore, October 8 11, 2005. Proceedings., pp. 63 77. Berlin: Springer, 2005.

Gunduz, N. and Fokoue, E. UCI machine learning repository, 2013.

Higham, N. J. Computing a nearest symmetric positive semideﬁnite matrix. Linear Algebra and its Applications, 103:103 118, 1988.

Hocking, R. R. A Biometrics Invited Paper. The Analysis and Selection of Variables in Linear Regression. Biometrics, 32(1):1 49, 1976.

Hyun, S., G sell, M., and Tibshirani, R. J. Exact postselection inference for the generalized lasso path. Electronic Journal of Statistics, 12(1):1053 1097, 2018.

Lee, J. D., Sun, D. L., Sun, Y., and Taylor, J. E. Exact post-selection inference, with application to the Lasso. The Annals of Statistics, 44(3):907 927, 2016.

Leeb, H. and P otscher, B. M. Model selection and inference: facts and ﬁction. Econometric Theory, 21(1):21 59, 2005.

Leeb, H. and P otscher, B. M. Can one estimate the conditional distribution of post-model-selection estimators? The Annals of Statistics, 34(5):2554 2591, 2006.

Lim, J. N., Yamada, M., Jitkrittum, W., Terada, Y., Matsui, S., and Shimodaira, H. More Powerful Selective Kernel Tests for Feature Selection. volume 108 of Proceedings of Machine Learning Research, pp. 820 830. PMLR, 2020.

Liu, K., Markovic, J., and Tibshirani, R. More powerful post-selection inference, with application to the Lasso. ar Xiv preprint, 1801.09037, 2018.

Loftus, J. R. Selective inference after cross-validation. ar Xiv preprint, 1511.08866, 2015.

Post-Selection Inference with HSIC-Lasso

Markovic, J., Xia, L., and Taylor, J. Unifying approach to selective inference with applications to cross-validation. ar Xiv preprint, 1703.06559, 2017.

Sch olkopf, B. and Smola, A. J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA: MIT Press, 2018.

Shimodaira, H. Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling. The Annals of Statistics, 32(6):2616 2641, 2004.

Smola, A., Gretton, A., Song, L., and Sch olkopf, B. A Hilbert space embedding for distributions. In Algorithmic learning theory. 18th international conference, ALT 2007, Sendai, Japan, October 1 4, 2007. Proceedings, pp. 13 31. Berlin: Springer, 2007.

Song, L., Smola, A., Gretton, A., Bedo, J., and Borgwardt, K. Feature selection via dependence maximization. Journal of Machine Learning Research (JMLR), 13:1393 1434, 2012.

Stone, M. Cross-validatory choice and assessment of statistical predictions. Discussion. Journal of the Royal Statistical Society. Series B, 36:111 147, 1974.

Taylor, J. and Tibshirani, R. Post-selection inference for ℓ1-penalized likelihood models. The Canadian Journal of Statistics, 46(1):41 61, 2018.

Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 58(1):267 288, 1996.

Tibshirani, R. J., Taylor, J., Lockhart, R., and Tibshirani, R. Exact Post-Selection Inference for Sequential Regression Procedures. Journal of the American Statistical Association, 111(514):600 620, 2016.

van Dijk, D., Sharma, R., Nainys, J., Yim, K., Kathail, P., Carr, A. J., Burdziak, C., Moon, K. R., Chaffer, C. L., Pattabiraman, D., Bierie, B., Mazutis, L., Wolf, G., Krishnaswamy, S., and Pe er, D. Recovering Gene Interactions from Single-Cell Data Using Data Diffusion. Cell, 174 (3):716 729.e27, 2018.

Villani, A.-C., Satija, R., Reynolds, G., Sarkizova, S., Shekhar, K., Fletcher, J., Griesbeck, M., Butler, A., Zheng, S., Lazo, S., Jardine, L., Dixon, D., Stephenson, E., Nilsson, E., Grundberg, I., Mc Donald, D., Filby, A., Li, W., De Jager, P. L., Rozenblatt-Rosen, O., Lane, A. A., Haniffa, M., Regev, A., and Hacohen, N. Singlecell rna-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science, 356(6335), 2017.

Yamada, M., Jitkrittum, W., Sigal, L., Xing, E. P., and Sugiyama, M. High-dimensional feature selection by feature-wise kernelized Lasso. Neural Computation, 26 (1):185 207, 2014.

Yamada, M., Umezu, Y., Fukumizu, K., and Takeuchi, I. Post Selection Inference with Kernels. volume 84 of Proceedings of Machine Learning Research, pp. 152 160. PMLR, 2018.

Zhang, Q., Filippi, S., Gretton, A., and Sejdinovic, D. Largescale kernel methods for independence testing. Statistics and Computing, 28(1):113 130, 2018.

Zou, H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476):1418 1429, 2006.