# selective_inference_for_groupsparse_linear_models__56c334e9.pdf

Selective inference for group-sparse linear models

Fan Yang Department of Statistics University of Chicago fyang1@uchicago.edu

Rina Foygel Barber Department of Statistics University of Chicago rina@uchicago.edu

Prateek Jain Microsoft Research India prajain@microsoft.com

John Lafferty Depts. of Statistics and Computer Science University of Chicago lafferty@galton.uchicago.edu

We develop tools for selective inference in the setting of group sparsity, including the construction of conﬁdence intervals and p-values for testing selected groups of variables. Our main technical result gives the precise distribution of the magnitude of the projection of the data onto a given subspace, and enables us to develop inference procedures for a broad class of group-sparse selection methods, including the group lasso, iterative hard thresholding, and forward stepwise regression. We give numerical results to illustrate these tools on simulated data and on health record data.

1 Introduction

Signiﬁcant progress has been recently made on developing inference tools to complement the feature selection methods that have been intensively studied in the past decade [6, 5, 9]. The goal of selective inference is to make accurate uncertainty assessments for the parameters estimated using a feature selection algorithm, such as the lasso [12]. The fundamental challenge is that after the data have been used to select a set of coefﬁcients to be studied, this selection event must then be accounted for when performing inference, using the same data. A speciﬁc goal of selective inference is to provide p-values and conﬁdence intervals for the ﬁtted coefﬁcients. As the sparsity pattern is chosen using nonlinear estimators, the distribution of the estimated coefﬁcients is typically non-Gaussian and multimodal, even under a standard Gaussian noise model, making classical techniques unusable for accurate inference. It is of particular interest to develop ﬁnite-sample, non-asymptotic results.

In this paper, we present new results for selective inference in the setting of group sparsity [15, 3, 10]. We consider the linear model Y = Xβ + N(0, σ2In) where X Rn p is a ﬁxed design matrix. In many applications, the p columns or features of X are naturally grouped into blocks C1, . . . , CG {1, . . . , p}. In the high dimensional setting, the working assumption is that only a few of the corresponding blocks of the coefﬁcients β contain nonzero elements; that is, βCg = 0 for most groups g. This group-sparse model can be viewed as an extension of the standard sparse regression model. Algorithms for ﬁtting this model, such as the group lasso [15], extend well-studied methods for sparse linear regression to this grouped setting.

In the group-sparse setting, recent results of Loftus and Taylor [9] give a selective inference method for computing p-values for each group chosen by a model selection method such as forward stepwise regression; selection via cross-validation was studied in [9]. More generally, the inference technique of [7] applies to any model selection method whose outcome can be described in terms of quadratic

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

conditions on Y . However, their technique cannot be used to construct conﬁdence intervals for the selected coefﬁcients or for the size of the effects of the selected groups.

Our main contribution in this work is to provide a tool for constructing conﬁdence intervals as well as p-values for testing selected groups. In contrast to the (non-grouped) sparse regression setting, the conﬁdence interval construction does not follow immediately from the p-value calculation, and requires a careful analysis of non-centered multivariate normal distributions. Our key technical result precisely characterizes the density of PLY 2 (the magnitude of the projection of Y onto a given subspace L), conditioned on a particular selection event. This truncated projection lemma is the group-wise analogue of the polyhedral lemma of Lee et al. [5] for the lasso. This technical result enables us to develop inference tools for a broad class of model selection methods, including the group lasso [15], iterative hard thresholding [1, 4], and forward stepwise group selection [14].

In the following section we frame the problem of group-sparse inference more precisely, and present previous results in this direction. We then state our main technical results; the proofs of the results are given in the supplementary material. In Section 3 we show how these results can be used to develop inferential tools for three different model selection algorithms for group sparsity. In Section 4 we give numerical results to illustrate these tools on simulated data, as well as on the California county health data used in previous work [9]. We conclude with a brief discussion of our work.

2 Main results: selective inference over subspaces

To establish some notation, we will write PL for the projection to any linear subspace L Rn, and P L for the projection to its orthogonal complement. For y Rn, dir L(y) = PLy PLy 2 L Sn 1 is the unit vector in the direction of PLy. This direction is not deﬁned if PLy = 0.

We focus on the linear model Y = Xβ + N(0, σ2In), where X Rn p is ﬁxed and σ2 > 0 is assumed to be known. More generally, our model is Y N(µ, σ2In) with µ Rn unknown and σ2 known. For a given block of variables Cg [p], we write Xg to denote the n |Cg| submatrix of X consisting of all features of this block. For a set S [G] of blocks, XS consists of all features that lie in any of the blocks in S.

When we refer to selective inference, we are generally interested in the distribution of subsets of parameters that have been chosen by some model selection procedure. After choosing a set of groups S [G], we would like to test whether the true mean µ is correlated with a group Xg for each g S after controlling for the remaining selected groups, i.e. after regressing out all the other groups, indexed by S\g. Thus, the following question is central to selective inference:

Questiong,S : What is the magnitude of the projection of µ onto the span of P XS\g Xg? (1)

In particular, we are interested in a hypothesis test to determine if µ is orthogonal to this span, that is, whether block g should be removed from the model with group-sparse support determined by S; this is the question studied by Loftus and Taylor [9] for which they compute p-values. Alternatively, we may be interested in a conﬁdence interval on PLµ 2, where L = span(P XS\g Xg). Since S and g are themselves determined by the data Y , any inference on these questions must be performed post-selection, by conditioning on the event that S is the selected set of groups.

2.1 Background: The polyhedral lemma In the more standard sparse regression setting without grouped variables, after selecting a set S [p] of features corresponding to columns of X, we might be interested in testing whether the column Xj should be included in the model obtained by regressing Y onto XS\j. We may want to test the null hypothesis that X j P XS\jµ is zero, or to construct a conﬁdence interval for this inner product.

In the setting where S is the output of the lasso, Lee et al. [5] and Tibshirani et al. [13] characterize the selection event as a polyhedron in Rn: for any set S [p] and any signs s { 1}S, the event that the lasso (with a ﬁxed regularization parameter λ) selects the given support with the given signs is equivalent to the event Y A = y : Ay < b , where A is a ﬁxed matrix and b is a ﬁxed vector, which are functions of X, S, s, λ. The inequalities are interpreted elementwise, yielding a convex polyhedron A. To test the regression question described above, one then tests η µ for a ﬁxed unit vector η P XS\j Xj. The polyhedral lemma , found in [5, Theorem 5.2] and [13, Lemma 2], proves that the distribution of η Y , after conditioning on {Y A} and on P η Y , is given by a

truncated normal distribution, with density f(r) exp (r η µ)2/2σ2 1 {a1(Y ) r a2(Y )} . (2)

The interval endpoints a1(Y ), a2(Y ) depend on Y only through P η Y and are deﬁned to include exactly those values of r that are feasible given the event Y A. That is, the interval contains all values r such that r η + P η Y A.

Examining (2), we see that under the null hypothesis η µ = 0, this is a truncated zero-mean normal density, which can be used to construct a p-value testing η µ = 0. To construct a conﬁdence interval for η µ, we can instead use (2) with nonzero η µ, which is a truncated noncentral normal density.

2.2 The group-sparse case In the group-sparse regression setting, Loftus and Taylor [9] extend the work of Lee et al. [5] to questions where we would like to test PLµ, the projection of the mean µ to some potentially multidimensional subspace, rather than simply testing η µ, which can be interpreted as a projection to a one-dimensional subspace, L = span(η). For a ﬁxed set A Rn and a ﬁxed subspace L of dimension k, Loftus and Taylor [9, Theorem 3.1] prove that, after conditioning on {Y A}, on dir L(Y ), and on P L Y , under the null hypothesis PLµ = 0, the distribution of PLY 2 is given by a truncated χk distribution, PLY 2 (σ χk truncated to RY ) where RY = r : r dir L(Y ) + P L Y A . (3) In particular, this means that, if we would like to test the null hypothesis PLµ = 0, we can compute a p-value using the truncated χk distribution as our null distribution. To better understand this null hypothesis, suppose that we run a group-sparse model selection algorithm that chooses a set of blocks S [G]. We might then want to test whether some particular block g S should be retained in this model or removed. In that case, we would set L = span(P XS\g Xg) and test whether PLµ = 0.

Examining the parallels between this result and the work of Lee et al. [5], where (2) gives either a truncated zero-mean normal or truncated noncentral normal distribution depending on whether the null hypothesis η µ = 0 is true or false, we might expect that the result (3) of Loftus and Taylor [9] can extend in a straightforward way to the case where PLµ = 0. More speciﬁcally, we might expect that (3) might then be replaced by a truncated noncentral χk distribution, with its noncentrality parameter determined by PLµ 2. However, this turns out not to be the case. To understand why, observe that PLY 2 and dir L(Y ) are the length and the direction of the vector PLY ; in the inference procedure of Loftus and Taylor [9], they need to condition on the direction dir L(Y ) in order to compute the truncation interval RY , and then they perform inference on PLY 2, the length. These two quantities are independent for a centered multivariate normal, and therefore if PLµ = 0 then PLY 2 follows a χk distribution even if we have conditioned on dir L(Y ). However, in the general case where PLµ = 0, we do not have independence between the length and the direction of PLY , and so while PLY 2 is marginally distributed as a noncentral χk, this is no longer true after conditioning on dir L(Y ).

In this work, we consider the problem of computing the distribution of PLY 2 after conditioning on dir L(Y ), which is the setting that we require for inference. This leads to the main contribution of this work, where we are able to perform inference on PLµ beyond simply testing the null hypothesis that PLµ = 0.

2.3 Key lemma: Truncated projections of Gaussians Before presenting our key lemma, we introduce some further notation. Let A Rn be any ﬁxed open set and let L Rn be a ﬁxed subspace of dimension k. For any y A, consider the set Ry = {r > 0 : r dir L(y) + P L y A} R+. Note that Ry is an open subset of R+, and its construction does not depend on PLy 2, but we see that PLy 2 Ry by deﬁnition. Lemma 1 (Truncated projection). Let A Rn be a ﬁxed open set and let L Rn be a ﬁxed subspace of dimension k. Suppose that Y N(µ, σ2In). Then, conditioning on the values of dir L(Y ) and P L Y and on the event Y A, the conditional distribution of PLY 2 has density1

f(r) rk 1 exp 1

2σ2 r2 2r dir L(Y ), µ 1 {r RY } . We pause to point out two special cases that are treated in the existing literature.

1Here and throughout the paper, we ignore the possibility that Y L since this has probability zero.

Special case 1: k = 1 and A is a convex polytope. Suppose A is the convex polytope {y : Ay < b} for ﬁxed A Rm n and b Rm. In this case, this almost exactly yields the polyhedral lemma of Lee et al. [5, Theorem 5.2]. Speciﬁcally, in their work they perform inference on η µ for a ﬁxed vector η; this corresponds to taking L = span(η) in our notation. Then since k = 1, Lemma 1 yields a truncated Gaussian distribution, coinciding with Lee et al. [5] s result (2). The only difference relative to [5] is that our lemma implicitly conditions on sign(η Y ), which is not required in [5].

Special case 2: the mean µ is orthogonal to the subspace L. In this case, without conditioning on {Y A}, we have PLY = PL µ + N(0, σ2I) = PL N(0, σ2I) , and so PLY 2 σ χk. Without conditioning on {Y A} (or equivalently, taking A = Rn), the resulting density is then

f(r) rk 1e r2/2σ2 1 {r > 0}

which is the density of the χk distribution (rescaled by σ), as expected. If we also condition on {Y A} then this is a truncated χk distribution, as proved in Loftus and Taylor [9, Theorem 3.1].

2.4 Selective inference on truncated projections We now show how the key result in Lemma 1 can be used for group-sparse inference. In particular, we show how to compute a p-value for the null hypothesis H0 : µ L, or equivalently, H0 : PLµ 2 = 0. In addition, we show how to compute a one-sided conﬁdence interval for PLµ 2, speciﬁcally, how to give a lower bound on the size of this projection.

Theorem 1 (Selective inference for projections). Under the setting and notation of Lemma 1, deﬁne

r RY ,r> PLY 2 rk 1e r2/2σ2 dr

r RY rk 1e r2/2σ2 dr . (4)

If µ L (or, more generally, if dir L(Y ), µ = 0), then P Uniform[0, 1]. Furthermore, for any desired error level α (0, 1), there is a unique value Lα R satisfying

r RY ,r> PLY 2 rk 1e (r2 2r Lα)/2σ2 dr

r RY rk 1e (r2 2r Lα)/2σ2 dr = α, (5)

and we have

P { PLµ 2 Lα} P { dir L(Y ), µ Lα} = 1 α.

Finally, the p-value and the conﬁdence interval agree in the sense that P < α if and only if Lα > 0.

From the form of Lemma 1, we see that we are actually performing inference on dir L(Y ), µ . Since PLµ 2 dir L(Y ), µ , this means that any lower bound on dir L(Y ), µ also gives a lower bound on PLµ 2. For the p-value, the statement dir L(Y ), µ = 0 is implied by the stronger null hypothesis µ L. We can also use Lemma 1 to give a two-sided conﬁdence interval for dir L(Y ), µ ; speciﬁcally, dir L(Y ), µ lies in the interval [Lα/2, L1 α/2] with probability 1 α. However, in general this cannot be extended to a two-sided interval for PLµ 2.

To compare to the main results of Loftus and Taylor [9], their work produces the p-value (4) testing the null hypothesis µ L, but does not extend to testing PLµ beyond the null hypothesis, which the second part (5) of our main theorem is able to do.2

3 Applications to group sparse regression methods

In this section we develop inference tools for three methods for group-sparse model selection: forward stepwise regression (also considered by Loftus and Taylor [9] with results on hypothesis testing), iterative hard thresholding (IHT), and the group lasso.

2Their work furthermore considers the special case where the conditioning event, Y A, is determined by a quadratic selection rule, that is, A is deﬁned by a set of quadratic constraints on y Rn. However, extending to the general case is merely a question of computation (as we explore below for performing inference for the group lasso) and this extension should not be viewed as a primary contribution of this work.

3.1 General recipe With a ﬁxed design matrix, the outcome of any group-sparse selection method is a function of Y . For example, a forward stepwise procedure determines a particular sequence of groups of variables. We call such an outcome a selection event, and assume that the set of all selection events forms a countable partition of Rn into disjoint open sets: Rn = e Ae.3 Each data vector y Rn determines a selection event, denoted e(y), and thus y Ae(y).

Let S(y) [G] be the set of groups selected for testing. This is assumed to be a function of e(y), i.e. S(y) = Se for all y Ae. For any g Se, let Le,g = span(P XSe\g Xg), the subspace of Rn

indicating correlation with group Xg beyond what can be explained by the other selected groups.

Write RY = {r > 0 : r U + Y Ae(Y )}, where U = dir Le(Y ),g(Y ) and Y = P Le(Y ),g Y . If we condition on the event {Y Ae} for some e, then as soon as we have calculated the region RY R+, Theorem 1 will allow us to perform inference on the quantity of interest PLe,gµ 2 by evaluating the expressions (4) and (5). In other words, we are testing whether µ is signiﬁcantly correlated with the group Xg, after controlling for all the other selected groups, S(Y )\g = Se\g.

To evaluate these expressions accurately, ideally we would like an explicit characterization of the region RY R+. To gain a better intuition for this set, deﬁne z Y (r) = r U + Y Rn for r > 0, and note that z Y (r) = Y when we plug in r = PLe(Y ),g Y 2. Then we see that

RY = r > 0 : e(z Y (r)) = e(Y ) . (6) In other words, we need to ﬁnd the range of values of r such that, if we replace Y with z Y (r), then this does not change the output of the model selection algorithm, i.e. e(z Y (r)) = e(Y ). For the forward stepwise and IHT methods, we ﬁnd that we can calculate RY explicitly. For the group lasso, we cannot calculate RY explicitly, but we can nonetheless compute the integrals required by Theorem 1 through numerical approximations. We now present the details for each of these methods.

3.2 Forward stepwise regression Forward stepwise regression [2, 14] is a simple and widely used method. We will use the following version:4 for design matrix X and response Y = y,

1. Initialize the residual bϵ0 = y and the model S0 = . 2. For t = 1, 2, . . . , T,

(a) Let gt = arg maxg [G]\St 1{ X g bϵt 1 2}.

(b) Update the model, St = {g1, . . . , gt}, and update the residual, bϵt = P XSty. Testing all groups at time T. First we consider the inference procedure where, at time T, we would like to test each selected group gt for t = 1, . . . , T; inference for this procedure was derived also in [8]. Our selection event e(Y ) is the ordered sequence g1, . . . , g T of selected groups. For a response vector Y = y, this selection event is equivalent to

X gk P XSk 1y 2 > X g P XSk 1y 2 for all k = 1, . . . , T, for all g Sk. (7) Now we would like to perform inference on the group g = gt, while controlling for the other groups in S(Y ) = ST . Deﬁne U, Y , and z Y (r) as before. Then, to determine RY = {r > 0 : z Y (r) Ae(Y )}, we check whether all of the inequalities in (7) are satisﬁed with y = z Y (r): for each k = 1, . . . , T and each g Sk, the corresponding inequality of (7) can be expressed as

r2 X gk P XSk 1U 2 2 + 2r X gk P XSk 1U, X gk P XSk 1Y + X gk P XSk 1Y 2 2

> r2 X g P XSk 1U 2 2 + 2r X g P XSk 1U, X g P XSk 1Y + X g P XSk 1Y 2 2.

Solving this quadratic inequality over r R+, we obtain a region Ik,g R+ which is either a single interval or a union of two disjoint intervals, whose endpoints we can calculate explicitly with the quadratic formula. The set RY is then given by all values r that satisfy the full set of inequalities:

g [G]\Sk Ik,g.

This is a union of ﬁnitely many disjoint intervals, whose endpoints are calculated explicitly as above.

3Since the distribution of Y is continuous on Rn, we ignore sets of measure zero without further comment. 4In practice, we would add some correction for the scale of the columns of Xg or for the number of features in group g; this can be accomplished with simple modiﬁcations of the forward stepwise procedure.

Sequential testing. Now suppose we carry out a sequential inference procedure, testing group gt at its time of selection, controlling only for the previously selected groups St 1. In fact, this is a special case of the non-sequential procedure above, which shows how to test g T while controlling for ST \g T = ST 1. Applying this method at each stage of the algorithm yields a sequential testing procedure. (The method developed in [9] computes p-values for this problem, testing whether µ P XSt 1Xgt at each time t.) See the supplementary material for detailed pseudo-code.

3.3 Iterative hard thresholding (IHT) The iterative hard thresholding algorithm ﬁnds a k-group-sparse solution to the linear regression problem, iterating gradient descent steps with hard thresholding to update the model choice as needed [1, 4]. Given k 1, number of iterations T, step sizes ηt, design matrix X and response Y = y,

1. Initialize the coefﬁcient vector, β0 = 0 Rp (or any other desired initial point). 2. For t = 1, 2, . . . , T,

(a) Take a gradient step, eβt = βt 1 ηt X (Xβt 1 y). (b) Compute (eβt)Cg 2 for each g [G] and let St [G] index the k largest norms.

(c) Update the ﬁtted coefﬁcients βt via (βt)j = (eβt)j 1 {j g St Cg}.

Here we are typically interested in testing Questiong,ST for each g ST . We condition on the selection event, e(Y ), given by the sequence of k-group-sparse models S1, . . . , ST selected at each stage of the algorithm, which is characterized by the inequalities (eβt)Cg 2 > (eβt)Ch 2 for all t = 1, . . . , T, and all g St, h St. (8) Fixing a group g ST to test, determining RY = {r > 0 : z Y (r) Ae(Y )} involves checking whether all of the inequalities in (8) are satisﬁed with y = z Y (r). First, with the response Y replaced by y = z Y (r), we show that we can write eβt = r ct + dt for each t = 1, . . . , T, where ct, dt Rp are independent of r; in the supplementary material, we derive ct, dt inductively as c1 = η1

n X U, d1 = (I η1

n X X)β0 + η1

ct = (Ip ηt

n X X)PSt 1ct 1 + ηt

n X U, dt = (Ip ηt

n X X)PSt 1dt 1 + ηt

n X Y for t 2.

Now we compute the region RY . For each t = 1, . . . , T and each g St, h St, the corresponding inequality in (8), after writing eβt = r ct + dt, can be expressed as

r2 (ct)Cg 2 2+2r (ct)Cg, (dt)Cg + (dt)Cg 2 2 > r2 (ct)Ch 2 2+2r (ct)Ch, (dt)Ch + (dt)Ch 2 2. As for the forward stepwise procedure, solving this quadratic inequality over r R+, we obtain a region It,g,h R+ that is either a single interval or a union of two disjoint intervals whose endpoints we can calculate explicitly. Finally, we obtain RY = T

t=1,...,T T

h [G]\St It,g,h.

3.4 The group lasso The group lasso, ﬁrst introduced by Yuan and Lin [15], is a convex optimization method for linear regression where the form of the penalty is designed to encourage group-wise sparsity of the solution. It is an extension of the lasso method [12] for linear regression. The method is given by bβ = arg minβ 1

2 y Xβ 2 2 + λ P

g βCg 2 , where λ > 0 is a penalty parameter. The penalty P

g βCg 2 promotes sparsity at the group level.5

For this method, we perform inference on the group support S of the ﬁtted model bβ. We would like to test Questiong,S for each g S. In this setting, for groups of size 2, we believe that it is not possible to analytically calculate RY , and furthermore, that there is no additional information that we can condition on to make this computation possible, without losing all power to do inference.

We thus propose a numerical approximation that circumvents the need for an explicit calculation of RY . Examining the calculation of the p-value P and the lower bound Lα in Theorem 1, we see that we can write P = f Y (0) and can ﬁnd Lα as the unique solution to f Y (Lα) = α, where

f Y (t) = Er σ χk h ert/σ2 1 {r RY , r > PLY 2} i

Er σ χk ert/σ2 1 {r RY } ,

5Our method can also be applied to a modiﬁcation of group lasso designed for overlapping groups [3] with a nearly identical procedure but we do not give details here.

where we treat Y as ﬁxed in this calculation and set k = dim(L) = rank(XS\g). Both the numerator and denominator can be approximated by taking a large number B of samples r σ χk and taking the empirical expectations. Checking r RY is equivalent to running the group lasso with the response replaced by y = z Y (r), and checking if the resulting selected model remains unchanged.

This may be problematic, however, if RY is in the tails of the σ χk distribution. We implement an importance sampling approach by repeatedly drawing r ψ for some density ψ; we ﬁnd that ψ = PLY 2 + N(0, σ2) works well in practice. Given samples r1, . . . , r B ψ we then estimate

f Y (t) bf Y (t) :=

b ψσ χk (rb)

ψ(rb) erbt/σ2 1 {rb RY , rb > PLY 2} P

b ψσ χk (rb)

ψ(rb) erbt/σ2 1 {rb RY }

where ψσ χk is the density of the σ χk distribution. We then estimate P b P = bf Y (0). Finally, since bf Y (t) is continuous and strictly increasing in t, we estimate Lα by numerically solving bf Y (t) = α.

4 Experiments

In this section we present results from experiments on simulated and real data, performed in R [11].6

4.1 Simulated data We ﬁx sample size n = 500 and G = 50 groups each of size 10. For each trial, we generate a design matrix X with i.i.d. N(0, 1/n) entries, set β with its ﬁrst 50 entries (corresponding to ﬁrst s = 5 groups) equal to τ and all other entries equal to 0, and set Y = Xβ + N(0, In). We present the result for IHT here; the results for the other two methods can be found in the supplementary material.

We run IHT to select k = 10 groups over T = 5 iterations, with step sizes ηt = 2 and initial point β0 = 0. For a moderate signal strength τ = 1.5, we plot the p-values for each selected group in Figure 1; each group displays p-values only for those trials in which it was selected. The histogram of p-values for the s true signals and for the G s nulls are also shown. We see that the the distribution of p-values for the true signals concentrates near zero while the null p-values are roughly uniform.

Next we look at the conﬁdence intervals given by our method, examining their empirical coverage across different signal strengths τ in Figure 2. We ﬁx conﬁdence level 0.9 (i.e. α = 0.1) and check empirical coverage with respect to both PLµ 2 and dir L(Y ), µ , with results shown separately for true signals and for nulls. For true signals, the conﬁdence interval for PLµ 2 is somewhat conservative while the coverage for dir L(Y ), µ is right at the target level, as expected from our theory. As signal strength τ increases, the gap is reduced for the true signals; this is because dir L(Y ) becomes an increasingly more accurate estimate of dir L(µ), and so the gap in the inequality PLµ 2 dir L(Y ), µ is reduced. For the nulls, if the set of selected groups contains the support of the true model, which is nearly always true for higher signal levels τ, then the two are equivalent (as PLµ 2 = dir L(Y ), µ = 0), and coverage is at the target level. At low signal levels τ, however, a true group is occasionally missed, in which case PLµ 2 > dir L(Y ), µ strictly.

Figure 1: Iterative hard thresholding (IHT). For each group, we plot its p-value for each trial in which that group was selected, for 200 trials. Histograms of the p-values for true signals (left, red) and for nulls (right, gray) are attached.

4.2 California health data We examine the 2015 California county health data7 which was also studied by Loftus and Taylor [9]. We ﬁt a linear model where the response is the log-years of potential life lost and the covariates

6Code reproducing experiments: http://www.stat.uchicago.edu/~rina/group_inf.html 7Available at http://www.countyhealthrankings.org

Figure 2: Iterative hard thresholding (IHT). Empirical coverage over 2000 trials with signal strength τ. Norm and inner product refer to coverage of PLµ 2 and dir L(Y ), µ , respectively.

are the 34 predictors in this data set. We ﬁrst let each predictor be its own group (i.e., group size 1) and run the three algorithms considered in Section 3. Next, we form a grouped model by expanding each predictor Xj into a group using the ﬁrst three non-constant Legendre polynomials, (Xj, 1

2(3X2 j 1), 1

2(5X3 j 3Xj)). In each case we set parameters so that 8 groups are selected. The selected groups and their p-values are given in Table 1; interestingly, even when the same predictor is selected by multiple methods, its p-value can differ substantially across the different methods.

Group size Forward stepwise p-value / seq. p-value Iterative hard thresholding p-value Group lasso p-value

80th percentile income 0.116 / 0.000 80th percentile income 0.000 80th percentile income 0.000 Injury death rate 0.000 / 0.000 Injury death rate 0.000 % Obese 0.007 Violent crime rate 0.016 / 0.000 % Smokers 0.004 % Physically inactive 0.040 % Receiving Hb A1c 0.591 / 0.839 % Single-parent household 0.009 Violent crime rate 0.055 % Obese 0.481 / 0.464 % Children in poverty 0.332 % Single-parent household 0.075 Chlamydia rate 0.944 / 0.975 Physically unhealthy days 0.716 Injury death rate 0.235 % Physically inactive 0.654 / 0.812 Food environment index 0.807 % Smokers 0.701 % Alcohol-impaired 0.104 / 0.104 Mentally unhealthy days 0.957 Preventable hospital stays rate 0.932

80th percentile income 0.001 / 0.000 Injury death rate 0.000 80th percentile income 0.000 Injury death rate 0.044 / 0.000 80th percentile income 0.000 Injury death rate 0.000 Violent crime rate 0.793 / 0.617 % Smokers 0.000 % Single-parent household 0.038 % Physically inactive 0.507 / 0.249 % Single-parent household 0.005 % Physically inactive 0.043 % Alcohol-impaired 0.892 / 0.933 Food environment index 0.057 % Obese 0.339 % Severe housing problems 0.119 / 0.496 % Children in poverty 0.388 % Alcohol-impaired 0.366 Chlamydia rate 0.188 / 0.099 Physically unhealthy days 0.713 % Smokers 0.372 Preventable hospital stays rate 0.421 / 0.421 Mentally unhealthy days 0.977 Violent crime rate 0.629 Table 1: Selective p-values for the California county health data experiment. The predictors obtained with forward stepwise are tested both simultaneously at the end of the procedure (ﬁrst p-value shown), and also tested sequentially (second p-value shown), and are displayed in the selected order.

5 Conclusion

We develop selective inference tools for group-sparse linear regression methods, where for a datadependent selected set of groups S, we are able to both test each group g S for inclusion in the model deﬁned by S, and form a conﬁdence interval for the effect size of group g in the model. Our theoretical results can be easily applied to a range of commonly used group-sparse regression methods, thus providing an efﬁcient tool for ﬁnite-sample inference that correctly accounts for data-dependent model selection in the group-sparse setting.

Acknowledgments Research supported in part by ONR grant N00014-15-1-2379, and NSF grants DMS-1513594 and DMS-1547396.

[1] Thomas Blumensath and Mike E Davies. Sampling theorems for signals from the union of ﬁnitedimensional linear subspaces. Information Theory, IEEE Transactions on, 55(4):1872 1882, 2009. [2] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001. [3] Laurent Jacob, Guillaume Obozinski, and Jean-Philippe Vert. Group lasso with overlap and graph lasso. In Proceedings of the 26th annual international conference on machine learning, pages 433 440. ACM, 2009. [4] Prateek Jain, Nikhil Rao, and Inderjit S. Dhillon. Structured sparse regression via greedy hardthresholding. Co RR, abs/1602.06042, 2016. URL http://arxiv.org/abs/1602.06042. [5] Jason D Lee, Dennis L Sun, Yuekai Sun, and Jonathan E Taylor. Exact post-selection inference with the lasso. ar Xiv preprint ar Xiv:1311.6238, 2013. [6] Jason D. Lee and Jonathan E. Taylor. Exact post model selection inference for marginal screening. In Advances in Neural Information Processing Systems 27, pages 136 144, 2014. [7] Joshua R Loftus. Selective inference after cross-validation. ar Xiv preprint ar Xiv:1511.08866, 2015. [8] Joshua R Loftus and Jonathan E Taylor. A signiﬁcance test for forward stepwise model selection. ar Xiv preprint ar Xiv:1405.3920, 2014. [9] Joshua R. Loftus and Jonathan E. Taylor. Selective inference in regression models with groups of variables. ar Xiv:1511.01478, 2015. [10] Soﬁa Mosci, Silvia Villa, Alessandro Verri, and Lorenzo Rosasco. A primal-dual algorithm for group sparse regularization with overlapping groups. In Advances in Neural Information Processing Systems 23, pages 2604 2612, 2010. [11] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2016. URL https://www.R-project.org/. [12] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267 288, 1996. [13] Ryan J Tibshirani, Jonathan Taylor, Richard Lockhart, and Robert Tibshirani. Exact postselection inference for sequential regression procedures. ar Xiv preprint ar Xiv:1401.3889, 2014. [14] Joel A. Tropp and Anna C. Gilbert. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. Information Theory, 53(12):4655 4666, 2007. doi: 10.1109/TIT. 2007.909108. URL http://dx.doi.org/10.1109/TIT.2007.909108. [15] Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49 67, 2006.