# selective_inference_with_distributed_data__0781a06c.pdf

Journal of Machine Learning Research 26 (2025) 1-44 Submitted 3/23; Revised 12/24; Published 1/25

Selective Inference with Distributed Data

Sifan Liu sfliu@stanford.edu Department of Statistics Stanford University Stanford, CA 94305-4020, USA

Snigdha Panigrahi psnigdha@umich.edu Department of Statistics University of Michigan Ann Arbor, MI 48109-1107, USA

Editor: Po-Ling Loh

When data are distributed across multiple sites or machines rather than centralized in one location, researchers face the challenge of extracting meaningful information without directly sharing individual data points. While there are many distributed methods for point estimation using sparse regression, few options are available for estimating uncertainties or conducting hypothesis tests based on the estimated sparsity. In this paper, we introduce a procedure for performing selective inference with distributed data. We consider a scenario where each local machine solves a lasso problem and communicates the selected predictors to a central machine. The central machine then aggregates these selected predictors to form a generalized linear model (GLM). Our goal is to provide valid inference for the selected GLM while reusing data that have been used in the model selection process. Our proposed procedure only requires lowdimensional summary statistics from local machines, thus keeping communication costs low and preserving the privacy of individual data sets. Furthermore, this procedure can be applied in scenarios where model selection is repeatedly conducted on randomly subsampled data sets, addressing the p-value lottery problem linked with model selection. We demonstrate the eﬀectiveness of our approach through simulations and an analysis of a medical data set on ICU admissions.

Keywords: carving, data aggregation, generalized linear models, lasso, post-selection inference, selective inference, selective likelihood

1. Introduction

In recent years, analyzing decentralized data spread across various sites or machines has become increasingly important, particularly for collaborative research. This has created a growing demand for distributed methods that enable researchers to extract meaningful information from such data while respecting its decentralized nature, and to combine this information using summary statistics for preserving privacy within the individual data sets. One of the simplest and most popular approaches in the distributed framework is divideand-conquer , which is also known as the split-and-merge or the one-shot approach; see for example the early work by Mcdonald et al. (2009); Zinkevich et al. (2010); Zhang et al.

c 2025 Sifan Liu and Snigdha Panigrahi.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-0309.html.

Liu and Panigrahi

(2012). Most of these approaches use only one round of communication. Each local machine estimates the unknown parameter using its subset of the training data, and communicates the estimator to a central machine which merges the local estimators to obtain a global estimator. In our paper, we focus on a sparse regression setting where only a few of the measured predictors aﬀect the response. The goal in this setup is usually two-fold: (i) select relevant predictors, and model the response by using the estimated sparsity, (ii) provide uncertainties or conduct hypothesis tests for regression parameters in the estimated model, and all this while, respect the distributed nature of data. Substantial progress has been made on the ﬁrst goal. For example, Lee et al. (2015) average locally computed, debiased lasso estimators, and show that the averaged estimator achieves the same estimation rate as the full-sample lasso, as long as the number of machines is not too large. Chen and Xie (2014) prove that the models aggregated via majority voting, based on variables selected by local machines, are consistent under some conditions. More recently, Battey et al. (2018) provide a debiased approach for hypothesis testing in the distributed setting. However, these methods are limited to models that are ﬁxed before the selection of relevant predictors. As a result, the communication cost in prior work scales with the number of original predictors, which can be unnecessarily large in sparse settings. We introduce a new procedure for conducting selective inference with distributed data. Selective inference tools ensure valid inference by accounting for the fact that the same data, used to select models, is reused when providing conﬁdence intervals and p-values. Several ingenious methods have been developed to provide selective inference in sparse regression problems; see papers by Benjamini and Yekutieli (2005); Berk et al. (2013); Belloni et al. (2015); Lee et al. (2016); Tian and Taylor (2018); Charkhi and Claeskens (2018); Bachoc et al. (2019); Panigrahi et al. (2021). Our procedure in this paper reuses data from all machines to form an asymptotic selective likelihood . This likelihood delivers approximately-valid selective inference in a selected generalized linear model (GLM), which is based on the predictors selected across diﬀerent machines. The proposed procedure has several important features:

1. It only requires low-dimensional summary statistics from each machine to conduct selective inference. Thus, the techniques developed are applicable to settings when data sets are distributed across diﬀerent sites due to security, privacy, or ethical concerns, as seen in the areas of diﬀerential privacy (Balcan et al., 2012) and federated learning (Mc Mahan et al., 2017). These sites can merge summary information without sharing individual data, enabling inference in a GLM based on the estimated sparsity.

2. The communication cost of our inferential procedure is only linear in the dimension of the selected model, which can be substantially smaller than the initial dimension of the problem.

3. This procedure can be easily adapted to address the p-value lottery problem in model selection, which arises when the selection process is repeated on randomly subsampled data sets. We show that our procedure serves as an eﬃcient alternative to multi-splitting (Meinshausen et al., 2009) and multi-carving (Schultheiss et al., 2021) without recourse to Markov chain Monte Carlo (MCMC) sampling.

Selective Inference with Distributed Data

The remaining paper is structured as follows. We provide a slightly more technical account of our contributions after outlining the problem setup, and review related work in Section 2. In Section 3, we introduce our procedure for selective inference with distributed data. In Section 4, we provide an asymptotic justiﬁcation for our selective likelihood, which forms the basis of our approach. We explore how our procedure can be modiﬁed to address the p-value lottery problem in Section 5. Section 6 reports ﬁndings from the application of our method on simulated data sets and a publicly available, medical data set on intensive care unit (ICU) admissions. We conclude the paper with a discussion in Section 7. Proofs of our technical results are collected in the Appendix.

2. Problem setup and background

In this section, we describe the distributed setup and introduce some background on selective inference with a single machine. Other related work is summarized at the end.

2.1 Problem setup

We consider a distributed setup with K local machines, all connected to a central machine, referred to as machine 0. Suppose that we observe n i.i.d. observations

(yi, xi,1, . . . , xi,p) Rp+1, i [n],

where [n] = {1, 2, . . . , n} for n N. Let C(k) [n] denote the index set of the samples stored at machine k, for k {0} [K]. The index sets C(k) are disjoint and form a partition of [n]. Let nk = |C(k)| be the cardinality of C(k). Let ρk = nk

n be the proportion of samples store at machine k. Then we have

k=0 nk, 1 =

Let D(k) = Y (k), X(k) denote the data stored at machine k, where X(k) Rnk p represents the feature matrix and Y (k) Rnk represents the response vector. Each local machine conducts variable selection with a loss function based on a GLMdensity. Let this loss function at machine k be denoted by

ℓ(k)(β; D(k)) = 1 nρk

i C(k) {A(x i β) yix i β} ,

where A( ) is the log-partition function of the GLM density. That is, machine k solves the following lasso problem:

bβΛ,(k) = argmin β Rp ℓ(k)(β; D(k)) + Λ(k)β 1, (1)

where Λ(k) = diag λ(k) 1 , . . . , λ(k) p is the diagonal matrix of regularization parameters. Although our procedure works when the regularization parameters diﬀer across machines, for simplicity, we assume Λ(k) = Λ for k [K].

Liu and Panigrahi

Let b E(k) = n j [p] : bβΛ,(k) j = 0 o .

represent the set of predictors with non-zero lasso coeﬃcients selected at machine k for 1 k K. We will denote the realized value of b E(k) with data D(k) as E(k). Let d(k) = |E(k)| denote the number of selected variables at machine k. Note, for a vector x Rp and E [p], x E represents the subvector of x with entries in the set E, and for a matrix X, XE is composed of the columns of X present in the set E.

2.2 Selected model

As described above, the K local machines return as output the selected predictors, which are then communicated to the central machine. The central machine aggregates these selected predictors as b E = Aggregate n b E(k), k [K] o . (2)

We let E be the observed value of b E and let d = |E|. Deferring the discussion on general aggregation rules to Appendix D, now consider the union aggregation rule

as a concrete example. Given E, our response y is modeled using a selected GLM based on XE. The density of y is given by:

f(y | x E, βE) = exp yx EβE A(x EβE) σ2

c(y; σ), (3)

where βE, the regression coeﬃcient vector, is the parameter that we want to infer about. We assume that the dispersion parameter, σ, is either known or can be consistently estimated. Some immediate questions arise when we seek inference for βE:

Can the central machine reuse data from the local machines to conduct selective inference for βE? Of course, naive inference, which uses all the data without adjusting for the bias from the model selection process, falls short of coverage guarantees, as illustrated on one of our simulated instances in Figure 1. Section 6 provides the details of this simulation.

A procedure for making selective inference must respect the distributed nature of data, as done at the time of selection. What information does the central machine require from the local machines, and how many communication exchanges does it involve?

2.3 Some background when K = 1

We provide some background in a rather simple setup with K = 1. Consider the special case of a Gaussian linear model, i.e., in (3), we have

A(x EβE) = 1

2(x EβE)2, c(y; σ) = y2

Selective Inference with Distributed Data

Dist-SI Splitting Naive

Coverage probability

Figure 1: Coverage probabilities of Dist-SI (our procedure for selective inference with distributed data), the Splitting , and Na ıve methods for a Gaussian linear model, with 2 local machines and a central machine. The 2 local machines and the central machine each have 1000 samples. The pre-speciﬁed level of coverage is 0.90, as indicated by the dotted horizontal line at 0.9.

Post selection, we want to infer for βE. We begin by noting that the regression problem in (1), at machine 1, can be rewritten as bβλ,(1) = argmin β Rp 1 n

1 2(yi x i β)2 + Λβ 1 nω nβ, (4)

i [n] xi(x i bβλ,(1) yi) 1

i C(1) xi(x i bβλ,(1) yi).

The term ωn is what we call a randomization variable. Regression of the form (4), with an added randomization variable ωn, was termed the randomized lasso in Tian et al. (2016). Suppose we have n observations in our response vector that are drawn as independent Gaussian variables with the same variance. Further, suppose that we solved the randomized lasso with a Gaussian randomization variable nωn N(0p, ΣΩ). A recent procedure by Panigrahi and Taylor (2023) constructs a selective likelihood by conditioning on a subset of the selection event n b E = E o .

This approach enables valid selective inference in the selected linear model while allowing for the re-use of data employed during model selection. This procedure centers interval estimates around the maximum likelihood estimator (MLE) of the selective likelihood and estimates its variance using the observed Fisher information matrix. If bβ(S) E and b I(S) E,E denote the selective MLE and the observed Fisher information matrix, using the selective likelihood, a two-sided 100 (1 α)% conﬁdence interval for βE,j is constructed as

bβ(S) E,j z1 α/2 bσ(S) j n ,

Liu and Panigrahi

where bσ(S) j = r b I(S) E,E 1

j,j is the estimated standard deviation of nbβ(S) E,j, and z1 α/2 is the

(1 α/2)-th quantile of a standard normal distribution. This procedure closely resembles classical inference via maximum likelihood, with the exception that the standard estimators are replaced by their selective analogs that adjust for bias from the model selection process.

2.4 Contributions and other related work

The contribution of the proposed method is threefold. First, we identify a simple representation of the selection event by developing a randomization framework. A signiﬁcant challenge in conditional selective inference is the need for an explicit characterization of the selection event, which is particularly diﬃcult in distributed settings. By leveraging the randomization framework, the selection event can be expressed in a simple closed form, making conditional selective inference feasible for GLMs with selected predictors. Second, the proposed method is eﬃcient in both computation and communication. It requires only lowdimensional summary statistics to be shared from local machines with a central machine. After aggregating these summary statistics, the central machine solves a convex optimization problem. Therefore, the proposed approach is suitable for scenarios where direct data sharing between sites is not possible or inter-site communication is costly. Third, the proposed method can be easily adapted to address the p-value lottery problem. In this sense, our procedure is related to the multi-carving approach in Schultheiss et al. (2021), which uses conditional techniques known as carving (Fithian et al., 2014; Panigrahi, 2023). However, existing carving and multi-carving implementations often depend on computationally expensive MCMC sampling, while our method provides a more eﬃcient alternative. We conclude this section with some more related work. Within the distributed setting, much work has been devoted to the averaged M-estimator (Mcdonald et al., 2009; Zinkevich et al., 2010; Zhang et al., 2012; Rosenblatt and Nadler, 2016). Rosenblatt and Nadler (2016) show that the averaged M-estimator is ﬁrst-order equivalent to the centralized Mestimator in the ﬁxed-dimension setting. Dobriban and Sheng (2021) study the eﬃciency of an estimator based on weighted average in the linear regression setting, as dimensions grow with sample sizes. Some methods have taken a likelihood-centric approach, e.g., Jordan et al. (2019) propose a surrogate likelihood where higher-order derivatives in a Taylor-series expansion of the full log-likelihood are replaced by local approximations. Lin and Xi (2011) propose an aggregated estimator for generalized linear models (GLM), where the locally computed MLE and Hessian of the likelihood are merged for eﬃciency gains. In work by Huang and Gelman (2005); Neiswanger et al. (2014); Wang and Dunson (2013); Scott et al. (2016); Minsker et al. (2017); Srivastava et al. (2018), distributed MCMC algorithms combine local posterior samples to obtain a global posterior distribution. In the post-selection inference literature, a selective likelihood was appended to priors for Bayesian inference post selection in Panigrahi and Taylor (2018); Panigrahi et al. (2021). The focus in these settings was on a category of variable selection rules that can be written as a set of polyhedral constraints on data. For the same category of selection rules, a diﬀerent body of work (Lee et al., 2016; Hyun et al., 2018; Le Duy and Takeuchi, 2022) considers a truncated Gaussian distribution for inference from normal data. Moving beyond normal data, Taylor and Tibshirani (2018) outline an asymptotic scheme to provide selective inference in GLMs. For blackbox selection, Liu et al. (2022) obtain the selective likelihood

Selective Inference with Distributed Data

by learning the selection event from bootstrapped data. We take a diﬀerent approach in the distributed setup by casting the problem into a randomized framework, and provide an asymptotic likelihood function for the selected GLM. The use of a randomized framework for selective inference was explored in Tian et al. (2016); Panigrahi et al. (2017) to improve power. In Rasines and Young (2023); Panigrahi et al. (2022), a randomized framework for selective inference was investigated for more eﬃcient uses of data than sample splitting. The connection between a randomized framework for selective inference and algorithmic stability was explored in Zrnic and Jordan (2023).

3. Selective inference with distributed data

In this section, we ﬁrst provide a characterization of the selection event with distributed data. Then we introduce a selective likelihood that takes into account the aggregation of predictors across machines to form the selected GLM.

3.1 The selection event

Fixing some notations, for z Rn, let A(z) and 2A(z) denote the vectors in Rn whose ith coordinates are the ﬁrst and second derivatives of A( ) at zi, respectively. We recast the regression problem (1) as a randomized lasso problem similar to (4). To do so, deﬁne the randomization variables

n X ( A(X bβΛ,(k)) Y ) 1

nk X(k) ( A(X(k) bβΛ,(k)) Y (k)) (5)

for k [K], where bβΛ,(k) is the solution to the lasso problem at machine k. Let Ω= (ω(1), , . . . , ω(K), ) RKp be the stack of the K randomization variables. As reviewed earlier, an equivalent expression for the lasso problem on machine k is

bβΛ,(k) = argmin β Rp 1 n

i [n] (A(x i β) yix i β) + Λβ 1 nω(k) β. (6)

This is because the Karush-Kuhn-Tucker (KKT) conditions of problem (1) and problem (6) are the same. Note that the KKT conditions are given as follows:

1 n X A(X bβΛ,(k)) Y + γ(k) = nω(k),

γ(k) E(k) = ΛE(k)s(k), s(k) = sign bβΛ,(k) E(k) ,

γ(k) E(k) = Λ E(k)z(k), z(k) 1,

where the vector γ(k) represents the subgradient of the penalty Λβ 1 evaluated at the lasso solution. To perform inference for the selected GLM (3), we need some additional notations. Let dk = |E(k)|, d = |E|, and d = P k [K] dk. Let b(k) = bβΛ,(k) E(k) Rdk represent the

lasso solution from machine k with only the active variables E(k), and let b B(k) denote the corresponding random vector. Let b B R d be the random vector formed by stacking b B(k)

Liu and Panigrahi

Local machine k Aggregated version Variable R.V. Obs. Dim R.V. Obs. Dim

Selected variable set b E(k) E(k) [p] b E E [p] Active variables of lasso solution b B(k) b(k) dk b B b d Signs of active variables b S(k) s(k) dk b S s d Subgradients of inactive variables b Z(k) z(k) p dk b Z z p K d Subgradients of lasso penalty Λβ 1 bΓ(k) γ(k) p bΓ γ p K Randomization variables ω(k) p K Ω p K

Table 1: Notations for the variables involved in selection. Columns 2-4 list the variables for the lasso problem on local machine k, while columns 5-7 show the aggregated or stacked versions. For the optimization variables (excluding selected variable sets and randomization variables), the corresponding random variables (R.V.) are denoted by uppercase letters with a hat, observed values (Obs.) by lowercase letters, and their dimensions (Dim) are also provided. The relationship among the variables is given in Equation (7).

from all machines, and let b denote the realized value of b B. Following the same scheme, we denote random variables with uppercase letters, while lowercase letters will represent their realized values, and bold letters will represent a vector obtained by stacking these variables from all machines. To make it easier for readers, we have provided Table 1 that contains a list of notations used in our paper. Our method conditions on n bΓ(k) = γ(k) k [K] o , (8)

a proper subset of { b E = E}, where bΓ(k) is the subgradient of the lasso penalty with realized value γ(k). In fact, event (8) is a subset of

{ b E(k) = E(k) k [K]},

and therefore a subset of the event { b E = E}. Note that the conditional approach for selective inference remains valid even when we work with a proper subset of the selection event, and it is easier to characterize the event (8) than the event { b E = E}, as shown in the next result. In this result, note that b S R d and b Z Rp K d denote the signs of lasso solutions at active variables and the subgradients at inactive variables, respectively, stacked across K machines, with the realized values s and z (see Table 1).

Proposition 1 (Characterization of the selection event) The conditioning event (8) can be characterized as n bΓ(k) = γ(k) k [K] o = n sign(b B) = s, b Z = z o .

To avoid any confusion, we will hereafter refer to the conditioning event in (8) as the selection event.

Selective Inference with Distributed Data

3.2 Marginal distribution of the MLE

One common method to infer for βE is through a maximum likelihood approach. In a distributed setting, we consider a maximum likelihood estimator (MLE) that is obtained by aggregating local estimators from each of the machines. Assuming that E is a ﬁxed set, we derive the joint marginal distribution of this MLE and the randomization variables Ωwithout considering the model selection process. This is the ﬁrst step in obtaining our selective likelihood, before we condition the marginal distribution on the selection event. Let

bβ(k) E = argmin β Rd 1 nρk

i Ck A(x i,Eβ) yix i,Eβ

be the local MLE at machine k, and let

b I(k) E,E = 1

nk (X(k) E ) c W k X(k) E , where c W (k) = diag 2A(X(k) E bβ(k) E ) ,

be the corresponding observed Fisher information (obs-FI) matrix. Let b IE,E = PK k=0 ρkb I(k) E,E. Deﬁne the aggregated MLE as

bβE = b I 1 E,E

k=0 ρkb I(k) E,E bβ(k) E . (9)

Let I denote the Fisher information E 1

n X diag( 2A(XEβE))X . In order to derive this marginal distribution of the aggregated MLE bβE, we state a few conditions.

Assumption 1 For k [K], let e E(k) = E \ E(k). For j e E(k), either βE,j = O(n 1/2) or XE(k)I 1 E(k),E(k)IE(k),j = Xj.

Assumption 2 The aggregated MLE bβE can be written as n(bβE βE) = I 1 E,E ℓ(βE) + op(1), n .

Assumption 1 states conditions on predictors that are present in the selected model but are not selected by machine k. This condition suggests that our asymptotic assertions remain valid as long as these predictors are either weak in strength or have a high partial correlation with a predictor in the selected set E(k). Assumption 2 states that the aggregated MLE has an asymptotically linear representation. This condition is met when the aggregated MLE is asymptotically equivalent to the standard MLE. The latter has been shown to hold under some mild regularity conditions in Lin and Xi (2011). The following result provides the asymptotic distribution of the randomization variables, with a detailed proof in Appendix A.1.

Proposition 2 Suppose nk/n ρk as n for k [K] and PK k=1 ρk < 1. Let U = diag(ρ 1 1 , . . . , ρ 1 K ) 1K K, where 1K K denotes the K K matrix with all entries equal to 1. Deﬁne

Liu and Panigrahi

the Kronecker product of U and I. Under Assumptions 1 and 2, we have

n Ωd Np K (0, ΣΩ) , n .

The model selection process is not solely a function of the aggregated MLE and the randomization variables, as it also depends on variables that were not included in the selected GLM. Therefore, we consider an additional statistic, deﬁned as:

n X E( A(XE bβE) Y ),

which involves variables that were not selected by the lasso. The next theorem derives the marginal distribution of bβE, bβ E, and Ω.

Theorem 3 Under the same assumptions as Proposition 2, it holds that

bβE βE bβ E Ω

I 1 E,E 0 0 0 I/IE,E 0 0 0 ΣΩ

where I/IE,E = I E, E I E,EI 1 E,EIE, E is the Schur complement of IE,E in I.

The proof of Theorem 3 is deferred to Appendix A.2. It is worth noting from this result that bβ E is an ancillary statistic for the parameters in the selected GLM. Furthermore, the randomization variables are independent of the statistics bβE and bβ E, which can be seen from the block diagonal covariance matrix in the limiting Gaussian distribution.

3.3 Selective likelihood

In what follows, we base inference on a conditional likelihood in the selected GLM. This likelihood, referred to as the selective likelihood, is stated in Theorem 4. It is derived from the asymptotic distribution of the aggregated MLE, as stated in Theorem 3, after conditioning on the event described in Proposition 1. Before presenting this likelihood, we specify a technical condition and introduce the matrices involved in this likelihood.

Assumption 3 Assume that the distribution of

bβE βE,n bβ E Ω

is absolutely continuous with respect to the Lebesgue measure on Rp(K+1). Let {pn}n 1 be the corresponding sequence of densities. Assume that {pn}n 1 is uniformly equicontinuous and bounded.

Selective Inference with Distributed Data

The condition in Assumption 3 together with the weak convergence in Theorem 3 implies that the density of (bβE, bβ E, Ω) converges to the corresponding limiting Gaussian density, uniformly on compact subsets of Rp(K+1). Deﬁne g(j) k = γ(j) E(k) Rdk for j, k [K], where g(j) k collects the coordinates of γ(j)

belonging to E(k). Deﬁne g(j) = γ(j) E Rd, where g(j) contains the coordinates of γ(j)

belonging to E. Let ρ0 = 1 P k [K] ρk. Let the matrices

Ξ R d d, Ψ R d d, τ R d, Θ Rd d, Π Rd d, κ Rd

be deﬁned as follows. The (j, k) block of Ξ 1 is a d(j) d(k) matrix given by

ρk + ρ2 k ρ0

IE(k),E(k) if j = k,

ρ0 IE(j),E(k) if j = k.

The (k, 1) block of Ξ 1Ψ and Ξ 1τ are given by

ρ0 IE(k),E; Ξ 1τ

k = ρkg(k) k ρk

j=1 ρjg(j) k .

Further, let

ρ0 IE,E Ψ Ξ 1Ψ; Θ 1Π = IE,E; Θ 1κ = Ψ Ξ 1τ +

ρj ρ0 g(j).

Let ϕ( ; µ, Σ) denote the density function of the normal distribution N(µ, Σ). Let O = {v R d : sign(v) = sign(s)} denote the orthant in R d based on the sign vector s.

Theorem 4 (Selective likelihood) Suppose that the conditions stated in Theorem 3 and Assumption 3 hold. The selective likelihood function, based on the asymptotic distribution of nbβE n bΓ(k) = γ(k) k [K] o , is equal to

f βE; bβE, s, z = ϕ( nbβE; Π nβE + κ, Θ)

P h nb B O | b Z = z i ,

P h nb B O | b Z = z i = Z ϕ( nbβE; Π nβE + κ, Θ)

ϕ( nb B; Ψ nbβE + τ, Ξ) 1 n b B O o dbβEdb B.

The proof is provided in Appendix A.3. With this selective likelihood, we proceed to compute the selective MLE and its associated Fisher information to perform inference as outlined in Section 2.3. In Section 4, we oﬀer an approximation to this selective likelihood, which in turn enables an eﬃcient way to compute the selective MLE and the corresponding Fisher information matrix.

Liu and Panigrahi

3.4 Algorithm

To conclude this section, we identify and specify the information that must be exchanged between the central machine and the local machines in order to carry out inferences in the selected GLM. Three rounds of communication are required for the central machine to perform selective inference:

1. Local machines send E(k) to the central machine.

2. The central machine sends the aggregated model E back to the local machines.

3. Local machines send local MLE bβ(k) E , local Fisher information b I(k) E,E, and the subgra-

dient variables γ(k) E to the central machine.

This is summarized in Algorithm 1. Note that the ﬁrst two rounds of communication involve only the indices of the selected variables, which lead to the selected model E. It is obvious that the local MLE bβ(k) E and local Fisher information b I(k) E,E are needed at the central machine to compute the aggregated MLE and aggregated Fisher information, as deﬁned in Equation (9). Besides these statistics, the selective likelihood derived in Theorem 4 involves the subgradients γ(j) E(k) and γ(j) E for j, k [K] in order to correct for the bias from model selection. Hence, in order to calculate the selective likelihood, the central machine needs the subgradients γ(j) with indices from E S k [K] E(k) sent from all local machines. Since we assumed for simplicity that E = S k [K] E(k), machine j simply needs

to send γ(j) E to the central machine.

Note that b I(k) E,E has a size of d2, resulting in a communication cost of O(d2) for each local machine, which does not depend on the original feature dimension p. Therefore, as long as a sparse model is selected, the communication cost for making inferences in the selected GLM remains relatively low. Extensions for more general aggregation rules are given in Appendix D.

4. Approximate selective MLE

In the previous section, we obtained the selective likelihood function. As per Theorem 4, the selective log-likelihood is equal to

log ϕ nbβE; Π nβE + κ, Θ log P h nb B O b Z = z i . (10)

The main focus of this section is to provide an approximation for the second term (in the log-likelihood) by using a large-deviation limit for the log-probability. This approximation is made under the following moment and regularity conditions, which are commonly assumed to ensure the existence of the limit. Consider a real-valued sequence an that goes to inﬁnity as n , and an = o(n1/2). Assume βE = βE,n satisﬁes that nβE,n = anβ E R|E|, where β E does not depend on n.

Selective Inference with Distributed Data

Algorithm 1: Communication of information

STEP 1: Variable selection at local machines Machine k solves Problem (1) and sends E(k) = Support(bβΛ,(k)) to the central machine.

STEP 2: Model aggregation at the central machine The central machine forms E = S k [K] E(k) and sends E back to the local machines.

STEP 3: Communication of summary statistics Local machines communicate the following statistics to the central machine:

local MLE and Fisher information: bβ(j) E , b I(j) E,E; subgradient: γ(j) E

From the proof of Theorem 3, we have

bβE βE,n bβ E Ω

= n En + Rn, (11)

where En = 1

n Pn i=1 ei,n is the average of n i.i.d. observations, and Rn = op(1).

Assumption 4 (Moment condition and convergence of remainder) Assume that

E [exp(λ e1,n 2)] <

for some λ R+, and that

lim n 1 a2n log P 1

an Rn 2 > ϵ =

for any ϵ > 0, where ei,n and Rn are from the asymptotic linear representation in (11).

Assumption 4 assumes the existence of an exponential moment in a neighborhood of zero, which is needed to justify a Laplace-type approximation for probabilities involving the mean of n i.i.d. variables, which in our case is En. The second condition in this assumption helps extend this justiﬁcation to En with an added op(1) error term, Rn.

Assumption 5 For a ﬁxed convex set R0 Rp(K+1), and for O = Op(1), we impose the condition that

lim n 1 a2n

nbβE nbβ E nΩ

nbβE nbβ E nΩ

Liu and Panigrahi

In terms of bβE, bβ E, and Ω, the probability of the selection event takes the form

nbβE nbβ E nΩ

(see Appendix B.1). Therefore, Assumption 5 allows us to approximate the probability of the selection event up to an op(1) remainder.

Theorem 5 Suppose that the conditions in Assumption 4 and Assumption 5 are met. Deﬁne

Ln = inf b,B

an κ Θ 1 b Πβ E 1

an τ Ξ 1 B Ψb 1

a2n Barr O(an B)

where Barr O(x) = P i log(1 + 1 sixi ). Then, we have

lim n 1 a2n log P h nb B O b Z = z i + Ln = C0,

where C0 is a constant that does not depend on β E.

The proof is provided in Appendix B.1. As a consequence of Theorem 5, we can substitute the log-probability in (10) by

( 1 2 (anb anΠβ E κ) Θ 1 (anb anΠβ E κ)

2 (an B anΨb τ) Ξ 1 (an B anΨb τ) + Barr O(an B)

ignoring the additive constant in the limit. Finally, we reparameterize anb = nv and an B = n V , which gives the following approximation of the selective log-likelihood

log ϕ( nbβE; Π nβE,n + κ, Θ)

( 1 2 nv nΠβE,n κ Θ 1 nv nΠβE,n κ

2 n V nΨv τ Ξ 1 n V nΨv τ + Barr O( n V )

The score and curvature of this selective likelihood yield the selective MLE and the selective obs-FI matrix. The derivation of these two estimators follows the steps in Panigrahi and Taylor (2023) for the standard Gaussian regression problem. In the interest of completeness, the expressions of these estimators are provided below.

Selective Inference with Distributed Data

Theorem 6 Consider solving the optimization problem

b V bβE = argmin V R d 1 2( n V Ψ nbβE τ) Ξ 1( n V Ψ nbβE τ) + Barr O( n V ). (13)

The maximizer of the approximate selective likelihood and the observed Fisher information matrix are equal to

Π 1 bβE 1 nΠ 1κ + I 1 E,EΨ Θ 1 ΨbβE + 1 nτ b V bβE

Θ 1 + Ψ Ξ 1Ψ Ψ Ξ 1 Ξ 1 + 2Barr O nb V bβE

1 Ξ 1Ψ 1 IE,E, (15)

respectively.

The proof is provided in Appendix B.2. In practice, the matrices Π, κ, Θ, Ψ, τ, Ξ are computed with the observed Fisher information b I, which are then used to compute the selective MLE and the selective obs-FI in (14) and (15), respectively. The algorithm for conducting inference based on the approximate selective MLE is summarized in Algorithm 2.

Algorithm 2: Approximate selective MLE-based inference

Compute the aggregated MLE bβE deﬁned by Equation (9). Solve the d-dimensional convex optimization in (13). Compute bβ(S) E and b I(S) E,E as stated in (14) and (15).

r b I(S) E,E 1

j,j , j [d].

Compute the two-sided p-value for H0 : βE,j = 0 as

where Φ = 1 Φ is the CDF of the standard normal distribution.

Compute the two-sided 100 (1 α)% conﬁdence interval for βE,j as

bβ(S) E,j z1 α/2 bσ(S) j n .

5. Addressing p-value lotteries

When dealing with sparse regression, it is often feasible to construct p-values after reducing the number of variables to a manageable size. One common method is sample-splitting,

Liu and Panigrahi

as described in Wasserman and Roeder (2009), where variables are ﬁrst selected on a subset of the data and then p-values are reported using classical least squares estimation on the remaining samples. Variables that are not selected are assigned a p-value equal to 1. A more powerful alternative to sample-splitting is carving, introduced in Fithian et al. (2014) through conditioning. However, results produced by sample splitting or carving are overly sensitive to the randomness involved in partitioning the data. This issue has been widely reported in literature as the p-value lottery problem. See, for example, the paper by Meinshausen et al. (2009). To address the p-value lottery problem, Meinshausen et al. (2009) aggregate p-values from repeated splitting, and more recently, Schultheiss et al. (2021) propose to aggregate p-values after repeated carving on random splits of data. We refer to the former procedure as multi-splitting and the latter procedure as multi-carving. With increasing numbers of replicates, the results are expected to be less sensitive to the randomness from the splits. Suppose one conducts multi-splitting or multi-carving B times, and obtains the p-values p(b) j , for b [B], and j [p]. This is followed by aggregating the B p-values through their empirical quantiles

Qj(γ) := qγ

γ p(b) j , b [B] 1,

where qγ denotes the γ-th empirical quantile. One can also minimize over γ and use

Pj := (1 log(γmin)) inf γ (γmin,1) Qj(γ) 1. (16)

The aggregation scheme produces valid p-values as long as all p-values are individually valid. Below, we show that our proposal in the paper can be easily adapted to address the pvalue lottery problem without recourse to MCMC sampling. We proceed as multi-splitting and multi-carving, i.e., we use a subsample of size n1 for variable selection. For inference, we reuse data from selection by conditioning on the event of selection. We repeat this procedure B times and aggregate the p-values as above. Moreover, in each replicate, we can draw K random subsets of size n1 with replacement. A base model is selected using each subset, and the K base models are aggregated as done in (2). To conduct selective inference, a similar procedure can be used, with a slight modiﬁcation in the distribution of the randomization variable Ω. Speciﬁcally, nΩstill converges to a normal distribution but with a diﬀerent covariance matrix, as stated in Lemma 7.

Lemma 7 If the K subsets D(1), . . . , D(K) are independent random samples of the data set D (rather than disjoint partitions) and each subset has size [ρn], then Theorem 3 holds with

The proof is provided in Appendix C. Following the same argument as in Theorem 4, the asymptotic selective likelihood function takes the same form as the expression in Theorem 4, with slightly diﬀerent expressions

Selective Inference with Distributed Data

Algorithm 3: Multiple carving

for k = 1, . . . , K do

Perform n1-out-of-n subsampling to form data set D(k). Solve the lasso problem with data D(k) and select variable set E(k). Construct the aggregated model E = Aggregate(E(1), . . . , E(K)). Compute the MLE bβE and Fisher information b IE,E in the selected GLM (3) using all the data. Compute matrices Ξ, Ψ, τ, Θ, Π, κ as given in Equation (17). Perform selective inference for βE using the approximate selective-MLE based method in Algorithm 2.

of the matrices Ξ, Ψ, τ, Θ, Π, κ due to the change of the covariance matrix ΣΩ. There matrices are now calculated as

{Ξ 1}j,k = δj,k ρ 1 ρIE(j),E(j); {Ξ 1Ψ}k = ρ 1 ρIE(k),E; {Ξ 1τ}k = ρ 1 ρg(k) k ; (17)

Θ 1 = 1 + Kρ

IE,E Ψ Ξ 1Ψ; Θ 1Π = IE,E; Θ 1κ = Ψ Ξ 1τ + ρ 1 ρ

The entire procedure is summarized in Algorithm 3. When K = 1 subset is drawn in each of the B replicates, this procedure closely resembles the multi-carving procedure described in Schultheiss et al. (2021). When B = 1 and K = 1, it reduces to the data carving of Fithian et al. (2014). However, the multi-carving method of Schultheiss et al. (2021) conditions on the indices of the samples used for selection which results in a polyhedral event similar to the event in Lee et al. (2016). In contrast, our method marginalizes over the randomness involved in subsampling instead of conditioning on it. This is captured in the deﬁnition of our randomization variable Ω. As shown in the simulation results in Section 6.2, our method achieves higher power compared to the method that conditions on more.

6. Experiments

In this section, we demonstrate the numerical performance of our proposed procedure. Our code can be accessed from the Git Hub repository https://github.com/snigdhagit/ Distributed-Selectinf.

6.1 Experiments with distributed data sets

The data in our experiments is simulated as follows. For i [n], we draw xi Np(0, Σ), where Σij = ρ|i j| with ρ = 0.9, p = 100. In the ﬁrst model, we draw a real-valued response from a linear model as yi N(x i β, σ2) with σ2 = 1. The dispersion parameter σ2 is estimated by ˆσ2 = 1 n d Pn i=1(yi x i,E bβE)2. In our second model, we draw a binary

response from a logistic-linear model as yi Bernoulli(1/(1 + e x i β)). Observation i is

Liu and Panigrahi

independent of all the other observations in our data set. There are 5 non-zero coeﬃcients in our model; each non-zero βj is equal to 2c log p, where the sign is random and c is referred to as the signal strength. Unless otherwise speciﬁed, the signal strength c is 0.7 for the linear model and 1 for the logistic model. The regularization parameter λ is set to 2 log p for linear regression and 0.5 log p for logistic regression. The n observations are partitioned into K + 1 disjoint subsets D(0), D(1), . . . , D(K). Subsets 1 through K, representing the data stored at K local machines, are used for variable selection. Subset 0, representing the data at the central machine, is used only at the time of selective inference. We design three diﬀerent scenarios to investigate the performance of our procedure. We conduct 500 rounds of simulations in each scenario.

(I). In Scenario 1, we vary the number of distributed data sets K {2, 4, 6, 8}. Each local machine has nk = [8000/K] samples, and the central machine has access to 1000 samples.

(II). In Scenario 2, we vary the signal strength c. For linear regression, we vary c among {0.3, 0.5, 0.7, 0.9}; while for logistic regression, we vary c among {0.5, 1, 1.5, 2}. We ﬁx K = 2. The central machine has 2000 samples, and each of the two local machines has 4000 samples.

(III). In Scenario 3, we vary the number of samples that are reserved only for selective inference at the central machine. Specially, we vary n0 {250, 500, 1000, 2000}. We ﬁx K = 3 and nk = 2000 for 1 k K.

We report the coverage proportions and average interval lengths for βE in Figure 2 and Figure 3, for the linear model and the logistic model, respectively. The target parameter βE is deﬁned as

βE := argmin βE Rd

i=1 E h ℓ(yi; x i,EβE) i ,

where ℓrepresents the corresponding GLM loss and the expectation is taken over the true distribution of yi. Speciﬁcally, for linear regression, ℓ(yi; θ) = 1

2(yi θ)2 and the true distribution is y N(x i β, σ2). For logistic regression, ℓ(yi; θ) = [yi log 1 1+e θ + (1

yi) log(1 1 1+e θ )] and the true distribution is yi Bernoulli((1 + e x i β) 1). The target coverage of the conﬁdence interval is 90%. The lengths of conﬁdence intervals are indications of the statistical power associated with selective inference for βE. Our method Dist-SI constructs the conﬁdence intervals based on the approximate selective MLE method as introduced before. As a baseline for comparison, we consider the Splitting method, which uses only the data at the central machine to construct standard Wald conﬁdence intervals. Error bars represent variations among 500 random replicates. Observations. Across all scenarios, the conﬁdence intervals produced by Dist-SI (approximately) attain the desired coverage probability. The Splitting method produces valid conﬁdence intervals, but discards samples used by the local machines. The advantages of reusing data from the local machines are quite evident in the plots for the lengths of the conﬁdence intervals. As expected, Dist-SI yields tighter conﬁdence intervals and achieves higher power than the baseline procedure based on Splitting in all three scenarios. We

Selective Inference with Distributed Data

Dist-SI Splitting

1 2 3 4 Signal strength

250 500 1000 2000 n0

Dist-SI Splitting

1 2 3 4 Signal strength

250 500 1000 2000 n0

Figure 2: Coverage proportions (top panel) and average interval lengths (bottom panel) for the linear model. Left panel: varying K. Middle panel: varying signal strength. Right panel: varying n0.

observe that interval lengths for both methods increase with K. This is because the ﬁnal model, which is the union of the K models selected by local machines, is likely to be larger for larger K. In this case, the variance of bβj tends to be larger. For a similar reason, interval lengths tend to increase with signal strengths as well. In Scenario 3, we see that both methods produce longer intervals when n0 decreases, and as expected, the gap between splitting and Dist-SI is more pronounced with fewer samples at the central machine.

6.2 Experiments on p-value lotteries

In this section, we apply the suitable adaptation of our procedure to solve the p-value lottery problem, as described in Section 5. We compare our procedure with Multi-carving and Multi-splitting as proposed by Schultheiss et al. (2021) and Meinshausen et al. (2009), respectively. For the latter two algorithms, we use the implementation provided by Schultheiss et al. (2021) with code available on Git Hub1. To avoid any confusion, we continue to refer our procedure as Dist-SI , though we are no longer simulating distributed data sets.

1. https://github.com/cschultheiss/Multicarving. The original code is written in R, and we load them into Python when running our simulations, which might have contributed to slightly longer running times as reported in our ﬁndings.

Liu and Panigrahi

Dist-SI Splitting

1 2 3 4 Signal strength

250 500 1000 2000 n0

Dist-SI Splitting

1 2 3 4 Signal strength

250 500 1000 2000 n0

Figure 3: Coverage proportions (top panel) and average interval lengths (bottom panel) for the logistic model. Left panel: varying K. Middle panel: varying signal strength. Right panel: varying n0.

Selective Inference with Distributed Data

We generate our data from the same linear model as described before, but now we use the same dimension as in Schultheiss et al. (2021). We take n = 100 and p = 200, and consider s = 5 nonzero coeﬃcients with βj = 2 log p. We use B = 10 replicates, and aggregate the p-values using formula (16) with γmin = 0.1. The proportion of samples used for variable selection is varied in the set {0.5, 0.6, . . . , 0.9}. We ﬁx the signiﬁcance level at 0.1. Again, the inferential target of these methods is not the actual generating β, unless the selected set E contains all the nonzero coeﬃcients of β. Nevertheless, we can compare these methods by their accuracy in terms of detecting nonzero signals. Speciﬁcally, a coeﬃcient βj is predicted to be nonzero if the corresponding p-value is no larger than 0.1. We measure the accuracy of this prediction by the F-score, which is deﬁned as

F-score := 2 True Positive 2 True Positive + False Positive + False Negative

Note that the positives and negatives are computed with respect to the true underlying parameter β, which might be diﬀerent from the inferential target post selection if the selected E fails to contain some nonzero coeﬃcients of β. Besides computing the F-score, we compare the average run time for Dist-SI and Multi-carving . The results are shown in Figure 4. In the left panel, we plot the F-scores of the three methods with varying proportions. The error bars are once again reported for 500 random repetitions. In the right panel, we plot the average run times of Dist-SI and Multi-carving on the log scale. Observations. We ﬁnd that our procedure has higher F-scores than the two previously proposed alternatives, Multi-carving and Multi-splitting , for all values of sample proportion. Especially, a p-value in every replicate uses the full data after carefully discarding information that was used up for selecting predictors. The re-use of data from selection results in larger power over Multi-splitting . Our procedure aligns with Multi-carving , which also deploys conditional techniques to reuse data for hypothesis testing. However, a key distinction of our procedure with Multi-carving lies in how we use the randomization framework to characterize the selection event, and subsequently marginalize over this randomness to construct our p-values. In particular, we note that Multi-carving conditions on the randomization that is involved during variable selection on a random split of the data, whereas our procedure explicitly characterizes the distribution of randomization instead of simply conditioning on Ω. We believe that this diﬀerence between the two procedures shows up in our simulated ﬁndings as we note higher values of F-score with Dist-SI . Unsurprisingly, our proposal is also faster than Multi-carving by about 100 times. This is because the latter procedure requires MCMC sampling, while our procedure only involves solving a convex optimization problem.

6.3 Experiments on medical data set

We illustrate an application of our procedure on a real data set that is publicly available on MIT s GOSSIS database Raﬀa et al. (2022). This data set contains records on intensive care unit (ICU) admissions from 192 hospitals, including patients demographic information, and various medical measurements, and lab results. We only use the data sets from the four largest hospitals, among which three data sets are used for variable selection and the remaining one is reserved for selective inference. We focus on a regression problem with data

Liu and Panigrahi

0.5 0.6 0.7 0.8 0.9 proportion

0.5 0.6 0.7 0.8 0.9 proportion

Dist-SI Multi-carving Multi-splitting

Figure 4: Compare Dist-SI with multi-carving and multi-splitting. The left panel shows the F-scores when using diﬀerent proportions of samples for selection. The right panel shows the average running time.

from the ﬁrst 24 hours of intensive care. The response in this problem is binary, and takes the value 1 if a patient admitted to an ICU has been diagnosed with Diabetes Mellitus, and is 0 otherwise. The same problem appeared in the 2021 Women in Data Science Datathon 2. We remove variables with more than half missing values, and also remove rows with missing values. After preprocessing, we end up with 81 predictors. The three data sets used for variable selection have sample sizes ranging from 1633 to 1788, and the data set reserved for inference has 2000 samples.

For model selection, we run the logistic regression with lasso penalty. The regularization parameter is tuned with one extra data set with 893 samples. The selected GLM has 58 predictors. To construct conﬁdence intervals for the 58 selected variables, we apply the proposed Dist-SI algorithm and Splitting as done in simulations. The signiﬁcance level is set to be 0.1. Dist-SI reports 21 signiﬁcant variables, while Splitting reports 13 signiﬁcant variables, with 10 variables overlapping as signiﬁcant in both methods. In Figure 5, we plot the conﬁdence intervals for the regression coeﬃcients that are rejected by either of the two procedures. The boxplot for the lengths of these intervals, in Figure 6, show that the median length of the Dist-SI intervals is smaller than the Splitting intervals by 67%. Additionally, the coeﬃcient of variation is 1.8 and 3.9 for Dist-SI and Splitting , respectively. This indicates that the dispersion of interval lengths for Dist-SI is smaller than Splitting . On this instance, we see that Splitting yields a few very wide intervals. This is because the Hessian matrix based on data present at the central machine (reserved data set) is ill-conditioned. Dist-SI does not have this issue because it reuses data from the three hospitals for more powerful selective inference.

2. https://www.kaggle.com/competitions/widsdatathon2021/data. Accessed on on Dec. 17, 2022.

Selective Inference with Distributed Data

resprate_apache

d1_diasbp_min

d1_resprate_max

pre_icu_los_days

elective_surgery

gcs_motor_apache

intubated_apache

d1_sysbp_noninvasive_min

d1_sysbp_noninvasive_max

d1_resprate_min

d1_heartrate_max

d1_sysbp_min

d1_sysbp_max

d1_heartrate_min

encounter_id

d1_spo2_min

ventilated_apache

Dist-SI Splitting

Figure 5: Conﬁdence intervals for the coeﬃcients that are rejected by either Dist-SI or sample splitting.

Dist-SI Splitting

Figure 6: Boxplot of conﬁdence interval lengths produced by Dist-SI and sample splitting. The y-axis is on the logarithmic scale.

Liu and Panigrahi

7. Conclusion

Model selection appears to be routine practice when analyzing big data sets. Inference for data-dependent models and parameters is a very challenging goal, because sound procedures must rigorously account for randomness from the selection process. To the best of our knowledge, this paper is the ﬁrst contribution that addresses selective inference with distributed data. We provide a rigorous procedure to construct conﬁdence intervals and p-values when inference is sought in a generalized linear model with selected predictors. We identify a representation for selection in a common distributed setup, and provide an asymptotic selective likelihood by developing a novel randomized framework for our problem. Approximately-valid selective inference, based on our selective likelihood, takes a very simple form: our conﬁdence intervals for the selected regression coeﬃcients are centered around the MLE of the selective likelihood, and the variance of the MLE is estimated by the observed Fisher information matrix. An appealing feature of our proposed inferential procedure is that we only require some aggregated information, with relatively low communication cost, from each machine. This feature allows an adaptation of our procedure to settings where various sites may not be willing to share their individual data sets. But, we note that there is room for improvement here, specially if various sites have not measured the same set of predictors. Our paper also provides an eﬃcient solution for the p-value lottery problem without relying on MCMC samplers. Our procedure bypasses the primary computational bottleneck in the earlier proposal (Schultheiss et al., 2021) by reducing selective inference to the solution of an optimization problem. The recently developed sampler in Liu (2023) is also applicable in our distributed setup to perform sampling-based inference, as an alternative to the MLEbased inference used in this paper.

Acknowledgments

S. Panigrahi s research is supported by NSF grants DMS-1951980 and DMS-2113342 and by the NSF CAREER Award DMS-2337882. S. Liu s research is partially supported by the Stanford Data Science Scholars program.

Selective Inference with Distributed Data

Appendix A. Proofs for Section 3

Supporting results are collected in Appendix A.3.1.

A.1 Proof of Proposition 2

Proof We write βE = βE,n in the asymptotic analysis as n . Deﬁne β E(k),n =

I 1 E(k),E(k)IE(k),EβE,n. We start with the decomposition

1 n X ( A(X bβΛ,(k)) Y ) = 1

n X ( A(XEβE,n) Y ) + R(k) 1 + R(k) 2 ,

n X ( A(XE(k)β E(k),n) A(XEβE,n)),

n X ( A(X bβΛ,(k)) A(XE(k)β E(k),n)).

In a similar fashion, we decompose the quantities based on local data D(k) as

1 nk X(k) ( A(X(k) bβΛ,(k)) Y (k)) = 1

nk X(k) ( A(X(k) E βE,n) Y (k)) + r(k) 1 + r(k) 2 .

The decomposition in the above two displays allow us to write

nω(k) = n n 1

n X ( A(XEβE,n) Y ) 1

nk X(k) ( A(X(k) E βE,n) Y (k))

+ R(k) 1 + R(k) 2 r(k) 1 r(k) 2 o

= n ω(k) + n R(k),

where R(k) = R(k) 1 + R(k) 2 r(k) 1 r(k) 2 ,

and ω(k) = 1

n X ( A(XEβE,n) Y ) 1

nk X(k) ( A(X(k) E βE,n) Y (k)).

Let Ω Rp K be the stack of ω(k) for 1 k K. It suﬃces to show that n Ωd N(0, ΣΩ) and n R(k) p 0. (19)

To proceed with the proof, let ei = xi( A(x i,EβE,n) yi). It is easy to see that ei are i.i.d. for all 1 i n with E [ei] = 0 and Var [ei] I under the selected model (3). It follows that n ω(k) = p

1 ρk 1 n nk

i/ Ck ei 1 ρk ρk

Hence, we have n ω(k) d N 0, 1 ρk

Liu and Panigrahi

Cov h n ω(j), n ω(k)i I

for j = k. This leads to

ρj I I I 1 ρk

which proves the ﬁrst statement of (19). Lemma 8 and Lemma 9 below show that n R(k) = op(1), thereby concluding the proof of (19).

Lemma 8 (Rate of R(k) 1 r(k) 1 ) Let

n X ( A(XE(k)β E(k),n) A(XEβE,n)),

nk X(k) ( A(X(k) E(k)β E(k),n) A(X(k) E βE,n)).

Then n(R(k) 1 r(k) 1 ) = op(1).

Proof [Proof of Lemma 8.] According to Lemma 12, we have

XE(k)β E(k),n XEβE,n = Op(n 1/2).

By a Taylor approximation, we have

R(k) 1 r(k) 1 = 1

n X diag( 2A(XEβE,n))(XE(k)I 1 E(k),E(k)IE(k),E XE)βE,n

nk X(k) diag( 2A(X(k) E βE,n))(X(k) E(k)I 1 E(k),E(k)IE(k),E XE)βE,n + op(n 1/2).

By Assumption 1, there exists Ek e E(k) = E\E(k) such that for j e E(k)\Ek, XE(k)I 1 E(k),E(k)IE(k),j =

Xj and βEk,n = O(n 1/2). So the last display simpliﬁes as

R(k) 1 r(k) 1 = 1

n X diag( 2A(XEβE,n))(XE(k)I 1 E(k),E(k)IE(k),Ek XEk)βEk,n

nk X(k) diag( 2A(X(k) E βE,n))(X(k) E(k)I 1 E(k),E(k)IE(k),Ek XEk)βEk,n + op(n 1/2).

If Ek is not empty, let

n X diag( 2A(XEβE,n))(XE(k)I 1 E(k),E(k)IE(k),Ek XEk) .

R(k) 1 r(k) 1 = 1

n X diag( 2A(XEβE,n))(XE(k)I 1 E(k),E(k)IE(k),Ek XEk) T1

nk X(k) diag( 2A(X(k) E βE,n))(X(k) E(k)I 1 E(k),E(k)IE(k),Ek XEk) T1

+ op(n 1/2).

Selective Inference with Distributed Data

Note that βEk = O(n 1/2). Further, observe that

1 n X diag( 2A(XEβE,n)(XE(k)I 1 E(k),E(k)IE(k),Ek XEk) T1 = op(1), and

1 nk X(k) diag( 2A(X(k) E βE,n)(X(k) E(k)I 1 E(k),E(k)IE(k),Ek XEk) T1 = op(1).

Thus, we conclude that R(k) 1 r(k) 1 = op(n 1/2).

Lemma 9 (Rate of R(k) 2 r(k) 2 ) Let

n X ( A(X bβΛ,(k)) A(XE(k)β E(k),n)),

nk X(k) ( A(X(k) bβΛ,(k)) A(X(k) E(k)β E(k),n)).

Then n(R(k) 2 r(k) 2 ) = op(1).

Proof [Proof of Lemma 9.] Based on the assertion in Lemma 12, we have

bβΛ,(k) E(k) β E(k),n = Op(n 1/2).

Taking a Taylor expansion of A(X bβΛ,(k)) at Xβ E(k),n for each coordinate, we obtain

R(k) 2 r(k) 2 = 1

n X h diag( 2A(XE(k)β E(k),n))XE(k)(bβΛ,(k) E(k) β E(k),n) + o( bβΛ,(k) E(k) β E(k),n )) i

1 nk X(k) h diag( 2A(X(k) E(k)β E(k),n))X(k) E(k)(bβΛ,(k) E(k) β E(k),n) + o( bβΛ,(k) E(k) β E(k),n )) i .

n X diag( 2A(XE(k)β E(k),n))XE(k) ,

we have 1 n X diag( 2A(XE(k)β E(k),n))XE(k) = T + op(1),

and 1 nk X(k) diag( 2A(X(k) E(k)β E(k),n))X(k) E(k) = T + op(1).

R(k) 2 r(k) 2 = 1

n X diag( 2A(XE(k)β E(k),n))XE(k) T (bβΛ,(k) E(k) β E(k),n)

nk X(k) diag( 2A(X(k) E(k)β E(k),n))X(k) E(k) T (bβΛ,(k) E(k) β E(k),n) + op(n 1/2)

= op(n 1/2).

Liu and Panigrahi

A.2 Proof of Theorem 3

Proof It follows from Assumption 2 that n(bβE βE,n) d N(0, I 1 E,E),

and from Theorem 2 we have nΩd N(0, ΣΩ). Note that nbβ E = 1 n X E( A(XEβE,n) Y ) + 1 n X E( A(XE bβE) A(XEβE,n))

= 1 n X E( A(XEβE,n) Y ) + 1 n X EWX E(bβE βE,n) + op(1)

= 1 n X E( A(XEβE,n) Y ) I E,EI 1 E,E 1 n X E( A(XEβE,n) Y ) + op(1).

In particular, we have

1 n X E( A(XEβE,n) Y ) 1 n X E( A(XEβE,n) Y )

= IE,E IE, E I E,E I E, E

Thus, we observe that the asymptotic variance of nbβ E is equal to

I E, E I E,EI 1 E,EIE, E,

and conclude that nbβ E d N(0, I/IE,E).

Further, it is easy see from the asymptotic representations of n(bβE βE,n) and nbβ E that they are mutually independent. Now, observe that n(bβE βE,n) and nbβ E are asymptotically equivalent to sums of i.i.d. random variables zi, and that nω(k) assumes the form in (18). Thus, we can write

where ei = xi( A(x i,EβE,n) yi). The covariance term on the right-hand-side is 0, which completes the proof.

A.3 Proof of Theorem 4

Proof We start by writing nω(k) = n ω(k) + op(1),

established in Lemma 13, where n ω(k) = T(k)( n b B(k), b Z(k); nbβE, nbβ E).

Selective Inference with Distributed Data

Let nΩ= T( nb B, b Z; nbβE, nbβ E) denote the stack of n ω(k) for k [K]. We begin with pn, the Lebesgue density of n(bβE βE,n), nbβ E, nΩ .

We then apply the change of variables nΩ ( nb B, b Z)

through the mapping nΩ= T( nb B, b Z; nbβE, nbβ E). Because the mapping is linear, the density of n(bβE βE,n), nbβ E, nb B, b Z

is proportional to

pn( n(bβE βE,n), nbβ E, T( nb B, b Z; nbβE, nbβ E)).

Now, the condition in Assumption 3 allows us to replace pn by the limiting Gaussian density in Theorem 3 which gives us the corresponding asymptotic density function

ϕ( nbβE; nβE,n, I 1 E,E) ϕ( nbβ E;0, (I/IE,E) 1)

ϕ(T( nb B, b Z; nbβE, nbβ E); 0, ΣΩ). (20)

Furthermore, if we condition on b Z = z and ignore constants, the asymptotic likelihood can be simpliﬁed to

ϕ( nbβE; Π nβE,n + κ; Θ) ϕ( nb B; Ψ nbβE + τ; Ξ),

where Π, κ, Θ, Ψ, τ, Ξ are deﬁned in Theorem 4. See details of the simpliﬁcation in Lemma 10 below. The selection event, when conditioned on b Z = z, is equivalent to b B O. So the conditional density of bβE, b B is given by

ϕ( nbβE; Π nβE,n + κ; Θ) ϕ( nb B; Ψ nbβE,n + τ; Ξ) 1 n b B O o

R ϕ( nbβE; Π nβE,n + κ; Θ) ϕ( nb B; Ψ nbβE + τ; Ξ) 1 n b B O o dbβEdb B .

Ignoring the factors that do not depend on βE,n gives the selective likelihood function.

Lemma 10 (Matrix simpliﬁcation) The joint density of ( nbβE, nb B, bβ E) when conditioned on b Z = z is equal to

ϕ( nbβE; Π nβE,n + κ; Θ) ϕ( nb B; Ψ nbβE + τ; Ξ) ϕ( nbβ E; 0, (I/IE,E) 1).

Proof [Proof of Lemma 10.] Denote

I ,E ... I ,E

I ,E(1) 0 . . . 0 0 I ,E(2) . . . 0 ... 0 . . . 0 I ,E(K)

Liu and Panigrahi

Let r(k) = Λ s(k)

, and let r be the stack of r(1), . . . , r(K). Observe, the

mapping T, given in Lemma 13, can be written as nΩ= Q1 nbβE + Q2 nb B + r. (22)

It follows from Equation (20) that the joint density of ( nbβE, nb B) after conditioning on b Z = z is proportional to

2( nbβE nβE,n) IE,E( nbβE nβE,n)

2( Q1 nbβE + Q2 nb B + r) Σ 1 Ω( Q1 nbβE + Q2 nb B + r)

2( nb B) Q 2Σ 1 ΩQ2( nb B) + ( nb B) Q 2Σ 1 Ω(Q1 nbβE r)

2( nbβE) (IE,E + Q 1Σ 1 ΩQ1)( nbβE) + ( nbβE) (IE,E nβE,n + Q 1Σ 1 Ωr) .

For Ξ 1 = Q 2Σ 1 ΩQ2, Ψ = ΞQ 2Σ 1 ΩQ1, τ = ΞQ 2Σ 1 Ωr, we observe that this density is proportional to

ϕ( nb B; Ψ nbβE + τ, Ξ) exp 1

2( nbβE) (IE,E + Q 1Σ 1 ΩQ1 Ψ Ξ 1Ψ)( nbβE)

+( nbβE) (Ψ Ξ 1τ + IE,E nβE,n + Q 1Σ 1 Ωr) i .

The density in the last display is proportional to

ϕ( nbβE; Π nβE,n + κ; Θ) ϕ( nb B; Ψ nbβE + τ; Ξ)

for Θ 1 = IE,E + Q 1Σ 1 ΩQ1 Ψ Ξ 1Ψ, Π = ΘIE,E, κ = Θ(Ψ Ξ 1τ + Q 1Σ 1 Ωr). To further simplifying the matrices in the likelihood, we note that Σ 1 Ω = U 1 I 1. Therefore, we can write Q 2Σ 1 ΩQ2 in the form of K K blocks, where the (j, k)-block is ρjρk

ρ0 + ρjδj,k

IE(j),E(k).

Similarly, Q 2Σ 1 Ωr has K blocks where the k-th block is

ρk JE(k)r(k) + ρk

j=1 ρj JE(k)r(j),

where JE is the matrix that selects the coordinates in E, i.e. JEx = x E. Since JE(k) 0E bβ E

0 and JE(k)γ(j) = g(j) k , the above display is equal to ρkg(k) k +(ρk/ρ0) P j [K] ρjg(j) k . Similarly, Q 1Σ 1 Ωr = P k [K] ρk ρ0 JEγ(k) = P k [K] ρk ρ0 g(k) and

Q 1Σ 1 ΩQ1 =

k [K] ρk + X

IE,E = 1 ρ0

Selective Inference with Distributed Data

Thus Θ 1 = 1

ρ0 IE,E Ψ Ξ 1Ψ.

Remark 11 We note that when using the union aggregation rule, the quantities Π, κ, Θ, Ψ, τ, Ξ do not depend on bβ E. If E(k) is not necessarily a subset of E, then

JE(k) 0E bβ E

0E(k) E bβ E(k)\E,

JE(k)r(j) = JE(k)γ(j) +

0E(k) E bβ E(k)\E

But, we only need to re-deﬁne g(j) k as

g(j) k = JE(k)γ(j) +

0E(k) E bβ E(k)\E

and this change only aﬀects τ. For the complete procedure when using general aggregation rules, see Appendix D.

A.3.1 Supporting Results

Lemma 12 Let β E(k),n = I 1 E(k),E(k)IE(k),EβE,n. Under Assumption 1,

XE(k)β E(k),n XEβE,n = Op(n 1/2), and

bβ(k),Λ E(k) β E(k),n = Op(n 1/2).

Proof Note that

XE(k)β E(k) XEβE = (XE(k)I 1 E(k),E(k)IE(k),E XE)βE

= XE(k)I 1 E(k),E(k)IE(k),E\E(k) XE\E(k) βE\E(k)+ XE(k)I 1 E(k),E(k)IE(k),E E(k) XE E(k) βE E(k)

= XE(k)I 1 E(k),E(k)IE(k),E\E(k) XE\E(k) βE\E(k).

By Assumption 1, let Ek be the set such that if j (E \ E(k)) \ Ek,

XE(k)I 1 E(k),E(k)IE(k),j Xj = 0,

Liu and Panigrahi

and βEk = O(n 1/2). Then we have

XE(k)β E(k),n XEβE = (XE(k)I 1 E(k),E(k)IE(k),Ek XEk)βEk = Op(n 1/2).

This proves the ﬁrst claim. By the KKT condition of lasso problem (1), we have

1 nk X(k) E(k) ( A(X(k) E(k) bβΛ,(k) E(k) ) Y (k)) = 1 nΛE(k)sk.

Denote the left-hand-side as ℓ(k)(bβΛ,(k) E(k) ). Taking a Taylor expansion at β E(k),n, we get

1 nΛE(k)sk = ℓ(k)(bβΛ,(k) E(k) ) = ℓ(k)(β E(k),n) + 2ℓ(k)(β E(k),n)(bβΛ,(k) E(k) β E(k),n)

+ o( bβΛ,(k) E(k) β E(k) 2). (23)

ℓ(k)(β E(k)) = 1

nk X(k) E(k) ( A(X(k) E(k)β E(k),n) Y (k))

nk X(k) E(k) ( A(X(k) E(k)β E(k),n) A(X(k) E βE,n))+

1 nk X(k) E(k) ( A(X(k) E βE,n) Y (k)).

The ﬁrst term is of order Op(n 1/2) due to the ﬁrst claim. The second term is of order Op(n 1/2) because it is an average of nk i.i.d. random variables of mean zero and ﬁnite variance. This proves

ℓ(k)(β E(k)) = Op(n 1/2).

2ℓ(k)(β E(k)) = 1

nk X(k) E(k) diag( 2A(XE(k)β E(k)))X(k) E(k)

nk X(k) E(k) diag( 2A(XEβE))X(k) E(k) + Op(n 1/2)

= IE(k),E(k) + Op(n 1/2).

Substituting into Equation (23) gives

n(bβΛ,(k) E(k) β E(k),n) = Op(1).

Selective Inference with Distributed Data

Lemma 13 The randomization variables have the expression nω(k) = n ω(k) + op(1),

where n ω(k) = T(k)( n B(k), Z(k); nbβE, nbβ E)

= I ,E(k) n B(k) I ,E nbβE + Λ S(k)

Proof The K.K.T. conditions of stationarity for the lasso on machine k are summarized by nω(k) = 1 n X ( A(X bβ(k),Λ) Y ) + γ(k)

= 1 n X ( A(XE bβE) Y + A(XE(k)B(k)) A(XE bβE)) + γ(k)

+ 1 n X diag( 2A(XEβE,n))XE(k)B(k)

1 n X diag( 2A(XEβE,n))XE bβE + γ(k) + op(1)

= I ,E(k) n B(k) I ,E nbβE + Λ S(k)

= n w(k) + op(1).

Appendix B. Proofs for Section 4

B.1 Proof of Theorem 5

Before proving Theorem 5, we provide a supporting result in Lemma 14.

First, let s(k) =

γ(k) E(k) 0

Rp be the active components of the subgradient vector γ(k)

padded with a vector of all zeros. Now, let s Rp K be formed by stacking the vectors s(k)

for k [K]. Recall that Ωis formed by stacking the vectors ω(k), k [K] that we previously deﬁned in Lemma 13. Suppose that n Ω= nΩ s.

Lemma 14 Deﬁne

nbβE nbβ E n Ω

Let S be a convex subset of Rp(K+1). Under the conditions in Assumption 4 and Assumption 5, we have

a2n log P 1

an Vn S = inf b,b ,ω S R(b, b , ω),

Liu and Panigrahi

R(b, b , ω) =

( 1 2(b β E) IE,E(b β E) + 1

2(b ) (I/IE,E) 1b + 1

Proof Because the diﬀerence between n Ωand n Ωis O(1), based on the condition in Assumption 5, we have

lim n 1 a2n

an Vn S log P

nbβE nbβ E nΩ

From the proof of Proposition 2, we have the representation

= n En + Rn,

where the limiting density of n En, at (b, b , ω), is proportional to ϕ(b; β E, I 1 E,E) ϕ(b ; 0, I/IE,E) ϕ(ω; 0, ΣΩ). So we have the following large-deviation limit

nbβE nbβ E nΩ

= inf b,b ,ω S R(b, b , ω)

under Assumption 4, where R is the rate function (25). The assertion follows directly by using (26).

Now we are ready to prove Theorem 5. Proof First let us deﬁne some notations which we need for the proof. Recall that r(k) =

, and that r is the stack of r(k) for k [K], and that s Rp K is the

stack of s(k) =

γ(k) E(k) 0

. In matrix form, we can write

r = Q3b Z + Q4 nbβ E + s

for ﬁxed matrices Q3 and Q4. Let 1 an z = ζ. Let Q1, Q2 be deﬁned according to (21). Then, note that we have

= Q1 nbβE + Q2 nb B + r s

= Q1 nbβE + Q2 nb B + Q3b Z + Q4 nbβ E.

Selective Inference with Distributed Data

If we deﬁne: h(B, Z; b, b ) = Q2B + Q3Z Q1b + Q4b .

then n Ω= h( nb B, b Z; nbβE, nbβ E).

Denote by Un the vector

nbβE nbβ E nb B b Z

and consider Vn as deﬁned in Lemma 14. Observe that

nbβE nbβ E h nb B, b Z; nbβE, nbβ E

is a bijective linear function of Un. Let us deﬁne H : Rp(K+1) Rp(K+1) as the function such that 1 an Vn = H( 1

Applying the contraction principle for large-deviation limit together with Lemma 14, we conclude that the vector 1 an Un = H 1( 1

an Vn) satisﬁes a large deviation principle with the rate function R H, where R is the rate function of 1 an Vn given in Equation (25). Thus, it follows that

a2n log P n

an b Z = ζ = inf b,b ,B O R H(b, b , B, ζ)

= inf b,b ,B O R(b, b , h(B, ζ; b, b ))

= inf b,b ,B O 1 2(b β E) IE,E(b β E) + 1

2b (I/IE,E) 1b

2h(B, ζ; b, b ) Σ 1 Ωh(B, ζ; b, b ).

By a similar matrix simpliﬁcation as lemma 10, this is equal to (up to constant independent of β E)

inf b,B O 1 2(b Πβ E κ) Θ 1(b Πβ E κ) + 1

2(B Ψb τ) Ξ 1(B Ψb τ),

where τ = ΞQ 2Σ 1 Ω(Q3ζ + Q4b ), κ = Θ(Ψ Ξ 1 τ + Q 1Σ 1 Ω(Q3ζ + Q4b )). The matrices Π, Θ, Ψ, Ξ are the same as in Lemma 10. It remains to show that

inf b,B O 1 2(b Πβ E κ) Θ 1(b Πβ E κ) + 1

2(B Ψb τ) Ξ 1(B Ψb τ)

= lim n inf b,B O 1 2(b Πβ E 1

an κ) Θ 1(b Πβ E κ) + 1

2(B Ψb τ) Ξ 1(B Ψb 1

Liu and Panigrahi

Recall from that Lemma 10 that Q 2Σ 1 Ω has (j, k) block equal to (ρjρk

ρ0 + ρkδj,k)JEj, and Q 1Σ 1 Ω has (1, k) block equal to ρk

ρ0 JE. The matrix Q4 has block k equal to J E. Thus, Q 2Σ 1 ΩQ4 = 0, Q 1Σ 1 ΩQ4 = 0. So τ = ΞQ 2Σ 1 ΩQ3ζ. Recall that z = anζ, so

τ = ΞQ 2Σ 1 Ω(Q3anz + s) = an τ ΞQ 2Σ 1 Ωs.

Because an , τ = 1 an τ + o(1). Similarly, κ = 1 an κ + o(1). This proves the last claim. To conclude the proof, we observe that

lim n inf b,B

an κ Θ 1 b Πβ E 1

an τ Ξ 1 B Ψb 1

a2n Barr O(an B)

= lim n inf b,B O 1 2(b Πβ E 1

an κ) Θ 1(b Πβ E 1

an τ) Ξ 1(B Ψb 1

This is because the sequence of convex objectives in the left-hand side display converge (in a pointwise sense) to the convex objective on the right-hand side display which has a unique minimum.

B.2 Proof of Theorem 6

Proof Observe, the approximate selective likelihood is equal to

( nbβE) Θ 1( nΠβE,n + κ) Q n Θ 1( nΠβE,n + κ) ,

where Q n(α) = sup v ( nv) α Qn( nv) (28)

Qn( nv) = 1

2( nv) Θ 1 nv + inf V

2 n V nΨv τ Ξ 1 n V nΨv τ

+ Barr O( n V ) o .

The score, based on the approximate selective likelihood, is equal to

nΠ Θ 1 nbβE Q n Θ 1( nΠβE,n + κ) .

Thus, the selective MLE is given by

Θ 1( nΠbβ(S) E,n + κ) = ( Q n) 1( nbβE)

= Qn( nbβE)

= Θ 1 nbβE Ψ Ξ 1 nb V bβE nΨbβE τ .

Selective Inference with Distributed Data

nbβ(S) E = nΠ 1 bβE Π 1κ + Π 1ΘΨ Ξ 1 Ψ nbβE + τ nb V bβE

= nΠ 1 bβE Π 1κ + I 1 E,EΨ Θ 1 Ψ nbβE + τ nb V bβE

Let v be the solution of (28) when α = nΠbβ(S) E +κ. The selective obs-FI matrix, derived from the curvature of the approximate selective likelihood, is given by

b I(S) E,E = Π Θ 1 2Q n nΠbβ(S) E + κ Θ 1Π

= Π Θ 1 2Qn nv 1 Θ 1Π

= Π Θ 1 Θ 1 + Ψ Ξ 1Ψ Ψ Ξ 1 Ξ 1 + 2Barr nb V bβE

1 Ξ 1Ψ 1 Θ 1Π

Θ 1 + Ψ Ξ 1Ψ Ψ Ξ 1 Ξ 1 + 2Barr nb V bβE

1 Ξ 1Ψ 1 IE,E.

Appendix C. Sampling subsets with replacement

Proof [Proof of Lemma 7] It suﬃces to prove that for j = k Cov nω(j), nω(k) 0. Following the proof A.1, we only need to show

i=1 ei 1 nρ

i C(j) ei, 1 n

i=1 ei 1 nρ

Let C(j) denote the index set of D(j). Since E h 1 n Pn i=1 ei 1 nρ P i C(j) ei | C(j)i = 0, it remains to show that

i=1 ei 1 nρ

i C(j) ei, 1 n

i=1 ei 1 nρ

i C(k) ei | C(j), C(k)

i=1 ei 1 nρ

i C(j) ei, 1 n

i=1 ei 1 nρ

i C(k) ei | C(j), C(k)

= Cov [ei] Cov [ei] Cov [ei] + |C(j) C(k)| 1

nρ2 Cov [ei]

= (|C(j) C(k)| 1

nρ2 1) Cov [ei] .

The proof is completed by the fact that E |C(j) C(k)| = nρ2.

Liu and Panigrahi

In this setting, the matrices Ξ, Θ, Ψ, τ, ν, Π, κ are similarly found by Theorem 4 and its proof with ΣΩ= 1 ρ

ρ IK I. So now Σ 1 Ω= ρ 1 ρIK I 1. Then

{Ξ 1}j,k = {Q2Σ 1 ΩQ2}j,k = ρ 1 ρIE(j),E(j)

if j = k and 0 otherwise. Other matrices are similarly computed.

Appendix D. Selective inference with general aggregation rules

D.1 Algorithm

In the main manuscript, we focused on the union aggregation rule, i.e., the ﬁnal model E is the union of selected variables in the base models E(k), 1 k K. We show that with a slight modiﬁcation, our procedure can be adapted to accommodate other aggregation rules. The new procedure is summarized in Algorithm 4. We note that the procedure remains almost the same, except that

g(j) k = JE(k)γ(j) +

0E(k) E bβ E(k)\E

where bβ = 1

n X (Y X bβE).

If there exists a variable that is selected by machine k but is not selected in the ﬁnal model E, then we must compensate the subgradients by the correlation between that variable and the residual vector for a more general aggregation rule. To compute the vector g(j) k , we note that the central machine requires bβ Eu\E, where

Eu = k [K]E(k) is the union of the base models. Thus, each local machine must send

bβ ,(k) Eu\E = X(k), Eu\E(Y (k) X(k) bβE)

to the central machine. Because this quantity depends on the MLE bβE, which is computed on the central machine, the central machine must ﬁrst send bβE to the local machines. Our modiﬁed procedure in Algorithm 4, therefore, involves two more exchanges between the central machine and local machines: (1) the central machine sends bβE to local machines; (2) local machines send bβ ,(k) Eu\E to the central machine. In comparison with Algorithm 1, the communication cost is |Eu E| per local machine. Note that this cost is comparable to the overall cost of order O(|E|2), as long as |Eu| is about the same order as |E|.

Of course, the modiﬁed g(j) k in (29) change some matrices in the optimization that we solve for approximately-valid selective inference. Theoretically, our selective likelihood is now obtained by conditioning further on bβ Eu\E besides the information from the subgradient vectors.

Selective Inference with Distributed Data

Algorithm 4: General aggregation rules.

STEP 1: Variable Selection at Local Machines Machine k solves (1) and sends E(k) = Support(bβΛ,(k)) to the central machine.

STEP 2: Modeling with selected predictors Central Machine aggregates E(k) to get the ﬁnal model E and forms the aggregated model E

STEP 3: Communication with Central Machine Exchange 1: Central machine sends the set E as well as Eu = k [K]E(k) to the local machines. Exchange 2: Local machines send back the following information

local estimators: bβ(k) E , b I(k) E,E; subgradient at bβΛ,(k): γ(k) Eu.

Exchange 3: Central Machine computes the MLE bβE and sends to local machines. Exchange 4: Local machines compute bβ ,(k) Eu\E = X(k), Eu\E(Y (k) X(k) bβE) and send to Central Machine. STEP 4: Selective Inference at Central Machine

(A) Compute bβ Eu\E = 1

n X(0), Eu\E(Y (0) X(0) bβE) + 1

n PK k=1 bβ ,(k) Eu\E

(B) Compute bΞ, bΨ, bτ, bΘ, bΠ, bκ as deﬁned in Theorem 4, with g(k) j deﬁned according to Equation (29).

(C) Apply Algorithm 2 to perform inference based on selective MLE.

Liu and Panigrahi

D.2 Experiments

We illustrate the performance of Algorithm 4 on simulated data. In the following experiment, we consider the same setting as that in Section 6 with prespeciﬁed groups of correlated predictors. More speciﬁcally, we consider 20 groups of predictors with size 5; distinct groups of predictors are uncorrelated, while all pairs of predictors within the same group have correlation equal to 0.9. As before, we assume there are 5 non-zero coeﬃcients βj and these nonzero coeﬃcients are present in 5 diﬀerent groups. Suppose that G = [

k [K] {Gj : j E(k)},

i.e., G contains groups which have at least one predictor selected by at least one of the K local machines. Our ﬁnal aggregated model is formed by randomly picking one predictor from each of the selected groups (in G) with highly correlated predictors. The results of our experiment are shown in Figure 7. We see that our proposed method achieves the desired coverage probability. Similar patterns hold up for the lengths and power of our conﬁdence intervals as was already noted for the previous aggregation rule.

Selective Inference with Distributed Data

Dist-SI Splitting

(a) Varying K. Each local machine has [8000/K] data points for each K.

1 2 3 4 Signal regime

1 2 3 4 Signal regime

1 2 3 4 Signal regime

Dist-SI Splitting

(b) Varying signal strength. The nonzero βj equals 2c log p with random signs for c = 0.3, 0.5, 0.7, 0.9 in the four signal regimes.

Dist-SI Splitting

(c) Varying n0, the sample size in the central machine.

Figure 7: Results for the grouped aggregation rule

Liu and Panigrahi

Fran cois Bachoc, Hannes Leeb, and Benedikt P otscher. Valid conﬁdence intervals for postmodel-selection predictors. The Annals of Statistics, 47(3):1475 1504, 2019.

Maria Florina Balcan, Avrim Blum, Shai Fine, and Yishay Mansour. Distributed learning, communication complexity and privacy. In Conference on Learning Theory, pages 26 1. JMLR Workshop and Conference Proceedings, 2012.

Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu. Distributed testing and estimation under sparse high dimensional models. Annals of statistics, 46(3):1352, 2018.

Alexandre Belloni, Victor Chernozhukov, and Kengo Kato. Uniform post-selection inference for least absolute deviation regression and other z-estimation problems. Biometrika, 102 (1):77 94, 2015.

Yoav Benjamini and Daniel Yekutieli. False discovery rate adjusted multiple conﬁdence intervals for selected parameters. Journal of the American Statistical Association, 100 (469):71 81, 2005.

Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. Valid postselection inference. The Annals of Statistics, 41(2):802 837, 2013.

Ali Charkhi and Gerda Claeskens. Asymptotic post-selection inference for the Akaike information criterion. Biometrika, 105(3):645 664, 2018.

Xueying Chen and Min-ge Xie. A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, pages 1655 1684, 2014.

Edgar Dobriban and Yue Sheng. Distributed linear regression by averaging. The Annals of Statistics, 49(2):918 943, 2021.

William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection. ar Xiv preprint ar Xiv:1410.2597, 2014.

Zaijing Huang and Andrew Gelman. Sampling for Bayesian computation with large datasets. Available at SSRN 1010107, 2005.

Sangwon Hyun, Max G Sell, and Ryan Tibshirani. Exact post-selection inference for the generalized lasso path. Electronic Journal of Statistics, 12(1):1053 1097, 2018.

Michael I Jordan, Jason D Lee, and Yun Yang. Communication-eﬃcient distributed statistical inference. Journal of the American Statistical Association, 2019.

Vo Nguyen Le Duy and Ichiro Takeuchi. More powerful conditional selective inference for generalized lasso by parametric programming. Journal of Machine Learning Research, 23 (300):1 37, 2022.

Jason Lee, Yuekai Sun, Qiang Liu, and Jonathan Taylor. Communication-eﬃcient sparse regression: a one-shot approach. ar Xiv preprint ar Xiv:1503.04337, 2015.

Selective Inference with Distributed Data

Jason Lee, Dennis L Sun, Yuekai Sun, and Jonathan Taylor. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907 927, 2016.

Nan Lin and Ruibin Xi. Aggregated estimating equation estimation. Statistics and Its Interface, 4(1):73 83, 2011.

Sifan Liu. An exact sampler for inference after polyhedral model selection. ar Xiv preprint ar Xiv:2308.10346, 2023.

Sifan Liu, Jelena Markovic, and Jonathan Taylor. Black-box selective inference via bootstrapping. ar Xiv preprint ar Xiv:2203.14504, 2022.

Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, and Gideon Mann. Eﬃcient large-scale distributed training of conditional maximum entropy models. Advances in neural information processing systems, 22, 2009.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise y Arcas. Communication-eﬃcient learning of deep networks from decentralized data. In Artiﬁcial intelligence and statistics, pages 1273 1282. PMLR, 2017.

Nicolai Meinshausen, Lukas Meier, and Peter B uhlmann. P-values for high-dimensional regression. Journal of the American Statistical Association, 104(488):1671 1681, 2009.

Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David Dunson. Robust and scalable Bayes via a median of subset posterior measures. The Journal of Machine Learning Research, 18(1):4488 4527, 2017.

Willie Neiswanger, Chong Wang, and Eric P Xing. Asymptotically exact, embarrassingly parallel mcmc. In Proceedings of the Thirtieth Conference on Uncertainty in Artiﬁcial Intelligence, pages 623 632, 2014.

Snigdha Panigrahi. Carving model-free inference. The Annals of Statistics, 51(6):2318 2341, 2023.

Snigdha Panigrahi and Jonathan Taylor. Scalable methods for Bayesian selective inference. Electronic Journal of Statistics, 12(2):2355 2400, 2018.

Snigdha Panigrahi and Jonathan Taylor. Approximate selective inference via maximum likelihood. Journal of the American Statistical Association, 118(544):2810 2820, 2023.

Snigdha Panigrahi, Jelena Markovic, and Jonathan Taylor. An MCMC-free approach to post-selective inference. ar Xiv preprint ar Xiv:1703.06154, 2017.

Snigdha Panigrahi, Jonathan Taylor, and Asaf Weinstein. Integrative methods for postselection inference under convex constraints. The Annals of Statistics, 49(5):2803 2824, 2021.

Snigdha Panigrahi, Jingshen Wang, and Xuming He. Treatment eﬀect estimation with eﬃcient data aggregation. ar Xiv preprint ar Xiv:2203.12726, 2022.

Liu and Panigrahi

Jesse Raﬀa, Alistair Johnson, Zach O Brien, Tom Pollard, Roger Mark, Leo Celi, David Pilcher, and Omar Badawi. The global open source severity of illness score (GOSSIS). Critical Care Medicine, 2022.

D Garc ıa Rasines and GA Young. Splitting strategies for post-selection inference. Biometrika, 110(3):597 614, 2023.

Jonathan Rosenblatt and Boaz Nadler. On the optimality of averaging in distributed statistical learning. Information and Inference: A Journal of the IMA, 5(4):379 404, 2016.

Christoph Schultheiss, Claude Renaux, and Peter B uhlmann. Multicarving for highdimensional post-selection inference. Electronic Journal of Statistics, 15(1):1695 1742, 2021.

Steven Scott, Alexander Blocker, Fernando Bonassi, Hugh Chipman, Edward George, and Robert Mc Culloch. Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11(2):78 88, 2016.

Sanvesh Srivastava, Cheng Li, and David Dunson. Scalable Bayes via barycenter in Wasserstein space. The Journal of Machine Learning Research, 19(1):312 346, 2018.

Jonathan Taylor and Robert Tibshirani. Post-selection inference for penalized likelihood models. Canadian Journal of Statistics, 46(1):41 61, 2018.

Xiaoying Tian and Jonathan Taylor. Selective inference with a randomized response. The Annals of Statistics, 46(2):679 710, 2018.

Xiaoying Tian, Snigdha Panigrahi, Jelena Markovic, Nan Bi, and Jonathan Taylor. Selective sampling after solving a convex problem. ar Xiv preprint ar Xiv:1609.05609, 2016.

Xiangyu Wang and David Dunson. Parallelizing MCMC via Weierstrass sampler. ar Xiv preprint ar Xiv:1312.4605, 2013.

Larry Wasserman and Kathryn Roeder. High dimensional variable selection. Annals of statistics, 37(5A):2178, 2009.

Yuchen Zhang, Martin Wainwright, and John Duchi. Communication-eﬃcient algorithms for statistical optimization. Advances in neural information processing systems, 25, 2012.

Martin Zinkevich, Markus Weimer, Lihong Li, and Alex Smola. Parallelized stochastic gradient descent. Advances in neural information processing systems, 23, 2010.

Tijana Zrnic and Michael I Jordan. Post-selection inference via algorithmic stability. The Annals of Statistics, 51(4):1666 1691, 2023.