# selective_inference_with_distributed_data__0781a06c.pdf Journal of Machine Learning Research 26 (2025) 1-44 Submitted 3/23; Revised 12/24; Published 1/25 Selective Inference with Distributed Data Sifan Liu sfliu@stanford.edu Department of Statistics Stanford University Stanford, CA 94305-4020, USA Snigdha Panigrahi psnigdha@umich.edu Department of Statistics University of Michigan Ann Arbor, MI 48109-1107, USA Editor: Po-Ling Loh When data are distributed across multiple sites or machines rather than centralized in one location, researchers face the challenge of extracting meaningful information without directly sharing individual data points. While there are many distributed methods for point estimation using sparse regression, few options are available for estimating uncertainties or conducting hypothesis tests based on the estimated sparsity. In this paper, we introduce a procedure for performing selective inference with distributed data. We consider a scenario where each local machine solves a lasso problem and communicates the selected predictors to a central machine. The central machine then aggregates these selected predictors to form a generalized linear model (GLM). Our goal is to provide valid inference for the selected GLM while reusing data that have been used in the model selection process. Our proposed procedure only requires lowdimensional summary statistics from local machines, thus keeping communication costs low and preserving the privacy of individual data sets. Furthermore, this procedure can be applied in scenarios where model selection is repeatedly conducted on randomly subsampled data sets, addressing the p-value lottery problem linked with model selection. We demonstrate the effectiveness of our approach through simulations and an analysis of a medical data set on ICU admissions. Keywords: carving, data aggregation, generalized linear models, lasso, post-selection inference, selective inference, selective likelihood 1. Introduction In recent years, analyzing decentralized data spread across various sites or machines has become increasingly important, particularly for collaborative research. This has created a growing demand for distributed methods that enable researchers to extract meaningful information from such data while respecting its decentralized nature, and to combine this information using summary statistics for preserving privacy within the individual data sets. One of the simplest and most popular approaches in the distributed framework is divideand-conquer , which is also known as the split-and-merge or the one-shot approach; see for example the early work by Mcdonald et al. (2009); Zinkevich et al. (2010); Zhang et al. c 2025 Sifan Liu and Snigdha Panigrahi. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/23-0309.html. Liu and Panigrahi (2012). Most of these approaches use only one round of communication. Each local machine estimates the unknown parameter using its subset of the training data, and communicates the estimator to a central machine which merges the local estimators to obtain a global estimator. In our paper, we focus on a sparse regression setting where only a few of the measured predictors affect the response. The goal in this setup is usually two-fold: (i) select relevant predictors, and model the response by using the estimated sparsity, (ii) provide uncertainties or conduct hypothesis tests for regression parameters in the estimated model, and all this while, respect the distributed nature of data. Substantial progress has been made on the first goal. For example, Lee et al. (2015) average locally computed, debiased lasso estimators, and show that the averaged estimator achieves the same estimation rate as the full-sample lasso, as long as the number of machines is not too large. Chen and Xie (2014) prove that the models aggregated via majority voting, based on variables selected by local machines, are consistent under some conditions. More recently, Battey et al. (2018) provide a debiased approach for hypothesis testing in the distributed setting. However, these methods are limited to models that are fixed before the selection of relevant predictors. As a result, the communication cost in prior work scales with the number of original predictors, which can be unnecessarily large in sparse settings. We introduce a new procedure for conducting selective inference with distributed data. Selective inference tools ensure valid inference by accounting for the fact that the same data, used to select models, is reused when providing confidence intervals and p-values. Several ingenious methods have been developed to provide selective inference in sparse regression problems; see papers by Benjamini and Yekutieli (2005); Berk et al. (2013); Belloni et al. (2015); Lee et al. (2016); Tian and Taylor (2018); Charkhi and Claeskens (2018); Bachoc et al. (2019); Panigrahi et al. (2021). Our procedure in this paper reuses data from all machines to form an asymptotic selective likelihood . This likelihood delivers approximately-valid selective inference in a selected generalized linear model (GLM), which is based on the predictors selected across different machines. The proposed procedure has several important features: 1. It only requires low-dimensional summary statistics from each machine to conduct selective inference. Thus, the techniques developed are applicable to settings when data sets are distributed across different sites due to security, privacy, or ethical concerns, as seen in the areas of differential privacy (Balcan et al., 2012) and federated learning (Mc Mahan et al., 2017). These sites can merge summary information without sharing individual data, enabling inference in a GLM based on the estimated sparsity. 2. The communication cost of our inferential procedure is only linear in the dimension of the selected model, which can be substantially smaller than the initial dimension of the problem. 3. This procedure can be easily adapted to address the p-value lottery problem in model selection, which arises when the selection process is repeated on randomly subsampled data sets. We show that our procedure serves as an efficient alternative to multi-splitting (Meinshausen et al., 2009) and multi-carving (Schultheiss et al., 2021) without recourse to Markov chain Monte Carlo (MCMC) sampling. Selective Inference with Distributed Data The remaining paper is structured as follows. We provide a slightly more technical account of our contributions after outlining the problem setup, and review related work in Section 2. In Section 3, we introduce our procedure for selective inference with distributed data. In Section 4, we provide an asymptotic justification for our selective likelihood, which forms the basis of our approach. We explore how our procedure can be modified to address the p-value lottery problem in Section 5. Section 6 reports findings from the application of our method on simulated data sets and a publicly available, medical data set on intensive care unit (ICU) admissions. We conclude the paper with a discussion in Section 7. Proofs of our technical results are collected in the Appendix. 2. Problem setup and background In this section, we describe the distributed setup and introduce some background on selective inference with a single machine. Other related work is summarized at the end. 2.1 Problem setup We consider a distributed setup with K local machines, all connected to a central machine, referred to as machine 0. Suppose that we observe n i.i.d. observations (yi, xi,1, . . . , xi,p) Rp+1, i [n], where [n] = {1, 2, . . . , n} for n N. Let C(k) [n] denote the index set of the samples stored at machine k, for k {0} [K]. The index sets C(k) are disjoint and form a partition of [n]. Let nk = |C(k)| be the cardinality of C(k). Let ρk = nk n be the proportion of samples store at machine k. Then we have k=0 nk, 1 = Let D(k) = Y (k), X(k) denote the data stored at machine k, where X(k) Rnk p represents the feature matrix and Y (k) Rnk represents the response vector. Each local machine conducts variable selection with a loss function based on a GLMdensity. Let this loss function at machine k be denoted by ℓ(k)(β; D(k)) = 1 nρk i C(k) {A(x i β) yix i β} , where A( ) is the log-partition function of the GLM density. That is, machine k solves the following lasso problem: bβΛ,(k) = argmin β Rp ℓ(k)(β; D(k)) + Λ(k)β 1, (1) where Λ(k) = diag λ(k) 1 , . . . , λ(k) p is the diagonal matrix of regularization parameters. Although our procedure works when the regularization parameters differ across machines, for simplicity, we assume Λ(k) = Λ for k [K]. Liu and Panigrahi Let b E(k) = n j [p] : bβΛ,(k) j = 0 o . represent the set of predictors with non-zero lasso coefficients selected at machine k for 1 k K. We will denote the realized value of b E(k) with data D(k) as E(k). Let d(k) = |E(k)| denote the number of selected variables at machine k. Note, for a vector x Rp and E [p], x E represents the subvector of x with entries in the set E, and for a matrix X, XE is composed of the columns of X present in the set E. 2.2 Selected model As described above, the K local machines return as output the selected predictors, which are then communicated to the central machine. The central machine aggregates these selected predictors as b E = Aggregate n b E(k), k [K] o . (2) We let E be the observed value of b E and let d = |E|. Deferring the discussion on general aggregation rules to Appendix D, now consider the union aggregation rule as a concrete example. Given E, our response y is modeled using a selected GLM based on XE. The density of y is given by: f(y | x E, βE) = exp yx EβE A(x EβE) σ2 c(y; σ), (3) where βE, the regression coefficient vector, is the parameter that we want to infer about. We assume that the dispersion parameter, σ, is either known or can be consistently estimated. Some immediate questions arise when we seek inference for βE: Can the central machine reuse data from the local machines to conduct selective inference for βE? Of course, naive inference, which uses all the data without adjusting for the bias from the model selection process, falls short of coverage guarantees, as illustrated on one of our simulated instances in Figure 1. Section 6 provides the details of this simulation. A procedure for making selective inference must respect the distributed nature of data, as done at the time of selection. What information does the central machine require from the local machines, and how many communication exchanges does it involve? 2.3 Some background when K = 1 We provide some background in a rather simple setup with K = 1. Consider the special case of a Gaussian linear model, i.e., in (3), we have A(x EβE) = 1 2(x EβE)2, c(y; σ) = y2 Selective Inference with Distributed Data Dist-SI Splitting Naive Coverage probability Figure 1: Coverage probabilities of Dist-SI (our procedure for selective inference with distributed data), the Splitting , and Na ıve methods for a Gaussian linear model, with 2 local machines and a central machine. The 2 local machines and the central machine each have 1000 samples. The pre-specified level of coverage is 0.90, as indicated by the dotted horizontal line at 0.9. Post selection, we want to infer for βE. We begin by noting that the regression problem in (1), at machine 1, can be rewritten as bβλ,(1) = argmin β Rp 1 n 1 2(yi x i β)2 + Λβ 1 nω nβ, (4) i [n] xi(x i bβλ,(1) yi) 1 i C(1) xi(x i bβλ,(1) yi). The term ωn is what we call a randomization variable. Regression of the form (4), with an added randomization variable ωn, was termed the randomized lasso in Tian et al. (2016). Suppose we have n observations in our response vector that are drawn as independent Gaussian variables with the same variance. Further, suppose that we solved the randomized lasso with a Gaussian randomization variable nωn N(0p, ΣΩ). A recent procedure by Panigrahi and Taylor (2023) constructs a selective likelihood by conditioning on a subset of the selection event n b E = E o . This approach enables valid selective inference in the selected linear model while allowing for the re-use of data employed during model selection. This procedure centers interval estimates around the maximum likelihood estimator (MLE) of the selective likelihood and estimates its variance using the observed Fisher information matrix. If bβ(S) E and b I(S) E,E denote the selective MLE and the observed Fisher information matrix, using the selective likelihood, a two-sided 100 (1 α)% confidence interval for βE,j is constructed as bβ(S) E,j z1 α/2 bσ(S) j n , Liu and Panigrahi where bσ(S) j = r b I(S) E,E 1 j,j is the estimated standard deviation of nbβ(S) E,j, and z1 α/2 is the (1 α/2)-th quantile of a standard normal distribution. This procedure closely resembles classical inference via maximum likelihood, with the exception that the standard estimators are replaced by their selective analogs that adjust for bias from the model selection process. 2.4 Contributions and other related work The contribution of the proposed method is threefold. First, we identify a simple representation of the selection event by developing a randomization framework. A significant challenge in conditional selective inference is the need for an explicit characterization of the selection event, which is particularly difficult in distributed settings. By leveraging the randomization framework, the selection event can be expressed in a simple closed form, making conditional selective inference feasible for GLMs with selected predictors. Second, the proposed method is efficient in both computation and communication. It requires only lowdimensional summary statistics to be shared from local machines with a central machine. After aggregating these summary statistics, the central machine solves a convex optimization problem. Therefore, the proposed approach is suitable for scenarios where direct data sharing between sites is not possible or inter-site communication is costly. Third, the proposed method can be easily adapted to address the p-value lottery problem. In this sense, our procedure is related to the multi-carving approach in Schultheiss et al. (2021), which uses conditional techniques known as carving (Fithian et al., 2014; Panigrahi, 2023). However, existing carving and multi-carving implementations often depend on computationally expensive MCMC sampling, while our method provides a more efficient alternative. We conclude this section with some more related work. Within the distributed setting, much work has been devoted to the averaged M-estimator (Mcdonald et al., 2009; Zinkevich et al., 2010; Zhang et al., 2012; Rosenblatt and Nadler, 2016). Rosenblatt and Nadler (2016) show that the averaged M-estimator is first-order equivalent to the centralized Mestimator in the fixed-dimension setting. Dobriban and Sheng (2021) study the efficiency of an estimator based on weighted average in the linear regression setting, as dimensions grow with sample sizes. Some methods have taken a likelihood-centric approach, e.g., Jordan et al. (2019) propose a surrogate likelihood where higher-order derivatives in a Taylor-series expansion of the full log-likelihood are replaced by local approximations. Lin and Xi (2011) propose an aggregated estimator for generalized linear models (GLM), where the locally computed MLE and Hessian of the likelihood are merged for efficiency gains. In work by Huang and Gelman (2005); Neiswanger et al. (2014); Wang and Dunson (2013); Scott et al. (2016); Minsker et al. (2017); Srivastava et al. (2018), distributed MCMC algorithms combine local posterior samples to obtain a global posterior distribution. In the post-selection inference literature, a selective likelihood was appended to priors for Bayesian inference post selection in Panigrahi and Taylor (2018); Panigrahi et al. (2021). The focus in these settings was on a category of variable selection rules that can be written as a set of polyhedral constraints on data. For the same category of selection rules, a different body of work (Lee et al., 2016; Hyun et al., 2018; Le Duy and Takeuchi, 2022) considers a truncated Gaussian distribution for inference from normal data. Moving beyond normal data, Taylor and Tibshirani (2018) outline an asymptotic scheme to provide selective inference in GLMs. For blackbox selection, Liu et al. (2022) obtain the selective likelihood Selective Inference with Distributed Data by learning the selection event from bootstrapped data. We take a different approach in the distributed setup by casting the problem into a randomized framework, and provide an asymptotic likelihood function for the selected GLM. The use of a randomized framework for selective inference was explored in Tian et al. (2016); Panigrahi et al. (2017) to improve power. In Rasines and Young (2023); Panigrahi et al. (2022), a randomized framework for selective inference was investigated for more efficient uses of data than sample splitting. The connection between a randomized framework for selective inference and algorithmic stability was explored in Zrnic and Jordan (2023). 3. Selective inference with distributed data In this section, we first provide a characterization of the selection event with distributed data. Then we introduce a selective likelihood that takes into account the aggregation of predictors across machines to form the selected GLM. 3.1 The selection event Fixing some notations, for z Rn, let A(z) and 2A(z) denote the vectors in Rn whose ith coordinates are the first and second derivatives of A( ) at zi, respectively. We recast the regression problem (1) as a randomized lasso problem similar to (4). To do so, define the randomization variables n X ( A(X bβΛ,(k)) Y ) 1 nk X(k) ( A(X(k) bβΛ,(k)) Y (k)) (5) for k [K], where bβΛ,(k) is the solution to the lasso problem at machine k. Let Ω= (ω(1), , . . . , ω(K), ) RKp be the stack of the K randomization variables. As reviewed earlier, an equivalent expression for the lasso problem on machine k is bβΛ,(k) = argmin β Rp 1 n i [n] (A(x i β) yix i β) + Λβ 1 nω(k) β. (6) This is because the Karush-Kuhn-Tucker (KKT) conditions of problem (1) and problem (6) are the same. Note that the KKT conditions are given as follows: 1 n X A(X bβΛ,(k)) Y + γ(k) = nω(k), γ(k) E(k) = ΛE(k)s(k), s(k) = sign bβΛ,(k) E(k) , γ(k) E(k) = Λ E(k)z(k), z(k) 1, where the vector γ(k) represents the subgradient of the penalty Λβ 1 evaluated at the lasso solution. To perform inference for the selected GLM (3), we need some additional notations. Let dk = |E(k)|, d = |E|, and d = P k [K] dk. Let b(k) = bβΛ,(k) E(k) Rdk represent the lasso solution from machine k with only the active variables E(k), and let b B(k) denote the corresponding random vector. Let b B R d be the random vector formed by stacking b B(k) Liu and Panigrahi Local machine k Aggregated version Variable R.V. Obs. Dim R.V. Obs. Dim Selected variable set b E(k) E(k) [p] b E E [p] Active variables of lasso solution b B(k) b(k) dk b B b d Signs of active variables b S(k) s(k) dk b S s d Subgradients of inactive variables b Z(k) z(k) p dk b Z z p K d Subgradients of lasso penalty Λβ 1 bΓ(k) γ(k) p bΓ γ p K Randomization variables ω(k) p K Ω p K Table 1: Notations for the variables involved in selection. Columns 2-4 list the variables for the lasso problem on local machine k, while columns 5-7 show the aggregated or stacked versions. For the optimization variables (excluding selected variable sets and randomization variables), the corresponding random variables (R.V.) are denoted by uppercase letters with a hat, observed values (Obs.) by lowercase letters, and their dimensions (Dim) are also provided. The relationship among the variables is given in Equation (7). from all machines, and let b denote the realized value of b B. Following the same scheme, we denote random variables with uppercase letters, while lowercase letters will represent their realized values, and bold letters will represent a vector obtained by stacking these variables from all machines. To make it easier for readers, we have provided Table 1 that contains a list of notations used in our paper. Our method conditions on n bΓ(k) = γ(k) k [K] o , (8) a proper subset of { b E = E}, where bΓ(k) is the subgradient of the lasso penalty with realized value γ(k). In fact, event (8) is a subset of { b E(k) = E(k) k [K]}, and therefore a subset of the event { b E = E}. Note that the conditional approach for selective inference remains valid even when we work with a proper subset of the selection event, and it is easier to characterize the event (8) than the event { b E = E}, as shown in the next result. In this result, note that b S R d and b Z Rp K d denote the signs of lasso solutions at active variables and the subgradients at inactive variables, respectively, stacked across K machines, with the realized values s and z (see Table 1). Proposition 1 (Characterization of the selection event) The conditioning event (8) can be characterized as n bΓ(k) = γ(k) k [K] o = n sign(b B) = s, b Z = z o . To avoid any confusion, we will hereafter refer to the conditioning event in (8) as the selection event. Selective Inference with Distributed Data 3.2 Marginal distribution of the MLE One common method to infer for βE is through a maximum likelihood approach. In a distributed setting, we consider a maximum likelihood estimator (MLE) that is obtained by aggregating local estimators from each of the machines. Assuming that E is a fixed set, we derive the joint marginal distribution of this MLE and the randomization variables Ωwithout considering the model selection process. This is the first step in obtaining our selective likelihood, before we condition the marginal distribution on the selection event. Let bβ(k) E = argmin β Rd 1 nρk i Ck A(x i,Eβ) yix i,Eβ be the local MLE at machine k, and let b I(k) E,E = 1 nk (X(k) E ) c W k X(k) E , where c W (k) = diag 2A(X(k) E bβ(k) E ) , be the corresponding observed Fisher information (obs-FI) matrix. Let b IE,E = PK k=0 ρkb I(k) E,E. Define the aggregated MLE as bβE = b I 1 E,E k=0 ρkb I(k) E,E bβ(k) E . (9) Let I denote the Fisher information E 1 n X diag( 2A(XEβE))X . In order to derive this marginal distribution of the aggregated MLE bβE, we state a few conditions. Assumption 1 For k [K], let e E(k) = E \ E(k). For j e E(k), either βE,j = O(n 1/2) or XE(k)I 1 E(k),E(k)IE(k),j = Xj. Assumption 2 The aggregated MLE bβE can be written as n(bβE βE) = I 1 E,E ℓ(βE) + op(1), n . Assumption 1 states conditions on predictors that are present in the selected model but are not selected by machine k. This condition suggests that our asymptotic assertions remain valid as long as these predictors are either weak in strength or have a high partial correlation with a predictor in the selected set E(k). Assumption 2 states that the aggregated MLE has an asymptotically linear representation. This condition is met when the aggregated MLE is asymptotically equivalent to the standard MLE. The latter has been shown to hold under some mild regularity conditions in Lin and Xi (2011). The following result provides the asymptotic distribution of the randomization variables, with a detailed proof in Appendix A.1. Proposition 2 Suppose nk/n ρk as n for k [K] and PK k=1 ρk < 1. Let U = diag(ρ 1 1 , . . . , ρ 1 K ) 1K K, where 1K K denotes the K K matrix with all entries equal to 1. Define Liu and Panigrahi the Kronecker product of U and I. Under Assumptions 1 and 2, we have n Ωd Np K (0, ΣΩ) , n . The model selection process is not solely a function of the aggregated MLE and the randomization variables, as it also depends on variables that were not included in the selected GLM. Therefore, we consider an additional statistic, defined as: n X E( A(XE bβE) Y ), which involves variables that were not selected by the lasso. The next theorem derives the marginal distribution of bβE, bβ E, and Ω. Theorem 3 Under the same assumptions as Proposition 2, it holds that bβE βE bβ E Ω I 1 E,E 0 0 0 I/IE,E 0 0 0 ΣΩ where I/IE,E = I E, E I E,EI 1 E,EIE, E is the Schur complement of IE,E in I. The proof of Theorem 3 is deferred to Appendix A.2. It is worth noting from this result that bβ E is an ancillary statistic for the parameters in the selected GLM. Furthermore, the randomization variables are independent of the statistics bβE and bβ E, which can be seen from the block diagonal covariance matrix in the limiting Gaussian distribution. 3.3 Selective likelihood In what follows, we base inference on a conditional likelihood in the selected GLM. This likelihood, referred to as the selective likelihood, is stated in Theorem 4. It is derived from the asymptotic distribution of the aggregated MLE, as stated in Theorem 3, after conditioning on the event described in Proposition 1. Before presenting this likelihood, we specify a technical condition and introduce the matrices involved in this likelihood. Assumption 3 Assume that the distribution of bβE βE,n bβ E Ω is absolutely continuous with respect to the Lebesgue measure on Rp(K+1). Let {pn}n 1 be the corresponding sequence of densities. Assume that {pn}n 1 is uniformly equicontinuous and bounded. Selective Inference with Distributed Data The condition in Assumption 3 together with the weak convergence in Theorem 3 implies that the density of (bβE, bβ E, Ω) converges to the corresponding limiting Gaussian density, uniformly on compact subsets of Rp(K+1). Define g(j) k = γ(j) E(k) Rdk for j, k [K], where g(j) k collects the coordinates of γ(j) belonging to E(k). Define g(j) = γ(j) E Rd, where g(j) contains the coordinates of γ(j) belonging to E. Let ρ0 = 1 P k [K] ρk. Let the matrices Ξ R d d, Ψ R d d, τ R d, Θ Rd d, Π Rd d, κ Rd be defined as follows. The (j, k) block of Ξ 1 is a d(j) d(k) matrix given by ρk + ρ2 k ρ0 IE(k),E(k) if j = k, ρ0 IE(j),E(k) if j = k. The (k, 1) block of Ξ 1Ψ and Ξ 1τ are given by ρ0 IE(k),E; Ξ 1τ k = ρkg(k) k ρk j=1 ρjg(j) k . Further, let ρ0 IE,E Ψ Ξ 1Ψ; Θ 1Π = IE,E; Θ 1κ = Ψ Ξ 1τ + ρj ρ0 g(j). Let ϕ( ; µ, Σ) denote the density function of the normal distribution N(µ, Σ). Let O = {v R d : sign(v) = sign(s)} denote the orthant in R d based on the sign vector s. Theorem 4 (Selective likelihood) Suppose that the conditions stated in Theorem 3 and Assumption 3 hold. The selective likelihood function, based on the asymptotic distribution of nbβE n bΓ(k) = γ(k) k [K] o , is equal to f βE; bβE, s, z = ϕ( nbβE; Π nβE + κ, Θ) P h nb B O | b Z = z i , P h nb B O | b Z = z i = Z ϕ( nbβE; Π nβE + κ, Θ) ϕ( nb B; Ψ nbβE + τ, Ξ) 1 n b B O o dbβEdb B. The proof is provided in Appendix A.3. With this selective likelihood, we proceed to compute the selective MLE and its associated Fisher information to perform inference as outlined in Section 2.3. In Section 4, we offer an approximation to this selective likelihood, which in turn enables an efficient way to compute the selective MLE and the corresponding Fisher information matrix. Liu and Panigrahi 3.4 Algorithm To conclude this section, we identify and specify the information that must be exchanged between the central machine and the local machines in order to carry out inferences in the selected GLM. Three rounds of communication are required for the central machine to perform selective inference: 1. Local machines send E(k) to the central machine. 2. The central machine sends the aggregated model E back to the local machines. 3. Local machines send local MLE bβ(k) E , local Fisher information b I(k) E,E, and the subgra- dient variables γ(k) E to the central machine. This is summarized in Algorithm 1. Note that the first two rounds of communication involve only the indices of the selected variables, which lead to the selected model E. It is obvious that the local MLE bβ(k) E and local Fisher information b I(k) E,E are needed at the central machine to compute the aggregated MLE and aggregated Fisher information, as defined in Equation (9). Besides these statistics, the selective likelihood derived in Theorem 4 involves the subgradients γ(j) E(k) and γ(j) E for j, k [K] in order to correct for the bias from model selection. Hence, in order to calculate the selective likelihood, the central machine needs the subgradients γ(j) with indices from E S k [K] E(k) sent from all local machines. Since we assumed for simplicity that E = S k [K] E(k), machine j simply needs to send γ(j) E to the central machine. Note that b I(k) E,E has a size of d2, resulting in a communication cost of O(d2) for each local machine, which does not depend on the original feature dimension p. Therefore, as long as a sparse model is selected, the communication cost for making inferences in the selected GLM remains relatively low. Extensions for more general aggregation rules are given in Appendix D. 4. Approximate selective MLE In the previous section, we obtained the selective likelihood function. As per Theorem 4, the selective log-likelihood is equal to log ϕ nbβE; Π nβE + κ, Θ log P h nb B O b Z = z i . (10) The main focus of this section is to provide an approximation for the second term (in the log-likelihood) by using a large-deviation limit for the log-probability. This approximation is made under the following moment and regularity conditions, which are commonly assumed to ensure the existence of the limit. Consider a real-valued sequence an that goes to infinity as n , and an = o(n1/2). Assume βE = βE,n satisfies that nβE,n = anβ E R|E|, where β E does not depend on n. Selective Inference with Distributed Data Algorithm 1: Communication of information STEP 1: Variable selection at local machines Machine k solves Problem (1) and sends E(k) = Support(bβΛ,(k)) to the central machine. STEP 2: Model aggregation at the central machine The central machine forms E = S k [K] E(k) and sends E back to the local machines. STEP 3: Communication of summary statistics Local machines communicate the following statistics to the central machine: local MLE and Fisher information: bβ(j) E , b I(j) E,E; subgradient: γ(j) E From the proof of Theorem 3, we have bβE βE,n bβ E Ω = n En + Rn, (11) where En = 1 n Pn i=1 ei,n is the average of n i.i.d. observations, and Rn = op(1). Assumption 4 (Moment condition and convergence of remainder) Assume that E [exp(λ e1,n 2)] < for some λ R+, and that lim n 1 a2n log P 1 an Rn 2 > ϵ = for any ϵ > 0, where ei,n and Rn are from the asymptotic linear representation in (11). Assumption 4 assumes the existence of an exponential moment in a neighborhood of zero, which is needed to justify a Laplace-type approximation for probabilities involving the mean of n i.i.d. variables, which in our case is En. The second condition in this assumption helps extend this justification to En with an added op(1) error term, Rn. Assumption 5 For a fixed convex set R0 Rp(K+1), and for O = Op(1), we impose the condition that lim n 1 a2n nbβE nbβ E nΩ nbβE nbβ E nΩ Liu and Panigrahi In terms of bβE, bβ E, and Ω, the probability of the selection event takes the form nbβE nbβ E nΩ (see Appendix B.1). Therefore, Assumption 5 allows us to approximate the probability of the selection event up to an op(1) remainder. Theorem 5 Suppose that the conditions in Assumption 4 and Assumption 5 are met. Define Ln = inf b,B an κ Θ 1 b Πβ E 1 an τ Ξ 1 B Ψb 1 a2n Barr O(an B) where Barr O(x) = P i log(1 + 1 sixi ). Then, we have lim n 1 a2n log P h nb B O b Z = z i + Ln = C0, where C0 is a constant that does not depend on β E. The proof is provided in Appendix B.1. As a consequence of Theorem 5, we can substitute the log-probability in (10) by ( 1 2 (anb anΠβ E κ) Θ 1 (anb anΠβ E κ) 2 (an B anΨb τ) Ξ 1 (an B anΨb τ) + Barr O(an B) ignoring the additive constant in the limit. Finally, we reparameterize anb = nv and an B = n V , which gives the following approximation of the selective log-likelihood log ϕ( nbβE; Π nβE,n + κ, Θ) ( 1 2 nv nΠβE,n κ Θ 1 nv nΠβE,n κ 2 n V nΨv τ Ξ 1 n V nΨv τ + Barr O( n V ) The score and curvature of this selective likelihood yield the selective MLE and the selective obs-FI matrix. The derivation of these two estimators follows the steps in Panigrahi and Taylor (2023) for the standard Gaussian regression problem. In the interest of completeness, the expressions of these estimators are provided below. Selective Inference with Distributed Data Theorem 6 Consider solving the optimization problem b V bβE = argmin V R d 1 2( n V Ψ nbβE τ) Ξ 1( n V Ψ nbβE τ) + Barr O( n V ). (13) The maximizer of the approximate selective likelihood and the observed Fisher information matrix are equal to Π 1 bβE 1 nΠ 1κ + I 1 E,EΨ Θ 1 ΨbβE + 1 nτ b V bβE Θ 1 + Ψ Ξ 1Ψ Ψ Ξ 1 Ξ 1 + 2Barr O nb V bβE 1 Ξ 1Ψ 1 IE,E, (15) respectively. The proof is provided in Appendix B.2. In practice, the matrices Π, κ, Θ, Ψ, τ, Ξ are computed with the observed Fisher information b I, which are then used to compute the selective MLE and the selective obs-FI in (14) and (15), respectively. The algorithm for conducting inference based on the approximate selective MLE is summarized in Algorithm 2. Algorithm 2: Approximate selective MLE-based inference Compute the aggregated MLE bβE defined by Equation (9). Solve the d-dimensional convex optimization in (13). Compute bβ(S) E and b I(S) E,E as stated in (14) and (15). r b I(S) E,E 1 j,j , j [d]. Compute the two-sided p-value for H0 : βE,j = 0 as where Φ = 1 Φ is the CDF of the standard normal distribution. Compute the two-sided 100 (1 α)% confidence interval for βE,j as bβ(S) E,j z1 α/2 bσ(S) j n . 5. Addressing p-value lotteries When dealing with sparse regression, it is often feasible to construct p-values after reducing the number of variables to a manageable size. One common method is sample-splitting, Liu and Panigrahi as described in Wasserman and Roeder (2009), where variables are first selected on a subset of the data and then p-values are reported using classical least squares estimation on the remaining samples. Variables that are not selected are assigned a p-value equal to 1. A more powerful alternative to sample-splitting is carving, introduced in Fithian et al. (2014) through conditioning. However, results produced by sample splitting or carving are overly sensitive to the randomness involved in partitioning the data. This issue has been widely reported in literature as the p-value lottery problem. See, for example, the paper by Meinshausen et al. (2009). To address the p-value lottery problem, Meinshausen et al. (2009) aggregate p-values from repeated splitting, and more recently, Schultheiss et al. (2021) propose to aggregate p-values after repeated carving on random splits of data. We refer to the former procedure as multi-splitting and the latter procedure as multi-carving. With increasing numbers of replicates, the results are expected to be less sensitive to the randomness from the splits. Suppose one conducts multi-splitting or multi-carving B times, and obtains the p-values p(b) j , for b [B], and j [p]. This is followed by aggregating the B p-values through their empirical quantiles Qj(γ) := qγ γ p(b) j , b [B] 1, where qγ denotes the γ-th empirical quantile. One can also minimize over γ and use Pj := (1 log(γmin)) inf γ (γmin,1) Qj(γ) 1. (16) The aggregation scheme produces valid p-values as long as all p-values are individually valid. Below, we show that our proposal in the paper can be easily adapted to address the pvalue lottery problem without recourse to MCMC sampling. We proceed as multi-splitting and multi-carving, i.e., we use a subsample of size n1 for variable selection. For inference, we reuse data from selection by conditioning on the event of selection. We repeat this procedure B times and aggregate the p-values as above. Moreover, in each replicate, we can draw K random subsets of size n1 with replacement. A base model is selected using each subset, and the K base models are aggregated as done in (2). To conduct selective inference, a similar procedure can be used, with a slight modification in the distribution of the randomization variable Ω. Specifically, nΩstill converges to a normal distribution but with a different covariance matrix, as stated in Lemma 7. Lemma 7 If the K subsets D(1), . . . , D(K) are independent random samples of the data set D (rather than disjoint partitions) and each subset has size [ρn], then Theorem 3 holds with The proof is provided in Appendix C. Following the same argument as in Theorem 4, the asymptotic selective likelihood function takes the same form as the expression in Theorem 4, with slightly different expressions Selective Inference with Distributed Data Algorithm 3: Multiple carving for k = 1, . . . , K do Perform n1-out-of-n subsampling to form data set D(k). Solve the lasso problem with data D(k) and select variable set E(k). Construct the aggregated model E = Aggregate(E(1), . . . , E(K)). Compute the MLE bβE and Fisher information b IE,E in the selected GLM (3) using all the data. Compute matrices Ξ, Ψ, τ, Θ, Π, κ as given in Equation (17). Perform selective inference for βE using the approximate selective-MLE based method in Algorithm 2. of the matrices Ξ, Ψ, τ, Θ, Π, κ due to the change of the covariance matrix ΣΩ. There matrices are now calculated as {Ξ 1}j,k = δj,k ρ 1 ρIE(j),E(j); {Ξ 1Ψ}k = ρ 1 ρIE(k),E; {Ξ 1τ}k = ρ 1 ρg(k) k ; (17) Θ 1 = 1 + Kρ IE,E Ψ Ξ 1Ψ; Θ 1Π = IE,E; Θ 1κ = Ψ Ξ 1τ + ρ 1 ρ The entire procedure is summarized in Algorithm 3. When K = 1 subset is drawn in each of the B replicates, this procedure closely resembles the multi-carving procedure described in Schultheiss et al. (2021). When B = 1 and K = 1, it reduces to the data carving of Fithian et al. (2014). However, the multi-carving method of Schultheiss et al. (2021) conditions on the indices of the samples used for selection which results in a polyhedral event similar to the event in Lee et al. (2016). In contrast, our method marginalizes over the randomness involved in subsampling instead of conditioning on it. This is captured in the definition of our randomization variable Ω. As shown in the simulation results in Section 6.2, our method achieves higher power compared to the method that conditions on more. 6. Experiments In this section, we demonstrate the numerical performance of our proposed procedure. Our code can be accessed from the Git Hub repository https://github.com/snigdhagit/ Distributed-Selectinf. 6.1 Experiments with distributed data sets The data in our experiments is simulated as follows. For i [n], we draw xi Np(0, Σ), where Σij = ρ|i j| with ρ = 0.9, p = 100. In the first model, we draw a real-valued response from a linear model as yi N(x i β, σ2) with σ2 = 1. The dispersion parameter σ2 is estimated by ˆσ2 = 1 n d Pn i=1(yi x i,E bβE)2. In our second model, we draw a binary response from a logistic-linear model as yi Bernoulli(1/(1 + e x i β)). Observation i is Liu and Panigrahi independent of all the other observations in our data set. There are 5 non-zero coefficients in our model; each non-zero βj is equal to 2c log p, where the sign is random and c is referred to as the signal strength. Unless otherwise specified, the signal strength c is 0.7 for the linear model and 1 for the logistic model. The regularization parameter λ is set to 2 log p for linear regression and 0.5 log p for logistic regression. The n observations are partitioned into K + 1 disjoint subsets D(0), D(1), . . . , D(K). Subsets 1 through K, representing the data stored at K local machines, are used for variable selection. Subset 0, representing the data at the central machine, is used only at the time of selective inference. We design three different scenarios to investigate the performance of our procedure. We conduct 500 rounds of simulations in each scenario. (I). In Scenario 1, we vary the number of distributed data sets K {2, 4, 6, 8}. Each local machine has nk = [8000/K] samples, and the central machine has access to 1000 samples. (II). In Scenario 2, we vary the signal strength c. For linear regression, we vary c among {0.3, 0.5, 0.7, 0.9}; while for logistic regression, we vary c among {0.5, 1, 1.5, 2}. We fix K = 2. The central machine has 2000 samples, and each of the two local machines has 4000 samples. (III). In Scenario 3, we vary the number of samples that are reserved only for selective inference at the central machine. Specially, we vary n0 {250, 500, 1000, 2000}. We fix K = 3 and nk = 2000 for 1 k K. We report the coverage proportions and average interval lengths for βE in Figure 2 and Figure 3, for the linear model and the logistic model, respectively. The target parameter βE is defined as βE := argmin βE Rd i=1 E h ℓ(yi; x i,EβE) i , where ℓrepresents the corresponding GLM loss and the expectation is taken over the true distribution of yi. Specifically, for linear regression, ℓ(yi; θ) = 1 2(yi θ)2 and the true distribution is y N(x i β, σ2). For logistic regression, ℓ(yi; θ) = [yi log 1 1+e θ + (1 yi) log(1 1 1+e θ )] and the true distribution is yi Bernoulli((1 + e x i β) 1). The target coverage of the confidence interval is 90%. The lengths of confidence intervals are indications of the statistical power associated with selective inference for βE. Our method Dist-SI constructs the confidence intervals based on the approximate selective MLE method as introduced before. As a baseline for comparison, we consider the Splitting method, which uses only the data at the central machine to construct standard Wald confidence intervals. Error bars represent variations among 500 random replicates. Observations. Across all scenarios, the confidence intervals produced by Dist-SI (approximately) attain the desired coverage probability. The Splitting method produces valid confidence intervals, but discards samples used by the local machines. The advantages of reusing data from the local machines are quite evident in the plots for the lengths of the confidence intervals. As expected, Dist-SI yields tighter confidence intervals and achieves higher power than the baseline procedure based on Splitting in all three scenarios. We Selective Inference with Distributed Data Dist-SI Splitting 1 2 3 4 Signal strength 250 500 1000 2000 n0 Dist-SI Splitting 1 2 3 4 Signal strength 250 500 1000 2000 n0 Figure 2: Coverage proportions (top panel) and average interval lengths (bottom panel) for the linear model. Left panel: varying K. Middle panel: varying signal strength. Right panel: varying n0. observe that interval lengths for both methods increase with K. This is because the final model, which is the union of the K models selected by local machines, is likely to be larger for larger K. In this case, the variance of bβj tends to be larger. For a similar reason, interval lengths tend to increase with signal strengths as well. In Scenario 3, we see that both methods produce longer intervals when n0 decreases, and as expected, the gap between splitting and Dist-SI is more pronounced with fewer samples at the central machine. 6.2 Experiments on p-value lotteries In this section, we apply the suitable adaptation of our procedure to solve the p-value lottery problem, as described in Section 5. We compare our procedure with Multi-carving and Multi-splitting as proposed by Schultheiss et al. (2021) and Meinshausen et al. (2009), respectively. For the latter two algorithms, we use the implementation provided by Schultheiss et al. (2021) with code available on Git Hub1. To avoid any confusion, we continue to refer our procedure as Dist-SI , though we are no longer simulating distributed data sets. 1. https://github.com/cschultheiss/Multicarving. The original code is written in R, and we load them into Python when running our simulations, which might have contributed to slightly longer running times as reported in our findings. Liu and Panigrahi Dist-SI Splitting 1 2 3 4 Signal strength 250 500 1000 2000 n0 Dist-SI Splitting 1 2 3 4 Signal strength 250 500 1000 2000 n0 Figure 3: Coverage proportions (top panel) and average interval lengths (bottom panel) for the logistic model. Left panel: varying K. Middle panel: varying signal strength. Right panel: varying n0. Selective Inference with Distributed Data We generate our data from the same linear model as described before, but now we use the same dimension as in Schultheiss et al. (2021). We take n = 100 and p = 200, and consider s = 5 nonzero coefficients with βj = 2 log p. We use B = 10 replicates, and aggregate the p-values using formula (16) with γmin = 0.1. The proportion of samples used for variable selection is varied in the set {0.5, 0.6, . . . , 0.9}. We fix the significance level at 0.1. Again, the inferential target of these methods is not the actual generating β, unless the selected set E contains all the nonzero coefficients of β. Nevertheless, we can compare these methods by their accuracy in terms of detecting nonzero signals. Specifically, a coefficient βj is predicted to be nonzero if the corresponding p-value is no larger than 0.1. We measure the accuracy of this prediction by the F-score, which is defined as F-score := 2 True Positive 2 True Positive + False Positive + False Negative Note that the positives and negatives are computed with respect to the true underlying parameter β, which might be different from the inferential target post selection if the selected E fails to contain some nonzero coefficients of β. Besides computing the F-score, we compare the average run time for Dist-SI and Multi-carving . The results are shown in Figure 4. In the left panel, we plot the F-scores of the three methods with varying proportions. The error bars are once again reported for 500 random repetitions. In the right panel, we plot the average run times of Dist-SI and Multi-carving on the log scale. Observations. We find that our procedure has higher F-scores than the two previously proposed alternatives, Multi-carving and Multi-splitting , for all values of sample proportion. Especially, a p-value in every replicate uses the full data after carefully discarding information that was used up for selecting predictors. The re-use of data from selection results in larger power over Multi-splitting . Our procedure aligns with Multi-carving , which also deploys conditional techniques to reuse data for hypothesis testing. However, a key distinction of our procedure with Multi-carving lies in how we use the randomization framework to characterize the selection event, and subsequently marginalize over this randomness to construct our p-values. In particular, we note that Multi-carving conditions on the randomization that is involved during variable selection on a random split of the data, whereas our procedure explicitly characterizes the distribution of randomization instead of simply conditioning on Ω. We believe that this difference between the two procedures shows up in our simulated findings as we note higher values of F-score with Dist-SI . Unsurprisingly, our proposal is also faster than Multi-carving by about 100 times. This is because the latter procedure requires MCMC sampling, while our procedure only involves solving a convex optimization problem. 6.3 Experiments on medical data set We illustrate an application of our procedure on a real data set that is publicly available on MIT s GOSSIS database Raffa et al. (2022). This data set contains records on intensive care unit (ICU) admissions from 192 hospitals, including patients demographic information, and various medical measurements, and lab results. We only use the data sets from the four largest hospitals, among which three data sets are used for variable selection and the remaining one is reserved for selective inference. We focus on a regression problem with data Liu and Panigrahi 0.5 0.6 0.7 0.8 0.9 proportion 0.5 0.6 0.7 0.8 0.9 proportion Dist-SI Multi-carving Multi-splitting Figure 4: Compare Dist-SI with multi-carving and multi-splitting. The left panel shows the F-scores when using different proportions of samples for selection. The right panel shows the average running time. from the first 24 hours of intensive care. The response in this problem is binary, and takes the value 1 if a patient admitted to an ICU has been diagnosed with Diabetes Mellitus, and is 0 otherwise. The same problem appeared in the 2021 Women in Data Science Datathon 2. We remove variables with more than half missing values, and also remove rows with missing values. After preprocessing, we end up with 81 predictors. The three data sets used for variable selection have sample sizes ranging from 1633 to 1788, and the data set reserved for inference has 2000 samples. For model selection, we run the logistic regression with lasso penalty. The regularization parameter is tuned with one extra data set with 893 samples. The selected GLM has 58 predictors. To construct confidence intervals for the 58 selected variables, we apply the proposed Dist-SI algorithm and Splitting as done in simulations. The significance level is set to be 0.1. Dist-SI reports 21 significant variables, while Splitting reports 13 significant variables, with 10 variables overlapping as significant in both methods. In Figure 5, we plot the confidence intervals for the regression coefficients that are rejected by either of the two procedures. The boxplot for the lengths of these intervals, in Figure 6, show that the median length of the Dist-SI intervals is smaller than the Splitting intervals by 67%. Additionally, the coefficient of variation is 1.8 and 3.9 for Dist-SI and Splitting , respectively. This indicates that the dispersion of interval lengths for Dist-SI is smaller than Splitting . On this instance, we see that Splitting yields a few very wide intervals. This is because the Hessian matrix based on data present at the central machine (reserved data set) is ill-conditioned. Dist-SI does not have this issue because it reuses data from the three hospitals for more powerful selective inference. 2. https://www.kaggle.com/competitions/widsdatathon2021/data. Accessed on on Dec. 17, 2022. Selective Inference with Distributed Data resprate_apache d1_diasbp_min d1_resprate_max pre_icu_los_days elective_surgery gcs_motor_apache intubated_apache d1_sysbp_noninvasive_min d1_sysbp_noninvasive_max d1_resprate_min d1_heartrate_max d1_sysbp_min d1_sysbp_max d1_heartrate_min encounter_id d1_spo2_min ventilated_apache Dist-SI Splitting Figure 5: Confidence intervals for the coefficients that are rejected by either Dist-SI or sample splitting. Dist-SI Splitting Figure 6: Boxplot of confidence interval lengths produced by Dist-SI and sample splitting. The y-axis is on the logarithmic scale. Liu and Panigrahi 7. Conclusion Model selection appears to be routine practice when analyzing big data sets. Inference for data-dependent models and parameters is a very challenging goal, because sound procedures must rigorously account for randomness from the selection process. To the best of our knowledge, this paper is the first contribution that addresses selective inference with distributed data. We provide a rigorous procedure to construct confidence intervals and p-values when inference is sought in a generalized linear model with selected predictors. We identify a representation for selection in a common distributed setup, and provide an asymptotic selective likelihood by developing a novel randomized framework for our problem. Approximately-valid selective inference, based on our selective likelihood, takes a very simple form: our confidence intervals for the selected regression coefficients are centered around the MLE of the selective likelihood, and the variance of the MLE is estimated by the observed Fisher information matrix. An appealing feature of our proposed inferential procedure is that we only require some aggregated information, with relatively low communication cost, from each machine. This feature allows an adaptation of our procedure to settings where various sites may not be willing to share their individual data sets. But, we note that there is room for improvement here, specially if various sites have not measured the same set of predictors. Our paper also provides an efficient solution for the p-value lottery problem without relying on MCMC samplers. Our procedure bypasses the primary computational bottleneck in the earlier proposal (Schultheiss et al., 2021) by reducing selective inference to the solution of an optimization problem. The recently developed sampler in Liu (2023) is also applicable in our distributed setup to perform sampling-based inference, as an alternative to the MLEbased inference used in this paper. Acknowledgments S. Panigrahi s research is supported by NSF grants DMS-1951980 and DMS-2113342 and by the NSF CAREER Award DMS-2337882. S. Liu s research is partially supported by the Stanford Data Science Scholars program. Selective Inference with Distributed Data Appendix A. Proofs for Section 3 Supporting results are collected in Appendix A.3.1. A.1 Proof of Proposition 2 Proof We write βE = βE,n in the asymptotic analysis as n . Define β E(k),n = I 1 E(k),E(k)IE(k),EβE,n. We start with the decomposition 1 n X ( A(X bβΛ,(k)) Y ) = 1 n X ( A(XEβE,n) Y ) + R(k) 1 + R(k) 2 , n X ( A(XE(k)β E(k),n) A(XEβE,n)), n X ( A(X bβΛ,(k)) A(XE(k)β E(k),n)). In a similar fashion, we decompose the quantities based on local data D(k) as 1 nk X(k) ( A(X(k) bβΛ,(k)) Y (k)) = 1 nk X(k) ( A(X(k) E βE,n) Y (k)) + r(k) 1 + r(k) 2 . The decomposition in the above two displays allow us to write nω(k) = n n 1 n X ( A(XEβE,n) Y ) 1 nk X(k) ( A(X(k) E βE,n) Y (k)) + R(k) 1 + R(k) 2 r(k) 1 r(k) 2 o = n ω(k) + n R(k), where R(k) = R(k) 1 + R(k) 2 r(k) 1 r(k) 2 , and ω(k) = 1 n X ( A(XEβE,n) Y ) 1 nk X(k) ( A(X(k) E βE,n) Y (k)). Let Ω Rp K be the stack of ω(k) for 1 k K. It suffices to show that n Ωd N(0, ΣΩ) and n R(k) p 0. (19) To proceed with the proof, let ei = xi( A(x i,EβE,n) yi). It is easy to see that ei are i.i.d. for all 1 i n with E [ei] = 0 and Var [ei] I under the selected model (3). It follows that n ω(k) = p 1 ρk 1 n nk i/ Ck ei 1 ρk ρk Hence, we have n ω(k) d N 0, 1 ρk Liu and Panigrahi Cov h n ω(j), n ω(k)i I for j = k. This leads to ρj I I I 1 ρk which proves the first statement of (19). Lemma 8 and Lemma 9 below show that n R(k) = op(1), thereby concluding the proof of (19). Lemma 8 (Rate of R(k) 1 r(k) 1 ) Let n X ( A(XE(k)β E(k),n) A(XEβE,n)), nk X(k) ( A(X(k) E(k)β E(k),n) A(X(k) E βE,n)). Then n(R(k) 1 r(k) 1 ) = op(1). Proof [Proof of Lemma 8.] According to Lemma 12, we have XE(k)β E(k),n XEβE,n = Op(n 1/2). By a Taylor approximation, we have R(k) 1 r(k) 1 = 1 n X diag( 2A(XEβE,n))(XE(k)I 1 E(k),E(k)IE(k),E XE)βE,n nk X(k) diag( 2A(X(k) E βE,n))(X(k) E(k)I 1 E(k),E(k)IE(k),E XE)βE,n + op(n 1/2). By Assumption 1, there exists Ek e E(k) = E\E(k) such that for j e E(k)\Ek, XE(k)I 1 E(k),E(k)IE(k),j = Xj and βEk,n = O(n 1/2). So the last display simplifies as R(k) 1 r(k) 1 = 1 n X diag( 2A(XEβE,n))(XE(k)I 1 E(k),E(k)IE(k),Ek XEk)βEk,n nk X(k) diag( 2A(X(k) E βE,n))(X(k) E(k)I 1 E(k),E(k)IE(k),Ek XEk)βEk,n + op(n 1/2). If Ek is not empty, let n X diag( 2A(XEβE,n))(XE(k)I 1 E(k),E(k)IE(k),Ek XEk) . R(k) 1 r(k) 1 = 1 n X diag( 2A(XEβE,n))(XE(k)I 1 E(k),E(k)IE(k),Ek XEk) T1 nk X(k) diag( 2A(X(k) E βE,n))(X(k) E(k)I 1 E(k),E(k)IE(k),Ek XEk) T1 + op(n 1/2). Selective Inference with Distributed Data Note that βEk = O(n 1/2). Further, observe that 1 n X diag( 2A(XEβE,n)(XE(k)I 1 E(k),E(k)IE(k),Ek XEk) T1 = op(1), and 1 nk X(k) diag( 2A(X(k) E βE,n)(X(k) E(k)I 1 E(k),E(k)IE(k),Ek XEk) T1 = op(1). Thus, we conclude that R(k) 1 r(k) 1 = op(n 1/2). Lemma 9 (Rate of R(k) 2 r(k) 2 ) Let n X ( A(X bβΛ,(k)) A(XE(k)β E(k),n)), nk X(k) ( A(X(k) bβΛ,(k)) A(X(k) E(k)β E(k),n)). Then n(R(k) 2 r(k) 2 ) = op(1). Proof [Proof of Lemma 9.] Based on the assertion in Lemma 12, we have bβΛ,(k) E(k) β E(k),n = Op(n 1/2). Taking a Taylor expansion of A(X bβΛ,(k)) at Xβ E(k),n for each coordinate, we obtain R(k) 2 r(k) 2 = 1 n X h diag( 2A(XE(k)β E(k),n))XE(k)(bβΛ,(k) E(k) β E(k),n) + o( bβΛ,(k) E(k) β E(k),n )) i 1 nk X(k) h diag( 2A(X(k) E(k)β E(k),n))X(k) E(k)(bβΛ,(k) E(k) β E(k),n) + o( bβΛ,(k) E(k) β E(k),n )) i . n X diag( 2A(XE(k)β E(k),n))XE(k) , we have 1 n X diag( 2A(XE(k)β E(k),n))XE(k) = T + op(1), and 1 nk X(k) diag( 2A(X(k) E(k)β E(k),n))X(k) E(k) = T + op(1). R(k) 2 r(k) 2 = 1 n X diag( 2A(XE(k)β E(k),n))XE(k) T (bβΛ,(k) E(k) β E(k),n) nk X(k) diag( 2A(X(k) E(k)β E(k),n))X(k) E(k) T (bβΛ,(k) E(k) β E(k),n) + op(n 1/2) = op(n 1/2). Liu and Panigrahi A.2 Proof of Theorem 3 Proof It follows from Assumption 2 that n(bβE βE,n) d N(0, I 1 E,E), and from Theorem 2 we have nΩd N(0, ΣΩ). Note that nbβ E = 1 n X E( A(XEβE,n) Y ) + 1 n X E( A(XE bβE) A(XEβE,n)) = 1 n X E( A(XEβE,n) Y ) + 1 n X EWX E(bβE βE,n) + op(1) = 1 n X E( A(XEβE,n) Y ) I E,EI 1 E,E 1 n X E( A(XEβE,n) Y ) + op(1). In particular, we have 1 n X E( A(XEβE,n) Y ) 1 n X E( A(XEβE,n) Y ) = IE,E IE, E I E,E I E, E Thus, we observe that the asymptotic variance of nbβ E is equal to I E, E I E,EI 1 E,EIE, E, and conclude that nbβ E d N(0, I/IE,E). Further, it is easy see from the asymptotic representations of n(bβE βE,n) and nbβ E that they are mutually independent. Now, observe that n(bβE βE,n) and nbβ E are asymptotically equivalent to sums of i.i.d. random variables zi, and that nω(k) assumes the form in (18). Thus, we can write where ei = xi( A(x i,EβE,n) yi). The covariance term on the right-hand-side is 0, which completes the proof. A.3 Proof of Theorem 4 Proof We start by writing nω(k) = n ω(k) + op(1), established in Lemma 13, where n ω(k) = T(k)( n b B(k), b Z(k); nbβE, nbβ E). Selective Inference with Distributed Data Let nΩ= T( nb B, b Z; nbβE, nbβ E) denote the stack of n ω(k) for k [K]. We begin with pn, the Lebesgue density of n(bβE βE,n), nbβ E, nΩ . We then apply the change of variables nΩ ( nb B, b Z) through the mapping nΩ= T( nb B, b Z; nbβE, nbβ E). Because the mapping is linear, the density of n(bβE βE,n), nbβ E, nb B, b Z is proportional to pn( n(bβE βE,n), nbβ E, T( nb B, b Z; nbβE, nbβ E)). Now, the condition in Assumption 3 allows us to replace pn by the limiting Gaussian density in Theorem 3 which gives us the corresponding asymptotic density function ϕ( nbβE; nβE,n, I 1 E,E) ϕ( nbβ E;0, (I/IE,E) 1) ϕ(T( nb B, b Z; nbβE, nbβ E); 0, ΣΩ). (20) Furthermore, if we condition on b Z = z and ignore constants, the asymptotic likelihood can be simplified to ϕ( nbβE; Π nβE,n + κ; Θ) ϕ( nb B; Ψ nbβE + τ; Ξ), where Π, κ, Θ, Ψ, τ, Ξ are defined in Theorem 4. See details of the simplification in Lemma 10 below. The selection event, when conditioned on b Z = z, is equivalent to b B O. So the conditional density of bβE, b B is given by ϕ( nbβE; Π nβE,n + κ; Θ) ϕ( nb B; Ψ nbβE,n + τ; Ξ) 1 n b B O o R ϕ( nbβE; Π nβE,n + κ; Θ) ϕ( nb B; Ψ nbβE + τ; Ξ) 1 n b B O o dbβEdb B . Ignoring the factors that do not depend on βE,n gives the selective likelihood function. Lemma 10 (Matrix simplification) The joint density of ( nbβE, nb B, bβ E) when conditioned on b Z = z is equal to ϕ( nbβE; Π nβE,n + κ; Θ) ϕ( nb B; Ψ nbβE + τ; Ξ) ϕ( nbβ E; 0, (I/IE,E) 1). Proof [Proof of Lemma 10.] Denote I ,E ... I ,E I ,E(1) 0 . . . 0 0 I ,E(2) . . . 0 ... 0 . . . 0 I ,E(K) Liu and Panigrahi Let r(k) = Λ s(k) , and let r be the stack of r(1), . . . , r(K). Observe, the mapping T, given in Lemma 13, can be written as nΩ= Q1 nbβE + Q2 nb B + r. (22) It follows from Equation (20) that the joint density of ( nbβE, nb B) after conditioning on b Z = z is proportional to 2( nbβE nβE,n) IE,E( nbβE nβE,n) 2( Q1 nbβE + Q2 nb B + r) Σ 1 Ω( Q1 nbβE + Q2 nb B + r) 2( nb B) Q 2Σ 1 ΩQ2( nb B) + ( nb B) Q 2Σ 1 Ω(Q1 nbβE r) 2( nbβE) (IE,E + Q 1Σ 1 ΩQ1)( nbβE) + ( nbβE) (IE,E nβE,n + Q 1Σ 1 Ωr) . For Ξ 1 = Q 2Σ 1 ΩQ2, Ψ = ΞQ 2Σ 1 ΩQ1, τ = ΞQ 2Σ 1 Ωr, we observe that this density is proportional to ϕ( nb B; Ψ nbβE + τ, Ξ) exp 1 2( nbβE) (IE,E + Q 1Σ 1 ΩQ1 Ψ Ξ 1Ψ)( nbβE) +( nbβE) (Ψ Ξ 1τ + IE,E nβE,n + Q 1Σ 1 Ωr) i . The density in the last display is proportional to ϕ( nbβE; Π nβE,n + κ; Θ) ϕ( nb B; Ψ nbβE + τ; Ξ) for Θ 1 = IE,E + Q 1Σ 1 ΩQ1 Ψ Ξ 1Ψ, Π = ΘIE,E, κ = Θ(Ψ Ξ 1τ + Q 1Σ 1 Ωr). To further simplifying the matrices in the likelihood, we note that Σ 1 Ω = U 1 I 1. Therefore, we can write Q 2Σ 1 ΩQ2 in the form of K K blocks, where the (j, k)-block is ρjρk ρ0 + ρjδj,k IE(j),E(k). Similarly, Q 2Σ 1 Ωr has K blocks where the k-th block is ρk JE(k)r(k) + ρk j=1 ρj JE(k)r(j), where JE is the matrix that selects the coordinates in E, i.e. JEx = x E. Since JE(k) 0E bβ E 0 and JE(k)γ(j) = g(j) k , the above display is equal to ρkg(k) k +(ρk/ρ0) P j [K] ρjg(j) k . Similarly, Q 1Σ 1 Ωr = P k [K] ρk ρ0 JEγ(k) = P k [K] ρk ρ0 g(k) and Q 1Σ 1 ΩQ1 = k [K] ρk + X IE,E = 1 ρ0 Selective Inference with Distributed Data Thus Θ 1 = 1 ρ0 IE,E Ψ Ξ 1Ψ. Remark 11 We note that when using the union aggregation rule, the quantities Π, κ, Θ, Ψ, τ, Ξ do not depend on bβ E. If E(k) is not necessarily a subset of E, then JE(k) 0E bβ E 0E(k) E bβ E(k)\E, JE(k)r(j) = JE(k)γ(j) + 0E(k) E bβ E(k)\E But, we only need to re-define g(j) k as g(j) k = JE(k)γ(j) + 0E(k) E bβ E(k)\E and this change only affects τ. For the complete procedure when using general aggregation rules, see Appendix D. A.3.1 Supporting Results Lemma 12 Let β E(k),n = I 1 E(k),E(k)IE(k),EβE,n. Under Assumption 1, XE(k)β E(k),n XEβE,n = Op(n 1/2), and bβ(k),Λ E(k) β E(k),n = Op(n 1/2). Proof Note that XE(k)β E(k) XEβE = (XE(k)I 1 E(k),E(k)IE(k),E XE)βE = XE(k)I 1 E(k),E(k)IE(k),E\E(k) XE\E(k) βE\E(k)+ XE(k)I 1 E(k),E(k)IE(k),E E(k) XE E(k) βE E(k) = XE(k)I 1 E(k),E(k)IE(k),E\E(k) XE\E(k) βE\E(k). By Assumption 1, let Ek be the set such that if j (E \ E(k)) \ Ek, XE(k)I 1 E(k),E(k)IE(k),j Xj = 0, Liu and Panigrahi and βEk = O(n 1/2). Then we have XE(k)β E(k),n XEβE = (XE(k)I 1 E(k),E(k)IE(k),Ek XEk)βEk = Op(n 1/2). This proves the first claim. By the KKT condition of lasso problem (1), we have 1 nk X(k) E(k) ( A(X(k) E(k) bβΛ,(k) E(k) ) Y (k)) = 1 nΛE(k)sk. Denote the left-hand-side as ℓ(k)(bβΛ,(k) E(k) ). Taking a Taylor expansion at β E(k),n, we get 1 nΛE(k)sk = ℓ(k)(bβΛ,(k) E(k) ) = ℓ(k)(β E(k),n) + 2ℓ(k)(β E(k),n)(bβΛ,(k) E(k) β E(k),n) + o( bβΛ,(k) E(k) β E(k) 2). (23) ℓ(k)(β E(k)) = 1 nk X(k) E(k) ( A(X(k) E(k)β E(k),n) Y (k)) nk X(k) E(k) ( A(X(k) E(k)β E(k),n) A(X(k) E βE,n))+ 1 nk X(k) E(k) ( A(X(k) E βE,n) Y (k)). The first term is of order Op(n 1/2) due to the first claim. The second term is of order Op(n 1/2) because it is an average of nk i.i.d. random variables of mean zero and finite variance. This proves ℓ(k)(β E(k)) = Op(n 1/2). 2ℓ(k)(β E(k)) = 1 nk X(k) E(k) diag( 2A(XE(k)β E(k)))X(k) E(k) nk X(k) E(k) diag( 2A(XEβE))X(k) E(k) + Op(n 1/2) = IE(k),E(k) + Op(n 1/2). Substituting into Equation (23) gives n(bβΛ,(k) E(k) β E(k),n) = Op(1). Selective Inference with Distributed Data Lemma 13 The randomization variables have the expression nω(k) = n ω(k) + op(1), where n ω(k) = T(k)( n B(k), Z(k); nbβE, nbβ E) = I ,E(k) n B(k) I ,E nbβE + Λ S(k) Proof The K.K.T. conditions of stationarity for the lasso on machine k are summarized by nω(k) = 1 n X ( A(X bβ(k),Λ) Y ) + γ(k) = 1 n X ( A(XE bβE) Y + A(XE(k)B(k)) A(XE bβE)) + γ(k) + 1 n X diag( 2A(XEβE,n))XE(k)B(k) 1 n X diag( 2A(XEβE,n))XE bβE + γ(k) + op(1) = I ,E(k) n B(k) I ,E nbβE + Λ S(k) = n w(k) + op(1). Appendix B. Proofs for Section 4 B.1 Proof of Theorem 5 Before proving Theorem 5, we provide a supporting result in Lemma 14. First, let s(k) = γ(k) E(k) 0 Rp be the active components of the subgradient vector γ(k) padded with a vector of all zeros. Now, let s Rp K be formed by stacking the vectors s(k) for k [K]. Recall that Ωis formed by stacking the vectors ω(k), k [K] that we previously defined in Lemma 13. Suppose that n Ω= nΩ s. Lemma 14 Define nbβE nbβ E n Ω Let S be a convex subset of Rp(K+1). Under the conditions in Assumption 4 and Assumption 5, we have a2n log P 1 an Vn S = inf b,b ,ω S R(b, b , ω), Liu and Panigrahi R(b, b , ω) = ( 1 2(b β E) IE,E(b β E) + 1 2(b ) (I/IE,E) 1b + 1 Proof Because the difference between n Ωand n Ωis O(1), based on the condition in Assumption 5, we have lim n 1 a2n an Vn S log P nbβE nbβ E nΩ From the proof of Proposition 2, we have the representation = n En + Rn, where the limiting density of n En, at (b, b , ω), is proportional to ϕ(b; β E, I 1 E,E) ϕ(b ; 0, I/IE,E) ϕ(ω; 0, ΣΩ). So we have the following large-deviation limit nbβE nbβ E nΩ = inf b,b ,ω S R(b, b , ω) under Assumption 4, where R is the rate function (25). The assertion follows directly by using (26). Now we are ready to prove Theorem 5. Proof First let us define some notations which we need for the proof. Recall that r(k) = , and that r is the stack of r(k) for k [K], and that s Rp K is the stack of s(k) = γ(k) E(k) 0 . In matrix form, we can write r = Q3b Z + Q4 nbβ E + s for fixed matrices Q3 and Q4. Let 1 an z = ζ. Let Q1, Q2 be defined according to (21). Then, note that we have = Q1 nbβE + Q2 nb B + r s = Q1 nbβE + Q2 nb B + Q3b Z + Q4 nbβ E. Selective Inference with Distributed Data If we define: h(B, Z; b, b ) = Q2B + Q3Z Q1b + Q4b . then n Ω= h( nb B, b Z; nbβE, nbβ E). Denote by Un the vector nbβE nbβ E nb B b Z and consider Vn as defined in Lemma 14. Observe that nbβE nbβ E h nb B, b Z; nbβE, nbβ E is a bijective linear function of Un. Let us define H : Rp(K+1) Rp(K+1) as the function such that 1 an Vn = H( 1 Applying the contraction principle for large-deviation limit together with Lemma 14, we conclude that the vector 1 an Un = H 1( 1 an Vn) satisfies a large deviation principle with the rate function R H, where R is the rate function of 1 an Vn given in Equation (25). Thus, it follows that a2n log P n an b Z = ζ = inf b,b ,B O R H(b, b , B, ζ) = inf b,b ,B O R(b, b , h(B, ζ; b, b )) = inf b,b ,B O 1 2(b β E) IE,E(b β E) + 1 2b (I/IE,E) 1b 2h(B, ζ; b, b ) Σ 1 Ωh(B, ζ; b, b ). By a similar matrix simplification as lemma 10, this is equal to (up to constant independent of β E) inf b,B O 1 2(b Πβ E κ) Θ 1(b Πβ E κ) + 1 2(B Ψb τ) Ξ 1(B Ψb τ), where τ = ΞQ 2Σ 1 Ω(Q3ζ + Q4b ), κ = Θ(Ψ Ξ 1 τ + Q 1Σ 1 Ω(Q3ζ + Q4b )). The matrices Π, Θ, Ψ, Ξ are the same as in Lemma 10. It remains to show that inf b,B O 1 2(b Πβ E κ) Θ 1(b Πβ E κ) + 1 2(B Ψb τ) Ξ 1(B Ψb τ) = lim n inf b,B O 1 2(b Πβ E 1 an κ) Θ 1(b Πβ E κ) + 1 2(B Ψb τ) Ξ 1(B Ψb 1 Liu and Panigrahi Recall from that Lemma 10 that Q 2Σ 1 Ω has (j, k) block equal to (ρjρk ρ0 + ρkδj,k)JEj, and Q 1Σ 1 Ω has (1, k) block equal to ρk ρ0 JE. The matrix Q4 has block k equal to J E. Thus, Q 2Σ 1 ΩQ4 = 0, Q 1Σ 1 ΩQ4 = 0. So τ = ΞQ 2Σ 1 ΩQ3ζ. Recall that z = anζ, so τ = ΞQ 2Σ 1 Ω(Q3anz + s) = an τ ΞQ 2Σ 1 Ωs. Because an , τ = 1 an τ + o(1). Similarly, κ = 1 an κ + o(1). This proves the last claim. To conclude the proof, we observe that lim n inf b,B an κ Θ 1 b Πβ E 1 an τ Ξ 1 B Ψb 1 a2n Barr O(an B) = lim n inf b,B O 1 2(b Πβ E 1 an κ) Θ 1(b Πβ E 1 an τ) Ξ 1(B Ψb 1 This is because the sequence of convex objectives in the left-hand side display converge (in a pointwise sense) to the convex objective on the right-hand side display which has a unique minimum. B.2 Proof of Theorem 6 Proof Observe, the approximate selective likelihood is equal to ( nbβE) Θ 1( nΠβE,n + κ) Q n Θ 1( nΠβE,n + κ) , where Q n(α) = sup v ( nv) α Qn( nv) (28) Qn( nv) = 1 2( nv) Θ 1 nv + inf V 2 n V nΨv τ Ξ 1 n V nΨv τ + Barr O( n V ) o . The score, based on the approximate selective likelihood, is equal to nΠ Θ 1 nbβE Q n Θ 1( nΠβE,n + κ) . Thus, the selective MLE is given by Θ 1( nΠbβ(S) E,n + κ) = ( Q n) 1( nbβE) = Qn( nbβE) = Θ 1 nbβE Ψ Ξ 1 nb V bβE nΨbβE τ . Selective Inference with Distributed Data nbβ(S) E = nΠ 1 bβE Π 1κ + Π 1ΘΨ Ξ 1 Ψ nbβE + τ nb V bβE = nΠ 1 bβE Π 1κ + I 1 E,EΨ Θ 1 Ψ nbβE + τ nb V bβE Let v be the solution of (28) when α = nΠbβ(S) E +κ. The selective obs-FI matrix, derived from the curvature of the approximate selective likelihood, is given by b I(S) E,E = Π Θ 1 2Q n nΠbβ(S) E + κ Θ 1Π = Π Θ 1 2Qn nv 1 Θ 1Π = Π Θ 1 Θ 1 + Ψ Ξ 1Ψ Ψ Ξ 1 Ξ 1 + 2Barr nb V bβE 1 Ξ 1Ψ 1 Θ 1Π Θ 1 + Ψ Ξ 1Ψ Ψ Ξ 1 Ξ 1 + 2Barr nb V bβE 1 Ξ 1Ψ 1 IE,E. Appendix C. Sampling subsets with replacement Proof [Proof of Lemma 7] It suffices to prove that for j = k Cov nω(j), nω(k) 0. Following the proof A.1, we only need to show i=1 ei 1 nρ i C(j) ei, 1 n i=1 ei 1 nρ Let C(j) denote the index set of D(j). Since E h 1 n Pn i=1 ei 1 nρ P i C(j) ei | C(j)i = 0, it remains to show that i=1 ei 1 nρ i C(j) ei, 1 n i=1 ei 1 nρ i C(k) ei | C(j), C(k) i=1 ei 1 nρ i C(j) ei, 1 n i=1 ei 1 nρ i C(k) ei | C(j), C(k) = Cov [ei] Cov [ei] Cov [ei] + |C(j) C(k)| 1 nρ2 Cov [ei] = (|C(j) C(k)| 1 nρ2 1) Cov [ei] . The proof is completed by the fact that E |C(j) C(k)| = nρ2. Liu and Panigrahi In this setting, the matrices Ξ, Θ, Ψ, τ, ν, Π, κ are similarly found by Theorem 4 and its proof with ΣΩ= 1 ρ ρ IK I. So now Σ 1 Ω= ρ 1 ρIK I 1. Then {Ξ 1}j,k = {Q2Σ 1 ΩQ2}j,k = ρ 1 ρIE(j),E(j) if j = k and 0 otherwise. Other matrices are similarly computed. Appendix D. Selective inference with general aggregation rules D.1 Algorithm In the main manuscript, we focused on the union aggregation rule, i.e., the final model E is the union of selected variables in the base models E(k), 1 k K. We show that with a slight modification, our procedure can be adapted to accommodate other aggregation rules. The new procedure is summarized in Algorithm 4. We note that the procedure remains almost the same, except that g(j) k = JE(k)γ(j) + 0E(k) E bβ E(k)\E where bβ = 1 n X (Y X bβE). If there exists a variable that is selected by machine k but is not selected in the final model E, then we must compensate the subgradients by the correlation between that variable and the residual vector for a more general aggregation rule. To compute the vector g(j) k , we note that the central machine requires bβ Eu\E, where Eu = k [K]E(k) is the union of the base models. Thus, each local machine must send bβ ,(k) Eu\E = X(k), Eu\E(Y (k) X(k) bβE) to the central machine. Because this quantity depends on the MLE bβE, which is computed on the central machine, the central machine must first send bβE to the local machines. Our modified procedure in Algorithm 4, therefore, involves two more exchanges between the central machine and local machines: (1) the central machine sends bβE to local machines; (2) local machines send bβ ,(k) Eu\E to the central machine. In comparison with Algorithm 1, the communication cost is |Eu E| per local machine. Note that this cost is comparable to the overall cost of order O(|E|2), as long as |Eu| is about the same order as |E|. Of course, the modified g(j) k in (29) change some matrices in the optimization that we solve for approximately-valid selective inference. Theoretically, our selective likelihood is now obtained by conditioning further on bβ Eu\E besides the information from the subgradient vectors. Selective Inference with Distributed Data Algorithm 4: General aggregation rules. STEP 1: Variable Selection at Local Machines Machine k solves (1) and sends E(k) = Support(bβΛ,(k)) to the central machine. STEP 2: Modeling with selected predictors Central Machine aggregates E(k) to get the final model E and forms the aggregated model E STEP 3: Communication with Central Machine Exchange 1: Central machine sends the set E as well as Eu = k [K]E(k) to the local machines. Exchange 2: Local machines send back the following information local estimators: bβ(k) E , b I(k) E,E; subgradient at bβΛ,(k): γ(k) Eu. Exchange 3: Central Machine computes the MLE bβE and sends to local machines. Exchange 4: Local machines compute bβ ,(k) Eu\E = X(k), Eu\E(Y (k) X(k) bβE) and send to Central Machine. STEP 4: Selective Inference at Central Machine (A) Compute bβ Eu\E = 1 n X(0), Eu\E(Y (0) X(0) bβE) + 1 n PK k=1 bβ ,(k) Eu\E (B) Compute bΞ, bΨ, bτ, bΘ, bΠ, bκ as defined in Theorem 4, with g(k) j defined according to Equation (29). (C) Apply Algorithm 2 to perform inference based on selective MLE. Liu and Panigrahi D.2 Experiments We illustrate the performance of Algorithm 4 on simulated data. In the following experiment, we consider the same setting as that in Section 6 with prespecified groups of correlated predictors. More specifically, we consider 20 groups of predictors with size 5; distinct groups of predictors are uncorrelated, while all pairs of predictors within the same group have correlation equal to 0.9. As before, we assume there are 5 non-zero coefficients βj and these nonzero coefficients are present in 5 different groups. Suppose that G = [ k [K] {Gj : j E(k)}, i.e., G contains groups which have at least one predictor selected by at least one of the K local machines. Our final aggregated model is formed by randomly picking one predictor from each of the selected groups (in G) with highly correlated predictors. The results of our experiment are shown in Figure 7. We see that our proposed method achieves the desired coverage probability. Similar patterns hold up for the lengths and power of our confidence intervals as was already noted for the previous aggregation rule. Selective Inference with Distributed Data Dist-SI Splitting (a) Varying K. Each local machine has [8000/K] data points for each K. 1 2 3 4 Signal regime 1 2 3 4 Signal regime 1 2 3 4 Signal regime Dist-SI Splitting (b) Varying signal strength. The nonzero βj equals 2c log p with random signs for c = 0.3, 0.5, 0.7, 0.9 in the four signal regimes. Dist-SI Splitting (c) Varying n0, the sample size in the central machine. Figure 7: Results for the grouped aggregation rule Liu and Panigrahi Fran cois Bachoc, Hannes Leeb, and Benedikt P otscher. Valid confidence intervals for postmodel-selection predictors. The Annals of Statistics, 47(3):1475 1504, 2019. Maria Florina Balcan, Avrim Blum, Shai Fine, and Yishay Mansour. Distributed learning, communication complexity and privacy. In Conference on Learning Theory, pages 26 1. JMLR Workshop and Conference Proceedings, 2012. Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu. Distributed testing and estimation under sparse high dimensional models. Annals of statistics, 46(3):1352, 2018. Alexandre Belloni, Victor Chernozhukov, and Kengo Kato. Uniform post-selection inference for least absolute deviation regression and other z-estimation problems. Biometrika, 102 (1):77 94, 2015. Yoav Benjamini and Daniel Yekutieli. False discovery rate adjusted multiple confidence intervals for selected parameters. Journal of the American Statistical Association, 100 (469):71 81, 2005. Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao. Valid postselection inference. The Annals of Statistics, 41(2):802 837, 2013. Ali Charkhi and Gerda Claeskens. Asymptotic post-selection inference for the Akaike information criterion. Biometrika, 105(3):645 664, 2018. Xueying Chen and Min-ge Xie. A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, pages 1655 1684, 2014. Edgar Dobriban and Yue Sheng. Distributed linear regression by averaging. The Annals of Statistics, 49(2):918 943, 2021. William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection. ar Xiv preprint ar Xiv:1410.2597, 2014. Zaijing Huang and Andrew Gelman. Sampling for Bayesian computation with large datasets. Available at SSRN 1010107, 2005. Sangwon Hyun, Max G Sell, and Ryan Tibshirani. Exact post-selection inference for the generalized lasso path. Electronic Journal of Statistics, 12(1):1053 1097, 2018. Michael I Jordan, Jason D Lee, and Yun Yang. Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 2019. Vo Nguyen Le Duy and Ichiro Takeuchi. More powerful conditional selective inference for generalized lasso by parametric programming. Journal of Machine Learning Research, 23 (300):1 37, 2022. Jason Lee, Yuekai Sun, Qiang Liu, and Jonathan Taylor. Communication-efficient sparse regression: a one-shot approach. ar Xiv preprint ar Xiv:1503.04337, 2015. Selective Inference with Distributed Data Jason Lee, Dennis L Sun, Yuekai Sun, and Jonathan Taylor. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907 927, 2016. Nan Lin and Ruibin Xi. Aggregated estimating equation estimation. Statistics and Its Interface, 4(1):73 83, 2011. Sifan Liu. An exact sampler for inference after polyhedral model selection. ar Xiv preprint ar Xiv:2308.10346, 2023. Sifan Liu, Jelena Markovic, and Jonathan Taylor. Black-box selective inference via bootstrapping. ar Xiv preprint ar Xiv:2203.14504, 2022. Ryan Mcdonald, Mehryar Mohri, Nathan Silberman, Dan Walker, and Gideon Mann. Efficient large-scale distributed training of conditional maximum entropy models. Advances in neural information processing systems, 22, 2009. Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273 1282. PMLR, 2017. Nicolai Meinshausen, Lukas Meier, and Peter B uhlmann. P-values for high-dimensional regression. Journal of the American Statistical Association, 104(488):1671 1681, 2009. Stanislav Minsker, Sanvesh Srivastava, Lizhen Lin, and David Dunson. Robust and scalable Bayes via a median of subset posterior measures. The Journal of Machine Learning Research, 18(1):4488 4527, 2017. Willie Neiswanger, Chong Wang, and Eric P Xing. Asymptotically exact, embarrassingly parallel mcmc. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pages 623 632, 2014. Snigdha Panigrahi. Carving model-free inference. The Annals of Statistics, 51(6):2318 2341, 2023. Snigdha Panigrahi and Jonathan Taylor. Scalable methods for Bayesian selective inference. Electronic Journal of Statistics, 12(2):2355 2400, 2018. Snigdha Panigrahi and Jonathan Taylor. Approximate selective inference via maximum likelihood. Journal of the American Statistical Association, 118(544):2810 2820, 2023. Snigdha Panigrahi, Jelena Markovic, and Jonathan Taylor. An MCMC-free approach to post-selective inference. ar Xiv preprint ar Xiv:1703.06154, 2017. Snigdha Panigrahi, Jonathan Taylor, and Asaf Weinstein. Integrative methods for postselection inference under convex constraints. The Annals of Statistics, 49(5):2803 2824, 2021. Snigdha Panigrahi, Jingshen Wang, and Xuming He. Treatment effect estimation with efficient data aggregation. ar Xiv preprint ar Xiv:2203.12726, 2022. Liu and Panigrahi Jesse Raffa, Alistair Johnson, Zach O Brien, Tom Pollard, Roger Mark, Leo Celi, David Pilcher, and Omar Badawi. The global open source severity of illness score (GOSSIS). Critical Care Medicine, 2022. D Garc ıa Rasines and GA Young. Splitting strategies for post-selection inference. Biometrika, 110(3):597 614, 2023. Jonathan Rosenblatt and Boaz Nadler. On the optimality of averaging in distributed statistical learning. Information and Inference: A Journal of the IMA, 5(4):379 404, 2016. Christoph Schultheiss, Claude Renaux, and Peter B uhlmann. Multicarving for highdimensional post-selection inference. Electronic Journal of Statistics, 15(1):1695 1742, 2021. Steven Scott, Alexander Blocker, Fernando Bonassi, Hugh Chipman, Edward George, and Robert Mc Culloch. Bayes and big data: The consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11(2):78 88, 2016. Sanvesh Srivastava, Cheng Li, and David Dunson. Scalable Bayes via barycenter in Wasserstein space. The Journal of Machine Learning Research, 19(1):312 346, 2018. Jonathan Taylor and Robert Tibshirani. Post-selection inference for penalized likelihood models. Canadian Journal of Statistics, 46(1):41 61, 2018. Xiaoying Tian and Jonathan Taylor. Selective inference with a randomized response. The Annals of Statistics, 46(2):679 710, 2018. Xiaoying Tian, Snigdha Panigrahi, Jelena Markovic, Nan Bi, and Jonathan Taylor. Selective sampling after solving a convex problem. ar Xiv preprint ar Xiv:1609.05609, 2016. Xiangyu Wang and David Dunson. Parallelizing MCMC via Weierstrass sampler. ar Xiv preprint ar Xiv:1312.4605, 2013. Larry Wasserman and Kathryn Roeder. High dimensional variable selection. Annals of statistics, 37(5A):2178, 2009. Yuchen Zhang, Martin Wainwright, and John Duchi. Communication-efficient algorithms for statistical optimization. Advances in neural information processing systems, 25, 2012. Martin Zinkevich, Markus Weimer, Lihong Li, and Alex Smola. Parallelized stochastic gradient descent. Advances in neural information processing systems, 23, 2010. Tijana Zrnic and Michael I Jordan. Post-selection inference via algorithmic stability. The Annals of Statistics, 51(4):1666 1691, 2023.