# distributed_statistical_inference_under_heterogeneity__65d1cc86.pdf Journal of Machine Learning Research 24 (2023) 1-56 Submitted 11/22; Revised 9/23; Published 12/23 Distributed Statistical Inference under Heterogeneity Jia Gu GUJIA@PKU.EDU.CN Center for Statistical Science Peking University Bejing, China Song Xi Chen SONGXICHEN@PKU.EDU.CN School of Mathematical Science, Guanghua School of Management and Center for Statistical Science, Peking University Beijing, China Editor: Tong Zhang We consider distributed statistical optimization and inference in the presence of heterogeneity among distributed data blocks. A weighted distributed estimator is proposed to improve the statistical efficiency of the standard split-and-conquer estimator for the common parameter shared by all the data blocks. The weighted distributed estimator is at least as efficient as the would-be full sample and the generalized method of moment estimators with the latter two estimators requiring full data access. A bias reduction is formulated for the weighted distributed estimator to accommodate much larger numbers of data blocks (relaxing the constraint from K = o(N 1/2) to K = o(N 2/3), where K is the number of blocks and N is the total sample size) than the existing methods without sacrificing the statistical efficiency at the same time. The mean squared error bounds, the asymptotic distributions, and the corresponding statistical inference procedures of the weighted distributed and the debiased estimators are derived, which show an advantageous performance of the debiased weighted estimators when the number of data blocks is large. Keywords: bias correction; distributed inference; heterogeneity; split-and-conquer method; weighted estimation. 1 Introduction Modern big data have brought new challenges to statistical inference. One such challenge is that despite the sheer volume of the data, full communication among the data points may not be possible due to either the cost of data communication or the privacy concern. The distributed or the splitand-conquer method has been proposed to divide the full data sample into smaller size data blocks to avoid data communication. The split-and-conquer estimator is also suited to situations where the data are naturally divided into data blocks and data communication among the data blocks are prohibited due to privacy concern. The split-and-conquer estimation has been considered in Lin and Xi (2010) for the U-statistics, Zhang et al. (2013) for the statistical optimization, Chen and Xie (2014) for the generalized linear models, Volgushev et al. (2017) and Chen et al. (2019) for the quantile regression, Battey et al. (2018) for high dimensional testing and estimation, and Chen and Peng (2021) for asymptotic symmetric statistics (Lai and Wang, 1993). Bootstrap resampling-based methods had been introduced to facilitate statistical inference. Kleiner et al. (2011) proposed the bag-of-little bootstrap (BLB) method for the plug-in estimators by making up economically the full 2023 Jia Gu and Song Xi Chen. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1274.html. GU AND CHEN sample for the distributed inference. Sengupta et al. (2015) suggested a sub-sampled double bootstrap method designed to improve the computational efficiency of the BLB. Chen and Peng (2021) proposed the distributed and the pseudo-distributed bootstrap methods with the former conducting the resampling within each data block while the latter directly resampling the distributed statistics. Privacy has been a major concern in big data applications where people are naturally reluctant to share the raw data to form a pool of big data as practised in the traditional full sample estimation. However, the data holders may like to contribute summary statistics without having to give away the full data information. Federated Learning or the distributed inference with a central host has been proposed to accommodate such reality (Mc Mahan et al., 2017; Yang et al., 2019; Li et al., 2020; Kairouz et al., 2021), where summary statistics of the data blocks or the gradients of the objective functions associated with the private data blocks are submitted to a central host for forming aggregated estimation or computation. Homogeneous distribution among the data blocks is assumed in the majority of the statistical distributed inference studies with a few exceptions (Zhao et al., 2014; Duan et al., 2021). Federated Learning, on the other hand, was introduced to mitigate challenges arising from classical distributed optimization. In particular, heterogeneous or non-IID distributed data across different data blocks is one of the defining characteristics in the Federated Learning (Li et al., 2020; Kairouz et al., 2021). Indeed, it is natural to expect the existence of heterogeneity, especially for data stored in different locations or generated by different stochastic mechanisms, for instance, mobile phones of different users. But few works have focused on the asymptotic statistical properties of the estimator, especially in a heterogeneous setting. Main Contributions. This paper considers distributed statistical inference under heterogeneous distributions among the data blocks, where there is a common parameter shared by the distributions of the data blocks and data-block-specific heterogeneous parameters. It is noted that Duan et al. (2021) also considered a heterogeneous setting but under a fully parametric framework. Specifically, the main contributions of this paper are as follows: Our study reveals that in the presence of heterogeneity the full sample estimator of the common parameter obtained by requiring full data access, can be less efficient than the split-andconquer estimator. It is found that this phenomenon disappears if the objective function of the statistical optimization satisfies a generalized second-order Bartlett s identity. We propose a weighted distributed (WD) estimator, which is asymptotically at least as efficient as the full sample and the split-and-conquer estimators when the number of data blocks K = o(N1/2), where K is the number of data blocks and N is the total sample size. The mean squared error bound and the asymptotic distribution of the proposed weighted distributed estimator are derived, as well as the asymptotic equivalence between the weighted distributed and the generalized method of moment (GMM) estimator (Hansen, 1982). We also propose a debiased weighted distributed estimator with a data splitting mechanism on each data block to remove the dependency between the bias correction and the weights used to tackle the heterogeneity. The debiased weighted distributed estimator is asymptotically as efficient as the WD estimator but allows quicker growth for the number of blocks K = o(N2/3). The bias correction is also applied to the split-and-conquer formulation, leading to a more communication-efficient debiased split-and-conquer estimator. DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY 2 Preliminaries Suppose that there is a large data sample of size N, which is divided into K data blocks of sizes {nk}K k=1 such that N = PK k=1 nk and let n = NK 1 be the average sample size of the data blocks. For the relative sample sizes among data blocks, we assume the following Assumption 1 There exist c, C > 0 such that c nk1/nk2 C for all pairs of (k1, k2). The k-th data block consists of a sub-sample {Xk,i}nk i=1 which are independent and identically distributed (IID) random vectors from a probability space (Ω, F, P) to (Rd, Rd) with Fk as the distribution. The K distributions {Fk}K k=1 share a common parameter φ Rp1, while each Fk has another parameter λk Rp2 specific to Fk. The parameters of interests in the k-th block are θk = (φT , λT k )T Rp where p = p1 + p2, and the overall parameters of interests are θ = (φT , λT 1 , λT 2 , ..., λT K)T Rp1+Kp2. Suppose there is a common objective function M(X; φ, λk) that is convex with respect to the parameter θk = (φT , λT k )T and facilitates the statistical optimization in each data block. In general, the loss function can be made block-specific, say Mk( ; ) function. Indeed, the presence of the heterogeneous local parameters {λk}K k=1 leads to different Mk(x, φ) = M(x; φ, λk) for the inference on φ, which connects to the multi-task learning. In the k-th data block the true parameter θ k = (φ T , λ T k )T is defined as the unique minimum of the expected objective function, namely θ k = arg min θk Θk EFk (M(Xk,1; φ, λk)) . The true common parameter φ appears in all θ k, and the block-specific {λ k}K k=1 may differ from each other. The entire true parameters θ = (φ T , λ T 1 , , λ T K )T , can be also identified as θ = arg min θ Θ k=1 γk EFk (M(Xk,1; φ, λk)) , where γk = limk (nk/N) is the asymptotic proportion of the local sample size nk relative to the total sample size N such that PK k=1 γk = 1. Our study is conducted under the semiparametirc setting, which does not assume knowledge of the data distribution of X but knowing the functional form of M( ; ), and hence semiparametric. It is a setting situated between the parametric and the nonparametric settings. Parametric setting means a full knowledge in the data generation distribution, which means that in the context of the M-estimation one knows the form of each Pk up to some parameter θ so that Pk = Pk(θ), where Pk(θ) is a parametrized form of Pk. In contrast, the nonparametric setting is where neither the form of the target quantities nor the distribution of X is known. The difficulty with the nonparametric setting is due to not knowing the forms of the data distribution and the quantities of interests. If the data could be shared across the data blocks, we would attain the conventional full sample estimator ˆθfull = arg min θ Θ i=1 M(Xk,i; φ, λk), where ˆθfull = (ˆφT full, ˆλT 1,full, , ˆλT K,full)T , which serves as a benchmark for distributed estimators. The estimating equations for the full sample estimators are (PK k=1 Pnk i=1 ψφ(Xk,i; φ, λk) = 0, Pnk i=1 ψλ(Xk,i; φ, λk) = 0 k = 1, ..., K, (1) GU AND CHEN where ψφ(Xk,i; φ, λk) = M(Xk,i; φ, λk)/ φ and ψλ(Xk,i; φ, λk) = M(Xk,i; φ, λk)/ λk are the score functions. The above full sample estimation is not attainable for the distributed situations due to privacy or the costs associated with the data communication. The distributed estimation first conducts local estimation on each data block, namely the local estimator ˆθk = (ˆφT k , ˆλT k )T = arg minθk Θk Pnk i=1 M(Xk,i; θk) with the corresponding estimating equations (Pnk i=1 ψφ(Xk,i; φk, λk) = 0, Pnk i=1 ψλ(Xk,i; φk, λk) = 0, (2) and the split-and-conquer estimator for the common parameter φ is k=1 nk ˆφk. (3) The heterogeneity among the distributions of the data blocks calls for study the relative efficiency and the estimation errors, which are the focus of this paper. We are to show that the splitand-conquer estimator (3) may not be the best formulation for estimating φ. Throughout this paper, unless otherwise stated, 2 represents the ℓ2 norm of a vector and a matrix. We will use C and Ci to denote positive constants independent of (nk, K, N). An important question is the efficiency and the estimation errors of the split-and-conquer estimator ˆφSa C relative to the full sample estimator ˆφfull. For the homogeneous case, Chen and Peng (2021) found that for the asymptotic symmetric statistics, the split-and-conquer estimator (3) attains the same efficiency as the full sample estimator in the non-degenerate case but encounters an efficiency loss in the degenerate case1, due to a lack of communication among different data blocks. Zhang et al. (2013) derived the mean squared error bound for the split-and-conquer estimator in the homogeneous case and showed that whenever K = O(N1/2), the split-and-conquer estimator achieves the best possible rate of convergence when all N data are accessible. Consider the estimating equations of the full sample statistical optimization PK k=1 Pnk i=1 ψφ(Xk,i; φ, λk) Pn1 i=1 ψλ(X1,i; φ, λ1) ... Pn K i=1 ψλ(XK,i; φ, λK) Let Ψθ(θk) = θk E (M(Xk,1; θk)) and Jk(θk) = T θkΨθ(θk) be the first and second order gradients of the k-th population objective function, respectively, whose matrix forms are: Ψθ(θk) = (Ψφ(θk)T , Ψλ(θk)T )T , Jk(θk) = Ψφ φ(θk) Ψλ φ(θk) Ψφ λ(θk) Ψλ λ(θk) 1. The symmetric statistics admits the expansion Tn = θ+n 1 Pn i=1 α(Xi; F)+n 2 P 1 i 0 cases, respectively. DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY Let Jφ|λ(θk) = Ψφ φ(θk) Ψλ φ(θk)Ψλ λ(θk) 1Ψφ λ(θk), Jλ|φ(θk) = Ψλ λ(θk) Ψφ λ(θk)Ψφ φ(θk) 1Ψλ φ(θk), Sφ(Xk,i; θk) = ψφ(Xk,i; θk) Ψλ φ(θk)Ψλ λ(θk) 1ψλ(Xk,i; θk) and Sλ(Xk,i; θk) = ψλ(Xk,i; θk) Ψφ λ(θk)Ψφ φ(θk) 1ψφ(Xk,i; θk). Then, apply Taylor s expansion to obtain (see Appendix A.1) N Jφ|λ(θ k) i=1 Sφ(Xk,i; θ k) + op(N 1/2). For the local estimator (ˆφk, ˆλk) that solves (2), the same derivation leads to (ˆφk φ = n 1 k Jφ|λ(θ k) 1 Pnk i=1 Sφ(Xk,i; θ k) + op(n 1/2 k ), ˆλk λ k = n 1 k Jλ|φ(θ k) 1 Pnk i=1 Sλ(Xk,i; θ k) + op(n 1/2 k ). Our analysis requires the following conditions. Assumption 2 (Identifiability) The parameters θ k = (φ , λ k) is the unique minimizer of Mk(θk) = E(M(Xk,1; θk)) for θk Θk. Assumption 3 (Compactness) The true parameter θ k is an interior point of the parameter space Θk which is a compact and convex set in Rp, and supθk Θk θk θ k 2 r for all k 1 and some r > 0. The true common parameter φ is an interior point of a subset Φ Θk. Assumption 4 (Local strong convexity) The population objective function on the k-th data block Mk(θk) = E(M(Xk,1; θk)) is twice differentiable, and there exists a constant ρ > 0 such that 2 θk Mk(θ k) ρ Ip p. Here A B means A B is a positive semi-definite matrix. Assumption 5 (Smoothness I) The objective function on the k-th data is twice differentiable with respect to θk Uk, where Uk = {θk | θk θ k 2 ρ} and ρ is a positive constant. There are positive constants R, L, v and v1 such that E θk M(Xk,1; θ k) 2v1 2 R2v1 and E 2 θk M(Xk,1; θ k) 2 θk Mk(θ k) 2v 2 L2v for all k 1. There also exist a positive function G( ) and a corresponding positive constant G such that 2 θk M(x; θk) 2 θk M(x; θ k) 2 G(x) θk θ k 2 for all θk, θ k Uk and x Rd, and E(G(Xk,1)2v) G2v. Assumptions 1-5 are standard ones on the parameter space and population objective functions for the homogeneous case (Jordan et al., 2019). In the heterogeneous case, Duan et al. (2021) requires the parameter space for the common parameter to be bounded, i.e. φ φ r under a fully parametric setting, while we need the overall parameter space to be bounded. The stronger condition is needed since we do not fully specify the distributions {Fk}K k=1 and it will be used when we derive the mean squared error bound for the proposed weighted distributed estimator in Section 4. Besides, since the differentiability of the objective function is assumed locally, we need Assumption 5 to define a high probability event in (34), under which all the derivatives are well-defined. GU AND CHEN 3 Full Sample versus split-and-conquer Estimation When the full data communication is available, one would in general prefer making the estimation based on the pooled data, namely solving the estimating equations (1) rather than the local estimating equations (2). This choice is based on the common belief that the estimator ˆφfull utilizes the full sample information and allow full communications among data blocks. In contrast, for distributed data, some information requiring interactions between observations may be prohibited (the degenerate U-statistics), and thus may reduces the statistical efficiency of the Sa C estimator ˆφSa C as compared with ˆφfull. In the homogeneous setting, there have been works justifying this reality (Zhang et al., 2013; Chen and Peng, 2021). However, we will show in this section that this is not necessarily the case under heterogeneity via a theorem and a concrete example. It is worth mentioning that we assume K being fixed in the following Proposition 1 and Theorem 2 for simplicity of formulating the asymptotic variance of the estimators, which helps us to motivate the construction of the weighted distributed estimator. We will show more general results and allow diverging K along with N in the subsequent theoretical results and the supplementary material (see Lemma B.2 for the uniform consistency of {ˆθk}K k=1 under diverging K and the relationship between the divergence rate and the smoothness factors v, v1). In particular, we will discuss how to improve the divergence rate of K in Section 5. Proposition 1 Under Assumptions 1 - 4 and Assumption 5 with v, v1 1, and if K is fixed, then ˆθk θ k and ˆθfull θ in probability; ˆφSa C = (1/N) PK k=1 nk ˆφk and ˆφfull are consistent to φ . Theorem 2 Under Assumptions 1 - 4 and Assumption 5 with v, v1 2, if K is fixed and nk/N γk (0, 1) for a set of constants {γk}K k=1, then N(ˆφSa C φ ) N k=1 γk Jφ|λ(θ k) 1Σk(θ k)Jφ|λ(θ k) 1 ! N(ˆφfull φ ) N k=1 γk Jφ|λ(θ k)) 1( k=1 γkΣk(θ k))( k=1 γk Jφ|λ(θ k)) 1 ! where Σk = Var (Sφ(Xk,1; θ k)). Theorem 2 suggests that the asymptotic variance of the full sample estimator may surpass that of the Sa C estimator. To appreciate this, define V (Σ, A) = (AT ) 1ΣA 1 as a mapping from Sp1 p1 ++ GL(Rp1) to Sp1 p1 ++ , where Sp1 p1 ++ and GL(Rp1) denote the symmetric positive definite matrices and invertible real matrices of order p1, respectively. Since ΣK k=1γk = 1 and γk > 0, the asymptotic variance of ˆφSa C can be interpreted as a convex combination of function values {V (Σk(θ k), Jφ|λ(θ k))}K k=1 and that of ˆφfull can be expressed as V (PK k=1 γkΣk(θ k), PK k=1 γk Jφ|λ(θ k)). However, V ( , ) is not convex with respect to its arguments (Σ, A), which means that k=1 γk Jφ|λ(θ k)) 1( k=1 γkΣk(θ k))( k=1 γk Jφ|λ(θ k)) 1 k=1 γk Jφ|λ(θ k) 1Σk(θ k)Jφ|λ(θ k) 1. In other words, ˆφfull is not necessarily more efficient than ˆφSa C under heterogeneity. In contrast, in the homogeneous setting (Jφ|λ(θ k), Σk(θ k)) are all equal for different k, then the non-convexity of DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY V ( , ) does not matter anymore and the first-order equivalence between the Sa C estimator and the full sample estimator holds as long as K = o(N1/2). A conclusion from Theorem 2 is that naively using the pooled data to solve the statistical optimization problem may not be a good choice when the underlying distributions of the data are heterogeneous. To gain confirmation of Theorem 2, we consider the errors-in-variables model. Suppose there are K independent data blocks {(Xk,i, Yk,i)}n i=1 for k = 1, 2..., K, where (Xk,i, Yk,i) are IID and generated from Xk = Zk + ek, Yk = φ + λ k Zk + fk, (6) where {Zk}K k=1 are random variables whose measurements {(Xk, Yk)}K k=1 are subject to errors {(ek, fk)}K k=1, and (ek, fk) are bivariate normally distributed with zero mean and covariance matrix σ2I2 and is independent of Z. Obviously, φ is the common parameter across all data blocks while λ k(λ k > 0) represents the block specific parameter. The condition Var(e) = Var(f) is assumed to avoid any identification issue arising when Z is also normally distributed (Reiersol, 1950) 2. We consider the approach in Example 5.26 of van der Vaart (1999) as detailed in Appendix A.2, which leads to the M-function M(Xk, Yk; θk) = 1 (1 + λ2 k)(λk Xk (Yk φ))2. (7) For simplicity we assume K = 2, then from Theorem 2 we have Var(ˆφfull) Var(Z) 2 1 1+λ 2 1 + 1 1+λ 2 2 + σ4(E(Z))2 2 (1+λ 2 1 )2 + 2 (1+λ 2 2 )2 ( 1 1+λ 2 1 + 1 1+λ 2 2 )2 Var(ˆφSa C) σ2E(Z2) Var(Z) (1+λ 2 1 )+(1+λ 2 2 ) 2 + σ4(E(Z))2 Note that the coefficients to σ2E(Z2)/Var(Z) in the first terms of the variances are harmonic and arithmetic means of {1 + λ 2 1 , 1 + λ 2 2 }, respectively. Hence, the coefficient in the first term of Var(ˆφSa C) is larger than that in Var(ˆφfull). The second terms of the variances involves (E(Z))2 as a multiplicative factor. Thus, if the unobserved Z has zero mean, the full-sample estimator will be at least as good as the Sa C estimator in terms of variance when the full sample size N goes to infinity. However, the situation may change when E(Z) = 0, because the second term of Var(ˆφfull) has a factor which is the square of a ratio between the quadratic mean and the arithmetic mean of { 1 1+λ 2 1 , 1 1+λ 2 2 }. The factor is larger than or equal to 1, where the equality holds if and only if λ 1 = λ 2, namely the homogeneous case. In the heterogeneous case, by adjusting σ4(E(Z))2 Var2(Z) /σ2E(Z2) Var(Z) , we can find cases such that λ 1 = λ 2 so that the full sample estimator has a larger variance than the Sa C estimator. Appendix C.1 displays such cases. 4 Weighted Distributed Estimator That the full sample estimator ˆφfull under heterogeneity may be less efficient than the simple averaged ˆφSa C suggests that the wisdom formulated in the homogeneous context may not be applicable 2. When Z is normal, (X, Y ) is also jointly normal, whose distribution can be fully characterized by the five parameters (E(X), Var(X), E(Y ), Var(Y ), E(XY )). Now, if we do not require Var(e) = Var(f), there will be six unknown parameters: (Var(e), Var(f), E(Z), Var(Z), φ, λ), leading to identification issues. For more detailed discussions on the errors-in-variables model, see Fuller (1987). GU AND CHEN to the heterogeneous case. How to better aggregate the local estimators {ˆφk}K k=1 for more efficient estimation is the focus of this section. 4.1 Formulation and Results Consider a class of estimators formed by linear combinations of the local estimators {ˆφk}K k=1: {ˆφSa C w | ˆφSa C w = k=1 Wk ˆφk, Wk Rp1 p1, k=1 Wk = Ip1 p1}. We want to minimize the asymptotic variance of ˆφSa C w with respect to the weighting matrices {Wk}K k=1. It may be shown from Theorem 2 that Var(ˆφSa C w ) PK k=1 n 1 k Wk A 1 k Σk(AT k ) 1W T k , where Ak = Jφ|λ(θ k) and Σk = Var (Sφ(Xk,i; θ k)). It is noted that the asymptotic variance is defined via the asymptotic normality of the statistical optimization. For the time being, Ak and Σk are assumed known and we denote Hk = A 1 k Σk(AT k ) 1. We choose the trace operator as a measure of the size of the covariance matrix, which leads to a minimization problem: Minimize {Wk}K k=1 trace K X k=1 n 1 k Wk Hk W T k k=1 Wk = Ip1 p1. (9) It is a convex optimization problem and can be solved via the Lagrangian multiplier method, which gives the optimal weighting matrices W k = (PK s=1 ns H 1 s ) 1nk H 1 k . If we replace the trace with the Frobenius norm in (9), the same solution is attained as shown in Appendix A.3. The split-andconquer estimator with the optimal weights W k is called the weighted distributed estimator and denoted as ˆφWD, which is at least as efficient as ˆφSa C by construction. To compare the statistical efficiency between ˆφfull and ˆφWD, we note that their covariances Var(ˆφfull) k=1 nk Ak)T ( k=1 nkΣk) 1( k=1 nk AT k Σ 1 k Ak , respectively. (10) Define V (Σ, A) = AT Σ 1A, which is a generalized convex function with respect to the matrix inequality shown in Lemma B.1. Applying Jensen s inequality leads to the conclusion that the weighted distributed estimator is at least as efficient as the full sample estimator ˆφfull. Thus, the estimating equations (4) obtained from the first-order derivatives of the simple summation of local objectives Pnk i=1 M(Xk,i; θk) may not be the best formulation. In contrast, the weighted distributed estimator exploits the potential efficiency gain from the heterogeneity by re-weighting of the local estimators, which is why the full sample estimator may not be as efficient as the weighted distributed estimator. 4.2 Likelihood and Quasi-likelihood The above results lead us to wonder whether the weighted distributed estimator can also be more efficient than the full sample estimator under the heterogeneity in a fully parametric setting. The answer is negative as shown below. DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY When the distribution of Xk,i is fully parametric with density function f( ; φ, λk), the Fisher information matrix in the k-th data block is I(θk) = I(φ, λk) = Iφφ Iφλk Iλkφ Iλkλk 2 φ2 log f(Xk,1; θk) 2 φ λT log f(Xk,1; θk) 2 λ φT log f(Xk,1; θk) 2 λ2 log f(Xk,1; θk) and the partial information matrix Iφ|λk = Iφφ Iφλk I 1 λkλk Iλkφ. Now, the objective function for the statistical optimization is M(Xk,i; φ, λk) = log f(Xk,i; φ, λk). Routine derivations show that Σk = Var (Sφ(Xk,1; θ k)) = Iφ|λk and Ak = Jφ|λ(θ k) = Iφ|λk. Hence, Var(ˆφfull) Var(ˆφWD) PK k=1 nk Iφ|λk 1 and Var(ˆφSa C) (1/N2) PK k=1 nk I 1 φ|λk. A direct application of Lemma B.1 shows that PK k=1 nk Iφ|λk 1 (1/N2) PK k=1 nk I 1 φ|λk. Thus, the full sample maximum likelihood estimator automatically adjusts for the heterogeneity and has the same asymptotic efficiency as that of the weighted distributed estimator. Both estimators are at least as efficient as the split-and-conquer estimator ˆφSa C. The same is true for the quasi-likelihood estimation with independent observations (see Appendix A.4). A close examination reveals that the underlying reason for the asymptotic equivalence between the weighted distributed estimator and the likelihood-based full sample estimators is that the two statistical optimization functions satisfy the second-order Bartlett s identity (Bartlett, 1953; Mc Cullagh, 1983): E M(Xk, θ k) M(Xk, θ k)T = E 2M(Xk, θ k) . By the asymptotic variance formula of the estimator and Lemma B.1, it is apparent that Bartlett s identity can be relaxed by allowing a factor γ = 0 such that E M(Xk, θ k) M(Xk, θ k)T = γE 2M(Xk, θ k) . (11) An example of such cases is the least square estimation in the parametric regression with homoscedastic and non-autocorrelated residuals in Appendix A.5. Otherwise, the full sample least square estimator may not be efficient and there is an opportunity for the weighted least square estimation such as the case in the errors-in-variables model (6). Thus, if M(xk, θk) satisfies (11), ˆφfull attains the same statistical efficiency as ˆφWD. 4.3 Relation to Generalized Method of Moment Estimation To further justify the statistical efficiency of the weighted distributed estimation, we consider the generalized method of moment (GMM) estimator (Hansen, 1982), which has certain optimal properties for the semiparametric inference that the weighted distributed estimation can compare with, despite it requires full data sharing. The score functions of the statistical optimization on each data block are aggregated to form the moment equations (Pnk i=1 ψφ(Xk,i; φ, λk) = 0, Pnk i=1 ψλ(Xk,i; φ, λk) = 0, k = 1, ..., K, (12) which have p K estimating equations, where the dimension of θ is p K (K 1)p1. Thus, the parameter is over-identified, offering potential efficiency gain for the generalized method of moment. The GMM estimation based on the moment restrictions (12) solves the minimization problem ˆθGMM = arg min θ Θ ψT N(θ)W0 ψN(θ), GU AND CHEN where W0 = Var( ψN(θ )) 1 is the optimal weighting matrix (Hansen, 1982; Yaron et al., 1996) and i=1 ψφ(X1,i; θ1)T , i=1 ψλ(X1,i; θ1)T , , i=1 ψφ(XK,i; θK)T , i=1 ψλ(XK,i; θK)T )T . Let the first p1 elements of ˆθGMM be ˆφGMM as an estimator of the common parameter. A derivation in Appendix A.6 shows that Var(ˆφGMM) PK k=1 nk Jφ|λΣ 1 k Jφ|λ 1 . Thus, the weighted dis- tributed estimator s asymptotic efficiency is the same as that of ˆφGMM. This is encouraging as the weighted distributed estimator does it without requiring so much data sharing among the blocks. 4.4 Estimation of weights with one round communication To formulate the weighted distributed estimator, we have to estimate the optimal weights W k = PK s=1 ns H 1 s 1 nk H 1 k . As we will show in Theorem 4, the estimation of the weights will not affect the estimation efficiency of the weighted distributed estimator attained in (10). By the structure of W k , we only need to estimate Hk, the leading principal submatrix of order p1 of the asymptotic covariance matrix e Hk of ˆθk. Note that e Hk = ( Ψθ(θ k)) 1 E ψθk(Xk,1; θ k)ψθk(Xk,1; θ k)T ( Ψθ(θ k)) 1 = Hk where Ψθ(θk) = E (ψθk(Xk,1; θk)). We can construct a sandwich type estimator (Stefanski and Boos, 2002) to estimate e Hk and then Hk. The procedure to obtain the weighted distributed estimator is summarized in Algorithm 1. Input: Distributed datasets: {Xk,i, k = 1, ..., K; i = 1, ..., nk} Output: Weighted distributed estimator: ˆφWD 1 In each data block k (k = 1, 2, , K): 2 Solve (2) and obtain ˆθk = (ˆφk, ˆλk) ; 3 Calculate b Hk(ˆθk), which is the leading principal sub-matrix of order p1 of ( θk bΨθk) 1(n 1 k Pnk i=1 Z(Xk,i; ˆθk))( θk bΨθk) T , where Z(x, θk) is defined in Assumption 6 and bΨθk = n 1 k Pnk i=1 ψθk(Xk,i; ˆθk); 4 In a central server: 5 Collect (ˆφk, b Hk(ˆθk) 1) from all the K data blocks; 6 Calculate ˆφ = PK k=1 nk b Hk(ˆθk) 1 1 PK k=1 nk( b Hk(ˆθk)) 1 ˆφk ; 7 ˆφWD = ˆφI(ˆφ Φ) + ˆφSa CI(ˆφ Φ), where ˆφSa C = N 1 PK k=1 nk ˆφk. Algorithm 1: Weighted Distributed estimator Step 7 of the algorithm is necessary since there is no guarantee that after weighting the estimator ˆφ belongs to the set Φ as required in Assumption 3. However, the event {ˆφ Φ} should happen with probability approaching one. Hence, the ˆφSa CI(ˆφ Φ) term is asymptotically negligible. To establish the theoretical properties of the weighted distributed estimator, we impose the following assumptions. DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY Assumption 6 (Smoothness II) Denote Z(x, θk) = θk M(x; θk) θk M(x; θk)T , then there are positive constants ρ and B, and a positive function B(x) such that Z(x, θk) is B(x) Lipschitz continuous with respect to θk, in the sense that Z(x, θk) Z(x, θ k) 2 B(x) θk θ k 2 for all θk, θ k Uk = {θk | θk θ k 2 ρ} and x Rd, and E(B(Xk,1)2v) B2v . Assumption 6 specifies the Lipschitz continuity of the outer product Z(x; θk) with respect to θk, which is to control the estimation errors when we estimate the asymptotic covariance matrices of the local estimators {ˆθk}K k=1. Appendix A.7 shows it is valid for the logistic regression model. Assumption 7 (Boundedness) Denote ΣS,k(θk) = EFk ψθk(Xk,1; θk)ψθk(Xk,1; θk)T , then there exists constants ρσ, c > 0 such that ΣS,k(θ k) 2 ρσ and Hk c Ip1 p1 for k 1, where θ k is the minimizer of the k-th population objective function and Hk = A 1 k Σk A 1 k , where Ak = Jφ|λ(θ k) and Σk = Var (Sφ(Xk,i; θ k)) By Hk s definition, Hk 2 Jk(θ k) 1 2 2 ΣS,k(θ k) 2 ρσρ 2 , implying H 1 k (ρ2 /ρσ)Ip1 p1. The above inequality also leads to the inequality Jk(θ k) 1 2 (c/ρσ)1/2, indicating a finite upper bound for the norm of the Hessian, as assumed in Jordan et al. (2019) and Duan et al. (2021). Theorem 3 Under Assumptions 1 - 4 and 7, and Assumption 5 - 6 with v 2 and v1 4 , for n = NK 1, the mean-squared error of the weighted distributed estimator ˆφWD satisfies E ˆφWD φ 2 2 n K + C2 (L2 + L4) + R2(R2 + R6 + G2) n3 + C4K 1 + L2v The v1 and v appeared in Assumptions 5 - 6 quantify the moments of the first two orders of the gradients of the M-function and their corresponding Lipschitz functions. When the number of data blocks K = O(N1/2) namely K = O(n), the convergence rate of mean squared error of ˆφWD is O((n K) 1), which is the same as the standard full sample estimator. However, when there are too many data blocks such that K n, the convergence rate is reduced to O(n 2). Theorem 4 Under Assumptions 1 - 4 and 7, and Assumptions 5 - 6 with v, v1 2, if K = o(N1/2), then ˆφWD φ )T PK k=1 nk H 1 k ˆφWD φ ) χ2 p1. As mentioned before, K is allowed to diverge with the full sample size at the rate o(N1/2). Although {Hk}K k=1 have bounded spectral norms, PK k=1(nk/N)H 1 k may not converge to a fixed matrix in the presence of heterogeneity. Thus, we can only obtain the asymptotic normality of the standardized N 1/2 PK k=1(nk/N)H 1 k 1/2 (ˆφWD φ ). This is why Theorem 4 is presented in the limiting chi-squared form, which implies that we can construct confidence regions for φ with confidence level 1 α as ( φ | ˆφWD φ)T K X k=1 nk b Hk(ˆθk) 1 ! ˆφWD φ) χ2 p1,α after replacing PK k=1 nk H 1 k with its sample counterpart PK k=1 nk b Hk(ˆθk) 1, where χ2 p1,α is the upper α quantile of the χ2 p1 distribution. Given the weighted distributed estimator of the common GU AND CHEN parameter φ , a natural question is that whether a more efficient estimator of the block-specific λ k can be obtained, if we plug in the weighted distributed estimator to each data block and re-estimate λk: Let ˆλ(2) k be the updated estimator, and Theorem 9 in Appendix A.8 will show that ˆλ(2) k is not necessarily more efficient than the local estimator ˆλk. 5 Debiased Estimator for Diverging K It is noted that K = o(N1/2) is required in Theorems 3 and 4 to attain the O(N 1) leading order mean squared error and the limiting chi-squared distribution of the weighted distributed estimator ˆφWD. A reason for this requirement is that the bias of the local estimator ˆθk is at order Op(n 1 k ), which can not be reduced by the weighted averaging. This leads to the bias of N1/2(ˆφWD φ ) being at the order Op(KN 1/2), which is not necessarily diminishing to zero unless K = o(N1/2). It is worth mentioning that Duan et al. (2021) needed the same K = o(N1/2) order in their maximum likelihood estimation framework to obtain the N1/2-convergence since Li et al. (2003) showed that the maximum likelihood estimator is asymptotically biased when K/n C (0, + ). This calls for a bias reduction step for the local estimators before aggregation to allow for a larger K. To facilitate the bias correction, we have to simplify the notation. Suppose F(θ) is a p 1 vector function, F(θ) is the usual Jacobian whose l-th row contains the partial derivatives of the l-th element of F(θ). Then, the matrices of higher derivatives are defined recursively so that the j-th element of the l-th row of s L(θ) (a p ps matrix) is the 1 p vector fv lj(θ) = fv 1 lj (θ)/ θT , where fv 1 lj is the l-th row and j-th element of v 1F(θ). Let denote the Kronecker product. Using Kronecker product we can express v F(θ) = v F(θ)/( θT θT θT ). Besides, define Mn,k(θk) = n 1 k Pnk i=1 M(Xk,i; θk), H3,k(θk) = E( 2 θkψθk(Xk,1; θk)), Qk(θk) = ( E( θkψθk(Xk,1; θk))) 1 , di,k(θk) = Qk(θk)ψθk(Xk,i; θk) and vi,k(θk) = θkψθk(Xk,i, θk) θkΨθ(θk). Then, the leading order bias of ˆθk (Rilstone et al., 1996) is Bias(ˆθk) = n 1 k Qk(θ k) E (v1,k(θ k)d1,k(θ k)) + 1 2H3,k(θ k)E (d1,k(θ k) d1,k(θ k)) . Let Bk(θk) = Qk(θk) E (v1,k(θk)d1,k(θk)) + 1 2H3,k(θk)E (d1,k(θk) d1,k(θk)) , whose first p1 dimension associated with φ are denoted as B1 k(θk). An estimator of Bk(θk) is ˆBk(θk) = ˆQk(θk) n 1 k i=1 ˆvi,k(θk) ˆdi,k(θk) + 1 2 ˆH3,k(θk)n 1 k i=1 ( ˆdi,k(θk) ˆdi,k(θk)) , (13) where ˆH3,k(θk) = n 1 k Pnk i=1 2 θkψθk(Xk,i; θk), ˆQk(θk) = n 1 k Pnk i=1 θkψθk(Xk,i; θk) 1 , ˆdi,k(θk) = ˆQk(θk)ψθk(Xk,i; θk) and ˆvi,k(θk) = θkψθk(Xk,i; θk). Applying it to each data block, we have the bias-corrected local estimator ˆθk,bc = ˆθk n 1 k ˆBk(ˆθk)1Ek,bc, where Ek,bc = {ˆθk n 1 k ˆBk(ˆθk) Θk}, and the indicator function is to ensure that ˆθk,bc Θk. After the local debiased estimators are obtained, we need to aggregate them via the estimated weights. A direct aggregation will invalidate the bias correction due to the dependence between the estimated weights and the local debiased estimator if they are constructed with the same dataset. DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY The accumulation of dependence over a large number of data blocks can make the bias correction fail. To remove the dependence between the local estimator ˆθk,bc and the estimated local weights c Wk = PK s=1 b Hs(ˆθs) 1 1 b Hk(ˆθk) 1, we divide each local dataset {Xk,i}nk i=1 to two basically equal-sized splits Ds k = {X(s) k,i }nk/2 i=1 , s = 1, 2. For s = 1, 2, we calculate the local estimators ˆθk,s and obtain ˆHk,s(ˆθk,s), which is the first p1 principal sub-matrix of ( θk bΨθk) 1((2/nk) i=1 ψθk(X(s) k,i ; ˆθk,s)ψθk(X(s) k,i ; ˆθk,s)T )( θk bΨθk) T , where bΨθk = (2/nk) Pnk/2 i=1 ψθk(X(s) k,i ; ˆθk,s). We perform the local bias correction to ˆθk,s based on a split with the weight obtained by the other, leading to two debiased estimators of the form k=1 nk ˆHk,s(ˆθk,s) 1 ! 1 K X k=1 nk( ˆHk,s(ˆθk,s)) 1 ˆφbc k,2 |s 1| for s = 1, 2. The two debiased local estimators are averaged to obtain the final debiased weighted distributed estimator, whose procedure is summarized in Algorithm 2. To provide a theoretical guarantee on the bias correction, we need an assumption on the third-order gradient of the M-function (Zhang et al., 2013), which strengthens a part of Assumption 5. Assumption 8 (Strong smoothness) For each x Rp, the third order derivatives of M(x; θk) with respect to θk exist and are A(x) Lipschitz continuous in the sense that ( 2 θkψθk(x; θk) 2 θkψθk(x; θ k))(u u) 2 A(x) θk θ k 2 u 2 2, for all θk, θ k Uk defined in Assumption 5 and u Rp, where A(x) is a positive function, E(A(Xk,i)2v) A2v for some v > 0 and A < . Theorem 5 Under Assumptions 1 - 4 and 7 - 8, and Assumptions 5 - 6 with v, v1 4 , E ˆφd WD φ 2 2 n K + C2 R2(L2 + R2) n2K + C3 G2R2(G2 + R2 + G4) + A2R6 +C5K 1 + L2v where the CBn 3 term characterizes the error due to the estimation of the bias correction terms {Bk(θ k)}K k=1, whose definition can be found in the proof of this theorem presented in the Appendix. The main difference between the upper bounds in Theorem 5 from that in Theorem 3 for the weighted distributed estimator is the disappearance of the O(n 2) term for the weighted distributed estimator, which has been absorbed into the O((n2K) 1 + n 3) terms for the debiased weighted distributed estimator. As shown next, this translates to a more relaxed K = o(N2/3) condition as compared with the K = o(N1/2) condition for the weighted distributed estimator in Theorem 4. GU AND CHEN Input: Distributed datasets: {Xk,i, k = 1, ..., K; i = 1, ..., nk} Output: debiased weighted distributed estimator: ˆφd WD 1 In each data block k (k = 1, 2, , K): 2 Split the local dataset into two equal sized subsets: Ds k = {X(s) k,i }nk/2 i=1 , s = 1, 2 ; 3 Solve (2) based on Ds k and obtain ˆθk,s = (ˆφk,s, ˆλk,s) for s = 1, 2 ; 4 Calculate b Hk,s(ˆθk,s) based on Ds k and ˆθs k for s = 1, 2; 5 Calculate ˆθbc k,s = ˆθk,s 2n 1 k ˆBk,s(ˆθk,s)1Ek,bc,s using formula (13) for s = 1, 2, where Ek,bc,s = {ˆθk,s 2n 1 k ˆBk,s(ˆθk,s) Θk} ; 6 In a central server: 7 Collect {ˆφbc k,s, b Hk,s(ˆθk,s) 1, s = 1, 2} from all the K data blocks; 8 Construct ˆφs = PK k=1 nk b Hk,s(ˆθk,s) 1 1 PK k=1 nk b Hk,s(ˆθk,s) 1 ˆφbc k,2 |s 1|; 9 Calculate ˆφd WD s = ˆφs I(ˆφs Φ) + K 1 PK k=1 nk ˆφbc k,2 |s 1|I(ˆφs Φ) for s = 1, 2; 10 ˆφd WD = 2 1 P2 s=1 ˆφd WD s . Algorithm 2: debiased Weighted Distributed (d WD) Estimator Theorem 6 Under the conditions required by Theorem 5, if K = o(N2/3), (ˆφd WD φ )T K X k=1 nk Hk(θ k) 1 ! (ˆφd WD φ ) d χ2 p1. Theorem 6 is also formulated in the chi-squared distribution form for the same reason when we formulate Theorem 4, and similar confidence region with confidence level 1 α can be constructed as n φ | ˆφd WD φ)T {PK k=1 nk Hk(ˆθk) 1} ˆφd WD φ) χ2 p1,α o . The fact that the confidence regions of debiased weighted distributed and weighted distributed estimators use the same standardizing matrix PK k=1 nk b Hk(ˆθk) 1 reflects that both estimators have the same estimation efficiency. However, the debiased version has a more relaxed constraint on K = o(N2/3) than that of the WD estimator requiring K = o(N1/2) . Both the debiased and non-debiased weighted distributed estimators are communication efficient as they only require one round of communication. When the communication budget is strictly limited, people may only share the debiased estimators without transmitting the weights. In this case, one may consider the following debiased split-and-conquer estimator ˆφd Sa C = N 1 K X k=1 nk(ˆφk n 1 k ˆB1 k(ˆθk)1Ek,bc), (14) which only performs bias correction and may be preferable when the heterogeneity is not severe. The asymptotic property of ˆφd Sa C is summarized in the following theorem. Theorem 7 Under the conditions required by Theorem 5, if K = o(N2/3), the debiased split-andconquer estimator ˆφd Sa C satisfies that (i) E ˆφd Sa C φ 2 2 C1/(n K) + C2/(n2K) + C3/n3 and (ii) N2(ˆφd Sa C φ )T PK k=1 nk Hk(θ k) 1 (ˆφd Sa C φ ) χ2 p1. DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY The corresponding confidence region with confidence level 1 α can be constructed as {φ | N2 ˆφd Sa C φ)T PK k=1 nk b Hk(ˆθk) 1 ˆφd Sa C φ) χ2 p1,α}. It is noted that the debiased version of the split-and-conquer estimator ˆφd Sa C has the same asymptotic distribution as that of ˆφSa C, but under a much more relaxed constraint on the divergence rate of K. Hence, the confidence regions based on the split-and-conquer estimator can be constructed in the same way as that based on the weighted distributed estimator with ˆφd Sa C replaced by ˆφSa C. To compare with the subsampled average mixture method (SAVGM) estimator proposed in Zhang et al. (2013), which also performs local bias correction but under the homogeneous setting, we have the following corollary to Theorem 7. Corollary 8 Under the homogeneous case such that {Xk,i, k = 1, ..., K, i = 1, ..., n; } are IID distributed, and the assumptions required by Theorem 5, E ˆθd Sa C θ 1 2 2 2E θ1Ψθ(θ 1) 1ψθ1(X1,1; θ 1) 2 2 where θ 1 is the true parameter for all the K data blocks. The SAVGM estimator resamples rnk data points from each data block k for a r (0, 1) to obtain a local estimator ˆθk,r based on the sub-samples, and has the form Its mean squared error bound as given in Theorem 4 of Zhang et al. (2013) is E θSAVGM θ 1 2 2 2 + 3r (1 r)2 E θ1Ψθ(θ 1) 1ψθ1(X1,1; θ 1) 2 2 Thus, the mean squared error bound (16) of the SAVGM estimator has an inflated factor (2+3r)(1 r) 2/2 > 1 for r (0, 1) when compared with that of the d Sa C estimator, although it is computationally more efficient than the debiased split-and-conquer and debiased weighted distributed estimators as it only draws one subsample in its resampling. For more comparisons between the debiased split-and-conquer estimator and one-step estimators proposed by Huang and Huo (2019), see Appendix A.9. To facilitate an overall comparison among the existing and the proposed estimators, Table 1 summarizes the non-asymptotic MSE rates of the estimators along with the details on the smoothness condition and the restriction on K, and the statue of statistical efficiency relative to the full sample GMM estimator. It is noted that the smoothness parameters v and v1 can be large if the corresponding random variables are of thin tails, such as the sub-exponential or sub-gaussian distributions, as assumed in Jordan et al. (2019) and Duan et al. (2021). As a result, the smoothness condition becomes trivial, and the Kn min(v,v1/2) term that appeared in the non-asymptotic rates can be ignored. 6 Numerical Results The purpose of this section is to examine the numerical performances of the estimators via both the simulation study in Section 6.1 and the real data analysis in Section 6.2. GU AND CHEN Table 1: MSE bounds in estimating the common parameter φ by the ideal full sample M-estimator (Full), the distributed Sa C, d Sa C, WD, d WD and SAVGM estimators, and the estimator in Chen and Peng (2021) (CP). The column headed by Smoothness states the needed moment restriction in Assumption 5. The column headed by K further states the condition on the block size under which the asymptotic normality of the estimator holds, while that by Efficiency indicates if an estimator is statistically efficient ( ) or not ( ) relative to the GMM estimator ˆφGMM. For the CP estimator, τ1 1 and τ1 = 1 under the current M-estimation setting. Estimator Smoothness Non-asymptotic Regime Asymptotic Regime MSE bound K Efficiency Full v, v1 1 C1(n K) 1 + C2(n K) 2 - Sa C v, v1 2 C1(n K) 1 + C2n 2 o(N1/2) CP - C1(n K) 1 + C2(n2K) 1 + C3n 2τ1 o(N1 1 2τ1 ) WD v 2, v1 4 C1(n K) 1 + C2n 2 + C3Kn min(v,v1/2) o(N1/2) d Sa C v, v1 4 C1(n K) 1 + C2n 2K 1 + C3n 3 o(N2/3) SAVGM v, v1 4 C1(n K) 1 + C2n 2K 1 + C3n 3 o(N2/3) d WD v, v1 4 C1(n K) 1 + C2n 2K 1 + C3n 3 + C3Kn min(v,v1/2) o(N2/3) 6.1 Simulation study We report results from simulation experiments designed to verify the theoretical findings made in the previous sections, which was to evaluate the numerical performance of the proposed weighted distributed (WD), debaised split-and-conquer (d Sa C) and debiased weighted distributed (d WD) estimators of the common parameter and compare them with the existing split-and-conquer (Sa C) and subsampled average mixture method (SAVGM) (with subsampling rate r = 0.05) estimators. Although Zhang s SAVGM estimator (Zhang et al., 2013) was proposed under the homogeneous setting, but since its main bias correction is performed locally on each data block k as shown in (15), similar theoretical bounds as (16) can be derived without much modifications on the original proof. Throughout the simulation experiments, the results of each simulation setting were based on B = 500 number of replications and were conducted in R with a 10-core Intel(R) Core(TM) i9-10900K @3.7 GHz processor. We evaluated the numerical performance of the five estimators for the common parameter φ under a logistic regression model. For each of K data blocks with K {10, 50, 100, 250, 500, 1000, 2000}, {(Xk,i; Yk,i)}n i=1 Rp {0, 1} were independently sampled from the following model: Xk,i N(0p 1, 0.752Ip p) and P(Yk,i = 1 | Xk,i) = exp(XT k,iθ k) 1 + exp(XT k,iθ k), where θ k = (φ T , λ T k )T , φ = 1, λ k = (λ k,1, λ k,2, , λ k,p2)T and λ k,j = ( 1)j10(1 2(k 1)/(K 1)). The sample sizes of the data blocks were equal at n = NK 1 with N = 2 106. Two levels of the dimension p2 = 4 and 10 of the nuisance parameter λk were considered. Due to space limit, we only report the set of result with p2 = 10 in the main paper. See Appendix C.2 for the result with p2 = 4 and Appendix A.10 for a derivation of the bias correction formula for the logistic model. Figure 1 reports the root mean squared errors and absolute bias of the estimators when p2 = 10. It is observed that the weighted distributed estimator and the two debiased estimators had smaller DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY root mean squared errors than those of the Sa C and SAVGM for almost all the simulation settings. The classical split-and-conquer estimator fared better than Zhang s SAVGM estimator as K became larger, which is due to the extra variation introduced by the subsampling method as indicated in (16), especially when K is large (the local sample size n is small). It was evident that the WD estimator had much smaller root mean squared errors than the Sa C and SAVGM estimators for all the block number K, realizing its theoretical promises. In most cases, the WD estimator had smaller bias than the Sa C estimator although it was not debiased. The WD estimator was advantageous for K 250. In comparison, both bias corrected d WD and d Sa C were very effective in reducing the bias of the WD and Sa C estimators, respectively, especially for larger K when the bias was more severe. The d WD attained the smallest root mean squared error and the bias in all settings, suggesting the need for conducting both weighting and the bias correction in the distributed inference especially for large K. These empirical results were consistent with Theorems 3 and 5, namely the leading root mean squared error term of the WD estimator changes from O((n K) 1) to O(n 2) when K surpasses the local sample size n, while the leading term of the d WD is still O((n K) 1) until K is much larger than n2. (a) Absolute Bias (p2 = 10) (b) RMSE (p2 = 10) Figure 1: Average simulated bias (a) and the root mean squared errors (RMSE) (b) of the weighted distributed (WD) (red circle), the split-and-conquer(Sa C) (blue triangle), the debiased split-andconquer (d Sa C) (green square), the debiased weighted distributed (d WD) (purple cross), the subsampled average mixture SAVGM (pink square cross) estimators, with respect to the number of data block K for the logistic regression model with the dimension p2 of the nuisance parameter λk being 10, and the full sample size N = 2 106. We also evaluated the coverage probabilities and widths of the 1 α (α = 0.01, 0.05, 0.1) confidence intervals (CIs) of the common parameter based on the asymptotic normality as given after Theorems 4 and 6. The SAVGM estimator was not included as its asymptotic distribution was not made available in Zhang et al. (2013). Table 2 reports the empirical coverage and the average width of the CIs. It is observed that the four types of the CIs all had quite adequate coverage levels when K 100. However, for K 250, the Sa C CIs first started to lose coverage, followed by those of the WD, while the CIs of the d Sa C and d WD estimators can hold up to the promised coverage for all cases of K. Although the d Sa C CIs had comparable coverages with the d WD CIs, their widths were much wider than those of the d WD. This was largely due to the fact that GU AND CHEN the weighted averaging conducted in the weighted distributed estimation reduced the variation and hence the width of the CIs. The widths of the WD CIs were largely the same with those of the d WD, and yet the coverage levels of the d WD CIs were much more accurate indicating the importance of the bias correction as it shifted the CIs without inflating the width. Table 2: Coverage probabilities and widths (in parentheses, multiplied by 100) of the 1 α confidence intervals for the common parameter φ in the logistic regression model based on the asymptotic normality of the split-and-conquer (Sa C), the weighted distributed (WD), the debiased splitand-conquer (d Sa C) and the debiased weighted distributed (d WD) estimators with respect to the number of data blocks K. The dimension p2 of the nuisance parameter λk is 10 and total sample size N = 2 106 K Sa C WD d Sa C d WD 1 α 0.99 0.95 0.90 0.99 0.95 0.90 0.99 0.95 0.90 0.99 0.95 0.90 10 0.99 0.94 0.88 1.00 0.96 0.92 1.00 0.94 0.88 1.00 0.96 0.92 (3.05) (2.32) (1.95) (2.41) (1.84) (1.54) (3.05) (2.32) (1.95) (2.42) (1.84) (1.54) 50 0.99 0.93 0.87 0.99 0.95 0.88 0.98 0.94 0.88 0.99 0.96 0.88 (2.94) (2.24) (1.88) (2.29) (1.74) (1.46) (2.94) (2.24) (1.88) (2.29) (1.74) (1.46) 100 0.97 0.89 0.84 0.97 0.93 0.87 0.98 0.95 0.90 0.98 0.94 0.89 (2.93) (2.23) (1.87) (2.28) (1.74) (1.46) (2.93) (2.23) (1.87) (2.29) (1.74) (1.46) 250 0.89 0.72 0.63 0.98 0.92 0.87 1.00 0.97 0.90 1.00 0.96 0.90 (2.94) (2.24) (1.88) (2.28) (1.74) (1.46) (2.94) (2.24) (1.88) (2.29) (1.74) (1.46) 500 0.51 0.28 0.18 0.93 0.81 0.70 0.99 0.94 0.90 0.98 0.94 0.88 (2.97) (2.26) (1.90) (2.29) (1.74) (1.46) (2.97) (2.26) (1.90) (2.30) (1.75) (1.47) 1000 0.00 0.00 0.00 0.66 0.37 0.28 0.99 0.95 0.90 0.99 0.96 0.89 (3.04) (2.31) (1.94) (2.30) (1.75) (1.47) (3.04) (2.31) (1.94) (2.34) (1.78) (1.49) 2000 0.00 0.00 0.00 0.02 0.00 0.00 0.99 0.96 0.90 0.99 0.93 0.87 (3.22) (2.45) (2.06) (2.34) (1.78) (1.49) (3.22) (2.45) (2.06) (2.40) (1.82) (1.53) In addition to the simulation experiments on the statistical properties of the estimators, the computation efficiency of the estimators was also evaluated. Table 3 reports the average CPU time per simulation run based on 500 replications of the five estimators for a range of K of the nuisance parameter for the logistic regression model with the total sample size N = 2 106 and p2 = 10. The computation speed of the d Sa C and d WD estimators were relatively slower than those of the Sa C, WD and Zhang s SAVGM estimators. The WD estimator was quite fast, which means that the re-weighting used less computing time than the bias-reduction. In comparison, the d WD estimator was the slowest as a cost for attaining the best root mean squared error among the five estimators in all settings. It is observed in Table 3 that the overall computation time for each estimator first decreased and then increased as K became larger. The decrease in time was because of the benefit of the distributed computation, while the increase was due to the increase in the number of optimization associated with the statistical optimization performed as K got larger. However, it is worth mentioning that these results did not account for the potential time expenditure in data communication among different data blocks. DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY Table 3: Average CPU time for each replication based on B = 500 replications for the split-andconquer (Sa C), the Zhang s SAVGM, the weighted distributed (WD), the debiased split-and-conquer (d Sa C) and the debiased weighted distributed (d WD) estimators for the logistic regression model with respect to K. The dimension p2 of the nuisance parameter λk is 10 and total sample size N = 2 106 K Sa C SAVGM WD d Sa C d WD 10 34.60 35.19 43.84 50.47 55.35 50 20.13 20.18 24.16 29.99 33.69 100 15.60 16.20 17.74 23.63 24.47 250 10.77 12.61 11.88 18.22 20.39 500 11.55 14.50 12.56 18.80 23.73 1000 15.23 18.27 16.28 22.38 32.24 2000 23.42 27.99 24.62 30.43 48.05 6.2 Real data analysis We report results from an empirical analysis of an airline s on-time performance data to demonstrate the proposed weighted distributed estimation for massive data. We aim to quantify the association between flight departure delay and a set of covariates, the arrival delay of the previous flight of the same plane, the seasonal effects, and the weather conditions with a logistic regression model, based on data from the top 10 busiest airports in the United States in 2007. The flight data are available from https://community.amstat.org/jointscsg-section/dataexpo/dataexpo2009 and the weather data are obtained from https://cds.climate.copernicus.eu/. We segmented the full data of N = 2412782 according to the airports of departing flights and obtained 10 data segments. For each segment, we split it to data blocks of size n = 5000, while the residual data blocks were discarded, such that the total number of blocks K = 479. We included seven covariates in the logistic regression: the arrival delay of the previous flight, the season (encoded by three dummy variables: spring (March-May), summer (June-August), autumn (September-November) with winter as the baseline, the near-surface air temperature and pressure, and the rain rate before the scheduled departure time. The coefficients of the three weather variables were treated as the common parameters while the remaining coefficients including the intercept were regarded as heterogeneous; see Section C.3 in the supplementary material for the justification. The estimated common parameters of the near-surface air pressure, temperature, and convective rain rate with 95% confidence intervals using the weighted distributed estimator and the split-and-conquer estimator are shown in Figure 2. Both methods successfully identified a significant association between the three weather variables and the departure delay of a flight. Besides, the weighted distributed estimator reduced the lengths of the confidence intervals of the estimated common parameters compared with the split-and-conquer method. In particular, the confidence interval of the rain parameter was shortened by 19.1%, while those of the other two common parameters were shorted by 2.2% (pressure) and 2.9% (temperature), which justified the statistical efficiency of the weighted distributed estimator. The data analysis demonstrated the feasibility of implementing the proposed weighted distributed estimation method for real-world distributed inference problems. With only one round GU AND CHEN of weighting to tackle the heterogeneity among the nuisance parameters, more efficient estimation can be obtained. Figure 2: Estimated common parameters of the near surface air pressure, temperature and convective rain rate with 95% confidence intervals using the weighted distributed estimator and the split-andconquer estimator temperature air pressure 0.0950 0.0975 0.1000 0.1025 0.1050 0.1075 0.080 0.085 0.090 0.095 0.100 0.0625 0.0600 0.0575 0.0550 Estimates (95% Confidence Interval) 7 Discussion This paper investigates distributed statistical optimization in the presence of heterogeneity among the data blocks. The weighted distributed estimator can improve the estimation efficiency of the split-and-conquer estimator for the common parameter. Two debiased estimators are proposed to allow for larger numbers of data blocks K. The statistical properties of the proposed estimators are shown to be advantageous over the split-and-conquer and SAVGM estimators. In particular, the weighted distributed estimator performs well for smaller K relative to N, and the debiased weighted distributed estimator that conducted both bias correction and weighting offers good estimation accuracy for large K. An important issue for the distributed estimation is the size of K relative to the full sample size N. Both the split-and-conquer and weighted distributed estimators require K = o(N1/2) to DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY preserve the the N1/2 convergence rate. The debiased weighted distributed and debiased split-andconquer estimators relax the restriction to K = o(N2/3) without sacrificing the statistical efficiency. Our finding that the heterogeneity requires separate weighting for better efficiency gain has implications beyond the current context. In particular, the proposed WD estimator can be extended to a multi-round procedure, where one can substitute the WD estimator back to the local loss functions, update the remaining local parameters, and then repeat the WD procedure. However, it may be shown that under the current M-estimation framework which is non-degenerate in the sense of Chen and Peng (2021), doing so could not lead to further gain in the efficiency beyond the WD estimator, as it suffices to achieve statistical efficiency with a one-round averaging as the WD. Nevertheless, it is of interest to explore further for the degenerate cases as the above statement may no longer be applicable. The heterogeneous M-estimation framework in this work is related to the federated learning (Mc Mahan et al., 2017), where one wants to minimize a federated risk function k=1 wk Mk(φ), (17) where Mk(Xk; ) is the k-th client specific loss function, Mk(φ) = EPk (Mk(Xk; φ)) is the corresponding risk function, φ Rp1 is the parameter of interest and wk is the pre-specified weight of the k-th client with a natural choice wk = 1/K. The local distributions {Pk}K k=1 may be not identically distributed to reflect heterogeneity. In (17), only a shared parameter φ needs to be estimated, and the heterogeneity is hidden in the local loss and risk functions. Our formulation is different from the federated risk function (17), where the {Mk}K k=1 are parameterized via the heterogeneous local parameters {λk}K k=1, leading to Mk(Xk; ) = M(X; φ, λk) (18) for inference on φ. Our finding that by actively weighting with respect to the heterogeneity as in the WD estimation can provide useful guideline for the selection of the weights wk in (17) of the federated learning. Different from (17), the federated multi-task learning (Smith et al., 2017) is designed to tackle the heterogeneity in a distributed network, which fits separate local parameters {φk}K k=1 Rp1 to different data blocks (tasks) through loss functions {ℓk( , )}K k=1 and its general formulation is i=1 ℓk(φT k Xk,i, Yk,i) + R(Φ, Ω) , (19) where Φ is the matrix with {φk}K k=1 as column vectors, Ω RK K and R( , ) measures the extent of the heterogeneity among different data blocks. Choices of R( , ) include the bi-convex function R(Φ, Ω) = δ1trace(ΦΩΦT ) + δ2 Φ 2 F for δ1, δ2 > 0 and Ω= IK K (1/K)1K1T K such that trace(ΦΩΦT ) = PK k=1 φk φK 2 2 where φK = (1/K) PK k=1 φk, which leads to the meanregularized multi-task learning (Evgeniou and Pontil, 2004) with R conducting regularization on each local model. Similar regularization formulations have also been applied in personalized federated learning (T. Dinh et al., 2020; Li et al., 2021). Other than regularization methods, Marfoq et al. (2021) proposed a clustering-based method, where the {Pk}K k=1 are assumed to be sampled from a GU AND CHEN mixture of S (S K) underlying distributions. It is also noted that although federated multi-task learning assumes different parameters {φk}K k=1 over the data blocks, it regularizes them toward a common one. In contrast, we assume there is a common parameter φ shared by the distributions. By doing so, we can clarify the source of heterogeneity {λk}K k=1 and homogeneity φ instead of putting an equal treatment on all the dimensions of the parameter and focusing on the statistical inference of the common parameter. Acknowledgments and Disclosure of Funding The authors would like to thank the two anonymous referees, the Associate Editor, and the Editor for their thoughtful comments and suggestions, which have improved the presentation of the paper. This research is funded by National Natural Science Foundation of China Grants 12071013, 12292980 and 12292983. DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY The Appendix is organized as follows. Section A provides derivations of the formulas given in the main text. Section B contains detailed proofs of the theoretical results. More simulation results and details about the real data analysis are reported in Section C. Appendix A. Derivation of formulas A.1 Expansion of the full sample estimator ˆφfull By integral form of Taylor s expansion around the true value θ , we have 0p 1 = ΨN(X; ˆφfull, ˆλ1,full, ..., ˆλK,full) = ΨN(X; θ ) + J(θ )(ˆθfull θ ) + ( ΨN(X; θ ) J(θ ))(ˆθfull θ ) 0 ΨN(X; θ + t(ˆθfull θ ))(ˆθfull θ )dt ΨN(X; θ )}(ˆθfull θ ), where J(θ) = E ( ΨN(X; θ)). Then, inverting the above leads to ˆθfull θ = J(θ ) 1ΨN(X; θ ) + RN1 + RN2, (20) where RN1 = J(θ ) 1{ ΨN(X; θ ) J(θ )}(ˆθfull θ ) and RN2 = J(θ ) 1{ R 1 0 ΨN(X; θ + t(ˆθfull θ ))(ˆθfull θ )dt ΨN(X; θ )}(ˆθfull θ ) are both higher-order remainder terms. Since J(θ) has the following form PK k=1 nkΨφ φ(θk) n1Ψλ φ(θ1) n KΨλ φ(θK) n1Ψφ λ(θ1) n1Ψλ λ(θ1) 0 0 ... 0 ... 0 n KΨφ λ(θK) 0 0 n KΨλ λ(θK), then the right bottom part of J(θ) is a block diagonal matrix, whose inverse is at hand. Thus we can see J(θ) as a 2 2 block matrix and directly apply the block matrix inverse formula (Lu and Shiou, 2002). Thus from (20) we have k=1 (nk/N)Jφ|λ(θ k) i=1 Sφ(Xk,i; θ k) + op(N 1/2). A.2 Errors-in-variables model We first give a derivation of the objective function from the perspective of statistical optimization. As we will see, the derived objective is exactly the same as that when we do orthogonal regression or Deming s regression (Carroll and Ruppert, 1996). Consider the conditional likelihood of (Xk,i, Yk,i) given Zk,i in block k f({Xk,i}, {Yk,i}|{Zk,i}, θk) = i=1 f1(Xk,i|Zk,i)f2(Yk,i|Zk,i) =( 1 2πσ2 )n n Y (X2 k,i + (Yk,i φ)2) 2Zk,i(Xk,i + λk(Yk,i φ)) + (1 + λ2 k)Z2 k,i GU AND CHEN By the factorization theorem, Xk,i + λk(Yk,i φ) is a sufficient statistic for Zk,i if θk = (φ, λk) is assumed to be known. And Xk,i + 2λk(Yk,i φ)|Zk,i N((1 + λ2 k)Zk,i, (1 + λ2 k)σ2). Then, the above conditional likelihood can be factorized as f({Xk,i}, {Yk,i}|{Zk,i}, θk) i=1 exp 1 2σ2(1 + λ2 k)(λk Xk,i (Yk,i φ))2 h(Xk,i + λk(Yk,i φ)|Zk,i), where h(si|zi) is the conditional density of N((1 + λ2 k)zi, (1 + λ2 k)σ2). Since {Zk,i}n i=1 are not observable, we discard the factor h and construct the estimator based on the first part of the factorization, which is denoted as f({Xk,i}, {Yk,i}|{Zk,i}, θk). Differentiate log f({Xk,i}, {Yk,i}|{Zk,i}, θk) with respect to θk = (φ, λk)T , we obtain φ log f({Xk,i}, {Yk,i}|{Zk,i}, θk) = 1 σ2(1 + λ2 k) i=1 (λk Xk,i (Yk,i φ)) and λk log f({Xk,i}, {Yk,i}|{Zk,i}, θk) = n λk 1 + λ2 k + λk σ2(1 + λ2 k)2 (λk Xk,i (Yk,i φ))2 Xk,i σ2(1 + λ2 k)(λk Xk,i (Yk,i φ)). However, E f({Xk,i}, {Yk,i}|{Zk,i}, θ k) = (0, nλ k/(1+λ 2 k ))T = 02 1, thus a correction term should be added to construct an appropriate objective function which satisfies the standard first-order condition in statistical optimization framework: Mn,k({Xk,i}, {Yk,i}|{Zk,i}, θk) = log f({Xk,i}, {Yk,i}|{Zk,i}, θk) + n 2 log(1 + λ2 k) = 1 2σ2(1 + λ2 k) i=1 (λk Xk,i (Yk,i φ))2 + C(σ), where C(σ) = n log( 2πσ) is an absolute constant so we also discard it. The corresponding M-function is M(Xk, θk) = 1 2σ2(1 + λ2 k)(λk Xk (Yk φ))2. (22) Below we check the identification of the true parameter under this objective function. We can directly solve the population level first-order conditions (FOC) using E ( M(Xk, Yk|Zk, θk)) = 02 1, which are given as 02 1 = (1 + λ2 k) (λk λ k)E (Zk) (φ φ) (λkλ k + 1)(λk λ k)E Z2 k λk(φ φ )2 + (φ φ )(1 + 2λkλ k λ2 k)E (Zk) (23) To solve the above set of equations, we consider the two scenarios. When E (Zk) = 0, from the first equation we obtain φ = φ , then the second equation reduces to C(λkλ k + 1)(λk DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY λ k)E Z2 k = 0. Since we have assumed λk, λ k > 0, we must have λk = λ k. When E (Zk) = 0, if λk = λ k we would obtain E (Zk) = (φ φ)/(λk λ k). Plugging it into the second equation of (23) and we can obtain (1 + λkλ k) σ2(1 + λ2 k)2(λk λ k) (λk λ k)2E Z2 k (φ φ )2 = 0, which is impossible unless Zk is degenerate, namely Zk = (φ φ)(λk λ k) with probability one. This leads to a contradiction. Thus we must have λk = λ k. Again from the first equation of (23) we will obtain that φ = φ . In summary, E ( M(Xk, Yk|Zk, θk)) = 02 1 if and only if θk = θ k. To give an explicit form of asymptotic variance of the estimator obtained from the M-function (22), we can directly calculate the following two terms: E 2M(Xk, Yk|Zk; θ k) 1 σ2(1+λ 2 k ) Xk σ2(1+λ 2 k ) 2λ k(λ k Xk (Yk φ )) σ2(1+λ 2 k )2 Xk σ2(1+λ 2 k ) 2λ k(λ k Xk (Yk φ )) σ2(1+λ 2 k )2 (3λ 2 k 1)(λ k Xk (Yk φ ))2 σ2(1+λ 2 k )3 4λ k Xk(λ k Xk (Yk φ )) σ2(1+λ 2 k )2 + X2 k σ2(1+λ 2 k ) = 1 σ2(1 + λ 2 k ) 1 E (Zk) E (Zk) E Z2 k and E M(Xk, Yk|Zk, θ k)( M(Xk, Yk|Zk, θ k))T 1 σ2(1+λ 2 k ) E(Zk) σ2(1+λ 2 k ) E(Zk) σ2(1+λ 2 k ) E(Z2 k) σ2(1+λ 2 k ) + 1 (1+λ 2 k )2 = 1 σ2(1 + λ 2 k ) 1 E (Zk) E (Zk) E Z2 k + σ2 1+λ 2 k Thus we have Jφ|λ(θ k) = 1 σ2(1 + λ 2 k )(1 (EZk)2 EZ2 k ) = 1 σ2(1 + λ 2 k ) Var(Zk) Var(Sφ) = 1 E(Zk) 1 σ2(1 + λ 2 k ) 1 EZk EZk EZ2 k + σ2 1+λ 2 k = 1 σ2(1 + λ 2 k ) which leads to the Equation (8) in the main text. A.3 Equivalent variance minimization formulations of the weighted estimators For simplicity, we assume that n1 = n2 = = n K = n. We claim that the following two formulations of the variance minimization problem have identical solution. Formulation 1: Trace Operator Minimize Wk trace K X k=1 Wk Hk W T k k=1 Wk = Ip1. (24) Formulation 2: Frobenius Norm Minimize Wk k=1 Wk Hk W T k F , s.t. k=1 Wk = Ip1. (25) GU AND CHEN Proof We solve the problem (24) first. The Lagrangian of this problem is L1 = trace K X k=1 Wk Hk W T k k=1 Wk Ip1 >, where Λ1 Rp1 p1 is the corresponding Lagrangian multiplier. If we take derivative of L1 w.r.t. Wk we can obtain 2Wk Hk + Λ1 = 0, k = 1, 2, , K. Then Wk = 1 2Λ1H 1 k . Using the constraint PK k=1 Wk = Ip1, we can obtain Λ 1 = 2(PK s=1 A 1 s ) 1 and W k = (PK s=1 A 1 s ) 1A 1 k . Now we turn to solve the problem (25). Equivalently we can minimize the square of the Frobenius norm, and the corresponding Lagrangian is k=1 Wk Hk W T k 2 F + < Λ2, k=1 Wk Ip1 > . Taking derivative w.r.t. Wk we can obtain 4(PK s=1 Ws As W T s )Wk Ak + Λ2 = 0. Now we can use the constraint PK k=1 Wk = Ip1 and get Λ 2 = 4(PK s=1 Ws As W T s )(PK s=1 A 1 s ) 1 and W k = (PK s=1 A 1 s ) 1A 1 k . A.4 Second-order Bartlett s indentity under QMLE For the quasi maximum likelihood estimation (QMLE), we only check that the second order Bartlett s identity holds for independent observarions. Suppose that the components of the response vector Y are independent with mean vector µ and covariance matrix σ2V (µ), where σ2 maybe unknown and V (µ) is a matrix of known functions. It is assumed that the parameters of interest, θ, is a function of µ. By independence of the components of Y and the physical mechanism plausibility, it is reasonable to assume further that Vi(µ) depends on µ only through µi, which implies that V (µ) = diag{V1(µ1), V2(µ2), , Vn(µn)}. For a single observation Y , we can construct the score function as U = u(µ; Y ) = (Y µ)/(σ2V (µ)). Then the corresponding objective function can be defined as Q(µ; y) = Z µ y t σ2V (t)dt, (26) which behaves like a negative log-likelihood: E ( µQ) = 0, Var( µQ) = E( 2 µQ) = 1/{σ2V (µ)}. We refer to Q(µ; y) as the negative quasi-likelihood (Mc Cullagh, 1983), or more precisely the negative log quasi-likelihood for µ based on data y. By independence, the negative quasi-likelihood for the complete data is the sum of the individual contributions: Q(µ; y) = Pn i=1 Q(µi; yi). The quasi-likelihood estimating equations for the regression parameters θ, obtained by differentiating Q(µ; y), can be written in the form U(ˆθ) = 0, where U(θ) = DV 1(Y µ)/σ2 is called the quasi-score function. The components of D, of order n p, are Dir = µi/ θr, the derivatives of µ(θ) with respect to the parameters. Suppose the true parameters are θ and µ , then by the zero-mean of U(θ ), we have Co V{U(θ )} = E U(θ )U(θ )T = DT V 1D/σ2 and θT (θ ) = E{DT V 1 µ θT /σ2 + DT V 1 σ2 } = DT V 1D/σ2. DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY A.5 Generalized second-order Bartlett s identity for parametric regression Suppose that we observe a random sample (X1, Y1), (X2, Y2), , (Xn, Yn), which follows Y = fθ (X) + e, E(e|X) = 0, Var(e|X) = σ2(X), X p(x). Then the objective function for the least square estimation is M(Z, θ) = (Y fθ(X))2 with Z = (X, Y ). Note that E (M(Z, θ)) = E(fθ(X) fθ (X))2 + Ee2 E (M(Z, θ )) + E((θ θ )T fθ (X))2, (27) which suggests that 2 θM(θ ) = 2E fθ (X) fθ (X)T where M(θ) = EM(Z, θ). For the approximation (27), see van der Vaart (1999). If we assume the independence between e and X, which implies Var(e) = σ2, then E M(Z, θ ) M(Z, θ )T = 4σ2E fθ (X) fθ (X)T with the multiplicative factor γ for the generalized second-order Bartlett s identity being 4σ2. A.6 GMM formulation of the full sample statistical optimization under heterogeneity It is noted that W0 admits the following form Var{ψθ1(X1,1; φ , λ 1)} 1 0 0 0 Var{ψθ2(X2,1; φ , λ 2)} 1 0 ... ... 0 0 Var{ψθK(XK,1; φ , λ K)} 1 Thus, W0 is a block diagonal matrix. Also note that GT 0 = E{ ψT N(θ ) ψφ φ(X1,i; φ , λ 1) ψλ φ(X1,i; φ , λ 1) ψφ φ(XK,i; φ , λ K) ψλ φ(XK,i; φ , λ K) ψφ λ(X1,i; φ , λ 1) ψλ λ(X1,i; φ , λ 1) 0 0 0 0 0 ψφ λ(X2,i; φ , λ 2) ψλ λ(X2,i; φ , λ 2) 0 ... ... ... ... ... ... 0 0 0 0 ψφ λ(XK,i; φ , λ K) ψλ λ(XK,i; φ , λ K) then the asymptotic variance of the GMM estimtator (Hansen, 1982) is Asy V Ar(ˆθGMM) = (GT 0 W0G0) 1 and has the following form: PK k=1 nk DΨφ(θ k)T Σ 1 S,k DΨφ(θk) n1DΨφ(θ 1)T Σ 1 S,1DΨλ(θ 1) n KDΨφ(θ K)T Σ 1 S,KDΨλ(θ K) n1DΨλ(θ 1)T Σ 1 S,1DΨφ(θ 1) n1DΨλ(θ 1)T Σ 1 S,1DΨλ(θ 1) 0 0 ... 0 ... ... ... ... ... ... ... 0 n KDΨλ(θ K)T Σ 1 S,KDΨφ(θ K) 0 0 n KDΨλ(θ K)T Σ 1 S,KDΨλ(θ K) DΨφ(θk)T = Ψφ φ(θk) Ψλ φ(θk) , DΨλ(θk)T = Ψφ λ(θk) Ψλ λ(θk) and ΣS,k = Var{ψθk(Xk,1; φ , λ k)}. GU AND CHEN By the inversion of block matrix, approximately Var(ˆφGMM) 1 has the following form: DΨφ(θ k)T Σ 1 S,k DΨφ(θ k) DΨφ(θ k)T Σ 1 S,k DΨλ(θ k) DΨλ(θ k)T Σ 1 S,k DΨλ(θ k) 1 DΨλ(θ k)T Σ 1 S,k DΨφ(θ k) . If we denote the elements in the above summation as nk Uk, then it is straightforward to verify that ( DΨφ(θ k)T ΣS,k DΨφ(θ k) DΨλ(θ k) ) 1 , namely, the inverse of Uk is the left top part of the inverse of a bigger matrix in the RHS of the above equation, from which we are able to obtain the simplified expression of Uk: Uk = J 1 φ|λ Ip1 p1 Ψλ φ(θ k)Ψλ λ(θ k) 1 ΣS,k Ip1 p1 Ψλ λ(θ k) 1Ψφ λ(θ k) = Jφ|λΣ 1 k Jφ|λ. Now we conclude that Var(ˆφGMM) PK k=1 Jφ|λΣ 1 k Jφ|λ 1 , which is the same as that of the WD estimator ˆφWD. A.7 Lipschitz continuity of the outer product of the gradient in logistic regression model First we define the logit function logit(a) = exp(a)/(1 + exp(a)) for a R. Then the logistic regression model can be defined as P(Y = 1|X) = logit(XT β ), where X, β Rp. If we define the objective M as M(z, β) = y log(logit(x T β))+(y 1) log(1 logit(x T β)), where z = (y, x), then the outer product of gradient, denoted as f(z, β), is f(z, β) = (y logit(x T β))2xx T . Now we have f(z, β1) f(z, β2) 2 = xx T (2y logit(x T β1) logit(x T β2))(logit(x T β1) logit(x T β2)) 2 = xx T (2y logit(x T β1) logit(x T β2))(1 logit(ξ))logit(ξ)x(β1 β2) 2 x 3 2 β1 β2 2, where the second equality comes from an application of the mean value theorem. A.8 Asymptotic efficiency comparison of ˆλk and ˆλ(2) k Theorem 9 Under the conditions required in Theorem 4, if K , then for the updated estimator ˆλ(2) k , we have that nk(ˆλ(2) k λ k) d N(0, Ψλ λ(θ k) 1Eψλ(Xk,1; θ k)ψλ(Xk,1; θ k)T Ψλ λ(θ k) 1). (28) DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY Hence, the asymptotic distribution of ˆλ(2) k is the same as that of the estimator of λ k obtained when the common parameter φ is known. It is noted that the joint asymptotic distribution for the estimator ˆθk = (ˆφT k , ˆλT k )T is nk(ˆθk θ k) d N(0, Jk(θ k) 1Eψθk(Xk,1; θ k)ψθk(Xk,1; θ k)T Jk(θ k) 1), which leads to nk(ˆλk λ k) d N(0, Jλ|φ(θ k) 1Var(Sφ(Xk,1; θ k))Jλ|φ(θ k) 1). (29) There is not a definite order on the relative efficiency between ˆλk and ˆλ(2) k by comparing the two asymptotic variances in (28) and (29), suggesting it would depend on the specific M function and the model setting. For general statistical optimization, a known nuisance parameter (here φ ) does not necessarily improve the efficiency of a parameter of interest Yuan and Jennrich (2000); Henmi and Eguchi (2004), which is the case for the current setting. Consider again the errors-in-variables model where it can be shown that Var(ˆλ(2) k ) σ4 (Var(Zk))2 1 nk and Var(ˆλk) σ4 (E(Z2 k))2 + σ2(1 + λ2 k) E(Z2 k) When E (Zk) = 0, i.e. Var(Zk) = E(Z2 k), the updated estimator ˆλ(2) k is more efficient, and the efficiency gain gets large as λ2 k increases. However, if E (Zk) has a large absolute magnitude, ˆλk can be more efficient than ˆλ(2) k . Moreover, the requirement in Theorem 9 that K is to obtain a succinct asymptotic variance of ˆλ(2) k . The above conclusion does not change for the fixed K case. Consider block 1, we assume ˆλ(2) 1 p λ 1 and ˆφWD is n1 consistent (detailed proofs of both claims are available in the next section). Then by Theorem 1 in Yuan and Jennrich (2000), if n1 1 n1 Pn1 i=1 ψλ(X1,i; θ 1) + Ψφ λ(θ 1)(ˆφWD φ ) d N(0, Q), we will have n1(ˆλ(2) 1 λ 1) d N(0, Ω) where Ω= Ψλ λ(θ 1) 1QΨλ λ(θ 1) 1. Denote Tn,K = nΨλ λ(θ 1) 1 1 n Pn i=1 ψλ(X1,i; θ 1) + Ψφ λ(θ 1)(ˆφWD φ ) , then Tn,K should have the same asymptotic distribution as n(ˆλ(2) 1 λ 1). So, we study the limiting behavior of Tn,K for simplicity. Consider the homogeneous scenario as a special case when θ 1 = θ 2 = = θ K, n1 = n2 = = n K = n, then the optimal weights are W 1 = W 2 = = W K = 1 K Ip1 p1. Now we have Tn,K = 1 nΨλ λ(θ 1) 1 n X i=1 ψλ(X1,i; θ 1) Ψφ λ(θ 1) 1 K (ˆφk φ ) = 1 nΨλ λ(θ 1) 1 n X i=1 ψλ(X1,i; θ 1) Ψφ λ(θ 1) i=1 J 1 φ|λSφ(Xk,i; θ k) + op(1) K )Ψλ λ(θ 1) 1Ψφ λ(θ 1) Ip2 p2 2M1(θ 1) 1 1 n ψφ(X1,i; θ 1) ψλ(X1,i; θ 1) Ψλ λ(θ 1) 1Ψφ λ(θ 1) 1 n 1 K i=1 J 1 φ|λSφ(Xk,i; θ k) + op(1) = T (1) n,k + op(1). GU AND CHEN We can verify that Var(T (1) n,k) = (1 1 K )Ψλ λ(θ 1) 1Var(ψλ(X1,1; θ 1))Ψλ λ(θ 1) 1 + n K Var(ˆλ1), or equivalently, Var(ˆλ(2) 1 ) (1 1 K )Ψλ λ(θ 1) 1Var(ψλ(X1,1; θ 1))Ψλ λ(θ 1) 1 1 K Var(ˆλ1). Thus Var(ˆλ(2) 1 ) Var(ˆλ1) if and only if Ψλ λ(θ 1) 1Var(ψλ(X1,1, θ 1))Ψλ λ(θ 1) 1/n Var(ˆλ1). (30) The LHS of inequality (30) is the asymptotic variance of the estimator of λ 1 if φ 1 is known and RHS is the asymptotic variance of estimator of λ 1 when we jointly estimate (φ T 1 , λ T 1 )T . Henmi and Eguchi (2004) showed that the inequality does not always hold for general statistical optimization problem and derived a sufficient condition under which a known nuisance parameter (φ ) will lead to a bigger asymptotic variance of the estimator of the parameter of interest (λ 1). A.9 Comparison with a one-step estimator Huang and Huo (2019), also under the same homogeneous setting, considered to utilize the second order information of the M-function to allow for a larger K. They proposed a one-step estimator which aggregates the local Hessian matrices and gradients and performs a single Newton-Raphson updating. The estimator, denoted as ˆθ(1), has a MSE upper bound E ˆθ(1) θ 1 2 2 2E θ1Ψθ(θ 1) 1ψθ1(X1,1; θ 1) 2 2 Thus, this method allows for K = o(n3), while still preserves the O(N 1) convergence rate. The price of this procedure is one extra round of transmission of the local Hessians and gradients. To mitigate the communication burden, they considered to use only one local Hessian matrix instead of the averaged one. Let ˆθ(1) LH be the estimator. They showed that E ˆθ(1) LH θ 1 2 2 2E θ1Ψθ(θ 1) 1ψθ1(X1,1; θ 1) 2 2 which is similar to the MSE bound of the d Sa C estimator in Corollary 8. However, both ˆθ(1) and ˆθ(1) LH are not readily extended to the heterogeneous setting, as the one-step update procedure relies crucially on the N1/2 consistency of the initial estimators of all the unknown parameters (van der Vaart, 1999), but the convergence rate of the block-specific estimators ˆλk are only of order Op(n1/2 k ). A.10 Bias correction for statistical optimization under logistic regression model Given observations {(yi, Xi)}n i=1, we now construct ˆB(β). Denote y = (y1, y2, , yn)T , X = (X1, X2, , Xn)T and ˆy = (ˆy1, ˆy2, , ˆyn) with ˆyi = logit(x T i β). Since daj logit(a) = logit(a) s=1 (1 logit(a)s), then we have Mn(β) = 1 n XT (ˆy y), 2Mn(β) = 1 n XT diag{ˆy (1 ˆy)}X and 3Mn(β) = 1 n Pn i=1 ˆyi(1 ˆyi)(1 2ˆyi)xivec(xi xi)T , where denotes the element-wise product of two vectors and vec is the vectorization operator. Then, the bias-correction formula is a combination of the gradients up to the third order. DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY Appendix B. Proofs Without loss of generality, we assume equal sample size n in each data block. Besides, unless otherwise stated, we will use C, ci and Ci to denote positive constants independent of (nk, K, N), and the same Ci can have different values from one context to another. Before presenting the proofs of the theoretical results established in the main paper, we first establish some technical lemmas in the following sub-section. Lemma B.1 Suppose H and K are positive definite matrices of order p, and X and Y are arbitrary p m matrices. Then, Q = XT H 1X + Y T K 1Y (X + Y )T (H + K) 1(X + Y ) 0. Proof Let A and B be defined as follows A = H X XT XT H 1X , B = K Y Y T Y T K 1Y Since H, K are positive definite, we can directly check that A, B are positive semi-definite. Thus A + B is also positive semi-definite, and the conclusion follows. See Haynsworth (1970); Ando (1979) for more similar types of matrix inequalities. Lemma B.2 Under Assumptions 1 - 4 and Assumptions 5 - 6 with min{v, v1} 1, if K = o(nv2), then sup 1 k K ˆθk θ k 2 P 0. Besides, the following holds for all k = 1, 2, , K and 1 v2 v1 E ˆθk θk 2v2 2 C R2v2 nv2 + 1 + L2v Proof Let Gn,k = 1 n Pn i=1 Gk(Xk,i), Mn,k(θ) = 1 n Pn i=1 M(Xk,i; θ) and δρ = min{ρ, ρρ /4G}. For k = 1, ..., K, define the following good events : Ek = {Gn,k 2G, 2 θk Mn,k(θ k) 2 θk Mk(θ k) 2 ρρ 2 , θk Mn,k(θ k) 2 (1 ρ)ρ δρ 2 }, (34) then by Lemma 6 in Zhang et al. (2013), we obtain that under the event K k=1Ek, ˆθk θ k 2 2 θk Mn,k(θ k) 2 (1 ρ)ρ holds for all k = 1, 2, , K. (35) Similar to the proof of Lemma C.1 in Jordan et al. (2019), we can show that there exist constants C1, C2, C3 independent of (n, K, d, L, R) such that P( K k=1Ec k) K C1 + C2(log2d)2v L2v nv + C3R2v1 GU AND CHEN Now For any ϵ > 0 and k K, we further define the events E k = { θk Mn,k(θ k) 2 (1 ρ)ρ ϵ/2}. Then by Markov s inequality and the union bound, there exist a positive constant C4 depending on ϵ such that P( K k=1E c k ) C4(log2d)2v L2v Thus, sup 1 k K ˆθk θ k 2 P 0 as n as long as K = o(nmin{v,v1}). Besides, the higher-order estimation error bound follows from the following decomposition E ˆθk θk 2v2 2 = E ˆθk θk 2v2 2 (I(Ek) + I(Ec k)) C E θk Mn,k(θ k) 2v2 2 + P ((Ek)c) nv2 + 1 + L2v Lemma B.3 Inv(A) : GL(Rp) GL(Rp) : A 7 A 1 is Lipschitz continuous at any A GL(Rp), where GL(Rp) consists of all p p invertible matrices of real numbers. Proof Let A0 GL(Rp) be given. Denote 1/ A 1 0 2 = δ > 0. It follows that for all x Rp we have x 2 = A 1 0 A0x 2 (1/δ) A0x 2, namely A0x 2 δ x 2. Assume that A A0 2 < δ/2, then Ax 2 A0x 2 (A A0)x 2 δ 2 x 2, which means A 1 exists and A 1 2 2/δ. Since A 1 A 1 0 = A 1(A0 A)A 1 0 , A 1 A 1 0 2 A 1 2 A0 A 2 A 1 0 2 (2/δ2) A A0 2, which completes the proof. Lemma B.4 Under Assumptions 1 - 4 and 7, and Assumptions 5 - 6 for v, v1 1, if K = o(n), k=1 Hk(θ k) 1(ˆφk φ )}T { k=1 nk Hk(θ k) 1} 1{ k=1 nk Hk(θ k) 1(ˆφk φ )} d χ2 p1. Proof We prove for the case when K , and it is straightforward to derive the proof for the fixed K case. Since we have assumed equal sample size n for simplicity, we can denote s=1 Hs(θ s) 1} 1 k=1 Hk(θ k) 1(ˆφk φ ), and the problem is equivalent to show that T1 2 2 χ2 p1 in distribution when K = o(n). Since all the smoothness conditions in Assumptions 5 - 6 only holds locally, namely in the Uρ ball, so all the Taylor expansions hold only under the event K k=1Ek, where the definition of the event Ek can be found in the proof of Lemma B.2. Now, under the event K k=1Ek, by the integral form of Taylor s expansion of θk Mn,k(θk) around the true parameter θ k, we have ˆθk θ k = Jk(θ k) 1 θk Mn,k(θ k) + R(k) n , (36) DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY where R(k) n = R(k) n,1 + R(k) n,2, R(k) n,1 = Jk(θ k) 1{ Z 1 0 2 θk Mn,k(θ k + t(ˆθk θ k))dt 2 θk Mn,k(θ k)}(ˆθk θ k) and R(k) n,2 = Jk(θ k) 1{ 2 θk Mn,k(θ k) Jk(θ k)}(ˆθk θ k) for each k. Recall the definition of Jφ|λ and Sφ, if we denote Pk = Hk(θ k) 1Jφ|λ(θ k) 1 Ip1 p1 Ψλ φ(θ k)Ψλ λ(θ k) 1 , (37) s=1 Hs(θ s) 1} 1 i=1 Pkψθk(Xk,i; θ k) k=1 Hk(θ k) 1) 1 k=1 Pk{ Z 1 0 2 θk Mn,k(θ k + t(ˆθk θ k))dt 2 θk Mn,k(θ k)}(ˆθk θ k) and k=1 Hk(θ k) 1) 1 k=1 Pk{ 2 θk Mn,k(θ k) Jk(θ k)}(ˆθk θ k), T1 = (T1,0 + R1 + R2) I( K k=1Ek) + T1 1 I( K k=1Ek) = T1,0 + (R1 + R2) I( K k=1Ek) + (T1 + T1,0) 1 I( K k=1Ek) . The proof of Lemma B.2 also shows that P( K k=1Ek) = 1 O(K/nv2), where v2 = min{v, v1}. Thus, it suffices to establish the asymptotic normality for T1,0, show that the R1 and R2 terms are both op(1) terms and then apply the Slutsky s lemma. Considering the T1,0 term, we apply the Cramer-Wold device to reduce the problem into a scalar case. For any non-zero l Rp1, let l T k = l T { 1 K PK s=1 Hk(θ s) 1} 1/2Pk, then l T T1,0 = N 1/2 PK k=1 Pn i=1 l T k ψθk(Xk,i; θ k). If we denote ZK,k = N 1/2 Pn i=1 l T k ψθk(Xk,i; θ k), then l T T1,0 = PK k=1 ZK,k and E (ZK,k) = 0. Below we check the Lindeberg conditions. First, PK k=1 E(Z2 K,k) = l T l = σ2 l > 0. Second, for any ϵ > 0, k=1 E(|ZK,k|2; |ZK,k| > ϵ) = k=1 E(|ZK,k|2I{|ZK,k|>ϵ}) ϵ )P(|ZK,k I{|ZK,k|>ϵ}| > t)dt k=1 P(|ZK,k| > ϵ) + ϵ P(|ZK,k| > t)dt, where the second equality comes from the tail-sum formula for expectations of absolute moments. Using Chebyshev s inequality, Marcinkiewicz-Zygmund inequality with b3 being the corresponding constant and Jensen s inequality , we can show that k=1 P(|ZK,k| > ϵ) b3 ϵ3K3/2 k=1 lk 3 2E ψθk(Xk,1; θ k) 2 2 . (38) GU AND CHEN Recalling the definition of lk, we can use the boundedness of Hk(θ k) and 2 θk Mk(θ k) to show that lk 2 C l 2. Thus we have that k=1 P(|ZK,k| > ϵ) b3C ϵ3K3/2 l 2K max 1 k KE ψθk(Xk,1; θ k) 3 2 max 1 k KE ψθk(Xk,1; θ k) 3 2 Now consider the PK k=1 R ϵ P(|ZK,k| > t)dt term. By replacing the ϵ in (38) with t, we have ϵ P(|ZK,k| > t)dt 1 Thus we conclude that T1,0 d N(0, Ip1 p1). Now we consider the remainder term R2. Since K PK k=1 Hk(θ k) 1 2 is bounded, we only need to show R2,1 = { 1 K PK k=1 Hk(θ k) 1}R2 is op(1). Since R2,1 2 q n 1 K PK k=1 Pk 2 n{ 2 θk Mn,k(θ k) Jk(θ k)} 2 n(ˆθk θ k) 2, by Markov s inequality and H older s inequality, we will have P( R2,1 ϵ) C1 E n{ 2 θk Mn,k(θ k) Jk(θ k)} 2 2 E n(ˆθk θ k) 2 2 . From Lemma 7 of Zhang et al. (2013) and Assumption 5 with v 1, we know that E n{ 2 θk Mn,k(θ k) Jk(θ k)} 2 2 C. On the other hand, by Lemma 6 of Zhang et al. (2013) and using the event Ek we can show that E n(ˆθk θ k) 2 2 C1. Since K = o(n), we conclude that R2 = op(1). Considering the R1 term, we can similarly prove that R1 2 = op(1) by using the Lipschitz condition 2 θk M(x; θk) 2 θk M(x; θ k) 2 G(x) θk θ k 2 as assumed in Assumption 5. The result follows via a direct application of the Slutsky s lemma. Lemma B.5 Under the same conditions required by Lemma B.4, the following term is asymptotically negligible (i.e. op(1)): s=1 ns ˆHs(ˆθs) 1} 1nk ˆHk(ˆθk) 1(ˆφk φ ) s=1 ns Hs(θ s) 1} 1nk Hk(θ k) 1(ˆφk φ ) . Proof Denote the above term as T2, then we have that s=1 ˆHs(ˆθs) 1} 1 2 1 K k=1 n( ˆHk(ˆθk) 1 Hk(θ k) 1) 2 n(ˆφk φ ) 2 k=1 Hk(θ k) 1 2 n { 1 s=1 ˆHs(ˆθs) 1} 1 { 1 s=1 Hs(θ s) 1} 1 2 n(ˆφk φ ) 2 n (T (1) 2 + T (2) 2 ). DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY Since K = o(n), it suffices to show T (1) 2 and T (2) 2 are both Op(1). Under the event AK defined in Equation (61), we have T (2) 2 I(AK) C n ˆΣS,k(θ k) ΣS,k(θ k) 2 + n ˆLk(θ k) Lk(θ k) 2 + n(ˆθk θ k) 2 n(ˆθk θ k) 2. Thus for v 1, v1 2, by Markov s inequality and Cauchy s inequality we have P(T (2) 2 > 1, AK) n max 1 k K E ˆΣS,k(θ k) ΣS,k(θ k) 2 2 E ˆθk θ k 2 2 E ˆLk(θ k) Lk(θ k) 2 2 E ˆθk θ k 2 2 + C 3E ˆθk θ k 2 2 Since we have shown P(AK) 1 if K = o(n v) with v = min{v, v1 2 }, and we have assumed that K = o(n) , we can conclude that T (2) 2 = Op(1). We can similarly show that T (1) 2 = Op(1). Now we complete the proof. Lemma B.6 Let A1, A2, , An Sp p, if vec(A2)T ... vec(An)T ( ) 2 A 2 2 holds for Rp, then A 2 pn A, where A = (vec(A1), vec(A2), , vec(An))T . Proof Since A( ) = ( T A1 , T A2 , , T An )T , A2 4 2 Pn i=1( T Ai )2 which implies max i n Ai 2 A. On the other hand, for B = (A1, A2, , An) Rp np, we have B 2 2 = λmax(Pn i=1 Ai AT i ) Pn i=1 λmax(Ai AT i ) = Pn i=1 Ai 2 n A2, which gives A 2 = AT 2 q Pn i=1 vec(Ai) 2 2 = q Pn i=1 Ai 2 F pn A. Lemma B.7 Under Assumptions 1 - 4 and 7 - 8, and Assumption 5 with v, v1 4 , E ˆBk(ˆθk)IEk,bc Bk(θ k) 2 2 C Proof Denote k = ˆθk θ k. By the definition of the event Ek,bc given in Algorithm 2, we immediately have that ˆBk(ˆθk)1Ek,bc Bk(θ k) 2 2 Cn2. Below we first control the Qk(θ k) ˆQk(ˆθk) 2 term. Note that Qk(θk) and ˆQk(θk) are exactly L 1 k (θk) and ˆLk(θk) 1 defined in the proof of Theorem 3, thus under the event { ˆLk(ˆθk) Lk(θ k) 2 ρ 2 }, we have Qk(θ k) ˆQk(ˆθk) 2 2 ρ2 ˆLk(ˆθk) Lk(θ k) 2. Besides, when k 2 ρ, we have ˆLk(ˆθk) Lk(θ k) 2 1 i=1 G(Xk,i) k 2 + ˆLk(θ k) Lk(θ k) 2. (39) GU AND CHEN Without loss of generality, we can assume ρ 8Gρ, then if we define EQ,k = { k 2 ρ 8G, Gn,k 2G, ˆLk(θ k) Lk(θ k) 2 < ρ ˆQk(ˆθk) 21EQ,k Qk(θ k) ˆQk(ˆθk) 2 + Qk(θ k) 2 1EQ,k 1 Qk(θ k) ˆQk(ˆθk) 21EQ,k Ek C G θk Mn,k(θ k) 2 + ˆLk(θ k) Lk(θ k) 2 . Using union bound and Markov s inequality, it is easy to show P ((EQ,k)c) C (1 + L2v) Besides, under this event, we can decompose the estimation error using the event Ek and obtain E 1EQ,k Qk(θ k) ˆQk(ˆθk) 2 2 C E θk Mn,k(θ k) 2 2 + E ˆLk(θ k) Lk(θ k) 2 2 + P ((Ek)c) n + P ((Ek)c) . (40) It is noted that ˆBk(ˆθk) Bk(θ k) 2 2 2 ˆQk(ˆθk) 1 i=1 ˆvi,k(ˆθk) ˆdi,k(ˆθk) Qk(θ k)E (v1,k(θ k)d1,k(θ k)) 2 2 2 ˆQk(ˆθk) ˆH3,k(ˆθk) 1 i=1 ˆdi,k(ˆθk) ˆdi,k(ˆθk) Qk(θ k)H3,k(θ k)E (d1,k(θ k) d1,k(θ k)) 2 2 := 2Ωk,1 + 1 then we can bound those two terms respectively. For Ωk,1, under the event EQ,k we have 2 ˆQk(ˆθk){ 1 i=1 ˆvi,k(ˆθk) ˆdi,k(ˆθk) E (v1,k(θ k)d1,k(θ k))} 2 21EQ,k +1EQ,k Qk(θ k) ˆQk(ˆθk) 2 2 E (v1,k(θ k)d1,k(θ k)) 2 2 i=1 (ˆvi,k(ˆθk) ˆdi,k(ˆθk) ˆvi,k(θ k)di,k(θ k)) 2 21EQ,k | {z } i=1 ˆvi,k(θ k)di,k(θ k) E (v1,k(θ k)d1,k(θ k)) 2 2 + 1EQ,k Qk(θ k) ˆQk(ˆθk) 2 2 By Lemma 7 in Zhang et al. (2013), we have i=1 ˆvi,k(θ k)di,k(θ k) E (v1,k(θ k)d1,k(θ k)) 2 2 DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY Besides, given (40), it suffices to control E Ω(1) k,1 . Note that Ω(1) k,1 2 1 i=1 (ˆvi,k(ˆθk) ˆvi,k(θ k)) ˆdi,k(ˆθk) 2 21EQ,k + 2 1 i=1 ˆvi,k(θ k)( ˆdi,k(ˆθk) di,k(θ k)) 2 21EQ,k := 2(Ω(2) k,1 + Ω(3) k,1). Under the event EQ,k Ek, we have ˆvi,k(ˆθk) ˆvi,k(θ k) 2 CG(Xk,i) θk Mn,k(θ k) 2. Besides, note that θk Mn,k(θ k) 2 is bounded under Ek, then by Taylor s expansion we can show that θk M(Xk,i; ˆθk) θk M(Xk,i; θ k) 21Ek C G(Xk,i) θk Mn,k(θ k) 2 2 + 2 θk M(Xk,i; θ k) 2 θk Mn,k(θ k) 2 1Ek C G(Xk,i) + 2 θk M(Xk,i; θ k) 2 θk Mn,k(θ k) 21Ek. (42) Now by H older s inequality, we can show that E Ω(2) k,11Ek θk Mn,k(θ k) 2 2 1 n i=1 G2(Xk,i) θk M(Xk,i; ˆθk) 2 21Ek i=1 {E θk Mn,k(θ k) 6 2 E G6(Xk,i) E θk M(Xk,i; ˆθk)1Ek 6 2 }1/3 C G2R2(G2 + L2 + R2) For E Ω(3) k,11Ek , first note that ˆdi,k(ˆθk) di,k(θ k) 2 ˆQk(ˆθk) Qk(θ k) 2 θk M(Xk,i; θ k) 2+ ˆQk(ˆθk) 2 θk M(Xk,i; ˆθk) θk M(Xk,i; θ k) 2, (43) then combined with (39) and (42), we have that E Ω(3) k,11Ek i=1 E 2 θk M(Xk,i; θ ) 2 2 G2 θk Mn,k(θ k) 2 2 + 2 θk Mn,k(θ k) 2 θk Mk(θ k) 2 2 θk M(Xk,i; θ k) 2 2 + G2(Xk,i) + 2 θk M(Xk,i; θ k) 2 2 2 θk Mn,k(θ k) 2 2 C L2R2(L2 + G2 + G2R2) Now we consider Ωk,2. First, by Assumption 5 and Lemma B.6, we can show 3 θk M(Xk,i; θ k) H3,k(θ k) 2 C(G(Xk,i) + G), (44) GU AND CHEN leading to E ˆH3,k(θ k) H3,k(θ k) 2v 2 CG2v nv . Besides, using Assumption 8 and Lemma B.6, ˆH3,k(ˆθk) ˆH3,k(θ k) 21Ek C i=1 A(Xk,i) θk Mn,k(θ k) 21Ek. C1 ˆQk(ˆθk) ˆH3,k(ˆθk) 1 i=1 ˆdi,k(ˆθk) ˆdi,k(ˆθk) Qk(θ k)H3,k(θ k) 1 i=1 di,k(θ k) di,k(θ k) 2 2 i=1 di,k(θ k) di,k(θ k) E (d1,k(θ k) d1,k(θ k)) 2 2 C1 ( ˆQk(ˆθk) ˆH3,k(ˆθk) Qk(θ k)H3,k(θ k)) 1 i=1 ˆdi,k(ˆθk) ˆdi,k(ˆθk) 2 2 i=1 ˆdi,k(ˆθk) ˆdi,k(ˆθk) 1 i=1 di,k(θ k) di,k(θ k) 2 2 i=1 di,k(θ k) di,k(θ k) E (d1,k(θ k) d1,k(θ k)) 2 2 := C Ω(1) k,2 + G2Ω(2) k,2 + G2Ω(3) k,2 Using lemma 7 in Zhang et al. (2013), we have E Ω(3) k,2 C L4R4 n . Considering Ω(1) k,2, Ω(1) k,21EQ,k Ek C( ˆH3,k(ˆθk) H3,k(θ k) 2 2 + G2 ˆQk(ˆθk) Qk(θ k) 2 2)( 1 i=1 θk M(Xk,i; ˆθk) 2 2)21EQ,k Ek, then by H older s inequality, we can show that E Ω(1) k,21EQ,k Ek C r E ˆH3,k(ˆθk) H3,k(θ k) 4 21EQ,k Ek + G2 r E ˆQk(ˆθk) Qk(θ k) 4 21EQ,k Ek v u u t E( 1 i=1 θk M(Xk,i; ˆθk) 2 21Ek)4 G2(1 + L2 + R2) + A2R2 (G4 + R4 + L4) n . Considering Ω(2) k,2, note for any two p-dimensional vectors ℓ1 and ℓ2, we have ℓ1 ℓ1 ℓ2 ℓ2 2 2 = ℓ1ℓT 1 ℓ2ℓT 2 2 F 2 ℓ1 ℓ2 2 2 ℓ1 2 2 + ℓ2 2 2 , (45) then using (42) and (43), we can show that E Ω(2) k,21EQ,k Ek C R2(G2 + R2 + L2)2 DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY Finally, we consider Ek,bc and we can show that P(Ec k,bc) P(Ec k,bc EQ,k Ek) + P((EQ,k Ek)c) C n2 E ˆB(ˆθk) Bk(θ k) 2 21EQ,k Ek + CP((EQ,k Ek)c). Collecting all the above results, we have the following upper-bound E ˆBk(ˆθk)1Ek,bc Bk(θ k) 2 2 E ˆB(ˆθk) Bk(θ k) 2 21EQ,k Ek + CP((EQ,k Ek Ek,bc)c) R2 + L2 + R2L2 + G2R2(G2 + R2 + L2) + L2R2(L2 + G2 + G2R2) + R2(G2 + R2 + L2)2 +(G2(1 + L2 + R2) + A2R2)(G4 + R4 + L4) + CP((EQ,k Ek)c). (46) This concludes the proof. Now let the pseudo debiased weighted distributed estimator be ˆφpd WD = PK k=1 Wk(θ k)(ˆφk 1 n B1 k(θ k)), we then give the following lemma on the MSE bound of this estimator. Lemma B.8 Under Assumptions 1 - 4 and 7 - 8, and Assumption5 with v, v1 4 , E ˆφpd WD φ 2 2 C1 Proof Under the event Ek defined in the Lemma B.2, we have that 0 = θk Mn,k(θ k) + 2 θk Mn,k(θ k) k + 1 0 3 θk Mn,k(θ k + t k)dt}( k k) = θk Mn,k(θ k) + 2 θk Mk(θ k) k + 1 2 3 θk Mk(θ k)( k k) +( 2 θk Mn,k(θ k) 2 θk Mk(θ k)) k + 1 0 3 θk Mn,k(θ k + t k)dt 3 θk Mk(θ k) ( k k). Recall that we have denoted Jk(θk) = 2 θk Mk(θk), solve for the above equation and we will have k = Jk(θ k) 1 θk Mn,k(θ k) Jk(θ k) 1 2 θk Mn,k(θ k) 2 θk Mk(θ k) k 2Jk(θ k) 1 3 θk Mk(θ k)( k k) 2Jk(θ k) 1 Z 1 0 3 θk Mn,k(θ k + t k)dt 3 θk Mk(θ k) ( k k). (48) Recalling the definition of Wk(θ k), we have that ˆφpd WD φ 2 2 C 1 K PK k=1 Hk(θ k) 1(ˆφk φ 1 n B1 k(θ k)) 2 2 = C 1 K PK k=1 Hk(θ k)( k 1 n Bk(θ k)) 2 2, where Hk(θ k) = Hk(θ k) 1 0 and GU AND CHEN thus Hk(θ k) 2 = Hk(θ k) 2. Denote Ωk,1 = ( 2 θk Mn,k(θ k) 2 θk Mk(θ k)) k 1 n E (v1,k(θ k)d1,k(θ k)) , Ωk,2 = ( k k) 1 n E (d1,k(θ k) d1,k(θ k)) and 0 3 θk Mn,k(θ k + t k)dt 3 θk Mk(θ k) ( k k), then (49) n Bk(θ k) = i=1 di,k(θ k) + Qk(θ k)(Ωk,1 + 1 2H3,k(θ k)Ωk,2 + 1 + k I(EC k ). (50) Considering Ωk,1, denote Ω(1) k,1 and Ω(2) k,1 as follows such that Ωk,1 = Ω(1) k,1 + Ω(2) k,1: Ω(1) k,1 = ( 2 θk Mn,k(θ k) 2 θk Mk(θ k))( k 1 i=1 di,k(θ k)) and Ω(2) k,1 = ( 2 θk Mn,k(θ k) 2 θk Mk(θ k)) 1 i=1 di,k(θ k) 1 n E (v1,k(θ k)d1,k(θ k)) . For Ω(1) k,1, we can show by Taylor s expansion that i=1 di,k(θ k) 1Ek = Qk(θ k) Z 1 0 2 θk Mn,k(θ k + t k)dt 2 θk Mk(θ k) k1Ek, then using H older s inequality we can show that E Ω(1) k,1 2 2IEk CG4R2(G2 + 1) For Ω(2) k,1, by the independence among the observations, it is easy to first show that E Ω(2) k,1 = 0. Denote eij = E (vi,k(θ k)dj,k(θ k)), then we have E Ω(2) k,1 2 2 = 1 t=1 E (vi,k(θ k)dj,k(θ k) eij)T (vs,k(θ k)dt,k(θ k) est) . (52) By a conditioning argument and independence among observations, it is straightforward to show that if the set {i, j, s, t} has three or four unique elements, then E(vi,k(θ k)di,k(θ k) eij)T (vs,k(θ k)dt,k(θ k) eij) = 0. Thus the RHS of Equation (52) has at most O(n2) non-zero elements and each of those non-zero elements can be bounded using H older s inequality. Since E v1,k(θ k)d1,k(θ k) e11 2 2 C(LR)2, we have E Ω(2) k,1 2 2 C(LR n )2. By independence among different Ω(2) k,1, we can directly show that k=1 Hk(θ k)Qk(θ k)Ωk,1 2 2 n2K + G4R2(G2 + 1) DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY For Ωk,2 appeared in (49), we define Ω(1) k,2 and Ω(2) k,2 as follows such that Ωk,2 = Ω(1) k,2 + Ω(2) k,2: Ω(1) k,2 = ( k k) ( 1 i=1 di,k(θ k)) ( 1 i=1 di,k(θ k)) and Ω(2) k,2 = ( 1 i=1 di,k(θ k)) ( 1 i=1 di,k(θ k)) 1 n E (d1,k(θ k) d1,k(θ k)) . Similar to (51), we can show that k=1 Hk(θ k)Qk(θ k)H3,k(θ k)Ω(2) k,2 2 2 CR4 Besides, due to (45) and (51), we can show that E Ω(1) k,2 2 21Ek C G2R4 n3 . For Ωk,3 in (49), combined with (44) and the Lipschitz continuity of 3 θk M(Xk,1, θk) with respect to θk in Uk, we can show that E Ωk,3 2 21Ek C(A2R6+G2R4) n3 . For k I(EC k ), E k I(EC k ) 2 2 CP(EC k ). In summary, we have the following MSE bound of the pd WD estimator: E ˆφpd WD φ 2 2 C1 R2 n K +C2 R2(L2 + R2) n2K +C3 G2R2(G2 + R2 + G4) + A2R6 k=1 P(Ec k). The asymptotic normality of the pd WD estimator is established in the following lemma. Lemma B.9 Under Assumptions 1 - 4 and 7 - 8, and Assumption 5 with v, v1 4 , if K = o(n2), (ˆφpd WD φ )T { k=1 nk Hk(θ k) 1}(ˆφpd WD φ ) d χ2 p1. Proof Since ˆφpd WD φ = { 1 k=1 Hk(θ k) 1} 1 1 k=1 Hk(θ k) k 1 n Bk(θ k) . Using the expansion (50) and other results in the proof of Lemma B.8, when K = o(n2), k=1 Hk(θ k)( k 1 n Bk(θ k)) = 1 Kn i=1 Hk(θ k)di,k(θ k) + op(1). (53) Then, the proof is completed with a direct application of the central limit theorem. GU AND CHEN B.2 Proof of Proposition 1 Proof The consistency of the local estimator ˆθk is implied by Lemma 6 of Zhang et al. (2013). Below we show the consistency of the global estimator ˆθfull. Define the global objective function M( X, θ) = (1/K) PK k=1 M(Xk, θk) and the global expected objective function M(θ) = M( X, θ) , where X = (XT 1 , XT 2 , ..., XT K)T and Xk is sampled from the disribution Fk. Then, equivalently, we have θ = argminθ Θ E M( X; θ) and ˆθfull = argminθ Θ 1 n where Xi = (XT 1,i, XT 2,i, ..., XT K,i)T . Now we can show that E θM( Xi, θ ) 2v1 2 R2v1 and E 2 θM( Xi, θ ) 2 θM(θ ) 2v 2 L2v hold by a direct application of Lemma 7 of Zhang et al. (2013). Besides, for all θ, θ U with U = {θ| θ θ 2 ρ}, we have 2 θM( X, θ) 2 θM( X, θ ) 2 θ θ 2 and E Besides, we can also show that 2 θM(θ ) ρ Ip1 p1 0 0 ρ K I(p1+Kp2) (p1+Kp2). It remains to apply Lemma 6 of Zhang et al. (2013) to obtain the consistency of ˆθfull. B.3 Proof of Theorem 2 Proof See the proof of Lemma B.4. B.4 Proof of Theorem 3 Proof Note that k=1 ˆHk(ˆθk) 1) 1 2 1 k=1 ˆHk(ˆθk) 1(ˆφk φ ) 2. (54) Since Hk(θ k) 1 ρ2 ρσ Ip1 p1 = ρh Ip1 p1, by Lemma B.3, the event IHK = { ˆHk(ˆθk) 1 Hk(θ k) 1 2 c 2, k = 1, ..., K} implies { 1 K PK k=1 ˆHk(ˆθk) 1} 1 { 1 K PK k=1 Hk(θ k) 1} 1 2 2 c2 1 K PK k=1 ˆHk(ˆθk) 1 1 K PK k=1 Hk(θ k) 1 2. Using Lemma B.3 again with Hk(θ k) c Ip1 p1 as assumed in Assumption 7, the event HK = { ˆHk(ˆθk) Hk(θ k) 2 c 2, k = 1, ..., K} implies ˆHk(ˆθk) 1 Hk(θ k) 1 2 2 c2 ˆHk(ˆθk) Hk(θ k) 2, k = 1, 2, , K. Now for any ϵ > 0, define ϵH = min{ϵ, c}/4, then under the event Hϵ K = { ˆHk(ˆθk) Hk(θ k) 2 ϵH, k = 1, ..., K} (55) DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY we have { 1 K PK k=1 ˆHk(ˆθk) 1} 1 { 1 K PK k=1 Hk(θ k) 1} 1 2 ϵ. Now using the boundedness of ˆφWD φ , we have E ˆφWD φ 2 2 k=1 ˆHk(ˆθk) 1(ˆφk φ ) 2 2I(Hϵ K) k=1 (ˆφk φ k) 2 2I(ˆφ Φ) + C3P((Hϵ K)c). Thus we only need to separately bound the three terms on the RHS of (56). Let us first consider bounding P((Hϵ K)c). Denote ˆLk(θk) = 2 θk Mn,k(θk), Lk(θk) = 2 θk Mk(θk), ˆVk(θk) = ˆLk(θk) 1 ˆΣS,k(θk)ˆLk(θk) 1 and Vk(θk) = Lk(θk) 1ΣS,k(θk)Lk(θk) 1. By definition of ˆHk(θk) and the triangle s inequality, we have ˆHk(ˆθk) Hk(θ k) 2 ˆVk(ˆθk) ˆVk(θ k) 2 + ˆVk(θ k) Vk(θ k) 2. (57) Hence, we can bound those two terms on the RHS of (57) separately. Note that ˆVk(θk) Vk(θk) 2 = ˆLk(θk) 1 ˆΣS,k(θk)ˆLk(θk) 1 Lk(θk) 1ΣS,k(θk)Lk(θk) 1 2 2( ˆLk(θk) 1 Lk(θk) 1 2 2 + Lk(θk) 1 2 2) ˆΣS,k(θk) ΣS,k(θk) 2 (58) +( ˆLk(θk) 1 Lk(θk) 1 2 + 2 Lk(θk) 1 2) ΣS,k(θk) 2 ˆLk(θk) 1 Lk(θk) 1 2. Then, under the event Lϵ K = { ˆLk(θ k) Lk(θ k) 2 min{ϵρ2 /2, ρ /2}, k = 1, ..., K} with ρ being the lower bound of the eigenvalues of Lk(θ k) as assumed in Assumption 4, we have ˆVk(θ k) Vk(θ k) 2 (59) ρ2 ) ˆΣS,k(θ k) ΣS,k(θ k) 2 + (ϵ + 2 ρ2 ˆLk(θ k) Lk(θ k) 2, k = 1, ..., K. Similar to (58), we have ˆVk(ˆθk) ˆVk(θ k) 2 2( ˆLk(ˆθk) 1 ˆLk(θ k) 1 2 2 + ˆLk(θ k) 1 2 2) ˆΣS,k(ˆθk) ˆΣS,k(θ k) 2 +( ˆLk(ˆθk) 1 ˆLk(θ k) 1 2 + 2 ˆLk(θ k) 1 2) ˆΣS,k(θ k) 2 ˆLk(ˆθk) 1 ˆLk(θ k) 1 2. Define an event MK = ˆLk(θ k) 1 2 2 ρ , ˆΣS,k(θ k) 2 2ρσ, ˆLk(ˆθk) ˆLk(θ k) 2 min{ρ 4 , ϵρ2 8 }, k = 1, ..., K . GU AND CHEN Under this event, we have for all k = 1, 2, , K, ˆVk(ˆθk) ˆVk(θ k) 2 ρ2 ) ˆΣS,k(ˆθk) ˆΣS,k(θ k) 2 + (ϵ + 4 ρ2 ˆLk(ˆθk) ˆLk(θ k) 2 (C1Bn,k + C2Gn,k) ˆθk θ k 2, (60) where Bn,k = (1/n) Pn i=1 B(Xk,i) and Gn,k = (1/n) Pn i=1 G(Xk,i). Note that ˆLk(θ k) 1 2 2 ρ and ˆΣS,k(θ k) 2ρσ are implied by ˆLk(θ k) Lk(θ k) 2 ρ 2 and ˆΣS,k(θ k) ΣS,k(θ k) 2 ρσ 2 , respectively. Thus, we define the event UK = Bn,k 2B, Gn,k 2G, ˆLk(θ k) Lk(θ k) 2 C1, ˆΣS,k(θ k) ΣS,k(θ k) 2 C2, k = 1, ..., K , which satisfies UK MK Lϵ K. Under UK, we have ˆHk(ˆθk) Hk(θ k) 2 C ˆθk θ k 2 + ϵH 2 . Furthermore, we define the following event AK = UK ( K k=1Ek) ({ ˆθk θ k 2 ϵH 2C , k = 1, ..., K}). (61) By Lemma 6 in Zhang et al. (2013), under the event K k=1Ek, the event { ˆθk θ k 2 ϵH/(2C), k = 1, ..., K} is implied by the event { θk Mn,k(θ k) 2 (1 ρ)ρ ϵH/(4C), k = 1, ...K}. Thus, the event Ak can be equivalently expressed as AK = Bn,k 2B, Gn,k 2G, ˆLk(θ k) Lk(θ k) 2 C1, ˆΣS,k(θ k) ΣS,k(θ k) 2 C2, θk Mn,k(θ k) 2 C3, k = 1, ..., K . Now with the union bound and Lemma 7 in Zhang et al. (2013), we can obtain that P(Bn,k > 2B) + P(Gn,k > 2G) + P( ˆLk(θ k) Lk(θ k) 2 > C1) +P( ˆΣS,k(θ k) ΣS,k(θ k) 2 > C2) + P( θk Mn,k(θ k) 2 > C3) where v = min{v, v1 Next we consider bounding E 1 K PK k=1 ˆHk(ˆθk) 1(ˆφk φ ) 2 2I(Hϵ K) in (56). Recall the definition of Hϵ K in (55), we can naturally decompose the event into Hϵ K = K k=1Hϵ,(k) K , where DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY Hϵ,(k) K = { ˆHk(ˆθk) Hk(θ k) 2 ϵH}. It is noted that for each k, under the event Hϵ,(k) K , we have ˆHk(ˆθk) 1 2 2 c2 ˆHk(ˆθk) Hk(θ k) 2 + Hk(θ k) 1 2 C. Since elements of { ˆHk(ˆθk) 1(ˆφk φ )I(Hϵ,(k) K )}K k=1 are independent with one another, we decompose the term as follows: k=1 ˆHk(ˆθk) 1(ˆφk φ ) 2 2I(Hϵ K) K E ˆφk φ 2 2 + E ˆHk(ˆθk) 1(ˆφk φ )I(Hϵ,(k) K ) 2 2 The first term in the RHS of (63) can be bounded using Lemma B.2. For the second term, we have E ˆHk(ˆθk) 1(ˆφk φ )I(Hϵ,(k) K ) 2 2 2 E ( ˆHk(ˆθk) 1 Hk(θ k) 1)(ˆφk φ )I(Hϵ,(k) K ) 2 2 + 2 E Hk(θ k) 1(ˆφk φ )I(Hϵ,(k) K ) 2 2 C1E ˆHk(ˆθk) 1 Hk(θ k) 1 2 2I(Hϵ,(k) K ) ˆφk φ 2 2 + C E (ˆθk θ k)I(Hϵ,(k) K ) 2 2. Using (57), (59) and (60), and decomposing AK into K k=1A(k) K , we can show by H older s inequality that E ˆHk(ˆθk) 1 Hk(θ k) 1 2 2I(Hϵ,(k) K ) ˆφk φ 2 2 C1E ˆΣS,k(θ k) ΣS,k(θ k) 4 2 + ˆLk(θ k) Lk(θ k) 4 2 + θk Mn,k(θ k) 4 2 + C2P (A(k) K )c . C R8 + R4 + L4 n2 + P (A(k) K )c . Besides, we have that E (ˆφk φ )I(Hϵ,(k) K ) 2 2 2 E(ˆφk φ ) 2 2 + 2 E (ˆφk φ )(1 I(Hϵ,(k) K )) 2 2. The second term can be bounded by also conditioning on the event A(k) K , using the inequality (35) and the H older s inequality, and we will obtain E (ˆφk φ )(1 I(Hϵ,(k) K )) 2 2 C E θk Mn,k(θ k) 2 2I((Hϵ,(k) K )c) + P (A(k) K )c P (A(k) K )c + P (A(k) K )c ! n2 + P (A(k) K )c . GU AND CHEN And for the first term in the RHS, the following holds from the proof of Theorem 1 of Zhang et al. (2013): E ˆφk φ 2 2 C L2 + R2G2 In summary, we have that k=1 ˆHk(ˆθk) 1(ˆφk φ ) 2 2I(Hϵ K) C 1 + L2 + R2 n K + (L2 + L4) + R2(R2 + R6 + G2) n3 + P (A(k) K )c . (65) It remains to bound E 1 K PK k=1(ˆφk φ k) 2 2I(ˆφ Φ) . First we have that k=1 (ˆφk φ k) 2 2I(ˆφ Φ) k=1 E ˆφk φ k 2 2I(ˆφ Φ) . Then, it is noted by (54) that under the event Hϵ K, the event {ˆφ Φ} is equivalent to the event { 1 K PK k=1 ˆHk(ˆθk) 1(ˆφk φ ) 2 > C}, namely we have that k=1 ˆHk(ˆθk) 1(ˆφk φ ) 2I(Hϵ K) > C} (Hϵ K)c, then we can repeat the conditioning on A(k) K argument to obtain E ˆφk φ k 2 2I(ˆφ Φ) C R4 n2 + P(ˆφ Φ) + P (A(k) K )c . Finally, gathering all the above results, we obtain that E ˆφWD φ 2 2 n K + C2 (L2 + L4) + R2(R2 + R6 + G2) n3 + C4K 1 + L2v The proof is now complete. B.5 Proof of Theorem 4 Proof With the results in Lemma B.4 and Lemma B.5, the proof follows from a direct application of the Slutsky s lemma. B.6 Proof of Theorem 5 Proof Recall the definition of the event Hϵ K defined in (55) in the proof of Theorem 3, we can similarly define Hϵ K,j to control the estimation error of { ˆHk,j(ˆθk,j)}K k=1, j = 1, 2. Since E ˆφd WD φ 2 2 1 j=1 E ˆφd WD j φ 2 2 = E ˆφd WD 1 φ 2 2 , DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY it suffices to bound the last term. Under the event Hϵ K,1 and using boundedness of ˆφd WD 1 φ 2, E ˆφd WD 1 φ 2 2 k=1 ˆHk,1(ˆθk,1) 1(ˆφbc k,2 φ ) 2 2I(Hϵ K,1) k=1 (ˆφbc k,2 φ ) 2 2I(ˆφ1 Φ) +C3P (Hϵ K,1)c := C1E (R1) + C2E (R2) + C3P (Hϵ K,1)c , (66) which is similar to (56). We have derived the upper bound for in C3P (Hϵ K,1)c in (62), and it remains to control R1 and R2. Considering R1, we have that ˆ Hk,1(ˆθk,1)(ˆθk,2 θ k 1 n/2Bk(θ k)) 2 2I(Hϵ K,1) k=1 ˆBk,2(ˆθk,2)1Ek,bc,2 Bk(θ k) 2 2I(ˆθbc k,2 Θk), where ˆ Hk,1(ˆθk,1) = ˆHk,1(ˆθk,1) 1 0 . Due to Lemma B.7 , it remains to control the R(1) 1 term. We can decompose the event Hϵ K,1 as Hϵ K,1 = K k=1H(k),ϵ K,1 where Hϵ,(k) K,1 = { ˆHk,1(ˆθk,1) Hk(θ k) 2 ϵH}. Then, we have ˆ Hk,1(ˆθk,1)I(H(k),ϵ K,1 )(ˆθk,2 θ k 1 n/2Bk(θ k)) 2 2. Since { ˆ Hk,1(ˆθk,1)I(H(k),ϵ K,1 )}K k=1 are independent of {ˆθk,2}K k=1 and bounded, E R(1) 1 has a similar upper bound as that of Lemma B.8. The E (R2) term is of higher-order and its upper bound can be easily derived. Collecting all the above results, we have the following upper bound E ˆφd WD φ 2 2 n K + C2 R2(L2 + R2) n2K + C3 G2R2(G2 + R2 + G4) + A2R6 +C5K 1 + L2v where CB is the constant term over n appeared in (46). This completes the proof. B.7 Proof of Theorem 6 Proof Similar to Lemma B.5, we first prove that the following term is of op(N 1/2): s=1 ˆHs,1(ˆθs,1) 1} 1 K X k=1 ˆHk,1(ˆθk,1) 1(ˆφbc k,2 φ ) { s=1 Hs(θ s) 1} 1 K X k=1 Hk(θ k) 1(ˆφbc k,2 φ ) GU AND CHEN for K = o(n2). Note N RH 2I(Hϵ K,1) s=1 ˆHs,1(ˆθs,1) 1} 1 { 1 s=1 Hs(θ s) 1} 1 2I(Hϵ K,1) 1 k=1 ˆHk,1(ˆθk,1) 1(ˆφbc k,2 φ ) 2 k=1 ( ˆHk,1(ˆθk,1) 1 Hk(θ k) 1)(ˆφbc k,2 φ ) 2I(Hϵ K,1) := RH,1 + CRH,2 For K = o(n2), we have shown in the proof of Theorem 5 that I(Hϵ K,1) 1 k=1 ˆHk,1(ˆθk,1) 1(ˆφbc k,2 φ ) 2 = Op( 1 Besides, we also have that s=1 ˆHs,1(ˆθs,1) 1} 1 { 1 s=1 Hs(θ s) 1} 1 2I(Hϵ K,1) = Op( 1 n). Combining those two results we prove that RH,1 = Op( 1 n) = op(1). Considering RH,2, we have k=1 ( ˆ Hk,1(ˆθk,1) Hk,1(θ k))I(H(k),ϵ K,1 )(ˆθbc k,2 θ k) 2 := R(1) H,2. From previous derivations, we know that when K = o(n2) the leading order term in R(1) H,2 is k=1 ( ˆ Hk,1(ˆθk,1) Hk,1(θ k))I(H(k),ϵ K,1 ) 1 i=1 d(2) i,k (θ k) 2, where di,k(θ k)(j) = Qk(θ k) θk M(X(j) k,i ; θ k). So we only need to show that R(2) H,2 = op(1). Denote m = n/2, then using the independence between ˆ Hk,1(ˆθk,1) and d(2) i,k (θ k), we have that E(R(2) H,2)2 N k=1 E ( ˆ Hk,1(ˆθk,1) Hk,1(θ k))I(H(k),ϵ K,1 ) 2 2 E i=1 d(2) i,k (θ k) 2 2 Then by Markov s inequality it s direct to show R(2) H,2 = op(1). Now with Slutsky s lemma, it remains to establish the asymptotic normality of s=1 Hs(θ s) 1} 1 K X k=1 Hk(θ k) 1 ˆφbc k,j, and the proof directly follows from the proof of Lemmas B.7 and B.9. DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY B.8 Proof of Theorem 7 and Corollary 8 Proof The proof can be easily derived based on previous derivations. B.9 Proof of Theorem 9 Proof To apply Theorem 1 in Yuan and Jennrich (2000), we need to check the uniform convergence of 1 n Pn i=1 ψλ φ(Xk,i, θk)T ψλ λ(Xk,i; θk)T T . This is actually the last p2 columns of 2 θk Mn,k(θk) for θk Uk, so we only need to show the uniform convergence of 2 θk Mn,k(θk) in Uk. By Assumption 5, 2 θk M(x; θk) is Lipschitz continuous w.r.t. θk for θk Uk, then we can directly apply Corollary 3.1 of Newey (1991) to establish the required uniform convergence. Now we are to show ˆλ(2) k P λ k. Following the proof of Lemma 6 in Zhang et al. (2013), we can first show that under the event Ek, Mn,k(θk) is (1 ρ)ρ -strongly convex on the ball Uk = { θk θ k 2 ρk}, where ρk min{ρρ 4G , ρ}. Define the event EWD,k = { ˆφWD φ 2 < ρk/2}, then under this event, θ k = (ˆφWD, λ k) Uk. For any θ k = (ˆφWD, λk) Θk, if θ k Uk, then under EWD,k, there exists w0 [0, 1] such that θ k,0 = w0θ k + (1 w0) θ k lies on the surface of the ball Uk, and thus θ k,0 θ k 2 = w0 θ k θ k 2 (ρk 2 ). Now under EWD,k we have that Mn,k(θ k) Mn,k(θ k,0)+ < θk Mn,k(θ k,0), θ k θ k,0 > Mn,k( θ k)+ < θk Mn,k( θ k), θ k θ k > +1 2(1 ρ)ρ ρ2 k 4 + < θk Mn,k(θ k,0) θk Mn,k( θ k), θ k θ k,0 > Mn,k( θ k)+ < θk Mn,k( θ k), θ k θ k > +1 2(1 ρ)ρ ρ2 k 4 , (67) where the first inequality holds due to the convexity of Mn,k(θk) on Uk and the second holds due to the strong convexity on Uk. The last inequality holds due to θ k θ k = 1 w0 w0 (θ k,0 θ k). When θ k Uk, with strong convexity Equation (67) still holds with ρ2 k 4 changed to θ k θ k 2 2 = λk λ k 2 2. In any case the following relationship holds under the event EWD,k: Mn,k(θ k) Mn,k( θ k)+ < θk Mn,k( θ k), θ k θ k > +1 2(1 ρ)ρ min{ρ2 k 4 , λk λ k 2 2}. Rewriting the inequality we obtain that min{ λk λ k 2 2, ρ2 k 4 } 2 (1 ρ)ρ Mn,k(θ k) Mn,k( θ k)+ < θk Mn,k( θ k), θ k θ k > Mn,k(θ k) Mn,k( θ k) + θk Mn,k( θ k) 2 θ k θ k 2 . (68) Now if we denote θ k,1 = (ˆφWD, ˆλ(2) k ) and set θ k = κθ k,1 + (1 κ) θ k for any fixed κ [0, 1], we will have min{κ ˆλ(2) k λ k 2, ρ2 k 4κ ˆλ(2) k λ k 2 } 2(Mn,k(κθ k,1 + (1 κ) θ k) Mn,k( θ k)) κ(1 ρ)ρ ˆλ(2) k λ k 2 +2 θk Mn,k( θ k) 2 (1 ρ)ρ . GU AND CHEN By definition we have Mn,k(θ k,1) Mn,k( θ k) and thus by convexity we have min{κ ˆλ(2) k λ k 2, ρ2 k 4κ ˆλ(2) k λ k 2 } 2 θk Mn,k( θ k) 2 (1 ρ)ρ . Define the event Es,k = { 2 θk Mn,k( θ k) 2 (1 ρ)ρ ρk 2 }, then under this event we have min{κ ˆλ(2) k λ k 2, ρ2 k 4κ ˆλ(2) k λ k 2 } ρk If ˆλ(2) k λ k 2 > ρk 2 , we can set κ = ρk 2 ˆλ(2) k λ k 2 , which leads to a contradiction. Thus we have ˆλ(2) k λ k 2 ρk 2 . Then using Equation (68) we have ˆλ(2) k λ k 2 < 2 θk Mn,k( θ k) 2 (1 ρ)ρ . (69) Since ˆφWD is consistent, we have P(EWD,k) 1. Besides, we already know that P(Ek) 1. Due to the form of the event Es,k and inequality (69), it remains to show θk Mn,k( θ k) 2 = o P (1) to establish the consistency of ˆλ(2) k . Note θk Mn,k( θ k) 2 θk Mn,k( θ k) θk Mn,k(θ k) 2 + θk Mn,k(θ k) 2 and the latter term is of Op( 1 nv1 ). Using the consistency of ˆφWD we can show θk Mn,k( θ k) θk Mn,k(θ k) 2 = op(1). Besides, since ˆφWD is N-consistent and K , then n(ˆφWD φ ) = op(1) and the asymptotic normality of n 1 n Pn i=1 ψλ(Xk,i; θ k) + Ψφ λ(θ k)(ˆφWD φ ) is implied by the asymptotic normality of n 1 n Pn i=1 ψλ(Xk,i; θ k) and Slutsky s lemma. Now we apply Theorem 1 in Yuan and Jennrich (2000) and the result follows. Appendix C. Additional numerical results C.1 Simulation results based on the errors in variables model In this simulation experiment, we simulated the errors-in-variables Model (6) with the objective function (7) to compare the performance of the full sample, the split-and-conquer and the weighted distributed estimators: ˆφfull, ˆφSa C and ˆφWD. The simulation was carried out by first generating IID {Zi,k} from N(µZ, σ2 Z), and then upon given a Zi,k, (Xk,i, Yi,k)T were independently drawn from N (Zi,k, φ + λ k Zi,k)T , σ2I2 2 . We chose φ = 1, K = 2, σ2 = 1 and n1 = n2 = 5 104 = N/2, and λ 1, λ 2, µZ and σ2 Z were those reported in Table 4 under four scenarios. As discussed in Section 3, the relative efficiency of ˆφfull to ˆφSa C depends on the ratio σ2(E(Z))2/(Var(Z)E(Z2)) as shown in (8). We designed four scenarios according to the above ratio under λ 1 = λ 2 and E (Z) = 0, respectively, which represented the settings where the full sample estimator ˆφfull would be less (Scenario 1) or more (Scenario 2) efficient than the split-and-conquer estimator as predicted by the ratio but not as efficient as the weighted distributed estimator ˆφWD. Scenario 3 (λ 1 = λ 2, E (Z) = 0) was the case when ˆφfull and ˆφWD would be asymptotically equivalent, and both estimators would be more efficient than ˆφSa C. Scenario 4 was the homogeneous case with λ 1 = λ 2 DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY in which all three estimators would have the same asymptotic efficiency. For all four scenarios, the ARE column of Table 4 confirmed the relative efficiency as predicted by the asymptotic variances in (8), and was well reflected in the comparison of the root mean squared errors, as the bias is of smaller order as compared with that of the standard deviation and thus negligible. Table 4: Average root mean squared error (RMSE) and the standard deviation (SD), multiplied by 102, of the full sample estimator ˆφfull, the Sa C estimator ˆφSa C and the WD estimator ˆφWD under four scenarios for the errors-in-variables model (12) for N = 105, K = 2 and n1 = n2. AREs (asymptotic relative efficiency) of ˆφfull to ˆφSa C are calculated from (8) ˆφfull ˆφSa C ˆφWD Scenario (λ 1, λ 2) ARE RMSE SD RMSE SD RMSE SD Scenario 1 (0.25,3.25) 0.89 4.55 4.51 4.12 4.09 3.91 3.89 (µZ = 1, σ2 Z = 0.1) (0.5,3.5) 0.93 4.65 4.65 4.35 4.35 4.08 4.08 (0.75,3.75) 0.97 4.52 4.52 4.40 4.38 4.13 4.13 Scenario 2 (0.25,2.25) 1.18 2.95 2.95 3.24 3.24 2.89 2.89 (µZ = 3, σ2 Z = 0.5) (0.75,2.75) 1.28 3.28 3.26 3.65 3.64 3.17 3.16 (1.25,3.25) 1.31 3.71 3.71 4.16 4.07 3.64 3.61 Scenario 3 (0.25,2.25) 1.97 0.41 0.41 0.61 0.61 0.41 0.41 (µZ = 0, σ2 Z = 0.5) (0.75,2.75) 1.92 0.51 0.51 0.70 0.70 0.51 0.51 (1.25,3.25) 1.68 0.64 0.64 0.82 0.82 0.64 0.64 Scenario 4 (0.5,0.5) 1 3.25 3.24 3.31 3.28 3.30 3.26 (µZ = 4, σ2 Z = 0.5) (1.0,1.0) 1 3.53 3.53 3.59 3.59 3.59 3.59 (1.5,1.5) 1 4.06 4.03 4.08 4.07 4.06 4.06 C.2 Simulation results based on the logistic model Figure 3 reports the absolute bias and root mean squared errors of the estimators when p2 = 4. Table 5 reports the empirical coverage and the average width of the CIs when p2 = 4. Table 6 reports the average CPU time per simulation run based on 500 replications of the five estimators for a range of K for the logistic regression model with p2 = 4. It is observed in Figure 3 that the bias of the estimators were smaller with p2 = 4 compared to the results with p2 = 10 in Figure 1. As a consequence, the CIs based on the weighted distributed estimator had adequate coverage probabilities even when K = 1000. C.3 Pre-processing of the real data The arrival delay of the previous flight that utilized the same plane was obtained by matching the tail number of the plane. The three meteorological factors (rain rate, close surface air pressure, and temperature) were obtained by matching this airline s on-time performance data with the ERA5 hourly data (https://cds.climate.copernicus.eu/). This dataset includes reanalysis from 1959 onwards whose temporal and spatial resolutions are one hour and 0.25 0.25 , respectively. We applied the f(x) = log(1 + x) transformation to the rain variable due to its serious skewness. We also standardized each covariate in each of the data blocks before performing the logistic regression analysis. GU AND CHEN (a) Absolute Bias (p2 = 4) (b) RMSE (p2 = 4) Figure 3: Average simulated bias (a) and the root mean squared errors (RMSE) (b) of the weighted distributed (WD) (red circle), the split-and-conquer(Sa C) (blue triangle), the debiased split-andconquer (d Sa C) (green square), the debiased weighted distributed (d WD) (purple cross), the subsampled average mixture SAVGM (pink square cross) estimators, with respect to the number of data block K for the logistic regression model with the dimension p2 of the nuisance parameter λk being 4, and the full sample size N = 2 106. Table 5: Coverage probabilities and widths (in parentheses, multiplied by 100) of the 1 α confidence intervals for the common parameter φ in the logistic regression model based on the asymptotic normality of the split-and-conquer (Sa C), the weighted distributed (WD), the debiased splitand-conquer (d Sa C) and the debiased weighted distributed (d WD) estimators with respect to the number of data blocks K. The dimension p2 of the nuisance parameter λk is 4 and total sample size N = 2 106 K Sa C WD d Sa C d WD 1 α 0.99 0.95 0.90 0.99 0.95 0.90 0.99 0.95 0.90 0.99 0.95 0.90 10 0.99 0.96 0.92 0.99 0.97 0.91 0.99 0.96 0.92 0.99 0.96 0.91 (2.45) (1.87) (1.57) (2.03) (1.55) (1.30) (2.45) (1.87) (1.57) (2.03) (1.55) (1.30) 50 0.99 0.95 0.91 0.98 0.93 0.89 0.99 0.95 0.91 0.99 0.93 0.88 (2.36) (1.80) (1.51) (1.97) (1.50) (1.26) (2.36) (1.80) (1.51) (1.97) (1.50) (1.26) 100 0.98 0.94 0.91 0.99 0.95 0.91 0.99 0.95 0.91 0.99 0.95 0.91 (2.36) (1.79) (1.51) (1.96) (1.49) (1.25) (2.36) (1.79) (1.51) (1.96) (1.49) (1.25) 250 0.99 0.93 0.85 0.99 0.95 0.90 0.99 0.96 0.91 0.99 0.95 0.90 (2.36) (1.79) (1.50) (1.96) (1.49) (1.25) (2.36) (1.79) (1.50) (1.96) (1.49) (1.25) 500 0.91 0.77 0.66 0.99 0.95 0.88 0.99 0.96 0.90 0.99 0.95 0.89 (2.36) (1.80) (1.51) (1.96) (1.49) (1.25) (2.36) (1.80) (1.51) (1.96) (1.49) (1.25) 1000 0.65 0.41 0.28 0.99 0.94 0.88 0.99 0.94 0.88 0.99 0.93 0.88 (2.38) (1.81) (1.52) (1.96) (1.49) (1.25) (2.38) (1.81) (1.52) (1.97) (1.50) (1.25) 2000 0.01 0.01 0.00 0.99 0.91 0.81 0.98 0.94 0.88 0.99 0.94 0.90 (2.42) (1.84) (1.55) (1.96) (1.50) (1.25) (2.42) (1.84) (1.55) (1.98) (1.50) (1.26) DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY Table 6: Average CPU time for each replication based on B = 500 replications for the split-andconquer (Sa C), Zhang s SAVGM, the weighted distributed (WD), the debiased split-and-conquer (d Sa C) and the debiased weighted distributed (d WD) estimators for the logistic regression model with respect to K. The dimension p2 of the nuisance parameter λk is 4 and total sample size N = 2 106 K Sa C SAVGM WD d Sa C d WD 10 15.65 15.97 18.50 20.00 21.95 50 9.63 9.95 10.66 12.37 14.59 100 8.09 8.63 8.76 10.50 12.05 250 8.49 9.69 9.07 10.84 12.82 500 9.68 11.58 10.25 11.97 14.84 1000 11.67 13.81 12.32 13.93 19.08 2000 15.78 19.68 16.57 18.11 28.55 We chose the parameter of the three meteorological factors as the common parameter based on Figure 4, which shows that the local estimates of those three parameters are the most concentrated. Figure 4: Histogram of the parameter estimates across the data blocks Temperature Air Pressure Rain Previous Delay Spring Summer Autumn 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Estimates (K = 479) GU AND CHEN T. Ando. Concavity of certain maps on positive definite matrices and applications to hadamard products. Linear Algebra and its Applications, 26:203 241, 08 1979. doi: 10.1016/0024-3795(79) 90179-4. M. Bartlett. Approximate confidence intervals. Biometrika, 40:12 19, 01 1953. doi: 10.2307/ 2333091. H. Battey, J. Fan, H. Liu, J. Lu, and Z. Zhu. Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics, 46:1352 1382, 06 2018. doi: 10.1214/17-AOS1587. R. J. Carroll and D. Ruppert. The use and misuse of orthogonal regression in linear errors-invariables models. The American Statistician, 50(1):1 6, 1996. ISSN 00031305. URL http: //www.jstor.org/stable/2685035. S. X. Chen and L. Peng. Distributed statistical inference for massive data. The Annals of Statistics, 49:2851 2869, 02 2021. X. Chen and M. Xie. A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24:1655 1684, 10 2014. doi: 10.5705/ss.2013.088. X. Chen, W. Liu, and Y. Zhang. Quantile regression under memory constraint. The Annals of Statistics, 47:3244 3273, 12 2019. doi: 10.1214/18-AOS1777. R. Duan, Y. Ning, and Y. Chen. Heterogeneity-aware and communication-efficient distributed statistical inference. Biometrika, 109(1):67 83, 02 2021. ISSN 1464-3510. doi: 10.1093/biomet/ asab007. URL https://doi.org/10.1093/biomet/asab007. T. Evgeniou and M. Pontil. Regularized multi task learning. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 109 117, 08 2004. doi: 10.1145/ 1014052.1014067. W. Fuller. Measurement Error Models. Wiley, 01 1987. ISBN 9780471861874. doi: 10.2307/ 1164709. L. Hansen. Large sample properties generalized method of moments estimators. Econometrica, 50: 1029 1054, 02 1982. doi: 10.2307/1912775. E. Haynsworth. Applications of an inequality for the schur complement. Proceedings of the American Mathematical Society, 24:512 516, 03 1970. doi: 10.1090/S0002-9939-1970-0255580-7. M. Henmi and S. Eguchi. A paradox concerning nuisance parameters and projected estimating functions. Biometrika, 91:929 941, 02 2004. doi: 10.1093/biomet/91.4.929. C. Huang and X. Huo. A distributed one-step estimator. Mathematical Programming, 174:41 76, 11 2019. doi: 10.1007/s10107-019-01369-0. M. Jordan, J. Lee, and Y. Yang. Communication-efficient distributed statistical learning. Journal of the American Statistical Association, 114:668 681, 05 2019. doi: 10.1080/01621459.2018. 1429274. DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY P. Kairouz, H. Mc Mahan, B. Avent, A. Bellet, M. Bennis, A. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. D Oliveira, H. Eichner, S. El Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gasc on, B. Ghazi, P. Gibbons, and S. Zhao. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14:1 210, 01 2021. doi: 10.1561/9781680837896. A. Kleiner, A. Talwalkar, P. Sarkar, and M. Jordan. A scalable bootstrap for massive data. Journal of the Royal Statistical Society Series B (Statistical Methodology), 76:795 816, 12 2011. doi: 10.1111/rssb.12050. T. Lai and J. Wang. Edgeworth expansions for symmetric statistics with applications to bootstrap methods. Statistica Sinica, 3:517 542, 01 1993. H. Li, B. Lindsay, and R. Waterman. Efficiency of projected score methods in rectangular array asymptotics. Journal of the Royal Statistical Society Series B, 65:191 208, 02 2003. doi: 10. 1111/1467-9868.00380. T. Li, A. Sahu, A. Talwalkar, and V. Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37:50 60, 05 2020. doi: 10.1109/MSP.2020. 2975749. T. Li, S. Hu, A. Beirami, and V. Smith. Ditto: Fair and robust federated learning through personalization. In International Conference on Machine Learning (ICML), pages 6357 6368, 2021. URL http://proceedings.mlr.press/v139/li21h.html. N. Lin and R. Xi. Fast surrogates of U-statistics. Computational Statistics & Data Analysis, 54: 16 24, 01 2010. doi: 10.1016/j.csda.2009.08.009. T.-T. Lu and S.-H. Shiou. Inverses of 2 2 block matrices. Computers & Mathematics With Applications - COMPUT MATH APPL, 43:119 129, 01 2002. doi: 10.1016/S0898-1221(01) 00278-4. O. Marfoq, G. Neglia, A. Bellet, L. Kameni, and R. Vidal. Federated multi-task learning under a mixture of distributions. In Advances in Neural Information Processing Systems (Neur IPS), volume 34, pages 15434 15447, 2021. P. Mc Cullagh. Quasi-likelihood functions. The Annals of Statistics, 11:59 67, 03 1983. doi: 10.1214/aos/1176346056. B. Mc Mahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas. Communication-Efficient Learning of Deep Networks from Decentralized Data. International Conference on Artificial Intelligence and Statistics (AISTATS), 54:1273 1282, 2017. W. Newey. Uniform convergence in probability and stochastic equicontinuity. Econometrica, 59: 1161 1167, 02 1991. doi: 10.2307/2938179. O. Reiersol. Identifiability of a linear relation between variables which are subject to error. Econometrica, 18:375 389, 10 1950. doi: 10.2307/1907835. GU AND CHEN P. Rilstone, V. Srivastava, and A. Ullah. The second-order bias and mean squared error of nonlinear estimators. Journal of Econometrics, 124:369 395, 12 1996. doi: 10.1016/0304-4076(96) 89457-7. S. Sengupta, S. Volgushev, and X. Shao. A subsampled double bootstrap for massive data. Journal of the American Statistical Association, 111:1222 1232, 08 2015. doi: 10.1080/01621459.2015. 1080709. V. Smith, C.-K. Chiang, M. Sanjabi, and A. Talwalkar. Federated Multi-Task Learning. Advances in Neural Information Processing Systems(Neur IPS), 05 2017. L. Stefanski and D. Boos. The Calculus of M-Estimation. The American Statistician, 56:29 38, 02 2002. doi: 10.1198/000313002753631330. C. T. Dinh, N. Tran, and J. Nguyen. Personalized federated learning with moreau envelopes. In Advances in Neural Information Processing Systems (Neur IPS), volume 33, pages 21394 21405, 2020. A. van der Vaart. Asymptotic Statistics, chapter 5. Cambridge University Press, 01 1999. doi: 10.1017/CBO9780511802256. S. Volgushev, S.-K. Chao, and G. Cheng. Distributed inference for quantile regression processes. Annals of Statistics, 47, 01 2017. doi: 10.1214/18-AOS1730. Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology, 10:1 19, 01 2019. doi: 10.1145/3298981. A. Yaron, L. Hansen, and J. Heaton. Finite-Sample Properties of Some Alternative GMM Estimators. Journal of Business & Economic Statistics, 14:262 80, 02 1996. doi: 10.1080/07350015. 1996.10524656. K.-H. Yuan and R. Jennrich. Estimating equations with nuisance parameters: Theory and applications. Annals of the Institute of Statistical Mathematics, 52:343 350, 02 2000. doi: 10.1023/A:1004122007440. Y. Zhang, J. Duchi, and M. Wainwright. Comunication-efficient algorithms for statistical optimization. Journal of Machine Learning Research, 14:3321 3363, 2013. doi: 10.1109/CDC.2012. 6426691. T. Zhao, G. Cheng, and H. Liu. A partially linear framework for massive heterogeneous data. The Annals of Statistics, 44:1400 1437, 10 2014. doi: 10.1214/15-AOS1410.