# distributed_statistical_inference_under_heterogeneity__65d1cc86.pdf

Journal of Machine Learning Research 24 (2023) 1-56 Submitted 11/22; Revised 9/23; Published 12/23

Distributed Statistical Inference under Heterogeneity

Jia Gu GUJIA@PKU.EDU.CN Center for Statistical Science Peking University Bejing, China

Song Xi Chen SONGXICHEN@PKU.EDU.CN School of Mathematical Science, Guanghua School of Management and Center for Statistical Science, Peking University Beijing, China

Editor: Tong Zhang

We consider distributed statistical optimization and inference in the presence of heterogeneity among distributed data blocks. A weighted distributed estimator is proposed to improve the statistical efﬁciency of the standard split-and-conquer estimator for the common parameter shared by all the data blocks. The weighted distributed estimator is at least as efﬁcient as the would-be full sample and the generalized method of moment estimators with the latter two estimators requiring full data access. A bias reduction is formulated for the weighted distributed estimator to accommodate much larger numbers of data blocks (relaxing the constraint from K = o(N 1/2) to K = o(N 2/3), where K is the number of blocks and N is the total sample size) than the existing methods without sacriﬁcing the statistical efﬁciency at the same time. The mean squared error bounds, the asymptotic distributions, and the corresponding statistical inference procedures of the weighted distributed and the debiased estimators are derived, which show an advantageous performance of the debiased weighted estimators when the number of data blocks is large. Keywords: bias correction; distributed inference; heterogeneity; split-and-conquer method; weighted estimation.

1 Introduction

Modern big data have brought new challenges to statistical inference. One such challenge is that despite the sheer volume of the data, full communication among the data points may not be possible due to either the cost of data communication or the privacy concern. The distributed or the splitand-conquer method has been proposed to divide the full data sample into smaller size data blocks to avoid data communication. The split-and-conquer estimator is also suited to situations where the data are naturally divided into data blocks and data communication among the data blocks are prohibited due to privacy concern. The split-and-conquer estimation has been considered in Lin and Xi (2010) for the U-statistics, Zhang et al. (2013) for the statistical optimization, Chen and Xie (2014) for the generalized linear models, Volgushev et al. (2017) and Chen et al. (2019) for the quantile regression, Battey et al. (2018) for high dimensional testing and estimation, and Chen and Peng (2021) for asymptotic symmetric statistics (Lai and Wang, 1993). Bootstrap resampling-based methods had been introduced to facilitate statistical inference. Kleiner et al. (2011) proposed the bag-of-little bootstrap (BLB) method for the plug-in estimators by making up economically the full

2023 Jia Gu and Song Xi Chen.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1274.html.

GU AND CHEN

sample for the distributed inference. Sengupta et al. (2015) suggested a sub-sampled double bootstrap method designed to improve the computational efﬁciency of the BLB. Chen and Peng (2021) proposed the distributed and the pseudo-distributed bootstrap methods with the former conducting the resampling within each data block while the latter directly resampling the distributed statistics. Privacy has been a major concern in big data applications where people are naturally reluctant to share the raw data to form a pool of big data as practised in the traditional full sample estimation. However, the data holders may like to contribute summary statistics without having to give away the full data information. Federated Learning or the distributed inference with a central host has been proposed to accommodate such reality (Mc Mahan et al., 2017; Yang et al., 2019; Li et al., 2020; Kairouz et al., 2021), where summary statistics of the data blocks or the gradients of the objective functions associated with the private data blocks are submitted to a central host for forming aggregated estimation or computation. Homogeneous distribution among the data blocks is assumed in the majority of the statistical distributed inference studies with a few exceptions (Zhao et al., 2014; Duan et al., 2021). Federated Learning, on the other hand, was introduced to mitigate challenges arising from classical distributed optimization. In particular, heterogeneous or non-IID distributed data across different data blocks is one of the deﬁning characteristics in the Federated Learning (Li et al., 2020; Kairouz et al., 2021). Indeed, it is natural to expect the existence of heterogeneity, especially for data stored in different locations or generated by different stochastic mechanisms, for instance, mobile phones of different users. But few works have focused on the asymptotic statistical properties of the estimator, especially in a heterogeneous setting. Main Contributions. This paper considers distributed statistical inference under heterogeneous distributions among the data blocks, where there is a common parameter shared by the distributions of the data blocks and data-block-speciﬁc heterogeneous parameters. It is noted that Duan et al. (2021) also considered a heterogeneous setting but under a fully parametric framework. Speciﬁcally, the main contributions of this paper are as follows:

Our study reveals that in the presence of heterogeneity the full sample estimator of the common parameter obtained by requiring full data access, can be less efﬁcient than the split-andconquer estimator. It is found that this phenomenon disappears if the objective function of the statistical optimization satisﬁes a generalized second-order Bartlett s identity.

We propose a weighted distributed (WD) estimator, which is asymptotically at least as efﬁcient as the full sample and the split-and-conquer estimators when the number of data blocks K = o(N1/2), where K is the number of data blocks and N is the total sample size. The mean squared error bound and the asymptotic distribution of the proposed weighted distributed estimator are derived, as well as the asymptotic equivalence between the weighted distributed and the generalized method of moment (GMM) estimator (Hansen, 1982).

We also propose a debiased weighted distributed estimator with a data splitting mechanism on each data block to remove the dependency between the bias correction and the weights used to tackle the heterogeneity. The debiased weighted distributed estimator is asymptotically as efﬁcient as the WD estimator but allows quicker growth for the number of blocks K = o(N2/3). The bias correction is also applied to the split-and-conquer formulation, leading to a more communication-efﬁcient debiased split-and-conquer estimator.

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

2 Preliminaries

Suppose that there is a large data sample of size N, which is divided into K data blocks of sizes {nk}K k=1 such that N = PK k=1 nk and let n = NK 1 be the average sample size of the data blocks. For the relative sample sizes among data blocks, we assume the following

Assumption 1 There exist c, C > 0 such that c nk1/nk2 C for all pairs of (k1, k2).

The k-th data block consists of a sub-sample {Xk,i}nk i=1 which are independent and identically distributed (IID) random vectors from a probability space (Ω, F, P) to (Rd, Rd) with Fk as the distribution. The K distributions {Fk}K k=1 share a common parameter φ Rp1, while each Fk has another parameter λk Rp2 speciﬁc to Fk. The parameters of interests in the k-th block are θk = (φT , λT k )T Rp where p = p1 + p2, and the overall parameters of interests are θ = (φT , λT 1 , λT 2 , ..., λT K)T Rp1+Kp2. Suppose there is a common objective function M(X; φ, λk) that is convex with respect to the parameter θk = (φT , λT k )T and facilitates the statistical optimization in each data block. In general, the loss function can be made block-speciﬁc, say Mk( ; ) function. Indeed, the presence of the heterogeneous local parameters {λk}K k=1 leads to different Mk(x, φ) = M(x; φ, λk) for the inference on φ, which connects to the multi-task learning. In the k-th data block the true parameter θ k = (φ T , λ T k )T is deﬁned as the unique minimum of the expected objective function, namely

θ k = arg min θk Θk EFk (M(Xk,1; φ, λk)) .

The true common parameter φ appears in all θ k, and the block-speciﬁc {λ k}K k=1 may differ from each other. The entire true parameters θ = (φ T , λ T 1 , , λ T K )T , can be also identiﬁed as

θ = arg min θ Θ

k=1 γk EFk (M(Xk,1; φ, λk)) ,

where γk = limk (nk/N) is the asymptotic proportion of the local sample size nk relative to the total sample size N such that PK k=1 γk = 1. Our study is conducted under the semiparametirc setting, which does not assume knowledge of the data distribution of X but knowing the functional form of M( ; ), and hence semiparametric. It is a setting situated between the parametric and the nonparametric settings. Parametric setting means a full knowledge in the data generation distribution, which means that in the context of the M-estimation one knows the form of each Pk up to some parameter θ so that Pk = Pk(θ), where Pk(θ) is a parametrized form of Pk. In contrast, the nonparametric setting is where neither the form of the target quantities nor the distribution of X is known. The difﬁculty with the nonparametric setting is due to not knowing the forms of the data distribution and the quantities of interests. If the data could be shared across the data blocks, we would attain the conventional full sample estimator

ˆθfull = arg min θ Θ

i=1 M(Xk,i; φ, λk), where ˆθfull = (ˆφT full, ˆλT 1,full, , ˆλT K,full)T ,

which serves as a benchmark for distributed estimators. The estimating equations for the full sample estimators are (PK k=1 Pnk i=1 ψφ(Xk,i; φ, λk) = 0, Pnk i=1 ψλ(Xk,i; φ, λk) = 0 k = 1, ..., K, (1)

GU AND CHEN

where ψφ(Xk,i; φ, λk) = M(Xk,i; φ, λk)/ φ and ψλ(Xk,i; φ, λk) = M(Xk,i; φ, λk)/ λk are the score functions. The above full sample estimation is not attainable for the distributed situations due to privacy or the costs associated with the data communication. The distributed estimation ﬁrst conducts local estimation on each data block, namely the local estimator ˆθk = (ˆφT k , ˆλT k )T = arg minθk Θk Pnk i=1 M(Xk,i; θk) with the corresponding estimating equations

(Pnk i=1 ψφ(Xk,i; φk, λk) = 0, Pnk i=1 ψλ(Xk,i; φk, λk) = 0, (2)

and the split-and-conquer estimator for the common parameter φ is

k=1 nk ˆφk. (3)

The heterogeneity among the distributions of the data blocks calls for study the relative efﬁciency and the estimation errors, which are the focus of this paper. We are to show that the splitand-conquer estimator (3) may not be the best formulation for estimating φ. Throughout this paper, unless otherwise stated, 2 represents the ℓ2 norm of a vector and a matrix. We will use C and Ci to denote positive constants independent of (nk, K, N). An important question is the efﬁciency and the estimation errors of the split-and-conquer estimator ˆφSa C relative to the full sample estimator ˆφfull. For the homogeneous case, Chen and Peng (2021) found that for the asymptotic symmetric statistics, the split-and-conquer estimator (3) attains the same efﬁciency as the full sample estimator in the non-degenerate case but encounters an efﬁciency loss in the degenerate case1, due to a lack of communication among different data blocks. Zhang et al. (2013) derived the mean squared error bound for the split-and-conquer estimator in the homogeneous case and showed that whenever K = O(N1/2), the split-and-conquer estimator achieves the best possible rate of convergence when all N data are accessible. Consider the estimating equations of the full sample statistical optimization

PK k=1 Pnk i=1 ψφ(Xk,i; φ, λk) Pn1 i=1 ψλ(X1,i; φ, λ1) ... Pn K i=1 ψλ(XK,i; φ, λK)

Let Ψθ(θk) = θk E (M(Xk,1; θk)) and Jk(θk) = T θkΨθ(θk) be the ﬁrst and second order gradients of the k-th population objective function, respectively, whose matrix forms are:

Ψθ(θk) = (Ψφ(θk)T , Ψλ(θk)T )T , Jk(θk) =

Ψφ φ(θk) Ψλ φ(θk) Ψφ λ(θk) Ψλ λ(θk)

1. The symmetric statistics admits the expansion Tn = θ+n 1 Pn i=1 α(Xi; F)+n 2 P

1 i<j n β(Xi, Xj; F)+Rn (Lai and Wang, 1993), which covers the U-statistics and M-estimator as special cases. The functions α(x; F) and β(x, y; F), depending on the underlying data distribution F, are known measurable functions of x and y, satisfying E (α(X1; F)) = 0 and Var(α(X1; F)) = σ2 α [0, ) and β(x, y; F) being symmetric in x and y such that E (β(X1, X2; F)|X1) = 0 and Var(β(X1, X2; F)) = σ2 β [0, ). The degenerate and non-degenerate case correspond to the σ2 α = 0 and σ2 α > 0 cases, respectively.

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

Let Jφ|λ(θk) = Ψφ φ(θk) Ψλ φ(θk)Ψλ λ(θk) 1Ψφ λ(θk), Jλ|φ(θk) = Ψλ λ(θk) Ψφ λ(θk)Ψφ φ(θk) 1Ψλ φ(θk), Sφ(Xk,i; θk) = ψφ(Xk,i; θk) Ψλ φ(θk)Ψλ λ(θk) 1ψλ(Xk,i; θk) and Sλ(Xk,i; θk) = ψλ(Xk,i; θk)

Ψφ λ(θk)Ψφ φ(θk) 1ψφ(Xk,i; θk). Then, apply Taylor s expansion to obtain (see Appendix A.1)

N Jφ|λ(θ k)

i=1 Sφ(Xk,i; θ k)

+ op(N 1/2).

For the local estimator (ˆφk, ˆλk) that solves (2), the same derivation leads to (ˆφk φ = n 1 k Jφ|λ(θ k) 1 Pnk i=1 Sφ(Xk,i; θ k) + op(n 1/2 k ), ˆλk λ k = n 1 k Jλ|φ(θ k) 1 Pnk i=1 Sλ(Xk,i; θ k) + op(n 1/2 k ).

Our analysis requires the following conditions.

Assumption 2 (Identiﬁability) The parameters θ k = (φ , λ k) is the unique minimizer of Mk(θk) = E(M(Xk,1; θk)) for θk Θk.

Assumption 3 (Compactness) The true parameter θ k is an interior point of the parameter space Θk which is a compact and convex set in Rp, and supθk Θk θk θ k 2 r for all k 1 and some r > 0. The true common parameter φ is an interior point of a subset Φ Θk.

Assumption 4 (Local strong convexity) The population objective function on the k-th data block Mk(θk) = E(M(Xk,1; θk)) is twice differentiable, and there exists a constant ρ > 0 such that 2 θk Mk(θ k) ρ Ip p. Here A B means A B is a positive semi-deﬁnite matrix.

Assumption 5 (Smoothness I) The objective function on the k-th data is twice differentiable with respect to θk Uk, where Uk = {θk | θk θ k 2 ρ} and ρ is a positive constant. There are positive constants R, L, v and v1 such that

E θk M(Xk,1; θ k) 2v1 2 R2v1 and E 2 θk M(Xk,1; θ k) 2 θk Mk(θ k) 2v 2 L2v

for all k 1. There also exist a positive function G( ) and a corresponding positive constant G such that 2 θk M(x; θk) 2 θk M(x; θ k) 2 G(x) θk θ k 2

for all θk, θ k Uk and x Rd, and E(G(Xk,1)2v) G2v.

Assumptions 1-5 are standard ones on the parameter space and population objective functions for the homogeneous case (Jordan et al., 2019). In the heterogeneous case, Duan et al. (2021) requires the parameter space for the common parameter to be bounded, i.e. φ φ r under a fully parametric setting, while we need the overall parameter space to be bounded. The stronger condition is needed since we do not fully specify the distributions {Fk}K k=1 and it will be used when we derive the mean squared error bound for the proposed weighted distributed estimator in Section 4. Besides, since the differentiability of the objective function is assumed locally, we need Assumption 5 to deﬁne a high probability event in (34), under which all the derivatives are well-deﬁned.

GU AND CHEN

3 Full Sample versus split-and-conquer Estimation

When the full data communication is available, one would in general prefer making the estimation based on the pooled data, namely solving the estimating equations (1) rather than the local estimating equations (2). This choice is based on the common belief that the estimator ˆφfull utilizes the full sample information and allow full communications among data blocks. In contrast, for distributed data, some information requiring interactions between observations may be prohibited (the degenerate U-statistics), and thus may reduces the statistical efﬁciency of the Sa C estimator ˆφSa C

as compared with ˆφfull. In the homogeneous setting, there have been works justifying this reality (Zhang et al., 2013; Chen and Peng, 2021). However, we will show in this section that this is not necessarily the case under heterogeneity via a theorem and a concrete example. It is worth mentioning that we assume K being ﬁxed in the following Proposition 1 and Theorem 2 for simplicity of formulating the asymptotic variance of the estimators, which helps us to motivate the construction of the weighted distributed estimator. We will show more general results and allow diverging K along with N in the subsequent theoretical results and the supplementary material (see Lemma B.2 for the uniform consistency of {ˆθk}K k=1 under diverging K and the relationship between the divergence rate and the smoothness factors v, v1). In particular, we will discuss how to improve the divergence rate of K in Section 5.

Proposition 1 Under Assumptions 1 - 4 and Assumption 5 with v, v1 1, and if K is ﬁxed, then ˆθk θ k and ˆθfull θ in probability; ˆφSa C = (1/N) PK k=1 nk ˆφk and ˆφfull are consistent to φ .

Theorem 2 Under Assumptions 1 - 4 and Assumption 5 with v, v1 2, if K is ﬁxed and nk/N γk (0, 1) for a set of constants {γk}K k=1, then

N(ˆφSa C φ ) N

k=1 γk Jφ|λ(θ k) 1Σk(θ k)Jφ|λ(θ k) 1 !

N(ˆφfull φ ) N

k=1 γk Jφ|λ(θ k)) 1(

k=1 γkΣk(θ k))(

k=1 γk Jφ|λ(θ k)) 1 !

where Σk = Var (Sφ(Xk,1; θ k)).

Theorem 2 suggests that the asymptotic variance of the full sample estimator may surpass that of the Sa C estimator. To appreciate this, deﬁne V (Σ, A) = (AT ) 1ΣA 1 as a mapping from Sp1 p1 ++ GL(Rp1) to Sp1 p1 ++ , where Sp1 p1 ++ and GL(Rp1) denote the symmetric positive deﬁnite matrices and invertible real matrices of order p1, respectively. Since ΣK k=1γk = 1 and γk > 0, the asymptotic variance of ˆφSa C can be interpreted as a convex combination of function values {V (Σk(θ k), Jφ|λ(θ k))}K k=1 and that of ˆφfull can be expressed as V (PK k=1 γkΣk(θ k), PK k=1 γk Jφ|λ(θ k)). However, V ( , ) is not convex with respect to its arguments (Σ, A), which means that

k=1 γk Jφ|λ(θ k)) 1(

k=1 γkΣk(θ k))(

k=1 γk Jφ|λ(θ k)) 1

k=1 γk Jφ|λ(θ k) 1Σk(θ k)Jφ|λ(θ k) 1.

In other words, ˆφfull is not necessarily more efﬁcient than ˆφSa C under heterogeneity. In contrast, in the homogeneous setting (Jφ|λ(θ k), Σk(θ k)) are all equal for different k, then the non-convexity of

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

V ( , ) does not matter anymore and the ﬁrst-order equivalence between the Sa C estimator and the full sample estimator holds as long as K = o(N1/2). A conclusion from Theorem 2 is that naively using the pooled data to solve the statistical optimization problem may not be a good choice when the underlying distributions of the data are heterogeneous. To gain conﬁrmation of Theorem 2, we consider the errors-in-variables model. Suppose there are K independent data blocks {(Xk,i, Yk,i)}n i=1 for k = 1, 2..., K, where (Xk,i, Yk,i) are IID and generated from Xk = Zk + ek, Yk = φ + λ k Zk + fk, (6)

where {Zk}K k=1 are random variables whose measurements {(Xk, Yk)}K k=1 are subject to errors {(ek, fk)}K k=1, and (ek, fk) are bivariate normally distributed with zero mean and covariance matrix σ2I2 and is independent of Z. Obviously, φ is the common parameter across all data blocks while λ k(λ k > 0) represents the block speciﬁc parameter. The condition Var(e) = Var(f) is assumed to avoid any identiﬁcation issue arising when Z is also normally distributed (Reiersol, 1950) 2. We consider the approach in Example 5.26 of van der Vaart (1999) as detailed in Appendix A.2, which leads to the M-function

M(Xk, Yk; θk) = 1 (1 + λ2 k)(λk Xk (Yk φ))2. (7)

For simplicity we assume K = 2, then from Theorem 2 we have

Var(ˆφfull)

Var(Z) 2 1 1+λ 2 1 + 1 1+λ 2 2 + σ4(E(Z))2

2 (1+λ 2 1 )2 + 2 (1+λ 2 2 )2

( 1 1+λ 2 1 + 1 1+λ 2 2 )2

Var(ˆφSa C) σ2E(Z2)

Var(Z) (1+λ 2 1 )+(1+λ 2 2 ) 2 + σ4(E(Z))2

Note that the coefﬁcients to σ2E(Z2)/Var(Z) in the ﬁrst terms of the variances are harmonic and arithmetic means of {1 + λ 2 1 , 1 + λ 2 2 }, respectively. Hence, the coefﬁcient in the ﬁrst term of Var(ˆφSa C) is larger than that in Var(ˆφfull). The second terms of the variances involves (E(Z))2

as a multiplicative factor. Thus, if the unobserved Z has zero mean, the full-sample estimator will be at least as good as the Sa C estimator in terms of variance when the full sample size N goes to inﬁnity. However, the situation may change when E(Z) = 0, because the second term of Var(ˆφfull) has a factor which is the square of a ratio between the quadratic mean and the arithmetic mean of { 1 1+λ 2 1 , 1 1+λ 2 2 }. The factor is larger than or equal to 1, where the equality holds if and only if λ 1 =

λ 2, namely the homogeneous case. In the heterogeneous case, by adjusting σ4(E(Z))2

Var2(Z) /σ2E(Z2)

Var(Z) , we can ﬁnd cases such that λ 1 = λ 2 so that the full sample estimator has a larger variance than the Sa C estimator. Appendix C.1 displays such cases.

4 Weighted Distributed Estimator

That the full sample estimator ˆφfull under heterogeneity may be less efﬁcient than the simple averaged ˆφSa C suggests that the wisdom formulated in the homogeneous context may not be applicable

2. When Z is normal, (X, Y ) is also jointly normal, whose distribution can be fully characterized by the ﬁve parameters (E(X), Var(X), E(Y ), Var(Y ), E(XY )). Now, if we do not require Var(e) = Var(f), there will be six unknown parameters: (Var(e), Var(f), E(Z), Var(Z), φ, λ), leading to identiﬁcation issues. For more detailed discussions on the errors-in-variables model, see Fuller (1987).

GU AND CHEN

to the heterogeneous case. How to better aggregate the local estimators {ˆφk}K k=1 for more efﬁcient estimation is the focus of this section.

4.1 Formulation and Results

Consider a class of estimators formed by linear combinations of the local estimators {ˆφk}K k=1:

{ˆφSa C w | ˆφSa C w =

k=1 Wk ˆφk, Wk Rp1 p1,

k=1 Wk = Ip1 p1}.

We want to minimize the asymptotic variance of ˆφSa C w with respect to the weighting matrices {Wk}K k=1. It may be shown from Theorem 2 that Var(ˆφSa C w ) PK k=1 n 1 k Wk A 1 k Σk(AT k ) 1W T k , where Ak = Jφ|λ(θ k) and Σk = Var (Sφ(Xk,i; θ k)). It is noted that the asymptotic variance is deﬁned via the asymptotic normality of the statistical optimization. For the time being, Ak and Σk are assumed known and we denote Hk = A 1 k Σk(AT k ) 1. We choose the trace operator as a measure of the size of the covariance matrix, which leads to a minimization problem:

Minimize {Wk}K k=1 trace K X

k=1 n 1 k Wk Hk W T k

k=1 Wk = Ip1 p1. (9)

It is a convex optimization problem and can be solved via the Lagrangian multiplier method, which gives the optimal weighting matrices W k = (PK s=1 ns H 1 s ) 1nk H 1 k . If we replace the trace with the Frobenius norm in (9), the same solution is attained as shown in Appendix A.3. The split-andconquer estimator with the optimal weights W k is called the weighted distributed estimator and denoted as ˆφWD, which is at least as efﬁcient as ˆφSa C by construction. To compare the statistical efﬁciency between ˆφfull and ˆφWD, we note that their covariances

Var(ˆφfull)

k=1 nk Ak)T (

k=1 nkΣk) 1(

k=1 nk AT k Σ 1 k Ak

, respectively. (10)

Deﬁne V (Σ, A) = AT Σ 1A, which is a generalized convex function with respect to the matrix inequality shown in Lemma B.1. Applying Jensen s inequality leads to the conclusion that the weighted distributed estimator is at least as efﬁcient as the full sample estimator ˆφfull. Thus, the estimating equations (4) obtained from the ﬁrst-order derivatives of the simple summation of local objectives Pnk i=1 M(Xk,i; θk) may not be the best formulation. In contrast, the weighted distributed estimator exploits the potential efﬁciency gain from the heterogeneity by re-weighting of the local estimators, which is why the full sample estimator may not be as efﬁcient as the weighted distributed estimator.

4.2 Likelihood and Quasi-likelihood

The above results lead us to wonder whether the weighted distributed estimator can also be more efﬁcient than the full sample estimator under the heterogeneity in a fully parametric setting. The answer is negative as shown below.

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

When the distribution of Xk,i is fully parametric with density function f( ; φ, λk), the Fisher information matrix in the k-th data block is

I(θk) = I(φ, λk) = Iφφ Iφλk Iλkφ Iλkλk

2 φ2 log f(Xk,1; θk) 2 φ λT log f(Xk,1; θk)

2 λ φT log f(Xk,1; θk) 2 λ2 log f(Xk,1; θk)

and the partial information matrix Iφ|λk = Iφφ Iφλk I 1 λkλk Iλkφ. Now, the objective function for the statistical optimization is M(Xk,i; φ, λk) = log f(Xk,i; φ, λk). Routine derivations show that Σk = Var (Sφ(Xk,1; θ k)) = Iφ|λk and Ak = Jφ|λ(θ k) = Iφ|λk.

Hence, Var(ˆφfull) Var(ˆφWD) PK k=1 nk Iφ|λk 1 and Var(ˆφSa C) (1/N2) PK k=1 nk I 1 φ|λk.

A direct application of Lemma B.1 shows that PK k=1 nk Iφ|λk 1 (1/N2) PK k=1 nk I 1 φ|λk. Thus, the full sample maximum likelihood estimator automatically adjusts for the heterogeneity and has the same asymptotic efﬁciency as that of the weighted distributed estimator. Both estimators are at least as efﬁcient as the split-and-conquer estimator ˆφSa C. The same is true for the quasi-likelihood estimation with independent observations (see Appendix A.4). A close examination reveals that the underlying reason for the asymptotic equivalence between the weighted distributed estimator and the likelihood-based full sample estimators is that the two statistical optimization functions satisfy the second-order Bartlett s identity (Bartlett, 1953; Mc Cullagh, 1983): E M(Xk, θ k) M(Xk, θ k)T = E 2M(Xk, θ k) . By the asymptotic variance formula of the estimator and Lemma B.1, it is apparent that Bartlett s identity can be relaxed by allowing a factor γ = 0 such that

E M(Xk, θ k) M(Xk, θ k)T = γE 2M(Xk, θ k) . (11)

An example of such cases is the least square estimation in the parametric regression with homoscedastic and non-autocorrelated residuals in Appendix A.5. Otherwise, the full sample least square estimator may not be efﬁcient and there is an opportunity for the weighted least square estimation such as the case in the errors-in-variables model (6). Thus, if M(xk, θk) satisﬁes (11), ˆφfull attains the same statistical efﬁciency as ˆφWD.

4.3 Relation to Generalized Method of Moment Estimation

To further justify the statistical efﬁciency of the weighted distributed estimation, we consider the generalized method of moment (GMM) estimator (Hansen, 1982), which has certain optimal properties for the semiparametric inference that the weighted distributed estimation can compare with, despite it requires full data sharing. The score functions of the statistical optimization on each data block are aggregated to form the moment equations (Pnk i=1 ψφ(Xk,i; φ, λk) = 0, Pnk i=1 ψλ(Xk,i; φ, λk) = 0, k = 1, ..., K, (12)

which have p K estimating equations, where the dimension of θ is p K (K 1)p1. Thus, the parameter is over-identiﬁed, offering potential efﬁciency gain for the generalized method of moment. The GMM estimation based on the moment restrictions (12) solves the minimization problem

ˆθGMM = arg min θ Θ ψT N(θ)W0 ψN(θ),

GU AND CHEN

where W0 = Var( ψN(θ )) 1 is the optimal weighting matrix (Hansen, 1982; Yaron et al., 1996) and

i=1 ψφ(X1,i; θ1)T ,

i=1 ψλ(X1,i; θ1)T , ,

i=1 ψφ(XK,i; θK)T ,

i=1 ψλ(XK,i; θK)T )T .

Let the ﬁrst p1 elements of ˆθGMM be ˆφGMM as an estimator of the common parameter. A derivation

in Appendix A.6 shows that Var(ˆφGMM) PK k=1 nk Jφ|λΣ 1 k Jφ|λ 1 . Thus, the weighted dis-

tributed estimator s asymptotic efﬁciency is the same as that of ˆφGMM. This is encouraging as the weighted distributed estimator does it without requiring so much data sharing among the blocks.

4.4 Estimation of weights with one round communication

To formulate the weighted distributed estimator, we have to estimate the optimal weights W k = PK s=1 ns H 1 s 1 nk H 1 k . As we will show in Theorem 4, the estimation of the weights will not affect the estimation efﬁciency of the weighted distributed estimator attained in (10). By the structure of W k , we only need to estimate Hk, the leading principal submatrix of order p1 of the asymptotic covariance matrix e Hk of ˆθk. Note that

e Hk = ( Ψθ(θ k)) 1 E ψθk(Xk,1; θ k)ψθk(Xk,1; θ k)T ( Ψθ(θ k)) 1 = Hk

where Ψθ(θk) = E (ψθk(Xk,1; θk)). We can construct a sandwich type estimator (Stefanski and Boos, 2002) to estimate e Hk and then Hk. The procedure to obtain the weighted distributed estimator is summarized in Algorithm 1.

Input: Distributed datasets: {Xk,i, k = 1, ..., K; i = 1, ..., nk} Output: Weighted distributed estimator: ˆφWD

1 In each data block k (k = 1, 2, , K):

2 Solve (2) and obtain ˆθk = (ˆφk, ˆλk) ;

3 Calculate b Hk(ˆθk), which is the leading principal sub-matrix of order p1 of ( θk bΨθk) 1(n 1 k Pnk i=1 Z(Xk,i; ˆθk))( θk bΨθk) T , where Z(x, θk) is deﬁned in Assumption 6 and bΨθk = n 1 k Pnk i=1 ψθk(Xk,i; ˆθk);

4 In a central server:

5 Collect (ˆφk, b Hk(ˆθk) 1) from all the K data blocks;

6 Calculate ˆφ = PK k=1 nk b Hk(ˆθk) 1 1 PK k=1 nk( b Hk(ˆθk)) 1 ˆφk ;

7 ˆφWD = ˆφI(ˆφ Φ) + ˆφSa CI(ˆφ Φ), where ˆφSa C = N 1 PK k=1 nk ˆφk. Algorithm 1: Weighted Distributed estimator

Step 7 of the algorithm is necessary since there is no guarantee that after weighting the estimator ˆφ belongs to the set Φ as required in Assumption 3. However, the event {ˆφ Φ} should happen with probability approaching one. Hence, the ˆφSa CI(ˆφ Φ) term is asymptotically negligible. To establish the theoretical properties of the weighted distributed estimator, we impose the following assumptions.

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

Assumption 6 (Smoothness II) Denote Z(x, θk) = θk M(x; θk) θk M(x; θk)T , then there are positive constants ρ and B, and a positive function B(x) such that Z(x, θk) is B(x) Lipschitz continuous with respect to θk, in the sense that Z(x, θk) Z(x, θ k) 2 B(x) θk θ k 2 for all θk, θ k Uk = {θk | θk θ k 2 ρ} and x Rd, and E(B(Xk,1)2v) B2v .

Assumption 6 speciﬁes the Lipschitz continuity of the outer product Z(x; θk) with respect to θk, which is to control the estimation errors when we estimate the asymptotic covariance matrices of the local estimators {ˆθk}K k=1. Appendix A.7 shows it is valid for the logistic regression model.

Assumption 7 (Boundedness) Denote ΣS,k(θk) = EFk ψθk(Xk,1; θk)ψθk(Xk,1; θk)T , then there exists constants ρσ, c > 0 such that ΣS,k(θ k) 2 ρσ and Hk c Ip1 p1 for k 1, where θ k is the minimizer of the k-th population objective function and Hk = A 1 k Σk A 1 k , where Ak = Jφ|λ(θ k) and Σk = Var (Sφ(Xk,i; θ k))

By Hk s deﬁnition, Hk 2 Jk(θ k) 1 2 2 ΣS,k(θ k) 2 ρσρ 2 , implying H 1 k (ρ2 /ρσ)Ip1 p1. The above inequality also leads to the inequality Jk(θ k) 1 2 (c/ρσ)1/2, indicating a ﬁnite upper bound for the norm of the Hessian, as assumed in Jordan et al. (2019) and Duan et al. (2021).

Theorem 3 Under Assumptions 1 - 4 and 7, and Assumption 5 - 6 with v 2 and v1 4 , for n = NK 1, the mean-squared error of the weighted distributed estimator ˆφWD satisﬁes

E ˆφWD φ 2 2

n K + C2 (L2 + L4) + R2(R2 + R6 + G2)

n3 + C4K 1 + L2v

The v1 and v appeared in Assumptions 5 - 6 quantify the moments of the ﬁrst two orders of the gradients of the M-function and their corresponding Lipschitz functions. When the number of data blocks K = O(N1/2) namely K = O(n), the convergence rate of mean squared error of ˆφWD is O((n K) 1), which is the same as the standard full sample estimator. However, when there are too many data blocks such that K n, the convergence rate is reduced to O(n 2).

Theorem 4 Under Assumptions 1 - 4 and 7, and Assumptions 5 - 6 with v, v1 2, if K = o(N1/2), then ˆφWD φ )T PK k=1 nk H 1 k ˆφWD φ ) χ2 p1.

As mentioned before, K is allowed to diverge with the full sample size at the rate o(N1/2). Although {Hk}K k=1 have bounded spectral norms, PK k=1(nk/N)H 1 k may not converge to a ﬁxed matrix in the presence of heterogeneity. Thus, we can only obtain the asymptotic normality of the

standardized N 1/2 PK k=1(nk/N)H 1 k 1/2 (ˆφWD φ ). This is why Theorem 4 is presented in the limiting chi-squared form, which implies that we can construct conﬁdence regions for φ with conﬁdence level 1 α as (

φ | ˆφWD φ)T K X

k=1 nk b Hk(ˆθk) 1 ! ˆφWD φ) χ2 p1,α

after replacing PK k=1 nk H 1 k with its sample counterpart PK k=1 nk b Hk(ˆθk) 1, where χ2 p1,α is the upper α quantile of the χ2 p1 distribution. Given the weighted distributed estimator of the common

GU AND CHEN

parameter φ , a natural question is that whether a more efﬁcient estimator of the block-speciﬁc λ k can be obtained, if we plug in the weighted distributed estimator to each data block and re-estimate λk: Let ˆλ(2) k be the updated estimator, and Theorem 9 in Appendix A.8 will show that ˆλ(2) k is not necessarily more efﬁcient than the local estimator ˆλk.

5 Debiased Estimator for Diverging K

It is noted that K = o(N1/2) is required in Theorems 3 and 4 to attain the O(N 1) leading order mean squared error and the limiting chi-squared distribution of the weighted distributed estimator ˆφWD. A reason for this requirement is that the bias of the local estimator ˆθk is at order Op(n 1 k ), which can not be reduced by the weighted averaging. This leads to the bias of N1/2(ˆφWD φ ) being at the order Op(KN 1/2), which is not necessarily diminishing to zero unless K = o(N1/2). It is worth mentioning that Duan et al. (2021) needed the same K = o(N1/2) order in their maximum likelihood estimation framework to obtain the N1/2-convergence since Li et al. (2003) showed that the maximum likelihood estimator is asymptotically biased when K/n C (0, + ). This calls for a bias reduction step for the local estimators before aggregation to allow for a larger K. To facilitate the bias correction, we have to simplify the notation. Suppose F(θ) is a p 1 vector function, F(θ) is the usual Jacobian whose l-th row contains the partial derivatives of the l-th element of F(θ). Then, the matrices of higher derivatives are deﬁned recursively so that the j-th element of the l-th row of s L(θ) (a p ps matrix) is the 1 p vector fv lj(θ) = fv 1 lj (θ)/ θT , where fv 1 lj is the l-th row and j-th element of v 1F(θ). Let denote the Kronecker product. Using Kronecker product we can express v F(θ) = v F(θ)/( θT θT θT ). Besides, deﬁne Mn,k(θk) = n 1 k Pnk i=1 M(Xk,i; θk), H3,k(θk) = E( 2 θkψθk(Xk,1; θk)), Qk(θk) = ( E( θkψθk(Xk,1; θk))) 1 , di,k(θk) = Qk(θk)ψθk(Xk,i; θk) and vi,k(θk) = θkψθk(Xk,i, θk) θkΨθ(θk). Then, the leading order bias of ˆθk (Rilstone et al., 1996) is

Bias(ˆθk) = n 1 k Qk(θ k) E (v1,k(θ k)d1,k(θ k)) + 1

2H3,k(θ k)E (d1,k(θ k) d1,k(θ k)) .

Let Bk(θk) = Qk(θk) E (v1,k(θk)d1,k(θk)) + 1

2H3,k(θk)E (d1,k(θk) d1,k(θk)) , whose ﬁrst p1 dimension associated with φ are denoted as B1 k(θk). An estimator of Bk(θk) is

ˆBk(θk) = ˆQk(θk) n 1 k

i=1 ˆvi,k(θk) ˆdi,k(θk) + 1

2 ˆH3,k(θk)n 1 k

i=1 ( ˆdi,k(θk) ˆdi,k(θk)) , (13)

where ˆH3,k(θk) = n 1 k Pnk i=1 2 θkψθk(Xk,i; θk), ˆQk(θk) = n 1 k Pnk i=1 θkψθk(Xk,i; θk) 1 , ˆdi,k(θk) = ˆQk(θk)ψθk(Xk,i; θk) and ˆvi,k(θk) = θkψθk(Xk,i; θk). Applying it to each data block, we have the bias-corrected local estimator

ˆθk,bc = ˆθk n 1 k ˆBk(ˆθk)1Ek,bc,

where Ek,bc = {ˆθk n 1 k ˆBk(ˆθk) Θk}, and the indicator function is to ensure that ˆθk,bc Θk. After the local debiased estimators are obtained, we need to aggregate them via the estimated weights. A direct aggregation will invalidate the bias correction due to the dependence between the estimated weights and the local debiased estimator if they are constructed with the same dataset.

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

The accumulation of dependence over a large number of data blocks can make the bias correction fail. To remove the dependence between the local estimator ˆθk,bc and the estimated local weights c Wk = PK s=1 b Hs(ˆθs) 1 1 b Hk(ˆθk) 1, we divide each local dataset {Xk,i}nk i=1 to two basically

equal-sized splits Ds k = {X(s) k,i }nk/2 i=1 , s = 1, 2. For s = 1, 2, we calculate the local estimators ˆθk,s and obtain ˆHk,s(ˆθk,s), which is the ﬁrst p1 principal sub-matrix of

( θk bΨθk) 1((2/nk)

i=1 ψθk(X(s) k,i ; ˆθk,s)ψθk(X(s) k,i ; ˆθk,s)T )( θk bΨθk) T ,

where bΨθk = (2/nk) Pnk/2 i=1 ψθk(X(s) k,i ; ˆθk,s). We perform the local bias correction to ˆθk,s based on a split with the weight obtained by the other, leading to two debiased estimators of the form

k=1 nk ˆHk,s(ˆθk,s) 1 ! 1 K X

k=1 nk( ˆHk,s(ˆθk,s)) 1 ˆφbc k,2 |s 1| for s = 1, 2.

The two debiased local estimators are averaged to obtain the ﬁnal debiased weighted distributed estimator, whose procedure is summarized in Algorithm 2. To provide a theoretical guarantee on the bias correction, we need an assumption on the third-order gradient of the M-function (Zhang et al., 2013), which strengthens a part of Assumption 5.

Assumption 8 (Strong smoothness) For each x Rp, the third order derivatives of M(x; θk) with respect to θk exist and are A(x) Lipschitz continuous in the sense that

( 2 θkψθk(x; θk) 2 θkψθk(x; θ k))(u u) 2 A(x) θk θ k 2 u 2 2,

for all θk, θ k Uk deﬁned in Assumption 5 and u Rp, where A(x) is a positive function, E(A(Xk,i)2v) A2v for some v > 0 and A < .

Theorem 5 Under Assumptions 1 - 4 and 7 - 8, and Assumptions 5 - 6 with v, v1 4 ,

E ˆφd WD φ 2 2

n K + C2 R2(L2 + R2)

n2K + C3 G2R2(G2 + R2 + G4) + A2R6

+C5K 1 + L2v

where the CBn 3 term characterizes the error due to the estimation of the bias correction terms {Bk(θ k)}K k=1, whose deﬁnition can be found in the proof of this theorem presented in the Appendix.

The main difference between the upper bounds in Theorem 5 from that in Theorem 3 for the weighted distributed estimator is the disappearance of the O(n 2) term for the weighted distributed estimator, which has been absorbed into the O((n2K) 1 + n 3) terms for the debiased weighted distributed estimator. As shown next, this translates to a more relaxed K = o(N2/3) condition as compared with the K = o(N1/2) condition for the weighted distributed estimator in Theorem 4.

GU AND CHEN

Input: Distributed datasets: {Xk,i, k = 1, ..., K; i = 1, ..., nk} Output: debiased weighted distributed estimator: ˆφd WD

1 In each data block k (k = 1, 2, , K):

2 Split the local dataset into two equal sized subsets: Ds k = {X(s) k,i }nk/2 i=1 , s = 1, 2 ;

3 Solve (2) based on Ds k and obtain ˆθk,s = (ˆφk,s, ˆλk,s) for s = 1, 2 ;

4 Calculate b Hk,s(ˆθk,s) based on Ds k and ˆθs k for s = 1, 2;

5 Calculate ˆθbc k,s = ˆθk,s 2n 1 k ˆBk,s(ˆθk,s)1Ek,bc,s using formula (13) for s = 1, 2, where Ek,bc,s = {ˆθk,s 2n 1 k ˆBk,s(ˆθk,s) Θk} ;

6 In a central server:

7 Collect {ˆφbc k,s, b Hk,s(ˆθk,s) 1, s = 1, 2} from all the K data blocks;

8 Construct ˆφs = PK k=1 nk b Hk,s(ˆθk,s) 1 1 PK k=1 nk b Hk,s(ˆθk,s) 1 ˆφbc k,2 |s 1|;

9 Calculate ˆφd WD s = ˆφs I(ˆφs Φ) + K 1 PK k=1 nk ˆφbc k,2 |s 1|I(ˆφs Φ) for s = 1, 2;

10 ˆφd WD = 2 1 P2 s=1 ˆφd WD s . Algorithm 2: debiased Weighted Distributed (d WD) Estimator

Theorem 6 Under the conditions required by Theorem 5, if K = o(N2/3),

(ˆφd WD φ )T K X

k=1 nk Hk(θ k) 1 !

(ˆφd WD φ ) d χ2 p1.

Theorem 6 is also formulated in the chi-squared distribution form for the same reason when we formulate Theorem 4, and similar conﬁdence region with conﬁdence level 1 α can be constructed as n φ | ˆφd WD φ)T {PK k=1 nk Hk(ˆθk) 1} ˆφd WD φ) χ2 p1,α o . The fact that the conﬁdence regions of debiased weighted distributed and weighted distributed estimators use the same standardizing matrix PK k=1 nk b Hk(ˆθk) 1 reﬂects that both estimators have the same estimation efﬁciency. However, the debiased version has a more relaxed constraint on K = o(N2/3) than that of the WD estimator requiring K = o(N1/2) . Both the debiased and non-debiased weighted distributed estimators are communication efﬁcient as they only require one round of communication. When the communication budget is strictly limited, people may only share the debiased estimators without transmitting the weights. In this case, one may consider the following debiased split-and-conquer estimator

ˆφd Sa C = N 1 K X

k=1 nk(ˆφk n 1 k ˆB1 k(ˆθk)1Ek,bc), (14)

which only performs bias correction and may be preferable when the heterogeneity is not severe. The asymptotic property of ˆφd Sa C is summarized in the following theorem.

Theorem 7 Under the conditions required by Theorem 5, if K = o(N2/3), the debiased split-andconquer estimator ˆφd Sa C satisﬁes that (i) E ˆφd Sa C φ 2 2 C1/(n K) + C2/(n2K) + C3/n3

and (ii) N2(ˆφd Sa C φ )T PK k=1 nk Hk(θ k) 1 (ˆφd Sa C φ ) χ2 p1.

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

The corresponding conﬁdence region with conﬁdence level 1 α can be constructed as {φ |

N2 ˆφd Sa C φ)T PK k=1 nk b Hk(ˆθk) 1 ˆφd Sa C φ) χ2 p1,α}. It is noted that the debiased version

of the split-and-conquer estimator ˆφd Sa C has the same asymptotic distribution as that of ˆφSa C, but under a much more relaxed constraint on the divergence rate of K. Hence, the conﬁdence regions based on the split-and-conquer estimator can be constructed in the same way as that based on the weighted distributed estimator with ˆφd Sa C replaced by ˆφSa C. To compare with the subsampled average mixture method (SAVGM) estimator proposed in Zhang et al. (2013), which also performs local bias correction but under the homogeneous setting, we have the following corollary to Theorem 7.

Corollary 8 Under the homogeneous case such that {Xk,i, k = 1, ..., K, i = 1, ..., n; } are IID distributed, and the assumptions required by Theorem 5,

E ˆθd Sa C θ 1 2 2 2E θ1Ψθ(θ 1) 1ψθ1(X1,1; θ 1) 2 2

where θ 1 is the true parameter for all the K data blocks.

The SAVGM estimator resamples rnk data points from each data block k for a r (0, 1) to obtain a local estimator ˆθk,r based on the sub-samples, and has the form

Its mean squared error bound as given in Theorem 4 of Zhang et al. (2013) is

E θSAVGM θ 1 2 2 2 + 3r

(1 r)2 E θ1Ψθ(θ 1) 1ψθ1(X1,1; θ 1) 2 2

Thus, the mean squared error bound (16) of the SAVGM estimator has an inﬂated factor (2+3r)(1 r) 2/2 > 1 for r (0, 1) when compared with that of the d Sa C estimator, although it is computationally more efﬁcient than the debiased split-and-conquer and debiased weighted distributed estimators as it only draws one subsample in its resampling. For more comparisons between the debiased split-and-conquer estimator and one-step estimators proposed by Huang and Huo (2019), see Appendix A.9. To facilitate an overall comparison among the existing and the proposed estimators, Table 1 summarizes the non-asymptotic MSE rates of the estimators along with the details on the smoothness condition and the restriction on K, and the statue of statistical efﬁciency relative to the full sample GMM estimator. It is noted that the smoothness parameters v and v1 can be large if the corresponding random variables are of thin tails, such as the sub-exponential or sub-gaussian distributions, as assumed in Jordan et al. (2019) and Duan et al. (2021). As a result, the smoothness condition becomes trivial, and the Kn min(v,v1/2) term that appeared in the non-asymptotic rates can be ignored.

6 Numerical Results

The purpose of this section is to examine the numerical performances of the estimators via both the simulation study in Section 6.1 and the real data analysis in Section 6.2.

GU AND CHEN

Table 1: MSE bounds in estimating the common parameter φ by the ideal full sample M-estimator (Full), the distributed Sa C, d Sa C, WD, d WD and SAVGM estimators, and the estimator in Chen and Peng (2021) (CP). The column headed by Smoothness states the needed moment restriction in Assumption 5. The column headed by K further states the condition on the block size under which the asymptotic normality of the estimator holds, while that by Efﬁciency indicates if an estimator is statistically efﬁcient ( ) or not ( ) relative to the GMM estimator ˆφGMM. For the CP estimator, τ1 1 and τ1 = 1 under the current M-estimation setting.

Estimator Smoothness Non-asymptotic Regime Asymptotic Regime

MSE bound K Efﬁciency Full v, v1 1 C1(n K) 1 + C2(n K) 2 - Sa C v, v1 2 C1(n K) 1 + C2n 2 o(N1/2)

CP - C1(n K) 1 + C2(n2K) 1 + C3n 2τ1 o(N1 1 2τ1 ) WD v 2, v1 4 C1(n K) 1 + C2n 2 + C3Kn min(v,v1/2) o(N1/2) d Sa C v, v1 4 C1(n K) 1 + C2n 2K 1 + C3n 3 o(N2/3) SAVGM v, v1 4 C1(n K) 1 + C2n 2K 1 + C3n 3 o(N2/3) d WD v, v1 4 C1(n K) 1 + C2n 2K 1 + C3n 3 + C3Kn min(v,v1/2) o(N2/3)

6.1 Simulation study

We report results from simulation experiments designed to verify the theoretical ﬁndings made in the previous sections, which was to evaluate the numerical performance of the proposed weighted distributed (WD), debaised split-and-conquer (d Sa C) and debiased weighted distributed (d WD) estimators of the common parameter and compare them with the existing split-and-conquer (Sa C) and subsampled average mixture method (SAVGM) (with subsampling rate r = 0.05) estimators. Although Zhang s SAVGM estimator (Zhang et al., 2013) was proposed under the homogeneous setting, but since its main bias correction is performed locally on each data block k as shown in (15), similar theoretical bounds as (16) can be derived without much modiﬁcations on the original proof. Throughout the simulation experiments, the results of each simulation setting were based on B = 500 number of replications and were conducted in R with a 10-core Intel(R) Core(TM) i9-10900K @3.7 GHz processor. We evaluated the numerical performance of the ﬁve estimators for the common parameter φ under a logistic regression model. For each of K data blocks with K {10, 50, 100, 250, 500, 1000, 2000}, {(Xk,i; Yk,i)}n i=1 Rp {0, 1} were independently sampled from the following model:

Xk,i N(0p 1, 0.752Ip p) and P(Yk,i = 1 | Xk,i) = exp(XT k,iθ k)

1 + exp(XT k,iθ k),

where θ k = (φ T , λ T k )T , φ = 1, λ k = (λ k,1, λ k,2, , λ k,p2)T and λ k,j = ( 1)j10(1 2(k 1)/(K 1)). The sample sizes of the data blocks were equal at n = NK 1 with N = 2 106. Two levels of the dimension p2 = 4 and 10 of the nuisance parameter λk were considered. Due to space limit, we only report the set of result with p2 = 10 in the main paper. See Appendix C.2 for the result with p2 = 4 and Appendix A.10 for a derivation of the bias correction formula for the logistic model. Figure 1 reports the root mean squared errors and absolute bias of the estimators when p2 = 10. It is observed that the weighted distributed estimator and the two debiased estimators had smaller

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

root mean squared errors than those of the Sa C and SAVGM for almost all the simulation settings. The classical split-and-conquer estimator fared better than Zhang s SAVGM estimator as K became larger, which is due to the extra variation introduced by the subsampling method as indicated in (16), especially when K is large (the local sample size n is small). It was evident that the WD estimator had much smaller root mean squared errors than the Sa C and SAVGM estimators for all the block number K, realizing its theoretical promises. In most cases, the WD estimator had smaller bias than the Sa C estimator although it was not debiased. The WD estimator was advantageous for K 250. In comparison, both bias corrected d WD and d Sa C were very effective in reducing the bias of the WD and Sa C estimators, respectively, especially for larger K when the bias was more severe. The d WD attained the smallest root mean squared error and the bias in all settings, suggesting the need for conducting both weighting and the bias correction in the distributed inference especially for large K. These empirical results were consistent with Theorems 3 and 5, namely the leading root mean squared error term of the WD estimator changes from O((n K) 1) to O(n 2) when K surpasses the local sample size n, while the leading term of the d WD is still O((n K) 1) until K is much larger than n2.

(a) Absolute Bias (p2 = 10) (b) RMSE (p2 = 10)

Figure 1: Average simulated bias (a) and the root mean squared errors (RMSE) (b) of the weighted distributed (WD) (red circle), the split-and-conquer(Sa C) (blue triangle), the debiased split-andconquer (d Sa C) (green square), the debiased weighted distributed (d WD) (purple cross), the subsampled average mixture SAVGM (pink square cross) estimators, with respect to the number of data block K for the logistic regression model with the dimension p2 of the nuisance parameter λk being 10, and the full sample size N = 2 106.

We also evaluated the coverage probabilities and widths of the 1 α (α = 0.01, 0.05, 0.1) conﬁdence intervals (CIs) of the common parameter based on the asymptotic normality as given after Theorems 4 and 6. The SAVGM estimator was not included as its asymptotic distribution was not made available in Zhang et al. (2013). Table 2 reports the empirical coverage and the average width of the CIs. It is observed that the four types of the CIs all had quite adequate coverage levels when K 100. However, for K 250, the Sa C CIs ﬁrst started to lose coverage, followed by those of the WD, while the CIs of the d Sa C and d WD estimators can hold up to the promised coverage for all cases of K. Although the d Sa C CIs had comparable coverages with the d WD CIs, their widths were much wider than those of the d WD. This was largely due to the fact that

GU AND CHEN

the weighted averaging conducted in the weighted distributed estimation reduced the variation and hence the width of the CIs. The widths of the WD CIs were largely the same with those of the d WD, and yet the coverage levels of the d WD CIs were much more accurate indicating the importance of the bias correction as it shifted the CIs without inﬂating the width.

Table 2: Coverage probabilities and widths (in parentheses, multiplied by 100) of the 1 α conﬁdence intervals for the common parameter φ in the logistic regression model based on the asymptotic normality of the split-and-conquer (Sa C), the weighted distributed (WD), the debiased splitand-conquer (d Sa C) and the debiased weighted distributed (d WD) estimators with respect to the number of data blocks K. The dimension p2 of the nuisance parameter λk is 10 and total sample size N = 2 106

K Sa C WD d Sa C d WD 1 α 0.99 0.95 0.90 0.99 0.95 0.90 0.99 0.95 0.90 0.99 0.95 0.90 10 0.99 0.94 0.88 1.00 0.96 0.92 1.00 0.94 0.88 1.00 0.96 0.92 (3.05) (2.32) (1.95) (2.41) (1.84) (1.54) (3.05) (2.32) (1.95) (2.42) (1.84) (1.54) 50 0.99 0.93 0.87 0.99 0.95 0.88 0.98 0.94 0.88 0.99 0.96 0.88 (2.94) (2.24) (1.88) (2.29) (1.74) (1.46) (2.94) (2.24) (1.88) (2.29) (1.74) (1.46) 100 0.97 0.89 0.84 0.97 0.93 0.87 0.98 0.95 0.90 0.98 0.94 0.89 (2.93) (2.23) (1.87) (2.28) (1.74) (1.46) (2.93) (2.23) (1.87) (2.29) (1.74) (1.46) 250 0.89 0.72 0.63 0.98 0.92 0.87 1.00 0.97 0.90 1.00 0.96 0.90 (2.94) (2.24) (1.88) (2.28) (1.74) (1.46) (2.94) (2.24) (1.88) (2.29) (1.74) (1.46) 500 0.51 0.28 0.18 0.93 0.81 0.70 0.99 0.94 0.90 0.98 0.94 0.88 (2.97) (2.26) (1.90) (2.29) (1.74) (1.46) (2.97) (2.26) (1.90) (2.30) (1.75) (1.47) 1000 0.00 0.00 0.00 0.66 0.37 0.28 0.99 0.95 0.90 0.99 0.96 0.89 (3.04) (2.31) (1.94) (2.30) (1.75) (1.47) (3.04) (2.31) (1.94) (2.34) (1.78) (1.49) 2000 0.00 0.00 0.00 0.02 0.00 0.00 0.99 0.96 0.90 0.99 0.93 0.87 (3.22) (2.45) (2.06) (2.34) (1.78) (1.49) (3.22) (2.45) (2.06) (2.40) (1.82) (1.53)

In addition to the simulation experiments on the statistical properties of the estimators, the computation efﬁciency of the estimators was also evaluated. Table 3 reports the average CPU time per simulation run based on 500 replications of the ﬁve estimators for a range of K of the nuisance parameter for the logistic regression model with the total sample size N = 2 106 and p2 = 10. The computation speed of the d Sa C and d WD estimators were relatively slower than those of the Sa C, WD and Zhang s SAVGM estimators. The WD estimator was quite fast, which means that the re-weighting used less computing time than the bias-reduction. In comparison, the d WD estimator was the slowest as a cost for attaining the best root mean squared error among the ﬁve estimators in all settings. It is observed in Table 3 that the overall computation time for each estimator ﬁrst decreased and then increased as K became larger. The decrease in time was because of the beneﬁt of the distributed computation, while the increase was due to the increase in the number of optimization associated with the statistical optimization performed as K got larger. However, it is worth mentioning that these results did not account for the potential time expenditure in data communication among different data blocks.

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

Table 3: Average CPU time for each replication based on B = 500 replications for the split-andconquer (Sa C), the Zhang s SAVGM, the weighted distributed (WD), the debiased split-and-conquer (d Sa C) and the debiased weighted distributed (d WD) estimators for the logistic regression model with respect to K. The dimension p2 of the nuisance parameter λk is 10 and total sample size N = 2 106

K Sa C SAVGM WD d Sa C d WD 10 34.60 35.19 43.84 50.47 55.35 50 20.13 20.18 24.16 29.99 33.69 100 15.60 16.20 17.74 23.63 24.47 250 10.77 12.61 11.88 18.22 20.39 500 11.55 14.50 12.56 18.80 23.73 1000 15.23 18.27 16.28 22.38 32.24 2000 23.42 27.99 24.62 30.43 48.05

6.2 Real data analysis

We report results from an empirical analysis of an airline s on-time performance data to demonstrate the proposed weighted distributed estimation for massive data. We aim to quantify the association between ﬂight departure delay and a set of covariates, the arrival delay of the previous ﬂight of the same plane, the seasonal effects, and the weather conditions with a logistic regression model, based on data from the top 10 busiest airports in the United States in 2007. The ﬂight data are available from https://community.amstat.org/jointscsg-section/dataexpo/dataexpo2009 and the weather data are obtained from https://cds.climate.copernicus.eu/. We segmented the full data of N = 2412782 according to the airports of departing ﬂights and obtained 10 data segments. For each segment, we split it to data blocks of size n = 5000, while the residual data blocks were discarded, such that the total number of blocks K = 479.

We included seven covariates in the logistic regression: the arrival delay of the previous ﬂight, the season (encoded by three dummy variables: spring (March-May), summer (June-August), autumn (September-November) with winter as the baseline, the near-surface air temperature and pressure, and the rain rate before the scheduled departure time. The coefﬁcients of the three weather variables were treated as the common parameters while the remaining coefﬁcients including the intercept were regarded as heterogeneous; see Section C.3 in the supplementary material for the justiﬁcation. The estimated common parameters of the near-surface air pressure, temperature, and convective rain rate with 95% conﬁdence intervals using the weighted distributed estimator and the split-and-conquer estimator are shown in Figure 2. Both methods successfully identiﬁed a signiﬁcant association between the three weather variables and the departure delay of a ﬂight. Besides, the weighted distributed estimator reduced the lengths of the conﬁdence intervals of the estimated common parameters compared with the split-and-conquer method. In particular, the conﬁdence interval of the rain parameter was shortened by 19.1%, while those of the other two common parameters were shorted by 2.2% (pressure) and 2.9% (temperature), which justiﬁed the statistical efﬁciency of the weighted distributed estimator.

The data analysis demonstrated the feasibility of implementing the proposed weighted distributed estimation method for real-world distributed inference problems. With only one round

GU AND CHEN

of weighting to tackle the heterogeneity among the nuisance parameters, more efﬁcient estimation can be obtained.

Figure 2: Estimated common parameters of the near surface air pressure, temperature and convective rain rate with 95% conﬁdence intervals using the weighted distributed estimator and the split-andconquer estimator

temperature

air pressure

0.0950 0.0975 0.1000 0.1025 0.1050 0.1075

0.080 0.085 0.090 0.095 0.100

0.0625 0.0600 0.0575 0.0550

Estimates (95% Confidence Interval)

7 Discussion

This paper investigates distributed statistical optimization in the presence of heterogeneity among the data blocks. The weighted distributed estimator can improve the estimation efﬁciency of the split-and-conquer estimator for the common parameter. Two debiased estimators are proposed to allow for larger numbers of data blocks K. The statistical properties of the proposed estimators are shown to be advantageous over the split-and-conquer and SAVGM estimators. In particular, the weighted distributed estimator performs well for smaller K relative to N, and the debiased weighted distributed estimator that conducted both bias correction and weighting offers good estimation accuracy for large K. An important issue for the distributed estimation is the size of K relative to the full sample size N. Both the split-and-conquer and weighted distributed estimators require K = o(N1/2) to

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

preserve the the N1/2 convergence rate. The debiased weighted distributed and debiased split-andconquer estimators relax the restriction to K = o(N2/3) without sacriﬁcing the statistical efﬁciency. Our ﬁnding that the heterogeneity requires separate weighting for better efﬁciency gain has implications beyond the current context. In particular, the proposed WD estimator can be extended to a multi-round procedure, where one can substitute the WD estimator back to the local loss functions, update the remaining local parameters, and then repeat the WD procedure. However, it may be shown that under the current M-estimation framework which is non-degenerate in the sense of Chen and Peng (2021), doing so could not lead to further gain in the efﬁciency beyond the WD estimator, as it sufﬁces to achieve statistical efﬁciency with a one-round averaging as the WD. Nevertheless, it is of interest to explore further for the degenerate cases as the above statement may no longer be applicable. The heterogeneous M-estimation framework in this work is related to the federated learning (Mc Mahan et al., 2017), where one wants to minimize a federated risk function

k=1 wk Mk(φ), (17)

where Mk(Xk; ) is the k-th client speciﬁc loss function, Mk(φ) = EPk (Mk(Xk; φ)) is the corresponding risk function, φ Rp1 is the parameter of interest and wk is the pre-speciﬁed weight of the k-th client with a natural choice wk = 1/K. The local distributions {Pk}K k=1 may be not identically distributed to reﬂect heterogeneity. In (17), only a shared parameter φ needs to be estimated, and the heterogeneity is hidden in the local loss and risk functions. Our formulation is different from the federated risk function (17), where the {Mk}K k=1 are parameterized via the heterogeneous local parameters {λk}K k=1, leading to

Mk(Xk; ) = M(X; φ, λk) (18)

for inference on φ. Our ﬁnding that by actively weighting with respect to the heterogeneity as in the WD estimation can provide useful guideline for the selection of the weights wk in (17) of the federated learning. Different from (17), the federated multi-task learning (Smith et al., 2017) is designed to tackle the heterogeneity in a distributed network, which ﬁts separate local parameters {φk}K k=1 Rp1 to different data blocks (tasks) through loss functions {ℓk( , )}K k=1 and its general formulation is

i=1 ℓk(φT k Xk,i, Yk,i) + R(Φ, Ω) , (19)

where Φ is the matrix with {φk}K k=1 as column vectors, Ω RK K and R( , ) measures the extent of the heterogeneity among different data blocks. Choices of R( , ) include the bi-convex function R(Φ, Ω) = δ1trace(ΦΩΦT ) + δ2 Φ 2 F for δ1, δ2 > 0 and Ω= IK K (1/K)1K1T K such that trace(ΦΩΦT ) = PK k=1 φk φK 2 2 where φK = (1/K) PK k=1 φk, which leads to the meanregularized multi-task learning (Evgeniou and Pontil, 2004) with R conducting regularization on each local model. Similar regularization formulations have also been applied in personalized federated learning (T. Dinh et al., 2020; Li et al., 2021). Other than regularization methods, Marfoq et al. (2021) proposed a clustering-based method, where the {Pk}K k=1 are assumed to be sampled from a

GU AND CHEN

mixture of S (S K) underlying distributions. It is also noted that although federated multi-task learning assumes different parameters {φk}K k=1 over the data blocks, it regularizes them toward a common one. In contrast, we assume there is a common parameter φ shared by the distributions. By doing so, we can clarify the source of heterogeneity {λk}K k=1 and homogeneity φ instead of putting an equal treatment on all the dimensions of the parameter and focusing on the statistical inference of the common parameter.

Acknowledgments and Disclosure of Funding

The authors would like to thank the two anonymous referees, the Associate Editor, and the Editor for their thoughtful comments and suggestions, which have improved the presentation of the paper. This research is funded by National Natural Science Foundation of China Grants 12071013, 12292980 and 12292983.

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

The Appendix is organized as follows. Section A provides derivations of the formulas given in the main text. Section B contains detailed proofs of the theoretical results. More simulation results and details about the real data analysis are reported in Section C.

Appendix A. Derivation of formulas

A.1 Expansion of the full sample estimator ˆφfull

By integral form of Taylor s expansion around the true value θ , we have

0p 1 = ΨN(X; ˆφfull, ˆλ1,full, ..., ˆλK,full)

= ΨN(X; θ ) + J(θ )(ˆθfull θ ) + ( ΨN(X; θ ) J(θ ))(ˆθfull θ )

0 ΨN(X; θ + t(ˆθfull θ ))(ˆθfull θ )dt ΨN(X; θ )}(ˆθfull θ ),

where J(θ) = E ( ΨN(X; θ)). Then, inverting the above leads to

ˆθfull θ = J(θ ) 1ΨN(X; θ ) + RN1 + RN2, (20)

where RN1 = J(θ ) 1{ ΨN(X; θ ) J(θ )}(ˆθfull θ ) and RN2 = J(θ ) 1{ R 1 0 ΨN(X; θ + t(ˆθfull θ ))(ˆθfull θ )dt ΨN(X; θ )}(ˆθfull θ ) are both higher-order remainder terms. Since J(θ) has the following form

PK k=1 nkΨφ φ(θk) n1Ψλ φ(θ1) n KΨλ φ(θK) n1Ψφ λ(θ1) n1Ψλ λ(θ1) 0 0 ... 0 ... 0 n KΨφ λ(θK) 0 0 n KΨλ λ(θK),

then the right bottom part of J(θ) is a block diagonal matrix, whose inverse is at hand. Thus we can see J(θ) as a 2 2 block matrix and directly apply the block matrix inverse formula (Lu and Shiou, 2002). Thus from (20) we have

k=1 (nk/N)Jφ|λ(θ k)

i=1 Sφ(Xk,i; θ k)

+ op(N 1/2).

A.2 Errors-in-variables model

We ﬁrst give a derivation of the objective function from the perspective of statistical optimization. As we will see, the derived objective is exactly the same as that when we do orthogonal regression or Deming s regression (Carroll and Ruppert, 1996). Consider the conditional likelihood of (Xk,i, Yk,i) given Zk,i in block k

f({Xk,i}, {Yk,i}|{Zk,i}, θk) =

i=1 f1(Xk,i|Zk,i)f2(Yk,i|Zk,i)

=( 1 2πσ2 )n n Y

(X2 k,i + (Yk,i φ)2) 2Zk,i(Xk,i + λk(Yk,i φ)) + (1 + λ2 k)Z2 k,i

GU AND CHEN

By the factorization theorem, Xk,i + λk(Yk,i φ) is a sufﬁcient statistic for Zk,i if θk = (φ, λk) is assumed to be known. And Xk,i + 2λk(Yk,i φ)|Zk,i N((1 + λ2 k)Zk,i, (1 + λ2 k)σ2). Then, the above conditional likelihood can be factorized as

f({Xk,i}, {Yk,i}|{Zk,i}, θk)

i=1 exp 1 2σ2(1 + λ2 k)(λk Xk,i (Yk,i φ))2 h(Xk,i + λk(Yk,i φ)|Zk,i),

where h(si|zi) is the conditional density of N((1 + λ2 k)zi, (1 + λ2 k)σ2). Since {Zk,i}n i=1 are not observable, we discard the factor h and construct the estimator based on the ﬁrst part of the factorization, which is denoted as f({Xk,i}, {Yk,i}|{Zk,i}, θk). Differentiate log f({Xk,i}, {Yk,i}|{Zk,i}, θk) with respect to θk = (φ, λk)T , we obtain

φ log f({Xk,i}, {Yk,i}|{Zk,i}, θk) = 1 σ2(1 + λ2 k)

i=1 (λk Xk,i (Yk,i φ)) and

λk log f({Xk,i}, {Yk,i}|{Zk,i}, θk) = n λk 1 + λ2 k +

λk σ2(1 + λ2 k)2 (λk Xk,i (Yk,i φ))2

Xk,i σ2(1 + λ2 k)(λk Xk,i (Yk,i φ)).

However, E f({Xk,i}, {Yk,i}|{Zk,i}, θ k) = (0, nλ k/(1+λ 2 k ))T = 02 1, thus a correction term should be added to construct an appropriate objective function which satisﬁes the standard ﬁrst-order condition in statistical optimization framework:

Mn,k({Xk,i}, {Yk,i}|{Zk,i}, θk) = log f({Xk,i}, {Yk,i}|{Zk,i}, θk) + n

2 log(1 + λ2 k)

= 1 2σ2(1 + λ2 k)

i=1 (λk Xk,i (Yk,i φ))2 + C(σ),

where C(σ) = n log(

2πσ) is an absolute constant so we also discard it. The corresponding M-function is M(Xk, θk) = 1 2σ2(1 + λ2 k)(λk Xk (Yk φ))2. (22)

Below we check the identiﬁcation of the true parameter under this objective function. We can directly solve the population level ﬁrst-order conditions (FOC) using E ( M(Xk, Yk|Zk, θk)) = 02 1, which are given as

02 1 = (1 + λ2 k) (λk λ k)E (Zk) (φ φ)

(λkλ k + 1)(λk λ k)E Z2 k λk(φ φ )2 + (φ φ )(1 + 2λkλ k λ2 k)E (Zk)

(23) To solve the above set of equations, we consider the two scenarios. When E (Zk) = 0, from the ﬁrst equation we obtain φ = φ , then the second equation reduces to C(λkλ k + 1)(λk

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

λ k)E Z2 k = 0. Since we have assumed λk, λ k > 0, we must have λk = λ k. When E (Zk) = 0, if λk = λ k we would obtain E (Zk) = (φ φ)/(λk λ k). Plugging it into the second equation of (23) and we can obtain

(1 + λkλ k) σ2(1 + λ2 k)2(λk λ k)

(λk λ k)2E Z2 k (φ φ )2 = 0,

which is impossible unless Zk is degenerate, namely Zk = (φ φ)(λk λ k) with probability one. This leads to a contradiction. Thus we must have λk = λ k. Again from the ﬁrst equation of (23) we will obtain that φ = φ . In summary, E ( M(Xk, Yk|Zk, θk)) = 02 1 if and only if θk = θ k. To give an explicit form of asymptotic variance of the estimator obtained from the M-function (22), we can directly calculate the following two terms:

E 2M(Xk, Yk|Zk; θ k)

1 σ2(1+λ 2 k ) Xk σ2(1+λ 2 k ) 2λ k(λ k Xk (Yk φ )) σ2(1+λ 2 k )2

Xk σ2(1+λ 2 k ) 2λ k(λ k Xk (Yk φ )) σ2(1+λ 2 k )2 (3λ 2 k 1)(λ k Xk (Yk φ ))2

σ2(1+λ 2 k )3 4λ k Xk(λ k Xk (Yk φ )) σ2(1+λ 2 k )2 + X2 k σ2(1+λ 2 k )

= 1 σ2(1 + λ 2 k )

1 E (Zk) E (Zk) E Z2 k and

E M(Xk, Yk|Zk, θ k)( M(Xk, Yk|Zk, θ k))T

1 σ2(1+λ 2 k ) E(Zk) σ2(1+λ 2 k ) E(Zk) σ2(1+λ 2 k ) E(Z2 k) σ2(1+λ 2 k ) + 1 (1+λ 2 k )2

= 1 σ2(1 + λ 2 k )

1 E (Zk) E (Zk) E Z2 k + σ2 1+λ 2 k

Thus we have

Jφ|λ(θ k) = 1 σ2(1 + λ 2 k )(1 (EZk)2

EZ2 k ) = 1 σ2(1 + λ 2 k ) Var(Zk)

Var(Sφ) = 1 E(Zk)

1 σ2(1 + λ 2 k )

1 EZk EZk EZ2 k + σ2 1+λ 2 k

= 1 σ2(1 + λ 2 k )

which leads to the Equation (8) in the main text.

A.3 Equivalent variance minimization formulations of the weighted estimators

For simplicity, we assume that n1 = n2 = = n K = n. We claim that the following two formulations of the variance minimization problem have identical solution. Formulation 1: Trace Operator

Minimize Wk trace K X

k=1 Wk Hk W T k

k=1 Wk = Ip1. (24)

Formulation 2: Frobenius Norm

Minimize Wk

k=1 Wk Hk W T k F , s.t.

k=1 Wk = Ip1. (25)

GU AND CHEN

Proof We solve the problem (24) ﬁrst. The Lagrangian of this problem is

L1 = trace K X

k=1 Wk Hk W T k

k=1 Wk Ip1 >,

where Λ1 Rp1 p1 is the corresponding Lagrangian multiplier. If we take derivative of L1 w.r.t. Wk we can obtain 2Wk Hk + Λ1 = 0, k = 1, 2, , K. Then Wk = 1

2Λ1H 1 k . Using the constraint PK k=1 Wk = Ip1, we can obtain Λ 1 = 2(PK s=1 A 1 s ) 1 and W k = (PK s=1 A 1 s ) 1A 1 k . Now we turn to solve the problem (25). Equivalently we can minimize the square of the Frobenius norm, and the corresponding Lagrangian is

k=1 Wk Hk W T k 2 F + < Λ2,

k=1 Wk Ip1 > .

Taking derivative w.r.t. Wk we can obtain 4(PK s=1 Ws As W T s )Wk Ak + Λ2 = 0. Now we can use the constraint PK k=1 Wk = Ip1 and get Λ 2 = 4(PK s=1 Ws As W T s )(PK s=1 A 1 s ) 1 and W k = (PK s=1 A 1 s ) 1A 1 k .

A.4 Second-order Bartlett s indentity under QMLE

For the quasi maximum likelihood estimation (QMLE), we only check that the second order Bartlett s identity holds for independent observarions. Suppose that the components of the response vector Y are independent with mean vector µ and covariance matrix σ2V (µ), where σ2 maybe unknown and V (µ) is a matrix of known functions. It is assumed that the parameters of interest, θ, is a function of µ. By independence of the components of Y and the physical mechanism plausibility, it is reasonable to assume further that Vi(µ) depends on µ only through µi, which implies that

V (µ) = diag{V1(µ1), V2(µ2), , Vn(µn)}.

For a single observation Y , we can construct the score function as U = u(µ; Y ) = (Y µ)/(σ2V (µ)). Then the corresponding objective function can be deﬁned as

Q(µ; y) = Z µ

y t σ2V (t)dt, (26)

which behaves like a negative log-likelihood: E ( µQ) = 0, Var( µQ) = E( 2 µQ) = 1/{σ2V (µ)}. We refer to Q(µ; y) as the negative quasi-likelihood (Mc Cullagh, 1983), or more precisely the negative log quasi-likelihood for µ based on data y. By independence, the negative quasi-likelihood for the complete data is the sum of the individual contributions: Q(µ; y) = Pn i=1 Q(µi; yi). The quasi-likelihood estimating equations for the regression parameters θ, obtained by differentiating Q(µ; y), can be written in the form U(ˆθ) = 0, where U(θ) = DV 1(Y µ)/σ2 is called the quasi-score function. The components of D, of order n p, are Dir = µi/ θr, the derivatives of µ(θ) with respect to the parameters. Suppose the true parameters are θ and µ , then by the zero-mean of U(θ ), we have

Co V{U(θ )} = E U(θ )U(θ )T = DT V 1D/σ2 and

θT (θ ) = E{DT V 1 µ

θT /σ2 + DT V 1

σ2 } = DT V 1D/σ2.

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

A.5 Generalized second-order Bartlett s identity for parametric regression

Suppose that we observe a random sample (X1, Y1), (X2, Y2), , (Xn, Yn), which follows

Y = fθ (X) + e, E(e|X) = 0, Var(e|X) = σ2(X), X p(x).

Then the objective function for the least square estimation is M(Z, θ) = (Y fθ(X))2 with Z = (X, Y ). Note that

E (M(Z, θ)) = E(fθ(X) fθ (X))2 + Ee2 E (M(Z, θ )) + E((θ θ )T fθ (X))2, (27)

which suggests that 2 θM(θ ) = 2E fθ (X) fθ (X)T where M(θ) = EM(Z, θ). For the approximation (27), see van der Vaart (1999). If we assume the independence between e and X, which implies Var(e) = σ2, then E M(Z, θ ) M(Z, θ )T = 4σ2E fθ (X) fθ (X)T

with the multiplicative factor γ for the generalized second-order Bartlett s identity being 4σ2.

A.6 GMM formulation of the full sample statistical optimization under heterogeneity

It is noted that W0 admits the following form

Var{ψθ1(X1,1; φ , λ 1)} 1 0 0 0 Var{ψθ2(X2,1; φ , λ 2)} 1 0 ... ... 0 0 Var{ψθK(XK,1; φ , λ K)} 1

Thus, W0 is a block diagonal matrix. Also note that

GT 0 = E{ ψT N(θ )

ψφ φ(X1,i; φ , λ 1) ψλ φ(X1,i; φ , λ 1) ψφ φ(XK,i; φ , λ K) ψλ φ(XK,i; φ , λ K)

ψφ λ(X1,i; φ , λ 1) ψλ λ(X1,i; φ , λ 1) 0 0 0 0 0 ψφ λ(X2,i; φ , λ 2) ψλ λ(X2,i; φ , λ 2) 0 ... ... ... ... ... ... 0 0 0 0 ψφ λ(XK,i; φ , λ K) ψλ λ(XK,i; φ , λ K)

then the asymptotic variance of the GMM estimtator (Hansen, 1982) is Asy V Ar(ˆθGMM) = (GT 0 W0G0) 1 and has the following form:

PK k=1 nk DΨφ(θ k)T Σ 1 S,k DΨφ(θk) n1DΨφ(θ 1)T Σ 1 S,1DΨλ(θ 1) n KDΨφ(θ K)T Σ 1 S,KDΨλ(θ K) n1DΨλ(θ 1)T Σ 1 S,1DΨφ(θ 1) n1DΨλ(θ 1)T Σ 1 S,1DΨλ(θ 1) 0 0 ... 0 ... ... ... ... ... ... ... 0 n KDΨλ(θ K)T Σ 1 S,KDΨφ(θ K) 0 0 n KDΨλ(θ K)T Σ 1 S,KDΨλ(θ K)

DΨφ(θk)T = Ψφ φ(θk) Ψλ φ(θk) , DΨλ(θk)T = Ψφ λ(θk) Ψλ λ(θk) and ΣS,k = Var{ψθk(Xk,1; φ , λ k)}.

GU AND CHEN

By the inversion of block matrix, approximately Var(ˆφGMM) 1 has the following form:

DΨφ(θ k)T Σ 1 S,k DΨφ(θ k)

DΨφ(θ k)T Σ 1 S,k DΨλ(θ k) DΨλ(θ k)T Σ 1 S,k DΨλ(θ k) 1 DΨλ(θ k)T Σ 1 S,k DΨφ(θ k) .

If we denote the elements in the above summation as nk Uk, then it is straightforward to verify that

( DΨφ(θ k)T

ΣS,k DΨφ(θ k) DΨλ(θ k) ) 1 ,

namely, the inverse of Uk is the left top part of the inverse of a bigger matrix in the RHS of the above equation, from which we are able to obtain the simpliﬁed expression of Uk:

Uk = J 1 φ|λ Ip1 p1 Ψλ φ(θ k)Ψλ λ(θ k) 1 ΣS,k

Ip1 p1 Ψλ λ(θ k) 1Ψφ λ(θ k)

= Jφ|λΣ 1 k Jφ|λ.

Now we conclude that Var(ˆφGMM) PK k=1 Jφ|λΣ 1 k Jφ|λ 1 , which is the same as that of the

WD estimator ˆφWD.

A.7 Lipschitz continuity of the outer product of the gradient in logistic regression model

First we deﬁne the logit function logit(a) = exp(a)/(1 + exp(a)) for a R. Then the logistic regression model can be deﬁned as P(Y = 1|X) = logit(XT β ), where X, β Rp. If we deﬁne the objective M as M(z, β) = y log(logit(x T β))+(y 1) log(1 logit(x T β)), where z = (y, x), then the outer product of gradient, denoted as f(z, β), is f(z, β) = (y logit(x T β))2xx T . Now we have

f(z, β1) f(z, β2) 2 = xx T (2y logit(x T β1) logit(x T β2))(logit(x T β1) logit(x T β2)) 2 = xx T (2y logit(x T β1) logit(x T β2))(1 logit(ξ))logit(ξ)x(β1 β2) 2 x 3 2 β1 β2 2,

where the second equality comes from an application of the mean value theorem.

A.8 Asymptotic efﬁciency comparison of ˆλk and ˆλ(2) k

Theorem 9 Under the conditions required in Theorem 4, if K , then for the updated estimator ˆλ(2) k , we have that

nk(ˆλ(2) k λ k) d N(0, Ψλ λ(θ k) 1Eψλ(Xk,1; θ k)ψλ(Xk,1; θ k)T Ψλ λ(θ k) 1). (28)

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

Hence, the asymptotic distribution of ˆλ(2) k is the same as that of the estimator of λ k obtained when the common parameter φ is known. It is noted that the joint asymptotic distribution for the estimator ˆθk = (ˆφT k , ˆλT k )T is

nk(ˆθk θ k) d N(0, Jk(θ k) 1Eψθk(Xk,1; θ k)ψθk(Xk,1; θ k)T Jk(θ k) 1),

which leads to

nk(ˆλk λ k) d N(0, Jλ|φ(θ k) 1Var(Sφ(Xk,1; θ k))Jλ|φ(θ k) 1). (29)

There is not a deﬁnite order on the relative efﬁciency between ˆλk and ˆλ(2) k by comparing the two asymptotic variances in (28) and (29), suggesting it would depend on the speciﬁc M function and the model setting. For general statistical optimization, a known nuisance parameter (here φ ) does not necessarily improve the efﬁciency of a parameter of interest Yuan and Jennrich (2000); Henmi and Eguchi (2004), which is the case for the current setting. Consider again the errors-in-variables model where it can be shown that

Var(ˆλ(2) k ) σ4

(Var(Zk))2 1 nk and Var(ˆλk) σ4

(E(Z2 k))2 + σ2(1 + λ2 k) E(Z2 k)

When E (Zk) = 0, i.e. Var(Zk) = E(Z2 k), the updated estimator ˆλ(2) k is more efﬁcient, and the efﬁciency gain gets large as λ2 k increases. However, if E (Zk) has a large absolute magnitude, ˆλk can be more efﬁcient than ˆλ(2) k . Moreover, the requirement in Theorem 9 that K is to obtain a succinct asymptotic variance of ˆλ(2) k . The above conclusion does not change for the ﬁxed K case. Consider block 1, we assume ˆλ(2) 1 p λ 1 and ˆφWD is n1 consistent (detailed proofs of both claims are available in the next section). Then by Theorem 1 in Yuan and Jennrich (2000), if n1 1

n1 Pn1 i=1 ψλ(X1,i; θ 1) + Ψφ λ(θ 1)(ˆφWD φ ) d N(0, Q), we will have n1(ˆλ(2) 1 λ 1) d N(0, Ω) where Ω= Ψλ λ(θ 1) 1QΨλ λ(θ 1) 1. Denote Tn,K = nΨλ λ(θ 1) 1 1

n Pn i=1 ψλ(X1,i; θ 1) +

Ψφ λ(θ 1)(ˆφWD φ ) , then Tn,K should have the same asymptotic distribution as n(ˆλ(2) 1 λ 1). So, we study the limiting behavior of Tn,K for simplicity. Consider the homogeneous scenario as a special case when θ 1 = θ 2 = = θ K, n1 = n2 = = n K = n, then the optimal weights are W 1 = W 2 = = W K = 1

K Ip1 p1. Now we have

Tn,K = 1 nΨλ λ(θ 1) 1 n X

i=1 ψλ(X1,i; θ 1) Ψφ λ(θ 1)

1 K (ˆφk φ )

= 1 nΨλ λ(θ 1) 1 n X

i=1 ψλ(X1,i; θ 1) Ψφ λ(θ 1)

i=1 J 1 φ|λSφ(Xk,i; θ k) + op(1)

K )Ψλ λ(θ 1) 1Ψφ λ(θ 1) Ip2 p2 2M1(θ 1) 1 1 n

ψφ(X1,i; θ 1) ψλ(X1,i; θ 1)

Ψλ λ(θ 1) 1Ψφ λ(θ 1) 1 n 1 K

i=1 J 1 φ|λSφ(Xk,i; θ k) + op(1) = T (1) n,k + op(1).

GU AND CHEN

We can verify that Var(T (1) n,k) = (1 1

K )Ψλ λ(θ 1) 1Var(ψλ(X1,1; θ 1))Ψλ λ(θ 1) 1 + n

K Var(ˆλ1), or

equivalently, Var(ˆλ(2) 1 ) (1 1

K )Ψλ λ(θ 1) 1Var(ψλ(X1,1; θ 1))Ψλ λ(θ 1) 1 1

K Var(ˆλ1). Thus

Var(ˆλ(2) 1 ) Var(ˆλ1) if and only if

Ψλ λ(θ 1) 1Var(ψλ(X1,1, θ 1))Ψλ λ(θ 1) 1/n Var(ˆλ1). (30)

The LHS of inequality (30) is the asymptotic variance of the estimator of λ 1 if φ 1 is known and RHS is the asymptotic variance of estimator of λ 1 when we jointly estimate (φ T 1 , λ T 1 )T . Henmi and Eguchi (2004) showed that the inequality does not always hold for general statistical optimization problem and derived a sufﬁcient condition under which a known nuisance parameter (φ ) will lead to a bigger asymptotic variance of the estimator of the parameter of interest (λ 1).

A.9 Comparison with a one-step estimator

Huang and Huo (2019), also under the same homogeneous setting, considered to utilize the second order information of the M-function to allow for a larger K. They proposed a one-step estimator which aggregates the local Hessian matrices and gradients and performs a single Newton-Raphson updating. The estimator, denoted as ˆθ(1), has a MSE upper bound

E ˆθ(1) θ 1 2 2 2E θ1Ψθ(θ 1) 1ψθ1(X1,1; θ 1) 2 2

Thus, this method allows for K = o(n3), while still preserves the O(N 1) convergence rate. The price of this procedure is one extra round of transmission of the local Hessians and gradients. To mitigate the communication burden, they considered to use only one local Hessian matrix instead of the averaged one. Let ˆθ(1) LH be the estimator. They showed that

E ˆθ(1) LH θ 1 2 2 2E θ1Ψθ(θ 1) 1ψθ1(X1,1; θ 1) 2 2

which is similar to the MSE bound of the d Sa C estimator in Corollary 8. However, both ˆθ(1) and ˆθ(1) LH are not readily extended to the heterogeneous setting, as the one-step update procedure relies crucially on the N1/2 consistency of the initial estimators of all the unknown parameters (van der Vaart, 1999), but the convergence rate of the block-speciﬁc estimators ˆλk are only of order Op(n1/2 k ).

A.10 Bias correction for statistical optimization under logistic regression model

Given observations {(yi, Xi)}n i=1, we now construct ˆB(β). Denote y = (y1, y2, , yn)T , X = (X1, X2, , Xn)T and ˆy = (ˆy1, ˆy2, , ˆyn) with ˆyi = logit(x T i β). Since

daj logit(a) = logit(a)

s=1 (1 logit(a)s),

then we have Mn(β) = 1

n XT (ˆy y), 2Mn(β) = 1

n XT diag{ˆy (1 ˆy)}X and 3Mn(β) = 1 n Pn i=1 ˆyi(1 ˆyi)(1 2ˆyi)xivec(xi xi)T , where denotes the element-wise product of two vectors and vec is the vectorization operator. Then, the bias-correction formula is a combination of the gradients up to the third order.

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

Appendix B. Proofs

Without loss of generality, we assume equal sample size n in each data block. Besides, unless otherwise stated, we will use C, ci and Ci to denote positive constants independent of (nk, K, N), and the same Ci can have different values from one context to another.

Before presenting the proofs of the theoretical results established in the main paper, we ﬁrst establish some technical lemmas in the following sub-section.

Lemma B.1 Suppose H and K are positive deﬁnite matrices of order p, and X and Y are arbitrary p m matrices. Then, Q = XT H 1X + Y T K 1Y (X + Y )T (H + K) 1(X + Y ) 0.

Proof Let A and B be deﬁned as follows

A = H X XT XT H 1X

, B = K Y Y T Y T K 1Y

Since H, K are positive deﬁnite, we can directly check that A, B are positive semi-deﬁnite. Thus A + B is also positive semi-deﬁnite, and the conclusion follows. See Haynsworth (1970); Ando (1979) for more similar types of matrix inequalities.

Lemma B.2 Under Assumptions 1 - 4 and Assumptions 5 - 6 with min{v, v1} 1, if K = o(nv2), then sup 1 k K ˆθk θ k 2 P 0.

Besides, the following holds for all k = 1, 2, , K and 1 v2 v1

E ˆθk θk 2v2 2 C R2v2

nv2 + 1 + L2v

Proof Let Gn,k = 1

n Pn i=1 Gk(Xk,i), Mn,k(θ) = 1

n Pn i=1 M(Xk,i; θ) and δρ = min{ρ, ρρ /4G}. For k = 1, ..., K, deﬁne the following good events :

Ek = {Gn,k 2G, 2 θk Mn,k(θ k) 2 θk Mk(θ k) 2 ρρ

2 , θk Mn,k(θ k) 2 (1 ρ)ρ δρ

2 }, (34) then by Lemma 6 in Zhang et al. (2013), we obtain that under the event K k=1Ek,

ˆθk θ k 2 2 θk Mn,k(θ k) 2 (1 ρ)ρ holds for all k = 1, 2, , K. (35)

Similar to the proof of Lemma C.1 in Jordan et al. (2019), we can show that there exist constants C1, C2, C3 independent of (n, K, d, L, R) such that

P( K k=1Ec k) K C1 + C2(log2d)2v L2v

nv + C3R2v1

GU AND CHEN

Now For any ϵ > 0 and k K, we further deﬁne the events E k = { θk Mn,k(θ k) 2 (1 ρ)ρ ϵ/2}. Then by Markov s inequality and the union bound, there exist a positive constant C4 depending on ϵ such that

P( K k=1E c k ) C4(log2d)2v L2v

Thus, sup 1 k K ˆθk θ k 2 P 0 as n as long as K = o(nmin{v,v1}). Besides, the higher-order

estimation error bound follows from the following decomposition

E ˆθk θk 2v2 2 = E ˆθk θk 2v2 2 (I(Ek) + I(Ec k))

C E θk Mn,k(θ k) 2v2 2 + P ((Ek)c)

nv2 + 1 + L2v

Lemma B.3 Inv(A) : GL(Rp) GL(Rp) : A 7 A 1 is Lipschitz continuous at any A GL(Rp), where GL(Rp) consists of all p p invertible matrices of real numbers.

Proof Let A0 GL(Rp) be given. Denote 1/ A 1 0 2 = δ > 0. It follows that for all x Rp we have x 2 = A 1 0 A0x 2 (1/δ) A0x 2, namely A0x 2 δ x 2. Assume that A A0 2 < δ/2, then Ax 2 A0x 2 (A A0)x 2 δ

2 x 2, which means A 1 exists and A 1 2 2/δ. Since A 1 A 1 0 = A 1(A0 A)A 1 0 , A 1 A 1 0 2 A 1 2 A0 A 2 A 1 0 2 (2/δ2) A A0 2, which completes the proof.

Lemma B.4 Under Assumptions 1 - 4 and 7, and Assumptions 5 - 6 for v, v1 1, if K = o(n),

k=1 Hk(θ k) 1(ˆφk φ )}T {

k=1 nk Hk(θ k) 1} 1{

k=1 nk Hk(θ k) 1(ˆφk φ )} d χ2 p1.

Proof We prove for the case when K , and it is straightforward to derive the proof for the ﬁxed K case. Since we have assumed equal sample size n for simplicity, we can denote

s=1 Hs(θ s) 1} 1

k=1 Hk(θ k) 1(ˆφk φ ),

and the problem is equivalent to show that T1 2 2 χ2 p1 in distribution when K = o(n). Since all the smoothness conditions in Assumptions 5 - 6 only holds locally, namely in the Uρ ball, so all the Taylor expansions hold only under the event K k=1Ek, where the deﬁnition of the event Ek can be found in the proof of Lemma B.2. Now, under the event K k=1Ek, by the integral form of Taylor s expansion of θk Mn,k(θk) around the true parameter θ k, we have

ˆθk θ k = Jk(θ k) 1 θk Mn,k(θ k) + R(k) n , (36)

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

where R(k) n = R(k) n,1 + R(k) n,2,

R(k) n,1 = Jk(θ k) 1{ Z 1

0 2 θk Mn,k(θ k + t(ˆθk θ k))dt 2 θk Mn,k(θ k)}(ˆθk θ k) and

R(k) n,2 = Jk(θ k) 1{ 2 θk Mn,k(θ k) Jk(θ k)}(ˆθk θ k)

for each k. Recall the deﬁnition of Jφ|λ and Sφ, if we denote

Pk = Hk(θ k) 1Jφ|λ(θ k) 1 Ip1 p1 Ψλ φ(θ k)Ψλ λ(θ k) 1 , (37)

s=1 Hs(θ s) 1} 1

i=1 Pkψθk(Xk,i; θ k)

k=1 Hk(θ k) 1) 1

k=1 Pk{ Z 1

0 2 θk Mn,k(θ k + t(ˆθk θ k))dt 2 θk Mn,k(θ k)}(ˆθk θ k) and

k=1 Hk(θ k) 1) 1

k=1 Pk{ 2 θk Mn,k(θ k) Jk(θ k)}(ˆθk θ k),

T1 = (T1,0 + R1 + R2) I( K k=1Ek) + T1 1 I( K k=1Ek)

= T1,0 + (R1 + R2) I( K k=1Ek) + (T1 + T1,0) 1 I( K k=1Ek) .

The proof of Lemma B.2 also shows that P( K k=1Ek) = 1 O(K/nv2), where v2 = min{v, v1}. Thus, it sufﬁces to establish the asymptotic normality for T1,0, show that the R1 and R2 terms are both op(1) terms and then apply the Slutsky s lemma. Considering the T1,0 term, we apply the Cramer-Wold device to reduce the problem into a scalar case. For any non-zero l Rp1, let l T k = l T { 1

K PK s=1 Hk(θ s) 1} 1/2Pk, then l T T1,0 = N 1/2 PK k=1 Pn i=1 l T k ψθk(Xk,i; θ k). If we denote ZK,k = N 1/2 Pn i=1 l T k ψθk(Xk,i; θ k), then l T T1,0 = PK k=1 ZK,k and E (ZK,k) = 0. Below we check the Lindeberg conditions. First, PK k=1 E(Z2 K,k) = l T l = σ2 l > 0. Second, for any ϵ > 0,

k=1 E(|ZK,k|2; |ZK,k| > ϵ) =

k=1 E(|ZK,k|2I{|ZK,k|>ϵ})

ϵ )P(|ZK,k I{|ZK,k|>ϵ}| > t)dt

k=1 P(|ZK,k| > ϵ) +

ϵ P(|ZK,k| > t)dt,

where the second equality comes from the tail-sum formula for expectations of absolute moments. Using Chebyshev s inequality, Marcinkiewicz-Zygmund inequality with b3 being the corresponding constant and Jensen s inequality , we can show that

k=1 P(|ZK,k| > ϵ) b3 ϵ3K3/2

k=1 lk 3 2E ψθk(Xk,1; θ k) 2 2 . (38)

GU AND CHEN

Recalling the deﬁnition of lk, we can use the boundedness of Hk(θ k) and 2 θk Mk(θ k) to show that lk 2 C l 2. Thus we have that

k=1 P(|ZK,k| > ϵ) b3C ϵ3K3/2 l 2K max 1 k KE ψθk(Xk,1; θ k) 3 2 max 1 k KE ψθk(Xk,1; θ k) 3 2

Now consider the PK k=1 R ϵ P(|ZK,k| > t)dt term. By replacing the ϵ in (38) with t, we have

ϵ P(|ZK,k| > t)dt 1

Thus we conclude that T1,0 d N(0, Ip1 p1). Now we consider the remainder term R2. Since

K PK k=1 Hk(θ k) 1 2 is bounded, we only need to show R2,1 = { 1

K PK k=1 Hk(θ k) 1}R2 is

op(1). Since R2,1 2 q

n 1 K PK k=1 Pk 2 n{ 2 θk Mn,k(θ k) Jk(θ k)} 2 n(ˆθk θ k) 2, by Markov s inequality and H older s inequality, we will have

P( R2,1 ϵ) C1

E n{ 2 θk Mn,k(θ k) Jk(θ k)} 2 2 E n(ˆθk θ k) 2 2 .

From Lemma 7 of Zhang et al. (2013) and Assumption 5 with v 1, we know that

E n{ 2 θk Mn,k(θ k) Jk(θ k)} 2 2 C.

On the other hand, by Lemma 6 of Zhang et al. (2013) and using the event Ek we can show that E n(ˆθk θ k) 2 2 C1. Since K = o(n), we conclude that R2 = op(1). Considering the R1 term, we can similarly prove that R1 2 = op(1) by using the Lipschitz condition 2 θk M(x; θk) 2 θk M(x; θ k) 2 G(x) θk θ k 2 as assumed in Assumption 5. The result follows via a direct application of the Slutsky s lemma.

Lemma B.5 Under the same conditions required by Lemma B.4, the following term is asymptotically negligible (i.e. op(1)):

s=1 ns ˆHs(ˆθs) 1} 1nk ˆHk(ˆθk) 1(ˆφk φ )

s=1 ns Hs(θ s) 1} 1nk Hk(θ k) 1(ˆφk φ ) .

Proof Denote the above term as T2, then we have that

s=1 ˆHs(ˆθs) 1} 1 2 1 K

k=1 n( ˆHk(ˆθk) 1 Hk(θ k) 1) 2 n(ˆφk φ ) 2

k=1 Hk(θ k) 1 2 n { 1

s=1 ˆHs(ˆθs) 1} 1 { 1

s=1 Hs(θ s) 1} 1 2 n(ˆφk φ ) 2

n (T (1) 2 + T (2) 2 ).

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

Since K = o(n), it sufﬁces to show T (1) 2 and T (2) 2 are both Op(1). Under the event AK deﬁned

in Equation (61), we have T (2) 2 I(AK) C

n ˆΣS,k(θ k) ΣS,k(θ k) 2 + n ˆLk(θ k)

Lk(θ k) 2 + n(ˆθk θ k) 2

n(ˆθk θ k) 2. Thus for v 1, v1 2, by Markov s inequality

and Cauchy s inequality we have

P(T (2) 2 > 1, AK) n max 1 k K

E ˆΣS,k(θ k) ΣS,k(θ k) 2 2 E ˆθk θ k 2 2

E ˆLk(θ k) Lk(θ k) 2 2 E ˆθk θ k 2 2 + C 3E ˆθk θ k 2 2

Since we have shown P(AK) 1 if K = o(n v) with v = min{v, v1

2 }, and we have assumed that

K = o(n) , we can conclude that T (2) 2 = Op(1). We can similarly show that T (1) 2 = Op(1). Now we complete the proof.

Lemma B.6 Let A1, A2, , An Sp p, if

vec(A2)T ... vec(An)T

( ) 2 A 2 2 holds for Rp,

then A 2 pn A, where A = (vec(A1), vec(A2), , vec(An))T .

Proof Since A( ) = ( T A1 , T A2 , , T An )T , A2 4 2 Pn i=1( T Ai )2

which implies max i n Ai 2 A. On the other hand, for B = (A1, A2, , An) Rp np, we

have B 2 2 = λmax(Pn i=1 Ai AT i ) Pn i=1 λmax(Ai AT i ) = Pn i=1 Ai 2 n A2, which gives

A 2 = AT 2 q Pn i=1 vec(Ai) 2 2 = q Pn i=1 Ai 2 F pn A.

Lemma B.7 Under Assumptions 1 - 4 and 7 - 8, and Assumption 5 with v, v1 4 ,

E ˆBk(ˆθk)IEk,bc Bk(θ k) 2 2 C

Proof Denote k = ˆθk θ k. By the deﬁnition of the event Ek,bc given in Algorithm 2, we immediately have that ˆBk(ˆθk)1Ek,bc Bk(θ k) 2 2 Cn2. Below we ﬁrst control the Qk(θ k) ˆQk(ˆθk) 2 term. Note that Qk(θk) and ˆQk(θk) are exactly L 1 k (θk) and ˆLk(θk) 1 deﬁned in the proof of Theorem 3, thus under the event { ˆLk(ˆθk) Lk(θ k) 2 ρ

2 }, we have Qk(θ k) ˆQk(ˆθk) 2 2 ρ2 ˆLk(ˆθk) Lk(θ k) 2. Besides, when k 2 ρ, we have

ˆLk(ˆθk) Lk(θ k) 2 1

i=1 G(Xk,i) k 2 + ˆLk(θ k) Lk(θ k) 2. (39)

GU AND CHEN

Without loss of generality, we can assume ρ 8Gρ, then if we deﬁne EQ,k = { k 2 ρ 8G, Gn,k 2G, ˆLk(θ k) Lk(θ k) 2 < ρ

ˆQk(ˆθk) 21EQ,k Qk(θ k) ˆQk(ˆθk) 2 + Qk(θ k) 2 1EQ,k 1

Qk(θ k) ˆQk(ˆθk) 21EQ,k Ek C G θk Mn,k(θ k) 2 + ˆLk(θ k) Lk(θ k) 2 .

Using union bound and Markov s inequality, it is easy to show

P ((EQ,k)c) C (1 + L2v)

Besides, under this event, we can decompose the estimation error using the event Ek and obtain

E 1EQ,k Qk(θ k) ˆQk(ˆθk) 2 2 C E θk Mn,k(θ k) 2 2 + E ˆLk(θ k) Lk(θ k) 2 2 + P ((Ek)c)

n + P ((Ek)c) . (40)

It is noted that

ˆBk(ˆθk) Bk(θ k) 2 2

2 ˆQk(ˆθk) 1

i=1 ˆvi,k(ˆθk) ˆdi,k(ˆθk) Qk(θ k)E (v1,k(θ k)d1,k(θ k)) 2 2

2 ˆQk(ˆθk) ˆH3,k(ˆθk) 1

i=1 ˆdi,k(ˆθk) ˆdi,k(ˆθk) Qk(θ k)H3,k(θ k)E (d1,k(θ k) d1,k(θ k)) 2 2

:= 2Ωk,1 + 1

then we can bound those two terms respectively. For Ωk,1, under the event EQ,k we have

2 ˆQk(ˆθk){ 1

i=1 ˆvi,k(ˆθk) ˆdi,k(ˆθk) E (v1,k(θ k)d1,k(θ k))} 2 21EQ,k

+1EQ,k Qk(θ k) ˆQk(ˆθk) 2 2 E (v1,k(θ k)d1,k(θ k)) 2 2

i=1 (ˆvi,k(ˆθk) ˆdi,k(ˆθk) ˆvi,k(θ k)di,k(θ k)) 2 21EQ,k | {z }

i=1 ˆvi,k(θ k)di,k(θ k) E (v1,k(θ k)d1,k(θ k)) 2 2 + 1EQ,k Qk(θ k) ˆQk(ˆθk) 2 2

By Lemma 7 in Zhang et al. (2013), we have

i=1 ˆvi,k(θ k)di,k(θ k) E (v1,k(θ k)d1,k(θ k)) 2 2

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

Besides, given (40), it sufﬁces to control E Ω(1) k,1 . Note that

Ω(1) k,1 2 1

i=1 (ˆvi,k(ˆθk) ˆvi,k(θ k)) ˆdi,k(ˆθk) 2 21EQ,k + 2 1

i=1 ˆvi,k(θ k)( ˆdi,k(ˆθk) di,k(θ k)) 2 21EQ,k

:= 2(Ω(2) k,1 + Ω(3) k,1).

Under the event EQ,k Ek, we have ˆvi,k(ˆθk) ˆvi,k(θ k) 2 CG(Xk,i) θk Mn,k(θ k) 2. Besides, note that θk Mn,k(θ k) 2 is bounded under Ek, then by Taylor s expansion we can show that

θk M(Xk,i; ˆθk) θk M(Xk,i; θ k) 21Ek C G(Xk,i) θk Mn,k(θ k) 2 2 + 2 θk M(Xk,i; θ k) 2 θk Mn,k(θ k) 2 1Ek C G(Xk,i) + 2 θk M(Xk,i; θ k) 2 θk Mn,k(θ k) 21Ek. (42)

Now by H older s inequality, we can show that

E Ω(2) k,11Ek

θk Mn,k(θ k) 2 2 1 n

i=1 G2(Xk,i) θk M(Xk,i; ˆθk) 2 21Ek

i=1 {E θk Mn,k(θ k) 6 2 E G6(Xk,i) E θk M(Xk,i; ˆθk)1Ek 6 2 }1/3

C G2R2(G2 + L2 + R2)

For E Ω(3) k,11Ek , ﬁrst note that

ˆdi,k(ˆθk) di,k(θ k) 2 ˆQk(ˆθk) Qk(θ k) 2 θk M(Xk,i; θ k) 2+ ˆQk(ˆθk) 2 θk M(Xk,i; ˆθk) θk M(Xk,i; θ k) 2, (43) then combined with (39) and (42), we have that

E Ω(3) k,11Ek

i=1 E 2 θk M(Xk,i; θ ) 2 2 G2 θk Mn,k(θ k) 2 2 + 2 θk Mn,k(θ k) 2 θk Mk(θ k) 2 2 θk M(Xk,i; θ k) 2 2

+ G2(Xk,i) + 2 θk M(Xk,i; θ k) 2 2 2 θk Mn,k(θ k) 2 2

C L2R2(L2 + G2 + G2R2)

Now we consider Ωk,2. First, by Assumption 5 and Lemma B.6, we can show

3 θk M(Xk,i; θ k) H3,k(θ k) 2 C(G(Xk,i) + G), (44)

GU AND CHEN

leading to E ˆH3,k(θ k) H3,k(θ k) 2v 2 CG2v

nv . Besides, using Assumption 8 and Lemma B.6,

ˆH3,k(ˆθk) ˆH3,k(θ k) 21Ek C

i=1 A(Xk,i) θk Mn,k(θ k) 21Ek.

C1 ˆQk(ˆθk) ˆH3,k(ˆθk) 1

i=1 ˆdi,k(ˆθk) ˆdi,k(ˆθk) Qk(θ k)H3,k(θ k) 1

i=1 di,k(θ k) di,k(θ k) 2 2

i=1 di,k(θ k) di,k(θ k) E (d1,k(θ k) d1,k(θ k)) 2 2

C1 ( ˆQk(ˆθk) ˆH3,k(ˆθk) Qk(θ k)H3,k(θ k)) 1

i=1 ˆdi,k(ˆθk) ˆdi,k(ˆθk) 2 2

i=1 ˆdi,k(ˆθk) ˆdi,k(ˆθk) 1

i=1 di,k(θ k) di,k(θ k) 2 2

i=1 di,k(θ k) di,k(θ k) E (d1,k(θ k) d1,k(θ k)) 2 2

:= C Ω(1) k,2 + G2Ω(2) k,2 + G2Ω(3) k,2

Using lemma 7 in Zhang et al. (2013), we have E Ω(3) k,2 C L4R4

n . Considering Ω(1) k,2,

Ω(1) k,21EQ,k Ek

C( ˆH3,k(ˆθk) H3,k(θ k) 2 2 + G2 ˆQk(ˆθk) Qk(θ k) 2 2)( 1

i=1 θk M(Xk,i; ˆθk) 2 2)21EQ,k Ek,

then by H older s inequality, we can show that

E Ω(1) k,21EQ,k Ek C r

E ˆH3,k(ˆθk) H3,k(θ k) 4 21EQ,k Ek + G2 r

E ˆQk(ˆθk) Qk(θ k) 4 21EQ,k Ek v u u t E( 1

i=1 θk M(Xk,i; ˆθk) 2 21Ek)4

G2(1 + L2 + R2) + A2R2 (G4 + R4 + L4) n .

Considering Ω(2) k,2, note for any two p-dimensional vectors ℓ1 and ℓ2, we have

ℓ1 ℓ1 ℓ2 ℓ2 2 2 = ℓ1ℓT 1 ℓ2ℓT 2 2 F 2 ℓ1 ℓ2 2 2 ℓ1 2 2 + ℓ2 2 2 , (45)

then using (42) and (43), we can show that

E Ω(2) k,21EQ,k Ek C R2(G2 + R2 + L2)2

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

Finally, we consider Ek,bc and we can show that

P(Ec k,bc) P(Ec k,bc EQ,k Ek) + P((EQ,k Ek)c)

C n2 E ˆB(ˆθk) Bk(θ k) 2 21EQ,k Ek + CP((EQ,k Ek)c).

Collecting all the above results, we have the following upper-bound

E ˆBk(ˆθk)1Ek,bc Bk(θ k) 2 2

E ˆB(ˆθk) Bk(θ k) 2 21EQ,k Ek + CP((EQ,k Ek Ek,bc)c)

R2 + L2 + R2L2 + G2R2(G2 + R2 + L2) + L2R2(L2 + G2 + G2R2) + R2(G2 + R2 + L2)2

+(G2(1 + L2 + R2) + A2R2)(G4 + R4 + L4) + CP((EQ,k Ek)c). (46)

This concludes the proof.

Now let the pseudo debiased weighted distributed estimator be ˆφpd WD = PK k=1 Wk(θ k)(ˆφk 1 n B1 k(θ k)), we then give the following lemma on the MSE bound of this estimator.

Lemma B.8 Under Assumptions 1 - 4 and 7 - 8, and Assumption5 with v, v1 4 ,

E ˆφpd WD φ 2 2 C1

Proof Under the event Ek deﬁned in the Lemma B.2, we have that

0 = θk Mn,k(θ k) + 2 θk Mn,k(θ k) k + 1

0 3 θk Mn,k(θ k + t k)dt}( k k)

= θk Mn,k(θ k) + 2 θk Mk(θ k) k + 1

2 3 θk Mk(θ k)( k k)

+( 2 θk Mn,k(θ k) 2 θk Mk(θ k)) k + 1

0 3 θk Mn,k(θ k + t k)dt 3 θk Mk(θ k) ( k k).

Recall that we have denoted Jk(θk) = 2 θk Mk(θk), solve for the above equation and we will have

k = Jk(θ k) 1 θk Mn,k(θ k) Jk(θ k) 1 2 θk Mn,k(θ k) 2 θk Mk(θ k) k

2Jk(θ k) 1 3 θk Mk(θ k)( k k)

2Jk(θ k) 1 Z 1

0 3 θk Mn,k(θ k + t k)dt 3 θk Mk(θ k) ( k k). (48)

Recalling the deﬁnition of Wk(θ k), we have that ˆφpd WD φ 2 2 C 1

K PK k=1 Hk(θ k) 1(ˆφk φ 1

n B1 k(θ k)) 2 2 = C 1

K PK k=1 Hk(θ k)( k 1

n Bk(θ k)) 2 2, where Hk(θ k) = Hk(θ k) 1 0 and

GU AND CHEN

thus Hk(θ k) 2 = Hk(θ k) 2. Denote

Ωk,1 = ( 2 θk Mn,k(θ k) 2 θk Mk(θ k)) k 1

n E (v1,k(θ k)d1,k(θ k)) ,

Ωk,2 = ( k k) 1

n E (d1,k(θ k) d1,k(θ k)) and

0 3 θk Mn,k(θ k + t k)dt 3 θk Mk(θ k) ( k k), then (49)

n Bk(θ k) =

i=1 di,k(θ k) + Qk(θ k)(Ωk,1 + 1

2H3,k(θ k)Ωk,2 + 1

+ k I(EC k ). (50)

Considering Ωk,1, denote Ω(1) k,1 and Ω(2) k,1 as follows such that Ωk,1 = Ω(1) k,1 + Ω(2) k,1:

Ω(1) k,1 = ( 2 θk Mn,k(θ k) 2 θk Mk(θ k))( k 1

i=1 di,k(θ k)) and

Ω(2) k,1 = ( 2 θk Mn,k(θ k) 2 θk Mk(θ k)) 1

i=1 di,k(θ k) 1

n E (v1,k(θ k)d1,k(θ k)) .

For Ω(1) k,1, we can show by Taylor s expansion that

i=1 di,k(θ k)

1Ek = Qk(θ k) Z 1

0 2 θk Mn,k(θ k + t k)dt 2 θk Mk(θ k) k1Ek,

then using H older s inequality we can show that

E Ω(1) k,1 2 2IEk CG4R2(G2 + 1)

For Ω(2) k,1, by the independence among the observations, it is easy to ﬁrst show that E Ω(2) k,1 = 0. Denote eij = E (vi,k(θ k)dj,k(θ k)), then we have

E Ω(2) k,1 2 2 = 1

t=1 E (vi,k(θ k)dj,k(θ k) eij)T (vs,k(θ k)dt,k(θ k) est) . (52)

By a conditioning argument and independence among observations, it is straightforward to show that if the set {i, j, s, t} has three or four unique elements, then E(vi,k(θ k)di,k(θ k) eij)T (vs,k(θ k)dt,k(θ k) eij) = 0. Thus the RHS of Equation (52) has at most O(n2) non-zero elements and each of those non-zero elements can be bounded using H older s inequality. Since E v1,k(θ k)d1,k(θ k) e11 2 2

C(LR)2, we have E Ω(2) k,1 2 2 C(LR

n )2. By independence among different Ω(2) k,1, we can directly show that

k=1 Hk(θ k)Qk(θ k)Ωk,1 2 2

n2K + G4R2(G2 + 1)

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

For Ωk,2 appeared in (49), we deﬁne Ω(1) k,2 and Ω(2) k,2 as follows such that Ωk,2 = Ω(1) k,2 + Ω(2) k,2:

Ω(1) k,2 = ( k k) ( 1

i=1 di,k(θ k)) ( 1

i=1 di,k(θ k)) and

Ω(2) k,2 = ( 1

i=1 di,k(θ k)) ( 1

i=1 di,k(θ k)) 1

n E (d1,k(θ k) d1,k(θ k)) .

Similar to (51), we can show that

k=1 Hk(θ k)Qk(θ k)H3,k(θ k)Ω(2) k,2 2 2 CR4

Besides, due to (45) and (51), we can show that E Ω(1) k,2 2 21Ek C G2R4

n3 . For Ωk,3 in (49),

combined with (44) and the Lipschitz continuity of 3 θk M(Xk,1, θk) with respect to θk in Uk, we

can show that E Ωk,3 2 21Ek C(A2R6+G2R4)

n3 . For k I(EC k ), E k I(EC k ) 2 2 CP(EC k ). In summary, we have the following MSE bound of the pd WD estimator:

E ˆφpd WD φ 2 2 C1 R2

n K +C2 R2(L2 + R2)

n2K +C3 G2R2(G2 + R2 + G4) + A2R6

k=1 P(Ec k).

The asymptotic normality of the pd WD estimator is established in the following lemma.

Lemma B.9 Under Assumptions 1 - 4 and 7 - 8, and Assumption 5 with v, v1 4 , if K = o(n2),

(ˆφpd WD φ )T {

k=1 nk Hk(θ k) 1}(ˆφpd WD φ ) d χ2 p1.

Proof Since

ˆφpd WD φ = { 1

k=1 Hk(θ k) 1} 1 1

k=1 Hk(θ k) k 1

n Bk(θ k) .

Using the expansion (50) and other results in the proof of Lemma B.8, when K = o(n2),

k=1 Hk(θ k)( k 1

n Bk(θ k)) = 1 Kn

i=1 Hk(θ k)di,k(θ k) + op(1). (53)

Then, the proof is completed with a direct application of the central limit theorem.

GU AND CHEN

B.2 Proof of Proposition 1

Proof The consistency of the local estimator ˆθk is implied by Lemma 6 of Zhang et al. (2013). Below we show the consistency of the global estimator ˆθfull. Deﬁne the global objective function M( X, θ) = (1/K) PK k=1 M(Xk, θk) and the global expected objective function M(θ) =

M( X, θ) , where X = (XT 1 , XT 2 , ..., XT K)T and Xk is sampled from the disribution Fk. Then, equivalently, we have

θ = argminθ Θ E

M( X; θ) and ˆθfull = argminθ Θ 1 n

where Xi = (XT 1,i, XT 2,i, ..., XT K,i)T . Now we can show that

E θM( Xi, θ ) 2v1 2 R2v1 and E 2 θM( Xi, θ ) 2 θM(θ ) 2v 2 L2v

hold by a direct application of Lemma 7 of Zhang et al. (2013). Besides, for all θ, θ U with U = {θ| θ θ 2 ρ}, we have

2 θM( X, θ) 2 θM( X, θ ) 2

θ θ 2 and E

Besides, we can also show that

2 θM(θ ) ρ Ip1 p1 0 0 ρ

K I(p1+Kp2) (p1+Kp2).

It remains to apply Lemma 6 of Zhang et al. (2013) to obtain the consistency of ˆθfull.

B.3 Proof of Theorem 2

Proof See the proof of Lemma B.4.

B.4 Proof of Theorem 3

Proof Note that

k=1 ˆHk(ˆθk) 1) 1 2 1

k=1 ˆHk(ˆθk) 1(ˆφk φ ) 2. (54)

Since Hk(θ k) 1 ρ2 ρσ Ip1 p1 = ρh Ip1 p1, by Lemma B.3, the event IHK = { ˆHk(ˆθk) 1

Hk(θ k) 1 2 c

2, k = 1, ..., K} implies { 1

K PK k=1 ˆHk(ˆθk) 1} 1 { 1

K PK k=1 Hk(θ k) 1} 1 2 2 c2 1

K PK k=1 ˆHk(ˆθk) 1 1

K PK k=1 Hk(θ k) 1 2. Using Lemma B.3 again with Hk(θ k) c Ip1 p1 as assumed in Assumption 7, the event HK = { ˆHk(ˆθk) Hk(θ k) 2 c

2, k = 1, ..., K} implies ˆHk(ˆθk) 1 Hk(θ k) 1 2 2

c2 ˆHk(ˆθk) Hk(θ k) 2, k = 1, 2, , K. Now for any ϵ > 0, deﬁne ϵH = min{ϵ, c}/4, then under the event

Hϵ K = { ˆHk(ˆθk) Hk(θ k) 2 ϵH, k = 1, ..., K} (55)

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

we have { 1

K PK k=1 ˆHk(ˆθk) 1} 1 { 1

K PK k=1 Hk(θ k) 1} 1 2 ϵ. Now using the boundedness of ˆφWD φ , we have

E ˆφWD φ 2 2

k=1 ˆHk(ˆθk) 1(ˆφk φ ) 2 2I(Hϵ K)

k=1 (ˆφk φ k) 2 2I(ˆφ Φ)

+ C3P((Hϵ K)c).

Thus we only need to separately bound the three terms on the RHS of (56). Let us ﬁrst consider bounding P((Hϵ K)c). Denote

ˆLk(θk) = 2 θk Mn,k(θk), Lk(θk) = 2 θk Mk(θk), ˆVk(θk) = ˆLk(θk) 1 ˆΣS,k(θk)ˆLk(θk) 1 and Vk(θk) = Lk(θk) 1ΣS,k(θk)Lk(θk) 1.

By deﬁnition of ˆHk(θk) and the triangle s inequality, we have

ˆHk(ˆθk) Hk(θ k) 2 ˆVk(ˆθk) ˆVk(θ k) 2 + ˆVk(θ k) Vk(θ k) 2. (57)

Hence, we can bound those two terms on the RHS of (57) separately. Note that

ˆVk(θk) Vk(θk) 2 = ˆLk(θk) 1 ˆΣS,k(θk)ˆLk(θk) 1 Lk(θk) 1ΣS,k(θk)Lk(θk) 1 2 2( ˆLk(θk) 1 Lk(θk) 1 2 2 + Lk(θk) 1 2 2) ˆΣS,k(θk) ΣS,k(θk) 2 (58)

+( ˆLk(θk) 1 Lk(θk) 1 2 + 2 Lk(θk) 1 2) ΣS,k(θk) 2 ˆLk(θk) 1 Lk(θk) 1 2.

Then, under the event Lϵ K = { ˆLk(θ k) Lk(θ k) 2 min{ϵρ2 /2, ρ /2}, k = 1, ..., K} with ρ being the lower bound of the eigenvalues of Lk(θ k) as assumed in Assumption 4, we have

ˆVk(θ k) Vk(θ k) 2 (59)

ρ2 ) ˆΣS,k(θ k) ΣS,k(θ k) 2 + (ϵ + 2

ρ2 ˆLk(θ k) Lk(θ k) 2, k = 1, ..., K.

Similar to (58), we have

ˆVk(ˆθk) ˆVk(θ k) 2 2( ˆLk(ˆθk) 1 ˆLk(θ k) 1 2 2 + ˆLk(θ k) 1 2 2) ˆΣS,k(ˆθk) ˆΣS,k(θ k) 2 +( ˆLk(ˆθk) 1 ˆLk(θ k) 1 2 + 2 ˆLk(θ k) 1 2) ˆΣS,k(θ k) 2 ˆLk(ˆθk) 1 ˆLk(θ k) 1 2.

Deﬁne an event

MK = ˆLk(θ k) 1 2 2

ρ , ˆΣS,k(θ k) 2 2ρσ,

ˆLk(ˆθk) ˆLk(θ k) 2 min{ρ

4 , ϵρ2 8 }, k = 1, ..., K .

GU AND CHEN

Under this event, we have for all k = 1, 2, , K,

ˆVk(ˆθk) ˆVk(θ k) 2

ρ2 ) ˆΣS,k(ˆθk) ˆΣS,k(θ k) 2 + (ϵ + 4

ρ2 ˆLk(ˆθk) ˆLk(θ k) 2

(C1Bn,k + C2Gn,k) ˆθk θ k 2, (60)

where Bn,k = (1/n) Pn i=1 B(Xk,i) and Gn,k = (1/n) Pn i=1 G(Xk,i). Note that ˆLk(θ k) 1 2 2 ρ and ˆΣS,k(θ k) 2ρσ are implied by ˆLk(θ k) Lk(θ k) 2 ρ

2 and ˆΣS,k(θ k) ΣS,k(θ k) 2 ρσ

2 , respectively. Thus, we deﬁne the event

UK = Bn,k 2B, Gn,k 2G, ˆLk(θ k) Lk(θ k) 2 C1,

ˆΣS,k(θ k) ΣS,k(θ k) 2 C2, k = 1, ..., K ,

which satisﬁes UK MK Lϵ K. Under UK, we have ˆHk(ˆθk) Hk(θ k) 2 C ˆθk θ k 2 + ϵH

2 . Furthermore, we deﬁne the following event

AK = UK ( K k=1Ek) ({ ˆθk θ k 2 ϵH

2C , k = 1, ..., K}). (61)

By Lemma 6 in Zhang et al. (2013), under the event K k=1Ek, the event { ˆθk θ k 2 ϵH/(2C), k = 1, ..., K} is implied by the event { θk Mn,k(θ k) 2 (1 ρ)ρ ϵH/(4C), k = 1, ...K}. Thus, the event Ak can be equivalently expressed as

AK = Bn,k 2B, Gn,k 2G, ˆLk(θ k) Lk(θ k) 2 C1,

ˆΣS,k(θ k) ΣS,k(θ k) 2 C2, θk Mn,k(θ k) 2 C3, k = 1, ..., K .

Now with the union bound and Lemma 7 in Zhang et al. (2013), we can obtain that

P(Bn,k > 2B) + P(Gn,k > 2G) + P( ˆLk(θ k) Lk(θ k) 2 > C1)

+P( ˆΣS,k(θ k) ΣS,k(θ k) 2 > C2) + P( θk Mn,k(θ k) 2 > C3)

where v = min{v, v1

Next we consider bounding E 1

K PK k=1 ˆHk(ˆθk) 1(ˆφk φ ) 2 2I(Hϵ K) in (56). Recall the

deﬁnition of Hϵ K in (55), we can naturally decompose the event into Hϵ K = K k=1Hϵ,(k) K , where

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

Hϵ,(k) K = { ˆHk(ˆθk) Hk(θ k) 2 ϵH}. It is noted that for each k, under the event Hϵ,(k) K , we have

ˆHk(ˆθk) 1 2 2

c2 ˆHk(ˆθk) Hk(θ k) 2 + Hk(θ k) 1 2 C.

Since elements of { ˆHk(ˆθk) 1(ˆφk φ )I(Hϵ,(k) K )}K k=1 are independent with one another, we decompose the term as follows:

k=1 ˆHk(ˆθk) 1(ˆφk φ ) 2 2I(Hϵ K)

K E ˆφk φ 2 2 + E ˆHk(ˆθk) 1(ˆφk φ )I(Hϵ,(k) K ) 2 2

The ﬁrst term in the RHS of (63) can be bounded using Lemma B.2. For the second term, we have

E ˆHk(ˆθk) 1(ˆφk φ )I(Hϵ,(k) K ) 2 2

2 E ( ˆHk(ˆθk) 1 Hk(θ k) 1)(ˆφk φ )I(Hϵ,(k) K ) 2 2 + 2 E Hk(θ k) 1(ˆφk φ )I(Hϵ,(k) K ) 2 2

C1E ˆHk(ˆθk) 1 Hk(θ k) 1 2 2I(Hϵ,(k) K ) ˆφk φ 2 2 + C E (ˆθk θ k)I(Hϵ,(k) K ) 2 2.

Using (57), (59) and (60), and decomposing AK into K k=1A(k) K , we can show by H older s inequality that

E ˆHk(ˆθk) 1 Hk(θ k) 1 2 2I(Hϵ,(k) K ) ˆφk φ 2 2

C1E ˆΣS,k(θ k) ΣS,k(θ k) 4 2 + ˆLk(θ k) Lk(θ k) 4 2 + θk Mn,k(θ k) 4 2 + C2P (A(k) K )c .

C R8 + R4 + L4

n2 + P (A(k) K )c .

Besides, we have that

E (ˆφk φ )I(Hϵ,(k) K ) 2 2 2 E(ˆφk φ ) 2 2 + 2 E (ˆφk φ )(1 I(Hϵ,(k) K )) 2 2.

The second term can be bounded by also conditioning on the event A(k) K , using the inequality (35) and the H older s inequality, and we will obtain

E (ˆφk φ )(1 I(Hϵ,(k) K )) 2 2

C E θk Mn,k(θ k) 2 2I((Hϵ,(k) K )c) + P (A(k) K )c

P (A(k) K )c + P (A(k) K )c !

n2 + P (A(k) K )c .

GU AND CHEN

And for the ﬁrst term in the RHS, the following holds from the proof of Theorem 1 of Zhang et al. (2013):

E ˆφk φ 2 2 C L2 + R2G2

In summary, we have that

k=1 ˆHk(ˆθk) 1(ˆφk φ ) 2 2I(Hϵ K)

C 1 + L2 + R2

n K + (L2 + L4) + R2(R2 + R6 + G2)

n3 + P (A(k) K )c . (65)

It remains to bound E 1

K PK k=1(ˆφk φ k) 2 2I(ˆφ Φ) . First we have that

k=1 (ˆφk φ k) 2 2I(ˆφ Φ)

k=1 E ˆφk φ k 2 2I(ˆφ Φ) .

Then, it is noted by (54) that under the event Hϵ K, the event {ˆφ Φ} is equivalent to the event { 1

K PK k=1 ˆHk(ˆθk) 1(ˆφk φ ) 2 > C}, namely we have that

k=1 ˆHk(ˆθk) 1(ˆφk φ ) 2I(Hϵ K) > C} (Hϵ K)c,

then we can repeat the conditioning on A(k) K argument to obtain

E ˆφk φ k 2 2I(ˆφ Φ) C R4

n2 + P(ˆφ Φ) + P (A(k) K )c .

Finally, gathering all the above results, we obtain that

E ˆφWD φ 2 2

n K + C2 (L2 + L4) + R2(R2 + R6 + G2)

n3 + C4K 1 + L2v

The proof is now complete.

B.5 Proof of Theorem 4

Proof With the results in Lemma B.4 and Lemma B.5, the proof follows from a direct application of the Slutsky s lemma.

B.6 Proof of Theorem 5

Proof Recall the deﬁnition of the event Hϵ K deﬁned in (55) in the proof of Theorem 3, we can similarly deﬁne Hϵ K,j to control the estimation error of { ˆHk,j(ˆθk,j)}K k=1, j = 1, 2. Since

E ˆφd WD φ 2 2 1

j=1 E ˆφd WD j φ 2 2 = E ˆφd WD 1 φ 2 2 ,

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

it sufﬁces to bound the last term. Under the event Hϵ K,1 and using boundedness of ˆφd WD 1 φ 2,

E ˆφd WD 1 φ 2 2

k=1 ˆHk,1(ˆθk,1) 1(ˆφbc k,2 φ ) 2 2I(Hϵ K,1)

k=1 (ˆφbc k,2 φ ) 2 2I(ˆφ1 Φ)

+C3P (Hϵ K,1)c

:= C1E (R1) + C2E (R2) + C3P (Hϵ K,1)c , (66)

which is similar to (56). We have derived the upper bound for in C3P (Hϵ K,1)c in (62), and it remains to control R1 and R2. Considering R1, we have that

ˆ Hk,1(ˆθk,1)(ˆθk,2 θ k 1 n/2Bk(θ k)) 2 2I(Hϵ K,1)

k=1 ˆBk,2(ˆθk,2)1Ek,bc,2 Bk(θ k) 2 2I(ˆθbc k,2 Θk),

where ˆ Hk,1(ˆθk,1) = ˆHk,1(ˆθk,1) 1 0 . Due to Lemma B.7 , it remains to control the R(1) 1 term. We can decompose the event Hϵ K,1 as Hϵ K,1 = K k=1H(k),ϵ K,1 where Hϵ,(k) K,1 = { ˆHk,1(ˆθk,1) Hk(θ k) 2 ϵH}. Then, we have

ˆ Hk,1(ˆθk,1)I(H(k),ϵ K,1 )(ˆθk,2 θ k 1 n/2Bk(θ k)) 2 2.

Since { ˆ Hk,1(ˆθk,1)I(H(k),ϵ K,1 )}K k=1 are independent of {ˆθk,2}K k=1 and bounded, E R(1) 1 has a similar upper bound as that of Lemma B.8. The E (R2) term is of higher-order and its upper bound can be easily derived. Collecting all the above results, we have the following upper bound

E ˆφd WD φ 2 2

n K + C2 R2(L2 + R2)

n2K + C3 G2R2(G2 + R2 + G4) + A2R6

+C5K 1 + L2v

where CB is the constant term over n appeared in (46). This completes the proof.

B.7 Proof of Theorem 6

Proof Similar to Lemma B.5, we ﬁrst prove that the following term is of op(N 1/2):

s=1 ˆHs,1(ˆθs,1) 1} 1 K X

k=1 ˆHk,1(ˆθk,1) 1(ˆφbc k,2 φ ) {

s=1 Hs(θ s) 1} 1 K X

k=1 Hk(θ k) 1(ˆφbc k,2 φ )

GU AND CHEN

for K = o(n2). Note

N RH 2I(Hϵ K,1)

s=1 ˆHs,1(ˆθs,1) 1} 1 { 1

s=1 Hs(θ s) 1} 1 2I(Hϵ K,1) 1

k=1 ˆHk,1(ˆθk,1) 1(ˆφbc k,2 φ ) 2

k=1 ( ˆHk,1(ˆθk,1) 1 Hk(θ k) 1)(ˆφbc k,2 φ ) 2I(Hϵ K,1)

:= RH,1 + CRH,2

For K = o(n2), we have shown in the proof of Theorem 5 that

I(Hϵ K,1) 1

k=1 ˆHk,1(ˆθk,1) 1(ˆφbc k,2 φ ) 2 = Op( 1

Besides, we also have that

s=1 ˆHs,1(ˆθs,1) 1} 1 { 1

s=1 Hs(θ s) 1} 1 2I(Hϵ K,1) = Op( 1 n).

Combining those two results we prove that RH,1 = Op( 1 n) = op(1). Considering RH,2, we have

k=1 ( ˆ Hk,1(ˆθk,1) Hk,1(θ k))I(H(k),ϵ K,1 )(ˆθbc k,2 θ k) 2 := R(1) H,2.

From previous derivations, we know that when K = o(n2) the leading order term in R(1) H,2 is

k=1 ( ˆ Hk,1(ˆθk,1) Hk,1(θ k))I(H(k),ϵ K,1 ) 1

i=1 d(2) i,k (θ k) 2,

where di,k(θ k)(j) = Qk(θ k) θk M(X(j) k,i ; θ k). So we only need to show that R(2) H,2 = op(1). Denote

m = n/2, then using the independence between ˆ Hk,1(ˆθk,1) and d(2) i,k (θ k), we have that

E(R(2) H,2)2 N

k=1 E ( ˆ Hk,1(ˆθk,1) Hk,1(θ k))I(H(k),ϵ K,1 ) 2 2 E

i=1 d(2) i,k (θ k) 2 2

Then by Markov s inequality it s direct to show R(2) H,2 = op(1). Now with Slutsky s lemma, it remains to establish the asymptotic normality of

s=1 Hs(θ s) 1} 1 K X

k=1 Hk(θ k) 1 ˆφbc k,j,

and the proof directly follows from the proof of Lemmas B.7 and B.9.

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

B.8 Proof of Theorem 7 and Corollary 8

Proof The proof can be easily derived based on previous derivations.

B.9 Proof of Theorem 9

Proof To apply Theorem 1 in Yuan and Jennrich (2000), we need to check the uniform convergence of 1 n Pn i=1 ψλ φ(Xk,i, θk)T ψλ λ(Xk,i; θk)T T . This is actually the last p2 columns of 2 θk Mn,k(θk) for θk Uk, so we only need to show the uniform convergence of 2 θk Mn,k(θk) in Uk. By Assumption 5, 2 θk M(x; θk) is Lipschitz continuous w.r.t. θk for θk Uk, then we can directly apply Corollary 3.1 of Newey (1991) to establish the required uniform convergence. Now we are to show ˆλ(2) k P λ k. Following the proof of Lemma 6 in Zhang et al. (2013), we can ﬁrst show that under the event Ek, Mn,k(θk) is (1 ρ)ρ -strongly convex on the ball Uk = { θk θ k 2 ρk}, where ρk min{ρρ

4G , ρ}. Deﬁne the event EWD,k = { ˆφWD φ 2 < ρk/2}, then under this event, θ k = (ˆφWD, λ k) Uk. For any θ k = (ˆφWD, λk) Θk, if θ k Uk, then under EWD,k, there exists w0 [0, 1] such that θ k,0 = w0θ k + (1 w0) θ k lies on the surface of the ball Uk, and thus θ k,0 θ k 2 = w0 θ k θ k 2 (ρk

2 ). Now under EWD,k we have that

Mn,k(θ k) Mn,k(θ k,0)+ < θk Mn,k(θ k,0), θ k θ k,0 >

Mn,k( θ k)+ < θk Mn,k( θ k), θ k θ k > +1

2(1 ρ)ρ ρ2 k 4 + < θk Mn,k(θ k,0) θk Mn,k( θ k), θ k θ k,0 >

Mn,k( θ k)+ < θk Mn,k( θ k), θ k θ k > +1

2(1 ρ)ρ ρ2 k 4 , (67)

where the ﬁrst inequality holds due to the convexity of Mn,k(θk) on Uk and the second holds due to the strong convexity on Uk. The last inequality holds due to θ k θ k = 1 w0

w0 (θ k,0 θ k). When θ k Uk, with strong convexity Equation (67) still holds with ρ2 k 4 changed to θ k θ k 2 2 = λk λ k 2 2. In any case the following relationship holds under the event EWD,k:

Mn,k(θ k) Mn,k( θ k)+ < θk Mn,k( θ k), θ k θ k > +1

2(1 ρ)ρ min{ρ2 k 4 , λk λ k 2 2}.

Rewriting the inequality we obtain that

min{ λk λ k 2 2, ρ2 k 4 } 2 (1 ρ)ρ

Mn,k(θ k) Mn,k( θ k)+ < θk Mn,k( θ k), θ k θ k >

Mn,k(θ k) Mn,k( θ k) + θk Mn,k( θ k) 2 θ k θ k 2 . (68)

Now if we denote θ k,1 = (ˆφWD, ˆλ(2) k ) and set θ k = κθ k,1 + (1 κ) θ k for any ﬁxed κ [0, 1], we will have

min{κ ˆλ(2) k λ k 2, ρ2 k 4κ ˆλ(2) k λ k 2 } 2(Mn,k(κθ k,1 + (1 κ) θ k) Mn,k( θ k))

κ(1 ρ)ρ ˆλ(2) k λ k 2 +2 θk Mn,k( θ k) 2 (1 ρ)ρ .

GU AND CHEN

By deﬁnition we have Mn,k(θ k,1) Mn,k( θ k) and thus by convexity we have

min{κ ˆλ(2) k λ k 2, ρ2 k 4κ ˆλ(2) k λ k 2 } 2 θk Mn,k( θ k) 2 (1 ρ)ρ .

Deﬁne the event Es,k = { 2 θk Mn,k( θ k) 2 (1 ρ)ρ ρk

2 }, then under this event we have

min{κ ˆλ(2) k λ k 2, ρ2 k 4κ ˆλ(2) k λ k 2 } ρk

If ˆλ(2) k λ k 2 > ρk

2 , we can set κ = ρk 2 ˆλ(2) k λ k 2 , which leads to a contradiction. Thus we have

ˆλ(2) k λ k 2 ρk

2 . Then using Equation (68) we have

ˆλ(2) k λ k 2 < 2 θk Mn,k( θ k) 2 (1 ρ)ρ . (69)

Since ˆφWD is consistent, we have P(EWD,k) 1. Besides, we already know that P(Ek) 1. Due to the form of the event Es,k and inequality (69), it remains to show θk Mn,k( θ k) 2 = o P (1) to establish the consistency of ˆλ(2) k . Note

θk Mn,k( θ k) 2 θk Mn,k( θ k) θk Mn,k(θ k) 2 + θk Mn,k(θ k) 2

and the latter term is of Op( 1

nv1 ). Using the consistency of ˆφWD we can show θk Mn,k( θ k) θk Mn,k(θ k) 2 = op(1). Besides, since ˆφWD is

N-consistent and K , then n(ˆφWD φ ) = op(1) and the asymptotic normality of n 1

n Pn i=1 ψλ(Xk,i; θ k) + Ψφ λ(θ k)(ˆφWD φ ) is implied by the asymptotic normality of n 1

n Pn i=1 ψλ(Xk,i; θ k) and Slutsky s lemma. Now we apply Theorem 1 in Yuan and Jennrich (2000) and the result follows.

Appendix C. Additional numerical results

C.1 Simulation results based on the errors in variables model

In this simulation experiment, we simulated the errors-in-variables Model (6) with the objective function (7) to compare the performance of the full sample, the split-and-conquer and the weighted distributed estimators: ˆφfull, ˆφSa C and ˆφWD. The simulation was carried out by ﬁrst generating IID {Zi,k} from N(µZ, σ2 Z), and then upon given a Zi,k, (Xk,i, Yi,k)T were independently drawn from N (Zi,k, φ + λ k Zi,k)T , σ2I2 2 . We chose φ = 1, K = 2, σ2 = 1 and n1 = n2 = 5 104 = N/2, and λ 1, λ 2, µZ and σ2 Z were those reported in Table 4 under four scenarios. As discussed in Section 3, the relative efﬁciency of ˆφfull to ˆφSa C depends on the ratio σ2(E(Z))2/(Var(Z)E(Z2)) as shown in (8). We designed four scenarios according to the above ratio under λ 1 = λ 2 and E (Z) = 0, respectively, which represented the settings where the full sample estimator ˆφfull would be less (Scenario 1) or more (Scenario 2) efﬁcient than the split-and-conquer estimator as predicted by the ratio but not as efﬁcient as the weighted distributed estimator ˆφWD. Scenario 3 (λ 1 = λ 2, E (Z) = 0) was the case when ˆφfull and ˆφWD would be asymptotically equivalent, and both estimators would be more efﬁcient than ˆφSa C. Scenario 4 was the homogeneous case with λ 1 = λ 2

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

in which all three estimators would have the same asymptotic efﬁciency. For all four scenarios, the ARE column of Table 4 conﬁrmed the relative efﬁciency as predicted by the asymptotic variances in (8), and was well reﬂected in the comparison of the root mean squared errors, as the bias is of smaller order as compared with that of the standard deviation and thus negligible.

Table 4: Average root mean squared error (RMSE) and the standard deviation (SD), multiplied by 102, of the full sample estimator ˆφfull, the Sa C estimator ˆφSa C and the WD estimator ˆφWD under four scenarios for the errors-in-variables model (12) for N = 105, K = 2 and n1 = n2. AREs (asymptotic relative efﬁciency) of ˆφfull to ˆφSa C are calculated from (8)

ˆφfull ˆφSa C ˆφWD

Scenario (λ 1, λ 2) ARE RMSE SD RMSE SD RMSE SD Scenario 1 (0.25,3.25) 0.89 4.55 4.51 4.12 4.09 3.91 3.89 (µZ = 1, σ2 Z = 0.1) (0.5,3.5) 0.93 4.65 4.65 4.35 4.35 4.08 4.08 (0.75,3.75) 0.97 4.52 4.52 4.40 4.38 4.13 4.13 Scenario 2 (0.25,2.25) 1.18 2.95 2.95 3.24 3.24 2.89 2.89 (µZ = 3, σ2 Z = 0.5) (0.75,2.75) 1.28 3.28 3.26 3.65 3.64 3.17 3.16 (1.25,3.25) 1.31 3.71 3.71 4.16 4.07 3.64 3.61 Scenario 3 (0.25,2.25) 1.97 0.41 0.41 0.61 0.61 0.41 0.41 (µZ = 0, σ2 Z = 0.5) (0.75,2.75) 1.92 0.51 0.51 0.70 0.70 0.51 0.51 (1.25,3.25) 1.68 0.64 0.64 0.82 0.82 0.64 0.64 Scenario 4 (0.5,0.5) 1 3.25 3.24 3.31 3.28 3.30 3.26 (µZ = 4, σ2 Z = 0.5) (1.0,1.0) 1 3.53 3.53 3.59 3.59 3.59 3.59 (1.5,1.5) 1 4.06 4.03 4.08 4.07 4.06 4.06

C.2 Simulation results based on the logistic model

Figure 3 reports the absolute bias and root mean squared errors of the estimators when p2 = 4. Table 5 reports the empirical coverage and the average width of the CIs when p2 = 4. Table 6 reports the average CPU time per simulation run based on 500 replications of the ﬁve estimators for a range of K for the logistic regression model with p2 = 4. It is observed in Figure 3 that the bias of the estimators were smaller with p2 = 4 compared to the results with p2 = 10 in Figure 1. As a consequence, the CIs based on the weighted distributed estimator had adequate coverage probabilities even when K = 1000.

C.3 Pre-processing of the real data

The arrival delay of the previous ﬂight that utilized the same plane was obtained by matching the tail number of the plane. The three meteorological factors (rain rate, close surface air pressure, and temperature) were obtained by matching this airline s on-time performance data with the ERA5 hourly data (https://cds.climate.copernicus.eu/). This dataset includes reanalysis from 1959 onwards whose temporal and spatial resolutions are one hour and 0.25 0.25 , respectively. We applied the f(x) = log(1 + x) transformation to the rain variable due to its serious skewness. We also standardized each covariate in each of the data blocks before performing the logistic regression analysis.

GU AND CHEN

(a) Absolute Bias (p2 = 4) (b) RMSE (p2 = 4)

Figure 3: Average simulated bias (a) and the root mean squared errors (RMSE) (b) of the weighted distributed (WD) (red circle), the split-and-conquer(Sa C) (blue triangle), the debiased split-andconquer (d Sa C) (green square), the debiased weighted distributed (d WD) (purple cross), the subsampled average mixture SAVGM (pink square cross) estimators, with respect to the number of data block K for the logistic regression model with the dimension p2 of the nuisance parameter λk being 4, and the full sample size N = 2 106.

Table 5: Coverage probabilities and widths (in parentheses, multiplied by 100) of the 1 α conﬁdence intervals for the common parameter φ in the logistic regression model based on the asymptotic normality of the split-and-conquer (Sa C), the weighted distributed (WD), the debiased splitand-conquer (d Sa C) and the debiased weighted distributed (d WD) estimators with respect to the number of data blocks K. The dimension p2 of the nuisance parameter λk is 4 and total sample size N = 2 106

K Sa C WD d Sa C d WD 1 α 0.99 0.95 0.90 0.99 0.95 0.90 0.99 0.95 0.90 0.99 0.95 0.90 10 0.99 0.96 0.92 0.99 0.97 0.91 0.99 0.96 0.92 0.99 0.96 0.91 (2.45) (1.87) (1.57) (2.03) (1.55) (1.30) (2.45) (1.87) (1.57) (2.03) (1.55) (1.30) 50 0.99 0.95 0.91 0.98 0.93 0.89 0.99 0.95 0.91 0.99 0.93 0.88 (2.36) (1.80) (1.51) (1.97) (1.50) (1.26) (2.36) (1.80) (1.51) (1.97) (1.50) (1.26) 100 0.98 0.94 0.91 0.99 0.95 0.91 0.99 0.95 0.91 0.99 0.95 0.91 (2.36) (1.79) (1.51) (1.96) (1.49) (1.25) (2.36) (1.79) (1.51) (1.96) (1.49) (1.25) 250 0.99 0.93 0.85 0.99 0.95 0.90 0.99 0.96 0.91 0.99 0.95 0.90 (2.36) (1.79) (1.50) (1.96) (1.49) (1.25) (2.36) (1.79) (1.50) (1.96) (1.49) (1.25) 500 0.91 0.77 0.66 0.99 0.95 0.88 0.99 0.96 0.90 0.99 0.95 0.89 (2.36) (1.80) (1.51) (1.96) (1.49) (1.25) (2.36) (1.80) (1.51) (1.96) (1.49) (1.25) 1000 0.65 0.41 0.28 0.99 0.94 0.88 0.99 0.94 0.88 0.99 0.93 0.88 (2.38) (1.81) (1.52) (1.96) (1.49) (1.25) (2.38) (1.81) (1.52) (1.97) (1.50) (1.25) 2000 0.01 0.01 0.00 0.99 0.91 0.81 0.98 0.94 0.88 0.99 0.94 0.90 (2.42) (1.84) (1.55) (1.96) (1.50) (1.25) (2.42) (1.84) (1.55) (1.98) (1.50) (1.26)

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

Table 6: Average CPU time for each replication based on B = 500 replications for the split-andconquer (Sa C), Zhang s SAVGM, the weighted distributed (WD), the debiased split-and-conquer (d Sa C) and the debiased weighted distributed (d WD) estimators for the logistic regression model with respect to K. The dimension p2 of the nuisance parameter λk is 4 and total sample size N = 2 106

K Sa C SAVGM WD d Sa C d WD 10 15.65 15.97 18.50 20.00 21.95 50 9.63 9.95 10.66 12.37 14.59 100 8.09 8.63 8.76 10.50 12.05 250 8.49 9.69 9.07 10.84 12.82 500 9.68 11.58 10.25 11.97 14.84 1000 11.67 13.81 12.32 13.93 19.08 2000 15.78 19.68 16.57 18.11 28.55

We chose the parameter of the three meteorological factors as the common parameter based on Figure 4, which shows that the local estimates of those three parameters are the most concentrated.

Figure 4: Histogram of the parameter estimates across the data blocks

Temperature Air Pressure Rain

Previous Delay Spring Summer Autumn

1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0

1.0 0.5 0.0 0.5 1.0

Estimates (K = 479)

GU AND CHEN

T. Ando. Concavity of certain maps on positive deﬁnite matrices and applications to hadamard products. Linear Algebra and its Applications, 26:203 241, 08 1979. doi: 10.1016/0024-3795(79) 90179-4.

M. Bartlett. Approximate conﬁdence intervals. Biometrika, 40:12 19, 01 1953. doi: 10.2307/ 2333091.

H. Battey, J. Fan, H. Liu, J. Lu, and Z. Zhu. Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics, 46:1352 1382, 06 2018. doi: 10.1214/17-AOS1587.

R. J. Carroll and D. Ruppert. The use and misuse of orthogonal regression in linear errors-invariables models. The American Statistician, 50(1):1 6, 1996. ISSN 00031305. URL http: //www.jstor.org/stable/2685035.

S. X. Chen and L. Peng. Distributed statistical inference for massive data. The Annals of Statistics, 49:2851 2869, 02 2021.

X. Chen and M. Xie. A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24:1655 1684, 10 2014. doi: 10.5705/ss.2013.088.

X. Chen, W. Liu, and Y. Zhang. Quantile regression under memory constraint. The Annals of Statistics, 47:3244 3273, 12 2019. doi: 10.1214/18-AOS1777.

R. Duan, Y. Ning, and Y. Chen. Heterogeneity-aware and communication-efﬁcient distributed statistical inference. Biometrika, 109(1):67 83, 02 2021. ISSN 1464-3510. doi: 10.1093/biomet/ asab007. URL https://doi.org/10.1093/biomet/asab007.

T. Evgeniou and M. Pontil. Regularized multi task learning. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 109 117, 08 2004. doi: 10.1145/ 1014052.1014067.

W. Fuller. Measurement Error Models. Wiley, 01 1987. ISBN 9780471861874. doi: 10.2307/ 1164709.

L. Hansen. Large sample properties generalized method of moments estimators. Econometrica, 50: 1029 1054, 02 1982. doi: 10.2307/1912775.

E. Haynsworth. Applications of an inequality for the schur complement. Proceedings of the American Mathematical Society, 24:512 516, 03 1970. doi: 10.1090/S0002-9939-1970-0255580-7.

M. Henmi and S. Eguchi. A paradox concerning nuisance parameters and projected estimating functions. Biometrika, 91:929 941, 02 2004. doi: 10.1093/biomet/91.4.929.

C. Huang and X. Huo. A distributed one-step estimator. Mathematical Programming, 174:41 76, 11 2019. doi: 10.1007/s10107-019-01369-0.

M. Jordan, J. Lee, and Y. Yang. Communication-efﬁcient distributed statistical learning. Journal of the American Statistical Association, 114:668 681, 05 2019. doi: 10.1080/01621459.2018. 1429274.

DISTRIBUTED STATISTICAL INFERENCE UNDER HETEROGENEITY

P. Kairouz, H. Mc Mahan, B. Avent, A. Bellet, M. Bennis, A. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. D Oliveira, H. Eichner, S. El Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gasc on, B. Ghazi, P. Gibbons, and S. Zhao. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14:1 210, 01 2021. doi: 10.1561/9781680837896.

A. Kleiner, A. Talwalkar, P. Sarkar, and M. Jordan. A scalable bootstrap for massive data. Journal of the Royal Statistical Society Series B (Statistical Methodology), 76:795 816, 12 2011. doi: 10.1111/rssb.12050.

T. Lai and J. Wang. Edgeworth expansions for symmetric statistics with applications to bootstrap methods. Statistica Sinica, 3:517 542, 01 1993.

H. Li, B. Lindsay, and R. Waterman. Efﬁciency of projected score methods in rectangular array asymptotics. Journal of the Royal Statistical Society Series B, 65:191 208, 02 2003. doi: 10. 1111/1467-9868.00380.

T. Li, A. Sahu, A. Talwalkar, and V. Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37:50 60, 05 2020. doi: 10.1109/MSP.2020. 2975749.

T. Li, S. Hu, A. Beirami, and V. Smith. Ditto: Fair and robust federated learning through personalization. In International Conference on Machine Learning (ICML), pages 6357 6368, 2021. URL http://proceedings.mlr.press/v139/li21h.html.

N. Lin and R. Xi. Fast surrogates of U-statistics. Computational Statistics & Data Analysis, 54: 16 24, 01 2010. doi: 10.1016/j.csda.2009.08.009.

T.-T. Lu and S.-H. Shiou. Inverses of 2 2 block matrices. Computers & Mathematics With Applications - COMPUT MATH APPL, 43:119 129, 01 2002. doi: 10.1016/S0898-1221(01) 00278-4.

O. Marfoq, G. Neglia, A. Bellet, L. Kameni, and R. Vidal. Federated multi-task learning under a mixture of distributions. In Advances in Neural Information Processing Systems (Neur IPS), volume 34, pages 15434 15447, 2021.

P. Mc Cullagh. Quasi-likelihood functions. The Annals of Statistics, 11:59 67, 03 1983. doi: 10.1214/aos/1176346056.

B. Mc Mahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas. Communication-Efﬁcient Learning of Deep Networks from Decentralized Data. International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), 54:1273 1282, 2017.

W. Newey. Uniform convergence in probability and stochastic equicontinuity. Econometrica, 59: 1161 1167, 02 1991. doi: 10.2307/2938179.

O. Reiersol. Identiﬁability of a linear relation between variables which are subject to error. Econometrica, 18:375 389, 10 1950. doi: 10.2307/1907835.

GU AND CHEN

P. Rilstone, V. Srivastava, and A. Ullah. The second-order bias and mean squared error of nonlinear estimators. Journal of Econometrics, 124:369 395, 12 1996. doi: 10.1016/0304-4076(96) 89457-7.

S. Sengupta, S. Volgushev, and X. Shao. A subsampled double bootstrap for massive data. Journal of the American Statistical Association, 111:1222 1232, 08 2015. doi: 10.1080/01621459.2015. 1080709.

V. Smith, C.-K. Chiang, M. Sanjabi, and A. Talwalkar. Federated Multi-Task Learning. Advances in Neural Information Processing Systems(Neur IPS), 05 2017.

L. Stefanski and D. Boos. The Calculus of M-Estimation. The American Statistician, 56:29 38, 02 2002. doi: 10.1198/000313002753631330.

C. T. Dinh, N. Tran, and J. Nguyen. Personalized federated learning with moreau envelopes. In Advances in Neural Information Processing Systems (Neur IPS), volume 33, pages 21394 21405, 2020.

A. van der Vaart. Asymptotic Statistics, chapter 5. Cambridge University Press, 01 1999. doi: 10.1017/CBO9780511802256.

S. Volgushev, S.-K. Chao, and G. Cheng. Distributed inference for quantile regression processes. Annals of Statistics, 47, 01 2017. doi: 10.1214/18-AOS1730.

Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology, 10:1 19, 01 2019. doi: 10.1145/3298981.

A. Yaron, L. Hansen, and J. Heaton. Finite-Sample Properties of Some Alternative GMM Estimators. Journal of Business & Economic Statistics, 14:262 80, 02 1996. doi: 10.1080/07350015. 1996.10524656.

K.-H. Yuan and R. Jennrich. Estimating equations with nuisance parameters: Theory and applications. Annals of the Institute of Statistical Mathematics, 52:343 350, 02 2000. doi: 10.1023/A:1004122007440.

Y. Zhang, J. Duchi, and M. Wainwright. Comunication-efﬁcient algorithms for statistical optimization. Journal of Machine Learning Research, 14:3321 3363, 2013. doi: 10.1109/CDC.2012. 6426691.

T. Zhao, G. Cheng, and H. Liu. A partially linear framework for massive heterogeneous data. The Annals of Statistics, 44:1400 1437, 10 2014. doi: 10.1214/15-AOS1410.