# distributed_estimation_on_semisupervised_generalized_linear_model__8cc9a990.pdf Journal of Machine Learning Research 25 (2024) 1-41 Submitted 6/22; Revised 9/23; Published 3/24 Distributed Estimation on Semi-Supervised Generalized Linear Model Jiyuan Tu tujy.19@gmail.com School of Mathematics Shanghai University of Finance and Economics, Shanghai, 200433, China Weidong Liu weidongl@sjtu.edu.cn School of Mathematical Sciences Mo E Key Lab of Artificial Intelligence Shanghai Jiao Tong University, Shanghai, 200240, China Xiaojun Mao maoxj@sjtu.edu.cn School of Mathematical Sciences Ministry of Education Key Laboratory of Scientific and Engineering Computing Shanghai Jiao Tong University, Shanghai, 200240, China Editor: Sanmi Koyejo Semi-supervised learning is devoted to using unlabeled data to improve the performance of machine learning algorithms. In this paper, we study the semi-supervised generalized linear model (GLM) in the distributed setup. In the cases of single or multiple machines containing unlabeled data, we propose two distributed semi-supervised algorithms based on the distributed approximate Newton method. When the labeled local sample size is small, our algorithms still give a consistent estimation, while fully supervised methods fail to converge. Moreover, we theoretically prove that the convergence rate is greatly improved when sufficient unlabeled data exists. Therefore, the proposed method requires much fewer rounds of communications to achieve the optimal rate than its fully-supervised counterpart. In the case of the linear model, we prove the rate lower bound after one round of communication, which shows that rate improvement is essential. Finally, several simulation analyses and real data studies are provided to demonstrate the effectiveness of our method. Keywords: Distributed Learning, Semi-supervised Learning, Generalized Linear Model 1. Introduction Distributed machine learning has attracted much attention in statistics in recent years. A variety of distributed algorithms have been proposed for statistical inference problems. Many interesting properties are investigated, and distributed machine learning is enlarged remarkably. As one of the most fundamental ideas in distributed learning, the divide-and-conquer (DC) method often behaves in impressive performances in statistical models. For quantile Weidong Liu and Xiaojun Mao are the co-corresponding authors. c 2024 Jiyuan Tu, Weidong Liu, and Xiaojun Mao. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/22-0670.html. Tu, Liu, and Mao regression (Volgushev et al., 2019; Chen et al., 2019), kernel ridge regression (Zhang et al., 2015; Lin et al., 2017) and kernel density estimation (Li et al., 2013), the optimal rates of DC depend on the number of machines and the scale of samples. For high dimensional linear regression and support vector machine, Lee et al. (2017); Lian and Fan (2018); Battey et al. (2018); Zhao et al. (2020) proved that the accuracy of DC can be remarkably improved by a well-known debiasing technique in statistics. For iterative algorithms like Distributed Approximate NEwton (DANE) method (Shamir et al. (2014)), Jordan et al. (2019); Jianqing Fan and Wang (2023) proved that with an appropriate initial estimator, DANE and its variants could achieve the optimal statistical rate within a constant round of communications. There are also many other distributed learning algorithms like alternating direction method of multipliers (Boyd et al., 2011), subsampled average mixture algorithm (Zhang et al., 2013), local stochastic gradient descent (Stich, 2019; Yu et al., 2019), distributed dual coordinate ascent method (Jaggi et al., 2014; Smith et al., 2017), lazily aggregated gradient method (Chen et al., 2018), and so on. In this article, we focus on distributed semi-supervised estimation for the generalized linear model (GLM). There are various practical problems, especially in the era of big data, involving the analysis of massive unlabeled datasets (e.g., electronic medical records data, textual data). Semi-supervised learning (SSL) has demonstrated considerable success in machine learning problems in terms of prediction precision. In non-distributed settings, a growing body of literature demonstrates SSL can be of great use in classification (see, e.g. , Ando and Zhang (2005, 2007); Blum and Mitchell (1998); Vapnik (1999); Wang and Shen (2007); Wang et al. (2007, 2009); Zhu (2005); Zhu and Goldberg (2009)). The semisupervised estimation problems, with continuous labels, have also been studied by Wasserman and Lafferty (2007); Johnson and Zhang (2008); Chakrabortty and Cai (2018); Zhang et al. (2019); Cai and Guo (2020); Zhang and Bradic (2021); Hou et al. (2023); Azriel et al. (2022). In recent years, distributed SSL has emerged as an exciting new area of research in machine learning. A novel study of kernel ridge regression by Chang et al. (2017) has shown that the semi-supervised DC method will benefit significantly from the unlabeled data and allows a more flexible number of machines than the ordinary DC. Work done along this line include distributed SSL for the kernel-based gradient descent algorithms (Lin and Zhou, 2018), bias-corrected kernel ridge regression (Guo et al., 2017), learning with multi-penalty regularization (Guo et al., 2019) and flexible Gaussian kernels (Hu and Zhou, 2021). A common feature of these prior works in distributed SSL is that they all focused on one-shot algorithms. It is well known that iterative algorithms can lead to more accurate estimates than one-shot algorithms. In this paper, we study the distributed SSL for generalized linear models using the approximate Newton iteration method. Within this paper, we concentrate specifically on the scenario where data is distributed across several computing units, some of which possess supplementary unlabeled data. This setup aligns with numerous practical situations; for instance, in the medical field, multiple hospitals collaborate to enhance the efficacy of the model through sharing medical data, which typically encompass ample unlabeled covariates (Hou et al., 2023; Liu et al., 2022; Cai et al., 2022). Throughout this paper, we direct our focus towards the generalized linear model (GLM), a widely utilized statistical model that has been implemented across diverse fields such as healthcare (Parikh et al., 2019; Guo et al., 2022), genetic analysis (Ma et al., 2022), and economics (Theodossiou, 1998). One of the reasons we choose to consider GLM Distributed Estimation on Semi-Supervised Generalized Linear Model is not only due to its broad range of applications but also due to the unique form of its loss functions, which will be delved into in more detail in Section 2.1. In the following, we introduce the generalized linear model. Given a function ψ : R R, the observations (X, Y ) Rp+1 are generated according to the following conditional probability function P(Y | X) = ec exp Y XTβ ψ(XTβ ) where ec and c(σ) are scale constants, and β is the true model parameter. Here ψ , the derivative of ψ, is called the canonical link function of the generalized linear model. To estimate β , the most straightforward estimator is the solution of the (empirical) negative log-likelihood function f(X, Y, β) = Y XTβ + ψ(XTβ). (2) In SSL, a large unlabeled dataset only contains information on predictors X. It is thus of great interest to develop a distributed SSL algorithm using both labeled datasets and unlabeled datasets. However, as we will prove, naive incorporation of the unlabeled dataset in DANE with the loss function (2) will lead to degradation of the estimation precision. The primary reason is that the unlabeled predictor does not contain the information of β in lack of the corresponding response and may result in higher uncertainty. On the other hand, in GLM, the predictors are directly related to the Hessian matrix of the loss (2). In other words, the unlabeled dataset is particularly useful in estimating the Hessian matrix and reducing the number of iterations of iterative algorithms. In this paper, we propose a novel loss function (called the semi-supervised surrogate loss function) based on a variance correction technique to avoid higher uncertainty. The proposed surrogate loss has a more concentrated Hessian matrix, and the resulting estimator is as efficient as the supervised estimator. Further, we combine it with the Newton iteration method and propose a Semi Supervised Distributed Approximate NEwton estimator (SSDANE) for GLM. Compared with supervised DANE, our SSDANE has the following advantages. First, SSDANE converges faster to the optimal neighborhood of β . Second, SSDANE requires fewer rounds of communication between the master machine and local machines. This is particularly important when communication cost becomes a bottleneck in distributed computing. Third, SSDANE allows higher dimensions for β . The supervised DANE typically requires a dimension smaller than the labeled local sample sizes. In contrast, SSDANE can still have a nearly optimal rate when the dimension exceeds the labeled local sample sizes. Finally, extensive numerical experiments are performed to demonstrate the better performance of SSDANE. Our experimental results suggest that the performance of the semi-supervised algorithm is consistently superior to that of the supervised algorithm, its performance improves as the amount of unlabeled data increases. Based on our findings, we recommend the use of our SSDANE over DANE, particularly when a larger amount of unlabeled data is available. The rest of the paper is organized as follows. Section 2 introduces the semi-supervised surrogate loss function and SSDANE. In Section 3, we present theoretical results for the proposed methods. Numerical experiments are provided in Section 4. Finally, we conclude our work in Section 5. The proofs of theoretical results are relegated to Appendix. Given a vec- tor v = (v1, ..., vp)T , we denote |v|1 = Pp l=1 |vl|, |v|2 = q Pp l=1 v2 l and |v| = sup1 l p |vl|. Tu, Liu, and Mao For a matrix A Rn p, define A as the spectral norm, λmax(A) and λmin(A) as the largest and smallest eigenvalues of A respectively. 2. Methodology In this section, we first introduce the semi-supervised surrogate loss in a single-machine setting. Then we apply it in a distributed setup and develop two distributed semi-supervised algorithms. 2.1 Semi-Supervised Surrogate Loss To motivate the construction of the semi-supervised surrogate loss, we shall consider the single-machine setting. Denote D as the index set of the labeled pairs {(Xi, Yi)}, D as the index set of unlabeled covariates {Xi}, and H = D D . We let |D| = n and |H| = n . For the generalized linear model (1) with link function ψ ( ), recall the empirical loss function based on the labelled data D is i D Yi XT i β + 1 i D ψ(XT i β). (3) With this target function, one may estimate the true parameters by the optimization bβD = argmin β Rp {L(β)}. (4) It is well known that the computation time (the number of iterations) of a first/second order method on solving this optimization depends on the condition number of the Hessian matrix c HD(β) = 1 i D Xi XT i ψ (XT i β). When the sample size |D| is small, or the dimension p is large, the condition number can be large. For example, when p n, the condition number of the sample covariance matrix can go to infinity (see Bickel and Levina (2008)). In order to get a more stable and concentrated Hessian matrix, a direct way is to make use of the unlabelled data by noting that the Hessian matrix only depends on the covariates. That is, one may consider the following loss i D Yi XT i β + 1 i H ψ(XT i β), (5) where the unlabeled data is used and leads to a more stable Hessian matrix once n is much larger than n: c H ss H(β) = 1 i H Xi XT i ψ (XT i β). However, due to the unbalance between the sample sizes of D and H, the asymptotic variance of the estimator bβH = argminβ Rp{Lss(β)} increases. This results in a great loss Distributed Estimation on Semi-Supervised Generalized Linear Model in statistical efficiency on estimating β . In fact, the covariance of the gradient at β i D Yi Xi + 1 i H Xiψ (XT i β ) n Cov Y X + Xψ (XTβ ) + Cov 1 i H Xiψ (XT i β ) 1 i D Xiψ (XT i β ) n Cov Y X + Xψ (XTβ ) + n n nn Cov Xψ (XTβ ) , (6) where the first matrix on the right-hand side of the above equation is the covariance matrix of the gradient of L(β) in (3). The unbalance between D and H leads to the second extra covariance matrix, which has the same order as the first one. In order to eliminate the second term of (6) while maintaining the stability of the Hessian matrix, we consider the following two-step method. Suppose that we have an initial estimator bβ (0) for β . Then we can subtract Lss(bβ (0)) L(bβ (0)) = 1 i H Xiψ (XT i bβ (0)) 1 i D Xiψ (XT i bβ (0)) to reduce the extra term in the gradient of Lss(β). That is, we consider the following corrected estimating equation Lss(β) Lss(bβ (0)) L(bβ (0)) = 0. (7) It is easy to see that if the initial value bβ (0) is close to β , then the covariance of the above equation (at value β ) is asymptotically equivalent to that of L(β). Therefore, the estimator satisfies equation (7) will have almost the same statistical efficiency as L(β). On the other hand, the Hessian matrix of (7) is c H ss H(β), which is more stable than c HD(β). The estimating equation in (7) corresponds to the following loss function e L(β) = Lss(β) D Lss(bβ (0)) L(bβ (0)), β E . (8) We call e L(β) as the semi-supervised surrogate loss. This surrogate loss can be easily extended to the distributed setting. We will show rigorously that the number of iterations and hence the communication cost are much smaller than the algorithm with only labeled data. 2.2 Distributed Semi-Supervised Learning In this section, we consider semi-supervised learning in the distributed setup. More specifically, assume we have N = mn i.i.d. pairs of observations {(Xi, Yi)} from the generalized linear model (1) evenly stored in m different machines {H1, . . . , Hm}. Let H1 be the master machine that is in charge of data update and transmission. We discuss the cases where unlabeled data are stored in a single machine and multiple machines separately. In both cases, we develop algorithms that achieve a faster convergence rate than their fully-supervised counterparts. Tu, Liu, and Mao 2.2.1 Unlabeled Data on Master Machine We first assume the sample size of additional n n unlabeled covariates Xi is not too large so that they can be stored in a single machine (i.e. H1, the master machine). We denote Dj as the indices of labelled observations {(Xi, Yi)}, and D 1 as the unlabeled covariates {Xi} on H1. Then we have that |Dj| = n, |D 1| = n n and |H1| = n . To conduct distributed learning with the assistance of the unlabeled data, we use the semi-supervised surrogate loss in (8) and replace the local gradient L(bβ (0)) by the global gradient 1 m Pm j=1 Lj(bβ (0)), where i Dj Yi XT i β + 1 i Dj ψ(XT i β) (9) is the local loss in the j-th machine. After the master machine receives all the gradients from workers, it solves the following optimization and updates the initial value bβ (0) to bβ (1): bβ (1) = argmin β Rp { e L(1)(β)} = argmin β Rp Lss 1 (β) D Lss 1 (bβ (0)) 1 j=1 Lj(bβ (0)), β E We call this the one-step Semi-Supervised Distributed Approximate NEwton (SSDANE) estimator, with respect to the DANE proposed by Shamir et al. (2014).1 The one-step SSDANE can be directly extended to multi-round SSDANE which is stated in Algorithm 1. 2.2.2 Unlabeled Data on Multiple Machines In this section, we consider the case that the sample size of unlabeled data is large so that they are separately stored in multiple machines. To be more specific, we denote U as the set of machines having unlabeled dataset and let Dj and D j be the index sets of labelled observations {(Xi, Yi)} and unlabeled covariates {Xi} on Hj respectively. Then for j U, there is |Dj| = n, |D j| = n n and |Hj| = n . Suppose we have an initial estimator bβ (0), for j U, we apply SSDANE on the j-th machine to obtain bβ (1) j = argmin β Rp e L(1) j (β) = argmin β Rp Lss j (β) D Lss j (bβ (0)) 1 k=1 Lk(bβ (0)), β E) where Lss j (β) denotes the local semi-supervised empirical loss on the j-th machine, namely, Lss j (β) = 1 i Dj Yi XT i β + 1 i Hj ψ(XT i β). 1 The original DANE is based on the average of local solutions for all workers. For the sake of comparison, we refer the solution in a local machine to DANE and the average of them to DANE-Avg Distributed Estimation on Semi-Supervised Generalized Linear Model Algorithm 1 Semi-Supervised Distributed Approximate NEwton Method (SSDANE) Input: Labeled data {(Xi, Yi) | i Dj} on machine Hj for j = 1, ..., m, and unlabeled data {Xi | i D 1} on the master machine H1, the number of iterations T. 1: The master machine H1 obtains the initial estimator bβ (0) by minimizing the local empirical loss function on H1. 2: for t = 1, . . . , T do 3: The master machine broadcasts the parameter bβ (t 1) to each worker machine. 4: for j = 1, . . . , m do 5: The j-th machine computes the local gradient Lj(bβ (t 1)) = 1 i Dj Yi XT i + 1 i Dj ψ (XT i bβ (t 1))Xi, and sends back to the master machine H1. 7: The master machine updates the parameter by solving bβ (t) = argmin β Rp e L(t)(β) = argmin β Rp Lss 1 (β) D Lss 1 (bβ (t 1)) 1 j=1 Lj(bβ (t 1)), β E Output: The final estimator bβ (T). Then we take the average over these local estimators to obtain a more accurate one, bβ (1) Avg = 1 j U bβ (1) j . (12) The multi-round realization of the averaged SSDANE (SSDANE-Avg) method is provided in Algorithm 2. 3. Theoretical Results In this section, we investigate the theoretical properties of the proposed algorithms. First, we need the following regular assumptions. Assumption 1 Let H = E{ψ (XTβ )XXT}, then there is a constant ρ > 0 such that ρ λmin (H) λmax (H) ρ 1. Assumption 2 There exist positive numbers η0 > 0, C0 such that max sup v Sp 1 E h exp n η0|XTv|2oi , E h exp η0|ψ (Xβ )|2 i , E h exp η0Y 2 i C0. Tu, Liu, and Mao Algorithm 2 Semi-Supervised Distributed Approximate NEwton with Average ( SSDANE-Avg) Input: Labeled data {(Xi, Yi) | i Dj} on machine Hj for j = 1, ..., m; unlabeled data {Xi | i D j} on the each machine Hj, where j U; the number of iterations T. 1: The master machine H1 obtains the initial estimator bβ (0) by minimizing the local empirical loss function on H1. 2: for t = 1, . . . , T do 3: The master machine broadcasts the parameter bβ (t 1) to each worker machine. 4: for j = 1, ..., m do 5: The j-th machine computes the local gradient Lj(bβ (t 1)) = 1 i Dj Yi XT i + 1 i Dj ψ (XT i bβ (t 1))Xi, and sends back to the master machine H1. 7: The master machine computes m 1 Pm j=1 Lj(bβ (t 1)) and broadcasts to worker machine with unlabeled data, that is, Hj for j U. 8: for j U do 9: The j-th machine updates the parameter by solving bβ (t) j = argmin β Rp e L(t) j (β) = argmin β Rp Lss j (β) D Lss j (bβ (t 1)) 1 k=1 Lk(bβ (t 1)), β E) and sends back to the master machine H1. 10: end for 11: The master machine updates the parameter by bβ (t) Avg = |U| 1 P j U bβ (t) j . 12: end for Output: The final estimator bβ (T) Avg. Assumption 3 The canonical link function ψ ( ) is twice differentiable, and there exists a uniform constant cψ such that sup x R max{|ψ (x)|, |ψ (x)|} cψ. The above assumptions are standard conditions for proving estimation consistency. Assumption 1 assumes the well-posedness of the Hessian matrix. Assumption 2 guarantees that the covariate X and the label Y admit the sub-Gaussian distribution. In Assumption 3 we assume the canonical link function ψ (x) has uniformly bounded first-order and second-order derivatives. Distributed Estimation on Semi-Supervised Generalized Linear Model 3.1 Convergence Rate of Semi-Supervised Distributed Estimation 3.1.1 Unlabeled Data on Master Machine We first give the convergence rate of the estimator in Algorithm 1 when all unlabeled data are stored in the master machine. Theorem 1 Suppose Assumption 1 to 3 hold, assume the initial estimator bβ (0) satis- fies |bβ (0) β |2 = OP(rn). Moreover, the dimension of the parameter p satisfies p = o(min{(n )2/3/ log n , mn/ log n }) and rn = o(p 1/2). The T-th round SSDANE estimator satisfies bβ (T) β 2 = OP mn + rn p log n T/2 + 1 p( prn)2T ! This theorem provides an upper bound for the SSDANE method. The first term is a nearly optimal statistical convergence rate with all labeled data. The second and third terms represent the improved convergence rate by each iteration in the algorithm. For example, the rate in the second term was improved by the factor p p(log n )/n with one iteration. To achieve the nearly optimal statistical rate q mn , it suffices to take T log n + log m log p log log n log n 2 log p log log n . (15) Note that n is the number of total samples in the master. It indicates that the information of unlabeled data indeed helps improve the rate of the algorithm. In fact, if the loss L1(β) with only labeled data is used in DANE, i.e. replacing Lss 1 (β) by L1(β) in (11), bβ (t) o = argmin β Rp L1(β) D L1(bβ (t 1) o ) 1 j=1 Lj(bβ (t 1) o ), β E then the second term of (14) becomes rn p log n n T/2 , which is much slower than the one with unlabeled data as long as n n. A slower rate means more iterations and more rounds of communication between the master machine and workers. The improvement of the rate by the unlabeled data is essential. Consider the one-round estimator with T = 1 and for the linear model. Proposition 7 below shows a tight lower bound for DANE, bβ (1) o is C q p mn + rn q p n . Except for the theoretical evidence, we will conduct extensive numerical analysis to verify such improvement. To further see the advantage of our semi-supervised surrogate loss in distributed estimation, we consider the case that p > n but p is much smaller than n and mn. Since p > n, then the Hessian matrix of L1(β) is not positive definite. Hence bβ (t) o performs poorly on estimation of β . For example, for the linear model, (16) is reduced to solve a linear equation bΣ1β = b, where bΣ1 = 1 n P i D1 Xi XT i is not full rank so that the solution is not unique. On the other hand, the Hessian matrix of our semi-supervised surrogate loss is still Tu, Liu, and Mao positive definite as n p. In this case, bβ (T) is well defined and can still have the nearly optimal rate. Next, we present the asymptotic normality result for our proposed SSDANE estimator. Theorem 2 (Asymptotic normality of SSDANE) Under the same conditions as in Theorem 1, and additionally, we assume the dimension of the parameter p satisfies p = o(min{(n )1/2/ log n , (mn/ log2 n )1/3}). Then when the iteration round T satisfies (15), for any vector v Sp 1, we have that mn σ(v) v T(bβ (T) β ) d N(0, 1), σ(v) 2 = v TH 1CH 1v, (17) here H = E ψ (XTβ )XXT , C = c(σ)E ψ (XTβ )XXT . The result shows that the proposed semi-supervised surrogate loss keeps the same statistical efficiency compared to L(β) with all labeled data. Remark 3 Regarding the statistical efficiency, we also note that the unlabeled data sometimes helps reduce the asymptotic variance in some works of semi-supervised learning. However, our methods have different settings and targets from these works. In particular, most results enjoy variance reduction due to additional information. For example, in Hou et al. (2023), the author proposed the imputation method with the assistance of additional surrogate information. In Cai and Guo (2020), the author considered estimating the explained variance with the unlabeled data. Chakrabortty and Cai (2018) and Azriel et al. (2022) estimated the model parameter for the miss-specified linear model. In contrast, we focus on estimating the model parameter in the distributed setting. It would also be interesting to study the aforementioned problem in the distributed setting by extending our method. And there should be no technical difficulties in showing a similar efficiency-enhancement effect. 3.1.2 Unlabeled Data on Multiple Machines In this section, we provide similar theoretical results when the unlabeled data are separately stored in multiple machines. Theorem 4 Suppose Assumption 1 to 3 hold, assume the initial estimator bβ (0) satisfies |bβ (0) β |2 = OP(rn). Moreover, there are rate constraints p = o(min{(n )2/3/ log n , mn/ log n }) and rn = o(p 1/2). Then the T-th round SSDANE-Avg estimator satisfies bβ (T) Avg β 2 = OP mn + rn p log n T + rn p log n T/2 + 1 p( prn)2T ! (18) where |U| denotes the cardinality of the set U. Distributed Estimation on Semi-Supervised Generalized Linear Model We now compare the rates between (18) and (14). It is obvious that the more machines with unlabeled data, the faster the convergence rate of the algorithm. Note that the total number of unlabeled data is |U|(n n). If the |U|(n n) data are all stored in the master machine, then according to Theorem 1, the rate of SSDANE bβ (T) is mn + rn p log n |U|(n n) + n T/2 + 1 p( prn)2T ! This rate is slower than or the same as (18) when |U| n /(p log n ), relying on magnitude of n n. Therefore, when the unlabeled data are stored in multiple workers, the average SSDANE can have a better convergence rate. Furthermore, it has a faster rate than SSDANE by an extra factor 1/|U| when n is close to n (c.f. n = n). On the other hand, the average of SSDANE requires |U| workers to solve the optimization (13) simultaneously, and hence needs a more homogeneous distributed system. For the asymptotic distribution of bβ (T) Avg, we can establish a similar result with the same asymptotic variance as Theorem 2. Therefore, the average of SSDANE further accelerates the algorithm while keeping the statistical efficiency. We further note that, suppose the loss Lss j (β) with unlabeled data were used trivially in DANE, i.e. let bβ (t) ss,j = argmin β Rp Lss 1 (β) D Lss 1 (bβ (t 1) ss ) 1 j=1 Lss j (bβ (t 1) ss ), β E and bβ (t) ss,Avg = 1 |U| P j U bβ (t) ss,j. We can prove that the asymptotic variance of bβ (t) ss,Avg is v TH 1 C + (n n)|U| mn Css H 1v, where Css = Cov ψ (XTβ )X . That is, a trivial use of unlabeled data in DANE can lead to a larger variance when |U| is proportional to m. Remark 5 We note that our current theoretical framework is founded on the premise that the data has comparable sample sizes across local sources, primarily for the sake of clarity in presentation. Specifically, for the SSDANE method, we conduct minimization of the surrogate loss (defined in (10)) solely on the master machine H1. In this context, we allow for imbalanced local sample sizes, provided that the master machine contains a sufficient number of covariates. Conversely, in the case of the SSDANE-Avg method, we take the average of all SSDANE estimators from each machine in U. In this case, when sample sizes across machines are imbalanced, the statistical rate becomes contingent on the smallest local sample size, denoted as min{n j|j U}. In simpler terms, the presence of smaller local sample sizes can potentially impact the theoretical performance of SSDANE-Avg negatively. In practice, we suggest using the SSDANE method when the sample size across machines is extremely imbalanced. Notably, recent research efforts have delved into the realm of distributed training in the context of imbalanced data (see, e.g., Duan et al. (2021a,b); Chen et al. (2023)). It would indeed be intriguing to explore the fusion of these methodologies with our distributed semi-supervised approach. Tu, Liu, and Mao 3.2 Two Examples In this section, we provide two specific examples of the generalized linear model. The first one is linear regression. We use this example to illustrate the lower bound of the rate for DANE with labeled data and support the claims on better performance of our SSDANE in Section 2.2.1. Linear regression Assume the observation pair (X, Y ) comes from the following model Y = XTβ + ϵ, (19) where ϵ denotes the Gaussian noise. In this case, the canonical link function is ψ (x) = x. Therefore ψ(x) = x2/2, which has a quadratic form. The third-order derivative of ψ(x) vanishes for the linear model. Proposition 6 In the linear model, suppose the covariate X follows the sub-Gaussian dis- tribution. Then the T-th round SSDANE estimator bβ (T) satisfies that |bβ (T) β |2 = OP mn + rn p log n For linear regression, the third term in (14) is disappeared when the canonical link function is in linear form. In the following, we provide the lower bound of one-step SSDANE. Proposition 7 In the linear model, assume the covariate X follows the sub-Gaussian distribution. Moreover, we assume that inf v Sp 1 λmin h Cov n (XXT Σ)v oi ρ1. Let bβ (0) be an initial estimator which is independent to Xi s and ϵi s, then there exists a constant c, such that the one-step SSDANE estimator bβ (1) satisfies P |bβ (1) β |2 c r p where rn = |bβ (0) β |2. Suppose there is no additional unlabeled data and we have n = n. In this case, SSDANE becomes the DANE and bβ (1) = bβ (1) o . Therefore, the DANE with only labeled data has the lower bound c q p mn + rn q Logistic regression Assume the observation (X, Y ) Rp+1 admits the following conditional probability function P(Y | X) = exp(Y XTβ ) 1 + exp(XTβ ), (21) where the response variable Y only takes value in {0, 1}. Then the canonical link function ψ ( ) is defined as ψ(x) = ex/(1 + ex). It is not hard to verify that ψ fulfills Assumption 3. Therefore, the theorems above hold for Logistic regression. Distributed Estimation on Semi-Supervised Generalized Linear Model 3.3 Theory on non-distributed setting To show a trivial case involving unlabeled data will decline the statistical efficiency, we provide a result on asymptotic distribution in the non-distributed setting. Recall the loss function Lss(β) = 1 i D Yi XT i β + 1 i H ψ(XT i β). Define bβ ss argmin β Rp n Lss(β) o . (22) Proposition 8 (Asymptotic normality of the semi-supervised GLM estimator) Suppose Assumption 1 to 3 hold, and additionally, we assume the rate constraint p = o((n/ log n )2/3). Then for any vector v Sp 1, we have that n σ(v)v T(bβ ss β ) d N(0, 1), where σ(v) 2 = v TH 1 C + n n n Css H 1v, (23) here C = c(σ)E ψ (XTβ )XXT , Css = Cov ψ (XTβ )X , H = E ψ (XTβ )XXT . Let n = n in Proposition 8. In this case, D = H and hence bβ ss = bβD. That is, the asymptotic covariance matrix of bβD is H 1CH 1. While the unlabeled data is used, the covariance matrix would become to be H 1 C + n n n Css H 1 and hence has lower statistical efficiency. 4. Empirical Analysis In the empirical analysis, we conduct several experiments to show the effectiveness of our proposed methods. 4.1 Simulation Studies on Synthetic Dataset In this section, we show the performance of our proposed semi-supervised estimators on linear regression and logistic regression. Parameter Settings In both models, we assume the i.i.d. covariate vectors Xi = (Xi,1, ..., Xi,p)T are drawn from a multivariate normal distribution N(0, Σ) for i = 1, ..., N. Here the covariance matrix Σ is a p p Toeplitz matrix with its (i, j)-th entry Σij = 0.5|i j|, where 1 i, j p. We fix dimension p = 20 and the true coefficient β = (1, 0.95, 0.9, ..., 0.1, 0.05). We repeat 100 independent simulations and report the averaged estimation error and the corresponding standard error. We mainly compare the SSDANE (Algorithm 1) and SSDANE-Avg (Algorithm 2) with the following six methods: Tu, Liu, and Mao DANE: Supervised version of distributed approximate Newton estimator. We run Algorithm 1 without using the additional unlabeled dataset. DANE-Avg: Supervised version of averaged distributed approximate Newton estimator. We run Algorithm 2 without using the additional unlabeled dataset. DANEimp: Imputation-based distributed approximate Newton estimator. We interpolate the unlabeled data by the current estimator bβ, and run DANE for all pairs of data. DANEimp-Avg: Imputation-based averaged distributed approximate Newton estimator. We interpolate the unlabeled data by the current estimator bβ, and run DANE-Avg for all pairs of data. Local estimator: We minimize the empirical loss function L1(β) = 1 n P i D1 Yi XT i β+ 1 n P i D1 ψ(XT i β) with the data on the master machine H1. Pooled estimator: We collect all labeled pairs into one machine and minimize the empirical loss function. In particular, we denote SSDANE-Avg(α) (DANE-Avg(α)) as the estimator which is the average of the local SSDANE (DANE) estimators from α fraction of machines. For the choice of the initial estimator, we uniformly use the local estimator on the master machine H1. 4.1.1 Results for Linear Regression We first consider linear regression, where the observation (X, Y ) is generated from the linear model (19). Effect of the Number of Machines and Local Unlabeled Data To investigate the effect of the number of machines and local unlabeled data, we fix the labeled local sample size n to be 100, and vary the number of machines m from {20, 50, 100}, and the unlabeled local sample size n from {100, 200, 500}. We compare the one-step SSDANE and SSDANE-Avg with different averaging fractions. The results of the ℓ2-errors and their standard errors are reported in Table 1. As we can see from above, for every fixed m, more unlabeled samples help reduce the ℓ2-error of the semi-supervised estimators. The averaged estimator SSDANE-Avg always has a lower error than SSDANE, which only solves the problem (11) on the master machine. Moreover, the more fraction of machines we average from, the smaller the estimation error we get. All of these findings coincide with the theoretical results in Theorem 1 and Theorem 4. Covariance of the gradient at β From the proof of Theorem 2 and Proposition 8, we observe that the asymptotic variance of the estimators differs only in the covariance of the gradient of the loss function L. Therefore, to demonstrate the statistical efficiency of our method in comparison with the naive semi-supervised estimator, we compare the rescaled covariance of the gradient at the true parameter β . Specifically, for each round, we generate N samples and compute N L(β )( L(β ))T. Then, we repeat this process 100 times and compute their average, which provides the covariance of the gradient L(β ) Distributed Estimation on Semi-Supervised Generalized Linear Model Table 1: The ℓ2-errors and their standard errors (in parentheses) of the one-step SSDANE, SSDANE-Avg(0.2), SSDANE-Avg(0.5) and SSDANE-Avg(1) under labeled local sample size n = 100. Data are generated from a linear model with parameter dimension p = 20. m n SSDANE SSDANE-Avg(0.2) SSDANE-Avg(0.5) SSDANE-Avg(1) 100 0.588(0.226) 0.310(0.083) 0.236(0.053) 0.213(0.042) 20 200 0.321(0.109) 0.193(0.041) 0.163(0.030) 0.154(0.026) 500 0.203(0.040) 0.156(0.026) 0.145(0.028) 0.142(0.028) 100 0.586(0.228) 0.227(0.056) 0.199(0.041) 0.187(0.035) 50 200 0.299(0.103) 0.131(0.029) 0.118(0.022) 0.112(0.020) 500 0.173(0.039) 0.101(0.019) 0.096(0.017) 0.094(0.016) 100 0.593(0.235) 0.199(0.049) 0.184(0.038) 0.178(0.036) 100 200 0.295(0.104) 0.106(0.023) 0.097(0.020) 0.095(0.019) 500 0.161(0.039) 0.073(0.015) 0.070(0.014) 0.069(0.013) Table 2: The rescaled trace of the covariance bΣ( L) of the L, Lss, e L(1) and e L(5) under labeled local sample size n = 100. Data are generated from a linear model with parameter dimension p = 20. m n L Lss e L(1) e L(5) 100 0.997 0.997 6.097 2891.910 20 200 0.997 12.443 3.720 1.763 500 0.997 18.476 2.196 1.046 100 0.979 0.979 13.591 9386.730 50 200 0.979 12.530 7.425 3.062 500 0.979 19.100 3.542 1.048 100 0.944 0.944 26.512 22642.400 100 200 0.944 12.228 13.755 5.829 500 0.944 18.367 5.822 0.998 (denoted as bΣ( L)). Here the loss function is chosen as L (fully supervised empirical loss), Lss (semi-supervised empirical loss), e L(1) (1-step semi-supervised surrogate loss), and e L(5) (5-step semi-supervised surrogate loss). We present the trace of the covariance (Table 2), the difference bΣ( L) C (Table 3), and draw the distribution of eigenvalues of bΣ( L) (Figure 1) From Table 2, it can be observed that fully supervised surrogate loss (n = n) has the largest covariance trace. This is because the supervised estimator has a large bias. As n grows, the semi-supervised empirical loss exhibits a larger covariance trace than that of L, which aligns with the result in Proposition 8. On the contrary, the semi-supervised surrogate loss has a smaller covariance trace. In particular, for large iteration numbers and unlabeled sample sizes, the semi-supervised surrogate loss has a covariance trace similar to that of L. Table 3 reveals that bΣ( L) approximates C for SSDANE when the local unlabeled sample size is large. The naive semi-supervised loss has a larger difference as the unlabeled sample size increases. Figure 1 demonstrates similar implications. Tu, Liu, and Mao Table 3: The norm of difference bΣ( L) C of the L, Lss, e L(1) and e L(5) under labeled local sample size n = 100. Data are generated from a linear model with parameter dimension p = 20. m n L Lss e L(1) e L(5) 100 1.787 1.787 3.211 101 3.331 104 20 200 1.787 7.892 101 1.716 101 6.037 500 1.787 1.121 102 8.215 1.963 100 2.224 2.224 7.789 101 1.157 105 50 200 2.224 8.153 101 4.016 101 1.634 101 500 2.224 1.262 102 1.698 101 2.290 100 2.138 2.138 1.554 102 2.717 105 100 200 2.138 8.200 101 7.753 101 3.946 101 500 2.138 1.214 102 3.028 101 2.202 Figure 1: The values of eigenvalues of bΣ( L) in increasing order under the linear regression model. The labeled local sample size takes value in 100, the number of machines is 50, the unlabeled local sample size takes value in {0, 100, 400}, and the dimension p is 20. Effect of the Number of Iterations Next, we study the effect of iterations. We set the number of machines to be 100, the unlabeled local sample size to be 400, and vary the labeled local sample size from {50, 100, 200}. The curves of ℓ2-error over the number of iterations are presented in Figure 2. From Figure 2, we can see that in all cases, the SSDANE and SSDANE-Avg outperform their fully supervised counterparts. The superiority is more obvious when the labeled local sample size n is small. The imputation-based method DANEimp is similar to SSDANE when only one machine has unlabeled data, but the averaged estimator DANEimp-Avg is inferior to both SSDANE-Avg and DANE-Avg. This indicates that a naive combination of imputation with DANE may lead to a deterioration in performance. Moreover, we find that the fully supervised estimator DANE blows up in several steps when n is small, while the semi-supervised estimators always converge well. Effect of initialization From Proposition 6, in the linear model, the convergence rate always increases as the iteration number grows. In the following experiment, we examine if Distributed Estimation on Semi-Supervised Generalized Linear Model Figure 2: The ℓ2-error over the number of iterations under linear model. The labeled local sample size takes value in {50, 100, 200}, the number of machine is 100, the unlabeled local sample size is 400, and the dimension p is 20. Table 4: The ℓ2-errors and their standard errors (in parentheses) of the 1-step SSDANE, 5-step SSDANE, 1-step SSDANE-Avg, and 5-step SSDANE-Avg with initialization bβ (0) = β + ϵ, under labeled local sample size n = 100. Data are generated from a linear model with parameter dimension p = 20. m n 1 Step SSDANE 1 Step SSDANE-Avg 5 Step SSDANE 5 Step SSDANE-Avg 100 3.434(1.212) 1.169(0.222) 28.978(37.931) 0.126(0.022) 20 200 2.135(0.717) 0.682(0.165) 1.476(1.927) 0.125(0.022) 500 1.248(0.354) 0.580(0.160) 0.145(0.038) 0.125(0.022) 100 3.496(1.273) 1.177(0.221) 33.016(48.324) 0.080(0.016) 50 200 2.128(0.722) 0.587(0.132) 1.462(2.004) 0.079(0.015) 500 1.184(0.338) 0.384(0.107) 0.989(0.032) 0.079(0.015) 100 3.492(1.300) 1.177(0.216) 33.780(51.572) 0.056(0.010) 100 200 2.138(0.734) 0.558(0.114) 1.475(2.026) 0.056(0.010) 500 1.175(0.346) 0.307(0.085) 0.077(0.028) 0.056(0.010) the estimation error is sensitive to the initial point. To see that, we take the initial point as bβ (0) = β + ϵ, where the noise ϵ is sampled from N(0, Ip). It is not hard to see that E[|bβ (0) β |2] p, which violate the assumption in the theory. The result is presented in Table 4. As shown in the linear model, the fully supervised estimator (n = n) is highly sensitive to initialization compared to the semi-supervised estimator (n > n). Increasing the amount of unlabeled data leads to greater robustness of the estimator. Moreover, we extend our exploration to encompass the utilization of distributed batch stochastic gradient descent (BSGD) and local stochastic gradient descent (LSGD) methodologies for initialization purposes. In this context, our approach employs a batch size of 0.1n for both BSGD and LSGD, while conducting a total of 10 iterations. Within the framework of BSGD, a constant step size is employed, while in alignment with the principles outlined in Stich (2019), the step size in LSGD is determined as C/(t + 1). We compare SSDANE and SSDANE-Avg initialized by the local estimator, random initialization (bβ (0) = β + ϵ), Tu, Liu, and Mao Figure 3: The ℓ2-error over the number of iterations under a linear model with different initializations. The labeled local sample size takes value in {50, 100, 200}, the number of machines is 100, the unlabeled local sample size is 400, and the dimension p is 20. early-stopped BSGD and LSGD, and show how their estimation errors evolve with respect to iterations. The results are elucidated in Figure 3. As we can see from Figure 3, in the linear model, all these four initialization methods lead to consistent estimators. The SSDANE-Avg method always outperforms its corresponding SSDANE method. Notably, the BSGD and LSGD initialized methods exhibit smaller estimation errors, consequently requiring fewer iterations to attain the optimal rate. In practical applications, the choice of initialization can be tailored to the characteristics of the problem at hand. Specifically, when confronted with scenarios of ample local sample sizes, the employment of a local minimizer for initialization is advocated, considering its intrinsic advantage of circumventing communication overhead. Conversely, for a situation involving diminutive local sample sizes, the adoption of BSGD and LSGD for initialization serves as a prudent approach, aligning with the requisite criterion of achieving |bβ (0) β |2 = o P(p 1/2). 4.1.2 Results for Logistic Regression In this section, we consider logistic regression, where the observation (X, Y ) is generated from the model (21). To solve the optimization problem (13) for the logistic regression model, we apply conjugate gradient descent motivated by Minka (2003). Effect of the Number of Machines and Local Unlabeled Data Similarly, as in the linear model, we fix the labeled local sample size n to be 300 and vary the number of machines m from {20, 50, 100}, and the unlabeled local sample size n from {300, 450, 900}. We compare the one-step SSDANE and SSDANE-Avg with different averaging fractions. The results are shown in Table 5. We can find similar phenomena as in the linear model, which shows the unlabeled dataset and averaging fraction play important roles in reducing the estimation errors. Covariance of the gradient at β In this part, we compare the various quantities of the covariance matrix bΣ( L) including the rescaled trace (Table 6), the difference bΣ( L) C (Table 7), and draw the distribution of eigenvalues of bΣ( L) (Figure 4). We can observe a similar phenomenon as in the linear model. Distributed Estimation on Semi-Supervised Generalized Linear Model Table 5: The ℓ2-errors and their standard errors (in parentheses) of the one-step SSDANE, SSDANE-Avg(0.2), SSDANE-Avg(0.5) and SSDANE-Avg(1) under labeled local sample size n = 300. Data are generated from a logistic regression model with parameter dimension p = 20. m n SSDANE SSDANE-Avg(0.2) SSDANE-Avg(0.5) SSDANE-Avg(1) 300 0.692(0.267) 0.414(0.105) 0.347(0.074) 0.325(0.061) 20 450 0.510(0.154) 0.347(0.076) 0.309(0.064) 0.294(0.053) 900 0.375(0.080) 0.299(0.053) 0.284(0.052) 0.278(0.049) 300 0.656(0.301) 0.280(0.072) 0.247(0.054) 0.239(0.047) 50 450 0.455(0.176) 0.226(0.053) 0.204(0.042) 0.199(0.037) 900 0.309(0.079) 0.189(0.035) 0.178(0.032) 0.177(0.031) 300 0.634(0.281) 0.226(0.046) 0.210(0.039) 0.204(0.036) 100 450 0.437(0.169) 0.172(0.033) 0.161(0.028) 0.157(0.028) 900 0.285(0.079) 0.137(0.023) 0.131(0.022) 0.129(0.022) Table 6: The rescaled of the covariace bΣ( L) of bΣ( L) of L, Lss, e L(1) and e L(5) under labeled local sample size n = 300. Data are generated from a logistic regression model with parameter dimension p = 20. m n L Lss e L(1) e L(5) 300 0.074 0.074 0.222 0.271 20 450 0.074 0.218 0.174 0.091 900 0.074 0.355 0.130 0.077 300 0.073 0.073 0.441 0.693 50 450 0.073 0.227 0.325 0.108 900 0.073 0.348 0.211 0.078 300 0.076 0.076 0.809 1.421 100 450 0.076 0.213 0.570 0.133 900 0.076 0.344 0.336 0.078 Effect of the Number of Iterations In this experiment, we fix the number of machines as 100, the unlabeled local sample size as 600, and vary the labeled local sample size from {100, 150, 300}. Similarly, from Figure 5, we find that the semi-supervised estimators SSDANE and SSDANE-Avg are more stable and accurate, especially when the labeled local sample size n is small. Moreover, DANEimp-Avg always performs worse than SSDANE-Avg. Effect of initialization In Theorem 1 and 4, our theory suggests that the algorithm converges only when the initial rate rn = o(1/ p). In this experiment, we check if this requirement is necessary. Similarly, as in the linear model, we take the initial point as bβ (0) = β +ϵ, where the noise ϵ is sampled from N(0, Ip). The result of ℓ2-error is reported in Table 8. In the table, we denote / if the algorithm blows up. We can observe a similar trend as in the linear model, where the semi-supervised estimator shows greater robustness to initialization than its fully supervised counterpart. Additionally, the SSDANE algorithm Tu, Liu, and Mao Table 7: The norm of difference bΣ( L) C of L, Lss, e L(1) and e L(5) under labeled local sample size n = 300. Data are generated from a logistic regression model with parameter dimension p = 20. m n L Lss e L(1) e L(5) 300 0.152 0.152 0.937 1.578 20 450 0.152 0.935 0.668 0.224 900 0.152 1.702 0.411 0.163 300 0.138 0.138 2.250 5.041 50 450 0.138 0.967 1.567 0.349 900 0.138 1.662 0.908 0.143 300 0.146 0.146 4.424 10.456 100 450 0.146 0.900 2.975 0.549 900 0.146 1.610 1.618 0.154 Figure 4: The values of eigenvalues of bΣ( L) in increasing order under the logistic regression model. The labeled local sample size is 300, the number of machines is 50, the unlabeled local sample size takes value in {0, 150, 600}, and the dimension p is 20. exhibits greater robustness than SSDANE-Avg. These findings suggest that there may be scope for further improvement in Theorem 1 and 4. We can also find that the 5step SSDANE exhibits a large variance or even blows up when confronted with scenarios of a limited number of covariates n . We concur that this behavior could be attributed to unfavorable initialization coupled with the small value of n . In the context of our theoretical framework, specifically referencing Theorem 1 and Theorem 4, it is discernible that when prn or p log n /n is greater than 1, the statistical rate escalates with an increasing number of steps. This phenomenon potentially elucidates the reasoning behind the occurrence where a 5-step SSDANE is worse than its single-step counterpart in certain instances. We extend our experiment to investigate the utilization of an array of initialization strategies, including the use of the local minimizer, random input, early-stopped BSGD and LSGD, thereby mirroring the approach adopted in the linear model context. The selection of hyperparameters aligns with the configuration outlined in Section 4.1.1. Con- Distributed Estimation on Semi-Supervised Generalized Linear Model Figure 5: The ℓ2-error over the number of iterations under the logistic regression model. The labeled local sample size takes value in {100, 150, 300}, the number of machines is 100, the unlabeled local sample size is 600, and the dimension p is 20. Table 8: The ℓ2-errors and their standard errors (in parentheses) of the 1-step SSDANE, 5-step SSDANE, 1-step SSDANE-Avg, and 5-step SSDANE-Avg with initialization bβ (0) = β +ϵ, under labeled local sample size n = 300. Data are generated from a logistic regression model with parameter dimension p = 20. m n 1 Step SSDANE 1 Step SSDANE-Avg 5 Step SSDANE 5 Step SSDANE-Avg 300 10.193(20.198) 115.690(450.732) / / 20 450 4.715(10.561) 3.239(4.366) 0.431(0.252) / 900 1.532(2.203) 1.042(1.651) 0.264(0.053) / 300 10.245(21.135) 133.470(316.598) / / 50 450 4.963(12.696) 2.994(3.789) / / 900 1.385(1.614) 0.860(1.254) 0.174(0.047) / 300 10.197(21.368) 89.883(180.015) / / 100 450 4.912(13.116) 7.447(44.892) / / 900 1.400(2.047) 0.757(0.961) 0.127(0.043) / trasting with the findings in Figure 3, we can find that random initialization may result in divergence in the SSDANE-Avg method. However, as the local sample size increases, the SSDANE-Avg method exhibits convergence. This observation suggests that both our SSDANE and SSDANE-Avg methods benefit from a well-chosen initialization, and a larger local sample size exerts a positive influence in mitigating sensitivity to the initialization conditions. This phenomenon can be partially elucidated through Theorem 1 and 4. 4.2 Application to Real-World Benchmarks In this section, we analyze the Celeb A dataset2 from the Kaggle website, which is included in LEAF (Caldas et al., 2018), a standard distributed learning benchmark. Our aim is to train a classifier that distinguishes young people from old ones. In the dataset, each variable has 39 attributes. We take the total sample size as 120000, and randomly partition the dataset 2 https://www.kaggle.com/datasets/jessicali9530/celeba-dataset Tu, Liu, and Mao Figure 6: The ℓ2-error over the number of iterations under a logistic model with different initializations. The labeled local sample size takes value in {100, 150, 300}, the number of machines is 100, the unlabeled local sample size is 600, and the dimension p is 20. Figure 7: Celeb A dataset. The classification error over iterations, under various pairs of (m, n). The total training sample size N is 20000, the unlabeled sample size is 80000, the test sample size is 20000, and the dimension p is 39. into 20000 testing data, 20000 labeled training data, and 80000 unlabeled training data. We perform 100 random partitions of the dataset and report the averaged classification error on the testing set. We consider three cases where (m, n) = (20, 1000),(40, 500) and (80, 250) respectively. The result is shown in Figure 7. It is not hard to see that the proposed semisupervised methods always outperform their supervised counterparts, and the superiority of the semi-supervised method is more obvious when the labeled local sample size n is small. While the imputation-based method DANEimp is comparable to SSDANE, the averaged estimator DANEimp-Avg is much worse than SSDANE-Avg. This result also reveals that naively combining imputation with DANE may not improve the training performance. 5. Concluding Remarks and Future Study In this paper, we study the semi-supervised generalized linear model in the distributed setup. With the assistance of the additional unlabeled data, we theoretically show that the Distributed Estimation on Semi-Supervised Generalized Linear Model proposed estimators enjoy higher statistical accuracy than their fully supervised counterparts. The superiority is also illustrated by numerical experiments. In the future, there are several directions worth being explored. For example, it is interesting to study distributed semi-supervised learning in a high-dimensional regime. We believe that the unlabeled data can obtain a similar accuracy-enhancement effect when incorporated with the one-shot averaging method (Lee et al., 2017) and the multi-round distributed sparse learning method (Wang et al., 2017; Jordan et al., 2019). Second, while we focus on the scenario where the model is well-specified, it is important to consider the distributed semi-supervised learning for miss-specified models (Bellec et al., 2018; Chakrabortty and Cai, 2018; Zhang and Bradic, 2021; Deng et al., 2020; Azriel et al., 2022). Third, it will be of great importance to explore the distributed semi-supervised ridge regression (Dobriban and Sheng, 2021, 2020; Sheng and Dobriban, 2020). Lastly, of profound significance is the intriguing prospect of synergistically investigating the amalgamation of research endeavors focused on distributed training within the context of imbalanced data (see, e.g., Duan et al. (2021a,b); Chen et al. (2023)) with our distributed semi-supervised approach. Acknowledgments Jiyuan Tu s research is supported by the Fundamental Research Funds for the Central Universities . Weidong Liu s research is supported by NSFC Grant No. 11825104. Xiaojun Mao s research is supported by NSFC Grant No. 12371273, the Shanghai Rising Star Program 23QA1404600 and Young Elite Scientists Sponsorship Program by CAST (2023QNRC001). The authors would like to thank the action editor and three anonymous referees for their constructive comments, which greatly improves the quality of the paper. Tu, Liu, and Mao The appendix consists of three parts. In Appendix A, we provide proof for the theoretical results of SSDANE and SSDANE-Avg in Section 3.1. In Appendix B, we prove the theories in Section 3.2, especially the rate lower bound of SSDANE for the linear model. Appendix C is devoted to proving the normality result for the semi-supervised generalized linear model in the single machine setting, namely the theorems in Section 3.3. Appendix A. Proof of Theories in Section 3.1 Proof [Proof of Equation (6) ] Notice that Yi Xi + Xiψ (XT i β ) i = 0 = E h 1 i H Xiψ (XT i β ) 1 i D Xiψ (XT i β ) i , Yi Xi + Xiψ (XT i β ) , 1 i H Xiψ (XT i β ) 1 i D Xiψ (XT i β ) = 0, we have that i D Yi Xi + 1 i H Xiψ (XT i β ) (24) Yi Xi + Xiψ (XT i β ) + Cov 1 i H Xiψ (XT i β ) 1 i D Xiψ (XT i β ) . We shall compute them respectively. For the first term, since each Xi are independent to each other, we have that Yi Xi + Xiψ (XT i β ) = 1 n Cov Y X + Xψ (XTβ ) . For the second term in (24), we compute that i H Xiψ (XT i β ) 1 i D Xiψ (XT i β ) 1 n Xiψ (XT i β ) X nn Xiψ (XT i β ) (n )2 + n(n n)2 Cov Xψ (XTβ ) = n n nn Cov Xψ (XTβ ) . Then (6) can be obtained by substituting the above two formulas into (24). Distributed Estimation on Semi-Supervised Generalized Linear Model Lemma 9 Given a convex loss function L(β), suppose it is second-order differentiable, and there exist constants ρ0, r0 such that inf β:|β β |2 r0 inf v Sp 1 v T 2L(β)v ρ0, (25) and r0 > 3ρ 1 0 | L(β )|2, then the minimizer β of L(β) satisfies that ρ0 | L(β )|2. Proof For simplicity, we denote bn = 3ρ 1 0 | L(β )|2, and construct the set Θ1 = {β : |β β |2 = bn} in the parameter space. Then clearly we know that Θ1 {β : |β β |2 r0}. We will show that L(β) > L(β ) strictly holds uniformly for all β Θ1. To show this, we compute that 0 (β β )T L β + t(β β ) dt =(β β )T L(β ) + Z 1 0 (1 t)(β β )T 2L β + t(β β) (β β )dt 2 |β β |2 2 | L(β )|2 |β β |2 2 3ρ 1 0 | L(β )|2|β β |2 | L(β )|2 |β β |2 > 0. Then by convexity of the loss function L(β), we have that |β β |2 3 ρ0 | L(β )|2, which completes the proof. Lemma 10 Suppose Assumption 1 to 3 hold. Assume r = o(1), then for each t 1, there is inf β:|β β |2 r inf v Sp 1 v T 2 e L(t)(β)v ρ holds with probability not less than 1 O((n ) pτ) for some constant τ > 0. Proof For arbitrary β satisfying |β β|2 r, there is v T 2 e L(t)(β)v i H1 ψ XT i β v TXi 2 =v TE ψ (XTβ )XXT v + 1 0 ψ XT i β + t(β β ) XT i (β β ) v TXi 2dt i H1 ψ XT i β Xi XT i E ψ (XTβ )XXT Tu, Liu, and Mao It left to bound the term |T1| and T 2 respectively. Bound of |T1|: From Assumption 3, It is not hard to see that 0 ψ XT i β + t(β β ) XT i (β β ) v TXi 2dt cψr sup v Sp 1 i H1 |v TXi|3 Denote N as the 1/2-net of Sp 1, by Lemma 5.2 of Vershynin (2010) we know that |N| 5p. Then there is i H1 |v TXi|3 i H1 |ev TXi|3 + 4 max ev N sup v:|v ev|2 1/2 i H1 |(ev v)TXi|3 i H1 |ev TXi|3 2 sup v Sp 1 i H1 |v TXi|3 i H1 |v TXi|3 i H1 |ev TXi|3 i H1 |ev TXi|3 E |ev TXi|3 + 8 sup v Sp 1 E |ev TXi|3 . Note that for each ev N, there is max ev N E h exp n η3/2|ev TXi|3 2/3oi sup ev Sp 1 E h exp n η|ev TXi|2oi C0. We know that |ev TXi|3 s are sub-Weibull(2/3) random variables. Therefore, by Theorem 3.1 of Kuchibhotla and Chakrabortty (2022), there exist constants c3, τ > 0 such that i H1 |ev TXi|3 E |ev TXi|3 c3 n + (p log n )3/2 5p max ev N P i H1 |ev TXi|3 E |ev TXi|3 c3 n + (p log n )3/2 5p(n ) τ1p (n ) (τ1 1)p, Distributed Estimation on Semi-Supervised Generalized Linear Model which tends to 0 when τ1 > 1. On the other hand, we compute that sup v Sp 1 E |ev TXi|3 2η3/2 0 sup v Sp 1 E exp η0|ev TXi|2 C0 2η3/2 0 , (29) where the second line uses the elementary inequality |x|3 1 2e|x|. Combining (27), (28) and (29) we have that |T1| cψr sup v Sp 1 i H1 |v TXi|3 8cψr max ev N i H1 |ev TXi|3 E |ev TXi|3 + 8cψr sup v Sp 1 E |ev TXi|3 n + (p log n )3/2 Bound of T 2 : Denote N as the 1/4-net of Sp 1, by Lemma 5.2 of Vershynin (2010) we know that |N| 9p. Then there is T 2 = sup v Sp 1 ev TT 2ev + 2 max ev N sup v:|v ev|2 1/4 ev TT 2(ev v) + max ev N sup v:|v ev|2 1/4 (ev v)TT 2(ev v) ev TT 2ev + 1 2 sup v Sp 1 v TT 2v + 1 16 sup v Sp 1 ev TT 2ev . (30) For each ev N, it is not hard to see that ψ XT i β (v TXi)2 is sub-exponential random variable. Therefore we can apply Lemma 1 of Cai and Liu (2011) and yield max ev N 1 n X i H1 ψ XT i β (v TXi)2 E ψ (XTβ )(v TX)2 c4 9p max ev N P i H1 ψ XT i β (v TXi)2 E ψ (XTβ )(v TX)2 c4 9p(n ) τ2p (n ) (τ2 1)p, for some constant τ2 > 0. Substitute this bound into (30) we have that T 2 = o(1). Substitute the bound of |T1| and T 2 into (26) we finally have that v T 2 e L(t)(β)v v TE ψ (XTβ )XXT v |T1| T 2 ρ Tu, Liu, and Mao hold with probability not less than 1 (n ) (τ1 1)p (n ) (τ2 1)p = 1 O((n ) τp) by taking τ = min{τ1 1, τ2 1}, which proves the lemma. Proof [Proof of Theorem 1] For simplicity, we first consider the convergence rate of bβ (1). By Lemma 10, we know the condition (25) is sufficed with high probability. Now using Lemma 9, we know that it is enough to bound | e L(1)(β )|2. Indeed, we can compute that e L(1)(β ) 2 = Lss 1 (β ) Lss 1 (bβ (0)) + 1 j=1 Lj(bβ (0)) Lss 1 (β ) Lss 1 (bβ (0)) E ψ (XTβ )XXT (β bβ (0)) | {z } T1 h Lj(bβ (0)) Lj(β ) E ψ (XTβ )XXT (bβ (0) β ) i For the term T1, there is n ψ (XT i β ) ψ (XT i bβ (0)) o Xi E ψ (XTβ )XXT (β bβ (0)) ψ (XT i β )Xi XT i E ψ (XTβ )XXT (β bβ (0)) 0 ψ XT i {bβ (0) + t(β bβ (0))} Xi Xi(β bβ (0)) 2dt. Following the same strategy as in the proof of Lemma 10, we have that ψ (XT i β )Xi XT i E ψ (XTβ )XXT (β bβ (0)) 0 ψ XT i {bβ (0) + t(β bβ (0))} Xi Xi(β bβ (0)) 2dt = OP pr2 n . Therefore we have Similarly, we can show that Distributed Estimation on Semi-Supervised Generalized Linear Model T3 is the average of mn copies of Y X ψ(XTβ )X, thus we can simply apply Lemma 1 of Cai and Liu (2011) and yield Plugging (32), (33) and (34) into (31) we have e L(1)(β ) 2 = OP Therefore, by Lemma 9 we obtain bβ (1) β 2 = OP Apply the above formula iteratively we can obtain Theorem 1. Proof [Proof of Theorem 2] Since |bβ (t 1) β |2 = OP( p p log n /(mn)), follow the same strategy as in the proof of Theorem 1, we can show that |bβ (t) β |2 = OP Lss 1 (β ) Lss 1 (bβ (t)) E ψ (XTβ )XXT (β bβ (t)) 2 mnn + p3/2 log n By optimality condition of bβ (t) we have that 0 = e L(t)(bβ (t)) = Lss 1 (bβ (t)) Lss 1 (β ) + Lss 1 (β ) Lss 1 (bβ (t 1)) + 1 j=1 Lj(bβ (t 1)) = Lss 1 (bβ (t)) Lss 1 (β ) + 1 j=1 Lj(β ) + OP =E ψ (XTβ )XXT (bβ (t) β ) + 1 j=1 Lj(β ) + OP mnn + p3/2 log n bβ (t) β = E ψ (XTβ )XXT 1 1 j=1 Lj(β ) + OP mnn + p3/2 log n Tu, Liu, and Mao For each ev Sp 1, we know i Dj (ψ (XT i β ) Yi)ev TXi, which is the average of mn i.i.d. terms. Compute that Var h (ψ (XT i β ) Yi)ev TXi i = c(σ)E h ψ (XTβ )(ev TX)2i . Therefore, by the central limit theorem, there is mn σ(ev) ev T 1 j=1 Lj(β ) d N(0, 1), {σ(ev)}2 = ev TCev, and C = c(σ)E ψ (XTβ )XXT . Next we assume (p log n )2 = o(n ) and p3 log2 n = o(mn), and replace ev by E ψ (XTβ )XXT 1 v, we can obtain the asymptotic normality result for v T(bβ (t) β ). Proof [Proof of Theorem 4] We first consider the convergence rate of bβ (1) Avg. For each bβ (1) j = argmin e L(1) j (β) (where j U), by Theorem 1 we know that bβ (1) j β 2 = OP By the optimality condition, there is 0 = e L(1) j (bβ (1) j ) = Lss j (bβ (1) j ) Lss j (bβ (0)) + 1 k=1 Lk(bβ (0)) =E n ψ (XTβ )XXTo (bβ (1) j β ) + Lss j (bβ (1) j ) Lss j (β ) E n ψ (XTβ )XXTo (bβ (1) j β ) + Lss j (β ) Lss j (bβ (0)) + 1 k=1 Lk(bβ (0)) bβ (1) j β = h E n ψ (XTβ )XXToi 1 " Lss j (bβ (1) j ) Lss j (β ) E n ψ (XTβ )XXTo (bβ (1) j β ) + Lss j (β ) Lss j (bβ (0)) + 1 k=1 Lk(bβ (0)) Distributed Estimation on Semi-Supervised Generalized Linear Model Take the average over j U and take the norm on both sides, we have that bβ (1) Avg β 2 = j U bβ (1) j β n Lss j (bβ (1) j ) Lss j (β ) E n ψ (XTβ )XXTo (bβ (1) j β ) o 2 n Lss j (β ) Lss j (bβ (0)) + Lj(bβ (0)) o 2 Lss j (bβ (1) j ) Lss j (β ) E n ψ (XTβ )XXTo (bβ (1) j β ) 2 | {z } T1 n Lss j (β ) Lss j (bβ (0)) + Lj(bβ (0)) Following the strategy of the proof of Theorem 1, we can prove that max 1 j m |bβ (1) j β |2 n + p max 1 j m |bβ (1) j β |2 2 ! mnn + rn p log n n + p3/2r4 n |U|n + pr2 n Plug them into (35) we have that bβ (1) Avg β 2 = OP mn + rn p log n |U|n + pr2 n Then Theorem 4 can be proved by applying Theorem 4 inductively. Appendix B. Proof of Theories in Section 3.2 Proof [Proof of Proposition 6] Notice that in the linear regression model, the canonical link function ψ (x) = x satisfies ψ (x) = 0. Therefore, in the proof of Theorem 1, we can obtain that Lss 1 (β ) Lss 1 (bβ (0)) E ψ XTβ XXT (β bβ (0)) 2 = OP rn j=1 Lj(β ) 1 j=1 Lj(bβ (0)) E ψ XTβ XXT (β bβ (0)) 2 = OP rn Tu, Liu, and Mao where the term pr2 n is eliminated. Therefore, the corollary is proved by plugging the above bound into (31) of Theorem 1. Proof [Proof of Proposition 7] For linear model, given the initial estimator bβ (0), the onestep SSDANE can be written explicitly as follows bβ (1) = bβ (0) (bΣ ss 1 ) 1n 1 i Dj Xi XT i bβ (0) 1 mn i Dj Yi Xi o , where bΣ ss 1 = (n ) 1 P i H1 Xi XT i . Then the statistical error bβ (1) β can be written down explicitly as follows =(bΣ ss 1 ) 1n 1 i H0 Xi XT i (bβ (0) β ) 1 mn i Dj Xi XT i bβ (0) + 1 mn i Dj Yi Xi o =(bΣ ss 1 ) 1h 1 i H1 Xi XT i 1 mn i Dj Xi XT i (bβ (0) β ) + 1 mn i Dj ϵi Xi i =(bΣ ss 1 ) 1 " X Xi XT i Σ (bβ (0) β ) + X Xi XT i Σ (bβ (0) β ) + 1 mnϵi Xi o n Xi XT i Σ (bβ (0) β ) + ϵi Xi o# (bΣ ss 1 ) 1T . (36) Since we already assume that bβ (0) is independent to all Xi s and ϵi s, there are sum of m(n 1)+n independent, zero-mean terms in the square bracket. For notational convenience we denote Zi = Xi XT i Σ (bβ (0) β )/|bβ (0) β |2 and W i = ϵi Xi, then we have that Cov(Zi) = Cov Xi XT i bβ (0) β |bβ (0) β |2 , Cov(W i) = σ2Σ, Cov(Zi, W i) = 0. Thus we can compute that (n )2 r2 n Cov(Z) + n(mn n )2 (mnn )2 r2 n Cov(Z) + n (mn)2 Cov(W ) + n(m 1) (mn)2 r2 n Cov(Z) + n(m 1) (mn)2 Cov(W ) r2 n Cov(Z) + 1 mn Cov(W ). (37) Distributed Estimation on Semi-Supervised Generalized Linear Model Moreover, there is mnn rn Zi + 1 mn W i 1 mnrn Zi + 1 mn W i 2 + 1 mn W i 2 + 1 mn W i (n )2 + 16 (mn)2 r3 n E |Z|3 2 + 4 (mn)2 E |W |3 2 C0p3/2 r3 n (n )2 + 1 (mn)2 for some constant C0 > 0. Denote e T = (Cov(T )) 1/2T , then we can apply multivariate Berry-Esseen theorem (see Theorem 1.1 of Raiˇc (2019)) and yield P(e T S) N(0, Ip){S} C1p7/4 min 1 n , (rn mn)3 for all measurable convex S Rp. Let T = (T 1, ..., T p)T follows p-dimensional standard Gaussian distribution N(0, Ip), then we can apply Lemma 1 of Cai and Liu (2011) to the i.i.d. sequence T 2 i 1 and obtain that l=1 (T 2 l 1) p l=1 (T 2 l 1) p for sufficiently large p. Then there is p/2) P(|T | p p/2) P(|e T | p p/2) P(|T | p By assumption λmin(Cov(Z)), λmin(Cov(W )) ρ1, then from (37) we know λmin Cov(T ) C3ρ1 r2 n n + 1 mn Then there is P |T 2|2 C3ρ1 P |T 2|2 rp Following the strategy, in the proof of Lemma 10 we can show that P λmin (bΣ ss) 1 ρ = P λmax bΣ ss 2ρ 1 1 Tu, Liu, and Mao Therefore by combining (38) and (36), we know that P |bβ (1) β |2 ρ P |T 2|2 C3ρ1 2 ; λmin (bΣ ss) 1 ρ P |T 2|2 C3ρ1 P λmin (bΣ ss) 1 ρ which proves the theorem. Appendix C. Proof of Theories in Section 3.3 Proof [Proof of Proposition 8] We first prove the convergence rate of bβ ss. Notice that the semi-supervised empirical loss Lss(β) defined in (5) has the same Hessian matrix as the surrogate loss e L(1)(β) defined in (10). Therefore by Lemma 10, it holds that inf β:|β β |2 r inf v Sp 1 v T 2Lss(β)v ρ Then by Lemma 9, we only need to bound the term | Lss(β )|2. Note that for arbitrary unit vector v Sp 1, there is i D Yiv TXi + 1 i H ψ (XT i β )v TXi Yiv TXi + E ψ (XT i β )v TXi + 1 ψ (XT i β )v TXi E ψ (XT i β )v TXi , Denote E(X, η) = E[X2 exp(η|X|)], we compute that E Yiv TXi + E ψ (XT i β )v TXi , η0 η2 0 E h exp 2η0 Yiv TXi + E ψ (XT i β )v TXi i n E h exp 2η0 Yiv TXi io2 n E h exp η0 Yi 2 exp η0 v TXi 2 io2 1 where the second line uses the elementary inequality x2 ex, the third line uses Jensen s inequality, and the last line uses Cauchy s inequality. Similarly, we have that E ψ (XT i β )v TXi E ψ (XT i β )v TXi , η0 1 Distributed Estimation on Semi-Supervised Generalized Linear Model Then we can apply Lemma 1 of Cai and Liu (2011) and yield that Yiv TXi + E ψ (XT i β )v TXi = OP ψ (XT i β )v TXi E ψ (XT i β )v TXi = OP Therefore, we know that there exists a constant c1 such that | Lss(β )|2 = e T l Lss(β ) 2 with probability not less than 1 (n ) τ, where τ > 0 is some positive constant and el (where l = 1, ..., p) denotes the l-th coordinate vectors. Therefore, by Lemma 9, we know that bβ ss β 2 = OP On the other hand, by definition of bβ ss in (22) we have that 0 = Lss(bβ ss) = 1 i D Yi XT i + 1 i H ψ (XT i bβ ss)Xi i D Yi XT i + 1 i H ψ (XT i β )Xi + 1 ψ (XT i bβ ss) ψ (XT i β ) Xi i D Yi XT i + 1 i H ψ (XT i β )Xi 0 ψ XT i β + t(bβ ss β ) Xi XT i (bβ ss β )dt i D Yi XT i 1 i H ψ (XT i β )Xi (40) 0 (1 t)ψ XT i β + t(bβ ss β ) Xi XT i (bβ ss β ) 2dt i H ψ (XT i β )Xi XT i E ψ (XTβ )XXT i (bβ ss β ) + E ψ (XTβ )XXT (bβ ss β ). Tu, Liu, and Mao Using the results in the proof of Lemma 10, we know that 1 n X 0 (1 t)ψ XT i β + t(bβ ss β ) Xi XT i (bβ ss β ) 2dt =OP p|bβ ss β |2 2 = OP i H ψ (XT i β )Xi XT i E ψ (XTβ )XXT i (bβ ss β ) n |bβ ss β |2 Substitute these bounds into (40), we have that E ψ (XTβ )XXT (bβ ss β ) = 1 i D Yi XT i 1 i H ψ (XT i β )Xi + OP bβ ss β = E ψ (XTβ )XXT 1 Lss(β ) + OP (41) For each ev Sp 1, we know ev T Lss(β ) = 1 i D Yiev TXi + 1 i H ψ (XT i β )ev TXi n Yiev TXi + 1 n ψ (XT i β )ev TXi + n n nn E ψ (XT i β )ev TXi n ψ (XT i β )ev TXi 1 n E ψ (XT i β )ev TXi , which is the sum of n independent, zero-mean random variables. Compute that n Yiev TXi + 1 n ψ (XT i β )ev TXi + n n nn E ψ (XT i β )ev TXi n2 Var h Yiev TXi ψ (XT i β )ev TXi i + (n n)2 (nn )2 Var h ψ (XT i β )ev TXi E ψ (XT i β )ev TXi i n2 E h ψ (XTβ )(ev TX)2i + (n n)2 (nn )2 Var h ψ (XTβ )ev TX i , n ψ (XT i β )ev TXi 1 n E ψ (XT i β )ev TXi = 1 (n )2 Var h ψ (XTβ )ev TX i . Since from Assumption 2 we know both ev TXi and ψ (XT i β ) are sub-Gaussian random variables, we know that the Lindeberg condition for central limit theorem is sufficed. Therefore, Distributed Estimation on Semi-Supervised Generalized Linear Model by central limit theorem (see, e.g., Theorem 9.1.1 of Chow and Teicher (2012)), we have that n σ(ev)ev T Lss(β ) d N(0, 1), (42) σ(ev) 2 =n2Var 1 n Yiev TXi + 1 n ψ (XT i β )ev TXi + n n nn E ψ (XT i β )ev TXi + (n n)n Var 1 n ψ (XT i β )ev TXi 1 n E ψ (XT i β )ev TXi =c(σ)E h ψ (XTβ )(ev TX)2i + n n n Var h ψ (XTβ )ev TX i . We can also write it in the following way σ(ev) 2 =ev T C + n n where C =c(σ)E ψ (XTβ )XXT , Css =Cov ψ (XTβ )X . Next we assume p3/2 log n = o(n), and replace ev by E ψ (XTβ )XXT 1 v, we can obtain the asymptotic normality result for v T(bβ ss β ). Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res., 6(61):1817 1853, 2005. Rie Kubota Ando and Tong Zhang. Two-view feature generation model for semi-supervised learning. In Proceedings of the 24th International Conference on Machine Learning, page 25 32, 2007. David Azriel, Lawrence D. Brown, Michael Sklar, Richard Berk, Andreas Buja, and Linda Zhao. Semi-supervised linear regression. J. Amer. Statist. Assoc., 117(540):2238 2251, 2022. Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu. Distributed testing and estimation under sparse high dimensional models. Ann. Statist., 46(3):1352 1382, 2018. Pierre C. Bellec, Arnak S. Dalalyan, Edwin Grappin, and Quentin Paris. On the prediction loss of the lasso in the partially labeled setting. Electron. J. Stat., 12(2):3443 3472, 2018. Peter J. Bickel and Elizaveta Levina. Covariance regularization by thresholding. Ann. Statist., 36(6):2577 2604, 2008. Tu, Liu, and Mao Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, page 92 100, 1998. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found.Trends Mach. Learn., 3(1):1 122, 2011. T. Tony Cai and Zijian Guo. Semisupervised inference for explained variance in high dimensional linear regression and its applications. J. R. Stat. Soc. Ser. B. Stat. Methodol., 82(2):391 419, 2020. Tianxi Cai, Molei Liu, and Yin Xia. Individual data protected integrative regression analysis of high-dimensional heterogeneous data. J. Amer. Statist. Assoc., 117(540):2105 2119, 2022. Tony Cai and Weidong Liu. Adaptive thresholding for sparse covariance matrix estimation. J. Amer. Statist. Assoc., 106(494):672 684, 2011. Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Koneˇcn y, H. Brendan Mc Mahan, Virginia Smith, and Ameet Talwalkar. LEAF: A Benchmark for Federated Settings. ar Xiv e-prints, page ar Xiv:1812.01097, 2018. Abhishek Chakrabortty and Tianxi Cai. Efficient and adaptive linear regression in semisupervised settings. Ann. Statist., 46(4):1541 1572, 2018. Xiangyu Chang, Shao-Bo Lin, and Ding-Xuan Zhou. Distributed semi-supervised learning with kernel ridge regression. J. Mach. Learn. Res., 18(1):1493 1514, 2017. Dengsheng Chen, Jie Hu, Vince Junkai Tan, Xiaoming Wei, and Enhua Wu. Elastic aggregation for federated optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12187 12197, 2023. Tianyi Chen, Georgios Giannakis, Tao Sun, and Wotao Yin. Lag: Lazily aggregated gradient for communication-efficient distributed learning. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. Xi Chen, Weidong Liu, and Yichen Zhang. Quantile regression under memory constraint. Ann. Statist., 47(6):3244 3273, 2019. Y.S. Chow and H. Teicher. Probability Theory: Independence, Interchangeability, Martingales. Springer Texts in Statistics. Springer New York, 2012. Siyi Deng, Yang Ning, Jiwei Zhao, and Heping Zhang. Optimal semi-supervised estimation and inference for high-dimensional linear regression. ar Xiv e-prints, art. ar Xiv:2011.14185, 2020. Edgar Dobriban and Yue Sheng. Wonder: Weighted one-shot distributed ridge regression in high dimensions. J. Mach. Learn. Res., 21(66):1 52, 2020. Distributed Estimation on Semi-Supervised Generalized Linear Model Edgar Dobriban and Yue Sheng. Distributed linear regression by averaging. Ann. Statist., 49(2):918 943, 2021. Moming Duan, Duo Liu, Xianzhang Chen, Renping Liu, Yujuan Tan, and Liang Liang. Self-balancing federated learning with global imbalanced data in mobile systems. IEEE Trans. Parallel Distrib. Syst., 32(1):59 71, 2021a. Rui Duan, Yang Ning, and Yong Chen. Heterogeneity-aware and communication-efficient distributed statistical inference. Biometrika, 109(1):67 83, 2021b. ISSN 1464-3510. Zheng-Chu Guo, Lei Shi, and Qiang Wu. Learning theory of distributed regression with bias corrected regularization kernel network. J. Mach. Learn. Res., 18(118):1 25, 2017. Zheng-Chu Guo, Shao-Bo Lin, and Lei Shi. Distributed learning with multi-penalty regularization. Appl. Comput. Harmon. Anal., 46(3):478 499, 2019. Zijian Guo, Prabrisha Rakshit, Daniel S. Herman, and Jinbo Chen. Inference for the case probability in high-dimensional logistic regression. J. Mach. Learn. Res., 22(1), 2022. ISSN 1532-4435. Jue Hou, Zijian Guo, and Tianxi Cai. Surrogate assisted semi-supervised inference for high dimensional risk prediction. J. Mach. Learn. Res., 24(265):1 58, 2023. Ting Hu and Ding-Xuan Zhou. Distributed regularized least squares with flexible gaussian kernels. Appl. Comput. Harmon. Anal., 53:349 377, 2021. Martin Jaggi, Virginia Smith, Martin Takac, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. Communication-efficient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems, volume 27, 2014. Yongyi Guo Jianqing Fan and Kaizheng Wang. Communication-efficient accurate statistical estimation. J. Amer. Statist. Assoc., 118(542):1000 1010, 2023. Rie Johnson and Tong Zhang. Graph-based semi-supervised learning and spectral kernel design. IEEE Trans. Inform. Theory, 54(1):275 288, 2008. Michael I. Jordan, Jason D. Lee, and Yun Yang. Communication-efficient distributed statistical inference. J. Amer. Statist. Assoc., 114(526):668 681, 2019. Arun Kumar Kuchibhotla and Abhishek Chakrabortty. Moving beyond sub-Gaussianity in high-dimensional statistics: applications in covariance estimation and linear regression. Inf. Inference, 11(4):1389 1456, 2022. ISSN 2049-8772. Jason D. Lee, Qiang Liu, Yuekai Sun, and Jonathan E. Taylor. Communication-efficient sparse regression. J. Mach. Learn. Res., 18(1):115 144, 2017. Runze Li, Dennis K.J. Lin, and Bing Li. Statistical inference in massive data sets. Appl. Stoch. Models Bus. Ind., 29(5):399 409, 2013. Heng Lian and Zengyan Fan. Divide-and-conquer for debiased l1-norm support vector machine in ultra-high dimensions. J. Mach. Learn. Res., 18(182):1 26, 2018. Tu, Liu, and Mao Shao-Bo Lin and Ding-Xuan Zhou. Distributed kernel-based gradient descent algorithms. Constr. Approx., 47(2):249 276, 2018. Shao-Bo Lin, Xin Guo, and Ding-Xuan Zhou. Distributed learning with regularized least squares. J. Mach. Learn. Res., 18(92):1 31, 2017. Molei Liu, Yin Xia, Kelly Cho, and Tianxi Cai. Integrative high dimensional multiple testing with heterogeneity under data sharing constraints. J. Mach. Learn. Res., 22(1), 2022. ISSN 1532-4435. Rong Ma, Zijian Guo, T. Tony Cai, and Hongzhe Li. Statistical Inference for Genetic Relatedness Based on High-Dimensional Logistic Regression. ar Xiv e-prints, page ar Xiv:2202.10007, 2022. Thomas P Minka. A comparison of numerical optimizers for logistic regression. Unpublished draft, pages 1 18, 2003. Ravi B. Parikh, Christopher Manz, Corey Chivers, Susan Harkness Regli, Jennifer Braun, Michael E. Draugelis, Lynn M. Schuchter, Lawrence N. Shulman, Amol S. Navathe, Mitesh S. Patel, and Nina R. O Connor. Machine Learning Approaches to Predict 6Month Mortality Among Patients With Cancer. JAMA Network Open, 2(10):e1915997 e1915997, 2019. ISSN 2574-3805. Martin Raiˇc. A multivariate Berry-Esseen theorem with explicit constants. Bernoulli, 25 (4A):2824 2853, 2019. Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efficient distributed optimization using an approximate newton-type method. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pages 1000 1008, 2014. Yue Sheng and Edgar Dobriban. One-shot distributed ridge regression in high dimensions. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 8763 8772, 2020. Virginia Smith, Simone Forte, Chenxin Ma, Martin Tak aˇc, Michael I. Jordan, and Martin Jaggi. Cocoa: A general framework for communication-efficient distributed optimization. J. Mach. Learn. Res., 18(1):8590 8638, 2017. Sebastian U. Stich. Local SGD converges fast and communicates little. In 7th International Conference on Learning Representations, ICLR 2019, May 6-9, 2019, 2019. I. Theodossiou. The effects of low-pay and unemployment on psychological well-being: A logistic regression approach. J. Health Econ., 17(1):85 104, 1998. ISSN 0167-6296. Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer science & business media, 1999. Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv e-prints, art. ar Xiv:1011.3027, 2010. Distributed Estimation on Semi-Supervised Generalized Linear Model Stanislav Volgushev, Shih-Kang Chao, and Guang Cheng. Distributed inference for quantile regression processes. Ann. Statist., 47(3):1634 1662, 2019. Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Efficient distributed learning with sparsity. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 3636 3645, 2017. Junhui Wang and Xiaotong Shen. Large margin semi-supervised learning. J. Mach. Learn. Res., 8(65):1867 1891, 2007. Junhui Wang, Xiaotong Shen, and Yufeng Liu. Probability estimation for large-margin classifiers. Biometrika, 95(1):149 167, 2007. Junhui Wang, Xiaotong Shen, and Wei Pan. On efficient large margin semi-supervised learning: Method and theory. J. Mach. Learn. Res., 10(25):719 742, 2009. Larry Wasserman and John Lafferty. Statistical analysis of semi-supervised regression. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007. Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):5693 5700, 2019. Anru Zhang, Lawrence D. Brown, and T. Tony Cai. Semi-supervised inference: General theory and estimation of means. Ann. Statist., 47(5):2538 2566, 2019. Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res., 14:3321 3363, 2013. Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res., 16(1):3299 3340, 2015. Yuqian Zhang and Jelena Bradic. High-dimensional semi-supervised learning: in search of optimal inference of the mean. Biometrika, 109(2):387 403, 2021. Weihua Zhao, Fode Zhang, and Heng Lian. Debiasing and distributed estimation for highdimensional quantile regression. IEEE Trans. Neural Netw. Learn. Syst., 31(7):2569 2577, 2020. Xiaojin Zhu and Andrew B. Goldberg. Introduction to semi-supervised learning. Synth. Lect. Artif. Intell. Mach. Learn., 3(1):1 130, 2009. Xiaojin Jerry Zhu. Semi-supervised learning literature survey. Unpublished draft, 2005.