# distributed_estimation_on_semisupervised_generalized_linear_model__8cc9a990.pdf

Journal of Machine Learning Research 25 (2024) 1-41 Submitted 6/22; Revised 9/23; Published 3/24

Distributed Estimation on Semi-Supervised Generalized Linear Model

Jiyuan Tu tujy.19@gmail.com School of Mathematics Shanghai University of Finance and Economics, Shanghai, 200433, China

Weidong Liu weidongl@sjtu.edu.cn School of Mathematical Sciences Mo E Key Lab of Artiﬁcial Intelligence Shanghai Jiao Tong University, Shanghai, 200240, China

Xiaojun Mao maoxj@sjtu.edu.cn School of Mathematical Sciences Ministry of Education Key Laboratory of Scientiﬁc and Engineering Computing Shanghai Jiao Tong University, Shanghai, 200240, China

Editor: Sanmi Koyejo

Semi-supervised learning is devoted to using unlabeled data to improve the performance of machine learning algorithms. In this paper, we study the semi-supervised generalized linear model (GLM) in the distributed setup. In the cases of single or multiple machines containing unlabeled data, we propose two distributed semi-supervised algorithms based on the distributed approximate Newton method. When the labeled local sample size is small, our algorithms still give a consistent estimation, while fully supervised methods fail to converge. Moreover, we theoretically prove that the convergence rate is greatly improved when suﬃcient unlabeled data exists. Therefore, the proposed method requires much fewer rounds of communications to achieve the optimal rate than its fully-supervised counterpart. In the case of the linear model, we prove the rate lower bound after one round of communication, which shows that rate improvement is essential. Finally, several simulation analyses and real data studies are provided to demonstrate the eﬀectiveness of our method. Keywords: Distributed Learning, Semi-supervised Learning, Generalized Linear Model

1. Introduction

Distributed machine learning has attracted much attention in statistics in recent years. A variety of distributed algorithms have been proposed for statistical inference problems. Many interesting properties are investigated, and distributed machine learning is enlarged remarkably. As one of the most fundamental ideas in distributed learning, the divide-and-conquer (DC) method often behaves in impressive performances in statistical models. For quantile

Weidong Liu and Xiaojun Mao are the co-corresponding authors.

c 2024 Jiyuan Tu, Weidong Liu, and Xiaojun Mao.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/22-0670.html.

Tu, Liu, and Mao

regression (Volgushev et al., 2019; Chen et al., 2019), kernel ridge regression (Zhang et al., 2015; Lin et al., 2017) and kernel density estimation (Li et al., 2013), the optimal rates of DC depend on the number of machines and the scale of samples. For high dimensional linear regression and support vector machine, Lee et al. (2017); Lian and Fan (2018); Battey et al. (2018); Zhao et al. (2020) proved that the accuracy of DC can be remarkably improved by a well-known debiasing technique in statistics. For iterative algorithms like Distributed Approximate NEwton (DANE) method (Shamir et al. (2014)), Jordan et al. (2019); Jianqing Fan and Wang (2023) proved that with an appropriate initial estimator, DANE and its variants could achieve the optimal statistical rate within a constant round of communications. There are also many other distributed learning algorithms like alternating direction method of multipliers (Boyd et al., 2011), subsampled average mixture algorithm (Zhang et al., 2013), local stochastic gradient descent (Stich, 2019; Yu et al., 2019), distributed dual coordinate ascent method (Jaggi et al., 2014; Smith et al., 2017), lazily aggregated gradient method (Chen et al., 2018), and so on. In this article, we focus on distributed semi-supervised estimation for the generalized linear model (GLM). There are various practical problems, especially in the era of big data, involving the analysis of massive unlabeled datasets (e.g., electronic medical records data, textual data). Semi-supervised learning (SSL) has demonstrated considerable success in machine learning problems in terms of prediction precision. In non-distributed settings, a growing body of literature demonstrates SSL can be of great use in classiﬁcation (see, e.g. , Ando and Zhang (2005, 2007); Blum and Mitchell (1998); Vapnik (1999); Wang and Shen (2007); Wang et al. (2007, 2009); Zhu (2005); Zhu and Goldberg (2009)). The semisupervised estimation problems, with continuous labels, have also been studied by Wasserman and Laﬀerty (2007); Johnson and Zhang (2008); Chakrabortty and Cai (2018); Zhang et al. (2019); Cai and Guo (2020); Zhang and Bradic (2021); Hou et al. (2023); Azriel et al. (2022). In recent years, distributed SSL has emerged as an exciting new area of research in machine learning. A novel study of kernel ridge regression by Chang et al. (2017) has shown that the semi-supervised DC method will beneﬁt signiﬁcantly from the unlabeled data and allows a more ﬂexible number of machines than the ordinary DC. Work done along this line include distributed SSL for the kernel-based gradient descent algorithms (Lin and Zhou, 2018), bias-corrected kernel ridge regression (Guo et al., 2017), learning with multi-penalty regularization (Guo et al., 2019) and ﬂexible Gaussian kernels (Hu and Zhou, 2021). A common feature of these prior works in distributed SSL is that they all focused on one-shot algorithms. It is well known that iterative algorithms can lead to more accurate estimates than one-shot algorithms. In this paper, we study the distributed SSL for generalized linear models using the approximate Newton iteration method. Within this paper, we concentrate speciﬁcally on the scenario where data is distributed across several computing units, some of which possess supplementary unlabeled data. This setup aligns with numerous practical situations; for instance, in the medical ﬁeld, multiple hospitals collaborate to enhance the eﬃcacy of the model through sharing medical data, which typically encompass ample unlabeled covariates (Hou et al., 2023; Liu et al., 2022; Cai et al., 2022). Throughout this paper, we direct our focus towards the generalized linear model (GLM), a widely utilized statistical model that has been implemented across diverse ﬁelds such as healthcare (Parikh et al., 2019; Guo et al., 2022), genetic analysis (Ma et al., 2022), and economics (Theodossiou, 1998). One of the reasons we choose to consider GLM

Distributed Estimation on Semi-Supervised Generalized Linear Model

is not only due to its broad range of applications but also due to the unique form of its loss functions, which will be delved into in more detail in Section 2.1. In the following, we introduce the generalized linear model. Given a function ψ : R R, the observations (X, Y ) Rp+1 are generated according to the following conditional probability function

P(Y | X) = ec exp Y XTβ ψ(XTβ )

where ec and c(σ) are scale constants, and β is the true model parameter. Here ψ , the derivative of ψ, is called the canonical link function of the generalized linear model. To estimate β , the most straightforward estimator is the solution of the (empirical) negative log-likelihood function f(X, Y, β) = Y XTβ + ψ(XTβ). (2)

In SSL, a large unlabeled dataset only contains information on predictors X. It is thus of great interest to develop a distributed SSL algorithm using both labeled datasets and unlabeled datasets. However, as we will prove, naive incorporation of the unlabeled dataset in DANE with the loss function (2) will lead to degradation of the estimation precision. The primary reason is that the unlabeled predictor does not contain the information of β

in lack of the corresponding response and may result in higher uncertainty. On the other hand, in GLM, the predictors are directly related to the Hessian matrix of the loss (2). In other words, the unlabeled dataset is particularly useful in estimating the Hessian matrix and reducing the number of iterations of iterative algorithms. In this paper, we propose a novel loss function (called the semi-supervised surrogate loss function) based on a variance correction technique to avoid higher uncertainty. The proposed surrogate loss has a more concentrated Hessian matrix, and the resulting estimator is as eﬃcient as the supervised estimator. Further, we combine it with the Newton iteration method and propose a Semi Supervised Distributed Approximate NEwton estimator (SSDANE) for GLM. Compared with supervised DANE, our SSDANE has the following advantages. First, SSDANE converges faster to the optimal neighborhood of β . Second, SSDANE requires fewer rounds of communication between the master machine and local machines. This is particularly important when communication cost becomes a bottleneck in distributed computing. Third, SSDANE allows higher dimensions for β . The supervised DANE typically requires a dimension smaller than the labeled local sample sizes. In contrast, SSDANE can still have a nearly optimal rate when the dimension exceeds the labeled local sample sizes. Finally, extensive numerical experiments are performed to demonstrate the better performance of SSDANE. Our experimental results suggest that the performance of the semi-supervised algorithm is consistently superior to that of the supervised algorithm, its performance improves as the amount of unlabeled data increases. Based on our ﬁndings, we recommend the use of our SSDANE over DANE, particularly when a larger amount of unlabeled data is available. The rest of the paper is organized as follows. Section 2 introduces the semi-supervised surrogate loss function and SSDANE. In Section 3, we present theoretical results for the proposed methods. Numerical experiments are provided in Section 4. Finally, we conclude our work in Section 5. The proofs of theoretical results are relegated to Appendix. Given a vec-

tor v = (v1, ..., vp)T , we denote |v|1 = Pp l=1 |vl|, |v|2 = q Pp l=1 v2 l and |v| = sup1 l p |vl|.

Tu, Liu, and Mao

For a matrix A Rn p, deﬁne A as the spectral norm, λmax(A) and λmin(A) as the largest and smallest eigenvalues of A respectively.

2. Methodology

In this section, we ﬁrst introduce the semi-supervised surrogate loss in a single-machine setting. Then we apply it in a distributed setup and develop two distributed semi-supervised algorithms.

2.1 Semi-Supervised Surrogate Loss

To motivate the construction of the semi-supervised surrogate loss, we shall consider the single-machine setting. Denote D as the index set of the labeled pairs {(Xi, Yi)}, D as the index set of unlabeled covariates {Xi}, and H = D D . We let |D| = n and |H| = n . For the generalized linear model (1) with link function ψ ( ), recall the empirical loss function based on the labelled data D is

i D Yi XT i β + 1

i D ψ(XT i β). (3)

With this target function, one may estimate the true parameters by the optimization

bβD = argmin β Rp {L(β)}. (4)

It is well known that the computation time (the number of iterations) of a ﬁrst/second order method on solving this optimization depends on the condition number of the Hessian matrix c HD(β) = 1

i D Xi XT i ψ (XT i β).

When the sample size |D| is small, or the dimension p is large, the condition number can be large. For example, when p n, the condition number of the sample covariance matrix can go to inﬁnity (see Bickel and Levina (2008)). In order to get a more stable and concentrated Hessian matrix, a direct way is to make use of the unlabelled data by noting that the Hessian matrix only depends on the covariates. That is, one may consider the following loss

i D Yi XT i β + 1

i H ψ(XT i β), (5)

where the unlabeled data is used and leads to a more stable Hessian matrix once n is much larger than n:

c H ss H(β) = 1

i H Xi XT i ψ (XT i β).

However, due to the unbalance between the sample sizes of D and H, the asymptotic variance of the estimator bβH = argminβ Rp{Lss(β)} increases. This results in a great loss

Distributed Estimation on Semi-Supervised Generalized Linear Model

in statistical eﬃciency on estimating β . In fact, the covariance of the gradient at β

i D Yi Xi + 1

i H Xiψ (XT i β )

n Cov Y X + Xψ (XTβ ) + Cov 1

i H Xiψ (XT i β ) 1

i D Xiψ (XT i β )

n Cov Y X + Xψ (XTβ ) + n n

nn Cov Xψ (XTβ ) , (6)

where the ﬁrst matrix on the right-hand side of the above equation is the covariance matrix of the gradient of L(β) in (3). The unbalance between D and H leads to the second extra covariance matrix, which has the same order as the ﬁrst one. In order to eliminate the second term of (6) while maintaining the stability of the Hessian matrix, we consider the following two-step method. Suppose that we have an

initial estimator bβ (0) for β . Then we can subtract

Lss(bβ (0)) L(bβ (0)) = 1

i H Xiψ (XT i bβ (0)) 1

i D Xiψ (XT i bβ (0))

to reduce the extra term in the gradient of Lss(β). That is, we consider the following corrected estimating equation

Lss(β) Lss(bβ (0)) L(bβ (0)) = 0. (7)

It is easy to see that if the initial value bβ (0) is close to β , then the covariance of the above equation (at value β ) is asymptotically equivalent to that of L(β). Therefore, the estimator satisﬁes equation (7) will have almost the same statistical eﬃciency as L(β). On the other hand, the Hessian matrix of (7) is c H ss H(β), which is more stable than c HD(β). The estimating equation in (7) corresponds to the following loss function

e L(β) = Lss(β) D Lss(bβ (0)) L(bβ (0)), β E . (8)

We call e L(β) as the semi-supervised surrogate loss. This surrogate loss can be easily extended to the distributed setting. We will show rigorously that the number of iterations and hence the communication cost are much smaller than the algorithm with only labeled data.

2.2 Distributed Semi-Supervised Learning

In this section, we consider semi-supervised learning in the distributed setup. More specifically, assume we have N = mn i.i.d. pairs of observations {(Xi, Yi)} from the generalized linear model (1) evenly stored in m diﬀerent machines {H1, . . . , Hm}. Let H1 be the master machine that is in charge of data update and transmission. We discuss the cases where unlabeled data are stored in a single machine and multiple machines separately. In both cases, we develop algorithms that achieve a faster convergence rate than their fully-supervised counterparts.

Tu, Liu, and Mao

2.2.1 Unlabeled Data on Master Machine

We ﬁrst assume the sample size of additional n n unlabeled covariates Xi is not too large so that they can be stored in a single machine (i.e. H1, the master machine). We denote Dj as the indices of labelled observations {(Xi, Yi)}, and D 1 as the unlabeled covariates {Xi} on H1. Then we have that |Dj| = n, |D 1| = n n and |H1| = n . To conduct distributed learning with the assistance of the unlabeled data, we use the

semi-supervised surrogate loss in (8) and replace the local gradient L(bβ (0)) by the global

gradient 1 m Pm j=1 Lj(bβ (0)), where

i Dj Yi XT i β + 1

i Dj ψ(XT i β) (9)

is the local loss in the j-th machine. After the master machine receives all the gradients

from workers, it solves the following optimization and updates the initial value bβ (0) to bβ (1):

bβ (1) = argmin β Rp { e L(1)(β)}

= argmin β Rp

Lss 1 (β) D Lss 1 (bβ (0)) 1

j=1 Lj(bβ (0)), β E

We call this the one-step Semi-Supervised Distributed Approximate NEwton (SSDANE) estimator, with respect to the DANE proposed by Shamir et al. (2014).1 The one-step SSDANE can be directly extended to multi-round SSDANE which is stated in Algorithm 1.

2.2.2 Unlabeled Data on Multiple Machines

In this section, we consider the case that the sample size of unlabeled data is large so that they are separately stored in multiple machines. To be more speciﬁc, we denote U as the set of machines having unlabeled dataset and let Dj and D j be the index sets of labelled observations {(Xi, Yi)} and unlabeled covariates {Xi} on Hj respectively. Then for j U,

there is |Dj| = n, |D j| = n n and |Hj| = n . Suppose we have an initial estimator bβ (0), for j U, we apply SSDANE on the j-th machine to obtain

bβ (1) j = argmin β Rp e L(1) j (β)

= argmin β Rp

Lss j (β) D Lss j (bβ (0)) 1

k=1 Lk(bβ (0)), β E)

where Lss j (β) denotes the local semi-supervised empirical loss on the j-th machine, namely,

Lss j (β) = 1

i Dj Yi XT i β + 1

i Hj ψ(XT i β).

1 The original DANE is based on the average of local solutions for all workers. For the sake of comparison, we refer the solution in a local machine to DANE and the average of them to DANE-Avg

Distributed Estimation on Semi-Supervised Generalized Linear Model

Algorithm 1 Semi-Supervised Distributed Approximate NEwton Method (SSDANE)

Input: Labeled data {(Xi, Yi) | i Dj} on machine Hj for j = 1, ..., m, and unlabeled data {Xi | i D 1} on the master machine H1, the number of iterations T.

1: The master machine H1 obtains the initial estimator bβ (0) by minimizing the local empirical loss function on H1.

2: for t = 1, . . . , T do

3: The master machine broadcasts the parameter bβ (t 1) to each worker machine.

4: for j = 1, . . . , m do

5: The j-th machine computes the local gradient

Lj(bβ (t 1)) = 1

i Dj Yi XT i + 1

i Dj ψ (XT i bβ (t 1))Xi,

and sends back to the master machine H1.

7: The master machine updates the parameter by solving

bβ (t) = argmin β Rp e L(t)(β)

= argmin β Rp

Lss 1 (β) D Lss 1 (bβ (t 1)) 1

j=1 Lj(bβ (t 1)), β E

Output: The ﬁnal estimator bβ (T).

Then we take the average over these local estimators to obtain a more accurate one,

bβ (1) Avg = 1

j U bβ (1) j . (12)

The multi-round realization of the averaged SSDANE (SSDANE-Avg) method is provided in Algorithm 2.

3. Theoretical Results

In this section, we investigate the theoretical properties of the proposed algorithms. First, we need the following regular assumptions.

Assumption 1 Let H = E{ψ (XTβ )XXT}, then there is a constant ρ > 0 such that

ρ λmin (H) λmax (H) ρ 1.

Assumption 2 There exist positive numbers η0 > 0, C0 such that

max sup v Sp 1 E h exp n η0|XTv|2oi , E h exp η0|ψ (Xβ )|2 i , E h exp η0Y 2 i C0.

Tu, Liu, and Mao

Algorithm 2 Semi-Supervised Distributed Approximate NEwton with Average ( SSDANE-Avg)

Input: Labeled data {(Xi, Yi) | i Dj} on machine Hj for j = 1, ..., m; unlabeled data {Xi | i D j} on the each machine Hj, where j U; the number of iterations T.

1: The master machine H1 obtains the initial estimator bβ (0) by minimizing the local empirical loss function on H1.

2: for t = 1, . . . , T do

3: The master machine broadcasts the parameter bβ (t 1) to each worker machine.

4: for j = 1, ..., m do

5: The j-th machine computes the local gradient

Lj(bβ (t 1)) = 1

i Dj Yi XT i + 1

i Dj ψ (XT i bβ (t 1))Xi,

and sends back to the master machine H1.

7: The master machine computes m 1 Pm j=1 Lj(bβ (t 1)) and broadcasts to worker machine with unlabeled data, that is, Hj for j U.

8: for j U do

9: The j-th machine updates the parameter by solving

bβ (t) j = argmin β Rp e L(t) j (β)

= argmin β Rp

Lss j (β) D Lss j (bβ (t 1)) 1

k=1 Lk(bβ (t 1)), β E)

and sends back to the master machine H1.

10: end for

11: The master machine updates the parameter by bβ (t) Avg = |U| 1 P j U bβ (t) j .

12: end for

Output: The ﬁnal estimator bβ (T) Avg.

Assumption 3 The canonical link function ψ ( ) is twice diﬀerentiable, and there exists a uniform constant cψ such that

sup x R max{|ψ (x)|, |ψ (x)|} cψ.

The above assumptions are standard conditions for proving estimation consistency. Assumption 1 assumes the well-posedness of the Hessian matrix. Assumption 2 guarantees that the covariate X and the label Y admit the sub-Gaussian distribution. In Assumption 3 we assume the canonical link function ψ (x) has uniformly bounded ﬁrst-order and second-order derivatives.

Distributed Estimation on Semi-Supervised Generalized Linear Model

3.1 Convergence Rate of Semi-Supervised Distributed Estimation

3.1.1 Unlabeled Data on Master Machine

We ﬁrst give the convergence rate of the estimator in Algorithm 1 when all unlabeled data are stored in the master machine.

Theorem 1 Suppose Assumption 1 to 3 hold, assume the initial estimator bβ (0) satis-

ﬁes |bβ (0) β |2 = OP(rn). Moreover, the dimension of the parameter p satisﬁes p = o(min{(n )2/3/ log n , mn/ log n }) and rn = o(p 1/2). The T-th round SSDANE estimator satisﬁes

bβ (T) β 2 = OP

mn + rn p log n

T/2 + 1 p( prn)2T !

This theorem provides an upper bound for the SSDANE method. The ﬁrst term is a nearly optimal statistical convergence rate with all labeled data. The second and third terms represent the improved convergence rate by each iteration in the algorithm. For example, the rate in the second term was improved by the factor p

p(log n )/n with one

iteration. To achieve the nearly optimal statistical rate q

mn , it suﬃces to take

T log n + log m log p log log n

log n 2 log p log log n . (15)

Note that n is the number of total samples in the master. It indicates that the information of unlabeled data indeed helps improve the rate of the algorithm. In fact, if the loss L1(β) with only labeled data is used in DANE, i.e. replacing Lss 1 (β) by L1(β) in (11),

bβ (t) o = argmin β Rp

L1(β) D L1(bβ (t 1) o ) 1

j=1 Lj(bβ (t 1) o ), β E

then the second term of (14) becomes rn p log n

n T/2 , which is much slower than the one with unlabeled data as long as n n. A slower rate means more iterations and more rounds of communication between the master machine and workers. The improvement of the rate by the unlabeled data is essential. Consider the one-round estimator with T = 1 and for the linear model. Proposition 7 below shows a tight lower

bound for DANE, bβ (1) o is C q

p mn + rn q

p n . Except for the theoretical evidence, we will conduct extensive numerical analysis to verify such improvement. To further see the advantage of our semi-supervised surrogate loss in distributed estimation, we consider the case that p > n but p is much smaller than n and mn. Since

p > n, then the Hessian matrix of L1(β) is not positive deﬁnite. Hence bβ (t) o performs poorly on estimation of β . For example, for the linear model, (16) is reduced to solve a linear equation bΣ1β = b, where bΣ1 = 1

n P i D1 Xi XT i is not full rank so that the solution is not unique. On the other hand, the Hessian matrix of our semi-supervised surrogate loss is still

Tu, Liu, and Mao

positive deﬁnite as n p. In this case, bβ (T) is well deﬁned and can still have the nearly optimal rate. Next, we present the asymptotic normality result for our proposed SSDANE estimator.

Theorem 2 (Asymptotic normality of SSDANE) Under the same conditions as in Theorem 1, and additionally, we assume the dimension of the parameter p satisﬁes p = o(min{(n )1/2/ log n , (mn/ log2 n )1/3}). Then when the iteration round T satisﬁes (15), for any vector v Sp 1, we have that mn

σ(v) v T(bβ (T) β ) d N(0, 1),

σ(v) 2 = v TH 1CH 1v, (17)

here H = E ψ (XTβ )XXT ,

C = c(σ)E ψ (XTβ )XXT .

The result shows that the proposed semi-supervised surrogate loss keeps the same statistical eﬃciency compared to L(β) with all labeled data.

Remark 3 Regarding the statistical eﬃciency, we also note that the unlabeled data sometimes helps reduce the asymptotic variance in some works of semi-supervised learning. However, our methods have diﬀerent settings and targets from these works. In particular, most results enjoy variance reduction due to additional information. For example, in Hou et al. (2023), the author proposed the imputation method with the assistance of additional surrogate information. In Cai and Guo (2020), the author considered estimating the explained variance with the unlabeled data. Chakrabortty and Cai (2018) and Azriel et al. (2022) estimated the model parameter for the miss-speciﬁed linear model. In contrast, we focus on estimating the model parameter in the distributed setting. It would also be interesting to study the aforementioned problem in the distributed setting by extending our method. And there should be no technical diﬃculties in showing a similar eﬃciency-enhancement eﬀect.

3.1.2 Unlabeled Data on Multiple Machines

In this section, we provide similar theoretical results when the unlabeled data are separately stored in multiple machines.

Theorem 4 Suppose Assumption 1 to 3 hold, assume the initial estimator bβ (0) satisﬁes

|bβ (0) β |2 = OP(rn). Moreover, there are rate constraints p = o(min{(n )2/3/ log n , mn/ log n }) and rn = o(p 1/2). Then the T-th round SSDANE-Avg estimator satisﬁes

bβ (T) Avg β 2 = OP

mn + rn p log n

T + rn p log n

T/2 + 1 p( prn)2T !

(18) where |U| denotes the cardinality of the set U.

Distributed Estimation on Semi-Supervised Generalized Linear Model

We now compare the rates between (18) and (14). It is obvious that the more machines with unlabeled data, the faster the convergence rate of the algorithm. Note that the total number of unlabeled data is |U|(n n). If the |U|(n n) data are all stored in the master

machine, then according to Theorem 1, the rate of SSDANE bβ (T) is

mn + rn p log n

|U|(n n) + n

T/2 + 1 p( prn)2T !

This rate is slower than or the same as (18) when |U| n /(p log n ), relying on magnitude of n n. Therefore, when the unlabeled data are stored in multiple workers, the average SSDANE can have a better convergence rate. Furthermore, it has a faster rate than SSDANE by an extra factor 1/|U| when n is close to n (c.f. n = n). On the other hand, the average of SSDANE requires |U| workers to solve the optimization (13) simultaneously, and hence needs a more homogeneous distributed system.

For the asymptotic distribution of bβ (T) Avg, we can establish a similar result with the same asymptotic variance as Theorem 2. Therefore, the average of SSDANE further accelerates the algorithm while keeping the statistical eﬃciency. We further note that, suppose the loss Lss j (β) with unlabeled data were used trivially in DANE, i.e. let

bβ (t) ss,j = argmin β Rp

Lss 1 (β) D Lss 1 (bβ (t 1) ss ) 1

j=1 Lss j (bβ (t 1) ss ), β E

and bβ (t) ss,Avg = 1 |U| P j U bβ (t) ss,j. We can prove that the asymptotic variance of bβ (t) ss,Avg is

v TH 1 C + (n n)|U|

mn Css H 1v,

where Css = Cov ψ (XTβ )X . That is, a trivial use of unlabeled data in DANE can lead to a larger variance when |U| is proportional to m.

Remark 5 We note that our current theoretical framework is founded on the premise that the data has comparable sample sizes across local sources, primarily for the sake of clarity in presentation. Speciﬁcally, for the SSDANE method, we conduct minimization of the surrogate loss (deﬁned in (10)) solely on the master machine H1. In this context, we allow for imbalanced local sample sizes, provided that the master machine contains a suﬃcient number of covariates. Conversely, in the case of the SSDANE-Avg method, we take the average of all SSDANE estimators from each machine in U. In this case, when sample sizes across machines are imbalanced, the statistical rate becomes contingent on the smallest local sample size, denoted as min{n j|j U}. In simpler terms, the presence of smaller local sample sizes can potentially impact the theoretical performance of SSDANE-Avg negatively. In practice, we suggest using the SSDANE method when the sample size across machines is extremely imbalanced. Notably, recent research eﬀorts have delved into the realm of distributed training in the context of imbalanced data (see, e.g., Duan et al. (2021a,b); Chen et al. (2023)). It would indeed be intriguing to explore the fusion of these methodologies with our distributed semi-supervised approach.

Tu, Liu, and Mao

3.2 Two Examples

In this section, we provide two speciﬁc examples of the generalized linear model. The ﬁrst one is linear regression. We use this example to illustrate the lower bound of the rate for DANE with labeled data and support the claims on better performance of our SSDANE in Section 2.2.1. Linear regression Assume the observation pair (X, Y ) comes from the following model

Y = XTβ + ϵ, (19)

where ϵ denotes the Gaussian noise. In this case, the canonical link function is ψ (x) = x. Therefore ψ(x) = x2/2, which has a quadratic form. The third-order derivative of ψ(x) vanishes for the linear model.

Proposition 6 In the linear model, suppose the covariate X follows the sub-Gaussian dis-

tribution. Then the T-th round SSDANE estimator bβ (T) satisﬁes that

|bβ (T) β |2 = OP

mn + rn p log n

For linear regression, the third term in (14) is disappeared when the canonical link function is in linear form. In the following, we provide the lower bound of one-step SSDANE.

Proposition 7 In the linear model, assume the covariate X follows the sub-Gaussian distribution. Moreover, we assume that

inf v Sp 1 λmin h Cov n (XXT Σ)v oi ρ1.

Let bβ (0) be an initial estimator which is independent to Xi s and ϵi s, then there exists a

constant c, such that the one-step SSDANE estimator bβ (1) satisﬁes

P |bβ (1) β |2 c r p

where rn = |bβ (0) β |2.

Suppose there is no additional unlabeled data and we have n = n. In this case, SSDANE

becomes the DANE and bβ (1) = bβ (1) o . Therefore, the DANE with only labeled data has the

lower bound c q

p mn + rn q

Logistic regression Assume the observation (X, Y ) Rp+1 admits the following conditional probability function

P(Y | X) = exp(Y XTβ ) 1 + exp(XTβ ), (21)

where the response variable Y only takes value in {0, 1}. Then the canonical link function ψ ( ) is deﬁned as ψ(x) = ex/(1 + ex). It is not hard to verify that ψ fulﬁlls Assumption 3. Therefore, the theorems above hold for Logistic regression.

Distributed Estimation on Semi-Supervised Generalized Linear Model

3.3 Theory on non-distributed setting

To show a trivial case involving unlabeled data will decline the statistical eﬃciency, we provide a result on asymptotic distribution in the non-distributed setting. Recall the loss function Lss(β) = 1

i D Yi XT i β + 1

i H ψ(XT i β).

Deﬁne bβ ss argmin β Rp

n Lss(β) o . (22)

Proposition 8 (Asymptotic normality of the semi-supervised GLM estimator) Suppose Assumption 1 to 3 hold, and additionally, we assume the rate constraint p = o((n/ log n )2/3). Then for any vector v Sp 1, we have that n σ(v)v T(bβ ss β ) d N(0, 1),

where σ(v) 2 = v TH 1 C + n n

n Css H 1v, (23)

here C = c(σ)E ψ (XTβ )XXT , Css = Cov ψ (XTβ )X ,

H = E ψ (XTβ )XXT .

Let n = n in Proposition 8. In this case, D = H and hence bβ ss = bβD. That is, the asymptotic covariance matrix of bβD is H 1CH 1. While the unlabeled data is used, the covariance matrix would become to be H 1 C + n n

n Css H 1 and hence has lower statistical eﬃciency.

4. Empirical Analysis

In the empirical analysis, we conduct several experiments to show the eﬀectiveness of our proposed methods.

4.1 Simulation Studies on Synthetic Dataset

In this section, we show the performance of our proposed semi-supervised estimators on linear regression and logistic regression. Parameter Settings In both models, we assume the i.i.d. covariate vectors Xi = (Xi,1, ..., Xi,p)T

are drawn from a multivariate normal distribution N(0, Σ) for i = 1, ..., N. Here the covariance matrix Σ is a p p Toeplitz matrix with its (i, j)-th entry Σij = 0.5|i j|, where 1 i, j p. We ﬁx dimension p = 20 and the true coeﬃcient

β = (1, 0.95, 0.9, ..., 0.1, 0.05).

We repeat 100 independent simulations and report the averaged estimation error and the corresponding standard error. We mainly compare the SSDANE (Algorithm 1) and SSDANE-Avg (Algorithm 2) with the following six methods:

Tu, Liu, and Mao

DANE: Supervised version of distributed approximate Newton estimator. We run Algorithm 1 without using the additional unlabeled dataset.

DANE-Avg: Supervised version of averaged distributed approximate Newton estimator. We run Algorithm 2 without using the additional unlabeled dataset.

DANEimp: Imputation-based distributed approximate Newton estimator. We interpolate the unlabeled data by the current estimator bβ, and run DANE for all pairs of data.

DANEimp-Avg: Imputation-based averaged distributed approximate Newton estimator. We interpolate the unlabeled data by the current estimator bβ, and run DANE-Avg for all pairs of data.

Local estimator: We minimize the empirical loss function L1(β) = 1

n P i D1 Yi XT i β+ 1 n P i D1 ψ(XT i β) with the data on the master machine H1.

Pooled estimator: We collect all labeled pairs into one machine and minimize the empirical loss function.

In particular, we denote SSDANE-Avg(α) (DANE-Avg(α)) as the estimator which is the average of the local SSDANE (DANE) estimators from α fraction of machines. For the choice of the initial estimator, we uniformly use the local estimator on the master machine H1.

4.1.1 Results for Linear Regression

We ﬁrst consider linear regression, where the observation (X, Y ) is generated from the linear model (19). Eﬀect of the Number of Machines and Local Unlabeled Data To investigate the eﬀect of the number of machines and local unlabeled data, we ﬁx the labeled local sample size n to be 100, and vary the number of machines m from {20, 50, 100}, and the unlabeled local sample size n from {100, 200, 500}. We compare the one-step SSDANE and SSDANE-Avg with diﬀerent averaging fractions. The results of the ℓ2-errors and their standard errors are reported in Table 1. As we can see from above, for every ﬁxed m, more unlabeled samples help reduce the ℓ2-error of the semi-supervised estimators. The averaged estimator SSDANE-Avg always has a lower error than SSDANE, which only solves the problem (11) on the master machine. Moreover, the more fraction of machines we average from, the smaller the estimation error we get. All of these ﬁndings coincide with the theoretical results in Theorem 1 and Theorem 4. Covariance of the gradient at β From the proof of Theorem 2 and Proposition 8, we observe that the asymptotic variance of the estimators diﬀers only in the covariance of the gradient of the loss function L. Therefore, to demonstrate the statistical eﬃciency of our method in comparison with the naive semi-supervised estimator, we compare the rescaled covariance of the gradient at the true parameter β . Speciﬁcally, for each round, we generate N samples and compute N L(β )( L(β ))T. Then, we repeat this process 100 times and compute their average, which provides the covariance of the gradient L(β )

Distributed Estimation on Semi-Supervised Generalized Linear Model

Table 1: The ℓ2-errors and their standard errors (in parentheses) of the one-step SSDANE, SSDANE-Avg(0.2), SSDANE-Avg(0.5) and SSDANE-Avg(1) under labeled local sample size n = 100. Data are generated from a linear model with parameter dimension p = 20.

m n SSDANE SSDANE-Avg(0.2) SSDANE-Avg(0.5) SSDANE-Avg(1) 100 0.588(0.226) 0.310(0.083) 0.236(0.053) 0.213(0.042) 20 200 0.321(0.109) 0.193(0.041) 0.163(0.030) 0.154(0.026) 500 0.203(0.040) 0.156(0.026) 0.145(0.028) 0.142(0.028) 100 0.586(0.228) 0.227(0.056) 0.199(0.041) 0.187(0.035) 50 200 0.299(0.103) 0.131(0.029) 0.118(0.022) 0.112(0.020) 500 0.173(0.039) 0.101(0.019) 0.096(0.017) 0.094(0.016) 100 0.593(0.235) 0.199(0.049) 0.184(0.038) 0.178(0.036) 100 200 0.295(0.104) 0.106(0.023) 0.097(0.020) 0.095(0.019) 500 0.161(0.039) 0.073(0.015) 0.070(0.014) 0.069(0.013)

Table 2: The rescaled trace of the covariance bΣ( L) of the L, Lss, e L(1) and e L(5) under labeled local sample size n = 100. Data are generated from a linear model with parameter dimension p = 20.

m n L Lss e L(1) e L(5)

100 0.997 0.997 6.097 2891.910 20 200 0.997 12.443 3.720 1.763 500 0.997 18.476 2.196 1.046 100 0.979 0.979 13.591 9386.730 50 200 0.979 12.530 7.425 3.062 500 0.979 19.100 3.542 1.048 100 0.944 0.944 26.512 22642.400 100 200 0.944 12.228 13.755 5.829 500 0.944 18.367 5.822 0.998

(denoted as bΣ( L)). Here the loss function is chosen as L (fully supervised empirical loss), Lss (semi-supervised empirical loss), e L(1) (1-step semi-supervised surrogate loss), and e L(5)

(5-step semi-supervised surrogate loss). We present the trace of the covariance (Table 2), the diﬀerence bΣ( L) C (Table 3), and draw the distribution of eigenvalues of bΣ( L) (Figure 1)

From Table 2, it can be observed that fully supervised surrogate loss (n = n) has the largest covariance trace. This is because the supervised estimator has a large bias. As n grows, the semi-supervised empirical loss exhibits a larger covariance trace than that of L, which aligns with the result in Proposition 8. On the contrary, the semi-supervised surrogate loss has a smaller covariance trace. In particular, for large iteration numbers and unlabeled sample sizes, the semi-supervised surrogate loss has a covariance trace similar to that of L. Table 3 reveals that bΣ( L) approximates C for SSDANE when the local unlabeled sample size is large. The naive semi-supervised loss has a larger diﬀerence as the unlabeled sample size increases. Figure 1 demonstrates similar implications.

Tu, Liu, and Mao

Table 3: The norm of diﬀerence bΣ( L) C of the L, Lss, e L(1) and e L(5) under labeled local sample size n = 100. Data are generated from a linear model with parameter dimension p = 20.

m n L Lss e L(1) e L(5)

100 1.787 1.787 3.211 101 3.331 104

20 200 1.787 7.892 101 1.716 101 6.037 500 1.787 1.121 102 8.215 1.963 100 2.224 2.224 7.789 101 1.157 105

50 200 2.224 8.153 101 4.016 101 1.634 101

500 2.224 1.262 102 1.698 101 2.290 100 2.138 2.138 1.554 102 2.717 105

100 200 2.138 8.200 101 7.753 101 3.946 101

500 2.138 1.214 102 3.028 101 2.202

Figure 1: The values of eigenvalues of bΣ( L) in increasing order under the linear regression model. The labeled local sample size takes value in 100, the number of machines is 50, the unlabeled local sample size takes value in {0, 100, 400}, and the dimension p is 20.

Eﬀect of the Number of Iterations Next, we study the eﬀect of iterations. We set the number of machines to be 100, the unlabeled local sample size to be 400, and vary the labeled local sample size from {50, 100, 200}. The curves of ℓ2-error over the number of iterations are presented in Figure 2. From Figure 2, we can see that in all cases, the SSDANE and SSDANE-Avg outperform their fully supervised counterparts. The superiority is more obvious when the labeled local sample size n is small. The imputation-based method DANEimp is similar to SSDANE when only one machine has unlabeled data, but the averaged estimator DANEimp-Avg is inferior to both SSDANE-Avg and DANE-Avg. This indicates that a naive combination of imputation with DANE may lead to a deterioration in performance. Moreover, we ﬁnd that the fully supervised estimator DANE blows up in several steps when n is small, while the semi-supervised estimators always converge well. Eﬀect of initialization From Proposition 6, in the linear model, the convergence rate always increases as the iteration number grows. In the following experiment, we examine if

Distributed Estimation on Semi-Supervised Generalized Linear Model

Figure 2: The ℓ2-error over the number of iterations under linear model. The labeled local sample size takes value in {50, 100, 200}, the number of machine is 100, the unlabeled local sample size is 400, and the dimension p is 20.

Table 4: The ℓ2-errors and their standard errors (in parentheses) of the 1-step SSDANE,

5-step SSDANE, 1-step SSDANE-Avg, and 5-step SSDANE-Avg with initialization bβ (0) = β + ϵ, under labeled local sample size n = 100. Data are generated from a linear model with parameter dimension p = 20.

m n 1 Step SSDANE 1 Step SSDANE-Avg 5 Step SSDANE 5 Step SSDANE-Avg 100 3.434(1.212) 1.169(0.222) 28.978(37.931) 0.126(0.022) 20 200 2.135(0.717) 0.682(0.165) 1.476(1.927) 0.125(0.022) 500 1.248(0.354) 0.580(0.160) 0.145(0.038) 0.125(0.022) 100 3.496(1.273) 1.177(0.221) 33.016(48.324) 0.080(0.016) 50 200 2.128(0.722) 0.587(0.132) 1.462(2.004) 0.079(0.015) 500 1.184(0.338) 0.384(0.107) 0.989(0.032) 0.079(0.015) 100 3.492(1.300) 1.177(0.216) 33.780(51.572) 0.056(0.010) 100 200 2.138(0.734) 0.558(0.114) 1.475(2.026) 0.056(0.010) 500 1.175(0.346) 0.307(0.085) 0.077(0.028) 0.056(0.010)

the estimation error is sensitive to the initial point. To see that, we take the initial point

as bβ (0) = β + ϵ, where the noise ϵ is sampled from N(0, Ip). It is not hard to see that

E[|bβ (0) β |2] p, which violate the assumption in the theory. The result is presented in Table 4. As shown in the linear model, the fully supervised estimator (n = n) is highly sensitive to initialization compared to the semi-supervised estimator (n > n). Increasing the amount of unlabeled data leads to greater robustness of the estimator. Moreover, we extend our exploration to encompass the utilization of distributed batch stochastic gradient descent (BSGD) and local stochastic gradient descent (LSGD) methodologies for initialization purposes. In this context, our approach employs a batch size of 0.1n for both BSGD and LSGD, while conducting a total of 10 iterations. Within the framework of BSGD, a constant step size is employed, while in alignment with the principles outlined in Stich (2019), the step size in LSGD is determined as C/(t + 1). We compare SSDANE

and SSDANE-Avg initialized by the local estimator, random initialization (bβ (0) = β + ϵ),

Tu, Liu, and Mao

Figure 3: The ℓ2-error over the number of iterations under a linear model with diﬀerent initializations. The labeled local sample size takes value in {50, 100, 200}, the number of machines is 100, the unlabeled local sample size is 400, and the dimension p is 20.

early-stopped BSGD and LSGD, and show how their estimation errors evolve with respect to iterations. The results are elucidated in Figure 3. As we can see from Figure 3, in the linear model, all these four initialization methods lead to consistent estimators. The SSDANE-Avg method always outperforms its corresponding SSDANE method. Notably, the BSGD and LSGD initialized methods exhibit smaller estimation errors, consequently requiring fewer iterations to attain the optimal rate. In practical applications, the choice of initialization can be tailored to the characteristics of the problem at hand. Speciﬁcally, when confronted with scenarios of ample local sample sizes, the employment of a local minimizer for initialization is advocated, considering its intrinsic advantage of circumventing communication overhead. Conversely, for a situation involving diminutive local sample sizes, the adoption of BSGD and LSGD for initialization serves as a

prudent approach, aligning with the requisite criterion of achieving |bβ (0) β |2 = o P(p 1/2).

4.1.2 Results for Logistic Regression

In this section, we consider logistic regression, where the observation (X, Y ) is generated from the model (21). To solve the optimization problem (13) for the logistic regression model, we apply conjugate gradient descent motivated by Minka (2003). Eﬀect of the Number of Machines and Local Unlabeled Data Similarly, as in the linear model, we ﬁx the labeled local sample size n to be 300 and vary the number of machines m from {20, 50, 100}, and the unlabeled local sample size n from {300, 450, 900}. We compare the one-step SSDANE and SSDANE-Avg with diﬀerent averaging fractions. The results are shown in Table 5. We can ﬁnd similar phenomena as in the linear model, which shows the unlabeled dataset and averaging fraction play important roles in reducing the estimation errors. Covariance of the gradient at β In this part, we compare the various quantities of the covariance matrix bΣ( L) including the rescaled trace (Table 6), the diﬀerence bΣ( L) C (Table 7), and draw the distribution of eigenvalues of bΣ( L) (Figure 4). We can observe a similar phenomenon as in the linear model.

Distributed Estimation on Semi-Supervised Generalized Linear Model

Table 5: The ℓ2-errors and their standard errors (in parentheses) of the one-step SSDANE, SSDANE-Avg(0.2), SSDANE-Avg(0.5) and SSDANE-Avg(1) under labeled local sample size n = 300. Data are generated from a logistic regression model with parameter dimension p = 20.

m n SSDANE SSDANE-Avg(0.2) SSDANE-Avg(0.5) SSDANE-Avg(1) 300 0.692(0.267) 0.414(0.105) 0.347(0.074) 0.325(0.061) 20 450 0.510(0.154) 0.347(0.076) 0.309(0.064) 0.294(0.053) 900 0.375(0.080) 0.299(0.053) 0.284(0.052) 0.278(0.049) 300 0.656(0.301) 0.280(0.072) 0.247(0.054) 0.239(0.047) 50 450 0.455(0.176) 0.226(0.053) 0.204(0.042) 0.199(0.037) 900 0.309(0.079) 0.189(0.035) 0.178(0.032) 0.177(0.031) 300 0.634(0.281) 0.226(0.046) 0.210(0.039) 0.204(0.036) 100 450 0.437(0.169) 0.172(0.033) 0.161(0.028) 0.157(0.028) 900 0.285(0.079) 0.137(0.023) 0.131(0.022) 0.129(0.022)

Table 6: The rescaled of the covariace bΣ( L) of bΣ( L) of L, Lss, e L(1) and e L(5) under labeled local sample size n = 300. Data are generated from a logistic regression model with parameter dimension p = 20.

m n L Lss e L(1) e L(5)

300 0.074 0.074 0.222 0.271 20 450 0.074 0.218 0.174 0.091 900 0.074 0.355 0.130 0.077 300 0.073 0.073 0.441 0.693 50 450 0.073 0.227 0.325 0.108 900 0.073 0.348 0.211 0.078 300 0.076 0.076 0.809 1.421 100 450 0.076 0.213 0.570 0.133 900 0.076 0.344 0.336 0.078

Eﬀect of the Number of Iterations In this experiment, we ﬁx the number of machines as 100, the unlabeled local sample size as 600, and vary the labeled local sample size from {100, 150, 300}. Similarly, from Figure 5, we ﬁnd that the semi-supervised estimators SSDANE and SSDANE-Avg are more stable and accurate, especially when the labeled local sample size n is small. Moreover, DANEimp-Avg always performs worse than SSDANE-Avg. Eﬀect of initialization In Theorem 1 and 4, our theory suggests that the algorithm converges only when the initial rate rn = o(1/ p). In this experiment, we check if this requirement is necessary. Similarly, as in the linear model, we take the initial point as bβ (0) = β +ϵ, where the noise ϵ is sampled from N(0, Ip). The result of ℓ2-error is reported in Table 8. In the table, we denote / if the algorithm blows up. We can observe a similar trend as in the linear model, where the semi-supervised estimator shows greater robustness to initialization than its fully supervised counterpart. Additionally, the SSDANE algorithm

Tu, Liu, and Mao

Table 7: The norm of diﬀerence bΣ( L) C of L, Lss, e L(1) and e L(5) under labeled local sample size n = 300. Data are generated from a logistic regression model with parameter dimension p = 20.

m n L Lss e L(1) e L(5)

300 0.152 0.152 0.937 1.578 20 450 0.152 0.935 0.668 0.224 900 0.152 1.702 0.411 0.163 300 0.138 0.138 2.250 5.041 50 450 0.138 0.967 1.567 0.349 900 0.138 1.662 0.908 0.143 300 0.146 0.146 4.424 10.456 100 450 0.146 0.900 2.975 0.549 900 0.146 1.610 1.618 0.154

Figure 4: The values of eigenvalues of bΣ( L) in increasing order under the logistic regression model. The labeled local sample size is 300, the number of machines is 50, the unlabeled local sample size takes value in {0, 150, 600}, and the dimension p is 20.

exhibits greater robustness than SSDANE-Avg. These ﬁndings suggest that there may be scope for further improvement in Theorem 1 and 4. We can also ﬁnd that the 5step SSDANE exhibits a large variance or even blows up when confronted with scenarios of a limited number of covariates n . We concur that this behavior could be attributed to unfavorable initialization coupled with the small value of n . In the context of our theoretical framework, speciﬁcally referencing Theorem 1 and Theorem 4, it is discernible that when prn or p log n /n is greater than 1, the statistical rate escalates with an increasing number of steps. This phenomenon potentially elucidates the reasoning behind the occurrence where a 5-step SSDANE is worse than its single-step counterpart in certain instances. We extend our experiment to investigate the utilization of an array of initialization strategies, including the use of the local minimizer, random input, early-stopped BSGD and LSGD, thereby mirroring the approach adopted in the linear model context. The selection of hyperparameters aligns with the conﬁguration outlined in Section 4.1.1. Con-

Distributed Estimation on Semi-Supervised Generalized Linear Model

Figure 5: The ℓ2-error over the number of iterations under the logistic regression model. The labeled local sample size takes value in {100, 150, 300}, the number of machines is 100, the unlabeled local sample size is 600, and the dimension p is 20.

Table 8: The ℓ2-errors and their standard errors (in parentheses) of the 1-step SSDANE,

5-step SSDANE, 1-step SSDANE-Avg, and 5-step SSDANE-Avg with initialization bβ (0) = β +ϵ, under labeled local sample size n = 300. Data are generated from a logistic regression model with parameter dimension p = 20.

m n 1 Step SSDANE 1 Step SSDANE-Avg 5 Step SSDANE 5 Step SSDANE-Avg 300 10.193(20.198) 115.690(450.732) / / 20 450 4.715(10.561) 3.239(4.366) 0.431(0.252) / 900 1.532(2.203) 1.042(1.651) 0.264(0.053) / 300 10.245(21.135) 133.470(316.598) / / 50 450 4.963(12.696) 2.994(3.789) / / 900 1.385(1.614) 0.860(1.254) 0.174(0.047) / 300 10.197(21.368) 89.883(180.015) / / 100 450 4.912(13.116) 7.447(44.892) / / 900 1.400(2.047) 0.757(0.961) 0.127(0.043) /

trasting with the ﬁndings in Figure 3, we can ﬁnd that random initialization may result in divergence in the SSDANE-Avg method. However, as the local sample size increases, the SSDANE-Avg method exhibits convergence. This observation suggests that both our SSDANE and SSDANE-Avg methods beneﬁt from a well-chosen initialization, and a larger local sample size exerts a positive inﬂuence in mitigating sensitivity to the initialization conditions. This phenomenon can be partially elucidated through Theorem 1 and 4.

4.2 Application to Real-World Benchmarks

In this section, we analyze the Celeb A dataset2 from the Kaggle website, which is included in LEAF (Caldas et al., 2018), a standard distributed learning benchmark. Our aim is to train a classiﬁer that distinguishes young people from old ones. In the dataset, each variable has 39 attributes. We take the total sample size as 120000, and randomly partition the dataset

2 https://www.kaggle.com/datasets/jessicali9530/celeba-dataset

Tu, Liu, and Mao

Figure 6: The ℓ2-error over the number of iterations under a logistic model with diﬀerent initializations. The labeled local sample size takes value in {100, 150, 300}, the number of machines is 100, the unlabeled local sample size is 600, and the dimension p is 20.

Figure 7: Celeb A dataset. The classiﬁcation error over iterations, under various pairs of (m, n). The total training sample size N is 20000, the unlabeled sample size is 80000, the test sample size is 20000, and the dimension p is 39.

into 20000 testing data, 20000 labeled training data, and 80000 unlabeled training data. We perform 100 random partitions of the dataset and report the averaged classiﬁcation error on the testing set. We consider three cases where (m, n) = (20, 1000),(40, 500) and (80, 250) respectively. The result is shown in Figure 7. It is not hard to see that the proposed semisupervised methods always outperform their supervised counterparts, and the superiority of the semi-supervised method is more obvious when the labeled local sample size n is small. While the imputation-based method DANEimp is comparable to SSDANE, the averaged estimator DANEimp-Avg is much worse than SSDANE-Avg. This result also reveals that naively combining imputation with DANE may not improve the training performance.

5. Concluding Remarks and Future Study

In this paper, we study the semi-supervised generalized linear model in the distributed setup. With the assistance of the additional unlabeled data, we theoretically show that the

Distributed Estimation on Semi-Supervised Generalized Linear Model

proposed estimators enjoy higher statistical accuracy than their fully supervised counterparts. The superiority is also illustrated by numerical experiments. In the future, there are several directions worth being explored. For example, it is interesting to study distributed semi-supervised learning in a high-dimensional regime. We believe that the unlabeled data can obtain a similar accuracy-enhancement eﬀect when incorporated with the one-shot averaging method (Lee et al., 2017) and the multi-round distributed sparse learning method (Wang et al., 2017; Jordan et al., 2019). Second, while we focus on the scenario where the model is well-speciﬁed, it is important to consider the distributed semi-supervised learning for miss-speciﬁed models (Bellec et al., 2018; Chakrabortty and Cai, 2018; Zhang and Bradic, 2021; Deng et al., 2020; Azriel et al., 2022). Third, it will be of great importance to explore the distributed semi-supervised ridge regression (Dobriban and Sheng, 2021, 2020; Sheng and Dobriban, 2020). Lastly, of profound signiﬁcance is the intriguing prospect of synergistically investigating the amalgamation of research endeavors focused on distributed training within the context of imbalanced data (see, e.g., Duan et al. (2021a,b); Chen et al. (2023)) with our distributed semi-supervised approach.

Acknowledgments

Jiyuan Tu s research is supported by the Fundamental Research Funds for the Central Universities . Weidong Liu s research is supported by NSFC Grant No. 11825104. Xiaojun Mao s research is supported by NSFC Grant No. 12371273, the Shanghai Rising Star Program 23QA1404600 and Young Elite Scientists Sponsorship Program by CAST (2023QNRC001). The authors would like to thank the action editor and three anonymous referees for their constructive comments, which greatly improves the quality of the paper.

Tu, Liu, and Mao

The appendix consists of three parts. In Appendix A, we provide proof for the theoretical results of SSDANE and SSDANE-Avg in Section 3.1. In Appendix B, we prove the theories in Section 3.2, especially the rate lower bound of SSDANE for the linear model. Appendix C is devoted to proving the normality result for the semi-supervised generalized linear model in the single machine setting, namely the theorems in Section 3.3.

Appendix A. Proof of Theories in Section 3.1

Proof [Proof of Equation (6) ] Notice that

Yi Xi + Xiψ (XT i β ) i = 0 = E h 1

i H Xiψ (XT i β ) 1

i D Xiψ (XT i β ) i ,

Yi Xi + Xiψ (XT i β ) , 1

i H Xiψ (XT i β ) 1

i D Xiψ (XT i β ) = 0,

we have that

i D Yi Xi + 1

i H Xiψ (XT i β ) (24)

Yi Xi + Xiψ (XT i β ) + Cov 1

i H Xiψ (XT i β ) 1

i D Xiψ (XT i β ) .

We shall compute them respectively. For the ﬁrst term, since each Xi are independent to each other, we have that

Yi Xi + Xiψ (XT i β ) = 1

n Cov Y X + Xψ (XTβ ) .

For the second term in (24), we compute that

i H Xiψ (XT i β ) 1

i D Xiψ (XT i β )

1 n Xiψ (XT i β ) X

nn Xiψ (XT i β )

(n )2 + n(n n)2

Cov Xψ (XTβ ) = n n

nn Cov Xψ (XTβ ) .

Then (6) can be obtained by substituting the above two formulas into (24).

Distributed Estimation on Semi-Supervised Generalized Linear Model

Lemma 9 Given a convex loss function L(β), suppose it is second-order diﬀerentiable, and there exist constants ρ0, r0 such that

inf β:|β β |2 r0 inf v Sp 1 v T 2L(β)v ρ0, (25)

and r0 > 3ρ 1 0 | L(β )|2, then the minimizer β of L(β) satisﬁes that

ρ0 | L(β )|2.

Proof For simplicity, we denote bn = 3ρ 1 0 | L(β )|2, and construct the set Θ1 = {β : |β β |2 = bn} in the parameter space. Then clearly we know that Θ1 {β : |β β |2 r0}. We will show that L(β) > L(β ) strictly holds uniformly for all β Θ1. To show this, we compute that

0 (β β )T L β + t(β β ) dt

=(β β )T L(β ) + Z 1

0 (1 t)(β β )T 2L β + t(β β) (β β )dt

2 |β β |2 2 | L(β )|2 |β β |2

2 3ρ 1 0 | L(β )|2|β β |2 | L(β )|2 |β β |2 > 0.

Then by convexity of the loss function L(β), we have that |β β |2 3

ρ0 | L(β )|2, which completes the proof.

Lemma 10 Suppose Assumption 1 to 3 hold. Assume r = o(1), then for each t 1, there is inf β:|β β |2 r inf v Sp 1 v T 2 e L(t)(β)v ρ

holds with probability not less than 1 O((n ) pτ) for some constant τ > 0.

Proof For arbitrary β satisfying |β β|2 r, there is

v T 2 e L(t)(β)v

i H1 ψ XT i β v TXi 2

=v TE ψ (XTβ )XXT v + 1

0 ψ XT i β + t(β β ) XT i (β β ) v TXi 2dt

i H1 ψ XT i β Xi XT i E ψ (XTβ )XXT

Tu, Liu, and Mao

It left to bound the term |T1| and T 2 respectively. Bound of |T1|: From Assumption 3, It is not hard to see that

0 ψ XT i β + t(β β ) XT i (β β ) v TXi 2dt

cψr sup v Sp 1

i H1 |v TXi|3

Denote N as the 1/2-net of Sp 1, by Lemma 5.2 of Vershynin (2010) we know that |N| 5p. Then there is

i H1 |v TXi|3

i H1 |ev TXi|3

+ 4 max ev N sup v:|v ev|2 1/2

i H1 |(ev v)TXi|3

i H1 |ev TXi|3

2 sup v Sp 1

i H1 |v TXi|3

i H1 |v TXi|3

i H1 |ev TXi|3

i H1 |ev TXi|3 E |ev TXi|3 + 8 sup v Sp 1 E |ev TXi|3 .

Note that for each ev N, there is

max ev N E h exp n η3/2|ev TXi|3 2/3oi

sup ev Sp 1 E h exp n η|ev TXi|2oi C0.

We know that |ev TXi|3 s are sub-Weibull(2/3) random variables. Therefore, by Theorem 3.1 of Kuchibhotla and Chakrabortty (2022), there exist constants c3, τ > 0 such that

i H1 |ev TXi|3 E |ev TXi|3 c3

n + (p log n )3/2

5p max ev N P

i H1 |ev TXi|3 E |ev TXi|3 c3

n + (p log n )3/2

5p(n ) τ1p (n ) (τ1 1)p,

Distributed Estimation on Semi-Supervised Generalized Linear Model

which tends to 0 when τ1 > 1. On the other hand, we compute that

sup v Sp 1 E |ev TXi|3

2η3/2 0 sup v Sp 1 E exp η0|ev TXi|2 C0 2η3/2 0 , (29)

where the second line uses the elementary inequality |x|3 1

2e|x|. Combining (27), (28) and (29) we have that

|T1| cψr sup v Sp 1

i H1 |v TXi|3

8cψr max ev N

i H1 |ev TXi|3 E |ev TXi|3 + 8cψr sup v Sp 1 E |ev TXi|3

n + (p log n )3/2

Bound of T 2 : Denote N as the 1/4-net of Sp 1, by Lemma 5.2 of Vershynin (2010) we know that |N| 9p. Then there is

T 2 = sup v Sp 1

ev TT 2ev + 2 max ev N sup v:|v ev|2 1/4

ev TT 2(ev v) + max ev N sup v:|v ev|2 1/4

(ev v)TT 2(ev v)

ev TT 2ev + 1

2 sup v Sp 1

v TT 2v + 1

16 sup v Sp 1

ev TT 2ev . (30)

For each ev N, it is not hard to see that ψ XT i β (v TXi)2 is sub-exponential random variable. Therefore we can apply Lemma 1 of Cai and Liu (2011) and yield

max ev N 1 n X

i H1 ψ XT i β (v TXi)2 E ψ (XTβ )(v TX)2 c4

9p max ev N P

i H1 ψ XT i β (v TXi)2 E ψ (XTβ )(v TX)2 c4

9p(n ) τ2p (n ) (τ2 1)p,

for some constant τ2 > 0. Substitute this bound into (30) we have that T 2 = o(1). Substitute the bound of |T1| and T 2 into (26) we ﬁnally have that

v T 2 e L(t)(β)v

v TE ψ (XTβ )XXT v |T1| T 2 ρ

Tu, Liu, and Mao

hold with probability not less than 1 (n ) (τ1 1)p (n ) (τ2 1)p = 1 O((n ) τp) by taking τ = min{τ1 1, τ2 1}, which proves the lemma.

Proof [Proof of Theorem 1] For simplicity, we ﬁrst consider the convergence rate of bβ (1). By Lemma 10, we know the condition (25) is suﬃced with high probability. Now using Lemma 9, we know that it is enough to bound | e L(1)(β )|2. Indeed, we can compute that

e L(1)(β ) 2 =

Lss 1 (β ) Lss 1 (bβ (0)) + 1

j=1 Lj(bβ (0))

Lss 1 (β ) Lss 1 (bβ (0)) E ψ (XTβ )XXT (β bβ (0)) | {z } T1

h Lj(bβ (0)) Lj(β ) E ψ (XTβ )XXT (bβ (0) β ) i

For the term T1, there is

n ψ (XT i β ) ψ (XT i bβ (0)) o Xi E ψ (XTβ )XXT (β bβ (0))

ψ (XT i β )Xi XT i E ψ (XTβ )XXT (β bβ (0))

0 ψ XT i {bβ (0) + t(β bβ (0))} Xi Xi(β bβ (0)) 2dt.

Following the same strategy as in the proof of Lemma 10, we have that

ψ (XT i β )Xi XT i E ψ (XTβ )XXT (β bβ (0))

0 ψ XT i {bβ (0) + t(β bβ (0))} Xi Xi(β bβ (0)) 2dt

= OP pr2 n .

Therefore we have

Similarly, we can show that

Distributed Estimation on Semi-Supervised Generalized Linear Model

T3 is the average of mn copies of Y X ψ(XTβ )X, thus we can simply apply Lemma 1 of Cai and Liu (2011) and yield

Plugging (32), (33) and (34) into (31) we have

e L(1)(β ) 2 = OP

Therefore, by Lemma 9 we obtain

bβ (1) β 2 = OP

Apply the above formula iteratively we can obtain Theorem 1.

Proof [Proof of Theorem 2] Since |bβ (t 1) β |2 = OP( p

p log n /(mn)), follow the same strategy as in the proof of Theorem 1, we can show that

|bβ (t) β |2 = OP

Lss 1 (β ) Lss 1 (bβ (t)) E ψ (XTβ )XXT (β bβ (t)) 2

mnn + p3/2 log n

By optimality condition of bβ (t) we have that

0 = e L(t)(bβ (t))

= Lss 1 (bβ (t)) Lss 1 (β ) + Lss 1 (β ) Lss 1 (bβ (t 1)) + 1

j=1 Lj(bβ (t 1))

= Lss 1 (bβ (t)) Lss 1 (β ) + 1

j=1 Lj(β ) + OP

=E ψ (XTβ )XXT (bβ (t) β ) + 1

j=1 Lj(β ) + OP

mnn + p3/2 log n

bβ (t) β = E ψ (XTβ )XXT 1 1

j=1 Lj(β ) + OP

mnn + p3/2 log n

Tu, Liu, and Mao

For each ev Sp 1, we know

i Dj (ψ (XT i β ) Yi)ev TXi,

which is the average of mn i.i.d. terms. Compute that

Var h (ψ (XT i β ) Yi)ev TXi i = c(σ)E h ψ (XTβ )(ev TX)2i .

Therefore, by the central limit theorem, there is mn

σ(ev) ev T 1

j=1 Lj(β ) d N(0, 1),

{σ(ev)}2 = ev TCev,

and C = c(σ)E ψ (XTβ )XXT .

Next we assume (p log n )2 = o(n ) and p3 log2 n = o(mn), and replace ev by E ψ (XTβ )XXT 1 v,

we can obtain the asymptotic normality result for v T(bβ (t) β ).

Proof [Proof of Theorem 4] We ﬁrst consider the convergence rate of bβ (1) Avg. For each bβ (1) j = argmin e L(1) j (β) (where j U), by Theorem 1 we know that

bβ (1) j β 2 = OP

By the optimality condition, there is

0 = e L(1) j (bβ (1) j )

= Lss j (bβ (1) j ) Lss j (bβ (0)) + 1

k=1 Lk(bβ (0))

=E n ψ (XTβ )XXTo (bβ (1) j β ) + Lss j (bβ (1) j ) Lss j (β ) E n ψ (XTβ )XXTo (bβ (1) j β )

+ Lss j (β ) Lss j (bβ (0)) + 1

k=1 Lk(bβ (0))

bβ (1) j β = h E n ψ (XTβ )XXToi 1 "

Lss j (bβ (1) j ) Lss j (β ) E n ψ (XTβ )XXTo (bβ (1) j β )

+ Lss j (β ) Lss j (bβ (0)) + 1

k=1 Lk(bβ (0))

Distributed Estimation on Semi-Supervised Generalized Linear Model

Take the average over j U and take the norm on both sides, we have that

bβ (1) Avg β 2 =

j U bβ (1) j β

n Lss j (bβ (1) j ) Lss j (β ) E n ψ (XTβ )XXTo (bβ (1) j β ) o 2

n Lss j (β ) Lss j (bβ (0)) + Lj(bβ (0)) o 2

Lss j (bβ (1) j ) Lss j (β ) E n ψ (XTβ )XXTo (bβ (1) j β ) 2 | {z } T1

n Lss j (β ) Lss j (bβ (0)) + Lj(bβ (0))

Following the strategy of the proof of Theorem 1, we can prove that

max 1 j m |bβ (1) j β |2

n + p max 1 j m |bβ (1) j β |2 2 !

mnn + rn p log n

n + p3/2r4 n

|U|n + pr2 n

Plug them into (35) we have that

bβ (1) Avg β 2 = OP

mn + rn p log n

|U|n + pr2 n

Then Theorem 4 can be proved by applying Theorem 4 inductively.

Appendix B. Proof of Theories in Section 3.2

Proof [Proof of Proposition 6] Notice that in the linear regression model, the canonical link function ψ (x) = x satisﬁes ψ (x) = 0. Therefore, in the proof of Theorem 1, we can obtain that Lss 1 (β ) Lss 1 (bβ (0)) E ψ XTβ XXT (β bβ (0)) 2 = OP rn

j=1 Lj(β ) 1

j=1 Lj(bβ (0)) E ψ XTβ XXT (β bβ (0)) 2 = OP rn

Tu, Liu, and Mao

where the term pr2 n is eliminated. Therefore, the corollary is proved by plugging the above bound into (31) of Theorem 1.

Proof [Proof of Proposition 7] For linear model, given the initial estimator bβ (0), the onestep SSDANE can be written explicitly as follows

bβ (1) = bβ (0) (bΣ ss 1 ) 1n 1

i Dj Xi XT i bβ (0) 1 mn

i Dj Yi Xi o ,

where bΣ ss 1 = (n ) 1 P i H1 Xi XT i . Then the statistical error bβ (1) β can be written down explicitly as follows

=(bΣ ss 1 ) 1n 1

i H0 Xi XT i (bβ (0) β ) 1 mn

i Dj Xi XT i bβ (0) + 1 mn

i Dj Yi Xi o

=(bΣ ss 1 ) 1h 1

i H1 Xi XT i 1 mn

i Dj Xi XT i (bβ (0) β ) + 1 mn

i Dj ϵi Xi i

=(bΣ ss 1 ) 1 " X

Xi XT i Σ (bβ (0) β ) + X

Xi XT i Σ (bβ (0) β ) + 1 mnϵi Xi o

n Xi XT i Σ (bβ (0) β ) + ϵi Xi o#

(bΣ ss 1 ) 1T . (36)

Since we already assume that bβ (0) is independent to all Xi s and ϵi s, there are sum of m(n 1)+n independent, zero-mean terms in the square bracket. For notational convenience we

denote Zi = Xi XT i Σ (bβ (0) β )/|bβ (0) β |2 and W i = ϵi Xi, then we have that

Cov(Zi) = Cov

Xi XT i bβ (0) β

|bβ (0) β |2

, Cov(W i) = σ2Σ, Cov(Zi, W i) = 0.

Thus we can compute that

(n )2 r2 n Cov(Z) + n(mn n )2

(mnn )2 r2 n Cov(Z) + n (mn)2 Cov(W ) + n(m 1)

(mn)2 r2 n Cov(Z) + n(m 1)

(mn)2 Cov(W )

r2 n Cov(Z) + 1 mn Cov(W ). (37)

Distributed Estimation on Semi-Supervised Generalized Linear Model

Moreover, there is

mnn rn Zi + 1 mn W i

1 mnrn Zi + 1 mn W i

2 + 1 mn W i

2 + 1 mn W i

(n )2 + 16 (mn)2

r3 n E |Z|3 2 + 4 (mn)2 E |W |3 2 C0p3/2 r3 n (n )2 + 1 (mn)2

for some constant C0 > 0. Denote e T = (Cov(T )) 1/2T , then we can apply multivariate Berry-Esseen theorem (see Theorem 1.1 of Raiˇc (2019)) and yield

P(e T S) N(0, Ip){S} C1p7/4 min 1

n , (rn mn)3

for all measurable convex S Rp. Let T = (T 1, ..., T p)T follows p-dimensional standard Gaussian distribution N(0, Ip), then we can apply Lemma 1 of Cai and Liu (2011) to the i.i.d. sequence T 2 i 1 and obtain that

l=1 (T 2 l 1) p

l=1 (T 2 l 1) p

for suﬃciently large p. Then there is

p/2) P(|T | p

p/2) P(|e T | p

p/2) P(|T | p

By assumption λmin(Cov(Z)), λmin(Cov(W )) ρ1, then from (37) we know

λmin Cov(T ) C3ρ1

r2 n n + 1 mn

Then there is

P |T 2|2 C3ρ1

P |T 2|2 rp

Following the strategy, in the proof of Lemma 10 we can show that

P λmin (bΣ ss) 1 ρ

= P λmax bΣ ss 2ρ 1 1

Tu, Liu, and Mao

Therefore by combining (38) and (36), we know that

P |bβ (1) β |2 ρ

P |T 2|2 C3ρ1

2 ; λmin (bΣ ss) 1 ρ

P |T 2|2 C3ρ1

P λmin (bΣ ss) 1 ρ

which proves the theorem.

Appendix C. Proof of Theories in Section 3.3

Proof [Proof of Proposition 8] We ﬁrst prove the convergence rate of bβ ss. Notice that the semi-supervised empirical loss Lss(β) deﬁned in (5) has the same Hessian matrix as the surrogate loss e L(1)(β) deﬁned in (10). Therefore by Lemma 10, it holds that

inf β:|β β |2 r inf v Sp 1 v T 2Lss(β)v ρ

Then by Lemma 9, we only need to bound the term | Lss(β )|2. Note that for arbitrary unit vector v Sp 1, there is

i D Yiv TXi + 1

i H ψ (XT i β )v TXi

Yiv TXi + E ψ (XT i β )v TXi + 1

ψ (XT i β )v TXi E ψ (XT i β )v TXi ,

Denote E(X, η) = E[X2 exp(η|X|)], we compute that

E Yiv TXi + E ψ (XT i β )v TXi , η0

η2 0 E h exp 2η0 Yiv TXi + E ψ (XT i β )v TXi i

n E h exp 2η0 Yiv TXi io2

n E h exp η0 Yi 2 exp η0 v TXi 2 io2 1

where the second line uses the elementary inequality x2 ex, the third line uses Jensen s inequality, and the last line uses Cauchy s inequality. Similarly, we have that

E ψ (XT i β )v TXi E ψ (XT i β )v TXi , η0 1

Distributed Estimation on Semi-Supervised Generalized Linear Model

Then we can apply Lemma 1 of Cai and Liu (2011) and yield that

Yiv TXi + E ψ (XT i β )v TXi = OP

ψ (XT i β )v TXi E ψ (XT i β )v TXi = OP

Therefore, we know that there exists a constant c1 such that

| Lss(β )|2 =

e T l Lss(β ) 2

with probability not less than 1 (n ) τ, where τ > 0 is some positive constant and el (where l = 1, ..., p) denotes the l-th coordinate vectors. Therefore, by Lemma 9, we know that bβ ss β 2 = OP

On the other hand, by deﬁnition of bβ ss in (22) we have that

0 = Lss(bβ ss) = 1

i D Yi XT i + 1

i H ψ (XT i bβ ss)Xi

i D Yi XT i + 1

i H ψ (XT i β )Xi + 1

ψ (XT i bβ ss) ψ (XT i β ) Xi

i D Yi XT i + 1

i H ψ (XT i β )Xi

0 ψ XT i β + t(bβ ss β ) Xi XT i (bβ ss β )dt

i D Yi XT i 1

i H ψ (XT i β )Xi (40)

0 (1 t)ψ XT i β + t(bβ ss β ) Xi XT i (bβ ss β ) 2dt

i H ψ (XT i β )Xi XT i E ψ (XTβ )XXT i (bβ ss β ) + E ψ (XTβ )XXT (bβ ss β ).

Tu, Liu, and Mao

Using the results in the proof of Lemma 10, we know that 1 n X

0 (1 t)ψ XT i β + t(bβ ss β ) Xi XT i (bβ ss β ) 2dt

=OP p|bβ ss β |2 2 = OP

i H ψ (XT i β )Xi XT i E ψ (XTβ )XXT i (bβ ss β )

n |bβ ss β |2

Substitute these bounds into (40), we have that

E ψ (XTβ )XXT (bβ ss β ) = 1

i D Yi XT i 1

i H ψ (XT i β )Xi + OP

bβ ss β = E ψ (XTβ )XXT 1 Lss(β ) + OP

(41) For each ev Sp 1, we know

ev T Lss(β ) = 1

i D Yiev TXi + 1

i H ψ (XT i β )ev TXi

n Yiev TXi + 1

n ψ (XT i β )ev TXi + n n

nn E ψ (XT i β )ev TXi

n ψ (XT i β )ev TXi 1

n E ψ (XT i β )ev TXi ,

which is the sum of n independent, zero-mean random variables. Compute that

n Yiev TXi + 1

n ψ (XT i β )ev TXi + n n

nn E ψ (XT i β )ev TXi

n2 Var h Yiev TXi ψ (XT i β )ev TXi i + (n n)2

(nn )2 Var h ψ (XT i β )ev TXi E ψ (XT i β )ev TXi i

n2 E h ψ (XTβ )(ev TX)2i + (n n)2

(nn )2 Var h ψ (XTβ )ev TX i ,

n ψ (XT i β )ev TXi 1

n E ψ (XT i β )ev TXi

= 1 (n )2 Var h ψ (XTβ )ev TX i .

Since from Assumption 2 we know both ev TXi and ψ (XT i β ) are sub-Gaussian random variables, we know that the Lindeberg condition for central limit theorem is suﬃced. Therefore,

Distributed Estimation on Semi-Supervised Generalized Linear Model

by central limit theorem (see, e.g., Theorem 9.1.1 of Chow and Teicher (2012)), we have that n σ(ev)ev T Lss(β ) d N(0, 1), (42)

σ(ev) 2 =n2Var 1

n Yiev TXi + 1

n ψ (XT i β )ev TXi + n n

nn E ψ (XT i β )ev TXi

+ (n n)n Var 1

n ψ (XT i β )ev TXi 1

n E ψ (XT i β )ev TXi

=c(σ)E h ψ (XTβ )(ev TX)2i + n n

n Var h ψ (XTβ )ev TX i .

We can also write it in the following way

σ(ev) 2 =ev T C + n n

where C =c(σ)E ψ (XTβ )XXT ,

Css =Cov ψ (XTβ )X .

Next we assume p3/2 log n = o(n), and replace ev by E ψ (XTβ )XXT 1 v, we can obtain the asymptotic normality result for v T(bβ ss β ).

Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res., 6(61):1817 1853, 2005.

Rie Kubota Ando and Tong Zhang. Two-view feature generation model for semi-supervised learning. In Proceedings of the 24th International Conference on Machine Learning, page 25 32, 2007.

David Azriel, Lawrence D. Brown, Michael Sklar, Richard Berk, Andreas Buja, and Linda Zhao. Semi-supervised linear regression. J. Amer. Statist. Assoc., 117(540):2238 2251, 2022.

Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu. Distributed testing and estimation under sparse high dimensional models. Ann. Statist., 46(3):1352 1382, 2018.

Pierre C. Bellec, Arnak S. Dalalyan, Edwin Grappin, and Quentin Paris. On the prediction loss of the lasso in the partially labeled setting. Electron. J. Stat., 12(2):3443 3472, 2018.

Peter J. Bickel and Elizaveta Levina. Covariance regularization by thresholding. Ann. Statist., 36(6):2577 2604, 2008.

Tu, Liu, and Mao

Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, page 92 100, 1998.

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found.Trends Mach. Learn., 3(1):1 122, 2011.

T. Tony Cai and Zijian Guo. Semisupervised inference for explained variance in high dimensional linear regression and its applications. J. R. Stat. Soc. Ser. B. Stat. Methodol., 82(2):391 419, 2020.

Tianxi Cai, Molei Liu, and Yin Xia. Individual data protected integrative regression analysis of high-dimensional heterogeneous data. J. Amer. Statist. Assoc., 117(540):2105 2119, 2022.

Tony Cai and Weidong Liu. Adaptive thresholding for sparse covariance matrix estimation. J. Amer. Statist. Assoc., 106(494):672 684, 2011.

Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Koneˇcn y, H. Brendan Mc Mahan, Virginia Smith, and Ameet Talwalkar. LEAF: A Benchmark for Federated Settings. ar Xiv e-prints, page ar Xiv:1812.01097, 2018.

Abhishek Chakrabortty and Tianxi Cai. Eﬃcient and adaptive linear regression in semisupervised settings. Ann. Statist., 46(4):1541 1572, 2018.

Xiangyu Chang, Shao-Bo Lin, and Ding-Xuan Zhou. Distributed semi-supervised learning with kernel ridge regression. J. Mach. Learn. Res., 18(1):1493 1514, 2017.

Dengsheng Chen, Jie Hu, Vince Junkai Tan, Xiaoming Wei, and Enhua Wu. Elastic aggregation for federated optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12187 12197, 2023.

Tianyi Chen, Georgios Giannakis, Tao Sun, and Wotao Yin. Lag: Lazily aggregated gradient for communication-eﬃcient distributed learning. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Xi Chen, Weidong Liu, and Yichen Zhang. Quantile regression under memory constraint. Ann. Statist., 47(6):3244 3273, 2019.

Y.S. Chow and H. Teicher. Probability Theory: Independence, Interchangeability, Martingales. Springer Texts in Statistics. Springer New York, 2012.

Siyi Deng, Yang Ning, Jiwei Zhao, and Heping Zhang. Optimal semi-supervised estimation and inference for high-dimensional linear regression. ar Xiv e-prints, art. ar Xiv:2011.14185, 2020.

Edgar Dobriban and Yue Sheng. Wonder: Weighted one-shot distributed ridge regression in high dimensions. J. Mach. Learn. Res., 21(66):1 52, 2020.

Distributed Estimation on Semi-Supervised Generalized Linear Model

Edgar Dobriban and Yue Sheng. Distributed linear regression by averaging. Ann. Statist., 49(2):918 943, 2021.

Moming Duan, Duo Liu, Xianzhang Chen, Renping Liu, Yujuan Tan, and Liang Liang. Self-balancing federated learning with global imbalanced data in mobile systems. IEEE Trans. Parallel Distrib. Syst., 32(1):59 71, 2021a.

Rui Duan, Yang Ning, and Yong Chen. Heterogeneity-aware and communication-eﬃcient distributed statistical inference. Biometrika, 109(1):67 83, 2021b. ISSN 1464-3510.

Zheng-Chu Guo, Lei Shi, and Qiang Wu. Learning theory of distributed regression with bias corrected regularization kernel network. J. Mach. Learn. Res., 18(118):1 25, 2017.

Zheng-Chu Guo, Shao-Bo Lin, and Lei Shi. Distributed learning with multi-penalty regularization. Appl. Comput. Harmon. Anal., 46(3):478 499, 2019.

Zijian Guo, Prabrisha Rakshit, Daniel S. Herman, and Jinbo Chen. Inference for the case probability in high-dimensional logistic regression. J. Mach. Learn. Res., 22(1), 2022. ISSN 1532-4435.

Jue Hou, Zijian Guo, and Tianxi Cai. Surrogate assisted semi-supervised inference for high dimensional risk prediction. J. Mach. Learn. Res., 24(265):1 58, 2023.

Ting Hu and Ding-Xuan Zhou. Distributed regularized least squares with ﬂexible gaussian kernels. Appl. Comput. Harmon. Anal., 53:349 377, 2021.

Martin Jaggi, Virginia Smith, Martin Takac, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. Communication-eﬃcient distributed dual coordinate ascent. In Advances in Neural Information Processing Systems, volume 27, 2014.

Yongyi Guo Jianqing Fan and Kaizheng Wang. Communication-eﬃcient accurate statistical estimation. J. Amer. Statist. Assoc., 118(542):1000 1010, 2023.

Rie Johnson and Tong Zhang. Graph-based semi-supervised learning and spectral kernel design. IEEE Trans. Inform. Theory, 54(1):275 288, 2008.

Michael I. Jordan, Jason D. Lee, and Yun Yang. Communication-eﬃcient distributed statistical inference. J. Amer. Statist. Assoc., 114(526):668 681, 2019.

Arun Kumar Kuchibhotla and Abhishek Chakrabortty. Moving beyond sub-Gaussianity in high-dimensional statistics: applications in covariance estimation and linear regression. Inf. Inference, 11(4):1389 1456, 2022. ISSN 2049-8772.

Jason D. Lee, Qiang Liu, Yuekai Sun, and Jonathan E. Taylor. Communication-eﬃcient sparse regression. J. Mach. Learn. Res., 18(1):115 144, 2017.

Runze Li, Dennis K.J. Lin, and Bing Li. Statistical inference in massive data sets. Appl. Stoch. Models Bus. Ind., 29(5):399 409, 2013.

Heng Lian and Zengyan Fan. Divide-and-conquer for debiased l1-norm support vector machine in ultra-high dimensions. J. Mach. Learn. Res., 18(182):1 26, 2018.

Tu, Liu, and Mao

Shao-Bo Lin and Ding-Xuan Zhou. Distributed kernel-based gradient descent algorithms. Constr. Approx., 47(2):249 276, 2018.

Shao-Bo Lin, Xin Guo, and Ding-Xuan Zhou. Distributed learning with regularized least squares. J. Mach. Learn. Res., 18(92):1 31, 2017.

Molei Liu, Yin Xia, Kelly Cho, and Tianxi Cai. Integrative high dimensional multiple testing with heterogeneity under data sharing constraints. J. Mach. Learn. Res., 22(1), 2022. ISSN 1532-4435.

Rong Ma, Zijian Guo, T. Tony Cai, and Hongzhe Li. Statistical Inference for Genetic Relatedness Based on High-Dimensional Logistic Regression. ar Xiv e-prints, page ar Xiv:2202.10007, 2022.

Thomas P Minka. A comparison of numerical optimizers for logistic regression. Unpublished draft, pages 1 18, 2003.

Ravi B. Parikh, Christopher Manz, Corey Chivers, Susan Harkness Regli, Jennifer Braun, Michael E. Draugelis, Lynn M. Schuchter, Lawrence N. Shulman, Amol S. Navathe, Mitesh S. Patel, and Nina R. O Connor. Machine Learning Approaches to Predict 6Month Mortality Among Patients With Cancer. JAMA Network Open, 2(10):e1915997 e1915997, 2019. ISSN 2574-3805.

Martin Raiˇc. A multivariate Berry-Esseen theorem with explicit constants. Bernoulli, 25 (4A):2824 2853, 2019.

Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-eﬃcient distributed optimization using an approximate newton-type method. In Proceedings of the 31st International Conference on Machine Learning, volume 32, pages 1000 1008, 2014.

Yue Sheng and Edgar Dobriban. One-shot distributed ridge regression in high dimensions. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 8763 8772, 2020.

Virginia Smith, Simone Forte, Chenxin Ma, Martin Tak aˇc, Michael I. Jordan, and Martin Jaggi. Cocoa: A general framework for communication-eﬃcient distributed optimization. J. Mach. Learn. Res., 18(1):8590 8638, 2017.

Sebastian U. Stich. Local SGD converges fast and communicates little. In 7th International Conference on Learning Representations, ICLR 2019, May 6-9, 2019, 2019.

I. Theodossiou. The eﬀects of low-pay and unemployment on psychological well-being: A logistic regression approach. J. Health Econ., 17(1):85 104, 1998. ISSN 0167-6296.

Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer science & business media, 1999.

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. ar Xiv e-prints, art. ar Xiv:1011.3027, 2010.

Distributed Estimation on Semi-Supervised Generalized Linear Model

Stanislav Volgushev, Shih-Kang Chao, and Guang Cheng. Distributed inference for quantile regression processes. Ann. Statist., 47(3):1634 1662, 2019.

Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang. Eﬃcient distributed learning with sparsity. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 3636 3645, 2017.

Junhui Wang and Xiaotong Shen. Large margin semi-supervised learning. J. Mach. Learn. Res., 8(65):1867 1891, 2007.

Junhui Wang, Xiaotong Shen, and Yufeng Liu. Probability estimation for large-margin classiﬁers. Biometrika, 95(1):149 167, 2007.

Junhui Wang, Xiaotong Shen, and Wei Pan. On eﬃcient large margin semi-supervised learning: Method and theory. J. Mach. Learn. Res., 10(25):719 742, 2009.

Larry Wasserman and John Laﬀerty. Statistical analysis of semi-supervised regression. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.

Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. Proceedings of the AAAI Conference on Artiﬁcial Intelligence, 33(01):5693 5700, 2019.

Anru Zhang, Lawrence D. Brown, and T. Tony Cai. Semi-supervised inference: General theory and estimation of means. Ann. Statist., 47(5):2538 2566, 2019.

Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-eﬃcient algorithms for statistical optimization. J. Mach. Learn. Res., 14:3321 3363, 2013.

Yuchen Zhang, John Duchi, and Martin Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. J. Mach. Learn. Res., 16(1):3299 3340, 2015.

Yuqian Zhang and Jelena Bradic. High-dimensional semi-supervised learning: in search of optimal inference of the mean. Biometrika, 109(2):387 403, 2021.

Weihua Zhao, Fode Zhang, and Heng Lian. Debiasing and distributed estimation for highdimensional quantile regression. IEEE Trans. Neural Netw. Learn. Syst., 31(7):2569 2577, 2020.

Xiaojin Zhu and Andrew B. Goldberg. Introduction to semi-supervised learning. Synth. Lect. Artif. Intell. Mach. Learn., 3(1):1 130, 2009.

Xiaojin Jerry Zhu. Semi-supervised learning literature survey. Unpublished draft, 2005.