# federatedem_with_heterogeneity_mitigation_and_variance_reduction__653e89b2.pdf

Federated Expectation Maximization with heterogeneity mitigation and variance reduction

Aymeric Dieuleveut Centre de Math ematiques Appliqu ees

Ecole Polytechnique, France Institut Polytechnique de Paris aymeric.dieuleveut@polytechnique.edu

Gersende Fort Institut de Math ematiques de Toulouse

Universit e de Toulouse; CNRS

UPS, Toulouse, France gersende.fort@math.univ-toulouse.fr

Eric Moulines Centre de Math ematiques Appliqu ees

Ecole Polytechnique, France CS Dpt, HSE University, Russian Federation

eric.moulines@polytechnique.edu

Genevi eve Robin Laboratoire de Math ematiques

et Mod elisation d Evry Universit e d Evry Val d Essonne; CNRS

Evry-Courcouronnes, France genevieve.robin@cnrs.fr

The Expectation Maximization (EM) algorithm is the default algorithm for inference in latent variable models. As in any other ﬁeld of machine learning, applications of latent variable models to very large datasets makes the use of advanced parallel and distributed architectures mandatory. This paper introduces Fed EM, which is the ﬁrst extension of the EM algorithm to the federated learning context. Fed EM is a new communication efﬁcient method, which handles partial participation of local devices, and is robust to heterogeneous distributions of the datasets. To alleviate the communication bottleneck, Fed EM compresses appropriately deﬁned complete data sufﬁcient statistics. We also develop and analyze an extension of Fed EM to further incorporate a variance reduction scheme. In all cases, we derive ﬁnite-time complexity bounds for smooth non-convex problems. Numerical results are presented to support our theoretical ﬁndings, as well as an application to federated missing values imputation for biodiversity monitoring.

1 Introduction

The Expectation Maximization (EM) algorithm is the most popular approach for inference in latent variable models. The EM algorithm, a special instance of the Majorize/Minimize algorithm [24], was formalized by [8] and is without doubt one of the fundamental algorithms in machine learning. Applications include among many others ﬁnite mixture analysis, latent factor models inference, and missing data imputation; see [38, 29, 26, 13] and the references therein. As in any other ﬁeld of machine learning, training latent variable models on very large datasets make the use of advanced parallel and distributed architectures mandatory. Federated Learning (FL) [22, 39], which exploits the computation power of a large number of edge devices to perform distributed machine learning, is a powerful framework to achieve this goal.

The conventional EM algorithm is not suitable for FL settings. We propose several new distributed versions of the EM algorithm supporting compressed communication. More precisely, our objective

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

is to minimize a non-convex ﬁnite-sum smooth objective function

Argmin 2 F( ), F( ) := 1

Li( ) + R( ) , Rd , (1)

where n is the number of workers/devices which are connected to a central server, and the worker #i only has access to its local data; ﬁnally R is a penalty term which may be introduced to promote sparsity, regularity, etc. In latent variable models, Li( ) = m 1 Pm

j=1 log p(yij; ), where {yij}m

j=1 are the m observations available for worker #i, and p(y; ) is the incomplete likelihood. p(y; ) is deﬁned by marginalizing the complete-data likelihood p(y, z; ) deﬁned as the joint probability density function of the observation y and a non-observed latent variable z 2 Z, i.e. p(y; ) =

Z p(y, z; )µ(dz) where Z is the latent space and µ is a measure on Z. We focus in this paper on the case where p(y, z; ) belongs to a curved exponential family, given by

p(y, z; ) := (y, z) exp

hs(y, z), φ( )i ( )

where s(y, z) 2 Rq is the complete-data sufﬁcient statistics, φ : ! Rq and : ! R, : Y Z ! R+ are vector/scalar functions.

In absence of communication constraints, the EM algorithm is a popular method to solve (1). It alternates between two steps: in the Expectation (E) step, using the current value of the iterate curr, it computes a majorizing function 7! Q( , curr) given up to an additive constant by

Q( , curr) := h s( curr), φ( )i + ( ) + R( ) where s( ) := 1

si( ) ; (3)

and si( ) is the ith device conditional expectation of the complete-data sufﬁcient statistics:

sij( ) , sij( ) :=

s(yij, z)p(z|yij; )µ(dz) , (4)

where p(z|yij; ) := p(yij, z; )/p(yij; ). As for the M step, an updated value of curr is computed as a minimizer of 7! Q( , curr). The majorizing function is then updated with the new curr; this process is iterated until convergence. The EM algorithm is most useful when for any curr 2 , the function 7! Q( , curr) is a convex function of the parameter which is solvable in either explicitly or with little computational effort. A major advantage of the EM algorithm stems from its invariance under homeomorphisms, contrary to classical ﬁrst-order methods: the EM updates are the same for any continuous invertible re-parametrization [23].

In the FL context, the vanilla EM algorithm is affected by three major problems: (1) the communication bottleneck, (2) data heterogeneity, and (3) partial participation (PP) of the workers.

When the number of workers is large, the cost of communication becomes overwhelming. A classical technique to alleviate this problem is to use communication compression. Most FL algorithms are ﬁrst order methods and compression is typically applied to stochastic gradients. Yet, these methods are not appropriate to solve (1) since (i) they do not preserve the desirable homeomorphic invariance property, and (ii) the full EM iteration is not distributed since the M step is performed by the central server only. This calls for an extension of the EM algorithm to the FL setting.

Since workers are often user personal devices, the issue of data heterogeneity naturally arises. Our model in Equations (1), (3) and (4) allows the local loss functions to depend on the worker i 2 {1, . . . , n} and the observations yij to be independent but not necessarily identically distributed. In addition, our theoretical results deal with speciﬁc behaviors for each worker i 2 {1, . . . , n}, see e.g., A5, 7 and 8. In the FL-EM setting, heterogeneity manifests itself by the non-equality of the local conditional expectations of the complete-data sufﬁcient statistics si s; modiﬁcations to the algorithms must be performed to ensure convergence at the central server.

Finally, a subset of users are potentially inactive in each learning round, being unavailable or unwilling to participate. Thus, taking into account PP of the workers and its impact on the convergence of algorithms, is a major issue.

Fed EM. The main contribution of our paper is a new method called Fed EM, supporting communication compression, partial participation and data heterogeneity. In this algorithm, the workers compute an estimate of the local complete-data sufﬁcient statistics si using a minibatch of data, apply an unbiased compression operator to a noise compensated version (using a technique inspired by [17, 15]) and send the result to the central server, which performs aggregation and the M-step (i.e. the parameter update).

VR-Fed EM. We improve Fed EM by adding a variance reduction method inspired by the SPIDER framework [9] which has recently been extended to the EM framework [10]. For both Fed EM and VR-Fed EM, the central server updates the expectations of the global complete-data sufﬁcient statistics through a Stochastic Approximation procedure [3, 4]. When compared to Fed EM, VR-Fed EM additionally performs variance reduction for each worker, progressively alleviating the variance brought by the random oracles which provide approximations of the local complete-data sufﬁcient statistics. Theoretical analysis. EM in the curved exponential family setting converges to the roots of a function h (see e.g. Section 2). We introduce a uniﬁed theoretical framework which covers the convergence of Fed EM and VR-Fed EM algorithms in the non-convex case and establish convergence guarantees for ﬁnding an -stationary point (see Theorem 1 and Theorem 3). In both cases, we provide the number Kopt( ) of optimization steps and the number KCE( ) of computed conditional expectations to reach -stationarity. These results show that in the Stochastic Approximation steps of VR-Fed EM , the step sizes are independent of m, the number of observations per server. Furthermore, the computational cost in terms of KCE( ) improves on earlier results. In this respect, VR-Fed EM has the same advantages as SPIDER [9] compared to SVRG [18] and SAGA [6], or as SPIDER-EM [10] compared to s EM-vr [5] and FIEM [20, 11]. Lastly, our bounds demonstrate the robustness of Fed EM and VR-Fed EM to data heterogeneity. Finally, seen as a root ﬁnding algorithm in a quantized FL setting, VR-Fed EM can be compared to VR-DIANA [17]: we show that VR-Fed EM does not require the step sizes to decrease with m and provides state of the art iteration complexity to reach a precision .

Notations. For vectors a, b in Rq, ha, bi is the Euclidean scalar product, and k k denotes the associated norm. For r 1, kakr is the r-norm of a vector a. The Hadamard product a b denotes the entrywise product of the two vectors a, b. By convention, vectors are column-vectors. For a matrix A, A> is its transpose and k Ak F is its Frobenius norm; for two matrices A, B, h A, Bi := Trace(B>A). For a positive integer n, set [n]? := {1, , n} and [n] := {0, , n}. The set of non-negative integers (resp. positive) is denoted by N (resp. N?). The minimum (resp. maximum) of two real numbers a, b is denoted by a b (resp. a_b). We will use the Bachmann-Landau notation a(x) = O(b(x)) to characterize an upper bound of the growth rate of a(x) as being b(x).

2 Fed EM: Expectation Maximization algorithms for federated learning

Recall the deﬁnition of the negative penalized (normalized) log-likelihood F( ) from (1). Along the entire paper, we make the following assumptions A1 to A3,which deﬁne the model at hand. A1. The parameter set Rd is a convex open set. The functions R : ! R, φ : ! Rq, : ! R, and (yij, ) : Z ! R+, s(yij, ) : Z ! Rq for i 2 [n]? and j 2 [m]? are measurable functions. For any 2 and i 2 [n]?, the log-likelihood is ﬁnite: 1 < Li( ) < 1. A2. For all 2 and i 2 [n]?, the conditional expectation si( ) is well-deﬁned. A3. For any s 2 Rq, the map s 7! Argmin 2 { ( ) + R( ) hs, φ( )i} exists and is unique; the singleton is denoted by {T(s)}.

EM deﬁnes a sequence { k, k 0} that can be computed recursively as k+1 = T s( k), where the map T is deﬁned in A3 and s is deﬁned in (3). On the other hand, the EM algorithm can be deﬁned through a mapping in the complete-data sufﬁcient statistics, referred to as the expectation space. In this setting, the EM iteration deﬁnes a Rq-valued sequence {b Sk, k 0} given by b Sk+1 = s T(b Sk). Thus, we observe that the EM algorithm admits two equivalent representations:

(Parameter space) k+1 = T s( k); (Expectation space) b Sk+1 = s T(b Sk). (5)

In this paper, we focus on the expectation space representation; see [23] for an interesting discussion on the connection of EM and mirror descent. It has been shown in [7] that if s? is a ﬁxed point to the EM algorithm in the expectation space, then ? := T(s?) is a ﬁxed point of the EM algorithm in the parameter space, i.e., ? = T s( ?); note that the converse is also true. Deﬁne the functions hi and h from Rq to Rq by h(s) := 1

i=1 hi(s) with hi(s) := si T(s) s .

hi(s) , hi(s) := si T(s) s . (6)

A key property is that the ﬁxed points of EM in the expectation space are the roots of the mean ﬁeld s 7! h(s) (see (3) for the deﬁnition of s). Therefore, convergence of EM-based algorithms is

evaluated in terms of -stationarity (see [14, 10]): for all > 0, there exists a (possibly random) termination time K s.t.: E

kh(b SK)k2i

. Another key property of EM is that it is a monotonic

algorithm: each iteration leads to a decrease of the negative penalized log-likelihood i.e. F( k+1) F( k) or, equivalently in the expectation space F T(b Sk+1) F T(b Sk) (for sequences { k, k 0} and {b Sk, k 0} given by (5)). A4 assumes that the roots of the mean ﬁeld h are the roots of the gradient of F T (see [7] for the same assumption when studying Stochastic EM). A5 assumes global Lipschitz properties of the functions hi s.

A4. The function W := F T : Rq ! R is continuously differentiable on Rq and its gradient is globally Lipschitz with constant L W. Furthermore, for any s 2 Rq, r W(s) = B(s)h(s) where B(s) is a q q positive deﬁnite matrix. In addition, there exist 0 < vmin vmax such that for any s 2 Rq, the spectrum of B(s) is in [vmin, vmax].

A5. For any i 2 [n]?, there exists Li > 0 such that for any s, s0 2 Rq, khi(s) hi(s0)k = k( si T(s) s) ( si T(s0) s0)k Liks s0k .

A Federated EM algorithm.

Algorithm 1: Fed EM with partial participation Data: kmax 2 N?; for i 2 [n]?, V0,i 2 Rq;

b S0 2 Rq; a positive sequence {γk+1, k 2 [kmax 1]}; > 0; a coefﬁcient p = EA PPP[card(A)]/n. Result: The Fed EM-PP sequence:

{b Sk, k 2 [kmax]}

1 Set V0 = n 1 Pn

i=1 V0,i 2 for k = 0, . . . , kmax 1 do

3 Sample Ak+1 PPP 4 for i 2 Ak+1 do

5 (worker #i)

6 Sample Sk+1,i, an approximation of si T(b Sk)

7 Set k+1,i = Sk+1,i Vk,i b Sk 8 Set Vk+1,i = Vk,i + Quant( k+1,i).

9 Send Quant( k+1,i) to the central server

10 for i /2 Ak+1 do

11 (worker #i)

12 Set Vk+1,i = Vk,i (no update)

13 (the central server)

Hk+1 = Vk + (np) 1 P

i2Ak+1 Quant( k+1,i)

15 Set b Sk+1 = b Sk + γk+1Hk+1 Set

Vk+1 = Vk + n 1 P

i2Ak+1 Quant( k+1,i)

Send b Sk+1 and T(b Sk+1) to the n workers

Our ﬁrst contribution, the novel algorithm Fed EM is described by algorithm 1. The algorithm encompasses partial participation of the workers: at iteration #(k + 1), only a subset Ak+1 of active workers participate to the training, see line 3. The averaged fraction of participating workers is denoted p. Each of the active workers #i computes an unbiased approximation Sk+1,i (line 6) of si T(b Sk); conditionally to the past (see Appendix D.2 for a rigorous deﬁnition), these approximations are independent. The workers then transmit to the central server a compressed information about the new sufﬁcient statistics. A naive solution would be to compress and transmit Sk+1,i b Sk, but data heterogeneity between servers often prevents these local differences from vanishing at the optimum, leading to large compression errors and impairing convergence of the algorithm. Following [28], a memory Vk,i (initialized to hi(b S0) at k = 0) is introduced; and the differences k+1,i := Sk+1,i ˆSk Vk,i are compressed for i 2 Ak+1 (line 7 and line 9). These memories are updated locally: Vk+1,i = Vk,i + Quant( k+1,i), at line 8, with > 0 (typically set to 1/(1+!) where ! is deﬁned in A6). On its side, the central server releases an aggregated estimate b Sk+1 of the complete-data sufﬁcient statistics by averaging the quantized difference (np) 1 P

i2Ak+1 Quant( k+1,i) and by adding Vk (line 14 and line 15). Then, it updates Vk+1 = Vk + n 1 Pn

i=1 Quant( k+1,i), see line 15. The ﬁnal step consists in solving the M-step of the EM algorithm, i.e. in computing T(b Sk+1) (see A3).

We ﬁnally state our assumption on the compression process. We consider a large class of unbiased compression operators Quant satisfying a variance bound:

A6. There exists ! 0 s.t. for any s 2 Rq: E [Quant(s)] = s, and E

k Quant(s)k2

Intuitively, the stronger the compression is, the larger ! will be. Remark that if no compression is used, or equivalently for all s 2 Rq, Quant(s) = s, then A6 is satisﬁed with ! = 0. An example

of quantization operator satisfying A6 is the random dithering that can be described as the random operator Quant : Rq ! Rq, Quant(x) = (1/squant)kxkr sign(x) bsquant(|x|/kxkr) + c where r 1 is user-deﬁned, is a uniform random variable on [0, 1]q and squant 2 N? is the number of levels of roundings; see [17, 2]. This operator satisﬁes A6 with ! = s 1

quant O(q1/r + q1/2); see [17, Example 1]. Another example, namely the block-p-quantization, is provided in the supplemental (see Appendix B). More generally, this assumption is valid for many compression operators, for example resulting in sparsiﬁcation [see. e.g. 28].

The convergence analysis is under the following assumptions on the oracle Sk+1,i: for any i 2 [n]?, the approximations Sk+1,i are unbiased and their conditional variances are uniformly bounded in k. For each k 2 N, denote by Fk the σ-algebra generated by {S ,i, A ; i 2 [n]?, 2 [k]} and including the randomness inherited from the quantization operator Quant up to iteration #k. A 7. For all k 2 N, conditional to Fk, {Sk+1,i}n

i=1 are independent. Moreover, for any i 2 [n]?, E [Sk+1,i|Fk ] = si T(b Sk) and there exists σ2

i > 0 such that for any k 0

k Sk+1,i si T(b Sk)k2,,,Fk

A7 covers both the ﬁnite-sum setting described in the introduction, and the online setting. In the ﬁnite-sum setting, si is of the form m 1 Pm

j=1 sij. In that case, Sk+1,i can be the sum over a minibatch Bk+1,i of size b sampled at random in [m]?, with or without replacement and independently of the history of the algorithm: we have Sk+1,i = b 1 P

j2Bk+1,i sij T(b Sk). In the online setting, the oracles Sk+1,i come from an online processing of streaming informations; in that case Sk+1,i can be computed from a minibatch of independent examples so that the conditional variance σ2

i , which will be inversely proportional to the size of the minibatch, can be made arbitrarily small.

Reduction of communication complexity for FL. Reducing the communication cost between workers is a crucial aspect of the FL approach [19]. In gradient based optimization, four techniques have been used to reduce the amount of communication: (i) increasing the minibatch size and reducing the number of iterations, (ii) increasing the number of local steps between two communication rounds, (iii) using compression, (iv) sampling clients at each step. Here, we provide a tight analysis of strategies (i), (iii) and (iv) (sampling client is part of PP).

Regarding the interest of performing multiple iterations (ii), as analyzed for example in [21, 27] for the classical gradient settings, we note that: ﬁrst, from a theoretical standpoint, tradeoffs between larger minibatch and more local iterations are unclear [37]. Secondly, performing local iterations is not possible in the EM setting: one iteration of EM is the combination of two steps E and M and the M step, which required the use of the map T, is only performed by the central server; this remark is a fundamental speciﬁcity of the EM framework (which is not shared by the gradient framework). In applications, we usually do not want T to be available at each local node. However, our work allows to perform multiple local iterations of the E step before communicating with the central server. In algorithm 1, the local statistics Sk+1,i are general enough to cover this case; see the comment above on A7.

Finally, as we do not perform local full EM iterations, we do not face the well-identiﬁed clientdrift challenge (in the presence of heterogeneity). Yet, we stress that combining compression and heterogeneity results in other challenges: it is known in the Gradient Descent setting (see e.g. [28, 31]), that heterogeneity strongly hinders convergence in the presence of compression. To alleviate the impact of heterogeneity, we introduce the Vk,i s memory-variables.

Convergence analysis, full participation regime. In this paragraph, we focus on the fullparticipation regime (p = 1): for all k 2 [kmax]?, Ak = [n] . We now present in Theorem 1 our key result, from which complexity expressions are derived. The proof is postponed to Appendix C.

Theorem 1. Assume A1 to A7 and set L2 := n 1 Pn

i , σ2 := n 1 Pn

i . Let {b Sk, k 2 [kmax]} be given by algorithm 1, with ! > 0, := (1 + !) 1 and γk = γ 2 (0, γmax] where

γmax := vmin

2L(1 + !)p! . (7)

Denote by K the uniform random variable on [kmax 1]. Then, taking V0,i = hi(b S0) for all i 2 [n]?:

kh(b SK)k2i

W(b S0) min W

When there is no compression (! = 0 so that Quant(s) = s), we prove that the introduction of the random variables Vk,i s play no role whatever > 0 and the choice of the V0,i s, and we have for any γ 2 (0, 2vmin/L W) (see (29) in the supplemental)

kh(b SK)k2i

W(b S0) min W

Optimizing the learning rate γ, we derive the following corollary (see the proof in Appendix C).

Corollary 2 (of Theorem 1). Choose γ :=

1 (W(b S0) min W)n

kmax L W(1+5!)σ2

21/2 γmax. We get

kh(b SK)k2i

W(b S0) min W

L W(1 + 5!)σ2

W(b S0) min W

Theorem 1 and Corollary 2 do not require any assumption regarding the distributional heterogeneity of workers. These results remain thus valid when workers have access to data resulting from different distributions a widespread situation in FL frameworks. Crucially, without assumptions on the heterogeneity of workers, the convergence of a naive implementation of compressed distributed EM (i.e. an implementation without the variables Vk,i s) would not converge.

Let us comment the complexity to reach an -stationary point, and more precisely how the complexity evaluated in terms of the number of optimization steps depend on !, n, σ2 and . Since KOpt( ) = kmax, from Corollary 2 we have that: Kopt( ) = O

Maximal learning rate and compression. The comparison of Theorem 1 with the no compression case (see (9)) shows that compression impacts γmax by a factor proportional to pn/!3/2 as ! increases (similar constraints were observed in the risk optimization literature, e.g. in [17, 32]). This highlights two different regimes depending on the ratio pn/!3/2: if the number of workers n scales at least as !3, the maximal learning rate is not impacted by compression; on the other hand, for smaller numbers of workers n !3, compression can degrade the maximal learning rate. We highlight this conclusion with a small example in the case of scalar quantization for which ! pq/squant: for q = 102 and squant = 4 (obtaining a compression rate of a factor 16), the maximal learning rate is almost unchanged if n 16.

Dependency on . The complexity Kopt( ) is decomposed into two terms scaling respectively as σ2 2 and γ 1

max 1, the ﬁrst term being dominant when ! 0. This observation highlights two different regimes: a high noise regime corresponding to γmax(1 + !)σ2/(n 1) 1 where the complexity is of order σ2 2, and a low noise regime where γmax(1 + !)σ2/(n 1) 1 and the complexity is of order γ 1

max 1. An extreme example of the low noise case is σ2 = 0, occurring for example in the ﬁnite-sum case (i.e., when si = m 1 Pm

j=1 sij) with the oracle Sk+1,i = si T(b Sk).

Impact of compression for -stationarity. As mentioned above, the compression simultaneously impacts the maximal learning rate (as in (7)) and the complexity Kopt( ). Consequently, the impact of the compression depends on the balance between !, n, σ2 and , and we can distinguish four different main regimes. In the following tabular, for each of the four situations, we summarize the increase in complexity Kopt( ) resulting from compression.

Complexity regime: (Dominating term in Kopt( ))

γmax regime: (Dominating term in (7))

Example situation High noise σ2,

Low σ2 (e.g., large minibatch)

larger vmin 2L W large ratio n/!3 ! 1 pn 2

2L(1+!)p! low ratio n/!3 ! !3/2/pn

Depending on the situation, the complexity can be multiplied by a factor ranging from 1 to ! _ (!3/2/pn) . Remark that the communication cost of each iteration is typically reduced by compression of a factor at least !. Moreover, the beneﬁt of compression is most signiﬁcant in the low noise regime and when the maximal learning rate is vmin/(2L W) (e.g., when n large enough). We then improve the communication cost of each iteration without increasing the optimization complexity, effectively reducing the communication budget for free .

Because of space constraints, the results in the PP regime are postponed to Appendix A.

3 VR-Fed EM: Federated EM algorithm with variance reduction

A novel algorithm, called VR-Fed EM and described by algorithm 2, is derived to additionally incorporate a variance reduction scheme in Fed EM. It is described in the ﬁnite-sum setting when for all i 2 [n]?, si := m 1 Pm

j=1 sij: at each iteration #(t, k + 1), the oracle on si T(b St,k) will use a minibatch Bt,k+1,i of examples sampled at random (with or without replacement) in [m]?.

Algorithm 2: VR-Fed EM Data: kout, kin, b 2 N?; for i 2 [n]?, V1,0,i 2 Rq;

b Sinit 2 Rq; a positive sequence {γt,k+1, t 2 [kout]?, k 2 [kin 1]}; > 0 Result: sequence: {b St,k, t 2 [kout]?, k 2 [kin]}

1 b S1,0 = b S1, 1 = b Sinit, V1,0 = n 1 Pn

i=1 V1,0,i 2 for i = 1, . . . , n do

3 S1,0,i = 1

j=1 sij T(b Sinit)

4 for t = 1, . . . , kout do

5 for k = 0, . . . , kin 1 do

6 for i = 1, . . . , n (worker #i, locally) do

7 Sample at random a batch Bt,k+1,i of size b in

8 Set St,k+1,i = St,k,i +

sij T(b St,k) sij T(b St,k 1)

9 Set t,k+1,i = St,k+1,i b St,k Vt,k,i 10 Set Vt,k+1,i = Vt,k,i + Quant( t,k+1,i).

11 Send Quant( t,k+1,i) to the central server

12 (the central server)

13 Set Ht,k+1 = Vt,k + n 1 Pn

i=1 Quant( t,k+1,i)

14 Set b St,k+1 = b St,k + γt,k+1Ht,k+1

15 Set Vt,k+1 = Vt,k + n 1 Pn

i=1 Quant( t,k+1,i)

16 Send b St,k+1 and T(b St,k+1) to the n workers

17 b St+1,0 = b St+1, 1 = b St,kin 18 Vt+1,0 = Vt,kin 19 for i = 1, . . . , n do

20 St+1,0,i = 1

j=1 sij T(b St+1,0)

21 Vt+1,0,i = Vt,kin,i

The algorithm is decomposed into kout outer loops (indexed by t), each of them having kin inner loops (indexed by k). At iteration #(k + 1) of the inner loops, each worker #i updates a local statistic St,k+1,i based on a minibatch Bt,k+1,i of its own examples { sij, j 2 Bt,k+1,i} (see Line 8): starting from b St,0,i := m 1 Pm

j=1 sij T(b St, 1), b St,k+1,i is deﬁned in such a way that it approximates m 1 Pm

j=1 sij T(b St,k) (see Corollary 18). Then, the worker #i sends to the central server a quantization of t,k+1,i (see Line 12) which can be seen as an approximation of 1{hi(b St,k) hi(b St,k 1)} upon noting that the variable Vt,k+1,i deﬁned by Line 10 approximates hi(b St,k) (see Proposition 26). The central server learns the mean value Vt,k+1 = n 1 Pn

i=1 Vt,k+1,i (see Line 15 and Lemma 21) and, by adding the quantized quantities, deﬁnes a ﬁeld Ht,k+1 which approximates n 1 Pn

i=1 hi(b St,k) (see Proposition 24). Line 14 can be seen as a Stochastic Approximation update, with learning rate γt,k+1 and mean ﬁeld s 7! n 1 Pn

i=1 hi(s) (see (6) for the deﬁnition of hi).

The variance reduction is encoded in the deﬁnition of St,k+1,i, Line 8. We have St,k+1,i = b 1 P

j2Bt,k+1,i sij T(b St,k)+ t,k+1,i. The ﬁrst term is the natural approximation

of si T(b St,k) based on a minibatch Bt,k+1,i. Conditionally to the past, t,k+1,i is correlated to the ﬁrst term and biased, but its bias is canceled at the beginning of each outer loop (see Line 20 and Appendix E.3.2): t,k+1,i deﬁnes a control variate. Such a variance reduction technique was ﬁrst proposed in the stochastic gradient setting [30, 9, 36] and then extended to the EM setting [10, 12]. At the end of each outer loop, the local approximations St+1,0,i are initialized to the full sum m 1 Pm

j=1 sij T(b St,kin) (see Line 20) thus canceling the bias of S ,i (see Proposition 17).

When there is a single worker and no compression is used (n = 1, ! = 0), VR-Fed EM reduces to SPIDER-EM, which has been shown to be rate optimal for smooth, non-convex ﬁnite-sum optimization [10]. Theorem 3 studies the FL setting (n 1 and ! 0): it establishes a ﬁnite time control of convergence in expectation for VR-Fed EM . Assumptions A5 and A7 are replaced with A8.

A8. For any i 2 [n]? and j 2 [m]?, the conditional expectations sij( ) are well deﬁned for any 2 , and there exists Lij such that for any s, s0 2 Rq, k( sij T(s) s) ( sij T(s0) s0)k Lijks s0k .

Theorem 3. Assume A1 to 3, A4, A6 and A8. Set L2 := n 1m 1 Pn

ij. Let {b St,k, t 2

[kout]?, k 2 [kin 1]} be given by algorithm 2 run with := 1/(1 + !), V1,0,i := hi(b S1,0) for any i 2 [n]?, b := d kin (1+!)2 e and

γt,k = γ := vmin

L pn(1 + !)

! + 1 + 10!

Let ( , K) be the uniform random variable on [kout]? [kin 1], independent of {b St,k, t 2 [kout]?, k 2 [kin]}. Then, it holds

vminγkinkout

kh(b S ,K)k2i

1 + γ2 L2(1 + !)2

The proof is postponed to Appendix E. This result is a consequence of the more general Proposition 25. We make the following comments:

1. Eq. (11) provides the convergence of E

, and Eq. (12) ensures that the quantity

of interest E[kh(b S ,K)k2] is controlled by E[k H ,K+1k2]. We observe that 2(1 + γ2 L2(1+!)2

n ) is uniformly bounded w.r.t. ! as, by (10), γ2 = O!!1(! 3). 2. Up to our knowledge, this is the ﬁrst result on Federated EM, that leverages advanced variance reduction techniques, while being robust to distribution heterogeneity (the theorem is valid without any assumption on heterogeneity) and while reducing the communication cost. 3. Without compression (! = 0) and in the single-worker case (n = 1), Fort et al. [10] use kin = b: we recover this result as a particular case. When n > 1 and ! > 0, the recommended batch size b decreases as 1/(1 + !)2.

Convergence rate and optimization complexity. Our step-size γ is chosen constant and independent of kin, kout. Indeed, contrary to Theorem 1, there is no Bias-Variance trade-off (as typically observed with variance reduced methods), and the optimal choice of γ is the largest one to ensure convergence. Consequently, since the number of optimization steps is koutkin, we have Kopt( ) = O( 1

Impact of compression on the learning rate and -stationarity. The compression constant ! does not directly appear in (11), but impacts the value of γ. Two different regimes appear:

L pn(1 + !)

21/2 1 (i.e. we focus on the large !, n asymptotics when !3 n), then γ ' vmin

L W has nearly the same value as without compression [10]. The complexity is then similar to the one of SPIDER-EM [10], with a smaller communication cost. The gain from compression is maximal in this regime. 2. if 4

L pn(1 + !)

21/2 1 (i.e. we focus on the large !, n asymptotics when

!3 n), then γ = O

pn vmax L!3/2

is strictly smaller than without compression. The optimization

complexity is then higher to the one of SPIDER-EM1 (by a factor proportional to !3/2/pn) with a smaller communication cost (typically at least ! times less bits exchanged per iteration). The overall trade-off thus depends on the comparison between ! and n.

Complexity : 1/(γ ) γ regime: (Dominating

term in (10))

Example situation

vmin/L W large ratio n/!3 1 vmin

pn/(vmax L!3/2) low ratio n/!3 !3/2/pn

We summarize these two regimes in this tabular, focusing on the large n, large ! asymptotic regimes. For the two regimes, we indicate the increase in complexity Kopt( ) resulting from compression.

We provide a discussion on computed conditional expectations complexity KCE in Appendix E.2.

1As a corollary of [10, Theorem 2], the optimization complexity of SPIDER-EM is kout + kinkout that is 1 in order to reach -stationarity.

4 Numerical illustrations

In this section, we illustrate the performance of Fed EM and VR-Fed EM applied to inference in Gaussian Mixture Models (GMM), on a synthetic data set and on the MNIST data set. We also present an application to Federated missing data imputation with the analysis of the e Bird data set [34, 1].

Synthetic data. The synthetic data are from the following GMM model: for all 2 [N]? and g 2 {0, 1}, P(Z = g) = g; and conditionally to Z = g, Y N2(µg, ). The 2 2 covariance matrix is known, and the parameters to be ﬁtted are the weights ( 0, 1) and the expectations (µ0, µ1). The total number of examples is N = 104, the number of agents is n = 102, and the probability of participation of servers is p = 0.75. Fed EM and VR-Fed EM are run with γ = 10 2, ! = 1 and = 10 2. For Fed EM, we consider the ﬁnite-sum setting when si = m 1 Pm

j=1 sij with m = 102; the oracle Sk+1,i is obtained by a sum over a minibatch of b = 20 examples. For VR-Fed EM, we set b = 5 and kin = 20. We run the two algorithms for 500 epochs (one epoch corresponds to N conditional expectation evaluations sij). Figure 1 shows a trajectory of k Hkk2

given by Fed EM (and k Ht,kk2 given by VR-Fed EM), along with the theoretical value of the mean ﬁeld kh(b Sk)k2 for Fed EM (and kh(b St,k)k2 for VR-Fed EM). The results illustrate the variance reduction, and gives insight on the variability of the trajectories resulting from the two algorithms.

MNIST Data set. We perform a similar experiment on the MNIST dataset to illustrate the behaviour of Fed EM and VR-Fed EM on a GMM inference problem with real data. The dataset consists of N = 7 104 images of handwritten digits, each with 784 pixels. We pre-process the dataset by removing 67 uninformative pixels (which are always zero across all images) to obtain d = 717 pixels per image. Second, we apply principal component analysis to reduce the data dimension. We keep the d PC = 20 principal components of each observation. These N preprocessed observations are distributed at random across n = 102 servers, each containing m = 700 observations. We estimate a Rd PC-multivariate GMM model with G = 10 components. Details on the multivariate Gaussian mixture model are given in the supplementary material (see Appendix F). Here again, si is a sum over the m examples available at server #i; the minibatches are independent and sampled at random in [m]? with replacement; we choose b = 20 and the step size is constant and set to γ = 10 3. The same initial value b Sinit is used for all experiments: we set b Sinit := s( 0, µ0, b 0), where 0

g = 1/G for all g 2 [G]?, the expectations µ0

g are sampled uniformly at random among the available examples, and b 0 is the empirical covariance matrix of the N examples. Figure 3 shows the sequence of parameter estimates for the weights and the squared norm of the mean ﬁeld k Hkk2 for Fed EM (resp. k Ht,kk2 for VR-Fed EM ) vs the number of epochs.

Federated missing values imputation for citizen science. We develop Fed Miss EM, a special instance of Fed EM designed to missing values imputation in the federated setting; we apply it to the analysis of part of the e Bird data base [34, 1], a citizen science smartphone application for biodiversity monitoring. In e Bird, citizens record wildlife observations, specifying the ecological site they visited, the date, the species and the number of observed specimens. Two major challenges occur: (i) ecological sites are visited irregularly, which leads to missing values and (ii) non-professional observers have heterogeneous wildlife counting schemes.

Model and the Fed Miss EM algorithm. I observers participate in the programme, there are J ecological sites and L time stamps. Each observer #i provides a J L matrix Xi and a subset of indices i [J]? [L]?. For j 2 [J]? and 2 [L]?, the variable Xi

j encodes the observation that would be collected by observer #i if the site #j were visited at time stamp # ; since there are unvisited sites, we denote by Y i := {Xi

j , (j, ) 2 i} the set of observed values and Zi := {Xi

j , (j, ) /2 i} the set of unobserved values. The statistical model is parameterized by a matrix 2 RJ L, where j is a scalar parameter characterizing the distribution of species individuals at site j and time stamp . For instance, j is the log-intensity of a Poisson distribution when the observations are count data or the log-odd of a binomial model when the observations are presenceabsence data. This model could be extended to the case observers #i and #i0 count different number of specimens on average at the same location and time stamp, because they do not have access to the same material or do not have the same level of expertise: heterogeneity between observers could be modeled by using different parameters for each individual #i say i 2 RJ L. Here, we consider the case when i

j = j for all (j, ) 2 [J]? [L]? and i 2 [I]?. We further assume that the entries {Xi

j , i 2 [I]?, j 2 [J]?, 2 [L]?} are independent with a distribution from an exponential

0 100 200 300 400 500

1e 06 1e 04 1e 02 1e+00

0 200 400 600 800 1000

1e 20 1e 12 1e 04

Figure 1: Trajectory of Fed EM vs the number of epochs (left; blue line: kh(b Sk)k2; red line: k Hkk2) and of VR-Fed EM (right; dashed blue line: kh(b Sk)k2; solid red line: k Ht,kk2).

+ ++++++++++

+ +++++++++++++

0 50 100 150 200

observation date

Estimated bird count

2000 02 2003 07 2006 05 2009 03 2011 10 2014 05 2016 12 2019 07

520 560 600

Estimated count

Number of NA

0 2000 6000

observation date

Estimated bird count

2001-05 2010-03 2011-12 2013-06 2014-12 2016-06 2017-12 2019-06

800 850 900 950

Estimated count

Number of NA

Figure 2: Estimated temporal trends for Common Buzzard (Left) and Mallard (right). Blue crosses: estimated monthly counts; Red triangles: number of missing values. Dotted lines: LOESS regressions for the estimated counts (blue) and the number of missing values (red).

0 5 10 15 20 25 30

0.00 0.10 0.20 0.30

Mixture weight

0 20 40 60 80 100

0.02 0.05 0.10 0.20

0 20 40 60 80 100

0.0 0.1 0.2 0.3 0.4 0.5

Mixture weight

0 20 40 60 80 100

5e 04 5e 03 5e 02

Figure 3: [Left to right] For Fed EM : Evolution of the estimates of the weights for 2 [G]? vs the number of epochs (ﬁrst plot) and Evolution of the squared norm of the mean ﬁeld k Hkk2 vs the number of epochs (second plot). Then, the same things for VR-Fed EM (third and fourth plots).

family with respect to some reference measure on R of the form: x 7! (x) exp{x j ( j )}. Algorithm 7 in Appendix F.2 provides details on the model, and the pseudo-code for Fed Miss EM.

Application to e Bird data analysis. We apply Fed Miss EM to the analysis of part of the e Bird data base [34, 1] of ﬁeld observations reported in France by I = 2, 465 observers, across J = 9, 721 sites and at L = 525 monthly time points. We analyze successively two data sets corresponding to observations of two relatively common species: the Common Buzzard and the Mallard. These subsamples correspond respectively to N = 5, 980 and N = 12, 185 ﬁeld observations. The I ﬁeld observers are randomly assigned into n = 10 groups (the observations of the ﬁeld observers from the group c 2 [n]? are allocated to the server #c). For c 2 [n]?, server c contains Nc observations; in our two examples, Nc ranges between 400 and 1, 500. We run Fed Miss EM for 150 epochs; with γ = 10 4, = 10 3, b = 102, a rank r = 2 and λ = 0; for the distribution of the variables Xi

j , we use a Gaussian distribution with unknown expectation j and variance 1. We recover aggregated temporal trends at the national French level for these two bird species by summing the estimated counts across ecological sites, for each time stamp; the trends are displayed in Figure 2, along with a locally estimated scatterplot smoothing (LOESS).

5 Conclusions

We introduced Fed EM which is, to the best of our knowledge, the ﬁrst algorithm implementing EM in a FL setting, and handles compression of exchanged information, data heterogeneity and partial participation. We further extended it to incorporate a variance reduction scheme, yielding VR-Fed EM. We derived complexity bounds which highlight the efﬁciency of the two algorithms, and illustrated our claims with numerical simulations, as well as an application to biodiversity monitoring data. In a simultaneously published work, Marfoq et al. [25] consider a different Federated EM algorithm, in order to address the personalization challenge by considering a mixture model. Under the assumption that each local data distribution is a mixture of unknown underlying distributions, their algorithm computes a model corresponding to each distribution. On the other hand, we focus on the curved exponential family, with variance reduction, partial participation and compression and on limiting the impact of heterogeneity, but do not address personalization.

Acknowledgments The work of A. Dieuleveut and E. Moulines is partially supported by ANR19-CHIA-0002-01 /chaire SCAI, and Hi!Paris. The work of G. Fort is partially supported by the Fondation Simone et Cino del Duca under the project Op Si Mor E.

Broader Impact of this work This work is mostly theoretical, and we believe it does not currently present any direct societal consequence. However, the methods described in this paper can be used to train machine learning models which could themselves have societal consequences. For instance, the deployment of machine learning models can suffer from gender and racial bias, or amplify existing inequalities.

[1] ebird. 2017. ebird: An online database of bird distribution and abundance [web application].

ebird, cornell lab of ornithology, ithaca, new york. available: http://www.ebird.org. (accessed: 21 march 2020).

[2] D. Alistarh, T. Hoeﬂer, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The con-

vergence of sparsiﬁed gradient methods. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31, pages 5973 5983. Curran Associates, Inc., 2018.

[3] A. Benveniste, M. M etivier, and P. Priouret. Adaptive Algorithms and Stochastic Approxima-

tions. Springer Verlag, 1990.

[4] V. S. Borkar. Stochastic approximation. Cambridge University Press, Cambridge; Hindustan

Book Agency, New Delhi, 2008. A dynamical systems viewpoint.

[5] J. Chen, J. Zhu, Y. Teh, and T. Zhang. Stochastic expectation maximization with variance re-

duction. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7967 7977. 2018.

[6] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A Fast Incremental Gradient Method

With Support for Non-Strongly Convex Composite Objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1646 1654. Curran Associates, Inc., 2014.

[7] B. Delyon, M. Lavielle, and E. Moulines. Convergence of a stochastic approximation version

of the EM algorithm. Ann. Statist., 27(1):94 128, 1999.

[8] A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM

Algorithm. J. Roy. Stat. Soc. B Met., 39(1):1 38, 1977.

[9] C. Fang, C. Li, Z. Lin, and T. Zhang. SPIDER: Near-Optimal Non-Convex Optimization via

Stochastic Path-Integrated Differential Estimator. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 689 699. Curran Associates, Inc., 2018.

[10] G. Fort, E. Moulines, and H.-T. Wai. A Stochastic Path Integral Differential Estimato R Ex-

pectation Maximization Algorithm. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 16972 16982. Curran Associates, Inc., 2020.

[11] G. Fort, P. Gach, and E. Moulines. Fast Incremental Expectation Maximization for ﬁnite-sum

optimization: non asymptotic convergence. Statistics and Computing, 2021. Accepted for publication.

[12] G. Fort, E. Moulines, and H.-T. Wai. Geom-SPIDER-EM: Faster Variance Reduced Stochastic

Expectation Maximization for Nonconvex Finite-Sum Optimization. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2021.

[13] S. Fr uhwirth-Schnatter, G. Celeux, and C. P. Robert, editors. Handbook of mixture analysis.

Chapman & Hall/CRC Handbooks of Modern Statistical Methods. CRC Press, Boca Raton, FL, 2019.

[14] S. Ghadimi and G. Lan. Stochastic Firstand Zeroth-Order Methods for Nonconvex Stochastic

Programming. SIAM J. Optim., 23(4):2341 2368, 2013.

[15] E. Gorbunov, F. Hanzely, and P. Richt arik. A uniﬁed theory of SGD: Variance reduction,

sampling, quantization and coordinate descent. In International Conference on Artiﬁcial Intelligence and Statistics, pages 680 690. PMLR, 2020.

[16] S. Horv ath and P. Richtarik. A better alternative to error feedback for communication-efﬁcient

distributed learning. In International Conference on Learning Representations, 2021.

[17] S. Horv ath, D. Kovalev, K. Mishchenko, S. Stich, and P. Richt arik. Stochastic distributed

learning with gradient quantization and variance reduction. ar Xiv preprint ar Xiv:1904.05115, 2019.

[18] R. Johnson and T. Zhang. Accelerating Stochastic Gradient Descent using Predictive Variance

Reduction. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 315 323. Curran Associates, Inc., 2013.

[19] P. Kairouz, H. B. Mc Mahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawit,

Z. Charles, G. Cormode, R. Cummings, R. G. L. D Oliveira, H. Eichner, S. El Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gasc on, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konecn y, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Ozg ur, R. Pagh, H. Qi, D. Ramage, R. Raskar, M. Raykova, D. Song, W. Song, S. U. Stich, Z. Sun, A. Theertha Suresh, F. Tram er, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao. Advances and Open Problems in Federated Learning. Now Foundations and Trends.

[20] B. Karimi, H.-T. Wai, E. Moulines, and M. Lavielle. On the Global Convergence of (Fast) In-

cremental Expectation Maximization Methods. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch e Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 2837 2847. Curran Associates, Inc., 2019.

[21] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh. SCAFFOLD:

Stochastic controlled averaging for federated learning. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5132 5143. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/v119/karimireddy20a.html.

[22] J. Koneˇcn y, H. B. Mc Mahan, F. X. Yu, P. Richt arik, A. T. Suresh, and D. Bacon.

Federated learning: Strategies for improving communication efﬁciency. ar Xiv preprint ar Xiv:1610.05492, 2016.

[23] F. Kunstner, R. Kumar, and M. Schmidt. Homeomorphic-invariance of em: Non-asymptotic

convergence in kl divergence for exponential families via mirror descent. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3295 3303. PMLR, 2021.

[24] K. Lange. MM Optimization Algorithms. SIAM-Society for Industrial and Applied Mathemat-

[25] O. Marfoq, G. Neglia, A. Bellet, L. Kameni, and R. Vidal. Federated multi-task learning

under a mixture of distributions. 35th Conference on Neural Information Processing Systems (Neur IPS 2021), 2021.

[26] G. Mc Lachlan and T. Krishnan. The EM algorithm and extensions. Wiley series in probability

and statistics. Wiley, 2008.

[27] B. Mc Mahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas. Communication Efﬁcient Learning of Deep Networks from Decentralized Data. In Artiﬁcial Intelligence and Statistics, pages 1273 1282. PMLR, 2017.

[28] K. Mishchenko, E. Gorbunov, M. Tak aˇc, and P. Richt arik. Distributed learning with com-

pressed gradient differences. ar Xiv preprint ar Xiv:1901.09269, 2019.

[29] K. Murphy and S. J. Russell. Dynamic bayesian networks: representation, inference and learn-

[30] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Tak aˇc. Sarah: A novel method for machine

learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML 17, page 2613 2621. JMLR.org, 2017.

[31] C. Philippenko and A. Dieuleveut. Bidirectional compression in heteregenous settings for dis-

tributed or federated learning with partial participation: tight convergence guarantees. Technical report, ar Xiv 2006.14591v3, 2020.

[32] C. Philippenko and A. Dieuleveut. Preserved central model for faster bidirectional compression

in distributed settings. Advances in Neural Information Processing Systems (Neur IPS), 2021.

[33] F. Sattler, S. Wiedemann, K.-R. M uller, and W. Samek. Robust and Communication-Efﬁcient

Federated Learning From Non-i.i.d. Data. IEEE Transactions on Neural Networks and Learning Systems, pages 1 14, 2019.

[34] B. L. Sullivan, C. L. Wood, M. J. Iliff, R. E. Bonney, D. Fink, and S. Kelling. e Bird: A citizen-

based bird observation network in the biological sciences. Biological Conservation, 142(10): 2282 2292, 2009.

[35] H. Tang, C. Yu, X. Lian, T. Zhang, and J. Liu. Double Squeeze: Parallel Stochastic Gradient

Descent with Double-pass Error-Compensated Compression. In International Conference on Machine Learning, pages 6155 6165. PMLR, May 2019. ISSN: 2640-3498.

[36] Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. Spider Boost and Momentum: Faster Stochas-

tic Variance Reduction Algorithms. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch e Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 2406 2416. 2019.

[37] B. Woodworth, K. K. Patel, S. Stich, Z. Dai, B. Bullins, B. Mcmahan, O. Shamir, and N. Sre-

bro. Is local SGD better than minibatch SGD? In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 10334 10343. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/v119/woodworth20a.html.

[38] L. Xu and M. I. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures.

Neural computation, 8(1):129 151, 1996.

[39] Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine learning: Concept and applications.

ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1 19, 2019.