# federatedem_with_heterogeneity_mitigation_and_variance_reduction__653e89b2.pdf Federated Expectation Maximization with heterogeneity mitigation and variance reduction Aymeric Dieuleveut Centre de Math ematiques Appliqu ees Ecole Polytechnique, France Institut Polytechnique de Paris aymeric.dieuleveut@polytechnique.edu Gersende Fort Institut de Math ematiques de Toulouse Universit e de Toulouse; CNRS UPS, Toulouse, France gersende.fort@math.univ-toulouse.fr Eric Moulines Centre de Math ematiques Appliqu ees Ecole Polytechnique, France CS Dpt, HSE University, Russian Federation eric.moulines@polytechnique.edu Genevi eve Robin Laboratoire de Math ematiques et Mod elisation d Evry Universit e d Evry Val d Essonne; CNRS Evry-Courcouronnes, France genevieve.robin@cnrs.fr The Expectation Maximization (EM) algorithm is the default algorithm for inference in latent variable models. As in any other field of machine learning, applications of latent variable models to very large datasets makes the use of advanced parallel and distributed architectures mandatory. This paper introduces Fed EM, which is the first extension of the EM algorithm to the federated learning context. Fed EM is a new communication efficient method, which handles partial participation of local devices, and is robust to heterogeneous distributions of the datasets. To alleviate the communication bottleneck, Fed EM compresses appropriately defined complete data sufficient statistics. We also develop and analyze an extension of Fed EM to further incorporate a variance reduction scheme. In all cases, we derive finite-time complexity bounds for smooth non-convex problems. Numerical results are presented to support our theoretical findings, as well as an application to federated missing values imputation for biodiversity monitoring. 1 Introduction The Expectation Maximization (EM) algorithm is the most popular approach for inference in latent variable models. The EM algorithm, a special instance of the Majorize/Minimize algorithm [24], was formalized by [8] and is without doubt one of the fundamental algorithms in machine learning. Applications include among many others finite mixture analysis, latent factor models inference, and missing data imputation; see [38, 29, 26, 13] and the references therein. As in any other field of machine learning, training latent variable models on very large datasets make the use of advanced parallel and distributed architectures mandatory. Federated Learning (FL) [22, 39], which exploits the computation power of a large number of edge devices to perform distributed machine learning, is a powerful framework to achieve this goal. The conventional EM algorithm is not suitable for FL settings. We propose several new distributed versions of the EM algorithm supporting compressed communication. More precisely, our objective 35th Conference on Neural Information Processing Systems (Neur IPS 2021). is to minimize a non-convex finite-sum smooth objective function Argmin 2 F( ), F( ) := 1 Li( ) + R( ) , Rd , (1) where n is the number of workers/devices which are connected to a central server, and the worker #i only has access to its local data; finally R is a penalty term which may be introduced to promote sparsity, regularity, etc. In latent variable models, Li( ) = m 1 Pm j=1 log p(yij; ), where {yij}m j=1 are the m observations available for worker #i, and p(y; ) is the incomplete likelihood. p(y; ) is defined by marginalizing the complete-data likelihood p(y, z; ) defined as the joint probability density function of the observation y and a non-observed latent variable z 2 Z, i.e. p(y; ) = Z p(y, z; )µ(dz) where Z is the latent space and µ is a measure on Z. We focus in this paper on the case where p(y, z; ) belongs to a curved exponential family, given by p(y, z; ) := (y, z) exp hs(y, z), φ( )i ( ) where s(y, z) 2 Rq is the complete-data sufficient statistics, φ : ! Rq and : ! R, : Y Z ! R+ are vector/scalar functions. In absence of communication constraints, the EM algorithm is a popular method to solve (1). It alternates between two steps: in the Expectation (E) step, using the current value of the iterate curr, it computes a majorizing function 7! Q( , curr) given up to an additive constant by Q( , curr) := h s( curr), φ( )i + ( ) + R( ) where s( ) := 1 si( ) ; (3) and si( ) is the ith device conditional expectation of the complete-data sufficient statistics: sij( ) , sij( ) := s(yij, z)p(z|yij; )µ(dz) , (4) where p(z|yij; ) := p(yij, z; )/p(yij; ). As for the M step, an updated value of curr is computed as a minimizer of 7! Q( , curr). The majorizing function is then updated with the new curr; this process is iterated until convergence. The EM algorithm is most useful when for any curr 2 , the function 7! Q( , curr) is a convex function of the parameter which is solvable in either explicitly or with little computational effort. A major advantage of the EM algorithm stems from its invariance under homeomorphisms, contrary to classical first-order methods: the EM updates are the same for any continuous invertible re-parametrization [23]. In the FL context, the vanilla EM algorithm is affected by three major problems: (1) the communication bottleneck, (2) data heterogeneity, and (3) partial participation (PP) of the workers. When the number of workers is large, the cost of communication becomes overwhelming. A classical technique to alleviate this problem is to use communication compression. Most FL algorithms are first order methods and compression is typically applied to stochastic gradients. Yet, these methods are not appropriate to solve (1) since (i) they do not preserve the desirable homeomorphic invariance property, and (ii) the full EM iteration is not distributed since the M step is performed by the central server only. This calls for an extension of the EM algorithm to the FL setting. Since workers are often user personal devices, the issue of data heterogeneity naturally arises. Our model in Equations (1), (3) and (4) allows the local loss functions to depend on the worker i 2 {1, . . . , n} and the observations yij to be independent but not necessarily identically distributed. In addition, our theoretical results deal with specific behaviors for each worker i 2 {1, . . . , n}, see e.g., A5, 7 and 8. In the FL-EM setting, heterogeneity manifests itself by the non-equality of the local conditional expectations of the complete-data sufficient statistics si s; modifications to the algorithms must be performed to ensure convergence at the central server. Finally, a subset of users are potentially inactive in each learning round, being unavailable or unwilling to participate. Thus, taking into account PP of the workers and its impact on the convergence of algorithms, is a major issue. Fed EM. The main contribution of our paper is a new method called Fed EM, supporting communication compression, partial participation and data heterogeneity. In this algorithm, the workers compute an estimate of the local complete-data sufficient statistics si using a minibatch of data, apply an unbiased compression operator to a noise compensated version (using a technique inspired by [17, 15]) and send the result to the central server, which performs aggregation and the M-step (i.e. the parameter update). VR-Fed EM. We improve Fed EM by adding a variance reduction method inspired by the SPIDER framework [9] which has recently been extended to the EM framework [10]. For both Fed EM and VR-Fed EM, the central server updates the expectations of the global complete-data sufficient statistics through a Stochastic Approximation procedure [3, 4]. When compared to Fed EM, VR-Fed EM additionally performs variance reduction for each worker, progressively alleviating the variance brought by the random oracles which provide approximations of the local complete-data sufficient statistics. Theoretical analysis. EM in the curved exponential family setting converges to the roots of a function h (see e.g. Section 2). We introduce a unified theoretical framework which covers the convergence of Fed EM and VR-Fed EM algorithms in the non-convex case and establish convergence guarantees for finding an -stationary point (see Theorem 1 and Theorem 3). In both cases, we provide the number Kopt( ) of optimization steps and the number KCE( ) of computed conditional expectations to reach -stationarity. These results show that in the Stochastic Approximation steps of VR-Fed EM , the step sizes are independent of m, the number of observations per server. Furthermore, the computational cost in terms of KCE( ) improves on earlier results. In this respect, VR-Fed EM has the same advantages as SPIDER [9] compared to SVRG [18] and SAGA [6], or as SPIDER-EM [10] compared to s EM-vr [5] and FIEM [20, 11]. Lastly, our bounds demonstrate the robustness of Fed EM and VR-Fed EM to data heterogeneity. Finally, seen as a root finding algorithm in a quantized FL setting, VR-Fed EM can be compared to VR-DIANA [17]: we show that VR-Fed EM does not require the step sizes to decrease with m and provides state of the art iteration complexity to reach a precision . Notations. For vectors a, b in Rq, ha, bi is the Euclidean scalar product, and k k denotes the associated norm. For r 1, kakr is the r-norm of a vector a. The Hadamard product a b denotes the entrywise product of the two vectors a, b. By convention, vectors are column-vectors. For a matrix A, A> is its transpose and k Ak F is its Frobenius norm; for two matrices A, B, h A, Bi := Trace(B>A). For a positive integer n, set [n]? := {1, , n} and [n] := {0, , n}. The set of non-negative integers (resp. positive) is denoted by N (resp. N?). The minimum (resp. maximum) of two real numbers a, b is denoted by a b (resp. a_b). We will use the Bachmann-Landau notation a(x) = O(b(x)) to characterize an upper bound of the growth rate of a(x) as being b(x). 2 Fed EM: Expectation Maximization algorithms for federated learning Recall the definition of the negative penalized (normalized) log-likelihood F( ) from (1). Along the entire paper, we make the following assumptions A1 to A3,which define the model at hand. A1. The parameter set Rd is a convex open set. The functions R : ! R, φ : ! Rq, : ! R, and (yij, ) : Z ! R+, s(yij, ) : Z ! Rq for i 2 [n]? and j 2 [m]? are measurable functions. For any 2 and i 2 [n]?, the log-likelihood is finite: 1 < Li( ) < 1. A2. For all 2 and i 2 [n]?, the conditional expectation si( ) is well-defined. A3. For any s 2 Rq, the map s 7! Argmin 2 { ( ) + R( ) hs, φ( )i} exists and is unique; the singleton is denoted by {T(s)}. EM defines a sequence { k, k 0} that can be computed recursively as k+1 = T s( k), where the map T is defined in A3 and s is defined in (3). On the other hand, the EM algorithm can be defined through a mapping in the complete-data sufficient statistics, referred to as the expectation space. In this setting, the EM iteration defines a Rq-valued sequence {b Sk, k 0} given by b Sk+1 = s T(b Sk). Thus, we observe that the EM algorithm admits two equivalent representations: (Parameter space) k+1 = T s( k); (Expectation space) b Sk+1 = s T(b Sk). (5) In this paper, we focus on the expectation space representation; see [23] for an interesting discussion on the connection of EM and mirror descent. It has been shown in [7] that if s? is a fixed point to the EM algorithm in the expectation space, then ? := T(s?) is a fixed point of the EM algorithm in the parameter space, i.e., ? = T s( ?); note that the converse is also true. Define the functions hi and h from Rq to Rq by h(s) := 1 i=1 hi(s) with hi(s) := si T(s) s . hi(s) , hi(s) := si T(s) s . (6) A key property is that the fixed points of EM in the expectation space are the roots of the mean field s 7! h(s) (see (3) for the definition of s). Therefore, convergence of EM-based algorithms is evaluated in terms of -stationarity (see [14, 10]): for all > 0, there exists a (possibly random) termination time K s.t.: E kh(b SK)k2i . Another key property of EM is that it is a monotonic algorithm: each iteration leads to a decrease of the negative penalized log-likelihood i.e. F( k+1) F( k) or, equivalently in the expectation space F T(b Sk+1) F T(b Sk) (for sequences { k, k 0} and {b Sk, k 0} given by (5)). A4 assumes that the roots of the mean field h are the roots of the gradient of F T (see [7] for the same assumption when studying Stochastic EM). A5 assumes global Lipschitz properties of the functions hi s. A4. The function W := F T : Rq ! R is continuously differentiable on Rq and its gradient is globally Lipschitz with constant L W. Furthermore, for any s 2 Rq, r W(s) = B(s)h(s) where B(s) is a q q positive definite matrix. In addition, there exist 0 < vmin vmax such that for any s 2 Rq, the spectrum of B(s) is in [vmin, vmax]. A5. For any i 2 [n]?, there exists Li > 0 such that for any s, s0 2 Rq, khi(s) hi(s0)k = k( si T(s) s) ( si T(s0) s0)k Liks s0k . A Federated EM algorithm. Algorithm 1: Fed EM with partial participation Data: kmax 2 N?; for i 2 [n]?, V0,i 2 Rq; b S0 2 Rq; a positive sequence {γk+1, k 2 [kmax 1]}; > 0; a coefficient p = EA PPP[card(A)]/n. Result: The Fed EM-PP sequence: {b Sk, k 2 [kmax]} 1 Set V0 = n 1 Pn i=1 V0,i 2 for k = 0, . . . , kmax 1 do 3 Sample Ak+1 PPP 4 for i 2 Ak+1 do 5 (worker #i) 6 Sample Sk+1,i, an approximation of si T(b Sk) 7 Set k+1,i = Sk+1,i Vk,i b Sk 8 Set Vk+1,i = Vk,i + Quant( k+1,i). 9 Send Quant( k+1,i) to the central server 10 for i /2 Ak+1 do 11 (worker #i) 12 Set Vk+1,i = Vk,i (no update) 13 (the central server) Hk+1 = Vk + (np) 1 P i2Ak+1 Quant( k+1,i) 15 Set b Sk+1 = b Sk + γk+1Hk+1 Set Vk+1 = Vk + n 1 P i2Ak+1 Quant( k+1,i) Send b Sk+1 and T(b Sk+1) to the n workers Our first contribution, the novel algorithm Fed EM is described by algorithm 1. The algorithm encompasses partial participation of the workers: at iteration #(k + 1), only a subset Ak+1 of active workers participate to the training, see line 3. The averaged fraction of participating workers is denoted p. Each of the active workers #i computes an unbiased approximation Sk+1,i (line 6) of si T(b Sk); conditionally to the past (see Appendix D.2 for a rigorous definition), these approximations are independent. The workers then transmit to the central server a compressed information about the new sufficient statistics. A naive solution would be to compress and transmit Sk+1,i b Sk, but data heterogeneity between servers often prevents these local differences from vanishing at the optimum, leading to large compression errors and impairing convergence of the algorithm. Following [28], a memory Vk,i (initialized to hi(b S0) at k = 0) is introduced; and the differences k+1,i := Sk+1,i ˆSk Vk,i are compressed for i 2 Ak+1 (line 7 and line 9). These memories are updated locally: Vk+1,i = Vk,i + Quant( k+1,i), at line 8, with > 0 (typically set to 1/(1+!) where ! is defined in A6). On its side, the central server releases an aggregated estimate b Sk+1 of the complete-data sufficient statistics by averaging the quantized difference (np) 1 P i2Ak+1 Quant( k+1,i) and by adding Vk (line 14 and line 15). Then, it updates Vk+1 = Vk + n 1 Pn i=1 Quant( k+1,i), see line 15. The final step consists in solving the M-step of the EM algorithm, i.e. in computing T(b Sk+1) (see A3). We finally state our assumption on the compression process. We consider a large class of unbiased compression operators Quant satisfying a variance bound: A6. There exists ! 0 s.t. for any s 2 Rq: E [Quant(s)] = s, and E k Quant(s)k2 Intuitively, the stronger the compression is, the larger ! will be. Remark that if no compression is used, or equivalently for all s 2 Rq, Quant(s) = s, then A6 is satisfied with ! = 0. An example of quantization operator satisfying A6 is the random dithering that can be described as the random operator Quant : Rq ! Rq, Quant(x) = (1/squant)kxkr sign(x) bsquant(|x|/kxkr) + c where r 1 is user-defined, is a uniform random variable on [0, 1]q and squant 2 N? is the number of levels of roundings; see [17, 2]. This operator satisfies A6 with ! = s 1 quant O(q1/r + q1/2); see [17, Example 1]. Another example, namely the block-p-quantization, is provided in the supplemental (see Appendix B). More generally, this assumption is valid for many compression operators, for example resulting in sparsification [see. e.g. 28]. The convergence analysis is under the following assumptions on the oracle Sk+1,i: for any i 2 [n]?, the approximations Sk+1,i are unbiased and their conditional variances are uniformly bounded in k. For each k 2 N, denote by Fk the σ-algebra generated by {S ,i, A ; i 2 [n]?, 2 [k]} and including the randomness inherited from the quantization operator Quant up to iteration #k. A 7. For all k 2 N, conditional to Fk, {Sk+1,i}n i=1 are independent. Moreover, for any i 2 [n]?, E [Sk+1,i|Fk ] = si T(b Sk) and there exists σ2 i > 0 such that for any k 0 k Sk+1,i si T(b Sk)k2,,,Fk A7 covers both the finite-sum setting described in the introduction, and the online setting. In the finite-sum setting, si is of the form m 1 Pm j=1 sij. In that case, Sk+1,i can be the sum over a minibatch Bk+1,i of size b sampled at random in [m]?, with or without replacement and independently of the history of the algorithm: we have Sk+1,i = b 1 P j2Bk+1,i sij T(b Sk). In the online setting, the oracles Sk+1,i come from an online processing of streaming informations; in that case Sk+1,i can be computed from a minibatch of independent examples so that the conditional variance σ2 i , which will be inversely proportional to the size of the minibatch, can be made arbitrarily small. Reduction of communication complexity for FL. Reducing the communication cost between workers is a crucial aspect of the FL approach [19]. In gradient based optimization, four techniques have been used to reduce the amount of communication: (i) increasing the minibatch size and reducing the number of iterations, (ii) increasing the number of local steps between two communication rounds, (iii) using compression, (iv) sampling clients at each step. Here, we provide a tight analysis of strategies (i), (iii) and (iv) (sampling client is part of PP). Regarding the interest of performing multiple iterations (ii), as analyzed for example in [21, 27] for the classical gradient settings, we note that: first, from a theoretical standpoint, tradeoffs between larger minibatch and more local iterations are unclear [37]. Secondly, performing local iterations is not possible in the EM setting: one iteration of EM is the combination of two steps E and M and the M step, which required the use of the map T, is only performed by the central server; this remark is a fundamental specificity of the EM framework (which is not shared by the gradient framework). In applications, we usually do not want T to be available at each local node. However, our work allows to perform multiple local iterations of the E step before communicating with the central server. In algorithm 1, the local statistics Sk+1,i are general enough to cover this case; see the comment above on A7. Finally, as we do not perform local full EM iterations, we do not face the well-identified clientdrift challenge (in the presence of heterogeneity). Yet, we stress that combining compression and heterogeneity results in other challenges: it is known in the Gradient Descent setting (see e.g. [28, 31]), that heterogeneity strongly hinders convergence in the presence of compression. To alleviate the impact of heterogeneity, we introduce the Vk,i s memory-variables. Convergence analysis, full participation regime. In this paragraph, we focus on the fullparticipation regime (p = 1): for all k 2 [kmax]?, Ak = [n] . We now present in Theorem 1 our key result, from which complexity expressions are derived. The proof is postponed to Appendix C. Theorem 1. Assume A1 to A7 and set L2 := n 1 Pn i , σ2 := n 1 Pn i . Let {b Sk, k 2 [kmax]} be given by algorithm 1, with ! > 0, := (1 + !) 1 and γk = γ 2 (0, γmax] where γmax := vmin 2L(1 + !)p! . (7) Denote by K the uniform random variable on [kmax 1]. Then, taking V0,i = hi(b S0) for all i 2 [n]?: kh(b SK)k2i W(b S0) min W When there is no compression (! = 0 so that Quant(s) = s), we prove that the introduction of the random variables Vk,i s play no role whatever > 0 and the choice of the V0,i s, and we have for any γ 2 (0, 2vmin/L W) (see (29) in the supplemental) kh(b SK)k2i W(b S0) min W Optimizing the learning rate γ, we derive the following corollary (see the proof in Appendix C). Corollary 2 (of Theorem 1). Choose γ := 1 (W(b S0) min W)n kmax L W(1+5!)σ2 21/2 γmax. We get kh(b SK)k2i W(b S0) min W L W(1 + 5!)σ2 W(b S0) min W Theorem 1 and Corollary 2 do not require any assumption regarding the distributional heterogeneity of workers. These results remain thus valid when workers have access to data resulting from different distributions a widespread situation in FL frameworks. Crucially, without assumptions on the heterogeneity of workers, the convergence of a naive implementation of compressed distributed EM (i.e. an implementation without the variables Vk,i s) would not converge. Let us comment the complexity to reach an -stationary point, and more precisely how the complexity evaluated in terms of the number of optimization steps depend on !, n, σ2 and . Since KOpt( ) = kmax, from Corollary 2 we have that: Kopt( ) = O Maximal learning rate and compression. The comparison of Theorem 1 with the no compression case (see (9)) shows that compression impacts γmax by a factor proportional to pn/!3/2 as ! increases (similar constraints were observed in the risk optimization literature, e.g. in [17, 32]). This highlights two different regimes depending on the ratio pn/!3/2: if the number of workers n scales at least as !3, the maximal learning rate is not impacted by compression; on the other hand, for smaller numbers of workers n !3, compression can degrade the maximal learning rate. We highlight this conclusion with a small example in the case of scalar quantization for which ! pq/squant: for q = 102 and squant = 4 (obtaining a compression rate of a factor 16), the maximal learning rate is almost unchanged if n 16. Dependency on . The complexity Kopt( ) is decomposed into two terms scaling respectively as σ2 2 and γ 1 max 1, the first term being dominant when ! 0. This observation highlights two different regimes: a high noise regime corresponding to γmax(1 + !)σ2/(n 1) 1 where the complexity is of order σ2 2, and a low noise regime where γmax(1 + !)σ2/(n 1) 1 and the complexity is of order γ 1 max 1. An extreme example of the low noise case is σ2 = 0, occurring for example in the finite-sum case (i.e., when si = m 1 Pm j=1 sij) with the oracle Sk+1,i = si T(b Sk). Impact of compression for -stationarity. As mentioned above, the compression simultaneously impacts the maximal learning rate (as in (7)) and the complexity Kopt( ). Consequently, the impact of the compression depends on the balance between !, n, σ2 and , and we can distinguish four different main regimes. In the following tabular, for each of the four situations, we summarize the increase in complexity Kopt( ) resulting from compression. Complexity regime: (Dominating term in Kopt( )) γmax regime: (Dominating term in (7)) Example situation High noise σ2, Low σ2 (e.g., large minibatch) larger vmin 2L W large ratio n/!3 ! 1 pn 2 2L(1+!)p! low ratio n/!3 ! !3/2/pn Depending on the situation, the complexity can be multiplied by a factor ranging from 1 to ! _ (!3/2/pn) . Remark that the communication cost of each iteration is typically reduced by compression of a factor at least !. Moreover, the benefit of compression is most significant in the low noise regime and when the maximal learning rate is vmin/(2L W) (e.g., when n large enough). We then improve the communication cost of each iteration without increasing the optimization complexity, effectively reducing the communication budget for free . Because of space constraints, the results in the PP regime are postponed to Appendix A. 3 VR-Fed EM: Federated EM algorithm with variance reduction A novel algorithm, called VR-Fed EM and described by algorithm 2, is derived to additionally incorporate a variance reduction scheme in Fed EM. It is described in the finite-sum setting when for all i 2 [n]?, si := m 1 Pm j=1 sij: at each iteration #(t, k + 1), the oracle on si T(b St,k) will use a minibatch Bt,k+1,i of examples sampled at random (with or without replacement) in [m]?. Algorithm 2: VR-Fed EM Data: kout, kin, b 2 N?; for i 2 [n]?, V1,0,i 2 Rq; b Sinit 2 Rq; a positive sequence {γt,k+1, t 2 [kout]?, k 2 [kin 1]}; > 0 Result: sequence: {b St,k, t 2 [kout]?, k 2 [kin]} 1 b S1,0 = b S1, 1 = b Sinit, V1,0 = n 1 Pn i=1 V1,0,i 2 for i = 1, . . . , n do 3 S1,0,i = 1 j=1 sij T(b Sinit) 4 for t = 1, . . . , kout do 5 for k = 0, . . . , kin 1 do 6 for i = 1, . . . , n (worker #i, locally) do 7 Sample at random a batch Bt,k+1,i of size b in 8 Set St,k+1,i = St,k,i + sij T(b St,k) sij T(b St,k 1) 9 Set t,k+1,i = St,k+1,i b St,k Vt,k,i 10 Set Vt,k+1,i = Vt,k,i + Quant( t,k+1,i). 11 Send Quant( t,k+1,i) to the central server 12 (the central server) 13 Set Ht,k+1 = Vt,k + n 1 Pn i=1 Quant( t,k+1,i) 14 Set b St,k+1 = b St,k + γt,k+1Ht,k+1 15 Set Vt,k+1 = Vt,k + n 1 Pn i=1 Quant( t,k+1,i) 16 Send b St,k+1 and T(b St,k+1) to the n workers 17 b St+1,0 = b St+1, 1 = b St,kin 18 Vt+1,0 = Vt,kin 19 for i = 1, . . . , n do 20 St+1,0,i = 1 j=1 sij T(b St+1,0) 21 Vt+1,0,i = Vt,kin,i The algorithm is decomposed into kout outer loops (indexed by t), each of them having kin inner loops (indexed by k). At iteration #(k + 1) of the inner loops, each worker #i updates a local statistic St,k+1,i based on a minibatch Bt,k+1,i of its own examples { sij, j 2 Bt,k+1,i} (see Line 8): starting from b St,0,i := m 1 Pm j=1 sij T(b St, 1), b St,k+1,i is defined in such a way that it approximates m 1 Pm j=1 sij T(b St,k) (see Corollary 18). Then, the worker #i sends to the central server a quantization of t,k+1,i (see Line 12) which can be seen as an approximation of 1{hi(b St,k) hi(b St,k 1)} upon noting that the variable Vt,k+1,i defined by Line 10 approximates hi(b St,k) (see Proposition 26). The central server learns the mean value Vt,k+1 = n 1 Pn i=1 Vt,k+1,i (see Line 15 and Lemma 21) and, by adding the quantized quantities, defines a field Ht,k+1 which approximates n 1 Pn i=1 hi(b St,k) (see Proposition 24). Line 14 can be seen as a Stochastic Approximation update, with learning rate γt,k+1 and mean field s 7! n 1 Pn i=1 hi(s) (see (6) for the definition of hi). The variance reduction is encoded in the definition of St,k+1,i, Line 8. We have St,k+1,i = b 1 P j2Bt,k+1,i sij T(b St,k)+ t,k+1,i. The first term is the natural approximation of si T(b St,k) based on a minibatch Bt,k+1,i. Conditionally to the past, t,k+1,i is correlated to the first term and biased, but its bias is canceled at the beginning of each outer loop (see Line 20 and Appendix E.3.2): t,k+1,i defines a control variate. Such a variance reduction technique was first proposed in the stochastic gradient setting [30, 9, 36] and then extended to the EM setting [10, 12]. At the end of each outer loop, the local approximations St+1,0,i are initialized to the full sum m 1 Pm j=1 sij T(b St,kin) (see Line 20) thus canceling the bias of S ,i (see Proposition 17). When there is a single worker and no compression is used (n = 1, ! = 0), VR-Fed EM reduces to SPIDER-EM, which has been shown to be rate optimal for smooth, non-convex finite-sum optimization [10]. Theorem 3 studies the FL setting (n 1 and ! 0): it establishes a finite time control of convergence in expectation for VR-Fed EM . Assumptions A5 and A7 are replaced with A8. A8. For any i 2 [n]? and j 2 [m]?, the conditional expectations sij( ) are well defined for any 2 , and there exists Lij such that for any s, s0 2 Rq, k( sij T(s) s) ( sij T(s0) s0)k Lijks s0k . Theorem 3. Assume A1 to 3, A4, A6 and A8. Set L2 := n 1m 1 Pn ij. Let {b St,k, t 2 [kout]?, k 2 [kin 1]} be given by algorithm 2 run with := 1/(1 + !), V1,0,i := hi(b S1,0) for any i 2 [n]?, b := d kin (1+!)2 e and γt,k = γ := vmin L pn(1 + !) ! + 1 + 10! Let ( , K) be the uniform random variable on [kout]? [kin 1], independent of {b St,k, t 2 [kout]?, k 2 [kin]}. Then, it holds vminγkinkout kh(b S ,K)k2i 1 + γ2 L2(1 + !)2 The proof is postponed to Appendix E. This result is a consequence of the more general Proposition 25. We make the following comments: 1. Eq. (11) provides the convergence of E , and Eq. (12) ensures that the quantity of interest E[kh(b S ,K)k2] is controlled by E[k H ,K+1k2]. We observe that 2(1 + γ2 L2(1+!)2 n ) is uniformly bounded w.r.t. ! as, by (10), γ2 = O!!1(! 3). 2. Up to our knowledge, this is the first result on Federated EM, that leverages advanced variance reduction techniques, while being robust to distribution heterogeneity (the theorem is valid without any assumption on heterogeneity) and while reducing the communication cost. 3. Without compression (! = 0) and in the single-worker case (n = 1), Fort et al. [10] use kin = b: we recover this result as a particular case. When n > 1 and ! > 0, the recommended batch size b decreases as 1/(1 + !)2. Convergence rate and optimization complexity. Our step-size γ is chosen constant and independent of kin, kout. Indeed, contrary to Theorem 1, there is no Bias-Variance trade-off (as typically observed with variance reduced methods), and the optimal choice of γ is the largest one to ensure convergence. Consequently, since the number of optimization steps is koutkin, we have Kopt( ) = O( 1 Impact of compression on the learning rate and -stationarity. The compression constant ! does not directly appear in (11), but impacts the value of γ. Two different regimes appear: L pn(1 + !) 21/2 1 (i.e. we focus on the large !, n asymptotics when !3 n), then γ ' vmin L W has nearly the same value as without compression [10]. The complexity is then similar to the one of SPIDER-EM [10], with a smaller communication cost. The gain from compression is maximal in this regime. 2. if 4 L pn(1 + !) 21/2 1 (i.e. we focus on the large !, n asymptotics when !3 n), then γ = O pn vmax L!3/2 is strictly smaller than without compression. The optimization complexity is then higher to the one of SPIDER-EM1 (by a factor proportional to !3/2/pn) with a smaller communication cost (typically at least ! times less bits exchanged per iteration). The overall trade-off thus depends on the comparison between ! and n. Complexity : 1/(γ ) γ regime: (Dominating term in (10)) Example situation vmin/L W large ratio n/!3 1 vmin pn/(vmax L!3/2) low ratio n/!3 !3/2/pn We summarize these two regimes in this tabular, focusing on the large n, large ! asymptotic regimes. For the two regimes, we indicate the increase in complexity Kopt( ) resulting from compression. We provide a discussion on computed conditional expectations complexity KCE in Appendix E.2. 1As a corollary of [10, Theorem 2], the optimization complexity of SPIDER-EM is kout + kinkout that is 1 in order to reach -stationarity. 4 Numerical illustrations In this section, we illustrate the performance of Fed EM and VR-Fed EM applied to inference in Gaussian Mixture Models (GMM), on a synthetic data set and on the MNIST data set. We also present an application to Federated missing data imputation with the analysis of the e Bird data set [34, 1]. Synthetic data. The synthetic data are from the following GMM model: for all 2 [N]? and g 2 {0, 1}, P(Z = g) = g; and conditionally to Z = g, Y N2(µg, ). The 2 2 covariance matrix is known, and the parameters to be fitted are the weights ( 0, 1) and the expectations (µ0, µ1). The total number of examples is N = 104, the number of agents is n = 102, and the probability of participation of servers is p = 0.75. Fed EM and VR-Fed EM are run with γ = 10 2, ! = 1 and = 10 2. For Fed EM, we consider the finite-sum setting when si = m 1 Pm j=1 sij with m = 102; the oracle Sk+1,i is obtained by a sum over a minibatch of b = 20 examples. For VR-Fed EM, we set b = 5 and kin = 20. We run the two algorithms for 500 epochs (one epoch corresponds to N conditional expectation evaluations sij). Figure 1 shows a trajectory of k Hkk2 given by Fed EM (and k Ht,kk2 given by VR-Fed EM), along with the theoretical value of the mean field kh(b Sk)k2 for Fed EM (and kh(b St,k)k2 for VR-Fed EM). The results illustrate the variance reduction, and gives insight on the variability of the trajectories resulting from the two algorithms. MNIST Data set. We perform a similar experiment on the MNIST dataset to illustrate the behaviour of Fed EM and VR-Fed EM on a GMM inference problem with real data. The dataset consists of N = 7 104 images of handwritten digits, each with 784 pixels. We pre-process the dataset by removing 67 uninformative pixels (which are always zero across all images) to obtain d = 717 pixels per image. Second, we apply principal component analysis to reduce the data dimension. We keep the d PC = 20 principal components of each observation. These N preprocessed observations are distributed at random across n = 102 servers, each containing m = 700 observations. We estimate a Rd PC-multivariate GMM model with G = 10 components. Details on the multivariate Gaussian mixture model are given in the supplementary material (see Appendix F). Here again, si is a sum over the m examples available at server #i; the minibatches are independent and sampled at random in [m]? with replacement; we choose b = 20 and the step size is constant and set to γ = 10 3. The same initial value b Sinit is used for all experiments: we set b Sinit := s( 0, µ0, b 0), where 0 g = 1/G for all g 2 [G]?, the expectations µ0 g are sampled uniformly at random among the available examples, and b 0 is the empirical covariance matrix of the N examples. Figure 3 shows the sequence of parameter estimates for the weights and the squared norm of the mean field k Hkk2 for Fed EM (resp. k Ht,kk2 for VR-Fed EM ) vs the number of epochs. Federated missing values imputation for citizen science. We develop Fed Miss EM, a special instance of Fed EM designed to missing values imputation in the federated setting; we apply it to the analysis of part of the e Bird data base [34, 1], a citizen science smartphone application for biodiversity monitoring. In e Bird, citizens record wildlife observations, specifying the ecological site they visited, the date, the species and the number of observed specimens. Two major challenges occur: (i) ecological sites are visited irregularly, which leads to missing values and (ii) non-professional observers have heterogeneous wildlife counting schemes. Model and the Fed Miss EM algorithm. I observers participate in the programme, there are J ecological sites and L time stamps. Each observer #i provides a J L matrix Xi and a subset of indices i [J]? [L]?. For j 2 [J]? and 2 [L]?, the variable Xi j encodes the observation that would be collected by observer #i if the site #j were visited at time stamp # ; since there are unvisited sites, we denote by Y i := {Xi j , (j, ) 2 i} the set of observed values and Zi := {Xi j , (j, ) /2 i} the set of unobserved values. The statistical model is parameterized by a matrix 2 RJ L, where j is a scalar parameter characterizing the distribution of species individuals at site j and time stamp . For instance, j is the log-intensity of a Poisson distribution when the observations are count data or the log-odd of a binomial model when the observations are presenceabsence data. This model could be extended to the case observers #i and #i0 count different number of specimens on average at the same location and time stamp, because they do not have access to the same material or do not have the same level of expertise: heterogeneity between observers could be modeled by using different parameters for each individual #i say i 2 RJ L. Here, we consider the case when i j = j for all (j, ) 2 [J]? [L]? and i 2 [I]?. We further assume that the entries {Xi j , i 2 [I]?, j 2 [J]?, 2 [L]?} are independent with a distribution from an exponential 0 100 200 300 400 500 1e 06 1e 04 1e 02 1e+00 0 200 400 600 800 1000 1e 20 1e 12 1e 04 Figure 1: Trajectory of Fed EM vs the number of epochs (left; blue line: kh(b Sk)k2; red line: k Hkk2) and of VR-Fed EM (right; dashed blue line: kh(b Sk)k2; solid red line: k Ht,kk2). + ++++++++++ + +++++++++++++ 0 50 100 150 200 observation date Estimated bird count 2000 02 2003 07 2006 05 2009 03 2011 10 2014 05 2016 12 2019 07 520 560 600 Estimated count Number of NA 0 2000 6000 observation date Estimated bird count 2001-05 2010-03 2011-12 2013-06 2014-12 2016-06 2017-12 2019-06 800 850 900 950 Estimated count Number of NA Figure 2: Estimated temporal trends for Common Buzzard (Left) and Mallard (right). Blue crosses: estimated monthly counts; Red triangles: number of missing values. Dotted lines: LOESS regressions for the estimated counts (blue) and the number of missing values (red). 0 5 10 15 20 25 30 0.00 0.10 0.20 0.30 Mixture weight 0 20 40 60 80 100 0.02 0.05 0.10 0.20 0 20 40 60 80 100 0.0 0.1 0.2 0.3 0.4 0.5 Mixture weight 0 20 40 60 80 100 5e 04 5e 03 5e 02 Figure 3: [Left to right] For Fed EM : Evolution of the estimates of the weights for 2 [G]? vs the number of epochs (first plot) and Evolution of the squared norm of the mean field k Hkk2 vs the number of epochs (second plot). Then, the same things for VR-Fed EM (third and fourth plots). family with respect to some reference measure on R of the form: x 7! (x) exp{x j ( j )}. Algorithm 7 in Appendix F.2 provides details on the model, and the pseudo-code for Fed Miss EM. Application to e Bird data analysis. We apply Fed Miss EM to the analysis of part of the e Bird data base [34, 1] of field observations reported in France by I = 2, 465 observers, across J = 9, 721 sites and at L = 525 monthly time points. We analyze successively two data sets corresponding to observations of two relatively common species: the Common Buzzard and the Mallard. These subsamples correspond respectively to N = 5, 980 and N = 12, 185 field observations. The I field observers are randomly assigned into n = 10 groups (the observations of the field observers from the group c 2 [n]? are allocated to the server #c). For c 2 [n]?, server c contains Nc observations; in our two examples, Nc ranges between 400 and 1, 500. We run Fed Miss EM for 150 epochs; with γ = 10 4, = 10 3, b = 102, a rank r = 2 and λ = 0; for the distribution of the variables Xi j , we use a Gaussian distribution with unknown expectation j and variance 1. We recover aggregated temporal trends at the national French level for these two bird species by summing the estimated counts across ecological sites, for each time stamp; the trends are displayed in Figure 2, along with a locally estimated scatterplot smoothing (LOESS). 5 Conclusions We introduced Fed EM which is, to the best of our knowledge, the first algorithm implementing EM in a FL setting, and handles compression of exchanged information, data heterogeneity and partial participation. We further extended it to incorporate a variance reduction scheme, yielding VR-Fed EM. We derived complexity bounds which highlight the efficiency of the two algorithms, and illustrated our claims with numerical simulations, as well as an application to biodiversity monitoring data. In a simultaneously published work, Marfoq et al. [25] consider a different Federated EM algorithm, in order to address the personalization challenge by considering a mixture model. Under the assumption that each local data distribution is a mixture of unknown underlying distributions, their algorithm computes a model corresponding to each distribution. On the other hand, we focus on the curved exponential family, with variance reduction, partial participation and compression and on limiting the impact of heterogeneity, but do not address personalization. Acknowledgments The work of A. Dieuleveut and E. Moulines is partially supported by ANR19-CHIA-0002-01 /chaire SCAI, and Hi!Paris. The work of G. Fort is partially supported by the Fondation Simone et Cino del Duca under the project Op Si Mor E. Broader Impact of this work This work is mostly theoretical, and we believe it does not currently present any direct societal consequence. However, the methods described in this paper can be used to train machine learning models which could themselves have societal consequences. For instance, the deployment of machine learning models can suffer from gender and racial bias, or amplify existing inequalities. [1] ebird. 2017. ebird: An online database of bird distribution and abundance [web application]. ebird, cornell lab of ornithology, ithaca, new york. available: http://www.ebird.org. (accessed: 21 march 2020). [2] D. Alistarh, T. Hoefler, M. Johansson, N. Konstantinov, S. Khirirat, and C. Renggli. The con- vergence of sparsified gradient methods. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31, pages 5973 5983. Curran Associates, Inc., 2018. [3] A. Benveniste, M. M etivier, and P. Priouret. Adaptive Algorithms and Stochastic Approxima- tions. Springer Verlag, 1990. [4] V. S. Borkar. Stochastic approximation. Cambridge University Press, Cambridge; Hindustan Book Agency, New Delhi, 2008. A dynamical systems viewpoint. [5] J. Chen, J. Zhu, Y. Teh, and T. Zhang. Stochastic expectation maximization with variance re- duction. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7967 7977. 2018. [6] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1646 1654. Curran Associates, Inc., 2014. [7] B. Delyon, M. Lavielle, and E. Moulines. Convergence of a stochastic approximation version of the EM algorithm. Ann. Statist., 27(1):94 128, 1999. [8] A. Dempster, N. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Roy. Stat. Soc. B Met., 39(1):1 38, 1977. [9] C. Fang, C. Li, Z. Lin, and T. Zhang. SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 689 699. Curran Associates, Inc., 2018. [10] G. Fort, E. Moulines, and H.-T. Wai. A Stochastic Path Integral Differential Estimato R Ex- pectation Maximization Algorithm. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 16972 16982. Curran Associates, Inc., 2020. [11] G. Fort, P. Gach, and E. Moulines. Fast Incremental Expectation Maximization for finite-sum optimization: non asymptotic convergence. Statistics and Computing, 2021. Accepted for publication. [12] G. Fort, E. Moulines, and H.-T. Wai. Geom-SPIDER-EM: Faster Variance Reduced Stochastic Expectation Maximization for Nonconvex Finite-Sum Optimization. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2021. [13] S. Fr uhwirth-Schnatter, G. Celeux, and C. P. Robert, editors. Handbook of mixture analysis. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. CRC Press, Boca Raton, FL, 2019. [14] S. Ghadimi and G. Lan. Stochastic Firstand Zeroth-Order Methods for Nonconvex Stochastic Programming. SIAM J. Optim., 23(4):2341 2368, 2013. [15] E. Gorbunov, F. Hanzely, and P. Richt arik. A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent. In International Conference on Artificial Intelligence and Statistics, pages 680 690. PMLR, 2020. [16] S. Horv ath and P. Richtarik. A better alternative to error feedback for communication-efficient distributed learning. In International Conference on Learning Representations, 2021. [17] S. Horv ath, D. Kovalev, K. Mishchenko, S. Stich, and P. Richt arik. Stochastic distributed learning with gradient quantization and variance reduction. ar Xiv preprint ar Xiv:1904.05115, 2019. [18] R. Johnson and T. Zhang. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 315 323. Curran Associates, Inc., 2013. [19] P. Kairouz, H. B. Mc Mahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawit, Z. Charles, G. Cormode, R. Cummings, R. G. L. D Oliveira, H. Eichner, S. El Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gasc on, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konecn y, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Ozg ur, R. Pagh, H. Qi, D. Ramage, R. Raskar, M. Raykova, D. Song, W. Song, S. U. Stich, Z. Sun, A. Theertha Suresh, F. Tram er, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao. Advances and Open Problems in Federated Learning. Now Foundations and Trends. [20] B. Karimi, H.-T. Wai, E. Moulines, and M. Lavielle. On the Global Convergence of (Fast) In- cremental Expectation Maximization Methods. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch e Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 2837 2847. Curran Associates, Inc., 2019. [21] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh. SCAFFOLD: Stochastic controlled averaging for federated learning. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5132 5143. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/v119/karimireddy20a.html. [22] J. Koneˇcn y, H. B. Mc Mahan, F. X. Yu, P. Richt arik, A. T. Suresh, and D. Bacon. Federated learning: Strategies for improving communication efficiency. ar Xiv preprint ar Xiv:1610.05492, 2016. [23] F. Kunstner, R. Kumar, and M. Schmidt. Homeomorphic-invariance of em: Non-asymptotic convergence in kl divergence for exponential families via mirror descent. In International Conference on Artificial Intelligence and Statistics, pages 3295 3303. PMLR, 2021. [24] K. Lange. MM Optimization Algorithms. SIAM-Society for Industrial and Applied Mathemat- [25] O. Marfoq, G. Neglia, A. Bellet, L. Kameni, and R. Vidal. Federated multi-task learning under a mixture of distributions. 35th Conference on Neural Information Processing Systems (Neur IPS 2021), 2021. [26] G. Mc Lachlan and T. Krishnan. The EM algorithm and extensions. Wiley series in probability and statistics. Wiley, 2008. [27] B. Mc Mahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas. Communication Efficient Learning of Deep Networks from Decentralized Data. In Artificial Intelligence and Statistics, pages 1273 1282. PMLR, 2017. [28] K. Mishchenko, E. Gorbunov, M. Tak aˇc, and P. Richt arik. Distributed learning with com- pressed gradient differences. ar Xiv preprint ar Xiv:1901.09269, 2019. [29] K. Murphy and S. J. Russell. Dynamic bayesian networks: representation, inference and learn- [30] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Tak aˇc. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML 17, page 2613 2621. JMLR.org, 2017. [31] C. Philippenko and A. Dieuleveut. Bidirectional compression in heteregenous settings for dis- tributed or federated learning with partial participation: tight convergence guarantees. Technical report, ar Xiv 2006.14591v3, 2020. [32] C. Philippenko and A. Dieuleveut. Preserved central model for faster bidirectional compression in distributed settings. Advances in Neural Information Processing Systems (Neur IPS), 2021. [33] F. Sattler, S. Wiedemann, K.-R. M uller, and W. Samek. Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data. IEEE Transactions on Neural Networks and Learning Systems, pages 1 14, 2019. [34] B. L. Sullivan, C. L. Wood, M. J. Iliff, R. E. Bonney, D. Fink, and S. Kelling. e Bird: A citizen- based bird observation network in the biological sciences. Biological Conservation, 142(10): 2282 2292, 2009. [35] H. Tang, C. Yu, X. Lian, T. Zhang, and J. Liu. Double Squeeze: Parallel Stochastic Gradient Descent with Double-pass Error-Compensated Compression. In International Conference on Machine Learning, pages 6155 6165. PMLR, May 2019. ISSN: 2640-3498. [36] Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. Spider Boost and Momentum: Faster Stochas- tic Variance Reduction Algorithms. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch e Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 2406 2416. 2019. [37] B. Woodworth, K. K. Patel, S. Stich, Z. Dai, B. Bullins, B. Mcmahan, O. Shamir, and N. Sre- bro. Is local SGD better than minibatch SGD? In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 10334 10343. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/v119/woodworth20a.html. [38] L. Xu and M. I. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures. Neural computation, 8(1):129 151, 1996. [39] Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1 19, 2019.