# stragglerresilient_personalized_federated_learning__785c3aac.pdf

Published in Transactions on Machine Learning Research (10/2023)

Straggler-Resilient Personalized Federated Learning

Isidoros Tziotis isidoros_13@utexas.edu The University of Texas at Austin

Zebang Shen zebang.shen@inf.ethz.ch ETH Zurich

Ramtin Pedarsani ramtin@ece.ucsb.edu The University of California, Santa Barbara

Hamed Hassani hassani@seas.upenn.edu The University of Pennsylvania

Aryan Mokhtari mokhtari@austin.utexas.edu The University of Texas at Austin

Reviewed on Open Review: https: // openreview. net/ forum? id= gx Ep UFx Igz& referrer= %5BAuthor% 20

Federated Learning is an emerging learning paradigm that allows training models from samples distributed across a large network of clients while respecting privacy and communication restrictions. Despite its success, federated learning faces several challenges related to its decentralized nature. In this work, we develop a novel algorithmic procedure with theoretical speedup guarantees that simultaneously handles two of these hurdles, namely (i) data heterogeneity, i.e., data distributions can vary substantially across clients, and (ii) system heterogeneity, i.e., the computational power of the clients could diﬀer signiﬁcantly. By leveraging previous works in the realm of representation learning (Collins et al., 2021; Liang et al., 2020), our method constructs a global common representation utilizing the data from all clients. Additionally, it learns a user-speciﬁc set of parameters, resulting in a personalized solution for each individual client. Furthermore, it mitigates the eﬀects of stragglers by adaptively selecting clients based on their computational characteristics, thus achieving for the ﬁrst time near-optimal sample complexity and provable logarithmic speedup. Experimental results support our theoretical ﬁndings showing the superiority of our method over alternative personalized federated schemes in system and data heterogeneous environments.

1 Introduction

Due to growing concerns on data privacy and communication cost, Federated Learning (FL) has become an emerging learning paradigm as it allows for training machine learning models without collecting local data from the clients. Due to its decentralized nature, a major challenge in designing eﬃcient solvers for FL is heterogeneity of local devices which can be categorized into two diﬀerent types: (i) data heterogeneity where the underlying data distributions of clients vary substantially, and (ii) system heterogeneity where the computational and storage capabilities of devices diﬀer signiﬁcantly. In fact, it has been observed that the seminal Federated Averaging (Fed Avg) method suﬀers from slow convergence to a high quality solution when facing highly heterogeneous datasets (Mc Mahan et al., 2017) as well as heterogeneous systems (Li et al., 2020; Kairouz et al., 2021).

In this paper, we aim to address these two challenges simultaneously by introducing a generic framework that includes algorithms with robust performance in the presence of those forms of clients heterogeneity. Inspired by prior works in the literature of FL (Collins et al., 2021; Liang et al., 2020) that utilized representation

Published in Transactions on Machine Learning Research (10/2023)

learning theory to tackle data heterogeneity, we propose a meta-algorithm that produces personalized solutions and handles data heterogeneity by leveraging a global representation shared among all clients. Further, our method circumvents the delays introduced due to the presence of stragglers by carefully selecting participating nodes based on their computational speeds. In early stages, only a few of the fastest nodes are chosen to participate and sequentially slower devices are included in the training process until the target accuracy is achieved. Although the disproportional selection of nodes raises fairness and accuracy concerns, we highlight that our method achieves speedup without compromising the resulting solution. The most signiﬁcant contribution of our work is achieving near-optimal sample complexity in regimes with data and system heterogeneity, alongside a provable logarithmic speedup guarantee in terms of running time. Next we summarize our contributions:

1. SRPFL Algorithm. We propose Straggler-Resilient Personalized Federated Learning (SRPFL), an adaptive node participation meta-algorithm that builds upon subroutines that fall into the representation learning framework (Collins et al., 2021; Liang et al., 2020) enhancing their resilience to stragglers and performance.

2. Logarithmic Speedup. Assuming that clients speeds are drawn from the exponential distribution, we prove that SRPFL guarantees logarithmic speedup in the linear representation setting, outperforming established, straggler-prone benchmarks while maintaining the state of the art sample complexity per client m = O((d/N + log(N))), where d and N denote the feature vector size and number of active nodes. Our results hold for non-convex loss functions, heterogeneous data and dynamically changing client s speeds.

3. Numerical Results. Experiments on various datasets (CIFAR10, CIFAR100, EMNIST, FEMNIST, Sent140) support our theoretical results showing that: (i) SRPFL signiﬁcantly boosts the performance of diﬀerent subroutines designed for personalized FL both in full and partial participation settings and (ii) SRPFL exhibits superior performance in system and data heterogeneous settings compared to state-of-the-art baselines.

1.1 Related Work

Data heterogeneity. In data heterogeneous settings, if one aims at minimizing the aggregate loss in the network using the classic Fed Avg method or more advanced algorithms, which utilize control-variate techniques, such as SCAFFOLD (Karimireddy et al., 2019), FEDGATE (Haddadpour et al., 2021), Fed Dyn (Acar et al., 2021) or FEDSHUFFLE (Horváth et al., 2022) the resulting solution could perform poorly for some of the clients. This is an unavoidable hurdle due to the fact that no single model works well for all clients when their underlying data distributions are diverse. A common technique that addresses this issue is ﬁne-tuning the derived global model to each local task by following a few steps of SGD updates (Wang et al., 2019; Yu et al., 2020). Based on this observation, Fallah et al. (2020b) showed that one might need to train models that work well after ﬁne-tuning and showed its connections to Model-Agnostic Meta-Learning (MAML). In (Cho et al., 2022; Balakrishnan et al., 2021) novel client-sampling schemes were explored achieving increased eﬃciency in regimes with data heterogeneity. Another line of work for personalized FL is learning additive mixtures of local and global models (Deng et al., 2020; Mansour et al., 2020; Hanzely and Richtárik, 2020). These methods learn local models for clients that are close to each other in some norm, an idea closely related to multi-task FL (Smith et al., 2017; Hu et al., 2021). The works in (Chen et al., 2022; Lee et al., 2022) studied the interplay of local and global models utilizing Bayesian hierarchical models and partial participation, respectively. An alternative approach was presented by Collins et al. (2021), where instead of enforcing local models to be close, the authors assumed that models across clients share a common representation. Using this perspective, they presented Fed Rep a method that provably learns this underlying structure in the linear representation setting. Building upon the idea of a common representation Zhu et al. (2021) and Jiang and Lin (2022) proposed federated methods that can handle data heterogeneity while exhibiting robustness to distribution shifts. Recently, a novel framework was proposed allowing the comparison of personalized FL methods under various metrics (Wu et al., 2022). In all of the aforementioned methods however, a subset of clients participate regardless of their computational capabilities. Thus, in the presence of stragglers, the

Published in Transactions on Machine Learning Research (10/2023)

speed of the training process signiﬁcantly goes down as the server waits, at every communication round, for the slowest participating node to complete its local updates.

System heterogeneity. Several works have attempted to address the issue of system heterogeneity. Speciﬁcally, asynchronous methods, which rely on bounded staleness of slow clients, have demonstrated signiﬁcant gains in distributed data centers (Xie et al., 2019; Stich, 2019; So et al., 2021). In FL frameworks, however, stragglers could be arbitrarily slow casting these methods ineﬃcient. In an attempt to manually control staleness, deadline-based computation has been proposed (Reisizadeh et al., 2019) as well as aggregation of a ﬁxed number of models per round (Nguyen et al., 2022). However, in the worst case scenario the performance of these methods is still determined by the slowest client in the network. Active sampling is another approach where the server aggregates as many local updates as possible within a predeﬁned time span (Nishio and Yonetani, 2019). In a diﬀerent line of work the eﬀects of stragglers are mitigated by utilizing computation/gradient coding schemes (Tandon et al., 2017; Wang et al., 2018; 2020b). In (Cho et al., 2021) clients use heterogeneous model-architectures to transfer knowledge to nodes with similar data distributions while Yang et al. (2022) proposed a new framework where clients are free to choose their participation scheme. More recently, normalized averaging methods were proposed in (Wang et al., 2020a; Horváth et al., 2022) that rectify the objective inconsistency created by the mismatch in clients updates. Fed Lin (Mitra et al., 2021) instead utilizes gradient correction and error-feedback mechanisms to circumvent the speed-accuracy conﬂict. A novel approach to mitigate the eﬀects of stragglers was proposed by Reisizadeh et al. (2022), where clients are selected to take part in diﬀerent stages of the training according to their computational characteristics. Alas, all of the above methods yield improvement only in data-homogeneous settings and they are not applicable in regimes with data heterogeneity.

2 Problem Formulation

In this section, we introduce our setting and deﬁne the data and system heterogeneity model that we study. Consider the FL framework where M clients interact with a central server. We focus on a supervised, data-heterogeneous setting where client i draws samples from distribution Di, potentially with Di =Dj. Further, consider the learning model of the i-th client as qi : Rd Y which maps inputs xi Rd to predicted labels qi(xi) Y. The objective function of client i is deﬁned as fi(qi) := E(xi,yi) Di [ℓ(qi(xi), yi))], where the loss ℓ: Y Y R penalizes the gap between the predicted label qi(xi) and true label yi. In the most general setting clients aim to solve

min (q1,...,q M) Q 1 M

i=1 fi(qi), (1)

with Q the space of feasible tuples of mappings (q1, ..., q M). Traditionally in FL, methods focus on learning a single model q=q1=...=q M that performs well on average across clients (Li et al., 2020; Mc Mahan et al., 2017). Although such a solution may be satisfactory in data-homogeneous settings, it leads to undesirable local models when the data distributions are diverse. Indeed, in the presence of data heterogeneity the loss functions fi have diﬀerent forms and their minimizers could be far from each other. This justiﬁes the formulation in (1) and necessitates the search for personalized solutions that can be learned in federated fashion.

Low Dimensional Common Representation. There have been numerous examples in image classiﬁcation and word prediction where tasks with heterogeneous data share a common, low dimensional representation, despite having diﬀerent labels (Bengio et al., 2013; Le Cun et al., 2015; Pillutla et al., 2022). Based on that, a reasonable choice for Q is a set in which all qi share a common map, coupled with a personalized map that ﬁts their local data. To formalize this, suppose the ground-truth map can be written for each client i as qi=hi φ where φ:Rd Rk is a shared global representation which maps d-dimensional data points to a lower dimensional space of size k and hi : Rk Y, which maps from the lower dimensional subspace to the space of labels. Typically k d and thus given any ﬁxed representation φ, the client speciﬁc heads hi : Rk Y are easy to optimize locally. With this common structure into consideration, (1) can be reformulated as minφ Φ 1

M PM i=1 minhi H fi(hi φ), where Φ is the class of feasible representation and H is the class of feasible heads. This formulation leads to good local solutions, if the underlying data generation models for the clients share a low dimensional common representation, i.e., yi = h i φ (xi) + zi, where zi is some additive noise.

Published in Transactions on Machine Learning Research (10/2023)

Figure 1: Classic FL schemes for solving (2)

Figure 2: SRPFL for solving (2)

The server and clients collaborate to learn the common representation φ, while locally each client learns their unique head hi. Since clients often do not have access to their true data distributions, instead of minimizing their expected loss, they settle for minimizing the empirical loss associated with their local samples. Speciﬁcally, we assume client i has access to Si samples {x1 i , x2 i , ..., x Si i }, and its local empirical loss is ˆfi(hi φ) = 1 Si PSi s=1 ℓ(hi φ(xs i ), ys i ). Hence, the global problem becomes

min φ Φ 1 M

i=1 min hi H

( ˆfi(hi φ) := 1

s=1 ℓ(hi φ(xs i ), ys i )

System Heterogeneity Model. In most FL settings, thousands of clients participate in the training process, each with diﬀerent computational capabilities. Thus, ﬁxed computational tasks such as gradient computations or local model updates require diﬀerent processing time, for diﬀerent clients. Formally, for each client i [M], we denote by Ti the time required to compute a local model update. When a subset of clients participate in the learning process, the computational time of each round is determined by the slowest participating node. Naturally, as the number of nodes in the network grows, we expect the number of stragglers to increase. This phenomenon calls for the development of straggler-resilient methods in system-heterogeneous settings.

3 Straggler-Resilient Personalized FL

In the shared representation setting, we face the challenge of ﬁnding an algorithm that coordinates server and clients in order to learn a common representation and a set of personalized parameters in a federated and straggler-resilient fashion. To this end, we propose a method that tackles problem (2) with limited sample access and provably superior performance over naive, straggler-prone methods. Speciﬁcally, we propose the Straggler-Resilient Personalized Federated Learning (SRPFL) meta-algorithm, designed to mitigate the eﬀects of system heterogeneity in environments with non-convex loss functions and heterogeneous data while accommodating a variety of methods as subroutines. In a nutshell, SRPFL iteratively solves problem (2), while adaptively increasing the set of participating nodes based on their computational capabilities. As a result the process of learning the common representation across clients tasks is accelerated, without compromising the resulting accuracy.

To simplify the exposition, henceforth we denote by A some federated algorithm of choice, designed to solve (2). As depicted in Figure 1, in standard FL frameworks, out of all M clients in the network, the server often selects uniformly at random a subset of size N. Subsequently, a few iterations of algorithm A are performed to approximately solve a version of (2) which corresponds to those N selected clients i.e., minφ Φ 1

N PN i=1 minhi H ˆfi(hi φ). In every following stage, a new subset of N nodes is sampled and the same process is repeated. Although such a procedure eventually learns the global representation across all tasks, it is susceptible to delays caused by stragglers, as the server has to wait for the slowest client (among the N selected ones) to complete its updates. Hence, when N is large the training procedure could become prohibitively slow.

Published in Transactions on Machine Learning Research (10/2023)

SRPFL takes a diﬀerent approach in order to mitigate the eﬀects of straggling clients. An overview of the selecting scheme is provided in Figure 2. At each stage, N clients are randomly selected, but only a small, fast subset of them is used to solve their corresponding learning problem. More precisely, suppose that at stage r, each client i in the sampled subset of size N, is associated with a computational time T r i . For simplicity we assume that the nodes are re-indexed at every stage so that they maintain a decreasing ordering w.r.t. their times, i.e. T r 1 T r 2 ... T r N. Initially, only the n0 fastest clients, {1, 2, ..., n0}, are included in the learning procedure, with n0 much smaller than N. At every communication round, the set of n0 nodes perform the model updates indicated by algorithm A, to solve their version of (2), i.e., minφ Φ 1 n0 Pn0 i=1 minhi H ˆfi(hi φ). We note that during this stage of the algorithm, the server waits only for the slowest client among the participating ones, i.e., client n0 which is potentially much faster than node N. Remark 3.1. In practice, the knowledge of clients computational power is not required to ﬁgure out the n0 fastest nodes. Instead, the server sends the global representation model to all N sampled clients and updates the common representation once the ﬁrst n0 new models are received. Indeed, these representations belong to the fastest n0 nodes.

Once the current stage terminates, a new batch of N clients is sampled and the 2n0 fastest nodes are chosen to participate in the learning process. Since speeds vary across stages, consecutive sets of participating nodes could have small or no overlap. However, the representations learned in previous stages still operate as good starting points for subsequent stages which is possible since nodes are homogeneous w.r.t. their representations (see Section 2). Thus, utilizing the representation model learned from the previous stage, nodes {1, 2, ..., 2n0} continue the learning process deriving a model of improved accuracy. The procedure of geometrically increasing the number of participating nodes continues until the target accuracy is achieved. Hence, SRPFL uses the data of stragglers only at the latest stages of the algorithm when an accurate approximation is required. Remark 3.2. For simplicity of exposition we assume that the set of N sampled nodes maintains connectivity to the server throughout each stage. However, our analysis remains unaﬀected even if a new set of nodes is sampled at every round.

We proceed to characterize the class of federated algorithms able to employ SRPFL to enhance their performance in system heterogeneous environments. Any iterative method that solves an instance of (2) can be combined with our adaptive node participation scheme, however in this paper, we focus on a broad class of alternating gradient-based update methods, presented in (Collins et al., 2021). As the name suggests, in each round, clients update their heads and representation models in an alternative fashion. After a certain number of gradient updates is completed, all clients send their derived representations to the server where the models are averaged and broadcasted back to the clients. Next, we rigorously illustrate this procedure.

Alternating Update Scheme. At round t, the server communicates a common representation φt to the clients and a subset of them It, are selected to participate by performing the following updates.

Client Head Update. Each client i It performs τh local gradient-based updates optimizing their head parameter, given the received representation φt. Concretely, for s = 1 , ..., τh client i updates their head model as follows ht,s i = GRD fi ht,s 1 i , φt , ht,s 1 i , η . (3)

Client Representation Update. Once the updated local heads ht,τh i are obtained, each client i executes τφ local updates on their representation parameters. That is for s = 1, ..., τφ

φt,s i = GRD fi ht,τh i , φt,s 1 i , φt,s 1 i , η . (4)

In the above expressions, GRD(f, h, η) captures the generic notion of an update of variable h using the gradient of function f with respect to h and step size η. This notation allows the inclusion of a large class of algorithms such as Gradient Descent with momentum, SGD, etc.

Server Update. Each client i sends their derived representation models φt,τφ i to the server, where they are averaged producing the next representation model φt+1.

Coupling SRPFL with a generic subroutine that falls into the Alternating Update Scheme, gives rise to Algorithm 1. Every stage r is characterized by a participating set of size 2r n0, denoted by Ir. At the

Published in Transactions on Machine Learning Research (10/2023)

Algorithm 1 SRPFL

1: Input: Initial number of nodes n0; step size η; number of local updates for head τh; number of local updates for representation τφ.

2: Initialization: n n0, φ0, h0,τh 1 , ..., h0,τh N 3: for r = 0, 1, 2, . . . , log(N/n0) do 4: φ0 φr 5: for t = 1, 2, . . . , τr do 6: Server sends representation φt 1 to N clients sampled from [M]. 7: for i Ir do 8: Client i initializes ht,0 i ht 1,τh i and runs τh updates ht i = ht i η ht ifi(ht i, φt 1).

9: Client i initializes φt,0 i φt 1 and runs τφ updates φt i = φt i η φt ifi(ht i, φt i).

10: Client i sends φt,τφ i to the server. 11: end for 12: for each client i / Ir do 13: ht,τh i ht 1,τh i 14: end for 15: Server computes φt 1

n Pn i=1 φt,τφ i .

16: end for 17: Server sets n min{N, 2n} and φr+1 φτr. 18: end for

beginning of each round the server provides a representation model to the participating clients (line 6). The clients update their models (lines 8 and 9) and sent their representations back to the server where they are averaged producing a new global model. The numbers of local updates τh, τφ depend on the subroutine method of choice and the number of rounds per stage is denoted by τr. At the end of every stage the set of participating nodes is doubling in size until a set of size N is reached (line 17).

Remark 3.3 summarizes the technical novelties of our work and highlights crucial beneﬁts enjoyed by SRPFL. Remark 3.3. Reisizadeh et al. (2022) proposed a similar participation scheme, however their approach diﬀers from ours and their results apply to signiﬁcantly more restrictive regimes. Speciﬁcally, in (Reisizadeh et al., 2022) the analysis heavily relies on deriving a connection between the ERM solutions of consecutive stages. In order to control the statistical accuracy of the corresponding ERM problems (i) data homogeneity across all clients is necessary and further (ii) clients who participate in early stages are required to remain active and connected to the server in all subsequent stages, maintaining ﬁxed computational speeds throughout the whole training process. (iii) The results of their analysis hold only for strongly convex loss functions and (iv) their stage termination criterion requires the knowledge of the strong convexity parameter. The above restrictions are detrimental in the FL regime and severely undermine the applicability of the resulting algorithm. In our work we follow a diﬀerent approach controlling -in terms of principal angle distance - a quantity analogous to statistical accuracy, therefore directly connecting the common representation (and overall solution) at every stage to the ground truth representation. This novel approach allows our algorithm to accommodate (i) data heterogeneity, (ii) clients with dynamically changing speeds or equivalently clients that are replaced by new ones at every round, and (iii) non-convex loss functions. Additionally, a major part of our technical contribution focuses on (iv) analytically deriving the optimal number of rounds per stage, thus producing a simple and eﬃcient doubling scheme.

4 SRPFL in the Linear Representation Case

Our theoretical analysis focuses on a speciﬁc instance of (1), where clients strive to solve a linear representation learning problem. Concretely we assume that fi is the quadratic loss, φ is a projection onto a k-dimensional subspace of Rd, given by matrix B Rd k and the i-th client s head hi, is a vector wi Rk. We model local data of client i such that yi = w i B xi + zi, for some ground truth representation B Rd k, local heads w i Rk and zi N(0, σ2) capturing the noise in the measurements. Hence, all clients optimal solutions lie

Published in Transactions on Machine Learning Research (10/2023)

Algorithm 2 Fed Rep-SRPFL (Linear Representation)

1: Input: Step size η; Batch size m; Initial nodes n0; 2: Initialization: Client i [N] sends to server: Pi := 1

m Pm j=1(y0,j i )2x0,j i (x0,j i ) , n n0.

3: Server ﬁnds UDU rank-k SVD( 1

N PN i=1 Pi). 4: for r = 0, 1, 2, . . . , log(N/n0) do 5: Server initializes representation Br,0 U. 6: for t = 1, 2, . . . , τr do 7: Server sends Br,t to N clients sampled from [M].

8: for i {1, .., n} do 9: Client i samples a fresh batch of m samples. 10: Client i computes wr,t+1 i arg minw ˆf t i (w, Br,t).

11: Client i computes Br,t+1 i Br,t η B ˆf t i (wt+1 i , Br,t) and sends it back to the server. 12: end for

13: Server computes Br,t+1 1

n P i It Br,t+1 i .

14: Server computes Br,t+1, Rr,t+1 QR( Br,t+1). 15: end for 16: Server sets U Br,t+1 and n min{N, 2n}. 17: end for

in the same k-dimensional subspace. Under these assumptions the global expected risk is

min B,W 1 2M

i=1 E(xi,yi) Di h yi w i B xi 2i , (5)

where W = [w 1 , ..., w N] RN k is the concatenation of the client-speciﬁc heads. Since the true distributions Di s are unknown, algorithms strive to minimize the empirical risk instead. The global empirical risk over all clients is 1 M

i=1 ˆfi(wi, B)= 1 2Mm

yj i wt i B xj i 2 , (6)

where m is the number of samples at each client. The global loss in (6) is nonconvex and has many global minima, including all pairs of W Q 1, B Q , where Q Rk k is some invertible matrix. Thus, the server aims to retrieve the column space of B , instead of the ground truth factors (W , B ). To measure closeness between column spaces, we adopt the metric of principal angle distance (Jain et al., 2013).

Deﬁnition 4.1.Let matrices B1, B2 Rd k and ˆB1, , ˆB2 orthonormal matrices s.t. span( ˆB1, ) = span(B1)

and span( ˆB2) = span(B2). The principle angle distance between the column spaces of B1 and B2 is deﬁned to be dist(B1, B2) := ˆB 1, ˆB2 2. Federated Representation Learning (Fed Rep) is an alternating minimization-descent algorithm, recently proposed in (Collins et al., 2021) for the Linear Shared Representation framework. SRPFL coupled with Fed Rep gives rise to Algorithm 2. Below we highlight the main points of interest.

In the initialization phase (lines 2 and 3) a model of bounded distance from the optimal representation is obtained, via the Method of Moments (Tripuraneni et al., 2021). Subsequently, at every round t, client i samples a fresh batch of samples {xt,j i , yt,j i }m j=1 from its local distribution (line 9) and thus the corresponding loss function becomes ˆfi(wi B):= 1

2m Pm j=1(yt,j i w i B xt,j i )2. Utilizing the global representation provided by the server, client i computes the optimal head wi (line 10). Fixing the newly computed head, client i proceeds to update its global representation model with one step of gradient descent (line 11) and transmits it back to the server. As depicted in lines 13 and 14 the parameter server averages the models received and orthogonalizes the resulting matrix before broadcasting it to the clients, a component of crucial importance required in our analysis.

Mapping this method back to Algorithm 1 we note that the number of representation model updates τφ is set to 1, whereas the number of head updates τh is suﬃciently large to derive (approximately) the optimal solutions. This imbalance is designed to take advantage of the inherent structure of our problem where the

Published in Transactions on Machine Learning Research (10/2023)

size of wi s is signiﬁcantly smaller than d. Finally, we point out that the number of the communication rounds per stage τr is a small and a priori known to the algorithm constant, speciﬁed by our analysis. Remark 4.2. Although our proposed method utilizes Fed Rep as a backbone, our analysis and framework diﬀer substantially from the ones in (Collins et al., 2021). Speciﬁcally, Collins et al. (2021) assume access to (i) inﬁnite samples, (ii) without the presence of noise. Additionally, (iii) the number of participating nodes remains ﬁxed throughout the training and (iv) the focus lies solely on handling heterogeneous data with system heterogeneity being an orthogonal direction to their work. In contrast, our analysis requires only (i) ﬁnite and (ii) noisy samples, and our theoretical results (Theorems 4.7 and 4.9) revolve around (iii) participating subsets of diﬀerent sizes and (iv) regimes where both data and system heterogeneity is prevalent.

4.1 Theoretical Results

In this subsection, we provide rigorous analysis of Fed Rep-SRPFL in the linear representation setting. First, we present the notion of Wall Clock Time (WCT) which is the measure of the performance for our algorithm. Subsequently, we illustrate the contraction inequality that determines the rate at which the distance to the optimal representation diminishes. We conclude showing that Fed Rep-SRPFL outperforms its straggler-prone variant by a factor of O(log N), under the standard assumption that clients computational times follow the exponential distribution. Before we proceed, we introduce the necessary notation and the assumptions.

E0 : = 1 dist2 B0, B , (7)

σmax, := max I [N],|I|=n,n0 n Nσmax

σmin, := min I [N],|I|=n,n0 n Nσmin

Assumption 4.3. (Sub-gaussian design). The local samples xi Rd are i.i.d. with mean 0, covariance Id and are Id-sub-gaussian, i.e. E[ev xi] e v 2 2/2 for all v Rd.

Assumption 4.4. (Client diversity). Let σmin, deﬁned in (9), be the minimum singular value of any matrix that can be obtained by taking n rows of 1 n W . Then σmin, >0.

Speciﬁcally, our theoretical analysis requires Assumption 4.4 to be satisﬁed for every n n0. Assumption 4.4 implies that the optimal heads, of the participating clients, span Rk. This is true in many FL regimes as the number of clients is usually much larger than the dimension of the shared representation. Remark 4.5. In this work we consider client speeds being independent of the local data available to them. This is a natural assumption since the computational power of each client crucially depends on their device characteristics (battery, CPU, etc.), whereas any connection to their local data is unclear. However, in the presence of strong correlation between data and system heterogeneity, Assumption 4.4 may not hold, which can be seen as a potential limitation of our work and an interesting future direction to explore.

Assumption 4.6. (Client normalization). The ground-truth client speciﬁc parameters satisfy w i 2 =

k for all i [n] and B has orthonormal columns. Assumption 4.6 ensures the ground-truth matrix W B is row-wise incoherent, i.e. its row norms have similar magnitudes. This is of vital importance since our measurement matrices are row-wise sparse and incoherence is a common requirement in sensing problems with sparse measurements.

Wall Clock Time. To measure the speedup that our meta-algorithm enjoys we use the concept of real time or WCT as described below. Fed Rep-SRPFL runs in communication rounds grouped into stages. Consider such a round t at stage r, with nodes {1, 2, ..., nr} participating in the learning process. Here nr denotes the slowest participating node. The expected amount of time that the server has to wait for the updates to take place is E T r nr . Put simply, the expected computational time of the slowest node acts as the bottleneck for the round. Further, at the beginning and at the end of every round, models are exchanged between the server and the clients. This incurs an additional, ﬁxed communication cost C. If τr communication rounds take place at every stage r, then the overall expected WCT for Fed Rep-SRPFL is E [TSRPFL] = Plog(N/n0) r=0 τr E T r nr + C . Similarly, the total expected runtime for Fed Rep can be expressed in terms of the total number of rounds, TF R, as E [TFed Rep]=TF R(E [T r N]+C).Taking the ratio of these quantities derives the desired speedup guarantee.

Published in Transactions on Machine Learning Research (10/2023)

0.00 0.95 1.90 2.85 3.80 Total Time (C.T.=0)

20 30 40 50 60 70 80

Testing Accuracy

CIFAR10, M=100, Shard=5

Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML

0.00 0.95 1.90 2.85 3.80 Total Time (C.T.=0)

Testing Accuracy

CIFAR100, N=100, Shard=20

0.00 0.72 1.44 2.16 2.88 Total Time (C.T.=0)

20 30 40 50 60 70 80 90

Testing Accuracy

EMNIST, N=100, Shard=5

0.00 0.72 1.44 2.16 2.88 Total Time (C.T.=0)

10 20 30 40 50 60 70 80 90

Testing Accuracy

FEMNIST, N=100, Shard=3

0.00 1.75 3.50 5.25 7.00 Total Time (C.T.=10)

20 30 40 50 60 70 80

Testing Accuracy

CIFAR10, N=100, Shard=5

0.00 1.65 3.30 4.95 6.60 Total Time (C.T.=10)

Testing Accuracy

CIFAR100, N=100, Shard=20

0.00 1.42 2.84 4.26 5.68 Total Time (C.T.=10)

20 30 40 50 60 70 80 90

Testing Accuracy

EMNIST, N=100, Shard=5

0.00 1.42 2.84 4.26 5.68 Total Time (C.T.=10)

10 20 30 40 50 60 70 80 90

Testing Accuracy

FEMNIST, N=100, Shard=3

0.000 0.812 1.624 2.436 3.248

Total Time (C.T.=100)

20 30 40 50 60 70 80

Testing Accuracy

CIFAR10, N=100, Shard=5

0.000 0.795 1.590 2.385 3.180

Total Time (C.T.=100)

Testing Accuracy

CIFAR100, N=100, Shard=20

0.000 0.772 1.544 2.316 3.088

Total Time (C.T.=100)

20 30 40 50 60 70 80 90

Testing Accuracy

EMNIST, N=100, Shard=5

0.000 0.772 1.544 2.316 3.088

Total Time (C.T.=100)

10 20 30 40 50 60 70 80 90

Testing Accuracy

FEMNIST, N=100, Shard=3

Figure 3: Numerical results on CIFAR10, CIFAR100, EMNIST, FEMNIST with full participation (M = N) in the ﬁxed computation speeds setting. Shard denotes the number of classes per client. C.T. denotes the communication cost per round.

Contraction Inequality. Theorem 4.7 captures the progress made between two consecutive rounds of Fed Rep-SRPFL. It follows that the rate of convergence to the optimal representation is exponentially fast, provided that the number of participating nodes and the batch size are suﬃciently large.

Theorem 4.7. Let Assumptions 4.3-4.6 hold. Further, let the following inequalities hold for the number of participating nodes and the batch size respectively, n n0 and m c0 (1+σ2)k3κ4

E2 0 max {log(N), d/n0}, for

some absolute constant c0. Then Fed Rep-SRPFL with stepsize η 1 8 σ2max, , satisﬁes the following contraction inequality: dist Bt+1, B dist Bt, B

n n0 (1 a) , (10)

w.p. at least 1 T exp 90 min d, k2 log(N) , where a = 1

2ηE0 σ2 min, 1

Here T denotes the total number of communication rounds which is logarithmic w.r.t. the target error ϵ. The initial representation computed by the Method of Moments satisﬁes dist B0, B 1 CM, for some constant CM. Since E0 is strictly greater than zero inequality (10) ensures contraction. Remark 4.8. Theorem 4.7 suggests that the server can learn the ground-truth representation before some of the clients update their local heads or participate in the learning process. This might raise concerns about fairness or accuracy, however it is important to highlight that such concerns are unfounded. This is because, after obtaining it the server shares the ground-truth representation with all the clients in the system. Thus, even if a client i was not selected in the representation learning process, it can still optimize its low-dimensional head wi Rk using its local samples through a few local updates (given that k is a small constant). Consequently, the derivation of the ground-truth model beneﬁts both the clients that already participated in the learning procedure as well as new clients who opt to join the federated system at a later time.

Logarithmic Speedup. Algorithm 2 sets oﬀwith n0 participating clients and follows a doubling scheme so that at stage r only the fastest 2rn0 nodes contribute to the learning process. Thus at stage r, inequality (10) can be written as: dist+ dist

2r (1 α) (11)

Published in Transactions on Machine Learning Research (10/2023)

0.00 1.09 2.18 3.27 4.36 Total Time (C.T.=0)

Testing Accuracy

CIFAR10, M=100, Shard=5

Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML

0.00 1.09 2.18 3.27 4.36 Total Time (C.T.=0)

Testing Accuracy

CIFAR100, M=100, Shard=20

0.00 1.09 2.18 3.27 4.36 Total Time (C.T.=0)

Testing Accuracy

EMNIST, M=100, Shard=5

0.00 1.09 2.18 3.27 4.36 Total Time (C.T.=0)

10 20 30 40 50 60 70 80 90

Testing Accuracy

FEMNIST, M=100, Shard=3

0.00 1.51 3.02 4.53 6.04 Total Time (C.T.=10)

Testing Accuracy

CIFAR10, M=100, Shard=5

0.00 1.69 3.38 5.07 6.76 Total Time (C.T.=10)

Testing Accuracy

CIFAR100, M=100, Shard=20

0.00 1.51 3.02 4.53 6.04 Total Time (C.T.=10)

Testing Accuracy

EMNIST, M=100, Shard=5

0.00 1.51 3.02 4.53 6.04 Total Time (C.T.=10)

10 20 30 40 50 60 70 80 90

Testing Accuracy

FEMNIST, M=100, Shard=3

0.000 0.534 1.068 1.602 2.136

Total Time (C.T.=100)

Testing Accuracy

CIFAR10, M=100, Shard=5

0.000 0.709 1.418 2.127 2.836

Total Time (C.T.=100)

Testing Accuracy

CIFAR100, M=100, Shard=20

0.000 0.534 1.068 1.602 2.136

Total Time (C.T.=100)

Testing Accuracy

EMNIST, M=100, Shard=5

0.000 0.534 1.068 1.602 2.136

Total Time (C.T.=100)

10 20 30 40 50 60 70 80 90

Testing Accuracy

FEMNIST, M=100, Shard=3

Figure 4: Numerical results on CIFAR10, CIFAR100, EMNIST, FEMNIST with partial participation (N = M/5) in the ﬁxed computational speeds setting. Shard denotes the number of classes per client. C.T. denotes the communication cost per round.

We note that the second term on the r.h.s. is an artifact of the noisy measurements. Utilizing geometric series properties we can deduce that in the limit, contraction (11) converges to α/

2r(1 α)(1 1 α). This implies that the achievable error of our algorithm is lower bounded by α/p

N n0 (1 α)(1 1 α), since the total number of stages is at most r = log (N/n0).

To illustrate the theoretical beneﬁts of SRPFL we compare Algorithm 2 to Fed Rep. One can distill Fed Rep from Algorithm 2 by disregarding the doubling scheme and instead at each round, randomly sampling N nodes to participate. For fair comparison between the two methods we set the target error small enough so that the contribution of all N nodes is necessary. Speciﬁcally, we express the error ϵ as

N n0 (1 α) 1 1 α , with

2 > ˆc > 1. (12)

Intuitively, one should expect Fed Rep-SRPFL to vastly outperform straggler-prone Fed Rep as ˆc approaches

2 (large error), since in this case the biggest chunk of the workload is completed before Fed Rep-SRPFL utilizes the slower half of the clients. In contrast, Fed Rep experiences heavy delays throughout the whole training process due to the inclusion of stragglers at every round. On the contrary, as ˆc approaches 1 (small error), the amount of rounds spent by Fed Rep-SRPFL utilizing N clients increases. In this case one should expect the speedup achieved by Fed Rep-SRPFL in early stages, to eventually become obsolete. Theorem 4.9 provides a rigorous exposition of these insights. Theorem 4.9. Suppose that at each stage the client s computational times are i.i.d. random variables drawn from the exponential distribution with parameter λ. Further, suppose that the expected communication cost per round is C = c

λ, for some constant c. Finally, consider the target error ϵ given in (12). Then, we have

E[TSRPFL] E[TFed Rep] = O log( 1 ˆc 1) log(N)+log( 1 ˆc 1)

Published in Transactions on Machine Learning Research (10/2023)

0.00 0.71 1.42 2.13 2.84 Total Time (C.T.=0)

20 30 40 50 60 70 80

Testing Accuracy

CIFAR10, M=100, Shard=5

Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML

0.00 0.71 1.42 2.13 2.84 Total Time (C.T.=0)

Testing Accuracy

CIFAR100, N=100, Shard=20

0.00 0.55 1.10 1.65 2.20 Total Time (C.T.=0)

20 30 40 50 60 70 80 90

Testing Accuracy

EMNIST, N=100, Shard=5

0.00 0.55 1.10 1.65 2.20 Total Time (C.T.=0)

10 20 30 40 50 60 70 80 90

Testing Accuracy

FEMNIST, N=100, Shard=3

0.00 1.51 3.02 4.53 6.04 Total Time (C.T.=10)

20 30 40 50 60 70 80

Testing Accuracy

CIFAR10, N=100, Shard=5

0.00 1.51 3.02 4.53 6.04 Total Time (C.T.=10)

Testing Accuracy

CIFAR100, N=100, Shard=20

0.00 1.25 2.50 3.75 5.00 Total Time (C.T.=10)

20 30 40 50 60 70 80 90

Testing Accuracy

EMNIST, N=100, Shard=5

0.00 1.25 2.50 3.75 5.00 Total Time (C.T.=10)

10 20 30 40 50 60 70 80 90

Testing Accuracy

FEMNIST, N=100, Shard=3

0.000 0.788 1.576 2.364 3.152

Total Time (C.T.=100)

20 30 40 50 60 70 80

Testing Accuracy

CIFAR10, N=100, Shard=5

0.000 0.788 1.576 2.364 3.152

Total Time (C.T.=100)

Testing Accuracy

CIFAR100, N=100, Shard=20

0.000 0.755 1.510 2.265 3.020

Total Time (C.T.=100)

20 30 40 50 60 70 80 90

Testing Accuracy

EMNIST, N=100, Shard=5

0.000 0.755 1.510 2.265 3.020

Total Time (C.T.=100)

10 20 30 40 50 60 70 80 90

Testing Accuracy

FEMNIST, N=100, Shard=3

Figure 5: Numerical results on CIFAR10, CIFAR100, EMNIST, FEMNIST with full participation (M = N) in the random (dynamic) computation speeds setting. Shard denotes the number of classes per client. C.T. denotes the communication cost per round.

Theorem 4.9 establishes O (log N) speedup for our method compared to its straggler prone benchmark. This result holds when speeds are drawn once at the beginning of the process as well as when new speeds are drawn at every round thus rendering our method versatile in a broad class of settings. Remark 4.10. Our analysis is crucially intertwined with the representation learning framework where the presence of a shared, global representation serves as common ground across data-heterogeneous clients allowing us to show that intermediate solutions constitute good starting points and substantial progress is achieved between stages. Despite Fed Avg being a general-purpose algorithm not designed for representation learning, it was recently shown to recover the ground-truth representation in the case of multi-task linear regression (Collins et al., 2022), thus casting Fed Avg a potential candidate subroutine for our method.

Remark 4.11. The initialization phase of Algorithm 2 requires a one-time exchange of information between the clients and the server. Although this process reveals only the sum of outer products of local samples, it can be further fortiﬁed using diﬀerential privacy techniques, such as the ones in (Jain et al., 2021; Shen et al., 2022).

5 Experiments

In our empirical study we consider the classiﬁcation tasks for the CIFAR10, CIFAR100, EMNIST, FEMNIST and Sent140 datasets. We conduct experiments under the full and partial participation scheme with diﬀerent computation speed distributions comparing the performance of our proposed method against other state-ofthe-art benchmarks. Due to space limitation, in this section we present the results for the image classiﬁcation tasks in the full and partial participation regime with ﬁxed and dynamic computation speeds. Additional results and extensive discussion can be found in Appendix C.

Baselines. The ﬁrst benchmarks under consideration are Fed Rep (Collins et al., 2021) and Local-Global Fed Avg (LG-Fed Avg) (Liang et al., 2020). These federated methods utilize a mixture of global and local models to derive personalized solutions with small global loss. Coupling these algorithms with our proposed doubling scheme gives rise to Fed Rep-SRPFL and LG-SRPFL, respectively, which are our proposed algorithms

Published in Transactions on Machine Learning Research (10/2023)

0.000 0.779 1.559 2.338 3.118

Total Time (C.T.=0)

Testing Accuracy

CIFAR10, M=100, Shard=5

Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML

0.000 0.779 1.559 2.338 3.118

Total Time (C.T.=0)

Testing Accuracy

CIFAR100, M=100, Shard=20

0.00 1.09 2.18 3.27 4.36 Total Time (C.T.=0)

Testing Accuracy

EMNIST, M=100, Shard=5

0.000 0.779 1.559 2.338 3.118

Total Time (C.T.=0)

10 20 30 40 50 60 70 80 90

Testing Accuracy

FEMNIST, M=100, Shard=3

0.000 0.784 1.567 2.351 3.135

Total Time (C.T.=10)

Testing Accuracy

CIFAR10, M=100, Shard=5

0.000 0.784 1.567 2.351 3.135

Total Time (C.T.=10)

Testing Accuracy

CIFAR100, M=100, Shard=20

0.000 0.784 1.567 2.351 3.135

Total Time (C.T.=10)

Testing Accuracy

EMNIST, M=100, Shard=5

0.000 0.784 1.567 2.351 3.135

Total Time (C.T.=10)

10 20 30 40 50 60 70 80 90

Testing Accuracy

FEMNIST, M=100, Shard=3

0.000 0.822 1.644 2.466 3.288

Total Time (C.T.=100)

Testing Accuracy

CIFAR10, M=100, Shard=5

0.000 0.822 1.644 2.466 3.288

Total Time (C.T.=100)

Testing Accuracy

CIFAR100, M=100, Shard=20

0.000 0.822 1.644 2.466 3.288

Total Time (C.T.=100)

Testing Accuracy

EMNIST, M=100, Shard=5

0.000 0.822 1.644 2.466 3.288

Total Time (C.T.=100)

10 20 30 40 50 60 70 80 90

Testing Accuracy

FEMNIST, M=100, Shard=3

Figure 6: Numerical results on CIFAR10, CIFAR100, EMNIST, FEMNIST with partial participation (M = N) in the random (dynamic) computation speeds setting. Shard denotes the number of classes per client. C.T. denotes the communication cost per round.

for the numerical experiments. We further compare our methods with FLANP (Reisizadeh et al., 2022) and Fed Avg (Mc Mahan et al., 2017) with and without ﬁne-tuning. Note that the ﬁne-tuning variants are also considered as they often lead to better performance in data-heterogeneous settings (Wang et al., 2019; Yu et al., 2020). Our ﬁnal baseline is the HF-MAML algorithm from (Fallah et al., 2020a) which is a Model Agnostic Meta Learning-based method producing high quality personalized solutions in data-heterogeneous settings.

Data allocation. To ensure that our data allocation is heterogeneous we randomly split the data points among the clients in a way that each client can only observe a speciﬁc subset of classes. For instance, in the CIFAR10 dataset where there are in total 10 diﬀerent classes of data points, each client is only assigned data from 5 diﬀerent classes, which we refer to as Shards; see ﬁrst column of Figure 3. We also make sure that the test set for each client is consistent with the samples they have access to at training time, e.g., if client i only observes samples with labels 1, 4, 5 during training, then at test time they are only asked to classify samples from the same classes. Further details and diﬀerent allocation schemes are presented in Appendix C.

Simulation of System Heterogeneity. We consider two diﬀerent types of client speed conﬁgurations: Fixed computation speeds. In this conﬁguration we sample a value for each client from the exponential distribution with parameter λ, once at the beginning of the training process. These personalized values capture the computational time of every client and remain ﬁxed throughout the training procedure. In addition to the computational time, our methods suﬀer a ﬁxed communication cost at every round. In Figures 3 to 6 each row depicts the eﬀects of diﬀerent values of communication cost on the convergence of the algorithms under consideration (C.T. = 0, 10, 1001). Dynamic computation speeds. In this conﬁguration every client samples at every round their processing times from a personalized exponential distribution that remains ﬁxed throughout the process. Speciﬁcally, at the beginning of the training process we sample for each client i a parameter λi from the uniform distribution over [1/M, 1]. Subsequently, at every round we sample a new computational time for each client i from the exponential distribution with parameter λi. Similarly to the former setting an additional, ﬁxed communication

1For reference the computational times are sampled with λ=1.

Published in Transactions on Machine Learning Research (10/2023)

cost is incurred at every round contributing to the overall running time. As we illustrate in Figure 5 (full participation) and 6 (partial participation - %20), the experimental results under this conﬁguration are qualitatively similar to the ones presented in Figure 3 (full participation) and Figure 4(partial participation - %20) for ﬁxed computation speeds, with SRPFL providing substantial speedup over straggler prone variants.

Results and Discussions.From the numerical results in Figures 3 to 6 we distill the following takeaways: 1) Coupling SRPFL with Fed Rep exhibits consistently superior performance across diﬀerent datasets and communication time regimes compared to all proposed baselines. 2) Applying SRPFL to personalized FL solvers (Fed Rep and LG-Fed Avg) signiﬁcantly enhances their eﬃciency. 3) The speedup achieved by our meta-algorithm is more signiﬁcant for smaller values of communication time. Concretely, we observe that the gap between Fed Rep-SRPFL and Fed Rep as well as LG-SRPFL and LG-Fed Avg diminishes as the communication cost increases (plots in the same column C.T.= 0, 10, 100 ). This is unsurprising since our method improves the computational cost of the training and thus when the communication cost dominates the overall running time, the beneﬁts of SRPFL are less apparent. 4) Fed Rep-SRPFL vastly outperforms the ﬁne-tuning variant of the previously proposed FLANP, especially in regimes with high data heterogeneity (FEMNIST).

6 Conclusion

In this paper, we proposed SRPFL, a straggler resilient FL meta-algorithm with near-optimal sample complexity and provable logarithmic speedup guarantees in regimes with data and system heterogeneity. Our method leverages ideas from representation learning theory to compute a global representation model along with local client heads, thus deriving personalized solutions for all clients. In SRPFL the participating clients are selected in an adaptive manner. In early stages fast nodes are prioritized and progressively slower nodes are included in the training process, therefore mitigating the eﬀects of stragglers without compromising the quality of the solutions. Our numerical results illustrated the beneﬁts of SRPFL when coupled with diﬀerent personalized FL methods such as Fed Rep and LG-Fed Avg. Furthermore, our experiments support our theoretical ﬁndings showcasing the superior performance of Fed Rep-SRPFL compared to state-of-the-art FL methods.

Acknowledgments

The research of I. Tziotis and A. Mokhtari is supported in part by NSF Grants 2019844 and 2112471, ARO Grant W911NF2110226, the Machine Learning Lab (MLL) at UT Austin, and the Wireless Networking and Communications Group (WNCG) Industrial Aﬃliates Program. The research of Z. Shen and H. Hassani is supported by the NSF Institute for CORE Emerging Methods in Data Science (En CORE) as well as The Institute for Learning-enabled Optimization at Scale (TILOS). The research of R. Pedarsani is supported by NSF awards 2003035 and 2236483.

Acar, D. A. E., Zhao, Y., Matas, R., Mattina, M., Whatmough, P., and Saligrama, V. (2021). Federated learning based on dynamic regularization. In International Conference on Learning Representations.

Balakrishnan, R., Li, T., Zhou, T., Himayat, N., Smith, V., and Bilmes, J. (2021). Diverse client selection for federated learning via submodular maximization. In International Conference on Learning Representations.

Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798 1828.

Chen, H., Ding, J., Tramel, E. W., Wu, S., Sahu, A. K., Avestimehr, S., and Zhang, T. (2022). Actperﬂ: Active personalized federated learning. In ACL 2022 Workshop on Federated Learning for Natural Language Processing.

Cho, Y. J., Wang, J., Chiruvolu, T., and Joshi, G. (2021). Personalized federated learning for heterogeneous clients with clustered knowledge transfer. ar Xiv preprint ar Xiv:2109.08119.

Cho, Y. J., Wang, J., and Joshi, G. (2022). Towards understanding biased client selection in federated learning. In International Conference on Artiﬁcial Intelligence and Statistics, pages 10351 10375. PMLR.

Published in Transactions on Machine Learning Research (10/2023)

Collins, L., Hassani, H., Mokhtari, A., and Shakkottai, S. (2021). Exploiting shared representations for personalized federated learning. In Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR.

Collins, L., Hassani, H., Mokhtari, A., and Shakkottai, S. (2022). Fedavg with ﬁne tuning: Local updates lead to representation learning. Advances in Neural Information Processing Systems, 35:10572 10586.

Deng, Y., Kamani, M. M., and Mahdavi, M. (2020). Adaptive personalized federated learning. ar Xiv preprint ar Xiv:2003.13461.

Fallah, A., Mokhtari, A., and Ozdaglar, A. (2020a). On the convergence theory of gradient-based modelagnostic meta-learning algorithms. In International Conference on Artiﬁcial Intelligence and Statistics. PMLR.

Fallah, A., Mokhtari, A., and Ozdaglar, A. (2020b). Personalized federated learning: A meta-learning approach. ar Xiv preprint ar Xiv:2002.07948.

Haddadpour, F., Kamani, M. M., Mokhtari, A., and Mahdavi, M. (2021). Federated learning with compression: Uniﬁed analysis and sharp guarantees. In International Conference on Artiﬁcial Intelligence and Statistics. PMLR.

Hanzely, F. and Richtárik, P. (2020). Federated learning of a mixture of global and local models. ar Xiv preprint ar Xiv:2002.05516.

Horváth, S., Sanjabi, M., Xiao, L., Richtárik, P., and Rabbat, M. (2022). Fedshuﬄe: Recipes for better use of local work in federated learning. ar Xiv preprint ar Xiv:2204.13169.

Hu, S., Wu, Z. S., and Smith, V. (2021). Private multi-task learning: Formulation and applications to federated learning. ar Xiv preprint ar Xiv:2108.12978.

Jain, P., Netrapalli, P., and Sanghavi, S. (2013). Low-rank matrix completion using alternating minimization. In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC 13, page 665 674, New York, NY, USA. Association for Computing Machinery.

Jain, P., Rush, J., Smith, A., Song, S., and Guha Thakurta, A. (2021). Diﬀerentially private model personalization. Advances in Neural Information Processing Systems, 34:29723 29735.

Jiang, L. and Lin, T. (2022). Test-time robust personalization for federated learning. ar Xiv preprint ar Xiv:2205.10920.

Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. (2021). Advances and open problems in federated learning. Foundations and Trends in Machine Learning.

Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S. J., Stich, S. U., and Suresh, A. T. (2019). Scaﬀold: Stochastic controlled averaging for on-device federated learning. ar Xiv preprint ar Xiv:1910.06378.

Le Cun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521:436 44.

Lee, S., Sahu, A. K., He, C., and Avestimehr, S. (2022). Partial model averaging in federated learning: Performance guarantees and beneﬁts. ar Xiv preprint ar Xiv:2201.03789.

Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. (2020). Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429 450.

Liang, P. P., Liu, T., Ziyin, L., Allen, N. B., Auerbach, R. P., Brent, D., Salakhutdinov, R., and Morency, L.-P. (2020). Think locally, act globally: Federated learning with local and global representations. ar Xiv preprint ar Xiv:2001.01523.

Mansour, Y., Mohri, M., Ro, J., and Suresh, A. T. (2020). Three approaches for personalization with applications to federated learning. ar Xiv preprint ar Xiv:2002.10619.

Published in Transactions on Machine Learning Research (10/2023)

Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and Arcas, B. A. y. (2017). Communication-Eﬃcient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, Proceedings of Machine Learning Research. PMLR.

Mitra, A., Jaafar, R., Pappas, G. J., and Hassani, H. (2021). Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems, volume 34, pages 14606 14619. Curran Associates, Inc.

Nguyen, J., Malik, K., Zhan, H., Yousefpour, A., Rabbat, M., Malek, M., and Huba, D. (2022). Federated learning with buﬀered asynchronous aggregation. In International Conference on Artiﬁcial Intelligence and Statistics, pages 3581 3607. PMLR.

Nishio, T. and Yonetani, R. (2019). Client selection for federated learning with heterogeneous resources in mobile edge. In ICC 2019-2019 IEEE International Conference on Communications (ICC), pages 1 7. IEEE.

Pillutla, K., Malik, K., Mohamed, A.-R., Rabbat, M., Sanjabi, M., and Xiao, L. (2022). Federated learning with partial model personalization. In International Conference on Machine Learning, pages 17716 17758. PMLR.

Reisizadeh, A., Taheri, H., Mokhtari, A., Hassani, H., and Pedarsani, R. (2019). Robust and communicationeﬃcient collaborative learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 8388 8399.

Reisizadeh, A., Tziotis, I., Hassani, H., Mokhtari, A., and Pedarsani, R. (2022). Straggler-resilient federated learning: Leveraging the interplay between statistical accuracy and system heterogeneity. IEEE Journal on Selected Areas in Information Theory.

Shen, Z., Ye, J., Kang, A., Hassani, H., and Shokri, R. (2022). Share your representation only: Guaranteed improvement of the privacy-utility tradeoﬀin federated learning. In The Eleventh International Conference on Learning Representations.

Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. S. (2017). Federated multi-task learning. Advances in neural information processing systems.

So, J., Ali, R. E., Güler, B., and Avestimehr, A. S. (2021). Secure aggregation for buﬀered asynchronous federated learning. ar Xiv preprint ar Xiv:2110.02177.

Stich, S. U. (2019). Local sgd converges fast and communicates little. In ICLR 2019 ICLR 2019 International Conference on Learning Representations.

Tandon, R., Lei, Q., Dimakis, A. G., and Karampatziakis, N. (2017). Gradient coding: Avoiding stragglers in distributed learning. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR.

Tripuraneni, N., Jin, C., and Jordan, M. (2021). Provable meta-learning of linear representations. In International Conference on Machine Learning, pages 10434 10443. PMLR.

Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.

Wang, J., Liu, Q., Liang, H., Joshi, G., and Poor, H. V. (2020a). Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems.

Wang, K., Mathews, R., Kiddon, C., Eichner, H., Beaufays, F., and Ramage, D. (2019). Federated evaluation of on-device personalization. ar Xiv preprint ar Xiv:1910.10252.

Published in Transactions on Machine Learning Research (10/2023)

Wang, S., Liu, J., and Shroﬀ, N. (2018). Coded sparse matrix multiplication. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5152 5160. PMLR.

Wang, S., Liu, J., and Shroﬀ, N. B. (2020b). Fundamental limits of approximate gradient coding. ACM SIGMETRICS Performance Evaluation Review, 48:21 22.

Wu, S., Li, T., Charles, Z., Xiao, Y., Liu, Z., Xu, Z., and Smith, V. (2022). Motley: Benchmarking heterogeneity and personalization in federated learning.

Xie, C., Koyejo, S., and Gupta, I. (2019). Asynchronous federated optimization. ar Xiv preprint ar Xiv:1903.03934.

Yang, H., Zhang, X., Khanduri, P., and Liu, J. (2022). Anarchic federated learning. In International Conference on Machine Learning, pages 25331 25363. PMLR.

Yu, T., Bagdasaryan, E., and Shmatikov, V. (2020). Salvaging federated learning by local adaptation. ar Xiv preprint ar Xiv:2002.04758.

Zhu, C., Xu, Z., Chen, M., Konečn y, J., Hard, A., and Goldstein, T. (2021). Diurnal or nocturnal? federated learning of multi-branch networks from periodically shifting distributions. In International Conference on Learning Representations.

Published in Transactions on Machine Learning Research (10/2023)

Before we dive into the analysis we provide the following useful deﬁnition. Deﬁnition A.1. For a random vector x Rd1 and a ﬁxed matrix A Rd1 d2, the vector A x is called A 2-subgaussian if y A x is subgaussian with subgaussian norm O ( A2 y 2) for all y Rd2, i.e.

E exp y A x exp y 2 2 A 2 2 /2 .

We study the performance of SSRFRL with Fed Rep as the subroutine of choice. The ﬁrst part of our analysis focuses on a single round t and extends the analysis in Collins et al. (2021). We assume that there are N clients in the network and at round t a subset It of them participate in the learning procedure with cardinality n n0 := 2d log(N) σmax, . Without loss of generality we assume that the clients are indexed from fastest to slowest thus the clients that participate in the learning process are all i [n]. Each client i, draws a batch of m c0 (1+σ2)k3κ4 log(N)

E2 0 fresh, i.i.d. samples at every round. We denote by Xt i Rd m and Yt i Rm the

matrix of samples and the labels for client i, such that the rows of Xt i are samples {x1 i , ..., xm i }. By Zt i Rm

we denote the noise in the measurements of client i, with zi,j N(0, σ2). Let ˆB Rd k and W RN k

stand for the optimal representation and the concatenation of optimal heads respectively. The hat denotes that a matrix is orthonormal i.e. its columns form an orthonormal set. Similarly, ˆBt Rd k and Wt RN k

denote the global representation and the concatenation of the heads at round t. w i s and wt i s denote the optimal heads and the heads at round t which constitute the rows of W and Wt respectively. Furthermore we deﬁne σmin, := min I [N],|I|=n ,n N σmin 1

n W I and σmax, := max I [N],|I|=n ,n N σmax 1

n W I, where WI is formed by taking the rows of W indexed by I. That is σmax, and σmin, are the maximum and minimum singular values of any submatrix W I that can be obtained throughout the course of our algorithm. Notice that by assumption 4.6 each row of W has norm

k, so 1 n acts as a normalization factor such that 1

k. Finally we deﬁne κ = σmax, / σmin, . Since we focus on a single round the time index can be dropped for simplicity. Further, henceforth we drop the subscripts It on Wt.

First we derive the update scheme our algorithm follows. Notice that the empirical objective function given in (6) can be expressed via matrices Xi and Yi,

LN(B, W) = 1 2mn

Yi Xi ˆBwi 2 (13)

Further, computing the gradients we derive

i=1 ˆB Yi Xi ˆBwi 2 = 1 mn

i=1 X i Xi ˆBwi Yi w i , (14)

Yj Xj ˆBwj 2 = 1 mn ˆB X i Xi ˆBwi Yi (15)

and since 1 m ˆB X i Xi ˆB is invertible with high probability by Lemma A.4, solving for the minimizer gives us

m ˆB X i Xi ˆB 1 1

m ˆB X i Yi (16)

Thus our update scheme with stepsize η is

i [n] w+ i = 1

m ˆB X i Xi ˆB 1 1

m ˆB X i Yi (17)

B+ = ˆB η mn

i=1 X i Xi ˆBw+ i Yi w+ i (18)

ˆB+, R+ = QR(B+) (19)

where QR denotes the QR decomposition and Yi = Xi ˆB w i + Zi.

Published in Transactions on Machine Learning Research (10/2023)

Lemma A.2. For every client i the update for wi can be expressed as follows :

w+ i = ˆB ˆB w i + Fi + Gi, (20)

where Fi and Gi are deﬁned in equations (24) and (25), respectively.

Proof. Further expanding (17) we can write

m ˆB X i Xi ˆB 1 1

m ˆB X i Xi ˆB w i + 1

m ˆB X i Xi ˆB 1 1

m ˆB X i Zi (21)

= ˆB ˆB w i + 1

m ˆB X i Xi ˆB 1 1

m ˆB X i Xi ˆB 1

m ˆB X i Xi ˆB ˆB ˆB w i

m ˆB X i Xi ˆB 1 1

m ˆB X i Zi (22)

= ˆB ˆB w i + Fi + Gi (23)

where we deﬁne

m ˆB X i Xi ˆB 1 1

m ˆB X i Xi ˆB 1

m ˆB X i Xi ˆB ˆB ˆB w i , (24)

m ˆB X i Xi ˆB 1 1

m ˆB X i Zi (25)

We further have the following immediate corollary.

Corollary A.3. Let W+, F and G be the matrices with rows the concatenation of w+ i , Fi and Gi, respectively. Then

W+ = W ˆB ˆB + F + G (26)

Our ﬁrst goal is to control the norm of w+ i . In order to achieve that we provide lemmas that bound the norms of Fi and Gi extending the analysis in Collins et al. (2021) and Jain et al. (2013).

Lemma A.4. Let δ = c k3/2

log(N) m for some absolute constant c, then with probability at least 1 exp 115k3 log(N)

i [n], σmin

m ˆB X i Xi ˆB 1 δ (27)

It follows that with the same probability

i [n], σmax

m ˆB X i Xi ˆB 1 1 1 δ (28)

Proof. First notice that we can rewrite

1 m ˆB X i Xi ˆB =

1 m ˆB xj i

1 m ˆB xj i

For all i [n], j [m] we deﬁne vj i := 1 m ˆB xj i such that each vj i is an i.i.d. 1 m ˆB 2-subgaussian random variable (please see the deﬁnition of A 2-subgaussian in Deﬁnition A.1) and thus by equation (4.22)(Theorem 4.6.1) in Vershynin (2018) we obtain the following bound for any m k, l 0

m ˆB X i Xi ˆB 1 c1

Published in Transactions on Machine Learning Research (10/2023)

with probability at least 1 exp l2 and c1 some absolute constant. We set l = 12k3/2 log(N)

δ1 = 12c1k3/2

log(N) m and the above bound becomes

m ˆB X i Xi ˆB 1 δ1, (31)

with probability at least 1 exp k 12k p

Further notice that

exp k 12k p

log(N) 1 2 = exp k 144k2 log(N) + 24k log(N) 1 (32)

exp 120k3 log(N) (33)

Thus taking Union Bound over i [n] we have that for all i [n]

m ˆB X i Xi ˆB 1 δ1 (34)

with probability at least

1 n exp 120k3 log(N) 1 exp 115k3 log(N) (35)

Choosing c suﬃciently large derives the statement of the lemma.

Lemma A.5. Let Hi := 1 m ˆB X i 1 m Xi ˆB ˆB Id ˆB and δ := c k3/2

log(N) m , for an absolute con-

stant c. Then with probability at least 1 exp 115k2 log(N) we have

i=1 Hiw i 2 2 δ2 W 2 2 dist2 ˆB, ˆB (36)

Proof. In order to argue about the quantity Hi = 1 m ˆB X i 1 m Xi ˆB ˆB Id ˆB we deﬁne matrix

U := 1 m Xi ˆB ˆB Id ˆB such that its j-th row, uj = 1 m ˆB ˆB ˆB Id xj i is subgaussian with norm

at most 1 m ˆB ˆB ˆB Id . Similarly we deﬁne V = 1 m ˆB X i such that its j-th row vj = 1 m ˆBxj i has

norm at most 1 m ˆB. We are now ready to use a concentration argument similar to Proposition (4.4.5) in .

Let Sk 1 denote the unit sphere in k dimensions and Nk the 1/4-net of cardinality 9k. From equation (4.13) in Vershynin (2018) we have

ˆB X i Xi ˆB ˆB Id ˆB 2 = U V 2 2 max p,y Nk p

= 2 max p,y Nk

j=1 p, uj vj, y (38)

By deﬁnition p, uj and vj, y are subgaussians with norms 1 m

ˆB ˆB ˆB Id 2 = 1 mdist ˆB, ˆB

and 1 m ˆ B 2 = 1 m respectively and thus for all j [m] the product p, uj vj, y is subexponential with

norm at most C

m dist ˆB, ˆB , for some constant C . Note that

E [ p, uj v, y ] = p ˆB Id ˆB ˆB ˆB y = 0 (39)

Published in Transactions on Machine Learning Research (10/2023)

and thus we can use Bernstein s inequality to bound the sum of m zero mean subexponential random variables, for any ﬁxed pair p, y Nk:

i=1 p, uj vj, y s

dist2 ˆB, ˆB , sm

dist ˆB, ˆB

dist2 ˆB, ˆB , s

dist ˆB, ˆB

for constant c2. Thus taking Union Bound over all the point in the net we derive

j=1 p, uj vj, y 2s

dist2 ˆB, ˆB , s

dist ˆB, ˆB

Since m > Ck2 log(N), by setting s = dist ˆB, ˆB q

4m and (38) we obtain

ˆB X i Xi ˆB ˆB Id ˆB 2 dist ˆB, ˆB r

dist2 ˆB, ˆB

92k exp C c2mk2 log(N)

exp 120k2 log(N) (45)

for suﬃciently large C. Using Union Bound again over all participating clients we get

i [n] Hi 2 dist ˆB, ˆB r

1 n exp 120k2 log(N) (46)

1 exp 115k2 log(N) (47)

The above also implies

i=1 Hi 2 2 Cdist2 ˆB, ˆB k2 log(N)

1 exp 115k2 log(N) (48)

i=1 Hi 2 2 C W 2 2 dist2 ˆB, ˆB k3 log(N)

1 exp 115k2 log(N) (49)

Finally notice that

i=1 Hiw i 2 2

i=1 Hi 2 2 k W 2 F n

i=1 Hi 2 2 k

i=1 Hi 2 2 , (50)

where we used Assumption (4.6) and the fact that the rank of W is k. Combining this with (49) and choosing suﬃciently large c we derive the result.

Building on the previous lemmas we can now bound the norm of Fi.

Published in Transactions on Machine Learning Research (10/2023)

Lemma A.6. Let δ := c k3/2

log(N) m for some absolute constant c and for all i [n] let Fi given by (24). Further let matrix F Rn k such that its rows are the concatenation of Fi s. Then with probability at least 1 exp 110k2 log(N) we have

i [n] Fi 2 δ 1 δ dist ˆB, ˆB w i 2 , (51)

F F δ 1 δ dist ˆB, ˆB W 2 (52)

m ˆB X i Xi ˆB 1

2 Hi 2 2 w i 2 2 (53)

(1 δ)2 dist2 ˆB, ˆB w i 2 2 (54)

which holds for all i [n] with probability at least 1 exp 110k2 log(N) by using Union Bound on the failure probability of (28) and (47). Similarly, we have

m ˆB X i Xi ˆB 1

2 Hiw i 2 2 (55)

i=1 Hiw i 2 2 (56)

(1 δ)2 dist2 ˆB, ˆB W 2 2 (57)

which holds with probability at least 1 exp 110k2 log(N) taking Union Bound on the failure probability on (28) and (36).

We now turn our attention on deriving a bound for Gi 2.

Lemma A.7. Let δ := c k3/2

log(N) m for some absolute constant c and for all i [n] let Gi given by (25).

Further let matrix G Rn k such that its rows are the concatenation of Gi s. Then with probability at least 1 exp 110k2 log(N) we have

i [n] Gi 2 δ 1 δ σ2, (58)

G F δ 1 δ nσ2 (59)

Proof. Notice that we can write

m ˆB X i Xi ˆB 1 1

m ˆB X i Zi = 1

m ˆB X i Xi ˆB 1 1

i=1 zj i ˆB xj i, (60)

and since zj i N 0, σ2 we can conclude that for all i, j, zj i ˆB xj i is an i.i.d. zero mean subexponential with norm at most C 2σ2 ˆ B 2 = C 2σ2 for some constant C2. Once again we denote by Sk 1 the unit sphere in k dimensions and by Nk the 1/4-net with cardinality 9k. Using Bernstein s inequality and Union Bound over all the points on the net we follow the derivations from Lemma A.5 to get

i=1 zj i ˆB xj i

9k+1 exp c3m min s2

Published in Transactions on Machine Learning Research (10/2023)

Since m > C2k2 log(N), by setting s = σ2 q

C2k2 log(N)

4m we derive

i=1 zj i ˆB xj i

C2k2 log(N)

9k+1 exp c3m s2

9k+1 exp C2 c3k2 log(N) (63)

exp 115k2 log(N) (64)

for suﬃciently large C2. Choosing c large enough and taking Union Bound over all i [n] we can obtain

i=1 zj i ˆB xj i

1 n exp 115k2 log(N) (65)

1 exp 113k2 log(N) (66)

Finally taking Union Bound over the failure probabilities of (28) and (67) we get

m ˆB X i Xi ˆB 1 2

i=1 zj i ˆB xj i

2 δ 1 δ σ2 (68)

with probability at least 1 n exp 110k2 log(N) . It follows that with the same probability

i=1 Gi 2 2 n δ 1 δ

For all i [n] we deﬁne qi := ˆBw+ i ˆB w i . The following lemma provides upper bounds on the norms of w+ i and qi

Lemma A.8. Let δ := c k3/2

log(N) m for some absolute constant c and ˆδ = δ/(1 δ). Then with probability at least 1 exp 105k2 log(N) we have

i [n] w+ i 2 2

k + σ2ˆδ (70)

Further with probability at least 1 exp 105k2 log(N) we have

i [n] qi 2 2

k dist ˆB, ˆB + σ2ˆδ (71)

w+ i 2 ˆB 2

ˆB 2 w i 2 + Fi 2 + Gi 2 (72)

w i 2 + ˆδ w i 2 dist ˆB, ˆB + ˆδσ2 (73)

k + ˆδσ2 (74)

where the ﬁrst inequality comes from (20) and the third from Assumption 4.6. For the second inequality we take Union Bound over the failure probability of (51) and (58) and thus the above result holds with probability at least 1 exp 107k2 log(N) . Taking Union Bound for all i [n] we get that with probability at least 1 exp 105k2 log(N)

i [n] w+ i 2 2

k + σ2ˆδ (75)

Published in Transactions on Machine Learning Research (10/2023)

and the ﬁrst result of the lemma follows. For the second part we have

qi 2 = ˆBw+ i ˆB w i 2 ˆB ˆB ˆB w i + ˆBFi + ˆBGi ˆB w i 2 (76)

ˆB ˆB Id ˆB w i 2 + ˆBFi 2 + ˆBGi 2 (77)

ˆB ˆB 2 w i 2 + Fi 2 + Gi 2 (78)

dist ˆB, ˆB w i 2 + dist ˆB, ˆB ˆδ w i 2 + σ2ˆδ (79)

dist ˆB, ˆB 2

k + σ2ˆδ (80)

where the ﬁrst inequality comes from (20). For the forth inequality we take Union Bound over the failure probability of (51) and (58) and thus the above result holds with probability at least 1 exp 107k2 log(N) . Taking Union Bound for all i [n] we get that with probability at least 1 exp 105k2 log(N)

i [n] qi 2 2

k dist ˆB, ˆB + σ2ˆδ (81)

Lemma A.9. Let δ := c k3/2

log(N) m for some absolute constant c and ˆδ = δ/(1 δ). Then with probability at least 1 exp ( 105d) exp 105k2 log(N) we have

i=1 X i Ziw+ i

Proof. Let Sd 1, Sk 1 denote the unit spheres in d and k dimensions and Nd, Nk the 1/4-nets of cardinality 9d and 9k, respectively. By equation 4.13 in Vershynin (2018) we have 1 mn

i=1 X i Ziw+ i

2 2 max p Nd,y Nk p n X

1 mn X i Ziw+ i

= 2 max p Nd,y Nk p

zj i mnxj iw+ i

= 2 max p Nd,y Nk

D xj i, p E w+ i , y !

Notice that for any ﬁxed p, y and i [n], j [m] the random variables zj i mn D xj i, p E w+ i , y are i.i.d. zero

mean subexponentials with norm at most C 3σ2 w+ i 2 mn , for some absolute constant C 3. Conditioning on the event

k + ˆδσ2o , (86)

which holds with probability at least 1 exp 105k2 log(N) by Lemma A.8, we can invoke Bernstein s inequality to get

D xj i, p E w+ i , y s E1

k + σ2ˆδ 2 , s

Published in Transactions on Machine Learning Research (10/2023)

Since m > d C3

n by setting s = σ2(2

k+ˆδσ2) d C3 8 mn the above quantity simpliﬁes as follows

D xj i, p E w+ i , y σ2 2

k + ˆδσ2 d C3

exp ( C3 c4 d) (89)

exp ( 110d) (90)

for C3 large enough. Taking Union Bound over all points p, y on the Nd, Nk and using (85) we derive

i=1 X i Ziw+ i

9d+k exp ( 110d) (91)

exp ( 105d) (92)

and removing the conditioning on E1 we get

i=1 X i Ziw+ i

exp ( 105d) + P EC 1 (93)

exp ( 105d) + exp 105k2 log(N) (94)

Choosing c large enough and taking the complementary event derives the result.

Lemma A.10. Let δ := c k3/2

log(N) m for some absolute constant c and ˆδ = δ/(1 δ). Then with probability at least 1 exp ( 100d) exp 100k2 log(N) we have

m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i

d dist ˆB, ˆB k +

kˆδσ2 + ˆδσ2 2

Proof. Let us deﬁne the event

k + ˆδσ2 \ qi 2 dist ˆB, ˆB 2

k + ˆδσ2o , (96)

which happens with probability at least 1 exp 100k2 log(N) by Union Bound and Lemma A.8. For the rest of this proof we work conditioning on event E2. Recall that qi := ˆBw+ i ˆB w i and thus we can write

m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i = 1

i=1 X i Xiqiw+ i

D xj i, qi E xj iw+ i

Published in Transactions on Machine Learning Research (10/2023)

Let Sd 1, Sk 1 denote the unit spheres in d and k dimensions and Nd, Nk the 1/4-nets of cardinality 9d and 9k, respectively. By equation 4.13 in Vershynin (2018) we have

D xj i, qi E xj iw+ i 1

2 max p Nd,y Nk 1 np

D xj i, qi E xj iw+ i 1

= 2 max p Nd,y Nk 1 mn

D xj i, qi E D p, xj i E w+ i , y p, qi w+ i , y

Notice that for any ﬁxed p, y the products D xj i, qi E are i.i.d. subgaussians with norm at most

c1 qi and D p, xj i E are i.i.d. subgaussians with norm at most c2 p 2 = c2. Hence un-

der the event E2 the product 1 mn D xj i, qi E D p, xj i E w+ i , y are subexponentials with norm at most

dist ˆB, ˆB k +

kˆδσ2 + ˆδσ2 2 , for some constant C 4. Also note that

E h D xj i, qi E D p, xj i E w+ i , y p, qi w+ i , y i = 0 (101)

and thus applying Bernstein s inequality we get

D xj i, qi E D p, xj i E w+ i , y 1

i=1 p, qi w+ i , y s

s2 dist ˆB, ˆB k +

kˆδσ2 + ˆδσ2 2 2 , s dist ˆB, ˆB k +

kˆδσ2 + ˆδσ2 2

Since m > d C4

n by setting s =

C4 d dist( ˆB, ˆB )k+

kˆδσ2+(ˆδσ2) 2

2 mn and taking Union Bound over all p Nd, y Nk we derive

m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i

C4 d dist ˆB, ˆB k +

kˆδσ2 + ˆδσ2 2

c5 mns2 dist ˆB, ˆB k +

kˆδσ2 + ˆδσ2 2 2

9d+k exp ( C4 c5 d) (103)

9d+k exp ( 120d) (104)

exp ( 100d) (105)

Published in Transactions on Machine Learning Research (10/2023)

choosing a large enough constant C4. Recall that P EC 2 exp 100k2 log(N) . Hence by removing the conditioning on E2 we get that with probability at least 1 exp ( 100d) exp 100k2 log(N)

m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i

d dist ˆB, ˆB k +

kˆδσ2 + ˆδσ2 2

for suﬃciently large c.

Having set all the building blocks we now proceed to the proof of Theorem 4.7.

Theorem 4.7. Let Assumptions 4.3-4.6 hold. Further, let the following inequalities hold for the number of participating nodes and the batch size respectively, n n0 and m c0 (1+σ2)k3κ4

E2 0 max {log(N), d/n0}, for

some absolute constant c0. Then Fed Rep-SRPFL with stepsize η 1 8 σ2max, , satisﬁes the following contraction inequality:

dist Bt+1, B dist Bt, B

n n0 (1 a) , (10)

w.p. at least 1 T exp 90 min d, k2 log(N) , where a = 1

2ηE0 σ2 min, 1

Proof. First let us recall the deﬁnition of δ := c k3/2

log(N) m for some absolute constant c and ˆδ = δ/(1 δ). Further notice that for our choice of m and suﬃciently large c0 we have the following useful inequality

ˆδ = δ 1 δ 2δ E0 20 κ2 1 1 + σ2 1

From the update scheme of our algorithm (18) we have

B+ = ˆB η mn

i=1 X i Xi ˆBw+ i w+ i

i=1 X i Xi ˆB w i w+ i

i=1 X i Ziw+ i

m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i

ˆBw+ i ˆB w i w+ i + η

1 m X i Ziw+ i (109)

where we added and subtracted terms. Multiplying both sides by ˆB we get

ˆB B+ = ˆB ˆB η

m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i

ˆB ˆBw+ i ˆB ˆB w i w+ i + η

1 m X i Ziw+ i (110)

i=1 w+ i w+ i

1 m X i Ziw+ i

m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i

Published in Transactions on Machine Learning Research (10/2023)

where the second equality holds since ˆB ˆB = 0. Recall that from the QR decomposition of B+we have B+ = ˆB+R+. Hence multiplying by (R+) 1 and taking both sides the norm we derive

dist ˆB+, ˆB

i=1 w+ i w+ i

1 m X i Ziw+ i

m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i

Let us deﬁne

A1 := dist ˆB, ˆB Ik η

i=1 w+ i w+ i

1 m X i Ziw+ i

m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i

so that the following inequality holds

dist ˆB+, ˆB (A1 + A2 + A3) R+ 1 2 (116)

For the rest of the proof we will work conditioning on the intersection of the events

k + ˆδσ2 \ qi 2 dist ˆB, ˆB 2

k + ˆδσ2o (117)

E3 := n F F dist ˆB, ˆB ˆδ W 2 \ G F ˆδ nσ2o (118)

i=1 X i Ziw+ i

m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i

d dist ˆB, ˆB k +

kˆδσ2 + ˆδσ2 2

which happens with probability at least 1 exp ( 90d) exp 90k2 log(N) by Union Bound on the failure probability of (52), (59), (82), (95) and (96).

We will now provide bounds for each of the terms of interest in (116), starting from A1. Notice that by (26) we have

λmax W+W+ = W+ 2 2 = W ˆB ˆB + F + G 2

2 W 2 2 + 4 F 2 2 + 4 G 2 2 (122)

2 W 2 2 + dist ˆB, ˆB 4ˆδ2 W 2 2 + 4ˆδ2σ4n (123)

4 W 2 2 + n (124)

4n σ2 max, + 1 (125)

Published in Transactions on Machine Learning Research (10/2023)

where in the last inequality we use the fact that W 2 = n σmax, . Since η < σ2 max, + 1 1 the matrix Ik η

n W+ W+ is positive deﬁnite. Thus we have

n W+ W+ 2 1 η

nλmin W+ W+ (126)

W ˆB ˆB + F + G W ˆB ˆB + F + G (127)

σ2 min W ˆB ˆB σ2 min(F) σ2 min(G)

σmax F W ˆB ˆB + σmax F G + σmax G W ˆB ˆB (128)

σ2 min (W ) σ2 min ˆB ˆB + 2 ˆB ˆB 2

σmax F W + σmax G W

n σmax(F)σmax(G) (129)

1 η σ2 min, σ2 min ˆB ˆB + 2η

n ( F 2 + G 2) W 2 + 2η

n ( F 2 G 2) (130)

where we used that the norms of ˆB and ˆB are 1 since the matrices are orthonormal and σmin, σmin(W ). Recall that we operate under E3 and thus we can further write

n W+ W+ 2 1 η σ2 min, σ2 min ˆB ˆB + 2η

dist ˆB, ˆB ˆδ W 2 + nˆδσ2 W 2

dist ˆB, ˆB ˆδ2σ2 n W 2 (131)

1 η σ2 min, σ2 min ˆB ˆB + 2η

ˆδ W 2 2 n + ˆδσ2 W 2 n + ˆδ2σ4 W n

1 η σ2 min, σ2 min ˆB ˆB + 3η

E0 σ2 min, 20 σ2max, σ2 max,

1 η σ2 min, σ2 min ˆB ˆB + 1

6ηE0 σ2 min, (134)

where we upper bound dist ˆB, ˆB by 1, ˆδ E0 20 κ2 1 1+σ2 and use Assumption 4.4 in the third inequality.

Further by the deﬁnition of E0 := 1 dist2 ˆB0, ˆB σ2 min ˆB , ˆB we have

n W+ W+ 2 1 ηE0 σ2 min, + 1

6ηE0 σ2 min, (135)

and it follows immediately that

A1 dist ˆB, ˆB 1 ηE0 σ2 min, + 1

6ηE0 σ2 min,

Further since we operate under E4 (119) and ˆB 2 = 1 we have

and since we operate under E5 (120) we obtain

A3 ηc dist ˆB, ˆB k +

Published in Transactions on Machine Learning Research (10/2023)

Combining (116), (136), (137) and (138) we get

dist ˆB+, ˆB dist ˆB, ˆB

6ηE0 σ2 min, + ηck

k + 1 σ2 + 1

d mn R+ 1 2 (139)

The last part of the proof focuses on bounding (R+) 1 2.

Let us deﬁne

1 m X i Xi ˆBw+ i ˆB w i w+ i (140)

1 m X i Ziw+ i (141)

and hence (108) takes the form

B+ B+ = ˆB ˆB η

ˆB S + S ˆB + η

ˆB E + E ˆB + η2

n2 E S + S E + η2

n2 E E (143)

ˆB S + S ˆB + η

ˆB E + E ˆB η2

n2 E S + S E + η2

n2 E E + η2

n2 S S (144)

By Weyl s inequality and since R+ R+ = ˆB+ ˆB+ we derive

σ2 min R+ 1 η

nλmax ˆB S + S ˆB η

nλmax ˆB E + E ˆB η2

n2 λmax E S + S E (145)

Let us further deﬁne

nλmax ˆB S + S ˆB (146)

n2 λmax E S + S E (147)

nλmax ˆB E + E ˆB (148)

So that we can succinctly rewrite the above inequality as follows

σ2 min R+ 1 R1 R2 R3 (149)

We work to bound separately each of the three terms.

n max p: p 2=1 p ˆB Sp (150)

= max p: p 2=1 2η

n p ˆB " n X

m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i

+ max p: p 2=1 2η

n p ˆB " n X

ˆBw+ i ˆB w i w+ i

Published in Transactions on Machine Learning Research (10/2023)

and since we operate under E5 (120) the above simpliﬁes to

R1 2η ˆB 2 c dist ˆB, ˆB k +

d mn + max p: p 2=1 2η

n p ˆB " n X

ˆBw+ i ˆB w i w+ i

3ηc dist ˆB, ˆB k +

d mn + max p: p 2=1 2η

n p ˆB " n X

ˆBw+ i ˆB w i w+ i

We focus on the second term and using (20) we get

n p ˆB " n X

ˆBw+ i ˆB w i w+ i

ˆBw+ i ˆB w i w+ i pp ˆB #

ˆBw+ i ˆB w i ˆB ˆB w i + Fi + Gi pp ˆB #

We bound each term separately and to this end we deﬁne

ˆBw+ i ˆB w i w i ˆB ˆBpp ˆB #

ˆBw+ i ˆB w i F i pp ˆB #

ˆBw+ i ˆB w i G i pp ˆB #

such that (155) can be expressed as

n p ˆB " n X

ˆBw+ i ˆB w i w+ i

p = T1 + T2 + T3 (159)

Further expanding T1 we have

ˆB ˆB ˆB w i w i + ˆBFiw i + ˆBGiw i ˆB w i w i ˆB ˆBpp ˆB #

" ˆB ˆB ˆB Id n X

ˆB w i w i ˆB ˆBpp #

ˆBFiw i ˆB ˆBpp ˆB #

ˆBGiw i ˆB ˆBpp ˆB #

n tr h ˆB ˆB F + G W ˆB ˆBpp i (162)

n ( F F + G F ) W 2 (163)

where in the ﬁrst equality we expand w+ i via (20) and in the third equality we use that ˆB ˆB ˆB Id = 0

and F = n P

i=1 Fiw i , G = n P

i=1 Giw i . The inequality is obtained by noticing that the norms of the

orthonormal ˆB, ˆB is one and also pp 2 1. Conditioning on E3 (118) we can further simplify as follows

dist ˆB, ˆB ˆδ W 2 2 + ˆδσ2 n W 2 (164)

10ηE0 σ2 min, (165)

Published in Transactions on Machine Learning Research (10/2023)

We now turn our attention to T2

ˆB ˆB ˆB w i + ˆBFi + ˆBGi ˆB w i F i pp ˆB #

" ˆB ˆB ˆB Id n X

ˆB w i F i pp #

i=1 Fi F i pp #

i=1 Gi F i pp #

n tr h ˆB ˆB F F + G F pp i (168)

F 2 F + G F F F (169)

where in the third equality we used that ˆB ˆB ˆB Id = 0 and in the forth that the norms of the

orthonormal matrices is 1 as well as the norm of pp . Following the same calculations for T3 we get

G 2 F + G F F F (170)

and thus summing the two terms we get the following

n ( F F + G F )2 (171)

Again conditioning on E3 (118) we derive

ˆδ W 2 + nσ2 2 (172)

2ηˆδ2 σ2 max, + σ4 (173)

10ηE0 σ2 min, (174)

Hence combining (153), (165) and (174) we get a bound for R1

5ηE0 σ2 min, (175)

5ηE0 σ2 min, (176)

We work in similar fashion to derive the bound on R2

n2 λmax E S + S E (177)

n2 max p: p 2=1 p S Ep (178)

n2 max p: p 2=1 p " n X

m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i

m X i Ziw+ i

n2 max p: p 2=1 p " n X

ˆBw+ i ˆB w i w+ i

m X i Ziw+ i

m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i

X i Ziw+ i 2

ˆBw+ i ˆB w i w+ i

X i Ziw+ i 2 (180)

Published in Transactions on Machine Learning Research (10/2023)

Since we work conditioning on the event E4 T E5 we further derive

d dist ˆB, ˆB k +

kˆδσ2 + ˆδσ2 2

ˆBw+ i ˆB w i w+ i

i=1 qi 2 w+ i 2

And since we also condition on E2 (117) we ﬁnally get

R2 3η2c2k 3 2 σ2 d

3η2c2k 3 2 σ2 d

The last term we need to bound is R3

nλmax E ˆB + ˆB E (186)

n2 max p: p 2=1 p ˆB Ep (187)

1 m X i Ziw+ i

Combining (149) with (176), (185) and (190) we derive

σ2 min R+ 1 6η c k

5ηE0 σ2 min, 3η2c2k 3 2 σ2 d

1 14ηck 3 2 σ2

d mn 3η2c2k 3 2 σ2 d

5ηE0 σ2 min, (192)

1 15ηck 3 2 σ2

5ηE0 σ2 min, (193)

where the last inequality holds since mn c

d. We can now combine (139) and (193) to obtain the contraction inequality

dist ˆB+, ˆB dist ˆB, ˆB

6ηE0 σ2 min, + ηck

1 15ηck 3 2 σ2

5ηE0 σ2 min,

k + 1 σ2 + 1

1 15ηck 3 2 σ2

5ηE0 σ2 min,

Published in Transactions on Machine Learning Research (10/2023)

We divide and multiply by n0 and using our bounds on m and n the previous inequality further simpliﬁes,

dist ˆB+, ˆB dist ˆB, ˆB

6ηE0 σ2 min, + ηck

1 15ηck 3 2 σ2

5ηE0 σ2 min,

k + 1 σ2 + 1

1 15ηck 3 2 σ2

5ηE0 σ2 min,

dist ˆB, ˆB 1 1

2ηE0 σ2 min,

2ηE0 σ2 min,

4nηE0 σ2 min,

2ηE0 σ2 min,

dist ˆB, ˆB s 1 1

2ηE0 σ2 min,

2ηE0 σ2 min,

2ηE0 σ2 min, (197)

where in the second inequality we used that for our choices of m and n the following inequality holds 15ηck 3 2 (1 + σ2)

d mn0 1 10ηE0 σ2 min, . Taking Union Bound over the total number of iterations T we derive the result.

Corollary A.11. Recall that our algorithm starts at stage 0 with n0 participating clients and doubles the number of participating clients at every subsequent stage. Thus, by slightly abusing notation, we can reformulate the contraction inequality of Theorem 4.7 at stage r as follows

2r(1 a) with a 1

In the second part of our analysis we compute the expected Wall Clock Time of our proposed method and compare it to the corresponding Wall Clock Time of straggler-prone Fed Rep. We prove that when the computational speeds are drawn from the exponential distribution with parameter λ and the communication cost is given by C = c 1

λ, (for some constant c), then SRPFL guarantees a logarithmic speedup. Recall that in Corollary A.11 we get the following simpliﬁed version of the contraction inequality

2r(1 a) with a 1

For the rest of this section w.l.g. we assume that the clients are re-indexed at every stage so that the expected computation times maintain a decreasing ordering i.e. r E [T r 1 ] E [T r 2 ] ..., E [T r N]. For simplicity henceforth we drop the stage index r. Notice that the decreasing ordering of the computation times in combination with (199) imply that SRPFL initially beneﬁts by including only few fast nodes in the training procedure. However, as the distance diminishes the improvement guaranteed by the contraction inequality becomes negligible and thus our method beneﬁts by including slower nodes, thus decreasing the second term of the r.h.s. of (199).

Let us denote by Xi the maximum distance for which the optimal number of participating nodes (for SRPFL) is n0 2i. This deﬁnition immediately implies that X0 = + . To compute each Xi we turn our attention on measuring the progress per unit of time achieved by SRPFL, when 2r n0 nodes are utilized. This ratio at stage r can be expressed as

dist+ dist 1 a a

2r(1 a) E [Tn02r] + C . (200)

Published in Transactions on Machine Learning Research (10/2023)

Notice that by (199) the nominator captures the progress per round while the algorithm incurs E [Tn02r] computation and C communication cost. Similarly the ration when 2r+1 n0 nodes are used is given by

dist+ dist 1 a a

2r+1(1 a) E [Tn02r+1] + C . (201)

Based on the above inequalities we can now compute the optimal doubling points (in terms of distance) and thus the values of Xi s. Subsequently, we compute the number of iterations SRPFL spends in every stage.

Lemma B.1. For all i let Xi denote the maximum distance for which the optimal number of nodes for SRPFL is n0 2i. Then the following holds

i > 0 Xi = a p

2r(1 a)1 1 a)

1 + (E [Tn02r] + C) 1 1

E [Tn02r+1] E [Tn02r]

Further SRPFL spends at each stage r at most tr communication rounds such that

2(E[Tn02r+1] E[Tn02r])

E[Tn02r] E Tn02r 1

log( 1 1 a) . (203)

Proof. For each stage r let us compute the point where the transitioning between 2r n0 and 2r+1 n0 occurs. That is the distance at which SRPFL beneﬁts by doubling the number of participation nodes to 2r+1. Thus equating the two ratios in (200) and (201) we get

Xr+1 Xr+1 1 a a

2r(1 a) E [Tn02r] + C = Xr+1 Xr+1 1 a a

2r+1(1 a) E [Tn02r+1] + C (204)

Xr+1 (E [Tn02r+1] E [Tn02r]) = a p

2r(1 a)(1 1 a)

E [Tn02r+1] 1

2E [Tn02r] + C 1 1

2r(1 a)(1 1 a)

1 + (E [Tn02r] + C) 1 1

E [Tn02r+1] E [Tn02r]

Let us now compute the number of rounds tr (henceforth denoted by t) required in stage r. That is the minimum number of iterations that SRPFL needs to decrease the distance from Xr to Xr+1 using only the n02r fastest participating nodes. Thus, starting oﬀat Xr and following (199) for t rounds we have

1 a)i (207)

As stated above we want to ﬁnd the minimum number of rounds such that we reach the next doubling point i.e. we want t large enough such that

1 a)i (208)

1 a)t + a p

2r(1 a) 1 1 a t

1 1 a (209)

Published in Transactions on Machine Learning Research (10/2023)

where in the last inequality we use geometric series properties. We proceed to solve for t by rearranging and using (202) and the fact that Xr > a

2r(1 a)(1 1 a) ,

2r(1 a)(1 1 a)Xr+1 a p

2r(1 a)(1 1 a)Xr a (210)

(E[Tn02r]+C) 1 1

E[Tn02r+1] E[Tn02r]

2 (E Tn02r 1 +C) 1 1

E[Tn02r] E Tn02r 1 + 1 1

E [Tn02r] E [Tn02r 1]

E [Tn02r+1] E [Tn02r] (212)

taking the logarithm on both sides we derive the required amount of rounds

2(E[Tn02r+1] E[Tn02r])

E[Tn02r] E Tn02r 1

log( 1 1 a) (214)

The following lemmas compute the Wall Clock Time that SRPFL and Fed Rep require in order to achieve target accuracy ϵ. As discussed in Section 4.1 for fair comparison we consider accuracy of the form

N n0 (1 α) 1 1 α , with

2 > ˆc > 1. (216)

When ˆc takes values close to

2 we expect SRPFL to vastly outperform Fed Rep and as ˆc takes values close to 1 the performance gap diminishes.

Lemma B.2. Suppose at each stage the client s computational times are i.i.d. random variables drawn from the exponential distribution with parameter λ. Further, suppose that the expected communication cost per round is C = c 1

λ, for some constant c. Finally, consider target accuracy ϵ given in (12). Then the expected Wall Clock Time for SRPFL is upper bounded as follows

E [TSRPFL] log N

6(c + 1) + 4 log( 1 ˆc 1)

log( 1 1 a)

! 1 λ (217)

Proof. First we upper bound the expected cost suﬀered by our method until the distance between the current representation and the optimal representation becomes smaller than Xlog(N/n0), i.e. the cost corresponding to

the ﬁrst log N 2n0

stages of SRPFL (denoted by E [TSRPFL]) .

Published in Transactions on Machine Learning Research (10/2023)

E T 1 SRPFL =

log( N 2n0 ) X

i=1 ti (E [Tn02i] + C) (218)

log( N 2n0 ) X

i=1 2 (E [Tn02i] + C) log

2 E Tn02i+1 E Tn02i

E Tn02i E Tn02i 1

log( 1 1 a) (219)

log( N 4n0 ) X

i=1 2 (E [Tn0 2i] + C)

log( 1 1 a)

+ 2(E TN/2 + C)

2(E[TN] E h T N

log( 1 1 a) , (220)

where we used 214. Since the computational times of the clients come from the exponential distribution it is straightforward to derive the following bounds

E [TN] E TN/2 = 1

λ (ln(N/2) + 1) 1

λ log(N) (221)

E TN/2 E TN/4 = 1

N/2 + 1 + 1

N/2 + 2 + ... + 1

E TN/2 E TN/4 1

E TN/4 E TN/8 = 1

3N/4 + 1 + 1

3N/4 + 2 + ... + 1

Making use of the above bounds the expression in 220 further simpliﬁes to

log N 4n0 X

(E [Tn0 2i] + C) log(3

2) log( 1 1 a)

+ 2 E TN/2 + C log(3

2 log(N)) log( 1 1 a) (225)

5 log( 1 1 a)

log N 4n0 X

i=1 E [Tn0 2i] + log N

2 log(N)) log( 1 1 a) (E TN/2 + C) (226)

Further, notice that

and similarly E TN/4 1

3, E TN/8 1

7, E TN/16 1

15 and so on. Thus,

log N 4n0 X

i=1 E [Tn0 2i] = E [2n0] + E [4n0] + ... + E [N/4] 1

Combining the bounds from 227 and 228 and substituting C = c 1

λ in expression 226 we derive the following bound

E T 1 SRPFL 5

log 1 1 a (c log(N/2n0) + 1) 1

λ + 2 log(3

log 1 1 a (c + 1)

log(N/n0) 6(c + 1)

log 1 1 a 1

Published in Transactions on Machine Learning Research (10/2023)

Having derived an upper bound on the cost suﬀered by SRPFL on the ﬁrst log N 2n0

stages we now turn our attention on bounding the cost incurred from Xlog( N

n0 ) until the target accuracy ϵ is achieved (denoted by

E T 2 SRPFL ). Recall that

Xlog(N/n0) = a q

N 2n0 (1 a)(1 1 a)

1 + (E TN/2 + C)(1 1

E [TN] E TN/2

and further during the last stage of SRPFL, N clients are utilized deriving the following form in the contractions inequality from (199)

N n0 (1 a) with a 1

We ﬁrst compute the number of rounds required in this second phase of the algorithm. Starting with distance Xlog(N/n0) and following the contraction in (232) for t rounds, we derive current distance at most

Xlog N n0 (

N n0 (1 a) (

1 a)i (233)

= Xlog N n0 (

1 a)t + a q

1 1 a (234)

N n0 (1 a)(1 1 a)

1 + (E TN/2 + C)(1 1

E [TN] E TN/2

1 a t) + (1

where in the ﬁrst equality we use geometric series properties and in the second we substitute according to (231). Using the fact that

2(E [T N/2] + C)(1 1

2 1) and E [TN] E [T N/2] 1

λ log N expression (235) is upper bounded by

N n0 (1 a)(1 1 a)

2 + (c + 1)(

1 a t) + (1

1 a t) (236)

N n0 (1 a)(1 1 a) + a q

N n0 (1 a)(1 1 a)

1 a t (237)

The above implies that the number of rounds in the second phase is the smallest t so that the target accuracy is achieved, i.e.,

N n0 (1 a)(1 1 a) + a q

N n0 (1 a)(1 1 a)

1 a t (238)

Further recall that from (216) the accuracy can be expressed in terms of

N n0 (1 α) 1 1 α , with

2 > ˆc > 1. (239)

Combining the above and solving for t we derive the required number of rounds for the second phase

t 2 log( 1 ˆc 1)

log( 1 1 a) (240)

The expected cost during phase 2 can be computed as follows

Published in Transactions on Machine Learning Research (10/2023)

E T 2 SRPFL (E [TN] + C)

2 log( 1 ˆc 1)

log( 1 1 a) + 1

(ln(N) + 1 + c) 1

2 log( 1 ˆc 1)

log( 1 1 a) + 1

log( 1 ˆc 1)

! 1 λ (243)

log(N) 4 log( 1 1 a) log 1 ˆc 1

Summing the two quantities of interest we can derive the promised upper bound on the Wall Clock Time of SRPFL.

E [TSRPFL] = E T 1 SRPFL + E T 2 SRPFL (245)

log(N/n0) 6(c + 1)

log 1 1 a 1

λ + log(N) 4 log( 1 1 a) log 1 ˆc 1

6(c + 1) + 4 log( 1 ˆc 1)

log( 1 1 a)

! 1 λ (247)

Having computed an upper bound on the expected Wall Clock Time of SRPFL we proceed to compute an lower bound on the expected Wall Clock Time of Fed Rep.

Lemma B.3. Suppose at each stage the client s computational times are i.i.d. random variables drawn from the exponential distribution with parameter λ. Further, suppose that the expected communication cost per round is C = c 1

λ, for some constant c. Finally, consider target accuracy ϵ given in (12). Then the expected Wall Clock Time for Fed Rep is lower bounded as follows

E [TFed Rep] log N

log N + 2 log 1 ˆc 1

Proof. First we compute the number of rounds required by Fed Rep to achieve the target accuracy. Recall that Fed Rep utilizes N clients at each round deriving the following form of the contractions inequality from (199)

N n0 (1 a) with a 1

Starting with distance equal 1 and following the contraction in (249) for t rounds, we derive current distance at most

N n0 (1 a) (

1 a)t + a q

1 1 a , (250)

using the properties of geometric series. Further recall that from (216) the accuracy can be expressed as

N n0 (1 α) 1 1 α , with

2 > ˆc > 1. (251)

Published in Transactions on Machine Learning Research (10/2023)

The above imply that the number of rounds is going to be the smallest t that guarantees that the target accuracy has been achieved that is

N n0 (1 α) 1 1 α (

1 a)t + a q

1 1 a (252)

We use the fact that q

N n0 (1 a)(1 1 a) a > 0 for a 1/4 and all reasonable values of N to rearrange and solve for t. Thus, we derive

t 2 log 1 ˆc 1

log 1 1 a + log N

log 1 1 a (253)

Multiplying the number of rounds with a lower bound on the expected cost incurred per round, results in the desired lower bound on the expected Wall Clock Time suﬀered by Fed Rep:

E [TFed Rep] (E [TN] + C)

2 log 1 ˆc 1

log N + 2 log 1 ˆc 1

Combining the results of Lemma B.2 and Lemma B.3 we obtain Theorem 4.9.

Theorem 4.9. Suppose that at each stage the client s computational times are i.i.d. random variables drawn from the exponential distribution with parameter λ. Further, suppose that the expected communication cost per round is C = c

λ, for some constant c. Finally, consider the target error ϵ given in (12). Then, we have

E[TSRPFL] E[TFed Rep] = O log( 1 ˆc 1) log(N)+log( 1 ˆc 1)

E [TSRPFL] E [TFed Rep] 6(c + 1) + 4 log( 1 ˆc 1)

log N + 2 log 1 ˆc 1 = O

log(N) + log 1 ˆc 1

Remark B.4. The initialization scheme in Algorithm 2 guarantees that dist B0, B 1 c, with probability at least 1 O (mn) 100 , eﬀectively without increasing the overall sample complexity. The formal statement and proof is identical to Theorem 3 in (Collins et al., 2021) and is omitted.

C More on Experiments

Hyperparameters and choice of models.We set the hyperparameters following the work of (Collins et al., 2021). Speciﬁcally, for the implementation of Fed Rep and Fed Rep-SRPFL we use SGD with momentum where the momentum parameter is set to 0.5 and the local learning rate to 0.1. Further, similarly to (Collins et al., 2021) we set the local learning rate to 0.1 for all other methods under consideration, which obtains optimal performance. We ﬁx the batch size to 10 for all our implementations. The number of local epochs is set to 1 for CIFAR10 with N = 100 and to 5 for the rest of the datasets. In terms of the choice of the neural network model, for CIFAR10, we use Le Net-5 including two convolution layers with (64, 64) channels and three fully connected layers where the numbers of hidden neurons are (120, 64). The same structure is used for CIFAR100, but the numbers of channels in the convolution layers are increased to (64, 128) and the numbers of hidden neurons are increased to (256, 128). Additionally, a dropout layer with parameter 0.6 is added after the ﬁrst two fully connected layers, which improves the testing accuracy. For EMNIST and

Published in Transactions on Machine Learning Research (10/2023)

0.00 1.29 2.58 3.87 5.16 Homogeneous (C.T. = 0)

Testing Accuracy

CIFAR10, M=100, Shard=10

Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML

0.00 1.29 2.58 3.87 5.16 Homogeneous (C.T. = 0)

Testing Accuracy

EMNIST, M=100, Shard=10

0.00 2.07 4.14 6.21 8.28 Homogeneous (C.T. = 0)

Testing Accuracy

SENT140, M=83, Shard=2

Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML

0.00 1.71 3.42 5.13 6.84 Homogeneous (C.T. = 10)

Testing Accuracy

CIFAR10, M=100, Shard=10

0.00 1.71 3.42 5.13 6.84 Homogeneous (C.T. = 10)

Testing Accuracy

EMNIST, M=100, Shard=10

0.000 0.332 0.664 0.996 1.328

Homogeneous (C.T. = 10)

Testing Accuracy

SENT140, M=83, Shard=2

0.000 0.554 1.108 1.662 2.216

Homogeneous (C.T. = 100)

Testing Accuracy

CIFAR10, M=100, Shard=10

0.000 0.554 1.108 1.662 2.216

Homogeneous (C.T. = 100)

Testing Accuracy

EMNIST, M=100, Shard=10

0.000 1.457 2.914 4.371 5.828

Homogeneous (C.T. = 100)

Testing Accuracy

SENT140, M=83, Shard=2

Figure 7: Numerical results on CIFAR10, EMNIST, Sent140 with full participation (M = N) in the ﬁxed computation speeds setting. Shard denotes the number of classes per client. C.T. denotes the communication cost per round.

FEMNIST, we use MLP with three hidden layers with (512, 256, 64) hidden neurons. For Sent140 we use two-layer bidirectional LSTM with dimension 256 and dropout rate 0.5 followed by a fully connected layer of dimensions 5 and a classiﬁcation head. Further, we use the standard glove embedding of dimension 100 and vocabulary of size 10000. For the needs of SRFRL, we split the neural network model into two parts, the customized head hi and the common representation φ. In our experiments, we simply take the customized head to be the last hidden layer and the rest of the parameters are treated as the common representation. Note that LG-Fed Avg and LG-FLANP have a diﬀerent head/representation split scheme and the head is globally shared across all clients while a local version of the representation is maintained on every client. For all included datasets, i.e. CIFAR10, CIFAR100, EMNIST, FEMNIST and Sent140 the common head include the last two fully connected layers and the rest of the layers are treated as the representation part.

Datasets. We include ﬁve datasets in our empirical study: CIFAR10 which consists of 10 classes and a total number of 50,000 training data points, CIFAR100 which consists of 100 classes and the same amount of data points as CIFAR10, and EMNIST (balanced) which consists of 47 classes and 131,600 training data points. Note that in Figures 3-7, we use the ﬁrst 10 classes from EMNIST following (Collins et al., 2021). For FEMNIST, we use the same setting as (Collins et al., 2021) with the exception that in Figures 3-6 we allocate to each client 150 data points and in Figure 6 we allocate to client i, (100 + ui) samples, with ui a uniformly distributed random variable in [0, 50] . The sentiment140 dataset contains 1, 600, 000 tweets annotated negative or positive and they can be used to detect sentiment. For this dataset we perform pre-processing splitting the samples across clients both in a homogeneous as well as a heterogeneous manner. The number of allocated samples to clients follow the log-normal distribution. During our training procedure, we perform the data augmentation operations of standard random crop and horizontal ﬂip on the ﬁrst two datasets and and perform no pre-processing for the last one.

Homogeneous Setting.Apart from providing straggler-resilience beneﬁts our method takes advantage of the shared representation model to simultaneously address the hurdle of data heterogeneity. In the less challenging

Published in Transactions on Machine Learning Research (10/2023)

0.00 1.29 2.58 3.87 5.16 Total Time (C.T. = 0)

Testing Accuracy

FEMNIST, M=100, Shard=3

Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML

0.00 0.58 1.16 1.74 2.32 Heterogeneous (C.T. = 0)

Testing Accuracy

SENT140, M=83, Shard=2

Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML

0.00 2.07 4.14 6.21 8.28 Stragglers % (C.T. = 0)

Testing Accuracy

SENT140, M=83, Shard=2

Fed Rep-SRPFL-10% Fed Rep-SRPFL-30% Fed Rep-10% Fed Rep-30% LG-SRPFL-10% LG-SRPFL-30% LG-Fed Avg-10% LG-Fed Avg-30%

0.00 1.29 2.58 3.87 5.16 Total Time (C.T. = 0)

20 30 40 50 60 70 80

Testing Accuracy

CIFAR10, M=100, Shard=5

Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML

0.00 1.71 3.42 5.13 6.84 Total Time (C.T. = 10)

Testing Accuracy

FEMNIST, M=100, Shard=3

0.00 1.83 3.66 5.49 7.32 Heterogeneous (C.T. = 10)

Testing Accuracy

SENT140, M=83, Shard=2

0.000 0.332 0.664 0.996 1.328

Stragglers % (C.T. = 10)

Testing Accuracy

SENT140, M=83, Shard=2

0.00 1.71 3.42 5.13 6.84 Total Time (C.T. = 10)

20 30 40 50 60 70 80

Testing Accuracy

CIFAR10, M=100, Shard=5

0.000 0.554 1.108 1.662 2.216

Total Time (C.T. = 100)

Testing Accuracy

FEMNIST, M=100, Shard=3

0.000 1.308 2.616 3.924 5.232

Heterogeneous (C.T. = 100)

Testing Accuracy

SENT140, M=83, Shard=2

0.000 1.457 2.914 4.371 5.828

Stragglers % (C.T. = 100)

Testing Accuracy

SENT140, M=83, Shard=2

0.000 0.554 1.108 1.662 2.216

Total Time (C.T. = 100)

20 30 40 50 60 70 80

Testing Accuracy

CIFAR10, M=100, Shard=5

Figure 8: Numerical results on FEMNIST, Sent140, CIFAR10 in the ﬁxed computation speeds setting. From left to right: Imbalanced/Imbalanced/Stragglers%/Correlated Heterogeneity. Shard denotes the number of classes per client. C.T. denotes the communication cost per round.

homogeneous setting one would expect Fed Rep-SRPFL to lose some of its competitive advantage. However, our numerical results in Figure 7 indicate that our method exhibits a stable performance while maintaining an edge over the other baselines. We note that LG-SRPFL outperforms Fed Rep-SRPFL in the Sent140 dataset where LG-Fed Avg appears to be better-suited than Fed Rep. Importantly, our doubling scheme continues to enhance both subroutines providing clear straggler-resilience beneﬁts even in data homogeneous environments.

Additional Regimes. In Figure 8 we observe the extent at which the beneﬁts of our meta-algorithm persist in various regimes. Imbalanced Datasets. In the two left columns we explore how the performance of our doubling scheme degrades for diﬀerent levels of imbalanced local datasets. In the FEMNIST dataset we allocate to each client i, (100 + ui) local samples where ui is a random variable uniformly distributed in [0, 50]. In this setting the numbers of local samples are suﬃciently close and as a result the beneﬁts of our scheme are prevalent for all diﬀerent values of communication cost. As we allow the clients to have more diverse numbers of local samples however, the speedup provided by our doubling scheme diminishes. Indeed this is the case in the Sent140 dataset where we allocate data to clients in a more dispropotional manner following the log-normal distribution. We point out that in settings with high diversity simply doubling the number of clients is not suﬃcient. Instead, to achieve optimal performance one needs to select consecutive participating sets of clients with cumulative number of samples that double from one stage to the next. Diﬀerent levels of stragglers. We further consider the setting where clients are split into two categories. The ﬁrst category consists of typically fast clients whose computational values come from the exponential distribution with λ = 0.1. The second category consists of clients who are typically straggling, i.e. sampling their computational cost from the exponential distribution with λ = 10. On the third column of Figure 8 we compare Fed Rep, LG-Fed Avg and their straggler-resilient variants for diﬀerent percentages of clients coming from the latter category. Unsurprisingly, the performance of the methods under consideration degrades as the percentage of stragglers increases however our doubling scheme to some extent mitigates this eﬀect. Correlated heterogeneity. In this setting we assign data to clients depending on their speeds. More speciﬁcally,

Published in Transactions on Machine Learning Research (10/2023)

we split clients into 3 groups based on the computational times and respectively we group samples into 3 distinct groups based on their labels. Subsequently, we allocate mutually exclusive (with respect to their labels) groups of data to speciﬁc groups of clients. The rightmost column of Figure 8 indicates that the beneﬁts of our doubling scheme persist, although to a lesser degree as the correlation between data and system heterogeneity grows.