# stragglerresilient_personalized_federated_learning__785c3aac.pdf Published in Transactions on Machine Learning Research (10/2023) Straggler-Resilient Personalized Federated Learning Isidoros Tziotis isidoros_13@utexas.edu The University of Texas at Austin Zebang Shen zebang.shen@inf.ethz.ch ETH Zurich Ramtin Pedarsani ramtin@ece.ucsb.edu The University of California, Santa Barbara Hamed Hassani hassani@seas.upenn.edu The University of Pennsylvania Aryan Mokhtari mokhtari@austin.utexas.edu The University of Texas at Austin Reviewed on Open Review: https: // openreview. net/ forum? id= gx Ep UFx Igz& referrer= %5BAuthor% 20 Federated Learning is an emerging learning paradigm that allows training models from samples distributed across a large network of clients while respecting privacy and communication restrictions. Despite its success, federated learning faces several challenges related to its decentralized nature. In this work, we develop a novel algorithmic procedure with theoretical speedup guarantees that simultaneously handles two of these hurdles, namely (i) data heterogeneity, i.e., data distributions can vary substantially across clients, and (ii) system heterogeneity, i.e., the computational power of the clients could differ significantly. By leveraging previous works in the realm of representation learning (Collins et al., 2021; Liang et al., 2020), our method constructs a global common representation utilizing the data from all clients. Additionally, it learns a user-specific set of parameters, resulting in a personalized solution for each individual client. Furthermore, it mitigates the effects of stragglers by adaptively selecting clients based on their computational characteristics, thus achieving for the first time near-optimal sample complexity and provable logarithmic speedup. Experimental results support our theoretical findings showing the superiority of our method over alternative personalized federated schemes in system and data heterogeneous environments. 1 Introduction Due to growing concerns on data privacy and communication cost, Federated Learning (FL) has become an emerging learning paradigm as it allows for training machine learning models without collecting local data from the clients. Due to its decentralized nature, a major challenge in designing efficient solvers for FL is heterogeneity of local devices which can be categorized into two different types: (i) data heterogeneity where the underlying data distributions of clients vary substantially, and (ii) system heterogeneity where the computational and storage capabilities of devices differ significantly. In fact, it has been observed that the seminal Federated Averaging (Fed Avg) method suffers from slow convergence to a high quality solution when facing highly heterogeneous datasets (Mc Mahan et al., 2017) as well as heterogeneous systems (Li et al., 2020; Kairouz et al., 2021). In this paper, we aim to address these two challenges simultaneously by introducing a generic framework that includes algorithms with robust performance in the presence of those forms of clients heterogeneity. Inspired by prior works in the literature of FL (Collins et al., 2021; Liang et al., 2020) that utilized representation Published in Transactions on Machine Learning Research (10/2023) learning theory to tackle data heterogeneity, we propose a meta-algorithm that produces personalized solutions and handles data heterogeneity by leveraging a global representation shared among all clients. Further, our method circumvents the delays introduced due to the presence of stragglers by carefully selecting participating nodes based on their computational speeds. In early stages, only a few of the fastest nodes are chosen to participate and sequentially slower devices are included in the training process until the target accuracy is achieved. Although the disproportional selection of nodes raises fairness and accuracy concerns, we highlight that our method achieves speedup without compromising the resulting solution. The most significant contribution of our work is achieving near-optimal sample complexity in regimes with data and system heterogeneity, alongside a provable logarithmic speedup guarantee in terms of running time. Next we summarize our contributions: 1. SRPFL Algorithm. We propose Straggler-Resilient Personalized Federated Learning (SRPFL), an adaptive node participation meta-algorithm that builds upon subroutines that fall into the representation learning framework (Collins et al., 2021; Liang et al., 2020) enhancing their resilience to stragglers and performance. 2. Logarithmic Speedup. Assuming that clients speeds are drawn from the exponential distribution, we prove that SRPFL guarantees logarithmic speedup in the linear representation setting, outperforming established, straggler-prone benchmarks while maintaining the state of the art sample complexity per client m = O((d/N + log(N))), where d and N denote the feature vector size and number of active nodes. Our results hold for non-convex loss functions, heterogeneous data and dynamically changing client s speeds. 3. Numerical Results. Experiments on various datasets (CIFAR10, CIFAR100, EMNIST, FEMNIST, Sent140) support our theoretical results showing that: (i) SRPFL significantly boosts the performance of different subroutines designed for personalized FL both in full and partial participation settings and (ii) SRPFL exhibits superior performance in system and data heterogeneous settings compared to state-of-the-art baselines. 1.1 Related Work Data heterogeneity. In data heterogeneous settings, if one aims at minimizing the aggregate loss in the network using the classic Fed Avg method or more advanced algorithms, which utilize control-variate techniques, such as SCAFFOLD (Karimireddy et al., 2019), FEDGATE (Haddadpour et al., 2021), Fed Dyn (Acar et al., 2021) or FEDSHUFFLE (Horváth et al., 2022) the resulting solution could perform poorly for some of the clients. This is an unavoidable hurdle due to the fact that no single model works well for all clients when their underlying data distributions are diverse. A common technique that addresses this issue is fine-tuning the derived global model to each local task by following a few steps of SGD updates (Wang et al., 2019; Yu et al., 2020). Based on this observation, Fallah et al. (2020b) showed that one might need to train models that work well after fine-tuning and showed its connections to Model-Agnostic Meta-Learning (MAML). In (Cho et al., 2022; Balakrishnan et al., 2021) novel client-sampling schemes were explored achieving increased efficiency in regimes with data heterogeneity. Another line of work for personalized FL is learning additive mixtures of local and global models (Deng et al., 2020; Mansour et al., 2020; Hanzely and Richtárik, 2020). These methods learn local models for clients that are close to each other in some norm, an idea closely related to multi-task FL (Smith et al., 2017; Hu et al., 2021). The works in (Chen et al., 2022; Lee et al., 2022) studied the interplay of local and global models utilizing Bayesian hierarchical models and partial participation, respectively. An alternative approach was presented by Collins et al. (2021), where instead of enforcing local models to be close, the authors assumed that models across clients share a common representation. Using this perspective, they presented Fed Rep a method that provably learns this underlying structure in the linear representation setting. Building upon the idea of a common representation Zhu et al. (2021) and Jiang and Lin (2022) proposed federated methods that can handle data heterogeneity while exhibiting robustness to distribution shifts. Recently, a novel framework was proposed allowing the comparison of personalized FL methods under various metrics (Wu et al., 2022). In all of the aforementioned methods however, a subset of clients participate regardless of their computational capabilities. Thus, in the presence of stragglers, the Published in Transactions on Machine Learning Research (10/2023) speed of the training process significantly goes down as the server waits, at every communication round, for the slowest participating node to complete its local updates. System heterogeneity. Several works have attempted to address the issue of system heterogeneity. Specifically, asynchronous methods, which rely on bounded staleness of slow clients, have demonstrated significant gains in distributed data centers (Xie et al., 2019; Stich, 2019; So et al., 2021). In FL frameworks, however, stragglers could be arbitrarily slow casting these methods inefficient. In an attempt to manually control staleness, deadline-based computation has been proposed (Reisizadeh et al., 2019) as well as aggregation of a fixed number of models per round (Nguyen et al., 2022). However, in the worst case scenario the performance of these methods is still determined by the slowest client in the network. Active sampling is another approach where the server aggregates as many local updates as possible within a predefined time span (Nishio and Yonetani, 2019). In a different line of work the effects of stragglers are mitigated by utilizing computation/gradient coding schemes (Tandon et al., 2017; Wang et al., 2018; 2020b). In (Cho et al., 2021) clients use heterogeneous model-architectures to transfer knowledge to nodes with similar data distributions while Yang et al. (2022) proposed a new framework where clients are free to choose their participation scheme. More recently, normalized averaging methods were proposed in (Wang et al., 2020a; Horváth et al., 2022) that rectify the objective inconsistency created by the mismatch in clients updates. Fed Lin (Mitra et al., 2021) instead utilizes gradient correction and error-feedback mechanisms to circumvent the speed-accuracy conflict. A novel approach to mitigate the effects of stragglers was proposed by Reisizadeh et al. (2022), where clients are selected to take part in different stages of the training according to their computational characteristics. Alas, all of the above methods yield improvement only in data-homogeneous settings and they are not applicable in regimes with data heterogeneity. 2 Problem Formulation In this section, we introduce our setting and define the data and system heterogeneity model that we study. Consider the FL framework where M clients interact with a central server. We focus on a supervised, data-heterogeneous setting where client i draws samples from distribution Di, potentially with Di =Dj. Further, consider the learning model of the i-th client as qi : Rd Y which maps inputs xi Rd to predicted labels qi(xi) Y. The objective function of client i is defined as fi(qi) := E(xi,yi) Di [ℓ(qi(xi), yi))], where the loss ℓ: Y Y R penalizes the gap between the predicted label qi(xi) and true label yi. In the most general setting clients aim to solve min (q1,...,q M) Q 1 M i=1 fi(qi), (1) with Q the space of feasible tuples of mappings (q1, ..., q M). Traditionally in FL, methods focus on learning a single model q=q1=...=q M that performs well on average across clients (Li et al., 2020; Mc Mahan et al., 2017). Although such a solution may be satisfactory in data-homogeneous settings, it leads to undesirable local models when the data distributions are diverse. Indeed, in the presence of data heterogeneity the loss functions fi have different forms and their minimizers could be far from each other. This justifies the formulation in (1) and necessitates the search for personalized solutions that can be learned in federated fashion. Low Dimensional Common Representation. There have been numerous examples in image classification and word prediction where tasks with heterogeneous data share a common, low dimensional representation, despite having different labels (Bengio et al., 2013; Le Cun et al., 2015; Pillutla et al., 2022). Based on that, a reasonable choice for Q is a set in which all qi share a common map, coupled with a personalized map that fits their local data. To formalize this, suppose the ground-truth map can be written for each client i as qi=hi φ where φ:Rd Rk is a shared global representation which maps d-dimensional data points to a lower dimensional space of size k and hi : Rk Y, which maps from the lower dimensional subspace to the space of labels. Typically k d and thus given any fixed representation φ, the client specific heads hi : Rk Y are easy to optimize locally. With this common structure into consideration, (1) can be reformulated as minφ Φ 1 M PM i=1 minhi H fi(hi φ), where Φ is the class of feasible representation and H is the class of feasible heads. This formulation leads to good local solutions, if the underlying data generation models for the clients share a low dimensional common representation, i.e., yi = h i φ (xi) + zi, where zi is some additive noise. Published in Transactions on Machine Learning Research (10/2023) Figure 1: Classic FL schemes for solving (2) Figure 2: SRPFL for solving (2) The server and clients collaborate to learn the common representation φ, while locally each client learns their unique head hi. Since clients often do not have access to their true data distributions, instead of minimizing their expected loss, they settle for minimizing the empirical loss associated with their local samples. Specifically, we assume client i has access to Si samples {x1 i , x2 i , ..., x Si i }, and its local empirical loss is ˆfi(hi φ) = 1 Si PSi s=1 ℓ(hi φ(xs i ), ys i ). Hence, the global problem becomes min φ Φ 1 M i=1 min hi H ( ˆfi(hi φ) := 1 s=1 ℓ(hi φ(xs i ), ys i ) System Heterogeneity Model. In most FL settings, thousands of clients participate in the training process, each with different computational capabilities. Thus, fixed computational tasks such as gradient computations or local model updates require different processing time, for different clients. Formally, for each client i [M], we denote by Ti the time required to compute a local model update. When a subset of clients participate in the learning process, the computational time of each round is determined by the slowest participating node. Naturally, as the number of nodes in the network grows, we expect the number of stragglers to increase. This phenomenon calls for the development of straggler-resilient methods in system-heterogeneous settings. 3 Straggler-Resilient Personalized FL In the shared representation setting, we face the challenge of finding an algorithm that coordinates server and clients in order to learn a common representation and a set of personalized parameters in a federated and straggler-resilient fashion. To this end, we propose a method that tackles problem (2) with limited sample access and provably superior performance over naive, straggler-prone methods. Specifically, we propose the Straggler-Resilient Personalized Federated Learning (SRPFL) meta-algorithm, designed to mitigate the effects of system heterogeneity in environments with non-convex loss functions and heterogeneous data while accommodating a variety of methods as subroutines. In a nutshell, SRPFL iteratively solves problem (2), while adaptively increasing the set of participating nodes based on their computational capabilities. As a result the process of learning the common representation across clients tasks is accelerated, without compromising the resulting accuracy. To simplify the exposition, henceforth we denote by A some federated algorithm of choice, designed to solve (2). As depicted in Figure 1, in standard FL frameworks, out of all M clients in the network, the server often selects uniformly at random a subset of size N. Subsequently, a few iterations of algorithm A are performed to approximately solve a version of (2) which corresponds to those N selected clients i.e., minφ Φ 1 N PN i=1 minhi H ˆfi(hi φ). In every following stage, a new subset of N nodes is sampled and the same process is repeated. Although such a procedure eventually learns the global representation across all tasks, it is susceptible to delays caused by stragglers, as the server has to wait for the slowest client (among the N selected ones) to complete its updates. Hence, when N is large the training procedure could become prohibitively slow. Published in Transactions on Machine Learning Research (10/2023) SRPFL takes a different approach in order to mitigate the effects of straggling clients. An overview of the selecting scheme is provided in Figure 2. At each stage, N clients are randomly selected, but only a small, fast subset of them is used to solve their corresponding learning problem. More precisely, suppose that at stage r, each client i in the sampled subset of size N, is associated with a computational time T r i . For simplicity we assume that the nodes are re-indexed at every stage so that they maintain a decreasing ordering w.r.t. their times, i.e. T r 1 T r 2 ... T r N. Initially, only the n0 fastest clients, {1, 2, ..., n0}, are included in the learning procedure, with n0 much smaller than N. At every communication round, the set of n0 nodes perform the model updates indicated by algorithm A, to solve their version of (2), i.e., minφ Φ 1 n0 Pn0 i=1 minhi H ˆfi(hi φ). We note that during this stage of the algorithm, the server waits only for the slowest client among the participating ones, i.e., client n0 which is potentially much faster than node N. Remark 3.1. In practice, the knowledge of clients computational power is not required to figure out the n0 fastest nodes. Instead, the server sends the global representation model to all N sampled clients and updates the common representation once the first n0 new models are received. Indeed, these representations belong to the fastest n0 nodes. Once the current stage terminates, a new batch of N clients is sampled and the 2n0 fastest nodes are chosen to participate in the learning process. Since speeds vary across stages, consecutive sets of participating nodes could have small or no overlap. However, the representations learned in previous stages still operate as good starting points for subsequent stages which is possible since nodes are homogeneous w.r.t. their representations (see Section 2). Thus, utilizing the representation model learned from the previous stage, nodes {1, 2, ..., 2n0} continue the learning process deriving a model of improved accuracy. The procedure of geometrically increasing the number of participating nodes continues until the target accuracy is achieved. Hence, SRPFL uses the data of stragglers only at the latest stages of the algorithm when an accurate approximation is required. Remark 3.2. For simplicity of exposition we assume that the set of N sampled nodes maintains connectivity to the server throughout each stage. However, our analysis remains unaffected even if a new set of nodes is sampled at every round. We proceed to characterize the class of federated algorithms able to employ SRPFL to enhance their performance in system heterogeneous environments. Any iterative method that solves an instance of (2) can be combined with our adaptive node participation scheme, however in this paper, we focus on a broad class of alternating gradient-based update methods, presented in (Collins et al., 2021). As the name suggests, in each round, clients update their heads and representation models in an alternative fashion. After a certain number of gradient updates is completed, all clients send their derived representations to the server where the models are averaged and broadcasted back to the clients. Next, we rigorously illustrate this procedure. Alternating Update Scheme. At round t, the server communicates a common representation φt to the clients and a subset of them It, are selected to participate by performing the following updates. Client Head Update. Each client i It performs τh local gradient-based updates optimizing their head parameter, given the received representation φt. Concretely, for s = 1 , ..., τh client i updates their head model as follows ht,s i = GRD fi ht,s 1 i , φt , ht,s 1 i , η . (3) Client Representation Update. Once the updated local heads ht,τh i are obtained, each client i executes τφ local updates on their representation parameters. That is for s = 1, ..., τφ φt,s i = GRD fi ht,τh i , φt,s 1 i , φt,s 1 i , η . (4) In the above expressions, GRD(f, h, η) captures the generic notion of an update of variable h using the gradient of function f with respect to h and step size η. This notation allows the inclusion of a large class of algorithms such as Gradient Descent with momentum, SGD, etc. Server Update. Each client i sends their derived representation models φt,τφ i to the server, where they are averaged producing the next representation model φt+1. Coupling SRPFL with a generic subroutine that falls into the Alternating Update Scheme, gives rise to Algorithm 1. Every stage r is characterized by a participating set of size 2r n0, denoted by Ir. At the Published in Transactions on Machine Learning Research (10/2023) Algorithm 1 SRPFL 1: Input: Initial number of nodes n0; step size η; number of local updates for head τh; number of local updates for representation τφ. 2: Initialization: n n0, φ0, h0,τh 1 , ..., h0,τh N 3: for r = 0, 1, 2, . . . , log(N/n0) do 4: φ0 φr 5: for t = 1, 2, . . . , τr do 6: Server sends representation φt 1 to N clients sampled from [M]. 7: for i Ir do 8: Client i initializes ht,0 i ht 1,τh i and runs τh updates ht i = ht i η ht ifi(ht i, φt 1). 9: Client i initializes φt,0 i φt 1 and runs τφ updates φt i = φt i η φt ifi(ht i, φt i). 10: Client i sends φt,τφ i to the server. 11: end for 12: for each client i / Ir do 13: ht,τh i ht 1,τh i 14: end for 15: Server computes φt 1 n Pn i=1 φt,τφ i . 16: end for 17: Server sets n min{N, 2n} and φr+1 φτr. 18: end for beginning of each round the server provides a representation model to the participating clients (line 6). The clients update their models (lines 8 and 9) and sent their representations back to the server where they are averaged producing a new global model. The numbers of local updates τh, τφ depend on the subroutine method of choice and the number of rounds per stage is denoted by τr. At the end of every stage the set of participating nodes is doubling in size until a set of size N is reached (line 17). Remark 3.3 summarizes the technical novelties of our work and highlights crucial benefits enjoyed by SRPFL. Remark 3.3. Reisizadeh et al. (2022) proposed a similar participation scheme, however their approach differs from ours and their results apply to significantly more restrictive regimes. Specifically, in (Reisizadeh et al., 2022) the analysis heavily relies on deriving a connection between the ERM solutions of consecutive stages. In order to control the statistical accuracy of the corresponding ERM problems (i) data homogeneity across all clients is necessary and further (ii) clients who participate in early stages are required to remain active and connected to the server in all subsequent stages, maintaining fixed computational speeds throughout the whole training process. (iii) The results of their analysis hold only for strongly convex loss functions and (iv) their stage termination criterion requires the knowledge of the strong convexity parameter. The above restrictions are detrimental in the FL regime and severely undermine the applicability of the resulting algorithm. In our work we follow a different approach controlling -in terms of principal angle distance - a quantity analogous to statistical accuracy, therefore directly connecting the common representation (and overall solution) at every stage to the ground truth representation. This novel approach allows our algorithm to accommodate (i) data heterogeneity, (ii) clients with dynamically changing speeds or equivalently clients that are replaced by new ones at every round, and (iii) non-convex loss functions. Additionally, a major part of our technical contribution focuses on (iv) analytically deriving the optimal number of rounds per stage, thus producing a simple and efficient doubling scheme. 4 SRPFL in the Linear Representation Case Our theoretical analysis focuses on a specific instance of (1), where clients strive to solve a linear representation learning problem. Concretely we assume that fi is the quadratic loss, φ is a projection onto a k-dimensional subspace of Rd, given by matrix B Rd k and the i-th client s head hi, is a vector wi Rk. We model local data of client i such that yi = w i B xi + zi, for some ground truth representation B Rd k, local heads w i Rk and zi N(0, σ2) capturing the noise in the measurements. Hence, all clients optimal solutions lie Published in Transactions on Machine Learning Research (10/2023) Algorithm 2 Fed Rep-SRPFL (Linear Representation) 1: Input: Step size η; Batch size m; Initial nodes n0; 2: Initialization: Client i [N] sends to server: Pi := 1 m Pm j=1(y0,j i )2x0,j i (x0,j i ) , n n0. 3: Server finds UDU rank-k SVD( 1 N PN i=1 Pi). 4: for r = 0, 1, 2, . . . , log(N/n0) do 5: Server initializes representation Br,0 U. 6: for t = 1, 2, . . . , τr do 7: Server sends Br,t to N clients sampled from [M]. 8: for i {1, .., n} do 9: Client i samples a fresh batch of m samples. 10: Client i computes wr,t+1 i arg minw ˆf t i (w, Br,t). 11: Client i computes Br,t+1 i Br,t η B ˆf t i (wt+1 i , Br,t) and sends it back to the server. 12: end for 13: Server computes Br,t+1 1 n P i It Br,t+1 i . 14: Server computes Br,t+1, Rr,t+1 QR( Br,t+1). 15: end for 16: Server sets U Br,t+1 and n min{N, 2n}. 17: end for in the same k-dimensional subspace. Under these assumptions the global expected risk is min B,W 1 2M i=1 E(xi,yi) Di h yi w i B xi 2i , (5) where W = [w 1 , ..., w N] RN k is the concatenation of the client-specific heads. Since the true distributions Di s are unknown, algorithms strive to minimize the empirical risk instead. The global empirical risk over all clients is 1 M i=1 ˆfi(wi, B)= 1 2Mm yj i wt i B xj i 2 , (6) where m is the number of samples at each client. The global loss in (6) is nonconvex and has many global minima, including all pairs of W Q 1, B Q , where Q Rk k is some invertible matrix. Thus, the server aims to retrieve the column space of B , instead of the ground truth factors (W , B ). To measure closeness between column spaces, we adopt the metric of principal angle distance (Jain et al., 2013). Definition 4.1.Let matrices B1, B2 Rd k and ˆB1, , ˆB2 orthonormal matrices s.t. span( ˆB1, ) = span(B1) and span( ˆB2) = span(B2). The principle angle distance between the column spaces of B1 and B2 is defined to be dist(B1, B2) := ˆB 1, ˆB2 2. Federated Representation Learning (Fed Rep) is an alternating minimization-descent algorithm, recently proposed in (Collins et al., 2021) for the Linear Shared Representation framework. SRPFL coupled with Fed Rep gives rise to Algorithm 2. Below we highlight the main points of interest. In the initialization phase (lines 2 and 3) a model of bounded distance from the optimal representation is obtained, via the Method of Moments (Tripuraneni et al., 2021). Subsequently, at every round t, client i samples a fresh batch of samples {xt,j i , yt,j i }m j=1 from its local distribution (line 9) and thus the corresponding loss function becomes ˆfi(wi B):= 1 2m Pm j=1(yt,j i w i B xt,j i )2. Utilizing the global representation provided by the server, client i computes the optimal head wi (line 10). Fixing the newly computed head, client i proceeds to update its global representation model with one step of gradient descent (line 11) and transmits it back to the server. As depicted in lines 13 and 14 the parameter server averages the models received and orthogonalizes the resulting matrix before broadcasting it to the clients, a component of crucial importance required in our analysis. Mapping this method back to Algorithm 1 we note that the number of representation model updates τφ is set to 1, whereas the number of head updates τh is sufficiently large to derive (approximately) the optimal solutions. This imbalance is designed to take advantage of the inherent structure of our problem where the Published in Transactions on Machine Learning Research (10/2023) size of wi s is significantly smaller than d. Finally, we point out that the number of the communication rounds per stage τr is a small and a priori known to the algorithm constant, specified by our analysis. Remark 4.2. Although our proposed method utilizes Fed Rep as a backbone, our analysis and framework differ substantially from the ones in (Collins et al., 2021). Specifically, Collins et al. (2021) assume access to (i) infinite samples, (ii) without the presence of noise. Additionally, (iii) the number of participating nodes remains fixed throughout the training and (iv) the focus lies solely on handling heterogeneous data with system heterogeneity being an orthogonal direction to their work. In contrast, our analysis requires only (i) finite and (ii) noisy samples, and our theoretical results (Theorems 4.7 and 4.9) revolve around (iii) participating subsets of different sizes and (iv) regimes where both data and system heterogeneity is prevalent. 4.1 Theoretical Results In this subsection, we provide rigorous analysis of Fed Rep-SRPFL in the linear representation setting. First, we present the notion of Wall Clock Time (WCT) which is the measure of the performance for our algorithm. Subsequently, we illustrate the contraction inequality that determines the rate at which the distance to the optimal representation diminishes. We conclude showing that Fed Rep-SRPFL outperforms its straggler-prone variant by a factor of O(log N), under the standard assumption that clients computational times follow the exponential distribution. Before we proceed, we introduce the necessary notation and the assumptions. E0 : = 1 dist2 B0, B , (7) σmax, := max I [N],|I|=n,n0 n Nσmax σmin, := min I [N],|I|=n,n0 n Nσmin Assumption 4.3. (Sub-gaussian design). The local samples xi Rd are i.i.d. with mean 0, covariance Id and are Id-sub-gaussian, i.e. E[ev xi] e v 2 2/2 for all v Rd. Assumption 4.4. (Client diversity). Let σmin, defined in (9), be the minimum singular value of any matrix that can be obtained by taking n rows of 1 n W . Then σmin, >0. Specifically, our theoretical analysis requires Assumption 4.4 to be satisfied for every n n0. Assumption 4.4 implies that the optimal heads, of the participating clients, span Rk. This is true in many FL regimes as the number of clients is usually much larger than the dimension of the shared representation. Remark 4.5. In this work we consider client speeds being independent of the local data available to them. This is a natural assumption since the computational power of each client crucially depends on their device characteristics (battery, CPU, etc.), whereas any connection to their local data is unclear. However, in the presence of strong correlation between data and system heterogeneity, Assumption 4.4 may not hold, which can be seen as a potential limitation of our work and an interesting future direction to explore. Assumption 4.6. (Client normalization). The ground-truth client specific parameters satisfy w i 2 = k for all i [n] and B has orthonormal columns. Assumption 4.6 ensures the ground-truth matrix W B is row-wise incoherent, i.e. its row norms have similar magnitudes. This is of vital importance since our measurement matrices are row-wise sparse and incoherence is a common requirement in sensing problems with sparse measurements. Wall Clock Time. To measure the speedup that our meta-algorithm enjoys we use the concept of real time or WCT as described below. Fed Rep-SRPFL runs in communication rounds grouped into stages. Consider such a round t at stage r, with nodes {1, 2, ..., nr} participating in the learning process. Here nr denotes the slowest participating node. The expected amount of time that the server has to wait for the updates to take place is E T r nr . Put simply, the expected computational time of the slowest node acts as the bottleneck for the round. Further, at the beginning and at the end of every round, models are exchanged between the server and the clients. This incurs an additional, fixed communication cost C. If τr communication rounds take place at every stage r, then the overall expected WCT for Fed Rep-SRPFL is E [TSRPFL] = Plog(N/n0) r=0 τr E T r nr + C . Similarly, the total expected runtime for Fed Rep can be expressed in terms of the total number of rounds, TF R, as E [TFed Rep]=TF R(E [T r N]+C).Taking the ratio of these quantities derives the desired speedup guarantee. Published in Transactions on Machine Learning Research (10/2023) 0.00 0.95 1.90 2.85 3.80 Total Time (C.T.=0) 20 30 40 50 60 70 80 Testing Accuracy CIFAR10, M=100, Shard=5 Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML 0.00 0.95 1.90 2.85 3.80 Total Time (C.T.=0) Testing Accuracy CIFAR100, N=100, Shard=20 0.00 0.72 1.44 2.16 2.88 Total Time (C.T.=0) 20 30 40 50 60 70 80 90 Testing Accuracy EMNIST, N=100, Shard=5 0.00 0.72 1.44 2.16 2.88 Total Time (C.T.=0) 10 20 30 40 50 60 70 80 90 Testing Accuracy FEMNIST, N=100, Shard=3 0.00 1.75 3.50 5.25 7.00 Total Time (C.T.=10) 20 30 40 50 60 70 80 Testing Accuracy CIFAR10, N=100, Shard=5 0.00 1.65 3.30 4.95 6.60 Total Time (C.T.=10) Testing Accuracy CIFAR100, N=100, Shard=20 0.00 1.42 2.84 4.26 5.68 Total Time (C.T.=10) 20 30 40 50 60 70 80 90 Testing Accuracy EMNIST, N=100, Shard=5 0.00 1.42 2.84 4.26 5.68 Total Time (C.T.=10) 10 20 30 40 50 60 70 80 90 Testing Accuracy FEMNIST, N=100, Shard=3 0.000 0.812 1.624 2.436 3.248 Total Time (C.T.=100) 20 30 40 50 60 70 80 Testing Accuracy CIFAR10, N=100, Shard=5 0.000 0.795 1.590 2.385 3.180 Total Time (C.T.=100) Testing Accuracy CIFAR100, N=100, Shard=20 0.000 0.772 1.544 2.316 3.088 Total Time (C.T.=100) 20 30 40 50 60 70 80 90 Testing Accuracy EMNIST, N=100, Shard=5 0.000 0.772 1.544 2.316 3.088 Total Time (C.T.=100) 10 20 30 40 50 60 70 80 90 Testing Accuracy FEMNIST, N=100, Shard=3 Figure 3: Numerical results on CIFAR10, CIFAR100, EMNIST, FEMNIST with full participation (M = N) in the fixed computation speeds setting. Shard denotes the number of classes per client. C.T. denotes the communication cost per round. Contraction Inequality. Theorem 4.7 captures the progress made between two consecutive rounds of Fed Rep-SRPFL. It follows that the rate of convergence to the optimal representation is exponentially fast, provided that the number of participating nodes and the batch size are sufficiently large. Theorem 4.7. Let Assumptions 4.3-4.6 hold. Further, let the following inequalities hold for the number of participating nodes and the batch size respectively, n n0 and m c0 (1+σ2)k3κ4 E2 0 max {log(N), d/n0}, for some absolute constant c0. Then Fed Rep-SRPFL with stepsize η 1 8 σ2max, , satisfies the following contraction inequality: dist Bt+1, B dist Bt, B n n0 (1 a) , (10) w.p. at least 1 T exp 90 min d, k2 log(N) , where a = 1 2ηE0 σ2 min, 1 Here T denotes the total number of communication rounds which is logarithmic w.r.t. the target error ϵ. The initial representation computed by the Method of Moments satisfies dist B0, B 1 CM, for some constant CM. Since E0 is strictly greater than zero inequality (10) ensures contraction. Remark 4.8. Theorem 4.7 suggests that the server can learn the ground-truth representation before some of the clients update their local heads or participate in the learning process. This might raise concerns about fairness or accuracy, however it is important to highlight that such concerns are unfounded. This is because, after obtaining it the server shares the ground-truth representation with all the clients in the system. Thus, even if a client i was not selected in the representation learning process, it can still optimize its low-dimensional head wi Rk using its local samples through a few local updates (given that k is a small constant). Consequently, the derivation of the ground-truth model benefits both the clients that already participated in the learning procedure as well as new clients who opt to join the federated system at a later time. Logarithmic Speedup. Algorithm 2 sets offwith n0 participating clients and follows a doubling scheme so that at stage r only the fastest 2rn0 nodes contribute to the learning process. Thus at stage r, inequality (10) can be written as: dist+ dist 2r (1 α) (11) Published in Transactions on Machine Learning Research (10/2023) 0.00 1.09 2.18 3.27 4.36 Total Time (C.T.=0) Testing Accuracy CIFAR10, M=100, Shard=5 Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML 0.00 1.09 2.18 3.27 4.36 Total Time (C.T.=0) Testing Accuracy CIFAR100, M=100, Shard=20 0.00 1.09 2.18 3.27 4.36 Total Time (C.T.=0) Testing Accuracy EMNIST, M=100, Shard=5 0.00 1.09 2.18 3.27 4.36 Total Time (C.T.=0) 10 20 30 40 50 60 70 80 90 Testing Accuracy FEMNIST, M=100, Shard=3 0.00 1.51 3.02 4.53 6.04 Total Time (C.T.=10) Testing Accuracy CIFAR10, M=100, Shard=5 0.00 1.69 3.38 5.07 6.76 Total Time (C.T.=10) Testing Accuracy CIFAR100, M=100, Shard=20 0.00 1.51 3.02 4.53 6.04 Total Time (C.T.=10) Testing Accuracy EMNIST, M=100, Shard=5 0.00 1.51 3.02 4.53 6.04 Total Time (C.T.=10) 10 20 30 40 50 60 70 80 90 Testing Accuracy FEMNIST, M=100, Shard=3 0.000 0.534 1.068 1.602 2.136 Total Time (C.T.=100) Testing Accuracy CIFAR10, M=100, Shard=5 0.000 0.709 1.418 2.127 2.836 Total Time (C.T.=100) Testing Accuracy CIFAR100, M=100, Shard=20 0.000 0.534 1.068 1.602 2.136 Total Time (C.T.=100) Testing Accuracy EMNIST, M=100, Shard=5 0.000 0.534 1.068 1.602 2.136 Total Time (C.T.=100) 10 20 30 40 50 60 70 80 90 Testing Accuracy FEMNIST, M=100, Shard=3 Figure 4: Numerical results on CIFAR10, CIFAR100, EMNIST, FEMNIST with partial participation (N = M/5) in the fixed computational speeds setting. Shard denotes the number of classes per client. C.T. denotes the communication cost per round. We note that the second term on the r.h.s. is an artifact of the noisy measurements. Utilizing geometric series properties we can deduce that in the limit, contraction (11) converges to α/ 2r(1 α)(1 1 α). This implies that the achievable error of our algorithm is lower bounded by α/p N n0 (1 α)(1 1 α), since the total number of stages is at most r = log (N/n0). To illustrate the theoretical benefits of SRPFL we compare Algorithm 2 to Fed Rep. One can distill Fed Rep from Algorithm 2 by disregarding the doubling scheme and instead at each round, randomly sampling N nodes to participate. For fair comparison between the two methods we set the target error small enough so that the contribution of all N nodes is necessary. Specifically, we express the error ϵ as N n0 (1 α) 1 1 α , with 2 > ˆc > 1. (12) Intuitively, one should expect Fed Rep-SRPFL to vastly outperform straggler-prone Fed Rep as ˆc approaches 2 (large error), since in this case the biggest chunk of the workload is completed before Fed Rep-SRPFL utilizes the slower half of the clients. In contrast, Fed Rep experiences heavy delays throughout the whole training process due to the inclusion of stragglers at every round. On the contrary, as ˆc approaches 1 (small error), the amount of rounds spent by Fed Rep-SRPFL utilizing N clients increases. In this case one should expect the speedup achieved by Fed Rep-SRPFL in early stages, to eventually become obsolete. Theorem 4.9 provides a rigorous exposition of these insights. Theorem 4.9. Suppose that at each stage the client s computational times are i.i.d. random variables drawn from the exponential distribution with parameter λ. Further, suppose that the expected communication cost per round is C = c λ, for some constant c. Finally, consider the target error ϵ given in (12). Then, we have E[TSRPFL] E[TFed Rep] = O log( 1 ˆc 1) log(N)+log( 1 ˆc 1) Published in Transactions on Machine Learning Research (10/2023) 0.00 0.71 1.42 2.13 2.84 Total Time (C.T.=0) 20 30 40 50 60 70 80 Testing Accuracy CIFAR10, M=100, Shard=5 Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML 0.00 0.71 1.42 2.13 2.84 Total Time (C.T.=0) Testing Accuracy CIFAR100, N=100, Shard=20 0.00 0.55 1.10 1.65 2.20 Total Time (C.T.=0) 20 30 40 50 60 70 80 90 Testing Accuracy EMNIST, N=100, Shard=5 0.00 0.55 1.10 1.65 2.20 Total Time (C.T.=0) 10 20 30 40 50 60 70 80 90 Testing Accuracy FEMNIST, N=100, Shard=3 0.00 1.51 3.02 4.53 6.04 Total Time (C.T.=10) 20 30 40 50 60 70 80 Testing Accuracy CIFAR10, N=100, Shard=5 0.00 1.51 3.02 4.53 6.04 Total Time (C.T.=10) Testing Accuracy CIFAR100, N=100, Shard=20 0.00 1.25 2.50 3.75 5.00 Total Time (C.T.=10) 20 30 40 50 60 70 80 90 Testing Accuracy EMNIST, N=100, Shard=5 0.00 1.25 2.50 3.75 5.00 Total Time (C.T.=10) 10 20 30 40 50 60 70 80 90 Testing Accuracy FEMNIST, N=100, Shard=3 0.000 0.788 1.576 2.364 3.152 Total Time (C.T.=100) 20 30 40 50 60 70 80 Testing Accuracy CIFAR10, N=100, Shard=5 0.000 0.788 1.576 2.364 3.152 Total Time (C.T.=100) Testing Accuracy CIFAR100, N=100, Shard=20 0.000 0.755 1.510 2.265 3.020 Total Time (C.T.=100) 20 30 40 50 60 70 80 90 Testing Accuracy EMNIST, N=100, Shard=5 0.000 0.755 1.510 2.265 3.020 Total Time (C.T.=100) 10 20 30 40 50 60 70 80 90 Testing Accuracy FEMNIST, N=100, Shard=3 Figure 5: Numerical results on CIFAR10, CIFAR100, EMNIST, FEMNIST with full participation (M = N) in the random (dynamic) computation speeds setting. Shard denotes the number of classes per client. C.T. denotes the communication cost per round. Theorem 4.9 establishes O (log N) speedup for our method compared to its straggler prone benchmark. This result holds when speeds are drawn once at the beginning of the process as well as when new speeds are drawn at every round thus rendering our method versatile in a broad class of settings. Remark 4.10. Our analysis is crucially intertwined with the representation learning framework where the presence of a shared, global representation serves as common ground across data-heterogeneous clients allowing us to show that intermediate solutions constitute good starting points and substantial progress is achieved between stages. Despite Fed Avg being a general-purpose algorithm not designed for representation learning, it was recently shown to recover the ground-truth representation in the case of multi-task linear regression (Collins et al., 2022), thus casting Fed Avg a potential candidate subroutine for our method. Remark 4.11. The initialization phase of Algorithm 2 requires a one-time exchange of information between the clients and the server. Although this process reveals only the sum of outer products of local samples, it can be further fortified using differential privacy techniques, such as the ones in (Jain et al., 2021; Shen et al., 2022). 5 Experiments In our empirical study we consider the classification tasks for the CIFAR10, CIFAR100, EMNIST, FEMNIST and Sent140 datasets. We conduct experiments under the full and partial participation scheme with different computation speed distributions comparing the performance of our proposed method against other state-ofthe-art benchmarks. Due to space limitation, in this section we present the results for the image classification tasks in the full and partial participation regime with fixed and dynamic computation speeds. Additional results and extensive discussion can be found in Appendix C. Baselines. The first benchmarks under consideration are Fed Rep (Collins et al., 2021) and Local-Global Fed Avg (LG-Fed Avg) (Liang et al., 2020). These federated methods utilize a mixture of global and local models to derive personalized solutions with small global loss. Coupling these algorithms with our proposed doubling scheme gives rise to Fed Rep-SRPFL and LG-SRPFL, respectively, which are our proposed algorithms Published in Transactions on Machine Learning Research (10/2023) 0.000 0.779 1.559 2.338 3.118 Total Time (C.T.=0) Testing Accuracy CIFAR10, M=100, Shard=5 Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML 0.000 0.779 1.559 2.338 3.118 Total Time (C.T.=0) Testing Accuracy CIFAR100, M=100, Shard=20 0.00 1.09 2.18 3.27 4.36 Total Time (C.T.=0) Testing Accuracy EMNIST, M=100, Shard=5 0.000 0.779 1.559 2.338 3.118 Total Time (C.T.=0) 10 20 30 40 50 60 70 80 90 Testing Accuracy FEMNIST, M=100, Shard=3 0.000 0.784 1.567 2.351 3.135 Total Time (C.T.=10) Testing Accuracy CIFAR10, M=100, Shard=5 0.000 0.784 1.567 2.351 3.135 Total Time (C.T.=10) Testing Accuracy CIFAR100, M=100, Shard=20 0.000 0.784 1.567 2.351 3.135 Total Time (C.T.=10) Testing Accuracy EMNIST, M=100, Shard=5 0.000 0.784 1.567 2.351 3.135 Total Time (C.T.=10) 10 20 30 40 50 60 70 80 90 Testing Accuracy FEMNIST, M=100, Shard=3 0.000 0.822 1.644 2.466 3.288 Total Time (C.T.=100) Testing Accuracy CIFAR10, M=100, Shard=5 0.000 0.822 1.644 2.466 3.288 Total Time (C.T.=100) Testing Accuracy CIFAR100, M=100, Shard=20 0.000 0.822 1.644 2.466 3.288 Total Time (C.T.=100) Testing Accuracy EMNIST, M=100, Shard=5 0.000 0.822 1.644 2.466 3.288 Total Time (C.T.=100) 10 20 30 40 50 60 70 80 90 Testing Accuracy FEMNIST, M=100, Shard=3 Figure 6: Numerical results on CIFAR10, CIFAR100, EMNIST, FEMNIST with partial participation (M = N) in the random (dynamic) computation speeds setting. Shard denotes the number of classes per client. C.T. denotes the communication cost per round. for the numerical experiments. We further compare our methods with FLANP (Reisizadeh et al., 2022) and Fed Avg (Mc Mahan et al., 2017) with and without fine-tuning. Note that the fine-tuning variants are also considered as they often lead to better performance in data-heterogeneous settings (Wang et al., 2019; Yu et al., 2020). Our final baseline is the HF-MAML algorithm from (Fallah et al., 2020a) which is a Model Agnostic Meta Learning-based method producing high quality personalized solutions in data-heterogeneous settings. Data allocation. To ensure that our data allocation is heterogeneous we randomly split the data points among the clients in a way that each client can only observe a specific subset of classes. For instance, in the CIFAR10 dataset where there are in total 10 different classes of data points, each client is only assigned data from 5 different classes, which we refer to as Shards; see first column of Figure 3. We also make sure that the test set for each client is consistent with the samples they have access to at training time, e.g., if client i only observes samples with labels 1, 4, 5 during training, then at test time they are only asked to classify samples from the same classes. Further details and different allocation schemes are presented in Appendix C. Simulation of System Heterogeneity. We consider two different types of client speed configurations: Fixed computation speeds. In this configuration we sample a value for each client from the exponential distribution with parameter λ, once at the beginning of the training process. These personalized values capture the computational time of every client and remain fixed throughout the training procedure. In addition to the computational time, our methods suffer a fixed communication cost at every round. In Figures 3 to 6 each row depicts the effects of different values of communication cost on the convergence of the algorithms under consideration (C.T. = 0, 10, 1001). Dynamic computation speeds. In this configuration every client samples at every round their processing times from a personalized exponential distribution that remains fixed throughout the process. Specifically, at the beginning of the training process we sample for each client i a parameter λi from the uniform distribution over [1/M, 1]. Subsequently, at every round we sample a new computational time for each client i from the exponential distribution with parameter λi. Similarly to the former setting an additional, fixed communication 1For reference the computational times are sampled with λ=1. Published in Transactions on Machine Learning Research (10/2023) cost is incurred at every round contributing to the overall running time. As we illustrate in Figure 5 (full participation) and 6 (partial participation - %20), the experimental results under this configuration are qualitatively similar to the ones presented in Figure 3 (full participation) and Figure 4(partial participation - %20) for fixed computation speeds, with SRPFL providing substantial speedup over straggler prone variants. Results and Discussions.From the numerical results in Figures 3 to 6 we distill the following takeaways: 1) Coupling SRPFL with Fed Rep exhibits consistently superior performance across different datasets and communication time regimes compared to all proposed baselines. 2) Applying SRPFL to personalized FL solvers (Fed Rep and LG-Fed Avg) significantly enhances their efficiency. 3) The speedup achieved by our meta-algorithm is more significant for smaller values of communication time. Concretely, we observe that the gap between Fed Rep-SRPFL and Fed Rep as well as LG-SRPFL and LG-Fed Avg diminishes as the communication cost increases (plots in the same column C.T.= 0, 10, 100 ). This is unsurprising since our method improves the computational cost of the training and thus when the communication cost dominates the overall running time, the benefits of SRPFL are less apparent. 4) Fed Rep-SRPFL vastly outperforms the fine-tuning variant of the previously proposed FLANP, especially in regimes with high data heterogeneity (FEMNIST). 6 Conclusion In this paper, we proposed SRPFL, a straggler resilient FL meta-algorithm with near-optimal sample complexity and provable logarithmic speedup guarantees in regimes with data and system heterogeneity. Our method leverages ideas from representation learning theory to compute a global representation model along with local client heads, thus deriving personalized solutions for all clients. In SRPFL the participating clients are selected in an adaptive manner. In early stages fast nodes are prioritized and progressively slower nodes are included in the training process, therefore mitigating the effects of stragglers without compromising the quality of the solutions. Our numerical results illustrated the benefits of SRPFL when coupled with different personalized FL methods such as Fed Rep and LG-Fed Avg. Furthermore, our experiments support our theoretical findings showcasing the superior performance of Fed Rep-SRPFL compared to state-of-the-art FL methods. Acknowledgments The research of I. Tziotis and A. Mokhtari is supported in part by NSF Grants 2019844 and 2112471, ARO Grant W911NF2110226, the Machine Learning Lab (MLL) at UT Austin, and the Wireless Networking and Communications Group (WNCG) Industrial Affiliates Program. The research of Z. Shen and H. Hassani is supported by the NSF Institute for CORE Emerging Methods in Data Science (En CORE) as well as The Institute for Learning-enabled Optimization at Scale (TILOS). The research of R. Pedarsani is supported by NSF awards 2003035 and 2236483. Acar, D. A. E., Zhao, Y., Matas, R., Mattina, M., Whatmough, P., and Saligrama, V. (2021). Federated learning based on dynamic regularization. In International Conference on Learning Representations. Balakrishnan, R., Li, T., Zhou, T., Himayat, N., Smith, V., and Bilmes, J. (2021). Diverse client selection for federated learning via submodular maximization. In International Conference on Learning Representations. Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798 1828. Chen, H., Ding, J., Tramel, E. W., Wu, S., Sahu, A. K., Avestimehr, S., and Zhang, T. (2022). Actperfl: Active personalized federated learning. In ACL 2022 Workshop on Federated Learning for Natural Language Processing. Cho, Y. J., Wang, J., Chiruvolu, T., and Joshi, G. (2021). Personalized federated learning for heterogeneous clients with clustered knowledge transfer. ar Xiv preprint ar Xiv:2109.08119. Cho, Y. J., Wang, J., and Joshi, G. (2022). Towards understanding biased client selection in federated learning. In International Conference on Artificial Intelligence and Statistics, pages 10351 10375. PMLR. Published in Transactions on Machine Learning Research (10/2023) Collins, L., Hassani, H., Mokhtari, A., and Shakkottai, S. (2021). Exploiting shared representations for personalized federated learning. In Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR. Collins, L., Hassani, H., Mokhtari, A., and Shakkottai, S. (2022). Fedavg with fine tuning: Local updates lead to representation learning. Advances in Neural Information Processing Systems, 35:10572 10586. Deng, Y., Kamani, M. M., and Mahdavi, M. (2020). Adaptive personalized federated learning. ar Xiv preprint ar Xiv:2003.13461. Fallah, A., Mokhtari, A., and Ozdaglar, A. (2020a). On the convergence theory of gradient-based modelagnostic meta-learning algorithms. In International Conference on Artificial Intelligence and Statistics. PMLR. Fallah, A., Mokhtari, A., and Ozdaglar, A. (2020b). Personalized federated learning: A meta-learning approach. ar Xiv preprint ar Xiv:2002.07948. Haddadpour, F., Kamani, M. M., Mokhtari, A., and Mahdavi, M. (2021). Federated learning with compression: Unified analysis and sharp guarantees. In International Conference on Artificial Intelligence and Statistics. PMLR. Hanzely, F. and Richtárik, P. (2020). Federated learning of a mixture of global and local models. ar Xiv preprint ar Xiv:2002.05516. Horváth, S., Sanjabi, M., Xiao, L., Richtárik, P., and Rabbat, M. (2022). Fedshuffle: Recipes for better use of local work in federated learning. ar Xiv preprint ar Xiv:2204.13169. Hu, S., Wu, Z. S., and Smith, V. (2021). Private multi-task learning: Formulation and applications to federated learning. ar Xiv preprint ar Xiv:2108.12978. Jain, P., Netrapalli, P., and Sanghavi, S. (2013). Low-rank matrix completion using alternating minimization. In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC 13, page 665 674, New York, NY, USA. Association for Computing Machinery. Jain, P., Rush, J., Smith, A., Song, S., and Guha Thakurta, A. (2021). Differentially private model personalization. Advances in Neural Information Processing Systems, 34:29723 29735. Jiang, L. and Lin, T. (2022). Test-time robust personalization for federated learning. ar Xiv preprint ar Xiv:2205.10920. Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. (2021). Advances and open problems in federated learning. Foundations and Trends in Machine Learning. Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S. J., Stich, S. U., and Suresh, A. T. (2019). Scaffold: Stochastic controlled averaging for on-device federated learning. ar Xiv preprint ar Xiv:1910.06378. Le Cun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521:436 44. Lee, S., Sahu, A. K., He, C., and Avestimehr, S. (2022). Partial model averaging in federated learning: Performance guarantees and benefits. ar Xiv preprint ar Xiv:2201.03789. Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. (2020). Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429 450. Liang, P. P., Liu, T., Ziyin, L., Allen, N. B., Auerbach, R. P., Brent, D., Salakhutdinov, R., and Morency, L.-P. (2020). Think locally, act globally: Federated learning with local and global representations. ar Xiv preprint ar Xiv:2001.01523. Mansour, Y., Mohri, M., Ro, J., and Suresh, A. T. (2020). Three approaches for personalization with applications to federated learning. ar Xiv preprint ar Xiv:2002.10619. Published in Transactions on Machine Learning Research (10/2023) Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and Arcas, B. A. y. (2017). Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research. PMLR. Mitra, A., Jaafar, R., Pappas, G. J., and Hassani, H. (2021). Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems, volume 34, pages 14606 14619. Curran Associates, Inc. Nguyen, J., Malik, K., Zhan, H., Yousefpour, A., Rabbat, M., Malek, M., and Huba, D. (2022). Federated learning with buffered asynchronous aggregation. In International Conference on Artificial Intelligence and Statistics, pages 3581 3607. PMLR. Nishio, T. and Yonetani, R. (2019). Client selection for federated learning with heterogeneous resources in mobile edge. In ICC 2019-2019 IEEE International Conference on Communications (ICC), pages 1 7. IEEE. Pillutla, K., Malik, K., Mohamed, A.-R., Rabbat, M., Sanjabi, M., and Xiao, L. (2022). Federated learning with partial model personalization. In International Conference on Machine Learning, pages 17716 17758. PMLR. Reisizadeh, A., Taheri, H., Mokhtari, A., Hassani, H., and Pedarsani, R. (2019). Robust and communicationefficient collaborative learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 8388 8399. Reisizadeh, A., Tziotis, I., Hassani, H., Mokhtari, A., and Pedarsani, R. (2022). Straggler-resilient federated learning: Leveraging the interplay between statistical accuracy and system heterogeneity. IEEE Journal on Selected Areas in Information Theory. Shen, Z., Ye, J., Kang, A., Hassani, H., and Shokri, R. (2022). Share your representation only: Guaranteed improvement of the privacy-utility tradeoffin federated learning. In The Eleventh International Conference on Learning Representations. Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. S. (2017). Federated multi-task learning. Advances in neural information processing systems. So, J., Ali, R. E., Güler, B., and Avestimehr, A. S. (2021). Secure aggregation for buffered asynchronous federated learning. ar Xiv preprint ar Xiv:2110.02177. Stich, S. U. (2019). Local sgd converges fast and communicates little. In ICLR 2019 ICLR 2019 International Conference on Learning Representations. Tandon, R., Lei, Q., Dimakis, A. G., and Karampatziakis, N. (2017). Gradient coding: Avoiding stragglers in distributed learning. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR. Tripuraneni, N., Jin, C., and Jordan, M. (2021). Provable meta-learning of linear representations. In International Conference on Machine Learning, pages 10434 10443. PMLR. Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. Wang, J., Liu, Q., Liang, H., Joshi, G., and Poor, H. V. (2020a). Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems. Wang, K., Mathews, R., Kiddon, C., Eichner, H., Beaufays, F., and Ramage, D. (2019). Federated evaluation of on-device personalization. ar Xiv preprint ar Xiv:1910.10252. Published in Transactions on Machine Learning Research (10/2023) Wang, S., Liu, J., and Shroff, N. (2018). Coded sparse matrix multiplication. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5152 5160. PMLR. Wang, S., Liu, J., and Shroff, N. B. (2020b). Fundamental limits of approximate gradient coding. ACM SIGMETRICS Performance Evaluation Review, 48:21 22. Wu, S., Li, T., Charles, Z., Xiao, Y., Liu, Z., Xu, Z., and Smith, V. (2022). Motley: Benchmarking heterogeneity and personalization in federated learning. Xie, C., Koyejo, S., and Gupta, I. (2019). Asynchronous federated optimization. ar Xiv preprint ar Xiv:1903.03934. Yang, H., Zhang, X., Khanduri, P., and Liu, J. (2022). Anarchic federated learning. In International Conference on Machine Learning, pages 25331 25363. PMLR. Yu, T., Bagdasaryan, E., and Shmatikov, V. (2020). Salvaging federated learning by local adaptation. ar Xiv preprint ar Xiv:2002.04758. Zhu, C., Xu, Z., Chen, M., Konečn y, J., Hard, A., and Goldstein, T. (2021). Diurnal or nocturnal? federated learning of multi-branch networks from periodically shifting distributions. In International Conference on Learning Representations. Published in Transactions on Machine Learning Research (10/2023) Before we dive into the analysis we provide the following useful definition. Definition A.1. For a random vector x Rd1 and a fixed matrix A Rd1 d2, the vector A x is called A 2-subgaussian if y A x is subgaussian with subgaussian norm O ( A2 y 2) for all y Rd2, i.e. E exp y A x exp y 2 2 A 2 2 /2 . We study the performance of SSRFRL with Fed Rep as the subroutine of choice. The first part of our analysis focuses on a single round t and extends the analysis in Collins et al. (2021). We assume that there are N clients in the network and at round t a subset It of them participate in the learning procedure with cardinality n n0 := 2d log(N) σmax, . Without loss of generality we assume that the clients are indexed from fastest to slowest thus the clients that participate in the learning process are all i [n]. Each client i, draws a batch of m c0 (1+σ2)k3κ4 log(N) E2 0 fresh, i.i.d. samples at every round. We denote by Xt i Rd m and Yt i Rm the matrix of samples and the labels for client i, such that the rows of Xt i are samples {x1 i , ..., xm i }. By Zt i Rm we denote the noise in the measurements of client i, with zi,j N(0, σ2). Let ˆB Rd k and W RN k stand for the optimal representation and the concatenation of optimal heads respectively. The hat denotes that a matrix is orthonormal i.e. its columns form an orthonormal set. Similarly, ˆBt Rd k and Wt RN k denote the global representation and the concatenation of the heads at round t. w i s and wt i s denote the optimal heads and the heads at round t which constitute the rows of W and Wt respectively. Furthermore we define σmin, := min I [N],|I|=n ,n N σmin 1 n W I and σmax, := max I [N],|I|=n ,n N σmax 1 n W I, where WI is formed by taking the rows of W indexed by I. That is σmax, and σmin, are the maximum and minimum singular values of any submatrix W I that can be obtained throughout the course of our algorithm. Notice that by assumption 4.6 each row of W has norm k, so 1 n acts as a normalization factor such that 1 k. Finally we define κ = σmax, / σmin, . Since we focus on a single round the time index can be dropped for simplicity. Further, henceforth we drop the subscripts It on Wt. First we derive the update scheme our algorithm follows. Notice that the empirical objective function given in (6) can be expressed via matrices Xi and Yi, LN(B, W) = 1 2mn Yi Xi ˆBwi 2 (13) Further, computing the gradients we derive i=1 ˆB Yi Xi ˆBwi 2 = 1 mn i=1 X i Xi ˆBwi Yi w i , (14) Yj Xj ˆBwj 2 = 1 mn ˆB X i Xi ˆBwi Yi (15) and since 1 m ˆB X i Xi ˆB is invertible with high probability by Lemma A.4, solving for the minimizer gives us m ˆB X i Xi ˆB 1 1 m ˆB X i Yi (16) Thus our update scheme with stepsize η is i [n] w+ i = 1 m ˆB X i Xi ˆB 1 1 m ˆB X i Yi (17) B+ = ˆB η mn i=1 X i Xi ˆBw+ i Yi w+ i (18) ˆB+, R+ = QR(B+) (19) where QR denotes the QR decomposition and Yi = Xi ˆB w i + Zi. Published in Transactions on Machine Learning Research (10/2023) Lemma A.2. For every client i the update for wi can be expressed as follows : w+ i = ˆB ˆB w i + Fi + Gi, (20) where Fi and Gi are defined in equations (24) and (25), respectively. Proof. Further expanding (17) we can write m ˆB X i Xi ˆB 1 1 m ˆB X i Xi ˆB w i + 1 m ˆB X i Xi ˆB 1 1 m ˆB X i Zi (21) = ˆB ˆB w i + 1 m ˆB X i Xi ˆB 1 1 m ˆB X i Xi ˆB 1 m ˆB X i Xi ˆB ˆB ˆB w i m ˆB X i Xi ˆB 1 1 m ˆB X i Zi (22) = ˆB ˆB w i + Fi + Gi (23) where we define m ˆB X i Xi ˆB 1 1 m ˆB X i Xi ˆB 1 m ˆB X i Xi ˆB ˆB ˆB w i , (24) m ˆB X i Xi ˆB 1 1 m ˆB X i Zi (25) We further have the following immediate corollary. Corollary A.3. Let W+, F and G be the matrices with rows the concatenation of w+ i , Fi and Gi, respectively. Then W+ = W ˆB ˆB + F + G (26) Our first goal is to control the norm of w+ i . In order to achieve that we provide lemmas that bound the norms of Fi and Gi extending the analysis in Collins et al. (2021) and Jain et al. (2013). Lemma A.4. Let δ = c k3/2 log(N) m for some absolute constant c, then with probability at least 1 exp 115k3 log(N) i [n], σmin m ˆB X i Xi ˆB 1 δ (27) It follows that with the same probability i [n], σmax m ˆB X i Xi ˆB 1 1 1 δ (28) Proof. First notice that we can rewrite 1 m ˆB X i Xi ˆB = 1 m ˆB xj i 1 m ˆB xj i For all i [n], j [m] we define vj i := 1 m ˆB xj i such that each vj i is an i.i.d. 1 m ˆB 2-subgaussian random variable (please see the definition of A 2-subgaussian in Definition A.1) and thus by equation (4.22)(Theorem 4.6.1) in Vershynin (2018) we obtain the following bound for any m k, l 0 m ˆB X i Xi ˆB 1 c1 Published in Transactions on Machine Learning Research (10/2023) with probability at least 1 exp l2 and c1 some absolute constant. We set l = 12k3/2 log(N) δ1 = 12c1k3/2 log(N) m and the above bound becomes m ˆB X i Xi ˆB 1 δ1, (31) with probability at least 1 exp k 12k p Further notice that exp k 12k p log(N) 1 2 = exp k 144k2 log(N) + 24k log(N) 1 (32) exp 120k3 log(N) (33) Thus taking Union Bound over i [n] we have that for all i [n] m ˆB X i Xi ˆB 1 δ1 (34) with probability at least 1 n exp 120k3 log(N) 1 exp 115k3 log(N) (35) Choosing c sufficiently large derives the statement of the lemma. Lemma A.5. Let Hi := 1 m ˆB X i 1 m Xi ˆB ˆB Id ˆB and δ := c k3/2 log(N) m , for an absolute con- stant c. Then with probability at least 1 exp 115k2 log(N) we have i=1 Hiw i 2 2 δ2 W 2 2 dist2 ˆB, ˆB (36) Proof. In order to argue about the quantity Hi = 1 m ˆB X i 1 m Xi ˆB ˆB Id ˆB we define matrix U := 1 m Xi ˆB ˆB Id ˆB such that its j-th row, uj = 1 m ˆB ˆB ˆB Id xj i is subgaussian with norm at most 1 m ˆB ˆB ˆB Id . Similarly we define V = 1 m ˆB X i such that its j-th row vj = 1 m ˆBxj i has norm at most 1 m ˆB. We are now ready to use a concentration argument similar to Proposition (4.4.5) in . Let Sk 1 denote the unit sphere in k dimensions and Nk the 1/4-net of cardinality 9k. From equation (4.13) in Vershynin (2018) we have ˆB X i Xi ˆB ˆB Id ˆB 2 = U V 2 2 max p,y Nk p = 2 max p,y Nk j=1 p, uj vj, y (38) By definition p, uj and vj, y are subgaussians with norms 1 m ˆB ˆB ˆB Id 2 = 1 mdist ˆB, ˆB and 1 m ˆ B 2 = 1 m respectively and thus for all j [m] the product p, uj vj, y is subexponential with norm at most C m dist ˆB, ˆB , for some constant C . Note that E [ p, uj v, y ] = p ˆB Id ˆB ˆB ˆB y = 0 (39) Published in Transactions on Machine Learning Research (10/2023) and thus we can use Bernstein s inequality to bound the sum of m zero mean subexponential random variables, for any fixed pair p, y Nk: i=1 p, uj vj, y s dist2 ˆB, ˆB , sm dist ˆB, ˆB dist2 ˆB, ˆB , s dist ˆB, ˆB for constant c2. Thus taking Union Bound over all the point in the net we derive j=1 p, uj vj, y 2s dist2 ˆB, ˆB , s dist ˆB, ˆB Since m > Ck2 log(N), by setting s = dist ˆB, ˆB q 4m and (38) we obtain ˆB X i Xi ˆB ˆB Id ˆB 2 dist ˆB, ˆB r dist2 ˆB, ˆB 92k exp C c2mk2 log(N) exp 120k2 log(N) (45) for sufficiently large C. Using Union Bound again over all participating clients we get i [n] Hi 2 dist ˆB, ˆB r 1 n exp 120k2 log(N) (46) 1 exp 115k2 log(N) (47) The above also implies i=1 Hi 2 2 Cdist2 ˆB, ˆB k2 log(N) 1 exp 115k2 log(N) (48) i=1 Hi 2 2 C W 2 2 dist2 ˆB, ˆB k3 log(N) 1 exp 115k2 log(N) (49) Finally notice that i=1 Hiw i 2 2 i=1 Hi 2 2 k W 2 F n i=1 Hi 2 2 k i=1 Hi 2 2 , (50) where we used Assumption (4.6) and the fact that the rank of W is k. Combining this with (49) and choosing sufficiently large c we derive the result. Building on the previous lemmas we can now bound the norm of Fi. Published in Transactions on Machine Learning Research (10/2023) Lemma A.6. Let δ := c k3/2 log(N) m for some absolute constant c and for all i [n] let Fi given by (24). Further let matrix F Rn k such that its rows are the concatenation of Fi s. Then with probability at least 1 exp 110k2 log(N) we have i [n] Fi 2 δ 1 δ dist ˆB, ˆB w i 2 , (51) F F δ 1 δ dist ˆB, ˆB W 2 (52) m ˆB X i Xi ˆB 1 2 Hi 2 2 w i 2 2 (53) (1 δ)2 dist2 ˆB, ˆB w i 2 2 (54) which holds for all i [n] with probability at least 1 exp 110k2 log(N) by using Union Bound on the failure probability of (28) and (47). Similarly, we have m ˆB X i Xi ˆB 1 2 Hiw i 2 2 (55) i=1 Hiw i 2 2 (56) (1 δ)2 dist2 ˆB, ˆB W 2 2 (57) which holds with probability at least 1 exp 110k2 log(N) taking Union Bound on the failure probability on (28) and (36). We now turn our attention on deriving a bound for Gi 2. Lemma A.7. Let δ := c k3/2 log(N) m for some absolute constant c and for all i [n] let Gi given by (25). Further let matrix G Rn k such that its rows are the concatenation of Gi s. Then with probability at least 1 exp 110k2 log(N) we have i [n] Gi 2 δ 1 δ σ2, (58) G F δ 1 δ nσ2 (59) Proof. Notice that we can write m ˆB X i Xi ˆB 1 1 m ˆB X i Zi = 1 m ˆB X i Xi ˆB 1 1 i=1 zj i ˆB xj i, (60) and since zj i N 0, σ2 we can conclude that for all i, j, zj i ˆB xj i is an i.i.d. zero mean subexponential with norm at most C 2σ2 ˆ B 2 = C 2σ2 for some constant C2. Once again we denote by Sk 1 the unit sphere in k dimensions and by Nk the 1/4-net with cardinality 9k. Using Bernstein s inequality and Union Bound over all the points on the net we follow the derivations from Lemma A.5 to get i=1 zj i ˆB xj i 9k+1 exp c3m min s2 Published in Transactions on Machine Learning Research (10/2023) Since m > C2k2 log(N), by setting s = σ2 q C2k2 log(N) 4m we derive i=1 zj i ˆB xj i C2k2 log(N) 9k+1 exp c3m s2 9k+1 exp C2 c3k2 log(N) (63) exp 115k2 log(N) (64) for sufficiently large C2. Choosing c large enough and taking Union Bound over all i [n] we can obtain i=1 zj i ˆB xj i 1 n exp 115k2 log(N) (65) 1 exp 113k2 log(N) (66) Finally taking Union Bound over the failure probabilities of (28) and (67) we get m ˆB X i Xi ˆB 1 2 i=1 zj i ˆB xj i 2 δ 1 δ σ2 (68) with probability at least 1 n exp 110k2 log(N) . It follows that with the same probability i=1 Gi 2 2 n δ 1 δ For all i [n] we define qi := ˆBw+ i ˆB w i . The following lemma provides upper bounds on the norms of w+ i and qi Lemma A.8. Let δ := c k3/2 log(N) m for some absolute constant c and ˆδ = δ/(1 δ). Then with probability at least 1 exp 105k2 log(N) we have i [n] w+ i 2 2 k + σ2ˆδ (70) Further with probability at least 1 exp 105k2 log(N) we have i [n] qi 2 2 k dist ˆB, ˆB + σ2ˆδ (71) w+ i 2 ˆB 2 ˆB 2 w i 2 + Fi 2 + Gi 2 (72) w i 2 + ˆδ w i 2 dist ˆB, ˆB + ˆδσ2 (73) k + ˆδσ2 (74) where the first inequality comes from (20) and the third from Assumption 4.6. For the second inequality we take Union Bound over the failure probability of (51) and (58) and thus the above result holds with probability at least 1 exp 107k2 log(N) . Taking Union Bound for all i [n] we get that with probability at least 1 exp 105k2 log(N) i [n] w+ i 2 2 k + σ2ˆδ (75) Published in Transactions on Machine Learning Research (10/2023) and the first result of the lemma follows. For the second part we have qi 2 = ˆBw+ i ˆB w i 2 ˆB ˆB ˆB w i + ˆBFi + ˆBGi ˆB w i 2 (76) ˆB ˆB Id ˆB w i 2 + ˆBFi 2 + ˆBGi 2 (77) ˆB ˆB 2 w i 2 + Fi 2 + Gi 2 (78) dist ˆB, ˆB w i 2 + dist ˆB, ˆB ˆδ w i 2 + σ2ˆδ (79) dist ˆB, ˆB 2 k + σ2ˆδ (80) where the first inequality comes from (20). For the forth inequality we take Union Bound over the failure probability of (51) and (58) and thus the above result holds with probability at least 1 exp 107k2 log(N) . Taking Union Bound for all i [n] we get that with probability at least 1 exp 105k2 log(N) i [n] qi 2 2 k dist ˆB, ˆB + σ2ˆδ (81) Lemma A.9. Let δ := c k3/2 log(N) m for some absolute constant c and ˆδ = δ/(1 δ). Then with probability at least 1 exp ( 105d) exp 105k2 log(N) we have i=1 X i Ziw+ i Proof. Let Sd 1, Sk 1 denote the unit spheres in d and k dimensions and Nd, Nk the 1/4-nets of cardinality 9d and 9k, respectively. By equation 4.13 in Vershynin (2018) we have 1 mn i=1 X i Ziw+ i 2 2 max p Nd,y Nk p n X 1 mn X i Ziw+ i = 2 max p Nd,y Nk p zj i mnxj iw+ i = 2 max p Nd,y Nk D xj i, p E w+ i , y ! Notice that for any fixed p, y and i [n], j [m] the random variables zj i mn D xj i, p E w+ i , y are i.i.d. zero mean subexponentials with norm at most C 3σ2 w+ i 2 mn , for some absolute constant C 3. Conditioning on the event k + ˆδσ2o , (86) which holds with probability at least 1 exp 105k2 log(N) by Lemma A.8, we can invoke Bernstein s inequality to get D xj i, p E w+ i , y s E1 k + σ2ˆδ 2 , s Published in Transactions on Machine Learning Research (10/2023) Since m > d C3 n by setting s = σ2(2 k+ˆδσ2) d C3 8 mn the above quantity simplifies as follows D xj i, p E w+ i , y σ2 2 k + ˆδσ2 d C3 exp ( C3 c4 d) (89) exp ( 110d) (90) for C3 large enough. Taking Union Bound over all points p, y on the Nd, Nk and using (85) we derive i=1 X i Ziw+ i 9d+k exp ( 110d) (91) exp ( 105d) (92) and removing the conditioning on E1 we get i=1 X i Ziw+ i exp ( 105d) + P EC 1 (93) exp ( 105d) + exp 105k2 log(N) (94) Choosing c large enough and taking the complementary event derives the result. Lemma A.10. Let δ := c k3/2 log(N) m for some absolute constant c and ˆδ = δ/(1 δ). Then with probability at least 1 exp ( 100d) exp 100k2 log(N) we have m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i d dist ˆB, ˆB k + kˆδσ2 + ˆδσ2 2 Proof. Let us define the event k + ˆδσ2 \ qi 2 dist ˆB, ˆB 2 k + ˆδσ2o , (96) which happens with probability at least 1 exp 100k2 log(N) by Union Bound and Lemma A.8. For the rest of this proof we work conditioning on event E2. Recall that qi := ˆBw+ i ˆB w i and thus we can write m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i = 1 i=1 X i Xiqiw+ i D xj i, qi E xj iw+ i Published in Transactions on Machine Learning Research (10/2023) Let Sd 1, Sk 1 denote the unit spheres in d and k dimensions and Nd, Nk the 1/4-nets of cardinality 9d and 9k, respectively. By equation 4.13 in Vershynin (2018) we have D xj i, qi E xj iw+ i 1 2 max p Nd,y Nk 1 np D xj i, qi E xj iw+ i 1 = 2 max p Nd,y Nk 1 mn D xj i, qi E D p, xj i E w+ i , y p, qi w+ i , y Notice that for any fixed p, y the products D xj i, qi E are i.i.d. subgaussians with norm at most c1 qi and D p, xj i E are i.i.d. subgaussians with norm at most c2 p 2 = c2. Hence un- der the event E2 the product 1 mn D xj i, qi E D p, xj i E w+ i , y are subexponentials with norm at most dist ˆB, ˆB k + kˆδσ2 + ˆδσ2 2 , for some constant C 4. Also note that E h D xj i, qi E D p, xj i E w+ i , y p, qi w+ i , y i = 0 (101) and thus applying Bernstein s inequality we get D xj i, qi E D p, xj i E w+ i , y 1 i=1 p, qi w+ i , y s s2 dist ˆB, ˆB k + kˆδσ2 + ˆδσ2 2 2 , s dist ˆB, ˆB k + kˆδσ2 + ˆδσ2 2 Since m > d C4 n by setting s = C4 d dist( ˆB, ˆB )k+ kˆδσ2+(ˆδσ2) 2 2 mn and taking Union Bound over all p Nd, y Nk we derive m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i C4 d dist ˆB, ˆB k + kˆδσ2 + ˆδσ2 2 c5 mns2 dist ˆB, ˆB k + kˆδσ2 + ˆδσ2 2 2 9d+k exp ( C4 c5 d) (103) 9d+k exp ( 120d) (104) exp ( 100d) (105) Published in Transactions on Machine Learning Research (10/2023) choosing a large enough constant C4. Recall that P EC 2 exp 100k2 log(N) . Hence by removing the conditioning on E2 we get that with probability at least 1 exp ( 100d) exp 100k2 log(N) m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i d dist ˆB, ˆB k + kˆδσ2 + ˆδσ2 2 for sufficiently large c. Having set all the building blocks we now proceed to the proof of Theorem 4.7. Theorem 4.7. Let Assumptions 4.3-4.6 hold. Further, let the following inequalities hold for the number of participating nodes and the batch size respectively, n n0 and m c0 (1+σ2)k3κ4 E2 0 max {log(N), d/n0}, for some absolute constant c0. Then Fed Rep-SRPFL with stepsize η 1 8 σ2max, , satisfies the following contraction inequality: dist Bt+1, B dist Bt, B n n0 (1 a) , (10) w.p. at least 1 T exp 90 min d, k2 log(N) , where a = 1 2ηE0 σ2 min, 1 Proof. First let us recall the definition of δ := c k3/2 log(N) m for some absolute constant c and ˆδ = δ/(1 δ). Further notice that for our choice of m and sufficiently large c0 we have the following useful inequality ˆδ = δ 1 δ 2δ E0 20 κ2 1 1 + σ2 1 From the update scheme of our algorithm (18) we have B+ = ˆB η mn i=1 X i Xi ˆBw+ i w+ i i=1 X i Xi ˆB w i w+ i i=1 X i Ziw+ i m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i ˆBw+ i ˆB w i w+ i + η 1 m X i Ziw+ i (109) where we added and subtracted terms. Multiplying both sides by ˆB we get ˆB B+ = ˆB ˆB η m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i ˆB ˆBw+ i ˆB ˆB w i w+ i + η 1 m X i Ziw+ i (110) i=1 w+ i w+ i 1 m X i Ziw+ i m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i Published in Transactions on Machine Learning Research (10/2023) where the second equality holds since ˆB ˆB = 0. Recall that from the QR decomposition of B+we have B+ = ˆB+R+. Hence multiplying by (R+) 1 and taking both sides the norm we derive dist ˆB+, ˆB i=1 w+ i w+ i 1 m X i Ziw+ i m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i Let us define A1 := dist ˆB, ˆB Ik η i=1 w+ i w+ i 1 m X i Ziw+ i m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i so that the following inequality holds dist ˆB+, ˆB (A1 + A2 + A3) R+ 1 2 (116) For the rest of the proof we will work conditioning on the intersection of the events k + ˆδσ2 \ qi 2 dist ˆB, ˆB 2 k + ˆδσ2o (117) E3 := n F F dist ˆB, ˆB ˆδ W 2 \ G F ˆδ nσ2o (118) i=1 X i Ziw+ i m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i d dist ˆB, ˆB k + kˆδσ2 + ˆδσ2 2 which happens with probability at least 1 exp ( 90d) exp 90k2 log(N) by Union Bound on the failure probability of (52), (59), (82), (95) and (96). We will now provide bounds for each of the terms of interest in (116), starting from A1. Notice that by (26) we have λmax W+W+ = W+ 2 2 = W ˆB ˆB + F + G 2 2 W 2 2 + 4 F 2 2 + 4 G 2 2 (122) 2 W 2 2 + dist ˆB, ˆB 4ˆδ2 W 2 2 + 4ˆδ2σ4n (123) 4 W 2 2 + n (124) 4n σ2 max, + 1 (125) Published in Transactions on Machine Learning Research (10/2023) where in the last inequality we use the fact that W 2 = n σmax, . Since η < σ2 max, + 1 1 the matrix Ik η n W+ W+ is positive definite. Thus we have n W+ W+ 2 1 η nλmin W+ W+ (126) W ˆB ˆB + F + G W ˆB ˆB + F + G (127) σ2 min W ˆB ˆB σ2 min(F) σ2 min(G) σmax F W ˆB ˆB + σmax F G + σmax G W ˆB ˆB (128) σ2 min (W ) σ2 min ˆB ˆB + 2 ˆB ˆB 2 σmax F W + σmax G W n σmax(F)σmax(G) (129) 1 η σ2 min, σ2 min ˆB ˆB + 2η n ( F 2 + G 2) W 2 + 2η n ( F 2 G 2) (130) where we used that the norms of ˆB and ˆB are 1 since the matrices are orthonormal and σmin, σmin(W ). Recall that we operate under E3 and thus we can further write n W+ W+ 2 1 η σ2 min, σ2 min ˆB ˆB + 2η dist ˆB, ˆB ˆδ W 2 + nˆδσ2 W 2 dist ˆB, ˆB ˆδ2σ2 n W 2 (131) 1 η σ2 min, σ2 min ˆB ˆB + 2η ˆδ W 2 2 n + ˆδσ2 W 2 n + ˆδ2σ4 W n 1 η σ2 min, σ2 min ˆB ˆB + 3η E0 σ2 min, 20 σ2max, σ2 max, 1 η σ2 min, σ2 min ˆB ˆB + 1 6ηE0 σ2 min, (134) where we upper bound dist ˆB, ˆB by 1, ˆδ E0 20 κ2 1 1+σ2 and use Assumption 4.4 in the third inequality. Further by the definition of E0 := 1 dist2 ˆB0, ˆB σ2 min ˆB , ˆB we have n W+ W+ 2 1 ηE0 σ2 min, + 1 6ηE0 σ2 min, (135) and it follows immediately that A1 dist ˆB, ˆB 1 ηE0 σ2 min, + 1 6ηE0 σ2 min, Further since we operate under E4 (119) and ˆB 2 = 1 we have and since we operate under E5 (120) we obtain A3 ηc dist ˆB, ˆB k + Published in Transactions on Machine Learning Research (10/2023) Combining (116), (136), (137) and (138) we get dist ˆB+, ˆB dist ˆB, ˆB 6ηE0 σ2 min, + ηck k + 1 σ2 + 1 d mn R+ 1 2 (139) The last part of the proof focuses on bounding (R+) 1 2. Let us define 1 m X i Xi ˆBw+ i ˆB w i w+ i (140) 1 m X i Ziw+ i (141) and hence (108) takes the form B+ B+ = ˆB ˆB η ˆB S + S ˆB + η ˆB E + E ˆB + η2 n2 E S + S E + η2 n2 E E (143) ˆB S + S ˆB + η ˆB E + E ˆB η2 n2 E S + S E + η2 n2 E E + η2 n2 S S (144) By Weyl s inequality and since R+ R+ = ˆB+ ˆB+ we derive σ2 min R+ 1 η nλmax ˆB S + S ˆB η nλmax ˆB E + E ˆB η2 n2 λmax E S + S E (145) Let us further define nλmax ˆB S + S ˆB (146) n2 λmax E S + S E (147) nλmax ˆB E + E ˆB (148) So that we can succinctly rewrite the above inequality as follows σ2 min R+ 1 R1 R2 R3 (149) We work to bound separately each of the three terms. n max p: p 2=1 p ˆB Sp (150) = max p: p 2=1 2η n p ˆB " n X m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i + max p: p 2=1 2η n p ˆB " n X ˆBw+ i ˆB w i w+ i Published in Transactions on Machine Learning Research (10/2023) and since we operate under E5 (120) the above simplifies to R1 2η ˆB 2 c dist ˆB, ˆB k + d mn + max p: p 2=1 2η n p ˆB " n X ˆBw+ i ˆB w i w+ i 3ηc dist ˆB, ˆB k + d mn + max p: p 2=1 2η n p ˆB " n X ˆBw+ i ˆB w i w+ i We focus on the second term and using (20) we get n p ˆB " n X ˆBw+ i ˆB w i w+ i ˆBw+ i ˆB w i w+ i pp ˆB # ˆBw+ i ˆB w i ˆB ˆB w i + Fi + Gi pp ˆB # We bound each term separately and to this end we define ˆBw+ i ˆB w i w i ˆB ˆBpp ˆB # ˆBw+ i ˆB w i F i pp ˆB # ˆBw+ i ˆB w i G i pp ˆB # such that (155) can be expressed as n p ˆB " n X ˆBw+ i ˆB w i w+ i p = T1 + T2 + T3 (159) Further expanding T1 we have ˆB ˆB ˆB w i w i + ˆBFiw i + ˆBGiw i ˆB w i w i ˆB ˆBpp ˆB # " ˆB ˆB ˆB Id n X ˆB w i w i ˆB ˆBpp # ˆBFiw i ˆB ˆBpp ˆB # ˆBGiw i ˆB ˆBpp ˆB # n tr h ˆB ˆB F + G W ˆB ˆBpp i (162) n ( F F + G F ) W 2 (163) where in the first equality we expand w+ i via (20) and in the third equality we use that ˆB ˆB ˆB Id = 0 and F = n P i=1 Fiw i , G = n P i=1 Giw i . The inequality is obtained by noticing that the norms of the orthonormal ˆB, ˆB is one and also pp 2 1. Conditioning on E3 (118) we can further simplify as follows dist ˆB, ˆB ˆδ W 2 2 + ˆδσ2 n W 2 (164) 10ηE0 σ2 min, (165) Published in Transactions on Machine Learning Research (10/2023) We now turn our attention to T2 ˆB ˆB ˆB w i + ˆBFi + ˆBGi ˆB w i F i pp ˆB # " ˆB ˆB ˆB Id n X ˆB w i F i pp # i=1 Fi F i pp # i=1 Gi F i pp # n tr h ˆB ˆB F F + G F pp i (168) F 2 F + G F F F (169) where in the third equality we used that ˆB ˆB ˆB Id = 0 and in the forth that the norms of the orthonormal matrices is 1 as well as the norm of pp . Following the same calculations for T3 we get G 2 F + G F F F (170) and thus summing the two terms we get the following n ( F F + G F )2 (171) Again conditioning on E3 (118) we derive ˆδ W 2 + nσ2 2 (172) 2ηˆδ2 σ2 max, + σ4 (173) 10ηE0 σ2 min, (174) Hence combining (153), (165) and (174) we get a bound for R1 5ηE0 σ2 min, (175) 5ηE0 σ2 min, (176) We work in similar fashion to derive the bound on R2 n2 λmax E S + S E (177) n2 max p: p 2=1 p S Ep (178) n2 max p: p 2=1 p " n X m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i m X i Ziw+ i n2 max p: p 2=1 p " n X ˆBw+ i ˆB w i w+ i m X i Ziw+ i m X i Xi ˆBw+ i ˆB w i ˆBw+ i ˆB w i w+ i X i Ziw+ i 2 ˆBw+ i ˆB w i w+ i X i Ziw+ i 2 (180) Published in Transactions on Machine Learning Research (10/2023) Since we work conditioning on the event E4 T E5 we further derive d dist ˆB, ˆB k + kˆδσ2 + ˆδσ2 2 ˆBw+ i ˆB w i w+ i i=1 qi 2 w+ i 2 And since we also condition on E2 (117) we finally get R2 3η2c2k 3 2 σ2 d 3η2c2k 3 2 σ2 d The last term we need to bound is R3 nλmax E ˆB + ˆB E (186) n2 max p: p 2=1 p ˆB Ep (187) 1 m X i Ziw+ i Combining (149) with (176), (185) and (190) we derive σ2 min R+ 1 6η c k 5ηE0 σ2 min, 3η2c2k 3 2 σ2 d 1 14ηck 3 2 σ2 d mn 3η2c2k 3 2 σ2 d 5ηE0 σ2 min, (192) 1 15ηck 3 2 σ2 5ηE0 σ2 min, (193) where the last inequality holds since mn c d. We can now combine (139) and (193) to obtain the contraction inequality dist ˆB+, ˆB dist ˆB, ˆB 6ηE0 σ2 min, + ηck 1 15ηck 3 2 σ2 5ηE0 σ2 min, k + 1 σ2 + 1 1 15ηck 3 2 σ2 5ηE0 σ2 min, Published in Transactions on Machine Learning Research (10/2023) We divide and multiply by n0 and using our bounds on m and n the previous inequality further simplifies, dist ˆB+, ˆB dist ˆB, ˆB 6ηE0 σ2 min, + ηck 1 15ηck 3 2 σ2 5ηE0 σ2 min, k + 1 σ2 + 1 1 15ηck 3 2 σ2 5ηE0 σ2 min, dist ˆB, ˆB 1 1 2ηE0 σ2 min, 2ηE0 σ2 min, 4nηE0 σ2 min, 2ηE0 σ2 min, dist ˆB, ˆB s 1 1 2ηE0 σ2 min, 2ηE0 σ2 min, 2ηE0 σ2 min, (197) where in the second inequality we used that for our choices of m and n the following inequality holds 15ηck 3 2 (1 + σ2) d mn0 1 10ηE0 σ2 min, . Taking Union Bound over the total number of iterations T we derive the result. Corollary A.11. Recall that our algorithm starts at stage 0 with n0 participating clients and doubles the number of participating clients at every subsequent stage. Thus, by slightly abusing notation, we can reformulate the contraction inequality of Theorem 4.7 at stage r as follows 2r(1 a) with a 1 In the second part of our analysis we compute the expected Wall Clock Time of our proposed method and compare it to the corresponding Wall Clock Time of straggler-prone Fed Rep. We prove that when the computational speeds are drawn from the exponential distribution with parameter λ and the communication cost is given by C = c 1 λ, (for some constant c), then SRPFL guarantees a logarithmic speedup. Recall that in Corollary A.11 we get the following simplified version of the contraction inequality 2r(1 a) with a 1 For the rest of this section w.l.g. we assume that the clients are re-indexed at every stage so that the expected computation times maintain a decreasing ordering i.e. r E [T r 1 ] E [T r 2 ] ..., E [T r N]. For simplicity henceforth we drop the stage index r. Notice that the decreasing ordering of the computation times in combination with (199) imply that SRPFL initially benefits by including only few fast nodes in the training procedure. However, as the distance diminishes the improvement guaranteed by the contraction inequality becomes negligible and thus our method benefits by including slower nodes, thus decreasing the second term of the r.h.s. of (199). Let us denote by Xi the maximum distance for which the optimal number of participating nodes (for SRPFL) is n0 2i. This definition immediately implies that X0 = + . To compute each Xi we turn our attention on measuring the progress per unit of time achieved by SRPFL, when 2r n0 nodes are utilized. This ratio at stage r can be expressed as dist+ dist 1 a a 2r(1 a) E [Tn02r] + C . (200) Published in Transactions on Machine Learning Research (10/2023) Notice that by (199) the nominator captures the progress per round while the algorithm incurs E [Tn02r] computation and C communication cost. Similarly the ration when 2r+1 n0 nodes are used is given by dist+ dist 1 a a 2r+1(1 a) E [Tn02r+1] + C . (201) Based on the above inequalities we can now compute the optimal doubling points (in terms of distance) and thus the values of Xi s. Subsequently, we compute the number of iterations SRPFL spends in every stage. Lemma B.1. For all i let Xi denote the maximum distance for which the optimal number of nodes for SRPFL is n0 2i. Then the following holds i > 0 Xi = a p 2r(1 a)1 1 a) 1 + (E [Tn02r] + C) 1 1 E [Tn02r+1] E [Tn02r] Further SRPFL spends at each stage r at most tr communication rounds such that 2(E[Tn02r+1] E[Tn02r]) E[Tn02r] E Tn02r 1 log( 1 1 a) . (203) Proof. For each stage r let us compute the point where the transitioning between 2r n0 and 2r+1 n0 occurs. That is the distance at which SRPFL benefits by doubling the number of participation nodes to 2r+1. Thus equating the two ratios in (200) and (201) we get Xr+1 Xr+1 1 a a 2r(1 a) E [Tn02r] + C = Xr+1 Xr+1 1 a a 2r+1(1 a) E [Tn02r+1] + C (204) Xr+1 (E [Tn02r+1] E [Tn02r]) = a p 2r(1 a)(1 1 a) E [Tn02r+1] 1 2E [Tn02r] + C 1 1 2r(1 a)(1 1 a) 1 + (E [Tn02r] + C) 1 1 E [Tn02r+1] E [Tn02r] Let us now compute the number of rounds tr (henceforth denoted by t) required in stage r. That is the minimum number of iterations that SRPFL needs to decrease the distance from Xr to Xr+1 using only the n02r fastest participating nodes. Thus, starting offat Xr and following (199) for t rounds we have 1 a)i (207) As stated above we want to find the minimum number of rounds such that we reach the next doubling point i.e. we want t large enough such that 1 a)i (208) 1 a)t + a p 2r(1 a) 1 1 a t 1 1 a (209) Published in Transactions on Machine Learning Research (10/2023) where in the last inequality we use geometric series properties. We proceed to solve for t by rearranging and using (202) and the fact that Xr > a 2r(1 a)(1 1 a) , 2r(1 a)(1 1 a)Xr+1 a p 2r(1 a)(1 1 a)Xr a (210) (E[Tn02r]+C) 1 1 E[Tn02r+1] E[Tn02r] 2 (E Tn02r 1 +C) 1 1 E[Tn02r] E Tn02r 1 + 1 1 E [Tn02r] E [Tn02r 1] E [Tn02r+1] E [Tn02r] (212) taking the logarithm on both sides we derive the required amount of rounds 2(E[Tn02r+1] E[Tn02r]) E[Tn02r] E Tn02r 1 log( 1 1 a) (214) The following lemmas compute the Wall Clock Time that SRPFL and Fed Rep require in order to achieve target accuracy ϵ. As discussed in Section 4.1 for fair comparison we consider accuracy of the form N n0 (1 α) 1 1 α , with 2 > ˆc > 1. (216) When ˆc takes values close to 2 we expect SRPFL to vastly outperform Fed Rep and as ˆc takes values close to 1 the performance gap diminishes. Lemma B.2. Suppose at each stage the client s computational times are i.i.d. random variables drawn from the exponential distribution with parameter λ. Further, suppose that the expected communication cost per round is C = c 1 λ, for some constant c. Finally, consider target accuracy ϵ given in (12). Then the expected Wall Clock Time for SRPFL is upper bounded as follows E [TSRPFL] log N 6(c + 1) + 4 log( 1 ˆc 1) log( 1 1 a) ! 1 λ (217) Proof. First we upper bound the expected cost suffered by our method until the distance between the current representation and the optimal representation becomes smaller than Xlog(N/n0), i.e. the cost corresponding to the first log N 2n0 stages of SRPFL (denoted by E [TSRPFL]) . Published in Transactions on Machine Learning Research (10/2023) E T 1 SRPFL = log( N 2n0 ) X i=1 ti (E [Tn02i] + C) (218) log( N 2n0 ) X i=1 2 (E [Tn02i] + C) log 2 E Tn02i+1 E Tn02i E Tn02i E Tn02i 1 log( 1 1 a) (219) log( N 4n0 ) X i=1 2 (E [Tn0 2i] + C) log( 1 1 a) + 2(E TN/2 + C) 2(E[TN] E h T N log( 1 1 a) , (220) where we used 214. Since the computational times of the clients come from the exponential distribution it is straightforward to derive the following bounds E [TN] E TN/2 = 1 λ (ln(N/2) + 1) 1 λ log(N) (221) E TN/2 E TN/4 = 1 N/2 + 1 + 1 N/2 + 2 + ... + 1 E TN/2 E TN/4 1 E TN/4 E TN/8 = 1 3N/4 + 1 + 1 3N/4 + 2 + ... + 1 Making use of the above bounds the expression in 220 further simplifies to log N 4n0 X (E [Tn0 2i] + C) log(3 2) log( 1 1 a) + 2 E TN/2 + C log(3 2 log(N)) log( 1 1 a) (225) 5 log( 1 1 a) log N 4n0 X i=1 E [Tn0 2i] + log N 2 log(N)) log( 1 1 a) (E TN/2 + C) (226) Further, notice that and similarly E TN/4 1 3, E TN/8 1 7, E TN/16 1 15 and so on. Thus, log N 4n0 X i=1 E [Tn0 2i] = E [2n0] + E [4n0] + ... + E [N/4] 1 Combining the bounds from 227 and 228 and substituting C = c 1 λ in expression 226 we derive the following bound E T 1 SRPFL 5 log 1 1 a (c log(N/2n0) + 1) 1 λ + 2 log(3 log 1 1 a (c + 1) log(N/n0) 6(c + 1) log 1 1 a 1 Published in Transactions on Machine Learning Research (10/2023) Having derived an upper bound on the cost suffered by SRPFL on the first log N 2n0 stages we now turn our attention on bounding the cost incurred from Xlog( N n0 ) until the target accuracy ϵ is achieved (denoted by E T 2 SRPFL ). Recall that Xlog(N/n0) = a q N 2n0 (1 a)(1 1 a) 1 + (E TN/2 + C)(1 1 E [TN] E TN/2 and further during the last stage of SRPFL, N clients are utilized deriving the following form in the contractions inequality from (199) N n0 (1 a) with a 1 We first compute the number of rounds required in this second phase of the algorithm. Starting with distance Xlog(N/n0) and following the contraction in (232) for t rounds, we derive current distance at most Xlog N n0 ( N n0 (1 a) ( 1 a)i (233) = Xlog N n0 ( 1 a)t + a q 1 1 a (234) N n0 (1 a)(1 1 a) 1 + (E TN/2 + C)(1 1 E [TN] E TN/2 1 a t) + (1 where in the first equality we use geometric series properties and in the second we substitute according to (231). Using the fact that 2(E [T N/2] + C)(1 1 2 1) and E [TN] E [T N/2] 1 λ log N expression (235) is upper bounded by N n0 (1 a)(1 1 a) 2 + (c + 1)( 1 a t) + (1 1 a t) (236) N n0 (1 a)(1 1 a) + a q N n0 (1 a)(1 1 a) 1 a t (237) The above implies that the number of rounds in the second phase is the smallest t so that the target accuracy is achieved, i.e., N n0 (1 a)(1 1 a) + a q N n0 (1 a)(1 1 a) 1 a t (238) Further recall that from (216) the accuracy can be expressed in terms of N n0 (1 α) 1 1 α , with 2 > ˆc > 1. (239) Combining the above and solving for t we derive the required number of rounds for the second phase t 2 log( 1 ˆc 1) log( 1 1 a) (240) The expected cost during phase 2 can be computed as follows Published in Transactions on Machine Learning Research (10/2023) E T 2 SRPFL (E [TN] + C) 2 log( 1 ˆc 1) log( 1 1 a) + 1 (ln(N) + 1 + c) 1 2 log( 1 ˆc 1) log( 1 1 a) + 1 log( 1 ˆc 1) ! 1 λ (243) log(N) 4 log( 1 1 a) log 1 ˆc 1 Summing the two quantities of interest we can derive the promised upper bound on the Wall Clock Time of SRPFL. E [TSRPFL] = E T 1 SRPFL + E T 2 SRPFL (245) log(N/n0) 6(c + 1) log 1 1 a 1 λ + log(N) 4 log( 1 1 a) log 1 ˆc 1 6(c + 1) + 4 log( 1 ˆc 1) log( 1 1 a) ! 1 λ (247) Having computed an upper bound on the expected Wall Clock Time of SRPFL we proceed to compute an lower bound on the expected Wall Clock Time of Fed Rep. Lemma B.3. Suppose at each stage the client s computational times are i.i.d. random variables drawn from the exponential distribution with parameter λ. Further, suppose that the expected communication cost per round is C = c 1 λ, for some constant c. Finally, consider target accuracy ϵ given in (12). Then the expected Wall Clock Time for Fed Rep is lower bounded as follows E [TFed Rep] log N log N + 2 log 1 ˆc 1 Proof. First we compute the number of rounds required by Fed Rep to achieve the target accuracy. Recall that Fed Rep utilizes N clients at each round deriving the following form of the contractions inequality from (199) N n0 (1 a) with a 1 Starting with distance equal 1 and following the contraction in (249) for t rounds, we derive current distance at most N n0 (1 a) ( 1 a)t + a q 1 1 a , (250) using the properties of geometric series. Further recall that from (216) the accuracy can be expressed as N n0 (1 α) 1 1 α , with 2 > ˆc > 1. (251) Published in Transactions on Machine Learning Research (10/2023) The above imply that the number of rounds is going to be the smallest t that guarantees that the target accuracy has been achieved that is N n0 (1 α) 1 1 α ( 1 a)t + a q 1 1 a (252) We use the fact that q N n0 (1 a)(1 1 a) a > 0 for a 1/4 and all reasonable values of N to rearrange and solve for t. Thus, we derive t 2 log 1 ˆc 1 log 1 1 a + log N log 1 1 a (253) Multiplying the number of rounds with a lower bound on the expected cost incurred per round, results in the desired lower bound on the expected Wall Clock Time suffered by Fed Rep: E [TFed Rep] (E [TN] + C) 2 log 1 ˆc 1 log N + 2 log 1 ˆc 1 Combining the results of Lemma B.2 and Lemma B.3 we obtain Theorem 4.9. Theorem 4.9. Suppose that at each stage the client s computational times are i.i.d. random variables drawn from the exponential distribution with parameter λ. Further, suppose that the expected communication cost per round is C = c λ, for some constant c. Finally, consider the target error ϵ given in (12). Then, we have E[TSRPFL] E[TFed Rep] = O log( 1 ˆc 1) log(N)+log( 1 ˆc 1) E [TSRPFL] E [TFed Rep] 6(c + 1) + 4 log( 1 ˆc 1) log N + 2 log 1 ˆc 1 = O log(N) + log 1 ˆc 1 Remark B.4. The initialization scheme in Algorithm 2 guarantees that dist B0, B 1 c, with probability at least 1 O (mn) 100 , effectively without increasing the overall sample complexity. The formal statement and proof is identical to Theorem 3 in (Collins et al., 2021) and is omitted. C More on Experiments Hyperparameters and choice of models.We set the hyperparameters following the work of (Collins et al., 2021). Specifically, for the implementation of Fed Rep and Fed Rep-SRPFL we use SGD with momentum where the momentum parameter is set to 0.5 and the local learning rate to 0.1. Further, similarly to (Collins et al., 2021) we set the local learning rate to 0.1 for all other methods under consideration, which obtains optimal performance. We fix the batch size to 10 for all our implementations. The number of local epochs is set to 1 for CIFAR10 with N = 100 and to 5 for the rest of the datasets. In terms of the choice of the neural network model, for CIFAR10, we use Le Net-5 including two convolution layers with (64, 64) channels and three fully connected layers where the numbers of hidden neurons are (120, 64). The same structure is used for CIFAR100, but the numbers of channels in the convolution layers are increased to (64, 128) and the numbers of hidden neurons are increased to (256, 128). Additionally, a dropout layer with parameter 0.6 is added after the first two fully connected layers, which improves the testing accuracy. For EMNIST and Published in Transactions on Machine Learning Research (10/2023) 0.00 1.29 2.58 3.87 5.16 Homogeneous (C.T. = 0) Testing Accuracy CIFAR10, M=100, Shard=10 Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML 0.00 1.29 2.58 3.87 5.16 Homogeneous (C.T. = 0) Testing Accuracy EMNIST, M=100, Shard=10 0.00 2.07 4.14 6.21 8.28 Homogeneous (C.T. = 0) Testing Accuracy SENT140, M=83, Shard=2 Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML 0.00 1.71 3.42 5.13 6.84 Homogeneous (C.T. = 10) Testing Accuracy CIFAR10, M=100, Shard=10 0.00 1.71 3.42 5.13 6.84 Homogeneous (C.T. = 10) Testing Accuracy EMNIST, M=100, Shard=10 0.000 0.332 0.664 0.996 1.328 Homogeneous (C.T. = 10) Testing Accuracy SENT140, M=83, Shard=2 0.000 0.554 1.108 1.662 2.216 Homogeneous (C.T. = 100) Testing Accuracy CIFAR10, M=100, Shard=10 0.000 0.554 1.108 1.662 2.216 Homogeneous (C.T. = 100) Testing Accuracy EMNIST, M=100, Shard=10 0.000 1.457 2.914 4.371 5.828 Homogeneous (C.T. = 100) Testing Accuracy SENT140, M=83, Shard=2 Figure 7: Numerical results on CIFAR10, EMNIST, Sent140 with full participation (M = N) in the fixed computation speeds setting. Shard denotes the number of classes per client. C.T. denotes the communication cost per round. FEMNIST, we use MLP with three hidden layers with (512, 256, 64) hidden neurons. For Sent140 we use two-layer bidirectional LSTM with dimension 256 and dropout rate 0.5 followed by a fully connected layer of dimensions 5 and a classification head. Further, we use the standard glove embedding of dimension 100 and vocabulary of size 10000. For the needs of SRFRL, we split the neural network model into two parts, the customized head hi and the common representation φ. In our experiments, we simply take the customized head to be the last hidden layer and the rest of the parameters are treated as the common representation. Note that LG-Fed Avg and LG-FLANP have a different head/representation split scheme and the head is globally shared across all clients while a local version of the representation is maintained on every client. For all included datasets, i.e. CIFAR10, CIFAR100, EMNIST, FEMNIST and Sent140 the common head include the last two fully connected layers and the rest of the layers are treated as the representation part. Datasets. We include five datasets in our empirical study: CIFAR10 which consists of 10 classes and a total number of 50,000 training data points, CIFAR100 which consists of 100 classes and the same amount of data points as CIFAR10, and EMNIST (balanced) which consists of 47 classes and 131,600 training data points. Note that in Figures 3-7, we use the first 10 classes from EMNIST following (Collins et al., 2021). For FEMNIST, we use the same setting as (Collins et al., 2021) with the exception that in Figures 3-6 we allocate to each client 150 data points and in Figure 6 we allocate to client i, (100 + ui) samples, with ui a uniformly distributed random variable in [0, 50] . The sentiment140 dataset contains 1, 600, 000 tweets annotated negative or positive and they can be used to detect sentiment. For this dataset we perform pre-processing splitting the samples across clients both in a homogeneous as well as a heterogeneous manner. The number of allocated samples to clients follow the log-normal distribution. During our training procedure, we perform the data augmentation operations of standard random crop and horizontal flip on the first two datasets and and perform no pre-processing for the last one. Homogeneous Setting.Apart from providing straggler-resilience benefits our method takes advantage of the shared representation model to simultaneously address the hurdle of data heterogeneity. In the less challenging Published in Transactions on Machine Learning Research (10/2023) 0.00 1.29 2.58 3.87 5.16 Total Time (C.T. = 0) Testing Accuracy FEMNIST, M=100, Shard=3 Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML 0.00 0.58 1.16 1.74 2.32 Heterogeneous (C.T. = 0) Testing Accuracy SENT140, M=83, Shard=2 Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML 0.00 2.07 4.14 6.21 8.28 Stragglers % (C.T. = 0) Testing Accuracy SENT140, M=83, Shard=2 Fed Rep-SRPFL-10% Fed Rep-SRPFL-30% Fed Rep-10% Fed Rep-30% LG-SRPFL-10% LG-SRPFL-30% LG-Fed Avg-10% LG-Fed Avg-30% 0.00 1.29 2.58 3.87 5.16 Total Time (C.T. = 0) 20 30 40 50 60 70 80 Testing Accuracy CIFAR10, M=100, Shard=5 Fed Rep-SRPFL Fed Rep LG-SRPFL LG-Fed Avg FLANP FLANP-FT Fed Avg Fed Avg-FT HFMAML 0.00 1.71 3.42 5.13 6.84 Total Time (C.T. = 10) Testing Accuracy FEMNIST, M=100, Shard=3 0.00 1.83 3.66 5.49 7.32 Heterogeneous (C.T. = 10) Testing Accuracy SENT140, M=83, Shard=2 0.000 0.332 0.664 0.996 1.328 Stragglers % (C.T. = 10) Testing Accuracy SENT140, M=83, Shard=2 0.00 1.71 3.42 5.13 6.84 Total Time (C.T. = 10) 20 30 40 50 60 70 80 Testing Accuracy CIFAR10, M=100, Shard=5 0.000 0.554 1.108 1.662 2.216 Total Time (C.T. = 100) Testing Accuracy FEMNIST, M=100, Shard=3 0.000 1.308 2.616 3.924 5.232 Heterogeneous (C.T. = 100) Testing Accuracy SENT140, M=83, Shard=2 0.000 1.457 2.914 4.371 5.828 Stragglers % (C.T. = 100) Testing Accuracy SENT140, M=83, Shard=2 0.000 0.554 1.108 1.662 2.216 Total Time (C.T. = 100) 20 30 40 50 60 70 80 Testing Accuracy CIFAR10, M=100, Shard=5 Figure 8: Numerical results on FEMNIST, Sent140, CIFAR10 in the fixed computation speeds setting. From left to right: Imbalanced/Imbalanced/Stragglers%/Correlated Heterogeneity. Shard denotes the number of classes per client. C.T. denotes the communication cost per round. homogeneous setting one would expect Fed Rep-SRPFL to lose some of its competitive advantage. However, our numerical results in Figure 7 indicate that our method exhibits a stable performance while maintaining an edge over the other baselines. We note that LG-SRPFL outperforms Fed Rep-SRPFL in the Sent140 dataset where LG-Fed Avg appears to be better-suited than Fed Rep. Importantly, our doubling scheme continues to enhance both subroutines providing clear straggler-resilience benefits even in data homogeneous environments. Additional Regimes. In Figure 8 we observe the extent at which the benefits of our meta-algorithm persist in various regimes. Imbalanced Datasets. In the two left columns we explore how the performance of our doubling scheme degrades for different levels of imbalanced local datasets. In the FEMNIST dataset we allocate to each client i, (100 + ui) local samples where ui is a random variable uniformly distributed in [0, 50]. In this setting the numbers of local samples are sufficiently close and as a result the benefits of our scheme are prevalent for all different values of communication cost. As we allow the clients to have more diverse numbers of local samples however, the speedup provided by our doubling scheme diminishes. Indeed this is the case in the Sent140 dataset where we allocate data to clients in a more dispropotional manner following the log-normal distribution. We point out that in settings with high diversity simply doubling the number of clients is not sufficient. Instead, to achieve optimal performance one needs to select consecutive participating sets of clients with cumulative number of samples that double from one stage to the next. Different levels of stragglers. We further consider the setting where clients are split into two categories. The first category consists of typically fast clients whose computational values come from the exponential distribution with λ = 0.1. The second category consists of clients who are typically straggling, i.e. sampling their computational cost from the exponential distribution with λ = 10. On the third column of Figure 8 we compare Fed Rep, LG-Fed Avg and their straggler-resilient variants for different percentages of clients coming from the latter category. Unsurprisingly, the performance of the methods under consideration degrades as the percentage of stragglers increases however our doubling scheme to some extent mitigates this effect. Correlated heterogeneity. In this setting we assign data to clients depending on their speeds. More specifically, Published in Transactions on Machine Learning Research (10/2023) we split clients into 3 groups based on the computational times and respectively we group samples into 3 distinct groups based on their labels. Subsequently, we allocate mutually exclusive (with respect to their labels) groups of data to specific groups of clients. The rightmost column of Figure 8 indicates that the benefits of our doubling scheme persist, although to a lesser degree as the correlation between data and system heterogeneity grows.