# federated_representation_learning_in_the_underparameterized_regime__88ede9bf.pdf

Federated Representation Learning in the Under-Parameterized Regime

Renpu Liu 1 Cong Shen 2 Jing Yang 1

Federated representation learning (FRL) is a popular personalized federated learning (FL) framework where clients work together to train a common representation while retaining their personalized heads. Existing studies, however, largely focus on the over-parameterized regime. In this paper, we make the initial efforts to investigate FRL in the under-parameterized regime, where the FL model is insufficient to express the variations in all ground-truth models. We propose a novel FRL algorithm FLUTE, and theoretically characterize its sample complexity and convergence rate for linear models in the under-parameterized regime. To the best of our knowledge, this is the first FRL algorithm with provable performance guarantees in this regime. FLUTE features a dataindependent random initialization and a carefully designed objective function that aids the distillation of subspace spanned by the global optimal representation from the misaligned local representations. On the technical side, we bridge lowrank matrix approximation techniques with the FL analysis, which may be of broad interest. We also extend FLUTE beyond linear representations. Experimental results demonstrate that FLUTE outperforms state-of-the-art FRL solutions in both synthetic and real-world tasks.

1. Introduction

In the development of machine learning (ML), the role of representation learning has become increasingly essential. It transforms raw data into meaningful features, reveals hidden patterns and insights in data, and facilitates efficient learning

1Department of Electrical Engineering, The Pennsylvania State University, University Park, PA, USA 2Department of Electrical and Computer Engineering, University of Virginia, Charlottesville, Virginia, USA. Correspondence to: Jing Yang <yangjing@psu.edu>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

of various ML tasks such as meta-learning (Tripuraneni et al., 2021), multi-task learning (Wang et al., 2016a), and few-shot learning (Du et al., 2020).

Recently, representation learning has been introduced to the federated learning (FL) framework to cope with the heterogeneous local datasets at participating clients (Liang et al., 2020). In the FL setting, it often assumes that all clients share a common representation, which works in conjunction with personalized local heads to realize personalized prediction while harnessing the collective training power (Arivazhagan et al., 2019; Collins et al., 2021; Zhong et al., 2022; Shen et al., 2023).

Existing theoretical analysis of representation learning usually assumes the adopted model is over-parameterized to almost fit the ground-truth model (Tripuraneni et al., 2021; Wang et al., 2016a). While this may be valid for expressive models like Deep Neural Networks (He et al., 2016; Liu et al., 2017) or Large Language Models (Open AI, 2023; Touvron et al., 2023), it may be too restrictive for FL on resource-constrained devices, as adopting overparameterized models in such a framework faces several significant challenges, as elaborated below.

Computation limitation. In FL, edge devices like smartphones and Internet of Things (Io T) devices often have limited memory and lack computational power, which are not capable of either storing or training overparameterized models (Wang et al., 2019; He et al., 2020; Kairouz et al., 2021)1. Communication overhead. In FL, the clients need to communicate updated model information with the server frequently. It thus becomes prohibitive to transmit a huge number of model updates for devices operating with limited communication energy and bandwidth. Privacy concern. Existing works show that excessively expressive models may memorize relevant information from local datasets, increasing the model s susceptibility to reconstruction attacks (Hitaj et al., 2017; Melis et al., 2019; Wang et al., 2018; Li et al., 2020) or membership inference (Tan et al., 2022).

1For example, two of the widely adopted neural network models suitable for Io T or embedded devices, Mobile Net V3 (Howard et al., 2019) and Efficient Net-B0 (Tan & Le, 2019), only have a few million parameters and, as an example, typically process at most a few GFLOPS in a Raspberry Pi 4 (Ju et al., 2023).

Federated Representation Learning in the Under-Parameterized Regime

Motivated by those concerns, in this work, we focus on federated representation learning (FRL) in the underparameterized regime, where the parameterized model class is not rich enough to realize the ground-truth models across all clients. This is arguably a more realistic setting for edge devices supporting FL. Meanwhile, due to the inherent limitation of the expressiveness of the under-parameterized models, the algorithm design and theoretical guarantees in the over-parameterized regime do not naturally translate to this setting. We summarize our main contributions as follows.

Algorithm design. A major challenge for FRL in the under-parameterized regime is the fact that the locally optimal representation may not be globally optimal. As a result, simply averaging the local representations may not converge to the global optimal solution. To cope with this challenge, we propose FLUTE, a novel FRL framework tailored for the under-parameterized setting. To the best of our knowledge, this is the first FRL framework that focuses on the under-parameterized regime. Our algorithm design features two primary innovations. First, we develop a new regularization term that generalizes the existing formulations in a non-trivial way. In particular, this new regularization term is designed to provably enhance the performance of FRL in the under-parameterized setting. Second, our algorithm contains a new and critical step of server-side updating by simultaneously optimizing both the representation layer and all local head layers. This represents a significant departure from existing approaches in FRL, particularly in over-parameterized settings where local heads are optimized solely on the client side. By leveraging information across these local heads, our approach could learn the ground-truth model more effectively. Theoretical guarantees. In terms of theoretical performance, we specialize FLUTE to the linear setting and analyze the sample complexity required for FLUTE to recover a near-optimal model, as well as characterizing its convergence rate. FLUTE achieves a sample complexity that scales in O max{d,M}

Mϵ2 for recovering an ϵ-optimal model, where d is the dimension of the input data and M is the number of clients. This result indicates a linear sample complexity speedup in terms of M in the high dimensional setting (i.e., d M) compared with its single-agent counterpart (Hsu et al., 2012). Besides, it outperforms the sample complexity in the noiseless over-parameterized FRL setting (Collins et al., 2021) in terms of both M and d. Moreover, we show that FLUTE converges to the optimal model exponentially fast when the number of samples is sufficiently large. Technical contributions. In the under-parameterized regime, we must analyze the convergence of both the representation and personalized heads toward their op-

timal estimations. This is in sharp contrast to the overparameterized regime, where we only need to study the convergence of the representation column space to the ground truth (Collins et al., 2021; Zhong et al., 2022). Towards this end, we adopt a low-rank matrix approximation framework (Chen et al., 2023) of the ground-truth model. However, in contrast to conventional low-rank matrix approximation, in FRL, the global model is not accessible a priori but must be learned from distributed local datasets. Thus, the technical analysis needs to bound the unavoidable gradient discrepancy in the underparameterized regime, as well as ensure that neither gradient discrepancy nor noise-induced errors accumulate over iterations. To address these technical challenges, we first provide new concentration results to ensure that the norm of the gradient discrepancy can be bounded when local datasets are sufficiently large. We then develop iterationdependent upper bounds for sample complexity, which guarantee that the improvement in the estimation, i.e., the distance between our estimated model and the optimal low-rank model, can mitigate potential disturbances caused by gradient discrepancy and noise.

Empirical evaluation.2 We conduct a series of experiments utilizing both synthetic datasets for linear FLUTE and real-world datasets, specifically CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009), for general FLUTE. The empirical results demonstrate the advantages of FLUTE, as evidenced by its superior performance over baselines, particularly in the scenarios where the level of under-parameterization is significant.

2. Related Work

Representation learning. Representation learning focuses on acquiring a representation across diverse tasks to effectively extract feature information (Le Cun et al., 2015; Tripuraneni et al., 2021; Wang et al., 2016a; Finn et al., 2017). In the linear multi-task learning setting, Du et al. (2020) characterize the optimal solution of the empirical risk minimization (ERM) problem, demonstrating that the gap between the solution and the ground-truth representation

is upper bounded by O q

MN , where d is the dimension of data, M is the number of clients and N is the number of samples per task. Tripuraneni et al. (2021) give an upper

d MN using the Method-of-Moment estimator.

Thekumparampil et al. (2021) also show the O q

upper bound in their work. Duchi et al. (2022) consider data-dependent noise and show that the sample complexity required to recover the shared subspace of the linear mod-

2Main experiments can be reproduced with the code provided under the following link: https://github.com/ Renpu Liu/flute

Federated Representation Learning in the Under-Parameterized Regime

els scales in O log3(Nd) q

d MN . These works, however, only focus on the over-parameterized regime in a centralized setting.

Federated representation learning. Recently, representation learning has been introduced to FL

(Arivazhagan et al., 2019; Liang et al., 2020; Collins et al., 2021; Yu et al., 2020). Liang et al. (2020) propose an FRL framework named Fed-LG, where the distinct representations are stored locally and the common prediction head is forwarded to the server for aggregation. In contrast, Arivazhagan et al. (2019) propose Fed Per, where a common representation is shared among clients, with personalized local heads kept at the client side. A similar setting is adopted by Fed Rep (Collins et al., 2021), where exponential convergence to the optimal representation in the linear setting is proved. These works focus on the over-parameterized regime, while the under-parameterized regime has largely been overlooked.

Low-rank matrix factorization. Under-parameterized representation learning problem considered in this work is closely related to low-rank matrix factorization, where the objective is to find two low-rank matrices whose product is closest to a given matrix Φ. Pitaval et al. (2015) prove the global convergence of gradient search with infinitesimal step size for this problem. Ge et al. (2017) demonstrate that no spurious minima exists in such a problem and all saddle points are strict. Based on a revised robust strict saddle property, Zhu et al. (2021) show that the local search method such as gradient descent leads to a linear convergence rate with good initialization with a regularity condition on Φ. Chen et al. (2023) extend the analysis in Zhu et al. (2021) to general Φ, and show that with a moderate random initialization, the gradient descent method will converge globally at a linear rate. In the over-parameterized regime, Ye & Du (2021) proves that the gradient descent method will converge to a global minimum at a polynomial rate with random initialization. We note, however, that these works assume the perfect knowledge of Φ, which is different from the data-based representation learning problem considered in this work.

3. Problem Formulation

Notations. We use diag(x1, , xd) to denote a ddimension diagonal matrix with diagonal entries x1, , xd. x, y denotes the inner product of x and y, and x denotes the Euclidean norm of vector x. We use f ψ to denote the composition of functions f : Rk Rm and ψ : Rd Rk, i.e., (f ψ)(x) = f(ψ(x)). Id represents a d d identify matrix, and 0 is a d-dimensional all-zero vector.

FL with common representation. We consider an FL system consisting of M clients and one server. Client i has

a local dataset Di that consists of ni training samples (x, y) where x Rd and y Rm. For simplicity, we assume ni = N for all client i [M]. For (xi,j, yi,j) Di, we assume yi,j = gi(xi,j)+ξi,j, where xi,j is randomly drawn according to a sub-Gaussian distribution PX with mean 0 and covariance matrix Id, gi : Rd Rm is a deterministic function, and ξi,j Rm is an independent and identically distributed (IID) centered sub-Gaussian noise vector with covariance matrix σ2Id.

Federated representation learning (FRL) aims at learning both a common representation that suits all clients and an individual head that only fits client i. An FL framework adopting this principle was proposed by Arivazhagan et al. (2019), and we follow the same framework in this paper. More specifically, we assume that the local model of client i can be decomposed into two parts: a common representation ψB : Rd Rk shared by all clients and a local head fwi : Rk Rm, where B and wi are the parameters of the corresponding functions. Then, the ERM problem considered in this FRL framework can be formulated as:

min B,{wi} 1 M

(x,y) Di ℓ((fwi ψB)(x), y). (1)

This formulation leverages the common representation while accommodating data heterogeneity among clients, facilitating efficient personalized model training (Arivazhagan et al., 2019; Collins et al., 2021).

In this work, we focus on the under-parameterized setting in FRL, which is formally defined as follows.

Definition 3.1 (Under-Parameterization in FRL). Given a common representation class Ψ and a collection of local head classes {Fi}M i=1, an FRL problem is underparameterized if there does not exist a representation ψ Ψ, and a collection of functions f1 f2 . . . f M F1 F2 . . . FM such that fi ψ = gi for all i [M].

The over-parameterization in FRL can be defined in a symmetric form. This definition aligns with the overparameterized frameworks in matrix approximation, as detailed in Jiang et al. (2022); Ye & Du (2021), where overparameterization is characterized by the rank of the representation being no-less than that of the ground-truth model. It also encompasses the definition in central statistical learning (Belkin et al., 2019; Oneto et al., 2023), where overparameterization is defined as the predictor s function class being sufficiently rich to approximate the global minimum.

While various algorithms have been developed and analyzed in the over-parameterized setting (Arivazhagan et al., 2019; Liang et al., 2020; Collins et al., 2021), to the best of our knowledge, under-parameterized FRL has not been studied in the literature before. This is, however, arguably a more practical setting in large-scale FRL supported by a massive

Federated Representation Learning in the Under-Parameterized Regime

number of resources-scarce Io T devices, as such Io T devices usually cannot support the storage, computation, and communication of models parameterized by a large number of parameters, while the task heterogeneity across massive devices imposes significant challenges on the model class to reconstruct M different local models perfectly3.

Low-dimensional linear representation. We first focus on the linear setting in which all local models gi are linear, i.e., yi,j = ϕ i xi,j +ξi,j for (xi,j, yi,j) Di. Denote Φ := [ϕ1, , ϕM] Rd M and assume its rank is r. Then, similar to the works of Collins et al. (2021); Arivazhagan et al. (2019), we consider a linear prediction model where (fwi ψB)(x) can be expressed as x Bwi. Here, B Rd k

is the common linear representation shared across clients, and wi Rk is the local head maintained by client i. We denote W = [w1, , w M]. Then, if we further consider the ℓ2 loss function, the ERM problem becomes

min B,W 1 M

(x,y) Di x Bwi y 2. (2)

We note that the existing literature usually assumes that r k, which falls in the over-parameterized regime (Zhu et al., 2021). The over-parameterized assumption implies the existence of a pair of B and W that can accurately recover the ground-truth model Φ, i.e., BW = Φ. Thus, the learning goal in the over-parameterized regime is to identify such a pair using available training data (Du et al., 2020; Tripuraneni et al., 2021; Collins et al., 2021; Shen et al., 2023).

In contrast to the existing works, in the under-parameterized regime given in Definition 3.1, we have r > k, i.e., there does not exist matrices B Rd k and W Rk M such that BW = Φ. Our objective is to learn a common representation and local heads (B, W) in the federated learning framework such that BW Φ 2 F reaches its minimum, although Φ is not explicitly given but embedded in local datasets.

4. The FLUTE Algorithm

In this section, we present the FLUTE algorithm for the linear model. We will first highlight the unique challenges the under-parameterized setting brings, and then introduce our algorithm design.

3Continuing the previous example of Mobile Net, which can be adapted for object detection for autonomous driving (Chen et al., 2021), it is known that a single model may not capture very detailed or complex features of the complete environment, including pedestrians, cyclists, and various road signs (Chen et al., 2022).

4.1. Challenges

In order to understand the fundamental differences between the overand under-parameterized regimes, we first assume Φ is known beforehand, and consider solving the following optimization problem:

(B , W ) = arg min B Rd k,W Rk M BW Φ 2 F . (3)

Denote the singular value decomposition (SVD) of Φ as UΛV , where U and V are two unitary matrices, and Λ is a diagonal matrix. When k r, i.e., the model is over-parameterized, B and W can be explicitly constructed from the SVD of Φ, i.e., any (B, W) satisfying BW = UΛV is an optimizer to Equation (3). When k < r, i.e., in the under-parameterized regime, we can no longer recover the full matrix Φ with B and W . Instead, existing result (Golub & Van Loan, 2013) states that we can only determine that the solution must satisfy B W = UkΛk V k , where Λk is a k k diagonal matrix with the k largest singular values of Φ as the diagonal entries.

Compared with the over-parameterized setting, learning B and W from decentralized datasets is more challenging in the under-parameterized setting. Let B i be the locally optimized representation at client i, i.e., (B i , w i ) = arg min Biwi ϕi 2. Then, in the over-parameterized setting, B i will always stay in the same column space as B , i.e., span(B i ) span(B ), i [M]. However, for the under-parameterized setting, it is possible that span(B i ) span(B ), i [M]. How to aggregate the locally obtained B i to correctly span the column space of B

thus becomes a unique challenge in the under-parameterized setting and requires novel techniques different from those in the existing over-parameterized literature. Example 1. Consider a scenario that Φ Rd M with M < d. We assume Φ = U diag(λ1, , λM), where U := [u1, . . . , u M] is a unitary matrix and λ1 > λ2 > > λM > 0. Assume k = 1. Then, we have B W = u1λ1. Assume each client i can perfectly recover its local model ϕi = uiλi with Bi = uiλi/wi. Then, depending on the value of wi s, the aggregated representation B := 1 M P

i Bi may exhibit different properties. For example, if wi = λi, we have B = 1 M P

i ui, which deviates significantly from the column space of B . On the other hand, if wi = p

λi/M, then Bi = ui Mλi, while B = P i ui p

λi/M. Thus, u1 will have a heavier weight in the aggregated representation, which will eventually help recover the column space of B . Intuitively, to accurately recover the column space of B , in the under-parameterized setting, it requires a more sophisticated algorithm design not just to estimate the column space of Φ, but also distill the most significant components of it from distributed datasets in each aggregation.

Federated Representation Learning in the Under-Parameterized Regime

4.2. A New Loss Function

Motivated by the observation in Example 1, instead of considering the original problem in (2), we introduce two new regularization terms and consider the following ERM problem:

min B,{wi}M i=1

(x,y) Di x Bwi y 2 (4)

γ1 BW 2 F | {z } (I)

+ γ2( B B 2 F + WW 2 F ) | {z } (II)

In Equation (4), we introduce the regularization term (I) into the loss function, with the purpose of preserving the top-k significant components of BW. By preserving the significant components in B, the term (I) mitigates local over-fitting induced during local updates. However, minimizing term (I) alone would result in a uniform enlargement of all k singular values of BW. To address this, we further incorporate the regularization term (II). This term is specifically formulated to promote the k most significant components and suppress the less significant ones. By doing so, it aids the server in accurately distilling the correct subspace spanned by the optimal representation. We note that when γ1 = 2λ2, (I) and (II) together recover the conventional penalty term B B WW 2 F , which has been previously adopted for low-rank matrix approximation (Chen et al., 2023; Zhu et al., 2021; Wang et al., 2016b) and multi-task learning (Tripuraneni et al., 2021).

4.3. FLUTE for Linear Model

In order to solve the optimization problem given in (4), we introduce an algorithm named FLUTE (Ferated Learning in Under-parame Terized REgime), which is compactly described in Algorithm 1. Specifically, for each epoch, the algorithm consists of three major steps, namely, server broadcast, client update, and server update.

Server broadcast. At the beginning of epoch t, the server broadcasts the representation Bt 1 to all clients, and wt 1 i (i.e., the i-th column of Wt 1) to each individual client i.

Client update. Denoting the local loss function as Li = 1 N P

(x,y) Di x Bwi y 2, the client calculates the gradient of Li with respect to wt 1 i and Bt 1 i , respectively, and uploads them to the server.

Server update. After receiving wt 1 i Li and Bt 1Li from all clients, the server first aggregates them to update the global representation and local heads as follows:

Bt = Bt 1 ηl X

i [M] Bt 1Li wt 1 i , Bt 1 ,

wt i = wt 1 i ηl wt 1 i Li wt 1 i , Bt 1 , i [M],

Algorithm 1 FLUTE Linear

1: Input: Learning rates ηl and ηr, regularization parameter λ, communication round T, constant α 2: Initialization: All entries of B0 and W0 are independently sampled form N(0, α2). 3: for t [T] do 4: Server sends Bt 1 and wt 1 i to client i, i [M]. 5: for client i [M] in parallel do

6: Calculates wt 1 i Li wt 1 i , Bt 1 and

Bt 1Li wt 1 i , Bt 1 .

7: Sends gradients to the server. 8: end for 9: Server updates according to Equations (5) to (6). 10: end for

after which it constructs matrix Wt by setting Wt := [wt 1, , wt M]. It then performs another step of gradient descent with respect to the regularization term in (4) to refine the global representation and local heads and obtain Bt

Bt = Bt + γ1ηr Bt 1 Bt 1Wt 1 2 F (6)

γ2ηr Bt 1( (Bt 1) Bt 1 2 F + Wt 1(Wt 1) 2 F ),

Wt = Wt + γ1ηr Wt 1 Bt 1Wt 1 2 F γ2ηr Wt 1( (Bt 1) Bt 1 2 F + Wt 1(Wt 1) 2 F ).

The procedure repeats until some stop criterion is satisfied. Remark 4.1. When α is small, the initialization of B0

and W0 would ensure that the largest singular value of B0(W0)T is sufficiently small with high probability. As we will show in the next section, such initialization guarantees that FLUTE converges to the global minimum.

The major differences between FLUTE and existing FRL algorithms such as Fed Rep (Collins et al., 2021), Fed Rod (Chen & Chao, 2021), and Fed CP (Zhang et al., 2023a) lie in the server-side model updating. While these existing algorithms typically involve transmitting only the shared representation layers of local models to the server, with local heads being optimized and utilized exclusively at the client side, FLUTE requires clients to transmit both the shared representation layers and the local heads to the server. The increased communication cost is fundamentally necessary due to the unique nature of FRL in the underparameterized regime, as it allows for server-side optimization, not just aggregation, of the entire model. Furthermore, FLUTE introduces additional data-free penalty terms to the server-side updates. These terms are designed to guide the shared representation to converge toward the global minimum by leveraging the information in the local heads. This approach represents a significant paradigm shift in federated learning, aiming to enhance the overall global performance

Federated Representation Learning in the Under-Parameterized Regime

of the FRL model.

5. Theoretical Guarantees

Before introducing our main theorem, we denote d = min{d, M} and d = max{d, M}. We also denote λ1 λ2 λd as the ordered singular values of Φ with := 2(λk λk+1). Denote E = P

i λ2 i . We assume > 0 throughout the analysis.

5.1. Main Results

Theorem 5.1 (Sample complexity). Set γ1 = 1

4 and γ2 = 1

8 in Equation (4). Let 0 < α 1 10d, and η := ηl = ηr

228λ13 . Then, for any ϵ > 0, under Algorithm 1, there exists positive constants c and c such that when the number of samples per client satisfies

N cλ4 1k(d + log 1

δ + log log 1

k(λ1)2 + E +

and t log(ϵ

Mη 2/c λ2 1

k) log(1 η /16) , with probability at least 1 δ,

i [M] Btwt i B w i ϵ. (7)

Remark 5.2. Theorem 5.1 indicates that the per-client sample complexity scales in O max{d,M}

Mϵ2 . Compared with the single-client setting, which is essentially a noisy linear regression problem with sample complexity O( d

ϵ2 ) (Hsu et al., 2012), FLUTE achieves a linear speedup in terms of M in the high dimensional setting (i.e., d > M). When d < M, the sample complexity of FLUTE becomes independent with M, which is due to the fact that each client requires a minimum number of samples to have the local optimization problem non-ill-conditioned. Compared with the sample complexity O d M + log(M) of FRL in the noiseless over-parameterized setting (Collins et al., 2021), FLUTE achieves more favorable dependency on M. Remark 5.3. We note that the dependency on and λ1, especially , is unique for the under-parameterized FRL. For the special case when λ k+1 = 0, the problem we consider essentially falls into the overparameterized regime, and FLUTE can still be applied. Theorem 5.1 shows that the sample complexity scales in O( max{d,M}

Mϵ2 ( λ 1 )10). We note that under the assumption that B consists of orthonormal columns, the SOTA sample complexity in the over-parameterized regime scales in O( max{d,M}

Mϵ2 κ4) (Tripuraneni et al., 2021), where κ = σ1((W ) W )/σr((W ) W ). Under the same assumption on B , we have λ 1 = p

σ1((W ) W ), = p

σk((W ) W ), and our sample complexity then becomes O( max{d,M}

Mϵ2 κ5). The additional order of κ in

the bound is due to an initial state-dependent quantity bounded by λ 1 . The detailed analysis can be found in Appendix A.2.3. Remark 5.4. The sample complexity in Theorem 5.1 requires that the size of each local dataset be sufficiently large. This is in stark contrast to the sample complexity result in existing works (Collins et al., 2021), which imposes a requirement on the total number of samples in the system instead of on each individual client/task. We need the size for each local dataset to be sufficiently large to ensure that every ϕi can be locally estimated with a small error so that the top k components of the ground truth Φ can be correctly recovered.

Theorem 5.5 (Convergence rate). Set γ1, γ2 and η as in Theorem 5.1. Denote κT = (1 η

16 )T . Then, for a constant TR (defined in Equation (16) in Appendix A) and any T > TR, there exist positive constants c1 and c2 such that when

N c1 (d log δ+log T )(

κ2 T 2 , for all TR < t T, with probability at least 1 δ, we have

i [M] Btwt i B w i c2λ2 1

Remark 5.6. Theorem 5.5 shows that when the number of samples per client N is sufficiently large, FLUTE converges exponentially fast. We note that the required number of samples grows exponentially in the total number of iterations. Such an exponential increase in the required number of samples is essential to guarantee that the noise level, which is the gradient estimation error, decays at least as fast as the decay rate of the representation estimation error, which is exponential. Similar phenomenon has been observed in the literature (Mitra et al., 2021; Zhang et al., 2023b). In our problem, there are essentially two parts of noise in the learning process. One is the sub-exponential label noise ξi,j, and the other is the gradient discrepancy arising from the under-parameterized nature. This discrepancy persists even when Bt and Wt are nearly optimal, leading to an unavoidable gap between Bt Wt and Φ. This gap behaves similarly to the sub-Gaussian noise in the convergence analysis, as elaborated in Section 5.2. Therefore, an exponential increase in the number of samples is required to cope with both parts of the noise and ensure the one-step improvement of the estimation error as iteration grows. Remark 5.7. We also note that both the sample complexity in Theorem 5.1 and the convergence rate in Theorem 5.5 are influenced by , the gap between λk and λk+1. A smaller signifies a growing challenge in correctly identifying the top-k principal components of Φ, leading to increased sample complexity and slower convergence. This is due to the challenge of accurately distinguishing and recovering the k-th and (k + 1)-th significant components from the dataset when is small. Note that in order to successfully distin-

Federated Representation Learning in the Under-Parameterized Regime

guish σk and σk+1, we need to estimate them to be /2accurate, i.e., |ˆσk σk| /2 and |ˆσk+1 σk+1| /2. Hence, the required number of samples per client would grow significantly when is small, and this is arguably inevitable.

5.2. Proof Sketch

In this subsection, we outline the major challenges and main steps in the proof of Theorem 5.5 while deferring the complete analysis to Appendix A. Theorem 5.1 can be proved once Theorem 5.5 is established.

Challenges of the analysis. The analytical frameworks proposed by Collins et al. (2021) and Zhong et al. (2022) for over-parameterized learning scenarios, as well as by Chen et al. (2023) for low-rank matrix approximation, cannot handle the unique challenges that arise in the underparameterized FRL framework, as elaborated below.

The first major challenge we encounter is to bound the gradient discrepancy on the update of Bt, denoted as (Bt Wt Φ)(Wt) P

i [M] Xi X i N Btwt i ϕi (wt i) . Such difficulty is absent in the analyses in Collins et al. (2021) and Zhong et al. (2022) because, in the overparameterized regime and with a fixed number of samples per client per iteration, the error caused by the gradient discrepancy decays at a rate comparable to that of the representation estimation error. Therefore, the gradient discrepancy will gradually converge to zero. However, for the underparameterized setting, even with the optimal (Bt, Wt), i.e., when Bt Wt = B W , gradient discrepancy can still be non-zero, as the optimal representation cannot recover all local models, i.e., B W = Φ. Instead, it only decreases when the number of samples N increases. This phenomenon indicates that an increase in the number of samples is essential to ensure one-step improvements of the estimated representations toward the ground-truth representation as the iteration progresses.

Another main challenge is ensuring that neither the gradient discrepancy nor noise-induced errors accumulate over iterations. This is critical as error accumulation can lead to significant deviation from the optimal solution, resulting in poor convergence and degraded model performance. To achieve this, we need to ensure the improvement of the estimation can dominate the effect of potential disturbances.

To tackle these new challenges, we first prove two concentration lemmas (Lemma A.13 and Lemma A.14 in Appendix A.2.5) to ensure that the norm of the gradient discrepancy can be bounded when local datasets are sufficiently large. Next, to address the second challenge of avoiding the accumulation of gradient discrepancy and noise-induced errors over iterations, we develop iteration-dependent upper bounds for sample complexity (Lemma A.10 and

Lemma A.11 in Appendix A.2.3). These bounds guarantee that the improvement in estimation, i.e., the distance improvement between our estimated model and the optimal low-rank model, can mitigate potential disturbances caused by gradient discrepancy and noise. We establish this by introducing a novel approach to derive an accuracy-dependent upper bound for the per-client sample complexity, ensuring the error caused by the gradient discrepancy decays as fast as the increase of the signal-to-noise ratio (SNR), formally introduced in Appendix A.

Main steps of the proof. First, we transform the asymmetric matrix factorization problem into a symmetric problem by appropriately padding 0 columns or rows to Bt

and Wt and constructing the updating matrices Θt (see Appendix A). Our goal is then to prove that Θt(Θt) converges. We first show that, with a small random initialization, Θt will enter a region containing the optima with high probability. Then, utilizing Lemma A.13 and Lemma A.14, we demonstrate that when Θt enters the region R, it will remain in this region with high probability despite gradient discrepancy and noise. Finally, utilizing Lemma A.10 and Lemma A.11, we show that when N is sufficiently large, Θt(Θt) converges at a linear rate with high probability under the influence of gradient discrepancy and noise, provided that the initialization condition satisfies Θ0 R.

6. General FLUTE

In this section, we extend FLUTE designed for linear models to more general settings. Specifically, we use ψB to denote the representation, and assume linear local heads fi(z) = H i z + bi, where Hi Rk m, bi Rm. This is motivated by the neural network architecture where all layers before the last layer are abstracted as the representation layer, and the last layer is linear. Then, the objective function becomes

min B,{Hi},{bi} 1 M

(x,y) Di ℓ H i ψB(x) + bi, y

+ λR({Hi}, B), (9)

where R({Hi}, B) is the regularization term to encourage the alignment of local models with the global optimum structure.

The general FLUTE algorithm for solving problem (9) is provided in Algorithm 2 in Appendix B. Given the nonlinearity of ψB, the penalty introduced in linear FLUTE is not directly applicable to the general problem. We thus formulate and design new penalty terms, following the same principles that motivated the design in the linear setting. This is to mitigate the local over-fitting induced by local updates and to encourage a structure benefit to global optimization. As a concrete example, we present a design of the penalty term for the classification problem with CNN as a

Federated Representation Learning in the Under-Parameterized Regime

prediction model in Section 7.2.

7. Experimental Results

7.1. Synthetic Datasets

We generate a synthetic dataset as follows. First, we randomly generate ϕi according to a d-dimensional standard Gaussian distribution. For each ϕi, we then randomly generate N pairs of (x, y), where x is sampled from a standard Gaussian distribution, ξ is sampled from a centered Gaussian distribution with variance σ2, and y = ϕ i x + ξ.

In Figure 1, we compare FLUTE with Fed Rep (Collins et al., 2021). We measure the quality of the learned representation Bt and Wt over the metric 1 M P i [M] Btwt i ϕi . We emphasize that Fed Rep requires empirical covariance estimated from the local datasets to be transmitted to the server for the initialization. Thus, it begins with a good estimate of the subspace spanned by B . In contrast, FLUTE commences with a random initialization of both the representation and the heads. As a result, Fed Rep converges to a relatively small error within the few initial epochs, while FLUTE needs to go through more epochs to obtain a good estimate of the representation. However, as the learning progresses over more epochs, FLUTE eventually outperforms Fed Rep. To validate this hypothesis, we introduce Fed Rep(RI) in our experiments, which has the same initialization as FLUTE but is otherwise identical to Fed Rep. We see from Figure 1 that when Fed Rep is randomly initialized, FLUTE outperforms Fed Rep(RI) in much fewer iterations.

We also observe that the performance gain of FLUTE is more pronounced in highly under-parameterized scenarios, i.e., where k is relatively small. As k increases, the gap between the convergence rates of FLUTE and Fed Rep narrows. These results demonstrate that FLUTE achieves better performance in the under-parameterized regime. In the additional experimental results included in Appendix C, we also observe that when the number of participating clients M increases, the average error of the model learned from FLUTE decreases, which is consistent with Theorem 5.5.

7.2. Real World Datasets

Datasets and models. We now evaluate the performance of general FLUTE on multi-class classification tasks with real-world datasets CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009). For all experiments, we adopt a convolutional neural network (CNN) with two convolution layers, two fully connected layers with Re LU activation, and a final fully connected layer with a softmax activation function. A detailed description of the CNN structure is deferred to Appendix C of the Appendix.

Algorithms for comparison. We compare FLUTE with sev-

Figure 1. Experimental results with synthetic datasets.

eral baseline algorithms, including Fed Avg (Mc Mahan et al., 2017), Fed-LG (Liang et al., 2020), Fed Per (Arivazhagan et al., 2019), Fed Rep (Collins et al., 2021), Fed Rod (Chen & Chao, 2021) and Fed CP (Zhang et al., 2023a). Fed-LG is designed to learn a common head shared across clients while allowing for localized representations, while Fed Per and Fed Rep both assume shared representation and personalized local heads. Fed Rod extends the model considered in Fed Rep by adding another head layer into the local model, and Fed CP further equips a conditional policy network into the local model. We also consider variants of FLUTE and Fed Rep, denoted as FLUTE* and Fed Rep*, respectively, under which we vary the number of updates of the local heads in each communication round, as elaborated later.

Loss function and penalty. For algorithms other than FLUTE and FLUTE*, the local loss function is chosen as Li = 1 N P

(x,y) Di LCE(H i ψB(x) + bi, y), where LCE is the cross entropy loss. The local loss function for FLUTE and FLUTE* are specialized as

Li(B, b, H) = 1

(x,y) Di LCE H i ψB(x) + bi, y +

λ1 ψB(x) 2 + λ2 Hi 2 F + λ3NCi(Hi), (10)

where y Rm is a one-hot vector whose k-th entry is 1 if the corresponding x belongs to class k, λ1, λ2 and λ3 are non-negative regularization parameters. NCi(Hi) is motivated by Papyan et al. (2020) and set as H i Hi H i Hi F 1 m 1uiu i Im 1

m1m1 m F , where

ui is an m-dimensional one-hot vector whose c-th entry

Federated Representation Learning in the Under-Parameterized Regime

Table 1. Average test accuracy on CIFAR-10 and CIFAR-100.

Dataset CIFAR-10 CIFAR-100

Partition 50 2 50 5 100 2 100 5 100 5 100 10 100 20 100 40

Fed Avg 34.460 1.083 47.217 0.395 41.584 0.433 51.876 0.675 20.212 0.574 31.533 0.519 34.659 0.482 32.902 0.195 Fed Avg-FT 83.996 0.948 71.465 0.701 84.688 0.437 70.884 0.697 78.342 0.574 66.660 0.370 54.464 0.178 44.858 0.119 Fed-LG 82.724 2.137 61.820 0.409 83.019 0.431 62.957 0.895 72.526 0.692 53.526 0.151 34.445 0.375 22.702 0.315 Fed Per 85.173 1.082 74.015 0.724 86.168 0.703 73.666 0.281 76.001 0.454 67.100 0.229 56.066 0.389 44.689 0.411 Fed Rep 86.133 0.775 71.737 0.296 86.685 0.766 73.808 0.561 78.621 0.159 68.530 0.255 56.360 0.245 43.061 0.476 Fed Rep* 87.320 1.485 75.766 0.220 87.177 0.489 75.296 0.505 78.892 0.410 68.630 0.705 56.654 0.609 42.025 0.404 Fed Ro D 79.476 2.966 68.728 1.750 83.296 1.545 72.116 0.788 74.299 0.338 66.462 0.284 57.280 0.105 48.120 0.186 Fed CP 85.361 1.605 71.603 0.885 84.798 0.489 71.344 0.587 74.266 0.559 66.426 0.372 57.067 0.483 43.638 0.415 FLUTE 87.012 0.453 76.478 0.484 86.128 1.007 76.918 0.712 77.750 0.615 70.598 0.282 59.243 0.334 48.169 0.597 FLUTE* 87.713 1.365 76.543 0.921 88.567 0.457 78.255 0.688 79.560 0.627 70.844 0.419 59.714 0.448 48.170 0.440

is 1 if c Ci, and denotes the Hadamard product. We specialize the regularization term optimized on the server side as R({Hi}) = P

i NCi(Hi). Note that for general FLUTE specified to a classification problem, we penalize ψB(x) instead of directly penalizing the parameter B. Since ψB(x) depends on data, the regularization term is optimized partially on the client side and partially on the server side.

Compared with the objective function in Equation (4) for the linear case, the term λ3NCi(Hi) replaces term (I) and λ1 ψB(x) 2 2 + λ2 Hi 2 F replaces term (II). The primary goal of introducing λ3NCi(Hi) is to mitigate local overfitting that occurs during local updates in the training process. As elaborated in Appendix B.3, such a regularization term promotes a beneficial structure for the global model, facilitating efficient learning performance. This term shares the same motivation as the term (I) in the linear scenario, which focuses on distilling significant components from the model to mitigate local over-fitting effects. For term (II), we only replace B B 2 F with ψB(x) 2, since the representation is not linear in general.

Implementation and evaluation. We use m to denote the number of classes assigned to each client. For CIFAR-10 dataset, we consider four (N, m) pairs: (50, 2), (50, 5), (100, 2) and (100, 5); For CIFAR-100 dataset, we consider four (N, m) pairs: (100, 5), (100, 10), (100, 20) and (100, 40).

For experiments conducted on the CIFAR-10 dataset, all algorithms are executed over 100 communication rounds. For LG-Fed, Fed Per, Fed Ro D, Fed CP and FLUTE, each client performs one round of local updates in each communication round. Fed Rep performs one epoch of local head update and an additional epoch for the local representation update. Compared with Fed Rep, Fed Rep* processes 10 epochs to update its local heads and one epoch to update its representation. For comparison, FLUTE* also runs 11 rounds of local updates, updating both representation and local head in the first round, followed by 10 rounds of only updating the local head.

The experiments on the CIFAR-100 dataset also use 100 communication rounds. The number of local updates for LG-Fed, Fed Per, Fed Ro D, Fed CP and FLUTE are set to 5. Fed Rep is configured to update the local representation and head for 5 epochs each, while Fed Rep* allocates 5 epochs for updating the local representation and 10 for updating the local head. FLUTE* runs 15 epochs of local updates, where the initial 5 epochs update both the representation and local head while the subsequent 10 epochs solely update the local head.

Averaged performance. The results are reported in Table 1. It is evident that FLUTE and FLUTE* consistently outperform other baseline algorithms in all experiments conducted on CIFAR-10 and CIFAR-100 datasets. This superior performance is attributed to the tailored design that encourages the locally learned models to move towards a global optimal solution rather than a local optimum. We also observe that the gain of FLUTE and FLUTE* becomes more prominent with larger N and m. Intuitively, larger N and m implies more severe under-parameterization for the given CNN model, and our algorithms exhibit more advantage for such cases.

8. Conclusion

To the best of our knowledge, this paper represents the first effort in the study of federated representation learning in the under-parameterized regime, which is of great practical importance. We have proposed a novel FRL algorithm FLUTE that was inspired by asymmetric low-rank matrix approximation. FLUTE incorporates a novel regularization term in the loss function and solves the corresponding ERM problem in a federated manner. We proved the convergence of FLUTE and established the per-client sample complexity that is comparable to the over-parameterized result but with very different proof techniques. We also extended FLUTE to general (non-linear) settings which are of practical interest. FLUTE demonstrated superior performance over existing FRL solutions in both synthetic and real-world tasks, highlighting its advantages for efficient learning in the under-parameterized regime.

Federated Representation Learning in the Under-Parameterized Regime

Acknowledgements

The work of R. Liu and J. Yang was supported in part by the U.S. National Science Foundation under grants CNS1956276, CNS-2003131 and CNS-2114542. The work of C. Shen was supported in part by the U.S. National Science Foundation under grants ECCS-2332060, CPS-2313110, ECCS-2143559, and ECCS-2033671.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Arivazhagan, M. G., Aggarwal, V., Singh, A. K., and Choudhary, S. Federated learning with personalization layers. ar Xiv preprint ar Xiv:1912.00818, 2019.

Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine-learning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 -15854, July 2019.

Chen, H., Chen, X., Elmasri, M., and Sun, Q. Fast global convergence of gradient descent for low-rank matrix approximation. ar Xiv preprint ar Xiv:2305.19206, 2023.

Chen, H.-Y. and Chao, W.-L. On bridging generic and personalized federated learning for image classification. ar Xiv preprint ar Xiv:2107.00778, 2021.

Chen, L., Lin, S., Lu, X., Cao, D., Wu, H., Guo, C., Liu, C., and Wang, F.-Y. Deep neural network based vehicle and pedestrian detection for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 22(6):3234 3246, 2021.

Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., and Liu, Z. Mobile-Former: Bridging Mobile Net and transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5270 5279, 2022.

Collins, L., Hassani, H., Mokhtari, A., and Shakkottai, S. Exploiting shared representations for personalized federated learning. In International Conference on Machine Learning, pp. 2089 2099. PMLR, 2021.

Davidson, K. R. and Szarek, S. J. Local operator theory, random matrices and banach spaces. Handbook of the geometry of Banach spaces, 1(317-366):131, 2001.

Du, S. S., Hu, W., Kakade, S. M., Lee, J. D., and Lei, Q. Few-shot learning via learning the representation, provably. ar Xiv preprint ar Xiv:2002.09434, 2020.

Duchi, J., Feldman, V., Hu, L., and Talwar, K. Subspace recovery from heterogeneous data with non-isotropic noise, 2022.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126 1135. PMLR, 2017.

Ge, R., Jin, C., and Zheng, Y. No spurious local minima in nonconvex low rank problems: A unified geometric analysis, 2017.

Golub, G. H. and Van Loan, C. F. Matrix computations. JHU Press, 2013.

He, C., Annavaram, M., and Avestimehr, S. Group knowledge transfer: Federated learning of large CNNs at the edge. Advances in Neural Information Processing Systems, 33:14068 14080, 2020.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770 778, 2016.

Hitaj, B., Ateniese, G., and Perez-Cruz, F. Deep models under the gan: information leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 603 618, 2017.

Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314 1324, 2019.

Hsu, D., Kakade, S. M., and Zhang, T. Random design analysis of ridge regression. In Conference on Learning Theory, pp. 9 1. JMLR Workshop and Conference Proceedings, 2012.

Hwang, S.-G. Cauchy s interlace theorem for eigenvalues of hermitian matrices. The American Mathematical Monthly, 111(2):157 159, 2004.

Jiang, L., Chen, Y., and Ding, L. Algorithmic regularization in model-free overparametrized asymmetric matrix factorization. ar Xiv preprint ar Xiv:2203.02839, 2022.

Ju, R.-Y., Lin, T.-Y., Jian, J.-H., and Chiang, J.-S. Efficient convolutional neural networks on Raspberry Pi for image classification. Journal of Real-Time Image Processing, 20(2):21, 2023.

Federated Representation Learning in the Under-Parameterized Regime

Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1 2):1 210, 2021.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.

Le Cun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436 444, 2015.

Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50 60, 2020.

Liang, P. P., Liu, T., Ziyin, L., Allen, N. B., Auerbach, R. P., Brent, D., Salakhutdinov, R., and Morency, L.-P. Think locally, act globally: Federated learning with local and global representations. ar Xiv preprint ar Xiv:2001.01523, 2020.

Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., and Alsaadi, F. E. A survey of deep neural network architectures and their applications. Neurocomputing, 234:11 26, 2017.

Mc Mahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In AISTATS, pp. 1273 1282. PMLR, 2017.

Melis, L., Song, C., De Cristofaro, E., and Shmatikov, V. Exploiting unintended feature leakage in collaborative learning. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 691 706, 2019.

Mitra, A., Jaafar, R., Pappas, G. J., and Hassani, H. Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients. Advances in Neural Information Processing Systems, 34:14606 14619, 2021.

Oneto, L., Ridella, S., and Anguita, D. Do we really need a new theory to understand over-parameterization? Neurocomputing, 543:126227, 2023.

Open AI. Gpt-4 technical report, 2023.

Papyan, V., Han, X. Y., and Donoho, D. L. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652 24663, Sept. 2020.

Pitaval, R.-A., Dai, W., and Tirkkonen, O. Convergence of gradient descent for low-rank matrix approximation. IEEE Transactions on Information Theory, 61(8):4451 4457, 2015.

Rudelson, M. and Vershynin, R. The littlewood-offord problem and invertibility of random matrices, 2008.

Shen, Z., Ye, J., Kang, A., Hassani, H., and Shokri, R. Share your representation only: Guaranteed improvement of the privacy-utility tradeoff in federated learning, 2023.

Tan, J., Mason, B., Javadi, H., and Baraniuk, R. Parameters or privacy: A provable tradeoff between overparameterization and membership inference. Advances in Neural Information Processing Systems, 35:17488 17500, 2022.

Tan, M. and Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105 6114. PMLR, 2019.

Thekumparampil, K. K., Jain, P., Netrapalli, P., and Oh, S. Sample efficient linear meta-learning by alternating minimization, 2021.

Tirer, T. and Bruna, J. Extended unconstrained features model for exploring deep neural collapse, 2022.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models, 2023.

Tripuraneni, N., Jin, C., and Jordan, M. Provable metalearning of linear representations. In International Conference on Machine Learning, pp. 10434 10443. PMLR, 2021.

Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010.

Wang, J., Kolar, M., and Srebro, N. Distributed multitask learning with shared representation. ar Xiv preprint ar Xiv:1603.02185, 2016a.

Wang, L., Zhang, X., and Gu, Q. A unified computational and statistical framework for nonconvex low-rank matrix estimation, 2016b.

Wang, S., Tuor, T., Salonidis, T., Leung, K. K., Makaya, C., He, T., and Chan, K. Adaptive federated learning in resource constrained edge computing systems. IEEE Journal on Selected Areas in Communications, 37(6): 1205 1221, 2019.

Wang, Z., Song, M., Zhang, Z., Song, Y., Wang, Q., and Qi, H. Beyond inferring class representatives: User-level privacy leakage from federated learning, 2018.

Ye, T. and Du, S. S. Global convergence of gradient descent for asymmetric low-rank matrix factorization, 2021.

Federated Representation Learning in the Under-Parameterized Regime

Yu, T., Bagdasaryan, E., and Shmatikov, V. Salvaging federated learning by local adaptation. ar Xiv preprint ar Xiv:2002.04758, 2020.

Zhang, J., Hua, Y., Wang, H., Song, T., Xue, Z., Ma, R., and Guan, H. Fedcp: Separating feature information for personalized federated learning via conditional policy. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 3249 3261, 2023a.

Zhang, T. T. C. K., Toso, L. F., Anderson, J., and Matni, N. Meta-learning operators to optimality from multi-task non-iid data, 2023b.

Zhong, A., He, H., Ren, Z., Li, N., and Li, Q. Feddar: Federated domain-aware representation learning. ar Xiv preprint ar Xiv:2209.04007, 2022.

Zhu, Z., Li, Q., Tang, G., and Wakin, M. B. The global optimization geometry of low-rank matrix optimization, 2021.

Federated Representation Learning in the Under-Parameterized Regime

Notations. Throughout this paper, bold capital letters (e.g., X) denote matrices, and calligraphic capital letters (e.g., C) denote sets. We use tr(X) to denote the trace of matrix X, σmin(X) and σmax(X) to denote the minimum and maximum singular values of X, respectively, and diag(x1, , xd) to denote a d-dimensional diagonal matrix with diagonal entries x1, , xd. |C| denotes the cardinality of set C, and {Xi}i [N] denotes the set {X1, , XN}. We use x, y to denote the inner product of x and y, and x to denote the Euclidean norm of vector x. We use f ψ to denote the composition of functions f : Rk Rm and ψ : Rd Rk, i.e., (f ψ)(x) = f(ψ(x)). a b indicates a Cb for a positive constant C. Id represents a d d identity matrix, and 0 is a d-dimensional all-zero vector.

Denote d := max{d, M}, d := min{d, M}, and Φ Rd d as the matrix constructed from Φ Rd M by padding allzero columns or rows. Define its SVD as Φ = U Λ V . Denote Λ = diag(2Λ , 2Λ ) and let λ 1 λ 2 λ 2d be the eigenvalues of Λ, with = λ k λ k+1. Note that the definition of is consistent with the definition in Section 5. For clarity of presentation, we use σξ to denote the standard deviation of the noise ξ instead of σ that is used in the main paper.

A. Analysis of the FLUTE Linear Algorithm

A.1. Preliminaries

We start with the updating rule of Bt and Wt in Algorithm 1.

For Bt, from FLUTE we have the following updating rule:

Bt+1 = Bt η

j [N] xi,j x i,j Btwt i yi,j (wt i) η

2Bt (Bt) Bt Wt(Wt)

i [M] Xi X i Btwt i ϕi (wt i) η

2Bt (Bt) Bt Wt(Wt) .

Since data points {xi,j} are sampled from a standard Gaussian distribution, for large N, it holds that Xi X i /N I. Then, we introduce the following definition:

Qt+1 := η X

Btwt i ϕi (wt i) η X

Xi X i N Btwt i ϕi (wt i) + η X

Xi Ei(wt i)

With this definition, the updating rule of Bt can be rewritten as

Bt+1 = Bt η Bt Wt Φ (Wt) η

2Bt (Bt) Bt Wt(Wt) + Qt+1.

Now we consider the updating rule of W. Observe that each of its columns satisfies

wt+1 i = wt i η

j [N] (Bt) xi,j (xi,j) Btwt i yi,j

N (Bt) Xi X i Btwt i ϕi .

We define Qt+1 := [ qt+1 1 , , qt+1 M ], where each of its columns is given by

qt+1 i := η(Bt) Btwt i ϕi η

N (Bt) Xi X i Btwt i ϕi + η

N (Bt) Xi Ei. (12)

Then, Wt is updated according to

Wt+1 = Wt η(Bt) (Bt Wt Φ) + η

2 (Bt) Bt Wt(Wt) Wt + Qt+1.

Recall the SVD of Φ is denoted as Φ = UΛV . Further denote Bt = U Bt and Wt = Wt V. Then, we have

Bt+1 = Bt η Bt Wt Λ Wt η

2 Bt ( Bt) Bt Wt 1( Wt) + U Qt+1,

Federated Representation Learning in the Under-Parameterized Regime

Wt+1 = Wt η( Bt) Bt Wt Λ η

2 ( Bt) Bt Wt( Wt) Wt + Qt+1V.

Similar to the definition of Φ , we construct Bt Rd k and Wt Rk d by padding all-zero columns or rows to Bt and W, respectively. Similarly, we obtain Qt Rd k and Qt Rk d by padding all-zero columns or rows to Qt and Qt, respectively. Then, we define Θt and Rt as

Θt = (Bt ) + Wt

Rt = (Qt ) U + Qt V

2 (Qt ) U Qt V

Then, the updating rule of Θt can be described as

Θt+1 = Θt + η

2Θt(Θt) Θt + Rt+1. (13)

Let Θt = [(Θt k) (Θt res) ] and Rt = [(Rt k) (Rt 2d k) ] where Θt k Rk k, Θt res R(2d k) k, Rt k Rk k and

Rt 2d k R(2d k) k. Then, we decompose the updating rule of Θt as

Θt k = Θt 1 k + η

2 ΛkΘt 1 k η

2Θt 1 k (Θt 1) Θt 1 + Rt k, (14)

Θt res = Θt 1 res + η

2 ΛresΘt 1 res η

2Θt 1 res (Θt 1) Θt 1 + Rt 2d k. (15)

A.2. Proof of Theorem 5.5

First, we restate Theorem 5.5 as follows.

Theorem A.1 (Restatement of Theorem 5.5). Set λ and η as in Theorem 5.1. Then for constant TR and any T > TR, there exist positive constants c1 and c2 such that when the number of samples per client satisfies N

c1 (d log δ+log T )(k

κ2 T 2 , for all TR < t T we have

i [M] Btwt i B w i c2κ

with probability at least 1 δ, where κT = (1 η

Overview of the proof. The proof of Theorem 5.5 consists of three main steps.

Step 1: We show that with a small random initialization, Θt will enter a region containing the optima with high probability (see Appendix A.2.1).

Step 2: We show that once Θt enters this region, it will stay in it with high probability (see Appendix A.2.2).

Step 3: We show that when N is sufficiently large, with high probability it holds that Θt(Θt) diag( Λk, 0) converges to 0 at a linear rate when the initialization satisfies Θ0 R (see Appendix A.2.3).

We then put pieces together and prove Theorem 5.5 in Appendix A.2.4. We introduce some auxiliary lemmas in Appendix A.2.5.

A.2.1. STEP 1: ENTERING A REGION WITH SMALL RANDOM INITIALIZATION

We first introduce the following definitions, adapted from the proof in Chen et al. (2023). Recall that = λ k λ k+1. We define

R = Θt = Θt k Θt res

R2d k σ2 1(Θt) 2λ 1, σ2 1(Θt res) λ k /2, σ2 k(Θt k) /4 ,

Federated Representation Learning in the Under-Parameterized Regime

Rs = Θt = Θt k Θt res

R2d k σ2 1(Θt) 2λ 1, σ2 1(Θt res) λ k /2 .

Then, we establish the following proposition. Proposition A.2. Assume η 1 6λ 1 and all entries of B0 and W0 are independently sampled from N(0, α2) with a sufficiently small α. Then, if

d log δ c1 , 3456 p

k(λ 1)2 + E +

kσξ) min{σ1(Θ0res), σ1(Θ0res)} c1

with probability at least 1 ctδ for some constant c > 0, Θt will enter region R for some t [TR], where

TR = log( /(4σ2 k(Θ0 k))) 2 log(1 + η

2(λ k /2)). (16)

The proof of Proposition A.2 relies on Lemma A.4 and Lemma A.5, which will be introduced shortly. Before that, we state the following claim introduced in Chen et al. (2023):

Claim A.3. σ2 1(Θ0) λ 1, σ2 1(Θ0 res) λ k /2, σ2 k(Θ0 k) /4 and σ2 1(Θ0 res) c1 σk(Θ0 k)1+κ, where c1 = 1 κ/2

23 κ and κ = log(1+ η

8 ) log(1+ η

2 (λ k /2)) < 1.

The following lemma shows that with a small random initialization, Claim A.3 holds with high probability. Lemma A.4. Assume all entries of B0 and W0 are independently sampled from N(0, α2). Then, for any δ [0, 1], if α is sufficiently small, Claim A.3 holds with probability at least 1 δ.

Proof of Lemma A.4. Using Lemma A.19, we have σ1(Θ0) p

λ 1 and σ1(Θ0 res) p

λ k /2 hold with probability at least 1 2 exp( ( 1

d)2/2), and σk(Θ0 k)

/2 holds with probability at least 1 2 exp( (

d)2/2). Then for α small enough such that

2 log(2/δ ) ,

2 log(2/δ )

λ 1, σ1(Θ0 res) p

λ k /2 and σk(Θ0 k)

/2 hold with probability at least 1 2δ for any δ [0, 1].

From Rudelson & Vershynin (2008), there exists a constant K that only depends on δ such that with probability at least 1 δ , we have σk(Θ0 k) α2K

Thus, when α is sufficiently small such that α4 2(1+κ) c1(K

8d+4 log(2/δ ), with probability at least 1 δ , we have

c1σk(Θ0 k)1+κ c1(α2K

2 log(2/δ ))2.

Note that from Lemma A.19, with probability at least 1 δ we have

σ2 1(Θ0 res) α4(2

2 log(2/δ ))2.

Then we conclude that with probability at least 1 4δ , we have σ1(Θ0) p

λ 1, σ1(Θ0 res) p

λ 1 /2, σk(Θ0 k)

/2 and σ2 1(Θ0 res) c1 σk(Θ0 k)1+κ. Finally the lemma follows by setting δ = 4δ .

Next, we introduce the following lemma, which shows that when Claim A.3 holds, Θt will enter the region R in a short time period. Lemma A.5. Assume η 1 6λ 1 and Claim A.3 holds. Then, if

d log δ c1 , 3456 p

k(λ 1)2 + E +

kσξ) min{σ1(Θ0res), σ1(Θ0res)} c1

with probability at least 1 ctδ, we have σk(Θt k)

/2 for some t [0, log( /4σ2 k(Θ0 k)) 2 log(1+ η

2 (λ k /2))].

Federated Representation Learning in the Under-Parameterized Regime

Proof of Lemma A.5. With Claim A.3 holds, we have σ2 1(Θ0) 2λ 1, σ2 1(Θ0 res) λ k /2. Then, based on Lemma A.17, for N satisfying inequality (17), we have

σ1(Θt res) 1 + η

8 t σ1(Θ0 res),

holds with probability at least 1 ctδ. Combining with Claim A.3, we obtain

8 TR σ2 1(Θ0 res) = 4σ2 k(Θ0 k)

κ/2 σ2 1(Θ0 res) 8 p

λ 1 σk(Θ0 k).

Then, for all t TR, we have 1 + η

8 2t σ2 1(Θ0 res) 1 + η

8 t+TR σ2 1(Θ0 res) 1 + η

λ 1 σk(Θ0 k).

Let T = min{t > 0|σ2 k(Θt k) /4}. We then aim to prove that

σk(Θt k) 1 + η

4 t σk(Θ0 k), t min{TR, T }. (18)

We prove it by induction.

Assume Equation (18) holds for some τ t, where t min{TR, T }. Then we have

σ2 1(Θτ res) 1 + η

8 2τ σ2 1(Θ0 res) 8 p

4 τ σk(Θ0 k) 8 p

λ 1 σk(Θτ k).

We consider the next time step τ + 1. Note that σk(Θτ+1 k ) can be lower bounded as

σk(Θτ+1 k ) σk(Θτ k + η

2Θτ k(Θτ) Θτ) σ1(Rτ+1 k )

σk(Θτ k + η

2Θτ k(Θτ k) Θτ k) η

2σ1(Θτ k(Θτ res) Θτ res) σ1(Rτ+1).

Applying Lemma D.4 in Jiang et al. (2022) gives

σk(Θτ k + η

2Θτ k(Θτ k) Θτ k) η

2σ1(Θτ k(Θτ res) Θτ res)

2 σ1( Λk(Θτ k) Θτ k) (1 + η

2λ k)σk(Θτ k) 1 η

2σ2 k(Θτ k) η

2σ1(Θτ k(Θτ res) Θτ res)

2 λ 1 2 1 + ηλ k 2

8 σk(Θτ k).

Then, for η 2

18λ 1 3 , we have

σk(Θτ k + η

2Θτ k(Θτ k) Θτ k) η

2σ1(Θτ k(Θτ res) Θτ res) 1 + ηλ k 2 η 71

σk(Θτ k). (19)

According to Lemma A.13 and Lemma A.14, if

d log δ c1 , 3456 p

k(λ 1)2 + E +

d log δ σk(Θ0 k) c1

then, with probability at least 1 2δ, we have σ1(Rτ+1) 1 288η σk(Θ0 k).

Combining with Equation (19) gives

σk(Θτ+1 k ) 1 + ηλ k 2 η 71

σk(Θτ k) 1 288η σk(Θ0 k)

Federated Representation Learning in the Under-Parameterized Regime

1 + ηλ k 2 η 71

4 τ σk(Θ0 k) 1 288η σk(Θ0 k)

1 + ηλ k 2 η 71

4 τ σk(Θ0 k) 1 288

4 τ η σk(Θ0 k)

4 τ+1 σk(Θ0 k).

Then, we conclude that with probability at least 1 ctδ for some constant c, we have

σk(Θt k) 1 + ηλ k/2 η /4 t σk(Θ0 k).

Here we claim TR T always holds, since if TR > T , we must have

σk(ΘTR k ) 1 + ηλ k/2 η /4 TR σk(Θ0 k)

which contradicts the definition of T . The proof is thus complete.

A.2.2. STEP 2: TRAPPED IN THE ABSORBING REGION

We start by introducing Lemma A.6, Lemma A.7 and Lemma A.8.

Lemma A.6. Assume η 2 5λ 1 and Θ0 Rs. Then, if

dk k log δ c1 for constant c1, with probability at least

1 2tδ, we have σ1(Θτ) p

2λ 1 hold for all τ t.

Proof of Lemma A.6. Assume that Θτ 1 Rs. Then, utilizing Equation (13), we have

σ1(Θτ) σ1(Θτ 1) 1 + η

2σ2 1(Θτ 1) + σ1(Rτ).

Note that σ1(Θτ 1) 1 + η

2σ2 1(Θτ 1) reaches its maximum at σ1(Θτ 1) = q

2+ηλ 1 3η . For η 2 5λ 1 , we have q

2+ηλ 1 3η p

2λ 1. Thus, σ1(Θτ 1) 1 + η

2σ2 1(Θτ 1) is monotonically increasing for σ1(Θτ 1) [0, p

2λ 1]. Then,

2λ 1 + σ1(Rτ).

We prove σ1(Θτ) p

2λ 1 by induction. First, since Θ0 Rs, we have σ1(Θ0) p

2λ 1. Then, assume σ1(Θτ) p

2λ 1 holds for time step 0 τ < t. According to Lemma A.13 and Lemma A.14, if σ1(Θτ) p

dk k log δ c1 , with probability at least 1 2δ, we have σ1(Rτ+1)

2 2 η(λ 1) 3 2 . Thus,

2λ 1 + σ1(Rτ+1) p

Then, by induction, with probability at least 1 2tδ, we have σ1(Θτ) p

2λ 1 for all τ t. Then the proof is complete.

Lemma A.7. Assume η 1 6λ 1 and Θ0 Rs. Then, if

d log δ c1 , 48 p

k(λ 1)2 + E +

for constant c1, with probability at least 1 2tδ, we have σ1(Θτ res) p

λ k /2 holds for all τ t.

Federated Representation Learning in the Under-Parameterized Regime

Proof of Lemma A.7. We prove it by induction. Note that σ1(Θ0 res) p

λ k /2 since Θ0 Rs. Assume σ1(Θτ 1 res ) p

λ k /2 holds for some τ. We aim to show that the inequality holds for σ1(Θτ res) as well.

Based on Lemma A.15, for η 1/6λ 1 and σ1(Θτ) p

2λ 1, we have

σ1(Θτ res) 1 + η

2 λ k+1 σ2 1(Θτ 1 res ) σ2 k(Θτ 1 k ) σ1(Θτ 1 res ) + σ1(Rτ 2d k)

2 λ k+1 σ2 1(Θτ 1 res ) σ1(Θτ 1 res ) + σ1(Rτ 2d k).

Note that when σ1(Θτ 1 res ) 0, 1 + η

2 λ k+1 σ2 1(Θτ 1 res ) σ1(Θτ 1 res ) is maximized at σ1(Θτ 1 res ) = q

2+ηλ k+1 3η . Since

we assume η 1 6λ 1 2 3λ k+λ k+1 , it holds that q

2+ηλ k+1 3η p

λ k /2. Then, 1 + η

2 λ k+1 σ2 1(Θτ 1 res ) σ1(Θτ 1 res )

is monotonically increasing for 0 σ1(Θτ 1 res ) p

λ k /2. We thus have

σ1(Θτ res) q

4 + σ1(Rτ). (20)

According to Lemma A.13 and Lemma A.14, if

d log δ c1 , 48

c1(λ k /2) , with probabil-

ity at least 1 2δ, we have

λ k /2 4 . (21)

Then by combining (20) and (21), we have σ1(Θτ res) p

λ k /2. The proof is thus complete.

Lemma A.8. Assume η 2

32λ 1 3 , σk(Θ0 k)

/2 and Θτ Rs for all 0 < τ t. Then, if

d log δ c1 , 6144 p

k(λ 1)2 + E +

for some contact c1, with probability at least 1 2tδ, we have σk(Θτ k)

/2 hold for all 0 < τ t.

Proof of Lemma A.8. We prove it by induction. First note that σk(Θ0 k)

/2 under the assumption of Lemma A.8. Assume that σk(Θτ k)

/2 holds for some t. We then show that σk(Θτ+1 k )

/2 holds as well.

Since for all τ t it holds that Θτ Rs and N satisfies the condition described in Lemma A.8, based on Lemmas A.13 and A.14, with probability at least 1 2tδ, it holds that σ1(Rτ k) η 2

λ 1 for all 0 τ t. From the intermediate result

of Lemma A.16, for η 2

16λ 1 3 , we have

σ2 k(Θτ+1 k ) 1 + η λ k σ2 1(Θτ res) σ2 k(Θτ k) σ2 k(Θτ k) η2λ 1 3 4 p

λ 1σ1(Rτ k).

Combining with the fact that

σ2 k(Θτ+1 k ) 1 + η λ k σ2 1(Θτ res) σ2 k(Θτ k) σ2 k(Θτ k) η2λ 1 3 η 2

σ2 k(Θτ+1 k ) 1 + η /2 σ2 k(Θτ k) σ2 k(Θτ k) η2λ 1 3 η 2

(1 + η( /2 /4)) /4 η2λ 1 3 η 2

The proof is thus complete.

Federated Representation Learning in the Under-Parameterized Regime

Combining Lemmas A.6 and A.7, we conclude that for N sufficiently large, Rs is an absorbing region with high probability, i.e., starting from Θ0 Rs, the subsequent Θt will stay in Rs for all t > 0 with high probability, which is summarized in the following proposition.

Proposition A.9. Assume Θ0 R. If

3 2 for some constant c, then, with probability at least 1 tδ, we have Θτ R for all 0 τ t.

A.2.3. STEP 3: LOCAL CONVERGENCE OF Θt(Θt)

We next show that when N is sufficiently large, with high probability, Θt(Θt) diag( Λk, 0) converges to 0 exponentially fast when Θ0 R.

Firstly, we establish the following lemma that lower bounds the number of samples needed for the inverse SNR to converge exponentially fast with high probability.

Lemma A.10. Denote σt+1 ref = p

16 t+1 . Assume η 2/(36λ 3 1 ), Θτ R for all 0 τ t, and

k(λ 1)2 + E +

kσξ) σt+1 ref ,

d log δλ 1( p

k(λ 1)2 + E +

for some constant c. Then, with probability at least 1 (t + 1)δ, we have

σ1(Θt+1 res ) 2

Proof of Lemma A.10. Based on Lemma A.20, we have σ1(Rτ+1 2d k) σ1(Rτ+1) and σ1(Rτ+1 k ) σ1(Rτ+1). Then, it follows that

σ1(Rτ+1 2d k) q

σ2 1((Qτ+1 ) U ) + σ2 1( Qτ+1 V ) and σ1(Rτ+1 k ) q

σ2 1((Qτ+1 ) U ) + σ2 1( Qτ+1 V ).

Substitute the σ in Lemmas A.13 and A.14 by σt+1 ref . Then, if

d log δ c1 , 192 p

k(λ 1)2 + E +

σt+1 ref c1 , 6144 p

d log δλ 1( p

k(λ 1)2 + E +

with probability at least 1 2δ, we have

σ1(Rτ 2d k) σt+1 ref η

16 and σ1(Rτ k) η 2

We prove the lemma by considering two cases. In the first case, we assume σ1(Θτ res) σt+1 ref for all 0 τ t. In the second case, we assume there exists at least one time step in [0, t] such that σ1(Θτ res) < σt+1 ref , and we denote the last time step satisfying this condition as t .

We start from the first case. Combining Equation (22) with Lemma A.15 gives

σ1(Θτ+1 res ) 1 + η

2 λ k+1 σ2 1(Θτ res) σ2 k(Θτ k) σ1(Θτ res) + σ1(Θτ res)η 16 (23)

2 λ k+1 + /8 σ2 1(Θτ res) σ2 k(Θτ k) σ1(Θτ res),

where in Equation (23) we use the assumption that σt+1 ref σ1(Θτ res).

Then, using the fact that σ2 1(Θτ) 2λ 1 and η 16λ 1 2 we obtain

σ2 1(Θτ+1 res ) 1 + η λ k+1 + /8 σ2 1(Θτ res) σ2 k(Θτ k) + 4η2λ 1 2 σ2 1(Θτ res) (24)

Federated Representation Learning in the Under-Parameterized Regime

1 + η λ k+1 + /8 σ2 1(Θτ res) σ2 k(Θτ k) + η

σ2 1(Θτ res)

1 η /8 + η λ k /2 σ2 1(Θτ res) σ2 k(Θτ k) σ2 1(Θτ res), (25)

where Equation (24) holds since λ k+1 + /8 σ2 1(Θτ res) σ2 k(Θτ k) 2 16λ 1 2.

Next, combining Lemma A.16 and Equation (22) leads to

σ2 k(Θτ+1 k ) σ2 k( Θτ+1 k ) 4 p

λ 1σ1(Rτ+1 k )

σ2 k( Θτ+1 k ) η 2

σ2 k( Θτ+1 k ) η

8 σ2 k(Θτ k) (26)

1 + η λ k σ2 1(Θτ res) σ2 k(Θτ k) η

σ2 k(Θτ k) η

8 σ2 k(Θτ k)

= 1 + η /8 + η λ k /2 σ2 1(Θτ res) σ2 k(Θτ k) σ2 k(Θτ k), (27)

where in Equation (26) we use the fact σk(Θτ k)

Then, combining Equation (25) with Equation (27) we have

σ2 1(Θτ+1 res ) σ2 k(Θτ+1 k )

1 η /8 + η λ k /2 σ2 1(Θτ res) σ2 k(Θτ k) σ2 1(Θτ res) 1 + η /8 + η λ k /2 σ2 1(Θτres) σ2 k(Θτ k) σ2 k(Θτ k)

3/2 + η /8 σ2 1(Θτ res) σ2 k(Θτ k) (28)

σ2 1(Θτ res) σ2 k(Θτ k)

2 σ2 1(Θτ res) σ2 k(Θτ k) , (29)

where Equation (28) holds when 1/2 η λ k /2 σ2 1(Θτ res) σ2 k(Θτ k) 1/2, which is valid when η 2/(36λ 1 3), and Equation (29) holds since (1 η /6) (1 η /16) is valid for positive η.

Then, with probability at least 1 2(t + 1)δ, we have

σ2 1(Θt+1 res ) 1 η

2(t+1) σ2 1(Θ0 res) σ2 k(Θ0 k) σ2 k(Θt+1 k ) 8λ 1 2

For the second case, note at time step t we have σ1(Θt res) < σt+1 ref , and for all t < τ t we have σ1(Θτ res) σt+1 ref . Similar to the previous analysis, we show that with probability at least 1 2(t + 1 t )δ, we have

σ2 1(Θt+1 res ) 1 η

t+1 t σt+1 ref 2

σ2 k(Θt k )σ2 k(Θt+1 k )

8λ 1 σt+1 ref 2 1 η

The proof is complete by combining the two cases.

Federated Representation Learning in the Under-Parameterized Regime

The following lemma characterizes the number of samples needed for Θk to converge to Λk, which is based on the convergence of the inverse SNR.

Lemma A.11. Assume η 2/(36λ 3 1 ), Θt R for all 0 τ t and N satisfies

d log δλ 1(

for some constant c. Then, with probability at least 1 (t + 1)δ,

σ1(Dt+1) 200λ 1 2

where σt+1 D = min n 3λ 1, (1 η

16 )t+1 λ 1 2

Proof of Lemma A.11. We denote Dτ = Θτ k(Θτ k) Λk. For Θτ k defined in Lemma A.16, we have

Dτ = Θτ k( Θτ k) Λk + Θτ k(Rτ k) + Rτ k( Θτ k)T + Rτ k(Rτ k) . (30)

Let σt+1 D = min n 3λ 1, (1 η

16 )t+1 λ 1 2

η 2 o . Then, if

d log δ c1 , 1152 p

d log δλ 1( p

k(λ 1)2 + E +

kσξ) σt+1 D c1

we have Rτ k σt+1 D η

λ 1 . It follows that

Θτ k(Rτ k) σt+1 D η

Rτ k(Rτ k) σt+1 D η 96 p

where Equation (32) holds since σt+1 D η

λ 1 1. Then, with probability at least 1 2δ, we have

σ1 Θτ k(Rτ k) + Rτ k( Θτ k) + Rτ k(Rτ k) η

16 σt+1 D .

We prove the lemma by considering two cases: In the first case, we assume σ1(Dτ) σt+1 D for all 0 τ t; In the second case, we assume there is at least one time step in [0, t] such that σ1(Dτ) < σt+1 D , and we denote the latest time step satisfies this condition as t .

We start from the first case. From Section A.3 in Chen et al. (2023), we have

σ1(Dτ) (1 η

8 )σ1(Dτ 1) + σ2 1(Θτ 1 res ) + η

8 )σ1(Dτ 1) + (1 η

16 )2(τ 1) 8λ 1 2

16 σ1(Dτ 1)

16 )σ1(Dτ 1) + (1 η

16 )2(τ 1) 8λ 1 2

Then, for N satisfying Equation (31), with probability at least 1 2(t + 1)δ, we have

σ1(Dt+1) (1 η

16 )t+1 σ1(D0) +

1 η /16 i 8λ 1 2

Federated Representation Learning in the Under-Parameterized Regime

σ1(D0) + 130λ 1 2

where Equation (33) follows from the fact that λ 1/η 2 1 and σ1(D0) 3λ 1 3λ 1 2/η 2. Therefore, we conclude that σ1(Dt+1) (1 η /16)t+1 200λ 1 2

For the second case, note at time step t we have σ1(Dt ) < σt+1 D , and for all t < τ t we have σ1(Dτ) σt+1 D . Similar to the previous analysis, we show that with probability at least 1 2(t + 1 t ) exp( c2(d + k)), it has

σ1(Dt+1) (1 η

16 )t+1 t σ1(Dt ) +

1 η /16 i 8λ 1 2

(1 η /16) ,

16 )t+1 + 130λ 1 2

The proof is complete by combining the two cases.

Then, we aim to show the local convergence property of Θt stated in the following proposition.

Proposition A.12. Assume Θ0 R0, η 2/(36λ 3 1 ) and N satisfies

d log δ c1 , 1152 p

k(λ 1)2 + E +

kσξ) κt c1 , 6144 p

d log δλ 1( p

k(λ 1)2 + E +

(34) for constant c1 and κt = 1 η

16 t+1. Define κ = λ 1 2

η 2 . Then, with probability at least 1 ctδ, we have

Θt(Θt) diag( Λk, 0) F 400κ

Proof. First, by combining Lemmas A.6 to A.8, we conclude that if N satisfies (34), Θτ Rτ holds for all τ t with probability at least 1 ctδ.

Then, based on Lemma A.10, if N satisfies Equation (34), with probability at least 1 2tδ, we have

σ1(Θt res) 2

Define Dt = Θt k(Θt k) Λk. Based on Lemma A.11, we have

k(λ 1)2 + E +

kσξ) κt c1 1152 p

d log δλ 1( p

k(λ 1)2 + E +

kσξ) σt+1 D c1 .

Under the same conditions in Equation (34), it holds that

σ1(Dt) 200λ 1 2

By combining Equation (35) and Equation (36), we have

Θt(Θt) diag( Λk, 0) F

kλ 1 Θt res (37)

Federated Representation Learning in the Under-Parameterized Regime

Note that the randomness in {Θt}t comes from {Rt}t. If N satisfies Equation (34), we have Equation (35) Equation (36)

and the event Θτ Rτ, 0 < τ t holds with probability at least 1 ctδ. Noting that max n

η 2 , the proof is complete.

A.2.4. PUTTING ALL TOGETHER

Combining Propositions A.2, A.9 and A.12, it is straightforward to show that if t > TR, N satisfies N

c2 (d log δ+log T )(

κ2 T 2 for some constant c2, and Θ0 satisfies the small random initialization condition

stated in Proposition A.2, then it holds that Θt(Θt) diag( Λk, 0) F c4κ

16 t for all t satisfies TR < t < T

with probability at least 1 δ, where κT = (1 η

16 )T and constant c4 = 400 1 η

Applying Lemma A.18, with probability at least 1 δ, we have

Bt Wt diag(Λk, 0) F c4κ

Note that Bt Wt B W F = Bt Wt diag(Λk, 0) F . Combing with the fact that P

i [M] Btwt i B w i 2 = Bt Wt B W 2 F , we have X

i [M] Btwt i B w i 2 c2 4κ2k 1 η

Then, applying the Cauchy-Schwarz inequality gives X

i [M] Btwt i B w i 2 c2 4Mκ2k 1 η

which immediately implies that

i [M] Btwt i B w i c4κ

A.2.5. AUXILIARY LEMMAS

Lemma A.13 (Concentration of U Qτ+1 ). For any T 0, assume Θτ Rs holds for all 0 < τ t. Then, we have the following results for any 0 τ t and c2 0 with probability at least 1 2δ:

d log δ c1 , 192

kσξ) σ c1 , then it holds that

d log δ c1 , 6144

d log δλ 1(

kσξ) 2 c1 , then it holds that

Proof of Lemma A.13. Recall that Qτ+1 is defined as

Bτ wτ i ϕ i (wτ i ) η X

X i X i N Bτ wτ i ϕ i (wτ i ) + η X

X i Ei(wτ i )

Federated Representation Learning in the Under-Parameterized Regime

where ϕ i and X i are padded versions of ϕi and Xi, respectively. To upper bound the norm of Qτ+1 , we decompose it into two parts:

Bτ wτ i ϕ i (wτ i ) X

X i X i N Bτ wτ i ϕ i (wτ i )

| {z } Aτ+1 1

X i Ei(wτ i )

| {z } Aτ+1 2

For Aτ+1 1 ,by applying Lemma 5.4 in Vershynin (2010), there exists a 1

4-net Nk on the unit sphere Sk 1 and a 1

4-net Nd on the unit sphere Sd 1 such that

Aτ+1 1 2 max u Nd,v Nk

j [N] u Bτwτ i ϕi (wτ i ) v X

j [N] u xi,jx i,j Bτwτ i ϕi (wτ i ) v

Denote cτ i = Bτwτ i ϕi and cτ w = maxi{ wτ i }. Observe that u xi,jx i,j Bτwτ i ϕi (wτ i ) v u (Bτwτ i ϕi)(wτ i ) v is a sub-exponential random variable with sub-exponential norm c cτ i cτ w for some constant c , where c depends on the distribution of x. Then, based on the tail bound for sub-exponential random variables, there exists a constant c 2 > 0 such that for any s 0,

j [N] u Bτwτ i ϕi (wτ i ) v X

j [N] u xi,jx i,j Bτwτ i ϕi (wτ i ) v

i [M](cτ i cτw)2 , s maxi{cτ i cτw}

Taking the union bound over all u Nd and v Nk, with probability at least 1 9d+kexp

Nc 2min n s2 P

i [M](cτ i cτw)2 , s maxi{cτ cτ w} o , we have

Bτwτ i ϕi (wτ i ) X

Xi X i N Bτwτ i ϕi (wτ i )

Since σ2 1(Θτ) 2λ 1, we have

( Bτ) Bτ + Wτ( Wτ) = σ2 1(Θτ) 2λ 1.

Note that ( Bτ) Bτ and Wτ( Wτ) are PSD matrices. It follows that

2λ 1 and Wτ p

which implies that Bτ p

2λ 1 and Wτ p

2λ 1. Since cτ i = Bτwτ i ϕi and cτ w = maxi{ wτ i }, we have P

i [M](cτ i cτ w)2 2λ 1 BτWτ Φ 2 F 4λ 1(k(λ 1)2 + E) and maxi{cτ i cτ w} 3

2(λ 1) 3 2 , where

i(λi)2. Let s = p

18λ 1(k(λ 1)2 + E) p

log(1/δ)/d + 6

Nc 2. Then, if N is sufficiently large such that ( p

log(1/δ)/d + 2)

Nc 2 1, we have

2(λ 1) 3 2 s p

18λ 1(k(λ 1)2 + E) s2

18λ 1(k(λ 1)2 + E).

Then, with probability at least 1 δ, we have

Bτwτ i ϕi (wτ i ) X

Xi X i N Bτwτ i ϕi (wτ i )

18λ 1(k(λ 1)2 + E) p

log(1/δ)/d + 6

Federated Representation Learning in the Under-Parameterized Regime

Therefore, with probability at least 1 δ, we have

λ 1 k(λ 1)2 + E 6

2 d log δ Nc2 ,

where c2 = c 2 6 .

Next, we consider Aτ+1 2 . Similar to the above analysis, note that u xi,jξi,j(wτ i ) v is a centered sub-exponential random variable with sub-exponential norm c σξ wτ i for some constant c . Based on the tail bound for sub-exponential random variables, there exists a constant c 3 > 0 such that for any s 0,

i [M] u X i E i(wτ i )

σ2 ξ Wτ 2 F , s σξ p

Combining with the fact Wτ 2 F 2kλ 1 and taking the union bound over all u Nd and v Nk, we have that inequality P

i [M] X i Eτ i(wτ i )

2s holds with probability at least 1 9d+kexp Ncmin n s2

2σ2 ξkλ 1 , s σξ

o . Then, let

log(1/δ)/d + 6

Nc. If N is sufficiently large such that ( p

log(1/δ)/d + 2)

Nc 3 1 we have min{ s2

2σ2 ξkλ 1 , s σξ

2λ 1 } = s2

2σ2 ξkλ 1 . Therefore, with a probability at least 1 δ, we have

2σ2 ξkλ 1 p

log(1/δ)/d + 6

d Nc 3 (λ 1) 1 2 σξ 6

2 dk k log δ Nc3 ,

where c3 = c 3 24. Combining the upper bounds of Aτ+1 1 and Aτ+1 2 , we conclude that following inequality holds with probability at least 1 δ:

U Qτ+1 Qτ+1 η(Aτ+1 1 + Aτ+1 2 )

η(λ 1) 1 2 q

k(λ 1)2 + E 6

d log(δ/2) Nc2 + η(λ 1) 1 2 σξ 6

dk k log(δ/2) Nc3

k(λ 1)2 + E +

2 d log δ Nc1 ,

where c1 = 1

2 min{c2, c3}. Thus, for any σ 0, if

N 192 d log δ

kσξ) σ c1 , with probability at least 1 δ it holds

Similarly, if

N 6144 d log δλ 1(

kσξ) 2 c1 , with probability at least 1 δ,

Lemma A.14 (Concentration of Qτ+1 V ). For any t 0, assume Θτ Rs holds for all 0 < τ t. Then, we have the following results for any 0 τ t and c2 0 with probability at least 1 2δ:

d log δ c1 , 192

kσξ) σ c1 , then it holds that

Federated Representation Learning in the Under-Parameterized Regime

d log δ c1 , 6144

d log δλ 1(

kσξ) 2 c1 , then it holds that

Proof of Lemma A.14. This proof resembles the proof of Lemma A.13. According to Lemma 5.4 in Vershynin (2010), there exists a 1

4-net Nk on the unit sphere Sk 1 and a 1

4-net NM on the unit sphere SM 1 so that

1 η Qτ+1 2 max u NM,v Nk

i [M] v qτ+1 i ui + v 1

i [M] (Bτ) Xi Eiui

2 max u NM,v Nk

j [N] v (Bτ) Bτwτ i ϕi ui X

j [N] v (Bτ) xi,jx i,j Bτwτ i ϕi ui

| {z } Aτ+1 3

+ 2 max u NM,v Nk

i [M] (Bτ) Xi Eiui

| {z } Aτ+1 4

Let cτ B = Bτ and recall that cτ i = Bτwτ i ϕi . Based on the tail bound for sub-exponential random variables, there exists a constant c > 0 such that for any s 0,

j [N] v (Bτ) Bτwτ i ϕi ui X

j [N] v (Bτ) xi,jx i,j Bτwτ i ϕi ui

i [M](cτ i cτ B)2 , s maxi{cτ i cτ B}

Taking the union bound over all u NM and v Nk, with probability at least 1 9M+kexp

Ncmin n s2 P

i [M](cτ i cτ B)2 , s maxi{cτ cτ B} o , we have Aτ+1 3 2s. Let s = (λ 1) 1 2 p

k(λ 1)2 + E q

log(1/δ)/d + 6

If N is sufficiently large such that ( p

log(1/δ)/d + 2)

Nc 1, there exits constant c2 such that with probability at least 1 δ, we have

Aτ+1 3 η(λ 1) 1 2 q

k(λ 1)2 + E 6

d log δ Nc2 .

For term Aτ+1 4 , from the tail bound for sub-exponential random variables, there exists a constant c > 0 such that for any s 0,

i [M] v (Bτ) Xi Ei

i [M] ui Bτ 2 , s σξ p

2σ2 ξλ 1 , s σξ p

2σ2 ξkλ 1 q

log(1/δ)/d + 6

Nc. If N is sufficiently large such that ( q

log(1/δ)/d + 2)

Nc 1, there exists constant c3 such that with a probability at least 1 δ, we have

Aτ+1 4 (λ 1) 1 2 σξ 6

dk k log δ Nc3 .

Federated Representation Learning in the Under-Parameterized Regime

Combining the upper bounds of Aτ+1 3 and Aτ+1 4 , we conclude that following inequality holds with probability at least 1 δ:

Qτ+1 V Qτ+1 η(Aτ+1 3 + Aτ+1 4 )

η(λ 1) 1 2 q

k(λ 1)2 + E 6

d log(δ/2) Nc2 + η(λ 1) 1 2 σξ 6

dk k log(δ/2)

k(λ 1)2 + E +

d log δ Nc1 ,

where c1 = 1

2 min{c2, c3}. Thus, for any σ 0, if

kσξ) σ c1 , with probability at least 1 δ it holds that

Similarly, if

d log δλ 1(

kσξ) 2 c1 , with probability at least 1 δ,

Lemma A.15. Suppose η 1/6λ 1 and σ1(Θτ) p

2λ 1. Then, it holds that

σ1(Θτ+1 res ) 1 + η

2 λ k+1 σ2 1(Θτ res) σ2 k(Θτ k) σ1(Θτ res) + σ1(Rτ 2d k).

Proof of Lemma A.15. According to Equation (15), we can rewrite Θτ+1 res as

Θτ+1 res = 1

2Θτ res(Θτ res) Θτ 1 res + 1

2 Λres Θτ res + Θτ res 1

2(Θτ k) Θτ k + Rτ+1 2d k. (41)

From Lemma A.5 in Chen et al. (2023) we have the following inequalities:

2Θτ res(Θτ res) Θτ res 1

2σ1(Θτ res) η

2σ3 1(Θτ res), (42)

2Λres Θτ res 1

4 + ηλ k+1 2 σ1(Θτ res), (43)

σ1 Θτ res 1

2(Θτ k) Θt k σ1(Θτ res) 1

2σ2 k(Θτ k) . (44)

Substituting Equations (42) to (44) into Equation (41) proves the lemma.

Lemma A.16. Suppose σ1(Θt) p

2λ 1 and η 2

16λ 1 3 . Then, it holds that

σ2 k(Θτ+1 k ) 1 + η λ k σ2 1(Θτ res) σ2 k(Θτ k) η

σ2 k(Θτ k) 4 p

λ 1σ1(Rτ k).

Proof of Lemma A.16. Denote Θτ+1 k = Θτ k + η

2Θτ k(Θτ) Θτ. Based on Lemma A.6 and Lemma 2.3 in Chen et al. (2023), we have

σ2 k( Θτ+1 k ) 1 + η λ k σ2 1(Θτ res) σ2 k(Θτ k) σ2 k(Θτ k) η2λ 1 3

1 + η λ k σ2 1(Θτ res) σ2 k(Θτ k) η

σ2 k(Θτ k). (45)

Federated Representation Learning in the Under-Parameterized Regime

Combining with η 1 16λ 1 gives

σ1( Θτ+1 k ) σ1(Θτ k) + σ1 η

2 ΛkΘτ k + σ1 η

2Θτ k(Θτ) Θτ

Thus, σk( Θτ+1 k ) σ1( Θτ+1 k ) 2 p

λ 1. Combining with the fact that σk(Θτ+1 k ) σk( Θτ+1 k ) σ1(Rτ+1 k ), we have

σ2 k(Θτ+1 k ) σk( Θτ+1 k ) σ1(Rτ+1 k ) 2

σ2 k( Θτ+1 k ) 2σk( Θτ+1 k ) σ1(Rτ+1 k )

σ2 k( Θτ k) 4 p

λ 1σ1(Rτ k). (46)

The lemma thus follows by substituting Equation (45) into Equation (46).

Lemma A.17. Assume η 1 6λ 1 and σ1(Θ0) p

2λ 1 hold. Then, if

d log δ c1 , 96 p

k(λ 1)2 + E +

d log δ σ1(Θ0res) c1

with probability at least 1 ctδ for some constant c, we have

σ1(Θτ res) 1 + η

8 τ σ1(Θ0 res)

holds for all τ t.

Proof of Lemma A.17. Suppose η 1/6λ 1 and σ1(Θτ) p

2λ 1. We have

σ1(Θτ res) 1 + η

2 λ k+1 σ2 1(Θτ 1 res ) σ2 k(Θτ 1 k ) σ1(Θτ 1 res ) + σ1(Rτ 2d k)

2λ k+1 σ1(Θτ 1 res ) + σ1(Rτ).

Combining Lemma A.13 and Lemma A.14, when N satisfies Equation (47), we have σ1(Rt) η

8 σ1(Θ0 res). The lemma follows by induction.

Lemma A.18. If Θ(Θ) diag( Λk, 0) F δ for some δ > 0, then BW diag(Λk, 0) F δ.

Proof of Lemma A.18. Note that Θ(Θ)T diag( Λk, 0) F δ implies that B + W

2 diag(Λk, 0) F δ,

BW + W B 2 diag(Λk, 0) F

2 diag(Λk, 0) B W

2 diag(Λk, 0) F + B W

Combining with the fact BW + W B 2 diag(Λk, 0) F = 2 BW diag(Λk, 0) F , the proof is complete.

Federated Representation Learning in the Under-Parameterized Regime

Lemma A.19 (Theorem 2.13 in Davidson & Szarek (2001)). Let N n and A be an N n matrix whose entries are IID standard Gaussian random variables. Then, for any ϵ 0, with probability at least 1 2 exp( ϵ2/2), we have

N n ϵ σmin(A) σmax(A)

Lemma A.20 (Eigenvalue Interlacing Theorem (Hwang, 2004)). For a symmetric matrix A Rd d, let B Rk k, k < d, be a principal matrix of A. Denote the eigenvalues of A as λ1 λd and the eigenvalues of B as µ1 µd. Then, for any i [k], it holds that λi+d k µi λi.

B. General FLUTE

B.1. Details of General FLUTE

Algorithm 2 General FLUTE

1: Input: Learning rates ηl and ηr, regularization parameter λ, communication round T 2: Initialization: Server initializes model parameters B0, {b0 i }, {H0 i } 3: for t = {0, , T 1} do 4: Server samples a batch of clients It+1

5: Server sends Bt, and Ht i to all client i It+1

6: for client i [M] in parallel do 7: if i It+1 then 8: Bt,0 i Bt, bt,0 i bt i and Ht,0 i Hi 9: for τ = {0, , T 1} do 10: Ht,τ+1 i GRD(Li(Bt,τ i , bt,τ i , Ht,τ i ); Ht,τ i , ηl) 11: bt,τ+1 i GRD(Li(Bt,τ i , bt,τ i , Ht,τ i ); bt,τ i , ηl) 12: Bt,τ+1 i GRD(Li(Bt,τ i , bt,τ i , Ht,τ i ); Bt,τ, ηl) 13: end for 14: Bt+1 i Bt,T i , bt+1 i bt,T i and Ht+1 i Ht,T i 15: Sends Bt+1 i , bt+1 i and Ht+1 i to the server 16: else 17: bt+1 i bt i 18: end if 19: end for 20: Server updates: 21: Bt+1 = 1 r M P

i It+1 Bt+1 i 22: {Ht+1 i }i It+1 GRD(R({Ht+1 i }i It+1, Bt+1); {Ht+1 i }i It+1, ηr) 23: Ht+1 i Ht i, i / It+1

24: end for

The General FLUTE is presented in Algorithm 2, where GRD(f; θ, α) denotes the update of variable θ using the gradient of the function f with respect to θ and the step size α. The local loss function Li is defined as

Li(B, b, H) = 1

(x,y) Di L H f B(x) + b, y . (48)

In this work, we instantiate the general FLUTE by a federated multi-class classification problem. In this case, the local loss function is specialized as

Li(B, b, H) = 1

(x,y) Di LCE H i f B(x) + bi, y + λ1 f B(x) 2 2 + λ2 Hi 2 F + λ3NCi(Hi), (49)

where y Rm is a one-hot vector whose k-th entry is 1 if the corresponding x belongs to class k and 0 otherwise, and λ1, λ2 and λ3 are non-negative regularization parameters. LCE( ) is the cross-entropy loss, where for a one-hot vector y whose

Federated Representation Learning in the Under-Parameterized Regime

k-th entry is 1, we have:

LCE(ˆy, y) = log

exp(ˆyk) P i [c] exp(ˆyi)

NCi(Hi), inspired by the concept of neural collapse (Papyan et al., 2020), is defined as

H i Hi HT i Hi F 1 m 1uiu i Im 1

where ui is an m-dimensional one-hot vector whose c-th entry is 1 if c Ci and 0 otherwise. Also, we specialize the regularization term optimized on the server side as R({Hi}) = P

B.2. Additional Definition

Definition B.1 (k-Simplex ETF, Definition 2.2 in Tirer & Bruna (2022)). The standard simplex equiangular tight frame (ETF) is a collection of points in Rk specified by the columns of

Consequently, the standard simplex EFT obeys

M M = MM = k k 1

In this work, we consider a (general) simplex ETF as a collection of points in Rd, d k specified by the columns

k k 1P Ik 1

k1k1 k , where P Rd k is an orthonormal matrix. Consequently, M M MM =

B.3. More Discussion on General FLUTE

Firstly, we explain the concept of neural collapse.

Neural collapse. Neural collapse (NC) was experimentally identified in Papyan et al. (2020), and they outlined four elements in the neural collapse phenomenon:

(NC1) Features learned by the model (output of the representation layers) for samples within the same class tend to converge toward their average, essentially causing the within-class variance to diminish;

(NC2) When adjusted for their overall average, the final means of different classes display a structure known as a simplex equiangular tight frame (ETF);

(NC3) The weights of the final layer, which serves as the classifier, align with this simplex ETF structure;

(NC4) Consequently, after this collapse occurs, classification decisions are made based on measuring the nearest class center in the feature space.

Next, we discuss some observations on the vanilla multi-classification problem, i.e., no additional regularization term and no client-side optimization, which is given as

Li(B, b, H) = 1

(x,y) Di LCE H i f B(x) + bi, y + λ1 f B(x) 2 2 + λ2 Hi 2 F . (54)

The first observation, which directly comes from Theorem 3.2 in Tirer & Bruna (2022), describes the phenomena of local neural collapse, which could happen when the model is locally trained for long epochs.

Federated Representation Learning in the Under-Parameterized Regime

Observation B.2. When f B( ) is sufficiently expressive such that f B(x) can be viewed as a free variable. and the feature dimension k is no smaller than the number of total classes m, locally learned B and Hi that optimize the objective function (54) must satisfy:

f B (x1) = f B (x2), x1, x2 Dc i , c Ci (55)

f B (x), h i,c f B (x) h i,c = 1, x Dc i , c Ci (56)

H i Hi H i Hi F = 1

m 1uiu i Im 1

where ui is a m-dimensional one-hot vector whose c-th entry is 1 if c Ci and 0 otherwise, and m is the number of classes per client.

The above observation states that NC1, NC2, and NC3 happen locally, implying: 1) hi,c = 0 if c / Ci; and 2) the sub-matrix of Hi constructed by columns hi,c with c Ci will form a K-Simplex ETF (c.f. Definition B.1) up to some scaling and rotation. We conclude that if there exist B and H1, , HM such that they are the optimal models for all clients, then the data from the same class may be mapped to different points in the feature space by f B when data are drawn from different clients. However, this condition usually cannot be satisfied in the under-parameterized regime, due to the less expressiveness of the under-parameterized model.

To further demonstrate the phenomenon in the under-parameterized regime, we assume that in the under-parameterized regime, a well-performed representation f B should map data from the same class but different clients to the same feature mean: Condition 1. For client i and j, if class c Ci and c Cj, then 1 |Dc i | P

x:(x,y) Dc i f B(x) = 1 |Dc j| P

x:(x,y) Dc j f B(x).

With this condition, we have the following observation that also comes from Theorem 3.1 in Zhu et al. (2021), which describes the neural collapse in the under-parameterized regime.

Observation B.3. When Condition 1 holds and the feature dimension k is no smaller than the number of total classes m, any global optimizer B , H 1, , H M of (54) satisfies

f B (x1) = f B (x2), x1, x2 Dc i , i [M], c Ci, (58)

f B (x), h i,c f B (x) h i,c = 1, x Dc i , i [M], c Ci, (59)

H i Hi H i Hi F = 1 m 1uiu i Im 1

, i [M], (60)

where ui is a m-dimensional one-hot vector whose c-th entry is 1 if c Ci and 0 otherwise.

Comparing these two observations, we conclude that in the under-parameterized case, the optimal models hi,c and hj,c are of the same direction when class c is included in both Ci and Cj. It implies that the globally optimized model performs differently compared with the locally learned model. In Figure 2, we present an example to illustrate how H performs differently when it is globally or locally optimized.

In Figure 2, we consider the scenario that the number of clients M = 3, total number of data classes m = 3, number of data classes per client m = 2, client 1 contains data of class 1 and class 2, client 2 contains data of class 1 and class 3, and client 3 contains data of class 2 and class 3. The first row of the three sub-figures shows the structure of normalized columns of H1, H2, and H3 when they are locally optimized, and the second row of the three sub-figures shows those optimize (54). We observe that under this setting, the locally optimized heads are in opposite directions, which perform differently compared with the global optimal heads.

Inspired by such observations, we add NCi to the local loss function and also optimize R({Hi}), to ensure that the personalized heads also contribute to the global performance. This principle aligns with our motivation to design the linear FLUTE.

Federated Representation Learning in the Under-Parameterized Regime

Figure 2. Behavior of locally optimized heads and globally optimized heads.

C. Additional Experimental Results

C.1. Synthetic Datasets

Implementation Details. In the experiments conducted on synthetic datasets shown in Figure 3, Λ Rd d is generated by setting the i-th singular value to be 2d i+1. We randomly generate U Rd d with d orthonormal columns and V Rd M

with d orthonormal rows. The ground-truth model is then Φ = UΛV , where each column ϕi represents the local ground-truth model for client i. Each client generates N samples (x, y) from y = x ϕi + ξi, where x is sampled from a standard Gaussian distribution and every entry of ξi is IID sampled from N(0, 0.3). The learning rate is set to η = 0.03, and for random initialization, we set α = 1 10d.

Parameter Settings. For experiments on synthetic datasets shown in Figure 3, we set d = 10. We select the value of k from the set {2, 4, 6, 8}, M from the set {15, 30}, and N from the set {12, 20}.

Experimental Results. From the experiments in Figure 3, we observe that, with the dimensions d, M, and N fixed, an increase in k results in a diminishing discrepancy in convergence speeds between FLUTE and Fed Rep. This trend demonstrates FLUTE s superior performance in under-parameterized settings. Furthermore, keeping d, k, and N unchanged while increasing the number of clients M, we see a reduction in the average error of models generated by FLUTE. This observation aligns with our theoretical findings presented in Theorem 5.5.

Varying γ1 and γ2. In Figure 4, we report the results of the following experiments where d = 10, k = 6, M = 10, and N selected from the set {8, 9, 10, 11}. For comparison, we use three pairs of γ1 and γ2: γ1 = 2γ2, γ1 = γ2, and γ1 = 2

3γ2. We do not set γ1 > 2γ2 because in this setting, BW F usually diverges. From the experimental results, we observe that when N = 8, 9, or 10, γ1 = γ2 shows the best performance among the three settings of γ1 and γ2.

Federated Representation Learning in the Under-Parameterized Regime

Figure 3. Experimental results with synthetic datasets.

Federated Representation Learning in the Under-Parameterized Regime

Figure 4. Experimental results with synthetic datasets.

C.2. Real-world Datasets

Implementation Details. For our experiments on the CIFAR-10 dataset, we employ a 5-layer CNN architecture. It begins with a convolutional layer Conv2d(3, 64, 5), followed by a pooling layer Max Pool2d(2, 2). The second convolutional layer is Conv2d(64, 64, 5), which precedes three fully connected layers: Linear(64*5*5, 120), Linear(120, 64), and Linear(64, 10). In contrast, for the CIFAR-100 dataset, we also use a 5-layer CNN, but with some modifications to accommodate the higher complexity of the dataset. The initial layer is Conv2d(3, 64, 5), followed by pooling and dropout layers: Max Pool2d(2, 2) and nn.Dropout(0.6). The subsequent convolutional layer is Conv2d(64, 128, 5). This is succeeded by three fully connected layers: Linear(128*5*5, 256), Linear(256, 128), and Linear(128, 100).

Experimental Results. In this section, we plot Figure 5 to Figure 12 to illustrate the detailed convergence behavior of the test accuracy of the trained models reported in Table 1 as a function of the training epochs. We augment the test accuracy results by introducing two different metrics. The first one is Global NC2, which is measured by

H i Hi H i Hi F 1 m 1uiu i Im 1

The second one is Averaged Local NC2, referred to as

H i Hi H i Hi F 1

m 1uiu i Im 1

These two metrics are inspired by Observation B.3 and Observation B.2, respectively. Global NC2 aims to measure the similarity between the learned models and the optimal under-parameterized global model. In contrast, Averaged Local NC2 assesses the similarity between the learned models and the optimal local models. Note that these two metrics are positively correlated, meaning that when one is small, the other is usually small as well. In some results, such as those shown in Figure 5 and Figure 7, the gaps between Fed Rep* and FLUTE* in terms of Averaged Local NC2 are significantly larger than those in terms of Global NC2, suggesting that the models learned by FLUTE* are closer to the global optimizer than those learned by Fed Rep*.

Federated Representation Learning in the Under-Parameterized Regime

Figure 5. Experimental results for CIFAR10 when M = 50, m = 2.

Figure 6. Experimental results for CIFAR10 when M = 50, m = 5.

Figure 7. Experimental results for CIFAR10 when M = 100, m = 2.

Figure 8. Experimental results for CIFAR10 when M = 50, m = 5.

Federated Representation Learning in the Under-Parameterized Regime

Figure 9. Experimental results for CIFAR100 when M = 100, m = 5.

Figure 10. Experimental results for CIFAR100 when M = 100, m = 10.

Figure 11. Experimental results for CIFAR100 when M = 100, m = 20.

Figure 12. Experimental results for CIFAR100 when M = 100, m = 40.