# personalized_federated_learning_via_lowrank_matrix_optimization__c9416e6d.pdf

Published in Transactions on Machine Learning Research (08/2025)

Personalized Federated Learning via Low-Rank Matrix Optimization

Ali Dadras ali.dadras@umu.se Umeå University, Sweden

Sebastian U. Stich stich@cispa.de CISPA Helmholtz Center, Germany

Alp Yurtsever alp.yurtsever@umu.se Umeå University, Sweden

Reviewed on Open Review: https: // openreview. net/ forum? id= DFJu1QB2Nr

Personalized Federated Learning (p FL) has gained signiﬁcant attention for building a suite of models tailored to diﬀerent clients. In p FL, the challenge lies in balancing the reliance on local datasets, which may lack representativeness, against the diversity of other clients models, whose quality and relevance are uncertain. Focusing on the clustered FL scenario, where devices are grouped based on similarities in their data distributions without prior knowledge of cluster memberships, we develop a mathematical model for p FL using low-rank matrix optimization. Building on this formulation, we propose a p FL approach leveraging the Burer Monteiro factorization technique. We examine the convergence guarantees of the proposed method and present numerical experiments on training deep neural networks, demonstrating the empirical performance of the proposed method in scenarios where personalization is crucial.

1 Introduction

Federated Learning (FL) is an important paradigm in machine learning, holding a great promise for training machine learning models over a large network with restricted data sharing. It is most suitable when clients require collaboration often due to the absence of a large, representative dataset available at hand that can capture the diversity and variability of the underlying behavior but in an environment where sharing datasets with collaborators is prohibited often driven by concerns and regulations surrounding data sharing and storage. Consequently, FL research has been focused on designing algorithms that can solve optimization and learning problems on a network without sharing essential data; but instead communicating decision variables (referred to as local models in machine learning) or other auxiliary variables like gradients or step-directions, alongside privacy improvement strategies such as encryption or noise injection. However, the restriction of data sharing presents challenges beyond the training process. Primarily, limitations on data sharing hinder eﬀective control over the quality and relevance of the data provided by participating clients a major concern that led to the rise of personalized federated learning models (p FL).

The primary goal of p FL is ﬁnding the right balance between the two conﬂicting forces: the reliability of local datasets which may lack representativeness, and the diversity of collaborators models whose quality and relevance are uncertain. As a result, p FL lacks a clear deﬁnition and direction without explicit speciﬁcations regarding data distribution of the clients. Regrettably, it appears there is no universally accepted measurable quantitative goal to evaluate the success in personalization. Adding to this concern, many existing p FL methods are tested in settings that are inherently unsuited for p FL, where either Federated Averaging (Fed Avg) or local training produces the best accuracies. It is evident that when Fed Avg yields optimal results,

Published in Transactions on Machine Learning Research (08/2025)

indicating uniform data distributions across clients, there is no room for personalization. On the other hand, if local training performs well, suggesting that local datasets represent the problem suﬃciently well, the necessity of FL is called into question.

Motivated by these observations, our ﬁrst step is to formulate a mathematical problem that highlights the role and necessity of a p FL approach. Suppose there are n clients collaborating on an FL system, indexed by i = 1, . . . , n, and assume that the data for each client comes from a speciﬁc data distribution, denoted by Di. The true objective function for each client is deﬁned as:

f i (θi) := Eξi Di ℓi(θi, ξi), (1)

where ℓi : Rd Rp R is a loss function. Here, ξi = (xi, yi) denotes a random data sample drawn from the probability distribution Di, where xi represents the input features and yi denotes the corresponding response. When data distributions are known, a solution to this problem can be found by minimizing f i (θi) locally:

min θi f i (θi). (2)

However, the true distribution Di is unknown in practice. Instead, client i approximates its objective using an empirical dataset Mi = { ξi,1, . . . , ξi,mi} consisting of mi independent and identically distributed data samples drawn from Di. This yields the empirical objective

fi(θi) := 1

ξi Mi ℓi(θi, ξi) .

Throughout, we operate under the assumption that the dataset Mi is not large enough for clients to accurately approximate a solution to problem (2) locally on their own. Otherwise, FL would not be required.

An eﬀective solution to this problem is possible only if the distributions Di exhibit some correlation that we can exploit. At one extreme, when all distributions are the same, the standard FL template can be used, which can be formulated as

min θ1,...,θn 1 n

i=1 fi(θi) s.t. θ1 = = θn , (3)

or equivalently as

i=1 fi(θ) . (4)

A signiﬁcant portion of existing p FL methods are designed by relaxing the equality constraint (also called consensus constraint); examples include Moreau envelope smoothing and quadratic penalty regularization (Dinh et al., 2020; Li et al., 2020). However, these approaches that penalize model dissimilarity using a speciﬁc norm have limitations, as they rely on the (implicit) assumption that similarity in distributions Di translates to the proximity of client models in a given norm. The following simple examples demonstrate these limitations:

Example 1 (Label noise in classiﬁcation). Consider a linear binary classiﬁcation problem with two groups of clients that diﬀer in their sign conventions. Speciﬁcally, these groups label the positive and negative classes in opposite ways due to a misalignment in how they interpret the binary outcomes. Clearly, the data distributions of these two groups are nearly identical, diﬀering only by one bit. However, for this linear classiﬁer, this diﬀerence results in the optimal models for the two groups having opposite signs, leading to solutions θ group1 = θ group2, which are distant in all norms.

Although presented as a toy example here, mislabeling is a common problem in classiﬁcation tasks, especially in domains like healthcare where human experts are involved in data collection. In real-world FL systems, where data remains private, detecting or preventing such issues is challenging. Therefore, algorithms must be designed robust against these inconsistencies.

Example 2 (Clustered FL). Suppose each client draws data from one of r distinct distributions, forming r clusters of clients. We assume that cluster memberships are unknown, and the challenge is to establish eﬀective collaboration without knowing in advance which clients share similar data distributions.

Published in Transactions on Machine Learning Research (08/2025)

θ 1 = a1 θ 2 = a2

(b) Fed Avg

(c) Distance based

θfac 2 θfac 3

Low dimensional subspace

(d) Low-rank (ours)

Figure 1: Solutions to the minimization problem 1

n Pn i=1 θi ai 2, where θi, ai R2. The red points represent individual minimizers of the problem which are equivalent to θ i = ai. Fed Avg solution is θ = 1

n Pn i=1 ai which is shown in yellow. Considering θi θj δ constraint one can ﬁnd θ i (δ) as shown in green. Solving this problem with the low-rank assumption (r = 1) gives us solutions θfac i , blue points, lying in a low-dimensional subspace.

A speciﬁc class of p FL methods for clustered FL problems focuses on identifying clusters using various similarity measures and then performing cluster-aware aggregation. However, in a real-world FL system where data remains private, estimating these clusters is challenging without exposing additional information. This diﬃculty persists unless we rely on the stringent assumption that similar distributions produce models that are close in a certain norm, which could then be used to estimate clusters during training.

Example 3 (Collaborative ﬁltering). Consider the classical problem of recommendation systems. Suppose there are n clients and p items. Let θi Rp represent the relevance scores of client i for the items. The data consists of the actual scores rated by the clients, where each client rates only a subset of the items, denoted by Si {1, . . . , p}. We denote these scores by θ ij for j Si. The goal is predicting unknown scores that are not in Si, based on hidden patterns among clients.1

Models based on Euclidean distance regularizations are known to fail in accurately predicting movie preferences. Instead, low-rank matrix factorization is among the most popular and successful approaches to collaborative ﬁltering, as demonstrated by the Netﬂix competition 2008 progress prize winning team (Koren et al., 2009). The typical explanation for the empirical success of low-rank models in this problem is that movie preferences are well-parameterized by a few meaningful factors, such as genre, cast, language, and year. A more nuanced argument generalizes this by noting that low-rank matrices naturally arise in latent variable models (LVMs). While this is standard for LVMs with linear parameterizations, (Udell & Townsend, 2019) demonstrate that low-rank models are eﬀective for a broad class of (possibly high-dimensional) LVMs parameterized by a piecewise analytic function.

Inspired by these examples, we explore how to formulate p FL without relying on a speciﬁc distance metric. This leads us to investigate low-dimensional subspace formulations, where personalized models are related by their membership to a low-dimensional subspace rather than their proximity in a distance metric. This approach allows us to conceptualize p FL by focusing on the inherent structure of the model relationships rather than their spatial closeness, as illustrated in Figure 1. Drawing parallels to collaborative ﬁltering, we speciﬁcally focus on low-rank formulations.

We can now summarize our main contributions:

We introduce a new formulation for p FL based on low-rank matrix optimization in Section 3. Utilizing a nonconvex matrix factorization method applied to this formulation, we propose a new method called Personalized Federated Learning via Matrix Factorization (p FLMF).

We investigate the convergence guarantees of the proposed method in Section 4. For the smooth nonconvex minimization problem, we show that the proposed method converges to a ﬁrst-order stationary point at a rate of O(1/T); with the stochastic gradients, the rate becomes O(1/

1The decision variable in matrix completion reveals the data, limiting the privacy beneﬁts of FL. Nevertheless, the problem highlights the challenge of distributed learning with personalized models.

Published in Transactions on Machine Learning Research (08/2025)

We present numerical experiments on training various types of neural networks in Section 5. We compare the performance of the proposed method against the baseline in scenarios where personalization is crucial, such as classiﬁcation in clustered FL with label misalignment or non-homogeneous data distributions.

2 Related Work

Many p FL algorithms regularize client models to remain close; the main assumption is that personal models are close under a chosen metric (Dinh et al., 2020). Learning a mixture of the global model and the local models is proposed in (Hanzely & Richtárik, 2020), where these personalized models are encouraged to stay close to their average by incorporating a quadratic penalty. Empirical evidence suggests that model similarities among diﬀerent neural networks, particularly in their classiﬁer layers, correlates with the similarity of the training data distributions (Tan et al., 2023). Thus, enforcing model closeness (e.g., in Euclidean distance) tends to favor clients with similar data distributions.

More recently, model decoupling methods have been proposed (Arivazhagan et al., 2019; Oh et al., 2021; Pillutla et al., 2022; Mishchenko et al., 2023), showing a better performance than distance-regularization-based p FL methods. The main idea is to decouple each local model into two blocks, a feature extractor block followed by a classiﬁer block. The feature extractor block is communicated and aggregated over clients and the classiﬁer block is trained locally by each client. (Arivazhagan et al., 2019) considered a personalization of only selected layers of the neural network: all devices share a common set of base layers with identical weights, while each device maintains its own personalization layers that adapt to its local data. The base layers are synchronized with the server while the personalization layers are kept private by each device. In (Oh et al., 2021), the entire network is decomposed into the body (extractor), which is related to universality, and the head (classiﬁer), which is related to personalization. This reduces the update and aggregation parts from the entire model to the body of the model during federated training.

Perhaps the most relevant works to ours are (Collins et al., 2021) and (Thekumparampil et al., 2021). The goal in (Thekumparampil et al., 2021) is to ﬁnd a shared low-dimensional representation of the data features. While the primary focus of (Thekumparampil et al., 2021) is on multi-task learning, it can be applied for personalized FL by treating each client s learning problem as a separate task, as considered in (Collins et al., 2021). In (Collins et al., 2021), the server learns the common low-dimensional features of the data, and each client learns local features suited to its requirements. This method, Federated Representation Learning (Fed Rep), uses gradient-based updates to train a global low-dimensional representation and allows each client to compute a personalized low-dimensional classiﬁer. The key diﬀerence between Fed Rep and our approach is in the low-rank assumption. Fed Rep assumes that each clients parameters are low rank. In contrast, we assume that the concatenation of all clients parameters lies in a low-dimensional space. In other words, their feature extractor extracts the features from a single shared global model while our method trains a set of models and each client uses a combination of these models as its personalized model. Put diﬀerently, Fed Rep assumes that θi are low rank, whereas we assume that Θ := [vec(θ1), vec(θ2), . . . , vec(θn)] is low rank. Furthermore, Fed Rep focuses on the linear representation setting with a quadratic loss.

Beyond distance-based and decoupling approaches, a variety of other strategies have been explored for p FL. (Zhang et al., 2024) introduces LR-BPFL, a Bayesian personalized federated learning method that learns a global deterministic model along with personalized low-rank Bayesian corrections. (Bao et al., 2023) proposes Fed Collab, a clustered federated learning framework that mitigates negative transfer by partitioning clients into non-overlapping coalitions informed by both pairwise distribution distances and relative data quantities. (Prakash et al., 2023) processes hierarchical, tree-like data in federated learning by developing an algorithm tailored to hyperbolic spaces. (Anelli et al., 2022) investigates federated pair-wise learning for factorization models in a recommendation scenario. (Huang et al., 2022) proposes an FL framework for solving the Point-of-Interest (POI) recommendation problem. (Ammad-Ud-Din et al., 2019) introduces a federated implementation of collaborative ﬁltering for recommendation systems. Liang et al. (2020) introduces LG-FEDAVG, which combines local representation learning with global model learning in an end-to-end manner. Each local device learns to extract higher-level representations from raw data before a global model operates on the representations (rather than raw data) from all devices. (Tan et al., 2023) proposes a decoupling algorithm that also personalizes feature extractors by adjusting aggregation weights based on

Published in Transactions on Machine Learning Research (08/2025)

classiﬁer similarity. (Deng et al., 2020) introduces APFL algorithm, which learns a personalized model for each user as a convex combination of local and global models, with the combination coeﬃcients adaptively updated during training. (Hao et al., 2022) assumes factorized weights for neural network weights and, rather than learning a single global model, learns a dictionary of rank-1 weight factor matrices. Each client then assembles a personalized model from this dictionary to match its own data distribution. (Jeong & Hwang, 2022) considers factorization of the model parameters and allows clients to perform a selective aggregation scheme to utilize only the knowledge from the relevant participants for each client. (Pal et al., 2024) introduces LRS, modeling user parameters as low-rank + sparse to capture shared structure and individual-speciﬁc characteristics. They further develop AMHT-LRS, providing theoretical guarantees in the linear Gaussian setting, and extend it to a user-level diﬀerentially private version.

Finally, several works use low-rank matrix optimization in FL, but with diﬀerent motivations and goals from ours. One line uses low-rank structure in the model parameters: (Pinto et al., 2023) projects private datasets onto a low-dimensional subspace spanned by principal components estimated from public unlabeled data, then apply gradient-based private algorithms (e.g., Noisy-SGD) to learn a linear classiﬁer; (Zhao et al., 2016; Cai et al., 2014) assumes one of the neural network layers is low rank; (Liu et al., 2024a) uses homogeneous pre-factorized low-rank layers across clients; (Tran et al., 2025) factorizes local prompts into two lower-rank components plus a residual; and (Niu et al., 2023) introduces Pri SM training, which assigns resource-constrained clients low-rank sub-models via importance-aware probabilistic sampling. A second line uses low-rank structure in the gradients, such as (Kasiviswanathan, 2021; Yu et al., 2021; Gooneratne et al., 2020). (Yao et al., 2021) proposes Fed HM, which trains low-rank factorized neural networks of a speciﬁed size and reconstructs a full-rank global model on the server via a model shape alignment method.

3 Algorithm

We propose a novel formulation for p FL based on low-rank matrix optimization:

min Θ Rd n F(Θ) := 1

i=1 fi(θi) s.t. rank(Θ) r. (5)

Here, Θ Rd n denotes the system-level decision variable obtained by concatenating clients decision variables as Θ := [θ1, θ2, . . . , θn], and r is problem speciﬁc tuning parameter. Note that this formulation suits well for the examples we discussed in the introduction.

There exists a rich literature on rank-constrained matrix optimization problems, approaches including hard thresholding algorithms (Jain et al., 2010; Goldfarb & Ma, 2011; Kyrillidis & Cevher, 2014), convex relaxation methods (Candès & Recht, 2012; Recht et al., 2010), and nonconvex matrix factorization techniques (Burer & Monteiro, 2003; Sun & Luo, 2016; Bhojanapalli et al., 2016; Park et al., 2017). The ﬁrst two class of algorithms require expensive spectral decomposition steps, hence they are not suitable for federated implementation unless the server possesses suﬃcient computational power to perform such decompositions. Consequently, we adopt the nonconvex matrix factorization technique, also known as the Burer-Monteiro (BM) factorization.

BM factorization strategy replaces the system-level decision variable Θ Rd n with a factorized form of Θ = UV . This transformation leads to the following optimization problem:

min U Rd r, V Rn r ψ(U, V), where ψ(U, V) := F(UV ) = 1

i=1 fi(Uvi). (6)

We denote by V := [v1, , vn] Rr n. In this notation, personalized model parameters can be computed as θi = Uvi Rd. One can interpret U Rd r as a shared feature representation in the FL problem, computed by the server, and vi Rr as the feature extractor speciﬁc to client i.

While various optimization techniques can address problem (6), we design our algorithm based on the simple block-coordinate gradient updates. We can compute the gradient of ψ with respect to U and vi as follows:

Uψ(U, V) = 1

i=1 fi(Uvi) v i and viψ(U, V) = 1

n U fi(Uvi). (7)

Published in Transactions on Machine Learning Research (08/2025)

Algorithm 1 Personalized Federated Learning via Matrix Factorization (p FLMF)

set U0 Rm r, v0 i Rr i [n]. for round t = 0, 1, . . . , T 1 do

Client-level local training - for client i St do

set vt,1 i = vt i. for k = 0, . . . , K 1 do

vt,k+1 i = vt,k i ηi 1

n Ut fi(Utvt,k i ) end for vt+1 i = vt,K i Gt i = fi(Utvt i) vt i

Client communicates Gt i to the server. end for Server-level aggregation Ut+1 = Ut ηi 1 |St| P i St Gt i Server communicates Ut+1 to the clients.

It is crucial that ψ is separable with respect to vi, enabling clients to compute viψ(U, V) in parallel without requiring access to data or model parameters from other clients, given the features U. Consequently, for a given step-size ηi > 0, local training steps can be independently formulated and performed by each participating client as:

vt+1 i = vt i ηi 1 n Ut fi(Utvt i). (8)

On the other hand, ψ is not separable with respect to the rows or columns of U, necessitating collaboration among clients for computing Uψ(U, V). Consequently, the gradient step in U requires communication and will be performed at the server, forming our aggregation step:

Ut+1 = Ut 1

i=1 ηi fi(Utvt i) vt i . (9)

Algorithm 1 depicts the pseudo-code of our algorithm. Here, K is the number of local passes each client performs, and the output of the algorithm is a set of personalized parameters θi = Uvi that each client can compute locally using its feature extractors vi and the shared feature representation U.

4 Convergence Guarantees

Several works have studied the convergence for the problem (6) under diﬀerent assumptions; we refer to (Chi et al., 2019) and references therein. (Bhojanapalli et al., 2016; Park et al., 2018) proved linear/sub-linear rates for smooth functions and smooth and strongly convex functions, respectively. Due to the nonconvex nature of BM factorization, even in cases where f(.) is convex in Θ, it is not possible to prove a convergence theorem to the global minimum. For more specialized cases (e.g., matrix sensing problems under some technical assumptions called restricted isometry property), convergence to a global solution can be characterized with careful initialization procedures (Park et al., 2018; Jain et al., 2013; Zheng & Laﬀerty, 2016; Park et al., 2016). Since our focus is primarily on neural network applications, where objectives are already nonconvex in Θ, we derive convergence guarantees to a stationary point.

We begin our presentation of the main convergence guarantee by ﬁrst listing our assumptions. We assume fi(Uvi) are directionally smooth, and that we have access to unbiased stochastic gradients VF(UV ) and UF(UV ) with bounded variance.

Published in Transactions on Machine Learning Research (08/2025)

Assumption 1 (Directional smoothness). We assume that F(UV ) is smooth with respect to U and V, i.e., there exist constants LU, LV 0 such that for all U1, U2 Rd r and V1, V2 Rn r:

UF(U1V 1 ) UF(U2V 2 ) F LU U1 U2 F + V1 V2 F

VF(U1V 1 ) VF(U2V 2 ) F LV U1 U2 F + V1 V2 F

Assumption 2 (Stochastic gradients). We assume access to an unbiased stochastic gradient estimator with bounded variance, i.e. there exists σ < + such that for all U Rd r and V Rn r:

E F(UV ) = F(UV ) and E h UF(UV ) UF(UV ) 2i σ2

E h VF(UV ) VF(UV ) 2i σ2.

Theorem 1. Consider problem (6) with smooth loss functions fi(.) in the sense that Assumption 1 holds. Assume access to a stochastic gradient estimator such that Assumption 2 holds. Furthermore, assume that every client participates in each round with probability p and performs K local steps per iteration. Then, the sequence Ut, Vt generated by p FLMF with step-sizes ηv = pηu

K and ηu < 1 2L, where L := max{LU, LV }, satisﬁes the following bound:

E UF(Ut Vt ) 2 + E

VF(Ut Vt k ) 2 #!

2 F(U0V0 ) F

ηT 1 2ηL + 2ηLσ2

Corollary 2. Choosing η = 1 2L

T in Theorem 1 yields a rate of O(1/

T) in the stochastic setting. If full gradients are available (σ = 0), then η = 1 4L results in a convergence rate of O(1/T).

5 Numerical Experiments

In this section, we evaluate the performance of p FLMF. We compare the performance of p FLMF against several baselines, including Local training, Fed Avg (Mc Mahan et al., 2017), Fed Per (Arivazhagan et al., 2019), Fed Rep (Collins et al., 2021), APFL (Deng et al., 2020), CFL (Sattler et al., 2021), FLUTE (Liu et al., 2024b), Fed AS (Yang et al., 2024), Fed Alt (Pillutla et al., 2022), and p Fed FDA (Mc Laughlin & Su, 2024) by implementing p FLMF in the FL-Bench benchmark (Tan & Wang, 2024; Tan et al., 2023). The source code is available at https://github.com/Dadras Ali/p FLMF. We conducted experiments in ﬁve diﬀerent setups:

Setup (1) For the MNIST, CIFAR10, and CIFAR100 datasets, we split the data according to the Dirichlet distribution Dir(0.5) and Dir(1) across 100 clients. The labels distribution is shown in Figures 2c and 2d. Performance of the algorithms is shown in Table 1.

Setup (2) For the CIFAR-100 dataset, we partitioned the 100 classes into 20 groups, each containing 5 distinct labels. Data was then distributed among 500 clients, with each client exclusively assigned data from a single group, resulting in highly heterogeneous data. Results are shown in Table 2.

Setup (3) For the MNIST, we follow the experimental setup in (Sattler et al., 2019) and consider 1000 clients divided into 10 groups, and labels in each group are re-mapped (permuted) according to a random permutation map. In other words, clients in group one would have the same numbers {0, , 9} but labeled diﬀerently; group one may consider 0 with label 0, and group two may consider 0 with label 8. Figures 2a and 2b show the distribution of the labels before and after re-labeling, respectively. Results are shown in Table 2.

Setup (4) We sampled a subset of clients, 30% of the total clients, from FEMNIST dataset without changing the underlying data distribution, then we removed clients with less than 10 data points. The remaining set has 1091 clients. We ran the experiments for 1 and 5 numbers of local epochs. Results are shown in Table 2.

Published in Transactions on Machine Learning Research (08/2025)

0 100 200 300 400 500 600 700

0 1 2 3 4 5 6 7 8 9

(a) Uniform

0 100 200 300 400 500 600 700

0 1 2 3 4 5 6 7 8 9

(b) Permuted

0 200 400 600 800 1000 1200 1400

0 1 2 3 4 5 6 7 8 9

(c) Dir(0.5)

0 200 400 600 800 1000

0 1 2 3 4 5 6 7 8 9

Figure 2: Distribution of the labels for MNIST dataset and 100 clients. The vertical and horizontal axes show clients and the size of each client s data, respectively.

MNIST CIFAR10 CIFAR100

Dir(0.5) Dir(1) Dir(0.5) Dir(1) Dir(0.5) Dir(1) Local 92.12% ( 0.59) 89.15% ( 1.12) 59.14% ( 3.74) 48.53% ( 2.53) 16.09% ( 1.52) 10.66% ( 0.98) Fed Avg 96.92% ( 0.65) 97.01% ( 0.54) 65.21% ( 2.11) 65.44% ( 1.68) 28.30% ( 1.82) 28.36% ( 1.32) Fed Per 96.30% ( 0.26) 95.16% ( 0.58) 66.86% ( 3.19) 58.25% ( 2.22) 19.98% ( 1.6) 14.22% ( 1.01) Fed Rep 95.04% ( 0.40) 93.33% ( 0.94) 65.16% ( 3.44) 55.4% ( 2.06) 17.49% ( 1.12) 12.14% ( 1.05) APFL 97.93% ( 0.51) 97.64% ( 0.39) 65.99% ( 2.06) 65.14% ( 1.54) 27.07% ( 1.57) 27.07% ( 1.36) CFL 96.92% ( 0.72) 97.04% ( 0.5) 64.97% ( 2.68) 65.98% ( 1.70) 27.02% ( 1.48) 24.84% ( 0.91) FLUTE 76.72% ( 0.88) 73.25% ( 1.01) 48.20% ( 0.74) 40.59% ( 0.69) 12.58% ( 0.28) 7.28% ( 0.13) Fed Alt 98.09% ( 0.05) 97.91% ( 0.05) 62.89% ( 0.57) 58.09% ( 0.43) 20.57% ( 0.46) 16.20% ( 0.19) Fed AS 97.17% ( 0.12) 97.11% ( 0.39) 66.33% ( 1.14) 63.80% ( 0.54) 8.48% ( 1.79) 5.12% ( 1.56) p Fed FDA 97.23% ( 0.04) 97.05% ( 0.08) 70.65% ( 1, 59) 66.67% ( 1.41) 26.15% ( 0.19) 19.10% ( 0.17) p FLMF

r = 1 96.75% ( 0.61) 96.53% ( 0.59) 43.89% ( 3.49) 64.03% ( 1.66) 34.32% ( 1.96) 35.24% ( 1.77) r = 5 96.78% ( 0.51) 96.55% ( 0.60) 60.73% ( 2.86) 65.89% ( 1.88) 35.64% ( 2.09) 35.75% ( 1.24) r = 10 96.98% ( 0.70) 96.84% ( 0.56) 65.10% ( 2.30) 67.68% ( 1.56) 35.28% ( 1.73) 36.84% ( 1.50) r = 15 98.24% ( 0.26) 97.93% ( 0.22) 68.13% ( 2.43) 65.88% ( 1.62) 35.70% ( 1.77) 36.12% ( 1.46)

Table 1: Performance of the algorithms for Setup (1). The best accuracy is shown in boldface, and the second best is underlined.

Setup (5) We examine the sensitivity of p FLMF to the choice of rank, ranging from 1 to 20, on a highly heterogeneous data split, Dir(0.1) and 100 clients. Figure 3 shows the average test accuracy and runtime versus rank for the MNIST and CIFAR-10 datasets.

Setup (6) We evaluated p FLMF with a range of local update counts, KU {1, 10, 20} and r = {1, 10, 20}, on a highly heterogeneous MNIST split generated by Dir(0.1). The experiment involved 200 clients, each communication round sampling a 10% subset. Figure 4 reports the resulting average test accuracy.

Model. We used a three-layer neural network, consisting of three linear layers, on the MNIST and FEMNIST datasets and a four-layer convolutional neural network, consisting of two convolutional layers followed by two linear layers, on the CIFAR10 and CIFAR100 datasets. For Fed Per and Fed Rep, we treated the last layer as the classiﬁer, while in p FLMF, we factorized the entire model.

Hyper-parameters. We consider partial participation with probability equal to 0.1. We set the batch size equal to 256 for all algorithms. We tried parameter r values of (6) is set to r {1, 5, 10, 15}. All experiments have 75% train and 25% test data splits on each client s data. We chose the best step size for each algorithm from the set {10 4, 10 3, 10 2, 10 1}.

Observations. In the heterogeneous experiments, Setup (1), p FLMF outperforms other p FL methods in most cases, though the performance across algorithms is comparable on the MNIST dataset. Remarkably, p FLMF achieves substantial improvements in average test accuracy when diﬀerent client groups have similar intra-group distributions but diﬀer signiﬁcantly across groups (see Table 2). It is important to note that the low test accuracy observed in the CIFAR-100 experiment, Setup (2), is attributed to the simplicity of the neural network model rather than the algorithms themselves. Additionally, the results indicate that multiple

Published in Transactions on Machine Learning Research (08/2025)

MNIST (permuted labels) CIFAR100 (super groups) FEMNIST

1000 clients 500 clients 1091 clients

1 local epoch 1 local epoch 1 local epoch 5 local epochs Local 25.36% ( 0.013) 10.49% ( 0.95) 50.77% ( 0.053) 65.69%(0.012) Fed Avg 12.02% ( 0.022) 36.40% ( 1.31) 65.40% ( 0.017) 77.19%(0.013) Fed Per 19.86% ( 0.141) 14.80% ( 0.81) 66.05% ( 0.010) 67.72%(0.009) Fed Rep 21.30% ( 0.148) 12.18% ( 0.80) 66.10% ( 0.013) 66.29%( 0.010) p FLMF

r = 1 14.70% ( 0.083) 42.94% ( 1.22) 67.82% ( 0.134) 71.42%(0.044) r = 5 23.75% ( 0.027) 44.70% ( 1.91) 69.99% ( 0.123) 72.09%( 0.208) r = 10 34.23% ( 0.090) 45.57% ( 1.97) 72.56% ( 0.023) 72.47%( 0.010) r = 15 39.31% ( 0.042) 45.43% ( 1.23) 73.59% (0.092) 76.41%( 0.006)

Table 2: Performance of the algorithms for Setup (2), Setup (3), and Setup (4). The best accuracy is shown in boldface, and the second best is underlined.

1 3 5 7 9 11 13 15 17 19 0

MNIST CIFAR10

1 3 5 7 9 11 13 15 17 19 0.5

runtime (s)

MNIST CIFAR10

Figure 3: Sensitivity analysis of p FLMF with respect to the factorization rank r. The left plot shows prediction accuracy versus rank, while the right plot shows runtime versus rank. Results are averaged over 5 random trials. See Setup (5) for details.

local updates in p FLMF contribute positively to performance. Overall, the ﬁndings highlight the practical signiﬁcance of the proposed method.

6 Conclusions

We introduced a new p FL formulation based on low-rank matrix optimization and developed a novel p FL algorithm utilizing Burer-Monteiro factorization. We further established convergence guarantees for the proposed method: for minimizing a smooth non-convex objective, the algorithm converges to a stationary point at a rate of O(1/T) with full gradients; and O(1/

T) for the stochastic setting. Evaluations across four experimental setups highlight the practical signiﬁcance of the proposed method, especially in scenarios where personalization is essential, and standard approaches are unable to adequately capture the complexity of the underlying data distributions.

We conclude by listing some limitations and future directions. Our numerical experiments demonstrate improved performance of p FLMF with multiple local steps; however, this enhancement is not reﬂected in our theoretical convergence guarantees. Establishing stronger guarantees that reﬂect this behavior is a valuable direction for future research. Another notable limitation is that our formulation currently factorizes the entire model (decision variable), which can be computationally intensive in some cases, particularly in large-scale neural network applications. A more eﬃcient approach might be to apply the BM factorization selectively, targeting only a subset of the parameters, which could reduce overhead while maintaining its beneﬁts. Exploring such partial factorizations is a promising direction for future research

Published in Transactions on Machine Learning Research (08/2025)

communication round

communication round

KU = 1 KU = 10 KU = 100

communication round

Figure 4: Performance of p FLMF for diﬀerent numbers of U steps and ranks. See Setup (6) for details.

Acknowledgments

This work was partially conducted while Ali Dadras was at a research visit at CISPA; we sincerely appreciate their support and the stimulating research environment they provided. Alp Yurtsever and Ali Dadras were supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. The computations were enabled by the supercomputing resource Berzelius provided by National Supercomputer Centre at Linköping University and the Knut and Alice Wallenberg foundation. Alp Yurtsever and Ali Dadras further acknowledge support from the Swedish Research Council, under registration number 2023-05476.

Muhammad Ammad-Ud-Din, Elena Ivannikova, Suleiman A. Khan, Were Oyomno, Qiang Fu, Kuan Eeik Tan, and Adrian Flanagan. Federated collaborative ﬁltering for privacy-preserving personalized recommendation system. ar Xiv preprint ar Xiv:1901.09888, 2019.

Vito Walter Anelli, Yashar Deldjoo, Tommaso Di Noia, Antonio Ferrara, and Fedelucio Narducci. Usercontrolled federated matrix factorization for recommender systems. Journal of Intelligent Information Systems, 58(2):287 309, 2022.

Manoj Ghuhan Arivazhagan, Vinay Aggarwal, Aaditya Kumar Singh, and Sunav Choudhary. Federated learning with personalization layers. ar Xiv preprint ar Xiv:1912.00818, 2019.

Wenxuan Bao, Haohan Wang, Jun Wu, and Jingrui He. Optimizing the collaboration structure in cross-silo federated learning. In International Conference on Machine Learning, pp. 1718 1736. PMLR, 2023.

Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity for faster semi-deﬁnite optimization. In Conference on Learning Theory, pp. 530 582. PMLR, 2016.

Samuel Burer and Renato D. C. Monteiro. A nonlinear programming algorithm for solving semideﬁnite programs via low-rank factorization. Mathematical Programming, 95(2):329 357, 2003.

Chenghao Cai, Dengfeng Ke, Yanyan Xu, and Kaile Su. Fast learning of deep neural networks via singular value decomposition. In PRICAI 2014: Trends in Artiﬁcial Intelligence, pp. 820 826. Springer, 2014.

Emmanuel Candès and Benjamin Recht. Exact matrix completion via convex optimization. Communications of the ACM, 55(6):111 119, 2012.

Yuejie Chi, Y. M. Lu, and Yuxin Chen. Nonconvex optimization meets low-rank matrix factorization: an overview. IEEE Transactions on Signal Processing, 67(20):5239 5269, 2019.

Published in Transactions on Machine Learning Research (08/2025)

Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Exploiting shared representations for personalized federated learning. In International Conference on Machine Learning, pp. 2089 2099. PMLR, 2021.

Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Adaptive personalized federated learning. ar Xiv preprint ar Xiv:2003.13461, 2020.

Canh T. Dinh, Nguyen H. Tran, and Josh Nguyen. Personalized federated learning with Moreau envelopes. In Advances in Neural Information Processing Systems, volume 33, pp. 21394 21405, 2020.

Donald Goldfarb and Shiqian Ma. Convergence of ﬁxed-point continuation algorithms for matrix rank minimization. Foundations of Computational Mathematics, 11(2):183 210, 2011.

Mary Gooneratne, Khe Chai Sim, Petr Zadrazil, Andreas Kabel, Françoise Beaufays, and Giovanni Motta. Low-rank gradient approximation for memory-eﬃcient on-device training of deep neural networks. In ICASSP 2020 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3017 3021. IEEE, 2020.

Filip Hanzely and Peter Richtárik. Federated learning of a mixture of global and local models. ar Xiv preprint ar Xiv:2002.05516, 2020.

Weituo Hao, Nikhil Mehta, Kevin J. Liang, Pengyu Cheng, Mostafa El-Khamy, and Lawrence Carin. WAFFLE: weight anonymized factorization for federated learning. IEEE Access, 10:49207 49218, 2022.

Jiwei Huang, Zeyu Tong, and Zihan Feng. Geographical POI recommendation for internet of things: a federated learning approach using matrix factorization. International Journal of Communication Systems, pp. e5161, 2022.

Prateek Jain, Raghu Meka, and Inderjit Dhillon. Guaranteed rank minimization via singular value projection. In Advances in Neural Information Processing Systems, volume 23, 2010.

Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. Low-rank matrix completion using alternating minimization. In ACM Symposium on Theory of Computing, pp. 665 674. ACM, 2013.

Wonyong Jeong and Sung Ju Hwang. Factorized-FL: personalized federated learning with parameter factorization and similarity matching. In Advances in Neural Information Processing Systems, volume 35, pp. 35684 35695, 2022.

Shiva Prasad Kasiviswanathan. SGD with low-dimensional gradients with applications to private and distributed learning. In Uncertainty in Artiﬁcial Intelligence, pp. 1905 1915. PMLR, 2021.

Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. IEEE Computer, 42(8):30 37, 2009.

Anastasios Kyrillidis and Volkan Cevher. Matrix recipes for hard thresholding methods. Journal of Mathematical Imaging and Vision, 48:235 265, 2014.

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2:429 450, 2020.

Paul Pu Liang, Terrance Liu, Ziyin Liu, Nicholas B. Allen, Randy P. Auerbach, David Brent, Ruslan Salakhutdinov, and Louis-Philippe Morency. Think locally, act globally: federated learning with local and global representations. ar Xiv preprint ar Xiv:2001.01523, 2020.

Jiahao Liu, Yipeng Zhou, Di Wu, Miao Hu, Mohsen Guizani, and Quan Z. Sheng. Fed LMT: tackling system heterogeneity of federated learning via low-rank model training with theoretical guarantees. In International Conference on Machine Learning. PMLR, 2024a.

Renpu Liu, Cong Shen, and Jing Yang. Federated representation learning in the under-parameterized regime. ar Xiv preprint ar Xiv:2406.04596, 2024b.

Published in Transactions on Machine Learning Research (08/2025)

Connor Mc Laughlin and Lili Su. Personalized federated learning via feature distribution adaptation. In Advances in Neural Information Processing Systems, volume 37, pp. 77038 77059, 2024.

Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communicationeﬃcient learning of deep networks from decentralized data. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1273 1282. PMLR, 2017.

Konstantin Mishchenko, Rustem Islamov, Eduard Gorbunov, and Samuel Horváth. Partially personalized federated learning: breaking the curse of data heterogeneity. ar Xiv preprint ar Xiv:2305.18285, 2023.

Yue Niu, Saurav Prakash, Souvik Kundu, Sunwoo Lee, and Salman Avestimehr. Overcoming resource constraints in federated learning: large models can be trained with only weak clients. Transactions on Machine Learning Research, 2023.

Jaehoon Oh, Sangmook Kim, and Se-Young Yun. Fed BABU: towards enhanced representation for federated image classiﬁcation. ar Xiv preprint ar Xiv:2106.06042, 2021.

Soumyabrata Pal, Prateek Varshney, Gagan Madan, Prateek Jain, Abhradeep Thakurta, Gaurav Aggarwal, Pradeep Shenoy, and Gaurav Srivastava. Sample-eﬃcient personalization: modeling user parameters as low rank plus sparse components. In International Conference on Artiﬁcial Intelligence and Statistics, pp. 1702 1710. PMLR, 2024.

Dohyung Park, Anastasios Kyrillidis, Srinadh Bhojanapalli, Constantine Caramanis, and Sujay Sanghavi. Provable Burer Monteiro factorization for a class of norm-constrained matrix problems. ar Xiv preprint ar Xiv:1606.01316, 2016.

Dohyung Park, Anastasios Kyrillidis, Constantine Caramanis, and Sujay Sanghavi. Non-square matrix sensing without spurious local minima via the Burer Monteiro approach. In Artiﬁcial Intelligence and Statistics, pp. 65 74. PMLR, 2017.

Dohyung Park, Anastasios Kyrillidis, Constantine Caramanis, and Sujay Sanghavi. Finding low-rank solutions via nonconvex matrix factorization, eﬃciently and provably. SIAM Journal on Imaging Sciences, 11(4): 2165 2204, 2018.

Krishna Pillutla, Kshitiz Malik, Abdel-Rahman Mohamed, Mike Rabbat, Maziar Sanjabi, and Lin Xiao. Federated learning with partial model personalization. In International Conference on Machine Learning, pp. 17716 17758. PMLR, 2022.

Francesco Pinto, Yaxi Hu, Fanny Yang, and Amartya Sanyal. PILLAR: how to make semi-private learning more eﬀective. ar Xiv preprint ar Xiv:2306.03962, 2023.

Saurav Prakash, Jin Sima, Chao Pan, Eli Chien, and Olgica Milenkovic. Federated classiﬁcation in hyperbolic spaces via secure aggregation of convex hulls. ar Xiv preprint ar Xiv:2308.06895, 2023.

Benjamin Recht, Maryam Fazel, and Pablo A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review, 52(3):471 501, 2010.

Felix Sattler, Klaus-Robert Müller, and Wojciech Samek. Clustered federated learning. In Neur IPS19 Workshop on Federated Learning for Data Privacy and Conﬁdentiality, pp. 1 5, 2019.

Felix Sattler, Klaus-Robert Müller, and Wojciech Samek. Clustered federated learning: model-agnostic distributed multitask optimization under privacy constraints. IEEE Transactions on Neural Networks and Learning Systems, 32(8):3710 3722, 2021.

Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory, 62(11):6535 6579, 2016.

Jiahao Tan and Xinpeng Wang. FL-bench: a federated learning benchmark for solving image classiﬁcation tasks. https://github.com/Karhou Tam/FL-bench, 2024.

Published in Transactions on Machine Learning Research (08/2025)

Jiahao Tan, Yipeng Zhou, Gang Liu, Jessie Hui Wang, and Shui Yu. p Fed Sim: similarity-aware model aggregation towards personalized federated learning. ar Xiv preprint ar Xiv:2305.15706, 2023.

Kiran K. Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Statistically and computationally eﬃcient linear meta-representation learning. In Advances in Neural Information Processing Systems, volume 34, pp. 18487 18500, 2021.

Linh Tran, Wei Sun, Stacy Patterson, and Ana Milanova. Privacy-preserving personalized federated prompt learning for multimodal large language models. ar Xiv preprint ar Xiv:2501.13904, 2025.

Madeleine Udell and Alex Townsend. Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science, 1(1):144 160, 2019.

Xiyuan Yang, Wenke Huang, and Mang Ye. Fed AS: bridging inconsistency in personalized federated learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11986 11995, 2024.

Dezhong Yao, Wanning Pan, Michael J. O Neill, Yutong Dai, Yao Wan, Hai Jin, and Lichao Sun. Fed HM: eﬃcient federated learning for heterogeneous models via low-rank factorization. ar Xiv preprint ar Xiv:2111.14655, 2021.

Da Yu, Huishuai Zhang, Wei Chen, and Tie-Yan Liu. Do not let privacy overbill utility: gradient embedding perturbation for private learning. ar Xiv preprint ar Xiv:2102.12677, 2021.

Boning Zhang, Dongzhu Liu, Osvaldo Simeone, Guanchu Wang, Dimitrios Pezaros, and Guangxu Zhu. Personalizing low-rank Bayesian neural networks via federated learning. ar Xiv preprint ar Xiv:2410.14390, 2024.

Yong Zhao, Jinyu Li, and Yifan Gong. Low-rank plus diagonal adaptation for deep neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5005 5009. IEEE, 2016.

Qinqing Zheng and John Laﬀerty. Convergence analysis for rectangular matrix completion using Burer Monteiro factorization and gradient descent. ar Xiv preprint ar Xiv:1605.07051, 2016.

Published in Transactions on Machine Learning Research (08/2025)

A Theoretical Results

A.1 Compact Notation

We assume that each client participates in the learning process independently with probability p. To model this, we deﬁne a partial participation matrix Dt for each time step t as a diagonal matrix, where each diagonal entry [Dt]i,i represents the participation status of client i at time t. Speciﬁcally,

( 1, with probability p, 0, with probability 1 p,

where each [Dt]i,i is an independent Bernoulli random variable with parameter p. This implies that for each client i, [Dt]i,i = 1 if the client participates in the training process at time t, and [Dt]i,i = 0 otherwise. We can write our algorithm in the compact form as follows:

for k = 0, . . . , K 1, do

Vt k+1 = Vt k ηv Dt VF(Ut Vt k )

Ut+1 = Ut ηu UF(Ut Vt )

Vt+1 = Vt K .

We deﬁne expectations with respect to gradient noise as EV noise[ ] and EU noise[ ], and expectation with respect to participation randomness as ESt[ ]. For participation randomness, we assume that ESt[Dt] = p I, where I is the identity matrix and p is the probability of client participation under independent sampling. We deﬁne the conditional expectation given all randomness before iteration t and local step k as

Et,k[ ] := EV noise h | randomness before (t, k), Dt i ,

where the randomness includes all prior gradient noise and participation randomness up to local step k of iteration t. Additionally, we deﬁne the conditional expectation given all randomness in the algorithm before iteration t and the ﬁnal local step K as

Et[ ] := EU noise h | randomness before (t, K), Dt i .

Finally, we use E[ ] to denote the total expectation over all sources of randomness in the algorithm, including gradient noise and client participation.

A.2 Convergence Analysis

We start with proving some useful bounds. For any t,

UF(Ut Vt ) 2

UF(Ut Vt ) 2

F + UF(Ut Vt ) UF(Ut Vt ) 2

+ 2 D UF(Ut Vt ), UF(Ut Vt ) UF(Ut Vt ) E

UF(Ut Vt ) 2

F + σ2. (10)

Published in Transactions on Machine Learning Research (08/2025)

Similar to above, we can write, for any t and k,

Dt VF(Ut Vt k ) 2

Dt VF(Ut Vt k ) 2

F + Dt VF(Ut Vt k ) Dt VF(Ut Vt k ) 2

Dt VF(Ut Vt k ), Dt VF(Ut Vt k ) Dt VF(Ut Vt k ) | {z } Et,k[ . |Dt]=0

= Dt VF(Ut Vt k ) 2

F + Dt 2 2 Et,k

VF(Ut Vt k ) VF(Ut Vt k ) 2

= Dt VF(Ut Vt k ) 2

F + Dt 2 2 σ2

where in the third line, we used the submultiplicative property of the Frobenius norm. Now we take the expectation with respect to the participation probability

Dt VF(Ut Vt k ) 2

i = ESt h Dt VF(Ut Vt k ) 2

F + Dt 2 2 σ2i

p VF(Ut Vt ) 2

F + σ2 , (11)

where we used Lemma 6, and Lemma 3.

(A) First, we use the smoothness of F(UVt ) with respect to V and write

F(Ut Vt k+1 ) F(Ut Vt k ) + D VF(Ut Vt k ), Vt k+1 Vt k E + LV

Vt k+1 Vt k 2 F

= F(Ut Vt k ) ηv D VF(Ut Vt k ), Dt VF(Ut Vt k ) E + η2 v LV

Dt VF(Ut Vt k ) 2

Taking conditional expectation, we get

ESt h Et,k h F(Ut Vt k+1 ) ii F(Ut Vt k ) ηvp VF(Ut Vt k ) 2

F + η2 v p LV

VF(Ut Vt k ) 2

F + η2 v LV

= F(Ut Vt k ) ηvp 1 ηv LV

VF(Ut Vt k ) 2

F + η2 v LV σ2

where we used (11) in the second line. We rearrange the inequality above and average over k to obtain

ηvp 1 ηv LV

VF(Ut Vt k ) 2

F(Ut Vt,0 ) ESt h Et,K h F(Ut Vt,K ) ii + η2 v LV σ2

F(Ut Vt ) ESt h Et,K h F(Ut Vt+1 ) ii + η2 v LV σ2

where, in the second line, we used Vt,0 = Vt and Vt,K = Vt+1.

(B) Now, we will use the smoothness again, but this time with respect to U:

F(Ut+1Vt k ) F(Ut Vt k ) + D UF(Ut Vt k ), Ut+1 Ut E + LU

Ut+1 Ut 2 F

F(Ut Vt k ) ηu D UF(Ut Vt k ), UF(Ut Vt ) E + η2 u LU

UF(Ut Vt ) 2

Similar to the previous case, if we take expectation with respect to the randomness in U update at iteration t, we obtain the following bound by using (10):

Et F(Ut+1Vt k ) F(Ut Vt k ) ηu D UF(Ut Vt k ), UF(Ut Vt ) E + η2 u LU

UF(Ut Vt ) 2

F + η2 u LUσ2

Published in Transactions on Machine Learning Research (08/2025)

If we split the inner product term as D UF(Ut Vt k ), UF(Ut Vt ) E = D UF(Ut Vt k ), UF(Ut Vt ) UF(Ut Vt k ) + UF(Ut Vt k ) E

= D UF(Ut Vt k ), UF(Ut Vt ) UF(Ut Vt k ) E + UF(Ut Vt k ) 2

UF(Ut Vt k ) 2

UF(Ut Vt ) UF(Ut Vt k ) 2

F + UF(Ut Vt k ) 2

where the last line follows from Young s inequality (17) with α = ηu LU. Moreover, by the smoothness assumption, we have

UF(Ut Vt ) UF(Ut Vt k ) 2 LU

Vt Vt k 2 η2 v ηu

i=0 VF(Ut Vt i )

Combining all these bounds, we get

Et h F(Ut+1Vt k ) i F(Ut Vt k ) ηu

UF(Ut Vt k ) 2

F + η2 u LU

UF(Ut Vt ) 2

i=0 VF(Ut Vt i )

F + η2 u LUσ2

Now we consider two cases k = 0 and k = K in (13).

1. For k = 0, we have

Et h F(Ut+1Vt ) i F(Ut Vt ) ηu (1 ηu LU) UF(Ut Vt ) 2

F + η2 u LUσ2

where we used Vt,0 = Vt.

2. For k = K, we have

Et F(Ut+1Vt+1) F(Ut Vt+1 ) ηu

UF(Ut Vt+1 ) 2

F + η2 u LU

UF(Ut Vt ) 2

k=0 VF(Ut Vt k )

F + η2 u LUσ2

F(Ut Vt+1 ) + η2 u LU

UF(Ut Vt ) 2

F + η2 v LU

k=0 VF(Ut Vt k )

F + η2 u LUσ2

Rearranging the terms we can write

η2 u UF(Ut Vt ) 2

k=0 VF(Ut Vt k )

F(Ut Vt+1 ) Et F(Ut+1Vt+1 ) + η2 u LUσ2

(C) We once again use smoothness with respect to V:

F(Ut+1Vt k+1 ) F(Ut+1Vt k ) + D VF(Ut+1Vt k ), Vt k+1 Vt k E + LV

Vt k+1 Vt k 2 F

= F(Ut+1Vt k ) ηv D VF(Ut+1Vt k ), Dt VF(Ut Vt k ) E + η2 v LV

Dt VF(Ut Vt k ) 2

Published in Transactions on Machine Learning Research (08/2025)

We take the conditional expectation

ESt h Et,k h F(Ut+1Vt k+1 ) ii F(Ut+1Vt k ) ηv ESt h Et,k h D VF(Ut+1Vt k ), Dt VF(Ut Vt k ) Eii

+ η2 v p LV

VF(Ut Vt k ) 2

F + η2 v LV σ2

where we used (11). Focusing again on the inner product term, we obtain

ESt h Et,k h D VF(Ut+1Vt k ), Dt VF(Ut Vt k ) Eii

= ESt h Et,k h D VF(Ut+1Vt k ) VF(Ut Vt k ), Dt VF(Ut Vt k ) Eii

= ESt h D VF(Ut+1Vt k ) VF(Ut Vt k ), Dt VF(Ut Vt k ) E + D VF(Ut Vt k ), Dt VF(Ut Vt k ) Ei

= p D VF(Ut+1Vt k ) VF(Ut Vt k ), VF(Ut Vt k ) E + p VF(Ut Vt k ) 2

VF(Ut+1Vt k ) VF(Ut Vt k ) 2

VF(Ut Vt k ) 2

F + p VF(Ut Vt k ) 2

Ut+1 Ut 2 F ηvp KLV

VF(Ut Vt k ) 2

F + p VF(Ut Vt k ) 2

2ηv K η2 u UF(Ut Vt ) 2

VF(Ut Vt k ) 2

F + p VF(Ut Vt k ) 2

where we used Young s inequality (17) in the fourth line with α = ηv KLV . Substituting back, we get

ESt h Et,k h F(Ut+1Vt k+1 ) ii F(Ut+1Vt k ) ηvp 1 ηv (K + 1)LV

VF(Ut Vt k ) 2

+ η2 u p LV

UF(Ut Vt ) 2

F + η2 v LV σ2

By summing over k and appropriately rearranging the terms in the inequality, we obtain the following:

ηvp 1 ηv (K + 1)LV

VF(Ut Vt k ) 2

F η2 u p LV

UF(Ut Vt ) 2

F(Ut+1Vt ) ESt h Et,K h F(Ut+1Vt+1 ) ii + η2 v KLV σ2

where we used Vt,K = Vt+1 and Vt,0 = Vt. We deﬁne

UF(Ut Vt ) 2

VF(Ut Vt k ) 2

We summarize the resulting inequalities in part (A) to (C) as

ηvp 1 ηv LV

F(Ut Vt ) ESt h Et,K h F(Ut Vt+1 ) ii + η2 v LV σ2

ηu 1 ηu LU U F(Ut Vt ) Et F(Ut+1Vt ) + η2 u LUσ2

η2 u U + η2 v

k=0 VF(Ut Vt k )

F(Ut Vt+1 ) Et F(Ut+1Vt+1 ) + η2 u LUσ2

ηvp 1 ηv (K + 1)LV

K V η2 u p LV

2 U F(Ut+1Vt ) ESt h Et,K h F(Ut+1Vt+1 ) ii + η2 v KLV σ2

Published in Transactions on Machine Learning Research (08/2025)

where we used the deﬁnitions (15). We rewrite four inequalities above using ηv = pηu

K and deﬁning L := max{LU, LV } as

ηp2 1 η p L

2K V F(Ut Vt ) ESt h Et,K h F(Ut Vt+1 ) ii + η2 Lσ2

η 1 ηL U F(Ut Vt ) Et F(Ut+1Vt ) + η2 Lσ2

η2Et[ U] + η2p2

k=0 VF(Ut Vt k )

F(Ut Vt+1 ) Et F(Ut+1Vt+1 ) + η2 Lσ2

K 1 ηp(K + 1)L

2K K V η2 p L

2 U F(Ut+1Vt ) ESt h Et,K h F(Ut+1Vt+1 ) ii + η2 Lσ2

Summing up the inequalities above, we get

η 1 2ηL ( U + V ) η 1 2ηL U + ηp 1 η p L

2 F(Ut Vt ) ESt h Et,K h F(Ut+1Vt+1 ) ii + η2 1 + 1

2 F(Ut Vt ) ESt h Et,K h F(Ut+1Vt+1 ) ii + 2η2Lσ2 , (16)

where we used the following inequality

k=0 VF(Ut Vt k )

1 ηp(K + 1)L

VF(Ut Vt k ) 2

VF(Ut Vt k ) 2 ηp2

1 ηp(K + 1)L

VF(Ut Vt k ) 2

2 ((1 + p)K + 1

K ) 1 K 1 X

VF(Ut Vt k ) 2

where in the second line, we used Jensen s inequality, see (18). Next, we take the expectation over all sources of randomness in the algorithm. Then, we average both sides of (16) over the iterations t, followed by dividing both sides by η (1 2ηL), yielding the following expression:

E UF(Ut Vt ) 2 + E

VF(Ut Vt k ) 2 #!

2 F(U0V0 ) F(UT VT )

ηT 1 2ηL + 2ηLσ2

2 F(U0V0 ) F

ηT 1 2ηL + 2ηLσ2

provided that L = max{LU, LV }, ηv = pηu

K , and η 1 2L. This completes the proof.

Published in Transactions on Machine Learning Research (08/2025)

A.3 Useful Inequalities

We list below a few elementary facts that we used in our analysis, included here only for completeness.

Lemma 3. For any matrices A Rd1 d2 and B Rd2 d3, the Frobenius norm of the product AB satisﬁes

AB F A 2 B F .

where . 2 is the spectral norm.

Lemma 4 (Young s inequality). Let Θ, Y Rd1 d2 and α > 0. Then, the following inequality holds:

2 Θ 2 F + 1

2α Y 2 F . (17)

Lemma 5. Let Θi Rd1 d2 for i 0, . . . , K 1. Then, the following bound holds:

i=0 Θi 2 F . (18)

Proof. This inequality follows directly from Jensen s inequality applied to the Frobenius norm.

Lemma 6. Let D be a diagonal matrix with diagonal entries that are 1 with probability p and 0 with probability 1 p, and let A be an arbitrary matrix. Then the expectation of the squared Frobenius norm of the product DA is given by ED DA 2 F = p A 2 F .

Proof. The Frobenius norm squared of DA is deﬁned as:

i,j (DA)2 ij.

Since D is diagonal, the product DA will zero out all rows of A where the corresponding diagonal entry in D is 0. Let di represent the i-th diagonal entry of D, where each di is a Bernoulli random variable with ED[di] = p.

Thus, we can express DA 2 F as:

j=1 A2 ij =

i=1 di Ai, 2 2,

where Ai, 2 2 = Pm j=1 A2 ij is the squared norm of the i-th row of A.

Now, taking the expectation, we have:

ED DA 2 F =

i=1 ED[di] Ai, 2 2 =

i=1 p Ai, 2 2.

Simplifying, we get:

ED DA 2 F = p

i=1 Ai, 2 2 = p A 2 F .

Therefore, the expectation of DA 2 F is:

ED DA 2 F = p A 2 F .

This completes the proof.

Published in Transactions on Machine Learning Research (08/2025)

Lemma 7. Let D be a diagonal n n matrix where each diagonal entry is independently 1 with probability p and 0 with probability 1 p. Then the expected value of the spectral norm D 2 is given by

E( D 2) = 1 (1 p)n 1.

Proof. Since D is diagonal, its spectral norm D 2 is the largest absolute value among its diagonal entries. Therefore, D 2 = 1 if at least one diagonal entry is 1, and D 2 = 0 only if all diagonal entries are 0.

Deﬁne X as the event that all diagonal entries are 0. The probability of this event, Pr(X), is:

Pr(X) = (1 p)n,

since each diagonal entry is 0 independently with probability 1 p.

Thus, the probability that D 2 = 1 (i.e., the event X does not occur) is:

1 Pr(X) = 1 (1 p)n.

Therefore, the expected value of D 2 is:

E( D 2) = 1 (1 (1 p)n) + 0 (1 p)n = 1 (1 p)n 1.

This completes the proof.