# cobo_collaborative_learning_via_bilevel_optimization__328b09c5.pdf

COBO: Collaborative Learning via Bilevel Optimization

Diba Hashemi EPFL diba.hashemi@epfl.ch

Tencent Inc. liam.he15@gmail.com

Martin Jaggi EPFL martin.jaggi@epfl.ch

Collaborative learning is an important tool to train multiple clients more effectively by enabling communication among clients. Identifying helpful clients, however, presents challenging and often introduces significant overhead. In this paper, we model client-selection and model-training as two interconnected optimization problems, proposing a novel bilevel optimization problem for collaborative learning. We introduce COBO, a scalable and elastic, SGD-type alternating optimization algorithm that efficiently addresses these problem with theoretical convergence guarantees. Empirically, COBO achieves superior performance, surpassing popular personalization algorithms by 9.3% in accuracy on a task with high heterogeneity, involving datasets distributed among 80 clients.2

1 Introduction

In a classic collaborative learning scenario, n clients, each with a distinct machine learning task, seek solutions that potentially outperform their individual solvers through a collective effort. Common collaborative learning frameworks generally alternate between training local models on individual datasets and synchronizing updates among collaborators. More concretely, during the computation step, client i [n] trains a d-dimensional model xi Rd to minimize its loss function, fi : Rd R. In the subsequent communication step, client i exchanges updates with clients, potentially benefiting from collaboration.

Despite the plethora of collaborative learning frameworks, the ideal approach to collaborate remains under-exploited. The FEDAVG [28, 18] algorithm learns a single global model over pooled datasets from all clients, i.e., minx Rd 1

n Pn i=1 fi(x). However, due to heterogeneous data distributions among clients, a global model may significantly underperform compared to personal models trained on local datasets for certain clients, which can discourage their participation in collaborative training [29]. DITTO addresses this issue by training personal models with a regularization term that penalizes deviations from a global model [24]. Although DITTO enables personal models to leverage the global model, it offers only a coarse-grained level of collaboration. In instances where clients data exhibit significant differences, the DITTO algorithm is constrained to facilitating collaboration at a global level, thereby neglecting the inherent heterogeneity among clients.

Clustering-based federated learning algorithms have been developed to accommodate scenarios in which clients data originate from multiple clusters [11, 39]. Nevertheless, these algorithms typically inherit the limitations associated with clustering techniques, including the need to predetermine the number of clusters, initialize cluster centers, and other such prerequisites, which can diminish their practical utility.

Corresponding author. 2The code is available at: https://github.com/epfml/Co Bo.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

In this paper, we propose a bilevel optimization framework to enhance collaborative learning by discovering better structural relationships among clients. The inner problem focuses on optimizing a binary collaborator selection variable wij {0, 1}, determined based on a gradient alignment measure for each pair of clients. In the outer problem, we train personalized models X Rn d, incorporating a penalization term that accounts for the distances between clients, as dictated by the collaboration weights established in the inner problem.

The contributions of this paper can be summarized as follows:

We model collaborative learning through a novel bilevel optimization formulation that yields more generalizable solutions by fully exploiting the inherent structure of collaboration.

We propose COBO, an SGD-type alternating optimization algorithm that efficiently solves the bilevel problem. COBO scales with the number of clients n and is elastic to the number of clients.

We prove that COBO enjoys theoretical convergence guarantees for collaborative learning with cluster structures.

Empirically, COBO surpasses popular personalized federated learning baselines in experiments involving highly heterogeneous federated learning settings and Large Language Models (LLMs).

2 Problem formulation

In this paper, we model collaborative learning as a bilevel optimization problem, where personalized models X Rd n are trained in the outer problem, and collaborative weights W Rn n are determined by the inner problem. More concretely,

min [x1,...,xn] Rd n

i=1 fi(xi) + ρ

1 i<j n w ij xi xj 2 2 (Model-Training)

where w ij arg max wij [0,1] wij

(Client-Selection)

where ρ > 0 is a hyperparameter for penalization. We break down the formulation as follows.

Outer problem: training personalized models. In the outer problem (Model-Training), client i trains its model xi by minimizing its loss function fi, along with penalizing its distances to neighboring models, e.g. xj, as weighted by w ij > 0.

Our formulation is similar to DITTO [24], but with two key differences: First, DITTO uses uniform and fixed collaboration weights and penalizes the distance between xi and a global model, whereas we penalize the distances between pairs of clients and adjust the collaboration weights during training. Consequently, when client tasks are heterogeneous such as clients drawn from clusters the performance of a global model deteriorates, and DITTO s local models cannot benefit from finegrained collaboration. In contrast, our method is able to exploit such structure and achieve better performance in diverse settings.

Inner Problem: Finding Collaborators. In the inner problem, we decompose the task of optimizing W Rn n into independent sub-problems, one for each entry of W . We relax the binary collaborator selection variable wij {0, 1} to a continuous weight wij [0, 1]. As the objective function is linear with respect to wij, and the domain is convex, optimization algorithms such as Frank-Wolfe [10, 17] or projected gradient descent can efficiently find the maximizers, which occur at 0 or 1.

It is important to note that w ij does not imply a permanent connection between clients i and j, but rather a temporary assessment based on the current states of xi and xj.

A simple inner problem with two clients is illustrated in Figure 1. The f1, f2 are their loss functions, and suppose µ1, µ2, and µ are the minimizers of f1, f2, and 1

2(f1 + f2). Suppose µ1, µ2, and µ are the minimizers of f1, f2, and 1

2(f1 + f2) respectively. The model weights at points A, B, C demonstrate three scenarios for updating W .

Model weights x A, x B, x C

Minimizers of f1 and f2

Minimizer of 1

Averaged model weight

Grads for X-update f1(x1), f2(x2)

Grads for W-update f1(

Figure 1: Diagram of the inner problem (Client-Selection) represented through a contour of 1

2(f1+f2). The blue arrows are gradients computed at middle point 1

2(x1 + x2) to determine connectivity. The red arrows represent gradients computed at local models to update model weights.

Point A: The model x A is far away from µ, i.e., x A µ >> maxi µi µ . The descent directions of the clients have a positive inner product; therefore w12 = 1. Collaboration at this stage speeds up training.

Point B: The model x B is closer to µ, i.e., x B µ µi µ . In this case, moving closer to the minimizer µ of 1

2(f1 + f2) no longer helps both clients get closer to the minimizers of their own losses µi. The inner problem yields w12 = 0 and the clients disconnect.

Point C: The models x C 1 and x C 2 are already disconnected. The gradients computed at their midpoint suggest they should remain disconnected; thus w12 = 0.

Because collaboration weights in Client-Selection are determined in a pairwise fashion, our formulation, unlike clustering-based methods [11, 39], does not require knowledge of cluster sizes, allowing clients to join and leave during collaborative training. This elasticity enables our method to be applicable in a wider range of scenarios. Remark 1 (Extensions). While Client-Selection is defined over a box constraint W [0, 1]n n, it can be easily extended to other convex domains. For example, in all-for-one type collaborative training, the weights are optimized over a simplex. The experiment on language models is deferred to Section 4.3.

2.1 Algorithm

We propose a novel stochastic gradient descent (SGD)-type alternating optimization algorithm, termed COBO, to solve the bilevel optimization problem defined by (Model-Training) and (Client-Selection). The algorithm alternates between updating the model variables X and the collaboration weights W .

In each iteration t, we first fix the model variables {xt i}n i=1 and update the collaboration weights by applying projected gradient ascent with step size γ > 0 to the inner problem (Client-Selection):

wt+1 ij = Proj[0,1]

wt ij + γ fi

xt i + xt j 2

xt i + xt j 2

i, j [n]. (1)

Next, with the updated collaboration weights {wt+1 ij } are fixed, we optimize the model variables {xi}n i=1 using the following update rule derived from the outer problem (Model-Training):

xt+1 i = xt i η

fi(xt i) + ρ

k=1 wt+1 ik xt i xt k !

where η > 0 is the step size for updating the model variables. This alternating process is repeated until convergence.

The detailed implementation of the algorithm is provided in Algorithm 1. In this implementation, the full gradients { fi}i [n] in (1) and (2) are replaced by their stochastic estimates. Additionally, collaborative weights are updated with a probability of O( 1

n), leading to an expected computation of

Algorithm 1 COBO: Collaborative Learning via Bilevel Optimization Input: Model parameters i [n] x0 i = x0 Rd; Penalization parameter ρ > 0; W 0 Rn n where w0 ij = 1, i, j [n]; Step size η, γ > 0. 1: for round t = 0, 1 . . . , T do 2: Call W t+1 Client-Selection({xt i}i [n], W t) 3: for client i = 1, . . . n do 4: Draw sample ξi Di and compute stochastic gradient gt i Rd of fi(xt i) and update

xt+1 i xt i η

k=1 wt+1 ik xt i xt k !

5: end for 6: end for 7: Output: Uniform randomly select s [T] and return {xs 0, . . . , xs n} and W s. 8: 9: procedure CLIENT-SELECTION(X, W ) 10: for each pair of clients (i, j) where i = j [n] do 11: if with a probability 1/n, then 12: Compute the average model zij = 1

2(xi + xj). 13: Compute stochastic gradient gi i and gi j for fi(zij) and fj(zij) respectively,

wij Proj[0,1] (wij + γ gi i, gi j ) . (4)

14: end if 15: end for 16: return updated selection variables W 17: end procedure

O(n) gradients. This results in an overhead comparable to that of standard decentralized learning methods [25, 19], thereby enabling client selection with minimal additional cost.

Compared to federated clustering algorithms, which require global synchronization before applying clustering oracles, the inner problem (Client-Selection) in COBO is solved in a pairwise fashion. This pairwise approach makes the algorithm non-blocking and robust to stragglers, providing greater flexibility and efficiency. Not all pairwise weights have to be computed in each iteration. In Table 1 we compare the performance of multiple edge-sampling strategies.

3 Theoretical results

In this section, we define assumptions in collaborative learning settings and show that COBO converges to stationary points. The following assumptions regarding the local optimization objective fi are commonly adopted in the literature [1, 19]:

(A1) L-smooth. For all x and y in Rd and i [n], the loss function fi has L-Lipschitz gradients, i.e. fi(x) fi(y) L x y .

(A2) Noise bound. For all x Rd and i [n], there exists σ2 > 0 such that the stochastic gradient has bounded noise Eξ h fi(x; ξ) Eξ [ fi(x; ξ)] 2i σ2 .

(A3) Global minimum. For all i [n], the loss function fi has a global lower bound f i .

The next assumption characterizes the possible relationships between clients. In the first case, when reaching the stationary point x of their joint objective fi + fj, then by (5) implies that fi(x) = fj(x) = 0 client i and j reach their own stationary points. In the second case, when client i reaches its stationary point, the gradient of j is lower bounded by a positive constant, meaning they don t share stationary points. This leads to eventual

(A4) Collaborativeness. If clients i and j are collaborative, then there exists Mij > 0 such that

fi(x) fj(x) 2 2 M 2 ij fi(x) + fj(x) 2 2 x Rd. (5)

Otherwise, there exists ζ2 ij > 0 such that

fi(x) 2 2 + fj(x) 2 2 ζ2 ij x Rd. (6)

This assumption is similar to [39, Assumptions 4,5], but we define relations for pairs of clients instead of clusters. In the next example, we use quadratics to demonstrate (A4) Example 2. Assume that there are K clusters with [n] = k [K] Ck and Ck Ck = for all k = k [K]. Consider the k-th cluster with center µk and client i Ck, the loss function is fi(x) = ai

2 x µk 2 2 where ai > 0. Then for clients i, j in the same cluster, i.e. i, j Ck

fi(x) fj(x) 2 2 = (ai aj)2 x µk 2 2 = (ai aj)2

(ai + aj)2 fi(x) + fj(x) 2 2.

The Mij = |ai aj|

ai+aj in this case. On the other hand, for i Ck and j Ck and µk = µk ,

fi(x) 2 2 + fj(x) 2 2 = a2 i x µk 2 2 + a2 j x µk 2 2 = a2 i a2 j (a2 i + a2 j)2 µk µk 2 2

where the lower bound ζ2 ij = a2 i a2 j (a2 i +a2 j)2 µk µk 2 2 > 0.

Finally, we derive a convergence theorem with the assumption that clients are drawn from clusters, as e.g. in [33, Assumption 2].

(A5) Cluster. All clients are drawn from clusters where within each cluster clients share stationary points. Theorem I. Suppose Assumption 1,2,3,4,5 hold true. Suppose that COBO solves (4) with minibatch size b. Consider clients i and j in the same cluster C of size c. Suppose that M 2 ij (0, 1

b 2 c2 2Lη(c 2)σ2 and ζ2 ik fi(x) + fk(x) 2 2 for all x and k. Let ρ

3L c and step size

fij z0 ij f ij , 1 2

The consensus distance also converges to 0, i.e.

i,j C E h xt+1 i xt+1 j 2

i 6M 2 ij ρ2c2

v u u t Lσ2

fij z0 ij f ij .

Moreover, the gradient norm is upper bounded.

i,j C E fij zt ij 2

v u u t Lσ2

fij z0 ij f ij .

This theorem suggests that clients inside the same cluster gradually reach consensus. This clusterlevel consensus model reaches stationary point of their losses by (5). Note that a larger penalization parameter ρ and smaller values of M 2 ij lead to faster convergence, which aligns with our expectations. Note that Mij in (A4) measures how well i,j collaborate. A smaller Mij leads to better consensus distance in Theorem I, with Mij = 0 leading to identical data distribution. The following corollary states the convergence of norm of client gradient of model xi.

Corollary II. Under same conditions as Theorem I, fi (xt i) 2 2 converges at a similar rate

fi xt i 2 2 4

v u u t Lσ2

fij z0 ij f ij .

4 Experiments

In this section, we present three experiments to demonstrate the practical effectiveness of COBO. In the first two experiments, we benchmark COBO in both a cross-silo federated learning setup involving 8 clients and a cross-device setup with 80 clients, using the CIFAR-100 dataset for multi-task learning [21]. In the third experiment, we train language models on subsets of the Wiki-40B dataset while learning domain weights within a simplex [12]. Compared to state-of-the-art personalized federated learning baselines, COBO obtains personalized models of higher quality and correctly identifies cluster structures. Details of the experiments, including descriptions of the architectures and the system setup, are deferred to Appendix B.

Throughout the experiments, we use popular federated learning baselines such as FEDAVG [28], Federated Clustering (abbreviated as FC) [39], DITTO [24], IFCA [11], and an oracle algorithm. The definition of the oracle baseline varies in each setup and will be discussed case by case. Note that we additionally provide clustering-based algorithms, i.e., FC and IFCA, with the actual number of clusters. Their experimental statistics reported in this section, such as accuracy and perplexity, include this advantage. In addition to previous baselines, we also compare COBO with a collaborative fine-tuning approach for large language models that leverages performance on validation data to determine the collaboration weights [38] (referred to as Validation Based in Table 2).

4.1 Cross-silo federated learning experiment with 8 clients

In this experiment, we evaluate the performance of COBO by comparing the average accuracies of local models against those of established collaborative learning baselines. Our objective is to assess how effectively COBO discerns and leverages the structure of data clusters relative to other collaborative learning algorithms.

We simulate a cross-silo multi-task environment where training a single model across all clients yields poor performance, thus highlighting the necessity for client selection. Our experimental configuration consists of 4 clusters, each containing 2 clients utilizing the Res Net-9 model [13]. To encourage collaboration within clusters, we randomly allocate half of the dataset to each client in a cluster. To differentiate between clusters, we introduce label diversity by flipping the image labels in each cluster using distinct random seeds. This process ensures that each class maintains unique labels across all clusters, effectively creating a scenario where a universally trained model would not be optimal, thereby necessitating personalized models that can cater to the specific label distribution of each cluster.

In this context, collaboration among clients within the same cluster is advantageous, as their datasets are complementary. There are two primary reasons why collaboration between different clusters may not be beneficial: (1) the dataset available to clients within each cluster is identical, negating the incentive to collaborate with clients from other clusters; and (2) the label flipping across clusters could mean that inter-cluster collaboration might actually degrade local model performance.

Given these considerations, we designate an oracle algorithm for our scenario: FEDAVG implemented separately within each cluster. This ensures that collaboration is confined to where it is most beneficial. Additionally, the oracle collaboration matrix is defined to be a block-diagonal matrix, with entries of 1 for pairs of clients within the same cluster (indicating collaboration) and entries of 0 for pairs from different clusters (indicating no collaboration). This matrix serves as a benchmark for the ideal collaboration structure in our simulated environment.

To enable the practical application of COBO, we sample pairs of clients in each iteration to update their collaboration weights. We begin by examining the impact of various sampling strategies on the performance of COBO. The primary approach involves sampling with a constant probability of O(1/n). Additionally, we observe that COBO identifies an appropriate collaboration matrix early in the training process, motivating the use of a time-step-dependent sampling rate, O(1/t). We also implement a mixed strategy: employing the constant sampling rate, O(1/n), for the initial 0.2% of iterations, followed by a switch to the time-dependent sampling rate, O(1/t), for the remainder of the training. A comparison of these strategies with the non-sampling oracle, where all pairs are updated in every iteration, is presented in Table 1. While COBO demonstrates consistent performance across all sampling strategies, achieving results close to those of the non-sampling oracle, the mixed strategy shows a slight performance advantage.

Table 1: Comparison of the average performance of COBO across different sampling strategies for updating the weights of client pairs in the collaboration matrix. All strategies demonstrate performance close to that of the non-sampling oracle. However, the mixed strategy, which combines a constant sampling rate at the start with a time-dependent rate during later training phases, shows superior performance.

Acc.(%) Loss

Constant (O(1/n)) 73.05 1.104 Time-dependent (O(1/t)) 73.18 1.226 Mixed 74.77 1.081 No Sampling (Oracle) 74.93 1.278

0.1 0.3 0.5 Fraction of dataset

3 4 5 Clusters

2 3 4 Clients per cluster

0 100 200 300 400 Iterations ( 100)

Co Bo Finetuned Fed Avg Start of fine-tuning

Oracle Ditto

Local Fed Avg

Figure 2: (2a) Average accuracy in cross-silo experiments with varying factors, including the fraction of the dataset available to clients, the number of clusters, and the number of clients per cluster. (2b) Average accuracy of personalized models for cross-silo federated learning with 8 clients. The "Oracle" denotes applying Fed Avg to the clients with the same label permutation.

To further assess performance, we trained COBO and other baseline algorithms for a total of 40,000 iterations. Figure 2b presents the accuracy diagram. We observe that COBO almost reaches the performance bound established by the Oracle. Moreover, COBO achieves a fixed accuracy of 60% in 4,500 iterations, which is 30% faster than DITTO. For better comparison, the values of accuracy and loss are reported in Table 2. Additionally, the evolution of the collaboration matrix for clustering algorithms and COBO is illustrated in Figure 3. COBO starts to identify clients with similar label permutations as early as 300 iterations and stabilizes in less than 5,000 iterations (12.5% of the training phase). IFCA always degenerates to one fully connected cluster, while FC periodically suffers from clustering mistakes even at the end of training.

Figure 2a presents the results of the cross-silo experiment under various configurations to further assess the robustness of COBO. First, we modify the fraction of the dataset allocated to each client. Intuitively, the total amount of data available to a cluster directly impacts the performance of COBO. Next, we experiment with different numbers of clusters, each containing two clients, and observe that

Figure 3: Collaboration matrices learned by Federated Clustering (FC), IFCA, and COBO at different stages of training for cross-silo experiment with 8 clients. The diagonals are masked out. The oracle matrix is a block diagonal matrix with blocks of size 2. The collaboration matrix of COBO already starts to look similar to oracle matrix within as low as 300 iterations (0.75% of the total iterations), and converges to it within 5000 iterations (12.5% of the total iterations). On the other hand, IFCA yields a fully-connected matrix while FC occasionally diverges from the achieved cluster structures (e.g., iterations 300, 5000, and 40000), even at the end of training.

the number of clusters does not significantly affect COBO s accuracy. Additionally, we investigate the effect of varying the number of clients per cluster while maintaining a fixed total of four clusters. In this setup, the dataset is partitioned among clients within each cluster, resulting in less data per client as the cluster size increases. Despite this, COBO leverages collaboration to maintain robust performance even with larger cluster sizes.

4.2 Cross-device experiment experiment with 80 clients

In this experiment, we demonstrate the performance of COBO in a challenging cross-device federated learning setting characterized by significant data heterogeneity. We create 10 clusters of varying sizes: 2 clusters consist of 6 clients each, another 2 comprise 7 clients each, and so on. Each cluster is allocated data from 10 distinct classes out of the total 100 classes available in the CIFAR-100 dataset, ensuring that the data across clusters are disjoint. Within each cluster, the data are distributed uniformly at random among the clients. We proceed to train individual Res Net-9 models [13] on each client s data for a total of 20,000 iterations. This setup allows us to observe the behavior of COBO and its ability to handle both the quantity and diversity of data across different client groups and cluster sizes.

We define the oracle algorithm and the corresponding collaboration matrix in the same manner as in Section 4.1. Note that while we manually create the clusters, inter-cluster collaboration may still be helpful in practice, and it is impossible to know the actual ground truth in this case. Consequently, we recognize that the oracle may not correspond to the optimal performance. Nevertheless, this oracle still exhibits superior performance compared to other baselines that lack prior knowledge of the data distribution among clients, as evidenced by the results presented in Table 2. The collaboration matrix and accuracy plots are deferred to Figure 5 and Figure 6 in Appendix B, respectively.

In this challenging experiment, COBO surpasses all other baselines by at least 5.7% in accuracy. This supports the conclusion that COBO scales well with the size of collaborative learning and effectively exploits collaboration weights among clients at a fine-grained level.

Table 2: Comparisons of model quality and fairness measure of personalized models for cross-silo experiment with 8 clients, and cross-device experiment with 80 clients, and the language modelling experiment with 4 clients having different languages. Federated clustering (FC) is not scalable with number of clients due to its O(n2) complexity, and therefore ignored in the cross-device fl experiment. The clustering algorithms IFCA and FC are not applicable to LLMs and there ignored. Note that Oracle is not defined in the LLMs experiment. The column Imp.(%) demonstrates the percentage of clients with improved performance compared to local training.

Cross-silo Cross-device Fine-tuning LLMs

Acc.(%) Loss Imp.(%) Acc.(%) Loss Imp.(%) Perplexity Imp.(%)

Local 64.9 0.1 1.67 - 54.9 0.1 1.40 - 41.26 0.38 - Fed Avg 18.8 0.1 2.66 0 53.9 0.1 1.79 29 64.84 0.00 0 Fine-tuning Fed Avg 70.2 0.2 1.77 0 58.9 0.1 1.88 94 46.70 0.07 0 Ditto 73.5 0.3 1.55 100 70.3 0.1 1.21 100 40.05 0.01 100 IFCA 18.6 0.1 2.75 0 45.6 0.8 2.15 4 - - FC 55.1 0.4 1.79 0 - - - - - Validation Based - - - - - - 42.90 1.68 75 COBO 74.6 0.2 1.08 100 79.6 0.4 0.97 100 39.28 0.01 100

Oracle 75.4 0.2 1.07 100 83.6 0.3 0.70 100 - -

4.3 Collaborative fine-tuning on language models

Recently, Large Language Models (LLMs) have gained significant popularity due to their ability to effectively solve challenging tasks. Their downstream performance can be further enhanced by finetuning; however, the scarcity of data often leads to inferior performance and necessitates collaboration. Therefore, we conduct an experiment with four clients, each having a pre-trained GPT-2 base model3 with 124 million parameters in total [32], and a subset of articles from the Wiki-40B dataset [12] in one of the following four languages: Catalan, Spanish, German, or Dutch. We use Lo RA for the Self-Attention and MLP layers for fine-tuning, which accounts for 0.47% of the full parameters [14].

0 25 50 75 100 Iterations

Domain Weights

Spanish German Dutch

Figure 4: Domain weights found by COBO for Catalan language. There are 4 domains in total: Catalan, Spanish, German, and Dutch. The curves are smoothed by exponential moving average.

For data-hungry tasks, such as those involving LLMs, contributions from all domains are valuable. Clustering methods fall short in this aspect due to their binary, discrete outputs, which do not capture the nuanced degrees of collaboration needed. COBO addresses this limitation by allowing for a continuous range of collaboration intensities, achieved by a simple yet effective modification to the projection domain in (1). Specifically, we employ a probability simplex, denoted as i = {wij 0, P

j wij = 1}, as the domain of the inner problem.

In Table 2, we compare the perplexity of COBO with baselines after 500 iterations, when FEDAVG converges. There are no oracle domain weights in this experiment due to the complex coherence among languages; therefore, we omit the oracle algorithm in the table. COBO achieves the best perplexity among all algorithms. In Figure 4, we demonstrate the domain weights learned for the Catalan language. Overall, Catalan assigns the highest collaboration weight to Spanish, which is reasonable considering the similarity between the two languages.

3https://github.com/karpathy/nano GPT

5 Related Work

Personalized federated learning. Personalized federated learning has received significant attention in recent years due to its potential to tailor models to individual user data while benefit from collaboration [33, 35, 36, 9, 22, 3]. There are various flavors of personalized federated learning. DITTO trains personalized models by incorporating a regularization term that penalizes the divergence from a global model [24]. Many personalization works assume that clients are drawn from clusters. For example, Marfoq et al. [27] use K-nearest neighbors (KNN) to determine collaborators. Mansour et al. [26], Ghosh et al. [11], Werner et al. [39] develop K personalized models and assign clients to clusters based on criteria such as minimum function values or gradient similarities. Additionally, Even et al. [7] provided theoretical insights by establishing lower bounds, which demonstrate that the optimal gradient filtering strategy involves clustering clients with identical optima.

Federated learning with client selection In federated learning, client selection is often performed by simultaneously minimizing task losses and collaborative weights in a single-level objective function. Zantedeschi et al. [40] minimize task losses augmented with a penalization term wij xi xj 2 2, similar to our outer problem. However, optimizing wij directly can lead to a degenerate solution (wij = 0), which necessitates an additional penalization for small wij values. Smith et al. [34] approach multi-task learning by minimizing task losses with a more sophisticated penalization term that accounts for the relationships between tasks. This formulation requires the client-selection function to be consistent with client selection, which can negatively impact performance. Apart from multi-task federated learning, a similar bilevel optimization formulation has been used by Le Bars et al. [23] to find a sparse mixing matrix while training a consensus model in the outer problem.

Bilevel optimization and alternating optimization. Bilevel optimization is a powerful tool which models a broad range of problems, such as reinforcement learning [6, 30, 16, 15, 37, 31], and linearlysolvable Markov decision process [5], meta-learning [8, 20], etc. A typical bilevel optimization problem, as the name indicates, consists of an outer and an inner optimization problem whose variables are inter-dependent. Typical bilevel optimization solvers requires hessian information which is usually expansive to acquire [8]. On the other hand, alternating optimization tools has been used be used to solve bilevel optimization problem [2, 4]. While in general there is no universal convergence guarantees for alternative optimizations, the special structure of our inner problem ensures the convergence of COBO to the stationary point.

6 Conclusions

Existing collaborative learning algorithms only allow coarse-grained collaboration, which leads to inferior performance in practice. To address this issue, we model collaborative learning as a special bilevel optimization problem where client selection is based on the optimization of a linear function of gradient alignment measure for each pair of clients. In addition, we propose an efficient SGD-type alternating optimization algorithm COBO which is scalable, elastic, and enjoy theoretical guarantees. Besides, COBO empirically outperforms popular personalized federated learning algorithms in realistic collaborative learning problems.

Limitations. In this work, we do not take privacy into consideration. The existing algorithm requires exchanging gradients between collaborators when updating weight w which may raise privacy concerns. We defer the discussion of privacy-preserving collaborative learning framework to future work.

Acknowledgement. We acknowledge funding from Swiss National Science Foundation (SNSF) grant number 200020_200342, from Huawei Cloud Intelligent Cloud Technologies Initiative, and from Google Research Collaborations.

[1] Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199 (1):165 214, 2023.

[2] James C Bezdek and Richard J Hathaway. Convergence of alternating optimization. Neural, Parallel & Scientific Computations, 11(4):351 368, 2003.

[3] Aymeric Capitaine, Etienne Boursier, Antoine Scheid, Eric Moulines, Michael I Jordan, El Mahdi El-Mhamdi, and Alain Durmus. Unravelling in collaborative learning. ar Xiv preprint ar Xiv:2407.14332, 2024.

[4] Tianyi Chen, Yuejiao Sun, and Wotao Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems. Advances in Neural Information Processing Systems, 34:25294 25307, 2021.

[5] Bo Dai, Niao He, Yunpeng Pan, Byron Boots, and Le Song. Learning from conditional distributions via dual embeddings. In Aarti Singh and Xiaojin (Jerry) Zhu, editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, volume 54 of Proceedings of Machine Learning Research, pages 1458 1467. PMLR, 2017. URL http://proceedings.mlr.press/v54/ dai17a.html.

[6] Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. SBEED: convergent reinforcement learning with nonlinear function approximation. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1133 1142. PMLR, 2018. URL http://proceedings.mlr.press/v80/dai18c.html.

[7] Mathieu Even, Laurent Massoulié, and Kevin Scaman. On sample optimality in personalized collaborative and federated learning. Advances in Neural Information Processing Systems, 35: 212 225, 2022.

[8] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 1126 1135. PMLR, 2017. URL http://proceedings.mlr.press/v70/finn17a.html.

[9] Gersende Fort. Federated expectation maximization with heterogeneity mitigation and variance reduction.

[10] Marguerite Frank, Philip Wolfe, et al. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95 110, 1956.

[11] Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. An efficient framework for clustered federated learning. Advances in Neural Information Processing Systems, 33: 19586 19597, 2020.

[12] Mandy Guo, Zihang Dai, Denny Vrandeˇci c, and Rami Al-Rfou. Wiki-40B: Multilingual language model dataset. In LREC - Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2440 2452. European Language Resources Association, May 2020. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.297.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.

[14] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lo RA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=n Ze VKee FYf9.

[15] Yifan Hu, Siqi Zhang, Xin Chen, and Niao He. Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien

Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 612, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/ 1cdf14d1e3699d61d237cf76ce1c2dca-Abstract.html.

[16] Feihu Huang, Junyi Li, and Shangqian Gao. Biadam: Fast adaptive bilevel optimization methods. ar Xiv preprint ar Xiv:2106.11396, 2021.

[17] Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In International conference on machine learning, pages 427 435. PMLR, 2013.

[18] Peter Kairouz, H Brendan Mc Mahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and trends in machine learning, 14(1 2):1 210, 2021.

[19] Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian U. Stich. A unified theory of decentralized SGD with changing topology and local updates. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 5381 5393. PMLR, 2020. URL http://proceedings.mlr.press/v119/koloskova20a.html.

[20] Guy Kornowski, Swati Padmanabhan, Kai Wang, Zhe Zhang, and Suvrit Sra. First-order methods for linearly constrained bilevel optimization. ar Xiv preprint ar Xiv:2406.12771, 2024.

[21] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[22] Viraj Kulkarni, Milind Kulkarni, and Aniruddha Pant. Survey of personalization techniques for federated learning. In 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (World S4), pages 794 797. IEEE, 2020.

[23] Batiste Le Bars, Aurélien Bellet, Marc Tommasi, Erick Lavoie, and Anne-Marie Kermarrec. Refined convergence and topology learning for decentralized sgd with heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pages 1672 1702. PMLR, 2023.

[24] T. Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. Fair and robust federated learning through personalization. In 38th International Conference on Machine Learning, 2021.

[25] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5330 5340, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/ f75526659f31040afeb61cb7133e4e6d-Abstract.html.

[26] Yishay Mansour, Mehryar Mohri, Jae Ro, and Ananda Theertha Suresh. Three approaches for personalization with applications to federated learning. 2021.

[27] Othmane Marfoq, Giovanni Neglia, Laetitia Kameni, and Richard Vidal. Federated multi-task learning under a mixture of distributions. In Proceedings of the 35th International Conference on Machine Learning, volume 34, 2021.

[28] Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273 1282. PMLR, 2017.

[29] Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. In International Conference on Machine Learning, pages 4615 4625. PMLR, 2019.

[30] Ofir Nachum and Bo Dai. Reinforcement learning via fenchel-rockafellar duality. Ar Xiv preprint, abs/2001.01866, 2020. URL https://arxiv.org/abs/2001.01866.

[31] Ieva Petrulionyte, Julien Mairal, and Michael Arbel. Functional bilevel optimization for machine learning. ar Xiv preprint ar Xiv:2403.20233, 2024.

[32] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.

[33] Felix Sattler, Klaus-Robert Muller, and Wojciech Samek. Clustered federated learning: Modelagnostic distributed multi-task optimization under privacy constraints. 2019.

[34] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi-task learning. Advances in neural information processing systems, 30, 2017.

[35] Canh T Dinh, Nguyen Tran, and Josh Nguyen. Personalized federated learning with moreau envelopes. Advances in Neural Information Processing Systems, 33:21394 21405, 2020.

[36] Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. IEEE Transactions on Neural Networks and Learning Systems, 2022.

[37] Vinzenz Thoma, Barna Pásztor, Andreas Krause, Giorgia Ramponi, and Yifan Hu. Bilevel optimization with lower-level contextual mdps. In Agentic Markets Workshop at ICML 2024.

[38] Nicolas Wagner, Dongyang Fan, and Martin Jaggi. Personalized collaborative fine-tuning for on-device large language models. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=bwo3GVsg Ov.

[39] Mariel Werner, Lie He, Sai Praneeth Karimireddy, Michael Jordan, and Martin Jaggi. Provably personalized and robust federated learning. ar Xiv preprint ar Xiv:2306.08393, 2023.

[40] Valentina Zantedeschi, Aurélien Bellet, and Marc Tommasi. Fully decentralized joint learning of personalized models and collaboration graphs. In International Conference on Artificial Intelligence and Statistics, pages 864 874. PMLR, 2020.

Let zt ij := 1

2(xt i + xt j) be the average iterate of xt i and xt j and fij(x) := 1

2(fi(x) + fj(x)) be an averaged objective function. Lemma 3. Suppose (A1) holds true. Let η 1 2L. Then for i, j in the same cluster C of size c 1 c2 X

fij zt ij 2

fij zt ij E[ fij zt+1 ij ] + D1 1 c2 X

xt i xt j 2 2

k=1 | Eh[wt+1 ik ] w ik|2 xt i xt k 2 2 + Lησ2

where coefficient D1 is defined as

D1(L, c, η, b, σ2) := 3L2

4 + 3c2ρ2 + Lηρ2(c 2)2σ2

A.1 Proof of Lemma 3

Proof. Let ht i and ht j be independent and unbiased estimates of fi(zt ij), fj(zt ij) respectively. The variance of ht i has a variance of σ2

b . Let s denote Eg[ ] := Eg1,...,gn[ |zt i] and Eh[ ] := Eh1,...,hn[ ] and let E[ ] = Eh[Eg[ ]]. By the L-smoothness assumption (A1) and bounded noise assumption (A2)

Eh Eg h fij zt+1 ij i fij zt ij + D fij zt ij , Eh Eg zt+1 ij zt ij E + L

2 Eh Eg[zt+1 ij zt ij] 2 2

2 Eh Eg zt+1 ij zt ij Eh Eg[zt+1 ij zt ij] 2 2 .

The last quantity, i.e. variance of zt+1 ij zt ij , can be expanded as follows

Eh Eg zt+1 ij zt ij Eh Eg[zt+1 ij zt ij] 2 2 = Eh Eg zt+1 ij Eh Eg[zt+1 ij ] 2 2

=Eh Eg zt+1 ij Eh[zt+1 ij ] Eh Eg[zt+1 ij ] 2 2

=Eg Eh[zt+1 ij ] Eh Eg[zt+1 ij ] 2 2 + Eh Eg zt+1 ij Eh[zt+1 ij ] 2 2

=Eg Eh[zt+1 ij ] Eh Eg[zt+1 ij ] 2 2 + Eh Eg zt+1 ij Eh[zt+1 ij ] 2 2

1 2(gt i + gt j fi(xt i) fj(xt j))

2 + Eh Eg zt+1 ij Eh[zt+1 ij ] 2 2

4 + Eh Eg zt+1 ij Eh[zt+1 ij ] 2 2

| {z } =:Tij where the linear term vanishes due to the expectation of h in the second equality and the definition of zt+1 ij and (3) for the third equality. Then

E h fij zt+1 ij i fij zt ij + D fij zt ij , E zt+1 ij zt ij E + L

E[zt+1 ij zt ij] 2

Expand the inner product with equality x, y = 1

2 y 2 2 + 1

2 x y 2 2 yields

E h fij zt+1 ij i fij zt ij + η

E[zt+1 ij ] zt ij η + fij zt ij

E[zt+1 ij ] zt ij η

fij zt ij 2

E[zt+1 ij ] zt ij η

fij zt ij + η

E[zt+1 ij ] zt ij η + fij zt ij

E[zt+1 ij ] zt ij η

fij zt ij 2

where η 1 2L is applied in the last inequality. Then the upper bound of fij zt ij 2

fij zt ij 2

fij zt ij E[ fij zt+1 ij ] +

E[zt+1 ij ] zt ij η + fij zt ij

2 | {z } =:T

The T can be upper bounded by expanding zt+1 i with (2)

E[zt+1 ij ] zt ij η + fij zt ij

fij zt ij 1

2 fi(xt i) + fj(xt j) ρ

k=1 Eh[wt+1 ik ](xt i xt k) +

k=1 Eh[wt+1 jk ](xt j xt k)

3 fij zt ij 1

2 fi(xt i) + fj(xt j)

2 | {z } T1

k=1 w ik(xt i xt k) +

k=1 w jk(xt j xt k)

2 | {z } T2

k=1 (Eh[wt+1 ik ] w ik)(xt i xt k) +

k=1 (Eh[wt+1 jk ] w jk)(xt j xt k)

2 | {z } T3

Bound T1: Use L-smoothness of fi and fj. Take expectation with respect to gt i and gt j which are unbiased estimates of fi(xt i) and fj(xt j)

T1 = fij zt ij 1

2 fi(xt i) + fj(xt j)

zt ij xt i 2 2 + L2

zt ij xt j 2 2

xt i xt j 2 2.

Bound T2: Use Cauchy-Schwarz inequality and C has a cluster size of c

k=1 w ik xt i xt k 2 2 +

k=1 w jk xt j xt k 2 2

Bound T3: Use Cauchy-Schwarz inequality

k=1 | Eh[wt+1 ik ] w ik|2 xt i xt k 2 2 +

k=1 | Eh[wt+1 jk ] w jk|2 xt j xt k 2 2

Sum up fij zt ij 2

2 for all of i, j in the same cluster C for (7) yields

fij zt ij 2

fij zt ij E[ fij zt+1 ij ] + 3L2

xt i xt j 2 2

xt i xt j 2 2 + 3ncρ2 X

k=1 | Eh[wt+1 ik ] w ik|2 xt i xt k 2 2

fij zt ij fij zt+1 ij + 3L2

4 + 3c2ρ2 X

xt i xt j 2 2

k=1 | Eh[wt+1 ik ] w ik|2 xt i xt k 2 2 + c2Lησ2

Now we expand Tij as follows

Tij =Eh Eg zt+1 ij Eh[zt+1 ij ] 2 2

" η 2(gt i + gt j) + ηρ

k=1 (wt+1 ik (xt i xt k) + wt+1 jk (xt j xt k))

" η 2(gt i + gt j) + ηρ

k=1 (wt+1 ik (xt i xt k) + wt+1 jk (xt j xt k))

k=1 ((wt+1 ik Eh[wt+1 ik ])(xt i xt k) + (wt+1 jk Eh[wt+1 jk ])(xt j xt k))

Eh h wt+1 ik Eh[wt+1 ik ] 2 2

i xt i xt k 2 2 + Eh

wt+1 jk Eh[wt+1 jk ] 2

xt j xt k 2 2

where we use the independence of random variables in the last equality.

Average the above equality over i, j C yields

ij Tij =η2ρ2(c 2)

i,j Eh h wt+1 ij Eh[wt+1 ij ] 2

i xt i xt j 2 2

i,j xt i xt j 2 2.

Therefore, (8) is upper bounded with

fij zt ij 2

fij zt ij E[ fij zt+1 ij ] + 3L2

4 + 3c2ρ2 X

xt i xt j 2 2

k=1 | Eh[wt+1 ik ] w ik|2 xt i xt k 2 2 + c2Lησ2

2 η2ρ2(c 2)

i,j xt i xt j 2 2

fij zt ij E[ fij zt+1 ij ]

4 + 3c2ρ2 + Lηρ2(c 2)2σ2

xt i xt j 2 2

k=1 | Eh[wt+1 ik ] w ik|2 xt i xt k 2 2 + c2Lησ2

Lemma 4. Suppose Mij 1

3L c , b 2 c2 2Lη(c 2)σ2, and η 1 2ρc 1 2

i,j C E h xt+1 i xt+1 j 2

i (1 ηρc) 1

xt i xt j 2 2

k=1 Eh |wt+1 ik w ik|2 xt i xt k 2 2

+ 6M 2 ij ρc 1 c2 X

fij zt ij E h fij zt+1 ij i

c2 + 3ηLM 2 ij

Proof. Expand xt+1 i xt+1 j with (2)

xt+1 i xt+1 j = xt i xt j η gt i gt j ηρ

wt+1 ik (xt i xt k) wt+1 jk (xt j xt k) .

As i and j belong to the same cluster (i.e., w ij = 1), we add 2ηρ Pn k=1 w ik(xt i xt k)

xt+1 i xt+1 j =(1 2ηρc)(xt i xt j) η gt i gt j

(wt+1 ik w ik)(xt i xt k) (wt+1 jk w jk)(xt j xt k) .

Compute the norm of xt+1 i xt+1 j and choose ηρ 1 2c to use Jensen s inequality

Eg[xt+1 i xt+1 j ] 2

2 (1 2ηρc) xt i xt j 2 2

(wt+1 ik w ik)(xt i xt k) (wt+1 jk w jk)(xt j xt k)

+ 1 2ρc fi(xt i) fj(xt j)

Expand the right-hand side with Cauchy-Schwarz inequality

Eg[xt+1 i xt+1 j ] 2

2 (1 2ηρc) xt i xt j 2 2

(wt+1 ik w ik)(xt i xt k) (wt+1 jk w jk)(xt j xt k)

+ 4ηρc 1 2ρc fi(xt i) fj(xt j)

(1 2ηρc) xt i xt j 2 2 + 8ηρc

k=1 (wt+1 ik w ik)(xt i xt k)

k=1 (wt+1 jk w jk)(xt j xt k)

+ 4ηρc 1 2ρc fi(xt i) fj(xt j)

(1 2ηρc) xt i xt j 2 2 + 2nηρ

k=1 |wt+1 ik w ik|2 xt i xt k 2 2

k=1 |wt+1 jk w jk|2 xt j xt k 2 2 + η

ρc fi(xt i) fj(xt j) 2 2 | {z } =:T

The last term T can be upper bounded by adding fi zt ij fj zt ij and use L-smoothness assumption (A1) of fi and that i, j belong to the same cluster (A4)

T = fi(xt i) fi zt ij fj zt ij fj(xt j) 2 2 3 fi(xt i) fi zt ij 2 2 + 3 fi zt ij fj zt ij 2 2 + 3 fj(xt j) fj zt ij 2 2

2 xt i xt j 2 2 + 3M 2 ij fi zt ij + fj zt ij 2 2

xt i xt j 2 2 + 3M 2 ij fij(zt ij) 2

Note that E[ xt+1 i xt+1 j 2 2] = η2 E[ gt i gt j E[gt i gt j] 2 2] + Eh Eg[xt+1 i xt+1 j ] 2 2 and independence between randomness on worker i and j where expectation is take to all of the randomness until time t

By averaging for all i, j C

i,j C E h xt+1 i xt+1 j 2

i 1 2ηρc + 3ηL2

xt i xt j 2 2

k=1 Eh |wt+1 ik w ik|2 xt i xt k 2 2

ρc2σ2 + 3ηM 2 ij ρc 1 c2 X

fij(zt ij) 2

Use the previous Lemma 3 to bound P

i,j C fij(zt ij) 2

i,j C E h xt+1 i xt+1 j 2

1 2ηρc + 3ηL2

xt i xt j 2 2 + 4n2ηρ

k=1 Eh |wt+1 ik w ik|2 xt i xt k 2 2

2ρc 1 c2 + 3ηM 2 ij ρc

fij zt ij E h fij zt+1 ij i + D1 1 c2 X

xt i xt j 2 2

+ 3ηM 2 ij ρc

k=1 Eh |wt+1 ik w ik|2 xt i xt k 2 2 + Lησ2

Rearrange the terms 1 c2 X

i,j C E h xt+1 i xt+1 j 2

1 2ηρc + 3ηL2

2ρc + 3ηM 2 ij ρc D1

xt i xt j 2 2

c + 3ηM 2 ij ρc 3n2ρ2 ! 1 cn

k=1 Eh |wt+1 ik w ik|2 xt i xt k 2 2

+ 3ηM 2 ij ρc 2 η 1 c2 X

fij zt ij fij zt+1 ij + ησ2

2ρc 1 c2 + 3ηM 2 ij ρc Lησ2

We would like to achieve 1 2ηρc + 3ηL2

2ρc + 3ηM2 ij ρc D1 (1 ηρc) by

3L c , s.t. 3ηL2

letting b 2 c2 2Lη(c 2)σ2 and ρ

3L c and Mij 1

5, the following inequality hold true

3ηM 2 ij ρc

4 + 3c2ρ2 + Lηρ2(c 2)2σ2

3ηM 2 ij ρc 15

4 ρcηM 2 ij 1

Using the same requirement that ρ

3L c and Mij 1

c + 3ηM 2 ij ρc 3n2ρ2 4n2ηρ2

3ηM 2 ijn2ρ2

The upper bound of 1/c2 P

i,j C xt+1 i xt+1 j 2

2 can be simplified to

i,j C E h xt+1 i xt+1 j 2

i (1 ηρc) 1

xt i xt j 2 2

k=1 Eh |wt+1 ik w ik|2 xt i xt k 2 2

+ 6M 2 ij ρc 1 c2 X

fij zt ij E h fij zt+1 ij i

c2 + 3ηLM 2 ij

A.2 Proof of Theorem I

Proof. Given Lemma 4 and average over time t = 0 over T 1 and take expectation to all randomness throughout training

i,j C E h xt+1 i xt+1 j 2

i (1 ηρc) 1 c2T

i,j C E h xt i xt j 2 2

k=1 E |wt+1 ik w ik|2 xt i xt k 2 2

+ 6M 2 ij ρc 1 Tc2

E h fij zt ij i E h fij zt+1 ij i

c2 + 3ηLM 2 ij

Rearrange 1 c2T PT 1 t=0 P

i,j C E h xt+1 i xt+1 j 2

i,j C E h xt+1 i xt+1 j 2

k=1 E |wt ik w ik|2 xt i xt k 2 2

+ 6M 2 ij ηρ2c2 1 c2T

i,j C E h fij zt ij fij zt+1 ij i

c2 + 3ηLM 2 ij

Consider bounding |wt ik w ik|2 in two cases

Case 1: w ik = 1. Suppose Mik (0, 1), then fi(zt ik) fk(zt ik) 2 2 M 2 ij fi(zt ik) + fj(zt ik) 2 2 implies

fi(zt ik), fk(zt ik) 1 M 2 ik 2(1 + M 2 ik) fi(zt ik) 2 2 + fk(zt ik) 2 2 0.

then wt+1 ik = w ik = 1 and therefore |wt+1 ik w ik|2 = 0.

Case 2: w ik = 0. Suppose ζ2 ik fi(x) + fk(x) 2 2 for all x then

fi(zt ik) + fk(zt ik) 2 2 = fi(zt ik) 2 2 + fk(zt ik) 2 2 + 2 fi(zt ik), fk(zt ik)

ζ2 ik + 2 fi(zt ik), fk(zt ik)

which means the inner product fi(zt ij), fj(zt ij) 0 is negative, i.e., wt+1 ij = 0 = w ij. The above cases hold true for well initialized weights x0.

Then with lower bound assumption of fi and fj (A3)

i,j C E h xt+1 i xt+1 j 2

i 6M 2 ij ηρ2c2 1 c2T

i,j C E h fij zt ij fij zt+1 ij i

c2 + 3ηLM 2 ij

6M 2 ij ηρ2c2 1 c2T

i,j C E h fij z0 ij fij z T ij i

c2 + 3ηLM 2 ij

Minimize the upper bound through choosing η

fij z0 ij f ij

i,j C E h xt+1 i xt+1 j 2

i 6M 2 ij ρ2

v u u t Lσ2

fij z0 ij f ij . (9)

By the result of Lemma 3

i,j C E fij zt ij 2

fij z0 ij f ij + 4c2ρ2 1

i,j C E h xt i xt j 2 2

2c2 v u u t Lσ2

fij z0 ij f ij + 4c2ρ2 1

i,j C E h xt i xt j 2 2

2c2 + 24c2M 2 ij v u u t Lσ2

fij z0 ij f ij

3c2 v u u t Lσ2

fij z0 ij f ij .

A.3 Proof of Corollary II

Proof. By adding fi(x) + fj(x) 2 2 on both sides of (5), and replace x with zij we have

2 fi(zij) 2 2 + fj(zij) 2 2 4(1 + M 2 ij) fij(zij) 2

2 x Rd. (10)

Then using the upper bound of Mij < 1/5 from Theorem I, and average over t and i, j yields

i,j C E h fi zt ij 2 2

v u u t Lσ2

fij z0 ij f ij . (11)

By Cauchy-Schwarz inequality and L-Lipschitz smoothness, we have that

fi xt i 2 2 1

fi zt ij 2 2 + 1 c2T

fi zt ij fi xt i 2 2

fi zt ij 2 2 + L2

xt i xt j 2 2.

Applying (9) and (11) to the upper bound of the above inequality

fi xt i 2 2 1 + 1

v u u t Lσ2

fij z0 ij f ij

4 6M 2 ij ρ2c2

v u u t Lσ2

fij z0 ij f ij .

3L and Mij < 1

5 as stated in Theorem I,

fi xt i 2 2 4

v u u t Lσ2

fij z0 ij f ij .

B Experimental Details

In Section 4, we present our results on two tasks with different properties. Here, we provide the full details of our experimental setup, alongside with additional experiments.

We first describe the setup for cross-device and cross-silo experiments: we use the fix batch size of 128 for cross-device, and cross-silo experiments on CIFAR-100. We tune each method for the optimal learning rate individually: we use learning rate of 0.1 for ditto, 0.05 for Federated Clustering (FC), and 0.01 for all other methods. For Ditto, we use the hyper-parameter of λ = 1 as recommended in their paper. For Federated Clustering, we use the ground truth number of clusters and size of clusters as the hyper-parameter. We also use the ground truth number of clusters for IFCA, and sample all the clients in cross-silo experiment. We reduce the sampling rate to 10% for the cross-device experiment to ensure scalability and fairness for comparison to other methods. For cross-silo experiments we employed a single NVIDIA V-100 GPU with 32GB memory, and moved to four NVIDIA V-100 GPUs with 32 GB memory for cross-device experiment. With this setup, running COBO for cross-silo and cross-device experiment takes 9 hours and 28 hours respectively.

For Language modeling experiment, we conducted the experiments with the learning rate of 0.002, batch size of 50, and 4 accumulation steps. Note that each agent only get a subset of the regarding language from Wiki-40B dataset, consisting of total of 800000 tokens. We also used the context length of 512, dropout rate of 0.1, and Lo RA module with rank 4. Training is performed on a single NVIDIA A-100 GPU with 40GB memory. It takes 2.5 hours to run COBO for 500 iterations in this framework. We also use λ = 0.1 for Ditto which has higher performance for this experiment. There are 3 strategies proposed in [38] for collaborative fine-tuning, highlighting one as the most effective. However, this approach requires a portion of dataset to be shared among agents, which is incompatible with our experimental setup. Therefore, we use the second-best method in our experiments, which relies on validation performance.

Figure 5: Collaboration matrices learned by COBO at different stages of training for cross-device experiment with 80 clients. The diagonals are masked out. The oracle matrix is a block diagonal matrix, consisting of 10 blocks: two blocks of size 10, two blocks of size 9, and so on. The collaboration matrix of COBO already starts to look similar to oracle matrix within as low as 300 iterations. (1.5% of the total iterations)

For the cross-device experiment with 80 agents in Section 4.3, we present the accuracy curve in Figure 6. Our method outperform all other methods except the Oracle with a large margin. We can also observe the collaboration matrix of COBO in Figure 5. The clusters are learned with COBO efficiently.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The theoretical results are given in Section 3 and empirical evaluations are presented in Section 4. 2. Limitations

0 25 50 75 100 125 150 175 200 Iterations ( 100)

Co Bo Oracle Ditto Finetuned Fed Avg Local Fed Avg IFCA Start of fine-tuning

Figure 6: Averaged accuracies of personalized models for cross-device federated learning with 80 clients. The "Oracle" denotes applying Fed Avg to the clients having the data from the same classes of CIFAR-100 dataset.

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: We list limitations at the end of the main text.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: The assumptions are given in Section 3.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: The experiments are clearly defined, and all the necessary hyperparameters for reproducing them are presented in the Appendix.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: The code is available in supplementary materials as well as on an anonymous Github repository, and we only used publicly available datasets in our experiments.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: The full training setup is described in the main text. The hyper-parameters are also available both for COBO and other baselines either in the main text or appendix.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: We report proper confidence intervals for all experiments in our work. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We explained the training framework thoroughly in appendix. All the mentioned information are discussed and we believe it is sufficient for reproducing the experiments. 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We preserve the Neur IPS Code of Ethics throughout the research project. 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Not applicable. 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Note applicable. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We have cited the datasets. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: Not applicable 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: This paper does not involve crowdsourcing experiments. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Not applicable.