# incentiveaware_federated_learning_with_trainingtime_model_rewards__bc749f33.pdf Published as a conference paper at ICLR 2024 INCENTIVE-AWARE FEDERATED LEARNING WITH TRAINING-TIME MODEL REWARDS Zhaoxuan Wu , Mohammad Mohammadi Amiri , Ramesh Raskar & Bryan Kian Hsiang Low Institute of Data Science, National University of Singapore, Republic of Singapore Department of Computer Science, Rensselaer Polytechnic Institute, USA Media Lab, Massachusetts Institute of Technology, USA Department of Computer Science, National University of Singapore, Republic of Singapore {wu.zhaoxuan,lowkh}@comp.nus.edu.sg mamiri@rpi.edu raskar@mit.edu In federated learning (FL), incentivizing contributions of training resources (e.g., data, compute) from potentially competitive clients is crucial. Existing incentive mechanisms often distribute post-training monetary rewards, which suffer from practical challenges of timeliness and feasibility of the rewards. Rewarding the clients after the completion of training may incentivize them to abort the collaboration, and monetizing the contribution is challenging in practice. To address these problems, we propose an incentive-aware algorithm that offers differentiated training-time model rewards for each client at each FL iteration. We theoretically prove that such a local design ensures the global objective of client incentivization. Through theoretical analyses, we further identify the issue of error propagation in model rewards and thus propose a stochastic reference-model recovery strategy to ensure theoretically that all the clients eventually obtain the optimal model in the limit. We perform extensive experiments to demonstrate the superior incentivizing performance of our method compared to existing baselines. 1 INTRODUCTION Federated learning (FL) is a popular framework that fosters collaboration among distributed clients while keeping the raw data on their local devices (Mc Mahan et al., 2017). The clients perform local optimization on local data, while the server performs centralized parameter updates by aggregating these local model updates (Li et al., 2020a). It is essential to motivate potentially competing clients to contribute their training resources since they incur nontrivial costs for data collection (Sim et al., 2020), local computation (Sarikaya & Ercetin, 2020), and federated communication (Lim et al., 2020). Moreover, self-interested clients may refrain from contributing to the best of their abilities or drop out of the FL due to insufficient rewards (Zhan et al., 2022), which can delay model training and worsen model performance (Tu et al., 2022). To address these issues, a number of incentive mechanisms have been proposed (Tu et al., 2022; Khan et al., 2020; Zhang et al., 2021; Liu et al., 2023). Most existing incentive mechanisms distribute external resources (e.g., money) post-training (Zhan et al., 2022; Tu et al., 2022), which poses practical challenges regarding the timeliness and feasibility of the incentives. Firstly, rewarding clients only after the completion of the FL process can discourage their participation as they usually anticipate timely rewards due to the continuous contribution of costly resources throughout the process (IMDA, 2019). It is also impractical to withhold compensation for clients until the end of the FL process as they have the freedom to exit the collaboration in the middle of the process. Secondly, monetary incentives are often infeasible in situations where the source of revenue is unclear, the budget is limited (Sim et al., 2020), and the contribution-to-dollar value denomination is difficult to determine (Xu et al., 2021). To tackle these challenges, it is vital to design training-time model rewards at each iteration of the FL process to achieve the overall incentivization goal. However, few work have explored this direction (Xu et al., 2021; Kong et al., 2022), and their heuristic methods overlook the important theoretical implications of local design choices on the performance of rewarded models and thus client incentivization. A central issue is the global-to-local design: How should each client be rewarded locally in each FL iteration, given the global incentivization objective? Inspired by the game-theoretic insights, Published as a conference paper at ICLR 2024 we propose that the proportion of local model updates, which a client can aggregate as a local reward, should be commensurate with his contribution in each FL iteration. Therefore, clients receive different models (each aggregated using a specific proportion of local model updates) such that higher-contributing clients receive models aggregated from a larger proportion of local model updates. We prove this local reward scheme ensures that a higher-contributing client gets a final model with a better performance guarantee. However, we observe an undesirable phenomenon of error propagation from the performance bound: Low-contributing clients can worsen every client model as aggregating low-quality model updates can adversely influence the high-quality models (Deng et al., 2022). This phenomenon implies that achieving the incentivization goal prevents the client models from reaching optimality. Ideally, our algorithm should allow clients to eventually get the globally optimal model. How to adjust our local reward scheme if we want every client to obtain the best model in the limit yet without hurting their incentives? To address this challenge, we propose a reference-model recovery strategy that stochastically provides every client with the same reference model at each FL iteration. This approach mitigates the error propagation, and we further prove that all the client models asymptotically converge to the global optimum as long as their contributions do not diminish too quickly with the FL iterations. Consequently, every client is better off and is expected to eventually receive the best model while still being incentivized to contribute in finite FL iterations. In summary, we propose an incentive-aware federated learning (IAFL) algorithm with training-time model rewards commensurate with client contributions, while being agnostic to any contribution measures (Section 4). Then, we theoretically justify our design choices through convergence analyses (Section 5). Finally, we demonstrate through extensive experiments that our method enjoys superior incentivizing performance compared to other baselines (Section 6). 2 RELATED WORKS Mechanism design for FL incentives. Popular tools to model the behaviors of the server and clients include Stackelberg games (Khan et al., 2020; Zhan et al., 2020), auctions (Zhang et al., 2021; Cong et al., 2020) and contract theory (Liu et al., 2023; Kang et al., 2019), which all utilize post-training monetary incentive schemes to motivate client contributions (Zhan et al., 2022; Tu et al., 2022). Karimireddy et al. (2022) suggested post-training model rewards of various utilities achieved through noisy model perturbation. However, this method may not be effective in FL since clients have already obtained the best global model during the model broadcasting of the FL process. Our paper instead directly integrates the incentive mechanism into the FL algorithm using training-time model rewards. Heterogeneity and personalized FL. Data heterogeneity among distributed clients is a common challenge in FL (Wang et al., 2020a; Chen et al., 2022). Methods like Fed Prox (Li et al., 2020b) and Fed Nova (Wang et al., 2020a) train a single shared model that aligns clients potentially mismatched objectives caused by heterogeneous local data distributions. Personalization in FL yields a personalized model for each client and focuses on improving performances on local test sets (Tan et al., 2022). Personalized models are achieved by personalizing layers or structures of the shared model (Liang et al., 2020; Collins et al., 2021; Li et al., 2021; Pillutla et al., 2022). In contrast, we focus on tailored client model rewards trained by optimizing a shared global objective with data heterogeneity (e.g., a client with only MNIST digits from class 0-2 is still interested in a model that classifies all digits). Training-time model incentives for FL. CGSV (Xu et al., 2021) and Rank (Kong et al., 2022) have explored training-time model incentives during FL. However, their heuristic approaches lack theoretical convergence analyses. Additionally, Rank impractically assumes access to a validation set to rank local model performances for aggregation. CGSV presents a contrasting perspective to ours: CGSV zeroes out partial model update parameter values for all clients, while our IAFL effectively zeroes out partial client model updates for all parameter values. Our aggregation approach is more aligned with the partial client participation setting in FL (Li et al., 2020d). CGSV s local performance guarantee cannot be generalized to the global incentivization objective for the entire FL training. 3 NOTATIONS AND BACKGROUNDS We consider N federated clients collaboratively learning a predictive model θ Rd. Client i has a local dataset Di of size Bi such that the grand dataset D = N i=1Di has size B := PN i=1 Bi. The global Published as a conference paper at ICLR 2024 loss F(θ) = PN i=1 wi Fi(θ) is a weighted sum of local loss functions Fi(θ) = 1 Bi P d Di fi(θ, d) where fi( , ) is an empirical loss function. Let θt denote the global model at FL iteration t. FL conducts two steps for T iterations: (1) Clients download the global model θ1 i,t = θt where θj i,t denotes the model at client i before the j-th step of the local update; (2) Clients perform τ steps of local updates according to θj+1 i,t = θj i,t ηt Fi(θj i,t, ξj i,t) where ηt is the learning rate and ξj i,t is the local batch randomly chosen for the stochastic gradient descent step. Then, each client i uploads the model update gi,t = θτ+1 i,t θ1 i,t = ηt Pτ j=1 Fi(θj i,t, ξj i,t) to the server. Game-theoretic formulation for incentivization. Instead of assuming benevolent clients as in the standard FL, we consider a more practical setting where a client i has the freedom to strategically decide his resource contribution pi R 0. Let p i = [p1, p2, . . . , pi 1, pi+1, . . . , p N] RN 1 0 , we define P i R 0 as the aggregate contribution of clients other than i. We define a rewarding function ν(pi, P i) that is continuous, non-decreasing and concave on both arguments, reflecting the diminishing marginal return of contribution. It is usually observed in machine learning that the marginal increase in model accuracy diminishes with the increase in training dataset size (Wang et al., 2021; De & Chakrabarti, 2022). Furthermore, a client i incurs a non-trivial constant cost ci > 0 for offering some marginal contribution (Sim et al., 2020) and the cost is essential for the clients to decide an optimal amount of contribution that yields the highest utility (i.e., benefit minus cost). Practically, ci could account for the cost associated with collecting an additional data sample, computing the gradients for a larger dataset, or increasing the rate of partial federated participation. 4 INCENTIVE-AWARE FEDERATED LEARNING Standard FL focuses on training a globally shared model and neglects the incentives for clients to join the FL procedure. We rethink the server-centric objective of FL and aim to develop a client-centric procedure to incentivize participation. The main idea is to offer training-time model rewards that are commensurate with client contributions through our proposed mechanism implementation. 4.1 INCENTIVE MECHANISM When multiple clients are free to decide their contributions, clients behaviors depend on the rewarding strategy formalized as a mechanism Mν(p) : RN 0 [0, 1]N. The rewarding mechanism Mν(p) = ν(pi, P i) maps the contribution vector p to the corresponding reward values using the rewarding function ν. In standard FL, the global model is synced with all the clients at each FL iteration. Clients lose their incentives to increase pi since they receive the best aggregate model regardless of contributions. Therefore, we have P i = h(p i) where the aggregate function h only depends on p i. Our idea is to make P i dependent on pi by defining a new aggregate function h(pi, p i). Proposition 1. If h satisfies d h(pi,p i) dpi > 0 and ν(pi, h(pi,p i)) h(pi,p i) > 0, then a mechanism [Mν(p)]i = ν(pi, h(pi, p i)) incentivizes a client to contribute more than that in the standard FL mechanism. The proof is in Appendix B.1. Specifically, d h(pi, p i)/dpi > 0 implies that a client i should benefit more from other clients if his contribution pi is larger, and ν(pi, h(pi, p i))/ h(pi, p i) > 0 indicates the feasibility of improving the reward from increasing the aggregate contribution of other clients. Proposition 1 reveals possible better designs of the mechanism that incentivize clients to contribute more resources. This result aligns with Karimireddy et al. (2022), which showed that the standard FL mechanism leads to catastrophic free-riding where only the lowest cost client contributes. Mechanism design. Following Proposition 1, we propose to design a function h and a mechanism to reward clients based on contributions. Intuitively, we allow clients with higher contribution pi to be rewarded based on a larger proportion of the aggregate contribution of other clients h(p i). We first propose [Mν(p)]i = ν (pi, (pi/maxj pj)h(p i)). However, this mechanism with a relative rewarding strategy may result in an undesirable equilibrium. It is possible that clients converge to an equilibrium where all clients give similar but little contributions. Such equilibria are undesirable as little contribution may translate to a bad model. The proof for the existence of undesirable equilibria is provided in Proposition 3 in Appendix B.2. To circumvent this problem, we propose to replace Published as a conference paper at ICLR 2024 maxj pj with a pre-defined contribution ceiling pceil and a tunable sharing coefficient κ [0, 1], [Mν(p)]i = ν pi, (min {pi/pceil, 1})1 κ h(p i) . (1) Here, pceil can be viewed as a hypothetical upper limit for a client s contribution imposed by the mechanism. For example, pceil = 1 implies full client participation in all FL iterations if pi measures the rate of client participation. As for the sharing coefficient, κ determines the extent of sharing enforced by the server. A larger κ makes the rewards more equitable across clients. One limitation of this design is that the clients lose incentive for pi > pceil. The server can mitigate this by setting a high pceil upper limit (to keep incentivizing high contributors) while setting a larger κ (to even out the fractions (pi/pceil)1 κ towards 1 to motivate low contributors). To interpret, h(p i) can be seen as local model updates from other clients (i.e., other than client i) in FL and a client with higher contribution pi is rewarded a model trained with a broader range of local model updates. If an averaging aggregation strategy is adopted, the model of a higher-contributing is updated using an aggregated average from a larger proportion of local model updates. Definition 1 (Individual Rationality (IR) in FL). A mechanism Mν(p) satisfies individual rationality if [Mν(p)]i ν(pi, h(0)) for any i and p. That is, any client will receive a reward at least as good as what they can achieve on their own. Remark 1. Our mechanism (1) satisfies IR (Definition 1), an established concept for player incentivization now adapted to FL. In our FL process, a client i always utilizes the resource pi that he contributes. This essential property of our mechanism ensures the participation of all rational agents. 4.2 IMPLEMENTATION OF THE MECHANISM IN FL We present an implementation of mechanism (1) in the FL algorithm. We extend the setting in Section 4.1 to allow a client i to vary its contribution pi,t across iterations t (instead of a fixed pi). Aligning with the conditions in Proposition 1, we propose the idea that, instead of sharing the global model in each iteration, the server shares updates of varying qualities with the clients. Accordingly, a client with a higher contribution receives a higher quality model. This can be achieved by first computing a client reward rate γi,t = (min{pi,t/pceil, 1})1 κ = min{(pi,t/pceil)1 κ, 1} and then using it to vary the proportion of local model updates {gi,t}N i=1 that a client can average from in iteration t. To ensure convergence to the global optimum, we additionally introduce a stochastic recovery (with probability q) of client models using a flexible user-defined reference model θref,t trained with custom induced reward rates γ ref,t. We defer the justification and details of the stochastic recovery technique to Section 5.1. The details of the incentive-aware federated learning (IAFL) algorithm are in Algorithm 1, where [N] := {1, 2, ..., N} and denotes the ceiling. Algorithm 1: Incentive-Aware Federated Learning 1 Initialize θi,0 = θref,0 = θ0 and gi,0 = 0; 2 for t = 1 to T do 3 Perform Procedure 2 such that { θi,t 1}N i=1, θref,t 1 = Server Agg(t 1); 4 foreach client i do 5 with probability q do 6 Server shares θref,t 1 with client i; 7 θi,t = θref,t 1; 9 Server shares θi,t 1 with client i; 10 θi,t = θi,t 1 + θi,t 1; 11 θτ+1 i,t = θi,t ηt Pτ j=1 Fi(θj i,t, ξj i,t); 12 Upload the local update gi,t = θτ+1 i,t θi,t; Procedure 2: Server Agg(t) 1 foreach client i do γi,t = min n (pi,t/pceil)1 κ , 1 o ; 3 Sample S i,t {j : j [N], j = i} randomly s.t. |S i,t| = γi,t(N 1) ; 4 Include client i s update Si,t = {i} S i,t; 5 θi,t = 1 |Si,t| P j Si,t gj,t; 6 Sample Sref,t [N] randomly such that |Sref,t| = γ ref,t N ; 7 θref,t = θref,t 1 + 1 |Sref,t| P j Sref,t gj,t; 8 return { θi,t}N i=1, θref,t An intuition for the connection between the mechanism in (1) and the IAFL algorithm is provided in Appendix C. Overall, clients keep different versions of the model and update their respective models using the partially averaged model updates rewarded to them over the FL iterations. The models have varying qualities as the updates rewarded to the clients are crafted to match their contributions. This approach incentivizes clients to increase their contributions to the learning process. Published as a conference paper at ICLR 2024 Contribution measurement. Our IAFL is agnostic to contribution measures and distributes trainingtime model rewards that are commensurate with client contribution pi,t. This is beneficial because clients have the option to define a custom contribution measure collaboratively. Besides the conventional measures depending on dataset size or participation rate, it is possible to utilize external knowledge to value contributions. For example, meta attributes such as the timeliness of the data, reliability of the client network and closeness of clients relationships are all possible considerations of the contribution measure. Importantly, our method also gives the flexibility to vary pi with FL iteration t through pi,t. Our IAFL is also plug-and-play with existing FL contribution evaluation methods capturing the quality of local model updates, such as Fed SV (Wang et al., 2020b), Com Fed SV (Fan et al., 2022), CGSV (Xu et al., 2021), Fed FAIM (Shi et al., 2022), R-RCCE (Zhao et al., 2021), etc. 5 THEORETICAL ANALYSIS This section provides theoretical justification for IAFL via convergence analysis. Our training-time local reward scheme ensures that a higher-contributing client receives a final model with a better performance guarantee. We further propose a stochastic reference-model recovery strategy to mitigate the theoretically observed error propagation problem and ensure the asymptotic convergence of client models to the global optimum as T . For ease of analysis, we consider full client participation, where the results can be extended to partial client participation following Li et al. (2020d). We define an induced reward rate γ i,t = γi,t (γi,t 1)/N, which represents the fraction of the local model updates used by client i at FL iteration t. 5.1 PERFORMANCE GUARANTEE Formalized in Assumptions 1-4 (Appendix D), we assume that the loss functions Fi( ) are L-smooth and µ-strongly convex. The average expected variance of stochastic gradients is Σ2, and the squared l2-norm of stochastic gradients is bounded by G. We let each client i have the equal weight wi = 1/N in the global objective F and further use Γ = F PN i=1 wi F i to denote the extent of data heterogeneity where F and F i are the optimum values of F and Fi, respectively. Theorem 1 (Performance bound). With full clients participation and a decreasing learning rate ηt = 1 µτ(t+α) where α 4L(τ+1) µτ , define B = 2Lτ(2τ + 3)Γ + 2τ 3G2 + τ 2 Σ2 + (α + 1)µ2τ 2E[ θ1 θ 2] and HT = (2 2µηT )τ 3G2, then the performance of the client model θi,T trained over T FL iterations using IAFL is bounded by E [F(θi,T )] F L 2µ2τ 2 B + CT α + T where CT = HT 2 1 γ m,t + 1 The proof is given in Appendix D.1. Here, B is constant w.r.t T and the client contributions only affect γ i,t and thus CT . To interpret, our IAFL ensures that a higher-contributing client (i.e., larger γ i,t) receives a model with a better performance guarantee measured in terms of the expected performance gap to the global optimum. This aligns with our goal of rewarding higher-contributing clients more and hence incentivizing client participation and contributions. However, we observe that CT grows with FL iterations T and causes non-convergence of client model θi,T . This undesirable behavior is caused by the lack of model synchronization and we next propose a remedy to the problem. Error propagation and non-convergence. The non-convergence is seen in Theorem 1 as T . Asymptotically, we require CT /T 0 for the algorithm to converge to F . Since we already have (HT /T) PT t=1 T 2/t2 = O(T), we need 1/γ i,T = o(1/T) for the convergence to hold, where the big-O and small-o notations are defined in Definition D.1 and Definition D.2. However, this cannot be achieved as γ i,t 1, t. Consequently, we instead require that γ i,t = 1 for all i [N], t [T], which corresponds to the situation where all the clients contribute at their maximum capacity. Corollary 1. When all the clients contribute at their maximum capacity, i.e., pi,t = pceil for all i [N], t [T], we have CT = 0 and achieve asymptotic convergence lim T E [F(θi,T )] F = 0. Corollary 1 states a very strong assumption that recovers the Fed Avg algorithm: Even when only one client fails to contribute at the maximum, all client models cannot be guaranteed to converge to the Published as a conference paper at ICLR 2024 optimum, including for the highest-contributing client. We call this phenomenon error propagation because low-quality model updates from low-contributing clients deteriorate the high-quality models. Stochastic recovery strategy. To ensure convergence to the global optimum, we propose a stochastic recovery strategy, which shares the current reference model with a client at probability q. The reference model is allowed to have a custom induced reference reward rate γ ref,t. Specifically, the reference model is trained by aggregating local model updates {gi,t}i Sref,t where Sref,t is randomly sampled from [N] such that |Sref,t| = γ ref,t N . Therefore, every participating client at each FL iteration has an equal chance q of receiving the current reference model. With probability 1 q, the client still receives from the server an update that is aggregated with the corresponding proportion of local model updates. Algorithm 1 outlines the implementations. As we present in the following theorem, this method prevents persistent error propagation and enables the convergence of the model. Theorem 2 (Improved bound). Let ηt = β t+α and B, HT be defined in Theorem 1. With a stochastic recovery rate of q, the performance of client model θi,T trained over T FL iterations is bounded by E [F(θi,T )] F L t=1 (1 2µηtτ) t=1 η2 t (Q + Dt + Et) l=t+1 (1 2µηtτ) where DT = HT 2 1 γ m,t + 1 γ ref,t 2 2 1 γ i,t + 1 γ ref,t 2 (1 q)T t+1 . Proposition 2 (Asymptotic convergence). If 1 γ i,t = o t2 log t for all i [N] and 1 γ ref,t = o t2 log t , lim T E [F(θi,T )] F = 0 . The proofs are in Appendix D.3 and Appendix D.4. Asymptotically, all clients are expected to achieve the optimal model regardless of their contributions. To interpret the conditions for convergence, the contributions of all clients i [N] and the induced reference reward rate do not decrease too quickly over time. The first condition is satisfied as 1/γ i,t N by definition whereas the second is userdefined. While offering the same model in the limit may seem to work against client incentives at first glance, clients who exit the collaboration early will not obtain the global optimal model. Theorem 2 shows that the impact of contribution γ i,t is reflected on ET , where ET decreases with γ i,t. Therefore, a client that contributes more receives a better model with a smaller performance gap to the global optimum. This improvement is more highlighted for limited FL iterations T (see Section 6.2). 6 EXPERIMENTS In this section, we show the effectiveness of IAFL and the impacts of its hyperparameters. We also show that IAFL is suited for custom contribution measure definitions and partial client participation. Datasets & partition strategies. We use the datasets and heterogeneous data partition pipelines in the non-IID FL Benchmark (Li et al., 2022). We simulate 50 clients for all experiments. We perform extensive experiments on various vision datasets like MNIST (Le Cun et al., 1989), FMNIST (Xiao et al., 2017), SVHN (Netzer et al., 2011), CIFAR-10/100 (Krizhevsky, 2009), Tiny-Image Net (Deng et al., 2009) and language datasets Stanford Sentiment Treebank (SST) (Socher et al., 2013), Sentiment140 (Go et al., 2009). The datasets are partitioned among the clients following five partitioning strategies: (1) Distribution-based label distribution skew simulates label imbalance by sampling proportions πk from a Dirichlet distribution Dir(β) where β is the concentration parameter. For each label class k, we sample πk and allocate πk,i of label-k samples to client i; (2) Quantity-based label distribution skew, denoted by #C = k, creates another type of label imbalance by sampling k label classes for each client and then randomly and equally distributing samples from class k among eligible clients; (3) Noise-based feature distribution skew only applies to vision datasets and adds noises with distribution N(σ i/N) to the data of client i; (4) Quantity skew allocates πi proportion of total data randomly to client i from a Dirichlet sample π Dir(β); (5) Homogeneous partition, denoted by IID, partitions the samples from each class randomly and equally among the clients. Published as a conference paper at ICLR 2024 Table 1: Comparison of IPRaccu among IAFL and baselines using different dataset partitions. Each value reports the mean and the standard error of 10 independent evaluations and partition seedings. Category Dataset Partitioning Fed Avg Finetune LG-Fed Avg CGSV Rank IAFL Label Distribution Skew MNIST Dir(0.5) 0.85 0.01 0.96 0.01 0.95 0.01 0.79 0.08 1.00 0.00 #C = 3 0.73 0.06 1.00 0.00 1.00 0.00 0.84 0.02 1.00 0.00 FMNIST Dir(0.5) 0.65 0.03 0.89 0.01 0.95 0.01 0.80 0.02 0.99 0.01 #C = 3 0.41 0.04 0.99 0.01 0.99 0.01 0.71 0.02 1.00 0.00 SVHN Dir(0.5) 0.99 0.00 1.00 0.00 0.90 0.02 0.92 0.02 1.00 0.00 #C = 3 0.35 0.05 0.97 0.01 0.56 0.09 0.46 0.02 0.82 0.02 CIFAR-10 Dir(0.5) 0.26 0.09 0.45 0.09 0.12 0.06 0.33 0.10 0.67 0.05 #C = 3 0.00 0.00 0.01 0.01 0.01 0.00 0.02 0.01 0.46 0.03 CIFAR-100 Dir(0.5) 0.37 0.05 0.96 0.02 0.00 0.00 0.83 0.02 1.00 0.00 #C = 30 0.02 0.01 0.14 0.03 0.00 0.00 0.40 0.05 0.64 0.06 SST Dir(0.5) 0.47 0.04 0.65 0.02 0.47 0.03 0.75 0.03 0.79 0.03 #C = 3 0.96 0.01 0.58 0.02 0.51 0.05 0.85 0.01 0.89 0.01 Feature Distribution Skew 0.19 0.03 0.70 0.04 0.98 0.01 0.21 0.04 0.71 0.09 FMNIST 0.66 0.03 0.16 0.05 0.02 0.01 0.68 0.03 1.00 0.00 SVHN 1.00 0.00 1.00 0.00 1.00 0.00 0.97 0.01 1.00 0.00 CIFAR-10 1.00 0.00 1.00 0.00 0.16 0.07 1.00 0.00 1.00 0.00 CIFAR-100 1.00 0.00 1.00 0.00 0.46 0.10 0.98 0.01 1.00 0.00 Quantity Skew 0.75 0.02 0.89 0.01 0.95 0.01 0.97 0.01 0.99 0.01 FMNIST 0.88 0.02 0.80 0.03 0.52 0.02 0.90 0.02 1.00 0.00 SVHN 1.00 0.00 1.00 0.00 0.99 0.01 0.63 0.06 1.00 0.00 CIFAR-10 0.91 0.01 0.96 0.01 0.65 0.02 0.54 0.04 0.96 0.00 CIFAR-100 0.98 0.01 0.99 0.00 0.68 0.01 0.68 0.04 0.98 0.01 SST 1.00 0.00 0.57 0.05 1.00 0.00 0.94 0.01 1.00 0.00 Homogeneous Partition 0.23 0.02 0.58 0.04 0.99 0.01 0.25 0.06 0.71 0.07 FMNIST 0.60 0.05 0.37 0.03 0.01 0.01 0.68 0.04 1.00 0.00 SVHN 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 CIFAR-10 1.00 0.00 1.00 0.00 0.26 0.09 0.99 0.00 1.00 0.00 CIFAR-100 1.00 0.00 1.00 0.00 0.64 0.08 1.00 0.00 1.00 0.00 SST 1.00 0.00 0.51 0.07 1.00 0.00 0.96 0.01 1.00 0.00 Number of times that performs the best 10 11 7 3 24 Baselines. We compare to a Fed Avg (Mc Mahan et al., 2017) baseline with local finetuning, a personalization method LG-Fed Avg (Liang et al., 2020) and two FL training-time model incentives methods, CGSV (Xu et al., 2021) and Rank (Kong et al., 2022). In particular, CGSV has demonstrated empirical superiority over other popular techniques such as q-FFL (Li et al., 2020c), CFFL (Lyu et al., 2020) and ECI (Song et al., 2019). We refer to Appendix E.1 for the implementation details. 6.1 INCENTIVIZATION PERFORMANCE Incentivized participation rate (IPR). Cho et al. (2022) proposed using IPR to measure the incentivization performance of an FL algorithm. We let IPRloss be the percentage of clients receiving a model not worse than his standalone model in terms of the test loss. Likewise, We define IPRaccu but in terms of the test accuracy. Here, a standalone model refers to the model trained only with the client s local dataset. IPR therefore indicates IR (Definition 1). Table 1 summarizes the IPRaccu for different methods, datasets and partitions. Notably, our IAFL achieves the highest IPRaccu 24 times out of the 29 experimental settings, out of which 17 of them have a 100% IPRaccu. Although Fed Avg Finetune and LF-Fed Avg achieve comparable performance as IAFL, they are significantly worse than IAFL under the heterogeneous learning setting with label distribution skew, which may be the most challenging regime and the regime of interest for FL. Therefore, we have demonstrated IAFL s ability to facilitate effective collaboration to achieve better learning performance for federated clients. Additionally, we attribute the imperfect IR (Remark 1) to the violation of non-decreasing and concave assumptions for the rewarding function ν in practice, which is defined as the test accuracy in this experiment. Alternatively choosing ν as the test loss, IAFL achieves perfect IR for almost all settings. We show the results for IPRloss in Table 4 of Appendix E.2. Additional results on complex large-scale datasets, Tiny-Image Net and Sent140, are in Appendix E.9. Correlation to contribution. Another way to quantify the incentivization performance is by measuring the Pearson correlation coefficient ρ between the client model accuracies achieved by the algorithm after T FL iterations and their standalone accuracies (Xu et al., 2021). We use standalone accuracies as a surrogate for the client contributions in FL, hence a good incentivizing method should produce client models with accuracies highly correlated to the respective client contributions. Table 2 presents the effectiveness of various methods under different types of heterogeneity. Most methods, including LG-Fed Avg, CGSV and Rank, could only perform well when the quality of client datasets Published as a conference paper at ICLR 2024 Table 2: The incentivization performance under different dataset partitions, measured using the Pearson correlation coefficient ρ between the final client model accuracies and standalone accuracies. Each value reports the mean and the standard error over 10 independent evaluations. Category Dataset Partitioning Fed Avg Finetune LG-Fed Avg CGSV Rank IAFL Label Distribution Skew MNIST Dir(0.5) 0.74 0.03 0.66 0.06 -0.80 0.02 0.71 0.08 0.81 0.02 #C = 3 0.27 0.06 0.12 0.03 -0.01 0.11 0.56 0.03 0.59 0.02 FMNIST Dir(0.5) 0.85 0.01 0.82 0.02 0.23 0.14 0.92 0.01 0.84 0.02 #C = 3 -0.05 0.09 0.35 0.05 0.28 0.04 0.65 0.01 0.86 0.02 SVHN Dir(0.5) 0.87 0.01 0.75 0.01 0.57 0.03 0.85 0.01 0.81 0.01 #C = 3 1.00 0.00 0.81 0.02 0.49 0.05 0.79 0.02 0.84 0.01 CIFAR-10 Dir(0.5) 0.76 0.04 0.84 0.03 0.46 0.04 0.79 0.03 0.79 0.02 #C = 3 0.83 0.01 0.88 0.01 0.31 0.05 0.56 0.06 0.63 0.02 CIFAR-100 Dir(0.5) 0.58 0.04 0.43 0.04 0.20 0.07 0.60 0.03 0.89 0.01 #C = 30 0.70 0.02 0.55 0.03 0.01 0.07 0.53 0.03 0.87 0.02 SST Dir(0.5) 0.83 0.01 0.97 0.00 0.73 0.04 0.89 0.02 0.72 0.02 #C = 3 0.86 0.01 0.93 0.01 0.53 0.09 0.84 0.01 0.66 0.02 Feature Distribution Skew 0.01 0.03 0.09 0.04 -0.02 0.04 0.03 0.04 0.39 0.09 FMNIST 0.04 0.03 0.06 0.04 0.01 0.05 0.03 0.04 0.39 0.05 SVHN 0.00 0.05 0.14 0.03 -0.01 0.07 0.13 0.05 0.41 0.07 CIFAR-10 0.08 0.05 0.08 0.04 -0.02 0.04 0.20 0.04 0.57 0.03 CIFAR-100 0.12 0.04 0.16 0.04 -0.05 0.05 0.20 0.03 0.72 0.02 Quantity Skew -0.56 0.03 0.82 0.04 0.37 0.16 0.88 0.03 0.77 0.03 FMNIST -0.37 0.03 0.90 0.01 0.86 0.04 0.94 0.01 0.78 0.02 SVHN -0.20 0.03 0.83 0.02 0.75 0.04 0.94 0.01 0.77 0.03 CIFAR-10 0.25 0.06 0.95 0.00 0.73 0.04 0.95 0.01 0.80 0.01 CIFAR-100 -0.24 0.04 0.80 0.01 0.66 0.04 0.93 0.01 0.83 0.01 SST 0.20 0.06 0.81 0.01 0.49 0.05 0.70 0.02 0.47 0.06 Homogeneous Partition -0.03 0.05 0.02 0.05 0.08 0.04 0.01 0.04 0.39 0.12 FMNIST 0.04 0.05 -0.06 0.05 0.01 0.03 0.12 0.04 0.39 0.05 SVHN 0.04 0.03 0.20 0.07 -0.04 0.04 0.11 0.08 0.40 0.06 CIFAR-10 0.00 0.05 0.08 0.04 -0.03 0.05 0.09 0.03 0.59 0.05 CIFAR-100 -0.02 0.03 0.08 0.04 0.05 0.07 0.18 0.04 0.71 0.02 SST 0.27 0.04 0.69 0.02 0.22 0.04 0.56 0.02 0.44 0.02 Number of times that performs the best 2 7 0 5 15 exhibits distinct differences under severe label and quantity skew. They cannot handle the more challenging scenarios under feature skew or homogeneous partition, where client data only present minor differences in standalone accuracies. In particular, Rank further assumes access to a good global validation set to rank local models during training, which could be infeasible in practice. Fed Avg Finetune fails the quantity skew data partition, showing even negative correlations. In contrast, IAFL shows superior incentivization performance across extensive settings. It can achieve relatively high ρ for all settings and perform the best for 16 out of 29 experimental settings. Therefore, IAFL has demonstrated its ability to distribute model rewards that are commensurate with client contributions. Predictive performance. We compare the average and highest client model test accuracies achieved by IAFL and the baseline methods. Presented in Table 5 of Appendix E.3 and Table 10 of Appendix E.9, our IAFL outperforms other baselines in nearly all the settings. The high average client model performance illustrates the effectiveness of collaboration following IAFL. Additionally, IAFL is able to maintain the model performance of high-contributing clients by updating their local models using data from a larger proportion of clients, which is important in a collaborative training setting. Additional results showing the average increase in test accuracies are in Table 6 of Appendix E.4. 6.2 EFFECT OF HYPERPARAMETERS ON EMPIRICAL CONVERGENCE Sharing coefficient κ. As introduced in (1), κ [0, 1] determines the extent of sharing in IAFL and reflects an essential practical trade-off between model performance and client incentivization (more in Appendix E.7). A larger κ results in more equitable and better client models, but this can hurt the incentives of high-contributing clients as they receive models similar to those received by low-contributing clients. This is reflected in Figure 1 where we plot the model performance of 6 clients with different λ i (see Section 5). A larger κ better utilizes local model updates and achieves higher accuracies. However, it results in models with more similar model performance and thus worse client incentivization measured by ρ. For convergence, Figure 1 confirms the results in Theorem 1 that less contributing clients converge to worse models, further away from the global optimum. Stochastic recovery rate q. Stochastic recovery is introduced in IAFL to ensure asymptotic convergence to the global optimum for all client models. In a stochastic manner, each client has an equal probability q to be recovered with a reference model in each FL iteration. In practice, we recommend Published as a conference paper at ICLR 2024 0 250 500 750 1000 FL iterations Test accuracy 0 250 500 750 1000 FL iterations 0 250 500 750 1000 FL iterations 1 = 1.00 2 = 0.82 3 = 0.60 4 = 0.44 5 = 0.24 6 = 0.09 Figure 1: Effects of sharing coefficient κ with a fixed stochastic recovery rate. We use CIFAR-100 with quantity skew and N = 50. 0 250 500 750 1000 FL iterations Test accuracy 0 250 500 750 1000 FL iterations =0.5, q=0.005 0 250 500 750 1000 FL iterations =0.5, q=0.01 1 = 1.00 2 = 0.82 3 = 0.60 4 = 0.44 5 = 0.24 6 = 0.09 Figure 2: Effects of stochastic recovery rate q with a fixed sharing coefficient. We use CIFAR-100 with quantity skew and N = 50. Table 3: Performance comparison to baselines when pi corresponds to the participation rate. Method ρ IPRaccu Acc. (highest) Fed Avg Finetune 0.06 0.28 0.12 (0.16) LG-Fed Avg -0.25 0.60 0.13 (0.15) CGSV 0.90 0.00 0.06 (0.07) Rank -0.74 0.94 0.24 (0.31) IAFL 0.94 1.00 0.44 (0.53) 0 10 20 30 40 50 Client index Test accuracy Participation rate pi Figure 3: Visualization of correlation ρ when pi corresponds to the participation rate. the reference model to be updated with the best reward (largest λ i,t) in each FL iteration, whereas another choice of the median reference model is discussed in Appendix E.5. As shown in Figure 2, stochastic recovery presents a trade-off between incentivization performance measured by ρ and the test accuracies of client models. A small q is sufficient as q = 0.01 results in higher test accuracies while maintaining a relatively high incentivization performance of ρ = 0.61 in finite FL iterations. The empirical convergence observed in Figure 2 corroborates the theoretical guarantee in Theorem 2. Free riders behavior. With differentiated training-time model rewards, IAFL can offer worsened or delayed rewards to free riders. In Figures 1 and 2, we identify client 6 with λ 6 = 0.09 as the free rider as it contributes very little compared to other clients. With q = 0, the free rider only achieves 4.6% accuracy when κ = 0 and 19.2% accuracy when the κ = 0.8. After the stochastic recovery is incorporated, we observe the free rider receives a good recovered reward at a delayed time stochastically and continues to worsen due to the poor model updates received in each FL iteration. 6.3 ALTERNATIVE CONTRIBUTION MEASURES IAFL is flexible to plug-and-play with various contribution evaluation techniques out of the box (i.e., no modification is needed). To demonstrate the effectiveness with partial client participation, we let the client participation rate be the measure. Specifically, we define pi = 0.5 (1+i/N) as illustrated in Figure 3. There is a high correlation of 0.94 between the test accuracies of client models and client contributions measured by the participation rate pi. In Table 3, IAFL also achieves the best average (highest) client model accuracy of 44% (53%) and IPRaccu across all methods. In Appendix E.8, we further show IAFL s superior effectiveness when using the established CGSV contribution measure. 7 CONCLUSION & DISCUSSION In this paper, we have introduced the IAFL, a novel algorithm that distributes training-time model rewards to incentive client contributions for FL. In particular, we propose a proportional aggregation scheme for local training-time rewards that achieves the global incentivization objective. By analyzing the theoretical guarantees of the method, we observe the error propagation phenomenon that creates tension between client incentives and achieving optimality on client models. We then mitigate the problem with a stochastic reference-model recovery strategy, which enables us to distribute globally optimal models in the limit while preserving client incentives. Our method is highly versatile and can be applied to a wider range of domains (e.g., financial, biomedical, etc.) as it allows novel definitions of the contribution measure according to the specific resources that are valuable in the domain. Our work also opens up several directions for future investigation. It is important to study the roles of the server and clients in determining the contribution measure and rewarding strategy. Additionally, there can be tensions and mismatches between the contribution measure and the model rewards because clients can continuously contribute, while model performance cannot grow infinitely and suffers from diminishing returns. We further elaborate on the broader societal impacts in Appendix A. Published as a conference paper at ICLR 2024 REPRODUCIBILITY STATEMENT We have carefully discussed all the assumptions for the theoretical results and included the complete proofs for all theorems, lemmas and propositions in Appendix B. To further improve the reproducibility of our work, we elaborate in Appendix E.1 the details for dataset preprocessing, models architectures, hyperparameters for training using IAFL as well as training using all the baseline methods that we compare to. The code has been submitted as supplementary material. ACKNOWLEDGEMENT This research/project is supported by the National Research Foundation Singapore and DSO National Laboratories under the AI Singapore Programme (AISG Award No: AISG2-RP-2020-018). The authors thank Fusheng Liu for many interesting discussions. Daoyuan Chen, Dawei Gao, Weirui Kuang, Yaliang Li, and Bolin Ding. p FL-bench: A comprehensive benchmark for personalized federated learning. In Proc. Neur IPS (Datasets and Benchmarks Track), pp. 9344 9360, 2022. Yae Jee Cho, Divyansh Jhunjhunwala, Tian Li, Virginia Smith, and Gauri Joshi. To federate or not to federate: Incentivizing client participation in federated learning. In Proc. Neur IPS Workshop on Federated Learning: Recent Advances and New Challenges, 2022. Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Exploiting shared representations for personalized federated learning. In Proc. ICML, pp. 2089 2099, 2021. Mingshu Cong, Han Yu, Xi Weng, and Siu Ming Yiu. A game-theoretic framework for incentive mechanism design in federated learning. In Federated Learning: Privacy and Incentive, pp. 205 222. Springer International Publishing, 2020. Abir De and Soumen Chakrabarti. Neural estimation of submodular functions with applications to differentiable subset selection. In Proc. Neur IPS, pp. 19537 19552, 2022. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A large-scale hierarchical image database. In Proc. CVPR, pp. 248 255, 2009. Yongheng Deng, Feng Lyu, Ju Ren, Huaqing Wu, Yuezhi Zhou, Yaoxue Zhang, and Xuemin Shen. Auction: Automated and quality-aware client selection framework for efficient federated learning. IEEE Transactions on Parallel and Distributed Systems, 33(8):1996 2009, 2022. Zhenan Fan, Huang Fang, Zirui Zhou, Jian Pei, Michael P Friedlander, Changxin Liu, and Yong Zhang. Improving fairness for data valuation in federated learning. In Proc. ICDE, pp. 2440 2453, 2022. Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12):2009, 2009. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. CVPR, pp. 770 778, 2016. Kevin Hsieh, Amar Phanishayee, Onur Mutlu, and Phillip Gibbons. The non-iid data quagmire of decentralized machine learning. In Proc. ICML, pp. 4387 4398, 2020. IMDA. Guide to data valuation for data sharing. Technical report, The Infocomm Media Development Authority (IMDA) Singapore, 2019. Jiawen Kang, Zehui Xiong, Dusit Niyato, Shengli Xie, and Junshan Zhang. Incentive mechanism for reliable federated learning: A joint optimization approach to combining reputation and contract theory. IEEE Internet of Things Journal, 6(6):10700 10714, 2019. Published as a conference paper at ICLR 2024 Sai Praneeth Karimireddy, Wenshuo Guo, and Michael Jordan. Mechanisms that incentivize data sharing in federated learning. In Proc. Neur IPS Workshop on Federated Learning: Recent Advances and New Challenges, 2022. Latif U. Khan, Shashi Raj Pandey, Nguyen H. Tran, Walid Saad, Zhu Han, Minh N. H. Nguyen, and Choong Seon Hong. Federated learning for edge networks: Resource optimization and incentive mechanism. IEEE Communications Magazine, 58(10):88 93, 2020. Shuyu Kong, You Li, and Hai Zhou. Incentivizing federated learning. ar Xiv:2205.10951, 2022. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Department of Computer Science, University of Toronto, 2009. Yann Le Cun, Bernhard Boser, John Denker, Donnie Henderson, R. Howard, Wayne Hubbard, and Lawrence Jackel. Handwritten digit recognition with a back-propagation network. In Proc. Neur IPS, 1989. Qinbin Li, Yiqun Diao, Quan Chen, and Bingsheng He. Federated learning on non-iid data silos: An experimental study. In Proc. ICDE, pp. 965 978, 2022. Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50 60, 2020a. Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Proc. MLSys, pp. 429 450, 2020b. Tian Li, Maziar Sanjabi, Ahmad Beirami, and Virginia Smith. Fair resource allocation in federated learning. In Proc. ICLR, 2020c. Tian Li, Shengyuan Hu, Ahmad Beirami, and Virginia Smith. Ditto: Fair and robust federated learning through personalization. In Proc. ICML, pp. 6357 6368, 2021. Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of Fed Avg on non-IID data. In Proc. ICLR, 2020d. Paul Pu Liang, Terrance Liu, Liu Ziyin, Nicholas B. Allen, Randy P. Auerbach, David Brent, Ruslan Salakhutdinov, and Louis-Philippe Morency. Think locally, act globally: Federated learning with local and global representations. ar Xiv:2001.01523, 2020. Wei Yang Bryan Lim, Nguyen Cong Luong, Dinh Thai Hoang, Yutao Jiao, Ying-Chang Liang, Qiang Yang, Dusit Niyato, and Chunyan Miao. Federated learning in mobile edge networks: A comprehensive survey. IEEE Communications Surveys & Tutorials, 22(3):2031 2063, 2020. Yuan Liu, Mengmeng Tian, Yuxin Chen, Zehui Xiong, Cyril Leung, and Chunyan Miao. A contract theory based incentive mechanism for federated learning. In Federated and Transfer Learning, pp. 117 137. Springer International Publishing, 2023. Lingjuan Lyu, Xinyi Xu, Qian Wang, and Han Yu. Collaborative fairness in federated learning. In Federated Learning: Privacy and Incentive, pp. 189 204. Springer International Publishing, 2020. Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Proc. AISTATS, pp. 1273 1282, 2017. Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In Proc. Neur IPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011. Quoc Phong Nguyen, Bryan Kian Hsiang Low, and Patrick Jaillet. Trade-off between payoff and model rewards in shapley-fair collaborative machine learning. In Proc. Neur IPS, pp. 30542 30553, 2022. Krishna Pillutla, Kshitiz Malik, Abdel-Rahman Mohamed, Mike Rabbat, Maziar Sanjabi, and Lin Xiao. Federated learning with partial model personalization. In Proc. ICML, pp. 17716 17758, 2022. Published as a conference paper at ICLR 2024 Yunus Sarikaya and Ozgur Ercetin. Motivating workers in federated learning: A stackelberg game perspective. IEEE Networking Letters, 2(1):23 27, 2020. Zhuan Shi, Lan Zhang, Zhenyu Yao, Lingjuan Lyu, Cen Chen, Li Wang, Junhao Wang, and Xiang Yang Li. Fedfaim: A model performance-based fair incentive mechanism for federated learning. IEEE Transactions on Big Data, pp. 1 13, 2022. Rachael Hwee Ling Sim, Yehong Zhang, Mun Choon Chan, and Bryan Kian Hsiang Low. Collaborative machine learning with incentive-aware model rewards. In Proc. ICML, pp. 8927 8936, 2020. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christophe Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. EMNLP, pp. 1631 1642, 2013. Tianshu Song, Yongxin Tong, and Shuyue Wei. Profit allocation for federated learning. In IEEE International Conference on Big Data (Big Data), pp. 2577 2586, 2019. Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. IEEE Transactions on Neural Networks and Learning Systems, pp. 1 17, 2022. Xuezhen Tu, Kun Zhu, Nguyen Cong Luong, Dusit Niyato, Yang Zhang, and Juan Li. Incentive mechanisms for federated learning: From economic and game theoretic perspective. IEEE Transactions on Cognitive Communications and Networking, 8(3):1566 1593, 2022. Hui-Po Wang, Sebastian Stich, Yang He, and Mario Fritz. Progfed: effective, communication, and computation efficient federated learning by progressive training. In Proc. ICML, pp. 23034 23054, 2022. Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. In Proc. Neur IPS, pp. 7611 7623, 2020a. Tianhao Wang, Johannes Rausch, Ce Zhang, Ruoxi Jia, and Dawn Song. A principled approach to data valuation for federated learning. In Federated Learning, pp. 153 167. Springer International Publishing, 2020b. Tianhao Wang, Si Chen, and Ruoxi Jia. One-round active learning. ar Xiv:2104.11843, 2021. Yuxin Wu and Kaiming He. Group normalization. In Proc. ECCV, pp. 3 19, 2018. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv:1708.07747, 2017. Xinyi Xu and Lingjuan Lyu. A reputation mechanism is all you need: Collaborative fairness and adversarial robustness in federated learning. In Proc. ICML Workshop on Federated Learning for User Privacy and Data Confidentiality, 2021. Xinyi Xu, Lingjuan Lyu, Xingjun Ma, Chenglin Miao, Chuan Sheng Foo, and Bryan Kian Hsiang Low. Gradient driven rewards to guarantee fairness in collaborative machine learning. In Proc. Neur IPS, pp. 16104 16117, 2021. Yufeng Zhan, Peng Li, Zhihao Qu, Deze Zeng, and Song Guo. A learning-based incentive mechanism for federated learning. IEEE Internet of Things Journal, 7(7):6360 6368, 2020. Yufeng Zhan, Jie Zhang, Zicong Hong, Leijie Wu, Peng Li, and Song Guo. A survey of incentive mechanism design for federated learning. IEEE Transactions on Emerging Topics in Computing, 10(2):1035 1044, 2022. Jingwen Zhang, Yuezhou Wu, and Rong Pan. Incentive mechanism for horizontal federated learning based on reputation and reverse auction. In Proc. WWW, pp. 947 956, 2021. Jie Zhao, Xinghua Zhu, Jianzong Wang, and Jing Xiao. Efficient client contribution evaluation for horizontal federated learning. In Proc. ICASSP, pp. 3060 3064, 2021. Xingyu Zhou. On the Fenchel duality between strong convexity and Lipschitz continuous gradient. ar Xiv:1803.06573, 2018. Published as a conference paper at ICLR 2024 A BROADER IMPACTS AND ETHICS Designed to promote client contribution among potentially competitive clients in FL, our incentiveaware federated learning algorithm with training-time model reward offers versatile applications in various fields such as finance, healthcare, the Internet of things, etc. While our proposed method is capable of shaping the pipeline of collaboration between the server and clients in distributed learning, we also see potential societal impacts associated with the development of such incentive methods. Firstly, the use of highly customizable contribution measures presents both advantages and challenges. While it is technically feasible to employ any contribution measure, ethical considerations arise when certain choices of the contribution measure do not align well with the training-time model reward. For instance, the rewards based on model performance (e.g., using the test performance as a utility measure), cannot infinitely improve and should exhibit diminishing returns with respect to the contribution of learning resources. This raises questions about the ethical implications of incentivizing client contributions in these later stages of model training, especially when the marginal utility is minimal. Is it ethical to continue incentivizing client contributions at later FL iterations (e.g., near convergence), knowing that the marginal utility of contribution is small? Is it reasonable to reward a client less at a later FL iteration, even though the client contributes more in this FL iteration as compared to some earlier iterations? These concerns need to be addressed with specific application scenarios, taking into account the characteristics of the server and clients. In order to mitigate these ethical concerns, it may be necessary to enforce regulations on the protocol to ensure the adoption of an accumulative contribution measure. Secondly, the ethical considerations regarding the distribution of rewards require closer examination. In this paper, we adopt the perspective of collaborative fairness (Xu & Lyu, 2021), where our incentive-aware algorithm encourages client contributions by providing rewards commensurate with their level of contribution. However, some may argue that this approach violates equitable fairness (Li et al., 2020c) among clients. Interestingly, our stochastic reference-model recovery with a tunable probability q ensures equitable fairness in the long run (i.e., as the number of FL iterations approaches infinity). By adjusting the value of q, a balance can be struck between collaborative fairness and equitable fairness. Smaller q values prioritize collaborative fairness, while larger q values emphasize equitable fairness. Therefore, guidelines should be established to assist the server and clients in collaboratively determining the algorithm s hyperparameters, thereby minimizing potential ethical concerns. Thirdly, reward distribution within the broader spectrum of collaborative learning may require careful ethical considerations. In this paper, we give out training-time model rewards due to the difficulties of timeliness and feasibility (discussed in Section 1) associated with post-training monetary rewards. We believe future investigations could potentially open up new opportunities if we can incorporate both monetary and non-monetary (e.g., model) rewards when considering client incentives in FL training. However, the challenge lies in determining the denomination between monetary incentives and model incentives both technically and ethically. Nguyen et al. (2022) offer some insights on dealing with the following scenario: How can one rectify a situation where a client s contribution does not correspond to a good model reward, but the client desires to obtain a better model by offering additional monetary payments? Addressing these interesting questions could bring us a more holistic view of incentivization and collaboration in learning. At the same time, apart from the technical perspective, more in-depth ethical studies may be necessary to regulate such denominations between monetary and non-monetary rewards for client incentivization in practice. B.1 PROOF OF PROPOSITION 1 Proposition 1. If h satisfies d h(pi,p i) dpi > 0 and ν(pi, h(pi,p i)) h(pi,p i) > 0, then a mechanism [Mν(p)]i = ν(pi, h(pi, p i)) incentivizes a client to contribute more than that in the standard FL mechanism. Published as a conference paper at ICLR 2024 Proof. The utility of client i with pi contribution can be written as ui = [Mν(p)]i cipi = ν(pi, h(pi, p i)) cipi . Thus, taking the derivative with respect to pi, we have dpi dpi + ν h(pi, p i) d h(pi, p i) pi + ν h(pi, p i) d h(pi, p i) In the standard FL mechanism, we have h(pi, p i) = h(p i) where h(p i) is independent of pi. This implies d h(pi,p i) dpi = 0. The derivative of the utility is u (F L) i = ν If the following conditions are satified: 1 ν h(pi,p i) > 0 and 2 d h(pi,p i) dpi > 0, according to (2) we will have u i > u (F L) i for any pi. To maximize the utility, a rational client i will contribute p i such that u i = 0. Since ui is in general a concave function and u i > u (F L) i for any pi with 1 ν h(pi,p i) > 0 and 2 d h(pi,p i) dpi > 0, we have p i > p (F L) i . That is, the mechanism with such a h incentivizes a client to contribute more than that in the standard FL mechanism. B.2 PROOF OF PROPOSITION 3 Proposition 3. There exists an undesirable equilibrium in the relative reward mechanism [Mν(p)]i = ν pi, pi maxj pj h(p i) . Specifically, the equilibrium elicits contributions from all client that sums to what a single client would contribute without the federation. Proof. This is a constructive proof for existence. We consider N identical clients each with a common cost ci = c, i {1, . . . , N}. Let the rewarding function ν be a composition function such that ν(pi, P i) = g(f(pi, P i)) where g is a continuous non-decreasing concave function and f(pi, P i) = pi + P i. Furthermore, we define the aggregate function as h(p i) = P j:j [N] j =i pj. Define an optimal contribution p for an individual client, such that ν (p , P i) = g (p ) = c. This is because P i = h(p i) = 0 when there is only an individual client. Now, if every client contributes p /N (i.e., p = [p /N, . . . , p /N]), we have pi maxj pj = 1 for all i and thus [Mν(p)]i = ν(p /N, h([p /N, . . . , p /N] | {z } N 1 terms = g(f(p /N, X j:j [N] j =i pj)) In this case, clients reach a bad Nash equilibrium as the total contribution from all the federated clients is the same as a single client s contribution in the non-federated environment. Here, in the Published as a conference paper at ICLR 2024 Nash equilibrium, each client is assumed to know the equilibrium strategies of the other participating clients, and no client can increase one s own utility by deviating to other strategies. In other words, the mechanism does not incentivize clients to contribute more. C CONNECTION BETWEEN THE MECHANISM AND THE IAFL ALGORITHM We draw the connections between the IAFL algorithm in Algorithm 1 and the mechanism in (1) through a simplifying scenario. Consider N homogeneous clients producing updates {gi,t}N i=1 that follow a population with the same mean µ. The update θi,t follows the sampling distribution of the mean with Var( θi,t) = σ2 t /|Si,t| where σt is the population variance. Note that step 4 of Procedure 2 aggregates the contribution of client i, θi,t, with the sampled contributions of other clients. Taking the reciprocal of the variance as the quality of the training-time model update reward, then 1 Var( θi,t) = 1 σ2 t + min pi,t pceil , 1 1 κ N 1 which matches the mechanism in (1). D PROOF OF THE CONVERGENCE Denote the optimal solution of F(θ) by θ such that F = F(θ ). Similarly, we denote F i = Fi(θ i ) where θ i is the optimal solution of Fi(θ). We define the data bias as Γ F 1 N PN i=1 F m. We set the learning rate for client i at FL iteration t to common rate ηt = ηi,t, i. Recall that the j-th step of stochastic gradient decent at client i is θj+1 i,t = θj i,t ηt Fi(θj i,t, ξj i,t) where ξj i,t is the local batch randomly chosen for the SGD step. Note that θ1 i,t = θi,t is the model at client i at the beginning of a FL iteration t. At the end of each FL iteration, a client i updates its model to θi,t+1 = θi,t ηt 1 |Si,t| j=1 Fm(θj m,t, ξj m,t) where Si,t is defined in Algorithm 1 as the set of indices that a client i can aggregate from in a given FL iteration t. Without loss of generality, we assume that γ i,t N is an integer. In practice, the ceilling γ i,t N can be used. Note that |Si,t| = γ i,t N. For simplicity, we denote the update for client i in iteration t as Ri,t = ηt 1 |Si,t| j=1 Fm(θj m,t, ξj m,t) . Since Ri,t essentially computes the i.i.d. sample mean from the population {gm,t}N m=1 = { ηt Fm(θj m,t, ξj m,t)}N m=1 for each FL iteration t, we have E[Ri,t] = E[Rj,t] for all i, j {1, . . . , N}. For the analysis in this section, we make the following assumptions on the functions F1, . . . , FN: Assumption 1 (L-smoothness). The loss function Fm is L-smooth, then for all u, v Rd, 2(Fm(u) Fm(v)) 2 u v, Fm(v) + L u v 2 , m [N] . Assumption 2 (µ-strong convexity). The loss functions F1, . . . , FN for all N clients are µ-strongly convex, then for all u, v Rd, 2(Fm(u) Fm(v)) 2 u v, Fm(v) + µ u v 2 , m [N] . Assumption 3 (Expected stochastic gradient variance). The variance of stochastic gradients in a client is bounded, E Fm(θj m,t, ξj m,t) Fm(θj m,t) 2 σ2 m, m [N], j [τ], t [T] . We further define Σ2 = 1 N PN i=1 σ2 m as the average expected stochastic gradient variance among clients. Published as a conference paper at ICLR 2024 Assumption 4 (Expected squared l2-norm). The expected squared l2-norm of the stochastic gradients is uniformly bounded, E Fm(θj m,t, ξj m,t) 2 G2, m [N], j [τ], t [T] . D.1 PROOF OF THEOREM 1 Theorem 1 (Performance bound). With full clients participation and a decreasing learning rate ηt = 1 µτ(t+α) where α 4L(τ+1) µτ , define B = 2Lτ(2τ + 3)Γ + 2τ 3G2 + τ 2 Σ2 + (α + 1)µ2τ 2E[ θ1 θ 2] and HT = (2 2µηT )τ 3G2, then the performance of the client model θi,T trained over T FL iterations using IAFL is bounded by E [F(θi,T )] F L 2µ2τ 2 B + CT where CT = HT 2 1 γ m,t + 1 Proof. Define a decreasing learning rate ηt = β t+α for some β > 1 2µτ and α 4L(τ+1) From Lemma D.1, we have the following for any client i after we train for T FL iterations, E h θi,T +1 θ 2i (1 2µηT τ)E h θi,T θ 2i + 2Lτ(2τ + 3)η2 T Γ + (2 2µηT ) 1 j=1 E θj m,T θi,T 2 + η2 T τ 2 Σ2 . From Lemma D.2, we have the following for any pair of clients w and v after we train for T FL iterations, E θj w,T θ1 v,T 2 τ 2G2 T X γ w,t + 1 γ v,t 2 + η2 T (j 1)2G2 . (4) E h θi,T +1 θ 2i (1 2µηT τ)E h θi,T θ 2i + 2Lτ(2τ + 3)η2 T Γ + η2 T τ 2 Σ2 + (2 2µηT ) 1 1 γ m,t + 1 + η2 T (j 1)2G2 ! = (1 2µηT τ)E h θi,T θ 2i 2Lτ(2τ + 3)Γ + (2 2µηT )G2 τ X j=1 (j 1)2 + τ 2 Σ2 + (2 2µηT ) 1 1 γ m,t + 1 (1 2µηT τ)E h θi,T θ 2i + η2 T 2Lτ(2τ + 3)Γ + 2τ 3G2 + τ 2 Σ2 + η2 T (2 2µηT )τ 3G2 1 2 1 γ m,t + 1 = (1 2µηT τ)E h θi,T θ 2i + η2 T (Q + CT ) Published as a conference paper at ICLR 2024 where for simplicity we define shorthands Q = 2Lτ(2τ + 3)Γ + 2τ 3G2 + τ 2 Σ2, HT = (2 2µηT )τ 3G2 and CT = HT N PN m=1 PT t=1 T +α t+α 2 1 γ m,t + 1 γ i,t 2 . Note that Q is a constant whereas CT is dependent on T. Let t = E h θi,t θ 2i . We now aim to prove t vt α+t by induction where vt = max n β2(Q+Ct) 2βµτ 1 , (α + 1) 1 o . When t = 1, we have 1 v1 α+1 by definition. Now assume t vt α+t for some t, then from (5) we have vt t + α + β2(Q + Ct) vt t + α + 2βµτ 1 (t + α)2 vt = vt t + α vt (t + α)2 vt t + α vt (t + α)(t + α + 1) = vt t + α + 1 (a) vt+1 t + α + 1 where (a) follows from Ct Ct+1 and Ht Ht+1. From the L-smoothness of F, E [F(θi,T )] F L 2 v T α + T . (7) Specifically, we let β = 1 µτ and train for t = T iterations, then v T = max β2(Q + CT ) 2βµτ 1 , (α + 1) 1 β2(Q + CT ) 2βµτ 1 + (α + 1) 1 µ2τ 2 + (α + 1) 1 . Finally, substituting (8) into (7), we get E [F(θi,T )] F L 2(α + T) µ2τ 2 + (α + 1) 1 Define B = Q + (α + 1)µ2τ 2E h θ1 θ 2i , we simplify the inequality to E [F(θi,T )] F L 2µ2τ 2(α + T) Q + CT + (α + 1)µ2τ 2 1 = L 2µ2τ 2 B + CT α + T . (10) Published as a conference paper at ICLR 2024 D.2 DEFERRED PROOFS OF LEMMAS Lemma D.1. Given ηt 1 4L(τ+1) and Assumptions 1-3, E h θi,t+1 θ 2i (1 2µηtτ)E h θi,t θ 2i + 2Lτ(2τ + 3)η2 t Γ + (2 2µηt) 1 j=1 E θj m,t θi,t 2 + η2 t τ 2 Σ2 . Proof. In this proof, we define Ri,t = ηt 1 γ i,t N j=1 Fm(θj m,t) . (12) To simplify the notation in this proof, we use θi,t interchangeably with θ1 i,t to denote the model of client i at FL iteration t without performing any local updates. = θi,t + Ri,t θ Ri,t + Ri,t 2 = θi,t θ + Ri,t 2 + Ri,t Ri,t 2 + 2 θi,t θ + Ri,t, Ri,t Ri,t . Step 1: Bounding the second and third term in (13) Bound the second term in (13) in expectation with respect to ξj m,t and Si,t, ES,ξ h Ri,t Ri,t 2i ηt 1 γ i,t N j=1 ( Fm(θj m,t, ξj m,t) Fm(θj m,t)) Fm(θj m,t, ξj m,t) Fm(θj m,t)) 2 (a) η2 t τ 2ES m Si,t σ2 m (b)= η2 t τ 2 Σ2 where (a) and (b) both follow from Assumption 3. The expectation of the third term in (13) is zero. That is, E 2 θi,t θ + Ri,t, Ri,t Ri,t = 0 (15) because E [Ri,t] = Ri,t. Step 2: Bounding the first term in (13) We write θi,t θ + Ri,t 2 = θi,t θ 2 + Ri,t 2 + 2 θi,t θ , Ri,t . (16) We will next bound Ri,t 2 and 2 θi,t θ , Ri,t , respectively. Published as a conference paper at ICLR 2024 Step 2.1: Bounding Ri,t 2 From the Assumption 1 of L-smoothness for Fm, due to Lemma 4 of (Zhou, 2018), we have Fm(θj m,t) 2 2L(Fm(θj m,t)) F m) . (17) Bounding the second term in the RHS of (16), we have Ri,t 2 = η2 t j=1 Fm(θj m,t) η2 t 1 γ i,t N j=1 Fm(θj m,t) η2 t τ 1 γ i,t N Fm(θj m,t) 2 2Lη2 t τ 1 γ i,t N j=1 (Fm(θj m,t) F m) . Step 2.2: Bounding 2 θi,t θ , Ri,t Bounding the last term in the RHS of (16), we rewrite 2 θi,t θ , Ri,t = 2 θi,t θ , ηt 1 γ i,t N j=1 Fm(θj m,t) = 2ηt 1 γ i,t N j=1 θ θi,t, Fm(θj m,t) = 2ηt 1 γ i,t N j=1 θj m,t θi,t, Fm(θj m,t) + 2ηt 1 γ i,t N j=1 θ θj m,t, Fm(θj m,t) . Consider the first term in (19), 2ηt 1 γ i,t N j=1 θj m,t θi,t, Fm(θj m,t) ηt 1 γ i,t N θj m,t θi,t 2 + ηt Fm(θj m,t) 2 = 1 γ i,t N θj m,t θi,t 2 + η2 t 1 γ i,t N Fm(θj m,t) 2 . Consider the second term in (19), 2ηt 1 γ i,t N j=1 θ θj m,t, Fm(θj m,t) (a) 2ηt 1 γ i,t N Fm(θ ) Fm(θj m,t) µ θj m,t θ 2 (21) Published as a conference paper at ICLR 2024 where (a) follows from Assumption 2. Step 3: Putting the results together In the previous step, we want to bound (16). Applying (18), (20) and (21), we have θi,t θ + Ri,t 2 = θi,t θ 2 + Ri,t 2 + 2 θi,t θ , Ri,t θi,t θ 2 + 2Lη2 t τ 1 γ i,t N j=1 (Fm(θj m,t) F m) + 1 γ i,t N θj m,t θi,t 2 + η2 t 1 γ i,t N Fm(θj m,t) 2 + 2η(t) 1 γ i,t N Fm(θ ) Fm(θj m,t) µ (a)= θi,t θ 2 µηt 1 γ i,t N θj m,t θ 2 + 1 γ i,t N θj m,t θi,t 2 + 2L(τ + 1)η2 t 1 γ i,t N j=1 (Fm(θj m,t) F m) + 2ηt 1 γ i,t N Fm(θ ) Fm(θj m,t) (b) θi,t θ 2 2µηt 1 γ i,t N j=1 θi,t θ 2 + (1 2µηt) 1 γ i,t N θj m,t θi,t 2 + At = (1 2µηtτ) θi,t θ 2 + (1 2µηt) 1 γ i,t N θj m,t θi,t 2 + At (22) where (a) follows from L-smoothness in (17) and rearranging and (b) follows from θj m,t θ 2 2 θj m,t θi,t 2 + 2 θi,t θ 2. We next bound At in expectation. Define Wt = 2ηt(1 2L(τ + 1)ηt). Note that we define ηt such that ηt 1 4L(τ+1). Hence, Wt > 0. Published as a conference paper at ICLR 2024 From the definition of A above, we have At = 2L(τ + 1)η2 t 1 γ i,t N j=1 (Fm(θj m,t) F m) + 2ηt 1 γ i,t N Fm(θ ) F m + F m Fm(θj m,t) = 2ηt(1 2L(τ + 1)ηt) 1 γ i,t N j=1 (Fm(θj m,t) F m) + 2ηt 1 γ i,t N j=1 (Fm(θ ) F m) = Wt 1 γ i,t N j=1 (Fm(θj m,t) F ) Wtτ 1 γ i,t N m Si,t (F F m) + 2ηtτ 1 γ i,t N m Si,t (Fm(θ ) F m) . Wt 1 γ i,t N j=1 (Fm(θj m,t) F ) Wtτ 1 γ i,t N E m Si,t (F F m) + 2ηtτ 1 γ i,t N E m Si,t (Fm(θ ) F m) Wt 1 γ i,t N j=1 (Fm(θj m,t) F ) + 4Lτ(τ + 1)η2 t Γ where (a) follows from E h 1 γ i,t N P m Si,t(F F m) i = 1 N PN m=1(F F m) = Γ. Note that in the equation above, we have j=1 (Fm(θj m,t) F ) = 1 γ i,t N j=1 (Fm(θj m,t) Fm(θi,t)) + 1 γ i,t N j=1 (Fm(θi,t) F ) (a) 1 γ i,t N j=1 Fm(θi,t), θj m,t θi,t + 1 γ i,t N j=1 (Fm(θi,t) F ) ηt Fm(θi,t) 2 + 1 θj m,t θi,t 2 + 1 γ i,t N j=1 (Fm(θi,t) F ) (b) 1 γ i,t N ηt L(Fm(θi,t) F m) + 1 θj m,t θi,t 2 + 1 γ i,t N j=1 (Fm(θi,t) F ) (25) Published as a conference paper at ICLR 2024 where (a) follows from Assumption 2 and (b) follows from (17). Therefore, Wt 1 γ i,t N j=1 (Fm(θj m,t) F ) Wt 1 γ i,t N ηt L(Fm(θi,t) F m) + 1 θj m,t θi,t 2 Wt 1 γ i,t N j=1 (Fm(θi,t) F ) = Wt(ηt L 1) 1 γ i,t N j=1 (Fm(θi,t) F ) + Wtηt L 1 γ i,t N j=1 (F F m) + Wt θj m,t θi,t 2 (a) Wtηt L 1 γ i,t N j=1 (F F m) + 1 γ i,t N θj m,t θi,t 2 where (a) follows from ηt L 1 0 and 1 γ i,t N P m Si,t Pτ j=1(Fm(θi,t) F ) 0, and also from the fact that ηt Wt 2ηt given that ηt 1 4L(τ+1). Then, substituting (26) into (24), E [At] Wtηt Lτ 1 γ i,t N E m Si,t (F F m) θj m,t θi,t 2 + 4Lτ(τ + 1)η2 t Γ (a) 2Lτη2 t (2τ + 3)Γ + E θj m,t θi,t 2 where (a) again follows from E h 1 γ i,t N P m Si,t(F F m) i = 1 N PN m=1(F F m) = Γ. Finally, taking expectation of (22) and then substituting (27), we have E h θi,t θ + Ri,t 2i (1 2µηtτ)E h θi,t θ 2i + (1 2µηt)E θj m,t θi,t 2 + 2Lτ(2τ + 3)η2 t Γ + E θj m,t θi,t 2 = (1 2µηtτ)E h θi,t θ 2i + 2Lτ(2τ + 3)η2 t Γ + (2 2µηt) 1 j=1 E θj m,t θi,t 2 . Step 4: Concluding the proof Published as a conference paper at ICLR 2024 Lastly, we take expectation of (13) and substitute (28), (14), (15), E h θi,t+1 θ 2i = E h θi,t θ + Ri,t 2i + E h Ri,t Ri,t 2i (a) (1 2µηtτ)E h θi,t θ 2i + 2Lτ(2τ + 3)η2 t Γ + (2 2µηt) 1 j=1 E θj m,t θi,t 2 + η2 t τ 2 Σ2 . Note that we have two sources of uncertainties above. They are 1) the random selection of updates from Si,t in Ri,t and 2) the stochastic gradients ξj m,t. We have taken expectation with respect to both of them above. Lemma D.2. Given Assumption 4, E θj w,t θ1 v,t 2 τ 2G2 t X γ w,z + 1 γ v,z 2 + η2 t (j 1)2G2 . (30) Proof. Using the Cauchy-Schwarz inequality, we first bound E h θ1 w,t θ1 v,t 2i z=1 (Rw,z Rv,z) z=1 E h Rw,z Rv,z 2i j=1 Fm(θj m,z, ξj m,z) 1 |Sv,z| j=1 Fm(θj m,z, ξj m,z) 1 γ w,z N 1 γ v,z N m Sw,z m Sv,z j=1 Fm(θj m,z, ξj m,z) + 1 γ w,z N m Sw,z m/ Sv,z j=1 Fm(θj m,z, ξj m,z) 1 γ v,z N m Sv,z m/ Sw,z j=1 Fm(θj m,z, ξj m,z) γ w,zγ v,z N 1 γ w,z N 1 γ v,z N 2 + γ w,z(1 γ v,z)N 1 γ w,z N +γ v,z(1 γ w,z)N 1 γ v,z N j=1 Fm(θj m,z, ξj m,z) γ w,z + 1 γ v,z 2 1 j=1 Fm(θj m,z, ξj m,z) Published as a conference paper at ICLR 2024 (d) τ 2G2 t X γ w,z + 1 γ v,z 2 (31) where (a) follows from the independent nature of Ri,z s at different z, i and E [Rw,z Rs,z] = 0, (b) follows from the algorithm such that |Si,z| = γi,z(N 1)+1 = γ i,z N for all i [N], (c) follows from expected number of party indices in Sw,z, Sv,z and (d) follows from Assumption 4. Note that here we assume γ i,z N is a positive integer w.l.o.g. Then, we bound the model that has performed j 1 local gradient steps, E θj w,t θ1 v,t 2 (32) b=1 Fw(θb w,t, ξb w,t) θ1 v,t = E h θ1 w,t θ1 v,t 2i + E b=1 Fw(θb w,t, ξb w,t) 2 θ1 w,t θ1 v,t, ηt b=1 Fw(θb w,t, ξb w,t) (a)= E h θ1 w,t θ1 v,t 2i + E b=1 Fw(θb w,t, ξb w,t) E h θ1 w,t θ1 v,t 2i + η2 t (j 1) b=1 E h Fi(θb i,t, ξb i,t) 2i (37) (b) E h θ1 w,t θ1 v,t 2i + η2 t (j 1)2G2 (38) (c) τ 2G2 t X γ w,z + 1 γ v,z 2 + η2 t (j 1)2G2 (39) where (a) follows from E θ1 w,t θ1 v,t = Pt 1 z=0 E [Rw,z Rv,z] = 0, (b) follows from Assumption 4 and (c) follows from (31). D.3 PROOF OF THEOREM 2 Theorem 2 (Improved bound). Let ηt = β t+α and B, HT be defined in Theorem 1. With a stochastic recovery rate of q, the performance of client model θi,T trained over T FL iterations is bounded by E [F(θi,T )] F L t=1 (1 2µηtτ) t=1 η2 t (Q + Dt + Et) l=t+1 (1 2µηtτ) where DT = HT 2 1 γ m,t + 1 γ ref,t 2 2 1 γ i,t + 1 γ ref,t 2 (1 q)T t+1 . Proof. Let 0 q 1 be the probability of stochastic recovery. During stochastic recovery, a client recovers a reference model. We denote θref as the reference model parameter and Rref,t as the reward for the reference model at iteration t, which is determined by the induced reference reward rate γ ref,t. Let h be the number of iterations the client has not synchronized for. Published as a conference paper at ICLR 2024 With the introduction of the reference model, the difference θm,t θi,t 2 between two models at an iteration t can be broken down into θm,t θref,t 2 and θref,t θi,t 2, which can be better bounded using the stochastic recovery rate 0 q 1. We now derive a new bound for the expected difference between a client model θw,T and the reference model θref,T at an iteration T, E h θ1 w,T θ1 ref,T 2i h=1 q(1 q)h E z=T +1 h (Rw,z Rref,z) h=1 q(1 q)h " T X z=T +1 h E h (Rw,z Rref,z) 2i# h=1 (1 q)h # E h Rw,T Rref,T 2i + . . . + h=T (1 q)h # E h Rw,1 Rref,1 2i# h=l (1 q)h # E h Rw,T +1 l Rref,T +1 l 2i# (b) qτ 2G2 T X 1 γ w,T +1 l + 1 γ ref,T +1 l 2 h=l (1 q)h # (c)= τ 2G2 T X 1 γ w,T +1 l + 1 γ ref,T +1 l 2 η2 T +1 l (1 q)l (1 q)T +1 (d)= τ 2G2 T X 1 γ w,t + 1 γ ref,t 2 η2 t (1 q)T t+1 (1 q)T +1 1 γ w,t + 1 γ ref,t 2 η2 t (1 q)T t+1 (40) where (a) follows from the independent nature of Ri,t s at different t, i and E [Rw,z Rs,z] = 0, (b) follows from (31), (c) follows from the closed form of the summation PT h=l(1 q)h = (1 q)l (1 q)T +1 q and (d) follows from replacing l with t such that t = T + 1 l. From Lemma D.1, we have E h θi,T +1 θ 2i (1 2µηT τ)E h θi,T θ 2i + 2Lτ(2τ + 3)η2 T Γ + η2 t τ 2 Σ2 + (2 2µηt) 1 j=1 E θj m,T θi,T 2 (a) (1 2µηT τ)E h θi,T θ 2i + 2Lτ(2τ + 3)η2 T Γ + η2 T τ 2 Σ2 + (2 2µηT ) 1 E h θ1 m,T θ1 i,T 2i + η2 t (j 1)2G2 (b) (1 2µηT τ)E h θi,T θ 2i + 2Lτ(2τ + 3)η2 T Γ + 2η2 T τ 3G2 + η2 T τ 2 Σ2 + (2 2µηT )τ 1 m=1 E h θ1 m,T θ1 ref,T 2i + (2 2µηT )τE h θ1 ref,T θ1 i,T 2i where (a) follows from (38) and (b) follows from E θ1 i,T = E h θ1 ref,T i . Published as a conference paper at ICLR 2024 Then, we substitute (40), E h θi,T +1 θ 2i (1 2µηT τ)E h θi,T θ 2i + 2Lτ(2τ + 3)η2 T Γ + 2η2 T τ 3G2 + η2 T τ 2 Σ2 + η2 T (2 2µηT )τ 3G2 1 2 1 γ m,t + 1 γ ref,t 2 + η2 T (2 2µηT )τ 3G2 T X 2 1 γ i,t + 1 γ ref,t 2 = (1 2µηT τ)E h θi,T θ 2i + η2 T (Q + DT + ET ) . Note that in the above, we have defined Q = 2Lτ(2τ +3)Γ+2τ 3G2+τ 2 Σ2, HT = (2 2µηT )τ 3G2 in Theorem 1 and we further define 2 1 γ m,t + 1 γ ref,t 2 2 1 γ i,t + 1 γ ref,t 2 (1 q)T t+1 . From the L-smoothness of F, E [F(θi,T )] F 2 E h θi,T θ 2i t=1 (1 2µηtτ) t=1 η2 t (Q + Dt + Et) l=t+1 (1 2µηtτ) . D.4 PROOF OF PROPOSITION 2 Definition D.1. We write f(x) = o(g(x)) as x if for all positive real number M, there exist a real number x0 such that |f(x)| < Mg(x) x x0 . Or equivalently, lim x f(x) g(x) = 0 . Definition D.2. We write f(x) = O(g(x)) as x if there exists positive real numbers M and δ such that for all defined x with 0 < |x a| < δ, |f(x)| Mg(x) x x0 . Or equivalently, lim sup x |f(x)| Proposition 2 (Asymptotic convergence). If 1 γ i,t = o t2 log t for all i [N] and 1 γ ref,t = o t2 log t , we achieve lim T E [F(θi,T )] F = 0 . Published as a conference paper at ICLR 2024 Proof. Firstly, we define new functions ai(t) = 1 γ i,t for all i [N] such that t Z+. Define a non-decreasing function ai(t) such that ai(t) ai(t) for all t, specifically, ai(t) = max x [t] ai(x) . (45) Note that ai(t) = o t2 log t and aref(t) = o t2 log t still hold. Consider the term ET in (42), 2 1 γ i,t + 1 γ ref,t 2 t=1 (ai(t) + aref(t)) T + α 2 (1 q)T t+1 (a) 2τ 3G2 T 1 X l=0 (ai(T l) + aref(T l)) T + α T l + α (b)) 2τ 3G2 T 1 X l=0 ( ai(T) + aref(T)) T + α T l + α 2τ 3G2( ai(T) + aref(T)) l=0 (l + 1 + α)2(1 q)l+1 where (a) follows from t = T l, (b) follows from the definition of ai(t) and the non-decreasing nature of it. Notice in (46), ST is non-decreasing in T, we can then use the limit of ST to bound it. The limit has a closed form, lim T ST = (q 1)( α2q2 2αq + q 2) Therefore, we bound the term ET , ET MAi,ref(T) (48) where M is a constant M = 2τ 3G2 (q 1)( α2q2 2αq + q 2) Ai,ref(T) = ai(T) + aref(T) . (50) Similarly, we bound the term DT in (42), m=1 ( am(T) + aref(T)) m=1 Am,ref(T) . Notice that we have Ai,ref(T) = o T 2 log T for all i [N]. Published as a conference paper at ICLR 2024 Now, we define t = E h θi,t θ 2i , Yt = β2(Q + MAi,ref(t) + M N PN m=1 Am,ref(t)) and let β = 1 2µτ . From (42), we can write t+1 (1 2µηtτ) t + η2 T (Q + Dt + Et) t+1 1 1 α + t t + Yt (α + t)2 (α + t) t+1 (α + t 1) t + Yt α + t . Let t = (α + t 1) t where 1 = α 1, we have t+1 t + Yt α + t . (53) 1 α + l (54) due to the non-decreasing nature of Ai,ref(t) with t. lim T T = lim T lim T α 1 + β2(Q + MAi,ref(T) + M N PN m=1 Am,ref(T)) PT 1 l=1 1 α+l T = lim T Ai,ref(T) + 1 N PN m=1 Am,ref(T) T lim T PT 1 l=1 1 α+l T = lim T Ai,ref(T) + 1 N PN m=1 Am,ref(T) T 2 log T lim T PT 1 l=1 1 α+l log T . Note that Ai,ref(T) = o T 2 log T , i [N], hence by definition lim T Ai,ref(T) + 1 N PN m=1 Am,ref(T) T 2 log T = 0 . (56) Also, since PT 1 l=1 1 α+l = O(log T), PT 1 l=1 1 α+l log T = Z (57) where Z is a constant. Therefore, substituting (56) and (57) into (55) to obtain lim T T 0, and further considering that T 0, lim T T = 0 . (58) From the L-smoothness of F, we conclude that lim T E [F(θi,T )] F lim T L 2 T = 0 . (59) Published as a conference paper at ICLR 2024 E ADDITIONAL EXPERIMENTS E.1 DATASETS AND IMPLEMENTATION DETAILS We provide a comprehensive comparison between our method and the existing baselines using widely-used benchmark datasets, following a heterogeneous data partitioning benchmark (Li et al., 2022). All experiments were carried out on a server with Intel(R) Xeon(R)@2.70GHz processors and 1.5TB RAM. We utilized one Tesla V100 GPU for the experiments. Dataset preprocessing. For all vision datasets, we standardized the pixel values of the images. We additionally applied data augmentation techniques of random cropping and random horizontal flipping to the CIFAR-10 and CIFAR-100 datasets. For natural language datasets, we performed standard tokenization and vocabulary building. We simulated the federated environment with 50 clients for all experiments. We employed five different heterogeneous partitioning strategies, namely: 1) Distribution-based label distribution skew, 2) quantity-based label distribution skew, 3) noise-based feature distribution skew, 4) quantity skew and 5) homogeneous partition. More detailed information on the implementation of these partitioning strategies can be found in Section 6. FL models. To ensure the reproducibility of the experiments carried out in this paper, we used common and well-established model architectures. Specifically, we utilized the following model architectures: 1. Convolutional neural networks (CNN) following that of (Li et al., 2022). The CNN consists of 2 convolutional layers followed by 3 fully connected layers. 2. Res Net18 (He et al., 2016). We replaced the batch normalization layers in Res Net18 with group normalization to ensure stable training with highly heterogeneous clients (Wu & He, 2018; Hsieh et al., 2020; Wang et al., 2022). The number of groups used for the four layers of Res Net18 were 4, 8, 16 and 32, respectively. 3. Long short-term memory networks (LSTM). The network comprises an embedding layer of a dimension 300, an LSTM layer and 3 fully connected layers. In Section 6.1, Appendix E.2, Appendix E.3 and Appendix E.4, we conducted fair evaluations using architectures 1 and 3. In Section 6.2, we utilized the more complex architecture 2 to demonstrate the ability of IAFL to achieve high accuracies on the challenging CIFAR-100 dataset. Architecture 2 was also employed in experiments conducted in Section 6.3 and Appendix E.5. Hyperparameters and options for training. We used a default initial learning rate of η0 = 0.001, unless otherwise specified. We used an exponential learning rate decay with a rate of 0.977, causing the learning rate to decrease by 10 folds in 100 iterations. For MNIST, we used η0 = 0.01. For CIFAR-100 with Res Net18, we used η0 = 0.005 with a decay rate of 0.988. The total number of FL training iterations used was 50 for MNIST, FMNIST, SVHN and 100 for CIFAR-10, CIFAR-100 and SST. Each FL iteration involved a client training for 1 local epoch. For all classification tasks, we employed the cross-entropy loss. Fed Avg finetune. Standard Fed Avg returns the same model to all clients in each FL iteration. In this case, the correlation metric ρ in Section 6.1 is undefined and uncomputable. To address this and create a personalized model for each client, we consider a simple modification that allows clients to train for an additional epoch at the end of the FL algorithm. Therefore, this variation, named Fed Avg finetune, serves as a straightforward FL personalization baseline. LG-Fed Avg (Liang et al., 2020). Local global federated averaging (LG-Fed Avg) achieves personalized models by personalizing the layers or structures of the shared model. Specifically, each client learns a personalized local feature extractor and the local representations of client data are federated to train shared global layers. To achieve this, the model updates averaged by the server are exclusively used to update the shared global layers. In our experiments, we specify the last 3 fully connected layers of the model architecture as the shared global layers. CGSV (Xu et al., 2021). Cosine gradient Shapley value (CGSV) evaluates client contributions based on the cosine similarity between a client s gradients and the average of all clients gradients. During the rewarding phase (i.e., when the server sends respective updates), CGSV masks out (i.e., zeroes out) portions of the gradient update aggregate according to the client contributions. This masking is Published as a conference paper at ICLR 2024 Table 4: Comparison of IPRloss among IAFL and baselines using differents datasets and partitions. Each value reports the mean and standard error of 10 independent evaluations and partition seedings. Category Dataset Partitioning Fed Avg Finetune LG-Fed Avg CGSV Rank IAFL Label Distribution Skew MNIST Dir(0.5) 0.97 0.01 0.93 0.02 1.00 0.00 0.80 0.02 1.00 0.00 #C = 3 0.89 0.01 0.90 0.02 1.00 0.00 0.68 0.02 1.00 0.00 FMNIST Dir(0.5) 1.00 0.00 0.98 0.01 1.00 0.00 0.85 0.01 1.00 0.00 #C = 3 0.92 0.01 0.95 0.01 1.00 0.00 0.81 0.02 1.00 0.00 SVHN Dir(0.5) 1.00 0.00 1.00 0.00 1.00 0.00 0.95 0.01 1.00 0.00 #C = 3 1.00 0.00 1.00 0.00 1.00 0.00 0.94 0.01 1.00 0.00 CIFAR-10 Dir(0.5) 0.99 0.01 1.00 0.00 1.00 0.00 0.92 0.01 1.00 0.00 #C = 3 0.91 0.01 1.00 0.00 1.00 0.00 0.85 0.02 1.00 0.00 CIFAR-100 Dir(0.5) 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 #C = 30 1.00 0.00 1.00 0.00 1.00 0.00 0.92 0.01 1.00 0.00 SST Dir(0.5) 1.00 0.00 0.00 0.00 0.99 0.01 0.74 0.01 1.00 0.00 #C = 3 1.00 0.00 0.00 0.00 1.00 0.00 0.78 0.02 1.00 0.00 Feature Distribution Skew 0.66 0.02 0.24 0.01 1.00 0.00 0.76 0.04 1.00 0.00 FMNIST 1.00 0.00 0.92 0.02 1.00 0.00 0.96 0.01 1.00 0.00 SVHN 1.00 0.00 1.00 0.00 1.00 0.00 0.97 0.00 1.00 0.00 CIFAR-10 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 CIFAR-100 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 Quantity Skew 0.96 0.01 0.56 0.02 1.00 0.00 0.92 0.02 1.00 0.00 FMNIST 1.00 0.00 0.74 0.02 1.00 0.00 0.99 0.00 1.00 0.00 SVHN 1.00 0.00 1.00 0.00 1.00 0.00 0.99 0.00 1.00 0.00 CIFAR-10 0.96 0.01 0.98 0.00 0.85 0.01 0.90 0.02 0.98 0.01 CIFAR-100 1.00 0.00 1.00 0.00 0.99 0.00 0.99 0.00 1.00 0.00 SST 1.00 0.00 0.00 0.00 1.00 0.00 0.77 0.02 1.00 0.00 Homogeneous Partition 0.81 0.02 0.41 0.03 1.00 0.00 0.76 0.04 1.00 0.00 FMNIST 1.00 0.00 0.98 0.01 1.00 0.00 1.00 0.00 1.00 0.00 SVHN 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 CIFAR-10 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 CIFAR-100 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 SST 1.00 0.00 0.00 0.00 1.00 0.00 0.67 0.02 1.00 0.00 performed layer-wise, where the contribution determines the proportion of parameter values being masked out in each model layer s gradient updates. For our experiments, we utilized the original implementation of Xu et al. (2021). Rank (Kong et al., 2022). Rank rewards clients in each iteration based on the accuracy of their respective models on a validation dataset. The method ranks the clients by their validation accuracies before rewarding them. Client i can aggregate model updates from other clients that have a lower validation accuracy than that of client i. Since no code has been released for this work, we followed the original paper for our own implementation. IAFL. The implementation follows that of Algorithm 1. For baseline comparisons in Section 4.1, we set the sharing parameter κ = 0 and stochastic recovery probability q = 0. Subsequently, we investigated the effects of these hyperparameters in Section 6.2. In all experiments except Section 6.3, we used the standalone accuracies of clients as their contributions pi,t, which remain fixed across iterations. In Section 6.3, we instead used the participation rates of clients as the contribution measure. E.2 INCENTIVIZED PARTICIPATION RATE USING TEST LOSS We expect IAFL to satisfy individual rationality (IR) if the mechanism strictly follows [Mν(p)]i = ν pi, (min {pi/pceil, 1})1 κ h(p i) and the rewarding function ν(pi, P i) is continuous, nondecreasing, concave and exchangeable with respect to both arguments. However, the assumptions on ν are often violated, as shown by the results of Table 1 in Section 6.1. Here we show that using the test loss as the rewarding function ν better fulfills the assumptions needed for ν. The comparison of IPRloss (i.e., using the test loss) among IAFL and baselines is shown in Table 4. Our IAFL has excellent incentivization performance as it achieves the highest IPRloss in all settings and perfect IR in all but one setting. Published as a conference paper at ICLR 2024 Table 5: The average (highest) accuracies achieved by the client models. The values show the top-1 test accuracy measured in percentage (i.e., %). Each value reports the mean of 10 independent evaluations and partition seedings. Category Dataset Partitioning Fed Avg Finetune LG-Fed Avg CGSV Rank IAFL Label Distribution Skew MNIST Dir(0.5) 79 (95) 86 (94) 89 (93) 71 (94) 90 (96) #C = 3 32 (46) 47 (63) 64 (72) 38 (66) 93 (96) FMNIST Dir(0.5) 60 (77) 64 (75) 71 (73) 62 (80) 74 (83) #C = 3 30 (46) 35 (46) 47 (54) 32 (47) 71 (78) SVHN Dir(0.5) 57 (75) 70 (77) 59 (68) 56 (75) 67 (79) #C = 3 26 (41) 35 (52) 30 (44) 25 (41) 38 (69) CIFAR-10 Dir(0.5) 28 (39) 30 (39) 24 (30) 29 (40) 32 (45) #C = 3 21 (24) 22 (26) 16 (23) 22 (25) 24 (36) CIFAR-100 Dir(0.5) 8 (11) 11 (13) 6 (6) 10 (12) 13 (17) #C = 30 7 (9) 8 (10) 4 (5) 8 (9) 9 (13) SST Dir(0.5) 23 (33) 23 (30) 23 (29) 24 (31) 26 (34) #C = 3 29 (37) 25 (30) 24 (30) 27 (34) 30 (37) Feature Distribution Skew 92 (96) 95 (97) 96 (96) 91 (96) 95 (96) FMNIST 79 (82) 76 (79) 75 (75) 79 (83) 84 (84) SVHN 83 (85) 79 (81) 83 (83) 82 (85) 86 (86) CIFAR-10 52 (54) 49 (51) 42 (42) 51 (54) 56 (56) CIFAR-100 17 (18) 16 (17) 9 (9) 15 (18) 19 (20) Quantity Skew 95 (98) 93 (97) 96 (96) 91 (97) 97 (97) FMNIST 82 (85) 76 (83) 75 (76) 75 (83) 85 (86) SVHN 84 (86) 77 (85) 79 (84) 60 (81) 84 (86) CIFAR-10 53 (55) 46 (57) 41 (44) 37 (50) 52 (56) CIFAR-100 18 (20) 15 (20) 8 (9) 8 (15) 15 (20) SST 35 (38) 25 (29) 33 (35) 29 (35) 37 (38) Homogeneous Partition 93 (96) 94 (96) 96 (96) 91 (96) 95 (96) FMNIST 79 (82) 78 (80) 75 (75) 79 (83) 84 (84) SVHN 84 (85) 82 (83) 83 (83) 83 (85) 86 (86) CIFAR-10 53 (55) 51 (53) 43 (43) 52 (54) 55 (56) CIFAR-100 17 (18) 18 (18) 9 (9) 16 (18) 19 (20) SST 36 (38) 25 (28) 35 (36) 30 (35) 37 (38) E.3 AVERAGE AND HIGHEST TEST ACCURACY It is important to achieve high overall performance for client models as obtaining good models serves as a strong motivation for clients to actively participate in the federated learning process. To assess the effectiveness of IAFL and baseline methods in this regard, we compare the average and highest client model test accuracies. As presented in Table 5, our IAFL consistently outperforms the other baselines in almost all settings. These results indicate that IAFL successfully incentivizes clients to actively contribute and enables them to benefit from the collaborative process. The higher highest accuracies achieved by IAFL further demonstrated its ability to effectively leverage the decentralized data of clients to achieve superior predictive performance. We further note that the low accuracies obtained for CIFAR-100 can be attributed to the CNN model architecture used in this experiment. Further results in Section 6.2 show that IAFL achieves competitive accuracies on CIFAR-100 using a more complex Res Net18 model. E.4 AVERAGE INCREASE IN ACCURACIES Following (Pillutla et al., 2022), we compare the average increase in client model performance measured by test accuracy. Table 6 demonstrates that IAFL generally yields the most significant improvements in test accuracies across all clients. Intuitively speaking, such significant increases in test accuracies before and after the adoption of federated collaboration via IAFL serve as strong incentives for clients to actively participate and use the IAFL algorithm. These findings further complement the earlier experiments on IPRaccu and IPRloss, which only provided a percentage for binary outcomes (i.e., whether a client model improved after collaboration or not). Through this additional experiment, we have clearly shown the extent of client model improvements on the individual standalone models that our IAFL brings. Published as a conference paper at ICLR 2024 Table 6: Average accuracy increase of clients. The unit for this table is %, and we report the mean of 10 independent evaluations. Category Dataset Partitioning Fed Avg Finetune LG-Fed Avg CGSV Rank IAFL Label Distribution Skew MNIST Dir(0.5) 8.6 0.5 15.0 0.7 18.2 0.6 0.9 4.5 19.1 0.7 #C = 3 2.7 0.5 17.2 0.9 34.0 2.4 8.4 0.3 63.2 0.4 FMNIST Dir(0.5) 3.0 0.5 7.2 0.3 14.2 0.5 5.5 0.4 17.5 0.4 #C = 3 1.9 0.3 6.8 0.4 18.3 1.5 3.5 0.4 42.7 1.1 SVHN Dir(0.5) 9.2 0.5 21.5 0.6 10.8 0.7 7.8 0.6 18.6 0.5 #C = 3 -0.3 0.1 7.9 0.5 3.0 2.8 -1.2 0.2 11.1 0.9 CIFAR-10 Dir(0.5) -1.7 0.9 0.1 0.9 -5.7 0.9 -1.0 1.0 2.5 0.9 #C = 3 -3.3 0.1 -2.0 0.1 -8.0 0.3 -2.1 0.1 -0.4 0.4 CIFAR-100 Dir(0.5) -0.2 0.1 2.8 0.3 -2.6 0.1 1.0 0.1 4.5 0.2 #C = 30 -1.3 0.1 -0.9 0.1 -4.2 0.2 -0.2 0.1 0.7 0.3 SST Dir(0.5) -0.0 0.3 0.3 0.0 -0.0 0.3 1.1 0.1 3.3 0.3 #C = 3 4.6 0.2 0.2 0.0 -1.0 0.4 2.0 0.2 5.1 0.3 Feature Distribution Skew -1.9 0.5 1.0 0.5 2.2 0.4 -3.3 0.6 1.4 0.4 FMNIST 0.6 0.2 -2.6 0.4 -3.6 0.3 1.1 0.2 5.9 0.1 SVHN 10.8 0.3 6.8 0.4 10.7 0.5 9.1 0.5 13.1 0.5 CIFAR-10 8.9 0.3 5.4 0.3 -1.6 0.4 7.8 0.2 12.3 0.2 CIFAR-100 8.6 0.2 7.3 0.2 -0.1 0.2 6.1 0.1 10.2 0.2 Quantity Skew 10.1 0.8 7.7 0.5 10.8 0.6 6.3 0.4 11.3 0.6 FMNIST 10.6 0.6 5.2 0.4 4.1 0.6 3.7 0.4 13.5 0.5 SVHN 26.9 1.0 19.3 0.8 21.6 0.9 2.9 1.0 27.1 0.9 CIFAR-10 15.8 0.3 8.7 0.3 4.1 0.3 -0.0 0.4 15.6 0.3 CIFAR-100 10.7 0.1 7.8 0.1 0.8 0.1 0.4 0.2 7.5 0.2 SST 10.7 0.1 0.2 0.2 8.6 0.2 4.2 0.2 11.8 0.2 Homogeneous Partition -1.2 0.4 0.5 0.4 2.3 0.4 -2.9 0.5 1.5 0.4 FMNIST 0.3 0.2 -0.7 0.1 -3.6 0.2 0.9 0.2 5.2 0.2 SVHN 10.9 0.5 9.2 0.5 10.2 0.4 10.2 0.5 13.3 0.3 CIFAR-10 9.1 0.3 7.4 0.4 -1.3 0.4 7.8 0.5 11.4 0.4 CIFAR-100 8.3 0.2 8.5 0.1 0.3 0.2 7.2 0.2 10.0 0.1 SST 10.3 0.2 -0.1 0.2 9.6 0.3 4.4 0.2 11.6 0.2 E.5 CHANGING THE REFERENCE MODEL IAFL is flexible with the reference model definition, as long as the reference model at the server is also updated using a portion of the client model updates in each FL iteration, similar to any other federated client. The reference model, of course, does not contribute any learning resources (e.g., data). The proportion of client model updates used by the reference model is determined by the user-defined custom induced reference reward rate γ ref,t. In the experiments of Section 6.2, we set the induced reference reward rate as γ ref,t = maxi [N] γ i,t. In other words, the induced reference reward rate is equal to the highest induced reward rate among all clients in a given iteration t. In this section, we explore the possibility of changing the reference model to a median reward, i.e., γ ref,t = mediani [N]γ i,t. Intuitively, the clients are now stochastically recovered with a roughly median model among all clients in any given iteration. The results are shown in Figure 4 and should be viewed in comparison to Figure 1, where the highest induced reference reward rates are used to update the reference models. Overall, the incentivization performances are comparable under the two reference model definitions. We observe that lower 0 250 500 750 1000 FL iterations Test accuracy 0 250 500 750 1000 FL iterations =0.5, q=0.005 0 250 500 750 1000 FL iterations =0.5, q=0.01 1 = 1.00 2 = 0.82 3 = 0.60 4 = 0.44 5 = 0.24 6 = 0.09 Figure 4: Effects of the stochastic recovery rate q with a fixed sharing coefficient and a reference model trained with median induced reward rates. The dataset is CIFAR-100 with quantity skew and 50 clients. The figures should be viewed in comparison to Figure 2. Published as a conference paper at ICLR 2024 0 250 500 750 1000 FL iterations Validation accuracy CIFAR-100, Quantity Skew, Res Net18, = 0, q = 0 0 250 500 750 1000 FL iterations CIFAR-100, Quantity Skew, Res Net18, = 0.5, q = 0.01 0 25 50 75 100 FL iterations CIFAR-10, Quantity Skew, CNN, = 0, q = 0 0 25 50 75 100 FL iterations CIFAR-10, Label Distribution Skew, CNN, = 0, q = 0 Figure 5: Comparing IAFL and Fed Avg for the speed of convergence. Generally, clients with different levels of contribution induce different reward rates γ i and result in different speeds for convergence. Notably, the stochastic reference model recovery is vital to the convergence to the global optimum. induced reference reward rates γ ref,t may hinder convergence to the global optimum and affect the overall average accuracy. However, lower γ ref,t values have the advantage of better incentivizing top-contributors because it becomes more difficult for the lower-contributing clients to obtain the highest-performing models within a limited number of iterations. Therefore, the choice of the reference model definition becomes an important hyperparameter of IAFL, as it determines the extent of sharing and collaboration among clients. In practical settings where clients are more altruistic in their contributions (e.g., collaboration among non-profit organizations or government agencies), having a high γ ref,t value is recommended. E.6 EMPIRICAL CONVERGENCE In this section, we investigate the effect of IAFL on the empirical convergence speed of different clients contributing at different levels. Figure 5 illustrates the empirical convergence curves of selected representative clients. Based on Figure 5(a), when q = 0 (i.e., the stochastic reference model recovery is deactivated), even the client with the highest induced reward rate γ i experiences slowed convergence, eventually to a suboptimal model. Therefore, the proposed stochastic reference model recovery strategy is essential to the convergence to the global optimum. Upon the activation of the stochastic recovery at q = 0.01, as shown in Figure 5(b), we observe that the convergence rates of client models trained using IAFL are comparable to Fed Avg. In Figure 5(c) and (d), we observe a similar behavior to that in Figure 5(a) using different datasets and data partitions: The empirical convergence is slower than Fed Avg as expected when stochastic recovery is deactivated. E.7 PRACTICAL TRADE-OFF BETWEEN PERFORMANCE AND INCENTIVIZATION In Section 6.2, we discuss the controllable hyperparameters of the IAFL algorithm and their effects on the practical trade-off between model performance and our goal of client incentivization. Generally, larger values of κ, q and setting γ ref, t closer to maxi [N] γ i,t favor overall client model performances (at the same time promoting equality among clients) while trading-off client incentivization (i.e., compromising collaborative fairness and distributing less distinctive client models). In this paper, we have presented recommended values for easy usage that strike a good balance between these goals. The following set of values κ = 0.5 , q = 0.01 , γ ref, t = maxi [N] γ i,t , generally worked well in our experiments. Consequently, we recommend using them as the default configuration. If practitioners wish to achieve a different degree of trade-off with our IAFL framework, they could calibrate these hyperparameters to favor either client incentivization or model performance, from different perspectives depending on the hyperparameter. Our proposed IAFL algorithm provides a high degree of flexibility for the server and clients (e.g., users, practitioners) to adjust the hyperparameters such that they align with users perceptions and preferences. However, we highlight that this flexibility is not without constraint: It must trade off between fairness and equality, implying that there Published as a conference paper at ICLR 2024 are certain scenarios infeasible for accommodation. For example, the winner takes all perspective falls outside the spectrum of our trade-off as it violates both fairness and equality. Similarly, it is also unrealistic to expect to achieve both fairness and equality together in the IAFL s training-time model rewards distributed to the clients. E.8 ALTERNATIVE CHOICES OF THE CONTRIBUTION MEASURE As discussed in Section 4.2, IAFL as an incentive mechanism can be applied to a vast range of contribution measures. In this section, we adopt the cosine gradient Shapley value (CGSV) discussed in Section 2 as the contribution measure to determine pi,t and conduct IAFL training. We denote this specific method as CGSV-IAFL and compare it with the original CGSV implementation with gradient masking (Xu et al., 2021). Table 7 shows the superior incentivization performance of CGSV-IAFL across various data partitioning strategies on CIFAR-10. Therefore, IAFL works with alternative choices of contribution measures. Additionally, we demonstrate through this experiment that our rewarding scheme detailed in Procedure 2 is more effective than the gradient masking technique of the original CGSV. Table 7: Comparison of the incentivization performance between CGSV and IAFL after adopting a common contribution measure defined by CGSV. The incentivization performance is measured using the Pearson correlation coefficient ρ between the final client model accuracies and standalone accuracies. Each value reports the mean and the standard error over 10 independent evaluations. Category Partitioning CGSV CGSV-IAFL Label Distribution Skew Dir(0.5) 0.46 0.04 0.69 0.03 #C = 3 0.31 0.05 0.62 0.03 Feature Distribution Skew N(0.1) -0.02 0.04 0.06 0.05 Quantity Skew Dir(0.5) 0.73 0.04 0.81 0.01 Homogeneous Partition IID -0.03 0.05 -0.03 0.04 E.9 EXPERIMENTS ON COMPLEX LARGE-SCALE DATASETS We additionally conduct experiments on complex large-scale datasets: Tiny-Image Net for vision tasks and Sent140 for language tasks. The results are shown in Table 8. Table 8: The incentivization performance under different dataset partitions, measured using the Pearson correlation coefficient ρ between the final client model accuracies and standalone accuracies. Each value reports the mean and the standard error over 3 independent evaluations. Category Dataset Partitioning Fed Avg Finetune LG-Fed Avg CGSV Rank IAFL Label Distribution Skew Tiny-Image Net Dir(0.5) 0.27 0.06 0.10 0.07 -0.02 0.00 0.34 0.11 0.81 0.07 #C = 60 0.38 0.06 0.37 0.11 0.10 0.08 0.22 0.10 0.82 0.01 Sent140 Dir(0.5) 0.93 0.01 0.99 0.00 0.23 0.32 0.98 0.00 0.84 0.03 #C = 1 1.00 0.00 1.00 0.00 1.00 0.00 1.00 0.00 0.07 0.07 Feature Distribution Skew Tiny-Image Net N(0.1) undefined -0.04 0.07 -0.04 0.10 0.04 0.10 0.77 0.06 Quantity Skew Tiny-Image Net Dir(0.5) undefined 0.95 0.00 0.38 0.11 0.85 0.02 0.83 0.02 Sent140 -0.74 0.04 1.00 0.00 0.80 0.02 0.99 0.00 0.90 0.01 Homogeneous Partition Tiny-Image Net IID undefined -0.02 0.06 0.07 0.10 0.02 0.01 0.78 0.08 Sent140 0.08 0.03 0.27 0.02 -0.20 0.08 0.06 0.04 -0.06 0.11 Number of times that performs the best 1 5 1 1 4 For the complex Tiny-Image Net dataset, IAFL achieves the best incentivization performance in the majority of the data partition settings. The correlations ρ between the final client model accuracies and standalone accuracies are generally above 0.75. Note that finetuning the network trained by Fed Avg with the local client dataset for a small number of epochs may not affect the performance of the networks. Therefore, in some cases, the final models received by the clients have the same accuracy (as the global Fed Avg model) and thus the correlation ρ is undefined. For the Sent140 dataset, we notice that other methods such as Fed Avg Finetune and LG-Fed Avg outperform IAFL in terms of incentivization performance measured by the correlation ρ in many Published as a conference paper at ICLR 2024 cases. Further investigation reveals that using ρ alone to assess the effectiveness of collaborative learning is insufficient. Comparing the IPRloss in Table 9, the high ρ of baselines on Sent140 #C = 1 corresponds to 0% client incentivization measured by IPRloss. This is because collaboration through baseline methods does not result in any improvement of the client s model. We further provide the results of the average (and highest) accuracies achieved by clients on Sent140 in Table 10. The results illustrate that IAFL rewards better models to clients as compared to other baselines. Overall, IAFL achieves the best incentivization performance considering both metrics ρ and IPRloss while outputting client models with the highest model accuracies. Table 9: Comparison of IPRloss among IAFL and baselines on Sent140. Each value reports the mean and standard error of 3 independent evaluations and partition seedings. Category Partitioning Fed Avg Finetune LG-Fed Avg CGSV Rank IAFL Label Distribution Skew Dir(0.5) 0.84 0.02 0.01 0.01 0.93 0.07 0.62 0.00 0.87 0.02 #C = 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.83 0.04 Quantity Skew Dir(0.5) 1.00 0.00 0.00 0.00 1.00 0.00 0.91 0.01 1.00 0.00 Homogeneous Partition IID 1.00 0.00 0.00 0.00 1.00 0.00 0.71 0.01 1.00 0.00 Table 10: The average (highest) accuracies achieved by the client models on Sent140. The values show the top-1 test accuracy measured in percentage (i.e., %). Each value reports the mean of 3 independent evaluations and partition seedings. Category Partitioning Fed Avg Finetune LG-Fed Avg CGSV Rank IAFL Label Distribution Skew Dir(0.5) 68 (83) 61 (77) 61 (72) 66 (83) 71 (84) #C = 1 50 (50) 50 (50) 50 (50) 50 (50) 56 (70) Quantity Skew Dir(0.5) 85 (85) 72 (80) 79 (81) 76 (85) 85 (85) Homogeneous Partition IID 84 (85) 75 (76) 81 (81) 81 (83) 85 (85) F FREQUENTLY ASKED QUESTIONS AND DISCUSSIONS Question 1: Is it more natural to incentivize the clients to contribute using their full capacity? Answer: While having clients contribute to their full capacities for the best learning outcome is the ideal scenario, this goal is unrealistic because every client incurs non-trivial costs to contribute. Importantly, the actual contribution of clients depends on the interplay between rewards and costs for different contribution levels. In our work, we propose the next achievable alternative: To incorporate the goal of incentivizing clients to contribute as much as possible on top of satisfying individual rationality. To this end, we designed IAFL such that Theorem 1 and 2 are fulfilled. These theorems suggest that clients will be rewarded with better models if they contribute more. This will incentivize clients to further contribute to their full capacity for improved model performance received, of course, subject to their marginal utility increment in the presence of a cost ci. Clients will contribute up till the marginal increment in reward still surpasses the associated cost. Question 2: What if a client does not faithfully compute and upload the full local model updates? Answer: Our framework does not place any assumption on what kind of local model updates are computed and uploaded by a client. Instead, the client can freely decide whether he or she wants to make full or zero contribution. The contribution measures (in Section 4.2) capture the quality of the local model updates as contributions, e.g., Fed SV (Wang et al., 2020b), Com Fed SV (Fan et al., 2022), CGSV (Xu et al., 2021), Fed FAIM (Shi et al., 2022) are examples of such contribution measures. We summarise the whole process here: The client s behavior model decides the quality of the local model updates being computed (e.g., using different portions of its local data). Then, the quality of the local model updates affects the client contribution level (as assessed by the contribution evaluation measures mentioned above), which in turn affects the reward rate in IAFL. Subsequently, the reward rates of clients affect their model convergence. Question 3: Does convergence speed become faster as the number of agents N grows? Published as a conference paper at ICLR 2024 Answer: The simple answer is, adding clients that help than hurt is going to improve the convergence speed. To elaborate, the effect depends on the contribution levels of the clients. We can look at the term N PN m=1 PT t=1 T +α t+α 2 1 γ m,t + 1 γ i,t 2 , when N increases, the denominator becomes bigger but at the same time there are additional terms in the summation. We can consider a case where we increase N = n to N = n + 1. Abstracting away the constants HT , T and α for simplicity, we can make a comparison for CT = 1 n Pn m=1 1 γ m,t + 1 γ i,t 2 when N = n against the case for CT = 1 n+1 Pn+1 m=1 1 γ m,t + 1 γ i,t 2 when N = n + 1. Note that γ i,t can also be treated as a constant here. Informally speaking, we observe that when N increases, the term CT generally decreases in its scale if the added client has contribution γ m,t at least the average contribution of the existing clients 1 n Pn m=1 γ m,t. Therefore, the decrease in CT tightens the performance bound in Theorem 1 and makes the convergence faster. An exception that we can derive from the above is that when the added clients contribute badly with a very low γ m,t, it could slow down the convergence despite the increase in N. In practice, this implies that the model update provided by the added client is bad, harmful, or even adversarial, and is detected by the contribution evaluation measure. Theoretically, the convergence is adversely affected if we add harmful clients. However, we could easily implement client selection methods based on this insight on contribution evaluation to filter such clients during the training process to ensure faster convergence when N grows. Question 4: Is it fair that for small clients, even if they contribute to their best, they still cannot receive a high-quality model? Yet the large clients only need to contribute above some threshold to receive the best model? Answer: There are two prevalent concepts of fairness in literature: (a) collaborative fairness (Lyu et al., 2020) (i.e., contribute more get back more), and (b) equality (Li et al., 2020c). To be fair, people from one of the camps will not view the other as fair. In our paper, we have demonstrated results for the collaborative fairness view. Specifically, the contributions of clients are fairly evaluated by a contribution evaluation measure, e.g., CGSV (Xu et al., 2021), and then used to fairly reward the clients with a model that has performance commensurate with their contributions. If the designer of the FL process views fairness more towards equality such that if everyone puts in their best efforts, they should be treated equally , it is possible to use individual efforts percentage (e.g., how many percentages a client contributes to his/her best effort) as a contribution evaluation measure (assuming that individual efforts of clients can be measured). This measure can be seamlessly combined with the IAFL framework, too. However, we would like to clarify that individual efforts percentage is not commonly used as a contribution measure in literature.