# fair_resource_allocation_in_federated_learning__a65fef4b.pdf

Published as a conference paper at ICLR 2020

FAIR RESOURCE ALLOCATION IN FEDERATED LEARNING

Tian Li CMU tianli@cmu.edu

Maziar Sanjabi Facebook AI maziars@fb.com

Ahmad Beirami Facebook AI beirami@fb.com

Virginia Smith CMU smithv@cmu.edu

Federated learning involves training statistical models in massive, heterogeneous networks. Naively minimizing an aggregate loss function in such a network may disproportionately advantage or disadvantage some of the devices. In this work, we propose q-Fair Federated Learning (q-FFL), a novel optimization objective inspired by fair resource allocation in wireless networks that encourages a more fair (speciﬁcally, a more uniform) accuracy distribution across devices in federated networks. To solve q-FFL, we devise a communication-efﬁcient method, q-Fed Avg, that is suited to federated networks. We validate both the effectiveness of q-FFL and the efﬁciency of q-Fed Avg on a suite of federated datasets with both convex and non-convex models, and show that q-FFL (along with q-Fed Avg) outperforms existing baselines in terms of the resulting fairness, ﬂexibility, and efﬁciency.

1 INTRODUCTION

Federated learning is an attractive paradigm for ﬁtting a model to data generated by, and residing on, a network of remote devices (Mc Mahan et al., 2017). Unfortunately, naively minimizing an aggregate loss in a large network may disproportionately advantage or disadvantage the model performance on some of the devices. For example, although the accuracy may be high on average, there is no accuracy guarantee for individual devices in the network. This is exacerbated by the fact that the data are often heterogeneous in federated networks both in terms of size and distribution, and model performance can thus vary widely. In this work, we therefore ask: Can we devise an efﬁcient federated optimization method to encourage a more fair (i.e., more uniform) distribution of the model performance across devices in federated networks?

There has been tremendous recent interest in developing fair methods for machine learning (see, e.g., Cotter et al., 2019; Dwork et al., 2012). However, current approaches do not adequately address concerns in the federated setting. For example, a common deﬁnition in the fairness literature is to enforce accuracy parity between protected groups1 (Zafar et al., 2017a). For devices in massive federated networks, however, it does not make sense for the accuracy to be identical on each device given the signiﬁcant variability of data in the network. Recent work has taken a step towards addressing this by introducing good-intent fairness, in which the goal is instead to ensure that the training procedure does not overﬁt a model to any one device at the expense of another (Mohri et al., 2019). However, the proposed objective is rigid in the sense that it only maximizes the performance of the worst performing device/group, and has only be tested in small networks (for 2-3 devices). In realistic federated learning applications, it is natural to instead seek methods that can ﬂexibly trade off between overall performance and fairness in the network, and can be implemented at scale across hundreds to millions of devices.

In this work, we propose q-FFL, a novel optimization objective that addresses fairness issues in federated learning. Inspired by work in fair resource allocation for wireless networks, q-FFL minimizes an aggregate reweighted loss parameterized by q such that the devices with higher loss are given higher relative weight. We show that this objective encourages a device-level deﬁnition

1While fairness is typically concerned with performance between groups , we deﬁne fairness in the federated setting at a more granular scale in terms of the devices in the network. We note that devices may naturally combine to form groups, and thus use these terms interchangeably in the context of prior work.

Published as a conference paper at ICLR 2020

of fairness in the federated setting, which generalizes standard accuracy parity by measuring the degree of uniformity in performance across devices. As a motivating example, we examine the test

Increasing the accuracy of the worst-performing devices

Figure 1: Model performance (e.g., test accuracy) in federated networks can vary widely across devices. Our objective, q-FFL, aims to increase the fairness/uniformity of model performance while maintaining average performance.

accuracy distribution of a model trained via a baseline approach (Fed Avg) vs. q-FFL in Figure 1. Due to the variation in the data across devices, the model accuracy is quite poor on some devices. By using q-FFL, we can maintain the same overall average accuracy while ensuring a more fair/uniform quality of service across the network. Adaptively minimizing our q-FFL objective results in a ﬂexible framework that can be tuned depending on the desired amount of fairness.

To solve q-FFL in massive federated networks, we additionally propose a lightweight and scalable distributed method, q-Fed Avg. Our method carefully accounts for important characteristics of the federated setting such as communication-efﬁciency and low participation of devices (Bonawitz et al., 2019; Mc Mahan et al., 2017). The method also reduces the overhead of tuning the hyperparameter q in q-FFL by dynamically estimating the step-sizes associated with different values of q.

Through extensive experiments on federated datasets with both convex and non-convex models, we demonstrate the fairness and ﬂexibility of q-FFL and the efﬁciency of q-Fed Avg compared with existing baselines. In terms of fairness, q-FFL is able to reduce the variance of accuracies across devices by 45% on average while maintaining the same overall average accuracy. In terms of efﬁciency, our distributed method, q-Fed Avg, is capable of solving the proposed objective orders-ofmagnitude more quickly than other baselines. Finally, while we consider our approaches primarily in the context of federated learning, we also demonstrate that q-FFL can be applied to other related problems such as meta-learning, helping to produce fair initializations across multiple tasks.

2 RELATED WORK

Fairness in Resource Allocation. Fair resource allocation has been extensively studied in ﬁelds such as network management (Ee & Bajcsy, 2004; Hahne, 1991; Kelly et al., 1998; Neely et al., 2008) and wireless communications (Eryilmaz & Srikant, 2006; Nandagopal et al., 2000; Sanjabi et al., 2014; Shi et al., 2014). In these contexts, the problem is deﬁned as allocating a scarce shared resource, e.g., communication time or power, among many users. In these cases, directly maximizing utilities such as total throughput may lead to unfair allocations where some users receive poor service. As a service provider, it is important to improve the quality of service for all users while maintaining overall throughput. For this reason, several popular fairness measurements have been proposed to balance between fairness and total throughput, including Jain s index (Jain et al., 1984), entropy (Rényi et al., 1961), max-min/min-max fairness (Radunovic & Le Boudec, 2007), and proportional fairness (Kelly, 1997). A uniﬁed framework is captured through α-fairness (Lan et al., 2010; Mo & Walrand, 2000), in which the network manager can tune the emphasis on fairness by changing a single parameter, α.

To draw an analogy between federated learning and the problem of resource allocation, one can think of the global model as a resource that is meant to serve the users (or devices). In this sense, it is natural to ask similar questions about the fairness of the service that users receive and use similar tools to promote fairness. Despite this, we are unaware of any works that use α-fairness from resource allocation to modify objectives in machine learning. Inspired by the α-fairness metric, we propose a similarly modiﬁed objective, q-Fair Federated Learning (q-FFL), to encourage a more fair accuracy distribution across devices in the context of federated training. Similar to the α-fairness metric, our q-FFL objective is ﬂexible enough to enable trade-offs between fairness and other traditional metrics such as accuracy by changing the parameter q. In Section 4, we show empirically that the use of q-FFL as an objective in federated learning enables a more uniform accuracy distribution across devices signiﬁcantly reducing variance while maintaining the average accuracy.

Fairness in Machine Learning. Fairness is a broad topic that has received much attention in the machine learning community, though the goals often differ from that described in this work. Indeed,

Published as a conference paper at ICLR 2020

fairness in machine learning is typically deﬁned as the protection of some speciﬁc attribute(s). Two common approaches are to preprocess the data to remove information about the protected attribute, or to post-process the model by adjusting the prediction threshold after classiﬁers are trained (Feldman, 2015; Hardt et al., 2016; Calmon et al., 2017). Another set of works optimize an objective subject to some fairness constraints during training time (Agarwal et al., 2018; Cotter et al., 2019; Hashimoto et al., 2018; Woodworth et al., 2017; Baharlouei et al., 2020; Zafar et al., 2017a;b; Dwork et al., 2012). Our work also enforces fairness during training, although we deﬁne fairness as the uniformity of the accuracy distribution across devices in federated learning (Section 3), as opposed to the protection of a speciﬁc attribute. Although some works deﬁne accuracy parity to enforce equal error rates among speciﬁc groups as a notion of fairness (Zafar et al., 2017a; Cotter et al., 2019), devices in federated networks may not be partitioned by protected attributes, and our goal is not to optimize for identical accuracy across all devices. Cotter et al. (2019) use a notion of minimum accuracy , which is conceptually similar to our goal. However, it requires one optimization constraint for each device, which would result in hundreds to millions of constraints in federated networks.

In federated settings, Mohri et al. (2019) recently proposed a minimax optimization scheme, Agnostic Federated Learning (AFL), which optimizes for the performance of the single worst device. This method has only been applied at small scales (for a handful of devices). Compared to AFL, our proposed objective is more ﬂexible as it can be tuned based on the desired amount of fairness; AFL can in fact be seen as a special case of our objective, q-FFL, with large enough q. In Section 4, we demonstrate that the ﬂexibility of our objective results in more favorable accuracy vs. fairness trade-offs than AFL, and that q-FFL can also be solved at scale more efﬁciently.

Federated Optimization. Federated learning faces challenges such as expensive communication, variability in systems environments in terms of hardware or network connection, and non-identically distributed data across devices (Li et al., 2019). In order to reduce communication and tolerate heterogeneity, optimization methods must be developed to allow for local updating and low participation among devices (Mc Mahan et al., 2017; Smith et al., 2017). We incorporate these key ingredients when designing methods to solve our q-FFL objective efﬁciently in the federated setting (Section 3.3).

3 FAIR FEDERATED LEARNING

In this section, we ﬁrst formally deﬁne the classical federated learning objective and methods, and introduce our proposed notion of fairness (Section 3.1). We then introduce q-FFL, a novel objective that encourages a more fair (uniform) accuracy distribution across all devices (Section 3.2). Finally, in Section 3.3, we describe q-Fed Avg, an efﬁcient distributed method to solve the q-FFL objective in federated settings.

3.1 PRELIMINARIES: FEDERATED LEARNING, FE DAV G, AND FAIRNESS

Federated learning algorithms involve hundreds to millions of remote devices learning locally on their device-generated data and communicating with a central server periodically to reach a global consensus. In particular, the goal is typically to solve:

min w f(w) =

k=1 pk Fk(w) , (1)

where m is the total number of devices, pk 0, and P

k pk = 1. The local objective Fk s can be deﬁned by empirical risks over local data, i.e., Fk(w) = 1 nk Pnk jk=1 ljk(w), where nk is the number of samples available locally. We can set pk to be nk

n , where n = P

k nk is the total number of samples to ﬁt a traditional empirical risk minimization-type objective over the entire dataset.

Most prior work solves (1) by sampling a subset of devices with probabilities pk at each round, and then running an optimizer such as stochastic gradient descent (SGD) for a variable number of iterations locally on each device. These local updating methods enable ﬂexible and efﬁcient communication compared to traditional mini-batch methods, which would simply calculate a subset of the gradients (Stich, 2019; Wang & Joshi, 2018; Woodworth et al., 2018; Yu et al., 2019). Fed Avg (Mc Mahan et al., 2017), summarized in Algorithm 3 in Appendix C.1, is one of the leading methods to solve (1) in non-convex settings. The method runs simply by having each selected device apply E epochs of SGD locally and then averaging the resulting local models.

Published as a conference paper at ICLR 2020

Unfortunately, solving problem (1) in this manner can implicitly introduce highly variable performance between different devices. For instance, the learned model may be biased towards devices with larger numbers of data points, or (if weighting devices equally), to commonly occurring devices. More formally, we deﬁne our desired fairness criteria for federated learning below.

Deﬁnition 1 (Fairness of performance distribution). For trained models w and w, we informally say that model w provides a more fair solution to the federated learning objective (1) than model w if the performance of model w on the m devices, {a1, . . . am}, is more uniform than the performance of model w on the m devices.

In this work, we take performance , ak, to be the testing accuracy of applying the trained model w on the test data for device k. There are many ways to mathematically evaluate the uniformity of the performance. In this work, we mainly use the variance of the performance distribution as a measure of uniformity. However, we also explore other uniformity metrics, both empirically and theoretically, in Appendix A.1. We note that a tension exists between the fairness/uniformity of the ﬁnal testing accuracy and the average testing accuracy across devices. In general, our goal is to impose more fairness/uniformity while maintaining the same (or similar) average accuracy.

Remark 2 (Connections to other fairness deﬁnitions). Deﬁnition 1 targets device-level fairness, which has ﬁner granularity than the classical attribute-level fairness such as accuracy parity (Zafar et al., 2017a). We note that in certain cases where devices can be naturally clustered into groups with speciﬁc attributes, our deﬁnition can be seen as a relaxed version of accuracy parity, in that we optimize for similar but not necessarily identical performance across devices.

3.2 THE OBJECTIVE: q-FAIR FEDERATED LEARNING (q-FFL)

A natural idea to achieve fairness as deﬁned in (1) would be to reweight the objective assigning higher weights to devices with poor performance, so that the distribution of accuracies in the network shifts towards more uniformity. Note that this reweighting must be done dynamically, as the performance of the devices depends on the model being trained, which cannot be evaluated a priori. Drawing inspiration from α-fairness, a utility function used in fair resource allocation in wireless networks, we propose the following objective. For given local non-negative cost functions Fk and parameter q > 0, we deﬁne the q-Fair Federated Learning (q-FFL) objective as:

min w fq(w) =

pk q + 1F q+1 k (w) , (2)

where F q+1 k ( ) denotes Fk( ) to the power of (q+1). Here, q is a parameter that tunes the amount of fairness we wish to impose. Setting q = 0 does not encourage fairness beyond the classical federated learning objective (1). A larger q means that we emphasize devices with higher local empirical losses, Fk(w), thus imposing more uniformity to the training accuracy distribution and potentially inducing fairness in accordance with Deﬁnition 1. Setting fq(w) with a large enough q reduces to classical minimax fairness (Mohri et al., 2019), as the device with the worst performance (largest loss) will dominate the objective. We note that while the (q+1) term in the denominator in (2) may be absorbed in pk, we include it as it is standard in the α-fairness literature and helps to ease notation. For completeness, we provide additional background on α-fairness in Appendix B.

As mentioned previously, q-FFL generalizes prior work in fair federated learning (AFL) (Mohri et al., 2019), allowing for a ﬂexible trade-off between fairness and accuracy as parameterized by q. In our theoretical analysis (Appendix A), we provide generalization bounds of q-FFL that generalize the learning bounds of the AFL objective. Moreover, based on our fairness deﬁnition (Deﬁnition 1), we theoretically explore how q-FFL results in more uniform accuracy distributions with increasing q. Our results suggest that q-FFL is able to impose uniformity of the test accuracy distribution in terms of various metrics such as variance and other geometric and information-theoretic measures.

In our experiments (Section 4.2), on both convex and non-convex models, we show that using the q-FFL objective, we can obtain fairer/more uniform solutions for federated datasets in terms of both the training and testing accuracy distributions.

Published as a conference paper at ICLR 2020

3.3 THE SOLVER: FEDAVG-STYLE q-FAIR FEDERATED LEARNING (q-FE DAV G)

In developing a functional approach for fair federated learning, it is critical to consider not only what objective to solve but also how to solve such an objective efﬁciently in a massive distributed network. In this section, we provide methods to solve q-FFL. We start with a simpler method, q-Fed SGD, to illustrate our main techniques. We then provide a more efﬁcient counterpart, q-Fed Avg, by considering local updating schemes. Our proposed methods closely mirror traditional distributed optimization methods mini-batch SGD and federated averaging (Fed Avg) but with step-sizes and subproblems carefully chosen in accordance with the q-FFL problem (2).

Achieving variable levels of fairness: tuning q. In devising a method to solve q-FFL (2), we begin by noting that it is crucial to ﬁrst determine how to set q. In practice, q can be tuned based on the desired amount of fairness (with larger q inducing more fairness). As we describe in our experiments (Section 4.2), it is therefore common to train a family of objectives for different q values so that a practitioner can explore the trade-off between accuracy and fairness for the application at hand.

One concern with solving such a family of objectives is that it requires step-size tuning for every value of q. In particular, in gradient-based methods, the step-size inversely depends on the Lipschitz constant of the function s gradient, which will change as we change q. This can quickly cause the search space to explode. To overcome this issue, we propose estimating the local Lipschitz constant for the family of q-FFL objectives by using the Lipschitz constant we infer by tuning the step-size (via grid search) on just one q (e.g., q = 0). This allows us to dynamically adjust the step-size of our gradient-based optimization method for the q-FFL objective, avoiding manual tuning for each q. In Lemma 3 below we formalize the relation between the Lipschitz constant, L, for q = 0 and q > 0. Lemma 3. If the non-negative function f( ) has a Lipschitz gradient with constant L, then for any q 0 and at any point w,

Lq(w) = Lf(w)q + qf(w)q 1 f(w) 2 (3)

is an upper-bound for the local Lipschitz constant of the gradient of 1 q+1f q+1( ) at point w.

Proof. At any point w, we can compute the Hessian 2 1 q+1f q+1(w) as:

2 1 q + 1f q+1(w) = qf q 1(w) f(w) T f(w) | {z } f(w) 2 I

+f q(w) 2f(w) | {z } L I

As a result, 2 1 q+1f q+1(w) 2 Lq(w) = Lf(w)q + qf(w)q 1 f(w) 2.

A ﬁrst approach: q-Fed SGD. Our ﬁrst fair federated learning method, q-Fed SGD, is an extension of the well-known federated mini-batch SGD (Fed SGD) method (Mc Mahan et al., 2017). q-Fed SGD uses a dynamic step-size instead of the normal ﬁxed step-size of Fed SGD. Based on Lemma 3, for each local device k, the upper-bound of the local Lipschitz constant is LFk(w)q + q Fk(w)q 1 Fk(w) 2. In each step of q-Fed SGD, Fk and Fk on each selected device k are computed at the current iterate and communicated to the central node. This information is used to compute the step-sizes (weights) for combining the updates from each device. The details are summarized in Algorithm 1. Note that q-Fed SGD is reduced to Fed SGD when q = 0. It is also important to note that to run q-Fed SGD with different values of q, we only need to estimate L once by tuning the step-size on q = 0 and can then reuse it for all values of q > 0.

Improving communication-efﬁciency: q-Fed Avg. In federated settings, communication-efﬁcient schemes using local stochastic solvers (such as Fed Avg) have been shown to signiﬁcantly improve convergence speed (Mc Mahan et al., 2017). However, when q > 0, the F q+1 k term is not an empirical average of the loss over all local samples due to the q + 1 exponent, preventing the use of local SGD as in Fed Avg. To address this, we propose to generalize Fed Avg for q > 0 using a more sophisticated dynamic weighted averaging scheme. The weights (step-sizes) are inferred from the upper bound of the local Lipschitz constants of the gradients of F q+1 k , similar to q-Fed SGD. To extend the local updating technique of Fed Avg to the q-FFL objective (2), we propose a heuristic where we replace the gradient Fk in the q-Fed SGD steps with the local updates that are obtained by running SGD locally on device k. Similarly, q-Fed Avg is reduced to Fed Avg when q = 0. We

Published as a conference paper at ICLR 2020

Algorithm 1 q-Fed SGD

1: Input: K, T, q, 1/L, w0, pk, k = 1, , m 2: for t = 0, , T 1 do 3: Server selects a subset St of K devices at random (each device k is chosen with prob. pk) 4: Server sends wt to all selected devices 5: Each selected device k computes: t k = F q k (wt) Fk(wt)

ht k = q F q 1 k (wt) Fk(wt) 2 + LF q k (wt)

6: Each selected device k sends t k and ht k back to the server 7: Server updates wt+1 as: wt+1 = wt

k St t k P k St ht k 8: end for

Algorithm 2 q-Fed Avg

1: Input: K, E, T, q, 1/L, η, w0, pk, k = 1, , m 2: for t = 0, , T 1 do 3: Server selects a subset St of K devices at random (each device k is chosen with prob. pk) 4: Server sends wt to all selected devices 5: Each selected device k updates wt for E epochs of SGD on Fk with step-size η to obtain wt+1 k 6: Each selected device k computes: wt k = L(wt wt+1 k )

t k = F q k (wt) wt k ht k = q F q 1 k (wt) wt k 2 + LF q k (wt)

7: Each selected device k sends t k and ht k back to the server 8: Server updates wt+1 as:

k St t k P k St ht k 9: end for

provide additional details on q-Fed Avg in Algorithm 2. As we will see empirically, q-Fed Avg can solve q-FFL objective much more efﬁciently than q-Fed SGD due to the local updating heuristic. Finally, recall that as q the q-FFL objective recovers that of the AFL. However, we empirically notice that q-Fed Avg has a more favorable convergence speed compared to AFL while resulting in similar performance across devices (see Figure 9 in the appendix).

4 EVALUATION

We now present empirical results of the proposed objective, q-FFL, and proposed methods, q-Fed Avg and q-Fed SGD. We describe our experimental setup in Section 4.1. We then demonstrate the improved fairness of q-FFL in Section 4.2, and compare q-FFL with several baseline fairness objectives in Section 4.3. Finally, we show the efﬁciency of q-Fed Avg compared with q-Fed SGD in Section 4.4. All code, data, and experiments are publicly available at github.com/litian96/fair_flearn.

4.1 EXPERIMENTAL SETUP

Federated datasets. We explore a suite of federated datasets using both convex and non-convex models in our experiments. The datasets are curated from prior work in federated learning (Mc Mahan et al., 2017; Smith et al., 2017; Li et al., 2020; Mohri et al., 2019) as well as recent federated learning benchmarks (Caldas et al., 2018). In particular, we study: (1) a synthetic dataset using a linear regression classiﬁer, (2) a Vehicle dataset collected from a distributed sensor network (Duarte & Hu, 2004) with a linear SVM for binary classiﬁcation, (3) tweet data curated from Sentiment140 (Go et al., 2009) (Sent140) with an LSTM classiﬁer for text sentiment analysis, and (4) text data built

Published as a conference paper at ICLR 2020

from The Complete Works of William Shakespeare (Mc Mahan et al., 2017) and an RNN to predict the next character. When comparing with AFL, we use the two small benchmark datasets (Fashion MNIST (Xiao et al., 2017) and Adult (Blake, 1998)) studied in Mohri et al. (2019). When applying q-FFL to meta-learning, we use the common meta-learning benchmark dataset Omniglot (Lake et al., 2015). Full dataset details are given in Appendix D.1.

Implementation. We implement all code in Tensorﬂow (Abadi et al., 2016), simulating a federated network with one server and m devices, where m is the total number of devices in the dataset (Appendix D.1). We provide full details (including all hyperparameter values) in Appendix D.2.

4.2 FAIRNESS OF q-FFL

In our ﬁrst experiments, we verify that the proposed objective q-FFL leads to more fair solutions (Deﬁnition 1) for federated data. In Figure 2, we compare the ﬁnal testing accuracy distributions of two objectives (q = 0 and a tuned value of q > 0) averaged across 5 random shufﬂes of each dataset. We observe that while the average testing accuracy remains fairly consistent, the objectives with q > 0 result in more centered (i.e., fair) testing accuracy distributions with lower variance. In particular, while maintaining roughly the same average accuracy, q-FFL reduces the variance of accuracies across all devices by 45% on average. We further report the worst and best 10% testing accuracies and the variance of the ﬁnal accuracy distributions in Table 1. Comparing q = 0 and q > 0, we see that the average testing accuracy remains almost unchanged with the proposed objective despite signiﬁcant reductions in variance. We report full results on all uniformity measurements (including variance) in Table 5 in the appendix, and show that q-FFL encourages more uniform accuracies under other metrics as well. We observe similar results on training accuracy distributions in Figure 6 and Table 6, Appendix E. In Table 1, the average accuracy is with respect to all data points, not all devices; however, we observe similar results with respect to devices, as shown in Table 7, Appendix E.

Figure 2: q-FFL leads to fairer test accuracy distributions. While the average accuracy remains almost identical (see Table 1), by setting q > 0, the distributions shift towards the center as low accuracies increase at the cost of potentially decreasing high accuracies on some devices. Setting q = 0 corresponds to the original objective (1). The selected q values for q > 0 on the four datasets, as well as distribution statistics, are also shown in Table 1.

Table 1: Statistics of the test accuracy distribution for q-FFL. By setting q > 0, the accuracy of the worst 10% devices is increased at the cost of possibly decreasing the accuracy of the best 10% devices. While the average accuracy remains similar, the variance of the ﬁnal accuracy distribution decreases signiﬁcantly. We provide full results of other uniformity measurements (including variance) in Table 5, Appendix E.1, and show that q-FFL encourages more uniform distributions under all metrics.

Dataset Objective Average Worst 10% Best 10% Variance (%) (%) (%)

Synthetic q = 0 80.8 .9 18.8 5.0 100.0 0.0 724 72 q = 1 79.0 1.2 31.1 1.8 100.0 0.0 472 14

Vehicle q = 0 87.3 .5 43.0 1.0 95.7 1.0 291 18 q = 5 87.7 .7 69.9 .6 94.0 .9 48 5

Sent140 q = 0 65.1 4.8 15.9 4.9 100.0 0.0 697 132 q = 1 66.5 .2 23.0 1.4 100.0 0.0 509 30

Shakespeare q = 0 51.1 .3 39.7 2.8 72.9 6.7 82 41 q = .001 52.1 .3 42.1 2.1 69.0 4.4 54 27

Published as a conference paper at ICLR 2020

Choosing q. As discussed in Section 3.3, a natural question is to determine how q should be tuned in the q-FFL objective. Our framework is ﬂexible in that it allows one to choose q to tradeoff between fairness/uniformity and average accuracy. We empirically show that there are a family of q s that can result in variable levels of fairness (and accuracy) on synthetic data in Table 11, Appendix E. In general, this value can be tuned based on the data/application at hand and the desired amount of fairness. Another reasonable approach in practice would be to run Algorithm 2 with multiple q s in parallel to obtain multiple ﬁnal global models, and then select amongst these based on performance (e.g., accuracy) on the validation data. Rather than using just one optimal q for all devices, for example, each device could pick a device-speciﬁc model based on their validation data. We show additional performance improvements with this device-speciﬁc strategy in Table 12 in Appendix E. Finally, we note that one potential issue is that increasing the value of q may slow the speed of convergence. However, for values of q that result in more fair results on our datasets, we do not observe signiﬁcant decrease in the convergence speed, as shown in Figure 8, Appendix E.

4.3 COMPARISON WITH OTHER OBJECTIVES

Next, we compare q-FFL with other objectives that are likely to impose fairness in federated networks. One heuristic is to weight each data point equally, which reduces to the original objective in (1) (i.e., q-FFL with q = 0) and has been investigated in Section 4.2. We additionally compare with two alternatives: weighting devices equally when sampling devices, and weighting devices adversarially, namely, optimizing for the worst-performing device, as proposed in Mohri et al. (2019).

Weighting devices equally. We compare q-FFL with uniform sampling schemes and report testing accuracy in Figure 3. A table with the ﬁnal accuracies and three fairness metrics is given in the appendix in Table 9. While the weighting each device equally heuristic tends to outperform our method in training accuracy distributions (Figure 7 and Table 8 in Appendix E), we see that our method produces more fair solutions in terms of testing accuracies. One explanation for this is that uniform sampling is a static method and can easily overﬁt to devices with very few data points, whereas q-FFL will put less weight on a device once its loss becomes small, potentially providing better generalization performance due to its dynamic nature.

Figure 3: q-FFL (q > 0) compared with uniform sampling. In terms of testing accuracy, our objective produces more fair solutions than uniform sampling. Distribution statistics are provided in Table 9 in the appendix. q-FFL achieves similar average accuracies and more fair solutions.

Weighting devices adversarially. We further compare with AFL (Mohri et al., 2019), which is the only work we are aware of that aims to address fairness issues in federated learning. We implement a non-stochastic version of AFL where all devices are selected and updated each round, and perform grid search on the AFL hyperparameters, γw and γλ. In order to devise a setup that is as favorable to AFL as possible, we modify Algorithm 2 by sampling all devices and letting each of them run gradient descent at each round. We use the same small datasets (Adult (Blake, 1998) and subsampled Fashion MNIST (Xiao et al., 2017)) and the same logistic regression model as in Mohri et al. (2019). Full details of the implementation and hyperparameters (e.g., values of q1 and q2) are provided in Appendix D.2.3. We note that, as opposed to AFL, q-FFL is ﬂexible depending on the amount of fairness desired, with larger q leading to more accuracy uniformity. As discussed, q-FFL generalizes AFL in this regard, as AFL is equivalent to q-FFL with a large enough q. In Table 2, we observe that q-FFL can in fact achieve higher testing accuracy than AFL on the device with the worst performance (i.e., the problem that the AFL was designed to solve) with appropriate q. This also indicates that q-FFL obtains the most fair solutions in certain cases. We also observe that q-FFL converges faster

Published as a conference paper at ICLR 2020

in terms of communication rounds compared with AFL to obtain similar performance (Appendix E), which we speculate is due to the non-smoothness of the AFL objective.

Table 2: Our objective compared with weighing devices adversarially (AFL (Mohri et al., 2019)). In order to be favorable to AFL, we use the two small, well-behaved datasets studied in (Mohri et al., 2019). q-FFL (q > 0) outperforms AFL on the worst testing accuracy of both datasets. The tunable parameter q controls how much fairness we would like to achieve larger q induces less variance. Each accuracy is averaged across 5 runs with different random initializations.

Adult Fashion MNIST

Objectives average Ph D non-Ph D average shirt pullover T-shirt (%) (%) (%) (%) (%) (%) (%) q-FFL, q=0 83.2 .1 69.9 .4 83.3 .1 78.8 .2 66.0 .7 84.5 .8 85.9 .7 AFL 82.5 .5 73.0 2.2 82.6 .5 77.8 1.2 71.4 4.2 81.0 3.6 82.1 3.9 q-FFL, q1>0 82.6 .1 74.1 .6 82.7 .1 77.8 .2 74.2 .3 78.9 .4 80.4 .6 q-FFL, q2>q1 82.3 .1 74.4 .9 82.4 .1 77.1 .4 74.7 .9 77.9 .4 78.7 .6

4.4 EFFICIENCY OF THE METHOD q-FE DAV G

In this section, we show the efﬁciency of our proposed distributed solver, q-Fed Avg, by comparing Algorithm 2 with its non-local-updating baseline q-Fed SGD (Algorithm 1) to solve the same objective (same q > 0 as in Table 1). At each communication round, we have each method perform the same amount of computation, with q-Fed Avg running one epoch of local updates on each selected device while q-Fed SGD runs gradient descent with the local training data. In Figure 4, q-Fed Avg converges faster than q-Fed SGD in terms of communication rounds in most cases due to its local updating scheme. The slower convergence of q-Fed Avg compared with q-Fed SGD on the synthetic dataset may be due to the fact that when local data distributions are highly heterogeneous, local updating schemes may allow local models to move too far away from the initial global model, potentially hurting convergence; see Figure 10 in Appendix E for more details.

0 500 1000 1500 2000 # Rounds

Testing accuracy

q-Fed Avg q-Fed SGD best tuned Fed SGD

0 2 4 6 8 10 # Rounds

Testing accuracy

0 200 400 600 # Rounds

Testing accuracy

0 500 1000 1500 2000 # Rounds

Testing accuracy

Shakespeare

Figure 4: For a ﬁxed objective (i.e., q-FFL with the same q), the convergence of q-Fed Avg (Alg.2), q-Fed SGD (Alg.1), and Fed SGD. For q-Fed Avg and q-Fed SGD, we tune a best step-size on q = 0 and apply that step-size to solve q-FFL with q > 0. For q-Fed SGD, we tune the step-size directly. We observe that (1) q-Fed Avg converges faster in terms of communication rounds; (2) our proposed q-Fed SGD solver with a dynamic step-size achieves similar convergence behavior compared with a best-tuned Fed SGD.

To demonstrate the optimality of our dynamic step-size strategy in terms of solving q-FFL, we also compare our solver q-Fed SGD with Fed SGD with a best-tuned step-size. For q-Fed SGD, we tune a step-size on q = 0 and apply that step-size to solve q-FFL with q > 0. q-Fed SGD has similar performance with Fed SGD, which indicates that (the inverse of) our estimated Lipschitz constant on q > 0 is as good as a best tuned ﬁxed step-size. We can reuse this estimation for different q s instead of manually re-tuning it when q changes. We show the full results on other datasets in Appendix E. We note that both proposed methods q-Fed Avg and q-Fed SGD can be easily integrated into existing implementations of federated learning algorithms such as Tensor Flow Federated (TFF).

Published as a conference paper at ICLR 2020

4.5 BEYOND FEDERATED LEARNING: APPLYING q-FFL TO META-LEARNING

Finally, we generalize the proposed q-FFL objective to other learning tasks beyond federated learning. One natural extension is to apply q-FFL to meta-learning, where each task can be viewed as a device in federated networks. The goal of meta-learning is to learn a model initialization such that it can be quickly adapted to new tasks using limited training samples. However, as the new tasks can be heterogeneous, the performance distribution of the ﬁnal personalized models may also be non-uniform. Therefore, we aim to learn a better initialization such that it can quickly solve unseen tasks in a fair manner, i.e., reduce the variance of the accuracy distribution of the personalized models.

0.0 0.2 0.4 0.6 0.8 1.0 Testing accuracy

Figure 5: q-FFL results in fairer (more centered) initializations for meta-learning tasks.

Table 3: Statistics of the accuracy distribution of personalized models using q-MAML. The method with q = 0 corresponds to MAML. Similarly, we see that the variance is reduced while the accuracy of the worst 10% devices is increased.

Dataset Objective Average Worst 10% Best 10% Variance (%) (%) (%)

Omniglot q = 0 79.1 9.8 61.2 3.2 94.0 .5 93 23 q = .1 79.3 9.6 62.5 5.3 93.8 .9 86 28

To achieve this goal, we propose a new method, q-MAML, by combining q-FFL with the popular metalearning method MAML (Finn et al., 2017). In particular, instead of updating the global model in the way described in MAML, we update the global parameters using the gradients of the q-FFL objective 1 q+1F q+1 k (w), with weights inferred from Lemma 3. Similarly, q-MAML with q = 0 reduces to MAML, and q-MAML with q corresponds to MAML with a most fair initialization and a potentially lower average accuracy. The detailed algorithm is summarized in Algorithm 4 in Appendix C.2. We sample 10 tasks at each round during meta-training, and train for 5 iterations of (mini-batch) SGD for personalization on meta-testing tasks. We report test accuracy of personalized models on the meta-testing tasks. From Figure 5 and Table 3 above, we observe that q-MAML is able to learn initializations which result in fairer personalized models with lower variance.

5 CONCLUSION

In this work, we propose q-FFL, a novel optimization objective inspired by fair resource allocation in wireless networks that encourages fairer (more uniform) accuracy distributions across devices in federated learning. We devise a scalable method, q-Fed Avg, to solve this objective in massive networks. Our empirical evaluation on a suite of federated datasets demonstrates the resulting fairness and ﬂexibility of q-FFL, as well as the efﬁciency of q-Fed Avg compared with existing baselines. We show that our framework is useful not only for federated learning tasks, but also for other learning paradigms such as meta-learning.

ACKNOWLEDGMENTS

We thank Sebastian Caldas, Chen Dan, Neel Guha, Anit Kumar Sahu, Eric Tan, and Samuel Yeom for their helpful discussions and comments. The work of TL and VS was supported in part by the National Science Foundation grant IIS1838017, a Google Faculty Award, a Carnegie Bosch Institute Research Award, and the CONIX Research Center. Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reﬂect the National Science Foundation or any other funding agency.

Published as a conference paper at ICLR 2020

Tensorﬂow federated: Machine learning on decentralized data. URL https://www. tensorflow.org/federated.

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorﬂow: A system for large-scale machine learning. In Operating Systems Design and Implementation, 2016.

Alekh Agarwal, Alina Beygelzimer, Miroslav Dudik, John Langford, and Hanna Wallach. A reductions approach to fair classiﬁcation. In International Conference on Machine Learning, 2018.

Sina Baharlouei, Maher Nouiehed, Ahmad Beirami, and Meisam Razaviyayn. Rényi fair inference. In International Conference on Learning Representations, 2020.

Ahmad Beirami, Robert Calderbank, Mark M Christiansen, Ken R Duffy, and Muriel Médard. A characterization of guesswork on swiftly tilting curves. IEEE Transactions on Information Theory, 65(5):2850 2871, 2019.

Catherine L Blake. Uci repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository, 1998.

Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konecny, Stefano Mazzocchi, H Brendan Mc Mahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselande. Towards federated learning at scale: System design. In Conference on Machine Learning and Systems, 2019.

Sebastian Caldas, Peter Wu, Tian Li, Jakub Koneˇcn y, H Brendan Mc Mahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings. ar Xiv preprint ar Xiv:1812.01097, 2018.

Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Ramamurthy, and Kush R Varshney. Optimized pre-processing for discrimination prevention. In Advances in Neural Information Processing Systems, 2017.

Andrew Cotter, Heinrich Jiang, Serena Wang, Taman Narayan, Maya Gupta, Seungil You, and Karthik Sridharan. Optimization with non-differentiable constraints with applications to fairness, recall, churn, and other goals. Journal of Machine Learning Research, 2019.

Mina Dashti, Paeiz Azmi, and Keivan Navaie. Harmonic mean rate fairness for cognitive radio networks with heterogeneous trafﬁc. Transactions on Emerging Telecommunications Technologies, 2013.

Marco F Duarte and Yu Hen Hu. Vehicle classiﬁcation in distributed sensor networks. Journal of Parallel and Distributed Computing, 2004.

Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Innovations in Theoretical Computer Science, 2012.

Cheng Tien Ee and Ruzena Bajcsy. Congestion control and fairness for many-to-one routing in sensor networks. In International Conference on Embedded Networked Sensor Systems, 2004.

Atilla Eryilmaz and R Srikant. Joint congestion control, routing, and mac for stability and fairness in wireless networks. IEEE Journal on Selected Areas in Communications, 2006.

Michael Feldman. Computational fairness: Preventing machine-learned discrimination. 2015.

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, 2017.

Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classiﬁcation using distant supervision. Stanford CS224N Project Report, 2009.

Published as a conference paper at ICLR 2020

Ellen L. Hahne. Round-robin scheduling for max-min fairness in data networks. IEEE Journal on Selected Areas in communications, 1991.

Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems, 2016.

Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, 2018.

Rajendra K Jain, Dah-Ming W Chiu, and William R Hawe. A quantitative measure of fairness and discrimination. Eastern Research Laboratory, Digital Equipment Corporation, 1984.

Frank Kelly. Charging and rate control for elastic trafﬁc. European Transactions on Telecommunications, 1997.

Frank P Kelly, Aman K Maulloo, and David KH Tan. Rate control for communication networks: shadow prices, proportional fairness and stability. Journal of the Operational Research society, 1998.

Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 2015.

Tian Lan, David Kao, Mung Chiang, and Ashutosh Sabharwal. An axiomatic theory of fairness in network resource allocation. In Conference on Information Communications, 2010.

Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. ar Xiv preprint ar Xiv:1908.07873, 2019.

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. In Conference on Machine Learning and Systems, 2020.

H Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agüera y Arcas. Communication-efﬁcient learning of deep networks from decentralized data. In International Conference on Artiﬁcial Intelligence and Statistics, 2017.

Jeonghoon Mo and Jean Walrand. Fair end-to-end window-based congestion control. IEEE/ACM Transactions on Networking, 2000.

Mehryar Mohri, Gary Sivek, and Ananda Theertha Suresh. Agnostic federated learning. In International Conference on Machine Learning, 2019.

Thyagarajan Nandagopal, Tae-Eun Kim, Xia Gao, and Vaduvur Bharghavan. Achieving mac layer fairness in wireless packet networks. In International Conference on Mobile Computing and Networking, 2000.

Michael J Neely, Eytan Modiano, and Chih-Ping Li. Fairness and optimal stochastic control for heterogeneous networks. IEEE/ACM Transactions On Networking, 2008.

Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing, 2014.

Bozidar Radunovic and Jean-Yves Le Boudec. A uniﬁed framework for max-min and min-max fairness with applications. IEEE/ACM Transactions on Networking, 2007.

Alfréd Rényi et al. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1961.

Maziar Sanjabi, Meisam Razaviyayn, and Zhi-Quan Luo. Optimal joint base station assignment and beamforming for heterogeneous networks. IEEE Transactions on Signal Processing, 2014.

Ohad Shamir, Nati Srebro, and Tong Zhang. Communication-efﬁcient distributed optimization using an approximate newton-type method. In International Conference on Machine Learning, 2014.

Published as a conference paper at ICLR 2020

Huaizhou Shi, R Venkatesha Prasad, Ertan Onur, and IGMM Niemegeers. Fairness in wireless networks: Issues, measures and challenges. IEEE Communications Surveys and Tutorials, 2014.

Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated multi-task learning. In Advances in Neural Information Processing Systems, 2017.

Virginia Smith, Simone Forte, Ma Chenxin, Martin Takáˇc, Michael I Jordan, and Martin Jaggi. Cocoa: A general framework for communication-efﬁcient distributed optimization. Journal of Machine Learning Research, 2018.

Sebastian U Stich. Local sgd converges fast and communicates little. In International Conference on Learning Representations, 2019.

Jianyu Wang and Gauri Joshi. Cooperative sgd: A uniﬁed framework for the design and analysis of communication-efﬁcient sgd algorithms. ar Xiv preprint ar Xiv:1808.07576, 2018.

Blake Woodworth, Suriya Gunasekar, Mesrob I. Ohannessian, and Nathan Srebro. Learning nondiscriminatory predictors. In Conference on Learning Theory, 2017.

Blake E Woodworth, Jialei Wang, Adam Smith, Brendan Mc Mahan, and Nati Srebro. Graph oracle models, lower bounds, and gaps for parallel stochastic optimization. In Advances in Neural Information Processing Systems, 2018.

Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. ar Xiv preprint ar Xiv:1708.07747, 2017.

Hao Yu, Sen Yang, and Shenghuo Zhu. Parallel restarted sgd for non-convex optimization with faster convergence and less communication. In AAAI Conference on Artiﬁcial Intelligence, 2019.

Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classiﬁcation without disparate mistreatment. In International Conference on World Wide Web, 2017a.

Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rogriguez, and Krishna P Gummadi. Fairness constraints: Mechanisms for fair classiﬁcation. In International Conference on Artiﬁcial Intelligence and Statistics, 2017b.

Published as a conference paper at ICLR 2020

A THEORETICAL ANALYSIS OF THE PROPOSED OBJECTIVE q-FFL

A.1 UNIFORMITY INDUCED BY q-FFL

In this section, we theoretically justify that the q-FFL objective can impose more uniformity of the performance/accuracy distribution. As discussed in Section 3.2, q-FFL can encourage more fair solutions in terms of several metrics, including (1) the variance of accuracy distribution (smaller variance), (2) the cosine similarity between the accuracy distribution and the all-ones vector 1 (larger similarity), and (3) the entropy of the accuracy distribution (larger entropy). We begin by formally deﬁning these fairness notions.

Deﬁnition 4 (Uniformity 1: Variance of the performance distribution). We say that the performance distribution of m devices {F1(w), . . . , Fm(w)} is more uniform under solution w than w if

Var(F1(w), . . . , Fm(w)) < Var(F1(w ), . . . , Fm(w )). (5)

Deﬁnition 5 (Uniformity 2: Cosine similarity between the performance distribution and 1). We say that the performance distribution of m devices {F1(w), . . . , Fm(w)} is more uniform under solution w than w if the cosine similarity between {F1(w), . . . , Fm(w)} and 1 is larger than that between {F1(w ), . . . , Fm(w )} and 1, i.e.,

1 m Pm k=1 Fk(w) q

1 m Pm k=1 F 2 k (w)

1 m Pm k=1 Fk(w ) q

1 m Pm k=1 F 2 k (w ) . (6)

Deﬁnition 6 (Uniformity 3: Entropy of performance distribution). We say that the performance distribution of m devices {F1(w), . . . , Fm(w)} is more uniform under solution w than w if

e H(F(w)) e H(F(w )), (7)

where e H(F(w)) is the entropy of the stochastic vector obtained by normalizing {F1(w), . . . , Fm(w)}, deﬁned as

e H(F(w)) :=

Fk(w) Pm k=1 Fk(w) log Fk(w) Pm k=1 Fk(w)

To enforce uniformity/fairness (deﬁned in Deﬁnition 4, 5, and 6), we propose the q-FFL objective to impose more weights on the devices with worse performance. Throughout the proof, for the ease of mathematical exposition, we consider a similar unweighted objective:

k=1 F q+1 k (w)

and we denote w q as the global optimal solution of minw fq(w).

We ﬁrst investigate the special case of q = 1 and show that q = 1 results in more fair solutions than q = 0 based on Deﬁnition 4 and Deﬁnition 5.

Lemma 7. q = 1 leads to a more fair solution (smaller variance of the model performance distribution) than q = 0, i.e., Var(F1(w 1), . . . , Fm(w 1)) < Var(F1(w 0), . . . , Fm(w 0)).

Proof. Use the fact that w 1 is the optimal solution of minw f1(w), and w 0 is the optimal solution of minw f0(w), we get Pm k=1 F 2 k (w 1) m

i=1 Fk(w 1)

Pm k=1 F 2 k (w 0) m

i=1 Fk(w 1)

Pm k=1 F 2 k (w 0) m

i=1 Fk(w 0)

Published as a conference paper at ICLR 2020

Lemma 8. q = 1 leads to a more fair solution (larger cosine similarity between the performance distribution and 1) than q = 0, i.e.,

1 m Pm k=1 Fk(w 1) q

1 m F 2 k (w 1)

1 m Pm k=1 Fk(w 0) q

1 m F 2 k (w 0) .

Proof. As 1 m Pm k=1 Fk(w 1) 1 m Pm k=1 Fk(w 0) and 1 m Pm k=1 F 2 k (w 1) 1 m Pm k=1 F 2 k (w 0), it directly follows that 1 m Pm k=1 Fk(w 1) q

1 m F 2 k (w 1)

1 m Pm k=1 Fk(w 0) q

1 m F 2 k (w 0) .

We next provide results based on Deﬁnition 6. It states that for arbitrary q 0, by increasing q for a small amount, we can get more uniform performance distributions deﬁned over higher-orders of the performance. Lemma 9. Let F(w) be twice differentiable in w with 2F(w) 0 (positive deﬁnite). The derivative of e H(F q(w p)) with respect to the variable p evaluated at the point p = q is non-negative, i.e.,

p e H(F q(w p)) p=q 0,

where e H(F q(w p)) is deﬁned in equation 8.

p e H(F q(w p)) p=q =

F q k (w p) P

κ F q κ(w p) ln F q k (w p) P

κ F q κ(w p)

F q k (w p) P

κ F q κ(w p) ln F q k (w p) p=q

κ F q κ(w p)

w F q k (w q) P

κ F q κ(w q) ln F q k (w q)

F q k (w q) P

κ F q κ(w q)

w F q k (w q)

F q k (w q) (12)

w F q k (w q) P

κ F q κ(w q) ln F q k (w q) + 1 . (13)

Now, let us examine pw p p=q. We know that P

k w F p k (w p) = 0 by deﬁnition. Taking the

derivative with respect to p, we have X

k 2 w F p k (w p)

ln F p k (w p) + 1 w F p k (w p) = 0. (14)

Invoking implicit function theorem,

k 2 w F p k (w p)

ln F p k (w p) + 1 w F p k (w p). (15)

Published as a conference paper at ICLR 2020

Plugging pw p p=q into (13), we get that p e H(F q(w p)) p=q 0 completing the proof.

Lemma 9 states that for any p, the performance distribution of {F p 1 (w p+ϵ), . . . , F p m(w p+ϵ)} is guaranteed to be more uniform based on Deﬁnition 6 than that of {F p 1 (w p), . . . , F p m(w p)} for a small enough ϵ. Note that Lemma 9 is different from the existing results on the monotonicity of entropy under the tilt operation, which would imply that q e H(F q(w p)) 0 for all q 0 (see Beirami et al. (2019, Lemma 11)).

Ideally, we would like to prove a result more general than Lemma 9, implying that the distribution {F q 1 (w p+ϵ), . . . , F q m(w p+ϵ)} is more uniform than {F q 1 (w p), . . . , F q m(w p)} for any p, q and small enough ϵ. We prove this result for the special case of m = 2 in the following.

Lemma 10. Let F(w) be twice differentiable in w with 2F(w) 0 (positive deﬁnite). If m = 2, for any q R+, the derivative of e H(F q(w p)) with respect to the variable p is non-negative, i.e.,

p e H(F q(w p)) 0,

where e H(F q(w p)) is deﬁned in equation 8.

Proof. First, we invoke Lemma 9 to obtain that

p e H(F q(w p)) p=q 0. (16)

θq(w) := F q 1 (w) F q 1 (w) + F q 2 (w). (17)

Without loss of generality assume that θq(w p) (0, 1

2], as we can relabel F1 and F2 otherwise. Then, given that m = 2, we conclude from equation 16 along with the monotonicity of the binary entropy function in (0, 1

2] that pθq(w p) p=q 0, (18)

which in conjunction with equation 17 implies that

F1(w p) F2(w p)

q p=q 0. (19)

Given the monotonicity of xq with respect to x for all q > 0, it can be observed that the above is sufﬁcient to imply that for any q > 0,

F1(w p) F2(w p)

Going all of the steps back we would obtain that for all p > 0

p e H(F q(w p)) 0. (21)

This completes the proof of the lemma.

We conjecture that the statement of Lemma 9 is true for all q R+, which would be equivalent to the statement of Lemma 10 holding true for all m N.

Thus far, we provided results that showed that q-FFL promotes fairness in three different senses. Next, we further provide a result on equivalence between the geometric and information-theoretic notions of fairness.

Published as a conference paper at ICLR 2020

Lemma 11 (Equivalence between uniformity in entropy and cosine distance). q-FFL encourages more uniform performance distributions in the cosine distance sense (Deﬁnition 5) if any only if it encourages more uniform performance distributions in the entropy sense (Deﬁnition 6), i.e., (a) holds if and only if (b) holds where (a) for any p, q R, the derivative of H(F q(w p)) with respect to p is non-negative,

(b) for any 0 t r, 0 v u, ft(w u) fr(w u) ft(w v) fr(w v).

Proof. Deﬁnition 6 is a special case of H(F q(w p)) with q = 1. If e H(F q(w p)) increases with p for any p, q, then we are guaranteed to get more fair solutions based on Deﬁnition 6. Similarly, Deﬁnition 5 is a special case of ft(w u) fr(w u) with t = 0, r = 1. If ft(w u) fr(w u) increases with u for any t r, q-FFL can also obtain more fair solutions under Deﬁnition 5.

Next, we show that (a) and (b) are equivalent measures of fairness.

For any r t 0, and any u v 0,

ft(w u) fr(w u) ft(w v) fr(w v) ln ft(w u) fr(w u) ln ft(w v) fr(w v) 0 (22)

τ ln ft(w τ) fr(w τ)dτ 0 (23)

p ln ft(w p) fr(w p) 0, for any p 0 (24)

p ln fr(w p)

p ln ft(w p) 0, for any p 0 (25)

p q ln fq(w p)dq 0 for any p, q 0 (26)

p q ln fq(w p) 0, for any p, q 0 (27)

p e H(F q(w p)) 0, for any p, q 0. (28)

The last inequality is obtained using the fact that by taking the derivative of ln fq(w p) with respect to q, we get e H(F q(w p)).

Discussions. We give geometric (Deﬁnition 5) and information-theoretic (Deﬁnition 6) interpretations of our uniformity/fairness notion and provide uniformity guarantees under the q-FFL objective in some cases (Lemma 7, Lemma 8, and Lemma 9). We reveal interesting relations between the geometric and information-theoretic interpretations in Lemma 11. Future work would be to gain further understandings for more general cases indicated in Lemma 11.

A.2 GENERALIZATION BOUNDS

In this section, we ﬁrst describe the setup we consider in more detail, and then provide generalization bounds of q-FFL. One beneﬁt of q-FFL is that it allows for a ﬂexible trade-off between fairness and accuracy, which generalizes AFL (a special case of q-FFL with q ). We also provide learning bounds that generalize the bounds of the AFL objective, as described below.

Suppose the service provider is interested in minimizing the loss over a distributed network of devices, with possibly unknown weights on each device:

k=1 λk E(x,y) Dk[l(h(x), y)], (29)

where λ is in a probability simplex Λ, m is the total number of devices, Dk is the local data distribution for device k, h is the hypothesis function, and l is the loss. We use ˆLλ(h) to denote the empirical

Published as a conference paper at ICLR 2020

j=1 l(h(xk,j), yk,j), (30)

where nk is the number of local samples on device k and (xk,j, yk,j) Dk.

We consider a slightly different, unweighted version of q-FFL:

min w fq(w) = 1

k=1 F q+1 k (w) , (31)

which is equivalent to minimizing the empirical loss

Lq(h) = max ν, ν p 1

j=1 l(h(xk,j), yk,j), (32)

p + 1 q+1 = 1 (p 1, q 0).

Lemma 12 (Generalization bounds of q-FFL for a speciﬁc λ). Assume that the loss l is bounded by M > 0 and the numbers of local samples are (n1, , nm). Then, for any δ > 0, with probability at least 1 δ, the following holds for any λ Λ, h H:

Lλ(h) Aq(λ) Lq(h) + E max h H Lλ(h) ˆLλ(h) + M

λ2 k 2nk log 1

where Aq(λ) = λ p, and 1/p + 1/(q + 1) = 1.

Proof. Similar to the proof in Mohri et al. (2019), for any δ > 0, the following inequality holds with probability at least 1 δ for any λ Λ, h H:

Lλ(h) ˆLλ(h) + E max h H Lλ(h) ˆLλ(h) + M

λ2 k 2nk log 1

Denote the empirical loss on device k 1 nk Pnk j=1 l(h(xk,j), yk,j) as Fk. From Hölder s inequality, we have

k=1 F q+1 k

! 1 q+1 = Aq(λ) Lq(h), 1

p + 1 q + 1 = 1.

Plugging ˆLλ(h) Aq(λ) Lq(h) into (34) yields the results.

Theorem 13 (Generalization bounds of q-FFL for any λ). Assume that the loss l is bounded by M > 0 and the number of local samples is (n1, , nm). Then, for any δ > 0, with probability at least 1 δ, the following holds for any λ Λ, h H:

Lλ(h) max λ Λ (Aq(λ)) Lq(h) + max λ Λ

E max h H Lλ(h) ˆLλ(h) + M

λ2 k 2nk log 1

where Aq(λ) = λ p, and 1/p + 1/(q + 1) = 1.

Proof. This directly follows from Lemma 12, by taking the maximum over all possible λ s in Λ.

Discussions. From Lemma 12, letting λ = 1

m and q , we recover the generalization bounds in AFL (Mohri et al., 2019). In that sense, our generalization results extend those of AFL s. In addition, it is not straightforward to derive an optimal q with the tightest generalization bound from Lemma 12 and Theorem 13. In practice, our proposed method q-Fed Avg allows us to tune a family of q s by re-using the step-sizes.

Published as a conference paper at ICLR 2020

B α-FAIRNESS AND q-FFL

As discussed in Section 2, while it is natural to consider the α-fairness framework for machine learning, we are unaware of any work that uses α-fairness to modify machine learning training objectives. We provide additional details on the framework below; for further background on αfairness and fairness in resource allocation more generally, we defer the reader to Shi et al. (2014); Mo & Walrand (2000).

α-fairness (Lan et al., 2010; Mo & Walrand, 2000) is a popular fairness metric widely-used in resource allocation problems. The framework deﬁnes a family of overall utility functions that can be derived by summing up the following function of the individual utilities of the users in the network:

( ln(x), if α = 1 1 1 αx1 α, if α 0, α = 1 .

Here Uα(x) represents the individual utility of some speciﬁc user given x allocated resources (e.g., bandwidth). The goal is to ﬁnd a resource allocation strategy to maximize the sum of the individual utilities. This family of functions includes a wide range of popular fair resource allocation strategies. In particular, the above function represents zero fairness with α = 0, proportional fairness (Kelly, 1997) with α = 1, harmonic mean fairness (Dashti et al., 2013) with α = 2, and max-min fairness (Radunovic & Le Boudec, 2007) with α = + .

Note that in federated learning, we are dealing with costs and not utilities. Thus, max-min in resource allocation corresponds to min-max in our setting. With this analogy, it is clear that in our proposed objective q-FFL (2), the case where q = + corresponds to min-max fairness since it is optimizing for the worst-performing device, similar to what was proposed in Mohri et al. (2019). Also, q = 0 corresponds to zero fairness, which reduces to the original Fed Avg objective (1). In resource allocation problems, α can be tuned for trade-offs between fairness and system efﬁciency. In federated settings, q can be tuned based on the desired level of fairness (e.g., desired variance of accuracy distributions) and other performance metrics such as the overall accuracy. For instance, in Table 2 in Section 4.3, we demonstrate on two datasets that as q increases, the overall average accuracy decreases slightly while the worst accuracies are increased signiﬁcantly and the variance of the accuracy distribution decreases.

Published as a conference paper at ICLR 2020

C PSEUDO-CODE OF ALGORITHMS

C.1 THE FEDAVG ALGORITHM

Algorithm 3 Federated Averaging Mc Mahan et al. (2017) (Fed Avg)

Input: K, T, η, E, w0, N, pk, k = 1, , N for t = 0, , T 1 do

Server randomly chooses a subset St of K devices (each device k is chosen with probability pk) Server sends wt to all chosen devices Each device k updates wt for E epochs of SGD on Fk with step-size η to obtain wt+1 k Each chosen device k sends wt+1 k back to the server Server aggregates the w s as wt+1 = 1 K P

k St wt+1 k end for

C.2 THE q-MAML ALGORITHM

Algorithm 4 q-FFL applied to MAML: q-MAML

1: Input: K, T, η, w0, N, pk, k = 1, , N 2: for t = 0, , T 1 do 3: Sample a batch of St (|St| = K) tasks randomly (each task k is chosen with probability pk) 4: Send wt to all sampled tasks 5: Each task k St samples data Dk from the training set and D k from the testing set, and computes updated parameters on D: wt k = wt η Fk(wt) 6: Each task k St computes the gradients Fk(wt k) on D

7: Each task k St computes:

t k = F q k (wt k) Fk(wt k)

ht k = q F q 1 k (wt k) Fk(wt k) 2 + LF q k (wt k)

8: wt+1 is updated as:

P k St t k P

Published as a conference paper at ICLR 2020

D EXPERIMENTAL DETAILS

D.1 DATASETS AND MODELS

We provide full details on the datasets and models used in our experiments. The statistics of four federated datasets used in federated learning (as opposed to meta-learning) experiments are summarized in Table 4. We report the total number of devices, the total number of samples, and mean and deviation in the sizes of total data points on each device. Additional details on the datasets and models are described below.

Synthetic: We follow a similar set up as that in Shamir et al. (2014) and impose additional heterogeneity. The model is y = argmax(softmax(Wx+b)), x R60, W R10 60, b R10, and the goal is to learn a global W and b. Samples (Xk, Yk) and local models on each device k satisﬁes Wk N(uk, 1), bk N(uk, 1), uk N(0, 1); xk N(vk, Σ), where the covariance matrix Σ is diagonal with Σj,j = j 1.2. Each element in vk is drawn from N(Bk, 1), Bk N(0, 1). There are 100 devices in total and the number of samples on each devices follows a power law.

Vehicle2: We use the same Vehicle Sensor (Vehicle) dataset as Smith et al. (2017), modelling each sensor as a device. This dataset consists of acoustic, seismic, and infrared sensor data collected from a distributed network of 23 sensors Duarte & Hu (2004). Each sample has a 100-dimension feature and a binary label. We train a linear SVM to predict between AAV-type and DW-type vehicles. We tune the hyperparameters in SVM and report the best conﬁguration.

Sent140: This dataset is a collection of tweets curated from 1,101 accounts from Sentiment140 (Go et al., 2009) (Sent140) where each Twitter account corresponds to a device. The task is text sentiment analysis which we model as a binary classiﬁcation problem. The model takes as input a 25-word sequence, embeds each word into a 300-dimensional space using pretrained Glove (Pennington et al., 2014), and outputs a binary label after two LSTM layers and one densely-connected layer.

Shakespeare: This dataset is built from The Complete Works of William Shakespeare (Mc Mahan et al., 2017). Each speaking role in the plays is associated with a device. We subsample 31 speaking roles to train a deep language model for next character prediction. The model takes as input an 80-character sequence, embeds each character into a learnt 8-dimensional space, and outputs one character after two LSTM layers and one densely-connected layer.

Omniglot: The Omniglot dataset (Lake et al., 2015) consists of 1,623 characters from 50 different alphabets. We create 300 meta-training tasks from the ﬁrst 1,200 characters, and 100 meta-testing tasks from the last 423 characters. Each task is a 5-class classiﬁcation problem where each character forms a class. The model is a convolutional neural network with two convolution layers and two fully-connected layers.

Table 4: Statistics of federated datasets

Dataset Devices Samples Samples/device

mean stdev Synthetic 100 12,697 127 73 Vehicle 23 43,695 1,899 349 Sent140 1,101 58,170 53 32 Shakespeare 31 116,214 3,749 6,912

D.2 IMPLEMENTATION DETAILS

D.2.1 MACHINES

We simulate the federated setting (one server and m devices) on a server with 2 Intel R Xeon R E5-2650 v4 CPUs and 8 NVidia R 1080Ti GPUs.

2http://www.ecs.umass.edu/~mduarte/Software.html

Published as a conference paper at ICLR 2020

D.2.2 SOFTWARE

We implement all code in Tensor Flow (Abadi et al., 2016) Version 1.10.1. Please see github.com/litian96/fair_flearn for full details.

D.2.3 HYPERPARAMETERS

We randomly split data on each local device into 80% training set, 10% testing set, and 10% validation set. We tune a best q from {0.001, 0.01, 0.1, 0.5, 1, 2, 5, 10, 15} on the validation set and report accuracy distributions on the testing set. We pick up the q value where the variance decreases the most, while the overall average accuracy change (compared with the q = 0 case) is within 1%. For each dataset, we repeat this process for ﬁve randomly selected train/test/validation splits, and report the mean and standard deviation across these ﬁve runs where applicable. For Synthetic, Vehicle, Sent140, and Shakespeare, optimal q values are 1, 5, 1, and 0.001, respectively. For all datasets, we randomly sample 10 devices each round. We tune the learning rate and batch size on Fed Avg and use the same learning rate and batch size for all q-Fed Avg experiments of that dataset. The learning rates for Synthetic, Vehicle, Sent140, and Shakespeare are 0.1, 0.01, 0.03, and 0.8, respectively. The batch sizes for Synthetic, Vehicle, Sent140, and Shakespeare are 10, 64, 32, and 10. The number of local epochs E is ﬁxed to be 1 for both Fed Avg and q-Fed Avg regardless of the values of q.

In comparing q-Fed Avg s efﬁciency with q-Fed SGD, we also tune a best learning rate for q-Fed SGD methods on q = 0. For each comparison, we ﬁx devices selected and mini-batch orders across all runs. We stop training when the training loss F(w) does not decrease for 10 rounds. When running AFL methods, we search for a best γw and γλ such that AFL achieves the highest testing accuracy on the device with the highest loss within a ﬁxed number of rounds. For Adult, we use γw = 0.1 and γλ = 0.1; for Fashion MNIST, we use γw = 0.001 and γλ = 0.01. We use the same γw as step-sizes for q-Fed Avg on Adult and Fashion MNIST. In Table 2, q1 = 0.01, q2 = 2 for q-FFL on Adult and q1 = 5, q2 = 15 for q-FFL on Fashion MNIST. Similarly, the number of local epochs is ﬁxed to 1 whenever we perform local updates.

Published as a conference paper at ICLR 2020

E FULL EXPERIMENTS

E.1 FULL RESULTS OF PREVIOUS EXPERIMENTS

Fairness of q-FFL under all uniformity metrics. We demonstrate the fairness of q-FFL in Table 1 in terms of variance. Here, we report similar results in terms of other uniformity measures (the last two columns).

Table 5: Full statistics of the test accuracy distribution for q-FFL. q-FFL increases the accuracy of the worst 10% devices without decreasing the average accuracies. We see that q-FFL encourages more uniform distributions under all uniformity metrics deﬁned in Appendix A.2: (1) the variance of the accuracy distribution (Deﬁnition 4), (2) the cosine similarity/geometric angle between the accuracy distribution and the all-ones vector 1 (Deﬁnition 5), and (3) the KL-divergence between the normalized accuracy vector a and the uniform distribution u, which can be directly translated to the entropy of a (Deﬁnition 6) .

Dataset Objective Average Worst 10% Best 10% Variance Angle KL(a||u) (%) (%) (%) ( )

Synthetic q = 0 80.8 .9 18.8 5.0 100.0 0.0 724 72 19.5 1.1 .083 .013 q = 1 79.0 1.2 31.1 1.8 100.0 0.0 472 14 16.0 .5 .049 .003

Vehicle q = 0 87.3 .5 43.0 1.0 95.7 1.0 291 18 11.3 .3 .031 .003 q = 5 87.7 .7 69.9 .6 94.0 .9 48 5 4.6 .2 .003 .000

Sent140 q = 0 65.1 4.8 15.9 4.9 100.0 0.0 697 132 22.4 3.3 .104 .034 q = 1 66.5 .2 23.0 1.4 100.0 0.0 509 30 18.8 .5 .067 .006

Shakespeare q = 0 51.1 .3 39.7 2.8 72.9 6.7 82 41 9.8 2.7 .014 .006 q = .001 52.1 .3 42.1 2.1 69.0 4.4 54 27 7.9 2.3 .009 .05

Fairness of q-FFL with respect to training accuracy. The empirical results in Section 4 are with respect to testing accuracy. As a sanity check, we show that q-FFL also results in more fair training accuracy distributions in Figure 6 and Table 6.

Figure 6: q-FFL (q > 0) results in more centered (i.e., fair) training accuracy distributions across devices without sacriﬁcing the average accuracy.

Table 6: q-FFL results in more fair training accuracy distributions in terms of all uniformity measurements (a) the accuracy variance, (b) the cosine similarity (i.e., angle) between the accuracy distribution and the all-ones vector 1, and (c) the KL divergence between the normalized accuracy a and uniform distribution u.

Dataset Objective Average Worst 10% Best 10% Variance Angle KL(a||u) (%) (%) (%) ( )

Synthetic q = 0 81.7 .3 23.6 1.1 100.0 .0 597 10 17.5 .3 .061 .002 q = 1 78.9 .2 41.8 1.0 96.8 .5 292 11 12.5 .2 .027 .001

Vehicle q = 0 87.5 .2 49.5 10.2 94.9 .7 237 97 10.2 2.4 .025 .011 q = 5 87.8 .5 71.3 2.2 93.1 1.4 37 12 4.0 .7 .003 .001

Sent140 q = 0 69.8 .8 36.9 3.1 94.4 1.1 278 44 13.6 1.1 .032 .006 q = 1 68.2 .6 46.0 .3 88.8 .8 143 4 10.0 .1 .017 .000

Shakespeare q = 0 72.7 .8 46.4 1.4 79.7 .9 116 8 9.9 .3 .015 .001 q = .001 66.7 1.2 48.0 .4 71.2 1.9 56 9 7.1 .5 .008 .001

Published as a conference paper at ICLR 2020

Average testing accuracy with respect to devices. In Section 4.2, we show that q-FFL leads to more fair accuracy distributions while maintaining approximately the same testing accuracies. Note that we report average testing accuracy with respect to all data points in Table 1. However, we observe similar results on average accuracy with respect to all devices between q = 0 and q > 0 objectives, as shown in Table 7. This indicates that q-FFL can reduce the variance of the accuracy distribution without sacriﬁcing the average accuracy over devices or over data points.

Table 7: Average testing accuracy under q-FFL objectives. We show that the resulting solutions of q = 0 and q > 0 objectives have approximately the same average accuracies both with respect to all data points and with respect to all devices.

Dataset Objective Accuracy w.r.t. Data Points Accuracy w.r.t. Devices

Synthetic q = 0 80.8 .9 77.3 .6 q = 1 79.0 1.2 76.3 1.7

Vehicle q = 0 87.3 .5 85.6 .4 q = 5 87.7 .7 86.5 .7

Sent140 q = 0 65.1 4.8 64.6 4.5 q = 1 66.5 .2 66.2 .2

Shakespeare q = 0 51.1 .3 61.4 2.7 q = .001 52.1 .3 60.0 .5

Comparison with uniform sampling. In Figure 7 and Table 8, we show that in terms of training accuracies, the uniform sampling heuristic may outperform q-FFL (as opposed to the testing accuracy results in Section 4). We suspect that this is because the uniform sampling baseline is a static method and is likely to overﬁt to those devices with few samples. In additional to Figure 3 in Section 4.3, we also report the average testing accuracy with respect to data points, best 10%, worst 10% accuracies, and the variance (along with two other uniformity measures) in Table 9.

Figure 7: q-FFL (q > 0) compared with uniform sampling in training accuracy. We see that on some datasets uniform sampling has higher (and more fair) training accuracies due to the fact that it is overﬁtting to devices with few samples.

Table 8: More statistics comparing the uniform sampling objective with q-FFL in terms of training accuracies. We observe that uniform sampling could result in more fair training accuracy distributions with smaller variance in some cases.

Dataset Objective Average Worst 10% Best 10% Variance Angle KL(a||u) (%) (%) (%) ( )

Synthetic uniform 83.5 .2 42.6 1.4 100.0 .0 366 17 13.4 .3 .031 .002 q = 1 78.9 .2 41.8 1.0 96.8 .5 292 11 12.5 .2 .027 .001

Vehicle uniform 87.3 .3 46.6 .8 94.8 .5 261 10 10.7 .2 .027 .001 q = 5 87.8 .5 71.3 2.2 93.1 1.4 37 12 4.0 .7 .003 .001

Sent140 uniform 69.1 .5 42.2 1.1 91.0 1.3 188 19 11.3 .5 .022 .002 q = 1 68.2 .6 46.0 .3 88.8 .8 143 4 10.0 .1 .017 .000

Shakespeare uniform 57.7 1.5 54.1 1.7 72.4 3.2 32 7 5.2 .5 .004 .001 q = .001 66.7 1.2 48.0 .4 71.2 1.9 56 9 7.1 .5 .008 .001

Published as a conference paper at ICLR 2020

Table 9: More statistics showing more fair solutions induced by q-FFL compared with the uniform sampling baseline in terms of test accuracies. Again, we observe that under q-FFL, the testing accuracy of the worst 10% devices tends to increase compared with uniform sampling, and the variance of the ﬁnal testing accuracies is smaller. Similarly, q-FFL is also more fair than uniform sampling in terms of other uniformity metrics.

Dataset Objective Average Worst 10% Best 10% Variance Angle KL(a||u) (%) (%) (%) ( )

Synthetic uniform 82.2 1.1 30.0 .4 100.0 .0 525 47 15.6 .8 .048 .007 q = 1 79.0 1.2 31.1 1.8 100.0 0.0 472 14 16.0 .5 .049 .003

Vehicle uniform 86.8 .3 45.4 .3 95.4 .7 267 7 10.8 .1 .028 .001 q = 5 87.7 0.7 69.9 .6 94.0 .9 48 5 4.6 .2 .003 .000

Sent140 uniform 66.6 2.6 21.1 1.9 100.0 0.0 560 19 19.8 .7 .076 .006 q = 1 66.5 .2 23.0 1.4 100.0 0.0 509 30 18.8 .5 .067 .006

Shakespeare uniform 50.9 .4 41.0 3.7 70.6 5.4 71 38 9.1 2.8 .012 .006 q = .001 52.1 .3 42.1 2.1 69.0 4.4 54 27 7.9 2.3 .009 .05

E.2 ADDITIONAL EXPERIMENTS

Effects of data heterogeneity and the number of devices on unfairness. To study how data heterogeneity and the total number of devices affect unfairness in a more direct way, we investigate into a set of synthetic datasets where we can quantify the degree of heterogeneity. The results are shown in Table 10 below. We generate three synthetic datasets following the process described in Appendix D.1, but with different parameters to control heterogeneity. In particular, we generate an IID data Synthetic (IID) by setting the same W and b on all devices and setting the samples xk N(0, 1) for any device k. We instantiate two non-identically distributed datasets (Synthetic (1, 1) and Synthetic (2, 2)) from Synthetic (α, β) where uk N(0, α) and Bk N(0, β). Recall that α, β allows to precisely manipulate the degree of heterogeneity with larger α, β values indicating more statistical heterogeneity. Therefore, from top to bottom in Table 10, data are more heterogeneous. For each dataset, we further create two variants with different number of participating devices. We see that as data become more heterogeneous and as the number of devices in the network increases, the accuracy distribution tends to be less uniform.

Table 10: Effects of data heterogeneity and the number of devices on unfairness. For a ﬁxed number of devices, as data heterogeneity increases from top to bottom, the accuracy distributions become less uniform (with larger variance) for both q = 0 and q > 0. Within each dataset, the decreasing number of devices results in a more uniform accuracy distribution. In all scenarios (except on IID data), setting q > 0 helps to encourage more fair solutions.

Dataset Objective Average Worst 10% Best 10% Variance

Synthetic (IID)

100 devices q = 0 89.2 .6 70.9 3 100.0 0 85 15 q = 1 89.0 .5 70.3 3 100.0 0 88 19

50 devices q = 0 87.1 1.5 66.5 3 100.0 0 107 14 q = 1 86.8 0.8 66.5 2 100.0 0 109 13

Synthetic (1, 1)

100 devices q = 0 83.0 .9 36.8 2 100.0 0 452 22 q = 1 82.7 1.3 43.5 5 100.0 0 362 58

50 devices q = 0 84.5 .3 43.3 2 100.0 0 370 37 q = 1 85.1 .8 47.3 3 100.0 0 317 41

Synthetic (2, 2)

100 devices q = 0 82.6 1.1 25.5 8 100.0 0 618 117 q = 1 82.2 0.7 31.9 6 100.0 0 484 79

50 devices q = 0 85.9 1.0 36.8 7 100.0 0 421 85 q = 1 85.9 1.4 39.1 6 100.0 0 396 76

A family of q s results in variable levels of fairness. In Table 11, we show the accuracy distribution statistics of using a family of q s on synthetic data. Our objective and methods are not sensitive to any particular q since all q > 0 values can lead to more fair solutions compared with q = 0. In our experiments in Section 4, we report the results using the q values selected following the protocol described in Appendix D.2.3.

Published as a conference paper at ICLR 2020

Table 11: Test accuracy statistics of using a family of q s on synthetic data. We show results with q s selected from our candidate set {0.001, 0.01, 0.1, 1, 2, 5, 10, 15}. q-FFL allows for a more ﬂexible trade-off between fairness and accuracy. A larger q results in more fairness (smaller variance), but potentially lower accuracy. Similarly, a larger q imposes more uniformity in terms of other metrics (a) the cosine similarity/angle between the accuracy distribution and the all-ones vector 1, and (b) the KL divergence between the normalized accuracy a and a uniform distribution u.

Dataset Objective Average Worst 10% Best 10% Variance Angle KL(a||u) (%) (%) (%) ( )

q=0 80.8 .9 18.8 5.0 100.0 0.0 724 72 19.5 1.1 .083 .013 q= 0.1 81.1 0.8 22.1 .8 100.0 0.0 666 56 18.4 .8 .070 .009 q=1 79.0 1.2 31.1 1.8 100.0 0.0 472 14 16.0 .5 .049 .003 q=2 74.7 1.3 32.2 2.1 99.9 .2 410 23 15.6 0.7 .044 .005 q=5 67.2 0.9 30.0 4.8 94.3 1.4 369 51 16.3 1.2 .048 .010

Device-speciﬁc q. In these experiments, we explore a device-speciﬁc strategy for selecting q in q-FFL. We solve q-FFL with q {0, 0.001, 0.01, 0.1, 1, 2, 5, 10} in parallel. After training, each device selects the best resulting model based on the validation data and tests the performance of the model using the testing set. We report the results in terms of testing accuracy in Table 12. Interestingly, using this device-speciﬁc strategy the average accuracy in fact increases while the variance of accuracies is reduced, in comparison with q = 0. We note that this strategy does induce more local computation and additional communication load at each round. However, it does not increase the number of communication rounds if run in parallel.

Table 12: Effects of running q-FFL with several q s in parallel. We train multiple global models (corresponding to different q s) independently in the network. After the training ﬁnishes, each device picks up a best, device-speciﬁc model based on the performance (accuracy) on the validation data. While this adds additional local computation and more communication load per round, the device-speciﬁc strategy has the added beneﬁt of increasing the accuracies of devices with the worst 10% accuracies and devices with the best 10% accuracies simultaneously. This strategy is built upon the proposed primitive Algorithm 2, and in practice, people can develop other heuristics to improve the performance (similar to what we explore here), based on the method of adaptively averaging model updates proposed in Algorithm 2.

Dataset Objective Average Worst 10% Best 10% Variance Angle KL(a||u) (%) (%) (%) ( )

q=0 87.3 .5 43.0 1.0 95.7 1.0 291 18 11.3 .3 .031 .003 q=5 87.7 .7 69.9 .6 94.0 .9 48 5 4.6 .2 .003 .000 multiple q 88.5 .3 70.0 2.0 95.8 .6 52 7 4.7 .3 .004 .000

Shakespeare

q=0 51.1 .3 39.7 2.8 72.9 6.7 82 41 9.8 2.7 .014 .006 q=.001 52.1 .3 42.1 2.1 69.0 4.4 54 27 7.9 2.3 .009 .05 multiple q 52.0 1.5 41.0 4.3 72.0 4.8 72 32 10.1 .7 .017 .000

Convergence speed of q-FFL. Since q FFL (q > 0) is more difﬁcult to optimize, a natural question one might ask is: will the q-FFL q > 0 objectives slow the convergence compared with Fed Avg? We empirically investigate this on the four datasets. We use q-Fed Avg to solve q-FFL, and compare it with Fed Avg (i.e., solving q-FFL with q = 0). As demonstrated in Figure 8, the q values that result in more fair solutions also do not signiﬁcantly slow down convergence.

Figure 8: The convergence speed of q-FFL compared with Fed Avg. We plot the distance to the highest accuracy achieved versus communication rounds. Although q-FFL with q>0 is a more difﬁcult optimization problem, for the q values we choose that could lead to more fair results, the convergence speed is comparable to that of q = 0.

Published as a conference paper at ICLR 2020

Efﬁciency of q-FFL compared with AFL. One added beneﬁt of q-FFL is that it leads to faster convergence than AFL even when we use non-local-updating methods for both objectives. In Figure 9, we show with respect to the ﬁnal testing accuracy for the single worst device (i.e., the objective that AFL is trying to optimize), q-FFL converges faster than AFL. As the number of devices increases (from Fashion MNIST to Vehicle), the performance gap between AFL and q-FFL becomes larger because AFL introduces larger variance.

Figure 9: q-FFL is more efﬁcient than AFL. With the worst device achieving the same ﬁnal testing accuracy, q-FFL converges faster than AFL. For Vehicle (with 23 devices) as opposed to Fashion MNIST (with 3 devices), we see that the performance gap is larger. We run full gradient descent at each round for both methods.

Efﬁciency of q-Fed Avg under different data heterogeneity. As mentioned in Appendix E.1, one potential cause for the slower convergence of q-Fed Avg on the synthetic dataset may be that local updating schemes could hurt convergence when local data distributions are highly heterogeneous. Although it has been shown that applying updates locally results in signiﬁcantly faster convergence in terms of communication rounds (Mc Mahan et al., 2017; Smith et al., 2018), which is consistent with our observation on most datasets, we note that when data is highly heterogeneous, local updating may hurt convergence. We validate this by creating an IID synthetic dataset (Synthetic-IID) where local data on each device follow the same global distribution. We call the synthetic dataset used in Section 4 Synthetic-Non-IID. We also create a hybrid dataset (Synthetic-Hybrid) where half of the total devices are assigned IID data from the same distribution, and half of the total devices are assigned data from different distributions. We observe that if data is perfectly IID, q-Fed Avg is more efﬁcient than q-Fed SGD. As data become more heterogeneous, q-Fed Avg converges more slowly than q-Fed SGD in terms of communication rounds. For all three synthetic datasets, we repeat the process of tuning a best constant step-size for Fed SGD and observe similar results as before our dynamic solver q-Fed SGD behaves similarly (or even outperforms) a best hand-tuned Fed SGD.

0 500 1000 1500 2000 # Rounds

Testing accuracy

Synthetic-Non-IID

q-Fed Avg q-Fed SGD best tuned Fed SGD

0 500 1000 1500 2000 # Rounds

Testing accuracy

Synthetic-Hybrid

0 500 1000 1500 2000 # Rounds

Testing accuracy

Synthetic-IID

Figure 10: Convergence of q-Fed Avg compared with q-Fed SGD under different data heterogeneity. When data distributions are heterogeneous, it is possible that q-Fed Avg converges more slowly than q-Fed SGD. Again, the proposed dynamic solver q-Fed SGD performs similarly (or better) than a best tuned ﬁxed-step-size Fed SGD.