# active_sampling_for_minmax_fairness__75a3b121.pdf Active Sampling for Min-Max Fairness Jacob Abernethy 1 Pranjal Awasthi 2 Matthäus Kleindessner 3 Jamie Morgenstern 4 Chris Russell 3 Jie Zhang 4 We propose simple active sampling and reweighting strategies for optimizing min-max fairness that can be applied to any classification or regression model learned via loss minimization. The key intuition behind our approach is to use at each timestep a datapoint from the group that is worst off under the current model for updating the model. The ease of implementation and the generality of our robust formulation make it an attractive option for improving model performance on disadvantaged groups. For convex learning problems, such as linear or logistic regression, we provide a fine-grained analysis, proving the rate of convergence to a min-max fair solution. 1. Introduction A model trained on a dataset containing multiple demographic groups typically has unequal error rates across the different groups, either because some groups are underrepresented in the training data or the underlying learning task is inherently harder for particular groups. Many existing fairness notions aim to equalize the performance on different demographic groups (e.g., Hardt et al., 2016; Zafar et al., 2019), which can result in deliberately down-grading the performance on the better off groups and unnecessarily reducing overall performance. This degradation of performance can correspond to a loss of access to relevant services, which is referred to as leveling-down in law and ethics where it has received substantial criticism (Holtug, 1998; Temkin, 2000; Doran, 2001; Mason, 2001; Brown, 2003; Christiano and Braynen, 2008). In contrast, min-max notions of fairness offer an alternative approach. These notions level-up , by adopting an optimization perspective that pri- 1Georgia Tech, USA 2Google, USA 3Amazon Web Services, Germany 4University of Washington, USA. Correspondence to: J. Abernethy , P. Awasthi , M. Kleindessner , J. Morgenstern , C. Russell , J. Zhang . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). oritizes improving the model s performance on the group for which performance is the worst. Such optimizations only degrade the performance on a group if it will improve the performance on the worst off group (Martinez et al., 2020; Diana et al., 2021). In this paper, we propose simple and theoretically principled algorithms for min-max fairness (formally defined in Section 2). We provide a general template in Algorithm 1. The key idea underlying our approach is to adaptively sample or reweight data from worst off groups. The intent is that by providing additional data for these groups, or increasing the weights associated with them, we improve the model s performance on these groups. This is a compellingly simple approach to mitigating loss disparity between groups, and we show that for convex learning problems such methods indeed provably minimize the maximum per-group loss. We consider two concrete variants of Algorithm 1. The first one (Algorithm 2) simply samples a point from the worst off group and updates model parameters via stochastic gradient descent. We show in Theorem 1 that Algorithm 2 converges to a min-max fair solution at a rate of 1/ T (as a function of the number of iterations T), assuming a convex loss function. Our second approach (Algorithm 3), accelerated min-max gradient descent, converges to a min-max fair solution at a faster rate of 1/T. The accelerated Algorithm 3 is based on an adaptive reweighting scheme and performs in each iteration a gradient descent step for optimizing a weighted population loss. In contrast, Algorithm 2 samples, in each iteration, a datapoint from the worst off group and uses it for performing a stochastic gradient descent step. While Algorithm 3 achieves a faster rate of convergence, Algorithm 2 is conceptually simpler, easier to implement and efficient in practice due to the stochastic nature of the updates. We also provide finite sample generalization bounds for Algorithm 2. We present an empirical evaluation of both our proposed algorithms in Section 5. 2. Preliminaries Let Z be our data space; any z Z represents a population sample containing both the observed features and the unobserved label. We assume that there are g disjoint Active Sampling for Min-Max Fairness Algorithm 1 A generic group-specific loss aware sampling strategy 1: Initialize classifier / regressor f with initial model 2: repeat: 3: Determine the group for which the current model f has the highest loss 4: Sample a labelled datapoint from that group and use it to update f 5: (optional) Set f to the average over all past f 6: Return f demographic groups, indexed by the set [g] := {1, . . . , g}.1 Let D1, . . . , Dg be a family of distributions over Z, where Dj (Z) := {distributions over Z} is the distribution of the data for group j. Let q g := {v [0, 1]g : Pg i=1 vi = 1} denote a vector of mixture weights over groups. Given any q, we define the mixture distribution Dq on Z as follows: we first sample a group index i q, and then we sample z Di from the randomly chosen group i. The models considered are based on a parameterized family of functions F := {fθ : θ Θ}, where each fθ operates on examples z Z and Θ Rd is a d-dimensional parameter space. We assume that Θ is a compact convex set. We evaluate each fθ according to a loss function ℓ: F Z R, and we assume the loss ℓ(fθ; z) to be convex in θ for any fixed z Z: Assumption 1. For any z Z, the function θ 7 ℓ(fθ; z) is convex in θ. Assumption 1 is satisfied in several common scenarios, e.g., when fθ is linear in θ and ℓis the standard logistic loss or hinge loss (binary classification), the cross entropy loss composed with softmax activation (multiclass classification), or the squared loss (regression). While we use ℓthroughout the paper to refer to the gradient, we do not strictly need θ 7 ℓ(fθ; z) to be differentiable as we may consider any sub-gradient instead. Given a distribution D (Z) and any θ Θ, we define the expected loss of θ with respect to D as v (θ; D) := E z D ℓ(fθ, z). Similarly, given mixture weights q g, we consider the performance of fθ with respect to Dq. Thus v (θ; Dq) = E i q E z Di [ℓ(θ, z)] = i=1 q(i) E z Di ℓ(θ, z). 1Sometimes we want to be fair w.r.t. demographic groups that are not disjoint, e.g., to men and women and also to old and young people. In this case, as it is standard in the literature, we simply consider all intersections of these groups, i.e., young females, young males, etc.. For each group i [g] we assume we have an IID sample of mi examples z1, . . . , zmi Di. We use ˆDi to represent the empirical distribution over these samples. Hence, v θ; ˆDi = 1 j=1 ℓ(fθ, zj). Throughout, we use the notation PROJK(x) to refer to the ℓ2-projection of point x Rd onto compact convex set K Rd, that is PROJK(x) := argminy K x y 2. Our goal is to learn a model fθ that is min-max fair w.r.t. the g demographic groups. This means that θ satisfies max i [g] v (θ ; Di) = inf θ Θ max i [g] v (θ; Di) . Remark 1. A criticism of the min-max approach to fairness is that it puts too much focus on improving the performance of a single group. If some class j is particularly hard to learn, so that any θ Θ will have a large value v (θ; Dj), larger than for all other classes j , then the training procedure will aim to optimize the loss with respect to only Dj. This is a reasonable concern, but we can mitigate it by considering blended distributions: let p [0, 1] be a trade-off parameter, and define the mixture distribution Di p := (1 p)Di + p Dˆq, where Dˆq is the true population-level distribution; that is, the full population is made up of a mixture of subpopulations D1, . . . , Dg weighted by the true mixture ˆq. If we run our min-max fairness procedures on the blended distributions Di p instead of on Di, we are biasing our algorithm, with bias parameter p, to focus more on the full population and less on any particular group. 3. Algorithms and Analysis Here we present formal versions of Algorithm 1, and analyze their theoretical performance. To formalize Algorithm 1 fully, we describe how we evaluate which group has the highest loss in each iteration, and how we either sample additional data from that group or reweight and update the model. Both algorithms use their updating schemes to reduce their training error on the min-max objective, and while we also present theorems which bound the test error as well, this work does not use reweighting or resampling to explicitly decrease generalization error. 3.1. Stochastic Optimization Algorithm 2 maintains a validation set, that is a fixed comparison sample set on which the group-specific loss is repeatedly measured. It samples a fresh point from the group with highest loss on the comparison set, then takes a single Active Sampling for Min-Max Fairness gradient step in the direction of that fresh sample. Its performance is governed by two quantities: the regret term, which decreases with the number of iterations T, and the uniform deviation bound of the comparisons of group-specific loss. Algorithm 2 Min-max Stochastic Gradient Descent 1: Init: θ1 Θ arbitrary 2: for t = 1 . . . T 1 do 3: compute it = argmax i [g] v θt; ˆDi 4: sample zt Dit 5: compute t θℓ(fθt; zt) 6: update θt+1 PROJΘ(θt η t) 7: end for 8: return θT = As is common in gradient descent, the following proof assumes that the Lipschitz constant L and a domain radius W are known. When this is not the case, the step size ν is typically tuned empirically. Theorem 1. Assume we have a function Rδ = R(m1, . . . , mg; δ) which guarantees that sup θ Θ max i [g] |v (θ; Di) v θ; ˆDi | Rδ with probability at least 1 δ. Let W := supθ Θ θ θ1 2 and L := supθ Θ maxi [g] θv (θ; Di) 2. With η := W L T , Algorithm 2 ensures that max i [g] v θT ; Di inf θ Θ max i [g] v (θ; Di)+WL with probability at least 1 δ. Proof. This bound is obtained by combining a number of classical results from empirical process theory, as well as common tricks from using online convex optimization tools to solve min-max problems. This sequence of steps is given in Figure 1, with further discussion here. One observes that we need to swap between v θ; ˆD and v (θ; D) on two separate inequalities, and on each we have to add the deviation bound RT δ; the T factor is necessary because we need a union bound over all T rounds. We then replace v (θt; Dit) with ℓ(fθt; zt), which is valid since θt is independent of zt, zt is distributed according to Dit, and we have the outer expectation over all z1, . . . , z T (more details on this technique can be found in the paper of Cesa-Bianchi et al., 2004). Next, since the θt s are chosen using the Online Gradient Descent (OGD) algorithm on loss functions ht( ) := ℓ(f ; zt), we can immediately apply the OGD regret bound see Hazan (2019) for details. The most subtle part of this proof may be the final two observations. The sequence z1:T is generated stochastically and sequentially, where each zt may depend on the previous samples chosen. But in the end, the sample ST is produced by taking some combination of samples from the various D1, . . . , Dg, and ultimately we marginalize the quantity v θ ; ST over the randomness generated by z1, . . . , z T . On average, the z s in ST will have been drawn from some mixture over the various groups, and we refer to those mixture weights as q. It then follows by this observation that Ez1:T v θ ; ST = v (θ ; D q). Finally, since v (θ ; D q) = Ei q v (θ ; Di), we upper bound Ei q with maxi to complete the proof. While not the focus of our paper, it is easy enough to give a uniform deviation bound as Theorem 1 employs. Lemma 1. With probability at least 1 δ, for every θ Θ and for every q g it holds that v θ; ˆDq v (θ; Dq) + c P-dim(Θ) log(g mmin/δ) where c > 0 is some constant and P-dim is the pseudodimension of the class (Pollard, 1990). Proof. This follows from a standard uniform convergence argument over Θ for any fixed group i, as ˆDi is a sample of mi IID points drawn from Di. Taking a union bound over all g groups yields the bound. This implies that the total error of using Algorithm 2 is comprised of two terms, one upper bounding the optimization error (which decays with 1/ T), and the generalization error, which is governed by the sample size of the smallest dataset across all groups. A key benefit of Theorem 1 is that it provides both an optimization guarantee as well as a sample complexity bound. The proof s core is the classical online to batch conversion (Cesa-Bianchi et al., 2004) that provides generalization based on regret bounds, combined with tools from min-max optimization. One downside of this method is that it relies on having two sources of data: a comparison set ˆDi for each i [g], as well as the ability to draw fresh (independent) samples from each Di. Alternatively, we consider a version that only focuses on training error that reweights rather than samples fresh data, by considering zt drawn from ˆDi. This variant will still allow for the min-max empirical risk to decay at the standard 1/ Corollary 1. Consider a version of Algorithm 2 that, on line 4, draws samples IID from the empirical distribution ˆDit as opposed to fresh samples from Dit. Then, with Active Sampling for Min-Max Fairness Figure 1. Main steps in the proof of Theorem 1. The proof proceeds along the lines of the classical online-to-batch conversion (Cesa Bianchi et al., 2004), but hinges on a few additional tricks. Ez1:T maxi [g] v θT ; Di (Jensen s inequality) Ez1:T h 1 T maxi [g] PT t=1 v (θt; Di) i (max sum sum max) Ez1:T h 1 T PT t=1 maxi [g] v (θt; Di) i deviation between D, ˆD + union bound Ez1:T h 1 T PT t=1 maxi [g] v θt; ˆDi i + RT δ (definition of it) = Ez1:T h 1 T PT t=1 v θt; ˆDit i + RT δ additional deviation between ˆD, D Ez1:T h 1 T PT t=1 v (θt; Dit) i + 2RT δ (since zt Dit + outer expectation ) = Ez1:T h 1 T PT t=1 ℓ(fθt; zt) i + 2RT δ (apply OGD regret bound) Ez1:T h 1 T PT t=1 ℓ(fθ ; zt) + ηL2 2T η i + 2RT δ η := W L T , ST := {z1, . . . , z T } = Ez1:T v θ ; ST + W L = v (θ ; D q) + W L maxi [g] v (θ ; Di) + W L W, L defined as in Theorem 1, we have max i [g] v θT ; ˆDi inf θ Θ max i [g] v θ; ˆDi + WL Remark 2 (Mini-batching). Many online training scenarios use mini-batch gradient updates, where instead of a single sample a set of samples is taken, an average gradient is computed across these samples, and the average gradient is used to update the current parameter estimate. Indeed, it requires a straightforward modification to implement minibatch training in Algorithm 2. While this may have practical benefits, providing faster empirical training times, we note that this is not likely to provide improved theoretical guarantees. Our convergence guarantee in Theorem 1 still applies in the mini-batch setting, with convergence depending on the number of updates T, rather than the total amount of data used. Batches of size k then require k times more data overall for the same convergence guarantee. One might hope for a decrease in variance from the mini-batch averaging, and indeed this often empirically leads to better convergence, though not promised by our results. 3.2. Accelerated Optimization Algorithm 3 s optimization error shrinks much faster as a function of T. However, it is non-stochastic and has a more complex update rule. This algorithm explicitly maintains a distribution over groups which it updates relative to the current group losses, increasing the probability mass assigned to groups with higher loss. Then the algorithm takes a gradient step with respect to the full, weighted distribution over (empirical) group distributions. While each iteration requires a full pass over the data, the convergence rate is O(1/T) rather than O(1/ Unlike Algorithm 2, Algorithm 3 does not have a natural sampling analogue. The update rule is with respect to the entire weighted empirical distribution, rather than a single datapoint. Below we present the main algorithmic guarantee associated with the algorithm. See Appendix A for the proof. Algorithm 3 Accelerated Min-max Gradient Descent 1: Init: q0 = ( 1 g, . . . , 1 g), θ1 Θ arbitrary, 0 = 0 2: for t = 1 . . . T do 3: Compute ut(i) v θt; ˆDi , for i = 1, . . . , g 4: Update qt(i) qt 1(i) exp(γut(i)), for i = 1, . . . , g 5: Normalize qt qt qt 1 6: Compute t θv θt; ˆDqt 7: Update θt+1 PROJΘ(θt 2η t + η t 1) 8: end for 9: return θT = Theorem 2. Algorithm 3, with parameters η = W L log g and Active Sampling for Min-Max Fairness W L , outputs θT that satisfies max i [g] v θT ; ˆDi inf θ Θ max i [g] v θ; ˆDi + 2WL log g where W and L are defined as in Theorem 1. Remark 3 (Averaging versus final iterate). A careful reader may note that Algorithms 2 and 3 output a time-weighted average θT , whereas in typical online training methods one simply outputs the final iterate θT . Indeed, for the min-max framework we propose, our theory requires returning the average iterate. Some work exists on last-iterate convergence for special cases of min-max optimization (Abernethy et al., 2021), but this is beyond the scope of the present work. 4. Related Work Fair ML. There is a large body of work on fairness in machine learning (Barocas et al., 2018), much of it focusing on supervised learning. Many fairness notions balance performance measures across different groups (e.g., Hardt et al., 2016). These notions suffer from the leveling-down discussed in the introduction. Min-max fairness notions have been proposed as a remedy. Min-max fairness. Martinez et al. (2020) consider the search for min-max Pareto optimal classifiers and present structural results regarding the case of unbounded hypothesis sets. By appropriately reparameterizing the space, they show that one can, in principle, model the case of learning min-max Pareto optimal classifiers over the class of deep neural networks. Martinez et al. propose an algorithm to find optimal classifiers (based on sub-gradient descent), but unlike our work, their proposed algorithm has no performance guarantees, and is not guaranteed to converge. Diana et al. (2021) propose a multiplicative weights update based method to achieve min-max fairness. While they do not require convexity, they assume access to a weighted empirical risk minimization (ERM) oracle, and it is unclear how to implement such oracles in a non-convex setting. Furthermore, the analysis in Diana et al. (2021) is only carried out in the population setting where it is assumed that certain weighted ERM problems can be exactly optimized over the distribution. As a result, their work ignores the complexity of the analysis arising from the stochastic nature of gradient updates. One key contribution of our work is the analysis of gradient-based updates, which allow for more efficient computation and the use of highly-optimised frameworks and tools. Finally, at least in the non-convex case, the hypothesis output by Diana et al. (2021) needs to be randomized, which can be problematic in scenarios strongly affecting people (Cotter et al., 2019b). A previous arxiv version of this paper introduced a restricted variant of Algorithm 1, where a single sample was drawn from the worst off group, and each round s model was the global optimum on the current dataset, and analyzed its behaviour. Shekhar et al. (2021) claimed that the sampling scheme proposed in our previous draft and closely related to our Algorithm 1 converges to min-max fair solutions. While their paper does not impose convexity constraints, their algorithm has no rate of convergence guarantees, and their assumptions (particularly Assumption 2) needed to prove convergence frequently fail to hold in practice.2 We improve those results empirically and theoretically and guarantee fast rates of convergence. Min-max fairness has also been studied in unsupervised settings such as dimensionality reduction (Samadi et al., 2018; Tantipongpipat et al., 2019) and clustering (Ghadiri et al., 2021) as well as in federated learning scenarios (Mohri et al., 2019; Papadaki et al., 2022). Min-max optimization. Many problems beyond fairness can be formulated as min-max optimization problems, and the study of generic methods for solving these remains an active field of research (e.g., Thekumparampil et al., 2019; Razaviyayn et al., 2020; Ouyang and Xu, 2021). We are unaware of any generic methods that would be appropriate for our fairness problem with a discrete group variable. Group reweighting. Other works study the problem of debiasing a dataset via group reweighting. Li and Vasconcelos (2019) propose a reweighting scheme to reduce representation bias. While based on a min-max formulation, their problem setting is different to ours. Rolf et al. (2021) study structural properties of optimal group allocations for a dataset. They present structural results regarding the nature of optimal allocations, but no algorithmic results. Agarwal et al. (2019) consider a fair regression problem under a bounded group loss constraint, which in their setting is equivalent to finding a min-max fair classifier. Similar to Diana et al. (2021), they design a near optimal regressor assuming access to a weighted risk minimization oracle that can be optimized exactly on the population. Achieving fairness under bounded loss constraints assuming oracle access has also been studied by Cotter et al. (2019a). 2Their Assumption 2 states For any two distinct attributes z, z Z, we must have L(z, f z ) < L(z , f z ), for any f z arg minf F L(z, f). This rarely holds in practice. In particular, let ˆf be a min-max optimal predictor. Consider a group z with largest loss under ˆf: we know that L( ˆf, z) L( ˆf, z ) for any other group z . When this inequality is strict and L( ˆf, z) > maxz =z L( ˆf, z ), we actually know that L( ˆf, z) = L(f z , z), or mixing ˆf and f z would reduce the min-max risk of ˆf. So, in many applications of interest (for instance, where the min-max optimal predictor has a unique worst off group), their assumption will not hold it never held in any of our experiments. Active Sampling for Min-Max Fairness 0.00 0.25 0.50 0.75 1.00 # operations [points visited] 1e7 Logistic loss Drug Consumption --- TRAIN=VAL Group US UK Other Method Diana et al. OURS (Alg. 2) OURS (Alg. 3) Naive 0.00 0.25 0.50 0.75 1.00 # operations [points visited]1e7 Drug Consumption --- TRAIN=VAL 0.0 0.2 0.4 0.6 0.8 1.0 # operations [points visited] 1e7 Difference in iterates Drug Consumption --- TRAIN=VAL Diana et al. OURS (Alg. 2) OURS (Alg. 3) 0.00 0.25 0.50 0.75 1.00 # operations [points visited] 1e7 Logistic loss Drug Consumption --- SMALL VAL Group US UK Other Method Diana et al. OURS (Alg. 2) OURS (Alg. 3) Naive 0.00 0.25 0.50 0.75 1.00 # operations [points visited]1e7 Drug Consumption --- SMALL VAL 0.0 0.2 0.4 0.6 0.8 1.0 # operations [points visited] 1e7 Difference in iterates Drug Consumption --- SMALL VAL Diana et al. OURS (Alg. 2) OURS (Alg. 3) 0.49 0.50 0.51 0.52 Overall log loss Maximum group log loss Drug Consumption p=0.0 p=0.1 p=0.2 p=0.3 p=0.4 p=0.5 p=0.6 p=0.7 p=0.8 p=0.9 p=1.0 Naive US UK Other Overall Diana et al. (2021) 0.4035 0.5894 0.4710 0.5166 Martinez et al. (2020) 0.4122 0.5889 0.4771 0.5198 OURS (Alg. 2 with TRAIN=VAL; Avg. over 5 runs) 0.4114 0.5889 0.4766 0.5195 OURS (Alg. 2 with SMALL VAL; Avg. over 5 runs) 0.4112 0.5889 0.4766 0.5195 OURS (Algorithm 3) 0.4108 0.5889 0.4761 0.5193 Diana et al. (2021) 0.1566 0.2964 0.2173 0.2432 Martinez et al. (2020) 0.1598 0.2950 0.2183 0.2435 OURS (Alg. 2 with TRAIN=VAL; Avg. over 5 runs) 0.1634 0.2971 0.2218 0.2463 OURS (Alg. 2 with SMALL VAL; Avg. over 5 runs) 0.1601 0.2960 0.2211 0.2446 OURS (Algorithm 3) 0.1598 0.2960 0.2183 0.2440 Figure 2. Logistic regression on the Drug Consumption dataset. Top row: Logistic loss and classification error for the three groups over time, both for Diana et al. (2021) and our Algorithms 2 and 3. Middle row: Same as top row, but for a validation set that comprises only 60 datapoints (20 datapoints sampled uniformly at random from each group). Bottom row: Trade-off curves obtained by varying γ in a variant of the algorithm by Diana et al. (2021) and a probability parameter p with which we sample from the whole population in our Algorithm 2. (Average) Per-group losses and errors as well as overall losses and errors from the final iteration are shown in the table. For every method, the maximum loss / error among the groups is shown in bold. Active sampling. Active / adaptive sampling lies at the heart of active learning (Settles, 2010). Related to our work is the paper by Anahideh and Asudeh (2020). In each round their sampling strategy queries the label of a datapoint that is both informative and expected to yield a classifier with small violation of a fairness measure (they do not consider min-max fairness but mainly demographic parity, which requires the classifier s prediction to be independent of group membership). Unlike our work, their approach requires training a classifier for every datapoint which might be queried before actually querying a datapoint, resulting in a significant computational overhead. Moreover, their work does not provide any theoretical analysis. Also related is the paper by Noriega-Campero et al. (2019), who propose to actively collect additional features for datapoints to equalize the performance on different groups. 5. Experiments Before presenting empirical results,3 we once more highlight the key advantages of our method over existing ones: (a) simplicity and computational efficiency we only perform one (stochastic) gradient step in every iteration, while the other methods fully retrain in every iteration; (b) stronger convergence guarantees our proposed Algorithm 2 (Algorithm 3) is guaranteed to converge at a rate of 1/ T ( 1/T) SGD (GD) steps. In contrast, the algorithm presented in Diana et al. (2021) is guaranteed to converge at a rate of 1/ T oracle calls, and Martinez et al. (2020) do not prove convergence of their proposed algorithm. Our online stochastic approach is substantially faster than the fully deterministic approaches that exactly solve subproblems at each iteration. To avoid being mislead by implementation details, such as the choice of implementation language, we consider a proxy for runtime: namely, the 3Code available on https://github.com/amazonresearch/active-sampling-for-minmaxfairness Active Sampling for Min-Max Fairness 0 2500 5000 7500 10000 Round t Logistic loss Group African-American Caucasian Hispanic Other Method Diana et al. OURS (Alg. 3) Naive 0 2500 5000 7500 10000 Round t 0 2500 5000 7500 10000 Round t Group African-American Caucasian Hispanic Other Method Diana et al. OURS (Alg. 3) African-American Caucasian Hispanic Other Overall Diana et al. (2021) 0.6200 0.6132 0.6104 0.5804 0.6145 Martinez et al. (2020) 0.6196 0.6162 0.6129 0.5830 0.6158 OURS (Algorithm 3) 0.6196 0.6161 0.6127 0.5828 0.6156 Diana et al. (2021) 0.3279 0.3292 0.3150 0.3074 0.3260 Martinez et al. (2020) 0.3309 0.3370 0.3234 0.3185 0.3316 OURS (Algorithm 3) 0.3309 0.3370 0.3234 0.3185 0.3316 Figure 3. Logistic regression on the COMPAS dataset. Performance of Algorithm 3 in comparison to the method by Diana et al. (2021): Logistic loss (left), classification error (center), and group weights (right) over time. In the table, for every method we show the maximum loss / error among the groups in bold. number of times any datapoint is examined. Under this metric, the cost of computing a single SGD update of minibatch size k is k; the cost of evaluating the objective w.r.t. a single point is 1; while the cost of evaluating the objective for every datapoint of a dataset of size n is n. Logistic regression has a cost of ni where i is the number of iterations needed to reach convergence. Looking at Algorithm 2, we see that a significant bottleneck per iteration is line 3, which evaluates the loss over a validation set. Set size is potentially important: too small and the method may not reliably select the worst off group, but if it is too large, it will needlessly hurt runtime. As such, we evaluate using small validation sets containing 20 random members of each group, and larger validation sets. For all experiments we use a mini-batch size of 32. We compare with the public code of both Diana et al. (2021) and Martinez et al. (2020). When evaluating efficiency, we focus on the method of Diana et al. (2021), however, since the method of Martinez et al. (2020) does not come with theoretical guarantees of convergence (and as such it is incomparable to our methods anyway) and their experimental evaluation does not look at the evolution of iterates over time but only at the final iterate. As for our proposed strategy, we focus on Algorithm 2 as the efficient variant of our general strategy (Algorithm 1) in the convex case; however, we also study the performance of Algorithm 3, and we run Algorithm 2 without averaging using a simple neural network as classification model to study the non-convex case. Similarly to Diana et al., we report both optimization and generalization performance. To evaluate optimization performance, we use small datasets and report results on the training data. When studying generalization performance, we consider a large dataset and report results on a held-out test set. Since our strategy (Algorithms 1 or 2) is randomized, we show its results for five runs with different random seeds. See Appendix B.1 for implementation details. Heuristics for estimating W and L. Algorithm 3 requires estimates of the values θ1, W := supθ Θ θ θ1 2 and L := supθ Θ maxi [g] θv (θ; Di) 2 in order to set the parameters η and γ. We use the same method of estimating them for both experiments shown in Figure 3 and the table of Figure 2, respectively: as the data is whitened, we simply take L := d, where d is the number of parameters in our logistic regression model, as an upper bound for the gradient of logistic regression. For θ1, we run unweighted logistic regression over the entire dataset and use this as the initialisation of our model. Finally, we take W = θ1 as an approximate estimate of the size of the domain. Performance on smaller datasets. We compare Algorithms 2 and 3 with Diana et al. (2021) and Martinez et al. (2020) on the Drug Consumption dataset (Fehrman et al., 2015) and the COMPAS dataset (Angwin et al., 2016), respectively. We provide some details about the datasets used in our experiments in Appendix B.2. On the Drug Consumption dataset, we train a logistic regressor to predict if an individual consumed cannabis within the last decade or not. The groups that we want to be fair to are defined by an individual s country. The dataset contains 1885 records. We use the entire dataset for training (sampling and performing SGD updates) and for reporting performance metrics. We either use the entire dataset or a small subset comprising 20 datapoints sampled uniformly at random from each group as validation set (for determining the group with the highest loss). Active Sampling for Min-Max Fairness Table 1. Logistic regression on the Diabetes dataset. (Average) Per-group log losses and classification errors from the final iteration. For every method, the maximum loss / error among the groups is shown in bold. [0-50) [50-60) [60-70) [70-80) [80-90) [0-50) [50-60) [60-70) [70-80) [80-90) Diana et al. (2021) Train 0.6327 0.6425 0.6474 0.6550 0.6473 0.3256 0.3418 0.3500 0.3626 0.3498 Test 0.6357 0.6443 0.6495 0.6563 0.6509 0.3304 0.3450 0.3537 0.3652 0.3556 Martinez et al. (2020) Train 0.6145 0.6219 0.6289 0.6458 0.6439 0.3228 0.3326 0.3435 0.3616 0.3573 Test 0.6113 0.6263 0.6329 0.6448 0.6456 0.3198 0.3396 0.3513 0.3650 0.3611 OURS (Alg. 2; Avg. 5 runs) Train 0.6161 0.6232 0.6299 0.6458 0.6439 0.3240 0.3333 0.3442 0.3618 0.3579 Test 0.6129 0.6277 0.6337 0.6449 0.6455 0.3197 0.3401 0.3511 0.3650 0.3600 0 2000 4000 Round t Logistic loss Diabetes Test Set --- MULTILAYER PERCEPTRON Group [0-50) [50-60) [60-70) [70-80) [80-100) Method OURS (Alg. 2) Naive 0 2000 4000 Round t Diabetes Test Set --- MLP 0 2000 4000 Round t Logistic loss Diabetes Test Set --- MLP 0 2000 4000 Round t Diabetes Test Set --- MLP Figure 4. Algorithm 2 (returning the final iterate rather than the average) applied to a non-convex MLP on the Diabetes dataset. The plots show the per-group logistic losses and classification errors over time, evaluated on the held-out test set. The first and the second plot show the results for five runs of the experiment; the third and the fourth show the average results together with 95%-confidence intervals. Figure 2 shows the results. The plots show the loss and error for each group over the run of our algorithms or that of Diana et al. (2021). Alongside loss or error curves, we also plot baseline curves that are obtained by training a classifier or regressor using standard SGD on the training data (dashed lines; denoted by Naive). The figure also provides a plot showing trade-off curves, trading off the maximum group loss vs the overall population loss. For Diana et al. the curve is obtained by varying the parameter γ in a variant of their algorithm, for our strategy we exploit the simple modification discussed in Remark 1: before determining the worst off group, we flip a biased coin and with probability p sample a datapoint from the whole population and with probability 1 p sample from the worst off group. By varying the parameter p [0, 1], we generate the trade-off curve. The table shows the performance metrics for the model obtained in the final iteration of an algorithm (including the algorithm of Martinez et al. (2020)). The loss (and also the error) of the worst off group decreases over time, while increasing for other groups. This is consistent with Section 3, which guarantees improved performance on the highest-loss group, but not for the other groups. In terms of the solution found in the final iteration, we perform similarly to Diana et al. (2021) and Martinez et al. (2020) with all methods accurately solving the same objective and finding similar cost solutions (cf. the table in Figure 2). Also the two trade-off curves are almost identical. Very clear trends can be seen in the graphs. While around 1e7 operations are required for Diana et al. (2021) to converge (this is particular apparent in the blue curve representing loss on the US), our approaches converge much faster. In particular, the intrinsic volatility of the SGD update is largely masked by the fast convergence of our approach with all runs converging much faster than any other approach, leading to multiple overlayed plots. The performance benefit is even more extreme when a small validation set is used. Here convergence looks near instantaneous when plotted on a scale that allows us to also see the behavior of the algorithm by Diana et al.. In general, despite its better performance guarantees, our deterministic accelerated version has comparable performance to Algorithm 2 with a large validation set, and lies midway in performance between Algorithm 2 with a small validation set and Diana et al. (2021). Note that as the parameters γ and W are estimated heuristically, it is likely that better performance could be obtained if they were known; however, we felt it was more informative to report the performance obtained without tuning. In Figure 3, we evaluate Algorithm 3 on the COMPAS dataset (Angwin et al., 2016) and train a logistic regression classifier to predict recidivism. The graphs show a comparison with Diana et al. (2021). For our method a single iteration corresponds to one step of gradient descent, while Diana et al. require the computation of an optimal classifier in each iteration. Despite this, we still converge substantially faster per iteration. Generalization on a larger dataset. In Table 1, we evaluate on the Diabetes 130-US Hospitals dataset (Strack et al., 2014). The goal is to predict whether a patient was readmitted to hospital, and we want to be fair with respect to different age groups. We train a linear logistic classifier. The Diabetes dataset contains 101766 records, which we split into a training, validation, and a held-out test set of Active Sampling for Min-Max Fairness equal size. The latter is not used in training. We initialize our classifier training on a subset of 2000 training points. All methods achieve similar loss and error, both when comparing between training and test sets and when comparing the various methods. The former illustrates the good generalization performance of the algorithms. The results for our Algorithm 2 and the method by Martinez et al. (2020) are even more similar to each other than compared to the method by Diana et al. (2021). Algorithm 2 in non-convex learning. Minus the averaging step, Algorithm 2 can also be applied, without guarantees, to non-convex learning problems including neural network training. We demonstrate this by training a simple multilayer perceptron (MLP) with Algorithm 2 on the Diabetes dataset, as in the setting of Table 1. Results are shown in Figure 4. The plots show the per-group logistic losses and classification errors on the test set. We improve the loss and the error of the worst off group compared to the naive baseline. 6. Discussion Potential harms. Implicit to min-max fairness is the idea that the labels are accurate and a trained classifier should reproduce them with high fidelity. As argued by Wachter et al. (2021), where this is not the case, for example: where racially-biased law enforcement practices make stop and arrest rates a poor surrogate for criminal activity (Baumgartner et al., 2018); where hiring data is based on biased historic practices (Harvie et al., 1998); or when using existing diagnoses to train skin cancer detection (Gupta et al., 2016); min-max fairness along with other error-based fairness notions can give rise to classifiers that mimic the biases present in the data, which if used to make substantive decisions about individuals can perpetuate inequality. Our contribution. We present a novel approach to minmax algorithmic fairness. In contrast to existing approaches, our approach stands out both for its efficient stochastic nature and easy-to-implement formulations and its guaranteed convergence rates. Our experiments on real-world datasets show the merits of our approach. Acknowledgements Jamie Morgenstern and Jie Zhang acknowledge funding from the NSF AI Institute for the Foundations of Machine Learning (IFML), an NSF Career award, and the Simons Collaborative grant on Theory of Algorithmic Fairness. J. Abernethy, K. Lai, and A. Wibisono. Last-iterate convergence rates for min-max optimization: Convergence of hamiltonian gradient descent and consensus optimization. In International Conference on Algorithmic Learning Theory (ALT), 2021. A. Agarwal, M. Dudík, J. Langford, and Z. S. Wu. Fair regression: Quantitative definitions and reduction-based algorithms. In International Conference on Machine Learning (ICML), 2019. H. Anahideh and A. Asudeh. Fair active learning. ar Xiv:2001.01796 [cs.LG], 2020. J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Propublica machine bias, 2016. https: //www.propublica.org/article/machinebias-risk-assessments-in-criminalsentencing. S. Barocas, M. Hardt, and A. Narayanan. Fairness and Machine Learning. fairmlbook.org, 2018. http:// www.fairmlbook.org. F. R. Baumgartner, D. A. Epp, and K. Shoub. Suspect citizens: What 20 million traffic stops tell us about policing and race. Cambridge University Press, 2018. C. Brown. Giving up levelling down. Economics and Philosophy, 19(1):111, 2003. N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050 2057, 2004. C.-K. Chiang, T. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin, and S. Zhu. Online optimization with gradual variations. In Conference on Learning Theory (COLT), 2012. T. Christiano and W. Braynen. Inequality, injustice and levelling down. Ratio, 21(4):392 420, 2008. A. Cotter, H. Jiang, and K. Sridharan. Two-player games for efficient non-convex constrained optimization. In International Conference on Algorithmic Learning Theory (ALT), 2019a. A. Cotter, H. Narasimhan, and M. Gupta. On making stochastic classifiers deterministic. In Neural Information Processing Systems (Neur IPS), 2019b. E. Diana, W. Gill, M. Kearns, K. Kenthapadi, and A. Roth. Minimax group fairness: Algorithms and experiments. In AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2021. Code available on https://github.com/ amazon-research/minimax-fair. Active Sampling for Min-Max Fairness B. Doran. Reconsidering the levelling-down objection against egalitarianism. Utilitas, 13(1):65 85, 2001. D. Dua and C. Graff. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, 2019. http:// archive.ics.uci.edu/ml. E. Fehrman, A. K. Muhammad, E. M. Mirkes, V. Egan, and A. N. Gorban. The five factor model of personality and evaluation of drug consumption risk. ar Xiv:1506.06297 [stat.AP], 2015. Dataset available on https: //archive.ics.uci.edu/ml/datasets/Drug+ consumption+(quantified). M. Ghadiri, S. Samadi, and S. Vempala. Socially fair kmeans clustering. In ACM Conference on Fairness, Accountability, and Transparency (FAcc T), 2021. A. K. Gupta, M. Bharadwaj, and R. Mehrotra. Skin cancer concerns in people of color: risk factors and prevention. Asian Pacific journal of cancer prevention, 17(12):5257, 2016. M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In Neural Information Processing Systems (Neur IPS), 2016. K. Harvie, J. Marshall-Mcaskey, and L. Johnston. Genderbased biases in occupational hiring decisions 1. Journal of Applied Social Psychology, 28(18):1698 1711, 1998. E. Hazan. Introduction to online convex optimization. ar Xiv:1909.05207 [cs.LG], 2019. N. Holtug. Egalitarianism and the levelling down objection. Analysis, 58(2):166 174, 1998. Y. Li and N. Vasconcelos. Repair: Removing representation bias by dataset resampling. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019. N. Martinez, M. Bertran, and G. Sapiro. Minimax pareto fairness: A multi objective perspective. In International Conference on Machine Learning (ICML), 2020. Code available on https://github.com/natalialmg/ MMPF. A. Mason. Egalitarianism and the levelling down objection. Analysis, 61(3):246 254, 2001. M. Mohri, G. Sivek, and A. T. Suresh. Agnostic federated learning. In International Conference on Machine Learning (ICML), 2019. A. Noriega-Campero, M. A. Bakker, B. Garcia-Bulle, and A. Pentland. Active fairness in algorithmic decision making. In AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2019. Y. Ouyang and Y. Xu. Lower complexity bounds of firstorder methods for convex-concave bilinear saddle-point problems. Mathematical Programming, 185:1 35, 2021. A. Papadaki, N. Martinez, M. Bertran, G. Sapiro, and M. Rodrigues. Minimax demographic group fairness in federated learning. ar Xiv:2201.08304 [cs.LG], 2022. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python. Journal of Machine Learning Research (JMLR), 12:2825 2830, 2011. D. Pollard. Empirical processes: Theory and applications. NSF-CBMS Regional Conference Series in Probability and Statistics, 2:i 86, 1990. S. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. In Neural Information Processing Systems (Neur IPS), 2013. M. Razaviyayn, T. Huang, S. Lu, M. Nouiehed, M. Sanjabi, and M. Hong. Nonconvex min-max optimization: Applications, challenges, and recent theoretical advances. IEEE Signal Processing Magazine, 37(5):55 66, 2020. E. Rolf, T. Worledge, B. Recht, and M. Jordan. Representation matters: Assessing the importance of subgroup allocations in training data. ar Xiv:2103.03399 [cs.LG], 2021. S. Samadi, U. Tantipongpipat, J. Morgenstern, M. Singh, and S. Vempala. The price of fair PCA: One extra dimension. In Neural Information Processing Systems (Neur IPS), 2018. B. Settles. Active learning literature survey. Technical report, 2010. Available on http://burrsettles.com/pub/ settles.activelearning.pdf. S. Shekhar, M. Ghavamzadeh, and T. Javidi. Adaptive sampling for minimax fair classification. ar Xiv:2103.00755 [cs.LG], 2021. B. Strack, J. P. De Shazo, C. Gennings, J. L. Olmo, S. Ventura, K. J. Cios, and J. N. Clore. Impact of hba1c measurement on hospital readmission rates: Analysis of 70000 clinical database patient records. Bio Med Research International, 2014. Dataset available on https://archive.ics.uci.edu/ml/ datasets/Diabetes+130-US+hospitals+ for+years+1999-2008. U. Tantipongpipat, S. Samadi, M. Singh, J. Morgenstern, and S. Vempala. Multi-criteria dimensionality reduction Active Sampling for Min-Max Fairness with applications to fairness. In Neural Information Processing Systems (Neur IPS), 2019. L. Temkin. Equality, priority, and the levelling down objection. The ideal of equality, pages 126 161, 2000. K. Thekumparampil, P. Jain, P. Netrapalli, and S. Oh. Efficient algorithms for smooth minimax optimization. In Neural Information Processing Systems (Neur IPS), 2019. S. Wachter, B. Mittelstadt, and C. Russell. Bias preservation in machine learning: The legality of fairness metrics under eu non-discrimination law. West Virginia Law Review, Forthcoming, 2021. M. B. Zafar, I. Valera, M. G. Rodriguez, and K. P. Gummadi. Fairness constraints: A flexible approach for fair classification. Journal of Machine Learning Research (JMLR), 20:1 42, 2019. Active Sampling for Min-Max Fairness A. Proof of Theorem 2 Let ENT(q) := Pg i=1 q(i) log q(i) be the entropy function and KL(p||q) := Pg i=1 p(i) log p(i) q(i) be the Kullback Leibler divergence. Let q g be the indicator distribution that puts all of its mass on argmaxi [g] v θT ; Di . Notice that the iterative description of qt allows us to write qt := argmax q g 1 γ ENT(q) + s=1 us, q . We first observe that, using Jensen s inequality, T max i [g] v θT ; Di = Tv θT ; q t=1 v θt; ˆDq = t=1 ut, q + ENT(q ) t=1 ut, q T + ENT(q T ) = u T , q T + t=1 ut, q T + ENT(q T ) = u T , q T + t=1 ut, q T 1 + ENT(q T 1) KL(q T +1||q T 1) KL(qt||qt 1) γ + ENT(q1) t=1 v θt; ˆDqt qt qt 1 2 1 2γ + log g Now let us focus on the first summation on the last line. We notice that the θ update protocol follows the Optimistic Mirror Descent algorithm (Chiang et al., 2012; Rakhlin and Sridharan, 2013), which leads to the following upper bound that holds for arbitrary θ Θ: t=1 v θt; ˆDqt t=1 v θ ; ˆDqt + θ θ1 2 2 η + η t=1 v θt; ˆDqt v θt; ˆDqt 1 2 2. (1) We have assumed that v ( ; ) 2 is uniformly upper bounded by L, and therefore it holds that v θt; ˆDqt v θt; ˆDqt 1 2 L qt qt 1 1. We therefore have t=1 v θt; ˆDqt t=1 v θ ; ˆDqt + W 2 t=1 qt qt 1 2 1. (2) Combining, we have max i [g] v θT ; Di 1 T t=1 v θ ; ˆDqt + W 2 t=1 qt qt 1 2 1. If we set γ = (ηL2) 1, the final summation vanishes. Furthermore, if we let q := 1 T PT t=1 qt, we see that t=1 v θ ; ˆDqt = v θ ; ˆD q max i [g] v (θ ; Di) . Active Sampling for Min-Max Fairness Putting it all together gives max i [g] v θT ; Di max i [g] v (θ ; Di) + W 2 Tη + ηL2 log g and plugging in the parameter η = W L log g completes the proof. B.1. Details About Implementation and Hyperparameters We implemented Algorithm 2 based on Scikit-learn s (Pedregosa et al., 2011; https://scikit-learn.org) SGDClassifier class, and we implemented Algorithm 3 using Pytorch (https://pytorch.org/). When applying our strategy (Algorithm 1) to the MLP on the Diabetes dataset, we used Scikit-learn s MLPClassifier class. In that experiment, we used a MLP with two hidden layers of size 10 and 5, respectively. For the method by Diana et al. (2021) we used Scikit-learn s Logistic Regression class with lbfgs-solver as oracle. Regularization parameter: In the experiments on the Drug Consumption dataset and the COMPAS dataset, neither for our algorithms nor for the method by Diana et al. (2021), we used regularization. In the experiments on the Diabetes dataset, for our strategy, we set the regularization parameter for l2-regularization to 10 6 for the logistic regression classifier and to 10 4 for the MLP classifier. For the method by Diana et al. (2021), we set the regularization parameter for l2-regularization to 10 4. When setting it to 10 6, the learnt classifier does not generalize well to the held-out test set. The code of Martinez et al. (2020) does not provide the option to easily set the regularization parameter via an input argument, and we used their hard-coded default value of 10 7/(2n), where n is the number of training points, as regularization parameter for l2-regularization. Learning rate: In all experiments we used a constant learning rate for our methods. We described how to set the learning rate for Algorithm 3 in the main body of the paper. For Algorithm 2, we used a learning rate of 0.01 on the Drug Consumption dataset and 0.005 (logistic regression) or 0.001 (MLP) on the Diabetes dataset. The codes of Diana et al. (2021) or Martinez et al. (2020) with logistic regression as the baseline classifier do not rely on a learning rate. We used the same parameters as for our method to train the baseline (naive) classifiers, and in order to perform a fair comparison with our method, we returned the average over the iterates instead of the last iterate (by setting the average parameter to True in SGDClassifier; this does not apply to the MLP on the Diabetes dataset). In all experiments except on the Diabetes dataset with the logistic regression classifier (cf. Section 5), we initialized our strategy with the baseline classifier. All other parameters in the code of Diana et al. (2021) or Martinez et al. (2020) are set as their default values. B.2. Details About Datasets In Section 5 we use the Drug Consumption dataset (Fehrman et al., 2015) and the Diabetes 130-US Hospitals dataset (Diabetes dataset; Strack et al., 2014), which are both publicly available in the UCI repository (Dua and Graff, 2019). We also use the COMPAS dataset (Angwin et al., 2016), which is publicly available at https://github.com/propublica/ compas-analysis. On the Drug Consumption dataset we use the features Nscore, Escore, Oscore, Ascore, Cscore, Impulsive, and SS for predicting whether an individual consumed cannabis within the last decade or not. We define the groups by an individual s country, where we merge Australia, Canada, New Zealand, Republic of Ireland and Other into one group Other . On the Diabetes dataset, we use the features gender, age, admission_type_id, time_in_hospital, num_lab_procedures, num_procedures, num_medications, number_outpatient, number_emergency, number_inpatient, number_diagnoses, max_glu_serum, A1Cresult, change, and diabetes Med for predicting whether a patient was readmitted to hospital or not, and we define the groups by a patient s age. On the Compas dataset we use the features age, sex, priors_count, c_charge_degree, and juv_fel_count for predicting recidivism. Groups are defined by a person s race, where we merge Asian, Native American and Other into one group Other . We never provide the group information as a feature.