# adaptive_sampling_for_minimax_fair_classification__8884e12a.pdf Adaptive Sampling for Minimax Fair Classification Shubhanshu Shekhar Greg Fields Mohammad Ghavamzadeh Tara Javidi Machine learning models trained on uncurated datasets can often end up adversely affecting inputs belonging to underrepresented groups. To address this issue, we consider the problem of adaptively constructing training sets which allow us to learn classifiers that are fair in a minimax sense. We first propose an adaptive sampling algorithm based on the principle of optimism, and derive theoretical bounds on its performance. We also propose heuristic extensions of this algorithm suitable for application to large scale, practical problems. Next, by deriving algorithm independent lower-bounds for a specific class of problems, we show that the performance achieved by our adaptive scheme cannot be improved in general. We then validate the benefits of adaptively constructing training sets via experiments on synthetic tasks with logistic regression classifiers, as well as on several real-world tasks using convolutional neural networks (CNNs). 1 Introduction Machine learning (ML) models are increasingly being applied for automating the decision-making process in several sensitive applications, such as loan approval and employment screening. However, recent work has demonstrated that discriminatory behaviour might get encoded in the model at various stages of the ML pipeline, such as data collection, labelling, feature selection, and training, and as a result, adversely impact members of some protected groups in rather subtle ways (Barocas and Selbst, 2016). This is why ML researchers have started to introduce a large number of fairness measures to include the notion of fairness in the design of their algorithms. Some of the important measures of fairness include demographic parity (Zemel et al., 2013), equal odds and opportunity (Hardt et al., 2016; Woodworth et al., 2017), individual fairness (Dwork et al., 2012), and minimax fairness (Feldman et al., 2015). The minimax fairness is particularly important in scenarios in which it is necessary to be as close as possible to equality without introducing unnecessary harm (Ustun et al., 2019). These scenarios are common in areas such as healthcare and predicting domestic violence. A measure that has been explored to achieve this goal is predictive risk disparity (Feldman et al., 2015; Chen et al., 2018; Ustun et al., 2019). Instead of using the common approach of putting constraints on the norm of discrimination gaps, Martinez et al. (2020) has recently introduced the notion of minimax Pareto fairness. These are minimax classifiers that are on the Pareto frontier of prediction risk, i.e., no decrease in the predictive risk of one group is possible without increasing the risk of another one. In this paper, we are primarily interested in the notion of minimax fairness in terms of the predictive risk. However, instead of studying the training phase of the ML pipeline, our focus is on the datacollection stage, motivated by Jo and Gebru (2019) and Holstein et al. (2018). In particular, we study the following question: given a finite sampling budget, how should a learner construct a training set consisting of elements from different protected groups in appropriate proportions to ensure that a classifier trained on this dataset achieves minimax fairness? Our work is motivated by the following scenario: suppose we have to learn a ML model for performing a task (e.g., loan approval) for inputs belonging to different groups based on protected attributes, such Electrical and Computer Engineering Department at UCSD ({shshekha,grfields,tjavidi}@ucsd.edu) Google Research (ghavamza@google.com) 35th Conference on Neural Information Processing Systems (Neur IPS 2021). as race or gender. Depending upon the distribution of input-label pairs from different groups, the given task may be statistically harder for some groups. Our goal is to ensure that the eventual ML model has optimal predictive accuracy for the worst-off group. We show that, under certain technical conditions, this results in a model with comparable predictive accuracy over all groups. One approach for this would be to train separate classifiers for each group. However, this is often not possible as the group-membership information may not be available at the deployment time, or it may be forbidden by law to explicitly use the protected characteristics as an input in the prediction process (Lipton et al., 2017, 1). To ensure having a higher proportion of samples from the harder groups without knowing the identity of the hard or easy groups apriori, we consider this problem in an active setting, where a learner has to incrementally construct a training set by drawing samples one at a time (or a batch at a time) from different groups. Towards this goal, we propose and analyze an adaptive sampling scheme based on the principle of optimism, used in bandits literature (e.g., Auer et al. 2002), that detects the harder groups and populates the training set with more samples from them in an adaptive manner. We also wish to note that bias in ML has multi-faceted origins and that our work here addresses dataset construction and cannot account for bias introduced by model selection, the underlying data distribution, or other sources as discussed in Hooker (2021), Suresh and Guttag (2019). We also endeavor to ensure minimax fairness, but in some contexts another notion of fairness, such as those mentioned above, may be more appropriate or equitable. In general, application of our algorithm is not a guarantee that the resulting model is wholly without bias. Our main contributions are: 1) We first propose an optimistic adaptive sampling strategy, Aopt, for training set construction in Section 3. This strategy is suited to smaller problems and admits theoretical guarantees. We then introduce a heuristic variant of Aopt in Section 4 that is more suitable to practical problems involving CNNs. 2) We obtain upper bounds on the convergence rate of Aopt, and show its minimax near-optimality by constructing a matching lower bound in Section 3.1. 3) Finally, we demonstrate the benefits of our algorithm with empirical results on several synthetic and real-world datasets in Section 5. Related Work. The closest work to ours is by Abernethy et al. (2020), where they propose an ϵ-greedy adaptive sampling strategy. They present theoretical analysis under somewhat restrictive assumptions and also empirically demonstrate the benefits of their strategy over some baselines. We describe their results in more detail in Appendix C.1, and employ the tools we develop to analyze our algorithm to perform a thorough analysis of the excess risk of their strategy under a much less restrictive set of assumptions and to show some necessary conditions on the value of their exploration parameter, ϵ. We find comparable empirical results for both algorithms given sufficient tuning of their respective hyperparameters, we report some of these results in Section 5 and compare the algorithms in Appendix C. In another related work, Anahideh et al. (2020) propose a fair adaptive sampling strategy that selects points based on a linear combination of model accuracy and fairness measures, and empirically study its performance. These results, however, do not obtain convergence rates of the excess risk of their respective methods and only offer implementations for small-scale datasets. The above results study the problem of fair classification in an active setting and target the datacollection stage of the ML pipeline. There are also works that take a passive approach to this problem and focus of the training phase of the pipeline. Agarwal et al. (2018) design a scheme for learning fair binary classifiers with fairness metrics that can be written as linear constraints involving certain conditional moments. This class of fairness metrics, however, do not contain the minimax fairness measure. Diana et al. (2021) propose a method for constructing (randomized) classifiers for minimax fairness w.r.t. the empirical loss calculated on the given training set. Similarly, Martinez et al. (2020) derive an algorithm for learning a Pareto optimal minimax fair classifier, under the assumption that the learner has access to the true expected loss functions. Thus, the theoretical guarantees in Diana et al. (2021) and Martinez et al. (2020) hold under the assumption of large training sets. Our work, in contrast, constructs the training set incrementally (active setting) from scratch while carefully taking into account the effects of the finite sample size. Finally, we note that the data-collection strategies proposed in our paper can, in principle, be combined with the training methods presented in Martinez et al. (2020) and Diana et al. (2021) to further guarantee the (minimax) fairness of the resulting classifier. We leave the investigation of this approach for future work. Besides data-collection and training, there have been studies in the fair ML literature on other aspects of the ML pipeline, such as pre-processing (Celis et al., 2020), learning feature representations (Zemel et al., 2013), post-processing (Hardt et al., 2016), and model documentation (Mitchell et al., 2018). We refer readers to a recent survey by Caton and Haas (2020) for detailed description of these results. 2 Problem Formulation Consider a classification problem with the feature (input) space X, label set Y, and a protected attribute set Z = {z1, . . . , zm}. For any z Z, we use Pz(x, y) as a shorthand for PXY |Z=z (X = x, Y = y | Z = z) to denote the feature-label joint distribution given that the protected feature value is Z = z. We also use Ez[ ] as a shorthand for the expectation w.r.t. Pz. A (non-randomized) classifier is a mapping f : X 7 Y, which assigns a label y Y to every feature x X. For a family of classifiers F YX , a loss function ℓ: F X Y 7 R, and a mixture distribution over the protected attributes π m, we define the π-optimal classifier fπ as fπ arg min f F Eπ [ℓ(f, X, Y )] := X z Z π(z) Ez [ℓ(f, X, Y )] . (1) When π lies on the corners of the simplex m, i.e., π(z) = 1 for a z Z, we use the notation fz instead of fπ. For π in the interior of m, it is clear that the risk of fπ on group z, i.e., Ez [ℓ(fπ, X, Y )], must be larger than the best possible classification loss for PXY |Z=z. In this paper, our goal is to develop an active sampling scheme to find the fair mixture distribution π in a minimax sense (minimizing the maximum risk among the groups), i.e., π arg min π m max z Z L(z, fπ) := Ez [ℓ(fπ, X, Y )] . (2) The active sampling problem that we study in this paper can be formally defined as follows: Problem 1. Suppose F denotes a family of classifiers, ℓa loss function, n is a sampling budget, and O : Z 7 X Y an oracle that maps any attribute z Z to a feature-label pair (X, Y ) PXY |Z=z. The learner designs an adaptive sampling scheme A that comprises a sequence of mappings (At)n t=1 with At : (X Y)t 1 7 Z to adaptively query O and construct a dataset of size n. Let πn m denote the resulting empirical mixture distribution over Z, where πn(z) = Nz,n/n and Nz,n is the number of times that A samples from Pz in n rounds. Then, the quality of the resulting dataset is measured by the excess risk, or sub-optimality, of the πn-optimal classifier fπn, i.e., Rn (A) := max z Z L (z, fπn) max z Z L(z, fπ ) , (3) where π is the fair mixture distribution defined by (2). Hence, the goal of the learner is to design a strategy A which has a small excess risk Rn(A). Informally, the algorithm should adaptively identify the harder groups z and dedicate a larger portion of the overall budget n to sample from their distributions (see Section 2.2 for an illustrative example). 2.1 Properties of the Fair Mixture As discussed above, our goal is to derive an active sampling scheme that allocates the overall budget n over the attributes z Z in a similar manner as the unknown fair mixture π , defined by (2). Thus, it is important to better understand the properties of π and the π-optimal classifiers fπ, defined by (1). We state three properties of fπ and π in this section. We refer the readers to Martinez et al. (2020) for the definitions of Pareto front and convexity discussed in this section. Property 1. As discussed in Martinez et al. (2020, Section 4), any fπ that solves (1) for a π with π(z) > 0, z Z, belongs to the Pareto front PZ,F of the risk functions {L(z, f)}z Z, f F. Property 2. We prove in Proposition 1 (see Appendix A for the proof) that under the following two assumptions on the conditional distributions {Pz}z Z, function class F, and loss function ℓ, there exists a unique fair mixture π , whose classifier fπ has equal risk over all attributes z Z. Assumption 1. The mapping π 7 L(z, fπ) is continuous for all attributes z Z. Furthermore, if π, ν m are such that π(z) > ν(z) for an attribute z Z, then L(z, fπ) < L(z, fν). The above assumption indicates that increasing the weight of an attribute z in the mixture distribution must lead to an increase in the performance of the resulting classifier on the distribution Pz. Assumption 2. For any two distinct attributes z, z Z, we must have L(z, f z ) < L(z , f z ), for any f z arg minf F L(z, f). This assumption requires that any optimal classifier corresponding to distribution Pz to have higher risk w.r.t. the distribution Pz of any other attribute z Z. Note that Assumption 2 may not always hold, for example, if one attribute is significantly easier to classify than another. Proposition 1. Let Assumptions 1 and 2 hold for the conditional distributions {Pz}z Z, the loss function ℓ, and the function class F. Then, there exists a unique π m that achieves the optimal value for problem (2) and satisfies L(z1, fπ ) = L(z2, fπ ) = = L(zm, fπ ) := M . In other words, Assumptions 1 and 2 define a regime where there exists a fair mixture π , whose classifier fπ achieves complete parity in the classification performance across all attributes. Moreover, Property 1 indicates that fπ also belongs to the Pareto front PZ,F. Therefore, under these two assumptions, the equal risk classifier not only belongs to the Pareto front, but it is also the minimax Pareto fair classifier (Lemma 3.1 in Martinez et al. 2020). Note that, in general, without Assumption 1, the classifier that attains equality of risk might have worse performance on all attributes than the minimax Pareto fair classifier (Lemma 3.2 in Martinez et al. 2020). Property 3. We may show that when both the function class F and risk functions {L(z, )}z Z are convex, fπ is a minimax Pareto fair classifier. As originally derived by Geoffrion (1968) and then restated by Martinez et al. (2020, Theorem 4.1), under these convexity assumptions, the Pareto front PZ,F is convex and any classifier on PZ,F is a solution to (1). This together with Property 1 indicates that the classifier fπ corresponding to the fair (minimax optimal) mixture π is on the Pareto front, and additionally, is a minimax Pareto fair classifier. 2.2 Synthetic Models It is illustrative to consider a class of synthetic models which we use as a running example. Definition 1 (Synthetic Model1). Set X = R2, Y = {0, 1}, and Z = {u, v}. Each instance of our synthetic model is defined by the set of distributions PXY |Z that satisfy the following: PY |Z(Y = 1|Z = z) = 0.5 for both z Z, and PX|Y =y,Z=z = N(µyz, I2) for all (y, z) Y Z. Thus, each instance of this model can be represented by a mean vector µ = (µyz : y Y, z Z). The idea behind this class of models is that the hardness of the classification problem for a value of z Z depends on the distance between µ0z and µ1z. The more separated these two mean vectors are, the easier it is to distinguish the labels. Thus, depending on this distance, it is expected that different protected attributes may require different fractions of samples in the training set to achieve comparable test accuracy. To illustrate this, we consider two instances of the Synthetic Model1: I. µ0u = ( 2, 2), µ1u = (2, 2), µ0v = ( 1, 1), and µ1v = (1, 1), and II. µ0u = ( 1.5, 1.5), µ1u = (1.5, 1.5), µ0v = ( 2, 2), and µ1v = (2, 2). For both model instances, we trained a logistic regression classifier over training sets with 1001 equally spaced values of π(u) in [0, 1]. Figure 1 shows the test accuracy of the learned classifier for both attributes (blue and red curves) as well as their minimum (black curve) for different values of π(u). Since the pair (µ0u, µ1u) is better separated than (µ0v, µ1v) in the first instance, it is easier to classify inputs with Z = u than those with Z = v, and thus, it requires fewer training samples from Pz=u than Pz=v to achieve the same accuracy. This is reflected in Figure 1 (left) that shows the best (min-max) performance is achieved at π(u) 0.23. An opposite trend is observed in the second instance (right plot in Figure 1), where the mean vectors corresponding to the attribute Z = v are better separated, and hence, require fewer training samples to achieve the same accuracy. Figure 1: The variation of the minimum prediction accuracy over the two attributes Z = {u, v} with π(u) for Instance I, on the left, and Instance II, on the right, of the Synthetic Model1. 3 Optimistic Adaptive Sampling Strategy In this section, we present our optimistic adaptive sampling algorithm Aopt for Problem 1. Algorithm 1 contains a simple pseudo-code for Aopt, see Algorithm 2 for a more detailed, formal pseudo-code. The algorithm proceeds in the following phases: Phase 1. In the initial phase t m = |Z|, we draw two independent samples from Pz = PXY |Z=z, for each attribute z Z, and add one to the training dataset Dt and the other one to Dz. We use Dt to learn a common (over all z Z) classifier ˆft via empirical risk minimization (ERM). The independent datasets {Dzi}m i=1 are used for estimating the performance of ˆft for each z Z. Algorithm 1: Optimistic Adaptive Sampling for Minimax Fair Classification. Input: n (budget), F (function class), ℓ(loss function), ξ (forced exploration term) 1 Initialize: D1 = ; {Dzi}m i=1 = ; {Nzi,1}m i=1 = 0; 2 for t = 1, . . . , n/2 do 3 if t m then 5 else if minz Z Nz,t < tξ then 6 zt arg minz Z Nz,t; 8 zt = arg maxz Z Ut(z, ˆft); // see Equation 4 10 Draw two independent samples from Pzt; 11 Add the first one to Dt and the second one to Dzt; 12 Update Nzt,t; πt; ezt(Nzt,t); 13 Learn classifier ˆft by ERM using Dt; Output: πn/2 and ˆfn/2 Phase 2. In each round t > m, we choose an attribute zt according to the following selection rule: if there exists an attribute z whose number of samples in Dt, denoted by Nzt,t, is fewer than tξ for some input ξ (0, 1) (Line 5), we set it as zt (Line 6), else we set zt as the attribute which has the largest upper confidence bound (UCB) (described below) for the risk of the classifier fπt (Line 8). Here πt denote the empirical mixture distribution (over the attributes z) of Dt. Phase 3. When the attribute zt is selected in Phase 2, the algorithm draws a pair of independent samples from Pzt, adds one to Dt and the other to Dzt, and updates Nzt,t, πt, and the uniform deviation bound ezt(Nzt,t) (described below) (Lines 10 to 12). The updated dataset Dt is then used to learn a new candidate classifier ˆft (Line 13). Phases 2 and 3 are repeated until the sampling budget is exhausted, i.e., t = n/2 (note that we sample twice at each round). Calculating UCB. To construct the UCB, we introduce two additional assumptions: For stating the next assumption, we will use the notations L(z, f) = Ez [ℓ(f, X, Y )] and L(π, f) = P z Z π(z)L(z, f) for any f F, z Z and π m. Assumption 3. There exist positive constants ϵ0, C > 0 such that for any function f F and any π m, with π(z) > 0, z Z, if we have L(π, f) L(π, fπ) + ϵ, for some 0 < ϵ ϵ0, then π(z) |L(z, f) L(z, fπ)| 2Cϵ, z Z. Assumption 3 is stated in abstract terms, so Example 1 describes a problem for which it holds. Example 1. Consider the binary classification problem, with X = [0, 1]k and Y = {0, 1} and Z = {z1, . . . , zm}. Let F be the given family of classifiers f : X Y, and introduce d(f1, f2) = µ({f1 = f2}), where µ is the Lebesgue measure on X. Assume that the marginal distributions PX|Z=z for all z Z admit densities νz with respect to µ satisfying C1 νz(x) C2 for all x X. Note that this is a version of the strong-density assumption used in prior works in classification, such as (Audibert and Tsybakov, 2007, Definition 2.2). Then, for the ℓ01 loss, we observe that the expected loss between any two f1, f2 satisfies C1d(f1, f2) |L(z, f1) L(z, f2)| C2d(f1, f2). As a result of this, for any π m and f F, we have C1d(f, fπ) L(π, f) L(π, fπ) C2d(f, fπ). Therefore, if it is known that L(π, f) L(π, fπ) ϵ, then we can conclude that we must have d(f, fπ) ϵ/C1. Hence, by the strong-density assumption on PX|Z=z we have that π(z) |L(z, f) L(z, fπ)| 1 C2d(f, fπ) (C2/C1)ϵ := 2Cϵ. Our final assumption is a standard uniform convergence requirement which is satisfied by many commonly used families of classifiers (several examples are detailed in Remark 1). Assumption 4. Let δ [0, 1] be a confidence parameter. For each z Z, there exists a monotonically non-increasing sequence, {ez(N, F, δ) : N 1}, with lim N ez(N, F, δ) = 0, such that the following event holds with probability at least 1 δ/2: b LN(z, f) L(z, f) ez(N, F, δ) o , where b LN(z, f) := 1 N PN i=1 ℓ f, X(i) z , Y (i) z , z Z, with X(i) z , Y (i) z N i=1 being an i.i.d. sequence of input-label pairs from Pz. In what follows, we will drop the F and δ dependence and refer to ez(N, F, δ) as ez(N) for all z Z and N 1. Given an appropriate sequence of ez(N), we construct the UCB for the risk function L(z, fπt), defined by Equation (2), with the following expression: Ut(z, ˆft) := 1 |Dz| (x,y) Dz ℓ( ˆft, x, y) + ez (Nz,t) + 2C πt(z) z Z πt(z )ez (Nz ,t) . (4) Where C is the constant described in Assumption 3. The first term on the RHS is the empirical loss of the learned classifier at round t on attribute Z = z, so by Assumption 4 the first two terms of the RHS of Equation 4 give a high probability upper bound on L(z, ˆft), the expected loss of ˆft conditioned on Z = z. The third term then provides an upper bound on the difference |L(z, fπt) L(z, ˆft)| and so altogether this provides the desired UCB on the expected loss of the πt optimal classifier on attribute Z = z. The form of the third term is due to Assumption 3 and is discussed in detail in Appendix B.1. 3.1 Theoretical Analysis In this section, we derive an upper-bound on the excess risk Rn(Aopt) of Algorithm 2. We also show that the performance achieved this algorithm cannot in general be improved, by obtaining an algorithm independent lower-bound on the excess risk for a particular class of problems. Upper Bound. We begin by obtaining an upper bound on the convergence rate of the excess risk of Aopt, the proof of this result is in Appendix B.1. Theorem 1. Let Assumptions 1-3 hold and define πmin := minz Z π (z). Fix any A such that πmin/2 A < πmin. Suppose the query budget n is sufficiently large, as defined in Equation 19 in Appendix B.1. Then, with probability 1 δ, the excess risk of Algorithm 2 can be upper-bounded as Rn(Aopt) = max z Z L(z, fπn) M = O |Z|C πmin max z Z ez(NA) , (5) where M = L(z, fπ ), z Z (see Proposition 1) and NA = nπ2 min/(2πmin A). Remark 1. The uniform deviation bounds, ez(N), defined in Assumption 4 and the bound on the excess risk of Algorithm 2 (see Equation 5) can be instantiated for several commonly used classifiers (function classes F) to obtain an explicit convergence rate in n. If F has a finite VC-dimension, d V C, then a suitable deviation bound is ez(N) = 2 p (2d V C log (2e N/d V C) + 2 log (2N2π2|Z|/3δ)) /N, where here π is the numerical constant. And if F has Rademacher complexity Rn, we can choose ez(N) = 2RN + p log (N2π2|Z/3δ) /N. Furthermore, with these uniform deviation bounds and if Rn is O (n α), for some α > 0, then we obtain excess risk bounds of Rn(Aopt) = O |Z|C , Rn(Aopt) = O |Z|C π1+α min nα for the VC-dimension and Rademacher cases, respectively. These conditions on F are satisfied by several commonly used classifiers, such as linear, SVM, and multi-layer perceptron. Remark 2 (Analysis of ϵ-greedy). Abernethy et al. (2020) only analyze the greedy version (i.e., ϵ = 0) of their ϵ-greedy sampling strategy with |Z| = 2, and show that at time n, either the excess risk is of O maxz Z p 2d V C(ℓ F) log(2/δ)/Nz,n , or the algorithm draws a sample from the attribute with the largest loss. However, due to the greedy nature of the algorithm they analyzed, there are no guarantees that Nz,n = Ω(n), and thus, in the worst case the above excess risk bound is O(1). We show in Appendix C how the techniques we developed for the analysis of Algorithm 2 can be suitably employed to study their ϵ-greedy strategy. In particular, we obtain sufficient conditions on ϵ under which the excess risk of ϵ-greedy converges to zero, and the rate at which this convergence occurs. Remark 3 (Comparison between Aopt and ϵ-greedy). Aopt holds two primary advantages over ϵ-greedy, these are discussed in detail in Appendix C, but briefly Aopt has superior risk guarantees, by a constant factor. And both algorithms depend on tunable parameters: C for Aopt and ϵ for ϵ-greedy, but Aopt is far more robust to the choice of parameter. Figure 9 in Appendix C illustrates a case where poor choice of ϵ can prevent ϵ-greedy from converging to the optimal mixture distribution for any number of samples. This robustness makes Aopt easier to employ in practical settings. Lower Bound. Let Q = (µ, F, ℓ01) denote the class of problems where µ M is an instance of the Synthetic Model I described in Section 2.2, F is the class of linear classifiers in two dimensions, and ℓ01 is the 0 1 loss. For this function class F, ez(N) = O( p log(N)/N) for z Z = {u, v}, which implies that the excess risk achieved by both Aopt and ϵ-greedy strategies is of O p log(n)/n . We prove in Appendix D that this convergence rate (in terms of n) cannot in general be improved by showing that max Q Q EQ [Rn (A)] = Ω(1/ n). 4 Heuristic Extensions We show in Appendix B.1.1 that if the uniform deviation bounds, ez(N), can be chosen to decrease to zero sufficiently quickly as N increases, then we can omit the C dependent third term in Equation (4) and still attain the same regret bounds given in Theorem 1. The resulting two-term UCB is then only the high probability upper bound on L(z, ˆft). We will use this UCB as the basis for several practical modifications to our optimistic adaptive sampling strategy. First we note that in UCB based algorithms the confidence bounds necessary to attain theoretical results often do not produce optimal empirical results see, for example, Section 6 in Srinivas et al. (2009). So, following standard practice, we introduce a hyperparameter, c0, in Equation (6) below, which we can tune to optimize the exploration/exploitation trade-off. Practical applications of our strategy run into two further challenges. As described in Section 3, Aopt requires re-training the classifier in every iteration. While this can be implemented in small problems, it becomes infeasible for problems involving large models, such as CNNs. Also, as noted in Section 2.1, for data where one attribute is significantly easier than the other, Assumption 2 may not hold. In this case it may not be beneficial to continue to sample from the attribute with the largest loss. To address these issues we present a heuristic variant of Algorithm 2. For the first challenge we make the following modifications to Algorithm 2. 1) We expand Phase 1 to n0 rounds, 2) at each subsequent round we draw two batches of size b0 from the chosen attribute, 3) instead of re-training from scratch each round, we update the previous model, ˆft 1, and 4) instead of training to convergence, we perform one gradient step over the entire accumulated training set, Dt. To address the second challenge we modify the UCB by adding a term based on the Mann Kendall statistic (Mann, 1945) (Kendall, 1948) which, for time series data (Xi)n i=1, is given by S = Pn 1 i=1 Pn j=i+1 sgn (Xj Xi) . This is designed to identify monotonic upwards or downward trends in time series. We calculate the statistic for each attribute, and denote it Szt, for the accuracy of the classifier, ˆft, on the attribute s validation set, Dzt, over time. Intuitively, this is tracking whether training on additional samples from an attribute is improving the classifier s accuracy on that attribute. We incorporate the statistic into the UCB as follows: e Ut(z, ˆft) := 1 |Dz| (x,y) Dz ℓ( ˆft, x, y) + c0 p Nz,t + c1 Szt p var(Szt) , (6) to incentivize the algorithm to sample attributes on which accuracy is increasing. c1 is a free parameter controlling the importance of per-attribute loss trends. We note that the general ability to modify the UCB in this manner is a strength of our algorithm, as it allows for a great deal of interpretability and adaptability in practical applications. The parameters c0, c1, n0, and b0 are chosen based on the specific problem and available computational resources. We describe our selection of these terms Appendix E. In the experiments we will refer to the variant of this algorithm with c1 = 0 as Aopt, and the variant with c1 > 0 as Aopt+. 5 Empirical Results We evaluate the performance of our proposed active sampling algorithm Aopt on both synthetic and real datasets, and compare it with the following baselines: 1) ϵ-greedy scheme of Abernethy et al. (2020), 2) Greedy scheme, which is equivalent to ϵ-greedy with ϵ = 0, 3) Uniform, where an equal number of samples are drawn for each attribute, and 4) Uncurated, where samples are drawn according to the natural distribution of the dataset. We note that, for some datasets, the natural distribution is uniform, for those we omit the results for the Uncurated scheme. We will make the code for these experiments available in the supplemental materials. All experiments were run for multiple trials the number of which is indicated in the results and we report the average over trials with shaded regions on plots indicating 1 standard deviation. Experiments for image datasets were run on a single GPU provided by Google Colab or other shared computing resources. Figure 2: The figure shows the convergence of πn(u) for the three algorithms Aopt, ϵ-greedy (with ϵ = 0.1) and Empirical (i.e., ϵ-greedy with ϵ = 0), to the optimal value π (u) for the two instances of the Synthetic Model1 introduced in Section 2.2. Synthetic Dataset. In this experiment, we compare how the πn(u) returned by the different algorithms converge to π (u) for the two synthetic models introduced in Section 2.2 with F chosen as the family of logistic regression (LR) classifiers. Since the feature space is two dimensional, we use the version of Aopt, ϵ-greedy, and Empirical schemes in which we train (from scratch) the classifier in each round. For Aopt, we use UCB given in 6 with c0 = 0.1 and for ϵ-greedy, we use ϵ = 0.1. These values of c0 and ϵ are selected via a grid search. Figure 2 shows how πn(u) changes with n for the three algorithms averaged over 100 trials. As expected, the algorithms with an exploration component (i.e., Aopt and ϵ-greedy) eventually converge to the optimal π (u) value in both cases, whereas the Empirical scheme that acts greedily often gets stuck with a wrong mixture distribution, resulting in high variability in its performance. Adult Dataset. For the remaining experiments we find that, for properly tuned values of ϵ and c0, both Aopt and ϵ-greedy attain comparable minimum test error. So we omit the ϵ-greedy results for the purpose of clarity, see Appendix C for a detailed comparison of the two algorithms. We now analyze the performance of the remaining algorithms on a dataset from the UCI ML Repository (Dua and Graff, 2017) that is commonly used in the fairness literature: the Adult dataset. It is a low dimensional problem, so we use LR classifiers and the exact version of each sampling algorithm, where the optimal classifier is computed with each new sample. We set Z to be white men, non-white men, white women, and non-white women. The minimum test accuracy over all attributes at each iteration is displayed in Figure 3a. Each sampling scheme approaches 80.5% accuracy as sample size grows, this matches the results achieved under the LR algorithms reported in Table 4a) in Martinez et al. (2020). This is the maximum accuracy an LR classifier can achieve on the white male sub-population in isolation and so additional sampling, or other fairness algorithms, cannot improve on this performance in a minimax sense. We do note, however, that the adaptive algorithms hold a sizable advantage over Uniform at small sample sizes. Image Datasets. We also compare the performance of the different sampling schemes on larger scale problems and more complex hypothesis classes. To this end we use three image datasets: UTKFace (Zhang et al., 2017), Fashion MNIST (Xiao et al., 2017), and Cifar10 (Krizhevsky and Hinton, 2009) with CNNs. We use the heuristic variants of the Aopt, with c1 = 0, and Greedy algorithms with batch sizes of 50. The CNN architecture, data transforms, and further algorithm parameters are detailed in Appendix E. Figure 3: Left: Minimum test accuracy over all attributes for Adult dataset as a function of the sampling budget n, averaged over 10 trials for Adult. Right: Mixture distribution learned by Aopt over 500 training rounds on Fashion MNIST. Figure 4: Minimum test accuracy, over all attributes, for both Fashion MNIST and UTKFace as a function of the time step t, averaged over 10 trials. The UTKFace dataset consists of face images annotated with age, gender, and ethnicity. We choose Y = {Male,Female} and set Z to the five ethnicities. The minimum test accuracy, over all attributes, is shown in Figure 4b and demonstrates a clear separation between the adaptive algorithms and both Uniform and Uncurated sampling. Uncurated does particularly poorly in the low sample regime here, in contrast with the Adult dataset, where Uncurated performed comparably to the adaptive algorithms. This is because, for Adult dataset, the lowest accuracy attribute is over-represented in the dataset with white men at 63% of all samples. Whereas, for UTKFace, the accuracy was lowest for Asian people, who are under-represented in the available data at only 14% of all samples. The other two datasets, Fashion MNIST and Cifar10, were chosen to provide a controlled setting to demonstrate the existence of hard and easy attributes on real-world data. For both, we divide the labels into 5 pairs and assign each an "attribute". The pairs were chosen according to existing confusion matrices to have pairs that are both easy and hard to distinguish from each other, see Appendix E for more details. For Fashion MNIST, each pair in Figure 3b is one attribute and within each pair one item is assigned Y = 0 and the other Y = 1. Then a single binary CNN classifier is learned simultaneously over all 5 pairs. Figure 3b shows the mixture distribution generated by the Aopt sampling scheme, which allocated the vast majority of samples to the (Tshirt, Shirt) and (Pullover, Coat) pairs. This makes intuitive sense, since both pairs of items are qualitatively very similar to each other, and aligns with common confusion matrices which indicate that those items are frequently misclassified as each other by standard classifiers. Figure 4a displays the worst case accuracy for each scheme on Fashion MNIST as a function of time step and shows that both adaptive algorithms outperform Uniform sampling throughout the training process. Finally, Figure 5 shows the test accuracy on each attribute for both Aopt and Uniform sampling schemes. Aopt maintains a much smaller spread between the accuracy over different attributes. This equity is particularly desirable from a fairness perspective as it ensures no attribute has a large advantage over any other in the sense of expected accuracy. It also validates our assumption that the optimal distribution will tend to equalize the losses across attributes in real world data. Final accuracy for all experiments, per-attribute accuracy for UTKFace, all results for CIFAR10, and details of the Adult dataset are included in Appendix E, along with results for the German dataset, another UCI ML Repository dataset. Figure 5: Test accuracy for each attribute in Fashion MNIST as a function of the time step, t, for both Aopt and Uniform sampling schemes, averaged over 10 trials. Figure 6: Test accuracy as a function of sampling budget for both attributes from the dataset shown in Fig. 7, averaged over 100 trials. Empirical results when Assumption 2 is violated. Finally, we consider the case where our assumption of a unique, risk equalizing mixture distribution may not hold. The practical effects of this situation can be seen in the Adult dataset where additional samples of the white male attribute show diminishing returns, while other attributes can attain higher accuracy given more samples. In such a scenario, each adaptive algorithm will continue to select samples mainly from the worst case group despite this granting little improvement in performance. Figure 7: An instance of Synthetic Model2with attribute Z = u shown in red and Z = v shown in blue. To evaluate this scenario we introduce a second synthetic data model: an instance of Synthetic Model2 is illustrated in Figure 7 and specified in detail in Definition 4 in Appendix E. In this model we create one attribute, shown in red, which allows a relatively small maximal accuracy that can be attained with few samples. In contrast, the blue attribute can be classified with high accuracy given few sample, as the bulk of its mass is in the central, separable clusters shown in the figure. But classification of this attribute also benefits from many additional samples as the sparser regions in the top left and bottom right are explored. In this setting, constantly sampling from the attribute with the lowest empirical accuracy is inadvisable as additional samples of the red attribute cannot increase its accuracy, while additional samples of the blue attribute can increase its accuracy. We compared the performance of Aopt+, the heuristic variation of our algorithm with a trend-based statistic included, Aopt, and Greedy on Synthetic Model2. Figure 6 shows the results, each of the three algorithms achieves similar minimax accuracy and performs worst on attribute Z = u, shown in red in Figure 7. But Aopt+ achieves 4 points higher average accuracy than both Greedy and Aopt with the original UCB on the other attribute, Z = v. This demonstrates that the additional term in the Aopt+ UCB is effective at recognizing when the hardest group is not benefiting from additional training samples and redistributing them more effectively. 6 Conclusion We considered the problem of actively constructing a training set in order to learn a classifier that achieves minimax fairness in terms of predictive loss. We proposed a new strategy for this problem (Aopt) and obtained theoretical guarantees on its performance. We then showed that the theoretical performance achieved by Aopt and ϵ-greedy (Abernethy et al., 2020) cannot be improved in general, by obtaining algorithm independent lower-bounds for the problem. Our experiments demonstrated that adaptive sampling schemes can achieve superior minimax risk [Fig. 3a, Fig. 4] and smaller disparity between per-attribute risks [Fig. 5] compared to both uniform and uncurated schemes. The results in Fig. 2 show the necessity of exploration, as purely greedy algorithms can converge to sub-optimal mixture distributions. And finally Fig. 6 shows the versatility of our general UCB-based strategy in its ability to readily incorporate new terms to accommodate the practical challenges posed by real-world problems. Our theoretical results rely on the existence of a unique risk equalizing mixture distribution π , a condition that may not always hold. Thus, an important future work is to relax this assumption, and to design and analyze algorithms that identify pareto optimal mixture distributions that achieve minimax fairness. Another important direction is to derive and analyze active sampling strategies for other fairness measures.