# multigroup_agnostic_pac_learnability__332e4e03.pdf Multi-group Agnostic PAC Learnability Guy N. Rothblum 1 Gal Yona 1 An agnostic PAC learning algorithm finds a predictor that is competitive with the best predictor in a benchmark hypothesis class, where competitiveness is measured with respect to a given loss function. However, its predictions might be quite suboptimal for structured subgroups of individuals, such as protected demographic groups. Motivated by such fairness concerns, we study multi-group agnostic PAC learnability : fixing a measure of loss, a benchmark class H and a (potentially) rich collection of subgroups G, the objective is to learn a single predictor such that the loss experienced by every group g G is not much larger than the best possible loss for this group within H. Under natural conditions, we provide a characterization of the loss functions for which such a predictor is guaranteed to exist. For any such loss function we construct a learning algorithm whose sample complexity is logarithmic in the size of the collection G. Our results unify and extend previous positive and negative results from the multi-group fairness literature, which applied for specific loss functions. 1. Introduction Machine learning tools are used to make and inform increasingly consequential decisions about individuals. Examples range from medical risk prediction to hiring decisions and criminal justice. Automated classification and risk prediction come with benefits, but they also raise substantial societal concerns. One prominent concern is that these algorithms might discriminate against protected and/or disadvantaged groups. In particular, a learned predictor might perform differently on a protected subgroup compared to the general population. The growing literature on algorithmic fairness has studied different concerns. Many works aim *Equal contribution 1Weizmann Institute of Science, Rehovot, Israel. Correspondence to: Guy N. Rothblum , Gal Yona . Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). to ensure parity or balance between demographic groups, e.g. similar rates of positive predictions or similar false positive or false negative rates (Hardt et al., 2016; Kleinberg et al., 2016). Other works consider accuracy guarantees, such as calibration (Dawid, 1982), for protected groups. Protections at the level of a single group might be too weak (Dwork et al., 2012), and recent works have studied extending these notions to the setting of multiple overlapping groups (H ebert-Johnson et al., 2018; Kearns et al., 2018). In this work, we focus on the setting of (supervised) agnostic learning (Kearns et al., 1994). Given an i.i.d. training set of labeled data, the goal is to learn a predictor h that performs well on the underlying distribution. Performance is measured by a loss function, and with respect to a fixed class H: the loss incurred by the predictor h should be competitive with the best predictor in H. To capture a wide variety of settings, we aim to be quite general in our treatment of different loss functions. With fairness in mind, the agnostic learning paradigm raises a fundamental concern: since the predictor s loss is measured over the entire underlying distribution, it might not reflect the predictor s performance on sub-populations such as protected demographic groups. Indeed, it has been demonstrated that standard machine learning tools, when applied to standard data sets, produce predictors whose performance on protected demographic groups is quite poor (Buolamwini & Gebru, 2018). Motivated by these concerns, we study multi-group agnostic learning. For a rich collection G of (potentially) overlapping groups, our goal is to learn a single predictor h, such that the loss experienced by every group g G (when classified by h) is not much larger than the loss of the best predictor for that group in the class H. We emphasize that this should hold for all groups in G simultaneously. To see how this objective is different from the usual agnostic PAC, consider the simple example in which H is the class of hyperplanes and we have two subgroups S, T X . Suppose that the data is generated such that every group g has a hyperplane hg that has very low error on it (but that these are different, so e.g. h T has large loss on S and vice versa). This means that there is no classifier h H that perfectly labels the data. If S is small compared to T, then the agnostic learning objective could be satisfied by h T, the Multi-group Agnostic PAC Learnability optimal classifier for T. For multi-group agnostic PAC, the fact that there is some other classifier in H that perfectly labels S serves to disqualify h T (more generally, it could be the case that no h H will be multi-PAC). This also highlights that the multi-group objective becomes challenging when the groups in question are intersecting (if the groups are disjoint, we can combine the optimal classifiers for each group (Dwork et al., 2017)). A recent work by Blum and Lykouris (Blum & Lykouris, 2019) studied this question in an online setting with sequential predictions. Our focus is on the batch setting. They showed that (for every collection of groups and every benchmark hypothesis class) it is possible to achieve competitive loss for all groups, so long as the loss function is decomposable: the loss experienced by each group is an average of losses experienced by its members (this result also applies to the batch setting). On the other hand, they showed a loss function (the average of false negative and false positive rates), for which the objective is infeasible even in the batch setting. Since this loss is non-decomposable (the false positive rate of a classifier depends on the negative rate, which is a property of the entire distribution in question), one might conjecture that the multi-group objective is only possible for decomposable losses. However, this conjecture is false: For example, calibration (Dawid, 1982) is a non-decomposable loss that is compatible with the multi-group objective. Indeed, (H ebert-Johnson et al., 2018) propose an algorithm that guarantees predictions are calibrated across groups (in other words, the calibration loss for each g G is close to zero). In particular, the output of this algorithm would satisfy the multi-group agnostic PAC requirement (with respect to the calibration loss, and for every hypothesis class. 1.1. Our contributions Motivated by these observations, in this work we formalize two main questions: 1. We define a loss function as (multi-group) compatible if it is appropriate to use with our objective, in the sense that for every hypothesis class H and collection of groups G, there exists a hypothesis h that is competitive with H for every group in G. What makes a loss function compatible? Previous works provide several positive and negative results, but there was no clear characterization of compatibility. 2. Is it always the case that for such compatible losses, a multi-group predictor can also be found using a finite number of samples? In other words, is there a separation between multi-group compatibility and multigroup learnability? Our main technical contributions answer both questions: 1. We prove a partial information-theoretic characterization of the compatible loss functions. 2. For any such loss function that also satisfies a natural uniform convergence property, we show an algorithm that, for any specified finite collection G and finite hypothesis class H, learns a multi-group agnostic predictor from labeled data. The sample complexity is logarithmic in the sizes of G and H. Our algorithm is derived by a reduction to outcome indistinguishability (OI), a learning objective recently introduced by Dwork et al. (Dwork et al., 2020), drawing a new connection between OI and agnostic learning. This shows that (under minimal assumptions on the loss function), multi-group compatibility implies multi-group learnability. In slightly more detail, we characterize the compatible loss functions assuming an additional unambiguity property (we refer to the characterization as partial because of this assumption): we assume that once we fix a single individual and specify the distribution of their label, there is a unique prediction that minimizes the loss for that individual. We view this as a very natural assumption on the loss function, see the discussion following Definition 2.2. We show that if a loss function is compatible and unambiguous, then for each individual, the optimal prediction can be obtained by a fixed local function f, whose output only depends on the features and on the marginal distribution of that individual s label. The point is that the function f doesn t depend on the global distribution, but the predictions that it specifies still minimize the loss for that distribution. We call loss functions that satisfy this property f-proper, and we show that (under the unambiguity assumption) being f-proper is equivalent to multi-group agnostic PAC compatibility. We then construct a universal multi-group agnostic PAC learning algorithm for any f-proper loss function that also satisfies a minimal uniform convergence property (this is necessary for finite sample complexity). The learning algorithm works for any specified finite hypothesis class H and any finite collection of groups G, and its sample complexity is logarithmic in |H| and |G|. The algorithm is obtained via a reduction from multi-group agnostic PAC learning to the recently studied task of outcome indistinguishable learning (Dwork et al., 2020). Beyond unifying previous work in multi-group fairness, this result can be thought of as a multi-group analogue for the known result that every PAC learnable class is learnable via empirical risk minimization. 1.2. Related work The literature on algorithmic fairness is broad and growing. Most pertinent to our work is the literature on fairness notions that aim to stake a middle-ground between Multi-group Agnostic PAC Learnability the strong, individual-level semantics of individual fairness notion (Dwork et al., 2012) and the weaker but useful group fairness notions (Hardt et al., 2016; Kleinberg et al., 2016). We broadly refer to such notions as multi-group notions. Some of these works are geared towards guaranteeing parity: equalizing some statistic across the (possibly exponentially-many) subgroups in G. For example, (Kearns et al., 2018) study multi-group versions of both Equality of Opportunity and Demographic Parity notions (Hardt et al., 2016). We note however, that as the collection G becomes richer and richer, it might be the case multi-group parity can only be obtained via trivial predictions or otherwise undesirable behaviour, such as intentionally making worse predictions on some groups to achieve parity (Chen et al., 2018). A different line of works consider guarantees that are not inherently in conflict with accuracy, e.g. because the Bayes optimal predictor always satisfies multi-group fairness (regardless of G). Notable examples include multicalibration (H ebert-Johnson et al., 2018) and multi-accuracy (Kim et al., 2019), with subsequent works studying extensions in ranking (Dwork et al., 2019), regression (Jung et al., 2020) and online learning (Gupta et al., 2021). As discussed above, Blum and Lykouris (Blum & Lykouris, 2019) study multi-group agnostic PAC learning (which, depending on the loss function, is often similarly aligned with accuracy) in the online setting. In the batch setting, their approach for learning a multi-PAC predictor is via the mixture of experts approach (Jacobs et al., 1991), where the experts in question are a set of |G| predictors, with each hi being the optimal predictor for group gi G in the class H. However, this approach fails to produce a multi-PAC predictor when the loss is non-decomposable, such as calibration (this happens even though, by (H ebert-Johnson et al., 2018), such multi-fair predictors exist and can be found in sample complexity that scales with log |G|). 2. Preliminaries Setup and notation. We consider binary classification problems, where X Rd denotes a feature space and Y = {0, 1} the target labels. As a general convention, we use D to denote distributions over X Y and DX for the marginal distribution of D over X . The support of a distribution (w.r.t X ) is supp(D) = supp(DX). We will sometimes be interested in distributions over X Y for which |supp(D)| = 1, i.e. distributions that are supported on a single element x X . We refer to these as singleton distributions . For a distribution D and an element x X , we use Dx to denote the singleton distribution on X Y obtained by restricting D to X = x. A predictor is a mapping h : X [0, 1], where h(x) is an estimate for the probability that the label of x is 1. We will sometimes consider the special case of classifiers (binary predictors), whose range is {0, 1}. A hypothesis class is a collection of predictors, and is denoted by H. A subgroup is some subset of X . A collection of subgroups is denoted by G. For this work we generally assume H and G are finite (but possibly exponentially large in d). For a distribution D and a group g X , we use Dg to denote the distribution on X Y obtained by restricting D to x g. 2.1. General loss functions A loss function L is some mapping from a distribution D and a predictor h (technically, its restriction to D) to [0, 1]. We are typically interested in how L changes as we keep the first argument (the distribution) fixed, and only change the second argument (the predictor in question). We thus typically use LD(h) to denote the loss of h w.r.t. a distribution D. For a sample S = {(xi, yi)}m i=1 we use LS(h) to denote the empirical loss, calculated as L ˆD(h), where ˆD is the empirical distribution defined by the sample S. Note that this setup is extremely general, and assumes nothing about the loss (except that it is bounded and can t depend on what happens outside D). In machine learning it is common to consider more structured losses, in which LD(h) is the expected loss of h on a random example drawn according to D. We refer to such structured losses as decomposable losses. Definition 2.1 (Decomposable losses). A loss function L is decomposable if there exists a function ℓ: X Y [0, 1] [0, 1] such that for every distribution D and classifier h, LD(h) = E(x,y) D[ℓ(x, y, h(x))]. For example, for binary classifiers a standard decomposable loss is the 0-1 loss, in which ℓ(x, y, h(x)) = 1[h(x) = y]. For predictors, an example of a standard decomposable loss is the squared loss, in which ℓ(x, y, h(x)) = (h(x) y)2. Beyond decomposable losses. While decomposable losses are standard and common, there are many loss functions of interest that don t have this form especially in the literature on algorithmic fairness. For this reason, we focus on a general notion of loss functions (which does not explicitly assume losses are decomposable) in our exploration of multi-group agnostic PAC learning. Notable examples of such losses, as used in the algorithmic fairness literature, include the following notions: Calibration (Chouldechova, 2017; H ebert-Johnson et al., 2018; Kleinberg et al., 2016; Shabat et al., 2020): A predictor is calibrated if for every value v [0, 1], conditioned on p(x) = v, the true expectation of the label is close to v. Intuitively, this means that the outputs of the predictor can reasonably be thought of as probabilities, and hence is a fundamental requirement in the literature on forecasting (Dawid, 1982; Foster & Vohra, 1998). For example, in weather prediction, calibrated forecasts ensure that, out of all the days on which the forecaster predicted say 0.8, it Multi-group Agnostic PAC Learnability really rained on 80% of them. Calibration loss measures the extent to which a predictor is miscalibrated; e.g., the expected calibration error (Kumar et al., 2018) measures the deviation from calibration w.r.t a value v drawn from the distribution induced by the prediction in questions. This loss is not decomposable because it is a global function of the predictions, not a property of the prediction for a single x X . One-sided error rates (Blum & Lykouris, 2019; Blum & Stangl, 2019; Chouldechova, 2017; Hardt et al., 2016; Kearns et al., 2018): The false positive rate (similarly, false negative rate) measures the probability of a random example being labeled as h(x) = 1, conditioned on the true label being y = 0. This isn t a decomposable loss because the exact contribution of a single misclassification depends on the frequency of the negative labels, which is a global property. Individual fairness (Dwork et al., 2012; Rothblum & Yona, 2018): Individual fairness (IF) requires that individuals that are considered similar w.r.t the task at hand are treated similarly by the classifier. Similarity is specified by a metric d : X X [0, 1]. If d(x, x ) 0 (x, x are similar), then it should be the case that h(x) h(x ). An IF loss may quantify the expected violation of this requirement, e.g. over a draw of a pair (x, x ) i.i.d from D. This objective isn t decomposable because the requirement w.r.t a single x X depends on the extent to which there are other similar x in supp(D). We note that both the latter two losses (false positive rate and Individual fairness) are typically not interesting on their own e.g. because the constant classifier h 0 minimizes them. Typically we are interested in these losses as either an additional constraint on an existing objective, or paired with an additional loss (e.g. IF + accuracy based loss, or false positive rate + false negative rate) In the rest of this section we continue to specify two minimal conditions for the losses we consider. We will use the notation L for the class of losses that satisfy both conditions. The first condition is unambiguity. It guarantees that distributions over a single example x have a unique loss minimizer. Definition 2.2 (Unambiguity). A loss L is unambiguous if for every singleton distribution Dx over X Y, there is a unique prediction h(x) that minimizes the loss. That is, arg minh(x) [0,1] LDx(h) = 1. Standard decomposable losses satisfy this condition because the function ℓ(h(x), y) typically has a unique minimum w.r.t its first argument. For example when ℓcorresponds to the squared loss the optimal labeling is h(x) = E [y|x] and when ℓcorresponds to the expected 0-1 loss it is h(x) = 1 [E [y|x] 0.5]. With respect to the fairnessmotivated losses mentioned above, unambiguity fails w.r.t individual fairness and the one-sided error rates (exactly because they can both be minimized by the constant allzeroes classifier, regardless of the true distribution). But as mentioned above, these losses are rarely interesting on their own, and combined losses (individual fairness with an unambiguous loss, or an average of false positive rate and false negative rate) are unambiguous. In other words, all losses that are of practical interest to us are unambiguous. The second condition we will require is that the empirical risk LS( ) can really be used to approximate the true risk LD( ), for a sufficiently large sample S. To build up to this notion we first recall the standard definition of uniform convergence for hypotheses classes. Definition 2.3 (Uniform Convergence for hypotheses classes). We say that a hypothesis class H has the uniform convergence property (w.r.t. a domain X Y and a loss function L) if there exists a function m UC H : (0, 1)2 N such that for every ε, δ (0, 1) and for every probability distribution D over X Y, if S is a sample of m m UC H (ε, δ) examples drawn i.i.d. according to D, then, with probability of at least 1 δ, h H : |LS(h) LD(h)| ε. In our context, we will be interested in uniform convergence as a property of the loss function. We will say that a loss L has uniform convergence (w.r.t finite classes) with sample complexity m UC L : (0, 1)2 N N if every finite class H has the uniform convergence property w.r.t L with sample complexity m UC H (ε, δ) m UC L (ε, δ, |H|). Specifically, we will be interested in losses that have the uniform convergence property with sample complexity that depends polynomially on 1/ε, 1/δ and log |H|. This gives rise to the following definition: Definition 2.4 (Uniform convergence for loss functions). A loss L has the uniform convergence property (w.r.t finite classes) with sample complexity m UC L : (0, 1)2 N if there exists a polynomial f : R3 N such that for every ε, δ (0, 1) and k N, m UC L (ε, δ, k) max H: |H|=k m UC H (ε, δ) f (1/ε, 1/δ, log(k)) The uniform convergence property is satisfied by any decomposable loss function. This follows by a combination of Heoffding s bound (for a single h) and a union bound to get a simultaneous guarantee for every h H. Out of the fairness-motivated losses we discussed above, only the loss LD(h) = a FPRD(h) + b FNRD(h) doesn t have uniform convergence, as we prove in Appendix A. For calibration, uniform convergence follows as a special case of the bounds in (Shabat et al., 2020); for individual fairness, the argument is similar to the standard argument for decomposable losses, only this time the concentration argument is w.r.t pairs of samples from D (Rothblum & Yona, 2018). Multi-group Agnostic PAC Learnability loss notation in L? decomposable Lℓ D(h) calibration LC D(h) IF + decomposable a LIF D (h) + b Lℓ D(h) error rates a LFPR D (h) + b LFNR D (h) Figure 1. Summary: loss functions and the set L (all losses that are unambiguous and have the uniform convergence property.) To summarize, we have defined a collection of reasonable losses L as all loss functions that are both unambiguous and have the uniform convergence property. We have argued that of the loss functions we discussed, this only rules out the one-sided error rate loss; see Figure 1 for an overview. 2.2. Learning paradigms In this work we draw a connection between an extension of the classic notion of agnostic PAC learnability in the presence of multiple groups and the recently introduced complexity-theoretic inspired framework of Outcome Indistinguishability (Dwork et al., 2020), which builds on previous work in algorithmic fairness (H ebert-Johnson et al., 2018). In this section we review both of these learning paradigms, starting with agnostic PAC learnability. For consistency, we give the definition w.r.t finite classes. Definition 2.5 (Agnostic-PAC learnability). A hypothesis class H is agnostic PAC learnable with respect to a set X Y and a loss function L if there exist a function m H : (0, 1)2 N and a learning algorithm with the following property: For every ε, δ (0, 1) and for every distribution D over X Y, when running the learning algorithm on m m H(ε, δ) i.i.d. examples generated by D, the algorithm returns h such that, with probability of at least 1 δ (over the choice of the m training examples), LD(h) LD(H) + ε (1) Additionally, the sample complexity m H should be polynomial in 1/ε, 1/δ and in log(|H|). Outcome indistinguishability. A predictor p : X [0, 1] can be viewed as providing a generative model for outcomes, where for x X a binary outcome is sampled as y Ber( p(x)). Given a distribution D and a predictor p, we define the modeled distribution D = D( p) as the distribution over X Y obtained by first drawing x according to the marginal distribution of D on X and then labeling it as y Ber( p(x)). The framework of Outcome Indistinguishability (OI) aims to guarantee that outcomes produced by p are indistinguishable from the true outcomes under D. This guarantee is formulated with respect to a fixed class A of distinguishers, and the requirement is that every distringuisher A A behaves similarly on samples from the real distribution D and on samples from the modeled distribution D. As discussed and studied in (Dwork et al., 2020), there are several different ways to instantiate this framework based on the power of the distinguishers. In particular, two axes of variation are (i) the input to the distinguisher (single sample vs multi-sample) and (ii) the access level it receives to the predictor p in question. In this work, we focus on multi-sample Sample-Access OI distinguishers, where the inputs to each distinguisher A Ak are k tuples of the form {(xi, yi, pi)}k i=1, where for every i [k], (xi, yi) is sampled from either D or D: Definition 2.6 (Outcome Indistinguishability (Dwork et al., 2020)). Fix a distribution D, a collection of distinguishers Ak and τ > 0. A predictor p : X [0, 1] satisfies (Ak, τ)-OI if for every A Ak, | Pr {(xi,yi)}k i=1 Dk[A({(xi, yi, p(xi)}k i=1) = 1] Pr {(xi,yi)}k i=1 Dk[A({(xi, yi, p(xi)}k i=1) = 1]| τ Much like the definition of (regular) PAC learning, we say that an algorithm learns multi-sample OI if on receiving sufficiently many i.i.d samples from the distribution, it is guaranteed to return a predictor that is probably approximately OI. Definition 2.7 (OI learnability). A family of k-sample distinguishers Ak is multi-sample OI learnable if with respect to a set X Y if there exists a function m A : (0, 1)2 N and a learning algorithm with the following property: For every τ, η (0, 1) and for every distribution D, when running the learning algorithm on m m A(τ, η) i.i.d example generated by D, the algorithm returns h such that with probability at least 1 η (over the choice of the m training examples), h satisfies (τ, Ak)-OI. Additionally, the sample complexity m A should be polynomial in 1/τ, 1/η and in log(Ak). Dwork et al. (Dwork et al., 2020) showed that every distinguisher class Ak is OI-learnable. In our analysis we use the following theorem, which bounds the sample complexity of this algorithm and follows from (Dwork et al., 2020). Theorem 2.8 (from (Dwork et al., 2020)). Fix a class of k sample distinguishers Ak. There exists an algorithm OIAk that satisfies the requirement of Definition (2.7), and whose sample complexity is O k log |Ak|/η Multi-group Agnostic PAC Learnability 3. Agnostic PAC with multiple groups The objective of agnostic PAC learning is to output a predictor h that satisfies LD(h) LD(H) (Equation 1). Putting aside questions of computational complexity and sample complexity, this objective is itself always feasible. In other words, regardless of the loss in question, it can always be obtained using a learner who has full knowledge of the distribution D and is not constrained in runtime. From an information-theoretic perspective, this makes the interesting question the question of what can be done in finite samples hence the focus on agnostic PAC learnability. A multi-group extension of agnostic PAC asks for a predictor that satisfies the above, but simultaneously for every group g in a collection G: LDg(h) LDg(H), where Dg denotes the restriction of D to samples from g. When G consists of intersecting groups, however, it is not immediately clear that this objective is always feasible: it might not be satisfied by any predictor h : X [0, 1]. For a simple (but contrived) example, let h0, h1 denote the all-zeroes and all-ones predictors, and consider a loss L that specifies that LDS(h0) = 0 and LDT(h1) = 0 (and for any other classifier h, the loss of every distribution is always 1). Then the multi-group objective w.r.t G = {S, T} requires that we label the intersection S T as both 1 and 0, which is impossible. The following lemma proves that this can be the case even for seemingly natural losses, that are in L. Proposition 3.1. There exists L L for which some problem instances (corresponding to a distribution D, class H, subgroups G and ε > 0) don t admit a multi-group agnostic PAC solution. In other words, there is no classifier h for which LDg(h) LDg(H) for every group g G. The loss in question is a weighted combination of an individual fairness loss with an accuracy loss. We construct two intersecting groups S and T and a similarity metric d that specifies d(x, x ) = 0 if and only if x S T and x S T. However, the true labels of these individuals are different: for x S T the label is y = 0 but for x S T the label is y = 1. This creates a situation in which group T is optimizing strictly for accuracy (and so it wants the intersection to be labeled as ˆy = 1) whereas for the group S the dominating term is IF (and so it wants everyone, and in particular the intersection, to be labeled as ˆy = 0). Thus, any classifier that is simultaneously multi PAC w.r.t both T and S must label the intersection as both 0 and 1, which is impossible. See Appendix B for the full proof. We note that (Blum & Lykouris, 2019) already demonstrated that this impossibility occurs in the batch setting. This motivated them to focus on the 0-1 loss. However, their counter example is for the error-rate type loss we discussed earlier, which is a fixed combination of the false positive rate and false negative rate of a classifier. As noted in Section 2.1, this loss doesn t have the uniform convergence property and is therefore not in L. Proposition 3.1 clarifies that uniform convergence is not the issue, and there are natural (and otherwise reasonable) losses that are not appropriate to use with the multi-group objective (in the sense that even an approximate solution is not guaranteed to exist). In light of this discussion, we proceed to explicitly separate the question of feasibility (whether it s always the case that some multi-group solution is guaranteed to exist) from learnability (whether we can find it with a reasonable number of samples). Formally, we define two notions compatibility and learnability that formalize these concepts. Importantly, both are properties of the loss function in question, taking a universal quantifier over the class H and the groups G (and the other aspects of the problem instance). Definition 3.2 (Multi-PAC compatibility). We say that a loss L is multi-PAC compatible if for every distribution D, class H, subgroups G and ε > 0, there exists h : X [0, 1] such that for every group g G, LDg(h) LDg(H) + ε. Multi-PAC learnability strengthens the above requirement by asking that such a solution can also be found by a learning algorithm whose sample complexity is constrained to depend inverse-polynomially on the parameters in question and logarithmically on the sizes of H and G. Definition 3.3 (Multi-PAC learnability). We say that a loss L is multi-PAC learnable with sample complexity mg PAC L : (0, 1)3 N2 N if there exists a learning algorithm with the following property: For every ε, δ, γ (0, 1), for every finite hypothesis class H, for every finite collection of subgroups G 2X and for every distribution D over X Y, when running the learning algorithm on m mg PAC L (ε, δ, γ, |H| , |G|) i.i.d. examples generated by D, the algorithm returns h such that, with probability at least 1 δ (over the choice of the m training examples and the coins of the learning algorithm) g Gγ, LDg(h) LDg(H) + ε, where Gγ G is the subset of groups whose mass under D is at least γ: Gγ = {g G : Pr D[x g] γ}. Additionally, the sample complexity mg PAC L should be polynomial in 1/ε, 1/δ, 1/γ and log(|H|), log(|G|). 4. Compatibility Learnability via OI In this section we prove our main result: that for loss functions in L, multi-PAC compatibility implies also multi-PAC learnability. Theorem 4.1. If a loss L L is multi-group compatible (Definition 3.2), then it is also multi-group learnable (Definition 3.3). Multi-group Agnostic PAC Learnability Towards proving Theorem 4.1, we introduce an additional property of loss functions, which we define below. Definition 4.2 (f-proper). For a function f : X [0, 1] [0, 1], we say that a loss L is f-proper if for every distribution D on X Y, the classifier h D given by h D(x) = f (x, ED[y|x]) minimizes the loss w.r.t D: h D arg minh LD(h). Recall that proper losses (or proper scoring functions) are losses that are minimized by the conditional expectation predictor x 7 ED[y|x] (Buja et al., 2005). Definition 4.2 is a weaker requirement that says that for every distribution a minimizer can be obtained as some local transformation of this predictor (i.e. that does not depend on the rest of the distribution). Recalling that we defined L as precisely all loss functions that satisfy both uniform convergence and unambiguity, we see that Theorem 4.1 follows as a direct corollary of the following two lemmas. Lemma 4.3. If L is unambiguous (Definition 2.2) and multigroup compatible (Definition 3.2), then L is f-proper (Definition 4.2). Lemma 4.4. If L is f-proper (Definition 4.2) and has the uniform convergence property (Definition 2.4), then L is multi-group learnable (Definition 3.3). We note that the other direction of Lemma 4.3 is also true (if a loss is f-proper, it is also multi-group compatible): this follows immediately, since labeling everyone in supp(D) according to f is optimal and in particular satisfies the multi-group requirement for every H. This means that the notion of f proper provides a partial characterization of compatibility (up to unambiguity). The full proofs of Lemmas 4.3 and 4.4 can be found in Appendices C and D, respectively. In the rest of this section we give an overview of both proofs, and end with a discussion of the implications of the equivalence. 4.1. Overview of Lemma 4.3 Let L L be a loss function that is multi-group compatible and unambiguous (we do not use the uniform convergence property here). We show that L is also f-proper, where the function f maps a singleton distribution (specified by an input x and the conditional probability that the label is 1) to whatever prediction minimizes the loss on that distribution. In more detail: for an input x X and a value z [0, 1], let Dx,z be the singleton distribution over {x} {0, 1}, where the label is drawn by the Bernoulli distribution Ber(z). Let hx,z be the predictor that minimizes the loss on this distribution, i.e. hx,z = arg minh LDx,z(h), and recall that by unambiguity, the prediction that minimizes the loss is unique, and so the value hx,z(x) is well defined. We take f (x, z) = hx,z(x). It remains to prove that L is an f-proper loss function. Suppose for contradiction that it is not: i.e., there exists a distribution D for which h D(x) = f (x, ED[y|x]) does not minimize the loss, i.e. there exists some predictor h s.t. LD(h ) < LD(h D). We show this contradicts the multigroup compatibility of L. To see this, define a collection of sets that includes all the singletons in the support of D, as well as the global set comprised of the entire support of D. By unambiguity, the singletons all want to be labeled by h D. On the other hand, the global set wants to be labeled by h . Whatever predictor we pick, the loss on either the global set or of one of the singletons will not be optimal. Thus, for the above collection of groups G (comprised of all singletons plus the global set), and for the hypothesis class H that includes h D and h , it is impossible to obtain predictions that are competitive with H for all groups in G simultaneously. 4.2. Overview of Lemma 4.4 Given a loss function L that is f-proper and has the uniform convergence property, we want to construct a multi-group agnostic PAC learning algorithm that works for any (finite) hypothesis class H and (finite) collection of groups G. The algorithm will work by a reduction to the task of OI learning (see above): namely, we construct a collection A of distinguishers, and show that any predictor p that is OI w.r.t this collection can be used to derive a multi-group agnostic predictor h. In particular, we show that if p is OI (w.r.t A), then h(x) = f (x, p(x)) is a multi-group agnostic predictor (recall that f is the local transformation for the f-proper loss function L). The collection of distinguishers depends on the loss function L, on the hypothesis class H and on the collection of groups G. This reduction, together with the OI learning algorithm of Theorem 2.8 (from (Dwork et al., 2020)), gives a universal multi-group agnostic learning algorithm for any f-proper loss function. The algorithm is described in Figure 1. It remains to construct the family of distinguishers A, and to prove the reduction. Towards this, fix a group g G and fix a hypothesis h H. We want to guarantee that the loss of the hypothesis h(x) = f (x, p(x)) is competitive with the loss of h, where both losses are measured on the distribution Dg over members of the group g. We begin by observing that this is true when the labels are drawn by p(x) (as in the distribution D). We will use OI (with an appropriately constructed distinguisher) to ensure that it is also true for the real distribution Dg. In more detail, since L is an f-proper loss function, we have: L Dg( h) L Dg(h), because in D the labels are indeed generated by p, i.e. Multi-group Agnostic PAC Learnability p(x) = E D[y|x]. By uniform convergence, this will remain true even if we consider the empirical loss over a (sufficiently large) i.i.d. sample from Dg. With this in mind, we define a distinguisher Ak g,h,α, which takes as input k samples {(xi, yi, pi)} and checks whether, for the samples where xi g, it is true that the loss obtained by predicting f (xi, pi) for each xi is competitive with the loss obtained by h on those samples (up to an additive factor of α). By the above discussion, when the labels yi values are drawn by Ber( p(xi)), and assuming that there are sufficiently many samples in g to guarantee uniform convergence for the loss L, the distinguisher will accept with high probability. See Figure 2 for a full description of the distinguisher. Now, if p is OI w.r.t. a class that includes the distinguisher Ak g,h,α, then the distinguisher should accept with similar probabilities when the labeled examples are drawn by D or by D (where in both cases pi = p(xi)). I.e., Ak g,h,α should also accept w.h.p. when the labeled examples are drawn by D. This can only happen if the predictor h is competitive with the hypothesis h w.r.t. the distribution Dg, which is exactly the guarantee we wanted from h! The class A of distinguishers includes such a distinguisher for each g G and h H, and thus if p is OI w.r.t. A, we conclude that the loss of h is competitive with every h H for every group in G simultaneously. Note that the distinguishers in A use multiple samples, and the number of samples must be sufficiently large so that (for any sufficiently-large group g) w.h.p. enough of them fall in g to guarantee uniform convergence. The sample complexity of the learning algorithm is governed by the sample complexity of OI learning, which is logarithmic in the number of distinguishers. Since the class A includes |G| |H| distinguishers, the resulting learning algorithm has sample complexity that is logarithmic in |G| and in |H|. We note that we need G and H to be finite because the known OI learning algorithm works for finite collections of distinguishers. Acknowledgements. We thank Cynthia Dwork, Michael P. Kim and Omer Reingold for fruitful discussions throughout this work. Blum, A. and Lykouris, T. Advancing subgroup fairness via sleeping experts. ar Xiv preprint ar Xiv:1909.08375, 2019. Blum, A. and Stangl, K. Recovering from biased data: Can fairness constraints improve accuracy? ar Xiv preprint ar Xiv:1912.01094, 2019. Buja, A., Stuetzle, W., and Shen, Y. Loss functions for bi- Algorithm 1 Multi Group L,f(ϵ, δ, γ, H, G) 1: Parameters: loss function L, function f : X [0, 1] [0, 1] 2: Input: accuracy parameter ϵ (0, 1), failure probability δ (0, 1), minimal subgroups size parameter γ (0, 1), hypothesis class H, collection of subgroups G. 3: Output: A classifier h satisfying the (ε, δ)-multi-group guarantee w.r.t H and G 4: Set ε = α = ε/4 and δ = η = τ = δ/4. 5: Set k G = m UC L (ε , δ , |H| + 1). 6: Set k = 10 1 7: Let A = n AL,f,k g,h,α | g G, h H o be a collection of distinguishers, as defined in Algorithm 2. 8: Invoke OI as a sub-routine to learn p OI(τ, η, A). 9: return f ( p) Algorithm 2 AL,f,k g,h,α (multi-sample Sample-Access OI distinguisher) 1: Parameters: number of samples k N, group g X , classifier h : X [0, 1], loss function L, function f : X [0, 1] [0, 1] 2: Input: {(xi, yi, pi)}k i=1, where xi X , yi {0, 1} and pi [0, 1] 3: Output: A binary output denoting Accept/Reject 4: Ig = {i : xi g} 5: Sg = {(xi, yi)}i Ig 6: Define a predictor fg as ( f (xi, pi) i [k] such that x = xi 0 otherwise (2) 7: if LSg( fg) < LSg(h) + 2α then 8: return 1 9: end if 10: return 0 nary class probability estimation and classification: Structure and applications. Working draft, November, 3, 2005. Buolamwini, J. and Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Friedler, S. A. and Wilson, C. (eds.), Conference on Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New York, NY, USA, volume 81 of Proceedings of Machine Learning Research, pp. 77 91. PMLR, 2018. URL http://proceedings.mlr. press/v81/buolamwini18a.html. Multi-group Agnostic PAC Learnability Chen, I., Johansson, F. D., and Sontag, D. Why is my classifier discriminatory? In Advances in Neural Information Processing Systems, pp. 3539 3550, 2018. Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153 163, 2017. Dawid, A. P. The well-calibrated bayesian. Journal of the American Statistical Association, 77(379):605 610, 1982. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pp. 214 226, 2012. Dwork, C., Immorlica, N., Kalai, A. T., and Leiserson, M. Decoupled classifiers for fair and efficient machine learning. ar Xiv preprint ar Xiv:1707.06613, 2017. Dwork, C., Kim, M. P., Reingold, O., Rothblum, G. N., and Yona, G. Learning from outcomes: Evidence-based rankings. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pp. 106 125. IEEE, 2019. Dwork, C., Kim, M. P., Reingold, O., Rothblum, G. N., and Yona, G. Outcome indistinguishability. ar Xiv preprint ar Xiv:2011.13426, 2020. Foster, D. P. and Vohra, R. V. Asymptotic calibration. Biometrika, 85(2):379 390, 1998. Gupta, V., Jung, C., Noarov, G., Pai, M. M., and Roth, A. Online multivalid learning: Means, moments, and prediction intervals. ar Xiv preprint ar Xiv:2101.01739, 2021. Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315 3323, 2016. H ebert-Johnson, U., Kim, M. P., Reingold, O., and Rothblum, G. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, pp. 1939 1948, 2018. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive mixtures of local experts. Neural computation, 3(1):79 87, 1991. Jung, C., Lee, C., Pai, M. M., Roth, A., and Vohra, R. Moment multicalibration for uncertainty estimation. ar Xiv preprint ar Xiv:2008.08037, 2020. Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning, pp. 2564 2572, 2018. Kearns, M. J., Schapire, R. E., and Sellie, L. M. Toward efficient agnostic learning. Machine Learning, 17(2-3): 115 141, 1994. Kim, M. P., Ghorbani, A., and Zou, J. Multiaccuracy: Blackbox post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 247 254, 2019. Kleinberg, J., Mullainathan, S., and Raghavan, M. Inherent trade-offs in the fair determination of risk scores. ar Xiv preprint ar Xiv:1609.05807, 2016. Kumar, A., Sarawagi, S., and Jain, U. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning, pp. 2805 2814. PMLR, 2018. Rothblum, G. and Yona, G. Probably approximately metricfair learning. In International Conference on Machine Learning, pp. 5680 5688. PMLR, 2018. Shabat, E., Cohen, L., and Mansour, Y. Sample complexity of uniform convergence for multicalibration. ar Xiv preprint ar Xiv:2005.01757, 2020.