# fair_and_optimal_classification_via_postprocessing__49a5e6e9.pdf Fair and Optimal Classification via Post-Processing Ruicheng Xian 1 Lang Yin 1 Han Zhao 1 To mitigate the bias exhibited by machine learning models, fairness criteria can be integrated into the training process to ensure fair treatment across all demographics, but it often comes at the expense of model performance. Understanding such tradeoffs, therefore, underlies the design of fair algorithms. To this end, this paper provides a complete characterization of the inherent tradeoff of demographic parity on classification problems, under the most general multi-group, multi-class, and noisy setting. Specifically, we show that the minimum error rate achievable by randomized and attribute-aware fair classifiers is given by the optimal value of a Wasserstein-barycenter problem. On the practical side, our findings lead to a simple post-processing algorithm that derives fair classifiers from score functions, which yields the optimal fair classifier when the score is Bayes optimal. We provide suboptimality analysis and sample complexity for our algorithm, and demonstrate its effectiveness on benchmark datasets. 1. Introduction Machine learning models trained on biased data have been found to perpetuate and even amplify the bias against historically underrepresented and disadvantaged demographic groups at inference time (Barocas & Selbst, 2016; Bolukbasi et al., 2016). As a result, concerns of fairness have gained significant attention, especially as applications of these models expand to high-stakes domains such as criminal justice, healthcare, and finance (Berk et al., 2021). To mitigate the bias, a variety of fairness criteria and algorithms have been proposed (Barocas et al., 2023; Caton & Haas, 2020), which impose mathematical or statistical constraints on the model to ensure equitable treatment under the respective fairness notions. But these algorithms typically incur a cost to model 1University of Illinois Urbana-Champaign, Urbana, Illinois, USA. Correspondence to: Ruicheng Xian , Han Zhao . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). performance as they improve model fairness (Calders et al., 2009; Corbett-Davies et al., 2017). It is not immediately clear whether the degradation in performance is attributed to artifacts of the algorithm, or possibly to the inherent tradeoff predictive power that must be given up for satisfying the criteria (Hardt et al., 2016; Zhao & Gordon, 2022). Hence the design of fair algorithms necessitates the understanding of this tradeoff, which would also provide insight to the implications of fairness in machine learning; yet, it remains an open problem for most fairness criteria and learning settings. For the group fairness criterion of demographic parity (DP; Definition 2.1), a.k.a. statistical parity, which requires statistical independence between model output and demographic group membership (Calders et al., 2009), Le Gouic et al. (2020) and Chzhen et al. (2020) concurrently characterized the tradeoff between mean squared error (MSE) and fairness on regression problems. On classification problems, the inherent tradeoff in terms of error rate has only been studied under special cases: Denis et al. (2023) assumed binary groups, Zeng et al. (2022a) and Gaucher et al. (2023) assumed binary class labels, and Zhao & Gordon (2022) assumed that the data distribution is noiseless, i.e., the Bayes error rate is zero. We will close this gap and complete the characterization of the tradeoff of DP fairness in the most general classification setting. Contributions. This paper considers learning randomized and attribute-aware classifiers under (approximate) DP fairness in the general setting of multi-group, multi-class, and potentially noisy data distributions. We show that: 1. The minimum classification error rate under DP is given by the optimal value of a (relaxed) Wassersteinbarycenter problem (Section 3.1). 2. This characterization reveals that the optimal fair classifier one that satisfies DP while achieving the minimum error is given by the composition of the Bayes optimal score function (minimum MSE regressor of the one-hot labels) and the optimal transports from the Wasserstein-barycenter problem (Section 3.2). 3. Based on the findings, we propose a post-processing Fair and Optimal Classification via Post-Processing method that derives fair classifiers from (pre-trained) score functions (Section 3.3). Our method is instantiated for finite sample estimation in Section 4 with sample complexity analysis.2 4. Experiments on benchmark datasets demonstrate the effectiveness of our algorithm (Section 5), which achieves precise control of the tradeoff provided sufficient training data. 1.1. Related Work Inherent Tradeoff. The concept of barycenter appears in many analyses of the tradeoff of DP fairness. Intuitively, by treating the barycenter computed over the distributions of optimal model outputs (without constraints) on each group as the output distribution that is required to be identical across groups under DP, the sum of distances to the barycenter is naturally related to the minimum fair error. We review existing characterizations of the tradeoff of DP fairness below and draw connections to our result. Denote the input by X, group membership by A, and target variable by Y (for classification, the one-hot label). Let r a be the distribution of the conditional mean on group a, E[Y | X, A = a], i.e., the minimum MSE estimates of Y given (X, A = a) (for classification, these are distributions of class probabilities). Lastly, let wa := P(A = a) denote the proportion of each group. Then, under DP, on regression problems (Le Gouic et al., 2020; Chzhen et al., 2020; Chzhen & Schreuder, 2022), the minimum excess risk in terms of MSE is given by the Wasserstein2-barycenter (under the ℓ2 metric) over the r a s, min q:supp(q) R a A wa W 2 2 (r a, q); (1) noiseless classification problems (Zhao & Gordon, 2022), the minimum/excess error rate is given by the TV-barycenter over the class priors, pa(ei) := P(Y = ei | A = a), min q:supp(q) {e1, ,ek} 2 pa q 1, (2) 2 1 computes the total variation (TV); classification problems in the general setting (Theorem 3.2), the minimum error rate is given by the Wasserstein-1-barycenter (under the ℓ1 metric), min q:supp(q) {e1, ,ek} 2 W1(r a, q). (3) 2Our code is available at https://github.com/rxian/ fair-classification. First, unlike regression, the support of the barycenters in Equations (2) and (3) is restricted to {e1, , ek}, which represents the one-hot labels. Combined with the fact that the error rate is the expected 1 2ℓ1 distance between the true class probabilities and the output class assignments, the minimum error rate equals the sum of 1 2W1 distances to the barycenter under the ℓ1 metric. Similarly, the use of the W 2 2 distance under ℓ2 in Equation (1) reflects the MSE loss. Second, our Equation (3) recovers Equation (2) in the noiseless setting, because, under which, r a = pa and 1 2W1 = 1 2 1. Denis et al. (2023) and Gaucher et al. (2023) also derived similar expressions for the tradeoff to ours, but only under binary group or class labels. Post-Processing. Given a (biased) model, this family of mitigation algorithms post-process the model to satisfy fairness, e.g., via remapping the outputs (Hardt et al., 2016; Pleiss et al., 2017). Existing algorithms for DP fairness include (Fish et al., 2016; Menon & Williamson, 2018; Chzhen et al., 2019; Jiang et al., 2020; Zeng et al., 2022a; Denis et al., 2023), but they are limited to binary group and/or binary classification. For multi-group and multi-class DP, the only applicable postprocessing algorithm, to our knowledge, is due to Alghamdi et al. (2022), which is based on model projection. But the tradeoff of their algorithm is unclear as they did not directly relate error rate to the difference between the projected model and the original, and experiments show that their algorithm underperforms compared to ours, especially on tasks involving a large number of groups and classes. 2. Preliminaries Notation. Denote the (k 1)-dimensional probability simplex by k := {z Rk 0 : z 1 = 1}, whose k vertices are {e1, , ek}, where ei Rk is the vector of all zeros except for a single 1 on the i-th coordinate. Let Qk denote the collection of distributions supported on the vertices of k. We will work with randomized functions (Definition B.2), which have probabilistic outputs according to some distributions conditioned on the input. Given a (randomized) function f : X Y and a distribution p over X, we denote the push-forward of p by f p (Definition B.3). Problem Setup. A k-class classification problem is defined by a joint distribution µ of input X X, demographic group membership (a.k.a. the sensitive attribute) A A = [m] := {1, , m}, and class label in one-hot representation, Y Y = {e1, , ek}; the class labels may be subject to noise originating from, e.g., the data collection process. Denote the marginal distribution of input X by µX, the conditional distribution of µ on group A = a by µa, and the group weight by wa := Pµ(A = a). Fair and Optimal Classification via Post-Processing The goal of fair classification is to find a randomized and attribute-aware classifier, h : X A Y, that achieves the minimum classification error rate on µ subject to the constraints set by the designated fairness criteria. Denote the component of h associated with group a by ha : X Y, i.e., ha(x) h(x, a). The error rate is defined as err(h) := P(h A(X) = Y ) a [m] wa P(h A(X) = Y | A = a) X Y P(ha(x) = y) dµa(x, y), where the decomposition on the last line highlights the randomness of h. For fairness, we consider the group criterion of demographic parity: Definition 2.1 (Approximate Demographic Parity). For α [0, 1], a classifier h : X A Y is said to satisfy α-DP if DP(h) α, which is defined as DP(h) := max a,a [m] y Y P(h A(X) = y | A = a) P(h A(X) = y | A = a ) = max a,a [m] ha µX a ha µX a , P(h A(X) = y | A = a) = Z X P(ha(x) = y) dµX a (x) is the proportion of outputs with class assignment y on group a, and p q := maxz Z |p(z) q(z)| between two distributions p, q. We call a classifier α-fair if it satisfies α-DP. The parameter α controls the tradeoff between fairness and (the maximum attainable) accuracy (due to the inherent tradeoff); setting α = 0 recovers the standard strict definition of DP. Lastly, a (attribute-aware) score function f : X A k is a model that outputs probability vectors as estimates of the class probabilities, as in f(x, a)y Pµa(Y = y | X = x). A score function is said to be Bayes optimal, denoted by f , if it computes the true class probabilities exactly, f a(x)i := Pµa(Y = ei | X = x) = Eµa[Y | X = x]i; it coincides with the minimum MSE estimator of the onehot labels Y given (X, A = a). We will often work with the quantity r a := f a µX a , the distribution of true class probabilities conditioned on group a. Given a (pre-trained) score function f, our post-processing method finds a (probabilistic) fair classifier by deriving from f. I.e., it returns classifiers of the form (x, a) 7 ga fa(x) for some post-processing maps g1, , gm : k Y. Optimal Transport and Wasserstein Distance. Our analysis involves the concept of optimal transports and Wasserstein distance (Villani, 2003); the latter is a metric on the space of probability distributions. Definition 2.2 (Coupling). Let p, q be probability distributions over X and Y, respectively. A coupling γ of p, q is a joint distribution over X Y satisfying p(x) = R y Y dγ(x, y), x X, and q(y) = R x X dγ(x, y), y Y. We denote the collection of couplings of p, q by Γ(p, q). Definition 2.3 (Optimal Transport). Let p, q be probability distributions over X and Y, respectively, and c : X Y [0, ) a cost function. The optimal transportation cost between p and q is given by infγ Γ(p,q) R X Y c(x, y) dγ(x, y). Let γ be a minimizer, then the optimal transport from p to q, denoted by T p q,c : X Y, is a (randomized) function satisfying γ = (Id T p q,c) p, where Id is the identity map (that in the other direction is defined symmetrically). Intuitively, T p q,c specifies a plan for moving masses distributed according to p to q with the minimum total cost. In this plan, the mass located at each x X is moved (probabilistically) to T p q,c(x) Y. The optimal transport can also be represented by the optimal coupling γ Γ(p, q), as we can derive an optimal transport T from γ by setting P(T (X) = y | X = x) = γ (x, y)/γ (x, Y), x, y.3 Lastly, when X = Y is a metric space equipped with distance d, the optimal transportation cost between p and q under c = d is equivalent to their Wasserstein-1 distance: Definition 2.4 (Wasserstein Distance). Let p, q be probability distributions over a metric space (X, d), and r [1, ]. The Wasserstein-r distance between p and q is Wr(p, q) = (infγ Γ(p,q) R X X d(x, x )r dγ(x, x ))1/r. 3. Fair and Optimal Classification In this section, we provide a characterization of the inherent tradeoff of DP fairness, then, based on the findings, propose and analyze a post-processing method for DP. 3.1. Characterization of the Inherent Tradeoff Our characterization comes from a reformulation of the classification problem assuming access to the Bayes optimal score. On any generic (group-less) classification problem, Lemma 3.1. Let f : X k be the Bayes optimal score function, define r := f µX, and fix q Qk. For any (randomized) classifier h : X Y satisfying h µX = q, there exists a coupling γ Γ(r , q) s.t. err(h) = 1 k Y s y 1 dγ(s, y). Conversely, for any γ Γ(r , q), there exists a randomized classifier h satisfying h µX = q s.t. the above equality holds. 3We will only consider transportation under the of ℓ1 cost of (x, y) 7 x y 1, hence omit the dependency of T p q on c. Fair and Optimal Classification via Post-Processing Figure 1. A distribution r over the 2d simplex (density; grey surface), and a finite distribution q Q3 over its vertices {e1, e2, e3} (blue spikes). It follows that minimizing classification error subject to having an output distribution of q is equivalent to solving an optimal transport problem from r to q under the ℓ1 cost, because by Definition 2.4, minh:h µX=q err(h) = 1 2 minγ Γ(r ,q) R s y 1 dγ = 1 2 W1(r , q). Note that this reformulation allows for explicit control of the output distribution, which is well-suited for analyzing DP fairness since it only constrains the output distributions qa := ha µX a ; namely, that they need to be equal (α = 0) or close: DP(h) α maxa,a qa qa α by Definition 2.1. Therefore, for attribute-aware classifiers, whose component ha s can be optimized independently, the discussions above immediately give the following characterization of the minimum error rate under DP: Theorem 3.2 (Minimum Fair Error Rate). Let α [0, 1], f : X A k be the Bayes optimal score function, and define r a := f a µX a , a [m]. With W1 under the ℓ1 metric, err α := min h: DP(h) α err(h) = min q1, ,qm Qk maxa,a qa qa α 2 W1(r a, qa). (4) It could be viewed as a (relaxed) Wasserstein-barycenter problem on the r a s under the special case where the support of the barycenter(s) qa is restricted to the vertices {e1, , ek}. It is a convex problem (in the primal form presented above), and can be simplified under certain assumptions: if the problem is noiseless (and α = 0), it reduces to the TV-barycenter problem in Equation (2) (Theorem A.1); this result is first established in (Zhao & Gordon, 2022), but only under m = k = 2. Under strict DP fairness (α = 0), the inherent tradeoff, namely the excess risk incurred by the DP constraint, is 1 2(minq P a wa W1(r a, q) P a minqawa W1(r a, qa)) 0; the second term is the Bayes error rate, achieved by the classifier (x, a) 7 earg maxi f a (x)i. The tradeoff is expected to be large on problems with very different r a s, and equals to zero when they are identical (i.e., Eµ[Y | X, A] A; since all groups would have the same optimal decision rule), meaning that enforcing DP would not degrade model performance. But we point out that the tradeoff could be zero even if Eµ[Y | X, A] A, partly due to the nonuniqueness of the optimal classifier (Example A.4). Lastly, Zhao & Gordon (2022) concluded that in the noiseless setting, the tradeoff is zero if and only if the class priors are identical, i.e., Eµ[Y | A] A. But this condition is no longer sufficient in the general setting (Example A.5). 3.2. Optimal Fair Classifier via Post-Processing In addition to characterizing the minimum error rate under DP, we also show that the optimal fair classifier can be obtained by deriving from the Bayes optimal score f : Theorem 3.3 (Optimal Fair Classifier). Let α [0, 1], f : X A k be the Bayes optimal score function, (q 1, , q m) a minimizer of Equation (4), and T r a q a the optimal transport from r a to q a under the ℓ1 cost, a [m]. We have (x, a) 7 T r a q a f a(x) arg min h: DP(h) α err(h). This result is a consequence of the construction used in the proof of Lemma 3.1 (deferred to Appendix B), and reveals the form of the optimal fair classifier as a composition of f and optimal transports from r a s to the minimizing q a s of Equation (4) (see Figure 1 for a picture of these distributions). It immediately suggests a three-step method for learning optimal fair classifiers: (i) learn the Bayes optimal score function f , e.g., via minimizing MSE w.r.t. the one-hot label Y , f = arg min f:X A k Eµ f(X, A) Y 2 2 , (ii) find a minimizer (q 1, , q m) of the barycenter problem in Equation (4), (iii) compute the optimal transports T r a q a, and finally return (x, a) 7 T r a q a f a(x). The last two steps (reproduced in Algorithm 1) post-process f . 3.3. Post-Processing Any Score Function In practice, however, the Bayes optimal f may not be exactly learned due to computational cost, or difficulties in representation, optimization, and generalization (Woodworth et al., 2017). Instead, we will often work with suboptimal scores f f (e.g., pre-trained by a vendor). This section analyzes the applicability and suboptimality of Algorithm 1 for post-processing non-Bayes optimal score functions. Given an arbitrary score function f : X A k, we want to find post-processing maps ga : k Y such that the derived classifier (x, a) 7 ga fa(x) satisfies DP fairness, Fair and Optimal Classification via Post-Processing Algorithm 1 Post-Process for α-DP 1: Input: α [0, 1], score function f : X A k, marginal distribution µX,A of (X, A) 2: Define wa := Pµ(A = a) and ra := fa µX a , a [m] 3: (q1, , qm) minimizer of Eq. (4) 4: for a = 1 to m do 5: T ra qa ra-to-qa optimal transport under ℓ1 cost 6: end for 7: Return: (x, a) 7 T ra qa fa(x) and ideally, achieves the minimum error rate among all fair classifiers derived from f. Let h(x, a) := T ra qa fa(x) denote the classifier obtained from applying Algorithm 1 to f. First, h is always α-fair regardless of f, because ha µX a = qa, a [m] by construction, and DP( h) = maxa,a qa qa α from the constraints in Equation (4). Second, the suboptimality of h can be upper bounded by the L1 difference between f and the Bayes optimal score f : Theorem 3.4 (Error Propagation). Let α [0, 1], f : X A k be a score function, and f the Bayes optimal score function. For the α-fair classifier h obtained from applying Algorithm 1 to f, 0 err( h) err α E[ f(X, A) f (X, A) 1], where err α is defined in Equation (4). Hence, whereas the degradation in performance of the postprocessed h from Algorithm 1 if f = f is attributed entirely to the inherent tradeoff (Theorem 3.3), it is not the case when f = f due to the loss of information about Y . We may, however, guarantee that h is optimal among all fair classifiers derived from f if it satisfies group-wise distribution calibration (Kull & Flach, 2015), i.e., the output predictions correctly convey the class probabilities: Definition 3.5. A score function f : X A k is said to be group-wise distribution calibrated if Pµ(Y = ei | f(X, a) = s, A = a) = si, s k, i [k], a [m]. If f is not calibrated, but labeled data is available, one could learn mappings ua : k k and compose it with f to recalibrate it. The optimal calibration maps, u a (in the sense that they achieve calibration without incurring further information loss), are by definition the minimum MSE estimators of Y given (fa(X), A = a). To see why the optimality of h among all fair classifiers derived from f is guaranteed provided calibration, note that finding the optimal fair post-processing map for f is equivalent to finding the optimal fair classifier on a new problem µ derived from the original µ under an input transformation, given by the joint distribution of (X := f A(X), A, Y ). Also, the Bayes optimal score on µ coincide with the optimal calibration map, as Eµ [Y | X = s, A = a] = u a(s). So by Theorem 3.3, Algorithm 1 finds post-processing map ga s s.t. (x , a) 7 ga u a(x ) is the optimal fair classifier on µ , whereby (x, a) 7 ga (u a fa)(x) is optimal among all derived fair classifiers (the term in the parentheses is the recalibrated score). Finally, we remark that the suboptimality of h due to miscalibration can be bounded using Theorem 3.4 by simply replacing the reference f with the calibrated score (x, a) 7 u a fa. 4. Finite Sample Estimation We have discussed post-processing for DP assuming access to the distribution µX,A. In this section, we instantiate our Algorithm 1 for post-processing using finite samples: Assumption 4.1. We have n i.i.d. samples of (X, A) that are independent of the score function f being post-processed. Denote the samples from group a by (xa,i)i [na], and their number by na. Define ˆwa := na/n, and the empirical distribution ˆra := 1 na P i [na] δfa(xa,i), where δ is the Dirac delta function. We also analyze the sample complexity for both the fairness and the error rate of the returned classifier. For simplicity, we assume f to be calibrated; otherwise, Theorem 3.4 can be used to bound the suboptimality due to miscalibration. Assumption 4.2. The score function f being post-processed is group-wise calibrated (Definition 3.5). Let ra := fa µX a , a [m]. Recall from Section 3.3 that the error rate of the optimal derived α-fair classifier is err α,f := min q1, ,qm Qk maxa,a qa qa α 2 W1(ra, qa). This section is divided into three subsections w.r.t. the continuity of the distributions of the score, r1, , rm. 4.1. The Finite Case We start with the case where the ra s have finite supports, i.e., |Ra| < where Ra := supp(ra). Note that this does not mean that the input distribution µX is finite. If the true probability mass of the ra s were known, then Algorithm 1 can be implemented by a linear program: LP : min q1, ,qm 0 γ1, ,γm 0 2 s y 1 γa(s, y) s Ra γa(s , y) = qa(y), a [m], y Y, y Y γa(s, y ) = ra(s), a [m], s Ra, |qa(y) qa (y)| α, a, a [m], y Y, Fair and Optimal Classification via Post-Processing Figure 2. Examples of simplex-vertex optimal transports (k = 3; with different vertex distributions). All points in the lower-left blue partition are transported to e1, lower-right yellow to e2, and upper green to e3. The transports are described by a Y-shaped boundary. where qa k and γa R|Ra| k. This program simultaneously finds a minimizer (q 1, , q m) of the barycenter problem in Equation (4) and the optimal transports T ra q a (used in Theorem 3.3) in the form of couplings (γ 1, , γ m): namely, each T ra q is a randomized function satisfying P(T ra q (R) = y | R = s) = γ a(s, y)/ P y Y γ a(s, y ), for all s Ra. If the true pmfs of the ra s are unknown but finite samples as in Assumption 4.1 are given, we proceed with solving LP defined on the empirical ˆwa and ˆra s, which will give us estimated ˆqa s and T ˆra ˆqa s. Then, we post-process f via ˆh(x, a) = T ˆra ˆqa fa(x). The sample complexity is: Theorem 4.3 (Sample Complexity, Finite Case). Let α [0, 1], f : X A k be a score function, and assume |Ra| := supp(fa µX a ) < , a [m]. W.p. at least 1 δ over the random draw of samples in Assumption 4.1, for the classifier ˆh derived above, and n Ω(maxa ln(m/δ)/wa), DP(ˆh) α + O |Ra| ln(m/δ) in addition, with Assumption 4.2, err(ˆh) err α,f O |Ra| ln(m/δ) 4.2. The Continuous Case When the ra s are continuous,4 given finite samples, we may still solve LP defined on ˆwa and ˆra s to estimate the optimal output distribution ˆqa s under α-DP, but the empirical transports T ˆra ˆqa are no longer usable for post-processing in this case, since by continuity, the inputs to the transports at inference time will be unseen almost surely (i.e., fa(x) / fa[(xa,i)i [na]] a.s. for x µX a ), or in other words, 4I.e., the probability measure does not give mass to sets whose intersection with k has Hausdorff dimension less than k 1. they cannot extrapolate to the full support of ra. So after obtaining the ˆqa s, we will need to estimate the optimal transports T ra ˆqa from (the population) ra s to the ˆqa s. Since supp(ˆqa) = {e1, , ek} is finite, this makes finding T ra ˆqa a semi-discrete optimal transport problem (Genevay et al., 2016; Staib et al., 2017; Chen et al., 2019), for which, a common procedure in existing work is to reformulate optimal transport as a convex optimization problem over a vector ψa Rk using the Kantorovich-Rubinstein dual and the c-transform of the Kantorovich potential ϕa: concretely, for each a [m], W1(ra, ˆqa) = inf γa Γ(ra,ˆqa) k Y s y 1 dγa(s, y) = sup ϕa: k R, ψa Rk ϕa(s)+ψa,i s ei 1 ES ra[ϕa(S)] + i=1 ψa,iˆqa(ei) = sup ψa Rk ES ra h min i ( S ei 1 ψa,i) i + i=1 ψa,iˆqa(ei) and ψ a can be optimized, e.g., using (stochastic) gradient ascent. Moreover, Gangbo & Mc Cann (1996) showed that in the semi-discrete case, the optimal transport T ra ˆqa belongs to the parameterized function class Gk := n s 7 earg mini [k]( s ei 1 ψi) : ψ Rko (break ties to the tied ei with the largest index i) with ψ a as its parameter. See Figure 2 for pictures of semi-discrete optimal transports when k = 3. Compared to the empirical transports T ˆra ˆqa, the domain of T ra ˆqa Gk covers the support of ra, i.e., it can handle future unseen inputs. Note that Gangbo & Mc Cann (1996) assumed the cost function to be strictly convex and superlinear (Assumptions H1 to H3 in their paper), which are not satisfied by our ℓ1 cost (Ambrosio & Pratelli, 2003). Hence, as a technical contribution, we provide a proof in Appendix D for the existence and uniqueness of the optimal transport in Gk on simplex-vertex transportation problems under the ℓ1 cost, via analyzing its geometry. Our Implementation. To recap, for post-processing in the finite sample and continuous case, we (i) get the ˆqa s from solving LP, then (ii) estimate the T ra ˆqa s; as discussed above, the typical approach for solving this semidiscrete transportation problem is via optimizing ψa w.r.t. Equation (5) (taking expectation over the empirical ˆra). However, instead of estimating from scratch, by leveraging the observation that each of the empirical transports T ˆra ˆqa obtained (as byproducts) from solving LP already achieves the optimal value of W1(ˆra, ˆqa) of Equation (5), and their Fair and Optimal Classification via Post-Processing Line 6: get boundaries Points and their discrete transport assignments Simplex-vertex transport mapping e3 Line 8: compute parameters Line 7: find point in convex polytope Figure 3. Extract a simplex-vertex transport T G3 that agrees with the discrete optimal transport (Lines 6 to 9 of Algorithm 2). For illustrative purposes, the discrete transport on the left does not split mass; otherwise, there would be disagreements on the boundaries. Algorithm 2 Post-Process for α-DP (Finite Samples, Continuous Case) 1: Input: α [0, 1], score function f : X A k, samples ((xa,i)i [na])a [m] 2: Define ˆwa := na n and ˆra := 1 na P i δfa(xa,i), a [m] 3: (γ1, , γm) minimizer of LP on ˆwa and ˆra s 4: Define vij := ej ei 5: for a = 1 to m do 6: Ba,ij {0} max{fa(xa,ℓ) vij + 1 : ℓs.t. γa(fa(xa,ℓ), ei) > 0} 7: za point in T i =j{x Rk : x vij Ba,ij 1} 8: ψa,i 2z a vi1, i [k] 9: Ta (s 7 earg mini( s ei 1 ψa,i)) 10: end for 11: Return: (x, a) 7 Ta fa(x) decision boundaries is simply described by k 2 = Θ(k2) hyperplanes (recall Figure 2), we can directly extract a set of transport mappings Ta Gk from T ˆra ˆqa. The procedure for this is provided on Lines 6 to 9 in Algorithm 2 (with step-by-step illustration in Figure 3; formal derivations are deferred to Appendix D), which amounts to finding a feasible point in a polytope, formulated as a linear program (Appendix E.3). Each of the extracted Ta will agree with T ˆra ˆq on all points in ˆra except for those that lie on the Θ(k2) boundaries, and by continuity of ra, the number of disagreements between them is O(k2) almost surely. Our implementation in Algorithm 2 involves (m + 1) linear programs, where LP dominates with O(nk) variables and constraints, and takes, e.g., e O(poly(nk)) time to solve to (near-)optimality using interior point methods (Vaidya, 1989). Its sample complexity is: Theorem 4.4 (Sample Complexity, Continuous Case). Let α [0, 1], f : X A k be a score function, and assume that fa µX a is continuous, a [m]. W.p. at least 1 δ over the random draw of samples in Assumption 4.1, for the classifier ˆh obtained from applying Algorithm 2 to f, and n Ω(maxa ln(m/δ)/wa), DP(ˆh) α + O k + ln(mk/δ) nwa + k nwa in addition, with Assumption 4.2, err(ˆh) err α,f O The first term in both expressions is the sample complexity of PAC learning (with the complexity of Gk analyzed in Theorem D.2), and the second term comes from the disagreement between Ta and T ˆra ˆq on ˆra discussed above. 4.3. The General Case For completeness, we briefly discuss the general case where the ra s are neither finite nor continuous (i.e., contain atoms), which is handled by smoothing the ra s using an i.i.d. noise generator ρ with a continuous distribution.4 The smoothing is done by perturbing (samples from) ra with random noise drawn from ρ, i.e., ra := uρ ra, where uρ is a randomized function s.t. uρ(s) s + N with N ρ; it is not hard to show that the resulting ra is continuous. Now that uρ fa produces continuous score distributions, we may apply the same algorithm in Section 4.2 for DP postprocessing. The resulting classifier, hρ(x, a) := T ra qa uρ fa(x), is α-fair regardless of the choice of ρ, and the suboptimality incurred by the smoothing procedure is controlled by the bandwidth of ρ: Theorem 4.5 (Error Propagation, Smoothing). Let α [0, 1], f : X A k be a score function, and ρ a continuous distribution with finite first moment. Under Assumption 4.2, for the α-fair classifier hρ derived above, 0 err( hρ) err α,f EN ρ[ N 1]. Fair and Optimal Classification via Post-Processing 0.00 0.05 0.10 0.15 0.20 0.766 0.1 0.2 0.3 0.000 0.025 0.050 0.075 Bias Bios (2 genders, 2 classes) (5 races, 5 classes) (2 genders, 28 occupations) Violation of demographic parity (ΔDP) Classification accuracy Ours Fair Projection-KL Fair Projection-CE Figure 4. Tradeoff curves between accuracy and DP (Definition 2.1). Scoring model is logistic regression. Error bars indicate the standard deviation over 10 runs with different random splits. Running time is reported in appendix Table 1. On ACSIncome, Fair Projection-KL and Fair Projection-CE have similar results. E.g., if ρ = Laplace(0, b Ik), then E[ N 1] = kb, and the suboptimality due to smoothing is less than kb; in practice, the smallest-allowable b depends on machine precision. 5. Experiments Our proposed DP post-processing Algorithm 2 is evaluated on four benchmark datasets: the UCI Adult dataset for income prediction (Kohavi, 1996), the ACSIncome dataset (Ding et al., 2021) an extension of the Adult with much more examples (1.6 million vs. 48,842) on which we consider a binary setting where the sensitive attribute is gender and the target is whether the income is over $50k, as well as a multi-group multi-class setting with five race categories and five income buckets; the Communities & Crime dataset (Redmond & Baveja, 2002), and the Bias Bios dataset (De-Arteaga et al., 2019), where the task is to predict occupations from biographies. We highlight the effectiveness of our algorithm by comparing it to Fair Projection (Alghamdi et al., 2022), which is to our knowledge the only other fair post-processing algorithm for multi-group and multi-class classification. On each dataset, we split the data into pre-training, postprocessing, and testing. We first train a linear logistic regression scoring model on the pre-training split, then perform DP post-processing. On Bias Bios, the model is trained on embeddings of biographies computed by a pre-trained BERT language model from the bert-base-uncased checkpoint (Devlin et al., 2019). Additional details, including hyperparameters, are in Appendix E. Results. The results on ACSIncome and Bias Bios are shown in Figure 4 (those on Adult and Communities are deferred to appendix Figure 7). Across all tasks, our method is effective at reducing the disparity, and almost precise control of DP via α is achieved on tasks with sufficient data. Compared to Fair Projection, our method can achieve lower DP and produces better tradeoff curves, and its advantage is most evident under multi-class settings (e.g., Bias Bios). While our method achieves a significant degree of DP fairness, there remains a gap to DP = 0 in our results, especially on tasks with more groups and classes (e.g., ACSIncome under the multi-group multi-class setting). This could be due to a potential violation of the continuity assumption required by Algorithm 2 (Section 4.2), but we suspect the main reason to be insufficient data. Recall from Theorem 4.4 that the sample complexity scales as e O( p k/nwa) in the worst-case a, where wa is the proportion of group a. This means that good generalization performance hinges on collecting adequate amounts of data from minority groups. Lastly, as discussed in Section 3.3 although not empirically explored here higher accuracies could be achieved if the scores were calibrated prior to post-processing. 6. Further Related Work Fairness Criteria. This paper focused on the group criterion of demographic parity, which is defined on populationlevel statistics (Verma & Rubin, 2018; Kearns et al., 2018). Other group criteria include parity of true positive and/or negative rates (Hardt et al., 2016), predictive rates (Chouldechova, 2017; Berk et al., 2021; Zeng et al., 2022b), accuracy (Buolamwini & Gebru, 2018), etc. Besides group-level, there are criteria defined on the individual-level (Dwork et al., 2012; Kearns et al., 2019), which require the model to output similar predictions to individuals deemed to be similar under application and context-specific measures designed by the practitioner. Fair and Optimal Classification via Post-Processing Mitigation Methods. In addition to post-processing, there are data pre-processing methods (Calmon et al., 2017), as well as in-processing ones via constrained optimization (Kamishima et al., 2012; Zafar et al., 2017; Agarwal et al., 2019) or fair representation learning (Zemel et al., 2013; Madras et al., 2018; Zhao et al., 2020; Song et al., 2019). There are methods under other learning settings and paradigms, such as unsupervised learning (Chierichetti et al., 2017; Backurs et al., 2019; Li et al., 2020), ranking (Zehlike et al., 2017), and sequential decision making (Joseph et al., 2016; 2017; Gillen et al., 2018; Chi et al., 2022). 7. Conclusion In this paper, we characterized the inherent tradeoff of DP fairness on classification problems in the most general setting, and proposed an effective post-processing method with suboptimality and sample complexity analyses. Our implementation uses linear program solvers; while they enjoy stability and consistent performance, a potential concern is scalability to larger numbers of classes or samples. It would be of practical value to analyze an implementation that uses more time efficient optimization methods, e.g., gradient descent (Staib et al., 2017). Technically, we studied the geometry of the optimal transport between distributions supported on the simplex and its vertices. A main result is that when the distributions are semi-discrete, the optimal transport is unique, and is given by the c-transform of the Kantorovich potential, including under the ℓ1 cost which is neither strictly convex nor superlinear. This result may be of independent theoretical interest to the community. Our results add to the line of work that study the inherent tradeoffs of fairness criteria, which we believe would benefit practitioners in the design of fair machine learning systems, and contribute to a better understanding of the implications of fairness in machine learning. Acknowledgements The authors thank Jane Du, Yuzheng Hu, and Seiyun Shin for feedback on the draft. The work of HZ was supported in part by the Defense Advanced Research Projects Agency (DARPA) under Cooperative Agreement Number: HR00112320012, a Facebook Research Award, and Amazon AWS Cloud Credit. Abid, A., Farooqi, M., and Zou, J. Persistent Anti-Muslim Bias in Large Language Models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pp. 298 306, 2021. Agarwal, A., Dudík, M., and Wu, Z. S. Fair Regression: Quantitative Definitions and Reduction-Based Algorithms. In Proceedings of the 36th International Conference on Machine Learning, pp. 120 129, 2019. Alghamdi, W., Hsu, H., Jeong, H., Wang, H., Michalak, P. W., Asoodeh, S., and Calmon, F. P. Beyond Adult and COMPAS: Fair Multi-Class Prediction via Information Projection. In Advances in Neural Information Processing Systems, 2022. Ambrosio, L. and Pratelli, A. Existence and stability results in the L1 theory of optimal transportation. In Optimal Transportation and Applications, volume 1813 of Lecture Notes in Mathematics, pp. 123 160. Springer, 2003. Backurs, A., Indyk, P., Onak, K., Schieber, B., Vakilian, A., and Wagner, T. Scalable Fair Clustering. In Proceedings of the 36th International Conference on Machine Learning, pp. 405 413, 2019. Barocas, S. and Selbst, A. D. Big Data s Disparate Impact. California Law Review, 104(3):671 732, 2016. Barocas, S., Hardt, M., and Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities. The MIT Press, 2023. Bauer, U., Kerber, M., Roll, F., and Rolle, A. A unified view on the functorial nerve theorem and its variations. Expositiones Mathematicae, 2023. Berk, R., Heidari, H., Jabbari, S., Kearns, M., and Roth, A. Fairness in Criminal Justice Risk Assessments: The State of the Art. Sociological Methods & Research, 50 (1):3 44, 2021. Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., and Kalai, A. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Advances in Neural Information Processing Systems, 2016. Buolamwini, J. and Gebru, T. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 2018 Conference on Fairness, Accountability, and Transparency, pp. 77 91, 2018. Calders, T., Kamiran, F., and Pechenizkiy, M. Building Classifiers with Independency Constraints. In 2009 IEEE International Conference on Data Mining Workshops, pp. 13 18, 2009. Calmon, F. P., Wei, D., Vinzamuri, B., Ramamurthy, K. N., and Varshney, K. R. Optimized Pre-Processing for Discrimination Prevention. In Advances in Neural Information Processing Systems, 2017. Caton, S. and Haas, C. Fairness in Machine Learning: A Survey, 2020. ar Xiv:2010.04053 [cs.LG]. Fair and Optimal Classification via Post-Processing Chen, Y., Telgarsky, M., Zhang, C., Bailey, B., Hsu, D., and Peng, J. A Gradual, Semi-Discrete Approach to Generative Network Training via Explicit Wasserstein Minimization. In Proceedings of the 36th International Conference on Machine Learning, pp. 1071 1080, 2019. Chi, J., Shen, J., Dai, X., Zhang, W., Tian, Y., and Zhao, H. Towards Return Parity in Markov Decision Processes. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, pp. 1161 1178, 2022. Chierichetti, F., Kumar, R., Lattanzi, S., and Vassilvitskii, S. Fair Clustering Through Fairlets. In Advances in Neural Information Processing Systems, 2017. Chouldechova, A. Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data, 5(2):153 163, 2017. Chzhen, E. and Schreuder, N. A minimax framework for quantifying risk-fairness trade-off in regression. The Annals of Statistics, 50(4):2416 2442, 2022. Chzhen, E., Denis, C., Hebiri, M., Oneto, L., and Pontil, M. Leveraging Labeled and Unlabeled Data for Consistent Fair Binary Classification. In Advances in Neural Information Processing Systems, 2019. Chzhen, E., Denis, C., Hebiri, M., Oneto, L., and Pontil, M. Fair Regression with Wasserstein Barycenters. In Advances in Neural Information Processing Systems, 2020. Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., and Huq, A. Algorithmic Decision Making and the Cost of Fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 797 806, 2017. De-Arteaga, M., Romanov, A., Wallach, H., Chayes, J., Borgs, C., Chouldechova, A., Geyik, S., Kenthapadi, K., and Kalai, A. T. Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting. In Proceedings of the 2019 ACM Conference on Fairness, Accountability, and Transparency, pp. 120 128, 2019. Denis, C., Elie, R., Hebiri, M., and Hu, F. Fairness guarantee in multi-class classification, 2023. arxiv:2109.13642 [math.ST]. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1, pp. 4171 4186, 2019. Diamond, S. and Boyd, S. CVXPY: A Python-Embedded Modeling Language for Convex Optimization. Journal of Machine Learning Research, 17(83):1 5, 2016. Ding, F., Hardt, M., Miller, J., and Schmidt, L. Retiring Adult: New Datasets for Fair Machine Learning. In Advances in Neural Information Processing Systems, 2021. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness Through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pp. 214 226, 2012. Fish, B., Kun, J., and Lelkes, Á. D. A Confidence-Based Approach for Balancing Fairness and Accuracy. In Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 144 152, 2016. Forrest, J., Ralphs, T., Santos, H. G., Vigerske, S., Forrest, J., Hafer, L., Kristjansson, B., jpfasano, Edwin Straver, Lubin, M., Jan-Willem, rlougee, jpgoncal1, Brito, S., h-i-gassmann, Cristina, Saltzman, M., tosttost, Pitrus, B., Matsushima, F., and to-st. coin-or/cbc: Release releases/2.10.10, April 2023. URL https: //doi.org/10.5281/zenodo.7843975. Gangbo, W. and Mc Cann, R. J. The geometry of optimal transportation. Acta Mathematica, 177(2):113 161, 1996. Gaucher, S., Schreuder, N., and Chzhen, E. Fair learning with Wasserstein barycenters for non-decomposable performance measures. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics, pp. 2436 2459, 2023. Genevay, A., Cuturi, M., Peyré, G., and Bach, F. Stochastic Optimization for Large-scale Optimal Transport. In Advances in Neural Information Processing Systems, 2016. Gillen, S., Jung, C., Kearns, M., and Roth, A. Online Learning with an Unknown Fairness Metric. In Advances in Neural Information Processing Systems, 2018. Hardt, M., Price, E., and Srebro, N. Equality of Opportunity in Supervised Learning. In Advances in Neural Information Processing Systems, 2016. Jiang, R., Pacchiano, A., Stepleton, T., Jiang, H., and Chiappa, S. Wasserstein Fair Classification. In Proceedings of the 35th Uncertainty in Artificial Intelligence Conference, pp. 862 872, 2020. Joseph, M., Kearns, M., Morgenstern, J., and Roth, A. Fairness in Learning: Classic and Contextual Bandits. In Advances in Neural Information Processing Systems, 2016. Joseph, M., Kearns, M., Morgenstern, J., Neel, S., and Roth, A. Fair Algorithms for Infinite and Contextual Bandits, 2017. ar Xiv:1610.09559 [cs.LG]. Fair and Optimal Classification via Post-Processing Kamishima, T., Akaho, S., Asoh, H., and Sakuma, J. Fairness-Aware Classifier with Prejudice Remover Regularizer. In Machine Learning and Knowledge Discovery in Databases, pp. 35 50, 2012. Kearns, M., Neel, S., Roth, A., and Wu, Z. S. Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. In Proceedings of the 35th International Conference on Machine Learning, pp. 2564 2572, 2018. Kearns, M., Roth, A., and Sharifi-Malvajerdi, S. Average Individual Fairness: Algorithms, Generalization and Experiments. In Advances in Neural Information Processing Systems, 2019. Kohavi, R. Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 202 207, 1996. Kull, M. and Flach, P. Novel Decompositions of Proper Scoring Rules for Classification: Score Adjustment as Precursor to Calibration. In Machine Learning and Knowledge Discovery in Databases, pp. 68 85, 2015. Le Gouic, T., Loubes, J.-M., and Rigollet, P. Projection to Fairness in Statistical Learning, 2020. ar Xiv:2005.11720 [cs.LG]. Li, P., Zhao, H., and Liu, H. Deep Fair Clustering for Visual Learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9067 9076, 2020. Madras, D., Creager, E., Pitassi, T., and Zemel, R. Learning Adversarially Fair and Transferable Representations. In Proceedings of the 35th International Conference on Machine Learning, pp. 3384 3393, 2018. Menon, A. K. and Williamson, R. C. The cost of fairness in binary classification. In Proceedings of the 2018 Conference on Fairness, Accountability, and Transparency, pp. 107 118, 2018. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. The MIT Press, second edition, 2018. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, É. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85):2825 2830, 2011. Peyré, G. and Cuturi, M. Computational Optimal Transport: With Applications to Data Science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019. Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., and Weinberger, K. Q. On Fairness and Calibration. In Advances in Neural Information Processing Systems, 2017. Ravfogel, S., Elazar, Y., Gonen, H., Twiton, M., and Goldberg, Y. Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7237 7256, 2020. Redmond, M. and Baveja, A. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 141(3):660 678, 2002. Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. Song, J., Kalluri, P., Grover, A., Zhao, S., and Ermon, S. Learning Controllable Fair Representations. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, pp. 2164 2173, 2019. Spanier, E. H. Algebraic Topology. Springer, 1981. Staib, M., Claici, S., Solomon, J., and Jegelka, S. Parallel Streaming Wasserstein Barycenters. In Advances in Neural Information Processing Systems, 2017. Vaidya, P. M. Speeding-up linear programming using fast matrix multiplication. In 30th Annual Symposium on Foundations of Computer Science, pp. 332 337, 1989. Verma, S. and Rubin, J. Fairness Definitions Explained. In 2018 IEEE/ACM International Workshop on Software Fairness, 2018. Villani, C. Topics in Optimal Transportation. American Mathematical Society, 2003. Weed, J. and Bach, F. Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance. Bernoulli, 25(4A):2620 2648, 2019. Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and Weinberger, M. J. Inequalities for the L1 Deviation of the Empirical Distribution. Technical Report HPL-200397R1, Hewlett-Packard Laboratories, 2003. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38 45, 2020. Fair and Optimal Classification via Post-Processing Woodworth, B., Gunasekar, S., Ohannessian, M. I., and Srebro, N. Learning Non-Discriminatory Predictors. In Proceedings of the 2017 Conference on Learning Theory, pp. 1920 1953, 2017. Zafar, M. B., Valera, I., Rogriguez, M. G., and Gummadi, K. P. Fairness Constraints: Mechanisms for Fair Classification. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp. 962 970, 2017. Zehlike, M., Bonchi, F., Castillo, C., Hajian, S., Megahed, M., and Baeza-Yates, R. FA*IR: A Fair Top-k Ranking Algorithm. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1569 1578, 2017. Zemel, R., Wu, Y. L., Swersky, K., Pitassi, T., and Dwork, C. Learning Fair Representations. In Proceedings of the 30th International Conference on Machine Learning, pp. 325 333, 2013. Zeng, X., Dobriban, E., and Cheng, G. Bayes-Optimal Classifiers under Group Fairness, 2022a. ar Xiv:2202.09724 [stat.ML]. Zeng, X., Dobriban, E., and Cheng, G. Fair Bayes-Optimal Classifiers Under Predictive Parity. In Advances in Neural Information Processing Systems, 2022b. Zhao, H. and Gordon, G. J. Inherent Tradeoffs in Learning Fair Representations. Journal of Machine Learning Research, 23(57):1 26, 2022. Zhao, H., Coston, A., Adel, T., and Gordon, G. J. Conditional Learning of Fair Representations. In International Conference on Learning Representations, 2020. Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang, K.-W. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 2, pp. 15 20, 2018. Fair and Optimal Classification via Post-Processing A. Additional Discussions in Section 3.1 This section discusses the reduction of Theorem 3.2 to TV-barycenter in the noiseless setting, and the examples in Section 3.1. A.1. Reduction to TV-Barycenter When the classification problem µ is noiseless, i.e., the (unconstrained) Bayes error rate is zero, minh err(h) = 0, we show that Theorem 3.2 reduces to a relaxed TV-barycenter problem. For strict DP (α = 0), this result is previously established by Zhao & Gordon (2022) for the binary case of m = k = 2. Our reduction here holds for the general case. Note that noiselessness means that there exist a deterministic labeling function y : X A Y s.t. Y = y(X, A) (almost surely). Therefore, the Bayes optimal score f = y, and we have r a = pa := f a µX a where pa(ei) = Pµ(Y = ei | A = a). Theorem A.1 (Minimum Fair Error Rate, Noiseless Setting). Let α [0, 1], and suppose µ is noiseless. We have min h: DP(h) α err(h) = min q1, ,qm Qk maxa,a qa qa α 2 W1(r a, qa) = min q1, ,qm Qk maxa,a qa qa α This is due to supp(pa) {e1, , ek} sharing the same finite support with any qa Qk, whereby 1 2 W1(r a, qa) = 1 2 W1(pa, qa) = 1 2 pa qa 1. Specifically, recall the fact that W1 under the 0-1 distance is equal to 1: Proposition A.2. Let p, q be probability measures on X with metric d(x, y) = 1[x = y], where 1[ ] denotes the indicator function. We have W1(p, q) = 1 Proof. By definition, under the metric d(x, y) = 1[x = y], W1(p, q) = inf γ Γ(p,q) X X 1[x = y] dγ(x, y) 1 sup γ Γ(p,q) X X 1[x = y] dγ(x, y) X min(p(x), q(x)) dx X (p(x) min(p(x), q(x))) dx X max(0, p(x) q(x)) dx X |p(x) q(x)| dx, where line 3 is due to γ(x, x) min(p(x), q(x)) for all γ Γ(p, q), plus there always exists a coupling s.t. γ(x, x) = min(p(x), q(x)), and line 5 to R X p(x) q(x) dx = 0. Proof of Theorem A.1. Because supp(r a) = supp(pa) {e1, , ek}, the ℓ1 distance (in W1) between any s supp(r a) and y {e1, , ek} simplifies to s y 1 = 2 1[s = y]. The result then follows from Proposition A.2. Moreover, in this case, we have closed-form solution for the optimal fair classifier in Theorem 3.3: Theorem A.3 (Optimal Fair Classifier, Noiseless Setting). Let α [0, 1], suppose µ is noiseless, let y : X A Y be the ground-truth labeling function, and (q 1, , q m) a minimizer of Equation (4). Define da(ei) := max(0, q a(ei) pa(ei)) P j [k] max(0, q a(ej) pa(ej)), sa(ei) := max(0, pa(ei) q a(ei)) pa(ei) , a [m], i [k], Fair and Optimal Classification via Post-Processing and randomized functions Ta : Y Y for each a [m] satisfying ( ei w.p. sa(ei)da(ei) + (1 sa(ei)), ej w.p. sa(ei)da(ej), j [m], j = i. We have (x, a) 7 Ta ya(x) arg min h: DP(h) α err(h). Proof. We first verify that (Ta ya) µX a = q a, which would imply DP(h) α by the constraints in Equation (4). For all ei Y, P(Ta ya(X) = ei | A = a) = pa(ei)(sa(ei)da(ei) + (1 sa(ei))) + da(ei) X i =j pa(ej)sa(ej) = pa(ei)(1 sa(ei)) + da(ei) X j [k] pa(ej)sa(ej) = pa(ei) max(0, pa(ei) q a(ei)) + max(0, q a(ei) pa(ei)) where line 3 is because P i pa(ei) q a(ei) = 0 = P i max(0, pa(ei) q a(ei)) = P i max(0, q a(ei) pa(ei)), and the last line is from case analysis. Next, we compute the error rate on group a [m]. By construction, its accuracy conditioned on Y = ya(X) = ei is P(Ta ya(X) = ei | A = a, Y = ei) = ( 1 sa(ei) if da(ei) = 0, 1 = 1 sa(ei) if da(ei) > 0 sa(ei) = 0, so the error rate is P(Ta ya(X) = Y | A = a) = X i [k] pa(ei) (1 P(Ta ya(X) = ei | A = a, Y = ei)) i [k] pa(ei)sa(ei) i [k] max(0, pa(ei) q a(ei)) i [k] |pa(ei) q a(ei)|. We conclude by combining the error on all groups and invoking Theorem A.1. A.2. Examples Regarding the Tradeoff of DP Fairness In the remarks of Theorem 3.2, we discussed properties of the inherent tradeoff of error rate for DP fairness, which we illustrated here with two concrete examples. It is discussed that the tradeoff could be zero even when the distribution of class probabilities r a := f a µX a differ, or equivalently, Eµ[Y | X, A] A. This means that on certain problem instances, the Bayes error rate is simultaneously achieved by an unfair classifier and a fair one; in other words, the cost of DP fairness is zero. Such cases arise from the nonuniqueness of the optimal classifier. They would not occur on regression problems (with MSE), where the optimal regressor is always unique (namely, f a(x) = Eµ[Y | X, A = a]). Example A.4. Consider the two-group binary classification problem given by Pµ1(Y = e1 | X = x) = 1 and Pµ2(Y = e1 | X = x) = Pµ2(Y = e2 | X = x) = 1 2 for all x X. Fair and Optimal Classification via Post-Processing The optimal classifier on group 1 is the constant function x 7 e1, and all classifiers on group 2 yield the same (hence optimal) error rate of 1 2, including x 7 e1 which when combined with the optimal group 1 classifier achieves DP and the (group-balanced) Bayes error rate of 1 In (Zhao & Gordon, 2022), it is concluded that in the noiseless setting, the inherent tradeoff is zero if and only if the class prior distributions are the same, pa(ei) = Pµ(Y = ei | A = a) and r a = pa := f a µX a in this case, or equivalently Eµ[Y | A] A. However, for the general case, this is no longer sufficient for the tradeoff to be zero (a sufficient condition here is Eµ[Y | X, A] A). Example A.5. Consider the two-group binary classification problem of two inputs, X = {1, 2}, given by Pµ1(Y = e1 | X = 1) = 1, Pµ1(Y = e1 | X = 2) = 0, Pµ1(Y = e2 | X = 1) = 0, Pµ1(Y = e2 | X = 2) = 1, with Pµ1(X = 1) = 1 3, Pµ1(X = 2) = 2 Pµ2(Y = e1 | X = x) = 1 3 for all x {1, 2}. Note that the class prior on both groups is ( 1 3). The unique optimal classifier on group 1 is x 7 ex, and the unique optimal classifier on group 2 is the constant x 7 e2, but this combination do not satisfy DP, since the output distribution on group 1 is ( 1 3) but that on group 2 is (0, 1). Since all other classifiers including the fair ones have strictly higher error rates, the tradeoff is nonzero. B. Proofs for Section 3 To make our arguments rigorous, we begin by providing a definition of randomized functions via the Markov kernel. These definitions will be frequently referred to in the proofs in this section, and that of Theorem 4.5. Definition B.1 (Markov Kernel). A Markov kernel from a measurable space (X, S) to (Y, T ) is a mapping K : X T [0, 1], such that K( , T) is S-measurable T T , and K(x, ) is a probability measure on (Y, T ) x X. Definition B.2 (Randomized Function). A randomized function f : (X, S) (Y, T ) is associated with a Markov kernel K : X T [0, 1], and for all x X, T T , P(f(x) T) = K(x, T). Definition B.3 (Push-Forward by Randomized Function). Let p be a measure on (X, S) and f : (X, S) (Y, T ) a randomized function with Markov kernel K. The push-forward of p under f, denoted by f p, is a measure on Y given by f p(T) = R X K(x, T) dp(x) for all T T . Also, let blackboard bold 1 denote the indicator function, where 1[P] = 1 if the predicate P is true, else 0. We provide the proofs to Lemma 3.1 and Theorems 3.2 to 3.4. The proofs to these results and the ones in Section 4 all make use of the following rewriting of the error rate as an integral over a coupling: Lemma B.4. Let f : X k be the Bayes optimal score function. The error rate of any randomized classifier h : X Y can be written as k Y s y 1 P(f (X) = s, h(X) = y) d(s, y) = 1 2 E[ f (X) h(X) 1]. Note that the joint distribution P of (f (X), h(X)) is a coupling belonging to Γ(f µX, h µX). Fair and Optimal Classification via Post-Processing Proof. The accuracy of h is 1 err(h) = 1 P(h(X) = Y ) = P(h(X) = Y ) i [k] P(Y = ei, h(X) = ei, f (X) = s) ds i [k] P(Y = ei, h(X) = ei | f (X) = s) Pµ(f (X) = s) ds i [k] Pµ(Y = ei | f (X) = s) P(h(X) = ei | f (X) = s) Pµ(f (X) = s) ds i [k] si P(f (X) = s, h(X) = ei) ds, where line 4 follows from X Y given f (X), since f (X) = Eµ[Y | X] fully specifies the pmf of Y conditioned on X. Next, because P(f (X) = , h(X) = ) is a probability measure, i [k] (1 si) P(f (X) = s, h(X) = ei) ds i [k] s ei 1 P(f (X) = s, h(X) = ei) ds k Y s y 1 P(f (X) = s, h(X) = y) d(s, y) 2 E[ f (X) h(X) 1], where the second equality is due to an identity stated in Equation (8). Lemma B.5 (Full Version of Lemma 3.1). Let f : X k be the Bayes optimal score function, define r := f µX, and fix q Qk. For any randomized classifier h : X Y with Markov kernel K satisfying h µX = q, the coupling γ Γ(r , q) given by γ(s, y) = Z f 1(s) K(x, y) dµX(x), where f 1(s) := {x X : f (x) = s}, satisfies k Y s y 1 dγ(s, y). (6) Conversely, for any γ Γ(r , q), the randomized classifier h with Markov kernel K(x, T) = γ(f (x), T)/γ(f (x), Y) satisfies h µX = q and Equation (6). Proof. We begin with the first direction. Let a randomized classifier h with Markov kernel K satisfying h µX = q be given. We verify that the coupling constructed above belongs to Γ(r , q): Z Y γ(s, y) dy = Z f 1(s) K(x, y) dµX(x) dy Y K(x, y) dy dµX(x) f 1(s) dµX(x) = PµX(f (X) = s) = r (s), Fair and Optimal Classification via Post-Processing where line 3 follows from Definition B.1 of Markov kernels, and line 5 from the definition of push-forward measures; Z k γ(s, y) ds = Z f 1(s) K(x, y) dµX(x) ds X K(x, y) dµX(x) X P(h(X) = y | X = x) dµX(x) = P(h(X) = y) = q(y), where line 3 follows from Definition B.2 of randomized function, and line 5 is by assumption. Next, by Lemma B.4 and the same arguments above, k Y s y 1 P(f (X) = s, h(X) = y) d(s, y) x P(f (X) = s, h(X) = y, X = x) dx d(s, y) f 1(s) P(h(X) = y, X = x) dx f 1(s) P(h(X) = y | X = x) dµX(x) k Y s y 1 γ(s, y) d(s, y) as desired, where line 3 is due to P(f (X) = s, h(X) = y, X = x) = 1[f (x) = s] P(h(X) = y, X = x) for all (s, y, x). For the converse, let a coupling γ Γ(r , q) be given. We show that the Markov kernel of the randomized classifier h constructed in the statement satisfies the equality γ(s, y) = R f 1(s) K(x, y) dµX(x), then Equation (6) will follow directly from the same arguments used in the previous part. Let s k and y Y, and x f 1(s) arbitrary, then γ(s, y) = γ(s, y) γ(s, Y) γ(s, Y) = γ(f (x ), y) γ(f (x ), Y) γ(s, Y) = K(x , y) γ(s, Y) = K(x , y) r (s) = K(x , y) Z x f 1(s) dµX(x) x f 1(s) K(x , y) dµX(x) x f 1(s) K(x, y) dµX(x), where line 3 is by construction of K, line 4 from γ Γ(r , q), and the last line is because K(x, y) is constant for all x f 1(s), also by construction. Proof of Theorem 3.2. Lemma 3.1 implies that for each a [m] and fixed qa Qk, the minimum error rate on group a, denoted by erra, among randomized classifiers ha : X Y whose output distribution equals to qa, is given by min ha:ha µX a =qa erra(ha) := min ha:ha µX a =qa P(ha(X) = Y | A = a) = 1 2 W1(r a, qa). Fair and Optimal Classification via Post-Processing Because of attribute-awareness, we can optimize each component of h : X A Y, h( , a) =: ha for all a [m], independently. So for any set of fixed q1, , qm Qk, min h:ha µX a =qa, a err(h) = X a [m] min ha:ha µX a =qa wa erra(ha) = X 2 W1(r a, qa). Incorporating the α-DP constraint, we get min h: DP(h) α err(h) = min q1, ,qm Qk maxa,a qa qa α min h:ha µX a =qa, a err(h) = min q1, ,qm Qk maxa,a qa qa α 2 W1(r a, qa). Proof of Theorem 3.3. By construction, the Markov kernel of the randomized optimal fair classifier h (x, a) := T r a q a f a(x) is K((x, a), y) = γ a(f a(x), y) γ a(f a(x), Y) where γ a Γ(r a, q a) is the optimal transport between r a and q a. We verify that the output distributions of h equal q 1, , q m, thereby it is α-DP because the q a s satisfy the constraint in Equation (4), and its error rate achieves the minimum in Theorem 3.2. First, for all y Y, P( h (X, A) = y | A = a) = Z X P( h (x, a) = y) Pµ(X = x | A = a) dx X K((x, a), y) dµX a (x) γ a(f a(x), y) γ a(f a(x), Y) dµX a (x) γ a(s, y) γ a(s, Y) f a 1(s) dµX a (x) γ a(s, y) γ a(s, Y)r a(s) ds k γ a(s, y) ds = γ a( k, y) = q a(y), where line 3 is due to γ a(f a(x), y) = γ a(s, y) being constant for all x f a 1(s). Similarly, Lemma B.5 implies that the error rate on group a, denoted by erra( h ), is erra( h ) = 1 k Y s y 1 dγa(s, y), where γa Γ(r a, q a) equals to γa(s, y) = Z f a 1(s) K((x, a), y) dµX a (x) = Z γ a(f a(x), y) γ a(f a(x), Y) dµX a (x) = γ a(s, y) γ a(s, Y) f a 1(s) dµX a (x) = γ a(s, y). So erra( h ) = 1 k Y s y 1 dγ a(s, y) = 1 2 W1(r a, q a) because γ a is an optimal transport between r a and q a, and err( h ) = P a [m] wa erra( h ) = P 2 W1(r a, q a), the minimum error rate under α-DP. Proof of Theorem 3.4. Recall that the α-fair classifier h returned from Algorithm 1 is h(x, a) = T ra qa fa(x), Fair and Optimal Classification via Post-Processing where (q1, , qm) is a minimizer of Equation (4) on ra := fa µX a and α, and T ra qa is the optimal transport from ra to qa. Denote the L1 difference between f and f by E := EµX[ f A(X) f A(X) 1]. For the upper bound, by Lemma B.4 and the triangle inequality, err( h) = 1 a [m] wa E T ra qa fa(X) f a(X) 1 | A = a a [m] wa E[ T ra qa fa(X) fa(X) 1 | A = a] + E[ fa(X) f a(X) 1 | A = a] 2 W1(ra, qa) + E = min q 1, ,q m Qk maxa,a q a q a α 2 W1(ra, q a) + E where line 3 is because T ra qa is the optimal transport from ra to qa under the ℓ1 cost, and line 4 is because (q1, , qm) is a minimizer. Let (q 1, , q m) denote a minimizer of Equation (4) on the distributions of Bayes scores r a := f a µX a and α, then by Theorem 3.2, err( h) err α 1 min q 1, ,q m Qk maxa,a q a q a α a [m] wa W1(ra, q a) X a [m] wa W1(r a, q a) a [m] wa W1(ra, q a) X a [m] wa W1(r a, q a) 2 W1(ra, r a) + E a [m] wa E[ fa(X) f a(X) 1 | A = a] + E where the last line is because for each a [m], W1(ra, r a) is upper bounded by the transportation cost under the coupling given by the joint distribution of (fa(X), f a(X)) conditioned on A = a: denote the coupling by πa, then clearly πa Γ(ra, r a), and R k k s s 1 dπa(s, s ) = R X fa(x) f a(x) 1 dµX a (x) = E[ fa(X) f a(X) 1 | A = a]. For the lower bound, again by Lemma B.4, err( h) = X k Y s y 1 P(f a(X) = s, T ra qa fa(X) = y | A = a) d(s, y) 2 W1(r a, qa) min q 1, ,q m Qk maxa,a q a q a α 2 W1(r a, q a) where line 2 is because the joint distribution of (f a(X), T ra qa fa(X)) conditioned on A = a is a coupling belonging to Γ(r a, qa), thereby the transportation cost represented by the quantity in the preceding line upper bounds W1(r a, qa). Fair and Optimal Classification via Post-Processing C. Proofs for Section 4 We establish the sample complexities in Theorems 4.3 and 4.4 and the error propagation bound for smoothing (Theorem 4.5), in that order. We remark that Assumption 4.2 of group-wise calibration of the scores can be dropped by adding the error propagation of Theorem 3.4 to the results. The sample complexity for the general case procedure described in Section 4.3 via smoothing can also be obtained by combining Theorems 4.4 and 4.5 (provided that the supports of the distributions after smoothing are contained in the simplex). The generalization bound of the finite case uses an ℓ1 (TV) convergence result of empirical distributions, which follows directly from the concentration of multinoulli random variables (Weissman et al., 2003): Theorem C.1. Let p d, d 2, and ˆpn 1 n Multinomial(n, p). W.p. at least 1 δ, p ˆpn 1 p 2d ln(2/δ)/n. Corollary C.2. Let p be a distribution over X with finite support, and x1, , xn p be i.i.d. samples. Define the empirical distribution ˆpn := 1 n Pn i=1 δxi. W.p. at least 1 δ over the random draw of the samples, p ˆpn 1 p 2|X| ln(2/δ)/n. Theorem C.3 (Hoeffding s Inequality). Let x1, , xn R be i.i.d. random variables s.t. ai xi bi almost surely. W.p. at least 1 δ, | 1 n Pn i=1(xi E xi)| p Pn i=1(bi ai)2/2n2 ln 2/δ. Proof of Theorem 4.3. We first bound the error rate, then DP. Error Rate. Consider the classification problem µ derived from the original µ under an input transformation given by the joint distribution of (f A(X), A, Y ), as discussed in Section 3.3, on which Id is the Bayes optimal score due to calibration of f s. Then by Lemma B.4 applied on µ , err(ˆh) = 1 y Y s y 1 P(Id(X ) = s, T ˆra ˆqa(X ) = y) y Y s y 1 ra(s) P(T ˆra ˆqa(s) = y) y Y s y 1 (ˆra(s) + |ra(s) ˆra(s)|) P(T ˆra ˆqa(s) = y) 1 2 W1(ˆra, ˆqa) + X s Ra |ra(s) ˆra(s)| where line 4 uses the fact that each T ˆra ˆqa is an optimal transport from ˆra to ˆqa. Let (q1, , qm) denote a minimizer of Equation (4) on the ra s with α, then err(ˆh) err α,f X 1 2(W1(ˆra, ˆqa) W1(ra, qa)) + X s Ra |ra(s) ˆra(s)| 1 2(W1(ˆra, ˆqa) W1(ra, qa)) + X s Ra |ra(s) ˆra(s)| 1 2(W1(ˆra, qa) W1(ra, qa)) + X s Ra |ra(s) ˆra(s)| 1 2 W1(ˆra, ra) + X s Ra |ra(s) ˆra(s)| where we defined ˆwa = na/n, line 2 is because wa = Θ( ˆwa) for all a [m] when n Ω(maxa ln(m/δ)/wa), and line 3 is due to (ˆq1, , ˆqm) being a minimizer of Equation (4) on the ˆra s. Because s s 1 2 1[s = s ], by Proposition A.2, Fair and Optimal Classification via Post-Processing W1(ˆra, ra) ˆra ra 1, so it follows that err(ˆh) err α,f O s Ra ˆwa|ra(s) ˆra(s)| |Ra| ln(m/δ) |Ra| ln(m/δ) w.p. at least 1 δ from m applications of Corollary C.2 and a union bound. Fairness. In the finite case, we can get a stronger result in terms of the ℓ1-norm. Note that for all a [m] and y Y, P(ˆh(X, A) = y | A = a) ˆqa(y) = X s Ra ra(s) P(T ˆra ˆqa(s) = y) X s Ra ˆra(s) P(T ˆra ˆqa(s) = y) s Ra |ra(s) ˆra(s)| P(T ˆra ˆqa(s) = y) s Ra |ra(s) ˆra(s)| |Ra| ln(m/δ) w.p. at least 1 δ from applications of Corollary C.2 and a union bound, where the first equality is because T ˆra ˆqa is a transport from ˆra to ˆqa. This ℓ1-norm bound directly implies an ℓ -norm bound: P(ˆh(X, A) = y | A = a) ˆqa(y) X P(ˆh(X, A) = y | A = a) ˆqa(y) = O |Ra| ln(m/δ) Lastly, because of the constraint in Equation (4) that maxa,a [m] ˆqa ˆqa α, DP(ˆh) = max a,a [m] y Y P(ˆh(X, A) = y | A = a) P(ˆh(X, a ) = y | A = a ) max a,a [m] y Y |ˆqa(y) ˆqa (y)| + max a,a [m] y Y P(ˆh(X, A) = y | A = a) ˆqa(y) + P(ˆh(X, a ) = y | A = a ) ˆqa (y) α + max a,a [m] O |Ra| ln(m/δ) |Ra | ln(m/δ) The theorem then follows from a final application of union bound. The proof of the sample complexities in the continuous case uses the following uniform convergence results with the VC dimension and the pseudo-dimension as the complexity measure. We omit the proofs, but refer readers to (Shalev-Shwartz & Ben-David, 2014, Theorem 6.8) and (Mohri et al., 2018, Theorem 11.8), respectively. We also need a characterization of the disagreement between the empirical transports T ˆra ˆq and the transport mappings Ta Gk extracted from them on Lines 6 to 9 of Algorithm 2, (to be) stated in Lemma D.5. Fair and Optimal Classification via Post-Processing Theorem C.4 (VC Dimension Uniform Convergence). Let H be a class of binary functions from X to {0, 1}, p a distribution over X {0, 1}, of which (x1, y1), , (xn, yn) p are i.i.d. samples. W.p. at least 1 δ over the random draw of the samples, h H, E(X,Y ) p 1[h(X) = Y ] 1 i=1 1[h(xi) = yi] d + ln(1/δ) for some universal constant c, where d is the VC dimension of H (Definition D.12 with k = 2). Theorem C.5 (Pseudo-Dimension Uniform Convergence). Let H be a class of functions from X to R, ℓ: X Y R 0 a nonnegative loss function upper bounded by M, p a distribution over X Y, of which (x1, y1), , (xn, yn) p are i.i.d. samples. W.p. at least 1 δ over the random draw of the samples, h H, E(X,Y ) p ℓ(h(X), Y ) 1 i=1 ℓ(h(xi), yi) d + ln(1/δ) for some universal constant c, where is the pseudo-dimension of {(x, y) 7 ℓ(h(x), y) : h H} (Definition D.14). One highlight of our proof is that we avoided using the convergence of the empirical measure under Wasserstein distance in our arguments, which would have resulted in sample complexity that is exponential in the number of label classes k (Weed & Bach, 2019). Instead, we leveraged the existence and uniqueness of the semi-discrete simplex-vertex optimal transport in the low complexity function class Gk, established in Theorems D.1 and D.2, whereby we can apply the above uniform bound to Gk and achieve a rate that is only polynomial in k. In addition, we remark that the O(k2/nwa) term coming from the disagreements between Ta and T ˆra ˆqa on (xa,i)i [na] could potentially be improved to O(k/nwa) if LP is assumed to return an extremal solution (Peyré & Cuturi, 2019), and Gk is modified so that the output on points that lie on each boundary can be specified, rather than always tie-broken to the ei with the largest index i. Proof of Theorem 4.4. We first bound the error rate, followed by DP. Recall that the classifier returned from Algorithm 2 is ˆh(x, a) := Ta fa(x), where each Ta Gk is extracted from the empirical optimal transport T ˆra ˆqa by Algorithm 2, obtained from calling LP(ˆr1, , ˆrm, α), where ˆq1, , ˆqm is the minimizer of Equation (4) on the ˆra s with α. We will use a complexity result of Gk in terms of its pseudo-dimension when associated with ℓ1 loss, and a VC bound of binarized versions of Gk (to be defined in Equation (17)), which are deferred to Theorem D.15 and Corollary D.16. Error Rate. Consider the classification problem µ derived from the original µ under an input transformation given by the joint distribution of (f A(X), A, Y ), as discussed in Section 3.3, on which Id is the Bayes optimal score due to calibration of f s. Then by Lemma B.4 applied on µ , err(ˆh) = X k Y s y 1 P(Id(X ) = s, Ta(X ) = y) d(s, y) 2 ES ra[ Ta(S) S 1]. Define sa,j := fa(xa,j). By Theorems C.5 and D.15, and a union bound, we have w.p. at least 1 δ, for all a [m], ES ra[ Ta(S) S 1] 1 j=1 Ta(sa,j) sa,j 1 O k + ln(m/δ) Because each Ta is extracted from the empirical optimal transport T ˆra ˆqa using Lines 6 to 9 of Algorithm 2, by the discussion in Appendix D and Lemma D.5, they both agree on all of (xa,i)i [na] except for points that lie on the decision Fair and Optimal Classification via Post-Processing boundaries of Ta. The boundaries are described by k 2 hyperplanes, and because ra is continuous, no two points in (xa,i)i [na] lie on the same hyperplane almost surely, so the number of disagreements is at most k 2 = k(k 1)/2, and j=1 Ta(sa,j) sa,j 1 W1(ˆra, ˆqa) j=1 Ta(sa,j) sa,j 1 1 j=1 T ˆra ˆqa(sa,j) sa,j 1 j=1 Ta(sa,j) T ˆra ˆqa(sa,j) 1 j=1 1[Ta(sa,j) = T ˆra ˆqa(sa,j)] where the first equality is because T ˆra ˆqa is the optimal transport from ˆra to ˆqa. Therefore, we arrive at w.p. at least 1 δ, 2 W1(ˆra, ˆqa) O k + ln(m/δ) k + ln(m/δ) Continuing, let T ra qa Gk denote the optimal transport from ra to qa, where the qa s denote the minimizer of Equation (4) on the ra s with α. The existence of this transport in Gk is due to the problem being semi-discrete and Theorem D.1. Define q a(y) = 1 na Pna j=1 1[T ra qa(sa,j) = y]. It follows that err(ˆh) err α,f E + X 2 (W1(ˆra, ˆqa) W1(ra, qa)) 2 ((W1(ˆra, ˆqa) W1(ˆra, q a)) + (W1(ˆra, q a) W1(ra, qa))) 2 ((W1(ˆra, ˆqa) W1(ˆra, q a)) + (W1(ˆra, q a) W1(ra, qa))) 2 ((W1(ˆra, qa) W1(ˆra, q a)) + (W1(ˆra, q a) W1(ra, qa))) 2 (W1(qa, q a) + (W1(ˆra, q a) W1(ra, qa))) where we defined ˆwa = na/n, line 3 is because wa = Θ( ˆwa) for all a [m] when n Ω(maxa ln(m/δ)/wa), line 4 is due to (ˆq1, , ˆqm) being a minimizer of Equation (4) on the ˆra s For the first term in the summand, because both distributions qa, q a are supported on the vertices, so by Proposition A.2, W1(qa, q a) = qa q a 1, and w.p. at least 1 δ, for all a [m], qa q a 1 = X ES ra 1[T ra qa(S)i = 1] 1 j=1 1[T ra qa(sa,j)i = 1] ES ra[T ra qa(S)] 1 j=1 T ra qa(sa,j) Fair and Optimal Classification via Post-Processing where the last line follows from Theorem C.1, and a union bound over all a [m]. For the second term, because then the joint probability of (b S, T ra qa(b S)), b S ˆra, is a coupling belonging to Γ(ˆra, q a), the transportation cost of T ra qa on ˆra to q a upper bounds W1(ˆra, q a), whereby w.p. at least 1 δ, for all a [m], W1(ˆra, q a) W1(ra, qa) 1 T ra qa(sa,j) sa,j 1 ES ra h T ra qa(S) S 1 by Theorem C.3, because T ra qa(sa,j) sa,j 1 2. Hence, putting everything together, we conclude with a union bound that err(ˆh) err α,f O Fairness. By applying Theorem C.4 and Corollary D.16 to the artificial binary classification problem whose data distribution is the joint distribution of (S, 1), S ra, and a union bound, w.p. at least 1 δ, for all i [k] and a [m], P(ˆh(X, A) = ei | A = a) 1 j=1 1[Ta(sa,j) = ei] ES ra,Ta 1[Ta(S)i = 1] 1 j=1 1[Ta(sa,j)i = 1] k + ln(mk/δ) Now, the decision boundaries of the function s 7 1[Ta(s)i = 1] Gk,i (defined in Equation (17)) are described by k hyperplanes, and Ta is extracted from T ˆra ˆqa, so by the discussion in Appendix D and Lemma D.5 and the same reasoning used previously, they both agree on all but k points in (xa,i)i [na] almost surely, thereby j=1 1[Ta(sa,j) = ei] ˆqa(ei) j=1 1[Ta(sa,j)i = 1] 1 j=1 1[T ˆra ˆqa(sa,j)i = 1] j=1 1[Ta(sa,j)i = T ˆra ˆqa(sa,j)i] O k Therefore, we conclude that DP(ˆh) = max a,a [m] y Y P(ˆh(X, A) = y | A = a) P(ˆh(X, a ) = y | A = a ) max a,a [m] y Y |ˆqa(y) ˆqa (y)| + max a,a [m] y Y P(ˆh(X, A) = y | A = a) ˆqa(y) + P(ˆh(X, a ) = y | A = a ) ˆqa (y) k + ln(mk/δ) The theorem them follows from noting that na = Θ(nwa) when n Ω(maxa ln(m/δ)/wa). Fair and Optimal Classification via Post-Processing Proof of Theorem 4.5. Let the (q1, , qm) denote a minimizer of Equation (4) on the ra s with α, then hρ(x, a) := T ra qa uρ fa(x) where ra := uρ ra. Denote the coupling associated with T ra qa by γa Γ( ra, qa), then the Markov kernel of T ra qa uρ is K(s, T) = EN ρ γa(s + N, T) γa(s + N, Y) s Rk γa( s, T) γa( s, Y) d(ρ δs)( s) = Z s Rk γa( s, T) ra( s) d(ρ δs)( s), where denotes convolution. Consider the classification problem µ derived from the original µ under an input transformation given by the joint distribution of (f A(X), A, Y ), as discussed in Section 3.3, on which Id is the Bayes optimal score due to calibration of f. Then by Lemma B.5 applied on µ , the error rate on group a, denoted by erra( hρ), is erra( hρ) = 1 k Y s y 1 dγ a(s, y), (7) where γ a Γ(ra, qa) equals to γ a(s, y) = Z Id 1(s) K(s , y) dra(s ) = K(s, y) ra(s) = Z Rk γa( s, y) ra( s) d(ρ δs)( s) ra(s), 2 erra( hρ) = X Rk s y 1 γa( s, y) ra( s) (ρ δs)( s)ra(s) d s ds Rk s y 1 γa( s, y) ra( s) (ρ δs)( s)ra(s) d s ds Rk s s 1 γa( s, y) ra( s) (ρ δs)( s)ra(s) d s ds Rk s y 1 γa( s, y) k (ρ δs)( s)ra(s) ds d s Rk n 1 γa(n s, y) ra(n s) (ρ δs)(n s)ra(s) dn ds Rk s y 1 γa( s, y) ra( s) ra( s) d s + Z Rk n 1 dρ(n) dra(s) = W1( ra, qa) + EN ρ[ N 1], where line 3 involves a change of variable n := s s. Then we have err( hρ) err α,f X 2 (W1( ra, qa) W1(ra, qa)) + X 2 EN ρ[ N 1] 2 W1( ra, ra) + 1 2 EN ρ[ N 1]. Fair and Optimal Classification via Post-Processing Now, we upper bound the first term. Consider the coupling πa Γ( ra, ra) given by πa( s, s) = ρ( s s)ra(s), whereby W1( ra, ra) = inf γ Γ( ra,ra) Rk k s s 1 dγ( s, s) Rk k s s 1 dπa( s, s) = ZZ s s 1ρ( s s)ra(s) d s ds =: ZZ (s + n) s 1ρ(n)ra(s) dn ds = EN ρ[ N 1]. Substituting this into the result above, we obtain the upper bound. On the other hand, the lower bound follows from Equation (7), where err( hρ) = X k Y s y 1 dγ a(s, y) X 2 W1(ra, qa) = err α,f . D. Optimal Transport Between Simplex and Vertex Distributions The (k 1)-dimensional probability simplex is defined for k 2 by i=1 xi = 1, xj 0, j [k] and its k vertices are {e1, , ek}. In this section, we study the optimal transport problem between distributions supported on the simplex and its vertices under the ℓ1 cost, given by c(x, y) = x y 1. By extending each k to infinity, we obtain a (k 1)-dimensional affine space of Define vectors vij := ej ei, i, j [k], and note that for each i [k], {vij : j = i} forms a basis for Dk. Also, observe the following identity for the ℓ1 distance between a point on the simplex and a point on the vertex: x ei 1 = 1 xi + X j =i xj = 1 2xi + X j xj = 2(1 xi), x k, i [k] (8) (this identity is central to some of the upcoming results). A main result of this section is that when the transportation problem is semi-discrete, the deterministic (Monge) optimal transport exists, and is unique: Theorem D.1. Let p be a continuous probability measure on k, q a probability measure on {e1, , ek}, and c(x, y) = x y 1. The optimal transport from p to q is a Monge plan, and is unique up to sets of measure zero w.r.t. p. Specifically, the optimal transport T p q in Theorem D.1 is given by the c-transform of the Kantorovich potential from the Kantorovich-Rubinstein dual formulation of the transportation problem. In other words, it belongs to the following parameterized class of deterministic functions: Gk := n x 7 earg mini [k]( x ei 1 ψi) : ψ Rko {e1, , ek} k (9) (break ties to the tied ei with the largest index i). This function class is therefore of particular interest to various analyses in this paper. For the generalization bounds in Section 4.2, we show that this function class has low complexity in terms of the Natarajan dimension (Definition D.12): Fair and Optimal Classification via Post-Processing Theorem D.2. d N(Gk) = k 1. In addition, note that as illustrated in Figure 2, we can equivalently characterize each g Gk by the center point at which its k decision boundaries all intersect: Proposition D.3. Define the function class G k {e1, , ek} k parameterized by Rk s.t. for each gz G k with parameter z Rk, gz(x) = ei if xj xi zj zi x vij z vij, j = i (10) (when multiple ei s are eligible, output the tied ei with the largest index i). We have gψ Gk, gψ = gz G k by setting , i [k] (11) (the choice of Pk i=1 zi = 1 s.t. z Dk was arbitrary, due to an extra degree of freedom because the support of g is contained in k). Conversely, gz G k, gz = gψ Gk by setting ψi = 2(z1 zi) = 2z vi1, i [k] (12) (again, the choice of ψ1 = 0 was arbitrary). Proof. Let gψ Gk, then for the gz G k constructed in Equation (11), by Equation (8), gz(x) = ei is eligible xj xi zj zi, j = i xj xi (ψi ψj)/2, j = i 2(xj xi) ψi ψj, j = i 2(1 xi) ψi 2(1 xj) ψj, j = i x ei 1 ψi x ej 1 ψj, j = i. Conversely, let gz G k, then for the gψ Gk constructed in Equation (12), gψ(x) = ei is eligible x ei 1 ψi x ej 1 ψj, j = i 2(xj xi) 2(z1 zi) 2(z1 zj), j = i xj xi zj zi, j = i. We will often use this alternative characterization of Gk. The remaining proofs are deferred to Appendix D.2. Theorem D.1 is established via an analysis of the geometry of the simplex-vertex optimal transport, which we discuss in the next section. D.1. Geometry of Optimal Transport Let p be an arbitrary distribution supported on k, and q a (finite) distribution over Y := {e1, , ek}. We study the geometric properties of the optimal solution to the (Kantorovich) transportation problem between p, q under the ℓ1 cost, sup γ Γ(p,q) k Y x y 1 dγ(x, y), (13) and note that the supremum can be attained because the supports are compact. Fair and Optimal Classification via Post-Processing C3 i [3] Ai Figure 5. Illustration of the objects defined on Equations (14) and (15) for k = 3. See Figure 6 for an example where the intersection is empty, when the underlying transport is not optimal. First, given a simplex-vertex transport γ Γ(p, q), we define the following geometric objects: Bij := min b R : γ({x k : x vij b 1}, ei) = q(ei) {0}, and j =i {x k : x vij Bij 1}. (14) For each i [k], Bij defines the (smallest offset of the) halfspace in the vij direction in which all points that are transported by γ to ei are contained, and Ci is formed by the intersections of these halfspaces, also containing all points transported to ei. See Figures 3 and 5 for illustrations. Now, if γ is an optimal transport of Equation (13), then intuition tells us that in order to achieve minimum cost, the halfspaces along each direction should not overlap (i.e., Bij + Bji 2 for all i = j), and the Ci s should not intersect (except on a set of Lebesgue measure zero). We show that these intuitions regarding the geometry of γ are indeed valid, and they are implied by showing that the intersection of the following sets Ai is nonempty (see Figure 5 for an illustration), j =i {x Dk : x vij Bij 1}. (15) Proposition D.4. If γ is a minimizer of Equation (13), then T i [k] Ai = . The proof is deferred to Appendix D.2. Note that T i [k] Ai is exactly the set considered on Line 7 of Algorithm 2, and Proposition D.4 says that if γ is an optimal transport, then a point z T i [k] Ai exists. The significance of this point is that, the function gz Gk with parameter z Dk agrees with the transport Tp q associated with γ only except for points that lie on the boundaries (which have Lebesgue measure zero): Lemma D.5. Let p, q be probability measures on k and {e1, , ek}, respectively. If γ Γ(p, q) is a minimizer of Equation (13), then T Gk with parameters z Dk satisfying γ(x, T (x)) = p(x), x supp(p) \ [ i =j {x Dk : x vij = z vij}. This result underlies many discussions throughout our presentation: (i) the construction used in its proof led to Lines 6 to 9 of Algorithm 2 for extracting post-processing functions from the empirical optimal transports, (ii) it embodies the argument used in the proof of Theorem 4.4 regarding the disagreements between the extracted functions and the empirical transports, and (iii) the existence part of Theorem D.1 is a direct consequence, since the set on which disagreements may occur always has measure zero when p is continuous. Proof. Let z T i [k] Ai, which exists due to Proposition D.4. Then let T Gk with parameter z, which we show agrees with γ on all x supp(p) \ S i =j{x Dk : x vij = z vij}: suppose T (x) = ei, then T (x) = ei x vij z vij by construction. Furthermore, by the definition of Ai in Equation (15) of γ , x vij < z vij Bij 1 for all j = i, so we must have that γ (x, ej) = 0, j = i = γ (x, ei) = p(x). Otherwise, it would contradict the definition of Bij in Equation (14). Fair and Optimal Classification via Post-Processing D.2. Omitted Proofs for Section D Proof of Theorem D.1. For existence, Lemma D.5 provides a T Gk that agrees with the optimal transport almost everywhere, since the set of points lying on the boundaries has measure zero w.r.t. p by continuity. Next, we prove uniqueness. Let γ, γ Γ(p, q) be two optimal transports, and T , T Gk mappings provided by Lemma D.5 that agree with γ, γ a.e. We will show that T = T a.e., and so is γ = γ . Denote the parameter (i.e., center of the decision boundaries) of T (analogously for T ) by z Dk, the decision boundaries by Bij := z vij + 1, and the decision regions by Ci := T j =i{x k : x vij Bij 1}. By definition of Gk, k = Fk i=1 Ci, and for all x k, T (x) = ei x Ci almost surely, therefore, γ(Ci, ei) = q(ei) because T agrees with the transport γ a.e. Define the difference in the boundaries between T and T by dij := B ij Bij, and note that dij = dnj dni with dℓℓ:= 0, i, j, n, ℓ [k], (16) which follows from the observation that Bij = z (vnj vni) + 1 = Bnj Bni + 1 with Bℓℓ:= 0, i, j, n, ℓ [k]. Construct a directed graph of k nodes where (i, j) is an edge iff dij > 0. Note that this graph is acyclic: first, it cannot contain cycles of length 2, otherwise, (i, j), (j, i) E = dij + dji > 0 contradicts the fact that dij + dji = 0 by definition; next, consider the shortest cycle, and let (i, j), (n, i), j = n denote two edges contained in it. It follows that dij, dni > 0, and dnj 0, or it is not the shortest cycle. Then by Equation (16), 0 < dij = dnj dni < 0, which is a contradiction. Now, we show by strong induction on the reverse topological order of the graph nodes that for all i [n], p(Ci C i) = 0 where denotes the symmetric difference of the sets. For the base case, let i denote a sink node in the graph, then we have that dij 0 for all j, meaning that C i Ci. Then q(ei) = γ (C i, ei) = p(C i) p(Ci) = γ(Ci, ei) = q(ei). If the inequality is strict, then it is a contradiction; otherwise, combining the equality with T (x) = ei x Ci a.s. (and T analogously) implies p(Ci C i) = p(Ci \ C i) = 0. For the inductive case, let i denote a node, and J [n] \ {i} the set of nodes directed to from i, then by construction F j J {i} C j F j J {i} Cj. Let Fi := Ci C i, and note that for all x C i \ Fi, T (x) = ei and T (x) {ej : j J \ {i}}. Therefore, C i \ Fi S j J(Cj C j), and by the inductive hypothesis, p(C i \ Fi) 0. It then follows that p(C i) p(Ci), and subsequently p(Ci C i) = p(Ci \ C i) = 0 by the same arguments used in the base case. Therefore, p({x : T (x) = T (x)}) Pk i=1 p(Ci C i) = 0, so T = T a.e. The proof of Proposition D.4 needs the following technical result, which at a high-level states that if a collection of vij-aligned convex sets do not intersect, then they cannot cover the entire space: Proposition D.6. Let B Rk k arbitrary, and define Si := T j =i{x Dk : x vij Bij 1} for each i [k]. We have T i [k] Si = = S i [k] Si = Dk. While this could be proved with elementary arguments, for clarity, we use known results from algebraic topology in the final steps of our proof. The tools and concepts that we use include homotopy equivalence and homology groups (we omit the definition for the latter, but refer readers to (Spanier, 1981) for a textbook). The definitions are provided below for completeness; readers may skip to the main proof. Definition D.7 (Homotopy). Let X, Y be topological spaces, and f, g : X Y two continuous functions. A homotopy between f and g is a continuous function h : X [0, 1] Y, such that h(x, 0) = f(x) and h(x, 1) = g(x) for all x X. We say f, g are homotopic if there exists a homotopy between them. Definition D.8 (Homotopy Equivalence). Let X, Y be topological spaces. If there exist continuous maps f : X Y and g : Y X such that g f is homotopic to the identity map Id X on X, and f g is homotopic to Id Y, then X and Y are homotopy equivalent, denoted by X = Y. Fact D.9 (Homology). (See (Spanier, 1981) for a textbook). Fair and Optimal Classification via Post-Processing 1. The homology groups of Rd, denoted by Hn(Rd) for n {0, 1, 2, }, are ( Z if n = 0 {0} else. 2. The homology groups of the d-dimensional simplex, d+1, are ( Z if n = 0 {0} else. 3. The homology groups of the d-dimensional simplex without its interior, d+1, are ( Z if n = 0 or d 1 {0} else. 4. If X = Y, then Hn(X) = Hn(Y) for all n. Clearly, the affine space Dk = Rk 1 via a rotation and a translation. We also cite the Nerve theorem (Bauer et al., 2023, Theorem 3.1): Theorem D.10 (Nerve). Let S = {S1, , Sn} be a finite collection of sets, and define its nerve by If the sets Si s are convex closed subsets of Rd, then Nrv(S) = S Proof of Proposition D.6. We prove the contrapositive statement of S i [k] Si = Dk = T i [k] Si = by strong induction on the dimensionality k. For the base case of k = 2, observe that S1 S2 = {x : x v12 B12 1 or x v12 1 B21}, so S1 S2 = D2 if and only if B12 1 1 B21, in which case the point (1 B12/2, B12/2) S1 S2, thereby the intersection is nonempty. For k > 2, suppose S i [k] Si = Dk. Our goal is to show that for all J [k], T j J Sj = . Recall that j [k],j =i {x Dk : x vij Bij 1}, and we define for any J [k] and i [k] j J,j =i {x Dk : x vij Bij 1}, (we will drop the subscript J as the discussions below will focus on a single J). We first show that T i J Si = for any J [k] with |J| k 1. By assumption, Dk = S i [k] S i, and we argue that S i J S i = Dk. Suppose not, then let z / S i J S i, and consider the line i/ J,j J vij : a R First, no part of this line is contained in S i J S i, because it does not contain the point z L, and L runs parallel to and hence never intercepts any of the halfspaces defining each S i for i J: let i, j J, i = j, then n/ J,m J vnm = X n/ J,m J (e j em e j en e i em + e i en) = 1 0 1 + 0 = 0. Fair and Optimal Classification via Post-Processing Find intersection of Ai s complements, and a point contained in it Construct transport with lower cost Ai s do not intersect! Not an optimal transport Figure 6. Illustration of the construction in the proof of Proposition D.4 for k = 3. Second, this line is partially not contained any S i = T j J{x Dk : x vij Bij 1} for i / J: let i / J and j J, then n/ J,m J vnm = X n/ J,m J (e j em e j en e i em + e i en) = 1 0 0 + 1 = 2; so points on L with sufficiently large a s are not contained in S i/ J S i, contradicting the assumption that S i [k] S i = Dk. Back to proving that T i J Si = for any J [k] with |J| k 1. Since S i J S i = Dk, by applying the inductive hypothesis to a |J|-dimensional instance derived from {S i : i J} by removing the axes {ei : i / J}, we get z T i J S i. Using similar arguments above, it can be shown that the line L := {z + a P i/ J,j J vij : a R} is entirely contained in T i J S i and partially in T i J Si = (T i J S i) (T i/ J,j J{x Dk : x vij Bij 1}), so the intersection is nonempty. We have thus established that any intersection of the strict subset of {Si}i [k] is nonempty, and we will conclude with the Nerve theorem. We have J [k], 1 |J| k 1, J Nrv({S1, , Sk}). Because we assumed in the beginning that S i [k] Si = Dk, it must follow that [k] Nrv({S1, , Sk}) as well. Otherwise, the nerve is a (k 1)-dimensional simplex (each n-face is represented by its n 1 vertices) without its interior (represented by [k]), whose homology differs from that of Dk, then S i [k] Si = Nrv({S1, , Sk}) = Dk by Theorem D.10, which contradicts our assumption that S i [k] Si = Dk. Hence the nerve contains [k], meaning T i [k] Si = . Proof of Proposition D.4. Recall the definitions of the objects Bij, Ci and Ai in Equations (14) and (15) of γ . Suppose T i [k] Ai = , then z T i [k](Dk \ Ai) by Proposition D.6. It then follows by definition that i [k], j = i s.t. z vij < Bij 1. Let u : [k] [k] denote a mapping s.t. the pairs (i, u(i)) satisfy this relation for all i [k]; note that there exists a nonempty J [m] s.t. the undirected edges {(i, u(i)) : i J} form a cycle because u(i) = i. Also, there exist m > 0 and measurable sets Fi {x : x viu(i) > z viu(i)} Ci s.t. γ (Fi, ei) := mi m. We show that the coupling γ Γ(p, q) given by γ (B, ei) = γ (B, ei) if i / J, γ (B ( k \ Fi), ei) mi γ (B Fi, ei) + m mu 1(i) γ (B Fu 1(i), eu 1(i)) else Fair and Optimal Classification via Post-Processing has a lower transportation cost than γ (see Figure 6 for an illustration): Z k Y x y 1 d(γ γ )(x, y) = X x ei 1 x eu(i) 1 dγ (x, ei) Fi x viu(i) dγ (x, ei) Fi z viu(i) dγ (x, ei) i J z viu(i) i J (zu(i) zi) = 0, where line 2 follows from Equation (8). Finally, we consider the complexity of the function class Gk defined in Equation (9). First, recall the definition of multi-class shattering, based on which the Natarajan dimension is defined (Shalev-Shwartz & Ben-David, 2014, Definitions 29.1 and 29.2): Definition D.11 (Multi-Class Shattering). Let H be a class of functions from X to {1, , k}. A set S := {x1, , xn} X is said to be multi-class shattered by H if f0, f1 : S {1, , k} labelings satisfying f0(xi) = f1(xi) for all i [n], such that S0, S1 that partition S (i.e., S0 S1 = S), h H, s.t. h(x) = f0(x), x S0, and h(x) = f1(x), x S1. Definition D.12 (Natarajan Dimension). Let H be a class of functions from X to {1, , k}. The Natarajan dimension of H, denoted by d N(H), is the size of the largest subset of X multi-class shattered by H. Proof of Theorem D.2. We associate ei with the label i, i {1, , k}. We first show that d N(Gk) k 1 by constructing a set of cardinality k 1 that is shattered by Gk, then show that d N(Gk) < k by contradiction. Lower Bound. Consider the set S = {e1, e2, , ek 1} and let f0(ej) = j and f1(ej) = k for all j [k 1], which satisfy f0 = f1 on all x S. Let S0 S1 = S be arbitrary, and define ( 1[ej S1] if j [k 1] 0 if j = k. Consider gz Gk with parameters where boldface 1k Rk denotes the vector of all ones. Observe that ℓ =j v jℓvnm ℓ =j (e ℓem e ℓen e j em + e j en) j=1 ι(j) (1[n = j] 1[m = j]) + (k 1) j=1 ι(j) (e j em e j en) = (k 1)(ι(m) ι(n)) + (ι(m) ι(n)) = k(ι(m) ι(n)). Fair and Optimal Classification via Post-Processing Recall from Equation (10) that for all i, n [k], gz(ei) = en is eligible e i vnm z vnm, m = n; so in our case, it follows that for all i [k 1] and j = i, gz(ei) = ei is eligible 1 k(ι(m) ι(i)), m = i, and gz(ei) = ej is eligible 1 k(ι(i) ι(j)) and 0 k(ι(m) ι(j)), m = i (also, recall ι(k) := 0). Observe that for any i [k 1], if ι(i) = 0, then gz(ei) = ej is ineligible for all j = i, then we must have gz(ei) = ei. Otherwise, if ι(i) = 1, then ek is always eligible, so gz(ei) = ek due to the tie-breaking rule. Therefore, gz is a witness function, and we conclude that Gk shatters S. Upper Bound. Let S = (x1, , xk) be given, along with f0, f1 : S [k] satisfying f0(x) = f1(x) for all x S. Suppose Gk shatters S. Let gz Gk denote a witness function for the partitioning of S0 = S and S1 = , and gz Gk that for the partitioning of S 0 = and S 1 = S. We will reuse an argument from an earlier proof. Denote the decision boundaries of gz (analogously for gz ) by Bij := z vij + 1, and the decision regions by Ci := T j =i{x k : x vij Bij 1}. By definition of Gk, x Ci = gz(x) = ei is eligible. Then, define the difference in the boundaries between gz and gz by dij := B ij Bij. Construct a directed graph of k nodes where (i, j) is an edge iff dij > 0, which is acyclic as shown in the proof of Theorem D.1. First, consider the case where x, x S s.t. {f0(x), f1(x)} = {f0(x ), f1(x )}. W.l.o.g., assume i := f0(x1) = f1(x2) and j := f1(x2) = f0(x1), then we have gz(x1) = ei, gz (x1) = ej = dji > 0, gz(x2) = ej, gz (x2) = ei = dij > 0 (after taking into account of the tie-breaking rule), however, this would imply a cycle in the graph, which is a contradiction. Next, if {f0(x), f1(x)} differs for all x S, then we may assume w.l.o.g. f0(xi) = ei and f1(xi) = ei+1 for all i [k] (where the index of k + 1 means 1). Then gz(xi) = ei, gz (xi) = ei+1 = di+1,i > 0, i [k]; again, this would imply a cycle in the graph, hence a contradiction. Therefore, we conclude that Gk cannot shatter any S k of cardinality k. In addition, on data distributions that satisfy X = Y , X k, applying the ℓ1 loss of ℓ(ˆy, y) := ˆy y 1 to Gk yields a function class with pseudo-dimension of k 1: Definition D.13 (Pseudo-Shattering). Let F be a class of functions from X to R. A set {x1, , xn} X is said to be pseudo-shattered by F if t1, , tn R threshold values s.t. b1, , bn {0, 1} binary labels, f F s.t. 1[f(xi) ti] = bi for all i [n]. Definition D.14 (Pseudo-Dimension). Let F be a class of functions from X to R. The pseudo-dimension of F, denoted by d P(F), is the size of the largest subset of X pseudo-shattered by F. Theorem D.15. Define Fk := {x 7 g(x) x 1 : g Gk}. We have d P(Fk) = k 1. Proof. The proof shares the same arguments as that of Theorem D.2. We will only show the upper bound, and remark that the lower bound can be established using a similar construction of that in Theorem D.2. Let x1, , xk be given, and suppose there exists thresholds r1, , rk s.t. Fk shatters the set of points. It follows that gz, gz Gk s.t. gz (xi) xi 1 < ri gz(xi) xi 1 for all i, which means that gz(xi) = gz (xi). But by the arguments in the proof of the upper bound of Theorem D.2, such (gz, gz ) pair does not exists, contradicting the shattering assumption. Fair and Optimal Classification via Post-Processing Finally, for all i {1, , k}, define restriction of Gk to class i by Gk,i := {x 7 1[g(x) = ei] : g Gk} {0, 1} k. (17) Because it is a binary function, its VC dimension, denoted by d VC(Gk,i), is equivalent to its Natarajan dimension by Definition D.12; moreover, because Gk,i is derived from Gk by an output remapping, its Natarajan dimension is clearly upper bounded by that of the latter: Corollary D.16. d VC(Gk,i) d N(Gk) = k 1. E. Experiment Details E.1. Datasets and Tasks Adult (Kohavi, 1996). We consider the binary classification task of whether the annual income of an individual is over or below $50k per year (|Y| = 2) given attributes including gender, race, age, education level, etc. The data are collected from the 1994 US Census. We let gender be the sensitive attribute (|A| = 2). It contains 48,842 examples in total, which we split for pre-training/post-processing/testing by 0.35/0.35/0.3. ACSIncome (Ding et al., 2021). It is an extension of the Adult dataset with data collected from the US Census Bureau. We consider income prediction under two settings. In the binary setting, the task is to predict whether the annual income of an individual is over or below $50k per year (|Y| = 2), with gender as the sensitive attribute (|A| = 2). In the multi-group multi-class setting, we create five income buckets of <15000, [15000,30000), [30000,48600), [48600,78030), 78030, and group the data into five race categories of American Indian or Alaska Native alone , Asian , Native Hawaiian or Other Pacific Islander alone , Black or African American alone , Other , and White alone (same as in Adult). It contains 1,664,500 examples, which we split for pre-training/post-processing/testing by 0.63/0.07/0.3. Bias Bios (De-Arteaga et al., 2019). The task is to determine the occupation (|Y| = 28) of female and male individuals (|A| = 2) by their raw text biographies. The data are mined from the Common Crawl corpus. In this dataset, gender is the sensitive attribute, and is observed to correlate with certain occupations such as software engineer and nurse. We use the version of Bias Bios scrapped and hosted by Ravfogel et al. (2020) with 393,423 examples in total, which we split for pre-training/post-processing/testing by 0.35/0.35/0.3. This experiment is of particular interest because of the increasing popularity of large language models and the fairness concerns regarding their usage. In particular, the uncurated corpora (e.g., crawled from the internet) on which the language models are pre-trained may contain historical social bias, and empirical investigations have shown that such bias could be propagated and amplified in downstream applications (Bolukbasi et al., 2016; Zhao et al., 2018; Abid et al., 2021). Communities & Crime (Redmond & Baveja, 2002). The Communities & Crime tabular dataset contains the socioeconomic and crime data of communities in 46 US states, and the task is to predict the number of violent crimes per 100k population given attributes ranging from the racial composition of the community, their income and background, and law enforcement resource. The data come from the 1990 US Census, 1990 LEMAS survey, and 1995 FBI Uniform Crime Reporting program. We bin the rate of violent crime into five classes (|Y| = 5), and we treat race as the sensitive attribute by the presence of minorities (|A| = 4): a community does not have a significant presence of minorities if White makes up more than 95% of the population, otherwise the largest minority group is considered to have a significant presence (Asian, Black, or Hispanic). It contains 1,994 examples, which we split for pre-training/post-processing/testing by 0.35/0.35/0.3. E.2. Additional Details On each task, we first create the pre-training split from the dataset and train a linear logistic regression scoring model using the implementation provided in scikit-learn (Pedregosa et al., 2011). Then, we randomly split the remaining data for post-processing and testing with 10 different seeds and aggregate the results as presented in Figures 4 and 7 (the pre-trained model remains the same). The linear programs of our Algorithm 2 are implemented using the interface of cvxpy (Diamond & Boyd, 2016), and are solved using the COIN-OR Cbc solver that is based on the branch and cut method (Forrest et al., 2023). Finally, the BERT model in Bias Bios experiments is loaded through the Hugging Face Transformers library (Wolf et al., 2020). Fair and Optimal Classification via Post-Processing Hyperparameters. The tradeoff curves in Figure 4 and Example A.5 are generated with the following fairness tolerance/strictness settings. For our method, α is set to: ACSIncome (binary). 0.2, 0.18, 0.16, 0.14, 0.12, 0.1, 0.08, 0.06, 0.04, 0.02, 0.01, 0.005, 0.001, 0.0. ACSIncome (5-group, 5-class). 0.32, 0.3, 0.28, 0.26, 0.24, 0.22, 0.2, 0.18, 0.16, 0.14, 0.12, 0.1, 0.08, 0.06, 0.04, 0.02, 0.01, 0.0. Bias Bios. 0.08, 0.07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0.008, 0.006, 0.004, 0.002, 0.001, 0.0. Adult. 0.16, 0.14, 0.12, 0.1, 0.08, 0.06, 0.04, 0.02, 0.01, 0.008, 0.006, 0.004, 0.002, 0.001, 0.0. Communities. 0.6, 0.55, 0.5, 0.45, 0.4, 0.35, 0.3, 0.25, 0.2, 0.15, 0.1, 0.05, 0.01, 0.0. For Fair Projection, we use the default settings that came with the code/package; in particular, increasing the number of iterations to over 1,000 did not improve performance. The tolerance is set to: ACSIncome (binary). 0.3, 0.2, 0.18, 0.16, 0.14, 0.12, 0.1, 0.08, 0.06, 0.04, 0.02, 0.01, 0.005, 0.001, 0.0. ACSIncome (5-group, 5-class). 0.5, 0.4, 0.35, 0.3, 0.25, 0.2, 0.15, 0.1, 0.08, 0.06, 0.04, 0.02, 0.01, 0.0. Bias Bios. 1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.05, 0.0. Adult. 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0. Communities. 1.0, 0.8, 0.6, 0.4, 0.3, 0.2, 0.1, 0.0. E.3. Finding Feasible Point on Line 7 Line 7 of Algorithm 2 involves finding a feasible point in the intersection of halfspaces, which can be obtained with the following linear program: min z Rk 0 s.t. z vij Bij 1, i, j [k], i = j. As illustrated in Figure 3, the point z that is returned determines the center of the boundaries of the extracted transport maps Ta Gk. Because of the machine learning folklore that classifiers with larger margin enjoy better generalization properties, we instead use the follow quadratic program (of a least-squares problem) that maximizes the margins in our experiments for point-finding: z vij (Bij 1) 2 2 (2 Bij Bji)2 s.t. z vij Bij 1, i, j [k], i = j. Our preliminary experiments showed that using the quadratic program for point-finding led to better post-processing performance with Algorithm 2 than using the linear program, both in terms of the error rate and DP during inference. Fair and Optimal Classification via Post-Processing Table 1. Running time (in seconds) of post-processing algorithms under the strictest tolerance setting (see Appendix E.2), averaged over three random splits. Our algorithm is run on a single core of an Intel Xeon Silver 4314 CPU, and Fair Projection is run on an NVIDIA RTX A6000 GPU. ACSIncome Bias Bios Adult Communities Groups 2 5 2 2 4 Classes 2 5 28 2 5 Examples (post-processing split) 116,515 137,698 17095 698 Ours (CPU) 2.56 109 797 0.43 0.38 Fair Projection-KL (GPU) 33 38 99 15 13 0.00 0.05 0.10 0.15 0.2 0.4 0.6 Communities (2 genders, 2 classes) (4 races, 5 classes) Violation of demographic parity (ΔDP) Classification accuracy Ours Fair Projection-KL Figure 7. Tradeoff curves between accuracy and DP (Definition 2.1). Scoring model is logistic regression. Error bars indicate the standard deviation over 10 runs with different random splits. Running time is reported in Table 1. On both datasets, Fair Projection-KL and Fair Projection-CE have similar results.