# fair_wrapping_for_blackbox_predictions__45adf4f4.pdf Fair Wrapping for Black-box Predictions Alexander Soen Australian National University alexander.soen@anu.edu.au Ibrahim Alabdulmohsin Google Research ibomohsin@google.com Sanmi Koyejo Google Research Stanford University sanmik@google.com Yishay Mansour Google Research Tel Aviv University mansour@google.com Nyalleng Moorosi Google Research nyalleng@google.com Richard Nock Google Research Australian National University richardnock@google.com Ke Sun Australian National University CSIRO s Data61 sunk@ieee.org Lexing Xie Australian National University lexing.xie@anu.edu.au We introduce a new family of techniques to post-process ( wrap") a black-box classiﬁer in order to reduce its bias. Our technique builds on the recent analysis of improper loss functions whose optimization can correct any twist in prediction, unfairness being treated as a twist. In the post-processing, we learn a wrapper function which we deﬁne as an -tree, which modiﬁes the prediction. We provide two generic boosting algorithms to learn -trees. We show that our modiﬁcation has appealing properties in terms of composition of -trees, generalization, interpretability, and KL divergence between modiﬁed and original predictions. We exemplify the use of our technique in three fairness notions: conditional valueat-risk, equality of opportunity, and statistical parity; and provide experiments on several readily available datasets. 1 Introduction The social impact of Machine Learning (ML) has seen a dramatic increase over the past decade enough so that the bias of model outputs must be accounted for alongside accuracy [1, 14, 35]. Considering the various number of fairness targets [20] and the energy and CO2 footprint of ML [19, 28], the combinatorics of training accurate and fair models is non-trivial. This is especially so given the inherit incompatibilities of fairness constraints [17] and the underlying tension of satisfying fairness whilst maintaining accuracy. One trend in the ﬁeld "decouples" the two constraints by post-processing pretrained (accurate) models to achieve fairer outputs [35]. Post-processing may be the only option if we have no access to the model s training data / algorithm / hyperparameters (etc.). Within the post-processing approach, three trends have emerged: learning a new fair model close to the black-box, tweaking the output subject to fairness constraints, and exploiting sets of classiﬁers. If the task is class probability estimation [24], the estimated black-box is an accurate but potentially unfair posterior u : X ! [0, 1] which neither can be opened nor trained further. The goal is then to learn a fair posterior f from it. In addition to the black-box constraint, a number of desiderata can be considered for post-processing. Ideally in correcting a black-box, we would want the approach to have (ﬂexibility) in satisfying substantially different fairness criteria, (proximity) of the learnt 36th Conference on Neural Information Processing Systems (Neur IPS 2022). 9eifl Tsr91bu V9Tbt2qb R07ns/LZv7wf7CY= u(x) Accuracy CVa R EOO * * Black-box Black-box + Fair wrapper Accuracy CVa R (Thm. 1) EOO (Thm. 5) Figure 1: Summary of using different -correction wrappers to obtain different fairness criteria guarantees. See Sections 5 and 6 for full details on guarantees. f to the original u, and meaningful (composability) properties if, e.g., f was later treated as a new black-box to be post-processed. To facilitate a speciﬁc style of correction, we may also want the representation of the correction to facilitate (explainability) for auditing the post-processing procedure and bounds on the increased model (complexity) of the ﬁnal classiﬁer f. In training the correction, algorithmically we would also want guarantees for (convergence). Our contribution satisﬁes the aforementioned desiderata in its correction, representation, and algorithmic guarantees. By leveraging recent theory in improper loss functions, we utilize a universal correction of black-box posteriors deﬁned by the -loss. This allows for a ﬂexible correction which yields convenient divergence bounds between f and u, a convenient form for the Rademacher complexity of the class of f, and a simple composability property. Representation-wise, the corrections we learn are easy-to-interpret tree-shaped functions that we deﬁne as -trees. Algorithmically speaking, we provide two formal boosting algorithms to learn -trees building upon seminal results [15]. We demonstrate our algorithm for conditional value-at-risk (CVAR) [32], equality of opportunity (EOO), and statistical parity; as depicted in Fig. 1. Experiments are provided against ﬁve baselines on readily available datasets. All proofs and more experiments are in an Appendix denoted as SI. 2 Related Work Post-processing models to achieve fairness is one of three different categories in tackling the ML + fairness challenge [35, Section 6.2]. Although other notions exist, e.g. individual fairness [12], we limit our analysis to group fairness, which concerns itself with ensuring that statistics of subpopulations are similar. We further segment this cluster into three subsets: (I) approaches learning a new model with two constraints: being close to the pretrained model and being fair [16, 22, 31, 34]; (II) approaches biasing the output of the pretrained model at classiﬁcation time, modifying observations for fairer outcomes [1, 14, 18, 21, 33, 34]; and (III) techniques consisting of exploiting sets of models to achieve fairness [11]. None of these approaches formulates substantial guarantees on all of the desiderata in the introduction. Some bring contributions with the (ﬂexibility) of being applicable to more than two fairness notions [7, 31, 11, 34]. Two of which provide the convenience of analytic conditions on new fairness notions to ﬁt in the approach [31, 11]. However, for all of them, the algorithmic price-tag is unclear [7, 11] or heavily depends on convex optimization routines [31]. [1, 34] provide strong guarantees regarding (proximity), w.r.t. consistency and generalization. To our knowledge, our approach of correcting prediction unfairness through improper losses [29] is new. 3 Setting and Motivating Example Let X be a domain of observations, Y .= { 1, 1} be labels and S is a sensitive attribute in X. We assume that the modalities of S induce a partition of X. We further let D denote the joint measure over X Y, M denote the marginal measure over X, and .= P[Y = 1] being the prior. We denote conditioning of M through a subscript, e.g., Ms for s 2 S denotes M conditioned on a sensitive attribute subgroup S = s. We leave the σ-algebra to be implicit (which is assumed to be the same everywhere). As is often assumed in ML, sampling is i.i.d.; we make no notational distinction between empirical and true measures to simplify exposition most of our results apply for both. Consider the task of learning a function 2 [0, 1]X to estimate the true posterior ?(x) = P[Y = 1|X = x] in binary classiﬁcation. For instance, we may want to predict the probability of hiring an applicant for a company. Given the pointwise loss L( (X), ?(X)), which determines the loss of a -0.35 2.50 0.5 1.0 u (x) Ground Truth u(x): Age 25 f(x): Age 25 u(x): Age < 25 z < 0 f(x): Age < 25 z < 0 u(x): Age < 25 z 0 f(x): Age < 25 z 0 z 2 [ 1, 1] 0.0 Figure 2: Improving CVAR for a toy hiring task with -trees. An -tree (left) transforms the posterior via (3) (middle); resulting in an input-dependent fairness correction of the posterior u (right). single example, the (total) risk is deﬁned as .= EX M [L( (X), ?(X))] (1) (with slight abuse of notation). A low risk corresponds to good classiﬁcation performance. In this paper, we consider the risk determined by the log-loss: .= E X M [ ?(X) log (X) + (1 ?(X)) log(1 (X))] . (2) We now consider a simple fairness problem, centered around the example of Fig. 2. Suppose that we are given a black-box u which predicts hiring probabilities without considering fairness. Although the minimized total risk of (1) might be small, there can be discrepancies in the performance between different subgroups. Instead of considering the total risk, the predictive performance of speciﬁc subgroups can be examined through the subgroup risk L( u; Ms, ?), for a subgroup s 2 S. For instance, we might want to examine the discrepancy of subgroup risks among the age of applications. A natural fairness task would be to improve the worst performing subgroup, say sw 2 S. To post-process unfairness, we want to learn a function : X ! R which wraps u and lowers the worst subgroup risk. We propose the following wrapping , inspired by improper loss functions [29] .= u(x) (x) u(x) (x) + (1 u(x)) (x) 2 [0, 1]. (3) Notice when (x) = 1 the resulting posterior is the original u(x). Importantly, (3) is ﬂexible enough to transform any input black-box u to any needed f. Looking at Fig. 2 (left and middle), intuitively by setting different (x) values, (3) sharpens (yellow, > 1), dampens (blue, 0 < < 1), or polarity reverses (red, < 0) the original posterior u. To improve fairness, we need a combination of these corrections to accommodate different subsets of the input domain (thus learning (.) as a function). We speciﬁcally learn (.) to be a tree structure, which allows an interpretable correction alongside other formal properties (Section 5). Deﬁnition 1. An -tree is a rooted, directed binary tree, with internal nodes labeled with observation variables. Outgoing arcs are labeled with tests over the nodes variable. Leaves are real valued. ( ) is the leafset of -tree . An -tree induces a correction over posteriors as per (3) with = . Our fairness problem is now learning an -tree which provides a corresponding correction that improves the worst subgroup risk, i.e., L( u; Msw, ?) > L( f; Msw, ?). The entirety of Fig. 2 presents such a process. In this hiring task example, the ground truth hiring rate is constant w.r.t. the inputs, ?(x) = 0.7. Despite this, the black-box u(.) unfairly depends on the age of applicants and incorrectly depends on a noise feature Z. By choosing (x) as per Fig. 2 (left), the correction changes u to be closer to ?, improving the risk of the worst subgroup (alongside the other subgroup in this example): the worse-case loss improves from 1.09 to 0.62. Although the fairness criteria discussed might be considered simple, the procedure of iteratively minimizing the worse performing subgroup can be used to improve the conditional value-at-risk (lower is better) fairness criteria [32]: .= ES[L( f; MS, ?) | L( f; MS, ?) Lβ], (4) Algorithm 1 TOPDOWN (Mt, t, 0, B) Input mixture Mt, posterior t, -tree 0, B 2 R+ ; Step 1: 0; Step 2 : while stopping condition not met do Step 2.1 : pick leaf λ? 2 ( ); // i.e. heaviest leaf Step 2.2 : h? arg minh2H H( (λ?, h); Mt, t); Step 2.3 : (λ?, h?); // split using h? at λ? Step 3 : label leaves 8λ 2 ( ): 1 + e(Mλ, t) , // -value (5) m E0Mwkczc Csk IS0y0ya Zo Qvj6FP5PWqe2c26j Rq Vcu1z EUQCH4Aic Adcg Bq4Bn XQBARQ8ACew LN1Zz1a L9brv HXJWswcg B+w3j4BICe NKg=S = s = s0 m E0Mwkczc Csk IS0y0ya Zo Qvj6FP5PWqe2c26j Rq Vcu1z EUQCH4Aic Adcg Bq4Bn XQBARQ8ACew LN1Zz1a L9brv HXJWswcg B+w3j4BICe NKg=S = s = s0 Fairness criteria dependent choice of in . Top Down 1 w4c Jbi Z0q Bh K1Vg9+bj Mmcx Ista7rauv Hec I8qw CVc984p I4SXk NY6ue HUq6/u7surt Gd31Xu TIRkthcy B2i37x Dp SAge9gl+/UVLx LOIXr Ln0Xv Js Vy+AYd Nba2P7B0v34e D+5vi Hza3n36092K5/z Lw5+n L01ej Oa Dz6cf Rg9GT0b LQ/Ckd09Nfo79E/q/d W91f91d OK+ta NWubz Uezmvw Hi TF4NA= k tq AQyb3F8Zm T9Y2g9Hm+uj H9Y3Xm6u PNmqf8y853zlf O08d Eb Oj84T57mz7xw6g UOdv5y/n X+WHy8f Lnv L5x X17p1a5gun81m O/w Nvu3htTop Down Figure 3: Picking 0 a stump on the fairness attribute allows to ﬁnely tune growths of sub- -trees to the fairness criterion at hand. where Lβ is the risk value for the β quantile among subgroups, which is user-deﬁned. The difference is that CVARβ not only considers the worse case subgroup, but all subgroups above the Lβ risk value (let these subgroups be Sβ). One can simply iteratively improve all s 2 Sβ, as done in the example of Fig. 2. Indeed, in the example, CVARβ with β = 0.9 improves from 1.09 to 0.62 (which is equivalent to worse-case loss in this case). 4 Growing Alpha-Trees We now introduce the procedure to grow an -tree via a boosting algorithm TOPDOWN, Algorithm 1. TOPDOWN can be thought of as a generalization of the standard decision tree induction used for classiﬁcation [15]. We ﬁrst introduce relevant concepts from decision tree induction to explain TOPDOWN. We contextualize TOPDOWN through its application in improving the CVAR criteria. We ﬁrst introduce a technical assumption for the black-box u to be post-processed: Assumption 1. The black-box prediction is bounded away from the extremes: 9B > 0 such that (1 + exp(B)) 1, (1 + exp( B)) 1 (a.s.). (6) Compliance with Assumption 1 can be done by clipping the black-box s output with a user-ﬁxed B or making sure it is calibrated and then ﬁnding B. Entropy-based updates for fairness: An important component in standard decision tree induction is the edge function, which measures the label purity (proportion of positive examples) of a decision tree node. We introduce a generalization which considers the alignment purity of a black-box. Deﬁnition 2. Let (u) .= log(u/(1 u)) the logit of u 2 (0, 1) and (u) .= (u)/B a normalization which satisﬁes (I) = [ 1, 1]. The alignment edge of Mt and t is deﬁned as, .= E(X,Y) Dt [Y ( u(X))] , (7) where Dt is the joint measure induced by Mt and t. With Assumption 1, e(Mt, t) 2 [ 1, 1]. By replacing the normalized logit with a constant 1, (7) reduces to a measure of label purity used in regular classiﬁcation. In our case, (7) measures how well the black-box u aligns with the true labels Y through the logit. This also takes into account the conﬁdence of the black-box s predictions: for a high alignment purity, predictions not only need be correct but also to be highly conﬁdent ( u close to the endpoints of I). Similar to how the splits of a decision tree classiﬁer are determined by the entropy of a tree s label purity, an -tree splits based on its alignment entropy. Deﬁnition 3. Given an -tree with leafset , when Assumption 1 is satisﬁed, the entropy of is: H( ; Mt, t) .= Eλ M ( ) [H1(λ; Mt, t)] , (8) .= q log(q) (1 q) log(1 q), H1(λ; Mt, t) .= H ((1 + e(Mλ, t))/2), Mλ is Mt conditioned to leaf λ 2 ( ), and M ( ) is a measure induced on ( ) by leaf weights on Mt. Theorem 1. For any Mt, t, let s leaves follow (5). Then L( f; Mt, t) H( ; Mt, t). Algorithm 1 can now be explained by repeatedly leveraging Theorem 1. Suppose that we have a hypothesis set of possible splits H to grow our -tree. Denote (λ, h) as the -tree split at leaf λ 2 ( ) using test h. The inner loop within Step 2 is the process of ﬁnding the best possible leaf splits which helps to minimize the -tree s entropy and to reduce the risk as per Theorem 1. The -values of (5) calculated in step 3 are those used to ensure Theorem 1 holds. By setting Mt Msw and t ?, Algorithm 1 improves CVAR by iteratively improving the -tree s worst subgroup entropy H( ; Msw, ?), which as a surrogate improves the worst subgroup risk L( u; Msw, ?). To accommodate for different β quantiles values for CVAR, TOPDOWN can be run repeatedly (replacing the initial input tree 0) to progressively improve all s 2 Sβ. Hence, to reduce CVAR, we basically use TOPDOWN with Mt Ms (s 2 Sβ) and t ?. (CVAR) As alluded to by the notation used to instantiate TOPDOWN, the inputs of the algorithm can be instantiated to optimize for fairness criteria beyond CVAR. This is discussed in Section 6. In the usual ML setting, Mt, t can be estimated from a training sample (see Section 7). Initialization: In the procedure of improving CVAR, the worst subgroups can be iteratively improved. However, we also need to make sure that improvement of a subgroup does not adversely affect another subgroup (which could potentially harm CVAR instead). As such, we introduce an additional structure to the -tree by tweaking the initial tree structure 0 used in TOPDOWN. Since the fairness attribute S partitions the dataset, a convenient choice of initializing the -tree is to split by the subgroup modalities, as depicted in Fig. 3. As such, we grow separate sub- -trees for each of the sensitive modalities. For CVAR, this allows the subgroup risk of individual subgroups to be tweaked without adversely affecting other subgroups. 5 Formal Properties We move to the formal properties of our approach. We ﬁrst detail the background of improper loss functions which motivates our correction given in Section 3. We then present the formal properties of this correction. The useful properties of having (.) represented by a tree structure is then discussed. Finally, we present a convergence analysis of Algorithm 1 and an alternative boosting scheme. Can f as per (3) correct (any) potential unfairness? Yes. In short, this comes from recent theory in improper loss functions for class probability estimation (CPE) [29]. We are interested in the pointwise minimizer (eventually set-valued) of: .= arg inf u2[0,1] L(u, ). (9) Dubbed as the Bayes tilted estimate of a loss [29], t ( ) is the set of optimal responses given a ground truth (pointwise) posterior . Common loss functions are proper: the ground truth value 2 t ( ) is an optimal response. However, in the case where cannot be trusted (for instance when it is unfair), we may not want to default to imitating . In addition we also want to make sure that the Bayes tilted estimate can ﬁt to any desired (in our context, fair) target. The so-called -loss , which generalizes the (proper) log-loss, is a good candidate parameterized by a variable . Its Bayes tilted estimate is the pointwise version of (3), for 62 {0, 1} and 6= 1/2: t ( ) = { /( + (1 ) )} . (10) Importantly for -losses, for any 62 {0, 1/2, 1} and any 0 2 (0, 1), there exists 2 R in (10) such that t ( ) = { 0}. This property, called twist-properness [29], allows for any pointwise correction. By extending to a function (of x 2 X, as per (3)), twist-properness ensures that given an initial unfair posterior an appropriately learned (.) can correct any unfairness. This allows for (ﬂexible) fairness post-processing different functions can be learned for different criteria (i.e., Fig. 1). Why use the Log-Loss? As per Section 3, we minimize the log-loss. We choose the log-loss for two reasons: it is strictly proper and so minimizing L( f; Mt, t) (i.e., via TOPDOWN) pushes f towards target t; and it is the -loss for = 1, so we are guaranteed that for the minimizer f ! u () ! 1. With alternative (i.e. non-strictly proper) loss, we might have only ) . Do we have guarantees on some proximity of f with respect to u? Yes, with light assumptions. We examine the (proximity) of black-box and post-processed posteriors with the KL divergence [2]: KL( u, f; M) .= E(X,Y) Du d Du((X, Y)) d Df((X, Y)) , (11) where Du, Df are the product measures deﬁned from M and their respective posteriors. To bound the proximity (11), we present setting (S1). (S1) Assumption 1 holds for some 0 < B 3 and function satisﬁes | (x) 1| 1/B (a.s.). This setting lead to the following data independent proximity bound. Theorem 2. For any M, (S1) implies KL( u, f; M) 2/(6 (2 + exp(B) + exp( B))). As an example, ﬁx B = 3 for (S1). In this case, we want (.) 2 [2/3, 4/3] (a.s.) which is a reasonable sized interval centered at 1. The clamped black-box posterior s interval is approximately [0.04, 0.96], which is quite ﬂexible and the distortion is upperbounded as KL( u, f; M) 7.5E 2. Is the composition of transformations meaningful? Yes. The analytical form in (3) brings the following easy-to-check (composability) property. Lemma 1. The composition of any two wrapping transformations u 7! f 0 f following (3) is equivalent to the single transformation u 0 This gives an invertibility condition wrapping f with 0 = 1/ recovers the original black-box u. Given some capacity parameter for u, can we easily compute that of f? Yes, e.g., for decision trees. Such a question is particularly relevant for generalization. As we are using the log-loss (2), a relevant capacity notion to assess the uniform convergence of risk minimization for the whole wrapped model is the Rademacher (complexity) [3]. We examine the following set of functions: .= { f : f(x) given by (3) with , u; 8( , u)} , (12) where we assume known the set of functions from which u was trained. We now assume we have a m-training sample S .= {(xi, yi) D}m i=1. The empirical Rademacher complexity of a set of functions H from X to R, RS(H) .= Eσ suph2H Ei[σih(xi)] (sampling uniform with σi 2 { 1, 1}), is a capacity parameter that yields efﬁcient control of uniform convergence when the loss used is Lipschitz [3, Theorem 7], which is the case of the log-loss. To see how the -tree affects the Rademacher complexity of classiﬁcation using f instead of u, suppose real-valued prediction based on u is achieved via logit mapping, u (12). Such mappings are common for decision trees [26]. Lemma 2. Suppose { u} is the set of decision trees of depth d and denote RS(DT(d)) the empirical Rademacher complexity of decision trees of depth d [3] and d0 the maximum depth allowed for -trees. Then we have for Hf in (12): RS(Hf) RS(DT(d + d0)). The proof is straightforward once we remark that elements in Hf can be represented as decision trees, where we plug at each leaf of u a copy of the -tree . Does Algorithm 1 have any convergence properties? Yes, it is a boosting algorithm. Following a similar blueprint to classical decision tree induction, it comes with no surprise that TOPDOWN can achieve boosting compliant convergence. To show that TOPDOWN is a boosting algorithm, we need a Weak Hypothesis Assumption (WHA), which postulates informally that each chosen split brings a small edge over random splits for a tailored distribution. Deﬁnition 4. Let λ 2 ( ) and Dtλ be the product measure on X Y conditioned on λ. The balanced product measure D0 t λ at leaf λ is deﬁned as (z .= (x, y) for short): .= 1 e(Mλ, t) y ( u(x)) 1 e(Mλ, t)2 Dtλ(z). (13) We check that t λ = 1 because of Def. 2. Our balanced distribution is named after [15] s: ours indeed generalizes theirs. The key difference comes from the change in setting, where we consider the alignment purity of a leaf and not its label purity. The fairness-free case where (.) is replaced by constant 1 yields the original balanced distribution [15]. We now state our WHA. Assumption 2. Let h : X ! Y be the function splitting leaf λ. For γ > 0, then h γ-witnesses the Weak Hypothesis Assumption (WHA) at λ iff )))) E (X,Y) D0t λ [Y ( u(X)) h(X)] )))) γ; (ii) e(Mλ, t) E (X,Y) Dtλ (1 2( u(X))) h(X) Intuitively, (i) gives a condition on the split h s correlation with unfair posterior u agreement with labels and (ii) does the same for the conﬁdence of u predictions. Similarly to the balanced /Nfj P7c La/X72xex Ps6ez81ky2f/m P1z9q/3/3P/1/fn948b6tvt TK/n A0+9xf/Bd As UAQ=1/2 e+(Mλ, t) e+(Mλ, t) + e (Mλ, t) j P0Z/jv4a/b1+bf3p+m/rfk09f6R+WLU+6z Dvwk Yd4=1 + e(Mλ, t) itaoaro Xvb Qq8fk Yre T7XWys+/to59LH9f JNvl Edkh G9sk ROSanp Es4+U3+kn/k Ntl Ivi V7yc Gi NGkse7b ICp If/w FOLMy AH1(λ; Mt, t) H2(λ; Mt, t) t YBQxax XTMd EWps Srkrc TRSAJPc I8nr0n+OC0Sanrea1Tpo XVW9m6r3e F2p3Wc5Ft AZOke Xy EO3q Ibq IGai CKJ3t A7+n A+n S9n7nwv Rzecb Oc U5cr5+QWf Mag UH i9WX+3d F1dpz+6y9y ZDIlw I6YTa Lv EOl Ic+72CX7x WUr Is ZBdpc+k9Z9m2Wr Bn Lai Wj JWt/niq Oh6kxu T2ysj8Yd N+OLi/Pvpufe Pp Nys PNusf PW85nzmf O3edkf O98B57Ow5+07gv HT+d P5y/l7e Xy6Wf1v+va Lev FHLf OJ0Pst/Ae MIj Pδ Ha6Nv1tafb X8c KP+0f OW86nzm XPXGTnf Og+d J86us+c Ezsz50/n L+Xsp XPp16bel3yvqz Ru1z B2n81n64z8/jo7E ? Muim NGz YWlfg3xcl UVK6Kq Wh62r HB3nq LZsyv XIv CJzf An5GZJLXuv F5qu/+oq Hdldj95k RCUr IZt Qd+sxs YGUh Gh U8N3/l NRc JPy86C69h1zc NQv+r IXVInjd64/Hpu Nhasyvb03d Hzb9h0e3dqaf7Ow+Gjr9l7o+c Lk7cn70xu TKa Tye3J19N7k8OJv Hkd PLH5M/JX5vh5i+bv27+1l Cfutb Kv Dk Zf DZ/xd Vk Iuz (p) Figure 4: Left: Difference between the per-leaf bounds on risk L( u; Mt, t) using (8) and Theorem 1 (conservative scoring) and (15) (audacious scoring). Details in the proof of Lemma 3. Right: A representation of the (p, δ)-pushup of ?, where (p) .= inf ?(Xp) < 1/2 (Def. A). All posteriors in [ (p), 1/2 + δ] are mapped to 1/2 + δ; others do not change. New posterior ? p,δ eventually reduces the accuracy of classiﬁcation for observations whose posterior lands in the thick red interval (x-axis). distribution, in the fairness-free case where we only care about the label purity of splits, the WHA simpliﬁed to that of [15] s i.e., only (i) matters. A further discussion on our balanced distribution and WHA is in the SI, Section III. We now state TOPDOWN s boosting compliant (convergence). Theorem 3. Suppose (a) Assumption 1 holds, (b) we pick the heaviest leaf to split at each iteration in Step 2.1 of TOPDOWN and (c) 9γ > 0 such that each split h? (Step 2.2) in γ-witnesses the WHA. Then there exists a constant c > 0 such that 8" > 0, if the number of leaves of satisﬁes | ( )| (1/")c log( 1 , then f crafted from (3) using TOPDOWN s achieves L( f; Mt, t) ". Are there alternative ways of growing -trees? Yes. Let us call conservative the scoring scheme in (5). There is an alternative scoring scheme, which can lead to substantially larger corrections in absolute values, hence the naming, and yields better entropic bounds for the -tree. Deﬁnition 5. For any mixture Mt and posteriors u, t, let e+(Mt, t) and e (Mt, t) be deﬁned by .= E(X,Y) Dt [max{0, Y ( u(X))}] . (14) The audacious scoring schemes at the leaves of the -tree replaces (5) in Step 3 by: e+(Mλ, t) e+(Mλ, t) + e (Mλ, t) , 8λ 2 ( ). Theorem 4. Suppose Assumption 1 holds and let H2(q) .= H(q)/ log 2 (2 [0, 1]), H being deﬁned in Deﬁnition 3. For any leaf λ 2 ( ), denote for short: H2(λ; Mt, t) where let eb .= eb(Mλ, t), 8b 2 {+, }. Using audacious scoring, we get instead of Theorem 1: L( f; Mt, t) Eλ M ( ) [H2(λ; Mt, t)] . (15) While upperbounds in Theorem 1 and Theorem 4 may look incomparable, it takes a simple argument to show that (15) is never worse and can be much tighter. Lemma 3. 8 -tree , Eλ M ( ) [H2(λ; Mt, t)] H( ; Mt, t). It thus comes at no surprise that using the audacious scoring also results in a boosting result for TOPDOWN guaranteeing the same rates as in Theorem 3. It also takes a simple picture to show that the per-leaf slack in Lemma 3 can be substantial, a slack which can be represented using a simple picture, see Figure 4 (left), following from the use of Jensen s inequality in the Lemma s proof. As audacious scoring is better boosting-wise, is conservative scoring useful? Yes. If we only cared about accuracy, we would barely have any reason to use the conservative correction. Even thinking about generalization, the Rademacher complexity of decision trees is a function of their depth so faster the convergence, the better [3, Section 4.1]. Adding fairness substantially changes the picture: some constraints, like equality of opportunity (Section 6) can antagonize accuracy to some extent. In such a case, using the conservative correction can keep posteriors u and t close enough (Theorem 2) so that fairness can be achieved without substantial sacriﬁce on accuracy. 6 Fairness and Societal Considerations In this section, we present the fairness guarantees TOPDOWN can achieve. In particular, we provide a discussion about how Theorem 3 can guarantee minimization of the CVAR criteria. Furthermore, we provide alternative inputs to TOPDOWN which allows for EOO to be targeted as a fairness criteria. In the SI Section II, we further present a treatment of statistical parity. Lastly, we discuss how using -trees provides explainable corrections and how utilization of the sensitive attribute (as per Fig. 3) can be circumvented. Guarantees on CVa R: As discussed in previous sections, one way to improve the CVAR fairness criteria (as per (4)) is to focus optimization on the worst treated subgroups. Given a speciﬁed quantile group β and the set of worse subgroups Sβ, we can repeat (CVAR) until CVARβ( f) gets below a threshold or (more speciﬁcally) its worst tread group gets a risk below a threshold (i.e., a stopping criterion). Importantly, Theorem 3 provides a guarantee: to ensure CVARβ is below ", we simply need to boost for |S| times the tree size bound | ( )| given in Theorem 3. Guarantees on EOO: EOO requires to smooth discrimination within an advantaged group, modeled by the label Y = 1 [14]. We say that f achieves "-equality of opportunity iff a mapping hf of f to Y (e.g. using the sign of its logit ) satisﬁes [hf(X) = 1] min [hf(X) = 1] ", (16) where Ps is the positive observations measure conditioned to value S = s for the sensitive attribute. It is clear that EOO can be antagonistic to accuracy: the rate of advantage in the data D may not be equal among the subgroups. As such, unlike CVAR, we do not want to target the Bayes posterior t 6 ? for EOO. Instead, we target a skewed posterior which aims to improve the least advantaged subgroup, i.e., increasing s 2 arg mins2S PX Ps [hf(X) = 1]. Our strategy consists of picking a target posterior which skews part of the original ? to be more advantaged, thus reducing the LHS of (16) until (16) is satisﬁed1. For this, we create a (p, δ)-pushup of ?, deﬁned in SI Appendix I. Fig. 4 (right) presents an example of a pushup map. Notice that the pushup only changes the predicted probability of example which do not have a conﬁdent prediction (the interval [ (p), 1/2 + δ]). Intuitively, p controls how many examples are corrected and δ controls how much the correction pushes up advantage, further discussion in SI Appendix I. We then run TOPDOWN using as mixture the positive measure conditioned to S = s and p .= PX Ps [hf(X) = 1] + "/(K 1), δ .= K"/(K 1), with K > 1 user-ﬁxed. Thus, we do Use TOPDOWN with Mt Ps and t ? p,δ . (EOO) Theorem 5. If TOPDOWN is run until L( f; Mt, t) ("4/2) + EX Mt [H( t(X))], then after the run we observe PX Ps? [hf(X) = 1] PX Ps [hf(X) = 1] ". In the full context of EOO, in the optimization we should not wait to get the bound on L( f; Mt, t). Rather, we should make sure (a) we update arg mins2S PX Ps [hf(X) = 1] (and thus s ) after each split in the -tree and (b) we keep arg maxs2S PX Ps [hf(X) = 1] as is, to prevent switching targets and eventually composing pushup transformations for the same S = s , which would not necessarily comply with our theory. Notably, the guarantee presented in Theorem 5 depends on the mapping hf and not the direct posterior f, as typically considered [14]. When taking a threshold (sign of the logit), hf can be interpreted as forcing the original posterior to be extreme values of 0 or 1. Unlike the CVAR case, EOO (as per (17), SI) requires an explicit approximation of ?. In practice, we ﬁnd that taking a simple approximation of ? still can yield fairness gains. However, if one does not want to make such an approximation, one can adapt the statistical parity approach (detailed in SI, Section II). Similarly, if wants to consider the typical EOO deﬁnitions depending on posterior values, the target measure can be replaced (i.e., swapping measure Mt with the positive examples P). Explainability: -trees using the initialization proposed in Section 4 (and Fig. 3) allows for (explainability) properties similar to that of decision tree classiﬁers. Fixing a sensitive attribute 1If we instead reduce arg maxs2S PX Ps [hf(X) = 1] we get a symmetric strategy. The application informs which to use: if positive class implies money spending (e.g. loan prediction), then our strategy implies spending more money; while the latter aims to reduce money lent to achieve fairness. c1 4 7 10 13 16 19 22 25 28 31 CVa R (lower better) c 1 4 7 10 13 16 0.0 EOO (lower better) Equal Opportunity c 1 4 7 10 13 16 19 22 25 28 31 MD (lower better) Posterior Mean Diﬀerence Audacious (CVa R) Conservative (CVa R) Audacious (EOO) Conservative (EOO) BBox In CVa R 27 29 31 0.28 Figure 5: ACS 2015 with Binary Sensitive Attribute and Random Forest Black-box: Evaluation of TOPDOWN over boosting iterations (x-axis) for different fairness criteria. c on the x-axis denotes the clipped black-box. denote when a subgroup s -tree is initiated (over any fold). The shade denotes a standard deviation from the mean, this disappears when folds have early stopping. S = s, the corresponding sub- -tree s can be examined to scrutinize the correction done for the corresponding subgroup. If the splits of the -tree are simple, similarly to standard decision tree classiﬁers, corresponding partitions of the input domain can be examined. Furthermore, the type of corrections can also be examined, as discussed in Section 3; where corrections can be classiﬁed as sharpening , dampening , or polarity ﬂipping depending on the leaves -values. Usage of sensitive attribute: Post-processing methods have been ﬂagged in the context of fair classiﬁcation for the fact that they require explicit access to the sensitive feature at classiﬁcation time [35, 6.2.3]. Our basic approach to the induction of -trees falls in this category (Fig. 3), but there is a simple way to mask the use of the sensitive attribute and the polarity of disparate treatment it induces: it consists in ﬁrst inducing a decision tree to predict the sensitive feature based on the other features and use this decision tree as an alternative initialization to naively splitting on subgroups. We thus also redeﬁne sensitive groups based on this decision tree thus alleviating the need to use the sensitive attribute in the -tree. The use of proxy sensitive attributes in a similar manner has seen ample use in a various domain such as health care [6, 5] and ﬁnance [13]. We however note that its application in post-process and -trees may not be appropriate across all domains [8]. 7 Experiments To evaluate TOPDOWN2, we consider three datasets presenting a range of different size / feature types, Bank and German Credit (preprocessed by AIF360 [4]) and the American Community Survey (ACS) dataset preprocessed by Folktables3 [9]. The SI (pg 32) presents all results at length (including considerations on proxy sensitive attributes, distribution shift, and interpretability), along with the different black-boxes considered (random forests and neural nets). We concentrate in the Section on the ACS dataset for income prediction in the state of CA and evaluate TOPDOWN s application to various fairness criteria (as per Section 6 and SI pg 15) with Random Forests (RF). For these experiments, we consider age as a binary sensitive attribute with a bin split at 25 (a trinary modality is deferred to the SI). For the black-box, we consider a clipped (Assumption 1 with B = 1) random forest (RF) from scikit-learn calibrated using Platt s method [23]. The RF consists of an ensemble of 50 decision trees with a maximum depth of 4 and a random selection of 10% of the training samples per decision tree. Data is split into 3 subsets for black-box training, post-processing training, and testing; consisting of 40:40:20 splits in 5 fold cross validation. For EOO, we utilize an out-of-the-box Gaussian Naive Bayes classiﬁer from scikit-learn to approximate ?. Multiple fairness criteria We evaluate TOPDOWN for CVAR, equality of opportunity EOO, and statistical parity SP. The complete treatment of SP is pushed to SI (Sections II, XII). SP aims to make subgroup s expected posteriors similar and is popular in a various post-processing methods [31, 1]. The deﬁnition can be found in SI (pg 15) along with the strategy used in TOPDOWN. Conservative and audacious updates rules are also tested. For each of these TOPDOWN conﬁgurations, we boost for 32 iterations. The initial -tree is initialized as in Fig. 3. We compare against 5 baseline approaches. For CVAR we consider the in-processing approach (INCVAR) presented in [32]. For EOO, we consider a derived predictor (DEREOO) [14]. Our 2Implementation public at: https://github.com/alexandersoen/alpha-tree-fair-wrappers 3Public at: https://github.com/zykls/folktables SP baselines include an optimized score transformation approach (OST) [31]; a derived predictor modiﬁed for SP (DERSP) [14]; and a randomized threshold optimizer approach (RTO) [1]. We denote the clipped black-box as BBOX. The experiments for CVAR and EOO are summarized in Fig. 5; the full plot with SP is presented at SI Fig. 10. For clarity we only plot the baselines and wrappers which are directly associated to each fairness criteria. We also plot the posterior mean difference between the data and debiased posteriors MD (0/1 loss) to examine the effects on accuracy. For CVAR, both conservative and audacious approaches decreases CVAR, which results in better CVAR values than both the original BBOX and in-processing baseline INCVAR which is good news since INCVAR directly optimizes CVAR. We note that there are cases in which the in-processing approach is better than ours (trinary sensitive attributes in SI), but this is expected given INCVAR s optimization goal. Interestingly, the audacious update is superior in both CVAR and MD than the conservative update. This is also consistent for trinary sensitive attributes. Thus, the audacious update is desirable when optimizing CVAR. Another observation is that only one sensitive attribute subgroup s -tree is initialized (only one ). This indicates that after 32 iterations the worse case subgroup does not change in the binary case. For EOO, there is a huge difference between conservative and audacious updates as the former gets to the most fair outcomes of all baselines. Even if we used early stopping or pruning of the -tree (taking an earlier iterations) the audacious update would fail at producing outcomes as fair as its conservative counterpart. Furthermore, the audacious update comes with a signiﬁcant degradation of accuracy MD. Furthermore, by looking at the iterations in which subgroup -trees are initialized, the audacious update causes large (primarily bad) jumps in performance. This rejoins our remark on the interest of having a conservative update in Section 5. When compared to DEREOO, we ﬁnd that the conservative TOPDOWN approach produces lower EOO. However, DEREOO tend to have better accuracy scores in MD. These observations are consistent with the trinary sensitive attribute (SI). For SP, we can observe fairness results that can be on par with contenders for the conservative update, but observe a substantial degradation of MD. This, we believe, follows from a simple plug-in instantiation of M, t for the fairness notion in SI Section II, resulting in potentially harsh updates. In SI (pg 15), we discuss an alternative approach using ties with optimal transport. 8 Limitations and Conclusion Given the context of fairness, it is important to highlight possible limitations of our approach and the potential social harm from such misuse. We highlight two of these for our TOPDOWN approach. Firstly, our approach has shown to have failure cases in small data cases. This can be seen in the experiments on German Credit dataset (SI, Section XIII-XV). This, we believe, has a formal basis in our approach. For instances, EOO requires accurate data posterior estimation which may be difﬁcult in small data regimes. Secondly, TOPDOWN is of course not unilaterally better than all other post-processing fairness approaches there is No Free Lunch. As such, we do not claim that our instantiation of TOPDOWN is optimal in CVAR, EOO, or SP. However, considering that such different fairness models instantiated in the same algorithm can lead to competitive results with the respective state of the art, the avenue for improved instantiations or accurate extensions to new fairness constraints appears promising. We leave these for future work. Acknowledgments and Disclosure of Funding YM received funding from the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation program (grant agreement No. 882396), the Israel Science Foundation(grant number 993/17), Tel Aviv University Center for AI and Data Science (TAD), and the Yandex Initiative for Machine Learning at Tel Aviv University. AS and LX thank members of the ANU Humanising Machine Intelligence program for discussions on fairness and ethical concerns in AI, and the Ne CTAR Research Cloud for providing computational resources, an Australian research platform supported by the National Collaborative Research Infrastructure Strategy. [1] I. Alabdulmohsin and M. Lucic. A near-optimal algorithm for debiasing trained machine learning models. In Neur IPS*34, 2021. [2] S.-I. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 2000. [3] P.-L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463 482, 2002. [4] Rachel K. E. Bellamy, Kuntal Dey, Michael Hind, Samuel C. Hoffman, Stephanie Houde, Kalapriya Kannan, Pranay Lohia, Jacquelyn Martino, Sameep Mehta, Aleksandra Mojsilovic, Seema Nagar, Karthikeyan Natesan Ramamurthy, John Richards, Diptikalyan Saha, Prasanna Sattigeri, Moninder Singh, Kush R. Varshney, and Yunfeng Zhang. AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias. IBM J. Res. Dev., 63:4:1 4:15, 2019. [5] David P Brown, Caprice Knapp, Kimberly Baker, and Meggen Kaufmann. Using bayesian imputation to assess racial and ethnic disparities in pediatric performance measures. Health services research, 51(3):1095 1108, 2016. [6] Consumer Financial Protection Bureau. Using publicly available information to proxy for unidentiﬁed race and ethnicity: A methodology and assessment. Washington, DC: CFPB, Summer, 2014. [7] S. Corbett-Davies, E. Pierson, A. Feller, S. Goel, and A. Huq. Algorithmic decision making and the cost of fairness. Co RR, abs/1701.08230, 2017. [8] Anupam Datta, Matthew Fredrikson, Gihyuk Ko, Piotr Mardziel, and Shayak Sen. Use privacy in data-driven systems: Theory and experiments with machine learnt programs. In 24th ACM SIGSAC, pages 1193 1210, 2017. [9] Frances Ding, Moritz Hardt, John Miller, and Ludwig Schmidt. Retiring adult: New datasets for fair machine learning. Neur IPS*34, 2021. [10] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R.-S. Zemel. Fairness through awareness. In ITCS 12, pages 214 226, 2012. [11] C. Dwork, N. Immorlica, A.-T. Kalai, and M.-D.-M. Leiserson. Decoupled classiﬁers for group-fair and efﬁcient machine learning. In Conference on Fairness, Accountability and Transparency, volume 81, pages 119 133. PMLR, 2018. [12] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214 226, 2012. [13] Allen M Fremont, Arlene Bierman, Steve L Wickstrom, Chloe E Bird, Mona Shah, José J Escarce, Thomas Horstman, and Thomas Rector. Use of geocoding in managed care settings to identify quality disparities. Health Affairs, 24(2):516 526, 2005. [14] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In Neur IPS 16, pages 3315 3323, 2016. [15] M.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. In Proc. of the 28 th ACM STOC, pages 459 468, 1996. [16] M.-P. Kim, A. Ghorbani, and J.-Y. Zou. Multiaccuracy: Black-box post-processing for fairness in classiﬁcation. In AAAI/ACM Conference on AI, Ethics, and Society, pages 247 254. ACM, 2019. [17] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. Inherent trade-offs in the fair determination of risk scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. [18] P.-K. Lohia, K.-N. Ramamurthy, M. Bhide, D. Saha, K.-R. Varshney, and R. Puri. Bias mitigation post-processing for individual and group fairness. In ICASSP 19, pages 2847 2851. IEEE, 2019. [19] K. Martineau. Shrinking deep learning s carbon footprint. https://news.mit.edu/2020/ shrinking-deep-learning-carbon-footprint-0807, 2020. [20] N. Mehrabi, F. Morstatter, N. Saxena, and A. Galstyan. A survey on bias and fairness in machine learning. ACM CSUR, 54:1 35, 2022. [21] A.-K. Menon and R.-C. Williamson. The cost of fairness in binary classiﬁcation. In Conference on Fairness, Accountability and Transparency, volume 81, pages 107 118. PMLR, 2018. [22] F. Petersen, D. Mukherjee, Y. Sun, and M. Yurochkin. Post-processing for individual fairness. Co RR, abs/2110.13796, 2021. [23] John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classiﬁers, 10(3):61 74, 1999. [24] M.-D. Reid and R.-C. Williamson. Information, divergence and risk for binary experiments. JMLR, 12:731 817, 2011. [25] R.-T. Rockafellar and S. Uryasev. Optimisation of conditional value-at-risk. Journal of Risk, 2:21 41, 2000. [26] R. E. Schapire and Y. Singer. Improved boosting algorithms using conﬁdence-rated predictions. In 9 th COLT, pages 80 91, 1998. [27] R. E. Schapire and Y. Singer. Improved boosting algorithms using conﬁdence-rated predictions. MLJ, 37:297 336, 1999. [28] E. Strubell, A. Ganesh, and A. Mc Callum. Energy and policy considerations for deep learning in NLP. In ACL 19, pages 3645 3650, 2019. [29] T. Sypherd, R. Nock, and L. Sankar. Being properly improper. In ICML 22, 2022. [30] T. van Erven and P. Harremoës. Rényi divergence and kullback-leibler divergence. IEEE Trans. IT, 60:3797 3820, 2014. [31] D. Wei, K.-N. Ramamurthy, and F. du Pin Calmon. Optimized score transformation for fair classiﬁcation. In AISTATS 20, volume 108, pages 1673 1683, 2020. [32] Robert C. Williamson and Aditya Krishna Menon. Fairness risk measures. In International Conference on Machine Learning, pages 6786 6797, 2019. [33] B.-E. Woodworth, S. Gunasekar, M.-I. Ohannessian, and N. Srebro. Learning non-discriminatory predictors. In COLT 17, volume 65, pages 1920 1953. PMLR, 2017. [34] Forest Yang, Mouhamadou Cisse, and Oluwasanmi O Koyejo. Fairness with overlapping groups; a probabilistic perspective. Advances in Neural Information Processing Systems, 33, 2020. [35] M.-B. Zafar, I. Valera, M. Gomez-Rodriguez, and K.-P. Gummadi. Fairness constraints: A ﬂexible approach for fair classiﬁcation. JMLR, 20:75:1 75:42, 2019. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reﬂect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] At the end of the last page. (c) Did you discuss any potential negative societal impacts of your work? [Yes] In limitation section and corresponding Appendix sections. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] All assumptions are stated in the Theorems, Lemmas, and Corollaries. (b) Did you include complete proofs of all theoretical results? [Yes] Full proofs are presented in the Appendix. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main exper- imental results (either in the supplemental material or as a URL)? [Yes] See code in supplementary material. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] Summary presented in main-text with further details in Appendix. (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [Yes] Error bars are present in all relevant plots. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Details present in Appendix XI. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] Data sources and baselines cited. (b) Did you mention the license of the assets? [Yes] Yes, when available. (c) Did you include any new assets either in the supplemental material or as a URL? [No] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [Yes] We use public datasets. For the newer dataset, we defer to the cited paper s Section. (e) Did you discuss whether the data you are using/curating contains personally identiﬁable information or offensive content? [No] The datasets are standard benchmarks used extensively in ML. For the newer dataset, we defer to the reference s Section. 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]