# disparate_conditional_prediction_in_multiclass_classifiers__8ddfdcd2.pdf

Disparate Conditional Prediction in Multiclass Classifiers

Sivan Sabato 1 2 Eran Treister 2 Elad Yom-Tov 3

We propose methods for auditing multiclass classifiers for fairness under multiclass equalized odds, by estimating the deviation from equalized odds when the classifier is not completely fair. We generalize to multiclass classifiers the measure of Disparate Conditional Prediction (DCP), originally suggested by Sabato & Yom-Tov (2020) for binary classifiers. DCP is defined as the fraction of the population for which the classifier predicts with conditional prediction probabilities that differ from the closest common baseline. We provide new local-optimization methods for estimating the multiclass DCPunder two different regimes, one in which the conditional confusion matrices for each protected sub-population are known, and one in which these cannot be estimated, for instance, because the classifier is inaccessible or because good-quality individual-level data is not available. These methods can be used to detect classifiers that likely treat a significant fraction of the population unfairly. Experiments demonstrate the accuracy of the methods. Code is provided at https://github.com/sivansabato/ DCPmulticlass.

1. Introduction

Fairness of classifiers is a crucial property in many real-life scenarios (see, e.g., Caton & Haas, 2024). In particular, auditing classifiers for fairness is essential in a wide range of applications and has been studied in many works (e.g., Saleiro et al., 2018; Angwin et al., 2022; Taskesen et al., 2021; Cherian & Cand es, 2024).

While it is desirable that a classifier accurately satisfies the required fairness criterion, this is in many cases not achiev-

1Department of Computing and Software, Mc Master University; Canada CIFAR AI Chair, Vector Institute 2Department of Computer Science, Ben-Gurion University of the Negev 3Department of Computer Science, Bar-Ilan University. Correspondence to: Sivan Sabato <sabatos@mcmaster.ca>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

able, due to inherent limitations of the distribution (Pleiss et al., 2017; Kleinberg et al., 2017; Menon & Williamson, 2018; Wang et al., 2024), as well as practical constraints. It is thus necessary to be able to quantify the deviation from fairness of a given classifier. In addition, often the classifier is proprietary (Dastin, 2022; Grandinetti, 2023) and thus is not directly accessible for auditing. In other cases, individual-level data that is required for auditing is missing or insufficient. For instance, a health insurance company may use a proprietary classifier to decide on coverage or on premium rates (Kafuria, 2022). Sabato & Yom-Tov (2020) showed that for binary classifiers, it is possible to provide fairness auditing without individual-level data, using only population-level information on the frequencies of positive predictions and of true positive labels in each subpopulation, where a sub-population is the group of individuals who have the same value of the protected attribute(s). They considered fairness in the sense of equalized odds (Hardt et al., 2016), which defines a binary classifier as fair if its false positive rate and its false negative rate are each the same across all sub-populations. They proposed to quantify the deviation from equalized odds using a measure that we will henceforth call Disparate Conditional Prediction (DCP). DCP is the fraction of the population for which the classifier predicts with conditional prediction probabilities that differ from the closest common baseline.

The fairness notion of equalized odds was originally studied for binary classifiers. However, interest in fairness for multiclass classifiers has gained traction in recent years (see, e.g. Alghamdi et al., 2022; Rouzot et al., 2022; Wang et al., 2024). In multiclass scenarios, the multiclass equalized odds criterion measures any differences in conditional prediction probabilities between sub-populations. This includes not only the difference in the rate of correct predictions as in the binary case, but also the types of prediction mistakes. For instance, if a patient s heart attack is misdiagnosed as an anxiety attack (which may mean the patient is denied care), this is significantly different than being misdiagnosed as a stroke (which may lead to delayed care). If some subpopulations incur more of a certain type of misdiagnosis error, this could indicate bias in diagnosis, as well as lead to undesired differences in treatment.

In this work, we study the auditing of multiclass classifiers for deviation from multiclass equalized odds, using

Disparate Conditional Prediction in Multiclass Classifiers

a natural generalization of DCP. DCP is different from other commonly used measures of deviation from equalized odds, such as the difference or ratio of equalized odds (e.g., Alghamdi et al., 2022; Wang et al., 2024), in that it has a consistent interpretable meaning as a fraction of the population, regardless of the number of protected attribute values, the number of classes, or the degree of class imbalance. Thus, DCP is useful for interpretably auditing and comparing classifiers. This is contrasted with the standard use of differences or ratios, which produces undesirable artifacts, such as discounting differences in rare labels, lack of normalization or boundedness, and lack of differentiation between classifiers with different degrees of bias. The quantifiable interpretation of the DCP measure ensures that it does not suffer from similar issues.

It was shown in Sabato & Yom-Tov (2020) that the DCP of a given binary classifier can be calculated efficiently using the confusion matrices of this classifier in each sub-population. When the confusion matrices are not available as discussed above, there exists an efficient procedure for deriving a lower bound on the value of DCP given population-level frequencies. This lower bound can be used to identify classifiers that treat a high fraction of the population unfairly, without direct access to the classifier. In the case of multiclass classifiers, efficient procedures for calculating DCP are unknown, as this may be computationally intractable. Thus, we propose procedures to upper-bound and lower-bound the DCP of a multiclass classifier, given the conditional confusion matrices of the classifier for each sub-population, as well as given only population-level frequencies.

The upper bounds are obtained using local minimization procedures. The minimization problems are constrained, non-smooth and non-convex, and their objective functions have regions with large gradients, which is challenging numerically. As it is known that non-smooth functions are challenging for gradient-based algorithms, we first handle the non-smoothness, caused by a maximum term over several concave functions, by splitting the functions into their smooth parts, and adding more constraints to the existing ones to account for that. Then, we replace the concave functions in the constraints with linear approximations, and the minimization is obtained via sequential solutions of standard linear programming (LP) problems, which have available and rather efficient solution routines (Hall et al., 2023). The LP solvers efficiently handle the constrained minimization at each step, even when the constraint matrices include large numbers because of the large gradients. As the problem is highly non-convex, the sequential minimization reaches local minima.

We report experiments on several data sets, showing that the gap between the upper and lower bounds, for both scenarios, are usually quite small, indicating that the optimization

procedures provide useful estimates. These estimates can be used to identify classifiers that behave differently on different protected sub-populations.

Paper structure. Section 2 discusses related work. Preliminaries are provided in Section 3. In Section 4, we present the DCP measure and extend it to multiclass classifiers. Section 5 provides methods for calculating upper and lower bounds for DCP for multiclass classifiers when the conditional confusion matrices for all sub-populations are known. In Section 6, we consider the case where these matrices are unknown, using only population-level frequencies. Experiments are reported in Section 7. We conclude in Section 8. Some technical details are deferred to appendices.

2. Related Work

Fairness for multiclass classification has been gaining interest in recent years. Denis et al. (2024) studied multiclass fairness with demographic parity. Alghamdi et al. (2022) use model projection for multiclass equalized odds. Putzel & Lee (2022) propose post-processing techniques for obtaining fairness in multiclass classification for various fairness notions. Rouzot et al. (2022) propose fairness scoring systems for multiclass classification. Wang et al. (2024) study the fundamental limits of fairness in multiclass classifiers, under several fairness notions, including equalized odds.

Sabato & Yom-Tov (2020) studied estimating the (un)fairness of a classifier, a crucial task in many applications (Bellamy et al., 2019). They proposed the new DCP measure for binary classifiers, showed that it is easy to calculate using the confusion matrices for each sub-population, and provided methods for lower-bounding this measure in the absence of individual label data, using population-level frequencies. Several other works have studied fairness auditing in the absence of individual information about protected attributes (e.g., Awasthi et al., 2021; Fabris et al., 2023; Cornacchia et al., 2023). Fairness auditing has also been studied in specific applications, including tax auditing algorithms (Black et al., 2022), visual systems (Goyal et al., 2022), and candidate rankings (Roth et al., 2022).

Many works use some relaxation of equalized odds to allow learning or studying near-fair classifiers. However, there is not a single agreed-upon relaxation, in the binary or the multiclass case. For instance, Donini et al. (2018); Jung et al. (2021); Xue (2023); Wang et al. (2024) use a difference-based formulation, while Calmon et al. (2017) and Alghamdi et al. (2022) use a ratio-based one. In Xue (2023), the sum of the differences is used, while in Wang et al. (2024), the maximum is used. In this work, we study a natural extension to of the interpretable DCP measure of Sabato & Yom-Tov (2020) to multiclass classification.

Disparate Conditional Prediction in Multiclass Classifiers

Our work partly falls within the realm of fairness auditing using only aggregate statistics, without assuming access to the classifier. The challenge of auditing fairness using limited information has received significant attention in recent years, as evident, for example, in (Pinz on et al., 2024; Wang et al., 2021). Our work is unique, in that it is the first, to our knowledge, to address the multiclass setting.

3. Preliminaries

We consider a multiclass classification problem with k possible labels, in which each individual in the population has a true label in the label set Y {1, . . . , k}. Fairness is considered with respect to some protected attribute, such as race, state of residence or age. A value of the attribute, which can be multi-valued, is assigned to each individual. If there is more than one protected attribute, it can be substituted for a single attribute which is the Cartesian product of all protected attributes. A sub-population is the subset of the population that includes all the individuals with the same value of the protected attribute.

The object of study is an existing classifier, denote it C, which maps each individual from the population to a predicted label, which may be different from its true label. We do not make any assumptions about the way the classifier is generated or the classification model. For a given C, denote by D the uniform distribution over the population of the triplets of true label, predicted label, and protected attribute value of individuals. A random triplet drawn according to D is denoted by (Y, ˆY , A), where Y Y is the true label of the individual, ˆY Y is the label predicted by C for this individual, and A A is the individual s protected attribute value, where A is the set of possible values. Denote the probability of an event E according to D by P[E].

We define the following notation for properties of D: The frequency of each sub-population in the distribution is wa := P[A = a]; The vector of frequencies is w = (wa)a A. The proportion of true label y Y in subpopulation a A is πy a := P[Y = y | A = a]; The vector of these proportions is πa := (πy a)y Y. The proportion of predicted label y Y by classifier C in sub-population a A is ˆpy a := P[ ˆY = y | A = a]; The vector of these proportions is ˆpa := (ˆpy a)y Y. The entries in the confusion matrix of the classifier C on a sub-population a A are denoted by y, ˆy Y, αyˆy a := P[ ˆY = ˆy | Y = y, A = a]. Throughout this work, the value of a conditional probability expression in which the probability of the condition is zero is treated as zero. Denote the confusion matrix of C on a subpopulation a by Ma := (αyˆy a )y,ˆy Y. Denote the indexed set of all such confusion matrices by MA := {Ma}a A. Denote row y in Ma by αy a := (αyˆy a )ˆy Y.

Denote the k-simplex by k := {v Rk + | v 1 = 1}.

The set of matrices whose rows are in the simplex over Y is k k := {X := (xyˆy)y,ˆy Y | y Y, (xyˆy)ˆy Y k}. By definition, the classifier C satisfies the following:

a A it holds that Ma k k and MT a πa = ˆpa. (1)

The fairness notion of equalized odds (Hardt et al., 2016), originally defined for binary classification, was later generalized to multiclass classifiers. We use the term-by-term multiclass equalized odds criterion (Putzel & Lee, 2022; Alghamdi et al., 2022), which requires that for each y, ˆy Y, αyˆy a P[ ˆY = ˆy | Y = y, A = a] is the same across all a A. Equivalently, all the matrices in MA are the same. Note that the same definitions can be used also for relaxed multiclass equalized odds criteria, such as those that distinguish sensitive and insensitive labels (see, e.g. Rouzot et al., 2022). This can be achieved by mapping ˆY conditioned Y = y to a smaller set of predicted labels, where the mapping can depend on y, and then recalculating the distribution properties provided above. All the methods and results below are valid for the resulting transformation.

4. The DCP Measure for Multiclass Classifiers

Recall that the distribution D is determined by the given classifier C. Denote the conditional distribution of D given the attribute value A = a, and the true label Y = y by Dy a. The binary DCP measure is based on modeling each Dy a as a mixture of two conditional distributions: a global baseline distribution, which is the same for all sub-populations a and represents the baseline behavior, and a local nuisance distribution, which can be different for each a and represents the deviation from this baseline. For a A, y Y, let ηy a [0, 1] be the probability conditioned on A = a, Y = y that ˆY is drawn according to the nuisance distribution. DCP(C) is defined as the fraction of the population for which the local nuisance conditional distributions are used by the classifier instead of the global baseline distribution. Since the baseline conditional distribution is unobserved, it is taken to be the one that results in the minimal possible DCP. Formally, for a given classifier C with the distribution D, and letting ηa := (ηy a)y Y,

DCP(C) = min {ηa}

a A waπT a ηa, (2)

where the minimum is taken over {ηa}a A that are consistent with D; that is, such that there exist a baseline distribution Dy b and nuisance distributions {N y a }y Y,a A such that for each y Y, a A, Dy a = (1 ηy a)Dy b + ηy a N y a . If the classifier satisfies equalized odds, then the distributions Dy a are the same for all a A, in which case the decomposition holds by setting ηy a = 0 for all y Y, a A and setting the baseline distribution for y to Dy a for an arbitrary a. This

Disparate Conditional Prediction in Multiclass Classifiers

0.0 0.2 0.4 0.6 0.8 1.0 a

b = 0.01 b = 0.25 b = 0.5 b = 0.75 b = 0.99

(a) a 7 η(a, b)

0.0 0.2 0.4 0.6 0.8 1.0 b

a = 0.0 a = 0.25 a = 0.5 a = 0.75 a = 1.0

(b) b 7 η(a, b)

Figure 1: The function η(a, b).

gives DCP(C) = 0, as expected. Sabato & Yom-Tov (2020) show that for binary classifiers, where Y = {0, 1},

DCP(C) = DCP(MA, w, π) := (3) X

y Y min αy(1 y) b [0,1]

a A waπy aη(αy(1 y) b , αy(1 y) a ),

where MA is determined by C, w, π are the population properties, and η without subscripts denotes the function:

1 b/a b < a, 1 (1 b)/(1 a) b > a, 0 b = a. (4)

η is illustrated in Figure 1. In the minimization in Eq. (3), αy(1 y) b represent conditional prediction rates for the baseline distribution. More generally, we denote the conditional probability of predicting ˆy given true label y under the baseline distribution by αyˆy b . Sabato & Yom-Tov (2020) observe that since a 7 η(a, b) is piecewise concave on the intervals [0, b] and [b, 1] (see Figure 1), it suffices to minimize each of the terms in Eq. (3) over αy(1 y) b {αy a}a A {0, 1}.

We first provide a formulation of DCP(MA, w, π) for multiclass classifiers.

Theorem 4.1. For multiclass classification, Eq. (2) implies

DCP(C) = DCP(MA, w, π) = (5) X

y Y min αy b k

a A waπy a max ˆy Y η(αyˆy b , αyˆy a ),

where αy b := (αyˆy b )ˆy Y.

Proof. In an analog to the derivation in Sabato & Yom-Tov (2020), it is easy to see that for a given baseline distribution with rates {αyˆy b }y,ˆy Y,

αyˆy a = αyˆy b (1 ηy a) + νyˆy a ηy a,

where νyˆy a is the conditional probability of predicting ˆy given true label y, under the nuisance distribution N y a . From Eq. (2), it follows that the optimal solution for νyˆy a minimizes ηy a subject to

ˆy Y s.t. νyˆy a = αyˆy b , ηy a = αyˆy a αyˆy b νyˆy a αyˆy b ,

y Y, a A, ηy a 0,

a A, (νy1 a , . . . , νyk a ) k.

If αyˆy a = αyˆy b for all ˆy Y, then clearly ηy a = 0. Otherwise, the first constraint requires νyˆy a = αyˆy b + (αyˆy a αyˆy b )/ηy a. Summing over ˆy, this implies P

ˆy Y νyˆy a = P

ˆy Y αyˆy b +

(P ˆy Y αyˆy a P ˆy Y αyˆy b )/ηy a = 1. Thus, the last constraint can be replaced by νyˆy a [0, 1] for all ˆy Y. Since the derivative of the RHS of the first constraint is never zero, the minimizer of ηy a is obtained when one of the other constraints holds. That is, with an assignment of ν ˆyy a = 1 for some a, ˆy, y such that αyˆy a > αyˆy b , or ν ˆyy a = 0 for some a, ˆy, y such that αyˆy a < αyˆy b . It follows that there is some ˆy Y such that

αyˆy a > αyˆy b and ηy a = αyˆy a αyˆy b 1 αyˆy b = 1 1 αyˆy a 1 αyˆy b or

αyˆy a < αyˆy b and ηy a = αyˆy a αyˆy b αyˆy b = 1 αyˆy a αyˆy b .

Moreover, for all ˆy Y, if αyˆy a > αyˆy b then ηa 1 1 αyˆy a 1 αyˆy b and if αyˆy a < αyˆy b then ηa 1 αyˆy a αyˆy b . It follows that

ηy a = max ˆy Y max(1 1 αyˆy a 1 αyˆy b , 1 αyˆy a αyˆy b ) = max ˆy Y η(αyˆy b , αyˆy a ).

Eq. (5) thus follows.

It is easy to verify that Eq. (5) is equivalent to Eq. (3) for binary classifiers, since for Y = {0, 1}, αyy a = 1 αy(1 y) a and the same for αb, and η(a, b) = η(1 a, 1 b).

For simplicity of notation, we henceforth treat w and π as fixed and write DCP(MA). In the multiclass case,

Disparate Conditional Prediction in Multiclass Classifiers

DCP(MA) is the solution of a non-convex constrained minimization problem. In the next section, we provide methods for calculating a lower bound and an upper bound for this quantity.

Note that in practice, in some cases, the properties of the distribution D for C would be estimated from a limited data set and so may deviate from the true properties for D. However, since DCP is interpretable as a fraction of the population, any inaccuracy in the estimation of D would map to at most the same amount of inaccuracy in DCP.

5. Bounding the DCP of a Multiclass Classifier

We propose methods for calculating a lower bound and an upper bound for DCP(MA). The lower bound is a simple analytical formula. For y Y, define αy A := {αy a}a A and

DCPy(αy A) := min αy b k

a A waπy a max z Y η(αyˆy b , αyˆy a ). (6)

Then DCP(MA) = P

y Y DCPy(αy A). A lower bound on DCPy can be derived as follows:

DCPy(αy A) = min αy b k

a A waπy a max ˆy Y η(αyˆy b , αyˆy a ) (7)

min αy b [0,1]k X

a A waπy a max ˆy Y η(αyˆy b , αyˆy a )

min αy b [0,1]k max ˆy Y

a A waπy aη(αyˆy b , αyˆy a )

= max ˆy Y min x [0,1]

a A waπy aη(x, αyˆy a ).

Due to the piecewise concavity of a 7 η(a, b) on [0, b] and [b, 1], it suffices to minimize each of the terms in Eq. (3) over x {0, 1} {αyˆy a }a A. Thus,

y Y max ˆy Y min x {0,1} {αyˆy a }a A

a A waπy aη(x, αyˆy a ).

This lower bound can be calculated in time O(|A|2k2).

To upper bound DCP(MA), we propose a local iterative optimization approach on the objective function in Eq. (6), which is constrained, non-smooth, and non-convex. One can randomly initialize using a feasible solution. However, we propose a more tailored approach at the end of this section.

Our optimization approach utilizes sequential LP solutions, and is as follows. Given y Y, denote the matrix H R|A| k to include the entries ha,ˆy := αyˆy a . Note that the rows of H are in the simplex k. The vector ew [0, 1]|A| includes the entries ewa := waπy a. The optimization variables are denoted by the vector α := αy b Rk. Denote the all-one vector of dimension d by 1d and the

Algorithm 1 Local minimization of DCPy via sequential linear programming

Input: α(0), H, ew, max Iter, ε Output: The local minimizer α

for t = 0 to max Iter do

Define the LP approximation of Eq. (9) by computing M1, M2, and v. Define ˆα as the minimizer of the LP in Eq. (11). Set α(t+1) = α(t) +µ(ˆα α(t)), where µ is obtained using line search to ensure a reduction in Eq. (8). If e < ε, stop. end for return the last iterate α(t).

identity matrix of dimensions d d by Id. Our objective in Eq. (5) and its constraints, is thus given by:

Minimize α Rk X

a ewa max ˆy {η(αˆy, ha,ˆy)} (8)

s.t. 0 α 1,

To solve the problem, we observe that its structure resembles a linear program (LP). That is, the objective is given by an inner product, and the constraints are linear. The only difference between an LP and Eq. (8) is the maximum term over the η values inside the inner product. Hence, to solve Eq. (8), we apply the sequential linear programming approach (Nocedal & Wright, 2006), which iteratively approximates Eq. (8) as a linear program and solves it.

First, we replace the maximum terms in the objective with another variable vector c and additional constraints, following Charalambous & Conn (1978):

Minimize α Rk,c R|A| ew, c (9)

s.t. 0 α, c 1,

η(αˆy, ha,ˆy) ca, ˆy [k], a A.

This problem is equivalent to Eq. (8), but has no maximum operation. However, now we have non-linear constraints, thus violating the definition of an LP. To correct this, we use a local approximation of η (around an iterate α(t)) using the Taylor series. Note that the function a 7 η(a, b) is piecewise concave, non-negative, and smooth at (0, 1) except at b, where it is minimized and equals zero. Therefore, we do not expect the maximum over ˆy in the objective Eq. (8) to fall on the discontinuity of η at b, except for extreme cases. Hence, using a linear approximation we expect to get a good local approximation of η().

The sequential linear programming approach calculates the new iterate (α(t+1), c(t+1)) by solving an LP problem of an

Disparate Conditional Prediction in Multiclass Classifiers

approximation of Eq. (9) around (α(t), c(t)). That is, given an iterate (α(t), c(t)) we first approximate η by a linear Taylor series in its first argument:

η(α+ϵ, b) η(α, b) η(α, b)

b α2 ϵ α > b, (1 b) (1 α)2 ϵ α < b, 0 α = b.

We now locally approximate Eq. (9) around an iterate α(t)

via the following LP problem:

Minimize α Rk,c R|A| ew, c (10)

s.t. 0 α, c 1,

va + Ja(α α(t)) 1kca, a A,

where va Rk is a vector with the coordinates (va)ˆy = η(a(t) ˆy , ha,ˆy) for all ˆy [k], and Ja Rk k is a diagonal

Jacobian matrix s.t (Ja)ˆyˆy = η(a(t) ˆy ,ha,ˆy)

a . Bringing this into a canonical form with a variable α = α(t) + e yields

Minimize α Rk,c R|A| ew, c (11)

s.t. 0 α, c 1,

M1c + M2α M2α(t) + v 0,

where M1 := I|A| 1k, M2 := [J1; J2; ...; J|A|] is the stacking of all diagonal Jacobian matrices on top of each other, and v := [v1; ...; v|A|] is the stacking of vectors. Eq. (11) is an LP in a canonical form, which is solved iteratively. Note that M1 and M2 are highly sparse, which can be exploited to easily solve large instances of the problem. The resulting local optimization algorithm is given in Alg. 1.

A note on usage. By definition, the function η in Eq. (4) is continuous in the open section (0, 1)2, but at the boundaries, it is easy to see that η(0, 0) = 0, while limδ 0 η(δ, 0) = 1. Furthermore, the derivatives of η when its first argument is close to 0 or 1 can be very large, causing numerical difficulties in the LP solvers. Thus, to avoid numerical instability, we make sure that all entries of H are in [ε, 1 ε] (where we used ε = 10 5) by replacing any value outside these boundaries by the relevant boundary value and renormalizing. Note that since the input values in H are calculated in practice based on finite data sets, values in H are already noisy representations of the true population values, and so slightly changing them does not negatively affect the correctness of the method.

Finding an initializing assignment A natural guess for an initializing assignment α(0) is the weighted average of the confusion matrices in MA. However, this does not take

the objective function into account. We propose instead a greedy approach, in which the entries in the baseline confusion matrix are optimized label-by-label, using the fact that the binary problem is easy to minimize. In the first iteration, all labels except for one are treated as the same label, and optimal confusion matrix values for the first labels are calculated. In each further iteration, one additional label is separated, and the assignment for this label is optimized given the assignments of the previous labels. This approach is possible because DCP is monotonic under the merging of labels. The full derivation and procedure are provided in Appendix C. Our experiments in Section 7 demonstrate that this method is superior to the averaging approach.

6. Best-case DCP without Confusion Matrices

We next consider the case of a classifier for which the confusion matrices for each sub-population are not available, either because the classifier is not accessible for testing or because individual-level ground-truth data is missing or insufficient. In this case, we assume, as in Sabato & Yom Tov (2020), access only to the frequencies of true and predicted labels in each sub-population, but not the conditional probabilities. Formally, we do not have MA for the given classifier C. We only have access to the population-level frequencies (w, {(πa, ˆpa)}a A).

If C is fair under multiclass equalized odds, then all the matrices in MA must be identical. Thus, from Eq. (1), there exists a single confusion matrix M k k such that a A, MT πa = ˆpa. If such a matrix does not exist and the input frequencies are accurate, then C does not satisfy the multiclass equalized odds criterion. Nonetheless, given (w, {(πa, ˆpa)}a A) for C, we would like to find the bestcase value of DCP for C, denoted min DCP. Formally,

min DCP((w, {(πa, ˆpa)}a A)) := (12)

min{DCP(MA) | a A, Ma k k, MT a πa = ˆpa}.

Sabato & Yom-Tov (2020) show that min DCP can be calculated exactly for binary classifiers. In the multiclass case, we are not aware of an efficient algorithm for calculating the exact value of min DCP. Instead, we provide below a local-minimization algorithm for DCP(MA) under the constraints on MA, which results in an upper bound for min DCP. This, combined with a lower bound on min DCP, provides a limited range of possible values of min DCP for the given classifier C.

To calculate the lower bound, observe that

DCP(MA) = X

y Y min αy b k

a A waπy a max z Y η(αyˆy b , αyˆy a )

y Y min αyy b [0,1]

a A waπy aη(αyy b , αyy a ). (13)

Disparate Conditional Prediction in Multiclass Classifiers

Each of the terms in the RHS is of the same form as the terms in the definition of DCP for binary classification. The RHS is a constrained minimization problem that can be solved using the same methodology as in Sabato & Yom Tov (2020), providing a lower bound for min DCP.

We now turn to the local optimization algorithm for min DCP. We use a sequential linear programming approach, similarly to our solution to Eq. (8) in Section 5. First, we extend the notation of Eq. (8) to explicitly denote the value of y Y: The matrix Hy R|A| k is defined such that hy aˆy = αyˆy a ; The vector ewy [0, 1]|A| includes the entries ewy a := waπy a; The optimization variables are denoted by the vectors αy := αy b Rk. We replace the maximum terms in the problem in Eq. (5) with the vectors cy Rk, similarly to the transformation from Eq. (8) to Eq. (9). The variables for the optimization problem are the collection of the vectors and matrices

x = {α1, H1, c1, ..., αk, Hk, ck}, (14)

which amount to k2 + k2 |A| + k |A| scalar unknowns in total. We get the following new problem, which is equivalent to the problem defined in Eq. (5), including its constraints:

Minimize {αy},{Hy},{cy}

y ewy, cy (15)

s.t. 0 αy 1 y [k],

0 Hy 1 y [k],

0 cy 1 y [k],

αy, 1 = 1 y [k],

Hy1k = 1|A| y [k], Pk y=1 πy ahy a,ˆy = pˆy a ˆy [k], a [|A|].

η(αy ˆy, hy a,ˆy) cy a y, ˆy [k], a [|A|],

Note that the objective in Eq. (15) is linear, and its constraints are linear as well, except the bottom ones involving η(), similarly to Eq. (9). Also note that all of the linear constraints are separable in y, except for the last one, which introduces a coupling between the Hy of different y s.

To solve the problem we again use the sequential linear programming approach, as the structure of the problem again resembles a linear program. Given the t-th iterate x(t) for all the variables defined in Eq. (14), we first approximate η by a linear Taylor series, this time in both of its arguments, since now both are optimized. The function η(α, b) is linear in b, hence the Taylor series is exact with respect to b, which is appealing for our approach. However, minimizing the objective with respect to both hy a,ˆy and αy a leads to the minimal non-smooth points of η. These points are the singularity points (see Fig. 1), where η is not approximated well by a linear function. This results in a slow convergence of the sequential LP algorithm. To solve this, we split η and define

0.15 0.20 0.25 0.30 0.35 0.40 0.45

η(α,b) = max(η

Linear approx. of η

Linear approx. of η

Minimum of η

Minimum of linear approx.

Figure 2: The linear approximation of η(α, b) using the split to η1 and η2 around α0 = 0.35, for a fixed b = 0.25.

it as the maximum between the two smooth functions:

η(α, b) = max {η1(α, b), η2(α, b)} , (16)

where η1(α, b) = 1 b

α and η2(α, b) = 1 1 b

1 α. This split is an alternative formulation to the definition of η in Eq. (4), only now the singularity point is replaced with a maximum, and can be treated in the same way we treat the maximization over all terms involving η using sequential LP. Figure 2 shows the independent linearization of the two terms after the split and the piece-wise linear approximation of the non-smooth function in our objective. Overall, the solution of the Eq. (15) is obtained after the split using sequential LP. More details are given in Appendix D.

7. Experiments

We report experiments demonstrating the performance of the methods proposed above, for the case of known as well as unknown confusion matrices (Section 7.1 and Section 7.2), respectively. We show that in most cases, the ratio between the lower bound and the upper bound, in both types of experiments, is close to 1, indicating that the methods produce fairly accurate estimates. Code is provided at https: //github.com/sivansabato/DCPmulticlass.

7.1. Bounding the DCP with Known Confusion Matrices

We report the results of calculating an upper bound and a lower bound on the DCP(MA) of a multiclass classifier given MA. As discussed above, if the confusion matrices in MA are known up to some accuracy of D, then DCP(MA) will have at least the same accuracy, as a direct result of its definition as a fraction of the population.

We compared our proposed local-optimization approach for calculating an upper bound, described in Section 5, to other approaches for finding a feasible low-value assignment for Eq. (5). We tested the following alternatives: (1) Average: the weighted average of the sub-population confusion matrices; (2) Greedy: the greedy initialization procedure proposed in Section 5; (3) Average+LM and (4) Greedy+LM:

Disparate Conditional Prediction in Multiclass Classifiers

initializing with the corresponding method, then running the local optimization procedure. Our results below show that our approach, Greedy+LM, has the best performance. We also report the ratio between the best upper bound and the lower bound, calculated according to Eq. (7). A ratio close to 1 indicates that in this case, the two bounds are tight.

For these experiments, we selected data sets on individuals, which are mostly categorical, and which include several multi-valued attributes that can be used as a predicted label and as a protected attribute. First, we used the US Census data set (Dua & Graff, 2019) to generate multiclass classifiers to predict each of the individual attributes that have between 3 and 10 values. The protected attribute was the state of work. We tested our methods with three different types of classifiers: a decision tree and a nearest neighbor classifier, using standard Matlab libraries, as well as a classifier based on a standard fully connected neural network. The latter used two layers of size 64 and 32 neurons, with Re LU activation. The network was implemented in Py Torch, and trained with the Adam W optimizer, with a learning rate of 0.001, and a batch size of 256, for 4 epochs.

Table 1, as well as Table 3 and Table 5 in Appendix A, list the DCP lower bound calculated for each of these classifiers given their confusion matrices, the upper bound obtained by each of the tested upper bound methods, and the ratio between the smallest upper bound and the lower bound. In some of the experiments, the lower bound is equal to the upper bound, giving the exact value of the DCP of this classifier. In all of the experiments except for a single case, our proposed approach provides the tightest upper bound. The ratio between the upper bound and the lower bound ranges from 1.00 to 2.85, and is close to 1.00 in most cases, showing that for these classifiers, the DCP can be estimated quite well. In most cases, the DCP for decision tree classifiers obtained better accuracy ratios.

Second, we used data about births in the United States (CDC, 2017), which provides detailed information about each birth that occurred during 2017. The data set includes approximately 3.8 million data points and 50 attributes. Using only attributes that are known before the labor, we generated three classifiers that attempt to predict the type of labor out of the five possible options (e.g., spontaneous, Cesarean): a decision tree (error 30.8%), a k-Nearest Neighbor classifier, where k was set to 9 using parameter tuning (error 24.6%), and a 2-layer neural network (error 22.43%). This allows studying a high-error regime for the estimation of DCP. We estimated the DCP of each classifier, with respect to different protected attributes. Table 1 (bottom), and Table 4, Table 6 in Appendix A, report the DCP lower bound and upper bounds relative to each of these protected attributes. Here too, our proposed approach provides the tightest upper bounds. The ratio between the upper bound and the

Table 1: DCP with known confusion matrices, decision trees; US Census (top), Natality (bottom). For US Census, each row corresponds to a different classifier. Its error is indicated in the Error column. Lower bound is the lower bound calculated for DCP using our method. The upper bounds are the values obtained by each of the compared methods for minimizing the DCP objective. Best ratio indicates the ratio between the best (lowest) upper bound and the lower bound. A ratio of 1 indicates that the bounds are both tight. For Natality, all the rows report the same classifier, and each row calculates DCP with respect to a different protected attribute.

# Labels Error Lower Upper Bounds Best Bound Average Greedy Average+LM Greedy+LM Ratio 3 11.74% 5.39% 27.19% 9.65% 14.07% 7.65% 1.42 3 5.71% 4.35% 42.17% 5.92% 32.39% 5.28% 1.21 3 3.96% 3.24% 43.63% 5.07% 16.95% 3.25% 1.00 3 5.15% 4.24% 39.05% 5.40% 14.32% 4.25% 1.00 3 3.36% 2.65% 48.04% 5.20% 5.49% 3.81% 1.44 3 1.85% 1.85% 59.64% 8.22% 17.85% 1.85% 1.00 3 1.96% 1.96% 51.86% 7.88% 13.31% 1.96% 1.00 3 2.32% 2.28% 48.50% 6.57% 10.56% 2.28% 1.00 3 14.00% 4.57% 28.85% 6.47% 13.13% 6.10% 1.34 4 3.91% 1.48% 27.55% 8.12% 1.86% 1.49% 1.00 5 5.61% 2.06% 7.14% 43.01% 6.61% 3.91% 1.90 5 11.83% 4.57% 25.28% 34.40% 22.18% 8.28% 1.81 6 0.87% 0.86% 21.96% 24.65% 9.55% 0.86% 1.00 8 22.61% 8.29% 84.54% 32.03% 38.20% 23.61% 2.85 9 21.54% 5.03% 86.17% 12.50% 6.84% 6.17% 1.23

Protected Lower Upper Bounds Best Attribute Bound Average Greedy Average+LM Greedy+LM Ratio Attendant 1.91% 14.59% 2.34% 2.10% 2.08% 1.09 Father Race 0.92% 19.75% 1.50% 1.32% 1.28% 1.40 Mother Race 0.65% 13.43% 1.31% 1.14% 1.12% 1.71 Payer 1.74% 24.15% 1.96% 1.97% 1.89% 1.09

lower bound is between 1.00 and 1.58, showing that here too, DCP is estimated to a high accuracy.

7.2. Bounding min DCP without Confusion Matrices

In the second set of experiments, we used only the population-level frequencies (w, {(πa, ˆpa)}a A) to estimate min DCP. We ran the local optimization procedure provided in Section 6 to find an upper bound on min DCP, and compared this result to the lower bound Eq. (13), to provide a range of possible values for min DCP. This range can indicate whether the population-level frequencies point to a possible, or definite, large deviation from multiclass equalized odds, as measured by the DCP.

First, we calculated the local minimizer of min DCP for the same two labeled data sets that were tested in Section 7.1, this time without access to the confusion matrices. We then compared the obtained value to the actual ranges of DCP that were calculated using the confusion matrices in Section 7.1. Table 2 provides results for the US Census data set and the Natality data set. In all but a single case the local optimizer of min DCP was lower than the true DCP range of the classifier, showing that this value is indeed a relevant best-case value for DCP with unknown confusion matrices.

Next, we report two experiments for which we do not have ground truth labels to compare to, to demonstrate how this

Disparate Conditional Prediction in Multiclass Classifiers

Table 2: Comparing the output of the min DCP local optimizer (LO) to the DCP range calculated for US Census classifiers (top) and Natality (bottom). The ranges are derived from the results of Table 1, Table 3 and Table 4.

US Census Decision tree Nearest Neighbor

# Labels min DCP LO true DCP min DCP LO true DCP

3 2.47% 5.39% 7.65% 3.16% 6.56% 9.77% 3 1.11% 4.35% 5.28% 2.86% 6.13% 9.87% 3 1.20% 3.24% 3.25% 3.01% 6.82% 9.59% 3 0.84% 4.24% 4.25% 2.79% 6.16% 8.81% 3 0.98% 2.65% 3.81% 2.83% 7.15% 9.05% 3 1.13% 1.85% 2.82% 6.93% 9.08% 3 1.11% 1.96% 3.02% 6.56% 8.59% 3 0.68% 2.28% 2.82% 7.14% 9.47% 3 1.99% 4.57% 6.10% 2.40% 5.24% 7.01% 4 1.07% 1.48% 1.49% 4.29% 7.53% 10.81% 5 2.08% 2.06% 3.91% 6.03% 8.86% 10.74% 5 3.13% 4.57% 8.28% 3.29% 8.86% 10.74% 5 5.99% 9.50% 18.22% 5 5.62% 8.88% 21.25% 6 0.32% 0.86% 6.41% 10.10% 20.58% 8 4.08% 8.29% 23.61% 5.97% 7.89% 20.45% 9 2.38% 5.03% 6.17% 4.31% 7.91% 14.09%

Natality Decision tree k-Nearest-Neighbors

Protected Attribute min DCP LO true DCP min DCP LO true DCP Attendant 0.75% 1.91% 2.08% 0.74% 1.80% 1.82% Father Race 0.26% 0.92% 1.28% 0.32% 1.17% 1.18% Mother Race 0.12% 0.65% 1.12% 0.14% 0.61% 0.62% Payer 0.05% 1.74% 1.89% 0.59% 1.73% 1.75%

method can be used for studying various empirical questions. In the first experiment, we used data about the general elections in the UK from 1918 to 2019 (Watson et al., 2020). We studied the changes in voting patterns between elections by studying the min DCP value of a hypothetical classifier that would predict the vote of individuals in one general election to be the same as their vote in the previous general elections (ignoring the change in population between elections). We studied DCP when the protected attribute was the geographic region of the voters, as reported in the data set. A high DCP value of such a classifier would indicate a possibly large difference between voting pattern changes across regions. Figure 3 shows the value of min DCP by election year, revealing clear differences between different periods of the last century.1 The full results are reported in Table 9 in the appendix. In Appendix B, we report an additional experiment, on a US education data set.

8. Conclusion

In this work, we provided a definition and bounding methods for the DCP measure for multiclass classifiers. This provides a new tool for auditing fairness in multiclass classification. Because DCP is interpretable as a fraction of the population, the estimation methods that we proposed can be used to provide a clear evaluation of classifiers deviation

1Note the spike in Figure 3 for Oct. 1974 elections; they were unusual as they were held in the same year as the previous elections, and resulted in a significant political change (Roe-Crines, 2021).

1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020

min DCT estimate

election year

UK elections

Figure 3: Calculated value min DCP for each election year in the UK elections dataset, where the value is reported for the classifier that attempts to predict this year s election result using the results of the previous election.

from equalized odds, even when there are many classes, protected attribute values, or highly imbalanced data sets.

Impact Statement

This paper studies the auditing of classifiers for fairness in the sense of multiclass equalized odds. Identifying classifiers that may violate fairness criteria is an important task that can help advance societal desiderata. Nonetheless, such use must always consider the suitability of the fairness criterion to the specific classification task. In addition, in some cases the methods that we propose may provide loose bounds for the studied criterion. These cases can be identified by observing a large gap between the upper bound and the lower bound. In such cases, care must be taken when making operative decisions based on the results of the methods.

Acknowledgements

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), [funding reference number RGPIN-2024-05907]. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute; see https://vectorinstitute.ai/ partnerships/current-partners/. This work was supported by the Israeli Council for Higher Education (CHE) via the Data Science Research Center at Ben-Gurion University.

Alghamdi, W., Hsu, H., Jeong, H., Wang, H., Michalak, P., Asoodeh, S., and Calmon, F. Beyond adult and compas:

Disparate Conditional Prediction in Multiclass Classifiers

Fair multi-class prediction via information projection. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 38747 38760. Curran Associates, Inc., 2022.

Angwin, J., Larson, J., Mattu, S., and Kirchner, L. Machine bias. In Ethics of data and analytics, pp. 254 264. Auerbach Publications, 2022.

Awasthi, P., Beutel, A., Kleindessner, M., Morgenstern, J., and Wang, X. Evaluating fairness of machine learning models under uncertain and incomplete information. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 206 214, 2021.

Bellamy, R. K., Dey, K., Hind, M., Hoffman, S. C., Houde, S., Kannan, K., Lohia, P., Martino, J., Mehta, S., Mojsilovi c, A., et al. AI fairness 360: An extensible toolkit for detecting and mitigating algorithmic bias. IBM Journal of Research and Development, 63(4/5):4 1, 2019.

Black, E., Elzayn, H., Chouldechova, A., Goldin, J., and Ho, D. Algorithmic fairness and vertical equity: Income fairness with irs tax audit models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 22, pp. 1479 1503, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522.

Calmon, F., Wei, D., Vinzamuri, B., Natesan Ramamurthy, K., and Varshney, K. R. Optimized pre-processing for discrimination prevention. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 3992 4001. Curran Associates, Inc., 2017.

Caton, S. and Haas, C. Fairness in machine learning: A survey. ACM Computing Surveys, 56(7):1 38, 2024.

CDC. United states natality public use file, 2017. URL https://ftp.cdc.gov/pub/ Health_Statistics/NCHS/Datasets/DVS/ natality/Nat2017us.zip.

Charalambous, C. and Conn, A. An efficient method to solve the minimax problem directly. SIAM Journal on Numerical Analysis, 15(1):162 187, 1978.

Cherian, J. J. and Cand es, E. J. Statistical inference for fairness auditing. Journal of Machine Learning Research, 25(149):1 49, 2024.

Cornacchia, G., Anelli, V. W., Biancofiore, G. M., Narducci, F., Pomo, C., Ragone, A., and Di Sciascio, E. Auditing fairness under unawareness through counterfactual reasoning. Information Processing & Management, 60(2): 103224, 2023.

Dastin, J. Amazon scraps secret ai recruiting tool that showed bias against women. In Ethics of data and analytics, pp. 296 299. Auerbach Publications, 2022.

Denis, C., Elie, R., Hebiri, M., and Hu, F. Fairness guarantees in multi-class classification with demographic parity. Journal of Machine Learning Research, 25(130):1 46, 2024.

Donini, M., Oneto, L., Ben-David, S., Shawe-Taylor, J. S., and Pontil, M. Empirical risk minimization under fairness constraints. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 2791 2801. Curran Associates, Inc., 2018.

Dua, D. and Graff, C. UCI machine learning repository, 2019. URL http://archive.ics.uci.edu/ml.

Fabris, A., Esuli, A., Moreo, A., and Sebastiani, F. Measuring fairness under unawareness of sensitive attributes: A quantification-based approach. Journal of Artificial Intelligence Research, 76:1117 1180, 2023.

Goyal, P., Soriano, A. R., Hazirbas, C., Sagun, L., and Usunier, N. Fairness indicators for systematic assessments of visual feature extractors. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 22, pp. 70 88, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522.

Grandinetti, J. Examining embedded apparatuses of ai in facebook and tiktok. AI & Society, pp. 1 14, 2023.

Hall, J., Galabova, I., Gottwald, L., and Feldmeier, M. Highs high performance software for linear optimization, 2023.

Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315 3323, 2016.

Jung, S., Lee, D., Park, T., and Moon, T. Fair feature distillation for visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12115 12124, 2021.

Kafuria, A. D. A predictice model for health insurance premium rates using machine learning algorithms. Ph D thesis, University of Rwanda, 2022.

Kiessling, D., Zanelli, A., Nurkanovi c, A., Gillis, J., Diehl, M., Zeilinger, M., Pipeleers, G., and Swevers, J. A feasible sequential linear programming algorithm with application to time-optimal path planning problems. In 2022 IEEE 61st Conference on Decision and Control (CDC), pp. 1196 1203. IEEE, 2022.

Disparate Conditional Prediction in Multiclass Classifiers

Kleinberg, J., Mullainathan, S., and Raghavan, M. Inherent trade-offs in the fair determination of risk scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.

Menon, A. K. and Williamson, R. C. The cost of fairness in binary classification. In Conference on Fairness, Accountability and Transparency, pp. 107 118, 2018.

Nocedal, J. and Wright, S. Numerical optimization. Springer Science & Business Media, 2006.

Pinz on, C., Palamidessi, C., Piantanida, P., and Valencia, F. On the incompatibility of accuracy and equal opportunity. Machine Learning, 113(5):2405 2434, 2024.

Pleiss, G., Raghavan, M., Wu, F., Kleinberg, J., and Weinberger, K. Q. On fairness and calibration. In Advances in Neural Information Processing Systems, pp. 5680 5689, 2017.

Putzel, P. and Lee, S. Blackbox post-processing for multiclass fairness. ar Xiv preprint ar Xiv:2201.04461, 2022.

Roe-Crines, A. S. Who Governs? The General Election Defeats of 1974, pp. 355 375. Springer International Publishing, Cham, 2021.

Roth, J., Saint-Jacques, G., and Yu, Y. An outcome test of discrimination for ranked lists. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAcc T 22, pp. 350 356, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522.

Rouzot, J., Ferry, J., and Huguet, M.-J. Learning optimal fair scoring systems for multi-class classification. In 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 197 204. IEEE, 2022.

Sabato, S. and Yom-Tov, E. Bounding the fairness and accuracy of classifiers from population statistics. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 8316 8325. PMLR, 13 18 Jul 2020.

Saleiro, P., Kuester, B., Hinkson, L., London, J., Stevens, A., Anisfeld, A., Rodolfa, K. T., and Ghani, R. Aequitas: A bias and fairness audit toolkit. ar Xiv preprint ar Xiv:1811.05577, 2018.

Taskesen, B., Blanchet, J., Kuhn, D., and Nguyen, V. A. A statistical test for probabilistic fairness. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 648 665, 2021.

USDA Economic Research Service. Highest level of educational attainment, 2021. URL https://data.ers. usda.gov/reports.aspx?ID=17829.

Wang, H., He, L., Gao, R., and Calmon, F. Aleatoric and epistemic discrimination: Fundamental limits of fairness interventions. Advances in Neural Information Processing Systems, 36, 2024.

Wang, J., Liu, Y., and Levy, C. Fair classification with groupdependent label noise. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 526 536, 2021.

Watson, C., Uberoi, E., and Loft, P. General election results from 1918 to 2019, 2020. URL https://commonslibrary.parliament. uk/research-briefings/cbp-8647/.

Xue, Z. Group adaboost with fairness constraint. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), pp. 865 873. SIAM, 2023.

Disparate Conditional Prediction in Multiclass Classifiers

A. Deferred Experiment Results

Table 3 Table 9 below provide the remaining result of the experiments which are described in Section 7.

Table 3: Bounding the DCP for multiclass classifiers with known confusion matrices; US Census data set with nearest neighbor classifiers. Table columns as in Table 1.

# Error Lower Upper Bounds Best Labels Bound Average Greedy Average+LM Greedy+LM Ratio 3 27.71% 6.56% 19.71% 10.64% 10.80% 9.77% 1.49 3 28.71% 6.13% 19.00% 11.66% 10.67% 9.87% 1.61 3 26.22% 6.82% 18.44% 10.46% 11.82% 9.59% 1.41 3 24.61% 6.16% 19.54% 10.05% 11.23% 8.81% 1.43 3 24.24% 7.15% 20.04% 10.44% 12.04% 9.05% 1.27 3 23.52% 6.93% 20.42% 10.00% 11.97% 9.08% 1.31 3 23.43% 6.56% 19.15% 8.91% 11.10% 8.59% 1.31 3 23.46% 7.14% 20.01% 10.95% 12.08% 9.47% 1.33 3 19.45% 5.24% 23.24% 8.80% 9.98% 7.01% 1.34 4 27.22% 5.61% 39.16% 11.24% 18.47% 9.12% 1.63 5 13.87% 7.53% 64.62% 13.32% 14.62% 10.81% 1.43 5 17.54% 8.86% 59.21% 11.83% 29.39% 10.74% 1.21 5 48.57% 9.50% 41.93% 22.14% 28.16% 18.22% 1.92 5 54.92% 8.88% 36.77% 24.23% 24.88% 21.25% 2.39 6 50.47% 10.10% 48.96% 25.51% 33.19% 20.58% 2.04 8 42.67% 7.89% 69.92% 30.60% 47.78% 20.45% 2.59 9 39.56% 7.91% 99.27% 16.90% 17.99% 14.09% 1.78

Table 4: Bounding the DCP for multiclass classifiers with known confusion matrices; Natality data set with a k-Nearest Neighbor classifier. Table columns as in Table 1.

Test Lower Upper Bounds Best Bound Average Greedy Average+LM Greedy+LM Ratio Attendant 1.80% 28.53% 1.82% 1.82% 1.82% 1.01 Father Race 1.17% 46.54% 1.20% 1.18% 1.18% 1.01 Mother Race 0.61% 21.47% 0.69% 0.62% 0.62% 1.02 Payer 1.73% 54.61% 1.79% 1.76% 1.75% 1.01

Disparate Conditional Prediction in Multiclass Classifiers

Table 5: Bounding the DCP for multiclass classifiers with known confusion matrices; US Census data set with the neural network-based classifiers. Table columns as in Table 1.

# Error Lower Upper Bounds Best Ratio Labels Bound Average Greedy Average+LM Greedy+LM 3 12.76% 2.36% 20.19% 3.46% 3.04% 3.03% 1.28 3 12.12% 1.43% 23.16% 2.06% 2.75% 1.84% 1.28 3 8.95% 1.27% 18.89% 2.05% 1.75% 1.75% 1.38 3 8.96% 1.52% 16.63% 1.94% 1.83% 1.77% 1.16 3 6.04% 2.42% 22.88% 3.52% 2.96% 2.96% 1.22 3 8.26% 0.86% 34.80% 3.53% 2.24% 2.24% 2.60 3 13.31% 0.01% 68.05% 27.12% 0.01% 0.02% 2.45 3 8.07% 0.63% 24.58% 1.41% 0.79% 0.78% 1.23 3 10.39% 1.53% 40.93% 1.56% 2.06% 1.56% 1.02 4 4.08% 0.93% 41.29% 4.07% 2.51% 2.17% 2.33 5 4.09% 0.78% 69.65% 12.97% 1.89% 0.99% 1.28 5 5.40% 2.44% 66.73% 41.57% 3.18% 2.53% 1.04 5 17.48% 3.55% 46.89% 4.35% 4.24% 4.05% 1.14 5 21.35% 6.02% 64.82% 8.01% 8.86% 7.67% 1.27 6 17.76% 2.49% 66.61% 6.01% 4.30% 3.67% 1.47 8 26.69% 2.64% 68.10% 14.22% 5.66% 4.28% 1.62 9 18.19% 2.95% 80.76% 3.13% 4.15% 3.12% 1.06

Table 6: Bounding the DCP for multiclass classifiers with known confusion matrices; Natality data set with the neural network classifier. Table columns as in Table 1.

Protected Lower Upper Bounds Best Attribute Bound Average Greedy Average+LM Greedy+LM Ratio Attendant 1.03% 26.25% 1.05% 1.04% 1.04% 1.01 Father Race 0.67% 50.53% 0.73% 0.67% 0.67% 1.01 Mother Race 0.58% 13.25% 0.58% 0.58% 0.58% 1.01 Payer 1.21% 55.86% 1.21% 1.21% 1.21% 1.00

Disparate Conditional Prediction in Multiclass Classifiers

Table 7: Comparing the output of the min DCP local optimizer (LO) to the DCP range calculated for US Census classifiers for the neural network classifiers. The ranges are derived from Table 5.

US Census Neural Network # Labels min DCP LO true DCP 3 0.97% 2.36% 3.03% 3 0.63% 1.43% 1.84% 3 0.68% 1.27% 1.75% 3 0.53% 1.52% 1.77% 3 1.23% 2.42% 2.96% 3 0.56% 0.86% 2.24% 3 0.01% 0.01% 3 0.26% 0.63% 0.78% 3 0.90% 1.53% 1.56% 4 0.47% 0.93% 2.17% 5 0.70% 0.78% 0.99% 5 1.36% 2.44% 2.53% 5 1.09% 3.55% 4.05% 5 3.16% 6.02% 7.67% 6 0.80% 2.49% 3.67% 8 1.77% 2.64% 4.28% 9 1.63% 2.95% 3.12%

Table 8: Comparing the output of the min DCP local optimizer (LO) to the DCP range calculated for the Natality data set for the neural network classifier. The ranges are derived from Table 6.

Natality Neural Network Protected Attribute min DCP LO true DCP Attendant 0.15% 1.03% 1.04% Father Race 0.24% 0.67% Mother Race 0.10% 0.58% Payer 0.13% 1.21%

Disparate Conditional Prediction in Multiclass Classifiers

Table 9: Calculated min DCP local optimizers for the UK elections data. In each line, the election data from the listed baseline year was used to predict the vote in the listed prediction year.

Election years

Baseline Prediction # protected attribute values min DCP LO 1918 1922 12 6.36% 1922 1923 13 8.57% 1923 1924 13 3.26% 1924 1929 13 4.92% 1929 1931 13 5.14% 1931 1935 13 3.11% 1935 1945 13 5.69% 1945 1950 12 4.75% 1950 1951 12 3.79% 1951 1955 12 5.54% 1955 1959 12 2.40% 1959 1964 12 2.16% 1964 1966 12 1.37% 1966 1970 12 1.32% 1970 1974 (Feb) 12 2.69% 1974 (Feb) 1974 (Oct) 12 6.28% 1974 (Oct) 1979 12 3.67% 1979 1983 11 3.23% 1983 1987 12 1.43% 1987 1992 12 4.21% 1992 1997 11 5.25% 1997 2001 12 5.25% 2001 2005 12 4.83% 2005 2010 12 5.26% 2010 2015 12 3.96% 2015 2017 12 5.54% 2017 2019 11 5.24%

Disparate Conditional Prediction in Multiclass Classifiers

B. An Additional Experiment

In this experiment, we studied a data set on US education (USDA Economic Research Service, 2021), which provides the percentage of various levels of education attainment (e.g., high school, college) in each US state in each decade. Here too, we calculated min DCP for a hypothesized classifier that predicts the education level to be distributed the same in each state in each decade. The protected attribute as set to be the state. Table 10 provides our results. Here, we found no significant differences in the DCP of change patterns in different decades, indicating a fairly constant behavior of this measure of divergence between states. This type of analysis can be used for exploratory research on social questions.

Table 10: Calculated min DCP upper bounds for the US education data set.

Baseline Predicted min DCP upper bound 1970 1980 2.38% 1980 1990 2.94% 1990 2000 2.22% 2000 2015-2019 2.32%

C. The Greedy Initialization Procedure

We provide here the full details of the greedy initialization procedure presented in Section 5.

Let f := {fy}y Y, where fy : Y Y, be label mappings conditioned on the true label y, which can map some predicted labels to the same transformed label. For a given classifier C with distribution D, let C[f] be a hypothetical classifier that predicts fy(ˆy) whenever the true label is y and C would predict ˆy. For a given distribution P over (Y, ˆY , A), let P[f] be the distribution of (Y, f Y ( ˆY ), A). Then, D[f] is the distribution determined by C[f]. It is easy to see that DCP(C[f]) DCP(C). This is because the equality Dy a = (1 ηy a)Dy b +ηy a N y a implies that also Dy a[f] = (1 ηy a)Dy b[f]+ηy a N y a [f]. Thus, minimizing over ηy a for C[f] can never result in a solution of a higher value than minimizing for C. We use this observation to devise an iterative greedy optimization procedure.

For i [k 1], let fi := (fi,y)y Y be an indexed set of label mappings, fi,y : Y Y, defined as follows. Let yi be the i th label in Y that is different from y. Denote Yi = {y, y1, . . . , yi}. Note that Yk 1 = Y. For i [k 2], define fi,y(j) = j I[j Yi]+yi I[j / Yi]. Note that fi,y can be calculated from the image of fi 1,y. Hence, C[fi] is a refinement of C[fi 1]. In addition, fk 1,y is the identity. Thus, the following monotonicity property holds:

DCP(C[f1]) DCP(C[f2]) . . . DCP(C[fk 1]) = DCP(C).

Moreover, DCP(f1) can be calculated exactly as in case of binary classification, since the range of f1 includes only y and y1. Based on these observations, we define a greedy procedure for calculating an assignment for αy b to initialize the minimization in Eq. (5).

Let αy A[i] be row y of the confusion matrices of D[fi]. Then coordinates j Yi 1 of αy b[i] are the same as those of αy A, and coordinate yi has the value eαyyi a := Pk 1 j=i αyyj a . We have

DCPy(αy A[1]) DCPy(αy A[2]) . . . DCPy(αy A[k 1]) = DCPy(αy A).

The greedy procedure first calculates an assignment for αy b[1] that obtains the value of DCPy(αy A[1]). This is a binary problem, which can be solved exactly following Sabato & Yom-Tov (2020). Then, at each iteration i + 1 for i [k 2], a local minimum αy a[i + 1] for DCPy(αy A[i + 1]) is calculated by constraining αy b[i + 1] to have the same coordinates as αy b[i] on Yi 1, and minimizing over αyyi b , αyyi+1 b such that their sum is equal to coordinate yi in αy b[i]. This minimization can be solved exactly, as follows.

Denote the value of coordinate yi in αy b[i] by γ = 1 P

j Yi 1 αyj b . Minimizing the objective of DCPy(αy A[i + 1]) subject

Disparate Conditional Prediction in Multiclass Classifiers

to the constraints resulting from αy b[i] is equivalent solving the following problem:

Minimize α yyi b ,α yyi+1 b

a A waπy a max{max ˆy Yi η(αyˆy b , αyˆy a ), η(αyyi+1 b , eαyyi+1 a )}

s.t. αyyi b , αyyi+1 b 0 and αyyi b + αyyi+1 b = γ.

Letting va := maxˆy Yi 1 η(αyˆy b , αyˆy a ), this is equivalent to

Minimize α yyi b [0,γ]

a A waπy a max{va, η(αyyi b , αyyi a ), η(γ αyyi b , eαyyi+1 a )}.

Similarly to the case of binary classification, this objective is one-dimensional, and concave in each of the intervals defined by the inflection points of the η instances and the values for which any two of the expressions in the maximum are equal. Denote this set of points by M y i . Then the objective above is minimized by one of the values in the following set: M y i {αyyi a }a A {eαyyi+1 a }a A {0, γ}. Repeating this procedure until iteration i = k 1, we obtain an assignment for αy b which can be used to calculate an upper bound for DCPy(αy A).

Since the ordering of the labels in the greedy procedure is arbitrary, it is possible to attempt several different orderings and select the one that obtains the smallest DCP value. In our experiments, we tried 10 random orderings in each upper bound calculation.

D. More Details on the Local Optimization Procedure

Here we provide more details on the minimization of Eq. (15). Given the split in Eq. (16), the last constraint in Eq. (15) becomes

η(αy ˆy, hy a,ˆy) cy a η1(αy ˆy, hy a,ˆy) cy a η2(αy ˆy, hy a,ˆy) cy a ,

and now the constraints does not include singularity points, and can be locally approximated by two linear functions, one for η1 and one for η2 (see Figure 2). Explicitly, the Taylor approximation of ηi for i = 1, 2 are given by:

ηi(α + ϵa, b + ϵb) ηi(α, b) + ηi

α = b α2 , η1

(1 α)2 , and η2

Given the first order Taylor approximations above, we form the LP approximation of Eq. (15) around an iterate x(t) = { αy, Hy, cy}k y=1 as follows:

Minimize {αy},{Hy},{cy}

y ewy, cy (17)

s.t. 0 αy 1 y [k],

0 Hy 1 y [k],

0 cy 1 y [k],

αy, 1k = 1 y [k],

Hy1k = 1|A| y [k], Pk y=1 πy ahy a,ˆy = pˆy a ˆy [k], a [|A|].

ηi( αy ˆy, hy a,ˆy) + ηi

α (αy ˆy αy ˆy) + ηi

b (hy a,ˆy hy a,ˆy) cy a, y, ˆy [k], a [|A|], i = 1, 2,

Eq. (17) is an LP problem which is solved at each iteration by an LP solver, for which we use the scipy.optimize library. In addition to the box constraints of [0, 1] for all variables, it includes 2 k |A| + k equality constraints, and 2 k2 |A| inequality constraints. It can be seen that the number of inequality constraints is rather high, and as a result, so is the computational complexity of the algorithm if run as is. However, many of these constraints are not active in the solution, and in any case we limit the step size of our algorithm. Hence, we can ease the difficulty of the LP problem by

Disparate Conditional Prediction in Multiclass Classifiers

both limiting the search space of the LP solver, and removing the inequality constraints that seem to be inactive in the solution. To this end, we remove the inequality constraints where ηi( αy ˆy, hy a,ˆy) < 1, as we expect these not to be active at the solution. Furthermore, we use a maximum step size τ so that αy αy < τ, and Hy Hy < τ, where is the maximum norm. This condition is trivially incorporated in the box constraints of Eq. (17). Finally, given the LP approximation above, the solver of Eq. (15) follows the same lines as Alg. 1.

As in Section 6, numerical instabilities arise when αyˆy a is too close to 0 or 1. Thus, here as well we restrict these values to be in the segment [ε, 1 ε], this time as optimization variables through the box constraints. Also, we update the values pˆy a to be pˆy a = (1 kϵ)pˆy a + ϵ to guarantee a solution for Eq. (15) after updating the box constraints for αyˆy a .

Convergence of Algorithm 1 Our problem contains the non-linear functions η, which are approximated by linear functions. For a general non-linear programming problem, sequential linear programming (SLP) methods converge linearly if all functions are smooth. That is because they do not use Hessian information, and their whole purpose is to handle the constraints efficiently (determine who is active and who is not). If one uses Hessian information, we get sequential quadratic programming (SQP) and the convergence is asymptotically quadratic once the active set of constraints is identified, similar to Newton s method (Nocedal & Wright, 2006). For an SLP method like in Algorithm 1 to converge linearly, we first need to have smooth constraints, and for that we use the split of presented in Eq. (16) and detailed above. Furthermore, we also limit the size of the steps to be of size at most τ. This approach is called a Trust Region method, and is commonly used with sequential quadratic or linear programming methods. The method was analyzed in (Kiessling et al., 2022) for a case that is similar to ours, where the objective of the constrained problem is linear, like in Eq. (9) and Eq. (15). It is shown that the convergence of the SLP method is linear, and the rate depends on the radius of the trust region method (τ, in our case). In Figure 4 we show the convergence of our method for solving Eq. (15) for a convex η, so that the problem is convex and the solutions that are obtained by the method using all trust regions are equivalent. Specifically, to demonstrate the convergence we used

ηconvex(a, b) =

a2 ab b < a, (1 a)2 (1 b)(1 a) b > a, 0 b = a, (18)

instead of Eq. (4), and got the plots in Figure 4 for various maximal step sizes τ, for solving Eq. (15) for the labor dataset. Here, the plots show that the method is slower as τ is smaller, since the algorithm is constrained to be slower. However, in some cases, the inner LP solution can result in a stagnate direction, and then decreasing τ as shown in (Kiessling et al., 2022) can mitigate this. In our original problem, η and the whole problem are non-convex, hence the algorithm can converge to a different local minimum for each value of τ. For the problems reported in this paper, we used τ = 0.2 which seemed to work best, and decreased τ when the LP solver failed to converge.

0 20 40 60 80 100 Iteration

|obj - obj*|

max_step=0.2 max_step=0.1 max_step=0.05 max_step=0.025

Figure 4: The convergence of Algorithm 1 for different trust region (maximal step size) parameters τ.