# fair_performance_metric_elicitation__fefd7ce6.pdf

Fair Performance Metric Elicitation

Gaurush Hiranandani UIUC gaurush2@illinois.edu

Harikrishna Narasimhan Google Research USA hnarasimhan@google.com

Oluwasanmi Koyejo UIUC & Google Research Accra sanmi@illinois.edu

What is a fair performance metric? We consider the choice of fairness metrics through the lens of metric elicitation a principled framework for selecting performance metrics that best reﬂect implicit preferences. The use of metric elicitation enables a practitioner to tune the performance and fairness metrics to the task, context, and population at hand. Speciﬁcally, we propose a novel strategy to elicit group-fair performance metrics for multiclass classiﬁcation problems with multiple sensitive groups that also includes selecting the trade-off between predictive performance and fairness violation. The proposed elicitation strategy requires only relative preference feedback and is robust to both ﬁnite sample and feedback noise.

1 Introduction

Machine learning models are increasingly employed for critical decision-making tasks such as hiring and sentencing [44, 3, 11, 14, 31]. Yet, it is increasingly evident that automated decision-making is susceptible to bias, whereby decisions made by the algorithm are unfair to certain subgroups [5, 3, 10, 8, 31]. To this end, a wide variety of group fairness metrics have been proposed all to reduce discrimination and bias from automated decision-making [25, 13, 17, 29, 49, 32]. However, a dearth of formal principles for selecting the most appropriate metric has highlighted the confusion of experts, practitioners, and end users in deciding which group fairness metric to employ [53]. This is further exacerbated by the observation that common metrics often lead to contradictory outcomes [29].

While the problem of selecting an appropriate fairness metric has gained prominence in recent years [17, 32, 53], it perhaps best understood as a special case of the task of choosing evaluation metrics in machine learning. For instance, when a cost-sensitive predictive model classiﬁes patients into cancer categories [50] even without considering fairness, it is often unclear how the costtradeoffs be chosen so that they reﬂect the expert s decision-making, i.e., replacing expert intuition by quantiﬁable metrics. The recently proposed Metric Elicitation (ME) framework [20, 21] provides a solution. ME is a principled framework for eliciting performance metrics using feedback over classiﬁers from an end user. The motivation behind ME is that employing the performance metrics which reﬂect user tradeoffs will enable learning models that best capture user preferences [20]. As humans are often inaccurate in providing absolute preferences [41], Hiranandani et al. [20] propose to use pairwise comparison queries, where the user (oracle) is asked to compare two classiﬁers and provide a relative preference. Using such queries, ME aims to recover the oracle s metric. Figure 1 (reproduced from [20]) illustrates the ME framework.

Existing research suggests a fundamental trade-off between algorithmic fairness and performance [25, 51, 11, 7, 32, 53], where in addition to appropriate metrics, the practitioner or policymaker must choose a trade-off operating point between the competing objectives [53]. To this end, we extend the ME framework from eliciting multiclass classiﬁcation metrics [21] to the task of eliciting fair performance metrics from pairwise preference feedback in the presence of multiple sensitive groups. In particular, we elicit metrics that reﬂect, jointly, the (i) predictive performance evaluated as a weighting of classiﬁer s overall predictive rates, (ii) fairness violation assessed as the discrepancy in predictive rates among groups, and (iii) a trade-off between the predictive performance and fairness

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Figure 1: Framework of Metric Elicitation [20].

ek Figure 2: R1 Rm (best seen in colors); Ru u [m] are convex sets with common vertices ei i [k] and enclose the sphere Sρ.

violation. Importantly, the elicited metrics are sufﬁciently ﬂexible to encapsulate and generalize many existing predictive performance and fairness violation measures.

In eliciting group-fair performance metrics, we tackle three new challenges. First, from preference query perspective, the predictive performance and fairness violations are correlated, thus increasing the complexity of joint elicitation. Second, we ﬁnd that in order to measure both positive and negative violations, the fair metrics are necessarily non-linear functions of the predictive rates, thus existing results on linear ME [21] cannot be applied directly. Finally, as we show, the number of groups directly impacts query complexity. We overcome these challenges by proposing a novel query efﬁcient procedure that exploits the geometric properties of the set of rates.

Contributions. We consider metrics for algorithmically group-fair classiﬁcation and propose a novel approach for eliciting predictive performance, fairness violations, and their trade-off point, from expert pairwise feedback. Our procedure uses binary-search based subroutines and recovers the metric with linear query complexity. Moreover, the procedure is robust to both ﬁnite sample and oracle feedback noise thus is useful in practice. Lastly, our method can be applied either by querying preferences over classiﬁers or rates. Such an equivalence is crucial for practical applications [20, 21].

Notations. Matrices and vectors are denoted by bold upper case and bold lower case letters, respectively. We denote the inner product of two vectors by , and the Hadamard product by . The ℓ2-norm is denoted by 2. For k Z+, we represent the index set {1, 2, , k} by [k], and the (k 1)-dimensional simplex by k. Given a matrix A, oﬀ-diag(A) returns a vector of off-diagonal elements of A in row-major form. The group membership is denoted by superscripts and coordinates of vectors, matrices, and tuples are denoted by subscripts.

2 Background

The standard multiclass, multigroup classiﬁcation setting comprises k classes and m groups with X X, G [m] and Y [k] representing the input, group membership, and output random variables, respectively. The groups are assumed to be disjoint and known apriori [17, 29]. We have access to a dataset {(x, g, y)i}n i=1 of size n, generated iid from a distribution P(X, G, Y ).

Group-speciﬁc rates: We consider separate (randomized) classiﬁers hg : X k for each group g, and use Hg = {hg : X k} to denote the set of all classiﬁers for group g. The group-speciﬁc rate matrix Rg(hg, P) Rk k for a classiﬁer hg is given by:

Rg ij(hg, P) := P(hg = j|Y = i, G = g) for i, j [k]. (1)

Since the diagonal entries of the rate matrix can be written in terms of the off-diagonal entries:

Rg ii(hg, P) = 1 Xk

j=1,j =i Rg ij(hg, P), (2)

any rate matrix is uniquely represented by its q := (k2 k) off-diagonal elements as a vector rg(hg, P) = oﬀ-diag(Rg(hg, P)). So we will interchangeably refer to the rate matrix as a vector of rates . The feasible set of rates associated with a group g is denoted by Rg = {rg(hg, P) : hg Hg}. For clarity, we will suppress the dependence on P and hg if it is clear from the context.

Overall rates: We deﬁne the overall classiﬁer h : (X, [m]) k by h(x, g) := hg(x) and denote its tuple of group-speciﬁc rates by:

r1:m := (r1, . . . , rm) R1 Rm =: R1:m.

This tuple allows us to measure the fairness violation across groups. The fairness violation is believed to be in trade-off with the predictive performance [25, 7, 32]. The latter is measured using the overall rate matrix of the classiﬁer h:

Rij := P(h = j|Y = i) = Xm

g=1 tg i Rg ij, (3)

where tg i := P(G = g|Y = i) is the prevalence of group g within class i. For an overall classiﬁer h, the vector of rates r = oﬀ-diag(R) can be conveniently written in terms of its group-speciﬁc tuple of rates as r = Pm g=1 τ g rg, where τ g := oﬀ-diag([tg tg . . . tg]).

Fairness violation measure: The (approximate) fairness of a classiﬁer is often determined by the discrepancy in rates across different groups e.g. equalized odds [17, 4]. So given two groups u, v [m], we deﬁne the discrepancy in their rates as:

duv := |ru rv|. (4)

Since there are m groups, the number of discrepancy vectors are m 2 .

2.1 Fair Performance Metric

We aim to elicit a general class of metrics, which recovers and generalizes existing fairness measures, based on trade-off between predictive performance and fairness violation [25, 17, 10, 7, 32]. Let φ : [0, 1]q R be the cost of overall misclassiﬁcation (aka. predictive performance) and ϕ : [0, 1]m q R be the fairness violation cost for a classiﬁer h determined by the overall rates r(h) and group discrepancies {duv(h)}m u,v=1,v>u, respectively. Without loss of generality (wlog), we assume the metrics φ and ϕ are costs. Moreover, the metrics are scale invariant as global scale does not affect the learning problem [36]; hence let φ : [0, 1]q [0, 1] and ϕ : [0, 1]m q [0, 1].

Deﬁnition 1. Fair Performance Metric: Let φ and ϕ be monotonically increasing linear functions of overall rates and group discrepancies, respectively. The fair metric Ψ is a trade-off between φ and ϕ. In particular, given a Rq, a 0 (misclassiﬁcation weights), a set of vectors B := {buv Rq, buv 0}m u,v=1,v>u (fairness violation weights), and a scalar λ (trade-off) with

a 2 = 1, Xm

u,v=1,v>u buv 2 = 1, 0 λ 1, (5)

(wlog., due to scale invariance), we deﬁne the metric Ψ as:

Ψ(r1:m ; a, B, λ) := (1 λ) | {z } trade-off

a, r | {z } φ(r)

u,v=1,v>u buv, duv

| {z } ϕ(r1:m)

Examples of the misclassiﬁcation cost φ(r) include cost-sensitive linear metrics [1]. Many existing fairness metrics for two classes and two groups such as equal opportunity [17], balance for the negative class [29] error-rate balance (i.e., 0.5|r1 1 r2 1|+0.5|r1 2 r2 2|) [10], weighted equalized odds (i.e., b1|r1 1 r2 1| + b2|r1 2 r2 2|) [17, 7], etc. correspond to fairness violations of the form ϕ(r1:m) considered above. The combination of φ(r) and ϕ(r1:m) as deﬁned in Ψ(r1:m) appears regularly in prior work [25, 7, 32]. Notice that the metric is ﬂexible to allow different fairness violation costs for different pairs of groups thus capable of enabling reverse discrimination [38]. Lastly, while the metric is linear with respect to (wrt.) the discrepancies, it is non-linear wrt. the group-wise rates. Hence, standard linear ME algorithm [21] cannot be trivially applied for eliciting the metric in Deﬁnition 1.

2.2 Fair Performance Metric Elicitation; Problem Statement

We now state the problem of Fair Performance Metric Elicitation (FPME) and deﬁne the associated oracle query. The broad deﬁnitions follow from Hiranandani et al. [20, 21], extended so the rates and the performance metrics correspond to the multiclass multigroup-fair classiﬁcation setting.

Deﬁnition 2 (Oracle Query). Given two classiﬁers h1, h2 (equivalent to a tuple of rates r1:m 1 , r1:m 2 respectively), a query to the Oracle (with metric Ψ) is represented by:

Γ(h1, h2) = Ω r1:m 1 , r1:m 2 = 1[Ψ(r1:m 1 ) > Ψ(r1:m 2 )], (7)

where Γ : H H {0, 1} and Ω: R1:m R1:m {0, 1}. In words, the query asks whether h1 is preferred to h2 (equivalent to whether r1:m 1 is preferred to r1:m 2 ), as measured by Ψ.

In practice, the oracle can be an expert, a group of experts, or an entire user population. The ME framework can be applied by posing classiﬁer comparisons directly to them via interpretable learning techniques [42, 12] or via A/B testing [45]. For example, one may perform A/B testing for an internet-based application by deploying two classiﬁers A and B and use the population s level of engagement to decide the preference between the two classiﬁers. For other applications, intuitive visualizations of the predictive rates for two different classiﬁers (see e.g., [53, 6]) can be used to ask preference feedback from a group of domain experts.

We emphasize that the metric Ψ used by the oracle is unknown to us and can be accessed only through queries to the oracle. Since the metrics we consider are functions of rates, comparing two classiﬁers on a metric is equivalent to comparing their corresponding rates. Henceforth, we will denote any query to the oracle by a pair of rates (r1:m 1 , r1:m 2 ). Also, whenever we refer to an oracles s dimension, we are referring to the dimension of its rate arguments. For instance, we will consider the oracle in Deﬁnition 2 to be of dimension m q. Next, we formally state the FPME problem. Deﬁnition 3 (Fair Performance Metric Elicitation with Pairwise Queries (given {(x, g, y)i}n i=1)). Suppose that the oracle s (unknown) performance metric is Ψ. Using oracle queries of the form Ω(ˆr1:m 1 , ˆr1:m 2 ), where ˆr1:m 1 , ˆr1:m 2 are the estimated rates from samples, recover a metric ˆΨ such that Ψ ˆΨ < ω under a suitable norm for sufﬁciently small error tolerance ω > 0.

Similar to the standard metric elicitation problems [20, 21], the performance of FPME is evaluated both by the ﬁdelity of the recovered metric and the query complexity. As done in decision theory literature [30, 20], we present our FPME solution by ﬁrst assuming access to population quantities such as the population rates r1:m(h, P), and then discuss how elicitation can be performed from ﬁnite samples, e.g., with empirical rates ˆr1:m(h, {(x, g, y)i}n i=1)).

2.3 Linear Performance Metric Elicitation Warmup

We revisit the Linear Performance Metric Elicitation (LPME) procedure from [21], which we will use as as a subroutine to elicit fair performance metrics. The LPME procedure assumes an enclosed sphere S Z, where Z is the q-dimensional space of classiﬁer statistics that are feasible, i.e., can be achieved by some classiﬁer. It also assumes access to a q-dimensional oracle Ω whose scale invariant linear metric is of the form ξ(z) := a, z with a 2 = 1, analogous to the misclassiﬁcation cost in Deﬁnition 1. Analogously, the oracle queries are of the type Ω (z1, z2) := 1[ξ(z1) > ξ(z2)].

When the number of classes k = 2, LPME elicits the coefﬁcients a using a simple one-dimensional binary search. When k > 2, LPME performs binary search in each coordinate while keeping the others ﬁxed, and performs this in a coordinate-wise fashion until convergence. By restricting this coordinate-wise binary search procedure to posing queries from within a sphere S, LPME can be equivalently seen as minimizing a strongly-convex function and shown to converge to a solution ˆa close to a. Speciﬁcally, the algorithm takes the query space S Z, binary-search tolerance ϵ, and the oracle Ω as input, and by querying O(q log(1/ϵ)) queries recovers ˆa with ˆa 2 = 1 such that a ˆa 2 O( qϵ) (Theorem 2 in [21]). We provide details of the LPME procedure in Algorithm 2 (Appendix A) for completeness and summarize the discussion with the following remark. Remark 1. Given a q-dimensional space Z enclosing a sphere S Z and an oracle Ω with linear metric ξ(z) := a, z , the LPME algorithm (Algorithm 2, Appendix A) provides an estimate ˆa with ˆa 2 = 1 such that the estimated slope is close to the true slope, i.e., ai/aj ˆai/ˆaj i, j [q].

Note that the algorithm estimates the direction of the coefﬁcient vector, not its magnitude.

3 Geometry of the product set R1:m

The LPME procedure described above works with rate queries of dimension q. We would like to use this procedure to elicit the fair metrics in Deﬁnition 1 deﬁned on tuples of dimension m q. So to make use of LPME, we restrict our queries to a q-dimensional sphere S which is common to the feasible rate region Rg for each group g, i.e. to a sphere in the intersection R1 . . . Rm. We show now that such a sphere does indeed exist under a mild assumption. Assumption 1. For all groups, the conditional-class distributions are not identical, i.e., g [m], i = j, P(Y = i|X, G = g) = P(Y = j|X, G = g). In other words, there is some non-trivial signal for classiﬁcation for each group.

Let ei {0, 1}q be the rate proﬁle for a trivial classiﬁer that predicts class i on all inputs. Note that these trivial classiﬁers evaluate to the same rates ei irrespective of which group we apply them to.

Figure 3: Workﬂow of the FPME procedure.

Algorithm 1: FPM Elicitation Input: Query spaces Sρ, S+ ϱ , search tolerance ϵ > 0, and oracle Ω 1: ˆa LPME(Sρ, ϵ, Ωclass) 2: If m == 2 3: f LPME(Sρ, ϵ, Ωviol 1 ) 4: f LPME(Sρ, ϵ, Ωviol 2 ) 5: ˆb12 normalized solution from (11) 6: Else Let L 7: For σ M do 8: f σ LPME(Sρ, ϵ, Ωviol σ,1) 9: f σ LPME(Sρ, ϵ, Ωviol σ,k) 10: Let ℓσ be Eq. (13), extend L L {ℓσ} 11: ˆB normalized solution from (14) using L 12: ˆλ Algorithm 4 (S+ ϱ , ϵ, Ωtrade-off) Output: ˆa, ˆB, ˆλ

Proposition 1 (Geometry of R1:m; Figure 2). For any group g [m], the set of confusion rates Rg is convex, bounded in [0, 1]q, and has vertices {ei}k i=1. The intersection of group rate sets R1 Rm is convex and always contains the rate o = 1 k Pk i=1 ei in the interior, which is associated with the uniform random classiﬁer that predicts each class with equal probability.

Since R1 Rm is convex and always contains a point o in the interior, we can make the following remark (see Figure 2 for an illustration). Remark 2 (Existence of common sphere Sρ). There exists a q-dimensional sphere Sρ R1 Rm of non-zero radius ρ centered at o. Thus, any rate s Sρ is feasible for all groups, i.e., s is achievable by some classiﬁer hg for all groups g [m].

A method to obtain Sρ with suitable radius ρ from [21] is discussed in Appendix B.1. From Remark 2, we observe that any tuple of group rates r1:m = (s1, . . . , sm) chosen from Sρ . . . Sρ is achievable for some choice of group-speciﬁc classiﬁers h1, . . . , hm. Moreover, when two groups u, v are assigned the same rate proﬁle s Sρ, the fairness discrepancy duv = 0. We will exploit these observations in the elicitation strategy we discuss next.

4 Metric Elicitation

We have access to an oracle whose (unknown) metric Ψ given in Deﬁnition 1 is parameterized by (a, B, λ). The proposed FPME framework for eliciting the oracle s metric is presented in Figure 3 and is summarized in Algorithm 1. The procedure has three parts executed in sequence: (a) eliciting the misclassiﬁcation cost φ(r) (i.e., a), (b) eliciting the fairness violation ϕ(r1:m) (i.e., B), and (c) eliciting the trade-off between the misclassiﬁcation cost and fairness violation (i.e., λ). For simplicity, we will suppress the coefﬁcients (a, B, λ) from the notation Ψ whenever it is clear from context.

Notice that the metric Ψ is piece-wise linear in its coefﬁcients. So our high level idea is to restrict the queries we pose to the oracle to lie within regions where the metric Ψ is linear, so that we can then employ the LPME subroutine to elicit the corresponding linear coefﬁcients. We will show for each of the three components (a) (c), how we can identify regions in the query space where the metric is linear and apply the LPME procedure (or a variant of it). By restricting the query inputs to those regions, we will essentially be converting the (m q)-dimensional oracle Ωin Deﬁnition 2 into an equivalent q-dimensional oracle that compares rates s1, s2 from the common sphere Sρ R1 Rm. We ﬁrst discuss our approach assuming the oracle has no feedback noise, and later in Section 5 show that our approach is robust to noisy feedback and provide query complexity guarantees.

4.1 Eliciting the Misclassiﬁcation Cost φ(r): Part 1 in Figure 3 and Line 1 in Algorithm 1 To elicit the misclassiﬁcation cost coefﬁcients a, we will query from a region of the query space where the fairness violation term in the metric is zero. Speciﬁcally, we will query group rate proﬁle of the form r1:m = (s, . . . , s), where s is a q-dimensional rate from the common sphere Sρ. For these group rate proﬁles, the metric Ψ simply evaluates to the linear misclassiﬁcation term, i.e.:

Ψ(s, . . . , s) = (1 λ) a, s .

So given a pair of group rate proﬁles r1:m 1 = (s1, . . . , s1) and r1:m 2 = (s2, . . . , s2), where s1, s2 Sρ, the oracle s response will essentially compare s1 and s2 on the linear metric (1 λ) a, s . Hence, we estimate the coefﬁcients a by applying LPME over the q-dimensional sphere Sρ with a modiﬁed oracle Ωclass which takes a pair of rate proﬁles s1 and s2 from Sρ as input, and responds with:

Ωclass(s1, s2) = Ω((s1, . . . , s1), (s2, . . . , s2)).

This is decribed in line 1 of Algorithm 1, which applies the LPME subroutine with query space Sρ, binary search tolerance ϵ, and the oracle Ωclass. From Remark 1, this subroutine returns a coefﬁcient vector f with f 2 = 1 such that:

(1 λ)ai (1 λ)aj = fi

By setting ˆa = f, we recover the classiﬁcation coefﬁcients independent of the fairness violation coefﬁcients and trade-off parameter. See part 1 in Figure 3 for further illustration.

4.2 Eliciting the Fairness Violation ϕ(r1:m): Part 2 in Figure 3 and lines 2-11 in Algorithm 1 We now discuss eliciting the fairness term ϕ(r1:m). We will ﬁrst discuss the special case of m = 2 groups and later discuss how the proposed procedure can be extended to handle multiple groups.

4.2.1 Special Case of m = 2: Lines 2-5 in Algorithm 1

Recall from Deﬁnition 1 that in the violation term, we measure the group discrepancies using the absolute difference between the group rates, i.e. d12 = |r1 r2|. If we restrict our queries to only those rate proﬁles r1:2 for which the difference in each coordinate of r1 r2 is either always positive or always negative, then we can treat the violation term as a linear metric within this region and apply LPME to estimate the associated coefﬁcients.

To this end, we pose to the oracle queries of the form r1:2 = (s, ei), where we assign to group 1 a rate proﬁle s from the common sphere Sρ, and to group 2 the rate proﬁle ei {0, 1}q for some i. Remember that ei is a rate vector associated with a trivial classiﬁer which predicts class i on all inputs, and is therefore a binary vector. Since we know whether an entry of ei is either a 0 or a 1, we can decipher the signs of each entry of the difference vector s ei. Hence for group rate proﬁles of the above form, the metric Ψ can be written as a linear function in s:

Ψ(s, ei) = (1 λ)a (1 τ 2) + λwi b12, s + ci, (9)

where wi := 1 2ei tells us the sign of each entry of s ei, ci is a constant, and we have used the fact that τ 1 = 1 τ 2. Fixing a class i, we then apply LPME over the q-dimensional sphere Sρ with a modiﬁed oracle Ωviol i which takes a pair of rate proﬁles s1, s2 Sρ as input and responds with:

Ωviol i (s1, s2) = Ω((s1, ei), (s2, ei)). (10)

One run of LPME with oracle Ωviol 1 results in q 1 independent equations. In order to elicit a q-dimensional vector b12, we must run LPME again with oracle Ωviol 2 . This is described in lines 3 and 4 of Algorithm 1. The LPME calls provide us with two slopes f, f such that f 2 = f 2 = 1 from which it is easy to obtain the fairness violation weights:

b12 2 , with b12 = w1 h δ f ˆa (1 τ 2) i , (11)

where δ is a scalar depending on the known entities τ 12, ˆa, f 12, f 12. The derivation is provided in Appendix C.2.1. Because ϕ is scale invariant (see Deﬁnition 1), the normalized solution ˆb12 is independent of the true trade-off λ and depends only on the previously elicited vector ˆa.

4.2.2 General Case of m > 2: Lines 6-11 in Algorithm 1

We brieﬂy outline the elicitation procedure for m > 2 groups, with details in Appendix C.2.2. Let M be a set of subsets of the m groups such that each element σ M and [m] \ σ partition the set of m groups. We will later discuss how to choose M for efﬁcient elicitation. Similar to the two-group case, we pose queries r1:m where to a subset of groups σ M, we assign the trivial rate vector ei and to the rest [m] \ σ groups, we assign a point s from the common sphere Sρ. Observe that within

this query region, the metric Ψ is linear in its inputs. So for a ﬁxed partitioning of groups deﬁned by σ, we apply LPME with a query space Sρ using the modiﬁed q-dimensional oracle:

Ωviol σ,i (s1, s2) = Ω(r1:m 1 , r1:m 2 ) where rg 1 =

( ei if g σ s1 o.w. and rg 2 =

( ei if g σ s2 o.w. . (12)

As described in lines 8 and 9 of the algorithm, we repeat this twice ﬁxing class i to 1 and k. The guarantees for LPME then give us the following relationship between coefﬁcients buv we wish to elicit and the already elicited coefﬁcient ˆa: X

u,v 1 |{u, v} σ| = 1 buv = w1 h δσ f σ ˆa (1 τ σ) i , (13)

where τ σ = P

g σ τ g and buv := λbuv/(1 λ) is a scaled version of the true (unknown) buv. Since we need to estimate m 2 coefﬁcients, we repeat the above procedure for m 2 partitions of the groups deﬁned by σ and get a system of m 2 linear equations. We may choose any M of size m 2 so that the equations are independent. From the solution to these equations, we recover buv s, which we normalize to get estimates of the ﬁnal fairness violation weights:

ˆbuv = buv Pm u,v=1,v>u buv 2 for u, v [m], v > u. (14)

Thanks to the normalization, the elicited fairness violation weights are independent of the trade-off λ.

4.3 Eliciting Trade-off λ: Part 3 in Figure 3 and Line 12 in Algorithm 1

Equipped with estimates of the misclassiﬁcation and fairness violation coefﬁcients (ˆa, ˆB), the ﬁnal step is to elicit the trade-off λ between them. We now show how this can be posed as one-dimensional binary search problem. Suppose we restrict our queries to be of the form r1:m = (s+, o, . . . , o), where for all but the ﬁrst group, we assign the rate o associated with a uniform random classiﬁer, and for the ﬁrst group, we assign some rate s+ such that s+ o. For these rate proﬁles, the group rate difference terms r1 rv = s+ o 0 for all v {2, . . . , m}, and all the other difference terms are 0. As a result, the metric Ψ is linear in the input rate proﬁles:

Ψ(s+, o, . . . , o) = (1 λ)τ 1 a + λ Xm

v=2 b1v, s+ + c, (15)

where c is a constant. Despite the metric being linear in the identiﬁed input region, we cannot directly apply the LPME procedure described in Section 2.3 to elicit λ, because we have one parameter to elicit but the input to the metric is q-dimensional. Here we propose a slight variant of LPME.

Similar to the original procedure [20], we ﬁrst construct a one-dimensional function ϑ, which takes a guess of the trade-off parameter as input, and outputs the quality of the guess. We show that this function is unimodal and its mode coincides with the oracle s true trade-off parameter λ. Lemma 1. Let S+ ϱ Sρ be a q-dimensional sphere with radius ϱ < ρ such that s+ o, s+ S+ ϱ (see Figure 2). Assume the estimates ˆa and ˆbuv s satisfy a mild regularity condition ˆa, Pm v=2 ˆb1v = 1. Deﬁne a one-dimensional function ϑ as: ϑ( λ) := Ψ(s λ, o, . . . , o), (16) where s λ = argmax s+ S+ ϱ (1 λ)τ 1 ˆa + λ Xm

v=2 ˆb1v, s+ . (17)

Then the function ϑ is strictly quasiconcave (and therefore unimodal) in λ. Moreover, the mode of this function is achieved at the oracle s true trade-off parameter λ.

For a candidate trade-off λ, the function ϑ ﬁrst constructs a candidate linear metric based on (15), maximizes this candidate metric over inputs s+, and evaluates the oracle s true metric Ψ at the maximizing rate proﬁle. Note that we cannot directly compute the function ϑ as it needs the oracle s metric Ψ. However, given two candidates for the trade-off parameter λ1 and λ2, one can compare the values of ϑ( λ1) and ϑ( λ2) by ﬁnding the corresponding maximizers over s+ and querying the oracle to compare them. Because ϑ is unimodal, one can use a simple binary search using such pairwise comparisons to ﬁnd the mode of the function, which we know coincides with the true λ. We provide an outline of this procedure in Algorithm 4 in Appendix C.3, which uses the modiﬁed oracle Ωtrade-off(s+ 1 , s+ 2 ) = Ω((s+ 1 , o, . . . , o), (s+ 2 , o, . . . , o)) to compare the maximizers in (17). We also discuss in Appendix C.3 how the maximizer in (17) can be computed efﬁciently. Combining parts 1, 2 and 3 in Figure 3 completes the FPME procedure.

(c) Figure 4: Elicitation error in recovering the oracle s metric.

5 Guarantees

We discuss elicitation guarantees under the following feedback model. Deﬁnition 4 (Oracle Feedback Noise: ϵΩ 0). For two rates r1:m 1 , r1:m 2 R1:m, the oracle responds correctly as long as |Ψ(r1:m 1 ) Ψ(r1:m 2 )| > ϵΩ. Otherwise, it may be incorrect.

In words, the oracle may respond incorrectly if the rates are very close as measured by the metric Ψ. Since deriving the ﬁnal metric involves ofﬂine computations including certain ratios, we discuss guarantees under a regularity assumption that ensures all components are well deﬁned. Assumption 2. We assume that 1 > c1 > λ > c2 > 0, mini |ai| > c3, mini |(1 λ)aiτ σ i λwjibσ i | > c4 j [q], σ M, for some c1, c2, c3, c4 > 0, ρ > ϱ ϵΩ, and a, Pm v=2 b1v = 1. Theorem 1. Given ϵ, ϵΩ 0, and a 1-Lipschitz fair performance metric Ψ parametrized by a, B, λ, under Assumptions 1 and 2, Algorithm 1 returns a metric ˆΨ with parameters:

ˆa : after O q log 1

ϵ queries such that a ˆa 2 O q(ϵ + p

ˆB : after O m 2 q log 1

ϵ queries such that vec(B) vec( ˆB) 2 O mq(ϵ + p

ϵΩ/ρ) , where

vec( ) vectorizes the matrix.

ˆλ : after O(log( 1

ϵ )) queries, with error |λ ˆλ| O ϵ + p

We see that the proposed FPME procedure is robust to noise, and its query complexity depends linearly in the number of unknown entities. For instance, line 1 in Algorithm 1 elicits ˆa Rq by posing O(q) queries, the for loop in line 7 of Algorithm 1 runs for m 2 iterations, where each iteration requires O(2q) queries, and ﬁnally line 12 in Algorithm 1 is a simple binary search requiring O(1) queries. Previous work suggests that linear multiclass elicitation (LPME) elicits misclassiﬁcation costs (φ) with linear query complexity [21]. Surprisingly, our proposed FPME procedure elicits a more complex (nonlinear) metric without increasing the query complexity order. Furthermore, since sample estimates of rates are consistent estimators, and the metrics discussed are 1-Lipschitz wrt. rates, with high probability, we gather correct oracle feedback from querying with ﬁnite sample estimates Ω(ˆr1:m 1 , ˆr1:m 2 ) instead of querying with population statistics Ω(r1:m 1 , r1:m 2 ), as long as we have sufﬁcient samples. Apart from this, Algorithm 1 is agnostic to ﬁnite sample errors as long as the sphere Sρ is contained within the feasible region R1 Rm.

6 Experiments

We ﬁrst empirically validate the FPME procedure and recovery guarantees of Section 5. Recall that there exists a sphere Sρ R1 Rm as long as there is a non-trivial classiﬁcation signal within each group (Assumption 2). Thus for experiments, we assume access to a feasible sphere Sρ with ρ = 0.2. We randomly generate 100 oracle metrics each for k, m {2, 3, 4, 5} parametrized by {a, B, λ}. This speciﬁes the query outputs by the oracle for each metric in Algorithm 1. We then use Algorithm 1 with tolerance ϵ = 10 3 to elicit corresponding metrics parametrized by {ˆa, ˆB, ˆλ}. Algorithm 1 makes 1 + 2M subroutine calls to LPME procedure and 1 call to Algorithm 4. LPME subroutine requires exactly 16(q 1) log(π/2ϵ) queries, where we use 4 queries to shrink the interval in the binary search loop and ﬁx 4 cycles for the coordinate-wise search. Also, Algorithm 4 requires 4 log(1/ϵ) queries. In Figure 4, we report the mean of the ℓ2-norm between the oracle s metric and the elicited metric. Clearly, we elicit metrics that are close to the true metrics. Moreover, this holds true across a range of m and k values demonstrating the robustness of the proposed approach. Figure 4(a)

shows that the error a ˆa 2 increases only with the number of classes k and not groups m. This is expected since ˆa is elicited by querying rates that zero out the fairness violation (Section 4.1). Figure 4(b) veriﬁes Theorem 1 by showing that vec(B) vec( ˆB) 2 increases with both number of classes k and groups m. In accord with Theorem 1, Figure 4(c) shows that the elicited trade-off ˆλ is also close to the true λ. However, the elicitation error increases consistently with groups m but not with classes k. A possible reason may be the cancellation of errors from eliciting ˆa and ˆB separately.

We also highlight the utility of FPME in ranking real-world classiﬁers in Appendix E. We ﬁnd that even though FPME may make elicitation error as discussed above, it achieves near perfect ranking of classiﬁers; whereas, the default baseline metrics fail to achieve good rankings (see Appendix E for details). To connect to practice, this implies that when given a set of classiﬁers, ranking based on elicited metrics will align most closely to ranking based on the true metric, as compared to ranking classiﬁers based on default metrics. This is a crucial advantage of ME for practical purposes.

7 Related Work

Some early attempts to eliciting individual fairness metrics [22, 33] are distinct from ours as we are focused on the more prevalent setting of group fairness, yet for which there are no existing approaches to our knowledge. Zhang et al. [53] propose an approach that elicits only the trade-off between accuracy and fairness using complicated ratio queries. We, on the other hand, elicit classiﬁcation cost, fairness violation, and the trade-off together as a non-linear function, all using much simpler pairwise comparison queries. Prior work for constrained classiﬁcation focus on learning classiﬁers under constraints for fairness [16, 17, 52, 34]. We take the regularization view of algorithmic fairness, where a fairness violation is embedded in the metric deﬁnition instead of as constraints [25, 7, 11, 2, 32]. From the elicitation perspective, the closest line of work to ours is Hiranandani et al. [20, 21], who propose the problem of ME but solve it only for a simpler setting of classiﬁcation without fairness. As we move to multiclass, multigroup fair performance ME, we ﬁnd that the complexity of both the form of the metrics and the query space increases. This results in starkly different elicitation strategy with novel methods required to provide query complexity guarantees. Learning (linear) functions passively using pairwise comparisons is a mature ﬁeld [24, 19, 40], but these approaches fail to control sample (i.e. query) complexity. Active learning in fairness [37] is a related direction; however the aim there is to learn a fair classiﬁer based on ﬁxed metric instead of eliciting the metric itself.

8 Discussion Points and Future Work

Transportability: Our elicitation procedure is independent of the population P as long as there exists a sphere of rates which is feasible for all groups. Thus, any metric that is learned using one dataset or model class (i.e. by estimated ˆP) can be applied to other applications and datasets, as long as the expert believes the context and tradeoffs are the same.

Extensions. Our propsal can be modiﬁed to leverage the structure in the metric or the groups to further reduce the query complexity. For example, when the fairness violation weights are the same for all pairs of groups, the procedure in Section 4.2.2 requires only one partitioning of groups to elicit the metric ˆϕ. Such modiﬁcations are easy to incorporate. In the future, we plan to extend our approach to more complex metrics such as linear-fractional functions of rates and discrepancies.

Limitations of group-fair metrics. Since the metrics we consider depend on a classiﬁer only through its rates, comparing two classiﬁers on these metrics is equivalent to comparing their rates. Unfortunately, with this setup, all the limitations associated with group-fairness deﬁnition of metrics apply to our setup as well. For example, we may discard notions of individual fairness when only group-rates are considered for comparing classiﬁers [9]. Similarly, issues associated with overlapping groups [27], detailed group speciﬁcation [27], unknown or changing groups [18, 15], noisy or biased group information [48], among others, pose limitations to our proposed setup. We hope that as the ﬁrst work on the topic, our work will inspire the research community to address many of these open problems for the task of metric elicitation.

Optimal bounds. We conjecture that our query complexity bounds are tight; however, we leave this detail for the future. In conclusion, we elicit a more complex (non-linear) group fair-metric with the same query complexity order as standard classiﬁcation linear elicitation procedures [21].

Broader Impact

Machine Learning community has constructed many methods that create bias-free classiﬁers; however, it has been long accepted that fairness is not only an algorithmic concept but a societal one [53, 10]. So machine learning systems cannot presume how policymakers would like to handle fairness but have to elicit fairness criteria from them. Thus our primary contribution is a framework for selecting metrics that can be tuned to expert panel preferences and are designed to measure intrinsic fairness tradeoffs. We hope that by eliciting metrics, algorithmic fairness can be better tuned to the tradeoffs of stakeholders, or be used to compare preferences of different possible stakeholders. This paper seeks to truly democratize and personalize fair machine learning. Besides, the signiﬁcance of fair performance metric elicitation lies in how it empowers the practitioner to tune the design of machine learning models to the needs of the target fairness task. The transportability aspect of the proposed procedure is crucial here since it allows practitioners to elicit metrics using any estimated distribution, perhaps from simple model class, and then use the elicited fair metric for optimizing complex models or evaluating models in test time for the target fairness task.

At the same time, this work may have drawbacks because it leaves open the key question of who should be the stakeholders to be queried. This work also assumes a parametric form for the oracle metric, which may not be an exact match to practice. Furthermore, we should be cautious of the result of the failure of the system which could cause disparate impact among sensitive groups when the elicited metric is incorrect, e.g., when applied to settings where the stated assumptions are not met.

Acknowledgments and Disclosure of Funding

We thank the anonymous reviewers for providing helpful and constructive feedback on the paper. Funding in support of this work: None.

[1] N. Abe, B. Zadrozny, and J. Langford. An iterative method for multi-class cost-sensitive learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 3 11. ACM, 2004.

[2] A. Agarwal, A. Beygelzimer, M. Dudik, J. Langford, and H. Wallach. A reductions approach to fair classiﬁcation. In International Conference on Machine Learning, pages 60 69, 2018.

[3] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine bias risk assessments in criminal sentencing. Pro Publica, May, 23, 2016.

[4] S. Barocas, M. Hardt, and A. Narayanan. Fairness in machine learning. NIPS Tutorial, 2017.

[5] S. Barocas and A. D. Selbst. Big data s disparate impact. Calif. L. Rev., 104:671, 2016.

[6] E. Beauxis-Aussalet and L. Hardman. Visualization of confusion matrix for non-expert users. In IEEE Conference on Visual Analytics Science and Technology (VAST)-Poster Proceedings, 2014.

[7] Y. Bechavod and K. Ligett. Learning fair classiﬁers: A regularization-inspired approach. In 4th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FATML), 2017.

[8] R. Berk, H. Heidari, S. Jabbari, M. Kearns, and A. Roth. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research, page 0049124118782533, 2018.

[9] R. Binns. On the apparent conﬂict between individual and group fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 514 524, 2020.

[10] A. Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153 163, 2017.

[11] S. Corbett-Davies, E. Pierson, A. Feller, S. Goel, and A. Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797 806, 2017.

[12] F. Doshi-Velez and B. Kim. Towards A Rigorous Science of Interpretable Machine Learning. Ar Xiv e-prints:1702.08608, 2017.

[13] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214 226, 2012.

[14] S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth. A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 329 338, 2019.

[15] S. Gillen, C. Jung, M. Kearns, and A. Roth. Online learning with an unknown fairness metric. In Advances in neural information processing systems, pages 2600 2609, 2018.

[16] G. Goh, A. Cotter, M. Gupta, and M. P. Friedlander. Satisfying real-world goals with dataset constraints. In Advances in Neural Information Processing Systems, pages 2415 2423, 2016.

[17] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pages 3315 3323, 2016.

[18] T. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pages 1929 1938, 2018.

[19] R. Herbrich. Large margin rank boundaries for ordinal regression. In Advances in large margin classiﬁers, pages 115 132. The MIT Press, 2000.

[20] G. Hiranandani, S. Boodaghians, R. Mehta, and O. Koyejo. Performance metric elicitation from pairwise classiﬁer comparisons. In The 22nd International Conference on Artiﬁcial Intelligence and Statistics, pages 371 379, 2019.

[21] G. Hiranandani, S. Boodaghians, R. Mehta, and O. O. Koyejo. Multiclass performance metric elicitation. In Advances in Neural Information Processing Systems, pages 9351 9360, 2019.

[22] C. Ilvento. Metric learning for individual fairness. ar Xiv preprint ar Xiv:1906.00250, 2019.

[23] T. Joachims. Svmlight: Support vector machine. SVM-Light Support Vector Machine http://svmlight. joachims. org/, University of Dortmund, 19(4), 1999.

[24] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133 142. ACM, 2002.

[25] T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma. Fairness-aware classiﬁer with prejudice remover regularizer. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 35 50. Springer, 2012.

[26] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm: A highly efﬁcient gradient boosting decision tree. In Advances in neural information processing systems, pages 3146 3154, 2017.

[27] M. Kearns, S. Neel, A. Roth, and Z. S. Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning, pages 2564 2572, 2018.

[28] D. G. Kleinbaum, K. Dietz, M. Gail, M. Klein, and M. Klein. Logistic regression. Springer, 2002.

[29] J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent trade-offs in the fair determination of risk scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.

[30] O. O. Koyejo, N. Natarajan, P. K. Ravikumar, and I. S. Dhillon. Consistent multilabel classiﬁcation. In NIPS, pages 3321 3329, 2015.

[31] P. Lahoti, K. P. Gummadi, and G. Weikum. ifair: Learning individually fair data representations for algorithmic decision making. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1334 1345. IEEE, 2019.

[32] A. K. Menon and R. C. Williamson. The cost of fairness in binary classiﬁcation. In Conference on Fairness, Accountability and Transparency, pages 107 118, 2018.

[33] D. Mukherjee, M. Yurochkin, M. Banerjee, and Y. Sun. Two simple ways to learn individual fairness metric from data. In ICML, 2020.

[34] H. Narasimhan. Learning with complex loss functions and constraints. In International Conference on Artiﬁcial Intelligence and Statistics, pages 1646 1654, 2018.

[35] H. Narasimhan, A. Cotter, and M. Gupta. Optimizing generalized rate metrics with three players. In Advances in Neural Information Processing Systems, pages 10746 10757, 2019.

[36] H. Narasimhan, H. Ramaswamy, A. Saha, and S. Agarwal. Consistent multiclass algorithms for complex performance measures. In ICML, pages 2398 2407, 2015.

[37] A. Noriega-Campero, M. A. Bakker, B. Garcia-Bulle, and A. Pentland. Active fairness in algorithmic decision making. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 77 83, 2019.

[38] S. Opotow. Afﬁrmative action, fairness, and the scope of justice. Journal of Social Issues, 52(4):19 24, 1996.

[39] S. K. Pal and S. Mitra. Multilayer perceptron, fuzzy sets, classiﬁaction. 1992.

[40] M. Peyrard, T. Botschen, and I. Gurevych. Learning to score system summaries for better content selection evaluation. In Proceedings of the Workshop on New Frontiers in Summarization, pages 74 84, 2017.

[41] B. Qian, X. Wang, F. Wang, H. Li, J. Ye, and I. Davidson. Active learning from relative queries. In IJCAI, pages 1614 1620, 2013.

[42] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of any classiﬁer. In ACM SIGKDD, pages 1135 1144. ACM, 2016.

[43] G. S. Shieh. A weighted kendall s tau statistic. Statistics & probability letters, 39(1):17 24, 1998.

[44] A. Singla, E. Horvitz, P. Kohli, and A. Krause. Learning to hire teams. In Third AAAI Conference on Human Computation and Crowdsourcing, 2015.

[45] G. Tamburrelli and A. Margara. Towards automated A/B testing. In International Symposium on Search Based Software Engineering, pages 184 198. Springer, 2014.

[46] S. K. Tavker, H. G. Ramaswamy, and H. Narasimhan. Consistent plug-in classiﬁers for complex objectives and constraints. In Advances in Neural Information Processing Systems, 2020, to appear.

[47] H. Valizadegan, R. Jin, R. Zhang, and J. Mao. Learning to rank by optimizing ndcg measure. In Advances in neural information processing systems, pages 1883 1891, 2009.

[48] S. Wang, W. Guo, H. Narasimhan, A. Cotter, M. Gupta, and M. I. Jordan. Robust optimization for fairness with noisy protected groups. 2020, to appear.

[49] B. Woodworth, S. Gunasekar, M. I. Ohannessian, and N. Srebro. Learning non-discriminatory predictors. In Conference on Learning Theory, pages 1920 1953, 2017.

[50] S. Yang and D. Q. Naiman. Multiclass cancer classiﬁcation based on gene expression comparison. Statistical applications in genetics and molecular biology, 13(4):477 496, 2014.

[51] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classiﬁcation without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, pages 1171 1180, 2017.

[52] M. B. Zafar, I. Valera, M. G. Rogriguez, and K. P. Gummadi. Fairness constraints: Mechanisms for fair classiﬁcation. In Artiﬁcial Intelligence and Statistics, pages 962 970, 2017.

[53] Y. Zhang, R. Bellamy, and K. Varshney. Joint optimization of ai fairness and utility: A humancentered approach. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 400 406, 2020.