# fair_performance_metric_elicitation__fefd7ce6.pdf Fair Performance Metric Elicitation Gaurush Hiranandani UIUC gaurush2@illinois.edu Harikrishna Narasimhan Google Research USA hnarasimhan@google.com Oluwasanmi Koyejo UIUC & Google Research Accra sanmi@illinois.edu What is a fair performance metric? We consider the choice of fairness metrics through the lens of metric elicitation a principled framework for selecting performance metrics that best reflect implicit preferences. The use of metric elicitation enables a practitioner to tune the performance and fairness metrics to the task, context, and population at hand. Specifically, we propose a novel strategy to elicit group-fair performance metrics for multiclass classification problems with multiple sensitive groups that also includes selecting the trade-off between predictive performance and fairness violation. The proposed elicitation strategy requires only relative preference feedback and is robust to both finite sample and feedback noise. 1 Introduction Machine learning models are increasingly employed for critical decision-making tasks such as hiring and sentencing [44, 3, 11, 14, 31]. Yet, it is increasingly evident that automated decision-making is susceptible to bias, whereby decisions made by the algorithm are unfair to certain subgroups [5, 3, 10, 8, 31]. To this end, a wide variety of group fairness metrics have been proposed all to reduce discrimination and bias from automated decision-making [25, 13, 17, 29, 49, 32]. However, a dearth of formal principles for selecting the most appropriate metric has highlighted the confusion of experts, practitioners, and end users in deciding which group fairness metric to employ [53]. This is further exacerbated by the observation that common metrics often lead to contradictory outcomes [29]. While the problem of selecting an appropriate fairness metric has gained prominence in recent years [17, 32, 53], it perhaps best understood as a special case of the task of choosing evaluation metrics in machine learning. For instance, when a cost-sensitive predictive model classifies patients into cancer categories [50] even without considering fairness, it is often unclear how the costtradeoffs be chosen so that they reflect the expert s decision-making, i.e., replacing expert intuition by quantifiable metrics. The recently proposed Metric Elicitation (ME) framework [20, 21] provides a solution. ME is a principled framework for eliciting performance metrics using feedback over classifiers from an end user. The motivation behind ME is that employing the performance metrics which reflect user tradeoffs will enable learning models that best capture user preferences [20]. As humans are often inaccurate in providing absolute preferences [41], Hiranandani et al. [20] propose to use pairwise comparison queries, where the user (oracle) is asked to compare two classifiers and provide a relative preference. Using such queries, ME aims to recover the oracle s metric. Figure 1 (reproduced from [20]) illustrates the ME framework. Existing research suggests a fundamental trade-off between algorithmic fairness and performance [25, 51, 11, 7, 32, 53], where in addition to appropriate metrics, the practitioner or policymaker must choose a trade-off operating point between the competing objectives [53]. To this end, we extend the ME framework from eliciting multiclass classification metrics [21] to the task of eliciting fair performance metrics from pairwise preference feedback in the presence of multiple sensitive groups. In particular, we elicit metrics that reflect, jointly, the (i) predictive performance evaluated as a weighting of classifier s overall predictive rates, (ii) fairness violation assessed as the discrepancy in predictive rates among groups, and (iii) a trade-off between the predictive performance and fairness 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. Figure 1: Framework of Metric Elicitation [20]. ek Figure 2: R1 Rm (best seen in colors); Ru u [m] are convex sets with common vertices ei i [k] and enclose the sphere Sρ. violation. Importantly, the elicited metrics are sufficiently flexible to encapsulate and generalize many existing predictive performance and fairness violation measures. In eliciting group-fair performance metrics, we tackle three new challenges. First, from preference query perspective, the predictive performance and fairness violations are correlated, thus increasing the complexity of joint elicitation. Second, we find that in order to measure both positive and negative violations, the fair metrics are necessarily non-linear functions of the predictive rates, thus existing results on linear ME [21] cannot be applied directly. Finally, as we show, the number of groups directly impacts query complexity. We overcome these challenges by proposing a novel query efficient procedure that exploits the geometric properties of the set of rates. Contributions. We consider metrics for algorithmically group-fair classification and propose a novel approach for eliciting predictive performance, fairness violations, and their trade-off point, from expert pairwise feedback. Our procedure uses binary-search based subroutines and recovers the metric with linear query complexity. Moreover, the procedure is robust to both finite sample and oracle feedback noise thus is useful in practice. Lastly, our method can be applied either by querying preferences over classifiers or rates. Such an equivalence is crucial for practical applications [20, 21]. Notations. Matrices and vectors are denoted by bold upper case and bold lower case letters, respectively. We denote the inner product of two vectors by , and the Hadamard product by . The ℓ2-norm is denoted by 2. For k Z+, we represent the index set {1, 2, , k} by [k], and the (k 1)-dimensional simplex by k. Given a matrix A, off-diag(A) returns a vector of off-diagonal elements of A in row-major form. The group membership is denoted by superscripts and coordinates of vectors, matrices, and tuples are denoted by subscripts. 2 Background The standard multiclass, multigroup classification setting comprises k classes and m groups with X X, G [m] and Y [k] representing the input, group membership, and output random variables, respectively. The groups are assumed to be disjoint and known apriori [17, 29]. We have access to a dataset {(x, g, y)i}n i=1 of size n, generated iid from a distribution P(X, G, Y ). Group-specific rates: We consider separate (randomized) classifiers hg : X k for each group g, and use Hg = {hg : X k} to denote the set of all classifiers for group g. The group-specific rate matrix Rg(hg, P) Rk k for a classifier hg is given by: Rg ij(hg, P) := P(hg = j|Y = i, G = g) for i, j [k]. (1) Since the diagonal entries of the rate matrix can be written in terms of the off-diagonal entries: Rg ii(hg, P) = 1 Xk j=1,j =i Rg ij(hg, P), (2) any rate matrix is uniquely represented by its q := (k2 k) off-diagonal elements as a vector rg(hg, P) = off-diag(Rg(hg, P)). So we will interchangeably refer to the rate matrix as a vector of rates . The feasible set of rates associated with a group g is denoted by Rg = {rg(hg, P) : hg Hg}. For clarity, we will suppress the dependence on P and hg if it is clear from the context. Overall rates: We define the overall classifier h : (X, [m]) k by h(x, g) := hg(x) and denote its tuple of group-specific rates by: r1:m := (r1, . . . , rm) R1 Rm =: R1:m. This tuple allows us to measure the fairness violation across groups. The fairness violation is believed to be in trade-off with the predictive performance [25, 7, 32]. The latter is measured using the overall rate matrix of the classifier h: Rij := P(h = j|Y = i) = Xm g=1 tg i Rg ij, (3) where tg i := P(G = g|Y = i) is the prevalence of group g within class i. For an overall classifier h, the vector of rates r = off-diag(R) can be conveniently written in terms of its group-specific tuple of rates as r = Pm g=1 τ g rg, where τ g := off-diag([tg tg . . . tg]). Fairness violation measure: The (approximate) fairness of a classifier is often determined by the discrepancy in rates across different groups e.g. equalized odds [17, 4]. So given two groups u, v [m], we define the discrepancy in their rates as: duv := |ru rv|. (4) Since there are m groups, the number of discrepancy vectors are m 2 . 2.1 Fair Performance Metric We aim to elicit a general class of metrics, which recovers and generalizes existing fairness measures, based on trade-off between predictive performance and fairness violation [25, 17, 10, 7, 32]. Let φ : [0, 1]q R be the cost of overall misclassification (aka. predictive performance) and ϕ : [0, 1]m q R be the fairness violation cost for a classifier h determined by the overall rates r(h) and group discrepancies {duv(h)}m u,v=1,v>u, respectively. Without loss of generality (wlog), we assume the metrics φ and ϕ are costs. Moreover, the metrics are scale invariant as global scale does not affect the learning problem [36]; hence let φ : [0, 1]q [0, 1] and ϕ : [0, 1]m q [0, 1]. Definition 1. Fair Performance Metric: Let φ and ϕ be monotonically increasing linear functions of overall rates and group discrepancies, respectively. The fair metric Ψ is a trade-off between φ and ϕ. In particular, given a Rq, a 0 (misclassification weights), a set of vectors B := {buv Rq, buv 0}m u,v=1,v>u (fairness violation weights), and a scalar λ (trade-off) with a 2 = 1, Xm u,v=1,v>u buv 2 = 1, 0 λ 1, (5) (wlog., due to scale invariance), we define the metric Ψ as: Ψ(r1:m ; a, B, λ) := (1 λ) | {z } trade-off a, r | {z } φ(r) u,v=1,v>u buv, duv | {z } ϕ(r1:m) Examples of the misclassification cost φ(r) include cost-sensitive linear metrics [1]. Many existing fairness metrics for two classes and two groups such as equal opportunity [17], balance for the negative class [29] error-rate balance (i.e., 0.5|r1 1 r2 1|+0.5|r1 2 r2 2|) [10], weighted equalized odds (i.e., b1|r1 1 r2 1| + b2|r1 2 r2 2|) [17, 7], etc. correspond to fairness violations of the form ϕ(r1:m) considered above. The combination of φ(r) and ϕ(r1:m) as defined in Ψ(r1:m) appears regularly in prior work [25, 7, 32]. Notice that the metric is flexible to allow different fairness violation costs for different pairs of groups thus capable of enabling reverse discrimination [38]. Lastly, while the metric is linear with respect to (wrt.) the discrepancies, it is non-linear wrt. the group-wise rates. Hence, standard linear ME algorithm [21] cannot be trivially applied for eliciting the metric in Definition 1. 2.2 Fair Performance Metric Elicitation; Problem Statement We now state the problem of Fair Performance Metric Elicitation (FPME) and define the associated oracle query. The broad definitions follow from Hiranandani et al. [20, 21], extended so the rates and the performance metrics correspond to the multiclass multigroup-fair classification setting. Definition 2 (Oracle Query). Given two classifiers h1, h2 (equivalent to a tuple of rates r1:m 1 , r1:m 2 respectively), a query to the Oracle (with metric Ψ) is represented by: Γ(h1, h2) = Ω r1:m 1 , r1:m 2 = 1[Ψ(r1:m 1 ) > Ψ(r1:m 2 )], (7) where Γ : H H {0, 1} and Ω: R1:m R1:m {0, 1}. In words, the query asks whether h1 is preferred to h2 (equivalent to whether r1:m 1 is preferred to r1:m 2 ), as measured by Ψ. In practice, the oracle can be an expert, a group of experts, or an entire user population. The ME framework can be applied by posing classifier comparisons directly to them via interpretable learning techniques [42, 12] or via A/B testing [45]. For example, one may perform A/B testing for an internet-based application by deploying two classifiers A and B and use the population s level of engagement to decide the preference between the two classifiers. For other applications, intuitive visualizations of the predictive rates for two different classifiers (see e.g., [53, 6]) can be used to ask preference feedback from a group of domain experts. We emphasize that the metric Ψ used by the oracle is unknown to us and can be accessed only through queries to the oracle. Since the metrics we consider are functions of rates, comparing two classifiers on a metric is equivalent to comparing their corresponding rates. Henceforth, we will denote any query to the oracle by a pair of rates (r1:m 1 , r1:m 2 ). Also, whenever we refer to an oracles s dimension, we are referring to the dimension of its rate arguments. For instance, we will consider the oracle in Definition 2 to be of dimension m q. Next, we formally state the FPME problem. Definition 3 (Fair Performance Metric Elicitation with Pairwise Queries (given {(x, g, y)i}n i=1)). Suppose that the oracle s (unknown) performance metric is Ψ. Using oracle queries of the form Ω(ˆr1:m 1 , ˆr1:m 2 ), where ˆr1:m 1 , ˆr1:m 2 are the estimated rates from samples, recover a metric ˆΨ such that Ψ ˆΨ < ω under a suitable norm for sufficiently small error tolerance ω > 0. Similar to the standard metric elicitation problems [20, 21], the performance of FPME is evaluated both by the fidelity of the recovered metric and the query complexity. As done in decision theory literature [30, 20], we present our FPME solution by first assuming access to population quantities such as the population rates r1:m(h, P), and then discuss how elicitation can be performed from finite samples, e.g., with empirical rates ˆr1:m(h, {(x, g, y)i}n i=1)). 2.3 Linear Performance Metric Elicitation Warmup We revisit the Linear Performance Metric Elicitation (LPME) procedure from [21], which we will use as as a subroutine to elicit fair performance metrics. The LPME procedure assumes an enclosed sphere S Z, where Z is the q-dimensional space of classifier statistics that are feasible, i.e., can be achieved by some classifier. It also assumes access to a q-dimensional oracle Ω whose scale invariant linear metric is of the form ξ(z) := a, z with a 2 = 1, analogous to the misclassification cost in Definition 1. Analogously, the oracle queries are of the type Ω (z1, z2) := 1[ξ(z1) > ξ(z2)]. When the number of classes k = 2, LPME elicits the coefficients a using a simple one-dimensional binary search. When k > 2, LPME performs binary search in each coordinate while keeping the others fixed, and performs this in a coordinate-wise fashion until convergence. By restricting this coordinate-wise binary search procedure to posing queries from within a sphere S, LPME can be equivalently seen as minimizing a strongly-convex function and shown to converge to a solution ˆa close to a. Specifically, the algorithm takes the query space S Z, binary-search tolerance ϵ, and the oracle Ω as input, and by querying O(q log(1/ϵ)) queries recovers ˆa with ˆa 2 = 1 such that a ˆa 2 O( qϵ) (Theorem 2 in [21]). We provide details of the LPME procedure in Algorithm 2 (Appendix A) for completeness and summarize the discussion with the following remark. Remark 1. Given a q-dimensional space Z enclosing a sphere S Z and an oracle Ω with linear metric ξ(z) := a, z , the LPME algorithm (Algorithm 2, Appendix A) provides an estimate ˆa with ˆa 2 = 1 such that the estimated slope is close to the true slope, i.e., ai/aj ˆai/ˆaj i, j [q]. Note that the algorithm estimates the direction of the coefficient vector, not its magnitude. 3 Geometry of the product set R1:m The LPME procedure described above works with rate queries of dimension q. We would like to use this procedure to elicit the fair metrics in Definition 1 defined on tuples of dimension m q. So to make use of LPME, we restrict our queries to a q-dimensional sphere S which is common to the feasible rate region Rg for each group g, i.e. to a sphere in the intersection R1 . . . Rm. We show now that such a sphere does indeed exist under a mild assumption. Assumption 1. For all groups, the conditional-class distributions are not identical, i.e., g [m], i = j, P(Y = i|X, G = g) = P(Y = j|X, G = g). In other words, there is some non-trivial signal for classification for each group. Let ei {0, 1}q be the rate profile for a trivial classifier that predicts class i on all inputs. Note that these trivial classifiers evaluate to the same rates ei irrespective of which group we apply them to. Figure 3: Workflow of the FPME procedure. Algorithm 1: FPM Elicitation Input: Query spaces Sρ, S+ ϱ , search tolerance ϵ > 0, and oracle Ω 1: ˆa LPME(Sρ, ϵ, Ωclass) 2: If m == 2 3: f LPME(Sρ, ϵ, Ωviol 1 ) 4: f LPME(Sρ, ϵ, Ωviol 2 ) 5: ˆb12 normalized solution from (11) 6: Else Let L 7: For σ M do 8: f σ LPME(Sρ, ϵ, Ωviol σ,1) 9: f σ LPME(Sρ, ϵ, Ωviol σ,k) 10: Let ℓσ be Eq. (13), extend L L {ℓσ} 11: ˆB normalized solution from (14) using L 12: ˆλ Algorithm 4 (S+ ϱ , ϵ, Ωtrade-off) Output: ˆa, ˆB, ˆλ Proposition 1 (Geometry of R1:m; Figure 2). For any group g [m], the set of confusion rates Rg is convex, bounded in [0, 1]q, and has vertices {ei}k i=1. The intersection of group rate sets R1 Rm is convex and always contains the rate o = 1 k Pk i=1 ei in the interior, which is associated with the uniform random classifier that predicts each class with equal probability. Since R1 Rm is convex and always contains a point o in the interior, we can make the following remark (see Figure 2 for an illustration). Remark 2 (Existence of common sphere Sρ). There exists a q-dimensional sphere Sρ R1 Rm of non-zero radius ρ centered at o. Thus, any rate s Sρ is feasible for all groups, i.e., s is achievable by some classifier hg for all groups g [m]. A method to obtain Sρ with suitable radius ρ from [21] is discussed in Appendix B.1. From Remark 2, we observe that any tuple of group rates r1:m = (s1, . . . , sm) chosen from Sρ . . . Sρ is achievable for some choice of group-specific classifiers h1, . . . , hm. Moreover, when two groups u, v are assigned the same rate profile s Sρ, the fairness discrepancy duv = 0. We will exploit these observations in the elicitation strategy we discuss next. 4 Metric Elicitation We have access to an oracle whose (unknown) metric Ψ given in Definition 1 is parameterized by (a, B, λ). The proposed FPME framework for eliciting the oracle s metric is presented in Figure 3 and is summarized in Algorithm 1. The procedure has three parts executed in sequence: (a) eliciting the misclassification cost φ(r) (i.e., a), (b) eliciting the fairness violation ϕ(r1:m) (i.e., B), and (c) eliciting the trade-off between the misclassification cost and fairness violation (i.e., λ). For simplicity, we will suppress the coefficients (a, B, λ) from the notation Ψ whenever it is clear from context. Notice that the metric Ψ is piece-wise linear in its coefficients. So our high level idea is to restrict the queries we pose to the oracle to lie within regions where the metric Ψ is linear, so that we can then employ the LPME subroutine to elicit the corresponding linear coefficients. We will show for each of the three components (a) (c), how we can identify regions in the query space where the metric is linear and apply the LPME procedure (or a variant of it). By restricting the query inputs to those regions, we will essentially be converting the (m q)-dimensional oracle Ωin Definition 2 into an equivalent q-dimensional oracle that compares rates s1, s2 from the common sphere Sρ R1 Rm. We first discuss our approach assuming the oracle has no feedback noise, and later in Section 5 show that our approach is robust to noisy feedback and provide query complexity guarantees. 4.1 Eliciting the Misclassification Cost φ(r): Part 1 in Figure 3 and Line 1 in Algorithm 1 To elicit the misclassification cost coefficients a, we will query from a region of the query space where the fairness violation term in the metric is zero. Specifically, we will query group rate profile of the form r1:m = (s, . . . , s), where s is a q-dimensional rate from the common sphere Sρ. For these group rate profiles, the metric Ψ simply evaluates to the linear misclassification term, i.e.: Ψ(s, . . . , s) = (1 λ) a, s . So given a pair of group rate profiles r1:m 1 = (s1, . . . , s1) and r1:m 2 = (s2, . . . , s2), where s1, s2 Sρ, the oracle s response will essentially compare s1 and s2 on the linear metric (1 λ) a, s . Hence, we estimate the coefficients a by applying LPME over the q-dimensional sphere Sρ with a modified oracle Ωclass which takes a pair of rate profiles s1 and s2 from Sρ as input, and responds with: Ωclass(s1, s2) = Ω((s1, . . . , s1), (s2, . . . , s2)). This is decribed in line 1 of Algorithm 1, which applies the LPME subroutine with query space Sρ, binary search tolerance ϵ, and the oracle Ωclass. From Remark 1, this subroutine returns a coefficient vector f with f 2 = 1 such that: (1 λ)ai (1 λ)aj = fi By setting ˆa = f, we recover the classification coefficients independent of the fairness violation coefficients and trade-off parameter. See part 1 in Figure 3 for further illustration. 4.2 Eliciting the Fairness Violation ϕ(r1:m): Part 2 in Figure 3 and lines 2-11 in Algorithm 1 We now discuss eliciting the fairness term ϕ(r1:m). We will first discuss the special case of m = 2 groups and later discuss how the proposed procedure can be extended to handle multiple groups. 4.2.1 Special Case of m = 2: Lines 2-5 in Algorithm 1 Recall from Definition 1 that in the violation term, we measure the group discrepancies using the absolute difference between the group rates, i.e. d12 = |r1 r2|. If we restrict our queries to only those rate profiles r1:2 for which the difference in each coordinate of r1 r2 is either always positive or always negative, then we can treat the violation term as a linear metric within this region and apply LPME to estimate the associated coefficients. To this end, we pose to the oracle queries of the form r1:2 = (s, ei), where we assign to group 1 a rate profile s from the common sphere Sρ, and to group 2 the rate profile ei {0, 1}q for some i. Remember that ei is a rate vector associated with a trivial classifier which predicts class i on all inputs, and is therefore a binary vector. Since we know whether an entry of ei is either a 0 or a 1, we can decipher the signs of each entry of the difference vector s ei. Hence for group rate profiles of the above form, the metric Ψ can be written as a linear function in s: Ψ(s, ei) = (1 λ)a (1 τ 2) + λwi b12, s + ci, (9) where wi := 1 2ei tells us the sign of each entry of s ei, ci is a constant, and we have used the fact that τ 1 = 1 τ 2. Fixing a class i, we then apply LPME over the q-dimensional sphere Sρ with a modified oracle Ωviol i which takes a pair of rate profiles s1, s2 Sρ as input and responds with: Ωviol i (s1, s2) = Ω((s1, ei), (s2, ei)). (10) One run of LPME with oracle Ωviol 1 results in q 1 independent equations. In order to elicit a q-dimensional vector b12, we must run LPME again with oracle Ωviol 2 . This is described in lines 3 and 4 of Algorithm 1. The LPME calls provide us with two slopes f, f such that f 2 = f 2 = 1 from which it is easy to obtain the fairness violation weights: b12 2 , with b12 = w1 h δ f ˆa (1 τ 2) i , (11) where δ is a scalar depending on the known entities τ 12, ˆa, f 12, f 12. The derivation is provided in Appendix C.2.1. Because ϕ is scale invariant (see Definition 1), the normalized solution ˆb12 is independent of the true trade-off λ and depends only on the previously elicited vector ˆa. 4.2.2 General Case of m > 2: Lines 6-11 in Algorithm 1 We briefly outline the elicitation procedure for m > 2 groups, with details in Appendix C.2.2. Let M be a set of subsets of the m groups such that each element σ M and [m] \ σ partition the set of m groups. We will later discuss how to choose M for efficient elicitation. Similar to the two-group case, we pose queries r1:m where to a subset of groups σ M, we assign the trivial rate vector ei and to the rest [m] \ σ groups, we assign a point s from the common sphere Sρ. Observe that within this query region, the metric Ψ is linear in its inputs. So for a fixed partitioning of groups defined by σ, we apply LPME with a query space Sρ using the modified q-dimensional oracle: Ωviol σ,i (s1, s2) = Ω(r1:m 1 , r1:m 2 ) where rg 1 = ( ei if g σ s1 o.w. and rg 2 = ( ei if g σ s2 o.w. . (12) As described in lines 8 and 9 of the algorithm, we repeat this twice fixing class i to 1 and k. The guarantees for LPME then give us the following relationship between coefficients buv we wish to elicit and the already elicited coefficient ˆa: X u,v 1 |{u, v} σ| = 1 buv = w1 h δσ f σ ˆa (1 τ σ) i , (13) where τ σ = P g σ τ g and buv := λbuv/(1 λ) is a scaled version of the true (unknown) buv. Since we need to estimate m 2 coefficients, we repeat the above procedure for m 2 partitions of the groups defined by σ and get a system of m 2 linear equations. We may choose any M of size m 2 so that the equations are independent. From the solution to these equations, we recover buv s, which we normalize to get estimates of the final fairness violation weights: ˆbuv = buv Pm u,v=1,v>u buv 2 for u, v [m], v > u. (14) Thanks to the normalization, the elicited fairness violation weights are independent of the trade-off λ. 4.3 Eliciting Trade-off λ: Part 3 in Figure 3 and Line 12 in Algorithm 1 Equipped with estimates of the misclassification and fairness violation coefficients (ˆa, ˆB), the final step is to elicit the trade-off λ between them. We now show how this can be posed as one-dimensional binary search problem. Suppose we restrict our queries to be of the form r1:m = (s+, o, . . . , o), where for all but the first group, we assign the rate o associated with a uniform random classifier, and for the first group, we assign some rate s+ such that s+ o. For these rate profiles, the group rate difference terms r1 rv = s+ o 0 for all v {2, . . . , m}, and all the other difference terms are 0. As a result, the metric Ψ is linear in the input rate profiles: Ψ(s+, o, . . . , o) = (1 λ)τ 1 a + λ Xm v=2 b1v, s+ + c, (15) where c is a constant. Despite the metric being linear in the identified input region, we cannot directly apply the LPME procedure described in Section 2.3 to elicit λ, because we have one parameter to elicit but the input to the metric is q-dimensional. Here we propose a slight variant of LPME. Similar to the original procedure [20], we first construct a one-dimensional function ϑ, which takes a guess of the trade-off parameter as input, and outputs the quality of the guess. We show that this function is unimodal and its mode coincides with the oracle s true trade-off parameter λ. Lemma 1. Let S+ ϱ Sρ be a q-dimensional sphere with radius ϱ < ρ such that s+ o, s+ S+ ϱ (see Figure 2). Assume the estimates ˆa and ˆbuv s satisfy a mild regularity condition ˆa, Pm v=2 ˆb1v = 1. Define a one-dimensional function ϑ as: ϑ( λ) := Ψ(s λ, o, . . . , o), (16) where s λ = argmax s+ S+ ϱ (1 λ)τ 1 ˆa + λ Xm v=2 ˆb1v, s+ . (17) Then the function ϑ is strictly quasiconcave (and therefore unimodal) in λ. Moreover, the mode of this function is achieved at the oracle s true trade-off parameter λ. For a candidate trade-off λ, the function ϑ first constructs a candidate linear metric based on (15), maximizes this candidate metric over inputs s+, and evaluates the oracle s true metric Ψ at the maximizing rate profile. Note that we cannot directly compute the function ϑ as it needs the oracle s metric Ψ. However, given two candidates for the trade-off parameter λ1 and λ2, one can compare the values of ϑ( λ1) and ϑ( λ2) by finding the corresponding maximizers over s+ and querying the oracle to compare them. Because ϑ is unimodal, one can use a simple binary search using such pairwise comparisons to find the mode of the function, which we know coincides with the true λ. We provide an outline of this procedure in Algorithm 4 in Appendix C.3, which uses the modified oracle Ωtrade-off(s+ 1 , s+ 2 ) = Ω((s+ 1 , o, . . . , o), (s+ 2 , o, . . . , o)) to compare the maximizers in (17). We also discuss in Appendix C.3 how the maximizer in (17) can be computed efficiently. Combining parts 1, 2 and 3 in Figure 3 completes the FPME procedure. (c) Figure 4: Elicitation error in recovering the oracle s metric. 5 Guarantees We discuss elicitation guarantees under the following feedback model. Definition 4 (Oracle Feedback Noise: ϵΩ 0). For two rates r1:m 1 , r1:m 2 R1:m, the oracle responds correctly as long as |Ψ(r1:m 1 ) Ψ(r1:m 2 )| > ϵΩ. Otherwise, it may be incorrect. In words, the oracle may respond incorrectly if the rates are very close as measured by the metric Ψ. Since deriving the final metric involves offline computations including certain ratios, we discuss guarantees under a regularity assumption that ensures all components are well defined. Assumption 2. We assume that 1 > c1 > λ > c2 > 0, mini |ai| > c3, mini |(1 λ)aiτ σ i λwjibσ i | > c4 j [q], σ M, for some c1, c2, c3, c4 > 0, ρ > ϱ ϵΩ, and a, Pm v=2 b1v = 1. Theorem 1. Given ϵ, ϵΩ 0, and a 1-Lipschitz fair performance metric Ψ parametrized by a, B, λ, under Assumptions 1 and 2, Algorithm 1 returns a metric ˆΨ with parameters: ˆa : after O q log 1 ϵ queries such that a ˆa 2 O q(ϵ + p ˆB : after O m 2 q log 1 ϵ queries such that vec(B) vec( ˆB) 2 O mq(ϵ + p ϵΩ/ρ) , where vec( ) vectorizes the matrix. ˆλ : after O(log( 1 ϵ )) queries, with error |λ ˆλ| O ϵ + p We see that the proposed FPME procedure is robust to noise, and its query complexity depends linearly in the number of unknown entities. For instance, line 1 in Algorithm 1 elicits ˆa Rq by posing O(q) queries, the for loop in line 7 of Algorithm 1 runs for m 2 iterations, where each iteration requires O(2q) queries, and finally line 12 in Algorithm 1 is a simple binary search requiring O(1) queries. Previous work suggests that linear multiclass elicitation (LPME) elicits misclassification costs (φ) with linear query complexity [21]. Surprisingly, our proposed FPME procedure elicits a more complex (nonlinear) metric without increasing the query complexity order. Furthermore, since sample estimates of rates are consistent estimators, and the metrics discussed are 1-Lipschitz wrt. rates, with high probability, we gather correct oracle feedback from querying with finite sample estimates Ω(ˆr1:m 1 , ˆr1:m 2 ) instead of querying with population statistics Ω(r1:m 1 , r1:m 2 ), as long as we have sufficient samples. Apart from this, Algorithm 1 is agnostic to finite sample errors as long as the sphere Sρ is contained within the feasible region R1 Rm. 6 Experiments We first empirically validate the FPME procedure and recovery guarantees of Section 5. Recall that there exists a sphere Sρ R1 Rm as long as there is a non-trivial classification signal within each group (Assumption 2). Thus for experiments, we assume access to a feasible sphere Sρ with ρ = 0.2. We randomly generate 100 oracle metrics each for k, m {2, 3, 4, 5} parametrized by {a, B, λ}. This specifies the query outputs by the oracle for each metric in Algorithm 1. We then use Algorithm 1 with tolerance ϵ = 10 3 to elicit corresponding metrics parametrized by {ˆa, ˆB, ˆλ}. Algorithm 1 makes 1 + 2M subroutine calls to LPME procedure and 1 call to Algorithm 4. LPME subroutine requires exactly 16(q 1) log(π/2ϵ) queries, where we use 4 queries to shrink the interval in the binary search loop and fix 4 cycles for the coordinate-wise search. Also, Algorithm 4 requires 4 log(1/ϵ) queries. In Figure 4, we report the mean of the ℓ2-norm between the oracle s metric and the elicited metric. Clearly, we elicit metrics that are close to the true metrics. Moreover, this holds true across a range of m and k values demonstrating the robustness of the proposed approach. Figure 4(a) shows that the error a ˆa 2 increases only with the number of classes k and not groups m. This is expected since ˆa is elicited by querying rates that zero out the fairness violation (Section 4.1). Figure 4(b) verifies Theorem 1 by showing that vec(B) vec( ˆB) 2 increases with both number of classes k and groups m. In accord with Theorem 1, Figure 4(c) shows that the elicited trade-off ˆλ is also close to the true λ. However, the elicitation error increases consistently with groups m but not with classes k. A possible reason may be the cancellation of errors from eliciting ˆa and ˆB separately. We also highlight the utility of FPME in ranking real-world classifiers in Appendix E. We find that even though FPME may make elicitation error as discussed above, it achieves near perfect ranking of classifiers; whereas, the default baseline metrics fail to achieve good rankings (see Appendix E for details). To connect to practice, this implies that when given a set of classifiers, ranking based on elicited metrics will align most closely to ranking based on the true metric, as compared to ranking classifiers based on default metrics. This is a crucial advantage of ME for practical purposes. 7 Related Work Some early attempts to eliciting individual fairness metrics [22, 33] are distinct from ours as we are focused on the more prevalent setting of group fairness, yet for which there are no existing approaches to our knowledge. Zhang et al. [53] propose an approach that elicits only the trade-off between accuracy and fairness using complicated ratio queries. We, on the other hand, elicit classification cost, fairness violation, and the trade-off together as a non-linear function, all using much simpler pairwise comparison queries. Prior work for constrained classification focus on learning classifiers under constraints for fairness [16, 17, 52, 34]. We take the regularization view of algorithmic fairness, where a fairness violation is embedded in the metric definition instead of as constraints [25, 7, 11, 2, 32]. From the elicitation perspective, the closest line of work to ours is Hiranandani et al. [20, 21], who propose the problem of ME but solve it only for a simpler setting of classification without fairness. As we move to multiclass, multigroup fair performance ME, we find that the complexity of both the form of the metrics and the query space increases. This results in starkly different elicitation strategy with novel methods required to provide query complexity guarantees. Learning (linear) functions passively using pairwise comparisons is a mature field [24, 19, 40], but these approaches fail to control sample (i.e. query) complexity. Active learning in fairness [37] is a related direction; however the aim there is to learn a fair classifier based on fixed metric instead of eliciting the metric itself. 8 Discussion Points and Future Work Transportability: Our elicitation procedure is independent of the population P as long as there exists a sphere of rates which is feasible for all groups. Thus, any metric that is learned using one dataset or model class (i.e. by estimated ˆP) can be applied to other applications and datasets, as long as the expert believes the context and tradeoffs are the same. Extensions. Our propsal can be modified to leverage the structure in the metric or the groups to further reduce the query complexity. For example, when the fairness violation weights are the same for all pairs of groups, the procedure in Section 4.2.2 requires only one partitioning of groups to elicit the metric ˆϕ. Such modifications are easy to incorporate. In the future, we plan to extend our approach to more complex metrics such as linear-fractional functions of rates and discrepancies. Limitations of group-fair metrics. Since the metrics we consider depend on a classifier only through its rates, comparing two classifiers on these metrics is equivalent to comparing their rates. Unfortunately, with this setup, all the limitations associated with group-fairness definition of metrics apply to our setup as well. For example, we may discard notions of individual fairness when only group-rates are considered for comparing classifiers [9]. Similarly, issues associated with overlapping groups [27], detailed group specification [27], unknown or changing groups [18, 15], noisy or biased group information [48], among others, pose limitations to our proposed setup. We hope that as the first work on the topic, our work will inspire the research community to address many of these open problems for the task of metric elicitation. Optimal bounds. We conjecture that our query complexity bounds are tight; however, we leave this detail for the future. In conclusion, we elicit a more complex (non-linear) group fair-metric with the same query complexity order as standard classification linear elicitation procedures [21]. Broader Impact Machine Learning community has constructed many methods that create bias-free classifiers; however, it has been long accepted that fairness is not only an algorithmic concept but a societal one [53, 10]. So machine learning systems cannot presume how policymakers would like to handle fairness but have to elicit fairness criteria from them. Thus our primary contribution is a framework for selecting metrics that can be tuned to expert panel preferences and are designed to measure intrinsic fairness tradeoffs. We hope that by eliciting metrics, algorithmic fairness can be better tuned to the tradeoffs of stakeholders, or be used to compare preferences of different possible stakeholders. This paper seeks to truly democratize and personalize fair machine learning. Besides, the significance of fair performance metric elicitation lies in how it empowers the practitioner to tune the design of machine learning models to the needs of the target fairness task. The transportability aspect of the proposed procedure is crucial here since it allows practitioners to elicit metrics using any estimated distribution, perhaps from simple model class, and then use the elicited fair metric for optimizing complex models or evaluating models in test time for the target fairness task. At the same time, this work may have drawbacks because it leaves open the key question of who should be the stakeholders to be queried. This work also assumes a parametric form for the oracle metric, which may not be an exact match to practice. Furthermore, we should be cautious of the result of the failure of the system which could cause disparate impact among sensitive groups when the elicited metric is incorrect, e.g., when applied to settings where the stated assumptions are not met. Acknowledgments and Disclosure of Funding We thank the anonymous reviewers for providing helpful and constructive feedback on the paper. Funding in support of this work: None. [1] N. Abe, B. Zadrozny, and J. Langford. An iterative method for multi-class cost-sensitive learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 3 11. ACM, 2004. [2] A. Agarwal, A. Beygelzimer, M. Dudik, J. Langford, and H. Wallach. A reductions approach to fair classification. In International Conference on Machine Learning, pages 60 69, 2018. [3] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine bias risk assessments in criminal sentencing. Pro Publica, May, 23, 2016. [4] S. Barocas, M. Hardt, and A. Narayanan. Fairness in machine learning. NIPS Tutorial, 2017. [5] S. Barocas and A. D. Selbst. Big data s disparate impact. Calif. L. Rev., 104:671, 2016. [6] E. Beauxis-Aussalet and L. Hardman. Visualization of confusion matrix for non-expert users. In IEEE Conference on Visual Analytics Science and Technology (VAST)-Poster Proceedings, 2014. [7] Y. Bechavod and K. Ligett. Learning fair classifiers: A regularization-inspired approach. In 4th Workshop on Fairness, Accountability, and Transparency in Machine Learning (FATML), 2017. [8] R. Berk, H. Heidari, S. Jabbari, M. Kearns, and A. Roth. Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research, page 0049124118782533, 2018. [9] R. Binns. On the apparent conflict between individual and group fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 514 524, 2020. [10] A. Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153 163, 2017. [11] S. Corbett-Davies, E. Pierson, A. Feller, S. Goel, and A. Huq. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 797 806, 2017. [12] F. Doshi-Velez and B. Kim. Towards A Rigorous Science of Interpretable Machine Learning. Ar Xiv e-prints:1702.08608, 2017. [13] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214 226, 2012. [14] S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth. A comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 329 338, 2019. [15] S. Gillen, C. Jung, M. Kearns, and A. Roth. Online learning with an unknown fairness metric. In Advances in neural information processing systems, pages 2600 2609, 2018. [16] G. Goh, A. Cotter, M. Gupta, and M. P. Friedlander. Satisfying real-world goals with dataset constraints. In Advances in Neural Information Processing Systems, pages 2415 2423, 2016. [17] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In Advances in neural information processing systems, pages 3315 3323, 2016. [18] T. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, pages 1929 1938, 2018. [19] R. Herbrich. Large margin rank boundaries for ordinal regression. In Advances in large margin classifiers, pages 115 132. The MIT Press, 2000. [20] G. Hiranandani, S. Boodaghians, R. Mehta, and O. Koyejo. Performance metric elicitation from pairwise classifier comparisons. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 371 379, 2019. [21] G. Hiranandani, S. Boodaghians, R. Mehta, and O. O. Koyejo. Multiclass performance metric elicitation. In Advances in Neural Information Processing Systems, pages 9351 9360, 2019. [22] C. Ilvento. Metric learning for individual fairness. ar Xiv preprint ar Xiv:1906.00250, 2019. [23] T. Joachims. Svmlight: Support vector machine. SVM-Light Support Vector Machine http://svmlight. joachims. org/, University of Dortmund, 19(4), 1999. [24] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133 142. ACM, 2002. [25] T. Kamishima, S. Akaho, H. Asoh, and J. Sakuma. Fairness-aware classifier with prejudice remover regularizer. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 35 50. Springer, 2012. [26] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in neural information processing systems, pages 3146 3154, 2017. [27] M. Kearns, S. Neel, A. Roth, and Z. S. Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In International Conference on Machine Learning, pages 2564 2572, 2018. [28] D. G. Kleinbaum, K. Dietz, M. Gail, M. Klein, and M. Klein. Logistic regression. Springer, 2002. [29] J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent trade-offs in the fair determination of risk scores. In 8th Innovations in Theoretical Computer Science Conference (ITCS 2017). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017. [30] O. O. Koyejo, N. Natarajan, P. K. Ravikumar, and I. S. Dhillon. Consistent multilabel classification. In NIPS, pages 3321 3329, 2015. [31] P. Lahoti, K. P. Gummadi, and G. Weikum. ifair: Learning individually fair data representations for algorithmic decision making. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1334 1345. IEEE, 2019. [32] A. K. Menon and R. C. Williamson. The cost of fairness in binary classification. In Conference on Fairness, Accountability and Transparency, pages 107 118, 2018. [33] D. Mukherjee, M. Yurochkin, M. Banerjee, and Y. Sun. Two simple ways to learn individual fairness metric from data. In ICML, 2020. [34] H. Narasimhan. Learning with complex loss functions and constraints. In International Conference on Artificial Intelligence and Statistics, pages 1646 1654, 2018. [35] H. Narasimhan, A. Cotter, and M. Gupta. Optimizing generalized rate metrics with three players. In Advances in Neural Information Processing Systems, pages 10746 10757, 2019. [36] H. Narasimhan, H. Ramaswamy, A. Saha, and S. Agarwal. Consistent multiclass algorithms for complex performance measures. In ICML, pages 2398 2407, 2015. [37] A. Noriega-Campero, M. A. Bakker, B. Garcia-Bulle, and A. Pentland. Active fairness in algorithmic decision making. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 77 83, 2019. [38] S. Opotow. Affirmative action, fairness, and the scope of justice. Journal of Social Issues, 52(4):19 24, 1996. [39] S. K. Pal and S. Mitra. Multilayer perceptron, fuzzy sets, classifiaction. 1992. [40] M. Peyrard, T. Botschen, and I. Gurevych. Learning to score system summaries for better content selection evaluation. In Proceedings of the Workshop on New Frontiers in Summarization, pages 74 84, 2017. [41] B. Qian, X. Wang, F. Wang, H. Li, J. Ye, and I. Davidson. Active learning from relative queries. In IJCAI, pages 1614 1620, 2013. [42] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In ACM SIGKDD, pages 1135 1144. ACM, 2016. [43] G. S. Shieh. A weighted kendall s tau statistic. Statistics & probability letters, 39(1):17 24, 1998. [44] A. Singla, E. Horvitz, P. Kohli, and A. Krause. Learning to hire teams. In Third AAAI Conference on Human Computation and Crowdsourcing, 2015. [45] G. Tamburrelli and A. Margara. Towards automated A/B testing. In International Symposium on Search Based Software Engineering, pages 184 198. Springer, 2014. [46] S. K. Tavker, H. G. Ramaswamy, and H. Narasimhan. Consistent plug-in classifiers for complex objectives and constraints. In Advances in Neural Information Processing Systems, 2020, to appear. [47] H. Valizadegan, R. Jin, R. Zhang, and J. Mao. Learning to rank by optimizing ndcg measure. In Advances in neural information processing systems, pages 1883 1891, 2009. [48] S. Wang, W. Guo, H. Narasimhan, A. Cotter, M. Gupta, and M. I. Jordan. Robust optimization for fairness with noisy protected groups. 2020, to appear. [49] B. Woodworth, S. Gunasekar, M. I. Ohannessian, and N. Srebro. Learning non-discriminatory predictors. In Conference on Learning Theory, pages 1920 1953, 2017. [50] S. Yang and D. Q. Naiman. Multiclass cancer classification based on gene expression comparison. Statistical applications in genetics and molecular biology, 13(4):477 496, 2014. [51] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th international conference on world wide web, pages 1171 1180, 2017. [52] M. B. Zafar, I. Valera, M. G. Rogriguez, and K. P. Gummadi. Fairness constraints: Mechanisms for fair classification. In Artificial Intelligence and Statistics, pages 962 970, 2017. [53] Y. Zhang, R. Bellamy, and K. Varshney. Joint optimization of ai fairness and utility: A humancentered approach. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pages 400 406, 2020.