# multiclass_performance_metric_elicitation__376060a8.pdf Multiclass Performance Metric Elicitation Gaurush Hiranandani Department of Computer Science University of Illinois at Urbana-Champaign gaurush2@illinois.edu Shant Boodaghians Department of Computer Science University of Illinois at Urbana-Champaign boodagh2@illinois.edu Ruta Mehta Department of Computer Science University of Illinois at Urbana-Champaign rutameht@illinois.edu Oluwasanmi Koyejo Department of Computer Science University of Illinois at Urbana-Champaign sanmi@illinois.edu Metric Elicitation is a principled framework for selecting the performance metric that best reflects implicit user preferences. However, available strategies have so far been limited to binary classification. In this paper, we propose novel strategies for eliciting multiclass classification performance metrics using only relative preference feedback. We also show that the strategies are robust to both finite sample and feedback noise. 1 Introduction Consider a machine learning model for cancer diagnosis and treatment support where the doctor applies a cost-sensitive predictive model to classify patients into cancer categories [23, 24]. It is clear that the chosen costs directly determine the model decisions, and thus dictate the patient outcomes. This raises an obvious question, how should the cost-tradeoffs be chosen so that it reflects the expert s decision-making? As it turns out, going from expert intuition to precise quantitative cost trade-offs is often difficult. Needless to say, this is not only true for medical applications as there are a plethora of domains where the question of what to measure poses a serious ongoing challenge [3]. To address this issue, Hiranandani et al. [7] recently formalized the problem of Metric Elicitation (ME), which aims to determine the user s performance metric based on preference feedback. The motivation behind ME is that employing the performance metrics which reflect innate user tradeoffs will allow one to learn models that best capture user preferences. As humans are often inaccurate in providing absolute quality feedback [17], Hiranandani et al. [7] propose to use pairwise comparison queries, where the user (oracle) is asked to compare two classifiers and provide an indicator of relative preference. They show that in various settings, the user s innate metric can be elicited based on this preference feedback. Figure 1 (reproduced from Hiranandani et al. [7]) illustrates this framework. Conceptually, ME is applicable to any learning setting. However, Hiranandani et al. [7] only proposed methods for eliciting binary classification performance metrics. This manuscript extends prior work by proposing ME strategies for the more complicated multiclass classification setting thus significantly increasing the use cases for ME. Similar to the binary case, we also consider the most common families of performance metrics which are functions of the confusion matrix [15]; however, in our case, the elements of the confusion matrix summarize multiclass error statistics. In order to perform efficient multilcass performance metric elicitation, we study novel geometric properties of the space of multiclass confusion matrices. Our analysis reveals that due to structural differences between the space of binary and multiclass confusions, we can not trivially extend the elicitation procedure used for binary to the multiclass case. Instead, we provide novel strategies for 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. Figure 1: Metric Elicitation framework [7]. Table 1: The Bayes Optimal (BO) and Restricted-Bayes Optimal (RBO). Name Definition BO confusion c over a subset S C argmaxc S C φ(c) RBO classifier hk1,k2 argmax h Hk1,k2 ψ(d(h)) RBO diagonal confusion dk1,k2 argmax d Dk1,k2 ψ(d) eliciting linear functions of the multiclass confusion matrix and extend elicitation to more complicated yet popular functional forms such as linear-fractional functions of the confusion matrix elements [14]. Specifically, the elicitation procedures involve binary-search type algorithms that are robust to both finite sample and oracle feedback noise. In addition, the proposed methods can be applied either by querying pairwise classifier preferences or pairwise confusion matrix preferences. We find that this equivalence is crucial for practical applications. In summary, our main contributions are novel query efficient metric elicitation algorithms for multiclass classification. We study ME for linear functions of the confusion matrix and then briefly discuss extensions to more complicated functional forms such as the linear-fractional and arbitrary monotonic functions of the confusion matrix (with details in the appendix). Lastly, we show that the proposed procedures are robust to finite sample and feedback noise, thus are useful in practice. Notation. Matrices and vectors are denoted by bold upper case and bold lower case letters, respectively. Let R and Z+ denote the set of reals and positive integers, respectively. For k Z+, we denote the index set {1, 2, , k} by [k]. k denotes the (k 1) dimensional simplex. 1, 2, and denote the ℓ1-norm, ℓ2-norm, and ℓ -norm, respectively. We denote the inner product of two vectors by , . Given a matrix A, off-diag(A) returns a vector of off-diagonal elements of A in row-major form, and diag(A) returns a vector of diagonal elements of A. 2 Preliminaries The standard multiclass classification setting comprises k classes with X X and Y [k] representing the input and output random variables, respectively. We have access to a dataset of size n denoted by {(x, y)i}n i=1, generated iid from a distribution P(X, Y ). Let ηi(x) = P(Y = i|X = x) and ζi = P(Y = i) for i [k] be the conditional and the unconditional probability of the k classes, respectively. Let H = {h : X k} be the set of all classifiers. A confusion matrix for a classifier h is denoted by C(h, P) Rk k, where its elements are given by: Cij(h, P) = P(Y = i, h = j) for i, j [k]. (1) Under the population law P, it is useful to keep the following decomposition in mind: P(Y = i, h = i) = ζi P(Y = i, h = i) = Cii(h, P) = ζi j=1,j =i Cij(h, P). (2) Using this decomposition, any confusion matrix is uniquely represented by its q := (k2 k) offdiagonal elements. Hence, we will represent a confusion matrix C(h, P) by a vector c(h, P) = off-diag(C(h, P)), and interchangeably refer the confusion matrix as a vector of off-diagonal confusions . The space of off-diagonal confusions is denoted by C = {c(h, P) = off-diag(C(h, P)) : h H}. For clarity, we will suppress the dependence on P and h if it is clear from the context. Performance of a classifier is often determined by just the misclassification and not the type of misclassification, especially when the number of classes is large. Therefore, we will also consider metrics that only depend on correct and incorrect predictions, namely P(Y = i, h = i) and P(Y = i, h = i). Following the decomposition in (2), such metrics require only the diagonal elements of the original confusion matrices. Given a confusion matrix C, we will denote its diagonal by d = diag(C) and refer it as the vector of diagonal confusions . The space of diagonal confusions is represented by D = {d = diag(C(h)) : h H}. Let φ : [0, 1]q R and ψ : [0, 1]k R be the performance metrics for a classifier h determined by its corresponding off-diagonal and diagonal confusion entries c(h) and d(h), respectively. Without loss of generality (wlog), we assume the metrics φ and ψ are utilities so that larger values are preferred. Furthermore, the metrics are scale invariant as global scale does not affect the learning problem [15]. For this manuscript, we assume the following regularity assumption on the data distribution. Assumption 1. We assume that the functions gij(r) = P h ηi(X) ηj(X) r i i, j [k] are continuous and strictly decreasing for r [0, ). Intuitively, this weak assumption ensures that when the cost or reward tradeoffs for the classes change, the preferred confusion matrices for those cost or reward tradeoffs also change (and vice-versa). 2.1 Bayes Optimal and Restricted Bayes Optimal Confusions and Classifiers As illustrated in Table 1, the Bayes Optimal (BO) confusion c represents the optimal value of the off-diagonal confusions according to the metric φ over a subset S C. This is analogously defined for ψ and D. The Restricted Bayes Optimal (RBO) entities are of interest for diagonal metrics ψ, and indicate the case where classifiers are restricted to predict only classes k1, k2 [k]. Thus Hk1,k2 and Dk1,k2 denote the space of classifiers which exclusively predict either k1 or k2 and the associated space of diagonal confusions, respectively. Note that for such restricted classifiers h, Cii(h) = di(h) evaluates to zero at every index i = k1, k2. 2.2 Performance Metrics We first discuss elicitation for the following two major types of metrics used in classification. Definition 1. Diagonal Linear Performance Metric (DLPM): We denote this family by ϕDLP M. Given a Rk such that a 1 = 1 ( wlog., due to scale invariance), the metric is defined as: ψ(d) := a, d . This is also called weighted accuracy [15] and focuses on correct classification. Definition 2. Linear Performance Metric (LPM): We denote this family by ϕLP M. Given a Rq such that a 2 = 1 (wlog., due to scale invariance), the metric is defined as: φ(c) := a, c . Cost-sensitive linear metrics belong to ϕLP M [1] and focus on the types of misclassifications. The difference of norms in the definitions is only for simplicity of exposition and chosen to best complement the underlying metric elicitation algorithm and vice-versa. Moreover, notice that the elements of diagonal confusions (d s) and off-diagonal confusions (c s) reflect correct and incorrect classification, respectively. Thus, according to standard practice, wlog., we focus on eliciting monotonically increasing DLPMs and monotonically decreasing LPMs in their respective arguments. 2.3 Metric Elicitation; Problem Setup This section describes the problem of Metric Elicitation and the associated oracle query. Our definitions follow from Hiranandani et al. [7], extended so the confusion elements and the performance metrics correspond to the multiclass classification setting. The following definitions hold analogously for the diagonal case by replacing φ, c and C by ψ, d, and D, respectively. Definition 3 (Oracle Query). Given two classifiers h, h (equivalent to off-diagonal confusions c, c respectively), a query to the Oracle (with metric φ) is represented by: Γ(h, h ) = Ω(c, c ) = 1[φ(c) > φ(c )] =: 1[c c ], (3) where Γ : H H {0, 1} and Ω: C C {0, 1}. The query asks whether h is preferred to h (equivalent to c is preferred to c ), as measured by φ. We elicit metrics which are functions of the confusion matrix, thus comparison queries using classifiers are indistinguishable from comparison queries using confusions. Henceforth, for simplicity of notation, we denote any query as confusions based query. Next, we formally state the ME problem. Definition 4 (Metric Elicitation with Pairwise Queries (given {(x, y)i}n i=1)). Suppose that the oracle s (unknown) performance metric is φ. Using oracle queries of the form Ω(ˆc, ˆc ), where ˆc, ˆc are the estimated off-diagonal confusions from samples, recover a metric ˆφ such that φ ˆφ < κ under a suitable norm for sufficiently small error tolerance κ > 0. v1 = (ζ1, 0, 0) v2 = (0, ζ2, 0) v3 = (0, 0, ζ3) dk1 (ζk1, 0) Sλ λ c f (c) Figure 2: (a) Geometry of the space of diagonal confusions D for k = 3: a strictly convex space. Notice that each of the three axis-aligned faces are equivalent in geometry to the following figure in (b); (b) Geometry of diagonal confusions when restricted to classifiers predicting only classes k1 and k2 i.e. Dk1,k2; (c) A sphere Sλ centered at o with radius λ, contained in the convex space of off-diagonal confusions C. f (c) denotes the distance of c from the hyperplane ℓ tangent at c . The performance of ME is evaluated both by the fidelity of the recovered metric and the query complexity. Given the formal definitions, we can now proceed. As is standard in the decision theory literature [13, 7], we present our ME solution by first assuming access to population quantities such as the population confusions c(h, P), then examine practical implementation by considering the estimation error from finite samples e.g. with empirical confusions ˆc(h, {(x, y)i}n i=1). 3 Geometry and Parametrizations of the Query Spaces For any query based approach, it is important to understand the structure of the query space. Thus, we first study the properties of the query spaces and then develop parametrizations required for efficient elicitation. Readers may find these properties independently useful in other applications as well. 3.1 Geometry of the space of diagonal confusions D and parametrization of its boundary Let vi Rk for i [k] be the vectors with ζi at the i-th index and zero everywhere else. Notice that vi s are the diagonal confusions of the trivial classifiers predicting only class i on the entire space X. Proposition 1 (Geometry of D Figure 2 (a)). Under Assumption 1, the space of diagonal confusions D is strictly convex, closed, and contained in the box [0, ζ1] [0, ζk]. The diagonal confusions vi i [k] are the only vertices of D. Moreover, for any k1, k2 [k], the 2-dimensional (k1, k2) axesaligned face of D is Dk1,k2 (Figure 2 (b)), which is equivalent to the space of binary classification confusion matrices confined to classes k1, k2. In particular, Dk1,k2 is strictly convex. Proposition 1 characterizes the geometry of the space of diagonal confusions D. Figure 2(a) illustrates this geometry when k = 3. Interestingly, the 2-dimensional axes-aligned faces of D (Figure 2 (b)) have exactly the same geometry as the space of binary classification confusion matrices (compare this with Figure 2(a) of Hiranandani et al. [7]), where recall that a binary classification confusion matrix is uniquely determined by its two diagonal elements due to (2). We will exploit the set Dk1,k2 (more specifically, its boundary) for the elicitation task. Now notice that for ψ ϕDLP M, the RBO classifier restricted to predict classes k1, k2, predicts the label (out of the two possible choices) that maximizes the expected utility conditioned on the instance. This is discussed below. Proposition 2. Let ψ ϕDLP M be parametrized by a such that a 1 = 1, and let k1, k2 [k], then hk1,k2(x) = k1, if ak1ηk1(x) ak2ηk2(x) k2, o.w. is the Restricted Bayes Optimal classifier (restricted to classes k1, k2) with respect to ψ. For a metric ψ ϕDLP M, Proposition 2 provides RBO classifiers in Hk1,k2, which further gives us RBO diagonal confusions dk1,k2 using (1). We know that this dk1,k2 is unique, since any linear metric over a strictly convex domain (Dk1,k2) is maximized at a unique point on the boundary [2]. So, given a DLPM, we have access to a unique point in the query space. This allows us to define and then parametrize a subset of the query space, specifically, the upper boundary of Dk1,k2 through DLPMs. Definition 5. The upper boundary of Dk1,k2, denoted by D+ k1,k2, constitutes the RBO diagonal confusions confined to classes k1, k2 [k] for monotonically increasing DLPMs (ai 0 i [k]) such that at least one out of ak1 or ak2 is non-zero (i.e. ak1 + ak2 > 0). Parameterizing the upper boundary D+ k1,k2. Let m [0, 1]. Construct a DLPM by setting ak1 = m, ak2 = 1 m, and ai = 0 for i = k1, k2. By using Proposition 2 and (1), obtain its RBO diagonal confusions, which by definition lies on the upper boundary. Thus, varying m in this process, parametrizes the upper boundary D+ k1,k2. We denote this parametrization by ν(m; k1, k2), where ν : ([0, 1]; k1, k2) D+ k1,k2. 3.2 Geometry of the space C and parametrization of the enclosed sphere Recall that, unlike the diagonal case, we focus on eliciting LPMs monotonically decreasing in the elements of the off-diagonal confusions (Section 2.2). To this end, let ui C for i [k] be the off-diagonal confusions achieved by trivial classifiers predicting only class i on the entire space X. Proposition 3 (Geometry of C Figure 2 (c)). The space of off-diagonal confusions C is convex and contained in the box [0, ζ1](k 1) [0, ζk](k 1). {ui}k i=1 belong to the set of vertices of C. C always contains the point o = 1 k Pk i=1 ui which corresponds to the off-diagonal confusions of the trivial classifier that randomly predicts each class with equal probability on the entire space X. We find that the space of off-diagonal confusions C has quite different geometry than the diagonal case. For instance, C is not strictly convex. Nevertheless, since C is convex and always contains the point o, we may make the following assumption. Please see Figure 2(c) for an illustration. Assumption 2. There exists a q-dimensional sphere Sλ C of radius λ > 0 centered at o. Such a sphere always exists as long as the class-conditional distributions are not completely overlapping i.e. there is some signal for non-trivial classification. A method to obtain Sλ is discussed in Section 5. Now recall that the optimum for a linear function optimized over a sphere is given by the slope of the function scaled by the radius of the sphere. This is formalized as a trivial lemma below. Lemma 1. Let φ ϕLP M be parametrized by a such that a 2 = 1, then the unique optimal off-diagonal confusion c over the sphere Sλ is a point on the boundary of Sλ given by c = λa + o. Given an LPM, Lemma 1 provides a unique point in the query space Sλ C. This gives us an opportunity to characterize and then parametrize a subset of the query space through LPMs. Since we focus on eliciting monotonically decreasing LPMs, we parametrize the lower boundary of Sλ. Definition 6. The lower boundary of Sλ, denoted by S λ , constitutes the set of optimal off-diagonal confusions over the sphere Sλ for LPMs with ai 0 i [q] (monotonically decreasing condition). Parameterizing the lower boundary of the enclosed sphere S λ . We follow the standard method for parametrizing points on the surface of a sphere via angles. Let θ be a (q 1)-dimensional vector of angles, where all the angles except the primary angle are in second quadrant, i.e. {θi [π/2, π]}q 2 i=1 , and the primary angle is in the third quadrant, i.e. θ(q 1) [π, 3π/2]. Construct an LPM ( a 2 = 1) by setting ai = Πi 1 j=1 sin θj cos θi for i [q 1] and aq = Πq 1 j=1 sin θj. The choice of the quadrants ensures the monontonically decreasing condition i.e. {ai 0}q i=1. By using Lemma 1, obtain its BO off-diagonal confusions over the sphere Sλ, which clearly lies on the lower boundary. Thus, varying θ in this procedure, parametrizes the lower boundary S λ . We denote this parametrization by µ(θ), where µ : [π/2, π]q 2 [π, 3π/2] S λ . 4 Metric Elicitation Using the outlined parametrizations {ν, µ}, we propose efficient binary-search type algorithms to elicit oracle s implicit performance metric. We will first discuss elicitation procedures with no feedback noise from the oracle. We will later show robustness to noisy feedback in Section 5. 4.1 DLPM Elicitation The following lemma concerning a broader family of metrics is the route to our elicitation procedures. Since both linear and linear-fractional functions are quasiconcave, the lemma applies to both. Algorithm 1: DLPM Elicitation Input: ϵ > 0, oracle Ω, ˆa1 = 1 For i = 2, , k do Initialize: ma = 0, mb = 1. While mb ma > ϵ do Set mc = 3ma+mb 4 , md = ma+mb me = ma+3mb Set d a 1,i = ν(ma; 1, i) (i.e. parametrization of D+ 1,i in Section 3.1). Similarly, set d c 1,i, d d 1,i, d e 1,i, d b 1,i. Query Ω(d c 1,i, d a 1,i), Ω(d d 1,i, d c 1,i), Ω(d e 1,i, d d 1,i), and Ω(d b 1,i, d e 1,i). [ma, mb] Shrink Interval-1 (responses). Set md = ma+mb 2 . Then set ˆai = 1 md Output: ˆa = ˆa1 ˆa 1 , , ˆak ˆa 1 Algorithm 2: LPM Elicitation Input: ϵ > 0, oracle Ω, λ, and θ = θ(1) For t = 1, 2, , T do Set θa = θc = θd = θe = θb = θ(t). if (t%(q 1)) Set j = t%(q 1); else j = q 1. if (j == q 1) Initialize: θa j = π, θb j = 3π/2. else Initialize: θa j = π/2, θb j = π. While θb j θa j > ϵ do Set θc j = 3θa j +θb j 4 , θd j = θa j +θb j 2 , and θe j = θa j +3θb j 4 . Set ca = µ(θa) (i.e. parametrization of S λ in Section 3.2). Similarly, set cc, cd, ce, cb. Query Ω(cc, ca), Ω(cd, cc), Ω(ce, cd), Ω(cb, ce) [θa j , θb j] Shrink Interval-2 (responses). Set θd j = 1 2(θa j + θb j) and then set θ(t) = θd. Output: ˆai = Πi 1 j=1 sin θ(T ) j cos θi(T ) i [q 1], ˆaq = Πq 1 j=1 sin θ(T ) j . Lemma 2. Let ψ : D R be a quasiconcave metric which is monotone increasing in all {di}k i=1. For k1, k2 [k], let ρ+ : [0, 1] D+ k1,k2 be a continuous, bijective, parametrization of the upper boundary. Then the composition ψ ρ+ : [0, 1] R is quasiconcave and thus unimodal on [0, 1]. Remark 1. Under Assumption 1, every supporting hyperplane of Dk1,k2 supports a unique point on the boundary D+ k1,k2 and vice-versa (Proposition 1); therefore, the composition ψ ρ+ has no flat regions. In other words, the function ψ ρ+ is concave. The proof of Lemma 2 first shows that any quasiconcave metric ψ defined on the space D is also quasiconcave on the restricted space Dk1,k2, and then shows the quasiconcavity and thus the unimodality (due to the one-dimensional parametrization of D+ k1,k2) of ψ on a further restricted space D+ k1,k2. Furthermore, Remark 1 reveals that the function ψ ρ+ is concave, allowing us to devise the following binary-search type method for elicitation. Suppose that the oracle s metric is ψ ϕDLP M parametrized by a where a 1 = 1, {a i }k i=1 0 (Section 2.2). Using the parametrization ν, Algorithm 1 returns an estimate ˆa of a . It takes two classes at a time, class 1 and class i. Since the metric is unimodal on D+ 1,i (Lemma 2), the algorithm applies binary-search in the inner while-loop to estimate the ratio a i /a 1. The Shrink Interval-1 subroutine shrinks the interval [ma, mb] into half based on the oracle responses in the usual binarysearch way for searching the optimum (Figure 4, Appendix A). The algorithm repeats this (k 1) times to estimate the ratios {a 2/a 1, . . . , a k/a 1}. Finally, it outputs a normalized metric estimate ˆa. 4.2 LPM Elicitation We now discuss LPM elicitation, where the metrics are assumed to be monotonically decreasing in the off-diagonal confusions. Unfortunately, C may have flat regions due to lack of strict convexity, so the algorithm for the diagonal case does not apply. Instead, we consider a query space given by the sphere Sλ C and propose a coordinate-wise binary-search style algorithm, which is an outcome of our novel geometric characterization and the approach in Derivative-Free Optimization (DFO) [9]. Suppose that the oracle s metric is φ ϕLP M parametrized by a where a 2 = 1, {a i }q i=1 0 (Section 2.2). Using the parametrization µ(θ) of S λ (Section 3.2), Algorithm 2 returns an estimate ˆa of a . In each iteration, the algorithm updates one angle θj keeping other angles fixed by a binary-search procedure, where again the Shrink Interval-2 subroutine shrinks the interval [θa j , θb j] by half based on the oracle responses (Figure 5, Appendix A). Then the algorithm cyclically updates each angle until it converges to a metric sufficiently close to the true metric. The convergence is assured because, intuitively, the algorithm via a dual interpretation minimizes a smooth, strongly convex function f (c) measuring the distance of the boundary points from a hyperplane ℓ , whose slope is given by a and is tangent at the BO confusion c (see Figure 2(c)). Table 2: DLPM elicitation at ϵ = 0.01 for synthetic data. #Q denotes the number of queries. Classes k = 3 Classes k = 4 ψ = a ˆψ = ˆa #Q ψ = a ˆψ = ˆa #Q (0.21, 0.59, 0.20) (0.21, 0.60, 0.20) 56 (0.22, 0.13, 0.14, 0.52) (0.22, 0.13, 0.14, 0.52) 84 (0.23, 0.15, 0.62) (0.23, 0.15, 0.62) 56 (0.58, 0.17, 0.08, 0.18) (0.58, 0.17, 0.08, 0.18) 84 5 Guarantees We discuss robustness under the following feedback model, which is useful in practical scenarios. Definition 7 (Oracle Feedback Noise: ϵΩ 0). The oracle responses correctly as long as |φ(c) φ(c )| > ϵΩ(analogously |ψ(d) ψ(d )| > ϵΩ). Otherwise, it may provide incorrect answers. In other words, the oracle may respond incorrectly if the confusions are too close as measured by the metric φ (analogously ψ). Next, we discuss elicitation guarantees for DLPM and LPM elicitation. Theorem 1. Given ϵ, ϵΩ 0, and a 1-Lipschitz DLPM ψ parametrized by a . Then the output ˆa of Algorithm 1 after O((k 1) log 1 ϵ ) queries to the oracle satisfies a ˆa O(ϵ + ϵΩ), which is equivalent to a ˆa 2 O( k(ϵ + ϵΩ)) using standard norm bounds. The following theorem guarantees LPM elicitation when the sphere radius dominates the oracle noise. Theorem 2. Given ϵ, ϵΩ 0, and a 1-Lipschitz LPM φ parametrized by a . Suppose λ ϵΩ, then the output ˆa of Algorithm 2 after O z1 log(z2/(qϵ2))(q 1) log π 2ϵ queries satisfies a ˆa 2 O( q(ϵ + p ϵΩ/λ)), where z1, z2 are constants independent of ϵ and q. We see that the algorithms are robust to noise, and their query complexity depends linearly in the unknown entities. The term z1 log(z2/(qϵ2)) may attribute to the number of cycles in Algorithm 2, but due to the curvature of the sphere, we observe that it is not a dominating factor in the query complexity. For instance, we find that when ϵ = 10 2, two cycles (i.e. T = 2(q 1) in Algorithm 2) are sufficient for achieving elicitation up to the error tolerance qϵ. One remaining question for LPM elicitation is to select a sufficiently large value of λ. Algorithm 3 (Appendix D) provides an offline procedure to compute a λ r/k, where r is the radius of the largest ball contained in the set C. ME with Finite Samples: As a final step, we consider the following questions when working with finite samples: (a) do we get the correct feedback from querying Ω(ˆc, ˆc ) instead of querying Ω(c, c )? (b) what is the effect of ˆηi s when used in place of true ηi s? The answers are straightforward. Since the sample estimates of confusion matrices are consistent estimators and the metrics discussed are 1-Lipschitz with respect to the confusion matrices, with high probability, we gather correct oracle feedback as long as we have sufficient samples. Furthermore, subject to regularity assumptions, Lemma 3 of Hiranandani et al. [7] shows that the errors due to using ˆη affect the (binary) confusion matrices on the boundary in a controlled manner. Since Algorithm 1 uses pairwise RBO (binary) classifiers, it inherits the error guarantees in the multiclass case. Due to limited space, we do not repeat the details here. On the other hand, since Algorithm 2 does not use the boundary, its results are agnostic to finite sample error as long as the sphere is contained within the feasible region C. 6 Experiments In this section, we empirically validate the results of theorems 1 and 2 and investigate sensitivity due to finite sample estimates.1 For the ease of judgments, we show results for k = 3 and k = 4 classes. 6.1 Synthetic Data Experiments We assume a joint distribution for X = [ 1, 1] and Y = [k]. This is given by the marginal distribution f X = U[ 1, 1] and ηi(x) = 1 1+epix for i [k], where U[ 1, 1] is the uniform distribution on [ 1, 1] and {pi}k i=1 are the parameters controlling the degree of noise in the labels. We fix (p1, p2, p3) = (1, 3, 5) and (p1, p2, p3, p4) = (1, 3, 6, 10) for experiments with three and four classes, respectively. To verify elicitation, we first define a true metric ψ or φ . This specifies the query outputs of Algorithm 1 or Algorithm 2. Then we run the algorithms to check whether or not we recover the same 1A subset of results is shown here. Refer Appendix F for more results. Table 3: LPM elicitation at ϵ = 0.01 for synthetic data. #Q denotes the number of queries. Classes φ = a ˆφ = ˆa #Q 3 (-0.37, -0.89, -0.09, -0.23, -0.04, -0.03) (-0.37, -0.89, -0.09, -0.23, -0.04, -0.03) 320 3 (-0.80, -0.55, -0.18, -0.08, -0.14, -0.05) (-0.80, -0.55, -0.18, -0.08, -0.14, -0.05) 320 4 (-0.90, -0.28 -0.10, -0.31, -0.04, -0.05, (-0.90, -0.28, -0.10, -0.31, -0.04, -0.05, 704 -0.03, -0.04, -0.02, -0.01, -0.01, -0.01) -0.03, -0.04, -0.02, -0.01, -0.01, -0.01) 4 (-0.54, -0.10, -0.62, -0.52, -0.03, -0.07, (-0.55, -0.11, -0.62, -0.51, -0.03, -0.07, 704 -0.11, -0.07, -0.14, -0.03, -0.03, -0.04) -0.11, -0.07, -0.14, -0.03, -0.03, -0.04) 0.04 0.06 0.08 0.1 0.12 0.14 Successful ME Proportion ME on Sens IT (Acoustic) Dataset 50% Data 75% Data 100% Data 0.04 0.06 0.08 0.1 0.12 0.14 Successful ME Proportion ME on Vehicle Dataset 50% Data 75% Data 100% Data Figure 3: DLPM elicitation on real data for ϵ = 0.01. For randomly chosen hundred a , we show the proportion of times our estimates ˆa obtained with 4(k 1) log(1/ϵ) queries satisfy a ˆa ω. metric. Some results are shown in Table 2 and Table 3. Results verify that we elicit the true metrics even for small ϵ = 0.01, and as predicted, this requires only 4(k 1) log(1/ϵ) and 4T log(π/2ϵ) queries for DLPM and LPM elicitation respectively, where is the ceil function and T = 2(q 1). 6.2 Real-World Data Experiments Finite samples may affect the size of the sphere Sλ in LPM elicitation, but we observe that as long as λ is greater than ϵΩLPMs can be elicited (Appendix F.2). Thus, here we emprically validate only DLPM elicitation with finite samples. We consider two real-world datasets: (a) Sens IT (Acoustic) dataset [5] (78823 instances, 3 classes), and (b) Vehicle dataset [21] (846 instances, 4 classes). From each dataset, we create two other datasets containing randomly chosen 50% and 75% of the datapoints. So, we have six datasets in total. For all the datasets, we standardize the features and split the dataset into two parts S1 and S2. On S1, we learn {ˆηi(x)}k i=1 using a regularized softmax regression model. We use S2 for making predictions and computing sample confusions. We randomly selected 100 DLPMs i.e. a s. We then used Algorithm 1 with ϵ = 0.01 to recover the estimates ˆa s. In Figure 3, we show the proportion of times a ˆa ω for different values of ω. We see improved elicitation as we increase the number of datapoints in both the datasets, suggesting that ME improves with larger datasets. In particular, for the full Sens IT (Acoustic) dataset, we elicit all the metrics within ω = 0.12. We also observe that ω [0.04, 0.08] is an overly tight evaluation criterion that can result in failures. This is because the elicitation routine gets stuck at the closest achievable sample confusions, which need not be optimal within the (small) search tolerance ϵ. 7 Discussion Points and Future Work Extensions. The family of human evaluation metrics is believed to be large and now that we have discussed elicitation and guarantees for linear metrics, we can certainly aim for eliciting broader metric families. (a) Linear-fractional metrics e.g. F-measure [15] are common in classification problems because often one measures classification quality using proportions of predictions with respect to different classes. For eliciting linear-fractional metrics, we exploit their quasiconcave and quasiconvex nature. Intuitively, we aim to get a supporting hyperplane ℓ at the maximizer c and a supporting hyperplane ℓ at the minimizer c (see Figure 2(c)), which results in two non-linear systems of equations. Then we find a common solution to both the systems resulting in the true metric in just twice the number of queries required in the linear case. Due to limited space, we defer the details of diagonal and full linear-fractional elicitation to appendices E.1 and E.2, respectively. (b) When the oracle s metric is just monotonically increasing in diagonal confusions without even having a restricted functional form, then Algorithm 1 can return a first order approximation at the BO diagonal confusion. Notice that even this may be of high importance to practitioners. The elicitation details are discussed in Appendix E.3. Practical Convenience. Our procedures can also be applied by posing pairwise classifier comparisons directly. One way is to use A/B testing [22] where the user population acts an oracle. Another way is to use comparisons from a single expert, perhaps combined with interpretable machine learning techniques [19, 4]. We suggest the approach proposed by Narasimhan [14] for estimating the classifier associated with a given confusion matrix. Advantage of Algorithm 1. When there is a reason to restrict the metric search to DLPM e.g. due to prior knowledge, then Algorithm 1 is preferred for its lower query complexity. Future Work. We conjecture that our query complexity bounds are tight; however, we leave this detail for the future. We also plan to extend our procedures for the oracles that are only probably correct. This can be done easily by applying majority voting over repeated queries [11]. 8 Related Work The closest line of work to ours is Hiranandani et al. [7], who proposed the problem of ME but solved it only for a simpler setting of binary classification. As we move to multiclass performance ME, we find that the form of metrics and the complexity of the query space increases. This results in stark differences in the elicitation algorithms. Algorithm 1, which is closest to the binary approach, only works for Restricted Bayes Optimal classifiers, and Algorithm 2 requires a coordinate-wise binarysearch approach. As a result, novel methods are also required to provide query complexity guarantees. The LPM elicitation problem can be posed as a Derivative-Free Optimization [9] to a certain extent, but only after exploiting the geometry as we have. In addition, passively learning linear functions using pairwise comparisons has been studied before [6, 10, 16], but these approaches fail to control sample (i.e. query) complexity and end up utilizing more queries than the active approaches [20, 8, 12]. Papers which actively control the query samples for linear elicitation, e.g. [18], exploit the query space like us in order to achieve lower query complexity. However, unlike us, [18] does not provide theoretical bounds and is also applied to a different query space. 9 Conclusion We study the space of multiclass confusions and propose robust, efficient algorithms to elicit diagonallinear and linear performance metrics using preference feedback. We extend elicitation to other families e.g. linear-fractional metrics, thus covering a wide range of metrics encountered in practice. Acknowledgments Gaurush Hiranandani and Oluwasanmi Koyejo thank Microsoft Azure for providing computing credits. Shant Boodaghians and Ruta Mehta acknowledge the support of NSF via CCF 1750436. [1] N. Abe, B. Zadrozny, and J. Langford. An iterative method for multi-class cost-sensitive learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 3 11. ACM, 2004. [2] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004. [3] P. Dmitriev and X. Wu. Measuring metrics. In CIKM, 2016. [4] F. Doshi-Velez and B. Kim. Towards A Rigorous Science of Interpretable Machine Learning. Ar Xiv e-prints:1702.08608, 2017. [5] M. F. Duarte and Y. H. Hu. Vehicle classification in distributed sensor networks. Journal of Parallel and Distributed Computing, 64(7):826 838, 2004. [6] R. Herbrich. Large margin rank boundaries for ordinal regression. In Advances in large margin classifiers, pages 115 132. The MIT Press, 2000. [7] G. Hiranandani, S. Boodaghians, R. Mehta, and O. Koyejo. Performance metric elicitation from pairwise classifier comparisons. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 371 379, 2019. [8] K. G. Jamieson and R. Nowak. Active ranking using pairwise comparisons. In NIPS, pages 2240 2248, 2011. [9] K. G. Jamieson, R. Nowak, and B. Recht. Query complexity of derivative-free optimization. In Advances in Neural Information Processing Systems, pages 2672 2680, 2012. [10] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133 142. ACM, 2002. [11] M. Kääriäinen. Active learning in the non-realizable case. In International Conference on Algorithmic Learning Theory, pages 63 77. Springer, 2006. [12] D. M. Kane, S. Lovett, S. Moran, and J. Zhang. Active classification with comparison queries. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 355 366. IEEE, 2017. [13] O. O. Koyejo, N. Natarajan, P. K. Ravikumar, and I. S. Dhillon. Consistent multilabel classification. In NIPS, pages 3321 3329, 2015. [14] H. Narasimhan. Learning with complex loss functions and constraints. In International Conference on Artificial Intelligence and Statistics, pages 1646 1654, 2018. [15] H. Narasimhan, H. Ramaswamy, A. Saha, and S. Agarwal. Consistent multiclass algorithms for complex performance measures. In ICML, pages 2398 2407, 2015. [16] M. Peyrard, T. Botschen, and I. Gurevych. Learning to score system summaries for better content selection evaluation. In Proceedings of the Workshop on New Frontiers in Summarization, pages 74 84, 2017. [17] B. Qian, X. Wang, F. Wang, H. Li, J. Ye, and I. Davidson. Active learning from relative queries. In IJCAI, pages 1614 1620, 2013. [18] L. Qian, J. Gao, and H. Jagadish. Learning user preferences by adaptive pairwise comparison. Proceedings of the VLDB Endowment, 8(11):1322 1333, 2015. [19] M. T. Ribeiro, S. Singh, and C. Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In ACM SIGKDD, pages 1135 1144. ACM, 2016. [20] B. Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009. [21] J. P. Siebert. Vehicle recognition using rule based methods. 1987. [22] G. Tamburrelli and A. Margara. Towards automated A/B testing. In International Symposium on Search Based Software Engineering, pages 184 198. Springer, 2014. [23] S. Yang and D. Q. Naiman. Multiclass cancer classification based on gene expression comparison. Statistical applications in genetics and molecular biology, 13(4):477 496, 2014. [24] Z.-H. Zhou and X.-Y. Liu. On multi-class cost-sensitive learning. Computational Intelligence, 26(3):232 257, 2010.