# optimizing_blackbox_metrics_with_iterative_example_weighting__573ecfef.pdf Optimizing Black-box Metrics with Iterative Example Weighting Gaurush Hiranandani 1 Jatin Mathur1 Harikrishna Narasimhan2 Mahdi Milani Fard2 Oluwasanmi Koyejo3 1 We consider learning to optimize a classification metric defined by a black-box function of the confusion matrix. Such black-box learning settings are ubiquitous, for example, when the learner only has query access to the metric of interest, or in noisy-label and domain adaptation applications where the learner must evaluate the metric via performance evaluation using a small validation sample. Our approach is to adaptively learn example weights on the training dataset such that the resulting weighted objective best approximates the metric on the validation sample. We show how to model and estimate the example weights and use them to iteratively post-shift a pre-trained class probability estimator to construct a classifier. We also analyze the resulting procedure s statistical properties. Experiments on various label noise, domain shift, and fair classification setups confirm that our proposal compares favorably to the state-of-the-art baselines for each application. 1. Introduction In many real-world machine learning tasks, the evaluation metric one seeks to optimize is not explicitly available in closed-form. This is true for metrics that are evaluated through live experiments or by querying human users (Tamburrelli & Margara, 2014; Hiranandani et al., 2019a), or that require access to private or legally protected data (Awasthi et al., 2021), and hence cannot be written as an explicit training objective. This is also the case when the learner only has access to data with skewed training distribution or labels with heteroscedastic noise (Huang et al., 2019; Jiang et al., 2020), and hence cannot directly optimize the metric on the training set despite knowing its mathematical form. These problems can be framed as black-box learning tasks, Part of this work was done when G.H. was an intern at Google Research, USA. 1University of Illinois at Urbana-Champaign, Illinois, USA 2Google Research, USA 3Google Research, Accra. Correspondence to: Gaurush Hiranandani . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). where the goal is to optimize an unknown classification metric on a large (possibly noisy) training data, given access to evaluations of the metric on a small, clean validation sample (Jiang et al., 2020). Our high-level approach to these learning tasks is to adaptively assigns weights to the training examples, so that the resulting weighted training objective closely approximates the black-box metric on the validation sample. We then construct a classifier by using the example weights to post-shift a class-probability estimator pre-trained on the training set. This results in an efficient, iterative approach that does not require any re-training. Indeed, example weighting strategies have been widely used to both optimize metrics and to correct for distribution shift, but prior works either handle specialized forms of metric or data noise (Sugiyama et al., 2008; Natarajan et al., 2013; Patrini et al., 2017), formulate the example-weight learning task as a difficult non-convex problem that is hard to analyze (Ren et al., 2018; Zhao et al., 2019), or employ an expensive surrogate re-weighting strategy that comes with limited statistical guarantees (Jiang et al., 2020). In contrast, we propose a simple and effective approach to optimize a general black-box metric (that is a function of the confusion matrix) and provide a rigorous statistical analysis. A key element of our approach is eliciting the weight coefficients by probing the black-box metric at few select classifiers and solving a system of linear equations matching the weighted training errors to the validation metric. We choose the probing classifiers so that the linear system is well-conditioned, for which we provide both theoreticallygrounded options and practically efficient variants. This weight elicitation procedure is then used as a subroutine to iteratively construct the final plug-in classifier. Contributions: (i) We provide a method for eliciting example weights for linear black-box metrics (Section 3). (ii) We use this procedure to iteratively learn a plug-in classifier for general black-box metrics (Section 4). (iii) We provide theoretical guarantees for metrics that are concave functions of the confusion matrix under distributional assumptions (Section 5). (iv) We experimentally show that our approach is competitive with (or better than) the state-of-the-art methods for tackling label noise in CIFAR-10 (Krizhevsky et al., 2009) and domain shift in Adience (Eidinger et al., 2014), and optimizing with proxy labels and a black-box fairness Optimizing Black-box Metrics with Iterative Example Weighting metric on Adult (Dua & Graff, 2017) (Section 7). Notations: m denotes the (m 1)-dimensional simplex. [m] = {1, . . . , m} represents an index set. onehot(j) {0, 1}m returns the one-hot encoding of j [m]. The ℓ2 norm a vector is denoted by . 2. Problem Setup We consider a standard multiclass setup with an instance space X Rd and a label space Y = [m]. We wish to learn a randomized multiclass classifier h : X m that for any input x X predicts a distribution h(x) m over the m classes. We will also consider deterministic classifiers h : X [m] which map an instance x to one of m classes. Evaluation Metrics. Let D denote the underlying data distribution over X Y. We will evaluate the performance of a classifier h on D using an evaluation metric ED[h], with higher values indicating better performance. Our goal is to learn a classifier h that maximizes this evaluation measure: maxh ED[h]. (1) We will focus on metrics ED that can be written in terms of classifier s confusion matrix C[h] [0, 1]m m, where the i, j-th entry is the probability that the true label is i and the randomized classifier h predicts j: CD ij [h] = E(x,y) D [1(y = i)hj(x)] . The performance of the classifier can then be evaluated using a (possibly unknown) function ψ : [0, 1]m m R+ of the confusion matrix: ED[h] = ψ(CD[h]). (2) Several common classification metrics take this form, including typical linear metrics ψ(C) = P ij Lij Cij for some reward matrix L Rm m + , the F-measure ψ(C) = P i 2Cii P j Cij+P j Cji (Lewis, 1995), and the G-mean ψ(C) = Q j Cij 1/m (Daskalaki et al., 2006). We consider settings where the learner has query-access to the evaluation metric ED, i.e., can evaluate the metric for any given classifier h but cannot directly write out the metric as an explicit mathematical objective. This happens when the metric is truly a black-box function, i.e., ψ is unknown, or when ψ is known, but we have access to only a noisy version of the distribution D needed to compute the metric. Noisy Training Distribution. For learning a classifier, we assume access to a large sample Str of ntr examples drawn from a distribution µ, which we will refer to as the training distribution. The training distribution µ may be the same as the true distribution D, or may differ from the true distribution D in the feature distribution P(x), the conditional label distribution P(y|x), or both. We also assume access to a smaller sample Sval of nval examples drawn from the true distribution D. We will refer to the sample Str as the training sample, and the smaller sample Sval as the validation sample. We seek to solve (1) using both these samples. The following are some examples of noisy training distributions in the literature: Example 1 (Independent label noise (ILN) (Natarajan et al., 2013; Patrini et al., 2017)). The distribution µ draws an example (x, y) from D, and randomly flips y to ey with probability P(ey|y), independent of the instance x. Example 2 (Cluster-dependent label noise (CDLN) (Wang et al., 2020a)). Suppose each x belongs to one of k disjoint clusters g(x) [k]. The distribution µ draws (x, y) from D and randomly flips y to ey with probability P(ey|y, g(x)). Example 3 (Instance-dependent label noise (IDLN) (Menon et al., 2018)). µ draws (x, y) from D and randomly flips y to ey with probability P(ey|y, x), which may depend on x. Example 4 (Domain shift (DS) (Sugiyama et al., 2008)). µ draws ex according to a distribution Pµ(x) different from PD(x), but draws y from the true conditional PD(y|ex). Our approach is to learn example weights on the training sample Str, so that the resulting weighted empirical objective (locally, if not globally) approximates an estimate of the metric ED on the validation sample Sval. For ease of presentation, we will assume that the metrics only depend on the diagonal entries of the confusion matrix, i.e., Cii s. In Appendix A, we elaborate how our ideas can be extended to handle metrics that depend on the entire confusion matrix. While our approach uses randomized classifiers, in practice one can replace them with similarly performing deterministic classifiers using, e.g., the techniques of (Cotter et al., 2019a). In what follows, we will need the empirical confusion matrix on the validation set b Cval[h], where b Cval ij [h] = 1 nval P (x,y) Sval 1(y = i)hj(x). 3. Example Weighting for Linear Metrics We first describe our example weighting strategy for linear functions of the diagonal entries of the confusion matrix, which is given by: i βi CD ii [h] (3) for some (unknown) weights β1, . . . , βm. In the next section, we will discuss how to use this procedure as a subroutine to handle more complex metrics. 3.1. Modeling Example Weights We define an example weighting function W : X Rm + which associates m correction weights [Wi(x)]m i=1 with each example x so that: E(x,y) µ h P i Wi(x) 1(y = i)hi(x) i ED[h], h. (4) Optimizing Black-box Metrics with Iterative Example Weighting Table 1: Example weights W : X Rm m + for linear metric ED[h] = L, CD[h] under the noise models in Exmp. 1 4, where Wij(x) is the weight on entry Cij. In Sec. 3 4, we consider metrics that are functions of the diagonal confusion entries alone (i.e. L and T are diagonal), and handle general metrics in Appendix A. Model Noise Transition Matrix Correction Weights ILN Tij = P(ey = j|y = i) W(x) = L T 1 CDLN T [k] ij = P(ey = j|y = i, g(x) = k) W(x) = L (T[g(x)]) 1 IDLN Tij(x) = P(ey = j|y = i, x) W(x) = L (T(x)) 1 DS - Wij(x) = PD(x)/Pµ(x), i, j Elicit Weight Probing Classifier Plug-In Post-shifts Pre-trained Frank-Wolfe Update Plug-In with Elicited Weights (for Linear Metrics) Frank-Wolfe with Elicited Gradients (for General Metrics) Figure 1: Overview of our approach. Indeed for the noise models in Examples 1 4, there exist weighting functions W for which the above holds with equality. Table 1 shows the form of the weighting function for general linear metrics. Ideally, the weighting function W assigns m independent weights for each example x X. However, in practice, we estimate ED using a small validation sample Sval D. So to avoid having the example weights over-fit to the validation sample, we restrict the flexibility of W and set it to a weighted sum of L basis functions φℓ: X [0, 1]: Wi(x) = PL ℓ=1 αℓ iφℓ(x), (5) where αℓ i R is the coefficient associated with basis function φℓand diagonal confusion entry (i, i). In practice, the basis functions can be as simple as a partitioning of the instance space into L clusters, i.e.,: φℓ(x) = 1(g(x) = ℓ), (6) for a clustering function g : X [L], or may define a more complicated soft clustering using, e.g., radial basis functions (Sugiyama et al., 2008) with centers xℓand width σ: φℓ(x) = exp x xℓ /2σ2 . (7) 3.2. φ-transformed Confusions Expanding the weighting function in (4) gives us: i=1 αℓ i E(x,y) µ φℓ(x) 1(y = i)hi(x) where Φµ,ℓ[h] [0, 1]m can be seen as a φ-transformed confusion matrix for the training distribution µ. For example, if one had only one basis function φ1(x) = 1, x, then Φµ,1 i [h] = E(x,y) µ 1(y = i)hi(x) gives the standard confusion entries for the training distribution. If the basis functions divides the data into L clusters, as in (6), then Φµ,ℓ i [h] = E(x,y) µ 1(g(x) = ℓ, y = i)hi(x) gives the training confusion entries evaluated on examples from cluster ℓ. We can thus re-write equation (4) as a weighted combination of the Φ-confusion entries: i=1 αℓ iΦµ,ℓ i [h] ED[h], h. (8) 3.3. Eliciting Weight Coefficients α We next discuss how to estimate the weighting function coefficients αℓ i s from the training sample Str and validation sample Sval. Notice that (8) gives a relationship between statistics Φµ,ℓ s computed on the training distribution µ, and the evaluation metric of interest computed on the true distribution D. Moreover, for a fixed classifier h, the left-hand side is linear in the unknown coefficients α = [α1 1, . . . , αL 1 , . . . , α1 m, . . . , αL m] RLm. We therefore probe the metric b Eval at Lm different classifiers h1,1, . . . , h1,m, . . . , h L,1, . . . , h L,m, which results in a set of Lm linear equations of the form in (8): P ℓ,i αℓ i bΦtr,ℓ i [h1,1] = b Eval[h1,1], ℓ,i αℓ i bΦtr,ℓ i [h L,m] = b Eval[h L,m], where bΦtr,ℓ i [h] = 1 ntr P (x,y) Str φℓ(x) 1(y = i)hi(x) is evaluated on the training sample and the metric b Eval[h] = P i βi b Cval ii [h] is evaluated on the validation sample. More formally, let bΣ RLm Lm and b E RLm denote the left-hand and right-hand side observations in (9), i.e.,: bΣ(ℓ,i),(ℓ ,i ) = 1 ntr X (x,y) Str φℓ (x)1(y = i )hℓ,i i (x), b E(ℓ,i) = b Eval[hℓ,i]. (10) Then the weight coefficients are given by bα = bΣ 1b E. 3.4. Choosing the Probing Classifiers h1,1, . . . , h L,m We will have to choose the Lm probing classifiers so that bΣ is well-conditioned. One way to do this is to choose the classifiers so that bΣ has a high value on the diagonal entries and a low value on the off-diagonals, i.e. choose each classifier hℓ,i to evaluate to a high value on bΦtr,ℓ i [h] and a low value on bΦtr,ℓ i [h], (ℓ , i ) = (ℓ, i). This can be framed as the following constraint satisfaction problem on Str: For hℓ,i pick h H such that: bΦtr,ℓ i [h] γ, and bΦtr,ℓ i [h] ω, (ℓ , i ) = (ℓ, i), (11) Optimizing Black-box Metrics with Iterative Example Weighting Algorithm 1: Elicit Weights for Diagonal Linear Metrics 1: Input: b Eval, Basis functions φ1, . . . , φL : X [0, 1], Training set Str µ, Val. set Sval D, h, ϵ, H, γ, ω 2: If fixed classifier: 3: Choose hℓ,i(x) = ϵφℓ(x) ei(x) + (1 ϵφℓ(x)) h(x) 4: Else: 5: H = {τh + (1 τ) h | h H, τ [0, ϵ]} 6: Pick hℓ,i H to satisfy (11) with slack γ, ω, (ℓ, i) 7: Compute bΣ and b E using (10) with metric b Eval 8: Output: bα = bΣ 1b E Algorithm 2: Plug-in with Elicited Weights (PI-EW) for Diagonal Linear Metrics 1: Input: b Eval, Basis functions φ1, . . . , φL : X [0, 1], Class probability model bηtr : X m for µ, Training set Str µ, Validation set Sval D, h, ϵ 2: bα = Elicit Weights(b Eval, φ1, . . . , φL, Str, Sval, h, ϵ) 3: Example-weights: c Wi(x) = PL ℓ=1 bαℓ iφℓ i(x) 4: Plug-in: bh(x) argmaxi [m] c Wi(x)bηtr i (x) 5: Output: bh for some γ > ω > 0 and a sufficiently flexible hypothesis class H for which the constraints are feasible. These problems can generally be solved by formulating a constrained classification problem (Cotter et al., 2019b; Narasimhan, 2018). We show in Appendix G that this problem is feasible and can be efficiently solved for a range of settings. In practice, we do not explicitly solve (11) over a hypothesis class H. Instead, a simpler and surprisingly effective strategy is to set the probing classifiers to trivial classifiers that predict the same class on all (or a subset of) examples. To build intuition for why this is a good idea, consider a simple setting with only one basis function φ1(x) = 1, x, where the φ-confusions bΦtr,1 i [h] = 1 ntr P (x,y) Str 1(y = i)hi(x) are the standard confusion entries on the training set. In this case, a trivial classifier ei(x) = onehot(i), x, which predicts class i on all examples, yields the highest value for bΦtr,1 i and 0 for all other bΦtr,1 j , j = i. In fact, in our experiments, we set the probing classifier h1,i to a randomized combination of ei and some fixed base classifier h: h1,i(x) = ϵei(x) + (1 ϵ) h(x), for large enough ϵ so that bΣ is well-conditioned. Similarly, if the basis functions divide the data into L clusters (as in (6)), then we can randomize between h and a trivial classifier that predicts a particular class i on all examples assigned to the cluster ℓ [L]. The confusion matrix for the resulting classifiers will have higher values than h on the (ℓ, i)-th diagonal entry and a lower value on other Algorithm 3: Frank-Wolfe with Elicited Gradients (FWEG) for General Diagonal Metrics (also depicted in Fig. 1) 1: Input: b Eval, Basis functions φ1, . . . , φL : X [0, 1], Pre-trained bηtr : X m, Str µ, Sval D, T, ϵ 2: Initialize classifier h0 and c0 = diag(b Cval[h0]) 3: For t = 0 to T 1 do 4: if ED[h] = ψ(CD 11[h], . . . , CD mm[h]) for known ψ: 5: βt = ψ(ct) 6: b Elin[h] = P i βt i b Cval ii [h] 7: else 8: b Elin[h] = b Eval[h] {small ϵ recommendeded} 9: bf = PI-EW(b Elin, φ1, ..., φL, bηtr, Str, Sval, ht, ϵ) 10: ec = diag(b Cval[ bf]) 11: ht+1 = 1 2 t+1 ht + 2 t+1onehot( bf) 12: ct+1 = 1 2 t+1 ct + 2 t+1ec 13: End For 14: Output: bh = h T entries. These classifiers can be succinctly written as: hℓ,i(x) = ϵφℓ(x)ei(x) + (1 ϵφℓ(x)) h (12) where we again tune ϵ to make sure that the resulting bΣ is well-conditioned. This choice of the probing classifiers also works well in practice for general basis functions φℓ s. Algorithm 1 summarizes the weight elicitation procedure, where the probing classifiers are either constructed by solving the constrained satisfaction problem (11) or set to the fixed classifiers in (12). In both cases, the algorithm takes a base classifier h and the parameter ϵ as input, where ϵ controls the extent to which h is perturbed to construct the probing classifiers. This radius parameter ϵ restricts the probing classifiers to a neighborhood around h and will prove handy in the algorithm we develop in Section 4.2. 4. Plug-in Based Algorithms Having elicited the weight coefficients α, we now seek to learn a classifier that optimizes the left hand side of (8). We do this via the plug-in approach: first pre-train a model bηtr : X m on the noisy training distribution µ to estimate the conditional class probabilities bηtr i (x) Pµ(y = i|x), and then apply the correction weights to post-shift bηtr. 4.1. Plug-in Algorithm for Linear Metrics We first describe our approach for (diagonal) linear metrics ED[h] = P i βi CD ii [h] in Algorithm 2. Given the correction weights c W : X Rm +, we seek to maximize the following weighted objective on the training distribution: maxh E(x,y) µ h P i c Wi(x) 1(y = i)hi(x) i . Optimizing Black-box Metrics with Iterative Example Weighting This is a standard example-weighted learning problem, for which the following plug-in (post-shift) classifier is a consistent estimator (Narasimhan et al., 2015b; Yang et al., 2020): bh(x) argmaxi [m] c Wi(x) bηtr i (x). 4.2. Iterative Algorithm for General Metrics To optimize generic non-linear metrics of the form ED[h] = ψ(CD 11[h], . . . , CD mm[h]) for ψ : [0, 1]m R+, we apply Algorithm 2 iteratively. We consider both cases where ψ is unknown, and where ψ is known, but needs to be optimized using the noisy distribution µ. The idea is to first elicit local linear approximations to ψ and to then learn plug-in classifiers for the resulting linear metrics in each iteration. Specifically, following Narasimhan et al. (2015b), we derive our algorithm from the classical Frank-Wolfe method (Jaggi, 2013) for maximizing a smooth concave function ψ(c) over a convex set C Rm. In our case, C is the set of confusion matrices CD[h] achieved by any classifier h, and is convex when we allow randomized classifiers (see Lemma 10, Appendix B.3). The algorithm maintains iterates ct, and at each step, maximizes a linear approximation to ψ at ct: ec argmaxc C ψ(ct), c . The next iterate ct+1 is then a convex combination of ct and the current solution ec. In Algorithm 3, we outline an adaptation of this Frank-Wolfe algorithm to our setting, where we maintain a classifier ht and an estimate of the diagonal confusion entries ct from the validation sample Sval. At each step, we linearize ψ using b Elin[h] = P i βt i b Cval ii [h], where βt = ψ(ct), and invoke the plug-in method in Algorithm 2 to optimize the linear approximation b Elin. When the mathematical form of ψ is known, one can directly compute the gradient βt. When it is not known, we can simply set b Elin[h] = b Eval[h], but restrict the weight elicitation routine (Algorithm 1) to choose its probing classifiers hℓ,i s from a small neighborhood around the current classifier ht (in which ψ is effectively linear). This can be done by passing h = ht to the weight elicitation routine, and setting the radius ϵ to a small value. Each call to Algorithm 2 uses the training and validation set to elicit example weights for a local linear approximation to ψ, and uses the weights to construct a plug-in classifier. The final output is a randomized combination of the plugin classifiers from each step. Note that Algorithm 3 runs efficiently for reasonable values of L and m. Indeed the runtime is almost always dominated by the pre-training of the base model bηtr, with the time taken to elicit the weights (e.g. using (12)) being relatively inexpensive (see App. E). 5. Theoretical Guarantees We provide theoretical guarantees for the weight elicitation procedure and the plug-in methods in Algorithms 1 3. Assumption 1. The distributions D and µ are such that for any linear metric ED[h] = P i βi Cii[h], with β 1, α RLm s.t. P ℓ,i αℓ iΦµ,ℓ i [h] ED[h] ν, h and α 1 B, for some ν [0, 1) and B > 0. The assumption states that our choice of basis functions φ1, . . . , φL are such that, any linear metric on D can be approximated (up to a slack ν) by a weighting Wi(x) = P ℓ αℓ iφℓ(x) of the training examples from µ. The existence of such a weighting function depends on how well the basis functions capture the underlying distribution shift. Indeed, the assumption holds for some common settings in Table 1, e.g., when the noise transition T is diagonal (Appendix A handles a general T), and the basis functions are set to φ1(x) = 1, x, for the IDLN setting, and φℓ(x) = 1(g(x) = ℓ), x, for the CDLN setting. We analyze the coefficients bα elicited by Algorithm 1 when the probing classifiers hℓ,i are chosen to satisfy (11). In Appendix C, we provide an analysis when the probing classifiers hℓ,i are set to the fixed choices in (12). Theorem 1 (Error bound on elicited weights). Let γ, ω > 0 be such that the constraints in (11) are feasible for hypothesis class H, for all ℓ, i. Suppose Algorithm 1 chooses each classifier hℓ,i to satisfy (11), with ED[hℓ,i] [c, 1], ℓ, i, for some c > 0. Let α be defined as in Assumption 1. Suppose γ > 2 2Lmω and ntr L2m log(Lm|H|/δ) 2Lmω)2 . Fix δ (0, 1). Then w.p. 1 δ over draws of Str and Sval from µ and D resp., the coefficients bα output by Algorithm 1 satisfies: L log( Lm|H| where the term |H| can be replaced by a measure of capacity of the hypothesis class H. Because the probing classifiers are chosen using the training set alone, it is only the sampling errors from the training set that depend on the complexity of H, and not those from the validation set. This suggests robustness of our approach to a small validation set as long as the training set is sufficiently large and the number of basis functions is reasonably small. For the iterative plug-in method in Algorithm 3, we bound the gap between the metric value ED[bh] for the output classifier bh on the true distribution D, and the optimal value. We handle the case where the function ψ is known and its gradient ψ can be computed in closed-form. The more general case of an unknown ψ is handled in Appendix D. The above bound depends on the gap between the estimated class probabilities bηtr i (x) for the training distribution and true class probabilities ηtr i (x) = P(y = i|x), as well as the quality of the coefficients bα provided by the weight estima- Optimizing Black-box Metrics with Iterative Example Weighting tion subroutine, as measured by κ( ). One can substitute κ( ) with, e.g., the error bound provided in Theorem 1. Theorem 2 (Error Bound for FW-EG). Let ED[h] = ψ(CD 11[h], . . . , CD mm[h]) for a known concave function ψ : [0, 1]m R+, which is Q-Lipschitz and λ-smooth. Fix δ (0, 1). Suppose Assumption 1 holds, and for any linear metric P i βi CD ii [h], whose associated weight coefficients is α with α B, w.p. 1 δ over draw of Str and Sval, the weight estimation routine in Alg. 1 outputs coefficients bα with bα α κ(δ, ntr, nval), for some function κ( ) > 0. Let B = B + Lm κ(δ/T, ntr, nval). Then w.p. 1 δ over draws of Str and Sval from D and µ resp., the classifier bh output by Algorithm 3 after T iterations satisfies: max h ED[h] ED[bh] 2QB Ex ηtr(x) bηtr(x) 1 + 4Q T , ntr, nval) + m log(m) log(nval) + log(m/δ) The proof in turn derives an error bound for the plug-in classifier in Algorithm 2 for linear metrics (see App. B.2). 6. Related Work Methods for closed-form metrics. There has been a variety of work on optimizing complex evaluation metrics, including both plug-in type algorithms (Ye et al., 2012; Narasimhan et al., 2014; Koyejo et al., 2014; Narasimhan et al., 2015b; Yan et al., 2018), and those that use convex surrogates for the metric (Joachims, 2005; Kar et al., 2014; 2016; Narasimhan et al., 2015a; Eban et al., 2017; Narasimhan et al., 2019; Hiranandani et al., 2020). These methods rely on the test metric having a specific closed-form structure and do not handle black-box metrics. Methods for black-box metrics. Among recent blackbox metric learning works, the closest to ours is Jiang et al. (2020), who learn a weighted combination of surrogate losses to approximate the metric on a validation set. Like us, they probe the metric at multiple classifiers, but their approach has several drawbacks on both practical and theoretical fronts. Firstly, Jiang et al. (2020) require retraining the model in each iteration, which can be time-intensive, whereas we only post-shift a pre-trained model. Secondly, the procedure they prescribe for eliciting gradients requires perturbing the model parameters multiple times, which can be very expensive for large deep networks, whereas we only require perturbing the predictions from the model. Moreover, the number of perturbations they need grows polynomially with the precision with which they need to estimate the loss coefficients, whereas we only require a constant number of them. Lastly, their approach does not come with strong statistical guarantees, whereas ours does. Besides these benefits over (Jiang et al., 2020), we will also see in Section 7 that our method yields better accuracies. Other related black-box learning methods include Zhao et al. (2019), Ren et al. (2018), and Huang et al. (2019), who learn a (weighted) loss to approximate the metric, but do so using computationally expensive procedures (e.g. meta-gradient descent or RL) that often require retraining the model from scratch, and come with limited theoretical analysis. Methods for distribution shift. The literature on distribution shift is vast, and so we cover a few representative papers; see (Fr enay & Verleysen, 2013; Csurka, 2017) for a comprehensive discussion. For the independent label noise setting (Natarajan et al., 2013), Patrini et al. (2017) propose a loss correction approach that first trains a model with noisy label, use its predictions to estimate the noise transition matrix, and then re-trains model with the corrected loss. This approach is however tailored to optimize linear metrics; whereas, we can handle more complex metrics as well without re-training the underlying model. A plethora of approaches exist for tackling domain shift, including classical importance weighting (IW) strategies (Sugiyama et al., 2008; Shimodaira, 2000; Kanamori et al., 2009; Lipton et al., 2018) that work in two steps: estimate the density ratios and train a model with the resulting weighted loss. One such approach is Kernel Mean Matching (Huang et al., 2006), which matches covariate distributions between training and test sets in a high dimensional RKHS feature space. These IW approaches are however prone to over-fitting when used with deep networks (Byrd & Lipton, 2019). More recent iterative variants seek to remedy this (Fang et al., 2020). 7. Experiments We run experiments on four classification tasks, with both known and black-box metrics, and under different label noise and domain shift settings. All our experiments use a large training sample, which is either noisy or contains missing attributes, and a smaller clean (and complete) validation sample. We always optimize the cross-entropy loss for learning bηtr(x) Pµ(Y |x) using the training set (or bηval(x) PD(Y |x) for some baselines), where the models are varied across experiments. For monitoring the quality of bηtr and bηval, we sample small subsets hyper-train and hyper-val data from the original training and validation data, respectively. We repeat our experiments over 5 random trainvali-test splits, and report the mean and standard deviation for each metric. We will use , , and to denote that the differences between our method and the closest baseline are statistically significant (using Welch s t-test) at a confidence level of 90%, 95%, and 99%, respectively. Table 6 in App. H summarizes the datasets used. The source code (along with random seeds) is provided on the link below.1 Common baselines: We use representative baselines from 1https://github.com/koyejolab/fweg/ Optimizing Black-box Metrics with Iterative Example Weighting Table 2: Test accuracy for noisy label experiment on CIFAR-10. Cross-entropy [train] 0.582 0.007 Cross-entropy [val] 0.386 0.031 Learn-to-reweight 0.651 0.017 Plug-in [train-val] 0.733 0.044 Forward Correction 0.757 0.005 Fine-tuning 0.769 0.005 PI-EW 0.781 0.019 the black-box learning (Jiang et al., 2020), iterative reweighting (Ren et al., 2018), label noise correction (Patrini et al., 2017), and importance weighting (Huang et al., 2006) literatures. First, we list the ones common to all experiments. 1. Cross-entropy [train]: Maximizes accuracy on the training set and predicts: bh(x) argmaxi [m] bηtr i (x). 2. Cross-entropy [val]: Maximizes accuracy on the validation set and predicts: bh(x) argmaxi [m] bηval i (x). 3. Fine-tuning: Fine-tunes the pre-trained bηtr using the validation data, monitoring the cross-entropy loss on the hyper-val data for early stopping. 4. Opt-metric [val]: For metrics ψ(CD[h]), for which ψ is known, trains a model to directly maximize the metric on the small validation set using the Frank-Wolfe based algorithm of (Narasimhan et al., 2015b). 5. Learn-to-reweight (Ren et al., 2018): Jointly learns example weights, with the model, to maximize accuracy on the validation set; does not handle specialized metrics. 6. Plug-in [train-val]: Constructs a classifier bh(x) argmaxi wibηval i (x), where the weights wi R are tuned to maximize the given metric on the validation set, using a coordinate-wise line search (details in Appendix F). 7. Adaptive Surrogates (Jiang et al., 2020): Learns a weighted combination of surrogate losses (evaluated on clusters of examples) to approximate the metric on the validation set. Since this method is not directly amenable for use with large neural networks (see Section 6), we compare with it only when using linear models, and present additional comparisons in App. H (Table 7). Hyper-parameters: The learning rate for Fine-tuning is chosen from 1e{ 6,..., 4}. For PI-EW and FW-EG, we tune the parameter ϵ from {1, 0.4, 1e {4,3,2,1}}. The line search for Plug-in is performed with a spacing of 1e 4. The only hyper-parameters the other baselines have are those for training bηtr and bηval, which we state in the individual tasks. 7.1. Maximizing Accuracy under Label Noise In our first task, we train a 10-class image classifier for the CIFAR-10 dataset (Krizhevsky et al., 2009), replicating the independent (asymmetric) label noise setup from (Patrini et al., 2017). The evaluation metric we use is accuracy. We take 2% of original training data as validation data and flip labels in the remaining training set based on the following transition matrix: TRUCK AUTOMOBILE, BIRD Table 3: Test G-mean for proxy label experiment on Adult. Cross-entropy [train] 0.654 0.002 Cross-entropy [val] 0.394 0.064 Opt-metric [val] 0.652 0.027 Learn-to-reweight 0.668 0.003 Plug-in [train-val] 0.672 0.013 Forward Correction 0.214 0.004 Fine-tuning 0.631 0.017 Importance Weights 0.662 0.024 Adaptive Surrogates 0.682 0.002 FW-EG [unknown ψ] 0.685 0.002 FW-EG [known ψ] 0.685 0.001 PLANE, DEER HORSE, CAT DOG, with a flip probability of 0.6. For bηtr and bηval, we use the same Res Net-14 architecture as (Patrini et al., 2017), trained using SGD with momentum 0.9, weight decay 1e 4, and learning rate 0.01, which we divide by 10 after 40 and 80 epochs (120 in total). We additionally compare with the Forward Correction method of (Patrini et al., 2017), a specialized method for correcting independent label noise, which estimates the noise transition matrix T using predictions from bηtr on the training set, and retrains it with the corrected loss, thus training the Res Net twice. We saw a notable drop with this method when we used the (small) validation set to estimate T. We apply the proposed PI-EW method for linear metrics, using a weighting function W defined with one of two choices for the basis functions (chosen via cross-validation): (i) a default basis function that clusters all the points together φdef(x) = 1 x, and (ii) ten basis functions φ1, . . . , φ10, each one being the average of the RBF kernels (see (7)) centered at validation points belonging to a true class. The RBF kernels are computed with width 2 on UMAP-reduced 50-dimensional image embeddings (Mc Innes et al., 2018). As shown in Table 2, PI-EW achieves significantly better test accuracies than all the baselines. The results for Forward Correction matches those in (Patrini et al., 2017); unlike this method, we train the Res Net only once, but achieve 2.4% higher accuracy. Cross-entropy [val] over-fits badly, and yields the least test accuracy. Surprisingly, the simple fine-tuning yields the second-best accuracy. A possible reason is that the pre-trained model learns a good feature representation, and the fine-tuning step adapts well to the domain change. We also observed that PI-EW achieves better accuracy during cross-validation with ten basis functions, highlighting the benefit of the underlying modeling in PI-EW. Lastly, in Figure 2(a), we show the elicited (class) weights with the default basis function (φdef(x) = 1 x), where e.g. because BIRD PLANE, the weight on BIRD is upweighted and that on PLANE is down-weighted. 7.2. Maximizing G-mean with Proxy Labels Our next experiment borrows the proxy label setup from Jiang et al. (2020) on the Adult dataset (Dua & Graff, 2017). The task is to predict whether a candidate s gender is male, Optimizing Black-box Metrics with Iterative Example Weighting Table 4: Test F-measure for domain shift experiment on Adience. Cross-entropy [train] 0.760 0.014 Cross-entropy [val] 0.708 0.022 Opt-metric [val] 0.760 0.014 Plug-in [train-val] 0.759 0.014 Importance Weights [KMM] 0.760 0.013 Learn-to-reweight 0.773 0.009 Fine-tuning 0.781 0.014 FW-EG [unknown ψ] 0.815 0.013 FW-EG [known ψ] 0.804 0.015 but the training set contains only a proxy for the true label. We sample 1% validation data from the original training data, and replace the labels in the remaining sample with the feature relationship-husband . The label noise here is instance-dependent (see Example 3), and we seek to maximize the G-mean metric: ψ(C) = Q We train bηtr and bηval using linear logistic regression using SGD with a learning rate of 0.01. As additional baselines, we include the Adaptive Surrogates method of (Jiang et al., 2020) and Forward Correction (Patrini et al., 2017). The inner and outer learning rates for Adaptive Surrogates are each cross-validated in {0.1, 1.0}. We also compare with a simple Importance Weighting strategy, where we first train a logistic regression model f to predict if an example (x, y) belongs to the validation data, and train a gender classifier with the training examples weighted by f(x, y)/(1 f(x, y)). We choose between three sets of basis functions (using crossvalidation): (i) a default basis function φdef(x) = 1 x, (ii) φdef, φpw, φnpw, where φpw(x) = 1(xpw = 1) and φnpw(x) = 1(xnpw = 1) use features private-workforce and non-private-workforce to form hard clusters, (iii) φdef, φpw, φnpw, φinc, where φinc(x) = 1(xinc = 1) uses the binary feature income . These choices are motivated from those used by (Jiang et al., 2020), who compute surrogate losses on the individual clusters. We provide their Adaptive Surrogates method with the same clustering choices. Table 3 summarizes our results. We apply both variants of our FW-EG method for a non-linear metric ψ, one where ψ is known and its gradient is available in closed-form, and the other where ψ is assumed to be unknown, and is treated as a general black-box metric. Both variants perform similarly and are better than the baselines. Adaptive Surrogates comes a close second, but underperforms by 0.3% (with results being statistically significant). While the improvement of FW-EG over Adaptive Surrogates is small, the latter is time intensive as, in each iteration, it re-trains a logistic regression model. We verify this empirically in Figure 2(b) by reporting run-times for Adaptive Surrogates and our method FW-EG (including the pre-training time) against the choices of basis functions (clustering features). We see that our approach is 5 faster for this experiment. Lastly, Forward Correction performs poorly, likely because its loss correction is not aligned with this label noise model. Table 5: Black-box fairness metric on the test set for Adult. Cross-entropy [train] 0.736 0.005 Cross-entropy [val] 0.610 0.020 Learn-to-reweight 0.729 0.007 Fine-tuning 0.738 0.005 Adaptive Surrogates 0.812 0.004 Plug-in [train-val] 0.812 0.005 FW-EG 0.822 0.002 7.3. Maximizing F-measure under Domain Shift We now move on to a domain shift application (see Example 4). The task is to learn a gender recognizer for the Adience face image dataset (Eidinger et al., 2014), but with the training and test datasets containing images from different age groups (domain shift based on age). We use images belonging to age buckets 1 5 for training (12.2K images), and evaluate on images from age buckets 6 8 (4K images). For the validation set, we sample 20% of the 6 8 age bucket images. Here we aim to maximize the F-measure. For bηtr and bηval, we use the same Res Net-14 model from the CIFAR-10 experiment, except that the learning rate is divided by 2 after 10 epochs (20 in total). As an additional baseline, we compute importance weights using Kernel Mean Matching (KMM) (Huang et al., 2006), and train the same Res Net model with a weighted loss. Since the image size is large for directly applying KMM, we first compute the 2048-dimensional Image Net embedding (Krizhevsky et al., 2012) for the images and further reduce them to 10dimensions via UMAP. The KMM weights are learned on the 10-dimensional embedding. For the basis functions, besides the default basis φdef(x) = 1 x, we choose from subsets of six RBF basis functions φ1, . . . , φ6, centered at points from the validation set, each representing one of six age-gender combinations. We use the same UMAP embedding as KMM to compute the RBF kernels. Table 4 presents the test F-measure values. Both variants of FW-EG algorithm provide statistically significant improvements over the baselines. Both Fine-tuning and Learningto-reweight improve over plain cross-entropy optimization (train), however only moderately, likely because of the small size of the validation set, and because these methods are not tailored to optimize the F-measure. 7.4. Maximizing Black-box Fairness Metric We next handle a black-box metric given only query access to its value. We consider a fairness application where the goal is to balance classification performance across multiple protected groups. The groups that one cares about are known, but due to privacy or legal restrictions, the protected attribute for an individual cannot be revealed (Awasthi et al., 2021). Instead, we have access to an oracle that reveals the value of the fairness metric for predictions on a validation sample, with the protected attributes absent from the train- Optimizing Black-box Metrics with Iterative Example Weighting plane vehicle bird cat deer dog frog horse ship truck Classes 1 3 4 Number of Grouping Features (Sec. 7.2) Wall Clock Time (in seconds) FW-EG (ours) Adaptive Surrogates 0.50 0.52 0.54 0.56 0.58 0.60 0.62 Log Loss of Base Model on Hyper-train Test Accuracy of PI-EW Base Std 0.004 0.008 0.012 0.016 0.020 PI-EW Std 0.004 0.008 0.012 0.016 0.020 1 2 3 4 6 8 12 16 20 Size of Validation Data 102 Test Accuracy on Fixed Test Data CE [train] Fine-tuning PI-EW (d) Figure 2: (a) Elicited (class) weights for CIFAR-10 by PI-EW for the default basis (Sec. 7.1); (b) Run-time for FW-EG and Adaptive Surrogates (Jiang et al., 2020) vs no. of grouping features on proxy label task (Sec. 7.2); (c) Effect of quality of the base model bηtr on Adult (Sec. 7.2): as the base model s quality improves, the test accuracies of PI-EW also improves; (d) Effect of the validation set size on Adience (Sec. 7.3): PI-EW performs better than fine-tuning even for small validation sets, while both improve with larger ones. ing sample. This setup is different from recent work on learning fair classifiers from incomplete group information (Lahoti et al., 2019; Wang et al., 2020b), in that the focus here is on optimizing any given black-box fairness metric. We use the Adult dataset, and seek to predict whether the candidate s income is greater than $50K, with gender as the protected group. The black-box metric we consider (whose form is unknown to the learner) is the geometric mean of the true-positive (TP) and true-negative (TN) rates, evaluated separately on the male and female examples, which promotes equal performance for both groups and classes: ED[h] = TPmale[h] TNmale[h]) TPfemale[h] TNfemale[h] 1/4 . We train the same logistic regression models as in previous Adult experiment in Section 7.2. Along with the basis functions φdef, φpw and φnpw we used there, we additionally include two basis φhs and φwf based on features relationshiphusband and relationship-wife , which we expect to have correlations with gender.2 We include two baselines that can handle black-box metrics: Plug-in [train-val], which tunes a threshold on bηtr by querying the metric on the validation set, and Adaptive Surrogates. The latter is cross-validated on the same set of clustering features (i.e., basis functions in our method) for computing the surrogate losses. As seen in Table 5, FW-EG yields the highest black-box metric on the test set, Adaptive Surrogates comes in second, and surprisingly the simple plug-in approach fairs better than the other baselines. During cross-validation, we also observed that the performance of FW-EG improves with more basis functions, particularly with the ones that are better correlated with gender. Specifically, FW-EG with basis functions {φdef, φpw, φnpw, φwf, φhs} achieves approximately 1% better performance than both FW-EG with φdef basis function and FW-EG with basis functions {φdef, φpw, φnpw}. 7.5. Ablation Studies We close with two sets of experiments. First, we analyze how the performance of PI-EW, while optimizing accuracy 2The only domain knowledge we use is that the protected group is gender ; beyond this, the form of the metric is unknown, and importantly, an individual s gender is not available. for the Adult experiment (Section 7.2), varies with the quality of the base model bηtr. We save an estimate of bηtr after every 50 batches (batch size 32) while training the logistic regression model, and use these estimates as inputs to PIEW. As shown in Figure 2(c), the test accuracies for PI-EW improves with the quality of bηtr (as measured by the log loss on the hyper-train set). This is in accordance with Theorem 2. One can further improve the quality of the estimate ηtr by using calibration techniques (Guo et al., 2017), which will likely enhance the performance of PI-EW as well. Next, we show that PI-EW is robust to changes in the validation set size when trained on the Adience experiment in Section 7.3 to optimize accuracy. We set aside 50% of 6 8 age bucket data for testing, and sample varying sizes of validation data from the rest. As shown in Figure 2(d), PI-EW generally performs better than fine-tuning even for small validation sets, while both improve with larger ones. The only exception is 100-sized validation set (0.8% of training data), where we see overfitting due to small validation size. 8. Conclusion and Discussion We have proposed the FW-EG method for optimizing blackbox metrics given query access to the evaluation metric on a small validation set. Our framework includes common distribution shift settings as special cases, and unlike prior distribution correction strategies, is able to handle general non-linear metrics. A key benefit of our method is that it is agnostic to the choice of bηtr, and can thus be used to post-shift pre-trained deep networks, without having to retrain them. We showed that the post-shift example weights can be flexibly modeled with various choices of basis functions (e.g., hard clusters, RBF kernels, etc.) and empirically demonstrated their efficacies. We look forward to further improving the results with more nuanced basis functions. Acknowledgements We thank the anonymous reviewers for their helpful and constructive feedback. We also thank Google Cloud for supporting this research with cloud computing credits. Optimizing Black-box Metrics with Iterative Example Weighting Awasthi, P., Beutel, A., Kleindessner, M., Morganstern, J., and Wang, X. Evaluating fairness of machine learning models under uncertain and incomplete information. In FAcc T, 2021. Byrd, J. and Lipton, Z. What is the effect of importance weighting in deep learning? In International Conference on Machine Learning, pp. 872 881. PMLR, 2019. Cotter, A., Gupta, M., and Narasimhan, H. On making stochastic classifiers deterministic. In Advances in Neural Information Processing Systems, 2019a. Cotter, A., Jiang, H., Wang, S., Narayan, T., You, S., Sridharan, K., and Gupta, M. R. Optimization with nondifferentiable constraints with applications to fairness, recall, churn, and other goals. Journal of Machine Learning Research (JMLR), 20(172):1 59, 2019b. Csurka, G. A comprehensive survey on domain adaptation for visual applications. Domain adaptation in computer vision applications, pp. 1 35, 2017. Daniely, A., Sabato, S., Ben-David, S., and Shalev-Shwartz, S. Multiclass learnability and the erm principle. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 207 232. JMLR Workshop and Conference Proceedings, 2011. Daniely, A., Sabato, S., Ben-David, S., and Shalev-Shwartz, S. Multiclass learnability and the erm principle. Journal of Machine Learning Research, 16:2377 2404, 2015. Daskalaki, S., Kopanas, I., and Avouris, N. Evaluation of classifiers for an uneven class distribution problem. Applied Artificial Intelligence, 20:381 417, 2006. Demmel, J. W. Applied numerical linear algebra. SIAM, 1997. Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml. Eban, E., Schain, M., Mackey, A., Gordon, A., Rifkin, R., and Elidan, G. Scalable learning of non-decomposable objectives. In Artificial intelligence and statistics, pp. 832 840. PMLR, 2017. Eidinger, E., Enbar, R., and Hassner, T. Age and gender estimation of unfiltered faces. IEEE Transactions on Information Forensics and Security, 9(12):2170 2179, 2014. Fang, T., Lu, N., Niu, G., and Sugiyama, M. Rethinking importance weighting for deep learning under distribution shift. ar Xiv preprint ar Xiv:2006.04662, 2020. Fr enay, B. and Verleysen, M. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845 869, 2013. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321 1330. PMLR, 2017. Hiranandani, G., Boodaghians, S., Mehta, R., and Koyejo, O. Performance metric elicitation from pairwise classifier comparisons. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 371 379, 2019a. Hiranandani, G., Boodaghians, S., Mehta, R., and Koyejo, O. O. Multiclass performance metric elicitation. In Advances in Neural Information Processing Systems, pp. 9351 9360, 2019b. Hiranandani, G., Vijitbenjaronk, W., Koyejo, S., and Jain, P. Optimization and analysis of the pap@ k metric for recommender systems. In International Conference on Machine Learning, pp. 4260 4270. PMLR, 2020. Huang, C., Zhai, S., Talbott, W., Martin, M. B., Sun, S.- Y., Guestrin, C., and Susskind, J. Addressing the lossmetric mismatch with adaptive loss alignment. In International Conference on Machine Learning, pp. 2891 2900. PMLR, 2019. Huang, J., Gretton, A., Borgwardt, K., Sch olkopf, B., and Smola, A. Correcting sample selection bias by unlabeled data. Advances in neural information processing systems, 19:601 608, 2006. Jaggi, M. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, 2013. Jiang, Q., Adigun, O., Narasimhan, H., Fard, M. M., and Gupta, M. Optimizing black-box metrics with adaptive surrogates. In ICML, 2020. Joachims, T. A support vector method for multivariate performance measures. In Proceedings of the 22nd international conference on Machine learning, pp. 377 384. ACM, 2005. Kanamori, T., Hido, S., and Sugiyama, M. A least-squares approach to direct importance estimation. The Journal of Machine Learning Research, 10:1391 1445, 2009. Kar, P., Narasimhan, H., and Jain, P. Online and stochastic gradient methods for non-decomposable loss functions. ar Xiv preprint ar Xiv:1410.6776, 2014. Kar, P., Li, S., Narasimhan, H., Chawla, S., and Sebastiani, F. Online optimization methods for the quantification problem. In Proceedings of the 22nd ACM SIGKDD Optimizing Black-box Metrics with Iterative Example Weighting international conference on knowledge discovery and data mining, pp. 1625 1634, 2016. Koyejo, O. O., Natarajan, N., Ravikumar, P. K., and Dhillon, I. S. Consistent binary classification with generalized performance metrics. In NIPS, pp. 2744 2752, 2014. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25: 1097 1105, 2012. Lahoti, P., Gummadi, K. P., and Weikum, G. ifair: Learning individually fair data representations for algorithmic decision making. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1334 1345. IEEE, 2019. Lewis, D. Evaluating and optimizing autonomous text classification systems. In SIGIR, 1995. Lipton, Z., Wang, Y.-X., and Smola, A. Detecting and correcting for label shift with black box predictors. In International conference on machine learning, pp. 3122 3130. PMLR, 2018. Mc Innes, L., Healy, J., Saul, N., and Großberger, L. Umap: Uniform manifold approximation and projection. Journal of Open Source Software, 3(29):861, 2018. Menon, A. K., Van Rooyen, B., and Natarajan, N. Learning from binary labels with instance-dependent noise. Machine Learning, 107(8-10):1561 1595, 2018. Narasimhan, H. Learning with complex loss functions and constraints. In International Conference on Artificial Intelligence and Statistics, pp. 1646 1654, 2018. Narasimhan, H., Vaish, R., and Agarwal, S. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In Advances in Neural Information Processing Systems, pp. 1493 1501, 2014. Narasimhan, H., Kar, P., and Jain, P. Optimizing nondecomposable performance measures: A tale of two classes. In International Conference on Machine Learning, pp. 199 208. PMLR, 2015a. Narasimhan, H., Ramaswamy, H., Saha, A., and Agarwal, S. Consistent multiclass algorithms for complex performance measures. In ICML, pp. 2398 2407, 2015b. Narasimhan, H., Cotter, A., and Gupta, M. Optimizing generalized rate metrics with three players. In Advances in Neural Information Processing Systems, pp. 10746 10757, 2019. Natarajan, B. K. On learning sets and functions. Machine Learning, 4(1):67 97, 1989. Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A. Learning with noisy labels. Advances in neural information processing systems, 26:1196 1204, 2013. Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1944 1952, 2017. Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pp. 4334 4343. PMLR, 2018. Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227 244, 2000. Stewart, G. W. Perturbation theory for the singular value decomposition. Technical report, 1998. Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von B unau, P., and Kawanabe, M. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699 746, 2008. Tamburrelli, G. and Margara, A. Towards automated A/B testing. In International Symposium on Search Based Software Engineering, pp. 184 198. Springer, 2014. Wang, J., Liu, Y., and Levy, C. Fair classification with groupdependent label noise. ar Xiv preprint ar Xiv:2011.00379, 2020a. Wang, S., Guo, W., Narasimhan, H., Cotter, A., Gupta, M., and Jordan, M. I. Robust optimization for fairness with noisy protected groups. 2020b. Yan, B., Koyejo, S., Zhong, K., and Ravikumar, P. Binary classification with karmic, threshold-quasi-concave metrics. In International Conference on Machine Learning, pp. 5531 5540. PMLR, 2018. Yang, F., Cisse, M., and Koyejo, S. Fairness with overlapping groups. 2020. Ye, N., Chai, K. M., Lee, W. S., and Chieu, H. L. Optimizing f-measures: a tale of two approaches. In Proceedings of the 29th International Conference on Machine Learning, pp. 289 296. Omnipress, 2012. Zhao, S., Fard, M. M., Narasimhan, H., and Gupta, M. Metric-optimized example weights. In International Conference on Machine Learning, pp. 7533 7542. PMLR, 2019.