# optimizing_generalized_rate_metrics_with_three_players__b0d5acb3.pdf Optimizing Generalized Rate Metrics with Three Players Harikrishna Narasimhan, Andrew Cotter, Maya Gupta Google Research 1600 Amphitheatre Pkwy, Mountain View, CA 94043 {hnarasimhan, acotter, mayagupta}@google.com We present a general framework for solving a large class of learning problems with non-linear functions of classification rates. This includes problems where one wishes to optimize a non-decomposable performance metric such as the Fmeasure or G-mean, and constrained training problems where the classifier needs to satisfy non-linear rate constraints such as predictive parity fairness, distribution divergences or churn ratios. We extend previous two-player game approaches for constrained optimization to an approach with three players to decouple the classifier rates from the non-linear objective, and seek to find an equilibrium of the game. Our approach generalizes many existing algorithms, and makes possible new algorithms with more flexibility and tighter handling of non-linear rate constraints. We provide convergence guarantees for convex functions of rates, and show how our methodology can be extended to handle sums-of-ratios of rates. Experiments on different fairness tasks confirm the efficacy of our approach. 1 Introduction In many real-world machine learning problems, the performance measures used to evaluate a classification model are non-linear functions of the classifier s prediction rates. Examples include the F-measure and G-mean used in class-imbalanced classification tasks [1 5], metrics such as predictive parity used to impose fairness goals [6], the win-loss-ratio used to measure classifier churn [7], KL-divergence based metrics used in quantification tasks [8, 9] and score-based metrics such as the PR-AUC [10]. Because these goals are non-linear and are non-continuous in the model parameters, it becomes very challenging to optimize with them, especially when they are used in constraints [11]. Prior work on optimizing generalized rate metrics has largely focused on unconstrained learning problems. These approaches fall under two broad categories: surrogate-based methods that replace the classifier rates with convex relaxations [12 17], and oracle-based methods that formulate multiple cost-sensitive learning tasks, and solve them using an oracle [18 22, 11]. Both these approaches have notable deficiencies. The first category of methods rely crucially on the surrogates being close approximations to the rate metric, and perform poorly when this is not the case (see e.g. experiments in [20]). The use of surrogates becomes particularly problematic with constrained training problems, as relaxing the constraints with convex upper bounds can result in solutions that are over-constrained or infeasible [23]. The second category of methods assume access to a near-optimal cost-sensitive oracle, which is usually unrealistic in practice. In this paper, we present a three-player approach for learning problems where both the objective and constraints can be defined by general functions of rates. The three players optimize over model parameters, Lagrange multipliers and slack variables to produce a game equilibrium. Our approach generalizes many existing algorithms (see Table 2), and makes possible new algorithms with more flexibility and tighter handling of non-linear rate constraints. Specifically, we give a new method 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. (Algorithm 2) that can handle a wider range of performance metrics than previous surrogate methods (such as e.g. KL-divergence based metrics that only take inputs from a restricted range), and can be applied to constrained training problems without the risk of over-constraining the model because it needs to use surrogates less. To our knowledge, this is the first practical, surrogate-based approach that can handle constraints on generalized rate metrics. We show convergence of our algorithms for objectives and constraints that are convex functions of rates. This result builds on previous work by Cotter et al. [23, 24] for handling linear rate constraints, and additionally resolves an unanswered question in their work on the convergence of Lagrangian optimizers for non-zero-sum games. We also extend our framework to develop a heuristic (Algorithm 3) for optimizing performance measures that are a sum-of-ratios of rates (e.g. constraints on predictive parity and F-measure), and demonstrate their utility on real-world tasks. Related work: Many fairness goals can be expressed as linear constraints on a model s prediction rates [16, 25]. Recent work has focused on optimizing with linear rate constraints by computing an equilibrium of a game between players who optimize model parameters and Lagrange multipliers λ [26 28, 23, 24]. Of these, the closest to us is the work of Cotter et al. (2019) [23], who propose the idea of having only the -player optimize a surrogate objective, and having the λ-player use the original rates. We adapt and build on this idea to handle general functions of rates. Other game-based formulations include the approach of Wang et al. [29] for optimizing multivariate evaluation metrics. Their setup is very different from ours, with their players optimizing over (distributions on) all possible labelings of the data. Moreover, they do not handle constraints on the classifier. There has also been a concentrated effort on optimizing performance measures such as the F-measure that are fractional-linear functions of rates [18 20, 30, 31]. Many of these works exploit the pseudoconvex structure of the F-measure, but this property is absent for the problems that we consider where we need to handle sums or differences of ratios. Pan et al. [32] provide a heuristic approach for unconstrained sums-of-ratios, and recently Celis et al. [33] handle constraints that are sums-of-ratios, but do so by solving a large number of linearly constrained sub-problems, with the number of sub-problems growing exponentially with the number of rates. In contrast, we provide a practical algorithm that handles both objectives and constraints that are sums-of-ratios of rates. 2 Problem Setup Let X Rd and Y = { 1} be the instance and label spaces. Let f : X ! R be a prediction model, parameterized by 2 . Given a distribution D over X Y, we define the model s rate on D as: R( ; D) = E(X,Y ) D Y = sign(f (X)) For example, if DA denotes the distribution over a sub-population A X, then R( ; DA) gives us the accuracy of f on this sub-population. If D+ denotes a conditional distribution over the positively-labeled examples, then R( ; D+) is the true-positive rate TPR( ) of f and 1 R( ; D+) is the false-negative rate FNR( ) of f . Similarly, if D denotes a conditional distribution over the negatively-labeled examples, then TNR( ) = R( ; D ) is the true-negative rate of f , and FPR( ) = 1 R( ; D ) is the false-positive rate of f . Further, the true-positive and false-positive proportions are given by TP( ) = p TPR( ) and FP( ) = (1 p) FPR( ), where p = P(Y = 1); we can similarly define the true negative proportion TN and false negative proportion FN. We will consider several distributions D1, . . . , DK over X Y, and use the short-hand Rk( ) to denote the rate function R( ; Dk) 2 [0, 1]. We will denote a vector of K rates as: R( ) = [R1( ), . . . , RK( )]>. We assume we have access to unbiased estimates of the rates ˆRk( ) = i = sign(f (xk , computed from samples S = {(xk 1), . . . , (xk We will also consider stochastic models which are defined by distributions over the model parameters . Let denote the set of all such distributions over , where for any µ 2 , µ( ) is the probability mass on model . We define the rate Rk for a stochastic model µ to be the expected value for random draw of a model from µ, so Rk(µ) = E µ [Rk( )]. Generalized Rate Metrics as Objective. In many real-world applications the performance of a model is evaluated by a function of multiple rates: (R1( ), . . . , RK( )) for some : [0, 1]K ! R. For example, a common evaluation metric used in class-imbalanced classification tasks is the G-mean: GM( ) = 1 TPR( ) TNR( ), which is a convex function of rates [34, 35]. Similarly, a popular Table 1: Examples of generalized rate metrics. is either convex (C), pseudo-convex (PC) or a sum or difference of ratios (SR). p = P(Y = 1) and ˆp is the predicted proportion of positives. wins (losses) is the fraction of correct (wrong) predictions by the new model among examples where it disagrees with the old model. A and B refer to different protected groups or slices of the population. Measure Definition Type G-mean [34, 35] 1 p TPR TNR 1 pz1 z2 C H-mean [36] 1 2 1/TPR+1/TNR 1 2 1/z1+1/z2 C Q-mean [5] 1 FPR2 + FNR2 1 2 C KLD [8, 9] p log( p ˆp) + (1 p)log( 1 p 1 ˆp) p log( p z1 ) + (1 p)log( 1 p z2 ) C F-measure [37] 1 2TP 2TP + FP + FN 1 2z1 2z1+z2+z3 PC Predictive parity [6] TPA TPA+FNA TPB TPB+FNB z1 z1+z2 z3 z3+z4 SR F-measure parity 2TPA 2TPA+FPA+FNA 2TPB 2TPB+FPB+FNB 2z1 2z1+z2+z3 2z4 2z4+z5+z6 SR Churn [7] wins A losses A wins B losses B PR-AUC [10] 1 1 M TPm TPm+FPm , where TPm, FPm are evaluated at the largest threshold at which recall of f (X) + is m+0.5 evaluation metric used in text retrieval is the F-measure: F1( ) = 2TP( ) 2TP( ) + FP( ) + FN( ), which is a fractional-linear or pseudo-convex function of rates [38]. See Table 1 for more examples. One can consider directly optimizing these performance measures during training: 2 (R1( ), . . . , RK( )) . (P1) Generalized Rate Metrics as Constraints. There are also many applications, where one wishes to impose constraints defined by non-linear function of rates, for example to ensure groupspecific fairness metrics. Examples include: (i) Predictive parity fairness: Fair classification tasks where one wishes to match the precision of a model across different protected groups [6]: ''' TPA( ) TPA( ) + FPA( ) TPB( ) TPB( ) + FPB( ) ''' . One may also want to match e.g. the F-measure of a model across different groups. (ii) Distribution matching: Fairness or quantification tasks where one wishes to match the distribution of a model s outputs across different protected groups [9, 15]. One way to achieve this is to constrain the KL-divergence between the overall class proportion p and proportion of predicted positives for each group ˆp A [11], i.e. to enforce KLD(p, ˆp A) , where KLD is convex in ˆp A. (iii) Churn: Problems where one wishes to replace a legacy model with a more accurate model while limiting the changed predictions between the new and old models (possibly across different slices of the user base). This is ideally framed as constraints on the win-loss ratio s [24], which can be expressed as ratios of rates [7]. These and related problems can be framed generally as: 2 g( ) s.t. φj (R1( ), . . . , RK( )) 0, 8j 2 [J], (P2) for some objective function g : ! R, and J constraint functions φj : [0, 1]K ! R. We also consider a special case of (P2) with an objective and a constraint that is a sum-of-ratios of rates: h m, R( )i hβm, R( )i s.t. h m, R( )i hβm, R( )i γ, (P3) for coefficients m, βm 2 RK + , 8m 2 [2M] and slack γ 2 R. Our setup can also be used to optimize score-based metrics, such as the area under the precisionrecall curve (PR-AUC), that summarize the performance of a score model f : X ! R across multiple thresholds. We use the approach of Eban et al. [10] to (approximately) express PR-AUC as a Riemann summation of the precision of f at thresholds 1, . . . , M 2 R at which the recall of f is 0.5 M , . . . , M 0.5 M resp. This results in a formulation similar to (P3). See Appendix E for details. We next provide a three-player framework for solving (P1) (P3). We note that our formulations can be equivalently regarded as two-player games where one player is in charge of the parameters that need to be minimized, and the other player is in charge of the parameters that need to be maximized. We however find the three-player viewpoint to be a useful way to think about the problem algorithmically in that the three sets of optimization parameters can use different algorithms (see Table 2). Table 2: Algorithms for (P1) (P3) with 3 players. Frank-Wolfe [20], SPADE [14], NEMSIS [15] are previous algorithms. Alg. 1 3 are the proposed methods. Each player can do Best Response (BR), Online Gradient Descent (OGD) or Follow-The-Leader (FTL), and the game is zero-sum (ZS) or not. The first five algorithms find an approximate Nash or Coarse-Correlated (C.C.) equilibrium assuming and φj s are convex. Since (P3) is non-convex in the rates, Alg. 3 may not find an equilibrium. Player Objective Player Strategy Alg. P λ λ ZS Equil + L2( ; ) L2 - FTL BR 3 Nash SPADE P1 L1 L1 + L2 L2 BR OGD OGD 3 Nash NEMSIS P1 - + L2( ; ) L2 - FTL OGD 3 Nash Alg. 1 P1-2 L1 L1 + L2 L2 BR OGD BR 3 Nash Alg. 2 P1-2 L1 L1 + L2 L2 BR OGD OGD 7 C. C. Alg. 3 P3 L1 L1 + L2 L2 OGD OGD OGD 7 - 3 Generalized Rate Metric Objective We first present algorithms for the unconstrained problem in (P1). We assume is strictly convex, and is L-Lispchitz w.r.t. the 1-norm. For simplicity, we assume that is monotonically increasing in all arguments and that r (0) = 0, although our approach easily extends to more general metrics that are e.g. monotonically increasing in some arguments and monotonically decreasing in others. Game Formulation. We equivalently re-write (P1) to de-couple the rates Rk from the non-linear function by introducing auxiliary variables 1, . . . , K 2 [0, 1]: min 2 , 2[0,1]K ( 1, . . . , K) s.t. Rk( ) k, 8k 2 [K]. (1) A standard approach for solving (1) is to write the Lagrangian for the problem with Lagrange multipliers λ1, . . . , λK 2 R+ for the K constraints: L( , ; λ) = ( ) + λk (Rk( ) k) = ( ) | {z } L1( ;λ) | {z } L2( ;λ) Then one maximizes the Lagrangian over λ 2 RK + , and minimizes it over 2 and 2 [0, 1]K: L1( ; λ) + L2( ; λ), (2) Notice that L1 is convex in (by convexity of ), while L1 and L2 are linear in λ. We pose this max-min problem as a zero-sum game played with three players: a player who minimizes L1 + L2 over , a player who minimizes L1 + L2 over , and a player who maximizes L1 + L2 over λ. Each of the three players can now use different optimization algorithms customized for their problem. If additionally the Lagrangian was convex in , one could solve for an equilibrium of this game and obtain a solution for the primal problem (1). However, since L2 is a weighted sum of rates Rk( ), it need not be convex (or even continuous) in . To overcome this difficulty, we expand the solution space from deterministic models to stochastic models [26, 23, 24, 39], and re-formulate (2) as a problem that is linear in µ, by replacing each Rk( ) with E µ[Rk( )] in L2: λ2 min µ2 2[0,1]K L1( ; λ) + L2(µ; λ). (3) Here for technical reasons, we restrict the Lagrange multipliers to a bounded set = {λ 2 R+ | kλk1 }; we will choose the radius > 0 later in our theoretical analysis. By solving for an equilibrium of this expanded max-min problem, we can find a stochastic model µ 2 that minimizes (R1(µ), . . . , RK(µ)). There are two approaches that we can take to find an equilibrium of the expanded game. The first approach is to assume access to an oracle that can perform the minimization of L2 over µ for a fixed λ and . Since this is a linear optimization over the simplex, this amounts to performing a minimization over deterministic models in . The second and more realistic approach is to work with surrogates for the rates that are continuous and differentiable in . Let R1, . . . , RK : ! R be differentiable convex surrogate functions that are upper bounds on the rates: Rk( ) Rk( ), 8 2 . We assume access to unbiased stochastic sub-gradients for the surrogates r Rk( ) with E[r Rk( )] 2 @ Rk( ). We then define a surrogate-based approximation for L2: L2( ; λ) = PK k=1 λk Rk( ). (4) All we need to do now is to choose the objective that each player seeks to optimize (true or surrogate [23]) and the strategy that they use to optimize their objective, so that the players converge to an equilibrium of the game. Each of these choices lead to a different algorithm for (approximately) solving (P1). Table 2 summarizes different choices of strategies and objectives for the players, and the type of equilibrium and the algorithm that results from these choices. As we shall see shortly (and also elaborate in Appendix B.1), many existing algorithms can be seen as instances of this template. Oracle-based Lagrangian Optimizer. As a warm-up illustration of our approach, we first describe an idealized algorithm assuming access to an oracle that can approximately optimize L2 over (this oracle essentially optimizes a weighted-sum of rates, i.e. a cost-sensitive error objective [40]): Definition 1. A -approximate cost-sensitive optimization (CSO) oracle takes λ as input and outputs a model 2 such that L2( ; λ) min 2 L2( ; λ) + . We have all three players optimize the true Lagrangians, with the -player and -player playing best responses to the opponents strategies, i.e. they perform full optimization over their parameter space, and the λ-player running online gradient descent (OGD), an algorithm with no-regret guarantees [41]. The -player performs best response by using the above oracle to approximately minimize L2 over . For the -player, the best response optimization can be computed in closed-form: Lemma 1. Let : RK + ! R denote the Fenchel conjugate of . Then for any λ s.t. kλk1 L: r (λ) 2 argmin 2[0,1]K L1( ; λ). .The resulting algorithm outlined in Algorithm 1 is guaranteed to find an approximate Nash equilibrium of the max-min game in (3), and yields an approximate solution to (P1): Theorem 1. Let 1, . . . , T be the iterates generated by Algorithm 1 for (P1), and let µ be a stochastic model with a probability mass of 1 T on t. Define Bλ maxt krλL( t, t; λt)k2. Then setting = L and = L Bλ 2T , we have w.p. 1 δ over draws of stochastic gradients: (R( µ)) min µ2 (R(µ)) + O Remark 1 (Frank-Wolfe: a special case). A previous oracle-based approach for optimizing convex functions of rates is the Frank-Wolfe based method of Narasimhan et al. [14]. This method can be recovered from our framework by reformulating the λ-player s objective to include the minimization over : LFW(λ, ) = min L1( ; λ)+L2( ; λ) and having the λ-player play the Follow-The-Leader (FTL) algorithm [42] to maximize this objective over λ. As before, the -player plays best response on L2 using the CSO oracle. See Appendix B.1 for details, where we use recent connections between the classical Frank-Wolfe technique and equilibrium computation [43]. Surrogate-based Lagrangian Optimizer. While the CSO oracle may be available in some special cases (e.g. when is finite or when the underlying conditional-class probabilities can be estimated accurately), in many practical scenarios, it is not realistic to assume access to an oracle that can optimize non-continuous rates. We now provide a more practical algorithm for solving (P1) where the -player optimizes the surrogate Lagrangian function L2 in (4) instead of L2 using stochastic gradients r L2( ; λ). The - and λ-players, however, continue to operate on the true Lagrangian functions L1 and L2, which are continuous in the parameters and λ that these players optimize. In our proposed approach, outlined in Algorithm 2, both the -player and λ-player now run online gradient descent algorithms, while the -player plays its best response at each iteration. Since it is Algorithm 1 Oracle-based Optimizer Initialize: λ0 for t = 0 to T 1 do if (P1) then t = r (λt) else if (P2) then t 2 argmin 2[0,1]K L1( ; λt) end if t 2 argmin 2 L2( ; λt) [CSO] λt+1 = λt + rλ L( t, t; λt) end for return 1, . . . , T Algorithm 2 Surrogate-based Optimizer Initialize: 0, λ0 for t = 0 to T 1 do if (P1) then t = r (λt) else if (P2) then t 2 argmin 2[0,1]K L1( ; λt) end if t+1 ( t r L2( t; λt)) λt+1 (λt + λ rλL( t, t; λt)) end for return 1, . . . , T Figure 1: Optimizers for the unconstrained problem (P1) and constrained problem (P2). Here denotes the 1-projection onto and denotes the 2-projection onto . We denote a (stochastic) gradient of L by rλL( t, t; λt) = PK k=1( ˆRk( t) t k), where ˆRk( t) is an unbiased estimate of Rk( t). We denote a (stochastic) sub-gradient of L2 by r L2( t; λt) = PK k=1 λk r Rk( t). the -player alone who optimizes a surrogate, the resulting game between the three players is no longer zero-sum. Yet, we are able to show that the player strategies converge to an approximate coarse-correlated (C. C.) equilibrium of the game, and yields an approximate solution to (P1). Theorem 2. Let 1, . . . , T be the iterates of Algorithm 2 for (P1), and let µ be a stochastic model with probability 1 T on t. Let be a convex set and = 2 | R( ) 2 [0, 1]K . Let B max 2 k k2, B maxt kr L2( t; λt)k2 and Bλ maxt krλL( t, t; λt)k2. Setting = L, = B B T and λ = L Bλ 2T , we have w.p. 1 δ over draws of stochastic gradients: Note the right-hand side contains the optimal value for the surrogate objective ( R( )) and not for the original performance metric. This is unsurprising given the -player s inability to work with the true rates. Also, while Algorithm 2 can be applied to optimize over a general (bounded) convex model class , the comparator for our guarantee is a subset of models for which ( R( )) is defined. This is needed as the surrogate R( ) may output values outside the domain of . Remark 2 (SPADE, NEMSIS: special cases of our approach). Our approach includes two previous surrogate-based algorithms as special cases: SPADE [20] and NEMSIS [15]. SPADE can be recovered from our framework by having the same player strategies as Algorithm 2 but with both the - and λ-players optimizing surrogate objectives, i.e. with the -player minimizing L2 and the λ-player maximizing L1 + L2. NEMSIS also uses surrogates for both the and λ updates. It can be recovered by having the -player run OGD on L2, and having the λ-player play FTL over λ on the combined objective min L1( ; λ) + L2( ; λ). See Appendix B.1 for details. Remark 3 (Application to wider range of metrics). Because of their strong reliance on surrogates, SPADE and NEMSIS cannot be applied directly to functions that take inputs from a restricted range (e.g. KL-divergence), unless the surrogates are also bounded in the same range. In Appendix D.1, we point out scenarios where the NEMSIS method [15] fails to optimize the KL-divergence metric, unless the model is sufficiently regularized to not output large negative values. Algorithm 2 has no such restriction and can be applied even if the outputs of the surrogates are not within the domain of . This is because it uses the original rates for updates on λ. As a result, the game play between and λ never produces values that are outside the domain of . 4 Generalized Rate Metric Constraints We next describe how to apply our approach to the constrained optimization problem in (P2) and to the special case in (P3). We start with (P2) assuming that the constraints φj s are jointly convex, monotonic in each argument and L-Lipschitz w.r.t. the 1-norm, and g is a bounded convex function. For convenience, we assume that the φj s are monotonically increasing in all arguments. Constraints on the KL-divergence and G-mean metrics are examples of this setting. We introduce a set auxiliary variables 1, . . . , K for the K rate functions and re-write (P2) as: min 2 , 2[0,1]K g( ) s.t. φj ( 1, . . . , K) 0, 8j 2 [J], Rk( ) k, 8k 2 [K]. The Lagrangian for the re-written problem is given below, where λ1, . . . , λJ 2 R+ and λJ+1, . . . , λJ+K 2 R+ are the Lagrange multipliers for the two sets of constraints: L( , ; λ) = | {z } L1( ;λ) | {z } L2( ;λ) As before, we expand the search space to include stochastic models in , restrict the Lagrange multipliers to a bounded set = , and formulate a max-min problem: λ2 min µ2 , 2[0,1]K L1( ; λ) + L2(µ; λ). (6) One can now apply Algorithm 1 and 2 to this problem. For Algorithm 1, the -player uses the CSO oracle to optimize the true Lagrangian L2, and for Algorithm 2, the -player uses OGD to optimize a surrogate Lagrangian L2(µ; λ) = PK k=1 λJ+k Rk(µ). In both cases, the -player plays best response by minimizing L1 over using an analytical solution (where available) or using a convex optimization solver. Theorem 3. Let 1, . . . , T be the iterates generated by Algorithm 1 for (P2), and let µ be a stochastic model with a probability mass of 1 T on t. Suppose there exists a µ0 2 such that φj(R(µ0)) γ, 8j 2 [J], for some γ > 0. Let Bg = max 2 g( ). Let µ 2 be such that µ is feasible, i.e. φj(R(µ )) 0, 8j 2 [J], and E µ [g( )] E µ [g( )] for every µ 2 that is feasible. Let Bλ maxt krλL( t, t; λt)k2. Then setting = 2(L+1)Bg 2T , we have w.p. 1 δ over draws of stochastic gradients: E µ [g( )] E µ [g( )] + O and φj(R( µ)) O We have thus shown that Algorithm 1 outputs a stochastic model that has an objective close to the optimal feasible solution for (P2), while also closely satisfying the constraints. Theorem 4. Let 1, . . . , T be the iterates of Algorithm 1 for (P2), and let µ be a stochastic model with prob. 1 T on t. Let be convex and = 2 | 8j, R( ) 2 [0, 1]K . Let 2 be such that it satisfies the surrogate-relaxed constraints φj( R( )) 0, 8j 2 [J], and g( ) g( ) for every 2 that satisfies the same constraints. Let B max 2 k k2, B maxt kr L2( t; λt)k2 and Bλ maxt krλL( t, t; λt)k2. Then setting = (L + 1)T ! for ! 2 (0, 0.5), = B B 2T and λ = Bλ 2T , we have w.p. 1 δ over draws of stochastic gradients: E µ [g( )] g( ) + O log(1/δ) T 1/2 ! and φj(R( µ)) O The proof is an adaptation of the analysis in Agarwal et al. (2018) [26] to non-zero-sum games. We point out that despite the -player optimizing surrogate functions, the final stochastic model is nearfeasible for the original rate metrics. We also note that this result holds even if the surrogates output values outside the domain of the constraint functions φj s (e.g. with KL-divergence constraints). While the above convergence rate is not as good as the standard O(1/ T) rate achievable for OGD, this is similar to the guarantees shown by e.g. Agarwal et al. [26] for linear rate-constrained optimization problems.1 The reason for the poorer convergence rate is that we are unable to fix the radius of the space of Lagrange multipliers to a constant, and instead set it to a function of T. 1See Theorem 3 in their paper, where, for T = O(n4 ) iterations, the error bound is O(n ) = O(T 1/4). Algorithm 3 Surrogate-based Optimizer for (P3) Initialize: a0, b0, 0, λ0 for t = 0 to T 1 do at+1 = C (at ara Lsr( t, at, bt; λt)); bt+1 = C (bt brb Lsr( t, at, bt; λt)), where C = {a, b 2 R | a a b b} t+1 = ( t r Lsr( t; at, bt λt)), where Lsr is (7) with Rk replaced with Rk λt+1 = (λt + λ rλLsr( t, at, bt; λt)) end for return 1, . . . , T Remark 4 (Unanswered Question in Cotter et al.). Cotter et al. consider optimization problems with linear rate constraints, formulate a non-zero-game like us, and consider two algorithms, one where both the - and λ-player seek to minimize external regret (through OGD updates), and the other where the -player optimizes alone minimizes external regret, while the λ-player minimizes swap regret. They are however able to show convergence guarantees only for a more complicated swap regret algorithm, and leave the analysis of the external regret algorithm unanswered. Theorem 4 provides convergence guarantees for a generalization of their external regret algorithm. In their paper, Cotter et al. do obtain a better O(1/ T) convergence rate for their swap regret algorithm. It is easy to show a similar convergence rate for an adaptation of this algorithm to our setting (see Appendix C), but we stick to our present algorithm because of its simplicity. Case of Sum-of-ratios Metrics. Moving beyond convex functions of rates, we present a heuristic algorithm for optimizing with objectives and constraints in (P3) that are sums-of-ratios of rates. We assume that the numerators and denominators in each ratio term is bounded, i.e. a h m, R( )i hβm, R( )i b, and a h 0 m, R( )i hβ0 m, R( )i b, 8 2 for some a, b > 0. Introducing slack variables a1, . . . , a2M, b1, . . . , b2M for the numerators and denominators respectively to decouple the rates from the ratio terms, we equivalently re-write (P3): 2 a am bm b γ, am h m, R( )i, bm hβm, R( )i, 8m. We then formulate the Lagrangian for the above problem with multipliers λ 2 R4M+1 Lsr( , a, b; λ) = PM m=1 am/bm + λ0 m=M+1 am/bm γ m=1 λm(h m, R( )i am) + P2M m=1 λ2M+m(bm hβm, R( )i). Because Lsr is non-convex in the slack variables a, b, strong duality may not hold, and an optimal solution for the dual problem may not be optimal for (P3). Yet, by performing OGD updates for the λ, and the slack variables, with the -player alone optimizing the surrogate rates Rk, we obtain a heuristic to solve (P3). The details are given in Algorithm 3. 5 Experiments We conduct two types of experiments. In the first, we evaluate Algorithm 2 on the task of optimizing a convex rate objective subject to linear rate constraints and show it often performs as well as an oracle-based approach for this problem [11] (and does so without having to make an idealized oracle assumption). In the second, we evaluate Algorithm 3 on the task of optimizing a sum-of-ratios objective subject to sum-of-ratios constraints, and compare it against existing baselines. Datasets. We use five datasets: (1) COMPAS, where the goal is to predict recidivism with gender as the protected attribute [44]; (2) Communities & Crime, where the goal is to predict if a community in the US has a crime rate above the 70th percentile [45], and we consider communities having a black population above the 50th percentile as protected [27]; (3) Law School, where the task is to predict whether a law school student will pass the bar exam, with race (black or other) as the protected attribute [46]; (4) Adult, where the task is to predict if a person s income exceeds 50K/year, with gender as the protected attribute [45]; (5) Wiki Toxicity, where the goal is to predict if a comment Table 3: Optimizing KL-divergence fairness metric s.t. error rate constraints. For each method, we report two metrics: A (B), where A is the test fairness metric (lower is better) and B is the ratio of the test error rate of the method and that of Unc Error (lower is better). During training, we constrain B to be 1.1. Among the last 3 columns, the lowest fairness metric is highlighted in bold. Unc Error Post Shift COCO Stochastic Determ. COMPAS 0.115 (1.00) 0.000 (1.01) 0.043 (1.01) 0.000 (1.03) 0.000 (1.03) Crime 0.224 (1.00) 0.005 (1.40) 0.252 (0.83) 0.120 (1.11) 0.146 (1.08) Law 0.199 (1.00) 0.001 (1.45) 0.043 (1.05) 0.054 (1.12) 0.056 (1.08) Adult 0.114 (1.00) 0.000 (1.22) 0.011 (1.10) 0.014 (1.10) 0.014 (1.10) Wiki 0.175 (1.00) 0.001 (1.21) 0.134 (1.17) 0.133 (1.09) 0.127 (1.18) Table 4: Optimizing F-measure s.t. F-measure constraints. For each method, we report two metrics: A (B), where A is the overall test F-measure (higher is better) and B is the test constraint violation: Fmeasureprt Fmeasureother 0.02 (lower is better). The lowest constraint violation is in italic. Unc Error Unc F1 Stochastic Determ. COMPAS 0.656 (0.13) 0.666 (0.09) 0.627 (0.07) 0.628 (0.07) Crime 0.742 (0.19) 0.752 (0.19) 0.711 (0.11) 0.711 (0.11) Law 0.973 (0.10) 0.975 (0.08) 0.927 (0.04) 0.842 (-0.05) Adult 0.675 (0.03) 0.688 (0.06) 0.660 (0.04) 0.647 (0.03) Wiki 0.968 (0.18) 0.967 (0.18) 0.826 (0.00) 0.782 (-0.05) posted on a Wikipedia talk page contains non-toxic/acceptable content, with the comments containing the term gay considered as a protected group [47]. We use linear models, and hinge losses as surrogates Rk. All implementations are in Tensorflow.2 See Appendix G for additional details. KL-divergence Based Fairness Objective. We consider a demographic-parity style fairness objective that seeks to match the proportion of positives predicted in each group ˆp G with the true proportion of positives p in the data, measured using a KL-divergence metric: P G2{0,1} KLD(p, ˆp G). Note that this is convex in ˆp G. We additionally enforce a constraint that the error rate of the model is no more than 10% higher than an unconstrained model that optimizes error rate, i.e. ˆ err(f ) 1.1 ˆ err(func). The only previous method that we are aware of that can handle constrained problems of this form is the COCO method of Narasimhan (2018) [11]. This is an oracle-based approach and uses a plug-in method to implement the cost-sensitive oracle. We compare Algorithm 2 with COCO, and the post-shift method of Hardt et al. (2016) [48], where we take a pre-trained logistic regression model and assign different thresholds to the groups to correct for fairness disparity [48]. For our algorithm, we report the performance of both the trained stochastic classifier and the best deterministic classifier chosen through a best iterate heuristic of Cotter et al. (2019) [23]. The results are shown in Table 3. Post Shift performs the best on the fairness metric, but on all datasets except COMPAS, fairs poorly on the constraint. The proposed method and COCO achieve different trade-offs between optimizing the objective and satisfying the error rate constraint. The stochastic classifier trained by our method closely satisfies the constraint on almost all datasets, and yields a lower objective than COCO on three datasets. Moreover, the best deterministic classifier often has a similar performance to the stochastic classifier. We present further analysis and additional comparisons in Appendix G.1. F-measure Based Parity Constraints. We consider the fairness goal of training a classifier that yields at least as high a F-measure for the protected group as it does for the rest of the population, and impose this as a constraint. Specifically, we seek to maximize the overall F-measure subject to the constraint: Fmeasureprt Fmeasureother 0.02. We apply Algorithm 3 and compare it with an unconstrained classifier that optimizes error rate (Unc Error) and a plug-in classifier that optimizes the F-measure without constraints (Unc F1). As seen in Table 4, while Unc F1 yields a better objective than Unc Error, both the baselines fail to satisfy the constraint. On four of the datasets, the stochastic classifier trained by our approach yields moderate to significant reduction in constraint violation. On Adult, the trained stochastic classifier yields a very small constraint violation on the training set (0.004), but does not perform as well on the test set. The deterministic classifiers have an equal or lower constraint violation compared to the stochastic classifiers, but at the cost of a lower objective. 2https://github.com/google-research/google-research/tree/master/generalized_rates Acknowledgments We thank Qijia Jiang and Heinrich Jiang for helpful feedback on draft versions of this paper. [1] D.D. Lewis. Evaluating text categorization. In HLT Workshop on Speech and Natural Language, [2] J-D. Kim, Y. Wang, and Y. Yasunori. The Genia event extraction shared task, 2013 edition- overview. ACL, 2013. [3] Y. Sun, M.S. Kamel, and Y. Wang. Boosting for learning multiple classes with imbalanced class distribution. In ICDM, 2006. [4] S. Wang and X. Yao. Multiclass imbalance problems: Analysis and potential solutions. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 42(4):1119 1130, 2012. [5] S. Lawrence, I. Burns, A. Back, A-C. Tsoi, and C.L. Giles. Neural network classification and prior class probabilities. In Neural Networks: Tricks of the Trade, LNCS, pages 1524:299 313. Springer, 1998. [6] A. Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big data, 5(2):153 163, 2017. [7] M.M. Fard, Q. Cormier, K. Canini, and M. Gupta. Launch and iterate: Reducing prediction churn. In NIPS, 2016. [8] A. Esuli and F. Sebastiani. Optimizing text quantifiers for multivariate loss functions. ACM Transactions on Knowledge Discovery and Data, 9(4):Article 27, 2015. [9] W. Gao and F. Sebastiani. Tweet sentiment: From classification to quantification. In ASONAM, [10] E. Eban, M. Schain, A. Mackey, A. Gordon, R. Rifkin, and G. Elidan. Scalable learning of non-decomposable objectives. In AISTATS, 2017. [11] H. Narasimhan. Learning with complex loss functions and constraints. In AISTATS, 2018. [12] T. Joachims. A support vector method for multivariate performance measures. In ICML, 2005. [13] Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. Online and stochastic gradient methods for non-decomposable loss functions. In NIPS, 2014. [14] H. Narasimhan, P. Kar, and P. Jain. Optimizing non-decomposable performance measures: A tale of two classes. In ICML, 2015. [15] P. Kar, S. Li, H. Narasimhan, S. Chawla, and F. Sebastiani. Online optimization methods for the quantification problem. In KDD, 2016. [16] G. Goh, A. Cotter, M. Gupta, and M.P. Friedlander. Satisfying real-world goals with dataset constraints. In NIPS, 2016. [17] A. Sanyal, P. Kumar, P. Kar, S. Chawla, and F. Sebastiani. Optimizing non-decomposable measures with deep networks. Machine Learning, 107(8-10):1597 1620, 2018. [18] S.A.P. Parambath, N. Usunier, and Y. Grandvalet. Optimizing F-measures by cost-sensitive classification. In NIPS, 2014. [19] O. Koyejo, N. Natarajan, P. Ravikumar, and I.S. Dhillon. Consistent binary classification with generalized performance metrics. In NIPS, 2014. [20] H. Narasimhan, H.G. Ramaswamy, A. Saha, and S. Agarwal. Consistent multiclass algorithms for complex performance measures. In ICML, 2015. [21] B. Yan, O. Koyejo, K. Zhong, and P. Ravikumar. Binary classification with karmic, threshold- quasi-concave metrics. In ICML, 2018. [22] D. Alabi, N. Immorlica, and A. Kalai. Unleashing linear optimizers for group-fair learning and optimization. In COLT, 2018. [23] A. Cotter, H. Jiang, and K. Sridharan. Two-player games for efficient non-convex constrained optimization. In ALT, 2019. [24] A. Cotter, H. Jiang, S. Wang, T. Narayan, M. Gupta, S. You, and K. Sridharan. Optimization with non-differentiable constraints with applications to fairness, recall, churn, and other goals. JMLR (to appear), ar Xiv preprint ar Xiv:1809.04198, 2019. [25] M. B. Zafar, I. Valera, M. Gomez-Rodriguez, and K. P. Gummadi. Fairness constraints: Mechanisms for fair classification. In AISTATS, 2017. [26] A. Agarwal, A. Beygelzimer, M. Dudik, J. Langford, and H. Wallach. A reductions approach to fair classification. In ICML, 2018. [27] M. Kearns, S. Neel, A. Roth, and Z.S. Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In ICML, 2018. [28] M. Donini, L. Oneto, S. Ben-David, J.S. Shawe-Taylor, and M. Pontil. Empirical risk minimiza- tion under fairness constraints. In Neur IPS, 2018. [29] H. Wang, W. Xing, K. Asif, and B. Ziebart. Adversarial prediction games for multivariate losses. In NIPS, 2015. [30] R. Busa-Fekete, B. Szörényi, K. Dembczynski, and E. Hüllermeier. Online F-measure optimiza- tion. In NIPS, 2015. [31] M. Liu, X. Zhang, X. Zhou, and T. Yang. Faster online learning of optimal threshold for consistent F-measure optimization. In Neur IPS, 2018. [32] W. Pan, H. Narasimhan, P. Kar, P. Protopapas, and H. G. Ramaswamy. Optimizing the multiclass F-measure via biconcave programming. In ICDM, 2016. [33] L. E. Celis, L. Huang, V. Keswani, and N. K. Vishnoi. Classification with fairness constraints: A meta-algorithm with provable guarantees. In FAT, 2019. [34] M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. In ICML, 1997. [35] S. Daskalaki, I. Kopanas, and N. Avouris. Evaluation of classifiers for an uneven class distribu- tion problem. Applied Artificial Intelligence, 20:381 417, 2006. [36] K. Kennedy, B.M. Namee, and S.J. Delany. Learning without default: A study of one-class classification and the low-default portfolio problem. In ICAICS, 2009. [37] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. [38] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In SIGIR, 1994. [39] M. Gupta A. Cotter, H. Narasimhan. On making stochastic classifiers deterministic. In Neur IPS, [40] C. Elkan. The foundations of cost-sensitive learning. In IJCAI, 2001. [41] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003. [42] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends R in Machine Learning, 4(2):107 194, 2012. [43] J. D. Abernethy and J.-K. Wang. On Frank-Wolfe and equilibrium computation. In NIPS, 2017. [44] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine bias. Pro Publica, May, 23, 2016. [45] A. Frank and A. Asuncion. UCI machine learning repository. URL: http://archive.ics. uci.edu/ml, 2010. [46] L. Wightman. Lsac national longitudinal bar passage study. Law School Admission Council, [47] L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman. Measuring and mitigating unintended bias in text classification. In AIES, 2018. [48] M. Hardt, E. Price, and N. Srebro. Equality of opportunity in supervised learning. In NIPS, [49] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In EMNLP, 2014. [50] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050 2057, 2004. [51] X. Zhou. On the Fenchel duality between strong convexity and Lipschitz continuous gradient. ar Xiv preprint ar Xiv:1803.06573, 2018.