# classweighted_classification_tradeoffs_and_robust_approaches__934fb735.pdf Class-Weighted Classification: Trade-offs and Robust Approaches Ziyu Xu 1 Chen Dan 2 Justin Khim 1 Pradeep Ravikumar 1 We address imbalanced classification, the problem in which a label may have low marginal probability relative to other labels, by weighting losses according to the correct class. First, we examine the convergence rates of the expected excess weighted risk of plug-in classifiers where the weighting for the plug-in classifier and the risk may be different. This leads to irreducible errors that do not converge to the weighted Bayes risk, which motivates our consideration of robust risks. We define a robust risk that minimizes risk over a set of weightings, show excess risk bounds for this problem, and demonstrate that particular choices of the weighting set leads to a special instance of conditional value at risk (CVa R) from stochastic programming, which we call label conditional value at risk (LCVa R). Additionally, we generalize this weighting to derive a new robust risk problem that we call label heterogeneous conditional value at risk (LHCVa R). Finally, we empirically demonstrate the efficacy of LCVa R and LHCVa R on improving class conditional risks. 1. Introduction Classification is a fundamental problem in statistics and machine learning, including scientific problems such as cancer diagnosis and satellite image processing as well as engineering applications such as credit card fraud detection, handwritten digit recognition, and text processing (Khan et al., 2001; Lee et al., 2004), but modern applications have brought new challenges. In online retailing, websites such as Amazon have hundreds of thousands or millions of products to taxonomize (Lin et al., 2018). In text data, the distribution of words in documents has been observed to follow a power law in that there are many labels with few instances 1Machine Learning Department, Carnegie Mellon University, Pennsylvania, United States 2Computer Science Department, Carnegie Mellon University, Pennsylvania, United States. Correspondence to: Ziyu Xu . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). (Zipf, 1936; Feldman, 2019). Similarly, image data also a long tail of many classes with few examples (Salakhutdinov et al., 2011; Zhu et al., 2014). In such settings, the classes with smaller probabilities are generally classified incorrectly more often, and this is undesirable when the smaller classes are important, such as rare forms of cancer, fraudulent credit card transactions, and expensive online purchases. Thus, we need modern classification methods that work well when there are a large number of classes and when the class-wise probabilities are imbalanced. When faced with such class imbalance a popular approach in practice is to choose a metric other than zero-one accuracy, such as precision, recall, Fβ-measure (Van Rijsbergen, 1974, 1979), which explicitly take class conditional risks into account, and train classifiers to optimize this metric. A difficulty with this approach however is that the right metric for imbalanced classification is often not clear. A related class of approaches keep the zero-one accuracy metric but modifies the samples instead. The popular algorithm SMOTE (Chawla et al., 2002) performs a type of data augmentation for a minority class, i.e., a class with lower probability, and sub-samples the large classes. This has led to variants with different forms of data augmentation (Zhou & Liu, 2006; Mariani et al., 2018), but from a theoretical perspective, these methods remain poorly understood. A much simpler approach, which is also related to the approaches above, is class-weighting, in which different costs are incurred for mis-classifying samples of different labels. Practically, this is a natural approach because it is often possible to assign different costs to different classes. For example, the average fraudulent credit card transaction may cost hundreds of dollars, or in online retailing, failing to show a customer the correct item causes the company to lose out on the profit of selling that item. Thus, a good classifier should be fairly sensitive to possibly fraudulent transactions, and online retailers should prioritize displaying high-profit products. As a result, class-weighting has been studied in a variety of settings, including modifying black-box classifiers, SVMs, and neural networks (Domingos, 1999; Lin et al., 2002; Scott, 2012; Zhou & Liu, 2006). Additionally, class-weighting has been observed to be useful for estimating class probabilities, since class-weighting amounts to adjusting decision thresholds (Wang et al., 2008; Wu et al., 2010; Wang et al., 2019). Class-Weighted Classification: Trade-offs and Robust Approaches A crucial caveat with cost-weighting however is the right choice of costs is often not clear, and with any one choice of costs, the performance of the corresponding classifier might suffer for some other, perhaps more suitable, choices of costs. In this paper, we use cost-weighting for imbalanced classification in three ways. We start by examining a weighted sum of class-conditional risks, i.e., the risks conditional on the class Y taking some specific value i. This allows us to upweight a minority class to achieve better performance on the minority examples. We then provide an illuminating analysis of the fundamental tradeoffs that occur with any single choice of costs. Since we may not understand precisely which weighting q to pick, we examine a robust risk that is a supremum of the weighted risks over an uncertainty set Q of possible weights. This objective can be interpreted as a class-wise distributionally robust optimization problem where we ask for robustness over the marginal distribution of Y . This leads to a minimax problem, for which we provide generalization guarantees. We also note that a standard gradient descent-ascent algorithm may solve the optimization problem when the risk is convex in the classifier parameters. Finally, we show that for a natural class of uncertainty sets, the robust risk reduces to what call label conditional value at risk (LCVa R). We highlight a connection to conditional value at risk (CVa R), which is a well-studied quantity in portfolio optimization and stochastic programming parametrized by an α in (0, 1) (Rockafellar et al., 2000; Shapiro et al., 2009). Further, we propose a generalization that we call label heterogeneous conditional value at risk (LHCVa R) that allows for different parameters αi for each class i. To the best of our knowledge, this has not been examined previously, and it could possibly be used more broadly. To give an example in portfolio optimization, we may wish to treat risks arising from different types of assets, e.g., large-cap stocks versus small-cap stocks or domestic debt versus international debt, differently. Next, we show that the dual form for LHCVa R is similar to that for LCVa R as long as the heterogeneity is finite-dimensional, and this leads to an unconstrained optimization problem. Finally, we examine the efficacy of LCVa R, and LHCVa R on real and synthetic data. The rest of the paper is outlined as follows. In Section 2, we discuss our problem setup. In Section 3, we examine weighting in plug-in classification. In particular, we elucidate the fundamental trade-off in weighted classification and its methodological implications. In Section 4, we examine a robust version of the weighted risk problem, including generalization guarantees and connections to stochastic programming. In Section 5, we provide numerical results, and we conclude with a discussion in Section 6. Additional proofs and results in related settings are deferred to the appendices. 1.1. Further Related Work We briefly review other research related to imbalanced classification, but for a far more exhaustive treatment, see a survey of the area (He & Garcia, 2009; Fern andez et al., 2018). First, two other methods may be employed to solve imbalanced classification problems. The first is class-based margin adjustment (Lin et al., 2002; Scott, 2012; Cao et al., 2019), in which the margin parameter for the margin loss function may vary by class. Broadly, margin adjustment and weighting may both be considered loss modification procedures. The second method is Neyman-Pearson classification, in which one attempts to minimize the error on one class given a constraint on the worst permissible error on the other class (Rigollet & Tong, 2011; Tong, 2013; Tong et al., 2016). An important topic related to our paper but that has not been well-connected to imbalanced classification is robust optimization. Robust optimization is a well-studied topic (Ben-Tal & Nemirovski, 1999, 2003; Ben-Tal et al., 2004, 2009). A variant that has gained traction more recently is distributionally robust optimization (Ben-Tal et al., 2013; Bertsimas et al., 2014; Namkoong & Duchi, 2017). Unsurprisingly, CVa R, as a coherent risk measure, has been previously connected to distributionally robust optimization (Goh & Sim, 2010). Distributionally robust optimization generally and CVa R specifically have also previously been used in machine learning to deal with imbalance (Duchi et al., 2018; Duchi & Namkoong, 2018), but in these works, the imbalance was considered to exist in the covariates, whether known to the algorithm or not. These are motivated by the recent push toward fairness in machine learning, in particular so that ethnic minorities do not suffer discrimination in high-stakes situations such as loan applications, medical diagnoses, or parole decisions, due to biases in the data. 2. Preliminaries 2.1. Classification with Imbalanced Classes In this section, we briefly go over the problem setup. First, we draw samples from the space Z = X Y. For our purposes, we are interested in Y = {0, 1} or Y = {1, . . . , k}. Note there are two slightly different mechanisms for the data-generating process that are considered in imbalanced classification and Neyman-Pearson classification. In the first, we are given n i.i.d. samples (X1, Y1), . . . , (Xn, Yn) from a distribution PX,Y . Here, we let pi = P (Y = i) be the probability of class i. Additionally, we sometimes refer to the vector of class probabilities as p. This is our Class-Weighted Classification: Trade-offs and Robust Approaches framework of interest, since it corresponds to standard assumptions in nonparametric statistics and learning theory. In the alternative framework, we are given ni samples (X1, i), . . . , (Xni, i) from each marginal distribution PX|Y =i. The probability of class i in this case is then known: pi = bpi = ni/n. For the most part, these two mechanisms yield similar results, but the analyses differ slightly. To streamline the presentation, we only consider the first case in the main paper, although we give a result for the alternative framework in the appendix that illustrates the difference. 2.2. Class Conditioned Risk We are interested in finding a good classifier f : X D Y in some function space F, such as linear classifiers or neural networks. In this section, we establish our risk measures of interest. In general, we want to minimize the expectation of some loss function ℓ: F Z [0, 1], which we call risk and denote R(f) = E[ℓ(f, Z)]. Analogously, we define the class-conditioned risk for class i to be Rℓ,i(f) = E [ℓ(f, Z)|Y = i] . At this point, we make some observations for plug-in classification and empirical risk minimization. In the plugin classification results, we consider the zero-one loss ℓ01(f, z) = 1{f(x) = y}, and for our results on empirical risk minimization, we are primarily interested in convex surrogate losses. For simplicity, when ℓis clear from context, or a statement is made for a generic ℓ, we will denote this as Ri. Now, we can work toward defining weighted risks. We defined Observe that we can relate the risk to the classconditioned risk by R(f) = E [RY (f)] = P i Y pi Ri(f). An important part of our paper is an examination of classweighted risk. Definition 1. Let q = (q1, . . . , q|Y|) be a vector such that qi 0 for all i and E[q Y ] = P i Y qipi = 1. Then, the q-weighted risk is Rq(f) = E [q Y RY (f)] = X i Y qipi Ri(f). Note that the usual risk is recovered by setting q = (1, . . . , 1). 2.3. Plug-in Classification In this section, we discuss weighted plug-in classification. For plug-in, we restrict our attention to the binary classification case of Y = {0, 1}, and the primary quantity of interest is usually the one-zero risk R01(f) i.e the risk under ℓ0,1. In general, the risk for the best classifier is nonzero because for a given x in X, there is some probability it may take the value 0 or 1. As a result, we need a way to discuss the convergence of our estimator to the best possible estimator. We define the regression function η by η(x) = P (Y = 1|X = x) . Now, the Bayes optimal classifier is the classifier that minimizes the risk, and it is defined by f (x) = 1 {η(x) > 1/2} . The minimum possible risk is called the Bayes risk and denoted by R = R(f ), and generally we focus on minimizing the excess risk E(f) = R(f) R . Following the form of the Bayes classifier, a plug-in estimator bf attempts to estimate the regression function η by some bη and then plugs in the result to a threshold function. Thus, bf has the form bf(x) = 1 {bη(x) > 1/2} , which is analogous to the form of the Bayes classifier. For additional background on plug-in estimation, see, e.g., (Devroye et al., 1996). At this point, we wish to define the weighted versions of Bayes classifier, Bayes risk, plug-in classifier, and excess risk. For brevity, define the threshold tq = q0/(q0 + q1). First, we consider the Bayes classifier. Lemma 1. Let q = (q0, q1) be a weighting. The Bayes optimal classifier for q-weighted risk is f q (x) = 1 {η(x) > tq} . The proof, along with proofs of other subsequent results on plug-in classification, appears in the appendix. In this case, we denote the Bayes risk by R q = Rq(f q ). Lemma 1 reveals that the Bayes classifier is a plug-in rule, and analogously, we see that a plug-in estimator in the weighted case takes the form bfq(x) = 1 {bη(x) > tq} . Consequently, we define excess q-risk for an empirical classifier bf. The excess q-risk for an empirical classifier is Eq( bf) = Rq( bf) R q, and note that we are interested in bounding the expected excess q-risk for plug-in estimators. 2.4. Empirical Risk Minimization In this section, we define empirical quantities that we need for empirical risk minimization, particularly the weighted and robust risks. We consider Y = {1, . . . , k}. We define the empirical class-conditioned risk by b Ri = (1/Ni) Pn j=1 ℓ(f, zj)1 {yi = i} where Ni = Pn j=1 1 {yj = i}. Let bpi = Ni/n denote the empirical proportion of observations of class i, and let q be a weight vector. The empirical q-weighted risk is i=1 qibpi b Ri(f). The empirical Q-weighted risk is defined analogously by b RQ = supq Q b Rq(f). This problem is convex in f when Class-Weighted Classification: Trade-offs and Robust Approaches the loss ℓis convex and concave in q due to linearity; so one may solve the resulting saddle-point problem with standard techniques such as gradient descent-ascent, which we give in the appendix. Often in empirical risk minimization, generalization bounds are provided, i.e., a bound on the true risk of a classifier f in F in terms of its empirical risk and a variance term. To bring our results closer to those of plug-in estimation, we also consider a form of excess risk. To distinguish the two, define the excess (F, Q)-weighted risk to be EQ(F) = RQ( bf) RQ(f Q) where here bf is the Q-weighted empirical risk minimizer in F and f Q is the population Q-weighted risk minimizer in F. Beyond the robust formulation, the key difference between excess q-weighted risk and excess (F, Q)-weighted risk is that in the former we compete with the true regression function, and in the latter we compete with the best classifier in F. One additional tool we need for empirical risk minimization is a measure of function class complexity, and a typical measure of the expressiveness of a function class is Rademacher complexity. The empirical Rademacher complexity given a sample (X1, Y1), . . . , (Xn, Yn) is b Rn(F) = Eσ sup f F i=1 σif(Xi), where the expectation is taken with respect to the σi, which are Rademacher random variables. The Rademacher complexity is Rn(F) = Eb Rn(F), where the expectation is with respect to the Xi random variables. Finally, we make one note about the loss for our empirical risk minimization results. For binary classification, one can obtain bounds for any bounded loss function that is Lipschitz continuous in f(x). Since we present multiclass results, we use the multiclass margin loss, which is a bounded version of the multiclass hinge loss (Mohri et al., 2012). Here, it is assumed that for each i in Y, the function f outputs a score fi(x), and the chosen class is argmaxi Y fi(x). The multiclass margin loss is defined as ℓmar(f, z) = Φ (fy(x) maxy =y fy (x)) where Φ(a) = 1 {a 0} + (1 a)1 {0 < a 1}. For simplicity, we ignore the margin parameter, usually denoted by ρ, and treat it as 1 in our results. Finally, we define the projection set Π1(F) = {x 7 fy(x) : y Y, f F} . 3. Tradeoffs with Class Weighted Risk In this section, we examine weighted plug-in classification, and we have two main results. First, we show that weighted plug-in classification enjoys essentially the same rate of convergence as unweighted plug-in classification, although there is dependence on the chosen weights. Second, there is a fundamental trade-off in that optimizing for one set of Figure 1: The irreducible error (IE) and estimation error (EE). The irreducible error is the measure of the set of x where η(x) is between thresholds of q and q, which does not depend on bη. The estimation error is the measure of the x for which bη(x) and η(x) lead to different plug-in estimates. weights q may lead to suboptimal performance for another set of weights q . 3.1. Excess Risk Bounds We start with the excess risk bound for plug-in estimators when the weighting is well-specified. Proposition 1. Suppose the regression function η is βH older. Then, the q-weighted excess risk of bfq satisfies EEq( bfq) O (q0 + q1)n β 2β+d . Here, we see that the upper bound depends linearly on q0 and q1. This implies that when we increase the weight for a class with few examples, then our bound on the excess risk increases. While previous cost weighting setups have normalized the sum of weights (Scott, 2012), our normalization scheme is computed with respect to prior probabilities on each class as well, and consequently we explicitly include q0, q1 in our bound. Our choice of domain for weights is defined in Section 4. Now, we turn to our second task: examining the weighted excess risk of the bfq under a different weighting q . Observe that we can decompose the excess risk as EEq ( bfq) = ERq ( bfq) Rq (f q ) | {z } estimation error + Rq (f q ) Rq (f q ) | {z } irreducible error =: (EE) + (IE). (1) Unsurprisingly, we see that an error term that is constant, or irreducible appears in equation (1). Then, we see the irreducible error is given by the measure of the subset of X where η(x) lies between tq and tq . Given that we know Class-Weighted Classification: Trade-offs and Robust Approaches the Bayes optimal classifier for any weighting, we observe that the irreducible error can be upper bounded by a term proportional to the the product of the measure of PX in the region between tq and tq , and the difference between the thresholds themselves. We state this formally in the following proposition. Proposition 2. Let tq,q = min{tq, tq } and tq,q = max{tq, tq }. The irreducible error satisfies the bound (IE) (q 0 + q 1) |tq tq | P tq,q η(X) tq,q A visualization is given in Figure 1. Now, we turn to analyze the estimation error. The result is in many ways similar to Proposition 1, but an additional term appears due to the decision threshold tq for bη differing from that of the risk measurement tq . Proposition 3. For any density estimator bη, the estimation error satisfies (EE) (q 0 + q 1)E h η bη L1(PX) i + (q 0 + q 1) |tq tq| E h P bfq(x) = f q (x) i Corollary 1. When η is β-H older, using local polynomial estimator (Yang, 1999) for bη gives (EE) (q 0 + q 1)O n β 2β+d + (q 0 + q 1) |tq tq| E h P bfq(x) = f q (x) i Consequently, we can upper bound the expected excess q - risk. The probability in the bound of the estimation error has been considered in the context of nearest neighbors (Chaudhuri & Dasgupta, 2014), but in general, additional assumptions are required to provide an explicit rate. We consider one such assumption in the appendix. 4. Robust Class Weighted Risk Based the results in the previous section, we know that the performance degradation need not be graceful when we don t know how to choose the weights. This motivates us to study a more robust version of class weighted risk. Definition 2. Let Q R|Y| be a compact convex set such that qi 0 for each i and E[q Y ] = 1 for each q in Q. Then, the Q-weighted risk is RQ(f) = sup q Q E [q Y RY (f)] = sup q Q i Y qipi Ri(f). Additionally, we refer to the set Q as the uncertainty set. In this section, we have two goals: (1) to provide excess Frisk bounds and generalization bounds for robust weighted risk via uniform convergence and (2) to make connections to stochastic optimization via special choices of uncertainty set. We start with generalization; the proofs are given in the appendix. Theorem 1. Let ℓ= ℓmar be the multiclass margin loss. Recall that Ni = Pn j=1 1 {yj = i}. With probability at least 1 δ, we have the generalization bound RQ(f) sup q Q ( b Rq(f) + pin b RNi(Π1(F)) + for every f in F and the excess risk bound EQ(F) 2 sup q Q pin b RNi(Π1(F)) + A few remarks are in order. First, note that we only use the multiclass margin loss because it leads to simple multiclass bounds. In a binary classification setting, standard results would imply generalization for other Lipschitz losses. Second, in many cases, we can simplify the Rademacher complexity term. The following result applies to commonlyused function classes such as linear functions and neural networks (Bartlett et al., 2017; Golowich et al., 2018; Mohri et al., 2012). Corollary 2. Let ℓ= ℓmar be the multiclass margin loss. Let F be a function class satisfying b Rn(Π1(F)) C(F)n 1/2 for some constant C(F) that does not depend on n. Then with probability at least 1 δ, we have the generalization bound RQ(f) sup q Q ( b Rq(f) + 4k C(F) pin + and the excess (F, q)-risk bound EQ(F) 2 sup q Q 8k C(F) pin + Class-Weighted Classification: Trade-offs and Robust Approaches 4.1. Connections to Stochastic Programming In this section, we make concrete connections to stochastic programming (Shapiro et al., 2009). First, we introduce label conditional value at risk, and then we describe the generalization, label heterogeneous conditional value at risk. 4.1.1. LABEL CVAR We start with the definition. Definition 3. Let α in (0, 1) be given. Define the set Qα = q : E[q Y ] = 1, qi 0, α 1 for i 1, . . . , k . The label conditional value at risk (LCVa R) is LCVa Rα(f) = RQα(f). Now, we describe the connection to CVa R. Letting Z be a random variable, the CVa R of Z at level α is CVa Rα(Z) = sup Q Q α EQ[Z] = sup Q Q α E[(d Q/d P)Z], where Q α is the set of all probability measures that are absolutely continuous with respect to the underlying measure P such that d Q/d P α 1. If Z takes values on a finite discrete probability space with probability mass function p, then the CVa R may be written as CVa Rα(Z) = supq Qα Pk i=1 qipi Z. Thus, LCVa R is a specialization of CVa R to the variables RY (f), which take values on the finite discrete space Y. Notably, this is in contrast to other uses of CVa R in machine learning where, as noted previously, CVa R is used with respect to samples directly, in order to provide robustness or fairness. As with CVa R, LCVa R is a straightforward way to provide robustness. Intuitively, it moves weight to the worst losses, where all weightings are bounded by the same constant α 1. Now, we consider the dual form. Proposition 4 (LCVa R dual form). LCVa R permits the dual formulation LCVa Rα(f) = inf λ R αE[(RY (f) λ)+] + λ . Moreover, if F is compact in the supremum norm on X and ℓis continuous, then the dual form holds for all f in F. The proof is mostly standard and therefore deferred to the appendix. The only trick compared with CVa R is showing that we may restrict the domain of λ to a compact set; which essentially requires showing that the process {RY (f) : f F} is sufficiently well-behaved. It would also suffice to assume that ℓis bounded, as with most theoretical results in learing theory. Note that to minimize LCVa R, we can solve this convex program in λ and f. 4.1.2. LABEL HETEROGENEOUS CVAR While the LCVa R approach of the previous section is useful for providing some robustness in a computationally tractable manner, it may not be best suited for imbalanced classification because it treats all classes identically in that each qi must lie in the interval [0, α 1]. Since imbalanced classification is inherently a problem of heterogeneity, we may wish to allow qi to be in some interval [0, α 1 i ] instead. We can formalize this problem as follows. Definition 4. Define the uncertainty set QH,α = q : E[q Y ] = 1, qi [0, α 1 i ] for i = 1, . . . , k . We call the resulting optimization problem label heterogeneous conditional value at risk (LHCVa R), and we write LHCVa Rα(f) = sup q QH,α E [q Y RY (f)] . Similar to LCVa R, this has a dual form. Proposition 5. A dual form for LHCVa R is given by LHCVa Rα(f) = inf λ R E α 1 Y (RY (f) λ)+ + λ. Moreover, if F is compact in the supremum norm on X and ℓis continuous, then the dual form holds for all f in F. Again, we note that an alternative sufficient condition for the dual to hold for all f in F is that ℓbe bounded. Importantly, the label heterogeneous CVa R dual form is convex in f and λ. As a result, we can still optimize efficiently, in principle. We also note that the finite dimension k is crucial for label heterogeneous CVa R. This is due to our use of the minimax theorem, which requires compactness in various places; so in general this result cannot be extended to the infinitedimensional case. 5. Numerical Results Code for reproducing the results in this section can be found at https://www.github.com/neilzxu/ robust_weighted_classification. 5.1. Methods We examine the empirical performance of LCVa R and LHCVa R risks, and compare them against the standard risk and a balanced risk as baselines. Let bpi be the empirical proportion of the ith label and b Ri be the empirical class conditional risk. Balanced risk Here, we consider the specific weighting where each class is equally weighted: b R1/(kbp)(f) = 1 i=1 b Ri(f) i.e., we fix qi = 1/(kbpi). Class-Weighted Classification: Trade-offs and Robust Approaches (a) Class 0 risk (b) Class 1 risk (c) Worst class risk Figure 2: Plots of class 0, class 1, and worst class risk on the test dataset under different choices of 1 p in the synthetic experiment. The worst test class risk is the maximum of the risks of the two classes for each choice of the probability of class 0. LCVa R and LHCVa R performs better in worst class risk than both standard and balanced risks as class imbalance increases. LCVa R The empirical formulation optimizes the dual formulation, in which α is a hyperparameter: \ LCVa Rα(f) = min λ R i=1 bpi( b Ri(f) λ)+ + λ LHCVa R We similarly optimize a dual form in the empirical LHCVa R risk. To reduce the number of hyperparameters to only c (0, 1] and κ (0, ), we calculate αi as follows: α(κ,c) i = c bpi 1/κ Pk j=1 bpj 1/κ κ behaves as a temperature parameter (similar to Jang et al. 2016; Wang et al. 2020) and causes α to become a smoother distribution of weights when κ > 1 and converge to uniform weights as κ . Conversely, when κ < 1, the alpha distribution becomes sharper and heavily weights the classes with lowest bpi as κ 0. We simply choose a κ of 1 unless otherwise stated. c consequently characterizes the total magnitude of the weights. Ultimately, we formulate the empirical risk as: \ LHCVa Rκ,c(f) = inf λ R bpi α(κ,c) i ( b Ri(f) λ)+ + λ We train a logistic regression model with gradient descent on a cross entropy loss, which acts as a convex surrogate loss for zero-one risk. 5.2. Datasets We evaluate our methods on both synthetic and real datasets. Synthetic Datasets The data in our synthetic experiment is constructed for X = [0, 1] and Y = {0, 1}. For a given p = P(Y = 0), we generated a dataset by uniformly randomly sampling an X in [0, 1] and sampling a Y with the following distribution: P(Y = 1 | X = x) = x P(Y = 0 | X = x) = 1 x In these synthetic datasets, we note that the Bayes optimal classifier and class risks are: R0(f ) = 1 (1 + p) 1 When p is high, R0(f ) < R1(f ), which leads to a classifier that has vastly worse performance on class 1 compared to class 0. This discrepancy in class risk is a common issue in classification problems where there is a significant class imbalance. We randomly generated 10,000 data points for both train and test sets. We generated datasets for each value of p from 0.80 to 0.98, inclusive, in steps of 0.02. Real World Datasets We also experiment on the Covertype dataset taken from the UCI dataset repository (Dua & Graff, 2017). This dataset is 53-dimensional with 7 classes and has 2%-98% (11340-565892 examples) train-test split. Class-Weighted Classification: Trade-offs and Robust Approaches (a) Varying α for LCVa R (b) Varying κ for LHCVa R with α = 0.01 Figure 3: Worst class risk of different α values for LCVa R and κ values for LHCVa R in the synthetic setting. Across different levels of class imbalance, α and κ do not have a significant impact on worst class risk of LCVa R and LHCVa R. (a) Standard (b) Balanced (d) LHCVa R Figure 4: Histogram of class risks for each method on the Covertype dataset. The red line marks the largest risk for each method. The distribution of class risks for standard and balanced methods are more spread out, while the class risks for LCVa R and LHCVa R are more concentrated near the max class risk. The max class risks are slightly lower for LCVa R and LHCVa R compared to the other two methods. 5.3. Results Synthetic In Fig. 2, we can observe that the the worst case class risk of LCVa R and LHCVa R across multiple values of p is better than both the standard and balanced classifier. The classwise risks of LCVa R and LHCVa R are relatively close across different values of p, while there is a large discrepancy between classwise risks of the classifier trained under the standard or balanced risks. Note that the more significant the imbalance, i.e., the smaller the p, the better LCVa R and LHCVa R perform compared to balanced risk on class 0, while paying a progressively smaller price on the class 1 risk. The same is also true between both LCVa R and LHCVa R and the standard risk, although with the classes swapped. We note that while the worst class risk of LCVa R and LHCVa R seem to decrease with greater imbalance, this may not be a general property of these methods. Rather, this is more likely an artifact of the synthetic setup having more probability mass further from the decision boundary as the imbalance increases. The main observation is simply that LCVa R and LHCVa R have lower worst class risk in comparison to the baseline methods. Thus, this empirically demonstrates that both LCVa R and LHCVa R can significantly improve the highest class risks while losing little in performance on classes with lower risks. In addition to comparing against baselines, we also examine the effect of different choices of α and κ on LCVa R and LHCVa R, respectively. The results of this comparison are in Fig. 3. In both methods, varying the hyperparameters does not have a dramatic impact on the behavior of the worst class risk for both these methods across different values of class imbalance. Table 1: Standard risk and risk of the worst class for each method on the Covertype dataset. LCVa R and LHCVa R improve on the worst class risk. Method Standard Risk Worst Class Risk LHCVa R 0.3979 0.4907 LCVa R 0.3384 0.5037 Standard 0.3275 0.5111 Balanced 0.3765 0.5333 Real In Table 1, we observe that LCVa R and LHCVa R have better worst class risks than the standard and class weighted baselines. However, improving worst class risk comes at a cost to to the standard risk in the case of both LCVa R and LHCVa R. This tradeoff is reflected in the histograms of class risk shown in Fig. 4, where the class risks under the standard and balanced classifiers are more spread out and have classes with much lower risks. On the other Class-Weighted Classification: Trade-offs and Robust Approaches Table 2: Performance of LCVa R across different α values, and LHCVa R across different κ values. The performance each method is relatively agnostic to choices of α and κ, although the smallest choices of α and κ for each method have the largest changes in worst class risk, respectively. Method α κ Standard Risk Worst Class Risk 0.01 N/A 0.4266 0.5474 0.05 N/A 0.3993 0.4932 0.1 N/A 0.4060 0.5037 0.05 0.8 0.4308 0.5408 0.05 1 0.3979 0.4907 0.05 1.2 0.4171 0.5050 hand, LCVa R and LHCVa R have class risk distributions that are more concentrated towards the worst class risk value. Consequently, LCVa R and LHCVa R achieve a lower worst class risk, which is consistent with our theory. We also compare the effect of choosing different α and κ on LCVa R and LHCVa R, respectively, in Table 2. We see that the worst class risk still performs well under different choices of α and κ, although there is some degradation when the α is smaller than optimal choice, in the case of LCVa R, and when κ is smaller and produces a sharper distribution, in the case of LHCVa R. 6. Discussion In this work, we have studied the effect of optimizing classifiers with respect to different weightings and developed robust risk measures that minimizes worst case weighted risk across a set of weightings. We subsequently show that optimizing with respect to LCVa R and LHCVa R empirically improves the worst class risk, at a reasonable cost to accuracy. One future direction for research is to understand the Bayes optimal classifier under LCVa R and LHCVa R. Another more applied direction could be to consider domain shift. If we formalize each prior over the classes as a weighting, optimizing LCVa R or LHCVa R may improve performance when the test class priors are different from the training class priors. 7. Acknowledgements We acknowledge the support of Rakuten Inc., and Microsoft Research. The authors would also like to thank Biswajit Paria for his contributions to the numerical simulations, and to him and Pradipto Das for their helpful comments and discussions. Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6240 6249, 2017. Ben-Tal, A. and Nemirovski, A. Robust solutions of uncertain linear programs. Operations research letters, 25(1): 1 13, 1999. Ben-Tal, A. and Nemirovski, A. Robust solutions of linear programming problems contaminated with uncertain data. Mathematical programming, 88(3):411 424, 2003. Ben-Tal, A., Goryashko, A., Guslitzer, E., and Nemirovski, A. Adjustable robust solutions of uncertain linear programs. Mathematical Programming, 99(2):351 376, 2004. Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. Robust Optimization. Princeton University Press, 2009. Ben-Tal, A., Den Hertog, D., De Waegenaere, A., Melenberg, B., and Rennen, G. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):341 357, 2013. Bertsimas, D., Gupta, V., and Kallus, N. Robust sample average approximation. Mathematical Programming, pp. 1 66, 2014. Cao, K., Wei, C., Gaidon, A., Arechiga, N., and Ma, T. Learning imbalanced datasets with label-distributionaware margin loss. ar Xiv preprint ar Xiv:1906.07413, 2019. Chaudhuri, K. and Dasgupta, S. Rates of convergence for nearest neighbor classification. In Advances in Neural Information Processing Systems, pp. 3437 3445, 2014. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16: 321 357, 2002. Devroye, L., Gy orfi, L., and Lugosi, G. A probabilistic theory of pattern recognition. Springer Science & Business Media, 1996. Domingos, P. Metacost: A general method for making classifiers cost-sensitive. In KDD, volume 99, pp. 155 164, 1999. Dua, D. and Graff, C. Uci machine learning repository, 2017. Duchi, J. and Namkoong, H. Learning models with uniform performance via distributionally robust optimization. ar Xiv preprint ar Xiv:1810.08750, 2018. Class-Weighted Classification: Trade-offs and Robust Approaches Duchi, J. C., Hashimoto, T., and Namkoong, H. Distributionally robust losses against mixture covariate shifts. Arxiv, 2018. Feldman, V. Does learning require memorization? a short tale about a long tail. ar Xiv preprint ar Xiv:1906.05271, 2019. Fern andez, A., Garc ıa, S., Galar, M., Prati, R. C., Krawczyk, B., and Herrera, F. Learning from imbalanced data sets. Springer, 2018. Goh, J. and Sim, M. Distributionally robust optimization and its tractable approximations. Operations research, 58 (4-part-1):902 917, 2010. Golowich, N., Rakhlin, A., and Shamir, O. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pp. 297 299, 2018. He, H. and Garcia, E. A. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263 1284, 2009. Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. ar Xiv preprint ar Xiv:1611.01144, 2016. Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6):673, 2001. Lee, Y., Wahba, G., and Ackerman, S. A. Cloud classification of satellite radiance data by multicategory support vector machines. Journal of Atmospheric and Oceanic Technology, 21(2):159 169, 2004. Lin, Y., Lee, Y., and Wahba, G. Support vector machines for classification in nonstandard situations. Machine learning, 46(1-3):191 202, 2002. Lin, Y.-C., Das, P., and Datta, A. Overview of the SIGIR 2018 e Com Rakuten Data Challenge. In e COM@ SIGIR, 2018. Mariani, G., Scheidegger, F., Istrate, R., Bekas, C., and Malossi, C. Bagan: Data augmentation with balancing gan. ar Xiv preprint ar Xiv:1803.09655, 2018. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. MIT Press, 2012. Namkoong, H. and Duchi, J. C. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pp. 2971 2980, 2017. Rigollet, P. and Tong, X. Neyman-pearson classification, convexity and stochastic constraints. Journal of Machine Learning Research, 12(Oct):2831 2855, 2011. Rockafellar, R. T., Uryasev, S., et al. Optimization of conditional value-at-risk. Journal of risk, 2:21 42, 2000. Salakhutdinov, R., Torralba, A., and Tenenbaum, J. Learning to share visual appearance for multiclass object detection. In CVPR 2011, pp. 1481 1488. IEEE, 2011. Scott, C. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6:958 992, 2012. Shapiro, A., Dentcheva, D., and Ruszczy nski, A. Lectures on stochastic programming: modeling and theory. SIAM, 2009. Tong, X. A plug-in approach to neyman-pearson classification. The Journal of Machine Learning Research, 14(1): 3011 3040, 2013. Tong, X., Feng, Y., and Zhao, A. A survey on neymanpearson classification and suggestions for future research. Wiley Interdisciplinary Reviews: Computational Statistics, 8(2):64 81, 2016. Van Rijsbergen, C. J. Foundation of evaluation. Journal of Documentation, 30(4):365 373, 1974. Van Rijsbergen, C. J. Information Retrieval. Butterworth Heinemann, London, 2nd edition, 1979. Wang, J., Shen, X., and Liu, Y. Probability estimation for large-margin classifiers. Biometrika, 95(1):149 167, 2008. Wang, X., Helen Zhang, H., and Wu, Y. Multiclass probability estimation with support vector machines. Journal of Computational and Graphical Statistics, pp. 1 18, 2019. Wang, X., Tsvetkov, Y., and Neubig, G. Balancing training for multilingual neural machine translation. ar Xiv preprint ar Xiv:2004.06748, 2020. Wu, Y., Zhang, H. H., and Liu, Y. Robust model-free multiclass probability estimation. Journal of the American Statistical Association, 105(489):424 436, 2010. Yang, Y. Minimax nonparametric classification. i. rates of convergence. IEEE Transactions on Information Theory, 45(7):2271 2284, 1999. Zhou, Z.-H. and Liu, X.-Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering, 18(1):63 77, 2006. Class-Weighted Classification: Trade-offs and Robust Approaches Zhu, X., Anguelov, D., and Ramanan, D. Capturing longtail distributions of object subcategories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 915 922, 2014. Zipf, G. K. The Psycho-Biology of Language: an Introduction to Dynamic Philology. George Routledge & Sons, Ltd., 1936.