# trainable_undersampling_for_classimbalance_learning__64668787.pdf The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Trainable Undersampling for Class-Imbalance Learning Minlong Peng,1 Qi Zhang,1 Xiaoyu Xing,1 Tao Gui,1 Xuanjing Huang,1 Yu-Gang Jiang,1 Keyu Ding,2 Zhigang Chen2 School of Computer Science, Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai Insitute of Intelligent Electroics & Systems, Shanghai, China 1{mlpeng16, qz, xyxing14, tgui16, xjhuang, ygj}@fudan.edu.cn 2{kyding, zgcheng}@iflytek.com Undersampling has been widely used in the class-imbalance learning area. The main deficiency of most existing undersampling methods is that their data sampling strategies are heuristic-based and independent of the used classifier and evaluation metric. Thus, they may discard informative instances for the classifier during the data sampling. In this work, we propose a meta-learning method built on the undersampling to address this issue. The key idea of this method is to parametrize the data sampler and train it to optimize the classification performance over the evaluation metric. We solve the non-differentiable optimization problem for training the data sampler via reinforcement learning. By incorporating evaluation metric optimization into the data sampling process, the proposed method can learn which instance should be discarded for the given classifier and evaluation metric. In addition, as a data level operation, this method can be easily applied to arbitrary evaluation metric and classifier, including non-parametric ones (e.g., C4.5 and KNN). Experimental results on both synthetic and realistic datasets demonstrate the effectiveness of the proposed method. Introduction In many application areas of data mining and machine learning, the problem of class-imbalance is ubiquitous and tasks in these areas are commonly to distinguish the minority classes or achieve a balanced classification performance (Van Hulse, Khoshgoftaar, and Napolitano 2007). In this situation, conventional accuracy-based measurements are usually misleading because they are highly dependent on the classification accuracy of the majority classes. Therefore, many more appropriate and domain interest measurements such as the F-measures, area under the curve (AUC) (Hanley and Mc Neil 1982), and geometric mean (GM), were developed. In general, it is assumed that a classifier works well for the class-imbalanced task if it can achieve a good performance over the given evaluation metric. However, most of the existing learning algorithms were designed to improve the accuracy (Ganganwar 2012), instead of the given evaluation metric, by minimizing the training loss. Thus there is actually a gap between the training object of Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. the supervised classifier and the task object revealed by the evaluation metric. Undersampling has been widely used to narrow this gap in the class-imbalance learning area. The prevailing undersampling strategies undersample instances of majority classes using different heuristics (Cieslak and Chawla 2008; Wilson 1972; Mani and Zhang 2003; Tomek 1976b; 1976a), with the hope of arriving at a more robust and fair decision boundary for the evaluation metric. The sampling probability of each example is usually decided by the global or local imbalance ratio (Cieslak and Chawla 2008) and the hyper-parameters, which are adjusted to obtain better performance over the evaluation metric. However, these undersampling strategies are usually heuristic-based. They do not take into account the form of the used classifier and evaluation metric. Thus, even using fine-tuned hyperparameters, these strategies do not guarantee to obtain an appropriate subset matching the task object (Batista, Prati, and Monard 2004; He and Garcia 2009). A typical problem of these strategies is that they may throw away potentially useful data (Liu, Wu, and Zhou 2009). In this work, we propose a meta-learning method built on the undersampling to address the above issues. We parametrize the data sampler and train it to optimize the classification performance over the evaluation metric. Therefore, different from previous undersampling strategies that sample instances heuristically, the parametrized data sampler is trained to distinguish which instances should be discard and which instances should be preserved. We approach the non-differentiable optimization problem for training the data sampler via reinforcement learning. Specifically, we formulate the data sampling procedure as a Markov decision process (MDP), which takes the sampling operation of each example as the action, the chosen subset as the state, and the performance of the classifier trained using the chosen subset over the evaluation metric as the reward. We show that the convergence of this algorithm is guaranteed by that of the policy search algorithm (Williams 1992). For evaluating the proposed method, we performed experiments on both synthetic and realistic imbalanced datasets. The experimental results show that the proposed method can consistently outperform different heuristicbased data sampling methods, including undersampling and oversampling, and it can also achieve comparable performance with the specifically designed state-of-the-art cost-sensitive learning methods. The contributions of this work can be summarized as follows: 1) We propose a meta-learning method to incorporate the evaluation metric optimization into the undersamling process. It can be easily applied to arbitrary classifier and evaluation metric, and makes the data sampler trainable. 2) We approach the non-differentiable optimization problem for training the data sampler via reinforcement learning and propose a practical implementation of this approach. 3) The proposed model consistently outperforms different heuristic-based data sampling methods including undersampling and oversampling, and achieve comparable results with the specifically designed class-imbalance learning methods, which usually achieve state-of-the-art performance. Related Work The extensive development of undersampling in recent dacades has resulted in various strategies. A representative is the random majority undersampling (RUS). In RUS, instances of majority classes are randomly discarded from the dataset. Some other strategies have attempted to improve upon RUS by utilizing the distribution of data (Wilson 1972). For example, Near Miss (Mani and Zhang 2003) selected the examples that were the nearest to minority instances, and Cluster Centroid (Lemaˆıtre, Nogueira, and Aridas 2017) undersampled the majority class by replacing a cluster of majority samples with the cluster centroid of the KMeans algorithm. However, these strategies are all heuristic-based and commonly suffer from the problem that discarding potentially useful data. Some undersampling methods have used the ensemble technique to overcome this problem (Błaszczy nski and Stefanowski 2015; Kang and Cho 2006; Liu, Wu, and Zhou 2009). Two representatives of these methods are Easy Ensemble and Balance Cascade (Liu, Wu, and Zhou 2009). In short, Easy Ensemble independently samples with replacement several subsets from majority instances and builds a classifier for each subset. All the generated classifiers form a single ensemble for the final decision. Balance Cascade is similar to Easy Ensemble in structure. The main difference is that Balance Cascade iteratively removes the majority examples that were wrongly classified by the classifiers. Evaluation metric optimization has been gaining in popularity in recent years (Parambath, Usunier, and Grandvalet 2014; Eban et al. 2017; Norouzi et al. 2016), but few researcher have tackled the imbalanced data classification problem. A popular solution is to approximate the discrete evaluation metric with continuous loss (Eban et al. 2017; Herschtal and Raskutti 2004), on which gradient-based updating methods can be used. The problem is that it is usually hard for many evaluation metrics to find appropriate approximations. In addition, this solution is not applicable to non-parametric classifiers such as the decision tree (DT), k-nearest neighbor (KNN), and other rule-based models. Another popular solution borrows ideas from the reinforcement learning literature. It samples from the model during training and directly optimizes the reward over the model parameters with policy gradient ascent methods (Norouzi et al. 2016; Ranzato et al. 2015). In theory, this class of methods can be applied to any evaluation metric. However, it also suffers from the problem of not being applicable to non-parametric models. The last but not the least solution, Evolutionary Undersampling (EUS) (Garc ıa and Herrera 2009), applies the evolutionary algorithm to achieve this purpose. In EUS, each chromosome is a binary vector representing the presence or absence of instances in the data-set. Its time complexity is O(TNC), where T is the iterated generations, N is the population size, and C is the complexity for evaluating a sample (including training and testing a classifier). The drawback of this algorithm is that, it can only incorporate with quite simple classifiers (such as 1NN), otherwise its time-complexity will be quite high. Method The proposed method is to train the data sampler to sample a subset of the training dataset and the goal in data selection is to make the classifier achieve the optimum performance over the evaluation metric. It is NP-hard to compute the optimum solution, thus we must resort to an approximation. In the following, we first precisely formulate this problem, and then show how to approximate it via reinforcement learning. Formulation Let ℑ(A) denote the subset of A. Then, our approach contains a training dataset {X, Y}, a data sampler w: {X, Y} ℑ({X, Y}), a supervised classifier f: x ˆy that is to train on ℑ({X, Y}), and a specially defined evaluation metric G : {Y, ˆY} R. The problem can then be specified as finding the best possible w that we are able. Ideally, we would take w defined by w ({X, Y}) := arg max ℑ({X,Y}) G ({Y, f(X; ℑ({X, Y}))}) , (1) where f(X; ℑ({X, Y})) denotes the predicted labels ˆY of X by the classifier f trained on ℑ({X, Y}). But in general we do not expect this to be achievable. Instead, we aim for a good approximation of w with w({X, Y}) arg max ℑ({X,Y}) G ({Y, f(X; ℑ({X, Y}))}) , (2) rather than the best one. Characterize as a Markov Decision Process We approach the task of approximating (1) via reinforcement learning. Specifically, we characterise the problem as a Markov decision process (MDP), as defined by the tuple (S, A, R, T , I), where S is the state space. A maps a state s S to a set of possible actions A(s) when in s. R maps a state s S and an action a A to the reward R(s, a) R. T characterises the transitions made by MDP T : S A S. I is the distribution of the initial state s0 S. A policy π(a|s; θ) = p(a|s; θ) defines the probability of performing action a given that we are in state s. Here we write θ inside the probability to denote that the probability is determined by the parameter θ. Given a policy π, the MDP starts from sampling an initial state s0 according to I, and then evolves according to: st+1 := T (st, at π(a|st; θ)), at each step t 1. For reasons that will be made clear below, we only consider deterministic T and impose a finite horizon of T steps on our MDP, so that we do not consider states beyond T. We are now looking to find a good set of parameters θ such that if we follow the policy π(a|s; θ) we will obtain a high expected reward Eτ[Rτ|π; θ]: θ = arg max θ Eτ[Rτ|π; θ]. (3) Here τ denotes a trajectory of the MDP. That is a sequence s0, a0, r1, s1, a1, r2, , s T 1, a T 1, r T , where rt is the reward for having been in state st and taken action at. And Rτ = PT t=1 rt. In this work, we use the policy gradient method to solve the optimization problem. Weights are updated by stochastic gradient ascent in the direction that maximizes the expected reward: θ log π(τ; θ) where π(τ; θ) = π(a0|s0; θ) π(a T 1|s T 1; θ) denotes the trajectory probability of τ. We now show how our problem stated in the Formulation section can be formulated within this framework. We assume that the labeled dataset contains T examples with order fixed and (xt, yt) denotes the tth example. And we will refer X 1 and X