# anglebased_multicategory_distanceweighted_svm__7dbdc160.pdf Journal of Machine Learning Research 18 (2017) 1-21 Submitted 1/17; Revised 5/17; Published 7/17 Angle-based Multicategory Distance-weighted SVM Hui Sun sun400@purdue.edu Department of Statistics Purdue University West Lafayette, IN 47906, USA Bruce A. Craig bacraig@purdue.edu Department of Statistics Purdue University West Lafayette, IN 47906, USA Lingsong Zhang lingsong@purdue.edu Department of Statistics Purdue University West Lafayette, IN 47906, USA Editor: John Shawe-Taylor Classification is an important supervised learning technique with numerous applications. We develop an angle-based multicategory distance-weighted support vector machine (MDWSVM) classification method that is motivated from the binary distance-weighted support vector machine (DWSVM) classification method. The new method has the merits of both support vector machine (SVM) and distance-weighted discrimination (DWD) but also alleviates both the data piling issue of SVM and the imbalanced data issue of DWD. Theoretical and numerical studies demonstrate the advantages of MDWSVM method over existing angle-based methods. Keywords: Discriminant analysis, Imbalanced data, High dimension, Support vector machine, Distance-weighted discrimination 1. Introduction Classification is important in both statistics and machine learning. The goal of classification is to build a classifier such that it can predict the category of a new observation. Popular classification methods include Fisher s linear discriminant analysis, logistic regression, support vector machine (SVM), and boosting. See Hastie et al. (2001) for an introduction to various classification methods. SVM (Sch olkopf and Burges, 1999; Cristianini and Shawe-Taylor, 2000) has been shown to be a very popular and powerful method. It is well known that the binary SVM searches for a hyperplane in the feature space (that is a type of projection space from the original data) that maximizes the margin (a gap between the two groups). SVM has numerous applications, such as image classification (Chapelle et al., 1999; Foody and Mathur, 2006) and cancer diagnostics (Duan et al., 2005; Wang and Huang, 2011). c 2017 Sun, Craig and Zhang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/17-003.html. Sun, Craig and Zhang In a high-dimensional, low sample size (HDLSS) setting, Marron et al. (2007) and Ahn and Marron (2010) observed a data-piling phenomenon with the binary SVM and other classification methods. A SVM-type linear classifier is a margin-based classifier. It has a separating hyperplane, and its normal vector is essentially the discriminant direction. Datapiling is the phenomenon when projecting the data points to the discriminant direction that many of these projections are identical. This phenomenon indicates that the resulting separating hyperplane might be affected by noise artifacts in the data. Noise artifacts result in a discrimination direction far away from the Bayes direction. See more discussion in Ahn and Marron (2010). To alleviate the data-piling issue, Marron et al. (2007) proposed the binary distanceweighted discrimination (DWD) classifier. The idea of DWD is to minimize the total inverse margin of all the data points. This method works quite well in the HDLSS setting. However, because the DWD method uses all the observations to estimate the decision boundary, it is very sensitive to imbalanced sample sizes (Qiao et al., 2010). In particular, when the sample size of one class is much larger than the other, the classification boundary will be pushed towards the minority class and all future data will be assigned to the majority class. See more discussion in Qiao and Zhang (2015a,b) for the HDLSS overfitting issue of SVM and imbalanced data issue of DWD. To deal with both problems, Qiao and Zhang (2015a) proposed a binary distance-weighted support vector machine (DWSVM) method, which can be viewed as a combination of the binary SVM and DWD. The new method inherits both the merits of SVM and DWD yet outperforms both SVM and DWD in the HDLSS and imbalanced context. In practice, many classification problems have more than two classes. A natural way to deal with multiclass classification problems is to take a one-versus-one or one-versus-rest approach, see examples in Hastie et al. (1998) and Allwein et al. (2000). Though both approaches are intuitive, the one-versus-one approach can lead to a tie-in-vote problem and the one-versus-rest approach suffers from inconsistency when there is no dominant class (Lee et al., 2004). It is more desirable to consider all classes simultaneously. In a multiclass setting, the observed data are (xi, yi), i = 1, . . . , n, where xi Rd is a multivariate predictor, the scalar yi {1, . . . , K} is the corresponding class label, with K as the number of classes. Many classification approaches map x to f(x) RK, and the corresponding prediction rule is ˆy = arg maxi fj(x), where fj is the jth element of f. In this type of approach, a constraint such as PK j=1 fj = 0 is usually imposed to remove redundancy and reduce the dimension of the problem. See Zhu and Hastie (2001); Lee et al. (2004); Liu and Shen (2006); Liu (2007); Liu and Yuan (2011) for more discussion. Fisher consistency for several existing multicategory hinge loss functions are also provided in Liu (2007). It is straightforward to see that the sum-to-zero constraint can be removed if we redefine f in RK to be in RK 1, as the degrees of freedom of f is essentially K 1. Several classifiers have been proposed using this fact. For example, Lange and Wu (2008) proposed a vertex based model that maps the data x in Rd to a f(x) RK 1, and predicts the label based on the distance of f(x) to K predefined vertices. Zhang and Liu (2014) proposed a similar idea where they define the same K vertices in the Rk 1 space, and use the angle between f to these vertex vectors to predict labels. More details are given in Section 2 of this paper. Angle-based Multicategory Distance-weighted SVM The angle-based method can be viewed as a natural extension of the binary large margin classifier to the multiclass context. Zhang and Liu (2014) replace the usual functional margin by the angle (or inner product) between the projection f and the vertices. Their simulation results show that these angle-based classifiers have good prediction performance. The Fisher consistency of a family of large margin classifiers is also proved. However, as a specific case in large margin classifiers, the angle-based SVM (MSVM) method is not Fisher consistent because its loss function is not a strictly monotone decreasing function. In Zhang and Liu (2014), Fisher consistency of a proximal SVM was proposed and proven instead. In our experiment in Section 2, under HDLSS and imbalanced data setting, we observed that MSVM suffers from data piling issue. Binary DWD, based on the idea of Zhang and Liu (2014), can be extended to multicategory angle-based DWD (MDWD). Though free from the data piling concern, MDWD suffers from the imbalanced issue. Both these issues were previously observed in the binary case (Marron et al., 2007; Qiao and Zhang, 2015a). Note that Huang et al. (2013) extended the binary DWD to a version of multiclass DWD (MDWDH). It adopts the idea of pairwise comparisons from one-versus-one idea. MDWDH considers a data point with label i being misclassified if the difference of projections on ith member and jth member is negative (i = j). Even though it has nice theoretical properties and empirical performance as shown in their paper, the angle-based methods have better geometric interpretation. In addition, the pair-wise natural of the method will have more expensive computational cost, compared to angle-based approaches. We also observed that the performance is slightly better than angled-based MDWD method. If the angle-based MDWD approach also incorporates the pairwise idea, the two approaches will have similar empirical performance. In this paper, we adapt the idea in Qiao and Zhang (2015a) to develop a hybrid of MSVM and MDWD. The work can also be viewed as an extension of the binary DWSVM to the multiclass context. We prove its Fisher consistency and use extensive simulation studies to show the usefulness of our approach. For many cases, the novel approach outperforms both MDWD and MSVM, especially under HDLSS and the imbalanced case. The rest of this article is organized as follows. In Section 2, we briefly review the existing multicategory classifiers, and introduce our angle-based distance-weighted support vector machine (MDWSVM) model. In Section 3, we prove the Fisher consistency and show some imbalance properties of our new approach. In Section 4, we perform simulation studies to compare our model with MSVM and MDWD. The sensitivity of the prediction performance in terms of the tuning parameters is also explored in this section. Section 5 involves a real application and Section 6 discusses some future work for this model. The proofs of all theorems and lemmas are given in the appendix. 2. Methodology In this section, we give a general introduction to classification, including the angle-based multicategory classifier. We then show some drawbacks of this angle-based classification, which motivates our MDWSVM. We conclude with a detailed introduction of our approach, along with its implementation. Sun, Craig and Zhang 2.1 Classification and Loss Function Consider a binary classification problem with observed data (xi, yi), i = 1, . . . , n. The xi Rd is a multivariate predictor and the scalar yi {1, 1} is the corresponding class label. The goal is to find a decision function f along with its prediction ˆy(x) = sign(f(x)) to minimize the misclassification error E( ˆY = Y ). Note that when yf(x) > 0, f(x) gives a correct prediction; otherwise f(x) gives a misclassification. A natural way to estimate the misclassification error is to use the empirical error 1/n P I(ˆyi(x) = yi) = 1/n P I(yif(xi) < 0), where I(.) is the indicator function. However, due to the discontinuity and nonconvexity of I(yif(xi) < 0), it is hard to conduct a direct minimization. A common surrogate is a convex loss function ℓ(.), which is commonly used in large margin classifiers (Hastie et al., 2001). A large margin classifier can be viewed as minimizing the loss function given a constraint i=1 ℓ(f(xi), yi) + λ where F denotes the function space and J(.) is a type of norm, which is used to control the complexity of the model. The function ℓ(f, y) is a loss function surrogate for the 0-1 loss. The tuning parameter λ balances the loss and the norm. For example, the popular linear SVM uses the hinge loss function ℓS(u) = (1 u)+ where u = yf(x), and the L2 norm. The SVM method can also be viewed as maximizing the smallest distances of all observations to the separating hyperplane. As discussed in Section 1, SVM suffers from the data piling problem in HDLSS setting. Marron et al. (2007) proposed the DWD method, which improves the performance of SVM in the HDLSS setting. Essentially, DWD minimizes the mean of inverse distance of all data vectors to the separating hyperplane. As is discussed in Bartlett et al. (2006); Liu et al. (2011); Qiao and Zhang (2015a), DWD is also a large margin classifier, and its loss function is ( 2 u u 1 1/u otherwise. (1) In practice, lots of applications deal with multicategory rather than binary classification. For multiclass problems, yi {1, 2, . . . , K}, i = 1, . . . , n, with K the number of classes. The common simultaneous procedure is to map x to f(x) RK, and the corresponding prediction rule is ˆy = arg maxj fj(x), where fj is the jth element of f. Commonly a sum to zero constraint on f is used as discussed in Section 1 to overcome identifiability issues, see more discussion in Vapnik and Vapnik (1998); Lee et al. (2004); Liu and Yuan (2011). Many multicategory classification methods can be viewed as the following constrained optimization problem, i=1 ℓ(f(xi), yi) + λ j=1 fj(x) = 0. Angle-based Multicategory Distance-weighted SVM For example, a multicategory SVM with hinge loss (Vapnik and Vapnik, 1998) uses the loss function ℓ(f(x), y) = (1 fy(x))+. However, unlike binary classification, multicategory classification with a sum to zero constraint does not have a clear geometric explanation. It also suffers from expensive computation (Zhang and Liu, 2014). To overcome these limitations, Lange and Wu (2008) proposed the vertex idea where they define f as a K 1 dimensional function instead of a K dimensional function. This removes the need for a sum-to-zero constraint. A similar idea is used later by Zhang and Liu (2014) to conduct angle-based classification, which will be discussed next. 2.2 Angle-based Classification Framework The idea of angle-based classification is to map x to f(x), where f = (f1, . . . , f K 1), with a set of K predefined vertices in RK 1. We then assess which vertex has the smallest angle to the projection f, and the corresponding label is the prediction. In Zhang and Liu (2014), the vertices W = (W1, W2, . . . , WK) are defined as a collection of K vectors in RK 1 with elements ( (K 1) 1/2ζ, j = 1, (1 + K1/2)/{(K 1)3/2}ζ + {K/(K 1)}1/2ej 1, 2 j K. The unit vector ζ is of length K 1, and ej is a vector in RK 1 such that all of its element are 0, except the jth is 1. In this setting, W form a simplex with K vertices in a (K 1) dimensional space. The center of W is at the origin, and each of the Wj, j = 1, . . . , K has Euclidean norm of 1. Further, it is easy to check that the angle between each pair of vertices Wi and Wj, i = j is the same. Instead of yi, Wyi is used to represent the observed class. The prediction function is ˆy = arg maxj Wj, ˆf , where the inner product ., . between the two vectors denotes the projection of ˆf to Wj. The larger the inner product, the smaller the angle between ˆf and Wj. With this prediction rule, Zhang and Liu (2014) proposed the optimization model for the angle-based classification min f F 1 n i=1 ℓ( f(xi), Wyi ) + λ 2 J(f). (2) The product f(xi), Wyi can be viewed as a new functional margin of (x, y). Defining u = f(xi), Wyi , one of the examples given in Zhang and Liu (2014) is the multicategory angle-based SVM (MSVM) where the loss function is replaced by hinge loss ℓS(u) = (1 u)+ and with L2 norm. DWD loss can be applied to this framework as well, along with more generalizations of binary large margin classifiers. See Zhang and Liu (2014) for more details. 2.3 From Binary DWSVM to MDWSVM Through the exploration of the angle-based classification method, we found that MSVM has similar data piling issues; while MDWD does not have data piling problems, it suffers from imbalanced issues. To demonstrate the data piling and imbalanced issues, we show projection plots of a simulated example. In this example, we randomly simulate three Sun, Craig and Zhang Figure 1: Plots of projections and Wj s in R2 space. Dashed lines are the Wj s, j = 1, . . . , 3; Dots, triangles and squares represent the points from the three different classes. The left panel shows the projection plot for the Bayes classification, the middle one is for anglebased MSVM, and the right one is for the angle-based MDWD. The middle panel shows the MSVM has severe data piling issues (the middle panel), and MDWD in the right panel suffers from imbalanced issues (the right panel). classes of observations with 500 covariates, the sample sizes for each class are 100, 50, 50 respectively. For each group, the first two covariates are distributed N(µj, σ2I2), where µj s are three fixed points equally spaced on the unit circle, and σ = 0.5. All other covariates are independently and identically distributed N(0, σ2). Note that the data are HDLSS and imbalanced. One representation of the projection plots is given in Figure 1. This plot is used to visualize the n projections f(xi) s, i = 1, . . . , n and vectors Wj s, j = 1, . . . , 3 (dashed lines) in the R2 plane. The different colors and shapes correspond to different groups. The solid purple lines are the Bayesian decision boundaries. X1 and X2 are the two axes in this R2 plane. From Figure 1 it is clear that MSVM has severe data piling issues since almost all the points project to a single point on the W direction. Furthermore, MDWD suffers from the imbalanced issue as the angle-based classification assigns almost all the points to the dominant Class 1. These findings agree with those in Qiao and Zhang (2015a) for the binary case. To alleviate both data piling and imbalanced issues, Qiao and Zhang (2015a) proposed binary DWSVM, a combination of SVM and DWD. The method has the form min f F 1 n i=1 αℓD(yif0(xi)) + (1 α)ℓS(yif(xi)) + λ 2 J(f), (3) where f0(x) = xω + β0 and f(x) = xω + β, ω Rd is the coefficient direction vector and scalar β, β0 R. The loss function ℓD is from formula (1) and ℓS is the hinge loss. Notice that β is the SVM intercept and β0 is the DWD intercept, which is called auxillary intercept in the DWSVM paper. The prediction function is ˆy = sign(f(x)) = sign(x T i ω + β). The tuning parameter 0 < α < 1 is used to balance SVM and DWD losses. Angle-based Multicategory Distance-weighted SVM Qiao and Zhang (2015a) show that in binary classification, by choosing the appropriate α, the DWSVM method will result in a smaller misclassification error compared to both the DWD method and SVM method in HDLSS and imbalanced data context. Furthermore, in terms of the similarity to the Bayes classifier, the DWSVM are similar to the DWD but better than the SVM. The DWSVM method motivated us to build a multicategory DWSVM within the anglebased framework. Applying DWSVM to the angle-based framework, we propose a multicategory angle-based DWSVM (MDWSVM) min f F 1 n i=1 αℓD( f0(xi), Wyi ) + (1 α)ℓS( f(xi), Wyi ) + λ 2 J(f). (4) In this model f(xi) = xi B +β0 and f0(xi) = xi B +βd 0, where B = (B1, B2, . . . , BK 1), each of the Bj, j = 1, . . . , K 1 is a vector of length d, which does not include the intercept.The parameters β0, βd 0 RK 1 are intercept vectors. Note that f0 and f are only different in terms of the intercept β0 and βd 0 respectively. In this model, the interest is to find B, β0 and βd 0 to minimize the loss function. For prediction ˆy = arg maxj Wj, ˆf = arg maxj Wj, xi B + β0 , however, only β0 and B are used. This avoids the imbalanced issue cost by βd 0. Note that the prediction arg maxj Wj, ˆf is equivalent to arg minj (Wj, ˆf) where (a, b) represents the angle between vector a and b . We predict x with the label j such that vertex Wj and f(x) has the smallest angle among all (Wj, ˆf), j = 1, . . . , K. Observe that PK j=1 Wj, ˆf = 0 for all x, which means the angle-based classification framework automatically includes sum-to-zero constraints. 2.4 Implementation of MDWSVM In Qiao and Zhang (2015a), the implementation of the binary DWSVM (3) was through second-order cone programming. Mathematically, the DWSVM model can be written as min ω,β,β0,ηi,ξi ri + ηi) + (1 α)ξi}, s.t. ri = yi(x T i ω + β) + ηi, ri 0 and ηi 0, yi(x T i ω + β) + ξi 1, ξi 0, The first constraint ri = yi(x T i ω + β) + ηi is the distance from each data vector i to its separating hyperplane (adding slackness to allow misclassification), which corresponds to the DWD optimization. The second constraint yi(x T i ω+β)+ξi 1 is a standard constraint in SVM optimization. Both ηi and ξi control the misclassification rate, but with different decision boundaries. The third constraint ω 2 C is equivalent to the second term in (2), the Euclidean norm. To extend (5) to multiclass, we replace the distances (functional margins) to the inner product as introduced earlier in (2). Thus our MDWSVM will have the following mathe- Sun, Craig and Zhang matical form: ri + ηi) + (1 α)ξi}, s.t. ri = f0(xi), Wyi + ηi, ri 0 and ηi 0, f(xi), Wyi + ξi 1, ξi 0, j=1 BT j Bj C. In this form f0 and f are the same as in (4). It is verified in Zhang and Liu (2014) that the first term in the objective function Pn i=1( 1 ri + ηi) along with its constraint is equivalent to the objective of the MDWD method, and the second term in the objective function Pn i=1 ξi along with its constraint is equivalent to the objective in the MSVM method. In this case, MDWSVM can be viewed as a convex combination of MDWD and MSVM losses where the parameter α balances the two. Model (6) can be easily implemented in Matlab using the CVX package (Grant et al., 2008). Notice the only difference between f(x) and f0(x) is their location vectors β0 and βd 0. For prediction we only adopt the location vector from MSVM, which shows insensitivity to the imbalanced issue from Figure 1. Moreover, by combining the discriminant direction of MDWD and MSVM, our new model will have a better discriminant direction (closer to the Bayes direction) than the MSVM method alone. Both improvements will be shown in Sections 4 and 5 using simulations and real examples. 3. Theoretical Properties Fisher consistency is a fundamental requirement for a classification method. Fisher consistency implies that when the sample size approaches infinity, the classifier becomes closer and closer to the Bayes classification rule, which corresponds to the minimum misclassification rate. Qiao and Zhang (2015a) explored Fisher consistency of the binary DWSVM model. In the multiclass context, Zhang and Liu (2014) extended Fisher consistency to all large margin classification models under the angle-based framework. Let Pj = Pr(Y = j|X = x) for j = 1, . . . , K. Note that ˆy = arg maxj Pj is the Bayes rule. Assume that for a given x, the vector f (x) minimizes E[ℓ{ f(X), WY }|X = x], and the corresponding decision boundary will then be ˆy = arg maxj f (x), WY . Note that this is essentially the limit minimizer of (2) when sample size diverges to infinity. Fisher consistency assures that these two decision functions are the same ( arg maxj Pj = arg maxj f (x), WY ). In this section, we will prove that if using the approximate SVM loss function from Zhang and Liu (2014) in replacement of hinge loss, our MDWSM is Fisher consistent. Theorem 1 The MDWSVM is Fisher consistent for any 0 < α < 1. Fisher consistency in Theorem 1 ensures that the minimizer of the expected loss function will assign an observation to the same class as what Bayes rule does. Furthermore, in our numerical study in Section 4.2, we notice that for MDWSVM method, as long as C is fixed, different α s will give similar performance in both prediction error and closeness to Angle-based Multicategory Distance-weighted SVM the Bayes rule. Thus we will fix α to be 0.5 in this paper and not discuss the choice of α further. In the next theorem, we want to prove that MDWSVM is insensitive to imbalance. Using a similar paradigm as in Owen (2007), we consider the case that the sample size of one class diverges to infinity. Qiao and Zhang (2015a) showed that, in binary classification, the intercept term of DWD diverges, but the intercept of SVM and DWSVM will not diverge. This shows that SVM is not sensitive to imbalance, but DWD will be severely affected. In our multiclass setting, for simplicity, it is assumed that only one of the classes is the dominant one, and the sample size of all other classes are equally fixed. Without loss of generality, we assume their sample sizes are all 1. Under this setting, we can simply assume observation 1, . . . , K 1 belongs to the class 1, . . . , K 1 respectively, and observations K . . . , n belong to class K. As n goes to infinity, the classifier tends to classify all the points to the dominant class K. If this happens, β0, Wy K goes to infinity. In the next proposition, we prove that this will be not be the case for the angle-based SVM. Furthermore, we present in Theorem 3 that the intercept of our MDWSVM model is not sensitive to imbalance either. Proposition 2 In MSVM setting, when the size of the majority class goes to infinity, β0, Wy K < 2CK max |xij| + 1. Theorem 3 In the MDWSVM setting, when the size of the majority class goes to infinity, β0, Wy K < 2CK max |xij| + 1. Note that Theorem 3 does not ensure that the MDWSVM method completely overcomes the imbalanced issue. When the sample size of the majority group goes to infinity, the method still will ignore some observations in minority groups. 4. Simulation In this section, we use three simulation examples to demonstrate the performance of our MDWSVM method. We compare it to the angle-based SVM (MSVM) described in Section 2 and the angle-based DWD (MDWD) naturally developed using the ideas from Zhang and Liu (2014). In each example, we simulate a training data set, a tuning data set, and a testing data set. The training data and tuning data have the same sample sizes and are used to estimate the model and to find the optimal tuning parameters. The size of the testing data set is ten times the size of the training data, and is used to evaluate the prediction performance. As we are interested in the misclassification rate in both the dominant class and the minority classes, we will not use the total error rate 1/n P i I(ˆyi = yi) in this paper. Instead, we use the average within-group error rate as follows i:xi Cj I(ˆyi = j|xi Cj) Here Cj stands for class j and nj is the sample size for class j. This measure was previously introduced in Qiao and Liu (2009). Note that the term within the bracket is the error rate for each group, so r is the arithmetic average of all these error rates. Sun, Craig and Zhang We also want to measure the closeness of the estimated classifier to the Bayes rule. For the binary case, we can measure the angle between the two linear decision boundaries. For the multiclass case, we develop a similar measure as follows. Note that B is the projection matrix from Rd to RK 1 (the projection space). In the binary case, B is the discrimination direction vector, we can use the Euclidean inner product B, BBayes to measure the angle between the estimated and the Bayes rule. For the multiclass case, both B and BBayes are matrices. In matrix form, we want to measure the angle between the jth columns in both B and BBayes, and then calculate an average of these angles. In this paper, we use the Frobenius inner product: B, BBayes F = P i,j Bij BBayes ij. Essentially, this is the sum of entries of the Hadamard product between B and BBayes. One can see that this is the same idea as the inner product of the corresponding columns in B and BBayes, a scaled mean of the inner products. To make this quantity directly linked to angle, we normalize both B and BBayes to have Frobenius norm of 1, and thus B, BBayes = arc cos( B, BBayes F ) will be the angle used in this paper. In all examples, α is set as 0.5, the reason for this is described in Section 4.2. We want to choose C in R+. For convenience, we use the log scale, and set log2 C from -3 to 15. For the first two examples, we generate datasets that have signal based on only a few covariates, and then we add pure noise as additional covariates. To better compare the performance for both balanced and imbalanced scenarios, all the examples are conducted under both balance and imbalance cases. Let p = Pr(Y = 1) and Pr(Y = j) = 1 K 1(1 p) for j = 1. We will consider p = 1/K for the balanced case and p = 1/2 and p = 1/3 for the imbalanced case for all examples. The size of the training dataset for each example is 300, 600 and 300 respectively. In each example, five sets of dimensions are considered: 2, 10, 100, 500 and 1000. The noise covariates are identically independent distributed as N(0, σ2). For the third example, all covariates are signal variables. And for all simulation settings, we repeat the experiments 100 times and report the average performance. 4.1 Performance Comparison Example 1 We generate a three class dataset, where the first two covariates are distributed N(uj, σ2I2). In this setting, the uj s are three points equally spaced on a unit circle, and σ is chosen such that the Bayes error is 0.1. As we can see, this example is similar to the Example 1 in Zhang and Liu (2014) other than that our case considered both the balanced and imbalanced scenarios. Example 2 We generate a five class dataset, Let Pr(Y = 1) = p, and the first five covariates are distributed N(uj, σ2I5). Here uj s are five points equally spaced on the sphere of unit ball in R4, and σ = 0.55. When dimension is larger than 4, the last d 4 covariates are identically independent distributed as N(0, σ2). Example 3 A three groups dataset is generated with dimension d, the centers of the three groups are equally distributed on the sphere of an unit ball in Rd. A random noise N(0, σ2 = 0.552) was added to each dimension. We report the average prediction error rate and the average angle to the Bayes rule in Figures 2. Take Figure 2(a), which corresponds to Example 1, as an example. We report Angle-based Multicategory Distance-weighted SVM p=1/3 p=1/2 p=2/3 Error Angle 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 Methods MSVM MDWD MDWSVM p=1/5 p=1/2 p=2/3 Error Angle 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 Methods MSVM MDWD MDWSVM p=1/3 p=1/2 p=2/3 Error Angle 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 Methods MSVM MDWD MDWSVM Figure 2: Performance comparison plot between the three methods for Examples 1, 2, and 3. The top row of each graph plots the misclassification rate for different dimensions (the x axis) and different prior probabilities (left, middle and right panels). The bottom row is the angle between the estimated and the Bayes rule. For all measures, smaller implies better result. We can see that our method (the solid black line) performs almost the same as the other two methods (MSVM, the red dashed line, MDWD, the dotted blue line) for balanced case, but outperforms the other two for the imbalanced cases. Sun, Craig and Zhang both misclassification rate (the top row) and the angle between the estimated classifiers and the Bayes rule. In the plot, we use black solid lines for our MDWSVM method, red dashed lines for MSVM method, and blue dotted lines for MDWD method. The grid points on the x axis represents the different dimensions d. The y axis corresponds to the performance measure. In our plot, the smaller the y axis value, the better the performance. Different imbalance ratios are visualized in different panels from left to right. The left two panels correspond to the balanced case; the middle panels are mild imbalance (p = 1/2); and the right two panels are more severe imbalanced case (p = 2/3). For the balanced case, our approach performs similar to the MSVM and MDWD methods. However, for the imbalanced cases (the middle and right panels), we can see clear gaps between the performance of our method and the other two methods, demonstrating that our method outperforms the other two. Note that a similar pattern can also be seen in Figures 2(b) and 2(c). All suggests that the novel approach is better than MSVM and MDWD. It is also shown in these plots that as the dimension of training data changes from small to large (2 to 1000), the classifier s performances become worse and worse. Furthermore, the performance differences of the three methods become more pronounced. Note that MDWD gives the worst prediction error rate compared to the other two methods, and MSVM gives the worst classification direction compared to the other two methods. Our MDWSVM gives comparably the best performance in both aspects. 4.2 Sensitivity to Parameters There are two parameters C and α in our MDWSVM method. We have conducted many simulations to evaluate the performance of these two parameters. In this section, we will only use Example 1 to show the performance. We set α to be fixed, varied C, and evaluated its performance. Then we fixed an optimal C to evaluate the sensitivity of our approach to different α s. At the beginning, we let α = 0.5, and allow C to change from 2 3 to 212. The simulation is conducted under different dimensions (100, 500, 1000). All the simulation results are based on 100 replicates. The left panel of Figure 3 is the prediction error under different values of C with different dimensions of training data. It is clearly shown from the graph that as C increases, the prediction error rate first decreases and then increases. It shows that a minimal prediction error can be reached within this range. The right panel shows the relationship between prediction error rate and different α s. It is also clear that the prediction error rate stays the same as α changes from 0.1 to 0.9, regardless of the slightly increase as α approaches to 1. Based on the performance from Figure 3, the change of α has little impact on the prediction error rate compared to a change of C. The performance is quite stable for different α s. Since the property of the parameters are not the focus in this paper, this simulation gives us an easy suggestion of choosing parameters. One can simply fix α = 0.5 and use cross-validation to choose C. This is why we fix α = 0.5 in our simulation. 4.3 Computation Time In this subsection, we will compare the computational time for these methods (MDWSVM, MSVM and MDWD). To test the computational complexity, we only consider the simulation Angle-based Multicategory Distance-weighted SVM Dimension=100 Dimension=500 Dimension=1000 0.2 0.3 0.4 0.5 0.6 0.2 0.3 0.4 0.5 0.6 0.2 0.3 0.4 0.5 0.6 prediction error rate Dimension=100 Dimension=500 Dimension=1000 0.25 0.50 0.75 prediction error rate Figure 3: Average within-group error rate change under different parameters. Left panel is the prediction error rate change under different C value for fixed α = 0.5; right panel is prediction error rate change for different α when C is fixed at its optimal value Dimension MDWSVM MDWD MSVM 10 8.31(0.04) 8.22(0.04) 0.88(0.01) 100 14.52(0.07) 12.80(0.05) 2.79(0.01) 1000 41.66(0.20) 27.43(0.12) 9.30(0.07) Table 1: Computation time comparison for MDWSVM, MDWD and MSVM based on 100 runs of Example 1. The number shows average computing time in seconds, and the number in the parenthesis is the corresponding standard error. of Example 1. We let the dimension change from 10 to 1000. Table 1 gives the average computation time in seconds for 100 replicates along with their standard error. All numerical experiments were carried out on an Intel Xeon E3-1284L (2.5 GHz) processor. Table 1 shows that the most efficient method is MSVM and the most time-consuming method is our MDWSVM. Note that MSVM can still be viewed as a quadratic programming problem, while both MDWD and MDWSVM are conic program problems. It is not surprising that MSVM is the most computational efficient one. It is our expectation that MDWSVM would have the longest time to run, since it combines both MSVM and MDWD. From equation 6, we can see that the number of parameters can be viewed as the sum of the ones for MDWD and MSVM. Thus the computation times it takes to solve the problem increases as well. It is good enough that the computational time of MDWSVM is shorter than the sum of the computing time of each individual methods. It is worth mentioning that as dimension increases, the CPU times for the three methods increase. Sun, Craig and Zhang 2002/01/01 New Year 2002/09/02 Labor Day 2002/05/27 Memorial Day 2002/11/28 Thanksgiving 2002/07/04 Independence Day 2002/12/25 Christmas Table 2: Six national holidays on weekdays that are removed 5. Real Data Application In this section, we apply our MDWSVM method to a real data used in Shen and Huang (2005). The data were gathered at an inbound call center of a major northeastern U.S financial firm in 2002, and describe the call volume from 7:00am-12:00am. Each day is divided into 408 150-second intervals and the number of phone calls is recorded in each interval. Due to equipment malfunctioning, there are 6 missing weekdays within the whole year. The call volume data form a 360 408 matrix, where each row corresponds to a day and each column is the call volume for one of the 150-second intervals. Note that the data have been thoroughly analyzed in Shen and Huang (2005). Here we simply add some new insights from the data by using our novel approach. According to their analysis, the pattern for Saturday and Sunday is very different from the weekdays. Thus in this analysis, we only focus on the weekdays. Shen and Huang (2005) show that for weekdays, by using singular value decomposition to analyze the number of phone calls, Monday and Friday are slightly different from all other weekdays, see Section 5.3 and Figure 6 of Shen and Huang (2005) for more details. Tuesday, Wednesday and Thursday are hard to tell apart from each other. In this section, we only focus on classifying Monday, Friday and other weekdays (Tuesday, Wednesday and Thursday). In addition, due to the fact that the center has very low volumes on national holidays, we remove the holidays that fall on weekdays. Table 2 provides the national holidays excluded in our analysis. These days, along with some other more holidays, were also removed in Shen and Huang (2005). After removing these holidays, we have 48, 50, 51, 50, 52 days for Monday to Friday respectively. The data are divided into three groups, Group 1: Monday (size 48); Group 2: Tuesday to Thursday (size 151); Group 3: Friday (size 52). This dataset is a typical imbalanced HDLSS dataset with Group 2 as the dominant group. The average number of phone calls on each time interval are presented in Figure 4. From Figure 4, we can see that the average number of phone calls on Monday is quite distinct from the other days as it is larger than the other two groups. However, Group 2 and 3 are hard to distinguish from each other by only looking at the average number of phone calls. For this data set, we will compare the performance of three classifiers: our MDWSVM, MSVM, MDWD. To obtain a good evaluation, all three methods use five-fold cross validation. And the evaluation measures are the total error rate (TER) and the average within-group error rate (AER). We also report the prediction error within each group. Table 3 provides these results for the three different methods. From Table 3 it is straightforward to conclude that the prediction error of MDWSVM for each of the groups is the smallest, as well as the total error rate and average with-group prediction error rate. Our MDWSVM model works very well for this data set. Taking another look at the Table 3, the prediction error rate for Monday and Friday are more than 30%, which seems to be large. The result could be explained by the fact Angle-based Multicategory Distance-weighted SVM 0 100 200 300 400 Time intervals Average calls in each time interval Figure 4: Average number of calls in each time interval for the three classes Monday, Tuesday to Thursday and Friday. In this plot, the red dotted line is the number of phone calls for Mondays, the black solid line is for Tuesday to Thursday, and the blue dashed line is for Friday. MDWD MSVM MDWSVM Mon. 0.8156 0.6200 0.4111 Tue. - Thu. 0.0396 0.0594 0.0985 Fri. 0.6145 0.5527 0.3200 TER 0.3098 0.2702 0.2053 AER 0.4899 0.4107 0.2765 Table 3: Prediction error for number of phone calls. The top 3 rows report the crossvalidation prediction error for each class, and the bottom two rows are the total error rate and the average within-group error rate. Sun, Craig and Zhang that the DWSVM used here is a linear classifier, while the nature of the data may not be well classified by a linear classifier. If we incorporate a kernel approach to our classifier, the performance should improve. 6. Discussion In this paper, we proposed a new angle-based method MDWSVM. We show in this paper that in HDLSS and the imbalanced case, our novel MDWSVM method has smaller misclassification error relative to both MDWD and MSVM. In addition, it is closer to the Bayes discriminant direction compared to that of MSVM. The MDWSVM can be viewed as an extension of the binary DWSVM to the multicategory case. Furthermore, as our model considers both SVM and DWD loss, it can be treated as a weighted hybridization of MSVM and MDWD. One of the limitations of DWD-type methods is that it suffers from slow computation time compared to SVM-type methods (See Chapter 4.3 for examples). Recently an alternating direction method of multipliers (ADMM) algorithm has been implemented by Lam et al. (2016) for the DWD type methods, which can handle large size data more efficiently. A similar implementation for our MDWSVM method will be explored in the future. Note that we only consider linear classifiers for simplicity. This method can be extended to a kernel approach (Liu and Yuan, 2011) by allowing f to be a nonlinear mapping. The implementation of such a generalization will be our immediate future work. Our MDWSVM uses the squared norm as the regularization component so it does not have a variable selection property. To better deal with data of high dimension, variable selection penalties can be added to the model, e.g. LASSO (Tibshirani, 1996) or Elastic net (Zou and Hastie, 2005). Work on this type of generalization will follow. Acknowledgments We would like to thank the reviewers for the valuable comments and suggestions. Also thank Cheng Li (Purdue University) for the insightful discussion on proof of theorems. Angle-based Multicategory Distance-weighted SVM Appendix A. In this appendix we prove the following theorems and proposition from Section 3: Lemma 1 (Zhang and Liu, 2014): Suppose we have an arbitrary f RK 1. For any u, v {1, . . . , K} such that u = v, define Tu,v = Wu Wv. For any scalar z R, (f + z Tu,v), Ww = f, Ww , where w {1, . . . , K} and w = u, v. Furthermore, we have that (f + z Tu,v), Wu f, Wu = (f + z Tu,v), Wv + f, Wv . The proof of Lemma 1 is given in Zhang and Liu (2014). From Lemma 1, one can see that for a given f, if we move it along the direction of Tu,v, the inner product of f and Ww will stay the same when w = u, v. Furthermore, the sum of inner product f, Wu + f, Wu , Wu will remain unchanged as well. This lemma will help us to prove the Fisher consistency of the MDWSVM method. Theorem 1. The MDWSVM is Fisher consistent for any 0 < α < 1. Proof. Recall the definition of f is that (f , f 0 ) = arg min f,f0 E[(1 α)ℓs{ f(X), WY } + αℓd{ f0(X), WY }|X = x]. We need to show that when P1 > P2, W1, f > W2, f . This can be easily proved by contradiction. If W1, f = W2, f , Let f 0 = f , here we can see that as is only the difference of intercept, which is independent of X. Let (f , f 0 ) = (f , f 0 ) be such that Wj, f = Wj, f for j 3 and W1, f = W1, f + ϵ, W2, f = W2, f ϵ. This (f , f 0 ) exists based on Lemma 1 and the fact that inner product is continuous. To get the required f , we only need to move f along the direction of T1,2. Then it is easy to get j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }] j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }] =ϵ(P1 P2)(1 α)ℓ s{ f , W1 } + o(ϵ) Since we are using proximal hinge loss, ℓs is differentiable, P1 P2 > 0, ℓ s < 0 and 0 < α < 1. we have PK j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }] < PK j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }], which is a contradiction. For W1, f < W2, f case, if P1ℓ s{ f , W1 } P2ℓ s{ f , W2 } < 0, then choose (f , f 0 ) = (f , f 0 ) be such that Wj, f = Wj, f for j 3 and W1, f = Sun, Craig and Zhang W1, f + ϵ, W2, f = W2, f ϵ. Then we have j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }] j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }] =ϵ(1 α){P1ℓ s{ f , W1 } P2ℓ s{ f , W2 }} + o(ϵ) < 0 We can see that If P1ℓ s{ f , W1 } P2ℓ s{ f , W2 } > 0, then choose (f , f 0 ) = (f , f 0 ) be such that Wj, f = Wj, f for j 3 and W1, f = W1, f ϵ, W2, f = W2, f + ϵ. Then we have j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }] j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }] =ϵ(1 α){ P1ℓ s{ f , W1 } + P2ℓ s{ f , W2 }} + o(ϵ) < 0 We can see that this is a contradiction. This completes the proof. Proposition 2. In MSVM setting, when the size of the majority class goes to infinity, β0, Wy K < 2CK max |xij| + 1. Proof. Assume observations 1, . . . , K 1 belong to the class 1, . . . , K 1 respectively, and observations K, . . . , n belong to class K. i=1 ℓs{ f(xi), Wyi } i=1 ℓs{ f(xi), Wi } + i=K ℓs{ f(xi), WK } i=1 ℓs{ x T i B, Wi + β0, Wi } + i=K ℓs{ x T i B, WK + β0, WK } Now we prove that B Rp (K 1), we have β0, WK < sup i | x T i B, Wi |K + 1 < 2CK max |xij| + 1. We can use contradiction to prove it, if β0, WK > supi | x T i B, Wi |K + 1, then ℓs{ x T i B, WK + β0, WK } = 0 for all i {K, . . . , n} as x T i B, WK + β0, WK > 1. Angle-based Multicategory Distance-weighted SVM i=1 ℓs{ f(xi), Wyi } = i=1 ℓs{ x T i B, Wi + β0, Wi }. Then d L dβ0 = PK 1 i=1 l s{ x T i B, Wi + β0, Wi }W i. Since β0, WK > supi | x T i B, Wi |K + 1, one can get that u K = x T KB, WK + β0, WK > 1. Based on the property of W, we have PK i=1 β0, Wi = 0. Furthermore, it is easy to deduct that min β0, Wi < supi | x T i B, Wi | for i {1, . . . , K 1}. Then min ui = x T i B, Wi + min β0, Wi < 0 for i = 1, . . . , K 1. Thus we can choose K 1 different values K1, K2, , . . . , KK 1 from 1, . . . , K 1 such that u K1 u K2 . . . 0 . . . u KK 1. Assume i0 = max{i, u Ki < 1}, then d L dβ0 = PKKi0 i=KK 1 W Ki. One can simply verify that d L dβ0 = 0 based on the property of vertex W. Thus β0 cannot be the β0 that minimize the loss function given B. This step completes the prove. Theorem 3. In the MDWSVM setting, when the size of the majority class goes to infinity, β0, Wy K < 2CK max |xij| + 1. Proof. Based on the proof of Proposition 2, for any B Rp (K 1), we have β0, WK < sup i | x T i B, Wi |K + 1 < 2CK max |xij| + 1. Thus for MDWSVM model, no matter what B we get, the intercept only comes from MSVM part. Therefore, from the conclusion of Proposition 2, Theorem 3 is proved. Jeongyoun Ahn and JS Marron. The maximal data piling direction for discrimination. Biometrika, 97(1):254 259, 2010. Erin L Allwein, Robert E Schapire, and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1(Dec): 113 141, 2000. Peter L Bartlett, Michael I Jordan, and Jon D Mc Auliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006. Olivier Chapelle, Patrick Haffner, and Vladimir N Vapnik. Support vector machines for histogram-based image classification. IEEE Transactions on Neural Networks, 10(5): 1055 1064, 1999. Nello Cristianini and John Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, 2000. Kai-Bo Duan, Jagath C Rajapakse, Haiying Wang, and Francisco Azuaje. Multiple svm-rfe for gene selection in cancer classification with expression data. IEEE Transactions on Nanobioscience, 4(3):228 234, 2005. Sun, Craig and Zhang Giles M Foody and Ajay Mathur. The use of small training sets containing mixed pixels for accurate hard image classification: Training on mixed spectral responses for classification by a svm. Remote Sensing of Environment, 103(2):179 189, 2006. Michael Grant, Stephen Boyd, and Yinyu Ye. Cvx: Matlab software for disciplined convex programming, 2008. Trevor Hastie, Robert Tibshirani, et al. Classification by pairwise coupling. The Annals of Statistics, 26(2):451 471, 1998. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning. 2001. Springer, 2001. Hanwen Huang, Yufeng Liu, Ying Du, Charles M Perou, D Neil Hayes, Michael J Todd, and James Stephen Marron. Multiclass distance-weighted discrimination. Journal of Computational and Graphical Statistics, 22(4):953 969, 2013. Xin Yee Lam, JS Marron, Defeng Sun, and Kim-Chuan Toh. Fast algorithms for large scale generalized distance weighted discrimination. ar Xiv preprint ar Xiv:1604.05473, 2016. Kenneth Lange and Tongtong Wu. An mm algorithm for multicategory vertex discriminant analysis. Journal of Computational and Graphical Statistics, 17(3):527 544, 2008. Yoonkyung Lee, Yi Lin, and Grace Wahba. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67 81, 2004. Yufeng Liu. Fisher consistency of multicategory support vector machines. In AISTATS, pages 291 298, 2007. Yufeng Liu and Xiaotong Shen. Multicategory ψ-learning. Journal of the American Statistical Association, 101(474):500 509, 2006. Yufeng Liu and Ming Yuan. Reinforced multicategory support vector machines. Journal of Computational and Graphical Statistics, 20(4):901 919, 2011. Yufeng Liu, Hao Helen Zhang, and Yichao Wu. Hard or soft classification? large-margin unified machines. Journal of the American Statistical Association, 106(493):166 177, 2011. JS Marron, Michael J Todd, and Jeongyoun Ahn. Distance-weighted discrimination. Journal of the American Statistical Association, 102(480):1267 1271, 2007. Art B Owen. Infinitely imbalanced logistic regression. Journal of Machine Learning Research, 8(Apr):761 773, 2007. Xingye Qiao and Yufeng Liu. Adaptive weighted learning for unbalanced multicategory classification. Biometrics, 65(1):159 168, 2009. Xingye Qiao and Lingsong Zhang. Distance-weighted support vector machine. Statistics and Its Interface, 8(3):331 345, 2015a. Angle-based Multicategory Distance-weighted SVM Xingye Qiao and Lingsong Zhang. Flexible high-dimensional classification machines and their asymptotic properties. Journal of Machine Learning Research, 16:1547 1572, 2015b. Xingye Qiao, Hao Helen Zhang, Yufeng Liu, Michael J Todd, and James Stephen Marron. Weighted distance weighted discrimination and its asymptotic properties. Journal of the American Statistical Association, 105(489):401 414, 2010. Bernhard Sch olkopf and Christopher JC Burges. Advances in kernel methods: support vector learning. MIT press, 1999. Haipeng Shen and Jianhua Z. Huang. Analysis of call centre arrival data using singular value decomposition. Applied Stochastic Models in Business and Industry, 21(3):251 263, 2005. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267 288, 1996. Vladimir Naumovich Vapnik and Vlamimir Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998. Hui Wang and Gang Huang. Application of support vector machine in cancer diagnosis. Medical Oncology, 28(1):613 618, 2011. Chong Zhang and Yufeng Liu. Multicategory angle-based large-margin classification. Biometrika, 101(3):625 640, 2014. Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vector machine. In Advances in Neural Information Processing Systems, pages 1081 1088, 2001. Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301 320, 2005.