# anglebased_multicategory_distanceweighted_svm__7dbdc160.pdf

Journal of Machine Learning Research 18 (2017) 1-21 Submitted 1/17; Revised 5/17; Published 7/17

Angle-based Multicategory Distance-weighted SVM

Hui Sun sun400@purdue.edu Department of Statistics Purdue University West Lafayette, IN 47906, USA

Bruce A. Craig bacraig@purdue.edu Department of Statistics Purdue University West Lafayette, IN 47906, USA

Lingsong Zhang lingsong@purdue.edu Department of Statistics Purdue University West Lafayette, IN 47906, USA

Editor: John Shawe-Taylor

Classiﬁcation is an important supervised learning technique with numerous applications. We develop an angle-based multicategory distance-weighted support vector machine (MDWSVM) classiﬁcation method that is motivated from the binary distance-weighted support vector machine (DWSVM) classiﬁcation method. The new method has the merits of both support vector machine (SVM) and distance-weighted discrimination (DWD) but also alleviates both the data piling issue of SVM and the imbalanced data issue of DWD. Theoretical and numerical studies demonstrate the advantages of MDWSVM method over existing angle-based methods.

Keywords: Discriminant analysis, Imbalanced data, High dimension, Support vector machine, Distance-weighted discrimination

1. Introduction

Classiﬁcation is important in both statistics and machine learning. The goal of classiﬁcation is to build a classiﬁer such that it can predict the category of a new observation. Popular classiﬁcation methods include Fisher s linear discriminant analysis, logistic regression, support vector machine (SVM), and boosting. See Hastie et al. (2001) for an introduction to various classiﬁcation methods.

SVM (Sch olkopf and Burges, 1999; Cristianini and Shawe-Taylor, 2000) has been shown to be a very popular and powerful method. It is well known that the binary SVM searches for a hyperplane in the feature space (that is a type of projection space from the original data) that maximizes the margin (a gap between the two groups). SVM has numerous applications, such as image classiﬁcation (Chapelle et al., 1999; Foody and Mathur, 2006) and cancer diagnostics (Duan et al., 2005; Wang and Huang, 2011).

c 2017 Sun, Craig and Zhang.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v18/17-003.html.

Sun, Craig and Zhang

In a high-dimensional, low sample size (HDLSS) setting, Marron et al. (2007) and Ahn and Marron (2010) observed a data-piling phenomenon with the binary SVM and other classiﬁcation methods. A SVM-type linear classiﬁer is a margin-based classiﬁer. It has a separating hyperplane, and its normal vector is essentially the discriminant direction. Datapiling is the phenomenon when projecting the data points to the discriminant direction that many of these projections are identical. This phenomenon indicates that the resulting separating hyperplane might be aﬀected by noise artifacts in the data. Noise artifacts result in a discrimination direction far away from the Bayes direction. See more discussion in Ahn and Marron (2010).

To alleviate the data-piling issue, Marron et al. (2007) proposed the binary distanceweighted discrimination (DWD) classiﬁer. The idea of DWD is to minimize the total inverse margin of all the data points. This method works quite well in the HDLSS setting. However, because the DWD method uses all the observations to estimate the decision boundary, it is very sensitive to imbalanced sample sizes (Qiao et al., 2010). In particular, when the sample size of one class is much larger than the other, the classiﬁcation boundary will be pushed towards the minority class and all future data will be assigned to the majority class. See more discussion in Qiao and Zhang (2015a,b) for the HDLSS overﬁtting issue of SVM and imbalanced data issue of DWD. To deal with both problems, Qiao and Zhang (2015a) proposed a binary distance-weighted support vector machine (DWSVM) method, which can be viewed as a combination of the binary SVM and DWD. The new method inherits both the merits of SVM and DWD yet outperforms both SVM and DWD in the HDLSS and imbalanced context.

In practice, many classiﬁcation problems have more than two classes. A natural way to deal with multiclass classiﬁcation problems is to take a one-versus-one or one-versus-rest approach, see examples in Hastie et al. (1998) and Allwein et al. (2000). Though both approaches are intuitive, the one-versus-one approach can lead to a tie-in-vote problem and the one-versus-rest approach suﬀers from inconsistency when there is no dominant class (Lee et al., 2004).

It is more desirable to consider all classes simultaneously. In a multiclass setting, the observed data are (xi, yi), i = 1, . . . , n, where xi Rd is a multivariate predictor, the scalar yi {1, . . . , K} is the corresponding class label, with K as the number of classes. Many classiﬁcation approaches map x to f(x) RK, and the corresponding prediction rule is ˆy = arg maxi fj(x), where fj is the jth element of f. In this type of approach, a constraint such as PK j=1 fj = 0 is usually imposed to remove redundancy and reduce the dimension of the problem. See Zhu and Hastie (2001); Lee et al. (2004); Liu and Shen (2006); Liu (2007); Liu and Yuan (2011) for more discussion. Fisher consistency for several existing multicategory hinge loss functions are also provided in Liu (2007).

It is straightforward to see that the sum-to-zero constraint can be removed if we redeﬁne f in RK to be in RK 1, as the degrees of freedom of f is essentially K 1. Several classiﬁers have been proposed using this fact. For example, Lange and Wu (2008) proposed a vertex based model that maps the data x in Rd to a f(x) RK 1, and predicts the label based on the distance of f(x) to K predeﬁned vertices. Zhang and Liu (2014) proposed a similar idea where they deﬁne the same K vertices in the Rk 1 space, and use the angle between f to these vertex vectors to predict labels. More details are given in Section 2 of this paper.

Angle-based Multicategory Distance-weighted SVM

The angle-based method can be viewed as a natural extension of the binary large margin classiﬁer to the multiclass context. Zhang and Liu (2014) replace the usual functional margin by the angle (or inner product) between the projection f and the vertices. Their simulation results show that these angle-based classiﬁers have good prediction performance. The Fisher consistency of a family of large margin classiﬁers is also proved. However, as a speciﬁc case in large margin classiﬁers, the angle-based SVM (MSVM) method is not Fisher consistent because its loss function is not a strictly monotone decreasing function. In Zhang and Liu (2014), Fisher consistency of a proximal SVM was proposed and proven instead.

In our experiment in Section 2, under HDLSS and imbalanced data setting, we observed that MSVM suﬀers from data piling issue. Binary DWD, based on the idea of Zhang and Liu (2014), can be extended to multicategory angle-based DWD (MDWD). Though free from the data piling concern, MDWD suﬀers from the imbalanced issue. Both these issues were previously observed in the binary case (Marron et al., 2007; Qiao and Zhang, 2015a).

Note that Huang et al. (2013) extended the binary DWD to a version of multiclass DWD (MDWDH). It adopts the idea of pairwise comparisons from one-versus-one idea. MDWDH considers a data point with label i being misclassiﬁed if the diﬀerence of projections on ith member and jth member is negative (i = j). Even though it has nice theoretical properties and empirical performance as shown in their paper, the angle-based methods have better geometric interpretation. In addition, the pair-wise natural of the method will have more expensive computational cost, compared to angle-based approaches. We also observed that the performance is slightly better than angled-based MDWD method. If the angle-based MDWD approach also incorporates the pairwise idea, the two approaches will have similar empirical performance.

In this paper, we adapt the idea in Qiao and Zhang (2015a) to develop a hybrid of MSVM and MDWD. The work can also be viewed as an extension of the binary DWSVM to the multiclass context. We prove its Fisher consistency and use extensive simulation studies to show the usefulness of our approach. For many cases, the novel approach outperforms both MDWD and MSVM, especially under HDLSS and the imbalanced case.

The rest of this article is organized as follows. In Section 2, we brieﬂy review the existing multicategory classiﬁers, and introduce our angle-based distance-weighted support vector machine (MDWSVM) model. In Section 3, we prove the Fisher consistency and show some imbalance properties of our new approach. In Section 4, we perform simulation studies to compare our model with MSVM and MDWD. The sensitivity of the prediction performance in terms of the tuning parameters is also explored in this section. Section 5 involves a real application and Section 6 discusses some future work for this model. The proofs of all theorems and lemmas are given in the appendix.

2. Methodology

In this section, we give a general introduction to classiﬁcation, including the angle-based multicategory classiﬁer. We then show some drawbacks of this angle-based classiﬁcation, which motivates our MDWSVM. We conclude with a detailed introduction of our approach, along with its implementation.

Sun, Craig and Zhang

2.1 Classiﬁcation and Loss Function

Consider a binary classiﬁcation problem with observed data (xi, yi), i = 1, . . . , n. The xi Rd is a multivariate predictor and the scalar yi {1, 1} is the corresponding class label. The goal is to ﬁnd a decision function f along with its prediction ˆy(x) = sign(f(x)) to minimize the misclassiﬁcation error E( ˆY = Y ). Note that when yf(x) > 0, f(x) gives a correct prediction; otherwise f(x) gives a misclassiﬁcation. A natural way to estimate the misclassiﬁcation error is to use the empirical error 1/n P I(ˆyi(x) = yi) = 1/n P I(yif(xi) < 0), where I(.) is the indicator function. However, due to the discontinuity and nonconvexity of I(yif(xi) < 0), it is hard to conduct a direct minimization. A common surrogate is a convex loss function ℓ(.), which is commonly used in large margin classiﬁers (Hastie et al., 2001). A large margin classiﬁer can be viewed as minimizing the loss function given a constraint

i=1 ℓ(f(xi), yi) + λ

where F denotes the function space and J(.) is a type of norm, which is used to control the complexity of the model. The function ℓ(f, y) is a loss function surrogate for the 0-1 loss. The tuning parameter λ balances the loss and the norm. For example, the popular linear SVM uses the hinge loss function ℓS(u) = (1 u)+ where u = yf(x), and the L2 norm. The SVM method can also be viewed as maximizing the smallest distances of all observations to the separating hyperplane. As discussed in Section 1, SVM suﬀers from the data piling problem in HDLSS setting. Marron et al. (2007) proposed the DWD method, which improves the performance of SVM in the HDLSS setting. Essentially, DWD minimizes the mean of inverse distance of all data vectors to the separating hyperplane. As is discussed in Bartlett et al. (2006); Liu et al. (2011); Qiao and Zhang (2015a), DWD is also a large margin classiﬁer, and its loss function is

( 2 u u 1 1/u otherwise. (1)

In practice, lots of applications deal with multicategory rather than binary classiﬁcation. For multiclass problems, yi {1, 2, . . . , K}, i = 1, . . . , n, with K the number of classes. The common simultaneous procedure is to map x to f(x) RK, and the corresponding prediction rule is ˆy = arg maxj fj(x), where fj is the jth element of f. Commonly a sum to zero constraint on f is used as discussed in Section 1 to overcome identiﬁability issues, see more discussion in Vapnik and Vapnik (1998); Lee et al. (2004); Liu and Yuan (2011). Many multicategory classiﬁcation methods can be viewed as the following constrained optimization problem,

i=1 ℓ(f(xi), yi) + λ

j=1 fj(x) = 0.

Angle-based Multicategory Distance-weighted SVM

For example, a multicategory SVM with hinge loss (Vapnik and Vapnik, 1998) uses the loss function ℓ(f(x), y) = (1 fy(x))+. However, unlike binary classiﬁcation, multicategory classiﬁcation with a sum to zero constraint does not have a clear geometric explanation. It also suﬀers from expensive computation (Zhang and Liu, 2014). To overcome these limitations, Lange and Wu (2008) proposed the vertex idea where they deﬁne f as a K 1 dimensional function instead of a K dimensional function. This removes the need for a sum-to-zero constraint. A similar idea is used later by Zhang and Liu (2014) to conduct angle-based classiﬁcation, which will be discussed next.

2.2 Angle-based Classiﬁcation Framework

The idea of angle-based classiﬁcation is to map x to f(x), where f = (f1, . . . , f K 1), with a set of K predeﬁned vertices in RK 1. We then assess which vertex has the smallest angle to the projection f, and the corresponding label is the prediction. In Zhang and Liu (2014), the vertices W = (W1, W2, . . . , WK) are deﬁned as a collection of K vectors in RK 1 with elements

( (K 1) 1/2ζ, j = 1, (1 + K1/2)/{(K 1)3/2}ζ + {K/(K 1)}1/2ej 1, 2 j K.

The unit vector ζ is of length K 1, and ej is a vector in RK 1 such that all of its element are 0, except the jth is 1. In this setting, W form a simplex with K vertices in a (K 1) dimensional space. The center of W is at the origin, and each of the Wj, j = 1, . . . , K has Euclidean norm of 1. Further, it is easy to check that the angle between each pair of vertices Wi and Wj, i = j is the same. Instead of yi, Wyi is used to represent the observed class. The prediction function is ˆy = arg maxj Wj, ˆf , where the inner product ., . between the two vectors denotes the projection of ˆf to Wj. The larger the inner product, the smaller the angle between ˆf and Wj. With this prediction rule, Zhang and Liu (2014) proposed the optimization model for the angle-based classiﬁcation

min f F 1 n

i=1 ℓ( f(xi), Wyi ) + λ

2 J(f). (2)

The product f(xi), Wyi can be viewed as a new functional margin of (x, y). Deﬁning u = f(xi), Wyi , one of the examples given in Zhang and Liu (2014) is the multicategory angle-based SVM (MSVM) where the loss function is replaced by hinge loss ℓS(u) = (1 u)+ and with L2 norm. DWD loss can be applied to this framework as well, along with more generalizations of binary large margin classiﬁers. See Zhang and Liu (2014) for more details.

2.3 From Binary DWSVM to MDWSVM

Through the exploration of the angle-based classiﬁcation method, we found that MSVM has similar data piling issues; while MDWD does not have data piling problems, it suﬀers from imbalanced issues. To demonstrate the data piling and imbalanced issues, we show projection plots of a simulated example. In this example, we randomly simulate three

Sun, Craig and Zhang

Figure 1: Plots of projections and Wj s in R2 space. Dashed lines are the Wj s, j = 1, . . . , 3; Dots, triangles and squares represent the points from the three diﬀerent classes. The left panel shows the projection plot for the Bayes classiﬁcation, the middle one is for anglebased MSVM, and the right one is for the angle-based MDWD. The middle panel shows the MSVM has severe data piling issues (the middle panel), and MDWD in the right panel suﬀers from imbalanced issues (the right panel).

classes of observations with 500 covariates, the sample sizes for each class are 100, 50, 50 respectively. For each group, the ﬁrst two covariates are distributed N(µj, σ2I2), where µj s are three ﬁxed points equally spaced on the unit circle, and σ = 0.5. All other covariates are independently and identically distributed N(0, σ2). Note that the data are HDLSS and imbalanced. One representation of the projection plots is given in Figure 1. This plot is used to visualize the n projections f(xi) s, i = 1, . . . , n and vectors Wj s, j = 1, . . . , 3 (dashed lines) in the R2 plane. The diﬀerent colors and shapes correspond to diﬀerent groups. The solid purple lines are the Bayesian decision boundaries. X1 and X2 are the two axes in this R2

plane. From Figure 1 it is clear that MSVM has severe data piling issues since almost all the points project to a single point on the W direction. Furthermore, MDWD suﬀers from the imbalanced issue as the angle-based classiﬁcation assigns almost all the points to the dominant Class 1. These ﬁndings agree with those in Qiao and Zhang (2015a) for the binary case. To alleviate both data piling and imbalanced issues, Qiao and Zhang (2015a) proposed binary DWSVM, a combination of SVM and DWD. The method has the form

min f F 1 n

i=1 αℓD(yif0(xi)) + (1 α)ℓS(yif(xi)) + λ

2 J(f), (3)

where f0(x) = xω + β0 and f(x) = xω + β, ω Rd is the coeﬃcient direction vector and scalar β, β0 R. The loss function ℓD is from formula (1) and ℓS is the hinge loss. Notice that β is the SVM intercept and β0 is the DWD intercept, which is called auxillary intercept in the DWSVM paper. The prediction function is ˆy = sign(f(x)) = sign(x T i ω + β). The tuning parameter 0 < α < 1 is used to balance SVM and DWD losses.

Angle-based Multicategory Distance-weighted SVM

Qiao and Zhang (2015a) show that in binary classiﬁcation, by choosing the appropriate α, the DWSVM method will result in a smaller misclassiﬁcation error compared to both the DWD method and SVM method in HDLSS and imbalanced data context. Furthermore, in terms of the similarity to the Bayes classiﬁer, the DWSVM are similar to the DWD but better than the SVM. The DWSVM method motivated us to build a multicategory DWSVM within the anglebased framework. Applying DWSVM to the angle-based framework, we propose a multicategory angle-based DWSVM (MDWSVM)

min f F 1 n

i=1 αℓD( f0(xi), Wyi ) + (1 α)ℓS( f(xi), Wyi ) + λ

2 J(f). (4)

In this model f(xi) = xi B +β0 and f0(xi) = xi B +βd 0, where B = (B1, B2, . . . , BK 1), each of the Bj, j = 1, . . . , K 1 is a vector of length d, which does not include the intercept.The parameters β0, βd 0 RK 1 are intercept vectors. Note that f0 and f are only diﬀerent in terms of the intercept β0 and βd 0 respectively. In this model, the interest is to ﬁnd B, β0 and βd 0 to minimize the loss function. For prediction ˆy = arg maxj Wj, ˆf = arg maxj Wj, xi B + β0 , however, only β0 and B are used. This avoids the imbalanced issue cost by βd 0. Note that the prediction arg maxj Wj, ˆf is equivalent to arg minj (Wj, ˆf) where (a, b) represents the angle between vector a and b . We predict x with the label j such that vertex Wj and f(x) has the smallest angle among all (Wj, ˆf), j = 1, . . . , K. Observe that PK j=1 Wj, ˆf = 0 for all x, which means the angle-based classiﬁcation framework automatically includes sum-to-zero constraints.

2.4 Implementation of MDWSVM

In Qiao and Zhang (2015a), the implementation of the binary DWSVM (3) was through second-order cone programming. Mathematically, the DWSVM model can be written as

min ω,β,β0,ηi,ξi

ri + ηi) + (1 α)ξi},

s.t. ri = yi(x T i ω + β) + ηi, ri 0 and ηi 0,

yi(x T i ω + β) + ξi 1, ξi 0,

The ﬁrst constraint ri = yi(x T i ω + β) + ηi is the distance from each data vector i to its separating hyperplane (adding slackness to allow misclassiﬁcation), which corresponds to the DWD optimization. The second constraint yi(x T i ω+β)+ξi 1 is a standard constraint in SVM optimization. Both ηi and ξi control the misclassiﬁcation rate, but with diﬀerent decision boundaries. The third constraint ω 2 C is equivalent to the second term in (2), the Euclidean norm. To extend (5) to multiclass, we replace the distances (functional margins) to the inner product as introduced earlier in (2). Thus our MDWSVM will have the following mathe-

Sun, Craig and Zhang

matical form:

ri + ηi) + (1 α)ξi},

s.t. ri = f0(xi), Wyi + ηi, ri 0 and ηi 0,

f(xi), Wyi + ξi 1, ξi 0,

j=1 BT j Bj C.

In this form f0 and f are the same as in (4). It is veriﬁed in Zhang and Liu (2014) that the ﬁrst term in the objective function Pn i=1( 1

ri + ηi) along with its constraint is equivalent to the objective of the MDWD method, and the second term in the objective function Pn i=1 ξi along with its constraint is equivalent to the objective in the MSVM method. In this case, MDWSVM can be viewed as a convex combination of MDWD and MSVM losses where the parameter α balances the two. Model (6) can be easily implemented in Matlab using the CVX package (Grant et al., 2008). Notice the only diﬀerence between f(x) and f0(x) is their location vectors β0 and βd 0. For prediction we only adopt the location vector from MSVM, which shows insensitivity to the imbalanced issue from Figure 1. Moreover, by combining the discriminant direction of MDWD and MSVM, our new model will have a better discriminant direction (closer to the Bayes direction) than the MSVM method alone. Both improvements will be shown in Sections 4 and 5 using simulations and real examples.

3. Theoretical Properties

Fisher consistency is a fundamental requirement for a classiﬁcation method. Fisher consistency implies that when the sample size approaches inﬁnity, the classiﬁer becomes closer and closer to the Bayes classiﬁcation rule, which corresponds to the minimum misclassiﬁcation rate. Qiao and Zhang (2015a) explored Fisher consistency of the binary DWSVM model. In the multiclass context, Zhang and Liu (2014) extended Fisher consistency to all large margin classiﬁcation models under the angle-based framework. Let Pj = Pr(Y = j|X = x) for j = 1, . . . , K. Note that ˆy = arg maxj Pj is the Bayes rule. Assume that for a given x, the vector f (x) minimizes E[ℓ{ f(X), WY }|X = x], and the corresponding decision boundary will then be ˆy = arg maxj f (x), WY . Note that this is essentially the limit minimizer of (2) when sample size diverges to inﬁnity. Fisher consistency assures that these two decision functions are the same ( arg maxj Pj = arg maxj f (x), WY ). In this section, we will prove that if using the approximate SVM loss function from Zhang and Liu (2014) in replacement of hinge loss, our MDWSM is Fisher consistent.

Theorem 1 The MDWSVM is Fisher consistent for any 0 < α < 1.

Fisher consistency in Theorem 1 ensures that the minimizer of the expected loss function will assign an observation to the same class as what Bayes rule does. Furthermore, in our numerical study in Section 4.2, we notice that for MDWSVM method, as long as C is ﬁxed, diﬀerent α s will give similar performance in both prediction error and closeness to

Angle-based Multicategory Distance-weighted SVM

the Bayes rule. Thus we will ﬁx α to be 0.5 in this paper and not discuss the choice of α further. In the next theorem, we want to prove that MDWSVM is insensitive to imbalance. Using a similar paradigm as in Owen (2007), we consider the case that the sample size of one class diverges to inﬁnity. Qiao and Zhang (2015a) showed that, in binary classiﬁcation, the intercept term of DWD diverges, but the intercept of SVM and DWSVM will not diverge. This shows that SVM is not sensitive to imbalance, but DWD will be severely aﬀected. In our multiclass setting, for simplicity, it is assumed that only one of the classes is the dominant one, and the sample size of all other classes are equally ﬁxed. Without loss of generality, we assume their sample sizes are all 1. Under this setting, we can simply assume observation 1, . . . , K 1 belongs to the class 1, . . . , K 1 respectively, and observations K . . . , n belong to class K. As n goes to inﬁnity, the classiﬁer tends to classify all the points to the dominant class K. If this happens, β0, Wy K goes to inﬁnity. In the next proposition, we prove that this will be not be the case for the angle-based SVM. Furthermore, we present in Theorem 3 that the intercept of our MDWSVM model is not sensitive to imbalance either.

Proposition 2 In MSVM setting, when the size of the majority class goes to inﬁnity, β0, Wy K <

2CK max |xij| + 1.

Theorem 3 In the MDWSVM setting, when the size of the majority class goes to inﬁnity, β0, Wy K <

2CK max |xij| + 1.

Note that Theorem 3 does not ensure that the MDWSVM method completely overcomes the imbalanced issue. When the sample size of the majority group goes to inﬁnity, the method still will ignore some observations in minority groups.

4. Simulation

In this section, we use three simulation examples to demonstrate the performance of our MDWSVM method. We compare it to the angle-based SVM (MSVM) described in Section 2 and the angle-based DWD (MDWD) naturally developed using the ideas from Zhang and Liu (2014). In each example, we simulate a training data set, a tuning data set, and a testing data set. The training data and tuning data have the same sample sizes and are used to estimate the model and to ﬁnd the optimal tuning parameters. The size of the testing data set is ten times the size of the training data, and is used to evaluate the prediction performance. As we are interested in the misclassiﬁcation rate in both the dominant class and the minority classes, we will not use the total error rate 1/n P i I(ˆyi = yi) in this paper. Instead, we use the average within-group error rate as follows

i:xi Cj I(ˆyi = j|xi Cj)

Here Cj stands for class j and nj is the sample size for class j. This measure was previously introduced in Qiao and Liu (2009). Note that the term within the bracket is the error rate for each group, so r is the arithmetic average of all these error rates.

Sun, Craig and Zhang

We also want to measure the closeness of the estimated classiﬁer to the Bayes rule. For the binary case, we can measure the angle between the two linear decision boundaries. For the multiclass case, we develop a similar measure as follows. Note that B is the projection matrix from Rd to RK 1 (the projection space). In the binary case, B is the discrimination direction vector, we can use the Euclidean inner product B, BBayes to measure the angle between the estimated and the Bayes rule. For the multiclass case, both B and BBayes are matrices. In matrix form, we want to measure the angle between the jth columns in both B and BBayes, and then calculate an average of these angles. In this paper, we use the Frobenius inner product: B, BBayes F = P i,j Bij BBayes ij. Essentially, this is the sum of entries of the Hadamard product between B and BBayes. One can see that this is the same idea as the inner product of the corresponding columns in B and BBayes, a scaled mean of the inner products. To make this quantity directly linked to angle, we normalize both B and BBayes to have Frobenius norm of 1, and thus B, BBayes = arc cos( B, BBayes F ) will be the angle used in this paper. In all examples, α is set as 0.5, the reason for this is described in Section 4.2. We want to choose C in R+. For convenience, we use the log scale, and set log2 C from -3 to 15. For the ﬁrst two examples, we generate datasets that have signal based on only a few covariates, and then we add pure noise as additional covariates. To better compare the performance for both balanced and imbalanced scenarios, all the examples are conducted under both balance and imbalance cases. Let p = Pr(Y = 1) and Pr(Y = j) = 1 K 1(1 p) for j = 1. We will consider p = 1/K for the balanced case and p = 1/2 and p = 1/3 for the imbalanced case for all examples. The size of the training dataset for each example is 300, 600 and 300 respectively. In each example, ﬁve sets of dimensions are considered: 2, 10, 100, 500 and 1000. The noise covariates are identically independent distributed as N(0, σ2). For the third example, all covariates are signal variables. And for all simulation settings, we repeat the experiments 100 times and report the average performance.

4.1 Performance Comparison

Example 1 We generate a three class dataset, where the ﬁrst two covariates are distributed N(uj, σ2I2). In this setting, the uj s are three points equally spaced on a unit circle, and σ is chosen such that the Bayes error is 0.1. As we can see, this example is similar to the Example 1 in Zhang and Liu (2014) other than that our case considered both the balanced and imbalanced scenarios.

Example 2 We generate a ﬁve class dataset, Let Pr(Y = 1) = p, and the ﬁrst ﬁve covariates are distributed N(uj, σ2I5). Here uj s are ﬁve points equally spaced on the sphere of unit ball in R4, and σ = 0.55. When dimension is larger than 4, the last d 4 covariates are identically independent distributed as N(0, σ2).

Example 3 A three groups dataset is generated with dimension d, the centers of the three groups are equally distributed on the sphere of an unit ball in Rd. A random noise N(0, σ2 = 0.552) was added to each dimension.

We report the average prediction error rate and the average angle to the Bayes rule in Figures 2. Take Figure 2(a), which corresponds to Example 1, as an example. We report

Angle-based Multicategory Distance-weighted SVM

p=1/3 p=1/2 p=2/3

Error Angle

0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000

Methods MSVM MDWD MDWSVM

p=1/5 p=1/2 p=2/3

Error Angle

0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000

Methods MSVM MDWD MDWSVM

p=1/3 p=1/2 p=2/3

Error Angle

0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000

Methods MSVM MDWD MDWSVM

Figure 2: Performance comparison plot between the three methods for Examples 1, 2, and 3. The top row of each graph plots the misclassiﬁcation rate for diﬀerent dimensions (the x axis) and diﬀerent prior probabilities (left, middle and right panels). The bottom row is the angle between the estimated and the Bayes rule. For all measures, smaller implies better result. We can see that our method (the solid black line) performs almost the same as the other two methods (MSVM, the red dashed line, MDWD, the dotted blue line) for balanced case, but outperforms the other two for the imbalanced cases.

Sun, Craig and Zhang

both misclassiﬁcation rate (the top row) and the angle between the estimated classiﬁers and the Bayes rule. In the plot, we use black solid lines for our MDWSVM method, red dashed lines for MSVM method, and blue dotted lines for MDWD method. The grid points on the x axis represents the diﬀerent dimensions d. The y axis corresponds to the performance measure. In our plot, the smaller the y axis value, the better the performance. Diﬀerent imbalance ratios are visualized in diﬀerent panels from left to right. The left two panels correspond to the balanced case; the middle panels are mild imbalance (p = 1/2); and the right two panels are more severe imbalanced case (p = 2/3). For the balanced case, our approach performs similar to the MSVM and MDWD methods. However, for the imbalanced cases (the middle and right panels), we can see clear gaps between the performance of our method and the other two methods, demonstrating that our method outperforms the other two. Note that a similar pattern can also be seen in Figures 2(b) and 2(c). All suggests that the novel approach is better than MSVM and MDWD. It is also shown in these plots that as the dimension of training data changes from small to large (2 to 1000), the classiﬁer s performances become worse and worse. Furthermore, the performance diﬀerences of the three methods become more pronounced. Note that MDWD gives the worst prediction error rate compared to the other two methods, and MSVM gives the worst classiﬁcation direction compared to the other two methods. Our MDWSVM gives comparably the best performance in both aspects.

4.2 Sensitivity to Parameters

There are two parameters C and α in our MDWSVM method. We have conducted many simulations to evaluate the performance of these two parameters. In this section, we will only use Example 1 to show the performance. We set α to be ﬁxed, varied C, and evaluated its performance. Then we ﬁxed an optimal C to evaluate the sensitivity of our approach to diﬀerent α s. At the beginning, we let α = 0.5, and allow C to change from 2 3 to 212. The simulation is conducted under diﬀerent dimensions (100, 500, 1000). All the simulation results are based on 100 replicates. The left panel of Figure 3 is the prediction error under diﬀerent values of C with diﬀerent dimensions of training data. It is clearly shown from the graph that as C increases, the prediction error rate ﬁrst decreases and then increases. It shows that a minimal prediction error can be reached within this range. The right panel shows the relationship between prediction error rate and diﬀerent α s. It is also clear that the prediction error rate stays the same as α changes from 0.1 to 0.9, regardless of the slightly increase as α approaches to 1. Based on the performance from Figure 3, the change of α has little impact on the prediction error rate compared to a change of C. The performance is quite stable for diﬀerent α s. Since the property of the parameters are not the focus in this paper, this simulation gives us an easy suggestion of choosing parameters. One can simply ﬁx α = 0.5 and use cross-validation to choose C. This is why we ﬁx α = 0.5 in our simulation.

4.3 Computation Time

In this subsection, we will compare the computational time for these methods (MDWSVM, MSVM and MDWD). To test the computational complexity, we only consider the simulation

Angle-based Multicategory Distance-weighted SVM

Dimension=100 Dimension=500 Dimension=1000

0.2 0.3 0.4 0.5 0.6

0.2 0.3 0.4 0.5 0.6

0.2 0.3 0.4 0.5 0.6

prediction error rate

Dimension=100 Dimension=500 Dimension=1000

0.25 0.50 0.75

prediction error rate

Figure 3: Average within-group error rate change under diﬀerent parameters. Left panel is the prediction error rate change under diﬀerent C value for ﬁxed α = 0.5; right panel is prediction error rate change for diﬀerent α when C is ﬁxed at its optimal value

Dimension MDWSVM MDWD MSVM 10 8.31(0.04) 8.22(0.04) 0.88(0.01) 100 14.52(0.07) 12.80(0.05) 2.79(0.01) 1000 41.66(0.20) 27.43(0.12) 9.30(0.07)

Table 1: Computation time comparison for MDWSVM, MDWD and MSVM based on 100 runs of Example 1. The number shows average computing time in seconds, and the number in the parenthesis is the corresponding standard error.

of Example 1. We let the dimension change from 10 to 1000. Table 1 gives the average computation time in seconds for 100 replicates along with their standard error. All numerical experiments were carried out on an Intel Xeon E3-1284L (2.5 GHz) processor.

Table 1 shows that the most eﬃcient method is MSVM and the most time-consuming method is our MDWSVM. Note that MSVM can still be viewed as a quadratic programming problem, while both MDWD and MDWSVM are conic program problems. It is not surprising that MSVM is the most computational eﬃcient one. It is our expectation that MDWSVM would have the longest time to run, since it combines both MSVM and MDWD. From equation 6, we can see that the number of parameters can be viewed as the sum of the ones for MDWD and MSVM. Thus the computation times it takes to solve the problem increases as well. It is good enough that the computational time of MDWSVM is shorter than the sum of the computing time of each individual methods. It is worth mentioning that as dimension increases, the CPU times for the three methods increase.

Sun, Craig and Zhang

2002/01/01 New Year 2002/09/02 Labor Day 2002/05/27 Memorial Day 2002/11/28 Thanksgiving 2002/07/04 Independence Day 2002/12/25 Christmas

Table 2: Six national holidays on weekdays that are removed

5. Real Data Application

In this section, we apply our MDWSVM method to a real data used in Shen and Huang (2005). The data were gathered at an inbound call center of a major northeastern U.S ﬁnancial ﬁrm in 2002, and describe the call volume from 7:00am-12:00am. Each day is divided into 408 150-second intervals and the number of phone calls is recorded in each interval. Due to equipment malfunctioning, there are 6 missing weekdays within the whole year. The call volume data form a 360 408 matrix, where each row corresponds to a day and each column is the call volume for one of the 150-second intervals. Note that the data have been thoroughly analyzed in Shen and Huang (2005). Here we simply add some new insights from the data by using our novel approach. According to their analysis, the pattern for Saturday and Sunday is very diﬀerent from the weekdays. Thus in this analysis, we only focus on the weekdays. Shen and Huang (2005) show that for weekdays, by using singular value decomposition to analyze the number of phone calls, Monday and Friday are slightly diﬀerent from all other weekdays, see Section 5.3 and Figure 6 of Shen and Huang (2005) for more details. Tuesday, Wednesday and Thursday are hard to tell apart from each other. In this section, we only focus on classifying Monday, Friday and other weekdays (Tuesday, Wednesday and Thursday). In addition, due to the fact that the center has very low volumes on national holidays, we remove the holidays that fall on weekdays. Table 2 provides the national holidays excluded in our analysis. These days, along with some other more holidays, were also removed in Shen and Huang (2005). After removing these holidays, we have 48, 50, 51, 50, 52 days for Monday to Friday respectively. The data are divided into three groups, Group 1: Monday (size 48); Group 2: Tuesday to Thursday (size 151); Group 3: Friday (size 52). This dataset is a typical imbalanced HDLSS dataset with Group 2 as the dominant group. The average number of phone calls on each time interval are presented in Figure 4. From Figure 4, we can see that the average number of phone calls on Monday is quite distinct from the other days as it is larger than the other two groups. However, Group 2 and 3 are hard to distinguish from each other by only looking at the average number of phone calls. For this data set, we will compare the performance of three classiﬁers: our MDWSVM, MSVM, MDWD. To obtain a good evaluation, all three methods use ﬁve-fold cross validation. And the evaluation measures are the total error rate (TER) and the average within-group error rate (AER). We also report the prediction error within each group. Table 3 provides these results for the three diﬀerent methods. From Table 3 it is straightforward to conclude that the prediction error of MDWSVM for each of the groups is the smallest, as well as the total error rate and average with-group prediction error rate. Our MDWSVM model works very well for this data set. Taking another look at the Table 3, the prediction error rate for Monday and Friday are more than 30%, which seems to be large. The result could be explained by the fact

Angle-based Multicategory Distance-weighted SVM

0 100 200 300 400 Time intervals

Average calls in each time interval

Figure 4: Average number of calls in each time interval for the three classes Monday, Tuesday to Thursday and Friday. In this plot, the red dotted line is the number of phone calls for Mondays, the black solid line is for Tuesday to Thursday, and the blue dashed line is for Friday.

MDWD MSVM MDWSVM Mon. 0.8156 0.6200 0.4111 Tue. - Thu. 0.0396 0.0594 0.0985 Fri. 0.6145 0.5527 0.3200 TER 0.3098 0.2702 0.2053 AER 0.4899 0.4107 0.2765

Table 3: Prediction error for number of phone calls. The top 3 rows report the crossvalidation prediction error for each class, and the bottom two rows are the total error rate and the average within-group error rate.

Sun, Craig and Zhang

that the DWSVM used here is a linear classiﬁer, while the nature of the data may not be well classiﬁed by a linear classiﬁer. If we incorporate a kernel approach to our classiﬁer, the performance should improve.

6. Discussion

In this paper, we proposed a new angle-based method MDWSVM. We show in this paper that in HDLSS and the imbalanced case, our novel MDWSVM method has smaller misclassiﬁcation error relative to both MDWD and MSVM. In addition, it is closer to the Bayes discriminant direction compared to that of MSVM. The MDWSVM can be viewed as an extension of the binary DWSVM to the multicategory case. Furthermore, as our model considers both SVM and DWD loss, it can be treated as a weighted hybridization of MSVM and MDWD. One of the limitations of DWD-type methods is that it suﬀers from slow computation time compared to SVM-type methods (See Chapter 4.3 for examples). Recently an alternating direction method of multipliers (ADMM) algorithm has been implemented by Lam et al. (2016) for the DWD type methods, which can handle large size data more eﬃciently. A similar implementation for our MDWSVM method will be explored in the future. Note that we only consider linear classiﬁers for simplicity. This method can be extended to a kernel approach (Liu and Yuan, 2011) by allowing f to be a nonlinear mapping. The implementation of such a generalization will be our immediate future work. Our MDWSVM uses the squared norm as the regularization component so it does not have a variable selection property. To better deal with data of high dimension, variable selection penalties can be added to the model, e.g. LASSO (Tibshirani, 1996) or Elastic net (Zou and Hastie, 2005). Work on this type of generalization will follow.

Acknowledgments

We would like to thank the reviewers for the valuable comments and suggestions. Also thank Cheng Li (Purdue University) for the insightful discussion on proof of theorems.

Angle-based Multicategory Distance-weighted SVM

Appendix A.

In this appendix we prove the following theorems and proposition from Section 3:

Lemma 1 (Zhang and Liu, 2014): Suppose we have an arbitrary f RK 1. For any u, v {1, . . . , K} such that u = v, deﬁne Tu,v = Wu Wv. For any scalar z R, (f + z Tu,v), Ww = f, Ww , where w {1, . . . , K} and w = u, v. Furthermore, we have that (f + z Tu,v), Wu f, Wu = (f + z Tu,v), Wv + f, Wv .

The proof of Lemma 1 is given in Zhang and Liu (2014). From Lemma 1, one can see that for a given f, if we move it along the direction of Tu,v, the inner product of f and Ww will stay the same when w = u, v. Furthermore, the sum of inner product f, Wu + f, Wu , Wu will remain unchanged as well. This lemma will help us to prove the Fisher consistency of the MDWSVM method.

Theorem 1. The MDWSVM is Fisher consistent for any 0 < α < 1.

Proof. Recall the deﬁnition of f is that

(f , f 0 ) = arg min f,f0 E[(1 α)ℓs{ f(X), WY } + αℓd{ f0(X), WY }|X = x].

We need to show that when P1 > P2, W1, f > W2, f . This can be easily proved by contradiction.

If W1, f = W2, f , Let f 0 = f , here we can see that as is only the diﬀerence of intercept, which is independent of X. Let (f , f 0 ) = (f , f 0 ) be such that Wj, f = Wj, f for j 3 and W1, f = W1, f + ϵ, W2, f = W2, f ϵ. This (f , f 0 ) exists based on Lemma 1 and the fact that inner product is continuous. To get the required f , we only need to move f along the direction of T1,2.

Then it is easy to get

j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }]

j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }]

=ϵ(P1 P2)(1 α)ℓ s{ f , W1 } + o(ϵ)

Since we are using proximal hinge loss, ℓs is diﬀerentiable, P1 P2 > 0, ℓ s < 0 and 0 < α < 1. we have PK j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }] < PK j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }], which is a contradiction.

For W1, f < W2, f case, if P1ℓ s{ f , W1 } P2ℓ s{ f , W2 } < 0, then choose (f , f 0 ) = (f , f 0 ) be such that Wj, f = Wj, f for j 3 and W1, f =

Sun, Craig and Zhang

W1, f + ϵ, W2, f = W2, f ϵ. Then we have

j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }]

j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }]

=ϵ(1 α){P1ℓ s{ f , W1 } P2ℓ s{ f , W2 }} + o(ϵ) < 0

We can see that If P1ℓ s{ f , W1 } P2ℓ s{ f , W2 } > 0, then choose (f , f 0 ) = (f , f 0 ) be such that Wj, f = Wj, f for j 3 and W1, f = W1, f ϵ, W2, f = W2, f + ϵ. Then we have

j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }]

j=1 Pj[(1 α)ℓs{ f , Wj } + αℓd{ f 0 , Wj }]

=ϵ(1 α){ P1ℓ s{ f , W1 } + P2ℓ s{ f , W2 }} + o(ϵ) < 0

We can see that this is a contradiction. This completes the proof.

Proposition 2. In MSVM setting, when the size of the majority class goes to inﬁnity, β0, Wy K <

2CK max |xij| + 1.

Proof. Assume observations 1, . . . , K 1 belong to the class 1, . . . , K 1 respectively, and observations K, . . . , n belong to class K.

i=1 ℓs{ f(xi), Wyi }

i=1 ℓs{ f(xi), Wi } +

i=K ℓs{ f(xi), WK }

i=1 ℓs{ x T i B, Wi + β0, Wi } +

i=K ℓs{ x T i B, WK + β0, WK }

Now we prove that B Rp (K 1), we have

β0, WK < sup i | x T i B, Wi |K + 1 <

2CK max |xij| + 1.

We can use contradiction to prove it, if β0, WK > supi | x T i B, Wi |K + 1, then

ℓs{ x T i B, WK + β0, WK } = 0

for all i {K, . . . , n} as x T i B, WK + β0, WK > 1.

Angle-based Multicategory Distance-weighted SVM

i=1 ℓs{ f(xi), Wyi } =

i=1 ℓs{ x T i B, Wi + β0, Wi }.

Then d L dβ0 = PK 1 i=1 l s{ x T i B, Wi + β0, Wi }W i. Since β0, WK > supi | x T i B, Wi |K + 1, one can get that u K = x T KB, WK + β0, WK > 1. Based on the property of W, we have PK i=1 β0, Wi = 0. Furthermore, it is easy to deduct that min β0, Wi < supi | x T i B, Wi | for i {1, . . . , K 1}. Then min ui = x T i B, Wi + min β0, Wi < 0 for i = 1, . . . , K 1. Thus we can choose K 1 diﬀerent values K1, K2, , . . . , KK 1 from 1, . . . , K 1 such that u K1 u K2 . . . 0 . . . u KK 1. Assume i0 = max{i, u Ki < 1}, then

d L dβ0 = PKKi0 i=KK 1 W Ki. One can simply verify that d L dβ0 = 0 based on the property of vertex W. Thus β0 cannot be the β0 that minimize the loss function given B. This step completes the prove.

Theorem 3. In the MDWSVM setting, when the size of the majority class goes to inﬁnity, β0, Wy K <

2CK max |xij| + 1.

Proof. Based on the proof of Proposition 2, for any B Rp (K 1), we have

β0, WK < sup i | x T i B, Wi |K + 1 <

2CK max |xij| + 1.

Thus for MDWSVM model, no matter what B we get, the intercept only comes from MSVM part. Therefore, from the conclusion of Proposition 2, Theorem 3 is proved.

Jeongyoun Ahn and JS Marron. The maximal data piling direction for discrimination. Biometrika, 97(1):254 259, 2010.

Erin L Allwein, Robert E Schapire, and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classiﬁers. Journal of Machine Learning Research, 1(Dec): 113 141, 2000.

Peter L Bartlett, Michael I Jordan, and Jon D Mc Auliﬀe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006.

Olivier Chapelle, Patrick Haﬀner, and Vladimir N Vapnik. Support vector machines for histogram-based image classiﬁcation. IEEE Transactions on Neural Networks, 10(5): 1055 1064, 1999.

Nello Cristianini and John Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, 2000.

Kai-Bo Duan, Jagath C Rajapakse, Haiying Wang, and Francisco Azuaje. Multiple svm-rfe for gene selection in cancer classiﬁcation with expression data. IEEE Transactions on Nanobioscience, 4(3):228 234, 2005.

Sun, Craig and Zhang

Giles M Foody and Ajay Mathur. The use of small training sets containing mixed pixels for accurate hard image classiﬁcation: Training on mixed spectral responses for classiﬁcation by a svm. Remote Sensing of Environment, 103(2):179 189, 2006.

Michael Grant, Stephen Boyd, and Yinyu Ye. Cvx: Matlab software for disciplined convex programming, 2008.

Trevor Hastie, Robert Tibshirani, et al. Classiﬁcation by pairwise coupling. The Annals of Statistics, 26(2):451 471, 1998.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning. 2001. Springer, 2001.

Hanwen Huang, Yufeng Liu, Ying Du, Charles M Perou, D Neil Hayes, Michael J Todd, and James Stephen Marron. Multiclass distance-weighted discrimination. Journal of Computational and Graphical Statistics, 22(4):953 969, 2013.

Xin Yee Lam, JS Marron, Defeng Sun, and Kim-Chuan Toh. Fast algorithms for large scale generalized distance weighted discrimination. ar Xiv preprint ar Xiv:1604.05473, 2016.

Kenneth Lange and Tongtong Wu. An mm algorithm for multicategory vertex discriminant analysis. Journal of Computational and Graphical Statistics, 17(3):527 544, 2008.

Yoonkyung Lee, Yi Lin, and Grace Wahba. Multicategory support vector machines: Theory and application to the classiﬁcation of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67 81, 2004.

Yufeng Liu. Fisher consistency of multicategory support vector machines. In AISTATS, pages 291 298, 2007.

Yufeng Liu and Xiaotong Shen. Multicategory ψ-learning. Journal of the American Statistical Association, 101(474):500 509, 2006.

Yufeng Liu and Ming Yuan. Reinforced multicategory support vector machines. Journal of Computational and Graphical Statistics, 20(4):901 919, 2011.

Yufeng Liu, Hao Helen Zhang, and Yichao Wu. Hard or soft classiﬁcation? large-margin uniﬁed machines. Journal of the American Statistical Association, 106(493):166 177, 2011.

JS Marron, Michael J Todd, and Jeongyoun Ahn. Distance-weighted discrimination. Journal of the American Statistical Association, 102(480):1267 1271, 2007.

Art B Owen. Inﬁnitely imbalanced logistic regression. Journal of Machine Learning Research, 8(Apr):761 773, 2007.

Xingye Qiao and Yufeng Liu. Adaptive weighted learning for unbalanced multicategory classiﬁcation. Biometrics, 65(1):159 168, 2009.

Xingye Qiao and Lingsong Zhang. Distance-weighted support vector machine. Statistics and Its Interface, 8(3):331 345, 2015a.

Angle-based Multicategory Distance-weighted SVM

Xingye Qiao and Lingsong Zhang. Flexible high-dimensional classiﬁcation machines and their asymptotic properties. Journal of Machine Learning Research, 16:1547 1572, 2015b.

Xingye Qiao, Hao Helen Zhang, Yufeng Liu, Michael J Todd, and James Stephen Marron. Weighted distance weighted discrimination and its asymptotic properties. Journal of the American Statistical Association, 105(489):401 414, 2010.

Bernhard Sch olkopf and Christopher JC Burges. Advances in kernel methods: support vector learning. MIT press, 1999.

Haipeng Shen and Jianhua Z. Huang. Analysis of call centre arrival data using singular value decomposition. Applied Stochastic Models in Business and Industry, 21(3):251 263, 2005.

Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267 288, 1996.

Vladimir Naumovich Vapnik and Vlamimir Vapnik. Statistical learning theory, volume 1. Wiley New York, 1998.

Hui Wang and Gang Huang. Application of support vector machine in cancer diagnosis. Medical Oncology, 28(1):613 618, 2011.

Chong Zhang and Yufeng Liu. Multicategory angle-based large-margin classiﬁcation. Biometrika, 101(3):625 640, 2014.

Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vector machine. In Advances in Neural Information Processing Systems, pages 1081 1088, 2001.

Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301 320, 2005.