# adversarial_multiclass_classification_a_risk_minimization_perspective__746ea84e.pdf

Adversarial Multiclass Classiﬁcation: A Risk Minimization Perspective

Rizal Fathony Anqi Liu Kaiser Asif Brian D. Ziebart Department of Computer Science University of Illinois at Chicago Chicago, IL 60607 {rfatho2, aliu33, kasif2, bziebart}@uic.edu

Recently proposed adversarial classiﬁcation methods have shown promising results for cost sensitive and multivariate losses. In contrast with empirical risk minimization (ERM) methods, which use convex surrogate losses to approximate the desired non-convex target loss function, adversarial methods minimize non-convex losses by treating the properties of the training data as being uncertain and worst case within a minimax game. Despite this difference in formulation, we recast adversarial classiﬁcation under zero-one loss as an ERM method with a novel prescribed loss function. We demonstrate a number of theoretical and practical advantages over the very closely related hinge loss ERM methods. This establishes adversarial classiﬁcation under the zero-one loss as a method that ﬁlls the long standing gap in multiclass hinge loss classiﬁcation, simultaneously guaranteeing Fisher consistency and universal consistency, while also providing dual parameter sparsity and high accuracy predictions in practice.

1 Introduction

A common goal for standard classiﬁcation problems in machine learning is to ﬁnd a classiﬁer that minimizes the zero-one loss. Since directly minimizing this loss over training data via empirical risk minimization (ERM) [1] is generally NP-hard [2], convex surrogate losses are employed to approximate the zero-one loss. For example, the logarithmic loss is minimized by the logistic regression classiﬁer [3] and the hinge loss is minimized by the support vector machine (SVM) [4, 5]. Both are Fisher consistent [6, 7] and universally consistent [8, 9] for binary classiﬁcation, meaning they minimize the zero-one loss and are Bayes-optimal classiﬁers when they learn from any true distribution of data using a rich feature representation. SVMs provide the additional advantage of dual parameter sparsity so that when combined with kernel methods, extremely rich feature representations can be efﬁciently considered. Unfortunately, generalizing the hinge loss to classiﬁcation tasks with more than two labels is challenging and existing multiclass convex surrogates [10 12] tend to lose their consistency guarantees [13 15] or produce low accuracy predictions in practice [15].

Adversarial classiﬁcation [16, 17] uses a different approach to tackle non-convex losses like the zero-one loss. Instead of approximating the desired loss function and evaluating over the training data, it adversarially approximates the available training data within a minimax game formulation with game payoffs deﬁned by the desired (zero-one) loss function [18, 19]. This provides promising empirical results for cost-sensitive losses [16] and multivariate losses such as the F-measure and the precision-at-k [17]. Conceptually, parameter optimization for the adversarial method forces the adversary to behave like certain properties of the training data sample, making labels easier to predict within the minimax prediction game. However, a key bottleneck for these methods has been

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

their reliance on zero-sum game solvers for inference, which are computationally expensive relative to inference in other prediction methods, such as SVMs.

In this paper, we recast adversarial prediction from an empirical risk minimization perspective by analyzing the Nash equilibrium value of adversarial zero-one classiﬁcation games to deﬁne a new multiclass loss1. This enables us to demonstrate that zero-one adversarial classiﬁcation ﬁlls the long standing gap in ERM-based multiclass classiﬁcation by simultaneously: (1) guaranteeing Fisher consistency and universal consistency; (2) enabling computational efﬁciency via the kernel trick and dual parameter sparsity; and (3) providing competitive performance in practice. This reformulation also provides signiﬁcant computational efﬁciency improvements compared to previous adversarial classiﬁcation training methods [16].

2 Background and Related Work

2.1 Multiclass SVM generalizations

The multiclass support vector machine (SVM) seeks class-based potentials fy(xi) for each input vector x X and class y Y so that the discriminant function, ˆyf(xi) = argmaxy fy(xi), minimizes misclassiﬁcation errors, lossf(xi, yi) = I(yi = ˆyf(xi)). Unfortunately, empirical risk minimization (ERM), minf E P (x,y) [lossf(X, Y )], for the zero-one loss is NP-hard once the set of potentials is (parametrically) restricted (e.g., as a linear function of input features) [2]. Instead, a hinge loss approximation is employed by the SVM. In the binary setting, yi { 1, +1}, where the potential of one class can be set to zero (f 1 = 0) with no loss in generality, the hinge loss is deﬁned as [1 yif+1(xi)]+, with the compact deﬁnition [g(.)]+ max(0, g(.)). Binary SVM, which is an empirical risk minimizer using the hinge loss with L2 regularization,

min fθ E P (x,y) [lossfθ(X, Y )] + λ

2 ||θ||2 2, (1)

provides strong theoretical guarantees (Fisher consistency and universal consistency) [8, 21] and computational efﬁciency [1].

Many methods have been proposed to generalize SVM to the multiclass setting. Apart from the one-vs-all and one-vs-one decomposed formulations [22], there are three main joint formulations: the WW model by Weston et al. [11], which incorporates the sum of hinge losses for all alternative labels, loss WW(xi, yi) = P

j =yi [1 (fyi(xi) fj(xi))]+; the CS model by Crammer and Singer [10], which uses the hinge loss of only the largest alternative label, loss CS(xi, yi) = maxj =yi [1 (fyi(xi) fj(xi))]+; and the LLW model by Lee et al. [12], which employs an absolute hinge loss, loss LLW(xi, yi) = P

j =yi [1 + fj(xi)]+, and a constraint that P

j fj(xi) = 0. The former two models (CS and WW) both utilize the pairwise class-based potential differences fyi(xi) fj(xi) and are therefore categorized as relative margin methods. LLW, on the other hand, is an absolute margin method that only relates to fj(xi)[15]. Fisher consistency, or Bayes consistency [7, 13] guarantees that minimization of a surrogate loss for the true distribution provides the Bayes-optimal classiﬁer, i.e., minimizes the zero-one loss. If given any possible distribution of data, a classiﬁer is Bayes-optimal, it is called universally consistent. Of these, only the LLW method is Fisher consistent and universally consistent [12 14]. However, as pointed out by Do gan et al. [15], LLW s use of an absolute margin in the loss (rather than the relative margin of WW and CS) often causes it to perform poorly for datasets with low dimensional feature spaces. From the opposite direction, the requirements for Fisher consistency have been well-characterized [13], yet this has not led to a multiclass classiﬁer that is both Fisher consistent and performs well in practice.

2.2 Adversarial prediction games

Building on a variety of diverse formulations for adversarial prediction [23 26], Asif et al. [16] proposed an adversarial game formulation for multiclass classiﬁcation with cost-sensitive loss functions. Under this formulation, the empirical training data is replaced by an adversarially chosen conditional label distribution ˇP(ˇy|x) that must closely approximate the training data, but otherwise

1Farnia & Tse independently and concurrently discovered this same loss function [20]. They provide an analysis focused on generalization bounds and experiments for binary classiﬁcation.

seeks to maximize expected loss, while an estimator player ˆP(ˆy|x) seeks to minimize expected loss. For the zero-one loss, the prediction game is:

min ˆ P max ˇ P :EP (x) ˇ P (ˇy|x)[φ(X, ˇY )]= φ E P (x) ˆ P (ˆy|x) ˇ P (ˇy|x) h I( ˆY = ˇY ) i . (2)

The vector of feature moments, φ = E P (x,y)[φ(X, Y )], is measured from sample training data. Using minimax and strong Lagrangian duality, the optimization of Eq. (2) reduces to minimizing the equilibrium game values of a new set of zero-sum games characterized by matrix L xi,θ:

i max ˇp min ˆp ˆp T xi L xi,θˇpxi; L xi,θ =

ψ1,yi(xi) ψ|Y|,yi(xi) + 1 ... ... ... ψ1,yi(xi) + 1 ψ|Y|,yi(xi)

where θ is a vector of Lagrangian model parameters, ˆpxi is a vector representation of the conditional label distribution, ˆP( ˆY = k|xi), i.e. ˆpxi = [ ˆP( ˆY = 1|xi) ˆP( ˆY = 2|xi) . . .]T, and similarly for ˇpxi. The matrix L xi,θ is a zero-sum game matrix for each example, with ψj,yi(xi) = fj(xi) fyi(xi) = θT (φ(xi, j) φ(xi, yi)). This optimization problem (Eq. (3)) is convex in θ and the inner zero-sum game can be solved using linear programming [16].

3 Risk Minimization Perspective of Adversarial Multiclass Classiﬁcation

3.1 Nash equilibrium game value

Despite the differences in formulation between adversarial loss minimization and empirical risk minimization, we now recast the zero-one loss adversarial game as the solution to an empirical risk minimization problem. Theorem 1 deﬁnes the loss function that provides this equivalence by considering all possible combinations of the adversary s label assignments with non-zero probability in the Nash equilibrium of the game.2

Theorem 1. The model parameters θ for multiclass zero-one adversarial classiﬁcation are equivalently obtained from empirical risk minimization under the adversarial zero-one loss function:

AL0-1 f (xi, yi) = max S {1,...,|Y|}, S =

j S ψj,yi(xi) + |S| 1

where S is any non-empty member of the powerset of classes {1, 2, . . . , |Y|}.

Figure 1: AL0-1 evaluated over the space of potential differences (ψj,y(x) = fj(x) fy(x); and ψj,j(x) = 0) for binary prediction tasks when the true label is y = 1.

Thus, AL0-1 is the maximum value over 2|Y| 1 linear hyperplanes. For binary prediction tasks, there are three linear hyperplanes: ψ1,y(x), ψ2,y(x) and ψ1,y(x)+ψ2,y(x)+1

2 . Figure 1 shows the loss function in potential difference spaces ψ when the true label is y = 1. Note that AL0-1 combines two hinge functions at ψ2,y(x) = 1 and ψ2,y(x) = 1, rather than SVM s single hinge at ψ1,y(x) = 1. This difference from the hinge loss corresponds to the loss that is realized by randomizing label predictions.3 For three classes, the loss function has seven facets as shown in Figure 2a. Figures 2a, 2b, and 2c show the similarities and differences between AL0-1 and the multiclass SVM surrogate losses based on class potential differences. Note that AL0-1 is also a relative margin loss function that utilizes the pairwise potential difference ψj,y(x).

3.2 Consistency properties

Fisher consistency is a desirable property for a surrogate loss function that guarantees its minimizer, given the true distribution, P(x, y), will yield the Bayes optimal decision boundary [13, 14]. For

2The proof of this theorem and others in the paper are contained in the Supplementary Materials. 3We refer the reader to Appendix H for a comparison of the binary adversarial method and the binary SVM.

Figure 2: Loss function contour plots over the space of potential differences for the prediction task with three classes when the true label is y = 1 under AL0-1 (a), the WW loss (b), and the CS loss (c). (Note that ψi in the plots refers to ψj,y(x) = fj(x) fy(x); and ψj,j(x) = 0.)

multiclass zero-one loss, given that we know Pj(x) P(Y = j|x), Fisher consistency requires that argmaxj f j (x) argmaxj Pj(x), where f (x) = [f 1 (x), . . . , f |Y|(x)]T is the minimizer of E [lossf(X, Y )|X = x]. Since any constant can be added to all f j (x) while keeping argmaxj f j (x)

the same, we employ a sum-to-zero constraint, P|Y| j=1 fj(x) = 0, to remove redundant solutions. We establish an important property of the minimizer for AL0-1 in the following theorem.

Theorem 2. The loss for the minimizer f of E AL0-1 f (X, Y )|X = x resides on the hyperplane deﬁned (in Eq. 4) by the complete set of labels, S = {1, . . . , |Y|}.

As an illustration for the case of three classes (Figure 2a), the area described in the theorem above corresponds to the region in the middle where the hyperplane that supports AL0-1 is ψ1,y(x)+ψ2,y(x)+ψ3,y(x)+2

3 , and, equivalently, where 1

|Y| fj(x) |Y| 1

|Y| , j {1, . . . , |Y|} with a constraint that P

j fj(x) = 0. Based on this restriction, we focus on the minimization of

E AL0-1 f (X, Y )|X = x subject to 1

|Y| fj(x) |Y| 1

|Y| , j {1, . . . , |Y|} and the sum of potentials equal to zero. This minimization reduces to the following optimization:

y=1 Py(x)fy(x) subject to: 1

|Y| fj(x) |Y| 1

|Y| j {1, . . . , |Y|};

j=1 fj(x) = 0.

The solution for this maximization (a linear program) satisﬁes f j (x) = |Y| 1

|Y| if j = argmaxj Pj(x), and 1

|Y| otherwise, which therefore implies the Fisher consistency theorem.

Theorem 3. The adversarial zero-one loss, AL0-1, from Eq. (4) is Fisher consistent.

Theorem 3 implies that AL0-1 (Eq. (4)) is classiﬁcation calibrated, which indicates minimization of that loss for all distributions on X Y also minimizes the zero-one loss [21, 13]. As proven in general by Steinwart and Christmann [2], Micchelli et al. [27], since AL0-1 (Eq.(4)) is a Lipschitz loss with constant 1, the adversarial multiclass classiﬁer is universally consistent under the conditions speciﬁed in Corollary 1. Corollary 1. Given a universal kernel and regularization parameter λ in Eq. (1) tending to zero slower than 1

n, the adversarial multiclass classiﬁer is also universally consistent.

3.3 Optimization

In the learning process for adversarial classiﬁcation, Asif et al. [16] requires a linear program to be solved that ﬁnds the Nash equilibrium game value and strategy for every training data point in each gradient update. This requirement is computationally burdensome compared to multiclass SVMs, which must simply ﬁnd potential-maximizing labels. We propose two approaches with improved

efﬁciency by leveraging an oracle for ﬁnding the maximization inside AL0-1 and Lagrange duality in the quadratic programming formulation.

3.3.1 Primal optimization using stochastic sub-gradient descent

The sub-gradient in the empirical risk minimization of AL0-1 includes the mean of feature differences, 1 |R| P j R [φ(xi, j) φ(xi, yi)] , where R is the set that maximizes AL0-1. The set R is computed by the oracle using a greedy algorithm. Given θ and a sample (xi, yi), the algorithm calculates all potentials ψj,yi(xi) for each label j {1, . . . , |Y|} and sorts them in non-increasing order. Starting with the empty set R = , it then adds labels to R in sorted order until adding a label would decrease the value of

j R ψj,yi(xi)+|R| 1

Theorem 4. The proposed greedy algorithm used by the oracle is optimal.

3.3.2 Dual optimization

In the next subsections, we focus on the dual optimization technique as it enables us to establish convergence guarantees. We re-formulate the learning algorithm (with L2 regularization) as a constrained quadratic program (QP) with ξi specifying the amount of AL0-1 incurred by each of the n training examples:

min θ 1 2 θ 2 + C

i=1 ξi subject to: ξi i,k i {1, . . . n}k {1, . . . , 2|Y| 1}, (5)

where we denote each of the 2|Y| 1 possible constraints for example i corresponding to non-empty elements of the label powerset as i,k (e.g., i,1 = ψ1,yi(xi), and i,2|Y| 1 =

j Y ψj,yi(xi)+|Y| 1

|Y| ). Note also that non-negativity for ξi is enforced since i,yi = ψyi,yi(xi) = 0.

Theorem 5. Let Λi,k be the partial derivative of i,k with respect to θ, i.e., Λi,k = d i,k

dθ and νi,k is the constant part of i,k (for example if i,k = ψ1,yi(xi)+ψ3,yi(xi)+ψ4,yi(xi)+2

3 , then νi,k = 2

3), then the corresponding dual optimization for the primal minimization (Eq. 5) is:

k=1 νi,k αi,k 1

k,l=1 αi,kαj,l [Λi,k Λj,l] (6)

subject to: αi,k 0,

k=1 αi,k = C, i {1, . . . , n}, k {1, . . . , 2|Y| 1},

where αi,k is the dual variable for the k-th constraint of the i-th sample.

Note that the dual formulation above only depends on the dot product of two constraints partial derivatives (with respect to θ) and the constant part of the constraints. The original primal variable θ can be recovered from the dual variables using the formula: θ = Pn i=1 P2|Y| 1 k=1 αi,k Λi,k. Given a new datapoint x, de-randomized predictions are obtained from argmaxj fj(x) = argmaxj θTφ(x, j).

3.3.3 Efﬁciently incorporating rich feature spaces using kernelization

Considering large feature spaces is important for developing an expressive classiﬁer that can learn from large amounts of training data. Indeed, Fisher consistency requires such feature spaces for its guarantees to be meaningful. However, naïvely projecting from the original input space, xi, to richer (or possibly inﬁnite) feature spaces ω(xi), can be computationally burdensome. Kernel methods enable this feature expansion by allowing the dot products of certain feature functions to be computed implicitly, i.e., K(xi, xj) = ω(xi) ω(xj). Since our dual formulation only depends on dot products, we employ kernel methods to incorporate rich feature spaces into our formulation as stated in the following theorem.

Theorem 6. Let X be the input space and K be a positive deﬁnite real valued kernel on X X with a mapping function ω(x) : X H that maps the input space X to a reproducing kernel Hilbert

space H. Then all the values in the dual optimization of Eq. (6) needed to operate in the Hilbert space H can be computed in terms of the kernel function K(xi, xj) as:

Λi,k Λj,l = c(i,k),(j,l) K(xi, xj), i,k =

l=1 αj,l c(j,l),(i,k) K(xj, xi) + νi,k, (7)

|Rj,l| 1(m = yj) K(xj, xi) , (8)

where c(i,k),(j,l) =

|Ri,k| 1(m = yi) 1(m Rj,l)

|Rj,l| 1(m = yj) ,

and Ri,k is the set of labels included in the constraint i,k (for example if i,k = ψ1,yi(xi)+ψ3,yi(xi)+ψ4,yi(xi)+2

3 , then Ri,k = {1, 3, 4}), the function 1(j = yi) returns 1 if j = yi or 0 otherwise, and the function 1(j Ri,k) returns 1 if j is a member of set Ri,k or 0 otherwise.

3.3.4 Efﬁcient optimization using constraint generation

The number of constraints in the QP formulation above grows exponentially with the number of classes: O(2|Y|). This prevents the naïve formulation from being efﬁcient for large multiclass problems. We employ a constraint generation method to efﬁciently solve the dual quadratic programming formulation that is similar to those used for extending the SVM to multivariate loss functions [28] and structured prediction settings [29].

Algorithm 1 Constraint generation method

Require: Training data (x1, y1), . . . (xn, yn), C, ϵ

1: θ 0 2: A i { i,k| i,k = ψyi,yi(xi)} i = 1, . . . , n Actual label enforces non-negativity 3: repeat 4: for i 1, n do 5: a arg maxk| i,k Ai i,k Find the most violated constraint 6: ξi maxk| i,k A i i,k Compute the example s current loss estimate 7: if i,a > ξi + ϵ then 8: A i A i { i,a} Add it to the enforced constraints set 9: α Optimize dual over A = i A i 10: Compute θ from α: θ = Pn i=1 P

k| i,k A i αi,k Λi,k 11: end if 12: end for 13: until no A i has changed in the iteration

Algorithm 1 incrementally expands the set of enforced constraints, A i , until no remaining constraint from the set of all 2|Y| 1 constraints (in Ai) is violated by more than ϵ. To obtain the most violated constraint, we use the greedy algorithm described in the primal optimization. The constraint generation algorithm s stopping criterion ensures that a solution close to the optimal is returned (violating no constraint by more than ϵ). Theorem 7 provides a polynomial run time convergence bounds for the Algorithm 1.

Theorem 7. For any ϵ > 0 and training dataset {(x1, y1), . . . , (xn, yn)} with U = maxi[xi xi], Algorithm 1 terminates after incrementally adding at most max 2n

ϵ2 constraints to the constraint set A .

The proof of Theorem 7 follows the procedures developed by Tsochantaridis et al. [28] for bounding the running time of structured support vector machines. We observe that this bound is quite loose in practice and the algorithm tends to converge much faster in our experiments.

4 Experiments

We evaluate the performance of the AL0-1 classiﬁer and compare with the three most popular multiclass SVM formulations: WW [11], CS [10], and LLW [12]. We use 12 datasets from the UCI Machine Learning repository [30] with various sizes and numbers of classes (details in Table 1). For each dataset, we consider the methods using the original feature space (linear kernel) and a kernelized feature space using the Gaussian radial basis function kernel.

Table 1: Properties of the datasets, the number of constraints considered by SVM models (WW/CS/LLW), the average number of constraints added to the constraint set for AL0-1 and the average number of active constraints at the optima under both linear and Gausssian kernels.

Dataset Properties SVM AL0-1 constraints added and active

# class # train # test # feature constraints Linear kernel Gauss. kernel

(1) iris 3 105 45 4 210 213 13 223 38 (2) glass 6 149 65 9 745 578 125 490 252 (3) redwine 10 1119 480 11 10071 5995 1681 3811 1783 (4) ecoli 8 235 101 7 1645 614 117 821 130 (5) vehicle 4 592 254 18 1776 1310 311 1201 248 (6) segment 7 1617 693 19 9702 4410 244 4312 469 (7) sat 7 4435 2000 36 26610 11721 1524 11860 6269 (8) optdigits 10 3823 1797 64 34407 7932 597 10072 2315 (9) pageblocks 5 3831 1642 10 15324 9459 427 9155 551 (10) libras 15 252 108 90 3528 1592 389 1165 353 (11) vertebral 3 217 93 6 434 344 78 342 86 (12) breasttissue 6 74 32 9 370 258 65 271 145

For our experimental methodology, we ﬁrst make 20 random splits of each dataset into training and testing sets. We then perform two stage, ﬁve-fold cross validation on the training set of the ﬁrst split to tune each model s parameter C and the kernel parameter γ under the kernelized formulation. In the ﬁrst stage, the values for C are 2i, i = {0, 3, 6, 9, 12} and the values for γ are 2i, i = { 12, 9, 6, 3, 0}. We select ﬁnal values for C from 2i C0, i = { 2, 1, 0, 1, 2} and values for γ from 2iγ0, i = { 2, 1, 0, 1, 2} in the second stage, where C0 and γ0 are the best parameters obtained in the ﬁrst stage. Using the selected parameters, we train each model on the 20 training sets and evaluate the performance on the corresponding testing set. We use the Shark machine learning library [31] for the implementation of the three multiclass SVM formulations.

Despite having an exponential number of possible constraints (i.e., n(2|Y| 1) for n examples versus n(|Y| 1) for SVMs), a much smaller number of constraints need to be considered by the AL0-1 algorithm in practice to realize a better approximation (ϵ = 0) than Theorem 7 provides. Table 1 shows how the total number of constraints for multiclass SVM compares to the number considered in practice by our AL0-1 algorithm for linear and Gaussian kernel feature spaces. These range from a small fraction (0.23) of the SVM constraints for optdigits to a slightly greater number (with a fraction of 1.06) for iris. More speciﬁcally, of the over 3.9 million (= 210 3823) possible constraints for optdigits when training the classiﬁer, fewer than 0.3% (7932 or 10072 depending on the feature representation) are added to the constraint set during the constraint generation process. Fewer still (597 or 2315 constraints less than 0.06%) are constraints that are active in the ﬁnal classiﬁer with non-zero dual parameters. The sparsity of the dual parameters provides a key computational beneﬁt for support vector machines over logistic regression, which has essentially all non-zero dual parameters. The small number of active constraints shown in Table 1 demonstrate that AL0-1 induces similar sparsity, providing efﬁciency when employed with kernel methods.

We report the accuracy of each method averaged over the 20 dataset splits for both linear feature representations and Gaussian kernel feature representations in Table 2. We denote the results that are either the best of all four methods or not worse than the best with statistical signiﬁcance (under paired t-test with α = 0.05) using bold font. We also show the accuracy averaged over all of the datasets for each method and the number of dataset for which each method is indistinguishably best (bold numbers) in the last row. As we can see from the table, the only alternative model that is Fisher

Table 2: The mean and (in parentheses) standard deviation of the accuracy for each model with linear kernel and Gaussian kernel feature representations. Bold numbers in each case indicate that the result is the best or not signiﬁcantly worse than the best (paired t-test with α = 0.05).

D Linear Kernel Gaussian Kernel

AL0-1 WW CS LLW AL0-1 WW CS LLW

(1) 96.3 (3.1) 96.0 (2.6) 96.3 (2.4) 79.7 (5.5) 96.7 (2.4) 96.4 (2.4) 96.2 (2.3) 95.4 (2.1) (2) 62.5 (6.0) 62.2 (3.6) 62.5 (3.9) 52.8 (4.6) 69.5 (4.2) 66.8 (4.3) 69.4 (4.8) 69.2 (4.4) (3) 58.8 (2.0) 59.1 (1.9) 56.6 (2.0) 57.7 (1.7) 63.3 (1.8) 64.2 (2.0) 64.2 (1.9) 64.7 (2.1) (4) 86.2 (2.2) 85.7 (2.5) 85.8 (2.3) 74.1 (3.3) 86.0 (2.7) 84.9 (2.4) 85.6 (2.4) 86.0 (2.5) (5) 78.8 (2.2) 78.8 (1.7) 78.4 (2.3) 69.8 (3.7) 84.3 (2.5) 84.4 (2.6) 83.8 (2.3) 84.4 (2.6) (6) 94.9 (0.7) 94.9 (0.8) 95.2 (0.8) 75.8 (1.5) 96.5 (0.6) 96.6 (0.5) 96.3 (0.6) 96.4 (0.5) (7) 84.9 (0.7) 85.4 (0.7) 84.7 (0.7) 74.9 (0.9) 91.9 (0.5) 92.0 (0.6) 91.9 (0.5) 91.9 (0.4) (8) 96.6 (0.6) 96.5 (0.7) 96.3 (0.6) 76.2 (2.2) 98.7 (0.4) 98.8 (0.4) 98.8 (0.3) 98.9 (0.3) (9) 96.0 (0.5) 96.1 (0.5) 96.3 (0.5) 92.5 (0.8) 96.8 (0.5) 96.6 (0.4) 96.7 (0.4) 96.6 (0.4) (10) 74.1 (3.3) 72.0 (3.8) 71.3 (4.3) 34.0 (6.4) 83.6 (3.8) 83.8 (3.4) 85.0 (3.9) 83.2 (4.2) (11) 85.5 (2.9) 85.9 (2.7) 85.4 (3.3) 79.8 (5.6) 86.0 (3.1) 85.3 (2.9) 85.5 (3.3) 84.4 (2.7) (12) 64.4 (7.1) 59.7 (7.8) 66.3 (6.9) 58.3 (8.1) 68.4 (8.6) 68.1 (6.5) 66.6 (8.9) 68.0 (7.2)

avg 81.59 81.02 81.25 68.80 85.14 84.82 85.00 84.93 #bold 9 6 8 0 9 6 6 7

consistent the LLW model performs poorly on all datasets when only linear features are employed. This matches with previous experimental results conducted by Do gan et al. [15] and demonstrates a weakness of using an absolute margin for the loss function (rather than the relative margins of all other methods). The AL0-1 classiﬁer performs competitively with the WW and CS models with a slight advantages on overall average accuracy and a larger number of indistinguishably best performances on datasets or, equivalently, fewer statistically signiﬁcant losses to any other method.

The kernel trick in the Gaussian kernel case provides access to much richer feature spaces, improving the performance of all models, and the LLW model especially. In general, all models provide competitive results in the Gaussian kernel case. The AL0-1 classiﬁer maintains a similarly slight advantage and only provides performance that is sub-optimal (with statistical signiﬁcance) in three of the twelve datasets versus six of twelve and ﬁve of twelve for the other methods. We conclude that the multiclass adversarial method performs well in both low and high dimensional feature spaces. Recalling the theoretical analysis of the adversarial method, it is a well-motivated (from the adversarial zero-one loss minimization) multiclass classiﬁer that enjoys both strong theoretical properties (Fisher consistency and universal consistency) and empirical performance.

5 Conclusion

Generalizing support vector machines to multiclass settings in a theoretically sound manner remains a long-standing open problem. Though the loss function requirements guaranteeing Fisher-consistency are well-understood [13], the few Fisher-consistent classiﬁers that have been developed (e.g., LLW) often are not competitive with inconsistent multiclass classiﬁers in practice. In this paper, we have sought to ﬁll this gap between theory and practice. We have demonstrated that multiclass adversarial classiﬁcation under zero-one loss can be recast from an empirical risk minimization perspective and its surrogate loss, AL0-1, shown to satisfy the Fisher consistency property, leading to a universally consistent classiﬁer that also performs well in practice. We believe that this is an important contribution in understanding both adversarial methods and the generalized hinge loss. Our future work includes investigating the adversarial methods under the different losses and exploring other theoretical properties of the adversarial framework, including generalization bounds.

Acknowledgments

This research was supported as part of the Future of Life Institute (futureoﬂife.org) FLI-RFP-AI1 program, grant#2016-158710 and by NSF grant RI-#1526379.

[1] Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems, pages 831 838, 1992. [2] Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer Publishing Company, Incorporated, 1st edition, 2008. ISBN 0387772413. [3] Peter Mc Cullagh and John A Nelder. Generalized linear models, volume 37. CRC press, 1989. [4] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. A training algorithm for optimal margin classiﬁers. In Proceedings of the Workshop on Computational Learning Theory, pages 144 152, 1992. [5] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273 297, 1995. [6] Yi Lin. Support vector machines and the bayes rule in classiﬁcation. Data Mining and Knowledge Discovery, 6(3):259 275, 2002. [7] Peter L. Bartlett, Michael I. Jordan, and Jon D. Mc Auliffe. Convexity, classiﬁcation, and risk bounds. Journal of the American Statistical Association, 101(473):138 156, 2006. [8] Ingo Steinwart. Support vector machines are universally consistent. J. Complexity, 18(3):768 791, 2002. [9] Ingo Steinwart. Consistency of support vector machines and other regularized kernel classiﬁers. IEEE Trans. Information Theory, 51(1):128 142, 2005. [10] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. The Journal of Machine Learning Research, 2:265 292, 2002. [11] Jason Weston, Chris Watkins, et al. Support vector machines for multi-class pattern recognition. In ESANN, volume 99, pages 219 224, 1999. [12] Yoonkyung Lee, Yi Lin, and Grace Wahba. Multicategory support vector machines: Theory and application to the classiﬁcation of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):67 81, 2004. [13] Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classiﬁcation methods. The Journal of Machine Learning Research, 8:1007 1025, 2007. [14] Yufeng Liu. Fisher consistency of multicategory support vector machines. In International Conference on Artiﬁcial Intelligence and Statistics, pages 291 298, 2007. [15] Ürün Do gan, Tobias Glasmachers, and Christian Igel. A uniﬁed view on multi-class support vector classiﬁcation. Journal of Machine Learning Research, 17(45):1 32, 2016. [16] Kaiser Asif, Wei Xing, Sima Behpour, and Brian D. Ziebart. Adversarial cost-sensitive classiﬁcation. In Proceedings of the Conference on Uncertainty in Artiﬁcial Intelligence, 2015. [17] Hong Wang, Wei Xing, Kaiser Asif, and Brian Ziebart. Adversarial prediction games for multivariate losses. In Advances in Neural Information Processing Systems, pages 2710 2718, 2015. [18] Flemming Topsøe. Information theoretical optimization techniques. Kybernetika, 15(1):8 27, 1979. [19] Peter D. Grünwald and A. Phillip Dawid. Game theory, maximum entropy, minimum discrepancy, and robust Bayesian decision theory. Annals of Statistics, 32:1367 1433, 2004. [20] Farzan Farnia and David Tse. A minimax approach to supervised learning. In Advances in Neural Information Processing Systems, pages 4233 4241. 2016. [21] Peter L Bartlett, Michael I Jordan, and Jon D Mc Auliffe. Large margin classiﬁers: Convex loss, low noise, and convergence rates. In Advances in Neural Information Processing Systems, pages 1173 1180, 2003. [22] Naiyang Deng, Yingjie Tian, and Chunhua Zhang. Support vector machines: optimization based theory, algorithms, and extensions. CRC press, 2012. [23] Nilesh Dalvi, Pedro Domingos, Sumit Sanghai, Deepak Verma, et al. Adversarial classiﬁcation. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, pages 99 108. ACM, 2004. [24] Anqi Liu and Brian Ziebart. Robust classiﬁcation under sample selection bias. In Advances in Neural Information Processing Systems, pages 37 45, 2014. [25] Gert RG Lanckriet, Laurent El Ghaoui, Chiranjib Bhattacharyya, and Michael I Jordan. A robust minimax approach to classiﬁcation. The Journal of Machine Learning Research, 3:555 582, 2003. [26] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672 2680, 2014. [27] Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of Machine Learning Research, 6:2651 2667, 2006. [28] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods for structured and interdependent output variables. In JMLR, pages 1453 1484, 2005. [29] Thorsten Joachims. A support vector method for multivariate performance measures. In Proceedings of the International Conference on Machine Learning, pages 377 384, 2005. [30] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml. [31] Christian Igel, Verena Heidrich-Meisner, and Tobias Glasmachers. Shark. Journal of Machine Learning Research, 9:993 996, 2008.