# multiclass_learning_from_contradictions__050fee7d.pdf Multiclass Learning from Contradictions Sauptik Dhar LG Electronics Santa Clara, CA 95054 sauptik.dhar@lge.com Vladimir Cherkassky University of Minnesota Minneapolis, MN 55455 cherk001@umn.edu Mohak Shah LG Electronics Santa Clara, CA 95054 mohak.shah@lge.com We introduce the notion of learning from contradictions, a.k.a Universum learning, for multiclass problems and propose a novel formulation for multiclass universum SVM (MU-SVM). We show that learning from contradictions (using MU-SVM) incurs lower sample complexity compared to multiclass SVM (M-SVM) by deriving the Natarajan dimension for sample complexity for PAC-learnability of MU-SVM. We also propose an analytic span bound for MU-SVM and demonstrate its utility for model selection resulting in 2 4 faster computation times than standard resampling techniques. We empirically demonstrate the efficacy of MU-SVM on several real world datasets achieving > 20% improvement in test accuracies compared to M-SVM. Insights into the underlying behavior of MU-SVM using a histograms-of-projections method are also provided. 1 Introduction Many machine learning problems in domains such as, healthcare, autonomous driving, and prognostics and health management involve learning from high-dimensional data with limited labeled samples. In such domains labeling very large quantities of data is either extremely expensive, or entirely prohibitive due to the manual effort required. Standard inductive learning methods, including data intensive deep architectures [1], may not be sufficient for such high-dimensional limited-labeleddata problems. The learning from contradictions paradigm (popularly known as Universum learning) has shown to be particularly effective for binary classification problems of this nature [2 11]. In this paradigm, along with the labeled training data we are also given a set of unlabeled universum samples. These universum samples belong to the same application domain as the training data, but are known not to belong to either of the two classes. The rationale behind this setting comes from the fact that even though obtaining labels is very difficult, obtaining such additional unlabeled samples is relatively easier. These unlabeled universum samples act as contradictions and should not be explained by the binary decision rule. However, this paradigm has been mostly limited to binary classification problems making it impractical for most real applications involving classification of more than two categories. Further, this limits incorporation of a priori knowledge by discarding available universum data for such applications. Previous works such as [12,13] have hinted on adopting an Error Correcting Output Code (ECOC) based setting such as one-vs-one (OVO) and one-vs-all (OVA), where several binary Universum SVM [12] classifiers are combined to solve the multiclass problem. However, such studies lack a complete formalization and analysis. An alternative is the adoption of a direct approach [14] where the entire multiclass problem is solved through a single larger optimization formulation by 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. introducing universum learning under a probabilistic framework using a logistic loss. However, the work does not clarify as to how contradictions are captured through the proposed formulation. In this paper, we propose a formalization for multiclass learning with contradictions. Following [15] for multiclass SVM s, we introduce the new Multiclass Universum SVM (MU-SVM) formulation. The proposed MU-SVM provides a unified framework for multiclass learning under universum settings, with improved performance accuracies. The main contributions of this paper are as follows: 1. Formulation: We formalize the notion of universum learning for multiclass SVM (M-SVM), and propose a novel direct formulation called Multiclass Universum SVM (MU-SVM). 2. PAC Learnability: We derive the Natarajan dimension for the MU-SVM hypothesis class and analyze its sample complexity for PAC learnability (Theorem 2). Our analysis shows that MU-SVM incurs lower sample complexity compared to M-SVM. 3. Useful Properties: MU-SVM reduces to: i) standard multiclass SVM in absence of universum data and ii) binary U-SVM formulation in [16], for two-class problems (proposition 2). In addition, the proposed MU-SVM is solvable through any state-of-art M-SVM solvers (proposition 3). 4. Model Selection: We provide a new span definition specific to MU-SVM, and following [17] derive a leave-one-out bound for MU-SVM (Theorem 3). Under additional assumptions, a computationally efficient version of the bound is also provided (Theorem 4). 5. Empirical validation: Empirical results demonstrate the efficacy of the proposed formulation. We also propose a histogram-of-projections approach to analyse the results (section 4). 2 Multiclass SVM (M-SVM) This section introduces multiclass learning under inductive settings and the popular Crammer and Singer s (C&S) multiclass SVM (M-SVM) formulation [15] used in such settings. Although several other multiclass SVM formulations have been proposed in literature [18 24], C&S s M-SVM is among the most widely used ones. Further, compared to the most popular multiclass formulations, C&S s M-SVM provides the smallest estimation error while ensuring a Figure 1: Loss function for MSVM with fk(x) = w k x. For the soft-margin M-SVM (3) any sample (x, y = k) lying inside the margin is linearly penalized using the slack variable ξ. small approximation error (see [25] Table 1 for details). This makes the C&S M-SVM formulation highly desirable for limited labeled samples settings. In this paper we use the C&S s MSVM with equal misclassification costs for balanced data as an exemplar for multiclass SVM formulations under inductive settings, and refer to it as M-SVM throughout. Definition 1. (Multiclass Learning under Inductive Setting) Given i.i.d training samples T = (xi, yi)n i=1 Dn T , with x X ℜd and y Y = {1, . . . , L}, estimate a hypothesis h : X Y from hypothesis class H which minimizes, inf h H EDT [1y =h(x)] (1) where, DT = the training distribution, 1( ) = indicator function, and EDT ( ) = expectation under training distribution. The M-SVM, by minimizing a margin-based loss function [15], estimates f = [f1, . . . , f L] to construct the decision rule ˆh(x) = argmax l=1,...,L fl(x). The M-SVM hypothesis class is given as, HM-SVM = n x argmax l Y (w T l x) : l=1 wl 2 2 Λ2; w T k x argmax l =k w T l x 1; if y = k o (2) where, Λ 0 is a user-defined parameter which controls the complexity of the hypothesis class. The form in (2) is also known as the hard-margin version of M-SVM. For practical purposes we solve the soft-margin version given below, min w1...w L,ξ 1 2 l=1 wl 2 2 + C i=1 ξi i = 1 . . . n, l = 1 . . . L (3) s.t.: (wyi wl) xi eil ξi; eil = 1 δil δil = 1; yi = l 0; yi = l In this formulation, the training samples falling inside the margin border ( +1 ) are linearly penalized using the slack variables ξi 0, i = 1 . . . n (see Fig 1), which contributes to the margin error n P Eq. (3) balances between minimizing the margin error and regularization term using the user-defined parameter C 0. 3 Multiclass Universum SVM (MU-SVM) 3.1 Multiclass U-SVM formulation Learning from contraditions or Universum learning was introduced in [2] for binary classification problems to incorporate a priori knowledge about admissible data samples. For multiclass problems Figure 2: Loss function for universum samples x for kth class decision boundary w k x max l=1...Lw l x = 0. For soft-margin MU-SVM formulation (7) a sample lying outside the -insensitive zone is linearly penalized using the slack variable ζk. in addition to the labeled training data we are also given with unlabeled universum samples which are known not to belong to any of the classes in the training data. For example, if the goal of learning is to discriminate between handwritten digits (0, 1, 2,...,9), one can introduce additional knowledge in the form of handwritten letters (A, B, C, ... ,Z). These examples from the universum contain certain information (e.g., handwriting styles) but they cannot be assigned to any of the classes (0 to 9). Further, the universum samples do not have the same distribution as labeled training samples. Learning under this setting can be formalized as below. Definition 2. (Multiclass Learning under Universum Setting) Given i.i.d training samples T = (xi, yi)n i=1 Dn T , with x X ℜd and y Y = {1, . . . , L} and additional m unlabeled universum samples U = (x i )m i =1 DU with x X U ℜd, estimate h : X Y from hypothesis class H which, in addition to eq. (1), obtains maximum contradiction on universum samples i.e. maximizes the following probability for x X U, sup h H PDU [x / any class] = sup h H EDU [1{ T k {1,...,L} h(x ) =k}] (4) where, DU is the universum distribution, PDU ( ) is probability under universum distribution, EDU ( ) is the expectation under universum distribution, X U is the domain of universum data. Learning under the universum setting has the dual goal of minimizing eq. (1) while maximizing the contradiction (in eq. (4)) on universum samples. The following proposition 1 provides guidance on how to address this for the M-SVM formulation in eq. (3). Proposition 1. For the M-SVM formulation in (3), maximum contradiction on universum samples x U can be achieved when, |(w k x max l=1...Lw l x )| = 0; k {1, . . . , L} (5) That is, learning under Definition 2 using M-SVM requires the universum samples to lie on the decision boundaries of all the classes {1 . . . L}. Here however, we relax this constraint (5) by requiring the universum samples to lie within a -insensitive zone around the decision boundaries (see Fig 2) as was done for binary scenario [16]. However, different from [16], here the -insensitive loss is introduced for the decision boundaries of all the classes i.e., |w k x max l=1...Lw l x | ; k = 1 . . . L. This reasoning motivates the new multiclass Universum-SVM (MU-SVM) formulation where the Training samples T := (xi, yi)n i=1; yi {1, . . . , L} are penalized by standard hinge loss (similar to M-SVM (3) and shown in Fig. 1); and the Universum samples U := (x i )m i =1 are penalized by a -insensitive loss (see Fig. 2) for the decision functions of all the classes f = [f1, . . . , f L]. The resulting hard-margin MU-SVM hypothesis class is given as, HMU-SVM = n x argmax l=1,...,L (w T l x) : l=1 wl 2 2 Λ2 ; w T k x argmax l =k w T l x 1; if y = k |w k x max l=1...Lw l x | ; k Y o (6) We then relax the hard constraints on the universum samples by linearly penalizing the constraint violations through a slack variable ζ shown in Fig. 2 leading to the following soft-margin MU-SVM formulation 1, min w1...w L,ξ,ζ 1 2 l=1 wl 2 2 + C i=1 ξi + C m X k=1 ζi k i = 1 . . . n, i = 1 . . . m s.t. (wyi wl) xi eil ξi; eil = 1 δil, l = 1 . . . L (7) |(w k x i max l=1...Lw l x i )| + ζi k; k = 1 . . . L ζi k 0, δil = 1; yi = l 0; yi = l Here, for the kth class decision boundary the universum samples (x i )m i =1 that lie outside the -insensitive zone are linearly penalized using the slack variables ζi k 0, i = 1 . . . m. The user-defined parameters C, C 0 control the trade-off between the margin size, the margin-error on training samples, and the contradictions (samples lying outside zone) on the universum samples. Note that for C = 0 eq. (7) reduces to the M-SVM classifier. Proposition 2. For binary classification L = 2, (7) reduces to the standard U-SVM formulation in [16] with w = w1 w2 and b = 0. 3.2 Sample Complexity for PAC Learnability Next we derive the sample complexity for PAC-learnability of HM SV M and HMU SV M and provide a comparative analysis. First we provide the necessary definitions, Definition 3. (Sample Complexity for PAC learnability [26,27]) of an algorithm A : X Y H is defined as the smallest integer n A(ϵ, δ) such that for any given ϵ, δ > 0 and every n > n A(ϵ, δ) and distribution D on X Y we have ˆh = A((xi, yi)n i=1), P (xi,yi)n i=1 D PD[ˆh(x) = y] > inf h HPD[h(x) = y] + ϵ δ (8) The sample complexity of a hypothesis class H : n H(ϵ, δ) = inf A n A(ϵ, δ) The sample complexity for PAC learnability depends on the size (a.k.a capacity measure) of a hypothesis class. Traditional capacity measures used in binary classification do not directly apply for multiclass problems. This has led to the research on several newer capacity measures for multiclass problems [19,28 30]. One of the most widely researched capacity measure for multiclass SVMs is the Natarajan dimension [31] defined next, Definition 4. Shattering (Multiclass version) For any hypothesis class H YX and any S X, where Y = {1, . . . , L} and X := training data domain, we say H shatters S if f1, f2 : S Y with x S, f1(x) = f2(x) and for every T S there is a g H which satisfies, x T, g(x) = f1(x), and x S T, g(x) = f2(x). (9) Natarajan Dimension d N (H) is the maximal cardinality of a set that is shattered by H. An advantage of Natarajan Dimension is that it provides a natural extension to the fundamental learning theorem for multi-class problems (see [26,32]) discussed next. 1Throughout this paper, we use index i, j for training samples, i for universum samples and k, l for the class labels. Theorem 1. (Fundamental Learning Theorem) There exist absolute constants C1, C2 > 0 such that any hypothesis class H of functions from X Y is PAC-learnable with sample complexity C1 d N (H) + log(1/δ) ϵ2 n H(ϵ, δ) C2 d N (H)log(|Y|) + log(1/δ) Theorem 1 shows n H(ϵ, δ) = O( d N (H)log(|Y|)+log(1/δ) ϵ2 ). Hence, for low sample complexity it is desirable for hypothesis classes to have smaller d N (H). With these definitions in place, we prove a new Theorem 2 to characterize d N (H) for HM SV M and HMU SV M as shown below, Theorem 2. The Natarajan dimension for HM SV M and HMU SV M has the form d N (H) = O(ϑlog(ϑ)). Assuming ||x||2 2 R2 ; x X ℜd gives, HM SV M : ϑ = min (d L + 1, 2R2Λ2) (11) HMU SV M : ϑ = min (d L + 1, κ) (12) d = data dimension, L = total no. of classes κ min γ {γ 0 ; G(γ) 0} F(γ) = Λ2 + γ m L(L 1) 2 2 G(γ) = [2F(γ)R2]2 4γF(γ) trace[H(γ)] (14) H(γ) = (I + γVVT ) 1(VZT ZVT ) (15) also, the transformations Z, V are obtained as, (T1) Given: For a maximally shattered set S = {x1, . . . , xd N } using the functions f1(x), f2(x) (see definition (4)) Define: a mapping φ : ℜd ℜd L as , (0d 1, . . . , x f1(x)=l, . . . x f2(x)=k, . . . , 0d 1)d L 1; x T S (0d 1, . . . , x f1(x)=l, . . . x f2(x)=k, . . . , 0d 1)d L 1; x S T Obtain: Z = (z1)T ... (zd N )T Basically, the transformation φ maps a sample x ℜd from the shattered set S to a d L - dimension vector z; where for any x T with f1(x) = l and f2(x) = k; we copy the x vector onto l(d 1) + 1 . . . ld-th position and x vector onto k(d 1) + 1 . . . kd-th position of z. The remaining elements are set to 0. We reverse the sign of the mapping for x S T. (T2) Given: universum set U = (x 1)T ... (x m)T 1(L 1) 1 IL 1 L 1 0 1(L 2) 1 IL 2 L 2 0 0 1(L 3) 1 IL 3 L 3 2 L Obtain: V = (G U) where is the Kronecker product. Due to space constraints, the proof of Theorem 2 is provided in the supplementary material. Theorem 2 provides a framework to analyze the sample complexity for PAC-learnability of HM SV M and HMU SV M. A direct observation from Theorem 2 is that MU-SVM is likely to have a smaller d N and hence a lower sample complexity compared to M-SVM. This is seen from (14), where setting γ = 0 F(γ) = Λ2 G(γ) = [2Λ2R2]2. Hence we always have, κ 2R2Λ2 from (13). This gives ϑHMU SV M ϑHM SV M . In fact ϑHMU SV M can be significantly smaller than ϑHM SV M for a well chosen γ {γ 0 ; G(γ) 0}, resulting to a much smaller d N for MU-SVM compared to M-SVM. The trade-off between m, and the universum data types further ensures low sample-complexity for MU-SVM. 3.3 MU-SVM Implementation Another desirable property of MU-SVM (7) is that it is solvable through state-of-art M-SVM solvers [33,34]. For every universum sample x i we create L artificial samples belonging to all the classes, i.e. (x i , y i 1 = 1), . . . , (x i , y i L = L) and add them to the training set as shown below, (xi, yi, eil, Ci, ξi) = (xi, yi, eil, C, ξi) i = 1 . . . n (x i , y i l, ei l, C , ζi ) i = n + 1 . . . n + m L; i = 1 . . . m; l = 1 . . . L (16) Proposition 3. MU-SVM (7) after the transformation (16) can be solved in the dual form as, max α W(α) = 1 l αilαjl K(xi, xj) X i,l αileil (17) l αil = 0; αi,l Ci if l = yi; αi,l 0 if l = yi; i, j = 1 . . . n + m L, l = 1 . . . L Note that (17) has similar form as the M-SVM s dual form (see [15, 24]), except that (17) has additional m L constraints for the universum samples. Hence, solving MU-SVM using (17) is same as solving an M-SVM problem (3) with n + m L samples. 3.4 Model Selection The MU-SVM (17) has four tunable parameters: C, C , kernel parameter, and . Successful application of MU-SVM significantly depends on optimal model selection. In this paper we simplify the model selection using a two-step approach, (Step a) First, perform optimal tuning of the C and kernel parameters for M-SVM (2). This equivalently tunes the parameters specific only to the training samples in the MU-SVM formulation. (Step b) Tune while keeping C and kernel parameters fixed (from Step a). Also C = n C m L is kept fixed to ensure equal contribution of training and universum samples in MU-SVM (7). The model parameters in Steps (a) & (b) are typically selected through resampling techniques like, leave one out (l.o.o) or stratified cross-validation approaches [35]. Of these approaches, l.o.o provides an almost unbiased estimate of the test error [36]. However, on the downside it can be computationally prohibitive. In this paper we provide a new span definition for MU-SVM in (19), and modify the technique in [17] to derive a new analytic l.o.o bound for MU-SVM. Other span based l.o.o bounds have been derived for alternative versions of the M-SVM formulation [37]. However they do not apply to the C&S s M-SVM and the MU-SVM formulation proposed in this paper. Next, we show that our proposed bound can be sucessfully used for model selection in Steps (a) & (b) thereby avoiding computationally-prohibitive l.o.o and expensive cross-validation. The necessary definitions are provided next. Definition 5. The (Leave-One-Out procedure) with the tth training sample dropped involves solving (17) with an additional constraint αtl = 0; l. The obtained l.o.o solution αt = [αt 11, . . . , αt 1L | {z } αt 1 , . . . , αt t1 = 0, . . . , αt t L = 0 | {z } αt t=0 , . . .] with tth sample prediction ˆyt = i αt il K(xi, xt) gives the leave-one-out error as: Rl.o.o = 1 t=1 1[yt = ˆyt] Definition 6. Support vectors obtained through solving (17) are categorized as: Type 1 SV1 = { i |0 < αiyi < Ci} and Type 2 SV2 = { i |αiyi = Ci}. The set of all support vectors are represented as, SV = SV1 SV2. Similarly, the set of support vectors for l.o.o solution is given as SV t. Under Definition 6 we prove the following, Theorem 3. The leave-one-out error is upper bounded as: | n t SV1 T |St max( C ) 1 o | + | n t SV2 T o | where, | | := Cardinality of a set, and St := Span of a Type 1 support vector xt given as, S2 t = min β l βilβjl)K(xi, xj) (new span definition specific to MU-SVM) (19) s.t. αil βil Ci; {(i = t, l)| 0 < αil < Ci; l = yi} αil βil 0; {(i = t, l)| αil < 0; l = yi} βil = 0; i / SV1 {t} l = 1 . . . L; βtl = αtl; l = 1 . . . L; X and D is the Diameter of the smallest hypersphere containing all training samples. Please refer to the supplementary material for proof of Theorem 3. The practical utility of (18) is limited due to the significant computational complexity for solving (19) which results to O(n + m L)4 (worst case) to compute (18). To alleviate this, we derive a computationally attractive alternative under the following assumptions, Assumption 1. : For the MU-SVM solution, (A1) The sets SV1 and SV2 remain the same during the l.o.o procedure. (A2) The SV1 support vectors have only two active elements i.e. αi SV1 k = yi s.t. αik = αiyi. Theorem 4. Under Assumption 1 the leave-one-out error is upper bounded as: n| n t SV T | S2 t α t X l αil K(xi, xt) o | (20) where, S2 t = α t [(H 1)tt] 1αt t SV1 T α t [K(xt, xt) IL KT t H 1Kt]αt t SV2 T ; H := KSV1 IL A A 0 A := I|SV1| (1L) ; 1L = [ 1 1 . . . 1 | {z } L elements ] ; KSV1 := Kernel matrix of the SV1 support vectors (H 1)tt := sub-matrix of H 1 for indices i = [(t 1)L + 1 . . . t L] ; Kt = [(k T t 1L) 0L |SV1|]T ; kt = n|SV1| 1 dim vector where ith element is K(xi, xt), xi SV1 ; and is the Kronecker product. Theorem 4 provides a good approximation of the l.o.o error (also confirmed from results in Table 3) even when the Assumption 1 is violated just as in [17]. Further, it provides two major advantages over Theorem 3. First, Eq. (20) is valid for both SV1 & SV2 training support vectors and results in a stricter bound. Second, span computation in theorem 4 requires only one matrix inversion H 1. This results in a significant speed-up to O(n + m L)3 for computing the leave-one-out bound using (20) as compared to O(n + m L)4 in (18). 4 Empirical Results We use three real life datasets discussed next: German Traffic Sign Recognition Benchmark (GTSRB) [38]: The goal is to identify the traffic signs Table 1: Real-life datasets. DATASET TRAIN / TEST SIZE GTSRB 300 / 1500 (100 / 500 PER CLASS) ABCDETC 600 / 400 (150 / 100 PER CLASS) ISOLET 500 / 500 (100 / 100 PER CLASS) for the speed-zones 30 , 70 and 80 . Here, the images are represented by their 1568 histogram of gradient (HOG 1) features. For this data we use three kinds of Universum: (U1) Random Averaging (RA) : synthetically created by first selecting a random traffic sign from each class ( 30 , 70 and 80 ) in the training set and averaging them. (U2) Others: all other non-speed traffic signs. (U3) Priority road Sign: an exhaustive search over several non-speed zone traffic signs showed this universum to provide the best performance (see appendix B.4). Handwritten characters (ABCDETC) [16]: The goal is to identify handwritten digits 0 - 3 using their 10000 (100 100) pixel values. We use the characters other than digits as universum i.e., (U1) A - Z uppercase letters, (U2) a - z lowercase letters, (U3) all other symbols like:- ! ? , . ; : = - + / / ( ) $ % " @ and (U4) Random Averaging (RA) generated as above. Speech-based Isolated Letter recognition (ISOLET) [39]: This is a speech recognition dataset where 150 subjects pronounced each letter a - z twice. The goal is to identify the spoken letters a - e using the 617 dimensional spectral coefficients, contour, sonorant, presonorant, and post-sonorant features. We use two different types of universum: (U1) Others which contains all the other speeched letters i.e. f - z and (U2) Random Averaging (RA) discussed above. Due to space constraints and to simplify our analyses in later sections, we used only a subset of the training classes. We see similar results using all the classes (results provided in supplementary material B.1). For the model parameters our initial experiments showed linear parameterization to be optimal for GTSRB; hence only linear kernel has been used for it. For ABCDETC and ISOLET an RBF kernel K(xi, xj) = exp( γ xi xj 2) with γ = 2 7 provided optimal results for M-SVM. For all the experiments model selection is done over the range of parameters C = [10 4, . . . , 103] , C /C = n m L and = [0, 0.01, 0.05, 0.1]. Effectiveness of the MU-SVM formulation (7): Table 2 provides the average test error for MUSVM and several other baseline methods over 10 random training/test partitioning of the data in the proportions shown in Table 1. Model selection within each partition is done using stratified 5 Fold CV [35]. Here, SVMOVA & SVMOVO denotes the popular ECOC based multiclass extensions one-vs-all (OVA) and one-vs-one (OVO) using binary SVM [2] as the base classifier. Similarly, U-SVMOVA & U-SVMOVO uses binary U-SVM [16] as the base classifier. Owing to space constraints, we only show the results for the best performing universum for all the datasets. Also we fix the number of universum samples to m = 500. Additional increase in the universum samples do not provide any significant gains (see appendix B.3 for results). The complete set of results using all the universum types are provided in Appendix B.2. For reproducibility of the results we also provide the typical optimal parameters selected through model selection in Appendix B.2. Table 2 shows that MU-SVM provides lower test errors compared to all the other baseline methods. Specifically, compared to M-SVM, the performance gains using MU-SVM improve significantly up to 20 25%. For sufficiently large universum set size, such significant improvements using MU-SVM depend mostly on the statistical characteristics of the universum data. To better understand these statistical characteristics we adopt the technique of histogram of projections (HOP) originally introduced for binary classification in [40]. Here, different from [40], for a given M-SVM / MU-SVM model we project the training samples onto the decision space of their respective classes i.e. (xi, yi = k) we obtain the projection values as w k xi max l =k w l xi. In addition we also project the universum samples onto the decision spaces of all the classes i.e. (x i ); project k; w k x i max l =k w l x i . Finally we generate the class specific histograms of these projection values. In addition to the histograms, we also generate a frequency plot of the predicted labels for the universum samples using the models. Using this HOP visualization we analyze the effectiveness of the universum U3 for GTSRB dataset (see Fig 3). As seen from Fig. 3, the optimal M-SVM model has high separability for the training samples i.e. most of the training samples lie outside the margin borders (+1). In addition, the universum samples U3 are widely spread about the margin-borders and biased towards the positive side of the decision boundary of the sign 30 (Fig. 3(a)); and hence predominantly gets classified as sign 30 (Fig.3(d)). As seen from Figs 3. (e)-(g), applying the MU-SVM model preserves the separability of the training samples and additionally reduces the spread of the universum samples. Following proposition 1, such a model exhibits higher uncertainty on the universum samples class membership, and uniformly assigns them over all the classes i.e. signs 30 , 70 and 80 (see Fig. 3(h)). This shows that, the resulting MU-SVM model has higher contradiction (uncertainty) on the universum samples and hence provides better generalization compared to M-SVM. This behavior is consistently seen for the other datasets and universum choices (provided in the supplementary material - Appendix B.6). Model Selection using Theorem 4: Table 3 provides the average std. dev of time taken (in 30 70 80 (d) 30 70 80 (h) Figure 3: GTSRB: Histograms of projections for training (in blue) and universum U3 (in red). M-SVM (C = 1): (a) sign 30 . (b) sign 70 . (c) sign 80 . (d) frequency plot of universum labels. MU-SVM ( = 0) :(e) sign 30 . (f) sign 70 . (g) sign 80 . (h) frequency plot of universum labels. Table 2: Mean ( standard deviation) of the test errors (in %) over 10 runs of the experimental setting in Table 1. No. of universum samples (m = 500). DATASET SVMOVA SVMOVO M-SVM U-SVMOVA U-SVMOVO MU-SVM GTSRB (USING U3) 7.17 1.08 7.16 1.92 7.24 1.16 6.05 0.61 5.97 0.63 5.53 0.62 ABCDETC (USING U4) 28.1 4.74 29.1 4.16 27.5 3.34 26.1 4.93 26.9 4.51 22.1 3.24 ISOLET (USING U2) 3.72 0.6 3.88 0.44 3.6 0.31 3.56 0.55 3.88 0.63 2.83 0.32 Table 3: Comparisons for model selection using 5 Fold CV vs. Theorem 4. No. of universum samples (m = 500). Model parameters used C /C = n m L, = [0, 0.01, 0.05, 0.1] . 5-FOLD CV THEOREM 4 MUSVM TEST ERROR (IN %) TIME ( 104sec) TEST ERROR (IN %) TIME ( 104sec) U1 6.9 0.9 3.1 0.5 6.9 0.9 0.8 0.2 U2 7.4 0.9 3.2 0.9 7.1 0.8 0.9 0.3 U3 5.5 0.6 2.9 0.3 5.2 0.4 0.9 0.1 U1 26.1 4.0 2.8 0.1 26.1 3.7 1.1 0.1 U2 24.2 3.1 2.8 0.1 24.4 3.2 1.3 0.1 U3 23.3 3.2 2.6 0.2 24.1 3.8 0.9 0.09 U4 22.1 3.2 2.6 0.1 22.0 2.8 0.9 0.1 U1 3.3 0.3 4.8 0.9 3.3 0.3 2.1 0.5 U2 2.8 0.3 3.1 0.6 2.6 0.3 1.9 0.7 seconds) for model selection using Theorem 4 vs. 5-fold CV for 10 runs over the entire range of parameters as well as the respective average test errors. In each experimental run the data is partitioned as in Table 1. We use a desktop with 12 core Intel Xeon @3.5 Ghz and 32 GB RAM. The bound-based model selection is 2 4 faster than 5-fold CV and provides similar test errors. The advantage offered by Theorem 4 is even more pronounced against l.o.o. For instance, comparison with l.o.o for GTSRB dataset showed 100 improvement in speed using Theorem 4 with similar test accuracies (see Appendix B.5). Additional l.o.o results could not be reported owing to its prohibitively slow speed. 5 Conclusions This paper proposes a new formulation for multiclass SVM (MU-SVM). MU-SVM is shown to incur lower sample-complexity for PAC learnability compared to M-SVM by deriving Natarajan dimension. Further, the proposed MU-SVM embodies several useful mathematical properties amenable for: a) its efficient implementation using existing M-SVM solvers, and b) deriving practical analytic bounds that can perform model selection. We empirically show the effectiveness of the formulation as well as the bound. Insights into the workings of MU-SVM using HOP visualization is also provided. Acknowledgments We thank the anonymous reviewers for their comments which helped improve the quality of the paper. [1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www. deeplearningbook.org. [2] V. Vapnik, Estimation of Dependences Based on Empirical Data (Information Science and Statistics). Springer, Mar. 2006. [3] F. Sinz, O. Chapelle, A. Agarwal, and B. Schölkopf, An analysis of inference with the universum, in Advances in neural information processing systems 20. NY, USA: Curran, Sep. 2008, pp. 1369 1376. [4] S. Dhar and V. Cherkassky, Development and evaluation of cost-sensitive universum-svm, Cybernetics, IEEE Transactions on, vol. 45, no. 4, pp. 806 818, 2015. [5] S. Lu and L. Tong, Weighted twin support vector machine with universum, Advances in Computer Science: an International Journal, vol. 3, no. 2, pp. 17 23, 2014. [6] Z. Qi, Y. Tian, and Y. Shi, A nonparallel support vector machine for a classification problem with universum learning, Journal of Computational and Applied Mathematics, vol. 263, pp. 288 298, 2014. [7] C. Shen, P. Wang, F. Shen, and H. Wang, Uboost: Boosting with the universum, Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 34, no. 4, pp. 825 832, 2012. [8] Z. Wang, Y. Zhu, W. Liu, Z. Chen, and D. Gao, Multi-view learning with universum, Knowledge-Based Systems, vol. 70, pp. 376 391, 2014. [9] D. Zhang, J. Wang, F. Wang, and C. Zhang, Semi-supervised classification with universum. in SDM. SIAM, 2008, pp. 323 333. [10] Y. Xu, M. Chen, Z. Yang, and G. Li, ν-twin support vector machine with universum data for classification, Applied Intelligence, vol. 44, no. 4, pp. 956 968, 2016. [11] C. Zhu, Improved multi-kernel classification machine with nyström approximation technique and universum data, Neurocomputing, vol. 175, pp. 610 634, 2016. [12] F. Sinz, A priori knowledge from non-examples, Ph.D. dissertation, Mar 2007. [13] S. Chen and C. Zhang, Selecting informative universum sample for semi-supervised learning. in IJCAI, 2009, pp. 1016 1021. [14] X. Zhang and Y. Le Cun, Universum prescription: Regularization using unlabeled data. in AAAI, 2017, pp. 2907 2913. [15] K. Crammer and Y. Singer, On the learnability and design of output codes for multiclass problems, Machine learning, vol. 47, no. 2-3, pp. 201 233, 2002. [16] J. Weston, R. Collobert, F. Sinz, L. Bottou, and V. Vapnik, Inference with the universum, in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 1009 1016. [17] V. Vapnik and O. Chapelle, Bounds on error expectation for support vector machines, Neural computation, vol. 12, no. 9, pp. 2013 2036, 2000. [18] J. Weston, C. Watkins et al., Support vector machines for multi-class pattern recognition. in Esann, vol. 99, 1999, pp. 219 224. [19] Y. Lei, U. Dogan, A. Binder, and M. Kloft, Multi-class svms: From tighter data-dependent generalization bounds to novel algorithms, in Advances in Neural Information Processing Systems 28, 2015. [20] S. Szedmak, J. Shawe-Taylor et al., Learning via linear operators: Maximum margin regression, in In Proceedings of 2001 IEEE International Conference on Data Mining, 2005. [21] Y. Lee, Y. Lin, and G. Wahba, Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data, Journal of the American Statistical Association, vol. 99, no. 465, pp. 67 81, 2004. [22] E. J. Bredensteiner and K. P. Bennett, Multicategory classification by support vector machines, in Computational Optimization. Springer, 1999, pp. 53 79. [23] Y. Guermeur and E. Monfrini, A quadratic loss multi-class svm for which a radius margin bound applies, Informatica, vol. 22, no. 1, pp. 73 96, 2011. [24] C. Hsu and C. Lin, A comparison of methods for multiclass support vector machines, Neural Networks, IEEE Transactions on, vol. 13, no. 2, pp. 415 425, 2002. [25] A. Daniely et al., Multiclass learning approaches: A theoretical comparison with implications, in NIPS, 2012. [26] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms. Cambridge university press, 2014. [27] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning. MIT press, 2018. [28] K. Musayeva, F. Lauer, and Y. Guermeur, Rademacher complexity and generalization performance of multi-category margin classifiers, Neurocomputing, vol. 342, pp. 6 15, 2019. [29] A. Daniely and S. Shalev-Shwartz, Optimal learners for multiclass problems, in Proceedings of The 27th Conference on Learning Theory, ser. Proceedings of Machine Learning Research, vol. 35, 2014. [30] Y. Lei, Ü. Dogan, D. Zhou, and M. Kloft, Generalization error bounds for extreme multi-class classification, Co RR, abs/1706.09814, 2017. [31] B. K. Natarajan, On learning sets and functions, Machine Learning, vol. 4, no. 1, pp. 67 97, 1989. [32] A. Daniely, S. Sabato, S. Ben-David, and S. Shalev-Shwartz, Multiclass learnability and the erm principle, The Journal of Machine Learning Research, vol. 16, no. 1, pp. 2377 2404, 2015. [33] F. Lauer and Y. Guermeur, MSVMpack: a multi-class support vector machine package, Journal of Machine Learning Research, vol. 12, pp. 2269 2272, 2011, http://www.loria.fr/~lauer/MSVMpack. [34] libsvmtools, https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/, accessed: 2019-05-17. [35] N. Japkowicz and M. Shah, Evaluating learning algorithms: a classification perspective. Cambridge University Press, 2011. [36] A. Luntz, On estimation of characters obtained in statistical procedure of recognition, Technicheskaya Kibernetica, 1969. [37] R. Bonidal, Sélection de modèle par chemin de régularisation pour les machines à vecteurs support à coût quadratique. Ph.D. dissertation, June 2013. [38] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition, Neural Networks, pp. , 2012. [39] M. Fanty and R. Cole, Spoken letter recognition, in Advances in Neural Information Processing Systems, 1991, pp. 220 226. [40] V. Cherkassky, S. Dhar, and W. Dai, Practical conditions for effectiveness of the universum learning, Neural Networks, IEEE Transactions on, vol. 22, no. 8, pp. 1241 1255, 2011.