# khyperplane_hingeminimax_classifier__abd577d5.pdf K-hyperplane Hinge-Minimax Classifier Margarita Osadchy RITA@CS.HAIFA.AC.IL Tamir Hazan TAMIR@CS.HAIFA.AC.IL Daniel Keren DKEREN@CS.HAIFA.AC.IL Department of Computer Science, University of Haifa, Mount Carmel, Haifa 31905, Israel We explore a novel approach to upper bound the misclassification error for problems with data comprising a small number of positive samples and a large number of negative samples. We assign the hinge-loss to upper bound the misclassification error of the positive examples and use the minimax risk to upper bound the misclassification error with respect to the worst case distribution that generates the negative examples. This approach is computationally appealing since the majority of training examples (belonging to the negative class) are represented by the statistics of their distribution, in contrast to kernel SVM which produces a very large number of support vectors in such settings. We derive empirical risk bounds for linear and non-linear classification and show that they are dimensionally independent and decay as 1/ m for m samples. We propose an efficient algorithm for training an intersection of finite number of hyperplanes and demonstrate its effectiveness on real data, including letter and scene recognition. 1. Introduction Linear classifiers are the cornerstone of several applications in machine learning. The generalization ability of linear classifiers has been long studied in the context of support vector machines (SVMs), e.g., using VC dimension, covering numbers, and Rademacher complexity (Vapnik, 2000; Zhang, 2002; Bartlett & Mendelson, 2003; Bousquet et al., 2004; Kakade et al., 2009). SVMs upper bound the misclassification loss of the linear classifier using the hinge-loss. This setting is computationally appealing when there are fairly small number of support vectors. Alternatively, the minimax risk upper bounds the distri- Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). bution that generates the instances-labels examples in the world (Lanckriet et al., 2003; Honorio & Jaakkola, 2014). This approach is computationally appealing when there are (infinitely) many training examples since it only utilizes their statistical properties, such as mean and covariance. In our work we consider real-life data that consists of a small number of positive data points and a large number of negative data points, a setting that is prominent in machine learning applications, e.g., in computer vision, such as object detection, and in security, such as fraud or malicious attack detection. We suggest to combine the hingerisk with the minimax risk to enjoy the best of both worlds. The idea of combining minimax for the negative class and svm-like formulation for the positive samples was introduced in (Osadchy et al., 2012) for training a single hyperplane. No generalization bounds have been shown in that work. Here we derive an empirical mixed-risk bound, that uses the Rademacher complexities to bound the hinge-risk and vector Bernstein s inequalities to bound the minimax risk. Recently, (Honorio & Jaakkola, 2014) derived a generalization bound for the minimax risk using PAC-Bayesian approach, a setting that bounds the expected loss with respect to a posterior distribution over all possible classifiers. Our work differs as we use stronger assumptions - that the norm of the data points is bounded, an assumption that is natural in many applications. Thus we are able to avoid the variance while computing the expected loss of PACBayesian bounds. Our work mainly considers non-linear classifiers. Nonlinear classifiers are usually attained by applying kernel methods. Unfortunately, the computational complexity of kernel methods increases linearly with the size of the training sample (Steinwart, 2003). Instead we apply non-linearity by using intersection of hyperplanes (Klivans & Servedio, 2004; Arriaga & Vempala, 1999). Unfortunately, the proposed algorithms for intersection of hyperplanes are computationally costly when considering large sets of negative data points. To deal with this computational difficulty we extend the minimax risk to deal with intersection of hyperplanes over (infinitely many) negative examples. As this risk bound is loose when there are few K-hyperplane Hinge-Minimax Classifier data points, as we have in the positive examples, we apply the sum-of-hinge-loss to the positive examples. We also derive a generalization bound that mixes these two risks for intersection of hyperplanes setting. Our Rademacher complexity bound for classifying by intersection of hyperplanes extends the contraction lemma (cf. (Bartlett & Mendelson, 2003)) to vectors. This may be used to simplify recent multi-class generalization bounds using Rademacher complexity (Cortes et al., 2013). We propose an algorithm for training an intersection of hyperplanes that efficiently minimizes the minimax-hinge risk. We show empirically on two real sets with very differing characteristics, that this algorithm substantially improves over linear classifiers; further, it is on par with the classification rate of ensemble methods (comprising more than a 100 simple classifiers compared to 2-4 hyperplanes in the hinge-minimax classifier) and it even approached the classification performance of kernel SVM, but is orders of magnitude faster. 2. Background Let (x, y) D, where x Rd and y { 1, 1}. Let yw(x) = sign(w T x) denote a linear classifier. For simplicity, we assume that b = 0 (or absorbed by w). A zeroone risk for the yw(x) is defined as follows: L0/1 D (w) = ED[1[yw(x) = y]] The zero-one loss is non-convex. In SVMs the zero-one loss is upper bounded by the hinge loss: 1[yw(x) = y] max{0, 1 yw T x}. Thus the hinge risk upper bounds the zero-one risk: L0/1 D (w) ED[max{0, 1 yw T x}] LH D(w). Alternatively, one can upper bound the zero-one risk by the minimax risk (Lanckriet et al., 2003; Honorio & Jaakkola, 2014). There, instead of upper bounding the zero-one loss function, one upper bounds the distribution that generated the data. Denote µ = E(x,y) D[yx] and Σ = E(x,y) D[(yx µ)(yx µ)T ]. Denote by Z(µ, Σ) the set of all distributions with mean µ and covariance Σ. L0/1 D (w) sup yx Z(µ,Σ) Pr(w T yx 0) LM µ,Σ(w) It was shown in (Lanckriet et al., 2003; Honorio & Jaakkola, 2014) that sup z Z(µ,Σ) Pr(wtz 0) = 1 1 + (w T µ)2 In our work, we consider non-linear classification with K hyperplanes. Let wi, i = 1, .., K denote K hyperplanes. Let W be a matrix with wi as its ith column. We define a non-linear classifier f W (x) as an intersection of these K hyperplanes: f W (x) = { 1 if W T x 0 1 otherwise (1) where 0 denotes a vector of zeros. 3. Minimax-Hinge Risk We are interested in a classification problem in which the positive class corresponds to a single concept and the negative class is its complement. We assume that the sample of the positive class is relatively small while the negative sample is very large (it can be represented by an unlabeled data as well, thus is easy to collect). When the sample is small, the regularized hinge loss has been shown to be very effective. The minimax bound is tight for the Gaussian distribution, thus it will become tight for sample size approaching infinity. Due to the specifics of the problem we propose to use a mixed risk, namely, we use hinge risk for the positive class and minimax risk for the negative. Next, we formulate the mixed risk for the non-linear classifier in eq. 1. LMH D (W) = LH,1 D + LM, 1 µ(Dneg),Σ(Dneg) (2) where W is the matrix of K hyperplanes, D is a joint distribution of the samples and labels, Dneg is a marginal distribution of samples over the negative labels, and µ(Dneg) amd Σ(Dneg) are its mean and covariance respectively. LH,1 D and LM, 1 µ(Dneg),Σ(Dneg) are defined below. LH,1 D = ED[LH(W, x, y)1[y = 1]] where LH(W, x, 1) = i max{0, 1 w T x} 3.1. The Expected Risk of the Negative Class Since LM, 1 µ(Dneg),Σ(Dneg) is the risk of a negative sample falling into the intersection of K hyperplanes, which is a convex set, we can use the theorem due to Marshall and Olkin (Marshall & Olkin, 1960) to derive the risk for the negative class. Theorem 1 Let Z(µ, Σ) 1 be all distributions with known mean µ and covariance Σ. For K fixed hypeplanes wi (i = 1, .., K), we have sup z Z(µ,Σ) Pr(W T z > 0) = 1 1 + d2 with d2 = µT W( W T Σ W) 1 W T µ, where W is a matrix with columns that satisfy w T z = 0, where 1for notational simplicity, we drop Dneg and denote µ = µ(Dneg) and Σ = Σ(Dneg). K-hyperplane Hinge-Minimax Classifier z = arg minz(z µ)T Σ 1(z µ). Proof. Following the result of Marshall and Olkin (Marshall & Olkin, 1960) for a convex set, we obtain: sup z Z(µ,Σ) Pr(W T z 0) = 1 1 + d2 , with d2 = inf W T z 0(z µ)T Σ 1(z µ) Next, we want to derive a closed-form expression for d2. We seek the solution for the primal problem min z (z µ)T Σ 1(z µ) s.t. w T i z 0 for i = 1, .., K. We construct the Lagrangian: L(z, λi) = (z µ)T Σ 1(z µ) + i λiw T i z, λi 0. The optimality condition: z = 2Σ 1z 2Σ 1µ + i λiwi = 0, gives us z = µ 1 2 i λiΣwi. The Lagrange dual function is as follows, L(z , λ) = (1 i λiΣwi)T Σ 1(1 j λjΣwj) (3) i λiwt i(µ 1 The optimality conditions are: i λiw T k Σwi + w T k µ = 0 for k such that λk > 0. The function is optimized at λ = 2( WΣ W) 1 W T µ, (4) W is formed by a subset of columns of W for which λk > 0, and thus w T k z = 0. For the last step we substitute the optimal λ, given in eq. 4 into the dual function in eq. 3 and after simple algebraic manipulations we get: d2 = max λ 0 (L(z , λ )) = µT W( W T Σ W) 1 W T µ In the following we bound the risk by its finite sample. In what follows we show that the discrepancy between the risk LMH D (W) and its empirical estimation LMH S (W) decays at the rate of c m where δ is the confidence over the samples of the training data and m is the training data size. The main difficulty arises from mixing the hinge-risk for the positive examples and the minimax risk for the negative examples. For this purpose we divide the risk to its positive and negative cases, and for each we derive a uniform generalization bound. 4.1. Uniform generalization bound for the empirical minimax risk The minimax risk upper bounds the zero-one risk over the negative examples. To derive a uniform generalization bound for this setting we bound each of its components differently using a finite sample of training examples. In view of eq. 2 we bound the relative size of the negative set by its empirical average. We also bound the minimax risk itself, by its training sample estimation. For the sake of clarity we begin with deriving a generalization bound for a single hyperplane, followed by a generalization bound for the intersection of hyperplanes. Recall that Dneg is the distribution of the negative data points. We abbreviate its mean and covariance by µ µ(Dneg) and Σ Σ(Dneg) respectively. Let ˆµ and ˆΣ be the mean and covariance estimates from the training data points that are associated with negative labels. The minimax generalization bound LM, 1 D(µ,Σ)(w) is dominated by the discrepancy 1 + (w ˆµ)2 w ˆΣw Some algebraic manipulations yield a simpler form of the discrepancy: = w Σw (w ˆµ)2 w ˆΣw (w µ)2 ( w Σw + (w µ)2 ) ( w ˆΣw + (w ˆµ)2 ) (5) To provide uniform generalization bound to the minimax risk, we show that the discrepancy decreases when the size of the training sample increases. Therefore we represent the discrepancy with ˆµ µ and ˆΣ Σ that decrease as a function of the training sample. By adding and subtracting (w µ)2 w Σw to the numerator we are able to represent the discrepancy with these diminishing quantities. = w Σw ( (w ˆµ)2 (w µ)2) + (w µ)2 w (Σ ˆΣ)w ( w Σw + (w µ)2 ) ( w ˆΣw + (w ˆµ)2 ) K-hyperplane Hinge-Minimax Classifier The sampled quantity ˆµ µ diminishes with high probability as the training size increases. This controls the first term of the minimax discrepancy, as described by the following lemma: Lemma 1 Assume x Dneg is a distribution over data points x with negative labels such that x 1 holds with probability 1. Denote by µ its mean by Σ its covariance. Let Sneg training sample of size ˆm and let ˆµ = 1 ˆm x Sneg x be its sampled mean and ˆΣ = 1 ˆm x Sneg(x ˆµ)(x ˆµ) be its sampled covariance. Define 1 = w Σw ( (w ˆµ)2 (w µ)2) ( w Σw + (w µ)2 ) ( w ˆΣw + (w ˆµ)2 ). Assume that the minimal eigenvalue of Σ, ˆΣ is lower bounded by α. Then, with probability at least 1 δ over the draws of the training set Sneg the following holds uniformly for all w 32 log(1/δ) + 1/4 Proof. First, we upper bound 1 while decreasing the denominator, by omitting (w µ)2 and (w ˆµ)2. Using the identity a2 b2 = (a + b)(a b) with a = w ˆµ and b = w µ we obtain: 1 w (ˆµ µ) w (ˆµ + µ) Next, we applying the Cauchy-Schwarz inequality to the numerator a b a b and the lower bound w ˆΣw α w 2 to the denominator 1 ˆµ µ ˆµ + µ Finally, since x 1 then ˆµ + µ 2. Moreover, Bernstein inequality for vectors (cf. (Gross, 2011) Theorem 11, (Candes & Plan, 2011) Theorem 2.6) implies P[ ˆµ µ t] exp( ˆmt2/32 + 1/4) for any t 2 ˆm. The result follows when setting λ = exp( ˆmt2/32+1/4), or equivalently t = 32(log(1/δ)+1/4) We turn to handle the second term of the minimax discrepancy in eq. 5. The quantity ˆΣ Σ diminishes in high probability as the training size increases. Lemma 2 Under the conditions of Lemma 1, define 2 = (w µ)2 w (Σ ˆΣ)w ( w Σw + (w µ)2 ) ( w ˆΣw + (w ˆµ)2 ). Assume that the minimal eigenvalues of Σ, ˆΣ are lower bounded by α. Then, with probability at least 1 δ over the draws of the training set Sneg the following holds uniformly for all w 128(log(1/δ) + 1/4) Proof. First, we lower bound the denominator by α2 w 2 (when omitting (w µ)2 and (w ˆµ)2) thus upper bounding 2. Noting the w Aw = a b where a is a vectorization of A and b is a vectorization of ww , we use the Cauchy-Schwarz inequality to upper bound the numerator by µ 2 w 2 a b . Since b = w 2 and µ 1 we obtain the bound 2 Σ ˆΣ /α2. We consider the norm Σ ˆΣ as the norm of its vectorized form. Using the Bernstein inequality for vectors we obtain that P[ Σ ˆΣ t] exp( mt2/128+1/4) for any t 4 ˆm. The result follows when setting δ = exp( mt2/128+ 1/4) or equivalently t = 128(log(1/δ)+1/4) Bounds on the discrepancy between the minimax risk and its empirical risk that are uniform for any w guarantee generalization. The above lemmas suggest that the penalty of observing a finite sample space decreases as 1/ ˆm. This is summarized in the following theorem. Theorem 2 Suppose that D is a distribution over X Y such that Y = { 1, +1} and X = {x : x 1}. Let LM, 1 µ,Σ (w) be the minimax risk over the negative labels, where µ, Σ are the mean and covariance of the marginal distribution of x over the negative labels. Consider a training sample S of size m which ˆm of them have negative label and let LSneg(w) be the empirical minimax risk over the negative labels LSneg(w) = ˆm m sup z Z(ˆΣ,ˆµ) P[w z 0] where ˆµ, ˆΣ are the empirical mean and covariance estimation of the marginal distribution of x over the negative training labels. Then, for any δ (, 1] with probability at least 1 3δ over the i.i.d. sample of size m holds LM, 1 µ,Σ (w) LM, 1 ˆµ,ˆΣ (w) + c1 log(1/δ) + 1/4 128(1/α + 1/α2) and α is the minimal eigenvalue of Σ, ˆΣ. K-hyperplane Hinge-Minimax Classifier Proof. We estimate ρ = E(x,y) D [ 1[y = 1] ] by its empirical mean ˆm/m. From the Hoeffding inequality, P[ ˆm m ρ t] exp( 2mt2). Setting δ = exp( 2mt2) we derive the confidence interval t = log(1/δ)/2m. Thus combining the above lemmas, with error probability of 3δ the minimax risk is upper bounded by 1 + (w ˆµ)2 32(log(1/δ) + 1/4) + 128(log(1/δ) + 1/4). We conclude the proof by using ˆm/m 1 and 1 1+ (w ˆ µ)2 The same type of bound holds for classification by any finite number of hyperplanes, i.e., W x 0. We omit its derivation as it is tedious and follows the same derivations as above. Theorem 3 Consider the setting of Theorem 2 and let LM, 1 µ,Σ (W) = E(x,y) D [ 1[y = 1] ] sup z Z(Σ,µ) P[W z 0] where W is a matrix whose columns consist of k different hyperplanes. Then, for any δ (0, 1] with probability at least 1 δ over the i.i.d. sample of size m there holds uniformly for all W: LM, 1 µ,Σ (W) LM, 1 ˆµ,ˆΣ (W) + c log(1/δ) + 1/4 128k(1/α + 1/α2) and α is the minimal eigenvalue of Σ, ˆΣ. In our setting, it is important to assume that the eigenvalues of Σ are lower bounded by α. This assumption implies that the eigenvalues of ˆΣ are also bounded from below whenever ˆm d. Using Cauchy-Schwartz inequality for any w of a unit norm there holds |w Σw w ˆΣw| Σ ˆΣ . Using the Bernstein inequality for vectors we obtain that P[ Σ ˆΣ t] exp( mt2/128 + 1/4) for any t 4 ˆm. Thus with high probability (that decays exponentially with ˆm) we obtain that for any w of unit norm holds w ˆΣw w Σw t. Since the eigenvalues of Σ are lower bounded by α we have that w Σw α then the eigenvalues of ˆΣ are also lower bounded away from zero whenever t < α. 4.2. Uniform generalization bound for the empirical risk of the hinge-loss The risk of the hinge-loss upper bounds the zero-one risk over the positive examples. Using Rademacher complexities we derive a uniform generalization bound for k hyperplanes classification W x 0, i.e., w i x 0 for each i = 1, ..., k. The Rademacher complexity of a bounded set A Rk is m Eσ[max a A while σi { 1, +1} are i.i.d. and equally probable random variables. The set A describes the loss of the predictors W over a training sample (x1, y1), ..., (xm, ym), namely A = {L(W xj, yj)}j=1,...,m. Whenever the loss is Lipschitz with respect to k hyperplanes predictions, the Rademacher complexity is bounded by Theorem 4 Consider a k hyperplanes loss function L(W x, y) for which each hyperplane satisfies wi 1 and each data point satisfies x 1. Assume that the loss is Lipschitz for every y, i.e., |L(α, y) L(β, y)| k i=1 |αi βi|. Then its Rademacher complexity is bounded by R({L(W xj, yj)}j=1,...,m) Proof. First we prove the decompositional lemma (cf. (Kakade et al., 2009)) for the k hyperplane setting: R({L(W xj, yj)}j=1,...,m) R({w i xj}i=1,...,k,j=1,...,m). For notational convenience, we prove it for m = 1 and the general case follows by induction over m. By definition, R(W, x1, y1) = Eσ[max W σL(W, x1, y1)] and for simplicity, we denote this Rademacher complexity by R. Since P[σ = 1] = P[σ = 1] = 0.5 there holds, R = 0.5 max W L(W, x1, y1) + 0.5 max W L(W, x1, y1). By duplicating the hyperplanes to W, W we are able to maximize both cases jointly, R = 0.5 max W, ˆ W ( L(W, x1, y1) L( ˆW, x1, y1)] ) . By the Lipschitz condition L(W, x1, y1) L( ˆW, x1, y1) i |w i x1 ˆw i x1|. Since wi and wi have the same norm, taking the maximum over wi 1 may generate the absolute value, therefore ˆσi { 1, 1} R 0.5 max W, ˆ W i ˆσi(w i x1 ˆw i x1) ) . Thus we are able to take the expectation with respect to ˆσi while maintaining the inequality. Also, we separate the two maximizations while noting that ˆσi and ˆσi have the same distribution to obtain the result: R Eˆσ max W i ˆσiw i x1 ) . For general m we get by induction: m R(L(W, xj, yj)) Eσi,j max wi 1 i,j σi,jw i xj). K-hyperplane Hinge-Minimax Classifier Algorithm 1 Intersection of K hyperplanes classifier Input: {xi}, i = 1, .., Np a set of positive examples; {zi}, i = 1, .., Nu a set of negative examples. Output: W (K hyperplanes) {The initial greedy step} Estimate µ and Σ using {zi}Nu i Find w1 using eq.8 with µ and Σ. for k=2 to K do Estimate µk and Σk using {zi | w T j zi > 0, j = 1, .., k 1} Find wk using eq.8 with µk and Σk. end for {The refinement iterations} Let Pk be the probability Pr(W T z > 0) in iteration k while (Pk 1 Pk > ϵ) do Estimate µk and Σk using {zi | w T j zi > 0, j = 1, .., K; j = k}. Find wk using eq.8 with µk and Σk. end while Standard Rademacher type arguments, e.g. (Bartlett & Mendelson, 2003) derive the bound m R(L(W, xj, yj)) The above theorem generalizes the standard decomposition lemma (also known as the contraction lemma) to k hyperplanes with any Lipschitz loss. In our setting we consider the sum of hinge loss functions over the positive examples L(W, x, y) = 1[y = 1] i=1 max{0, 1 w i xy} Since this function satisfies the conditions of the theorem above, we may use the standard Rademacher uniform generalization bound. Let LH,1 D (W) = E(x,y) D [ L(W, x, y) ] be the risk, and let LH,1 S (W) = 1 m m i=1 L(W, xi, yi) be the empirical risk over a training sample of size m. Then, for any δ (0, 1] with probability at least 1 δ over the i.i.d. sample of size m there holds simultaneously for all w1 , ..., wk 1 whenever x 1: LH,1 D (W) LH,1 S (W) + 5. Algorithm We aim to minimize the risk in eq. 2. To this end, we minimize the empirical risk regularized by the sum of L2 norms of the K hyperplanes: i wi 2 + LM,1 S (W) + LH, 1 S (W) (7) This empirical risk is a non convex and non smooth function, hence a gradient based optimization of it is difficult. However, for a single hyperplane, we can write an equivalent optimization problem: minimize w C i max{0, 1 w T xi} subject to γ w T Σw + w T µ 0. where γ is a constant, controlling the probability in the positive half space, namely γ 2erf 1(1 2δ) 2, where Pr(w T z) δ. Since we seek to minimize this probability, we can assume that δ < 1/2, and thus γ > 0. The constraint in the optimization problem 8 is convex for γ > 0, and then the entire problem in 8 is convex. We propose an iterative algorithm (Algorithm 1) that approximates the solution to the problem in eq. 7. It starts by training K hyperplanes in a greedy manner and then iteratively adjusts each hyperplanes to minimize the Pr(W T z 0). Lemma 3 Algorithm 1 minimizes Pr(W T z 0) at each iteration. Proof. We show the proof for two hyperplanes; the same proof holds for K hyperplanes. Let W = [w1w2]. We can write Pr(W T z 0) = Pr(w T 1 z 0) Pr(w T 2 z 0) = Pr(w T 1 z 0)Pr(w T 2 z 0|w T 1 z 0) First, w1 optimizes Pr(w T 1 z 0). Second, the optimization in eq.8 seeks a w2 that minimizes Pr(w T 2 z 0|w T 1 z 0). Let wi 1 and wi 2 be the two hyperplanes after i iterations and α = Pr(wi 1 T z 0) Pr(wi 2 T z 0) be the current probability of the negative error in the intersection. The algorithm seeks a hyperplane wi+1 1 that minimizes Pr(wi+1 1 T z 0|wi 2 T z 0), thus Pr(wi+1 1 T z 0) Pr(wi 2 T z 0) = Pr(wi+1 1 T z 0|wi 2 T z 0)Pr(wi 2 T z 0) α. Therefor Algorithm 1 decreases the Pr(W T z 0) in each iteration. The empirical risk of the intersection of K hyperplanes is the sum of hinge losses. In each iteration the algorithm minimizes the hinge loss of one hyperlane, while keeping the rest fixed, thus the algorithm decreases the hinge risk at 2the supremum of the minimax risk is attained for the Gaussian distribution K-hyperplane Hinge-Minimax Classifier each iteration. Since the empirical risk is the sum of hinge and minimax risks, it follows from Lemma 1 and the above discussion that Algorithm 1 minimizes the empirical risk in each iteration and thus converges. 6. Experiments To test the proposed K-hyperplane hinge-minimax classifier, we ran experiments in three different scenarios: synthetic 2D data, letter recognition, and large scale scene classification. During classification, the K-hyperplane classifier incurs only K times the computational complexity of a linear classifier (just K inner products), hence its natural competitors are linear classifiers, and we choose linear SVM for the benchmark. We have also compared the hinge-minimax classifiers to kernel SVM and ensemble-based methods, which incur far longer running times (this is especially true for kernel SVM). The classification rates of the hinge-minimax classifier in all our experiments were comparable to ensemble classifiers which required 100-170 basic classifiers in order to reach similar performance. In experiments with highdimensional data, the hinge-minimax classifiers preformed as well as kernel SVM. The SVM classifiers were trained using C-SVC in LIBSVM 3. We used the CVX optimization package 4 to find a single hyperplane in Algorithm 1. The ensemble classifiers were trained using the Matlab Statistic toolbox. 6.1. Synthetic Data Example We construct the hinge-minimax classifier for 2D data to illustrate Algorithm 1. We samples 5000 data points from two highly overlapping Gaussians (see Figure 1) with varying ratio of positive (shown in red) and negative (shown in blue) examples. Each class was equally partitioned into training, validation, and test sets. We estimated the mean and covariance from the training data and tuned the parameters (C and γ) and the bias using the validation set. Table 1 shows the AUC for the different ratios of positive and negative examples using an intersection of 5 hyperplanes. These results demonstrate the robustness of the algorithm to unbalanced sets. Positive 0.01 0.1 0.2 0.3 0.4 0.5 fraction AUC 94.68 94.91 95.07 94.96 94.89 95.83 Table 1. AUC for different size partitions of positive and negative classes 3http : //www.csie.ntu.edu.tw/cjlin/libsvm/ 4http : //cvxr.com/cvx/download/ The first five plots in Figure 1 show the result of the initial greedy step for the first, second, third, forth, and fifth hyperplanes respectively. The contour lines in Figure 1 illustrate the covariance of the negative distribution inside the intersection, which is used to find the optimal separation hyperplane, depicted in black. The last plot in Figure 1 shows the final classifier after 25 iterations. It illustrates that the approximation algorithm succeeds in separating the positive set from the background, and that the refinement iterations improve the separation boundary. 6.2. Letter Recognition These tests were performed on a data set of letters from the UCI Machine Learning Repository (Murphy & Aha, 1994), which includes 16-dimensional feature vectors for the 26 letters in the English alphabet. The letter images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce 20,000 samples. For each letter, we used 100 samples for training, 250 for validation, and the rest for test (about 400 samples per letter). The parameters of all methods have been chosen using the validation set. Since the test set includes 25 times more negatives than positives, which leads to about 96% classification rate by just classifying all inputs as negative, we used EER as a more faithful measure of performance. Table 2 shows the classification rate at EER , averaged over Method Classification Classification rate at EER time hinge-minimax K = 1 89.32 5.6e-07 hinge-minimax K = 2 92.98 1.4e-06 hinge-minimax K = 3 93.93 1.5e-06 hinge-minimax K = 4 94.48 1.7e-06 Linear SVM 84.87 4.6e-07 RBF kernel SVM 96.47 1.7e-03 Ada Boost 92.26 1.0e-03 Table 2. Letter experiments. K corresponds to the number of hyperplanes used in the hinge-minimax classifier. The times are in sec. Ada Boost uses 100 decision trees. 26 letters, and the average classification times of the tested classifiers. The hinge-minimax classifiers improves over the linear SVM for all K, and for K > 1 outperforms Adaboost with much shorter classification time. For this data set, kernel SVM outperforms all methods. However, the 4-hyperplane hinge-minimax classifier comes fairly close to the performance of the kernel SVM, while its classification time is three magnitudes faster. K-hyperplane Hinge-Minimax Classifier 15 10 5 0 5 10 15 15 15 10 5 0 5 10 15 15 10 5 0 5 10 15 15 10 5 0 5 10 15 15 10 5 0 5 10 15 10 5 0 5 10 Figure 1. Illustration of hinge-minimax classifier construction on a toy example.The first 5 figures show the greedy initial step. The last figure shows the final classifier after 25 iterations. The contour lines show the covariance matrix of the negative distribution inside the intersection of hyperplane, which is used to find the optimal hyperplane, depicted in black. 6.3. Large Scale Scene Recognition In this test we used 397 scene categories of the SUN data base, which have at least 100 images per category (Xiao et al., 2010). We represent the images as BOW of dense HOG features with 300 words. We downloaded the features from the SUN web page5, containing spatial pyramid of BOWs, and used the bottom layer (the details of the feature extraction can be found in (Xiao et al., 2010)). The data is divided into 50 training and 50 test images in 10 folds. Training one-against-all classifiers for 397 categories with 50 training samples per category uses very unbalanced training sets. Thus we defined different weights for positive and negative samples in SVM training and we used RUSBoost (Seiffert et al., 2008) as an ensemble method (it is designed for skewed data and performed significantly better than Ada Boost on this data set). Note that the hinge-minimax classifier naturally handles unbalanced sets. Hinge-minimax classifier with more than two hyperplanes didn t improve the performance. Table 6.3 shows the average AUC which was used for evaluation in (Xiao et al., 2010)) of the tested method and their running times. 7. Conclusions and Future Work We proposed an efficient method for learning an intersection of finite number of hyperplanes which combined the hinge-risk (for the small number of positive data) with the 5http : //vision.cs.princeton.edu/projects/2010/SUN/ Method AUC classification time hinge-minimax, K = 1 88.89 9.8e-05 hinge-minimax, K = 2 90.99 1.34e-04 Linear SVM 88.20 8.6e-05 RBF kernel SVM 90.77 23.97 RUSBoost 90.76 0.08 Table 3. Scene classification with 300 dim. features. The classification time of RBF kernel SVM is very high, since it chooses about 15,000 SVs from 19850 training examples. The RUSBoost uses 100 decision trees. minimax risk (for a large number of negative data points) and derived a generalization bound for the mixed risk. We show that the proposed classifier yields results comparable to the popular non-linear classifiers, but at much lower (order of magnitude) computational cost of classification. Extension of this approach to multi-class learning for K hyperplanes remains open. A one-vs-all heuristic is not directly applicable since K-hyperplane intersection yields binary output with no score. One can use the result of Theorem 1 to find the closest distance to the intersection and use it as a score. Another direction is to combine structuredhinge and minimax risks. Acknowledgments: This work has been supported by Israel Science Foundation 839/12 and the European Union s Seventh Framework Programme FP7-ICT-2013-11 under grant agreement No 619435. K-hyperplane Hinge-Minimax Classifier Arriaga, Rosa I and Vempala, Santosh. An algorithmic theory of learning: Robust concepts and random projection. In Foundations of Computer Science, 1999. 40th Annual Symposium on, pp. 616 623. IEEE, 1999. Bartlett, Peter L and Mendelson, Shahar. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3: 463 482, 2003. Bousquet, Olivier, Boucheron, St ephane, and Lugosi, G abor. Introduction to statistical learning theory. In Advanced Lectures on Machine Learning, pp. 169 207. Springer, 2004. Candes, Emmanuel J and Plan, Yaniv. A probabilistic and ripless theory of compressed sensing. Information Theory, IEEE Transactions on, 57(11):7235 7254, 2011. Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Multi-class classification with maximum margin multiple kernel. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 46 54, 2013. Gross, David. Recovering low-rank matrices from few coefficients in any basis. Information Theory, IEEE Transactions on, 57(3):1548 1566, 2011. Honorio, Jean and Jaakkola, Tommi. {Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees}. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pp. 384 392, 2014. Kakade, Sham M, Sridharan, Karthik, and Tewari, Ambuj. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pp. 793 800, 2009. Klivans, Adam R and Servedio, Rocco A. Learning intersections of halfspaces with a margin. In Learning Theory, pp. 348 362. Springer, 2004. Lanckriet, Gert R.G., Ghaoui, Laurent El, Bhattacharyya, Chiranjib, and Jordan, Michael I. A robust minimax approach to classification. J. Mach. Learn. Res., 3:555 582, 2003. Marshall, Albert W. and Olkin, Ingram. Multivariate chebyshev inequalities. Ann. Math. Statist., 31(4):1001 1014, 1960. Murphy, P. and Aha, D. Uci repository of machine learning databases. Tech. rep., U. California, Dept. of Information and Computer Science, 1994. Osadchy, M., Keren, D., and Fadida-Specktor, B. Hybrid classifiers for object classification with a rich background. In ECCV (5), pp. 284 297, 2012. Seiffert, Chris, Khoshgoftaar, Taghi M., Hulse, Jason Van, and Napolitano, Amri. Rusboost: Improving classification performance when training data is skewed. In ICPR, pp. 1 4, 2008. Steinwart, Ingo. Sparseness of support vector machines. Journal of Machine Learning Research, 4:1071 1105, 2003. Vapnik, Vladimir. The nature of statistical learning theory. Springer Science & Business Media, 2000. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba, A. Sun database: Large-scale scene recognition from abbey to zoo. CVPR, pp. 3485 3492, 2010. Zhang, Tong. Covering number bounds of certain regularized linear function classes. The Journal of Machine Learning Research, 2:527 550, 2002.