# khyperplane_hingeminimax_classifier__abd577d5.pdf

K-hyperplane Hinge-Minimax Classiﬁer

Margarita Osadchy RITA@CS.HAIFA.AC.IL Tamir Hazan TAMIR@CS.HAIFA.AC.IL Daniel Keren DKEREN@CS.HAIFA.AC.IL Department of Computer Science, University of Haifa, Mount Carmel, Haifa 31905, Israel

We explore a novel approach to upper bound the misclassiﬁcation error for problems with data comprising a small number of positive samples and a large number of negative samples. We assign the hinge-loss to upper bound the misclassiﬁcation error of the positive examples and use the minimax risk to upper bound the misclassiﬁcation error with respect to the worst case distribution that generates the negative examples. This approach is computationally appealing since the majority of training examples (belonging to the negative class) are represented by the statistics of their distribution, in contrast to kernel SVM which produces a very large number of support vectors in such settings. We derive empirical risk bounds for linear and non-linear classiﬁcation and show that they are dimensionally independent and decay as 1/ m for m samples. We propose an efﬁcient algorithm for training an intersection of ﬁnite number of hyperplanes and demonstrate its effectiveness on real data, including letter and scene recognition.

1. Introduction

Linear classiﬁers are the cornerstone of several applications in machine learning. The generalization ability of linear classiﬁers has been long studied in the context of support vector machines (SVMs), e.g., using VC dimension, covering numbers, and Rademacher complexity (Vapnik, 2000; Zhang, 2002; Bartlett & Mendelson, 2003; Bousquet et al., 2004; Kakade et al., 2009). SVMs upper bound the misclassiﬁcation loss of the linear classiﬁer using the hinge-loss. This setting is computationally appealing when there are fairly small number of support vectors. Alternatively, the minimax risk upper bounds the distri-

Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s).

bution that generates the instances-labels examples in the world (Lanckriet et al., 2003; Honorio & Jaakkola, 2014). This approach is computationally appealing when there are (inﬁnitely) many training examples since it only utilizes their statistical properties, such as mean and covariance. In our work we consider real-life data that consists of a small number of positive data points and a large number of negative data points, a setting that is prominent in machine learning applications, e.g., in computer vision, such as object detection, and in security, such as fraud or malicious attack detection. We suggest to combine the hingerisk with the minimax risk to enjoy the best of both worlds. The idea of combining minimax for the negative class and svm-like formulation for the positive samples was introduced in (Osadchy et al., 2012) for training a single hyperplane. No generalization bounds have been shown in that work. Here we derive an empirical mixed-risk bound, that uses the Rademacher complexities to bound the hinge-risk and vector Bernstein s inequalities to bound the minimax risk. Recently, (Honorio & Jaakkola, 2014) derived a generalization bound for the minimax risk using PAC-Bayesian approach, a setting that bounds the expected loss with respect to a posterior distribution over all possible classiﬁers. Our work differs as we use stronger assumptions - that the norm of the data points is bounded, an assumption that is natural in many applications. Thus we are able to avoid the variance while computing the expected loss of PACBayesian bounds.

Our work mainly considers non-linear classiﬁers. Nonlinear classiﬁers are usually attained by applying kernel methods. Unfortunately, the computational complexity of kernel methods increases linearly with the size of the training sample (Steinwart, 2003). Instead we apply non-linearity by using intersection of hyperplanes (Klivans & Servedio, 2004; Arriaga & Vempala, 1999). Unfortunately, the proposed algorithms for intersection of hyperplanes are computationally costly when considering large sets of negative data points. To deal with this computational difﬁculty we extend the minimax risk to deal with intersection of hyperplanes over (inﬁnitely many) negative examples. As this risk bound is loose when there are few

K-hyperplane Hinge-Minimax Classiﬁer

data points, as we have in the positive examples, we apply the sum-of-hinge-loss to the positive examples. We also derive a generalization bound that mixes these two risks for intersection of hyperplanes setting. Our Rademacher complexity bound for classifying by intersection of hyperplanes extends the contraction lemma (cf. (Bartlett & Mendelson, 2003)) to vectors. This may be used to simplify recent multi-class generalization bounds using Rademacher complexity (Cortes et al., 2013).

We propose an algorithm for training an intersection of hyperplanes that efﬁciently minimizes the minimax-hinge risk. We show empirically on two real sets with very differing characteristics, that this algorithm substantially improves over linear classiﬁers; further, it is on par with the classiﬁcation rate of ensemble methods (comprising more than a 100 simple classiﬁers compared to 2-4 hyperplanes in the hinge-minimax classiﬁer) and it even approached the classiﬁcation performance of kernel SVM, but is orders of magnitude faster.

2. Background

Let (x, y) D, where x Rd and y { 1, 1}. Let yw(x) = sign(w T x) denote a linear classiﬁer. For simplicity, we assume that b = 0 (or absorbed by w). A zeroone risk for the yw(x) is deﬁned as follows:

L0/1 D (w) = ED[1[yw(x) = y]]

The zero-one loss is non-convex. In SVMs the zero-one loss is upper bounded by the hinge loss: 1[yw(x) = y] max{0, 1 yw T x}. Thus the hinge risk upper bounds the zero-one risk:

L0/1 D (w) ED[max{0, 1 yw T x}] LH D(w).

Alternatively, one can upper bound the zero-one risk by the minimax risk (Lanckriet et al., 2003; Honorio & Jaakkola, 2014). There, instead of upper bounding the zero-one loss function, one upper bounds the distribution that generated the data. Denote µ = E(x,y) D[yx] and Σ = E(x,y) D[(yx µ)(yx µ)T ]. Denote by Z(µ, Σ) the set of all distributions with mean µ and covariance Σ.

L0/1 D (w) sup yx Z(µ,Σ) Pr(w T yx 0) LM µ,Σ(w)

It was shown in (Lanckriet et al., 2003; Honorio & Jaakkola, 2014) that

sup z Z(µ,Σ) Pr(wtz 0) = 1

1 + (w T µ)2

In our work, we consider non-linear classiﬁcation with K hyperplanes. Let wi, i = 1, .., K denote K hyperplanes.

Let W be a matrix with wi as its ith column. We deﬁne a non-linear classiﬁer f W (x) as an intersection of these K hyperplanes:

f W (x) = { 1 if W T x 0 1 otherwise (1)

where 0 denotes a vector of zeros.

3. Minimax-Hinge Risk

We are interested in a classiﬁcation problem in which the positive class corresponds to a single concept and the negative class is its complement. We assume that the sample of the positive class is relatively small while the negative sample is very large (it can be represented by an unlabeled data as well, thus is easy to collect). When the sample is small, the regularized hinge loss has been shown to be very effective. The minimax bound is tight for the Gaussian distribution, thus it will become tight for sample size approaching inﬁnity. Due to the speciﬁcs of the problem we propose to use a mixed risk, namely, we use hinge risk for the positive class and minimax risk for the negative. Next, we formulate the mixed risk for the non-linear classiﬁer in eq. 1.

LMH D (W) = LH,1 D + LM, 1 µ(Dneg),Σ(Dneg) (2)

where W is the matrix of K hyperplanes, D is a joint distribution of the samples and labels, Dneg is a marginal distribution of samples over the negative labels, and µ(Dneg) amd Σ(Dneg) are its mean and covariance respectively. LH,1 D and LM, 1 µ(Dneg),Σ(Dneg) are deﬁned below.

LH,1 D = ED[LH(W, x, y)1[y = 1]]

where LH(W, x, 1) = i max{0, 1 w T x}

3.1. The Expected Risk of the Negative Class

Since LM, 1 µ(Dneg),Σ(Dneg) is the risk of a negative sample falling into the intersection of K hyperplanes, which is a convex set, we can use the theorem due to Marshall and Olkin (Marshall & Olkin, 1960) to derive the risk for the negative class.

Theorem 1 Let Z(µ, Σ) 1 be all distributions with known mean µ and covariance Σ. For K ﬁxed hypeplanes wi (i = 1, .., K), we have

sup z Z(µ,Σ) Pr(W T z > 0) = 1 1 + d2

with d2 = µT W( W T Σ W) 1 W T µ, where W is a matrix with columns that satisfy w T z = 0, where

1for notational simplicity, we drop Dneg and denote µ = µ(Dneg) and Σ = Σ(Dneg).

K-hyperplane Hinge-Minimax Classiﬁer

z = arg minz(z µ)T Σ 1(z µ).

Proof. Following the result of Marshall and Olkin (Marshall & Olkin, 1960) for a convex set, we obtain:

sup z Z(µ,Σ) Pr(W T z 0) = 1 1 + d2 ,

with d2 = inf W T z 0(z µ)T Σ 1(z µ) Next, we want to derive a closed-form expression for d2. We seek the solution for the primal problem

min z (z µ)T Σ 1(z µ)

s.t. w T i z 0 for i = 1, .., K. We construct the Lagrangian:

L(z, λi) = (z µ)T Σ 1(z µ) +

i λiw T i z, λi 0.

The optimality condition:

z = 2Σ 1z 2Σ 1µ +

i λiwi = 0,

gives us z = µ 1

2 i λiΣwi. The Lagrange dual function is as follows,

L(z , λ) = (1

i λiΣwi)T Σ 1(1

j λjΣwj) (3)

i λiwt i(µ 1

The optimality conditions are:

i λiw T k Σwi + w T k µ = 0

for k such that λk > 0.

The function is optimized at

λ = 2( WΣ W) 1 W T µ, (4)

W is formed by a subset of columns of W for which λk > 0, and thus w T k z = 0.

For the last step we substitute the optimal λ, given in eq. 4 into the dual function in eq. 3 and after simple algebraic manipulations we get:

d2 = max λ 0 (L(z , λ )) = µT W( W T Σ W) 1 W T µ

In the following we bound the risk by its ﬁnite sample. In what follows we show that the discrepancy between the risk LMH D (W) and its empirical estimation LMH S (W) decays

at the rate of c

m where δ is the conﬁdence over the samples of the training data and m is the training data size. The main difﬁculty arises from mixing the hinge-risk for the positive examples and the minimax risk for the negative examples. For this purpose we divide the risk to its positive and negative cases, and for each we derive a uniform generalization bound.

4.1. Uniform generalization bound for the empirical minimax risk

The minimax risk upper bounds the zero-one risk over the negative examples. To derive a uniform generalization bound for this setting we bound each of its components differently using a ﬁnite sample of training examples. In view of eq. 2 we bound the relative size of the negative set by its empirical average. We also bound the minimax risk itself, by its training sample estimation.

For the sake of clarity we begin with deriving a generalization bound for a single hyperplane, followed by a generalization bound for the intersection of hyperplanes. Recall that Dneg is the distribution of the negative data points. We abbreviate its mean and covariance by µ µ(Dneg) and Σ Σ(Dneg) respectively. Let ˆµ and ˆΣ be the mean and covariance estimates from the training data points that are associated with negative labels. The minimax generalization bound LM, 1 D(µ,Σ)(w) is dominated by the discrepancy

1 + (w ˆµ)2

w ˆΣw Some algebraic manipulations yield a simpler form of the discrepancy:

= w Σw (w ˆµ)2 w ˆΣw (w µ)2 ( w Σw + (w µ)2 ) ( w ˆΣw + (w ˆµ)2 ) (5)

To provide uniform generalization bound to the minimax risk, we show that the discrepancy decreases when the size of the training sample increases. Therefore we represent the discrepancy with ˆµ µ and ˆΣ Σ that decrease as a function of the training sample. By adding and subtracting (w µ)2 w Σw to the numerator we are able to represent the discrepancy with these diminishing quantities. =

w Σw ( (w ˆµ)2 (w µ)2) + (w µ)2 w (Σ ˆΣ)w ( w Σw + (w µ)2 ) ( w ˆΣw + (w ˆµ)2 )

K-hyperplane Hinge-Minimax Classiﬁer

The sampled quantity ˆµ µ diminishes with high probability as the training size increases. This controls the ﬁrst term of the minimax discrepancy, as described by the following lemma:

Lemma 1 Assume x Dneg is a distribution over data points x with negative labels such that x 1 holds with probability 1. Denote by µ its mean by Σ its covariance. Let Sneg training sample of size ˆm and let ˆµ = 1 ˆm x Sneg x be its

sampled mean and ˆΣ = 1 ˆm x Sneg(x ˆµ)(x ˆµ) be its sampled covariance. Deﬁne

1 = w Σw ( (w ˆµ)2 (w µ)2) ( w Σw + (w µ)2 ) ( w ˆΣw + (w ˆµ)2 ).

Assume that the minimal eigenvalue of Σ, ˆΣ is lower bounded by α. Then, with probability at least 1 δ over the draws of the training set Sneg the following holds uniformly for all w

32 log(1/δ) + 1/4

Proof. First, we upper bound 1 while decreasing the denominator, by omitting (w µ)2 and (w ˆµ)2. Using the identity a2 b2 = (a + b)(a b) with a = w ˆµ and b = w µ we obtain:

1 w (ˆµ µ) w (ˆµ + µ)

Next, we applying the Cauchy-Schwarz inequality to the numerator a b a b and the lower bound w ˆΣw α w 2 to the denominator

1 ˆµ µ ˆµ + µ

Finally, since x 1 then ˆµ + µ 2. Moreover, Bernstein inequality for vectors (cf. (Gross, 2011) Theorem 11, (Candes & Plan, 2011) Theorem 2.6) implies P[ ˆµ µ t] exp( ˆmt2/32 + 1/4) for any t 2 ˆm. The result follows when setting λ = exp( ˆmt2/32+1/4),

or equivalently t =

32(log(1/δ)+1/4)

We turn to handle the second term of the minimax discrepancy in eq. 5. The quantity ˆΣ Σ diminishes in high probability as the training size increases.

Lemma 2 Under the conditions of Lemma 1, deﬁne

2 = (w µ)2 w (Σ ˆΣ)w ( w Σw + (w µ)2 ) ( w ˆΣw + (w ˆµ)2 ).

Assume that the minimal eigenvalues of Σ, ˆΣ are lower bounded by α. Then, with probability at least 1 δ over the draws of the training set Sneg the following holds uniformly for all w

128(log(1/δ) + 1/4)

Proof. First, we lower bound the denominator by α2 w 2

(when omitting (w µ)2 and (w ˆµ)2) thus upper bounding 2. Noting the w Aw = a b where a is a vectorization of A and b is a vectorization of ww , we use the Cauchy-Schwarz inequality to upper bound the numerator by µ 2 w 2 a b . Since b = w 2 and µ 1 we obtain the bound

2 Σ ˆΣ /α2.

We consider the norm Σ ˆΣ as the norm of its vectorized form. Using the Bernstein inequality for vectors we obtain that P[ Σ ˆΣ t] exp( mt2/128+1/4) for any t 4 ˆm. The result follows when setting δ = exp( mt2/128+

1/4) or equivalently t =

128(log(1/δ)+1/4)

Bounds on the discrepancy between the minimax risk and its empirical risk that are uniform for any w guarantee generalization. The above lemmas suggest that the penalty of observing a ﬁnite sample space decreases as 1/ ˆm. This is summarized in the following theorem.

Theorem 2 Suppose that D is a distribution over X Y such that Y = { 1, +1} and X = {x : x 1}. Let LM, 1 µ,Σ (w) be the minimax risk over the negative labels, where µ, Σ are the mean and covariance of the marginal distribution of x over the negative labels. Consider a training sample S of size m which ˆm of them have negative label and let LSneg(w) be the empirical minimax risk over the negative labels

LSneg(w) = ˆm

m sup z Z(ˆΣ,ˆµ) P[w z 0]

where ˆµ, ˆΣ are the empirical mean and covariance estimation of the marginal distribution of x over the negative training labels. Then, for any δ (, 1] with probability at least 1 3δ over the i.i.d. sample of size m holds LM, 1 µ,Σ (w)

LM, 1 ˆµ,ˆΣ (w) + c1

log(1/δ) + 1/4

128(1/α + 1/α2) and α is the minimal eigenvalue of Σ, ˆΣ.

K-hyperplane Hinge-Minimax Classiﬁer

Proof. We estimate ρ = E(x,y) D [ 1[y = 1] ] by its empirical mean ˆm/m. From the Hoeffding inequality, P[ ˆm

m ρ t] exp( 2mt2). Setting δ = exp( 2mt2) we derive the conﬁdence interval t =

log(1/δ)/2m. Thus combining the above lemmas, with error probability of 3δ the minimax risk is upper bounded by

1 + (w ˆµ)2

32(log(1/δ) + 1/4) +

128(log(1/δ) + 1/4). We conclude the proof by using ˆm/m 1 and 1 1+ (w ˆ µ)2

The same type of bound holds for classiﬁcation by any ﬁnite number of hyperplanes, i.e., W x 0. We omit its derivation as it is tedious and follows the same derivations as above.

Theorem 3 Consider the setting of Theorem 2 and let

LM, 1 µ,Σ (W) = E(x,y) D [ 1[y = 1] ] sup z Z(Σ,µ) P[W z 0]

where W is a matrix whose columns consist of k different hyperplanes. Then, for any δ (0, 1] with probability at least 1 δ over the i.i.d. sample of size m there holds uniformly for all W:

LM, 1 µ,Σ (W) LM, 1 ˆµ,ˆΣ (W) + c

log(1/δ) + 1/4

128k(1/α + 1/α2) and α is the minimal eigenvalue of Σ, ˆΣ.

In our setting, it is important to assume that the eigenvalues of Σ are lower bounded by α. This assumption implies that the eigenvalues of ˆΣ are also bounded from below whenever ˆm d. Using Cauchy-Schwartz inequality for any w of a unit norm there holds |w Σw w ˆΣw| Σ ˆΣ . Using the Bernstein inequality for vectors we obtain that P[ Σ ˆΣ t] exp( mt2/128 + 1/4) for any t 4 ˆm. Thus with high probability (that decays exponentially with ˆm) we obtain that for any w of unit norm holds w ˆΣw w Σw t. Since the eigenvalues of Σ are lower bounded by α we have that w Σw α then the eigenvalues of ˆΣ are also lower bounded away from zero whenever t < α.

4.2. Uniform generalization bound for the empirical risk of the hinge-loss

The risk of the hinge-loss upper bounds the zero-one risk over the positive examples. Using Rademacher complexities we derive a uniform generalization bound for

k hyperplanes classiﬁcation W x 0, i.e., w i x 0 for each i = 1, ..., k. The Rademacher complexity of a bounded set A Rk is

m Eσ[max a A

while σi { 1, +1} are i.i.d. and equally probable random variables. The set A describes the loss of the predictors W over a training sample (x1, y1), ..., (xm, ym), namely A = {L(W xj, yj)}j=1,...,m. Whenever the loss is Lipschitz with respect to k hyperplanes predictions, the Rademacher complexity is bounded by

Theorem 4 Consider a k hyperplanes loss function L(W x, y) for which each hyperplane satisﬁes wi 1 and each data point satisﬁes x 1. Assume that the loss is Lipschitz for every y, i.e., |L(α, y) L(β, y)| k i=1 |αi βi|. Then its Rademacher complexity is bounded by R({L(W xj, yj)}j=1,...,m)

Proof. First we prove the decompositional lemma (cf. (Kakade et al., 2009)) for the k hyperplane setting:

R({L(W xj, yj)}j=1,...,m) R({w i xj}i=1,...,k,j=1,...,m).

For notational convenience, we prove it for m = 1 and the general case follows by induction over m. By deﬁnition, R(W, x1, y1) = Eσ[max W σL(W, x1, y1)] and for simplicity, we denote this Rademacher complexity by R. Since P[σ = 1] = P[σ = 1] = 0.5 there holds, R = 0.5 max W L(W, x1, y1) + 0.5 max W L(W, x1, y1). By duplicating the hyperplanes to W, W we are able to maximize both cases jointly, R = 0.5 max W, ˆ W ( L(W, x1, y1) L( ˆW, x1, y1)] ) . By

the Lipschitz condition L(W, x1, y1) L( ˆW, x1, y1) i |w i x1 ˆw i x1|. Since wi and wi have the same norm, taking the maximum over wi 1 may generate the absolute value, therefore

ˆσi { 1, 1} R 0.5 max W, ˆ W

i ˆσi(w i x1 ˆw i x1) ) .

Thus we are able to take the expectation with respect to ˆσi while maintaining the inequality. Also, we separate the two maximizations while noting that ˆσi and ˆσi have the same distribution to obtain the result:

R Eˆσ max W

i ˆσiw i x1 ) .

For general m we get by induction:

m R(L(W, xj, yj)) Eσi,j max wi 1

i,j σi,jw i xj).

K-hyperplane Hinge-Minimax Classiﬁer

Algorithm 1 Intersection of K hyperplanes classiﬁer

Input: {xi}, i = 1, .., Np a set of positive examples; {zi}, i = 1, .., Nu a set of negative examples. Output: W (K hyperplanes) {The initial greedy step} Estimate µ and Σ using {zi}Nu i Find w1 using eq.8 with µ and Σ. for k=2 to K do

Estimate µk and Σk using {zi | w T j zi > 0, j = 1, .., k 1} Find wk using eq.8 with µk and Σk. end for {The reﬁnement iterations} Let Pk be the probability Pr(W T z > 0) in iteration k while (Pk 1 Pk > ϵ) do

Estimate µk and Σk using {zi | w T j zi > 0, j = 1, .., K; j = k}. Find wk using eq.8 with µk and Σk. end while

Standard Rademacher type arguments, e.g. (Bartlett & Mendelson, 2003) derive the bound m R(L(W, xj, yj))

The above theorem generalizes the standard decomposition lemma (also known as the contraction lemma) to k hyperplanes with any Lipschitz loss. In our setting we consider the sum of hinge loss functions over the positive examples

L(W, x, y) = 1[y = 1]

i=1 max{0, 1 w i xy}

Since this function satisﬁes the conditions of the theorem above, we may use the standard Rademacher uniform generalization bound. Let LH,1 D (W) = E(x,y) D [ L(W, x, y) ]

be the risk, and let LH,1 S (W) = 1 m m i=1 L(W, xi, yi) be the empirical risk over a training sample of size m. Then, for any δ (0, 1] with probability at least 1 δ over the i.i.d. sample of size m there holds simultaneously for all w1 , ..., wk 1 whenever x 1:

LH,1 D (W) LH,1 S (W) +

5. Algorithm

We aim to minimize the risk in eq. 2. To this end, we minimize the empirical risk regularized by the sum of L2 norms of the K hyperplanes:

i wi 2 + LM,1 S (W) + LH, 1 S (W) (7)

This empirical risk is a non convex and non smooth function, hence a gradient based optimization of it is difﬁcult. However, for a single hyperplane, we can write an equivalent optimization problem:

minimize w C

i max{0, 1 w T xi}

subject to γ

w T Σw + w T µ 0.

where γ is a constant, controlling the probability in the positive half space, namely γ

2erf 1(1 2δ) 2, where Pr(w T z) δ. Since we seek to minimize this probability, we can assume that δ < 1/2, and thus γ > 0. The constraint in the optimization problem 8 is convex for γ > 0, and then the entire problem in 8 is convex.

We propose an iterative algorithm (Algorithm 1) that approximates the solution to the problem in eq. 7. It starts by training K hyperplanes in a greedy manner and then iteratively adjusts each hyperplanes to minimize the Pr(W T z 0).

Lemma 3 Algorithm 1 minimizes Pr(W T z 0) at each iteration.

Proof. We show the proof for two hyperplanes; the same proof holds for K hyperplanes. Let W = [w1w2]. We can write

Pr(W T z 0) = Pr(w T 1 z 0) Pr(w T 2 z 0)

= Pr(w T 1 z 0)Pr(w T 2 z 0|w T 1 z 0)

First, w1 optimizes Pr(w T 1 z 0). Second, the optimization in eq.8 seeks a w2 that minimizes Pr(w T 2 z 0|w T 1 z 0).

Let wi 1 and wi 2 be the two hyperplanes after i iterations and α = Pr(wi 1 T z 0) Pr(wi 2 T z 0) be the current probability of the negative error in the intersection. The algorithm seeks a hyperplane wi+1 1 that minimizes Pr(wi+1 1 T z 0|wi 2 T z 0), thus

Pr(wi+1 1 T z 0) Pr(wi 2 T z 0)

= Pr(wi+1 1 T z 0|wi 2 T z 0)Pr(wi 2 T z 0) α.

Therefor Algorithm 1 decreases the Pr(W T z 0) in each iteration.

The empirical risk of the intersection of K hyperplanes is the sum of hinge losses. In each iteration the algorithm minimizes the hinge loss of one hyperlane, while keeping the rest ﬁxed, thus the algorithm decreases the hinge risk at

2the supremum of the minimax risk is attained for the Gaussian distribution

K-hyperplane Hinge-Minimax Classiﬁer

each iteration. Since the empirical risk is the sum of hinge and minimax risks, it follows from Lemma 1 and the above discussion that Algorithm 1 minimizes the empirical risk in each iteration and thus converges.

6. Experiments

To test the proposed K-hyperplane hinge-minimax classiﬁer, we ran experiments in three different scenarios: synthetic 2D data, letter recognition, and large scale scene classiﬁcation.

During classiﬁcation, the K-hyperplane classiﬁer incurs only K times the computational complexity of a linear classiﬁer (just K inner products), hence its natural competitors are linear classiﬁers, and we choose linear SVM for the benchmark.

We have also compared the hinge-minimax classiﬁers to kernel SVM and ensemble-based methods, which incur far longer running times (this is especially true for kernel SVM). The classiﬁcation rates of the hinge-minimax classiﬁer in all our experiments were comparable to ensemble classiﬁers which required 100-170 basic classiﬁers in order to reach similar performance. In experiments with highdimensional data, the hinge-minimax classiﬁers preformed as well as kernel SVM.

The SVM classiﬁers were trained using C-SVC in LIBSVM 3. We used the CVX optimization package 4 to ﬁnd a single hyperplane in Algorithm 1. The ensemble classiﬁers were trained using the Matlab Statistic toolbox.

6.1. Synthetic Data Example

We construct the hinge-minimax classiﬁer for 2D data to illustrate Algorithm 1. We samples 5000 data points from two highly overlapping Gaussians (see Figure 1) with varying ratio of positive (shown in red) and negative (shown in blue) examples. Each class was equally partitioned into training, validation, and test sets. We estimated the mean and covariance from the training data and tuned the parameters (C and γ) and the bias using the validation set. Table 1 shows the AUC for the different ratios of positive and negative examples using an intersection of 5 hyperplanes. These results demonstrate the robustness of the algorithm to unbalanced sets.

Positive 0.01 0.1 0.2 0.3 0.4 0.5 fraction

AUC 94.68 94.91 95.07 94.96 94.89 95.83

Table 1. AUC for different size partitions of positive and negative classes

3http : //www.csie.ntu.edu.tw/cjlin/libsvm/ 4http : //cvxr.com/cvx/download/

The ﬁrst ﬁve plots in Figure 1 show the result of the initial greedy step for the ﬁrst, second, third, forth, and ﬁfth hyperplanes respectively. The contour lines in Figure 1 illustrate the covariance of the negative distribution inside the intersection, which is used to ﬁnd the optimal separation hyperplane, depicted in black. The last plot in Figure 1 shows the ﬁnal classiﬁer after 25 iterations. It illustrates that the approximation algorithm succeeds in separating the positive set from the background, and that the reﬁnement iterations improve the separation boundary.

6.2. Letter Recognition

These tests were performed on a data set of letters from the UCI Machine Learning Repository (Murphy & Aha, 1994), which includes 16-dimensional feature vectors for the 26 letters in the English alphabet. The letter images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce 20,000 samples. For each letter, we used 100 samples for training, 250 for validation, and the rest for test (about 400 samples per letter). The parameters of all methods have been chosen using the validation set. Since the test set includes 25 times more negatives than positives, which leads to about 96% classiﬁcation rate by just classifying all inputs as negative, we used EER as a more faithful measure of performance. Table 2 shows the classiﬁcation rate at EER , averaged over

Method Classiﬁcation Classiﬁcation rate at EER time hinge-minimax K = 1 89.32 5.6e-07 hinge-minimax K = 2 92.98 1.4e-06 hinge-minimax K = 3 93.93 1.5e-06 hinge-minimax K = 4 94.48 1.7e-06 Linear SVM 84.87 4.6e-07 RBF kernel SVM 96.47 1.7e-03 Ada Boost 92.26 1.0e-03

Table 2. Letter experiments. K corresponds to the number of hyperplanes used in the hinge-minimax classiﬁer. The times are in sec. Ada Boost uses 100 decision trees.

26 letters, and the average classiﬁcation times of the tested classiﬁers.

The hinge-minimax classiﬁers improves over the linear SVM for all K, and for K > 1 outperforms Adaboost with much shorter classiﬁcation time. For this data set, kernel SVM outperforms all methods. However, the 4-hyperplane hinge-minimax classiﬁer comes fairly close to the performance of the kernel SVM, while its classiﬁcation time is three magnitudes faster.

K-hyperplane Hinge-Minimax Classiﬁer

15 10 5 0 5 10 15 15

15 10 5 0 5 10 15

15 10 5 0 5 10 15

15 10 5 0 5 10 15

15 10 5 0 5 10 15

10 5 0 5 10

Figure 1. Illustration of hinge-minimax classiﬁer construction on a toy example.The ﬁrst 5 ﬁgures show the greedy initial step. The last ﬁgure shows the ﬁnal classiﬁer after 25 iterations. The contour lines show the covariance matrix of the negative distribution inside the intersection of hyperplane, which is used to ﬁnd the optimal hyperplane, depicted in black.

6.3. Large Scale Scene Recognition

In this test we used 397 scene categories of the SUN data base, which have at least 100 images per category (Xiao et al., 2010). We represent the images as BOW of dense HOG features with 300 words. We downloaded the features from the SUN web page5, containing spatial pyramid of BOWs, and used the bottom layer (the details of the feature extraction can be found in (Xiao et al., 2010)). The data is divided into 50 training and 50 test images in 10 folds. Training one-against-all classiﬁers for 397 categories with 50 training samples per category uses very unbalanced training sets. Thus we deﬁned different weights for positive and negative samples in SVM training and we used RUSBoost (Seiffert et al., 2008) as an ensemble method (it is designed for skewed data and performed signiﬁcantly better than Ada Boost on this data set). Note that the hinge-minimax classiﬁer naturally handles unbalanced sets. Hinge-minimax classiﬁer with more than two hyperplanes didn t improve the performance. Table 6.3 shows the average AUC which was used for evaluation in (Xiao et al., 2010)) of the tested method and their running times.

7. Conclusions and Future Work

We proposed an efﬁcient method for learning an intersection of ﬁnite number of hyperplanes which combined the hinge-risk (for the small number of positive data) with the

5http : //vision.cs.princeton.edu/projects/2010/SUN/

Method AUC classiﬁcation

time hinge-minimax, K = 1 88.89 9.8e-05 hinge-minimax, K = 2 90.99 1.34e-04 Linear SVM 88.20 8.6e-05 RBF kernel SVM 90.77 23.97 RUSBoost 90.76 0.08

Table 3. Scene classiﬁcation with 300 dim. features. The classiﬁcation time of RBF kernel SVM is very high, since it chooses about 15,000 SVs from 19850 training examples. The RUSBoost uses 100 decision trees.

minimax risk (for a large number of negative data points) and derived a generalization bound for the mixed risk. We show that the proposed classiﬁer yields results comparable to the popular non-linear classiﬁers, but at much lower (order of magnitude) computational cost of classiﬁcation.

Extension of this approach to multi-class learning for K hyperplanes remains open. A one-vs-all heuristic is not directly applicable since K-hyperplane intersection yields binary output with no score. One can use the result of Theorem 1 to ﬁnd the closest distance to the intersection and use it as a score. Another direction is to combine structuredhinge and minimax risks.

Acknowledgments: This work has been supported by Israel Science Foundation 839/12 and the European Union s Seventh Framework Programme FP7-ICT-2013-11 under grant agreement No 619435.

K-hyperplane Hinge-Minimax Classiﬁer

Arriaga, Rosa I and Vempala, Santosh. An algorithmic theory of learning: Robust concepts and random projection. In Foundations of Computer Science, 1999. 40th Annual Symposium on, pp. 616 623. IEEE, 1999.

Bartlett, Peter L and Mendelson, Shahar. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3: 463 482, 2003.

Bousquet, Olivier, Boucheron, St ephane, and Lugosi, G abor. Introduction to statistical learning theory. In Advanced Lectures on Machine Learning, pp. 169 207. Springer, 2004.

Candes, Emmanuel J and Plan, Yaniv. A probabilistic and ripless theory of compressed sensing. Information Theory, IEEE Transactions on, 57(11):7235 7254, 2011.

Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Multi-class classiﬁcation with maximum margin multiple kernel. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pp. 46 54, 2013.

Gross, David. Recovering low-rank matrices from few coefﬁcients in any basis. Information Theory, IEEE Transactions on, 57(3):1548 1566, 2011.

Honorio, Jean and Jaakkola, Tommi. {Tight Bounds for the Expected Risk of Linear Classiﬁers and PAC-Bayes Finite-Sample Guarantees}. In Proceedings of the Seventeenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 384 392, 2014.

Kakade, Sham M, Sridharan, Karthik, and Tewari, Ambuj. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pp. 793 800, 2009.

Klivans, Adam R and Servedio, Rocco A. Learning intersections of halfspaces with a margin. In Learning Theory, pp. 348 362. Springer, 2004.

Lanckriet, Gert R.G., Ghaoui, Laurent El, Bhattacharyya, Chiranjib, and Jordan, Michael I. A robust minimax approach to classiﬁcation. J. Mach. Learn. Res., 3:555 582, 2003.

Marshall, Albert W. and Olkin, Ingram. Multivariate chebyshev inequalities. Ann. Math. Statist., 31(4):1001 1014, 1960.

Murphy, P. and Aha, D. Uci repository of machine learning databases. Tech. rep., U. California, Dept. of Information and Computer Science, 1994.

Osadchy, M., Keren, D., and Fadida-Specktor, B. Hybrid classiﬁers for object classiﬁcation with a rich background. In ECCV (5), pp. 284 297, 2012.

Seiffert, Chris, Khoshgoftaar, Taghi M., Hulse, Jason Van, and Napolitano, Amri. Rusboost: Improving classiﬁcation performance when training data is skewed. In ICPR, pp. 1 4, 2008.

Steinwart, Ingo. Sparseness of support vector machines. Journal of Machine Learning Research, 4:1071 1105, 2003.

Vapnik, Vladimir. The nature of statistical learning theory. Springer Science & Business Media, 2000.

Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba, A. Sun database: Large-scale scene recognition from abbey to zoo. CVPR, pp. 3485 3492, 2010.

Zhang, Tong. Covering number bounds of certain regularized linear function classes. The Journal of Machine Learning Research, 2:527 550, 2002.