# rescaleinvariant_svm_for_binary_classification__617a2420.pdf

Rescale-Invariant SVM for Binary Classiﬁcation

Mojtaba Montazery and Nic Wilson Insight Centre for Data Analytics School of Computer Science and IT University College Cork, Ireland {mojtaba.montazery, nic.wilson}@insight-centre.org

Support Vector Machines (SVM) are among the best-known machine learning methods, with broad use in different scientiﬁc areas. However, one necessary pre-processing phase for SVM is normalization (scaling) of features, since SVM is not invariant to the scales of the features spaces, i.e., different ways of scaling may lead to different results. We deﬁne a more robust decision-making approach for binary classiﬁcation, in which one sample strongly belongs to a class if it belongs to that class for all possible rescalings of features. We derive a way of characterising the approach for binary SVM that allows determining when an instance strongly belongs to a class and when the classiﬁcation is invariant to rescaling. The characterisation leads to a computational method to determine whether one sample is strongly positive, strongly negative or neither. Our experimental results back up the intuition that being strongly positive suggests stronger conﬁdence that an instance really is positive.

1 Introduction Among the wide spectrum of (supervised) machine learning techniques, the Support Vector Machine approach [Burges, 1998] is one of the best-known. It is widely used in a variety of application domains from text classiﬁcation [Tong and Koller, 2001] to biological sciences [Byvatov and Schneider, 2002]. However, scaling of features is a crucial requirement for SVM because, for example, a particular feature with very large values, compared with the other features, might effectively veto the effect of other features on the SVM objective function. Therefore, SVM is clearly sensitive to the way features are scaled [Stolcke et al., 2008; Ben-Hur and Weston, 2010]. The common practice for scalings is based on the properties of input instances [Jain and Dubes, 1988; Aksoy and Haralick, 2001; Tax and Duin, 2000]; as an example of a scaling method, the value of a feature is subtracted by the minimum of all values of that feature in the dataset and divided by the difference between the maximum and minimum. So, the scaling can make the learnt model sometimes highly

sensitive to precisely which instances are received. There can also be subjective, and even rather arbitrary, choices in the scaling of the feature spaces; different ways lead to different results.

It is therefore natural to consider a more cautious classiﬁcation in which an instance is positively (negatively) classiﬁed if it is labeled as positive (negative) for all choices of scaling; the other instances, whose classiﬁcation can depend on the choice of scaling, are labelled as neutral. Thus, this method reﬁnes the set of positively (negatively) classiﬁed instances by SVM in order to improve the level of conﬁdence in the classiﬁcation decisions. This could be helpful in certain sensitive decision making applications such as disease diagnosis; e.g., the test for presence of a particular disease would fall into three categories, positive, negative, and requires further examination.

There are some studies attempting to account for the dependence on feature scaling in margin-based optimisation methods from a different perspective, see e.g., [Jebara and Shivaswamy, 2009; Dork o and Schmid, 2003]. Also, note that the motivation behind this paper is different from studies, like [Xu et al., ], which are concerned with improving the robustness of SVM against outliers and noise, since we are speciﬁcally focusing on the uncertainty caused by rescaling of features.

The rest of the paper is organised as follows. In Section 2, we rephrase conventional binary SVM to facilitate our extension. Section 3 considers the effect of rescaling and deﬁnes strong classiﬁcation. In Section 4, we characterise rescaleoptimality, where the rescale-optimal vectors are those that can be made optimal in the SVM objective function for some rescaling. This characterisation leads to a method for computing strong classiﬁcation, as described in Section 5. In Section 6, the presented approach is evaluated with 18 benchmarks, which are derived from six real data sets.1

1Because of the space restrictions, it was not possible to include all the proofs. See http://ucc.insight-centre. org/nwilson/Rescale Invar SVMClass Longer.pdf for the missing proofs. The longer document also contains a glossary of symbols.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

2.5x1 + 2.5x2 = 6

x1 + 2x2 = 1

Figure 1: (a) The input samples discussed in Example 1 are shown as black (positive class) and white (negative class) circles. (b) The shaded region shows G(X), with every element of the line segment between (3, 2) and (2, 3) being rescale-optimal in G(X).

2 Standard SVM for Binary Classiﬁcation

In this section, we introduce some notation, and express standard SVM using that notation, along with some relevant results. This enables easy generalisation to the rescale-invariant case. In this paper, as an initial step, we just consider the case when the training set is consistent (i.e., the instances are linearly separable). We deﬁne the input set X for a binary classiﬁcation task to be set of samples where each sample is characterised by n features. A sample is expressed as a pair of (x, y), where x IRn is the feature vector (i.e., x(k) being the score for x regarding the kth feature2), and y {+1, 1} indicates the class label for that sample. So, X IRn {+1, 1}. The input set can be also represented by two disjoint sets:

X+ = {x+ : (x+, +1) X}

X = {x : (x , 1) X}

We say that X is non-trivial if both X+ and X are nonempty. We deﬁne the following terms for any X IRn {+1, 1} which are used throughout the paper, and illustrated in Example 1 below.

Λ(X) = { x+ x

2 : x+ X+, x X }

I(X) = {1, . . . , |X+||X |}

G(X) = {w IRn : λ Λ(X), w λ 1}

MPw = minx+ X+ w x+

MNw = minx X w ( x )

The dot product u v is equal to Pn j=1 u(j)v(j) for vectors u, v IRn.

Example 1. Suppose that n = 2, X+ = {( 1, 5), (5, 1), (7/5, 7/5)} and X = {(1, 1)}, with the points marked in Figure 1(a). Then, Λ(X) is {( 1, 2), (2, 1), (1/5, 1/5)},

2Features are assumed to be numeric. However, for ordinal features each value can be replaced by a number, maintaining the order of values. For categorical features one might use the one-hot encoding (a.k.a. 1-of-k coding scheme) to convert a feature with k categories to k Boolean features.

I(X) = {1, 2, 3}, and G(X) is the shaded region in Figure 1(b). For w = (2.5, 2.5) (which is in G(X)), MPw and MNw are 7 and 5 respectively.

By assuming a linear relationship between the feature vector and the class label, each input point (x, y) X expresses a linear constraint y(w x + b) 1 with an unknown weight vector w IRn and intercept term b IR. Thus, the feasible set C(X) is deﬁned as: {(w, b) (IRn, IR) : (x, y) X, y(w x + b) 1} The linearity assumption of the model is less restrictive than it sounds; for instance, we could form additional features representing e.g., pairwise products of the basic features, transforming x from IRn to a bigger space (say H). However, in this paper, we assume this transformation has been explicitly deﬁned; we do not consider making use of (non-linear) kernels [Aiserman et al., 1964]. For certain problems the linear kernel works sufﬁciently well, for instance, when the number of features is large [Hsu et al., 2003] (Appendix C). The principal idea in SVM [Cortes and Vapnik, 1995] is that from the feasible set C(X) a pair (w, b) is chosen that maximises the margin; this corresponds to the situation when w has the minimum (Euclidean) norm in π(C(X)), where π is the projection function π : (IRn, IR) IRn given by π(w, b) = w. We use w as the notation for Euclidean norm. We will see in Proposition 1 below that G(X) = π(C(X)). This fact allows us to make use of general mathematical results from [Wilson and Montazery, 2016a], which considers rescaling in preference learning.

Proposition 1. Consider any ﬁnite and non-trivial X IRn {+1, 1} and w IRn. Then, w π(C(X)) if and only if w G(X). Thus, π(C(X)) = G(X).

As is well-known, the solution that is picked by SVM from C(X) is unique (see e.g., [Burges and Crisp, 1999]). The following theorem restates this fact, making use of our notation.

Theorem 2. Consider any non-trivial ﬁnite X IRn {+1, 1}. If C(X) is non-empty then there exists a unique element (w, b) C(X) such that w has minimal norm in π(C(X)). For that unique element, b = 1 MPw = 1 + MNw.

In Example 1, (2.5, 2.5) clearly is the unique element with minimal norm in G(X), and the corresponding b is 6 (= 1 MP = 1+MN). Thus, its associated hyperplane 2.5x1+ 2.5x2 6 = 0, the dotted line in Figure 1(a), produces the maximum margin. Let us denote the solution of SVM, which by Theorem 2 is unique, by (w , b ), where b = 1 MPw = 1 + MNw . Thereafter, the feature vector α IRn with unknown class label is classiﬁed as the positive (+1) class label if w α + b 0, and as the negative ( 1) class label otherwise. From now on, we just work with the positive class; the results can be easily applied for the negative class as well. Theorem 2 leads easily to the following characterisation of positive classiﬁcation, which is of a form that makes our extension in the following sections more straight-forward.

Proposition 3. Consider any non-trivial ﬁnite X IRn {+1, 1} and any α IRn. Then, vector α is positively

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

classiﬁed if and only if there exists x+ X+ such that w (x+ α) 1.

Proof: Vector α is positively classiﬁed if and only if w α + b 0, which is, by Theorem 2, iff w α + 1 MPw 0, i.e., MPw w α + 1. This holds, from the deﬁnition of MPw , if and only if minx+ X+ w x+ w α + 1, which is if and only if there exists x+ X+ such that w x+ w α + 1. This immediately leads to the result. 2

3 Rescaling SVM In this section we consider how performing certain afﬁne transformations, i.e., translations and rescalings, on the domain of each feature, affect the result of SVM. We deﬁne a vector as being strongly positively classiﬁed, if it is positively classiﬁed under all afﬁne transformations of each feature domain. It is a known fact that the maximum margin hyperplane is essentially unaffected by a translation of feature space (see e.g., [Meila, 2003]); i.e., by moving the origin, the normal vector to the separating hyperplane and hence the result of SVM does not change. Lemma 4 states this formally. Lemma 4. Consider any ﬁnite and non-trivial X IRn {+1, 1}, any α, δ IRn, and let Xδ = {(x + δ, y) : (x, y) X}. Then, α is positively classiﬁed with respect to X if and only if the vector α + δ is positively classiﬁed with respect to Xδ. In contrast with translations, changing the scales of features may signiﬁcantly affect the result of SVM. Firstly, we deﬁne u v to be the vector in IRn given by pointwise multiplication of u, v IRn, and thus, for all j = 1, . . . , n, (u v)(j) = u(j)v(j). Operation is commutative, associative and distributes over addition of vectors, and satisﬁes the property (u v) w = v (u w), for any u, v, w IRn. Also, let IRn + be the set of strictly positive elements in IRn, i.e., v IRn such that v(j) > 0 for all j = 1, . . . , n. Now, consider the effect of a rescaling τ IRn +, so that a feature vector x IRn is transformed into x τ. Therefore, X becomes Xτ = {(x τ, y) : (x, y) X}. We write the element of G(Xτ) with minimal norm as w τ. Example 2. If we rescale X in Example 1 by τ = (5, 1) then X+ τ will be {( 5, 5), (25, 1), (7, 7/5)}, X τ becomes {(5, 1)}, and consequently, Λ(Xτ) = {( 5, 2), (10, 1), (1, 1/5)}. It can be shown that w τ equals (3/5, 2) and b τ = 6. Now, let α = ( 1, 4) and so α τ = ( 5, 4). Clearly, when there is no scaling, α is positively classiﬁed since (2.5, 2.5) ( 1, 4) 6 = 1.5 > 0, but under scaling τ, α τ is negatively classiﬁed because (3/5, 2) ( 5, 4) 6 = 1 < 0. For a rescaling vector τ IRn +, we say that α IRn is positively classiﬁed under rescaling τ iff α τ is positively classiﬁed with respect to Xτ, which, by using Proposition 3, is if and only if there exists x+ τ X+ τ such that w τ (x+ τ (α τ)) 1. This holds iff there exists x+ X+ such that w τ ((x+ τ) (α τ)) 1, i.e., (w τ τ) (x+ α) 1. Let us say that α IRn is strongly positively classiﬁed if and only if it is positively classiﬁed under any rescaling

τ IRn + and any shift δ IRn. This is if and only if it is positively classiﬁed under any rescaling and no shift (i.e., a shift of δ = 0, the zero vector).

Deﬁnition 1. For α IRn, we deﬁne YX(α) = 1 if and only if α is strongly positively classiﬁed; i.e., for all τ IRn + there exists x+ X+ such that (w τ τ) (x+ α) 1.

4 Characterisations Using Rescale Optimality

Here we borrow the notion of rescale-optimality from [Wilson and Montazery, 2016a], and make use of general mathematical results from there. This leads to the computational technique in Section 5 for testing if a vector is strongly positively classiﬁed. We deﬁne RO(X) to be {w τ τ : τ IRn +}. Hence, it is clear, from Deﬁnition 1, that YX(α) = 1 if and only if w RO(X), x+ X+, s.t. w (x+ α) 1.

Deﬁnition 2 (rescale-optimal). For any non-trivial ﬁnite X IRn {+1, 1}, and w G(X), let us say that w is rescale-optimal in G(X) if there exists τ IRn + such that for all u G(X), u τ 2 w τ 2.

It can be seen intuitively that every element of the line segment between (3, 2) and (2, 3) in Figure 1(b) is rescale-optimal; if τ(1) > τ(2) (i.e., with the ratio τ(1)

τ(2) being increased) then w τ τ moves from (2.5, 2.5) towards (3, 2). Similarly, increasing the ratio τ(2)

τ(1) from 1 moves w τ τ from (2.5, 2.5) towards (2, 3). For instance, in Example 2, τ(1) = 5τ(2) leads to w τ τ = (3, 2). The following proposition, which follows immediately from Proposition 1 in [Wilson and Montazery, 2016a], states that the set of all rescale-optimal elements of G(X) is RO(X). As a result of this equivalence, we can say YX(α) = 1 if and only if for every rescale-optimal element w in G(X), there exists x+ X+ such that w (x+ α) 1.

Proposition 5. Consider any non-trivial ﬁnite X IRn {+1, 1}, and any w IRn. Then, w is rescale-optimal in G(X) if and only if w is in RO(X).

In Section 4.2 below we characterise the rescale-optimal elements, which leads to a computational procedure for strong classiﬁcation. First, in Section 4.1, we characterise the situations in which rescaling the values of the features makes no difference to the result of the classiﬁcation, in which case strong classiﬁcation is the same as the standard classiﬁcation.

4.1 Determining Invariance to Rescaling

Theorem 6 below shows that rescaling the features vector makes no difference in the classiﬁcation when there is a unique rescale-optimal vector in G(X) (and the vector thus has minimal norm in G(X)).

Theorem 6. Consider any non-trivial ﬁnite X IRn {+1, 1}. Let Pos(X) be the set of all α IRn that are positively classiﬁed, and let SPos(X) be the set of α that are strongly positively classiﬁed. Then Pos(X) = SPos(X) if and only if there exists a unique rescale-optimal element in G(X), i.e., RO(X) is a singleton set.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Proof: For w IRn, let Posw be the set of all α IRn such that there exists x+ X+ such that w (x+ α) 1. Then, by Proposition 3, Pos(X) = Posw , and using Proposition 5, SPos(X) is the intersection of Posw over all rescale-optimal w in G(X). Note that w is rescale optimal in G(X) (using the identity rescaling). Thus, if there is a single rescaleoptimal w then w = w and so Pos(X) = SPos(X). To prove the converse, suppose that Pos(X) = SPos(X), which implies that for all rescale-optimal w in G(X), Posw contains Posw . Now, we can write Posw as {α IRn : α w MPw 1}, and thus Posw is a halfspace with normal vector w. Hence, since Posw contains Posw , there exists real scalar q > 0 with w = qw. Then for any τ IRn +, w τ 2 = q2 w τ 2. By Deﬁnition 2, q = 1, since both w and w are rescale-optimal, and hence w = w . This shows that there is a unique rescale-optimal element in G(X). 2

Deﬁnition 3 (pointwise dominance). For u, w IRn, w pointwise dominates u if u = w and for all j {1, . . . , n}, either 0 w(j) u(j) or 0 w(j) u(j) (i.e., w(j) is closer to zero than u(j)).

Example 3. Consider modifying Example 1 by just removing (7/5, 7/5) from X+. Then, Λ(X) = {( 1, 2), (2, 1)}, and G(X) becomes the intersection of the half-spaces x1 + 2x2 1 and 2x1 x2 1, with a single extremal point (1, 1). This point pointwise dominates every other element in G(X) (see Figure 1(b)). Pointwise dominating every other element, from Theorem 7 below, means that (1, 1) will be the unique element of G(X) that is rescale-optimal. Using Theorem 6, this implies that the classiﬁcation is invariant to the rescaling (in this example).

Theorem 7 below characterises when there is a unique rescale-optimal element, and Corollary 8 leads to a polynomial algorithm to determine that unique element. These results follow from Theorem 2 and Corollary 1 in [Wilson and Montazery, 2016a; 2016b].

Theorem 7. Consider any non-trivial ﬁnite X IRn {+1, 1} and any w G(X). Then, w is the unique element of G(X) that is rescale-optimal if and only if w pointwise dominates every element in G(X) \ {w}.

Corollary 8. Consider any non-trivial ﬁnite X IRn {+1, 1} and any v G(X). Deﬁne w IRn as follows. For each j {1, . . . , n}: (i) If v(j) = 0 then deﬁne w(j) = 0. (ii) If v(j) > 0 then deﬁne w(j) = inf {u(j) : u G(X), u(j) 0}. (iii) If v(j) < 0 then deﬁne w(j) = sup {u(j) : u G(X), u(j) 0}. If w G(X) then w is uniquely rescale-optimal in G(X). Also, there exists a uniquely rescale-optimal element in G(X) if and only if w G(X).

4.2 Characterising Rescale-Optimality Here, we characterise the set of rescale-optimal elements in G(X), which leads to a computational method for strong classiﬁcation.

Deﬁnition 4 (agreeing on signs). For w, µ IRn, we say that w and µ agree on signs if for all j = 1, . . . , n, (i) w(j) = 0

µ(j) = 0; (ii) w(j) > 0 µ(j) > 0; and thus also: (iii) w(j) < 0 µ(j) < 0.

Theorem 9. Consider any ﬁnite non-trivial X IRn {+1, 1}, and any non-zero vector w G(X). Then, w is rescale-optimal in G(X) if and only if there exists µ IRn

and non-negative reals ri, for each i I(X), such that (a) µ agrees on signs with w; (b) µ = P i I riλi; and (c), for each i I(X) either ri = 0 or λi w = 1, where λi is the ith element of Λ(X).

Proof: Theorem 5 of [Wilson and Montazery, 2016a] implies that w is rescale-optimal in G(X) if and only if there exists µ IRn and non-negative reals ri (for each i I(X)) such that conditions (a), (b), (c) and (d) hold, where (a), (b) and (c) are the conditions above, and (d) is the condition that w µ = 1. : this follows immediately from Theorem 5 of [Wilson and Montazery, 2016a]. : Assume that there exists µ IRn and non-negative reals ri (for each i I(X)) such that conditions (a), (b), (c) hold. w is not the zero vector, since w G(X). Condition (a), that µ agrees on signs with w, implies that µ w > 0. Let µ = µ µ w, and for each i I(X) deﬁne r i = ri µ w. Then, (a) µ agrees on signs with w, (b) µ = P i I r iλi, (c) for each i I(X) either r i = 0 or λi w = 1, and (d) w µ = 1. Theorem 5 of [Wilson and Montazery, 2016a] then implies that w is rescale-optimal in G(X). 2

5 Computation of Strong Classiﬁcation

Here we express if an instance is strongly positively classiﬁed in terms of a set of constraints. For a set X IRn {+1, 1} and arbitrary α IRn, we would like to determine if YX(α) = 1, i.e., if α is strongly positively classiﬁed. We can infer from Propositions 3 and 5 that YX(α) = 1 if and only if there exists a rescale-optimal element w in G(X) such that for all x+ X+, w (x+ α) > 1. By taking into account Theorem 9 for characterising rescale optimality, we obtain a set of inequalities to determine if YX(α) = 1 as follows. Let λi be the ith element of Λ(X) where i I(X). Now, YX(α) = 1 if and only if there exists w IRn and µ IRn, and non-negative reals ri for each i I(X) such that

x+ X+, w (x+ α) > 1;

i I(X), w λi 1; (i.e., w G(X))

i I(X), w λi = 1 or ri = 0;

j = 1, . . . , n, w(j) = 0 µ(j) = 0, and w(j) > 0 µ(j) > 0; (i.e., agreeing on signs);

µ = P i I(X) riλi.

In CPLEX, a disjunctive constraint such as [w λi = 0 or ri = 0] can be expressed as (w λi == 0) + (ri == 0) 1 (each logical proposition is treated as an integer; 0 for false and 1 for true). The number of constraints here is 2|X+||X | + |X+| + n + 2.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Bench. |X| |I(X)| Data Set Features (n) 1. 16 63 Breast Cancer Wisconsin 9 2. 20 84 3. 25 126 4. 22 72 Pima Indians Diabetes 8 5. 28 171 6. 18 65 7. 20 64 Blood Transfusion 4 8. 25 46 9. 25 144 10. 25 84 Indian Liver Patient 10 11. 15 36 12. 21 108 13. 30 104 Fertility 9 14. 20 81 15. 15 14 16. 25 156 Banknote Authentication 4 17. 20 100 18. 15 50

Table 1: The speciﬁcations of 18 benchmarks which are used for the experiments; these are derived from six real data sets.

6 Experimental Results

The experiments make use of the UCI machine learning repository3 from which six real data sets are chosen, namely Breast Cancer Wisconsin [Street et al., 1993], Pima Indians Diabetes, Blood Transfusion Service Center [Yeh et al., 2009], Indian Liver Patient, Fertility [M endez et al., 2012], and Banknote Authentication. In fact, we include data sets that meet three criteria: (i) the data set has only two classes, (ii) the data set consists of only numeric features, and (iii) number of features is at most 10. The ﬁrst two criteria ensure that the data set complies with the proposed method in this paper, and the third made the computation especially fast. Thereafter, 18 benchmarks are derived from those data sets. Table 1 contains speciﬁcations of these benchmarks. For constructing a benchmark, a random selector creates two disjoint sets from a data set, one for learning (i.e., X) and one for testing. However, a pre-processing phase deletes some elements of the learning set in order to make it consistent (i.e., C(X) = ), since in this paper we consider only the consistent case. For each instance in the testing set (let s say α), the rescaling method determines based on the learning set X either (i) it is strongly positive (YX(α) = 1), or (ii) it is strongly negative (YX(α) = 1), or (iii) it is neutral (YX(α) = 1 and YX(α) = 1). The number and percentage of neutral instances for each benchmark can be found in Table 2. The ratio of neutral instances for the benchmarks (i.e., those instances being classiﬁed differently under different rescaling) varies from 6% to 95% with a mean of 55.9%. For more than half of the instances, rescaling the feature values can change the result of the SVM classiﬁcation. That points out the unreliability of standard SVM speciﬁcally for larger numbers like 95%- with respect to being sensitive to the way the features are scaled.

3http://archive.ics.uci.edu/ml/

Benchmark Neutral # Total # Ratio(%)

1. 261 683 38 2. 53 100 53 3. 53 100 53 4. 186 200 93 5. 168 200 84 6. 91 100 91 7. 35 100 35 8. 6 100 6 9. 9 100 9 10. 72 100 72 11. 93 100 93 12. 191 200 95 13. 46 70 66 14. 56 81 69 15. 79 85 93 16. 14 100 14 17. 12 100 12 18. 30 100 30

Table 2: The number and ratio of instances labelled as neutral by the rescaling method are shown for each benchmark.

In a testing set, among the non-neutral instances, we can also count:

1. The number of negative instances improperly strongly classiﬁed as positive (False Positive).

2. The number of positive instances properly strongly classiﬁed as positive (True Positive).

3. The number of positive instances improperly strongly classiﬁed as negative (False Negative).

4. The number of negative instances properly strongly classiﬁed as negative (True Negative).

Conventional SVM can also predict a class label for each instance, positive or negative. As a result, we have False Positive (FP), True Positive (TP), False Negative (FN) and True Negative (TN) for SVM as well. Tables 3 and 4 compare these measures between the two methods. Note that the value of FP + TN (resp. FN + TP) for the rescaling method is not (necessarily) equal to the total number of testing instances with real negative (resp. positive) class label because some instances are classiﬁed as neutral. In Tables 3 and 4, we also compare the Positive Predictive Value (PPV) and Negative Predictive Value (NPV) of the two methods. PPV is the fraction of truly positive instances among positively classiﬁed ones. Similarly, NPV is the fraction of truly negative instances among negatively classiﬁed ones. Thus, PPV equals TP FP+TP and NPV equals TN FN+TN. The measures PPV and NPV are widely used in medical contexts in order to validate a disease diagnostic test [Parikh et al., 2008]. To illustrate, PPV expresses that if a patient s test is positive how likely it is that he really has the disease. Similarly, NPV means that if a patient s test is negative how likely it is that she really doesn t have the disease. The results show

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

Bench. SVM Rescaling Method FP TP PPV(%) FP TP PPV(%) 1. 15 439 97 0 333 100 2. 3 71 96 0 46 100 3. 9 64 88 0 47 100 4. 25 48 66 1 1 50 5. 38 53 58 1 2 67 6. 8 15 65 0 0 - 7. 25 6 19 5 4 44 8. 7 4 43 4 3 36 9. 27 9 25 25 8 24 10. 15 13 46 0 2 100 11. 35 15 30 1 0 0 12. 44 23 34 3 4 57 13. 15 2 12 1 0 0 14. 19 1 5 3 1 25 15. 11 3 21 0 0 - 16. 3 45 94 3 38 93 17. 1 42 98 1 38 97 18. 3 44 94 0 32 100 Avg. 56.56 62.06

Table 3: A comparison, using 18 benchmarks, between the PPV of SVM and the rescaling method. The undeﬁned values, labelled -, are excluded from the mean.

Bench. SVM Rescaling Method FN TN NPV(%) FN TN NPV(%) 1. 10 219 96 3 86 97 2. 1 25 96 0 1 100 3. 0 27 100 0 0 - 4. 22 105 83 0 12 100 5. 14 95 87 0 29 100 6. 21 56 73 0 9 100 7. 17 52 75 11 45 80 8. 18 71 80 18 69 79 9. 9 55 86 8 50 86 10. 19 53 74 9 17 65 11. 13 37 74 2 4 67 12. 27 106 80 3 4 100 13. 6 47 89 1 22 96 14. 6 55 90 1 20 95 15. 8 63 89 0 6 100 16. 2 50 96 0 45 100 17. 1 56 98 0 49 100 18. 2 51 96 0 38 100 Avg. 86.00 91.47

Table 4: A comparison, using 18 benchmarks, between the NPV of SVM and the rescaling method. The undeﬁned values, labelled -, are excluded from the mean.

an increase of 5.5% in PPV and NPV for the benchmarks on average. Note that our approach is not intended to be a competitor of SVM in terms of classiﬁcation accuracy. What it does is to highlight certain instances where we can have greater conﬁdence, with the other instances having reduced conﬁdence because the result of the classiﬁcation could be changed if a different scaling of the features were used. This is why PPV and NPV are appropriate measures for the experimental results. The approach discussed in Section 5 was implemented using the solver CPLEX 12.6.2. Determining whether an instance is strongly positive or strongly negative or neutral for any of benchmarks takes less than a couple of seconds, making use of a computer facilitated by a Core i7 2.60 GHz processor and 8 GB RAM memory.

7 Discussion

The scaling of individual features, before applying an SVM method, can be subjective and arbitrary. However, the way features are scaled can make a difference; in fact, for a little more than half of the instances in our experiments, rescaling the feature values can change the result of the SVM classiﬁcation. Based on this fact, we say an instance strongly belongs to a class if it belongs to that class for all rescalings of features. This new deﬁnition boosts the conﬁdence of labeling instances by excluding those instances which are classiﬁed differently under different scalings. Building on the general mathematical results of [Wilson and Montazery, 2016a], we have developed a computational procedure that can test if an instance is strongly positive, i.e., labelled positive for every afﬁne way of rescaling the values of each feature (and similarly for the negative case). We also have a polynomial algorithm for determining when rescaling makes no difference, i.e., when, for a given training set, the classiﬁcation of any possible instance is not affected by rescaling. Our experiments also showed a slight improvement in the value of prediction, i.e., the likelihood that if an instance is classiﬁed as a particular class, it truly belongs to that class. There are many potential future directions following from this work; for instance, extending the method to the case when there is inconsistency in the input data (i.e., soft margin SVM); attempting to develop the computational method for certain kernels; performing multi-class classiﬁcation by making use of repeated binary-class classiﬁcation (e.g., using one-vs-all or one-vs-one approaches); and computing variations of the method where the user can limit the rescaling range for different features.

Acknowledgments

This publication has emanated from research conducted with the ﬁnancial support of Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289. Thanks to the reviewers for their comments, which helped improve the ﬁnal version of the paper.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)

[Aiserman et al., 1964] MA Aiserman, EM Braverman, and LI Rozonoer. Theoretical foundations of the potential function method in pattern recognition. Avtomat. i Telemeh, 25:917 936, 1964. [Aksoy and Haralick, 2001] Selim Aksoy and Robert M Haralick. Feature normalization and likelihood-based similarity measures for image retrieval. Pattern recognition letters, 22(5):563 582, 2001. [Ben-Hur and Weston, 2010] Asa Ben-Hur and Jason Weston. A users guide to support vector machines. Data mining techniques for the life sciences, pages 223 239, 2010. [Burges and Crisp, 1999] Christopher JC Burges and David J Crisp. Uniqueness of the SVM solution. In Advances in Neural Information Processing Systems 12, [NIPS Conference, Denver, Colorado, USA, November 29 - December 4, 1999], volume 99, pages 223 229, 1999. [Burges, 1998] Christopher JC Burges. A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2):121 167, 1998. [Byvatov and Schneider, 2002] Evgeny Byvatov and Gisbert Schneider. Support vector machine applications in bioinformatics. Applied bioinformatics, 2(2):67 77, 2002. [Cortes and Vapnik, 1995] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273 297, 1995. [Dork o and Schmid, 2003] Gyuri Dork o and Cordelia Schmid. Selection of scale-invariant parts for object class recognition. In 9th IEEE International Conference on Computer Vision (ICCV 2003), 14-17 October 2003, Nice, France, volume 1, pages 634 640, 2003. [Hsu et al., 2003] Chih-Wei Hsu, Chih-Chung Chang, Chih Jen Lin, et al. A practical guide to support vector classiﬁcation, 2003. [Jain and Dubes, 1988] Anil K Jain and Richard C Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988. [Jebara and Shivaswamy, 2009] Tony Jebara and Pannagadatta K Shivaswamy. Relative margin machines. In Advances in Neural Information Processing Systems, pages 1481 1488, 2009. [Meila, 2003] Marina Meila. Data centering in feature space. In Proceedings of the Ninth International Workshop on Artiﬁcial Intelligence and Statistics, AISTATS 2003, Key West, Florida, USA, January 3-6, 2003, 2003. [M endez et al., 2012] David G M endez, Jose L Girela, Joaquin De Juan, M Jose Gomez-Torres, and Magnus Johnsson. Predicting seminal quality with artiﬁcial intelligence methods. Expert Systems with Applications, 39(16):12564 12573, 2012. [Parikh et al., 2008] Rajul Parikh, Annie Mathai, Shefali Parikh, G Chandra Sekhar, Ravi Thomas, et al. Understanding and using sensitivity, speciﬁcity and predictive values. Indian journal of ophthalmology, 56(1):45, 2008.

[Stolcke et al., 2008] Andreas Stolcke, Sachin Kajarekar, and Luciana Ferrer. Nonparametric feature normalization for SVM-based speaker veriﬁcation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2008, March 30 - April 4, 2008, Caesars Palace, Las Vegas, Nevada, USA, pages 1577 1580. IEEE, 2008. [Street et al., 1993] W Nick Street, William H Wolberg, and Olvi L Mangasarian. Nuclear feature extraction for breast tumor diagnosis. In IS&T/SPIE s Symposium on Electronic Imaging: Science and Technology, pages 861 870. International Society for Optics and Photonics, 1993. [Tax and Duin, 2000] DM Tax and Robert PW Duin. Feature scaling in support vector data descriptions. Learning from Imbalanced Datasets, pages 25 30, 2000. [Tong and Koller, 2001] Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classiﬁcation. Journal of machine learning research, 2(Nov):45 66, 2001. [Wilson and Montazery, 2016a] Nic Wilson and Mojtaba Montazery. Preference inference through rescaling preference learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artiﬁcial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 2203 2209. IJCAI/AAAI Press, 2016. [Wilson and Montazery, 2016b] Nic Wilson and Mojtaba Montazery. Preference Inference Through Rescaling Preference Learning (extended version of IJCAI 2016 paper including proofs). Available at http://ucc.insightcentre.org/nwilson/Rescaling Proofs.pdf, 2016. [Xu et al., ] Linli Xu, Koby Crammer, and Dale Schuurmans. Robust support vector machine training via convex outlier ablation. In Proceedings of The Twenty-First National Conference on Artiﬁcial Intelligence and the Eighteenth Innovative Applications of Artiﬁcial Intelligence Conference, July 16-20, 2006, Boston, Massachusetts, USA. [Yeh et al., 2009] I-Cheng Yeh, King-Jang Yang, and Tao Ming Ting. Knowledge discovery on RFM model using Bernoulli sequence. Expert Systems with Applications, 36(3):5866 5871, 2009.

Proceedings of the Twenty-Sixth International Joint Conference on Artiﬁcial Intelligence (IJCAI-17)