# anomaly_ranking_as_supervised_bipartite_ranking__38028286.pdf

Anomaly Ranking as Supervised Bipartite Ranking

St ephan Cl emenc on STEPHAN.CLEMENCON@TELECOM-PARISTECH.FR LTCI UMR Telecom Paris Tech/CNRS No. 5141, 46 rue Barrault, 75634 Paris Cedex, France

Sylvain Robbiano SYLVAIN.ROBBIANO@GMAIL.COM LTCI UMR Telecom Paris Tech/CNRS No. 5141, 46 rue Barrault, 75634 Paris Cedex, France CIMFAV-Facultad de Ingeniera, Universidad de Valparaso, Valparaso, Chile

The Mass Volume (MV) curve is a visual tool to evaluate the performance of a scoring function with regard to its capacity to rank data in the same order as the underlying density function. Anomaly ranking refers to the unsupervised learning task which consists in building a scoring function, based on unlabeled data, with a MV curve as low as possible at any point. In this paper, it is proved that, in the case where the data generating probability distribution has compact support, anomaly ranking is equivalent to (supervised) bipartite ranking, where the goal is to discriminate between the underlying probability distribution and the uniform distribution with same support. In this situation, the MV curve can be then seen as a simple transform of the corresponding ROC curve. Exploiting this view, we then show how to use bipartite ranking algorithms, possibly combined with random sampling, to solve the MV curve minimization problem. Numerical experiments based on a variety of bipartite ranking algorithms well-documented in the literature are displayed in order to illustrate the relevance of our approach.

1. Introduction

Motivated by a great range of applications such as the design of search engines in information retrieval, credit-risk screening in ﬁnance or supervised anomaly detection in signal processing, the problem of learning how to rank data with ordinal labels has been the subject of a good deal of attention in machine-learning these last few years, see (Duchi et al., 2010; Cl emenc on et al., 2008; Agarwal

Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

et al., 2005) among others. A wide variety of criteria have been considered so as to cast the task of ranking observations in an order as close as possible to that induced by the ordinal output variable as a M-estimation problem, including the (area under the) receiver operator characteristic curve (ROC curve in short) and the precision-recall curve in the bipartite situation (i.e. when the output variable is binary), the normalized discounted cumulative gain criterion or the ROC manifold in the general multipartite framework. Many practical ranking algorithms, supported by sound theoretical results extending the probabilistic theory of pattern recognition, are now documented in the literature, see (Freund et al., 2003; Cl emenc on & Vayatis, 2009; Rakotomamonjy, 2004; Pahikkala et al., 2007) for instance. However, in many applications, which can be referred to as unsupervised anomaly/novelty detection and comprise the monitoring of complex systems such as the functioning of aircraft engines, system management in data centers (cf (Viswanathan et al., 2012)), network intrusion surveillance or fraud detection, it would be desirable as well to rank multivariate data, so that top ranked observations should be ideally the likeliest outliers , in absence of any output variable indicating the degree of abnormality . Throughout the article, this problem shall be termed anomaly ranking or anomaly scoring, insofar as the most natural way to deﬁne a preorder on a general feature space X Rd, with d 1, is to transport the natural order on the real half-line by means of a (measurable) scoring function s : X R+. The ideal way of ordering all the elements of the feature space naturally corresponds to the reverse of the order induced by the (generally unknown) underlying density function. Because density estimation should be avoided in a high dimensional setup, due to the curse of dimensionality phenomenon, a performance criterion of functional nature, just like the ROC curve in the supervised framework, has been recently introduced in (Cl emenc on & Jakubowicz, 2013), which permits to evaluate the accuracy of any ranking rule, as regards the objective pursued. It has been termed the Mass Volume curve (MV curve) and the lower

Anomaly Ranking as Supervised Bipartite Ranking

the MV curve of a scoring function, the more accurate the ranking it induces. But, in contrast to the supervised framework, no algorithm is readily available to build a scoring function with a nearly optimal MV curve from (unlabeled) training data. It should also be recalled that minimum volume set estimation techniques (Scott & Nowak, 2006), originally designed to solve unsupervised anomaly detection problems, are not appropriate for anomaly ranking, cast as MV curve minimization, since they consist in recovering a single density level set, corresponding thus to single point of the optimal MV curve (with the target mass level as abcissa).

It is the major purpose of this paper to highlight the connection between (supervised) bipartite ranking and anomaly ranking. Indeed, it is shown in the present article that, in the case where the underlying probability distribution F(dx) has compact support, included in [0, 1]d say, the MV curve of any scoring function s(x) and its ROC curve with regard to the bipartite ranking problem, where the positive distribution is the probability measure F(dx) which the observations are drawn according to and the negative distribution is uniform on [0, 1]d, are symmetrical w.r.t. the ﬁrst diagonal. Hence, minimization of the MV curve boils down to maximizing the corresponding ROC curve, which task can be achieved by various algorithms based on two independent data samples, one drawn from F(dx) and the other drawn from the uniform distribution. Building on this crucial observation, we ﬁrst propose to hijack bipartite ranking algorithms by implementing them with the originally unlabeled sample as negative sample and a simulated positive sample made of i.i.d. data uniformly distributed on [0, 1]d. Incidentally, we point out that sampling data from a reference measure in order to reveal properties of the density under study by means of supervised techniques is well-known folklore in applied statistics, generalized association rules for instance are precisely based on this approach (see Chapter 14 in (Friedman et al., 2009)) and (Steinwart et al., 2005) proposed a method, involving the simulation of uniformly distributed data too, to turn (unsupervised) anomaly detection into a supervised binary classiﬁcation problem. We next explain how to extend the TREERANK approach, originally introduced in (Cl emenc on & Vayatis, 2009) to the unsupervised framework. Beyond the fact that it may produce interpretable ranking rules, visualizable by means of an oriented tree graphic, in contrast to other bipartite ranking algorithms, its implementation in the unsupervised context does not require to sample any extra data uniformly distributed over [0, 1]d. Numerical experiments have been also carried out in order to illustrate the performance of various bipartite ranking algorithms for anomaly ranking.

The rest of the article is structured as follows. In section 2, notations are ﬁrst set out and key notions of the anomaly

ranking problem are recalled, together with basic concepts of bipartite ranking and ROC analysis. Section 3 explains at length the connection between the supervised and unsupervised ranking problems in the compact support case, while section 4 shows how to extend the use of certain bipartite ranking algorithms to anomaly ranking from a practical perspective. Finally, numerical results based on synthetic/real datasets are displayed in section 5 for illustration purpose.

2. Background and Preliminaries

Here we essentially describe the issue of anomaly ranking and brieﬂy recall the related key concepts of MV curve analysis. We also set out the notations that shall be needed throughout the paper.

2.1. The Statistical Framework

In the anomaly ranking problem, the goal pursued is to learn how to order observations by degree of abnormality , on the basis of a training sample X1, . . . , Xn made of i.i.d. copies of a random variable X, taking its values in a (possibly high-dimensional) space X Rd with d 1 and distributed according to a continuous probability measure F(dx) = f(x)dx. The preorder on X is deﬁned by means of a (measurable) scoring function s : X R+: (x, x ) X 2, x s x s(x) s(x ). We denote by S the set of all scoring functions on X integrable w.r.t. Lebesgue measure on X (see the next subsection for the explanation of the integrability constraint). Given the nature of the problem, the optimal ranking is naturally that determined by the density function f(x). The set S S of optimal scoring functions is made of (λ-integrable) nonnegative strictly increasing transforms of the density function.

We point out that the nature of the problem considered is very different from that of (nonparametric) density estimation: there is no need to estimate the local values taken by the density function, only the preorder on X it induces is of interest here. We also emphasize that in many applications, the very purpose of unsupervised anomaly detection is not to assign a label normal vs. abnormal to any new observation, compared to the vast majority of the data previously observed (i.e. the training data), but to rank any new set of observations (i.e. test data) by degree of abnormality. The form of the output, an ordered list namely, may greatly facilitates the work of human operators. For instance, in the context of Distributed Fleet Monitoring (DFM) for Flight Operational Quality Assurance (FOQA) programs, an anomaly scoring function (taking ﬂight data and aircraft features as input variables in particular) could permit to set priorities and help optimize the work of FOQA analysts who do not have time to look at the

Anomaly Ranking as Supervised Bipartite Ranking

data for hundreds of thousands of ﬂights: depending on operational constraints, the 10 most abnormal ﬂights will be examined ﬁrst, then the next 10 ﬂights, etc.. Finally, we underline that the framework we develop in this paper is fully nonparametric. In particular, no parametric assumption on the tail behavior of the underlying multivariate distribution, permitting to rank new observations according the corresponding p-values, is made.

The following assumptions shall be involved in the subsequent analysis.

H1 The random variable f(X) is continuous.

H2 The density f is bounded: supx X f(x) < + .

Here we denote by λ the Lebesgue measure on X, by I{E} the indicator function of any event E. The generalized inverse function of any cdf K(t) on R is denoted by K 1(u) = inf{t R : K(t) u}.

2.2. Mass Volume Curves

In (Cl emenc on & Jakubowicz, 2013), a natural way of measuring the ranking performance of a given scoring function s S in the unsupervised setting has been introduced (see Deﬁnition 2 therein). It consists of plotting its Mass Volume (MV) curve, namely the Probability Measure plot:

t > 0 7 (P{s(X) t}, λ({x X : s(x) t})) . (1)

With Ωs,t = {x X : s(x) t} for any t 0, the MV curve is the parametrized curve t > 0 7 (F(Ωs,t), λ(Ωs,t)). Observe that, since s is supposed to be λ-integrable, the measure λ(Ωs,t) ( R

u R+ s(u)du)/t is ﬁnite for any t > 0. Connecting points corresponding to possible jumps of the parametric curve, the curve can be seen as the plot of a continuous mapping MVs : α (0, 1) 7 MVs(α), starting at (0, 0). Denoting by Fs(t) the cdf of the r.v. s(X), in the case where Fs F 1 s (α) = α, we have:

MVs(α) = λ {x X : s(x) F 1 s (1 α)} . (2)

Let α (0, 1). Under the Assumptions H1 H2, the set Ω α = {x X : f(x) F 1 f (1 α)} is the unique solution of the minimum volume set problem

min Ω B(X) λ(Ω) subject to F(Ω) α, (3)

where B(X) denotes the ensemble made of all borelian subsets of X. For small values of the mass level α, minimum volume sets are expected to contain the modes of the distribution, whereas their complementary sets correspond to abnormal observations when considering large values

of α. Refer to (Einmahl & Mason, 1992; Polonik, 1997) for an account of minimum volume set theory and to (Vert & Vert, 2006; Scott & Nowak, 2006) for related statistical learning results. As stated in Proposition 3 of (Cl emenc on & Jakubowicz, 2013), this implies that the MV curve of the scoring function f(x), which plots the (minimum) volume λ(Ω α) against the mass F(Ω α) = α and which shall be denoted by MV , is dominated by any other MV curve everywhere:

s S, α (0, 1), MV (α) MVs(α). (4)

It is noteworthy that MV is a convex function and MV curves are invariant under strictly increasing transforms. A list of properties of MV curves is given in Proposition 5 in (Cl emenc on & Jakubowicz, 2013). A typical MV curve is depicted in Fig. 5.1. When the distribution F(dx) is much concentrated around its modes and exhibits a light tail behavior, the optimal MV curve increases very slowly and rises near 1. Of course, MV curves are generally unknown in practice, just like the distribution F(dx), and must be estimated by their empirical counterparts, see Theorem 8 in (Cl emenc on & Jakubowicz, 2013) for more details on statistical estimation of MV curves.

A partial order on S. The concept of MV curve induces a partial order on the set of scoring functions S. Let (s1, s2) S2, we will say that the scoring function s1 is more accurate than s2 if and only if its MV curve is below of s2 everywhere, i.e. α (0, 1), MVs1(α) MVs2(α): for any ﬁxed mass level, s1 deﬁnes a subset of smaller volume. Equipped with this functional performance criterion, the optimal scoring functions s S are the elements of S = {T f : T : Imf(X) R+ strictly increasing}, where Imf(X) denotes the image of the r.v. f(X). The closer the MV curve of a scoring function candidate to MV , the more accurate the ranking it deﬁnes. Of course, there a many ways of quantifying closeness in the MV space. One could naturally consider Lp distances, 1 p + , as in (Cl emenc on & Jakubowicz, 2013).

In this paper, focus is on the situation where the assumption below is fulﬁlled.

H3 The probability distribution F(dx) has compact support, equal to [0, 1]d say.

In this case, the MV curve of any scoring function s S ends at (1, 1). Let p [1, + ], the performance of any s S can be measured through the quantity

dp(s, s ) = ||MVs MV ||p, (5)

where ||.||p denotes the Lp-norm on [0, 1] and s S . The goal of anomaly ranking can be then stated in a quantitative manner. Based on a training dataset {X1, . . . , Xn},

Anomaly Ranking as Supervised Bipartite Ranking

the objective is to build a scoring function s S such that dp(s, s ) is as small as possible with overwhelming probability. Notice ﬁnally that, since we have the decomposition d1(s, s ) = R1 α=0 MVs(α)dα R1 α=0 MV (α)dα (see Eq. (4)), anomaly ranking reduces to minimization of the area under the MV curve in the case p = 1.

Anomaly ranking versus anomaly detection. Anomaly ranking is very different from (unsupervised) anomaly detection, cast as minimum volume set estimation. A mass level α (0, 1) being preliminarily ﬁxed, the latter consists in recovering from data a speciﬁc density level set Ω α, while the former aims at building a scoring function s(x) whose collection of level sets {Ωs,t}t>0 nearly corresponds to that of the underlying density f(x), {Ω α}α (0,1) (i.e. an increasing transform of f(x)). Hence, anomaly ranking should be viewed as a continuum of anomaly detection problems: ﬁnding the observations forming the top 1% the most abnormal, then those forming the top 2%, etc.

2.3. Ranking Bipartite and ROC Analysis

Consider now two probability distributions on the space X, G(dx) and H(dx), absolutely continuous with respect to each other. The ROC curve of any scoring function s(x) is deﬁned as the PP-plot t > 0 7 (1 Hs(t), 1 Gs(t)), where Hs(dt) and Gs(dt) respectively denote the images of the distributions H and G by the mapping s : X R+. Connecting by convention possible jumps by line segments, the ROC curve of the scoring function s(x) can always be viewed as the plot of a continuous mapping ROCs : α (0, 1) 7 ROCs(α). It starts at (0, 0) and ends at (1, 1). At any point α (0, 1) such that Hs H 1 s (α) = α, we have: ROCs(α) = 1 Gs H 1 s (1 α). The curve ROCs measures the capacity of s to discriminate between distributions H and G. It coincides with the ﬁrst diagonal when Hs = Gs. Observe also that the stochastically larger than Hs the distribution Gs, the closer to the left upper corner of the ROC space the curve ROCs. One may refer to (Fawcett, 2006) for an account of ROC analysis and its applications.

The notion of ROC curve deﬁnes a partial order on S. A scoring function s1 is more accurate than s2 iff: α (0, 1), ROCs1(α) ROCs2(α). A Neyman-Pearson argument shows that the optimal ROC curve, denoted by ROC , is that of the likelihood ratio statistic φ(x) = d G/d H(x). It dominates any other ROC curve everywhere: (s, α) S (0, 1), ROCs(α) ROC (α). The set S H,G = {T φ, T : Imφ(X) R+ strictly increasing} is the set of optimal scoring functions regarding the bipartite problem considered.

The goal of bipartite ranking is to build a scoring function with a ROC curve as high as possible, based on two independent labeled datasets: (X 1 , . . . , X m) and

(X+ 1 , . . . , X+ q) made of independent realizations of H and G respectively, with m, q 1. Assigning the label Y = +1 to observations drawn from G(dx) and label Y = 1 to those drawn from H(dx), the objective can be also expressed as to rank/score any pooled set of observations (in absence of label information) so that, ideally, the higher the score of an observation X, the likelier its (hidden) label Y is positive.

The accuracy of any s S can be measured by:

Dp(s, s ) = ||ROCs ROC ||p, (6)

where s S H,G and p [1, + ]. Observe that, in the case p = 1, one may write D1(s, s ) = AUC AUC(s), where AUC(s) = R1 α=0 ROCs(α)dα is the Area Under the ROC Curve (AUC in short) and AUC = AUC(φ) is the maximum AUC. Hence, minimizing D1(s, s ) boils down to maximizing the ROC summary AUC(s). The popularity of this quantity arises from the fact it can be interpreted, in a probabilistic manner, as the rate of concording pairs

AUC(s) = P {s(X) < s(X )} + 1

2P {s(X) = s(X )} , (7)

where X and X denote independent r.v. s deﬁned on the same probability space, drawn from H and G respectively. An empirical counterpart of AUC(s) can be straightforwardly derived from (7), paving the way for the implementation of empirical risk minimization strategies, see (Cl emenc on et al., 2008).

The algorithms proposed to optimize the AUC criterion or surrogate performance measures are too numerous to be listed in an exhaustive manner. In the present article, due to space limitations, we restrict our attention to the extension of the following algorithms to the anomaly ranking problem: 1) the TREERANK method and its variants (see (Cl emenc on & Vayatis, 2009; Cl emenc on et al., 2011; 2013)), which relies on recursive AUC maximization, see subsection 4.2, 2) the Rank Boost algorithm, which implements a boosting approach tailored for the ranking problem (see (Freund et al., 2003)), 3) the SVMrank algorithm originally designed for ordinal regression (see (Herbrich et al., 2000)) and 4) the Rank RLS procedure proposed in (Pahikkala et al., 2007).

3. Anomaly Ranking: a Bipartite View

With the notations of subsection 2.3, we take H(dx) as the uniform distribution U(dx) on [0, 1]d and G(dx) as F(dx), the distribution of interest in the anomaly ranking problem. It follows immediately from the deﬁnitions and properties recalled in section 2 that, for any scoring function s S, the curves MVs and ROCs are symmetrical with respect to the ﬁrst diagnonal of the unit square [0, 1]2. Hence, as

Anomaly Ranking as Supervised Bipartite Ranking

stated in the next result, solving the anomaly ranking problem related to distribution F(dx) is equivalent to solving the bipartite ranking problem related to the pair (U, F).

Theorem 1 Suppose that assumptions H1, H2 and H3 hold true. Let U(dx) be the uniform distribution on [0, 1]d. For any (s, α) S (0, 1), we have: ROC 1 s (α) = MVs(α). We also have S = S U,F, and

(s, s ) S S , Dp(s, s ) = dp(s, s ),

for 1 p + . In particular, we have: s S,

α=0 MVs(α)dα = P{s(W) < s(X)}

2P{s(W) < s(X)},

where W and X are independent r.v. s, drawn from U(dx) and F(dx) respectively.

The proof is straightforward, it sufﬁces to observe that φ = d G/d H = f in this context. Details are thus left to the reader.

Incidentally, we point out that, under the assumptions listed above, the minimal area under the MV curve may be thus interpreted as a measure of dissimilarity between the distribution F(dx) and the uniform distribution on [0, 1]d. The closer R1 0 MV (α)dα to 1/2, the more similar to U(dx) the distribution F(dx).

Remark 2 (ON THE SUPPORT ASSUMPTION.) In general, the support of F(dx) is unknown, just like the distribution itself. However, the argument above remains valid in the case where supp F(dx) [0, 1]d. The sole difference lies in the fact that the curve MV then ends at the point of mass-axis coordinate 1 and volume-axis coordinate λ(supp F) 1, the corresponding curve ROC

exhibiting a plateau: it reaches 1 from the false positive rate λ(supp F). We point out that, when no information about the support is available, one may always carry out the analysis for the conditional distribution given X C, where C denotes any compact set containing the observations X1, . . . , Xn.

4. Extending Bipartite Methods

Now that the connection between anomaly ranking and bipartite ranking has been highlighted, we show how to exploit it to extend efﬁcient algorithms proposed in the supervised framework to MV curve minimization. Learning procedures are based on a training i.i.d. sample X1, . . . , Xn, distributed according to the unknown probability measure F(dx) with compact support, included in [0, 1]d say.

4.1. Sampling

One may extend the use of any bipartite ranking algorithm A to the unsupervised context by simulating extra data, uniformly distributed on the unit hypercube, as follows.

1. Sample additional data X 1 , . . . , X m, uniformly distributed over [0, 1]d.

2. Assign a negative label to the sample D m = {X 1 , . . . , X m} and a positive label to the original data D+ n = {X1, . . . , Xn}.

3. Run algorithm A based on the bipartite statistical population D m D+ n, producing the anomaly scoring function s(x).

Except the choice of the algorithm A and the selection of its hyperparameters, the sole tuning parameter which must be set is the size m of the uniformly distributed sample. In practice, it should be chosen as large as possible, depending on the current computational constraints. From a practical perspective, it should be noticed that the computational cost of the sampling stage is reduced. Indeed, the d components of a r.v. uniformly distributed on the hypercube [0, 1]d

being independent and uniformly distributed according to the uniform distribution on the unit interval, the negative sample can be thus generated by means of pseudo-random number generators (PRNG s), involving no complex simulation algorithm. Furthermore, uniform distributions on any (borelian) subset of [0, 1]d can be naturally simulated in a quite similar fashion, with an additional conditioning step.

We point out that, in the context of density estimation, a similar sampling technique for transforming this unsupervised problem into one of supervised function approximation is discussed in section 14.2.4 in (Friedman et al., 2009), where it is used in particular to build generalized association rules. This idea is also exploited in (Steinwart et al., 2005) for anomaly detection, see also (Scott & Davenport, 2007). In this respect, it should be mentioned that a variety of techniques, including that proposed in (Sch olkopf et al., 2001) where the SVM machinery has been extended to the unsupervised framework and now referred to as ONE CLASS SVM, have been proposed to recover the set Ω α for a target mass level α (0, 1), ﬁxed in advance. Therefore, even if the estimates produced of are of the form {x X : bf(x) > tα} and one could consider using the decision function bf(x) as scoring function, one should keep in mind that there is no statistical guarantee that the ensembles {x X : bf(x) > t} are good estimates of density level sets for t = tα. This explains the poor performance of such a plug-in approach in practice, as exhibited in section 5.

Anomaly Ranking as Supervised Bipartite Ranking

4.2. Unsupervised Tree Rank

The TREERANK algorithm, a bipartite ranking technique optimizing the ROC curve in a recursive fashion, has been introduced in (Cl emenc on & Vayatis, 2009) and its properties have been investigated in detail in (Cl emenc on et al., 2011). Its output consists of an ordered partition of the feature space X (deﬁning thus a ranking, for which elements of a same cell being viewed as ties). The ordered recursive partitioning process is described by a left-to-right oriented binary tree structure, referred to as ranking tree, with ﬁxed maximum depth J 0. At depth j J, there are 2j nodes, indexed by (j, k) with 0 k < 2j. The root node depicts the entire space C0,0 = X and each internal node (j, k) with j < J and k {0, . . . , 2j 1} represents a subset Cj,k X, whose left and right siblings respectively correspond to disjoint subsets Cj+1,2k and Cj+1,2k+1 such that Cj,k = Cj+1,2k Cj+1,2k+1. At the root, one starts with a constant scoring function s1(x) = I{x C0,0} 1 and after m = 2j + k iterations, 0 k < 2j, the current scoring function is sm(x) = P2k 1 l=0 (m l) I{x Cj+1,l} + P2j 1 l=k (m k l) I{x Cj,l} and the cell Cj,k is split in order to form a reﬁned version of the scoring function, sm+1(x) = P2k l=0(m l) I{x Cj+1,l} + P2j 1 l=k+1(m k l) I{x Cj,l} namely, with maximum (empirical) AUC. Therefore, it happens that this problem boils down to solve a cost-sensitive binary classiﬁcation problem on the set Cj,k, see subsection 3.3 in (Cl emenc on et al., 2011) for further details. Indeed, one may write the AUC increment as AUC(sm+1) AUC(sm) = 1

2H(Cj,k)G(Cj,k)

(1 Λ(Cj+1,2k | Cj,k)), where Λ(Cj+1,2k | Cj,k) def = G(Cj,k \ Cj+1,2k)/G(Cj,k) + H(Cj+1,2k)/H(Cj,k). Setting p = G(Cj,k)/(H(Cj,k) + G(Cj,k)), the crucial point of the TREERANK approach is that the quantity 2p(1 p)Λ(Cj+1,2k | Cj,k) can be interpreted as the cost-sensitive error of a classiﬁer on Cj,k predicting positive label on Cj+1,2k and negative label on Cj,k\Cj+1,2k with cost p (respectively, 1 p) assigned to the error consisting in predicting label +1 given Y = 1 (resp., label 1 given Y = +1), balancing thus the two types of error. Hence, at each iteration of the ranking tree growing stage, the TREERANK algorithm calls a cost-sensitive binary classiﬁcation algorithm, termed LEAFRANK, in order to solve a statistical version of the problem above (replacing the theoretical probabilities involved by their empirical counterparts) and split Cj,k into Cj+1,2k and Cj+1,2k+1. As described at length in (Cl emenc on et al., 2011), one may use costsensitive versions of celebrated binary classiﬁcation algorithms such as CART or SVM for instance as LEAFRANK procedure, the performance depending on their ability to capture the geometry of the level sets of the likelihood ratio d G/d H(x). In general, the growing stage is followed by a pruning procedure, where children of a same parent

node are recursively merged in order to produce a ranking subtree that maximizes an estimate of the AUC criterion, based on cross-validation usually (cf section 4 in (Cl emenc on et al., 2011)). Under adequate assumptions, consistency results and rate bounds for the TREERANK approach (in the sup norm sense and for the AUC deﬁcit both at the same time) are established in (Cl emenc on & Vayatis, 2009) and (Cl emenc on et al., 2011), an extensive experimental study can be found in (Cl emenc on et al., 2012).

In the anomaly ranking context, the negative distribution is U(dx). Therefore, in the situation where LEAFRANK is chosen as a cost-sensitive version of the CART algorithm with axis parallel splits (see (Breiman et al., 1984)), all the cells Cj,k can be expressed as union of hypercubes. The exact computation of the volume U(Cj,k) is then elementarily feasible, as a function of the threshold values involved in the decision tree describing the split and of the volume of the parent node. Hence, only empirical counterparts of the quantities F(C) for subset C [0, 1]d candidates, b Fn(C) = (1/n) Pn i=1 I{X C}, are required to estimate the cost-sensitive classiﬁcation error and implement the splitting stage (AUC maximization). Hence, this approach does not require to sample any additional data, in contrast to that proposed in subsection 4.1. This is a key advantage in practice, in contrast to simulation-based approaches: for high values of the dimension d, data are expected to lie very sparsely in [0, 1]d and can be then very easily separated from those obtained by sampling a reasonable number of uniform observations, leading bipartite ranking algorithms to overﬁt. Similarly to the supervised case, the UNSUPERVISED TREERANK algorithm corresponds to a statistical version of an adaptive piecewise linear interpolation scheme of the optimal MV curve, see (Cl emenc on & Vayatis, 2009). Interpretation. From a practical angle, a crucial advantage of the approach describes above lies in the interpretability of the anomaly ranking rules produced. In contrast to alternative techniques, they can be summarized by means of a left-to-right oriented binary tree graphic: observations are all the more considered as abnormal as they are located in terminal leaves at the right of the anomaly ranking tree. An arrow at the bottom of the tree indicates the direction in which the density decreases. Each splitting rule possibly involves the combination of elementary threshold rules of the type X(k) > κ or X(k) κ with κ R in a hierarchical manner. In addition, it is also possible to rank the X(k) s depending on their relative importance, measured through the empirical volume under the MV curve decrease induced by splits involving X(k) as split variable, just like in the supervised setup, see section 5.1 in (Cl emenc on et al., 2011) for further details. This permits to identify the variables which have most relevance to detect anomalies.

Anomaly Ranking as Supervised Bipartite Ranking

5. Numerical Experiments

We now illustrate the points put forward in sections 3 and 4 by means of numerical experiments, based on unlabeled synthetic/real datasets. Precisely, we implemented the modiﬁcation of the TREERANK procedure based on locally weighted versions of the CART method (with axis parallel splits) described at length in subsection 4.2, using the package for R statistical software (see http://www.rproject.org), available at http://treerank.sourceforge.net (with parameters: minsplit = 1, maxdepth = 4, in the LEAFRANK), see (Baskiotis et al., 2009). We have also used Rank Boost (aggregating 30 stumps, see (Rudin et al., 2005)) and SVMRank (with linear and Gaussian kernels with cross-validated parameters, see (Herbrich et al., 2000)), using the SVM-light implementation available at http://svmlight.joachims.org/. The Rank RLS method (http://www.tucs.ﬁ/RLScore, see (Pahikkala et al., 2007)) that implements a regularized least square algorithm with linear kernel ( bias = 1 ) and with Gaussian kernel (γ = 0.01) has also been used, selection of the intercept on a grid being performed through a leave-one-out procedure. The anomaly ranking procedure based on the latter algorithms required to sample uniformly distributed negative data, as explained in subsection 4.1. As said at the end of section 4, the decision function output by the 1class SVM procedure can be used as a scoring function, improperly however because the objective function it optimizes is related to a single point of the target MV curve (see the toy example below). We used the R-package Kernlab with gaussian kernel, with parameters chosen automatically by cross-validation. In the tables displayed below, the anomaly scoring function produced by Rank Boost is referred to as Rank Boost , those computed by means of SVMrank (respectively, by means of Rank RLS) based on a linear and a Gaussian kernels as SVMlin and SVMgauss (resp. RLSlin and RLSgauss ) and that produced by one-class SVM as 1c SVM .

In the following experiment, an estimate of the area under the MV-curve (AMV in short) is computed over 5 replications of a 5-fold cross validation as well as the overall standard deviation (denoted by σ).

5.1. Toy Examples and Synthetic Data

Let Z be a q-dimensional Gaussian r.v. with mean µ and covariance matrix Γ, and consider a borelian subset C Rq with non zero Lebesgue measure. We denote by NC(µ, Γ) the conditional distribution of Z given Z C. Equipped with this notation, we can write the probability

a. Pooled sample: red circles represent instances with density f and blue stars those with uniform distribution.

b. Density level sets.

Figure 1. Mixture of Gaussian distributions

Figure 2. MV-curves: in blue, the MV-curve using the posterior distribution, in red the MVcurves using TREERANK.

distribution used as toy example here as:

2N[0,1]2 1/2 1/2

, 1/30 0 0 1/30

4N[0,1]2 3/10 7/10

, 1/30 1/60 1/60 1/30

4N[0,1]2 7/10 7/10

, 1/30 1/60 1/60 1/30

The simulated dataset is plotted in Fig. 2a, while some level sets of the density f are represented in Fig. 2b. We have independently sampled 5000 independent observations from distribution f(x)dx and 5000 independent uniformly distributed points. The optimal AMV is denoted by AMV

(knowing the density, it can be estimated by a basic Monte Carlo scheme). As expected, given the distribution of the data to be ranked, the linear methods perform worst. Notice in addition that TREERANK and RLSgauss yield comparable results and outperform Rank Boost, SVMgauss on this example. Among nonlinear rules, 1c SVM yields the poorest performance (cf section 4). The MV curves produced by the TREERANK algorithm in Fig. 3.

5.2. A Real-World Example

We also used a benchmark dataset in anomaly detection (computer network intrusion detection namely) proposed as a challenge for intrusion detection in the CMDC2013, see http://www.csmining.org/cdmc2013/ and (Song, 2013). In the present analysis, we kept the three features dst.host.rerror.rate, rerror.rate and serror.rate and re-

Anomaly Ranking as Supervised Bipartite Ranking

Table 1. Comparison of the AMV test - AMV = 0.2393

Method Tree Rank Rank Boost SVMlin SVMgauss RLSlin RLSgauss oc SVM

AMV test 0.2448 0.2598 0.4128 0.2502 0.4129 0.2443 0.3373 σ 0.0124 0.0119 0.0125 0.0122 0.0123 0.0122 0.0042

moved degenerate features, yielding a training set of 6802 instances. Then, we simulated 10000 extra negative data, uniformly distributed over the cube [0, 1]3. Estimates of the area under the MV curve (AMV) have been computed over ﬁve replications of a ﬁve-fold cross validation and the overall standard deviation is reported in Table 2. The modiﬁed TREERANK procedure outperforms all the other methods, illustrating the advantage of using a method avoiding any sampling stage. Given the clear superiority of the methods based on Gaussian kernels over linear techniques, one may also guess that the level sets of the underlying density are highly nonlinear.

6. Conclusion

Here we shed light on the connection between anomaly ranking, cast as Mass Volume curve minimization, and bipartite ranking when the distribution F(dx) of the (unlabeled) training data is compactly supported. Assuming (rather than rescaling the observations) that the support is included in [0, 1]d, the related bipartite problem corresponds to the situation where F(dx) is the positive distribution and the negative one is uniform on [0, 1]d. We thus proved that, following the generation of uniformly distributed data, bipartite ranking algorithms can be readily used to build scoring functions which nearly solve MV curve minimization. We illustrated this through numerical experiments.

Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., and Roth, D. Generalization bounds for the area under the ROC curve. JMLR, 6:393 425, 2005.

Baskiotis, N., Cl emenc on, S., Depecker, M., and Vayatis, N. R-implementation of the Tree Rank algorithm. Submitted for publication, 2009.

Breiman, L., Friedman, J., Olshen, R., and Stone, C. Clas-

siﬁcation and Regression Trees. Wadsworth and Brooks, 1984.

Cl emenc on, S. and Jakubowicz, J. Scoring anomalies: a M-estimation formulation. In Proceedings of AISTATS, 2013.

Cl emenc on, S. and Vayatis, N. Tree-based ranking methods. IEEE Trans. Inf. Theory, 55(9):4316 4336, 2009.

Cl emenc on, S., Lugosi, G., and Vayatis, N. Ranking and empirical risk minimization of U-statistics. Ann. Stat., 36(2):844 874, 2008.

Cl emenc on, S., Depecker, M., and Vayatis, N. Adaptive partitioning schemes for bipartite ranking. Machine Learning, 43(1):31 69, 2011.

Cl emenc on, S., Depecker, M., and Vayatis, N. An empirical comparison of learning algorithms for nonparametric scoring: the treerank algorithm and other methods. Patt. Analys. Appl., 2012.

Cl emenc on, S., Depecker, M., and Vayatis, N. Ranking Forests. J. Mach. Learn. Res., 2013.

Duchi, J., Mackey, L., and Jordan, M. On the consistency of ranking algorithms. In Proceedings of ICML, 2010.

Einmahl, J.H.J. and Mason, D.M. Generalized quantile process. Ann. Stat., 20:1062 1078, 1992.

Fawcett, T. An Introduction to ROC Analysis. Letters in Pattern Recognition, 27(8):861 874, 2006.

Freund, Y., Iyer, R., Schapire, R., and Singer, Y. An efﬁcient boosting algorithm for combining preferences. JMLR, 4:933 969, 2003.

Friedman, J., Hastie, T., and Tibshirani, R. The Elements of Statistical Learning. Springer, 2009.

Table 2. Comparison of the AMV test - Experiment CMD2013

Method Tree Rank Rank Boost SVMlin SVMgauss RLSlin RLSgauss

AMV test 0.0225 0.1500 0.4029 0.2501 0.4070 0.0481 σ 0.0027 0.0133 0.0136 0.0125 0.0135 0.0051

Anomaly Ranking as Supervised Bipartite Ranking

Herbrich, R., Graepel, T., and Obermayer, K. Advances in Large Margin Classiﬁers, chapter Large margin rank boundaries for ordinal regression, pp. 115 132. MIT Press, 2000.

Pahikkala, T., Tsivtsivadze, E., Airola, A., Boberg, J., and Salakoski, T. Learning to rank with pairwise regularized least-squares. In Proceedings of SIGIR, pp. 27 33, 2007.

Polonik, W. Minimum volume sets and generalized quantile processes. Stochastic Processes and their Applications, 69(1):1 24, 1997.

Rakotomamonjy, A. Optimizing Area Under Roc Curve with SVMs. In Proceedings of the First Workshop on ROC Analysis in AI, 2004.

Rudin, C., Cortes, C., Mohri, M., and Schapire, R. E. Margin-based ranking and boosting meet in the middle. In Proceedings of COLT, 2005.

Sch olkopf, B., Platt, J.C., Shawe-Taylor, J., Smola, A., and Williamson, R. Estimating the Support of a High Dimensional Distribution. Neural Computation, 13(7): 1443 1471, 2001.

Scott, C. and Davenport, M. Regression level set estimation via cost-sensitive classiﬁcation. IEEE Transactions on Signal Processing, 55, 2007.

Scott, C. and Nowak, R. Learning minimum volume sets. JMLR, 7:665 704, 2006.

Song, J. Cdmc2013 intrusion detection dataset, 2013.

Steinwart, I., Hush, D., and Scovel, C. A classiﬁcation framework for anomaly detection. J. Machine Learning Research, 6:211 232, 2005.

Vert, R. and Vert, J.-P. Consistency and convergence rates of one-class svms and related algorithms. JMLR, 7:817 854, 2006.

Viswanathan, K., Choudur, L., Talwar, V., Wang, C., Macdonald, G., and Satterﬁeld, W. Ranking anomalies in data centers. In R.D.James (ed.), Network Operations and System Management, pp. 79 87. IEEE, 2012.