# concept_drift_detection_through_resampling__4d755b6c.pdf

Concept Drift Detection Through Resampling

Maayan Harel MAAYANGA@TX.TECHNION.AC.IL Koby Crammer KOBY@EE.TECHNION.AC.IL Ran El-Yaniv RANI@CS.TECHNION.AC.IL Shie Mannor SHIE@EE.TECHNION.AC.IL Technion - Israel Institute of Technology, Haifa, Israel.

Detecting changes in data-streams is an important part of enhancing learning quality in dynamic environments. We devise a procedure for detecting concept drifts in data-streams that relies on analyzing the empirical loss of learning algorithms. Our method is based on obtaining statistics from the loss distribution by reusing the data multiple times via resampling. We present theoretical guarantees for the proposed procedure based on the stability of the underlying learning algorithms. Experimental results show that the method has high recall and precision, and performs well in the presence of noise.

1. Introduction

Effective techniques for analyzing streaming data are required in many big data applications and pose new machine learning challenges. One major challenge arises when the underlying source generating the data is not stationary, in which case the concept to be learned changes through time. For example, in prediction tasks, such as fraud detection or user preference prediction, the performance of a static predictor that was previously trained, is bound to degrade over time, as the nature of fraudulent attacks, or personal preferences is evolving. A reliable detection of such changes can be used to maintain high performance and meaningful analyses of data streams. This work is concerned with detection of change in the context of a prediction problem, and proposes a detection scheme for online identiﬁcation of times along the stream in which such changes have occurred.

Changes of the prediction task can be caused by changes in the input data, the target concept, or both. One option is to try detecting the change in the data generating distribution, a renowned problem in statistics termed change-point

Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s).

detection; see Section 1.1 for further discussion. However, while a drift in the generating distribution may result in a change in the learning problem, this is not a necessity. For example, if a sensor is damaged at some point, yet these changes affect features with a low correlation with the label, or the changes do not affect the decision boundary of the predictor, they should be ignored. Detecting these types of irrelevant changes, and consequently initiating a new training phase, may degrade the quality of the resulting predictor by unnecessarily shrinking the training set size. Moreover, detection of any type of distributional change is a challenging task, especially in high dimensional data. Therefore, when high prediction quality is the goal, we adopt the principle of Vapnik (1998) and propose to solve the problem directly by monitoring the drift in the prediction loss of the underlying predictor rather than the intermediate problem of change-point detection.

The main idea is to consider concept change only with respect to some hypothesis class. This approach enjoys no false detection of changes that are irrelevant to the end task. Using this approach, the sample and computation complexity are conveniently controlled by the choice of the hypothesis class. As an example consider the task of predicting the rating of a stream of book reviews of a user, where the genre of the books being reviewed changes with time, and the hypothesis class is restricted to linear predictors with constrained L1 norm. For simplicity of exposition consider using a very small dictionary instead of the L1 norm constraint. If the dictionary is {bad, good, informative}, the change from fantasy to educational books will be easily detected, due to the correlation of the word informative with a good educational book. If the dictionary is {bad, good, kindle}, we won t detect the change of the user to kindle, as kindle is not correlated with the sentiment. However, if we inspect the distributional change in the input we may detect an irrelevant change.

The idea of focusing on the change in the target concept through the error of the underlying predictor instead of focusing on changes in the stream distribution has been previously proposed only in the context of classiﬁcation; see

Concept Drift Detection Through Resampling

Section 1.1 for details. Yet, a general-purpose model with theoretical guaranties for any prediction task, be it classiﬁcation, regression, clustering etc., has not yet been considered, and is the focus of this work.

In this paper, we term changes in the prediction loss concept drifts and focus on their detection in streaming data. Our deﬁnition of a concept drift (see Deﬁnition 2) slightly deviates from the standard deﬁnition, which associates drifts only with changes in the target concept (Tsymbal, 2004), while our terminology links a concept drift also to the learning algorithm and its function class.

Our main contribution is a general approach for detection of change in a prediction problem for different types of concept drifts. Our detection mechanism relies on the idea of random permutations of the examples, which generate multiple train-test splits from the stream. The proposed approach is invariant to the learning problem, speciﬁcally to the loss function and to the learning algorithm. We analyze the detection quality of the proposed scheme for different types of changes in the prediction problem, from the simplest scenario of an abrupt change to that of a slow noisy gradual change. We also show empirically the success of the permutation-based detection.

1.1. Related Work

A closely related problem to concept drift detection, pursued in this paper, is that of change-point detection; the goal of the ﬁrst is to detect changes that affect the mapping from the input space to the target concept, while the goal of the latter is to detect changes in the generating distributions of the time-series. A great deal of work has been invested in methods for change-point detection; some reference books include (Basseville & Nikiforov, 1993; Chen & Gupta, 2012), for parametric methods, and (Brodsky & Darkhovsky, 1993), for non-parametric methods.

Change-point detection is associated with homogeneity testing, in which, given two samples, one has to determine whether they were generated by the same distribution. Many attempts have been made to extend classical statistical tests for homogeneity for detection of change in time-series (Kifer et al., 2004; Lung-Yut-Fong et al., 2011). Other methods are based on a predeﬁned parametric model, such as the generating distribution (Basseville & Nikiforov, 1993; Gustafsson, 1996; Lavielle & Teyssiere, 2006), autoregressive models (Takeuchi & Yamanishi, 2006), and state-space models (Moskvina & Zhigljavsky, 2003; Kawahara et al., 2007). Their success depends on the compatibility of the preassigned model. Nonparametric alternatives often rely on the estimation of density functions (Dasu et al., 2006; Sebasti ao & Gama, 2007), which tends to be problematic in high dimensional problems. A different direction is to directly estimate the density ratio (Kawahara

& Sugiyama, 2012; Liu et al., 2013), while Desobry et al. (2005) propose to segment the series using one-class support vector machine. Vovk et al. (2003) pursue a related problem: exchangeability of the samples in the stream1.

The above list is merely a small sample of the numerous attempts at the challenging change-point detection problem. We suggest to evade its complexity when the goal is to detect concept drift. Existing approaches for detection of a concept-drift are commonly based on some heuristic that utilizes the error rate, and draw upon intuition derived from learning theory. For example, in Gama et al. (2004), a change is detected when the error trend increases as the size of the stream grows. Another detection method (Baena-Garc ıa et al., 2006), detects a concept drift once the distance between the classiﬁcation errors decreases. A third error-based method, compares the accuracy on a recent window with the overall accuracy excluding the recent window by applying a test of equal proportion (Nishida & Yamauchi, 2007). Other methods search directly for the optimal window (Klinkenberg & Joachims, 2000; Bifet & Gavalda, 2007). For example, Klinkenberg & Joachims search for a window that minimizes the approximate error of an SVM classiﬁer. The advantage of most error-based methods is their simple employment as wrappers to any classiﬁcation algorithm. A limitation of known detection schemes is that they only treat classiﬁcation settings under the zero-one loss. Theoretical guaranties for existing schemes are also scarce.

Other approaches for learning in the presence of concept drift don t apply detection but adapt to the change or employ an ensemble of learners. Adaptation methods retrain the learner after a ﬁxed window size, or re-weigh instances by appearance recency (Klinkenberg, 2003; 2004). The choice of a proper window size balances a tradeoff between adaptation speed and generalization. Some adaptation methods are tailored to a speciﬁc learning algorithm, such as decision trees, and modify elements of the model when drift is suspected by some heuristic strategy (Black & Hickey, 1999; Hulten et al., 2001). In ensemble methods, a weighted voting scheme of multiple learners, trained on different windows over the stream is applied (Street & Kim, 2001; Stanley, 2003; Wang et al., 2003; Kolter & Maloof, 2007; Masud et al., 2009). There is no explicit detection of the drift, but initiation of new classiﬁers.

2. Detection Scheme

A prediction problem is deﬁned on an input space X, an output space Y, and consists of ﬁnding a predictor h : X Y, from some ﬁxed function class H. During the

1This work has a similar spirit to our approach as it also employs permutations, however, the hypotheses tested are distinct.

Concept Drift Detection Through Resampling

P 4Â B L 4½Ô B ÊE L s å G

4Ã B L 4½Ô B E L G å

(a) Abrupt change

(b) Noisy abrupt change

Test window

(c) Gradual change

Figure 1. Different types of concept drifts.

detection process we observe a sub-sequence Zn .= Zn 1 = {z1, z2, ..., zn} of independent observations emitted from a stochastic process called a stream, where each observation, zt Z = X Y, is generated by some distribution Dt over Z. Under this deﬁnition, even within a single concept, the distribution generating different examples may vary. The risk of h H with respect to Dt is RDt(h) = Ez Dt [ℓ(h, z)] , where ℓ(h, z) : Y Y [0, 1] is a given bounded loss function. We denote the average risk over an index set I, not necessarily an ordinal one, by RI(h) = 1 |I| P

t I RDt(h), and its empirical estimate by ˆRI(h) = 1 |I| P

t I ℓ(h, zt).

We would like to detect different types of concept changes. In the following, we formulate different kinds of concept changes, from a sharp abrupt change, to a more realistic noisy change, and ﬁnally, a slowly evolving noisy gradual change. For each type of change, we provide a suitable detection algorithm and a corresponding statistical guaranty. The presentation begins with the simplest and progresses to the most difﬁcult one. Note that the algorithm for noisy gradual change also detects simpler drift types, and therefore may be applied as an all-purpose detection scheme.

We follow a hypothesis testing paradigm. Our null hypothesis H0 is of equality or proximity of the average true risk on sequential training and test segments. The alternative H1 asserts that there is a substantial difference between the two. Variations of this hypothesis, which are suited to different types of concept changes, are presented below.

2.1. Abrupt (noisy) change

The ﬁrst, and most simplistic concept change is an abrupt change, where we assume that within each concept the risk is constant, and between the concepts, due to the concept switch, there is a change in the risk s value. The model includes a parameter 0 that deﬁnes the rate of change to be tested. The threshold serves as a sensitivity valve, which provides the means to tradeoff small changes in the concept with a growing training size. The top image in

Figure 1a shows an abrupt change, while the bottom image presents a deviation that is under the threshold.

In the ﬁrst step of our basic detection scheme the observed sub-sequence Zn is divided into a training window Sord = Zk 1 , 1 < k < n, and a test window S ord = Zn k+1. If a change is not detected, the test window is added to the training window and a new test window is introduced. Detection is done by comparing the loss on the test window with the loss of multiple test sets of the same size obtained by random shufﬂing of the examples. This idea is formulated in Algorithm 1. The shufﬂed train-test partitions of the data are denoted by (S, S ), such that S consists of k elements of Zn, and S is Zn \ S. The predictor AS H is obtained by training algorithm A on a set S. We denote by f .= ASord the predictor obtained when A is trained on the ordered training set.

In each iteration of Algorithm 1 we compare the risk on the ordered train-test split, ˆRord = ˆRS ord(ASord), with the permutation loss ˆRperm = ES Un[ ˆRS (AS)], where Un denotes the uniform distribution over all possible n k

training sets of size k from the sample Zn. The permutation loss may be estimated over P n k random partitions. The comparison between the losses obtained on the ordered split and the shufﬂed splits is done by an hypothesis test, denoted by TEST in Algorithm 1. The TEST may be done by thresholding the difference between the losses by applying the bound provided next in Theorem 3, or by applying a permutation test detailed in Section 3.

The intuition behind this scheme is that if no concept change has occurred, the prediction on the ordered split should not deviate too much from that of the shufﬂed splits, especially if the learning algorithm has algorithmic stability. An advantage of using shufﬂed train-test splits over other error based schemes, as in (Gama et al., 2004), is that the training and test sets are disjoint. Therefore, the test error is a valid estimation of the risk.2 In addition, by using

2In other methods, the online error is computed, in which case the test errors are no longer independent, thus the classiﬁcation

Concept Drift Detection Through Resampling

multiple train-test splits we gain a good estimation of the error under the null hypothesis, as well as valuable information regarding its variation.

Note that the model presented in this paper assumes temporal independence of the samples. In cases of dependency, means to maintain exchangeability of the samples, such as block-wise permutations along with removing the initial part of each block, may be applied.

A generalization of abrupt change is noisy abrupt change, where within each concept the risk may vary up to some rate. These variations are common in most real-life scenarios, where the process generating the stream has small changes, which affect the risk but should be disregarded. We deﬁne these permitted variations as follows. Deﬁnition 1 (η-permitted variations). A stream segment [t1, t2] is said to have η-permitted variations, for some η 0, with respect to h H, if maxi,j [t1,t2] RDi(h) RDj(h) η.

Figure 1b illustrates an η-permitted variation and a noisy abrupt change. A concept change is deﬁned by: Deﬁnition 2 (Concept change w.r.t. h H). Let I, J be two sequential stream segments with η-permitted variations w.r.t. h. There is a concept change of size > η between I and J if |RJ(h) RI(h)| .

The null and alternative hypotheses for a (noisy) abrupt change are deﬁned as follows. Let I = [1, k] and J = [k + 1, n] deﬁne the ranges of two consecutive stream segments with η-permitted variations, corresponding to the training and test set windows. Then,

H0 : |RI(f) RJ(f)| , (1) H1 : |RI(f) RJ(f)| + γ(η).

The function γ(η) 0 depends on the permitted variation and other elements of the algorithm (see Theorem 3 below for further details). Notice that the hypotheses are parameterized by the prediction function f, which is trained on the training window I.

Our analysis relies on algorithmic stability (Bousquet & Elisseeff, 2002), which is reviewed in the supplementary material. The following theorem bounds the difference between the risks under the null and alternative hypotheses. The proof is provided in the supplementary material. The slack γ quantiﬁes an area where a change cannot be detected, as deﬁned in (1); it shrinks as the algorithm is more stable and the training and test sets are larger. Theorem 3 (Detection of (noisy) abrupt change). Deﬁne H0 and H1 as in (1). For an algorithm with βn = O( 1

slack γ = 12npβn(1 p)+2

2 log(4/δ)/np+2η (1 p)(1+2p) , where p = |J|

error is not a random variable of a Bernoulli process.

Algorithm 1: Concept Drift Detection Scheme

Input: Algorithm A, window size m, number of permutations P, , signiﬁcance rate δ. Output: A set of change indices D. Initialization: t1 1, k m, D while not end of stream do

Sord Zk t1, S ord Zk+m k+1 if detect(A, Sord, S ord, P, , δ) is True then t1 k, D D {k}. k k + m return D Procedure detect(A, Sord, S ord, P, , δ) for i 1 to P do

(Si, S i) random split of Sord S ord return TEST( ˆRord, { ˆRS i(ASi)}P i=1, , δ)

and any δ (0, 1), we have that under H0, with probability of at least 1 δ, ˆRord ˆRperm 6npβn(1 p)+ p

2 log(4/δ)/np

+ (1 p)(1 + 2p) + η.

Under H1, the inverse inequality holds with probability of at least 1 δ.

Theorem 3 states a threshold which differentiates between the two hypotheses with high probability, and may be used for the TEST routine in Algorithm 1. The role of each element in the bound is as follows:

1. Test window size p = |J|

n : the larger the window is the better concentration of the empirical risk around the risk and consequently, a tighter bound is obtained. In practice, the window size corresponds to the maximum detection delay and therefore one should choose it as large as possible while considering the tolerance to detection delay. 2. The signiﬁcance rate δ bounds the miss detection and false detection rates, usually set in the range 0.01 0.05. 3. The hyperparameter denotes the rate of change one wishes to ignore. The output of the test for different values of may be computed after training the predictors once, therefore, detection as a function of can be given instead of a single binary value with no further computational cost. 4. The stability rate βn = O( 1

n) of the algorithm controls the variability of the loss. For a stable algorithm npβn(1 p) 0 as n .3

5. The rate η bounds the permitted variations of the risk within an interval, and can be empirically estimated in advance. As can be expected, in the simpliﬁed scenario of an abrupt change (η = 0), the bound is tighter.

3Examples of stable algorithms are k-Nearest Neighbors, soft margin SVM, SVM regression (SVR), and Regularized Least Squares (Bousquet & Elisseeff, 2002).

Concept Drift Detection Through Resampling

2.2. Gradual drift

Another concept drift type is a gradual drift. In many real life problems, the change between two concepts happens gradually. For example, in fraud detection, the characteristics of fraudulent actions may have a slow transition phase. Gradual drifts may be modeled by a mixing stage, in which an increasing weight is given to the next concept and a decreasing one to the prior concept.

We treat two cases of gradual drift: one which is faster, where the change length is within the test window, and another, corresponding to a gradual drift evolving very slowly; see Figure 1c for an illustration. In the ﬁrst case, there is sufﬁcient difference between the average risk in the test window and the previous concept, thus, the abrupt change algorithm may be used. In the second case, the drift is slower, and therefore, the difference between the risks may be insufﬁcient to cross the threshold. When the drift is slow, it is most challenging to detect, however, when it is not detected, it can slowly degrade the quality of the predictor. To treat this drift type we provide a lookahead procedure presented in Algorithm 2. In this procedure, if the hypothesis test determines that there may be a change, we retain the same training window and move to test the next test window. This process is continued until a drift is detected, the test determines a false-alarm, or the maximum number of lookahead iterations have been reached.

We deﬁne the null and alternative hypotheses for detecting slow gradual change as follows. The hypotheses are directly deﬁned for the noisy setting, so all stream segments have η-permitted variations w.r.t. f (Deﬁnition 1). Let m be the size of the test window, let I = [1, k] be the training window and let Jq = [k+(q 1)m+1, k+qm], for q = 1, . . . , M, be the ranges of M test stream segments. The hypotheses at the M M lookahead iteration are:

H0 : q=1 . . . M RI(f) RJq(f) (2)

H1 : q=1 . . . M 1 RI(f) RJq(f)

and RI(f) RJM (f) +γ(η).

Recall that Sord denotes the training set. Denote the qth test set by S ord,q = {zt : t Jq}. Let ˆRperm,q be the average empirical risk obtained on permutations of [Sord, S ord,q]. The following theorem bounds the miss detection and false alarm rates of a gradual drift at iteration M M. Theorem 4 (Detection of slow gradual change). Deﬁne H0 and H1 as in (2). For an algorithm with βn = O( 1

slack γ = 12npβn(1 p)+2

2 log(4M /δ)/np+2η (1 p)(1+2p) , where p= m

n , we have that for any δ (0, 1) and T = 6npβn(1 p)+ p

2 log(4M /δ)/np+ (1 p)(1+2p)+η, under H0, with probability of at least 1 δ,

q = 1, ..., M ˆRS ord,q(f) ˆRperm,q T.

Algorithm 2: Slow Gradual Drift Detection Scheme. See Algorithm 1 for detect procedure.

Input: Algorithm A, window size m, P permutations, maximal lookahead iterations M, , warning and detection signiﬁcance rates: δw, δd. Output: A set of change indices D. Initialization: t1 1, k m, D while not end of stream do

exit False, q 1 while exit is False and q M do

Sord Zk t1, S ord Zk+qm k+(q 1)m+1 if detect(A, Sord, S ord, P, , δd) is True then

t1 k + 1, k k + m, D D {k} exit True else if detect(A, Sord, S ord, P, , δw) is False then

k k + m, exit True q q + 1 return D

Under H1, with probability of at least 1 δ,

max q=1...M 1

ˆRS ord,q(f) ˆRperm,q T,

and ˆRS ord,M (f) ˆRperm,M T.

The proof of the theorem relies on the bounds in Theorem 3 and the union bound.

3. Test Procedure

Using Theorems 3 and 4 themselves as testing procedures, with type-1 and type-2 error guarantees provided by the bounds, may result in conservative tests, as the bounds are based on large deviations results. In this section we propose another hypothesis testing procedure for the TEST module in Algorithms 1 and 2. The theorems imply that variability of the loss, encapsulated in the error stability of the algorithm, should be taken into account in the test procedure. In this section, we present a different perspective on these variations using permutation tests (Good, 2005). Permutation tests are a well known methodology in statistics to obtain most powerful statistical procedures, and will provide tighter thresholds for the testing procedure.

Our test statistic is ˆRord, the loss over the ordered traintest split. To test if a change has taken place, we use a permutation test. The heart of the method is that under the null hypothesis the samples are exchangeable and therefore may be randomly shufﬂed. According to the null hypothesis, the average risk before and after the tested change index k are -close. Under this hypothesis, the samples are

Concept Drift Detection Through Resampling

Algorithm 3: TEST( ˆRord, n ˆRS i(ASi) o P

i=1 , , δ) Procedure for detection scheme

Input: ˆRord, n ˆRS i(ASi) o P

i=1, , signiﬁcance rate δ

if 1+PP i=1 1[ ˆ Rord ˆ RS i(ASi) ]

P +1 δ then detection True else detection False return detection

approximately exchangeable. Speciﬁcally, the more stable the algorithm is and the smaller is, the more the examples are exchangeable. As the size of the training set grows, which emanates when a change is not detected, stability of the learner strengthens and ensures exchangeability.

By shufﬂing the data multiple times we can obtain an estimate of the non-parametric distribution of the test statistic ˆRord under the null hypothesis. Recall that ˆRS i(ASi) is the empirical error obtained for the ith shufﬂed split. If there was a concept drift at time index k, we expect ˆRord to be larger than if H0 was true. The larger value we observe, the stronger is the evidence against H0. Therefore, the achieved signiﬁcance level of the procedure is PSi,S i

h ˆRord ˆRS i(ASi) H0 i . Algorithm 3 deﬁnes this testing procedure. Optimally, to get the exact distribution of the statistic, we would need to examine all possible train-test shufﬂes, but in practice much less are sufﬁcient. Also note that the train-test procedure of the multiple splits can be straightforwardly parallelized.

In the gradual drift detection scheme, at iteration M , M

base hypotheses are combined, and therefore, there is a need to correct for multiple comparisons. In Theorem 4, this is done by the union bound, which is equivalent to the Bonferroni correction in the statistical literature. In the permutation test, we use the Benjamini-Hochberg correction (Benjamini & Hochberg, 1995), which is less conservative.

4. Experiments

We compare the performance of our detection algorithm in classiﬁcation and regression settings with other conceptdrift detection methods, and baseline methods. We use synthetic data and drifts synthesized in real data, thus controlling drift points and allowing precise performance analysis.

Detection Algorithms: We denote by PERM and grad PERM the application of Algorithm 1 and 2, respectively, applied with the testing procedure in Algorithm 3. In classiﬁcation problems, we compare PERM s detection to three known methods for concept drift detection: DDM (Gama et al., 2004), early drift detection (EDDM) (Baena-Garc ıa

et al., 2006), and STEPD (Nishida & Yamauchi, 2007); see Section 1.1. We also compare the performance with learners with a ﬁxed memory window (Window #, where # is the window size), a full memory that trains on all prior examples (No detection), and an optimal detector in hindsight that detects a drift at its onset (Exact detection). We set the sensitivity level of PERM and grad-PERM to δ = 0.01, = 0, the warning and detection thresholds of STEPD to w = 0.05, d = 0.01, and the parameters of EDDM to α = 0.95 and β = 0.90.4

Performance Measures: Performance comparisons are done by evaluating detection quality and error rate. A True Positive (TP) detection is deﬁned as a detection within a ﬁxed delay range after the precise concept change time. This range is taken to be the size of the window in PERM and STEPD. A False Negative (FN) is deﬁned as missing a detection within the delay range, and a False Positive (FP), as a detection outside this range or an extra detection in the range. The detection quality is measured both by the Recall =TP/(TP +FN) and Precision =TP/(TP +FP) of the detector. The prediction on a test example is done by a predictor trained on the previous examples.

4.1. Synthetic Data

In the following synthetic classiﬁcation tasks the base algorithm was K-Nearest Neighbors (k = 3), each stream was randomly repeated 100 times, and P = 100 reshufﬂing splits were used in PERM. The ﬁrst stream, denoted Mixed , represents abrupt drift with label noise. The features are two boolean (v, w) and two numeric (x, y) attributes; the examples are labeled as positive if at least two conditions are met: v, w = 0, y < 0.5 + 0.3 sin(3πx). The classiﬁcation is reversed at each concept change. The second dataset, denoted Circles , presents a more challenging concept drift with label noise. We sample data uniformly from the unit square, and label an example as positive/negative if it resides inside/outside a moving and dynamically changing circle. We set the initial center and direction of movement at random, and at each concept drift gradually move the center and change the radius.5 The third, most challenging synthetic dataset, is the rotating Checkerboard (Elwell & Polikar, 2011), where the examples are sampled uniformly from the unit square and the labels are set by a checkerboard with 0.2 tile width. At each concept drift, the checkerboard is rotated in an angle of π/20 radians.

4These hyper parameters were not optimized. The parameters δ and d correspond to a P-value and are thus set to a standard signiﬁcance value. The parameters used for DDM and EDDM were taken as recommended by their authors. 5When the radius is maximal (0.45), the circle shrinks (until it reaches 0.15) and its direction changes.

Concept Drift Detection Through Resampling

Table 1. Checkerboard dataset - gradual drift (100 repetitions).

Algorithm detection delay #runs/100 no detection PERM 241.0 113.2 0 grad-PERM 137.2 115.7 0 DDM 569.6 54.7 0 EDDM 494.0 77.7 4 STEPD 455.3 106.8 15

The results on the different streams, each consisting of 10 drifts and concept lengths of 1000 samples, are presented in Figure 2: rows correspond to datasets and columns to performance measures. Both recall and precision of the PERM algorithm are higher than the other methods. The performance difference of the error-noise curves is smaller on the Mixed and Circle datasets. These two problems are not hard, and therefore, while DDM and EDDM encounter many miss detections, their error rate remains low. In the checkerboard dataset, the performance of PERM improves as the window range grows. This can be anticipated because as the test size grows the variations of the error decreases, which reduces the false detection rate.

Our last synthetic drift was generated using the Checkerboard dataset. In this experiment, a single gradual drift is initiated by increasing the label noise at time-indices [1400, 1600, 1800, 2000] with corresponding noise-rates [0.05, 0.1, 0.2, 0.3]. The growing label noise serves as a gradual drift which slowly evolves. The window size in PERM, grad-PERM and STEPD was 100 samples. Table 1 shows that grad-PERM obtained the best detection rate.

4.2. User Preference Prediction

We compare detection performance on a user preference prediction task deﬁned using the 20-news groups text dataset6, consisting of 18, 846 documents and over 75, 000 features. We apply the TF-IDF weighting scheme on the documents and partition the 20 groups to six subgroups according to subject as presented in the repository. We deﬁne two prediction problems: a binary classiﬁcation problem of identifying whether a user likes/dislikes the subject, and a regression problem of rating user preference. In all our experiments the time-series is generated by randomly reordering all the documents. We use P = 500, and SVM and SVR with linear kernel as the learning algorithms.7

We ran two experiments on the classiﬁcation problem. In the ﬁrst, after 1000 documents ( 50 documents from each newsgroup) a concept drift is activated, in which a random

6http://qwone.com/ jason/20Newsgroups 7We used scikit-learn: Machine Learning in Python toolbox. Cross-validation on a single random concept showed low sensitivity to the choice of C on this dataset and therefore the default value C = 1 was chosen.

Table 2. 20 Newsgroup - Binary classiﬁcation (50 repetitions). Detection of 10 drift points in a range of 100 samples after drift onset. Detection error for exact windowing is .19 .01.

Algorithm Recall Precision Error PERM 0.95 .06 0.92 .075 .19 .01 DDM 0.67 .18 0.24 .06 .21 .01 EDDM 0.70 .14 0.25 .08 .23 .02 STEPD 0.51 0.19 0.51 0.19 .21 .01 Window 500 - - .20 .01 Window 1000 - - .23 .02 No Detection - - .38 .04

Table 3. 20 Newsgroup - Binary classiﬁcation with gradual drift (100 repetitions).

Algorithm detection #runs/100 FA delay no detection before drift PERM 245 172 14 0.15 0.3 grad-PERM 169 160 5 0.16 0.3 DDM 321 140 42 0.75 0.7 EDDM 188 155 26 1.6 1.3 STEPD 358 199 62 0.6 0.5

third of the binary ratings are switched and the rest remain as in the previous concept.8 In the second, we use the same concepts but with a gradual drift: after 500 documents from the ﬁrst concept, sets of 100 documents are added with mixing proportions [0.8|0.2, 0.6|0.4, 0.4|0.6, 0.2|0.8] from the ﬁrst and second concepts respectively. Another 500 documents from the second concept follow.

In the regression problem, the label is initiated randomly to 1 10 and a concept drift is activated after 2000 documents. Two types of drifts are considered: (I) the label of all six groups undergo a random walk step 3, (II) four groups randomly selected change rating of 3. Scenario (II) is more challenging as the drift is smaller. Performance of PERM is compared only to that of Window , No detection , and Exact detection , since DDM, EDDM and STEPD are limited to zero-one classiﬁcation problems.

The results are presented in Tables 2 to 4. The performance of PERM for regular drift and grad-PERM for gradual drift, are the highest in both classiﬁcation and regression.

5. Discussion

We presented a resampling scheme for concept-drift detection with respect to the risk of a learning algorithm. One of the advantages of the method is its applicability to any stable learning algorithm and any bounded loss function chosen as the most suitable for the task at hand. For example, weighted loss may be used in a data imbalance scenario.

8If at some point, a concept has all positive or all negative labeling, the concept is reinitialized randomly.

Concept Drift Detection Through Resampling

0 0.05 0.1 0.15 0.2 0.2

(a) Mixed: Recall-Noise

0 0.05 0.1 0.15 0.2 0

PERM DDM EDDM STEPD

(b) Mixed: Precision-Noise

0 0.05 0.1 0.15 0.2 0

PERM DDM EDDM STEPD Window 100 Window 1000 No Detection Exact Detection

(c) Mixed: Error-Noise

0 0.05 0.1 0.15 0

PERM DDM EDDM STEPD

(d) Circle: Recall-Noise

0 0.05 0.1 0.15 0

PERM DDM EDDM STEPD

(e) Circle: Precision-Noise

0 0.05 0.1 0.15 0.2 0

PERM DDM EDDM STEPD Window 100 Window 1000 No Detection Exact Detection

(f) Circle: Error-Noise

10 25 50 0.2

Detection Range

PERM DDM EDDM STEPD

(g) Checkerboard:Recall-Range

10 25 50 0.2

Detection Range

PERM DDM EDDM STEPD

(h) Checkerboard:Precision-Range

Detection Range

PERM DDM EDDM STEPD Window 100 Window 1000 No Detection Exact Detection

(i) Checkerboard:Error-Range

Figure 2. Synthetic datasets. Figures 2a to 2c - Mixed dataset with detection range of 50 samples. Figures 2d to 2f - Circle dataset with detection range of 100 samples. A larger range was chosen to compensate for the class imbalance in this problem. Figures 2g to 2i - Checkerboard dataset with different detection ranges.

Table 4. 20 Newsgroup-Regression (50 repetitions). Detection of 5 concept drifts, with detection range of 200 samples. (a) Detection Rate of PERM

Recall Precision (I) 0.93 .12 0.93 .10 (II) 0.87 .17 0.89 .11 (b) Error Rate. Exact detection: (I) 1.46 .16, (II) 1.48 .19.

PERM Window No 500|1000 Detection (I) 1.50 .17 1.60 .17, 1.58 .13 2.30 .13 (II) 1.47 .19 1.59 .22, 1.50 .19 2.14 .21

Our detection algorithm outputs time indices of concept changes that form windows of adaptive size, each comprising of examples from a single concept. The detection times may be used for initiating a new training phase, as done in our experimental evaluation, but can also be used as a basis for ensemble learners, by providing an informed choice of the training windows for the ensemble s members.

Our experiments show that the proposed scheme is more robust to noise and has better precision and recall rates than existing schemes. There is a bias-variance tradeoff in the choice of window size in PERM. A larger window reduces the variability which increases the accuracy of the test, but causes a detection delay which may increase the error.

In future work we plan to implement an online setting for the scheme, in which the learners of the reshufﬂed samples are preserved and updated. While the current algorithm is readily parallelized, this regime should enjoy a favorable computation time.

Acknowledgments

This research was partially supported by the ISF under contract 890015 and by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI).

Concept Drift Detection Through Resampling

Baena-Garc ıa, M., Campo Avila, J., Fidalgo, R., Bifet, A., Gavald a, R., and Morales-Bueno, R. Early drift detection method. In Stream KDD, pp. 77 86, 2006.

Basseville, M. and Nikiforov, I. V. Detection of Abrupt Changes: Theory and Application. Prentice-Hall, 1993.

Benjamini, Y. and Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS, pp. 289 300, 1995.

Bifet, A. and Gavalda, R. Learning from time-changing data with adaptive windowing. In ICDM, 2007.

Black, M. and Hickey, R.J. Maintaining the performance of a learned classiﬁer under concept drift. IDA, 1999.

Bousquet, O. and Elisseeff, A. Stability and generalization. JMLR, 2:499 526, 2002.

Brodsky, B.E. and Darkhovsky, B.S. Nonparametric methods in change-point problems. Springer, 1993.

Chen, Jie and Gupta, A Arjun K. Parametric Statistical Change Point Analysis: With Applications to Genetics, Medicine, and Finance. Springer, 2012.

Dasu, T., Krishnan, S., Venkatasubramanian, S., and Yi, K. An information-theoretic approach to detecting changes in multi-dimensional data streams. In ISCSA, 2006.

Desobry, F., Davy, M., and Doncarli, C. An online kernel change detection algorithm. IEEE Trans. Signal Process, 53(8):2961 2974, 2005.

Elwell, R. and Polikar, R. Incremental learning of concept drift in nonstationary environments. IEEE Trans. Neural Netw., 22(10):1517 1531, 2011.

Gama, J., Medas, P., Castillo, G., and Rodrigues, P. Learning with drift detection. In Advances in Artiﬁcial Intelligence SBIA 2004, pp. 286 295. 2004.

Good, P.I. Permutation, parametric and bootstrap tests of hypotheses. Springer, 2005.

Gustafsson, F. The marginalized likelihood ratio test for detecting abrupt changes. IEEE Trans. Autom. Control, 41(1):66 78, 1996.

Hulten, G., Spencer, L., and Domingos, P. Mining timechanging data streams. In KDD, 2001.

Kawahara, Y. and Sugiyama, M. Sequential change-point detection based on direct density-ratio estimation. Statistical Analysis and Data Mining, 5(2):114 127, 2012.

Kawahara, Y., Yairi, T., and Machida, K. Change-point detection in time-series data based on subspace identiﬁcation. In ICDM, pp. 559 564, 2007.

Kearns, M. and Ron, D. Algorithmic stability and sanity-

check bounds for leave-one-out cross-validation. Neural Computation, 11(6):1427 1453, 1999.

Kifer, D., Ben-David, S., and Gehrke, J. Detecting change in data streams. In VLDB, pp. 180 191, 2004.

Klinkenberg, R. Concept drift and the importance of examples. In Text Mining TAA, pp. 55 77, 2003.

Klinkenberg, R. Learning drifting concepts: Example selection vs. example weighting. IDA, 8:281 300, 2004.

Klinkenberg, R. and Joachims, T. Detecting concept drift with support vector machines. In ICML, 2000.

Kolter, J.Z. and Maloof, M.A. Dynamic weighted majority: An ensemble method for drifting concepts. JMLR, 8: 2755 2790, 2007.

Lavielle, M. and Teyssiere, G. Detection of multiple change-points in multivariate time series. Lithuanian Mathematical Journal, 46(3):287 306, 2006.

Liu, S., Yamada, M., Collier, N., and Sugiyama, M. Change-point detection in time-series data by relative density-ratio estimation. Neural Networks, 2013.

Lung-Yut-Fong, A., L evy-Leduc, C., and Capp e, O. Homogeneity and change-point detection tests for multivariate data using rank statistics. ar Xiv:1107.1971, 2011.

Masud, M., Gao, J., Khan, L., Han, J., and Thuraisingham, B. A multi-partition multi-chunk ensemble technique to classify concept-drifting data streams. In KDD. 2009.

Moskvina, V. and Zhigljavsky, A. An algorithm based on singular spectrum analysis for change-point detection. COMMUN STAT SIMULAT, 32(2):319 352, 2003.

Nishida, K. and Yamauchi, K. Detecting concept drift using statistical testing. In DS, pp. 264 269, 2007.

Sebasti ao, R. and Gama, J. Change detection in learning histograms from data streams. In PAI. Springer, 2007.

Stanley, K.O. Learning concept drift with a committee of decision trees. Informe t ecnico, 2003.

Street, W. and Kim, Y. A streaming ensemble algorithm (sea) for large-scale classiﬁcation. In KDD, 2001.

Takeuchi, J. and Yamanishi, K. A unifying framework for detecting outliers and change points from time series. IEEE Trans. Knowl. Data Eng, 18(4):482 492, 2006.

Tsymbal, A. The problem of concept drift: deﬁnitions and related work. Technical report, 2004.

Vapnik, V.N. Statistical learning theory. Wiley, 1998.

Vovk, V., Nouretdinov, I., and Gammerman, A. Testing exchangeability on-line. In ICML, pp. 768 775, 2003.

Wang, H., Fan, W. Yu, P., and Han, J. Mining concept drifting streams using ensemble classiﬁers. In KDD, 2003.