# mstatistic_for_kernel_changepoint_detection__193c1061.pdf

M-Statistic for Kernel Change-Point Detection

Shuang Li, Yao Xie H. Milton Stewart School of Industrial and Systems Engineering

Georgian Institute of Technology

sli370@gatech.edu yao.xie@isye.gatech.edu

Hanjun Dai, Le Song Computational Science and Engineering

College of Computing Georgia Institute of Technology

hanjundai@gatech.edu

lsong@cc.gatech.edu

Detecting the emergence of an abrupt change-point is a classic problem in statistics and machine learning. Kernel-based nonparametric statistics have been proposed for this task which make fewer assumptions on the distributions than traditional parametric approach. However, none of the existing kernel statistics has provided a computationally efﬁcient way to characterize the extremal behavior of the statistic. Such characterization is crucial for setting the detection threshold, to control the signiﬁcance level in the ofﬂine case as well as the average run length in the online case. In this paper we propose two related computationally efﬁcient M-statistics for kernel-based change-point detection when the amount of background data is large. A novel theoretical result of the paper is the characterization of the tail probability of these statistics using a new technique based on change-ofmeasure. Such characterization provides us accurate detection thresholds for both ofﬂine and online cases in computationally efﬁcient manner, without the need to resort to the more expensive simulations such as bootstrapping. We show that our methods perform well in both synthetic and real world data.

1 Introduction

Detecting the emergence of abrupt change-points is a classic problem in statistics and machine learning. Given a sequence of samples, x1, x2, . . . , xt, from a domain X, we are interested in detecting a possible change-point , such that before , the samples xi P i.i.d. for i , where P is the so-called background distribution, and after the change-point, the samples xi Q i.i.d. for i +1, where Q is a post-change distribution. Here the time horizon t can be either a ﬁxed number t = T0 (called an ofﬂine or ﬁxed-sample problem), or t is not ﬁxed and we keep getting new samples (called a sequential or online problem). Our goal is to detect the existence of the change-point in the ofﬂine setting, or detect the emergence of a change-point as soon as possible after it occurs in the online setting. We will restrict our attention to detecting one change-point, which arises often in monitoring problems. One such example is the seismic event detection [9], where we would like to detect the onset of the event precisely in retrospect to better understand earthquakes or as quickly as possible from the streaming data. Ideally, the detection algorithm can also be robust to distributional assumptions as we wish to detect all kinds of seismic events that are different from the background. Typically we have a large amount of background data (since seismic events are rare), and we want the algorithm to exploit these data while being computationally efﬁcient.

Classical approaches for change-point detection are usually parametric, meaning that they rely on strong assumptions on the distribution. Nonparametric and kernel approaches are distribution free and more robust as they provide consistent results over larger classes of data distributions (they can possibly be less powerful in settings where a clear distributional assumption can be made). In particular, many kernel based statistics have been proposed in the machine learning literature [5, 2, 18, 6, 7, 1] which typically work better in real data with few assumptions. However, none of these existing kernel statistics has provided a computationally efﬁcient way to characterize

the tail probability of the extremal value of these statistics. Characterization such tail probability is crucial for setting the correct detection thresholds for both the ofﬂine and online cases. Furthermore, efﬁciency is also an important consideration since typically the amount of background data is very large. In this case, one has the freedom to restructure and sample the background data during the statistical design to gain computational efﬁciency. On the other hand, change-point detection problems are related to the statistical two-sample test problems; however, they are usually more difﬁcult in that for change-point detection, we need to search for the unknown change-point location . For instance, in the ofﬂine case, this corresponds to taking a maximum of a series of statistics each corresponding to one putative change-point location (a similar idea was used in [5] for the ofﬂine case), and in the online case, we have to characterize the average run length of the test statistic hitting the threshold, which necessarily results in taking a maximum of the statistics over time. Moreover, the statistics being maxed over are usually highly correlated. Hence, analyzing the tail probabilities of the test statistic for change-point detection typically requires more sophisticated probabilistic tools.

In this paper, we design two related M-statistics for change-point detection based on kernel maximum mean discrepancy (MMD) for two-sample test [3, 4]. Although MMD has a nice unbiased and minimum variance U-statistic estimator (MMDu), it can not be directly applied since MMDu costs O(n2) to compute based on a sample of n data points. In the change-point detection case, this translates to a complexity quadratically grows with the number of background observations and the detection time horizon t. Therefore, we adopt a strategy inspired by the recently developed B-test statistic [17] and design a O(n) statistic for change-point detection. At a high level, our methods sample N blocks of background data of size B, compute quadratic-time MMDu of each reference block with the post-change block, and then average the results. However, different from the simple two-sample test case, in order to provide an accurate change-point detection threshold, the background block needs to be designed in a novel structured way in the ofﬂine setting and updated recursively in the online setting.

Besides presenting the new M-statistics, our contributions also include: (1) deriving accurate approximations to the signiﬁcance level in the ofﬂine case, and average run length in the online case, for our M-statistics, which enable us to determine thresholds efﬁciently without recurring to the onerous simulations (e.g. repeated bootstrapping); (2) obtaining a closed-form variance estimator which allows us to form the M-statistic easily; (3) developing novel structured ways to design background blocks in the ofﬂine setting and rules for update in the online setting, which also leads to desired correlation structures of our statistics that enable accurate approximations for tail probability. To approximate the asymptotic tail probabilities, we adopt a highly sophisticated technique based on change-of-measure, recently developed in a series of paper by Yakir and Siegmund et al. [16]. The numerical accuracy of our approximations are validated by numerical examples. We demonstrate the good performance of our method using real speech and human activity data. We also ﬁnd that, in the two-sample testing scenario, it is always beneﬁcial to increase the block size B as the distribution for the statistic under the null and the alternative will be better separated; however, this is no longer the case in online change-point detection, because a larger block size inevitably causes a larger detection delay. Finally, we point to future directions to relax our Gaussian approximation and correct for the skewness of the kernel-based statistics.

2 Background and Related Work

We brieﬂy review kernel-based methods and the maximum mean discrepancy. A reproducing kernel Hilbert space (RKHS) F on X with a kernel k(x, x0) is a Hilbert space of functions f( ) : X 7! R with inner product h , i F. Its element k(x, ) satisﬁes the reproducing property: hf( ), k(x, )i F = f(x), and consequently, hk(x, ), k(x0, )i F = k(x, x0), meaning that we can view the evaluation of a function f at any point x 2 X as an inner product.

Assume there are two sets with n observations from a domain X, where X = {x1, x2, . . . , xn} are drawn i.i.d. from distribution P, and Y = {y1, y2, . . . , yn} are drawn i.i.d. from distribution Q. The maximum mean discrepancy (MMD) is deﬁned as [3] MMD0[F, P, Q] := supf2F {Ex[f(x)] Ey[f(y)]} . An unbiased estimate of MMD2

0 can be obtained using U-statistic

u[F, X, Y ] = 1 n(n 1)

h(xi, xj, yi, yj), (1)

where h( ) is the kernel of the U-statistic deﬁned as h(xi, xj, yi, yj) = k(xi, xj) + k(yi, yj) k(xi, yj) k(xj, yi). Intuitively, the empirical test statistic MMD2

u is expected to be small (close to zero) if P = Q, and large if P and Q are far apart. The complexity for evaluating (1) is O(n2) since we have to form the so-called Gram matrix for the data. Under H0 (P = Q), the U-statistic is degenerate and distributed the same as an inﬁnite sum of Chi-square variables.

To improve the computational efﬁciency and obtain an easy-to-compute threshold for hypothesis testing, recently, [17] proposed an alternative statistic for MMD2

0 called B-test. The key idea of the approach is to partition the n samples from P and Q into N non-overlapping blocks, X1, . . . , XN and Y1, . . . , YN, each of constant size B. Then MMD2

u[F, Xi, Yi] is computed for each pair of blocks and averaged over the N blocks to result in MMD2

B[F, X, Y ] = 1

u[F, Xi, Yi]. Since B is constant, N O(n), and the computational complexity of MMD2

B[F, X, Y ] is O(B2n), a signiﬁcant reduction compared to MMD2

u[F, X, Y ]. Furthermore, by averaging MMD2

u[F, Xi, Yi] over independent blocks, the B-statistic is asymptotically normal leveraging over the central limit theorem. This latter property also allows a simple threshold to be derived for the two-sample test rather than resorting to more expensive bootstrapping approach. Our later statistics are inspired by B-statistic. However, the change-point detection setting requires signiﬁcant new derivations to obtain the test threshold since one cares about the maximum of MMD2

B[F, X, Y ] computed at different point in time. Moreover, the change-point detection case consists of a sum of highly correlated MMD statistics, because these MMD2

B are formed with a common test block of data. This is inevitable in our change-point detection problems because test data is much less than the reference data. Hence, we cannot use the central limit theorem (even a martingale version), but have to adopt the aforementioned change-of-measure approach.

Related work. Other nonparametric change-point detection approach has been proposed in the literature. In the ofﬂine setting, [5] designs a kernel-based test statistic, based on a so-called running maximum partition strategy to test for the presence of a change-point; [18] studies a related problem in which there are s anomalous sequences out of n sequences to be detected and they construct a test statistic using MMD. In the online setting, [6] presents a meta-algorithm that compares data in some reference window to the data in the current window, using some empirical distance measures (not kernel-based); [1] detects abrupt changes by comparing two sets of descriptors extracted online from the signal at each time instant: the immediate past set and the immediate future set; based on soft margin single-class support vector machine (SVM), they build a dissimilarity measure (which is asymptotically equivalent to the Fisher ratio in the Gaussian case) in the feature space between those sets without estimating densities as an intermediate step; [7] uses a density-ratio estimation to detect change-point, and models the density-ratio using a non-parametric Gaussian kernel model, whose parameters are updated online through stochastic gradient decent. The above work lack theoretical analysis for the extremal behavior of the statistics or average run length.

3 M-statistic for ofﬂine and online change-point detection

Give a sequence of observations {. . . , x 2, x 1, x0, x1, . . . , xt}, xi 2 X, with {. . . , x 2, x 1, x0} denoting the sequence of background (or reference) data. Assume a large amount of reference data is available. Our goal is to detect the existence of a change-point , such that before the change-point, samples are i.i.d. with a distribution P, and after the change-point, samples are i.i.d. with a different distribution Q. The location where the change-point occurs is unknown. We may formulate this problem as a hypothesis test, where the null hypothesis states that there is no change-point, and the alternative hypothesis is that there exists a change-point at some time . We will construct our kernel-based M-statistic using the maximum mean discrepancy (MMD) to measure the difference between distributions of the reference and the test data.

We denote by Y the block of data which potentially contains a change-point (also referred to as the post-change block or test block). In the ofﬂine setting, we assume the size of Y can be up to Bmax, and we want to search for a location of the change-point within Y such that observations after are from a different distribution. Inspired by the idea of B-test [17], we sample N reference blocks of size Bmax independently from the reference pool, and index them as XBmax

i , i = 1, . . . , N. Since we search for a location B (2 B Bmax) within Y for a change-point, we construct sub-block from Y by taking B contiguous data points, and denote them as Y B. To form the statistic, we correspondingly construct sub-blocks from each reference block by taking B contiguous data points out of that block, and index these sub-blocks as X(B)

i (illustrated in Fig. 1(a)). We then compute

Block containing potential change point MMDu2 XN

Bmax , Y(Bmax Pool of reference data

Bmax Y Bmax

𝑡 Bmax Bmax Bmax Bmax Bmax

Pool of reference data

Pool of reference data

Block containing potential change point MMDu2 𝑋𝑖

𝐵0,𝑡, 𝑌(𝐵0,𝑡

(a): ofﬂine (b): sequential Figure 1: Illustration of (a) ofﬂine case: data are split into blocks of size Bmax, indexed backwards from time t, and we consider blocks of size B, B = 2, . . . , Bmax; (b) online case. Assuming we have large amount of reference or background data that follows the null distribution. MMD2

u between (X(B)

i , Y (B)), and average over blocks

i , Y (B)) = 1 NB(B 1)

i,l , Y (B)

i,j denotes the jth sample in X(B)

i , and Y (B)

j denotes the j th sample in Y B. Due to the property of MMD2

u, under the null hypothesis, E[ZB] = 0. Let Var[ZB] denote the variance of ZB under the null. The expression of ZB is given by (6) in the following section. We see the variance depends on the block size B and the number of blocks N. As B increases Var[ZB] decreases (also illustrated in Figure 5 in the appendix). Considering this, we standardize the statistic, maximize over all values of B to deﬁne the ofﬂine M-statistic, and detect a change-point whenever the M-statistic exceeds the threshold b > 0:

M := max B2{2,3,...,Bmax} ZB/

Var[ZB] | {z } Z0

> b, {ofﬂine change-point detection} (3)

where varying the block-size from 2 to Bmax corresponds to searching for the unknown change-point location. In the online setting, suppose the post-change block Y has size B0 and we construct it using a sliding window. In this case, the potential change-point is declared as the end of each block Y . To form the statistic, we take NB0 samples without replacement (since we assume the reference data are i.i.d.with distribution P) from the reference pool to form N reference blocks, compute the quadratic MMD2

u statistics between each reference block and the post-change block, and then average them. When there is a new sample (time moves from t to t + 1), we append the new sample in the reference block, remove the oldest sample from the post-change block, and move it to the reference pool. The reference blocks are also updated accordingly: the end point of each reference block is moved to the reference pool, and a new point is sampled and appended to the front of each reference block, as shown in Fig. 1(b). Using the sliding window scheme described above, similarly, we may deﬁne an online M-statistic by forming a standardized average of the MMD2

u between the post-change block in a sliding window and the reference block:

i , Y (B0,t)), (4)

where B0 is the ﬁxed block-size, X(B0,t)

i is the ith reference block of size B0 at time t, and Y (B0,t) is the the post-change block of size B0 at time t. In the online case, we have to characterize the average run length of the test statistic hitting the threshold, which necessarily results in taking a maximum of the statistics over time. The online change-point detection procedure is a stopping time, where we detect a change-point whenever the normalized ZB0,t exceeds a pre-determined threshold b > 0:

T = inf{t : ZB0,t/

Var[ZB0] | {z } Mt

> b}. {online change-point detection} (5)

Note in the online case, we actually take a maximum of the standardized statistics over time. There is a recursive way to calculate the online M-statistic efﬁciently, explained in Section A in the appendix. At the stopping time T, we claim that there exists a change-point. There is a tradeoff in choosing the block size B0 in online setting: a small block size will incur a smaller computational cost, which may be important for the online case, and it also enables smaller detection delay for strong change-

point magnitude; however, the disadvantage of a small B0 is a lower power, which corresponds to a longer detection delay when the change-point magnitude is weak (for example, the amplitude of the mean shift is small). Examples of ofﬂine and online M-statistics are demonstrated in Fig. 2 based on synthetic data and a segment of the real seismic signal. We see that the proposed ofﬂine M-statistic powerfully detects the existence of a change-point and accurately pinpoints where the change occurs; the online M-statistic quickly hits the threshold as soon as the change happens.

0 100 200 300 400 500 10

0 100 200 300 400 500 0

Normal (0,1)

0 100 200 300 400 500 10

0 100 200 300 400 500 0

Normal (0,1) Laplace (0,1)

0 100 200 300 400 500 10

0 100 200 300 400 500 0

Normal (0,1) Laplace (0,1)

200 400 600 800 1000 100

Seismic Signal

200 400 600 800 1000

Bandw=100Med Bandw=0.1Med

(a): Ofﬂine, null (b): Ofﬂine, = 250 (c): Online, = 250 (d) seismic signal

Figure 2: Examples of ofﬂine and online M-statistic with N = 5: (a) and (b), ofﬂine case without and with a change-point (Bmax = 500 and the maximum is obtained when B = 263); (c) online case with a change-point at = 250, stopping-time T = 268 (detection delay is 18), and we use B0 = 50; (d) a real seismic signal and M-statistic with different kernel bandwidth. All thresholds are theoretical values and are marked in red.

4 Theoretical Performance Analysis

We obtain an analytical expression for the variance Var[ZB] in (3) and (5), by leveraging the correspondence between the MMD2

u statistics and U-statistic [11] (since ZB is some form of U-statistic), and exploiting the known properties of U-statistic. We also derive the covariance structure for the online and ofﬂine standardized ZB statistics, which is crucial for proving theorems 3 and 4.

Lemma 1 (Variance of ZB under the null.) Given any ﬁxed block size B and number of blocks N, under the null hypothesis,

N E[h2(x, x0, y, y0)] + N 1

N Cov [h(x, x0, y, y0), h(x00, x000, y, y0)]

where x, x0, x00, x000, y, and y0 are i.i.d. with the null distribution P.

Lemma 1 suggests an easy way to estimate the variance Var[ZB] from the reference data. To estimate (6), we need to ﬁrst estimate E[h2(x, x0, y, y0)], by each time drawing four samples without replacement from the reference data, use them for x, x0, y, y0, evaluate the sampled function value, and then form a Monte Carlo average. Similarly, we may estimate Cov [h(x, x0, y, y0), h(x00, x000, y, y0)].

Lemma 2 (Covariance structure of the standardized ZB statistics.) Under the null hypothesis, given u and v in [2, Bmax], for the ofﬂine case

ru,v := Cov (Z0

where u _ v = max{u, v}, and for the online case,

u,v := Cov(Mu, Mu+s) = (1 s

)(1 s B0 1), for s 0.

In the ofﬂine setting, the choice of the threshold b involves a tradeoff between two standard performance metrics: (i) the signiﬁcant level (SL), which is the probability that the M-statistic exceeds the threshold b under the null hypothesis (i.e., when there is no change-point); and (ii) power, which is the probability of the statistic exceeds the threshold under the alternative hypothesis. In the online setting, there are two analogous performance metrics commonly used for analyzing change-point detection procedures [15]: (i) the expected value of the stopping time when there is no change, the average run length (ARL); (ii) the expected detection delay (EDD), deﬁned to be the expected stopping time in the extreme case where a change occurs immediately at = 0. We focus on analyzing SL and ARL of our methods, since they play key roles in setting thresholds. We derive accurate approximations to these quantities as functions of the threshold b, so that given a prescribed SL or

ARL, we can solve for the corresponding b analytically. Let P1 and E1 denote, respectively, the probability measure and expectation under the null.

Theorem 3 (SL in ofﬂine case.) When b ! 1 and b/p Bmax ! c for some constant c, the significant level of the ofﬂine M-statistic deﬁned in (3) is given by

max B2{2,3,...,Bmax}

2B 1 B(B 1)

(8) where the special function (u) (2/u)(Φ(u/2) 0.5) (u/2)Φ(u/2)+φ(u/2), φ is the probability density function and Φ(x) is the cumulative distribution function of the standard normal distribution, respectively.

The proof of theorem 3 uses a change-of-measure argument, which is based on the likelihood ratio identity (see, e.g., [12, 16]). The likelihood ratio identity relates computing of the tail probability under the null to computing a sum of expectations each under an alternative distribution indexed by a particular parameter value. To illustrate, assume the probability density function (pdf) under the null is f(u). Given a function g!(x), with ! in some index set ,, we may introduce a family of alternative distributions with pdf f!(u) = e g!(u) !( )f(u), where !( ) := log

e g!(u)f(u)du is the log moment generating function, and is the parameter that we may assign an arbitrary value. It can be easily veriﬁed that f!(u) is a pdf. Using this family of alternative, we may calculate the probability of an event A under the original distribution f, by calculating a sum of expectations:

E![e !; A],

where E[U; A] := E[UI{A}], the indicator function I{A} is one when event A is true and zero otherwise, E! is the expectation using pdf f!(u), ! = log[f(u)/f!(u)] = g!(u) !( ), is the log-likelyhood ratio, and we have the freedom to choose a different value for each f!.

The basic idea of change-of-measure in our setting is to treat Z0

B := ZB/Var[ZB], as a random ﬁeld indexed by B. Then to characterize SL, we need to study the tail probability of the maximum of this random ﬁeld. Relate this to the setting above, Z0

B corresponds to g!(u), B corresponds to !, and A corresponds to the threshold crossing event. To compute the expectations under the alternative measures, we will take a few steps. First, we choose a parameter value B for each pdf associated with a parameter value B, such that B( B) = b. This is equivalent to setting the mean under each alternative probability to the threshold b: EB[Z0

B] = b and it allows us to use the local central limit theorem since under the alternative measure boundary cross has much larger probability. Second, we will express the random quantities involved in the expectations, as a functions of the so-called local ﬁeld terms: { B s : s = B, B 1, . . .}, as well as the re-centered log-likelihood ratios:

B = B b. We show that they are asymptotically independent as b ! 1 and b grows on the order of

B, and this further simpliﬁes our calculation. The last step is to analyze the covariance structure of the random ﬁeld (Lemma 2 in the following), and approximate it using a Gaussian random ﬁeld. Note that the terms Z0

v have non-negligible correlation due to our construction: they share the same post-change block Y (B). We then apply the localization theorem (Theorem 5.2 in [16]) to obtain the ﬁnal result.

Theorem 4 (ARL in online case.) When b ! 1 and b/p B0 ! c0 for some constant c0, the average run length (ARL) of the stopping time T deﬁned in (5) is given by

E1[T] = eb2/2

2(2B0 1) B0(B0 1)

+ o(1). (9)

Proof for Theorem 4 is similar to that for Theorem 3, due to the fact that for a given m > 0,

P1{T m} = P1

max 1 t m Mt > b

Hence, we also need to study the tail probability of the maximum of a random ﬁeld Mt = ZB0,t/

ZB0,t for a ﬁxed block size B0. A similar change-of-measure approach can be used, except that the covariance structure of Mt in the online case is slightly different from the ofﬂine case. This tail probability turns out to be in a form of P1{T m} = mλ + o(1). Using similar argu-

ments as those in [13, 14], we may see that T is asymptotically exponentially distributed. Hence, P1{T m} [1 exp( λm)] ! 0. Consequently E1{T} λ 1, which leads to (9).

Theorem 4 shows that ARL O(eb2) and, hence, b O(plog ARL). On the other hand, the EDD is typically on the order of b/ using the Wald s identity [12] (although a more careful analysis should be carried out in the future work), where is the Kullback-Leibler (KL) divergence between the null and alternative distributions (on a order of a constant). Hence, given a desired ARL (typically on the order of 5000 or 10000), the error made in the estimated threshold will only be translated linearly to EDD. This is a blessing to us and it means typically a reasonably accurate b will cause little performance loss in EDD. Similarly, Theorem 3 shows that SL O(e b2) and a similar argument can be made for the ofﬂine case.

5 Numerical examples

We test the performance of the M-statistic using simulation and real world data. Here we only highlight the main results. More details can be found in Appendix C. In the following examples, we use a Gaussian kernel: k(Y, Y 0) = exp

k Y Y 0k2/2σ26

, where σ > 0 is the kernel bandwidth and we use the median trick [10, 8] to get the bandwidth which is estimated using the background data.

Accuracy of Lemma 1 for estimating Var[ZB]. Fig. 5 in the appendix shows the empirical distributions of ZB when B = 2 and B = 200, when N = 5. In both cases, we generate 10000 random instances, which are computed from data following N(0, I), I 2 R20 20 to represent the null distribution. Moreover, we also plot the Gaussian pdf with sample mean and sample variance, which matches well with the empirical distribution. Note the approximation works better when the block size decreases. (The skewness of the statistic can be corrected; see discussions in Section 7).

Accuracy of theoretical results for estimating threshold. For the ofﬂine case, we compare the thresholds obtained from numerical simulations, bootstrapping, and using our approximation in Theorem 3, for various SL values . We choose the maximum block size to be Bmax = 20. In the appendix, Fig. 6(a) demonstrates how a threshold is obtained by simulation, for = 0.05, the threshold b = 2.88 corresponds to the 95% quantile of the empirical distribution of the ofﬂine Mstatistic. For a range of b values, Fig. 6(b) compares the empirical SL value from simulation with that predicted by Theorem 3, and shows that theory is quite accurate for small , which is desirable as we usually care of small s to obtain thresholds. Table 1 shows that our approximation works quite well to determine thresholds given s: thresholds obtained by our theory matches quite well with that obtained from Monte Carlo simulation (the null distribution is N(0, I), I 2 R20 20), and even from bootstrapping for a real data scenario. Here, the bootstrap thresholds are for a speech signal from the CENSREC-1-C dataset. In this case, the null distribution P is unknown, and we only have 3000 samples speech signals. Thus we generate bootstrap samples to estimate the threshold, as shown in Fig. 7 in the appendix. These b s obtained from theoretical approximations have little performance degradation, and we will discuss how to improve in Section 7.

Table 1: Comparison of thresholds for ofﬂine case, determined by simulation, bootstrapping and theory respectively, for various SL value .

Bmax = 10 Bmax = 20 Bmax = 50 b (sim) b (boot) b (the) b (sim) b (boot) b (the) b (sim) b (boot) b (the) 0.20 1.78 1.77 2.00 1.97 2.29 2.25 2.21 2.47 2.48 0.15 2.02 2.05 2.18 2.18 2.63 2.41 2.44 2.78 2.62 0.10 2.29 2.45 2.40 2.47 3.09 2.60 2.70 3.25 2.80

For the online case, we also compare the thresholds obtained from simulation (using 5000 instances) for various ARL and from Theorem 4, respectively. As predicated by theory, the threshold is consistently accurate for various null distributions (shown in Fig. 3). Also note from Fig. 3 that the precision improves as B0 increases. The null distributions we consider include N(0, 1), exponential distribution with mean 1, a Erdos-Renyi random graph with 10 nodes and probability of 0.2 of forming random edges, and Laplace distribution.

Expected detection delays (EDD). In the online setting, we compare EDD (with the assumption = 0) of detecting a change-point when the signal is 20 dimensional and the transition happens

0 0.2 0.4 0.6 0.8 1 1

Gaussian(0,I) Exp(1) Random Graph (Node=10, p=0.2) Laplace(0,1) Theory

0 0.2 0.4 0.6 0.8 1 1

Gaussian(0,I) Exp(1) Random Graph (Node=10, p=0.2) Laplace(0,1) Theory

0 0.2 0.4 0.6 0.8 1 1

Gaussian(0,I) Exp(1) Random Graph (Node=10, p=0.2) Laplace(0,1) Theory

(a): B0 = 10 (b): B0 = 50 (c): B0 = 200 Figure 3: In online case, for a range of ARL values, comparison b obtained from simulation and from Theorem 4 under various null distributions.

from a zero-mean Gaussian N(0, I20) to a non-zero mean Gaussian N(µ, I20), where the postchange mean vector µ is element-wise equal to a constant mean shift. In this setting, Fig. 10(a) demonstrates the tradeoff in choosing a block size: when block size is too small the statistical power of the M-statistic is weak and hence EDD is large; on the other hand, when block size is too large, although statistical power is good, EDD is also large because the way we update the test block. Therefore, there is an optimal block size for each case. Fig. 10(b) shows the optimal block size decreases as the mean shift increases, as expected. 6 Real-data

We test the performance of our M-statistics using real data. Our datasets include: (1) CENSREC1-C: a real-world speech dataset in the Speech Resource Consortium (SRC) corpora provided by National Institute of Informatics (NII)1; (2) Human Activity Sensing Consortium (HASC) challenge 2011 data2. We compare our M-statistic with a state-of-the-art algorithm, the relative densityratio (RDR) estimate [7] (one limitation of the RDR algorithm, however, is that it is not suitable for high-dimensional data because estimating density ratio in the high-dimensional setting is illposed). To achieve reasonable performance for the RDR algorithm, we adjust the bandwidth and the regularization parameter at each time step and, hence, the RDR algorithm is computationally more expensive than the M-statistics method. We use the Area Under Curve (AUC) [7] (the larger the better) as a performance metric. Our M-statistics have competitive performance compared with the baseline RDR algorithm in the real data testing. Here we report the main results and the details can be found in Appendix D. For the speech data, our goal is to detect the onset of speech signal emergent from the background noise (the background noises are taken from real acoustic signals, such as background noise in highway, airport and subway stations). The overall AUC for the Mstatistic is .8014 and for the baseline algorithm is .7578. For human activity detection data, we aim at detection the onset of transitioning from one activity to another. Each data consists of human activity information collected by portable three-axis accelerometers. The overall AUC for the M-statistic is .8871 and for the baseline algorithm is .7161. 7 Discussions

We may be able to improve the precision of the tail probability approximation in theorems 3 and 4 to account for skewness of Z0

B. In the change-of-measurement argument, we need to choose parameter values B such that B( B) = b. Currently, we use a Gaussian assumption Z0

B N(0, 1) and, hence, B( ) = 2/2, and B = b. We may improve the precision if we are able to estimate skewness (Z0

B. In particular, we can include the skewness in the log moment generating function approximation B( ) 2/2+ (Z0

B) 3/6 when we estimate the change-of-measurement parameter: setting the derivative of this to b and solving a quadratic equation (Z0

B) 2/2 + = b for 0

B. This will change the leading exponent term in (8) from e b2/2 to be e 0

Bb. A similar improvement can be done for the ARL approximation in Theorem 4.

Acknowledgments

This research was supported in part by CMMI-1538746 and CCF-1442635 to Y.X.; NSF/NIH BIGDATA 1R01GM108341, ONR N00014-15-1-2340, NSF IIS-1218749, NSF CAREER IIS-1350983 to L.S..

1 Available from http://research.nii.ac.jp/src/en/CENSREC-1-C.html 2 Available from http://hasc.jp/hc2011

[1] F. Desobry, M. Davy, and C. Doncarli. An online kernel change detection algorithm. IEEE

Trans. Sig. Proc., 2005.

[2] F. Enikeeva and Z. Harchaoui. High-dimensional change-point detection with sparse alterna-

tives. ar Xiv:1312.1900, 2014.

[3] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch olkopf, and Alexander

Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012.

[4] Z. Harchaoui, F. Bach, O. Cappe, and E. Moulines. Kernel-based methods for hypothesis

testing. IEEE Sig. Proc. Magazine, pages 87 97, 2013.

[5] Z. Harchaoui, F. Bach, and E. Moulines. Kernel change-point analysis. In Adv. in Neural

Information Processing Systems 21 (NIPS 2008), 2008.

[6] D. Kifer, S. Ben-David, and J. Gehrke. Detecting change in data streams. In Proc. of the 30th

VLDB Conf., 2004.

[7] Song Liu, Makoto Yamada, Nigel Collier, and Masashi Sugiyama. Change-point detection in

time-series data by direct density-ratio estimation. Neural Networks, 43:72 83, 2013.

[8] Aaditya Ramdas, Sashank Jakkam Reddi, Barnab as P oczos, Aarti Singh, and Larry Wasser-

man. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015.

[9] Z. E. Ross and Y. Ben-Zion. Automatic picking of direct P, S seismic phases and fault zone

head waves. Geophys. J. Int., 2014.

[10] Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines,

regularization, optimization, and beyond. MIT press, 2001.

[11] R. J. Serﬂing. U-Statistics. Approximation theorems of mathematical statistics. John Wiley &

Sons, 1980.

[12] D. Siegmund. Sequential analysis: tests and conﬁdence intervals. Springer, 1985.

[13] D. Siegmund and E. S. Venkatraman. Using the generalized likelihood ratio statistic for se-

quential detection of a change-point. Ann. Statist., (23):255 271, 1995.

[14] D. Siegmund and B. Yakir. Detecting the emergence of a signal in a noisy image. Stat. Interface, (1):3 12, 2008.

[15] Y. Xie and D. Siegmund. Sequential multi-sensor change-point detection. Annals of Statistics,

41(2):670 692, 2013. [16] B. Yakir. Extremes in random ﬁelds: A theory and its applications. Wiley, 2013.

[17] W. Zaremba, A. Gretton, and M. Blaschko. B-test: low variance kernel two-sample test. In

Adv. Neural Info. Proc. Sys. (NIPS), 2013.

[18] S. Zou, Y. Liang, H. V. Poor, and X. Shi. Nonparametric detection of anomalous data via kernel

mean embedding. ar Xiv:1405.2294, 2014.