# sequential_changepoint_detection_via_backward_confidence_sequences__9869780b.pdf

Sequential Changepoint Detection via Backward Confidence Sequences

Shubhanshu Shekhar 1 Aaditya Ramdas 1 2

We present a simple reduction from sequential estimation to sequential changepoint detection (SCD). In short, suppose we are interested in detecting changepoints in some parameter or functional θ of the underlying distribution. We demonstrate that if we can construct a confidence sequence (CS) for θ, then we can also successfully perform SCD for θ. This is accomplished by checking whether two CSs one forwards and the other backwards ever fail to intersect. Since the literature on CSs has been rapidly evolving recently, the reduction provided in this paper immediately solves several old and new change detection problems. Further, our backward CS , constructed by reversing time, is new and potentially of independent interest. We provide strong nonasymptotic guarantees on the frequency of false alarms and detection delay, and demonstrate numerical effectiveness on several problems.

1. Introduction

We study the problem of sequential changepoint detection (SCD), where the goal is to quickly detect any changes in the distribution generating a stream of observations, while controlling the false alarm rate at a specified level. Formally, for some (possibly infinite-dimensional) index set Θ, let {Pθ}θ Θ denote a class of distributions on some observation space X. Suppose that for some T 1, the observations {Xt : 1 t T} are drawn i.i.d. from Pθ0 with θ0 Θ, and {Xt : t > T} are drawn i.i.d. from Pθ1 for some θ1 Θ, with θ1 = θ0. Then, the SCD problem involves deciding between the null H0 : {T = }, meaning no change occurred, and the alternative H1 = i N{T = i}.

Since the observations arrive sequentially, our task is to

1Department of Statistics and Data Science, Carnegie Mellon University, USA 2Machine Learning Department, Carnegie Mellon University, USA. Correspondence to: Shubhanshu Shekhar <shubhan2@andrew.cmu.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

design a random stopping time, τ, adapted to the natural filtration {Ft : t 1} with Ft = σ(X1, . . . , Xt), at which we reject the null. A good stopping rule τ takes large values under the the null (i.e., when T = ), while minimizing the time required to detect the change under the alternative (i.e., when T < ). Formally, when T = , we require the average run length (ARL), E [τ], to be lower bounded by 1/α, for a given α (0, 1], while ensuring a small detection delay, ET [(τ T)+], when T < . Informally, this means that we will have a false alarm roughly every 1/α steps, so the reader may use α = 10 3 as a rough guideline. (We also briefly discuss how to keep the probability of even a single false alarm below α, but there is a tradeoff between the false alarm guarantee and detection delay; detecting true changes quickly necessitates tolerating infrequent false alarms.)

The literature on the topic of sequential changepoint detection is vast, as this problem arises in several important realworld applications, such as quality control (Shewhart, 1930), monitoring power networks (Chen et al., 2015), analysis of genomes (Chen et al., 2011; Shen and Zhang, 2012), and epidemic detection (Baron et al., 2004; Yu et al., 2013). Some of the earliest works in this topic (Shewhart, 1925; Page, 1954; Shiryaev, 1963) assume that the preand post-change distributions admit known densities f0 and f1 (w.r.t. some common reference measure). These methods use statistics involving likelihood ratios, that can be computed efficiently in an incremental manner, and have also been shown to admit strong optimality properties. The ideas underlying these likelihood-based schemes have also been extended to the case of (finite-dimensional) parametric families of distributions, such as the exponential family; see Tartakovsky et al. (2014) for a detailed discussion. However, these parametric assumptions are often too stringent to be applicable to many practical applications, where the data distributions may lie in much larger, nonparametric, classes. Most of the ideas developed for the parametric setting, and in particular the likelihood-based schemes, fail to be applicable in the nonparametric case. With some exceptions discussed later, there are very few general principles for constructing nonparametric changepoint detection schemes. Our work in this paper addresses this issue, by developing a conceptually simple meta-algorithm for transforming any confidence sequence construction into a powerful changepoint detection method. As a consequence, we can immediately build upon

Sequential Changepoint Detection via Backward Confidence Sequences

the recent progress in constructing confidence sequences to instantiate new changepoint detection methods.

Remark 1. We note that the SCD problem is usually studied in two settings, that differ from each other is a very subtle manner. In the first (and the more common) setting, the preand post change distributions (Pθ0 and Pθ1) are assumed to lie in two different, and usually well-separated, classes of distributions. Using our notation, this is equivalent to assuming that there exist two known disjoint sets Θ0 and Θ1, such that θi Θi, for i = 0, 1. The second setting, that is the subject of our paper, assumes less information. That is, both θ0 and θ1 are assumed to lie in some common index set (Θ in our notation), and the only condition is that θ0 = θ1. Hence, the first setting is in some sense easier , as the additional knowledge about Θ0 and Θ1 (their size, and their separation) can be exploited to design appropriate SCD schemes. While there exist some works that develop methods for SCD in the second setting, those schemes often rely on the specific structure of the problems considered. In this paper, we address this issue by developing a general principle for designing SCD schemes in the second setting.

2. Preliminaries

The primary technical tool we use in our strategy are timeuniform version of confidence sets, called confidence sequences (CSs), that were first introduced in statistics literature by Darling and Robbins (1967). We present a definition adapted to our problem below.

Definition 2 (Confidence Sequences). Suppose X1, X2, . . . are drawn i.i.d. from Pθ, for some θ Θ. Then, for any α (0, 1), a level-(1 α) CS, denoted by {Ct : t 1}, is a collection of subsets Ct Θ, such that (i) Ct is σ(X1, . . . , Xt)-measurable and (ii) P t 1 : θ Ct 1 α.

Remark 3. Due to the time-uniformity in the definition of CSs, we can replace the confidence set Ct with the smaller set e Ct := s t Cs. The new CS, { e Ct : t 1}, consists of nested confidence sets; that is, e Ct e Cs for s < t.

Remark 4. The data do not need to be i.i.d. for defining CSs. The above definition can be easily generalized to the case of independent random variables, with Xt Pϑt, with ϑt Θ (see Appendix A). This, however, requires that Θ is endowed with the notions of addition and scalar multiplication (a sufficient condition is that Θ is a vector space), which we implicitly assume when needed.

We now introduce a notion of the size of the confidence set Ct, that reflects the amount of uncertainty.

Definition 5 (CS width). Let Θ be endowed with a distance metric d, and let {Ct : t 1} denote a level-(1 α) CS constructed on observations X1, X2, . . . drawn i.i.d. from Pθ. A function w(t, θ, α) denotes the pointwise width

(bound) of the CS, if for all t N and θ Θ, we have supθ ,θ Ct d(θ , θ ) w(t, θ, α). We define the uniform width over Θ, as w(t, Θ, α) = supθ Θ w(t, θ, α).

As we will see later in Section 5, most of the non-trivial CSs have their pointwise widths (and often, the uniform widths as well) converging to 0 with the number of observations.

Example 6. Consider independent random variables {Xt : t 1}, with Xt N(θ, 1) and θ R for all t 1. In this case, the parameter set is Θ = R. For this process, we can define the CS {Ct : t 1} as follows: Ct = [ Xt wt/2, Xt + wt/2], where Xt = (1/t) Pt i=1 Xi, and wt = 3.4 p

(log log(2t) + 0.72 log(10.4/α)) /t. Thus, if we endow the parameter space Θ with the metric d(θ, θ ) = |θ θ |, we observe that the uniform width of Ct, denoted by w(t, Θ, α) = wt, converges to 0.

Related Work. As we mentioned earlier, a large part of the existing SCD literature focuses on the parametric setting. We refer the reader to some recent surveys, such as those by Veeravalli and Banerjee (2014); Xie et al. (2021), and the textbook by Tartakovsky et al. (2014) for details. In this section, we discuss some results on nonparametric SCD methods that are more relevant to our work.

Shin et al. (2022) developed a novel framework for changepoint detection by introducing e-detectors; obtained by combining a sequence of e-processes defined uniformly over the class of pre-change distribution. They showed that their resulting strategy using e-detectors controls the ARL under very general conditions, and also proved the optimality of their schemes (in terms of worst-case detection delays) in some cases. However, as we mentioned in Remark 1, their framework is applicable mainly in cases where the preand post-change distributions are known to lie in different classes. Our techniques, described in Section 3 and Section 4, addresses this issue.

Maillard (2019b) considered the task of detecting a change in the mean of a sequence of independent, univariate, sub Gaussian random variables; and proposed an SCD method by deriving a new, doubly time-uniform confidence sequence for the scan statistics associated with a generalized likelihood ratio scheme (Lai and Xing, 2010). The original proof of this concentration inequality (Maillard, 2019b, Theorem 4) was incomplete, and a corrected version (with an additional log log term) was obtained by the author in (Maillard, 2019a, Chapter 3, 4.1). For the resulting scheme, Maillard (2019a) obtained bounds on the probability of false alarm and on the detection delay, and also established the optimality of this scheme under certain scenarios (such as for Gaussian observations). Unlike Maillard (2019a), our SCD framework is applicable to a much wider class of problems beyond univariate mean testing. Nevertheless, when specialized to the case of univariate Gaussian observations,

Sequential Changepoint Detection via Backward Confidence Sequences

our scheme matches the optimal detection delay bound, while also providing control over the ARL (instead of the probability of false alarm).

Puchkin and Shcherbakova (2023) considered the SCD problem under the assumption that both the preand post-change distributions admit densities w.r.t. a common reference measure, and proposed a strategy based on learning a discriminator to estimate the density ratio. They showed that their strategy can control ARL at the required level, and also obtained high probability upper bound on the detection delay in terms of the Jensen Shannon (JS) divergence and the L2 norm of the difference of densities. Unlike them, our framework does not require the existence of densities, and it also works for distributions that are separated in terms of a large class of metrics, and not just the JS divergence.

Another class of nonparametric schemes for SCD are based on the kernel-MMD metric, first employed by Gretton et al. (2012) for designing powerful nonparametric two-sample tests. Li et al. (2019) proposed a SCD scheme based on a variant of the block-MMD statistic (Zaremba et al., 2013) computed using the observations, and a block of pre-change data. More recently, Flynn and Yoo (2019) and Wei and Xie (2022) proposed new SCD schemes that use linear and block-MMD statistics to define nonparametric analogues of the Cu Sum test of Page (1954). However, these schemes suffer from weak theoretical guarantees on the detection delay. Furthermore, similar to the case of Puchkin and Shcherbakova (2023), the strategies for designing these SCD schemes are specific to the kernel-MMD metric; and there is no obvious way to extend them to other popular metrics, such as the Kolmogorov-Smirnov metric. Our work addresses these issues.

Similarly, most other existing works in SCD are geared towards specific problem settings. Hence, both the design of the scheme as well as their analysis are strongly tied to the details of the problem being studied. Examples include empirical likelihood based methods for distributions on finite alphabets (Lau et al., 2018), nearest-neighbor techniques for multivariate or non-euclidean data (Chen, 2019), and spectral scan statistics for graph valued data (Sharpnack et al., 2013). However, our objective in this paper is different: instead of developing a powerful SCD scheme for a specific task, we develop an abstract unifying template for designing SCD schemes, that can then be instantiated for a large range of (old and new) SCD problems.

Our Contributions. In Section 3, we first present (as a warmup) a changepoint detection scheme that uses a single level-α forward CS. Our strategy is to stop as soon as the CS becomes inconsistent ; that is, it includes a point that it had previously discarded. We show in Proposition 8, that this simple strategy controls the probability of false alarm at level α, and we also obtain a high probability upper

bound on its detection delay. However, this scheme is too conservative as its ARL is infinite, and in practice this might result in large detection delays, especially when T is large.

In Section 4, we present our main strategy that proceeds by checking at each time t, whether a forward CS and a backward CS (a new notion) are consistent, and stops whenever an inconsistency is detected. In Theorem 13, we show that the ARL of this scheme is at least 1/(2α), and we characterize its expected detection delay under general conditions.

Finally, in Section 5, we demonstrate the power and generality of our proposed scheme by instantiating it with five different confidence sequences. The general bound on the detection delay obtained in Theorem 13 easily translate into problem-specific upper bounds in all these cases, and we also empirically verify the theoretical predictions through some simple numerical simulations.

3. Warmup: change detection via a forward CS

Before presenting our general scheme in the next section, we first introduce a simpler SCD method that only uses a single forward CS. We refer to this scheme as the FCS-Detector. The idea underlying this scheme is that if there is a change in the distribution generating the observations {Xt : t 1}, then the intersection of the CS will eventually end up being empty. Formally, we proceed by constructing a level-(1 α) confidence sequence (CS) for the unknown θ0, denoted by {Ct : t 1}, as introduced in Definition 2. When there is no changepoint, then the CS satisfies P ( t N : θ0 Ct) 1 α. However, if there is a changepoint at some time T, we expect that the confidence sets, Ct, deviate away from the confidence set CT , for t > T. Eventually, after sufficiently many post-change observations, the confidence sequence will be inconsistent and self-contradictory. That is, at some time t such that t T is large enough, we expect that t s=1Cs = . We thus define the stopping time, τ, as the smallest t at which the above inconsistency is observed.

Definition 7 (FCS-Detector). Given observations X1, X2, . . ., we construct a confidence sequence (CS), denoted by {Ct : t 1} for the pre-change parameter θ0. We stop at time τ := min{n 1 : t < n, Ct Cn = }.

This strategy satisfies the following properties.

Proposition 8. Consider a change point detection problem with observations X1, X2, . . . drawn i.i.d. from Pθ0 for t T and from Pθ1 for t > T, with T lying in N { } and θ0, θ1 Θ. Suppose for any θ Θ, we can construct confidence sequences {Ct : t 1}, with uniform width w(t) w( , Θ, α). Then, we have the following: (i) When T = , the FCS-Detector controls the probability of false alarm (PFA) at level α. That is, P (τ < ) α.

Sequential Changepoint Detection via Backward Confidence Sequences

(ii) Suppose T < is large enough to ensure that w(T) < d(θ0, θ1). For t > T, define λt := T/t, and eθt := λtθ0 + (1 λt)θ1. Then, we have τ T

min n t T : w(t) + w(T) d(θ0, eθt) o , with probability at least 1 α. Remark 9. If d is induced by a norm on Θ, we can bound the detection delay under the event E by (τ T)+ min t T : w(t) + w(T) t T

t d(θ0, θ1) . In many instances, we have w(t) 1/

t suppressing logarithmic factors. Then, the above expression implies that from small T, the delay is linear in T, while for large T, the delay behaves approximately like

T. The details of these calculations, along with plots of delay versus T are in Appendix B. Remark 10. The condition on T used above for obtaining the bound on detection delay is necessary, because we do not assume that the pre-change distribution (i.e., the parameter θ0) is known to us. Thus, to be able to detect the change in distribution, we must have enough observations from the pre-change distribution to estimate θ0 accurately enough, in comparison the magnitude of change, d(θ0, θ1). As an extreme example, if T = 1, no method can realistically detect that a change occurred (since it is statistically plausible that all the data are simply i.i.d. and no change occurred at all). Said differently, T = 0 and T = are information theoretically equivalent, and we need to be far enough away from those extremes for practical detectability.

When T = , the FCS-Detector continues sampling without stopping w.p. at least 1 α. Hence, its ARL is infinite. This makes it is too conservative in detecting the changepoint we cannot provide an upper bound on the expected detection delay when T < , and can only characterize the delay under an event of probability 1 α (see Figure 5 in Appendix B). We address this next, by proposing a scheme that augments the forward CS with a series of backward CSs.

4. Change detection via a backward CS

We now introduce our main SCD strategy that addresses the two drawbacks of the simpler SCD strategy discussed in the previous section. Informally, the idea underlying our strategy is as follows: in each round t 2, we construct the usual forward CS, and a new backward CS (using reversed observations, see Definition 11). We refer to this scheme as the BCS-Detector. If there has been a changepoint, we expect the forward and backward CSs to concentrate on different regions of Θ (i.e., around θ0 and θ1 resp.). Hence, we stop as soon as they become inconsistent. See ?? in Appendix A for a visual illustration, and Appendix F for an interpretation of our scheme in terms of repeated sequential tests.

We now present our definition of backward CSs.

Definition 11 (Backward Confidence Sequences). Let X1, X2, . . . be drawn i.i.d. from Pθ, for some θ Θ. For any n 1, we say that a sequence of sets {B(n) t }1 t n Θ is a backward CS, if (i) B(n) t is σ (Xt, . . . , Xn) measur-

able, and (ii) P t [n] : θ B(n) t 1 α.

Note that for n > 1, a forward CS Ct does not satisfy the first condition, since Ct is built using X1, . . . , Xt, but B(n) t can only use Xt, . . . , Xn. But, a backward CS at any n can be interpreted as the usual forward CS, introduced in Definition 2, constructed on observations seen in a reverse order from n to 1; see Appendix A for details.

Without loss of generality, we assume that any CS consists of a nested sequence of sets, as discussed in Remark 3. In other words, at a given time n, Cn is the smallest set among {Ct : t [n]}, while B(n) 1 is the smallest among {B(n) t : t [n]}. We say that the two confidence sequences {Ct : t 1} and {B(n) t : t [n]} do not intersect at time n if Cr B(n) s = for some 1 r, s n. It is easy to check that two nested CSs do not intersect if and only if their smallest sets (i.e., Cn and B(n) 1 ) do not intersect.

When T = , at every time n, both CSs {Ct : t 1} and {B(n) t : t [n]} will contain θ0 with probability at least 1 2α, and thus they will intersect with the same probability. This motivates us to stop and declare a changepoint at the first time n at which they do not intersect. Definition 12 (BCS-Detector). Given observations X1, X2, . . ., suppose we construct a forward CS {Ct : t 1} and new backward CSs {B(n) t : t [n]} for every n 1. Assume that all the constructed CSs are nested. Then, we define the stopping time, τ, as the first time at which the forward and backward CSs do not intersect: τ := inf{n 1 : Cn B(n) 1 = }.

Illustration of the BCS-Detector strategy. In Figure 1, we illustrate the intuition underlying our general BCS-Detector strategy, introduced above, using the task of detecting change in means of bounded observations (details in Section 5.2). The three plots in Figure 1 highlight the following aspects of our scheme:

Prior to the changepoint (or if there is no changepoint), both the forward and backward CSs concentrate around the same region in parameter space Θ (in this case, [0, 1]). In particular, note that in this case, for all values of t, one of the confidence intervals (CI) is entirely contained in the corresponding CI of the other CS.

As observations from the post change distribution start arriving, we expect the forward and backward CSs to start drifting away from each other. This is illustrated in the second plot of Figure 1.

Sequential Changepoint Detection via Backward Confidence Sequences

Figure 1. The plots illustrate the general ideas underlying our BCS-Detector strategy. Prior to the changepoint (first plot), both the forward and backward CSs have significant overlap. After getting some post-change observations (middle plot), the backward CS starts to drift away from the forward CS, although the deviation is not enough for the two CSs to become inconsistent. Finally, the last plot shows the scenario, where a sufficiently large number of post-change observations have arrived, which causes the forward and backward CSs to disagree. When this occurs, our scheme stops and rejects the null.

Finally, after sufficiently many post-change observations, the backward CS starts concentrating around the post-change parameter θ1, as a result of which some of the backward CIs become disjoint with the forward CIs. Our BCS-Detector scheme uses this occurrence as a signal to stop, and declare a changepoint.

We now present the main result of this section.

Theorem 13. Consider a SCD problem with observations X1, X2, . . . drawn i.i.d. from Pθ0 for t T and from Pθ1 for t > T, with T lying in N { } and θ0, θ1 Θ. Suppose for any θ Θ, we can construct confidence sequences {Ct : t 1} with pointwise width w( , θ, α). Then, we have the following: (i) When, there is no changepoint (T = ), the BCS-Detector satisfies E [τ] 1 2α 3

2. (ii) Suppose T < , and the pre-change parameter θ0 is not known. Introduce the event E = {θ0 Ct : 1 t < T}, and note that PT (E) 1 α by construction. Then, for α (0, 0.5), we have ET [(τ T)+|E] 3 u0(θ0,θ1,T )

1 α , where u0 := min{t 1 : w(t, θ1, α) + w(T, θ0, α) < d(θ1, θ0)}.

Recall from Definition 5 that w(t, θ1, α) denotes the width of the level-α confidence set with t observations, when the true parameter is θ1. The proof of this theorem is given in Appendix C. In many problem instances, the pre-change parameter θ0 is known as it represents the natural state of the process being observed. We can specialize the above result stated to this case as follows.

Corollary 14. Suppose the pre-change parameter θ0 is known, and T < . Then, we have ET [(τ T)+] (3/(1 α))t0(θ0, θ1, T), where t0 t0(θ0, θ1) := min{t 1 : w(t, θ1, α) < d(θ1, θ0)}.

Remark 15. These results demonstrate how the drawbacks of the FCS-Detector (introduced in Section 3) are addressed by carefully incorporating the idea of backward CSs in the design strategy. In particular, this new scheme,

called the BCS-Detector, is less conservative, and has a finite lower bound on the ARL under the null. More importantly, when T < , the expected detection delay of BCS-Detector is also finite, and furthermore, it is also independent of the value of the changepoint T. This is in contrast to the (high probability) bound on the detection delay for FCS-Detector, in which the detection delay increases approximately as

T, as the changepoint T .

Remark 16 (Other estimates). We can also construct an estimate of the changepoint, denoted by b T, as the time at which Ct and B(τ) t are most separated: b T b T(τ) := max argmax1 t τ d Ct, B(τ) t = max{s

τ : d(Cs, B(τ) s ) = max1 t τ d(Ct, B(τ) t )}. Additionally, we define an estimate of the magnitude of the change ϵ := d(θ0, θ1) as the separation between C b T and B(τ) b T , as measured by the distance metric d: bϵ bϵ(τ) := maxθ C b T ,θ B(τ) b T d(θ, θ ). While we do not obtain theo-

retical guarantees, some empirical results in the next section indicate that these estimates are accurate for several instantiations of the BCS-Detector.

5. Instantiations of BCS-Detector

In the previous section, we introduced a conceptually simple device that allows us to transform any confidence sequence construction into a powerful, sequential changepoint detector. This allows us to instantiate our general changepoint detection meta-algorithm to various scenarios, by leveraging the recent progress in constructing confidence sequences. We illustrate this, by presenting a variety of parametric and nonparametric SCD problems in this section. The code for reproducing the empirical results is available here.

Sequential Changepoint Detection via Backward Confidence Sequences

5.1. Parametric Change of Mean Detection

We begin by considering the simple case of univariate Gaussian mean changepoint detection. In this problem, we observe {Xt : t 1}, drawn i.i.d. according to the distribution N(µt, 1), with µt = θ0 for t T and µt = θ1 for t > T. Note that in this problem, we have Θ = R and we can set the distance metric, d, to be the absolute value of the difference. We will use the CS for Gaussian means, recently derived by Howard et al. (2021), that we had introduced earlier in Example 6. Furthermore, we can use the same expression for constructing the backward CS at any time n, denoted by {B(n) t : t 1}, but with the order of observations reversed, as described in Section 4. The following result shows that in this parametric setting, our changepoint detection scheme achieves an order-optimal detection delay (i.e., optimal modulo poly-logarithmic factors):

Corollary 17. Suppose {Xt : t 1} are drawn i.i.d. according to Pθ0 = N(θ0, 1) for t T, and Pθ1 = N(θ1, 1) for t > T. Note that in this case, we have d KL(Pθ1, Pθ0) = (θ1 θ0)2/2. Then, if T = Ω(log(1/d KL(Pθ1, Pθ0))/d KL(Pθ1, Pθ0)), then we have

ET [(τ T)+|E] = O

log log( 1 d KL(Pθ1,Pθ0)) + log( 1

d KL(Pθ1, Pθ0)

where E is the (1 α) probability event in Theorem 13.

Remark 18. If the pre-change mean (θ0) is known, the above upper bound holds for the worst-case detection delay without the conditioning, defined as JL(τ) := sup T >0 esssup E[(τ T)+|FT ]. Under the assumption that θ1 is also known, Lorden (1971) showed the following universal lower bound on this quantity: infτ JL(τ ) = log(1/α) d KL(Pθ1,Pθ0) (1 + o(1)), as α 0. Thus, BCS-Detector matches this optimal performance, modulo logarithmic factors, without the knowledge of the postchange parameter. This is unlike some of the existing schemes for Gaussian mean change detection, such as Pollak and Siegmund (1991), which achieve the same order optimal detection delay, but with additional assumptions (known lower bound on change, and in the limit of T ).

Empirical Verification. We now verify the theoretical claims of our proposed changepoint detection scheme using observations drawn from a unit-variance normal distribution with the pre-change mean θ0 = 0, and post-change mean θ1 = . In Figure 2, we consider the case of = 0.4 with the change occurring at T = 800. The plots show the forward and backward CSs at the time at which the change is detected in a trial, as well as the distributions of the detection delay, estimated changepoint, and estimated change magnitude over 250 trials. The plots indicate that BCS-Detector detects changes quickly and accurately.

In Figure 3, we study the variation of the average detection delay of our changepoint detection scheme as the change magnitude is varied. The empirical results verify the expected proportionality to 1/ 2 of the average detection delay, as claimed by our theoretical results.

5.2. Nonparametric Change of Mean Detection

We now consider a nonparametric analog of the change of mean detection problem from the previous section. Here we assume that X1, X2, . . . are independent random variables taking values in a bounded interval X R, which we set to [0, 1] without loss of generality. Prior to the changepoint T, we assume that the observations have a mean θ0 Θ = [0, 1], while it changes to θ1 = θ0 after time T. In this case, we can use the empirical Bernstein (EB) confidence intervals developed by Waudby-Smith and Ramdas (2023). To state the closed-form expression of the EB confidence sequence, we first need to introduce the following terms: bµt = 1 2 +Pt i=1 Xi t+1 , bσ2 t =

1 4 +Pt i=1(Xi bµt)2

t+1 , λt = q

2 log(2/α) bσ2 t t log(t+1)

Pt i=1 λi Xi Pt i=1 λi , vt = (4/ log(2/α))(Xt bµt 1)2, and

ΨE(x) = log(1 x) x

4 , for all x [0, 1). Using these terms, we can now state the EB-CS derived by Waudby Smith and Ramdas (2023) as follows: Ct = [bθt+wt/2, bθt wt/2], with wt = log(2/α)+Pt i=1 viΨE(λi) Pt i=1 λi . Again, by an application of the general result, Theorem 13, we can get the following bound on the expected detection delay of the SCD scheme, that uses the above EB-CS.

Proposition 19. Suppose X1, X2, . . . , XT are drawn i.i.d. from a distribution on X = [0, 1] with mean θ0 Θ = [0, 1], while XT +1, . . . are drawn from Pθ1 with mean θ1 = θ0. Then, assuming θ0 is known, this instance of BCS-Detector (Definition 12) satisfies the following (with := |θ0 θ1|, and σ2 1 = EPθ1[(X θ1)2]):

ET [(τ T)+] = O σ2 1 log(1/ ) + log(1/α)

Remark 20. While we stated the above result under the assumption that the preand post-change observations are i.i.d. from Pθ0 and Pθ1 respectively, we note that similar results can be obtained when the random variables are only independent, with fixed (preand post-change) means. Further, the assumption that θ0 is known can also be waived, at the cost of conditioning on the good event E.

Remark 21. In this subsection, we have considered the task of change-of-mean detection in perhaps the simplest (but nontrivial) nonparametric setting. The same ideas developed here, can however, we extended easily to other interesting cases, such as the sub-Gaussian family using the CS derived by Howard et al. (2021), or for heavy-tailed distributions (Wang and Ramdas, 2022).

Sequential Changepoint Detection via Backward Confidence Sequences

0 1,000 2,000

0 1,000 2,000

Detection Delay

200 1,000 1,800

Changepoint

Change Magnitude

Figure 2. The figures show the performance of our changepoint detection scheme, BCS-Detector, with univariate Gaussian observations whose mean changes from 0 to 0.4 at the time T = 800. The first plot shows the forward and backward CSs at time of detection (τ = 1264) in one of the trials, with the shaded gray region being the points at which the two CSs disagree. The next three plots show the empirical distribution of the detection delay, the estimated changepoint location, and the estimated changepoint magnitude over 250 repeated trials.

Empirical Verification. To verify our theoretical claims, we consider the case of an independent stream of observations supported on X = [0, 1], with preand post-change parameters, θ0 and θ1 respectively. We define distributions with specified means by taking approprirate mixtures of uniform distributions, as described in Appendix E. We plot the performance of our changepoint detection scheme for a fixed problem instance with (θ0, θ1, T) = (0.4, 0.6, 800) in Figure 6 (Appendix E). The predicted inverse quadratic dependence of the detection delay is verified in Figure 3.

5.3. Detecting Changes in CDFs

Staying with real-valued observations (or more generally, observations on totally ordered spaces), we now consider a more general question of detecting whether there have been any changes in distribution generating the observations. Since real valued random-variables are completely characterized by their cumulative distribution functions (CDFs), this task can be framed in terms of detecting changes in the CDFs. More formally, we assume that we are given a stream of observations, X1, X2, . . ., that are drawn according to a distribution θ0 = F0 for t < T, and according to a distribution θ1 = F1 for t T. Thus, in this case, Θ is the infinite-dimensional space of all feasible CDFs on R, and we endow it with the Kolmogorov-Smirnov (KS) metric, d KS, defined as d KS(F, G) = supx R |F(x) G(x)|.

To instantiate our SCD scheme, we will employ the following level-α confidence sequence for the CDF in terms of the KS metric, recently derived by Howard and Ramdas (2022): Ct = {θ Θ : d KS(θ, bθt) wt/2}, where wt = 1.7 p

log log(et) + 0.8 log(1612/α)/t. As a consequence of the general result in Theorem 13, we can obtain the following performance guarantee for this scheme.

Corollary 22. Suppose, for some T < , the observations X1, . . . , XT are drawn from a known distribution F0, while for t 1, the observations XT +1, XT +2, . . . are drawn from an unknown F1. Then, our SCD scheme instantiated

with the CS stated above satisfies (with := d KS(F1, F0)):

ET [(τ T)+] = O log log(1/ ) + log(1/α)

Empirical verification. We test the performance of our proposed scheme for t-distributions with 3 degrees of freedom. In Figure 7 in Appendix E, we show the performance of our proposed scheme for a fixed problem instance where the preand post-change CDFs satisfy = d KS(F0, F1) 0.4. The variation of the average detection delay with changing values of is plotted in Figure 3, and it displays the expected inverse quadratic dependence.

5.4. Detecting Change in Homogeneity of Two Streams

Suppose we have a stream of observations in a product space X = U U, and the parameter set consists of all product distributions on U U; that is, Θ = {P Q : P, Q P(U)}. Prior to the changepoint, we assume that the observations X1, X2, . . . , XT , with Xt = (Ut, Vt) U U, are drawn from θ0 = PU PV ; while the post-change observations XT +1, XT +2, . . . are assumed to be drawn from some other product distribution QU QV . Given some statistical distance measure, ρ : P(U) P(U) R, we assume that the ρ(PU, PV ) = ρ(QU, QV ). An interesting special case of this problem, motivated by the two-sample testing problem, is when PU = PV , and QU = QV .

If ρ is a probability metric, we can use it induce a distance metric, d, on the parameter space Θ as follows: d(θ0, θ1) = |ρ(PU, PV ) ρ(QU, QV ), where θ0 = PU PV and θ1 = QU QV . Then, to obtain a changepoint detection scheme we can employ CSs for the statistical distance ρ. We instantiate this strategy with the kernel-MMD metric (defined in Appendix A) associated with a kernel k, denoted by d MMD( , ). We use the following CS derived by Manole and Ramdas (2021) for the kernel MMD distance between two distributions (assuming that supu,u k(u, u ) 1): Ct = {(P, Q) : d MMD( b Pt, b Qt)

Sequential Changepoint Detection via Backward Confidence Sequences

0 2 4 6 8 10 12

Average Detection Delay

Delay vs Change Magnitude

Gaussian-mean Bounded-mean

CDF change Two-Sample Distribution-Shift

Figure 3. In all the instantiations of the BCS-Detector within

Section 5, the width of the CS at a p

(log log(t) + log(1/α)/t rate. Hence, by Theorem 13, we expect the detection delay to have an inverse quadratic dependence on the magnitude of change ( ). The figure above verifies this claim empirically. In particular, the solid lines plot the average detection delay computed using 250 trails, against 1/ 2, where is the change magnitude (i.e., d(θ1, θ0) for each problem. The dashed lines of the same color represent the best linear fit between the observed detection delay and 1/ 2. The good agreement between these two lines for each problem validates the prediction of Theorem 13.

γt d MMD(P, Q) d MMD( b Pt, b Qt) + 2κt} where κt = q log((1 log2(t))2π2/6) + log(4/α) /t, and γt = 4

log (3.54e(1 log2 t)3) + log(2/α) . For this instantiation, Theorem 13 implies the following.

Corollary 23. With Xt denoting the pair (Ut, Vt) on X X, suppose that X1, . . . , XT are drawn i.i.d. from θ0 = PU PV , and XT +1, XT +2, . . . are drawn i.i.d. from a distribution θ1 = QU QV . Then, with > 0 denoting d(θ0, θ1) = |d MMD(PU, PV ) d MMD(QU, QV )|, we have the following upper bound on the expected detection delay of our BCS-detector based on the CS described above:

E (τ T)+|E = O log(1/α) + log log(1/ )

where E denotes the good event {d MMD(PU, PV ) Ct : t T 1} associated with the forward CS {Ct : t 1}.

Remark 24. Consider the special case mentioned earlier, where PU = PV = QU = P for some distribution P, and QY = Q = P for some other distribution. Furthermore, assume that it is known that prior to changepoint PU = PV (that is, the event E is a probability one event). Then, the above result in this case implies an upper bound on the expected detection delay of O log(1/α)+log log(1/ )

= d MMD(P, Q). This matches existing results, such as the kernel Cu Sum scheme of Wei and Xie (2022).

Remark 25. We focused on the case of the kernel-MMD metric mainly due to its generality (it is applicable to distributions over arbitrary spaces on which positive definite kernels can be defined). However, the same ideas are applicable to any statistical distance measure that is convex in its arguments, by using the reverse submartingale based confidence sequence construction of Manole and Ramdas (2021). This family includes all popular statistical distances, such as Wasserstein metrics, f-divergences and general integral probability metrics.

Remark 26. The overall computational cost of our scheme is O(τ 3), as our scheme involves constructing a new backward CS, with O(t2) cost, every round. In practice, this complexity can be reduced, either by using linear or block MMD statistics, and/or by computing a new backward CS less frequently (instead of doing so every round).

Empirical Verification. We study the performance of our scheme on a stream of paired multivariate Gaussian observations in p = 5 dimensions. The pre-change distributions, PU and PV both have zero mean and identity covariance; while for the post change distributions we have QU = PU, and QY has a mean δ1, and a diagonal covariance matrix with randomly chosen values. In Figure 8 in Appendix E, we plot the performance of our changepoint detection scheme for a fixed problem instance with = d MMD(QU, QV ) 0.33, and T = 800, while the inverse quadratic dependence of the average detection delay with is verified in Figure 3.

5.5. Detecting Harmful Distribution Shifts

As a final application, consider the task of detecting harmful changes between train and test distributions of a machine learning (ML) model. Following Podkopaev and Ramdas (2021), we are interested in detecting only those distribution changes that lead to a sufficiently large increase in the risk (i.e., expected loss) of the trained ML model.

Formally, suppose a machine learning model, denoted by h, is trained on a dataset drawn i.i.d. from a source distribution PS taking values on some space X. For some bounded loss function, ϕ, we let θ0 denote the expected training loss of this model; θ0 = EPS[ϕ(X, h)]. Next, we assume that the model h is deployed on a stream of test data, denoted by X1, X2, . . ., drawn from the source (or training) distribution PS for t < T; and from some other distribution PT = PS with PT = PS. Our goal is to detect postchange distributions PT that are harmful to the trained model; that is, they result in an increase in expected loss: θ1 := EPT [ϕ(h, X)] > θ0 (see Figure 10 in Appendix E).

For bounded loss functions ϕ, this problem fits into the nonparametric change of mean detection framework of Sec-

Sequential Changepoint Detection via Backward Confidence Sequences

tion 5.2. Since we are only interested in one-sided changes, we can modify the strategy of Section 5.2 to use only upper CS in the forward direction, and lower CSs in the backward direction. As in Proposition 19, for this strategy, we can show that the expected detection delay of the scheme will depend inversely on how harmful the target distribution is (i.e., the gap θ1 θ0).

Empirical Verification. To illustrate the ideas discussed above, we consider a simple binary classification problem with linear classifiers and 2-dimensional features (see Appendix E for details). We plot the performance of our scheme on a specific problem with 0.16 in Figure 9 in Appendix E, and also verify the inverse quadratic dependence of average detection delay on in Figure 3.

5.6. Other change detection tasks

We have illustrated the generality of our BCS-Detector strategy by instantiating it for five different scenarios in this section. For simplicity, we focused mainly on univariate observations (with the exception of Section 5.4). However, we note that the same ideas used in the previous instantiations also carry over easily to more general observations, or under additional robustness or privacy constraints. We list some such examples here, without going into the details of analysis or practical implementations: (i) Exponential family. In this case, the observations X1, X2, . . . lie in X = Rp, and the preand postchange distributions (Pθ0 and Pθ1) are chosen from a finitedimensional exponential family with Θ = Rm for some m < . Here, we can use the BCS-Detector with the CSs derived by Chowdhury et al. (2022). (ii) Covariance matrix. Again, we assume that X = Rp, but now we assume that at the change point T, the covariance matrix of the observations changes from θ0 Rp p

to some θ1 = θ0. For this problem, we can instantiate the BCS-Detector with the CS for covariance matrices derived by Howard et al. (2021, 4.3). (iii) Nonparametric regression. Suppose U1, U2, . . . denote i.i.d. uniform draws from U = [0, 1]p, for some p 1. Let Θ denote an RKHS (with kernel k) of functions from U to R. Then, for any θ Θ, define the random variable Yt Yt(θ) = θ(Ut) + ηt, where {ηt : t 1} are an i.i.d. sequence of 1-sub-Gaussian noise. Clearly, the joint distribution of (Ut, Yt) is parametrized by θ. Consider the SCD problem, where θ = θ0 prior to changepoint T, and θ = θ1 after that, with θ0 θ1 k > 0. Our BCS-Detector strategy is easily applicable to this scenario, with an infinite-dimensional index set Θ, using the CS constructed by Chowdhury and Gopalan (2017). (iv) Robust SCD. An interesting variant of the SCD problem involves detecting changepoints under adversarial contamination (Li and Yu, 2021). Our BCS-Detector strategy readily extends to such scenarios, by exploiting recent ro-

bust confidence sequence constructions, such as those by Wang and Ramdas (2023). (v) Private SCD. Privacy is an important concern in many applications, especially involving personal data, and is often ensured by revealing only randomized versions of the actual data to the analyst. This adds another layer of complexity to the usual SCD task (Cummings et al., 2018). However, our BCS-Detector framework can easily handle this, building upon the recent advances in private CS construction (Waudby-Smith et al., 2022).

6. Conclusion

We proposed a general strategy (BCS-Detector) for designing sequential changepoint detection (SCD) schemes by carefully combining confidence sequences (CSs), and backward CSs a novel variant of CSs, that we introduced in this paper. Under very mild, and natural requirements on the CSs, we showed that BCS-Detector provides tight control over the ARL and the detection delay. Leveraging the recent progress in constructing CSs, we instantiated our strategy for a wide range of SCD problems (both parametric, and nonparametric), and empirically verified the theoretical claims via some small-scale numerical experiments.

Our work opens up several directions for future work: (i) Constructing a new backward CS in every round can be computationally costly, and in most cases results in an overall quadratic (or even cubic) complexity. An interesting direction to pursue is to investigate if we can achieve the same performance by updating the backward CS fewer times. (ii) In Remark 16 we defined estimators of the changepoint (T), and the change magnitude ( ), which performed well empirically as shown in Figure 2 and Appendix E. Establishing theoretical guarantees for them is another interesting question for future work.

Sequential Changepoint Detection via Backward Confidence Sequences

M. Baron, V. Antonov, C. Huber, M. Nikulin, and V. Polischook. Early detection of epidemics as a sequential change-point problem. Longevity, aging and degradation models in reliability, public health, medicine and biology, LAD, pages 7 9, 2004.

H. Chen. Sequential change-point detection based on nearest neighbors. The Annals of Statistics, 47(3):1381 1407, 2019.

J. Chen, A. Yi giter, and K.-C. Chang. A Bayesian approach to inference about a change point model with application to DNA copy number experimental data. Journal of Applied Statistics, 38(9):1899 1913, 2011.

Y. C. Chen, T. Banerjee, A. D. Dominguez-Garcia, and V. V. Veeravalli. Quickest line outage detection and identification. IEEE Transactions on Power Systems, 31(1): 749 758, 2015.

S. R. Chowdhury and A. Gopalan. On kernelized multiarmed bandits. In International Conference on Machine Learning, pages 844 853. PMLR, 2017.

S. R. Chowdhury, P. Saux, O.-A. Maillard, and A. Gopalan. Bregman deviations of generic exponential families. ar Xiv preprint ar Xiv:2201.07306, 2022.

R. Cummings, S. Krehbiel, Y. Mei, R. Tuo, and W. Zhang. Differentially private change-point detection. Advances in neural information processing systems, 31, 2018.

D. A. Darling and H. Robbins. Confidence sequences for mean, variance, and median. Proceedings of the National Academy of Sciences, 58(1):66 68, 1967.

T. Flynn and S. Yoo. Change detection with the kernel cumulative sum algorithm. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 6092 6099. IEEE, 2019.

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch olkopf, and A. Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723 773, 2012.

S. R. Howard and A. Ramdas. Sequential estimation of quantiles with applications to A/B testing and best-arm identification. Bernoulli, 28(3):1704 1728, 2022.

S. R. Howard, A. Ramdas, J. Mc Auliffe, and J. Sekhon. Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49(2):1055 1080, 2021.

T. L. Lai and H. Xing. Sequential change-point detection when the pre-and post-change parameters are unknown. Sequential analysis, 29(2):162 175, 2010.

T. S. Lau, W. P. Tay, and V. V. Veeravalli. A binning approach to quickest change detection with unknown postchange distribution. IEEE Transactions on Signal Processing, 67(3):609 621, 2018.

M. Li and Y. Yu. Adversarially robust change point detection. Advances in Neural Information Processing Systems, 34:22955 22967, 2021.

S. Li, Y. Xie, H. Dai, and L. Song. Scan B-statistic for kernel change-point detection. Sequential Analysis, 38 (4):503 544, 2019.

G. Lorden. Procedures for reacting to a change in distribution. The Annals of Mathematical Statistics, pages 1897 1908, 1971.

O.-A. Maillard. Mathematics of statistical sequential decision making. Habilitation a Diriger des Recherches, 2019a.

O.-A. Maillard. Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds. In Algorithmic Learning Theory, pages 610 632. PMLR, 2019b.

T. Manole and A. Ramdas. Martingale methods for sequential estimation of convex functionals and divergences. ar Xiv preprint ar Xiv:2103.09267, 2021.

F. Orabona. A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213, 2019.

E. S. Page. Continuous inspection schemes. Biometrika, 41 (1/2):100 115, 1954.

A. Podkopaev and A. Ramdas. Tracking the risk of a deployed model and detecting harmful distribution shifts. In International Conference on Learning Representations, 2021.

M. Pollak and D. Siegmund. Sequential detection of a change in a normal mean when the initial value is unknown. The Annals of Statistics, 19(1):394 416, 1991.

N. Puchkin and V. Shcherbakova. A contrastive approach to online change point detection. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 5686 5713. PMLR, 25 27 Apr 2023.

J. Sharpnack, A. Singh, and A. Rinaldo. Changepoint detection over graphs with the spectral scan statistic. In Artificial Intelligence and Statistics, pages 545 553. PMLR, 2013.

Sequential Changepoint Detection via Backward Confidence Sequences

J. J. Shen and N. R. Zhang. Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing. The Annals of Applied Statistics, 6(2):476 496, 2012.

W. A. Shewhart. The application of statistics as an aid in maintaining quality of a manufactured product. Journal of the American Statistical Association, 20(152):546 548, 1925.

W. A. Shewhart. Economic quality control of manufactured product. Bell System Technical Journal, 9(2):364 389, 1930.

J. Shin, A. Ramdas, and A. Rinaldo. E-detectors: a nonparametric framework for online changepoint detection. ar Xiv preprint ar Xiv:2203.03532, 2022.

A. N. Shiryaev. On optimum methods in quickest detection problems. Theory of Probability & Its Applications, 8(1): 22 46, 1963.

A. Tartakovsky, I. Nikiforov, and M. Basseville. Sequential analysis: Hypothesis testing and changepoint detection. CRC Press, 2014.

V. V. Veeravalli and T. Banerjee. Quickest change detection. In Academic press library in signal processing, volume 3, pages 209 255. Elsevier, 2014.

H. Wang and A. Ramdas. Catoni-style confidence sequences for heavy-tailed mean estimation. ar Xiv preprint ar Xiv:2202.01250, 2022.

H. Wang and A. Ramdas. Huber-robust confidence sequences. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 9662 9679. PMLR, 25 27 Apr 2023.

I. Waudby-Smith and A. Ramdas. Estimating means of bounded random variables by betting. Journal of the Royal Statistical Society B, 2023.

I. Waudby-Smith, Z. S. Wu, and A. Ramdas. Locally private nonparametric confidence intervals and sequences. ar Xiv preprint ar Xiv:2202.08728, 2022.

S. Wei and Y. Xie. Online kernel CUSUM for change-point detection. ar Xiv preprint ar Xiv:2211.15070, 2022.

L. Xie, S. Zou, Y. Xie, and V. V. Veeravalli. Sequential (quickest) change detection: Classical results and new directions. IEEE Journal on Selected Areas in Information Theory, 2(2):494 514, 2021.

X. Yu, M. Baron, and P. K. Choudhary. Change-point detection in binomial thinning processes, with applications in epidemiology. Sequential Analysis, 32(3):350 367, 2013.

W. Zaremba, A. Gretton, and M. Blaschko. B-test: A nonparametric, low variance kernel two-sample test. Advances in Neural Information Processing Systems, 26, 2013.

Sequential Changepoint Detection via Backward Confidence Sequences

A. Additional Background

CS for non-i.i.d. observations. As mentioned in Remark 4, we can also define CSs for non-i.i.d. observations, as follows.

Definition 27. Suppose X1, X2, . . . , denote an independent stream of observations on X, with Xt Pϑt for ϑt Θ. Assuming Θ is a vector space, we say that {Ct Θ : t 1} is a level (1 α) CS for the running average of parameters if P { t 1 : (1/t) P i = 1tϑt Ct} 1 α.

The same definition of pointwise width introduced in Definition 5 is still applicable to the above CS. However, for defining the uniform width, we need to take the supremum over the sequence of parameters, instead of a fixed parameter (θ). More specifically, the pointwise and uniform widths for the above CS is defined as

w(t, θt 1, α) = sup θ ,θ Ct d(θ , θ ), and w(t, Θ, α) = sup θt 1 Θ t w(t, θt 1, α).

We show a simple illustration of the CS introduced in Example 6 with time varying parameters in Figure 4.

Figure 4. An example of the CS introduced in Example 6 for the running (conditional) mean of independent Gaussian processes with a time-varying mean function, variance fixed at 1.

Implementing backward CSs. If we know how to construct forward CSs, we can use that directly to construct backward CSs in the following steps:

At the end of round n, we have observed X1, . . . , Xn. Introduce the time-reversed version of the observations Ys = Xn+1 s for s [n] := {1, . . . , n}.

Construct a new level-(1 α) CS using {Ys : s [n]}, denoted by { B(n) s : s [n]}. Note that B(n) s is σ(Y1, . . . , Ys) = σ(Xn, Xn 1, . . . , Xn s+1) measurable for all s [n].

Finally, we again reverse the index of the CS, to obtain {B(n) t : t [n]}, where B(n) t = B(n) n+1 s. Note that by virtue

of being a CS, we have P t [n] : θ0 B(n) t 1 α. The superscript (n) serves as a reminder that there is only one forward CS, but there is a different backward CS constructed afresh at each time n.

The kernel-MMD metric. In Section 5.4, we constructed a scheme for detecting changes the pairwise kernel-MMD distance between the distributions generating a stream of paired observations. Here, we recall the its definition.

We assume that k : X X R denotes a uniformly-bounded positive-definite kernel, and let Hk denote the reproducing kernel Hilbert space (RKHS) associated with k.

Sequential Changepoint Detection via Backward Confidence Sequences

Definition 28. Given a positive definite kernel k : X X R, the kernel-MMD distance between two-distributions P and Q on X is defined as

d MMD(P, Q) = sup g Hk: g k 1 EP [g(X)] EQ[g(Y )].

The kernel-MMD distance defined above is an instance of a class of statistical distances called integral probability metrics (IPMs). For a class of kernels, called characteristic kernels, it is known that d MMD is a distance metric on the space of probability distributions.

B. Proof of Proposition 8

Proof of the bound on probability of false alarm. Recall that the stopping time is defined as the first time, τ, at which we have the condition τ t=1Ct = . Now, consider the good event of the CS under the null: E = t=1{θ0 Ct}, which satisfies P (E) 1 α by definition. Hence, under this event {θ0} t=1 = , which in turn implies that P (τ = ) 1 α, as required.

Proof of the bound on detection delay. Let w(t) w(t, Θ, α) denote the width of the confidence Ct after t observations. By assumption, T is large enough to ensure that under the good event E = t=1 1

t ϑt Ct , we have that θ1 CT at the changepoint. Note that in the definition of E, we have ϑt = θ01t T + θ11t>T .

Under the event E, we know that θ0 CT . For any t > T, introduce the terms λt = T/t and λt = 1 λt. Then, by definition of confidence sequences, we have λtθ0 + λtθ1 Ct for all t > T under the event E. The width of the set Ct at t > T is no larger than w(t) w(t, Θ, α). Hence, a sufficient condition for stopping prior to t > T is if the sum of the widths of Ct and CT , that is w(t) + w(T), is smaller than d θ0, λtθ0 + λtθ1 .

Informal calculations for Remark 9. As mentioned in Remark 9, in many cases, the stopping time τ satisfies: τ min{t T : 1/

T ((t T)/t) }, where = d(θ0, θ1). We now consider the behavior of the delay, τ T, in two different regimes of the changepoint T.

First we consider the case where T is small ; that is T 1/ 2. For concreteness, assume that T = 9/ 2. Then, it is easy to check that with τ 4T, since for t = 4τ

Next, we consider the case where is fixed, but T . In this case, we have τ T = O

T . To see this, consider

t = T + u, with u = o(T). Then, we have

T , and t T

Thus, this implies that the appropriate order of growth of the detection delay is u = O(

T/ ), as T with fixed.

C. Proof of Theorem 13

Proof of the ARL control. To prove this result, note that for us to stop under the null at some time τ, either the forward or the backward CS (or both) must be miscovering and failing to contain θ0. In other words,

{τ = N, T = } = {CN miscovers} {B(N) 1 miscovers}.

{τ N, T = } = {(Ct)N t=1 miscovers}

t=1 {B(t) 1 miscovers}.

In both the above implications, we have used the property of nested CSs to define the miscovering events. Now, using the fact that the forward CS and each backward CS is a (1 α)-CS, we have by a simple union bound that for any fixed time N,

Pr (τ N) min{(N + 1)α, 1}.

Sequential Changepoint Detection via Backward Confidence Sequences

Figure 5. The plots show the variation of the delay of SCD scheme introduced in Section 3, under the conditions of Remark 9. When the changepoint T 1/ 2, the delay has a linear dependence on T, while in the regime where T with fixed, the delay behaves

Rephrasing, we have P (τ > N) (1 α(N + 1)) 0, and thus

N=1 Pr (τ > N)

N=1 (1 α αN)

= (1/α 1)(1 α) α

= 1/α 2 + α α(1/α 1)(1/α)

2α 3/2 + α.

As we alluded to earlier in Remark 1, we did not need to know θ0 or θ1 or the direction of the changepoint. The aforementioned argument goes through for any indexed family of distributions. Furthermore, the above guarantee does not require the width of the CSs (forward of backward) to decay to zero; it only uses the coverage property of CSs.

Proof of the detection delay bound. We next obtain the upper bound on the average detection delay conditioned on the event E := {θ0 Ct : 1 t T}, stated in Theorem 13. Since, (τ T)+ = max{0, τ T} is a non-negative random

Sequential Changepoint Detection via Backward Confidence Sequences

variable, we have

ET [(τ T)+|E] =

t=0 PT (τ T)+ t|E

t t PT (τ T)+ t |E

t t PT (τ > T + t |E)

t t PT (τ > T + t |E) , for any t 1. (1)

Recall that u0 u0(θ0, θ1, T) := min{t T : w(T, θ0, α) + w(t T, θ1, α) < d(θ0, θ1)} was introduced in the statement of Theorem 13, and it represents an upper bound on the smallest time after T at which the backward CS must stop intersecting with the forward CS (assuming none of the CSs miscover). For any integer i 1, consider the event {τ > T + iu0}, and note that it satisfies the following inclusion:

{τ > T + iu0} E i j=1 n CT B(T +ju0) T = o E

i j=1 n CT B(T +ju0) T +(j 1)u0 = o E (2)

i j=1 n θ1 B(T +ju0) T +(j 1)u0

In the display above, (2) uses the fact that B(n) s B(n) s for any s > s. The inclusion (3) is the crucial observation for our proof. It relies on the fact that if for some j 1, the BCS {B(T +ju0) s s [T + ju0]} does not miscover, then the B(T +ju0) s contains θ1 for all s {T, . . . , T + ju0}, and in particular at s = T + (j 1)u0. Also note that the diameter of CT is smaller than w(T, θ0, α), and that of BT +(j 1)u0 is smaller than w(u0, θ1, α). Finally, the definition of u0 implies that

w(T, θ0, α) + w(u0, θ1, α) < d(θ0, θ1) implying that CT and B(T +ju0) T +(j 1)u0 are contained in two disjoint balls, and hence

are disjoint. Thus, if CT B(T +ju0) T +(j 1)u0 = , then the backward CS at T + ju0 must miscover.

Next, we make the following two observations:

(I) For j = j , the events Ej := {θ1 B(T +ju0) T +(j 1)u0} and Ej := {θ1 B(T +j u0) T +(j 1)u0} are independent.

(II) For all 1 j i, the event Ej (introduced above) is independent of E.

The statement (I) follows from the observation that the event Ej lies in the sigma-algebra σ ({Xk : T + (j 1)u0 k < T + ju0}), while Ej lies in σ ({Xk : T + (j 1)u0 k < T + j u0}); which are independent. Similarly, the second statement (II) uses the fact that E lies in σ ({Xk : 1 k < T}), which is independent of σ ({Xk : T + (j 1)u0 k < T + ju0}) for all 1 j i.

Based on the above observations, we conclude that

PT (τ > T + iu0|E) = PT ({τ > T + iu0} E) /PT (E)

PT i j=1{θ1 B(T +ju0) T +(j 1)u0} E /PT (E)

= PT i j=1{θ1 B(T +ju0) T +(j 1)u0} (4)

αi, for all i 1. (5)

In the above display, (4) follows from (II), and (5) uses (I). Now, we return to (1), and set t to i0 u0 for some integer i0

Sequential Changepoint Detection via Backward Confidence Sequences

to be specified later. We then note that

ET [(τ T)+|E] i0u0 +

(i+1)u0 1 X

t=iu0 PT (τ > T + t|E)

i=i0 u0PT (τ > T + iu0|E)

i0u0 + u0 αi0

The inequality in (6) uses the fact that PT (τ > T + iu0) αi derived in (5). The final result, as stated in Theorem 13, then follows by selecting i0 = log(1/1 α)/ log(1/α) . Note that when α < 0.5, we have i0 = 1.

D. Deferred proofs from Section 5

D.1. Proofs of Corollary 17, Corollary 22, Corollary 23

All these three results can be obtained as a direct consequence of the following proposition.

Proposition 29. For some > 0, define the time t0 as

log log t + log(1/α)

where c > 0 is some constant. Then, we have

t0 = O c2 log log(c/ ) + log(1/α)

Proof. Without loss of generality, we assume that c = 1; or equivalently, we can replace with /c. Now, note that we can upper bound t0 t1 + t2, where

t1 = min{t 1 : p

log log(t)/t /4}, and t2 = min{t 1 : p

log(1/α)/t /4}.

By a simple calculation, we can obtain t2 = O log(1/α)/ 2 . Hence to complete the proof, we will show that t1 = O log log(1/ )/ 2 . We proceed in two steps: (i) first we show that t1 32/ 3, and (ii) using this, we refine the result to show that t1 = O log log(1/ )/ 2 .

Let t3 = 32/ 3 for 1. Then, observe that

log log(t3)

2 3 log log(32/ 3)

2 log log(32/ 2)

1/ 0.63 < 1.

The last inequality, along with the definition of t1 implies that t1 t3.

Hence, log log(t1) log log(t3) = log log(32/ 3) 2.35 + log log(1/ ). Thus, we have

log log(t1)

t1 2.35 + log log(1/ )

which implies that t1 16 2 (2.35 + log log(1/ )) = O log log(1/ )/ 2 . Combining this bound on t1, with the previously obtained upper bound on t2; and using the fact that t0 t1 + t2, we get the required result.

D.2. Proof of Proposition 19

To prove the variance adaptive bound on the expected detection delay, we first show the following result for the width of the CS derived by Waudby-Smith and Ramdas (2023).

Sequential Changepoint Detection via Backward Confidence Sequences

Proposition 30. We can modify the backward CS used instantiating the BCS-Detector in Section 5.2, to obtian a level (1 2α)-backward CS (for every n T), denoted by {B(n) t : 1 t n}, such that the width of B(n) t satisfies

w(t, θ1, α) = O σ1

log(1/α) + p

log t , for all T t n.

Remark 31. The main benefit of this result is that it characterizes the width of the backward CS (for n T) explicitly in terms of the standard deviation (σ1) of the post-change distribution Pθ1, unlike the original CS described in Section 5.2, whose width depends on empirical estimates of σ1. As a consequence of this result, we obtain Proposition 19 by first appealing to Corollary 14, and then repeating the calculations used to obtain Proposition 29.

Proof. Since we are only interested in characterizing the order with which the width of the CS decays (and not the exact constants), we will not track the constants in our argument for this proof. In particular, we will use A B to indicate that by A/B = O(1), and A B to indicate that A B and B A.

We proceed in the following steps:

First, we show that we can construct a level-(1 α) confidence sequence for the empirical variance based on samples from the post-change distribution. In particular, let bσ2 n = ( 1

4 + Pn t=1(Xt bµt)2)/(n + 1), with X1, X2, . . . Pθ1 i.i.d.. Then, we have the following:

P (E1) 1 α, where E1 := n 1

|bσ2 n σ2 1| = O

log log n + log(1/α)

Thus, using this event, for any n T, we can modify the backward CS to get a level-(1 2α) Backward CS, in which bσ is replaced by σ1 (plus a small approximation error term) for T t n. In the next two steps, we characterize the width of these level-(1 2α) CSs.

Next, we show that under the event E1, we have

i=1 viψE(λi) log(σ2 1n). (8)

Under the same event, we then show that

Combining these results, we get that the width of the CS is of the order

wn 1 n log(2/α) + Pn t=1 vtψE(t) 1 n Pn t=1 λt σ1 n

log(2/α) + p

as required. Thus, it remains to prove (7), (8), and (9).

Proof of (7). To prove this, we first introduce the usual unbiased estimate of the variance: eσ2 n =

1 n(n 1) Pn i=1 P j =i (Xi Xj)2

2 . Since the observations X1, X2, . . . Pθ1 are bounded, and lie in [0, 1], it is easy to verify that |bσ2 n eσ2 n| = O(1/n), and hence bσ2 n eσ2 n.

Since, eσ2 n is an instance of a U-statistic, and hence the process {eσ2 : n 1} is a reverse-martingale, adapted to the exchangeable filtration. Using this fact, along with the boundedness (and hence sub-Gaussianity) of the random variable eσ2 for all n 1, we can use Manole and Ramdas (2021, Corollary 8), to conclude the time-uniform concentration result: P n 1 : |eσ2 σ2 1| = O(rn) 1 α, where rn = p

(log log n + log(1/α)) /n.

Sequential Changepoint Detection via Backward Confidence Sequences

Proof of (8). To show this, we recall the fact that ψE(λ)/(λ2/8) 1 as λ 0. Hence, we have the following:

i=1 viψE(λi) = 4 log(2/α)

i=1 (Xi bµi)2ψE(λi) 4 log(2/α)

i=1 (Xi bµi)2 λ2 i 8

i=1 (Xi bµi)2 log(2/α)

ibσ2 i 1 1 log(2/α)

i=1 (Xi bµi)2 log(2/α)

(Xi bµi 1)2 Pi j=1(Xj bµj 1)2

i=1 (Xi bµi 1)2 !

log neσ2 n log nσ2 1 .

In the above display, (10) follows by an application of the following lemma with f(x) = 1/x.

Lemma 32 (Orabona (2019), Lemma 4.13). Let ai 0 for all i, and f : [0, ) [0, ) be an increasing function. Then

t=1 atf(a0 +

i=1 ai) Z PT t=0 at

This concludes the proof of (8).

Proof of (9). We proceed as follows with ri = q log log i + log(1/α) /i

i=1 λi = 1 n

iσ2 1(1 + ri)

This completes the proof of Proposition 30.

E. Details of Experiments

Details of the bounded source with a specified mean (Section 5.2). For testing the performance of the SCD scheme described in Section 5.2, we constructed a probability distribution, Pθ, over X = [0, 1] with a specified mean θ [0, 1] by appropriately mixing two uniform distributions. In particular, we define Pθ = (1 θ)U1 +θU2, where U1 Uniform([0, θ]) and U2 Uniform([θ, 1]).

Details of the CDF change detection experiment. (Section 5.3). For this experiment, we used univariate t-distributions with 3 degrees-of-freedom. For the pre-change distribution, we set the mean to 0, and the scale parameter to 1. For the post-change distribution, we set the mean to some value > 0, and the scale parameter to 2.

Details of the Binary classification source (Section 5.5) We consider feature-label pairs (Zi, Li) R2 {0, 1}, with a source distribution PS = PL PZ|L. We assume that the label L is drawn uniformly on the set {0, 1}, and the features are drawn from a bivariate normal conditioned on the labels: PZ|L = N(µL, I2), with µL = (2L 1)[1, 0]T R2. For this problem we will consider linear classifiers parameterized by a weight vector w R2, of the form hw(z) = 1 w,z 0. For the 0-1 loss function, ϕ(z, l, hw) = 1hw(z) =l, it is easy to check that the Bayes-optimal classifier for the source distribution is h hw , with w [1, 0]T . For the post change distributions, we rotate the mean of the features by an angle γ; that is, µL = (2L 1)[cos γ, sin γ]T .

Sequential Changepoint Detection via Backward Confidence Sequences

Detection Delay

Changepoint

Change Magnitude

Figure 6. The figures show the performance of our changepoint detection scheme with independent bounded observations whose mean changes from p0 = 0.4 to p1 = 0.6 at the time T = 800. The first plot shows the forward and backward CSs at time of detection (τ = 863) in one of the trials, with the shaded gray region being the points at which the two CSs disagree. The next three plots show the empirical distribution of the detection delay, the estimated changepoint location, and the estimated changepoint magnitude over 250 trials of the experiment with the same value of = 0.2 and α = 0.01.

Changepoint at T = 400

0 20 40 60 80 0

Detection Delay

Changepoint

0.3 0.4 0.5 0

Change Magnitude

Figure 7. The figures show the performance of our changepoint detection scheme with observations drawn from univariate t-distributions (3 degrees of freedom) whose mean changes from 0 to 1.0 at the time T = 800. The first plot shows the forward and backward CSs around the empirical CDFs, at time of detection in one of the trials. The next three plots show the empirical distribution of the detection delay, the estimated changepoint location, and the estimated changepoint magnitude over 250 trials of the experiment.

F. Repeated sequential test interpretation

Our scheme as repeated sequential tests. Due to the equivalence between sequential hypothesis tests and confidence sequences, we can also motivate our general strategy (Definition 12) in the language of sequential hypothesis testing. In particular, our approach can be informally described as follows, due to the time-uniform coverage guarantees of CSs: in each round t 2, we run a new sequential hypothesis test for every 1 s t 1 to decide whether (X1, . . . , Xs) and (Xs+1, . . . , Xt) are drawn from the same distribution, or not. As soon as we find a t for which a partition of the observations are sufficiently distinct, we can stop and declare the existence of a changepoint. As we saw in Section 4, this idea can be implemented in an elegant manner by combining a single forward CS (similar to Section 3) with a succession of CSs constructed on reversed versions of the data, that we called backward CSs .

Connections to Cu Sum. We now demonstrate that we can also interpret the popular parametric SCD scheme, Cu Sum, as also performing repeated sequential tests. Let f0 and f1 denote two density functions on some observation space X. Let {Xt : t 1} denote a sequence of independent observations, and consider a changepoint detection problem with f0 and f1 as the preand post-change distributions respectively.

Definition 33 (Cu Sum). The cumulative sum (Cu Sum) method proceeds as follows:

τc = min{n 1 : Wn bα}, where

W1 = 0, and Wn = max 1 t n

f1(Xi) f0(Xi), for t 2.

The term bα is selected to ensure that the ARL is at least 1/α for a given α (0, 1).

Sequential Changepoint Detection via Backward Confidence Sequences

Changepoint at T=800

Detection Delay

Estimated Changepoint

Estimated Change Magnitude

Figure 8. The figures show the performance of our changepoint detection scheme with independent paired multivariate-Gaussian observations whose kernel-MMD distance changes from a pre-change value of 0 to 0.33 at the time T = 800.

Detection Delay

Estimated Changepoint

Change Magnitude

Figure 9. The figures show the performance of our scheme for detecting harmful changes in test distribution for two-dimensional feature vectors, as described in Section 5.5. In these plots, there is a change with magnitude 0.16 at the time T = 800.

Source Distribution

L = 0 L = 1 h

Benign Shift Harmful Shift

Figure 10. The first plot shows the samples corresponding to the two labels (L = 0 and L = 1), as well as the optimal linear classifier h for this problem (the dashed black line). In the second plot, the distributions are more separated, and the same classifier h is also Bayes-optimal for this problem, with a smaller risk. Finally, in the third figure, we have an example of a harmful distribution shift. In this case, the feature distributions for the two labels are rotated anti-clockwise by 45 degrees, which makes h a suboptimal classifier for this problem. The new Bayes-optimal classifier is shown by the red dotted line.

Sequential Changepoint Detection via Backward Confidence Sequences

Observe that the Cu Sum procedure can also be interpreted as a sequence of repeated sequential (power-one) tests in the backward direction, testing the null f0 against the alternative f1. These are power-one tests (as opposed to the usual SPRT) because they only stop when the likelihood ratio process is large, and not when it is small.

We now introduce an alternative version of Cu Sum. In this definition, we use the convention that the infimum or minimum over an empty set is infinity, that is, inf{x : x } = .

Definition 34 (Cu Sum-II). For any n 1, and 1 t n, define the backward sequential test, τ back n , as follows:

τ back n = min{1 t n : Ln t bα}, where Ln t =

f1(Xs) f0(Xs).

Then, we can define the modified Cu Sum stopping time as

τ c := inf{n : τ back n < }.

Proposition 35. The two tests defined in Definition 33 and Definition 34 are the same.

Proof. We show that for any N N, the sets {τc = N} and {τ c = N} are equal.

{τc = N} = {WN bα} N 1 n=1 {Wn < bα}

= { t [N] : LN t bα} N 1 n=1 {Ln t < bα, t [n]}

= {τ back N < } N 1 n=1 {τ back n = }

= {τ c = N}.

F.1. Cu Sum as an instance of the BCS-Detector

We now discuss how the Cu Sum test for simple preand post-change distributions can be considered an instance of our general BCS-Detector method. To do this, we need to identify a quantity for which we can construct forward and backward CSs. Define θ0 and θ1 as follows:

= 1, and θ1 = Ef1

= 1 + dχ2(f0, f1) 1.

Since the pre-change distribution is known, we set the forward CS simply equal to θ0. That is, we have Ct = {1} for all t 1.

For any n 1, we can construct a betting based backward CS consisting of subsets of Θ = {θ0, θ1} = {1, θ1}. To do this, we introduce the following terms, with ϱ : R [ 1, 1] denoting an odd sigmoid function:

W (n) t (θ1) =

1 + ϱ f1(Xi)

, and W (n) t (1) =

f1(Xi) f0(Xi).

B(n) t = {a Θ : Wt(a) < 1/α}

Since Ct = {1} for all t 1, the BCS-Detector stops for the first time n, at which the backward CS rejects the point 1. In other words, we can define the stopping time, τ, as follows:

τ := min{n 1 : t [n], W (n) t (1) 1/α}

= min n 1 : max t [n] W (n) t (1) 1/α

The above stopping time is the same as the original Cu Sum with bα = 1/α.

Sequential Changepoint Detection via Backward Confidence Sequences

G. Details for the Gaussian mean change detection problem

Setup. Recall that in this problem, we are given a stream of real-valued observations, X1, X2, . . ., drawn independently according to the distribution N(µt, 1), with µt = θ0 for t < T, and µt = θ1 for t T. Here θ0 = θ1 are two unknown parameters, belonging to the parameter set Θ = R endowed with the distance metric d(θ, θ ) = |θ θ |.

Confidence Sequences. To instantiate both of our schemes (the FCS-Detector of Section 3, and the BCS-Detector of Section 4), we need to construct confidence sequences. A suitable closed-form CS for the mean of an independent Gaussian process was derived by Howard et al. (2021), and can be directly employed as the forward CS in both of our schemes.

Ct = [ Xt wt], where wt =

3.4 log log(2t) + 0.72 log(10.4/α)

The same CS can also be used in the definition of the new backward CS in every round n, as follows:

" 1 n t + 1

Note that the width of the (forward) CS decays uniformly to 0 as the number of observations grows, and thus the conditions of both, Proposition 8 and Theorem 13, are satisfied.

Performance Guarantees for FCS-Detector. For the SCD scheme obtained by using the above CS in the FCS-Detector method, we can claim the following as a consequence of Proposition 8:

If there is no changepoint, then the probability that the FCS-Detector ever stops is equal to the probability that the running intersection of the CS {Ct : t 1} ever becomes empty. This is upper bounded by α by the definition of CSs.

Suppose the changepoint occurs at some time T 1. For detection to be possible by FCS-Detector, we require the changepoint to satisfy T T1, where T1 := min{t 1 : wt < d(θ0, θ1)}. Without this requirement, we do not have enough pre-change observations (relative to the change magnitude d(θ0, θ1)) to estimate the pre-change mean parameter (θ0) sufficiently well. Assuming this requirement holds, Proposition 8 implies that the following upper bound on the detection delay holds with probability at least 1 α:

τ T min t τ : w(t) + w(T) 1 T

As we discussed in Appendix B, this detection delay is O(T) when T log log(1/ )

T) when T . In both cases, the detection delay can be made arbitrarily large by making the changepoint T large.

Hence, the FCS-Detector provides a very strong control of false positives at the cost of weak detection guarantees. We now show how a better trade-off is achieved by the BCS-Detector.

Performance Guarantees for BCS-Detector. By specializing the general result of Theorem 13 to our problem, we get the following performance guarantees:

Instead of a high probability bound on false alarm rate, the BCS-Detector provides a guarantee on the average run length (ARL): E0[τ] 1 2α 3/2. In other words, it guarantees that when there is no change, the BCS-Detector will raise an alarm roughly every 1 2α 3/2 steps.

When there is a change in distribution at some finite time T, then the scheme guarantees the following upper bound on the detection delay, as we showed in Corollary 17:

E[(τ T)+|E] = O log log(1/ ) + log(1/α)

, where = |θ0 θ1|,

and E is the good event of probability at least 1 α, associated with the forward CS.

Sequential Changepoint Detection via Backward Confidence Sequences

Finally, note that if the pre-change mean parameter θ0 is known, then the event E has probability 1. Hence, we can get an unconditional version of the above statement in this case. Furthermore, since for Gaussian distributions with unit variance, d KL(Pθ1, Pθ0) = 1

2 2, the expected detection delay is O log log(1/d KL(Pθ1,Pθ0)+log(1/α)

d KL(Pθ1,Pθ0) , which as we discussed in Remark 18, is asymptotically near-optimal.

H. Additional Experiments

In this section, we compare the performance of our SCD scheme with the kernel-CUSUM strategy of Flynn and Yoo (2019), on the problem of testing for homogeneity in a stream of paired observations (Section 5.4). We consider the following datasets for binary classification from the UCI Machine learning repository: Higgs, Banknote, and Occupancy.

The kernel-CUSUM test proposed by Flynn and Yoo (2019) declares a detection the first time a running estimate of the kernel-MMD distance between the two streams exceeds a threshold a. That is,

τ = min{n 1 : Lt a}, where

L0 = 0, L2n = L2n 2 + k(U2n 1, U2n) + k(V2n 1, V2n) k(U2n 1, V2n) k(U2n, V2n 1) δ.

Here k denotes a positive definite kernel, and δ is a lower bound on the change magnitude that is assumed to be known apriori. In many problems, a large volume of pre-change data is available, Flynn and Yoo (2019) suggest selecting the rejection threshold a as the smallest value that results in the ARL (computed on historical pre-change data) exceeding a required value. We followed this approach for calibrating the kernel-Cu Sum test and our proposed SCD scheme of Section 5.4, with a target ARL of 500 on all the three datasets. The results are tabulated in Table 1.

Dataset ARL (BCS) ARL (K-Cu Sum) Delay (BCS) Delay (K-Cu Sum)

Higgs 547.85 569.20 251.45 440.63 Banknote 601.68 594.96 38.47 61.92 Occupancy 505.25 521.80 22.75 58.60

Table 1. Comparison of the performance of our BCS detector for detecting changes in homogeneity with the kernel-CUSUM method of Flynn and Yoo (2019). Both the methods were calibrated to ensure an ARL of at least 500. The results indicate that our BCS detector can achieve a smaller detection delay than the kernel-Cu Sum while maintaining the required ARL.