# bandit_quickest_changepoint_detection__636484ba.pdf Bandit Quickest Changepoint Detection Aditya Gopalan Electrical Communication Engineering Indian Institute of Science, Bengaluru 560012 aditya@iisc.ac.in Braghadeesh Lakshminarayanan Electrical Communication Engineering Indian Institute of Science, Bengaluru 560012 braghadeesh94@gmail.com Venkatesh Saligrama Electrical and Computer Engineering Boston University, Boston, MA 02215 srv@bu.edu Many industrial and security applications employ a suite of sensors for detecting abrupt changes in temporal behavior patterns. These abrupt changes typically manifest locally, rendering only a small subset of sensors informative. Continuous monitoring of every sensor can be expensive due to resource constraints, and serves as a motivation for the bandit quickest changepoint detection problem, where sensing actions (or sensors) are sequentially chosen, and only measurements corresponding to chosen actions are observed. We derive an information-theoretic lower bound on the detection delay for a general class of finitely parameterized probability distributions. We then propose a computationally efficient online sensing scheme, which seamlessly balances the need for exploration of different sensing options with exploitation of querying informative actions. We derive expected delay bounds for the proposed scheme and show that these bounds match our information-theoretic lower bounds at low false alarm rates, establishing optimality of the proposed method. We then perform a number of experiments on synthetic and real datasets demonstrating the effectiveness of our proposed method. 1 Introduction We propose a framework for bandit1 quickest change detection (BQCD). Specifically, we are given a multi-dimensional online data stream, which can be sequentially probed by means of actions belonging to an action set (for instance, only a few among all of the data stream components can be observed at any time, or a linear combination can be acquired). The online data stream can exhibit abrupt statistical changes, at any time, and only among a few arbitrary components. Our task is to sequentially probe the data stream by adaptively choosing valid actions based on past bandit2 observations associated with past actions. Our objective is to detect a change, if it has happened, with minimum detection delay. Example Scenarios. Surveillance systems [HC11] are equipped with a suite of sensors that can be switched and steered to focus attention on any target or location over a physical landscape (see Fig. 1) to detect abrupt changes at any location. On the other hand, sensor suites are resource limited, and only a limited subset, among all the locations, can be probed at any time. As such, 1By bandit we generally mean adaptive sampling with partial information. 2This is different from classical quickest change detection [TNB14], where all of the multiple streams are observed at each time, and the only adaptive decision is to declare whether or not a change has happened. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). we face a fundamental dilemma: Focusing attention at any one location can compromise change detection at other locations. Although we describe a scenario in surveillance, the problem of BQCD is general and arises in intrusion detection [Bas99], social networks [Vis+14], disease outbreaks and epidemics [QSC14; YSK15], manufacturing processes [Pur+19a; Din+06], energy limited sensor networks[OGR10; ES10], and vital health monitoring [Vil+17]. In this paper, we derive a fundamental characterization for the delay performance of BQCD, and make explicit, the fundamental tradeoff between delay and false-alarm in the low-false alarm regime. We specifically make the following contributions. Information-theoretic Lower Bound. We prove a lower bound on the expected detection delay that any BQCD algorithm must suffer at a fixed false alarm rate. The lower bound exhibits a fundamental tradeoff between early stopping (false-alarm) and detection delay (time to detect abrupt change). It offers the key insight that the quickest way to detect a change, at any false-alarm rate, is by playing the most informative action , of an oracle who a priori knows the post-change distribution. This suggests that to quickly identify changepoints, we must direct our effort towards rapidly identifying informative actions. On a technical level, we develop a change-of-measure argument for nonstationary, adaptive change detection, that allows for relating the divergences between random trajectories until stopping to the divergence of probability laws under each action for any two problem instances. Time Series at Node 1 Abrupt change Time Series at Node k 1 Figure 1: Example BQCD Scenario. There are N spatial-locations that are monitored. The steerable sensor is capable of rapidly switching modes grey and red beams as shown and collect aggregated signal from multiple locations at a low signal-to-noise ratio (SNR) or focus attention narrowly on a single location at high SNR. Locations can exhibit abrupt change, which manifests as the depicted time-series (see top). Due to resource constraints, at any time, only a limited set of locations can be observed with high SNR. BQCD aims to learn a policy to adaptively steer and switch modes to different locations and detect abrupt changes if any, quickly. ϵ-Greedy Change Detector (ϵ-GCD). We propose ϵ-GCD, which, at a high level, uses a small amount of forced exploration to identify informative actions. The forced exploration allows for rapid convergence towards informative actions, and playing these actions minimizes detection delay. Our ϵ-GCD is based on the generalized max-likelihood/likelihood ratio principle which is utilized to estimate parametric changes. To prove detection delay bounds we draw upon key insights of ϵ-GCD. We first interpret the scheme in terms of competing parallel queues , where each queue corresponds to a candidate post-change parameter, collects arrivals which are log-likelihood ratios of observations, and cannot go negative. The true parameter is the queue which enjoys the highest growth rate after a change, and the detection delay is the time required for it to dominate and become the longest queue . The dynamics of the queues can be related to nonstationary random walks with drifts. Using these insights, we prove that the expected detection delay of ϵ-GCD at low false alarm rates mirrors our information theoretic lower bounds thus establishing optimality of our method. Experiments. We perform numerical experiments of ϵ-GCD on synthetic and real datasets and show that under variations of changepoints, anomalies, and action sets, we realize gains due to adaptive sensing. 2 Related Work Classical quickest change detection, dating back to the pioneering works of Page [Pag54] and Lorden [Lor71], studies the problem of deciding when to stop (akin to declaring a change) while sequentially observing a data stream. In this context, the CUSUM algorithm [Pag54], in which a change is announced as soon as a likelihood ratio statistic exceeds a threshold, has been shown to enjoy optimal performance. Our work is motivated by constraints on data collection in multi-stream time-series data, necessitating the design of an adaptive sampling rule in addition to the stopping rule. Similar to our work, Zhang and Mei [ZM20] also propose a sequential sampling method for quickest change detection based on the well-known Shiryaev-Roberts-Pollak scheme (see, e.g., [BN93]). However, their theoretical analysis is rather limited in its scope. In particular, they argue in principle" that their method will control false alarms for sufficiently large thresholds, and will eventually (i.e., after infinite time) choose sensors where changes manifest, without any explicit characterization of detection delays. In contrast, we derive fundamental detection delay under a false alarm rate constraint, thus fully characterizing the BQCD problem. Additionally, their framework is not general enough to handle correlated or structured anomalies, which is a significant aspect of both our theoretical and experimental investigations. In other related work, Gundersen et al [Gun+21] propose to extend the framework of Bayesian Online Change Detection (BOCD) [AM07] to incorporate costs in making decisions for real-time data acquisition in multi-fidelity sensing scenarios. Different from this perspective, our approach is frequentist, and does not impose action-specific costs and distributional constraints on the underlying latent parameters or on the changepoints. Furthermore, we derive finite-time performance bounds, which is not a focus of this work. 3 Problem Setup General Setup. We consider the following discrete-time model for the bandit quickest changepoint detection problem. At each decision round t N := {1, 2, . . .}, the parameter θt governs the distribution of the observation taken in that round. Let ν N denote the round at which the change in behavior occurs, so that θt = θ0 for t < ν and θt = θ Θ(1) for t ν. At each round t, a learner decides to either (1) stop and output that a change has taken place, denoted by the random variable Ut = 1, or (2) continue (denoted by Ut = 0) and play an action or sensing decision At from an action set A. Upon playing this action the learner obtains an observation Xt, sampled independent of the past, from a probability distribution Pθt [ |At], which depends on both the current system parameter θt and the current action At. The stopping decision and sensing action (Ut and At) are assumed to be chosen in a causal manner, i.e., depending on all past information U0, A1, X1, U1, . . . , At 1, Xt 1, along with potentially independent internal randomness. The learner is aware, a priori, of the prechange parameter θ0, the post-change parameter set Θ(1) and the observation distribution structure, i.e., {Pθ [ |a]}θ,a with a A denotes fixed actions. Crucially, the change time ν and the specific post-change parameter θ are not known in advance and must be learnt in order to perform well. The stopping time of the learner is defined to be τ = min{t 0 : Ut = 1}, and the main performance metric that we are interested in is the detection delay, which is the random variable (τ ν)+ = max{τ ν, 0}. Specifically, we are interested in designing sampling and stopping rules so that (a) when no change occurs (ν = ), τ should be at least a specified time delay with high probability, which is a measure of false alarm rate, and (b) when a change occurs (ν < ), the expected detection delay should be small implying quick detection of change. Illustrative Example. For illustration, we elaborate on the scenario depicted in Fig. 1. We consider N locations as shown, and the observation at location j at time t follows: Yj(t) = θ1[t ν] [j S] + Wj(t), t Z+, j [N], where {Wj(t)}j,t is a collection of standard normal random variables and 1[ ] denotes the indicator function. This is a temporal extension of the model in [QSC14]. The model with the parameter θ = 0 signifies an elevated mean after time t ν Z+ at locations S [N]. The learner is agnostic to θ, the time ν as well as which set S among a family of known subsets S {0, 1}N exhibits elevated activity. The learner at time t can probe locations by means of sensing actions At A {0, 1}N, and observe a scalar measurement, Xt Pθt( | At) N θt |At S| |At| , 1 , with θt = 0 for t < ν and for t ν, θt = θ Θ(1) = R \ {0}. The normalization factor p |At| expresses the fact that the same SNR is maintained across different probes At (see [QSC14]). General (Structured) Scenarios. Depending on the application, locations maybe endowed with a neighborhood network structure, and change maybe observable in a K-neighborhood of the target location, and in such cases, S are constrained to be K-neighborly sets. Similarly, to allow for diverse probing modes, we could allow the action set to include actions that aggregate signal across multiple locations. Notation. For a post-change parameter θ Θ(1), we use a (θ) to denote the most informative action for θ w.r.t. θ(0), i.e., a (θ) argmaxa A D (Pθ [ |a] ||Pθ0 [ |a]), where D ( || ) stands for the Kullback-Leibler divergence. To lighten notation, in the sequel we often abbreviate the distribution Pθ [ |a] to θ(a), and thus D (Pθ [ |a] ||Pθ0 [ |a]) to D (θ(a)||θ0(a)). The simultaneous subscriptsuperscript notation zj i is often used to represent the sequence (zi, zi+1, . . . , zj). For any θ Θ(1) and n N, we use P(n,θ) to denote the probability measure on the process U0, A1, X1, U1, A2, X2, . . . induced by the learner s decisions, when the pre-change parameter θ0 changes to θ at time n. We also denote by P( ) the probability measure as above under the no-change situation (ν = ). We define the filtration (Ft)t 0 to be the σ-algebra generated by all the past random variables up to time t, i.e., Ft = σ({(As, Xs), 0 s t)}). An (admissible) change detector is an algorithm for which the stopping decision Ut {0, 1} and action At A are Ft 1-measurable at all rounds t. Objective Specification. Our BQCD framework involves a false alarm rate constraint specified by a tuple (m, α), where m Z+ denotes an a priori fixed time such that the probability of an admissible change detector stopping before time m under the null hypothesis (no change) is at most α (0, 1), namely, P( ) [τ < m] α. This kind of false alarm duration specification for the stopping time ( m with probability 1 α) has been commonly employed in the literature on classical (nonadaptive) change detection (see e.g., [Lai98]), to ensure that the comparative performance rules out early triggering algorithms. Let Dm,α be the class of admissible change detectors with the (m, α) property stated above. As such, our goal is to find detectors in Dm,α that minimize the detection delay when a change indeed occurs. Informal Statement of Our Results. Our main result is a sharp optimal characterization for expected detection delay in the limit of vanishing false alarms (α 0, m ). The result, stated below, involves the new ϵ-Greedy-change-detector (ϵ-GCD) algorithm in which, in each round t, we choose an action uniformly at random from the set A with probability ϵ, or choose an action that is most informative for an estimated post-change parameter with probability 1 ϵ. Objective Specification. Our BQCD framework involves a false alarm rate constraint specified by a tuple (m, α), where m Z+ denotes an a priori fixed time such that the probability of an admissible change detector stopping before time m under the null hypothesis (no change) is at most α (0, 1), namely, P( ) [τ < m] α. This kind of Theorem 1 (informal). Consider the limiting situation as α 0 and m = ω log 1 α . Then, any admissible change detector in Dm,α must suffer lim inf α 0 E(ν,θ )[(τ ν)+] α 1 maxa D(θ (a)||θ(0)(a)). On the other hand, the ϵ-GCD algorithm, suitably tuned to be in Dm,α, achieves lim sup α 0 E(ν,θ )[(τ ν)+] α 8 (1 ϵ) 1 maxa D(θ (a)||θ(0)(a)). 4 Fundamental lower bounds on bandit change detection performance Before embarking upon the design of bandit changepoint algorithms, it is worth understanding what the limits of bandit change detection performance are, due to the stochastic and partial nature of the problem s information structure. We expose in this section a universal lower bound on the detection delay of any bandit changepoint algorithm with a false alarm rate above a given level. Theorem 2. Let 0 < α 1 10 and m 1. For any bandit changepoint algorithm satisfying P( ) [τ < m] α, we have E(1,θ ) [τ] min 1 20 log 1 α maxa A D(θ (a)||θ(0)(a)), m The result implies that in the small false alarm rate (α) regime with large m = ω(log(1/α)), e.g., m = log2(1/α), we have that any algorithm meeting the false alarm rate property P( ) [τ m] 1 α must suffer a detection delay at least Ω log 1 α maxa A D(θ (a)||θ(0)(a)) when a change occurs at time 1 (a similar, anytime lower bound holds for detection delay at ν N, see appendix). False alarm-detection Delay Tradeoff. Theorem 2 shows that a basic tradeoff exists between the false alarm rate or early stopping rate in case of no change, on one hand, and the detection delay after a true change occurs, on the other. Specifically, it is impossible to stop too early , i.e., before time α maxa A D(θ (a)||θ(0)(a)) , after a true change if one wishes to stop too late under no change, i.e., P( ) [τ m] 1 α. We also note that when there is no adaptive sampling of actions (i.e., |A| = 1), then the lower bound reduces to the form of a a standard lower bound for the classical changepoint detection problem [Lai98]. Information Structure. This is perhaps the most valuable implication of the lower bound. The quantity D θ (a)||θ(0)(a) , in the denominator, can be interpreted as the amount of information that playing an action a provides (on average) in order to detect a change from θ0 to θ . Consequently, the quickest way to detect a change is by playing the most informative action , argmaxa A D θ (a)||θ(0)(a) . This can be viewed as the sampling strategy of an oracle who knows the post-change parameter θ in advance. Unfortunately, this information is not known a priori for a causal algorithm. However, we will see that this can in fact be learnt along the trajectory and the lower bound can be attained order-wise by a suitable learning algorithm that we propose. Proof Sketch for Theorem 2. The main idea is a measure change argument, adapted to our nonstationary change point setting, from literature on sample complexity bounds for bandit best arm identification [GK16]. On one hand, if the algorithm in consideration is very lazy to begin with, i.e., P(1,θ ) [τ m] is at least a constant, say 1/2, then we immediately get E(1,θ ) [τ] 1 2 m. On the other hand, if the algorithm is active , i.e., P(1,θ ) [τ < m] 1/2, then the KL divergence between the laws of the indicator random variable 1 {τ < m} in the instances (1, θ ) and ( , ) is large (at least a constant times log(1/α)) owing to the hypothesis that P( ) [τ < m] is small (at most α). But by the data processing inequality, this divergence is bounded above by the divergence between the distributions of the entire trajectory U0, A1, X1, U1, A2, X2, . . . under the two instances, which can be seen to be equivalent to the quantity P a A D (θ (a)||θ(a)) E(1,θ ) [Nτ(a)], and which is further bounded above by E(1,θ ) [τ] maxa A D (θ (a)||θ(a)). See appendix for a complete proof. 5 ϵ-GCD Algorithm We describe the ϵ-GCD adaptive sensing algorithm for the bandit quickest change detection problem. The lower bound in Section 4 suggests that it is beneficial to infer the target post-change parameter, so that playing the most informative action for it can yield the best possible detection delay performance. This is the key principle underlying the design of the ϵ-GCD algorithm (Algorithm 1). At a high level, ϵ-GCD uses a small amount of forced exploration along with greedy exploitation to play sensing actions. Specifically, it computes, at each round t, a maximum likelihood estimate (MLE) of the post-change parameter based on the generalized likelihood ratio test (GLRT) principle. It then plays either a randomly chosen action, if the current slot is an exploration slot, or a greedy , i.e., most informative, action for the estimated post-change parameter, if it is an exploitation slot. The MLE of the post-change parameter, in round t, admits an interpretation as the longest queue corresponding to some parameter in Θ(1). To see this, notice that the MLE for the pair (ν, θ ), given all previous data in exploration rounds can be written as argmaxθ Θ(1),1 v t Qv 1 ℓ=1 fθ0(Xℓ|Aℓ)EℓQt 1 ℓ=v fθ(Xℓ|Aℓ)Eℓ = argmaxθ Θ(1),1 v t Pt 1 ℓ=v Eℓlog fθ(Xℓ|Aℓ) fθ0(Xℓ|Aℓ), with fθ( |a) taken to be the probability density or mass function of the distribution Pθ( |Aℓ) (assuming one exists). Observe now that for each candidate post-change parameter θ, the inner maximum over v, Q(1) t (θ) := argmax1 v t Pt 1 ℓ=v Eℓlog fθ(Xℓ|Aℓ) fθ0(Xℓ|Aℓ), evolves as a queue 3 with arriving work Eℓlog fθ(Xℓ|Aℓ) fθ0(Xℓ|Aℓ) at time slot ℓ; in other words, Q(1) t+1(θ) = Q(1) t (θ) + Et log fθ(Xt|At) fθ0(Xt|At) + . 3This is also known as the Lindley recursion equation in queueing theory. We define the algorithm using general processing functions gθ in place of the log-likelihood ratios log fθ fθ0 above. Broadly speaking, the functions g should ideally be chosen with the hope that (a) gθ(Xℓ, Aℓ) is negative in expectation before the change time (ℓ< ν), and (b) gθ(Xℓ, Aℓ) is large and positive in expectation for θ = θ , the true change parameter, after the change (ℓ ν). The stopping rule that ϵ-GCD uses is based on the generalized likelihood ratio (GLR)-type statistic. It is the largest of an ensemble of evolving exploitation-data queues Q(2) t (θ), which mirrors the definition of Q(1) t (θ) but with 1 Et instead of Et. 5.1 Theoretical guarantees for the ϵ-GCD algorithm In this section, we present and discuss theoretical guarantees on the false alarm rate and detection delay performance of the ϵ-GCD algorithm of Section 5. Algorithm 1 ϵ-GCD 1: Input: ϵ [0, 1], θ0, Θ(1), β 1, π, Observation function gθ(x, a) (θ, x, a) Θ(1) X A. 2: Init: Q(1) 1 (θ) 0, Q(2) 1 (θ) 0 θ Θ(1) {CUSUM statistics for exploration and exploitation per θ} 3: for round t = 1, 2, 3, . . . do 4: if maxθ Θ(1) Q(2) t (θ) log β then 5: break {Stop sampling and exit} 6: end if 7: Sample Et Ber(ϵ) independently 8: if Et == 1 then 9: Play action At π independently {Explore} 10: Get observation Xt 11: Set θ Θ(1) : Q(1) t+1(θ) Q(1) t (θ) + gθ(Xt, At) + {Update exploration CUSUM statistics} 12: else 13: Compute ˆθt = argmaxθ Θ(1) Q(1) t (θ) {Most likely post-change distribution based on exploration data} 14: Play action At = argmaxa A D θ(0)(a)||ˆθt(a) 15: Get observation Xt 16: Set θ Θ(1) : Q(2) t+1(θ) Q(2) t (θ) + gθ(Xt, At) + {Update exploitation CUSUM statistics} 17: end if 18: end for We make the following assumptions on the parameter space and observation distributions in order to derive performance bounds for the algorithm. Assumption 1. The post-change parameter set Θ(1) is finite, i.e., |Θ(1)| < . This assumption is made primarily for ease of analysis, whereas the algorithm is defined even for arbitrary parameter spaces. Specifically, it allows for easy control of the fluctuations of an ensemble of (drifting) random walks via a union bound over the parameter set. While we believe that this can be relaxed to handle general parameter spaces via appropriate netting or chaining arguments, this chiefly technical task is left to future investigation. Assumption 2. Every post-change parameter is detectable by some action, i.e., θ Θ(1) a A : D θ(a)||θ(0)(a) > 0. Moreover, any two post-change parameters are separable by some action, i.e., θ, θ Θ(1) a A : D (θ(a)||θ (a)) > 0. This is a basic identifiability requirement of the setting, without which some parameter changes could be completely undetectable. Put differently, one can only hope to detect changes that the sensing set can tease apart. Assumption 3 (Bounded marginal KL divergences). There is a constant Dmax satisfying D (θ1(a)||θ2(a)) Dmax for each a A, θ1, θ2, Θ(1). This assumption is easily met, for instance, if all log likelihood ratios are bounded, or if the observations are modelled as Gaussians. In this case log likelihood ratios are linear functions of the observation. Assumption 4. Every observation probability distribution Pθ [ |a], for a A and θ Θ(1) {θ0}, has either a density4 or a mass function, denoted by fθ( |a). Moreover, there exists r > 0 such that for any θ, θ , θ Θ(1) {θ0}, and a A, the log likelihood ratio fθ (X|a) fθ (X|a) is r-subgaussian 5 under X Pθ [ |a]. 4with respect to a standard reference measure, e.g., Lebesgue measure 5A random variable X is r-subgaussian under X P if λ R: E[eλ(X E[X])] eλ2r/2. This assumption is common in the statistical inference literature; we use it to be able to control the fluctuations of the exploration and exploitation CUSUM statistics in the algorithm via standard subgaussian concentration tools. A concrete example of a setting that meets Assumptions 1-4 is the linear observation model described in Section 3 and Fig. 1. We are now in a position to state our key theoretical result. Theorem 3 (False alarm and detection delay for general change point). Under assumptions 1-4, the following conclusions hold for ϵ-GCD (Algorithm 1) run with the log-likelihood observation function gθ(x, a) log fθ(x|a) 1. (Time to false alarm) Let α (0, 1) and m N. If the stopping threshold β is set as β m|Θ(1)| α , then the stopping time satisfies P( ) [τ < m] α. 2. (Detection delay) For a change from θ0 to θ Θ(1) occurring at time ν N, E(ν,θ ) (τ ν)+ 8 log β (1 ϵ) maxa D θ (a)||θ(0)(a) + O poly 1 , E( ) h Q(1) ν i , P , provided π(a θ ) > 0, with Q(1) ν := Q(1) ν (θ) θ Θ(1) denoting the explore queue statistics for all parameters and P (Pθ [ |a])θ {θ0} Θ(1) all the observation distributions. Remark. We note that β is an internal, algorithmic parameter, whose setting is spelt out in order that the (m, α) false alarm constraint can be met. Thus, the first part of Theorem 3 is a correctness statement6 Interpretation of the Detection Delay Bound. The first term in the detection delay bound (1) can be interpreted as the time for an oracle fixed-action strategy (e.g., CUSUM), that always plays the most informative action for θ , to stop. The second term, on the other hand, is a bound on the time taken by the algorithm to learn to play the most informative action for θ (details follow in proof sketch). We omit the precise dependence of the second term on the problem structure here for brevity, but detail it explicitly in the appendix. Specifically, for small forced exploration rates ϵ, the second term depends on the information geometry of the problem approximately as 1 θ Θ(1):a θ =a θ 1 4 θ . Here, for each candidate post-change parameter θ Θ(1), its gap θ is a measure of how difficult it is for the algorithm to eliminate θ as an estimate for the true post-change parameter θ during this parameterlearning phase. We formally define it as θ := Dθ ,θ0 1 2( Dθ ,θ0 + Dθ ,θ0 Dθ ,θ +), where Dθ ,θ := P a A π(a)D (θ (a)||θ(a)) is the average information divergence between parameters θ , θ offered by playing from the exploration distribution π. With finer analysis, the dependence on gaps and ϵ could be improved. We omit it for simplicity. We now turn to the (additive & linear; see the appendix) dependence on E( ) h Q(1) ν i of the second term. The insight as to why this arises is as follows. The learning phase for θ , after a change takes place at time ν, can be seen as a race between competing log-likelihood queues of all the parameters vying to become the MLE ˆθt. At the change time, each non-ground truth queue Q(1) ν (θ) is, on average, at level E( ) h Q(1) ν (θ) i , representing its initial advantage over the ground-truth queue Q(1) ν (θ ) which in the worst case could be at level 0. Thus, the difference E( ) h Q(1) ν (θ) i Q(1) ν (θ ) Q(1) ν (θ) is the extra amount of time work that the θ queue must do to overcome the θ queue. A special case of the result is for ν = 1 in which case all queues start out at 0. For general ν, one can establish that E( ) h Q(1) ν (θ) i must be finite, by noticing that each queue is non-negative and accumulates arrivals with negative mean before the change, and applying standard queueing stability arguments. This also indicates that the detection delay bound is roughly invariant given the location of the change point ν. Optimality at low false-alarms. The result implies that in the small time to false alarm regime, i.e., α 0, m = ω(log(1/α)) and log(m) = o(log(α)), the detection delay when β is set as 6This is, at a high level, akin to specifying a PAC algorithm s correctness requirement and measuring its sample complexity on the other hand, i.e., insist that its returned answer be ϵ-correct with probability at least 1 δ, and then analyze how long it takes to learn and return an answer. The former corresponds to the false alarm constraint and the latter to measuring detection delay here. above is dominated by the first term: E(ν,θ ) [τ] = O log β (1 ϵ) maxa D(θ (a)||θ(0)(a)) . This matches, order-wise, the universal lower bound of Theorem 2 up to the algorithm-dependent multiplicative factor 1/(1 ϵ), which may be interpreted as a penalty for the forced exploration mechanism. Benefits of Adaptivity. Let us consider a concrete setting to elucidate the detection delay bound. In this example, there are d sensing actions, represented by the canonical basis vectors in Rd. The pre-change parameter is 0 Rd, and there are d candidate post-change parameters, each of which is a canonical basis vector ei, i [d], multiplied by a constant δ. The observation from applying action a at system parameter θ is Gaussian distributed with mean a, θ and variance 1. Thus, the aim is to detect a (sparse) change of magnitude δ in one of the d coordinates in θ0 = 0, when at any time only a single coordinate can be noisily sensed. Suppose θ = δe1 without loss of generality, for which the most informative action (in fact, the only informative one) is a θ = e1. The first term in the detection delay given by Theorem 3 is then O log(1/α) (1 ϵ)δ2 . This is a factor of d smaller than Ω d log(1/α) (1 ϵ)δ2 that a standard CUSUM rule with non-adaptive uniform sampling over coordinates would achieve this is seen by applying the lower bound (Theorem 2) to the case of a single (trivial action) where the divergence in the denominator reduces to the average divergence across all coordinates: O(δ2/d). For estimating the second term, we calculate that for each θ = δei with i = 1, θ = Dθ ,θ0 ( Dθ ,θ0+( Dθ ,θ0 Dθ ,θ) +) 2 = δ2 δ2 2d δ2 2d δ2 the second term representing the time to learn the optimal sensing action scales as O (d 1)d4 O d5 ϵ4δ8 . Although we have not optimized dependence on d, ϵ and δ, the overall bound still gives an idea of how the active sensing problem gets harder as the dimension d grows or as the minimum change amount δ changes, making it akin to finding a needle in a haystack . Sketch of the Proof of Theorem 3. This section lays down the key arguments involved in the proof of Theorem 3 for bounding the false alarm time and detection delay of the ϵ-GCD algorithm. 1. Bounding the probability of early stopping under no change. The stopping time τ for ϵ-GCD, by line 4 in Algorithm 1, is equivalent to the first instant t when the worstcase (over Θ(1)) CUSUM statistic computed over exploitation data, i.e., maxθ Θ(1) Q(2) t (θ) maxθ Θ(1) max1 s t Qt 1 ℓ=s fθ(Xℓ|Aℓ) fθ0(Xℓ|Aℓ) (1 Eℓ) exceeds the level β 1. Under no change, each observation Xℓis distributed as Pθ0 [ |Aℓ] given Aℓ, and the product Qt 1 ℓ=s fθ(Xℓ|Aℓ) fθ0(Xℓ|Aℓ) (1 Eℓ) , over t = s, s+1, s+2, . . ., behaves as a (standard) likelihood ratio martingale with unit expectation under the appropriate filtration. Hence, the chance that the largest among this ensemble of martingales, one for each θ Θ(1) and starting time s [m], rises above β before time m is bounded using a union bound and Doob s inequality for each individual martingale, yielding the probability bound m|Θ(1)| 2. Control of the detection delay. Suppose that the change takes place at time ν = 1 to the postchange parameter θ Θ(1). The proof strategy is to show, with high probability, that τ is no more than t1 + t2, where t1 is an upper bound for the time taken for the plug-in estimate ˆθt, of the post-change parameter, to settle to θ . In other words, we show that after time t1 it is very unlikely that an action other than a θ = argmaxa A D θ(0)(a)||θ (a) is played in every exploitation round. t2 is an upper bound for the time taken for the worst-case CUSUM statistic maxθ Θ(1) Q(2) t (θ) to grow to the level β assuming that the optimal (i.e., most informative) action a θ is always played at all exploitation rounds. For clarity of exposition, we will assume that we are in the setting of linear measurements in Rd with additive standard Gaussian noise this makes KL divergences easy to interpret as Euclidean distances and that the changepoint is at ν = 1. Step 1: Finding t1 (time to learn θ ). We observe that the MLE ˆθt at time t can be written as the parameter θ associated with the largest stochastic process maxt 1 v=1 log Jv,t 1(θ), one for each θ: ˆθt = argmaxθ Θ(1) maxt 1 v=1 log Jv,t 1(θ), where Jv,t 1(θ) := Qt 1 ℓ=v fθ(Xℓ|Aℓ) fθ0(Xℓ|Aℓ) Eℓ. Under the post-change distribution P(1,θ ), log Jv,t 1(θ) can be seen to evolve (as a function of t) as a random walk with drift ϵ( θ θ 2 H θ 2 H). Here, H is the usual matrix-weighted Euclidean norm in Rd, governed by how much different directions are explored in expectation7 by the exploration distribution π: H = P a A π(a)aa T . Note that θ θ 2 H here is the KL divergence between the distributions of the observation that results when an action sampled from π is played, under parameters θ and θ. The preceding discussion implies that the drift is the largest (and positive) for the random walk corresponding to the true post-change parameter θ = θ . The remainder of this part of the proof uses martingale concentration tools (subgaussian Chernoff and maximal Hoeffding bounds) to find the time t1 at which the fastest-growing random walk, log Jv,t(θ ) dominates all other competing random walks log Jv,t(θ), θ = θ . Step 2: Finding t2 (time to stop under optimal action plays). In a manner similar to that of Step 1, we observe that the logarithm of the log-likelihood ratio for θ w.r.t. θ0 computed over only exploitation rounds, i.e., log Qt 1 ℓ=v fθ(Xℓ|Aℓ) fθ0(Xℓ|Aℓ) 1 Eℓ, evolves as a random walk with drift rate (1 ϵ) θ H. A Chernoff bound can then be used to control the probability of the bad event that this growing random walk has not crossed the level log β in a certain time duration t2 (t2 does not appear explicitly in the main derivation). 6 Experiments Size Oracle ϵ-GCD+ ϵ-GCD URS 10 30 5 98 62 112 74 306 76 15 31 5 129 95 158 119 460 115 20 31 5 163 128 196 156 616 153 25 31 5 191 154 253 216 764 194 Table 1: Observed mean and standard deviation for the simulated stopped time for varying graph-sizes with pointy action set A1, for change occurring at ν = 40. Our goal in this section is to illustrate various aspects of our theory through experiments on synthetic environments, and to explore the performance of the proposed ϵ-GCD algorithm on a setting based on a real-world dataset. All experiments were performed on a laptop with Intel Core i5 CPU and 8GB of RAM, and take under an hour to execute. The closely related prior works on this topic [ZM20; Gun+21] are more specific and are not compatible with many of our structured scenarios (see Sec. 2 and Appendix E). As alternatives, we propose several natural baselines and compare against our method to benchmark performance. Uniform Random Sampling (URS). At every round we sample each action in A uniformly at random. Our goal with URS is to calibrate adaptation gains. ϵ-GCD (Ours). This is our proposed scheme. A point to be noted here (as seen from Algorithm 1 is that the data used for making a stopping decision ("exploitation") and that for estimating the changed distribution ("exploration") are non-overlapping, and as such leads to inefficient data usage. ϵ-GCD+. To reduce the aforementioned data inefficiency, we consider the variant, the parameter estimation part also utilizes both data collected during exploration rounds and the data collected for the exploitation rounds. Oracle. Here we consider an omniscient strategy in that the algorithm is aware of both the pre-change and post-change distributions ahead of time, and consequently does not require exploration to estimate the post-change parameter. On the other hand the method has no knowledge of stopping time, and thus the problem reduces to the conventional quickest change detection. The Oracle characterizes the gold-standard in that no adaptation is required, and it gives the absolute smallest detection delay for any BQCD problem. Synthetic Experiments. We conduct synthetic experiments for the model introduced in Section 3 under the bold item titled illustrative example. We explore how average detection delay varies under controlled variation of various parameters such as changepoint location, dimensionality and structure of the ambient space, and the type of action sets. Our objective is twofold: (a) Illustrate gains due to adaptivity of proposed ϵ-GCD over the non-adaptive method, where actions are chosen uniformly at random; (b) Demonstrate near optimality by baselining against Oracle. 7This matrix also arises commonly in linear design of experiments as the Kiefer-Wolfowitz matrix [LS20]. We report results for the case when the ambient space is a line graph (see Illustrative example in Sec. 3). Nodes are interpreted as physical locations, and take values in [N]. Nodes j, k are connected if |j k| = 1. Each node n [N] offers a Gaussian-distributed observation depending on the changepoint ν N. Isolated and Structured Anomalies. The vector of change parameters θ = [θn] are of two types: (a) Isolated singleton change, namely, θn {0, 1} and P n [N] θn = 1; (b) Structural changes, i.e., Supp(θ) = {n [N] : θn = 0} is k-connected set with θ {0, 1}N. Diffuse and Pointy Action sets. In a parallel fashion we consider two types of action sets. The set A1, is pointy, namely, |Supp(A1)| = 1, which allows probing only single nodes. The set A2 is diffuse where only a connected subset of nodes can be queried. In either case, the observation received on an action is as described in Sec. 3. Key Findings. We report results for pointy ac- Uniformly Random Selection Oracle Selection Proposed Greedy Change Detector Figure 2: Audio based change detection of machine anomaly: Histogram of stopping times by URS, Oracle, and ϵ-GCD. Histograms for Oracle and ϵ-GCD demonstrates clear adaptivity gains. tions and isolated anomalies, and defer other settings to the appendix. We choose σ2 = 0.5, β such that the false alarms are about 1% for the Oracle, and a forced exploration rate ϵ = 0.2. Our results are averaged over 5000 Monte-Carlo runs. With this choice for β we did not observe false alarms in any of the algorithms. Gains from Adaptivity. The fact that we perform substantially better than URS uniformly across all of the experiments demonstrates our gains through adaptive sampling. Data Inefficiency. The fact that the proposed method closely tracks ϵ-GCD+ suggests that our decoupled scheme is capable of fully leveraging the observations, and the cost of decoupled exploration is small. Optimality. The fact that the proposed method tracks Oracle qualitatively across all experiment setups indicates that our method is close to optimal. This is because through adaptivity we achieve performance similar to a competitor who requires no adaptation (recall that oracle knows the change parameters and thus does not need to explore). In point of fact we note in passing that any other method would at best lie between our proposed method and the Oracle. Variation with Changepoint ν. For a fixed graph size, we observed that the average expected delay is relatively constant for all methods, which is consistent with Theorem 3. Audio based recognition of machine anomalies. We experiment using the MIMII audio dataset [Pur+19b]. Detailed specifics are in the appendix. The dataset has four machines (ID00,ID02,ID04,ID06), each equipped with audio sensors recording the health of the machine. There are three types of anomaly (rail damage, loose belt, no grease) which can occur at any time, in any machine. Corresponding to each anomaly there is an audio stream, and the anomaly occurs in one of the four machines at an arbitrary time. For each machine, the dataset contains audio-streams of 1000 normal and 300 abnormal files, and each audio-stream is about 10 seconds long. Audio Processing. For each audio-stream we train autoencoders on normal data using mel-spectrogram features, and fit Gaussians to the reconstruction errors. This results in preand post-change parameters for each machine s autoencoded reconstruction error score, corresponding to normal and abnormal operation. Experiment. To simulate BQCD, we introduced anomalies in machine bearing ID00 as follows. We concatenated 6 normal files and 54 abnormal files chosen uniformly at random from machine ID00. For the other machines we concatenated 60 normal files at random. The 60 files correspond to 600 seconds. Our changepoint corresponds to 6th file, which we denote as ν = 6 and our task is to detect this change. Note that the machine ID and the changepoint are both not known to the learner. Our results are depicted as histograms for changepoints of anomaly detection in Fig. 2. As observed, we notice that ϵ-GCD s performance is close to Oracle both in mean and distribution, while URS exhibits larger delay and significant variance. Appendix D presents histograms for a larger changepoint.