# graph_scan_statistics_with_uncertainty__2aefc6ed.pdf

Graph Scan Statistics With Uncertainty

Jose Cadena, Arinjoy Basak, Anil Vullikanti Department of Computer Science and Biocomplexity Institute, Virginia Tech, Blacksburg, VA 24061 {jcadena,arinjoyb,vsakumar}@vt.edu

Xinwei Deng Department of Statistics, Virginia Tech, Blacksburg, VA 24061 xdeng@vt.edu

Scan statistics is one of the most popular approaches for anomaly detection in spatial and network data. In practice, there are numerous sources of uncertainty in the observed data. However, most prior works have overlooked such uncertainty, which can affect the accuracy and inferences of such methods. In this paper, we develop the ﬁrst systematic approach to incorporating uncertainty in scan statistics. We study two formulations for robust scan statistics, one based on the sample average approximation and the other using a max-min objective. We show that uncertainty signiﬁcantly increases the computational complexity of these problems. Rigorous algorithms and efﬁcient heuristics for both formulations are developed with justiﬁcation of theoretical bounds. We evaluate our proposed methods on synthetic and real datasets, and we observe that our methods give signiﬁcant improvement in the detection power as well as optimization objective, relative to a baseline.

1 Introduction Identifying anomalous hotspots or outliers is an important problem in the analysis of spatio-temporal and network data, with a large number of applications in areas such as disease surveillance, security, systems biology and social network analysis. One of the popular methods for anomaly detection is the scan statistics approach e.g., (Kulldorff 1997; Neill 2012). Informally, this involves formalizing a notion of anomalousness by a form of hypothesis testing, based on either an underlying model of the expected data (referred to as parametric methods), or based on historical values of the data (i.e., non-parametric methods). One of the earliest uses of scan statistics was for ﬁnding unusual disease clusters (Kulldorff 1997). See (Akoglu, Tong, and Koutra 2014) for a survey and Section 7 for more details. Most existing work in anomaly detection for network data typically assumes that the datasets are taken as is , which is often not a realistic assumption. The observed counts in data have uncertainty and do not exactly match the real world due to reporting errors, geocoding errors, missing entries, etc. All these sources of uncertainty would affect the problem formulations and algorithms for anomaly detection, but they are especially relevant when using scan statistics because the

Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

anomaly score is formalized in terms of the log likelihood of occurrence of observed data. One of the very few results on the impact of uncertainty in scan statistics is by (Malizia 2013), who observes that uncertainty exists and that it can affect the quality of the clusters discovered by scan statistics, but this study does not propose any methods for taking uncertainty into account. To the best of our knowledge, the only scan-statistics-based method that incorporates uncertainty is the Bayesian scan statistic proposed in (Neill et al. 2009) and later extended to the multivariate case (Neill and Cooper 2010) though the authors do not explicitly motivate their work as a solution to uncertainty. Both papers assume simple conjugate priors and thus are able to derive closed-form expressions. However, in general, more complex distributional assumptions for uncertainty would lead to expressions that do not have a closed form and can only be approximately optimized via sampling. In this paper, we use methods from the theory of stochastic optimization to formally characterize scan statistic maximization with uncertainty and develop novel algorithms and heuristics for this problem. Our contributions are summarized below.

Scan Statistics with Uncertainty. We propose two approaches for taking uncertainty into account: a sample average approximation (SAA) and a max-min formulation (Shapiro, Dentcheva, and Ruszczy nski 2009; Bertsekas 1995; Pr ekopa 2013), and show that these are well motivated for scan statistics by analyzing instances with stochastic perturbations. When we account for uncertainty, the anomaly detection task becomes much more challenging. We show that, even without any connectivity constraints, ﬁnding clusters based on scan statistics that optimize the max-min objective function is NP-hard, whereas it can be done in linear time if there is no uncertainty. Rigorous Algorithms and Theoretical Bounds. We develop rigorous algorithms and heuristics for optimizing the scan statistics in both these formulations. For the SAA formulation, our algorithm AGGREGATESAA gives rigorous bounds on the approximation guarantee for ﬁnding solutions of a given effective size k, with or without connectivity constraints, and is a ﬁxed parameter tractable algorithm. For the max-min formulation, we present two algorithms: (1) MAXMINLPROUND for the case of no connectivity constraints, using linear programming rounding, which yields rigorous approximation bounds, and

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

(2) a heuristic, BESTMAX, for the case with connectivity. Both AGGREGATESAA and BESTMAX use the technique known as color coding for counting trees, which was adapted for scan statistics by (Cadena, Chen, and Vullikanti 2017). Numerical Evaluation of the Proposed Methods. We evaluate our proposed methods in two popular benchmarks for scan statistics with known events. Both the SAA and max-min approaches lead to fairly robust scan statistics, and our algorithms have a clear improvement in performance, compared to a natural baseline, over a wide range of signal-to-noise ratio regimes. For regimes with high noise in the data, we see up to two-fold improvement in F1 score. Similarly, the objective score improves in all experimental settings, especially when the signal-to-noise ratio is low, where we see gains over the baseline by more than a factor of 2. Many details are omitted for brevity, and are available at (Cadena and others 2018).

2 Preliminaries

It is known that scan statistics have been mostly applied to spatial data (Kulldorff 1997). Here, we consider the more general version of scan statistics on networks, which has attracted attention recently (Mc Fowland, Speakman, and Neill 2013; Neill 2012; Speakman and others 2015; Leiserson and others 2015; Hansen and Vandin 2016). Let us deﬁne an undirected graph G = (V, E) with n = |V | nodes and m = |E| edges. For each node v V , we have an event count c(v) and a baseline count b(v). Using scan statistics, the detection of anomalous clusters can be posed as a hypothesis testing problem. The null hypothesis H0 is that there is no anomalous cluster. That is, the event counts for all nodes are generated independently from the same distribution proportionally to the baseline counts. Under the alternative hypothesis H1(S), there exists a small connected subset of nodes S, such that event counts in S are generated at a higher rate than counts elsewhere (i.e., in V \ S). A scan statistic is an anomalousness function F : S R that evaluates how much a subset S deviates from the null hypothesis. Here, we focus on a commonly used scan statistic based on the likelihood ratio test as

F (S) = Pr(H1(S)|Data)

Pr(H0|Data) = Pr(H1(S)|C(S), B(S), C(V \ S), B(V \ S))

Pr(H0|C(V ), B(V )) ,

where C(S) =

v S c(v) is the total count of S, and B(S) =

v S b(v) is the baseline count. Depending on the type of data (i.e., counts, positive real values, etc.) and the statistical assumptions about the process generating the event counts (i.e., Poisson, Normal, etc.), the scan statistics can be broadly summarized in two categories: Parametric scan statistics. It assumes that observations follow a parameterized distribution, usually from the exponential family, such as Poisson or Normal. For example, the Kulldorff scan statistic for count data, commonly used in disease surveillance (Kulldorff 1997; Duczmal, Kulldorff, and Huang 2006; Kulldorff, Tango, and Park 2003;

Neill 2012), is deﬁned as

F (S) =C(S) log C(S)

+ (C(V ) C(S)) log C(V ) C(S)

C(V ) log C(V )

if C(S)/B(S) > C(V )/B(V ) and 0 otherwise. Nonparametric scan statistics. The observations are not assumed from a speciﬁed distribution. The main idea is to compute the empirical distribution of the counts based on multiple snapshots of the graph. First, one can deﬁne a pvalue p(v) for each vertex v by comparing the current values of c(v) and b(v) to past observations. Then, a hypothesis test can be performed to check whether the empirical p-values are uniformly distributed on [0, 1] (Qian, Saligrama, and Chen 2014; Sharpnack, Krishnamurthy, and Singh 2013; Neill 2012). For example, in the Berk-Jones scan statistic (BJ) (Berk and Jones 1979), a node is declared to be signiﬁcant if p(v) < α for a given signiﬁcance level α. The weight of a node with respect to α is w(v, α) is 1 if v is signiﬁcant and 0 otherwise. We use w(v) when α is clear from context. Then, the weight of a set S is W(S) =

v S w(v). The baseline count is b(v) = 1 for all nodes, so B(S) = |S|. The BJ scan statistic is deﬁned as

F (S) = max α αmax |S| W (S)

|S| log W (S)/|S|

log 1 W (S)/|S|

if W(S)/|S| > α and 0 otherwise. Thus, the anomaly detection problem in the network can be posed as a constrained optimization problem. Given a graph G = (V, E), a scan statistic F( ), and the associated counts for vertices, C and B, the objective is to ﬁnd a connected subset S V , such that F(S) = F(C(S), B(S)) is maximized. Example. Consider an instance of lung cancer cases in a population within a state. We consider a graph G = (V, E) with the set V representing the counties in the state, and E consisting of edges between counties sharing a boundary. The baseline count b(v) is the number of inhabitants in county v, and c(v) is the number of lung cancer cases in v. A cluster of counties in which the incidence counts are signiﬁcantly different from what is expected based on the population would have a high Kulldorff score.

3 Network Scan Statistics With Uncertainty The observed data is affected by different sources of uncertainty including unreported cases, inaccurate data collection, measurement error, etc., and how to account for such uncertainty is a crucial question. Let C be a vector of unobserved real counts, such that c(v) is the count associated with node v. Similarly, let X be the vector of observed noisy counts. We model uncertainty on X as a distribution θ parameterized by the unobserved real counts C. This distribution could be, for instance, Gaussian noise, such that X = C + ϵ, where ϵ captures our belief about the magnitude of uncertainty. Or a multivariate Gamma distribution with mean C and variance speciﬁed by a domain expert. We also deﬁne p(C|φ) as a prior distribution on C controlled by hyperparameters φ, which could be inferred from past data or from a domain expert s

knowledge. More generally, let X = (x(v1), . . . , x(vn)) and C = (c(v1), . . . , c(vn)) be the vectors of observed and real counts, respectively. Then, X|C θ(C). In most of the existing work, the observed counts are taken as the ground truth; that is, X = C deterministically.

3.1 Problem Formulations In this work, we consider two approaches to account for uncertainty. The ﬁrst method is based on the sample average approximation (SAA) method (Shapiro, Dentcheva, and Ruszczy nski 2009). The second is based on a max-min formulation (Bertsekas 1995), which is referred as the worst-case scenario in the stochastic optimization (Pr ekopa 2013). Scan statistics under sample average approximation. The key idea is taking the expectation of F(S) to account for the uncertainty in scan statistics. That is, given a graph G = (V, E), a scan statistic F( ), and the associated counts for vertices, X and B, the objective is to ﬁnd a connected subset S V that maximizes

E[F(S)|X] =

F (S) (F(S)|X)p(F(S)|X)d F(S)

C (F(S)|X)p(X|C)p(C)

Equivalently, we want to maximize

C(F(S)|X)p(X|C)p(C)d C, since P(X) is a constant. The idea of SAA is to generate samples from the distribution for the input and optimize over the set of samples (Swamy and Shmoys 2006). Speciﬁcally, maximizing E[F(S)|X] can be approximated by maximizing the average of N samples in the following manner: 1. for i = 1 to N: (a) Sample the ith vector of real counts Ci with probability proportional to p(Ci) (b) Sample the ith vector of estimate counts Xi from θ(Ci) (c) Compute Fi(S) = F(Xi(S), B(S)) for all connected S 2. Return S that maximizes the sample average: S = arg max S 1

N N i=1 Fi(S)

Problem 1 Given a graph G = (V, E), a scan statistic F( ), and N sets of counts for vertices Xi for i = 1 to N ﬁnd a connected subset S V that maximizes the average score,

i=1 Fi(S) = 1

i=1 F(Xi(S), B(S)).

Scan Statistics under the Max-Min Formulation. Rather than taking the average for accounting the uncertainty, an alternative approach to address uncertainty in scan statistics is to consider the minimum score over all the samples. This gives us the following formulation.

Problem 2 Given a graph G = (V, E), a scan statistic F( ), and N sets of counts for vertices Xi for i = 1 to N ﬁnd a connected subset S V , that maximizes the minimum over all samples,

min i=1,...,N Fi(S) = min i=1,...,N F(Xi(S), B(S)).

4 Motivation for the two formulations and challenges arising from uncertainty We use a simple stochastic perturbation model to show that the two formulations, Problem 1 and 2, are well motivated. We consider a simple model using the BJ statistic on an instance G = (V, E) constructed in the following manner. Let V be partitioned into V = V1 V2, with w(v) = 1 for v V1, and w(v ) = 0 for v V2. We also have |V1| = k, so that V1 is the optimal solution maximizing the BJ statistic for this instance. Next, we assume a simple noise model, in which each non-anomalous node (i.e., those with weight 0) becomes anomalous with a small probability p.

Observation 1 For the above stochastic model, both the SAA and Max-Min formulations correctly identify the optimal cluster V1, if the number of samples N 2 ln n

ln 1/p. In contrast, the optimal solution for any sample is a subset of V2, with high probability, missing the real anomalous subgraph V1.

Uncertainty makes scan statistics much harder. Neill (Neill 2008) observed that many scan statistics (without any uncertainty), including the BJ-statistic can be solved in linear time if there are no connectivity constraints, because of the linear time subset scanning property. As a result, the optimum solution of a given size can be found by considering the items in the order of their counts. In contrast, we show below that under uncertainty, the Max-Min formulation of Problem 2 is NP-complete. We note that maximizing scan statistic score with connectivity constraints was shown to be NP-Hard in (Cadena, Chen, and Vullikanti 2017).

Lemma 1 For any non-parametric scan statistic F( ), ﬁnding a subset S V without any connectivity requirement, and of size at most k, that maximizes mini Fi(S) is NPcomplete.

The proof is by a many-to-one reduction from the set multicover problem (Chekuri, Clarkson, and Har-Peled 2012), and the main idea is that the optimum solution to the Max-Min objective requires a certain number of anomalous nodes in each sample. By guessing this number, and mapping each set to a sample, the reduction can be made to work. The proof is discussed in the Supplementary Material (Cadena and others 2018).

5 Proposed Methods We describe our algorithms for the SAA and Max-Min objectives. We focus on non-parametric functions here, though our methods extend to parametric functions in a natural manner. The approximation bounds depend on the speciﬁc functions, and here we derive those for the BJ-statistic (see Section 2). For the rest of the section, we use wi(v, α) to denote the weight of node v deﬁned in Section 2 in the ith replicate. For brevity, many of the details and proofs are presented in the Supplementary Material.

5.1 Algorithm for the Sample Average Approximation Formulation Algorithm 1 describes AGGREGATESAA for ﬁnding a solution to problem 1. Our method builds on the work of (Cadena,

Algorithm 1: AGGREGATESAA((G(V, E), αmax), k, ϵ).

1: Input: Instance (G(V, E), {wi : i = 1, . . . , N}, αmax), parameters k, ϵ 2: Output: Set S of size at most k 3: Let A be the set of p-values of nodes in V below αmax 4: for α A 5: Let w be a weight vector with w(v) = N i=1 wi(v, α) 6: {S j (α) : j = 1, . . . , k} = MAXWT(G, w, k, ϵ n2 ) 7: S = argmaxj K,α A 1

N N i=1 Fi(S j (α)) 8: return S

9: 10: procedure MAXWT(G(V, E), w, k, ϵ ) 11: Input: Instance (G(V, E), w) and parameters k, ϵ

12: Output: {S j : j K}, such that S j has weight ψj 13: Let ψj = for all j K 14: for ℓ= 1 to ek log (1/ϵ ) 15: For each node v, pick random color col(v) K 16: for v V, s K 17: M(v, {s}) = w(v) if col(v) = s; otherwise 18: for v V and T K, with |T| 2 19: M(v, T) = maxu Nbr(v) T1,T2 T {M(v, T1) + M(u, T2)}

20: If M(v, T) > ψ|T | update ψ|T | = M(v, T) 21: return {S j :

v S j w(v) = ψj, for j K}

Chen, and Vullikanti 2017), and uses the color-coding technique of Alon et al. (Alon, Yuster, and Zwick 1995). This algorithm ﬁnds an optimal subgraph of size at most k, where k is a parameter, in time O(akpoly(n, m)), for some constant a i.e., the running time is polynomial on the size of the graph, but exponential on the solution size. In contrast, a brute force approach would need n k = O(nk) time to examine every possible connected subset of nodes of size at most k. Intuition behind the algorithm. The main idea is to color the nodes of the graph using K = {1, . . . , k} colors and restrict the search to colorful solutions, which are subgraphs with distinctly colored nodes. This immediately leads to an efﬁcient algorithm because: (1) colorful solutions can be computed using a simple dynamic program, and (2) if the coloring is done randomly, there is a reasonable probability that the optimal solution is colorful. (Cadena, Chen, and Vullikanti 2017) show that the scan statistics problem can be solved by such a dynamic program. Further, combined with a graph compression technique, one can discover solutions with hundreds of vertices while keeping k below 10. The ﬁnal algorithm is randomized, and it returns an optimal solution with probability (1 ϵ) in time O((2ϵ)km log(1/ϵ)), where k is solution size after such a compression. Problems for which such algorithms exist are said to be ﬁxed parameter tractable. Overview of Algorithm AGGREGATESAA. The for loop in lines 4-6 tries out each potential value of α. For each candidate α, the aggregate weight vector w is computed following the process described in Section 2. The subroutine MAXWT (from Cadena, Chen, and Vullikanti) is called in line 6, and it returns a candidate solu-

tion S j (α) for each α, and each size j k. It uses color coding to ﬁnd the optimal solution by dynamic programming. Theorem 1 Suppose the solution S computed by Algorithm AGGREGATESAA corresponds to S j (α) for some j K. Suppose N i=1 v S j (α) wi(v) cαj N, for a constant c > 1. Then, the score of S is within a factor of cα log 1/α cα log c+(1 cα) log 1 cα

1 α of the optimum score, with probabil-

ity at least 1 ϵ for any ϵ (0, 1). The total running time and space used are O(2kek|A|Nm log (n2/ϵ)), and O(2kn), respectively. Proof: (Sketch) The proof relies on the convexity of the function F( ) for non-parametric functions. Since N i=1 v S j (α) wi(v) cαj N, the minimum value i Fi(S j (α)) can take is when the weight in each replicate is the average, namely cαj. Therefore,

i Fi(S j (α))/N cα log c + (1 cα) log 1 cα

Since S j (α) maximizes the total weight N i=1 v S j (α) wi(v), it follows that for any other

set S , N i=1 v S j (α) wi(v) N i=1 v S wi(v). The maximum value that can be achieved by i Fi(S ) is when the total weight in some replicates is close to j and 0 in the rest (See (Cadena and others 2018)). Therefore,

v S wi(v) j N log1/α = cα log 1/α

The approximation factor is therefore bounded by the ratio of these, which proves the theorem. Performance guarantee in practice. The performance guarantee in Theorem 1 depends on how far the average weight of the sets is from α, over the samples. Empirically, we ﬁnd that the approximation bound decreases with both c and α, and is very close to 1 in our experiments in Section 6.

5.2 Algorithm for the Max-Min Formulation The basic idea of our algorithm is to guess the total weight z of the anomalous nodes in the optimal solution in the minimum sample, and then ﬁnd a solution that has at least weight z in each sample this corresponds to a multi-cover problem, as in the reduction in the proof of Lemma 1. We use the linear programming rounding method of Kolliopoulos et al. (Kolliopoulos and Young 2005) for ﬁnding such an approximate solution. Algorithm 2 describes MAXMINLPROUND for this problem. Lemma 2 Let Sz be the solution returned by Algorithm 2. Then, mini Fi(Sz ) KL(z /(k log n),α)

KL(z (1 ϵ)/k,α) for any ϵ (0, 1).

Next, we describe the heuristic BESTMAX for problem 2 with connectivity constraints. Proving its approximation guarantee remains an open problem; here, we analyze its running time. Lemma 3 Algorithm BESTMAX takes time O(2kek|A|Nm log (n2/ϵ)) and uses space O(2kn), where A is the set deﬁned in line 3 of the algorithm.

Algorithm 2: MAXMINLPROUND(V, w, k) for Max-Min formulation without connectivity constraints.

1: Input: Instance V, wi, i = 1, . . . , N and parameter k 2: Output: Instance S V that maximizes mini Fi(S) with |S| = k 3: for z [1, maxi

v wi(v)] in powers of (1 + ϵ) 4: Construct matrix A RN |V | with Aiv = wi(v) and b RN with bi = z 5: Find a solution x {0, 1}V that minimizes

i xi and satisﬁes Ax b with the algorithm of (Kolliopoulos and Young 2005) 6: if a solution x exists, let Sz = {i : xi = 1} 7: return Sz = {i : xi = 1} for the maximum z for which there is a solution

Algorithm 3: BESTMAX((G(V, E), αmax), k, ϵ) for Max-Min formulation with connectivity constraints.

1: Input: Instance (G(V, E), {wi : i = 1, . . . , N}, αmax), parameters k, ϵ 2: Output: Set S of size k 3: Let A be the set of p-values of nodes in V below αmax 4: for α A 5: {S j (i, α) : j = 1, . . . , k} = MAXWT(G(V, E), wi, k, ϵ/n2) 6: S = argmaxj K,i,α A min N i =1 Fi (S j (i, α)) 7: return S

6 Experiments Our experiments are motivated by the following questions: Detection and optimization power. Does accounting for uncertainty improve detection of anomalous clusters compared to the deterministic case? How do our algorithms AGGREGATESAA and BESTMAX perform for the objective functions in Problems 1 and 2? Effect of the number of replicates How does detection power vary with the number of replicates N? Approximation guarantee in practice What are the practical implications of Theorem 1? What is the empirical approximation bound of AGGREGATESAA? We evaluate our algorithms for scan statistics with uncertainty on the Kulldorff statistic (Kulldorff 1997) and the BJ statistic (Berk and Jones 1979), which are examples of parametric and non-parametric scan statistics, respectively. We evaluate the event detection power in terms of accuracy, precision, recall, and the F1 score. Let R be the set of nodes in the anomalous subgraph we discover and let S be the detected subgraph; then, we deﬁne (1) Accuracy(R, S) = |R S|

(2) Precision(R, S) = |R S|

(3) Recall(R, S) = |R S|

(4) F1 score = 2 Precision(R,S) Recall(R,S)

Precision(R,S)+Recall(R,S)

. We mainly discuss results for the F1 score below, but plots for the other metrics can be found in (Cadena and others 2018).

Baselines. For the evaluation, we compare our algorithms to the following baseline: Given N count vectors as in Problems 1 and 2, we select one of these vectors uniformly at random and return the connected set S that maximizes the scan statistic in the selected counts. The baseline for Problem 1, referred to as B-SAA, returns the SAA objective value for S. Similarly, the baseline for Problem 2, referred to as B-MAX-MIN, returns the Max-Min objective value for S. These baselines reﬂect the current practice of ignoring uncertainty.

6.1 Datasets The Northeastern USA Benchmark (NEast). This dataset (Kulldorff, Tango, and Park 2003) corresponds to occurrences of cancer in a network of 245 counties in the Northeastern part of USA. Under the null model (i.e., no signiﬁcant cluster), each node v has a count c(v) Poisson(pb(v)), where p = 2.03 10 5. We simulate anomalous clusters in this network as follows: A cluster consists of a node selected at random and all its neighbors. We generated three such clusters, which can be found in (Cadena and others 2018). These clusters vary in difﬁculty of detection easy, medium, and hard. The counts for a node v inside the cluster are sampled from Poisson(qb(v)). We perform experiments with values of q of the form q = βp, where β > 1 is a parameter that we call signal strength. Intuitively, nodes in the anomalous clusters have β times as many expected counts as nodes outside the cluster. In Section 6.2, we discuss the effect of this parameter. Then, we perturb the counts generated above using Gaussian noise. That is, for each node v, we sample and round down a count x(v) N(c(v), σ2), where σ2 is a noise parameter. Notice that one could have a different noise parameter for each node, but we only consider uniform noise in our experiments. Battle of the Water Sensor Networks (BWSN) This dataset (Ostfeld and others 2008) was originally used to evaluate different sensor network designs in terms of early detection of contaminants in a water system. The dataset includes ground truth subgraphs representing parts of the network that are contaminated at different points in time there is one ground truth graph for each snapshot. We control the noise on the sensors with a parameter ϵ. With probability ϵ, the real p-value of a sensor in the network is replaced by a random p-value uniformly sampled from the interval [0, 1]. We show results for three clusters; these are typical for other clusters as well.

6.2 Evaluation Detection and optimization performance. First, we discuss the performance of our methods on the NEast benchmark. We evaluate performance for a wide range of signal (β) and noise (σ2) parameters (deﬁned in Section 6.1). In Figure 1, we show results for SAA (left), Max-Min (center), and Baseline (right) in the medium-difﬁculty instance. The heatmaps correspond to different combinations of signal and noise, with darker colors indicating a higher F1 score. We observe that our algorithms for SAA and Max-Min obtain higher scores than the baseline for a larger range of β-σ2 combinations.

Figure 1: F1 score for various combinations of signal strength and noise for one of the clusters in the NEast dataset, using the AGGREGATESAA and BESTMAX algorithms, and the baseline (darker color means better performance).

In Figure 2, we provide an alternative way to summarize the performance gain for both formulations. The plots show the fraction of signal-noise combinations for which our algorithms have a given percentage of improvement over the baseline. For example, with the SAA formulation (left plot), we obtain at least a 20% improvement (x-axis) over the baseline on 25% (y-axis) of the signal-noise combinations from Figure 1 for the medium-difﬁculty cluster (green line). The two leftmost plots in Figure 2 reveal that we have larger improvement on F1 score over the baseline on the mediumdifﬁculty instance than in the other two. If a subgraph is easy to discover, accounting for uncertainty may not offer a significant advantage. On the other hand, if an instance is hard to discover, performance is affected in all methods. However, we observe an improvement in performance in all clusters. Finally, here, we can see that the Max-Min formulation has higher improvement for the same instance. The same trends are observed for the objective score (two rightmost plots). Next, we discuss results for the BJ statistic. In Figure 3 (top), we show the detection performance on the BWSN dataset with the BJ statistic as a function of noise, for three clusters. As expected, performance degrades for all methods as noise increases. However, our algorithms have a better performance than B-SAA and B-MAX-MIN for all levels of noise, and they degrade with the level of noise gradually. In contrast, the baselines show a very inconsistent performance. AGGREGATESAA and BESTMAX also show improved objective score compared to the baseline. In particular, for Problem 2, BESTMAX performs signiﬁcantly better, typically giving over 20% improvement over B-MAX-MIN.

Effect of N on performance. In Figure 4, we show the F1 score as a function of N, the number of replicates for the BWSN dataset. We observe that the detection power improves as we use more samples for both the SAA and the Max-Min formulation. However, the baseline does not beneﬁt from a larger N.

Approximation bound in practice. We analyze the empirical performance of AGGREGATESAA, compared with the worst case bound derived in Theorem 1. For each snapshot of the BWSN dataset, we compute the approximation ratio given in Theorem 1 for the subgraphs discovered by our algorithm. Figure 5 shows the empirical worst-case guarantee for different noise levels (i.e., each box in the plot) and each snap-

shot of the dataset (i.e., data points used to draw the box). We see that, for almost all the cases, the approximation guarantee is at least 80%. Further, we observe that the approximation generally becomes worse and has higher variance as noise increases. The increase in the two highest levels of noise may seem counterintuitive, but it is explained by the fact that the bound derived in Theorem 1 is with respect to the size of the solution discovered. As noise increases, the heuristic discovers smaller solutions, which are easier to approximate.

7 Related Work

Our paper is related to the broad area of anomaly detection for network data, and we refer to Akoglu et al. (Akoglu, Tong, and Koutra 2014) for a comprehensive survey on this topic. For brevity, we only discuss work on scan statistics. There are a number of parametric scan statistics, depending on the speciﬁc assumption about the observations, e.g., Positive Elevated Mean Scan Statistic (Qian, Saligrama, and Chen 2014), Expectation-based Poisson Scan Statistic (Neill 2012), and Expectation-based Gaussian Scan Statistic (Neill 2012), in addition to the Kulldorff Scan Statistic (Kulldorff 1997) discussed in Section 2. In general, optimizing these functions is challenging in the presence of network constraints, and a number of heuristics have been proposed, e.g., branch-and-bound methods (Speakman and others 2015), Additive Graph Scan (Speakman, Zhang, and Neill 2013) based on shortest paths in the graph, semi-deﬁnite programming (Qian, Saligrama, and Chen 2014) and Steiner tree heuristics (Rozenshtein and others 2014). The color-coding based algorithm of (Cadena, Chen, and Vullikanti 2017) gives rigorous results for all these functions. Similarly, in addition to the Berk-Jones (Berk and Jones 1979) described in Section 2, there are a number of non-parametric scan statistics, such as Higher Criticism (Donoho and Jin 2004), Kolmogorov-Smirnov (Wilcox 2005) and Anderson-Darling (Eicker 1979). There are several heuristics to optimize such functions, (Mc Fowland, Speakman, and Neill 2013; Neill and Lingwall 2007; Neill 2008; Chen and Neill 2014). The approach of (Cadena, Chen, and Vullikanti 2017) extends to these functions as well. However, as discussed in Section 1, none of the above methods deal with data uncertainty. Malizia (Malizia 2013) is one of the few papers considering the effect of uncertainty on scan statistics, but not how how to account for it. The only method that we are aware of for incorporating uncertainty is

Figure 2: Relative improvement of AGGREGATESAA and BESTMAX over the baseline on the F1 score (left plots) and objective score (right plots) in the NEast dataset. The y-axis corresponds to the fraction of cells in Figure 1, for which our algorithms have a certain level of improvement over the baselines (x-axis). Higher is better.

Figure 3: F1 score (top) and objective score (bottom) of AGGREGATESAA and BESTMAX, as a function of the noise level, compared to the baseline for three clusters in the BWSN dataset. Higher is better.

Figure 4: F1 score as a function of N for the BWSN dataset. Detection power increases with the number of replicates for our algorithms, but not for the baseline.

Figure 5: Empirical approximation guarantee of AGGREGATESAA for the BJ statistic (left) and size of the discovered subgraph (right) in the BWSN dataset.

the Bayesian scan statistic proposed in (Neill et al. 2009; Neill and Cooper 2010) though the authors do not ex-

plicitely motivate their work as a solution to uncertainty. The authors propose a Bayesian extension of the Kulldorff statistic assuming a Gamma-Poisson conjugate model, which allows them to derive a closed-form scan statistic. However, the methods that we develop are more general because we don t require a closed-form expression.

8 Conclusions

Scan statistics are used extensively in anomaly detection, but most previous works do not incorporate data uncertainty. We propose the ﬁrst characterization of the effects of uncertainty on scan statistics using two formulations from stochastic optimization, and we design rigorous algorithms and heuristics for these problems. Our evaluation shows that both ap-

proaches give clear improvement on the detection power relative to a natural baseline. We expect our methodology can help incorporate the effects of uncertainty in other problems as well.

Acknowledgements This work was partially supported by the following grants: DTRA CNIMS Contracts HDTRA1-11-D-0016-0010, HDTRA1-17-D-0023, and NSF grants IIS-1633028, ACI1443054.

References Akoglu, L.; Tong, H.; and Koutra, D. 2014. Graph based anomaly detection and description: a survey. Data Mining and Knowledge Discovery. Alon, N.; Yuster, R.; and Zwick, U. 1995. Color-coding. Journal of the ACM (JACM). Berk, R. H., and Jones, D. H. 1979. Goodness-of-ﬁt test statistics that dominate the kolmogorov statistics. Z. Wahrsch. Verw. Gebiete. Bertsekas, D. P. 1995. Dynamic programming and optimal control, volume 1. Athena Scientiﬁc Belmont, MA. Cadena, J., et al. 2018. Graph scan statistics with uncertainty. http://tinyurl.com/ybpaa384. Cadena, J.; Chen, F.; and Vullikanti, A. 2017. Near-optimal and practical algorithms for graph scan statistics. In SIAM Data Mining (SDM). Chekuri, C.; Clarkson, K. L.; and Har-Peled, S. 2012. On the set multicover problem in geometric settings. ACM Trans. Algorithms 9(1):9:1 9:17. Chen, F., and Neill, D. 2014. Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs. In KDD. Donoho, D., and Jin, J. 2004. Higher criticism for detecting sparse heterogeneous mixtures. The Annals of Statistics. Duczmal, L.; Kulldorff, M.; and Huang, L. 2006. Evaluation of spatial scan statistics for irregularly shaped clusters. Journal of Computational and Graphical Statistics. Eicker, F. 1979. The asymptotic distribution of the suprema of the standardized empirical processes. The Annals of Statistics. Hansen, T., and Vandin, F. 2016. Finding mutated subnetworks associated with survival in cancer. ar Xiv preprint ar Xiv:1604.02467. Kolliopoulos, S. G., and Young, N. E. 2005. Approximation algorithms for covering/packing integer programs. Journal of Computer and System Sciences 495 505. Kulldorff, M.; Tango, T.; and Park, P. J. 2003. Power comparisons for disease clustering tests. Computational Statistics & Data Analysis. Kulldorff, M. 1997. A spatial scan statistic. Communications in Statistics: Theory and Methods. Leiserson, M., et al. 2015. Pan-cancer network analysis identiﬁes combinations of rare somatic mutations across pathways and protein complexes. Nature genetics 47(2):106 114.

Malizia, N. 2013. Inaccuracy, uncertainty and the space-time permutation scan statistic. PLo S ONE. Mc Fowland, E.; Speakman, S.; and Neill, D. B. 2013. Fast generalized subset scan for anomalous pattern detection. JMLR 14(1). Neill, D. B., and Cooper, G. F. 2010. A multivariate bayesian scan statistic for early event detection and characterization. Machine learning 79(3):261 282. Neill, D. B., and Lingwall, J. 2007. A nonparametric scan statistic for multivariate disease surveillance. Advances in Disease Surveillance. Neill, D. B.; Cooper, G. F.; Das, K.; Jiang, X.; and Schneider, J. 2009. Bayesian network scan statistics for multivariate pattern detection. In Scan Statistics. Neill, D. B. 2008. Fast and ﬂexible outbreak detection by linear-time subset scanning. Advances in Disease Surveillance. Neill, D. B. 2012. Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). Ostfeld, A., et al. 2008. The battle of the water sensor networks (bwsn): A design challenge for engineers and algorithms. Journal of Water Resources Planning and Management. Pr ekopa, A. 2013. Stochastic programming, volume 324. Springer Science & Business Media. Qian, J.; Saligrama, V.; and Chen, Y. 2014. Connected sub-graph detection. In AISTATS. Rozenshtein, P., et al. 2014. Event detection in activity networks. In KDD. Shapiro, A.; Dentcheva, D.; and Ruszczy nski, A. 2009. Lectures on stochastic programming: modeling and theory. SIAM. Sharpnack, J.; Krishnamurthy, A.; and Singh, A. 2013. Nearoptimal anomaly detection in graphs using lovasz extended scan statistic. In NIPS. Speakman, S., et al. 2015. Scalable detection of anomalous patterns with connectivity constraints. Jl Comp Graphical Stat. Speakman, S.; Zhang, Y.; and Neill, D. B. 2013. Dynamic pattern detection with temporal consistency and connectivity constraints. In ICDM. Swamy, C., and Shmoys, D. 2006. Algorithms column: Approximation algorithms for 2-stage stochastic optimization problems. SIGACT News. Wilcox, R. 2005. Kolmogorov smirnov test. Encyclopedia of biostatistics.