# reconstructing_an_epidemic_outbreak_using_steiner_connectivity__2cf3540f.pdf

Reconstructing an Epidemic Outbreak Using Steiner Connectivity

Ritwick Mishra1,2, Jack Heavey1,2, Gursharn Kaur1, Abhijin Adiga1, Anil Vullikanti1,2

1 Biocomplexity Institute & Initiative, University of Virginia 2 Department of Computer Science, University of Virginia {mbc7bu, jch7jm, fug3aj, abhijin, vsakumar}@virginia.edu

Only a subset of infections is actually observed in an outbreak, due to multiple reasons such as asymptomatic cases and under-reporting. Therefore, reconstructing an epidemic cascade given some observed cases is an important step in responding to such an outbreak. A maximum likelihood solution to this problem (referred to as CASCADEMLE) can be shown to be a variation of the classical Steiner subgraph problem, which connects a subset of observed infections. In contrast to prior works on epidemic reconstruction, which consider the standard Steiner tree objective, we show that a solution to CASCADEMLE, based on the actual MLE objective, has a very different structure. We design a logarithmic approximation algorithm for CASCADEMLE, and evaluate it on multiple synthetic and social contact networks, including a contact network constructed for a hospital. Our algorithm has signiﬁcantly better performance compared to a prior baseline.

Introduction In most outbreaks (of diseases, pests, or pathogens), such as COVID-19 (Shaman et al. 2020), Hospital Associated Infections (HAIs), such as Methicillin-resistant Staphylococcus aureus (MRSA), and biological invasions (Robinson et al. 2017) only a subset of infections is known in a timely manner, due to multiple reasons, including asymptomatic cases (Jang et al. 2021), and lack of resources for tracing or inspection. Such events require a prompt response, such as isolation of infected patients (important in the case of HAIs, since hospitalized patients are very vulnerable) and corresponding measures in the case of crops and livestock. Therefore, reconstructing an epidemic cascade which identiﬁes missing infections (or infestations), and explains the pattern of spread an important problem. The problems of reconstructing an epidemic cascade and identifying the source have been studied extensively for both SI and SIR models on networks (Rozenshtein et al. 2016; Jang et al. 2021; Zhu and Ying 2014; Shah and Zaman 2011). These works assume partial information is available about the cascade, e.g., a subset of nodes which are known to be infected (Jang et al. 2021), or both infection state and time of infection (Rozenshtein et al. 2016; Zhu and Ying 2014; Shah and Zaman 2011).

Copyright 2023, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

In the simplest SIR type network model of epidemic spread (also referred to as the independent cascade (IC) model), where a disease spreads on each edge of a contact network G = (V, E) with probability p, this reconstruction problem (referred to as CASCADEMLE) involves ﬁnding a connected subgraph T = (V (T), E(T)) such that S V (T), where S is the subset of observed infections. Using the natural maximum likelihood estimation (MLE) approach, the goal of the CASCADEMLE problem is to ﬁnd a connected Steiner subgraph T such that S V (T), and Pr[T] is maximized this can be expressed as an equivalent problem of minimizing cost of T, denoted by Cost(T). As discussed later in the Preliminaries section, Cost(T) includes cost of edges e E(T), as well as cost of edges = (u, v) E(T) not in T, but with {u, v} V (T) = (i.e., at least one end point of e is in E(T)). This makes Cost(T) quite different from the standard Steiner tree objective (Karp 1972), which has been very well studied. Most prior work on reconstructing epidemic cascades in the SIR model has mainly been restricted to the regular Steiner tree objective, e.g., (Jang et al. 2021; Rozenshtein et al. 2016); this immediately connects with the vast literature on algorithms for the Steiner tree problem. We note that Zhu and Ying (2014) consider the actual cost, but mainly focus on trees for rigorous analysis (which is then extended to general graphs through various heuristics). Our Contributions. We study the CASCADEMLE problem of ﬁnding an MLE solution for reconstructing an epidemic cascade for the independent cascade (IC) model.

We show that the solution to CASCADEMLE can have a very different structure from the solution to the regular Steiner tree objective. In particular, there exist instances where the solution to CASCADEMLE has diameter Θ(n), where as the Steiner tree solution can have constant diameter. We observe a signiﬁcant difference in real instances as well. We study the conditions under which the MLE solution to the cascade reconstruction problem will fail. We ﬁnd that the MLE based approach is not good in many classes of instances. The CASCADEMLE hasn t been studied before. We show that it is NP-hard. For the independent cascade model, we present an algorithm with a logarithmic ap-

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

proximation factor, under natural assumptions about the structure of social contact networks. Finally, we evaluate our formulation and algorithms for several synthetic and realistic contact networks, including a contact network for the University of Virginia (UVA) hospital, constructed using Electronic Health Record (EHR) data. Our results show improved performance compared to a prior baseline in identifying missing infections.

Related Work Our paper is most closely related to the work on cascade reconstruction and source identiﬁcation on networks where only a partial set of nodes are observed. Rozenshtein et al. (2016) introduce a directed Steiner tree based algorithm, CULT, for reconstructing an epidemic cascade, where the underlying network is dynamic, and a subset of infections, along with their infection times, is given. They do not make any assumptions about the diffusion model. Jang et al. (2021) consider this problem in the asymptomatic infection setting and propose taking into account the individual s risk factors in the form of node attributes. They formulate this as a directed prize-collecting Steiner tree problem on a temporal contact network with outbreaks starting independently from multiple sources. However, both of these works minimize the regular Steiner tree objective, which consists of only the costs of edges within the subgraph. In contrast, we consider the true MLE cost which includes the costs from edges outside the subgraph corresponding to failed infection attempts. Shah and Zaman (2011) were the ﬁrst to study the source detection problem. They consider the SI model, and assume all the infections are given, and the goal is to determine the source. They study the ML estimator for the source detection problem, and show that it can be solved exactly on trees using a notion of rumor centrality. Zhu and Ying (2014) extend this work to the SIR model. They develop an optimal sample-path-based approach to source detection. A related line of research is the problem of sampling from the space of all cascades that explain the given observations. Xiao, Aslay, and Gionis (2018) map the problem of computing the node infection probabilities, given partial observations, to the problem of sampling Steiner trees. However, in their target sampling distribution, they only consider the contributions of edges within the tree, and ignore the contribution from edges outside the tree corresponding to failed infection attempts.

Preliminaries Spread model. We consider the simplest form of the Susceptible-Infected-Recovered (SIR) model called the Independent Cascade model on a contact graph G = (V, E). In this model, each node is in one of Susceptible (S), Infectious (I) or Recovered (R) state. We start off with all nodes in Susceptible state except the single source node which is in Infected state. At each time-step, an infected node u can infect each susceptible neighbor v with probability pu,v, independent of other neighbors of v. Each infected node gets

Figure 1: In this example, node 1 is the source, the red nodes are the infections, and the brown edges represent the infection cascade T. Here, δT is {(2, 4), (3, 4), (1, 8), (5, 9), (5, 7)}, and λT is {(4, 5), (6, 5)}. Each edge e in T contributes a cost ce to Cost(T). Edges in δT and λT contribute de costs, except for edge (4, 5). Neighbors 4 and 5 get infected at the same time, so they cannot attempt to infect each other.

only one opportunity to spread the infection following which they enter into Recovered state. Probability of a cascade. An outbreak, starting at a node r, is referred to as a cascade, and can be represented as a subgraph Tr = (V (Tr), E(Tr)), rooted at r. Let δTr be the set of edges not in Tr with exactly one endpoint in Tr, i.e., δTr = {(u, v) E \ E(Tr) : u V (Tr), v / V (Tr)}. Let λTr be the set of edges not in Tr with both endpoints in Tr, i.e. λTr = {(u, v) E \ E(Tr) : u, v V (Tr). Under the IC dynamics, the probability of the cascade Tr is:

e E(Tr) pe Y

e δTr (1 pe) Y

e=(u,v) λTr , d Tr (r,u) =d Tr (r,v)

where d Tr(r, u) denotes the distance between root node r and u, in the subgraph Tr. The ﬁrst term corresponds to the contribution of edges in the subgraph. Since every infected node gets a single chance to infect a susceptible neighbor, we have two kinds of (1 pe) terms contributed by edges not in Tr: (a) with exactly one endpoint in Tr, and (b) with both endpoints in Tr, as long as they are at different distances from the root. Let λ Tr denote the set of edges of type (b). MLE solution. We assume that subsets S0, S1 are given where S0 is a set of nodes which are known not to be infected in the outbreak, while S1 is a set of nodes known to be infected. We also assume the outbreak starts at a single node, which need not be in S. We say that a cascade Tr is consistent with (S0, S1) if S1 V (T) and S0 V \ V (T). The MLE problem involves ﬁnding a connected subgraph Tr = (V (Tr), E(Tr)) rooted at a node r which is consistent with the given (S0, S1), and maximizes P(Tr); this is equivalent to the optimal sample path detection problem as described in (Zhu and Ying 2014). Taking the log of the probabilities in P(Tr), we can deﬁne the cost of Tr as

Cost(Tr) = X

e E(Tr) ce + X

e δTr de + X

e=(u,v) λTr , d Tr (r,u) =d Tr (r,v)

Here, ce = log pe is the cost of including an edge e in the subgraph and de = log (1 pe) is the cost of excluding an edge e from the subgraph. In Figure 1, we have an illustrative example. The CASCADEMLE problem. Given subsets S0, S1, the goal is to ﬁnd a connected subgraph Tr rooted at some node r, which is consistent with (S0, S1), and minimizes Cost(Tr).

Theorem 1. CASCADEMLE is NP-hard.

Proof. (Sketch) We show this by a reduction from the standard Steiner tree problem which is NP-hard (Karp 1972). Consider the class of instances of CASCADEMLE where the homogeneous edge probability is p = 1 4n2 , n being the number of nodes in the graph. Thus, c = 2 log 2n and d = log(1 1 4n2 ) 2 4n2 for large n. The CASCADEMLE problem is to ﬁnd the consistent subgraph which minimizes

Cost(Tr) 2 log 2n|E(Tr)| + 1 2n2 |δTr λ Tr|

2 log 2n|E(Tr)| + 1 2n2 .n2

= 2 log 2n|E(Tr)| + 1

We use the fact that there can be at most n2 edges in δTr λ Tr. It can be veriﬁed that Cost(Tr) D if and only if |E(Tr)| D for any integer D n 1. The NP-hardness of CASCADEMLE follows from it.

We say a solution Tr is an α-approximation if Cost(Tr) αCost(T r ), where T r is an optimal solution to the instance of CASCADEMLE. Note that the root of Tr and T r need not be the same; we only need that Tr be consistent with (S0, S1). Remark. In practice, the costs of exclusion of the edges between same-level nodes in the cascade, P (u,v) λT ,d T (r,u)=d T (r,v) d(u,v), is a very small fraction of Cost(Tr) (as we verify in our experiments). This is also supported by the analysis of (Adcock, Sullivan, and Mahoney 2013) that many realistic social and information networks are tree-like, where this condition will hold. In such a setting, P(T) is a good approximation to Pr(Tr):

e E(T ) pe Y

e δT (1 pe) Y

e λT (1 pe). (3)

Observe that P(T) does not depend on the root. We consider the corresponding cost,

Cost(T) = X

e E(T ) ce + X

e δT de + X

e λT de (4)

and will focus on minimizing Cost(T). Algorithm 1 (described in the Approach section) considers the setting where we are given an undirected contact graph G = (V, E) with each edge e having an infection probability pe , and a set of observed infections S V . The

set of observed infected nodes S forms the terminal set in the output Steiner tree. For a node u, let Ne(u) denote the set of edges connecting to its neighbors. In Algorithm 2, we show that with a slight modiﬁcation, our approach extends to the setting where we are also given the set of observed uninfected nodes.

Difference between CASCADEMLE and Steiner Tree Solutions

As mentioned earlier, previous works (Rozenshtein et al. 2016; Jang et al. 2021) minimize the regular Steiner tree cost which consists of only the ﬁrst term in Cost(Tr), namely Costst(T) = P e E(T ) ce. Here we show that a solution which minimizes Costst(T), can have a very different structure than that which minimizes Cost(Tr). Next we study the conditions under which the MLE approach will fail by showing that there exist instances in which a CASCADEMLE solution does not recover the ground truth cascade.

Observation 1. There exist instances in which a CASCADEMLE solution Tr = arg min T r Cost(T r) has diameter Θ(n), while a Steiner tree solution Tst = arg min T Costst(T ) has diameter Θ(1).

Proof. Consider the class of graphs in Figure 2 where A1 A2 form a complete bipartite graph with |A1| = |A2| = N + 1 and terminal node set S = {r, t}. Assume the homogeneous setting i.e. ce = c, de = d for every e E. Consider trees T1 = (r, w1, w 1, t), and T2 = (r, u1, . . . , u N, t). Observe that Costst(T1) = 3c and Costst(T2) = (N + 1)c. It can be veriﬁed that T1 minimizes the Costst( ) objective. On the other hand, we have Cost(T1) = 3c + 2Nd + 2d and Cost(T2) = (N + 1)c + 2d. It can be veriﬁed that there exists a sufﬁciently large value of p for which T2 minimizes Cost( ), and thus, is a CASCADEMLE solution. Hence, there exist regimes in which the CASCADEMLE solution has a diameter Θ(n), while the Steiner tree solution has a diameter Θ(1).

Observation 2. There exist instances in which a CASCADEMLE solution does not recover the true cascade.

Proof. Consider the class of graphs in Figure 2. Suppose T1 is the ground-truth cascade comprising nodes {r, t, w1, w 1}. Given terminal set S = {r, t}, the MLE approach will pick T2 over T1, unless the cost of excluding the bipartite edges is insigniﬁcant, i.e., p is small enough. Thus, there exist regimes in which the MLE solution fails to recover any part of the ground truth cascade.

Our Approach

Assumption 1. For every edge in the network, pe 1/2. Equivalently, c(e) d(e), for all e E.

Lemma 1. Under Assumption 1, a CASCADEMLE solution T is a tree.

Figure 2: In this example, node r is the root, and S = {r, t} is the set of terminals. A1 A2 is a complete bipartite graph on nodes w1, . . . , w N+1, and nodes w 1, . . . , w N+1. T1 is the purple path between r, t through w1, w 1, while T2 is the red path between r, t through nodes u1, .., u N.

Assumption 1 states that the cost of including an edge is greater than than the cost of excluding it. This implies that an optimal subgraph is a tree, as we can always reduce the cost of the subgraph by excluding (rather than including) any edge that forms part of a cycle. Without this assumption, there exist instances where an optimal solution could have cycles. For example, in the homogeneous setting where all edges have the same costs c and d and c < d, an optimal solution could be one which spans the whole graph. Thus, under Assumption 1, we leverage a reduction from the CASCADEMLE problem to the node-weighted Steiner tree problem in our algorithm MINCOSTSTEINERTREE. In Algorithm 1, we construct a node and edge-weighted graph by weighing each node with the sum of the costs of exclusion of each of its incident edges, and each edge with the difference between its costs of inclusion and exclusion. Under Assumption 1, these edge weights are non-negative. Converting this to a purely node-weighted graph, this becomes a node weighted Steiner tree problem, where the goal is to ﬁnd the minimum-weighted Steiner tree with terminal set S. Algorithm 1 runs in polynomial time as we can construct the node-weighted graph in polynomial time and the Klein and Ravi (1995) algorithm has a polynomial time implementation. We now prove that this algorithm has a logarithmic approximation factor.

Theorem 2. Let ˆT be the tree returned by Algorithm 1, and let T be an optimal solution to the CASCADEMLE instance. Then ˆT is consistent with S, and

Cost(T ) Cost( ˆT) 4 ln |S| Cost(T ) (5)

Proof. For any Steiner tree T, X

u V (T ) w(u) + X

e E(T ) w(e)

e Ne(u) de + X

e E(T ) (ce de)

Algorithm 1: MINCOSTSTEINERTREE Input: An undirected contact graph G = (V, E) with edge probabilities pe and a set of observed infected nodes S Output: Tree Tr consistent with S 1: for each edge e do 2: Compute the cost of inclusion ce = log pe and cost of exclusion de = log (1 pe) 3: end for 4: Construct a node and edge-weighted graph G from G, by assigning weights as below: 5: for each node u do 6: w(u) P e Ne(u) de 7: end for 8: for each edge e do 9: w(e) ce de 10: end for 11: Convert G to a purely node-weighted graph ˆG by splitting each edge with a new node having the same weight. 12: Find the minimum weighted Steiner tree ˆT in ˆG with terminal set S, using Klein and Ravi s (1995) algorithm for the node-weighted Steiner tree problem. 13: Let r be any node in ˆT 14: return ˆT, with root r

e Ne(u) E(T ) de + X

e Ne(u) λT de

e Ne(u) δT de + X

e E(T ) (ce de)

e E(T ) de + 2 X

e λT de + X

e δT de + X

e E(T ) (ce de)

e E(T ) ce + X

e E(T ) de + 2 X

e λT de + X

= Cost(T) + X

e E(T ) de + X

e λT de (6)

u V (T ) w(u) + X

e E(T ) w(e)

Continuing from (6) and using Assumption 1, X

u V (T ) w(u) + X

e E(T ) w(e)

Cost(T) + X

e E(T ) ce + X

= 2Cost(T) X

e δT de 2 Cost(T)

Thus, for any Steiner tree T, we have shown that

u V (T ) w(u) + X

e E(T ) w(e) 2 Cost(T) (7)

This holds for the Steiner tree returned by the algorithm, ˆT.

Cost( ˆT) X

u V ( ˆT ) w(u) + X

e E( ˆT ) w(e) (8)

Klein and Ravi s algorithm has a worst-case approximation factor of 2 ln |S|. Hence, X

u V ( ˆT ) w(u) + X

e E( ˆT ) w(e)

u V (T ) w(u) + X

e E(T ) w(e)

4 ln |S|Cost(T ) (9)

Combining (8) and (9),

Cost( ˆT) 4 ln |S|Cost(T ) (10)

Observed Uninfected Nodes Our approach extends easily to the setting where we are given the set of observed uninfected nodes S0 in addition to the set of observed infected nodes S1. Our goal is to ﬁnd an optimal subgraph consistent with S0 and S1. Here we assume the observed uninfected nodes were never part of the cascade, i.e., they remained uninfected throughout the spreading process. Let κ = {(u, v) E \ E(T) : u S0, v T} be the set of edges with one endpoint in S0 and the other in the cascade T. In this setting, deﬁne δT as the set of edges not in cascade, with one endpoint in cascade and the other in V \ (S0 V (T)), i.e., δT = {(u, v) E \ E(T) : u V (T), v (V \(S0 V (T))}. Here λT is the same as before, i.e, λT = {(u, v) E \ E(T) : u, v T}. Then the cost of a subgraph T, under IC dynamics, can be deﬁned as:

Cost(T) = X

e E(T ) ce + X

e λT de + X

e δT de + X

e κ de (11)

Our goal is to ﬁnd a connected subgraph consistent with (S0, S1), and which minimizes Cost(T). Here we present Algorithm 2, MINCOSTSTEINERTREE-OBSUNINFECTED, which is only a slight modiﬁcation of our previous algorithm: we remove the nodes known to be uninfected (and their edges), after constructing the node and edge-weighted graph. This ensures that the graph weighting takes into account the removed edges and the returned tree is consistent with S0 and S1. Algorithm 2 has the same approximation factor as the previous algorithm.

Theorem 3. Let ˆT be the tree returned by Algorithm 2, and let T be an optimal solution to the CASCADEMLE instance. Then ˆT is consistent with S0, S1, and

Cost(T ) Cost( ˆT) 4 ln |S1| Cost(T ) (12)

The proof is similar to that for Theorem 2.

Experimental Results Dataset and Methods We experimentally study the CASCADEMLE problem and evaluate the performance of MINCOSTSTEINERTREE algorithm on several real-world and synthetic networks. The networks are listed in Table 1 and described here.

Algorithm 2: MINCOSTSTEINERTREE-OBS-UNINFECTED Input: An undirected contact graph G = (V, E) with edge probabilities pe, set of observed uninfected nodes S0, a set of observed infected nodes S1 Output: Tree Tr consistent with S0, S1. 1: for each edge e do 2: Compute the cost of inclusion ce = log pe and cost of exclusion de = log (1 pe) 3: end for 4: Construct a node and edge-weighted graph G from G, by assigning weights as below: 5: for each node u do 6: w(u) P e Ne(u) de 7: end for 8: for each edge e do 9: w(e) ce de 10: end for 11: Remove all the nodes in S0 (and their edges) from G . 12: Convert G to a purely node-weighted graph ˆG by splitting each edge with a new node having the same weight. 13: Find the minimum weighted Steiner tree ˆT in ˆG with terminal set S1, using Klein and Ravi s (1995) algorithm for the node-weighted Steiner tree problem. 14: Let r be any node in ˆT 15: return ˆT, with root r

1. ar Xiv High Energy Physics-Theory (HEP-TH): This is an academic collaboration network in the High Energy Physics-Theory community based on the citations in the ar Xiv preprints published between January 1993 and April 2004 (Gehrke, Ginsparg, and Kleinberg 2003; Leskovec and Krevl 2014). Taking the largest connected component, we generate a subgraph with n = 500 nodes, obtained by BFS starting from a random node. We refer to it as arxiv. 2. Erd os-R enyi random graphs: We generate several G(n, q) graphs for evaluating the performance of our method and include results for G(n = 300, q = 0.02). 3. Hospital ICU network: This is a contact network of patients and healthcare providers built using the Electronic Health Records (EHR) of the UVA Hospital s ICU between Jan 1, 2018 and Jan 8, 2018. We choose the largest connected component for our experiments and refer to it as hospital-icu. 4. Power-law networks: We generate power-law networks with n = 1000 nodes, varying the exponent γ in the range [1.5, 3.5].

First, we study the error in approximation of the cost by comparing the true Cost and our approximation Cost. The MINCOSTSTEINERTREE algorithm is evaluated with respect to network structure, diffusion model parameters, and the observation set. We compare our method against CULT (Rozenshtein et al. 2016) which is the state-of-theart Steiner tree-based cascade reconstruction method. Since CULT takes an additional time-of-report information while

MINCOSTSTEINERTREE does not, we consider three variants of this method: 1. CULT-DEL: all nodes are reported as infected at the last time step, 2. CULT-RAND: each node is reported as infected at a time step chosen randomly in between the time of infection and the last time step, and 3. CULT-NOS: all nodes reported as infected at the time step of infection (NOS means No-Shift). Note that these CULT variants are ordered in the increasing amount of information provided to the algorithm. We use the homogeneous probability setting for our experiments where we set the diffusion probability p across all edges to be the same. We generate the infection cascades under IC dynamics for a single source chosen uniformly at random. In our experiments for evaluating the algorithm, we have considered cascade sizes to be within (0.02n, 0.1n), where n is the network size so that sufﬁcient number of observed nodes can be extracted from the cascade. Next, to create the observation node sets from the generated infection cascades, we use two different schemes: (a) random: We randomly sample a ﬁxed % of nodes from the infected node set to form the observed node set. (b) frontier: Here, nodes in the cascade at a distance at least d from the source are chosen as observed. This corresponds to the scenario in which we have observed the more recent infections and our goal is to infer the rest. These schemes are inspired from Rozenshtein et al. (2016), and can help evaluate the performance of our method in two distinct observational settings. We choose Matthews correlation coefﬁcient (MCC) (Matthews 1975) and F1-score as in (Rozenshtein et al. 2016; Jang et al. 2021), to evaluate the quality of the reconstructed cascades with the ground-truth. All reported values are averaged over 100 trials.

Results Difference between Cost(T) and Cost(T). We created several power-law networks on 1000 nodes for various values of the power-law exponent γ in the range [1.5, 3.5]. We generated cascades starting from a source chosen uniformly at random, varying the probability p from 0.05 to 0.49. For each such cascade, we computed the error between the two costs. Representative results are in Figure 3 (the results are consistent across replicates of the networks). We observe that the difference in the costs depends on both the probability p as well as the network structure (which is decided by γ). We recall that the difference between the two costs is d = log(1 p) times the number of node pairs in the cascade that satisfy the property that both nodes are at equal distance from the source. For very low values of p, the difference is low across networks as the cost of including an edge c = log(p) d. As p increases, we observe that the network structure comes into play. For very low values of the power-law exponent γ, there are several nodes with high degree leading to the presence of dense subgraphs. This increases the chances of node pairs where the nodes at equal

0.1 0.2 0.3 0.4 0.5 Diﬀusion probability p

Power law exp.

1.5 2.0 2.5 3.0 3.5

Figure 3: Percentage error between Cost(T) and Cost(T) for cascades on random power-law graphs with varying diffusion probability p and power-law exponent.

distance from the source of the cascade, and in turn, leads to a larger difference in the two costs. On the other hand, for lower γ (2.5 3.5), the graph is more tree-like, and therefore, we see a very low error even for probability approaching 0.5. For γ = 2 in particular, we note that the error is 10% for p as high as 0.3.

Performance of MINCOSTSTEINERTREE. In Figure 4, the MCC scores are plotted for MINCOSTSTEINERTREE and CULT for the two observation schemes and a diffusion probability of 0.1. We observe that MINCOSTSTEINERTREE performance is superior compared to CULT-DEL across observation schemes, networks and diffusion probabilities. We recall that the MINCOSTSTEINERTREE algorithm accounts for the diffusion model while the versions of CULT do not. However, CULT-NOS and, to some extent CULT-RAND, account for the time of infection. In particular, for the G(300, 0.02) graph and the hospital-icu, we observe that the performance of MINCOSTSTEINERTREE is much better than that of CULT. We note that arxiv has a large average shortest path length (and low diameter) compared to the other two networks even though its clustering coefﬁcient is large. Even though hospital-icu has a large clustering coefﬁcient, it has a small average shortest path length like G(300, 0.02). In Figure 5, we have representative plots of the MCC and F1scores under diffusion probabilities 0.05 and 0.20. The performance is similar to that in Figure 4 for G(300, 0.02) and hospital-icu. For the higher probability, we observe inferior performance in the case of arxiv as the distance from source increases under the frontier observation scheme.

Impact of different types of observations. For the random observation scheme, we observe that MINCOSTSTEINERTREE performance drastically increases with increase in the number of observed nodes. This is particularly true for the real-world networks. Typically, we see good per-

Graph Name Nodes Edges Clustering coefﬁcient

Average shortest path length G(n, q) random graph 300 897 0.015 3.45 arxiv 500 895 0.52 12.5 hospital-icu 879 3575 0.59 4.31 Power-law networks 1000 660 6613 Average value reported.

Table 1: Networks and their properties

0.2 0.4 0.6 0.8 #observed nodes/#infected nodes

2 3 4 5 6 Distance from source

0.2 0.4 0.6 0.8 #observed nodes/#infected nodes

2 3 4 5 6 Distance from source

0.2 0.4 0.6 0.8 #observed nodes/#infected nodes

2 3 4 5 6 Distance from source

Min Cost Steiner Tree Cu LT-Del Cu LT-Rand Cu LT-No S

Figure 4: Performance of MINCOSTSTEINERTREE for random and frontier observation schemes. (a) G(300, 0.02); (b) arxiv; and (c) hospital-icu, with ﬁxed p = 0.10

formance when at least 40% of the infected nodes are observed. In the case of FRONTIER observation scheme, we observe that when the distance from the source is 4, the cascades constructed are quite inferior. This puts emphasis on early discovery of the outbreak.

Conclusions

We studied the problem of reconstructing an epidemic cascade given a subset of infections as observed nodes under IC dynamics. We presented an algorithm with a logarithmic approximation factor using a node-weighted Steiner tree approach, and evaluated its performance on several synthetic and real-world networks. An important future direction is to extend our approach to reconstruct cascades resulting from

0.2 0.4 0.6 0.8 #observed nodes/#infected nodes

2 3 4 5 6 Distance from source

0.2 0.4 0.6 0.8 #observed nodes/#infected nodes

2 3 4 5 Distance from source

0.2 0.4 0.6 0.8 #observed nodes/#infected nodes

2 3 4 5 6 Distance from source

Min Cost Steiner Tree Cu LT-Del Cu LT-Rand Cu LT-No S

Figure 5: Performance of MINCOSTSTEINERTREE for (a) G(300, 0.02) with p = 0.05; (b) arxiv with p = 0.20; and (c) hospital-icu with p = 0.05.

more complex SEIR processes with delayed recovery, SI, and the SIS models. Another direction is to incorporate additional information about the cascade such as reporting time or order of infections that can help overcome the limits of the MLE problem studied here.

Acknowledgements

This work was supported in part by the United States Agency for International Development under the Cooperative Agreement no. AID-OAA-L-15-00001, Feed the Future Innovation Laboratory for Integrated Pest Management, Agricultural AI for Transforming Workforce and Decision Support (Ag AID) grant no. 2021-67021-35344 from the USDA National Institute of Food and Agriculture, Network Models of Food Systems and their Application to Inva-

sive Species Spread, grant no. 2019-67021-29933 from the USDA National Institute of Food and Agriculture, UVA, NSF grants Expeditions CCF-1918656, IIS-1955797 NIH 2R01GM109718-07, and CDC MIND U01CK000589.

References Adcock, A. B.; Sullivan, B. D.; and Mahoney, M. W. 2013. Tree-Like Structure in Large Social and Information Networks. In 2013 IEEE 13th International Conference on Data Mining, 1 10. Gehrke, J.; Ginsparg, P. H.; and Kleinberg, J. M. 2003. Overview of the 2003 KDD Cup. SIGKDD Explor., 5: 149 151. Jang, H.; Pai, S.; Adhikari, B.; and Pemmaraju, S. V. 2021. Risk-aware Temporal Cascade Reconstruction to Detect Asymptomatic Cases : For the CDC MIn D Healthcare Network. In 2021 IEEE International Conference on Data Mining (ICDM), 240 249. Karp, R. M. 1972. Reducibility among Combinatorial Problems. In Miller, R. E.; Thatcher, J. W.; and Bohlinger, J. D., eds., Complexity of Computer Computations. The IBM Research Symposia Series, 85 103. Boston, MA: Springer US. ISBN 978-1-4684-2001-2. Klein, P.; and Ravi, R. 1995. A Nearly Best-Possible Approximation Algorithm for Node-Weighted Steiner Trees. Journal of Algorithms, 19(1): 104 115. Leskovec, J.; and Krevl, A. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/ data. Accessed: 2022-7-25. Matthews, B. 1975. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure, 405(2): 442 451. Robinson, A.; Walshe, T.; Burgman, M.; and Nunn, M. 2017. Invasive Species: Risk Assessment and Management. Cambridge University Press. ISBN 9781108158282. Rozenshtein, P.; Gionis, A.; Prakash, B. A.; and Vreeken, J. 2016. Reconstructing an Epidemic Over Time. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 16, 1835 1844. New York, NY, USA: Association for Computing Machinery. ISBN 9781450342322. Shah, D.; and Zaman, T. 2011. Rumors in a Network: Who s the Culprit? Information Theory, IEEE Transactions on, 57: 5163 5181. Shaman, J.; et al. 2020. An estimation of undetected COVID cases in France. Nature, 590: 38 39. Xiao, H.; Aslay, C.; and Gionis, A. 2018. Robust Cascade Reconstruction by Steiner Tree Sampling. In 2018 IEEE International Conference on Data Mining (ICDM), 637 646. Zhu, K.; and Ying, L. 2014. Information source detection in the SIR model: A sample-path-based approach. IEEE/ACM Transactions on Networking, 24(1): 408 421.