# detecting_sources_of_healthcare_associated_infections__c767f96f.pdf

Detecting Sources of Healthcare Associated Infections

Hankyu Jang1, Andrew Fu2, Jiaming Cui3, Methun Kamruzzaman4, B. Aditya Prakash3, Anil Vullikanti2, 4, Bijaya Adhikari1, Sriram V. Pemmaraju1

1Department of Computer Science, University of Iowa 2Department of Computer Science, University of Virginia 3College of Computing, Georgia Institute of Technology 4Biocomplexity Institute, University of Virginia Email: {hankyu-jang, bijaya-adhikari, sriram-pemmaraju}@uiowa.edu, {af9pn, hkz8wk, vsakumar}@virginia.edu, {jiamingcui1997, badityap}@gatech.edu

Healthcare acquired infections (HAIs) (e.g., Methicillinresistant Staphylococcus aureus infection) have complex transmission pathways, spreading not just via direct personto-person contacts, but also via contaminated surfaces. Prior work in mathematical epidemiology has led to a class of models which we call load sharing models that provide a discrete-time, stochastic formalization of HAI-spread on temporal contact networks. The focus of this paper is the source detection problem for the load sharing model. The source detection problem has been studied extensively in SEIR type models, but this prior work does not apply to load sharing models. We show that a natural formulation of the source detection problem for the load sharing model is computationally hard, even to approximate. We then present two alternate formulations that are much more tractable. The tractability of our problems depends crucially on the submodularity of the expected number of infections as a function of the source set. Prior techniques for showing submodularity, such as the live edge technique are not applicable for the load sharing model and our key technical contribution is to use a more sophisticated coupling technique to show the submodularity result. We propose algorithms for our two problem formulations by extending existing algorithmic results from submodular optimization and combining these with an expectation propagation heuristic for the load sharing model that leads to ordersof-magnitude speedup. We present experimental results on temporal contact networks based on fine-grained EMR data from three different hospitals. Our results on synthetic outbreaks on these networks show that our algorithms outperform baselines by up to 5.97 times. Furthermore, case studies based on hospital outbreaks of Clostridioides difficile infection show that our algorithms identify clinically meaningful sources.

Introduction Healthcare acquired infections (HAIs) such as Methicillinresistant Staphylococcus aureus infection (MRSA) and Clostridioides difficile infection (CDI) pose a significant burden on our healthcare infrastructure (Centers for Disease Control and Prevention 2019). Unlike respiratory infections such as SARS-Co V-2 and Influenza, HAIs such as MRSA

Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

and CDI have complex transmission pathways (Plipat et al. 2013; Li et al. 2009; Jang et al. 2019) that involve contaminated surfaces in addition to direct person-to-person contacts. As a result, HAIs have been difficult to model, detect, and control. Recent work, e.g., (Plipat et al. 2013; Jang et al. 2019), has shown that standard SEIR type models (Hethcote 2000) are not very suitable for modeling HAI spread; we will describe a load sharing model (Jang et al. 2019) later. Most hospitals have limited testing and when some infections are detected in the hospital, a lot of effort is invested into rapidly identifying the source of infection. This is done so that strategies such as isolation precautions can be imposed in order to limit the spread of the infection. This, of course, corresponds to the classical source detection problem, which has been studied extensively in data mining and network science, e.g., (Prakash, Vreeken, and Faloutsos 2014; Shah and Zaman 2010; Lappas et al. 2010). However, all prior work on source detection problems has been restricted to Susceptible-Exposed-Infected-Recovered (SEIR) type compartmental models, and cannot be easily adapted for HAIs. For instance, the Minimum Description Length (MDL) approach of (Prakash, Vreeken, and Faloutsos 2014) is crucially tied to the structure of SEIR type models. Thus the source detection problem remains open for HAIs, and is the focus of our paper. Motivated by the approach of (Lappas et al. 2010), we use a risk minimization type formulation for the source detection problem on load sharing models for HAIs. We describe this informally below and more formally later. Let G = (G0, G1, . . . , GT 1) be a temporal network, where the network Gt = (Pt, Lt, Et) represents interactions among a set Pt of people (e.g., patients, nurses, physicians) and between people and a set of locations Lt, in time step t. Let Pos be the set of observed positive HAI cases during the time window [0, T 1], and let Neg = (S

t Pt) \ Pos. Let M be an instance of a load sharing model. For a source set S, let Inf M(S) S t Pt denote the (random) subset of people who get infected according to model M due to disease starting at S. For any v S t Pt, let α(v, S) := Prob[v Inf M(S)]. To measure how good the source set S is, at explaining the observations Pos, we define two quantities: g(S) := P v P os α(v, S), the expected number of infections according to M, that are also observed, and f(S) := P

v Neg α(v, S), the expected number of infec-

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

tions according to M, that are not observed. The overall goal of this paper is to solve the following (informally stated) SOURCEDETECTION problem.

SOURCEDETECTION (SD) (informal) Given a temporal network G = (G0, G1, . . . , GT 1), a load sharing model M, and a set of observed HAI cases Pos, find a source set S that makes g(S) large while keeping f(S) small.

The main contributions of our paper are as follows: Problem Formulations: We show that a natural formulation of the SOURCEDETECTION problem for the load sharing model is computationally very hard, even to approximate. We then present two natural, alternate formulations (SD KNAP and SD RATIO) that are much more tractable, both in a theoretical and practical sense.

Submodularity result: The tractability of our formulations depends crucially on a submodularity result that we show. We show that for the load sharing model, the expected number of infections that are also observed and the expected number of infections that are not observed, are both submodular functions of the source set. These functions are analogous to the expected number of infections in the Independent Cascade model (Kempe, Kleinberg, and Tardos 2003), which has been shown to be submodular, using the live edge technique. However, this technique does not work for our load sharing model, since there is no live edges interpretation of infection flow in the load sharing model. Instead, we need to use a more sophisticated coupling technique, motivated by (Mossel and Roch 2007). As far as we know, this is the first submodularity result for a disease-spread model involving the transfer and sharing of pathogen loads.

Scalable implementations with strong worst-case guarantees: We use the submodularity of f( ) and g( ) to design multi-criteria approximation algorithms for the SD KNAP and SD RATIO problems. The worst case approximation guarantees we obtain depend on a notion of curvature of f( ) and g( ), but for our problem settings the function curvatures are such that we obtain strong guarantees. We significantly improve the running time of our algorithms for the SD KNAP and SD RATIO problems using a heuristic that we call truncated expectation propagation. This heuristic allows us to shortcut expensive simulations of the load sharing model, and leads to orders-of-magnitude speedup.

Experimental Results: We evaluate our algorithms on real-world contact network datasets from three hospitals. Our experiments on synthetic outbreaks on real hospital contact data show that our approaches significantly outperforms baselines by up to 6 times. Furthermore, we demonstrate that our approaches identify clinically meaningful sources on an actual in-hospital CDI outbreak. Due to the space limit, all proofs, additional details and results appear in the extended version (Jang et al. 2023).

Background Load Sharing Model Traditional compartmental models for disease-spread via person-to-person contact (e.g., SI, SIS, SIR, and SEIR) (Hethcote 2000) have a long history, dating back to the early 20th century. However, these have been found to be inadequate for modeling transmission of diseases which need to take the environment into account (Li et al. 2009; Tien and Earn 2010; Plipat et al. 2013; Wang and Ruan 2017; Kraay et al. 2018; Jang et al. 2019), such as HAIs. We now formally describe a load sharing model that was proposed for MRSA transmission in (Plipat et al. 2013; Jang et al. 2019). For any y Pt Lt, let Ly(t) denote the HAI pathogen load at node y at time t. Load dynamics can then be described by the following stochastic recurrence:

Ly(t + 1) = (1 d)Ly(t) X

x:{x,y} Et ρy,x Ly(t)

x:{x,y} Et ρx,y Lx(t) + Iinf q (1)

Here d (0, 1) is a die-off parameter, and the term (1 d)Ly(t) denotes the pathogen load remaining after die-off at time t + 1. The next two terms represent pathogen transfer. For each time-t edge (interaction) {x, y} that y participates in, a fraction ρy,x (0, 1) of load Ly(t) is transferred from y to x and a fraction ρx,y (0, 1) of load Lx(t) is transferred from x to y. In (Plipat et al. 2013; Jang et al. 2019) the transfer parameter ρx,y depends on the contact area of touch interaction, the total area of entity x, and the transfer efficiency between the two touching surfaces. The last term denotes the (stochastic) shedding of pathogen load due to infection. The pathogen load Ly(t) at person node y Pt at time t stochastically determines if y becomes infected at time t + 1. If y is infected at time t + 1 then, for some shedding parameter q > 0, y sheds q units of pathogen, i.e., q units are added to y s pathogen load. Whether y becomes infected is determined by a dose-response function p : R+ [0, 1]. (See (Brouwer et al. 2017) for a systematic study of dose response functions for infectious diseases.) Thus, a person node y becomes infected at time t + 1 with probability p(Ly(t)). The quantity Iinf appearing in the last term is the indicator random variable indicating if y is infected in time t + 1. Thus, Prob[Iinf = 1] = p(Ly(t)) if y Pt and 0 otherwise (i.e., if y is a location). For ease of exposition we assume the same n nodes are present in all T times stamps1. Then, this recurrence can be compactly rewritten as a stochastic matrix recurrence

L(t + 1) = (B(t) + D(t)) L(t) + q I(t). (2)

Here L(t) is a length-n vector representing loads at the nodes at time t. B(t) is an n n matrix with entry [x, y] equal to ρx,y if {x, y} is an edge in Gt and 0 otherwise. D(t) is an n n diagonal matrix with non-zero entries [y, y] equal

1In reality and in our datasets, patients are admitted and discharged and healthcare professionals may also change. Thus the number of nodes may change from one time step to the next.

to (1 d P x ρy,x); in this expression the sum P x is over all neighbors x of y in Gt. We restrict the model parameters so that 0 (d + P x ρy,x) 1, thus ensuring that the outgoing load does not cause load at node to become negative. Finally, I(t) is a length-n random indicator variable vector with entry [y] equal to 1 with probability p(Ly(t)) if y Pt and 0 otherwise. This models the fact that only people (not locations) can become infected and subsequently shed. In (Jang et al. 2019), an exponential dose response function (p(z) = 1 e πz) and a truncated linear dose response function (p(z) = min(πz, 1)) are evaluated for an infectivity parameter π > 0. See Fig 1 and Table 1 in (Jang et al. 2023) for an illustration of the load sharing model, and symbols and notation we use in this paper, respectively.

Problem Formulations Intractability of a Natural Formulation

We start by presenting the hardness of a natural formulation of the source detection problem. Inspired by the k-effectors problem in (Lappas et al. 2010) and the Positive-Negative Partial Set Cover problem (denoted PSC) defined in (Miettinen 2008), we define the SD PSC problem as follows.

SD PSC Given a temporal network G = (G0, G1, . . . , GT 1), a load sharing model M, 0 τ1 < τ2 < T, and an observed (positive) set Pos t [τ2,T 1]Pt, find a source set S t [0,τ1]Pt that minimizes X

v P os (1 α(v, S)) + X

v Neg α(v, S). (3)

The first term in the objective function is the expected number of observed positive cases not infected by an infection starting at source set S and the second term is the expected number of negative cases infected, by an infection starting at source set S. While this objective function is a simple and natural model for the SOURCEDETECTIONproblem, we prove the following hardness of approximation result that shows that no reasonable approximation exists for the problem.

Theorem 1. For any ε > 0, the SD PSC problem does not have an α-approximation for any α = O(2log1 ε n4), where n is the number of nodes in G0, unless NP DTIME(npolylog(n)). This is true even for an instance of the SD PSC problem with 3 time stamps.

In (Lappas et al. 2010) it is claimed that the k-effectors problem does not have a β-approximation for any β > 0 for the independent cascade (and similar) models. The proof of this claim and the NP-hardness claim it depends on seem incorrect (see the extended version for details) and so we use a different reduction to prove Theorem 1.

Two Tractable Problem Formulations

In light of the hardness result in the previous section, we present two formulations of the informally stated

SOURCEDETECTION problem that can be viewed as computationally tractable surrogates for the problem. The tractability of these formulations relies crucially on the submodularity of the functions f and g, which is shown in Theorem 2. In the SD KNAP problem defined below, instead of including both the positive and negative set of observations in the objective function (as in SD PSC), we impose constraints that bound the number of people outside the set of observed cases that are reached by the infection starting at the source set. Let Post Pt denote the set of observed cases at time t and let Negt = Pt \ Post. Then let ft(S) := P v Negt α(v, S) denote the expected number of infections in the negative set at time t.

SD KNAP Given temporal graph G = (G0, G1, . . . , GT 1), where Gt = (Pt, Lt, Et), integers τ1, τ2, 0 τ1 < τ2 < T, an observed set Post Pt of cases, positive reals kt, for each t [τ2, T 1], find S = arg maxs g(S) such that S satisfies the constraints ft(S) kt for all t [τ2, T 1].

We next define the SD RATIO problem. The goal of SD RATIO is to find a source set S that maximizes the ratio of g(S) (the expected number of infections in Pos) to P t γt ft(S), a linear combination of the expected number infections in Negt for values of t in the observation period [τ2, T 1]. The coefficients γt can be viewed as penalties, which could vary with time the penalty for getting it wrong in later time steps could be smaller than the penalty of getting it wrong in earlier time steps. This problem is similar in spirit to the problems of maximizing the difference g(S) f(S). However, as pointed out by Bai et al. (Bai et al. 2016) there is an important difference in the approximability of the ratio problem and the difference problem for submodular functions. While the ratio problem has algorithms with bounded approximation ratio for submodular functions, the difference version does not. This motivates the use of the SD RATIO problem for source detection.

SD RATIO Given temporal graph G = (G0, G1, . . . , GT 1), where Gt = (Pt, Lt, Et), integers τ1, τ2, 0 τ1 < τ2 < T, an observed set Post Pt of cases and positive reals γt, for each t [τ2, T 1], find

S = arg max s g(S)

t=τ2 γt ft(S) (4)

Methods Submodularity of Expected Infections In this section we show that the function g(S) denoting the expected number of infections in the positive set is a monotone, submodular set function for any load sharing model that uses a concave dose response function. This result holds for the functions f(S) and ft(S) as well. A concave dose response function models diminishing response to marginal increase in pathogen load. It is easy to see that functional

forms such as exponential and linear, mentioned earlier, are concave functions. A key aspect of our proof consists of showing that if loads at nodes are monotone, submodular functions of the source set and the dose response function is concave, then g(S) is submodular. Showing that the loads are submodular needs new ideas because the live edge technique used in the classical results of (Kempe, Kleinberg, and Tardos 2003) for the independent cascade model and linear threshold model does not seem to apply here. In particular, there seems to be no apriori setting of the random bits that fixes node-infectivity and shedding, thereby yielding a deterministic version of node loads that can be easily shown to be submodular. To get around this obstacle, we base our proof on the coupling technique of (Mossel and Roch 2007). To show the submodularity of loads using the diminishing returns definition of submodularity, we need to consider 4 source sets S, S +v, Q, and Q+v, where S Q and v Q. The key idea we use is the coupling of the stochastic decisions made by the 4 disease-spread processes starting from each of these source sets. With this 4-way coupling in place and using induction over time, we are able to show the submodularity of loads. The full proof is in the extended version (Jang et al. 2023).

Theorem 2. Let τ1, τ2 be integers satisfying 0 τ1 < τ2 < T. For any concave function p : R+ [0, 1], and any load sharing model M using dose response function p, the functions g(S), f(S), and ft(S) for t [τ2, T 1] are all monotone and submodular.

Algorithms for SD KNAP

Since g( ) and ft( ) have been shown to be submodular, it follows that SD KNAP is a problem of maximizing a submodular function subject to multiple submodular knapsack constraints. Even a special case of this problem, with a single cardinality constraint, is well known to be NP-complete (Feige 1998). So we seek an approximation algorithm for SD KNAP that provides strong worst-case guarantees, but is also practical and scalable. For this purpose we adapt the gradient ascent type framework of (Iyer and Bilmes 2013b,a), which has been proposed for the problem of maximizing a submodular function subject to a single submodular knapsack constraint. When used for SD KNAP, each ascent step of the algorithm requires the solution of a simpler submodular function maximization problem with multiple linear constraints. The result of Bilmes and Iyer only works for a single constraint, for which they are able to use a natural greedy algorithm for the ascent step; this does not provide approximation guarantees for the problem with multiple constraints. Multilinear relaxation plus rounding (Kulik, Shachnai, and Tamir 2009) is a well known approach to solving the submodular maximization problem with multiple linear constraints. However, while this approach is theoretically elegant, it does not scale to our input sizes. So we use a simple approach based on the multiplicative update method that has been shown to provide a O(1)-factor approximation to the problem in (Azar and Gamzu 2012). Algorithm 1 shows a high level description of our algorithm MUKnapsack SD for solving SD KNAP. Given a

Algorithm 1: MUKnapsack SD Input: G, τ1, τ2, Pos, and kt for each t [τ2, T 1] Output: A seed set S

1: S 2: while S has not converged do 3: Compute a linear function ˆft defined as

ˆft(Y ) := ft(S) X

j S\Y ft(j|S \ j) + X

j Y \S ft(j| )

4: S Multiplicative Update(g, ˆft, t [τ2, T 1]) 5: end while 6: return S

current solution S, we compute in Line 3, a modular upper bound ˆft of each submodular function ft that is guaranteed to be tight at S. In Line 4, we use the multiplicative update algorithm to solve the problem of maximizing the submodular function g(X) subject to linear constraints ˆft(X) kt. Below we show that MUKnapsack SD yields the following multicriteria approximation guarantee, where Kft := max{|X| : ft(X) kt} and κft is the curvature of the set function ft : 2V R defined as κft := 1 minv V ft(v|V \v)

ft(v) . W is the width of the packing con-

straints defined as W := min{kt/ ˆft(v) : ˆft(v) > 0} for t [τ2, T 1] and v X. Theorem 3. Algorithm MUKnapsack SD returns a solution SA such that g(SA) 1 2(e(T τ2)1/W +1) g( ˆS). Here, ˆS is an optimal solution to the problem of maximizing g(S) subject to constraints

ft(S) nt 1 + (Kft 1)(1 κft)

Kft for all t [τ2, T 1]. Our observations of positive cases typically come from a small time window [τ2, T 1]. For example, in our experiments this time window has size 2. Also, we observe W in [0.01, 21.67] from our experiments (see details in the extended version). Thus, for all practical purposes, the fraction 1/2(e(T τ2)1/W + 1) in the approximation ratio is a constant. In our setting, kt tends to be fairly small because we don t want to allow the infection of too many negative cases. As a result, Kft will also tend to be quite small. Since the fraction 1+(Kft 1)(1 κft)

Kft is bounded below by 1/Kft, it is reasonable to expect the constraint violation in the multicriteria approximation is only a constant factor. Additionally, we also implemented an algorithm called Greedy Knapsack SD, where we use a greedy heuristic instead of multiplicative update, for the ascent step. See the extended version (Jang et al. 2023) for further details.

Algorithms for SD RATIO We use the simple, greedy algorithm from (Bai et al. 2016) as solution to SD RATIO. For the problem of maximiz-

ing the ratio g(S)/f(S), it is shown in (Bai et al. 2016) that their algorithm yields an approximation ratio of 1 1/e1 κf . For the SD RATIO, this immediately translates to an approximation ratio of 1 1/e1 κf , where f (S) = PT 1 t=τ2 γtft(S). While this approximation factor may be tight in the worst case, in the following we show that it can be restated in terms of the curvature of f at the optimal solution. This quantity can be in general much smaller than κf and can therefore lead to a much better approximation ratio. For a set function f : 2V R and X V define the curvature of f at X, denoted κf(X), as

κf(X) := 1 min v X f(v | X \ v)

In our setting, it is reasonable to expect f(v | V \ v) to be near-0 for some v V . This is because f(v | V \ v) represents the marginal increase in the expected number of infected when the set of sources contains all but one nodes and we add that last node as a source. On the other hand, for most HAI applications we are interested in, S will be small and f(v | S \ v) may be relatively large for all v S . As a result, we expect that in practice κf (S ) κf . We use the name Greedy Ratio SD to denote the greedy algorithm that we implemented for SD RATIO problem. Theorem 4. The algorithm Greedy Ratio SD yields a (1 1/e1 κf (S ))-approximation for solving SD RATIO, where S is an optimal solution. It is possible for the g(S)/f(S) ratio to be very high, even when g(S) is very small. Thus, Greedy Ratio SD may return a solution S with very small value of g(S), which may be considered unsatisfactory for the SD RATIO problem. So we have also implemented a variant of Greedy Ratio SD, called Cover Ratio SD, that is forced to return a solution S for which g(S) is at least a prescribed fraction of the number of observed cases. See the extended version (Jang et al. 2023) for more details.

Truncated Expected Load Propagation A significant bottleneck in the running time of all our algorithms is the fact that the functions f and g are stochastic and obtaining good estimates requires a substantial number of simulations. We use a simple expected load propogation heuristic, described below, that allows us to shortcut costly simulations, while preserving solution quality. Given loads at all nodes at time t, the expected load E[Ly(t + 1)] at node y at time t + 1 can be written as

(1 d)Ly(t)+ X

x (ρy,x Lx(t) ρx,y Ly(t)) + q p(Ly(t)) .

Pushing the expectations through the right hand side and approximating the value of the dose response function p(Ly(t)) by p(E[Ly(t)]), we get a deterministic recurrence

E[Ly(t + 1)] = (1 d)E[Ly(t)] + X

x (ρy,x E[Lx(t)]

x ρx,y E[Ly(t)]) + q p(E[Ly(t)])

We use this recurrence to propagate expected loads from time stamp 0 to time stamp τ2 1 and then run simulations only for time stamps in [τ2, T] in order to find the set Inf([τ2, T]|S, M). Due to a somewhat subtle issue with this heuristic, we implement a truncated version of it, described in the extended version (Jang et al. 2023). As shown in Figure 3, this enables 17 speedup of our algorithms, without a noticeable degradation in quality of solution.

Experiments

We design extensive experiments to compare and contrast the performance of our algorithms against baselines in a variety of outbreak scenarios. Our code and public data is available for academic purposes 2. Baselines: Despite there not being a direct competitor, we compare performance of our algorithms against a wide range of methods. Cu LT (Rozenshtein et al. 2016) is the stateof-the-art cascade reconstruction approach that uses a directed Steiner-tree algorithm to span the observed infection set Pos. Similarly, Net Sleuth (Prakash, Vreeken, and Faloutsos 2012) is an MDL-based approach which returns a set of source nodes that best describe the observations. Path Finder simply selects top-k candidate source nodes with the highest number of reachable nodes in Pos. Finally, LOS (Length of Stay) selects patients from the first few time-stamps as sources, who remain longest in the hospital. Data: We run our methods and the baselines in real and simulated HAI outbreaks on a number of datasets. Each dataset comprises 31 daily snapshots of interactions between healthcare workers (HCWs), patients, and locations. Let |V | and |E| denote the average number of nodes and edges, respectively, across snapshots. UIHC (|V |=10.4K, |E|=13.8K) consists of interactions within the University of Iowa Hospitals and Clinics. The interactions were reconstructed from HCW logins and patient admission-dischargetransfer records. UIHCUNIT (|V |=789, |E|=526) is a subset of UIHC: the unit with the history of highest number of CDI cases. UVAPRECOVID (|V |=2.4K, |E|=430) and UVAPOSTCOVID (|V |=0.9K, |E|=396) consist of interactions recorded in the Cardiology department of the University of Virginia Hospital in March 2011 and January 2020 (shortly after the COVID-19 pandemic), respectively. Finally, CARILION (Jim enez, Lewis, and Eubank 2012) (|V |=2.3K |E|=30K) consists of interactions generated from the mobility log in Carilion Hospital in Roanoke, VA. From 72K unique locations and 89K unique individuals, we extract a densely connected subgraph from the largest connected component of the dynamic interaction network.

Quality of the Detected Sources

Our goal here is to quantify the goodness of the sources inferred by our approaches and the baselines. Although we have real HAI outbreaks in some of our datasets, true ground truth sources are unavailable. Hence, in these sets of experiments we rely on simulated outbreaks. We run a

2https://github.com/Hankyu Jang/Detecting-Sources-of Healthcare-Associated-Infections

Figure 1: The performance (using F1-score) of our approaches Knapsack SD (KSD) and Ratio SD (RSD) and the baselines Cu LT, Net Sleuth (NS), Path Finder (PF), and LOS is shown. See the extended version for MCC results. |S+| = 2 is shown in the top row and |S+| = 6 in the bottom row. Our approaches perform consistently well across different settings.

well-calibrated version of the load sharing model (see Section Background ) with an arbitrary set S+ of sources selected from among nodes appearing in time window [0, τ1]. From this run we obtain the sets of infections Pos and noninfections Neg for the time window [τ2, T 1]. Each method m is given sets Pos and Neg and returns a source set Sm. A straightforward metric to measure success is the intersection between Sm and S+. However, it is in general impossible for any algorithm to do well with respect to this metric because the ground truth source set S+ may be quite poor in explaining the observed cases and nonobserved cases relative to other source sets. In fact, we see this in our experiments, where we consistently discover source sets that have much higher probability of leading to observed cases and avoiding non-observed cases than the ground truth source set. So we use two other natural metrics to measure performance. First, we propose to measure the overlap between the ground truth Pos and Neg sets and the sets of infections Posm and non-infections Negm caused by an outbreak starting from the source set Sm selected by method m. Second, we use the distance between S+ and Sm as a metric of success. We present results from this second metric in (Jang et al. 2023). We run 100 simulations starting from Sm for each method m to obtain Posi m and Negi m for 1 i 100. From these, we compute the averaged true positive ATPm, true negative ATNm, false positive AFPm, and false negative AFNm for each method m. Finally, we compute average F1-Score and average MCC score (see the extended version for precise score definitions). Our experiments on all the datasets set |S+| to 2 and 6. We set τ1 = 1 (source set comes from the first two time steps) and τ2 = 29 (observations come from last two time steps). In all settings, the calibrated simulations yield about 5-10% infection in the last snapshot, as is the case for a typical HAI outbreak (Jarvis et al. 2007; Clabots et al. 1992). Our results are summarized in Figure 1. In the rest of the paper and in the figures, we use Knapsack SD to denote our better-performing algorithm for SD KNAP (we implemented two) and Ratio SD to denote our better-performing algorithm for SD RATIO (we implemented two). The solid horizontal line represents performance of the ground truth seed set S+ and the dotted hori-

zontal line represents performance of a random seed set averaged over 30 times. These two lines represent the empirical hardness of the problem instance; a poor performance of ground truth indicates that the problem instance is hard and conversely a good performance of random source sets represent easier problem instance. The primary take-away from the results is that both of our approaches Knapsack SD and Ratio SD consistently outperform all baselines regardless of instance hardness and performance metric. One can observe that Cu LT and Net Sleuth are inconsistent. We hypothesize that in settings where load transfer along multiple pathways plays role in infections, their performance drop as they are unable to model this phenomenon. Performance of other two baselines LOS and Path Finder are also similar. Additional results are presented in (Jang et al. 2023).

Case Study on a CDI Outbreak

We perform a case study in UIHCUNIT to qualitatively explore the sources detected by our approaches in an actual CDI outbreak. In our UIHCUNIT dataset, there were a total of five CDI positive tests. One positive test (with anonymized ID 214) was recorded on day 1, whereas the remaining 4 were on day 25. We make the 4 observed positive cases on day 25 available to both of our methods and ask them to infer sources in days 0 and 1. The sources returned by our approaches did not include the infection observed in time 1. However, they included patients with (anonymized) IDs 753 and 409 (see Fig 2). Digging deeper into these patients medical history, we find that both patients had high co-morbidity scores at admission time. Patient 753 s visit was later marked as a high severity admission by the Agency of Health Research and Quality (AHRQ). We also find that patient 409 was transferred to UIHC from an external acutecare hospital with inpatient facilities. In other words, both patients identified as sources had several of the known risk factors for CDI. To check the quality of patients 409 and 753 as potential sources, we simulated two sets of 1000 outbreaks each with source set {409, 753} and source set {214}. We observe that simulations starting from seeds detected from our methods (409 and 753) are much better at hitting observed infections on day 25: 807 times for {409, 753} versus 373

Figure 2: An illustration of one simulation for the CDI outbreak in UIHCUNIT. The sources (blue nodes) {409, 753} identified by our algorithm were able to infect 3 out of 4 observed infections (red nodes) on day 25. Additionally, in this simulation there were only 3 spurious infections (green nodes) out of a total 43 patients not observed to be cases.

times for {214}. Similarly, simulations starting from with source set {409, 753} hit unobserved infections a total of 2228 times versus 2765 times by simulations with source set {214}. Figure 2 illustrates one simulation with source set {409, 753}. Our conclusion is that patients 409 and 753 are clinically meaningful potential sources.

Speed-Up Due to Expected Load Propagation A big computational bottleneck for our algorithms is the number of simulations required to get good estimates of the stochastic functions g(S) and f(S). The truncated expectation propagation heuristic described earlier allowed us to shortcut costly simulations and was a key factor in our being able run all our experiments. Next we compute the speedups due to the lazy evaluation (Minoux 1978) and expected load propagation on UIHCUNIT. To show the kind of speedups this heuristic leads to, we demonstrate some experiments on UIHCUNIT. We start by running the vanilla version of Knapsack SD, its lazy greedy version, and a version augmented with both lazy greedy and truncated expected load propagation. The results are presented in Figure 3 As we can see lazy evaluation leads to the 1.09 speedup while lazy + expected load prorogation leads to 17.2 speedup. For Ratio SD, we compare the performance of the vanilla version and the one augmented with truncated expected load propagation. Note that Ratio SD does not have a lazy version. Here the expected load propagation version results in 5.6 speed up. These results highlight the importance of expected load propagation on speeding up our algorithms. While lazy greedy provides non-trivial speed-up, it is the truncated expected load propagation heuristic in addition to the lazy greedy that leads to an enormous speed-up.

Related Works Data Mining for Hospital Acquired Infections: Several recent works have leveraged data mining and machine learning techniques for HAI related problems. These include outbreak detection (Adhikari et al. 2019), case predictions (Li

Figure 3: Running time and F1 score for various versions of Knapsack SD and Ratio SD on UIHCUNIT. We achieved dramatic speedups by employing expected load propagation and lazy evaluations, with no sacrifice on the performance.

et al. 2019; Brodzicki et al. 2020), and inferring latent cases (Makar, Guttag, and Wiens 2018; Jang et al. 2021, 2022). Jang et al. used agent based simulation to determine the effect of architectural changes on methicin-resistant Staphyloccus aureus (MRSA) outbreaks (Jang et al. 2019). Source Detection: Detecting patient-zero in epidemics (Lappas et al. 2010; Shah and Zaman 2011; Jiang et al. 2016) has been well studied for the Independent Cascade (IC) and the Susceptible-Infected (SI) models. Several past works employ maximum likelihood estimation (Shah and Zaman 2011, 2012),Minimum Description Length (Prakash, Vreeken, and Faloutsos 2012), entropy (Zhang et al. 2021). Other commonly used approaches include label propagation (Wang et al. 2017) and monte carlo algorithms (Agaskar and Lu 2013) and divide and conquer using a reverse model (Zang et al. 2015). Dynamics over Temporal Networks: Dynamical Processes over temporal networks have been well studied including in information diffusion (Tong et al. 2016), epidemiology (Volz and Meyers 2007), and malwares (Wen et al. 2012). Epidemic threshold on temporal networks are also well studied (Prakash et al. 2010; Leitch, Alexander, and Sengupta 2019; Valdano et al. 2015). Subsequently, dynamics over temporal networks have been used for many applications including viral marketing (Zhuang et al. 2013), community detection (Sattari and Zamanifar 2018), network compression (Adhikari et al. 2017), and so on.

We consider the well-known source detection problem, but for a new and fundamentally different disease-spread model called the load sharing model. We showed that a natural formulation of the problem is intractable, but present two tractable formulations. The tractability of these formulations critically depends on the submodularity of the expected number of infections as a function of the source set. We were able to show submodularity despite not being able to use standard techniques such as the live edge technique. We design scalable algorithms that leverage submodularity and speed these up significantly by using a novel heuristic. Extensive experiments on real and simulated outbreaks on three different hospital contact networks demonstrate significant advantages of our approach over the baselines.

Acknowledgements

This work was partially supported by the NSF (Expeditions CCF-1918770 and CCF-1918656, CAREER IIS-2028586, RAPID IIS-2027862, Medium IIS-1955883, Medium IIS1955939, Medium IIS-2106961, PIPP CCF-2200269, IIS1955797), NIH 2R01GM109718-07, CDC MIn D Healthcare network cooperative agreements U01-CK000594 and U01CK000589, and the associated COVID-19 supplemental funding, startup from the University of Iowa, faculty research award from Facebook and funds/computing resources from Georgia Tech. The University of Iowa authors acknowledge feedback received from members of the Computational Epidemiology research group at the University of Iowa. The authors also thank anonymous reviewers whose feedback improved the quality of the paper.

Adhikari, B.; Lewis, B.; Vullikanti, A.; Jim enez, J. M.; and Prakash, B. A. 2019. Fast and near-optimal monitoring for healthcare acquired infection outbreaks. PLo S computational biology, 15(9): e1007284. Adhikari, B.; Zhang, Y.; Bharadwaj, A.; and Prakash, B. A. 2017. Condensing temporal networks using propagation. In SDM 2017, 417 425. SIAM. Agaskar, A.; and Lu, Y. M. 2013. A fast Monte Carlo algorithm for source localization on graphs. In Wavelets and Sparsity XV, volume 8858, 88581N. SPIE. Azar, Y.; and Gamzu, I. 2012. Efficient Submodular Function Maximization under Linear Packing Constraints. In ICALP, 38 50. Springer. Bai, W.; Iyer, R.; Wei, K.; and Bilmes, J. 2016. Algorithms for optimizing the ratio of submodular functions. In ICML, 2751 2759. PMLR. Brodzicki, A.; Jaworek-Korjakowska, J.; Kleczek, P.; Garland, M.; and Bogyo, M. 2020. Pre-trained deep convolutional neural network for clostridioides difficile bacteria cytotoxicity classification based on fluorescence images. Sensors, 20(23): 6713. Brouwer, A. F.; Weir, M. H.; Eisenberg, M. C.; Meza, R.; and Eisenberg, J. N. S. 2017. Dose-response relationships for environmentally mediated infectious disease transmission models. PLOS Comp Bio, 13(4): 1 28. Centers for Disease Control and Prevention. 2019. Antibiotic Resistance Threats in the United States. https: //www.cdc.gov/drugresistance/biggest-threats.html. Accessed: 2023-03-21. Clabots, C. R.; Johnson, S.; Olson, M. M.; Peterson, L. R.; and Gerding, D. N. 1992. Acquisition of Clostridium difficile by hospitalized patients: evidence for colonized new admissions as a source of infection. Journal of infectious diseases, 166(3): 561 567. Feige, U. 1998. A Threshold of Ln n for Approximating Set Cover. Journal of the ACM (JACM), 45(4): 634 652. Hethcote, H. W. 2000. The mathematics of infectious diseases. SIAM review, 42(4): 599 653.

Iyer, R.; and Bilmes, J. 2013a. Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints. ar Xiv preprint ar Xiv:1311.2106. Iyer, R. K.; and Bilmes, J. A. 2013b. Submodular optimization with submodular cover and submodular knapsack constraints. Neur IPS, 26. Jang, H.; Fu, A.; Cui, J.; Kamruzzaman, M.; Prakash, B. A.; Vullikanti, A.; Adhikari, B.; and Pemmaraju, S. V. 2023. Detecting Sources of Healthcare Associated Infections. Extended arxiv version. Jang, H.; Justice, S.; Polgreen, P. M.; Segre, A. M.; Sewell, D. K.; and Pemmaraju, S. V. 2019. Evaluating architectural changes to alter pathogen dynamics in a dialysis unit. In ASONAM, 961 968. IEEE. Jang, H.; Pai, S.; Adhikari, B.; and Pemmaraju, S. V. 2021. Risk-aware temporal cascade reconstruction to detect asymptomatic cases. In ICDM, 240 249. IEEE. Jang, H.; Pai, S.; Adhikari, B.; and Pemmaraju, S. V. 2022. Risk-aware temporal cascade reconstruction to detect asymptomatic cases. Knowledge and Information Systems, 64(12): 3373 3399. Jarvis, W. R.; Schlosser, J.; Chinn, R. Y.; Tweeten, S.; and Jackson, M. 2007. National prevalence of methicillinresistant Staphylococcus aureus in inpatients at US health care facilities, 2006. American journal of infection control, 35(10): 631 637. Jiang, J.; Wen, S.; Yu, S.; Xiang, Y.; and Zhou, W. 2016. Identifying propagation sources in networks: State-of-theart and comparative studies. IEEE Communications Surveys & Tutorials, 19(1): 465 481. Jim enez, J. M.; Lewis, B.; and Eubank, S. 2012. Hospitals as complex social systems: Agent-based simulations of hospital-acquired infections. In International Conference on Complex Sciences, 165 178. Springer.

Kempe, D.; Kleinberg, J.; and Tardos, E. 2003. Maximizing the spread of influence through a social network. In KDD, 137 146. Kraay, A. N.; Hayashi, M. A.; Hernandez-Ceron, N.; Spicknall, I. H.; Eisenberg, M. C.; Meza, R.; and Eisenberg, J. N. 2018. Fomite-mediated transmission as a sufficient pathway: a comparative analysis across three viral pathogens. BMC infectious diseases, 18(1): 1 13. Kulik, A.; Shachnai, H.; and Tamir, T. 2009. Maximizing submodular set functions subject to multiple linear constraints. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, 545 554. SIAM. Lappas, T.; Terzi, E.; Gunopulos, D.; and Mannila, H. 2010. Finding effectors in social networks. In KDD, 1059 1068. Leitch, J.; Alexander, K. A.; and Sengupta, S. 2019. Toward epidemic thresholds on temporal networks: a review and open questions. Applied Network Science, 4: 1 21. Li, B. Y.; Oh, J.; Young, V. B.; Rao, K.; and Wiens, J. 2019. Using machine learning and the electronic health record to predict complicated Clostridium difficile infection. In OFID, volume 6, ofz186. Oxford University Press US.

Li, S.; Eisenberg, J. N.; Spicknall, I. H.; and Koopman, J. S. 2009. Dynamics and control of infections transmitted from person to person through the environment. American journal of epidemiology, 170(2): 257 265. Makar, M.; Guttag, J.; and Wiens, J. 2018. Learning the probability of activation in the presence of latent spreaders. In AAAI, volume 32. Miettinen, P. 2008. On the Positive Negative Partial Set Cover Problem. Inf. Process. Lett., 108(4): 219 221. Minoux, M. 1978. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization techniques, 234 243. Springer. Mossel, E.; and Roch, S. 2007. On the submodularity of influence in social networks. In STOC, 128 134. Plipat, N.; Spicknall, I. H.; Koopman, J. S.; and Eisenberg, J. N. 2013. The dynamics of methicillin-resistant Staphylococcus aureus exposure in a hospital model and the potential for environmental intervention. BMC infectious diseases, 13(1): 595. Prakash, B. A.; Tong, H.; Valler, N.; Faloutsos, M.; and Faloutsos, C. 2010. Virus propagation on time-varying networks: Theory and immunization algorithms. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 99 114. Springer. Prakash, B. A.; Vreeken, J.; and Faloutsos, C. 2012. Spotting culprits in epidemics: How many and which ones? In ICDM, 11 20. IEEE. Prakash, B. A.; Vreeken, J.; and Faloutsos, C. 2014. Efficiently spotting the starting points of an epidemic in a large graph. Knowledge and information systems, 38: 35 59. Rozenshtein, P.; Gionis, A.; Prakash, B. A.; and Vreeken, J. 2016. Reconstructing an epidemic over time. In KDD, 1835 1844. Sattari, M.; and Zamanifar, K. 2018. A spreading activationbased label propagation algorithm for overlapping community detection in dynamic social networks. Data & Knowledge Engineering, 113: 155 170. Shah, D.; and Zaman, T. 2010. Detecting sources of computer viruses in networks: theory and experiment. In Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 203 214. Shah, D.; and Zaman, T. 2011. Rumors in a network: Who s the culprit? IEEE Transactions on information theory, 57(8): 5163 5181. Shah, D.; and Zaman, T. 2012. Rumor centrality: a universal source detector. In Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems, 199 210. Tien, J. H.; and Earn, D. J. 2010. Multiple transmission pathways and disease dynamics in a waterborne pathogen model. Bulletin of mathematical biology, 72(6): 1506 1533. Tong, G.; Wu, W.; Tang, S.; and Du, D.-Z. 2016. Adaptive influence maximization in dynamic social networks. IEEE/ACM Transactions on Networking, 25(1): 112 125.

Valdano, E.; Ferreri, L.; Poletto, C.; and Colizza, V. 2015. Analytical computation of the epidemic threshold on temporal networks. Physical Review X, 5(2): 021005. Volz, E.; and Meyers, L. A. 2007. Susceptible infected recovered epidemics in dynamic contact networks. Proceedings of the Royal Society B: Biological Sciences, 274(1628): 2925 2934. Wang, L.; and Ruan, S. 2017. Modeling nosocomial infections of methicillin-resistant Staphylococcus aureus with environment contamination. Scientific reports, 7(1): 1 12. Wang, Z.; Wang, C.; Pei, J.; and Ye, X. 2017. Multiple source detection without knowing the underlying propagation model. In AAAI, volume 31. Wen, S.; Zhou, W.; Zhang, J.; Xiang, Y.; Zhou, W.; and Jia, W. 2012. Modeling propagation dynamics of social network worms. IEEE Transactions on Parallel and Distributed Systems, 24(8): 1633 1643. Zang, W.; Zhang, P.; Zhou, C.; and Guo, L. 2015. Locating multiple sources in social networks under the SIR model: A divide-and-conquer approach. Journal of Computational Science, 10: 278 287. Zhang, C.; Guo, Q.; Fu, L.; Gan, X.; and Wang, X. 2021. ITE: A Structural Entropy Based Approach for Source Detection. In INFOCOM, 1 10. IEEE. Zhuang, H.; Sun, Y.; Tang, J.; Zhang, J.; and Sun, X. 2013. Influence maximization in dynamic social networks. In ICDM, 1313 1318. IEEE.