# diffusion_network_inference_from_partial_observations__f19f1730.pdf

Diffusion Network Inference from Partial Observations

Ting Gan1, Keqi Han1, Hao Huang1, Shi Ying1, Yunjun Gao2, Zongpeng Li1

1School of Computer Science, Wuhan University, China 2College of Computer Science and Technology, Zhejiang University, China {ganting, hankeqi, haohuang, yingshi, zongpeng}@whu.edu.cn, gaoyj@zju.edu.cn

To infer the structure of a diffusion network from observed diffusion results, existing approaches customarily assume that observed data are complete and contain the ﬁnal infection status of each node, as well as precise timestamps of node infections. Due to high cost and uncertainties in the monitoring of node infections, exact timestamps are often unavailable in practice, and even the ﬁnal infection statuses of nodes are sometimes missing. In this work, we study how to carry out diffusion network inference without infection timestamps, using only partial observations of the ﬁnal infection statuses of nodes. To this end, we iteratively infer the structure of the target diffusion network with observed data and imputed values for missing data, and learn the most likely infection transmission probabilities between nodes w.r.t. current inferred structure, which then help us update the imputation of missing data in turn. Extensive experimental results on both synthetic and real-world networks show that our approach can properly handle missing data and accurately uncover diffusion network structures.

Introduction The structures of diffusion networks delineate the underlying inﬂuence relationships between the nodes. An explicit diffusion network structure is essential for understanding the mechanisms of historical diffusion processes and for developing strategies to control future diffusions on the network. Unfortunately, in most cases, the diffusion network structures are not naturally accessible and need to be inferred from diffusion results observed from history. To infer the structure of a diffusion network, most existing approaches resort to precise timestamps of historical node infections, and utilize sequences of timestamps (known as cascades) to determine potential parent-child inﬂuence relationships between nodes (Mehmood et al. 2013). By transforming the problem of diffusion network inference into a problem of convex optimization (Myers and Leskovec 2010; Gomez-Rodriguez, Balduzzi, and Sch olkopf 2011; Du et al. 2012; Gomez-Rodriguez, Leskovec, and Sch olkopf 2013a,b; Daneshmand et al. 2014; Wang et al. 2014; Pouget Abadie and Horel 2015; Narasimhan, Parkes, and Singer 2015; Rong, Zhu, and Cheng 2016; Kalimeris et al. 2018)

Copyright c 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

or submodular optimization (Gomez-Rodriguez, Leskovec, and Krause 2010; Gomez-Rodriguez and Sch olkopf 2012), they try to ﬁnd a diffusion network structure that maximizes the likelihood of given cascades by solving the corresponding optimization problems. In reality, gathering node infection timestamps as cascades is not always feasible or affordable. In most realworld settings, such as disease propagation, nodes have a wide spatial distribution, and infection monitoring is often labor/resource demanding. Given the limitation of the cascade-based approaches, some infection timestamp-free methods try to learn diffusion network structures from only the ﬁnal infection statuses of the nodes in historical diffusion processes (Gripon and Rabbat 2013; Amin, Heidari, and Kearns 2014; Huang et al. 2019a,b; Han et al. 2020), since observing the ﬁnal infection statuses is relatively easier and less expensive than monitoring exact infection timestamps. Whether the existing approaches use observed cascades or ﬁnal infection statuses of nodes, they generally assume that the observed data are complete, without missing values on any infection timestamp or status. Unfortunately, this assumption often fails to hold in real-world applications (He et al. 2016; Lokhov 2016). For example, it is virtually impossible to avoid missing readings from non-functioning sensors, or avoid non-response in a survey. In situations where missing data are unavoidable, one can make the data complete via certain ad-hoc strategies, such as ﬁlling in the missing parts with average values in the observed data (Sefer and Kingsford 2015), although these strategies often have no guarantee of accuracy and may result in severe biases. So far, a few studies do deal with missing data, but aim at predicting diffusion results (He et al. 2016) or learning only the strengths of inﬂuence relationships (Lokhov 2016; Wilinski and Lokhov 2020), under the assumption that the diffusion network structures are known in advance. In this work, we investigate how to infer the structure of a diffusion network based on only partial observations of the ﬁnal node infection statuses in a limited number of historical diffusion processes. We propose an effective algorithm called POIND (a re-ordered acronym of Diffusion Network Inference with Partial Observations) for this problem. POIND consists of two iterative steps, i.e., (1) inferring the structure of the objective network with observed data and imputed values for missing data, where the imputed values

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

are sampled from estimated probability distributions for the missing data; and (2) learning the most likely infection transmission probabilities between nodes and their parents w.r.t. current inferred structure, which will help POIND update the probability distribution estimation and then the value imputation of missing data in turn. In summary, our key contributions include the following: (1) To our knowledge, this work is the ﬁrst to explicitly solve the problem of diffusion network inference with partial observations of ﬁnal node infection statuses. (2) We propose an effective approach to infer the structure of objective diffusion network and learn the infection transmission probabilities in an iterative way. (3) We present a probabilistic sampling method for the imputation of missing data, which is able to complete the data properly and improve the accuracy of structure inference. The remainder of the paper is organized as follows. We ﬁrst present our problem statement, and then introduce our proposed POIND algorithm, followed by reporting experimental results and our ﬁndings before concluding the paper.

Problem Statement

A diffusion network can be represented by a directed graph G = (V, E), where V = {v1, ..., vn} is the set of nodes in the network, and E represents the set of parent-child inﬂuence relationships between the nodes. In a diffusion process, infections propagate from an infected parent node to each uninfected child node with a certain probability. Let Fi denote the set of parent nodes of node vi V w.r.t. E, as each node has two possible infection statuses (i.e., infected and uninfected), there are 2|Fi| possible combinations of the infection statuses of vi s parent nodes, where |Fi| refers to the number of nodes in set Fi. Let Xi and XFi be the infection status variable of node vi and the infection status variables of nodes in Fi, respectively, πi,j be the j-th possible combination of the infection statuses of vi s parent nodes (j {1, ..., 2|Fi|}), and θi,j,k be the probability of vi s infection status being k (i.e., Xi = k, where k {0, 1}, 0 refers to uninfected status, and 1 refers to infected status) under the condition that the infection statuses of vi s parent nodes are instantiated to the j-th possible combination (i.e., XFi = πi,j), we refer to the probabilities {θi,j,1 | j {1, ..., 2|Fi|}} as the infection transmission probabilities between Fi and vi (θi,j,1 = 1 θi,j,0, i, j). In the problem of diffusion network inference, the node set V is given, while the structure (i.e., the directed edge set E) of the objective diffusion network is unknown, so are the infection transmission probabilities w.r.t. the structure. To infer the structure of a diffusion network, a set S of diffusion results observed from a number of historical diffusion processes on the network is required. In this work, we deal with the cases that the diffusion results contain only the ﬁnal infection statuses of nodes in each diffusion process, i.e., S = {sℓ i | i {1, ..., n}, ℓ {1, ..., β}}, where sℓ i {0, 1} is the infection status of node vi V observed at the end of the ℓ-th diffusion process , and β is the number of historical diffusion processes. Furthermore, we aim to carry out diffusion network inference in a complex yet more

realistic situation, i.e., partial ﬁnal infection statuses in the diffusion results are unobserved. Let Sobs and Smis denote the observed part and missing part in S, respectively, then our problem statement can be formulated as follows. Given: partial observations Sobs of node infection statuses observed on a diffusion network G at the end of β historical diffusion processes. Infer: the structure of the diffusion network G.

The POIND Algorithm To infer the structure of G with partial observations Sobs, POIND performs the following two steps iteratively, namely (1) inferring the structure with Sobs and imputed values for Smis, and (2) learning the most likely infection transmission probabilities w.r.t. the inferred structure. In the ﬁrst step, to make the data complete for structure inference, POIND initializes the imputed values for Smis randomly (or with some ad-hoc methods) at the ﬁrst iteration; in subsequent iterations, it estimates the probability distribution of each unobserved infection status in Smis based on Sobs, the latest inferred structure, and the current values of infection transmission probabilities Θ = {θi,j,1 | i {1, ..., n}, j {1, ..., 2|Fi|}}, and then carries out the value imputation for missing data by sampling values from an estimated probability distribution of each unobserved infection status. With the completed data, POIND can adopt existing infection timestamp-free methods for diffusion network inference to update the result of structure inference. In the second step, given the latest inferred structure, POIND performs an EM-like approach to approximate the optimal values of infection transmission probabilities Θ w.r.t. this structure in an iterative way. The EM-like approach adopts a carefully designed expectation function, from which the optimal updating result for current values of Θ can be theoretically deduced. EM-like approach iteratively updates the values of Θ until convergence. The details of the above two steps, as well as a complexity analysis for POIND are provided in what follows.

Structure Inference Let E(T ) and Θ(T ) denote the directed edge set and infection transmission probabilities inferred by POIND after the Tth iteration, then in the next iteration, the goal of structure inference is to ﬁnd a directed edge set E(T +1) that satisﬁes the following requirement.

E(T +1) = arg max E P(E | Θ(T ), Sobs). (1)

Nonetheless, as Sobs is not a complete data set, it is difﬁcult to calculate P(E | Θ(T ), Sobs) directly. Aiming at an easier estimation for P(E | Θ(T ), Sobs), we repeatedly sample Smis under the condition of Sobs, E(T ) and Θ(T )

to obtain a set of samples of missing data { ˆSmis 1 , ..., ˆSmis m }, where m is the number of sampling rounds. Then, we can obtain a set of samples of complete data { ˆS1, ..., ˆSm}, in which the r-th sample ˆSr consists of Sobs and ˆSmis r , i.e.,

ˆSr = (Sobs, ˆSmis r ). (2)

If the sampling is sufﬁcient (i.e., m is great enough), then

P(E | Θ(T ), Sobs) P(E | ˆS1, ..., ˆSm). (3)

Given this, we can utilize P(E | ˆS1, ..., ˆSm) to estimate P(E | Θ(T ), Sobs), and reformulate Eq. (1) as follows.

E(T +1) = arg max E P(E | ˆS1, ..., ˆSm). (4)

Note that the problem in Eq. (4) is equivalent to learning diffusion network structures from only the complete data of ﬁnal node infection statuses. One can utilize existing approaches to this problem, such as TWIND (Huang et al. 2019b), to infer E(T +1). Therefore, how to complete the data through sampling is crucial for structure inference. Next we elaborate the sampling method.

The Sampling Method To sample Smis in (T + 1)- th iteration, we need to know the conditional probability distribution P(sℓ i | Sobs, E(T ), Θ(T )) for each unobserved infection status sℓ i Smis. Let Dobs ℓ = {sℓ i Sobs | i {1, ..., n}} denote the set of observed data from ℓth historical diffusion process (ℓ {1, ..., β}), as each historical diffusion process is independent of each other, we have relationship

P(sℓ i | Sobs, E(T ), Θ(T )) = P(sℓ i | Dobs ℓ , E(T ), Θ(T )). (5)

Without loss of generality, let D = {s1, ..., sn} represent the ﬁnal infection statuses of nodes in a historical diffusion process, Dobs and Dmis represent the observed part and missing part in D, respectively. For the sake of simplicity, we adopt E and Θ to represent the latest inferred directed edge set and infection transmission probabilities. The infection probability pi of node vi in this historical diffusion process can be estimated as follows.

0, si Dobs, si = 0 1, si Dobs, si = 1 P(Xi = 1 | Dobs, E, Θ), si Dmis (6)

When the ﬁnal infection status si of node vi is missing (i.e., si Dmis), we consider the following two cases. Case 1: If for each of vi s parent nodes, its infection probability is known already, the infection probability pi of node vi can be estimated as follows.

j=1 P(Xi = 1, XFi = πi,j | Dobs, E, Θ)

α=1 ϕ Fi(α), j

where Fi(α) refers to the index of the α-th node in Fi and

ϕ Fi(α), j = p Fi(α), XFi =πi,j, XFi(α) =1; 1 p Fi(α), XFi =πi,j, XFi(α) =0. (8)

Moreover, we would like to point out that the calculation of pi can be carried out in a more efﬁcient way. Towards this,

we divide the set Fi into two parts, namely (1) the observed part F obs i in which the ﬁnal infection status of each node is observed, and (2) the missing part F mis i in which the ﬁnal infection status of each node is unobserved. Then, Eq. (7) can be rewritten as follows.

α=1 ϕ Fi(α), j θi,j,1

2|F obs i | X

2|F mis i | X

|F obs i | Y

|F mis i | Y

α2=1 θi,j1j2,1

ϕ F obs i (α1), j1 ϕ F mis i (α2), j2 ,

where j1j2 {1, ..., 2|Fi|}, and the j1j2-th possible combination of the infection statues of nodes in Fi corresponds to the j1-th possible combination of the infection statues of nodes in F obs i combined with the j2-th possible combination of the infection statues of nodes in F mis i . For the α1-th node in F obs i , the value of p F obs i (α1) and

the value of ϕ F obs i (α1), j1 for each j1 {1, ..., 2|F obs i |} should be 0 or 1. Without loss of generality, assuming that the infection statuses of nodes in F obs i are instantiated to the ji-th possible combination (ji {1, ..., 2|F obs i |}), then for each j2 {1, ..., 2|F mis i |}, the following relationship holds.

2|F obs i | X

|F obs i | Y

|F mis i | Y

α2=1 θi,j1j2,1

ϕ F obs i (α1), j1 ϕ F mis i (α2), j2

|F mis i | Y

α2=1 ϕ F mis i (α2), j2 θi,jij2,1.

Combining Eqs. (9) & (10), we have a much simpler calculating formula for pi as follows.

2|F mis i | X

|F mis i | Y

α=1 ϕ F mis i (α), j θi,jij,1. (11)

When we start the infection probability estimation, for each node with an observed infection status, we update its infection probability by Eq. (6), and then repeat the following two steps, i.e., (1) checking which nodes satisfy the computation condition of Case 1, and (2) calculating the infection probabilities of these nodes by Eq. (11), until there is no node satisfying the computation condition of Case 1. If there are still a few nodes whose infection probabilities are unknown, it indicates that there are cyclic dependencies among their infection probability estimation. We refer to this kind of situation as Case 2. Case 2: Without loss of generality, let v1, ..., vr V be the remaining nodes that cannot satisfy the computation condition of Case 1. Then, each remaining node vq {v1, ..., vr} will have at least one parent node with unknown

infection probability. Hence, we cannot calculate each pq with Eq. (11) separately. Nevertheless, we can compute infection probabilities jointly for all the remaining nodes by solving the following equations.

(p1, ..., pr) = f(p1, ..., pr). (12)

f(p1, ..., pr) = f1(p1, ..., pr), ..., fr(p1, ..., pr) , (13)

fq(p1, ..., pr) =

2|F mis q | X

|F mis q | Y

α=1 ϕ F mis q (α), j θq,jqj,1.

(14) The above Eq. (12) is a polynomial system of equations w.r.t. p1, ..., pr, of which the numerical solution can be efﬁciently obtained by existing tools, such as the fsolve and root functions in the Sci Py package for Python. When we know all the infection probabilities, we can carry out sampling for each si Dmis based on pi to complete the data of ﬁnal infection statuses in historical diffusion processes, and then perform TWIND on the completed data to infer the structure of the objective network G.

Learning Infection Transmission Probabilities

After inferring E(T +1) based on Θ(T ) and Sobs, we can learn Θ(T +1) based on E(T +1) and Sobs by solving the following problem.

Θ(T +1) = arg max Θ P(Θ | E(T +1), Sobs). (15)

With a given directed edge set E(T +1), the parent node set Fi for each node vi V and the possible combinations of vi s parent nodes are ﬁxed, so that the elements in Θ(T +1) are also ﬁxed. What we can do is to ﬁnd an optimal value for each element θi,j,1 Θ such that the likelihood of Sobs is maximized. Let H(E(T +1)) denote the constrained search space of Θ(T +1) w.r.t. the given E(T +1), the problem of learning Θ(T +1) can be reformulated as

Θ(T +1) = arg max Θ H(E(T +1)) P(Sobs | Θ). (16)

Inspired by the EM algorithm, we propose to update Θ iteratively, and deﬁne an expectation function Q as follows.

=E[log P(Sobs, Smis | Θ) | Sobs, Θ[t]]

Smis P(Smis | Sobs, Θ[t]) log P(Sobs, Smis | Θ). (17)

where Θ[t] refers to the updated Θ after t iterations, and E[ ] refers to the expectation value of a variable. Function Q has a nice theoretical property as follows.

Theorem 1. If Q(Θ, Θ[t]) > Q(Θ[t], Θ[t]), then relationship P(Sobs | Θ) > P(Sobs | Θ[t]) holds.

Proof: The relationship

log P(Sobs | Θ) log P(Sobs | Θ[t])

= log P Smis P(Sobs, Smis | Θ)

P(Sobs | Θ[t]) P(Smis | Sobs, Θ[t]) P(Smis | Sobs, Θ[t])

Smis P(Smis | Sobs, Θ[t]) P(Sobs, Smis | Θ)

P(Sobs, Smis | Θ[t])

Smis P(Smis | Sobs, Θ[t]) log P(Sobs, Smis | Θ)

P(Sobs, Smis | Θ[t])

=Q(Θ, Θ[t]) Q(Θ[t], Θ[t])

holds, where the inequality is derived from Jensens inequality. Thus, the theorem is correct. According to the above Theorem 1, one can learn the optimal Θ[t+1] by ﬁnding a Θ from H(E(T +1)) such that the value of Q(Θ, Θ[t]) is maximized, i.e.,

Θ[t+1] = arg max Θ H(E(T +1)) Q(Θ, Θ[t]). (18)

According to Eq. (3) in the work of Han et al. (2020), the probability P(Sobs, Smis | Θ) can be reformulated as

P(Sobs, Smis | Θ)

j=1 θNi,j,0 i,j,0 θNi,j,1 i,j,1

j=1 θNi,j,0 i,j,0 (1 θi,j,0)Ni,j,1.

where Ni,j,0 and Ni,j,1 represent the number of times situations Xi = 0 XFi = πi,j and Xi = 1 XFi = πi,j appear in (Sobs, Smis), respectively. Then, the expectation function Q can be reformulated as

P(Smis | Sobs, Θ[t])

Ni,j,0 log θi,j,0+ Ni,j,1 log(1 θi,j,0)

aij log θi,j,0+ bij log(1 θi,j,0)

j=1 hij(θi,j,0),

hij(θi,j,0) =

Smis aij log θi,j,0+ P

Smis bij log(1 θi,j,0)

aij = Ni,j,0P(Smis | Sobs, Θ[t]),

bij = Ni,j,1P(Smis | Sobs, Θ[t]).

The following Theorem can help us ﬁnd an optimal Θ that maximizes Q(Θ, Θ[t]).

Theorem 2. Let 0 < θ < 1, a > 0 and b > 0, then function f(θ) = a log θ + b log(1 θ) reaches its maximum value at θ = a a+b.

Proof: The derivative of f(θ) is f (θ) = a

θ b 1 θ, based on which we have that for 0 < θ < a a+b, relationship f (θ) > 0 holds, and for a a+b < θ < 1, relationship f (θ) < 0 holds. Hence, f(θ) reaches its maximum value at θ = a a+b, and the theorem is correct. To maximize Q(Θ, Θ[t]), we need to maximize the value of each hij(θi,j,0). According to Theorem 2, for each i {1, ..., n} and j {1, ..., 2|Fi|}, function hij(θi,j,0) reaches its maximum value at P Smis aij/(P Smis aij +P Smis bij). In other words, the optimal updating result θ[t+1] i,j,0 for θ[t] i,j,0 should be

θ[t+1] i,j,0 = P Smis aij P Smis aij + P Smis bij , (22)

and θ[t+1] i,j,1 = 1 θ[t+1] i,j,0 .

In order to obtain the value of θ[t+1] i,j,1 , we need to calculate P Smis Ni,j,k P(Smis | Sobs, Θ[t]) for k = 0, 1. X

Smis Ni,j,k P(Smis | Sobs, Θ[t])

=E[Ni,j,k | Sobs, Θ[t]]

ℓ=1 E[Ni,j,k | Dobs ℓ , Θ[t]].

If the infection statuses of vi and its parent nodes can be observed in the ℓ-th historical diffusion process, then E[Ni,j,k | Dobs ℓ , Θ[t]] can be calculated as

E[Ni,j,k | Dobs ℓ , Θ[t]]

( 1, si = k and πi,j Dobs ℓ ;

0, si = k or πi,j Dobs ℓ .

Otherwise, E[Ni,j,k | Dobs ℓ , Θ[t]] can be estimated as

E[Ni,j,k | Dobs ℓ , Θ[t]]

=P(Xi = k, XFi = πi,j | Dobs ℓ , Θ[t])

=θ[t] i,j,k P(XFi = πi,j | Dobs ℓ , Θ[t])

α=1 ϕ Fi(α), j θ[t] i,j,k,

where function ϕ is deﬁned in Eq. (8) and its computation method has been discussed in the Cases 1 & 2 of our proposed sampling method for missing data. Given the discussion above, the optimal values of Θ w.r.t. a given directed edge set can be iteratively approximated by repeatedly using Eq. (18).

Complexity Analysis POIND infers the diffusion network structure and updates infection transmission probabilities iteratively.

Graphs Number of Nodes Average Degree LFR1-5 100,150,200,250,300 4 LFR6-10 200 2,3,4,5,6

Table 1: Properties of LFR benchmark graphs.

In structure inference, the most computationally expensive process consists of the following two parts, namely (1) sampling missing data, and (2) performing TWIND on completed data. In the sampling process, calculating the nodes infection probabilities in β historical diffusion processes by Eq. (6) requires O(βn2 +2ηβn) time, where n is the number of nodes in objective diffusion network, and η is the upper bound of the number of parent nodes (η n). Performing TWIND requires about O(βn2) time. The updating of infection transmission probabilities is carried out in an iterative way. In each updating iteration, the most computationally expensive process is calculating Eq. (23) with Eq. (25), which takes O(2ηηβn) time. Let t be the number of all updating iterations, the time complexity of infection transmission probability updating is O(2ηtηβn). In summary, the overall time complexity of POIND algorithm is O(Tβn2 + 2ηTβn + 2ηTtηβn), where T refers to the number of iterations of POIND.

Experimental Evaluation In this section, we ﬁrst introduce the experimental setup, and then verify the effectiveness and efﬁciency of our POIND algorithm on synthetic as well as real-world networks. To this end, we investigate the effects of diffusion network size, diffusion network s average degree, the ratio of missing data, the amount of diffusion processes, and the iterations of POIND on the accuracy and running time performance of POIND. All algorithms in the experiments are implemented in Python, running on a desktop PC with Intel Core i3-6100 CPU at 3.70GHz and 8GB RAM.

Experimental Setup Network. We adopt LFR benchmark graphs (Lancichinetti, Fortunato, and Radicchi 2008) as the synthetic networks. By setting different generation parameters, such as the number of nodes and the average degree of each node, we generate two series of LFR benchmark graphs with properties summarized in Table 1. In addition, we adopt two realworld networks: Net Sci (Newman 2006), a co-authorship network containing 379 scientists and 1602 co-authorships, and DUNF (Wang et al. 2014), a microblogging network containing 750 users and 2974 following relationships. Infection Data. The infection status results S can be obtained by simulating β times of diffusion processes on each network with randomly selected initially infected nodes in each simulation (the ratio of initially infected nodes is 15%). Corresponding cascades are also recorded for cascade-based tested algorithms in the experiments. In each diffusion process, each infected node tries to infect its uninfected child nodes with an infection transmission probability that follows a Gaussian distribution with mean of 0.3 and standard deviation of 0.05, to make about 95% of infection

transmission probabilities are within a range from 0.2 to 0.4. We randomly remove a few infection status data from S as the missing data Smis (let γ denote the ratio of missing data), and use the remaining partial observations Sobs for diffusion network inference. Performance Criterion. To evaluate the accuracy performance of POIND algorithm, we report the F-score (the harmonic mean of precision and recall) of its inferred directed edges, computed as follows.

F-score = 2 Precision Recall

Precision + Recall ,

where Precision = NT P NT P +NF P , Recall = NT P NT P +NF N , NT P refers to the number of true positives, i.e., the true edges which are correctly inferred by the algorithm; NF P refers to the number of false positives, i.e., the wrong inferred edges which are not in the real network; and NF N refers to the number of false negatives, i.e., the true edges which are not correctly inferred by the algorithm. Benchmark Algorithms. We compare POIND with a classical convex optimization-based approach Net Rate (Gomez-Rodriguez, Balduzzi, and Sch olkopf 2011), a state-of-the-art submodular optimization-based approach Mul Tree (Gomez-Rodriguez and Sch olkopf 2012), and a high performance infection timestamp-free approach TWIND (Huang et al. 2019b) for performance comparison, where TWIND is also used to infer the structure of objective diffusion network with completed data in our POIND algorithm. For POIND, the maximum number T of iterations is set to 5, the number m of sampling round is set to 6, and the stop condition for the iterative updates of infection transmission probabilities Θ is Θ[t+1] Θ[t] 0.05. As the three benchmark algorithms require complete infection data, we complete the Sobs with the following ad-hoc method for these benchmark algorithms: we estimate the average infection probability of each node in Sobs, and then sample for the missing data 6 times based on the average infection probabilities. Since Net Rate infers the transmission rate between each two node in the network, we give Net Rate a privilege in accuracy performance comparison, i.e., by calculating the F-score of edges whose transmission rates are greater than a threshold, we use different thresholds to ﬁnd a highest F-score and report this F-score as the ﬁnal accuracy performance of Net Rate. Moreover, as Mul Tree requires users to specify the number of edges to be inferred, we use the actual number m of edges in the network as an input of Mul Tree.

Effect of Diffusion Network Size

To study the effect of diffusion network size on algorithm performance, we adopt ﬁve synthetic networks, LFR1 5, whose sizes vary from 100 to 300. We simulate 200 times of diffusion processes on each network (i.e., β = 200), and randomly remove 15% of infection status observations as missing data (i.e., γ = 0.15). Fig. 1 illustrates the F-score and execution time of each tested algorithm, from which we can observe that (1) among the three benchmark algorithms, TWIND achieves

100 150 200 250 300 Network Size

TWIND POIND

(a) F-score

100 150 200 250 300 Network Size

Running Time (Seconds)

Net Rate Mul Tree TWIND POIND

(b) Running Time

Figure 1: Effect of diffusion network size

2 3 4 5 6 Average Degree of Network

TWIND POIND

(a) F-score

2 3 4 5 6 Average Degree of Network

Running Time (Seconds)

Net Rate Mul Tree TWIND POIND

(b) Running time

Figure 2: Effect of average degree of diffusion network

the highest accuracy and best running time performance; (2) compared with TWIND, POIND achieves a relatively higher accuracy at the price of a longer running time; (3) the running time of each tested algorithm increases with the growth of diffusion network size.

Effect of Average Degree of Diffusion Network To study the effect of diffusion network s average degree on algorithm performance, we test the algorithms on ﬁve synthetic networks, LFR6 10, whose average node degree varies from 2 to 6. We simulate 200 times of diffusion processes on each network (i.e., β = 200), and randomly remove 15% of infection status observations as missing data (i.e., γ = 0.15). Fig. 2 illustrates the F-score and running time of each algorithm, from which we can observe that (1) the POIND leads its competitors in accuracy, and is reasonably insensitive to the network s average degree; (2) the running time of each tested algorithm increases with the growth of average degree, and TWIND is faster than POIND as there is no iteration in TWIND.

Effect of Missing Data Ratio To study the effect of missing data ratio on algorithm performance, we test the algorithms on two real-world networks, Net Sci and DUNF (with β = 200), varying the missing data ratio γ from 0 to 0.2. Figs. 3 & 4 illustrate the F-score and running time of each tested algorithm on Net Sci and DUNF, respectively. From the ﬁgures we can observe that (1) compared with TWIND, which often has the best accuracy among the benchmark algorithms, POIND has slightly higher accuracy on Net Sci

0 0.05 0.1 0.15 0.2 Missing Data Ratio

Net Rate Mul Tree

TWIND POIND

(a) F-score

0 0.05 0.1 0.15 0.2 Missing Data Ratio

Running Time (Seconds)

Net Rate Mul Tree TWIND POIND

(b) Running time

Figure 3: Effect of missing data ratio on Net Sci

0 0.05 0.1 0.15 0.2 Missing Data Ratio

Net Rate Mul Tree TWIND

(a) F-score

0 0.05 0.1 0.15 0.2 Missing Data Ratio

Running Time (Seconds)

Net Rate Mul Tree TWIND POIND

(b) Running time

Figure 4: Effect of missing data ratio on DUNF

and reasonably high accuracy on DUNF; (2) the increase of missing data ratio has a rather mild effect on the running time of Net Rate, Mul Tree and TWIND, but results in longer running time for POIND. This is because POIND needs longer running time to infer more missing data, while the three benchmark algorithms use completed data directly.

Effect of Amount of Diffusion Processes

To study the effect of the amount of diffusion processes on algorithm performance, we test the algorithms on Net Sci and DUNF with different number β of diffusion processes, varying from 100 to 300. In the diffusion result obtained with each β, we randomly remove 15% of infection status observations as missing data (i.e., γ = 0.15). Figs. 5 & 6 illustrate the F-score and running time of each algorithm on Net Sci and DUNF, respectively. From the ﬁgures we can observe that (1) a greater amount of diffusion processes often helps the tested algorithms to achieve more accurate results on diffusion network structure inference. POIND often has a better accuracy performance compared with the other tested algorithms; (2) to analyze the infection status results observed from more diffusion processes, Net Rate and Mul Tree often require more running time. A greater amount of diffusion processes does not always increase the running time of TWIND and POIND.

Effect of Iterations of POIND

To study the effect of POIND s iterations on the performance of POIND, we test POIND on Net Sci and DUNF (with β = 200, γ = 0.15), varying the maximum number T of iterations from 1 to 5.

100 150 200 250 300 Amount of Diffusion Processes

Net Rate Mul Tree

TWIND POIND

(a) F-score

100 150 200 250 300 Amount of Diffusion Processes

Running Time (Seconds)

Net Rate Mul Tree TWIND POIND

(b) Running time

Figure 5: Effect of amount of diffusion processes on Net Sci

100 150 200 250 300 Amount of Diffusion Processes

Net Rate Mul Tree TWIND

(a) F-score

100 150 200 250 300 Amount of Diffusion Processes

Running Time (Seconds)

Net Rate Mul Tree TWIND POIND

(b) Running time

Figure 6: Effect of amount of diffusion processes on DUNF

1 2 3 4 5 Iteration

Net Sci DUNF

(a) F-score

1 2 3 4 5 Iteration

Running Time (Seconds)

Net Sci DUNF

(b) Running time

Figure 7: Effect of iterations

Fig. 7 illustrates the tested results, from which we can observe that (1) POIND converges quickly; (2) more iterations result in more running time for POIND as a rule.

We have investigated the problem of how to infer the structure of a diffusion network from only the partial observations of ﬁnal infection statuses of nodes. Towards this, we have proposed an effective approach called POIND to carry out structure inference and learning infection transmission probabilities in an iterative way. To infer the structure, POIND executes a probabilistic sampling for missing data, and then performs an existing infection timestamp-free method on the completed data. To learn infection transmission probabilities, POIND adopts an EM-like approach to iteratively approximate the optimal solution. Extensive experimental results on both synthetic and real-world networks have veriﬁed that our approach can effectively uncover the diffusion network structures from the partial observations.

Acknowledgments This work was supported in part by the NSFC Grants 61976163, 61902284, 62025206, 61972338 and 61672392, the Technological Innovation Major Projects of Hubei Province under Grant No. 2017AAA125, the Science and Technology Program of Wuhan City under Grant No. 2018010401011288, and the Fundamental Research Funds for the Central Universities. Hao Huang is the corresponding author.

References Amin, K.; Heidari, H.; and Kearns, M. 2014. Learning from Contagion(Without Timestamps). In ICML 2014, 1845 1853.

Daneshmand, H.; Gomez-Rodriguez, M.; Song, L.; and Sch olkopf, B. 2014. Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity & Soft-thresholding Algorithm. In ICML 2014, 793 801.

Du, N.; Song, L.; Smola, A.; and Yuan, M. 2012. Learning Networks of Heterogeneous Inﬂuence. In NIPS 2012, 2780 2788.

Gomez-Rodriguez, M.; Balduzzi, D.; and Sch olkopf, B. 2011. Uncovering the Temporal Dynamics of Diffusion Networks. In ICML 2011, 561 568.

Gomez-Rodriguez, M.; Leskovec, J.; and Krause, A. 2010. Inferring Networks of Diffusion and Inﬂuence. In KDD 2010, 1019 1028.

Gomez-Rodriguez, M.; Leskovec, J.; and Sch olkopf, B. 2013a. Modeling Information Propagation with Survival Theory. In ICML 2013, 666 674.

Gomez-Rodriguez, M.; Leskovec, J.; and Sch olkopf, B. 2013b. Structure and Dynamics of Information Pathways in Online Media. In WSDM 2013, 23 32.

Gomez-Rodriguez, M.; and Sch olkopf, B. 2012. Submodular Inference of Diffusion Networks from Multiple Trees. In ICML 2012, 489 496.

Gripon, V.; and Rabbat, M. 2013. Reconstructing a Graph from Path Traces. In ISIT 2013, 2488 2492.

Han, K.; Tian, Y.; Zhang, Y.; Han, L.; Huang, H.; and Gao, Y. 2020. Statistical Estimation of Diffusion Network Topologies. In ICDE 2020, 625 636.

He, X.; Xu, K.; Kempe, D.; and Liu, Y. 2016. Learning Inﬂuence Functions from Incomplete Observations. In NIPS 2016, 2065 2073.

Huang, H.; Yan, Q.; Chen, L.; Gao, Y.; and Jensen, C. S. 2019a. Statistical Inference of Diffusion Networks. IEEE Transactions on Knowledge and Data Engineering .

Huang, H.; Yan, Q.; Gan, T.; Niu, D.; Lu, W.; and Gao, Y. 2019b. Learning Diffusions without Timestamps. In AAAI 2019, 582 589.

Kalimeris, D.; Singer, Y.; Subbian, K.; and Weinsberg, U. 2018. Learning Diffusion using Hyperparameters. In ICML 2018, 2420 2428.

Lancichinetti, A.; Fortunato, S.; and Radicchi, F. 2008. Benchmark graphs for testing community detection algorithms. Physical Review E 78(4). Lokhov, A. Y. 2016. Reconstructing Parameters of Spreading Models from Partial Observations. In NIPS 2016, 3467 3475. Mehmood, Y.; Barbieri, N.; Bonchi, F.; and Ukkonen, A. 2013. CSI: Community-Level Social Inﬂuence Analysis. In ECML PKDD 2013, 48 63. Myers, S.; and Leskovec, J. 2010. On the Convexity of Latent Social Network Inference. In NIPS 2010, 1741 1749. Narasimhan, H.; Parkes, D. C.; and Singer, Y. 2015. Learnability of Inﬂuence in Networks. In NIPS 2015, 3186 3194. Newman, M. E. J. 2006. Finding Community Structure in Networks Using the Eigenvectors of Matrices. Physical Review E 74(3): 036104. Pouget-Abadie, J.; and Horel, T. 2015. Inferring Graphs from Cascades: A Sparse Recovery Framework. In ICML 2015, 977 986. Rong, Y.; Zhu, Q.; and Cheng, H. 2016. A Model Free Approach to Infer the Diffusion Network from Event Cascade. In CIKM 2016, 1653 1662. Sefer, E.; and Kingsford, C. 2015. Convex Risk Minimization to Infer Networks from Probabilistic Diffusion Data at Multiple Scales. In ICDE 2015, 663 674. Wang, S.; Hu, X.; Yu, P.; and Li, Z. 2014. MMRate: Inferring Multi-aspect Diffusion Networks with Multipattern Cascades. In KDD 2014, 1246 1255. Wilinski, M.; and Lokhov, A. Y. 2020. Scalable Learning of Independent Cascade Dynamics from Partial Observations. ar Xiv preprint ar Xiv:2007.06557.