# reconstructing_diffusion_networks_from_incomplete_data__ed2691de.pdf

Reconstructing Diffusion Networks from Incomplete Data

Hao Huang , Keqi Han , Beicheng Xu and Ting Gan School of Computer Science, Wuhan University, China {haohuang, hankeqi, beichengxu, ganting}@whu.edu.cn

To reconstruct the topology of a diffusion network, existing approaches customarily demand not only eventual infection statuses of nodes, but also the exact times when infections occur. In real-world settings, such as the spread of epidemics, tracing the exact infection times is often infeasible; even obtaining the eventual infection statuses of all nodes is a challenging task. In this work, we study topology reconstruction of a diffusion network with incomplete observations of the node infection statuses. To this end, we iteratively infer the network topology based on observed infection statuses and estimated values for unobserved infection statuses by investigating the correlation of node infections, and learn the most probable probabilities of the infection propagations among nodes w.r.t. current inferred topology, as well as the corresponding probability distribution of each unobserved infection status, which in turn helps update the estimate of unobserved data. Extensive experimental results on both synthetic and real-world networks verify the effectiveness and efﬁciency of our approach.

1 Introduction The topology of a diffusion network provides an explicit view for the inﬂuence relationships between network nodes, and is crucial for revealing the intrinsic mechanisms of diffusion dynamics. In practice, diffusion network topologies are often not naturally accessible, and need to be reconstructed based on diffusion results observed from history [Gan et al., 2021]. To reconstruct the topology of a diffusion network, most existing approaches rely on the assumption that nodes infected sequentially within a short time interval are more likely to possess inﬂuence relationships. Therefore, these approaches demand not only the eventual infection statuses of nodes, but also the exact times (called cascades) when infections occur, and aim to ﬁnd an optimal diffusion network topology that maximizes the likelihood of the cascades [Gomez-Rodriguez et al., 2010; Myers and Leskovec, 2010; Gomez-Rodriguez et al., 2011; Gomez-Rodriguez and Sch olkopf, 2012; Du et al., 2012; Gomez-Rodriguez et al., 2013; Daneshmand et al., 2014; Wang et al., 2014; Pouget-Abadie and Horel, 2015;

Narasimhan et al., 2015; Rong et al., 2016; Kalimeris et al., 2018]. Nevertheless, in many real-world settings, such as the spread of epidemics and viral marketing campaigns, monitoring the entire diffusion process to obtain precise cascades is expensive and time consuming. A few existing approaches try to reconstruct diffusion networks with only the eventual infection statuses of nodes [Amin et al., 2014; Huang et al., 2019; Han et al., 2020; Huang et al., 2021], which are more easily accessible than cascades in most cases. Whether the existing approaches use cascades or the eventual infection statuses of nodes, they usually assume that the observed data are complete. Unfortunately, this assumption often fails to hold in practice [Gan et al., 2021]. For example, it is difﬁcult to collect infection situations of all individuals during the spread of epidemics because of population mobility and limited medical resources. It is also hard to guarantee that every respondent provides feedback in viral marketing campaigns. To deal with incomplete observed data, a simple and straightforward strategy is replacing each missing data with its corresponding expected value in the observed part [Sefer and Kingsford, 2015], under the assumption that the missing and observed data have the same probability distribution. However, this strategy has no accuracy guarantee, and may result in a severe bias for the estimate of missing data. To minimize bias, a few studies utilize constrained optimization to handle incomplete observed data, but most of them aim to learn either the strengths of inﬂuence relationships [Lokhov, 2016; Wilinski and Lokhov, 2020] or inﬂuence functions [He et al., 2016] when the topology of diffusion network is given. Only one work aims to reconstruct diffusion network topologies with incomplete observed data [Gan et al., 2021]. Nonetheless, this work tries to maximize a relaxed objective function rather than the original objective function, and thus can not guarantee to obtain the optimal solution or a local optimal solution of the original problem. Aiming at an accurate solution for diffusion network reconstruction with incomplete observations of node infection statuses, we propose LIDO (Learning from Incomplete Diffusion Observations) algorithm. It consists of three iterative steps, i.e., (1) inferring the topology of target diffusion network based on the observed node infection statuses and estimated values for unobserved node infection statuses, (2) jointly learning the optimal infection propagation probabilities between adjacent nodes in current inferred topology and

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

the corresponding probability distribution of each unobserved node infection status, that are most likely to generate the observed data, and (3) updating the estimate of unobserved infection statuses by sampling values from the latest learned probability distributions. LIDO repeats these three steps until the inferred topology converges to a stable solution. The remainder of the paper is organized as follows. We ﬁrst present our problem statement, and then introduce our proposed LIDO algorithm, followed by reporting experimental results and our ﬁndings before concluding the paper.

2 Problem Statement A diffusion network is represented as a directed graph G = (V, E), where V = {v1, ..., vn} is the set of n nodes in the network, and E is the set of directed edges (i.e., inﬂuence relationship). A directed edge (vj, vi) E from a parent node vj V to a child node vi V indicates that when vj is infected and vi is uninfected, vj will infect vi with a certain probability pji, which is known as infection propagation probability. We assume that the diffusion processes on G follow the Independent Cascade (IC) model [Kempe et al., 2003], in which each infected parent node tries to infect each of its uninfected children with corresponding infection propagation probability independently. This assumption is commonly used in existing approaches, and we would like to point out that our proposed algorithm can be also extended to other propagation models such as the linear threshold model. In the problem of diffusion network reconstruction, the node set V is given, while the topology (i.e., the directed edge set E) of target diffusion network is unknown, so are the infection propagation probabilities w.r.t. the topology. To reconstruct the diffusion network, a set S of diffusion results observed from a number of historical diffusion processes on the network is required. In this paper, we assume that the diffusion results contain only the ﬁnal infection statuses of nodes in each diffusion process, i.e., S = {sℓ i | i {1, ..., n}, ℓ {1, ..., β}}, where sℓ i {0, 1} is the ﬁnal infection status of node vi in the ℓ-th diffusion process, and β is the number of historical diffusion processes. Furthermore, we focus on a complex yet more realistic situation, i.e., partial data in S are unobserved. Let Sobs and Smis denote the observed part and unobserved part in S, respectively, then our problem statement can be formulated as follows. Given: partial observations Sobs of node infection statuses observed on a diffusion network G at the end of β historical diffusion processes. Infer: the unknown edge set E of diffusion network G.

3 The LIDO Algorithm To reconstruct the topology of G with partial observations Sobs, LIDO performs the following three steps iteratively, namely (1) inferring the topology with Sobs and estimated values ˆSmis for Smis, (2) learning the optimal infection propagation probabilities and corresponding probability distributions of Smis w.r.t. the current inferred topology, and (3) updating ˆSmis by sampling from the learned probability distribution. The details of these steps as well as a complexity analysis for LIDO are provided in what follows.

Topology inference. In general, the goal of diffusion network reconstruction with node infection statuses S is to ﬁnd an optimal directed edge set E that satisﬁes the following requirement.

E = arg max E P(E | S). (1)

Nonetheless, in our problem setting, the node infection statuses is incomplete, containing only partial observations Sobs. To estimate P(E | S) with less bias, we can repeatedly sample Smis to obtain a set of samples of unobserved data { ˆSmis 1 , ..., ˆSmis m }, where m is the number of sampling rounds. Then, we can obtain a set of samples of complete data { ˆS1, ..., ˆSm}, in which the r-th sample ˆSr consists of Sobs and ˆSmis r , i.e., ˆSr = (Sobs, ˆSmis r ). If the sampling is sufﬁcient and follows the underlying probability distributions of unobserved data, then

P(E | S) P(E | ˆS1, ..., ˆSm). (2)

Let { ˆS(t) 1 , ..., ˆS(t) m } denote the sample set of complete data sampled by LIDO after the t-th iteration, then in the (t+1)-th iteration, the task of topology inference is to ﬁnd a directed edge set E(t+1) that satisﬁes the following requirement.

E(t+1) = arg max E P(E | ˆS(t) 1 , ..., ˆS(t) m ). (3)

Note that the problem in Eq. (3) is equivalent to reconstructing diffusion network topologies from only the complete data of ﬁnal node infection statuses. Existing approaches to this problem, such as TENDS [Han et al., 2020], can be utilized to infer E(t+1). Therefore, how to complete the data through sampling is essential for topology inference.

Learning probability distribution for sampling. After inferring E(t+1) with { ˆS(t) 1 , ..., ˆS(t) m }, the subsequent task is to ﬁnd the optimal infection propagation probabilities {pji | (vj, vi) E(t+1)} and the corresponding probability distributions of unobserved data Smis that are most likely to generate the observed data Sobs. Then, by sampling Smis from the learned probability distributions, we can obtain an updated sample set of complete data { ˆS(t+1) 1 , ..., ˆS(t+1) m } for the next iteration of topology inference. According to IC model, when the infection status sℓ i of node vi in the ℓ-th diffusion process is observed and sℓ i = 1, the likelihood that the infection of vi is caused by any of its parent nodes is 1 Q vj Fi(1 pjisℓ j), where Fi refers to the set of parent nodes of vi; when sℓ i is observed and sℓ i = 0, the likelihood that none of the parent nodes of vi has successfully infected vi can be estimated as Q vj Fi(1 pjisℓ j).

When sℓ i is unobserved, i.e., sℓ i Smis, we regard sℓ i with the probability that Xℓ i = 1, i.e., sℓ i = P(Xℓ i = 1 | Sobs), where Xℓ i {0, 1} is the infection status variable of vi in the ℓ-th diffusion process. In other words, we substitute the unknown sℓ i with its expectation P(Xℓ i = 1 | Sobs). Then, the likelihood of the observed data Sobs can be

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

written as Qβ ℓ=1 Q sℓ i Sobs P(Xℓ i = sℓ i) with constraints:

P(Xℓ i = sℓ i) = 1 Y

vj F (t+1) i

(1 pjisℓ j), sℓ i = 1,

P(Xℓ i = sℓ i) = Y

vj F (t+1) i

(1 pjisℓ j), sℓ i = 0,

vj F (t+1) i

(1 pjisℓ j), sℓ i Smis,

where F (t+1) i refers to the set of parent nodes of node vi in current inferred topology E(t+1). Then, we can learn the optimal {pji | (vj, vi) E(t+1)} and corresponding {sℓ i Smis} by solving the following optimization problem.

sℓ i=1 ln 1 Y

vj F (t+1) i

1 pjisℓ j +

sℓ i=0 ln Y

vj F (t+1) i

s.t. cℓ i = 0, sℓ i Smis,

cℓ i = 1 sℓ i Y

vj F (t+1) i (1 pjisℓ j). (6)

Lagrange multiplier method can be used to solve the above problem, and the corresponding Lagrangian is as follows.

sℓ i Smis λℓ icℓ i, (7)

where λℓ i is the Lagrange multiplier corresponding to constraint cℓ i = 0 (sℓ i Smis). Given this Lagrangian L, its derivative w.r.t. pji is

vk F (t+1) i \vj(1 pkisℓ k)sℓ j 1 Q

vk F (t+1) i (1 pkisℓ k)

sℓ j 1 pjisℓ i +

ℓ, sℓ i Smis λℓ i Y

vk F (t+1) i \vj

(1 pkisℓ k)sℓ j,

and the derivative of L w.r.t. sℓ j Smis is

sℓ i=1, vj F (t+1) i

vk F (t+1) i \vj(1 pkisℓ k)pji

vk F (t+1) i (1 pkisℓ k)

sℓ i=0, vj F (t+1) i

pji 1 pjisℓ i λℓ j +

sℓ i Smis, vj F (t+1) i

vk F (t+1) i \vj

(1 pkisℓ k)pji.

Then, solving the optimization problem in Eq. (5) can be transformed into solving the following equations.

cℓ i = 0, sℓ i Smis; L sℓ j = 0, sℓ j Smis;

L pji = 0, vj F (t+1) i .

Speciﬁcally, starting with randomly initialized pji (0, 1), Eq. (10) can be solved by the following three iterative steps: Step 1. Substituting current pji into Eq. (6), we have a system of equations w.r.t. sℓ i Smis:

vj F (t+1) i (1 pjisℓ j) = 0, sℓ i Smis. (11)

As each historical diffusion process is independent to each other, for each ℓ {1, ..., β}, we have a system of equations:

vj F (t+1) i

(1 pjisℓ j) = 0, i {i | sℓ i Smis}. (12)

To solve Eq. (12), if the condition that for each vj F (t+1) i , sℓ j is observed or has been calculated already, holds for node vi in the ℓ-th diffusion process, then the corresponding sℓ i Smis can be calculated as

vj F (t+1) i (1 pjisℓ j). (13)

We repeatedly ﬁnd which nodes in the ℓ-th diffusion process satisfy the above condition, and calculate corresponding sℓ i by Eq. (13), until no node can be found. If there are still some unobserved sℓ i left that cannot be directly calculated by Eq. (13), it indicates that there exist cyclic dependencies among the calculation of these unobserved sℓ i. Without loss of generality, let sℓ 1, ..., sℓ k be this kind of unobserved infection statuses, then we can calculate their values jointly by solving the following equations.

vj F (t+1) i (1 pjisℓ j) = 0, i {1, ..., k}. (14)

The above Eq. (14) is a polynomial system of equations w.r.t. sℓ 1, ..., sℓ k, of which the numerical solution can be efﬁciently obtained by existing tools, such as the fsolve and root functions in Sci Py package for Python. Step 2. Substituting current pji and sℓ j into equations L sℓ j = 0 (sℓ j Smis), we have a system of equations w.r.t. the

Lagrange multipliers λℓ i. According to Eq. (9), this system of equations is linear and can be solved directly. Step 3. Substituting current pji, sℓ j and λℓ i into Eq. (8), we have the latest gradient L pji of the Lagrangian L w.r.t. pji, then we update pji with the latest gradient as follows.

pji = pji + θ(τ) L

where θ(τ) > 0 is the updating step size used in the τ-th iteration of Steps 1 3.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

Note that using variable step sizes helps strike a trade-off between convergence guarantee and speed. Large step sizes are ﬁrst adopted to quickly reach the neighborhood of the optimum; then, small steps are adopted to avoid overshooting. If the optimum is static, the gradient descent approach is guaranteed to converge to a stationary point when the step sizes satisfy the relationship below [Kushner and Yin, 2003].

lim τ + θ(τ) = 0 and X

τ=1 θ(τ) = (16)

Inspired by this, we set θ(τ) = θ0 τ , where θ0 = 0.001 works as the initial step size. After learning the optimal {pji | (vj, vi) E(t+1)} and corresponding {sℓ i Smis}, LIDO executes sampling based on the learned {sℓ i Smis}, which describes the probability distribution of unobserved data.

Complexity analysis. In topology inference, performing TENDS algorithm on complete data sampled by LIDO takes about O(mβn2) time, where n is the number of nodes in target diffusion network, β is the number of the historical diffusion processes, and m is the number of sampling rounds for unobserved data. To learn probability distribution for sampling, the most computationally expensive process is solving the equations Eq. (10) by our proposed three steps. The time complexities of the three steps are all bounded by O(βn3), and thus learning probability distribution takes about O(τβn3) time, where τ is the iteration number of the three steps. In brief, the time complexity of LIDO is O(tmβn2 + tτβn3), where t refers to the number of iterations of LIDO.

4 Experimental Evaluation

In this section, we ﬁrst introduce experimental setup, and then evaluate LIDO on synthetic and real-world networks with simulated infection data by investigating the effects of network size, network s average degree, unobserved data ratio, the amount of diffusion processes, and the iterations of LIDO on the accuracy and running time performance of LIDO. Furthermore, we also carry out a case study on a realworld network with real infection data. All algorithms in the experiments are implemented in Python, running on a desktop PC with Intel Core i3-6100 CPU at 3.70GHz and 8GB RAM.

Experimental setup. (1) Network: We adopt LFR benchmark graphs [Lancichinetti et al., 2008] as the synthetic networks, and generate two series of LFR benchmark graphs with properties summarized in Table 1. In addition, we adopt two commonly used real-world networks: Net Sci [Newman, 2006], a co-authorship network containing 379 scientists and 1602 co-authorships, and DUNF [Wang et al., 2014], a microblogging network containing 750 users and 2974 following relationships. (2) Infection data: The infection status results S or cascades can be obtained by simulating β times of diffusion processes on each network with randomly selected initially infected nodes in each simulation (the ratio of initially infected nodes is 15%). In each diffusion process, each infected node tries to infect its uninfected child nodes with

Graphs Number of Nodes Average Degree LFR1-5 100,150,200,250,300 4 LFR6-10 200 2,3,4,5,6

Table 1: Properties of LFR benchmark graphs

100 150 200 250 300 Network Size

Mul Tree TENDS

(a) F-score

100 150 200 250 300 Network Size

Running Time (Seconds)

Mul Tree TENDS

(b) Running time

Figure 1: Effect of diffusion network size

an infection propagation probability that follows a Gaussian distribution with mean of 0.3 and standard deviation of 0.05, to make about 95% of infection propagation probabilities within a range from 0.2 to 0.4 [Gan et al., 2021]. We randomly remove partial data from S as the unobserved data Smis (γ denotes the unobserved data ratio), and use the remaining observation data Sobs for diffusion network reconstruction. (3) Performance criterion: To evaluate the accuracy performance of LIDO algorithm, we report the F-score (the harmonic mean of precision and recall) of its inferred directed edges, which is computed as F-score = 2 P recision Recall

P recision+Recall , where Precision = NT P NT P +NF P , Recall = NT P NT P +NF N , NT P is the number of the true edges that are correctly inferred by the algorithm; NF P is the number of the wrong inferred edges that are not in the real network; and NF N is the number of the true edges that are not correctly inferred by the algorithm. (4) Benchmark algorithms: We compare LIDO with a classical cascade-based approach Mul Tree [Gomez-Rodriguez and Sch olkopf, 2012], a high-performance infection statusbased approach TENDS [Han et al., 2020], and POIND [Gan et al., 2021] which is the only existing approach to diffusion network reconstruction with incomplete data. In LIDO, the number m of sampling round is set to 6, the maximum number t of iterations is set to 5, and the stop condition for updating pji in each iteration is that the variation of each pji is less than 0.01. As TENDS requires complete data, we complete the Sobs with the following ad-hoc method for them: we estimate the average infection probability of each node in Sobs, and sample for the unobserved data 6 times based on the average infection probabilities.

Effect of diffusion network size. To study the effect of diffusion network size, we adopt ﬁve synthetic networks, LFR1 5, whose sizes vary from 100 to 300. We simulate 200 times of diffusion processes on each network (β = 200), and randomly remove 15% of infection status observations as unobserved data (γ = 0.15). Fig. 1 illustrates the F-score and running time of each algorithm, from which we can have the following observations:

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

2 3 4 5 6 Average Degree of Network

Mul Tree TENDS POIND LIDO

(a) F-score

2 3 4 5 6 Average Degree of Network

Running Time (Seconds)

Mul Tree TENDS

(b) Running time

Figure 2: Effect of average degree of diffusion network

0 0.05 0.1 0.15 0.2 Unobserved Data Ratio

Mul Tree TENDS POIND LIDO

(a) F-score on Net Sci

0 0.05 0.1 0.15 0.2 Unobserved Data Ratio

Running Time (Seconds)

Mul Tree TENDS POIND LIDO

(b) Running time on Net Sci

0 0.05 0.1 0.15 0.2 Unobserved Data Ratio

Mul Tree TENDS POIND LIDO

(c) F-score on DUNF

0 0.05 0.1 0.15 0.2 Unobserved Data Ratio

Running Time (Seconds)

Mul Tree TENDS POIND LIDO

(d) Running time on DUNF

Figure 3: Effect of unobserved data ratio

(1) LIDO outperforms the three benchmark algorithms in terms of accuracy. (2) A larger diffusion network size tends to degrade the accuracy of Mul Tree, while the accuracy of LIDO, TENDS, and POIND are reasonably insensitive to the size of diffusion network. TENDS s insensitivity to diffusion network size has also been veriﬁed in existing study [Han et al., 2020]. This is one of the reasons for why LIDO adopts TENDS for topology inference with completed data. (3) The running time of each tested algorithm tends to increase with the growth of diffusion network size, and TENDS is faster than LIDO as there is no iteration in TENDS.

Effect of average degree of diffusion network. To study the effect of the average degree of diffusion network, we test the algorithms on ﬁve synthetic networks, LFR6 10, whose average degree of each node varies from 2 to 6. We simulate 200 times of diffusion processes on each network (β = 200), and randomly remove 15% of infection status observations as unobserved data (γ = 0.15). Fig. 2 illustrates the F-score and running time of each algorithm, from which we can have the following observations: (1) Among the tested algorithms, LIDO has the best accuracy. (2) With the increase of the average degree, the accuracy of LIDO decreases slightly. This is because a greater average

100 150 200 250 300 Amount of Diffusion Processes

Mul Tree TENDS POIND LIDO

(a) F-score on Net Sci

100 150 200 250 300 Amount of Diffusion Processes

Running Time (Seconds)

Mul Tree TENDS POIND LIDO

(b) Running time on Net Sci

100 150 200 250 300 Amount of Diffusion Processes

Mul Tree TENDS POIND LIDO

(c) F-score on DUNF

100 150 200 250 300 Amount of Diffusion Processes

Running Time (Seconds)

Mul Tree TENDS POIND LIDO

(d) Running time on DUNF

Figure 4: Effect of amount of diffusion processes

degree often brings more complicated inﬂuence relationships between nodes, and thus adds complexity to diffusion network reconstruction. (3) The running time of each tested algorithm increases with the growth of average degree.

Effect of unobserved data ratio. To study the effect of the ratio of unobserved data, we test the algorithms on two realworld networks, Net Sci and DUNF, with β = 200, varying unobserved data ratio γ from 0 to 0.2. Fig. 3 illustrates the F-score and running time of each algorithm, from which we can have the following observations: (1) Compared with Mul Tree and POIND, LIDO often achieves higher accuracy. (2) LIDO and TENDS have the same accuracy in the setting of γ = 0; when there is more unobserved data, LIDO shows a greater advantage on accuracy than TENDS. (3) The increase of unobserved data ratio has a rather mild effect on the running time of Mul Tree and TENDS, but results in signiﬁcantly longer running time for LIDO and POIND. This is because LIDO and POIND require longer running time to infer more unobserved data, while Mul Tree and TENDS use completed data directly.

Effect of amount of diffusion processes. To study the effect of the amount β of diffusion processes, we test the algorithms on Net Sci and DUNF with different β, which varies from 100 to 300. In the diffusion result obtained with each β, we randomly remove 15% of infection status observations as unobserved data (γ = 0.15). Fig. 4 illustrates the F-score and running time of each algorithm, from which we can have the following observations: (1) With more diffusion processes, the tested algorithms obtain more accurate results, and LIDO leads its competitors in accuracy. (2) To analyze the cascades observed from more diffusion processes, Mul Tree requires more running time, while a greater amount of diffusion processes does not always increase the running time of LIDO, TENDS and POIND.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

1 2 3 4 5 Iteration of LIDO

Net Sci DUNF

(a) F-score

1 2 3 4 5 Iteration of LIDO

Running Time (Seconds)

Net Sci DUNF

(b) Running time

1 2 3 4 5 Iteration of LIDO

Net Sci DUNF

(c) Accuracy of estimated Smis

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Iteration of Updating p ji

Net Sci DUNF

(d) MAE of learned pji

Figure 5: Effect of iterations of LIDO

Effect of iterations of LIDO. To study the effect of the iterations of LIDO, we test LIDO on Net Sci and DUNF with β = 200 and γ = 0.15, and report the accuracy and accumulated running time of LIDO at the end of each iteration. Fig. 5(a) & (b) illustrate the testing result, from which we can observe that (1) LIDO shows a fast convergence property; (2) more iterations result in longer running time for LIDO. The main reason behind the fast convergence property is that LIDO can accurately estimate the values of unobserved data. To verify this point, Fig. 5(c) illustrates the accuracy (i.e., the rate of correct estimated values) of LIDO s estimated values for unobserved data Smis in each iteration, from which we can observe that the accuracy of estimated values increases and converges to a stable level quickly. Another important reason is due to the effectiveness of LIDO on revealing the underlying infection propagation probabilities, which helps LIDO to estimate the probability distribution and the values of of unobserved data in turn. To demonstrate this effectiveness, Fig. 5(d) illustrates the MAE (Mean Absolute Error) of infection propagation probabilities {pji | (vj, vi) E(t+1)} learned in the ﬁrst iteration of LIDO (i.e., t = 1). A lower MAE indicates a higher accuracy. From Fig. 5(d), we can observe that the accuracy of learned pji is signiﬁcantly improved by iteratively performing our proposed learning approach. Similar results can be observed when t > 1.

Case study on real infection data. To verify the effectiveness of LIDO in real scenarios, we adopt a commonly used real-world data set Meme Tracker (http://memetracker.org), which contains more than 96 million news articles and blogs from 1 million sources over a period of nine months from August 1 2008 till April 31 2009. As the ground truth of network topology is unknown on Meme Tracker data set, we generate a Meme Tracker network having 1000 media sites and blogs with 4000 directed hyperlinks as follows. We ﬁrst extract the top 1000 media

0 0.05 0.1 0.15 0.2 Unobserved Data Ratio

Mul Tree TENDS POIND LIDO

(a) F-score

0 0.05 0.1 0.15 0.2 Unobserved Data Ratio

Running Time (Seconds)

Mul Tree TENDS POIND LIDO

(b) Running time

Figure 6: Case study on Meme Tracker data set

sites and blogs with the largest number of documents as 1000 network nodes, and then add directed edge (vi, vj) if a post on network node vi linked to a post on a network node vj. Hyperlinks between blog posts can be used to trace diffusion processes [Leskovec et al., 2007]. When a blog publishes a piece of information and uses hyperlinks to refer to posts published by other blogs, we consider this kind of activities as events of information propagation. By tracing these information propagation events between the nodes of our generated Meme Tracker network, we identify 13000 infection results S and cascades of corresponding diffusion processes. Furthermore, to create incomplete data, we randomly remove partial data from S as the unobserved data Smis (the unobserved data ratio γ varies from 0 to 0.2), and use the remaining observation data Sobs for diffusion network reconstruction. Fig. 6 illustrates the F-score and running time of each tested algorithm, from which we can have the following observations: (1) A greater unobserved data ratio often degrades the accuracy of each tested algorithm, while LIDO leads its competitors in accuracy. (2) The increase of unobserved data ratio results in longer running time for LIDO, TENDS and POIND, while having a rather mild effect on the running time of Mul Tree. (3) Mul Tree tends to be relatively more efﬁcient than the other tested algorithm. This is because most of the diffusion processes identiﬁed from the Meme Tracker data set start from a single media or blog site, with tree-shape propagation paths, which are more applicable to Mul Tree.

5 Conclusion

In this paper, we have investigated the problem of how to reconstruct the topology of a diffusion network with incomplete observations of node infection statuses, and presented an effective approach called LIDO to solve this problem in an iterative way. In each iteration, LIDO infers the topology of target diffusion network based on observed data and estimated values of unobserved data, and learns the optimal infection propagation probabilities and corresponding probability distributions of unobserved data w.r.t. the inferred topology. The learned probability distributions are used by LIDO to update the estimate of unobserved data for the next iteration of topology inference. Extensive experimental results on both synthetic and real-world networks show that LIDO can properly handle incomplete observation data and reconstruct diffusion network topologies accurately.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)

Acknowledgments This work was supported in part by NSFC Grants 61976163 and 61902284. Ting Gan is the corresponding author.

References [Amin et al., 2014] Kareem Amin, Hoda Heidari, and Michael Kearns. Learning from contagion(without timestamps). In ICML 2014, pages 1845 1853, 2014. [Daneshmand et al., 2014] Hadi Daneshmand, Manuel Gomez-Rodriguez, Le Song, and Bernhard Sch olkopf. Estimating diffusion network structures: Recovery conditions, sample complexity & soft-thresholding algorithm. In ICML 2014, pages 793 801, 2014. [Du et al., 2012] Nan Du, Le Song, Alex Smola, and Ming Yuan. Learning networks of heterogeneous inﬂuence. In NIPS 2012, pages 2780 2788, 2012. [Gan et al., 2021] Ting Gan, Keqi Han, Hao Huang, Shi Ying, Yunjun Gao, and Zongpeng Li. Diffusion network inference from partial observations. In AAAI 2021, pages 7493 7500, 2021. [Gomez-Rodriguez and Sch olkopf, 2012] Manuel Gomez Rodriguez and Bernhard Sch olkopf. Submodular inference of diffusion networks from multiple trees. In ICML 2012, pages 489 496, 2012. [Gomez-Rodriguez et al., 2010] Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks of diffusion and inﬂuence. In KDD 2010, pages 1019 1028, 2010. [Gomez-Rodriguez et al., 2011] Manuel Gomez-Rodriguez, David Balduzzi, and Bernhard Sch olkopf. Uncovering the temporal dynamics of diffusion networks. In ICML 2011, pages 561 568, 2011. [Gomez-Rodriguez et al., 2013] Manuel Gomez-Rodriguez, Jure Leskovec, and Bernhard Sch olkopf. Modeling information propagation with survival theory. In ICML 2013, pages 666 674, 2013. [Han et al., 2020] Keqi Han, Yuan Tian, Yunjia Zhang, Ling Han, Hao Huang, and Yunjun Gao. Statistical estimation of diffusion network topologies. In ICDE 2020, pages 625 636, 2020. [He et al., 2016] Xinran He, Ke Xu, David Kempe, and Yan Liu. Learning inﬂuence functions from incomplete observations. In NIPS 2016, pages 2065 2073, 2016. [Huang et al., 2019] Hao Huang, Qian Yan, Ting Gan, Di Niu, Wei Lu, and Yunjun Gao. Learning diffusions without timestamps. In AAAI 2019, pages 582 589, 2019. [Huang et al., 2021] Hao Huang, Qian Yan, Lu Chen, Yunjun Gao, and Christian S. Jensen. Statistical inference of diffusion networks. IEEE Transactions on Knowledge and Data Engineering, 33(2):742 753, 2021. [Kalimeris et al., 2018] Dimitris Kalimeris, Yaron Singer, Karthik Subbian, and Udi Weinsberg. Learning diffusion using hyperparameters. In ICML 2018, pages 2420 2428, 2018.

[Kempe et al., 2003] David Kempe, Jon M. Kleinberg, and Eva Tardos. Maximizing the spread of inﬂuence through a social network. In KDD 2003, pages 137 146, 2003. [Kushner and Yin, 2003] Harold J. Kushner and G. George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer, 2003. [Lancichinetti et al., 2008] Andrea Lancichinetti, Santo Fortunato, and Filippo Radicchi. Benchmark graphs for testing community detection algorithms. Physical Review E, 78(4), 2008. [Leskovec et al., 2007] Jure Leskovec, Mary Mc Glohon, Christos Faloutsos, Natalie S. Glance, and Matthew Hurst. Patterns of cascading behavior in large blog graphs. In SDM 2007, pages 551 556, 2007. [Lokhov, 2016] Andrey Y. Lokhov. Reconstructing parameters of spreading models from partial observations. In NIPS 2016, pages 3467 3475, 2016. [Myers and Leskovec, 2010] Seth A. Myers and Jure Leskovec. On the convexity of latent social network inference. In NIPS 2010, pages 1741 1749, 2010. [Narasimhan et al., 2015] Harikrishna Narasimhan, David C. Parkes, and Yaron Singer. Learnability of inﬂuence in networks. In NIPS 2015, pages 3186 3194, 2015. [Newman, 2006] M. E. J. Newman. Finding community structure in networks using the eigenvectors of matrices. Physical Review E, 74(3):036104, 2006. [Pouget-Abadie and Horel, 2015] Jean Pouget-Abadie and Thibaut Horel. Inferring graphs from cascades: A sparse recovery framework. In ICML 2015, pages 977 986, 2015. [Rong et al., 2016] Yu Rong, Qiankun Zhu, and Hong Cheng. A model-free approach to infer the diffusion network from event cascade. In CIKM 2016, pages 1653 1662, 2016. [Sefer and Kingsford, 2015] Emre Sefer and Carl Kingsford. Convex risk minimization to infer networks from probabilistic diffusion data at multiple scales. In ICDE 2015, pages 663 674, 2015. [Wang et al., 2014] Senzhang Wang, Xia Hu, Philip S. Yu, and Zhoujun Li. MMRate: Inferring multi-aspect diffusion networks with multi-pattern cascades. In KDD 2014, pages 1246 1255, 2014. [Wilinski and Lokhov, 2020] Mateusz Wilinski and Andrey Y. Lokhov. Scalable learning of independent cascade dynamics from partial observations. ar Xiv preprint, ar Xiv:2007.06557, 2020.

Proceedings of the Thirty-First International Joint Conference on Artiﬁcial Intelligence (IJCAI-22)