# network_diffusions_via_neural_meanfield_dynamics__f88d045e.pdf

Network Diffusions via Neural Mean-Field Dynamics

Shushan He Mathematics & Statistics Georgia State University Atlanta, Georgia, USA she4@gsu.edu

Hongyuan Zha School of Data Science Shenzhen Research Institute of Big Data, CUHK, Shenzhen, China zhahy@cuhk.edu.cn

Xiaojing Ye Mathematics & Statistics Georgia State University Atlanta, Georgia, USA xye@gsu.edu

We propose a novel learning framework based on neural mean-ﬁeld dynamics for simultaneous inference and estimation problems of diffusions on networks. Our new framework is derived from the Mori-Zwanzig formalism to obtain an exact evolution of the node infection probabilities, which renders a delay differential equation with memory integral approximated by learnable time convolution operators, resulting in a highly structured and interpretable RNN. Directly using cascade data, our framework can jointly learn the structure of the diffusion network and the evolution of infection probabilities, which are cornerstone to important downstream applications such as inﬂuence maximization. Connections between parameter learning and optimal control are also established. Empirical study shows that our approach is versatile and robust to variations of the underlying diffusion network models, and signiﬁcantly outperforms existing approaches in accuracy and efﬁciency on both synthetic and real-world data.

1 Introduction

Continuous-time information diffusion on heterogeneous networks is a prevalent phenomenon [4, 39, 43]. News spreading on social media [13, 15, 49], viral marketing [23, 25, 52], computer malware propagation, and epidemics of contagious diseases [3, 36, 43, 47] are all examples of diffusion on networks, among many others. For instance, a piece of information (such as a tweet) can be retweeted by users (nodes) with followee-follower relationships (edge) on the Twitter network. We call a user infected if she retweets, and her followers see her retweet and can also become infected if they retweet in turn, and so on. Such information diffusion mimics the epidemic spread where an infectious virus can spread to individuals (human, animal, or plant) and then to many others upon their close contact.

In this paper, we are mainly concerned with the estimation of individual node infection probabilities as well as inference of the underlying diffusion network structures directly using cascade data of historical diffusion events on the network. For infection probability estimation, our goal is to compute the evolution of the probability of each node being infected during a diffusion initiated from a set of source nodes. For network structure inference, we aim at learning the edges as well as the strength of interactions (through the edges) between the nodes on the diffusion network. Not surprisingly, both problems are very challenging due to the extremely large scale of modern networks, the heterogeneous inter-dependencies among the nodes, and the randomness exhibited in cascade data. Most existing works focus on one problem only, e.g., either to solely infer the network structure from cascade data, or to estimate inﬂuence without providing insights into the underlying network structure.

We propose a novel learning framework, called neural mean-ﬁeld (NMF) dynamics, to simultaneously tackle both of the estimation and inference problems mentioned above. Speciﬁcally: (i) We develop a neural mean-ﬁeld dynamics framework to model the evolution of diffusion on a network. Our new framework is derived from the Mori-Zwanzig formalism to obtain an exact time evolution of the node infection probability with dimension linear in the network size; (ii) We show that the memory term of

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

the Mori-Zwanzig equation can be approximated by a trainable convolution network, which renders the dynamical system into a delay differential equation. We also show that the time discretization of such system reduces to a recurrent neural network. The approximate system is highly interpretable, and in particular, the training accepts sample cascades as input, and returns both individual probability estimates (and hence the inﬂuence function) as well as structure information of the diffusion network as outputs; (iii) We show that the parameters learning in NMF can be reduced to an optimal control problem with the parameters as time invariant control inputs, maximizing the total Hamiltonian of the system; and (iv) Our empirical analysis shows that our approach is robust to the variation of the unknown underlying diffusion models, and it also signiﬁcantly outperforms existing approaches for both synthetic and real-world diffusion networks.

The remainder of this paper is organized as follows. In Section 2, we introduce the diffusion network models and related background information, including the inﬂuence predication and structure inference problems. In Section 3, we develop the proposed framework of neural mean-ﬁeld dynamics for inference and prediction on diffusion networks, as well as an optimal control formulation for parameter learning. We demonstrate the performance of the proposed method on inﬂuence estimation and maximization on a variety of synthetic and real-world networks in Section 4. A discussion of the related work is given in Section 5. Section 6 concludes the paper.

2 Preliminaries on Diffusion Networks

Throughout this paper, we use boldfaced lower (upper) letter to denote vector (matrix) or vectorvalued (matrix-valued) function, and ( )k (or ( )ij) for its kth component (or (i, j)-th entry). All vectors are column vectors unless otherwise noted. We follow the Matlab syntax and use [x; y] to denote the vector that stacks x and y vertically. We denote inner product by x y and component-wise multiplication by x y. Time is denoted by t in either continuous (t [0, T]) or discrete case (t = 0, 1, . . . , T) for some time horizon T R+ (N in discrete case). Derivative is with respect to t, and gradient x is with respect to x. Probability is denoted by Pr( ), and expectation with respect to X (or p X) is denoted by EX[ ].

Diffusion network models Consider a diffusion network model, which consists of a network (directed graph) G = (V, E) with node set V = [n] := {1, . . . , n} and edge set E V V, and a diffusion model that describes the distribution p(t; αij) of the time t node i takes to infect a healthy neighbor j {j : (i, j ) E} for every (i, j) E. Then, given a source (seed) set S of nodes that are infected at time 0, they will infect their healthy neighbors with infection time following p, and the infected neighbors will then infect their healthy neighbors, and so on, such that the infection initiated from S at time 0 propagates to other nodes of the network.

Typical diffusion network models are assumed to be progressive where infected node cannot recover and the infections on different edges are independent. For example, the standard diffusion model with exponential distribution p(t; α) = αe αt is mostly widely used; other distributions can also be considered, as is done in this paper. For simplicity, we focus on uni-parameter distributions or distributions with multiple parameters but only one can vary across different edges with the consequence that the parameter αij 0 indicates the strength of impact node i has on node j.

Cascade data Observation data D of a diffusion network are often in the form of sample cascades D := {Ck = (Sk, τk) 2V Rn + : k [K]}, where the kth cascade Ck records its source set Sk V and the time (τk)i 0 which indicates when node i was infected (if i was not infected during Ck then (τk)i = ). We also equate Ck with {ˆx(k)(t) {0, 1}n : i [n], t 0} such that (ˆx(k)(t))i = 1 if the node i is in the infected status at time t and 0 otherwise. For example, ˆx(k)(0) = χSk where (χSk)i = 1 if i Sk and 0 otherwise. Such cascade data are collected from historical events for training purposes.

Inﬂuence prediction and inference of diffusion network Given the network G = (V, E), as well as the diffusion model and A, where (A)ji = αij is the parameter of p(t; αij) for edge (i, j), the inference prediction (or inﬂuence estimation) is to compute

x(t; χS) = [x1(t; χS), . . . , xn(t; χS)] [0, 1]n (1)

for all time t 0 and any source set S V. In (1), xi(t; χS) is the probability of node i being infected at time t given a source set S (not necessarily observed as a source set in D). Note that we use χS and S interchangeably hereafter. The probability x(t; χS) can also be used to compute the inﬂuence function σ(t; S) := 1 n x(t; χS), the expected number of infected nodes at time t. Note that an analytic solution of (1) is intractable due to the exponentially large state space of the complete dynamical system of the diffusion problem [19, 48].

On the other hand, network inference refers to learning the network connectivity E and A given cascade data D. The matrix A is the distribution parameters if the diffusion model p is given, or it simply qualitatively measures the strength of impact node i on j if no speciﬁc p is known.

Inﬂuence prediction may also require network inference when only cascade data D are available, resulting in a two-stage approach: a network inference is performed ﬁrst to learn the network structure E and the diffusion model parameters A, and then an inﬂuence estimation is used to compute the inﬂuence for the source set S. However, approximation errors and biases in the two stages will certainly accumulate. Alternatively, one can use a one-stage approach to directly estimate x(t; χS) of any S from the cascade data D, which is more versatile and less prone to diffusion model misspeciﬁcation. Our method is a such kind of one-stage method. Additionally, it allows knowledge of E and/or A, if available, to be integrated for further performance improvement.

Inﬂuence maximization Given cascade data D, inﬂuence maximization is to ﬁnd the source set S that generates the maximal inﬂuence σ(t; S) at t among all subsets of size n0, where t > 0 and 1 n0 < n are prescribed. Namely, inﬂuence maximization can be formulated as

max S σ(t; S), s.t. S V, |S| n0. (2)

There are two main ingredients of an inﬂuence maximization method for solving (2): an inﬂuence prediction subroutine that evaluates the inﬂuence σ(t; S) for any given source set S, and an (approximate) combinatorial optimization solver to ﬁnd the optimal set S of (2) that repeatedly calls the subroutine. The combinatorial optimization problem is NP-hard and is often approximately solved by greedy algorithms with guaranteed sub-optimality when σ(t; S) is submodular in S. In our experiment, we show that a standard greedy approach equipped with our proposed inﬂuence estimation method outperforms other state-of-the-art inﬂuence maximization algorithms.

3 Neural Mean-Field Dynamics

Modelling diffusion by stochastic jump processes We begin with the jump process formulation of network diffusion. Given a source set χS, let Xi(t; χS) denote the infection status of the node i at time t. Namely, Xi(t) = 1 if node i is infected by time t, and 0 otherwise. Then {Xi(t) : i [n]} are a set of n coupled jump processes, such that Xi(t; χS) jumps from 0 to 1 when the node i is infected by any of its infected neighbors at t. Let λ i (t) be the conditional intensity of Xi(t; χS) given the history H(t) = {Xi(s; χS) : s t, i [n]}, i.e.,

λ i (t) := lim τ 0+ E[Xi(t + τ; χS) Xi(t; χS)|H(t)]

Note that the numerator of (3) is also the conditional probability Pr(Xi(t + τ) = 1, Xi(t) = 0|H(t)) for any τ > 0. In inﬂuence prediction, our goal is to compute the probability x(t; χS) = [xi(t; χS)] in (1), which is the expectation of Xi(t; χS) conditioning on H(t):

xi(t; χS) = EH(t)[Xi(t; χS)|H(t)]. (4)

To this end, we adopt the following notations (for notation simplicity we temporarily drop χS in this subsection as the source set S is arbitrary but ﬁxed):

x I(t) = EH(t) Q i I Xi(t; χS) H(t) , y I(t) = Q i I xi(t), e I(t) = x I(t) y I(t) (5)

for any I [n] and |I| 2. Then we can derive the evolution of z := [x; e]. Here x(t) [0, 1]n is the resolved variable whose value is of interests and samples can be directly observed from the cascade data D, and e(t) = [ , e I(t), ] RN n where N = 2n 1 is the unresolved variable that captures all the second and higher order moments. The complete evolution equation of z is given in the following theorem, where the proof is provided in Appendix B.1.

Theorem 1. The evolution of z(t) = [x(t); e(t)] follows the nonlinear differential equation:

z = f(z), where f(z) = f(x, e) = f(x; A) (A E)1; , f I(x, e); , (6)

with initial value z0 = [χS; 0] RN, E = [eij] Rn n, and

f(x; A) = Ax diag(x)Ax, (7)

f I(x, e) = X

j / I αji(y I y I {j} + e I e I {j}) X

i I y I\{i} X

j =i αji(xj yij eij). (8)

The evolution (6) holds true exactly for the standard diffusion model with exponential distribution, but also approximates well for other distributions p, as shown in the empirical study below. In either case, the dimension N of z grows exponentially fast in network size n and hence renders the computation infeasible in practice. To overcome this issue, we employ the Mori-Zwanzig formalism [7] to derive a reduced-order model of x with dimensionality n only.

Mori-Zwanzig memory closure We employ the Mori-Zwanzig (MZ) formalism [7] that allows to introduce a generalized Langevin equation (GLE) of the x part of the dynamics (6). The GLE of x is derived from the original equation (6) describing the evolution of z = [x; e], while maintaining the effect of the unresolved part e. This is particularly useful in our case, as we only need x for infection probability estimation and inﬂuence prediction.

Deﬁne the Liouville operator L such that L[g](z) := f(z) zg(z) for any real-valued function g of z. Let et L be the Koopman operator associated with L such that et Lg(z(0)) = g(z(s)) where z(t) solves (6). Then L is known to satisfy the semi-group property for all g, i.e., et Lg(z) = g(et Lz). Now consider the projection operator P as the truncation such that (Pg)(z) = (Pg)([x; e]) = g([x; 0]) for any z = [x; e], and its orthogonal complement as Q = I P where I is the identity operator. The following theorem describes the exact evolution of x(t), and the proof is given in Appendix B.2. Theorem 2. The evolution of x speciﬁed in (6) can also be described by the following GLE:

x = f(x; A) + Z t

0 k(t s, x(s)) ds, (9)

where f is given in (7), and k(t, x) := PLet QLQLx.

Note that, (9) is not an approximation it is an exact representation of the x part of the original problem (6). The equation (9) can be interpreted as a mean-ﬁeld equation, where the two terms on the right hand side are called the streaming term (corresponding to the mean-ﬁeld dynamics) and memory term, respectively. The streaming term provides the main drift of the evolution, and the memory term in the convolution form is for vital adjustment. This inspires us to approximate the memory term as a time convolution on x, which naturally yields a delay differential equation and further reduces to a structured recurrent neural network (RNN) after discretization, as shown in the next subsection.

Delay differential equation and RNN To compute the evolution (9) of x, we consider an approximation of the Mori-Zwanzig memory term by a neural net ε with time convolution of x as follows, Z t

0 k(t s, x(s)) ds ε(x(t), h(t); η) where h(t) = Z t

0 K(t s; w)x(s) ds. (10)

In (10), K( ; w) is a convolutional operator with parameter w, and ε(x, h; η) is a deep neural net with (x, h) as input and η as parameter. Both w and η are to be trained by the cascade data D. Hence, (9) reduces to a delay differential equation which involves a time integral h(t) of past x:

x = f(x, h; θ) := f(x; A) + ε(x, h; η). (11)

The initial condition of (11) with source set S is given by

x(0) = χS, h(0) = 0, and x(t) = h(t) = 0, t < 0. (12)

We call the system (11) with initial (12) the neural mean-ﬁeld (NMF) dynamics.

The delay differential equation (11) is equivalent to a coupled system of (x, h). In addition, we show that the discretization of this system reduces to a structured recurrent neural network if K(t; w) is a (linear combination of) matrix convolutions in the following theorem.

Theorem 3. The delay differential equation (11) is equivalent to the following coupled system:

x = f(x, h; A, η) = f(x; A) + ε(x, h; η) (13a)

h = R t 0 K(t s; w) f(x(s), h(s); A, η) ds (13b)

with initial condition (12). In particular, if K(t; w) = PL l=1 Ble Clt for some L N with w = {(Bl, Cl)l : Bl Cl = Cl Bl, l [L]}, then (13) can be solved by a non-delay system of (x, h) with (13a) and h = PL l=1(Blx Clh). The discretization of such system (with step size normalized to 1) reduces to an RNN with hidden layers (xt, ht) for t = 0, 1, . . . , T 1:

xt+1 = xt + f(xt; A) + ε(xt, ht; η) (14a)

ht+1 = ht + PL l=1(Blxt+1 Clht) (14b)

where the input is given by x0 = χS and h0 = 0.

The proof is given in Appendix B.3. The matrices Bl and Cl in (14b) correspond to the weights on xt+1 and ht to form the linear transformation, and the neural network ε wraps up (xt, ht) to approximate the nonlinear effect of the memory term in (10).

We here consider a more general convolution kernel K( ; w) than the exponential kernel. Note that, in practice, the convolution weight K on older state x in (10) rapidly diminishes, and hence the memory kernel K can be well approximated with a truncated history of ﬁnite length τ > 0, or τ N after discretization. Hence, we substitute (14b) by

ht = Kwmt where Kw = [Kw 0 , . . . , Kw τ ] and mt = [xt; . . . ; xt τ]. (15)

Then we formulate the evolution of the augmented state mt deﬁned in (15) and follow (14a) to obtain a single evolution of mt for t = 0, . . . , T 1:

mt+1 = g(mt; θ), where g(m; θ) := [J0m + f(J0m, Kwm; θ); J0m; . . . ; Jτ 1m] (16)

and Js := [ , I, ] Rn (τ+1)n has identity I as the (s + 1)th block and 0 elsewhere (thus Jsmt extracts the (s+1)th block xt s of mt) for s = 0, . . . , τ 1. If (14b) is considered, a simpler augmented state mt = [xt; ht] can be formed similarly; we omit the details here. We will use the dynamics (16) of the augmented state mt in the training below.

An optimal control formulation of parameter learning Now we consider the training of the network parameters θ = (A, η, w) of (16) using cascade data D. Given a sample cascade ˆx = (S, τ) from D, we can observe its value in {0, 1}n at each of the time points t = 1, . . . , T and obtain the corresponding infection states, i.e., ˆx = {ˆxt {0, 1}n : t [T]} (see Section 2). Maximizing the log-likelihood of ˆx for the dynamics xt = xt(θ) [0, 1]n induced by θ is equivalent to minimizing the loss function ℓ(x, ˆx):

ℓ(x, ˆx) = PT t=1 ˆxt log xt + (1 ˆxt) log(1 xt), (17)

where the logarithm is taken componentwisely. We can add a regularization term r(θ) to (17) to impose prior knowledge or constraint on θ. In particular, if E is known, we can enforce a constraint such that A must be supported on E only. Otherwise, we can add A 1 or A 0 (the l1 or l0 norm of the vectorized A) if E is expected to be sparse. In general, A can be interpreted as the convolution to be learned from a graph convolution network (GCN) [26, 53]. The support and magnitude of A imply the network structure and strength of interaction between nodes, respectively. We provide more details of our numerical implementation in Section 4 and Appendix D.1.

The optimal parameter θ can be obtained by minimizing the loss function in (17) subject to the NMF dynamics (16). This procedure can also be cast as an optimal control problem to ﬁnd θ that steers mt to ﬁt data D through the NMF in (16):

min θ J (θ) := (1/K) PK k=1 ℓ(x(k), ˆx(k)) + r(θ) (18a)

s.t. m(k) t+1 = g(m(k) t ; θ), m(k) 0 = [χSk, 0, . . . , 0], t [T] 1, k [K], (18b)

where x(k) t = J0m(k) t for all t and k. The problem of optimal control has been well studied in both continuous and discrete cases in the past decades [2]. In particular, the discrete optimal control

with nonlinear difference equations and the associated maximum principle have been extensively exploited. Recently, an optimal control viewpoint of deep learning has been proposed [32] the network parameters of a neural network play the role of control variable in a discretized differential equation, and the training of these parameters for the network output to minimize the loss function can be viewed as ﬁnding the optimal control to minimize the objective function at the terminal state.

The Pontryagin s Maximum Principle (PMP) provides an important optimality condition of the optimal control [2, 32]. In standard optimal control, the control variable can be chosen freely in the allowed set at any given time t, which is a key in the proof of PMP. However, the NMF dynamics derived in (13) or (14) require a time invariant control θ throughout. This is necessary since θ corresponds to the network parameter and needs to be shared across different layers of the RNN, either from the linear kernel case with state [x; h] in (14) or the general case with state m in (16). Therefore, we need to modify the original PMP and the optimality condition for our NMF formulation. To this end, consider the Hamiltonian function

H(m, p; θ) = p g(m; θ) 1

T r(θ), (19)

and deﬁne the total Hamiltonian of the system (14) as PT 1 t=0 H(mt, pt+1; θ). Then we can show that the optimal solution θ is a time invariant control satisfying a modiﬁed PMP as follows.

Theorem 4. Let x be the optimally controlled state process by θ , then there exists a co-state (adjoint) p which satisﬁes the backward differential equation

m t+1 = g(m t ; θ ), m 0 = [χSk; 0; . . . ; 0], t = 0, . . . , T 1, (20a) p t = p t+1 mg(m t ; θ ), p T = m T ℓ, t = T 1, . . . , 0. (20b)

Moreover, the optimal θ maximizes the total Hamiltonian: for any θ, there is PT 1 t=0 H(m t , p t+1; θ ) PT 1 t=0 H(m t , p t+1; θ). (21)

In addition, for any given θ, there is θJ (θ) = PT 1 t=0 θH(mθ t , pθ t+1; θ), where {mθ t , pθ t : 0 t T} are obtained by the forward and backward passes (20a)-(20b) with θ.

The proof is given in Appendix B.4. We introduced the total Hamiltonian PT 1 t=0 H(mt, pt+1; θ) in Theorem 4 since the NMF dynamics (14) (or (16)) suggest a time invariant control θ independent of t, which corresponds to θ shared by all layers in an RNN. This is particularly important for time series analysis, where we perform regression on data observed within limited time window, but often want to use the learned parameters to predict events in distant future. Theorem 4 also implies that performing gradient descent to minimize J in (18a) with back-propagation is equivalent to maximizing the total Hamiltonian in light of (21).

Our numerical implementation of the proposed NMF is summarized in Algorithm 1. From training cascade data D, NMF can learn the parameter θ = (A, η, w). The support (indices of nonzero entries) of the matrix A reveals the edge E of the diffusion network G = (V, E), and the values of A are the corresponding infection rates on the edges. In addition to the diffusion network parameters inferred by A, we can also estimate (predict) the inﬂuence {xt : t [T]} of any new source set x0 Rn by a forward pass of NMF dynamics (16) with the learned θ. Note that this forward pass can be computed on the ﬂy, which is critical to those downstream applications (such as inﬂuence maximization) that call inﬂuence estimation as a subroutine repeatedly during the computations.

4 Numerical Experiments

Infection probability and inﬂuence function estimation We ﬁrst test NMF on a set of synthetic networks that mimic the structure of real-world diffusion network. Two types of the Kronecker network model [27] are used: hierarchical (Hier) [8] and core-periphery (Core) [28] networks with parameter matrices [0.9,0.1;0.1,0.9] and [0.9,0.5;0.5,0.3], respectively. For each type of network model, we generate 5 networks consisting of 128 nodes and 512 edges. We simulate the diffusion where the infection times are modeled by exponential distribution (Exp) and Rayleigh distribution (Ray). For each distribution, we draw αji from Unif[0.1,1] to simulate the varying interactions between nodes. We generate training data consists of K=10,000 cascades, which is formed by 10 sample cascades for each of 1,000 source sets (a source set is generated by randomly selecting 1 to

Algorithm 1 Neural mean-ﬁeld (NMF) algorithm for network inference and inﬂuence estimation

Input: D = {Ck : k [K]} where Ck = {ˆx(k)(t) {0, 1}n : t = 0, 1, . . . , T}. Initialization: Parameter θ = (A, η, w). for k = 1, . . . , K do

Sample a mini-batch ˆD D of cascades. Compute {mt : t [T]} using (16) with θ and m0 = [χS; 0] for each C ˆD. (Forward pass) Compute ˆ θJ = P

C ˆ D θℓ(x, ˆx) with ℓin (17). (Backward pass) Update parameter θ θ τ ˆ θJ . end for Output: Network parameter θ.

10 nodes from the network). All networks and cascades are generated by SNAP [29]. Our numerical implementation of NMF is available at https://github.com/Shushan He/neural-mf.

We compare NMF to two baseline methods: Inﬂu Learner [12] which is a state-of-the-art method that learns the coverage function of each node for any ﬁxed time, and a conditional LSTM (LSTM for short) [22], which are among the few existing methods capable of learning infection probabilities of individual nodes directly from cascade data as ours. For Inﬂu Learner, we set 128 as the feature number for optimal accuracy as suggested in [12]. For LSTM, we use one LSTM block and a dense layer for each t. To evaluate accuracy, we compute the mean absolute error (MAE) of node infection probability and inﬂuence over 100 source sets for each time t. More details of the evaluation criteria are provided in Appendix D.1. The results are given in Figure 1, which shows the mean (center line) and standard deviation (shade) of the three methods. NMF generally has lowest MAE, except at some early stages where Inﬂu Learner is better. Note that Inﬂu Learner requires and beneﬁts from the knowledge of the original source node for each infection in the cascade (provided in our training data), which is often unavailable in practice and not needed in our method.

We also tested NMF on a real dataset [54] from Sina Weibo social platform consisting of more that 1.78 million users and 308 million following relationships among them. Following the setting in [12], we select the most popular tweet to generate diffusion cascades from the past 1,000 tweets of each user. Then we recreate the cascades by only keeping nodes of the top 1,000 frequency in the pooled node set over all cascades. For testing, we uniformly generate 100 source sets of size 10 and use t = 1, 2, . . . , 10 as the time steps for observation. Finally, we test 100 source sets and compare our model NMF with the Inﬂu Learner and LSTM. The MAE of all methods are shown in Figure 2a which shows that NMF signiﬁcantly outperforms LSTM and is similar to Inﬂu Learner. However, unlike Inﬂu Learner that requires re-training for every t and is computationally expensive, NMF learns the evolution at all t in a single sweep of training and is tens of time faster.

We also test robustness of NMF for varying network density |E|/n. The MAE of inﬂuence and infection probabilty by NMF on a hierarchical network with n = 128 are shown in Figure 2c and 2b, respectively. NMF remains accurate for denser networks, which can be notoriously difﬁcult for other methods such as Inﬂu Learner.

Network structure inference The interpretable parameterization of NMF allows us to explicitly learn the weight matrix A. In this test, we examine the quality of the learned A. We set the recovered adjacency matrix E to the binary indicator matrix A ϵ, i.e., (E)i,j = 1 if (A)ji 0.01. To evaluate the quality of E and A, we use four metrics: precision (Prc), recall (Rcl), accuracy (Acc), and correlation (Cor), deﬁned as follows,

Prc(E, E ) = |E E |

|E | , Rcl(E, E ) = |E E |

|E| , Acc(E, E ) = 1 |E E |

|E|+|E |, Cor(A, A ) = tr(A A ) A F A F .

where E and A are their true values, respectively. In Acc, the edge set E is also interpreted as a matrix, and |E| counts the number of nonzeros in E. In Cor, A 2 F = tr(A A) is the Frobenius norm of the matrix A. Prc is the ratio of edges in E that are recovered in E. Rcl is the ratio of correctly recovered edges in E. Acc indicates the ratio of the number of common edges shared by E and E against the total number of edges in them. Cor measures similarity between A and A by taking their values into consideration. All metrics are bounded between [0, 1], and higher value indicates better result. For comparison, we also test NETRATE [16] to the cascade data and learn A with Rayleigh distribution. Evaluation by four metrics are shown in Table 1, which indicates that

2 4 6 8 10 12 14 16 18 20

influence MAE

LSTM Influ Learner NMF

(a) Hier + Exp

2 4 6 8 10 12 14 16 18 20

influence MAE

LSTM Influ Learner NMF

(b) Hier + Ray

2 4 6 8 101214161820

influence MAE

LSTM Influ Learner NMF

(c) Core + Exp

2 4 6 8 10 12 14 16 18 20

influence MAE

LSTM Influ Learner NMF

(d) Core + Ray

2 4 6 8 10 12 14 16 18 20

probability MAE

LSTM Influ Learner NMF

(e) Hier + Exp

2 4 6 8 101214161820

probability MAE

LSTM Influ Learner NMF

(f) Hier + Ray

2 4 6 8 101214161820

probability MAE

LSTM Influ Learner NMF

(g) Core + Exp

2 4 6 8 101214161820

probability MAE

LSTM Influ Learner NMF

(h) Core + Ray

Figure 1: MAE of inﬂuence (top row) and node infection probability (bottom row) by LSTM, Inﬂu Learner, and NMF on different combinations of Hierarchical (Hier) and Core-periphery (Core) networks, and exponential (Exp) and Rayleigh (Ray) diffusion models. Mean (centerline) and standard deviation (shade) over 100 tests are shown.

Table 1: Performance of structure inference using NETRATE and the proposed NMF on Random, Hierarchical, and Core-periphery networks with Rayleigh distribution as the diffusion time model on edges. Quality of the learned edge set E and distribution parameter A are measured by precision (Prc), recall (Rcl), accuracy (Acc), and correlation (Cor). Higher value indicates better quality.

Network Method Prc Rcl Acc Cor

Random NETRATE 0.481 0.399 0.434 0.465 NMF 0.858 0.954 0.903 0.950

Hierarchical NETRATE 0.659 0.429 0.519 0.464 NMF 0.826 0.978 0.893 0.938

Core-periphery NETRATE 0.150 0.220 0.178 0.143 NMF 0.709 0.865 0.779 0.931

NMF outperforms NETRATE in all metrics. Note that NMF learns A along with the NMF dynamics for infection probability estimation in its training, whereas NETRATE can only learn the matrix A.

Inﬂuence maximization We use NMF as an inﬂuence estimation subroutine in a classical greedy algorithm [38] (NMF+Greedy), and compare with a state-of-the-art method DIFFCELF[42] for inﬂuence maximization (IM). Like NMF+Greedy, DIFFCELF also only requires infection time features, but not network structures as in most existing methods. We generate 1000 cascades with unique source (as required by DIFFCELF but not ours) on a hierarchical network of 128 nodes and 512 edges, and use exponential distribution for the transmission function with A generated from Unif[1,10]. Time window is T = 20. For each source set size n0 = 1, . . . , 10, NMF+Greedy and DIFFCELF are applied to identify the optimal source sets, whose inﬂuence are computed by averaging 10,000 MC simulated cascades. Figure 2d shows that the source sets obtained by NMF+Greedy generates greater inﬂuence than DIFFCELF consistently for every source size n0.

5 Related Work

Sampling-based inﬂuence estimation methods have been considered for discrete-time and continuoustime diffusion models. Discrete-time models assume node infections only occur at discrete time points. Under this setting, the Independent Cascade (IC) model [24] is considered and a method with provable performance guarantee is developed for single source which iterates over a sequence of guesses of the true inﬂuence until the veriﬁer accepts in [34]. To resolve the inefﬁciency of Monte

1 2 3 4 5 6 7 8 9 10

influence MAE

LSTM Influ Learner NMF

(a) Inﬂu MAE vs t

2 3 4 5 6 density

influence MAE

(b) Inﬂu MAE vs density

2 3 4 5 6 density

probability MAE

(c) Prob MAE vs density

1 2 3 4 5 6 7 8 9 10

Diffu Celf NMF

(d) Inﬂuence vs n0 Figure 2: (a) MAE of inﬂuence estimated by LSTM, Inﬂu Learner on Weibo data; (b) (c) MAE of inﬂuence and infection probability of NMF for different network densities; (d) Inﬂuence of source sets selected by DIFFUCELF and NMF+Greedy for n0 = 1, . . . , 10.

Carlo simulations, the reverse inﬂuence sampling (RIS) sketching method [5] is adopted in [41]. Moreover, instead of using the full network structure, sketch-based approaches only characterize propagation instances for inﬂuence computation, such as the method in [10], which considers per-node summary structures deﬁned by the bottom-k min-bash [9] sketch of the combined reachability set. In contrast to discrete-time models, continuous-time diffusion models allow arbitrary event occurrence times and hence are more accurate in modeling real-world diffusion processes. In Continuous-time Independent Cascade (CIC) models, inﬂuence estimation can be reformulated as the problem of ﬁnding the least label list which contains information about the distance to the smallest reachable labels from the source [13, 20]. Compared to methods using a ﬁxed number of samples, a more scalable approximation scheme with a built-in block is developed to minimize the number of samples needed for the desired accuracy [40].

The aforementioned methods require knowledge of cascade traces [10] or the diffusion networks (review of related work on network structure inference is provided in Appendix C), such as node connectivity and node-to-node infection rates, as well as various assumptions on the diffusion of interests. However, such knowledge about the diffusion networks may not be available in practice, and the assumptions on the propagation or data formation are often application-speciﬁc and do not hold in most other problems. Inﬂu Learner [12] is a state-of-the-art method that does not require knowledge of the underlying diffusion network. Inﬂu Learner estimates the inﬂuence directly from cascades data in the CIC models by learning the inﬂuence function with a parameterization of the coverage functions using random basis functions. However, the random basis function suggested by [12] requires knowledge of the original source node for every infection, which can be difﬁcult or impossible to be tracked in real-world applications.

In recent years, deep learning techniques have been employed to improve the scalability of inﬂuence estimation on large networks. In particular, convolutional neural networks (CNNs) and attention mechanism are incorporated with both network structures and user speciﬁc features to learn users latent feature representation in [44]. By piping represented cascade graphs through a gated recurrent unit (GRU), the future incremental inﬂuence of a cascade can be predicted [31]. RNNs and CNNs are also applied to capture the temporal relationships on the user-generated contents networks (e.g., views, likes, comments, reposts) and extract more powerful features in [55]. In methods based on graph structures, graph neural networks (GNNs) and graph convolution networks (GCNs) are widely applied. In particular, two coupled GNNs are used to capture the interplay between node activation states and the inﬂuence spread [6], while GCNs integrated with teleport probability from the domain of page rank in [30] enhanced the performance of method in [44]. However, these methods depend critically on the structure or content features of cascades which may not be available in practice.

6 Conclusion

We proposed a novel framework using neural mean-ﬁeld dynamics for inference and estimation on diffusion networks. Our new framework is derived from the Mori-Zwanzig formalism to obtain exact evolution of node infection probabilities. The memory term of the evolution can be approximated by convolutions, which renders the system as a delay differential equation and its time discretization reduces to a structured and interpretable RNN. Empirical study shows that our approach is versatile and robust to different variations of diffusion network models, and signiﬁcantly outperforms existing approaches in accuracy and efﬁciency on both synthetic and real-world data sets.

Broader Impact

This paper makes a signiﬁcant contribution to the learning of structure and infection probabilities for diffusion networks, which is one of the central problems in the study of stochastic information propagation on large heterogeneous networks. The proposed neural mean-ﬁeld (NMF) dynamics provide the ﬁrst principled approach for inference and estimation problems using cascade data. NMF is shown to be a delay differential equation with proper approximation of memory integral using learnable time convolution operators, and the system reduces to a highly structured and interpretable recurrent neural network after time discretiztion. Potential applications include inﬂuence maximization, outbreak detection, and source identiﬁcation.

Acknowledgments and Disclosure of Funding

The work of SH and XY was supported in part by National Science Foundation under grants CMMI1745382, DMS-1818886, and DMS-1925263. The work of HZ was supported in part by National Science Foundation IIS-1717916 and Shenzhen Research Institute of Big Data. He was on leave from College of Computing, Georgia Institute of Technology.

[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorﬂow: A system for large-scale machine learning, 2016.

[2] D. P. Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientiﬁc Belmont, MA, 1995.

[3] Á. Bodó, G. Y. Katona, and P. L. Simon. Sis epidemic propagation on hypergraphs. Bulletin of mathematical biology, 78(4):713 735, 2016.

[4] M. Boguná and R. Pastor-Satorras. Epidemic spreading in correlated complex networks. Physical Review E, 66(4):047104, 2002.

[5] C. Borgs, M. Brautbar, J. Chayes, and B. Lucier. Maximizing social inﬂuence in nearly optimal time. Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, Dec 2013.

[6] Q. Cao, H. Shen, J. Gao, B. Wei, and X. Cheng. Popularity prediction on social platforms with coupled graph neural networks. In Proceedings of the 13th International Conference on Web Search and Data Mining, WSDM 20, pages 70 78, New York, NY, USA, 2020. Association for Computing Machinery.

[7] A. J. Chorin, O. H. Hald, and R. Kupferman. Optimal prediction and the mori zwanzig representation of irreversible processes. Proceedings of the National Academy of Sciences, 97(7):2968 2973, 2000.

[8] A. Clauset, C. Moore, and M. E. J. Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191):98 101, May 2008.

[9] E. Cohen. Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences, 55(3):441 453, 1997.

[10] E. Cohen, D. Delling, T. Pajor, and R. F. Werneck. Sketch-based inﬂuence maximization and computation: Scaling up with guarantees. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 14, pages 629 638, New York, NY, USA, 2014. Association for Computing Machinery.

[11] X. Dong, D. Thanou, M. Rabbat, and P. Frossard. Learning graphs from data: A signal representation perspective. IEEE Signal Processing Magazine, 36(3):44 63, 2019.

[12] N. Du, Y. Liang, M.-F. Balcan, and L. Song. Inﬂuence function learning in information diffusion networks. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML 14, pages II 2016 II 2024. JMLR.org, 2014.

[13] N. Du, L. Song, M. Gomez-Rodriguez, and H. Zha. Scalable inﬂuence estimation in continuoustime diffusion networks. In Advances in Neural Information Processing Systems, pages 3147 3155, 2013.

[14] N. Du, L. Song, M. Yuan, and A. J. Smola. Learning networks of heterogeneous inﬂuence. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2780 2788. Curran Associates, Inc., 2012.

[15] M. Farajtabar, X. Ye, S. Harati, L. Song, and H. Zha. Multistage campaigning in social networks. In Advances in Neural Information Processing Systems, pages 4718 4726, 2016.

[16] M. Gomez-Rodriguez, D. Balduzzi, and B. Schölkopf. Uncovering the temporal dynamics of diffusion networks. ar Xiv preprint ar Xiv:1105.0697, 2011.

[17] M. Gomez-Rodriguez, D. Balduzzi, and B. Schölkopf. Uncovering the temporal dynamics of diffusion networks. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML 11, pages 561 568, Madison, WI, USA, 2011. Omnipress.

[18] M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and inﬂuence. ACM Transactions on Knowledge Discovery from Data (TKDD), 5(4):21, 2012.

[19] M. Gomez-Rodriguez, J. Leskovec, and B. Schölkopf. Structure and dynamics of information pathways in online media. Co RR, abs/1212.1464, 2012.

[20] M. Gomez-Rodriguez, L. Song, N. Du, H. Zha, and B. Schölkopf. Inﬂuence estimation and maximization in continuous-time diffusion networks. ACM Trans. Inf. Syst., 34(2), Feb. 2016.

[21] A. Gouasmi, E. J. Parish, and K. Duraisamy. A priori estimation of memory effects in reducedorder models of nonlinear systems using the mori-zwanzig formalism. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 473(2205):20170385, 2017.

[22] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9:1735 80, 12 1997.

[23] D. Kempe, J. Kleinberg, and É. Tardos. Maximizing the spread of inﬂuence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137 146. ACM, 2003.

[24] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of inﬂuence through a social network. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 137 146. Association for Computing Machinery, 2003.

[25] D. Kempe, J. Kleinberg, and É. Tardos. Inﬂuential nodes in a diffusion model for social networks. In Automata, languages and programming, pages 1127 1138. Springer, 2005.

[26] T. N. Kipf and M. Welling. Semi-supervised classiﬁcation with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016.

[27] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. The Journal of Machine Learning Research, 11:985 1042, 2010.

[28] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical properties of community structure in large social and information networks. In Proceedings of the 17th International Conference on World Wide Web, WWW 08, pages 695 704, New York, NY, USA, 2008. Association for Computing Machinery.

[29] J. Leskovec and R. Sosiˇc. Snap: A general-purpose network analysis and graph-mining library. ACM Transactions on Intelligent Systems and Technology (TIST), 8(1):1, 2016.

[30] C. K. Leung, A. Cuzzocrea, J. J. Mai, D. Deng, and F. Jiang. Personalized deepinf: Enhanced social inﬂuence prediction with deep learning and transfer learning. In 2019 IEEE International Conference on Big Data (Big Data), pages 2871 2880, 2019.

[31] C. Li, J. Ma, X. Guo, and Q. Mei. Deepcas: An end-to-end predictor of information cascades. In Proceedings of the 26th international conference on World Wide Web, pages 577 586, 2017.

[32] Q. Li, L. Chen, C. Tai, and E. Weinan. Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1):5998 6026, 2017.

[33] Y. Liang, Z. Jiang, and Y. Zheng. Inferring trafﬁc cascading patterns. In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL 17, New York, NY, USA, 2017. Association for Computing Machinery.

[34] B. Lucier, J. Oren, and Y. Singer. Inﬂuence at scale: Distributed computation of complex contagion in networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 15, pages 735 744, New York, NY, USA, 2015. Association for Computing Machinery.

[35] G. Mateos, S. Segarra, A. G. Marques, and A. Ribeiro. Connecting the dots: Identifying network structure via graph signal processing. IEEE Signal Processing Magazine, 36(3):16 43, May 2019.

[36] J. C. Miller and I. Z. Kiss. Epidemic spread in networks: Existing methods and current challenges. Mathematical modelling of natural phenomena, 9(2):4, 2014.

[37] S. A. Myers and J. Leskovec. On the convexity of latent social network inference. ar Xiv preprint ar Xiv:1010.5504, 2010.

[38] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions i. Mathematical programming, 14(1):265 294, 1978.

[39] M. Newman. Networks: an introduction. Oxford University Press, 2010.

[40] H. T. Nguyen, T. P. Nguyen, T. N. Vu, and T. N. Dinh. Outward inﬂuence and cascade size estimation in billion-scale networks. Proc. ACM Meas. Anal. Comput. Syst., 1(1), June 2017.

[41] N. Ohsaka, T. Akiba, Y. Yoshida, and K.-i. Kawarabayashi. Dynamic inﬂuence analysis in evolving networks. Proc. VLDB Endow., 9(12):1077 1088, Aug. 2016.

[42] G. Panagopoulos, F. D. Malliaros, and M. Vazirgiannis. Diffugreedy: an inﬂuence maximization algorithm based on diffusion cascades. In International Conference on Complex Networks and their Applications, pages 392 404. Springer, 2018.

[43] R. Pastor-Satorras, C. Castellano, P. Van Mieghem, and A. Vespignani. Epidemic processes in complex networks. Reviews of modern physics, 87(3):925, 2015.

[44] J. Qiu, J. Tang, H. Ma, Y. Dong, K. Wang, and J. Tang. Deepinf: Social inﬂuence prediction with deep learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2110 2119, 2018.

[45] M. G. Rodriguez and B. Schölkopf. Submodular inference of diffusion networks from multiple trees. ar Xiv preprint ar Xiv:1205.1671, 2012.

[46] Y. Rong, Q. Zhu, and H. Cheng. A model-free approach to infer the diffusion network from event cascade. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 16, pages 1653 1662, New York, NY, USA, 2016. Association for Computing Machinery.

[47] F. D. Sahneh and C. Scoglio. Epidemic spread in human networks. In Decision and Control and European Control Conference (CDC-ECC), 2011 50th IEEE Conference on, pages 3008 3013. IEEE, 2011.

[48] P. Van Mieghem, J. Omic, and R. Kooij. Virus spread in networks. Networking, IEEE/ACM Transactions on, 17(1):1 14, 2009.

[49] M. Vergeer, L. Hermans, and S. Sams. Online social networks and micro-blogging in political campaigning the exploration of a new campaign tool and a new campaign style. Party Politics, 19(3):477 501, 2013.

[50] L. Wang, S. Ermon, and J. E. Hopcroft. Feature-enhanced probabilistic models for diffusion network inference. In Machine Learning and Knowledge Discovery in Databases, pages 499 514. Springer, 2012.

[51] Q. Wang, N. Ripamonti, and J. S. Hesthaven. Recurrent neural network closure of parametric pod-galerkin reduced-order models based on the mori-zwanzig formalism. Journal of Computational Physics, 410:109402, 2020.

[52] J. Wortman. Viral marketing and the diffusion of trends on social networks. Tech Report, 2008.

[53] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, pages 1 21, 2020.

[54] J. Zhang, B. Liu, J. Tang, T. Chen, and J. Li. Social inﬂuence locality for modeling retweeting behaviors. In Proceedings of the Twenty-Third International Joint Conference on Artiﬁcial Intelligence, IJCAI 13, pages 2761 2767. AAAI Press, 2013.

[55] Y. Zhu, J. Xie, and Z. Chen. Predicting the popularity of micro-videos with multimodal variational encoder-decoder framework, 2020.