# network_diffusions_via_neural_meanfield_dynamics__f88d045e.pdf Network Diffusions via Neural Mean-Field Dynamics Shushan He Mathematics & Statistics Georgia State University Atlanta, Georgia, USA she4@gsu.edu Hongyuan Zha School of Data Science Shenzhen Research Institute of Big Data, CUHK, Shenzhen, China zhahy@cuhk.edu.cn Xiaojing Ye Mathematics & Statistics Georgia State University Atlanta, Georgia, USA xye@gsu.edu We propose a novel learning framework based on neural mean-field dynamics for simultaneous inference and estimation problems of diffusions on networks. Our new framework is derived from the Mori-Zwanzig formalism to obtain an exact evolution of the node infection probabilities, which renders a delay differential equation with memory integral approximated by learnable time convolution operators, resulting in a highly structured and interpretable RNN. Directly using cascade data, our framework can jointly learn the structure of the diffusion network and the evolution of infection probabilities, which are cornerstone to important downstream applications such as influence maximization. Connections between parameter learning and optimal control are also established. Empirical study shows that our approach is versatile and robust to variations of the underlying diffusion network models, and significantly outperforms existing approaches in accuracy and efficiency on both synthetic and real-world data. 1 Introduction Continuous-time information diffusion on heterogeneous networks is a prevalent phenomenon [4, 39, 43]. News spreading on social media [13, 15, 49], viral marketing [23, 25, 52], computer malware propagation, and epidemics of contagious diseases [3, 36, 43, 47] are all examples of diffusion on networks, among many others. For instance, a piece of information (such as a tweet) can be retweeted by users (nodes) with followee-follower relationships (edge) on the Twitter network. We call a user infected if she retweets, and her followers see her retweet and can also become infected if they retweet in turn, and so on. Such information diffusion mimics the epidemic spread where an infectious virus can spread to individuals (human, animal, or plant) and then to many others upon their close contact. In this paper, we are mainly concerned with the estimation of individual node infection probabilities as well as inference of the underlying diffusion network structures directly using cascade data of historical diffusion events on the network. For infection probability estimation, our goal is to compute the evolution of the probability of each node being infected during a diffusion initiated from a set of source nodes. For network structure inference, we aim at learning the edges as well as the strength of interactions (through the edges) between the nodes on the diffusion network. Not surprisingly, both problems are very challenging due to the extremely large scale of modern networks, the heterogeneous inter-dependencies among the nodes, and the randomness exhibited in cascade data. Most existing works focus on one problem only, e.g., either to solely infer the network structure from cascade data, or to estimate influence without providing insights into the underlying network structure. We propose a novel learning framework, called neural mean-field (NMF) dynamics, to simultaneously tackle both of the estimation and inference problems mentioned above. Specifically: (i) We develop a neural mean-field dynamics framework to model the evolution of diffusion on a network. Our new framework is derived from the Mori-Zwanzig formalism to obtain an exact time evolution of the node infection probability with dimension linear in the network size; (ii) We show that the memory term of 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. the Mori-Zwanzig equation can be approximated by a trainable convolution network, which renders the dynamical system into a delay differential equation. We also show that the time discretization of such system reduces to a recurrent neural network. The approximate system is highly interpretable, and in particular, the training accepts sample cascades as input, and returns both individual probability estimates (and hence the influence function) as well as structure information of the diffusion network as outputs; (iii) We show that the parameters learning in NMF can be reduced to an optimal control problem with the parameters as time invariant control inputs, maximizing the total Hamiltonian of the system; and (iv) Our empirical analysis shows that our approach is robust to the variation of the unknown underlying diffusion models, and it also significantly outperforms existing approaches for both synthetic and real-world diffusion networks. The remainder of this paper is organized as follows. In Section 2, we introduce the diffusion network models and related background information, including the influence predication and structure inference problems. In Section 3, we develop the proposed framework of neural mean-field dynamics for inference and prediction on diffusion networks, as well as an optimal control formulation for parameter learning. We demonstrate the performance of the proposed method on influence estimation and maximization on a variety of synthetic and real-world networks in Section 4. A discussion of the related work is given in Section 5. Section 6 concludes the paper. 2 Preliminaries on Diffusion Networks Throughout this paper, we use boldfaced lower (upper) letter to denote vector (matrix) or vectorvalued (matrix-valued) function, and ( )k (or ( )ij) for its kth component (or (i, j)-th entry). All vectors are column vectors unless otherwise noted. We follow the Matlab syntax and use [x; y] to denote the vector that stacks x and y vertically. We denote inner product by x y and component-wise multiplication by x y. Time is denoted by t in either continuous (t [0, T]) or discrete case (t = 0, 1, . . . , T) for some time horizon T R+ (N in discrete case). Derivative is with respect to t, and gradient x is with respect to x. Probability is denoted by Pr( ), and expectation with respect to X (or p X) is denoted by EX[ ]. Diffusion network models Consider a diffusion network model, which consists of a network (directed graph) G = (V, E) with node set V = [n] := {1, . . . , n} and edge set E V V, and a diffusion model that describes the distribution p(t; αij) of the time t node i takes to infect a healthy neighbor j {j : (i, j ) E} for every (i, j) E. Then, given a source (seed) set S of nodes that are infected at time 0, they will infect their healthy neighbors with infection time following p, and the infected neighbors will then infect their healthy neighbors, and so on, such that the infection initiated from S at time 0 propagates to other nodes of the network. Typical diffusion network models are assumed to be progressive where infected node cannot recover and the infections on different edges are independent. For example, the standard diffusion model with exponential distribution p(t; α) = αe αt is mostly widely used; other distributions can also be considered, as is done in this paper. For simplicity, we focus on uni-parameter distributions or distributions with multiple parameters but only one can vary across different edges with the consequence that the parameter αij 0 indicates the strength of impact node i has on node j. Cascade data Observation data D of a diffusion network are often in the form of sample cascades D := {Ck = (Sk, τk) 2V Rn + : k [K]}, where the kth cascade Ck records its source set Sk V and the time (τk)i 0 which indicates when node i was infected (if i was not infected during Ck then (τk)i = ). We also equate Ck with {ˆx(k)(t) {0, 1}n : i [n], t 0} such that (ˆx(k)(t))i = 1 if the node i is in the infected status at time t and 0 otherwise. For example, ˆx(k)(0) = χSk where (χSk)i = 1 if i Sk and 0 otherwise. Such cascade data are collected from historical events for training purposes. Influence prediction and inference of diffusion network Given the network G = (V, E), as well as the diffusion model and A, where (A)ji = αij is the parameter of p(t; αij) for edge (i, j), the inference prediction (or influence estimation) is to compute x(t; χS) = [x1(t; χS), . . . , xn(t; χS)] [0, 1]n (1) for all time t 0 and any source set S V. In (1), xi(t; χS) is the probability of node i being infected at time t given a source set S (not necessarily observed as a source set in D). Note that we use χS and S interchangeably hereafter. The probability x(t; χS) can also be used to compute the influence function σ(t; S) := 1 n x(t; χS), the expected number of infected nodes at time t. Note that an analytic solution of (1) is intractable due to the exponentially large state space of the complete dynamical system of the diffusion problem [19, 48]. On the other hand, network inference refers to learning the network connectivity E and A given cascade data D. The matrix A is the distribution parameters if the diffusion model p is given, or it simply qualitatively measures the strength of impact node i on j if no specific p is known. Influence prediction may also require network inference when only cascade data D are available, resulting in a two-stage approach: a network inference is performed first to learn the network structure E and the diffusion model parameters A, and then an influence estimation is used to compute the influence for the source set S. However, approximation errors and biases in the two stages will certainly accumulate. Alternatively, one can use a one-stage approach to directly estimate x(t; χS) of any S from the cascade data D, which is more versatile and less prone to diffusion model misspecification. Our method is a such kind of one-stage method. Additionally, it allows knowledge of E and/or A, if available, to be integrated for further performance improvement. Influence maximization Given cascade data D, influence maximization is to find the source set S that generates the maximal influence σ(t; S) at t among all subsets of size n0, where t > 0 and 1 n0 < n are prescribed. Namely, influence maximization can be formulated as max S σ(t; S), s.t. S V, |S| n0. (2) There are two main ingredients of an influence maximization method for solving (2): an influence prediction subroutine that evaluates the influence σ(t; S) for any given source set S, and an (approximate) combinatorial optimization solver to find the optimal set S of (2) that repeatedly calls the subroutine. The combinatorial optimization problem is NP-hard and is often approximately solved by greedy algorithms with guaranteed sub-optimality when σ(t; S) is submodular in S. In our experiment, we show that a standard greedy approach equipped with our proposed influence estimation method outperforms other state-of-the-art influence maximization algorithms. 3 Neural Mean-Field Dynamics Modelling diffusion by stochastic jump processes We begin with the jump process formulation of network diffusion. Given a source set χS, let Xi(t; χS) denote the infection status of the node i at time t. Namely, Xi(t) = 1 if node i is infected by time t, and 0 otherwise. Then {Xi(t) : i [n]} are a set of n coupled jump processes, such that Xi(t; χS) jumps from 0 to 1 when the node i is infected by any of its infected neighbors at t. Let λ i (t) be the conditional intensity of Xi(t; χS) given the history H(t) = {Xi(s; χS) : s t, i [n]}, i.e., λ i (t) := lim τ 0+ E[Xi(t + τ; χS) Xi(t; χS)|H(t)] Note that the numerator of (3) is also the conditional probability Pr(Xi(t + τ) = 1, Xi(t) = 0|H(t)) for any τ > 0. In influence prediction, our goal is to compute the probability x(t; χS) = [xi(t; χS)] in (1), which is the expectation of Xi(t; χS) conditioning on H(t): xi(t; χS) = EH(t)[Xi(t; χS)|H(t)]. (4) To this end, we adopt the following notations (for notation simplicity we temporarily drop χS in this subsection as the source set S is arbitrary but fixed): x I(t) = EH(t) Q i I Xi(t; χS) H(t) , y I(t) = Q i I xi(t), e I(t) = x I(t) y I(t) (5) for any I [n] and |I| 2. Then we can derive the evolution of z := [x; e]. Here x(t) [0, 1]n is the resolved variable whose value is of interests and samples can be directly observed from the cascade data D, and e(t) = [ , e I(t), ] RN n where N = 2n 1 is the unresolved variable that captures all the second and higher order moments. The complete evolution equation of z is given in the following theorem, where the proof is provided in Appendix B.1. Theorem 1. The evolution of z(t) = [x(t); e(t)] follows the nonlinear differential equation: z = f(z), where f(z) = f(x, e) = f(x; A) (A E)1; , f I(x, e); , (6) with initial value z0 = [χS; 0] RN, E = [eij] Rn n, and f(x; A) = Ax diag(x)Ax, (7) f I(x, e) = X j / I αji(y I y I {j} + e I e I {j}) X i I y I\{i} X j =i αji(xj yij eij). (8) The evolution (6) holds true exactly for the standard diffusion model with exponential distribution, but also approximates well for other distributions p, as shown in the empirical study below. In either case, the dimension N of z grows exponentially fast in network size n and hence renders the computation infeasible in practice. To overcome this issue, we employ the Mori-Zwanzig formalism [7] to derive a reduced-order model of x with dimensionality n only. Mori-Zwanzig memory closure We employ the Mori-Zwanzig (MZ) formalism [7] that allows to introduce a generalized Langevin equation (GLE) of the x part of the dynamics (6). The GLE of x is derived from the original equation (6) describing the evolution of z = [x; e], while maintaining the effect of the unresolved part e. This is particularly useful in our case, as we only need x for infection probability estimation and influence prediction. Define the Liouville operator L such that L[g](z) := f(z) zg(z) for any real-valued function g of z. Let et L be the Koopman operator associated with L such that et Lg(z(0)) = g(z(s)) where z(t) solves (6). Then L is known to satisfy the semi-group property for all g, i.e., et Lg(z) = g(et Lz). Now consider the projection operator P as the truncation such that (Pg)(z) = (Pg)([x; e]) = g([x; 0]) for any z = [x; e], and its orthogonal complement as Q = I P where I is the identity operator. The following theorem describes the exact evolution of x(t), and the proof is given in Appendix B.2. Theorem 2. The evolution of x specified in (6) can also be described by the following GLE: x = f(x; A) + Z t 0 k(t s, x(s)) ds, (9) where f is given in (7), and k(t, x) := PLet QLQLx. Note that, (9) is not an approximation it is an exact representation of the x part of the original problem (6). The equation (9) can be interpreted as a mean-field equation, where the two terms on the right hand side are called the streaming term (corresponding to the mean-field dynamics) and memory term, respectively. The streaming term provides the main drift of the evolution, and the memory term in the convolution form is for vital adjustment. This inspires us to approximate the memory term as a time convolution on x, which naturally yields a delay differential equation and further reduces to a structured recurrent neural network (RNN) after discretization, as shown in the next subsection. Delay differential equation and RNN To compute the evolution (9) of x, we consider an approximation of the Mori-Zwanzig memory term by a neural net ε with time convolution of x as follows, Z t 0 k(t s, x(s)) ds ε(x(t), h(t); η) where h(t) = Z t 0 K(t s; w)x(s) ds. (10) In (10), K( ; w) is a convolutional operator with parameter w, and ε(x, h; η) is a deep neural net with (x, h) as input and η as parameter. Both w and η are to be trained by the cascade data D. Hence, (9) reduces to a delay differential equation which involves a time integral h(t) of past x: x = f(x, h; θ) := f(x; A) + ε(x, h; η). (11) The initial condition of (11) with source set S is given by x(0) = χS, h(0) = 0, and x(t) = h(t) = 0, t < 0. (12) We call the system (11) with initial (12) the neural mean-field (NMF) dynamics. The delay differential equation (11) is equivalent to a coupled system of (x, h). In addition, we show that the discretization of this system reduces to a structured recurrent neural network if K(t; w) is a (linear combination of) matrix convolutions in the following theorem. Theorem 3. The delay differential equation (11) is equivalent to the following coupled system: x = f(x, h; A, η) = f(x; A) + ε(x, h; η) (13a) h = R t 0 K(t s; w) f(x(s), h(s); A, η) ds (13b) with initial condition (12). In particular, if K(t; w) = PL l=1 Ble Clt for some L N with w = {(Bl, Cl)l : Bl Cl = Cl Bl, l [L]}, then (13) can be solved by a non-delay system of (x, h) with (13a) and h = PL l=1(Blx Clh). The discretization of such system (with step size normalized to 1) reduces to an RNN with hidden layers (xt, ht) for t = 0, 1, . . . , T 1: xt+1 = xt + f(xt; A) + ε(xt, ht; η) (14a) ht+1 = ht + PL l=1(Blxt+1 Clht) (14b) where the input is given by x0 = χS and h0 = 0. The proof is given in Appendix B.3. The matrices Bl and Cl in (14b) correspond to the weights on xt+1 and ht to form the linear transformation, and the neural network ε wraps up (xt, ht) to approximate the nonlinear effect of the memory term in (10). We here consider a more general convolution kernel K( ; w) than the exponential kernel. Note that, in practice, the convolution weight K on older state x in (10) rapidly diminishes, and hence the memory kernel K can be well approximated with a truncated history of finite length τ > 0, or τ N after discretization. Hence, we substitute (14b) by ht = Kwmt where Kw = [Kw 0 , . . . , Kw τ ] and mt = [xt; . . . ; xt τ]. (15) Then we formulate the evolution of the augmented state mt defined in (15) and follow (14a) to obtain a single evolution of mt for t = 0, . . . , T 1: mt+1 = g(mt; θ), where g(m; θ) := [J0m + f(J0m, Kwm; θ); J0m; . . . ; Jτ 1m] (16) and Js := [ , I, ] Rn (τ+1)n has identity I as the (s + 1)th block and 0 elsewhere (thus Jsmt extracts the (s+1)th block xt s of mt) for s = 0, . . . , τ 1. If (14b) is considered, a simpler augmented state mt = [xt; ht] can be formed similarly; we omit the details here. We will use the dynamics (16) of the augmented state mt in the training below. An optimal control formulation of parameter learning Now we consider the training of the network parameters θ = (A, η, w) of (16) using cascade data D. Given a sample cascade ˆx = (S, τ) from D, we can observe its value in {0, 1}n at each of the time points t = 1, . . . , T and obtain the corresponding infection states, i.e., ˆx = {ˆxt {0, 1}n : t [T]} (see Section 2). Maximizing the log-likelihood of ˆx for the dynamics xt = xt(θ) [0, 1]n induced by θ is equivalent to minimizing the loss function ℓ(x, ˆx): ℓ(x, ˆx) = PT t=1 ˆxt log xt + (1 ˆxt) log(1 xt), (17) where the logarithm is taken componentwisely. We can add a regularization term r(θ) to (17) to impose prior knowledge or constraint on θ. In particular, if E is known, we can enforce a constraint such that A must be supported on E only. Otherwise, we can add A 1 or A 0 (the l1 or l0 norm of the vectorized A) if E is expected to be sparse. In general, A can be interpreted as the convolution to be learned from a graph convolution network (GCN) [26, 53]. The support and magnitude of A imply the network structure and strength of interaction between nodes, respectively. We provide more details of our numerical implementation in Section 4 and Appendix D.1. The optimal parameter θ can be obtained by minimizing the loss function in (17) subject to the NMF dynamics (16). This procedure can also be cast as an optimal control problem to find θ that steers mt to fit data D through the NMF in (16): min θ J (θ) := (1/K) PK k=1 ℓ(x(k), ˆx(k)) + r(θ) (18a) s.t. m(k) t+1 = g(m(k) t ; θ), m(k) 0 = [χSk, 0, . . . , 0], t [T] 1, k [K], (18b) where x(k) t = J0m(k) t for all t and k. The problem of optimal control has been well studied in both continuous and discrete cases in the past decades [2]. In particular, the discrete optimal control with nonlinear difference equations and the associated maximum principle have been extensively exploited. Recently, an optimal control viewpoint of deep learning has been proposed [32] the network parameters of a neural network play the role of control variable in a discretized differential equation, and the training of these parameters for the network output to minimize the loss function can be viewed as finding the optimal control to minimize the objective function at the terminal state. The Pontryagin s Maximum Principle (PMP) provides an important optimality condition of the optimal control [2, 32]. In standard optimal control, the control variable can be chosen freely in the allowed set at any given time t, which is a key in the proof of PMP. However, the NMF dynamics derived in (13) or (14) require a time invariant control θ throughout. This is necessary since θ corresponds to the network parameter and needs to be shared across different layers of the RNN, either from the linear kernel case with state [x; h] in (14) or the general case with state m in (16). Therefore, we need to modify the original PMP and the optimality condition for our NMF formulation. To this end, consider the Hamiltonian function H(m, p; θ) = p g(m; θ) 1 T r(θ), (19) and define the total Hamiltonian of the system (14) as PT 1 t=0 H(mt, pt+1; θ). Then we can show that the optimal solution θ is a time invariant control satisfying a modified PMP as follows. Theorem 4. Let x be the optimally controlled state process by θ , then there exists a co-state (adjoint) p which satisfies the backward differential equation m t+1 = g(m t ; θ ), m 0 = [χSk; 0; . . . ; 0], t = 0, . . . , T 1, (20a) p t = p t+1 mg(m t ; θ ), p T = m T ℓ, t = T 1, . . . , 0. (20b) Moreover, the optimal θ maximizes the total Hamiltonian: for any θ, there is PT 1 t=0 H(m t , p t+1; θ ) PT 1 t=0 H(m t , p t+1; θ). (21) In addition, for any given θ, there is θJ (θ) = PT 1 t=0 θH(mθ t , pθ t+1; θ), where {mθ t , pθ t : 0 t T} are obtained by the forward and backward passes (20a)-(20b) with θ. The proof is given in Appendix B.4. We introduced the total Hamiltonian PT 1 t=0 H(mt, pt+1; θ) in Theorem 4 since the NMF dynamics (14) (or (16)) suggest a time invariant control θ independent of t, which corresponds to θ shared by all layers in an RNN. This is particularly important for time series analysis, where we perform regression on data observed within limited time window, but often want to use the learned parameters to predict events in distant future. Theorem 4 also implies that performing gradient descent to minimize J in (18a) with back-propagation is equivalent to maximizing the total Hamiltonian in light of (21). Our numerical implementation of the proposed NMF is summarized in Algorithm 1. From training cascade data D, NMF can learn the parameter θ = (A, η, w). The support (indices of nonzero entries) of the matrix A reveals the edge E of the diffusion network G = (V, E), and the values of A are the corresponding infection rates on the edges. In addition to the diffusion network parameters inferred by A, we can also estimate (predict) the influence {xt : t [T]} of any new source set x0 Rn by a forward pass of NMF dynamics (16) with the learned θ. Note that this forward pass can be computed on the fly, which is critical to those downstream applications (such as influence maximization) that call influence estimation as a subroutine repeatedly during the computations. 4 Numerical Experiments Infection probability and influence function estimation We first test NMF on a set of synthetic networks that mimic the structure of real-world diffusion network. Two types of the Kronecker network model [27] are used: hierarchical (Hier) [8] and core-periphery (Core) [28] networks with parameter matrices [0.9,0.1;0.1,0.9] and [0.9,0.5;0.5,0.3], respectively. For each type of network model, we generate 5 networks consisting of 128 nodes and 512 edges. We simulate the diffusion where the infection times are modeled by exponential distribution (Exp) and Rayleigh distribution (Ray). For each distribution, we draw αji from Unif[0.1,1] to simulate the varying interactions between nodes. We generate training data consists of K=10,000 cascades, which is formed by 10 sample cascades for each of 1,000 source sets (a source set is generated by randomly selecting 1 to Algorithm 1 Neural mean-field (NMF) algorithm for network inference and influence estimation Input: D = {Ck : k [K]} where Ck = {ˆx(k)(t) {0, 1}n : t = 0, 1, . . . , T}. Initialization: Parameter θ = (A, η, w). for k = 1, . . . , K do Sample a mini-batch ˆD D of cascades. Compute {mt : t [T]} using (16) with θ and m0 = [χS; 0] for each C ˆD. (Forward pass) Compute ˆ θJ = P C ˆ D θℓ(x, ˆx) with ℓin (17). (Backward pass) Update parameter θ θ τ ˆ θJ . end for Output: Network parameter θ. 10 nodes from the network). All networks and cascades are generated by SNAP [29]. Our numerical implementation of NMF is available at https://github.com/Shushan He/neural-mf. We compare NMF to two baseline methods: Influ Learner [12] which is a state-of-the-art method that learns the coverage function of each node for any fixed time, and a conditional LSTM (LSTM for short) [22], which are among the few existing methods capable of learning infection probabilities of individual nodes directly from cascade data as ours. For Influ Learner, we set 128 as the feature number for optimal accuracy as suggested in [12]. For LSTM, we use one LSTM block and a dense layer for each t. To evaluate accuracy, we compute the mean absolute error (MAE) of node infection probability and influence over 100 source sets for each time t. More details of the evaluation criteria are provided in Appendix D.1. The results are given in Figure 1, which shows the mean (center line) and standard deviation (shade) of the three methods. NMF generally has lowest MAE, except at some early stages where Influ Learner is better. Note that Influ Learner requires and benefits from the knowledge of the original source node for each infection in the cascade (provided in our training data), which is often unavailable in practice and not needed in our method. We also tested NMF on a real dataset [54] from Sina Weibo social platform consisting of more that 1.78 million users and 308 million following relationships among them. Following the setting in [12], we select the most popular tweet to generate diffusion cascades from the past 1,000 tweets of each user. Then we recreate the cascades by only keeping nodes of the top 1,000 frequency in the pooled node set over all cascades. For testing, we uniformly generate 100 source sets of size 10 and use t = 1, 2, . . . , 10 as the time steps for observation. Finally, we test 100 source sets and compare our model NMF with the Influ Learner and LSTM. The MAE of all methods are shown in Figure 2a which shows that NMF significantly outperforms LSTM and is similar to Influ Learner. However, unlike Influ Learner that requires re-training for every t and is computationally expensive, NMF learns the evolution at all t in a single sweep of training and is tens of time faster. We also test robustness of NMF for varying network density |E|/n. The MAE of influence and infection probabilty by NMF on a hierarchical network with n = 128 are shown in Figure 2c and 2b, respectively. NMF remains accurate for denser networks, which can be notoriously difficult for other methods such as Influ Learner. Network structure inference The interpretable parameterization of NMF allows us to explicitly learn the weight matrix A. In this test, we examine the quality of the learned A. We set the recovered adjacency matrix E to the binary indicator matrix A ϵ, i.e., (E)i,j = 1 if (A)ji 0.01. To evaluate the quality of E and A, we use four metrics: precision (Prc), recall (Rcl), accuracy (Acc), and correlation (Cor), defined as follows, Prc(E, E ) = |E E | |E | , Rcl(E, E ) = |E E | |E| , Acc(E, E ) = 1 |E E | |E|+|E |, Cor(A, A ) = tr(A A ) A F A F . where E and A are their true values, respectively. In Acc, the edge set E is also interpreted as a matrix, and |E| counts the number of nonzeros in E. In Cor, A 2 F = tr(A A) is the Frobenius norm of the matrix A. Prc is the ratio of edges in E that are recovered in E. Rcl is the ratio of correctly recovered edges in E. Acc indicates the ratio of the number of common edges shared by E and E against the total number of edges in them. Cor measures similarity between A and A by taking their values into consideration. All metrics are bounded between [0, 1], and higher value indicates better result. For comparison, we also test NETRATE [16] to the cascade data and learn A with Rayleigh distribution. Evaluation by four metrics are shown in Table 1, which indicates that 2 4 6 8 10 12 14 16 18 20 influence MAE LSTM Influ Learner NMF (a) Hier + Exp 2 4 6 8 10 12 14 16 18 20 influence MAE LSTM Influ Learner NMF (b) Hier + Ray 2 4 6 8 101214161820 influence MAE LSTM Influ Learner NMF (c) Core + Exp 2 4 6 8 10 12 14 16 18 20 influence MAE LSTM Influ Learner NMF (d) Core + Ray 2 4 6 8 10 12 14 16 18 20 probability MAE LSTM Influ Learner NMF (e) Hier + Exp 2 4 6 8 101214161820 probability MAE LSTM Influ Learner NMF (f) Hier + Ray 2 4 6 8 101214161820 probability MAE LSTM Influ Learner NMF (g) Core + Exp 2 4 6 8 101214161820 probability MAE LSTM Influ Learner NMF (h) Core + Ray Figure 1: MAE of influence (top row) and node infection probability (bottom row) by LSTM, Influ Learner, and NMF on different combinations of Hierarchical (Hier) and Core-periphery (Core) networks, and exponential (Exp) and Rayleigh (Ray) diffusion models. Mean (centerline) and standard deviation (shade) over 100 tests are shown. Table 1: Performance of structure inference using NETRATE and the proposed NMF on Random, Hierarchical, and Core-periphery networks with Rayleigh distribution as the diffusion time model on edges. Quality of the learned edge set E and distribution parameter A are measured by precision (Prc), recall (Rcl), accuracy (Acc), and correlation (Cor). Higher value indicates better quality. Network Method Prc Rcl Acc Cor Random NETRATE 0.481 0.399 0.434 0.465 NMF 0.858 0.954 0.903 0.950 Hierarchical NETRATE 0.659 0.429 0.519 0.464 NMF 0.826 0.978 0.893 0.938 Core-periphery NETRATE 0.150 0.220 0.178 0.143 NMF 0.709 0.865 0.779 0.931 NMF outperforms NETRATE in all metrics. Note that NMF learns A along with the NMF dynamics for infection probability estimation in its training, whereas NETRATE can only learn the matrix A. Influence maximization We use NMF as an influence estimation subroutine in a classical greedy algorithm [38] (NMF+Greedy), and compare with a state-of-the-art method DIFFCELF[42] for influence maximization (IM). Like NMF+Greedy, DIFFCELF also only requires infection time features, but not network structures as in most existing methods. We generate 1000 cascades with unique source (as required by DIFFCELF but not ours) on a hierarchical network of 128 nodes and 512 edges, and use exponential distribution for the transmission function with A generated from Unif[1,10]. Time window is T = 20. For each source set size n0 = 1, . . . , 10, NMF+Greedy and DIFFCELF are applied to identify the optimal source sets, whose influence are computed by averaging 10,000 MC simulated cascades. Figure 2d shows that the source sets obtained by NMF+Greedy generates greater influence than DIFFCELF consistently for every source size n0. 5 Related Work Sampling-based influence estimation methods have been considered for discrete-time and continuoustime diffusion models. Discrete-time models assume node infections only occur at discrete time points. Under this setting, the Independent Cascade (IC) model [24] is considered and a method with provable performance guarantee is developed for single source which iterates over a sequence of guesses of the true influence until the verifier accepts in [34]. To resolve the inefficiency of Monte 1 2 3 4 5 6 7 8 9 10 influence MAE LSTM Influ Learner NMF (a) Influ MAE vs t 2 3 4 5 6 density influence MAE (b) Influ MAE vs density 2 3 4 5 6 density probability MAE (c) Prob MAE vs density 1 2 3 4 5 6 7 8 9 10 Diffu Celf NMF (d) Influence vs n0 Figure 2: (a) MAE of influence estimated by LSTM, Influ Learner on Weibo data; (b) (c) MAE of influence and infection probability of NMF for different network densities; (d) Influence of source sets selected by DIFFUCELF and NMF+Greedy for n0 = 1, . . . , 10. Carlo simulations, the reverse influence sampling (RIS) sketching method [5] is adopted in [41]. Moreover, instead of using the full network structure, sketch-based approaches only characterize propagation instances for influence computation, such as the method in [10], which considers per-node summary structures defined by the bottom-k min-bash [9] sketch of the combined reachability set. In contrast to discrete-time models, continuous-time diffusion models allow arbitrary event occurrence times and hence are more accurate in modeling real-world diffusion processes. In Continuous-time Independent Cascade (CIC) models, influence estimation can be reformulated as the problem of finding the least label list which contains information about the distance to the smallest reachable labels from the source [13, 20]. Compared to methods using a fixed number of samples, a more scalable approximation scheme with a built-in block is developed to minimize the number of samples needed for the desired accuracy [40]. The aforementioned methods require knowledge of cascade traces [10] or the diffusion networks (review of related work on network structure inference is provided in Appendix C), such as node connectivity and node-to-node infection rates, as well as various assumptions on the diffusion of interests. However, such knowledge about the diffusion networks may not be available in practice, and the assumptions on the propagation or data formation are often application-specific and do not hold in most other problems. Influ Learner [12] is a state-of-the-art method that does not require knowledge of the underlying diffusion network. Influ Learner estimates the influence directly from cascades data in the CIC models by learning the influence function with a parameterization of the coverage functions using random basis functions. However, the random basis function suggested by [12] requires knowledge of the original source node for every infection, which can be difficult or impossible to be tracked in real-world applications. In recent years, deep learning techniques have been employed to improve the scalability of influence estimation on large networks. In particular, convolutional neural networks (CNNs) and attention mechanism are incorporated with both network structures and user specific features to learn users latent feature representation in [44]. By piping represented cascade graphs through a gated recurrent unit (GRU), the future incremental influence of a cascade can be predicted [31]. RNNs and CNNs are also applied to capture the temporal relationships on the user-generated contents networks (e.g., views, likes, comments, reposts) and extract more powerful features in [55]. In methods based on graph structures, graph neural networks (GNNs) and graph convolution networks (GCNs) are widely applied. In particular, two coupled GNNs are used to capture the interplay between node activation states and the influence spread [6], while GCNs integrated with teleport probability from the domain of page rank in [30] enhanced the performance of method in [44]. However, these methods depend critically on the structure or content features of cascades which may not be available in practice. 6 Conclusion We proposed a novel framework using neural mean-field dynamics for inference and estimation on diffusion networks. Our new framework is derived from the Mori-Zwanzig formalism to obtain exact evolution of node infection probabilities. The memory term of the evolution can be approximated by convolutions, which renders the system as a delay differential equation and its time discretization reduces to a structured and interpretable RNN. Empirical study shows that our approach is versatile and robust to different variations of diffusion network models, and significantly outperforms existing approaches in accuracy and efficiency on both synthetic and real-world data sets. Broader Impact This paper makes a significant contribution to the learning of structure and infection probabilities for diffusion networks, which is one of the central problems in the study of stochastic information propagation on large heterogeneous networks. The proposed neural mean-field (NMF) dynamics provide the first principled approach for inference and estimation problems using cascade data. NMF is shown to be a delay differential equation with proper approximation of memory integral using learnable time convolution operators, and the system reduces to a highly structured and interpretable recurrent neural network after time discretiztion. Potential applications include influence maximization, outbreak detection, and source identification. Acknowledgments and Disclosure of Funding The work of SH and XY was supported in part by National Science Foundation under grants CMMI1745382, DMS-1818886, and DMS-1925263. The work of HZ was supported in part by National Science Foundation IIS-1717916 and Shenzhen Research Institute of Big Data. He was on leave from College of Computing, Georgia Institute of Technology. [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning, 2016. [2] D. P. Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 1995. [3] Á. Bodó, G. Y. Katona, and P. L. Simon. Sis epidemic propagation on hypergraphs. Bulletin of mathematical biology, 78(4):713 735, 2016. [4] M. Boguná and R. Pastor-Satorras. Epidemic spreading in correlated complex networks. Physical Review E, 66(4):047104, 2002. [5] C. Borgs, M. Brautbar, J. Chayes, and B. Lucier. Maximizing social influence in nearly optimal time. Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, Dec 2013. [6] Q. Cao, H. Shen, J. Gao, B. Wei, and X. Cheng. Popularity prediction on social platforms with coupled graph neural networks. In Proceedings of the 13th International Conference on Web Search and Data Mining, WSDM 20, pages 70 78, New York, NY, USA, 2020. Association for Computing Machinery. [7] A. J. Chorin, O. H. Hald, and R. Kupferman. Optimal prediction and the mori zwanzig representation of irreversible processes. Proceedings of the National Academy of Sciences, 97(7):2968 2973, 2000. [8] A. Clauset, C. Moore, and M. E. J. Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191):98 101, May 2008. [9] E. Cohen. Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences, 55(3):441 453, 1997. [10] E. Cohen, D. Delling, T. Pajor, and R. F. Werneck. Sketch-based influence maximization and computation: Scaling up with guarantees. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM 14, pages 629 638, New York, NY, USA, 2014. Association for Computing Machinery. [11] X. Dong, D. Thanou, M. Rabbat, and P. Frossard. Learning graphs from data: A signal representation perspective. IEEE Signal Processing Magazine, 36(3):44 63, 2019. [12] N. Du, Y. Liang, M.-F. Balcan, and L. Song. Influence function learning in information diffusion networks. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML 14, pages II 2016 II 2024. JMLR.org, 2014. [13] N. Du, L. Song, M. Gomez-Rodriguez, and H. Zha. Scalable influence estimation in continuoustime diffusion networks. In Advances in Neural Information Processing Systems, pages 3147 3155, 2013. [14] N. Du, L. Song, M. Yuan, and A. J. Smola. Learning networks of heterogeneous influence. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2780 2788. Curran Associates, Inc., 2012. [15] M. Farajtabar, X. Ye, S. Harati, L. Song, and H. Zha. Multistage campaigning in social networks. In Advances in Neural Information Processing Systems, pages 4718 4726, 2016. [16] M. Gomez-Rodriguez, D. Balduzzi, and B. Schölkopf. Uncovering the temporal dynamics of diffusion networks. ar Xiv preprint ar Xiv:1105.0697, 2011. [17] M. Gomez-Rodriguez, D. Balduzzi, and B. Schölkopf. Uncovering the temporal dynamics of diffusion networks. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML 11, pages 561 568, Madison, WI, USA, 2011. Omnipress. [18] M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and influence. ACM Transactions on Knowledge Discovery from Data (TKDD), 5(4):21, 2012. [19] M. Gomez-Rodriguez, J. Leskovec, and B. Schölkopf. Structure and dynamics of information pathways in online media. Co RR, abs/1212.1464, 2012. [20] M. Gomez-Rodriguez, L. Song, N. Du, H. Zha, and B. Schölkopf. Influence estimation and maximization in continuous-time diffusion networks. ACM Trans. Inf. Syst., 34(2), Feb. 2016. [21] A. Gouasmi, E. J. Parish, and K. Duraisamy. A priori estimation of memory effects in reducedorder models of nonlinear systems using the mori-zwanzig formalism. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 473(2205):20170385, 2017. [22] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9:1735 80, 12 1997. [23] D. Kempe, J. Kleinberg, and É. Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137 146. ACM, 2003. [24] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 137 146. Association for Computing Machinery, 2003. [25] D. Kempe, J. Kleinberg, and É. Tardos. Influential nodes in a diffusion model for social networks. In Automata, languages and programming, pages 1127 1138. Springer, 2005. [26] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. ar Xiv preprint ar Xiv:1609.02907, 2016. [27] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. The Journal of Machine Learning Research, 11:985 1042, 2010. [28] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical properties of community structure in large social and information networks. In Proceedings of the 17th International Conference on World Wide Web, WWW 08, pages 695 704, New York, NY, USA, 2008. Association for Computing Machinery. [29] J. Leskovec and R. Sosiˇc. Snap: A general-purpose network analysis and graph-mining library. ACM Transactions on Intelligent Systems and Technology (TIST), 8(1):1, 2016. [30] C. K. Leung, A. Cuzzocrea, J. J. Mai, D. Deng, and F. Jiang. Personalized deepinf: Enhanced social influence prediction with deep learning and transfer learning. In 2019 IEEE International Conference on Big Data (Big Data), pages 2871 2880, 2019. [31] C. Li, J. Ma, X. Guo, and Q. Mei. Deepcas: An end-to-end predictor of information cascades. In Proceedings of the 26th international conference on World Wide Web, pages 577 586, 2017. [32] Q. Li, L. Chen, C. Tai, and E. Weinan. Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1):5998 6026, 2017. [33] Y. Liang, Z. Jiang, and Y. Zheng. Inferring traffic cascading patterns. In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL 17, New York, NY, USA, 2017. Association for Computing Machinery. [34] B. Lucier, J. Oren, and Y. Singer. Influence at scale: Distributed computation of complex contagion in networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 15, pages 735 744, New York, NY, USA, 2015. Association for Computing Machinery. [35] G. Mateos, S. Segarra, A. G. Marques, and A. Ribeiro. Connecting the dots: Identifying network structure via graph signal processing. IEEE Signal Processing Magazine, 36(3):16 43, May 2019. [36] J. C. Miller and I. Z. Kiss. Epidemic spread in networks: Existing methods and current challenges. Mathematical modelling of natural phenomena, 9(2):4, 2014. [37] S. A. Myers and J. Leskovec. On the convexity of latent social network inference. ar Xiv preprint ar Xiv:1010.5504, 2010. [38] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions i. Mathematical programming, 14(1):265 294, 1978. [39] M. Newman. Networks: an introduction. Oxford University Press, 2010. [40] H. T. Nguyen, T. P. Nguyen, T. N. Vu, and T. N. Dinh. Outward influence and cascade size estimation in billion-scale networks. Proc. ACM Meas. Anal. Comput. Syst., 1(1), June 2017. [41] N. Ohsaka, T. Akiba, Y. Yoshida, and K.-i. Kawarabayashi. Dynamic influence analysis in evolving networks. Proc. VLDB Endow., 9(12):1077 1088, Aug. 2016. [42] G. Panagopoulos, F. D. Malliaros, and M. Vazirgiannis. Diffugreedy: an influence maximization algorithm based on diffusion cascades. In International Conference on Complex Networks and their Applications, pages 392 404. Springer, 2018. [43] R. Pastor-Satorras, C. Castellano, P. Van Mieghem, and A. Vespignani. Epidemic processes in complex networks. Reviews of modern physics, 87(3):925, 2015. [44] J. Qiu, J. Tang, H. Ma, Y. Dong, K. Wang, and J. Tang. Deepinf: Social influence prediction with deep learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2110 2119, 2018. [45] M. G. Rodriguez and B. Schölkopf. Submodular inference of diffusion networks from multiple trees. ar Xiv preprint ar Xiv:1205.1671, 2012. [46] Y. Rong, Q. Zhu, and H. Cheng. A model-free approach to infer the diffusion network from event cascade. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 16, pages 1653 1662, New York, NY, USA, 2016. Association for Computing Machinery. [47] F. D. Sahneh and C. Scoglio. Epidemic spread in human networks. In Decision and Control and European Control Conference (CDC-ECC), 2011 50th IEEE Conference on, pages 3008 3013. IEEE, 2011. [48] P. Van Mieghem, J. Omic, and R. Kooij. Virus spread in networks. Networking, IEEE/ACM Transactions on, 17(1):1 14, 2009. [49] M. Vergeer, L. Hermans, and S. Sams. Online social networks and micro-blogging in political campaigning the exploration of a new campaign tool and a new campaign style. Party Politics, 19(3):477 501, 2013. [50] L. Wang, S. Ermon, and J. E. Hopcroft. Feature-enhanced probabilistic models for diffusion network inference. In Machine Learning and Knowledge Discovery in Databases, pages 499 514. Springer, 2012. [51] Q. Wang, N. Ripamonti, and J. S. Hesthaven. Recurrent neural network closure of parametric pod-galerkin reduced-order models based on the mori-zwanzig formalism. Journal of Computational Physics, 410:109402, 2020. [52] J. Wortman. Viral marketing and the diffusion of trends on social networks. Tech Report, 2008. [53] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, pages 1 21, 2020. [54] J. Zhang, B. Liu, J. Tang, T. Chen, and J. Li. Social influence locality for modeling retweeting behaviors. In Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI 13, pages 2761 2767. AAAI Press, 2013. [55] Y. Zhu, J. Xie, and Z. Chen. Predicting the popularity of micro-videos with multimodal variational encoder-decoder framework, 2020.