# learning_influence_functions_from_incomplete_observations__d6e24327.pdf

Learning Inﬂuence Functions from Incomplete

Observations

Xinran He Ke Xu David Kempe Yan Liu University of Southern California, Los Angeles, CA 90089

{xinranhe, xuk, dkempe, yanliu.cs}@usc.edu

We study the problem of learning inﬂuence functions under incomplete observations of node activations. Incomplete observations are a major concern as most (online and real-world) social networks are not fully observable. We establish both proper and improper PAC learnability of inﬂuence functions under randomly missing observations. Proper PAC learnability under the Discrete-Time Linear Threshold (DLT) and Discrete-Time Independent Cascade (DIC) models is established by reducing incomplete observations to complete observations in a modiﬁed graph. Our improper PAC learnability result applies for the DLT and DIC models as well as the Continuous-Time Independent Cascade (CIC) model. It is based on a parametrization in terms of reachability features, and also gives rise to an efﬁcient and practical heuristic. Experiments on synthetic and real-world datasets demonstrate the ability of our method to compensate even for a fairly large fraction of missing observations.

1 Introduction Many social phenomena, such as the spread of diseases, behaviors, technologies, or products, can naturally be modeled as the diffusion of a contagion across a network. Owing to the potentially high social or economic value of accelerating or inhibiting such diffusions, the goal of understanding the ﬂow of information and predicting information cascades has been an active area of research [10, 7, 9, 14, 1, 20]. A key task here is learning inﬂuence functions, mapping sets of initial adopters to the individuals who will be inﬂuenced (also called active) by the end of the diffusion process [10].

Many methods have been developed to solve the inﬂuence function learning problem [9, 7, 5, 8, 3, 16, 18, 24, 25]. Most approaches are based on ﬁtting the parameters of a diffusion model based on

observations, e.g., [8, 7, 18, 9, 16]. Recently, Du et al. [3] proposed a model-free approach to learn inﬂuence functions as coverage functions; Narasimhan et al. [16] establish proper PAC learnability of inﬂuence functions under several widely-used diffusion models.

All existing approaches rely on the assumption that the observations in the training dataset are complete, in the sense that all active nodes are observed as being active. However, this assumption fails to hold in virtually all practical applications [15, 6, 2, 21]. For example, social media data are usually collected through crawlers or acquired with public APIs provided by social media platforms, such as Twitter or Facebook. Due to non-technical reasons and established restrictions on the APIs, it is often impossible to obtain a complete set of observations even for a short period of time. In turn, the existence of unobserved nodes, links, or activations may lead to a signiﬁcant misestimation of the diffusion model s parameters [19, 15].

We take a step towards addressing the problem of learning inﬂuence functions from incomplete observations. Missing data are a complicated phenomenon, but to address it meaningfully and rigorously, one must make at least some assumptions about the process resulting in the loss of data. We focus on random loss of observations: for each activated node independently, the node s activation is observed only with probability r, the retention rate, and fails to be observed with probability

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

1 r. Random observation loss naturally occurs when crawling data from social media, where rate restrictions are likely to affect all observations equally.

We establish both proper and improper PAC learnability of inﬂuence functions under incomplete observations for two popular diffusion models: the Discrete-Time Independent Cascade (DIC) and Discrete-Time Linear Threshold (DLT) models. In fact, randomly missing observations do not even signiﬁcantly increase the required sample complexity. The result is proved by interpreting the incomplete observations as complete observations in a transformed graph,

The PAC learnability result implies good sample complexity bounds for the DIC and DLT models. However, the PAC learnability result does not lead to an efﬁcient algorithm, as it involves marginalizing a large number of hidden variables (one for each node not observed to be active).

Towards designing more practical algorithms and obtaining learnability under a broader class of diffusion models, we pursue improper learning approaches. Concretely, we use the parameterization of Du et al. [3] in terms of reachability basis functions, and optimize a modiﬁed loss function suggested by Natarajan et al. [17] to address incomplete observations. We prove that the algorithm ensures improper PAC learning for the DIC, DLT and Continuous-Time Independent Cascade (CIC) models. Experimental results on synthetic cascades generated from these diffusion models and real-world cascades in the Meme Tracker dataset demonstrate the effectiveness of our approach. Our algorithm achieves nearly a 20% reduction in estimation error compared to the best baseline methods on the Meme Tracker dataset.

Several recent works also aim to address the issue of missing observations in social network analysis, but with different emphases. For example, Chierichetti et al. [2] and Sadikov et al. [21] mainly focus on recovering the size of a diffusion process, while our task is to learn the inﬂuence functions from several incomplete cascades. Myers et al. [15] mainly aim to model unobserved external inﬂuence in diffusion. Duong et al. [6] examine learning diffusion models with missing links from complete observations, while we learn inﬂuence functions from incomplete cascades with missing activations. Most related to our work are papers by Wu et al. [23] and simultaneous work by Lokhov [13]. Both study the problem of network inference under incomplete observations. Lokhov proposes a dynamic message passing approach to marginalize all the missing activations, in order to infer diffusion model parameters using maximum likelihood estimation, while Wu et al. develop an EM algorithm. Notice that the goal of learning the model parameters differs from our goal of learning the inﬂuence functions directly. Both [13] and [23] provide empirical evaluation, but do not provide theoretical guarantees.

2 Preliminaries 2.1 Models of Diffusion and Incomplete Observations Diffusion Model. We model propagation of opinions, products, or behaviors as a diffusion process over a social network. The social network is represented as a directed graph G = (V, E), where n = |V | is the number of nodes, and m = |E| is the number of edges. Each edge e = (u, v) is associated with a parameter wuv representing the strength of inﬂuence user u has on v. We assume that the graph structure (the edge set E) is known, while the parameters wuv are to be learned. Depending on the diffusion model, there are different ways to represent the strength of inﬂuence between individuals. Nodes can be in one of two states, inactive or active. We say that a node gets activated if it adopts the opinion/product/behavior under the diffusion process. In this work, we focus on progressive diffusion models, where a node remains active once it gets activated.

The diffusion process begins with a set of seed nodes (initial adopters) S, who start active. It then proceeds in discrete or continuous time: according to a probabilistic process, additional nodes may become active based on the inﬂuence from their neighbors. Let N(v) be the in-neighbors of node v and At the set of nodes activated by time t. We consider the following three diffusion models:

Discrete-time Linear Threshold (DLT) model [10]: Each node v has a threshold v drawn independently and uniformly from the interval [0, 1]. The diffusion under the DLT model unfolds in discrete time steps: a node v becomes active at step t if the total incoming weight from its active neighbors exceeds its threshold: P

u2N(v)\At 1 wuv v.

Discrete-time Independent Cascade (DIC) model [10]: The DIC model is also a discrete-time model. The weight wuv 2 [0, 1] captures an activation probability. When a node u becomes active in step t, it attempts to activate all currently inactive neighbors in step t + 1. For each neighbor v, it

succeeds with probability wuv. If it succeeds, v becomes active; otherwise, v remains inactive. Once u has made all these attempts, it does not get to make further activation attempts at later times.

Continuous-time Independent Cascade (CIC) model [8]: The CIC model unfolds in continuous time. Each edge e = (u, v) is associated with a delay distribution with wuv as its parameter. When a node u becomes newly active at time t, for every neighbor v that is still inactive, a delay time duv is drawn from the delay distribution. duv is the duration it takes u to activate v, which could be inﬁnite (if u does not succeed in activating v). Nodes are considered activated by the process if they are activated within a speciﬁed observation window [0, ].

Fix one of the diffusion models deﬁned above and its parameters. For each seed set S, let S be the distribution of ﬁnal active sets. (In the case of the DIC and DLT model, this is the set of active nodes when no new activations occur; for the CIC model, it is the set of nodes active at time .) For any node v, let Fv(S) = Prob A S[v 2 A] be the (marginal) probability that v is activated according to the dynamics of the diffusion model with initial seeds S. Then, deﬁne the inﬂuence function F : 2V ! [0, 1]n mapping seed sets to the vector of marginal activation probabilities: F (S) = [F1(S), . . . , Fn(S)]. Notice that the marginal probabilities do not capture the full information about the diffusion process contained in S (since they do not observe co-activation patterns), but they are sufﬁcient for many applications, such as inﬂuence maximization [10] and inﬂuence estimation [4].

Cascades and Incomplete Observations. We focus on the problem of learning inﬂuence functions from cascades. A cascade C = (S, A) is a realization of the random diffusion process; S is the set of seeds and A S, A S is the set of activated nodes at the end of the random process. Similar to Narasimhan et al. [16], we focus on activation-only observations, namely, we only observe which nodes were activated, but not when these activations occurred.1

To capture the fact that some of the node activations may have been unobserved, we use the following model of independently randomly missing data: for each (activated) node v 2 A \ S, the activation of v is actually observed independently with probability r. With probability 1 r, the node s activation is unobservable. For seed nodes v 2 S, the activation is never lost. Formally, deﬁne A as follows: each v 2 S is deterministically in A, and each v 2 A \ S is in A independently with probability r. Then, the incomplete cascade is denoted by C = (S, A).

2.2 Objective Functions and Learning Goals To measure estimation error, we primarily use a quadratic loss function, as in [16, 3]. For two n-dimensional vectors x, y, the quadratic loss is deﬁned as sq(x, y) = 1

2. We also use this notation when one or both arguments are sets: when an argument is a set S, we formally mean to use the indicator function χS as a vector, where χS(v) = 1 if v 2 S, and χS(v) = 0 otherwise. In particular, for an activated set A, we write sq(A, F (S)) = 1

n||χA F (S)||2

We now formally deﬁne the problem of learning inﬂuence functions from incomplete observations. Let P be a distribution over seed sets (i.e., a distribution over 2V ), and ﬁx a diffusion model M and parameters, together giving rise to a distribution S for each seed set. The algorithm is given a set of M incomplete cascades C = {(S1, A1), . . . , (SM, AM)}, where each Si is drawn independently from P, and Ai is obtained by the incomplete observation process described above from the (random) activated set Ai Si. The goal is to learn an inﬂuence function F that accurately captures the diffusion process. Accuracy of the learned inﬂuence function is measured in terms of the squared error with respect to the true model: errsq[F ] = ES P,A S [ sq(A, F (S))]. That is, the expectation is over the seed set and the randomness in the diffusion process, but not the data loss process.

PAC Learnability of Inﬂuence Functions. We characterize the learnability of inﬂuence functions under incomplete observations using the Probably Approximately Correct (PAC) learning framework [22]. Let FM be the class of inﬂuence functions under the diffusion model M, and FL the class of inﬂuence functions the learning algorithm is allowed to choose from. We say that FM is PAC learnable if there exists an algorithm A with the following property: for all ", δ 2 (0, 1), all parametrizations of the diffusion model, and all distributions P over seed sets S: when given activation-only

1Narasimhan et al. [16] refer to this model as partial observations; we change the terminology to avoid confusion with incomplete observations.

and incomplete training cascades C = {(S1, A1), . . . , (SM, AM)} with M poly(n, m, 1/", 1/δ), A outputs an inﬂuence function F 2 FL satisfying Prob C[errsq[F ] errsq[F ] "] δ.

Here, F 2 FM is the ground truth inﬂuence function. The probability is over the training cascades, including the seed set generation, the stochastic diffusion process, and the missing data process. We say that an inﬂuence function learning algorithm A is proper if FL FM; that is, the learned inﬂuence function is guaranteed to be an instance of the true diffusion model. Otherwise, we say that A is an improper learning algorithm.

3 Proper PAC Learning under Incomplete Observations In this section, we establish proper PAC learnability of inﬂuence functions under the DIC and DLT models. For both diffusion models, FM can be parameterized by an edge parameter vector w, whose entries we are the activation probabilities (DIC model) or edge weights (DLT model). Our goal is to ﬁnd an inﬂuence function F w 2 FM that outputs accurate marginal activation probabilities. While our goal is proper learning meaning that the function must be from FM we do not require that the inferred parameters match the true edge parameters w. Our main theoretical results are summarized in Theorems 1 and 2.

Theorem 1. Let λ 2 (0, 0.5). The class of inﬂuence functions under the DIC model in which all edge activation probabilities satisfy we 2 [λ, 1 λ] is PAC learnable under incomplete observations with retention rate r. The sample complexity2 is O( n3m

Theorem 2. Let λ 2 (0, 0.5), and consider the class of inﬂuence functions under the DLT model such that the edge weight for every edge satisﬁes we 2 [λ, 1 λ], and for every node v, 1 P

u2N(v) wuv 2 [λ, 1 λ]. This class is PAC learnable under incomplete observations with retention rate r. The sample complexity is O( n3m

In this section, we present the intuition and a proof sketch for the two theorems. Details of the proof are provided in Appendix B.

The key idea of the proof of both theorems is that a set of incomplete cascades C on G under the two models can be considered as essentially complete cascades on a transformed graph ˆG = ( ˆV , ˆE). The inﬂuence functions of nodes in ˆG can be learned using a modiﬁcation of the result of Narasimhan et al. [16]. Subsequently, the inﬂuence functions for G are directly obtained from the inﬂuence functions for ˆG, by exploiting that inﬂuence functions only focus on the marginal activation probabilities.

The transformed graph ˆG is built by adding a layer of n nodes to the graph G. For each node v 2 V of the original graph, we add a new node v0 2 V 0 and a directed edge (v, v0) with known and ﬁxed edge parameter ˆwvv0 = r. (The same parameter value serves as activation probability under the DIC model and as edge weight under the DLT model.) The new nodes V 0 have no other incident edges, and we retain all edges e = (u, v) 2 E. Inferring their parameters is the learning task.

For each observed (incomplete) cascade (Si, Ai) on G (with Si Ai), we produce an observed activation set A0

i as follows: (1) for each v 2 Ai \ Si, we let v0 2 A0

i deterministically; (2) for each v 2 Si independently, we include v0 2 A0

i with probability r. This deﬁnes the training cascades ˆC = {(Si, A0

Now consider any edge parameters w, applied to both G and the ﬁrst layer of ˆG. Let F (S) denote the inﬂuence function on G, and ˆF (S) = [ ˆF10(S), . . . , ˆFn0(S)] the inﬂuence function of the nodes in the added layer V 0 of ˆG. Then, by the transformation, for all nodes v 2 V , we get that

ˆFv0(S) = r Fv(S). (1)

And by the deﬁnition of the observation loss process, Prob[v 2 Ai] = r Fv(S) = ˆFv0(S) for all non-seed nodes v /2 Si.

While the cascades ˆC are not complete on all of ˆG, in a precise sense, they provide complete information on the activation of nodes in V 0. In Appendix B, we show that Theorem 2 of Narasimhan et al. [16] can be extended to provide identical guarantees for learning ˆF (S) from the modiﬁed

2The O notation suppresses poly-logarithmic dependence on 1/λ, 1/δ, n, and m.

observed cascades ˆC. For the DIC model, this is a straightforward modiﬁcation of the proof from [16]. For the DLT model, [16] had not established PAC learnability3, so we provide a separate proof.

Because the results of [16] and our generalizations ensure proper learning, they provide edge weights w between the nodes of V . We use these exact same edge weights to deﬁne the learned inﬂuence functions in G. Equation (1) then implies that the learned inﬂuence functions on V satisfy Fv(S) = 1

r ˆFv0(S). The detailed analysis in Appendix B shows that the learning error only scales by a multiplicative factor 1 r2 .

The PAC learnability result shows that there is no information-theoretical obstacle to inﬂuence function learning under incomplete observations. However, it does not imply an efﬁcient algorithm. The reason is that a hidden variable would be associated with each node not observed to be active, and computing the objective function for empirical risk minimization would require marginalizing over all of the hidden variables. The proper PAC learnability result also does not readily generalize to the CIC model and other diffusion models, even under complete observations. This is due to the lack of a succinct characterization of inﬂuence functions as under the DIC and DLT models. Therefore, in the next section, we explore improper learning approaches with the goal of designing practical algorithms and establishing learnability under a broader class of diffusion models.

4 Efﬁcient Improper Learning Algorithm Instead of parameterizing the inﬂuence functions using the edge parameters, we adopt the model-free inﬂuence function learning framework, Inﬂu Learner, proposed by Du et al. [3] to represent the inﬂuence function as a sum of weighted basis functions. From now on, we focus on the inﬂuence function Fv(S) of a single ﬁxed node v.

Inﬂuence Function Parameterization. For all three diffusion models (CIC, DIC and DLT), the diffusion process can be characterized equivalently using live-edge graphs. Concretely, the results of [10, 4] state that for each instance of the CIC, DIC, and DLT models, there exists a distribution Γ over live-edge graphs H assigning probability γH to each live-edge graph H such that F

H:at least one node in S has a path to v in H γH.

To reduce the representation complexity, notice that from the perspective of activating v, two different live-edge graphs H, H0 are equivalent if v is reachable from exactly the same nodes in H and H0. Therefore, for any node set T, let β

H:exactly the nodes in T have paths to v in H γH. We then use characteristic vectors as feature vectors r T = χT , where we will interpret the entry for node u as u having a path to v in a live-edge graph. More precisely, let φ(x) = min{x, 1}, and χS the characteristic vector of the seed set S. Then, φ(χ>

S r T ) = 1 if and only if v is reachable from S, and we can write F

This representation still has exponentially many features (one for each T). In order to make the learning problem tractable, we sample a smaller set T of K features from a suitably chosen distribution, implicitly setting the weights βT of all other features to 0. Thus, we will parametrize the learned inﬂuence function as F β

T 2T βT φ(χ>

The goal is then to learn weights βT for the sampled features. (They will form a distribution, i.e., ||β||1 = 1 and β 0.) The crux of the analysis is to show that a sufﬁciently small number K of features (i.e., sampled sets) sufﬁces for a good approximation, and that the weights can be learned efﬁciently from a limited number of observed incomplete cascades. Speciﬁcally, we consider the log likelihood function (t, y) = y log t + (1 y) log(1 t), and learn the parameter vector (a distribution) by maximizing the likelihood PM

v (Si), χAi(v)).

Handling Incomplete Observations. The maximum likelihood estimation cannot be directly applied to incomplete cascades, as we do not have access to Ai (only the incomplete version Ai). To address this issue, notice that the MLE problem is actually a binary classiﬁcation problem with log loss and yi = χAi(v) as the label. From this perspective, incompleteness is simply class-conditional noise on the labels. Let yi = χ Ai(v) be our observation of whether v was activated or not under the incomplete cascade i. Then, Prob[ yi = 1|yi = 1] = r and Prob[ yi = 1|yi = 0] = 0. In words,

3[16] shows that the DLT model with ﬁxed thresholds is PAC learnable under complete cascades. We study the DLT model when the thresholds are uniformly distributed random variables.

the incomplete observation yi suffers from one-sided error compared to the complete observation yi. By results of Natarajan et al. [17], we can construct an unbiased estimator of (t, y) using only the incomplete observations y, as in the following lemma.

Lemma 3 (Corollary of Lemma 1 of [17]). Let y be the true activation of node v and y the incomplete observation. Then, deﬁning (t, y) := 1

ry log t + 2r 1

r (1 y) log(1 t), for any t, we

Based on this lemma, we arrive at the ﬁnal algorithm of solving the maximum likelihood estimation problem with the adjusted likelihood function (t, y):

Maximize PM

v (Si), χ Ai(v)) (2)

subject to ||β||1 = 1, β 0.

We analyze conditions under which the solution to (2) provides improper PAC learnability under incomplete observations; these conditions will apply for all three diffusion models.

These conditions are similar to those of Lemma 1 in the work of Du et al. [3], and concern the approximability of the reachability distribution β

T . Speciﬁcally, let q be a distribution over node sets T such that q(T) Cβ

T for all node sets T. Here, C is a (possibly very large) number that we will want to bound below. Let T1, . . . , TK be K i.i.d. samples drawn from the distribution q. The features are then rk = χTk. We use the truncated version of the function F β,λ

v (S) with parameter4 λ as in [3]: F β,λ

v (S) = (1 2λ)F β

Let Mλ be the class of all such truncated inﬂuence functions, and F β,λ

v 2 Mλ the inﬂuence functions obtained from the optimization problem (2). The following theorem (proved in Appendix C) establishes the accuracy of the learned functions.

Theorem 4. Assume that the learning algorithm uses K = ( C2

"2 ) features in the inﬂuence function it constructs, and observes5 M = ( log C

"4r2 ) incomplete cascades with retention rate r. Then, with probability at least 1 δ, the learned inﬂuence functions F β,λ

v for each node v and seed distribution P satisfy ES P

The theorem implies that with enough incomplete cascades, an algorithm can approximate the ground truth inﬂuence function to arbitrary accuracy. Therefore, all three diffusion models are improperly PAC learnable under incomplete observations. The ﬁnal sample complexity does not contain the graph size, but is implicitly captured by C, which will depend on the graph and how well the distribution β

T can be approximated. Notice that with r = 1, our bound on M has logarithmic dependency on C instead of polynomial, as in [3]. The reason for this improvement is discussed further in Appendix C.

Efﬁcient Implementation. As mentioned above, the features T cannot be sampled from the exact reachability distribution β

T , because it is inaccessible (and complex). In order to obtain useful guarantees from Theorem 4, we follow the approach of Du et al. [3], and approximate the distribution β

T with the product of the marginal distributions, estimated from observed cascades.

The optimization problem (2) is convex and can therefore be solved in time polynomial in the number of features K. However, the guarantees in Theorem 4 require a possibly large number of features. In order to obtain an efﬁcient algorithm for practical use and our experiments, we sacriﬁce the guarantee and use a ﬁxed number of features.

5 Experiments In this section, we experimentally evaluate the algorithm from Section 4. Since no other methods explicitly account for incomplete observations, we compare it to several state-of-the-art methods for inﬂuence function learning with full information. Hence, the main goal of the comparison is to examine to what extent the impact of missing data can be mitigated by being aware of it. We compare

4The proof of Theorem 4 in Appendix C will show how to choose λ. 5The notation suppresses all logarithmic terms except log C, as C could be exponential or worse in the number of nodes.

(a) CIC (b) DIC (c) DLT Figure 1: MAE of estimated inﬂuence as a function of the retention rate on synthetic datasets for (a) CIC model, (b) DIC model, (c) DLT model. The error bars show one standard deviation.

our algorithm to the following approaches: (1) CIC ﬁts the parameters of a CIC model, using the Net Rate algorithm [7] with exponential delay distribution. (2) DIC ﬁts the activation probabilities of a DIC model using the method in [18]. (3) Inﬂu Learner is the model-free approach proposed by Du et al. in [3] and discussed in Section 4. (4) Logistic uses logistic regression to learn the inﬂuence functions Fu(S) = f(χ>

S cu +b) for each u independently, where cu is a learnable weight vector and f(x) = 1 1+e x is the logistic function. (5) Linear uses linear regression to learn the total inﬂuence σ(S) = c> χS + b of the set S. Notice that the CIC and DIC methods have access to the activation time of each node in addition to the ﬁnal activation status, giving them an inherent advantage.

5.1 Synthetic cascades Data generation. We generate synthetic networks with core-peripheral structure following the Kronecker graph model [12] with parameter matrix [0.9, 0.5; 0.5, 0.3].6 Each generated network has 512 nodes and 1024 edges. Subsequently, we generate 8192 cascades as training data using the CIC, DIC and DLT models, with random seed sets whose sizes are power law distributed. The retention rate is varied between 0.1 and 0.9. The test set contains 200 independently sampled seed sets generated in the same way as the training data. Details of the data generation process are provided in Appendix A.

Algorithm settings. We apply all algorithms to cascades generated from all three models; that is, we also consider the results under model misspeciﬁcation. Whenever applicable, we set the hyperparameters of the ﬁve comparison algorithms to the ground truth values. When applying the Net Rate algorithm to discrete-time cascades, we set the observation window to 10.0. When applying the method in [18] to continuous-time cascades, we round activation times up to the nearest multiple of 0.1, resulting in 10 discrete time steps. For the model-free approaches (Inﬂu Learner and our algorithm), we use K = 200 features.

Results. Figure 1 shows the Mean Absolute Error (MAE) between the estimated total inﬂuence σ(S) and the true inﬂuence value, averaged over all testing seed sets. For each setting (diffusion model and retention rate), the reported MAE is averaged over ﬁve independent runs.

The main insight is that accounting for missing observations indeed strongly mitigates their effect: notice that for retention rates as small as 0.5, our algorithm can almost completely compensate for the data loss, whereas both the model-free and parameter ﬁtting approaches deteriorate signiﬁcantly even for retention rates close to 1. For the parameter ﬁtting approaches, even such large retention rates can lead to missing and spurious edges in the inferred networks, and thus signiﬁcant estimation errors. Additional observations include that ﬁtting inﬂuence using (linear or logistic) regression does not perform well at all, and that the CIC inference approach appears more robust to model misspeciﬁcation than the DIC approach.

Sensitivity of retention rate. We presented the algorithms as knowing r. Since r itself is inferred from noisy data, it may be somewhat misestimated. Figure 2 shows the impact of misestimating r. We generate synthetic cascades from all three diffusion models with a true retention rate of 0.8, and

6We also experimented on Kronecker graphs with hierarchical community structure ([0.9, 0.1; 0.1, 0.9]) and random structure ([0.5, 0.5; 0.5, 0.5]). The results are similar and omitted due to space constraints.

0 5 10 15 20 25 30 35 40

1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6

CIC DIC DLT

Figure 2: Relative error in MAE under retention rate misspeciﬁcation. x-axis: retention rate r used by the algorithm. y-axis: relative difference of MAE compared to using the true retention rate 0.8.

1 2 3 4 5 6 7

Groups2of2memes

Linear Logistic DIC CIC Influ Learner Our3Method

Figure 3: MAE of inﬂuence estimation on seven sets of real-world cascades with 20% of activations missing.

then apply our algorithm with (incorrect) retention rate r 2 {0.6, 0.65, . . . , 0.95, 1}. The results are averaged over ﬁve independent runs. While the performance decreases as the misestimation gets worse (after all, with r = 1, the algorithm is basically the same as Inﬂu Learner), the degradation is graceful.

5.2 Inﬂuence Estimation on real cascades We further evaluate the performance of our method on the real-world Meme Tracker7 dataset [11]. The dataset consists of the propagation of short textual phrases, referred to as Memes, via the publication of blog posts and main-stream media news articles between March 2011 and February 2012. Speciﬁcally, the dataset contains seven groups of cascades corresponding to the propagation of Memes with certain keywords, namely apple and jobs , tsunami earthquake , william kate marriage , occupy wall-street , airstrikes , egypt and elections. Each cascade group consists of 1000 nodes, with a number of cascades varying from 1000 to 44000. We follow exactly the same evaluation method as Du et al. [3] with a training/test set split of 60%/40%.

To test the performance of inﬂuence function learning under incomplete observations, we randomly delete 20% of the occurrences, setting r = 0.8. The results for other retention rates are similar and omitted. Figure 3 shows the MAE of our methods and the ﬁve baselines, averaged over 100 random draws of test seed sets, for all groups of memes. While some baselines perform very poorly, even compared to the best baseline (Inﬂu Learner), our algorithm provides an 18% reduction in MAE (averaged over the seven groups), showing the potential of data loss awareness to mitigate its effects.

6 Extensions and Future Work In the full version available on ar Xiv, we show both experimentally and theoretically how to generalize our results to non-uniform (but independent) loss of node activations, and how to deal with a misestimation of the retention rate r. Any non-trivial partial information about r leads to positive PAC learnability results.

A much more signiﬁcant departure for future work would be dependent loss of activations, e.g., losing all activations of some randomly chosen nodes. As another direction, it would be worthwhile to generalize the PAC learnability results to other diffusion models, and to design an efﬁcient algorithm with PAC learning guarantees.

Acknowledgments We would like to thank anonymous reviewers for useful feedback. The research was sponsored in part by NSF research grant IIS-1254206 and by the U.S. Defense Advanced Research Projects Agency (DARPA) under the Social Media in Strategic Communication (SMISC) program, Agreement Number W911NF-12-1-0034. The views and conclusions are those of the authors and should not be interpreted as representing the ofﬁcial policies of the funding agency or the U.S. Government.

7We use the preprocessed version of the dataset released by Du et al. [3] and available at http://www.cc. gatech.edu/~ndu8/Influ Learner.html. Notice that the dataset is semi-real, as multi-node seed cascades are artiﬁcially created by merging single-node seed cascades.

[1] K. Amin, H. Heidari, and M. Kearns. Learning from contagion (without timestamps). In Proc.

31st ICML, pages 1845 1853, 2014. [2] F. Chierichetti, D. Liben-Nowell, and J. M. Kleinberg. Reconstructing Patterns of Information

Diffusion from Incomplete Observations. In Proc. 23rd NIPS, pages 792 800, 2011. [3] N. Du, Y. Liang, M.-F. Balcan, and L. Song. Inﬂuence Function Learning in Information

Diffusion Networks. In Proc. 31st ICML, pages 2016 2024, 2014. [4] N. Du, L. Song, M. Gomez-Rodriguez, and H. Zha. Scalable Inﬂuence Estimation in Continuous-

Time Diffusion Networks. In Proc. 25th NIPS, pages 3147 3155, 2013. [5] N. Du, L. Song, S. Yuan, and A. J. Smola. Learning Networks of Heterogeneous Inﬂuence. In

Proc. 24th NIPS, pages 2780 2788. 2012. [6] Q. Duong, M. P. Wellman, and S. P. Singh. Modeling information diffusion in networks with

unobserved links. In Social Com/PASSAT, pages 362 369, 2011. [7] M. Gomez-Rodriguez, D. Balduzzi, and B. Schölkopf. Uncovering the temporal dynamics of

diffusion networks. In Proc. 28th ICML, pages 561 568, 2011. [8] M. Gomez-Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and inﬂuence.

ACM Transactions on Knowledge Discovery from Data, 5(4), 2012. [9] A. Goyal, F. Bonchi, and L. V. S. Lakshmanan. Learning inﬂuence probabilities in social

networks. In Proc. 3rd WSDM, pages 241 250, 2010. [10] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the Spread of Inﬂuence in a Social Network.

In Proc. 9th KDD, pages 137 146, 2003. [11] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news

cycle. In Proc. 15th KDD, pages 497 506, 2009. [12] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs:

An approach to modeling networks. The Journal of Machine Learning Research, 11:985 1042, 2010. [13] A. Lokhov. Reconstructing parameters of spreading models from partial observations. In Proc.

28th NIPS, pages 3459 3467, 2016. [14] S. A. Myers and J. Leskovec. On the convexity of latent social network inference. In Proc.

22nd NIPS, pages 1741 1749, 2010. [15] S. A. Myers, C. Zhu, and J. Leskovec. Information Diffusion and External Inﬂuence in Networks.

In Proc. 18th KDD, pages 33 41, 2012. [16] H. Narasimhan, D. C. Parkes, and Y. Singer. Learnability of Inﬂuence in Networks. In Proc.

27th NIPS, pages 3168 3176, 2015. [17] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Learning with noisy labels. In

Proc. 25th NIPS, pages 1196 1204, 2013. [18] N. Praneeth and S. Sujay. Learning the Graph of Epidemic Cascades. In Proc. 12th ACM

Sigmetrics, pages 211 222, 2012. [19] D. Quang, W. M. P, and S. S. P. Modeling Information Diffusion in Networks with Unobserved

Links. In Social Com, pages 362 369, 2011. [20] N. Rosenfeld, M. Nitzan, and A. Globerson. Discriminative learning of infection models. In

Proc. 9th WSDM, pages 563 572, 2016. [21] E. Sadikov, M. Medina, J. Leskovec, and H. Garcia-Molina. Correcting for missing data in

information cascades. In Proc. 4th WSDM, pages 55 64, 2011. [22] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134 1142,

1984. [23] X. Wu, A. Kumar, D. Sheldon, and S. Zilberstein. Parameter learning for latent network

diffusion. In Proc. 29th IJCAI, pages 2923 2930, 2013. [24] S.-H. Yang and H. Zha. Mixture of Mutually Exciting Processes for Viral Diffusion. In Proc.

30th ICML, pages 1 9, 2013. [25] K. Zhou, H. Zha, and L. Song. Learning Social Infectivity in Sparse Low-rank Network Using

Multi-dimensional Hawkes Processes. In Proc. 30th ICML, pages 641 649, 2013.