# learnability_of_influence_in_networks__8e48dd3c.pdf

Learnability of Inﬂuence in Networks

Harikrishna Narasimhan David C. Parkes Yaron Singer Harvard University, Cambridge, MA 02138 hnarasimhan@seas.harvard.edu, {parkes, yaron}@seas.harvard.edu

We show PAC learnability of inﬂuence functions for three common inﬂuence models, namely, the Linear Threshold (LT), Independent Cascade (IC) and Voter models, and present concrete sample complexity results in each case. Our results for the LT model are based on interesting connections with neural networks; those for the IC model are based an interpretation of the inﬂuence function as an expectation over random draw of a subgraph and use covering number arguments; and those for the Voter model are based on a reduction to linear regression. We show these results for the case in which the cascades are only partially observed and we do not see the time steps in which a node has been inﬂuenced. We also provide efﬁcient polynomial time learning algorithms for a setting with full observation, i.e. where the cascades also contain the time steps in which nodes are inﬂuenced.

1 Introduction

For several decades there has been much interest in understanding the manner in which ideas, language, and information cascades spread through society. With the advent of social networking technologies in recent years, digital traces of human interactions are becoming available, and the problem of predicting information cascades from these traces has gained enormous practical value. For example, this is critical in applications like viral marketing, where one needs to maximize awareness about a product by selecting a small set of inﬂuential users [1].

To this end, the spread of information in networks is modeled as an inﬂuence function which maps a set of seed nodes who initiate the cascade to (a distribution on) the set of individuals who will be inﬂuenced as a result [2]. These models are parametrized by variables that are unknown and need to be estimated from data. There has been much work on estimating the parameters of inﬂuence models (or the structure of the underlying social graph) from observed cascades of inﬂuence spread, and on using the estimated parameters to predict inﬂuence for a given seed set [3, 4, 5, 6, 7, 8]. These parameter estimation techniques make use of local inﬂuence information at each node, and there has been a recent line of work devoted to providing sample complexity guarantees for these local estimation techniques [9, 10, 11, 12, 13].

However, one cannot locally estimate the inﬂuence parameters when the cascades are not completely observed (e.g. when the cascades do not contain the time at which the nodes are inﬂuenced). Moreover, inﬂuence functions can be sensitive to errors in model parameters, and existing results do not tell us to what accuracy the individual parameters need to be estimated to obtain accurate inﬂuence predictions. If the primary goal in an application is to predict inﬂuence accurately, it is natural to ask for algorithms that have learnability guarantees on the inﬂuence function itself. A benchmark for studying such questions is the Probably Approximately Correct (PAC) learning framework [14]:

Are inﬂuence functions PAC learnable?

While many inﬂuence models have been popularized due to their approximation guarantees for inﬂuence maximization [2, 15, 16], learnability of inﬂuence is an equally fundamental property.

Part of this work was done when HN was a Ph D student at the Indian Institute of Science, Bangalore.

In this paper, we show PAC learnability for three well-studied inﬂuence models: the Linear Threshold, the Independent Cascade, and the Voter models. We primarily consider a setting where the cascades are partially observed, i.e. where only the nodes inﬂuenced and not the time steps at which they were inﬂuenced are observed. This is a setting where existing local estimation techniques cannot be applied to obtain parameter estimates. Additionally, for a fully observed setting where the time of inﬂuence is also observed, we show polynomial time learnability; our methods here are akin to using local estimation techniques, but come with guarantees on the global inﬂuence function.

Main results. Our learnability results are summarized below.

Linear threshold (LT) model: Our result here is based on an interesting observation that LT inﬂuence functions can be seen as multi-layer neural network classiﬁers, and proceed by bounding their VC-dimension. The method analyzed here picks a function with zero training error. While this can be computationally hard to implement under partial observation, we provide a polynomial time algorithm for the full observation case using local computations. Independent cascade (IC) model: Our result uses an interpretation of the inﬂuence function as an expectation over random draw of a subgraph [2]; this allows us to show that the function is Lipschitz and invoke covering number arguments. The algorithm analyzed for partial observation is based on global maximum likelihood estimation. Under full observation (and additional assumptions), we show polynomial time learnability using a local estimation technique. Voter model: Our result follows from a reduction of the learning problem to a linear regression problem; the resulting learning algorithm can be implemented in polynomial time for both the full and partial observation settings.

Related work. A related problem to ours is that of inferring the structure of the underlying social graph from cascades [6]. There has been a series of results on polynomial sample complexity guarantees for this problem under variants of the IC model [9, 12, 10, 11]. Most of these results make speciﬁc assumptions on the cascades/graph structure, and assume a full observation setting. On the other hand, in our problem, the structure of the social graph is assumed to be known, and the goal is to provably learn the underlying inﬂuence function. Our results do not depend on assumptions on the network structure, and primarily apply to the more challenging partial observation setting.

The work that is most related to ours is that of Du et al. [13], who show polynomial sample complexity results for learning inﬂuence in the LT and IC models (under partial observation). However, their approach uses approximations to inﬂuence functions and consequently requires a strong technical condition to hold, which is not necessarily satisﬁed in general. Our results for the LT and IC models are some what orthogonal. While the authors in [13] trade-off assumptions on learnability and gain efﬁcient algorithms that work well in practice, our goal is to show unconditional sample complexity for learning inﬂuence. We do this at the expense of the efﬁciency of the learning algorithms in the partial observation setting. Moreover, the technical approach we take is substantially different.

There has also been work on learnability of families of discrete functions such as submodular [17] and coverage functions [18], under the PAC and the variant PMAC frameworks. These results assume availability of a training sample containing exact values of the target function on the given input sets. While IC inﬂuence functions can be seen as coverage functions [2], the previous results do not directly apply to the IC class, as in practice, the true (expected) value of an IC inﬂuence function on a seed set is never observed, and only a random realization is seen. In contrast, our learnability result for IC functions do not require the exact function values to be known. Moreover, the previous results require strict assumptions on the input distribution. Since we focus on learnability of speciﬁc function classes rather than large families of discrete functions, we are able to handle general seed distributions for most part. Other results relevant to our work include learnability of linear inﬂuence games [19], where the techniques used bear some similarity to our analysis for the LT model.

2 Preliminaries

Inﬂuence models. We represent a social network as a ﬁnite graph G = (V, E), where the nodes V = {1, . . . , n} represent a set of n individuals and edges E V 2 represent their social links. Let |E| = r. The graph is assumed to be directed unless otherwise speciﬁed. Each edge (u, v) E is associated with a weight wuv R+ that indicates the strength of inﬂuence of node v on node u. We consider a setting where each node in the network holds an opinion in {0, 1} and opinions

disseminate in the network. This dissemination process begins with a small subset of nodes called the seed which have opinion 1 while the rest have opinion 0, and continues in discrete time steps. In every time step, a node may change its opinion from 0 to 1 based on the opinion of its neighbors, and according to some local model of inﬂuence; if this happens, we say that the node is inﬂuenced. We will use N(u) to denote the set of neighbors of node u, and At to denote the set of nodes that are inﬂuenced at time step t. We consider three well-studied models:

Linear threshold (LT) model: Each node u holds a threshold ru R+, and is inﬂuenced at time t if the total incoming weight from its neighbors that were inﬂuenced at the previous time step t 1 exceeds the threshold: P

v N(u) At 1 wuv ku. Once inﬂuenced, node u can then inﬂuence its neighbors for one time step, and never changes its opinion to 0.1

Independent cascade (IC) model: Restricting edge weights wuv to be in [0, 1], a node u is inﬂuenced at time t independently by each neighbor v who was inﬂuenced at time t 1. The node can then inﬂuence its neighbors for one time step, and never changes its opinion to 0. Voter model: The graph is assumed to be undirected (with self-loops); at time step t, a node u adopts the opinion of its neighbor v with probability wuv/ P

v N(u) {u} wuv . Unlike the LT and IC models, here a node may change its opinion from 1 to 0 or 0 to 1 at every step.

We stress that a node is inﬂuenced at time t if it changes its opinion from 0 to 1 exactly at t. Also, in both the LT and IC models, no node gets inﬂuenced more than once and hence an inﬂuence cascade can last for at most n time steps. For simplicity, we shall consider in all our deﬁnitions only cascades of length n. While revisiting the Voter model in Section 5, we will look at more general cascades. Deﬁnition 1 (Inﬂuence function). Given an inﬂuence model, a (global) inﬂuence function F : 2V [0, 1]n maps an initial set of nodes X V seeded with opinion 1 to a vector of probabilities [F1(X), . . . , Fn(X)] [0, 1]n, where the uth coordinate indicates the probability of node u V being inﬂuenced during any time step of the corresponding inﬂuence cascades.

Note that for the LT model, the inﬂuence process is deterministic, and the inﬂuence function simply outputs a binary vector in {0, 1}n. Let FG denote the class of all inﬂuence functions under an inﬂuence model over G, obtained for different choices of parameters (edge weights/thresholds) in the model. We will be interested in learning the inﬂuence function for a given parametrization of this inﬂuence model. We shall assume that the initial set of nodes that are seeded with opinion 1 at the start of the inﬂuence process, or the seed set, is chosen i.i.d. according to a distribution µ over all subsets of nodes. We are given a training sample consisting of draws of initial seed sets from µ, along with observations of nodes inﬂuenced in the corresponding inﬂuence process. Our goal is to then learn from FG an inﬂuence function that best captures the observed inﬂuence process.

Measuring Loss. To measure quality of the learned inﬂuence function, we deﬁne a loss function ℓ: 2V [0, 1]n R+ that for any subset of inﬂuenced nodes Y V and predicted inﬂuence probabilities p [0, 1]n assigns a value ℓ(Y, p) measuring discrepancy between Y and p. We deﬁne the error of a learned function F FG for a given seed distribution µ and model parametrization as the expected loss incurred by F: errℓ[F] = EX,Y ℓ Y, F(X) ,

where the above expectation is over a random draw of the seed set X from distribution µ and over the corresponding subsets of nodes Y inﬂuenced during the cascade.

We will be particularly interested in the difference between the error of an inﬂuence function FS FG learned from a training sample S and the minimum possible error achievable over all inﬂuence functions in FG: errℓ FS inf F FG errℓ F , and would like to learn inﬂuence functions for which this difference is guaranteed to be small (using only polynomially many training examples).

Full and partial observation. We primarily work in a setting in which we observe the nodes inﬂuenced in a cascade, but not the time step at which they were inﬂuenced. In other words, we assume availability of a partial observed training sample S = {(X1, Y 1) . . . , (Xm, Y m)}, where Xi denotes the seed set of a cascade i and Y i is the set of nodes inﬂuenced in that cascade. We will also consider a reﬁned notion of full observation in which we are provided a training sample S = {(X1, Y 1 1:n) . . . , (Xm, Y m 1:n)}, where Y i 1:n = {Y i 1 , . . . , Y i n} and Y i t is the set of nodes in

1In settings where the node thresholds are unknown, it is common to assume that they are chosen randomly by each node [2]. In our setup, the thresholds are parameters that need to be learned from cascades.

cascade i who were inﬂuenced precisely at time step t. Notice that here the complete set of nodes inﬂuenced in cascade i is given by Sn t=1 Y i t . This setting is particularly of interest when discussing learnability in polynomial time. The structure of the social graph is always assumed to be known.

PAC learnability of inﬂuence functions. Let FG be the class of all inﬂuence functions under an inﬂuence model over a n-node social network G = (V, E). We say FG is probably approximately correct (PAC) learnable w.r.t. loss ℓif there exists an algorithm s.t. the following holds for ϵ, δ (0, 1), for all parametrizations of the model, and for all (or a subset of) distributions µ over seed sets: when the algorithm is given a partially observed training sample S = {(X1, Y 1), . . . , (Xm, Y m)} with m poly(1/ϵ, 1/δ) examples, it outputs an inﬂuence function FS FG for which

PS errℓ FS inf F FG errℓ F ϵ δ,

where the above probability is over the randomness in S. Moreover, FG is efﬁciently PAC learnable under this setting if the running time of the algorithm in the above deﬁnition is polynomial in m and in the size of G. We say FG is (efﬁciently) PAC learnable under full observation if the above deﬁnition holds with a fully observed training sample S = {(X1, Y 1 1:n), . . . , (Xm, Y m 1:n)}.

Sensitivity of inﬂuence functions to parameter errors. A common approach to predicting inﬂuence under full observation is to estimate the model parameters using local inﬂuence information at each node. However, an inﬂuence function can be highly sensitive to errors in estimated parameters. E.g. consider an IC model on a chain of n nodes where all edge parameters are 1; if the parameters have all been underestimated with a constant error of ϵ, the estimated probability of the last node being inﬂuenced is (1 ϵ)n, which is exponentially smaller than the true value 1 for large n. Our results for full observation provide concrete sample complexity guarantees for learning inﬂuence functions using local estimation, to any desired accuracy; in particular, for the above example, our results prescribe that ϵ be driven below 1/n for accurate predictions (see Section 4 on IC model). Of course, under partial observation, we do not see enough information to locally estimate the individual model parameters, and the inﬂuence function needs to be learned directly from cascades.

3 The Linear Threshold model

We start with learnability in the Linear Threshold (LT) model. Given that the inﬂuence process is deterministic and the inﬂuence function outputs binary values, we use the 0-1 loss for evaluation; for any subset of nodes Y V and predicted boolean vector q {0, 1}n, this is the fraction of nodes on which the prediction is wrong: ℓ0-1(Y, q) = 1 n Pn u=1 1(χu(Y ) = qu), where χu(Y ) = 1(u Y ). Theorem 1 (PAC learnability under LT model). The class of inﬂuence functions under the LT model is PAC learnable w.r.t. ℓ0-1 and the corresponding sample complexity is e O ϵ 1(r + n) . Furthermore, in the full observation setting the inﬂuence functions can be learned in polynomial time.

The proof is in Appendix A and we give an outline here. Let F w denote a LT inﬂuence function with parameters w Rr+n (edge weights and thresholds) and let us focus on the partial observation setting (only a node and not its time of inﬂuence is observed). Consider a simple algorithm that outputs an inﬂuence function with zero error on training sample S = {(X1, Y 1), . . . , (Xm, Y m)}: 1 m

i=1 ℓ0-1 Y i, F w(Xi) = 1 mn

u=1 1 χu(Y i) = F w u (Xi) . (1)

Such a function always exists as the training cascades are generated using the LT model. We will shortly look at computational issues in implementing this algorithm. We now explain our PAC learnability result for this algorithm. The main idea is in interpreting LT inﬂuence functions as neural networks with linear threshold activations. The proof follows by bounding the VC-dimension of the class of all functions F w u for node u, and using standard arguments in showing learnability under ﬁnite VC-dimension [20]. We sketch the neural network (NN) construction in two steps (local inﬂuence as a two-layer NN, and the global inﬂuence as a multilayer network; see Figure 1), where a crucial part is in ensuring that no node gets inﬂuenced more than once during the inﬂuence process:

1. Local inﬂuence as a two-layer NN. Recall that the (local) inﬂuence at a node u for previously inﬂuenced nodes Z is given by 1 P

v N(u) Z wuv ku . This can be modeled as a linear (binary) classiﬁer, or equivalently as a two-layer NN with linear threshold activations. Here the input layer contains a unit for each node in the network and takes a binary value indicating whether the node

Figure 1: Modeling a single time step t of the inﬂuence process Ft,u : 2V {0, 1} as a neural network (t 2): the portion in black computes whether or not node u is inﬂuenced in the current time step t, while that in red/blue enforces the constraint that u does not get inﬂuenced more than once during the inﬂuence process. Here ξt,u is 1 when a node has been inﬂuenced previously and 0 otherwise. The dotted red edges represent strong negative signals (has a large negative weight) and the dotted blue edges represent strong positive signals. The initial input to each node u in the input layer is 1(u X), while that for the auxiliary nodes (in red) is 0.

is present in Z; the output layer contains a binary unit indicating whether u is inﬂuenced after one time step; the connections between the two layers correspond to the edges between u and other nodes; and the threshold term on the output unit is the threshold parameter ku. Thus the ﬁrst step of the inﬂuence process can be modeled using a NN with two n-node layers (the input layer takes information about the seed set, and the binary output indicates which nodes got inﬂuenced).

2. From local to global: the multilayer network. The two-layer NN can be extended to multiple time steps by replicating the output layer once for each step. However, the resulting NN will allow a node to get inﬂuenced more than once during the inﬂuence process. To avoid this, we introduce an additional binary unit u for each node u in a layer, which will record whether node u was inﬂuenced in previous time steps. In particular, whenever node u is inﬂuenced in a layer, a strong positive signal is sent to activate u in the next layer, which in turn will send out strong negative signals to ensure u is never activated in subsequent layers2; we use additional connections to ensure that u remains active there after. Note that a node u in layer t + 1 is 1 whenever u is inﬂuenced at time step t; let F w t,u : 2V {0, 1} denote this function computed at u for a given seed set. The LT inﬂuence function F w u (which for seed set X is 1 whenever u is inﬂuenced in any one of the n time steps) is then given by F w u (X) = Pn t=1 F w t,u(X). Clearly, F w u can be modeled as a NN with n + 1 layers.

A naive application of classic VC-dimension results for NN [21] will give us that the VC-dimension of the class of functions Fu is e O(n(r + n)) (counting r + n parameters for each layer). Since the same parameters are repeated across layers, this can be tightened to e O(r + n). The remaining proof involves standard uniform convergence arguments [20] and a union bound over all nodes.

3.1 Efﬁcient computation

Having shown PAC learnability, we turn to efﬁcient implementation of the prescribed algorithm.

Partial observation. In the case where the training set does not specify the time at which each node was infected, ﬁnding an inﬂuence function with zero training error is computationally hard in general (as this is similar to learning a recurrent neural network). In practice, however, we can leverage the neural network construction, and solve the problem approximately by replacing linear threshold activation functions with sigmoidal activations and the 0-1 loss with a suitable continuous surrogate loss, and apply back-propagation based methods used for neural network learning.

Full observation. Here it turns out that the algorithm can be implemented in polynomial time using local computations. Given a fully observed sample S = {(X1, Y 1 1:n), . . . , (Xm, Y m 1:n)}, the loss of an inﬂuence function F for any (X, Y1:n) is given by ℓ0-1( n t=1Yt, F(X)) and as before measures the fraction of mispredicted nodes. The prescribed algorithm then seeks to ﬁnd parameters w for which the corresponding training error is 0. Given that the time of inﬂuence is observed, this problem can be decoupled into a set of linear programs (LPs) at each node; this is akin to locally estimating the parameters at each node. In particular, let wu denote the parameters local to node u (incoming weights and threshold), and let fu(Z; wu) = 1 P v N(u) Z wuv ku denote the local inﬂuence at u for set Z of previously inﬂuence nodes. Let bα1,u(wu) = 1 m Pm i=1 1 χu(Y i 1 ) = fu(Xi; wu)

and bαt,u(wu) = 1 m Pm i=1 1 χu(Y i t ) = fu(Y i t 1; wu) , t 2, that given the set of nodes Y i t 1 inﬂuenced at time t 1, measures the local prediction error at time t. Since the training sample was

2By a strong signal, we mean a large positive/negative connection weight which will outweigh signals from other connections. Indeed such connections can be created when the weights are all bounded.

generated by a LT model, there always exists parameters such that bαt,u(wu) = 0 for each t and u, which also implies that the overall training error is 0. Such a set of parameters can be obtained by formulating a suitable LP that can be solved in polynomial time. The details are in Appendix A.2.

4 The Independent Cascade model

We now address the question of learnability in the Independent Cascade (IC) model. Since the inﬂuence functions here have probabilistic outputs, the proof techniques we shall use will be different from the previous section, and will rely on arguments based on covering numbers. In this case, we use the squared loss which for any Y V and q [0, 1]n, is given by: ℓsq(Y, q) = 1 n Pn u=1[χu(Y )(1 qu)2 + (1 χu(Y ))q2 u]. We shall make a mild assumption that the edge probabilities are bounded away from 0 and 1, i.e. w [λ, 1 λ]r for some λ (0, 0.5).

Theorem 2 (PAC learnability under IC model). The class of inﬂuence functions under the IC model is PAC learnable w.r.t. ℓsq and the sample complexity is m = e O ϵ 2n3r . Furthermore, in the full observation setting, under additional assumptions (see Assumption 1), the inﬂuence functions can be learned in polynomial time with sample complexity e O(ϵ 2nr3).

The proof is given in Appendix B. As noted earlier, an IC inﬂuence function can be sensitive to errors in estimated parameters. Hence before discussing our algorithms and analysis, we seek to understand the extent to which changes in the IC parameters can produce changes in the inﬂuence function, and in particular, check if the function is Lipschitz. For this, we use the closed-form interpretation of the IC function as an expectation of an indicator term over a randomly drawn subset of edges from the network (see [2]). More speciﬁcally, the IC cascade process can be seen as activating a subset of edges in the network; since each edge can be activated at most once, the active edges can be seen as having been chosen apriori using independent Bernoulli draws. Consider a random subgraph of active edges obtained by choosing each edge (u, v) E independently with probability wuv. For a given subset of such edges A E and seed set X V , let σu(A, X) be an indicator function that evaluates to 1 if u is reachable from a node in X via edges in A and 0 otherwise. Then the IC inﬂuence function can be written as an expectation of σ over random draw of the subgraph:

F w u (X) = X

(a,b) A wab Y

(a,b)/ A (1 wab) σu(A, X). (2)

While the above deﬁnition involves an exponential number of terms, it can be veriﬁed that the corresponding gradient is bounded, thus implying that the IC function is Lipschitz.3

Lemma 3. Fix X V . For any w, w Rr with w w 1 ϵ, F w u (X) F w u (X) ϵ.

This result tells us how small the parameter errors need to be to obtain accurate inﬂuence predictions and will be crucially used in our learnability results. Note that for the chain example in Section 2, this tells us that the errors need to be less than 1/n for meaningful inﬂuence predictions.

We are now ready to provide the PAC learning algorithm for the partial observation setting with sample S = {(X1, Y 1), . . . , (Xm, Y m)}; we shall sketch the proof here. The full observation case is outlined in Section 4.1, where we shall make use of the a different approach based on local estimation. Let F w denote the IC inﬂuence function with parameters w. The algorithm that we consider for partial observation resorts to a maximum likelihood (ML) estimation of the (global) IC function. Let χu(Y ) = 1(u Y ). Deﬁne the (global) log-likelihood for a cascade (X, Y ) as:

L(X, Y ; w) =

u=1 χu(Y ) ln F w u (X) + (1 χu(Y )) ln 1 F w u (X) ,

The prescribed algorithm then solves the following optimization problem, and outputs an IC inﬂuence function F w from the solution w obtained.

max w [λ,1 λ]r

i=1 L(Xi, Y i; w). (3)

3In practice, IC inﬂuence functions can be computed through suitable sampling approaches. Also, note that a function class can be PAC learnable even if the individual functions cannot be computed efﬁciently.

To provide learnability guarantees for the above ML based procedure, we construct a ﬁnite ϵ-cover over the space of IC inﬂuence functions, i.e. show that the class can be approximated to a factor of ϵ (in the inﬁnity norm sense) by a ﬁnite set of IC inﬂuence functions. We ﬁrst construct an ϵ-cover of size O((r/ϵ)r) over the space of parameters [λ, 1 λ]r, and use Lipschitzness to translate this to an ϵ-cover of same size over the IC class. Following this, standard uniform convergence arguments [20] can be used to derive a sample complexity guarantee on the expected likelihood with a logarithmic dependence on the cover size; this then implies the desired learnability result w.r.t. ℓsq: Lemma 4 (Sample complexity guarantee on the log-likelihood objective). Fix ϵ, δ (0, 1) and m = e O ϵ 2n3r . Let w be the parameters obtained from ML estimation. Then w.p. 1 δ,

sup w [λ,1 λ]r E 1

n L(X, Y ; w) E 1

n L(X, Y ; w) ϵ.

Compared to results for the LT model, the sample complexity in Theorem 2 has a square dependence on 1/ϵ. This is not surprising, as unlike the LT model, where the optimal 0-1 error is zero, the optimal squared error here is non-zero in general; in fact, there are standard sample complexity lower bound results that show that for similar settings, one cannot obtain a tighter bound in terms of 1/ϵ [20].

We wish to also note that the approach of Du et al. (2014) for learning inﬂuence under partial observation [13] uses the same interpretation of the IC inﬂuence function as in Eq. (2), but rather than learning the parameters of the model, they seek to learn the weights on the individual indicator functions. Since there are exponentially many indicator terms, they resort to constructing approximations to the inﬂuence function, for which a strong technical condition needs to be satisﬁed; this condition need not however hold in most settings. In contrast, our result applies to general settings.

4.1 Efﬁcient computation

Partial observation. The optimization problem in Eq. (3) that we need to solve for the partial observation case is non-convex in general. Of course, in practice, this can be solved approximately using gradient-based techniques, using sample-based gradient computations to deal with the exponential number of terms in the deﬁnition of F w in the objective (see Appendix B.5).

Full observation. On the other hand, when training sample S = {(X1, Y 1 1:n), . . . , (Xm, Y m 1:n)} contains fully observed cascades, we are able to show polynomial time learnability. For the LT model, we were assured of a set of parameters that would yield zero 0-1 error on the training sample, and hence the same procedure prescribed for partial information could be implemented under the full observation in polynomial time by reduction to local computations. This is not the case with the IC model, where we resort to the common approach of learning inﬂuence by estimating the model parameters through a local maximum likelihood (ML) estimation technique. This method is similar to the maximum likelihood procedure used in [9] for solving a different problem of recovering the structure of an unknown network from cascades. For the purpose of showing learnability, we ﬁnd it sufﬁcient to apply this procedure to only the ﬁrst time step of the cascade.

Our analysis ﬁrst provides guarantees on the estimated parameters, and uses the Lipschitz property in Lemma 3 to translate them to guarantees on the inﬂuence function. Since we now wish to give guarantees in the parameter space, we will require that there exists unique set of parameters that explains the IC cascade process; for this, we will need stricter assumptions. We assume that all edges have a minimum inﬂuence strength, and that even when all neighbors of a node u are inﬂuenced in a time step, there is a small probability of u not being inﬂuenced in the next step; we consider a speciﬁc seed distribution, where each node has a non-zero probability of (not) being a seed node. Assumption 1. Let w denote the parameters of the underlying IC model. Then there exists λ γ (0, 0.5) such that w uv λ for all (u, v) E and Q v N(u)(1 wuv) γ for all u V . Also, each node in V is chosen independently in the initial seed set with probability κ (0, 1).

We ﬁrst deﬁne the local log-likelihood for given seed set X and nodes Y1 inﬂuenced at t = 1:

L(X, Y1; β) = X

χu(Y1) ln 1 exp X

v N(u) X βuv

(1 χu(Y1)) X

v N(u) X βuv

where we have used log-transformed parameters βuv = ln(1 wuv), so that the objective is concave in β. The prescribed algorithm then solves the following maximization problem over all

parameters that satisfy Assumption 1 and constructs an IC inﬂuence function from the parameters.

i=1 L(Xi, Y i 1 ; β) s.t. (u, v) E, βuv ln 1 1 λ

v N(u) βuv ln 1

This problem breaks down into smaller convex problems and can be solved efﬁciently (see [9]). Proposition 5 (PAC learnability under IC model with full observation). Under full observation and Assumption 1, the class of IC inﬂuence functions is PAC learnable in polynomial time through local ML estimation. The corresponding sample complexity is e O nr3(κ2(1 κ)4λ2γ2ϵ2) 1 .

The proof is provided in Appendix B.6 and proceeds through the following steps: (1) we use covering number arguments to show that the local log-likelihood for the estimated parameters is close to the optimal value; (2) we then show that under Assumption 1, the expected log-likelihood is strongly concave, which gives us that closeness to the true model parameters in terms of the likelihood also implies closeness to the true parameters in the parameter space; (3) we ﬁnally use the Lipschitz property in Lemma 3 to translate this to guarantees on the global inﬂuence function.

Note that the sample complexity here has a worse dependence on the number of edges r compared to the partial observation case; this is due to the two-step approach of requiring guarantees on the individual parameters, and then transferring them to the inﬂuence function. The better dependence on the number of nodes n is a consequence of estimating parameters locally. It would be interesting to see if tighter results can be obtained by using inﬂuence information from all time steps, and making different assumptions on the model parameters (e.g. correlation decay assumption in [9]).

5 The Voter model

Before closing, we sketch of our learnability results for the Voter model, where unlike previous models the graph is undirected (with self-loops). Here we shall be interested in learning inﬂuence for a ﬁxed number of K time steps as the cascades can be longer than n. With the squared loss again as the loss function, this problem almost immediately reduces to linear least squares regression.

Let W [0, 1]n n be a matrix of normalized edge weights with Wuv = wuv/ P

v N(u) {u} wuv if (u, v) E and 0 otherwise. Note that W can be seen as a one-step probability transition matrix. Then for an initial seed set Z V , the probability of a node u being inﬂuenced under this model after one time step can be veriﬁed to be 1 u W1X, where 1X {0, 1}n is a column vector containing 1 in entries corresponding to nodes in X, and 0 everywhere else. Similarly, for calculating the probability of a node u being inﬂuenced after K time steps, one can use the K-step transition matrix: Fu(X) = 1 u (WK)1X. Now setting b = (WK) 1u, we have Fu(X) = b 1X which is essentially a linear function parametrized by n weights.

Thus learning inﬂuence in the Voter model (for ﬁxed cascade length) can be posed as n independent linear regression (one per node) with n coefﬁcients each. This can be solved in polynomial time even with partially observed data. We then have the following from standard results [20]. Theorem 6 (PAC learnability under Voter model). The class of inﬂuence functions under the Voter model is PAC learnable w.r.t. ℓsq in polynomial time and the sample complexity is e O ϵ 2n .

6 Conclusion

We have established PAC learnability of some of the most celebrated models of inﬂuence in social networks. Our results point towards interesting connections between learning theory and the literature on inﬂuence in networks. Beyond the practical implications of the ability to learn inﬂuence functions from cascades, the fact that the main models of inﬂuence are PAC learnable, serves as further evidence of their potent modeling capabilities. It would be interesting to see if our results extend to generalizations of the LT and IC models, and to investigate sample complexity lower bounds.

Acknowledgements. Part of this work was carried out while HN was visiting Harvard as a part of a student visit under the Indo-US Joint Center for Advanced Research in Machine Learning, Game Theory & Optimization supported by the Indo-US Science & Technology Forum. HN thanks Kevin Murphy, Shivani Agarwal and Harish G. Ramaswamy for helpful discussions. YS and DP were supported by NSF grant CCF-1301976 and YS by CAREER CCF-1452961 and a Google Faculty Research Award.

[1] Pedro Domingos and Matthew Richardson. Mining the network value of customers. In KDD, 2001.

[2] David Kempe, Jon M. Kleinberg, and Eva Tardos. Maximizing the spread of inﬂuence through a social network. In KDD, 2003.

[3] Amit Goyal, Francesco Bonchi, and Laks VS Lakshmanan. Learning inﬂuence probabilities in social networks. In KDD, 2010.

[4] Manuel Gomez-Rodriguez, David Balduzzi, and Bernhard Sch olkopf. Uncovering the temporal dynamics of diffusion networks. In ICML, 2011.

[5] Nan Du, Le Song, Alexander J. Smola, and Ming Yuan. Learning networks of heterogeneous inﬂuence. In NIPS, 2012.

[6] Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks of diffusion and inﬂuence. ACM Transactions on Knowledge Discovery from Data, 5(4):21, 2012.

[7] Nan Du, Le Song, Manuel Gomez-Rodriguez, and Hongyuan Zha. Scalable inﬂuence estimation in continuous-time diffusion networks. In NIPS, 2013.

[8] Abir De, Sourangshu Bhattacharya, Parantapa Bhattacharya, Niloy Ganguly, and Soumen Chakrabarti. Learning a linear inﬂuence model from transient opinion dynamics. In CIKM, 2014.

[9] Praneeth Netrapalli and Sujay Sanghavi. Learning the graph of epidemic cascades. In SIGMETRICS, 2012.

[10] Hadi Daneshmand, Manuel Gomez-Rodriguez, Le Song, and Bernhard Sch olkopf. Estimating diffusion network structures: Recovery conditions, sample complexity & soft-thresholding algorithm. In ICML, 2014.

[11] Jean Pouget-Abadie and Thibaut Horel. Inferring graphs from cascades: A sparse recovery framework. ICML, 2015.

[12] Bruno D. Abrahao, Flavio Chierichetti, Robert Kleinberg, and Alessandro Panconesi. Trace complexity of network inference. In KDD, 2013.

[13] Nan Du, Yingyu Liang, Maria-Florina Balcan, and Le Song. Inﬂuence function learning in information diffusion networks. In ICML, 2014.

[14] Leslie G. Valiant. A theory of the learnable. Commununications of the ACM, 27(11):1134 1142, 1984. [15] Elchanan Mossel and S ebastien Roch. On the submodularity of inﬂuence in social networks. In STOC, 2007.

[16] Eyal Even-Dar and Asaf Shapira. A note on maximizing the spread of inﬂuence in social networks. Information Processing Letters, 111(4):184 187, 2011.

[17] Maria-Florina Balcan and Nicholas J.A. Harvey. Learning submodular functions. In STOC, 2011.

[18] Vitaly Feldman and Pravesh Kothari. Learning coverage functions and private release of marginals. In COLT, 2014.

[19] Jean Honorio and Luis Ortiz. Learning the structure and parameters of large-population graphical games from behavioral data. Journal of Machine Learning Research, 16:1157 1210, 2015.

[20] Martin Anthony and Peter L. Bartlett. Neural network learning: Theoretical foundations. Cambridge University Press, 1999.

[21] Peter L. Bartlett and Wolfgang Maass. Vapnik Chervonenkis dimension of neural nets. Handbook of Brain Theory and Neural Networks, pages 1188 1192, 1995.

[22] Tong Zhang. Statistical behaviour and consistency of classiﬁcation methods based on convex risk minimization. Annals of Mathematical Statistics, 32:56 134, 2004.