# learning_influence_adoption_in_heterogeneous_networks__a81cd31a.pdf

Learning Inﬂuence Adoption in Heterogeneous Networks

Vincent Conitzer1, Debmalya Panigrahi1, Hanrui Zhang2

1 Duke University 2 Carnegie Mellon University conitzer@cs.duke.edu, debmalya@cs.duke.edu, hanruiz1@cs.cmu.edu

We study the problem of learning inﬂuence adoption in networks. In this problem, a communicable entity (such as an infectious disease, a computer virus, or a social media meme) propagates through a network, and the goal is to learn the state of each individual node by sampling only a small number of nodes and observing/testing their states. We study this problem in heterogeneous networks, in which each individual node has a set of distinct features that determine how it is affected by the propagating entity. We give an efﬁcient algorithm with nearly optimal sample complexity for two variants of this learning problem, corresponding to symptomatic and asymptomatic spread. In each case, the optimal sample complexity naturally generalizes both the complexity of learning how nodes are affected in isolation, and the complexity of learning inﬂuence adoption in a homogeneous network.

1 Introduction

The spread of contagious entities such as biological infections, social media memes, or computer viruses in a population via network propagation is a central object of study in the ﬁeld of network science. Consider, for example, the spread of an infectious disease. Quite often, a new disease originates by genetic mutation, and begins to propagate by ﬁrst infecting a group of people in close contact with the original source of the disease. After the initial infections, the disease continues to propagate as carriers of the disease interact with others: each carrier exposes colleagues, family, and friends to the disease, who may contract the disease and become spreaders themselves. While there is substantive research on how to control (or maximize, say in viral advertising) such spread (Kempe, Kleinberg, and Tardos 2003; Chen, Wang, and Yang 2009; Tang, Xiao, and Shi 2014), and to infer network parameters from it (Liben-Nowell and Kleinberg 2007; Chierichetti, Liben-nowell, and Kleinberg 2011; Du et al. 2012), much less is known about learning the state of individuals in the network based on the propagation process. In the latter problem, we observe the outcome of the propagation process for a sample of individuals in the network, and use these observations to train a learning algorithm that can then infer the

Copyright 2022, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

outcome for an individual that we have not observed. For example, suppose a university is trying to identify individuals affected by an infectious disease in its student population. Due to limited resources, it might be infeasible to test every student, at least not frequently; instead, it can test a sample of students and use additional information such as shared dorm residency, class registration, etc., to identify the students who are at risk of contracting the disease. We call this the inﬂuence adoption phenomenon in a network, and seek to study its learnability. Recently, Conitzer, Panigrahi, and Zhang (2020) introduced this problem and studied it in the homogeneous model, where any individual who comes in contact with an infected individual gets infected as well. The state of an individual is then determined by the initial set of infected individuals and the propagation network itself (which may be random). In practice, however, individuals differ in their characteristics, and have varying levels of susceptibility to a spreading contagion. For instance, some diseases infect only people of particular age groups, some computer viruses infect only computers with speciﬁc system conﬁgurations, and some memes are meaningful only to people in particular professions. In many cases, if an individual is not infected, she is not a spreader either. Indeed, this heterogeneity signiﬁcantly changes the learning task at hand. For instance, if a disease only infects adults, and the community we care about consists exclusively of children, then we only need to learn the fact that children are unaffected, rather than the status of any particular individual. Or, if a disease were to infect people of a particular blood group, the learning task becomes more challenging since the spread of the disease is now controlled by two sets of information the initial set of infected individuals and the individual features of members of the population in addition to the propagation network itself (which may again be random). This leads to the following fundamental question:

What is the impact of heterogeneity among individuals on the learnability of inﬂuence adoption in a network?

1.1 Our Contributions

Modeling heterogeneity. We model the community of interest as a network of heterogeneous nodes representing indi-

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

Figure 1: An example computer network, where the two computers on the left were exposed to, and therefore have been inspected for a new virus. Suppose the virus targets an unknown set of weaknesses, and suppose the presence of any one of these weaknesses allows the computer to be infected. Since the top left computer, with a weak OS and a weak password, is infected, the weaknesses targeted must include a weak OS, a weak password, or both (meaning that either would sufﬁce for infection). On the other hand, the bottom left computer is not infected, which means the virus does not target a weak OS. So given the prior knowledge and the inspection results, we know for sure that the virus targets a weak password. Given this, we can infer the complete adoption pattern in the network the top middle computer does not have a weak password, and therefore is invulnerable to the virus. The same is true for the bottom right computer. The bottom middle one, on the other hand, is vulnerable and in contact with the virus at the same time, so it must be infected. The top right computer is vulnerable, but the virus cannot reach it, since neither of its neighbors is infected. So the top right computer is not infected either.

viduals and edges representing connections between them.1 Each node has a set of features belonging to a common feature space X; these features determine whether the node is susceptible to the propagating entity. In addition to the features of individual nodes, the overall infection pattern also depends on the nodes that are originally exposed to the propagating entity: we call these nodes generators. For example, if a computer virus is propagating through a network, then features that determine susceptibility of individual computers include the nature of the OS, the strength of passwords, etc. While for a new virus, we may not know exactly what vulnerabilities it targets, we do know that there is an underlying mapping from the space of relevant features to the susceptibility of a computer to this virus. In fact, this yields a hypothesis class of possible sets of vulnerable feature combinations that a virus may target, and makes it possible to reason about the entire infection pattern of the virus by inspecting only some computers. Similarly, we denote the set of all patterns of generators, i.e., the computers infected externally, as a hypothesis class Hg. (See Fig. 1 for a detailed instantiation of this example.) We also distinguish between two common modes in which a communicable entity can spread in a network. The ﬁrst, which we call symptomatic spread, refers to the setting where a node transmits the entity to a neighbor only if it received the entity and is susceptible to it (in the context of epidemiology, this roughly corresponds to an SI process spreading on a graph on which some nodes are resistant/removed). This model corresponds to the analysis in the caption of Figure 1. In the second model, which we call asymptomatic spread, a node can transmit the entity to a neighbor

1In network/graph mining, heterogeneous networks normally exhibit heterogeneity in both nodes and edges, and our model is more similar to attributed networks where edges are homogeneous.

as long as it received the entity itself, irrespective of whether it is susceptible (which roughly corresponds to an SI process where some nodes are asymptomatic). In this model, for the example in Figure 1, we would conclude that the top right node is also infected (i.e., shows symptoms), as it would be exposed and also must be susceptible. (This would correspond to the case where the virus spreads through all computers on the network but is only able to affect say, encrypt the hard drive of some of them. In our language, only the computers with encrypted hard drives would be infected. ) We denote the corresponding hypothesis class for susceptibility of individual nodes Hs. Optimal sample complexity bounds. Our main results in this paper are asymptotically tight bounds for the sample complexity (i.e., the number of random nodes that need to be inspected) of learning inﬂuence adoption in heterogeneous networks, for both the symptomatic spread and asymptomatic spread models. In particular, we show the following bounds. In these bounds, ρ is the maximum width of the network, n is the number of nodes in it, and dg and ds are the VC dimensions of the generator and susceptibility hypothesis classes respectively (we will formally deﬁne the sample complexity and the width later). the sample complexity of learning in the symptomatic spread model is O(ds log n + min(ρ, dg log n)). the sample complexity of learning in the asymptomatic spread model is O(ds + min(ρ, dg log n)). To gain some intuition about these bounds, let us ﬁrst compare them to the homogeneous setting studied by Conitzer, Panigrahi, and Zhang (2020). In that model, all nodes are susceptible, so that ds = 0, and any subset of them can be generators, so that dg = n. Thus, the above bounds reduce to the width of the network ρ, which is exactly the result in Conitzer, Panigrahi, and Zhang (2020). Next, suppose

every individual in a network is externally infected, so that dg = 0. Then, the complexity of learning the state of each individual reduces to that of learning in the hypothesis class Hs, which demonstrates that the sample complexity bounds must depend on ds. Finally, suppose the network is an empty graph, so that ρ = n, and everybody is susceptible, so that ds = 0. In this case, an individual is infected if and only if that individual is externally exposed, which shows that the dependence on dg is also necessary. While the discussion above establishes that the sample complexity bounds must depend on ds, dg, and ρ, we show further in later sections that the speciﬁc bounds given above are also asymptotically tight. For this purpose, we construct a set of hard instances of the problem, in which effective learning using fewer samples than these bounds can be ruled out from an information-theoretic perspective. Random propagation. The propagation of communicable entities, such as the spread of disease between an infected and a healthy person who comes in contact, is often inherently a random event. In some cases, we may know who has come into contact with whom, but we generally still do not know for certain if there was transmission though we may have a better idea of the probability thereof. In other cases, who came into contact with whom itself might only be known through a Bayesian process: it might not be known if two people came in contact with one another, but the probability of doing so can be inferred from their activities. We abstract from both these cases by considering a random propagation network, and give a generic reduction from sample complexity bounds for learning in deterministic environments to learning in random environments. Using this reduction and our sample complexity bounds for deterministic propagation, we obtain similar bounds for stable random networks. The stability condition assumes that the target error rate and failure probability of the learner are above the resolution of the random network, a condition that is necessary even for homogeneous networks (see (Conitzer, Panigrahi, and Zhang 2020)).

1.2 Further Related Work Inﬂuence propagation. The phenomenon of inﬂuence propagation in a network was ﬁrst studied by Kempe, Kleinberg, and Tardos (2003). Various models of inﬂuence propagation have been investigated (Gruhl et al. 2004; Chen, Wang, and Wang 2010; Chen et al. 2011; Myers, Zhu, and Leskovec 2012). Several aspects of inﬂuence propagation have been extensively studied, among which a topic particularly related to our work is learning from inﬂuence propagation, i.e., learning structures/parameters of networks. There, the goal is to learn parameters of the network in which the propagation happens given outcomes of the propagation procedure. As such, it can be regarded as an inverse problem of the one we study. Some representative results on the topic include (Liben-Nowell and Kleinberg 2007; Goyal, Bonchi, and Lakshmanan 2010; Chierichetti, Libennowell, and Kleinberg 2011; Gomez Rodriguez, Balduzzi, and Sch olkopf 2011; Saito et al. 2011; Du et al. 2012; Guille and Hacid 2012; Abrahao et al. 2013; Cheng et al. 2014; Daneshmand et al. 2014; Du et al. 2014; Narasimhan,

Parkes, and Singer 2015; He et al. 2016; Kalimeris et al. 2018). A related and more recent line of work is on representation learning for inﬂuence propagation (Bourigault, Lamprier, and Gallinari 2016; Li et al. 2017; Wang et al. 2017). Being an inverse problem, our results are not directly comparable to all the above. Epidemic estimation. There is a massive body of research on epidemic estimation, where the main focus is on macroscopic prediction tasks, e.g., whether a meme will go viral, or the number of infections over time (Myers, Zhu, and Leskovec 2012; Weng, Menczer, and Ahn 2013). In contrast, our results aim to recover the infection status of every single node, as opposed to aggregate statistics. Learning theory. Valiant (1984) introduced the probably approximately correct (PAC) framework for passive learning. The Vapnik-Chervonenkis (VC) theory (see (Vapnik 2013)) nicely characterizes the learnability of different concepts via the VC dimension, following which various measures of complexity have been considered (Alon et al. 1997; Bartlett and Mendelson 2002; Pollard 2012; Daniely et al. 2015), and tighter generalization bounds have also been developed (Talagrand 1994; Hanneke 2016). While these general results are powerful, as discussed in later sections, they cannot be directly applied to the speciﬁc problem we study.

2 Preliminaries 2.1 Useful Tools from Learning Theory Deﬁnition 1 (VC Dimension). A set S is shattered by a family of sets F, if for any T S, there exists U F, such that S U = T. The VC dimension of a hypothesis class H over a space X, denoted d(H), is the cardinality of the largest set S X shattered by H. The sample complexity of PAC learning is given by the following theorem: Theorem 2 (VC Theorem (the realizable case)). Fix X and H. Consider any distribution D over X, f H, δ > 0, and ε > 0. Now, given m = O((d(H) log(1/ε) + log(1/δ))/ε) i.i.d. samples from D, the following holds with probability at least 1 δ: any hypothesis h H that is consistent with all the m samples (i.e., h(xi) = yi for all i [m]) satisﬁes Pr x D[h(x) = f(x)] ε.

Moreover, this bound is tight in the sense that any algorithm achieving this guarantee requires Ω((d(H) + log(1/δ))/ε) samples.

2.2 Learning Inﬂuence Propagation in Networks Deﬁnition 3 (Implicit Hypothesis Class). Given a directed network G = (V, E), the implicit hypothesis class associated with G, H(G), is deﬁned to be the family of all subsets S of V , where for any two vertices u, v V , if u S and u can reach v in G (i.e., u G v), then v S. In words, the implicit hypothesis class is the family of all sets that are closed under reachability in G. Deﬁnition 4 (Width of Directed Networks). The width ρ(G) of a directed network G = (V, E) is the size of the maximum set S V of vertices, such that for any u, v S where u = v, there is no u to v path in G, i.e., u G v.

Lemma 5 (VC Dimension of Implicit Hypothesis Class (Conitzer, Panigrahi, and Zhang 2020)). Given a directed network G, the VC dimension of the implicit hypothesis class H(G) associated with G is d(H(G)) = ρ(G).

3 Learning under Symptomatic Spread

The propagation model. In a directed network G = (V, E), each node u V has features x(u) X given by a feature mapping x : V X. Given a generator concept fg : X {0, 1} and a susceptibility concept fs : X {0, 1}, the ﬁnal adoption pattern fa : V {0, 1} induced by fg and fs is given by: fa(v) = 1 iff there exists u V , such that fg(x(u)) = 1, and u can reach v in the sub-network G(x, fs) induced by V (x, fs) = {w V | fs(x(w)) = 1}, i.e., u G(x,fs) v. In the rest of the paper, we will abuse notation and let fg(u) = fg(x(u)), fs(u) = fs(x(u)), etc. In words, a node u is initially in contact with the inﬂuence iff it is assigned label fg(u) = 1 by the generator concept fg. Moreover, if a node u has label fs(u) = 1, then it is fully susceptible to the inﬂuence, meaning that if in contact with the contagious entity, u will adopt and subsequently spread the inﬂuence. Otherwise (i.e., when fs(u) = 0), u is fully immune to the inﬂuence, meaning u does not interact with the contagious entity in any way. Given the propagation model, we are ready to deﬁne the learning problem. Roughly speaking, the learning problem asks to design an algorithm, which given access to a number of labeled sample nodes (where the label is whether the node has adopted the inﬂuence), predicts the label of a random node correctly with probability at least 1 ε. Moreover, the algorithm is allowed to fail with probability δ, in which case it does not need to satisfy any requirements. Below we deﬁne the problem formally. The learning problem. Fix a network G, a feature mapping x, a generator hypothesis class Hg and a susceptibility hypothesis class Hs. The learning problem asks to design an algorithm, which for any generator concept fg Hg, susceptibility concept fs Hs, distribution D over V , ε > 0 and δ > 0, with m = m(ε, δ) labeled (i.e., adoption vs no adoption) samples from D, outputs a hypothesis h : V {0, 1} satisfying: with probability at least 1 δ (over the samples and the algorithm s internal randomness), Prv D[h(v) = fa(v)] ε, where fa is the adoption pattern induced by fg and fs. In the above deﬁnition, we assume that the algorithm has full access to the generator and susceptibility hypothesis classes. However, being families of subsets of the feature space X, these hypothesis classes can be quite large or even inﬁnite, and in fact, they may not even admit a succinct representation. To this end, throughout the paper, we assume oracle access to Hg and Hs for empirical risk minimization. More speciﬁcally, we assume the following can be done in polynomial time: for H {Hg, Hs}, given k labeled data points {(xi, yi)}i [k], one can ﬁnd a concept h in H that best ﬁts the labels, i.e., h argminh H P i [k] |h (xi) yi|. In homogeneous networks, as shown by Conitzer, Panigrahi, and Zhang (2020), the key parameter which controls the sample complexity of learning inﬂuence adoption is the

width. However, in heterogeneous networks, the width by itself is no longer an effective measure of the complexity of the network. For example, the notion of the width does not capture the phenomenon that some members of the network may not be susceptible to the inﬂuence, making the network less well-connected than it appears to be. To this end, we instead use the maximum width, a generalization of the width to heterogeneous networks, to measure the complexity of a network. Deﬁnition 6 (Maximum Width of Directed Networks). For a network G, a feature mapping x, and susceptibility hypothesis class Hs, the maximum width is deﬁned to be ρ(G, x, Hs) = maxfs Hs ρ(G(x, fs)). We remark that the maximum width ρ(G, x, Hs) can be either larger or smaller than ρ(G). For example, suppose we know there is precisely one insusceptible node (but not which one). Then, when G has no edges, ρ(G, x, Hs) = ρ(G) 1, and when G is a chain, ρ(G, x, Hs) = ρ(G) + 1. Now we are ready to state our main result for the symptomatic spread model. The following theorem gives a sample complexity upper bound for the learning problem with deterministic networks we will generalize this to random networks in Section 5. Theorem 7. In the symptomatic spread model, for any G = (V, E), x, Hg, and Hs, let Ha be the class of all possible adoption patterns induced by generator concepts in Hg and susceptibility concepts in Hs. Let n = |V |, ρ = ρ(G, x, Hs), dg = d(Hg) and ds = d(Hs). Then we have

d(Ha) = O(ds log n + min(ρ, dg log n)).

As a result, there is an algorithm that learns the adoption pattern with any error rate ε > 0 and failure probability δ > 0 using

O (ds log n + min(ρ, dg log n)) log ε 1 + log δ 1

samples. Moreover, the algorithm runs in polynomial time when ds is a constant. The proof of the theorem, as well as all other proofs, are deferred to the appendices. Here we provide some intuition for the essence of the theorem: the upper bound on the VC dimension of the adoption hypothesis class Ha. Since each member of Ha is generated by a member of Hg and a member of Hs, if we can show that the numbers of effectively different members in Hg and Hs (i.e., members which can induce different adoption patterns) are both reasonably small, then the cardinality of Ha, upper bounded by the product of the two numbers, must be small too. This would imply an upper bound on the VC dimension, since in order to shatter a set of size d, Ha must have at least 2d members. Given the above observations, a weaker version of the bound can be derived immediately from the Sauer Shelah lemma, stated below. Lemma 8 (Sauer-Shelah Lemma (rephrased)). Let F be a family of sets with VC dimension d(F) = d. Then for any set S with cardinality |S| = n, |F S| (2en/d)d, where F S = {T S | T F}.

Applying the above lemma to Hg and Hs restricted to the nodes V , we have that |H V g | (2en/dg)dg and |H V s | (2en/ds)ds. So |Ha| |H V g | |H V s | = O(ndg+ds), which implies d(Ha) = O(log |Ha|) = O((dg + ds) log n). In order to obtain the full bound in Theorem 7, we also need to show that d(Ha) = O(ds log n + ρ), which requires more effort. To prove this bound, we consider the susceptibility hypothesis class Hs, and the implicit hypothesis class associated with the sub-network induced by susceptible nodes, i.e., H(G(x, fs)), for each fs Hs. Then, a similar counting argument to the one above gives d(Ha) = O((ds+ρ) log n). However, it turns out one can do better via a reﬁned argument, formalized in the following key lemma. Lemma 9. For any integer k > 0 and k families of sets F1, . . . , Fk, if for any i [k], d(Fi) d, then d S i [k] Fi = O(d + log k).

Applying the lemma to H(G(x, fs)) for all effectively different fs Hs then gives d(Ha) = O(ds log n + ρ), thereby ﬁnishing the proof of Theorem 7. One may suspect that with a further improved analysis, one can actually remove the log n factor from all terms in the upper bound. Next we show that this logarithmic dependency on n is unavoidable and in fact, all terms in our upper bound are tight up to constant factors even when restricted to undirected networks. See Appendix A for proof sketches. Theorem 10. In the symptomatic spread model, there exist families of hard instances of the learning problem in the symptomatic spread model, on which achieving constant target error rate ε and failure probability δ requires (1) Ω(ds log n) samples, when dg and ρ are constant, (2) Ω(dg log n) samples, when ds is constant, or (3) Ω(ρ) samples, when ds is constant. In other words, all terms in Theorem 7 are necessary. Moreover, all hard instances only involve undirected networks. Finally, we show that when the learning algorithm cannot access the feature mapping, no nontrivial bound is possible. Theorem 11. In the symptomatic spread model, when the feature mapping x is unknown (but the features of labeled sample nodes and nodes to make predictions for are known), there is a family of instances where dg, ds, and ρ are all constant, but the number of samples required to achieve constant error rate ε and failure probability δ is Ω(n).

4 Learning under Asymptomatic Spread The propagation model. In a directed network G = (V, E), each node v V has features x(v) X given by a feature mapping x : V X. Given a generator concept fg : X {0, 1} and a susceptibility concept fs : X {0, 1}, the ﬁnal adoption pattern fa : V {0, 1} induced by fg and fs is given by: fa(v) = 1 iff there exists u V such that fg(x(u)) = 1 and u can reach v in G, and fs(x(v)) = 1. Again, we will abuse notation and let fg(u) = fg(x(u)), fs(u) = fs(x(u)), etc. The asymptomatic spread model is similar to the symptomatic spread model discussed in Section 3, except that here, a node that is not susceptible cannot adopt, but can

still spread, the inﬂuence. Given the propagation model, we can deﬁne the learning problem in a similar way. The learning problem. Fix a network G, a feature mapping x, a generator hypothesis class Hg and a susceptibility hypothesis class Hs. The learning problem asks to design an algorithm, which for any generator concept fg Hg, susceptibility concept fs Hs, distribution D over V , ε > 0 and δ > 0, with m = m(ε, δ) labeled (i.e., adoption vs no adoption) samples from D, outputs a hypothesis h : V {0, 1} satisfying: with probability at least 1 δ (over the samples and the algorithm s internal randomness), Prv D[h(v) = fa(v)] ε, where fa is the adoption pattern induced by fg and fs. The following theorem establishes our sample complexity upper bound for the asymptomatic spread model. The proof techniques are similar to those used to establish Theorem 7.

Theorem 12. In the asymptomatic spread model, for any G = (V, E), x, Hg, and Hs, let Ha be the class of all possible adoption patterns induced by generator concepts in Hg and susceptibility concepts in Hs. Let n = |V |, ρ = ρ(G), dg = d(Hg) and ds = d(Hs). Then we have d(Ha) = O(ds + min(ρ, dg log n)). As a result, there is an algorithm that learns the adoption pattern with any error rate ε > 0 and failure probability δ > 0 using

O (ds + min(ρ, dg log n)) log ε 1 + log δ 1

samples. Moreover, the algorithm runs in polynomial time when dg and ds are constant.

The following theorem shows that all terms in our sample complexity upper bound are necessary. The hard instances are similar to those in the symptomatic spread model.

Theorem 13. In the asymptomatic spread model, there exist families of hard instances of the learning problem in the asymptomatic spread model, on which achieving constant target error rate ε and failure probability δ requires (1) Ω(ds) samples, when dg and ρ are constant, (2) Ω(dg log n) samples, when ds is constant, or (3) Ω(ρ) samples, when ds is constant. In other words, all terms in Theorem 12 are necessary. All hard instances only involve undirected networks.

We remark that the above bound has weaker dependency on n than the one in Theorems 7 and 10. This suggests that with the same amount of prior knowledge, it is easier to learn under asymptomatic than under symptomatic spread.

5 Random Networks

In this section, we discuss how our sample complexity bounds can be generalized to random networks. We ﬁrst present a general reduction for PAC learning with random hypothesis classes, and then show how the reduction can be applied to the problem of learning inﬂuence adoption, which yields sample complexity bounds in random networks. The random hypothesis class model. Let S be the set of possible states of the world. Each s S corresponds to a random concept fs,r over a feature space X, given by a random mapping r R (in other words, fs,r = r(s) is a

random subset of X). Moreover, r also maps S to a random hypothesis class Hr = {fs,r | s S}. Learning with random hypothesis classes. Fix a state space S, a feature space X, and a distribution R over mappings from S to 2X . The learning problem asks to design an algorithm, which for any state s S, distribution D over X, ε > 0 and δ > 0, when r R is an unobservable random mapping drawn from R, with m = m(ε, δ) labeled samples from D (i.e., {(xi, yi)}i [m] where yi = fs,r(xi) for each i), outputs a hypothesis h : X {0, 1} satisfying: with probability at least 1 δ (over the randomness in r, the labeled samples, and the learning algorithm), Prx D[h(x) = fs,r(x)] ε. We say such an algorithm (ε, δ)-learns the random hypothesis class Hr. Below we generalize the notion of stable networks by Conitzer, Panigrahi, and Zhang (2020) to PAC learning with random hypothesis classes, and show that stability enables efﬁcient learning in the more general problem that we study. Deﬁnition 14 ((ε0, δ0)-Stability). Fix a state space S, a feature space X, and a distribution R over mappings from S to 2X . Let r and r be two independent sample mappings drawn from R. We say (S, X, R) is (ε0, δ0)-stable with respect to a distribution D over X, if for any s S, the following is true: with probability at least 1 δ0 over r, Prr R,x D[fr,s(x) = fr ,s(x)] ε0. Theorem 15. Fix a state space S, a feature space X, and a distribution R over mappings from S to 2X . Suppose (S, X, R) is (ε0, δ0)-stable with respect to some distribution D over X. Then there is an algorithm that, for any ε > C1 ε0 and δ > C2 δ0, (ε, δ)-learns the random hypothesis class Hr using

m = O d log(1/ε) + log(1/δ)

samples, where d = Er[d(Hr)] is the expected VC dimension of the random hypothesis class Hr. We note that the stability condition is necessary, since Conitzer, Panigrahi, and Zhang (2020) show (in Proposition 4.1) that it is statistically hard to learn anything nontrivial even in their simpler model without any assumptions on the stability of the network. Intuitively, this is because the learning procedure can be viewed as inferring the seed set given the outcome of the propagation, and if the outcome and the seed set are not correlated well enough, then information-theoretically there is no hope for inference. Below we deﬁne the problem of learning inﬂuence adoption in random networks, and apply Theorem 15 to obtain sample complexity bounds. We start with the case of symptomatic spread; the case of asymptomatic spread is similar. Let G be a distribution over directed networks deﬁned over vertices V (also known as a random live-edge graph in the context of inﬂuence propagation). Each node v has features x(v) given by a feature mapping x. Given a generator concept fg and a susceptibility concept fs, the ﬁnal random adoption pattern fa induced by fg and fs is given by the following procedure: ﬁrst draw a random network G G. Then, fa(v) = 1 iff there exists u V such that fg(x(u)) = 1, and u can reach v in the sub-network

G(x, fs) induced by V (x, fs) = {w V | fs(x(w)) = 1}, i.e., u G(x,fs) v. That is, fa is the induced adoption pattern by fg and fs in a random network G G. Observe that the random network G induces a random mapping r G from Hg Hv to 2X , where r G(fg, fs) is the adoption pattern induced by fg and fs in G. We say this is the implicit mapping induced by the distribution G, and let R(G) denote the distribution of this mapping. Moreover, let HG = r G(Hg Hs) be the class of possible adoption patterns induced by Hg and Hs in G. Fix a distribution over networks G, a feature mapping x, a generator hypothesis class Hg and a susceptibility hypothesis class Hs. The learning problem asks to design an algorithm, which for any generator concept fg Hg, susceptibility concept fs Hs, distribution D over V , ε > 0 and δ > 0, with m = m(ε, δ) labeled samples from D, outputs a hypothesis h : V {0, 1} satisfying: with probability at least 1 δ, Prv D[h(v) = fa(v)] ε, where fa is the random adoption pattern induced by fg and fs in an unobservable random network G G. Corollary 16. In the symptomatic spread model with random networks, ﬁx G, X, Hg, Hs, and x. Suppose for a distribution D over X, (Hg Hs, X, R(G)) is (ε0, δ0)-stable with respect to D. Then the random hypothesis class HG is (ε, δ)-learnable for any ε > C1 ε0, δ > C2 δ0 using

O (ds log n + min (ρ, dg log n)) log ε 1 + log δ 1

samples, where ρ = EG G[ρ(G, x, Hs)]. Corollary 17. In the asymptomatic spread model with random networks, ﬁx G, X, Hg, Hs, and x. Suppose for a distribution D over X, (Hg Hs, X, R(G)) is (ε0, δ0)-stable with respect to D. Then the random hypothesis class HG is (ε, δ)-learnable for any ε > C1 ε0, δ > C2 δ0 using

O (ds + min (ρ, dg log n)) log ε 1 + log δ 1

samples, where ρ = EG G[ρ(G)].

6 Experimental Evaluation In this section, we instantiate our algorithms in a concrete setup with random networks. Arguably the simplest class of random networks is Erd os-R enyi random graphs. However, such networks are infeasible for illustrating the efﬁcacy of our approach: since all nodes are symmetric, these networks (before realizing) carry almost no structural information that can be utilized in learning. In order to make the learning problem as nontrivial as possible, we instead consider the following setup. Fixing the number of nodes n, each edge (u, v) realizes independently with probability puv. These probabilities satisfy: for any u, v [n], puv = 1/|u v| if |u v| C, and 0 otherwise. Intuitively, this models the case where all nodes are located on a line, and nodes closer to each other interact more frequently. This is a onedimensional instance of the small-world model by Kleinberg (2000), except that we truncate the probabilities at distance C to make the learning problem harder. In our experiments,

2 4 6 8 10 12 14 16 18 20 sample size m

baseline: k = 5 baseline: k = 10 our algorithm: k = 5 our algorithm: k = 10

2 4 6 8 10 12 14 16 18 20 sample size m

baseline: k = 5 baseline: k = 10 our algorithm: k = 5 our algorithm: k = 10

2 4 6 8 10 12 14 16 18 20 sample size m

baseline: k = 5 baseline: k = 10 our algorithm: k = 5 our algorithm: k = 10

2 4 6 8 10 12 14 16 18 20 sample size m

baseline: k = 5 baseline: k = 10 our algorithm: k = 5 our algorithm: k = 10

Figure 2: Error rates of our algorithm and the feature-oblivious baseline (Conitzer, Panigrahi, and Zhang 2020) on instances with different parameters. Each point is the average of 40 independent runs, and each run takes less than a minute on a laptop.

we choose C = 30. For the generator and susceptibility hypothesis classes, we consider the setting discussed in our introductory example (Figure 1): there are k potential weaknesses, and each susceptibility hypothesis (corresponding to a virus) is induced by a subset of the k weaknesses, corresponding to those the virus targets. Recall that a node is susceptible as long as it has some weakness that the virus targets. For simplicity, we assume the generator hypothesis class is singletons of nodes, meaning that exactly one (unknown) computer has been initially exposed to the virus. In our experiments, we choose the ground truth generator node uniformly at random, and the ground truth susceptibility hypothesis by adding each weakness to the set targeted independently with probability 3/k (so in expectation the virus targets 3 weaknesses). Then we generate the set of weaknesses of each node by adding each weakness with probability 0.2. All these numbers are chosen speciﬁcally to make the instance rich and the learning problem nontrivial.

Our results are shown in Figure 2. First note that the hypothesis classes we choose satisfy d(Hg) = 1 and d(Hs) = k. So applying Corollary 16, up to the resolution ε0, given m labeled samples, the error rate of our learner is O((log n + k)/m). This roughly aligns with our experimental observations: as m grows, the error rate drops roughly as the bound indicates, and tends to plateau as it approaches the stability parameter ε0 of the network, which appears to be quite small in this setup. For all 4 values of n the curves are very similar, which corresponds to the logarithmic dependence on n. The dependence on k is more signiﬁcant, as suggested by the theoretical results. For comparison, we also include the error rates of the feature-oblivious baseline algorithm given in (Conitzer, Panigrahi, and Zhang 2020), which ignores the

sets of weaknesses and simply assume all nodes are susceptible (there are two curves because the networks are generated differently for different values of k). As the ﬁgure shows, the baseline algorithm fails to exploit the heterogeneity of the network, and suffers signiﬁcantly higher error rates.

7 Conclusion In this paper, we studied the problem of inﬂuence adoption in the context of heterogeneous networks. There are several interesting follow-up directions from this study. First, there are situations, particularly in the context of infectious diseases, where the state of being infected or healthy is transient. In this case, can the state of individual nodes be learned over time, by periodically sampling nodes for their current state? It is also important to study the practical implications of this learning problem, since it has immediate applications to ﬁelds like epidemiology. For instance, do the sample complexity bounds shown in this paper lead to practical algorithms that can be used in ﬁeld studies? Finally, while we assumed that the nodes we inspect are a random sample from an underlying distribution (in the spirit of PAC learning), a different direction is to consider an active learning approach in which the algorithm can choose the nodes it wants to inspect. It is plausible that this additional selectivity can help reduce the size of the training set signiﬁcantly in some situations. In fact, in, for example, epidemiological applications, we often see a combination of the two approaches: inspect a random sample but also inspect some selected nodes, such as those that are at high risk of contracting the disease because they neighbor infected nodes from the random sample. Such hybrid models of practical signiﬁcance constitute an interesting direction for future research.

Acknowledgments Vincent Conitzer and Hanrui Zhang are supported by NSF award IIS-1814056. Debmalya Panigrahi is supported by NSF CAREER award CCF-1750140, NSF award CCF1955703, and ARO award W911NF2110230. The authors thank anonymous reviewers for their helpful feedback.

References Abrahao, B.; Chierichetti, F.; Kleinberg, R.; and Panconesi, A. 2013. Trace complexity of network inference. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 491 499. ACM. Alon, N.; Ben-David, S.; Cesa-Bianchi, N.; and Haussler, D. 1997. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM (JACM), 44(4): 615 631. Bartlett, P. L.; and Mendelson, S. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov): 463 482. Bourigault, S.; Lamprier, S.; and Gallinari, P. 2016. Representation learning for information diffusion through social networks: an embedded cascade model. In Proceedings of the Ninth ACM international conference on Web Search and Data Mining, 573 582. ACM. Chen, W.; Collins, A.; Cummings, R.; Ke, T.; Liu, Z.; Rincon, D.; Sun, X.; Wang, Y.; Wei, W.; and Yuan, Y. 2011. Inﬂuence maximization in social networks when negative opinions may emerge and propagate. In Proceedings of the 2011 siam international conference on data mining, 379 390. SIAM. Chen, W.; Wang, C.; and Wang, Y. 2010. Scalable inﬂuence maximization for prevalent viral marketing in large-scale social networks. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 1029 1038. ACM. Chen, W.; Wang, Y.; and Yang, S. 2009. Efﬁcient inﬂuence maximization in social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 199 208. ACM. Cheng, J.; Adamic, L.; Dow, P. A.; Kleinberg, J. M.; and Leskovec, J. 2014. Can cascades be predicted? In Proceedings of the 23rd international conference on World wide web, 925 936. ACM. Chierichetti, F.; Liben-nowell, D.; and Kleinberg, J. M. 2011. Reconstructing patterns of information diffusion from incomplete observations. In Advances in neural information processing systems, 792 800. Conitzer, V.; Panigrahi, D.; and Zhang, H. 2020. Learning Opinions in Social Networks. In International Conference on Machine Learning, 2122 2132. PMLR. Daneshmand, H.; Gomez-Rodriguez, M.; Song, L.; and Sch olkopf, B. 2014. Estimating diffusion network structures: Recovery conditions, sample complexity & softthresholding algorithm. In International Conference on Machine Learning, 793 801. Daniely, A.; Sabato, S.; Ben-David, S.; and Shalev-Shwartz, S. 2015. Multiclass learnability and the erm principle. The Journal of Machine Learning Research, 16(1): 2377 2404.

Du, N.; Liang, Y.; Balcan, M.; and Song, L. 2014. Inﬂuence function learning in information diffusion networks. In International Conference on Machine Learning, 2016 2024. Du, N.; Song, L.; Yuan, M.; and Smola, A. J. 2012. Learning networks of heterogeneous inﬂuence. In Advances in Neural Information Processing Systems, 2780 2788. Gomez Rodriguez, M.; Balduzzi, D.; and Sch olkopf, B. 2011. Uncovering the Temporal Dynamics of Diffusion Networks. In 28th International Conference on Machine Learning (ICML 2011), 561 568. International Machine Learning Society. Goyal, A.; Bonchi, F.; and Lakshmanan, L. V. 2010. Learning inﬂuence probabilities in social networks. In Proceedings of the third ACM international conference on Web search and data mining, 241 250. ACM. Gruhl, D.; Guha, R.; Liben-Nowell, D.; and Tomkins, A. 2004. Information diffusion through blogspace. In Proceedings of the 13th international conference on World Wide Web, 491 501. ACM. Guille, A.; and Hacid, H. 2012. A predictive model for the temporal dynamics of information diffusion in online social networks. In Proceedings of the 21st international conference on World Wide Web, 1145 1152. ACM. Hanneke, S. 2016. The optimal sample complexity OF PAC learning. The Journal of Machine Learning Research, 17(1): 1319 1333. He, X.; Xu, K.; Kempe, D.; and Liu, Y. 2016. Learning inﬂuence functions from incomplete observations. In Advances in Neural Information Processing Systems, 2073 2081. Kalimeris, D.; Singer, Y.; Subbian, K.; and Weinsberg, U. 2018. Learning diffusion using hyperparameters. In International Conference on Machine Learning, 2425 2433.

Kempe, D.; Kleinberg, J.; and Tardos, E. 2003. Maximizing the spread of inﬂuence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 137 146. ACM. Kleinberg, J. 2000. The small-world phenomenon: An algorithmic perspective. In Proceedings of the thirty-second annual ACM symposium on Theory of computing, 163 170. Li, C.; Ma, J.; Guo, X.; and Mei, Q. 2017. Deepcas: An endto-end predictor of information cascades. In Proceedings of the 26th international conference on World Wide Web, 577 586. International World Wide Web Conferences Steering Committee. Liben-Nowell, D.; and Kleinberg, J. 2007. The linkprediction problem for social networks. Journal of the American society for information science and technology, 58(7): 1019 1031. Myers, S. A.; Zhu, C.; and Leskovec, J. 2012. Information diffusion and external inﬂuence in networks. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, 33 41. ACM. Narasimhan, H.; Parkes, D. C.; and Singer, Y. 2015. Learnability of inﬂuence in networks. In Advances in Neural Information Processing Systems, 3186 3194.

Pollard, D. 2012. Convergence of stochastic processes. Springer Science & Business Media. Saito, K.; Ohara, K.; Yamagishi, Y.; Kimura, M.; and Motoda, H. 2011. Learning diffusion probability based on node attributes in social networks. In International Symposium on Methodologies for Intelligent Systems, 153 162. Springer. Talagrand, M. 1994. Sharper bounds for Gaussian and empirical processes. The Annals of Probability, 28 76. Tang, Y.; Xiao, X.; and Shi, Y. 2014. Inﬂuence maximization: Near-optimal time complexity meets practical efﬁciency. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 75 86. ACM. Valiant, L. G. 1984. A theory of the learnable. In Proceedings of the sixteenth annual ACM symposium on Theory of computing, 436 445. ACM. Vapnik, V. 2013. The nature of statistical learning theory. Springer science & business media. Wang, J.; Zheng, V. W.; Liu, Z.; and Chang, K. C.-C. 2017. Topological recurrent neural network for diffusion prediction. In 2017 IEEE International Conference on Data Mining (ICDM), 475 484. IEEE. Weng, L.; Menczer, F.; and Ahn, Y.-Y. 2013. Virality prediction and community structure in social networks. Scientiﬁc reports, 3(1): 1 6.