# structinf_mining_structural_influence_from_social_streams__95822aa2.pdf

Struct Inf: Mining Structural Inﬂuence from Social Streams

Jing Zhang, Jie Tang, Yuanyi Zhong , Yuchen Mo, Juanzi Li, Guojie Song, Wendy Hall, Jimeng Sun

Information School, Renmin University of China Department of Computer Science and Technology, Tsinghua University Key Laboratory of Machine Perception, Peking University Electronics and Computer Science, University of Southampton Computational Science and Engineering at College of Computing, Georgia Institute of Technology zhang-jing@ruc.edu.cn, {jietang, lijuanzi}@tsinghua.edu.cn, gjsong@pku.edu.cn, wh@ecs.soton.ac.uk, jsun@cc.gatech.edu

Social inﬂuence is a fundamental issue in social network analysis and has attracted tremendous attention with the rapid growth of online social networks. However, existing research mainly focuses on studying peer inﬂuence. This paper introduces a novel notion of structural inﬂuence and studies how to efﬁciently discover structural inﬂuence patterns from social streams. We present three sampling algorithms with theoretical unbiased guarantee to speed up the discovery process. Experiments on a big microblogging dataset show that the proposed sampling algorithms can achieve a 10 speedup compared to the exact inﬂuence pattern mining algorithm, with an average error rate of only 1.0%. The extracted structural inﬂuence patterns have many applications. We apply them to predict retweet behavior, with performance being signiﬁcantly improved.

1 Introduction Social inﬂuence occurs when one s behaviors or opinions are affected by others. It forms a fundamental mechanism governing the dynamics of social networks, and recently, has attracted tremendous attention with the availability of large online social behavior data. For example, a ﬁeld experiment conducted on Facebook shows strong evidence of social inﬂuence on political mobilization (Bond et al. 2012). The theory of three degrees of inﬂuence (Fowler, Christakis, and others 2008) claims that our behavior can inﬂuence people we have never met. However, the underlying mechanism is still unclear and a thorough investigation is thus needed. We show an example of retweet behavior inﬂuence in three different structures on a dataset of Sina Weibo1 in Figure 1. The red node represents a neighboring user (followee in Weibo) who retweets a message before time t; the white node denotes the target user to be studied. The general question is how likely it is that the target user will retweet this message at time t (0 < t t τ, τ is a short time interval), conditioned on different inﬂuence structures. We can see several interesting patterns. First, the conditional probability in Figure 1(b) increases to 150% higher than that of

Copyright c 2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. 1The most popular Chinese microblogging site (weibo.com).

(a) P=0.07%

(b) P=0.11%

(c) P=0.31%

Figure 1: Probability that the target user (white node) retweets a message on Weibo, conditioned on their neighbors (red nodes) having already retweeted the message.

Figure 1(a), suggesting more active neighbors can improve the retweet likelihood. On the other hand, the probability in Figure 1(c) increases to 300% higher than that of Figure 1(b). The difference between them is the relationship between the neighbors. Please note that the target user may be unaware of such a relationship. How does the network structure formed by friends matter in inﬂuencing users behavior changes? What are the most signiﬁcant inﬂuence structures hidden in the huge volume of streaming behavior data? Although social inﬂuence has been extensively studied before (Anagnostopoulos, Kumar, and Mahdian 2008; Kempe, Kleinberg, and Tardos 2003; Tang et al. 2009), most of the existing work focused on peer inﬂuence, and ignored the effect of such inﬂuence structures. Ugander et al. (2012) presented the idea of structural diversity and showed that diverse structures of neighboring users can inﬂuence user s behaviors. However, they only focused on investigating qualitative effects of social inﬂuence, but ignored quantitative estimation. In this work, we formalize the problem of mining structural inﬂuence from the social stream. Speciﬁcally, given large user behavior logs and the network structure among users, how can we discover inﬂuence structures with high conﬁdence? The discovered inﬂuence structures can be applied in many different applications, e.g., to be used as features in a paper s citation network to predict whether the paper will be extensively cited (Shi, Leskovec, and Mc Farland 2010), or to predict one s following behaviors in a user s following network (Zhang et al. 2015). We address the above issue and make the following contributions: (1) we formally deﬁne a novel notion of structural inﬂuence; (2) to handle large streaming behavior data, we propose an exact and several sampling algorithms to

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

quickly estimate structural inﬂuence; we provide theoretical unbiased guarantees for the sampling algorithms; (3) our empirical study on a large Weibo dataset, show that the proposed sampling algorithms achieve a 10 speedup and result in an average error rate of only 1.0%, compared to our exact counting algorithm; (4) ﬁnally, we demonstrate the effect of structural inﬂuence on retweet behavior prediction.

2 Problem Formulation Let G = (V, E) denote a social network, where V is a set of users and E V V is a set of relationships. We use vi V to represent a user and eij E to represent a relationship between vi and vj. A relationship can be either directed or undirected. With a directed relationship eij, we only consider one-way inﬂuence from vi to vj. An undirected relationship can be divided into two directed relationships, eij and eji, and the inﬂuence is also two-way. We use N(vi) to indicate the neighbors of user vi. Another input of our problem is a stream of user action logs.

Deﬁnition 1 Action Logs. Let L denote a stream of action logs, where each log entry l L is a triple (v, a, t), representing user v V performed action a A at time t. Here A is a set of action types.

For example, in a microblogging site, a retweet behavior is an action, and each tweet can be considered as an action type. We build an action diffusion graph from G and L.

Deﬁnition 2 Action Diffusion Graph. An action diffusion graph, Gp = (L, Ep), is a directed graph, where each node is a log entry in L, and each edge ep ij Ep between two log entries (vi, a, ti) and (vj, a, tj) satisﬁes 1) (vi, vj) E and 2) 0 < tj ti τ, where τ is a short time interval.

We use N p(l) to indicate the neighbors in Gp that point to l. In an action diffusion graph, an edge ep ij represents that when user vi performs action a at time ti, vi has a potential inﬂuence on her neighbor vj to perform a at tj, a short time interval τ after ti. We deﬁne (vi, a, ti) as inﬂuencing action and (vj, a, tj) as target action. If later (vj, a, tj) actually happens in L, we call it as active target action; otherwise inactive target action. Different from conventional research that mainly decomposes the inﬂuence from neighbors into peer inﬂuences, we propose a new notion as:

Deﬁnition 3 Structural Inﬂuence. When the target user vj performs action a at tj and her γ-degree (γ 1) friends who already performed the action before tj (0 < tj ti τ, ti is the latest time when a friend performed the action), we deﬁne structural inﬂuence as the combination effect of peer inﬂuences exerted by those γ-degree active friends on the target user, when the friends and target user form an inﬂuence structure. Table 1 lists all the structures when considering 1, 2, and 3 active γ-degree friends. We also name inﬂuence structure as structural inﬂuence pattern and use Ck to represent the k-th pattern. An instance of Ck for any action is named a pattern instance, and is denoted by ck i .

Structural inﬂuence can be formulated as a conditional probability, IPk = xk xk+yk , where xk represents the frequency of the instances with pattern Ck and target action lt

being active in L, while yk is that of Ck s instances with lt being inactive in L. Given the above deﬁnition, the key question we want to answer is to efﬁciently discover the structural inﬂuence patterns with high inﬂuence probability. Note that a node may be inﬂuenced by multiple pattern instances, and this combination effect can be modeled by a partial credit model or a cascade model (Goyal, Bonchi, and Lakshmanan 2010). This paper simply follows the assumption of Bernoulli distribution (Goyal, Bonchi, and Lakshmanan 2010) to estimate structural inﬂuences from different pattern instances independently. Other assumptions will be studied in the future. Note also that when enumerating an instance, we assign the maximal matched pattern for it. For example, if an instance can be matched to C4 in Table 1, the sub patterns of C4, such as C2 and C3, will not be matched. Later, we aim at quickly estimating IPk for each inﬂuence pattern. One potential application is to incorporate the patterns of high probabilities as features to predict user behaviors, and the details will be given in the experimental section.

3 Struct Inf: Structural Inﬂuence Estimation

In this section, we begin by introducing an exact algorithm to estimate the structural inﬂuence from social streams, based on which we propose three fast sampling strategies.

3.1 Struct Inf-Basic

For each pattern Ck, the goal is to estimate xk and yk, and based on which to calculate structural inﬂuence of Ck. Assume the social network is static, while the action logs are very large and arrive in real time. Our approach loads the static network into memory at the beginning and update xk and yk whenever an action log arrives2. The key idea of the approach is to 1) identify active and inactive target actions; and 2) enumerate the structural inﬂuence patterns from the target actions backwards along the diffusion edges. Identifying Target Actions. We propose Algorithm 1 to estimate xk and yk. We maintain an action diffusion graph Gp, a queue Q and a hashtable H, to record the diffusion edges and action logs within recent time interval τ. For each newly arrived action, (vi, ak, ti), we ﬁrst add it into Q and H (Line 4), and then update x (Lines 5-8) and y (Lines 9-19). To estimate x, we need to ﬁgure out active target actions. Obviously, each newly arrived action is an active target action. Thus when each action lt = (vi, ak, ti) arrives, we ﬁrst ﬁnd those neighbors of vi who are active in [ti τ, ti], and create the new diffusion edges from the actions of those active neighbors to lt in Gp (Lines 5-7), and then enumerate the structural inﬂuence patterns starting from lt (Line 8). To estimate y, we need to ﬁgure out inactive target actions. To achieve this, we enumerate the inﬂuence patterns when an action is outdated rather than newly arrived, because for any action (v, a, t), we can only know whether user v s neighbors are active or not within [t, t + τ] until time t + τ, which is exactly the time that (v, a, t) is popped up from Q. In particular, whenever a new action arrives, we

2In our implementation, storing a static network with millions of nodes and edges costs about 5G memory.

d4, d3, d2, d1

Target action

Figure 2: Illustration of pattern enumeration.

remove the outdated actions from Q and H (Lines 9-10). For each removed action lr = (vj, ah, tj), we identify the neighbors of vj that are inactive in [tj, tj + τ] (Lines 11-12), and assign a virtual time tj + τ to each inactive neighbor(Line 13), to create an inactive target action lt. After that, we add a virtual diffusion edge from lr to lt into Gp (Line 14). Finally, we complete the diffusion edges from other active actions to lt (Lines 15-17). After enumerating from lt (Line 18), we remove the created virtual diffusion edges (Line 19).

Enumerating Inﬂuence Patterns. Algorithm 2 presents how to enumerate all possible inﬂuence patterns from a given action log. The basic idea is: we start by adding an active or inactive target action into Vin, and their neighboring actions into Vext. Then we extend Vin by selecting an arbitrary action l from Vext, and update Vext by adding all the neighbors of l that are not included in Vin. One thing worth noting is the necessary of avoiding duplicate enumeration. To achieve this, we assign each action an incremental (unique) label when it arrives (Line 3 in Algorithm 1), and make sure that, anytime, the labels of the actions in Vext are smaller than those in Vin (Line 7 in Algorithm 2). Figure 2 illustrates the enumeration process under this strategy. The left part demonstrates an action diffusion graph with all the nodes denoted by incremental labels over time. According to Algorithm 2, we ﬁrst add the target action d4 into Vin, and d4 s neighbors, namely d3 and d2, into Vext, and then extend Vin by selecting the actions from Vext one by one. We can see that when selecting d2 into Vin, d3 is removed from Vext because the label of d3 is larger than that of d2 in Vin. A similar idea of enumerating subgraphs was used in (Wernicke 2006). They studied static networks, while we are dealing with the streaming behavior data, and thus the derived action diffusion graph evolves over time. When invoking Algorithm 2, we ﬁrst induce a subgraph by including all the edges between nodes in Vin, and then determine which pattern the induced graph belongs to (Line 2). This is essentially a problem of graph isomorphism. When the pattern is small, we can use the number of nodes/edges and degree sequences, to uniquely identify a pattern, and leverage approximate solutions when the pattern gets larger (Leskovec, Singh, and Kleinberg 2006). The enumeration stops when the considered maximal number of nodes, i.e., N, is reached (Lines 3-4).

Discussions. In summary, the proposed Struct Inf-Basic algorithm is a streaming approach that only needs one-time scan of the streaming action logs and can carefully avoid duplicate enumeration by a dynamic labeling mechanism.

Algorithm 1: Struct Inf-Basic

Input: A network G = (V, E), a stream of action logs L. Output: x, y. Initialize action diffusion graph Gp, queue Q, hashtable H 1 foreach lt = (vi, ak, ti) L do 2

Assign an incremental label d(lt) to lt; 3 Add lt into Q and H; 4 foreach vj N(vi) do 5

if H contains ls = (vj, ak) then 6

Add the edge ls lt into Gp; 7

x += Enum Inf Pattern(Gp, {lt}, N p(lt), []); 8 while lr = (vj, ah, tj) = Q.head() and ti tj > τ do 9

Remove lr from Q and H ; 10 foreach vm N(vj) do 11

if H does not contain lt = (vm, ah) then 12

Set tm = tj + τ ; 13 Add the edge lr lt into Gp; 14 foreach vn N(vm) do 15

if H contains ls = (vn, ah) then 16

Add the edge ls lt into Gp; 17

y += Enum Inf Pattern(Gp, {lt}, N p(lt), []); 18 Remove edges associated with lt from Qp; 19

Algorithm 2: Enum Inf Pattern

Input: Action diffusion graph Gp, selected actions Vin, extended actions Vext, inﬂuence pattern statistics n. Output: n. if |Vin| > 1 then 1

c = Induce Pattern(Vin); pid = Get Pattern ID(c); n[pid]++; 2

if |Vin| = N then 3

return n; 4

foreach l Vext do 5

Vin Vin {l} ; 6 V ext {l N p(l) V in Vext : d(l ) < d(l)} ; 7 Enum Inf Pattern(Gp, Vin, V ext, n); 8

Some other methods proposed by (Kashtan et al. 2004) and (Yan and Han 2002) extend neighboring edges of the selected edges, rather that extending nodes. However, when extending edges, the speeding up is not easy as that of extending nodes (Wernicke 2006). Regarding the time complexity, in Algorithm 1, the time complexity for enumerating active target actions is O(|L|dmax), and is O(|L|d2 max) for inactive target actions, where dmax is the maximal degree of G. The complexity of Algorithm 2 is O((dp max)N), where dp max is the maximal degree of Gp and N is the maximal number of nodes in all the considered inﬂuence patterns. In summary, the total time complexity is O(|L|d2 max(dp max)N).

3.2 Struct Inf-S: Fast Sampling Algorithms

Struct Inf-S1. The time complexity of Struct Inf-Basic is high because it enumerates all possible inﬂuence patterns for each target action. To speed up the algorithm, we propose several sampling strategies. The basic requirement is

to guarantee unbiased estimation of ˆxk and ˆyk. To achieve this, outside Line 8 and 18 in Algorithm 1, we add judgment statements to determine whether target action lt will be enumerated by a probability p respectively. In the same way, we judge whether a neighboring action l will be enumerated by p outside Lines 6-8 in Algorithm 2. We obtain the lemma:

Lemma 1 Given an inﬂuence pattern Ck, any of its instances, ck i , is selected uniformly according to probability pnk, where nk is the number of nodes in Ck.

Proposition 1 Let xk be the exact number by Struct Inf Basic, and ˆxk be the approximate number by Struct Inf-S1, then xk = ˆxk/pnk is an unbiased estimator for xk.

Proof 1 The expectation of xk can be written as

n=1 p(sn)(ˆxk/pnk)sn

where sn represents the nth sample of pattern instances and there are S possible samples in the whole pattern space. Notation (ˆxk/pnk)sn represents the value of ˆxk/pnk that is estimated in the sample sn. We can factorize ˆxk/pnk common to the ith pattern ck i and sum over the population:

ck i sn p(sn)

where ck i sn p(sn) indicates the summation over the probabilities of all samples containing the instance ck i , which is actually the prior probability that ck i is selected in a sample. In Struct Inf-S1, the probability is pnk, thus we have

1 pnk pnk = xk

According to the sampling theory (Horvitz and Thompson 1952), the unbiased estimation of the variance of xk is:

1 p(ck i ) p2(ck i ) +

p(ck i ck j ) p(ck i )p(ck j )

p(ck i ck j )p(ck i )p(ck j ) (1)

The above variance is determined by both the approximate estimation, ˆxk, and the sampling probability for a pattern, p(ck i ). According to the central limit theorem, the sum of a random sample of a large enough size from an arbitrary distribution follows approximately a normal distribution, i.e., xk N(xk, V ( xk)). Thus, the probability of the sampling error with conﬁdence 1 α is:

V ( xk) xk xk + zα/2

V ( xk) = 1 α

where zα/2 represents the number of standard deviations,

V ( xk), by which an observation xk differs from the mean xk, when the conﬁdence is 1 α.

Struct Inf-S2. Another idea is to ﬁrst sample diffusion edges uniformly to form a sampled action diffusion graph, and then enumerate the inﬂuence patterns based on the sampled graph. Speciﬁcally, in Algorithm 1, outside Lines 6-7, we add a judgment statement to determine whether ls lt will be added into Gp by a probability q. Outside Lines 12-19 and 16-17, we also determine whether lr lt or ls lt will be added into Gp by q. We obtain the following lemma:

Lemma 2 Given an inﬂuence pattern Ck, any of its instances, ck i , is selected uniformly according to probability qmk, where mk is the number of edges in Ck.

Proposition 2 Let xk be the exact number and ˆxk be the approximate number obtained by Struct Inf-S2. For the complete graphs such as C4 and C20 in Table 1, ˆxk/qmk is an unbiased estimator of xk, while for the incomplete graphs:

Ci:Ck Ci&nk=ni nikˆxi

Ci:Ck Ci &nk=ni

Proof 2 When a pattern is a complete graph, the proof is the same as that of Proposition 1. When a pattern is an incomplete graph, ˆxk/qmk records the number of not only pattern Ck, but also the patterns that contain Ck as their subgraph and with the same node size,i.e.,{Ci : Ck Ci&nk = ni}. We name Ci as the parent pattern of Ck. In a sampled action diffusion graph, when enumerating an incomplete subgraph, it may be restored to any of its parent patterns. Thus to obtain xk, we need to ﬁrst sum the approximate values of Ck and all its parents, and then divide by the sampling probability, qmk, to get an unbiased estimation (the ﬁrst part of Eq. (2)), and ﬁnally subtract the unbiased value of all the parent patterns (the second part of Eq. (2)). Notation nik is the times that Ck is contained in Ci. For example, in Table 1, C6 appears two times in C14. The unbiased value, xi, of each parent pattern can be estimated in the same way iteratively, until the pattern itself is a complete graph.

Struct Inf-S3. Struct Inf-S3 combines the above two approaches by not only sampling diffusion edges when building the action diffusion graph, but also sampling the action nodes when enumerating pattern instances.

4 Experiment The dataset and code are available online now.3

Experimental Setup. We perform the evaluation on a dataset collected from Sina Weibo1, The network consists of 1,776,950 users as nodes and 308,489,739 following relationships as edges, with the maximal degree dmax as 2,875. We use 23,755,810 tweet/retweet actions to build action diffusion graphs of 3,472,004 diffusion edges. The action type set A contains 300,000 types (i.e., the original tweets). Please refer to (Zhang et al. 2013) for details. In the Weibo dataset, we ﬁrst use the sampling algorithms to estimate the structural inﬂuence for the patterns in Table

3http://aminer.org/structinf

Figure 3: The trade-off between error and time by varying p and q. #Iteration is the times that Algorithm 2 is invoked.

1. We vary sampling probabilities and compare the error and time trade-off curves. The results of Struct Inf-Basic are used as ground truth. Please note that no existing methods can be directly used to estimate structural inﬂuence. Then we apply the extracted inﬂuence patterns to help retweet behavior prediction, to demonstrate the effectiveness of inﬂuence structures. Experimental Results. We use relative error,

UIPk = | IPk IPk|

where IPk is the exact/actual estimation and IPk is the approximate estimation, to measure how far the approximate estimation is from the exact estimation (Ahmed et al. 2014). We present the optimal estimation of structural inﬂuence, IPk, with k from 1 to 20 in Table 1. The results are obtained by executing Struct Inf-S3 with τ = 25 hours, q = 0.9, px = 0.6 and py = 0.1, where px and py are the node sampling probabilities for estimating xk and yk respectively, and q is for sampling edges, and is the same for estimating xk and yk. Because the ratio between active and inactive instances is about 1:700, xk can be estimated much faster than yk. Thus we can ﬁrst try several parameters to ﬁnd the optimal values of q and px, and then estimate py approxi-

mately by px nk

xk yk , that is derived from the variance. We estimate each xk and yk by averaging the results of 10 independent runs, and based on which to calculate IPk. Table 1 shows that each IPk is very close to the exact value. Most of the relative errors are around 1.0% and the worst case is about 5.0%. We also ﬁnd that the top inﬂuential patterns are those with more nodes or edges than others ( the numbers that are in boldface). To get the results in Table 1, the exact method (Struct Inf-Basic) requires 19 hours, while the sampling method (Struct Inf-S3) requires only 1.8 hours. Trade-off between Error and Time. To compare the performance of different sampling methods, we show how accuracy and speed vary by varying the sampling probabilities. Speciﬁcally, we ﬁrst tune the sampling probabilities. We try p in the range {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} for Struct Inf-S1, try q in the same range for Struct Inf-S2, and try all the possible combinations of p and q in the same range for Struct Inf-S3. Second, for each conﬁguration of p

Table 1: All structural inﬂuence patterns formed by 2, 3, and 4 nodes and the structural inﬂuence estimation results (%). The red nodes represent active friends who performed an action before and the white nodes represent the target user.

k Ck IPk IPk UIPk k Ck IPk IPk UIPk 1

0.066 0.066 0.020 11

0.038 0.038 0.720

0.074 0.074 0.085 12

0.186 0.186 0.088

0.111 0.110 0.425 13

0.399 0.392 1.785

0.307 0.304 0.928 14

0.063 0.062 0.616

0.069 0.069 0.530 15

0.619 0.615 0.548

0.091 0.090 0.358 16

0.444 0.439 1.378

0.067 0.067 0.236 17

0.070 0.070 0.074

0.106 0.099 5.852 18

0.420 0.416 0.890

0.381 0.388 1.666 19

0.662 0.645 2.696

0.165 0.162 1.128 20

0.485 0.479 1.239

Figure 4: Convergence of the relative errors of patterns C1 and C4 by Struct Inf-S3.

and q, we calculate the average error of xk over all the inﬂuence patterns, and further average them over 10 independent sampling results. Third, we use the number of iterations that Algorithm 2 is invoked as the metric to measure how fast an approximate estimation will be. The more iterations we run, the more inﬂuence patterns will be sampled. Figure 3 plots all the (relative-error, #iterations) pairs when varying p and q. For each method, the sampling parameters change from 0.1 to 1.0 incrementally from the upper-left to the lower-right points. Because Struct Inf-S3 has two tunable parameters, the points cannot be connected by one single line. From the results, we can see that ﬁrst, the error curves of al the methods almost follow exponential distributions. The average error drops dramatically when p increases at the very beginning, and then changes slowly when p gets larger than 0.5. Second, Struct Inf-S1 performs better than Struct Inf-S2 because its curve is below that of Struct Inf-S2, implying that it needs less iterations to achieve the same error. Finally, the resultant points of Struct Inf-S3 are more concentrated in the bottom-left corner than those of Struct Inf-S1 and Struct Inf-S2, suggesting that Struct Inf S3 is less sensitive to the parameters. This probably because Struct Inf-S3 combines the power of Struct Inf-S1 and Struct Inf-S2, thus achieves a more stable performance. Unbiasedness. We take Struct Inf-S3 as an example to

(a) τ (b) Retweet prediction

Figure 5: (1) Effect of τ on structural inﬂuence; (2) Performance of retweet behavior prediction.

study the properties of the sampling distribution. Speciﬁcally, for each pair of q and p, we run 20 independent samplings and plot the relative error of each run. From Figure 4, we can see that the sampling distributions of different patterns are all centered over the line with relative error as 0, which implies the unbiased property of the estimations. In addition, inﬂuence patterns with fewer nodes and edges can achieve better performance with few iteration times (i.e., samples). We see that the simplest pattern C1 obtains 0.5% relative error with few samples, while C4 needs more samples to achieve the same error rate, because the sampling probability p(ci) = pmqn decreases with m and n, making the variance larger than that of the simple patterns. Effect of Time Delay τ. We study how time delay τ affects structural inﬂuence. Figure 5(a) shows the changes of structural inﬂuence by varying τ. The Y-axis is the structural inﬂuence of C1, which is the basic component of all the other patterns. We can see that social inﬂuence increases quickly over time and gradually becomes stable after τ gets larger than 25 hours, implying that social inﬂuence decays over time. Thus, we set τ to 25 hours on the Weibo dataset. Application Improvements. We demonstrate how the discovered inﬂuence patterns can help improve the application of retweet prediction. The goal is, given an action triple (v, a, t), predict whether user v will retweet the tweet a at time t. Basically, for each observed action in the dataset, we treat it as a positive instance. For each positive instance (v, a, t), if a follower u of user v did not retweet a before t + τ, we treat (u, a, t + τ) as a negative instance. We uniformly sample a balanced dataset with equal number of positive and negative instances and train a binary classiﬁer with logistic regression. We feed some basic features, such as the number of followees/followers, gender, veriﬁcation status of the user, to the classiﬁer. We aim at investigating whether the structural inﬂuence patterns can improve prediction performance based on the basic features. Additionally, we add structural inﬂuence patterns as features (the number of each pattern that is counted from the target action (v, a, t)). In particular, we divide the patterns into four groups: C1(i.e., the number of active neighbors), weak ( IPk < 0.1), moderate (0.1 IPk < 0.3), and strong ( IPk > 0.3). We respectively add the basic features and pattern groups one by one and evaluate the increase of the predictive performance. A larger increase means a higher predictive power. From the results in Figure 5(b), we observe that weak patterns can

improve a lot upon basic features and the number of active neighbors (+1.1% in terms of F1), indicating the effect of structural inﬂuence on retweet behaviors. Furthermore, signiﬁcant increase on F1 score is observed when adding strong patterns upon moderate patterns (+1.73%), while no evident increase is observed when adding moderate patterns, implying that the discovered signiﬁcant structural inﬂuence patterns can beneﬁt a lot on predicting retweet behaviors.

5 Related Work

Structural Inﬂuence. A few research examined the structural characteristics of social inﬂuence. Ugander et al. (2012) ﬁrstly studied structural diversity and found that the possibility that a user joins Facebook is positively affected by the diverse structure of the friends who have joined Facebook. Lately quite a few work has been conducted based on this idea in various scenarios (Fang et al. 2014; Kloumann et al. 2015; Qiu et al. 2016; Zhang et al. 2013). However, all the studies about structural diversity do not distinguish the particular inﬂuence structures, and our paper proposes an algorithm to enumerate all inﬂuence structures. Inﬂuence Learning. The aim of inﬂuence learning is to associate each edge euv with a probability puv to represent the strength of inﬂuence exerted by u on v. (Kimura et al. 2011) were the ﬁrst to learn inﬂuence by maximizing the likelihood of generating historical behavior data. Tang et al. (2009) and Liu et al. (2012) extended the inﬂuence probabilities to the topic level. Kutzkov et al. (Kutzkov et al. 2013) approximately estimated the pairwise inﬂuence in a behavior stream. In addition to learning pairwise inﬂuence, group inﬂuence is also studied by (Tang, Wu, and Sun 2013; Zhang et al. 2014). To the best of our knowledge, this is the ﬁrst attempt to deﬁne and estimate structural inﬂuence. Subgraph Mining/Sampling. Traditional research counted triangles (Jha, Seshadhri, and Pinar 2013; Pavan et al. 2013), 4-node subgraphs (Jha, Seshadhri, and Pinar 2015), any subgraphs (Ahmed et al. 2014; Wernicke 2006), or speciﬁed ones such as cliques, stars, chains, and cascading patterns (Koutra et al. 2015; Leskovec, Singh, and Kleinberg 2006). However, the methods cannot be directly applied to our problem, because the behavior data is streaming and the structural inﬂuence is beyond the frequency of a subgraph.

6 Conclusion

We present the concept of structural inﬂuence and propose a sampling algorithm to quickly estimate the inﬂuence probabilities of different structures from social stream, with theoretical proof of unbiasedness properties. Experiments on a large Weibo data show that, compared to the exact solution, the third sampling method can achieve the best performance, with a 10 speedup and an average error rate of only 1.0%. We also demonstrate the effectiveness of the extracted high inﬂuential patterns on retweet behavior prediction.

Acknowledgments. The work is supported by the National High-tech R&D Program (No. 2015AA124102, No.2014AA015204), NSSFC (13&ZD190, 12&ZD220), National Key R&D Program(No.2016YFB1000702), NSFC

(61272137, 61202114, 61532021), Online Education Research Center, Ministry of Education (2016ZD102), and an Royal Society-Newton Advanced Fellowship Award. Jimeng Sun was supported by the National Science Foundation, award IIS- #1418511 and CCF-#1533768, Children s Healthcare of Atlanta, CDC I-SMILE project, Google Faculty Award, AWS Research Award, Microsoft Azure Research Award and UCB. Jie Tang is the corresponding author.

Ahmed, N. K.; Dufﬁeld, N.; Neville, J.; and Kompella, R. 2014. Graph sample and hold: A framework for big-graph analytics. In KDD 14, 1446 1455. Anagnostopoulos, A.; Kumar, R.; and Mahdian, M. 2008. Inﬂuence and correlation in social networks. In KDD 08, 7 15. Bond, R. M.; Fariss, C. J.; Jones, J. J.; Kramer, A. D. I.; Marlow, C.; Settle, J. E.; and Fowler, J. H. 2012. A 61million-person experiment in social inﬂuence and political mobilization. Nature 489:295 298. Fang, Z.; Zhou, X.; Tang, J.; Shao, W.; Fong, A.; Sun, L.; Ding, Y.; Zhou, L.; and Luo, J. 2014. Modeling paying behavior in game social networks. In CIKM 14, 411 420. Fowler, J. H.; Christakis, N. A.; et al. 2008. Dynamic spread of happiness in a large social network: longitudinal analysis over 20 years in the framingham heart study. BMJ 337:a2338. Goyal, A.; Bonchi, F.; and Lakshmanan, L. V. 2010. Learning inﬂuence probabilities in social networks. In WSDM 10, 241 250. Horvitz, D. G., and Thompson, D. J. 1952. A generalization of sampling without replacement from a ﬁnite universe. Journal of the American Statistical Association 47(260):663 685. Jha, M.; Seshadhri, C.; and Pinar, A. 2013. A space efﬁcient streaming algorithm for triangle counting using the birthday paradox. In KDD 13, 589 597. Jha, M.; Seshadhri, C.; and Pinar, A. 2015. Path sampling: A fast and provable method for estimating 4-vertex subgraph counts. In WWW 15, 495 505. Kashtan, N.; Itzkovitz, S.; Milo, R.; and Alon, U. 2004. Efﬁcient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11):1746 1758. Kempe, D.; Kleinberg, J.; and Tardos, E. 2003. Maximizing the spread of inﬂuence through a social network. In KDD 03, 137 146. Kimura, M.; Saito, K.; Ohara, K.; and Motoda, H. 2011. Learning information diffusion model in a social network for predicting inﬂuence of nodes. Intell. Data Anal. 15:633 652. Kloumann, I.; Adamic, L.; Kleinberg, J.; and Wu, S. 2015. The lifecycles of apps in a social ecosystem. In WWW 15, 581 591.

Koutra, D.; Kang, U.; Vreeken, J.; and Faloutsos, C. 2015. Summarizing and understanding large graphs. Statistical Analysis and Data Mining 8(3):183 202. Kutzkov, K.; Bifet, A.; Bonchi, F.; and Gionis, A. 2013. Strip: stream learning of inﬂuence probabilities. In KDD 13, 275 283. Leskovec, J.; Singh, A.; and Kleinberg, J. 2006. Patterns of inﬂuence in a recommendation network. In PAKDD 06, 380 389. Liu, L.; Tang, J.; Han, J.; and Yang, S. 2012. Learning inﬂuence from heterogeneous social networks. DMKD 25(3):511 544. Pavan, A.; Tangwongsan, K.; Tirthapura, S.; and Wu, K.-L. 2013. Counting and sampling triangles from a graph stream. VLDB 6(14):1870 1881. Qiu, J.; Li, Y.; Tang, J.; Lu, Z.; Ye, H.; Chen, B.; Yang, Q.; and Hopcroft., J. 2016. The lifecycle and cascade of wechat social messaging groups. In WWW 16. Shi, X.; Leskovec, J.; and Mc Farland, D. A. 2010. Citing for high impact. In JCDL 10, 49 58. Tang, J.; Sun, J.; Wang, C.; and Yang, Z. 2009. Social inﬂuence analysis in large-scale networks. In KDD 09, 807 816. Tang, J.; Wu, S.; and Sun, J. 2013. Conﬂuence: Conformity inﬂuence in large social networks. In KDD 13, 347 355. Ugander, J.; Backstrom, L.; Marlow, C.; and Kleinberg, J. 2012. Structural diversity in social contagion. PNAS 109(16):5962 5966. Wernicke, S. 2006. Efﬁcient detection of network motifs. TCBB 3(4):347 359. Yan, X., and Han, J. 2002. gspan: Graph-based substructure pattern mining. In ICDM 02, 721 724. Zhang, J.; Liu, B.; Tang, J.; Chen, T.; and Li, J. 2013. Social inﬂuence locality for modeling retweeting behaviors. In IJCAI 13, 2761 2767. Zhang, J.; Tang, J.; Zhuang, H.; Leung, C. W.-K.; and Li, J. 2014. Role-aware conformity inﬂuence modeling and analysis in social networks. In AAAI 14, 1 7. Zhang, J.; Fang, Z.; Chen, W.; and Tang, J. 2015. Diffusion of following links in microblogging networks. TKDE 27(8):2093 2106.