# tracking_idea_flows_between_social_groups__820c0abe.pdf

Tracking Idea Flows between Social Groups

Yangxin Zhong1, Shixia Liu1 , Xiting Wang1, Jiannan Xiao1 and Yangqiu Song2

1School of Software, Tsinghua University, Beijing, P.R. China 2Lane Department of Computer Science and Electrical Engineering, West Virginia University, United States 1{zhongyx12,wang-xt11,xjn11}@mails.tsinghua.edu.cn, 1shixia@tsinghua.edu.cn, 2yangqiu.song@mail.wvu.edu

In many applications, ideas that are described by a set of words often ﬂow between different groups. To facilitate users in analyzing the ﬂow, we present a method to model the ﬂow behaviors that aims at identifying the lead-lag relationships between word clusters of different user groups. In particular, an improved Bayesian conditional cointegration based on dynamic time warping is employed to learn links between words in different groups. A tensor-based technique is developed to cluster these linked words into different clusters (ideas) and track the ﬂow of ideas. The main feature of the tensor representation is that we introduce two additional dimensions to represent both time and lead-lag relationships. Experiments on both synthetic and real datasets show that our method is more effective than methods based on traditional clustering techniques and achieves better accuracy. A case study was conducted to demonstrate the usefulness of our method in helping users understand the ﬂow of ideas between different user groups on social media.

Introduction As stated in Webster s Third Edition, an idea is a formulated thought or opinion. Hundreds of millions of users post their ideas on bursty events, hot topics, and personal lives on social media. Users from different social groups tend to interact with each other on different ideas. For example, Democrats and Republicans often interact and communicate on ideas of interest on Twitter. One idea is about the bipartisan senate immigration bill, which can be described as immigration, bill, act, law, bipartisan. As in this example, a set of words is often used to represent an idea (Blei and Lafferty 2006). Frequently, idea ﬂows between different user groups on social media to reﬂect social interaction and inﬂuence (Pentland 2014). These ﬂows facilitate the transfer of information, opinions, and thoughts from group to group. For instance, idea ﬂows between Democrats and Republicans disclose their leadership and impact on different issues ranging from presidential approval ratings and immigration to health care and women s equality. For the immigration idea, Democrats led the discussion of the idea since they called for this new immigration bill. However, most Republican voters were

S. Liu is the corresponding author. Copyright c 2016, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

concerned that an immigration bill might not solve border security problems. As they argued and debated with each other, the immigration idea ﬂowed between Democrats and Republicans. For more details about this example, please refer to our case study in the evaluation section. In many applications, it is desirable to track such an idea ﬂow and its leadlag relationships between different groups (Liu et al. 2015; Wu et al. 2013). For this reason, the study of idea (inﬂuence) propagation has received a great deal of attention (Leskovec, Backstrom, and Kleinberg 2009; Myers, Zhu, and Leskovec 2012; Shaparenko and Joachims 2007; Wu et al. 2014; Yu et al. 2015). Existing methods range from link-based and contentbased to a hybrid approach (Nallapati et al. 2008). Although these methods have been successful in analyzing individual users and their connections, as well as the paths the information propagates through, less attention has been paid to studying how the ideas correlate social groups and interact with each other along time. The goal of our work is to identify an idea as a word cluster and track the lead-lag relationships between word clusters of different user groups. To achieve this goal, we ﬁrst derive an augmented bipartite word graph based on the correlations and lead-lag relationships between words. Each word is represented by a time series, which encodes its term frequency change over time. Since the correlation between two words can be irregular and arbitrary over time, we provide a way to automatically identify the time period in which two words are correlated. Speciﬁcally, we use dynamic time warping (DTW) to align two time series under the monotonic and slope constraint conditions of different time points (Sakoe and Chiba 1978). Then, we employ Bayesian conditional cointegration (BCC) (Bracegirdle and Barber 2012) to discover the local correlation between two time series. After applying BCC, we can determine whether two words have a lead or lag relationship at a particular time point. Consequently, we formulate the augmented bipartite graph as a tensor representation. In contrast to traditional time dependent data analysis using tensor, which employs one additional dimension to represent time information (Sun, Tao, and Faloutsos 2006; Sun et al. 2008), we have introduced two additional dimensions to represent both time and lead-lag relationships. Moreover, we automatically discover ideas that are represented by clusters of words by factorizing the 4-order tensor. For

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

User Group A

User Group B

v (1) A v (2) A v (3) A v (4) A v (5) A v (6) A

v (1) B v (2) B v (3) B v (4) B v (5) B v (6) B

(a) Augmented Bipartite Word Graph

(b) Flow of Ideas

I (1) A I (2) A I (3) A

I (1) B I (2) B

1 [ ,..., ] T c c

1 [ ,..., ] T t t ' '

Figure 1: Basic idea of the idea ﬂow model: (a) each edge is represented by a correlation vector c = [c1, ..., c T ] and a lead-lag vector Δt = [Δt1, ..., t T ]. T is the number of time points and ck, Δtk are the correlation value and lead-lag time between two words for the kth time point; (b) words are aggregated into ideas and edges are aggregated into ﬂows.

a certain pair of ideas, we further apply tensor factorization to cluster the time points, which can be used to segment the time series to identify the lead and lag period between ideas. To demonstrate the effectiveness of our approach, two experiments and a case study were conducted. First, we used a synthetic dataset, in which the ground truth about word clusters and their lead-lag information is known, to evaluate how the algorithm performs with different noise levels. Second, we used ten time series benchmark datasets to show that our algorithm improves the time series clustering quality by introducing local lead-lag information. Third, a case study was conducted based on a set of tweets posted by 514 members of the 113th U.S. Congress in 2013. The aim is to demonstrate how the ideas from different social groups interact with each other.

Algorithm Overview

Suppose each word is tracked by a time series that encodes its frequency change over time. The lead and lag relationships between ideas of different user groups can be derived using temporal correlations between words. As shown in Fig. 1, we ﬁrst extract the correlation and lead-lag relationship between words. An augmented bipartite graph G = (V, E) is used to encode all the correlations and lead-lag relationships between two groups of words. Second, we discover ideas that are represented by clusters of words as well as their ﬂows between different user groups. The steps are detailed as below. Augmented bipartite graph construction. In the ﬁrst step, we calculate correlations between words to construct the augmented bipartite graph. Generally, correlations between words can be irregular and arbitrary over time for two reasons: 1) the correlation of two words may change over time; 2) there are lead-lag relationships between the

0 100 200 0

Time (a) Before applying DTW (b) A (a) Before applying DTW

0 100 200 0

W (b) After applying DTW (b) After applying DTW

Figure 2: An example of applying the DTW alignment.

correlated parts (Fig. 2(a)). BCC is able to identify time periods in which two time series are correlated, but it is not able to detect the lead-lag relationships. To solve this problem, we developed an improved BCC algorithm that incorporates DTW to align two time series and detect the lead-lag relationships. With these correlations and leadlag relationships, an augmented bipartite graph is built, in which V = VA VB represents the words that belong to user groups A or B. Each edge in E is represented by two vectors: 1) correlation vector c = [c1, ..., c T ], where T is the number of time points and ck = 1 (ck = 0) means v(i) A is correlated (not correlated) with v(j) B at the kth time point; 2) lead-lag vector Δt = [Δt1, ..., Δt T ], where Δtk is the lead-lag time between words at time k. Tracking idea ﬂows. In this step, ideas are derived by partitioning the augmented bipartite graph into word clusters. The key challenge is to partition the augmented bipartite graph, in which each edge is represented by two vectors instead of a real number. To solve this problem, we model the augmented bipartite graph as a tensor, which uses additional dimensions to represent time and lead-lag relationships. We then apply tensor factorization techniques to extract a feature vector for each word and cluster the words based on these feature vectors. According to the clustering results, words are aggregated into ideas and edges are aggregated into ﬂows (Fig. 1(b)). For each pair of ideas, we further identify their lead and lag period by clustering the time points using tensor-based techniques.

Augmented Bipartite Graph Construction

This section introduces our algorithm for calculating correlations and lead-lag relationships between words of different groups by combining BCC with DTW. Based on these correlations, we then derive an augmented bipartite word graph. Suppose we have two time series x1:T and y1:T for two words

(b) (c) (d)

(1) (2) (3)

Figure 3: Three tensor representations: (a) word matrix; (b) (c) (d) three representations to encode time and lead-lag relationships.

that belong to different user groups. The goal is to extract the correlation vector c = [c1, ..., c T ] and the lead-lag vector Δt = [Δt1, ..., Δt T ]. In particular, our algorithm consists of the following two steps. First, DTW is employed to calculate [Δt1, ..., Δt T ]. DTW is a widely used dynamic programming algorithm that aligns two time series under the monotonic and slope constraint conditions of different time points (Sakoe and Chiba 1978). We choose DTW because it is both ﬂexible and efﬁcient (Sakoe and Chiba 1978). An example of applying DTW is shown in Fig. 2. Before applying DTW, there are misalignments between correlated parts of x1:T and y1:T (Fig. 2(a)). After applying DTW, the misaligned parts are successfully aligned together (Fig. 2(b)). Δtk is calculated based on this alignment: if xk is aligned to yl, we set Δtk to l k. Second, we derive [c1, ..., c T ] by using BCC to examine the cointegration between aligned time series x 1:T and y 1:T . Cointegration describes a relationship between time series where there is a stationary linear combination. Speciﬁcally, if x k and y k are cointegrated in a given time period, we have

y k = α + βx k + ϵk,

where α is a constant and β is the linear regression coefﬁcient. [ϵ1, ϵ2, ..., ϵT ] follows a mean-reverting, stationary process. Compared to typical measures such as the Pearson correlation and Spearman correlation (Bluman 2012), BCC has two advantages (Bracegirdle and Barber 2012). BCC is able to calculate both global and local correlations while Pearson correlation and Spearman correlation can only detect global correlations. BCC produces less spurious results compared with the classical cointegration testing approach. After applying BCC, we detect the correlation (cointegration) of x 1:T and y 1:T at different times (c ). We then assign c to c according to the alignment relationships derived by DTW.

Tracking Idea Flows This section introduces the tensor representation as well as tensor-based augmented bipartite graph partition and aggregation.

Tensor Representation The key challenge of partitioning the augmented bipartite word graph is that each edge is represented by two vectors rather than a real number (Fig. 1(a)). To tackle this challenge,

we model the augmented bipartite word graph as a tensor, which is able to encode time and lead-lag relationships using additional dimensions. A straightforward tensor representation is X(1) RNA NB T (Fig. 3(b)). Here NA and NB are the numbers of words in user groups A and B, respectively. In our implementation, user groups are already identiﬁed by two professors who majors in media and communications. X(1) ijk represents the lead-lag relationship between the ith word in A and the jth word in B at the kth time point. We set X(1) ijk to Δtk if the two words are correlated at the kth time point and if the two words are not correlated at that time point. However, representing uncorrelated information as may not be good enough because it can also mean one word correlated with another with an inﬁnite time lead. Moreover, X(1) is very dense. As a result, computation based on X(1) is very expensive. A more reasonable solution is to extend the 3-order tensor to a 4-order representation: X(2) RNA NB T T

(Fig. 3(c)). X(2) ijkl is set to 1 if and only if the ith word in user group A is correlated to the jth word in user group B at the kth time point and l = k + Δtk. Otherwise, X(2) ijkl is set to 0. Compared to X(1), X(2) is sparse and it clearly distinguishes between uncorrelated time points and correlated time points. However, X(2) fails to represent lead-lag time effectively. As shown in Fig. 3, for words v(i) A and v(j) B , the lead-lag time of the 4th time point and 5th time point are both 1. However, the feature vectors of X(2) at the 4th time point and the 5th time point (grey rows) are different ([0, 0, 0, 0, 1, 0] and [0, 0, 0, 0, 0, 1]). This leads to incorrect lead-lag results along the time dimension. To solve this issue, we redesign the 4-th dimension of X(2) to more effectively encode lead-lag time. Our design is X(3) RNA NB T (2τmax+1) (Fig. 3(d)), where τmax is the maximum allowable time deviation between two aligned time points in DTW. X(3) ijkl is set to 1 if and only if the ith word in A and the jth word in B are correlated at the kth time point and l = Δtk + τmax + 1. Otherwise, X(3) ijkl is set to 0. As shown in Fig. 3(d), the feature vectors of X(3) at 4th and 5th time points are the same ([0, 0, 1]). As a result, the 4th time point and 5th time point will be clustered into one segment and we can accurately detect the lead and lag periods.

Augmented Bipartite Graph Partition Based on the tensor representation, our algorithm ﬁrst employs tensor-based techniques to extract the feature vector for each word and then clusters the words using these features vectors. The algorithm consists of three steps. First, we employ a tensor factorization algorithm called greedy PARAFAC (Kolda, Bader, and Kenny 2005) to factorize the tensor. This algorithm is adopted because it is efﬁcient and is able to deal with tensors that have more than three dimensions. The algorithm yields a rank-q approximation of X(3) in the form

m=1 λ(m)u(m) v(m) w(m) h(m),

where λ(m) R, u(m) RNA, v(m) RNB, w(m) RT , h(m) R(2τmax+1). u(m) v(m) w(m) h(m) is the 4-way outer product so that (u(m) v(m) w(m) h(m))ijkl = u(m) i v(m) j w(m) k h(m) l . Here u(m) i is the ith entry of vector u(m). Then we use factors {u(m)} and {v(m)} to extract feature vectors for each word in groups A and B, respectively. Take user group A as an example. The feature matrix U for this group is constructed as follows: U = [u(1), u(2), ..., u(q)] RNA q. For the ith word in user group A, the feature vector is set to [Ui1, ..., Uiq]. Finally, we utilize a clustering algorithm such as K-means, on the extracted feature vectors to detect word clusters.

Augmented Bipartite Graph Aggregation Based on the clustering results, we aggregate words into ideas and edges into ﬂows (Fig. 1). To identify the lead and lag periods between ideas, we segment the time series for each idea pair by further applying tensor factorization. Take ideas I(1) A and I(2) B as an example. First, we extract a sub-tensor from X(3) that belongs to I(1) A and I(2) B . Without a loss of generality, we assume that the 1st to N1th words in A belong to I(1) A and the 1st to N2th words in B belong to I(2) B . Then the sub-tensor is denoted as X(3) 1:N1,1:N2,1:T,1:(2τmax+1) RN1 N2 T (2τmax+1). After extracting the sub-tensor, we apply tensor factorization on it to derive a feature vector for each time point and cluster these feature vectors by using a clustering algorithm such as K-means. Based on the clustering results, we divide the time points into segments. Finally, we extract the correlation between I(1) A and I(2) B at the kth segment ( ck). We average ck for all word pairs at time points that belong to this segment. If this value is greater than a given threshold, ck = 1; otherwise, ck = 0. If ck = 1, the lead-lag time between these two ideas, Δtk, is computed by averaging Δtk.

We conducted two experiments and a case study to illustrate the effectiveness and usefulness of our idea ﬂow tracking method. All the experiments were conducted on a workstation with an Intel Xeon E52630 CPU (2.4 GHz) and 64GB of Memory.

Experiment on Synthetic Data

In this experiment, we used a synthetic dataset to show how our algorithm performs with different noise levels. We also compared the accuracy and efﬁciency of three tensor representations. Experimental Settings. In this experiment, synthetic datasets with ﬁve different noise levels were generated (noise level L {0, 0.2, ..., 0.8}). For each L, we generated 50 synthetic datasets and averaged the results of these datasets. For each dataset, we ﬁrst generated ideas and the lead-lag relationships between ideas (Fig. 1(b)), which contain ck and Δtk. Then we generated the augmented bipartite word graph (Fig. 1(a)) according to noise level L. L denotes the probability that ck = 0 when ck = 1. The larger the L, the more difﬁcult it is to generate accurate idea ﬂows by using the augmented bipartite word graph. In the synthetic datasets, the number of ideas in each user group varied from 2 to 6, the number of words in each idea varied from 10 to 30, the lead-lag time varied from -6 to 6, and the number of time points in the lead or lag periods varied from 20 to 40, all while T was set to 200. We applied our algorithm with different tensors (X(1), X(2), and X(3)) to the datasets and reported the accuracy and efﬁciency. Criteria. We evaluated the accuracy of our algorithm based on ﬁve criteria. To evaluate the accuracy of our algorithm in detecting lead-lag relationships between ideas, we used three F1-measures: Flow F1, Flow Lead F1, and Flow Lead Time F1. Flow F1 measures the accuracy of our algorithm in estimating ck. Flow Lead F1 is more rigorous than Flow F1. It not only measures the accuracy of ck, but also measures the accuracy of our algorithm in detecting

0 0.2 0.4 0.6 0.8 0.5

(a) Flow F1 vs. Noise Level.

0 0.2 0.4 0.6 0.8 0.5

Flow Lead_F1

(b) Flow Lead F1 vs. Noise Level.

0 0.2 0.4 0.6 0.8 0.5

Flow Lead Time_F1

(c) Flow Lead Time F1 vs. Noise Level.

0 0.2 0.4 0.6 0.8 í

(d) Lead-lag Time MSE vs. Noise Level.

0 0.2 0.4 0.6 0.8 0.7

(e) Word Clustering NMI vs. Noise Level.

0 0.2 0.4 0.6 0.8

Time (Second)

(f) Running Time vs. Noise Level.

Figure 4: Comparison of three tensors (X(1), X(2), and X(3)) in terms of accuracy and efﬁciency at different noise levels.

the lead or lag status (i.e., whether Δtk > 0, Δtk = 0, or Δtk < 0). Flow Lead Time F1 is even more rigorous than Flow Lead F1, which measures both ck and Δtk. A result with larger F1-measures is better. In addition to F1-measures, we used Mean Squared Error (MSE) to evaluate the effectiveness of our algorithm in detecting the lead-lag time Δtk. Smaller MSE values indicate better quality. Moreover, a widely used measure for clustering quality, Normalized Mutual Information (NMI) (Strehl and Ghosh 2003), was utilized to measure the accuracy of idea identiﬁcation. A larger NMI value indicates more accurate idea identiﬁcation. Results. Fig. 4 compares the three tensors in terms of accuracy and efﬁciency at different noise levels. The following observations can be made from the results. Accuracy. Overall, X(3) has the best performance in terms of accuracy. This demonstrates that X(3) is less sensitive to noises compared to X(1) by using two additional dimensions to encode both ck and Δtk. Although X(2) is comparable to X(3) in terms of NMI, it is worse in terms of F1-measures and MSE. This indicates that X(2) is accurate in identifying ideas, but is not very accurate in detecting lead-lag relationships between ideas. This is due to its deﬁciency in clustering the time points and detecting lead and lag periods. Efﬁciency. As shown in Fig. 4(f), X(1) is much slower than X(2) and X(3). This is because X(2) and X(3) are much sparser than X(1). The computation time for X(2) and X(3) is comparable. Both of them are relatively efﬁcient with a maximum computation time of 12 seconds. Because X(3) is both accurate and efﬁcient, we used this tensor representation in the following experiments.

Experiment on Real Data In this experiment, we use real-world time series benchmark datasets to demonstrate that our algorithm can improve the quality of time series clustering by incorporating local leadlag information. Experimental Settings. The experiment was conducted on ten real-world time series datasets in the UCR archive (Chen et al. 2015), in which the class label of each time series is known. The ten time series datasets include datasets

where the shapes of the time series are quite similar as well as datasets where the shapes are different. Table 1 shows the summary statistics of the UCR time series benchmark datasets used in the experiment. NMI was utilized to measure the clustering quality. We performed a grid search on τmax ({1, 2, ..., 10}) and chose the parameter with the largest NMI value (τmax = 6). To reduce any bias caused by initial cluster assignments, we ran each experiment 100 times with different random seeds and reported the mean and standard deviation. Baselines. Three baselines were used in this experiment. The ﬁrst baseline (B1) was K-means (Hartigan and Wong 1979) and the second baseline (B2) was spectral clustering (Chen et al. 2011). In B1, we utilized y-values of time series to form feature vectors. In B2, we employed KDT W kernel (Marteau and Gibet 2015) to calculate similarities between time series. The third baseline (B3) was similar to our algorithm. The difference is that it does not utilize DTW to calculate local leadlag time. Instead, it calculates a global lead-lag time by moving the time series forward and backward along the time axis to ﬁnd the time shift that results in the highest correlation. Results. Table 2 illustrates the results of comparing our tensor-based clustering algorithm with three baselines in terms of NMI. For B3, several experiments were unable to return results within 20 days. These experiments are marked with a \. As shown in the table, our algorithm performs better than the baselines in all datasets except for one (Fish). The major difference between our algorithm and the baselines is that the baselines did not consider local lead-lag relationships during the clustering process. In particular, B1 and B2 did not employ any lead-lag measures and B3 calculated a global lead-lag time instead of local lead-lag time. This indicates that the clustering quality can be improved by using local lead-lag relationships. Next, we examined why our algorithm is worse than B1 and B2 in the Fish dataset. We found time series with similar shapes but different amplitudes are labelled to be different in this dataset. Since B1 and B2 use absolute y-values of the time series, they are more accurate in distinguishing such time series. Frequently, the time series of words that have similar shapes but different amplitudes are regarded as correlated. As a result, our algorithm is more appropriate to process textual data.

Coffee L7 Trace Fish OSU Leaf SC Strawberry FA CC Wafer K 2 7 4 7 6 6 2 14 3 2 N 56 143 200 350 442 600 983 2250 4307 7174 T 286 319 275 463 427 60 235 131 166 152

Table 1: Summary statistics of the 10 UCR time series datasets. Here K represents the number of classes, N is the number of time series, and T denotes the number of time points. The dataset names L7, SC, FA, and CC are abbreviations for Lightning-7, Synthetic Control, Face (All), and Chlorine Concentration, respectively.

Coffee L7 Trace Fish OSU Leaf SC Strawberry FA CC Wafer B1 0.00 0.00 0.43 0.02 0.53 0.02 0.29 0.03 0.22 0.02 0.78 0.03 0.12 0.00 0.37 0.02 0.01 0.00 0.00 0.00 B2 0.54 0.00 0.08 0.00 0.45 0.00 0.53 0.02 0.02 0.00 0.59 0.01 0.13 0.00 0.24 0.00 0.01 0.00 0.04 0.00 B3 0.48 0.00 0.22 0.03 0.54 0.00 0.12 0.00 0.13 0.00 0.46 0.06 \ \ \ \ Ours 0.64 0.00 0.44 0.02 0.64 0.00 0.16 0.01 0.26 0.00 0.80 0.03 0.32 0.00 0.39 0.09 0.02 0.00 0.19 0.00

Table 2: Comparison of our tensor-based clustering method with three baselines in terms of NMI on 10 datasets of the UCR time series benchmark datasets.

Jul. 31 Aug. 20 obamacare, senate, gop, vote, congress

gov, obamacare, debt, capitol, pay ( ) ( )

Republican Democrat

Figure 5: Local lead-lag relationships between ideas in different user groups.

1 immigration, bill, act, law, bipartisan 2 military, join, service, honor, nation 3 women, american, succeed, job, equality

Table 3: Top three ideas led by Democrats.

Case Study on Twitter Data To demonstrate the usefulness of our algorithm in detecting idea ﬂows, we conducted a case study with a professor who majors in media and communications. The dataset contains 156,501 tweets posted by 514 members of the 113th U.S. Congress from Apr. 9 to Oct. 13, 2013. The professor and another professor who also majors in media and communications manually divided these members into two groups: 243 Democrats and 271 Republicans. We segmented the data into 94 time points (every 2 days). Stopwords and rare words that occur less than 5 times each day by average were removed to reduce noise. We extracted 15 ideas for each group. First, we checked the global lead-lag relationships between all ideas. We found there were 119 lead-lag relationships on pairs of ideas. Among these pairs, 62 (52%) were led by Democrats and 57 (48%) were led by Republicans. This result demonstrated that Democrats marginally outperformed Republicans in taking the lead at that time. The professor conﬁrmed our ﬁndings. He said President Obama (Democratic) started to realize the political importance of social media after he was re-elected as President in 2012. After that, Democrats put more effort into social media. Because the president was a Democrat and the Democratic Party had a Senate majority, Democrats had an advantage over Republicans in leading public opinion. We next examined the ideas that were led by Democrats and Republicans, respectively. We ranked the ideas by the number of time points at which they led other ideas. The top three ideas for Democrats are shown in Table 3. The professor immediately identiﬁed that they were the bipartisan senate immigration bill, honoring military service members, and women s equality. By checking related tweets, the professor found Democrats took the lead on these ideas because they were more supportive of them. Take the bipartisan senate immigration bill idea as an example. Because Democrats supported this bill, they kept on posting the latest information about this bill and hosted new activities ( House leaders should allow debate & vote on Bipartisan Senate Immigration Bill. It s not perfect but we need reform. http://t.co/bt KDxblb Qf ). Republicans lagged because they mainly responded by disapproving of Democrats ideas. One major concern was that the immigration bill might not solve border security problems ( The immigration bill offers false

1 tax, economy, reform, family, icymi 2 job, plan, student, school, program 3 obamacare, senate, gop, vote, congress

Table 4: Top three ideas led by Republicans.

promises about border security and enforcement measures. ). The professor commented that this kind of interaction is common between Democrats and Republicans. Next, we checked the ideas led by Republicans. As shown in Table 4, the top three were tax reform, Republican s plan to create jobs, and health care vote in Congress (IR). While Republicans were supportive of tax reform and plans to create jobs, they did not support Obama s health care plan (Obamacare). To understand why IR was led by Republicans, we examined the correlated ideas from Democrats. Among them, the idea with the largest lag-time (ID) was exempliﬁed by words gov, obamacare, debt, capitol, and pay, which all are related to Obamacare. To know why IR led ID, we further checked the local lead-lag relationships between the two ideas. As shown in Fig. 5, each idea is represented by a colored stripe while the x-axis represents time. The width of a stripe changes over time, encoding the temporal hotness of this idea. The hotness is measured by the number of words that talks about this idea at each time. The local lead-lag relationships are visualized by the links that connect correlated time points. Fig. 5 shows that the main lead period for idea IR is from Jul. 31 to Aug. 20 (Fig. 5A). By checking the two peaks (Fig. 5B and Fig. 5C) on the stripe during this time period, we found Republicans led this idea because they were very active in opposing the health care plan. On Aug. 2 (Fig. 5B), they voted to repeal Obamacare for the 40th time. Around Aug. 13 (Fig. 5C), they posted many tweets criticizing the fact that a key consumer protection in the health care plan was delayed ( Yet another delay for Obamacare via @CBSNews - Key consumer protection in Obamacare delayed http://t.co/0br Yxzj Uj7 ).

Related Work Our work is relevant to inﬂuence analysis between documents. Previous research can be categorized into three groups: content-based, link-based, and hybrid methods. Content-based methods use the content of documents and their temporal changes to analyze inﬂuence (Cui et al. 2011b; Liu et al. 2012; Shaparenko and Joachims 2007; Zhang et al. 2010). Shaparenko and Joachims assumed the language model of a new document is a mixture of the language model of all previously published documents (Shaparenko and Joachims 2007). They adopted a likelihood ratio test

to detect the inﬂuence between documents. However, their method cannot model inﬂuence at the word level. To learn the inﬂuence at the word/phrase level, Meme Tracker (Leskovec, Backstrom, and Kleinberg 2009) employed a graph-based method to cluster variants of phrases and derive corresponding threads in the news cycle. To link two given documents with a coherent chronological chain, Shahaf and Guestrin (Shahaf and Guestrin 2010) employed a bipartite graph between words and documents and utilized the graph to model the inﬂuence between two documents. Compared with this work, our method builds an augmented bipartite graph between words used by two user groups by employing the improved BCC technique. The bipartite link is represented by a set of correlation values and lead-lag times between two words at different times instead of a single weight. As a result, we can model the inﬂuence between groups from a global overview to local temporal details. Link-based methods use links (e.g., citations or followerfollowee relationships) in a network to analyze inﬂuence. Page Rank (Brin and Page 1998) and its variations (Haveliwala 2002; Ma, Guan, and Zhao 2008) are typical examples of such methods. Researchers also studied the problem of inﬂuence maximization in a network (Feng et al. 2014; Kempe, Kleinberg, and Tardos 2003; Ohsaka et al. 2014). However, link-based methods need to utilize explicit links in the data to model inﬂuence. Such kind of links may not be available in many real-world datasets such as news articles. To solve this issue, researchers have developed hybrid methods that use both links and content to analyze inﬂuence (Cui et al. 2011a; El-Arini and Guestrin 2011; Gomez Rodriguez, Leskovec, and Krause 2010). For example, given the topics discussed in a network, Tang et al. (Tang et al. 2009) constructed a factor graph to compute the topic level inﬂuence in a large network. To model inﬂuence as well as topics in a corpus, researchers have proposed several topic models that incorporate links between documents (Dietz, Bickel, and Scheffer 2007; Erosheva, Fienberg, and Lafferty 2004; Gerrish and Blei 2010; Guo et al. 2010; 2014; Liu et al. 2010; Nallapati et al. 2008; Nallapati and Cohen 2008). Inﬂuence can also come from outside the network. Myers et al. (Myers, Zhu, and Leskovec 2012) developed an information diffusion model to integrate both the internal inﬂuence and the external inﬂuence in a network. The major issue of hybrid methods is that they model inﬂuence at the document, user, or topic level. In contrast, our method aims at detecting ideas at the word level and studying how the ideas correlate social groups and interact with each other along time.

Conclusions and Future Work In this paper, we have aimed to help users understand how ideas propagate between different groups. To this end, we ﬁrst derive an augmented bipartite word graph based on the correlations and lead-lag relationships between words. The major feature of the augmented bipartite graph is that its edge is represented by two vectors: a correlation vector and a lead-lag vector. The two vectors, along with the augmented bipartite graph, are formulated as a tensor representation. Next, by factorizing the tensor, we automatically discover

ideas that are represented by clusters of words. For a certain pair of ideas, we further apply tensor factorization to cluster the time points, which can be used to segment the time series to identify the lead and lag period between ideas. Finally, our evaluation demonstrates that the developed approach is generally more effective than the baseline methods. One interesting direction for future work is exploring a wider range of text data representations for better tracking idea ﬂows on social media. Text data is typically treated as bag-of-words. In this paper, we follow this paradigm since this representation is simple and easy to be understood. There are some advanced techniques that can convert the unstructured text to a network with typed entity and relation information (Wang et al. 2015a; 2015b; 2016), which can be less ambiguous than only using bag-of-words. Integrating these techniques with our model is a very interesting topic for further study in the future. In addition, we would like to study the lead-lag relationships within each user group and combine them with the lead-lag relationships between different user groups for better modeling idea ﬂows on social media.

Acknowledgements

We would like to thank Mengchen Liu for valuable contributions on related work and the model formulation. This research was supported by the National Key Technologies R&D Program of China (2015BAF23B03) and a Microsoft Research Fund (No. FY15-RES-OPP-112).

Blei, D. M., and Lafferty, J. D. 2006. Dynamic topic models. In ICML, 113 120. Bluman, A. G. 2012. Elementary statistics: A step by step approach. Mc Graw-Hill. Bracegirdle, C., and Barber, D. 2012. Bayesian conditional cointegration. In ICML, 1095 1102. Brin, S., and Page, L. 1998. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 30(1):107 117. Chen, W.-Y.; Song, Y.; Bai, H.; Lin, C.-J.; and Chang, E. Y. 2011. Parallel spectral clustering in distributed systems. PAMI 33(3):568 586. Chen, Y.; Keogh, E.; Hu, B.; Begum, N.; Bagnall, A.; Mueen, A.; and Batista, G. 2015. The ucr time series classiﬁcation archive. www.cs.ucr.edu/ eamonn/time series data/. Cui, P.; Wang, F.; Yang, S.; and Sun, L. 2011a. Item-level social inﬂuence prediction with probabilistic hybrid factor matrix factorization. In AAAI. Cui, W.; Liu, S.; Tan, L.; Shi, C.; Song, Y.; Gao, Z.; Qu, H.; and Tong, X. 2011b. Textﬂow: Towards better understanding of evolving topics in text. TVCG 17(12):2412 2421. Dietz, L.; Bickel, S.; and Scheffer, T. 2007. Unsupervised prediction of citation inﬂuences. In ICML, 233 240. El-Arini, K., and Guestrin, C. 2011. Beyond keyword search: Discovering relevant scientiﬁc literature. In KDD, 439 447.

Erosheva, E.; Fienberg, S.; and Lafferty, J. 2004. Mixedmembership models of scientiﬁc publications. PNAS 101(suppl 1):5220 5227. Feng, S.; Chen, X.; Cong, G.; Zeng, Y.; Chee, Y. M.; and Xiang, Y. 2014. Inﬂuence maximization with novelty decay in social networks. In AAAI. Gerrish, S., and Blei, D. M. 2010. A language-based approach to measuring scholarly impact. In ICML, 375 382. Gomez Rodriguez, M.; Leskovec, J.; and Krause, A. 2010. Inferring networks of diffusion and inﬂuence. In KDD, 1019 1028. Guo, Z.; Zhu, S.; Zhang, Z.; Chi, Y.; and Gong, Y. 2010. A topic model for linked documents and update rules for its estimation. In AAAI. Guo, Z.; Zhang, Z. M.; Zhu, S.; Chi, Y.; and Gong, Y. 2014. A two-level topic model towards knowledge discovery from citation networks. TKDE 26(4):780 794. Hartigan, J. A., and Wong, M. A. 1979. Algorithm as 136: A k-means clustering algorithm. Applied statistics 100 108. Haveliwala, T. H. 2002. Topic-sensitive pagerank. In WWW, 517 526. Kempe, D.; Kleinberg, J.; and Tardos, E. 2003. Maximizing the spread of inﬂuence through a social network. In KDD, 137 146. Kolda, T.; Bader, B.; and Kenny, J. 2005. Higher-order web link analysis using multilinear algebra. In ICDM, 242 249. Leskovec, J.; Backstrom, L.; and Kleinberg, J. 2009. Memetracking and the dynamics of the news cycle. In KDD, 497 506. Liu, L.; Tang, J.; Han, J.; Jiang, M.; and Yang, S. 2010. Mining topic-level inﬂuence in heterogeneous networks. In CIKM, 199 208. Liu, S.; Zhou, M. X.; Pan, S.; Song, Y.; Qian, W.; Cai, W.; and Lian, X. 2012. Tiara: Interactive, topic-based visual text summarization and analysis. ACM TIST 3(2). Liu, S.; Chen, Y.; Wei, H.; Yang, J.; Zhou, K.; and Drucker, S. M. 2015. Exploring topical lead-lag across corpora. IEEE TKDE 27(1):115 129. Ma, N.; Guan, J.; and Zhao, Y. 2008. Bringing pagerank to the citation analysis. Information Processing & Management 44(2):800 810. Marteau, P.-F., and Gibet, S. 2015. On recursive edit distance kernels with application to time series classiﬁcation. TNNLS 26(6):1121 1133. Myers, S. A.; Zhu, C.; and Leskovec, J. 2012. Information diffusion and external inﬂuence in networks. In KDD, 33 41. Nallapati, R. M., and Cohen, W. W. 2008. Link-plsa-lda: A new unsupervised model for topics and inﬂuence of blogs. In ICWSM. Nallapati, R. M.; Ahmed, A.; Xing, E. P.; and Cohen, W. W. 2008. Joint latent topic models for text and citations. In KDD, 542 550. Ohsaka, N.; Akiba, T.; Yoshida, Y.; and Kawarabayashi, K.-i. 2014. Fast and accurate inﬂuence maximization on large networks with pruned monte-carlo simulations. In AAAI.

Pentland, A. P. 2014. Social Physics: How Good Ideas Spread-The Lessons from a New Science. Penguin Press. Sakoe, H., and Chiba, S. 1978. Dynamic programming algorithm optimization for spoken word recognition. Acoustics, Speech and Signal Processing 26:43 49. Shahaf, D., and Guestrin, C. 2010. Connecting the dots between news articles. In KDD, 623 632. Shaparenko, B., and Joachims, T. 2007. Information genealogy: Uncovering the ﬂow of ideas in non-hyperlinked document databases. In KDD, 619 628. Strehl, A., and Ghosh, J. 2003. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. JMLR 3:583 617. Sun, J.; Tao, D.; Papadimitriou, S.; Yu, P. S.; and Faloutsos, C. 2008. Incremental tensor analysis: Theory and applications. TKDD 2(3). Sun, J.; Tao, D.; and Faloutsos, C. 2006. Beyond streams and graphs: dynamic tensor analysis. In KDD, 374 383. Tang, J.; Sun, J.; Wang, C.; and Yang, Z. 2009. Social inﬂuence analysis in large-scale networks. In KDD, 807 816. Wang, C.; Song, Y.; El-Kishky, A.; Roth, D.; Zhang, M.; and Han, J. 2015a. Incorporating world knowledge to document clustering via heterogeneous information networks. In KDD, 1215 1224. Wang, C.; Song, Y.; Li, H.; Zhang, M.; and Han, J. 2015b. Knowsim: A document similarity measure on structured heterogeneous information networks. In ICDM. Wang, C.; Song, Y.; Li, H.; Zhang, M.; and Han, J. 2016. Text classiﬁcation with heterogeneous information network kernels. In AAAI. Wu, F.; Song, Y.; Liu, S.; Huang, Y.; and Liu, Z. 2013. Lead-lag analysis via sparse co-projection in correlated text streams. In CIKM, 2069 2078. Wu, Y.; Liu, S.; Yan, K.; Liu, M.; and Wu, F. 2014. Opinionﬂow: Visual analysis of opinion diffusion on social media. IEEE TVCG 20(12):1763 1772. Yu, L.; Cui, P.; Wang, F.; Song, C.; and Yang, S. 2015. From micro to macro: Uncovering and predicting information cascading process with behavioral dynamics. ar Xiv preprint ar Xiv:1505.07193. Zhang, J.; Song, Y.; Zhang, C.; and Liu, S. 2010. Evolutionary hierarchical dirichlet processes for multiple correlated time-varying corpora. In KDD, 1079 1088.