# hierarchical_diffusion_attention_network__36bebd92.pdf Hierarchical Diffusion Attention Network Zhitao Wang and Wenjie Li Department of Computing, The Hong Kong Polytechnic University, Hong Kong {csztwang, cswjli}@comp.polyu.edu.hk A series of recent studies formulated the diffusion prediction problem as a sequence prediction task and proposed several sequential models based on recurrent neural networks. However, nonsequential properties exist in real diffusion cascades, which do not strictly follow the sequential assumptions of previous work. In this paper, we propose a hierarchical diffusion attention network (Hi DAN), which adopts a non-sequential framework and two-level attention mechanisms, for diffusion prediction. At the user level, a dependency attention mechanism is proposed to dynamically capture historical user-to-user dependencies and extract the dependency-aware user information. At the cascade (i.e., sequence) level, a time-aware influence attention is designed to infer possible future user s dependencies on historical users by considering both inherent user importance and time decay effects. Significantly higher effectiveness and efficiency of Hi DAN over state-of-the-art sequential models are demonstrated when evaluated on three real diffusion datasets. The further case studies illustrate that Hi DAN can accurately capture diffusion dependencies. 1 Introduction The emergence of online media has brought fundamental changes to information diffusion styles. Due to these changes, diffusion cascades are massively triggered and traced. These observed diffusion processes provide rich sources for companies to do marketing through forecasting advertisement diffusion or for governments to maintain stability through tracking opinion diffusion. All these related applications require for a good understanding of diffusion mechanisms and accurate predictions of diffusion dynamics. These great requirements have driven many researchers, in recent years, to study diffusion phenomena on online media and particularly focus on diffusion prediction problem [Bourigault et al., 2016; Wang et al., 2017b; 2017a]. Since retrievable information diffusion cascades are often recorded as sequences, researchers recently formulated 𝑡𝐴 𝑡𝐵 𝑡𝐶 𝑡𝐷 D C A B ℎ1 ℎ2 ℎ3 Sequential Model Assumption Real Non-Sequential Dependency Underlying Graph Figure 1: An example of non-sequential diffusion dependency. the problem as a sequence prediction task: given the historically infected users in an information cascade, the next infected user is predicted. With the great success of recurrent neural network (RNN) in sequence modeling, a series of RNN-based sequential models were proposed and their effectiveness was demonstrated on the real diffusion data [Du et al., 2016; Wang et al., 2017b; 2017a]. These models sequentially encode the historical information as hidden states and predict next infected user based on the compressed states. Though a cascade is in the form of sequence, the real diffusion process behind it does not strictly follow the sequential assumption. This is because there exists an underlying user connection graph, which may not be explicitly unobserved but can directly determine the diffusion dependencies among users. For example, given a cascade {(A, t A), (B, t B), (C, t C), (D, t D)} and an underlying graph as shown in Figure 1, the sequential models assume that the infections of C and D are influenced by the hidden states h2 (compressed information of A and B) and h3 (compressed information of A, B and C) respectively. But in fact, C and D are directly dependent on A and B according to the graph structure. This kind of non-sequential dependency has also been identified as an important characteristic of diffusion sequences in the previous work [Wang et al., 2017b]. Though the gating mechanisms in existing sequential models [Hochreiter and Schmidhuber, 1997] can selectively drop the information of B from hidden state h2 when generating C, they also lead to the loss of dependency of D on B in hidden state h3. The hidden states of the compressed information are not expressive enough for such non-sequential diffusion Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) dependency, thus the prediction power is limited. In this paper, we propose a Hierarchical Diffusion Attention neural Network (Hi DAN) for the problem of predicting diffusion when the underlying graph is unknown. To capture unique properties of diffusion sequences, we devise a non-sequential architecture with two-level attention mechanisms. Specifically, a user-level dependency attention is suggested to dynamically capture diffusion dependencies among historical users. A fusion gate is then designed to selectively integrate user s self information and its dependency context. Based on the dependency-aware historical user information, a cascade-level influence attention, which considers both inherent importance and time-decay effects, is developed to infer the influence of historical users on potentially infected future users. The inferred influence can be interpreted as the possible dependencies of the future user on all historical users. We evaluate the proposed model against state-of-the-art sequential diffusion prediction models on three real diffusion datasets. The significantly better performance demonstrates the effectiveness of our model. The case studies on synthetic datasets further indicate that the learned dependency attentions are mostly consistent with the true underlying graph. In addition, Hi DAN also shows its higher efficiency than sequential models in experiments. The main contributions of this paper are summarized as follows: To the best of our knowledge, we are the first to develop a non-sequential neural network framework for diffusion prediction problem, which is well-adapted to properties of real diffusion cascades. We propose two-level attention mechanisms for cascade modeling, i.e., a user-level dynamic dependency attention, which effectively captures historical diffusion dependencies, and a cascade-level time-aware influence attention, which infers future dependencies by modeling user inherent importance and time-decay effects. The experiments on three real datasets demonstrate the significantly improved effectiveness and efficiency of the proposed model compared with state-of-the-art approaches. 2.1 Preliminary A diffusion cascade can be represented as c = {(u1, t1), (u2, t2), ..., (uc, tc)}, where the element (ui, ti) denotes that user ui is infected in this cascade at time ti. The infected users are ordered by time, thus ti 1 < ti. Generally, the diffusion prediction problem can be formulated as: given the infection history {(u1, t1), ..., (ui, ti)} of a cascade, the task is to predict user ui+1, who will be infected next. Given a training cascades set C = {c1, ..., c M}, the goal is to build a model that is able to learn the function of the conditional probability p(ui+1|{(u1, t1), ..., (ui, ti)}). 2.2 The Hi DAN Model The framework of the proposed Hi DAN is illustrated in Fig. 2. Initially, each user of an input cascade is embedded as a user embedding. Given the embedded user information, the 𝐞𝐴 𝐞𝐵 𝐞𝐶 𝐞𝐷 𝑡𝐴 𝑡𝐵 𝑡𝐶 𝑡𝐷 𝐮𝐴 𝐮𝐵 𝐮𝐶 𝐮𝐷 : User-Level Dynamic Dependency Attention : Cascade-Level Time-Aware Influence Attention Figure 2: Overview of the Hi DAN model. e, d, u and c correspond to user embedding, dependency context embedding derived by userlevel attention, dependency-aware user embedding constructed by fusion gate and cascade embedding composed by cascade-level attention, respectively. The details will be described in subsection 2.2. user-level attention mechanism dynamically captures the diffusion dependencies between each user and its context users. A gating mechanism is developed to integrate a user s own information and his/her dependency context. Based on the dependency-aware user information, the cascade-level attention computes the influence of historical users on possible future users by capturing users inherent importance and the time-decay effects. Given the cascade embedding constructed with the influence attention, the model then predicts the next infected user. User Embedding At time ti, the sequence of already infected users {u1, ..., ui}, ordered by infection time, is regarded as the input to the model. The raw representation of each input user uj {u1, ..., ui} is the one hot vector of user ID, i.e., xj RN, where N is the total number of distinct users. To extract expressive high-level features of users, we transform the raw input x to the user embedding e via a fully-connected layer: ej = fx(Wxxj + bx) (1) where Wx Rd N, bx Rd are learnable parameters, d is the size of the embedding and fx is the non-linear activation function. User-Level Dynamic Dependency Attention This attention mechanism aims at capturing diffusion dependencies among input cascade users and extracting dependency-aware user features. The diffusion dependency describes who possibly infect(s) whom in the diffusion process, which possesses the following two characteristics. (1) Each cascade user can only be infected by its previous users, thus the dependency of uj only exists on {u1, ..., uj 1}. The previously infected users {u1, ..., uj 1} are referred to as the diffusion context users of uj. (2) Diffusion dependency is Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) 𝐞𝐴 𝐞𝐵 𝐞𝐶 𝐞𝐷 𝛼𝐴𝐷 𝛼𝐵𝐷 𝛼𝐶𝐷 Figure 3: Dependency attention mechanism: an example for u D. directional, i.e., the high dependency of uj on uk does not indicate same level of dependency of uk on uj. This is mainly because the dependency relationship is often directed in the real user graph. For example, in Twitter, a star user, who is followed by millions of users, frequently infects his/her followers instead of being infected by followers. Therefore, it requires differentiating different roles that users play in the directed dependencies. Based on the above understanding, we propose a dynamic attention mechanism to capture the diffusion dependency for each input user. The dependency attention score between user uj {u1, ..., ui} and its context user uk {u1, ..., uj 1} is measured as follows: αkj = exp( Wc eek, Wt eej ) Pj 1 l=1 exp( Wceel, Wteej ) (2) where Wc e, Wt e Rd d are transformation matrices for the context user and the target user respectively; , represents the inner product. Wc e, Wt e are employed to differentiate user roles in directed dependencies. When checking whether uj is dependent on uk, uk is regarded as the context user and transformed by Wc e, while uj is treated as the target user and transformed by Wt e. When checking dependency with the opposite direction, the roles of uk and uj are reversed, thus dependency scores are different. Similar to most attention mechanism, we apply a softmax function to derive the probability distribution. Therefore, αkj denotes the probability that uj is dependent on uk over all his/her context users. The useful context information of uj, denoted as dj, is computed via the following attention weighted sum: k=1 αkjek (3) The above dependency attention process for one specific user is illustrated in Fig. 3. Since dependency attentions and context embeddings are computed independently for each input user in the proposed non-sequential framework, we are able to parallelize the dynamic processes as matrix computation. Given a input cascade c, where the cascade length is l and the embedding matrix of l users is E Rd l, we replace time-consuming enumeration by using a mask matrix M Rl l, where each entry Mi,j = 0 if i < j; otherwise Mi,j = . Then we derive the matrix of attentions as: A = softmax((Wc e E)T (Wt e E) + M) (4) where A Rl l. Each row vector αj in A represents uj s attentions on its context users. The mask forces the softmax function to compute valid attentions only over uj s context users and assign 0 to other users (infected later than uj). The matrix of context embeddings can be derived as: D = AET . Dependency-Aware Fusion Gating For each input cascade user uj, we now have user embedding ej and his/her diffusion context embedding dj. To selectively integrate the important information of two embeddings, a concise and effective fusion gating mechanism is employed. It produces a dependency-aware user representation uj as follows: gj = sigmoid(W1 gej + W2 gdj + bg) (5) uj = gj ej + (1 gj) dj (6) where W1 g, W2 g Rd d and bg Rd. g is used to drop unimportant parts of user embedding ej and add new information of its diffusion context embedding dj such that the fused embedding uj is aware of the diffusion dependency. Cascade-Level Time-Aware Influence Attention At the cascade level, we also consider the non-sequential dependency, in which all historical users could trigger the future infection with different probabilities. We interpret such future dependencies as dynamic influence of historical users on the whole cascade, and propose a time-aware influence attention mechanism. Based on the inferred influence, we compose dependency-aware embeddings u to cascade-level features for final prediction. This attention mechanism captures two factors: the inherent importance of users to cascade and the dynamic time-decay effects. The inherent importance of uj describes how important the information in the dependency-aware embedding uj is to the cascade. If only considering the inherent importance, we can define the influence with the self-attention mechanism [Lin et al., 2017] as: w, fu(Wuuj + bu) . However, user influence is generally assumed to decrease as time passes. This is known as the time-decay effect. Empirical studies [Gomez Rodriguez et al., 2011] have shown that time-decay patterns in different data are not identical, thus predefining the form of time-decay function is often impractical [Cao et al., 2017]. We estimate the time-decay factor in Hi DAN via a neural function directly. For uj {(u1, t1), ..., (ui, ti)}, the past time at current step ti can be represented as tj = ti tj. We discretize the information of past time as a one-hot vector vec( tj) = tj RT , where tj n = 1 if tn 1 < ti tj < tn. The critical time points of the discretization, such as tn 1 and tn, are defined by splitting the time range (0, Tmax] into T intervals {(0, t1], ...(tn 1, tn], ..., (t T 1, Tmax]}, where Tmax is the max observation time. We aim at mapping the past time tj to a vector λj, describing latent features of time-decay. To capture the non-linearity of the decay effect, we compute λj via the following fully connected layer: λj = sigmoid(Wttj + bt) (7) where Wt Rd T and bt Rd. Taking both inherent importance and time decay factors into consideration, we define the following time-aware influence attention: βj = exp( w, λj fu(Wuuj + bu) ) Pi k=1 exp( w, λk fu(Wuuk + bu) ) (8) Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) where Wu Rd d, bu Rd and w Rd. The decay factor vector λj serves as a soft gate, which selectively drops the information of already infected users according to their infection times. Given the user influence βj, this layer finally composes all dependency-aware user embeddings and constructs the cascade embedding at time ti as follows: j=1 βjuj (9) Prediction Layer Given cascade embedding ci at ti, Hi DAN predicts the probability of next infected user over all possible users as: ˆp(ui+1|ci) = softmax(Wcci + bc) (10) where Wc RN d, bc RN. 2.3 Model Optimization Given the training set C = {c1, ..., c M}, the learning objective is to minimize the following negative log-likelihood loss: i=1 log ˆp(ui+1|cm i ) (11) where ui+1 is truly infected user in cascade cm at time ti+1. The backpropagation algorithm is utilized in the training process. As for parameters updating, we employ stochastic gradient descent (SGD) method with mini-batch and adopt the Adam optimizer [Kingma and Ba, 2015]. 3 Experiments To verify the effectiveness of the proposed model, we conduct comparative experiments on the following three real datasets, which are representative in information diffusion studies. Memes [Leskovec et al., 2009]: This dataset contains articles from mainstream news websites or blogs. Each cascade records the diffusion process of a specific key phrase and is represented by a sequence of webpage links associated with corresponding timestamps. Weibo [Zhang et al., 2013]: This dataset consists of content reposting logs crawled from Sina Weibo, a Chinese microblogging site. Each reposting log represents a diffusion process, in which users are ordered as a sequence according to the time they repost. Twitter [Weng et al., 2013]: This dataset records the diffusion processes of hash-tags in Twitter. The sequences of users and timestamps of using the same hash-tags are traced as diffusion cascades. Following the previous work [Wang et al., 2017b], we select frequent users and corresponding cascades as experimental data. The detailed statistics are presented in Table 1. We randomly sample 80% of cascades for training and the rest for validating and testing with an even split. Memes Weibo Twitter # Users 1,109 8,190 13,755 # Cascades 42,492 43,365 72,103 Avg. Cascade Length 8.8 22.5 9.4 Table 1: Statistics of experimental data. 3.2 Baselines We compare the proposed model, Hi DAN, with the following popular and strong sequential baselines. RNN: RNN represents the basic recurrent neural network sequential model. LSTM [Hochreiter and Schmidhuber, 1997]: Long shortterm memory (LSTM) network is a stronger RNN-based sequential model, which employs an effective gating mechanism to control the information flow in sequence. RMTPP [Du et al., 2016]: Recurrent marked temporal point process (RMTPP) is the state-of-the-art sequential models for sequence prediction. Besides modeling marker (diffusion user) sequence, it additionally models timing information with a temporal point process. The following state-of-the-art attention based sequential models are compared. All of them compute attentions on hidden states. Att-RNN: Att-RNN employs a representative attention mechanism [Luong et al., 2015] in RNN. Attentions are computed between current hidden state and previous states. Att-LSTM: Att-LSTM employs the same attention mechanism as Att-RNN in the LSTM framework. CYAN-RNN [Wang et al., 2017b]: This is the latest attention-based sequential method for cascade prediction. Instead of using a single-chain RNN, it employs a encoderdecoder architecture where a coverage-based alignment mechanism is applied. Attentions are computed between current decoder states and previous encoder states. The following variants of Hi DAN are also evaluated. Hi DANNUA: This is a user-level attention-free variant, which replaces dependency attention and fusion gate with an average operation over embeddings of context users. Hi DANNCA: This is a cascade-level attention-free variant. Influence attention is substituted by the average operation. Hi DANNT: This is a variant without considering timedecay in cascade-level attention. 3.3 Evaluation Metrics and Settings The performance is evaluated by predicting the next infected user based on previous infections. Due to the large number of potential targets, this prediction task is often regarded as a ranking problem [Wang et al., 2017b]. Given the output probabilities of all users, the ground-truth user, who is truly infected at next step, is expected to get higher probability. We adopt two widely used ranking metrics for evaluation: Mean Reciprocal Rank (MRR) and Accuracy on top k (A@k) [Wang et al., 2017b]. The size of hidden unit is set to 64 for all baselines. Other parameters of baselines follow the recommended settings in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Memes Weibo Twitter Model MRR A@10 A@50 A@100 MRR A@10 A@50 A@100 MRR A@10 A@50 A@100 RNN 23.26 40.23 66.33 77.24 1.33 2.25 6.04 9.49 2.04 3.68 9.83 14.92 LSTM 24.08 41.49 67.23 77.92 1.40 2.63 7.23 11.49 2.47 4.69 11.78 16.63 RMTPP 23.35 41.37 66.34 76.99 1.35 2.28 6.69 10.73 1.73 3.17 8.96 13.64 Att-RNN 23.51 42.05 67.14 78.10 1.57 2.52 7.51 12.18 2.53 4.56 13.68 20.14 Att-LSTM 24.39 43.11 68.69 79.55 1.64 2.93 8.20 12.60 2.73 5.08 14.78 21.71 CYANRNN 17.62 35.84 57.29 69.81 1.06 1.53 5.18 7.83 1.19 1.82 5.69 8.97 Hi DANNUA 16.60 33.41 61.59 73.75 1.44 2.76 8.14 12.69 2.50 4.72 11.86 16.71 Hi DANNCA 20.41 38.78 65.55 77.70 1.38 2.47 7.14 11.29 2.45 4.60 11.02 16.32 Hi DANNT 27.12 47.92 73.47 83.26 2.39 4.06 11.02 16.83 5.46 11.05 23.08 29.12 Hi DAN 27.91 48.89 74.63 84.44 2.48 4.30 11.31 17.30 5.74 11.18 23.61 30.41 Table 2: Diffusion prediction performance. original papers. For the proposed models, the dimension size of d is also 64, the learning rate is 0.001, the max observation time Tmax is 120 hours, the number of splitting time interval T is 40, and the non-linear activation functions fx, fu are selected as Exponential Linear Unit (ELU) [Clevert et al., 2016]. We also apply the Dropout [Srivastava et al., 2014] with the keep probability 0.8 and the L2 regularization on parameters to avoid over-fitting. 3.4 Evaluation Results As shown in Table 2, the proposed Hi DAN consistently and remarkably outperforms all compared methods in terms of MRR, A@10, A@50 and A@100 on three datasets. The superiority of Hi DAN on diffusion prediction is clearly demonstrated. In addition, we observe the following important findings. The non-sequential framework is capable for cascade modeling. Hi DAN significantly outperforms all baselines. Even without either of the important attention mechanisms, its variants can still achieve competitive and better performance than attention-free sequential models on Weibo and Twitter datasets. This indicates that the proposed non-sequential architecture is capable of modeling cascade without sequential assumptions. Attention mechanism is beneficial for diffusion prediction. Almost all attention-based sequential models perform better than their non-attention versions. Excluding attentions from the full Hi DAN also brings notable decreases on performance. The proposed hierarchical attentions are specific for capturing historical non-sequential dependencies and inferring future dependencies. Attentions in sequential models also aim at computing long-term dependencies to alleviate sequential assumptions. This finding is consistent with our argument that modeling non-sequential dependencies is important for diffusion prediction. Hi DAN is more effective in capturing diffusion dependencies. Compared with state-of-the-art attention-based sequential models, Hi DAN gains a very significant improvement. The attention-based sequential models define attentions on hidden states, which represent the accumulated information but not independent information of each historical user. They cannot clearly capture user-to-user dependencies. Differently, Hi DAN does not sequentially compress user information but directly computes user-level attentions with unique user embeddings. Hi DAN captures dependencies more explicitly and effectively. Time decay matters to influence modeling. Hi DAN outperforms its variant Hi DANNT across all datasets. Taking time information into account helps infer user influence (dependencies) on future infections more accurately. 3.5 Case Studies on Diffusion Dependency The experimental results have shown that user-level dependency attention plays an important role in Hi DAN. Here, we further investigate whether the proposed mechanism is better at capturing user-to-user diffusion dependencies. Since it is very difficult to access the complete graph of real data, we utilize two synthetic data (CP-Exp and RD-Exp) provided in the previous work [Wang et al., 2017b] for case studies. The graphs are created by Kronecker Generator [Leskovec et al., 2010]. Given the created graph, cascades are generated by simulation processes [Gomez Rodriguez et al., 2011]. We visualize attention matrices of the sampled cases learned by Hi DAN and its best competitor Att-LSTM. Meanwhile, the adjacency matrices of the users who are involved in the cases are visualized as ground-truth. As illustrated in Fig. 4, each element (ui, uj) in the learned attention matrices indicates how much uj is infected by ui. The deeper the color is, the greater the attention is. Each element (uk, ul) in the ground-truth matrices denotes whether or not there is a directed edge from uk to ul (i.e., black indicates yes and gray indicates no ). Compared with Att-LSTM, attentions learned by Hi DAN are more consistent with ground-truth dependencies. Despite of employing attention mechanism, Att-LSTM mainly focuses on the most recent hidden state. On the contrary, Hi DAN is more aware of cross dependencies, especially longterm and multidependencies. In the sampled case of the CPExp data (left 3 images), the dependencies of all users except u12 are correctly allocated by Hi DAN. In the sampled case of the RD-Exp (right 3 images), Hi DAN accurately captures dependencies for u30, u1 and u24. It is interesting that both u11 and u15 have multiple paths connected to previously infected users, as shown in the ground-truth matrix. Hi DAN is able to recognize such kind of multiple diffusion paths. 3.6 Efficiency Analysis Apart from effectiveness, higher efficiency is another crucial advantage of Hi DAN. To demonstrate this, we conduct a Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) ͳͷ ʹͲ ʹ ͳʹ ͳ ʹʹ ͳͻ ͳ Ͳ ͳ ʹͶ ͳͳ ͳͷ (a) Dependencies learned by Att-LSTM ͳͷ ʹͲ ʹ ͳʹ ͳ ʹʹ ͳͻ ͳ Ͳ ͳ ʹͶ ͳͳ ͳͷ (b) Dependencies learned by Hi DAN ͳͷ ʹͲ ʹ ͳʹ ͳ ʹʹ ͳͻ ͳ Ͳ ͳ ʹͶ ͳͳ ͳͷ (c) Ground-truth dependencies Figure 4: Case studies on learned dependencies. training time comparison. All models except CYAN-RNN1 are implemented with Tensorflow and trained on the same GTX1080Ti graphic card with the same batch size. As shown in Table 3, Hi DAN has relatively fewer parameters than Att-LSTM and CYAN-RNN. More importantly, Hi DAN is super faster than all compared sequential models. This can be attributed to its non-sequential architecture. The recurrent layer in sequential models has an approximate O(ld2) complexity. However, Hi DAN replaces the recurrent layer with the dependency attention, which can be parallelized with matrix computation as shown in Eq.(4), and time complexity is only about O(l2d). Compared with hidden size d (64 in experiments), average cascade length l (9, 23 and 10 in experiments) is often smaller in real data. Therefore, Hi DAN has a lower complexity at the historical user encoding layer. Additionally, without sequential assumptions, Hi DAN can compute outputs of all steps in parallel with a O(1) complexity at the prediction layer, whereas sequential models need to output step by step with a O(l) complexity. These dramatically speed up the training of Hi DAN especially when the cascade length is getting larger. 1Released code is in JAVA. GPU training time is unknown. But as an RNN-based model, it is at least not faster than RNN. # Param. Memes Weibo Twitter RNN 6.2k 22 s 145 s 261 s LSTM 24.8k 34 s 190 s 346 s RMTPP 8.8k 24 s 148 s 265 s Att-RNN 14.5k 29 s 152 s 273 s Att-LSTM 33.1k 38 s 203 s 422 s CYAN-RNN 27.1k 22 s 145 s 261 s Hi DAN 25.6k 12 s 36 s 85 s Table 3: Average training time (seconds) per epoch. 4 Related Work Diffusion Prediction Inspired by the theoretical independent cascade (IC) model [Kempe et al., 2003], most previous work focused on ICbased methods for diffusion prediction [Saito et al., 2009; Gomez Rodriguez et al., 2011]. Since these methods require strong hypothesis of diffusion patterns, which is hard to specify and verify in practice, they are often not effective on real data [Wang et al., 2017b]. Several recent studies formulated this problem as sequence prediction task and a series of RNNbased models were proposed. RMTPP [Du et al., 2016] is the first RNN-based model for predicting cascade dynamic. It models both timing and user information in sequences with a basic RNN framework. Topo-LSTM [Wang et al., 2017a] employs the LSTM framework and considers the topology. However, the requirement for complete graph limits its application on the data without graph information. Another representative work is CYANRNN[Wang et al., 2017b], which adopts an encoder-decoder framework and a machine translation alignment mechanism. Different from previous sequential models, we are the first to develop a non-sequential neural network for diffusion prediction. Attention in Neural Network Attention mechanism has been widely used in many sequence-based tasks, including neural machine translation [Luong et al., 2015; Bahdanau et al., 2015] and sequence embedding [Lin et al., 2017]. In a series of recent works, the attention mechanisms have proven more expressive in the RNN-free architecture. The RNN-free attention neural network Transformer [Vaswani et al., 2017] was firstly proposed for machine translation task. Another representative model [Shen et al., 2018] was then proposed for various single sequence problems. Besides, some researchers [Lin et al., 2017] focused on developing self attention network to capture the importance of elements in sequences. Inspired by them, we propose hierarchical attentions which are well adapted to unique properties of diffusion cascades on different levels. 5 Conclusion In this paper, we propose a hierarchal attention neural network for diffusion prediction, which is well adapted to the non-sequential characteristics of diffusion cascades. The proposed two-level attentions are able to capture historical userto-user dependencies and infer future dependencies. The experiments on three real diffusion datasets demonstrate the effectiveness and efficiency of our model when compared with state-of-the-art sequential models. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Acknowledgments The work described in this paper was supported by National Natural Science Foundation of China (61672445) and The Hong Kong Polytechnic University (G-YBJP). References [Bahdanau et al., 2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations, 2015. [Bourigault et al., 2016] Simon Bourigault, Sylvain Lamprier, and Patrick Gallinari. Representation learning for information diffusion through social networks: an embedded cascade model. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pages 573 582. ACM, 2016. [Cao et al., 2017] Qi Cao, Huawei Shen, Keting Cen, Wentao Ouyang, and Xueqi Cheng. Deephawkes: Bridging the gap between prediction and understanding of information cascades. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 1149 1158. ACM, 2017. [Clevert et al., 2016] Djork-Arn e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). International Conference on Learning Representations, 2016. [Du et al., 2016] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1555 1564. ACM, 2016. [Gomez Rodriguez et al., 2011] Manuel Gomez Rodriguez, David Balduzzi, and Bernhard Sch olkopf. Uncovering the temporal dynamics of diffusion networks. In Proceedings of the 28th International Conference on Machine Learning, pages 561 568. ACM, 2011. [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735 1780, 1997. [Kempe et al., 2003] David Kempe, Jon Kleinberg, and Eva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137 146. ACM, 2003. [Kingma and Ba, 2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. [Leskovec et al., 2009] Jure Leskovec, Lars Backstrom, and Jon Kleinberg. Meme-tracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 497 506. ACM, 2009. [Leskovec et al., 2010] Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research, 11(Feb):985 1042, 2010. [Lin et al., 2017] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. International Conference on Learning Representations, 2017. [Luong et al., 2015] Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412 1421, 2015. [Saito et al., 2009] Kazumi Saito, Masahiro Kimura, Kouzou Ohara, and Hiroshi Motoda. Learning continuoustime information diffusion model for social behavioral data analysis. In Asian Conference on Machine Learning, pages 322 337. Springer, 2009. [Shen et al., 2018] Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. Disan: Directional self-attention network for rnn/cnn-free language understanding. In AAAI Conference on Artificial Intelligence, 2018. [Srivastava et al., 2014] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929 1958, 2014. [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000 6010. 2017. [Wang et al., 2017a] Jia Wang, Vincent W Zheng, Zemin Liu, and Kevin Chen-Chuan Chang. Topological recurrent neural network for diffusion prediction. In Data Mining (ICDM), 2017 IEEE International Conference on, pages 475 484. IEEE, 2017. [Wang et al., 2017b] Yongqing Wang, Huawei Shen, Shenghua Liu, Jinhua Gao, and Xueqi Cheng. Cascade dynamics modeling with attention-based recurrent neural network. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pages 2985 2991, 2017. [Weng et al., 2013] L. Weng, F Menczer, and Y. Y. Ahn. Virality prediction and community structure in social networks. Scientific Reports, 3(8):618 618, 2013. [Zhang et al., 2013] Jing Zhang, Biao Liu, Jie Tang, Ting Chen, and Juanzi Li. Social influence locality for modeling retweeting behaviors. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pages 2761 2767, 2013. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)