# large_scale_evolving_graphs_with_burst_detection__18d70450.pdf Large Scale Evolving Graphs with Burst Detection Yifeng Zhao1, Xiangwei Wang2, Hongxia Yang2, Le Song3 and Jie Tang1 1Department of Computer Science and Technology, Tsinghua University 2DAMO Academy, Alibaba Group 3Ant Financial zhao-yf16@mails.tsinghua.edu.cn, {florian.wxw,yang.yhx}@alibaba-inc.com, le.song@antfin.com, jietang@tsinghua.edu.cn Analyzing large-scale evolving graphs are crucial for understanding the dynamic and evolutionary nature of social networks. Most existing works focus on discovering repeated and consistent temporal patterns, however, such patterns cannot fully explain the complexity observed in dynamic networks. For example, in recommendation scenarios, users sometimes purchase products on a whim during a window shopping. Thus, in this paper, we design and implement a novel framework called Burst Graph which can capture both recurrent and consistent patterns, and especially unexpected bursty network changes. The performance of the proposed algorithm is demonstrated on both a simulated dataset and a world-leading E-Commerce company dataset, showing that they are able to discriminate recurrent events from extremely bursty events in terms of action propensity. 1 Introduction Dynamic networks, where edges and vertices arrive over time, are ubiquitous in various scenarios, (e.g., social media, security, public health, computational biology and user-item purchase behaviors in the E-Commerce platform [Akoglu and Faloutsos, 2013; Akoglu et al., 2015]), and have attracted significant research interests in recent years. An important problem over dynamic networks is burst detection finding objects and relationships that are unlike the normal. There are many practical applications spanning numerous domains of burst detection, such as bursty interests of users in E-Commerce [Parikh and Sundaresan, 2008], cross-community relationships in social networks. Recently, the research community has focused on network embedding learning. One class of the network embedding methods represent nodes as single points in a low-dimensional latent space, which aims to preserve structural and content information of the network [Perozzi et al., 2014; Grover and Leskovec, 2016]. Other classes include edge embedding and subgraph embedding [Dong et al., 2017]. However, most existing network embedding methods mainly focus on the network structure, ignoring the bursty links appearing in the dynamic networks [Perozzi et al., 2014; Dai et al., 2016; Qiu et al., 2018]. In social network dynamics, users may generate consistent temporal patterns by buying consumable goods, such as food and papers, to satisfy their recurrent needs; or purchasing durable products, such as cell phones and cars, to satisfy their longtime needs. However, in the real world, bursty links are very common in network evolution. For instance, in a social network, people will meet new friends or discover new interests if they are in a new environment; in an E-Commerce network, customers often do window shopping when they are exploring recommendation sections. Figure 1 illustrates the interest evolution of the young lady during shopping. However, existing works on modeling dynamic networks mostly focus on repeated and consistent patterns [Trivedi et al., 2017; Li et al., 2017; Zhou et al., 2018], and cannot well capture bursty links due to their sparsity. Such important bursty information is commonly viewed as noisy data in the general machine learning algorithms and ignored in modeling [Chandola et al., 2009]. Furthermore, these bursty dynamics are hidden in other complex network dynamics, including the addition/removal of edges and the update of edge weights. It is challenging to design a framework to account for all these changes. Figure 1: An illustrative example of the observed interest evolution of a young lady during shopping. The main interests of the lady are clothes and shoes, while there also exist burst interests, such as a mop. To tackle the aforementioned challenges, we propose a novel framework with contributions summarized as follows: Problem Formulation: we formally define the problem of evolving graphs with bursty links. The key idea is to detect bursty links in dynamic graphs during their onset. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Algorithms: we propose a novel framework for modeling evolving graphs with bursty links, namely Burst Graph. Burst Graph divides graph evolution into two parts: vanilla evolution and bursty links occurrences. For the sparsity of bursty links, a spike-and-slab distribution [Mitchell and Beauchamp, 1988] is introduced as an approximation posterior distribution in the variational autoencoder (VAE)[Kingma and Welling, 2013] framework, while vanilla evolution accepts the original framework of VAE. To fully exploit the dynamic information in graph evolution, we propose an RNN-based dynamic neural network by capturing graph structures at each time step. The cell of RNN maintains information of both vanilla and bursty evolution, which is updated over time. The rest of this paper is organized as follows: Section 2 briefly reviews related work. Section 3 introduces the problem statement of evolving graphs with bursty links and presents the proposed framework. Experiments on both simulated dataset and real datasets are presented in Section 4 with discussions. Finally, Section 5 concludes the paper and visions the future work. 2 Related Work Static Network Embedding. Recently, learning representations for networks has attracted considerable research efforts. Inspired by Word2Vec [Mikolov et al., 2013], [Perozzi et al., 2014; Grover and Leskovec, 2016] learn a node representation with its neighborhood contexts. As an adjacency matrix is used to represent the topology of a network, representative works, such as [Qiu et al., 2018], use matrix factorization to learn low-rank space for the adjacency matrix. Deep learning methods [Wang et al., 2016] are proposed to introduce effective non-linear function learning in network embedding. Dynamic Network Embedding. Actually, inductive static methods [Perozzi et al., 2014; Hamilton et al., 2017a] can also handle dynamic networks by making inference of the new vertices. [Du et al., 2018] extends the skip-gram methods to update the original vertices embedding. [Zhou et al., 2018] focuses on capturing the triadic structure properties for learning network embedding. Considering both the network structure and node attributes, [Li et al., 2017] focuses on updating the top eigenvectors and eigenvalues for the streaming network. Burst Detection. Traditionally, burst detection is to detect an unexpectedly large number of events occurring within some time duration. There are two typical types of burst detection approaches, i.e., threshold-based [Heard et al., 2010] and statebased methods [Kleinberg, 2003]. [Heard et al., 2010] studies fast algorithms using self-similarity to model bursty time series. [Kleinberg, 2003] uses infinite-state automaton to model the burstiness and extract structure from text streams. However, the link building of evolving graphs is usually sparse and slow, where unexpected links are rare to occur simultaneously. Therefore, our definition of a burst is simply an unexpected behavior within a time duration, which is a straightforward definition adopted by many applications in the real world. Notation Description Gt network snapshot at time step t At adjacency matrix for network structure in Gt Av,t adjacency matrix for vanilla evolution in Gt Avi,t adjacency vector for vertex vi in the vanilla evolution in Gt Ab,t adjacency matrix for bursty evolution in Gt Abi,t adjacency vector for vertex vi in the bursty evolution in Gt ht hidden variable of RNN-based VAE at t zt random variable for vanilla evolution st random variable for bursty evolution ct, rt,1, rt,2 composition variables of st λ, β hyperparameters of our method d dimension of random variables n total number of nodes in G T number of time steps Table 1: Summary of Notation. 3 The Proposed Model Existing dynamic network embedding methods mainly focus on the expansion of new vertices and new edges, but usually ignore the changes of vertices attributes through time that lead to unexpected bursty network changes [Angel et al., 2012]. To tackle this problem, we propose to learn low-dimensional representations of vertices to capture both vanilla and bursty evolution. Given a dynamic network {G1, ..., GT }, the dynamic network embedding is to learn a series of functions , where each function fφt maps vertices in network Gt to low-dimensional vectors: fφt(vi) Rd and φts are corresponding network parameters. We first summarize symbols and notations in Table 1 and use bold uppercase for matrices (e.g., A), bold lowercase for vectors (e.g., a), normal lowercase for scalars (e.g., a). 1 denotes a vector whose elements are all 1 and I denotes the identity matrix. The framework of the proposed model is illustrated in Figure 2. It mainly consists of two components, namely the original VAE for vanilla evolution and the spike-and-slab model to detect bursty links. 3.1 VAE for Graph Evolution After detecting the bursty links at each discrete time t, we split the adjacency matrix of graph At into two parts: vanilla adjacency matrix Av,t and burst adjacency matrix Ab,t. To capture information of these two parts, we extend the framework of VAE and introduce a spike-and-slab distribution to simulate the sparsity of burst adjacency matrix. The loss function of VAE at each time step t is written as: i=1 p(Ai,t|Gi,t) = S p(Avi,t|zt) p(Abi,t|st)p(zt|Gi,t)p(st|Gi,t)dztdst, where Gi,t denotes the graph structure of vertex vi at time step t. zt and st are random variables in the VAE-based model for vanilla and bursty evolution respectively. The evidence lower bound (ELBO) of VAE [Kingma and Welling, 2013] can be Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Figure 2: An illustration of the proposed framework Burst Graph. At time step t, the framework generates vanilla evolution and bursty evolution based on network structure Gt. Part a is an original VAE for vanilla evolution, where random variable zt follows a Gaussian distribution. Part b is an extended VAE for bursty evolution, where random variable st follows a spike-and-slab distribution because of the sparsity of bursty links. The encoder for these two random variables zt and st shares the same Graph SAGE to utilize the information from vertices and their neighbors. written as: Lt =Ezt qφ(zt|Gi,t)[ln pθ(Avi,t|zt)p(zt) qφ(zt|Gi,t) ] + λ Est qφ(st|Gi,t)[ln pθ(Abi,t|st)p(st) qφ(st|Gi,t) ], (2) where importance weight λ is a hyperparameter. θ and φ are parameters of the encoder and decoder networks respectively. The approximate posterior distribution of random variable zt follows a Gaussian distribution: qφ(zt|Gi,t) N(µ0(Gi,t), Σ0(Gi,t)). µ0( ) and Σ0( ) are the encoder networks, which can be any highly flexible functions such as neural networks [Kingma and Welling, 2013]. We use Graph SAGE [Hamilton et al., 2017a], a representative framework of graph convolutional network, to generate representation from vertex attributes and its neighbours: ui,t = Graph SAGE(Gi,t) µ0(Gi,t) = fµ0(ui,t) Σ0(Gi,t) = fΣ0(ui,t), (3) where µ0(Gi,t) and Σ0(Gi,t) share the same Graph SAGE network to learn representation from topology structure and attributes of vertex vi. In our paper, fµ0( ) and fΣ0( ) are both two fully-connected layers where hidden layers are activated by RELU function. Similar to the original framework of VAE [Kingma and Welling, 2013], the prior of random variable zt follows a standard Gaussian distribution. This is no longer suitable in the case of rare and discrete bursty links. In our paper, the approximate posterior distribution of random variables st for bursty links is set to follow a spike-and-slab distribution Figure 3: An illustration of the RNN structure in Burst Graph, the hidden variable ht is updated by last hidden variable ht 1, random variables zt and st. h0 is set to a zero vector. The prior s t and z t depend on ht 1. The initial prior s 0 follows the distribution p (st) mentioned before, while the initial prior z t follows a standard Gaussian distribution. [Mitchell and Beauchamp, 1988]: ct|Gi,t iid Bernoulli(ψ(Gi,t)) rt,1|Gi,t N(0, 1 β Σ1(Gi,t)) rt,2|Gi,t N(µ1(Gi,t), Σ1(Gi,t)) st = (1 ct) rt,1 + ct rt,2, where µ1( ) and Σ1( ) are the encoder networks, with the same neural network structure settings as µ0( ) and Σ0( ) in Equation (3), respectively. ψ( ) is also an encoder network which Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) is two fully-connected layers activated by sigmoid function. The value for β > 0 is predefined with usual setting β = 100 for more flexbilities [Ishwaran et al., 2005]. Therefore, rt,1 follows a spike distribution while rt,2 is a slab distribution. For easy implementation, the prior distribution p (st) of st is set to: p (st) (1 α) N(0, 1 100I) + α N(0, I), where α = {αi} and each αi is drawn from a Bernoulli distribution: {αi} iid Bernoulli( 1 2). The regularization loss of spike-and-slab variable st in ELBO can be written as: DKL(q(st|Gi,t)||p(st)) =Eq(st|Gi,t)[ln p(ct) q(ct|Gi,t) + ln p(rt,1) q(rt,1|Gi,t) + ln p(rt,2) q(rt,2|Gi,t)] = DKL(q(ct|Gi,t)||p(ct)) DKL(q(rt,1|Gi,t)||p(rt,1)) DKL(q(rt,2|Gi,t)||p(rt,2)) = ln 2 ψ(Gi,t) ln ψ(Gi,t) (1 ψ(Gi,t)) ln(1 ψ(Gi,t)) 2(1 + ln(Σ1(Gi,t)) Σ1(Gi,t) β µ2 1(Gi,t)) 2(1 + ln Σ1(Gi,t) Σ1(Gi,t) µ2 1(Gi,t)), (5) By using the reparameterization trick, the random variables like zt, ct, rt,1 and rt,2 can be rewritten as: zt = µ0(Gi,t) + γ0 Σ 1 2 0 (Gi,t) ct = σ(ln ϵ ln (1 ϵ) + ln ψ(Gi,t) + ln (1 ψ(Gi,t))) rt,1 = γ1 ( 1 β Σ1(Gi,t)) 1 2 rt,2 = µ1(Gi,t) + γ2 Σ 1 2 1 (Gi,t), (6) where ϵ follows the uniform distribution ϵ U(0, 1). γ0, γ1 and γ2 all follow the standard Gaussian distribution: γ0, γ1, γ2 N(0, I). The decoder network fs(st) = σ(Ws st) is a transition function to reconstruct the bursty links Abi,t in time t. For vanilla evolution, the framework are similar with the original VAE, where the prior of random variable zt follows a standard Gaussian distribution and the decoder network fz(zt) is a fully-connected network activated by sigmoid function to reconstruct the vanilla connection Avi,t at time t. 3.2 Learning Representations For Dynamic Network In the evolving network settings, the goal is to learn the evolving changes from a set of sequential graphs G = {G1, G2, ..., GT }. To take the temporal structure of the sequential graphs into account, we introduce an RNN structure into our framework as shown in Figure 3. The prior of the latent random variables is not set to follow a standard distribution, but depends on the last hidden variable ht 1: zt|ht 1 N(µ0(ht 1), Σ0(ht 1)) rt,1|ht 1 N(0, 1 β Σ1(ht 1)) rt,2|ht 1 N(µ1(ht 1), Σ1(ht 1)) ct|ht 1 iid Bernoulli(ψ(ht 1)), dataset #vertices #edges #classes #time Simulate 20,118 527,268 118 6 Alibaba-S 19,091 479,441 213 6 Alibaba-L 25,432 3,745,819 7,419 6 Table 2: Statistics of datasets. In the similar way, the approximate posterior distribution of these random variables will depend on both the current network snapshot Gi,t and the last hidden variable ht 1: zt|Gi,t, ht 1 N(µ0(Gi,t, ht 1), Σ0(Gi,t, ht 1)) rt,1|Gi,t, ht 1 N(0, 1 β Σ1(Gi,t, ht 1)) rt,2|Gi,t, ht 1 N(µ1(Gi,t, ht 1), Σ1(Gi,t, ht 1)) ct|Gi,t, ht 1 iid Bernoulli(ψ(Gi,t, ht 1)), In the same way of RNN, the hidden variable ht is updated based on the previous hidden variable ht 1, the random variables zt and st: ht = fh(ht 1, zt, st), where fh( ) is also a fully-connected network to transform the hidden variables. Training. We adopt Adam optimizer [Kingma and Ba, 2014] to optimize the objective and also introduce dropout with weight penalties into our proposed model. As expected, we penalize L1-norm of weight Ws to induce the sparsity of the output. It is worth to note that all the parameters of our proposed model are shared along dynamic graphs over time interval (1, T). Graph SAGE in encoder network also shares a random features input for each vertice over time interval (1, T). The edge evolution is always sparse and unbalanced, which brings trouble to identify the positive instances to achieve better performance. Instead of traditional sigmoid cross entropy loss function, we use inter-and-intra class balanced loss in our model: 2(Lnod pos nnod pos + Lnod neg nnod neg ) | {z } inter class loss 2(Lcls pos ncls pos + Lcls neg ncls neg ) | {z } intra class loss where Lpos and Lneg are the cross entropy losses for positive and negative samples, respectively. Similarly, Lnod and Lcls are the cross entropy losses for labels in each node and class. Similar to the setting of loss, npos and nneg define the number of positive and negative samples, while nnod and ncls define the number of labels in each node and class, respectively. 4 Experiment In this section, we evaluate our model in the dynamic setting from the performance on the multi-class link prediction. Dataset. We first use the simulated data to verify the effectiveness of the model, then we apply Burst Graph to a real challenging dataset from a world-leading E-Commerce company. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Simulated Alibaba-S Alibaba-L vanilla evolution bursty evolution vanilla evolution bursty evolution vanilla evolution bursty evolution Model Micro Macro Micro Macro Micro Macro Micro Macro Micro Macro Micro Macro Deep Walk 73.5 71.2 71.7 69.8 79.5 74.7 71.9 70.2 80.7 70.4 68.1 64.6 Graph SAGE 88.3 87.4 70.8 69.2 78.6 74.2 70.8 70.3 76.1 72.8 69.7 68.5 TNE 75.9 74.3 69.1 68.3 77.6 72.1 70.4 69.2 79.9 73.9 69.1 67.2 CTDNE 72.2 69.6 70.9 68.7 79.3 75.0 72.2 70.4 80.5 70.0 67.8 64.3 Burst Graph 91.5 91.4 78.8 78.4 80.0 77.8 77.2 75.2 81.4 77.7 73.3 70.8 Table 3: Results(%) comparison of different embedding methods. We use bold to highlight winners. Simulated Dataset: We generate a sequence of synthetic bipartite graph snapshots of 21K vertices and 527K edges that share similar characteristics as real data. We assume that each vertex has two types (vanilla/burst) of representations. More specifically, we divide the generation process into two parts: the first part is to generate hyperparameters for each vertex to ensure the independence; the second part is to generate links by sampling two representations of vertices from Gaussian distributions with fixed hyperparameters. First, for each vertex, we sample the hyperparameters µ0, σ2 0 from the uniform distributions µ0 U(0, 2.5) and σ2 0 U(0, 0.01) for the vanilla evolution, and µ1, σ2 1 from µ1 U( 2.5, 0) and σ2 1 U(0, 1) for the bursty evolution respectively. To make sure that bursty links are sparse, the variance σ2 1 is set to be bigger than σ2 0. We then generate the links in each snapshot as follows: we first determine the link types of each vertex pair with probability drawn from Bernoulli(0.1) for bursty links and vanilla links otherwise. We then generate the links for each vertex pair. According to their link types, we resample the representations of these two vertices from corresponding Gaussian distributions: N(µ0, σ2 0) or N(µ1, σ2 1). A fixed weight W is employed to transform these two representations into a single value. In all simulated graph snapshots, we set a threshold to truncate around 4% of vertex pairs as links. Results are illustrated in Figure 4. Alibaba Dataset: We collect this dataset from a worldleading E-Commerce company Alibaba with two types of nodes, user and item. The bursty link between user and item is defined as if the user has no interaction with the item or similar items in the same category during the last 15 days according to business needs. The other links are viewed as vanilla links. With dynamic graph setting, we split the whole graph into a sequence of graph snapshots with the same time interval (e.g., 1 day). The dataset with sampled items is denoted as Alibaba-S dataset. The statistics of the above datasets are shown in Table 2. In each dataset, graph snapshot Gt consists of all the nodes and interactions that appear at time step t. In our experimental setting, we hide a set of edges from the original graph and train on the remaining graph. The test dataset contains 10% randomly selected vertices, while the negative links are also randomly selected with the same number of positive links for each link type (vanilla/burst). Baseline Methods We compare our model against the following network embedding algorithms. Figure 4: Adjacency matrix heatmap of sub-graphs with four time stamps of the simulated dataset. Each row represents a user, and each column represents an item. The red points represent the vanilla links between users and items, and the black points represent the bursty links. Compared to the bursty evolution, vanilla evolution is more regular and consistent over time. Deep Walk: Deep Walk1 [Perozzi et al., 2014] is a representative embedding method for static network. Deep Walk generates truncated random walks and uses Skipgram algorithm [Mikolov et al., 2013] to learn latent representations by treating walks as the equivalent of sentences. Graph SAGE: Graph SAGE2 [Hamilton et al., 2017b] is a general inductive framework of network embedding, which leverages topological structure and attribute information of vertices to efficiently generate vertex embedding. Besides, Graph SAGE can still maintain a good performance even with random features input. CTDNE: CTDNE [Nguyen et al., 2018] is a continuoustime dynamic network embedding method. CTDNE generates temporal random walk as the context information of each vertex and uses Skip-gram algorithm to learn latent representations. In this method, the time of each edge is simply valued according to the number of its snapshot. 1https://github.com/phanein/deepwalk 2https://github.com/williamleif/Graph SAGE Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) TNE: TNE3 [Zhu et al., 2016] is a dynamic network embedding algorithm based on matrix factorization. Besides, this method holds a temporal smoothness assumption to ensure the continuity of the embeddings in evolving graphs. It is worthy to mention, Deep Walk and Graph SAGE are static network embedding methods. To facilitate the comparison between our method and these relevant baselines, these two methods are trained with the whole graph, which includes all graph snapshots. In the following, we compare the performance of these methods with Burst Graph4 on three datasets in terms of Micro-F1 and Macro-F1. Comparison Results. Table 3 shows the overall performance of different methods on three datasets. Our model Burst Graph is able to consistently outperform all sorts of baselines in various datasets. Next we compare and analyze results on vanilla evolution and burst links, respectively. For the vanilla evolution, Burst Graph has a performance gain of +1.4% in terms of Micro-F1 and +3.5% in terms of Macro-F1 on average. For the bursty evolution, Burst Graph outperforms other methods +5.5% in terms of Micro-F1 and +5.3% in terms of Macro-F1 averagely. These results show that splitting evolving graphs into vanilla and bursty evolution not only benefits for the performance of the bursty evolution, but also benefits for that of the vanilla evolution. Moreover, compared to other baselines, the performance of Burst Graph is quite robust over the three datasets. Notice that Deep Walk, CTDNE and TNE perform poorly on the simulated dataset. One potential reason could be that the simulated generation process may be easier for message passing algorithms (e.g., Graph SAGE) compared to matrix factorization based methods (e.g., Deep Walk, CTDNE or TNE). Parameter Analysis. We investigate the sensitivity of different hyperparameters in Burst Graph including importance weight λ and random variable dimension d. Figure 5 shows the performance of Burst Graph on Alibaba-S dataset when altering the importance weight λ and variable dimension d, respectively. From part a), we can see that the performance of vanilla link prediction is stable when changing the importance weight λ. The performance of burst link prediction rises with the increase of importance weight λ and converges slowly when importance weight is larger than 1. From part b), we can conclude that the performance of Burst Graph is relatively stable within a large range of variable dimension, and the performance decreases when the variable dimension is either too small or too large. Visualization of vertex representations. We visualize the embedding vectors of sampled vertices in the simulated dataset and Alibaba-S dataset learned by Burst Graph. We project the embedding vectors to a 2-dimensional space with t-SNE method. As shown in Figure 6, embeddings from the vanilla evolution can be clearly separated from embeddings of the bursty evolution. More specifically, the embeddings of vertexes in the vanilla evolution evenly spread out in the space, while the embeddings of vertexes in the bursty evolution gather 3https://github.com/linhongseba/Temporal-Network-Embedding 4https://github.com/eric Zhao93/Burst Graph in a relatively small area. The reason could be that the vertex embedding in the vanilla evolution is highly depended on its attributes, while the vertex embedding in the bursty evolution is another way around because of sparsity. a) Variable dimension b) Importance weight Figure 5: The performance of Burst Graph on Alibaba-S dataset with increasing importance weight λ (left) or variable dimension d (right). Figure 6: 2D visualization on embeddings (100 randomly selected vertices) for vanilla evolution and bursty evolution. This visualizes the embeddings from vanilla and bursty evolution on Simulated dataset (left) and Alibaba-S dataset (right). Embeddings from vanilla evolution are spread out while embeddings from bursty evolution concentrate in a relatively small area. 5 Conclusion In this work, we propose a novel approach for evolving graphs and assume that the evolving of edges in a sequence of graph snapshots can be split into two parts: vanilla and bursty evolution. In addition, these two parts utilize variational autoencoders based on two different prior distributions to reconstruct the graph evolution, respectively. The vanilla evolution follows a Gaussian distribution, when the burst evolution follows a spike-and-slab distribution. Experiment results on real-world datasets show the benefits of our model on bursty links prediction in evolving graphs. However, there still exist limitations in our model. First, only bursty links are considered in our framework. However, there exist other bursty objects, e.g., vertices and communities, which should also be taken into account. We plan to extend our approach to support these bursty objects in the future. Second, we plan to propose a new time series model that supports continuous inputs rather than discretized graph snapshots. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) References [Akoglu and Faloutsos, 2013] Leman Akoglu and Christos Faloutsos. Anomaly, event, and fraud detection in large network datasets. In WSDM, pages 773 774. ACM, 2013. [Akoglu et al., 2015] Leman Akoglu, Hanghang Tong, and Danai Koutra. Graph based anomaly detection and description: a survey. In Data Mining and Knowledge Discovery 29, volume 3, pages 626 688. ACM, 2015. [Angel et al., 2012] Albert Angel, Nikos Sarkas, Nick Koudas, and Divesh Srivastava. Dense subgraph maintenance under streaming edge weight updates for real-time story identification. Proc. VLDB Endow., 5(6):574 585, 2012. [Chandola et al., 2009] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3):15:1 15:58, July 2009. [Dai et al., 2016] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In International conference on machine learning, pages 2702 2711, 2016. [Dong et al., 2017] Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 135 144. ACM, 2017. [Du et al., 2018] Lun Du, Yun Wang, Guojie Song, Zhicong Lu, and Junshan Wang. Dynamic network embedding: An extended approach for skip-gram based network embedding. In IJCAI, pages 2086 2092, 2018. [Grover and Leskovec, 2016] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855 864. ACM, 2016. [Hamilton et al., 2017a] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024 1034, 2017. [Hamilton et al., 2017b] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, 2017. [Heard et al., 2010] Nicholas A Heard, David J Weston, Kiriaki Platanioti, David J Hand, et al. Bayesian anomaly detection methods for social networks. The Annals of Applied Statistics, 4(2):645 662, 2010. [Ishwaran et al., 2005] Hemant Ishwaran, J Sunil Rao, et al. Spike and slab variable selection: frequentist and bayesian strategies. The Annals of Statistics, 33(2):730 773, 2005. [Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [Kingma and Welling, 2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [Kleinberg, 2003] Jon Kleinberg. Bursty and hierarchical structure in streams. Data Mining and Knowledge Discovery, 7(4):373 397, 2003. [Li et al., 2017] Jundong Li, Harsh Dani, Xia Hu, Jiliang Tang, Yi Chang, and Huan Liu. Attributed network embedding for learning in a dynamic environment. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 387 396. ACM, 2017. [Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111 3119, 2013. [Mitchell and Beauchamp, 1988] Toby J Mitchell and John J Beauchamp. Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404):1023 1032, 1988. [Nguyen et al., 2018] Giang Hoang Nguyen, John Boaz Lee, Ryan A Rossi, Nesreen K Ahmed, Eunyee Koh, and Sungchul Kim. Continuous-time dynamic network embeddings. In Companion of the The Web Conference 2018 on The Web Conference 2018, pages 969 976. International World Wide Web Conferences Steering Committee, 2018. [Parikh and Sundaresan, 2008] Nish Parikh and Neel Sundaresan. Scalable and near real-time burst detection from ecommerce queries. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 972 980. ACM, 2008. [Perozzi et al., 2014] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701 710. ACM, 2014. [Qiu et al., 2018] Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In WSDM, pages 459 467. ACM, 2018. [Trivedi et al., 2017] Rakshit Trivedi, Hanjun Dai, Yichen Wang, and Le Song. Know-evolve: Deep temporal reasoning for dynamic knowledge graphs. In Proceedings of the 34th International Conference on Machine Learning Volume 70, pages 3462 3471. JMLR. org, 2017. [Wang et al., 2016] Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In KDD, pages 1225 1234. ACM, 2016. [Zhou et al., 2018] Lekui Zhou, Yang Yang, Xiang Ren, Fei Wu, and Yueting Zhuang. Dynamic network embedding by modeling triadic closure process. 2018. [Zhu et al., 2016] Linhong Zhu, Dong Guo, Junming Yin, Greg Ver Steeg, and Aram Galstyan. Scalable temporal latent space inference for link prediction in dynamic social networks. IEEE Transactions on Knowledge and Data Engineering, 28(10):2765 2777, 2016. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)