# graphbased_dynamic_word_embeddings__73a74f68.pdf Graph-based Dynamic Word Embeddings Yuyin Lu1 , Xin Cheng2 , Ziran Liang1 , Yanghui Rao1 1School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China 2School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, China luyy37@mail2.sysu.edu.cn, chengxin 19@aliyun.com, liangzr5@mail2.sysu.edu.cn, raoyangh@mail.sysu.edu.cn As time goes by, language evolves with word semantics changing. Unfortunately, traditional word embedding methods neglect the evolution of language and assume that word representations are static. Although contextualized word embedding models can capture the diverse representations of polysemous words, they ignore temporal information as well. To tackle the aforementioned challenges, we propose a graph-based dynamic word embedding (GDWE) model, which focuses on capturing the semantic drift of words continually. We introduce word-level knowledge graphs (WKGs) to store short-term and long-term knowledge. WKGs can provide rich structural information as supplement of lexical information, which help enhance the word embedding quality and capture semantic drift quickly. Theoretical analysis and extensive experiments validate the effectiveness of our GDWE on dynamic word embedding learning. 1 Introduction Learning changes of language over time is critical to language understanding [Mc Mahon, 1994]. An indication of language evolution is the change of vocabulary and semantics, which arises with the generation of new words and the meaning drift of old words. Taking the word apple as an example, it referred to a kind of fruit in the past, but it often refers to a technology company nowadays. Although traditional word embedding models such as Word2Vec [Mikolov et al., 2013b] and PPMI-SVD [Levy and Goldberg, 2014] provide an efficient way for modeling word meaning, they neglect temporal information and generate static word embeddings. While contextual word embedding models make impressive progress in various time-invariant natural language processing (NLP) tasks, their performance degrade on a new corpus far from the training period [Lazaridou et al., 2021]. Due to the importance of learning language development and the limitation of prior word embedding models, dynamic word embedding learning has attracted much attention in recent years. Dynamic word embedding models aim at learning The corresponding author. word embeddings in different conditions. Since the change of language over time is one of the most common conditions, learning dynamic word embeddings often refers to learning time-specific word embeddings. The key idea of early dynamic word embedding models [Kulkarni et al., 2015; Hamilton et al., 2016] is to align word embedding spaces pretrained on every time slice separately, called alignment after training (i.e., 2-step) method. Later, Yao et al. [2018] proposed to jointly train the word embedding spaces over all time slices, so as to avoid the alignment process. However, the above approach needs to acquire the full corpus before training, which fails to meet the real-time requirement in practice. Instead of learning a word embedding space for each time slice, Bamler and Mandt [2017] proposed a probability model to incrementally fine-tune a word embedding space. Unfortunately, it may forget long-term semantics and fail to capture new meanings with limited contextual information. Additionally, several studies developed dynamic contextual word embeddings by introducing temporal information to model components [Hofmann et al., 2021] or input texts [Rosin et al., 2022; Dhingra et al., 2022]. But their high costs for training hinder them from adapting to rapidly changing conditions. In this work, we propose a graph-based dynamic word embedding (GDWE) model1, which continuously updates a word embedding space because such a mechanism is more consistent with the development of language [Bamler and Mandt, 2017]. To tackle the aforementioned challenges, we introduce word-level knowledge graphs (WKGs) to encode and store past word co-occurrence knowledge, since semantic drift is often reflected by the change of co-occurring words [Kutuzov et al., 2018]. Besides, the word co-occurrence graph has valuable structural information and can be adapted to support efficient streaming updates [Spitz and Gertz, 2018]. It is worth noting that language changes more significantly in a longer period. For example, the word you in modern English was written as thee in the Middle Ages, and it is often abbreviated as u in informal writing these years. To distinguish characteristics of language at different time stages, we construct and maintain WKGs which store long-term and short-term knowledge separately. Then, we utilize past knowledge encoded in both short-term and long- 1Our code and supplementary materials are available in public at: https://github.com/luyy9apples/GDWE. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) term WKGs to learn better dynamic word embeddings at present. Furthermore, to improve the efficiency of GDWE, we train on all texts only once. The auxiliary co-occurrence information retrieved from WKGs help GDWE avoid the risk of underfitting in online learning, and balance well between efficiency and effectiveness. We summarize the contributions of our work as follows: We introduce a short-term and multiple long-term WKGs to encode word co-occurrence information for capturing semantic drift over time. We propose a graph-based dynamic word embedding model named GDWE, which updates a time-specific word embedding space efficiently. We theoretically prove the correctness of using WKGs to assist dynamic word embedding learning and verify the effectiveness of GDWE by cross-time alignment, text stream classification, and qualitative analysis. 2 Related Works Different from static word representations, dynamic word embeddings can capture semantic drift within language. Since the cost functions of most word embedding models are rotation-invariant, word embedding spaces trained on different time slices are not in the same latent semantic space [Yao et al., 2018], thus the geometric relations among word embeddings of different time slices lack semantic information. Early dynamic word embedding models usually adopt a 2step method. First, they pre-train a word embedding space for each time slice using static word embedding models like Word2Vec [Mikolov et al., 2013b]. Then they align these temporal embedding spaces. Specifically, Kulkarni et al. [2015] achieved alignment by keeping embeddings of semantics-invariant words (i.e., anchor words) the same in each time slice. Similarly, Hamilton et al. [2016] used an orthogonal transformation to align embeddings. Different from the above 2-step approach, Yao et al. [2018] proposed a model called DW2V that jointly trains time-specific word embedding spaces by matrix factorization. DW2V avoids the alignment process and shares semantic information across time slices. Recently, Gong et al. [2020] considered to learn transformations from a general embedding space to temporal embedding spaces. Following the idea of finetuning word embeddings incrementally [Kim et al., 2014; Kaji and Kobayashi, 2017], Bamler and Mandt [2017] proposed a probabilistic model using approximate Bayesian inference to learn dynamic word embeddings continuously. On the other hand, there are some works [Hu et al., 2019; Giulianelli et al., 2020] employing contextual word embeddings [Devlin et al., 2019] to investigate semantic drift. Since the contextual word embeddings used before are not dynamic, Hofmann et al. [2021] proposed a dynamic contextual word embedding model, which transforms pre-trained BERT embeddings into time-specific word embeddings through temporal neural networks. Apart from introducing time-specific model components, Rosin et al. [2022] and Dhingra et al. [2022] modified input texts by prepending time tokens. Unfortunately, the results of some shared tasks [Schlechtweg et al., 2020; Basile et al., 2020] indicated that contextual embeddings perform quite limited on unsupervised lexical semantic change (LSC) detection. Although contextual embedding based methods first top the leaderboard in the LSC task with label supervision and external linguistic resources, these different results among the LSC tasks may lie in the difference between models rather than that between contextual and traditional embeddings [Kutuzov and Pivovarova, 2021]. Besides, contextual embedding models require high computational resources for pre-training or fine-tuning, which remains an obstacle for adapting them to dynamic conditions. 3 Proposed Method 3.1 Problem Definition For the convenience of the description, we first define the continuous learning paradigm of dynamic word embeddings. As presented in [Hofmann et al., 2021], the training corpus for dynamic word embeddings is a text stream in which new documents constantly appear. While a word embedding space is updated continuously in training, we split the text stream D into N time slices based on timestamps for the convenience of testing. Let D(i) denote the document set containing all the texts in the i-th time slice. The goal of dynamic word embedding models is to obtain a word embedding space Ui and a context embedding space Vi, which form a snapshot of the continuously updated embedding space in the i-th time slice. 3.2 Model Architecture Our GDWE continuously updates a word embedding space, where the coordinate axis remains unchanged. Therefore, GDWE can avoid the aforementioned alignment process. Additionally, we introduce WKGs to help our model capture semantic drift with limited lexical information. WKGs not only hold past word co-occurrence information, but also contain structural information such as common neighbors. To improve the efficiency of WKGs, we divide D(i) into a sequence of fixed-size mini-batches {B(i) 1 , . . . , B(i) Mi} according to timestamps, where Mi denotes the number of mini-batches in the i-th time slice. Then we update and utilize WKGs by mini-batches, which approximates continuous learning when the mini-batch size is relatively small. Figure 1 illustrates the architecture of our GDWE, which contains the following modules: (1) Construct and update the short-term WKG. To capture short-term knowledge, we first construct the current WKG G(i) j for B(i) j . Then, we cu- mulatively update a short-term WKG G(i) s, δe}. 6: Vu {n V|deg(n, G) > δd}. 7: return updated WKG Gu = (Vu, Eu). Algorithm 3 Long-term WKGs update algorithm Input: Document streams {D(1), , D(N)} and their corresponding mini-batch sizes {M1, , MN}; Hyper-parameters: The number of long-term WKGs m; thresholds δe and δd; Output: Long-term WKGs G(i) L = {G(i) L,1, , G(i) L,Ki}, where i = {2, , N} and Ki = |G(i) L |. 1: For the 1st time slice: G(1) L {}. 2: for i = 2 to N do 3: Construct a WKG G(i 1) l G(i 1) s, m then 7: G(i) L,2 Update WKG(G(i) L,1, G(i) L,2, δe, δd). 8: Pop G(i) L,1 from G(i) L : G(i) L,b G(i) L,(b+1), where b = {1, , m}. 9: end if 10: yield long-term WKGs G(i) L . 11: end for Retrieve knowledge from WKGs. The key idea of utilizing WKGs is to extract word co-occurrence knowledge from short-term and long-term WKGs as supplement of lexical information. Specifically, considering a word that exists in both the current WKG (denoted as G) and one of the short-term and long-term WKGs (denoted as Gk {Gs, GL}), we sample its direct neighbors in Gk as auxiliary information. The sampling probability of each neighbor is defined as follows: p(n|t, Gk) wn,t deg(n, Gk), (2) where wn,t is the weight of the edge between the target word t and its neighbor word n, and deg(n, Gk) denotes the degree of n in Gk. The sampling strategy is based on the assumption that a higher edge weight implies a closer correlation between two words. Besides, we conduct a penalty based on degree, since a word has a lot of neighbors may contain limited semantic information, such as articles and personal pronouns. However, the above neighbor sampling method cannot distinguish whether the meaning of the target word has drifted. So there is a risk of replaying neighbor words related to the target word s old meaning and hindering the model from capturing semantic drift. According to the distributional hypothesis, words which often have the same neighboring words tend to be semantically similar. Therefore, we estimate the similarity between the meanings of a target word in different time slices using the classic Jaccard s coefficient as follows: sim(t G, t Gk) = |N(t, G) N(t, Gk)| |N(t, G) N(t, Gk)|, (3) where N(t, G) denotes neighbors of the target word t in a WKG G. The numerator represents the common neighbors number of t in G and Gk, and the denominator is a regularization term. With a similarity higher than a threshold δs, the neighbors of t in Gk are used as supplementary information. Update dynamic word embeddings. Inspired by the incremental skip-gram with negative sampling (ISGNS) model [Kaji and Kobayashi, 2017], we define the loss function for updating word embeddings continually with word cooccurrence information as follows: c Ct [ψ+ t,c + k Es qt(s)ψ t,s], (4) where ψ t,x = ln σ( ut(vx)T ). c is a context word of the target word t in a sliding window Ct. ut U denotes the word embedding of the target word t, while vc V denotes the embedding of the context word c. As we apply the adaptive negative sampling algorithm, s is a negative sample drawn from a smoothed uniform distribution (i.e., noise distribution) with sampling probability qt(s) f(s)β, where f(s) is the frequency of word s occurred before word t in the text stream and β [0, 1] is a smoothing factor. k is the number of negative samples. σ( ) is the sigmoid function. As mentioned earlier, we retrieve knowledge from shortterm and long-term WKGs as supplementary samples. For B(i) j , the loss function using WKGs is defined as follows: t d En p(n|t ,Gk)[ψ+ t ,n + k Es qi,j(s)ψ t ,s], (5) where Gk {G(i) s, 0 and wn,t is the weight of the edge between n and t. Then, the trending score of t is defined as follows: TS(t) = SC(t) C(t). (9) Figure 4 illustrates top-25 trending words and their connections for NYT news in 2001. The trending words detection approach based on WKGs effectively discovered that the 9/11 terrorist attack was the most concerning social event in 2001. Moreover, details of the event were captured, including the terrorists ( osama bin laden ), the terrorist organization ( qaeda ), and the occurrence time ( sept ). In addition, there was a breakthrough in human embryonic stem cell research in 2001 according to trending words. Finally, the result of trending words detection also includes words with general meanings such as entertainment-related words ( singer ). 5.5 Lexical Semantic Drift To validate the ability of GDWE to capture lexical semantic drift qualitatively, we describe the changing semantics of representative target words apple in NYT and network in Arxiv through their neighbor words at different time slices. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) laden laden qaeda qaeda osama osama minimum minimum taliban taliban nights nights anthrax anthrax cover cover afghanistan afghanistan stem stem terrorist terrorist attacks attacks singer singer twodrink twodrink ferrer ferrer tonight tonight ballesteros ballesteros cells cells embryonic embryonic alliance alliance bloomberg bloomberg Figure 4: Top-25 trending words and their connections for NYT news in 2001 detected from WKGs. The diameters of nodes and the thickness of edges are proportional to their weights. As shown in Table 4, the neighbors of apple gradually drifted from dessert-related words to electronic technologyrelated words. Additionally, GDWE effectively revealed the products and technologies that attracted attention in different periods. For example, personal computers ( imac ) were the main products of the Apple company in 2008. In 2010 and 2011, the hot spot shifted to portable devices ( iphone and ipad ) and their key technologies, including wireless network technology ( 3g ) and operating systems ( android ). And the neighbor words of apple in 2016 changed to other technology companies (e.g., microsoft and google ). For the target word network , its semantics gradually drifted from communication networks to neural networks (NNs) with the development of computer science. Specifically, its neighbor words in 2010 and 2014 were mainly technical terms related to communication networks, including p2p , internet , and overlay . In recent years, NNs have gained widespread attention and GDWE can discover the development of NN architectures over time, from feedforward NNs and generative networks in 2017 to convolutional NNs in 2018, and then to graph convolutional NNs in 2019. 6 Efficiency Analysis 6.1 Time Complexity In this part, we analyze the time complexity of our GDWE, especially on the update and utilization of WKGs. For a WKG G, we use V and E to denote the numbers of its nodes and edges, respectively. Besides, D denotes the average degree of each node in G. According to Algorithm 2, it takes a merge of two WKGs, a traverse of all nodes, and a traverse of edges to update a WKG, which costs O(2E + V ), and so as the time complexity of updating a short-term WKG or a longterm WKG. Moreover, according to Equations (2) and (3), the time complexity of sampling neighbors of a word is O(2d D), where d is the number of samples. So it takes O(2d DV ) to retrieve word co-occurrence information from a WKG. 6.2 Runtime Comparison Here, we compare the runtime of our GDWE and baselines by Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz. Let s take the training process of each model on NYT as an example. Among all dynamic word embedding models, DW2V is Year Neighbor words of apple in NYT 1 2 3 4 5 1993 blueberry chocolate tart mascarpone pie 2008 imac chips pc itunes server 2010 ipone chips server conagra itunes 2011 chips blackberry 3g android ipad 2016 microsoft android blackberry samsung google Year Neighbor words of network in Arxiv 1 2 3 4 5 2010 wireless peertopeer connectivity p2p internet 2014 overlay multiplex vehicular peertopeer sensor 2017 overlay neural feedforward dnn generative 2018 neural convolutional feedforward cnn dnn 2019 neural gcn convolutional cnn capsule 2021 neural convolutional dnn cnn gcn Table 4: Top-5 neighbor words of target words apple and network in different years, where different noun forms of network are normalized for clarity. the fastest one, taking about 10 minutes to converge, but it performed poorly in the cross-time alignment and text stream classification tasks. GDWE w/o WKG only needed about 31 minutes for training. Because GDWE requires extra time costs to update and retrieve knowledge in WKGs, it took about 44 minutes to train. With a rational cost of time, GDWE achieved better results than GDWE w/o WKG in many experiments, that is, GDWE better balances the effectiveness and efficiency of learning dynamic word embeddings. Because both AW2V and TW2V need to pre-train static word embeddings on each time slice, which brings an extra time cost, they cost about 73 and 152 minutes for training, respectively. Finally, although CW2V and DCWE performed competitively in time-specific tasks, their time costs are quite high, which were about 18.3 hours and 13.6 hours, respectively. 7 Conclusion In this study, we propose a GDWE model to learn dynamic word embeddings continuously, where WKGs are introduced to encode and update long-term and short-term word cooccurrence knowledge. To help GDWE capture semantic drift even with limited contexts, we retrieve knowledge from WKGs to update dynamic word embeddings. Theoretical analysis and extensive experiments on cross-time alignment, text stream classification, and qualitative analysis validate the effectiveness of our model. Since how to apply contextual word embeddings to dynamic conditions more properly and efficiently remains to be studied, we will further explore the differences of contextual and traditional embeddings in the temporal scenario. Besides, as WKG provides an effective way to memorize past word co-occurrence information and distinguish the change of word sense based on common neighbors, it has the potential to be combined with contextual embedding models by developing time-specific selfsupervised tasks using temporal co-occurrence knowledge. Acknowledgements We are grateful to the reviewers for their constructive comments and suggestions. This work has been supported by the National Natural Science Foundation of China (61972426). Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22) References [Bamler and Mandt, 2017] Robert Bamler and Stephan Mandt. Dynamic word embeddings. In ICML, pages 380 389, 2017. [Basile et al., 2020] Pierpaolo Basile, Annalina Caputo, Tommaso Caselli, Pierluigi Cassotti, and Rossella Varvara. Diacr-ita @ EVALITA2020: overview of the EVALITA2020 diachronic lexical semantics (diacr-ita) task. In EVALITA, 2020. [Devlin et al., 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pages 4171 4186, 2019. [Dhingra et al., 2022] Bhuwan Dhingra, Jeremy R Cole, Julian Martin Eisenschlos, Daniel Gillick, Jacob Eisenstein, and William W Cohen. Time-aware language models as temporal knowledge bases. TACL, 10:257 273, 2022. [Giulianelli et al., 2020] Mario Giulianelli, Marco Del Tredici, and Raquel Fern andez. Analysing lexical semantic change with contextualised word representations. In ACL, pages 3960 3973, 2020. [Gong et al., 2020] Hongyu Gong, Suma Bhat, and Pramod Viswanath. Enriching word embeddings with temporal and spatial information. In Co NLL, pages 1 11, 2020. [Hamilton et al., 2016] William L. Hamilton, Jure Leskovec, and Dan Jurafsky. Diachronic word embeddings reveal statistical laws of semantic change. In ACL, pages 1489 1501, 2016. [Hofmann et al., 2021] Valentin Hofmann, Janet B. Pierrehumbert, and Hinrich Sch utze. Dynamic contextualized word embeddings. In ACL-IJCNLP, pages 6970 6984, 2021. [Hu et al., 2019] Renfen Hu, Shen Li, and Shichen Liang. Diachronic sense modeling with deep contextualized word embeddings: An ecological view. In ACL, pages 3899 3908, 2019. [Kaji and Kobayashi, 2017] Nobuhiro Kaji and Hayato Kobayashi. Incremental skip-gram model with negative sampling. In EMNLP, pages 363 371, 2017. [Kim et al., 2014] Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. Temporal analysis of language through neural language models. In LTCSS@ACL, pages 61 65, 2014. [Kulkarni et al., 2015] Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. Statistically significant detection of linguistic change. In WWW, pages 625 635, 2015. [Kutuzov and Pivovarova, 2021] Andrey Kutuzov and Lidia Pivovarova. Rushifteval: A shared task on semantic shift detection for russian. In Dialogue, 2021. [Kutuzov et al., 2018] Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski, and Erik Velldal. Diachronic word embeddings and semantic shifts: A survey. In COLING, pages 1384 1397, 2018. [Lazaridou et al., 2021] Angeliki Lazaridou, Adhiguna Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d Autume, Tom aˇs Koˇcisk y, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and Phil Blunsom. Mind the gap: Assessing temporal generalization in neural language models. In Neur IPS, pages 29348 29363, 2021. [Levy and Goldberg, 2014] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In NIPS, pages 2177 2185, 2014. [Mc Mahon, 1994] April M. S. Mc Mahon. Understanding language change. Cambridge University Press, 1994. [Mikolov et al., 2013a] Tom as Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR (Workshop Poster), 2013. [Mikolov et al., 2013b] Tom as Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111 3119, 2013. [Rosin et al., 2022] Guy D Rosin, Ido Guy, and Kira Radinsky. Time masking for temporal language models. In WSDM, pages 833 841, 2022. [Saeed et al., 2019] Zafar Saeed, Rabeeh Ayaz Abbasi, Muhammad Imran Razzak, and Guandong Xu. Event detection in twitter stream using weighted dynamic heartbeat graph approach [application notes]. IEEE Computational Intelligence Magazine, 14(3):29 38, 2019. [Schlechtweg et al., 2020] Dominik Schlechtweg, Barbara Mc Gillivray, Simon Hengchen, Haim Dubossarsky, and Nina Tahmasebi. Semeval-2020 task 1: Unsupervised lexical semantic change detection. In Sem Eval@COLING, pages 1 23, 2020. [Spitz and Gertz, 2018] Andreas Spitz and Michael Gertz. Exploring entity-centric networks in entangled news streams. In WWW (Companion Volume), pages 555 563, 2018. [Yang et al., 2016] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alexander J. Smola, and Eduard H. Hovy. Hierarchical attention networks for document classification. In NAACL-HLT, pages 1480 1489, 2016. [Yao et al., 2018] Zijun Yao, Yifan Sun, Weicong Ding, Nikhil Rao, and Hui Xiong. Dynamic word embeddings for evolving semantic discovery. In WSDM, pages 673 681, 2018. [Zuckerman and Last, 2019] Matan Zuckerman and Mark Last. Using graphs for word embedding with enhanced semantic relations. In Text Graphs@EMNLP, pages 32 41, 2019. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)