# analogies_explained_towards_understanding_word_embeddings__91589313.pdf Analogies Explained: Towards Understanding Word Embeddings Carl Allen 1 Timothy Hospedales 1 Word embeddings generated by neural network methods such as word2vec (W2V) are well known to exhibit seemingly linear behaviour, e.g. the embeddings of analogy woman is to queen as man is to king approximately describe a parallelogram. This property is particularly intriguing since the embeddings are not trained to achieve it. Several explanations have been proposed, but each introduces assumptions that do not hold in practice. We derive a probabilistically grounded definition of paraphrasing that we re-interpret as word transformation, a mathematical description of wx is to wy . From these concepts we prove existence of linear relationships between W2V-type embeddings that underlie the analogical phenomenon, identifying explicit error terms. 1. Introduction The vector representation, or embedding, of words underpins much of modern machine learning for natural language processing (e.g. Turney & Pantel (2010)). Where, previously, embeddings were generated explicitly from word statistics, neural network methods are now commonly used to generate neural embeddings that are of low dimension relative to the number of words represented, yet achieve impressive performance on downstream tasks (e.g. Turian et al. (2010); Socher et al. (2013)). Of these, word2vec2 (W2V) (Mikolov et al., 2013a) and Glove (Pennington et al., 2014) are amongst the best known and on which we focus. Interestingly, such embeddings exhibit seemingly linear behaviour (Mikolov et al., 2013b; Levy & Goldberg, 2014a), e.g. the respective embeddings of analogies, or word relationships of the form wa is to wa as wb is to wb , often satisfy wa wa + wb wb , where wi is the embedding 1School of Informatics, University of Edinburgh. Correspondence to: Carl Allen . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). 2Throughout, we refer to the more commonly used Skipgram implementation of W2V with negative sampling (SGNS). of word wi. This enables analogical questions such as man is to king as woman is to ..? to be solved by vector addition and subtraction. Such high order structure is surprising since word embeddings are trained using only pairwise word co-occurrence data extracted from a text corpus. We first show that where embeddings factorise pointwise mutual information (PMI), it is paraphrasing that determines when a linear combination of embeddings equates to that of another word. We say king paraphrases man and royal, for example, if there is a semantic equivalence between king and {man, royal} combined. We can measure such equivalence with respect to probability distributions over nearby words, in line with Firth s maxim You shall know a word by the company it keeps (Firth, 1957). We then show that paraphrasing can be reinterpreted as word transformation with additive parameters (e.g. from man to king by adding royal) and generalise to also allow subtraction. Finally, we prove that by interpreting an analogy wa is to wa as wb is to wb as word transformations wa to wa and wb to wb sharing the same parameters, the linear relationship observed between word embeddings of analogies follows (see overview in Fig 4). Our key contributions are: to derive a probabilistic definition of paraphrasing and show that it governs the relationship between one (PMIderived) word embedding and any sum of others; to show how paraphrasing can be generalised and interpreted as the transformation from one word to another, giving a mathematical formulation for wx is to wx ; to provide the first rigorous proof of the linear relationship between word embeddings of analogies, including explicit, interpretable error terms; and to show how these relationships materialise between vectors of PMI values, and so too in word embeddings that factorise the PMI matrix, or approximate such a factorisation e.g. W2V and Glove. 2. Previous Work Intuition for the presence of linear analogical relationships, or linguistic regularity, amongst word embeddings was first suggested by Mikolov et al. (2013a;b) and Pennington et al. (2014), and has been widely discussed since (e.g. Levy & Goldberg (2014a); Linzen (2016)). More recently, several theoretical explanations have been proposed: Analogies Explained: Towards Understanding Word Embeddings permitting auxiliary princess lord w K w M + w W Figure 1. The relative locations of word embeddings for the analogy "man is to king as woman is to ..?". The closest embedding to the linear combination w K w M + w W is that of queen. We explain why this occurs and interpret the difference between them. Arora et al. (2016) propose a latent variable model for language that contains several strong a priori assumptions about the spatial distribution of word vectors, discussed by Gittens et al. (2017), that we do not require. Also, the two embedding matrices of W2V are assumed equal, which we show to be false in practice. Gittens et al. (2017) refer to paraphrasing, from which we draw inspiration, but make several assumptions that fail in practice: (i) that words follow a uniform distribution rather than the (highly non-uniform) Zipf distribution; (ii) that W2V learns a conditional distribution violated by negative sampling (Levy & Goldberg, 2014b); and (iii) that joint probabilities beyond pairwise co-occurrences are zero. Ethayarajh et al. (2018) offer a recent explanation based on co-occurrence shifted PMI, however that property lacks motivation and several assumptions fail, e.g. it requires more than for opposite sides to have equal length to define a parallelogram in Rd, d > 2 (their Lemma 1). To our knowledge, no previous work mathematically interprets analogies so as to rigorously explain why if wa is to wa as wb is to wb then a linear relationship manifests between correponding word embeddings. 3. Background The Word2Vec algorithm considers a set of word pairs {(wik, cjk)}k generated from a (typically large) text corpus, by allowing the target word wi to range over the corpus, and the context word cj to range over a context window (of size l) symmetric about the target word. For each observed word pair (positive sample), k random word pairs (negative samples) are generated according to monogram distributions. The 2-layer neural network architecture simply multiplies two weight matrices W, C Rd n, subject to a non-linear (sigmoid) function, where d is the embedding dimensionality and n is the size of E the dictionary of unique words in the corpus. Conventionally, W denotes the matrix closest to the input target words. Columns of W and C are the embeddings of words in E: wi Rd (ith column of W) corresponds to wi the ith word in E observed as a target word; and ci Rd (ith column of C) corresponds to ci, the same word when observed as a context word. Levy & Goldberg (2014b) identified that the objective function for W2V is optimised if: w i cj = PMI(wi, cj) log k , (1) where PMI(wi, cj) = log p(wi, cj) p(wi)p(cj) is known as pointwise mutual information. In matrix form, this equates to: W C = SPMI Rn n , (2) where SPMIi,j =PMI(wi, cj) log k, (shifted PMI). Glove (Pennington et al., 2014) has the same architecture as W2V. Its embeddings perform comparably and also exhibit linear analogical structure. Glove s loss function is optimised when: w i cj = log p(wi, cj) bi bj + log Z (3) for biases bi, bj and normalising constant Z. (3) generalises (1) due to the biases, giving Glove greater flexibility than W2V and a potentially wider range of solutions. However, we will show that it is factorisation of the PMI matrix that causes linear analogical structure in embeddings, as approximately achieved by W2V (1). We conjecture that the same rationale underpins analogical structure in Glove embeddings, perhaps more weakly due to its increased flexibility. 4. Preliminaries We consider pertinent aspects of the relationship between word embeddings and co-occurrence statistics (1, 2) relevant to the linear structure between embeddings of analogies: Impact of the Shift As a chosen hyper-parameter, reflecting nothing of word properties, any effect on embeddings of k appearing in (1) is arbitrary. Comparing typical values of k with empirical PMI values (Fig 2), shows that the socalled shift ( log k) may also be material. Further, it is observed that adjusting the W2V algorithm to avoid any direct impact of the shift improves embedding performance (Le, 2017). We conclude that the shift is a detrimental artefact of the W2V algorithm and, unless stated otherwise, consider embeddings that factorise the unshifted PMI matrix: w i cj = PMI(wi, cj) or W C = PMI . (4) Analogies Explained: Towards Understanding Word Embeddings 5 0 5 10 PMI PMI(wi, cj) PMI(wi, ci) log 5 log 10 log 20 Figure 2. Histogram of PMI(wi, cj) for word pairs randomly sampled from text (blue) with PMI(wi, ci) for the same word overlaid (red, scale enlarged). The shift is material for typical values of k. Reconstruction Error In practice, (2) and (4) hold only approximately since W C Rn n is rank-constrained (rank r d < n) relative to the factored matrix M, e.g. M=PMI in (4). Recovering elements of M from W and C is thus subject to reconstruction error. However, we rely throughout on linear relationships in Rn, requiring only that they are sufficiently maintained when projected down into Rd, the space of embeddings. To ensure this, we assume: A1. C has full row rank. A2. Letting Mk denote the kth column of factored matrix M Rn n, the projection f :Rn Rd, f(Mi) = wi is approximately homomorphic with respect to addition, i.e. f(Mi + Mj) f(Mi) + f(Mj). A1 is reasonable since d n and d is chosen. A2 means that, whatever the factorisation method used (e.g. analytic, W2V, Glove, weighted matrix factorisation (Srebro & Jaakkola, 2003)), linear relationships between columns of M are sufficiently preserved by columns of W, i.e. the embeddings wi. For example, minimising a least squares loss function gives the linear projection wi = f LSQ(Mi) = C Mi for which A2 holds exactly (where C =(CC ) 1C, the Moore-Penrose pseudo-inverse of C , which exists by A1);1 whereas for W2V, wi =f W 2V (Mi) is non-linear.2 Zero Co-occurrence Counts The co-occurrence of rare words are often unobserved, thus their empirical probability estimates zero and PMI estimates undefined. However, for a fixed dictionary E, such zero counts decline as the corpus or context window size increase (the latter can be arbitrarily 1w.l.o.g. we write f( ) = C ( ) throughout (except in specific cases) to emphasise linearity of the relationship. 2It is beyond the scope of this work to show A2 is satisfied when the W2V loss function is minimised (4). We instead prove existence of linear relationships in the full rank space of PMI columns, thus in linear projections thereof, and assume A2 holds sufficiently for W2V embeddings given (2) and empirical observation of linearity. large if more distant words are down-weighted, e.g. Pennington et al. (2014)). Here, we consider small word sets W and assume the corpus and context window to be of sufficient size that the true values of considered probabilities are non-zero and their PMI values well-defined, i.e.: A3. p(W)>0, W E, |W|1, than for embeddings that factorise PMI. 6. Analogies An analogy is said to hold for words wa, wa , wb, wb E if, in some sense, wa is to wa as wb is to wb . Since in principle the same relationship may extend further ( ... as wc is to wc etc), we characterise a general analogy A by a set of ordered word pairs SA E E, where (wx, wx ) SA, wx, wx E, iff wx is to wx as ... [all other analogical pairs] under A. Our aim is to explain why respective word embeddings often satisfy: wb wa wa + wb , (8) or why in the more general case: wx wx u A , (9) (wx, wx ) SA and vector u A Rn specific to A. We split the task of understanding why analogies give rise to Equations 8 and 9 into: Q1) understanding conditions under which word embeddings can be added and subtracted to approximate other embeddings; Q2) establishing a mathematical interpretation of wx is to wx ; and Q3) drawing a correspondence between those results. We show that all of these can be answered with paraphrasing by generalising the notion to word sets. 6.1. Paraphrasing Word Sets Definition D2. We say word set W E paraphrases word set W E, |W|, |W |