# analogies_explained_towards_understanding_word_embeddings__91589313.pdf

Analogies Explained: Towards Understanding Word Embeddings

Carl Allen 1 Timothy Hospedales 1

Word embeddings generated by neural network methods such as word2vec (W2V) are well known to exhibit seemingly linear behaviour, e.g. the embeddings of analogy woman is to queen as man is to king approximately describe a parallelogram. This property is particularly intriguing since the embeddings are not trained to achieve it. Several explanations have been proposed, but each introduces assumptions that do not hold in practice. We derive a probabilistically grounded deﬁnition of paraphrasing that we re-interpret as word transformation, a mathematical description of wx is to wy . From these concepts we prove existence of linear relationships between W2V-type embeddings that underlie the analogical phenomenon, identifying explicit error terms.

1. Introduction

The vector representation, or embedding, of words underpins much of modern machine learning for natural language processing (e.g. Turney & Pantel (2010)). Where, previously, embeddings were generated explicitly from word statistics, neural network methods are now commonly used to generate neural embeddings that are of low dimension relative to the number of words represented, yet achieve impressive performance on downstream tasks (e.g. Turian et al. (2010); Socher et al. (2013)). Of these, word2vec2

(W2V) (Mikolov et al., 2013a) and Glove (Pennington et al., 2014) are amongst the best known and on which we focus.

Interestingly, such embeddings exhibit seemingly linear behaviour (Mikolov et al., 2013b; Levy & Goldberg, 2014a), e.g. the respective embeddings of analogies, or word relationships of the form wa is to wa as wb is to wb , often satisfy wa wa + wb wb , where wi is the embedding

1School of Informatics, University of Edinburgh. Correspondence to: Carl Allen <carl.allen@ed.ac.uk>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). 2Throughout, we refer to the more commonly used Skipgram implementation of W2V with negative sampling (SGNS).

of word wi. This enables analogical questions such as man is to king as woman is to ..? to be solved by vector addition and subtraction. Such high order structure is surprising since word embeddings are trained using only pairwise word co-occurrence data extracted from a text corpus.

We ﬁrst show that where embeddings factorise pointwise mutual information (PMI), it is paraphrasing that determines when a linear combination of embeddings equates to that of another word. We say king paraphrases man and royal, for example, if there is a semantic equivalence between king and {man, royal} combined. We can measure such equivalence with respect to probability distributions over nearby words, in line with Firth s maxim You shall know a word by the company it keeps (Firth, 1957). We then show that paraphrasing can be reinterpreted as word transformation with additive parameters (e.g. from man to king by adding royal) and generalise to also allow subtraction. Finally, we prove that by interpreting an analogy wa is to wa as wb is to wb as word transformations wa to wa and wb to wb sharing the same parameters, the linear relationship observed between word embeddings of analogies follows (see overview in Fig 4). Our key contributions are:

to derive a probabilistic deﬁnition of paraphrasing and show that it governs the relationship between one (PMIderived) word embedding and any sum of others; to show how paraphrasing can be generalised and interpreted as the transformation from one word to another, giving a mathematical formulation for wx is to wx ; to provide the ﬁrst rigorous proof of the linear relationship between word embeddings of analogies, including explicit, interpretable error terms; and to show how these relationships materialise between vectors of PMI values, and so too in word embeddings that factorise the PMI matrix, or approximate such a factorisation e.g. W2V and Glove.

2. Previous Work

Intuition for the presence of linear analogical relationships, or linguistic regularity, amongst word embeddings was ﬁrst suggested by Mikolov et al. (2013a;b) and Pennington et al. (2014), and has been widely discussed since (e.g. Levy & Goldberg (2014a); Linzen (2016)). More recently, several theoretical explanations have been proposed:

Analogies Explained: Towards Understanding Word Embeddings

permitting auxiliary

princess lord

w K w M + w W

Figure 1. The relative locations of word embeddings for the analogy "man is to king as woman is to ..?". The closest embedding to the linear combination w K w M + w W is that of queen. We explain why this occurs and interpret the difference between them.

Arora et al. (2016) propose a latent variable model for language that contains several strong a priori assumptions about the spatial distribution of word vectors, discussed by Gittens et al. (2017), that we do not require. Also, the two embedding matrices of W2V are assumed equal, which we show to be false in practice.

Gittens et al. (2017) refer to paraphrasing, from which we draw inspiration, but make several assumptions that fail in practice: (i) that words follow a uniform distribution rather than the (highly non-uniform) Zipf distribution; (ii) that W2V learns a conditional distribution violated by negative sampling (Levy & Goldberg, 2014b); and (iii) that joint probabilities beyond pairwise co-occurrences are zero.

Ethayarajh et al. (2018) offer a recent explanation based on co-occurrence shifted PMI, however that property lacks motivation and several assumptions fail, e.g. it requires more than for opposite sides to have equal length to deﬁne a parallelogram in Rd, d > 2 (their Lemma 1).

To our knowledge, no previous work mathematically interprets analogies so as to rigorously explain why if wa is to wa as wb is to wb then a linear relationship manifests between correponding word embeddings.

3. Background

The Word2Vec algorithm considers a set of word pairs {(wik, cjk)}k generated from a (typically large) text corpus, by allowing the target word wi to range over the corpus, and the context word cj to range over a context window (of size l) symmetric about the target word. For each observed word

pair (positive sample), k random word pairs (negative samples) are generated according to monogram distributions. The 2-layer neural network architecture simply multiplies two weight matrices W, C Rd n, subject to a non-linear (sigmoid) function, where d is the embedding dimensionality and n is the size of E the dictionary of unique words in the corpus. Conventionally, W denotes the matrix closest to the input target words. Columns of W and C are the embeddings of words in E: wi Rd (ith column of W) corresponds to wi the ith word in E observed as a target word; and ci Rd (ith column of C) corresponds to ci, the same word when observed as a context word.

Levy & Goldberg (2014b) identiﬁed that the objective function for W2V is optimised if:

w i cj = PMI(wi, cj) log k , (1)

where PMI(wi, cj) = log p(wi, cj) p(wi)p(cj) is known as pointwise mutual information. In matrix form, this equates to:

W C = SPMI Rn n , (2)

where SPMIi,j =PMI(wi, cj) log k, (shifted PMI).

Glove (Pennington et al., 2014) has the same architecture as W2V. Its embeddings perform comparably and also exhibit linear analogical structure. Glove s loss function is optimised when:

w i cj = log p(wi, cj) bi bj + log Z (3)

for biases bi, bj and normalising constant Z. (3) generalises (1) due to the biases, giving Glove greater ﬂexibility than W2V and a potentially wider range of solutions. However, we will show that it is factorisation of the PMI matrix that causes linear analogical structure in embeddings, as approximately achieved by W2V (1). We conjecture that the same rationale underpins analogical structure in Glove embeddings, perhaps more weakly due to its increased ﬂexibility.

4. Preliminaries

We consider pertinent aspects of the relationship between word embeddings and co-occurrence statistics (1, 2) relevant to the linear structure between embeddings of analogies:

Impact of the Shift As a chosen hyper-parameter, reﬂecting nothing of word properties, any effect on embeddings of k appearing in (1) is arbitrary. Comparing typical values of k with empirical PMI values (Fig 2), shows that the socalled shift ( log k) may also be material. Further, it is observed that adjusting the W2V algorithm to avoid any direct impact of the shift improves embedding performance (Le, 2017). We conclude that the shift is a detrimental artefact of the W2V algorithm and, unless stated otherwise, consider embeddings that factorise the unshifted PMI matrix:

w i cj = PMI(wi, cj) or W C = PMI . (4)

Analogies Explained: Towards Understanding Word Embeddings

5 0 5 10 PMI

PMI(wi, cj)

PMI(wi, ci)

log 5 log 10 log 20

Figure 2. Histogram of PMI(wi, cj) for word pairs randomly sampled from text (blue) with PMI(wi, ci) for the same word overlaid (red, scale enlarged). The shift is material for typical values of k.

Reconstruction Error In practice, (2) and (4) hold only approximately since W C Rn n is rank-constrained (rank r d < n) relative to the factored matrix M, e.g. M=PMI in (4). Recovering elements of M from W and C is thus subject to reconstruction error. However, we rely throughout on linear relationships in Rn, requiring only that they are sufﬁciently maintained when projected down into Rd, the space of embeddings. To ensure this, we assume:

A1. C has full row rank.

A2. Letting Mk denote the kth column of factored matrix M Rn n, the projection f :Rn Rd, f(Mi) = wi is approximately homomorphic with respect to addition, i.e. f(Mi + Mj) f(Mi) + f(Mj).

A1 is reasonable since d n and d is chosen. A2 means that, whatever the factorisation method used (e.g. analytic, W2V, Glove, weighted matrix factorisation (Srebro & Jaakkola, 2003)), linear relationships between columns of M are sufﬁciently preserved by columns of W, i.e. the embeddings wi. For example, minimising a least squares loss function gives the linear projection wi = f LSQ(Mi) = C Mi for which A2 holds exactly (where C =(CC ) 1C, the Moore-Penrose pseudo-inverse of C , which exists by A1);1

whereas for W2V, wi =f W 2V (Mi) is non-linear.2

Zero Co-occurrence Counts The co-occurrence of rare words are often unobserved, thus their empirical probability estimates zero and PMI estimates undeﬁned. However, for a ﬁxed dictionary E, such zero counts decline as the corpus or context window size increase (the latter can be arbitrarily

1w.l.o.g. we write f( ) = C ( ) throughout (except in speciﬁc cases) to emphasise linearity of the relationship. 2It is beyond the scope of this work to show A2 is satisﬁed when the W2V loss function is minimised (4). We instead prove existence of linear relationships in the full rank space of PMI columns, thus in linear projections thereof, and assume A2 holds sufﬁciently for W2V embeddings given (2) and empirical observation of linearity.

large if more distant words are down-weighted, e.g. Pennington et al. (2014)). Here, we consider small word sets W and assume the corpus and context window to be of sufﬁcient size that the true values of considered probabilities are non-zero and their PMI values well-deﬁned, i.e.: A3. p(W)>0, W E, |W|<l, where (throughout) |W| < l means |W| sufﬁciently less than l.

The Relationship between W and C Several works (e.g. Hashimoto et al. (2016); Arora et al. (2016)) assume embedding matrices W and C to be equal, i.e. wi = ci i. The assumption is convenient as the number of parameters is halved, equations simplify and consideration of how to use wi and ci falls away. However, this implies W W = PMI, requiring PMI to be positive semideﬁnite, which is not true for typical corpora. Thus wi, ci are not equal and modifying W2V to enforce them to be would unnecessarily constrain and may well worsen the low-rank approximation.

5. Paraphrases

Following a similar approach to Gittens et al. (2017), we consider a small set of target words W ={w1, . . . , wm} E, |W| < l; and the sum of their embeddings w W = P

i wi. In practice, we say word w E paraphrases W if w and W are semantically interchangeable within the text, i.e. in circumstances where all wi W appear, w could appear instead. This suggests a relationship between the probability distributions p(cj|W) and p(cj|w ), cj E. We refer to such conditional distributions over all context words as the distribution induced by W or w , respectively.

5.1. Deﬁning a Paraphrase

Let CW ={cj1, . . . , cjt} be a sequence of words (with repetition) observed in the context of W.3 A paraphrase word w E can be thought of as that which best explains the observation of CW. From a maximum likelihood perspective we have w (1) =argmaxwi E p(CW|wi). Assuming cj CW to be independent draws from p(cj|W), gives:

w (1) = argmax wi

cj E p(cj|wi)#j

cj E p(cj|W) log p(cj|wi) ,

as | CW| , where #j denotes the count of cj in CW. It follows that w (1) minimises the Kullback-Leibler (KL) divergence W,w KL between the induced distributions, i.e.:

W,w KL = DKL[ P(cj|W) || P(cj|w ) ]

= P jp(cj|W) log p(cj|W)

3By symmetry, CW is the set of target words for which all wi W are simultaneously observed in the context window.

Analogies Explained: Towards Understanding Word Embeddings

Alternatively, we might consider w (2) , the target word whose set of associated context words Cw is best explained by W, in the sense that w (2) minimises KL divergence w ,W KL = DKL[P(cj|w ) || P(cj|W)] (where, in general, W,w KL = w ,W KL ). Interpretations of w (1) and w (2) are discussed in Appendix A. In each case, the KL divergence lower bound (zero) is achieved iff the induced distributions are equal, providing a theoretical basis for:

Deﬁnition D1. We say word w E paraphrases word set W E, |W| < l, if the paraphrase error ρW,w Rn is (element-wise) small, where:

ρ W,w j = log p(cj|w )

p(cj|W) , cj E.

Note that W and w need not appear similarly often for w to paraphrase W, only amongst the same context words. We now connect paraphrasing, a semantic relationship, to relationships between word embeddings.

5.2. Paraphrase = Embedding Sum + Error

Lemma 1. For any word w E and word set W E, |W|<l:

wi W PMIi + ρ W,w + σ W τ W1 , (5)

where PMI is the column of PMI corresponding to w E, 1 Rn is a vector of 1s, and error terms σW j =log p(W|cj) Q

i p(wi|cj) and τ W =log p(W) Q

Proof. (See Appendix B.) As Lem 1 is central to what follows, we sketch its proof: a correspondence is drawn between the product of distributions induced by each wi W (I) and the distribution induced by w (II), by comparison to the distribution induced by joint event W (III), i.e. observing all wi W in the context window. I relates to III by the (in)dependence of wi W (i.e. by σW j , τ W ).4 II relates to III by the paraphrase error ρ W,w j .

Following immediately from Lem 1 we have:

Theorem 1 (Paraphrase). For any word w E and word set W E, |W|<l: w = w W + C (ρ W,w + σ W τ W1) , (6) where w W =P

Proof. Multiply (5) by C .

Thm 1 shows that an embedding (of w ) and a sum of embeddings (of W) differ by the paraphrase error ρW,w between w and W; and σW, τ W (collectively dependence error) reﬂecting relationships within W (unrelated to w ):

σW is a vector reﬂecting conditional dependencies within W given each cj E; σW j =0 iff all wi W are conditionally independent given each and every cj E;

4Analogous to a product of marginal probabilities relating to their joint probability subject to independence.

τ W is a scalar measure of mutual independence of wi W (thus constant cj E); τ W = 0 iff wi W are mutually independent.

Corollary 1.1. A word set W has no associated dependence error iff wi W are both mutually independent and conditionally independent given each context word cj E.

Thm 1, which holds for all words w and word sets W, explains why and when a paraphrase (e.g. of {man, royal} by king) can be identiﬁed by embedding addition (wman + wroyal wking). The phenomenon occurs due to a relationship between PMI vectors in Rn that holds for embeddings in Rd under projection by C (by A1, A2). The vector error w w W depends on both the paraphrase relationship between w and W; and statistical dependencies within W.

Corollary 1.2. For word w E and word set W E, w w W if w paraphrases W and wi W are materially independent (i.e. net dependence error is small).

5.3. Do Linear Relationships Identify Paraphrases?

The converse of Cor 1.2 is false: w w W does not imply w paraphrases W. Speciﬁcally, false positives arise if: (i) paraphrase and dependence error terms are material but happen to cancel, i.e. total error ρW,w + σW τ W1 0; or (ii) material components of the total error fall within the high (n d) dimensional null space of C and project to a small vector difference between w and w W. Case (i) can arise in PMI vectors (Lem 1) and thus lower rank embeddings also (Thm 1), but is highly unlikely in practice due to the high dimensionality (n). Case (ii) can arise only in lower rank embeddings (Thm 1) and might be minimised by a good choice of factorisation or projection method.

5.4. Paraphrasing in Explicit Embeddings

Lem 1 applies to full rank PMI vectors, without reconstruction error or case (ii) false positives (Sec 5.3), explaining the linear relationships observed by Levy & Goldberg (2014a).

Corollary 1.3. Thm 1 holds for explicit word embeddings, i.e. columns of PMI.

Proof. Choose factorisation W=PMI, C=I (the identity matrix) in Thm 1.

5.5. Paraphrasing in W2V Embeddings

Thm 1 extends to W2V embeddings by substituting vi v j = PMI(wi, cj) log k and f W 2V :

Corollary 1.4. Under conditions of Thm 1, W2V embeddings satisfy:

w = w W+f W 2V ρ W,w +σ W τ W1+log k(|W| 1)1 . (7)

Analogies Explained: Towards Understanding Word Embeddings

Comparing (6) and (7) shows that paraphrases correspond to linear relationships in W2V embeddings with an additional error term linear in |W|, and hence with less accuracy if |W|>1, than for embeddings that factorise PMI.

6. Analogies

An analogy is said to hold for words wa, wa , wb, wb E if, in some sense, wa is to wa as wb is to wb . Since in principle the same relationship may extend further ( ... as wc is to wc etc), we characterise a general analogy A by a set of ordered word pairs SA E E, where (wx, wx ) SA, wx, wx E, iff wx is to wx as ... [all other analogical pairs] under A. Our aim is to explain why respective word embeddings often satisfy:

wb wa wa + wb , (8)

or why in the more general case:

wx wx u A , (9)

(wx, wx ) SA and vector u A Rn speciﬁc to A.

We split the task of understanding why analogies give rise to Equations 8 and 9 into: Q1) understanding conditions under which word embeddings can be added and subtracted to approximate other embeddings; Q2) establishing a mathematical interpretation of wx is to wx ; and Q3) drawing a correspondence between those results. We show that all of these can be answered with paraphrasing by generalising the notion to word sets.

6.1. Paraphrasing Word Sets

Deﬁnition D2. We say word set W E paraphrases word set W E, |W|, |W |<l, if paraphrase error ρW,W Rn

is (element-wise) small, where:

ρ W,W j = log p(cj|W )

p(cj|W) , cj E.

D2 generalises D1 such that the paraphrase term W , previously w , can be more than one word.5 Analogously to D1, word sets paraphrase one another if they induce equivalent distributions over context words. Note that paraphrasing under D2 is both reﬂexive and symmetric (since |ρW,W | = |ρW ,W|), thus W paraphrases W and W paraphrases W are equivalent and denoted W P W .

Analogues of Lem 1 and Thm 1 follow:

Lemma 2. For any word sets W, W E, |W|, |W |<l: X

wi W PMIi = X

wi W PMIi + ρ W,W + σ W σ W

(τ W τ W )1 . (10)

Proof. (See Appendix C.)

5Equivalently, D1 is a special case of D2 with |W | = 1, hence we reuse terms without ambiguity.

Theorem 2 (Generalised Paraphrase). For any word sets W, W E, |W|, |W |<l:

w W = w W + C (ρ W,W +σ W σ W (τ W τ W )1) .

Proof. Multiply (10) by C .

Note that |W | = 1 recovers Lem 1 and Thm 1. With analogies in mind, we restate Thm 2 as:

Corollary 2.1. For any words wx, wx E and word sets W+, W E, |W+|, |W | < l 1:

wx = wx + w W+ w W + C (ρ W,W + σ W σ W

(τ W τ W )1), (11)

where W ={wx} W+, W ={wx } W .

Proof. Set W ={wx} W+, W ={wx } W in Thm 2.

Cor 2.1 shows how any word embedding wx relates to a linear combination of other embeddings (wΣ =wx+w W+ w W ), due to an equivalent relationship between columns of PMI. Analogously to one-word (D1) paraphrases, the vector difference wx wΣ depends on the paraphrase error that reﬂects the relationship between the two word sets W , W; and the dependence error that reﬂects statistical dependence between words within each of W and W .

Corollary 2.2. For terms as deﬁned above, wx wx + w W+ w W if W P W and wi W and wi W are materially independent or dependence terms materially cancel.

False positives can arise as discussed in Sec 5.3.

6.2. From Paraphrases to Analogies

A special case of Cor 2.1 gives:

Corollary 2.3. For any wa, wa , wb, wb E:

wb = wa wa + wb + C (ρ W,W + σ W σ W

(τ W τ W )1) , (12)

where W ={wb, wa } and W ={wb , wa}.

Proof. Set wx = wb, wx = wb , W+ = {wa }, W = {wa} in Cor 2.1.

Thus we see that (8) holds if {wb , wa} P{wb, wa } and those word pairs exhibit similar dependence (Sec 6.6). More generally, by Cor 2.1 we see that (9) is satisﬁed by u A w W+ w W if {wx , W } P {wx, W+} (wx, wx ) SA for common word sets W+, W E and each pair of paraphrasing word sets exhibit similar dependence.

This establishes sufﬁcient conditions for the linear relationships observed in analogy embeddings (8, 9) in terms of

Analogies Explained: Towards Understanding Word Embeddings

semantic relationships, answering Q1. However, those relationships are paraphrases, with no obvious connection to the wx is to wx ... relationships of analogies. We now show that paraphrases sufﬁcient for (8, 9) correspond to analogies by introducing the concept of word transformation.

6.3. Word Transformation

The paraphrase of a word set W by word w (D1) has, so far, been considered in terms of an equivalence between W and w by reference to their induced distributions. Alternatively, that paraphrase can be interpreted as a transformation from an arbitrary ws W to w by adding words W+ ={wi W, wi =ws}. Notionally, W+ can be considered words that make ws more like w . More precisely, wi W+ add context to ws: we move from a distribution induced by ws alone to one induced by the joint event of simultaneously observing ws and all wi W+, a contextualised occurrence of ws with an induced distribution closer that of w . A similar view can be taken of the associated embedding addition: starting with ws, add wi wi W+to approximate w . Note that only addition applies.

Moving to D2, the paraphrase of one word set W by another W can be interpreted additively as starting with some wx W, wx W , and adding W+={wi W, wi =wx}, W = {wi W , wi = wx }, respectively, such that the resulting sets W and W induce similar distributions, i.e. paraphrase. In effect, context is added to both wx and wx until their contextualised cases W and W paraphrase (Fig 3a). Note W and W may have no intuitive meaning and need not correspond to a single word, unlike D1 paraphrases. Alternatively, such a paraphrase can be interpreted as a transformation from wx W to wx W by adding wi W+

and subtracting wi W . Subtraction is effected by adding words to the other side, i.e. to wx .6 Just as adding words to wx adds or narrows its context, subtracting words removes or broadens context. Context is thus added and removed to transform from wx to wx , in which the paraphrase between W and W effectively serves as an intermediate step (Fig 3b). We refer to W+, W as transformation parameters, which can be thought of as explaining the difference between wx and wx with a richer dictionary than that available to D1 paraphrases by including differences between words. More precisely, transformation parameters align the induced distributions to create a paraphrase.

This interpretation show equivalence between a paraphrase W PW and a word transformation a relationship between wx W and wx W based on the addition and subtraction of context that is mirrored in the addition and subtraction of embeddings. Mathematical equivalence of the perspectives is reinforced by an alternate proof of Cor 2.1

6Analogous to standard algebra: if x < y, equality is achieved either by adding to x or by subtracting from y.

(a) Adding context to each of wx and wx to reach a paraphrase.

word transformation

(b) Adding and subtracting context to transform wx to wx .

Figure 3. Perspectives of the paraphrase W P W .

in Appendix D that begins with terms in only wx and wx , highlighting that any words W+, W can be introduced, but only certain choices form the necessary paraphrase.

Deﬁnition D 3. There exists a word transformation from wx E to wx E with transformation parameters W+, W E iff {wx} W+ P {wx } W .

Note that transformation parameters may not be unique and always (trivially) include W+ ={wx }, W ={wx}.

6.4. Interpreting a is to a* as b is to b*

With word transformation as a means of describing semantic difference between words, we mathematically interpret analogies. Speciﬁcally, we consider wx is to wx to refer to a transformation from wx to wx and an analogy to require an equivalence between such word transformations.

Deﬁnition D4. We say wa is to wa as wb is to wb for wa, wb, wa , wb E iff there exist parameters W+, W E that simultaneously transform wa to wa and wb to wb .

We show that the linear relationships between word embeddings of analogies (8, 9) follow from D4.

Lemma 3. If wa is to wa as wb is to wb by D4 with transformation parameters W+, W E, then:

PMIb = PMIa PMIa + PMIb

+ ρ Wb,Wb ρ Wa,Wa

+ (σ Wb σ Wb ) (σ Wa σ Wa )

((τ Wb τ Wb ) (τ Wa τ Wa ))1, (13)

where Wx ={wx} W+, Wx ={wx } W for x {a, b} and ρWb,Wb , ρWa,Wa are small.

Proof. Let W =Wx, W =Wx for x {a, b} in instances of Cor 2.1 and take the difference. Wx paraphrases Wx for x {a, b} by D3 and D4.

Analogies Explained: Towards Understanding Word Embeddings

wa is to wa as wb is to wb

{wa, W+} P {wa , W } {wb, W+} P {wb , W }

wa wa wb wb

Figure 4. Summary of steps to prove the relationship between analogies and word embeddings (omitting dependence error).

wx W+ W wx denotes a word transformation wx to wx with parameters W+, W E.

Theorem 3 (Analogies). If wa is to wa as wb is to wb by D4 with W+, W E , then:

wb = wa wa + wb

+ C (ρ Wb,Wb ρ Wa,Wa

+ (σ Wb σ Wb ) (σ Wa σ Wa )

((τ Wb τ Wb ) (τ Wa τ Wa ))1).

with terms as deﬁned in Lem 3.

Proof. Multiply (13) by C .

More generally, if D4 applies for a set of ordered word pairs S = {(wx, wx )}, i.e. wa is to wa as wb is to wb (wa, wa ), (wb, wb ) S with transformation parameters W+, W E, then each set {wx , W } must paraphrase {wx, W+} by D3, and (11) holds with small paraphrase error. By this and Thm 3 we know that word embeddings of an analogy wa, wb, wa , wb satisfy linear relationships (8, 9), subject to dependence error.

A few questions remain: how to ﬁnd appropriate transformation parameters; and, given non-uniqueness, which to choose? Addressing these in reverse order:

Transformation Parameter Equivalence

By Lem 3, if wa is to wa as wb is to wb then, subject to dependence error:

PMIb PMIb PMIa PMIa . (14)

If parameters W+ 2 , W 2 exist that (w.l.o.g.) transform wa to wa then (13) holds by suitably redeﬁning Wx, Wx , in which ρWa,Wa is small but nothing is known of ρWb,Wb . Thus, subject to dependence error:

PMIb PMIb PMIa PMIa + ρ Wb,Wb . (15)

By (14), (15), subject to dependence error, ρWb,Wb is also small and W+ 2 , W 2 must also transform wb to wb . Thus transformation parameters of any analogical pair transform all pairs and all applicable transformation parameters can be considered equivalent, up to dependence error.

Corollary 3.1. For analogy A, if parameters W+, W E transform wx to wx for any (wx, wx ) SA, then W+, W simultaneously transform wx to wx (wx, wx ) SA.

Identifying Transformation Parameters

To identify words that explain the difference between other words might, in general, be non-trivial. However, by Cor 3.1, transformation parameters for analogy A can simply be chosen as W+ ={wx }, W ={wx} for any (wx, wx ) SA.7 Making an arbitrary choice, Thm 3 simpliﬁes to:

Corollary 3.2. If wa is to wa as wb is to wb then:

wb = wa wa + wb + C (ρ W,W + σ W σ W

(τ W τ W )1), (16)

where W ={wb, wa }, W ={wb , wa} and ρW,W is small.

Proof. Let W+ ={wa }, W ={wa} in Thm 3.

We arrive back at (12) but now link directly to analogies, proving that word embeddings of analogies satisfy linear relationships (8) and (9), subject to dependence error. Fig 4 shows a summary of all steps to prove Cor 3.2. D4 also provides a mathematical interpretation of what we mean when we say wa is to wa as wb is to wb .

6.5. Example

To demonstrate the concepts developed, we consider the canonical analogy A : man is to king as woman is to queen , for which SA = {(man, king), (woman, queen)}. By D4, there exist parameters W+, W E that simultaneously transform man to king and woman to queen, which (by Cor 3.1) can be chosen to be W+ = {queen}, W = {woman}. Thus A

implies that {man, queen} P {king, woman} and {woman, queen} P {queen, woman}, the latter being trivially true. By Cor 2.1, A therefore implies:

w Q = w K w M + w W + C (ρ W,W + σ W σ W

(τ W τ W )1) ,

where we abbreviate words by their initials and, explicitly:

ρ W,W = log p(cj|w Q,w M)

p(cj|w W ,w K) (which must be small),

σ W =log p(w W ,w K|cj) p(w W |cj)p(w K|cj), τ W =log p(w W ,w K) p(w W )p(w K),

σ W =log p(w Q,w M|cj) p(w Q|cj)p(w M|cj), τ W =log p(w Q,w M) p(w Q)p(w M) .

7In the case of an analogical question wa is to wa as wb is to ... ? , there is only one choice: W+ ={wa }, W ={wa}.

Analogies Explained: Towards Understanding Word Embeddings

permitting auxiliary

princess lord

w K w M + w W

Figure 5. The plot shows the same embeddings of Fig 1, now with the difference between w K w M + w W and the embedding of queen explained (see connecting zigzag ) as the sum of conditional independence error (σ), independence error (τ = τ1) and paraphrase error (ρ). As anticipated, their sum is smallest for queen. Related words are seen nearby, with unrelated words clustered further away. Plot generated by ﬁxing the xy plane to contain man, king, queen and all other vectors plotted relatively, i.e. the z-axis captures any component off the xy-plane. Values are computed from the text8 corpus (Mahoney, 2011).

Thus w Q w K w M + w W subject to the accuracy with which {man, queen} paraphrases {king, woman} and statistical dependencies within those word pairs (see Fig 5).

6.6. Dependence error in analogies

Dependence error terms for analogies (13) bear an important distinction from those in one-word paraphrases (5). When a word set W is paraphrased by a single word w , the dependence error comprises a conditional independence term (σW) and a mutual independence term (τ W1) that bear no obvious relationship to one another and can only cancel by chance, which is low in high dimensions. However, (13) contains offsetting pairs of each component (σW, σW , τ W, τ W ), i.e. terms of the same form that may cancel, thus word sets with similar dependence terms will paraphrase with small overall dependence error.

It is illustrative to consider the case wa = wb, wa = wb , corresponding to the trivial analogy wa is to wa as wa is to wa , which holds true with zero total error for any word pair. Considering speciﬁc error terms: the paraphrase error is zero since p(cj|{wa, wa }) = p(cj|{wa , wa}), cj E, thus the net dependence error is also zero. However, individual dependence error terms, e.g. log p(wa,wa ) p(wa)p(wa ), are generally non-zero. This therefore proves existence of a case in which non-zero dependence error terms negate one another to give a negligible net dependence error.

6.7. Analogies in explicit embeddings

As with paraphrases, analogical relationships in embeddings stem from relationships between columns of PMI.

Corollary 3.3. Cor 3.2 applies to explicit (full-rank) embeddings, i.e. columns of PMI, with C = I (the identity matrix).

6.8. Analogies in W2V embeddings

As with paraphrases (Sec 5.5), the results for analogies can be extended to W2V embeddings by including the shift term appropriately throughout. Since the transformation parameters for analogies are of equal size (i.e. |W+| = |W | = 1), we ﬁnd that all shift terms cancel.

Corollary 3.4. Cor 3.2 applies to W2V embeddings replacing the projection C ( ) with f W 2V ( ).

Thus, linear relationships between embeddings for analogies hold equally for W2V embeddings as for those derived without the shift distortion. Whilst perhaps surprising, this is corroborative since linear analogical relationships have been observed extensively in W2V embeddings (e.g. Levy & Goldberg (2014a)), as is now justiﬁed theoretically. Thus we know that analogies hold for W2V embeddings subject to higher order statistical relationships between words of the analogy as deﬁned by the paraphrase and dependence errors.

7. Conclusion

In this work, we develop a probabilistically principled deﬁnition of paraphrasing by which equivalence is drawn between words and word sets by reference to the distributions they induce over words around them. We prove that, subject to statistical dependencies, paraphrase relationships give rise to linear relationships between word embeddings that factorise PMI (including columns of the PMI matrix), and thus others that approximate such a factorisation, e.g. W2V and Glove. By showing that paraphrases can be interpreted as word transformations, we enable analogies to be mathematically deﬁned and, thereby, properties of semantics to be translated into properties of word embeddings. This provides the ﬁrst rigorous explanation for the presence of linear relationships between the word embeddings of analogies.

In future work we aim to extend our understanding of the relationships between word embeddings to other applications of discrete object representation that rely on an underlying matrix factorisation, e.g. graph embeddings and recommender systems. Also, word embeddings are known to capture stereotypes present in corpora (Bolukbasi et al. (2016)) and future work may look at developing our understanding of embedding composition to foster principled methods to correct or debias embeddings.

Analogies Explained: Towards Understanding Word Embeddings

Acknowledgements

We thank Ivana Balaževi c and Jonathan Mallinson for helpful comments on this manuscript. Carl Allen was supported by the Centre for Doctoral Training in Data Science, funded by EPSRC (grant EP/L016427/1) and the University of Edinburgh.

Arora, S., Li, Y., Liang, Y., Ma, T., and Risteski, A. A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics, 2016.

Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and Kalai, A. T. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems, 2016.

Ethayarajh, K., Duvenaud, D., and Hirst, G. Towards understanding linear word analogies. ar Xiv preprint ar Xiv:1810.04882, 2018.

Firth, J. R. A synopsis of linguistic theory, 1930-1955. Studies in Linguistic Analysis, 1957.

Gittens, A., Achlioptas, D., and Mahoney, M. W. Skip-Gram - Zipf + Uniform = Vector Additivity. In Association for Computational Linguistics, 2017.

Hashimoto, T. B., Alvarez-Melis, D., and Jaakkola, T. S. Word embeddings as metric recovery in semantic spaces. Transactions of the Association for Computational Linguistics, 2016.

Le, M. Unshifting the PMI matrix. https: //minhlab.wordpress.com/2017/02/16/ presentation-at-clin-27/, presented at CLIN 27 (2017), 2017. [Online; accessed Sep 2018, presented at CLIN 27, 2017].

Levy, O. and Goldberg, Y. Linguistic regularities in sparse and explicit word representations. In Computational Natural Language Learning, 2014a.

Levy, O. and Goldberg, Y. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, 2014b.

Linzen, T. Issues in evaluating semantic spaces using word analogies. In 1st Workshop on Evaluating Vector-Space Representations for NLP, 2016.

Mahoney, M. text8 wikipedia dump. http:// mattmahoney.net/dc/textdata.html, 2011. [Online; accessed May 2019].

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, 2013a.

Mikolov, T., Yih, W.-t., and Zweig, G. Linguistic regularities in continuous space word representations. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013b.

Pennington, J., Socher, R., and Manning, C. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing, 2014.

Socher, R., Bauer, J., Manning, C. D., et al. Parsing with compositional vector grammars. In Association for Computational Linguistics, 2013.

Srebro, N. and Jaakkola, T. Weighted low-rank approximations. In International Conference on Machine Learning, 2003.

Turian, J., Ratinov, L., and Bengio, Y. Word representations: a simple and general method for semi-supervised learning. In Association for Computational Linguistics, 2010.

Turney, P. D. and Pantel, P. From frequency to meaning: Vector space models of semantics. Journal of Artiﬁcial Intelligence Research, 37:141 188, 2010.