# bayesian_neural_word_embedding__ab59ed67.pdf

Bayesian Neural Word Embedding

Oren Barkan

Tel Aviv University, Israel Microsoft, Israel Abstract Recently, several works in the domain of natural language processing presented successful methods for word embedding. Among them, the Skip-Gram with negative sampling, known also as word2vec, advanced the state-of-the-art of various linguistics tasks. In this paper, we propose a scalable Bayesian neural word embedding algorithm. The algorithm relies on a Variational Bayes solution for the Skip Gram objective and a detailed step by step description is provided. We present experimental results that demonstrate the performance of the proposed algorithm for word analogy and similarity tasks on six different datasets and show it is competitive with the original Skip-Gram method.

1 Introduction and Related Work

Recent progress in neural word embedding methods has advanced the state-of-the-art in the domain of natural language processing (NLP) (Pennington, Socher, and Manning 2014; Collobert and Weston 2008; Mnih and Hinton 2008; Mikolov et al. 2013; Vilnis and Mccallum 2015; Zhang et al. 2014). These methods attempt to learn a low dimensional representation for words that captures semantic and syntactic relations. Specifically, Skip-Gram (SG) with negative sampling, known also as word2vec (Mikolov et al. 2013), set new records in various linguistic tasks and its applications have been extended to other domains beyond NLP such as computer vision (Frome, Corrado, and Shlens 2013; Lazaridou, Nghia, and Baroni 2015) and Collaborative Filtering (CF) (Barkan and Koenigstein 2016; Barkan, Brumer, and Koenigstein 2016). In this paper, we propose a scalable Bayesian neural word embedding algorithm that in principle, can be applied to any dataset that is given as sets / sequences of items. Hence, the proposed method is not limited to the task of word embedding and may be applicable to general item similarity tasks as well. We provide a fully detailed step by Copyright 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

step algorithm, which is straightforward to implement and requires negligible amount of parameter tuning. Bayesian methods for words representation are proposed in (Vilnis and Mccallum 2015; Zhang et al. 2014). Different from these methods, we propose a Variational Bayes (VB) (Bishop 2006) solution to the SG objective. Therefore, we name our method Bayesian Skip-Gram (BSG). VB solutions provides a stable and robust behavior that require negligible effort in hyperparameter tuning (Bishop 2006). This is in contrast to point estimate solutions that are more sensitive to singularities in the objective and often require significant amount of hyperparameter tuning. While the SG method maps words to vectors, BSG maps words to densities in a latent space. Moreover, BSG provides for a confidence level in the embedding and opens up for density based similarities measures. Our contribution is twofold: first, we derive a tractable (yet scalable) Bayesian solution to the SG objective and provide a detailed step by step algorithm. Secondly, we propose several density based similarity measures that can be investigated in further research. The rest of the paper is organized as follows: Section 2 overviews the SG method. In Section 3, we provide the mathematical derivation of the BSG solution. In Section 4, we describe the BSG algorithm in detail. In Section 5, we present evaluations on six different datasets, where we show that in most cases BSG outperforms SG.

2 Skip-Gram with Negative Sampling

SG with negative sampling is a neural word embedding method that was introduced in (Mikolov et al. 2013). The method aims at estimating words representation that captures semantic and syntactic relations between a word to its surrounding words in a sentence. Note that SG can be applied with hierarchical softmax (Mikolov et al. 2013), but in this work we refer to SG as SG with negative sampling. The rest of this section provides a brief overview of the SG method. Given a sequence of words 1 ( )L i i w = from a finite vo-

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

cabulary 1 { }l i i W w = = , SG aims at maximizing the following objective

1 log ( | )

i j i i c j c j p w w L + = (1)

where c is defined as the context window size and ( | ) j i p w w is the softmax function:

exp( ) ( | ) exp( )

T i j j i T i k k I

u v p w w u v

( ) m iu U \ and ( ) m iv V \ are latent vectors that correspond to the target and context representations for the word iw W , respectively. {1,..., } W I l and the parameter m is chosen empirically and according to the size of the dataset. Using Eq. (2) is impractical due to the computational complexity of ( | ) j i p w w , which is linear in l that is usually in size of 5 6 10 10 .

2.1 Negative Sampling Negative sampling (Mikolov et al. 2013) is introduced in order to overcome the above computational problem by the replacement of the softmax function from Eq. (2) with

1 ( | ) ( ) ( )

N T T j i i j i k k p w w u v u v σ σ

where ( ) 1/1 exp( ) x x σ = + , N is a parameter that determines the number of negative examples to be sampled per a positive example. A negative word k w is sampled from the unigram distribution raised to the 3/4rd power 3/4( ) uni p w , where ( ) uni p w is defined as the number of times w appear in the entire corpus divided by the total length (in words) of the corpus. This distribution was found to outperform both the uniform and unigram distributions (Mikolov et al. 2013). The latent representations U and V are estimated by applying a stochastic gradient ascent with respect to the objective in Eq. (1). It is worth noting that some versions of word embedding algorithms incorporate bias terms into Eq. (3) as follows

1 ( | ) ( ) ( )

N T T j i i j i j i k i k k p w w u v b b u v b b σ σ

These biases often explain properties such as frequency of a word in the text (Pennington, Socher, and Manning 0QLK DQG +LQWRQ or popularity of an item in the dataset (Paquet and Koenigstein 2013). In this

work, we do not use biases, since in our initial experiments their contribution was found to be marginal.

2.2 Data Subsampling In order to overcome the imbalance between rare and frequent words the following subsampling procedure is suggested: Given the input words sequence, we discard each word w in a sentence with a probability

( | ) 1 / ( ) p discard w f w ρ = where ( ) f w is the frequency of the word w and ρ is a prescribed threshold. This technique is reported to accelerate the learning process and improve the representation of rare words (Mikolov et al. 2013).

2.3 Word Representation and Similarity

SG produces two different representations iu and iv for

the word iw . In this work, we use iu as the final repre-

sentation for iw . Alternatively, we can use iv , the addi-

tive composition i i u v + or the concatenation T T T i i u v ª º¼ .

The last two options are reported to procure superior representation (Garten et al. 2015). The similarity between a pair of words , a b w w is computed by applying the cosine similarity to their corresponding representations , a b u u as follows

T a b a b a b

u u sim w w u u = .

3 Bayesian Skip-Gram (BSG)

As described in Section 2, SG produces point estimates for U and V . In this section, we propose a method for deriving full distributions for U and V (in this paper, we use the terms distribution and density interchangeably). We assume that each target and context random vectors are independent and have normal priors with a zero mean and diagonal covariance with a precision parameter τ as follows

: W i I 1 ( ) (0, ) i i u p u N I τ = and 1 ( ) (0, ) i i v p v N I τ = .

Note that different values of τ can be set to the priors over U and V . Furthermore, these hyperparameters can be treated as random variables and be learned from the data (Paquet and Koenigstein 2013). However, in this work, we assume these hyperparameters are given and identical. We define ( ) i C w as a multiset that contains the

indices of the context words of iw in the corpus. Let

{( , ) | ( )} P i I i j j C w = and {( , ) | ( )} N i I i j j C w = be the

positive and negative sets, respectively. Note that PI is a

multiset too and NI s size might be quadratic in the

vocabulary size l . Therefore, we approximate NI by negative sampling as described in Section 2.1. Let and define , where

is a random variable

Then, the likelihood of ij d is given by

( | , ) ( ) T ij i j ij i j p d u v d u v σ = . Note that when applied to mul-

tisets, the operator is defined as the multiset sum and not as the multiset union. The joint distribution of , U V and D is given by

( , , ) ( | , ) ( ) ( )

( | , ) ( ) ( )

( ) (0, ) (0, ).

ij i j i i i j I i I i I

T ij i j u v i j I i I i I

p U V D p D U V p U p V

p d u v p u p v

d u v N I N I σ τ τ

3.1 Variational Approximation

We aim at computing the posterior ( , | ) p U V D . However, a direct computation is hard. The posterior can be approximated using MCMC approaches such as Gibbs sampling or by using VB methods. In this work, we choose to apply VB approximation (Bishop 2006), since it was shown to converge faster to an accurate solution, empirically. Let U V θ = , VB approximates the posterior ( | ) P D θ by finding a fully factorized distribution

1 ( ) ( , ) ( ) ( ) ( ) ( )

i i i q q U V q U q V q u q v θ

that minimizes the KL divergence (Bishop 2006)

( ) ( ) ( ) || ( | ) ( )log ( | ) KL q D q P D q d p D θ θ θ θ θ θ = ³ .

To this end, we define the following expression:

( , ) ( ) ( )log ( ) ( | ) ( )log ( )log ( ) ( )

( ) ( | ) log ( ) KL

p D L q q d q p D q d q p D d q

D q p D p D

θ θ θ θ θ θ θ θ θ θ

where the last transition in Eq. (5) is due to the fact q is a PDF. By rearranging Eq. (5) we get the relation ( ) ( ) ( | ) log ( ) ( ) KL D q p D p D L q θ θ = & , where we notice

that log ( ) p D is independent of q . Hence, minimizing ( ) ( ) ( | ) KL D q p D θ θ & is equivalent to maximizing ( ) L q .

It was shown (Bishop 2006) that ( ) L q is maximized by an iterative procedure that is guaranteed to converge to a local optimum. This is done by updating each of ' q s factors, sequentially and according to the update rule

( ) \ * exp [log ( , )] i ui u q q p D const θ θ = + E (6)

where the update for *

ivq is performed by the replace-

ment of iu with iv in Eq. (6).

Recall that the term ( , ) ( , , ) p D p U V D θ = in Eq. (6)

contains the likelihood ( | , ) p D U V from Eq. (4), which is a product of sigmoid functions of U and V . Therefore, a conjugate relation between the likelihood and the priors does not hold and the distribution that is implied by \ [log ( , )] ui q p D θ θ E in Eq. (6) does not belong to

the exponential family. Next, we show that by the introduction of an additional parameter ij ξ we are able to bring Eq. (6) to a

form that is recognized as the Gaussian distribution. We start by lower bounding ( | , ) p D U V using the following logistic bound (Jaakkola and Jordan 1996):

2 2 log ( ) ( )( ) log ( ) 2 a a a ξ σ λ ξ ξ σ ξ + (7)

where 1 1 ( ) ( ) 2 2 λ ξ σ ξ ξ = ¹ . By applying the lower bound

from (7) to log ( | ) p D θ we get

log ( | ) log ( | )

( )( ) log ( ) 2 D

T ij i j ij T T ij i j j i ij ij i j I

d u v u v v u

ξ λ ξ ξ σ ξ

By using Eq. (8) we bound ( ) L q as follows

( , ) ( ) ( ) ( )log ( ) ( | ) ( ) ( )log ( )

p D L q L q q d q p D p q d q

θ θ θ θ θ θ θ θ θ

Furthermore, it was shown (Jaakkola and Jordan 1996) that the bound in Eq. (6) is tight when

2 [ ] var( ) [ ] [ ]

var( ) i j j i

T T T T T ij q i j j i i j q i j q j i

T T T i j u v v u

u v v u u v u v v u

where [ ] jv q jv μ Ǽ and the last transition in Eq. (9)

D P N I I I = { | ( , ) } ij D D d i j I =

: {1, 1} D d I

1 ( , ) (( , )) 1 ( , )

i j I d d i j i j I =

holds since iu and jv are independent. By assuming

diagonal covariance matrices, the term var( ) T i j u v in Eq.

(8) is computed by

2 2 2 2 2 2

var( ) var var( )

ik jk ik jk jk ik

m m T i j ik jk ik jk k k

u v u v v u k

u v u v u v

σ σ σ μ σ μ

Finally, by combining Eqs. (9) and (10) we get

1 ( )( ) ik ik jk jk

ij u u v v k ξ σ μ σ μ

= = + + . (11)

Therefore, instead of maximizing ( ) L q we can maxim-

ize ( ) L q ξ and this is done by replacing the term

log ( , ) p D θ from Eq. (6) with log ( , ) p D ξ θ as follows

( ) \ * exp [log ( , )] i ui u q q p D const θ ξ θ = + E . (12)

By applying the natural logarithm to Eq. (12) we get

* log [log ( , )]

[log ( | , )] [log ( , )]

T T i u i u i

q p D const

p D U V p U V const

u r u P u const

T u ij q j j j I

with { | ( , ) } iu D I j i j I = and [ ] j j j

T T q j j v v v v v μ μ = + Ǽ .

Note that in the last transition in Eq. (13), all terms that are independent of iu are absorbed into the const term.

By inspecting Eqs. (13) and (14), we see that *

normally distributed with the natural parameters

i i u u P = (the precision matrix) and i i i u u u r P μ = (the

means times precision vector). Note that the computation of *

ivq s parameters is symmetric. Moreover, since

the updates for 1 { } i

l u i q = depend only on 1 { } i

l v i q = and vice

versa, they can be performed in parallel. This gives an alternating updates scheme that is embarrassingly parallel and (under the assumption of constant dataset) guaranteed to converge to a local optimum (Bishop 2006): First, update all 1 { } i

l u i q = (in parallel), then update all

l v i q = (in parallel) and repeat until convergence.

3.2 Stochastic Updates Due to data sampling, the effective dataset changes between the iterations and the optimization becomes stochastic. Since we do not want to ignore the information from previous steps, we need to figure out a way to consider this information in our updates. A common practice is to apply updates in the spirit of the Robbins Monro method (Robbins and Monro 1951). This is performed by the introduction of an iteration dependent variable ( ) k β that controls the updates as follows

( ) ( ) ( ) ( 1) (1 ) i i i

k k k k u u u P P P β β = +

( ) ( ) ( ) ( 1) (1 ) i i i

k k k k u u u r r r β β = + .

In practice, this means that during the runtime of the algorithm we need to keep the results from the previous iteration. Robbins and Monro showed several conditions for convergence, where one of them states that ( ) k β needs to satisfy:

= = and ( ) 2

To this end, we suggest to use ( ) k k γ β = with a decay parameter 0.5 1 γ < as this ensures the conditions in

(15) hold. We further suggest to set ( ) 1 k β = for the first few iterations, in order to avoid too early convergence. Specifically, in our implementation, we did not perform stochastic updates in the first 10 iterations.

4 The BSG Algorithm

In this section, we provide a detailed description of the BSG algorithm that is based on Sections 2 and 3. The algorithm is described in Fig. 1 and includes three main stages. The first stage is an initialization, then the algorithm iterates between data sampling and parameter updates till a convergence criterion is met or number of iterations is exceeded. In what follows, we explain each of these stages in detail.

4.1 Stage 1 - Initialization The algorithm is given the following hyperparameters: the input text T (set of senstences), target representation dimension m , the number of maximum iterations K , the maximal window size max c , the negative to positive ratio N , the subsampling parameter ρ , a stopping threshold ε , the decay parameter γ and the prior precision parameter τ . As described in Section 3, different values of τ can be learned for U and V , however in our implementation, we chose to use a unique parameter τ .

In this stage, we further need to determine the effective set of words W to learn representation for. This can be done by considering all words in the data that appear more than a prescribed number of times, or by considering the l most popular words. In this work, we stick with the latter. Then, every word w W is discarded from the data (step 1). Step 2 initializes the target distributions

1 { , } i i l u v i Q q q = = parameters. Specifically, the means are

drawn from the multivariate standard normal distribution and the covariance matrices are set to identity. In step 3, we compute uni p according to the description in Section 2.1, then we raise it to the ¾ rd power. Step 4 updates k and K according to κ . This ensures the stochastic updates are not performed in the first κ iterations.

4.2 Stage 2 Data Sampling At the beginning of every iteration, we subsample the data (step 5.1) as described in Section 2.2. Then, we follow the description in Section 3: for each instance of the word iw in T , we sample a window size c from

the uniform discrete distribution over the set max {1,..., } c and consider the c words to the left and to the right of

iw as context words for iw . This results in a multiset ( ) i C w that contains the indices of the context words of

iw (an index may appear multiple times). Then, we create a positive multiset of tuples {( , ) | ( )} P i I i j j C w = (step 5.3).

Next, for each tuple ( , ) P i j I we sample N negative

examples ( , ) i z such that ( , ) P i z I . A negative word

z w is sampled according to 3/4( ) uni z p w . We further up-

date , , , , i j z u v v ij iz I I I d d accordingly (step 5.4).

An alternative implementation of step 5.4 is to save for each tuple a counter for the number of times it appears. This can be done by maintaining dictionary data structures that count positive and negative examples. This avoids the need of maintaining 1 { , } i i l u v i I I = as mul-

tisets (list data structures) and replace them with set data structures.

4.3 Stage 3 Parameter Updates In this stage, we update the parameters of the distributions 1 { , } i i l u v i Q q q = = . The updates are performed first

for 1 { } i l u i q = (step 5.6) and then for 1 { } i l v i q = in a symmet-

ric manner (step 5.7). Moreover, each sequence of updates is performed in parallel. Note that step 5.6.1 involves the computation of ij ξ , ( ) ij λ ξ and [ ] T q j j v v Ǽ that

are given by Eqs. (11), (7) and (14), respectively.

Due to data sampling the dataset is changed per iteration. Therefore, we apply stochastic updates (step 5.6.2). The stochastic updates are performed starting from the iteration 1 κ + . This is ensured by step 5.5. A crucial point to notice is the computation of the mean and the covariance: first, we compute the covariance matrix by the inversion of the precision matrix (step 5.6.3). This is performed by using Cholesky decomposition. Then, we extract the mean (step 5.6.4). Finally, we set all the off diagonal values in iu Σ to zeros

(step 5.6.5) while keeping iu P as is.

The algorithm stops if the convergence criterion is met or number of iterations is exceeded (last line).

4.4 Similarity Measures BSG maps words to normal distributions. In this work, we choose to use the distributions 1 { } i l u i q = for represent-

ing words. The similarity between a pair of words , i j w w can be computed by the cosine similarity of

their means , i j u u μ μ . By using the covariance, a confi-

dence level can be computed as well. To this end, we define a random variable T ij i j y u u = . Though the distri-

bution of ijy is not normal, it has the following mean

and variance

ij i j i i i j j j

T T y u u u u u u u u tr

= Σ Σ + Σ + Σ (16)

Hence, we choose to approximate ijy s distribution with 2 ( , ) ij ij ij y y y N μ σ . Then, 2

ij y σ can be used as a confidence level of the similarity score. BSG further enables the applications of other similarity types. For example, we can approximate ( 1| ) ij p d D = by approximating the marginalization

( 1| ) ( 1, , | )

( 1| , ) ( | ) ( | )

( ) ( ) ( ) ( ) ( )

1 / 8 . ij ij

ij ij i j i j

ij i j i j i j

T i j i j i j ij ij ij

p d D p d u u D du du

p d u u p u D p u D du du

u u q u q u du du y p y dy σ σ

³ ³ ³ ³ (17)

where ij y μ and 2

ij y σ are given by Eq. (16) and the last

three approximations follow from the VB approximation, Eq. (16) and (Mac Kay 1992), respectively. Another option is to apply a similarity measure that is based on a symmetric version of the KL divergence between two multivariate normal distributions

BSG Algorithm

Input: m - target representation dimension T - input text, given as sequence of sequences of words τ - prior precision K - maximum number of iterations

max c - maximal window size l - number of most frequent words to be considered in the vocabulary ρ - subsampling parameter N - negative to positive ratio κ - number of iterations to apply without performing stochastic updates (in the beginning) ε - stopping threshold γ - decay parameter Output:

1 { , , , } i i i i l u v u v i Q μ μ = = - parameters of the distributions 1 { , } i i l u v i Q q q = =

1. Create a set 1 { }l i i W w = = of the l most frequent words in T and discard all other words from T .

2. for 1 i to l 2.1 ~ (0, ) iu N I μ , ~ (0, ) iv N I μ , iu P I , iv P I

3. Compute 3/4 uni p over W using T as described in Section 2.1.

4. 1 k κ , K K κ // first κ iterations are performed without stochastic updates 5. repeat 5.1. sub T Subsample ( T ), as described in Section 2.2

5.2. for 1 i to l 5.2.1. , i i u v I I φ φ

5.2.2. i i prev u u P P , i i prev v v P P , i i prev u u r r , i i prev v v r r // save values from the previous iteration

5.3. Create PI based on sub T as described in Section 4.2 // positive sampling

5.4. for ( , ) i j in PI

5.4.1. { } i i u u I I j , { } j j v v I I i , 1 ij d

5.4.2. for 1 n to N // negative sampling 5.4.2.1. Sample a negative word index z according to 3/4( ) uni z p w s.t. ( , ) p i z I

5.4.2.2. { } i i u u I I z , { } z z v v I I i , 1 iz d

5.5. if 0 k > then k γ β else 1 β // stochastic updates condition 5.6. parfor 1 i to l // parallel for loop 5.6.1. Compute , i i u u P r using Eq. (14)

5.6.2. (1 ) , (1 ) i i i i i i prev prev u u u u u u P P P r r r β β β β + +

i i u u P Σ

5.6.4. i i i u u ur μ Σ

5.6.5. [ [ ]] i i u u diag diag Σ Σ

5.7. Apply a symmetric version of step 5.6 to 1 { } i l v i q = parameters

until k K > or

l prev u u i r r ε

l prev v v i r r ε

Figure 1: The BSG algorithm

( ) ( ) ( , ) || || i j i j j i sym KL u u KL u u KL u u sim q q D q q D q q = (18)

where ( ) || i j KL u u D q q has the following closed form

solution (Bishop 2006)

1 || log log 2

j i j j i j i

T T u u u u u u u

tr m μ μ μ μ

ª º Σ + Σ Σ ¼

Note that the application of the BSG algorithm to general item similarity tasks is straightforward. The only requirement is that the data is given in the same format. Specifically, every sentence of words in the data is replaced with a sequence of items. Moreover, if the items are given as sets, for each sequence, the window size should be set to the length of the sequence. This results in a Bayesian version of item2vec (Barkan and Koenigstein 2016).

5 Experimental Setup and Results

In this section, we compare between the BSG and SG algorithms (for SG we used the ǁŽƌĚϮǀĞĐ1 implementation). The algorithms are evaluated on two different tasks: the word similarity task (Finkelstein et al. 2002) and the word analogy task (Pennington, Socher, and Manning 2014). The word similarity task requires to score pairs of words according to their relatedness. For each pair, a ground truth similarity score is given. The similarity score we used for both models is the cosine similarity. Specifically, for BSG we observed no significant improvement, when applying the similarities from Eqs. (17) and (18), instead of the cosine similarity. In order to compare between the BSG and SG methods, we compute for each method the Spearman (Spearman 1987) rank correlation coefficient with respect to the ground truth. The word analogy task is essentially a completion task: a bank of questions in the form of a w is to b w as

c w is to ? is given, where the task is to replace ? with

the correct word d w . The questions are divided to syntactic questions such as onion is to onions as lion is to lions and semantic questions, e.g. Berlin is to Germany as London is to England . The method we use to answer the questions is by reporting the word d w that gives the highest cosine simi-

larity score between du and ? b a c u u u u = + . For the

BSG and SG models we used iu μ and iu as the repre-

sentation for the word iw , respectively.

1 ŚƚƚƉƐ ĐŽĚĞ ŐŽŽŐůĞ ĐŽŵ Ɖ ǁŽƌĚϮǀĞĐ

5.1 Datasets We trained both models on the corpus from (Chelba et al. 2014). In order to accelerate the training process, we limited our vocabulary to the 30K most frequent words in the corpus and discarded all other words. Then, we randomly sampled a subset of 2.8M sentences that results in a total text length of 66M words. The word similarity evaluation includes several different datasets: Word Sim353 (Finkelstein et al. 2002), SCWS (Huang et al. 2012), Rare Words (Luong, Socher, and Manning 2013), MEN (Bruni, Tran, and Baroni 2014) and Sim Lex999 (Hill, Reichart, and Korhonen 2015). The reader is referred to the references for further details about these datasets. For each combination of dataset and method, we report the Spearman rank correlation (x100). The word analogy evaluation dataset (Mikolov et al. 2013) consists of 14 distinct groups of analogy questions, where each group contains a different number of questions. Both models were evaluated on an effective set that contains 14122 questions (all questions that contain out-of-vocabulary words were discarded).

5.2 Parameter Configuration

The same parameters configuration was set for both systems. Specifically, we set the target representation dimension 40 m = , maximal window size max 4 c = , subsampling parameter 5 10 ρ = , vocabulary size 30000 l = and negative to positive ratio 1 N = . For BSG, we further set 1 τ = , 10 κ = and 0.7 γ = (note that BSG is quite robust to the choice of γ as long as 0.5 1 γ < ). Both models were trained for 40 K = iterations (we verified their convergence after ~30 iterations). In order to mitigate the effect of noise in the results, we trained 10 different instances of BSG and SG and report the average score that is obtained for each entry in the tables.

5.3 Results Table 1 presents the (average) Spearman rank correlation score (x100) obtained by BSG and SG on the word similarity task for various datasets. Table 2 presents the (average) percentage of correct answers for each model per questions group on the word analogy task. We see that the models are competitive where BSG results in a better total accuracy. Examining the results from both tables, we notice that in most cases, BSG achieves better results than SG. This might be explained by the fact that BSG leverages information from second moments as well. Comparing our results with the literature 3HQQLQJWRQ 6RFKHU DQG 0DQQLQJ 0LNRORY HW DO 2013), we see that the scores obtained by both models are lower. This might be explained by several reasons:

First, we use a smaller corpus of 66M words vs. 1-30B in 3HQQLQJWRQ 6RFKHU DQG 0DQQLQJ 0LNRORY HW al. 2013). Secondly, the target representation dimension we use is 40 vs. 100-600 in (Pennington, Socher, and 0DQQLQJ 0LNRORY HW DO . Therefore, we believe that the performance of our models can be improved significantly by increasing the representation dimension as well as the amount of training data. Recall that our main goal is to show that BSG is an effective word embedding method and provides competitive results when compared to the SG method.

6 Conclusion

In this paper, we introduced the BSG algorithm that is based on a VB solution to the SG objective. We provide the mathematical derivation of the proposed solution as well as step by step algorithm that is straightforward to implement. Furthermore, we propose several density based similarities measures. We demonstrate the application of BSG on various linguistic datasets and present experimental results that show BSG and SG are competitive.

Barkan, Oren, Yael Brumer, and Noam Koenigstein. 2016. Modelling Session Activity with Neural Embedding. In Rec Sys Posters.

Barkan, Oren, and Noam Koenigstein. 2016. Item2Vec: Neural Item Embedding for Collaborative Filtering. Article. ar Xiv Preprint ar Xiv:1603.04259.

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Pattern Recognition. Vol. 4. doi:10.1117/1.2819119.

Bruni, Elia, Nam Khanh Tran, and Marco Baroni. 2014. Multimodal Distributional Semantics. Journal of Artificial Intelligence Research 49: 1 47. doi:10.1613/jair.4135.

Chelba, Ciprian, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2635 39.

Collobert, Ronan, and Jason Weston. 2008. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Proceedings of the 25th International Conference on Machine Learning, 160 67. doi:10.1145/1390156.1390177.

Finkelstein, Lev, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2002. Placing Search in Context: The Concept Revisited. ACM Transactions on Information Systems 20 (1): 116 31. doi:10.1145/503104.503110.

Frome, Andrea, Gs Corrado, and Jonathon Shlens. 2013. Devise: A Deep Visual-Semantic Embedding Model. Advances in Neural Information Processing Systems, 1 11.

Garten, Justin, Kenji Sagae, Volkan Ustun, and Morteza Dehghani. 2015. Combining Distributed Vector Representations for Words. Inproceedings. In Proceedings of NAACL-HLT, 95 101.

Hill, Felix, Roi Reichart, and Anna Korhonen. 2015. Sim Lex999: Evaluating Semantic Models with (Genuine) Similarity Estimation. Computational Linguistics 41 (4): 665 95. doi:10.1162/COLI.

Huang, Eric H, Richard Socher, Christopher D Manning, and Andrew Ng. 2012. Improving Word Representations via Global

Method Word Sim353 (Finkelstein et al. 2002)

SCWS (Huang et al. 2012)

Rare Words (Luong, Socher, and Manning 2013)

MEN (Bruni, Tran, and Baroni 2014)

Sim Lex999 (Hill, Reichart, and Korhonen 2015)

SG 58.6 59.4 51.2 60.9 26.4 BSG 61.1 59.3 52.5 61.1 27.3

Questions group name BSG (%) SG (%) Capital common countries 62.4 59.5 Capital world 53.2 50.2 Currency 7.1 3.6 City in state 14.3 17.9 Family 55.7 63.3 Adjective to adverb 20.1 15.2 Opposite 6.7 6.7 Comparable 47.2 43 Superlative 41 48.1 Present participle 39 36.2 Nationality adjective 88.9 82.3 Past tense 43.7 39.2 Plural 46.9 42.3 Plural verbs 29.4 29.1 Total 45.1 42.5

Table 1: A Comparison between BSG and SG on various word similarity datasets

Table 2: A Comparison between BSG and SG on the word analogy task

Context and Multiple Word Prototypes. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, 873 82.

Jaakkola, Tommi S, and Michael I Jordan. 1996. A Variational Approach to Bayesian Logistic Regression Models and Their Extensions. Aistats, no. AUGUST 2001. doi:10.1.1.49.5049.

Lazaridou, Angeliki, The Pham Nghia, and Marco Baroni. 2015. Combining Language and Vision with a Multimodal Skip-Gram Model. Proceedings of Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, Denver, Colorado, May 31 June 5, 2015, 153 63.

Luong, Minh-Thang, Richard Socher, and Christopher D. Manning. 2013. Better Word Representations with Recursive Neural Networks for Morphology. Co NLL-2013, 104 13.

Mac Kay, David J. C. 1992. The Evidence Framework Applied to Classification Networks. Neural Computation 4 (5): 720 36. doi:10.1162/neco.1992.4.5.720.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality. Nips, 1 9. doi:10.1162/jmlr.2003.3.45.951.

Mnih, Andriy, and Geoffrey E. Hinton. 2008. A Scalable Hierarchical Distributed Language Model. Advances in Neural Information Processing Systems, 1 8.

Paquet, Ulrich, and Noam Koenigstein. 2013. One-Class Collaborative Filtering with Random Graphs. Proceedings of the 22nd International Conference on World Wide Web, 999 1008.

Pennington, Jeffrey, Richard Socher, and Christopher D Manning. 2014. Glo Ve: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532 43. doi:10.3115/v1/D14-1162.

Robbins, Herbert, and Sutton Monro. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics 22 (3): 400 407. doi:10.1214/aoms/1177729586.

Spearman, C. 1987. The Proof and Measurement of Association between Two Things. By C. Spearman, 1904. The American Journal of Psychology 100 (3 4): 441 71. doi:10.1037/h0065390.

Vilnis, Luke, and Andrew Mccallum. 2015. Word Represenation via Guassian Embedding. In In Proceedings of the ICLR 2015, 1 12.

Zhang, J, J Salwen, M Glass, and A Gliozzo. 2014. Word Semantic Representations Using Bayesian Probabilistic Tensor Factorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).