# model_comparison_for_semantic_grouping__b595a080.pdf Model Comparison for Semantic Grouping Francisco Vargas 1 Kamen Brestnichki 1 Nils Hammerla 1 We introduce a probabilistic framework for quantifying the semantic similarity between two groups of embeddings. We formulate the task of semantic similarity as a model comparison task in which we contrast a generative model which jointly models two sentences versus one that does not. We illustrate how this framework can be used for the Semantic Textual Similarity tasks using clear assumptions about how the embeddings of words are generated. We apply model comparison that utilises information criteria to address some of the shortcomings of Bayesian model comparison, whilst still penalising model complexity. We achieve competitive results by applying the proposed framework with an appropriate choice of likelihood on the STS datasets. 1. Introduction The problem of Semantic Textual Similarity (STS), measuring how closely the meaning of one piece of text corresponds to that of another, has been studied in the hope of improving performance across various problems in Natural Language Processing (NLP), including information retrieval (Zheng & Callan, 2015). Recent progress in the learning of word embeddings (Mikolov et al., 2013b) has allowed the encoding of words using distributed vector representations, which capture semantic information through their location in the learned embedding space. Despite the extent to which semantic relations between words are captured in this space, it remains a challenge for researchers to adapt these individual word embeddings to express semantic similarity between word groups, like documents, sentences, and other textual formats. Recent methods for STS rely on additive composition of word vectors (Arora et al., 2016; Blacoe & Lapata, 2012; Mitchell & Lapata, 2008; 2010; Wieting et al., 2015) or 1Babylon Health. Correspondence to: Francisco Vargas . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). deep learning architectures (Kiros et al., 2015), both of which summarise a sentence through a single embedding. The resulting sentence vectors are then compared using cosine similarity a choice stemming only from the fact that cosine similarity gives good empirical results. It is difficult for a practitioner to utilize word vectors efficiently, as the underlying assumptions in the similarity measure are not well understood and the assumptions made at the word vector level are not clearly defined. For example, embedding magnitude in (Arora et al., 2016) seems to matter at the word level, but not at the sentence level. That is, normalizing word embeddings before the SIF estimate decreases results considerably, while normalising sentence embeddings has no effect on results. The main contribution of our work is the proposal of a framework that addresses these issues by explicitly deriving the similarity measure through a chosen generative model of embeddings. Via this design process, a practitioner can encode suitable assumptions and constraints that may be favourable to the application of interest. Furthermore, this framework puts forward a new research direction that could help improve the understanding of semantic similarity by allowing practitioners to study suitable embedding distributions and assess how these perform. The second contribution of our work is the derivation of a similarity measure that performs well in an online setting. Online settings are both practical and key to use-cases that involve information retrieval in dialogue systems. For example, in a chat-bot application new queries will arrive one at a time and methods such as the one proposed in (Arora et al., 2016) will not perform as strongly as they do on the benchmark datasets. This is because one cannot perform the required data pre-processing on the entire query dataset, which will not be available a priori in online settings. Whilst our framework produces an online similarity metric, it remains competitive to offline methods such as (Arora et al., 2016). We achieve results comparable to (Arora et al., 2016) on the STS dataset in O(nd) time compared to the O(nd2) average complexity of their method (where n is the number of words in a sentence, and d is the embedding size). Model Comparison for Semantic Grouping 2. Background The compositional nature of distributed representations demonstrated in (Mikolov et al., 2013b) and (Pennington et al., 2014) indicate the presence of semantic groups in the representation space of word embeddings; an idea which has been further explored in (Athiwaratkun & Wilson, 2017). Under this assumption, the task of semantic similarity can be formulated as the following question: Are the two sentences (as groups of words) samples from the same semantic group? . Utilising this rephrasing, this work formulates the task of semantic similarity between two arbitrary groups of objects as a model comparison problem. Taking inspiration from (Ghahramani & Heller, 2006) and (Marshall et al., 2006) we propose the generative models for groups (e.g. sentences) D1, D2 seen in Figure 1. The Bayes Factor for this graphical model is then formally defined as sim(D1, D2) = log p(D1, D2|M1) p(D1, D2|M2) = log p(D1, D2|M1) p(D1|M2)p(D2|M2). (1) To obtain the evidence p(D|Mi) the parameters of the model must be marginalised p(D1, D2|M1) = Z p(D1, D2|θ)p(θ)dθ p(D1, D2|M1) = Z Y wk D1 D2 p(wk|θ)p(θ)dθ, p(Di|M2) = Z Y wk Di p(wk|θ)p(θ)dθ, where means concatenation and wk is a word embedding. Computing the semantic similarity score of the two groups D1, D2 under the Bayesian framework requires selecting ewi D2 wi D1 wi D1 exi D2 Figure 1. On the left, M1 assumes that both datasets are generated i.i.d. from the same parametric distribution. On the right, M2 assumes that the datasets are generated i.i.d. from distinct parametric distributions. a reasonable model likelihood p(wk|θ), prior density on the parameters p(θ), and computing the marginal evidence specified above. Computing the evidence can be computationally intensive and usually requires approximation. What is more, the Bayes factor is very sensitive to the choice of prior and can result in estimates that heavily underfit the data (especially under a vague prior, as shown in Appendix E 1), having the tendency to select the simpler model; which is argued further by (M. S. Bartlett, 1957) and (Akaike et al., 1981). This is handled in (Ghahramani & Heller, 2006) by using the empirical Bayes procedure, a shortcoming of which is the issue of double counting (Berger, 2000) and thus being prone to over-fitting. We address these issues by choosing to work with information criteria based model comparison as opposed to using the Bayes factor. The details are described in Section 3. We are not aware of any prior work on sentence similarity that uses our approach. We employ a generative model for sentences similar to (Arora et al., 2016), but our contribution differs from theirs in that our similarity is based on the aforementioned model comparison test, whilst theirs is based on the inner product of sentence embeddings derived as maximum likelihood estimators. The method by (Marshall et al., 2006) applies the same score as in Equation 1 in the context of merging datasets whilst we focus on information retrieval and semantic textual similarity. 3. Methodology We address the shortcomings of the Bayes Factor described in Section 2 by proposing model comparison criteria that minimise Kullback-Leibler (KL) divergence (denoted in equations as DKL) across a candidate set of models. This results in a penalised likelihood ratio test which gives competitive results. This approach relies on Information Theoretic Criteria to design a log-ratio-like test that is prior free and robust to overfitting. We seek to compare models M1 and M2 using Information Criteria (IC) to assess the goodness of fit of each model. There are multiple IC used for model selection, each with different settings to which they are better suited. The IC which we will be working with have the general form IC(D, M) = αL(ˆθ|D, M) + Ω(D, M) , where L(ˆθ|D, M) = Pn i L(ˆθ|wi, M) is the maximised value of the log likelihood function for model M, α is a scalar derived for each IC, and Ω(D, M) represents a model complexity penalty term which is model and IC specific. Using this general formulation for the involved infor- 1All appendices are provided in the supplementary materials. Model Comparison for Semantic Grouping mation criteria yields the similarity score sim(D1, D2) = = IC({D1, D2}, M1) + IC({D1, D2}, M2) =α ˆL(ˆθ1,2|D, M1) ( ˆL(ˆθ1|D1, M2)+ ˆL(ˆθ2|D1, M2)) Ω({D1, D2}, M1) + Ω({D1, D2}, M2) . Tversky s contrast model2 describes what a good similarity is from a cognitive science perspective. Interestingly, the semantic interpretation of the similarity we have derived is similar to the one described in that work. We want to contrast the commonalities of the two datasets (through shared parameters) to the distinctive features of each dataset (through independently fit parameters). Examples of these criteria can be put into two broad classes. The Bayesian Information Criterion (BIC) is an example of an IC that approximates the model evidence directly, as defined in (Schwarz et al., 1978). Empirically it has been shown that the BIC is likely to underfit the data, especially when the number of samples is small (Dziak et al., 2012). We provide additional empirical results in Appendix E, showing BIC is not a good fit for the STS task as sentences contain a relatively small number of words (samples). Thus we focus on the second class - Information Theoretic Criteria. 3.1. Information Theoretic Criteria The Information Theoretic Criteria (ITC) are a family of model selection criteria. The task they address is evaluating the expected quality of an estimated model specified by L(ˆθ|w) when it is used to generate unseen data from the true distribution G(w), as defined in (Konishi & Kitagawa, 2008b). This family of criteria perform this evaluation by using the KL divergence between the true model G(w) and the fitted model L(ˆθ|w), with the aim of selecting the model (from a given set of models) that minimizes the quantity DKL G(w) p(w|ˆθ) = EG = HG(w) EG h ln p(w|ˆθ) i . The entropy of the true model HG(w) will remain constant across different likelihoods. Thus, the quantity of interest in the definition of the information criterion under consideration is given by the expected log likelihood under the true model EG[ln p(w|ˆθ)]. The goal is to find a good estimator for this quantity and one such estimator is given by the 2As described in the paragraph surrounding Equation (9) of (Tenenbaum & Griffiths, 2001) normalized maximum log likelihood E ˆ G h ln p(w|ˆθ) i = 1 i=1 ln p(wi|ˆθ), where ˆG represents the empirical distribution. This estimator introduces a bias that varies with respect to the dimension of the model s parameter vector θ and requires a correction in order to carry out a fair comparison of information criteria between models. A model specific correction is derived resulting in the following IC, called Takeuchi Information Criterion (TIC) in (Takeuchi, 1976) i=1 2 θL(θ|wi) θ= ˆθ , i=1 θL(θ|wi) θL(θ|wi) θ= ˆθ , TIC(D, M) = 2 L(ˆθ|D, M) tr ˆI ˆ J 1 . (2) For the case where we assume our model has the same parametric form as the true model and as n , the equality ˆI = ˆ J holds resulting in a penalty of tr ˆI ˆ J 1 = tr(Ik) = k, where k is the number of model parameters, as shown in (Konishi & Kitagawa, 2008a). This results in the Akaike Information Criterion (AIC) (Akaike, 1974) AIC(D, M) = 2(L(ˆθ|D, M) k). The AIC simplification of TIC relies on several assumptions that hold true in the big data limit. However, as shown in Appendix F, for models with a high number of parameters, TIC may prove unstable and thus AIC will generally perform better. In this study we consider and contrast both. We show in Appendix A that under the TIC we have the following similarity (where we omit the conditioning on the models for brevity) sim(D1, D2) = 2 L(ˆθ1,2|D1,2) L(ˆθ1|D1) L(ˆθ2|D2) tr ˆI1,2 ˆ J 1 1,2 + tr ˆI1 ˆJ1 1 + tr ˆI2 ˆJ2 1 , and therefore under the AIC we have the following similarity sim(D1, D2) = 2 L(ˆθ1,2|D1,2) L(ˆθ1|D1) L(ˆθ2|D2) + k , where D1,2 = D1 D2. Model Comparison for Semantic Grouping 4. Word Embedding Likelihoods In this section we will illustrate how to derive a similarity score under our ITC framework by choosing a likelihood function that incorporates our prior assumptions about the generating process of the data. Adopting the viewpoint of a practitioner, we would like to compare the performance of two models one that ignores word embedding magnitude, and one that makes use of it. Our modelling choices for each assumption are the von Mises-Fisher (v MF) and Gaussian likelihoods respectively. The comparison between the two likelihoods we provide in Section 5 provides empirical evidence as to which approach is better suited to modelling word embeddings. As we will see the TIC penalties of both likelihoods we consider can be calculated in O(nd), thus not increasing the time complexity of the algorithm. 4.1. Von Mises-Fisher Likelihood Cosine similarity is often used to measure the semantic similarity of words in various information retrieval tasks. Thus, we want to explore a distribution induced by the cosine similarity measure. We model our embeddings as vectors lying on the surface of the d 1 dimensional unit hypersphere w Sd 1 and i.i.d. according to a v MF likelihood (Fisher et al., 1993) p(w|µ, κ) = κ d 2 1 (2π) d 2 I d 2 1(κ) exp κµ w = 1 Z(κ) exp κµ w , where µ is the mean direction vector and κ is the concentration, with supports ||µ|| = ||w|| = 1, κ 0. The term Iν(κ) Corresponds to a modified Bessel function of the first kind with order ν. In this work we reparameterise the random variable to polar hypersphericals w(φ) (φ = (φ1, ..., φd 1) ) and µ(θ) (θ = (θ1, ..., θd 1) ) as adopted in (Mabdia, 1975). Further details can be found in Appendix B. We prove (in Appendix C) that the mixed derivatives of the v MF log likelihood are a constant (with respect to θ) times L(θ, κ|φ)/ θk. Thus, evaluated at the MLE, these entries are zero. Thus, we know ˆ J = diag( ˆJ11, ..., ˆJdd) and we can express the TIC penalty described in Equation 2 as tr(ˆI ˆ J 1) = i=1 ˆ Jii 1ˆIii = ˆJ 1 11 κL(θ, κ|D) 2 i=2 ˆJ 1 ii θi 1 L(θ, κ|D) 2 . (3) This quantity only requires O(nd) operations to compute and thus does not increase the asymptotic complexity of the algorithm. The closed form of the similarity measure for two sentences D1, D2 of length m and l respectively under this model is then sim(D1, D2) = (m + l)ˆκ1,2 R1,2 mˆκ1 R1 lˆκ2 R2 (m + l) log Z(ˆκ1,2) + m log Z(ˆκ1) + l log Z(ˆκ2) tr(ˆI1,2 ˆ J 1 1,2) + tr(ˆI1 ˆJ1 1) + tr(ˆI2 ˆJ2 1), where the Jacobian terms (from the reparametrisation) cancel out. The subscripts indicate the sentence, with 1, 2 meaning the concatenation of the two sentences. 4.2. Gaussian Likelihood (Schakel & Wilson, 2015) show that some frequency information is contained in the magnitude of word embeddings. This motivates a choice of a likelihood function that is not constrained to the unit hypersphere and possibly the simplest such choice is the Gaussian likelihood. Due to the small size of sentences3 we choose a diagonal covariance Gaussian models. The Gaussian likelihood is then p (w|µ, Σ) = exp 1 2(w µ) Σ 1(w µ) where Σ is a diagonal matrix. Our framework further allows us to compare two models and pick the better one without having access to a similarity corpus, as long as the comparison is done on the same data. As an example, we compare the diagonal Gaussian with a more restricted version of itself the spherical Gaussian. We compute the average AIC on a corpus of sentences and observe that the average AIC of the diagonal Gaussian is lower than that of the spherical one. This suggests that the diagonal Gaussian better describes the distribution of word vectors in a sentence and thus will produce a better similarity. We confirm this result in Appendix H, where we also provide the average AIC scores for each model. As with the v MF likelihood, we prove in Appendix D that the Hessian of the log likelihood evaluated at the MLE is diagonal. The TIC correction is a sum of O(d) terms, similar to the form in Equation 3. This can be nicely written in terms of biased sample kurtosis (denoted ˆκ) tr(ˆI ˆ J 1) = 1 The closed form of the similarity measure for two sentences D1, D2 of length m and l respectively under this model is 3The covariance matrix of n samples with d dimensions such that n < d results in a low rank matrix. Model Comparison for Semantic Grouping Table 1. Comparison of Spearman correlations on the STS datasets between the two similarity measures we introduce in the text. The average is weighted according to dataset size. Embedding Method STS12 STS13 STS14 STS15 STS16 Average Fast Text v MF+TIC 0.5219 0.5147 0.5719 0.6456 0.6347 0.5762 Diag+AIC 0.6193 0.6334 0.6721 0.7328 0.7518 0.6764 Glo Ve v MF+TIC 0.5421 0.5598 0.5736 0.6474 0.6168 0.5859 Diag+AIC 0.6031 0.6131 0.6445 0.7171 0.7346 0.6564 Word2Vec GN v MF+TIC 0.5665 0.5735 0.6062 0.6681 0.6510 0.6115 Diag+AIC 0.5957 0.6358 0.6614 0.7213 0.7187 0.6618 sim(D1, D2) = i=1 (m + l) ln(ˆσ1,2)i + m ln(ˆσ1)i + l ln(ˆσ2)i+ i=1 ( ˆ κ1,2)i + ( ˆ κ1)i + ( ˆ κ2)i where the subscripts indicate the sentence, with 1, 2 meaning the concatenation of the two sentences. 5. Experiments We assess our methods performance on the Semantic Textual Similarity (STS) datasets 4 (Agirre et al., 2012; 2013; 2014; 2015; 2016). The objective of these tasks is to estimate the similarity between two given sentences, validated against human scores. In our experiments, we assess on the pre-trained Glo Ve (Pennington et al., 2014) and Fast Text (Bojanowski et al., 2016) , and Word2Vec GN (Mikolov et al., 2013a) word embeddings. For the v MF distribution, we normalize the word embeddings to be of length 1. Some of the sentences are left with a single word after querying the word embeddings, making the MLE of the κ parameter of the v MF and Σ parameter of the Gaussian ill-defined, which in turn causes the similarity metric to take on undefined values. We overcome this issue by padding each sentence with an arbitrary embedding of a word or punctuation symbol from the embedding lexicon (i.e. . or the ). Our code builds on top of Sent Eval (Conneau & Kiela, 2018) and is available at https://github.com/Babylonpartners/MCSG. We first compare our methods: v MF likelihood with TIC correction (v MF+TIC) and diagonal Gaussian likelihood with AIC correction (Diag+AIC) against each other. Then, the better method is compared against mean word vector (MWV), word mover s distance (WMD) (Kusner et al., 4The STS13 dataset does not include the proprietary SMT dataset that was available with the original release of STS. 2015) 5, smooth inverse frequency (SIF), and SIF with principal component removal as defined in (Arora et al., 2016) 6. We re-ran these models under our experimental setup, to ensure a fair comparison. The metric used is the average Spearman correlation score over each dataset, weighted by the number of sentences. The choice of Spearman correlation is given by its non-parametric nature (assumes no distribution over the scores), as well as measuring any monotonic relationship between the two compared quantities. 5.1. Embedding Magnitude A practitioner may want to learn more about a given set of word embeddings, and the way these embeddings were trained may not allow the user to understand the importance of certain features say embedding magnitude. This is where our framework can be used to help build intuition by comparing a likelihood that implicitly incorporates embedding magnitude to one that does not. We present the comparison between the similarities derived from the v MF and Gaussian likelihoods in Table 4.2. We note that the Gaussian is a much better modelling choice, beating the v MF on every dataset by a margin of at least 0.05 (5%) on average, with each of the three word embeddings. This is strong evidence that the information encoded in an embedding s magnitude is useful for tasks such as semantic similarity. This further motivates the conjecture that frequency information is contained in word embedding magnitude, as explored in (Schakel & Wilson, 2015). 5.2. Online Scenario In an online setting, one cannot perform the principal component removal described in (Arora et al., 2016), as that pre-processing requires access to the entire query dataset a priori. The comparison against the baseline methods are presented 5https://github.com/mkusner/wmd 6https://github.com/Princeton ML/SIF Model Comparison for Semantic Grouping Table 2. Comparison of Spearman correlations on the STS datasets between our best model (Diag+AIC) and SIF, WMD and MWV for three different word vectors. The average is weighted according to dataset size. Embedding Method STS12 STS13 STS14 STS15 STS16 Average Fast Text Diag+AIC 0.6193 0.6334 0.6721 0.7328 0.7518 0.6764 SIF 0.6079 0.6989 0.6777 0.7436 0.7135 0.6821 MWV 0.5994 0.6494 0.6473 0.7114 0.6814 0.6542 WMD 0.5576 0.5146 0.5915 0.6800 0.6402 0.5997 Glo Ve Diag+AIC 0.6031 0.6131 0.6445 0.7171 0.7346 0.6564 SIF 0.5774 0.6319 0.6135 0.6740 0.6589 0.6255 MWV 0.5526 0.5643 0.5625 0.6314 0.5804 0.5784 WMD 0.5516 0.5007 0.5811 0.6704 0.6246 0.5896 Word2Vec GN Diag+AIC 0.5957 0.6358 0.6614 0.7213 0.7187 0.6618 SIF 0.5697 0.6594 0.6669 0.7261 0.6952 0.6588 MWV 0.5744 0.6330 0.6561 0.7040 0.6617 0.6451 WMD 0.5554 0.5250 0.6074 0.6730 0.6399 0.6034 Table 3. Comparison of Spearman correlations on the STS datasets between our best model (Diag+AIC) and SIF+PCA for three different word vectors. Word vectors Method STS12 STS13 STS14 STS15 STS16 Average Fast Text Diag+AIC 0.6193 0.6334 0.6721 0.7328 0.7518 0.6764 SIF+PCA 0.5945 0.7149 0.6824 0.7474 0.7271 0.6843 Glo Ve Diag+AIC 0.6031 0.6131 0.6445 0.7171 0.7346 0.6564 SIF+PCA 0.5732 0.6843 0.6546 0.7166 0.6931 0.6589 Word2Vec GN Diag+AIC 0.5957 0.6358 0.6614 0.7213 0.7187 0.6618 SIF+PCA 0.5602 0.6773 0.6722 0.7354 0.7111 0.6639 in Table 5.2. The method proposed in this work is able to out-perform the standard weighting induced by MWV, as well as the WMD approach on all datasets, with each of the three word embeddings considered. We outperform SIF using the Glo Ve embeddings by 0.309 (3.09%) and effectively tie when using the Fast Text and Word2Vec GN embeddings with a difference of 0.0057 (0.57%) and 0.0030 (0.30%) respectively. In Appendix I we conduct a significance analysis. We show that SIF and our method are on par with each other and that both significantly outperform MWV and WMD when using the Glo Ve embeddings. 5.3. Offline Scenario There are use-cases in which the entire dataset of sentences is available at evaluation time for example, in clustering applications. For this scenario, we compare against the SIF weightings, augmented with the additional pre-processing technique seen in (Arora et al., 2016). We need only consider this baseline, as it outperforms all others by a large margin. The results are shown in Table 5.2. We remain competitive with SIF+PCA on all three word embeddings, being able to match very closely on Glo Ve embeddings. On the Fast Text and Word2Vec embeddings, our method is less than 0.01 (1%) lower on average than SIF+PCA. 6. Conclusion We ve presented a new approach to similarity measurement that achieves competitive performance to standard methods in both online and offline settings. Our method requires a set of clear choices model, likelihood and information criterion. From that, a comparison framework is naturally derived, which supplies us with a statistically justified similarity measure (by utilizing ITC to reduce the resulting model-comparison bias). This framework is suitable for a variety of modelling scenarios, due to the freedom in specifying the generative process. The graphical model we employ is adaptable to encode structural dependencies beyond the i.i.d. data-generating process we have assumed throughout this study for example, an auto-regressive (sequential) model may be assumed if the practitioner suspects Model Comparison for Semantic Grouping that word order matters (i.e. compare Does she want to get pregnant? to She does want to get pregnant. ) In this study, we conjecture that the von Mises-Fisher distribution lends itself to representing word embeddings well, if their magnitude is disregarded and a unimodal distribution over individual sentences is assumed. Relaxing the former assumption, we also model word embeddings with a Gaussian likelihood. As this improves results, it suggests that word embedding magnitude carries information relevant for sentence level tasks, which agrees with prior intuition built from (Schakel & Wilson, 2015). We hope that this framework could be a stepping stone in using more complex and accurate generative models of text to assess semantic similarity. For example, relaxing the assumption of unimodality is an interesting area for future research. Acknowledgements We would like to thank Dane Sherburn for the help with the TIC robustness plots and Kostis Gourgoulias for the many insightful conversations. Agirre, E., Diab, M., Cer, D., and Gonzalez-Agirre, A. Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pp. 385 393. Association for Computational Linguistics, 2012. Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., and Guo, W. * sem 2013 shared task: Semantic textual similarity. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, volume 1, pp. 32 43, 2013. Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W., Mihalcea, R., Rigau, G., and Wiebe, J. Semeval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th international workshop on semantic evaluation (Sem Eval 2014), pp. 81 91, 2014. Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W., Lopez-Gazpio, I., Maritxalar, M., Mihalcea, R., et al. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (Sem Eval 2015), pp. 252 263, 2015. Agirre, E., Banea, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Mihalcea, R., Rigau, G., and Wiebe, J. Semeval2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In Proceedings of the 10th International Workshop on Semantic Evaluation (Sem Eval2016), pp. 497 511, 2016. Akaike, H. A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716 723, 1974. Akaike, H. et al. Likelihood of a model and information criteria. Journal of econometrics, 16(1):3 14, 1981. Arora, S., Liang, Y., and Ma, T. A simple but tough-tobeat baseline for sentence embeddings. International Conference on Learning Representations, 2017, 2016. Athiwaratkun, B. and Wilson, A. G. Multimodal word distributions. ar Xiv preprint ar Xiv:1704.08424, 2017. Berger, J. O. Bayesian analysis: A look at today and thoughts of tomorrow. Journal of the American Statistical Association, 95(452):1269 1276, 2000. ISSN 01621459. URL http://www.jstor.org/ stable/2669768. Blacoe, W. and Lapata, M. A comparison of vector-based representations for semantic composition. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pp. 546 556. Association for Computational Linguistics, 2012. Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. Enriching word vectors with subword information. Co RR, abs/1607.04606, 2016. URL http://arxiv.org/ abs/1607.04606. Conneau, A. and Kiela, D. Senteval: An evaluation toolkit for universal sentence representations. ar Xiv preprint ar Xiv:1803.05449, 2018. Dziak, J. J., Coffman, D. L., Lanza, S. T., and Li, R. Sensitivity and specificity of information criteria. The Methodology Center and Department of Statistics, Penn State, The Pennsylvania State University, 16(30):140, 2012. Fisher, N. I., Lewis, T., and Embleton, B. J. Statistical analysis of spherical data. Cambridge university press, 1993. Ghahramani, Z. and Heller, K. A. Bayesian sets. In Advances in neural information processing systems, pp. 435 442, 2006. Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., and Fidler, S. Skip-thought vectors. Model Comparison for Semantic Grouping Co RR, abs/1506.06726, 2015. URL http://arxiv. org/abs/1506.06726. Konishi, S. and Kitagawa, G. Information criteria and statistical modeling. Springer Science & Business Media, 2008a. Konishi, S. and Kitagawa, G. Information criteria and statistical modeling. Springer Science & Business Media, 2008b. Kusner, M., Sun, Y., Kolkin, N., and Weinberger, K. From word embeddings to document distances. In Bach, F. and Blei, D. (eds.), Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 957 966, Lille, France, 07 09 Jul 2015. PMLR. URL http://proceedings.mlr. press/v37/kusnerb15.html. M. S. Bartlett, M. S. A comment on d. v. lindley s statistical paradox. Biometrika, 44(3/4):533 534, 1957. Mabdia, K. Distribution theory for the von mises-fisher distribution and ite application. Statistical Distributions for Scientific Work, 1:113 30, 1975. Marshall, P., Rajguru, N., and Slosar, A. Bayesian evidence as a tool for comparing datasets. Physical Review D, 73 (6):067302, 2006. Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. ar Xiv preprint ar Xiv:1301.3781, 2013a. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111 3119, 2013b. Mitchell, J. and Lapata, M. Vector-based models of semantic composition. proceedings of ACL-08: HLT, pp. 236 244, 2008. Mitchell, J. and Lapata, M. Composition in distributional models of semantics. Cognitive science, 34(8):1388 1429, 2010. Pennington, J., Socher, R., and Manning, C. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532 1543, 2014. Schakel, A. M. and Wilson, B. J. Measuring word significance using distributed representations of words. ar Xiv preprint ar Xiv:1508.02297, 2015. Schwarz, G. et al. Estimating the dimension of a model. The annals of statistics, 6(2):461 464, 1978. Takeuchi, K. The distribution of information statistics and the criterion of goodness of fit of models. Mathematical Science, 153:12 18, 1976. Tenenbaum, J. B. and Griffiths, T. L. Generalization, similarity, and bayesian inference. Behavioral and brain sciences, 24(4):629 640, 2001. Wieting, J., Bansal, M., Gimpel, K., and Livescu, K. Towards universal paraphrastic sentence embeddings. ar Xiv preprint ar Xiv:1511.08198, 2015. Zheng, G. and Callan, J. Learning to reweight terms with distributed representations. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 15, pp. 575 584, New York, NY, USA, 2015. ACM. ISBN 978-1-45033621-5. doi: 10.1145/2766462.2767700. URL http: //doi.acm.org/10.1145/2766462.2767700.