# multimodal_distributional_semantics__1fb1a3ca.pdf Journal of Artificial Intelligence Research 49 (2014) 1-47 Submitted 7/13; published 1/14 Multimodal Distributional Semantics Elia Bruni elia.bruni@unitn.it Center for Mind/Brain Sciences, University of Trento, Italy Nam Khanh Tran ntran@l3s.de L3S Research Center, Hannover, Germany Marco Baroni marco.baroni@unitn.it Center for Mind/Brain Sciences, University of Trento, Italy Department of Information Engineering and Computer Science, University of Trento, Italy Distributional semantic models derive computational representations of word meaning from the patterns of co-occurrence of words in text. Such models have been a success story of computational linguistics, being able to provide reliable estimates of semantic relatedness for the many semantic tasks requiring them. However, distributional models extract meaning information exclusively from text, which is an extremely impoverished basis compared to the rich perceptual sources that ground human semantic knowledge. We address the lack of perceptual grounding of distributional models by exploiting computer vision techniques that automatically identify discrete visual words in images, so that the distributional representation of a word can be extended to also encompass its co-occurrence with the visual words of images it is associated with. We propose a flexible architecture to integrate textand image-based distributional information, and we show in a set of empirical tests that our integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter. 1. Introduction The distributional hypothesis states that words that occur in similar contexts are semantically similar. The claim has multiple theoretical roots in psychology, structuralist linguistics, lexicography and possibly even in the later writings of Wittgenstein (Firth, 1957; Harris, 1954; Miller & Charles, 1991; Wittgenstein, 1953). However, the distributional hypothesis has had a huge impact on computational linguistics in the last two decades mainly for empirical reasons, that is, because it suggests a simple and practical method to harvest word meaning representations on a large scale: Just record the contexts in which words occur in easy-to-assemble large collections of texts (corpora) and use their contextual profiles as surrogates of their meaning. Nearly all contemporary corpus-based approaches to semantics rely on contextual evidence in one way or another, but the most systematic and extensive application of distributional methods is found in what we call distributional semantic models (DSMs), also known in the literature as vector space or semantic space 2014 AI Access Foundation. All rights reserved. Bruni, Tran & Baroni models of meaning (Landauer & Dumais, 1997; Sahlgren, 2006; Schütze, 1997; Turney & Pantel, 2010). In DSMs, the meaning of a word is approximated with a vector that keeps track of the patterns of co-occurrence of the word in text corpora, so that the degree of semantic similarity or, more generally, relatedness (Budanitsky & Hirst, 2006) of two or more words can be precisely quantified in terms of geometric distance between the vectors representing them. For example, both car and automobile might occur with terms such as street, gas and driver, and thus their distributional vectors are likely to be very close, cuing the fact that these words are synonyms. Extended empirical evidence has shown that distributional semantics is very good at harvesting effective meaning representations on a large scale, confirming the validity of the distributional hypothesis (see some references in Section 2.1 below). Still, for all its successes, distributional semantics suffers of the obvious limitation that it represents the meaning of a word entirely in terms of connections to other words. A long tradition of studies in cognitive science and philosophy has stressed how models where the meaning of symbols (e.g., words) are entirely accounted for in terms of other symbols (e.g., other words) without links to the outside world (e.g., via perception) are deeply problematic, an issue that is often referred to as the symbol grounding problem (Harnad, 1990). DSMs have also come under attack for their lack of grounding (Glenberg & Robertson, 2000).1 Although the specific criticisms vented at them might not be entirely well-founded (Burgess, 2000), there can be little doubt that the limitation to textual contexts makes DSMs very dissimilar from humans, who, thanks to their senses, have access to rich sources of perceptual knowledge when learning the meaning of words so much so that some cognitive scientists have argued that meaning is directly embodied in sensory-motor processing (see the work in de Vega, Glenberg, & Graesser, 2008, for different views on embodiment in cognitive science). Indeed, in the last decades a large amount of behavioural and neuroscientific evidence has been amassed indicating that our knowledge of words and concepts is inextricably linked with our perceptual and motor systems. For example, perceiving actiondenoting verbs such as kick or lick involves the activation of areas of the brain controlling foot and tongue movements, respectively (Pulvermueller, 2005). Hansen, Olkkonen, Walter, and Gegenfurtner (2006) asked subjects to adjust the color of fruit images objects until they appeared achromatic. The objects were generally adjusted until their color was shifted away from the subjects gray point in a direction opposite to the typical color of the fruit, e.g., bananas were shifted towards blue because subjects overcorrected for their typical yellow color. Typical color also influences lexical access: For example, subjects are faster at naming a pumpkin in a picture in which it is presented in orange than in a grayscale representation, slowest if it is in another color (Therriault, Yaxley, & Zwaan, 2009). As a final example, Kaschak, Madden, Therriault, Yaxley, Aveyard, Blanchard, and Zwaan (2005) found that subjects are slower at processing a sentence describing an action if the sentence is presented concurrently to a visual stimulus depicting motion in the opposite 1. Harnard, in the original paper, is discussing formal symbols, such as those postulated in Fodor s language of thought (Fodor, 1975), rather than the words of a natural language. However, when the latter are represented in terms of connections to other words, as is the case in DSMs, the same grounding problem arises, and we follow the recent literature on the issue in referring to it as symbol grounding , where our symbols are natural language words. Multimodal Distributional Semantics direction of that described (e.g., The car approached you is harder to process concurrently to the perception of motion away from you). See the review in Barsalou (2008) for a review of more evidence that conceptual and linguistic competence is strongly embodied. One might argue that the concerns about DSMs not being grounded or embodied are exaggerated, because they overlook the fact that the patterns of linguistic co-occurrence exploited by DSMs reflect semantic knowledge we acquired through perception, so that linguistic and perceptual information are strongly correlated (Louwerse, 2011). Because dogs are more often brown than pink, we are more likely to talk about brown dogs than pink dogs. Consequently, a child can learn useful facts about the meaning of the concept denoted by dog both by direct perception and through linguistic input (this explains, among other things, why congenitally blind subjects can have an excellent knowledge of color terms; see, e.g., Connolly, Gleitman, & Thompson-Schill, 2007). One could then hypothesize that the meaning representations extracted from text corpora are indistinguishable from those derived from perception, making grounding redundant. However, there is by now a fairly extensive literature showing that this is not the case. Many studies (Andrews, Vigliocco, & Vinson, 2009; Baroni, Barbu, Murphy, & Poesio, 2010; Baroni & Lenci, 2008; Riordan & Jones, 2011) have underlined how text-derived DSMs capture encyclopedic, functional and discourse-related properties of word meanings, but tend to miss their concrete aspects. Intuitively, we might harvest from text the information that bananas are tropical and eatable, but not that they are yellow (because few authors will write down obvious statements such as bananas are yellow ). On the other hand, the same studies show how, when humans are asked to describe concepts, the features they produce (equivalent in a sense to the contextual features exploited by DSMs) are preponderantly of a perceptual nature: Bananas are yellow, tigers have stripes, and so on.2 This discrepancy between DSMs and humans is not, per se, a proof that DSMs will face empirical difficulties as computational semantic models. However, if we are interested in the potential implications of DSMs as models of how humans acquire and use language as is the case for many DSM developers (e.g., Griffiths, Steyvers, & Tenenbaum, 2007; Landauer & Dumais, 1997; Lund & Burgess, 1996, and many others) then their complete lack of grounding in perception is a serious blow to their psychological plausibility, and exposes them to all the criticism that classic ungrounded symbolic models have received. Even at the empirical level, it is reasonable to expect that DSMs enriched with perceptual information would outperform their purely textual counterparts: Useful computational semantic models must capture human semantic knowledge, and human semantic knowledge is strongly informed by perception. If we accept that grounding DSMs into perception is a desirable avenue of research, we must ask where we can find a practical source of perceptual information to embed into DSMs. Several interesting recent experiments use features produced by human subjects in concept description tasks (so-called semantic norms ) as a surrogate of true perceptual features (Andrews et al., 2009; Johns & Jones, 2012; Silberer & Lapata, 2012; Steyvers, 2010). While this is a reasonable first step, and the integration methods proposed in these studies 2. To be perfectly fair, this tendency might in part be triggered by the fact that, when subjects are asked to describe concepts, they might be encouraged to focus on their perceptual aspects by the experimenters instructions. For example Mc Rae, Cree, Seidenberg, and Mc Norgan (2005) asked subjects to list first physical properties, such as internal and external parts, and how [the object] looks. Bruni, Tran & Baroni are quite sophisticated, using subject-produced features is unsatisfactory both practically and theoretically (see however the work reported by Kievit-Kylar & Jones, 2011, for a crowdsourcing project that is addressing both kinds of concerns). Practically, using subjectgenerated properties limits experiments to those words that denote concepts described in semantic norms, and even large norms contain features for just a few hundred concepts. Theoretically, the features produced by subjects in concept description tasks are far removed from the sort of implicit perceptual features they are supposed to stand for. For example, since they are expressed in words, they are limited to what can be conveyed verbally. Moreover, subjects tend to produce only salient and distinctive properties. They do not state that dogs have a head, since that s hardly a distinctive feature for an animal! In this article, we explore a more direct route to integrate perceptual information into DSMs. We exploit recent advances in computer vision (Grauman & Leibe, 2011) and the availability of documents that combine text and images to automatically extract visual features that are naturally co-occurring with words in multimodal corpora. These imagebased features are then combined with standard text-based features to obtain perceptuallyenhanced distributional vectors. In doing this, we rely on a natural extension of the distributional hypothesis, that encompasses not only similarity of linguistic context, but also similarity of visual context. Interestingly, Landauer and Dumais, in one of the classic papers that laid the groundwork for distributional semantics, already touched on the grounding issue and proposed, speculatively, a solution along the lines of the one we are implementing here: [I]f one judiciously added numerous pictures of scenes with and without rabbits to the context columns in the [. . . ] corpus matrix, and filled in a handful of appropriate cells in the rabbit and hare word rows, [a DSM] could easily learn that the words rabbit and hare go with pictures containing rabbits and not to ones without, and so forth. (Landauer & Dumais, 1997, p. 227).3 Although vision is just one source of perceptual data, it is a reasonable starting point, both for convenience (availability of suitable data to train the models) and because it is probably the dominating modality in determining word meaning. As just one piece of evidence for this claim, the widely used subject-generated semantic norms of Mc Rae et al. (2005) contain 3,594 distinct perceptual features in total, and, of these, 3,099 (86%) are visual in nature! Do the relatively low-level and noisy features that we extract from images in multimodal corpora contribute meaningful information to the distributional representation of word meaning? We report the results of a systematic comparison of the network of semantic relations entertained by a set of concrete nouns in the traditional text-based and novel image-based distributional spaces confirming that image-based features are, indeed, semantically meaningful. Moreover, as expected, they provide somewhat complementary information with respect to text-based features. Having thus found a practical and effective way to extract perceptual information, we must consider next how to combine textand image-derived features to build a multimodal distributional semantic model. We propose a general parametrized architecture for multimodal fusion that, given appropriate sample data, automatically determines the optimal mixture of textand image-based features to be used for the target semantic task. Finally, we evaluate our multimodal DSMs in 3. We thank Mike Jones for pointing out this interesting historical connection to us. Multimodal Distributional Semantics two separate semantic tasks, namely predicting the degree of semantic relatedness assigned to word pairs by humans, and categorizing nominal concepts into classes. We show that in both tasks multimodal DSMs consistently outperform purely textual models, confirming our supposition that, just like for humans, the performance of computational models of meaning improves once meaning is grounded in perception. The article is structured as follows. Section 2 provides the relevant background from computational linguistics and image analysis, and discusses related work. We lay out a general architecture for multimodal fusion in distributional semantics in Section 3. The necessary implementation details are provided in Section 4. Section 5 presents the experiments in which we tested our approach. Section 6 concludes summarizing our current results as well as sketching what should come next. 2. Background and Related Work In this section we first give a brief introduction to traditional distributional semantic models (i.e., those based solely on textual information). Then, we describe the image analysis techniques we adopt to extract and manipulate visual information. Next, we discuss earlier attempts to construct a multimodal distributional representation of meaning. Finally, we describe the most relevant strategies to combine information coming from text and images proposed inside the computer vision community. 2.1 Distributional Semantics In the last few decades, a number of different distributional semantic models (DSMs) of word meaning have been proposed in computational linguistics, all relying on the assumption that word meaning can be learned directly from the linguistic environment. Semantic space models are one of the most common types of DSM. They approximate the meaning of words with vectors that record their distributional history in a corpus (Turney & Pantel, 2010). A distributional semantic model is encoded in a matrix whose m rows are semantic vectors representing the meanings of a set of m target words. Each component of a semantic vector is a function of the occurrence counts of the corresponding target word in a certain context (see Lowe, 2001, for a formal treatment). Definitions of context range from simple ones (such as documents or the occurrence of another word inside a fixed window from the target word) to more linguistically sophisticated ones (such as the occurrence of certain words connected to the target by special syntactic relations) (Padó & Lapata, 2007; Sahlgren, 2006; Turney & Pantel, 2010). After the raw targetcontext counts are collected, they are transformed into association scores that typically discount the weights of components whose corresponding word-context pairs have a high probability of chance co-occurrence (Evert, 2005). The rank of the matrix containing the semantic vectors as rows can optionally be decreased by dimensionality reduction, that might provide beneficial smoothing by getting rid of noise components and/or allow more efficient storage and computation (Landauer & Dumais, 1997; Sahlgren, 2005; Schütze, 1997). Finally, the distributional semantic similarity of a pair of target words is estimated by a similarity function that takes their semantic vectors as input and returns a scalar similarity score as output. Bruni, Tran & Baroni There are many different semantic space models in the literature. Probably the best known is Latent Semantic Analysis (LSA, Landauer & Dumais, 1997), where a highdimensional semantic space for words is derived by the use of co-occurrence information between words and the passages where they occur. Another well-known example is the Hyperspace Analog to Language model (HAL, Lund & Burgess, 1996), where each word is represented by a vector containing weighted co-occurrence values of that word with the other words in a fixed window. Other semantic space models rely on syntactic relations instead of windows (Grefenstette, 1994; Curran & Moens, 2002; Padó & Lapata, 2007). General overviews of semantic space models are provided by Clark (2013), Erk (2012), Manning and Schütze (1999), Sahlgren (2006) and Turney and Pantel (2010). More recently, probabilistic topic models have been receiving increasing attention as an alternative implementation of DSMs (Blei, Ng, & Jordan, 2003; Griffiths et al., 2007). Probabilistic topic models also rely on co-occurrence information from large corpora to derive meaning but, differently from semantic space models, they are based on the assumption that words in a corpus exhibit some probabilistic structure connected to topics. Words are not represented as points in a high-dimensional space but as a probability distribution over a set of topics. Conversely, each topic can be defined as a probability distribution over different words. Probabilistic topic models tackle the problem of meaning representation by means of statistical inference: use the word corpus to infer the hidden topic structure. Distributional semantic models, whether of the geometric or the probabilistic kind, ultimately are mainly used to provide a similarity score for arbitrary pairs of words, and that is how we will also employ them. Indeed, such models have shown to be very effective in modeling a wide range of semantic tasks including judgments of semantic relatedness and word categorization. There are several data sets to assess how well a DSM captures human intuitions about semantic relatedness, such as the Rubenstein and Goodenough set (Rubenstein & Goodenough, 1965) and Word Sim353 (Finkelstein, Gabrilovich, Matias, Rivlin, Solan, Wolfman, & Ruppin, 2002). Usually they are constructed by asking subjects to rate a set of word pairs according to a similarity scale. Then, the average rating for each pair is taken as an estimate of the perceived relatedness between the words (e.g., dollar-buck: 9.22, cord-smile: 0.31). To measure how well a distributional model approximates human semantic intuitions, usually a correlation measure between the similarity scores generated by the model and the human ratings is computed. The highest correlation we are aware of on the Word Sim353 set we will also employ below is of 0.80 and it was obtained by a model called Temporal Semantic Analysis, which captures patterns of word usage over time and where concepts are represented as time series over a corpus of temporally-ordered documents (Radinsky, Agichtein, Gabrilovich, & Markovitch, 2011). This temporal knowledge could be integrated with the perceptual knowledge we encode in our model. As a more direct comparison point, Agirre, Alfonseca, Hall, Kravalova, Pasça, and Soroa (2009) presented an extensive evaluation of distributional and Word Net-based semantic models on Word Sim, both achieving a maximum correlation of 0.66 across various parameters.4 4. Word Net, available at http://wordnet.princeton.edu/, is a large computational lexicon of English where nouns, verbs, adjectives and adverbs are grouped into sets of synonyms (synsets), each expressing a distinct concept. Multimodal Distributional Semantics Humans are very good at grouping together words (or the concepts they denote) into classes based on their semantic relatedness (Murphy, 2002), therefore a cognitive-aware representation of meaning must show its proficiency also in categorization (e.g., Poesio & Almuhareb, 2005; Baroni et al., 2010). Concept categorization is moreover useful for applications such as automated ontology construction and recognizing textual entailment. Unlike similarity ratings, categorization requires a discrete decision to group coordinates/cohyponyms into the same class and it is performed by applying standard clustering techniques to the model-generated vectors representing the words to be categorized. As an example, the Almuhareb-Poesio data set (Almuhareb & Poesio, 2005), that we also employ below, includes 402 concepts from Word Net, balanced in terms of frequency and degree of ambiguity. The distributional model of Rothenhäusler and Schütze (2009) exploits syntactic information to reach state-of-the-art performance on the Almuhareb-Poesio data set (maximum clustering purity across various parameter: 0.79). The window-based distributional approach of Baroni and Lenci (2010), more directly comparable to our text-based models, achieves 0.65 purity. Other semantic tasks DSMs have been applied to include semantic priming, generation of salient properties of concepts and intuitions about the thematic fit of verb arguments (see, e.g., Baroni & Lenci, 2010; Baroni et al., 2010; Mc Donald & Brew, 2004; Padó & Lapata, 2007; Padó, Padó, & Erk, 2007). Distributional semantic vectors can be used in a wide range of applications that require a representation of word meaning, and in particular an objective measure of meaning relatedness, including document classification, clustering and retrieval, question answering, automatic thesaurus generation, word sense disambiguation, query expansion, textual advertising and some areas of machine translation (Dumais, 2003; Turney & Pantel, 2010). 2.2 Visual Words Ideally, to build a multimodal DSM, we would like to extract visual information from images in a way that is similar to how we do it for text. Thanks to a well-known image analysis technique, namely bag-of-visual-words (Bo VW), it is indeed possible to discretize the image content and produce visual units somehow comparable to words in text, known as visual words (Bosch, Zisserman, & Munoz, 2007; Csurka, Dance, Fan, Willamowski, & Bray, 2004; Nister & Stewenius, 2006; Sivic & Zisserman, 2003; Yang, Jiang, Hauptmann, & Ngo, 2007). Therefore, semantic vectors can be extracted from a corpus of images associated with the target (textual) words using a similar pipeline to what is commonly used to construct text-based vectors: Collect co-occurrence counts of target words and discrete image-based contexts (visual words), and approximate the semantic relatedness of two words by a similarity function over the visual words representing them. The Bo VW technique to extract visual word representations of documents was inspired by the traditional bag-of-words (Bo W) method in Information Retrieval. Bo W in turn is a dictionary-based method to represent a (textual) document as a bag (i.e., order is not considered), which contains words from the dictionary. Bo VW extends this idea to visual documents (namely images), describing them as a collection of discrete regions, capturing their appearance and ignoring their spatial structure (the visual equivalent of ignoring word order in text). A bag-of-visual-word representation of an image is convenient from an image- Bruni, Tran & Baroni "#$%$&"##' '$"# " $%$&" #' '$" " $%$&" #' '$" Figure 1: Representing images by Bo VW: (i) Salient image patches or keypoints that contain rich local information are detected and represented as vectors of low-level features called descriptors; (ii) Descriptors are mapped to visual words on the basis of their distance from centers of clusters corresponding to the visual words (the preliminary clustering step is not shown in the figure); (iii) Images are finally represented as a bag-of-visual-words feature vector according to the distribution of visual words they contain. Images depicting the same things with rotations, occlusions, small differences in the low-level descriptors might still have a similar distribution of visual words, hence the same object can be traced very robustly across images while these conditions change. analysis point of view because it translates a usually large set of high-dimensional local descriptors into a single sparse vector representation across images. Importantly, the size of the original set varies from image to image, while the bag-of-visual-word representation is of fixed dimensionality. Therefore, machine learning algorithms which by default expect fixed-dimensionality vectors as input (e.g., for supervised classification or unsupervised clustering) can be used to tackle typical image analysis tasks such as object recognition, image segmentation, video tracking, motion detection, etc. More specifically, similarly to terms in a text document, an image has local interest points or keypoints defined as salient image patches that contain rich local information about the image. However keypoint types in images do not come off-the-shelf like word Multimodal Distributional Semantics types in text documents. Local interest points have to be grouped into types (i.e., visual words) within and across images, so that an image can be represented by the number of occurrences of each type in it, analogously to Bo W. The following pipeline is typically followed. From every image of a data set, keypoints are automatically detected (note that in most recent approaches a dense, pixelwise sampling of the keypoints is preferred to detecting the most salient ones only, and this is the solution that we also adopt, as explained in Section 4.2.2) and represented as vectors of low-level features called descriptors. Keypoint vectors are then grouped across images into a number of clusters based on their similarity in descriptor space. Each cluster is treated as a discrete visual word. With its keypoints mapped onto visual words, each image can then be represented as a Bo VW feature vector recording how many times each visual word occurs in it. In this way, we move from representing the image by a varying number of high-dimensional keypoint descriptor vectors to a representation in terms of a single visual word count vector of fixed dimensionality across all images, with the advantages we discussed above. Visual word assignment and its use to represent the image content is exemplified in Figure 1, where two images with a similar content are described in terms of bag-of-visual-word vectors. What kind of image content a visual word captures exactly depends on a number of factors, including the descriptors used to identify and represent keypoints, the clustering algorithm and the number of target visual words selected. In general, local interest points assigned to the same visual word tend to be patches with similar low-level appearance; but these local patterns need not be correlated with object-level parts present in the images (Grauman & Leibe, 2011). 2.3 Multimodal Distributional Semantics The availability of large amounts of mixed media on the Web, on the one hand, and the discrete representation of images as visual words, on the other, has not escaped the attention of computational linguists interested in enriching distributional representations of word meaning with visual features. Feng and Lapata (2010) propose the first multimodal distributional semantic model. Their generative probabilistic setting requires the extraction of textual and visual features from the same mixed-media corpus, because latent dimensions are here estimated through a probabilistic process which assumes that a document is generated by sampling both textual and visual words. Words are then represented by their distribution over a set of latent multimodal dimensions or topics (Griffiths et al., 2007) derived from the surface textual and visual features. Feng and Lapata experiment with a collection of documents downloaded from the BBC News website as corpus. They test their semantic representations on the free association norms of Nelson, Mc Evoy, and Schreiber (1998) and on a subset of 253 pairs from Word Sim, obtaining gains in performance when visual information is taken into account (correlations with human judgments of 0.12 and 0.32 respectively), compared to the textual modality standalone (0.08 and 0.25 respectively), even if performance is still well below state-of-the-art for Word Sim (see Section 2.1 above). The main drawbacks of this approach are that the textual and visual data must be extracted from the same corpus, thus limiting the choice of the corpora to be used, and that the generative probabilistic approach, while elegant, does not allow much flexibility Bruni, Tran & Baroni in how the two information channels are combined. Below, we re-implement the Feng and Lapata method (Mix LDA) training it on the ESP-Game data set, the same source of labeled images we adopt for our model. This is possible because the data set contains both images and the textual labels describing them. More in general, we recapture Feng and Lapata s idea of a common latent semantic space in the latent multimodal mixing step of our pipeline (see Section 3.2.1 below). Leong and Mihalcea (2011) also exploit textual and visual information to obtain a multimodal distributional semantic model. While Feng and Lapata merge the two sources of information by learning a joint semantic model, Leong and Mihalcea propose a strategy akin to what we will call Scoring Level fusion below: Come up with separate textand image-based similarity estimates, and combine them to obtain the multimodal score. In particular, they use two combination methods: summing the scores and computing their harmonic mean. Differently from Feng and Lapata, Leong and Mihalcea extract visual information not from a corpus but from a manually coded resource, namely the Image Net database (Deng, Dong, Socher, Li, & Fei-Fei, 2009), a large-scale ontology of images.5 Using a handcoded annotated visual resource such as Image Net faces the same sort of problems that using a manually developed lexical database such as Word Net faces with respect to textual information, that is, applications will be severely limited by Image Net coverage (for example, Image Net is currently restricted to nominal concepts), and the interest of the model as a computational simulation of word meaning acquisition from naturally occurring language and visual data is somewhat reduced (humans do not learn the meaning of mountain from a set of carefully annotated images of mountains with little else crowding or occluding the scene). In the evaluation, Leong and Mihalcea experiment with small subsets of Word Sim, obtaining some improvements, although not at the same level we report (the highest reported correlation is 0.59 on just 56 word pairs). Furthermore they use the same data set to tune and test their models. In Bruni, Tran, and Baroni (2011) we propose instead to directly concatenate the textand image-based vectors to produce a single multimodal vector to represent words, as in what we call Feature Level fusion below. The text-based distributional vector representing a word, taken there from a state-of-the-art distributional semantic model (Baroni & Lenci, 2010), is concatenated with a vector representing the same word with visual features, extracted from all the images in the ESP-Game collection we also use here. We obtain promising performance on Word Sim and other test sets, although appreciably lower than the results we report here (we obtain a maximum correlation of 0.52 when textand image-based features are used together; compare to Table 2 below). Attempts to use multimodal models derived from text and images to perform more specific semantic tasks have also been reported. Bergsma and Goebel (2011) use textual and image-based cues to model selectional preferences of verbs (which nouns are likely arguments of verbs). Their experiment shows that in several cases visual information is more useful than text in this task. For example, by looking in textual corpora for words such as carillon, migas or mamey, not much useful information is obtained to guess which of the three is a plausible argument for the verb to eat. On the other hand, they also show 5. http://image-net.org/ Multimodal Distributional Semantics that, by exploiting Google image search functionality,6 enough images for these words are found that a vision-based model of edible things can classify them correctly. Finally, we evaluate our multimodal models in the task of discovering the color of concrete objects, showing that the relation between words denoting concrete things and their typical color is better captured when visual information is also taken into account (Bruni, Boleda, Baroni, & Tran, 2012). Moreover, we show that multimodality helps in distinguishing literal and nonliteral uses of color terms. 2.4 Multimodal Fusion When textual information is used for image analysis, this is mostly done with different aims than ours: Text is used to improve image-related tasks, and typically there is an attempt to model the relation between specific images and specific words or textual passages (e.g., Barnard, Duygulu, Forsyth, de Freitas, Blei, & Jordan, 2003; Berg, Berg, & Shih, 2010; Farhadi, Hejrati, Sadeghi, Young, Rashtchian, Hockenmaier, & Forsyth, 2010; Griffin, Wahab, & Newell, 2013; Kulkarni, Premraj, Dhar, Li, Choi, Berg, & Berg, 2011). In contrast, (i) we want to use image-derived features to improve the representation of word meaning and (ii) we are interested in capturing the meaning of word types on the basis of sets of images connected to a word, and not to model specific word-image relations. Despite these differences, some of the challenges addressed in the image analysis literature that deals with exploiting textual cues are similar to the ones we face. In particular, the problem of merging, or fusing , textual and visual cues into a common representational space is exactly the same we have to face when we construct a multimodal semantic space. Traditionally, the image analysis community distinguishes between two classes of fusion schemes, namely early fusion and late fusion. The former fuses modalities in feature space, the latter fuses modalities in semantic similarity space, analogously to what we will call Feature Level and Scoring Level fusion, respectively. For example, Escalante, Hérnadez, Sucar, and Montes (2008) propose an image retrieval system for multimodal documents. Both early and late fusion strategies for the combination of the image and the textual channels are considered. Early fusion settings include a weighted linear combination of the two channels and a global strategy where different retrieval systems are used contemporarily on the entire, joint data set. Late fusion strategies include a per-modality strategy, where documents are retrieved by using only one or the other channel and a hierarchical setting where first text, image and their combination are used independently to query the database and then results are aggregated with four weighted combinations. Vreeswijk, Huurnink, and Smeulders (2011) train a visual concept classifier for abstract subject categories such as biology and history by using a late fusion approach where image and text information are combined at the output level, that is, first obtaining classification scores from the imageand text-based models separately and then joining them. Similarly to our multimodal mixing step, Pham, Maillot, Lim, and Chevallet (2007) and Caicedo, Ben-Abdallah, González, and Nasraoui (2012) propose an early fusion in which the two inputs are mapped onto the same latent space using dimensionality reduction techniques (e.g., Singular Value Decomposition). The multimodal representation obtained in this way is then directly used to retrieve image documents. 6. http://images.google.com/ Bruni, Tran & Baroni 3. A Framework for Multimodal Distributional Semantics In this section, a general and flexible architecture for multimodal semantics is presented. The architecture makes use of distributional semantic models based on textual and visual information to build a multimodal representation of meaning. To merge the two sources, it uses a parameter-based pipeline which is able to capture previously proposed combination strategies, with the advantage of having all of them explored within a single system. 3.1 Input of the Multimodal Architecture To construct a multimodal representation of meaning, a semantic model for each single modality has to be implemented. Independently of the actual parameters that are chosen for its creation (that, from our point of view, can be in a black box), there are some requirements that each model has to satisfy in order to guarantee a good functioning of the framework. In the first place, each modality must provide a separate representation, to leave room for the various fusion strategies afterwards. Then, each modality must encode the semantic information pertaining to each word of interest into a fixed-size vectorial representation. Moreover, we assume that both textand image-based vectors are normalized and arranged in matrices where words are rows and co-occurring elements are columns. In what follows, we assume that we harvested a matrix of text-based semantic vectors, and one of image-based semantic vectors for the same set of target words, representing, respectively, verbal and visual information about the words. In Section 4 below we give the details of how in our specific implementation we construct these matrices. 3.2 Multimodal Fusion The pipeline is based on two main steps: (1) Latent Multimodal Mixing: The text and vision matrices are concatenated, obtaining a single matrix whose row vectors are projected onto a single, common space to make them interact. (2) Multimodal Similarity Estimation: Information in the textand image-based matrices is combined in two ways to obtain similarity estimates for pairs of target words: at the Feature Level and at the Scoring Level. Figure 2 describes the infrastructure we propose for fusion. First, we introduce a mixing phase to promote the interaction between modalities that we call Latent Multimodal Mixing. While this step is part of what other approaches would consider Feature Level fusion (see below), we keep it separated as it might benefit the Scoring Level fusion as well. Once the mixing is performed, we proceed to integrate the textual and visual features. As reviewed in Section 2.4 above, in the literature fusion is performed at two main levels, the Feature Level and the Scoring Level. In the first case features are first combined and considered as a single input for operations, in the second case a task is performed separately with different sets of features and the separate results are then combined. Each approach has its own advantages and limitations and this is why both of them are incorporated into the multimodal infrastructure and together constitute what we call Multimodal Similarity Multimodal Distributional Semantics &$ # %&' &$ # &$".$/0%'$ 12&'/0 # .3 .1 &' 4"/&.2/51.5&$".&$" !"#$%&'/ #"2 0%'$ 12&'/6 '& $7/8 $ &$ 1. Figure 2: Multimodal fusion for combining textual and visual information in a semantic model. Estimation. A Feature Level approach requires only one learning step (i.e., determining the parameters of the feature vector combination) and offers a richer vector-based representation of the combined information, that can also be used for other purposes (e.g., image and text features could be used together to train a classifier). Benefits of a Scoring Level approach include the possibility to have different representations (in principle, not even vectorial) and different similarity scores for different modalities and the ease of increasing (or decreasing) the number of different modalities used in the representation. 3.2.1 Latent Multimodal Mixing This is a preparatory step in which the textual and the visual components are projected onto a common representation of lower dimensionality to discover correlated latent factors. The result is that new connections are made in each source matrix taking into account information and connections present in the other matrix, originating from patterns of covariance that overlap. Importantly, we assume that mixing is done via a dimensionality reduction technique that has the following characteristics: a parameter k that determines Bruni, Tran & Baroni the dimensionality of the reduced space and the fact that when k equals the rank of the original matrix the reduced matrix is identical or can be considered a good approximation of the original one. The commonly used Singular Value Decomposition reduction method that we adopt here for the mixing step satisfies these constraints. As a toy example of why mixing might be beneficial, consider the concepts pizza and coin, that we could use as features in our text-based semantic vectors (i.e., record the cooccurrences of target words with these concepts as part of the vector dimensions). While these words are not likely to occur in similar contexts in text, they are obviously visually similar. So, the original text features pizza and coin might not be highly correlated. However, after mixing in multimodal space, they might both be associated with (have high weights on) the same reduced space component, if they both have similar distributions to visual features that cue roundness. Consequently, two textual features that were originally uncorrelated might be drawn closer to each other by multimodal mixing, if the corresponding concepts are visually similar, resulting in mixed textual features that are, in a sense, visually enriched, and vice versa for mixed visual features (interestingly, psychologists have shown that, under certain conditions, words such as pizza and coin, that are not strongly associated but perceptually similar, can prime each other; e.g., Pecher, Zeelenberg, & Raaijmakers, 1998). Note that the matrices obtained by splitting the reduced-rank matrix back into the original textual and visual blocks have the same number of feature columns as the original textual and visual blocks, but the values in them have been smoothed by dimensionality reduction (we explain the details of how this is achieved in our specific implementation in the next paragraph). These matrices are then used to calculate a similarity score for a word pair by (re-)merging information at the feature and scoring levels. Mixing with SVD In our implementation, we perform mixing across textand imagebased features by applying the Singular Value Decomposition (SVD)7 to the matrix obtained by concatenating the two feature types row-wise (so that each row of the concatenated matrix describes a target word in textual and visual space). SVD is a widely used technique to find the best approximation of the original data points in a space of lower underlying dimensionality whose basis vectors ( principal components or latent dimensions ) are selected to capture as much of the variance in the original space as possible (Manning, Raghavan, & Schütze, 2008, Ch. 18). By performing SVD on the concatenated textual and visual matrices, we project the two types of information into the same space, where they are described as linear combinations of principal components. Following the description by Pham et al. (2007), the SVD of a matrix M of rank r is a factorization of the form U : matrix of eigenvectors derived from MMt Σ : r r diagonal matrix of singular values σ σ : square roots of the eigenvalues of MMt V t : matrix of eigenvectors derived from Mt M 7. Computed with SVDLIBC: http://tedlab.mit.edu/~dr/SVDLIBC/ Multimodal Distributional Semantics In our context, the matrix M is given by normalizing two feature matrices separately and then concatenating. By selecting the k largest values from matrix Σ and keeping the corresponding columns in matrices U and V , the reduced matrix Mk is given by Mk = UkΣk V t k where k < r is the dimensionality of the latent space. While Mk keeps the same number of columns/dimensions as M, its rank is now k. k is a free parameter that we tune on the development sets. Note that when k equals the rank of the original matrix, then trivially Mk = M. Thus we can consider not performing any SVD reduction as a special case of SVD, which helps when searching for the optimal parameters. Note also that, if M has n columns, then V t k is a k n matrix, so that Mk has the same number of columns of M. If the first j columns of M contain textual features, and columns from j + 1 to n contain visual features, the same will hold for Mk, although in the latter the values of the features will have been affected by global SVD smoothing. Thus, in the current implementation of the pipeline in Figure 2, block splitting is attained simply by dividing Mk into a textual mixed matrix containing its first j columns, and a visual mixed matrix containing the remaining columns. 3.2.2 Multimodal Similarity Estimation Similarity Function Following the distributional hypothesis, DSMs describe a word in terms of the contexts in which it occurs. Therefore, to measure the similarity of two words DSMs need a function capable of determining the similarity of two such descriptions (i.e., of two semantic vectors). In the literature, there are many different similarity functions used to compare two semantic vectors, including cosine similarity, Euclidean distance, L1 norm, Jaccard s coefficient, Jensen-Shannon divergence, Lin s similarity. For an extensive evaluation of different similarity measures, see the work by Weeds (2003). Here we focus on cosine similarity since it has been shown to be a very effective measure on many semantic benchmarks (Bullinaria & Levy, 2007; Padó & Lapata, 2007). Also, given that our system is based on geometric principles, the cosine, together with Euclidean distance, is the most principled choice to measure similarity. For example, some of the measures listed above, having been developed from probabilistic considerations, will only be applicable to vectors that encode well-formed probability distributions, which is typically not the case (for example, after multimodal mixing, our vectors might contain negative values). The cosine of two semantic vectors a and b is their dot product divided by the product of their lengths: cos(a, b) = Pi=n i=1 ai bi q Pi=n i=1 a2 i q Pi=n i=1 b2 i The cosine ranges from 0 (orthogonal vectors) to |1| (parallel vectors pointing in the same or opposite directions have cosine values of 1 and -1, respectively). Bruni, Tran & Baroni Feature Level Fusion In Feature Level fusion (FL), we use the linear weighted fusion method to combine textand image-based feature vectors of words into a single representation and then we use the latter to estimate the similarity of pairs. The linear weighted combination function is defined as F = α Ft (1 α) Fv where is the vector-concatenate operator. Scoring Level Fusion In Scoring Level fusion (SL), textand image-based matrices are used to estimate similarity of pairs independently. The scores are then combined to obtain the final estimate by using a linear weighted scoring function: S = β St + (1 β) Sv General Form and Special Cases Given fixed and normalized textand image-based matrices, our multimodal approach is parametrized by k (dimensionality of latent space), FL vs. SL, α (weight of text component in FL similarity estimation) and β (weight of text component in SL). Note that when k=r, with r the rank of the original combined matrix, Latent Multimodal Mixing returns the original combined matrix (no actual mixing). Picking SL with β=1 or β=0 corresponds to using the textual or visual matrix only, respectively. We thus derive as special cases the models in which only text (k=r, SL, β=1) or only images (k=r, SL, β=0) are used (called Text and Image models in the Results section below). The simple approach of Bruni et al. (2011), in which the two matrices are concatenated without mixing, is the parametrization k=r, FL, α=0.5 (called Naive FL model, below). The summing approach of Leong and Mihalcea (2011) corresponds to k=r, SL, β=0.5 (Naive SL, below). Picking k