# orderembeddings_of_images_and_language__e2c3076f.pdf

Published as a conference paper at ICLR 2016

ORDER-EMBEDDINGS OF IMAGES AND LANGUAGE

Ivan Vendrov, Ryan Kiros, Sanja Fidler, Raquel Urtasun Department of Computer Science University of Toronto {vendrov,rkiros,fidler,urtasun}@cs.toronto.edu

Hypernymy, textual entailment, and image captioning can be seen as special cases of a single visual-semantic hierarchy over words, sentences, and images. In this paper we advocate for explicitly modeling the partial order structure of this hierarchy. Towards this goal, we introduce a general method for learning ordered representations, and show how it can be applied to a variety of tasks involving images and language. We show that the resulting representations improve performance over current approaches for hypernym prediction and image-caption retrieval.

1 INTRODUCTION

Computer vision and natural language processing are becoming increasingly intertwined. Recent work in vision has moved beyond discriminating between a ﬁxed set of object classes, to automatically generating open-ended lingual descriptions of images (Vinyals et al., 2015). Recent methods for natural language processing such as Young et al. (2014) learn the semantics of language by grounding it in the visual world. Looking to the future, autonomous artiﬁcial agents will need to jointly model vision and language in order to parse the visual world and communicate with people.

But what, precisely, is the relationship between images and the words or captions we use to describe them? It is akin to the hypernym relation between words, and textual entailment among phrases: captions are simply abstractions of images. In fact, all three relations can be seen as special cases of a partial order over images and language, illustrated in Figure 1, which we refer to as the visualsemantic hierarchy. As a partial order, this relation is transitive: woman walking her dog , woman walking , person walking , person , and entity are all valid abstractions of the rightmost image. Our goal in this work is to learn representations that respect this partial order structure.

Figure 1: A slice of the visual-semantic hierarchy

Most recent approaches to modeling the hypernym, entailment, and image-caption relations involve learning distributed representations or embeddings. This is a very powerful and general approach which maps the objects of interest words, phrases, images to points in a high-dimensional vector space. One line of work, exempliﬁed by Chopra et al. (2005) and ﬁrst applied to the caption-image relationship by Socher et al. (2014), requires the mapping to be distance-preserving: semantically

ar Xiv:1511.06361v6 [cs.LG] 1 Mar 2016

Published as a conference paper at ICLR 2016

similar objects are mapped to points that are nearby in the embedding space. A symmetric distance measure such as Euclidean or cosine distance is typically used. Since the visual-semantic hierarchy is an antisymmetric relation, we expect this approach to introduce systematic model error.

Other approaches do not have such explicit constraints, learning a more-or-less general binary relation between the objects of interest, e.g. Bordes et al. (2011); Socher et al. (2013); Ma et al. (2015). Notably, no existing approach directly imposes the transitivity and antisymmetry of the partial order, leaving the model to induce these properties from data.

In contrast, we propose to exploit the partial order structure of the visual-semantic hierarchy by learning a mapping which is not distance-preserving but order-preserving between the visualsemantic hierarchy and a partial order over the embedding space. We call embeddings learned in this way order-embeddings. This idea can be integrated into existing relational learning methods simply by replacing their comparison operation with ours. By modifying existing methods in this way, we ﬁnd that order-embeddings provide a marked improvement over the state-of-art for hypernymy prediction and caption-image retrieval, and near state-of-the-art performance for natural language inference.

This paper is structured as follows. We begin, in Section 2, by giving a uniﬁed mathematical treatment of our tasks, and describing the general approach of learning order-embeddings. In the next three sections we describe in detail the tasks we tackle, how we apply the order-embeddings idea to each of them, and the results we obtain. The tasks are hypernym prediction (Section 3), captionimage retrieval (Section 4), and textual entailment (Section 5).

In the supplementary material, we visualize novel vector regularities that emerge in our learned embeddings of images and language.

2 LEARNING ORDER-EMBEDDINGS

To unify our treatment of various tasks, we introduce the problem of partial order completion. In partial order completion, we are given a set of positive examples P = {(u, v)} of ordered pairs drawn from a partially ordered set (X, X), and a set of negative examples N which we know to be unordered. Our goal is to predict whether an unseen pair (u , v ) is ordered. Note that hypernym prediction, caption-image retrieval, and textual entailment are all special cases of this task, since they all involve classifying pairs of concepts in the (partially ordered) visual-semantic hierarchy.

We tackle this problem by learning a mapping from X into a partially ordered embedding space (Y, Y ). The idea is to predict the ordering of an unseen pair in X based on its ordering in the embedding space. This is possible only if the mapping satisﬁes the following crucial property:

Deﬁnition 1. A function f : (X, X) (Y, Y ) is an order-embedding if for all u, v X,

u X v if and only if f(u) Y f(v)

This deﬁnition implies that each combination of embedding space Y , order Y , and orderembedding f determines a unique completion of our data as a partial order X. In the following, we ﬁrst consider the choice of Y and Y , and then discuss how to ﬁnd an appropriate f.

2.1 THE REVERSED PRODUCT ORDER ON RN +

The choice of Y and Y is somewhat application-dependent. For the purpose of modeling the semantic hierarchy, our choices are narrowed by the following considerations.

Much of the expressive power of human language comes from abstraction and composition. For any two concepts, say dog and cat , we can name a concept that is an abstraction of the two, such as mammal , as well as a concept that composes the two, such as dog chasing cat . So, in order to represent the visual-semantic hierarchy, we need to choose an order Y that is rich enough to embed these two relations.

We also restrict ourselves to orders Y with a top element, which is above every other element in the order. In the visual-semantic hierarchy, this element represents the most general possible concept; practically, it provides an anchor for the embedding.

Published as a conference paper at ICLR 2016

Finally, we choose the embedding space Y to be continuous in order to allow optimization with gradient-based methods.

A natural choice that satisﬁes all three properties is the reversed product order on RN +, deﬁned by the conjunction of total orders on each coordinate:

x y if and only if

i=1 xi yi (1)

for all vectors x, y with nonnegative coordinates. Note the reversal of direction: smaller coordinates imply higher position in the partial order. The origin is then the top element of the order, representing the most general concept.

Instead of viewing our embeddings as single points x RN +, we can also view them as sets {y : x y}. The meaning of a word is then the union of all concepts of which it is a hypernym, and the meaning of a sentence is the union of all sentences that entail it. The visual-semantic hierarchy can then be seen as a special case of the subset relation, a connection also used by Young et al. (2014).

2.2 PENALIZING ORDER VIOLATIONS

Having ﬁxed the embedding space and order, we now consider the problem of ﬁnding an orderembedding into this space. In practice, the order embedding condition (Deﬁnition 1) is too restrictive to impose as a hard constraint. Instead, we aim to ﬁnd an approximate order-embedding: a mapping which violates the order-embedding condition, imposed as a soft constraint, as little as possible.

More precisely, we deﬁne a penalty that measures the degree to which a pair of points violates the product order. In particular, we deﬁne the penalty for an ordered pair (x, y) of points in RN + as

E(x, y) = || max(0, y x)||2 (2)

Crucially, E(x, y) = 0 x y according to the reversed product order; if the order is not satisﬁed, E(x, y) is positive. This effectively imposes a strong prior on the space of relations, encouraging our learned relation to satisfy the partial order properties of transitivity and antisymmetry. This penalty is key to our method. Throughout the remainder of the paper, we will use it where previous work has used symmetric distances or learned comparison operators.

Recall that P and N are our positive and negative examples, respectively. Then, to learn an approximate order-embedding f, we could use a max-margin loss which encourages positive examples to have zero penalty, and negative examples to have penalty greater than a margin: X

(u,v) P E(f(u), f(v)) + X

(u ,v ) N max{0, α E(f(u ), f(v ))} (3)

In practice we are often not given negative examples, in which case this loss admits the trivial solution of mapping all objects to the same point. The best way of dealing with this problem depends on the application, so we will describe task-speciﬁc variations of the loss in the next several sections.

3 HYPERNYM PREDICTION

To test the ability of our model to learn partial orders from incomplete data, our ﬁrst task is to predict withheld hypernym pairs in Word Net (Miller, 1995). A hypernym pair is a pair of concepts where the ﬁrst concept is a specialization or an instance of the second, e.g., (woman, person) or (New York, city). Our setup differs signiﬁcantly from previous work in that we use only the Word Net hierarchy as training data. The most similar evaluation has been that of Baroni et al. (2012), who use external linguistic data in the form of distributional semantic vectors. Bordes et al. (2011) and Socher et al. (2013) also evaluate on the Word Net hierarchy, but they use other relations in Word Net as training data (and external linguistic data, in Socher s case).

Additionally, the latter two consider only direct hypernyms, rather than the full, transitive hypernymy relation. But predicting the transitive hypernym relation is a better-deﬁned problem because individual hypernym edges in Word Net vary dramatically in the degree of abstraction they require. For instance, (person, organism) is a direct hypernym pair, but it takes eight hypernym edges to get from cat to organism.

Published as a conference paper at ICLR 2016

3.1 LOSS FUNCTION

To apply order-embeddings to hypernymy, we follow the setup of Socher et al. (2013) in learning an N-dimensional vector for each concept in Word Net, but we replace their neural tensor network with our order-violation penalty deﬁned in Eq. (2). Just like them, we corrupt each hypernym pair by replacing one of the two concepts with a randomly chosen concept, and use these corrupted pairs as negative examples for both training and evaluation. We use their max-margin loss, which encourages the order-violation penalty to be zero for positive examples, and greater than a margin α for negative examples: X

(u,v) W ord Net E(f(u), f(v)) + max{0, α E(f(u ), f(v ))} (4)

where E is our order-violation penalty, and (u , v ) is a corrupted version of (u, v). Since we learn an independent embedding for each concept, the mapping f is simply a lookup table.

3.2 DATASET

The transitive closure of the Word Net hierarchy gives us 838073 edges between 82192 concepts in Word Net. Like Bordes et al. (2011), we randomly select 4000 edges for the test split, and another 4000 for the development set. Note that the majority of test set edges can be inferred simply by applying transitivity, giving us a strong baseline.

3.3 DETAILS OF TRAINING

We learn a 50-dimensional nonnegative vector for each concept in Word Net using the max-margin objective (4) with margin α = 1, sampling 500 true and 500 false hypernym pairs in each batch. We train for 30-50 epochs using the Adam optimizer (Kingma & Ba, 2015) with learning rate 0.01 and early stopping on the validation set. During evaluation, we ﬁnd the optimal classiﬁcation threshold on the validation set, then apply it to the test set.

3.4 RESULTS

Figure 2: 2-dim order-embedding of a small subset of the Word Net hypernym relation. All the true hypernym pairs (green arrows) are correctly embedded, but two spurious pairs (pink arrows), are introduced. Only direct hypernyms are shown.

Since our setup is novel, there are no published numbers to compare to. We therefore compare three variants of our model to two baselines, with results shown in Table 1.

The transitive closure baseline involves no learning; it simply classiﬁes hypernyms pairs as positive if they are in the transitive closure of the union of edges in the training and validation sets. The word2gauss baseline evaluates the approach of Vilnis & Mc Callum (2015) to represent words as Gaussian densities rather than points in the embedding space. This allows a natural representation of hierarchies using the KL divergence. We used 50-dimensional diagonal Gaussian embeddings, trained for 200 epochs on a max-margin objective with margin 7, chosen by grid search1.

order-embeddings (symmetric) is our full model, but using symmetric cosine distance instead of our asymmetric penalty. order-embeddings (bilinear) replaces our penalty with the bilinear model used by Socher et al. (2013). order-embeddings is our full model.

Only our full model can do better than the transitive baseline, showing the value of exploiting partial order structure in contrast to using symmetric similarity or learning a general binary relation as most previous work and our bilinear baseline do.

The resulting 50-dimensional embeddings are difﬁcult to visualize. To give some intuition for the structure being learned, Figure 2 shows the results of a toy 2D experiment.

1We used the code of http://github.com/seomoz/word2gauss

Published as a conference paper at ICLR 2016

Method Accuracy (%)

transitive closure 88.2 word2gauss 86.6 order-embeddings (symmetric) 84.2 order-embeddings (bilinear) 86.3 order-embeddings 90.6

Table 1: Binary classiﬁcation accuracy on 4000 withheld edges from Word Net.

4 CAPTION-IMAGE RETRIEVAL

The caption-image retrieval task has become a standard evaluation of joint models of vision and language (Hodosh et al., 2013; Lin et al., 2014a). The task involves ranking a large dataset of images by relevance for a query caption (Image Retrieval), and ranking captions by relevance for a query image (Caption Retrieval). Given a set of aligned image-caption pairs as training data, the goal is then to learn a caption-image compatibility score S(c, i) to be used at test time.

Many modern approaches model the caption-image relationship symmetrically, either by embedding into a common visual-semantic space with inner-product similarity (Socher et al., 2014; Kiros et al., 2014), or by using Canonical Correlations Analysis between distributed representations of images and captions (Klein et al., 2015). While Karpathy & Li (2015) and Plummer et al. (2015) model a ﬁner-grained alignment between regions in the image and segments of the caption, the similarity they use is still symmetric. An alternative is to learn an unconstrained binary relation, either with a neural language model conditioned on the image (Vinyals et al., 2015; Mao et al., 2015) or using a multimodal CNN (Ma et al., 2015).

In contrast to these lines of work, we propose to treat the caption-image pairs as a two-level partial order with captions above the images they describe, and let

S(c, i) = E(fi(i), fc(c))

with E our order-violation penalty deﬁned in Eq (2), and fc, fi are embedding functions from captions and images into RN + .

4.1 LOSS FUNCTION

To facilitate comparison, we use the same pairwise ranking loss that Socher et al. (2014), Kiros et al. (2014) and Karpathy & Li (2015) have used on this task simply replacing their symmetric similarity measure with our asymmetric order-violation penalty. This loss function encourages S(c, i) for ground truth caption-image pairs to be greater than that for all other pairs, by a margin:

c max{0, α S(c, i) + S(c , i)} + X

i max{0, α S(c, i) + S(c, i )}

where (c, i) is a ground truth caption-image pair, c goes over captions that no describe i, and i goes over image not described by c.

4.2 IMAGE AND CAPTION EMBEDDINGS

To learn fc and fi, we use the approach of Kiros et al. (2014) except, since we are embedding into RN +, we constrain the embedding vectors to have nonnegative entries by taking their absolute value. Thus, to embed images, we use fi(i) = |Wi CNN(i)| (6) where Wi is a learned N 4096 matrix, N being the dimensionality of the embedding space. CNN(i) is the same image feature used by Klein et al. (2015): we rescale images to have smallest side 256 pixels, we take 224 224 crops from the corners, center, and their horizontal reﬂections, run the 10 crops through the 19-layer VGG network of Simonyan & Zisserman (2015) (weights pre-trained on Image Net and ﬁxed during training), and average their fc7 features.

Published as a conference paper at ICLR 2016

Caption Retrieval Image Retrieval Model R@1 R@10 Med r Mean r

R@1 R@10 Med r Mean r 1k Test Images

MNLM (Kiros et al., 2014) 43.4 85.8 2 * 31.0 79.9 3 * m-RNN (Mao et al., 2015) 41.0 83.5 2 * 29.0 77.0 3 * DVSA (Karpathy & Li, 2015)

38.4 80.5 1 * 27.4 74.8 3 *

STV (Kiros et al., 2015) 33.8 82.1 3 * 25.9 74.6 4 * FV (Klein et al., 2015) 39.4 80.9 2 10.4 25.1 76.6 4 11.1 m-CNN (Ma et al., 2015) 38.3 81.0 2 * 27.4 79.5 3 * m-CNNENS 42.8 84.1 2 * 32.6 82.8 3 *

order-embeddings (reversed) 11.2 44.0 14.2 86.6 12.3 53.5 9.0 30.1 order-embeddings (1-crop) 41.4 84.2 2.0 8.7 33.5 82.2 2.6 10.0 order-embeddings (symm.) 45.4 88.7 2.0 5.8 36.3 85.8 2.0 9.0 order-embeddings 46.7 88.9 2.0 5.7 37.9 85.9 2.0 8.1

5k Test Images

DVSA 11.8 45.4 12.2 * 8.9 36.3 19.5 * FV 17.3 50.2 10.0 46.4 10.8 40.1 17.0 49.3

order-embeddings (symm.) 21.5 62.9 6.0 24.4 16.8 56.3 8.0 40.4 order-embeddings 23.3 65.0 5.0 24.4 18.0 57.6 7.0 35.9

Table 2: Results of caption-image retrieval evaluation on COCO. R@K is Recall@K, in %. Med r is median rank. Metrics for our models on 1k test images are averages over ﬁve 1000-image splits of the 5000-image test set, as in (Klein et al., 2015). Best results overall are in bold; best results using 1-crop VGG features are underlined.

To embed the captions, we use a recurrent neural net encoder with GRU activations (Cho et al., 2014), so fc(c) = |GRU(c)|, the absolute value of hidden state after processing the last word.

4.3 DATASET

We evaluate on the Microsoft COCO dataset (Lin et al., 2014b), which has over 120,000 images, each with at least ﬁve human-annotated captions per image. This is by far the largest dataset commonly used for caption-image retrieval. We use the data splits of Karpathy & Li (2015) for training (113,287 images), validation (5000 images), and test (5000 images).

4.4 DETAILS OF TRAINING

To train the model, we use the standard pairwise ranking objective from Eq. (5). We sample minibatches of 128 random image-caption pairs, and draw all contrastive terms from the minibatch, giving us 127 contrastive images for each caption and captions for each image. We train for 15-30 epochs using the Adam optimizer with learning rate 0.001, and early stopping on the validation set.

We set the dimension of the embedding space and the GRU hidden state N to 1024, the dimension of the learned word embeddings to 300, and the margin α to 0.05. All these hyperparameters, as well as the learning rate and batchsize, were selected using the validation set. For consistency with Kiros et al. (2014) and to mitigate overﬁtting, we constrain the caption and image embeddings to have unit L2 norm. This constraint implies that no two points can be exactly ordered with zero order-violation penalty, but since we use a ranking loss, only the relative size of the penalties matters.

4.5 RESULTS

Given a query caption or image, we sort all the images or captions of the test set in order of increasing penalty. We use standard ranking metrics for evaluation. We measure Recall@K, the percent of queries for which the GT term is one of the ﬁrst K retrieved; and median and mean rank, which are statistics over the position of the GT term in the retrieval order.

Published as a conference paper at ICLR 2016

Table 2 shows a comparison between all state-of-the-art and some older methods2 along with our own; see Ma et al. (2015) for a more complete listing.

The best results overall are in bold, and the best results using 1-crop VGG image features are underlined. Note that the comparison is additionally complicated by the following:

m-CNNENS is an ensemble of four different models, whereas the other entries are all single models.

STV and FV use external text corpora to learn their language features, whereas the other methods learn them from scratch.

To facilitate the comparison and to evaluate the contributions of various components of our model, we evaluate four variations of order-embeddings:

order-embeddings is our full model as described above.

order-embeddings (reversed) reverses the order of captions and image embeddings in our orderviolation penalty placing images above captions in the partial order learned by our model. This seemingly slight variation performs atrociously, conﬁrming our prior that captions are much more abstract than images, and should be placed higher in the semantic hierarchy.

order-embeddings (1-crop) computes the image feature using just the center crop, instead of averaging over 10 crops.

order-embeddings (symm.) replaces our asymmetric penalty with the symmetric cosine distance, and allows embedding coordinates to be negative essentially replicating MNLM, but with better image features. Here we ﬁnd that a different margin (α = 0.2) works best.

Between these four models, the only previous work whose results are incommensurable with ours is DVSA, since it uses the less discriminative CNN of Krizhevsky et al. (2012) but 20 region features instead of a single whole-image feature.

Aside from this limitation, and if only single models are considered, order-embeddings signiﬁcantly outperform the state-of-art approaches for image retrieval even when we control for image features.

4.6 EXPLORATION

Why would order-embeddings do well on such a shallow partial order? Why are they much more helpful for image retrieval than for caption retrieval?

Intuitively, symmetric similarity should fail when an image has captions with very different levels of detail, because the captions are so dissimilar that it is impossible to map both their embeddings close to the same image embedding. Order-embeddings don t have this problem: the less detailed caption can be embedded very far away from the image while remaining above it in the partial order.

To evaluate this intuition, we use caption length as a proxy for level of detail and select, among pairs of co-referring captions in our validation set, the 100 pairs with the biggest length difference. For image retrieval with 1000 target images, the mean rank over captions in this set is 6.4 for orderembeddings and 9.7 for cosine similarity, a much bigger difference than over the entire dataset. Some particularly dramatic examples of this are shown in Figure 3. Moreover, if we use the shorter caption as a query, and retrieve captions in order of increasing error, the mean rank of the longer caption is 34.0 for order-embeddings and 47.6 for cosine similarity, showing that order-embeddings are able to capture the relatedness of co-referring captions with very different lengths.

This also explains why order-embeddings provide a much smaller improvement for caption retrieval than for image retrieval: all the caption retrieval metrics are based on the position of the ﬁrst ground truth caption in the retrieval order, so the embeddings need only learn to retrieve one of each image s ﬁve captions well, which symmetric similarity is well suited for.

2Note that the numbers for MNLM come not from the published paper but from the recently released code at http://github.com/ryankiros/visual-semantic-embedding.

Published as a conference paper at ICLR 2016

Captions Image rank cosine order-emb

a sitting area with furniture and ﬂowers makes a backdrop for a boy with headphones sitting in the foreground at one of the chairs at a dining table that holds glasses and a handbag working at a laptop

a kid is wearing headphone while on a laptop 286 24

view of top of a white building with tan speckled area an uncovered awning with a pigeon in ﬁght below and a red umbrella behind balcony wall

a pigeon ﬂying near white beams of a building 91 6

Figure 3: Images with captions of very different lengths, and the rank of the GT image when using each caption as a query.

5 TEXTUAL ENTAILMENT / NATURAL LANGUAGE INFERENCE

Natural language inference can be seen as a generalization of hypernymy from words to sentences. For example, from woman walking her dog in a park we can infer both woman walking her dog and dog in a park , but not old woman or black dog . Given a pair of sentences, our task is to predict whether we can infer the second sentence (the hypothesis) from the ﬁrst (the premise).

5.1 LOSS FUNCTION

To apply order-embeddings to this task, we again view it as partial order completion we can infer a hypothesis from a premise exactly when the hypothesis is above the premise in the visual-semantic hierarchy.

Unlike our other tasks, for which we had to generate contrastive negatives, datasets for natural language inference include labeled negative examples. So, we can simply use a max-margin loss: X

(p,h) E(f(p), f(h)) + X

(p ,h ) max{0, α E(f(p ), f(h ))} (7)

where (p, h) are positive and (p , h ) negative pairs of premise and hypothesis. To embed sentences, we use the same GRU encoder as in the caption-image retrieval task.

5.2 DATASET

To evaluate order-embeddings on the natural language inference task, we use the recently proposed SNLI corpus (Bowman et al., 2015), which contains 570,000 pairs of sentences, each labeled with entailment if the inference is valid, contradiction if the two sentences contradict, or neutral if the inference is invalid but there is no contradiction. Our method only allows us to discriminate between entailment and non-entailment, so we merge the contradiction and neutral classes together to serve as our negative examples.

5.3 IMPLEMENTATION DETAILS

Just as for caption-image ranking, we set the dimensions of the embedding space and GRU hidden state to be 1024, the dimension of the word embeddings to be 300, and constrain the embeddings to have unit L2 norm. We train for 10 epochs with batches of 128 sentence pairs. We use the Adam optimizer with learning rate 0.001 and early stopping on the validation set. During evaluation, we ﬁnd the optimal classiﬁcation threshold on validation, then use the threshold to classify the test set.

Published as a conference paper at ICLR 2016

Method 2-class 3-class

Neural Attention (Rockt aschel et al., 2015) * 83.5 EOP classiﬁer (Bowman et al., 2015) 75.0 * skip-thoughts 87.7 81.5 order-embeddings (symmetric) 79.3 * order-embeddings 88.6 *

Table 3: Test accuracy (%) on SNLI.

5.4 RESULTS

The state-of-the-art method for 3-class classiﬁcation on SNLI is that of Rockt aschel et al. (2015). Unfortunately, they do not compute 2-class accuracy, so we cannot compare to them directly.

As a bridge to facilitate comparison, we use a challenging baseline which can be evaluated on both the 2-class and 3-class problems. The baseline, referred to as skip-thoughts, involves a feedforward neural network on top of skip-thought vectors (Kiros et al., 2015), a state-of-the-art semantic representation of sentences. Given pairs of sentence vectors u and v, the input to the network is the concatenation of u, v and the absolute difference |u v|. We tuned the number of layers, layer dimensionality and dropout rates to optimize performance on the development set, using the Adam optimizer. Batch normalization (Ioffe & Szegedy, 2015) and PRe LU units (He et al., 2015) were used. Our best network used 2 hidden layers of 1000 units each, with dropout rate of 0.5 across both the input and hidden layers. We did not backpropagate through the skip-thought encoder.

We also evaluate against EOP classiﬁer, a 2-class baseline introduced by (Bowman et al., 2015), and against a version of our model where our order-violation penalty is replaced with the symmetric cosine distance, order-embeddings (symmetric).

The results for all models are shown in Table 3. We see that order-embeddings outperform the skipthought baseline despite not using external text corpora. While our method is almost certainly worse than the state-of-the-art method of Rockt aschel et al. (2015), which uses a word-by-word attention mechanism, it is also much simpler.

6 CONCLUSION AND FUTURE WORK

We introduced a simple method to encode order into learned distributed representations, which allows us to explicitly model the partial order structure of the visual-semantic hierarchy. Our method can be easily integrated into existing relational learning methods, as we demonstrated on three challenging tasks involving computer vision and natural language processing. On two of these tasks, hypernym prediction and caption-image retrieval, our methods outperform all previous work.

A promising direction of future work is to learn better classiﬁers on Image Net (Deng et al., 2009), which has over 21k image classes arranged by the Word Net hierarchy. Previous approaches, including Frome et al. (2013) and Norouzi et al. (2014) have embedded words and images into a shared semantic space with symmetric similarity which our experiments suggest to be a poor ﬁt with the partial order structure of Word Net. We expect signiﬁcant progress on Image Net classiﬁcation, and the related problems of one-shot and zero-shot learning, to be possible using order-embeddings.

Going further, order-embeddings may enable learning the entire semantic hierarchy in a single model which jointly reasons about hypernymy, entailment, and the relationship between perception and language, unifying what have been until now almost independent lines of work.

ACKNOWLEDGMENTS

We thank Kaustav Kundu for many fruitful discussions throughout the development of this paper. The work was supported in part by an NSERC Graduate Scholarship.

Published as a conference paper at ICLR 2016

Baroni, Marco, Bernardi, Raffaella, Do, Ngoc-Quynh, and Shan, Chung-chieh. Entailment above the word level in distributional semantics. In EACL, 2012.

Bordes, Antoine, Weston, Jason, Collobert, Ronan, and Bengio, Yoshua. Learning structured embeddings of knowledge bases. In AAAI, 2011.

Bowman, Samuel R., Angeli, Gabor, Potts, Christopher, and Manning, Christopher D. A large annotated corpus for learning natural language inference. In EMNLP, 2015.

Cho, Kyunghyun, Van Merri enboer, Bart, Gulcehre, Caglar, Bahdanau, Dzmitry, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning phrase representations using rnn encoderdecoder for statistical machine translation. In EMNLP, 2014.

Chopra, Sumit, Hadsell, Raia, and Le Cun, Yann. Learning a similarity metric discriminatively, with application to face veriﬁcation. In CVPR, 2005.

Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.

Frome, Andrea, Corrado, Greg S, Shlens, Jon, Bengio, Samy, Dean, Jeff, Mikolov, Tomas, et al. Devise: A deep visual-semantic embedding model. In NIPS, 2013.

He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. ICCV, 2015.

Hodosh, Micah, Young, Peter, and Hockenmaier, Julia. Framing image description as a ranking task: Data, models and evaluation metrics. JAIR, 2013.

Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, 2015.

Karpathy, Andrej and Li, Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.

Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. In ICLR, 2015.

Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visual-semantic embeddings with multimodal neural language models. ar Xiv preprint ar Xiv:1411.2539, 2014.

Kiros, Ryan, Zhu, Yukun, Salakhutdinov, Ruslan, Zemel, Richard S, Torralba, Antonio, Urtasun, Raquel, and Fidler, Sanja. Skip-thought vectors. NIPS, 2015.

Klein, Benjamin, Lev, Guy, Sadeh, Gil, and Wolf, Lior. Associating neural word embeddings with deep image representations using ﬁsher vectors. In CVPR, 2015.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classiﬁcation with deep convolutional neural networks. In NIPS, 2012.

Lin, Dahua, Fidler, Sanja, Kong, Chen, and Urtasun, Raquel. Visual semantic search: Retrieving videos via complex textual queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014a.

Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Doll ar, Piotr, and Zitnick, C Lawrence. Microsoft coco: Common objects in context. In ECCV, 2014b.

Ma, Lin, Lu, Zhengdong, Shang, Lifeng, and Li, Hang. Multimodal convolutional neural networks for matching image and sentence. ICCV, 2015.

Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, and Yuille, Alan. Deep captioning with multimodal recurrent neural networks (m-rnn). In ICLR, 2015.

Mikolov, Tomas, Yih, Wen-tau, and Zweig, Geoffrey. Linguistic regularities in continuous space word representations. In HLT-NAACL, pp. 746 751, 2013.

Published as a conference paper at ICLR 2016

Miller, George A. Wordnet: a lexical database for english. Communications of the ACM, 1995.

Norouzi, Mohammad, Mikolov, Tomas, Bengio, Samy, Singer, Yoram, Shlens, Jonathon, Frome, Andrea, Corrado, Greg S, and Dean, Jeffrey. Zero-shot learning by convex combination of semantic embeddings. In ICLR, 2014.

Plummer, Bryan, Wang, Liwei, Cervantes, Chris, Caicedo, Juan, Hockenmaier, Julia, and Lazebnik, Svetlana. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-tosentence models. ar Xiv preprint ar Xiv:1505.04870, 2015.

Rockt aschel, Tim, Grefenstette, Edward, Hermann, Karl Moritz, Koˇcisk y, Tom aˇs, and Blunsom, Phil. Reasoning about entailment with neural attention. ar Xiv preprint ar Xiv:1509.06664, 2015.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

Socher, Richard, Chen, Danqi, Manning, Christopher D, and Ng, Andrew. Reasoning with neural tensor networks for knowledge base completion. In NIPS, 2013.

Socher, Richard, Karpathy, Andrej, Le, Quoc V, Manning, Christopher D, and Ng, Andrew Y. Grounded compositional semantics for ﬁnding and describing images with sentences. TACL, 2014.

Van der Maaten, Laurens and Hinton, Geoffrey. Visualizing data using t-sne. Journal of Machine Learning Research, 9(2579-2605):85, 2008.

Vilnis, Luke and Mc Callum, Andrew. Word representations via gaussian embedding. In ICLR, 2015.

Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Erhan, Dumitru. Show and tell: A neural image caption generator. In CVPR, 2015.

Young, Peter, Lai, Alice, Hodosh, Micah, and Hockenmaier, Julia. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2014.

Published as a conference paper at ICLR 2016

7 SUPPLEMENTARY MATERIAL

Mikolov et al. (2013) showed that word representations learned using word2vec exhibit semantic regularities, such as king man + woman queen. Kiros et al. (2014) showed that similar regularities hold for joint image-language models. We ﬁnd that order-embeddings exhibit a novel form of regularity, shown in Figure 4. The elementwise max and min operations in the embedding space roughly correspond to composition and abstraction, respectively.

Nearest non-query images in COCO train

max( man , cat )

max( black dog , park )

Figure 4: Multimodal regularities found with embeddings learned for the caption-image retrieval task. Note that some images have been slightly cropped for easier viewing, but no relevant objects have been removed.