# unsupervised_generative_adversarial_crossmodal_hashing__acc3f547.pdf

Unsupervised Generative Adversarial Cross-Modal Hashing

Jian Zhang, Yuxin Peng, Mingkuan Yuan Institute of Computer Science and Technology, Peking University Beijing 100871, China pengyuxin@pku.edu.cn

Cross-modal hashing aims to map heterogeneous multimedia data into a common Hamming space, which can realize fast and ﬂexible retrieval across different modalities. Unsupervised cross-modal hashing is more ﬂexible and applicable than supervised methods, since no intensive labeling work is involved. However, existing unsupervised methods learn hashing functions by preserving inter and intra correlations, while ignoring the underlying manifold structure across different modalities, which is extremely helpful to capture meaningful nearest neighbors of different modalities for cross-modal retrieval. To address the above problem, in this paper we propose an Unsupervised Generative Adversarial Cross-modal Hashing approach (UGACH), which makes full use of GAN s ability for unsupervised representation learning to exploit the underlying manifold structure of cross-modal data. The main contributions can be summarized as follows: (1) We propose a generative adversarial network to model cross-modal hashing in an unsupervised fashion. In the proposed UGACH, given a data of one modality, the generative model tries to ﬁt the distribution over the manifold structure, and select informative data of another modality to challenge the discriminative model. The discriminative model learns to distinguish the generated data and the true positive data sampled from correlation graph to achieve better retrieval accuracy. These two models are trained in an adversarial way to improve each other and promote hashing function learning. (2) We propose a correlation graph based approach to capture the underlying manifold structure across different modalities, so that data of different modalities but within the same manifold can have smaller Hamming distance and promote retrieval accuracy. Extensive experiments compared with 6 state-of-the-art methods on 2 widely-used datasets verify the effectiveness of our proposed approach.

Introduction Multimedia retrieval has become an important application over the past decades, which can retrieve multimedia contents that users have interests in. However, it is a big challenge to retrieve multimedia data efﬁciently from large scale databases, due to the explosive growth of multimedia information. To address this issue, there are many hashing methods (Wang et al. 2016; Gionis, Indyk, and Motwani 1999;

Corresponding author. Copyright c 2018, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Zhang and Peng 2017) proposed to accomplish efﬁcient retrieval. The goal of hashing methods is to map high dimensional representations in the original space to short binary codes in the Hamming space. By using these binary hash codes, faster Hamming distance computation can be applied based on bit operations that can be implemented efﬁciently. Moreover, binary codes take much less storage compared with original high dimensional representations. There are large numbers of hashing methods applied to single modality retrieval (Wang et al. 2016), by which users can only retrieve data by a query with the same modality, such as text retrieval (Baeza-Yates and Ribeiro-Neto 1999) and image retrieval (Wang et al. 2016). Nevertheless, single modality retrieval can not meet users increasing demands, due to the different modalities of multimedia data. For example, by single modality retrieval, it is impracticable to search an image by using a textual sentence that describes the semantic content of the image. Therefore, cross-modal hashing has been proposed to meet this kind of retrieval demands in large scale cross-modal databases. Owing to the effectiveness and ﬂexibility of cross-modal hashing, users can submit whatever they have to retrieve whatever they want (Peng, Huang, and Zhao 2017; Peng et al. 2017). Heterogeneous gap is the key challenge of cross-modal hashing, which means the similarity of between different modalities cannot be measured directly. Consequently, some cross-modal hashing methods (Kumar and Udupa 2011; Rastegari et al. 2013; Ding et al. 2016; Zhang and Li 2014; Zhuang et al. 2014) have been proposed to bridge this gap. Existing cross-modal hashing methods can be categorized into traditional methods and Deep Neural Networks (DNN) based methods. Moreover, traditional methods can be divided into unsupervised methods and supervised methods by whether semantic information is leveraged. Unsupervised cross-modal hashing methods usually project data from different modalities into a common Hamming space to maximize their correlations, which hold the similar idea with Canonical Correlation Analysis (CCA) (Hardoon, Szedmak, and Shawe-Taylor 2004). Song et al. propose Inter-Media Hashing (IMH) (Song et al. 2013) to establish a common Hamming space by preserving inter-media and intra-media consistency. Cross-view Hashing (CVH) (Kumar and Udupa 2011) is proposed to consider both intra-view and inter-view similarities, which is an ex-

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

tension of image hashing method named Spectral Hashing (SH) (Weiss, Torralba, and Fergus 2009). Predictable Dualview Hashing (PDH) (Rastegari et al. 2013) designs an objective function to keep the predictability of pre-generated binary codes. Ding et al. propose Collective Matrix Factorization Hashing (CMFH) (Ding et al. 2016) to learn uniﬁed hash codes by collective matrix factorization. Composite Correlation Quantization (CCQ) (Long et al. 2016) jointly learns the correlation-maximal mappings that transform different modalities into an isomorphic latent space, and learns composite quantizers that convert the isomorphic latent features into compact binary codes. Supervised cross-modal hashing methods utilize labeled semantic information to learn hashing functions. Bronstein et al. propose Cross-Modality Similarity Sensitive Hashing (CMSSH) (Bronstein et al. 2010) to model hashing learning by a classiﬁcation paradigm with a boosting manner. Wei et al. propose Heterogeneous Translated Hashing (HTH) (Wei et al. 2014), which learns translators to align separate Hamming spaces of different modalities to perform cross-modal hashing. Semantic Correlation Maximization (SCM) (Zhang and Li 2014) is proposed to learn hashing functions by constructing and preserving the semantic similarity matrix. Semantics-Preserving Hashing (Se PH) (Lin et al. 2015) transforms the semantic matrix into a probability distribution and minimizes the KL-divergence in order to approximate the distribution with learned hash codes in Hamming space. DNN based methods are inspired by the successful applications of deep learning, such as image classiﬁcation (Krizhevsky, Sutskever, and Hinton 2012). Cross-Media Neural Network Hashing (CMNNH) (Zhuang et al. 2014) is proposed to learn cross-modal hashing functions by preserving intra-modal discriminative ability and inter-modal pairwise correlation. Cross Autoencoder Hashing (CAH) (Cao et al. 2016b) is based on deep autoencoder structure to maximize the feature correlation and the semantic correlation between different modalities. Cao et al. propose Deep Visualsemantic Hashing (DVH) (Cao et al. 2016a) as an end-to-end framework that combines both representation learning and hashing function learning. Jiang et al. propose Deep Crossmodal Hashing (DCMH) (Jiang and Li 2017), which performs feature learning and hashing function learning simultaneously. Compared with unsupervised paradigm, supervised methods use labeled semantic information that requires massive labor to collect, resulting in a high labor cost in real world applications. On the contrary, unsupervised crossmodal hashing methods can leverage unlabeled data to realize efﬁcient cross-modal retrieval, which is more ﬂexible and applicable in real world applications. However, most unsupervised methods learn hashing functions by preserving inter and intra correlations, while ignoring the underlying manifold structure across different modalities, which is extremely helpful to capture meaningful nearest neighbors of different modalities. To address this problem, in this paper, we exploit correlation information from underlying manifold structure of unlabeled data across different modalities to enhance cross-modal hashing learning.

Inspired by recent progress of Generative Adversarial Network (GAN) (Goodfellow et al. 2014; Reed et al. 2016; Zhao and Gao 2017; Wang et al. 2017), which has shown its ability to model the data distribution in an unsupervised fashion. In this paper, we propose an unsupervised generative adversarial cross-modal hashing (UGACH) approach. We design a graph-based unsupervised correlation method to capture the underlying manifold structure across different modalities, and a generative adversarial network to learn the manifold structure and further enhance the performance by an adversarial boosting paradigm. The main contributions of this paper can be summarized as follows: We propose a generative adversarial network to model cross-modal hashing in an unsupervised fashion. In the proposed UGACH, given the data of any modality, the generative model tries to ﬁt the distribution over the manifold structure, and selects informative data of another modality to challenge the discriminative model. While the discriminative model learns to distinguish the generated data and the true positive data sampled from correlation graph to achieve better retrieval accuracy. We propose a correlation graph based learning approach to capture the underlying manifold structure across different modalities, so that data of different modalities but within the same manifold can have smaller Hamming distance and promote retrieval accuracy. We also integrate the proposed correlation graph into proposed generative adversarial network to provide manifold correlation guidance to promote the cross-modal retrieval accuracy. Extensive experiments compared with 6 state-of-the-art methods on 2 widely-used datasets verify the effectiveness of our proposed approach. The rest of this paper is organized as follows. In The Proposed Approach section, we present our UGACH approach in detail. The experimental results and analyses are reported in Experiment section. Finally, we conclude this paper in Conclusion section.

The Proposed Approach Figure 1 presents the overview of our proposed approach, which consists of three parts, namely feature extraction, generative model G and discriminative model D. The feature extraction part employs image feature and text feature extraction to represent unlabeled data of different modalities as original features. The detailed implementation of this part will be described at Experiment section. Given a data of one modality, G attempts to select informative data from another modality to generate a pair of data and send them to D. In D, we construct a correlation graph, which can capture the manifold structure among the original features. D receives the generated pairs as inputs, and also samples positive data from constructed graph to form a true manifold pair. Then D tries to distinguish the manifold and generated pairs in order to get better discriminate ability. These two models play a minimax game to boost each other, and the ﬁnally trained D can be used as cross-modal hashing model. We denote the cross-modal dataset as D = {I, T}, where I represents image modality and T represents text modal-

cel ebr it y dave tv cool hil l si nger act or ku wai t bashar aziz kuw ai ty 7ar akat al jasm i al shat ti al rai alj assm i har akat w oor t hy

gor geou s j ohnny act or p j ohnnyd epp depp om g goodl ook ing al bandr i

boy peop le m an guy cel ebr it y f r iend m ovi es act or t r ibu te j akegyl l enhaal heat hl edger god fat her m ati l da

oscar sq uir r el squi r rel s smi l ey act or academ yaw ar ds squar l et t bas

st reet bw f esti val edi nbur gh act or per f or m er 2b m onoch rom i a hou rof t hedi am ondl i ght hum anb odygal l ery bw

por t r ait m an cel ebr it y sungl asses decem ber fu ji bok eh f l ash act or j el o col um bi auni ver sit y teac hersco l ege

ar t f il m vint age nahp ro gr ay sal y act re s act or sal lyg ray br i ti shf il m act or Image Feature Extraction

Text Feature Extraction

Generated Pairs

Hashing Layers

Discriminative Model D Generative Model G

al rai alj assm i

oodl ook ing

academ yaw ar ds

squar l et t bas

col um bi auni ver sit

y teac hersco l ege

y teac hersco l ege

Correlation Graph

Manifold Pairs

ce le brity dave tv cool hill s inge r ac tor kuwa it bashar az iz kuwa ity 7ara ka t alja smi alsha tti a lrai alja ssmi ha rakat woorthy

gorgeous johnny ac tor p johnnyde pp de pp omg goodlooking albandri

le ic a f10 ac tor yakushima c rusa de r m5 shortfilm fortias p kingdomofhea ve n ja ywe st noctlux te ngugaiden boy people man guy ce le brity friend movies actor tribute ja kegylle nhaa l heathledger godfathe rm atilda

oscar squirrel squirre ls smiley a ctor ac ademyawa rds squa rlet tba s

stre et bw festival edinburgh ac tor performe r 2b monoc hromia hourofthe diamondligh t huma nbodygalle rybw

portra it ma n celebrity sunglasses dece mbe r fuji bokeh fla sh ac tor je llo columbia unive rs ity te ac he rsc olle ge

art film vinta ge nahpro gray sally ac tress ac tor sallygra y britishfilm actor

Hashing Layers

Image Feature Extraction

Text Feature Extraction

sha tti a lrai

gorgeous johnny

p johnnyde pp

omg goodlook

le ic a f10 ac tor

kushima c rusa de r

shortfilm fortias p

ingdomofhea ve n

boy people m

ce le brity fr

movies actor tribute

ja kegylle nhaa l

heathledger

godfathe rm atilda

oscar squirre

squirre ls smiley

ac ademyawa

squa rlet tba

edinburgh ac tor

performe r 2

monoc hromia

urofthe diamondligh

ma nbodygalle rybw

portra it ma n celebrity

sunglasses dece mbe r

fuji bokeh fla sh ac tor

columbia unive rs ity

te ac he rsc olle ge

art film vinta ge

nahpro gray sally

ac tress ac tor sallygra y

britishfilm actor

Feature Extraction

Generative Probability

Reward Loss

Relevance Score

Ranking Loss

portra it ma n celebrity sunglasses dece mbe r fuji bokeh fla sh ac tor je llo columbia unive rs ity te ac he rsc olle ge

boy people man guy ce le brity friend movies actor tribute ja kegylle nhaa l heathledger godfathe rm atilda

ce le brity dave tv cool hill s inge r ac tor kuwa it bashar az iz kuwa ity 7ara ka t alja smi alsha tti a lrai alja ssmi ha rakat woorthy

portra it ma n celebrity sunglasses dece mbe r fuji bokeh fla sh ac tor je llo columbia unive rs ity te ac he rsc olle ge

boy people man guy ce le brity friend movies actor tribute ja kegylle nhaa l heathledger godfathe rm atilda

ce le brity dave tv cool hill s inge r ac tor kuwa it bashar az iz kuwa ity 7ara ka t alja smi alsha tti a lrai alja ssmi ha rakat woorthy

ohnny ny ny

p depp depp

oodl ook ing

j akegyl l enhaal

heat hl e at hl edger at hl e hea heat

god fat her m ati od fat her m d fat h fat her d fat her god fat her m ati l da fat her fat he t h

sal y lal y

shf il m act o ct or m act o

hedi am ond o

odygal l e l ery l ery

y dave tv y dave tv y dave tv y dave tv y dave tv y dave tv y dave

l s l si n si ng l si nge si nger ng

ku wai wai t wai t

sti val val

tct or ct or

le ic a f10 ac tor yakushima c rusa de r m5 shortfilm fortias p kingdomofhea ve n ja ywe st noctlux te ngugaiden

Hashing Layers

Hashing Layers

Figure 1: The overall framework of our proposed unsupervised generative adversarial cross-modal hashing approach (UGACH), which consists of feature extraction part, generative model G and discriminative model D.

ity. In this paper, D = {I, T} is further split into a retrieval database Ddb = {Idb, Tdb} and a query set Dq = {Iq, Tq}. In the retrieval database Ddb, Idb = {ip}n p=1, Tdb = {tp}n p=1 and n is the number of data pairs. The retrieval database Ddb is also the training set. In the query set Dq, Iq = {ip}t p=1, Tq = {tp}t p=1 and t is the number of data pairs. The aim is to learn cross-modal hashing functions to generate hash codes for data CI = HI(I) and CT = HT (T), so that different modal data that has similar semantic are close in the common Hamming space. In addition, we denote the hash code length as l. By the generated hash codes, we can retrieve the relevant data by a query of any modality from database of another modality efﬁciently.

Generative Model The network of generative model has a two-pathway architecture, which receives the original features of both images and texts as inputs. Each pathway consists of a common representation layer and a hashing layer, whose implementations are two fully-connected layers. The ﬁrst layer can convert the modality speciﬁc features to common representations, which make the instances of different modalities measurable in a common space. The representation produced by this layer can be denoted as follows:

φc(x) = tanh(Wcx + bc) (1)

where x denotes the original features of images or texts, Wc is the weight parameter of the common representation layer,

and bc is the bias parameter. The hashing layer can map the common representations into binary hash codes, so that the similarity between different modalities can be measured by fast Hamming distance calculation. The continuous real values of hash code is deﬁned as:

h(x) = sigmoid(Whφc(x) + bh) (2)

where Wh is the weight parameter and bh is the bias parameter. Then we can get the binary codes by a thresholding function:

b(x) = sgn(hk(x) 0.5), k = 1, 2, , l (3)

where l denotes the hash code length. Considering that it is hard to optimize binary codes directly, we use relaxed continuous real valued hash codes h(x) in the training process. Given a data of one modality, the goal of generative model G is to ﬁt the distribution over the manifold structure and select informative data of another modality to challenge the discriminative model. The generative probability of G is pθ(x U|q), which is the foundation to select relevant instance of one modality from unpaired data when given a query of another modality. For example, given a image query qi, the generative model tries to select relevant text t U from Tdb. The generative probability pθ(x U|q) is deﬁned as a softmax function:

pθ(x U|q) = exp( h(q) h(x U) 2) x U exp( h(q) h(x U) 2) (4)

Given a query, we can use equation (4) to calculate the probability of each candidate, which indicates the possibility of becoming a relevant sample.

Discriminative Model

The network structure of discriminative model is same as the generative model. The input of this network is the generated pairs by generative model, and the manifold pairs provided by a correlation graph. The goal of discriminative model is to distinguish whether an input pair is generated or from the correlation graph. First of all, we introduce the correlation graph in the discriminative model. We propose a correlation graph to guide the training of discriminative model. The correlation graph can capture the underlying manifold structure across different modalities, so that data of different modalities but within the same manifold can have small Hamming distance and promote retrieval accuracy. Speciﬁcally, we ﬁrst construct undirected graphs Graphi = (V, Wi) and Grapht = (V, Wt) for image and text modality respectively, where V denotes the vertices and Wi and Wt are the similarity matrix. Wi is deﬁned as follows:

w(p, q) = 1 : xp NNk(xq) 0 : otherwise (5)

where NNk(xq) denotes the k-nearest neighbors of xq in the training set, similarly we can deﬁne Wt. Then we sample the data of the true distributions based on the constructed graph. For real pair ptrue(x|qj) provided by the dataset, we select xk as the true relevant instance of given query qj and associate them as manifold pair pmanifold(xk|qj), when w(k, j) = 1. It is noted that pairwise information naturally exists in cross-modal data, thus if the corresponding text tk is within the same manifold with text query qj, the paired image ik is also in the same manifold with qj and vice versa. By this deﬁnition, we intend to utilize the underlying data manifold of different modality to guide the training of discriminative model. Intuitively, we want the data of different modality but within the same manifold to have small Hamming distance (e.g. small Hamming distance between text query qj and image ik). After receiving the generated and manifold pairs, discriminative model predicts a relevance score between each pair as the judgment result. So the relevance score between instance x and its query q is deﬁned as fφ(x, q). The goal of discriminative model is to distinguish the true relevant data (manifold pairs) and non-relevant data (generated pairs) for a query q accurately. The relevance score of fφ(x G, q) is deﬁned by triplet ranking loss as follows:

fφ(x G, q) = max(0, m + h(q) h(x M) 2

h(q) h(x G) 2) (6)

where x M is a manifold paired instance with query q selected from the correlation graph, x G is the selected instance by generative model, and m is a margin parameter which

is set to be 1 in our proposed approach. The above equation means that we want the distance between manifold pair (q, x M) smaller than that of generated pair (q, x G) by a margin m, so that the discriminative model can draw a clear distinguishing line between the manifold and generated pairs. Then discriminative model D uses the relevance score to produce predicted probability of instance x by a sigmoid function:

D(x|q) = sigmoid(fφ(x, q)) = exp(fφ(x, q)) 1 + exp(fφ(x, q)) (7)

The generative model tries to select informative data to challenge the discriminative model, which limits its capability to perform cross-modal retrieval. By contrast, the discriminative model is suitable for retrieving data across different modalities, after being promoted greatly by the generative model. Therefore after the proposed UGACH is trained, we use the discriminative model to perform crossmodal retrieval via produced hash codes.

Adversarial Learning

Given the deﬁnitions of the generative and discriminative models, we can conduct a minimax game for training them. Given a query, the generative model attempts to generate a pair which is close to the manifold pair to fool the discriminative model. The discriminative model tries to distinguish between manifold pair sampled from the correlation graph and the generated pair, which forms an adversarial process against the generative model. Inspired by the GAN (Goodfellow et al. 2014), this adversarial process can be deﬁned:

V(G, D) = min θ max φ

j=1 (Ex ptrue(x M|qj)[log(D(x M|qj))]

+ Ex pθ(x G|qj)[log(1 D(x G|qj))]) (8)

The generative and discriminative models can be learned iteratively by maximizing and minimizing the above object function. As general training process, the discriminative model tries to maximize equation (8), while the generative model attempts to minimize equation (8) and ﬁt the distribution over the manifold structure. The learning process of the discriminative model is ﬁxed when the generative model can be trained as follows:

θ = arg min θ

j=1 (Ex ptrue(x G|qj )[log(sigmoid(fφ (x M, qj)))]

+ Ex pθ(x G|qj )[log(1 sigmoid(fφ (x G, qj)))])

where fφ denotes the discriminative model at previous iteration. The traditional GAN uses continuous noise vector to generate new data and is trained by stochastic gradient descent algorithm. By contrast, the generative model of our proposed UGACH selects data from unlabeled data to generate pairs and can not be optimized continuously due to the discrete selective strategy. We utilize reinforcement learning based parameters update policy to train the generative model

as follows: θEx pθ(x G|qj)[log(1 + exp(fφ(x G, qj)))]

k=1 θpθ(x G k |qj)log(1 + exp(fφ(x G k , qj)))

k=1 pθ(x U k |qj) θlogpθ(x G k |qj)log(1 + exp(fφ(x G k , qj)))

= Ex pθ(x G|qj)[ θlogpθ(x G|qj)log(1 + exp(fφ(x G, qj)))]

k=1 θlogpθ(x G k |qj)log(1 + exp(fφ(x G k , qj)))

(10) where k denotes the k-th instance selected by generative model according to a query qj. From the perspective of reinforcement learning, according to the environment qk, x G k is the action taken by policy logpθ(x G k |qj), and log(1 + exp(fφ(x G k , qj))) acts as the reward, which encourages the generative model to select data close to the distribution over manifold structure. Finally the trained discriminative model can be used to generate binary codes for any input data of any modality, and cross modal retrieval can be performed by fast Hamming distance computation between query and each data in the database.

Experiments In this section, we present the experimental results of our proposed UGACH approach. We ﬁrst introduce the datasets, evaluation metrics and implementation details. Then we compare and analyze the results of UGACH with 6 stateof-the-art methods and 2 baseline methods.

Dataset In the experiments, we conduct cross-modal hashing on 2 widely-used datasets: NUS-WIDE (Chua et al. 2009) and MIRFLICKR (Huiskes and Lew 2008). NUS-WIDE dataset (Chua et al. 2009) is a relatively large-scale image/tag dataset with 269498 images. Each image has corresponding textual tags, which are regarded as the text modality in our experiments. NUS-WIDE dataset has 81 categories, but there are overlaps among the categories. Following (Lin et al. 2015), we select the 10 largest categories and the corresponding 186557 images. We take 1% data of NUS-WIDE dataset as the query set, and the rest as the retrieval database. We randomly selected 5000 images as training set for the supervised methods. We represent each image by 4096 dimensional deep features extracted from 19-layer VGGNet, and each text by 1000 dimensional Bo W. MIRFlickr dataset (Huiskes and Lew 2008) has 25000 images collected from Flickr, which has 24 categories. Each image is also associated with text tags. Following (Lin et al. 2015), we take 5% of the dataset as the query set and the remaining as the retrieval database. We also randomly select 5000 images as training set for supervised methods. Similarly, we represent each image by 4096 dimensional deep features extracted from 19-layer VGGNet, and each texts by 1000 dimensional Bo W.

Compared Methods In order to verify the effectiveness of our proposed approach, there are 4 unsupervised methods and 2 supervised methods compared in the experiment, including unsupervised methods CVH (Kumar and Udupa 2011), PDH (Rastegari et al. 2013), CMFH (Ding et al. 2016) and CCQ (Long et al. 2016), and supervised methods CMSSH (Bronstein et al. 2010) and SCM (Zhang and Li 2014). Besides state-of-theart methods, we also compare our UGACH approach with 2 baseline methods to verify the effectiveness of our contributions.

Baseline: We design a baseline method without the correlation graph and adversarial training, we denote this method as Baseline. It is implemented by training the discriminative model alone with a triplet ranking loss in equation (6), where the positive data is only the paired data provided by cross-modal datasets.

Baseline-GAN: We add the adversarial training to Baseline, which means that we further promote discriminative model in Baseline by adversarial training deﬁned in equation (8).

Comparing Baseline-GAN with Baseline, we can verify the effectiveness of our proposed generative adversarial network for cross-modal hashing, and comparing our ﬁnal approach UGACH with Baseline-GAN, we can verify the effectiveness of proposed correlation graph.

Retrieval Tasks and Evaluation Metrics In the experiments, two retrieval tasks are performed: retrieving text by image query (image text) and retrieving images by text query (text image). Speciﬁcally, we ﬁrst obtain the hash codes for the images and texts in the query and retrieval database with our UGACH approach and all the compared methods. Then we take one of the images as query, compute the Hamming distance with all text in retrieval database, and evaluate the ranking list by 3 evaluation metrics to measure the retrieval effectiveness: Mean Average Precision (MAP), precision recall curve (PR-curve) and precision at top k returned results (top K-precision), which are deﬁned as follows:

The MAP scores are computed as the mean of average precision (AP) for all queries, and AP is computed as:

k Rk relk (11)

where n is the size of database, R is the number of relevant images in database, Rk is the number of relevant images in the top k returns, and relk = 1 if the image ranked at k-th position is relevant and 0 otherwise.

Precision recall curve (PR-curve): The precision at certain level of recall of the retrieved ranking list, which is widely used to measure the information retrieval performance.

Precision at top k returned results (top K-precision): The precision with respect to different numbers of retrieved samples from the ranking list.

Table 1: The MAP scores of two retrieval tasks on NUS-WIDE dataset with different lengths of hash codes.

Methods image text text image 16 32 64 128 16 32 64 128 CVH (Kumar and Udupa 2011) 0.458 0.432 0.410 0.392 0.474 0.445 0.419 0.398 PDH (Rastegari et al. 2013) 0.475 0.484 0.480 0.490 0.489 0.512 0.507 0.517 CMFH (Ding et al. 2016) 0.517 0.550 0.547 0.520 0.439 0.416 0.377 0.349 CCQ (Long et al. 2016) 0.504 0.505 0.506 0.505 0.499 0.496 0.492 0.488 CMSSH (Bronstein et al. 2010) 0.512 0.470 0.479 0.466 0.519 0.498 0.456 0.488 SCM orth (Zhang and Li 2014) 0.389 0.376 0.368 0.360 0.388 0.372 0.360 0.353 SCM seq (Zhang and Li 2014) 0.517 0.514 0.518 0.518 0.518 0.510 0.517 0.518 Baseline 0.540 0.537 0.573 0.598 0.554 0.555 0.583 0.608 Baseline-GAN 0.575 0.594 0.602 0.623 0.580 0.609 0.617 0.629 UGACH (Ours) 0.613 0.623 0.628 0.631 0.603 0.614 0.640 0.641

Table 2: The MAP scores of two retrieval tasks on MIRFlickr dataset with different lengths of hash codes.

Methods image text text image 16 32 64 128 16 32 64 128 CVH (Kumar and Udupa 2011) 0.602 0.587 0.578 0.572 0.607 0.591 0.581 0.574 PDH (Rastegari et al. 2013) 0.623 0.624 0.621 0.626 0.627 0.628 0.628 0.629 CMFH (Ding et al. 2016) 0.659 0.660 0.663 0.653 0.611 0.606 0.575 0.563 CCQ (Long et al. 2016) 0.637 0.639 0.639 0.638 0.628 0.628 0.622 0.618 CMSSH (Bronstein et al. 2010) 0.611 0.602 0.599 0.591 0.612 0.604 0.592 0.585 SCM orth (Zhang and Li 2014) 0.585 0.576 0.570 0.566 0.585 0.584 0.574 0.568 SCM seq (Zhang and Li 2014) 0.636 0.640 0.641 0.643 0.661 0.664 0.668 0.670 Baseline 0.619 0.631 0.633 0.646 0.625 0.635 0.634 0.649 Baseline-GAN 0.630 0.643 0.651 0.664 0.660 0.657 0.670 0.688 UGACH (Ours) 0.685 0.693 0.704 0.702 0.673 0.676 0.686 0.690

It should be noted that the MAP score is computed for all the retrieval results with 4 different lengths of hash codes, while PR-curve and top K-precision are evaluated on 128 bit hash codes.

Implementation Details

In this section, we present the implementation details of our UGACH in the experiments. We take 4096 dimensional feature extracted from 19-layer VGGNet for images, and use the 1000 dimensional Bo W feature for texts. We implement the proposed UGACH by tensorﬂow1. The dimension of common representation layer is set to be 4096, while the hashing layer s dimension is set to be the same as hash code length. Moreover, we train the proposed UGACH in a mini-batch way and set the batch size as 64 for discriminative and generative models. We train the proposed UGACH iteratively. After the discriminative model is trained in 1 epoch, the generative model respectively will be trained in 1 epoch. The learning rate of UGACH is decreased by a factor of 10 each two epochs, while it is initialized as 0.01. For the compared methods, we apply the implementations provided by their authors, and follow their best settings to preform the experiments. And it is noted that for a fair comparison between different methods, we use the same image and text features for all compared methods.

1https://www.tensorﬂow.org

Experiment Results Tables 1 and 2 show the MAP scores of our UGACH and the compared methods on NUS-WIDE and MIRFlickr datasets. Compared with state-of-the-art methods, it can be seen that our proposed UGACH approach achieves the best retrieval accuracy on all 2 datasets. For convenience, we categorize these result tables into three parts: unsupervised compared methods, supervised compared methods and baseline methods. On NUS-WIDE dataset, our proposed UGACH keeps the best average MAP score of 0.624 on image text and 0.625 on text image tasks. Compared with the best unsupervised methods CCQ (Long et al. 2016), our UGACH achieves an inspiring accuracy improvement from 0.505 to 0.624 on image text task, and improves the average MAP score from 0.494 to 0.625 on text image task. Even compared with supervised methods SCM seq (Zhang and Li 2014), our UGACH also improves average MAP scores from 0.517 to 0.624 on image text task, and from 0.516 to 0.625 on text image task. We can observe the similar trends on MIRFlickr dataset from Tables 2. Figures 2 and 3 show the top K-precision and precisionrecall curves on the two datasets with 128 bit code length. We can observe that on both image text and text image tasks, UGACH achieves the best accuracy among all compared unsupervised methods. And UGACH even achieves better retrieval accuracy than compared supervised methods on most of the evaluation metrics, which further demonstrates the effectiveness of our proposed approach.

0 200 400 600 800 1000 0.3

(a) Image >text task on NUS WIDE dataset

0 200 400 600 800 1000 0.5

(b) Image >text task on MIRFlickr dataset

0 200 400 600 800 1000 0.3

(c) Text >image task on NUS WIDE dataset

0 200 400 600 800 1000 0.5

(d) Text >image task on MIRFlickr dataset

UGACH(Ours) SCM_seq SCM_orth CMSSH CCQ CMFH PDH CVH

Figure 2: The top K-precision curves with 128 bit hash codes. The left two ﬁgures present the result of image text task on NUS-WIDE and MIRFlickr datasets, while the right two ﬁgures show the result of text image task.

0 0.2 0.4 0.6 0.8 1 0.3

(a) Image >text task on NUS WIDE dataset

0 0.2 0.4 0.6 0.8 1 0.5

(b) Image >text task on MIRFlickr dataset

0 0.2 0.4 0.6 0.8 1 0.3

(c) Text >image task on NUS WIDE dataset

0 0.2 0.4 0.6 0.8 1 0.5

(d) Text >image task on MIRFlickr dataset

UGACH(Ours) SCM_seq SCM_orth CMSSH CCQ CMFH PDH CVH

Figure 3: The precision-recall curves with 128 bit hash codes. The left two ﬁgures present the result of image text task on NUS-WIDE and MIRFlickr datasets, while the right two ﬁgures show the result of text image task.

Compared with 2 baseline methods on NUS-WIDE dataset, we can observe that Baseline-GAN has an improvement of 0.037 and 0.034 on two tasks, which demonstrates the effectiveness of our proposed generative adversarial network for cross-modal hashing. Compared our proposed UGACH with Baseline-GAN, we can observe an improvement of 0.025 and 0.016 on two tasks, which demonstrates the effectiveness of our proposed correlation graph. Similar trends can be also observed on MIRFlickr dataset.

In this paper, we have proposed an Unsupervised Generative Adversarial Cross-modal Hashing approach (UGACH), which intends to make full use of GAN s ability of unsupervised representation learning to exploit the underlying manifold structure of cross-modal data. On one hand, we propose a generative adversarial network to model cross-modal hashing in an unsupervised fashion. In the proposed UGACH, the generative model tries to ﬁt the distribution over the manifold structure, and select informative data of another modality to challenge the discriminative model. While the discriminative model learns to preserve traditional inter correlation, and the manifold correlations provided by generative model to achieve better retrieval accuracy. Those two models are trained in an adversarial way to improve each other and achieve better retrieval accuracy. On the other hand, we pro-

pose a graph based correlation learning approach to capture the underlying manifold structure across different modalities, so that data of different modalities but within the same manifold can have smaller Hamming distance and promote retrieval accuracy. Experiments compared with 6 state-ofthe-art methods on 2 widely-used datasets verify the effectiveness of our proposed approach. The future works lie in two aspects. Firstly, we will focus on extending our approach to support retrieval across multiple modalities, such as cross-modal retrieval across image, text, video and audio. Secondly, we attempt to extend current framework to other scenarios such as image caption to verify its versatility.

Acknowledgments This work was supported by National Natural Science Foundation of China under Grant 61771025 and Grant 61532005.

References Baeza-Yates, R., and Ribeiro-Neto, B. 1999. Modern information retrieval, volume 463. ACM press New York. Bronstein, M. M.; Bronstein, A. M.; Michel, F.; and Paragios, N. 2010. Data fusion through cross-modality metric learning using similarity-sensitive hashing. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 3594 3601. IEEE.

Cao, Y.; Long, M.; Wang, J.; Yang, Q.; and Yuy, P. S. 2016a. Deep visual-semantic hashing for cross-modal retrieval. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1445. Cao, Y.; Long, M.; Wang, J.; and Zhu, H. 2016b. Correlation autoencoder hashing for supervised cross-modal search. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, 197 204. ACM. Chua, T.-S.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; and Zheng, Y. 2009. Nus-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval, 48. ACM. Ding, G.; Guo, Y.; Zhou, J.; and Gao, Y. 2016. Largescale cross-modality search via collective matrix factorization hashing. IEEE Transactions on Image Processing 25(11):5427 5440. Gionis, A.; Indyk, P.; and Motwani, R. 1999. Similarity search in high dimensions via hashing. In VLDB, volume 99, 518 529. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672 2680. Hardoon, D. R.; Szedmak, S.; and Shawe-Taylor, J. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation 16(12):2639 2664. Huiskes, M. J., and Lew, M. S. 2008. The mir ﬂickr retrieval evaluation. In Proceedings of the 1st ACM international conference on Multimedia information retrieval, 39 43. ACM. Jiang, Q.-Y., and Li, W.-J. 2017. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3232 3240. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, 1097 1105. Kumar, S., and Udupa, R. 2011. Learning hash functions for cross-view similarity search. In IJCAI proceedingsinternational joint conference on artiﬁcial intelligence, volume 22, 1360. Lin, Z.; Ding, G.; Hu, M.; and Wang, J. 2015. Semanticspreserving hashing for cross-view retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3864 3872. Long, M.; Cao, Y.; Wang, J.; and Yu, P. S. 2016. Composite correlation quantization for efﬁcient multimodal retrieval. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, 579 588. ACM. Peng, Y.; Zhu, W.; Zhao, Y.; Xu, C.; Huang, Q.; Lu, H.; Zheng, Q.; Huang, T.; and Gao, W. 2017. Cross-media analysis and reasoning: advances and directions. Frontiers of In-

formation Technology & Electronic Engineering 18(1):44 57. Peng, Y.; Huang, X.; and Zhao, Y. 2017. An overview of cross-media retrieval: Concepts, methodologies, benchmarks and challenges. IEEE Transactions on Circuits and Systems for Video Technology PP(99):1 14. Rastegari, M.; Choi, J.; Fakhraei, S.; Hal, D.; and Davis, L. 2013. Predictable dual-view hashing. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 1328 1336. Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; and Lee, H. 2016. Generative adversarial text to image synthesis. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, 1060 1069. Song, J.; Yang, Y.; Yang, Y.; Huang, Z.; and Shen, H. T. 2013. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 785 796. ACM. Wang, J.; Liu, W.; Kumar, S.; and Chang, S. F. 2016. Learning to hash for indexing big data-a survey. Proceedings of the IEEE 104(1):34 57. Wang, J.; Yu, L.; Zhang, W.; Gong, Y.; Xu, Y.; Wang, B.; Zhang, P.; and Zhang, D. 2017. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 515 524. Wei, Y.; Song, Y.; Zhen, Y.; Liu, B.; and Yang, Q. 2014. Scalable heterogeneous translated hashing. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 791 800. ACM. Weiss, Y.; Torralba, A.; and Fergus, R. 2009. Spectral hashing. In Advances in neural information processing systems, 1753 1760. Zhang, D., and Li, W.-J. 2014. Large-scale supervised multimodal hashing with semantic correlation maximization. In AAAI, volume 1, 7. Zhang, J., and Peng, Y. 2017. Ssdh: Semi-supervised deep hashing for large scale image retrieval. IEEE Transactions on Circuits and Systems for Video Technology PP(99):1 14. Zhao, Xin, D. G. G. Y. H. J., and Gao, Y. 2017. Tuch: Turning cross-view hashing into single-view hashing via generative adversarial nets. In Proceedings of the Twenty Sixth International Joint Conference on Artiﬁcial Intelligence, IJCAI-17, 3511 3517. Zhuang, Y.; Yu, Z.; Wang, W.; Wu, F.; Tang, S.; and Shao, J. 2014. Cross-media hashing with neural networks. In Proceedings of the 22nd ACM international conference on Multimedia, 901 904. ACM.